Slides

siennaredwoodIA et Robotique

23 févr. 2014 (il y a 3 années et 1 mois)

47 vue(s)

Declarative Analysis of Noisy
Information Networks

Walaa Eldin Moustafa

Galileo
Namata

Amol

Deshpande

Lise

Getoor

University of Maryland

Outline


Motivations/Contributions


Framework


Declarative Language


Implementation


Results


Related and Future Work

Motivation

Motivation


Users/objects are modeled as nodes,
relationships as edges


The
observed networks
are noisy and
incomplete.


Some users may have more than one account


Communication may contain a lot of spam


Missing attributes, links, having multiple
references to the same entity


Need to extract underlying
information
network
.

Inference Operations


Attribute Prediction


To predict values of missing attributes


Link Prediction


To predict missing links


Entity Resolution


To predict if two references refer to the same entity


These prediction tasks can use:


Local node information


Relational information surrounding the node

Attribute Prediction

Automatic
Rule

Refinement for
Information Extraction

Join Optimization
of
Information Extraction
Output: Quality Matters!

A Statistical Model for
Multilingual Entity
Detection

and Tracking

Why Not
?

Tracing Lineage Beyond
Relational Operators

An Annotation
Management System for
Relational Databases

Language
Model Based
Arabic

Word
Segmentation.

DB

NL

?

Legend

Use
links
between nodes (
collective attribute
prediction
) [
Sen

et al., AI Magazine 2008]


Task:
Predict topic of the
paper

Attribute Prediction

Automatic
Rule

Refinement for
Information Extraction

Join Optimization
of
Information Extraction
Output: Quality Matters!

A Statistical Model for
Multilingual Entity
Detection

and Tracking

Why Not
?

Tracing Lineage Beyond
Relational Operators

An Annotation
Management System for
Relational Databases

Language
Model Based
Arabic

Word
Segmentation.

DB

NL

?

Legend

Task:
Predict topic of the
paper

P1

P2

Attribute Prediction

Automatic
Rule

Refinement for
Information Extraction

Join Optimization
of
Information Extraction
Output: Quality Matters!

A Statistical Model for
Multilingual Entity
Detection

and Tracking

Why Not
?

Tracing Lineage Beyond
Relational Operators

An Annotation
Management System for
Relational Databases

Language
Model Based
Arabic

Word
Segmentation.

DB

NL

?

Legend

Task:
Predict topic of the
paper

P2

P1

Link Prediction


Goal: Predict
new

links


Using
local
similarity


Using
relational
similarity [
Liben
-
Nowell

et al.,
CIKM 2003]

Divesh

Srivastava

Vladislav

Shkapenyuk

Nick
Koudas

Avishek

Saha

Graham
Cormode

Flip
Korn

Lukasz
Golab

Theodore
Johnson

Entity Resolution


Goal: to deduce that two references refer to
the same entity


Can be based on node attributes (
local
)



e.g. string similarity between titles or author
names


Local information only may not be enough


Jian

Li

Jian

Li

Entity Resolution



William
Roberts

Petre

Stoica

Jian

Li

Prabhu

Babu

Amol

Deshpande

Samir

Khuller

Barna

Saha

Jian

Li

Use
links

between the nodes (
collective entity
resolution
) [Bhattacharya et al., TKDD
2007]


Joint Inference


Each task helps others get better predictions.


How to combine the tasks?


One after other (pipelined), or interleaved?


GAIA:


A Java library for applying multiple joint AP, LP, ER
learning and inference tasks: [
Namata

et al., MLG
2009,
Namata

et al., KDUD 2009]


Inference can be pipelined or interleaved.

Our Goal and Contributions


Motivation: To support declarative network inference


Desiderata:


User declaratively specifies the prediction features


Local features


Relational features


Declaratively specify tasks


Attribute prediction, Link prediction, Entity resolution


Specify
arbitrary
interleaving or pipelining


Support for complex prediction
functions

Handle all that
efficiently


Outline


Motivations/Contributions


Framework


Declarative Language


Implementation


Results


Related and Future Work

Unifying Framework

Specify the domain

Compute features

Make Predictions, and Compute
Confidence in the Predictions

Choose Which Predictions to
Apply

For
attribute prediction
,
the domain is a subset of
the graph nodes.


For
link prediction

and
entity resolution
, the
domain is a subset of
pairs of nodes.


Unifying Framework

Specify the domain

Compute features

Make Predictions, and Compute
Confidence in the Predictions

Choose Which Predictions to
Apply

Local:

word frequency,
income, etc.

Relational:

degree,
clustering
coeff
., no. of
neighbors with each
attribute value, common
neighbors between pairs
of nodes, etc.

Unifying Framework

Specify the domain

Compute features

Make Predictions, and Compute
Confidence in the Predictions

Choose Which Predictions to
Apply

Attribute prediction
: the
missing attribute


Link prediction
: add link
or not?


Entity resolution
: merge
two nodes or not?

Unifying Framework

Specify the Domain

Compute Features

Make Predictions, and Compute
Confidence in the Predictions

Choose Which Predictions to
Apply

After predictions are made,
the graph changes:

Attribute prediction
changes local attributes.

Link prediction
changes the
graph links.

Entity resolution
changes
both local attributes and
graph links.

Outline


Motivations/Contributions


Framework


Declarative Language


Implementation


Results


Related and Future Work

Datalog


Use
Datalog

to express:


Domains


Local and relational features


Extend
Datalog

with operational semantics
(vs. fix
-
point semantics) to express:


Predictions (in the form of updates)


Iteration

Specifying Features

Degree:

Degree(X
, COUNT<Y>) :
-
Edge(X
, Y)


Number of Neighbors with attribute ‘A’

NumNeighbors(X
, COUNT<Y>) :−
Edge(X
, Y),
Node(Y
,
Att
=’A’)


Clustering Coefficient

NeighborCluster(X
, COUNT<Y,Z>) :−
Edge(X,Y
),
Edge(X,Z
),
Edge(Y,Z
)

ClusteringCoeff(X
, C) :−
NeighborCluster(X,N
),
Degree(X,D
), C=2*N/(D*(D
-
1))

Jaccard

Coefficient

IntersectionCount(X
, Y, COUNT<Z>) :−
Edge(X
, Z),
Edge(Y
, Z)

UnionCount(X
, Y, D) :−Degree(X,D1), Degree(Y,D2), D=D1+D2
-
D3,
IntersectionCount(X
,
Y, D3)

Jaccard(X
, Y, J) :−
IntersectionCount(X
, Y, N),
UnionCount(X
, Y, D), J=N/D


Specifying Domains


Domains are used to
restrict
the space of
computation for the prediction elements.


Space for this feature is |V|
2


Similarity(X
, Y, S) :−
Node(X
,
Att
=V1),
Node(Y
,
Att
=V1),
S=f(V1, V2)


Using this domain the space becomes |E|:


DOMAIN D(X,Y) :
-

Edge(X
, Y)


Other DOMAIN predicates:


Equality


Locality sensitive hashing


String similarity joins


Traverse edges

Feature Vector


Features of prediction elements are combined in
a single predicate to create the feature vector:

DOMAIN D(X, Y) :
-



{


P1(X, Y, F1) :
-







Pn(X
, Y, Fn) :
-




Features(X
, Y, F1, …, Fn) :
-

P1(X, Y, F1) , …,
Pn(X
, Y,
Fn)

}

Update Operation

DEFINE
Merge(X
, Y
)

{


INSERT
Edge(X
, Z) :
-

Edge(Y
, Z
)


DELETE
Edge(Y
, Z
)


UPDATE
Node(X
, A=
ANew
) :
-

Node(X,A
=AX),
Node(Y,A
=AY),
ANew
=(AX+AY)/
2


UPDATE
Node(X
, B=
BNew
) :
-

Node(X,B
=BX),
Node(X,B
=BX),
BNew
=
max(BX,BY
)


DELETE
Node(Y
)

}

Merge(X
, Y) :
-

Features (
X, Y,
F1,…,Fn)
, predict
-
ER
(F1,…,Fn)
= true, confidence
-
ER
(F1,…,Fn)
>
0.95

Prediction and Confidence Functions


The prediction and confidence functions are
user defined functions


Can be based on
logistic regression
,
Bayes

classifier
, or any other classification algorithm


The confidence is the class membership value


In logistic regression, the confidence can be the
value of the logistic function


In
Bayes

classifier, the confidence can be the
posterior probability value

Iteration


Iteration is supported by ITERATE construct.


Takes the number of iterations as a parameter,
or * to iterate until no more predictions.


ITERATE (*)


{


MERGE(X,Y) :
-
Features (X, Y, F1,…,Fn),







predict
-
ER(F1,…,Fn) = true,







confidence
-
ER(F1,…,Fn) IN TOP
10%


}



Pipelining

DOMAIN ER(X,Y)

:
-

….

{


ER1(X,Y,F1) :
-




ER2(X,Y,F1) :
-




Features
-
ER(X,Y,F1,F2) :
-



}

DOMAIN LP(X,Y)

:
-

….

{


LP1(X,Y,F1) :
-




LP2(X,Y,F1) :
-




Features
-
LP(X,Y,F1,F2) :
-



}


ITERATE(*)

{



INSERT

EDGE(X,Y) :
-

FT
-
LP(X,Y,F1,F2), predict
-
LP(X,Y,F1,F2), confidence
-
LP(X,Y,F1,F2


IN TOP 10%

}

ITERATE(*)

{


MERGE
(X,Y) :
-

FT
-
ER(X,Y,F1,F2), predict
-
ER(X,Y,F1,F2), confidence
-
ER(X,Y,F1,F2)


IN TOP 10%

}


Interleaving

DOMAIN ER(X,Y)

:
-

….

{


ER1(X,Y,F1) :
-




ER2(X,Y,F1) :
-




Features
-
ER(X,Y,F1,F2) :
-



}

DOMAIN LP(X,Y)

:
-

….

{


LP1(X,Y,F1) :
-




LP2(X,Y,F1) :
-




Features
-
LP(X,Y,F1,F2) :
-



}


ITERATE(*)

{



INSERT

EDGE(X,Y) :
-

FT
-
LP(X,Y,F1,F2), predict
-
LP(X,Y,F1,F2), confidence
-
LP(X,Y,F1,F2


IN TOP 10%




MERGE
(X,Y) :
-

FT
-
ER(X,Y,F1,F2), predict
-
ER(X,Y,F1,F2), confidence
-
ER(X,Y,F1,F2)


IN TOP 10%

}




Outline


Motivations/Contributions


Framework


Declarative Language


Implementation


Results


Related and Future Work

Implementation


Prototype based on Java Berkeley DB


Implemented a query parser, plan generator,
query evaluation engine


Incremental maintenance:


Aggregate/non
-
aggregate incremental
maintenance


DOMAIN maintenance

Incremental Maintenance


Predicates in the program correspond to materialized tables
(key/value maps).


Every
set of changes

done by AP, LP, or ER are logged into two
change tables
ΔNodes

and
ΔEdges
.


Insertions: |Record | +1 |


Deletions: |Record |
-
1 |


Updates: deletion followed by an insertion


Aggregate maintenance is performed by aggregating the
change table then refreshing the old table.


DOMAIN:



DOMAIN L(X):
-

Subgoals

of L

{


P1(X,Y) :
-

Subgoals

of P1

}

L(X) :
-

Subgoals

of L

P1’(X)

:
-

L⡘⤬
卵bgo慬a

潦o倱

倱⡘⤠)
-

L⡘⤠㸾
卵bg潡os

潦⁐o


Outline


Motivations/Contributions


Framework


Declarative Language


Implementation


Results


Related and Future Work

Synthetic
Experiements


Synthetic graphs. Generated using forest fire, and
preferential attachment generation models.


Three tasks:


Attribute Prediction, Link Prediction and Entity Resolution


Two approaches:


Recomputing

features after every iteration


Incremental maintenance


Varied parameters:


Graph size


Graph density


Confidence threshold (update size)


Changing Graph Size


Varied the graph size from 20K nodes and
200K edges to 100K nodes and 1M edges

Comparison with Derby


Compared the evaluation of 4 features:
degree, clustering coefficient, common
neighbors and
Jaccard
.

Real
-
world Experiment


Real
-
world
PubMed

graph


Set of publications from the medical domain, their
abstracts, and citations


50,634 publications, 115,323 citation edges


Task: Attribute prediction


Predict if the paper is categorized as Cognition, Learning,
Perception or Thinking


Choose top 10% predictions after each iteration, for
10 iterations


Incremental: 28 minutes.
Recompute
: 42 minutes



Program

DOMAIN
Uncommitted(X):
-
Node(X,Committed
=‘no’)

{


ThinkingNeighbors(X,Count
<Y>):
-

Edge(X,Y
),
Node(Y,Label
=‘Thinking’)


PerceptionNeighbors(X,Count
<Y>):
-

Edge(X,Y
),
Node(Y,Label
=‘Perception’)


CognitionNeighbors(X,Count
<Y>):
-

Edge(X,Y
),
Node(Y,Label
=‘Cognition’)


LearningNeighbors(X,Count
<Y>):
-

Edge(X,Y
),
Node(Y,Label
=‘Learning’)


Features
-
AP(X,A,B,C,D,
Abstract
):
-

ThinkingNeighbors(X,A
),
PerceptionNeighbors(X,B
),
CognitionNeighbors(X,C
),
LearningNeighbors(X,D),Node(X,
Abstract
, _,_)

}

ITERATE(10)

{


UPDATE
Node(X,_,P,‘yes
’):
-

Features
-
AP(X,A,B,C,D,Text),P

= predict
-
AP(X,A,B,C,D,Text),confidence
-
AP(X,A,B,C,D,Text
) IN TOP 10%

}

Outline


Motivations/Contributions


Framework


Declarative Language


Implementation


Results


Related and Future Work

Related Work


Dedupalog

[
Arasu

et al., ICDE 2009]:


Datalog
-
based entity resolution


User defines hard and soft rules for
deduplication


System satisfies hard rules and minimizes violations to
soft rules when
deduplicating

references


Swoosh
[
Benjelloun

et al., VLDBJ 2008]:


Generic Entity resolution


Match function for pairs of nodes (based on a set of
features)


Merge function determines which pairs should be
merged



Conclusions and Ongoing Work


Conclusions:


We built a declarative system to specify graph
inference operations


We implemented the system on top of Berkeley DB
and implemented incremental maintenance
techniques


Future work:


Direct computation of top
-
k

predictions


Multi
-
query evaluation (especially on graphs)


Employing a graph DB engine (e.g. Neo4j)


Support recursive queries and recursive view
maintenance

References


[
Sen

et al., AI Magazine 2008]


Prithviraj

Sen
, Galileo
Namata
, Mustafa
Bilgic
,
Lise

Getoor
, Brian Gallagher, Tina
Eliassi
-
Rad
:
Collective Classification in Network Data. AI Magazine 29(3): 93
-
106 (2008)


[
Liben
-
Nowell

et al., CIKM 2003]


David
Liben
-
Nowell
, Jon M. Kleinberg: The link prediction problem for social networks. CIKM
2003.


[Bhattacharya et al., TKDD 2007]


I. Bhattacharya and L.
Getoor
. Collective entity resolution in relational data. ACM TKDD, 1:1

36, 2007.


[
Namata

et al., MLG 2009]


G.
Namata

and L.
Getoor
: A Pipeline Approach to Graph Identification. MLG Workshop, 2009.


[
Namata

et al., KDUD 2009
]


G.
Namata

and L.
Getoor
: Identifying Graphs From Noisy and Incomplete Data. SIGKDD
Workshop on Knowledge Discovery from Uncertain Data, 2009.


[
Arasu

et al., ICDE 2009]


A.
Arasu
, C. Re, and D.
Suciu
. Large
-
scale
deduplication

with constraints using
dedupalog
. In
ICDE, 2009


[
Benjelloun

et al., VLDBJ 2008]


O.
Benjelloun
, H. Garcia
-
Molina, D.
Menestrina
, Q. Su, S. E.
Whang,and

J.
Widom
. Swoosh: a
generic approach to entity resolution. The VLDB Journal, 2008.