Declarative Analysis of Noisy
Information Networks
Walaa Eldin Moustafa
Galileo
Namata
Amol
Deshpande
Lise
Getoor
University of Maryland
Outline
Motivations/Contributions
Framework
Declarative Language
Implementation
Results
Related and Future Work
Motivation
Motivation
•
Users/objects are modeled as nodes,
relationships as edges
•
The
observed networks
are noisy and
incomplete.
–
Some users may have more than one account
–
Communication may contain a lot of spam
•
Missing attributes, links, having multiple
references to the same entity
•
Need to extract underlying
information
network
.
Inference Operations
•
Attribute Prediction
–
To predict values of missing attributes
•
Link Prediction
–
To predict missing links
•
Entity Resolution
–
To predict if two references refer to the same entity
•
These prediction tasks can use:
–
Local node information
–
Relational information surrounding the node
Attribute Prediction
Automatic
Rule
Refinement for
Information Extraction
Join Optimization
of
Information Extraction
Output: Quality Matters!
A Statistical Model for
Multilingual Entity
Detection
and Tracking
Why Not
?
Tracing Lineage Beyond
Relational Operators
An Annotation
Management System for
Relational Databases
Language
Model Based
Arabic
Word
Segmentation.
DB
NL
?
Legend
Use
links
between nodes (
collective attribute
prediction
) [
Sen
et al., AI Magazine 2008]
Task:
Predict topic of the
paper
Attribute Prediction
Automatic
Rule
Refinement for
Information Extraction
Join Optimization
of
Information Extraction
Output: Quality Matters!
A Statistical Model for
Multilingual Entity
Detection
and Tracking
Why Not
?
Tracing Lineage Beyond
Relational Operators
An Annotation
Management System for
Relational Databases
Language
Model Based
Arabic
Word
Segmentation.
DB
NL
?
Legend
Task:
Predict topic of the
paper
P1
P2
Attribute Prediction
Automatic
Rule
Refinement for
Information Extraction
Join Optimization
of
Information Extraction
Output: Quality Matters!
A Statistical Model for
Multilingual Entity
Detection
and Tracking
Why Not
?
Tracing Lineage Beyond
Relational Operators
An Annotation
Management System for
Relational Databases
Language
Model Based
Arabic
Word
Segmentation.
DB
NL
?
Legend
Task:
Predict topic of the
paper
P2
P1
Link Prediction
•
Goal: Predict
new
links
•
Using
local
similarity
•
Using
relational
similarity [
Liben
-
Nowell
et al.,
CIKM 2003]
Divesh
Srivastava
Vladislav
Shkapenyuk
Nick
Koudas
Avishek
Saha
Graham
Cormode
Flip
Korn
Lukasz
Golab
Theodore
Johnson
Entity Resolution
•
Goal: to deduce that two references refer to
the same entity
•
Can be based on node attributes (
local
)
–
e.g. string similarity between titles or author
names
•
Local information only may not be enough
Jian
Li
Jian
Li
Entity Resolution
William
Roberts
Petre
Stoica
Jian
Li
Prabhu
Babu
Amol
Deshpande
Samir
Khuller
Barna
Saha
Jian
Li
Use
links
between the nodes (
collective entity
resolution
) [Bhattacharya et al., TKDD
2007]
Joint Inference
•
Each task helps others get better predictions.
•
How to combine the tasks?
–
One after other (pipelined), or interleaved?
•
GAIA:
–
A Java library for applying multiple joint AP, LP, ER
learning and inference tasks: [
Namata
et al., MLG
2009,
Namata
et al., KDUD 2009]
–
Inference can be pipelined or interleaved.
Our Goal and Contributions
•
Motivation: To support declarative network inference
•
Desiderata:
–
User declaratively specifies the prediction features
•
Local features
•
Relational features
–
Declaratively specify tasks
•
Attribute prediction, Link prediction, Entity resolution
–
Specify
arbitrary
interleaving or pipelining
–
Support for complex prediction
functions
Handle all that
efficiently
Outline
Motivations/Contributions
Framework
Declarative Language
Implementation
Results
Related and Future Work
Unifying Framework
Specify the domain
Compute features
Make Predictions, and Compute
Confidence in the Predictions
Choose Which Predictions to
Apply
For
attribute prediction
,
the domain is a subset of
the graph nodes.
For
link prediction
and
entity resolution
, the
domain is a subset of
pairs of nodes.
Unifying Framework
Specify the domain
Compute features
Make Predictions, and Compute
Confidence in the Predictions
Choose Which Predictions to
Apply
Local:
word frequency,
income, etc.
Relational:
degree,
clustering
coeff
., no. of
neighbors with each
attribute value, common
neighbors between pairs
of nodes, etc.
Unifying Framework
Specify the domain
Compute features
Make Predictions, and Compute
Confidence in the Predictions
Choose Which Predictions to
Apply
Attribute prediction
: the
missing attribute
Link prediction
: add link
or not?
Entity resolution
: merge
two nodes or not?
Unifying Framework
Specify the Domain
Compute Features
Make Predictions, and Compute
Confidence in the Predictions
Choose Which Predictions to
Apply
After predictions are made,
the graph changes:
Attribute prediction
changes local attributes.
Link prediction
changes the
graph links.
Entity resolution
changes
both local attributes and
graph links.
Outline
Motivations/Contributions
Framework
Declarative Language
Implementation
Results
Related and Future Work
Datalog
•
Use
Datalog
to express:
–
Domains
–
Local and relational features
•
Extend
Datalog
with operational semantics
(vs. fix
-
point semantics) to express:
–
Predictions (in the form of updates)
–
Iteration
Specifying Features
Degree:
Degree(X
, COUNT<Y>) :
-
Edge(X
, Y)
Number of Neighbors with attribute ‘A’
NumNeighbors(X
, COUNT<Y>) :−
Edge(X
, Y),
Node(Y
,
Att
=’A’)
Clustering Coefficient
NeighborCluster(X
, COUNT<Y,Z>) :−
Edge(X,Y
),
Edge(X,Z
),
Edge(Y,Z
)
ClusteringCoeff(X
, C) :−
NeighborCluster(X,N
),
Degree(X,D
), C=2*N/(D*(D
-
1))
Jaccard
Coefficient
IntersectionCount(X
, Y, COUNT<Z>) :−
Edge(X
, Z),
Edge(Y
, Z)
UnionCount(X
, Y, D) :−Degree(X,D1), Degree(Y,D2), D=D1+D2
-
D3,
IntersectionCount(X
,
Y, D3)
Jaccard(X
, Y, J) :−
IntersectionCount(X
, Y, N),
UnionCount(X
, Y, D), J=N/D
Specifying Domains
•
Domains are used to
restrict
the space of
computation for the prediction elements.
•
Space for this feature is |V|
2
Similarity(X
, Y, S) :−
Node(X
,
Att
=V1),
Node(Y
,
Att
=V1),
S=f(V1, V2)
•
Using this domain the space becomes |E|:
DOMAIN D(X,Y) :
-
Edge(X
, Y)
•
Other DOMAIN predicates:
–
Equality
–
Locality sensitive hashing
–
String similarity joins
–
Traverse edges
Feature Vector
•
Features of prediction elements are combined in
a single predicate to create the feature vector:
DOMAIN D(X, Y) :
-
…
{
P1(X, Y, F1) :
-
…
…
Pn(X
, Y, Fn) :
-
…
Features(X
, Y, F1, …, Fn) :
-
P1(X, Y, F1) , …,
Pn(X
, Y,
Fn)
}
Update Operation
DEFINE
Merge(X
, Y
)
{
INSERT
Edge(X
, Z) :
-
Edge(Y
, Z
)
DELETE
Edge(Y
, Z
)
UPDATE
Node(X
, A=
ANew
) :
-
Node(X,A
=AX),
Node(Y,A
=AY),
ANew
=(AX+AY)/
2
UPDATE
Node(X
, B=
BNew
) :
-
Node(X,B
=BX),
Node(X,B
=BX),
BNew
=
max(BX,BY
)
DELETE
Node(Y
)
}
Merge(X
, Y) :
-
Features (
X, Y,
F1,…,Fn)
, predict
-
ER
(F1,…,Fn)
= true, confidence
-
ER
(F1,…,Fn)
>
0.95
Prediction and Confidence Functions
•
The prediction and confidence functions are
user defined functions
•
Can be based on
logistic regression
,
Bayes
classifier
, or any other classification algorithm
•
The confidence is the class membership value
–
In logistic regression, the confidence can be the
value of the logistic function
–
In
Bayes
classifier, the confidence can be the
posterior probability value
Iteration
•
Iteration is supported by ITERATE construct.
•
Takes the number of iterations as a parameter,
or * to iterate until no more predictions.
•
ITERATE (*)
{
MERGE(X,Y) :
-
Features (X, Y, F1,…,Fn),
predict
-
ER(F1,…,Fn) = true,
confidence
-
ER(F1,…,Fn) IN TOP
10%
}
Pipelining
DOMAIN ER(X,Y)
:
-
….
{
ER1(X,Y,F1) :
-
…
ER2(X,Y,F1) :
-
…
Features
-
ER(X,Y,F1,F2) :
-
…
}
DOMAIN LP(X,Y)
:
-
….
{
LP1(X,Y,F1) :
-
…
LP2(X,Y,F1) :
-
…
Features
-
LP(X,Y,F1,F2) :
-
…
}
ITERATE(*)
{
INSERT
EDGE(X,Y) :
-
FT
-
LP(X,Y,F1,F2), predict
-
LP(X,Y,F1,F2), confidence
-
LP(X,Y,F1,F2
IN TOP 10%
}
ITERATE(*)
{
MERGE
(X,Y) :
-
FT
-
ER(X,Y,F1,F2), predict
-
ER(X,Y,F1,F2), confidence
-
ER(X,Y,F1,F2)
IN TOP 10%
}
Interleaving
DOMAIN ER(X,Y)
:
-
….
{
ER1(X,Y,F1) :
-
…
ER2(X,Y,F1) :
-
…
Features
-
ER(X,Y,F1,F2) :
-
…
}
DOMAIN LP(X,Y)
:
-
….
{
LP1(X,Y,F1) :
-
…
LP2(X,Y,F1) :
-
…
Features
-
LP(X,Y,F1,F2) :
-
…
}
ITERATE(*)
{
INSERT
EDGE(X,Y) :
-
FT
-
LP(X,Y,F1,F2), predict
-
LP(X,Y,F1,F2), confidence
-
LP(X,Y,F1,F2
IN TOP 10%
MERGE
(X,Y) :
-
FT
-
ER(X,Y,F1,F2), predict
-
ER(X,Y,F1,F2), confidence
-
ER(X,Y,F1,F2)
IN TOP 10%
}
Outline
Motivations/Contributions
Framework
Declarative Language
Implementation
Results
Related and Future Work
Implementation
•
Prototype based on Java Berkeley DB
•
Implemented a query parser, plan generator,
query evaluation engine
•
Incremental maintenance:
–
Aggregate/non
-
aggregate incremental
maintenance
–
DOMAIN maintenance
Incremental Maintenance
•
Predicates in the program correspond to materialized tables
(key/value maps).
•
Every
set of changes
done by AP, LP, or ER are logged into two
change tables
ΔNodes
and
ΔEdges
.
–
Insertions: |Record | +1 |
–
Deletions: |Record |
-
1 |
–
Updates: deletion followed by an insertion
•
Aggregate maintenance is performed by aggregating the
change table then refreshing the old table.
•
DOMAIN:
DOMAIN L(X):
-
Subgoals
of L
{
P1(X,Y) :
-
Subgoals
of P1
}
L(X) :
-
Subgoals
of L
P1’(X)
:
-
L⡘⤬
卵bgo慬a
潦o倱
倱⡘⤠)
-
L⡘⤠㸾
卵bg潡os
潦⁐o
Outline
Motivations/Contributions
Framework
Declarative Language
Implementation
Results
Related and Future Work
Synthetic
Experiements
•
Synthetic graphs. Generated using forest fire, and
preferential attachment generation models.
•
Three tasks:
–
Attribute Prediction, Link Prediction and Entity Resolution
•
Two approaches:
–
Recomputing
features after every iteration
–
Incremental maintenance
•
Varied parameters:
–
Graph size
–
Graph density
–
Confidence threshold (update size)
Changing Graph Size
•
Varied the graph size from 20K nodes and
200K edges to 100K nodes and 1M edges
Comparison with Derby
•
Compared the evaluation of 4 features:
degree, clustering coefficient, common
neighbors and
Jaccard
.
Real
-
world Experiment
•
Real
-
world
PubMed
graph
–
Set of publications from the medical domain, their
abstracts, and citations
•
50,634 publications, 115,323 citation edges
•
Task: Attribute prediction
–
Predict if the paper is categorized as Cognition, Learning,
Perception or Thinking
•
Choose top 10% predictions after each iteration, for
10 iterations
•
Incremental: 28 minutes.
Recompute
: 42 minutes
Program
DOMAIN
Uncommitted(X):
-
Node(X,Committed
=‘no’)
{
ThinkingNeighbors(X,Count
<Y>):
-
Edge(X,Y
),
Node(Y,Label
=‘Thinking’)
PerceptionNeighbors(X,Count
<Y>):
-
Edge(X,Y
),
Node(Y,Label
=‘Perception’)
CognitionNeighbors(X,Count
<Y>):
-
Edge(X,Y
),
Node(Y,Label
=‘Cognition’)
LearningNeighbors(X,Count
<Y>):
-
Edge(X,Y
),
Node(Y,Label
=‘Learning’)
Features
-
AP(X,A,B,C,D,
Abstract
):
-
ThinkingNeighbors(X,A
),
PerceptionNeighbors(X,B
),
CognitionNeighbors(X,C
),
LearningNeighbors(X,D),Node(X,
Abstract
, _,_)
}
ITERATE(10)
{
UPDATE
Node(X,_,P,‘yes
’):
-
Features
-
AP(X,A,B,C,D,Text),P
= predict
-
AP(X,A,B,C,D,Text),confidence
-
AP(X,A,B,C,D,Text
) IN TOP 10%
}
Outline
Motivations/Contributions
Framework
Declarative Language
Implementation
Results
Related and Future Work
Related Work
•
Dedupalog
[
Arasu
et al., ICDE 2009]:
–
Datalog
-
based entity resolution
•
User defines hard and soft rules for
deduplication
•
System satisfies hard rules and minimizes violations to
soft rules when
deduplicating
references
•
Swoosh
[
Benjelloun
et al., VLDBJ 2008]:
–
Generic Entity resolution
•
Match function for pairs of nodes (based on a set of
features)
•
Merge function determines which pairs should be
merged
Conclusions and Ongoing Work
•
Conclusions:
–
We built a declarative system to specify graph
inference operations
–
We implemented the system on top of Berkeley DB
and implemented incremental maintenance
techniques
•
Future work:
–
Direct computation of top
-
k
predictions
–
Multi
-
query evaluation (especially on graphs)
–
Employing a graph DB engine (e.g. Neo4j)
–
Support recursive queries and recursive view
maintenance
References
•
[
Sen
et al., AI Magazine 2008]
–
Prithviraj
Sen
, Galileo
Namata
, Mustafa
Bilgic
,
Lise
Getoor
, Brian Gallagher, Tina
Eliassi
-
Rad
:
Collective Classification in Network Data. AI Magazine 29(3): 93
-
106 (2008)
•
[
Liben
-
Nowell
et al., CIKM 2003]
–
David
Liben
-
Nowell
, Jon M. Kleinberg: The link prediction problem for social networks. CIKM
2003.
•
[Bhattacharya et al., TKDD 2007]
–
I. Bhattacharya and L.
Getoor
. Collective entity resolution in relational data. ACM TKDD, 1:1
–
36, 2007.
•
[
Namata
et al., MLG 2009]
–
G.
Namata
and L.
Getoor
: A Pipeline Approach to Graph Identification. MLG Workshop, 2009.
•
[
Namata
et al., KDUD 2009
]
–
G.
Namata
and L.
Getoor
: Identifying Graphs From Noisy and Incomplete Data. SIGKDD
Workshop on Knowledge Discovery from Uncertain Data, 2009.
•
[
Arasu
et al., ICDE 2009]
–
A.
Arasu
, C. Re, and D.
Suciu
. Large
-
scale
deduplication
with constraints using
dedupalog
. In
ICDE, 2009
•
[
Benjelloun
et al., VLDBJ 2008]
–
O.
Benjelloun
, H. Garcia
-
Molina, D.
Menestrina
, Q. Su, S. E.
Whang,and
J.
Widom
. Swoosh: a
generic approach to entity resolution. The VLDB Journal, 2008.
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Comments 0
Log in to post a comment