On Querying Historical Evolving

plantationscarfAI and Robotics

Nov 25, 2013 (3 years and 6 months ago)

65 views

1

On Querying Historical Evolving
Graph Sequences

Chenghui

Ren
$
, Eric Lo
*
, Ben Kao
$
,
Xinjie

Zhu
$
,
Reynold

Cheng
$

$
The University of Hong Kong

$
{
chren
,
kao
,
xjzhu
,
ckcheng
}@
cs.hku.hk

*

Hong Kong

Polytechnic University

*
ericlo@comp.polyu.edu.hk

2

Motivation


Graphs are widely used to model the world


The world is ever changing/Graphs evolve with time





3

Motivation


How does the importance of a vertex change?


E.g. closeness centrality

Evolving Graph Sequence (
EGS
)



4

Motivation


How does the shortest path between
a

and
e

change?



Evolving Graph Sequence (
EGS
)



5

0
100
200
300
400
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Snapshot number

Shortest-path distance
365
304
186
178
Key moments:

Their distance changed

How did they get closer?

The shortest path distances
between two particular
Facebook

users over one year period (365
snapshots)

Example Study on
Facebook

EGS

Shortest Path Query

6

Problem Definition

Evolving Graph Sequence (
EGS
)

Problem: Given a query (e.g., shortest path between
a

and
e
), find the solution for each snapshot in the EGS:





7

Issues of Querying EGS

We are interested in the
EGSs such that the snapshot
graphs are:

a)
Large


b)
Numerous

c)
Gradually evolving

We need:


Efficient algorithm to process queries on EGSs


Effective storage models to store EGSs

Example:
Facebook

EGS

a) 60,000 vertices, 900,000 edges

b) 365 snapshots

c) 99%+ edges in common

8

Outline


Introduction


Solution framework


Storage models


Experimental evaluation


Conclusions

9

Baseline Algorithm


Baseline algorithm: run a traditional algorithm
directly on each snapshot in an EGS


E.g., breadth
-
first
-
search for shortest path query


Not efficient


Graphs in an EGS are usually large and numerous


Our goal: Exploit graph redundancies in an
EGS to make query processing faster

10

Find
-
Verify
-
Fix (FVF) Framework

An EGS

11

Find
-
Verify
-
Fix (FVF) Framework









12

Preprocessing:

Construct Representative Graphs

13

Preprocessing: Cluster Analysis

Segmentation clustering algorithm:


A cluster consists of successive snapshots


A cluster satisfies:



EGS

14

Query Processing Phase


Type of queries we use FVF to solve:


Shortest path


Closeness centrality


Graph diameter


15

Shortest Path Query Processing

FIND

Representative Solutions

16

Shortest Path Query Processing

VERIFY

Representative Solutions

Bounding property:

17

Shortest Path Query Processing

VERIFY

Representative Solutions



×

×

×

18

Shortest Path Query Processing

VERIFY

Representative Solutions





×

19

Shortest Path Query Processing

FIX
Representative Solutions

20

Outline


Introduction


Solution framework


Storage models


Experimental evaluation


Conclusions

21

EGS Storage Models


Wikipedia dataset (365 snapshots, >1M articles, >20M hyperlinks)

Space cost: more than 365X20M =
7.3billion

hyperlinks!!!

Aims of storage models:

1) Compress data to fit in memory

2) Support the application of the FVF algorithm framework

Effectiveness of our storage models:

50M

hyperlinks for the baseline algorithm,

100M

hyperlinks for the FVF algorithm,

compared to
7.3 billion

hyperlinks without compression!!!

22

Experimental Evaluation


Datasets


Real datasets


Facebook
-
friendship


YouTube


Wikipedia


Synthetic datasets


FVF VS Baseline


Baseline: Execute a graph algorithm on each snapshot
independently


Settings


C++, Linux, CPU: 2.83GHz Dual Core, Memory: 4G


23

Experimental Evaluation

Average graph edit similarity (
ges
) between successive snapshots


Dataset statistics

24

Experimental Evaluation
-

Shortest Path Queries

500 queries

25

0.4
0.5
0.6
0.7
0.8
0.9
1
0
10
20
30
40
50
Similarity threshold (

)
Number of clusters


Experimental Evaluation
-

Shortest Path Queries

FBFriend

dataset


A cluster satisfies:

1.
Fewer graphs in a cluster

2.
More clusters

Find Time

VF
-
Time

Residual
-
SPA Time

26

0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.5
1
1.5
Similarity threshold (

)
Time (sec)


FVF
Find Time
VF Time
Residual-SPA Time
Decompression Time
0.4
0.5
0.6
0.7
0.8
0.9
1
0
10
20
30
40
50
Similarity threshold (

)
Number of clusters


Experimental Evaluation
-

Shortest Path Queries

FBFriend

dataset

1.
Fewer graphs in a cluster

2.
More clusters

27

Experimental Evaluation
-

Shortest Path Queries

0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.5
1
1.5
Similarity threshold (

)
Time (sec)


FVF
Find Time
Residual-SPA Time
FBFriend

dataset

0.4
0.5
0.6
0.7
0.8
0.9
1
0
10
20
30
40
50
Similarity threshold (

)
Number of clusters


1.
Fewer graphs in a cluster

2.
More clusters

28

Experimental Evaluation
-

Shortest Path Queries

0.4
0.5
0.6
0.7
0.8
0.9
1
0
2
4
6
8
10
Similarity threshold (

)
Speedup


FBFriend

dataset

29

Experimental Evaluation
-

Closeness Centrality Queries

FBFriend

dataset

0.4
0.5
0.6
0.7
0.8
0.9
1
0
2
4
6
8
10
Similarity threshold (

)
Speedup


30

Conclusions


We proposed the evolving graph sequences to model world
evolution


We demonstrated that interesting information can be
obtained by posing queries on the various EGSs


We introduced the find
-
verify
-
fix (FVF) framework to query
EGSs


We discussed how to store EGSs


Experiments showed that our FVF framework is efficient and
interesting information can be unveiled

31

Thank you!

Chenghui

Ren
$
, Eric Lo
*
, Ben Kao
$
,
Xinjie

Zhu
$
,
Reynold

Cheng
$

$
The University of Hong Kong

$
{
chren
,
kao
,
xjzhu
,
ckcheng
}@
cs.hku.hk

*

The

Hong Kong

Polytechnic University

*
ericlo@comp.polyu.edu.hk

32

Related Work


Distance
-
based queries on a single large graph [F. Wei 2010,
Y.Xiao

2009]


Our work focuses on processing queries on an evolving graph sequence


Graph database [D.
Shasha

2002,
X.Yan

2005]


Different: Their work usually only support graph queries (e.g.
sub/super
-
graph query)


Similar: Both target to minimize the number of expensive graph
operations


Time
-
dependent graph [B. Ding 2008]


Our work is different in two ways:


Node set is not fixed


Find answers on all snapshots