Frequent subgraph discovery

voltaireblingData Management

Nov 20, 2013 (3 years and 9 months ago)

87 views

HAKAN KARDES



Data mining has emerged as a critical tool for
knowledge discovery in large data sets.



It has been extensively used to analyze business,
financial, and textual data sets.



The success of these techniques has renewed
interest in applying them to various scientific and
engineering fields.



Astronomy



Life Sciences



Ecosystem Modeling



Structural Mechanics







Most of existing data mining algorithms assume that
the data is represented via


Transactions (set of items)


Sequence of items or events


Multi
-
dimensional vectors


Time series


Scientific datasets with structures, layers, hierarchy,
geometry, and arbitrary relations can not be accurately
modeled using such frameworks.


e.g., Numerical simulations, 3D protein structures, chemical
compounds, etc.


Need algorithms that operate on scientific
datasets in their native representation



There are two basic choices.


Treat each dataset/application differently and
develop custom representations/algorithms.


Employ a new way of modeling such datasets and
develop algorithms that span across different
applications!



What should be the properties of this general
modeling framework?


Abstract compared with the original raw data.


Yet powerful enough to capture the important
characteristics.


Labeled directed/undirected
topological/geometric graphs and hyper graphs

Graphs are suitable
for capturing
arbitrary relations
between the various
elements.

Vertex

Element

Element’s Attributes

Relation Between

Two Elements

Type Of Relation

Vertex Label

Edge Label

Edge

Data Instance

Graph Instance

Relation between
a Set of Elements

Hyper Edge

Provide enormous flexibility for modeling the underlying data as they allow the
modeler to decide on what the
elements

should be and the type of
relations

to be
modeled


PDB; 1MWP

N
-
Terminal Domain Of The
Amyloid

Precursor
Protein

Alzheimer's disease
amyloid

A4 protein precursor


β

α

β

β

β

β

α

β

β

Backbone

Contact



Develop algorithms to mine and analyze
graph data
sets
.



Finding patterns in these graphs



Finding groups of similar graphs (clustering)



Building predictive models for the graphs
(classification)


Structural motif discovery


High
-
throughput screening


Protein fold recognition


VLSI reverse engineering



A lot more …


Beyond Scientific Applications


Semantic web


Mining relational profiles


Behavioral modeling


Intrusion detection


Citation analysis





Approach #1: Frequent
Subgraph

Mining


Find all
subgraphs

g

within a set of graph transactions
G

such that






where
t

is the minimum support

Focus on pruning and fast, code
-
based graph matching

t
G
g
freq

|
|
)
(

Approach #1: Algorithms


Apriori
-
based Graph Mining (AGM)


Inokuchi
,
Washio

&
Motoda

(Osaka U., Japan)


Frequent Sub
-
Graph discovery (FSG)


Kuramochi

&
Karypis

(U. Minnesota)


Graph
-
based Substructure pattern mining (
gSpan
)


Yan & Han (UIUC)


Fast Frequent
Subgraph

Mining (FFSM),
Spanning
tree
based maximal graph
mining
(Spin)


Huan
, Wang &
Prins

(UNC Chapel Hill)


Graph
, Sequences and Tree
extraction
(Gaston)


Kazius

&
Nijssen

(U. Leiden, Netherlands)


A pattern is a relation between the object’s elements that
is
recurring
over and over again.


Common structures in a family of chemical compounds
or proteins.


Similar arrangements of vortices at different “instances”
of turbulent fluid flows.






There are two general ways to
formally
define a pattern in
the context of graphs

Arbitrary
subgraphs

(connected or not)

Induced
subgraphs

(connected or not
)



Frequent pattern discovery translates to frequent
subgraph

discovery…


Candidate
generation



Candidate pruning



Frequency counting



Key to FSG’s computational
efficiency

Simple operations become complicated & expensive when
dealing with graphs…


Multiple
candidates for
the same core!


First Core

Second Core

First Core

Second Core

Multiple cores
between two

(
k
-
1)
-
subgraphs




























1
1
1
1
1
1
1
1
1
1
1
1
5
4
3
2
1
0
5
4
3
2
1
0
A
v
A
v
B
v
B
v
B
v
B
v
A
A
B
B
B
B
v
v
v
v
v
v


























1
1
1
1
1
1
1
1
1
1
1
1
0
5
4
2
1
3
0
5
4
2
1
3
B
v
A
v
A
v
B
v
B
v
B
v
B
A
A
B
B
B
v
v
v
v
v
v
v
0

B

v
1

B

v
2

B

v
3

B

v
4

A

v
5

A

Label = “1 01 011 0001 00010”

Label = “1 11 100 1000 01000”

Discover Frequent

Sub
-
graphs

1

Select Discriminating

Features

2

Learn a Classification

Model

4


Transform Graphs


in Feature

Representation

3

Graph

Databases


Approach #2:


Find
subgraph

S within a set of one or more graphs


G that maximally compresses G







where (G|S) is G compressed by S, i.e., instances of S
in G replaced by single vertex


Focus on efficient
subgraph

generation and heuristic
search


)
|
(
)
(
)
(
max
arg
S
G
size
S
size
G
size
S
S


THE BASIC IDEA BEHIND THE GBI


PAIRWISE CHUNKING


Graphs provide a powerful mechanism to represent
relational and physical datasets.



Can be used as a quick prototyping tool to test out
whether or not data
-
mining techniques can help a
particular application area and problem.



Their benefits can be realized if there exists an
extensive set of efficient and scalable algorithms to
mine them…



Takashi Matsuda, Hiroshi
Motoda
, Takashi
Washio
,
Graph
-
based induction and its applications, Advanced
Engineering Informatics, Volume 16, Issue 2, April 2002,
Pages
135
-
143.





Michihiro

Kuramochi
, George
Karypis
, "Frequent
Subgraph

Discovery,"

Data Mining, IEEE International
Conference on
, pp. 313, First IEEE International
Conference on Data Mining (ICDM'01), 2001.

THANK YOU FOR

YOUR ATTENTION

ANY QUESTIONS?