HAKAN KARDES
Data mining has emerged as a critical tool for
knowledge discovery in large data sets.
•
It has been extensively used to analyze business,
financial, and textual data sets.
The success of these techniques has renewed
interest in applying them to various scientific and
engineering fields.
•
Astronomy
•
Life Sciences
•
Ecosystem Modeling
•
Structural Mechanics
•
…
Most of existing data mining algorithms assume that
the data is represented via
•
Transactions (set of items)
•
Sequence of items or events
•
Multi

dimensional vectors
•
Time series
Scientific datasets with structures, layers, hierarchy,
geometry, and arbitrary relations can not be accurately
modeled using such frameworks.
•
e.g., Numerical simulations, 3D protein structures, chemical
compounds, etc.
Need algorithms that operate on scientific
datasets in their native representation
There are two basic choices.
•
Treat each dataset/application differently and
develop custom representations/algorithms.
•
Employ a new way of modeling such datasets and
develop algorithms that span across different
applications!
What should be the properties of this general
modeling framework?
•
Abstract compared with the original raw data.
•
Yet powerful enough to capture the important
characteristics.
Labeled directed/undirected
topological/geometric graphs and hyper graphs
Graphs are suitable
for capturing
arbitrary relations
between the various
elements.
Vertex
Element
Element’s Attributes
Relation Between
Two Elements
Type Of Relation
Vertex Label
Edge Label
Edge
Data Instance
Graph Instance
Relation between
a Set of Elements
Hyper Edge
Provide enormous flexibility for modeling the underlying data as they allow the
modeler to decide on what the
elements
should be and the type of
relations
to be
modeled
PDB; 1MWP
N

Terminal Domain Of The
Amyloid
Precursor
Protein
Alzheimer's disease
amyloid
A4 protein precursor
β
α
β
β
β
β
α
β
β
Backbone
Contact
Develop algorithms to mine and analyze
graph data
sets
.
•
Finding patterns in these graphs
•
Finding groups of similar graphs (clustering)
•
Building predictive models for the graphs
(classification)
•
Structural motif discovery
•
High

throughput screening
•
Protein fold recognition
•
VLSI reverse engineering
•
A lot more …
Beyond Scientific Applications
Semantic web
Mining relational profiles
Behavioral modeling
Intrusion detection
Citation analysis
…
Approach #1: Frequent
Subgraph
Mining
Find all
subgraphs
g
within a set of graph transactions
G
such that
where
t
is the minimum support
Focus on pruning and fast, code

based graph matching
t
G
g
freq


)
(
Approach #1: Algorithms
•
Apriori

based Graph Mining (AGM)
Inokuchi
,
Washio
&
Motoda
(Osaka U., Japan)
•
Frequent Sub

Graph discovery (FSG)
Kuramochi
&
Karypis
(U. Minnesota)
•
Graph

based Substructure pattern mining (
gSpan
)
Yan & Han (UIUC)
•
Fast Frequent
Subgraph
Mining (FFSM),
Spanning
tree
based maximal graph
mining
(Spin)
Huan
, Wang &
Prins
(UNC Chapel Hill)
•
Graph
, Sequences and Tree
extraction
(Gaston)
Kazius
&
Nijssen
(U. Leiden, Netherlands)
A pattern is a relation between the object’s elements that
is
recurring
over and over again.
•
Common structures in a family of chemical compounds
or proteins.
•
Similar arrangements of vortices at different “instances”
of turbulent fluid flows.
•
…
There are two general ways to
formally
define a pattern in
the context of graphs
Arbitrary
subgraphs
(connected or not)
Induced
subgraphs
(connected or not
)
Frequent pattern discovery translates to frequent
subgraph
discovery…
Candidate
generation
Candidate pruning
Frequency counting
Key to FSG’s computational
efficiency
Simple operations become complicated & expensive when
dealing with graphs…
Multiple
candidates for
the same core!
First Core
Second Core
First Core
Second Core
Multiple cores
between two
(
k

1)

subgraphs
1
1
1
1
1
1
1
1
1
1
1
1
5
4
3
2
1
0
5
4
3
2
1
0
A
v
A
v
B
v
B
v
B
v
B
v
A
A
B
B
B
B
v
v
v
v
v
v
1
1
1
1
1
1
1
1
1
1
1
1
0
5
4
2
1
3
0
5
4
2
1
3
B
v
A
v
A
v
B
v
B
v
B
v
B
A
A
B
B
B
v
v
v
v
v
v
v
0
B
v
1
B
v
2
B
v
3
B
v
4
A
v
5
A
Label = “1 01 011 0001 00010”
Label = “1 11 100 1000 01000”
Discover Frequent
Sub

graphs
1
Select Discriminating
Features
2
Learn a Classification
Model
4
Transform Graphs
in Feature
Representation
3
Graph
Databases
Approach #2:
•
Find
subgraph
S within a set of one or more graphs
G that maximally compresses G
•
where (GS) is G compressed by S, i.e., instances of S
in G replaced by single vertex
Focus on efficient
subgraph
generation and heuristic
search
)

(
)
(
)
(
max
arg
S
G
size
S
size
G
size
S
S
THE BASIC IDEA BEHIND THE GBI
PAIRWISE CHUNKING
Graphs provide a powerful mechanism to represent
relational and physical datasets.
Can be used as a quick prototyping tool to test out
whether or not data

mining techniques can help a
particular application area and problem.
Their benefits can be realized if there exists an
extensive set of efficient and scalable algorithms to
mine them…
Takashi Matsuda, Hiroshi
Motoda
, Takashi
Washio
,
Graph

based induction and its applications, Advanced
Engineering Informatics, Volume 16, Issue 2, April 2002,
Pages
135

143.
Michihiro
Kuramochi
, George
Karypis
, "Frequent
Subgraph
Discovery,"
Data Mining, IEEE International
Conference on
, pp. 313, First IEEE International
Conference on Data Mining (ICDM'01), 2001.
THANK YOU FOR
YOUR ATTENTION
ANY QUESTIONS?
Comments 0
Log in to post a comment