1
Graph Mining Applications to
Machine Learning Problems
Max Planck Institute for Biological Cybernetics
Koji Tsuda
2
Graphs
…
3
DNA Sequence
RNA
Texts in literature
Graph Structures in Biology
C
C
O
C
C
C
C
H
A
C
G
C
Amitriptyline
inhibits
adenosine
uptake
H
H
H
H
H
Compounds
CG
CG
U
U
U
U
UA
4
Substructure Representation
0/1 vector of
pattern
indicators
Huge dimensionality!
Need Graph Mining for selecting features
Better than paths
(Marginalized graph kernels)
patterns
5
Overview
Quick Review on Graph Mining
EM

based Clustering algorithm
Mixture model with L1 feature selection
Graph Boosting
Supervised Regression for QSAR Analysis
Linear programming meets graph mining
6
Quick Review of Graph Mining
7
Graph Mining
Analysis of Graph Databases
Find all patterns satisfying predetermined
conditions
Frequent Substructure Mining
Combinatorial, Exhaustive
Recently developed
AGM (Inokuchi et al., 2000), gspan (Yan et
al., 2002), Gaston (2004)
8
Graph Mining
Frequent Substructure Mining
Enumerate all patterns occurred in at least
m graphs
:Indicator of pattern k in graph i
Support(k): # of occurrence of pattern k
9
Gspan
(Yan and Han, 2002)
Efficient Frequent Substructure Mining
Method
DFS Code
Efficient detection of isomorphic patterns
Extend Gspan for our works
10
Enumeration on Tree

shaped
Search Space
Each node has a pattern
Generate nodes from the root:
Add an edge at each step
11
Tree Pruning
Anti

monotonicity:
If support(g) < m, stop exploring!
Not generated
Support(g): # of occurrence of pattern g
12
Discriminative patterns:
Weighted Substructure Mining
w_i > 0: positive class
w_i < 0: negative class
Weighted Substructure Mining
Patterns with large frequency difference
Not Anti

Monotonic: Use a bound
13
Multiclass version
Multiple weight vectors
(graph belongs to class )
(otherwise)
Search patterns overrepresented in a
class
14
EM

based clustering of graphs
Tsuda, K. and
T. Kudo:
Clustering Graphs by Weighted Substructure Mining.
ICML 2006
, 953

960
, 2006
15
EM

based graph clustering
Motivation
Learning a mixture model in the feature
space of patterns
Basis for more complex probabilistic
inference
L1 regularization & Graph Mining
E

step

> Mining

> M

step
16
Probabilistic Model
Binomial Mixture
Each Component
:Mixing weight for cluster
:Feature vector of a graph (0 or 1)
:Parameter vector for cluster
17
Function to minimize
L1

Regularized log likelihood
Baseline constant
ML parameter estimate using single binomial
distribution
In solution, most parameters exactly equal to
constants
18
E

step
Active pattern
E

step computed only with active
patterns (computable!)
19
M

step
Putative cluster assignment by E

step
Each parameter is solved separately
Use graph mining to find active patterns
Then, solve it only for active patterns
20
Solution
Occurrence probability in a cluster
Overall occurrence probability
21
Important Observation
For active pattern k, the occurrence probability in a graph
cluster is significantly different from the average
22
Mining for Active Patterns F
F is rewritten in the following form
Active patterns can be found by graph
mining! (multiclass)
23
Experiments: RNA graphs
Stem as a node
Secondary structure by RNAfold
0/1 Vertex label (self loop or not)
24
Clustering RNA graphs
Three Rfam families
Intron GP I (Int, 30 graphs)
SSU rRNA 5 (SSU, 50 graphs)
RNase bact a (RNase, 50 graphs)
Three bipartition problems
Results evaluated by ROC scores (Area
under the ROC curve)
25
Examples of RNA Graphs
26
ROC Scores
27
No of Patterns & Time
28
Found Patterns
29
Summary (EM)
Probabilistic clustering based on
substructure representation
Inference helped by graph mining
Many possible extensions
Na
ï
ve Bayes
Graph PCA, LFD, CCA
Semi

supervised learning
Applications in Biology?
30
Graph Boosting
Saigo, H., T. Kadowaki and K. Tsuda:
A Linear Programming Approach
for Molecular QSAR analysis.
International Workshop on
Mining and Learning with Graphs, 85

96
, 2006
31
Graph Regression Problem
Known as QSAR problem in chemical
informatics
Quantitative Structure

Activity Analysis
Given a graph, predict a real

value
Typically, features (descriptors) are given
32
QSAR with conventional descriptors
#atoms
#bonds
#rings
…
Activity
22
25
3
20
21
1.2
23
24
0.77
11
11

3.52
21
22

4
33
Motivation of Graph Boosting
Descriptors are not always available
New features by obtaining informative
patterns (i.e., subgraphs)
Greedy pattern discovery by Boosting +
gSpan
Linear Programming (LP) Boosting for
reducing the number of graph mining calls
Accurate prediction & interpretable results
34
Molecule as a labeled graph
C
C
C
C
C
C
O
C
C
C
C
35
QSAR with patterns
…
Activity
1
1
1
3

1
1

1
1.2

1
1

1
0.77

1
1

1

3.52
1
1

1

4
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
O
Cl
C
)
?
(
f
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
O
Cl
C
1
2
3
...
36
Sparse regression in a very
high dimensional space
G
: all possible patterns (intractably large)
G

dimensional feature vector
x
for a
molecule
Linear Regression
Use L1 regularizer to have sparse
α
Select a tractable number of patterns
d
j
j
j
x
α
f
1
)
(
x
37
Problem formulation
We introduce
ε

insensitive loss and L1 regularizer
m: # of training graphs
d = 
G

ξ
+
, ξ

: slack variables
ε: parameter
38
Dual LP
Primal: Huge number of weight variables
Dual: Huge number of constraints
LP1

Dual
39
Column Generation Algorithm
for LP Boost
(Demiriz et al., 2002)
Start from the dual with no constraints
Add the most violated constraint each time
Guaranteed to converge
Constraint Matrix
Used
Part
40
Finding the most violated
constraint
Constraint for a pattern (shown again)
Finding the most violated one
Searched by weighted substructure mining
m
i
ij
i
x
u
1
1
1
m
i
ij
i
j
x
u
1
max
arg
41
Algorithm Overview
Iteration
Find a new pattern by graph mining with weight
u
If all constraints are satisfied, break
Add a new constraint
Update
u
by
LP1

Dual
Return
Convert dual solution to obtain primal solution
α
42
Speed

up by adding multiple
patterns (multiple pricing)
So far, the most violated pattern is
chosen
Mining and inclusion of top
k
patterns
at each iteration
Reduction of the number of mining calls
m
i
ij
i
j
x
u
1
max
arg
A Linear Programming Approach for Molecular QSAR Analysis
43
Speed

up by multiple pricing
44
Clearly negative data
#atoms
#bonds
#rings
…
Activity
22
25
3
20
21
1.2
23
24
0.77
11
11

3.52
21
22

4
22
20

10000
23
19

10000
A Linear Programming Approach for Molecular QSAR Analysis
45
Inclusion of clearly negative data
LP2

Primal
l: # of clearly negative data
z: predetermined upperbound
ξ
’
: slack variable
46
Experiments
Data from Endocrine Disruptors Knowledge
Base
59 compounds labeled by real number and 61
compounds labeled by a large negative number
Label (target) is a log translated relative
proliferative potency (log(RPP)) normalized
between
–
1 and 1
Comparison with
Marginalized Graph Kernel + ridge regression
Marginalized Graph Kernel + kNN regression
47
Results with or without clearly
negative data
LP2
LP1
48
Extracted patterns
Interpretable compared with implicitly expressed
features by Marginalized Graph Kernel
49
Summary (Graph Boosting)
Graph Boosting simultaneously generate
patterns and learn their weights
Finite convergence by column generation
Potentially interpretable by chemists.
Flexible constraints and speed

up by LP.
50
Concluding Remarks
Using graph mining as a part of
machine learning algorithms
Weights are essential
Please include weights when you
implement your item

set/tree/graph mining
algorithms
Make it available on the web!
Then ML researchers can use it
Comments 0
Log in to post a comment