Machine Learning Problems

geographertonguesAI and Robotics

Nov 30, 2013 (3 years and 4 months ago)

106 views

1

Graph Mining Applications to
Machine Learning Problems

Max Planck Institute for Biological Cybernetics



Koji Tsuda

2

Graphs


3

DNA Sequence



RNA





Texts in literature

Graph Structures in Biology

C

C

O

C

C

C

C

H

A

C

G

C

Amitriptyline

inhibits

adenosine

uptake

H

H

H

H

H

Compounds

CG

CG

U

U

U

U

UA

4

Substructure Representation

0/1 vector of
pattern

indicators

Huge dimensionality!

Need Graph Mining for selecting features

Better than paths
(Marginalized graph kernels)

patterns

5

Overview

Quick Review on Graph Mining


EM
-
based Clustering algorithm


Mixture model with L1 feature selection


Graph Boosting


Supervised Regression for QSAR Analysis


Linear programming meets graph mining



6

Quick Review of Graph Mining

7

Graph Mining

Analysis of Graph Databases


Find all patterns satisfying predetermined
conditions


Frequent Substructure Mining

Combinatorial, Exhaustive

Recently developed


AGM (Inokuchi et al., 2000), gspan (Yan et
al., 2002), Gaston (2004)


8

Graph Mining

Frequent Substructure Mining


Enumerate all patterns occurred in at least
m graphs







:Indicator of pattern k in graph i




Support(k): # of occurrence of pattern k

9

Gspan
(Yan and Han, 2002)

Efficient Frequent Substructure Mining
Method

DFS Code


Efficient detection of isomorphic patterns

Extend Gspan for our works


10

Enumeration on Tree
-
shaped
Search Space

Each node has a pattern

Generate nodes from the root:


Add an edge at each step

11

Tree Pruning

Anti
-
monotonicity:



If support(g) < m, stop exploring!

Not generated

Support(g): # of occurrence of pattern g

12

Discriminative patterns:

Weighted Substructure Mining

w_i > 0: positive class

w_i < 0: negative class

Weighted Substructure Mining



Patterns with large frequency difference

Not Anti
-
Monotonic: Use a bound

13

Multiclass version

Multiple weight vectors



(graph belongs to class )



(otherwise)


Search patterns overrepresented in a
class


14

EM
-
based clustering of graphs

Tsuda, K. and

T. Kudo:


Clustering Graphs by Weighted Substructure Mining.

ICML 2006
, 953
-
960
, 2006






15

EM
-
based graph clustering

Motivation


Learning a mixture model in the feature
space of patterns


Basis for more complex probabilistic
inference

L1 regularization & Graph Mining

E
-
step
-
> Mining
-
> M
-
step

16

Probabilistic Model

Binomial Mixture



Each Component

:Mixing weight for cluster

:Feature vector of a graph (0 or 1)

:Parameter vector for cluster

17

Function to minimize

L1
-
Regularized log likelihood




Baseline constant


ML parameter estimate using single binomial
distribution


In solution, most parameters exactly equal to
constants

18

E
-
step

Active pattern



E
-
step computed only with active
patterns (computable!)

19

M
-
step

Putative cluster assignment by E
-
step


Each parameter is solved separately



Use graph mining to find active patterns

Then, solve it only for active patterns




20

Solution

Occurrence probability in a cluster



Overall occurrence probability



21

Important Observation

For active pattern k, the occurrence probability in a graph

cluster is significantly different from the average

22

Mining for Active Patterns F

F is rewritten in the following form






Active patterns can be found by graph
mining! (multiclass)

23

Experiments: RNA graphs

Stem as a node

Secondary structure by RNAfold

0/1 Vertex label (self loop or not)

24

Clustering RNA graphs

Three Rfam families


Intron GP I (Int, 30 graphs)


SSU rRNA 5 (SSU, 50 graphs)


RNase bact a (RNase, 50 graphs)

Three bipartition problems


Results evaluated by ROC scores (Area
under the ROC curve)

25

Examples of RNA Graphs

26

ROC Scores

27

No of Patterns & Time


28

Found Patterns


29

Summary (EM)

Probabilistic clustering based on
substructure representation

Inference helped by graph mining

Many possible extensions


Na
ï
ve Bayes


Graph PCA, LFD, CCA


Semi
-
supervised learning

Applications in Biology?


30

Graph Boosting

Saigo, H., T. Kadowaki and K. Tsuda:

A Linear Programming Approach

for Molecular QSAR analysis.

International Workshop on

Mining and Learning with Graphs, 85
-
96
, 2006

31

Graph Regression Problem

Known as QSAR problem in chemical
informatics


Quantitative Structure
-
Activity Analysis


Given a graph, predict a real
-
value


Typically, features (descriptors) are given


32

QSAR with conventional descriptors


#atoms

#bonds

#rings



Activity

22

25

3

20

21

1.2

23

24

0.77

11

11

-
3.52

21

22

-
4

33

Motivation of Graph Boosting

Descriptors are not always available

New features by obtaining informative
patterns (i.e., subgraphs)

Greedy pattern discovery by Boosting +
gSpan

Linear Programming (LP) Boosting for
reducing the number of graph mining calls

Accurate prediction & interpretable results


34

Molecule as a labeled graph

C

C

C

C

C

C

O

C

C

C

C

35

QSAR with patterns



Activity

1

1

1

3

-
1

1

-
1

1.2

-
1

1

-
1

0.77

-
1

1

-
1

-
3.52

1

1

-
1

-
4

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

O

Cl

C

)
?

(
f
C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

O

Cl

C

1


2


3


...

36

Sparse regression in a very
high dimensional space

G
: all possible patterns (intractably large)

|G|
-
dimensional feature vector
x

for a
molecule

Linear Regression



Use L1 regularizer to have sparse
α

Select a tractable number of patterns





d
j
j
j
x
α
f
1
)
(
x
37

Problem formulation

We introduce
ε
-
insensitive loss and L1 regularizer

m: # of training graphs

d = |
G
|

ξ
+
, ξ
-

: slack variables

ε: parameter

38

Dual LP

Primal: Huge number of weight variables

Dual: Huge number of constraints

LP1
-
Dual

39

Column Generation Algorithm
for LP Boost
(Demiriz et al., 2002)

Start from the dual with no constraints

Add the most violated constraint each time

Guaranteed to converge

Constraint Matrix

Used

Part

40

Finding the most violated
constraint

Constraint for a pattern (shown again)



Finding the most violated one



Searched by weighted substructure mining






m
i
ij
i
x
u
1
1
1


m
i
ij
i
j
x
u
1
max
arg
41

Algorithm Overview

Iteration


Find a new pattern by graph mining with weight
u


If all constraints are satisfied, break


Add a new constraint


Update
u
by

LP1
-
Dual

Return


Convert dual solution to obtain primal solution
α

42

Speed
-
up by adding multiple
patterns (multiple pricing)

So far, the most violated pattern is
chosen



Mining and inclusion of top
k

patterns
at each iteration


Reduction of the number of mining calls



m
i
ij
i
j
x
u
1
max
arg
A Linear Programming Approach for Molecular QSAR Analysis

43

Speed
-
up by multiple pricing

44

Clearly negative data

#atoms

#bonds

#rings



Activity

22

25

3

20

21

1.2

23

24

0.77

11

11

-
3.52

21

22

-
4

22

20

-
10000

23

19

-
10000

A Linear Programming Approach for Molecular QSAR Analysis

45

Inclusion of clearly negative data

LP2
-
Primal

l: # of clearly negative data

z: predetermined upperbound

ξ


: slack variable

46

Experiments

Data from Endocrine Disruptors Knowledge
Base


59 compounds labeled by real number and 61
compounds labeled by a large negative number


Label (target) is a log translated relative
proliferative potency (log(RPP)) normalized
between

1 and 1

Comparison with


Marginalized Graph Kernel + ridge regression


Marginalized Graph Kernel + kNN regression

47

Results with or without clearly
negative data


LP2



LP1


48

Extracted patterns

Interpretable compared with implicitly expressed
features by Marginalized Graph Kernel

49

Summary (Graph Boosting)

Graph Boosting simultaneously generate
patterns and learn their weights

Finite convergence by column generation

Potentially interpretable by chemists.

Flexible constraints and speed
-
up by LP.

50

Concluding Remarks

Using graph mining as a part of
machine learning algorithms


Weights are essential


Please include weights when you
implement your item
-
set/tree/graph mining
algorithms


Make it available on the web!


Then ML researchers can use it