Graph mining in bioinformatics
Laur Tooming
Graphs in biology
•
Graphs are often used in bioinformatics for
describing processes in the cell
•
Vertices are genes or proteins
•
The meaning of an edge depends on the type of
the graph
–
Protein

protein interaction
–
Gene regulation
What we’re looking for
•
We
want to find sets of genes that have a
biological meaning.
•
Idea: find graph

theoretically relevant sets
of vertices and find out if they are also
biologically meaningful.
•
Simple example: connected components
•
A more advanced idea: graph clustering.
Find subgraphs that have a high edge
density.
Markov Cluster Algorithm (MCL)
•
If there is cluster structure in a graph, random
walks tend to remain in a cluster for a long time
•
Graph modelled as a
stochastic matrix
: sum of
entries in a column is 1
•
a
ij

probability that random
ly walking out of
j will
go to i on
the
next step
•
B
igger
edge
weight means greater probability of
choosing
that
edge
Stijn van Dongen,
Graph Clustering by Flow Simulation
. PhD thesis, University of Utrecht, May 2000.
http://micans.org/mcl/
Markov Cluster Algorithm (MCL)
•
Two procedures,
inflation
and
expansion
,
are applied alternatively
•
Expansion: matrix squaring
–
c
onsiders longer random walks
•
Inflation: raising entries to some power,
rescaling to remain stochastic
–
Weakens weak edges and strengthens strong
ones
•
C
onverges to a steady state
Markov Cluster Algorithm (MCL)
Images from
http://micans.org/mcl/ani/mcl

animation.html
Betweenness centrality clustering
•
An edge between different clusters is on many
shortest paths from one cluster to another.
•
An edge inside a cluster is on less shortest paths,
because there are more alternative paths inside a
cluster.
•
Betweenness centrality
of an edge

the number of
shortest paths in the graph containing that edge.
•
R
emove edges with the highest centrality from the
graph
to obtain clustering.
•
Optimisation
s
:
–
instead of all shortest paths, pick a sample of vertices
and calculate shortest paths from them
–
remove several edges at once
GraphWeb
•
Web interface for analysing biological graphs
•
Simple syntax for entering graphs
–
multiple datasets
–
directed edges
–
edge weights
•
Visualising graphs with GraphViz
•
Finding biological meaning with g:Profiler
ds1: A > B 10
ds2: A > B 4
ds1: B C 5
ds2: C > D 12
Combining several datasets
•
Whether or not there is an edge between two
vertices is determined in biological experiments,
which may sometimes give false results.
•
For a given graph different sources may give
different information. Some sources may be
more trustworthy than others.
•
We would like to combine different sources and
assess the trustworthyness of each edge in the
resulting graph.
•
Edge weight in summary graph: sum over
datasets
–
w(e,G) =
Σ
w(e,G
i
)*w(G
i
)
Combining several datasets
The end
Comments 0
Log in to post a comment