Graph mining in bioinformatics

weinerthreeforksBiotechnology

Oct 2, 2013 (4 years and 9 days ago)

81 views

Graph mining in bioinformatics

Laur Tooming

Graphs in biology


Graphs are often used in bioinformatics for
describing processes in the cell


Vertices are genes or proteins


The meaning of an edge depends on the type of
the graph


Protein
-
protein interaction


Gene regulation


What we’re looking for


We

want to find sets of genes that have a
biological meaning.


Idea: find graph
-
theoretically relevant sets
of vertices and find out if they are also
biologically meaningful.


Simple example: connected components


A more advanced idea: graph clustering.
Find subgraphs that have a high edge
density.

Markov Cluster Algorithm (MCL)


If there is cluster structure in a graph, random
walks tend to remain in a cluster for a long time


Graph modelled as a
stochastic matrix
: sum of
entries in a column is 1


a
ij

-

probability that random
ly walking out of

j will
go to i on
the
next step


B
igger

edge

weight means greater probability of
choosing
that
edge

Stijn van Dongen,
Graph Clustering by Flow Simulation
. PhD thesis, University of Utrecht, May 2000.

http://micans.org/mcl/


Markov Cluster Algorithm (MCL)


Two procedures,
inflation

and
expansion
,
are applied alternatively


Expansion: matrix squaring


c
onsiders longer random walks


Inflation: raising entries to some power,
rescaling to remain stochastic


Weakens weak edges and strengthens strong
ones


C
onverges to a steady state

Markov Cluster Algorithm (MCL)

Images from
http://micans.org/mcl/ani/mcl
-
animation.html

Betweenness centrality clustering


An edge between different clusters is on many
shortest paths from one cluster to another.


An edge inside a cluster is on less shortest paths,
because there are more alternative paths inside a
cluster.


Betweenness centrality

of an edge
-

the number of
shortest paths in the graph containing that edge.


R
emove edges with the highest centrality from the
graph

to obtain clustering.


Optimisation
s
:


instead of all shortest paths, pick a sample of vertices
and calculate shortest paths from them


remove several edges at once


GraphWeb


Web interface for analysing biological graphs


Simple syntax for entering graphs


multiple datasets


directed edges


edge weights


Visualising graphs with GraphViz


Finding biological meaning with g:Profiler



ds1: A > B 10

ds2: A > B 4

ds1: B C 5

ds2: C > D 12


Combining several datasets


Whether or not there is an edge between two
vertices is determined in biological experiments,
which may sometimes give false results.


For a given graph different sources may give
different information. Some sources may be
more trustworthy than others.


We would like to combine different sources and
assess the trustworthyness of each edge in the
resulting graph.


Edge weight in summary graph: sum over
datasets


w(e,G) =
Σ

w(e,G
i
)*w(G
i
)


Combining several datasets



The end