# Graph mining in bioinformatics

Βιοτεχνολογία

2 Οκτ 2013 (πριν από 4 χρόνια και 9 μήνες)

88 εμφανίσεις

Graph mining in bioinformatics

Laur Tooming

Graphs in biology

Graphs are often used in bioinformatics for
describing processes in the cell

Vertices are genes or proteins

The meaning of an edge depends on the type of
the graph

Protein
-
protein interaction

Gene regulation

What we’re looking for

We

want to find sets of genes that have a
biological meaning.

Idea: find graph
-
theoretically relevant sets
of vertices and find out if they are also
biologically meaningful.

Simple example: connected components

A more advanced idea: graph clustering.
Find subgraphs that have a high edge
density.

Markov Cluster Algorithm (MCL)

If there is cluster structure in a graph, random
walks tend to remain in a cluster for a long time

Graph modelled as a
stochastic matrix
: sum of
entries in a column is 1

a
ij

-

probability that random
ly walking out of

j will
go to i on
the
next step

B
igger

edge

weight means greater probability of
choosing
that
edge

Stijn van Dongen,
Graph Clustering by Flow Simulation
. PhD thesis, University of Utrecht, May 2000.

http://micans.org/mcl/

Markov Cluster Algorithm (MCL)

Two procedures,
inflation

and
expansion
,
are applied alternatively

Expansion: matrix squaring

c
onsiders longer random walks

Inflation: raising entries to some power,
rescaling to remain stochastic

Weakens weak edges and strengthens strong
ones

C

Markov Cluster Algorithm (MCL)

Images from
http://micans.org/mcl/ani/mcl
-
animation.html

Betweenness centrality clustering

An edge between different clusters is on many
shortest paths from one cluster to another.

An edge inside a cluster is on less shortest paths,
because there are more alternative paths inside a
cluster.

Betweenness centrality

of an edge
-

the number of
shortest paths in the graph containing that edge.

R
emove edges with the highest centrality from the
graph

to obtain clustering.

Optimisation
s
:

instead of all shortest paths, pick a sample of vertices
and calculate shortest paths from them

remove several edges at once

GraphWeb

Web interface for analysing biological graphs

Simple syntax for entering graphs

multiple datasets

directed edges

edge weights

Visualising graphs with GraphViz

Finding biological meaning with g:Profiler

ds1: A > B 10

ds2: A > B 4

ds1: B C 5

ds2: C > D 12

Combining several datasets

Whether or not there is an edge between two
vertices is determined in biological experiments,
which may sometimes give false results.

For a given graph different sources may give
different information. Some sources may be
more trustworthy than others.

We would like to combine different sources and
assess the trustworthyness of each edge in the
resulting graph.

Edge weight in summary graph: sum over
datasets

w(e,G) =
Σ

w(e,G
i
)*w(G
i
)

Combining several datasets

The end