Extracting relevant clusters from biological networks

savagelizardΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 7 μήνες)

56 εμφανίσεις

Sylvain Brohée <sylvain@bigre.ulb.ac.be>

Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)

http://www.bigre.ulb.ac.be/

Extracting relevant clusters from biological
networks

Cluster


A cluster is a group of nodes of the graph.


The nodes can be grouped in a cluster according to many criteria.


Groups of neighbors


Groups of nodes fully connected (clique)



Groups of nodes tightly inter
-
connected


Selection of groups of nodes with high density


Partitioning the graph into near
-
cliques which are sparsely inter
-
connected


Selection of the nodes presenting more interactions with the cluster nodes than with the other
nodes of the graph


In an interaction network, all proteins linked to a given protein could form a
complex with it.

#neighb
seed
prot1
prot1
prot2
prot1
prot3
prot1
prot4
prot1
prot5
prot1
Looking for the neighbors

With NeAT


Use graph
-
neighbours to compute the neighbors of all or part of the nodes of a
graph.

#element
cluster
prot2
clique_1
prot4
clique_1
prot1
clique_1
prot3
clique_1
prot5
clique_2
prot4
clique_2
prot1
clique_2
Looking for the cliques


Looking for set of nodes that are completely inter
-
connected.

#element
cluster
prot2
clique_1
prot4
clique_1
prot1
clique_1
prot3
clique_1
prot5
clique_2
prot4
clique_2
prot1
clique_2
Looking for the cliques


Looking for set of nodes that are completely inter
-
connected.


A bit extreme

With NeAT


Use graph
-
cliques to compute all maximal cliques in a graph.

Cluster 1

Cluster 2


Graph clustering
-

Definition


Assignment of graph vertices to clusters.


Clustering criteria


Selection of groups of nodes with high density


Partitioning the graph into near
-
cliques which are sparsely inter
-
connected


Selection of the nodes presenting more interactions with the cluster nodes than with
the other nodes of the graphClustering algorithm development is a very active
domain in bioinformatics


Clustering algorithm development is a very active domain in bioinformatics


Graph clustering
-

Interests


Many fields of application


Database searching


Graph drawing


...


Bioinformatics


Detection of protein families


Gene expression pattern


Protein complex detection


...



RNSC


King 2004


Cost
-
based local search algorithm


Maximum number of requested clusters may be defined


Intuitive


Cannot deal with edge
-
weighted graphs


Lots of not very useful parameters


Available upon request


Available on NeAT


RNSC calculates two types of score :

Naive score :


Sum (for each node
i
)


The number of neighbours that are not in the
same cluster than
i


The number of non
-
neighbours of i that are
in the same cluster

Scaled score :


Sum (for each node
i
)


Naive score divided by the size of the union
of the cluster containing
i

and the set of
neighbours of
i
.

neighbours not in the same cluster
:
3

non
-
neighbours in the same
cluster :
1

size of union :
6

naive score :
4

scaled score :
4
/
6


RNSC (2)

RNSC (3)


Starting from an initial random clustering, RNSC makes a move at each iteration
(change a vertex from a cluster).


The RNSC process has two steps :


The naive scheme :


If the move decreases the naive cost of previous iteration, RNSC keeps the new clustering,


else it keeps the old one.


The naïve scheme stops when the score didn't decrease during the
k

last moves. RNSC then
goes to the second step.


The scaled scheme :


Same as the naive scheme but using the scaled cost.


MCL


Van Dongen,
2000
(Ph.D. Thesis)



Based on random walks on a graph


Weighting on edges


Very used in bioinformatics


Very simple to use


Very high computational efficiency


Srtand
-
alone version (Unix command
-
line)


http://micans.org/mcl/


Web interface + interconnection with other tools


NeAT

http://rsat.ulb.ac.be/neat/


Starting
point



Imagine random walks, starting
from the same points. Once inside
a dense region, a random walker
has little chance to escape from it.

By repeating the process an underlying cluster structure will gradually become visible.
The process ends up with a number of regions with strong internal flow (clusters),
separated‏by‏‘dry’‏boundaries‏with‏no‏flow.

MCL simulates many random walks within the whole graph
(using matrix multiplication), strengthen flow where it is
already strong, and weaken it where it is weak (using
inflation).

MCL (2)

% of removed edges
-
>

% of added edges

Assessment of clustering algorithms


Which method? Which parameters?


Evaluation of clustering algorithm.


Building of a network in which we already know the clusters


Network alteration by adding and removing various proportions of edges (addition of noise,
signal depletion).


Application of the clustering algorithm to the networks varying the parameters value


Comparison of the clusters returned by the algorithm with the known complex in a contingency
table.


contingency
-
stats in NeAT

% of removed edges
-
>

% of added edges

% of removed edges
-
>

% of added edges

Algorithm robustness

Experimental

network

clustering

process

cluster1

cluster2

cluster n

...

Looking for relevant clusters

cluster
1

cluster2

cluster n

...

cluster3


ribosome

nuclear
pore

...

replication fork complex

proteasome

Looking for relevant clusters


Comparing the clusters with some reference clusters (annotated complexes,
functional classes)


Cluster

i

Ribosomal

proteins

Finding relevant complexes


Each cluster must be compared with each reference class.


Calculation of the significance of the intersection according to some metrics.


Jaccard


Hypergeometric P
-
value


What is the probability to have


an intersection of at least a given size


knowing the number of elements, the size of the clusters


and the size of the class?


Look for the most significant interactions.


NeAT : compare
-
classes

Cluster comparison : compare
-
classes results

a

b

d

c

a
b
c
d
a
1
1
1
0
b
1
1
1
1
c
1
1
1
0
d
0
1
0
1
a
b
c
d
a
0.33
0.25
0.33
0
b
0.33
0.25
0.33
0.5
c
0.33
0.25
0.33
0
d
0
0.25
0
0.5
a
b
c
d
a
0.26
0.26
0.26
0.26
b
0.32
0.32
0.32
0.32
c
0.26
0.26
0.26
0.26
d
0.16
0.16
0.16
0.16
Adjacency
matrix

Markov
matrix

Column scaling

Squaring and
scaling(expansion)




expansion, raising all entries to
power‏k‏(inflation),‏scaling‏(∞)


a
b
c
d
a
0
0
0
0
b
1
1
1
1
c
0
0
0
0
d
0
0
0
0
Results : all
nodes belong to
the same cluster

MCL (3)


Separation at the level of complex

i



Separation at the level of cluster
j





Les protéines du cluster

j

proviennent
-
elles d'un
grand ou d'un petit nombre de complexes?


Les protéines du complexe
i

sont
-
elles placées
dans un seul cluster ou dans un grand nombre
de clusters différents?


Global separation (complexes and clusters)


Global separation for all complexes


Sep
Co

= average(SepCo
i
)


Global separation for all clusters


Sep
Cl

= average(SepCl
j
)

Statistics for comparing complexes and clusters

(Brohée & van Helden, 2006)

Brohee, S. and van Helden, J. (
2006
). Evaluation of clustering algorithms for protein
-
protein interaction networks. BMC Bioinformatics
7
,
488
.