Sylvain Brohée <sylvain@bigre.ulb.ac.be>
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
Extracting relevant clusters from biological
networks
Cluster
A cluster is a group of nodes of the graph.
The nodes can be grouped in a cluster according to many criteria.
Groups of neighbors
Groups of nodes fully connected (clique)
Groups of nodes tightly inter

connected
•
Selection of groups of nodes with high density
•
Partitioning the graph into near

cliques which are sparsely inter

connected
•
Selection of the nodes presenting more interactions with the cluster nodes than with the other
nodes of the graph
In an interaction network, all proteins linked to a given protein could form a
complex with it.
#neighb
seed
prot1
prot1
prot2
prot1
prot3
prot1
prot4
prot1
prot5
prot1
Looking for the neighbors
With NeAT
Use graph

neighbours to compute the neighbors of all or part of the nodes of a
graph.
#element
cluster
prot2
clique_1
prot4
clique_1
prot1
clique_1
prot3
clique_1
prot5
clique_2
prot4
clique_2
prot1
clique_2
Looking for the cliques
Looking for set of nodes that are completely inter

connected.
#element
cluster
prot2
clique_1
prot4
clique_1
prot1
clique_1
prot3
clique_1
prot5
clique_2
prot4
clique_2
prot1
clique_2
Looking for the cliques
Looking for set of nodes that are completely inter

connected.
A bit extreme
With NeAT
Use graph

cliques to compute all maximal cliques in a graph.
Cluster 1
Cluster 2
Graph clustering

Definition
Assignment of graph vertices to clusters.
Clustering criteria
Selection of groups of nodes with high density
Partitioning the graph into near

cliques which are sparsely inter

connected
Selection of the nodes presenting more interactions with the cluster nodes than with
the other nodes of the graphClustering algorithm development is a very active
domain in bioinformatics
Clustering algorithm development is a very active domain in bioinformatics
Graph clustering

Interests
Many fields of application
Database searching
Graph drawing
...
Bioinformatics
•
Detection of protein families
•
Gene expression pattern
•
Protein complex detection
•
...
RNSC
King 2004
Cost

based local search algorithm
Maximum number of requested clusters may be defined
Intuitive
Cannot deal with edge

weighted graphs
Lots of not very useful parameters
Available upon request
Available on NeAT
RNSC calculates two types of score :
Naive score :
Sum (for each node
i
)
•
The number of neighbours that are not in the
same cluster than
i
•
The number of non

neighbours of i that are
in the same cluster
Scaled score :
Sum (for each node
i
)
•
Naive score divided by the size of the union
of the cluster containing
i
and the set of
neighbours of
i
.
neighbours not in the same cluster
:
3
non

neighbours in the same
cluster :
1
size of union :
6
naive score :
4
scaled score :
4
/
6
RNSC (2)
RNSC (3)
Starting from an initial random clustering, RNSC makes a move at each iteration
(change a vertex from a cluster).
The RNSC process has two steps :
The naive scheme :
•
If the move decreases the naive cost of previous iteration, RNSC keeps the new clustering,
else it keeps the old one.
•
The naïve scheme stops when the score didn't decrease during the
k
last moves. RNSC then
goes to the second step.
The scaled scheme :
•
Same as the naive scheme but using the scaled cost.
MCL
Van Dongen,
2000
(Ph.D. Thesis)
Based on random walks on a graph
Weighting on edges
Very used in bioinformatics
Very simple to use
Very high computational efficiency
Srtand

alone version (Unix command

line)
http://micans.org/mcl/
Web interface + interconnection with other tools
NeAT
http://rsat.ulb.ac.be/neat/
Starting
point
Imagine random walks, starting
from the same points. Once inside
a dense region, a random walker
has little chance to escape from it.
By repeating the process an underlying cluster structure will gradually become visible.
The process ends up with a number of regions with strong internal flow (clusters),
separatedby‘dry’boundarieswithnoflow.
MCL simulates many random walks within the whole graph
(using matrix multiplication), strengthen flow where it is
already strong, and weaken it where it is weak (using
inflation).
MCL (2)
% of removed edges

>
% of added edges
Assessment of clustering algorithms
Which method? Which parameters?
Evaluation of clustering algorithm.
•
Building of a network in which we already know the clusters
•
Network alteration by adding and removing various proportions of edges (addition of noise,
signal depletion).
•
Application of the clustering algorithm to the networks varying the parameters value
•
Comparison of the clusters returned by the algorithm with the known complex in a contingency
table.
•
contingency

stats in NeAT
% of removed edges

>
% of added edges
% of removed edges

>
% of added edges
Algorithm robustness
Experimental
network
clustering
process
cluster1
cluster2
cluster n
...
Looking for relevant clusters
cluster
1
cluster2
cluster n
...
cluster3
ribosome
nuclear
pore
...
replication fork complex
proteasome
Looking for relevant clusters
Comparing the clusters with some reference clusters (annotated complexes,
functional classes)
Cluster
i
Ribosomal
proteins
Finding relevant complexes
Each cluster must be compared with each reference class.
Calculation of the significance of the intersection according to some metrics.
Jaccard
Hypergeometric P

value
•
What is the probability to have
•
an intersection of at least a given size
•
knowing the number of elements, the size of the clusters
•
and the size of the class?
Look for the most significant interactions.
NeAT : compare

classes
Cluster comparison : compare

classes results
a
b
d
c
a
b
c
d
a
1
1
1
0
b
1
1
1
1
c
1
1
1
0
d
0
1
0
1
a
b
c
d
a
0.33
0.25
0.33
0
b
0.33
0.25
0.33
0.5
c
0.33
0.25
0.33
0
d
0
0.25
0
0.5
a
b
c
d
a
0.26
0.26
0.26
0.26
b
0.32
0.32
0.32
0.32
c
0.26
0.26
0.26
0.26
d
0.16
0.16
0.16
0.16
Adjacency
matrix
Markov
matrix
Column scaling
Squaring and
scaling(expansion)
∞
expansion, raising all entries to
powerk(inflation),scaling(∞)
a
b
c
d
a
0
0
0
0
b
1
1
1
1
c
0
0
0
0
d
0
0
0
0
Results : all
nodes belong to
the same cluster
MCL (3)
Separation at the level of complex
i
Separation at the level of cluster
j
Les protéines du cluster
j
proviennent

elles d'un
grand ou d'un petit nombre de complexes?
Les protéines du complexe
i
sont

elles placées
dans un seul cluster ou dans un grand nombre
de clusters différents?
Global separation (complexes and clusters)
Global separation for all complexes
Sep
Co
= average(SepCo
i
)
Global separation for all clusters
Sep
Cl
= average(SepCl
j
)
Statistics for comparing complexes and clusters
(Brohée & van Helden, 2006)
Brohee, S. and van Helden, J. (
2006
). Evaluation of clustering algorithms for protein

protein interaction networks. BMC Bioinformatics
7
,
488
.
Comments 0
Log in to post a comment