An Efficient Algorithm for Large

Scale Detection of Protein
Families
J. Enright, S. Van Dongen and C. A. Ouzounis
Nucleic Acids Research, 2002, Vol. 30, No. 7
1575

1584
© 2002
Ox
for
d University Press
Introduction
Why Protein Families need to be detected?
Detection of protein families in large databases is one of the
principal research objectives in structural and functional
genomics
Protein family classification can significantly contribute to:
the delineation of functional diversity of homologous proteins
prediction of function based on domain architecture
presence of sequence motifs
comparative genomics
Valuable evolutionary insights
Why a new method?
Multi

domain structure
of many protein families
confounds these methods: Shared Domain vs Biochemical Function
result in the incorrect grouping of proteins
Attempted Solutions
detection of individual domains
•
using BLAST reports
•
domain database dictionaries
•
iterative sequence comparison
manual intervention for family assignment of multi

domain
proteins
Drawbacks:
•
either too computationally intensive
•
or somewhat inaccurate
•
or not fully automatic
Promiscuous Domains

Smaller, quite widespread protein modules
Families based on such domains are unlikely to share a common
evolutionary history
New Challenges: Eukaryotic Genomes
•
Larger in size
•
Greater complexity
•
Eukaryotic protein families: bottleneck for most methods
•
Large Protein Domains: need efficient Clustering Algorithms
•
Traditional iterative automatic domain detection algorithms rendered
impractical
Suggested Solution:
Detecting similar Domain Architectures (Vs detecting each domain
individually)
Assumption: proteins with near

identical sets of domains may have very
similar biochemical roles
Why a new method?
..contd
Case Study: GeneRAGE Algorithm
Clusters proteins within complete genomes
Uses Smith
–
Waterman dynamic programming alignment algorithm for error detection
and correction
Multi

domain proteins:
sequence comparison using a domain detection algorithm (also
based on Smith
–
Waterman)
For Small Data sets
(eg. prokaryotic genomes)
effectively and accurately identifies protein families
correctly detects multi

domain proteins
For Larger data sets
(eg. eukaryotic organisms)

detection of protein domains becomes
hampered to a large extent by
promiscuous domains
peptide fragments (representing incomplete database entries)
proteins of complex domain structure.
Example

Domains such as a ‘response regulator’ domain from two

component systems
cause proteins with vastly differing functions (such as heat shock factors and phytochromes) to be
assigned incorrectly to the same family
What is TRIBE

MCL?
TRIBE

MCL for rapid and accurate clustering of protein sequences into
families
Uses
Markov Cluster
(
MCL
) algorithm (previously developed for graph
clustering using flow simulation ) for the assignment of proteins into families
based on precomputed sequence similarity information
Claims that it:
is ideally suited to the rapid and accurate detection of protein
families on a large scale
does NOT suffer from the problems that normally hinder other
protein sequence clustering algorithms, such as
the presence of multi

domain proteins
promiscuous domains
fragmented proteins
Some Definitions
Random Walks
•
http://www.gypsymoth.ento.vt.edu/~sharov/PopEcol/lec12/randwalk.html
•
http://mathworld.wolfram.com/RandomWalk1

Dimensional.html
Stochastic Matrix
•
A column stochastic matrix is a non

negative matrix with the property that each of
its columns sums to 1
Markov Chain
•
A random process in which the probability that a certain future state will occur
depends only on the present or immediately preceding state of the system, and
not on the events leading up to the present state
Bootstrapping procedure
•
http://www.wordiq.com/definition/Bootstrap
doubly idempotent matrix
•
http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/special.html#Doub_Stochastic
Quadratic Convergence
•
http://www.seas.upenn.edu/~gksuresh/courses/fall97/lectures.html
•
http://www.cse.buffalo.edu/faculty/miller/Courses/CS237/Pitman/node7.html
•
http://planetmath.org/encyclopedia/QuadraticConvergence.html
TRIBE

MCL Algorithm: Overview
Assembly of a FASTA file containing the sequences to be clustered
Filtering using CAST, an accurate and sensitive compositional bias
detection algorithm
compared against itself using BLAST
the all

against

all sequence similarities generated by this analysis are
parsed and stored in a square matrix
This matrix represents sequence similarities as a connection graph
Nodes

represent proteins
Edges represent sequence similarity that connects such proteins
Weighted Edges: average pairwise
–
log
10
(
E

value)
Symmetric matrix generated
Weights are transformed into transition probabilities
Iterative rounds of matrix multiplication and matrix inflation until there is little
or no net change in the matrix
Inflation value parameter : controls cluster granularity (or ‘tightness’)
Final matrix: interpreted as a protein family clustering
MCL Algorithm: Operators
Inflation
corresponds with taking the Hadamard power of a matrix (taking powers entry wise)
followed by a scaling step, such that the resulting matrix is stochastic again
Г
r

> Inflation Operator
M

> Stochastic matrix
r

> a real number (power coefficient) > 1
Mij

> probability of going from j to i
For values of r > 1, inflation changes the probabilities by favoring more probable walks over less
probable walks
Expansion
corresponds to computing random walks of ‘higher length’ (which means random walks with many steps)
Expansion coincides with taking the power of a stochastic matrix using the normal matrix product (i.e. matrix
squaring)
It associates new probabilities with all pairs of nodes
Premise: Higher length paths are more common within clusters than between different clusters
the probabilities associated with node pairs within a cluster will, in general, be relatively large
Inflation will then have the effect of boosting the probabilities of intra

cluster walks and will demote inter

cluster walks
Salient Features
Iterating expansion and inflation

> Segmentation of the graph
Expansion and inflation

tidal forces alternated until equilibrium
(
doubly idempotent matrix
) is reached
Stochastically
•
Expansion causes flow to dissipate within clusters
•
Inflation eliminates flow between different clusters
Mathematically
•
Inflation strengthens the structural property and will never change the associated
DAG;
•
Expansion is in fact able to change the associated DAG
Increasing this Inflation parameter makes the operator stronger, and
increases the granularity or ‘tightness’ of clusters
The process simulated by the
algorithm
converges quadratically
around the equilibrium states
In practice, the
algorithm
starts to converge noticeably after 3
–
10
iterations
Conjecture: process always converges if the input graph is
symmetric
Salient Features
..contd
‘Neutral’ value: a weight that will not change when the
inflation operator is applied to the stochastic column
associated with the node
‘bootstrapping’ nature: retrieving cluster structure via the
imprint made by this structure on the flow process
Alternating expansion with inflation turns out to be an
appropriate way of exploiting this recombination property
based on a very different paradigm than any linkage

based algorithm.
Worst Case Time Complexity: O(N*K^2)
•
K

> pruning factor
•
Extremely Tight & Dense Graphs: O(N*N*log(K))
Space Complexity: O(N*K)
Max RAM needed: 2 * s * N * K
Tweaking MCL
Optimal Resource Usage

scheme
parameter
70 < Jury Marks < 100
Run using diff schemes, compare clusters using clmdist
Cluster Granularity
Effect of Inflation:

I
the principal handle
•
1.1 to 5.0
Effect of Similarity Distribution
•
Homogeneous Similarities
–
Coarse Grained Graph
Effect of initial centering
•

c Parameter (adds loops to input graph)
Two Level Approach

for coarse graphs
Cluster Overlap:
Overlap
–
Theoretical (Very strong symmetries)
Default

No Ovelap in mcl output
MCL iterand clusters

dump cls :writes clusters associated with each iterand to file

dump ite
: writes the intermediate iterands to file, then apply
clmimac
Application to Biological Graphs

clustering of proteins into protein families
•
Nodes

> represent proteins
•
Edges

> represent similarity between these proteins
•
Edges are weighted

> according to sequence similarity score obtained from BLAST etc
•
A Markov matrix is constructed

>representing transition probabilities from any protein in the
graph to any other protein for which a similarity has been detected.
•
Diagonal elements are set arbitrarily to a ‘neutral’ value
•
This Markov matrix is supplied to the MCL algorithm
•
Initial expansion

simulates random walks (which allow one to measure ‘flow’ in the graph)
–
Areas of high flow indicate that a large number of random walks go through this area.
•
Iterations of expansion and inflation

to promote flow through the graph where it is strong (within
protein families), and remove flow where it is weak (across protein families)
•
Process terminates when equilibrium has been reached
•
“A random walk starting at any given protein in a family is more likely to linger within this family
than to cross to another family”
–
Flow between protein families will be weaker than flow within a family as there are relatively few (if any)
paths that cross two distinct protein families.
–
Inter

family paths represent either sequence similarity relationships due to multi

domain proteins or mere
false positive similarity detections.
•
MCL approach overcomes many of the protein sequence clustering problems
–
Promiscuous domains will connect a member of a given protein family to all members of that family and
possibly to other protein families
–
the algorithm gradually eliminates these inter

family similarities and detects protein families accurately.
–
MCL algorithm clusters proteins with different domain structures into distinct families
Protein family analysis of the draft
human genome
Input
a full set of peptides from the EnsEMBL 0.80 release (29 691 proteins)
along with a vertebrate subset of proteins from the SwissProt and SPTrEMBL databases (73 347 entries).
Procedure:
Protein similarities within these 103 038 proteins detected using BLAST
BLAST Protein similarities (over 15 million) given to TRIBE

MCL algorithm
Took 15 min on a single CPU of a Compaq ES40 server.
the RLCS algorithm (A.J.Enright, unpublished data) was then used to determine a consensus annotation for each
detected protein family
Results:
13 023 protein families detected,
11
481 families (88% of the total) are human specific
Avg. Family Size: 2.5 members,
Only 1110 single

member families (3% of the total number of families).
The family size distribution has an exponential shape
hundreds of protein families with more than 20 members
347 families with more than 50
members (indicating a high degree of paralogy)
Some well known families detected : zinc finger

containing proteins, olfactory receptors, members of the
ras
superfamily of GTPases, myosin, actin, keratin, immunoglobulin, certain ribosomal proteins and multiple kinase types
Also Detected Number of novel families in human genome whose functions are either unknown or predicted.
Improvements Identified:
Some of the largest families (with more than 1000 members) may contain a number of unrelated members
Arises from presence of multiply repeated sequence patterns (not individual promiscuous domains)
Working towards the definition of multiple levels of protein family classification, using post

processing of the initial
clusters with multiple

threshold clustering
Advantages
Graph theory allows a global treatment of all
relationships in similarity space simultaneously (as
against pairwise treatment by traditional methods
It is not misled by edges linking different clusters
It is very fast and very scalable
It has a natural parameter for influencing cluster
granularity
The mathematics associated shows that there is an
intrinsic relationship between the process it simulates
and cluster structure in the input graph
Its formulation is simple and elegant
Tribe

MCL Execution
mcxdeblast
mcxassemble
mcl
clmformat
zoem
mclpipeline
mclblastline
Conclusion
We have a novel algorithm that generates accurate
protein families using the MCL formalism for graph
clustering by flow simulation.
Avoids expensive Sequence Alignment:Because the
method operates on a graph that contains similarity
information
Global patterns of sequence similarity are detected and
used for partitioning
Very fast, and accurate. The quality of the clustering is
impressive.
Up to 95% agreement can be obtained in a comparison
of the resulting classification using TRIBE

MCL and the
manually curated InterPro database.
Questions???
Comments 0
Log in to post a comment