An Efficient Algorithm for Large-Scale Detection of ... - Bioinformatics

dealerdeputyAI and Robotics

Nov 25, 2013 (3 years and 10 months ago)

88 views

An Efficient Algorithm for Large
-
Scale Detection of Protein
Families

J. Enright, S. Van Dongen and C. A. Ouzounis



Nucleic Acids Research, 2002, Vol. 30, No. 7
1575
-
1584


© 2002
Ox
for
d University Press


Introduction


Why Protein Families need to be detected?



Detection of protein families in large databases is one of the
principal research objectives in structural and functional
genomics




Protein family classification can significantly contribute to:



the delineation of functional diversity of homologous proteins


prediction of function based on domain architecture


presence of sequence motifs


comparative genomics




Valuable evolutionary insights



Why a new method?


Multi
-

domain structure

of many protein families



confounds these methods: Shared Domain vs Biochemical Function


result in the incorrect grouping of proteins


Attempted Solutions


detection of individual domains



using BLAST reports


domain database dictionaries


iterative sequence comparison


manual intervention for family assignment of multi
-
domain
proteins



Drawbacks:



either too computationally intensive


or somewhat inaccurate


or not fully automatic






Promiscuous Domains

-

Smaller, quite widespread protein modules


Families based on such domains are unlikely to share a common
evolutionary history



New Challenges: Eukaryotic Genomes


Larger in size


Greater complexity


Eukaryotic protein families: bottleneck for most methods


Large Protein Domains: need efficient Clustering Algorithms


Traditional iterative automatic domain detection algorithms rendered
impractical


Suggested Solution:



Detecting similar Domain Architectures (Vs detecting each domain
individually)


Assumption: proteins with near
-
identical sets of domains may have very
similar biochemical roles



Why a new method?



..contd

Case Study: GeneRAGE Algorithm


Clusters proteins within complete genomes



Uses Smith

Waterman dynamic programming alignment algorithm for error detection
and correction



Multi
-
domain proteins:
sequence comparison using a domain detection algorithm (also
based on Smith

Waterman)



For Small Data sets
(eg. prokaryotic genomes)



effectively and accurately identifies protein families


correctly detects multi
-
domain proteins




For Larger data sets
(eg. eukaryotic organisms)
-

detection of protein domains becomes
hampered to a large extent by


promiscuous domains


peptide fragments (representing incomplete database entries)


proteins of complex domain structure.



Example
-

Domains such as a ‘response regulator’ domain from two
-
component systems
cause proteins with vastly differing functions (such as heat shock factors and phytochromes) to be
assigned incorrectly to the same family

What is TRIBE
-
MCL?


TRIBE
-
MCL for rapid and accurate clustering of protein sequences into
families



Uses
Markov Cluster
(
MCL
) algorithm (previously developed for graph
clustering using flow simulation ) for the assignment of proteins into families
based on precomputed sequence similarity information



Claims that it:


is ideally suited to the rapid and accurate detection of protein
families on a large scale


does NOT suffer from the problems that normally hinder other
protein sequence clustering algorithms, such as


the presence of multi
-
domain proteins


promiscuous domains


fragmented proteins

Some Definitions


Random Walks


http://www.gypsymoth.ento.vt.edu/~sharov/PopEcol/lec12/randwalk.html


http://mathworld.wolfram.com/RandomWalk1
-
Dimensional.html


Stochastic Matrix


A column stochastic matrix is a non
-
negative matrix with the property that each of
its columns sums to 1


Markov Chain


A random process in which the probability that a certain future state will occur
depends only on the present or immediately preceding state of the system, and
not on the events leading up to the present state


Bootstrapping procedure


http://www.wordiq.com/definition/Bootstrap


doubly idempotent matrix



http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/special.html#Doub_Stochastic


Quadratic Convergence


http://www.seas.upenn.edu/~gksuresh/courses/fall97/lectures.html


http://www.cse.buffalo.edu/faculty/miller/Courses/CS237/Pitman/node7.html


http://planetmath.org/encyclopedia/QuadraticConvergence.html


TRIBE
-
MCL Algorithm: Overview


Assembly of a FASTA file containing the sequences to be clustered


Filtering using CAST, an accurate and sensitive compositional bias
detection algorithm


compared against itself using BLAST


the all
-
against
-
all sequence similarities generated by this analysis are
parsed and stored in a square matrix


This matrix represents sequence similarities as a connection graph


Nodes
-

represent proteins


Edges represent sequence similarity that connects such proteins


Weighted Edges: average pairwise

log
10
(
E
-
value)


Symmetric matrix generated


Weights are transformed into transition probabilities


Iterative rounds of matrix multiplication and matrix inflation until there is little
or no net change in the matrix


Inflation value parameter : controls cluster granularity (or ‘tightness’)


Final matrix: interpreted as a protein family clustering

MCL Algorithm: Operators



Inflation



corresponds with taking the Hadamard power of a matrix (taking powers entry wise)


followed by a scaling step, such that the resulting matrix is stochastic again








Г
r
-
> Inflation Operator







M
-
> Stochastic matrix







r
-
> a real number (power coefficient) > 1







Mij
-
> probability of going from j to i




For values of r > 1, inflation changes the probabilities by favoring more probable walks over less
probable walks



Expansion


corresponds to computing random walks of ‘higher length’ (which means random walks with many steps)


Expansion coincides with taking the power of a stochastic matrix using the normal matrix product (i.e. matrix
squaring)


It associates new probabilities with all pairs of nodes


Premise: Higher length paths are more common within clusters than between different clusters


the probabilities associated with node pairs within a cluster will, in general, be relatively large


Inflation will then have the effect of boosting the probabilities of intra
-
cluster walks and will demote inter
-
cluster walks




Salient Features


Iterating expansion and inflation
-
> Segmentation of the graph


Expansion and inflation
--

tidal forces alternated until equilibrium
(
doubly idempotent matrix
) is reached


Stochastically


Expansion causes flow to dissipate within clusters


Inflation eliminates flow between different clusters


Mathematically


Inflation strengthens the structural property and will never change the associated
DAG;


Expansion is in fact able to change the associated DAG


Increasing this Inflation parameter makes the operator stronger, and
increases the granularity or ‘tightness’ of clusters


The process simulated by the
algorithm

converges quadratically

around the equilibrium states


In practice, the
algorithm

starts to converge noticeably after 3

10
iterations


Conjecture: process always converges if the input graph is
symmetric

Salient Features




..contd


‘Neutral’ value: a weight that will not change when the
inflation operator is applied to the stochastic column
associated with the node


‘bootstrapping’ nature: retrieving cluster structure via the
imprint made by this structure on the flow process


Alternating expansion with inflation turns out to be an
appropriate way of exploiting this recombination property


based on a very different paradigm than any linkage
-
based algorithm.


Worst Case Time Complexity: O(N*K^2)


K
-
> pruning factor


Extremely Tight & Dense Graphs: O(N*N*log(K))


Space Complexity: O(N*K)


Max RAM needed: 2 * s * N * K

Tweaking MCL


Optimal Resource Usage


-
scheme

parameter


70 < Jury Marks < 100


Run using diff schemes, compare clusters using clmdist


Cluster Granularity


Effect of Inflation:
-
I
the principal handle


1.1 to 5.0


Effect of Similarity Distribution


Homogeneous Similarities


Coarse Grained Graph


Effect of initial centering


-
c Parameter (adds loops to input graph)


Two Level Approach
-

for coarse graphs


Cluster Overlap:


Overlap


Theoretical (Very strong symmetries)


Default
-

No Ovelap in mcl output


MCL iterand clusters


-
dump cls :writes clusters associated with each iterand to file


-
dump ite
: writes the intermediate iterands to file, then apply
clmimac



Application to Biological Graphs

-

clustering of proteins into protein families


Nodes
-
> represent proteins



Edges
-
> represent similarity between these proteins



Edges are weighted
-
> according to sequence similarity score obtained from BLAST etc



A Markov matrix is constructed
--

>representing transition probabilities from any protein in the
graph to any other protein for which a similarity has been detected.



Diagonal elements are set arbitrarily to a ‘neutral’ value



This Markov matrix is supplied to the MCL algorithm



Initial expansion
-

simulates random walks (which allow one to measure ‘flow’ in the graph)


Areas of high flow indicate that a large number of random walks go through this area.



Iterations of expansion and inflation
--

to promote flow through the graph where it is strong (within
protein families), and remove flow where it is weak (across protein families)



Process terminates when equilibrium has been reached



“A random walk starting at any given protein in a family is more likely to linger within this family
than to cross to another family”


Flow between protein families will be weaker than flow within a family as there are relatively few (if any)
paths that cross two distinct protein families.


Inter
-
family paths represent either sequence similarity relationships due to multi
-
domain proteins or mere
false positive similarity detections.



MCL approach overcomes many of the protein sequence clustering problems


Promiscuous domains will connect a member of a given protein family to all members of that family and
possibly to other protein families


the algorithm gradually eliminates these inter
-
family similarities and detects protein families accurately.


MCL algorithm clusters proteins with different domain structures into distinct families

Protein family analysis of the draft
human genome



Input


a full set of peptides from the EnsEMBL 0.80 release (29 691 proteins)


along with a vertebrate subset of proteins from the SwissProt and SPTrEMBL databases (73 347 entries).


Procedure:


Protein similarities within these 103 038 proteins detected using BLAST


BLAST Protein similarities (over 15 million) given to TRIBE
-
MCL algorithm


Took 15 min on a single CPU of a Compaq ES40 server.


the RLCS algorithm (A.J.Enright, unpublished data) was then used to determine a consensus annotation for each
detected protein family


Results:


13 023 protein families detected,


11

481 families (88% of the total) are human specific


Avg. Family Size: 2.5 members,


Only 1110 single
-
member families (3% of the total number of families).


The family size distribution has an exponential shape


hundreds of protein families with more than 20 members


347 families with more than 50

members (indicating a high degree of paralogy)


Some well known families detected : zinc finger
-
containing proteins, olfactory receptors, members of the
ras

superfamily of GTPases, myosin, actin, keratin, immunoglobulin, certain ribosomal proteins and multiple kinase types


Also Detected Number of novel families in human genome whose functions are either unknown or predicted.


Improvements Identified:


Some of the largest families (with more than 1000 members) may contain a number of unrelated members


Arises from presence of multiply repeated sequence patterns (not individual promiscuous domains)


Working towards the definition of multiple levels of protein family classification, using post
-
processing of the initial
clusters with multiple
-
threshold clustering


Advantages


Graph theory allows a global treatment of all
relationships in similarity space simultaneously (as
against pairwise treatment by traditional methods


It is not misled by edges linking different clusters


It is very fast and very scalable


It has a natural parameter for influencing cluster
granularity


The mathematics associated shows that there is an
intrinsic relationship between the process it simulates
and cluster structure in the input graph


Its formulation is simple and elegant


Tribe
-
MCL Execution


mcxdeblast


mcxassemble


mcl


clmformat


zoem


mclpipeline


mclblastline

Conclusion


We have a novel algorithm that generates accurate
protein families using the MCL formalism for graph
clustering by flow simulation.


Avoids expensive Sequence Alignment:Because the
method operates on a graph that contains similarity
information


Global patterns of sequence similarity are detected and
used for partitioning


Very fast, and accurate. The quality of the clustering is
impressive.


Up to 95% agreement can be obtained in a comparison
of the resulting classification using TRIBE
-
MCL and the
manually curated InterPro database.


Questions???