An Efficient Algorithm for Large-Scale Detection of ... - Bioinformatics

dealerdeputyAI and Robotics

Nov 25, 2013 (4 years and 7 months ago)


An Efficient Algorithm for Large
Scale Detection of Protein

J. Enright, S. Van Dongen and C. A. Ouzounis

Nucleic Acids Research, 2002, Vol. 30, No. 7

© 2002
d University Press


Why Protein Families need to be detected?

Detection of protein families in large databases is one of the
principal research objectives in structural and functional

Protein family classification can significantly contribute to:

the delineation of functional diversity of homologous proteins

prediction of function based on domain architecture

presence of sequence motifs

comparative genomics

Valuable evolutionary insights

Why a new method?


domain structure

of many protein families

confounds these methods: Shared Domain vs Biochemical Function

result in the incorrect grouping of proteins

Attempted Solutions

detection of individual domains

using BLAST reports

domain database dictionaries

iterative sequence comparison

manual intervention for family assignment of multi


either too computationally intensive

or somewhat inaccurate

or not fully automatic

Promiscuous Domains


Smaller, quite widespread protein modules

Families based on such domains are unlikely to share a common
evolutionary history

New Challenges: Eukaryotic Genomes

Larger in size

Greater complexity

Eukaryotic protein families: bottleneck for most methods

Large Protein Domains: need efficient Clustering Algorithms

Traditional iterative automatic domain detection algorithms rendered

Suggested Solution:

Detecting similar Domain Architectures (Vs detecting each domain

Assumption: proteins with near
identical sets of domains may have very
similar biochemical roles

Why a new method?


Case Study: GeneRAGE Algorithm

Clusters proteins within complete genomes

Uses Smith

Waterman dynamic programming alignment algorithm for error detection
and correction

domain proteins:
sequence comparison using a domain detection algorithm (also
based on Smith


For Small Data sets
(eg. prokaryotic genomes)

effectively and accurately identifies protein families

correctly detects multi
domain proteins

For Larger data sets
(eg. eukaryotic organisms)

detection of protein domains becomes
hampered to a large extent by

promiscuous domains

peptide fragments (representing incomplete database entries)

proteins of complex domain structure.


Domains such as a ‘response regulator’ domain from two
component systems
cause proteins with vastly differing functions (such as heat shock factors and phytochromes) to be
assigned incorrectly to the same family

What is TRIBE

MCL for rapid and accurate clustering of protein sequences into

Markov Cluster
) algorithm (previously developed for graph
clustering using flow simulation ) for the assignment of proteins into families
based on precomputed sequence similarity information

Claims that it:

is ideally suited to the rapid and accurate detection of protein
families on a large scale

does NOT suffer from the problems that normally hinder other
protein sequence clustering algorithms, such as

the presence of multi
domain proteins

promiscuous domains

fragmented proteins

Some Definitions

Random Walks

Stochastic Matrix

A column stochastic matrix is a non
negative matrix with the property that each of
its columns sums to 1

Markov Chain

A random process in which the probability that a certain future state will occur
depends only on the present or immediately preceding state of the system, and
not on the events leading up to the present state

Bootstrapping procedure

doubly idempotent matrix

Quadratic Convergence

MCL Algorithm: Overview

Assembly of a FASTA file containing the sequences to be clustered

Filtering using CAST, an accurate and sensitive compositional bias
detection algorithm

compared against itself using BLAST

the all
all sequence similarities generated by this analysis are
parsed and stored in a square matrix

This matrix represents sequence similarities as a connection graph


represent proteins

Edges represent sequence similarity that connects such proteins

Weighted Edges: average pairwise


Symmetric matrix generated

Weights are transformed into transition probabilities

Iterative rounds of matrix multiplication and matrix inflation until there is little
or no net change in the matrix

Inflation value parameter : controls cluster granularity (or ‘tightness’)

Final matrix: interpreted as a protein family clustering

MCL Algorithm: Operators


corresponds with taking the Hadamard power of a matrix (taking powers entry wise)

followed by a scaling step, such that the resulting matrix is stochastic again

> Inflation Operator

> Stochastic matrix

> a real number (power coefficient) > 1

> probability of going from j to i

For values of r > 1, inflation changes the probabilities by favoring more probable walks over less
probable walks


corresponds to computing random walks of ‘higher length’ (which means random walks with many steps)

Expansion coincides with taking the power of a stochastic matrix using the normal matrix product (i.e. matrix

It associates new probabilities with all pairs of nodes

Premise: Higher length paths are more common within clusters than between different clusters

the probabilities associated with node pairs within a cluster will, in general, be relatively large

Inflation will then have the effect of boosting the probabilities of intra
cluster walks and will demote inter
cluster walks

Salient Features

Iterating expansion and inflation
> Segmentation of the graph

Expansion and inflation

tidal forces alternated until equilibrium
doubly idempotent matrix
) is reached


Expansion causes flow to dissipate within clusters

Inflation eliminates flow between different clusters


Inflation strengthens the structural property and will never change the associated

Expansion is in fact able to change the associated DAG

Increasing this Inflation parameter makes the operator stronger, and
increases the granularity or ‘tightness’ of clusters

The process simulated by the

converges quadratically

around the equilibrium states

In practice, the

starts to converge noticeably after 3


Conjecture: process always converges if the input graph is

Salient Features


‘Neutral’ value: a weight that will not change when the
inflation operator is applied to the stochastic column
associated with the node

‘bootstrapping’ nature: retrieving cluster structure via the
imprint made by this structure on the flow process

Alternating expansion with inflation turns out to be an
appropriate way of exploiting this recombination property

based on a very different paradigm than any linkage
based algorithm.

Worst Case Time Complexity: O(N*K^2)

> pruning factor

Extremely Tight & Dense Graphs: O(N*N*log(K))

Space Complexity: O(N*K)

Max RAM needed: 2 * s * N * K

Tweaking MCL

Optimal Resource Usage



70 < Jury Marks < 100

Run using diff schemes, compare clusters using clmdist

Cluster Granularity

Effect of Inflation:
the principal handle

1.1 to 5.0

Effect of Similarity Distribution

Homogeneous Similarities

Coarse Grained Graph

Effect of initial centering

c Parameter (adds loops to input graph)

Two Level Approach

for coarse graphs

Cluster Overlap:


Theoretical (Very strong symmetries)


No Ovelap in mcl output

MCL iterand clusters

dump cls :writes clusters associated with each iterand to file

dump ite
: writes the intermediate iterands to file, then apply

Application to Biological Graphs


clustering of proteins into protein families

> represent proteins

> represent similarity between these proteins

Edges are weighted
> according to sequence similarity score obtained from BLAST etc

A Markov matrix is constructed

>representing transition probabilities from any protein in the
graph to any other protein for which a similarity has been detected.

Diagonal elements are set arbitrarily to a ‘neutral’ value

This Markov matrix is supplied to the MCL algorithm

Initial expansion

simulates random walks (which allow one to measure ‘flow’ in the graph)

Areas of high flow indicate that a large number of random walks go through this area.

Iterations of expansion and inflation

to promote flow through the graph where it is strong (within
protein families), and remove flow where it is weak (across protein families)

Process terminates when equilibrium has been reached

“A random walk starting at any given protein in a family is more likely to linger within this family
than to cross to another family”

Flow between protein families will be weaker than flow within a family as there are relatively few (if any)
paths that cross two distinct protein families.

family paths represent either sequence similarity relationships due to multi
domain proteins or mere
false positive similarity detections.

MCL approach overcomes many of the protein sequence clustering problems

Promiscuous domains will connect a member of a given protein family to all members of that family and
possibly to other protein families

the algorithm gradually eliminates these inter
family similarities and detects protein families accurately.

MCL algorithm clusters proteins with different domain structures into distinct families

Protein family analysis of the draft
human genome


a full set of peptides from the EnsEMBL 0.80 release (29 691 proteins)

along with a vertebrate subset of proteins from the SwissProt and SPTrEMBL databases (73 347 entries).


Protein similarities within these 103 038 proteins detected using BLAST

BLAST Protein similarities (over 15 million) given to TRIBE
MCL algorithm

Took 15 min on a single CPU of a Compaq ES40 server.

the RLCS algorithm (A.J.Enright, unpublished data) was then used to determine a consensus annotation for each
detected protein family


13 023 protein families detected,


481 families (88% of the total) are human specific

Avg. Family Size: 2.5 members,

Only 1110 single
member families (3% of the total number of families).

The family size distribution has an exponential shape

hundreds of protein families with more than 20 members

347 families with more than 50

members (indicating a high degree of paralogy)

Some well known families detected : zinc finger
containing proteins, olfactory receptors, members of the

superfamily of GTPases, myosin, actin, keratin, immunoglobulin, certain ribosomal proteins and multiple kinase types

Also Detected Number of novel families in human genome whose functions are either unknown or predicted.

Improvements Identified:

Some of the largest families (with more than 1000 members) may contain a number of unrelated members

Arises from presence of multiply repeated sequence patterns (not individual promiscuous domains)

Working towards the definition of multiple levels of protein family classification, using post
processing of the initial
clusters with multiple
threshold clustering


Graph theory allows a global treatment of all
relationships in similarity space simultaneously (as
against pairwise treatment by traditional methods

It is not misled by edges linking different clusters

It is very fast and very scalable

It has a natural parameter for influencing cluster

The mathematics associated shows that there is an
intrinsic relationship between the process it simulates
and cluster structure in the input graph

Its formulation is simple and elegant

MCL Execution









We have a novel algorithm that generates accurate
protein families using the MCL formalism for graph
clustering by flow simulation.

Avoids expensive Sequence Alignment:Because the
method operates on a graph that contains similarity

Global patterns of sequence similarity are detected and
used for partitioning

Very fast, and accurate. The quality of the clustering is

Up to 95% agreement can be obtained in a comparison
of the resulting classification using TRIBE
MCL and the
manually curated InterPro database.