Slides

ocelotgiantAI and Robotics

Nov 7, 2013 (3 years and 10 months ago)

75 views

Daniel KAHN


INRIA HELIX project

Laboratoire de Biométrie et Biologie Evolutive

University Lyon 1

Protein domain families

and protein innovation

The classical paradigm of
modular protein evolution

Modular proteins evolve as

combinations of pre
-
existing domains

LuxR

GerE

FixJ

OmpR

SpoOA

NtrC

NifA

prodom.prabi.fr

Exploring the evolution

of protein domains


Identify protein domain

families
systematically

by
comparison of all available sequences (
ProDom
)


Restrict analysis to completely sequenced genomes

(
ProDom
-
CG
)


Deduce
phylogenetic profile

for each family


Model evolutionary scenarios with
Bayesian networks


Bayesian networks


Directed acyclic graph


Random variable

at each node


Conditional dependencies
associated with edges

0.9

0.1

e

b

e

0.2

0.8

0.01


0.99

0.9

0.1

b

e

b

b

e

B

E

P(A | E,B)

Earthquake

Radio

Burglary

Alarm

Call

Figure from N. Friedman


Conditional probability distribution
attached at each node


Use of Bayesian networks


Posterior probabilities


Probability of any event given any
evidence

Earthquake

Radio

Burglary

Alarm

Call

Radio

Call

Figure from N. Friedman

Explaining away effect



Most likely explanation



Scenario that best explains evidence


+
/
-

+
/
-

-

-

-

+

+

+

+

+

+

+

+

+

+
/
-

+
/
-

+
/
-

+
/
-

+
/
-

+
/
-


Two
-
state Bayesian network

to model protein domain gain and loss

LUCA

Contemporary species

Advantages of Bayesian networks


Flexible, more general and rigorous than parsimony analysis


Allows for different parameter classes

Gain and loss frequencies
can vary widely
!


Parameters estimated by
EM

with missing information


Propagation of evidence provides

not only
most probable explanation

of evidence,

but also
marginal state probabilities

at all nodes


Takes into account potentially
complex scenarios

(
e.g.
, loss followed by gain)


Evolutionary

scenario
s for all ProDom
-
CG

Systematic analysis

of evolutionary scenarios


For each family, at each node of the taxonomy tree:


Is this family
novel
,
i.e.

first found since LUCA?


If novel, is it
phylum
-
specific
?


Has this family been lost?


Deduce numbers of events at each node of the
taxonomy tree:


Is
domain innovation

going on?


How prevalent is
horizontal transfer
?


Only a minority of

truly ancestral families

1662
770
345
3651
0
100000
Number of families
LUCA
Bacteria
Archaea
Eukaryota
Ancestral family
Restricted to kingdom
Suggests continuous domain innovation

Fates of domain families

new domain families,
phylum
-
specific

new domain families,

horizontally transferred

regained


domain families

vertically inherited

domain families



Lower bound

for domain innovation =
a

a

b

Example
:
Methanosarcina

4323 domain families



Upper bound for domain innovation =
a
+
b

The global picture

LUCA

1662 families

… for bacteria

Bacteria

2097 families

Rickettsia

Proteobacteria

2751 families

b

g

a

… proteobacteria

Cases of high domain innovation:

Xanthomonas
,
N. meningitidis,
rhizobiales,
H. pylori

R.s.

P. aeruginosa

B. aphidicola

Cases of massive horizontal transfer, above 40%:

e.g. P. aeruginosa, R. solanacearum

Expected cases of massive domain losses, up to 72% for
B. aphidicola.

Not exclusive with domain innovation, see
Rickettsia

Xanthomonas

N.meningitidis

rhizobiales

H.p.

Cyanobacteria

Listeria

S. aureus

S. pyogenes

Mycobacterium

Mycoplasma

Chlamydia

Actinomycetales

Bacillus

M.p.

Domain innovation can be above 30% (
e.g.
,
S. aureus
)

Only few cases of domain regain (maximum 8% for
M. pulmonis
)

M.l.

M.t.

Methanosarcina

Pyrococcus

Thermoplasma

Sulfolobus

Archaea

1668 families

… archaea

Case of massive turnover, above 50%:

Methanosarcina

[ defined as (gains + losses) / 2 ]

… eukaryota

A. thaliana

Bilateria

Eutheria

C. elegans

Eukaryota

5233 families


Much more domain innovation

in higher organisms



Much less horizontal transfer

H. sapiens

M. musculus

D. melanogaster

S.p.

S.c.

Conclusions


Complex evolutionary scenarios can be predicted
under a maximum likelihood principle


The traditional view of modular protein evolution:
proteins are made of a
combinatorial

assortment of
pre
-
existing domains


The new view:

domain innovation is an
ongoing process



Importance of
systematic domain analyses

INRA Toulouse

Catherine BRU

Emmanuel COURCELLE

Thomas SCHIEX

Daniel KAHN

Support

-
«

Après Séquençage
Génomique

» (Proteus)

-
ACI IMPBio

-

Toulouse Génopole

-
EU

-
Humboldt University

Phylogenetic profiling: why?


Exploit the idea that
correlated evolution

may inform
us on function (
Pellegrini et al., 1999
)


Best used in conjunction with other types of
correlations such as synteny, gene fusions,

correlated expression, etc…

Marcotte et al., 1999, and numerous follow
-
up articles


Conversely, the interpretation of phylogenetic profiles
may inform us on evolution of a biological function


Phylogenetic profiling: how?


Compare all
vs
. all


Cluster homologous genes


The phylogenetic distribution of each family across

n

genomes is represented as a
Boolean vector
:

a ‘
phylogenetic profile



Compare phylogenetic profiles using various
distances in {0,1}
n
, or mutual information


Critique of conventional
phylogenetic profiling


It does not take into account phylogenetic relations
between species


It is strongly affected by species
sampling bias

and
by sporadic horizontal transfer


Implicitly, single gene gain or loss events are counted
multiple times


One should therefore take
species phylogeny

into
account

Parameter estimation



Compilation of 61 families

with known evolutionary origin


One parameter class per kingdom

(an advantage of the Bayesian approach)


Use
Expectation
-
Maximization

for parameter estimation with missing information

Conditional probabilities

Domain loss at least 3
-
fold more frequent in prokaryotes

0%
5%
10%
15%
20%
25%
Conditional probability
Bacteria
Archaea
Eukaryota
Domain loss
Domain gain

Example of

evolutionary scenario

CG00307

LuxR type
activator domain

Gram +

Proteobacteria

Archaea

Gain/loss frequencies

are not correlated

with evolutionary distances

Boussau et al. (2004)

PNAS 101:9722
-
9727