Genomics and Bioinformatics

abalonestrawBiotechnology

Oct 2, 2013 (3 years and 11 months ago)

90 views

Phylogenetic methods


Thanks to:

Applications of phylogenetic
trees


Evolution studies


Systematic biology


Medical research and epidemiology


Ecology


Others like grouping languages, medieval
manuscripts …

Phylogenetics: basics


With only 2 sequences, we can simply align them and look at their
similarities or differences (how many, where they are in the molecule,
and so on).


More than 2 sequences, they are related to each other as a result of a
branching evolutionary history, and they are located at the tips of a
tree of species
(except if HGT or sampling within species
)



Phylogenies (evolutionary trees) can be inferred from molecular
sequences. They can be used to address a wide variety of questions
about the course of evolution. They are needed not only to know the
relationships of organisms, but to enable us to correctly interpret rates
of evolution and the evolution of different parts of different molecules.



In examining a phylogeny diagram, it is important to look at the
branching pattern and not be misled by the order of species names on
the page, or the left
-
right order of branching of lineages.


Terminology


Rooted tree





Unrooted tree

Trees: rooted or unrooted?


Rooted trees indicate the direction of evolution, and the ancestor
(more useful)


Unrooted trees only show patterns of relationship (more common)


The root can be placed on a tree by two methods:



• The outgroup method. This amounts to knowing in advance
where it is.


• A molecular clock. If we assume that rates of change are about
the same up all lineages, then the root will be a point which is
equidistant from all the tips. It is not necessarily easy to see where
this is.


The molecular clock is the assumption that change occurs at a
constant expected rate, so that all tips of the tree are equidistant
from the root. This is better the more closely related the species
are, but breaks down as their biology becomes more different. It is
often a useful approximation.

Advantages of molecular
phylogenetic analysis


Analogous features (share common
function, but NOT common ancestry) can
be misleading


DNA sequences more simple

to model, we
only have the four states A, C, G, T


DNA samples for sequence analysis easy
to prepare


Procedure:

Steps of a molecular



phylogenetic analysis

1.
Decide what sequences to examine

2.
Determine the evolutionary distances
between the sequences and build
distance matrix

3.
Phylogenetic tree construction



1. Decide what to examine


Choose
homologous sequences

in
different species


Homologous sequences must, by
definition, be derived from a common
ancestral sequence


Homology is not similarity

2. Determine the evolutionary
distances and build distance matrix


For molecular data, evolutionary distances
can be the observed number of
nucleotide
differences

between the pairs of species.


Distance matrix
: simply a table showing
the evolutionary distances between all
pairs of sequences in the dataset


2. Determine the evolutionary distances and build
distance matrix

-

A simple example

1.
AGGCCATGAATTAAGAATAA


2.
AG
C
CCATG
G
AT
A
AAGA
G
TAA

3.
AGG
A
CATGAATTAAGAATAA

4.
A
A
GCCA
A
GAATTA
C
GAATAA

Distance Matrix








In this example the evolutionary





distance is expressed as the





number of nucleotide differences





for each sequence pair. For





example, sequences 1 and 2 are





20 nucleotides in length and have





four differences, corresponding to





an evolutionary difference of 4/20





= 0.2.


1

2

3

4

1

-

0.2

0.05

0.15

2

-

0.25

0.4

3

-

0.2

4

-

3. Phylogenetic Tree Construction
example (UPGMA algorithm)


1.

Pick smallest entry D
ij


2.

Join the two intersecting species and assign branch lengths
D
ij
/2

to each of the nodes







D
ij

Bear

Raccoon


Weasel

Seal


Bear

-

0.26

0.34

0.29

Raccoon

-

0.42

0.44

Weasel

-

0.44

Seal

-


Bear


Raccoon



0.13

0.13



UPMGA (Michener & Sokal 1957)

3. Phylogenetic Tree Construction
example (UPGMA algorithm)

D
ij


Bear

Raccoon


Weasel

Seal


Bear

-

0.26

0.34

0.29

Raccoon

-

0.42

0.44

Weasel

-

0.44

Seal

-


3.

Compute new distances to the other species using

arithmetic means


Bear


Raccoon



0.13 0.13




3. Phylogenetic Tree Construction
example (UPGMA algorithm)

D
ij


BR

Weasel

Seal


BR

-

0.38

0.365

Weasel

-

0.44

Seal

-



1.

Pick smallest entry Dij



2.

Join the two intersecting species and assign branch


lengths Dij/2 to each of the nodes









Bear Raccoon Seal




0.13


0.1825



0.1825


3. Phylogenetic Tree Construction
example (UPGMA algorithm)

D
ij


BR


Weasel

Seal


BR

-

0.38

0.365

Weasel

-

0.44

Seal

-

3.
Compute new distances to the other species using
arithmetic means

Bear Raccoon Seal




0.13


0.1825



0.1825


3. Phylogenetic Tree Construction
example (UPGMA algorithm)

D
ij


BRS

Weasel

BRS

-

0.4

Weasel

-

1.
Pick smallest entry Dij.


2.
Join the two intersecting species and assign branch lengths
Dij/2 to each of the nodes.


3.
Done!








Bear Raccoon Seal


Weasel




0.13


0.1825



0.2





0.2






Weakness of UPGMA


UPGMA assumes a constant molecular clock
(i.e. accumulate mutations at the same rate)


All leaves in the same level


Only constructs rooted trees


2

3

4

1

1

4

3

2

Correct tree

UPGMA

Neighbor Joining

(Saitou and Nei, 1987)


Most widely used distance
-
based method


UPGMA showed that it is not enough to
just pick closest neighbors


Neighbor Joining:


Principle: Join nodes that are close to each
other
and

far from everything else


Produces an unrooted tree


Multiple substitutions at the same site
(
homoplasy
) obscure true distance and make
sequences artificially close to one another.

Methods


When there is conflict among characters
as to what phylogenies they support, we
need some way of choosing among them
a best estimate.
Maximum parsimony

chooses that phylogeny on which the
characters can evolve with the fewest
evolutionary events.


Long branch attraction


A phenomenon in phylogeny (esp. parsimony) when rapidly evolving
lineages are inferred to be closely related, regardless of their true
evolutionary relationships.



Esp. a problem with DNA sequence (rather than protein sequence or
other more complex trait combinations), b/c there are only 4
nucleotides



When DNA substitution rates are high, the probability that two
lineages will convergently evolve the same nucleotide at the same
site increases.



Parsimony wrongly interprets this similarity as having evolved only
once in a common ancestor of the two lineages



Problem can be minimized by using methods that incorporate
differential rates of substitution among lineages (e.g., maximum
likelihood) or by breaking up long branches by adding taxa that are
related to those with the long branches.

Other methods


Maximum likelihood.

The probability of
the data is computed, given the tree and a
probabilistic model of evolution. We
choose that tree that gives the highest
probability to the observed data, among all
trees.



Maximum Likelihood


Search through all trees to find the one
with highest probability


Assumed to be “NP
-
complete“


For a given tree, we can calculate its
likelihood score


Markov model


Dynamic programming


Very time
-
consuming


Which method to use?


Distance based


fast


Maximum Parsimony


Strong sequence similarity


Maximum Likelihood


Very slow


Use only for small number of sequences


Most packages (e.g. Phylip, MEGA) use
software for all three methods.

Methods of evaluating trees


Bootstrap
: resample initial data set with
one datum removed and replaced with
another member


Jackknife
: resample initial distribution with
one datum missing and not replaced


MCMC
: complex, but generates random
numbers to produce a desired probability
distribution with which to compare model

Monte Carlo
method

Bayesian phylogenetics



Newer method



Produces both a tree estimate and measures of
uncertainty for the groups on the tree



Similar to ML.



Optimal hypothesis is the one that maximizes the
posterior probability, = the likelihood multiplied by the
PRIOR PROBABILITY of that hypothesis.



Prior probabilities of different hypotheses convey the
scientist’s beliefs before having seen the data.

Rates and causes of molecular
evolution


Different parts of the genome are useful for
different problems. Fast evolving sequences are
useful for recent events, but become saturated
and unrecognizable when comparing more
distant relatives. Slow evolving sequences are
useful around the base of the tree, but don’t
have any variability at all among close relatives.

Different molecular regions, different rates


DNA distant from genes evolves very quickly (at
about one substitution per 10
8

years),


Flanking regions upstream and downstream
from a gene evolve less quickly than that,


Introns evolve less quickly than those, though
not much less,


Third positions of codons evolve less quickly
than introns,


First and second positions of codons evolve less
quickly than that,

Within a protein:



active sites evolve very slowly,



sites that bind heme, or interact with other proteins evolve a bit faster
but also very slowly,



interior sites evolve less quickly than exterior sites,



substitutions that involve less radical changes of the amino acid (i.e.
that change to a rather similar amino acid) happen more readily.


Of base changes
, transitions (A
-
> G or C
-
> T) happen several times
more readily than transversions (all other changes).

Between protein
-
coding loci, some (fibrinopeptide, for example) evolve
rapidly, some less so (hemoglobins, cytochromes), and some
(histones, for example)change very slowly.


Different molecular regions, different rates