As shown in previous section, Bayesian phylogenetic inferences explicitly use
phylogenetic models and likelihood functions for phylogeny estimation. Though
Bayesian phylogenetic inferenc
e essentially can be applied to various data type
s
including
molecular sequences
[58, 87, 102]
, morphological features, gene order
[104]
,
genomic contents and combined data
[25, 26, 56, 110]
, here we limit our discussion to
mole
cular sequences.
2
.
5
.1 The substitute rate matrix
Though phylogenetic signals exist in various mutation events which can be observed by
sequence comparisons, most phylogenetic methods consider substitution events because
other events
are
either difficult
to model mathematically or the derived model is
computational intractable.
30
A
T
C
G
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
t
p
t
p
t
p
t
p
t
p
t
p
t
p
t
p
t
p
t
p
t
p
t
p
t
p
t
p
t
p
t
p
t
P
GG
GC
GT
GA
CG
CC
CT
CA
TG
TC
TT
TA
AG
AC
AT
AA
Figure 2

5
: The transition diagram and transition matrix of nucleotides
DNA sequences for phylogenetic inference are treated
as an aligned character matrix.
Each site can have multiple states. For nucleotides, the number of states is 4; for amino
acids, the state is 20; for codons
–
the triplet of nucleotides, the number of states if 64 (or
61 if stopping codons are excluded). T
he character at each site can transit from one state
to another state stochastically
.
The probability
( )
ab
p t
with which a site is substituted
from state
a
by state
b
after a time interval
t
is determined by a molecular
substitution
mod
el.
Figure 2

5
shows the transition diagram of nucleotides and corresponding
transition matrix.
The molecular substitution can be modeled as a continuous

time Markov process
which
has
a
set of
character states as its state space
[111]
. This Markov process,
described by a transition matrix
)
(
)
(
t
p
t
P
ij
, is determined by an instantan
eous
substitution rate matrix
Q
. This substitution rate matrix is independent of time and has its
definition as:
t
I
t
P
Q
t
)
(
lim
0
(2

12
)
Once the rate matrix
Q
is known, the trans
ition matrix
)
(
t
P
can be computed by:
Qt
e
t
P
)
(
(2

13
)
31
To compute
)
(
t
P
with equation (
2

1
3
), we first transform
Q
as
1
U
U
Q
(2

14
)
In (2

1
4
),
is a diagonal matrix with eigenvalues of
Q
as its diagonal entries,
}
,
,
,
{
0
0
0
0
0
0
2
1
2
1
N
N
diag
(2

15
)
U
is the matrix consisting of th
e eigenvectors of
Q
in the same order of
.
1
U
is the
inverse Matrix of
U
. Apply
ing
(2

1
4
) and (
2

1
5
) to (
2

1
3
),
)
(
t
P
can be calculated by:
1
1
}
,
,
,
{
)
(
1
1
U
e
e
e
diag
U
U
Ue
t
P
N
t
(2

16
)
2
.
5
.2 Properties of the substitution rate matrix
Suppose there are
S
possible states at each site, then the substitution rate matrix can be
written as
sj
s
s
j
j
ij
q
q
q
q
q
q
q
q
q
q
Q
2
1
2
22
21
1
12
11
.
(2

17
)
We also denote stationary frequency distribution of the states
)
,
,
,
(
2
1
N
. Then
the following properties hold for
Q
and
.
0
ij
q
(2

18
)
i
j
j
ij
ii
q
q
,
(2

19
)
i
i
1
(2

20
)
32
0
Q
(2

21
)
t
Q
Qt
(
0
).
(2

22
)
Property (
2

21
) is the result of stationary assumption for the Markov chain, i.e.
)
(
t
P
. Property (
2

22
) indicates the substitution rate and the evolutionary time are
co

founded
[111]
. Therefore it is impossible to distinguish between the mutation rate and
the divergence time. The substitution rate can be fixed by assuming the total number of
mutation events per unit time is
constant, i.e.
c
q
i
i
j
j
ij
i
,
From
equation
(
2
–
19
),
this constraint
can be simplified as
c
q
i
ii
i
(2

23
)
2
.
5
.3 The general time reversible (GTR) model
There are 12 substitution rate parameters and 4
state frequency parameters for
a general
substitute rate matrix, 11
of
them
are free parameters due to the constraints of (
2

20
)
,
(
2

2
1
)
and (
2

23).
Various models with fewer model parameters have been proposed by
making more assumptions. Some
widely us
ed nucleotide substitution models include the
Jukes

Cantor model (JC69)
[112]
, the Kimura model (K2P)
[113]
, the Felsenstein
models (F81 and F84)
[114]
, the HKY model
[115]
, and the GTR model (GTR)
[116]
.
Details of these models and methods to calculate their transition probability are
discussed by
Swofford and
et al.
[77]
, Yang
[117]
and
other
r
esearchers
[18, 118]
.
The GTR model adds the time reversible assumption into the substation rate matrix
which requires
33
ji
j
ij
i
q
q
(2

24
)
or
i
ji
j
ij
q
q
(2

25
)
Therefore, the nucleotide substitution rate matrix for GTR model can be simplified as
C
T
A
G
T
A
G
C
A
G
C
T
GTR
f
e
c
f
d
b
e
d
a
c
b
a
Q
(2

26
)
By
introduc
ing
another matrix,
}
,
,
,
{
2
1
S
diag
, it is easy to verify that
GTR
GTR
Q
Q
(2

27
)
Further, we have
'
2
/
1
2
/
1
2
/
1
2
/
1
2
/
1
2
/
1
'
GTR
GTR
GTR
Q
Q
Q
(2

28
)
Equation (
2

28) states that the substitutio
n rate matrix
GTR
Q
is similar to the symmetric
matrix
2
/
1
2
/
1
GTR
Q
. Therefore, all eigenvalues of
GTR
Q
are real numbers and can be
computed by
2
/
1
1
2
/
1
U
U
Q
GTR
(2

29
)
Here
U
and
are the eigenvectors and eigenvalues of
2
/
1
2
/
1
GTR
Q
respectively. The
equation (
2

29)
reduces
the task of solving the eigensystem for a non

symmetric matrix
to solving the eigensystem for a symmet
ric matrix.
An additional benefit from
the
GTR model is that under
the
GTR model, the
likelihood value for
the
phylogenetic tree is independent of the root position of the tree
34
[114]
. Therefore, we can the change the root position at free without changing the
likelihood value of the tree.
2
.
5
.4 Rate heterogeneity among different sites
In
the
previous discussion,
the molecular substitution models are derived based on a
single homologous site. Because the mutation events are constrained by
physical and
chemical
structures of the DNA and protein molecular, and purified by natural selection,
substitution rate varies
greatly among different genes, different codon positions, and
different gene regions
[17]
. Rate heterogeneity among different sites is accommodated by
including an additional related rate coefficient
r
in the substitution rate matrix, i.e.
( )
rQt
P t e
(2

30
)
There are several possible
way
s to determine
r
[77]
:
(1)
Assigning different
r
and different substit
ution rate matrix to different partitions of
the dataset (by genes or by positions in the codon)
[77]
;
(2)
Assuming
r
at each site is drawn independently from a distribution, such
distribution could be continuous (such as Gamma d
istribution
[119

121]
or Log
distribution) or discrete (assume several categories of rate and each has a separate
probability to be chosen);
(3)
Assume some fraction of the sites is invariable
(i.e.
0
r
) while others mutate at
constant rate
[77]
;
(4)
Combining several methods, for example, “invariant + gamma” model.
35
2
.
5
.5 Other more realistic evolutionary models
C
onsiderable amount of effort
has targeted
in developing more realistic evolutionary
model.
Felsenstein and Churchill proposed Hidden Markov
model (HMM) to
accommodate rate variance
[122]
along the sequences.
Similarly, the
assumption
that rate
should be the same in all branches of the trees also needs to be relaxed and a variety of
methods have been proposed
.
The gaps within the alignments provide import
ant phylogenetic signals
. However
they
are o
ften neglected or removed in common phylogenetic inference
methods/packages
.
S
ome models ha
ve
been
proposed to in
corporate
gap in evolutionary
models for phylogenetic inference. These developments include
the fra
gment substitution
mode
l proposed
by Thorne
et al.
[123]
, the tree HMM approach by Mitchison & Durbin
[23, 123, 124]
.
In the future,
incorporation of rate variation correlated with the three
dimensional structure
may be needed
[21]
.
2
.
6
Likelihood function and its evaluation
Evaluating the likelihood of the data under a given model is a
key component in Bayesian
phylogenetic inference and maximum likelihood estimation. Most
computation
time in
likelihood

based phylogenetic inference methods
is spent in likelihood evaluation
.
2
.
6
.1 The likelihood function
The likelihood of
a specific phylogenetic model
is proportional to the probability of
observing the dataset
}
{
ij
x
X
given the
phylogenetic
model
. Here we assume
N
is
the number of taxa and
M
is the sequence length. Each site
)
,
,
,
(
2
1
M
u
u
u
u
x
x
x
x
is an
36
individual observation. The probability of obse
rving the nucleotide pattern at site depends
on the phylogenetic model
,
which includes a phylogenetic tree
T
,
a
vector of branch
length
3
2
2
1
,
,
,
N
, and an evolutionary model
.
As described in
the
previous section, the model of molecular evolution gives the
probability of a mutation from the
th
i
state to the
th
j
state at a site
u
over a finite
period of time
t
. This transition probability is computed as
,
,

)
(
t
i
s
j
s
p
t
p
ij
,
(2

31
)
Here
i
is the star
t
ing state,
j
is the ending state, and
t
is the diver
gence time (i.e. the
length of the branch).
)
(
t
P
ij
is computed by equation (
2

1
3
) when the substitution rate
matrix
}
{
ij
q
Q
is known. The substitution rate matrix is determined by
, param
eters
of an
evolutionary model.
The probability of observing the data at a site
u
given the phylogenetic tree is a sum
over all possible state
s
at the internal nodes of the tree, which is computed by
)
)
,
,

(
)
,
,

(
(
)
,
,

(
1
2
2
1
1
2
,
2
2
1
1
)
(
)
(
N
N
N
N
a
a
a
N
N
i
N
i
i
i
u
i
u
i
i
u
i
u
a
u
a
x
p
a
a
p
T
x
L
(2

32
)
In the above equation,
)
(
i
denotes the immediate ancestral node of node
i
,
i
u
a
denotes
the residu
al
state at node
i
, and
i
u
x
denotes the residu
al
at the
u
th site of the
i
th
sequence.
When rate heterogeneity across different sites is considered, and
we assumes
the rate
at a site follows a distribution
)

(
r
f
with the shape parameter
(e.g.
, the gamma
distribution), equation (
2

32) is replaced by
37
0
,
2
2
1
1
)
(
)
(
)

(
)
,
,

(
)
,
,

(
)
,
,
,

(
1
2
2
1
1
2
dr
r
f
a
x
p
a
a
p
T
x
L
N
N
N
N
a
a
a
N
N
i
N
i
i
i
u
i
u
i
i
u
i
u
a
u
(2

33
)
Equation
(
2

33) can be approximated by replacing the continuous gamma distribution
with a discrete gamma distribu
tion
[125]
.
Assuming the observation
at
each site is independent, the likelihood to observe the
entire sequence is:
M
u
u
T
x
L
T
X
L
1
)
,
,

(
)
,
,

(
(2

34
)
Generally, a logarithmic form of the likelihood is used because the likel
ihood itself is
very small
number.
2
.6
.
2
Felsenstein’s algorithm for likelihood evaluation
The probability given by (
2

32) and (
2

34) can be computed by traversing the tree in
post order using the algorithm proposed by Felsenstein
[114]
. Let
a
L
p
k
u

denote the
probability of all the leaves below node
k
when
the state at
site
u
on node
k
is
a
and
u
L
denote
s
the
likelihood of all leaves at site
u
.
The Felsenstein a
lgorithm is shown in
Figure 2

6
. Starting from the le
aves, the
Felsenstein’s algorithm continuously prunes the subtrees through step 9

14 until there are
no other nodes left except the root.
Without redundant computation, the Felsenstein’s algorithm needs about
(2 1)
NMS S
multiplication operati
ons in step 4. The memory space requirements vary
with implementation. If the site likelihood values at the most recently visited nodes are
saved, only
2
16 24
NS MS NM
byte memory space is needed: the first item stores the
38
transition matrix for each
branch (there are total
2 2
N
branches for a rooted tree); the
second item stores the site likelihood of current node and its two children; and the third
item stores the data matrix. However, this scheme will cause the algorithm to r
e

compute
the site likelihood for all nodes even if only a small portion of the tree has been changed
between two adjacent likelihood evaluations.
Computer

Node

Likelihood
( k )
1.
Compute

Transition

Matrix
(
)
(
k
P
,
k
)
2.
If
L
eaf

N
ode
( k )
3.
Then
4.
For
1
u
;
M
u
;
1
u
u
5.
If
k
u
x
a
6.
Then
1
)

(
a
L
p
k
u
7.
Else
0
)

(
a
L
p
k
u
8.
Else
9.
i
k.leftChild; j
k.rightChild
10.
Compute

Transition

Matrix
(
( )
i
P
,
i
)
11.
Compute

Transition

Matrix
(
( )
j
P
,
j
)
12.
For
1
u
;
M
u
;
1
u
u
13.
Foreach
States
of
Set
a
14.
(  ) (  ) ( ,) (  ) ( ,)
k i j
u u i u i
b c
p L a p L b p b a p L c p c a
Computer

Tree

Likelihood
(
T
)
15.
0
ln
L
16.
Computer

Node

Likelihood
( 2N

1 )
17.
0
u
L
18.
For
1
u
;
M
u
;
1
u
u
19.
2 1
(  )
N
u a u
a
L P L a
20.
)
ln(
ln
ln
u
L
L
L
Figure 2

6
: T
he Felsenstein algorithm for likelihood evaluation
39
2
.
7
Optimizations
of likelihood computation
The likelihood evaluation can be optimized
in
several ways between two adjacent
evaluations of tree likelihood
.
2
.
7
.1 Sequence packing
Repeated patterns are common for real dataset
s
used in phylogenetic inference. The
length of t
he sequences can be cut down by packing the sites with the same pattern and
speed up the likelihood computation. For example, if there are w column
s
of all state “a”
in the
dataset
calculate the likelihood of this once and raise it to the power of w. Throu
gh
sequence packing, equation (2

34) is replaced by
P
u
p
p
p
T
x
L
w
T
X
L
1
)
,
,

(
)
,
,

(
(2

35
)
Here
P
is
the total number of site patterns,
p
w
is the weight of pattern (i.e. the number of
sites wit
h
pattern
p
),
and
,
,

T
x
L
p
p
is the site likelihood of pattern.
Sequence packing can reduce the likelihood computation
by
M
P
1
.
Here
P
is the
number of unique site patterns;
M
is the number of characters.
2.7.2 Likelihood local update
Since in most MCMC algorithms the phylogenetic model changes continuously and
the change between two adjacent generations is small, if we record nodes affected by a
change in the
parameter values and trace them back to the root, then only the conditional
probability of those nodes appearing in the back tracing path needs to be recomputed; all
other parts of the computation remain the same. We call this shortcut a local update of t
he
40
likelihood. Figure 2

7 shows how local update works. The local update can reduce the
average number of nodes to be evaluated from
N
to
N
log
2
1
. The disadvantage is that all
the conditional probability values in the p
revious computation need to be kept in the
memory, increasing the memory requirement to
2
16 8
NS NMS NM
: the first item
stores the transition matrix for each branch; the second item stores the likelihood for all
internal nodes; and the third item st
ores the data matrix (we time a multiplier of
8
for
likelihood and transition probability since they are stored as double precision numbers).
W
hen
N
is large, the local update schema can speed up the likelihood evaluation
remarkably. For this reason, most Bayesian inference programs adopt local update
schema despite the additional memory requirement. However,
smart
memory
ma
nagement is required to keep the local update property without
frequent
data
copy
operations.
root
c
a
b
If branch a

b has been changed
,
only the
nodes in the path from node a to the root need
to recomputed to get the likelihood of the tree
.
Figure 2

7
: Illustration of likelihood local update
41
2
.
7
.3 Tree balance
After multiple mutation operations, the current
tree
may become imbalanced: one subtree
of the root is much deeper than the other subtree.
We
define the
depth of a subtree
as
the
maximum number of internal node
s
from the root of this subtree to any
leaf
node
s
in the
subtree. As discussed above, the
average
number of nodes to be re

computed for local
update is
1
log(.)
2
root depth
;
reducing t
he depth of the
root can speed up the computation
because
there are fewer nodes to be recomputed. The tree balancing algorithm is
shown
in Figure 2

8.
2
.
8
Markov
C
hain Monte Carlo methods
2.8.1 The Metropolis

Hasting algorithm
Though Bayesian analysis of phylogeny provides a direct, formal approach of dealing
with uncertainty in phylogenetic inference with sophisticated statistical model
s
, the
computations required by integrations over unknown parameters is a major obstacle.
t
ree

b
alance
(
T
)
1.
root
T
.root
2.
gap
depth(root.leftChild)
–
depth(root.rightChild)
3.
If
abs(gap) < 2
4.
Then
return
5.
Else
node
root
6.
For
i
0; i
abs(gap)/2; i
i+1
7.
If
de
pth(node.leftChild)>
depth(node.rightChild)
8.
Then
node
node.leftChild
9.
Else
node
node.rightChild
10.
Tree

Reroot (
T
, node )
Figure 2

8
: The tree

ba
lance algorithm
42
Until the advancement of computing technologies and the introduction of Markov
C
hain
Monte Carlo methods, Bayesian phylogenetic inferences weren’t feasible.
Markov
C
h
ain Monte Carlo refers to a class of methods that simulate random
variables from a target distribution, known up to a normalizing constant. The basic idea
of the MCMC methods is first to construct a Markov chain that has the space of the
parameters to be e
stimated as its state space and the posterior probability distribution of
the parameters as its stationary distribution. Next, simulate the chain and treat the
realization as a hopefully large and representative sample from the posterior probability
of the
parameters
of interests
. Two major strategies can be used to construct the Mark
ov
chains for exploring posterior
distribution: t
h
e Metropolis

Hasting algorithm
[126, 127]
and the Gibbs sampler
[128]
.
When applied to phylogenetic inference, the Metropolis

Hasting algorithm can be
descried as follows
(Figure 2

9):
Metropolis

Hasting algorithm
1.
0
t
;
0
)
0
(
2.
Repeat 3

9
3.
Draw a sample
from
)
(

t
q
4.
Draw a random variable
u
from a uniform
distribution
)
1
,
0
(
U
5.
Compute
,
)
(
t
6.
If
)
,
(
)
(
t
u
7.
Then
)
1
(
t
8.
Else
)
(
)
1
(
t
t
9.
1
t
t
.
Figure 2

9
: Metropolis

Hasting algorithm
43
In the above algorithm,
,
)
(
t
is called the
acceptance probability, an
d its
definition
distinguishes different
MCMC algorithms.
In
the original
Metropolis algorithm
[126]
, the acceptance probability is:
X
X
t
t


,
1
min
)
,
(
)
(
)
(
(2

36
)
Hasting
[127]
extended
the original Metropolis algorithm
by allowing an asymmetric
proposal probability
)

(
)
(
t
q
and introduced a new
transitional kernel
( )
( )
( ) ( )


(,) min 1,
 
t
t
t t
q
X
X q
.
(2

37
)
The
proposal probability
(
2

3
7) can be in any form
that satisfies
0
)

(
M
q
. The
choice of
)

(
M
q
may affect the convergence rate of the Markov chain.
In (
2

36) and (
2

37),
X

is the posterior distribution of phylogenetic m
odels
which is proportional to the product of the likelihood and the prior probability. As shown
in the form of the acceptance probability, only the likelihood ratio and the prior ratio for
current sample
t
and candidate sample
are needed to decide whether accepting the
proposal or not; the computation o
f the normalizing constant in (2

2) is unnecessary.
2.8.2 Exploring the posterior distribution
The direct objective of MCMC in Bayesian analysis is to calculate th
e integral appearing
in the marginal distribution shown in equation (2

5) or (2

6). Thus an MCMC method
plays the same role as the Monte Carlo integral approximation method. According to the
law of large numbers, the variance of the Monte Carlo integr
al approximation is
proportional to
1
N
, regardless of the dimensionality of the state space of target
44
distribution
[129]
.
Note that
N
is the number of samples and the variance decrease as
N
increases. To directly draw samples from a complex space with high dimensions is
difficult. Metropolis

based MCMC provides a
n effective method of sampling mechanism
by evolving a Markov chain.
In theory, a Markov chain constructed with the Metropolis

Hasting algorithm will
converge to a stationary distribution if the chain is irreducible, aperiodic, and possesses a
stationary d
istribution given the chain runs long enough
[130, 131]
. The irreducible
property requires the chain has a positive probability to move from one state into any
other state in a finite number of time steps, i.e.
 ( ) 0
j i
P t s t
(2

38
)
Here
i
and
j
are any pair of states in the state space,
t
is the current time,
s
is the
number of time steps needed to move
from
i
to
j
.
If the chain is irreducible, then the chain can reach any state after a sufficiently large
number of time steps no matter
what the
starting state is. It is intuitive that all tree
proposal me
thods shown in Chapter 3 guarantee the irreducible property of the MCMC
chain. Thus MCMC is a promising method for phylogenetic inference.
2.8.3 The
i
ssues
Eq
uation (
2

38) does
not provide any information
regarding
how large
s
mu
st be. As the
length of any MCMC chain used in real analysis is limited, there are risks that some
states are never reached after the chain has been terminated. One fundamental reason is
that the samples generated using Metropolis

Hasting algorithms are de
pendent samples.
As a result, samples between two time steps are correlated. The samples drawn using
45
MCMC tend to stick around
a
local optima
mode o
f
the target distribution. Due to
such
“stickiness”, the chain may mix extremely slowly: it may take a huge
large number of
time (perhaps infinite) steps for the chain to move
from
one mode to another mode.
Increasing the mixing rate of an MCMC chain may improve the quality of the
posterior distribution approximated by the chain. However, if the chain moves too
fast,
the acceptance ratio (the ratio of the number of accepted proposals to the total number of
proposals) become very low and a large percentage of computation is wasted; if the chain
moves too slowly, the acceptance ratio is high but it may take a extr
aordinary large
number of time steps for the chain to converge
[132]
. Neither is satisfactory.
The quality of the posterior distribution sampled by a MCMC sampler is critical to
the accuracy of the conclusions summarized from the distribution. If the approximated
distribution
deviates from the real distribution,
, the conclusions based on
may be completely misleading. For example, a poorly implemented Markov chain
may be trapped at local optima
local
, thus samples generated from this chain give an
extremely high posterior probability for
local
which is far away from the truth.
Therefore
, some practical issues exist for
the original Metropolis

Hasting
and have
to
be avoided in implementation.
1)
C
hoosing
the appropriate proposal steps;
2)
M
ak
ing
the chain move fast
when
exploring
a
posterior distribution
in a
high
dimensional parameter space;
3)
A
void
ing
the chain stop at local optima;
and
4)
D
etect
ing
the
ha
lting
time for the chain.
46
2
.
9
Summary
of the posterior distribution
Once the posterior probability has been approximated with MCMC samplers, various
kinds of information can be summarized from such posterior distribution.
2.
9
.1
Summary
of the phylogeneti
c trees
The posterior probability of phylogenetic trees can be summarized in several ways:
I.
Summarizing with the posterior probability of trees.
The occurring frequency of a
phylogeny in the samples can be interpreted as an approximation of its posterior
p
robability. By ranking the trees in the order of their posterior probability, the 99%
credible set of trees can be obtained.
Among the
se
tree
s,
the one with the maximum
posterior probability is called the MPP tree
.
II.
Summarizing with the posterior probabili
ty of clades.
A clade is the group of taxa
which is included in the same partition divided by an internal branch. Similar to the
posterior probability of the phylogenetic tree, the frequency of a clade in the
samples can be interpreted as its posterior pro
bability. Using the clade posterior
probability distribution and a specific consensus rule, a consensus tree can be
constructed as the summary of the posterior probability of the clades.
III.
Summarizing with the likelihood values.
Though seldom used in practi
ce, the
samples can also be summarized using their likelihood values as maximum
likelihood methods.
2.
9
.2
Summary of
the model parameters
When model parameters are also included as the interested of state parameters in MCMC,
Bayesian analysis also output
samples drawn during the MCMC run. Model parameters
47
and some evolutionary characteristics (e.g. tree length, distance between two partitions
separated by an internal branch) can be summarized from the approximated posterior
distribution samples from Bayesi
an phylogenetic inference using conventional statistical
methods.
2.
10
Chapter
s
ummary
This chapter provides an overview of phylogenetic inference method and the framework
of Bayesian phylogenetic inference. There are various competing phylogenetic method
s
to
build a phylogenetic tree
from
a dataset which provides clues for the
e
volutionary
history
of
a set of taxa.
These methods may use different source of data, different
optimality to choosing the closest estimation. But their common objective is same:
e
stimating the correct tree, or if impossible, making the estimation close to the true tree
as much as possible.
When the phylogenetic trees become large and large (up to thousands and ten of
thousands of taxa), advanced algorithms and high performance imp
lementations become
critical to guarantee the estimated trees close to the true tree sufficiently.
In this
dissertation, we choose Bayesian methods as a possible candidate to infer large
phylogenies.
Bayesian phylogenetics inference is founded on likeliho
od function for a dataset
under
some
phylogenetic model. It is a twin of Maximum Likelihood Estimation, both
require explicit
probabilistic
model of evolutions
.
48
The computational complexity of Bayesian phylogenetic inference is handled by a
family of Mark
ov
C
hain Monte Carlo methods
which can be viewed as a sampling
method or a stochastic search method depending on the interest of study and context.
Using MCMC, both the posterior distributions of phylogenetic trees and the
parameters in the model of evolu
tion can be approximated and conventional statistic
procedure can be applied to summarize the variables of interest.
49
C
hapter
3
Improved M
onte Carlo Strategies
3
.1
Introduction
As described in Chapter 2,
Bayesian phylogenetic inference relies on the poster
ior
distribution approximated
by
Markov
C
hain Monte Carlo (MCMC) methods. The quality
of the posterior distribution sampled by a MCMC sampler is critical to the accuracy of
the conclusions.
If the approximated distribution deviates from the real distributi
on, the
conclusions may be completely misleading
,
as observed in
the literature
[49, 50]
.
Though
the
MCMC method play
s
a critical role in Bayesian phylogenetic inference,
there
are
few stud
ies
o
f
its
performance
due to its computational expens
e
.
Developing
better MCMC
strategies and study
ing
their performance are necessary
for two reasons:
1)
Some experimental results indicate that Bayesian phylogenetic inference using
MCMC (BMCMC) produces misleading posterior probability for trees and clades
under some evolutionary scen
arios. It is not clear whether the
MCMC implementation
or
some deeper reasons in
the
Bayesian phylogenetic inference is responsible for such
discrepancies.
50
2)
T
here is no
theoretical
or
formal
proof
to detect or
guarantee that
an
MCMC
run will
converge
to th
e
correct posterior distribution after the chain stop at
a
certain number
of time steps.
Thus,
improved
MCMC
strategies are required
for
more robust, more efficient
Baye
sian
phylogenetic inference. This chapter provides our observations and
improvements f
or MCMC strategies
for
use in Bayesian phylogenetic inference.
3
.2 Observations
To illustrate the problem
in MCMC implementation
, we use the Metropolis

Hasting
algorithm
to approximate
the
distribution
shown in Figure
3

1
. The target distribution
has
the
a
nalytical form
2
1
2
3
2
1
[0,1]
( )
0 [0,1]
i
x
i
i
c a e x
f x
x
(3

1
)
A Distribution with Three Modes
0.0
1.0
2.0
3.0
4.0
5.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
x
p.d.f
Figure 3

1
: A target distribution with three modes
51
where
1
0.5
a
,
1
0.1
,
1
0.04
,
2
1.0
a
,
2
0.5
,
2
0.04
,
3
0.5
a
,
3
0.9
,
3
0.04
and
5.0
c
. The
modes of this
target distribution are
0.1
,
0.5
and
0.9
.
We use the delta method described in the previous section to propose a candidate
sample point which uses
E
quation (
3

2) to draw a new proposal point, namely
:
1 0.5
x t x t u
,
(3

2
)
Approximated Distribution Using Metropolis Algorithm
(lamda=0.7, x0=0.5)
0.0
1.0
2.0
3.0
4.0
5.0
6.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
x
p.d.f
(a)
Approximated Distribution Using Metropolis Algorithm
(lamda=0.2, x0=0.5)
0.0
2.0
4.0
6.0
8.0
10.0
12.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
x
p.d.f
(b)
Figure 3

2
: Distribution approximated using Metropolis MCMC methods
52
T
he target distribution seems simple,
but
it is difficult to approximate accurately
using
an
MCMC
constructed with the original Metropolis

Hasting algorithm. Figure 4

2
shows the two approximation
s
using
0.7
and
0.2
. Both chains start at
0 0.5
x
.
Though all three modes appear in the a
pproximation shown in Figure
3

2
(a), the
shape of each mode differs
from
the target distribution slightly.
In Figure
3

2 (b), t
he
approximation shows only
one mode
;
the other two modes have disappeared.
Neither
shows
the expected distribution
from
Figure
3

1.
Samples Drawn at Each Time Step (lamada=0.7, x0=0.5)
0.0
0.2
0.4
0.6
0.8
1.0
0
100
200
300
400
500
600
700
800
900
1000
Time Step (t)
x
(a)
Samples Drawn at Each Time Step (lamada=0.2, x0=0.5)
0.0
0.2
0.4
0.6
0.8
1.0
0
100
200
300
400
500
600
700
800
900
1000
Time Step (t)
x
(b)
Figure 3

3
: Samples drawn using Metropolis MCMC method
53
Figure
3

3 shows s
amples drawn at each time step during the above two
approx
imations.
We observe that for larger proposal step
s
, the chain mixes faster
; for a
smaller proposal step, the chain “sticks” around a local mode.
For the above example, though the target distribution is known and simple, it is
nontrivial to choose a proper proposal step parameter to achieve an efficient MCMC
chain. Intuitively,
since we know little about the shape of posterior distribution of the
phylogenetic model, we have to be cautious in interpreting the summary of the posterior
distribution sampled using MCMC methods. At the same time, we must develop more
robust MCMC meth
ods and investigate their performance in practical Phylogenetic
inference.
3
.3
Strategy #1:
r
educing
stickiness using variable
proposal step length
In the previous section, we discussed the
risks
that a Markov chain constructed using the
Metropolis

Hasting
algorithm may be trapped by a local mode and fail to explore the
desired distribution. Changing proposal step length
may
improve the mixing property of
an underlying Markov chain.
According to the irreducible property requirement
,
to approximate the targ
et
distribution correctly, at time
t
, the chain must have a positive probability to move from
one state
i
into any other state
j
in a finite number of time steps
s
, i.e.
 ( ) 0
j i
P t s t
.
(3

3
)
Under ideal situations, the chain can move from one state to any other state within
one step
(as
shown
in
Figure
3

4)
.
A
ccording to
a
proof of
the
MCMC algorithm
[130]
,
54
the chain will approximate the distribution accurately. However, such
a
situation is rare;
the chain
actually
need
s to
traverse some intermediate states to reach
another
state. If the
transition probability between the intermediate states are sma
ller than some threshold, the
probability given in (
3
–
3)
may
be
close to
0, which
means
the target state will never be
reached and the
theoretical
approximation has
a
large deviation from the real
approximation.
Therefore, we p
roposed
a variable step length MCMC which draws the step length
from certain distributions (for example, a uniform distribution)
; we
use
d
this step length
to propose a new candidate state. Using
variable st
ep length
, we can move the chain
between different states more
freely and overcome its “stickiness” to local optimal mode.
Figure
3

5 shows the approximation of the target distribution shown in Figure
3

1.
The distribution of the samples is close to the t
arget distribution, both in the number of
modes and in the shape of each mode.
The number of possible states of phylogenetic trees, even for a small number of taxa,
is extraordinary large. However, the distance between any two states is less than the
numb
er of taxa, even using the simplest tree proposal method, such as NNI (Nearest
Neighbor Interchange). We used the extended tree mutation operator for variable step
length proposal in phylogenetic inference; this algorithm is shown in Sections
3
.6.
0
2
1
3
Figure 3

4
: Illustration of state move
s
55
3
.
4
Strategy #2:
r
educing
sampling intervals using
multipoint
MC
MC
Choosing a good proposal mechanism is
very
difficult in phylogenetic inference.
Variable step
length MCMC can reduce the risk of trapping local optima. But it
introduces another issue: low acceptance rate. A chain must
try
many proposals to
accept
one successful candidate
,
which usually require
s
a large sampling interval.
One strategy is to
prop
os
e
multiple sample candidates and
consider
their combined
effects in
deciding the next move of the Markov chain.
We call this strategy multipoint
MCMC.
In this dissertation, we implement
ed
a variant of multipoint MCMC proposed
by
Liu
et al.
[132]
.
Figure
3

6
illustrat
es the proce
ss of
mult
ipoint MCMC. This algorithm
includes 4 steps:
1)
Propose
K
samples
1 2
,,,
K
X X X
from the distribution of
X
;
2)
Select a candidate
Y
from
1 2
,,,
K
X X X
according to the
probabilities
of
1 2
,,,
K
X X X
;
Approximated Distribution Using Metropolis Algorithm
(lamda=variable, x0=0.5)
0.0
1.0
2.0
3.0
4.0
5.0
6.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
x
p.d.f
Figure
3

5
: Approximated distribution using variable step length MCMC
56
3)
Propose
K
samples
1 2
,,,
K
Y Y Y
from the distribution of
Y
;
4)
Accept
Y
with
acceptan
ce ratio
1 2
1 2
,,,
min 1,
,,,
K
K
w X X w X X w X X
r
w Y Y w Y Y w Y Y
.
(3

4
)
and reject it with
1
r
.
In equation (
3
–
4),
(,) (,) (,)
w X Y X T X Y X Y
,
(,)
T X Y
is an arbitrary
proposal function,
and
(,)
X Y
is an arbitrary
,
symmetric
,
non

negative function.
In our implementation, we chose
*
*
(,) ln ln
i
i i
X
w X X L X L X
X
(3

5
)
and
*
*
(,) ln ln
i
i i
Y
w Y Y L Y L X
X
.
(3

6
)
X
X
4
X
3
X
2
X
1
Step
1
:
Sample X
1
..
Xk from X
Step
2
:
Choose Y from X
1
...
Xk
Step
3
:
Sample Y
1
...
Yk from Y
Step
4
:
Decide X
*=
X or X
*=
Y
Y
Y
4
Y
3
Y
2
Y
1
X
*
Figure 3

6
:
The
m
ultip
oint MCMC
57
Thus, the acceptance ratio
r
becomes:
1 2
1 2
ln ( ) ln ( ) ln ( )
ln ( ) ln ( ) ln ( )
K
K
L X L X L X
r
L Y L Y L Y
(3

7
We can use a similar technique
[132]
to prove that the above algorithm will
correctly
approximate the posterior distribution o
f phylogenetic models.
Though
multipoint MCMC
allows the chain to keep moving with a large step size,
one potential issue is that if the step length is
still
small
and the distribution has multiple
modes,
multipoint MCMC
may fail just as often as the class
ical Metropolis algorithms.
By
combing multipoint

MCMC with variable step length to
draw candidate samples with
different step sizes, we can overcome this issue.
3.5
Strategy #3:
i
mproving
mixing rate with
parallel tempering
As shown in
S
ection
3

2
, a targ
et distribution may contain multiple modes
which
are
separated by high energy barriers, or low probability regions. In phylogenetic inference,
such regions could be phylogenetic models with low likelihood scores. If a proposal
mechanism fails to draw cand
idate samples in regions which are separated from current
states
by
low probability regions, the chain may seem to converge
,
but the approximation
is far from complete.
One strategy is to use an augmented distribution
( )
i
x
)
..
1
(
m
i
,
which
consists of multiple “tempered” distributions, each distribution having a different
temperature
i
T
. Increasing
i
T
will result in a flatter distribution
,
given a heating schema
like
58
1 1
0
i
T
i
.
(3

8
)
Figure
3

7
shows four tempered distributions based on the target distribution
given
in
Figure
3

1
. The temperatures of the four distributions are 0.0, 1.0, 3.0, and 8.0.
Intuitively, Metropolis alg
orithms can approximate a flatter distribution accurately. This
was
verified in our simulation.
Figure 3

7
:
A family of tempered distributions with different
temperatures
The Metropolis

coupled
MCMC, first proposed by Geyer
[133, 134]
was also
called
Parallel Tempering, exchange Monte Carlo, or (MC)
3
.
This strategy currently has been
adopted in MrBayes
[58]
.
The idea of
Metropolis

coupled MCMC
i
s to run several chains
in pa
rallel, each chain having a different stationary
distribution
)
(
i
, and
with
index
swap operations conducted in place of the temperature transition
of
simulated annealing.
The chain with distribution
)
(
1
is used
in
sam
pling and
is called the cool chain. The
o
ther chains are
used to improve the mixing of the chains
and are called heated chains
.
The
Metropolis

coupled MCMC
algorithm is shown in Figure 3

8
.
A family of tempered distributions
0.0
1.0
2.0
3.0
4.0
5.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
x
p. d. f.
T=0.0
T=1.0
T=3.0
T=8.0
59
An alternative is to combine the parallel step and the s
wap step into
a super
step and
conduct
s
a
swap step at every generation.
A parallel version of Metropolis

coupled
MCMC was implemented in this dissertation and will be described in Chapter
4
.
Th
ere are th
ree
related
questions when
applying
Metropolis

coupl
ed MCMC
in
Bayesian phylogenetic inference:
1)
How many chains are needed
?
2)
W
hich heating schema should be used
?
3)
Will Metropolis

coupled MCMC fail?
H
ow many chains
to use
is an empirical issue. Usually, more chains will provide
more chances to improve
the
mix
ing rate and avoid local optima
,
but
more chains
also
incur
a
higher computational cost. Parallel computing can keep the total time
from
Metropolis

coupled MCMC
1.
0
t
;
2.
For
1
i
to
m
3.
0
t
i i
4.
While
(stop

condition

not

met)
5.
Draw random variable
1
u
from
)
1
,
0
(
U
6.
If
1 0
u
7.
Then
do

classical

MCM
C in parallel
8.
Else
//do a chain swap operation
9.
choose two chains
i
and j
10.
compute
( ) ( )
( ) ( )
( ) ( )
min 1,
( ) ( )
t t
i j j i
s
t t
i i j j
a
11.
Draw random variable
2
u
from
)
1
,
0
(
U
12.
If
2
s
u
13.
Then
swap

index

temperature
,
i j
14.
1
t t
Figure 3

8
:
The Metropolis

coupled MCMC algorithm
60
increas
ing
as the number of chains increase.
We observed the benefit of
choos
ing
the
number of chains according to the
heating schema and the target distribution.
Assuming two adjacent chains with temperature
i
T
and
1
i
T
, there is a rough
relationship among chain temperatures, the logarithm format likelihood, and
the
desired
acceptance ratio,
1
1
1 1
1 1
1
1 1
1 1
1
i i
i i
T T
i i
T T
i i
.
(3

9
)
From (
3
–
9), we have
1
1 1
ln log
1 1
i i
L
T T
.
(3

10
)
Here,
ln
L
is the typical difference of the logarithmic form of the l
ikelihood which can
be obtained by averaging the differences between random samples and the one estimated
using maximum likelihood
;
is the lower bound of the acceptance ratio of the chains.
During our experiments, we observed
that for phylogenetic inference problem
s
with a
large number of taxa, Metropolis

coupled MCMC may have very low acceptance ratio
s
.
3
.
6
Proposal
algorithms for phylogenetic models
A phylogenetic model
,,
T
includes three components:
a tree topology (
T
)
;
a
vector of branch lengths (
1
(,,)
n
,
where
n
is the number of branches)
;
and the
parameters of an evolutionary model (
1
,,
m
,
where
m
is the number of the
parameters for describing the model). Thus, a phylogenetic model is in the
space
T
,
which consists of discrete sub

space
s
separated by all possible tree
topolog
ies
. Another characteristic of
this space is
that
m
varies with different model
61
assumption
s
(for example,
1
m
for
the
JC69 model;
6
m
for
the
HKY model;
and
10
m
for
the
GTR model).
An MCMC
method is
essentially
a sampler which draws dependent samples from the
target distribution using one or more Markov chains constructed using Metropolis

Hasting algorithms or their variants. We break an MCMC sampler into two parts:
1)
generate a proposal fo
r
the
next move, and
2)
design a transition function to guide the
move. In this section, we discuss how to generate a proposal in the space of phylogenetic
models, i.e.
T
.
There
is
one discrete random
variable
,
T
,
and
there are
n m
continuous random
variables
1 1
,,,,,
n m
for one state in the phylogenetic model space
. The Gibbs
sampling strategies provide various mechanisms to update each co
mponent either
randomly or systematically
[132]
.
The r
andom

scan Gibbs sampler chooses a component
c
at
random and then draw
s
a sample from
( )
t
c
and leaves other components
unchanged at time step
t
.
A s
ystem

scan Gibbs sampler updates every component in
order within one super step. Therefore, given
the
current phylogenetic model state
( )
t
,
we can propose a new state by updat
ing
T
,
i
a
nd
i
separately.
3
.6.1 Basic tree mutation operators
A new tree topology can be generated by mutating current topology through three basic
tree operators: NNI (Nearest Neighbor Interchanges), SPR (Subtree Pruning

Regrafting),
and
TBR
(Tree Bisection

Reconnection)
[77]
.
1)
NNI changes a tree topology locally. Any internal branch
,
b u v
of a tree
topology connects four subtrees
(
A
,
B
,
C
, and
D
),
where
u
and
v
are
the
62
labels of the two nodes connected by
b
.
Assuming the original tree topology is
,,
A B C D
, two
additional tree topologies
,,
A C B D
and
,,
A D B C
are
obtained
by swapping one subtree of node
u
and one subtree of
node
v
.
3)
SPR prunes a subtree
s
T
from
the
current tree
T
and then attaches
s
T
to a branch of
the pruned tree
\
s
T T
. Assuming the number of leaves
in
T
is
n
and
the number of
leaves
in
s
T
is
m
,
then
2( ) 3
n m
topologies may be obtained from a single SPR
operation.
4)
TBR partitions a tree
T
into two subtrees
(
A
T
and
B
T
)
along a branch, chooses one
branch
a
from
A
T
and another branch
b
from
B
T
.
T
hen a new topology can be
obta
ined by connecting branch
a
and branch
b
. Assuming the number of leaves
on
A
T
is
a
n
and the number of leaves
on
B
T
is
b
n
, a single TBR operation can result
in
2 3 2 3
a b
n n
new topologies.
It can be
shown
that NNI, SPR, and TBR can change
any
tree topology into
any other
tree topology with a finite number of mutation operations used separately.
This is true
under
the
condition that all proposed moves will be accepted; it may be not true if the
move is
subjected to selection
according to its likelihood score because a high energy
barrier
(or low probability region)
will constrain the move around s
ome local optima.
3
.6.2 Basic tree branch length proposal methods
The length of a branch can be any real number within the interval
0,
. Two proposal
methods can be used to propose
new branch length
s
:
the
scaling and delta method
s
.
63
3
.6.2.1
The s
caling method
Denote
the
current branch
length
0
.
A
new branch length is generated as
( 0.5)
0
u
e
,
where
u
is draw from a uniform
distribution
(0,1)
U
.
When
0.5
u
, the branch is
shorten
ed
;
when
0.5
u
, the branch length is extended. The parameter
controls the
range of the proposed branch length.
3.6.2.2
The d
elta method
The delta method add
s a mutation to
the
current branch length as
0
0.5
u
,
where
u
is draw from a uniform
distribution
(0,1)
U
, and
controls the step length.
3
.
6
.3
Propose new parameter
s
Basically, the method for proposing branch lengths can be used to propose a new
parameter
. Some prior distribution
s
can be used to increase the acceptance ratio. For
example, the distribution of the frequencies of all nucleoti
de states can be draw
n
from the
Dirichlet distribution.
3.6.
4 Co

propose topology and branch length
Tree topology and branch length can be updated simultaneously using a single tree
mutation.
T
wo methods have been
proposed
in the past: the Traversal profi
le method
[40]
and
the LOCAL methods
[41]
.
3
.
7
Extended
proposal algorithms for phylogenetic
models
This
section presents our extension
of
the basic proposal
algorithm described in previous
sections.
64
3
.
7
.1 Extended tree mutation operator
As discussed in
Section
4.3, in order t
o avoid local optima phenomena in MCMC
sampling, we
need to
design tree mutation operator
s
that
can
move
rapidly in the tree
space.
The
idea is
to
combine multiple basic tree mutations within one operator, which
we name extended tree mutation operators. We use parameter
D
to control the maximum
number of basic tree mutations in an extended tree mutation operator.
An extended tree
mutation operator work as follows:
We can choose a proper distribution
( )
f k
to control the percentage of NNI, SPR and
TBR. The choice of parameter
D
depends
on the number of leaves of the tree topology.
An extended tree mutation can discover as many as
D
N
topologies
,
where
N
is the
number of tree topologies that can be explored by a single basic t
ree operator.
3
.
7
.2 Multiple

tree

merge operator
Most tree search methods search the optimal tree along a single trajectory. Larger spaces
can be explored using multiple independent trajectories. After some training period under
the likelihood function, ea
ch tree on those independent trajectories may contain some
e
xtended

t
ree

m
utation
(
T
,
D
)
1.
draw
d
from a uniform distribution
(1,)
U D
2.
0
T T
3.
F
or
1
i
to
d
4.
draw
k
from a distribution
( )
f k
for
1..3
k
5.
If
1
k
Then
i
T
Tree

NNI
(
1
i
T
)
6.
If
2
k
Then
i
T
Tree

SPR(
1
i
T
)
7.
If
3
k
Then
i
T
Tree

TBR(
1
i
T
)
8.
Return
i
T
9.
Figure 3

9
:
The e
xtended

t
ree

m
utation
m
ethod
65
subtrees which are
partially
optimal.
M
erging these optimal subtrees can result in a good
proposal for the next move. This is one of the basic ideas of the genetic algorithm. We
introduce
this meth
od
here as another tree proposal operator. The multiple

tree

merge
operator merges subtrees from several “good” candidates into a new candidate as
Figure
3

10
:
For
2
K
, the Multiple

tree

merge operator becomes the crossover operato
r
used
in
GAML
[135]
.
3.7.3 Backbone

slide

and

slide operator
For any two internal nodes
u
and
v
on an unrooted tree, there exists a path from node
u
to node
v
. We call this path backbone
u v
. Assuming there are a total
of
n
internal
nodes (including node
u
and node
v
) on backbone
u v
, then the backbone connects
2
n
subtrees labeled from
1
to
n
in the order they are visited
,
from node
u
to node
v
.
Label the other subtree of node
u
as subtree
0
, and the other subtree of node
v
as
m
ultiple

t
ree

m
erge
(
1
,,
K
T T
)
1.
draw
k
from a unif
orm distribution
(1,)
U K
2.
0
k
T T
3.
For
1
i
to
K
4.
If
i k
Then
5.
Select a subtree
s
T
from
i
T
6.
For each leave node in
s
T
, prune it from
0
T
,
denote the pruned tree
0
\
s
T T
7.
Attach
s
T
to
0
\
s
T T
at a random branch, repla
ce
0
s
T
with the new tree
8.
Return
0
T
Figure 3

10
:
The m
ultiple

t
ree

m
erge
method
66
subtree
1
n
. Arrange subtree
0
, subtree
1
n
and all internal nodes into one
vertical
line, and denote the distance from each internal node to subtree
0
i
y
. Then we randomly
choose one internal node
k
, and slide it along the backbone by draw
ing
a random
nu
mber
y
from a uniform distribution
1
(0,)
n
U y
. The value of
y
will decide the new
position of node
k
. Finally, we scale the length of the backbone.
The backbone

slide

scale method is shown in Figure
3

11
, it is an extension of the
LOCAL me
thod proposed by Larget
et al.
[41]
when the back bone includes two nodes.
3
.
8
Chapter summary
MCMC is the cornerstone of Bayesian phylogenetic inference. The proper
implementation of MCMC is critical to the correctness of Bayesian phylogenetic
u
v
0
1
2
3
4
5
6
0
y
1
y
3
y
4
y
5
y
2
y
6
a
b
c
u
v
0
1
2
3
4
5
6
0
y
1
y
3
y
4
y
5
y
2
'
y
6
a
b
c
u
v
0
1
2
3
4
5
6
0
y
1
*
y
3
*
y
4
*
y
5
*
y
2
*
y
6
*
a
b
c
Step
1
:
Construct the Backbone
Step
2
:
Slide Subtree
3
Step
3
:
Scale Backbone length
Figure
3

11
:
The b
ackbone
s
li
de and
s
cale
m
ethod
67
inference.
In theory an MCMC chain constructed using the Metropoli
s

Hasting algorithm
will visit every state after a sufficiently large number of time steps. In practice, many
chains can not efficiently mix between two states separated by low probability regions.
We analyzed the dangers of MCMC to output misleading appro
ximations and proposed
several strategies to overcome those pitfalls. The key idea is to design some transitional
kernel which can move from one state to any other state within limited number of steps
without blocking by probable high energy barriers.
We
implemented this idea as improved MCMC strategies and extended proposal
methods. Using variable proposal step length, we can bring two distant states close to
each other. Using multipoint MCMC, we can improve the quality of candidate state and
reduce the
sample intervals. Using population

based MCMC, we can expand the search
range of the MCMC algorithm.
By introducing the above

described proposal methods
and MCMC strategies
the steps needed for the chain jump from one state to any other
state is greatly sh
ortened
;
therefore, the chain can cross valleys much easier.
These described strategies and proposal methods are implemented in PBPI, our high
performance implementation of Bayesian phylogenetic inference.
68
C
hapter
4
Parallel Bayesian Phylogenetic Infere
nce
4.
1 The need for parallel Bayesian phylogenetic
inference
Large phylogenies deepen our understanding about biological evolution and diversity.
With the rapid accumulation of genomic data through various genome sequencing
projects, constructing large ph
ylogenies across the tree of life is
be
coming
a
reality.
Si
mulation studies
indicate
that
the accuracy of a phylogenetic method can be improved
by adding more taxa and including more characters
[136]
.
Bayesian inference of large phylogeny is a computationally intensive process.
Consider a
realistic
problem
:
estimating the phylogeny of 200 aligned amino acid
sequences with 3500 characters using
a model
that
allow
s
five different rates across sites
(
200
N
,
5000
M
,
20
S
,
and
5
K
)
.
Assum
e
we
use
a
Metropolis

coupled
MCMC
algorithm with 5 chains, each chain lasting 100,000,000
generations,
and
that we
use a local update schema in the implementation
. T
hen
we need
at least
on the order of
10
10
bytes
of
memory space and
at least
on the order of
17
10
multiplication operations.
To
be competitive
in analytic quality, more
complicated models
are desirable; this together
69
with an exponential growth rate in the number of sequenced taxa
makes
growing
computational demand. Even now these demands exceed the
ability of a single

CPU
computer and require a l
onger computation time than is reasonable
, hen
ce
the motivation
for parallel implementation of Bayesian phylogenetic inference to reduce computational
time.
4
.
2
TAPS:
a
tree

based abstraction
of
parallel
system
A parallel system
uses
a collection of proce
ssing elements that communicate and
cooperate to solve large problem
s
fast
er
[137]
.
Modern computer systems effectively
exploit various hardware parallelisms to gain raw performance
at different levels ranging
from instruction, architecture, vector, processor core,
microprocessor chip, SMP node,
cluster, and grid.
In the past three decades, we have observed tremendous performance
increase
s
and cost decrease
s
in microprocessor
s
, storage,
and
networking
.
Beowulf
clusters with hundreds of nodes are common in
major univ
ersities
and research institute
s
.
The g
rid
,
as
a
new infrastructure
,
makes
sharing
geographically

distributed
computing
resources
a
reality.
Parallel algorithm design and analysis relies on an abstract model of parallel
computation to model the key attrib
utes of physical parallel systems. Existing models
include PRAM (parallel random access machine)
[138]
, BSP (bulk sy
nchronous parallel)
[139]
, LogP
[140]
,
and their variants. N
one of them can be directly applied to grid
systems or clusters of heterogeneous clusters.
Therefore, we use TAPS, a tree

based
abstraction of ubiquitous parallel systems
,
as the model guiding our design and analysis
of parallel algorithms for Bayesian phy
logenetic inference.
70
As shown in Figure
4

1, TAPS represents the parallel systems as a rooted

tree
,
where
all the physical processors are located on the leaves and clusters of computing
resources
are represented as an internal node. The root is the largest organization available to the
user. Each leaf node has its independent processing unit (
P
), memory space (
M
) and
network interface (
N
). The internal node is a virtual organization which includes an
interconnection network and a collection of computing resources which could be a
physical processing unit or a lower level virtual organization. Each edge of the tree
represent
s a communication link with fixed bandwidth and latency. Each node (
labeled
k
)
on the tree incurs an overhead (
k
o
) when communicating with other nodes. Similar
to
the
LogP
and
P
n
log
model
[141]
, the communication cost between a pair of nodes
i
and
j
is modeled as