HIGH
PERFORMANCE, BAYESIA
N

BASED
PHYLOGENETIC INFEREN
CE
FRAMEWORK
By
Xizhou Feng
Bachelor
of
Engineering
China Textile University, 1993
Master of
Science
Tsinghua University, 1996
————————————————————————
Submitted in Partial Fulfillment of the Requi
rements
f
or the Degree of Doctor Philosophy in the
Department of Computer Science and Engineering
College of Engineering and Information Technology
University of South Carolina
200
6
Major Professor
Chairman, Examining Committee
Committee Member
Committee Member
Committee Member
Dean of The Graduate School
ii
Dedication
To Rong, Kevin and Katherine
iii
Acknowledgement
s
During the course of my graduate study, I have been fortunate to receive advice, support,
and encouragement from many people. Foremost is the debt of gratitude that I owe
to
my
thesis advisors
,
Professor Duncan A
.
Buell and Professor Kir
k W. Cameron. Not only
was Duncan responsible for introducing me to this interesting and
fruitful field, he also
provided
me inspiring guidance, great patience, and never

ending encouragement during
the past several years. I
e
special
ly
thank Professor Kirk
W. Cameron for his
invaluable
mentoring, insightful advising, and
constant investing. Kirk guided me into the exciting
field
of system
s
study,
and
provide
d
opportunities and support to conduct quality
research work in several cutting

edge areas.
I thank
Professor Manton Matthews for his years of academic advising and being on
my
advisory
committee. His guidance and support made it possible for me to explore
various fields in computer science and engineering.
I thank Professor John R. Rose
and Professor P
eter Waddell for their valuable
suggestions in this research work. The discussions and
collaborat
ive
work with John and
Peter generate
d
some important ideas which have been included in this thesis.
I
appreciate
Professor Austin L. Hughes
for being on my a
dvisory committee
and
providing
me critical opinions which led
me
to
rethink and
significant
ly
improv
e
ment
this dissertation
.
I also thank
the
faculty
and
st
a
ff in the Department of Computer and Engineering for
providing me one of the most wonderful train
ing program
s
in the world.
iv
Finally, I thank my family for their love and support during the hard time of
completing my dissertation.
This dissertation is dedicat
ed
to my wife Rong, my son Kevin, and my daughter
Katherine.
v
Abstract
Comparative analy
ses of biological data rely on a phylogenetic tree that describes the
evolutionary relationship of the organisms
studied.
By combining the Markov
C
hain
Monte Carlo (MCMC) method with likelihood

based assessment of phylogenies,
Bayesian phylogenetic inferen
ces
incorporate complex statistical models into the process
of
phylogenetic tree estimation
. This combination can be used
to address a number of
complex questions in evolutionary biology.
However, Bayesian analyses are
computationally expensive
because
th
ey almost invariably require high dimensional
integrations over unknown parameters.
T
horoughly investigat
ing
and exploit
ing
the
power of
the
Bayesian approach
requires
a high performance computing framework
.
Otherwise one cannot
tackle the computational ch
allenges of Bayesian phylogenetic
inference for large phylogeny problems.
This dissertation extend
ed
existing Bayesian phylogenetic inference framework
in
three
aspects: 1)
E
xploring various
strategies
to i
mprov
e
the performance
of the
MCMC
sampling metho
d; 2) Developing high performance, parallel algorithms for Bayesian
phylogenetic inference;
and
3) Combi
ni
ng data uncertainty and model uncertainty in
Bayesian phylogenetic inference
.
We implemente
d a
ll these extensions in PBPI, a
software package for para
llel Bayesian phylogenetic inference.
We
validated the
PBPI
implementation
using simulation study, a common method
used in phylogenetics and other scientific disciplines. The simulation results
showed that
PBPI can estimate the model trees accurately give
n sufficient number of sequences and
correct models
.
vi
We evaluated the computational speed of PBPI using simulated datasets
on
a
Terascale computing facility
and observed
significantly
performance improvement
. On a
single processor, PBPI r
a
n up to 19 time
s faster than
the
current leading Bayesian
phylogenetic inference program
with the same quality output.
O
n 64 processors, PBPI
achieved 46 times parallel speedup
in average
.
Combining both sequential improvement
and parallel computation, PBPI can speedup c
urrent Bayesian phylogenetic
inference
s up
to 870 times.
.
vii
Table of Contents
Dedication
................................
................................
................................
...........................
ii
Acknowledgements
................................
................................
................................
............
iii
Abstract
................................
................................
................................
...............................
v
List of Tables
................................
................................
................................
...................
xiii
List of Figures
................................
................................
................................
..................
xiv
Chapter 1 Introduction
................................
................................
................................
........
1
1.1 Phylogeny and its applications
................................
................................
..................
1
1.2 Phylogenetic inference
................................
................................
..............................
2
1.3 The challenges
................................
................................
................................
..........
5
1.3.1 Searching a complex tree space
................................
................................
.........
5
1.3
.2 Developing realistic evolutionary models
................................
.........................
6
1.3.3 Dealing with incomplete and unequal data distribution
................................
....
7
1.3.4 Resolving conf
licts among different methods and data sources
........................
8
1.4 Bayesian phylogenetic inference and its issues
................................
........................
8
1.5 Motivation
................................
................................
................................
...............
10
1.6 Research objectives and contributions
................................
................................
....
11
1.7 Organization of this dissertation
................................
................................
.............
12
Chapter 2 Background
................................
................................
................................
......
14
2.1 Representations of phylogenetic trees
................................
................................
....
14
2.2 Methods for phylogenetic inference
................................
................................
.......
19
viii
2.2.1 Sequenced

based methods and genome

based methods
................................
..
19
2.2.2 Distance

, MP

, ML

and BP

based methods
................................
..................
20
2.2.3 Tree search strategies
................................
................................
.......................
21
2.3 High performance computing phylogenetic inference methods
.............................
22
2.4 Bayesian phylogenetic inference
................................
................................
............
23
2.4.1 Introduction
................................
................................
................................
......
23
2.4.2 The Bayesian framework
................................
................................
.................
25
2.4.3 Components of Bayesian phylogenetic inference
................................
............
27
2.4.4 Likelihood, prior and posterior probability
................................
......................
27
2.4.5 Empirical and hierarchical Bayesian analysis
................................
..................
28
2.5 Models of molecular evolution
................................
................................
...............
29
2.5.1 The substitu
te rate matrix
................................
................................
.................
29
2.5.2 Properties of the substitution rate matrix
................................
.........................
31
2.5.3 The general time reversible (GTR) model
................................
.......................
32
2.5.4 Rate heterogeneity among different sites
................................
.........................
34
2.5.5 Other more realistic evolutionary models
................................
........................
35
2.6 Likelihood function and its evaluation
................................
................................
...
35
2.6.1 The likelihood function
................................
................................
....................
35
2.6.2 Felsenstein’s al
gorithm for likelihood evaluation
................................
............
37
2.7 Optimizations of likelihood computation
................................
...............................
39
2.7.1 Sequence packing
................................
................................
.............................
39
2.7.2 Likelihood local update
................................
................................
....................
39
2.7.3 Tree balance
................................
................................
................................
.....
41
ix
2.8 Markov Chain Monte Carl
o methods
................................
................................
.....
41
2.8.1 The Metropolis

Hasting algorithm
................................
................................
..
41
2.8.2 Exploring the posterior distribution
................................
................................
.
43
2.8.3 The issues
................................
................................
................................
.........
44
2.9 Summary of the posterior distribution
................................
................................
....
46
2.9.1 Summary of the phylo
genetic trees
................................
................................
..
46
2.9.2 Summary of the model parameters
................................
................................
..
46
2.10 Chapter summary
................................
................................
................................
..
47
Chapter 3 Improved Monte Carlo Strategies
................................
................................
....
49
3.1 Introduction
................................
................................
................................
.............
49
3.2 Observations
................................
................................
................................
...........
50
3.3 Strategy #1: reducing stickiness using variable proposal step length
.....................
53
3.4 Strategy #2: reducing sampling intervals using multipoint MCMC
.......................
55
3.5 Strategy #3: improving mixing rate with parallel tempering
................................
..
57
3.6 Proposal algorithms for phylogenetic models
................................
.........................
60
3.6.1 Basic tree mutation operators
................................
................................
...........
61
3.6.2 Basic tree branch length proposal methods
................................
.....................
62
3.6.3 Propose new parameters
................................
................................
..................
63
3.6.4 Co

propose topology and branch length
................................
..........................
63
3.7 Extended proposal algorithms for phyl
ogenetic models
................................
.........
63
3.7.1 Extended tree mutation operator
................................
................................
......
64
3.7.2 Multiple

tree

merge operator
................................
................................
...........
64
3.7.3 Backbone

slide

and

slide operator
................................
................................
..
65
x
3.8 Chapter summary
................................
................................
................................
....
66
Chapter 4 Parallel Bayesian
Phylogenetic Inference
................................
........................
68
4.1 The need for parallel Bayesian phylogenetic inference
................................
..........
68
4.2 TAPS: a tree

based abstraction of par
allel system
................................
.................
69
4.3 Performance models for parallel algorithms
................................
...........................
71
4.4 Concurrencies in Bayesian phylogenetic inference
................................
................
74
4.5 Issues of parallel Bayesian phylogenetic inference
................................
................
75
4.6 Parallel algorithms for Bayesian phylogenetic inference
................................
.......
77
4.6.1 Task decomposition and assignment
................................
...............................
77
4.6.2 Synchronization and communication
................................
...............................
79
4.6.3 Load balancing
................................
................................
................................
.
80
4.6.4 Symmetric MCMC algorithm
................................
................................
..........
80
4.6.5 Asymmetric MCMC algorithm
................................
................................
........
83
4.7 Justifying the correctness of the parallel algorithms
................................
...............
83
4.8 Chapter summary
................................
................................
................................
....
84
Chapte
r 5 Validation and Verification
................................
................................
..............
86
5.1 Introduction
................................
................................
................................
.............
86
5.2 Experimental methodology
................................
................................
.....................
89
5.2.1 The model trees
................................
................................
................................
89
5.2.2 The simulated datasets
................................
................................
.....................
90
5.2.3 The accuracy metrics
................................
................................
.......................
90
5.2.4 Tested programs and their run configurations
................................
.................
92
5.2.5 The computing platforms
................................
................................
.................
93
xi
5.3 Results on model tree FUSO024
................................
................................
.............
94
5.3.1 The overall accuracy of results
................................
................................
........
94
5.3.2 Further analysis
................................
................................
................................
96
5.3.3 PBPI stability
................................
................................
................................
.
100
5.4 Results on model tree BURK050
................................
................................
..........
103
5.5 Chapter summary
................................
................................
................................
..
105
Chapter 6 Performance Evaluation
................................
................................
.................
107
6.1 Introduction
................................
................................
................................
...........
107
6.
2 Experimental methodology
................................
................................
...................
108
6.3 The sequential performance of PBPI
................................
................................
....
110
6.3.1 The execution time of PBPI and MrBayes
................................
....................
110
6.3.2 The quality of the tree samples drawn by PBPI
................................
.............
111
6.3.3 The execution time of PBPI and MrBayes
................................
....................
112
6.4 Parallel speedup for fixed problem size
................................
................................
115
6.5 Scalability analysis
................................
................................
................................
119
6.6 Parallel sp
eedup with scaled workload
................................
................................
.
121
6.6.1 Scalability with different problem sizes
................................
........................
121
6.6.2 Scalability with the number of chains
................................
............................
122
6.7 Chapter summary
................................
................................
................................
..
123
Chapter 7 Summary and Future Work
................................
................................
............
124
7.1 The big picture
................................
................................
................................
......
124
7.2 Future work
................................
................................
................................
...........
127
xii
Bibliography
................................
................................
................................
...................
129
.
xiii
List of Tables
Table 1

1: The number of unrooted bifurcating trees as a function of taxa
.....................
5
Table 5

1: The four model trees used
in experiments
................................
.....................
89
Table 5

2: PBPI run configurations for validation and verification
...............................
95
Table 5

3: The number of datasets wh
ere the model tree FUSO024 is found in the
maximum probability tree, the 95% credible set of trees and the 50% majority
consensus tree. A total of 5 datasets are used in each case.
................................
..
96
Table 5

4: The average distances between the model tree FUSO024 and the maximum
probability tree, the 95% credible set of trees and the 50% majority consensus tree.
A total of 5 datasets are used in each case.
................................
...........................
96
Table 5

5: The topological distances between the model tree FUSO024 and the
maximum probability tree, the 95% credible set of trees and the 50% majority
consensus tree for datasets with 10,000 characters. Datasets are simulated under
the J
C69 model.
................................
................................
................................
....
97
Table 5

6: The average distances between the model tree BURK050 and the maximum
probability tree, the 95% credible set of tree and the 50% majority consensus tree.
A total of 5
datasets were used in each case.
................................
......................
103
Table 6

1: Benchmark dataset used in the evaluation
................................
..................
109
Tab
le 6

2: Sequential execution time of PBPI and MrBayes
................................
.......
110
xiv
L
ist of Figures
Figure 1

1: The procedure of a phylogenetic inference
................................
....................
4
Figure 2

1: Phylogenetic trees of 12 primates mitochondrial DNA sequences
..............
15
Figure 2

2: The NEWICK representation of th
e primate phylogenetic tree
...................
16
Figure 2

3: The nontrivial bipartitions of the primate phylogenetic tree
........................
17
Figure 2

4: A ph
ylogenetic tree with support values for each clade
.............................
18
Figure 2

5: The transition diagram and transition matrix of nucleotides
.......................
30
Figure 2

6: The Felsenstein algorithm for likelihood evaluation
................................
..
38
Figure 2

7: Illustration of likelihood local update
................................
..........................
40
Figure 2

8: The tree

balance algorithm
................................
................................
..........
41
Figure 2

9: Metropolis

Hasting algorithm
................................
................................
......
42
Figure
3

1: A target distribution with three modes
................................
.........................
50
Figure 3

2: Distribution approximated using Metropolis MCMC methods
...................
51
Figur
e 3

3: Samples drawn using Metropolis MCMC method
................................
......
52
Figure 3

4: Illustration of state moves
................................
................................
............
54
Figure 3

5: Approximated d
istribution using variable step length MCMC
....................
55
Figure 3

6: The multipoint MCMC
................................
................................
................
56
Figure 3

7: A family of tempered distribu
tions with different temperatures
.................
58
Figure 3

8: The Metropolis

coupled MCMC algorithm
................................
.................
59
Figure 3

9: The extended

tree

mut
ation method
................................
...........................
64
Figure 3

10: The multiple

tree

merge method
................................
...............................
65
Figure 3

11: The backbone slide and scale method
................................
........................
66
xv
Figure 4

1: An illustration of TAPS
................................
................................
...............
70
Figure 4

2: Speedup under fixed workload
................................
................................
....
73
Figure 4

3: The procedure of a generic Bayesian phylogenetic inference
.....................
75
Figure 4

4: Map 8 chains to a 4 x 4 grid, where the length each sequen
ce is 2000
.......
78
Figure 4

5
: The symmetric parallel MCMC algorithm
................................
...................
82
Figure 5

1: The procedure of a simulation method for accuracy assessment
.................
88
Figure 5

2: Run configuration for MrBayes
................................
................................
...
93
Figure 5

3: The phylogram of the model tree FUSO024
................................
................
98
Figure 5

4: The MPP tree estimated from dataset
fuso024_L10000_jc69_D001
......
99
Figure 5

5: Estimation variances in 10 individual runs
................................
................
100
Figure 5

6: The phylogram of the model tree BURK050
................................
.............
101
Figure 5

7: The MPP tree estimated from dataset burk050_L10000_jc69_D001.nex
.
102
Figure 5

8: The posterior distribution of the top 50 most probable trees
.....................
104
Figure 5

9: The topological distances distribution of the top
50 most probable trees
..
105
Figure 6

1: Different speedup values computed by wall clock time and user time
......
108
Figure 6

2: Log likelihood plot of the tree samples drawn by PBPI and MrBayes
......
111
Figure 6

3: The consensus tree estimated by PBPI
................................
......................
113
Figure 6

4: The consensus tree estimated by MrBayes
................................
................
114
Figure 6

5: Parallel speedup of PBPI for dataset FUSO024_L10000
.........................
116
Figure 6

6: Parallel speedup of PBPI for dataset ARCH107_L1000
...........................
117
Figure 6

7: Parallel speedup of PBPI for dataset BACK218_L10000
.........................
117
Figure 6

8: The consensus tree estimated by PBPI on 64 processors
...........................
118
Figure 6

9: Parallel speedup with different number of taxa
................................
.........
122
xvi
1
Chapter 1
Introduction
1.1 Phylogeny and its applications
A
ll
life on the earth, both
present and past
, are believed to be
descended from a common
ancestor.
The
descending pattern or
evolutionary relationship among species or
organisms
,
or
the relatedness of their genes,
is usually
described
by
a phylogeny
, a tree or
network
structure
,
with
edge length representing
the
evolutionary divergence along
different lineage
s. In a phylogeny, all
existing organism
s
are placed on
its
“
leaves
”
and
ance
stral organism
s
are placed
at its “branches,” or
internal nodes.
Since
all biological phenomena are the
result
of evolution, most biological studies
have to be
conducted
in the light of evolution
and require information on phylogeny to
interpret data
[1]
.
Thus, phylogenies
play important role
s
not only in evolutionary
biology, genetics and genomics
,
but also
in
modern
pharmaceutical research, drug
discovery, agricultural plant improvement, disease control stud
ies
(detection, prevention
and prediction)
and other
biolog
y

related
fields.
The importance
of phylogeny
in
scientific research and human society
has
never
been
made more clear
than
by the
ambitious
“
Tree of Life
”
project initiated by the
US
National Science Foundation
,
which
2
aims to assemble
a ph
ylogeny
for all 1.7 million described species (ATOL)
to benefit
society and science
[2]
.
The applications of phylogenies span a wide range of fields
,
both in industry and
science.
Several
example
s
follow:
Identifying, o
rganizing and classifying organism
[3, 4]
;
I
nterpreti
ng and
understanding the organization and evolution of genomes
[5, 6]
;
Identifying and characterizing newly discovered pathogens
[7]
;
Reconstruct
ing
the
evolution and radiation of life on the earth
[8, 9]
;
and
Identifying mutations most likely associated with diseases
[10]
.
1.2
P
hylogenetic inference
Phylogeny describes the pattern of evolution history among a group of taxa. But
history
only happens
once
,
and people have to use clues left by the history to reconst
ruct actual
events. One of the fundamental tasks of p
hylogenetic inference
is to approximate the
“true”
phylogenetic tree
for
a group of taxa using a set of
evolutionary
evidence in which
the
phylogenetic signals reside.
Various kinds of data are used
in
phylogenetics inferences, but recently DNA/RNA
molecular sequences are
most common.
There are three reasons:
1)
DNA sequences are the inheritance materials of all organisms on the earth;
2)
M
athematical models of molecular evolution are feasible and can be imp
roved
incrementally;
3)
Huge
numbers
of
genomic
sequences
have been generated and
are publicly
accessible
.
3
The third reason is
the
most important for the rapid advancement of phylogenetic
inference using genomic data. Worldwide genome projects
,
such as
the
H
uman Genome
Project (HGP)
[11]
,
have generated
an ever

increasing amount of biological data.
These
data are publicly
accessible
through several government

supported database efforts
,
such
as
GenBank
[12]
, EMBL
[13]
, DDJB
[14]
, and Swiss

Prot
[15]
. On August 22, 2005, the
public collections of DNA and RNA sequences provided by GenBank
,
EMB
L
, and
DDBJ
reached 100 Giga bases (i.e. 100,000,000,000 bases), representing genes and genomes of
over 165,000 organisms.
Those massive, complex data sets already generated
—
and
those
yet to be generated
—
have been fueling
the emerging or
r
enaissance
of a f
ew
interdisciplinary
fields,
including large scale phylogenetic analysis of genomic data.
The problem of phylogenetic inference using
genomic (molecular)
sequences is
formalized as follows:
Given an aligned character matrix
N M
ij
X x
for a
set of
N
t
axa
,
each taxa being
represented by an
M
character sequence,
ij
x
denoting the character of the

i th
taxa at
the

j th
site of i
ts sequence, phylogenetic inference typically seeks to answer two basic
questions:
1)
What is the phylogenetic
tree
(or
model
) that “best” explains the evolutionary
relations among these taxa?
2)
With how much confidence is a particular tree expected to be “corr
ect”?
Every phylogenetic method
can
output a phylogenetic tree which the method views
as the “best” tree
according to certain optimi
zation
criteria.
However, given the inherent
complexities in biological evolution and some unrealistic assumptions in phylog
enetic
inference,
each
given
inference method usually not only produces a tree but also provides
4
a measurement of the confidence in the tree. Bootstrapping and Bayesian posterior
probability
(discussed later)
are two common statistical tools to
provide
suc
h
confidence
measurement
s
.
A
s shown in Figure 1

1,
a
phylogenetic inference usually is
preceded
by multiple
alignments and model selections
to
generate input
. Most
phyl
ogenetic method
s
rely on
some phylogenetic tree as their input as well.
To reduce the errors produced by the
interdependence among multiple alignments, model selections and phylogenetic
inference, several iterations of alignments, selections, and inference
s may be required.
Collect Data
Retrieve Homologous Sequences
Alignt Multiple Sequences
Select Model of Evolution
Phylogenetic Inference
Assess Confidence
Aligned Data Matrix
“Best” tree with
measures of support
Hypothesis Testing
Phylogenetic Trees
(
s
)
Figure 1

1
:
The procedure of a phylogenetic inference
5
1.3
The
c
hallenges
Though there
have been
significant
advances
i
n phylogenetic inference
i
n the past several
decades, large scale phylogenetic inference is still a challenging problem.
1.3.1 Searching
a complex
tree space
The
biggest
ch
allenge of
phylogenetic inference
is
the
growth
in the number of unrooted
trees
, describe
d
by
3
2 5
N
i
i
(1

1
)
Here
Z
denotes
the number of possible tree topologies,
N
denotes of
the number of
taxa
.
Table 1 shows the number of unrooted trees corresponding to the number of taxa.
F
or example, the tree space for 100 taxa
will
contain
182
10
7
.
1
unrooted trees.
Searching
this
space to fin
d the best tree is computationally
impractical
.
Most optimization

based
phylogenetic methods
,
such as maximum parsimony and maximum likelihood
,
are NP

hard problems.
Many heuristic strategies for tree search
ing
have been studied, but much
work remains to b
e done
to
improv
e
these methods
[16]
.
Table 1

1
: The number of unrooted
bifurcating trees as a function of taxa
Nu
mber of taxa
Number of unrooted trees
3
1
10
6
10
03
.
2
50
74
10
84
.
2
100
182
10
70
.
1
1000
2860
10
93
.
1
6
1.3.2 Developing realistic evolutionary models
Most phylogenetic
methods
explicitly or implicitly assu
me a model of genomic sequence
evolution and use such
a
model
to
estimat
e
the rate of evolution
,
calculat
e
pair

wise
distance, or
compute
the likelihood of a given phylogeny. The process of genomic
sequence evolution has been affected by two factors: mutat
ions and selections. Mutations
are errors
in
curred during DNA replication. Mutations create genetic diversit
y
among
populations
,
and natural
selection steers evolutionary direction. Possible causes of
mutations include substitution, recombination, duplicat
ion, insertion, deletion, and
inversions
[17]
.
At the same time
, mutations are constrained by the geometric, physical
and chemical structures of nucleotide
s
, amino acid
s
, codon
s
, protein secondary structure
s
,
and
protein tertiary structure
s
[18]
.
Though phylogenetic signals exist in all kinds of mutation events, most evolutionary
models o
nly consider substitution events because it is either difficult or computationally
intractable
to
integrat
e
other events into the models
used by phylogenetic analysis
[19,
20]
.
With increasing computational power, researchers have relaxed some early
assumptions in evolutionary model
s
and proposed more realistic models
,
such as
allowing rate variation across site
s
[21]
, considering the effect of insertion and deletion,
and combining secondary structure informatio
n
[22

24]
. Given multiple possible models,
it is necessary for the phylogenetic inference approach to select a model that best fit
s
the
data. Also this approach should be robust enough to give a
correct
tree even
when
som
e
assumptions have been violated.
Besides the complexity of modeling single type sequence evolution, the need
for
combined analysis of multiple dataset
s
with different data type
s
and sources require
s
7
some unified model
which is both mathematically founded
and biologically
meaningful
[25, 26]
.
1.3.3
Dealing
with
i
ncomplete
and unequal data
distribution
T
he
imperfect process of
sam
pling
, sequencing and alignment
may introduce vari
ed
noise
into
an
available data
set
. Bias or errors in multiple sequence alignment is
the cause of
most
noise because: 1) most
multiple sequence
alignment methods depend on a
“
correct
”
phylogeny
to guide the ali
gnment process;
2
)
it is
necessary
to search
across
trees to find
the overall optimum.
It is
possible
to refine the alignment by repeating the procedure of
“
multiple alignment
—
model selection
—
phylogenetic inference
,
” but it is always
danger
ou
s to assume th
e alignment is “perfect
”
.
T
o assess the reliability or sensitivity of phylogeny on data with uncertainty,
the
b
ootstrap
approach
[28]
was suggested by
Felsenstein
[29]
and further refined by Efron et
al
.
[30]
. Bootstrap
ping
re
quires repeating the phylogenetic inference procedure
many
times (typically
on
the order of 1000 times
[23]
) on
derived
datasets obtained by
permuting the original data with
re
sampling and replacing.
T
he u
sefulness of phylogenetic inference methods
is
also
limited
by th
e sparse and
uneven distribution of sequence data among
species and the uncertainty inher
ent
in the
available data. Some
species have been sequenced for many genes; a few genes have
been sequ
enced for many species; but most of the potential data available for
phylogenetic purposes is still missing
[31, 32]
.
8
1.3.4
Resolving c
onflict
s
among different
methods and
data sources
Researchers usually represent a species with one or more genes in phylogeny
reconstruction. However,
a
gene tree is not
the
same as
a
species tree
[23]
. Phylogenetic
trees constructed with different genes or different data type
s
(morphological data vs.
molecular data) may be different. These conflicts may come from improper model
assumptions or tree
building approaches.
1.4 Bayesian phylogenetic i
nference and
i
ts
i
ssues
This dissertation
aims
to
extend
the framework of Bayesian phylogenetic inference
to
achieve
high
performance on large phylogeny problem
s
. By comb
in
ing several factors
into a compre
hensive probability model and
removing
unknown parameters with a
marginal probability distribution, Bayesian
analysis
has the potential to integrate complex
(i.e. realistic) models and existing knowledge into phylogenetic inference.
However, like other me
thods when they were first introduced, Bayesian phylogenetic
inference generated both excitement and debate.
Support
ers
of
the
Bayesian approach claim that
Bayesian phylogenetic methods have
a
t
least two advantages
over
traditional phylogenetic methods
[33

36]
:
1)
The primary Bayesian phy
logenetic analysis produces both a tree estimate and
a
measure of uncertainty for the groups on the estimated tree
[
10, 37, 38]
.
The
uncertainty is measured by a quantity called Bayesian posterior probability
,
which
is approximated by the percentage of occurrences of a group in the tree samples
generated by certain MCMC (Markov Chain Monte Carlo) methods
[39

41]
.
9
2)
Bayesian methods can implement very complex models of
sequence evolution
,
because a well

designed
MCMC can traverse various highly probably regions of
the tree space instead of sticking around only one region which is local
ly
optimal
but may be not the global
ly
optimal
[37]
.
However,
with
more thorough investigations, Bayesian phylogenetic inference also
brings various highly

debated issues
[34, 36, 42]
.
Several major issues have been
summarized below:
1)
Some Bayesian analyses offer conflicting findings to those from
other
approaches
,
such as maximum parsimony (MP) and maximum likelihood (ML)
[43, 44]
.
Some
highly debated topics include: “
H
ow
meaningful
are Bayesian support values?”
[45]
;
“Do
Bayesian support values reflect
the
probability of
being
true?”
[46]
;
and
“
Overcr
edibility of molecular phylogenies obtained by Bayesian phylogenetics
”
[47]
.
S
upporters
claim
that the Bayesian posterior probability of a tree is “the
probability that
the estimated tree is correct under the correct mode
l”
[10]
is
highly
deba
table
.
S
ome convincing interpretation is necessary to reconcile the
se
debates.
2)
One cornerstone of Bayesian phylogenetic inference is posterior probability
approximation using Markov
C
hain Monte Carlo (MCMC).
Shortly after MCMC
came out,
people expect
ed
th
at
it would be more
efficient than traditional ML
with bootstrapping
[41]
. However, experience show
s
that the chains have to run
much longe
r than previously expected to converge to the
correct
approximation
[48]
. More seriously, research
show
s
that
the
MCMC method may give
10
misleading “posterior probability” under cert
ain conditions
[42, 49]
, for example
on
a
mixture of trees
[50]
.
In spite of the above and other issues, Bayesian analysis
has
still gained wide
acceptance
since it was introduced into phylogenetics
[8, 51

57]
.
1.5 Motivation
Given the challenges described above
,
both positive and negative, it is necessary to
investigate Bayesian phy
logenetic inference
more thoroughly.
G
iven the stochastic nature
of molecular evolution, statistical analys
e
s such Bayesian methods do have the potential
to develop a unified framework to combine multiple data source
s
and exi
s
ting knowledge
into phylogenet
ic inference.
Some of the debates about
Bayesian
phylogenetic
inference
are due to insufficient
understanding or implementation of
this
method, especially the MCMC algorithm. An
improper MCMC implementation does have the danger
of
stop
ping
at local optima
. In
addition, it
can not cross low probability zones to reach other optimal modes.
T
herefore,
we need to
explor
e
improved MCMC strategies
to develop more
reliable
, more
efficient
implementation
.
One barrier for extensive investigation of Bayesian methods
is
that
the method itself
is time consuming. Given hundreds of taxa and complex model
s
, a complete MCMC

based Bayesian analysis may run several months to
obtain a
solution.
A
similar situation
occurred
when
the
maximum
likelihood method was first
introduced. However, when
computing systems became more and more powerful and better algorithm
s
were
11
developed,
the
m
aximum likelihood method
came into wide use
. This phenomen
on
may
happen again to
the
Bayesian

based phylogenetic method.
1.
6
Research o
bje
ctives and
c
ontributions
This
dissertation
aims to
develop a high performance framework for
Bayesian
phylogenetic inference
.
The following
summari
z
es
the research objectives and
contributions of this dissertation.
1)
Developing
a
high performance computing fr
amework for Bayesian phylogenetic
inference. In th
is dissertation, we investigate
technologies
and platform
s
for
Bayesian phylogenetic inference and abstract different computing platforms into
the TAPS (Tree

based Abstraction of Parallel System) model. Ba
sed on this
model, we developed parallel MCMC algorithms for Bayesian phylogenetic
inference and implemented them in the PBPI (Parallel Bayesian Phylogenetic
Inference) program. Both analytical analyses and numerical simulations show that
PBPI achieves rou
ghly linear speedup for datasets with different problem size
s.
This
means a Bayesian phylogenetic inference lasting several months
by former
methods
can be finished in several hour
s
using
parallel algorithms on mid

sized
Beowulf

like clusters.
2)
Developing b
etter MCMC strategies for Bayesian phylogenetic inference. In th
is
dissertation, we
proposed and implemented
several MCMC strategies for
exploring the posterior probability distribution of
the
phylogenetic model.
By
using variable propos
al
step length, we
made
the MCMC chain cross high energy
barrier
s
(i.e.,
low probability regions) and overcome “stickiness” around local
12
optimal regions. By introducing directional search within each proposal step, we
improve
d
the quality of each proposal and shorten
ed
the s
ample intervals, there
by
reducing the total number of generation
s,
to
produce
an acceptable distribution.
To improv
e
the mixing rate of the chain, we also implemented a class of
population

based MCMC methods which use
d
multiple chains to explore the
search
space more efficiently. W
e
demonstrated
that
classical MCMC methods
risk
generating misleading posterior probability on some models
;
by using
an
improved
MCMC framework
, th
is
risk
was
reduced.
V
arious novel algorithms
and MCMC strategies
were
implemented
in this
research
.
3)
Accommodating data uncertainty in phylogenetic inference with data resampling
in the MCMC
. We extend
ed
Bayesian phylogenetic
inference to include data
noise
in the inference procedure and showed that ML with bootstrapping can be
viewed a
s a special case of generic Bayesian phylogeneti
c inference. We justif
ied
that Bayesian posterior probability and bootstrap support value measure two kinds
of phylogenetic uncertainties: the former refers
to
multiple possible models for
the same dataset; t
he latter refers
to
the robustness of a tree on
a
specific data
set.
Both uncertainties can be assessed
jointly
by
incorporating data
resampling
during
a single
MCMC run.
1
.
7
Organization of this dissertation
This d
issertation
includes
three
parts.
The f
irst part consists of Chapter
s
1
and
2
,
which
present background
, methods, and
results in the
field
of Bayesian phylogenetic in
ference. In t
his chapter
we introduce
the
13
phylogenetic inference problem, its applications,
and
its challenges.
We also provide a
short review of positive and negative view
s of
Bayesian phylogenetic methods.
In
Chapter 2, we review various phylogenetic approaches and recent advances
in
high
performance computing for
solving
large phylogeny problem
s
.
The second part includes Chapter
s
3
and
4
in
which
we describe
our
extended
,
high
performance, Bayesian phylogenetic inference framework. In Chapter
3
, we
demonstrate
the weaknesses of traditional MCMC methods and
propose
how to overcome these
weaknesses
using improved MCMC algorithm
s
.
I
n Chapter
4
, we describe our parallel
Bayesian phylogenetic inference framework.
W
e
first
discuss the general models and
methods for parallelizing Bayesian phylogenetic inference
that
can be used as the
foundation of introducing high performance computing
support
to the
phylogenetic
inference problem.
Then
we present an implementation of parallel Metropolis

coupled
MCMC and numerical results.
The third part consists of
Chapter
s
5
and
6
,
where
we provide performance evaluation
of
the
Bayesian method and our
implementations. Using simulated dataset
s
under several
model tree
s
, we
verif
ied
that our implementation not only output the correct results but
also
ran
faster both in sequential and parallel implementation
,
in contrast to MrBayes
[58]
,
the
most
popular
Bayesian phylogenetic inference program current
ly
available. Our
results also demonstrate
d
that the accuracies of Bayesian

based phylogenetic method are
very well

suited for the
current
models of evolution.
Finally, in Chapter
7
,
we summarize
the
results
, conclusion
s and
contributions from
this dissertation and
outline
future research.
14
C
hapter
2
Background
2.1
Representations of p
hylogenetic tree
s
A phylogenetic tree is a graph representation of the evolutionary relationship among a set
of sp
ecies or organisms. Since species are organized as a hierarchical classification in
taxonomy, we call species at the leaf node of the tree taxon (plural taxa) in phylogenetic
inference. A phylogenetic tree is usually represented by a binary tree in which e
ach tree
node are connected at most three other nodes, but it could be represented by a multi

forked tree when some parts of the
tree can not be fully resolved
[59

62]
.
Each internal branch of the tree maps a div
ergence event in evolution and divides all
taxa into two groups. Each group is called a clade and each taxon in the clade shares the
same common ancestor with other taxa in the clade. If the length of the branch is set, it is
proportional to the divergence
time that two groups of taxa were separated from their
latest common ancestor. A phylogenetic tree could be rooted or unrooted depending on
whether a unique node is chosen as the least common ancestor of all taxa. Determining
the “true” root from for a gr
oup of taxa is usually impractical, so unrooted tree
s
are most
used in phylogenetic inference.
15
Tarsius syrichta
Lemur catta
Saimiri sciureus
Hylobates
Pongo
Gorilla
Homo sapiens
Pan
M sylvanus
M fascicularis
Macaca fuscata
M mulatta
( a )
(b)
0
.
1
Tarsius syrichta
Lemur catta
Saimiri sciureus
Hylobates
Pongo
Gorilla
Homo sapiens
Pan
M sylvanus
M fascicularis
Macaca fuscata
M mulatta
( c )
( d )
Figure 2

1
: Phylogenetic
trees
of 12
primates m
itochondrial
DNA
sequences
Tarsius syrichta
Lemur catta
Saimiri sciureus
Hylobates
Pongo
Gorilla
Homo sapiens
Pan
M sylvanus
M fascicularis
Macaca fuscata
M mulatta
0
.
1
Hylobates
Pongo
Gorilla
Homo sapiens
Pan
M sylvanus
M fascicularis
Macaca fuscata
M mulatta
Saimiri sciureus
Lemur catta
Tarsius syrichta
16
Figure 2

1 shows the
phylogenetic tree of 12 Primates mitochondrial DNA sequences.
This tree is constructed using MrBayes from 898
DNA characters using JC69 model.
Figu
re 2

1 (a) and (b) are called c
l
a
dograms which
provide topological information only.
Figure 2

1
(c) and (d) are called phylogram
s
which provide both br
a
nching order and
divergence time.
The NEWICK format representation of the phylogenetic tree
[63, 64]
in Figure 2

1 is
shown as follows.
To make the NEWICK representation unique, we
d
efine the signature of an unrooted
tree as one of its N
EWICK
format that satisfies two requirements:
1)
The
root of the tree is fixed at the internal node that has the taxon with the smalle
st
label as one of its children; and
2)
The
children of each internal node are order by their labels lexicographically.
For example, the signature of the above tree is:
#NEXUS
BEGIN TREES;
TRANSLATE
1
Tarsius_syrichta,
2
Lemur_catta,
3
Homo_sapiens,
4
Pan,
5
Gorilla,
6
Pongo,
7
Hylobates,
8
Macaca_fuscata,
[63]
9
M_mulatta,
10
M_fascicularis,
11
M_sylvanus,
12
Saimiri_sciureus
;
UTREE *
PRIMATE
= (1,2,(12,((7,(6,(5,(3,4)))),(11,(10,(8,9))))));
ENDBLOCK;
Figure 2

2
: The NEWICK representation of the primate phy
logenetic tree
17
(1,2,((((((3,4),5),6),7),(((8,9),10),11)),12))
Using the tree signature, we can easi
ly test the equality of two trees in the same way
as string comparison.
When distance between two trees instead of equality is preferred in practice, a
phylogenetic tree is also treated as a hierarchical bipartitions.
Each branch in the
phylogenetic tree d
ivides the set of taxa into one bipartition. For example, the complete
set of
nontrivial
bipartitions
(i.e., bipartitions in which each part has at least two nodes)
for the primate phylogenetic tree shown in Figure 2

2 is
:
Like the signature of a phylogenetic tree
,
we can view each bipartition as a signature
of its corresponding tree node and thus can compare two nodes
from two different
phylogenetic trees including the same group of taxa. The total number of bipartitions
which are shown in only one of the two trees but not both is defined the Robinson and
(1,2) (3,4,5,6,7,8,9,10,11,12)
(
1,2,12) (3,4,5,6,7,8,9,10,11)
(3,4) (1,2,5,6,7,8,9,10,11,12)
(3,4,5) (1,2,6,7,8,9,10,11,12)
(3,4,5,6) (1,2,7,8,9,10,11,12)
(3,4,5,6,7) (1,2,8,9,10,11,12)
(8,9) (1,2,3,4,5,6,7,10,11,12)
(8,9,10) (1,2,3,4,5,6,7,11,12)
(8,9,10,11) (1,2,3,4,5,6,7,12)
Figure 2

3
: The nontrivial bipartitions of the primate phylogenetic tree
18
Foulds topological distance of these two trees
[24]
, a distanced widely used in tree
comparisons.
Tarsius syrichta
Lemur catta
Saimiri sciureus
Hylobates
Pongo
Gorilla
Homo sapiens
Pan
0.91
1.00
1.00
1.00
M sylvanus
M fascicularis
Macaca fuscata
M mulatta
1.00
1.00
1.00
1.00
1.00
Figure 2

4
: A phylogenetic tree with support values f
or each clade
The support of a phylogenetic tree for given is usually assessed with
bootstrapping
[65]
or Bayesian posterior probability
[66]
. In both methods,
a consensus tree is
commonly used
to summarize common structures among a group of trees sampled using
MCMC (Markov
C
hain Monte Carlo) or
computed
using the bootstrapped dataset. In
either way, the occurrences of each bipartitions are counted and the frequencies of each
bipartition are show
n in the phylogram as shown in Figure 2

4.
The c
onsensus tree is also
use
d
to combine trees estimated using different genes or dataset or the same group of taxa.
19
When each individual tree has different but overlapped set of taxa, a supertree is
used
to rep
lace the consensus tree as the summarized output
[67]
.
Considering the possibility of horizontal gene transfer, phylogenetic
network is used
as an alternative representation of the evolution relationship of a group of taxa
[68]
.
2.2 Methods for phylogenetic inference
Various met
hods have been developed to build phylogenetic trees from different kinds of
data. These methods can be classified by: 1) the data type used in tree estimation; 2) the
criteria to define an “optimal” tree; and 3) the tree search strategies.
2.2.1 Sequence
d

based methods and genome

based
methods
Currently, molecular sequences and whole genome features are the two major data types
used in phylogenetic inference
[69]
:
1)
Sequence

based methods use one or multiple gene alignments to estimate the
phylogenetic tree. Phylogenetic inferen
ce with multiple gene alignments
becomes common in recent years. The supermatrix
[70]
and supertree
[71]
methods are two major approaches to handle combined data such as multiple
gene align
ments. Both approaches rely on standard sequenced

based
phylogenetic inference methods.
2)
Genome

based methods use phylogenetic signals contained in gene content
[72

74]
or gene order
[75, 76]
to estimate the phylogenetic tree. Phylogenetic
inference using whole

genome feature attracts researcher’s attention recently
and
many efforts are devoted to how to formulate distance metrics and
20
probabilities models. An overview of genome

based methods is provided by
Delsuc
et al.
[69]
.
2.2.2 Distance

, MP

, ML

and BP

based methods
There are four major criteria to define an “optimal” tree: distance, ma
ximum parsimony
(MP), maximum likelihood (ML), and Bayesian posterior probability (BP). Comparisons
among these methods are reviewed in
[33, 62, 77]
.
Briefly, distance

based methods are much faster th
a
n the other three methods but
have some potential weaknesses including: 1) informatio
n loss in converting sequences
into distance matrix; 2) inconsisten
cy
for data set with large distances.
MP and ML are both optimization

based methods which break the tree estimation
process into two major components: scoring a given tree and searching th
e tree (or trees)
with best scores. MP uses the minimum number of mutations that could produce a given
tree as the score. ML uses the likelihood of the given tree under an explicit evolutionary
model as the score. MP runs much faster than ML because: 1) MP
needs much less
computations in evaluating the number of mutations than ML evaluating the likelihood;
and 2) MP does not need to optimize the branch lengths. Drawbacks of MP include: 1)
multiple (or too many) trees may have the same MP score and only one
of them is true;
and 2) MP is subject to the “long

branch attraction” problem
[78]
since it does not
account for the fact that the number of mutations varies on different branches.
Both ML and BP are likelihood

based methods which explicitly use a probabilistic
model of m
olecular evolution. Their major difference is ML uses point estimation for the
unknown parameters and BP uses marginal distribution to integrate “out” the unknown
parameters. BP is suggested as an faster alternative of ML with bootstrapping
[41]
,
21
however this argument needs to be further justified
[79]
. Whether BP should be classified
as an optimization

based method is questionable since theoretically BP requires more
computations than ML in order to find the probabilities of all modes for
the posterior
distribution. As ML is conjectured as an NP

Hard problem, BP is at least as difficult as
ML. Therefore, we put BP in a new category of phylogenetic methods: sampling

based
method.
2.2.3 Tree search strategies
Any phylogenetic inference metho
ds rely on one or more tree search strategies once the
“optimal” criterion is formulated. We divide the tree search strategies into the following
categories:
1)
Clustering method
[23]
: a clustering method bui
lds the tree using a sequence of
clustering operations. UPGMA
[80]
and neighbor

joining
[81]
. A cluster method
runs much faster than other methods. Its limitation is that it produces onl
y one
tree which may not be the global optimal.
2)
Exact search
[77]
: this method examines every possible tree to locate the “best”
tree. Exact search can be further divided into exhaustive search and branch

and

bound search. Exhaustive search enum
erates all possible trees for evaluation.
Considering the huge number of possible trees as described in Chapter 1,
exhaustive is practical only for small data size. Branch

and

bound can prune the
search space by deleting those trees that have lower score t
han a preset bound (or
threshold). The more strict the bound, the further the space will be pruned. Same
to exhaustive search, branch

and

bound is limited to small problem size.
22
3)
Deterministic heuristics search: the tree space is not completely random
distr
ibuted. There is certain order in the tree space. A heuristic search attempts to
exploit such an order to find the “bes
t
” or near “best” tree. Common used
deterministic search strategies include stepwise addition, local arrangement, and
global arrangement
[64, 77]
. One potential problem of deterministic heuristi
cs
search is that it dose not guarantee a global optimal solution
.
4)
Stochastic search: By introducing some random moves, a stochastic search may
avoid local optima and move toward the global optima. Three stochastic
algorithms are used in phylogenetic infe
rence: simulated annealing
[82, 83]
,
genetic algorithm
[84

86]
and MCMC
[40, 41, 87, 88]
.
5)
Divide and conquer: a large problem can b
e solved by dividing the original
problem into a set of smaller problems, solving each of them separately, and then
merge the solutions for each smaller problem to obtain the solution for the
original problem. Disk

covering method (DCM)
[89]
, quartet

puzzling
[90]
and
supertree
[67]
are used in phylogenetic inference.
2.3 High performance computing phylogenetic
inference methods
As phylogenetic inference goes to large
problem size and the parallel processing become
common, high performance computing support in phylogenetic inference
is
needed. High
performance computing support include
s
: algorithm turning, parallel algorithm design,
and parallel platform deployment.
Alg
orithm tuning seeks alternative approaches for computation intensive parts in the
phylogenetic inference. One common technique for likelihood

based phylogenetic
23
method is not to frequently optimize the branch length because this optimization process
will t
ake
2
( )
o N
times likelihood calculations. This technique has been used
[85, 86, 91,
92]
.
Besides algorithms improvement and exploration, parallel processing has the
possibility to reduce the computation time from several months to several hours in
efficient and immediate manner.
Several parallel implementati
ons of widely used
phylogenetic inference methods have been developed recently, among them are parallel
fastDNLml
[93, 94]
, parallel TREE

PUZZLE
[95]
, parallel
g
enetic algorithm for ML
[96]
, GRAPPA
[97]
, and Parallel MCMC algorithms
[98, 99]
.
We note there are multiple
level concurrencies in most phylogenetic inference and these me
thods can run in parallel
embarrassingly.
2.4
Bayesian
p
hylogenetic
i
nference
2.4.1
Introduction
As described in the previous chapter, the task of phylogenetic inference includes two
major steps: 1) constructing a phylogenetic tree that maps the evolution
ary relationship
among a group of taxa, and 2) accessing the confidence on the estimated tree given the
observed data. Various methods are available for building the phylogenetic tree and some
of them are based on a probabilistic model of molecular evoluti
on. Due to the stochastic
nature of molecular evolution, complicated mechanisms that affect the evolutionary
process, almost every phylogenetic method has to deal with uncertainties caused by
unknown parameters. Also, the fact that multiple phylogenetic tr
ees are possible for the
24
same group of taxa has to be considered in applications which explicitly use a phylogeny
as the basis of study.
Using a comprehensive probabilistic model, Bayesian analysis provides a
methodology to describe relationships among all
variables under consideration. Bayesian
phylogenetic inference can learn the phylogenetic model from observed data based on a
quantity called posterior probability. The posterior probability of a phylogenetic model
,
,
T
can be interpret
ed as the probability with which this phylogenetic model is
correct.
Bayesian phylogenetic inference share same similarities
with maximum likelihood
estimation
[10, 33]
: both explicitly use a model of molecular evolution and a
formalization of the likelihood function.
However, the underlying methodologies
are
quite
different. First,
the
Bayesian approach deals with parameter uncertainty by
integrating over
all possible values that a parameter might assume, while maximum
likelihood estimation uses a point estimate in analysis. Second, Bayesian analysis
requires specifying prior distributions of the parameters of a phylogenetic model, which
provides an advant
age to incorporating existing knowledge but also
invites criticism
since the prior distributions are often unknown. Finally
,
Bayesian analysis
o
utputs the
posterior probability of trees and clades as a measurement of the confidence on the
estimated result
s. Therefore, Bayesian phylogenetic inference
is
considered a faster
alternative of maximum likelihood
estimation
with bootstrap resampling
[41]
.
Though the idea of Bayesian phylogenetic inference
emerged
almost at the same
period as
the
maximum likelihood method
[100]
, the computation of Bayesian posterior
probability of phylogeny was not feasible until Markov
C
hain Monte Carlo methods
were
25
i
mplemented for phylogenetic inference by three independent research groups
[87, 101

103]
in 1996.
Bayesian phylogenetic inference bec
a
me widely used after the method of
computing posterior probability
was
described
[10, 33, 39

41, 87, 104, 105]
and several
phylogenetic inference programs (BAMBE
[106]
and MrBayes
[58]
) become public
ly
available.
Despite some obvious benefits and ever

increasing applications, Bayesian
phylogenetic inference has been
hotly
debated on several issues
including
the
amount of
bias caused by inappropriate prior p
robability, the interpretation of Bayesian posterior
probability
[46]
, and the ac
curacy of Bayesian clade support
[34, 36, 42, 45]
.
This
call
s
for further examination of the power and performance of Bayesian phylogenetic analysis,
and therefore
a need for improved and faster implementations of current Bayesian
phylogenetic methods.
2
.
4
.
2
The Bayesian framework
A phylogenetic model
,
,
T
consists of three components: a tree structure (
T
)
that represents the evolutionary relationships of a set of organism under study, a vector of
branch lengths (
) which maps the divergence time along different lineages, and a model
of the molecular evolution (
) that approximates how the characters at each site evolve
over time along the tree.
In the Bayesian framework,
both the observed data
X
and
parameters of
the phylogenetic model
are treated as random variables. Then the joint
distribution of the data and the model can be set up as follows
:
)
(
)

(
)
,
(
P
X
P
X
P
(2

1
)
Once the data
is known
, Bayesian theory ca
n
be used to compute the posterior probability
of the model using
26
)
(
)
(
)

(
)

(
X
P
P
X
P
X
P
(2

2
)
Here,
)

(
X
P
is
called the likelihood (the probability of the data give
n the model),
)
(
P
is called the prior probability of the model (the unconditional probability of the
model without any knowledge of the observed data)
, and
)
(
X
P
is the unconditional
probability of the data. For
the
cont
inuous case,
)
(
X
P
is computed by
( ) (  ) ( )
P X P X P d
(2

3
)
For
discrete
case,
)
(
X
P
is computed by
( ) (  ) ( )
i
i i
P X P X P
(2

4
)
Since
)
(
X
P
is ju
st a normalizing constant
,
the computation of (
2

3) or (
2

4) is not
needed in p
ractical inference.
The posterior probability distribution of the phylogenetic model can be written as
j
T
j
j
i
i
i
d
d
T
P
T
X
P
T
P
T
X
P
X
T
P
X
P
,
,
)
,
,

(
)
,
,
(
)
,
,

(
)

,
,
(

.
(2

5
)
This distrib
ution is the current basis of
Bayesian phylogenetic inference;
useful
information can be obtained from this distribution. For example, the posterior probability
of a phylogenetic tree
i
T
can be computed as
d
d
X
T
P
X
T
P
i
i
)

,
,
(
)

(
.
(2

6
)
Similarly, the posterior probability of the
i th
component of the parameter
in the
evolutionary model can be summarized by
j
T
i
i
i
j
i
d
d
X
T
P
X
P
)
\
(
)

\
,
,
,
(
)

(
(2

7
)
27
Here,
i
is the
i th
component of the parameter
and
\
i
are the remaining
components of the parameter
.
2.4.3
Components of Bayesian phylogene
tic inference
A complete Bayesian phylogenetic inference consists of four major components:
(1)
Formulating the phylogenetic model
)
,
,

(
i
T
X
P
;
(2)
Choosing a proper prior probability
)
,
,
(
i
T
P
;
(3)
Approximating the posterior probability dis
tribution of phylogenetic models;
(4)
Inferring characteristics from the posterior probability distribution.
We briefly describe the second component in this section; the other three components
will be described in the following sections.
2.4.4
Likelihood, pri
or and posterior probability
Bayesian theory shown in (2

2) can be expressed informally in English as:
evidence
prior
likelihood
posterior
(2

8
)
This formula indicates that by ob
serving some new evidence (i.e.
the data
X
) our s
tarting
belief (i.e.
the prior probability
P
)
may
be converted into a set of new belief (i.e.
posterior probability
)

(
X
). The prior probability and the posterior probability are
connected through the likelih
ood, the probability with which the evidence can be
observed.
P
hylogenetic model is a hypothesis about how the data will evolve
. H
ypotheses can
not be observed directly,
so
both the prior and the posterior should be interpreted as a
confidence interval for
a model instead of explain
ed
as frequencies
[107]
.
28
A major concern in Bayesian analysis is how to choose
the
prior. Prior probability
has
the potential to incorporate existing knowledge about phylogenetic models into current
analysis, but it
is
also a controversial issue
since
choosing the appropriate prior
distribution
can be
subjective. Two approach
es
are often used
for choosing prior
probability: using a non

informative prior (or flat prior, which treats every hypothesis
equally possible); and using the knowledge obtained from past experience. In Bayesian
phylogenetic inference, the prior probability on phylogenetic models can be intro
duced as
constraints to prune the search space parameters.
The posterior probability of a phylogenetic model (for example, a phylogenetic tree)
can be interpreted as the probability with which this model can be correctly estimated for
a set of random data
simulated from this model. The accuracy of the posterior probability
will be affected
adversely
by the use of improper hypothesis
[108]
.
2.
4
.5
Empirical and hierarchical Bayesian analysis
The comprehensive posterior distribution
X
T
P
i

,
,
requires knowle
dge of uncertain
parameters not
of
interest
in our
current analysis (
e.g.
,
branch length or model
parameters). In addition to directly explore
X
T
P
i

,
,
, two alternatives approximations
are used to accommodate these uncertain parameters
[109]
in practice.
The first method is called
empirical
Bayesian
analysis
, which uses a point estimate to
eliminate one of the
integral
s
on
X
T
P
i

,
,
. F
or example, we estimate the best fit
parameters
*
and then substitute equation (
2

6) as
d
X
T
P
d
d
X
T
P
X
T
P
i
i
i
)
,

,
(
)

,
,
(
)

(
*
.
(2

9
)
29
The
second
method
is called hierarchical Bayesian analysis, which takes the posterior
probability of the phylogenetic tree as the integral over all possible combinations of
branch lengths and model parameters. The hierarchical Bayesian analysis can be written
as
j
T
j
j
i
i
i
T
P
T
X
P
T
P
T
X
P
X
T
P
)
(
)

(
)
(
)

(
)

(
(2

10
)
d
d
T
X
P
T
X
P
i
i
)
,
,

(
)

(
(2

11
)
2
.
5
Models of molecular evolution
Comments 0
Log in to post a comment