HIGH PERFORMANCE, BAYESIAN-BASED PHYLOGENETIC INFERENCE FRAMEWORK

spongereasonInternet and Web Development

Nov 12, 2013 (3 years and 9 months ago)

143 views

HIGH

PERFORMANCE, BAYESIA
N
-
BASED
PHYLOGENETIC INFEREN
CE

FRAMEWORK


By


Xizhou Feng


Bachelor
of

Engineering

China Textile University, 1993


Master of
Science

Tsinghua University, 1996


————————————————————————


Submitted in Partial Fulfillment of the Requi
rements


f
or the Degree of Doctor Philosophy in the


Department of Computer Science and Engineering


College of Engineering and Information Technology


University of South Carolina


200
6

Major Professor

Chairman, Examining Committee


Committee Member


Committee Member


Committee Member


Dean of The Graduate School

ii

Dedication

To Rong, Kevin and Katherine

iii

Acknowledgement
s

During the course of my graduate study, I have been fortunate to receive advice, support,
and encouragement from many people. Foremost is the debt of gratitude that I owe
to
my
thesis advisors
,

Professor Duncan A
.

Buell and Professor Kir
k W. Cameron. Not only
was Duncan responsible for introducing me to this interesting and
fruitful field, he also
provided

me inspiring guidance, great patience, and never
-
ending encouragement during
the past several years. I
e
special
ly

thank Professor Kirk

W. Cameron for his
invaluable

mentoring, insightful advising, and

constant investing. Kirk guided me into the exciting
field

of system
s

study,
and
provide
d

opportunities and support to conduct quality
research work in several cutting
-
edge areas.


I thank
Professor Manton Matthews for his years of academic advising and being on
my
advisory
committee. His guidance and support made it possible for me to explore
various fields in computer science and engineering.


I thank Professor John R. Rose

and Professor P
eter Waddell for their valuable
suggestions in this research work. The discussions and
collaborat
ive

work with John and
Peter generate
d

some important ideas which have been included in this thesis.


I
appreciate
Professor Austin L. Hughes

for being on my a
dvisory committee

and
providing
me critical opinions which led

me
to
rethink and
significant
ly

improv
e
ment
this dissertation
.


I also thank
the
faculty

and
st
a
ff in the Department of Computer and Engineering for
providing me one of the most wonderful train
ing program
s

in the world.



iv


Finally, I thank my family for their love and support during the hard time of
completing my dissertation.


This dissertation is dedicat
ed

to my wife Rong, my son Kevin, and my daughter
Katherine.


v

Abstract

Comparative analy
ses of biological data rely on a phylogenetic tree that describes the
evolutionary relationship of the organisms
studied.
By combining the Markov
C
hain
Monte Carlo (MCMC) method with likelihood
-
based assessment of phylogenies,
Bayesian phylogenetic inferen
ces

incorporate complex statistical models into the process
of
phylogenetic tree estimation
. This combination can be used
to address a number of
complex questions in evolutionary biology.


However, Bayesian analyses are
computationally expensive
because

th
ey almost invariably require high dimensional
integrations over unknown parameters.
T
horoughly investigat
ing

and exploit
ing

the
power of
the
Bayesian approach

requires

a high performance computing framework
.
Otherwise one cannot
tackle the computational ch
allenges of Bayesian phylogenetic
inference for large phylogeny problems.


This dissertation extend
ed

existing Bayesian phylogenetic inference framework
in

three

aspects: 1)
E
xploring various
strategies

to i
mprov
e

the performance
of the
MCMC
sampling metho
d; 2) Developing high performance, parallel algorithms for Bayesian
phylogenetic inference;
and
3) Combi
ni
ng data uncertainty and model uncertainty in
Bayesian phylogenetic inference
.
We implemente
d a
ll these extensions in PBPI, a
software package for para
llel Bayesian phylogenetic inference.


We
validated the
PBPI

implementation
using simulation study, a common method
used in phylogenetics and other scientific disciplines. The simulation results
showed that

PBPI can estimate the model trees accurately give
n sufficient number of sequences and
correct models
.

vi


We evaluated the computational speed of PBPI using simulated datasets

on
a
Terascale computing facility

and observed
significantly
performance improvement
. On a
single processor, PBPI r
a
n up to 19 time
s faster than
the
current leading Bayesian
phylogenetic inference program

with the same quality output.
O
n 64 processors, PBPI
achieved 46 times parallel speedup

in average
.

Combining both sequential improvement
and parallel computation, PBPI can speedup c
urrent Bayesian phylogenetic
inference
s up
to 870 times.


.





vii

Table of Contents

Dedication

................................
................................
................................
...........................

ii

Acknowledgements

................................
................................
................................
............

iii

Abstract

................................
................................
................................
...............................

v

List of Tables

................................
................................
................................
...................

xiii

List of Figures

................................
................................
................................
..................

xiv

Chapter 1 Introduction

................................
................................
................................
........

1

1.1 Phylogeny and its applications
................................
................................
..................

1

1.2 Phylogenetic inference

................................
................................
..............................

2

1.3 The challenges

................................
................................
................................
..........

5

1.3.1 Searching a complex tree space

................................
................................
.........

5

1.3
.2 Developing realistic evolutionary models

................................
.........................

6

1.3.3 Dealing with incomplete and unequal data distribution

................................
....

7

1.3.4 Resolving conf
licts among different methods and data sources

........................

8

1.4 Bayesian phylogenetic inference and its issues

................................
........................

8

1.5 Motivation

................................
................................
................................
...............

10

1.6 Research objectives and contributions

................................
................................
....

11

1.7 Organization of this dissertation

................................
................................
.............

12

Chapter 2 Background

................................
................................
................................
......

14

2.1 Representations of phylogenetic trees

................................
................................
....

14

2.2 Methods for phylogenetic inference

................................
................................
.......

19

viii

2.2.1 Sequenced
-
based methods and genome
-
based methods

................................
..

19

2.2.2 Distance
-
, MP
-
, ML
-

and BP
-
based methods

................................
..................

20

2.2.3 Tree search strategies

................................
................................
.......................

21

2.3 High performance computing phylogenetic inference methods

.............................

22

2.4 Bayesian phylogenetic inference

................................
................................
............

23

2.4.1 Introduction

................................
................................
................................
......

23

2.4.2 The Bayesian framework

................................
................................
.................

25

2.4.3 Components of Bayesian phylogenetic inference

................................
............

27

2.4.4 Likelihood, prior and posterior probability

................................
......................

27

2.4.5 Empirical and hierarchical Bayesian analysis
................................
..................

28

2.5 Models of molecular evolution

................................
................................
...............

29

2.5.1 The substitu
te rate matrix
................................
................................
.................

29

2.5.2 Properties of the substitution rate matrix

................................
.........................

31

2.5.3 The general time reversible (GTR) model

................................
.......................

32

2.5.4 Rate heterogeneity among different sites

................................
.........................

34

2.5.5 Other more realistic evolutionary models

................................
........................

35

2.6 Likelihood function and its evaluation

................................
................................
...

35

2.6.1 The likelihood function

................................
................................
....................

35

2.6.2 Felsenstein’s al
gorithm for likelihood evaluation
................................
............

37

2.7 Optimizations of likelihood computation

................................
...............................

39

2.7.1 Sequence packing
................................
................................
.............................

39

2.7.2 Likelihood local update
................................
................................
....................

39

2.7.3 Tree balance

................................
................................
................................
.....

41

ix

2.8 Markov Chain Monte Carl
o methods

................................
................................
.....

41

2.8.1 The Metropolis
-
Hasting algorithm

................................
................................
..

41

2.8.2 Exploring the posterior distribution

................................
................................
.

43

2.8.3 The issues

................................
................................
................................
.........

44

2.9 Summary of the posterior distribution

................................
................................
....

46

2.9.1 Summary of the phylo
genetic trees

................................
................................
..

46

2.9.2 Summary of the model parameters

................................
................................
..

46

2.10 Chapter summary

................................
................................
................................
..

47

Chapter 3 Improved Monte Carlo Strategies

................................
................................
....

49

3.1 Introduction

................................
................................
................................
.............

49

3.2 Observations

................................
................................
................................
...........

50

3.3 Strategy #1: reducing stickiness using variable proposal step length

.....................

53

3.4 Strategy #2: reducing sampling intervals using multipoint MCMC

.......................

55

3.5 Strategy #3: improving mixing rate with parallel tempering

................................
..

57

3.6 Proposal algorithms for phylogenetic models
................................
.........................

60

3.6.1 Basic tree mutation operators
................................
................................
...........

61

3.6.2 Basic tree branch length proposal methods

................................
.....................

62

3.6.3 Propose new parameters

................................
................................
..................

63

3.6.4 Co
-
propose topology and branch length

................................
..........................

63

3.7 Extended proposal algorithms for phyl
ogenetic models

................................
.........

63

3.7.1 Extended tree mutation operator

................................
................................
......

64

3.7.2 Multiple
-
tree
-
merge operator

................................
................................
...........

64

3.7.3 Backbone
-
slide
-
and
-
slide operator

................................
................................
..

65

x

3.8 Chapter summary

................................
................................
................................
....

66

Chapter 4 Parallel Bayesian
Phylogenetic Inference

................................
........................

68

4.1 The need for parallel Bayesian phylogenetic inference

................................
..........

68

4.2 TAPS: a tree
-
based abstraction of par
allel system

................................
.................

69

4.3 Performance models for parallel algorithms

................................
...........................

71

4.4 Concurrencies in Bayesian phylogenetic inference

................................
................

74

4.5 Issues of parallel Bayesian phylogenetic inference

................................
................

75

4.6 Parallel algorithms for Bayesian phylogenetic inference

................................
.......

77

4.6.1 Task decomposition and assignment

................................
...............................

77

4.6.2 Synchronization and communication

................................
...............................

79

4.6.3 Load balancing

................................
................................
................................
.

80

4.6.4 Symmetric MCMC algorithm

................................
................................
..........

80

4.6.5 Asymmetric MCMC algorithm

................................
................................
........

83

4.7 Justifying the correctness of the parallel algorithms

................................
...............

83

4.8 Chapter summary

................................
................................
................................
....

84

Chapte
r 5 Validation and Verification

................................
................................
..............

86

5.1 Introduction

................................
................................
................................
.............

86

5.2 Experimental methodology

................................
................................
.....................

89

5.2.1 The model trees

................................
................................
................................

89

5.2.2 The simulated datasets

................................
................................
.....................

90

5.2.3 The accuracy metrics

................................
................................
.......................

90

5.2.4 Tested programs and their run configurations

................................
.................

92

5.2.5 The computing platforms

................................
................................
.................

93

xi

5.3 Results on model tree FUSO024
................................
................................
.............

94

5.3.1 The overall accuracy of results

................................
................................
........

94

5.3.2 Further analysis

................................
................................
................................

96

5.3.3 PBPI stability

................................
................................
................................
.

100

5.4 Results on model tree BURK050

................................
................................
..........

103

5.5 Chapter summary

................................
................................
................................
..

105

Chapter 6 Performance Evaluation

................................
................................
.................

107

6.1 Introduction

................................
................................
................................
...........

107

6.
2 Experimental methodology

................................
................................
...................

108

6.3 The sequential performance of PBPI

................................
................................
....

110

6.3.1 The execution time of PBPI and MrBayes

................................
....................

110

6.3.2 The quality of the tree samples drawn by PBPI

................................
.............

111

6.3.3 The execution time of PBPI and MrBayes

................................
....................

112

6.4 Parallel speedup for fixed problem size

................................
................................

115

6.5 Scalability analysis
................................
................................
................................

119

6.6 Parallel sp
eedup with scaled workload

................................
................................
.

121

6.6.1 Scalability with different problem sizes

................................
........................

121

6.6.2 Scalability with the number of chains

................................
............................

122

6.7 Chapter summary

................................
................................
................................
..

123

Chapter 7 Summary and Future Work

................................
................................
............

124

7.1 The big picture

................................
................................
................................
......

124

7.2 Future work

................................
................................
................................
...........

127

xii

Bibliography

................................
................................
................................
...................

129


.


xiii

List of Tables

Table 1
-

1: The number of unrooted bifurcating trees as a function of taxa

.....................

5

Table 5
-

1: The four model trees used
in experiments
................................
.....................

89

Table 5
-

2: PBPI run configurations for validation and verification

...............................

95

Table 5
-

3: The number of datasets wh
ere the model tree FUSO024 is found in the
maximum probability tree, the 95% credible set of trees and the 50% majority
consensus tree. A total of 5 datasets are used in each case.

................................
..

96

Table 5
-

4: The average distances between the model tree FUSO024 and the maximum
probability tree, the 95% credible set of trees and the 50% majority consensus tree.
A total of 5 datasets are used in each case.

................................
...........................

96

Table 5
-

5: The topological distances between the model tree FUSO024 and the
maximum probability tree, the 95% credible set of trees and the 50% majority
consensus tree for datasets with 10,000 characters. Datasets are simulated under
the J
C69 model.

................................
................................
................................
....

97

Table 5
-

6: The average distances between the model tree BURK050 and the maximum
probability tree, the 95% credible set of tree and the 50% majority consensus tree.
A total of 5

datasets were used in each case.

................................
......................

103

Table 6
-

1: Benchmark dataset used in the evaluation

................................
..................

109

Tab
le 6
-

2: Sequential execution time of PBPI and MrBayes

................................
.......

110


xiv

L
ist of Figures

Figure 1
-

1: The procedure of a phylogenetic inference

................................
....................

4

Figure 2
-

1: Phylogenetic trees of 12 primates mitochondrial DNA sequences

..............

15

Figure 2
-

2: The NEWICK representation of th
e primate phylogenetic tree

...................

16

Figure 2
-

3: The nontrivial bipartitions of the primate phylogenetic tree

........................

17

Figure 2
-

4: A ph
ylogenetic tree with support values for each clade

.............................

18

Figure 2
-

5: The transition diagram and transition matrix of nucleotides

.......................

30

Figure 2
-

6: The Felsenstein algorithm for likelihood evaluation

................................
..

38

Figure 2
-

7: Illustration of likelihood local update

................................
..........................

40

Figure 2
-

8: The tree
-
balance algorithm

................................
................................
..........

41

Figure 2
-

9: Metropolis
-
Hasting algorithm

................................
................................
......

42

Figure

3
-

1: A target distribution with three modes

................................
.........................

50

Figure 3
-

2: Distribution approximated using Metropolis MCMC methods

...................

51

Figur
e 3
-

3: Samples drawn using Metropolis MCMC method

................................
......

52

Figure 3
-

4: Illustration of state moves

................................
................................
............

54

Figure 3
-

5: Approximated d
istribution using variable step length MCMC

....................

55

Figure 3
-

6: The multipoint MCMC

................................
................................
................

56

Figure 3
-

7: A family of tempered distribu
tions with different temperatures

.................

58

Figure 3
-

8: The Metropolis
-
coupled MCMC algorithm

................................
.................

59

Figure 3
-

9: The extended
-
tree
-
mut
ation method

................................
...........................

64

Figure 3
-

10: The multiple
-
tree
-
merge method

................................
...............................

65

Figure 3
-

11: The backbone slide and scale method

................................
........................

66

xv

Figure 4
-

1: An illustration of TAPS

................................
................................
...............

70

Figure 4
-

2: Speedup under fixed workload

................................
................................
....

73

Figure 4
-

3: The procedure of a generic Bayesian phylogenetic inference

.....................

75

Figure 4
-

4: Map 8 chains to a 4 x 4 grid, where the length each sequen
ce is 2000

.......

78

Figure 4
-

5
: The symmetric parallel MCMC algorithm

................................
...................

82

Figure 5
-

1: The procedure of a simulation method for accuracy assessment

.................

88

Figure 5
-

2: Run configuration for MrBayes

................................
................................
...

93

Figure 5
-

3: The phylogram of the model tree FUSO024

................................
................

98

Figure 5
-

4: The MPP tree estimated from dataset
fuso024_L10000_jc69_D001

......

99

Figure 5
-

5: Estimation variances in 10 individual runs

................................
................

100

Figure 5
-

6: The phylogram of the model tree BURK050

................................
.............

101

Figure 5
-

7: The MPP tree estimated from dataset burk050_L10000_jc69_D001.nex

.

102

Figure 5
-

8: The posterior distribution of the top 50 most probable trees

.....................

104

Figure 5
-

9: The topological distances distribution of the top
50 most probable trees

..

105

Figure 6
-

1: Different speedup values computed by wall clock time and user time

......

108

Figure 6
-

2: Log likelihood plot of the tree samples drawn by PBPI and MrBayes

......

111

Figure 6
-

3: The consensus tree estimated by PBPI

................................
......................

113

Figure 6
-

4: The consensus tree estimated by MrBayes

................................
................

114

Figure 6
-

5: Parallel speedup of PBPI for dataset FUSO024_L10000

.........................

116

Figure 6
-

6: Parallel speedup of PBPI for dataset ARCH107_L1000

...........................

117

Figure 6
-

7: Parallel speedup of PBPI for dataset BACK218_L10000

.........................

117

Figure 6
-

8: The consensus tree estimated by PBPI on 64 processors

...........................

118

Figure 6
-

9: Parallel speedup with different number of taxa

................................
.........

122

xvi

1

Chapter 1

Introduction

1.1 Phylogeny and its applications

A
ll
life on the earth, both
present and past
, are believed to be
descended from a common
ancestor.
The
descending pattern or
evolutionary relationship among species or
organisms
,

or
the relatedness of their genes,
is usually
described
by
a phylogeny
, a tree or
network
structure
,

with
edge length representing

the

evolutionary divergence along
different lineage
s. In a phylogeny, all

existing organism
s

are placed on
its


leaves


and
ance
stral organism
s

are placed
at its “branches,” or
internal nodes.


Since
all biological phenomena are the
result
of evolution, most biological studies
have to be
conducted

in the light of evolution

and require information on phylogeny to
interpret data
[1]
.
Thus, phylogenies

play important role
s

not only in evolutionary
biology, genetics and genomics
,

but also
in

modern

pharmaceutical research, drug
discovery, agricultural plant improvement, disease control stud
ies

(detection, prevention
and prediction)

and other
biolog
y
-
related
fields.

The importance
of phylogeny
in

scientific research and human society

has

never
been
made more clear
than
by the
ambitious

Tree of Life


project initiated by the
US
National Science Foundation
,

which
2

aims to assemble
a ph
ylogeny

for all 1.7 million described species (ATOL)
to benefit
society and science
[2]
.

The applications of phylogenies span a wide range of fields
,

both in industry and
science.
Several
example
s

follow:



Identifying, o
rganizing and classifying organism

[3, 4]
;



I
nterpreti
ng and

understanding the organization and evolution of genomes

[5, 6]
;



Identifying and characterizing newly discovered pathogens

[7]
;



Reconstruct
ing
the
evolution and radiation of life on the earth
[8, 9]
;

and



Identifying mutations most likely associated with diseases

[10]
.

1.2
P
hylogenetic inference

Phylogeny describes the pattern of evolution history among a group of taxa. But
history
only happens
once
,

and people have to use clues left by the history to reconst
ruct actual
events. One of the fundamental tasks of p
hylogenetic inference

is to approximate the
“true”
phylogenetic tree
for
a group of taxa using a set of
evolutionary
evidence in which
the
phylogenetic signals reside.


Various kinds of data are used
in

phylogenetics inferences, but recently DNA/RNA
molecular sequences are
most common.
There are three reasons:

1)

DNA sequences are the inheritance materials of all organisms on the earth;

2)

M
athematical models of molecular evolution are feasible and can be imp
roved
incrementally;

3)

Huge
numbers
of
genomic
sequences
have been generated and
are publicly
accessible
.

3


The third reason is
the
most important for the rapid advancement of phylogenetic
inference using genomic data. Worldwide genome projects
,

such as
the
H
uman Genome
Project (HGP)
[11]
,

have generated
an ever
-
increasing amount of biological data.
These
data are publicly
accessible
through several government
-
supported database efforts
,

such
as
GenBank
[12]
, EMBL
[13]
, DDJB
[14]
, and Swiss
-
Prot
[15]
. On August 22, 2005, the
public collections of DNA and RNA sequences provided by GenBank
,
EMB
L
, and
DDBJ
reached 100 Giga bases (i.e. 100,000,000,000 bases), representing genes and genomes of
over 165,000 organisms.
Those massive, complex data sets already generated

and
those
yet to be generated

have been fueling
the emerging or
r
enaissance
of a f
ew
interdisciplinary
fields,
including large scale phylogenetic analysis of genomic data.

The problem of phylogenetic inference using
genomic (molecular)
sequences is
formalized as follows:

Given an aligned character matrix


N M
ij
X x


for a
set of
N
t
axa
,
each taxa being
represented by an
M

character sequence,
ij
x

denoting the character of the
-
i th

taxa at
the
-
j th

site of i
ts sequence, phylogenetic inference typically seeks to answer two basic
questions:

1)

What is the phylogenetic
tree

(or
model
) that “best” explains the evolutionary
relations among these taxa?

2)

With how much confidence is a particular tree expected to be “corr
ect”?

Every phylogenetic method
can
output a phylogenetic tree which the method views
as the “best” tree

according to certain optimi
zation

criteria.

However, given the inherent
complexities in biological evolution and some unrealistic assumptions in phylog
enetic
inference,
each

given
inference method usually not only produces a tree but also provides
4

a measurement of the confidence in the tree. Bootstrapping and Bayesian posterior
probability
(discussed later)
are two common statistical tools to
provide

suc
h
confidence
measurement
s
.

A
s shown in Figure 1
-
1,
a
phylogenetic inference usually is
preceded
by multiple
alignments and model selections
to
generate input
. Most

phyl
ogenetic method
s

rely on
some phylogenetic tree as their input as well.
To reduce the errors produced by the
interdependence among multiple alignments, model selections and phylogenetic
inference, several iterations of alignments, selections, and inference
s may be required.

Collect Data
Retrieve Homologous Sequences
Alignt Multiple Sequences
Select Model of Evolution
Phylogenetic Inference
Assess Confidence
Aligned Data Matrix
“Best” tree with
measures of support
Hypothesis Testing
Phylogenetic Trees
(
s
)

Figure 1
-

1
:
The procedure of a phylogenetic inference


5

1.3
The
c
hallenges

Though there
have been

significant
advances

i
n phylogenetic inference
i
n the past several
decades, large scale phylogenetic inference is still a challenging problem.

1.3.1 Searching
a complex
tree space

The
biggest

ch
allenge of
phylogenetic inference

is

the
growth

in the number of unrooted
trees
, describe
d

by




3
2 -5
N
i
i

 

(1
-

1
)


Here
Z

denotes

the number of possible tree topologies,
N

denotes of
the number of
taxa
.

Table 1 shows the number of unrooted trees corresponding to the number of taxa.
F
or example, the tree space for 100 taxa
will
contain
182
10
7
.
1


unrooted trees.
Searching
this

space to fin
d the best tree is computationally
impractical
.
Most optimization
-
based
phylogenetic methods
,

such as maximum parsimony and maximum likelihood
,

are NP
-
hard problems.
Many heuristic strategies for tree search
ing

have been studied, but much
work remains to b
e done
to

improv
e

these methods
[16]
.

Table 1
-

1
: The number of unrooted
bifurcating trees as a function of taxa


Nu
mber of taxa

Number of unrooted trees

3


1


10


6
10
03
.
2



50


74
10
84
.
2



100


182
10
70
.
1



1000


2860
10
93
.
1




6

1.3.2 Developing realistic evolutionary models

Most phylogenetic
methods
explicitly or implicitly assu
me a model of genomic sequence
evolution and use such
a
model

to

estimat
e

the rate of evolution
,
calculat
e

pair
-
wise

distance, or
compute
the likelihood of a given phylogeny. The process of genomic
sequence evolution has been affected by two factors: mutat
ions and selections. Mutations
are errors
in
curred during DNA replication. Mutations create genetic diversit
y

among
populations
,

and natural

selection steers evolutionary direction. Possible causes of
mutations include substitution, recombination, duplicat
ion, insertion, deletion, and
inversions
[17]
.
At the same time
, mutations are constrained by the geometric, physical
and chemical structures of nucleotide
s
, amino acid
s
, codon
s
, protein secondary structure
s
,
and

protein tertiary structure
s

[18]
.


Though phylogenetic signals exist in all kinds of mutation events, most evolutionary
models o
nly consider substitution events because it is either difficult or computationally
intractable
to

integrat
e

other events into the models

used by phylogenetic analysis

[19,
20]
.

With increasing computational power, researchers have relaxed some early

assumptions in evolutionary model
s

and proposed more realistic models
,

such as
allowing rate variation across site
s

[21]
, considering the effect of insertion and deletion,
and combining secondary structure informatio
n

[22
-
24]
. Given multiple possible models,
it is necessary for the phylogenetic inference approach to select a model that best fit
s

the
data. Also this approach should be robust enough to give a
correct

tree even
when
som
e
assumptions have been violated.

Besides the complexity of modeling single type sequence evolution, the need
for
combined analysis of multiple dataset
s

with different data type
s

and sources require
s
7

some unified model

which is both mathematically founded

and biologically

meaningful

[25, 26]
.

1.3.3
Dealing
with
i
ncomplete

and unequal data
distribution

T
he
imperfect process of
sam
pling
, sequencing and alignment

may introduce vari
ed

noise
into

an

available data

set
. Bias or errors in multiple sequence alignment is
the cause of
most
noise because: 1) most
multiple sequence
alignment methods depend on a

correct


phylogeny

to guide the ali
gnment process;
2
)

it is
necessary

to search
across
trees to find
the overall optimum.
It is
possible

to refine the alignment by repeating the procedure of

multiple alignment

model selection

phylogenetic inference
,
” but it is always
danger
ou
s to assume th
e alignment is “perfect

.


T
o assess the reliability or sensitivity of phylogeny on data with uncertainty,
the
b
ootstrap
approach
[28]

was suggested by
Felsenstein
[29]

and further refined by Efron et
al
.

[30]
. Bootstrap
ping

re
quires repeating the phylogenetic inference procedure
many

times (typically
on

the order of 1000 times

[23]
) on
derived

datasets obtained by
permuting the original data with
re
sampling and replacing.

T
he u
sefulness of phylogenetic inference methods
is

also
limited

by th
e sparse and
uneven distribution of sequence data among
species and the uncertainty inher
ent

in the
available data. Some

species have been sequenced for many genes; a few genes have
been sequ
enced for many species; but most of the potential data available for
phylogenetic purposes is still missing

[31, 32]
.



8

1.3.4
Resolving c
onflict
s

among different
methods and
data sources

Researchers usually represent a species with one or more genes in phylogeny
reconstruction. However,
a
gene tree is not
the
same as
a
species tree

[23]
. Phylogenetic
trees constructed with different genes or different data type
s

(morphological data vs.
molecular data) may be different. These conflicts may come from improper model
assumptions or tree

building approaches.

1.4 Bayesian phylogenetic i
nference and
i
ts
i
ssues

This dissertation
aims

to
extend
the framework of Bayesian phylogenetic inference
to
achieve
high

performance on large phylogeny problem
s
. By comb
in
ing several factors
into a compre
hensive probability model and
removing
unknown parameters with a
marginal probability distribution, Bayesian
analysis

has the potential to integrate complex
(i.e. realistic) models and existing knowledge into phylogenetic inference.

However, like other me
thods when they were first introduced, Bayesian phylogenetic
inference generated both excitement and debate.

Support
ers

of
the
Bayesian approach claim that
Bayesian phylogenetic methods have
a
t

least two advantages
over
traditional phylogenetic methods

[33
-
36]
:

1)

The primary Bayesian phy
logenetic analysis produces both a tree estimate and
a
measure of uncertainty for the groups on the estimated tree
[
10, 37, 38]
.

The
uncertainty is measured by a quantity called Bayesian posterior probability
,

which
is approximated by the percentage of occurrences of a group in the tree samples
generated by certain MCMC (Markov Chain Monte Carlo) methods

[39
-
41]
.

9

2)

Bayesian methods can implement very complex models of
sequence evolution
,

because a well
-
designed

MCMC can traverse various highly probably regions of
the tree space instead of sticking around only one region which is local
ly

optimal
but may be not the global
ly

optimal

[37]
.


However,
with

more thorough investigations, Bayesian phylogenetic inference also
brings various highly
-
debated issues

[34, 36, 42]
.
Several major issues have been
summarized below:


1)

Some Bayesian analyses offer conflicting findings to those from
other
approaches
,

such as maximum parsimony (MP) and maximum likelihood (ML)

[43, 44]
.

Some
highly debated topics include: “
H
ow

meaningful

are Bayesian support values?”
[45]
;

“Do
Bayesian support values reflect
the
probability of
being
true?”
[46]
;

and

Overcr
edibility of molecular phylogenies obtained by Bayesian phylogenetics


[47]
.

S
upporters
claim
that the Bayesian posterior probability of a tree is “the
probability that
the estimated tree is correct under the correct mode
l”
[10]

is
highly
deba
table
.


S
ome convincing interpretation is necessary to reconcile the
se

debates.

2)

One cornerstone of Bayesian phylogenetic inference is posterior probability
approximation using Markov
C
hain Monte Carlo (MCMC).
Shortly after MCMC
came out,
people expect
ed

th
at
it would be more

efficient than traditional ML
with bootstrapping

[41]
. However, experience show
s

that the chains have to run
much longe
r than previously expected to converge to the
correct
approximation
[48]
. More seriously, research

show
s

that
the
MCMC method may give
10

misleading “posterior probability” under cert
ain conditions

[42, 49]
, for example
on
a
mixture of trees
[50]
.

In spite of the above and other issues, Bayesian analysis
has
still gained wide
acceptance
since it was introduced into phylogenetics

[8, 51
-
57]
.


1.5 Motivation

Given the challenges described above
,

both positive and negative, it is necessary to
investigate Bayesian phy
logenetic inference

more thoroughly.
G
iven the stochastic nature
of molecular evolution, statistical analys
e
s such Bayesian methods do have the potential
to develop a unified framework to combine multiple data source
s

and exi
s
ting knowledge
into phylogenet
ic inference.

Some of the debates about
Bayesian
phylogenetic
inference
are due to insufficient
understanding or implementation of
this
method, especially the MCMC algorithm. An
improper MCMC implementation does have the danger
of
stop
ping

at local optima
. In
addition, it

can not cross low probability zones to reach other optimal modes.
T
herefore,
we need to
explor
e
improved MCMC strategies

to develop more
reliable
, more
efficient
implementation
.


One barrier for extensive investigation of Bayesian methods

is

that

the method itself
is time consuming. Given hundreds of taxa and complex model
s
, a complete MCMC
-
based Bayesian analysis may run several months to
obtain a

solution.
A
similar situation
occurred
when
the
maximum
likelihood method was first

introduced. However, when
computing systems became more and more powerful and better algorithm
s

were
11

developed,
the
m
aximum likelihood method
came into wide use
. This phenomen
on

may
happen again to

the

Bayesian
-
based phylogenetic method.

1.
6

Research o
bje
ctives and
c
ontributions

This
dissertation
aims to
develop a high performance framework for
Bayesian
phylogenetic inference
.
The following
summari
z
es

the research objectives and
contributions of this dissertation.

1)

Developing
a
high performance computing fr
amework for Bayesian phylogenetic

inference. In th
is dissertation, we investigate

technologies

and platform
s

for
Bayesian phylogenetic inference and abstract different computing platforms into
the TAPS (Tree
-
based Abstraction of Parallel System) model. Ba
sed on this
model, we developed parallel MCMC algorithms for Bayesian phylogenetic
inference and implemented them in the PBPI (Parallel Bayesian Phylogenetic
Inference) program. Both analytical analyses and numerical simulations show that
PBPI achieves rou
ghly linear speedup for datasets with different problem size
s.

This

means a Bayesian phylogenetic inference lasting several months
by former
methods
can be finished in several hour
s

using
parallel algorithms on mid
-
sized
Beowulf
-
like clusters.

2)

Developing b
etter MCMC strategies for Bayesian phylogenetic inference. In th
is
dissertation, we
proposed and implemented
several MCMC strategies for
exploring the posterior probability distribution of
the
phylogenetic model.
By
using variable propos
al

step length, we
made

the MCMC chain cross high energy
barrier
s

(i.e.,
low probability regions) and overcome “stickiness” around local
12

optimal regions. By introducing directional search within each proposal step, we
improve
d

the quality of each proposal and shorten
ed

the s
ample intervals, there
by

reducing the total number of generation
s,

to
produce
an acceptable distribution.
To improv
e

the mixing rate of the chain, we also implemented a class of
population
-
based MCMC methods which use
d

multiple chains to explore the
search

space more efficiently. W
e

demonstrated
that
classical MCMC methods
risk
generating misleading posterior probability on some models
;
by using
an
improved
MCMC framework
, th
is

risk

was

reduced.
V
arious novel algorithms
and MCMC strategies
were
implemented
in this
research
.

3)

Accommodating data uncertainty in phylogenetic inference with data resampling
in the MCMC
. We extend
ed

Bayesian phylogenetic
inference to include data
noise

in the inference procedure and showed that ML with bootstrapping can be
viewed a
s a special case of generic Bayesian phylogeneti
c inference. We justif
ied

that Bayesian posterior probability and bootstrap support value measure two kinds
of phylogenetic uncertainties: the former refers
to
multiple possible models for
the same dataset; t
he latter refers
to
the robustness of a tree on
a
specific data
set.

Both uncertainties can be assessed

jointly
by
incorporating data
resampling

during
a single
MCMC run.

1
.
7

Organization of this dissertation

This d
issertation
includes
three
parts.


The f
irst part consists of Chapter
s

1
and
2
,

which
present background
, methods, and
results in the
field

of Bayesian phylogenetic in
ference. In t
his chapter

we introduce

the
13

phylogenetic inference problem, its applications,
and
its challenges.

We also provide a

short review of positive and negative view
s of

Bayesian phylogenetic methods.
In
Chapter 2, we review various phylogenetic approaches and recent advances
in

high
performance computing for
solving
large phylogeny problem
s
.

The second part includes Chapter
s

3

and
4

in
which
we describe
our
extended
,

high
performance, Bayesian phylogenetic inference framework. In Chapter
3
, we
demonstrate
the weaknesses of traditional MCMC methods and
propose
how to overcome these
weaknesses

using improved MCMC algorithm
s
.
I
n Chapter
4
, we describe our parallel
Bayesian phylogenetic inference framework.
W
e
first
discuss the general models and
methods for parallelizing Bayesian phylogenetic inference
that

can be used as the
foundation of introducing high performance computing
support
to the

phylogenetic
inference problem.
Then
we present an implementation of parallel Metropolis
-
coupled
MCMC and numerical results.

The third part consists of
Chapter
s

5

and
6
,

where
we provide performance evaluation
of
the
Bayesian method and our
implementations. Using simulated dataset
s

under several
model tree
s
, we
verif
ied

that our implementation not only output the correct results but
also
ran
faster both in sequential and parallel implementation
,

in contrast to MrBayes
[58]
,
the
most
popular
Bayesian phylogenetic inference program current
ly

available. Our
results also demonstrate
d

that the accuracies of Bayesian
-
based phylogenetic method are
very well
-
suited for the

current
models of evolution.

Finally, in Chapter
7
,
we summarize

the
results
, conclusion
s and

contributions from

this dissertation and
outline
future research.

14

C
hapter

2

Background

2.1
Representations of p
hylogenetic tree
s

A phylogenetic tree is a graph representation of the evolutionary relationship among a set
of sp
ecies or organisms. Since species are organized as a hierarchical classification in
taxonomy, we call species at the leaf node of the tree taxon (plural taxa) in phylogenetic
inference. A phylogenetic tree is usually represented by a binary tree in which e
ach tree
node are connected at most three other nodes, but it could be represented by a multi
-
forked tree when some parts of the

tree can not be fully resolved
[59
-
62]
.

Each internal branch of the tree maps a div
ergence event in evolution and divides all
taxa into two groups. Each group is called a clade and each taxon in the clade shares the
same common ancestor with other taxa in the clade. If the length of the branch is set, it is
proportional to the divergence

time that two groups of taxa were separated from their
latest common ancestor. A phylogenetic tree could be rooted or unrooted depending on
whether a unique node is chosen as the least common ancestor of all taxa. Determining
the “true” root from for a gr
oup of taxa is usually impractical, so unrooted tree
s

are most
used in phylogenetic inference.

15


Tarsius syrichta
Lemur catta
Saimiri sciureus
Hylobates
Pongo
Gorilla
Homo sapiens
Pan
M sylvanus
M fascicularis
Macaca fuscata
M mulatta


( a )


(b)

0
.
1
Tarsius syrichta
Lemur catta
Saimiri sciureus
Hylobates
Pongo
Gorilla
Homo sapiens
Pan
M sylvanus
M fascicularis
Macaca fuscata
M mulatta




( c )






( d )




Figure 2
-

1
: Phylogenetic
trees
of 12

primates m
itochondrial
DNA

sequences


Tarsius syrichta

Lemur catta

Saimiri sciureus

Hylobates

Pongo

Gorilla

Homo sapiens

Pan

M sylvanus

M fascicularis

Macaca fuscata

M mulatta

0
.
1
Hylobates
Pongo
Gorilla
Homo sapiens
Pan
M sylvanus
M fascicularis
Macaca fuscata
M mulatta
Saimiri sciureus
Lemur catta
Tarsius syrichta
16


Figure 2
-
1 shows the
phylogenetic tree of 12 Primates mitochondrial DNA sequences.
This tree is constructed using MrBayes from 898
DNA characters using JC69 model.
Figu
re 2
-
1 (a) and (b) are called c
l
a
dograms which

provide topological information only.
Figure 2
-
1

(c) and (d) are called phylogram
s

which provide both br
a
nching order and
divergence time.

The NEWICK format representation of the phylogenetic tree
[63, 64]

in Figure 2
-
1 is
shown as follows.


To make the NEWICK representation unique, we
d
efine the signature of an unrooted
tree as one of its N
EWICK

format that satisfies two requirements:

1)
The

root of the tree is fixed at the internal node that has the taxon with the smalle
st
label as one of its children; and

2)
The

children of each internal node are order by their labels lexicographically.

For example, the signature of the above tree is:


#NEXUS


BEGIN TREES;


TRANSLATE



1

Tarsius_syrichta,



2

Lemur_catta,



3

Homo_sapiens,



4

Pan,



5

Gorilla,



6

Pongo,



7

Hylobates,



8

Macaca_fuscata,
[63]



9

M_mulatta,



10

M_fascicularis,



11

M_sylvanus,



12

Saimiri_sciureus



;


UTREE *
PRIMATE
= (1,2,(12,((7,(6,(5,(3,4)))),(11,(10,(8,9))))));

ENDBLOCK;


Figure 2
-

2
: The NEWICK representation of the primate phy
logenetic tree



17

(1,2,((((((3,4),5),6),7),(((8,9),10),11)),12))

Using the tree signature, we can easi
ly test the equality of two trees in the same way
as string comparison.

When distance between two trees instead of equality is preferred in practice, a
phylogenetic tree is also treated as a hierarchical bipartitions.

Each branch in the
phylogenetic tree d
ivides the set of taxa into one bipartition. For example, the complete
set of
nontrivial
bipartitions
(i.e., bipartitions in which each part has at least two nodes)
for the primate phylogenetic tree shown in Figure 2
-
2 is
:



Like the signature of a phylogenetic tree
,

we can view each bipartition as a signature
of its corresponding tree node and thus can compare two nodes
from two different
phylogenetic trees including the same group of taxa. The total number of bipartitions
which are shown in only one of the two trees but not both is defined the Robinson and
(1,2)| (3,4,5,6,7,8,9,10,11,12)

(
1,2,12)| (3,4,5,6,7,8,9,10,11)

(3,4)| (1,2,5,6,7,8,9,10,11,12)

(3,4,5)| (1,2,6,7,8,9,10,11,12)

(3,4,5,6)| (1,2,7,8,9,10,11,12)

(3,4,5,6,7)| (1,2,8,9,10,11,12)

(8,9)| (1,2,3,4,5,6,7,10,11,12)

(8,9,10)| (1,2,3,4,5,6,7,11,12)


(8,9,10,11)| (1,2,3,4,5,6,7,12)

Figure 2
-

3
: The nontrivial bipartitions of the primate phylogenetic tree


18

Foulds topological distance of these two trees
[24]
, a distanced widely used in tree
comparisons.

Tarsius syrichta
Lemur catta
Saimiri sciureus
Hylobates
Pongo
Gorilla
Homo sapiens
Pan
0.91
1.00
1.00
1.00
M sylvanus
M fascicularis
Macaca fuscata
M mulatta
1.00
1.00
1.00
1.00
1.00

Figure 2
-

4
: A phylogenetic tree with support values f
or each clade


The support of a phylogenetic tree for given is usually assessed with
bootstrapping
[65]

or Bayesian posterior probability
[66]
. In both methods,
a consensus tree is
commonly used
to summarize common structures among a group of trees sampled using
MCMC (Markov
C
hain Monte Carlo) or
computed
using the bootstrapped dataset. In
either way, the occurrences of each bipartitions are counted and the frequencies of each
bipartition are show
n in the phylogram as shown in Figure 2
-
4.
The c
onsensus tree is also
use
d

to combine trees estimated using different genes or dataset or the same group of taxa.

19

When each individual tree has different but overlapped set of taxa, a supertree is
used
to rep
lace the consensus tree as the summarized output
[67]
.

Considering the possibility of horizontal gene transfer, phylogenetic
network is used
as an alternative representation of the evolution relationship of a group of taxa
[68]
.

2.2 Methods for phylogenetic inference

Various met
hods have been developed to build phylogenetic trees from different kinds of
data. These methods can be classified by: 1) the data type used in tree estimation; 2) the
criteria to define an “optimal” tree; and 3) the tree search strategies.

2.2.1 Sequence
d
-
based methods and genome
-
based
methods

Currently, molecular sequences and whole genome features are the two major data types
used in phylogenetic inference
[69]
:

1)

Sequence
-
based methods use one or multiple gene alignments to estimate the
phylogenetic tree. Phylogenetic inferen
ce with multiple gene alignments
becomes common in recent years. The supermatrix
[70]

and supertree
[71]

methods are two major approaches to handle combined data such as multiple
gene align
ments. Both approaches rely on standard sequenced
-
based
phylogenetic inference methods.

2)

Genome
-
based methods use phylogenetic signals contained in gene content
[72
-
74]

or gene order
[75, 76]

to estimate the phylogenetic tree. Phylogenetic
inference using whole
-
genome feature attracts researcher’s attention recently
and
many efforts are devoted to how to formulate distance metrics and
20

probabilities models. An overview of genome
-
based methods is provided by
Delsuc
et al.
[69]
.

2.2.2 Distance
-
, MP
-
, ML
-

and BP
-
based methods

There are four major criteria to define an “optimal” tree: distance, ma
ximum parsimony
(MP), maximum likelihood (ML), and Bayesian posterior probability (BP). Comparisons
among these methods are reviewed in
[33, 62, 77]
.

Briefly, distance
-
based methods are much faster th
a
n the other three methods but
have some potential weaknesses including: 1) informatio
n loss in converting sequences
into distance matrix; 2) inconsisten
cy

for data set with large distances.

MP and ML are both optimization
-
based methods which break the tree estimation
process into two major components: scoring a given tree and searching th
e tree (or trees)
with best scores. MP uses the minimum number of mutations that could produce a given
tree as the score. ML uses the likelihood of the given tree under an explicit evolutionary
model as the score. MP runs much faster than ML because: 1) MP

needs much less
computations in evaluating the number of mutations than ML evaluating the likelihood;
and 2) MP does not need to optimize the branch lengths. Drawbacks of MP include: 1)
multiple (or too many) trees may have the same MP score and only one
of them is true;
and 2) MP is subject to the “long
-
branch attraction” problem
[78]

since it does not
account for the fact that the number of mutations varies on different branches.

Both ML and BP are likelihood
-
based methods which explicitly use a probabilistic
model of m
olecular evolution. Their major difference is ML uses point estimation for the
unknown parameters and BP uses marginal distribution to integrate “out” the unknown
parameters. BP is suggested as an faster alternative of ML with bootstrapping
[41]
,
21

however this argument needs to be further justified
[79]
. Whether BP should be classified
as an optimization
-
based method is questionable since theoretically BP requires more
computations than ML in order to find the probabilities of all modes for
the posterior
distribution. As ML is conjectured as an NP
-
Hard problem, BP is at least as difficult as
ML. Therefore, we put BP in a new category of phylogenetic methods: sampling
-
based
method.

2.2.3 Tree search strategies

Any phylogenetic inference metho
ds rely on one or more tree search strategies once the
“optimal” criterion is formulated. We divide the tree search strategies into the following
categories:

1)

Clustering method
[23]
: a clustering method bui
lds the tree using a sequence of
clustering operations. UPGMA
[80]

and neighbor
-
joining
[81]
. A cluster method
runs much faster than other methods. Its limitation is that it produces onl
y one
tree which may not be the global optimal.

2)

Exact search
[77]
: this method examines every possible tree to locate the “best”
tree. Exact search can be further divided into exhaustive search and branch
-
and
-
bound search. Exhaustive search enum
erates all possible trees for evaluation.
Considering the huge number of possible trees as described in Chapter 1,
exhaustive is practical only for small data size. Branch
-
and
-
bound can prune the
search space by deleting those trees that have lower score t
han a preset bound (or
threshold). The more strict the bound, the further the space will be pruned. Same
to exhaustive search, branch
-
and
-
bound is limited to small problem size.

22

3)

Deterministic heuristics search: the tree space is not completely random
distr
ibuted. There is certain order in the tree space. A heuristic search attempts to
exploit such an order to find the “bes
t
” or near “best” tree. Common used
deterministic search strategies include stepwise addition, local arrangement, and
global arrangement
[64, 77]
. One potential problem of deterministic heuristi
cs
search is that it dose not guarantee a global optimal solution
.

4)

Stochastic search: By introducing some random moves, a stochastic search may
avoid local optima and move toward the global optima. Three stochastic
algorithms are used in phylogenetic infe
rence: simulated annealing
[82, 83]
,
genetic algorithm
[84
-
86]

and MCMC

[40, 41, 87, 88]
.

5)

Divide and conquer: a large problem can b
e solved by dividing the original
problem into a set of smaller problems, solving each of them separately, and then
merge the solutions for each smaller problem to obtain the solution for the
original problem. Disk
-
covering method (DCM)
[89]
, quartet
-
puzzling
[90]

and
supertree
[67]

are used in phylogenetic inference.

2.3 High performance computing phylogenetic
inference methods

As phylogenetic inference goes to large
problem size and the parallel processing become
common, high performance computing support in phylogenetic inference
is
needed. High
performance computing support include
s
: algorithm turning, parallel algorithm design,
and parallel platform deployment.

Alg
orithm tuning seeks alternative approaches for computation intensive parts in the
phylogenetic inference. One common technique for likelihood
-
based phylogenetic
23

method is not to frequently optimize the branch length because this optimization process
will t
ake
2
( )
o N

times likelihood calculations. This technique has been used
[85, 86, 91,
92]
.

Besides algorithms improvement and exploration, parallel processing has the
possibility to reduce the computation time from several months to several hours in
efficient and immediate manner.

Several parallel implementati
ons of widely used
phylogenetic inference methods have been developed recently, among them are parallel
fastDNLml

[93, 94]

, parallel TREE
-
PUZZLE

[95]
, parallel
g
enetic algorithm for ML

[96]
, GRAPPA

[97]
, and Parallel MCMC algorithms

[98, 99]
.

We note there are multiple
level concurrencies in most phylogenetic inference and these me
thods can run in parallel
embarrassingly.

2.4
Bayesian
p
hylogenetic
i
nference

2.4.1
Introduction

As described in the previous chapter, the task of phylogenetic inference includes two
major steps: 1) constructing a phylogenetic tree that maps the evolution
ary relationship
among a group of taxa, and 2) accessing the confidence on the estimated tree given the
observed data. Various methods are available for building the phylogenetic tree and some
of them are based on a probabilistic model of molecular evoluti
on. Due to the stochastic
nature of molecular evolution, complicated mechanisms that affect the evolutionary
process, almost every phylogenetic method has to deal with uncertainties caused by
unknown parameters. Also, the fact that multiple phylogenetic tr
ees are possible for the
24

same group of taxa has to be considered in applications which explicitly use a phylogeny
as the basis of study.

Using a comprehensive probabilistic model, Bayesian analysis provides a
methodology to describe relationships among all

variables under consideration. Bayesian
phylogenetic inference can learn the phylogenetic model from observed data based on a
quantity called posterior probability. The posterior probability of a phylogenetic model




,
,
T


can be interpret
ed as the probability with which this phylogenetic model is
correct.

Bayesian phylogenetic inference share same similarities
with maximum likelihood
estimation
[10, 33]
: both explicitly use a model of molecular evolution and a
formalization of the likelihood function.
However, the underlying methodologies
are
quite

different. First,
the
Bayesian approach deals with parameter uncertainty by
integrating over

all possible values that a parameter might assume, while maximum
likelihood estimation uses a point estimate in analysis. Second, Bayesian analysis
requires specifying prior distributions of the parameters of a phylogenetic model, which
provides an advant
age to incorporating existing knowledge but also
invites criticism

since the prior distributions are often unknown. Finally
,

Bayesian analysis
o
utputs the
posterior probability of trees and clades as a measurement of the confidence on the
estimated result
s. Therefore, Bayesian phylogenetic inference
is

considered a faster
alternative of maximum likelihood
estimation
with bootstrap resampling
[41]
.

Though the idea of Bayesian phylogenetic inference
emerged

almost at the same
period as
the
maximum likelihood method
[100]
, the computation of Bayesian posterior
probability of phylogeny was not feasible until Markov
C
hain Monte Carlo methods
were

25

i
mplemented for phylogenetic inference by three independent research groups
[87, 101
-
103]

in 1996.
Bayesian phylogenetic inference bec
a
me widely used after the method of
computing posterior probability
was

described
[10, 33, 39
-
41, 87, 104, 105]

and several
phylogenetic inference programs (BAMBE
[106]

and MrBayes
[58]
) become public
ly

available.

Despite some obvious benefits and ever
-
increasing applications, Bayesian
phylogenetic inference has been
hotly

debated on several issues
including

the
amount of
bias caused by inappropriate prior p
robability, the interpretation of Bayesian posterior
probability

[46]
, and the ac
curacy of Bayesian clade support
[34, 36, 42, 45]
.
This

call
s

for further examination of the power and performance of Bayesian phylogenetic analysis,
and therefore

a need for improved and faster implementations of current Bayesian
phylogenetic methods.

2
.
4
.
2

The Bayesian framework

A phylogenetic model




,
,
T



consists of three components: a tree structure (
T
)
that represents the evolutionary relationships of a set of organism under study, a vector of

branch lengths (

) which maps the divergence time along different lineages, and a model
of the molecular evolution (

) that approximates how the characters at each site evolve
over time along the tree.
In the Bayesian framework,

both the observed data
X

and
parameters of
the phylogenetic model

are treated as random variables. Then the joint
distribution of the data and the model can be set up as follows
:


)
(
)
|
(
)
,
(




P
X
P
X
P

(2
-

1
)

Once the data
is known
, Bayesian theory ca
n

be used to compute the posterior probability
of the model using

26


)
(
)
(
)
|
(
)
|
(
X
P
P
X
P
X
P






(2
-

2
)

Here,
)
|
(

X
P

is
called the likelihood (the probability of the data give
n the model),
)
(

P
is called the prior probability of the model (the unconditional probability of the
model without any knowledge of the observed data)
, and
)
(
X
P

is the unconditional
probability of the data. For
the
cont
inuous case,
)
(
X
P

is computed by


( ) ( | ) ( )
P X P X P d
   


(2
-

3
)

For
discrete
case,
)
(
X
P

is computed by


( ) ( | ) ( )
i
i i
P X P X P

  


(2
-

4
)

Since
)
(
X
P

is ju
st a normalizing constant
,
the computation of (
2

-

3) or (
2

-

4) is not
needed in p
ractical inference.

The posterior probability distribution of the phylogenetic model can be written as











j
T
j
j
i
i
i
d
d
T
P
T
X
P
T
P
T
X
P
X
T
P
X
P












,
,
)
,
,
|
(
)
,
,
(
)
,
,
|
(
)
|
,
,
(
|
.

(2
-

5
)

This distrib
ution is the current basis of
Bayesian phylogenetic inference;

useful
information can be obtained from this distribution. For example, the posterior probability
of a phylogenetic tree
i
T

can be computed as








d
d
X
T
P
X
T
P
i
i
)
|
,
,
(
)
|
(
.

(2
-

6
)

Similarly, the posterior probability of the
i th


component of the parameter


in the
evolutionary model can be summarized by





j
T
i
i
i
j
i
d
d
X
T
P
X
P
)
\
(
)
|
\
,
,
,
(
)
|
(









(2
-

7
)

27

Here,

i

is the
i th

component of the parameter


and
\
i


are the remaining
components of the parameter

.

2.4.3

Components of Bayesian phylogene
tic inference

A complete Bayesian phylogenetic inference consists of four major components:

(1)

Formulating the phylogenetic model
)
,
,
|
(


i
T
X
P
;

(2)

Choosing a proper prior probability
)
,
,
(


i
T
P
;

(3)

Approximating the posterior probability dis
tribution of phylogenetic models;

(4)

Inferring characteristics from the posterior probability distribution.

We briefly describe the second component in this section; the other three components
will be described in the following sections.

2.4.4

Likelihood, pri
or and posterior probability

Bayesian theory shown in (2
-

2) can be expressed informally in English as:


evidence
prior
likelihood
posterior



(2
-

8
)

This formula indicates that by ob
serving some new evidence (i.e.

the data
X
) our s
tarting
belief (i.e.

the prior probability



P
)
may

be converted into a set of new belief (i.e.
posterior probability
)
|
(
X


). The prior probability and the posterior probability are
connected through the likelih
ood, the probability with which the evidence can be
observed.

P
hylogenetic model is a hypothesis about how the data will evolve
. H
ypotheses can
not be observed directly,
so
both the prior and the posterior should be interpreted as a
confidence interval for

a model instead of explain
ed

as frequencies

[107]
.

28

A major concern in Bayesian analysis is how to choose
the
prior. Prior probability
has
the potential to incorporate existing knowledge about phylogenetic models into current
analysis, but it
is
also a controversial issue
since

choosing the appropriate prior
distribution
can be

subjective. Two approach
es

are often used
for choosing prior
probability: using a non
-
informative prior (or flat prior, which treats every hypothesis
equally possible); and using the knowledge obtained from past experience. In Bayesian
phylogenetic inference, the prior probability on phylogenetic models can be intro
duced as
constraints to prune the search space parameters.

The posterior probability of a phylogenetic model (for example, a phylogenetic tree)
can be interpreted as the probability with which this model can be correctly estimated for
a set of random data
simulated from this model. The accuracy of the posterior probability
will be affected
adversely
by the use of improper hypothesis
[108]
.

2.
4
.5

Empirical and hierarchical Bayesian analysis

The comprehensive posterior distribution


X
T
P
i
|
,
,



requires knowle
dge of uncertain
parameters not
of
interest
in our
current analysis (
e.g.
,

branch length or model
parameters). In addition to directly explore


X
T
P
i
|
,
,


, two alternatives approximations
are used to accommodate these uncertain parameters
[109]

in practice.

The first method is called
empirical
Bayesian

analysis
, which uses a point estimate to
eliminate one of the
integral
s

on


X
T
P
i
|
,
,


. F
or example, we estimate the best fit
parameters
*


and then substitute equation (
2
-

6) as













d
X
T
P
d
d
X
T
P
X
T
P
i
i
i
)
,
|
,
(
)
|
,
,
(
)
|
(
*
.

(2
-

9
)

29

The
second

method
is called hierarchical Bayesian analysis, which takes the posterior
probability of the phylogenetic tree as the integral over all possible combinations of
branch lengths and model parameters. The hierarchical Bayesian analysis can be written
as




j
T
j
j
i
i
i
T
P
T
X
P
T
P
T
X
P
X
T
P
)
(
)
|
(
)
(
)
|
(
)
|
(

(2
-

10
)








d
d
T
X
P
T
X
P
i
i
)
,
,
|
(
)
|
(

(2
-

11
)

2
.
5

Models of molecular evolution