Validation of
A
ccuracy of the
H
yper

P
arameters
E
stimation
Getting
from
MTML

msBayes
XIAOOU XIE
, MIKE HICKERSON
Biology Department
Q
ueens College
65
–
30 Kissena Blvd.
Flushing,
NY, 11367, USA
Fall
2011
Email: xiaoouxie@gmail.com
ABSTRACT
The
divergence time of bird
species
has long
be
en
argued
for a long time
. Using the set
of
COI genes in
mtDNA
sequences
of birds
, we
inferr
ed
the
two general
time
of speciation,
either
about 2 million years ago or 20,000 years ago.
To
validat
e
the
conclusion
we got from
MTML

msBayes, a Linux

based software package implemented by command lines,
we use
d
a
series of command lines, along with certain R script,
to estimate hyper

parameters (
Ψ
,
Ω
, etc.)
given P
seudo

Observed Data Set (PODS).
Then we verified the accuracy of estimation by RMSPE
and Bayes Factor, which proved that MTML

msBayes estimated DNA data well.
Introduction
Bayesian statistics, as a popular tool
in scientific
inference
especially in genetics and
genomics
, has been contributing a lot of strong support in solving evolution problems.
In our
program, we used ABC (
Approximate Bayesian computation
) [1] to analyze our data
, building
massive models, fitting them, validat
ing
the
m and improv
ing
these models [2].
As a tool based on Bayesian Theory,
MTML

msBayes
offer
s
a set of functions to support
complex and flexible comparative phylogeographic inference.
It uses
hierarchical approximate
Bayesian computation (HABC) to estimate hyp
er

parameters given DNA sequence data
[
3
]
.
Hence, w
e use
d
msBayes to
build
our
estim
ate
d dataset
s
.
To more accurately estimate data, we dividing the simulation process into two parts
,
combining
msBayes rejection command line and certain R script, which
is
widely used in
bioinformatics. In R script, we import
ed
R abc package
[4]
. This package offers a number of
functions based on ABC algorithm, to implement further estimation on our considerable data.
A
valuable function in R abc package is the main functio
n, abc function, in which there are a
variety of estimating method. We tested and found the best methods are
loclinear
[5
,6
] and
neural network (neuralnet)
[
7
].
Accuracy of that how well MTML

msBayes and R functions estimates these hyper

parameters is impo
rtant and often difficult to validate.
Root mean square posterior error
(RMSPE)
[
8
]
and Bayes Factor
(BF)
[
9
]
are two
most useful
criteria that we used for validation.
Implementation
Generally
,
we did 100 times of the entire process, and analyze these 100 data sets by
criteria of RMSPE and BF.
The result of validation would show how well the estimation works.
Due to the massive time we need to finish these 100 times, it is necessary to remotely
run the
codes using SH script.
There are several steps in the SH script. First, we simulated a PODS one

line file, then
executed msBayes Rejection with PODS and 3,000,000 simulations we obtained before, to get
rough estimation, and remotely ran R script
to input all necessary files to R environment.
From
3,000,000 to 900, this was how the simulations were selected throughout the whole process.
R
script would output the final results, including RMSPE calculation and BF calculation results.
1
Obtain PODS from MTML

msBayes
PODS are simulated from
the batch file (which we made in the first part of this
program),
, by choos
ing
a
random
number
from the hyper

prior of Ψ, where Ψ ranges
discrete
ly
from 1 to Y
(Y is the number of taxon pairs).
The fol
lowing msBayes command line was
executed to get a PODS without heading:
perl msbayes.pl

r 1

o
BirdCOIPODS_
Obs

c batch.BirdCOI_II
tail

1 BirdCOIObs > BirdCOIObsTAIL
2
Implement the first part of estimation by msBayes Rejection command line.
By comparin
g
the observed summary statistics
file (PODS)
and
3,000,000
simulated
summary statistics, msBayes implemented
its rejection step, accepting the closest
t
of the
number of simulations, where
t
refers to the tolerance of acceptance. Here we set
t
=0.0003, in
order to get 9,000 rough accepted simulations.
3 Run R script for further estimation
With R
abc
package and
all data files (as matrices) imported, we executed abc function,
setting
t
to 0.1, in order to get 900 adjusted simulations, which
wa
s
the final simulation.
We achieved the final 900 points through abc function by the following R code:
Z1noLL5 <

try(abc(OBS[1,c(148:360,432:502)], POSTVEC[,2], POSTVEC[,c(148:360,432:502)],
tol=0.1,method="loclinear"),T)
if(class(Z1noLL5) == "try

error")
{TDnoLLRmode5 <

NA
rmse5<

NA} else { TDnoLLRmode5 <

loc1stats(Z1noLL5$adj.values, prob=0.95)[1]
rmse5<

rmsevalues(Z1true1,as.numeric(as.vector(Z1noLL5$adj.values)))}
Next,
the
RMSPE
was calculated
by the following formula:
√
∑
Here x
i
ref
ers to any of the hyper

parameters of the 900 accepted estimations, and x
0
refers to the true value of this parameter in PODS.
RMSPE could be implemented through this following R code:
rmsevalues<

function(true, values){
SumSqure<

0
for (i in 1:900){
SumSqu
re<

SumSqure+(values[i]

true)^2
}
return (sqrt(SumSqure/900))
}
The second criterion is Bayes Factor, which was calculated after all 100 loops were done
and output. With files recording 100 lines imported, each of which contained 900 accepted
estimated parameter values and their corresponding true value, BF could be calculated by the
following formula:
Here M1 refers to the thresh
old
of hyper

parameter, for example,
Ψ
=1.
M2 refers to all possible
conditions other than M1. In this case,
, while
.
We validate the final estimations by that how many BF is greater than 10 or 3, given
Ψ
>1.5. This algorithm was executed
in
the following code:
BFWrongCal<

functio
n (bfmatrix, cutoff
) {
bfwrong<

NA
for (i in 1:100){
if (bfmatrix[i,1]<=cutoff)
next
postm1<

length(which(bfmatrix[i,c(2:901)]<cutoff))/900
if((postm1/(1

postm1))>(10/17))
bfwrong[i]<

1 else
bfwrong[i]<

0}
prob_wrong<

length(which(bfwrong==1))/
(length(which(bfwrong==1))+
length(which(bfwrong==0)))
return (prob_wrong)
}
2A
Figure.2 Histogram of RMSPE of Ψ and Ω. error
of Ω in 2B performs better than
error of Ψ in 2A.
2B
Results
The final result
consists of
direct plotting
, RMSPE and Bayes Factor
validation
. Figure 1
shows the
plot of True
Ω
value and mode of
Ω
, in the 100 loops
, using loclinear
and neuralnet
methods
.
Figure.1 Plot of True

Estimated Ω. 1A and 1B both showed good regression
relationship between true Ω and mode Ω. Plotting came from the data set of HUIB.
1A
1B
Figure 2
shows the RMSPE of
Ω
and
Ψ
,
getting from neuralnet method in abc function.
We can see the error of
Ω
is very small, while the error of
Ψ
is 2 out of 18 (the number of taxon
pairs of HUIB
).
Table 1 shows Bayes Factors of the data set of HUIB.
The table shows different BF values
and the probabilities of estimated parameter smaller than thresh
old
, given true parameters
greater than thresh
old
(which is 1.5 for
Ψ
and 0.01 for Ω).
In the table, NN refers to neural
network method, and LL refers to local linear method in abc function.
Table 1 probability of Bayes Factors wrong and right.
prob
BF
Wrong
prob
BF
Right
probEst_True
NN
Ψ
LL Ω
NN Ω
NN Ψ
LL Ω
NN Ω
NN Ψ
LL Ω
NN Ω
〮0
〮0
〮0
〮0㐶2
〮0〲2
ㄮ1
〮0
〮0
〮0
C潮c汵獩潮
䝥G敲慬汹ⰠtU攠c潭b楮ati潮 of 浳m慹敳敪散e楯n 慮d 删慢c p慣a慧攠f潲 ABC U祰yr

p慲慭et敲 敳e業it楯n pe牦潲浳mw敬e 潮潮獩d敲ab汥l慭潵湴 潦業o污t楯湳⸠.獂慹敳敪散ei潮
f楲獴汹 敳t業it
e
猠r潵杨 su浭慲礠獴慴楳i楣猬nd p慳獥猠楴 t漠慢c funct楯n f潲 獥捯sd汹
敳t業慴i潮
⸠
B礠慤橵獴in朠p牯p敲 t潬or慮c攠楮 b潴U 浳m慹敳敪散e楯渠ind abc 牥橥cti
潮H 楴 楳 not d楦f楣ult f潲
u獥牳st漠f楮d b敳e 牥獵lt献
周牯u杨 tU攠牥su汴猬 b礠y敮敲
at楮朠tw漠c物t敲楡i潦oB䘠慮d RM卐䔬 w攠c慮 s敥 b潴o of
tU敭 獵pp潲t tUat tU楳ic潭b楮at楯n 敳t業慴敳e
Ω
better than
Ψ
. Meanwhile, Figure 1 suggests
that there is a good regression between true and mode
Ω
, which supports the congruence
above as well.
However
, comparing BF and RMSPE,
users can see some difference of them. In BF
calculation,
the probability of BF wrong doesn’t show any difference among the three
parameters / meth
ods.
The
probability of BF
right only shows a small difference. On the other
hand, RMSPE clearly supports that the estimation of Ω is better than
Ψ
, which completely
conforms to the result of direct
plots
.
References
1.
J.S. Lopes
and
M.A. Beaumont
.
(2010)
ABC: A useful Bayesian
tool for the analysis of
population data
.
Infection, Genetics and Evolution
10 (6
),
825
–
832.
2. Katalin Csilléry
.
et al.
(2010)
Approximate Bayesian Computation (ABC) in practice.
Trends in
Ecology & Evolution
25 (7), 410
–
418
.
3.
Michael J Hickerson
.
et
al.
(2007)
msBayes:
Pipeline for testing comparative phylogeographic
histories using hierarchical approximate Bayesian computation
.
BMC Bioinformatics
8, 268
.
4.
Csillery K, Francois O, Blum MGB (2011)
abc: an R package for Approximate Bayesian
Computatio
n
(ABC)
.
Methods Ecol Evol
(in press)
.
5.
Kevin R. Thornton. (2009) approximate Bayesian computation by local linear regression
. BMC
Genetics
10:35
.
6
. Michael G.B. Blum and Olivier François.
(2010
)
Non

linear regression models for Approximate
Bayesian Computation
.
Statistics and Computing
(in press).
7
.
Michael D. Richard and Richard P. Lippmann
.
(1991)
Neural Network Classifiers Estimate
Bayesian a posteriori Probabilities
.
Neural Computation
3
(
4
)
,
461
–
483
.
8.
We
n Huang.
et al.
(2011)
MTML

msBayes: Approximate Bayesian comparative
phylogeographic inference from multiple taxa and
multiple loci with rate heterogeneity
.
BMC
Bioinformatics
12:1
.
9.
JO Berger and L. Pericchi, The intrinsic Bayes factor f
or model selection and
prediction. The
Journal of the American Statistical Association 91 (1996), pp. 109
–
122
.
Acknowledgments
I
deeply
thank for Mike’
s instructive advice and
useful suggestion
s
on my research,
as well as
his patient and
detailed
explanation on
my questions
throughout
this semester. I am deeply
grateful of his
guidance
into the gateway of biostatistics.
I am also indebted to Terry, JT, and all the other partners in the lab.
Finally, t
his
report
has gained a lot
of
technical support
from
EGGNOG, our hardworking
computer, which has been working so hard and never had a break for the whole semester.
Thanks for the work it has done and is doing.
Comments 0
Log in to post a comment