Download Xie's research report - Qcpages.qc.edu

fabulousgalaxyBiotechnology

Oct 1, 2013 (4 years and 13 days ago)

90 views



Validation of
A
ccuracy of the
H
yper
-
P
arameters

E
stimation

Getting

from

MTML
-
msBayes



XIAOOU XIE
, MIKE HICKERSON



Biology Department

Q
ueens College

65


30 Kissena Blvd.

Flushing,
NY, 11367, USA


Fall

2011

Email: xiaoouxie@gmail.com







ABSTRACT



The
divergence time of bird

species
has long
be
en

argued

for a long time
. Using the set
of
COI genes in
mtDNA

sequences

of birds
, we
inferr
ed

the
two general

time

of speciation,
either
about 2 million years ago or 20,000 years ago.

To
validat
e
the

conclusion

we got from
MTML
-
msBayes, a Linux
-
based software package implemented by command lines,
we use
d

a
series of command lines, along with certain R script,
to estimate hyper
-
parameters (
Ψ
,
Ω
, etc.)
given P
seudo
-
Observed Data Set (PODS).
Then we verified the accuracy of estimation by RMSPE
and Bayes Factor, which proved that MTML
-
msBayes estimated DNA data well.


Introduction

Bayesian statistics, as a popular tool

in scientific

inference

especially in genetics and
genomics
, has been contributing a lot of strong support in solving evolution problems.
In our
program, we used ABC (
Approximate Bayesian computation
) [1] to analyze our data
, building
massive models, fitting them, validat
ing

the
m and improv
ing

these models [2].

As a tool based on Bayesian Theory,
MTML
-
msBayes
offer
s

a set of functions to support

complex and flexible comparative phylogeographic inference.
It uses
hierarchical approximate
Bayesian computation (HABC) to estimate hyp
er
-
parameters given DNA sequence data

[
3
]
.

Hence, w
e use
d

msBayes to
build
our
estim
ate
d dataset
s
.


To more accurately estimate data, we dividing the simulation process into two parts
,
combining

msBayes rejection command line and certain R script, which
is

widely used in
bioinformatics. In R script, we import
ed

R abc package

[4]
. This package offers a number of
functions based on ABC algorithm, to implement further estimation on our considerable data.

A
valuable function in R abc package is the main functio
n, abc function, in which there are a
variety of estimating method. We tested and found the best methods are
loclinear

[5
,6
] and
neural network (neuralnet)

[
7
].

Accuracy of that how well MTML
-
msBayes and R functions estimates these hyper
-
parameters is impo
rtant and often difficult to validate.
Root mean square posterior error
(RMSPE)

[
8
]

and Bayes Factor

(BF)

[
9
]
are two
most useful
criteria that we used for validation.





Implementation

Generally
,

we did 100 times of the entire process, and analyze these 100 data sets by
criteria of RMSPE and BF.

The result of validation would show how well the estimation works.

Due to the massive time we need to finish these 100 times, it is necessary to remotely

run the
codes using SH script.

There are several steps in the SH script. First, we simulated a PODS one
-
line file, then
executed msBayes Rejection with PODS and 3,000,000 simulations we obtained before, to get
rough estimation, and remotely ran R script
to input all necessary files to R environment.
From
3,000,000 to 900, this was how the simulations were selected throughout the whole process.
R
script would output the final results, including RMSPE calculation and BF calculation results.


1
Obtain PODS from MTML
-
msBayes

PODS are simulated from

the batch file (which we made in the first part of this
program),
, by choos
ing
a
random

number

from the hyper
-
prior of Ψ, where Ψ ranges

discrete
ly
from 1 to Y
(Y is the number of taxon pairs).

The fol
lowing msBayes command line was
executed to get a PODS without heading:

perl msbayes.pl
-
r 1
-
o
BirdCOIPODS_
Obs
-
c batch.BirdCOI_II

tail
-
1 BirdCOIObs > BirdCOIObsTAIL

2
Implement the first part of estimation by msBayes Rejection command line.

By comparin
g

the observed summary statistics

file (PODS)
and
3,000,000
simulated
summary statistics, msBayes implemented
its rejection step, accepting the closest
t

of the
number of simulations, where
t

refers to the tolerance of acceptance. Here we set
t

=0.0003, in
order to get 9,000 rough accepted simulations.

3 Run R script for further estimation

With R
abc

package and

all data files (as matrices) imported, we executed abc function,
setting
t

to 0.1, in order to get 900 adjusted simulations, which
wa
s
the final simulation.

We achieved the final 900 points through abc function by the following R code:


Z1noLL5 <
-

try(abc(OBS[1,c(148:360,432:502)], POSTVEC[,2], POSTVEC[,c(148:360,432:502)],
tol=0.1,method="loclinear"),T)

if(class(Z1noLL5) == "try
-
error")

{TDnoLLRmode5 <
-

NA

rmse5<
-
NA} else { TDnoLLRmode5 <
-

loc1stats(Z1noLL5$adj.values, prob=0.95)[1]

rmse5<
-
rmsevalues(Z1true1,as.numeric(as.vector(Z1noLL5$adj.values)))}

Next,
the

RMSPE

was calculated

by the following formula:

















Here x
i

ref
ers to any of the hyper
-
parameters of the 900 accepted estimations, and x
0

refers to the true value of this parameter in PODS.

RMSPE could be implemented through this following R code:

rmsevalues<
-
function(true, values){

SumSqure<
-
0

for (i in 1:900){

SumSqu
re<
-
SumSqure+(values[i]
-
true)^2
}

return (sqrt(SumSqure/900))

}


The second criterion is Bayes Factor, which was calculated after all 100 loops were done
and output. With files recording 100 lines imported, each of which contained 900 accepted
estimated parameter values and their corresponding true value, BF could be calculated by the
following formula:

























Here M1 refers to the thresh
old

of hyper
-
parameter, for example,
Ψ
=1.

M2 refers to all possible
conditions other than M1. In this case,















, while











.

We validate the final estimations by that how many BF is greater than 10 or 3, given
Ψ
>1.5. This algorithm was executed
in

the following code:

BFWrongCal<
-
functio
n (bfmatrix, cutoff
) {

bfwrong<
-
NA

for (i in 1:100){

if (bfmatrix[i,1]<=cutoff)

next

postm1<
-
length(which(bfmatrix[i,c(2:901)]<cutoff))/900

if((postm1/(1
-
postm1))>(10/17))

bfwrong[i]<
-
1 else

bfwrong[i]<
-
0}

prob_wrong<
-
length(which(bfwrong==1))/
(length(which(bfwrong==1))+
length(which(bfwrong==0)))

return (prob_wrong)
}

2A

Figure.2 Histogram of RMSPE of Ψ and Ω. error

of Ω in 2B performs better than
error of Ψ in 2A.

2B

Results

The final result
consists of
direct plotting
, RMSPE and Bayes Factor

validation
. Figure 1
shows the

plot of True
Ω

value and mode of
Ω
, in the 100 loops
, using loclinear
and neuralnet
methods
.




















Figure.1 Plot of True
-
Estimated Ω. 1A and 1B both showed good regression
relationship between true Ω and mode Ω. Plotting came from the data set of HUIB.

1A

1B

Figure 2
shows the RMSPE of
Ω

and
Ψ
,

getting from neuralnet method in abc function.

We can see the error of

Ω

is very small, while the error of
Ψ

is 2 out of 18 (the number of taxon
pairs of HUIB
).

Table 1 shows Bayes Factors of the data set of HUIB.
The table shows different BF values
and the probabilities of estimated parameter smaller than thresh
old
, given true parameters
greater than thresh
old

(which is 1.5 for
Ψ

and 0.01 for Ω).

In the table, NN refers to neural
network method, and LL refers to local linear method in abc function.


Table 1 probability of Bayes Factors wrong and right.

prob
BF
Wrong

prob
BF
Right

probEst_True

NN

Ψ

LL Ω

NN Ω

NN Ψ

LL Ω

NN Ω

NN Ψ

LL Ω

NN Ω

〮0

〮0

〮0

〮0㐶2

〮0〲2

ㄮ1

〮0

〮0

〮0



C潮c汵獩潮

䝥G敲慬汹ⰠtU攠c潭b楮ati潮 of 浳m慹敳⁲敪散e楯n 慮d 删慢c p慣a慧攠f潲 ABC U祰yr
-
p慲慭et敲 敳e業it楯n pe牦潲浳mw敬e 潮⁣潮獩d敲ab汥l慭潵湴 潦⁳業o污t楯湳⸠.獂慹敳⁲敪散ei潮
f楲獴汹 敳t業it
e
猠r潵杨 su浭慲礠獴慴楳i楣猬⁡nd p慳獥猠楴 t漠慢c funct楯n f潲 獥捯sd汹
敳t業慴i潮

B礠慤橵獴in朠p牯p敲 t潬or慮c攠楮 b潴U 浳m慹敳⁲敪散e楯渠ind abc 牥橥cti
潮H 楴 楳 not d楦f楣ult f潲
u獥牳st漠f楮d b敳e 牥獵lt献

周牯u杨 tU攠牥su汴猬 b礠y敮敲
at楮朠tw漠c物t敲楡i潦oB䘠慮d RM卐䔬 w攠c慮 s敥 b潴o of
tU敭 獵pp潲t tUat tU楳ic潭b楮at楯n 敳t業慴敳e
Ω

better than
Ψ
. Meanwhile, Figure 1 suggests
that there is a good regression between true and mode
Ω
, which supports the congruence
above as well.

However
, comparing BF and RMSPE,

users can see some difference of them. In BF
calculation,
the probability of BF wrong doesn’t show any difference among the three
parameters / meth
ods.
The

probability of BF
right only shows a small difference. On the other
hand, RMSPE clearly supports that the estimation of Ω is better than
Ψ
, which completely
conforms to the result of direct
plots
.




References

1.

J.S. Lopes

and
M.A. Beaumont
.
(2010)
ABC: A useful Bayesian
tool for the analysis of
population data
.
Infection, Genetics and Evolution

10 (6
),

825

832.

2. Katalin Csilléry
.
et al.

(2010)
Approximate Bayesian Computation (ABC) in practice.
Trends in
Ecology & Evolution

25 (7), 410

418
.

3.

Michael J Hickerson
.
et

al.

(2007)

msBayes:
Pipeline for testing comparative phylogeographic
histories using hierarchical approximate Bayesian computation
.

BMC Bioinformatics

8, 268
.

4.

Csillery K, Francois O, Blum MGB (2011)

abc: an R package for Approximate Bayesian
Computatio
n

(ABC)
.
Methods Ecol Evol

(in press)
.

5.
Kevin R. Thornton. (2009) approximate Bayesian computation by local linear regression
. BMC
Genetics

10:35
.

6
. Michael G.B. Blum and Olivier François.

(2010
)

Non
-
linear regression models for Approximate
Bayesian Computation
.

Statistics and Computing

(in press).

7
.

Michael D. Richard and Richard P. Lippmann
.
(1991)
Neural Network Classifiers Estimate
Bayesian a posteriori Probabilities
.
Neural Computation

3
(
4
)
,

461

483
.

8.
We
n Huang.
et al.

(2011)
MTML
-
msBayes: Approximate Bayesian comparative

phylogeographic inference from multiple taxa and

multiple loci with rate heterogeneity
.

BMC
Bioinformatics

12:1
.

9.
JO Berger and L. Pericchi, The intrinsic Bayes factor f
or model selection and

prediction. The
Journal of the American Statistical Association 91 (1996), pp. 109

122
.




Acknowledgments


I
deeply

thank for Mike’
s instructive advice and

useful suggestion
s

on my research,
as well as
his patient and
detailed
explanation on
my questions
throughout
this semester. I am deeply
grateful of his
guidance

into the gateway of biostatistics.

I am also indebted to Terry, JT, and all the other partners in the lab.

Finally, t
his
report

has gained a lot

of

technical support

from

EGGNOG, our hardworking
computer, which has been working so hard and never had a break for the whole semester.

Thanks for the work it has done and is doing.