Technical Prospectus


Oct 1, 2013 (4 years and 9 months ago)


Species Gene Prediction with PhyloHMMs

One of the major disciplines in the field of biology is genetics, which is the science of genes and
heredity. A unique discipline within genetics is genomics, which is the study of the genomes in
organisms. Scientists have made a great deal of progress in genetic mapping, much in
part towards the government sponsorship of the Human Genome Project. While this project has
been quite successful in mapping the human genome, there is still a significant

amount of work
to be done in this field. Computational prediction of protein
coding gene structure is an
important method being employed by researchers. These methods of prediction are never 100%
sensitive nor 100% specific because their performance varie
s on different input data. One way to
improve these measures is to use consensus
generating methods that mix multiple individual
predictions into a single prediction. Another important concept is that gene structure is
evolutionarily conserved across close
ly related species. Using this knowledge, I propose to
implement a novel, consensus
generating method to integrate a mixture of gene predictions. I
will do this by developing a PhyloHMM finite state transducer framework to model gene
structure alphabets at

ancestral nodes in a phylogenetic tree. I will implement algorithms for
model estimation and optimal consensus gene structure usi
ng dynamic programming.

also plan


an open source software package called xRate, which is an
interpreter for phylo
mmars. I will evaluate my results based on their fit with respect to existing experimental data.

Gene Prediction

I will first explain some background on gene prediction. Biologists have made excellent progress
in genome annotation since drafts of the huma
n genome were first analyzed. The aim of genome
annotation is to determine the biochemical function of nucleotides in the genome (Brent, 2008).
Gene prediction models can be improved by integrating multiple sources of gene evidence. One
program that implem
ents this idea is Evigan, which is an automated gene annotation program
Mackey, Roos, Pereira,

. My project will take several concepts from Evigan,
including the use of a hidden Markov model (HMM). An HMM incorporates ideas from
probability and
computer theory, specifically in its resemblance to a finite state transducer. A
finite state transducer has an input tape, output tape, a symbol alphabet, and set of transitions
(Bradley, Holmes, 2007). An HMM is similar, but it also has a set of “hidden”

states that are
determined by using its set of observed states, transitions, and transition probabilities. My project
will make use of phyloHMMs, which combines the ideas of HMMs and phylogenetics.
Phylogeny studies the evolutionary history of lineages of

organisms. One of the big questions for
biologists is how to reconstruct that history (Cracraft, 1974). PhyloHMMs are often used to
accomplish this. They probabilistic models that consider the way substitutions occur through
evolutionary history at each s
ite of a genome and the way this process changes from one site to
the next (Siepel, Haussler, 2005). They work as two combined Markov processes; one that
operates along a genome and one that operates along the branches of a phylogenetic tree. These
are essential for creating accurate gene predictions in my project.

xRate Software

My project will use the xRate software package to create a phylo
grammar that will implement a
hierarchical dynamic Bayesian network. The phylo
grammar will use different s
phylogenies to enforce correlations in gene structure along the phylogenetic tree. I will use the
DNA alphabet described by Evigan (Liu et al., 2008) and a transition matrix that will
provide correlation between exon
intron states. The intellect
ual merit of this project stems from
my ability to model complex algorithms and create detailed grammars to generate the desired
results. These ideas are heavily based in computer theory and computational science.

Results and Impact

With this project, I h
ope to combine my understanding of computer science and programming
with research in the field of genomics, computational probability, and bioinformatics. I will
perform this project with the intent of generating better gene predictions by using combining
existing predictions with probabilistic models. Hopefully, the results will show that combining
phylogenies and HMMs produce better gene prediction results with higher sensitivities and
specificities. It is possible that problems could occur in the impleme
ntation of this project that
leads to unforeseen outcomes. In this case, the issues that arise will be well documented and will
still provide insight for future projects in this area.

Works Cited

Bradley, R.K., & Holmes, I. (2007). Transdu
cers: an emerg
ing probabilistic
framework for

modeling indels on trees.
Bioinformatics, 23
, 3258

Brent, M.R. (2008). Steady progress and recent breakthroughs in the accuracy of

automated genome annotation. Nature Reviews
Genetics, 9
, 62

Cracraft, J. (1974). Phylogenetic Models and Classification.
Systematic Zoology, 23


Makarov, V. (2002). Computer Programs for Eukaryotic Gene Prediction.
Briefings in

Bioinformatics, 3
, 195

Meyer, I. M., &

Durbin, R. (2002). Comparative Ab Initio Prediction of Gene Structures


Pair HMMs.
Bioinformatics, 18(10)
, 1309

Mossel, E., & Roch, S. (2006). Learning Nonsingular Phylogenies and Hidden Markov

The Annals of Applied Probability, 16(
, 583

Liu, Q., Mackey, A., Roos, D.S., & Pereira, F.C.N. (2008). Evigan: a hidden variable model


integrating gene evidence for eukaryotic gene prediction.

, 597

Siepel, A., & Haussler, D. (2005). Phylogenetic Hidden

Markov Models.
Statistics for

Biology and Health, 3
, 325