Hidden Markov Models in Bioinformatics

powerfultennesseeΒιοτεχνολογία

2 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

88 εμφανίσεις

Hidden Markov Models
in Bioinformatics

Example Domains: Gene Finding
&

Protein Family Modeling



5 Second Overview


Today’s goal: Introduce HMMs as general
tools in bioinformatics



I will use the problem of
Gene Finding

as an
example of an “ideal” HMM problem domain


I will use the problem of
Protein Family
Modeling

as an example of a clever way to
fit HMMs to a problem

Learning Objectives


When I’m done you should know:


1.
When is an HMM a good fit for a problem space?

2.
What materials are needed before work can
begin with an HMM?

3.
What are the advantages and disadvantages of
using HMMs?

Outline


HMMs as Statistical Models


The example tasks at a glance


Good problems for HMMs


HMM Advantages


HMM Disadvantages


Gene Finding Examples

Statistical Models


Definition:


Any mathematical construct that attempts to parameterize
a random process


Example: A normal distribution


Assumptions


Parameters


Estimation


Usage


HMMs are just a little more complicated…

HMM Assumptions


Observations are ordered


Random process can be represented by a
stochastic finite state machine with emitting states.

HMM Parameters


Using weather example


Modeling daily weather
for a year


Ra Ra Su Su Su Ra..



Lots of parameters


One for each table entry


Represented in two
tables.


One for emissions


One for transitions

HMM Estimation



Called training, it falls under machine learning


Feed an architecture (given in advance) a set
of observation sequences


The training process will iteratively alter its
parameters to fit the training set


The trained model will assign the training
sequences high probability

HMM Usage


Two major tasks


Evaluate the probability of an observation
sequence given the model (Forward)


Find the most likely path through the model
for a given observation sequence (Viterbi)

Gene Finding

(An Ideal HMM Domain)


Our Objective:


To find the coding and non
-
coding regions of an
unlabeled string of DNA nucleotides


Our Motivation:


Assist in the annotation of genomic data produced
by genome sequencing methods


Gain insight into the mechanisms involved in
transcription, splicing and other processes

Gene Finding Terminology


A string of DNA nucleotides containing a gene
will have separate regions (lines):


Introns


non
-
coding regions within a gene


Exons


coding regions


Separated by functional sites (boxes)


Start and stop codons


Splice sites


acceptors and donors

Gene Finding Challenges


Need the correct reading frame


Introns can interrupt an exon in mid
-
codon


There is no hard and fast rule for identifying
donor and acceptor splice sites


Signals are very weak

Protein Family Modeling

(A clever fit of HMMs)


I have a protein sequence.



What family is it in?


Can you give me a quick alignment to the
other members of the family?


These amino acids here, do they match the
families consensus positions, or are they
inserts?

Profile HMM


Square: Match (consensus) state


Diamond: Insert state


notice the loops


Circle: Delete state


allows you to “jump” a match

What makes a good HMM
problem space?

Characteristics:


Classification problems


There are two main types of output from an
HMM:


Scoring of sequences


(Protein family modeling)


Labeling of observations within a sequence


(Gene Finding)

HMM Problem Characteristics

Continued


The observations in a sequence should have
a clear, and meaningful order


Unordered observations will not map easily to
states


It’s beneficial, but not necessary for the
observations follow some sort of grammar


Makes it easier to design an architecture


Gene Finding


Protein Family Modeling

HMM Requirements

So you’ve decided you want to build an HMM,

here’s what you need:



An architecture


Probably the hardest part


Should be biologically sound & easy to interpret



A well
-
defined success measure


Necessary for any form of machine learning

HMM Requirements

Continued


Training data


Labeled or unlabeled


it depends


You do not always need a labeled training set to do
observation labeling, but it helps


Amount of training data needed is:


Directly proportional to the number of free parameters
in the model


Inversely proportional to the size of the training
sequences


Why HMMs might be a good fit for

Gene Finding


Classification
: Classifying observations within a sequence


Order
: A DNA sequence is a set of ordered observations


Grammar

/
Architecture
: Our grammatical structure (and the
beginnings of our architecture) is right here:






Success measure
: # of complete exons correctly labeled


Training data
: Available from various genome annotation
projects

Why HMMs can be made to fit
Protein Family Modeling


Classification
: What model fits a sequence best?


Order
: An amino acid sequence is well ordered


Grammar
: Any two matches can be separated by a series of
inserts and deletes… okay, maybe the word “grammar” is a
bit of a stretch


Success Measure
: How many sequences can we correctly
label after training?

HMM Advantages


Statistical Grounding


Statisticians are comfortable with the theory
behind hidden Markov models


Freedom to manipulate the training and
verification processes


Mathematical / theoretical analysis of the results
and processes


HMMs are still very powerful modeling tools


far
more powerful than many statistical methods


HMM Advantages continued



Modularity


HMMs can be combined into larger HMMs



Transparency of the Model


Assuming an architecture with a good design


People can read the model and make sense of it


The model itself can help increase understanding

HMM Advantages continued


Incorporation of Prior Knowledge



Incorporate prior knowledge into the architecture



Initialize the model close to something believed to
be correct



Use prior knowledge to constrain training process

How does Gene Finding make
use of HMM advantages?


Statistics
:


Many systems alter the training process to better
suit their success measure


Modularity
:


Almost all systems use a combination of models,
each individually trained for each gene region


Prior Knowledge
:


A fair amount of prior biological knowledge is built
into each architecture

HMM Disadvantages


Markov Chains


States are supposed to be independent





P(y) must be independent of P(x), and vice versa


This usually isn’t true


Can get around it when relationships are local


Not good for RNA folding problems

P(x)



P(y)

HMM Disadvantages

continued


…and then there are the standard
Machine Learning Problems


Watch out for local maxima


Model may not converge to a truly optimal
parameter set for a given training set



Avoid over
-
fitting


You’re only as good as your training set


More training is not always good

HMM Disadvantages

continued


Speed!!!



Almost everything one does in an HMM involves:
“enumerating all possible paths through the model”



There are efficient ways to do this



Still slow in comparison to other methods

HMM Gene Finders:

VEIL


A straight HMM Gene Finder


Takes advantage of grammatical structure and
modular design


Uses many states that can only emit one symbol to
get around state independence



HMM Gene Finders:

HMMGene



Uses an extended HMM called a CHMM


CHMM = HMM with classes


Takes full advantage of being able to modify
the statistical algorithms


Uses high
-
order states


Trains everything at once

HMM Gene Finders:

Genie


Uses a generalized HMM (GHMM)


Edges in model are complete HMMs


States can be any arbitrary program


States are actually neural networks specially
designed for signal finding


Conclusions


HMMs have problems where they excel, and
problems where they do not


You should consider using one if:


Problem can be phrased as classification


Observations are ordered


The observations follow some sort of grammatical
structure (optional)


Conclusions



Advantages:


Statistics


Modularity


Transparency


Prior Knowledge


Disadvantages:


State independence


Over
-
fitting


Local Maximums


Speed

Some final words…


Lots of problems can be phrased as
classification problems


Homology search


Build a model of the sequence with a few close
homologs, and use the model the search for more
distant homologs


Sequence alignment


Align all of these sequences to the model that
represents their family

Questions


Any Questions?