Hidden Markov Models in Bioinformatics


Oct 2, 2013 (3 years and 8 months ago)


Hidden Markov Models
in Bioinformatics

Example Domains: Gene Finding

Protein Family Modeling

5 Second Overview

Today’s goal: Introduce HMMs as general
tools in bioinformatics

I will use the problem of
Gene Finding

as an
example of an “ideal” HMM problem domain

I will use the problem of
Protein Family

as an example of a clever way to
fit HMMs to a problem

Learning Objectives

When I’m done you should know:

When is an HMM a good fit for a problem space?

What materials are needed before work can
begin with an HMM?

What are the advantages and disadvantages of
using HMMs?


HMMs as Statistical Models

The example tasks at a glance

Good problems for HMMs

HMM Advantages

HMM Disadvantages

Gene Finding Examples

Statistical Models


Any mathematical construct that attempts to parameterize
a random process

Example: A normal distribution





HMMs are just a little more complicated…

HMM Assumptions

Observations are ordered

Random process can be represented by a
stochastic finite state machine with emitting states.

HMM Parameters

Using weather example

Modeling daily weather
for a year

Ra Ra Su Su Su Ra..

Lots of parameters

One for each table entry

Represented in two

One for emissions

One for transitions

HMM Estimation

Called training, it falls under machine learning

Feed an architecture (given in advance) a set
of observation sequences

The training process will iteratively alter its
parameters to fit the training set

The trained model will assign the training
sequences high probability

HMM Usage

Two major tasks

Evaluate the probability of an observation
sequence given the model (Forward)

Find the most likely path through the model
for a given observation sequence (Viterbi)

Gene Finding

(An Ideal HMM Domain)

Our Objective:

To find the coding and non
coding regions of an
unlabeled string of DNA nucleotides

Our Motivation:

Assist in the annotation of genomic data produced
by genome sequencing methods

Gain insight into the mechanisms involved in
transcription, splicing and other processes

Gene Finding Terminology

A string of DNA nucleotides containing a gene
will have separate regions (lines):


coding regions within a gene


coding regions

Separated by functional sites (boxes)

Start and stop codons

Splice sites

acceptors and donors

Gene Finding Challenges

Need the correct reading frame

Introns can interrupt an exon in mid

There is no hard and fast rule for identifying
donor and acceptor splice sites

Signals are very weak

Protein Family Modeling

(A clever fit of HMMs)

I have a protein sequence.

What family is it in?

Can you give me a quick alignment to the
other members of the family?

These amino acids here, do they match the
families consensus positions, or are they

Profile HMM

Square: Match (consensus) state

Diamond: Insert state

notice the loops

Circle: Delete state

allows you to “jump” a match

What makes a good HMM
problem space?


Classification problems

There are two main types of output from an

Scoring of sequences

(Protein family modeling)

Labeling of observations within a sequence

(Gene Finding)

HMM Problem Characteristics


The observations in a sequence should have
a clear, and meaningful order

Unordered observations will not map easily to

It’s beneficial, but not necessary for the
observations follow some sort of grammar

Makes it easier to design an architecture

Gene Finding

Protein Family Modeling

HMM Requirements

So you’ve decided you want to build an HMM,

here’s what you need:

An architecture

Probably the hardest part

Should be biologically sound & easy to interpret

A well
defined success measure

Necessary for any form of machine learning

HMM Requirements


Training data

Labeled or unlabeled

it depends

You do not always need a labeled training set to do
observation labeling, but it helps

Amount of training data needed is:

Directly proportional to the number of free parameters
in the model

Inversely proportional to the size of the training

Why HMMs might be a good fit for

Gene Finding

: Classifying observations within a sequence

: A DNA sequence is a set of ordered observations


: Our grammatical structure (and the
beginnings of our architecture) is right here:

Success measure
: # of complete exons correctly labeled

Training data
: Available from various genome annotation

Why HMMs can be made to fit
Protein Family Modeling

: What model fits a sequence best?

: An amino acid sequence is well ordered

: Any two matches can be separated by a series of
inserts and deletes… okay, maybe the word “grammar” is a
bit of a stretch

Success Measure
: How many sequences can we correctly
label after training?

HMM Advantages

Statistical Grounding

Statisticians are comfortable with the theory
behind hidden Markov models

Freedom to manipulate the training and
verification processes

Mathematical / theoretical analysis of the results
and processes

HMMs are still very powerful modeling tools

more powerful than many statistical methods

HMM Advantages continued


HMMs can be combined into larger HMMs

Transparency of the Model

Assuming an architecture with a good design

People can read the model and make sense of it

The model itself can help increase understanding

HMM Advantages continued

Incorporation of Prior Knowledge

Incorporate prior knowledge into the architecture

Initialize the model close to something believed to
be correct

Use prior knowledge to constrain training process

How does Gene Finding make
use of HMM advantages?


Many systems alter the training process to better
suit their success measure


Almost all systems use a combination of models,
each individually trained for each gene region

Prior Knowledge

A fair amount of prior biological knowledge is built
into each architecture

HMM Disadvantages

Markov Chains

States are supposed to be independent

P(y) must be independent of P(x), and vice versa

This usually isn’t true

Can get around it when relationships are local

Not good for RNA folding problems



HMM Disadvantages


…and then there are the standard
Machine Learning Problems

Watch out for local maxima

Model may not converge to a truly optimal
parameter set for a given training set

Avoid over

You’re only as good as your training set

More training is not always good

HMM Disadvantages



Almost everything one does in an HMM involves:
“enumerating all possible paths through the model”

There are efficient ways to do this

Still slow in comparison to other methods

HMM Gene Finders:


A straight HMM Gene Finder

Takes advantage of grammatical structure and
modular design

Uses many states that can only emit one symbol to
get around state independence

HMM Gene Finders:


Uses an extended HMM called a CHMM

CHMM = HMM with classes

Takes full advantage of being able to modify
the statistical algorithms

Uses high
order states

Trains everything at once

HMM Gene Finders:


Uses a generalized HMM (GHMM)

Edges in model are complete HMMs

States can be any arbitrary program

States are actually neural networks specially
designed for signal finding


HMMs have problems where they excel, and
problems where they do not

You should consider using one if:

Problem can be phrased as classification

Observations are ordered

The observations follow some sort of grammatical
structure (optional)






Prior Knowledge


State independence


Local Maximums


Some final words…

Lots of problems can be phrased as
classification problems

Homology search

Build a model of the sequence with a few close
homologs, and use the model the search for more
distant homologs

Sequence alignment

Align all of these sequences to the model that
represents their family


Any Questions?