Profile Hidden Markov Models
Bioinformatics Fall

2004
Dr Webb Miller and Dr
Claude Depamphilis
Dhiraj Joshi
Department of Computer Science and Engineering
The Pennsylvania State University
Outline
Introduction to HMMs
Profile HMMs
Available resources for Profile HMMs
Some online demonstrations
Introduction to HMMs
Hidden Markov Models
–
Formalism
statistical techniques for modeling patterns in
data
First order Markov property

memorylessness
state generally a hidden entity which spawns
symbols or features
the same symbol could be emitted by several
states
HMM characterized by transition probabilities and
emission distribution
Introduction to HMMs
Hidden Markov Models
–
Parameter Estimation
Parameters

transition probabilities
and
emission probabilities
iterative computational algorithms used
EM
algorithm,
Viterbi
algorithm
algorithms based on dynamic programming to save
computational cost
usually the iterations involve variants of the following two
steps
estimate state sequence which maximizes
likelihood
under a
parameter set
update parameter set based on the estimated state sequence
algorithms converge to local optima sometimes
Outline
Introduction to HMMs
Profile HMMs
Available resources for Profile HMMs
Some online demonstrations
Profile Hidden Markov Models
Stochastic methods to model multiple sequence
alignments
–
proteins
and dna sequences
Potential application domains:
protein families could be modeled as an HMM or a group
of HMMs
constructing a profile HMM
new protein sequences could be aligned with stored
models to detect remote homology
aligning a sequence with a stored profile HMM
align two or more protein family profile HMMs to detect
homology
finding statistical similarities between two profile HMM
models
Profile Hidden Markov Models
Constructing a profile HMM
A multiple sequence alignment assumed
each consensus column can exist in 3 states
match
,
insert
and
delete
states
number of states depends upon length of the
alignment
Profile Hidden Markov Models
A typical profile HMM architecture
squares
represent
match states
diamonds
represent
insert states
circles
represent
delete states
arrows represent transitions
Profile Hidden Markov Models
A typical profile HMM architecture
transition between match states

transition from match state to insert state

transition within insert state

transition from match state to delete state

transition within delete state

emission of symbol at a state

Profile Hidden Markov Models
Estimation of parameters
transition probabilities estimated as frequency of a
transition in a given alignment
emission probabilities estimated as frequency of an
emission in a given alignment
pseudo counts usually introduced to account for
transititions / emissions which were not present in
the alignment
Profile Hidden Markov Models
Estimation of parameters
with pseudo counts
Dirichlet prior distribution used to determine pseudo counts
Profile Hidden Markov Models
Scoring a sequence against a profile HMM
Viterbi algorithm used to find the best state path
Simulated annealing based methods also used
Maximization criteria
–
log likelihood
or
log odds
Log likelihood score generally depends on length of
sequence and hence not preferred
If an alignment not given initially, the alignment
could be learnt iteratively using Viterbi
Profile Hidden Markov Models
Comparing two profile HMMs
Profile

profile comparison tool based on information
theory
based on Kullback

Leibler divergence criterion for
comparing 2 statistical distributions
dynamic programming used to compare entire profiles
detect weak similarities between models
Outline
Introduction to HMMs
Profile HMMs
Available resources for Profile HMMs
Some online demonstrations
Available resources for Profile HMMs
HMMER and SAM one of the first available
programs for profile HMMs
HMMER :
S Eddy at Washington University
SAM
: Sequence alignment and Modeling System
R. Hughey at University of California, Santa Cruz
available free for research
SAM has online servers to perform sequence
comparisons
http://www.cse.ucsc.edu/research/compbio/sam.html
Available resources for Profile HMMs
InterPro
consortium in Europe has many
resources for protein data
Database of protein families and domains
Brings together several different databases under one
umbrella
Pfam
and
Superfamily
are profile HMM libraries
associated with Interpro
Pfam based on HMMER search and Superfamily based on
SAM search and modeling
Available resources for Profile HMMs
SAM’s iterative approach for building HMM
find a set of close homologs using BLASTP
learn the alignment and build model using close
homologs
use BLASTP to get more remote homologs using the first
set of sequences (relax the E value)
iteratively refine the HMM model
SAM uses
Dirichlet priors
as pseudo counts for
parameters
Hand tuned seed alignments not required as the
alignments are learnt by the algorithm
–
unlike HMMER
Available resources for Profile HMMs
SUPERFAMILY database incorporates:
library of profile HMMs representing all proteins of known
structure
assignments to predicted proteins from all completely
sequenced genomes
search and alignment services
models and domain assignments are freely available
Based on SCOP classification of protein domains
SAM HMM iterative procedure used for model
building and sequence alignment
Available resources for Profile HMMs
In Superfamily:
Each
SCOP superfamily
is represented as an HMM model
Model built using SAM procedure based 4 variants
accurate structure based alignments
hand labeled alignments
autonomic alignments using ClustalW
sequence members used separately as seeds
Assignment of superfamilies
for a given sequence, every model is scored across the
whole sequence using Viterbi scoring
model which scores highest has its superfamily assigned to
the region
Outline
Introduction to HMMs
Profile HMMs
Available resources for Profile HMMs
Some online demonstrations
Online Demonstrations
http://supfam.mrc

lmb.cam.ac.uk/SUPERFAMILY/temp/624288710157514.html
References
Durbin. R, Eddy. S, Krough. A, and Mitchenson. G, ``
Biological
Sequence Analysis
’’, Cambridge University Press, 2002
Baldi. P and Brunak. S, ``
Bioinformatics, the Machine Learning
Approach
’’, the MIT Press, Cambridge, 1998
Eddy. S, ``
Profile Hidden Markov Models
’’, Bioinformatics Review, vol.
19, no. 8, pp. 755

763, 1998
Karplus. K, Barrett. C, and Hughey. R, ``
Hidden Markov models for
detecting remote homologies
’’, Bioinformatics, vol. 14, no. 10, pp. 846

856, 1998
Madera. M, Gough, J, ``
A comparison of profile hidden Markov model
procedures for remote homology detection
’’, Nucleic Acids Research,
vol. 30, no. 19, pp. 4321

4328, 2002
Gough. J, Karplus. K, Hughey. R, and Chothia. C, ``
Assignment of
Homology to Genome Sequences using a Library of Hidden Markov
Models that represent all Proteins of known structure
’’, J. Mol. Biol.,
313, pp. 903

919, 2001
References
Yona. G, Levitt. M, ``
Within the Twilight Zone: A sensitive Profile

Profile comparison tool based on Information Theory
’’, J. Mol. Biol.,
315, 1257

1275, 2002
Mandera. M, Vogel. C, Kummerfeld. K, Chothia. C, and Gough. J,
``
The SUPERFAMILY database in 2004: additions and improvements
’’,
Nucleic Acids Research, vol. 32, Database Issue, D235

239, 2004
Bateman. A, Birney. E, Durbin. R, Eddy. S, Finn. R, Sonnhammer. E,
``
Pfam 3.1: 1313 multiple alignments and profile HMMs match the
majority of proteins
’’, Nucleic Acids Research, vol. 27, no. 1, 1999
Andreeva. A, et. al., ``
SCOP database in 2004: refinements integrate
structure and sequence family data
’’, Nucleic Acids Research, vol. 32,
Database Issue, D226

D229,2004
Many other online resources and tutorials
Comments 0
Log in to post a comment