Profile Hidden Markov Models

boorishadamantΤεχνίτη Νοημοσύνη και Ρομποτική

29 Οκτ 2013 (πριν από 3 χρόνια και 7 μήνες)

73 εμφανίσεις

Profile Hidden Markov Models



Bioinformatics Fall
-
2004

Dr Webb Miller and Dr

Claude Depamphilis



Dhiraj Joshi

Department of Computer Science and Engineering


The Pennsylvania State University



Outline


Introduction to HMMs


Profile HMMs


Available resources for Profile HMMs


Some online demonstrations









Introduction to HMMs


Hidden Markov Models


Formalism


statistical techniques for modeling patterns in
data


First order Markov property
-

memorylessness



state generally a hidden entity which spawns
symbols or features


the same symbol could be emitted by several
states


HMM characterized by transition probabilities and
emission distribution






Introduction to HMMs


Hidden Markov Models


Parameter Estimation


Parameters
-

transition probabilities

and
emission probabilities


iterative computational algorithms used


EM

algorithm,
Viterbi

algorithm


algorithms based on dynamic programming to save
computational cost


usually the iterations involve variants of the following two
steps


estimate state sequence which maximizes
likelihood
under a
parameter set


update parameter set based on the estimated state sequence



algorithms converge to local optima sometimes








Outline


Introduction to HMMs


Profile HMMs


Available resources for Profile HMMs


Some online demonstrations








Profile Hidden Markov Models


Stochastic methods to model multiple sequence
alignments


proteins

and dna sequences


Potential application domains:


protein families could be modeled as an HMM or a group
of HMMs


constructing a profile HMM


new protein sequences could be aligned with stored
models to detect remote homology


aligning a sequence with a stored profile HMM


align two or more protein family profile HMMs to detect
homology


finding statistical similarities between two profile HMM
models








Profile Hidden Markov Models


Constructing a profile HMM






A multiple sequence alignment assumed


each consensus column can exist in 3 states


match
,
insert

and
delete

states


number of states depends upon length of the
alignment

Profile Hidden Markov Models


A typical profile HMM architecture








squares

represent
match states


diamonds

represent
insert states


circles

represent
delete states


arrows represent transitions






Profile Hidden Markov Models


A typical profile HMM architecture








transition between match states
-



transition from match state to insert state
-



transition within insert state
-


transition from match state to delete state
-


transition within delete state
-


emission of symbol at a state
-













Profile Hidden Markov Models


Estimation of parameters







transition probabilities estimated as frequency of a
transition in a given alignment


emission probabilities estimated as frequency of an
emission in a given alignment


pseudo counts usually introduced to account for
transititions / emissions which were not present in
the alignment








Profile Hidden Markov Models


Estimation of parameters








with pseudo counts





Dirichlet prior distribution used to determine pseudo counts

Profile Hidden Markov Models


Scoring a sequence against a profile HMM








Viterbi algorithm used to find the best state path


Simulated annealing based methods also used


Maximization criteria


log likelihood

or
log odds


Log likelihood score generally depends on length of
sequence and hence not preferred


If an alignment not given initially, the alignment
could be learnt iteratively using Viterbi







Profile Hidden Markov Models


Comparing two profile HMMs






Profile
-
profile comparison tool based on information
theory


based on Kullback
-
Leibler divergence criterion for
comparing 2 statistical distributions


dynamic programming used to compare entire profiles


detect weak similarities between models






Outline


Introduction to HMMs


Profile HMMs


Available resources for Profile HMMs


Some online demonstrations








Available resources for Profile HMMs


HMMER and SAM one of the first available
programs for profile HMMs


HMMER :
S Eddy at Washington University


SAM

: Sequence alignment and Modeling System


R. Hughey at University of California, Santa Cruz




available free for research


SAM has online servers to perform sequence
comparisons


http://www.cse.ucsc.edu/research/compbio/sam.html





Available resources for Profile HMMs


InterPro

consortium in Europe has many
resources for protein data


Database of protein families and domains


Brings together several different databases under one
umbrella


Pfam

and
Superfamily

are profile HMM libraries
associated with Interpro


Pfam based on HMMER search and Superfamily based on
SAM search and modeling





Available resources for Profile HMMs


SAM’s iterative approach for building HMM


find a set of close homologs using BLASTP


learn the alignment and build model using close
homologs


use BLASTP to get more remote homologs using the first
set of sequences (relax the E value)


iteratively refine the HMM model



SAM uses
Dirichlet priors

as pseudo counts for
parameters


Hand tuned seed alignments not required as the
alignments are learnt by the algorithm


unlike HMMER











Available resources for Profile HMMs


SUPERFAMILY database incorporates:


library of profile HMMs representing all proteins of known
structure


assignments to predicted proteins from all completely
sequenced genomes


search and alignment services


models and domain assignments are freely available



Based on SCOP classification of protein domains



SAM HMM iterative procedure used for model
building and sequence alignment













Available resources for Profile HMMs


In Superfamily:


Each
SCOP superfamily

is represented as an HMM model


Model built using SAM procedure based 4 variants


accurate structure based alignments


hand labeled alignments


autonomic alignments using ClustalW


sequence members used separately as seeds


Assignment of superfamilies


for a given sequence, every model is scored across the
whole sequence using Viterbi scoring


model which scores highest has its superfamily assigned to
the region














Outline


Introduction to HMMs


Profile HMMs


Available resources for Profile HMMs


Some online demonstrations








Online Demonstrations


http://supfam.mrc
-
lmb.cam.ac.uk/SUPERFAMILY/temp/624288710157514.html








References


Durbin. R, Eddy. S, Krough. A, and Mitchenson. G, ``
Biological
Sequence Analysis
’’, Cambridge University Press, 2002


Baldi. P and Brunak. S, ``
Bioinformatics, the Machine Learning
Approach
’’, the MIT Press, Cambridge, 1998


Eddy. S, ``
Profile Hidden Markov Models
’’, Bioinformatics Review, vol.
19, no. 8, pp. 755
-
763, 1998


Karplus. K, Barrett. C, and Hughey. R, ``
Hidden Markov models for
detecting remote homologies
’’, Bioinformatics, vol. 14, no. 10, pp. 846
-
856, 1998


Madera. M, Gough, J, ``
A comparison of profile hidden Markov model
procedures for remote homology detection
’’, Nucleic Acids Research,
vol. 30, no. 19, pp. 4321
-
4328, 2002


Gough. J, Karplus. K, Hughey. R, and Chothia. C, ``
Assignment of
Homology to Genome Sequences using a Library of Hidden Markov
Models that represent all Proteins of known structure
’’, J. Mol. Biol.,
313, pp. 903
-
919, 2001







References


Yona. G, Levitt. M, ``
Within the Twilight Zone: A sensitive Profile
-
Profile comparison tool based on Information Theory
’’, J. Mol. Biol.,
315, 1257
-
1275, 2002


Mandera. M, Vogel. C, Kummerfeld. K, Chothia. C, and Gough. J,
``
The SUPERFAMILY database in 2004: additions and improvements
’’,
Nucleic Acids Research, vol. 32, Database Issue, D235
-
239, 2004


Bateman. A, Birney. E, Durbin. R, Eddy. S, Finn. R, Sonnhammer. E,
``
Pfam 3.1: 1313 multiple alignments and profile HMMs match the
majority of proteins
’’, Nucleic Acids Research, vol. 27, no. 1, 1999


Andreeva. A, et. al., ``
SCOP database in 2004: refinements integrate
structure and sequence family data
’’, Nucleic Acids Research, vol. 32,
Database Issue, D226
-
D229,2004



Many other online resources and tutorials