Introduction to Profile
Hidden Markov Models
Mark Stamp
1
PHMM
Hidden Markov Models
Here, we assume you know about
HMMs
If not, see “A revealing introduction to hidden
Markov models”
Executive summary of
HMMs
HMM is a machine learning technique
Also, a discrete hill climb technique
Train model based on observation sequence
Score given sequence to see how closely it
matches the model
Efficient algorithms, many useful applications
PHMM
2
HMM Notation
Recall, HMM model denoted
λ
= (
A,B,
π)
Observation sequence is
O
Notation:
PHMM
3
Hidden Markov Models
Among the many uses for
HMMs
…
Speech analysis
Music search engine
Malware detection
Intrusion detection systems (IDS)
Many more, and more all the time
PHMM
4
Limitations of
HMMs
Positional information not considered
HMM has no “memory”
Higher order models have some memory
But no explicit use of positional information
Does not handle insertions or deletions
These limitations are serious problems in
some applications
In bioinformatics string comparison, sequence
alignment is critical
Also, insertions and deletions occur
PHMM
5
Profile HMM
Profile HMM (PHMM) designed to
overcome limitations on previous slide
In some ways, PHMM easier than HMM
In some ways, PHMM more complex
The basic idea of PHMM
Define multiple
B
matrices
Almost like having an HMM for each
position in sequence
PHMM
6
PHMM
In bioinformatics, begin by aligning
multiple related sequences
Multiple sequence alignment (MSA)
This is like training phase for HMM
Generate PHMM based on given MSA
Easy, once MSA is known
Hard part is generating MSA
Then can score sequences using PHMM
Use forward algorithm, like HMM
PHMM
7
Generic View of PHMM
Circles are
Delete
states
Diamonds are
Insert
states
Rectangles are
Match
states
Match states correspond to HMM states
Arrows are possible transitions
Each transition has associated probability
Transition probabilities are
A
matrix
Emission probabilities are
B
matrices
In PHMM, observations are emissions
Match and insert states have emissions
PHMM
8
Generic View of PHMM
Circles are
Delete
states, diamonds are
Insert
states, rectangles are
Match
states
Also, begin and end states
PHMM
9
PHMM Notation
Notation
PHMM
10
PHMM
Match state probabilities easily
determined from MSA, that is
a
Mi,Mi+1
transitions between match states
e
Mi
(k
)
emission probability at match state
Note: other transition probabilities
For example,
a
Mi,
I
i
and
a
Mi,Di+1
Emissions at all match & insert states
Remember, emission == observation
PHMM
11
MSA
First we show MSA construction
This is the difficult part
Lots of ways to do this
“Best” way depends on specific problem
Then construct PHMM from MSA
The easy part
Standard algorithm for this
How to score a sequence?
Forward algorithm, similar to HMM
PHMM
12
MSA
How to construct MSA?
Construct
pairwise
alignments
Combine
pairwise
alignments to obtain MSA
Allow gaps to be inserted
Makes better matches
But gaps tend to weaken scoring
So there is a tradeoff
PHMM
13
Global
vs
Local Alignment
In these
pairwise
alignment examples
“

” is gap
“

” are aligned
“
*
” omitted beginning and ending symbols
PHMM
14
Global
vs
Local Alignment
Global alignment is lossless
But gaps tend to proliferate
And gaps increase when we do MSA
More gaps implies more sequences match
So, result is less useful for scoring
We usually only consider local alignment
That is, omit ends for better alignment
For simplicity, we assume global
alignment here
PHMM
15
Pairwise
Alignment
We allow gaps when aligning
How to score an alignment?
Based on
n
x
n
substitution matrix
S
Where
n
is number of symbols
What
algorithm(s
) to align sequences?
Usually, dynamic programming
Sometimes, HMM is used
Other?
Local alignment

more issues
PHMM
16
Pairwise
Alignment
Example
Note gaps
vs
misaligned elements
Depends on
S
and gap penalty
PHMM
17
Substitution Matrix
Masquerade detection
Detect imposter using an account
Consider 4 different operations
E == send email
G == play games
C == C programming
J == Java programming
How similar are these to each other?
PHMM
18
Substitution Matrix
Consider 4 different operations:
E, G, C, J
Possible substitution matrix:
Diagonal

matches
High positive scores
Which others most similar?
J and C, so substituting C for J is a high score
Game playing/programming, very different
So substituting G for C is a negative score
PHMM
19
Substitution Matrix
Depending on problem, might be easy or
very difficult to get useful
S
matrix
Consider masquerade detection based on
UNIX commands
Sometimes difficult to say how “close” 2
commands are
Suppose aligning DNA sequences
Biological rationale for closeness of symbols
PHMM
20
Gap Penalty
Generally must allow gaps to be inserted
But gaps make alignment more generic
So, less useful for scoring
Therefore, we penalize gaps
How to penalize gaps?
Linear gap penalty function
f(g
) = dg
(i.e., constant penalty per gap)
Affine gap penalty function
f(g
) = a +
e(g
–
1)
Gap opening penalty
a
, then constant factor of
e
PHMM
21
Pairwise
Alignment Algorithm
We use dynamic programming
Based on
S
matrix, gap penalty function
Notation:
PHMM
22
Pairwise
Alignment DP
Initialization:
PHMM
23
Recursion:
MSA from
Pairwise
Alignments
Given
pairwise
alignments…
…how to construct MSA?
Generic approach is “progressive
alignment”
Select one
pairwise
alignment
Select another and combine with first
Continue to add more until all are combined
Relatively easy (good)
Gaps may proliferate, unstable (bad)
PHMM
24
MSA from
Pairwise
Alignments
Lots of ways to improve on generic
progressive alignment
Here, we mention one such approach
Not necessarily “best” or most popular
Feng

Dolittle
progressive alignment
Compute scores for all pairs of
n
sequences
Select
n

1
alignments that a) “connect” all
sequences and
b
) maximize
pairwise
scores
Then generate a minimum spanning tree
For MSA, add sequences in the order that
they appear in the spanning tree
PHMM
25
MSA Construction
Create
pairwise
alignments
Generate substitution matrix
Dynamic program for
pairwise
alignments
Use
pairwise
alignments to make MSA
Use
pairwise
alignments to construct spanning
tree (e.g., Prim’s Algorithm)
Add sequences to MSA in spanning tree order
(from highest score, insert gaps as needed)
Note: gap penalty is used
PHMM
26
MSA Example
Suppose 10 sequences, with the following
pairwise
alignment scores:
PHMM
27
MSA Example: Spanning Tree
Spanning tree
based on scores
So process pairs
in following order:
(5,4), (5,8), (8,3),
(3,2), (2,7), (2,1),
(1,6), (6,10), (10,9)
PHMM
28
MSA Snapshot
Intermediate
step and final
Use “+” for
neutral symbol
Then “

” for
gaps in MSA
Note increase
in gaps
PHMM
29
PHMM from MSA
For PHMM, must determine match and
insert states & probabilities from MSA
“Conservative” columns are match states
Half or less of symbols are gaps
Other columns are insert states
Majority of symbols are gaps
Delete states are a separate issue
PHMM
30
PHMM States from MSA
Consider a simpler MSA…
Columns 1,2,6 are match
states 1,2,3, respectively
Since less than half gaps
Columns 3,4,5 are combined
to form insert state 2
Since more than half gaps
Match states between insert
PHMM
31
PHMM Probabilities from MSA
Emission probabilities
Based on symbol
distribution in match and
insert states
State transition
probs
Based on transitions in
the MSA
PHMM
32
PHMM Probabilities from MSA
Emission probabilities:
But 0 probabilities are bad
Model “
overfits
” the data
So, use “add one” rule
Add one to each numerator,
add total to denominators
PHMM
33
PHMM Probabilities from MSA
More emission probabilities:
But 0 probabilities are bad
Model “
overfits
” the data
Again, use “add one” rule
Add one to each numerator,
add total to denominators
PHMM
34
PHMM Probabilities from MSA
Transition probabilities:
We look at some examples
Note that “

” is delete state
First, consider begin state:
Again, use add one rule
PHMM
35
PHMM Probabilities from MSA
Transition probabilities
When no information in
MSA, set
probs
to uniform
For example
I
1
does not
appear in MSA, so
PHMM
36
PHMM Probabilities from MSA
Transition probabilities,
another example
What about transitions
from state
D
1
?
Can only go to
M
2
, so
Again, use add one rule:
PHMM
37
PHMM Emission Probabilities
Emission probabilities for the given MSA
Using add

one rule
PHMM
38
PHMM Transition Probabilities
Transition probabilities for the given MSA
Using add

one rule
PHMM
39
PHMM Summary
Construct
pairwise
alignments
Usually, use dynamic programming
Use these to construct MSA
Lots of ways to do this
Using MSA, determine probabilities
Emission probabilities
State transition probabilities
In effect, we have trained a PHMM
Now what???
PHMM
40
PHMM Scoring
Want to score sequences to see how
closely they match PHMM
How did we score sequences with HMM?
Forward algorithm
How to score sequences with PHMM?
Forward algorithm
But, algorithm is a little more complex
Due to complex state transitions
PHMM
41
Forward Algorithm
Notation
Indices
i
and
j
are columns in MSA
x
i
is
i
th
observation symbol
q
xi
is distribution of
x
i
in “random model”
Base case is
is score of
x
1
,…,x
i
up to state
j
(note
that in PHMM,
i
and
j
may not agree)
Some states undefined
Undefined states ignored in calculation
PHMM
42
Forward Algorithm
Compute
P(X
λ
)
recursively
Note that depends on ,
and
And corresponding state transition
probs
PHMM
43
PHMM
We will see examples of PHMM later
In particular,
Malware detection based on
opcodes
Masquerade detection based on UNIX
commands
PHMM
44
References
Biological Sequence Analysis: Probabilistic
Models of Proteins and Nucleic Acids
,
Durbin, et al
Masquerade detection using profile hidden
Markov models, L. Huang and M. Stamp, to
appear in
Computers and Security
Profile hidden Markov models for
metamorphic virus detection, S.
Attaluri
,
S. McGhee and M. Stamp,
Journal in
Computer Virology
, Vol. 5, No. 2, May
2009, pp. 151

169
PHMM
45
Comments 0
Log in to post a comment