PHMMx

signtruculentBiotechnology

Oct 2, 2013 (3 years and 8 months ago)

58 views

Introduction to Profile
Hidden Markov Models

Mark Stamp

1

PHMM

Hidden Markov Models


Here, we assume you know about
HMMs


If not, see “A revealing introduction to hidden
Markov models”


Executive summary of
HMMs


HMM is a machine learning technique


Also, a discrete hill climb technique


Train model based on observation sequence


Score given sequence to see how closely it
matches the model


Efficient algorithms, many useful applications

PHMM

2

HMM Notation


Recall, HMM model denoted
λ

= (
A,B,
π)



Observation sequence is
O


Notation:


PHMM

3

Hidden Markov Models


Among the many uses for
HMMs



Speech analysis


Music search engine


Malware detection


Intrusion detection systems (IDS)


Many more, and more all the time

PHMM

4

Limitations of
HMMs


Positional information not considered


HMM has no “memory”


Higher order models have some memory


But no explicit use of positional information


Does not handle insertions or deletions


These limitations are serious problems in
some applications


In bioinformatics string comparison, sequence
alignment is critical


Also, insertions and deletions occur

PHMM

5

Profile HMM


Profile HMM (PHMM) designed to
overcome limitations on previous slide


In some ways, PHMM easier than HMM


In some ways, PHMM more complex


The basic idea of PHMM


Define multiple
B

matrices


Almost like having an HMM for each
position in sequence


PHMM

6

PHMM


In bioinformatics, begin by aligning
multiple related sequences


Multiple sequence alignment (MSA)


This is like training phase for HMM


Generate PHMM based on given MSA


Easy, once MSA is known


Hard part is generating MSA


Then can score sequences using PHMM


Use forward algorithm, like HMM

PHMM

7

Generic View of PHMM


Circles are
Delete

states


Diamonds are
Insert

states


Rectangles are
Match

states


Match states correspond to HMM states


Arrows are possible transitions


Each transition has associated probability


Transition probabilities are
A

matrix


Emission probabilities are
B

matrices


In PHMM, observations are emissions


Match and insert states have emissions

PHMM

8

Generic View of PHMM


Circles are
Delete

states, diamonds are
Insert

states, rectangles are
Match

states


Also, begin and end states

PHMM

9

PHMM Notation


Notation

PHMM

10

PHMM


Match state probabilities easily
determined from MSA, that is


a
Mi,Mi+1

transitions between match states


e
Mi
(k
)

emission probability at match state


Note: other transition probabilities


For example,
a
Mi,
I
i

and
a
Mi,Di+1


Emissions at all match & insert states


Remember, emission == observation

PHMM

11

MSA


First we show MSA construction


This is the difficult part


Lots of ways to do this


“Best” way depends on specific problem


Then construct PHMM from MSA


The easy part


Standard algorithm for this


How to score a sequence?


Forward algorithm, similar to HMM

PHMM

12

MSA


How to construct MSA?


Construct
pairwise

alignments


Combine
pairwise

alignments to obtain MSA


Allow gaps to be inserted


Makes better matches


But gaps tend to weaken scoring


So there is a tradeoff


PHMM

13

Global
vs

Local Alignment


In these
pairwise

alignment examples



-
” is gap



|
” are aligned



*
” omitted beginning and ending symbols

PHMM

14

Global
vs

Local Alignment


Global alignment is lossless


But gaps tend to proliferate


And gaps increase when we do MSA


More gaps implies more sequences match


So, result is less useful for scoring


We usually only consider local alignment


That is, omit ends for better alignment


For simplicity, we assume global
alignment here

PHMM

15

Pairwise

Alignment


We allow gaps when aligning


How to score an alignment?


Based on
n

x

n

substitution matrix
S


Where
n

is number of symbols


What
algorithm(s
) to align sequences?


Usually, dynamic programming


Sometimes, HMM is used


Other?


Local alignment
---

more issues

PHMM

16

Pairwise

Alignment


Example







Note gaps
vs

misaligned elements


Depends on
S
and gap penalty

PHMM

17

Substitution Matrix


Masquerade detection


Detect imposter using an account


Consider 4 different operations


E == send email


G == play games


C == C programming


J == Java programming


How similar are these to each other?

PHMM

18

Substitution Matrix


Consider 4 different operations:


E, G, C, J


Possible substitution matrix:


Diagonal
---

matches


High positive scores


Which others most similar?


J and C, so substituting C for J is a high score


Game playing/programming, very different


So substituting G for C is a negative score

PHMM

19

Substitution Matrix


Depending on problem, might be easy or
very difficult to get useful
S

matrix


Consider masquerade detection based on
UNIX commands


Sometimes difficult to say how “close” 2
commands are


Suppose aligning DNA sequences


Biological rationale for closeness of symbols

PHMM

20

Gap Penalty


Generally must allow gaps to be inserted


But gaps make alignment more generic


So, less useful for scoring


Therefore, we penalize gaps


How to penalize gaps?


Linear gap penalty function


f(g
) = dg
(i.e., constant penalty per gap)


Affine gap penalty function


f(g
) = a +
e(g



1)


Gap opening penalty
a
, then constant factor of
e

PHMM

21

Pairwise

Alignment Algorithm


We use dynamic programming


Based on
S

matrix, gap penalty function


Notation:

PHMM

22

Pairwise

Alignment DP


Initialization:

PHMM

23


Recursion:

MSA from
Pairwise

Alignments


Given
pairwise

alignments…


…how to construct MSA?


Generic approach is “progressive
alignment”


Select one
pairwise

alignment


Select another and combine with first


Continue to add more until all are combined


Relatively easy (good)


Gaps may proliferate, unstable (bad)

PHMM

24

MSA from
Pairwise

Alignments


Lots of ways to improve on generic
progressive alignment


Here, we mention one such approach


Not necessarily “best” or most popular


Feng
-
Dolittle

progressive alignment


Compute scores for all pairs of
n

sequences


Select
n
-
1

alignments that a) “connect” all
sequences and
b
) maximize
pairwise

scores


Then generate a minimum spanning tree


For MSA, add sequences in the order that
they appear in the spanning tree


PHMM

25

MSA Construction


Create
pairwise

alignments


Generate substitution matrix


Dynamic program for
pairwise

alignments


Use
pairwise

alignments to make MSA


Use
pairwise

alignments to construct spanning
tree (e.g., Prim’s Algorithm)


Add sequences to MSA in spanning tree order
(from highest score, insert gaps as needed)


Note: gap penalty is used


PHMM

26

MSA Example


Suppose 10 sequences, with the following
pairwise

alignment scores:

PHMM

27

MSA Example: Spanning Tree


Spanning tree
based on scores


So process pairs
in following order:
(5,4), (5,8), (8,3),
(3,2), (2,7), (2,1),
(1,6), (6,10), (10,9)

PHMM

28

MSA Snapshot


Intermediate
step and final


Use “+” for
neutral symbol


Then “
-
” for
gaps in MSA


Note increase
in gaps

PHMM

29

PHMM from MSA


For PHMM, must determine match and
insert states & probabilities from MSA


“Conservative” columns are match states


Half or less of symbols are gaps


Other columns are insert states


Majority of symbols are gaps


Delete states are a separate issue

PHMM

30

PHMM States from MSA


Consider a simpler MSA…


Columns 1,2,6 are match
states 1,2,3, respectively


Since less than half gaps


Columns 3,4,5 are combined
to form insert state 2


Since more than half gaps


Match states between insert

PHMM

31

PHMM Probabilities from MSA


Emission probabilities


Based on symbol
distribution in match and
insert states


State transition
probs


Based on transitions in
the MSA

PHMM

32

PHMM Probabilities from MSA


Emission probabilities:




But 0 probabilities are bad


Model “
overfits
” the data


So, use “add one” rule


Add one to each numerator,
add total to denominators

PHMM

33

PHMM Probabilities from MSA


More emission probabilities:




But 0 probabilities are bad


Model “
overfits
” the data


Again, use “add one” rule


Add one to each numerator,
add total to denominators

PHMM

34

PHMM Probabilities from MSA


Transition probabilities:




We look at some examples


Note that “
-
” is delete state


First, consider begin state:



Again, use add one rule

PHMM

35

PHMM Probabilities from MSA


Transition probabilities


When no information in
MSA, set
probs

to uniform


For example
I
1

does not
appear in MSA, so

PHMM

36

PHMM Probabilities from MSA


Transition probabilities,
another example


What about transitions
from state
D
1
?


Can only go to
M
2
, so






Again, use add one rule:

PHMM

37

PHMM Emission Probabilities


Emission probabilities for the given MSA


Using add
-
one rule

PHMM

38

PHMM Transition Probabilities


Transition probabilities for the given MSA


Using add
-
one rule

PHMM

39

PHMM Summary


Construct
pairwise

alignments


Usually, use dynamic programming


Use these to construct MSA


Lots of ways to do this


Using MSA, determine probabilities


Emission probabilities


State transition probabilities


In effect, we have trained a PHMM


Now what???

PHMM

40

PHMM Scoring


Want to score sequences to see how
closely they match PHMM


How did we score sequences with HMM?


Forward algorithm


How to score sequences with PHMM?


Forward algorithm


But, algorithm is a little more complex


Due to complex state transitions

PHMM

41

Forward Algorithm


Notation


Indices
i

and
j

are columns in MSA


x
i

is
i
th

observation symbol


q
xi

is distribution of
x
i

in “random model”


Base case is



is score of
x
1
,…,x
i

up to state
j

(note
that in PHMM,
i

and
j

may not agree)


Some states undefined


Undefined states ignored in calculation


PHMM

42

Forward Algorithm


Compute
P(X|
λ
)
recursively







Note that depends on ,
and


And corresponding state transition
probs


PHMM

43

PHMM


We will see examples of PHMM later


In particular,


Malware detection based on
opcodes


Masquerade detection based on UNIX
commands


PHMM

44

References


Biological Sequence Analysis: Probabilistic
Models of Proteins and Nucleic Acids
,
Durbin, et al


Masquerade detection using profile hidden
Markov models, L. Huang and M. Stamp, to
appear in
Computers and Security


Profile hidden Markov models for
metamorphic virus detection, S.
Attaluri
,
S. McGhee and M. Stamp,
Journal in
Computer Virology
, Vol. 5, No. 2, May
2009, pp. 151
-
169

PHMM

45