PHMMx

Βιοτεχνολογία

2 Οκτ 2013 (πριν από 4 χρόνια και 7 μήνες)

78 εμφανίσεις

Introduction to Profile
Hidden Markov Models

Mark Stamp

1

PHMM

Hidden Markov Models

Here, we assume you know about
HMMs

If not, see “A revealing introduction to hidden
Markov models”

Executive summary of
HMMs

HMM is a machine learning technique

Also, a discrete hill climb technique

Train model based on observation sequence

Score given sequence to see how closely it
matches the model

Efficient algorithms, many useful applications

PHMM

2

HMM Notation

Recall, HMM model denoted
λ

= (
A,B,
π)

Observation sequence is
O

Notation:

PHMM

3

Hidden Markov Models

Among the many uses for
HMMs

Speech analysis

Music search engine

Malware detection

Intrusion detection systems (IDS)

Many more, and more all the time

PHMM

4

Limitations of
HMMs

Positional information not considered

HMM has no “memory”

Higher order models have some memory

But no explicit use of positional information

Does not handle insertions or deletions

These limitations are serious problems in
some applications

In bioinformatics string comparison, sequence
alignment is critical

Also, insertions and deletions occur

PHMM

5

Profile HMM

Profile HMM (PHMM) designed to
overcome limitations on previous slide

In some ways, PHMM easier than HMM

In some ways, PHMM more complex

The basic idea of PHMM

Define multiple
B

matrices

Almost like having an HMM for each
position in sequence

PHMM

6

PHMM

In bioinformatics, begin by aligning
multiple related sequences

Multiple sequence alignment (MSA)

This is like training phase for HMM

Generate PHMM based on given MSA

Easy, once MSA is known

Hard part is generating MSA

Then can score sequences using PHMM

Use forward algorithm, like HMM

PHMM

7

Generic View of PHMM

Circles are
Delete

states

Diamonds are
Insert

states

Rectangles are
Match

states

Match states correspond to HMM states

Arrows are possible transitions

Each transition has associated probability

Transition probabilities are
A

matrix

Emission probabilities are
B

matrices

In PHMM, observations are emissions

Match and insert states have emissions

PHMM

8

Generic View of PHMM

Circles are
Delete

states, diamonds are
Insert

states, rectangles are
Match

states

Also, begin and end states

PHMM

9

PHMM Notation

Notation

PHMM

10

PHMM

Match state probabilities easily
determined from MSA, that is

a
Mi,Mi+1

transitions between match states

e
Mi
(k
)

emission probability at match state

Note: other transition probabilities

For example,
a
Mi,
I
i

and
a
Mi,Di+1

Emissions at all match & insert states

Remember, emission == observation

PHMM

11

MSA

First we show MSA construction

This is the difficult part

Lots of ways to do this

“Best” way depends on specific problem

Then construct PHMM from MSA

The easy part

Standard algorithm for this

How to score a sequence?

Forward algorithm, similar to HMM

PHMM

12

MSA

How to construct MSA?

Construct
pairwise

alignments

Combine
pairwise

alignments to obtain MSA

Allow gaps to be inserted

Makes better matches

But gaps tend to weaken scoring

PHMM

13

Global
vs

Local Alignment

In these
pairwise

alignment examples

-
” is gap

|
” are aligned

*
” omitted beginning and ending symbols

PHMM

14

Global
vs

Local Alignment

Global alignment is lossless

But gaps tend to proliferate

And gaps increase when we do MSA

More gaps implies more sequences match

So, result is less useful for scoring

We usually only consider local alignment

That is, omit ends for better alignment

For simplicity, we assume global
alignment here

PHMM

15

Pairwise

Alignment

We allow gaps when aligning

How to score an alignment?

Based on
n

x

n

substitution matrix
S

Where
n

is number of symbols

What
algorithm(s
) to align sequences?

Usually, dynamic programming

Sometimes, HMM is used

Other?

Local alignment
---

more issues

PHMM

16

Pairwise

Alignment

Example

Note gaps
vs

misaligned elements

Depends on
S
and gap penalty

PHMM

17

Substitution Matrix

Detect imposter using an account

Consider 4 different operations

E == send email

G == play games

C == C programming

J == Java programming

How similar are these to each other?

PHMM

18

Substitution Matrix

Consider 4 different operations:

E, G, C, J

Possible substitution matrix:

Diagonal
---

matches

High positive scores

Which others most similar?

J and C, so substituting C for J is a high score

Game playing/programming, very different

So substituting G for C is a negative score

PHMM

19

Substitution Matrix

Depending on problem, might be easy or
very difficult to get useful
S

matrix

UNIX commands

Sometimes difficult to say how “close” 2
commands are

Suppose aligning DNA sequences

Biological rationale for closeness of symbols

PHMM

20

Gap Penalty

Generally must allow gaps to be inserted

But gaps make alignment more generic

So, less useful for scoring

Therefore, we penalize gaps

How to penalize gaps?

Linear gap penalty function

f(g
) = dg
(i.e., constant penalty per gap)

Affine gap penalty function

f(g
) = a +
e(g

1)

Gap opening penalty
a
, then constant factor of
e

PHMM

21

Pairwise

Alignment Algorithm

We use dynamic programming

Based on
S

matrix, gap penalty function

Notation:

PHMM

22

Pairwise

Alignment DP

Initialization:

PHMM

23

Recursion:

MSA from
Pairwise

Alignments

Given
pairwise

alignments…

…how to construct MSA?

Generic approach is “progressive
alignment”

Select one
pairwise

alignment

Select another and combine with first

Continue to add more until all are combined

Relatively easy (good)

PHMM

24

MSA from
Pairwise

Alignments

Lots of ways to improve on generic
progressive alignment

Here, we mention one such approach

Not necessarily “best” or most popular

Feng
-
Dolittle

progressive alignment

Compute scores for all pairs of
n

sequences

Select
n
-
1

alignments that a) “connect” all
sequences and
b
) maximize
pairwise

scores

Then generate a minimum spanning tree

For MSA, add sequences in the order that
they appear in the spanning tree

PHMM

25

MSA Construction

Create
pairwise

alignments

Generate substitution matrix

Dynamic program for
pairwise

alignments

Use
pairwise

alignments to make MSA

Use
pairwise

alignments to construct spanning
tree (e.g., Prim’s Algorithm)

Add sequences to MSA in spanning tree order
(from highest score, insert gaps as needed)

Note: gap penalty is used

PHMM

26

MSA Example

Suppose 10 sequences, with the following
pairwise

alignment scores:

PHMM

27

MSA Example: Spanning Tree

Spanning tree
based on scores

So process pairs
in following order:
(5,4), (5,8), (8,3),
(3,2), (2,7), (2,1),
(1,6), (6,10), (10,9)

PHMM

28

MSA Snapshot

Intermediate
step and final

Use “+” for
neutral symbol

Then “
-
” for
gaps in MSA

Note increase
in gaps

PHMM

29

PHMM from MSA

For PHMM, must determine match and
insert states & probabilities from MSA

“Conservative” columns are match states

Half or less of symbols are gaps

Other columns are insert states

Majority of symbols are gaps

Delete states are a separate issue

PHMM

30

PHMM States from MSA

Consider a simpler MSA…

Columns 1,2,6 are match
states 1,2,3, respectively

Since less than half gaps

Columns 3,4,5 are combined
to form insert state 2

Since more than half gaps

Match states between insert

PHMM

31

PHMM Probabilities from MSA

Emission probabilities

Based on symbol
distribution in match and
insert states

State transition
probs

Based on transitions in
the MSA

PHMM

32

PHMM Probabilities from MSA

Emission probabilities:

Model “
overfits
” the data

PHMM

33

PHMM Probabilities from MSA

More emission probabilities:

Model “
overfits
” the data

PHMM

34

PHMM Probabilities from MSA

Transition probabilities:

We look at some examples

Note that “
-
” is delete state

First, consider begin state:

PHMM

35

PHMM Probabilities from MSA

Transition probabilities

When no information in
MSA, set
probs

to uniform

For example
I
1

does not
appear in MSA, so

PHMM

36

PHMM Probabilities from MSA

Transition probabilities,
another example

from state
D
1
?

Can only go to
M
2
, so

PHMM

37

PHMM Emission Probabilities

Emission probabilities for the given MSA

-
one rule

PHMM

38

PHMM Transition Probabilities

Transition probabilities for the given MSA

-
one rule

PHMM

39

PHMM Summary

Construct
pairwise

alignments

Usually, use dynamic programming

Use these to construct MSA

Lots of ways to do this

Using MSA, determine probabilities

Emission probabilities

State transition probabilities

In effect, we have trained a PHMM

Now what???

PHMM

40

PHMM Scoring

Want to score sequences to see how
closely they match PHMM

How did we score sequences with HMM?

Forward algorithm

How to score sequences with PHMM?

Forward algorithm

But, algorithm is a little more complex

Due to complex state transitions

PHMM

41

Forward Algorithm

Notation

Indices
i

and
j

are columns in MSA

x
i

is
i
th

observation symbol

q
xi

is distribution of
x
i

in “random model”

Base case is

is score of
x
1
,…,x
i

up to state
j

(note
that in PHMM,
i

and
j

may not agree)

Some states undefined

Undefined states ignored in calculation

PHMM

42

Forward Algorithm

Compute
P(X|
λ
)
recursively

Note that depends on ,
and

And corresponding state transition
probs

PHMM

43

PHMM

We will see examples of PHMM later

In particular,

Malware detection based on
opcodes

commands

PHMM

44

References

Biological Sequence Analysis: Probabilistic
Models of Proteins and Nucleic Acids
,
Durbin, et al

Markov models, L. Huang and M. Stamp, to
appear in
Computers and Security

Profile hidden Markov models for
metamorphic virus detection, S.
Attaluri
,
S. McGhee and M. Stamp,
Journal in
Computer Virology
, Vol. 5, No. 2, May
2009, pp. 151
-
169

PHMM

45