BLAST; Hidden Markov Models in sequence alignment (1 - Emory

disturbedtonganeseBiotechnology

Oct 2, 2013 (3 years and 10 months ago)

108 views

Hidden Markov Models
(1)


Brief
review of discrete time finite Markov
Chain


Hidden Markov
Model


Examples of HMM in Bioinformatics


Estimations

Basic Local Alignment Search Tool (BLAST)


The strategy



Important parameters



Example



Variations

BLAST

“Basic Local Alignment Search Tool”


Given a sequence, how to find its ‘relatives’ in a database of
millions of sequences?


An exhaustive search using
pairwise

alignment is
impossible.


BLAST is a heuristic method that finds good, but not
necessarily optimal, matches. It is
fast
.


Basis: good alignments should contain “words” that match
each other very well (identical or with very high alignment
score).

BLAST

Pandit et al. In Silico

Biology 4, 0047

BLAST


The scoring matrix: example:BLOSUM62

The expected score of two random sequences aligned to
each other is negative.

High scores of alignments between two random sequences
follow the Extreme Value Distribution.

BLAST

The general idea of the original BLAST:


Find a match of words in a database. This is done
efficiently using pre
-
processed database and hashing
technique.


Extend the match to find longer local alignments.


Report results exceeding a predefined threshold.

http://
www.cbi.pku.edu.cn/imag
es/blast_algorithm.gif

BLAST

Two key parameters:


w: word size


Normally a word size of 3~5 is used for protein. The word
size 3 yields 20^3=8000 possible words. Word size of
11~12 is used for nucleotides.


t: threshold of word matching score


Higher t


less word matches


lower sensitivity/higher
specificity

BLAST

BLAST

T
hreshold for
extention

Threshold for reporting
significant hit

BLAST

How significant is a hit ?


S: the raw alignment score.



S’: the bit score. A normalized score. Bit scores from
different alignments, or different scoring matrices are
comparable.

S’ =(λS
-
lnK)/ln2



K &
λ: constants, adjusting for scoring matrix



E: the expected value. number of chance alignments with
scores equivalent to or better than S’.


E = mN2
-
s’



m: query size



N: size of database

BLAST

Blasting a
sample
sequence.

BLAST

………

BLAST

BLAST

The “two
-
hit” method.


Two sequences with good similarity should share >1 word.


Only extend matches when two hits occur on the same
diagonal within certain distance.


It reduces sensitivity.

Pair it with lower per
-
word threshold t.

BLAST

There are many variants with different specialties. For
example


Gapped
BLAST


Allows the detection of a gapped alignment.


PSI
-
BLAST


“Position Specific Iterated BLAST


To detect weak relationships.

Constructs Position Specific Score Matrix automatically


……


Discrete time finite Markov Chain

Possible states: finite discrete set S {E1, E2, … … , Es}



From
time t to t+1, make stochastic movement
from one
state to another


The Markov Property:



At time t, the process is at
E
j
,


Then at time t+1, the probability it is at
E
k

only depends
on
E
j


The temporally homogeneous transition probabilities property
:



Prob
(
E
j

-
>
E
k
) is independent of time t.

Discrete time finite Markov Chain

The transition probability matrix:


P
ij
=
Prob
(
E
i

-
>
E
j
)










An N step transition:
P
ij
(N)=
Prob
(
E
i

-
> …….
-
>
E
j
)

It can be shown that

Discrete time finite Markov Chain

Consider the two step transition probability from
Ei

to
Ej
:




This is the
ij
th

element of


So, the two
-
step transition matrix


Extending this argument, we have

----------------------------------------------------------------------------


Absorbing states:


p
ii
=1. Once enter this state, stay in this state.


We don’t consider this.

Discrete time finite Markov Chain

The stationary state:





The probability of being at each state stays constant.



is the stationary state.


For
finite,
aperiodic
,
irreducible
Markov chains,

exists
and is
unique.


Periodic:

if a state can only be returned to at
t
0
,
2t
0
,3t
0
,…….. , t
0
>1

Irreducible:
any state can be eventually reached from any state

Discrete time finite Markov Chain

Note:


Markov chain is an elegant model in many situations in
bioinformatics, e.g. evolutionary process, sequence
analysis etc.


BUT the two Markov assumptions may not hold true.

Hidden Markov Model

More general than Markov model.


A discrete time Markov Model with extra features.

E1

E2

E3

A Markov Chain

O1

O2

Emission

Initiation

Hidden Markov Model

The emissions are
probablistic
,
state
-
dependent

and
time
-
independent


Example:

E1




E2

A




B

0.1

0.8

0.5

0.5

0.25

0.75

We don’t observe the states of the Markov Chain, but we
observe the emissions.


If both E1 and E2 have the same chance to initiate the
chain, and we observe sequence “BBB”, what is the most
likely state sequence that the chain went through?

Hidden Markov Model

Let Q denote the state sequence, and O denote the
observed emission. Our goal is to find:

O=“BBB”,


E1
-
>E1
-
>E1: 0.5*0.9*0.9*0.5*0.5*0.5= 0.050625

E1
-
>E1
-
>E2: 0.5*0.9*0.1*0.5*0.5*0.75= 0.0084375

E1
-
>E2
-
>E1: 0.5*0.1*0.8*0.5*0.75*0.5= 0.0075

E1
-
>E2
-
>E2: 0.5*0.1*0.2*0.5*0.75*0.75= 0.0028125

E2
-
>E1
-
>E1: 0.5*0.8*0.9*0.75*0.5*0.5=
0.0675

E2
-
>E1
-
>E2: 0.5*0.8*0.1*0.75*0.5*0.75= 0.01125

E2
-
>E2
-
>E1: 0.5*0.2*0.8*0.75*0.75*0.5= 0.0225

E2
-
>E2
-
>E2: 0.5*0.2*0.2*0.75*0.75*0.75= 0.0084375

Hidden Markov Model

Parameters:







initial distribution






transition probabilities






emission probabilities


Common questions:


How to efficiently calculate emissions:


How to find the most likely hidden state:


How to find the most likely parameters:

Hidden Markov Models in Bioinformatics

A generative model for
multiple sequence
alignment.

Hidden Markov Models in Bioinformatics

Data segmentation in array
-
based copy number analysis.

R package
aCGH

in
Bioconductor
.

Hidden Markov Models in Bioinformatics

Amplified

Normal

Deleted

An over
-
simplified model.

The Forward and Backward Algorithm

If there are a total of S states, and we
study
an
emision

from T steps of the chain, then #(all possible Q)=S^T

When either S or T is big, direct calculation is impossible.


Let


Let denote the hidden state at time t,


Let
b
i

denote the emission probabilities from state
E
i

Consider the “forward variables”:

Emissions up to step t chain at state
i

at step t

The Forward and Backward Algorithm

At step 1:


At step t+1:


… … …


IF we know all


Then P(O) can be found by



A total of 2T*S^2 computations are needed.

The Forward and Backward Algorithm

Back to our simple example, find P(“BBB”)










Now. What saved us the computing time?

It is the reusing the shared components.

And what allowed us to do that?.

The short memory of the Markov Chain.

The Forward and Backward Algorithm

The backward algorithm:









From
T
-
1 to 1, we can iteratively calculate
:





Emissions after step t chain at state
i

at step t

The Forward and Backward Algorithm

Back to our simple example, observing “BBB”, we have









The Viterbi Algorithm

To find:








So, to maximize the conditional probability, we can simply
maximize the joint probability.


Define:

The Viterbi Algorithm

Then our goal becomes:


Compare this to dynamic programming
pairwise

sequence
alignment:



At every step, we aim at the maximum


The maximum depends on the last step


At the end, we want the maximum and the trace



So, this problem is also solved by dynamic programming.

The Viterbi Algorithm

Time: 1 2 3


States: E1 0.5*0.5 0.15 0.0675





E2 0.5*0.75 0.05625 0.01125


The calculation is from left border to right border, while in
the sequence alignment, it is from upper
-
left corner to the
right and lower borders.


Again, why is this efficient calculation possible?

The short memory of the Markov Chain!


The Viterbi Algorithm

http://www.telecom.tuc.gr/~ntsourak/demo_viterbi.htm

The Estimation of parameters

When the topology of an HMM is known, how do we find
the initial distribution, the transition probabilities and the
emission probabilities, from a set of emitted sequences?



Difficulties:


Usually the number of parameters is not small. Even the
simple chain in our example has 5 parameters.


Normally the landscape of the likelihood function is
bumpy. Finding the global optimum is hard.



The Estimation of parameters

The general idea of the
Baun
-
Welch method:


(1)
Start with some initial values (uniform or informed guess)

(2)
Re
-
estimate the parameters

(1)
The previously mentioned forward
-

and backward
-
probabilities will be used.

(2)
It is guaranteed that the likelihood of the data
improves with the re
-
estimation

(3)
Repeat (2) until a local optimum is reached


We will discuss the details next time.

To be continued
… … …