From Suffix Trees to Approximate String Matching - Jarek Meller's ...

weinerthreeforksBiotechnology

Oct 2, 2013 (3 years and 8 months ago)

74 views

JM
-

http://folding.chmcc.org

1

Introduction to Bioinformatics: Lecture IV

Sequence Similarity and Dynamic Programming



Jarek Meller


Division of Biomedical Informatics,

Children’s Hospital Research Foundation

& Department of Biomedical Engineering, UC

JM
-

http://folding.chmcc.org

2

Outline of the lecture




Wrapping up the previous lecture: a quick look at the
NCBI Map Viewer and suffix trees by way of an
example


Inexact string matching: from generalizations of suffix
trees to dynamic programming


The dynamic programming algorithm for sequence
alignment: how it works


The dynamic programming algorithm for sequence
alignment: why it works


Limitations and faster heuristic approaches

JM
-

http://folding.chmcc.org

3


Web watch: NCBI Map Viewer

With the knowledge about STSs and physical maps

(hopefully) acquired last week we can have another look

at the NCBI Map Viewer:

http://www.ncbi.nlm.nih.gov/mapview


http://www.ncbi.nlm.nih.gov/genome/guide/human/



JM
-

http://folding.chmcc.org

4

Computationally efficient and elegant solutions for
the exact string matching problem:

http://www
-
igm.univ
-
mlv.fr/~lecroq/string/index.html

Christian Charras and Thierry Lecroq

JM
-

http://folding.chmcc.org

5

The idea of the suffix tree method

Phase 1: Preprocessing of the “text”

A string with
m

characters has m suffixes, which can be represented

as
m

leaves of a rooted directed tree. Consider for example
T=cabca

c

a

b

c

a

$

1

a

b

c

a

$

2

b

c

a

$

3

$

4

$

5

For simplicity one leaf, due to the terminal character $, is not included.

Problem

What is the reason for adding the terminal character?

JM
-

http://folding.chmcc.org

6

Suffix tree based matching: why does it work?

Phase II: Search

A substring of a string is a prefix of a suffix in that string. For example,

a substring
P=ab

is a prefix of the suffix
abca

in

T=cabca.
Thus, if
P


occurs in
T

there is a leaf in the suffix tree that has a label starting with
P
.

c

a

b

c

a

$

1

a

b

c

a

$

2

b

c

a

$

3

$

4

$

5

Problem

Does the size of the alphabet matter (and if so, how)?

Hint: how many edges may originate in a node, given that label of each

edge out of a node has to start with a different character?

JM
-

http://folding.chmcc.org

7

Generalized suffix tree for a set of strings and
the longest common substring problem

Consider for example two strings:
T=cabca
and
U=bbcb
.

c

a

b

c

a

$

T1

a

b

c

a

$

T2

b

c

a

$

T3

$

T4

$

T5

Remark
By building the generalized suffix tree for a set of
k

strings of the total

length
m
one can find the longest prefix
-
suffix match for all pairs of strings in

O(m+k
2
) time (an additional trick is required for that).

b

c

b

$

U1

b

$

U2

U4

U3

$

$

b

JM
-

http://folding.chmcc.org

8

Assembling DNA from fragment and the suffix
-
prefix matching problem

Hierarchical sequencing: physical maps, clone libraries and shotgun

(see Chapter 2 in “A Primer on Genome Science” by Gibson and Muse)


Definition

The algorithmic problem of shotgun sequence assembly

is to deduce the sequence of the DNA string from a set of sequenced

and partially overlapping short substrings derived from that string.


Analogy to physical map assembly: DNA sequence of a substring may

be viewed as a precise ordered fingerprint (in analogy to STSs) and the

suffix
-
prefix match determines if two substrings would be assembled

together.


In general, the shortest superstring problem (find the shortest string

that contains each string from a certain set of strings as its substring)

is NP
-
hard and heuristics are being developed to address the problem.

JM
-

http://folding.chmcc.org

9

Inexact or approximate string matching

Two major reasons for the importance of approximate matching in

computational molecular biology are:


i)
Measurement (e.g. sequencing) errors and fuzzy nature of underlying


molecular processes (e.g. hybridization may occur despite some


mismatches)

ii) Redundancy in biology with evolutionary processes resulting in closely


related, yet, different sequences that require approximate matching in


order to detect their relatedness and identify variable as well as


conserved features that may reveal fingerprints of structure and function



Either generalizations of exact string matching methods, such as suffix trees,

or dynamic programming (or their heuristic combinations) are being used to

solve this problem.

JM
-

http://folding.chmcc.org

10

Redundancy in biological systems

--
MSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASE

M

LS+GEWQLVL+VW KVEAD+ GHGQ++LIRLFK HPETLEKFD+FKHLK+E EMKASE


MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASE



DLKKHGVTVLTALGAILKKKGHHEAELKPFAQSHATKHKIPIKYLEFI
--
AIIHVLHSRH

DLKKHG TVLTALG ILKKKGHHEAE KP AQSHATKHKIP+KYLEFI I VL S+H


DLKKHGATVLTALGGILKKKGHHEAE
-
KPLAQSHATKHKIPVKYLEFISEC
-
IQVLQSKH


PGNFGADAQGAMNKALELFRKDIAAKYKELGYQG

PG+FGADAQGAMNKALELFRKD+A+ YKE


PGDFGADAQGAMNKALELFRKDMASNYKE
-----

Note that there are two types of mismatches:

i)
Due to point mutations

ii)
Due to insertions and deletions (gaps)

An example: two globin
-
like sequences:

JM
-

http://folding.chmcc.org

11

Gap penalties: evolutionary and
computational considerations


Linear gap penalties:


g
(g) =
-

g d



for a gap of length
g

and constant
d


Affine gap penalties:


g
(g) =
-

[ d + (g
-
1) e ]



where
d

is opening gap penalty and
e

an extension gap penalty.


JM
-

http://folding.chmcc.org

12

Dynamic programming algorithm for string alignment


Our goal is to find an optimal matching for two strings
S
1

=
a
1
a
2
…a
n

and
S
2

=
b
1
b
2
…b
m

over a certain alphabet
S
, given a
scoring matrix
s(a,b)

for each
a

and
b
in
S

and (for simplicity) a
linear gap penalty


Relation to minimal edit distance (number of insertions,
deletions and substitutions required to transform one string into
the other) problem


The similarity measure (scoring matrix) should represent
biological relatedness and separate true matches from random
alignments (find more in Chapter 2 of “Biological Sequence
Analysis” by Durbin et. al.)


JM
-

http://folding.chmcc.org

13

How many alignments are there?

All the possible alignments (with gaps) may be represented in the

Form of a DP graph (DP table). Consider an example with two

strings of length 2:

a
1

a
2

b
1

b
2

0

1

1

1

1

3

5

13

5


_



a
1
a
2
b
1
b
2


\



|


|



b
1
a
1
b
2
a
2


\
_

| _ _ |_ _ |_ _ _

|_ _ | |_ |_ | |



b
1
b
2
a
1
a
2


| | |_

\



a
1
b
1
a
2
b
2


\


\

\

_


|_ |


a
1
b
1
b
2
a
2

|_ _


\

|


b
1
a
1
a
2
b
2


\


JM
-

http://folding.chmcc.org

14

Computing the number of alignments with gaps


Definition

A string of length
n+m
, obtained by intercalating two strings
S
1

= a
1
a
2
…a
n

and

S
2

= b
1
b
2
…b
m

, while preserving the order of the
symbols in
S
1

and

S
2
, will be referred to as an
intercalated string

and
denoted by
S
1/2
. Note that
S
1

and
S
2

are subsequences of
S
1/2

but in
general they are not substrings of
S
1/2.



Definition

Two alignments are called
redundant

if their score is identical.
The relationship of “having the same score” may be used to define
equivalence classes of non
-
redundant alignments.



For example, the class
a
1
b
1
b
2
a
2
:



a
1
b
1
b
2
a
2



a
1
-
a
2
a
1
a
2
-


b
1
b
2
-

; b
1
-
b
2




JM
-

http://folding.chmcc.org

15

Computing the number of alignments with gaps


Lemma

There is one
-
to
-
one correspondence (bijection) between the set of
the non
-
redundant gapped alignments of two strings
S
1

and

S
2

and the set
of the intercalated strings {
S
1/2
}.



Corollary
The number of non
-
redundant gapped alignments of two strings,
of length
n

and
m
, respectively, is equal to
(n+m)!/[m!n!].



Proof

Since the order of each of the sequences is preserved when intercalating
them, we have in fact
n+m

positions to put
m

elements of the second sequence
(once this is done the position of each of the elements of the first sequence is fixed
unambiguously). Hence, the total number of intercalated sequences
S
1/2

is given by
the number of
m
-
element combinations of
n+m

elements and the corollary is a
simple consequence of the one
-
to
-
one correspondence between alignments and
intercalated sequences stated in the lemma. QED


JM
-

http://folding.chmcc.org

16

Computing the number of alignments with gaps


Problem
Consider for simplicity two strings of the same length and using
the Stirling formula (
x! ~ (2
p
)
1/2

x
x+1/2

e
-
x

) show that:



(n+n)!/[n!n!] ~ 2
2n

/ (2
p
n)
1/2




Note that for a very short by biology standards sequence of length
n=50

one needs to perform about
10
30

basic operations for an exhaustive search,
making the naïve approach infeasible.

Dynamic programming provides in polynomial time an optimal
solution for a class of optimization problems with exponentially
scaling search space, including the approximate string matching.