JM

http://folding.chmcc.org
1
Introduction to Bioinformatics: Lecture IV
Sequence Similarity and Dynamic Programming
Jarek Meller
Division of Biomedical Informatics,
Children’s Hospital Research Foundation
& Department of Biomedical Engineering, UC
JM

http://folding.chmcc.org
2
Outline of the lecture
Wrapping up the previous lecture: a quick look at the
NCBI Map Viewer and suffix trees by way of an
example
Inexact string matching: from generalizations of suffix
trees to dynamic programming
The dynamic programming algorithm for sequence
alignment: how it works
The dynamic programming algorithm for sequence
alignment: why it works
Limitations and faster heuristic approaches
JM

http://folding.chmcc.org
3
Web watch: NCBI Map Viewer
With the knowledge about STSs and physical maps
(hopefully) acquired last week we can have another look
at the NCBI Map Viewer:
http://www.ncbi.nlm.nih.gov/mapview
http://www.ncbi.nlm.nih.gov/genome/guide/human/
JM

http://folding.chmcc.org
4
Computationally efficient and elegant solutions for
the exact string matching problem:
http://www

igm.univ

mlv.fr/~lecroq/string/index.html
Christian Charras and Thierry Lecroq
JM

http://folding.chmcc.org
5
The idea of the suffix tree method
Phase 1: Preprocessing of the “text”
A string with
m
characters has m suffixes, which can be represented
as
m
leaves of a rooted directed tree. Consider for example
T=cabca
c
a
b
c
a
$
1
a
b
c
a
$
2
b
c
a
$
3
$
4
$
5
For simplicity one leaf, due to the terminal character $, is not included.
Problem
What is the reason for adding the terminal character?
JM

http://folding.chmcc.org
6
Suffix tree based matching: why does it work?
Phase II: Search
A substring of a string is a prefix of a suffix in that string. For example,
a substring
P=ab
is a prefix of the suffix
abca
in
T=cabca.
Thus, if
P
occurs in
T
there is a leaf in the suffix tree that has a label starting with
P
.
c
a
b
c
a
$
1
a
b
c
a
$
2
b
c
a
$
3
$
4
$
5
Problem
Does the size of the alphabet matter (and if so, how)?
Hint: how many edges may originate in a node, given that label of each
edge out of a node has to start with a different character?
JM

http://folding.chmcc.org
7
Generalized suffix tree for a set of strings and
the longest common substring problem
Consider for example two strings:
T=cabca
and
U=bbcb
.
c
a
b
c
a
$
T1
a
b
c
a
$
T2
b
c
a
$
T3
$
T4
$
T5
Remark
By building the generalized suffix tree for a set of
k
strings of the total
length
m
one can find the longest prefix

suffix match for all pairs of strings in
O(m+k
2
) time (an additional trick is required for that).
b
c
b
$
U1
b
$
U2
U4
U3
$
$
b
JM

http://folding.chmcc.org
8
Assembling DNA from fragment and the suffix

prefix matching problem
Hierarchical sequencing: physical maps, clone libraries and shotgun
(see Chapter 2 in “A Primer on Genome Science” by Gibson and Muse)
Definition
The algorithmic problem of shotgun sequence assembly
is to deduce the sequence of the DNA string from a set of sequenced
and partially overlapping short substrings derived from that string.
Analogy to physical map assembly: DNA sequence of a substring may
be viewed as a precise ordered fingerprint (in analogy to STSs) and the
suffix

prefix match determines if two substrings would be assembled
together.
In general, the shortest superstring problem (find the shortest string
that contains each string from a certain set of strings as its substring)
is NP

hard and heuristics are being developed to address the problem.
JM

http://folding.chmcc.org
9
Inexact or approximate string matching
Two major reasons for the importance of approximate matching in
computational molecular biology are:
i)
Measurement (e.g. sequencing) errors and fuzzy nature of underlying
molecular processes (e.g. hybridization may occur despite some
mismatches)
ii) Redundancy in biology with evolutionary processes resulting in closely
related, yet, different sequences that require approximate matching in
order to detect their relatedness and identify variable as well as
conserved features that may reveal fingerprints of structure and function
Either generalizations of exact string matching methods, such as suffix trees,
or dynamic programming (or their heuristic combinations) are being used to
solve this problem.
JM

http://folding.chmcc.org
10
Redundancy in biological systems

MSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASE
M
LS+GEWQLVL+VW KVEAD+ GHGQ++LIRLFK HPETLEKFD+FKHLK+E EMKASE
MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASE
DLKKHGVTVLTALGAILKKKGHHEAELKPFAQSHATKHKIPIKYLEFI

AIIHVLHSRH
DLKKHG TVLTALG ILKKKGHHEAE KP AQSHATKHKIP+KYLEFI I VL S+H
DLKKHGATVLTALGGILKKKGHHEAE

KPLAQSHATKHKIPVKYLEFISEC

IQVLQSKH
PGNFGADAQGAMNKALELFRKDIAAKYKELGYQG
PG+FGADAQGAMNKALELFRKD+A+ YKE
PGDFGADAQGAMNKALELFRKDMASNYKE

Note that there are two types of mismatches:
i)
Due to point mutations
ii)
Due to insertions and deletions (gaps)
An example: two globin

like sequences:
JM

http://folding.chmcc.org
11
Gap penalties: evolutionary and
computational considerations
Linear gap penalties:
g
(g) =

g d
for a gap of length
g
and constant
d
Affine gap penalties:
g
(g) =

[ d + (g

1) e ]
where
d
is opening gap penalty and
e
an extension gap penalty.
JM

http://folding.chmcc.org
12
Dynamic programming algorithm for string alignment
Our goal is to find an optimal matching for two strings
S
1
=
a
1
a
2
…a
n
and
S
2
=
b
1
b
2
…b
m
over a certain alphabet
S
, given a
scoring matrix
s(a,b)
for each
a
and
b
in
S
and (for simplicity) a
linear gap penalty
Relation to minimal edit distance (number of insertions,
deletions and substitutions required to transform one string into
the other) problem
The similarity measure (scoring matrix) should represent
biological relatedness and separate true matches from random
alignments (find more in Chapter 2 of “Biological Sequence
Analysis” by Durbin et. al.)
JM

http://folding.chmcc.org
13
How many alignments are there?
All the possible alignments (with gaps) may be represented in the
Form of a DP graph (DP table). Consider an example with two
strings of length 2:
a
1
a
2
b
1
b
2
0
1
1
1
1
3
5
13
5
_
a
1
a
2
b
1
b
2
\


b
1
a
1
b
2
a
2
\
_
 _ _ _ _ _ _ _
_ _  _ _  
b
1
b
2
a
1
a
2
  _
\
a
1
b
1
a
2
b
2
\
\
\
_
_ 
a
1
b
1
b
2
a
2
_ _
\

b
1
a
1
a
2
b
2
\
JM

http://folding.chmcc.org
14
Computing the number of alignments with gaps
Definition
A string of length
n+m
, obtained by intercalating two strings
S
1
= a
1
a
2
…a
n
and
S
2
= b
1
b
2
…b
m
, while preserving the order of the
symbols in
S
1
and
S
2
, will be referred to as an
intercalated string
and
denoted by
S
1/2
. Note that
S
1
and
S
2
are subsequences of
S
1/2
but in
general they are not substrings of
S
1/2.
Definition
Two alignments are called
redundant
if their score is identical.
The relationship of “having the same score” may be used to define
equivalence classes of non

redundant alignments.
For example, the class
a
1
b
1
b
2
a
2
:
a
1
b
1
b
2
a
2
a
1

a
2
a
1
a
2

b
1
b
2

; b
1

b
2
JM

http://folding.chmcc.org
15
Computing the number of alignments with gaps
Lemma
There is one

to

one correspondence (bijection) between the set of
the non

redundant gapped alignments of two strings
S
1
and
S
2
and the set
of the intercalated strings {
S
1/2
}.
Corollary
The number of non

redundant gapped alignments of two strings,
of length
n
and
m
, respectively, is equal to
(n+m)!/[m!n!].
Proof
Since the order of each of the sequences is preserved when intercalating
them, we have in fact
n+m
positions to put
m
elements of the second sequence
(once this is done the position of each of the elements of the first sequence is fixed
unambiguously). Hence, the total number of intercalated sequences
S
1/2
is given by
the number of
m

element combinations of
n+m
elements and the corollary is a
simple consequence of the one

to

one correspondence between alignments and
intercalated sequences stated in the lemma. QED
JM

http://folding.chmcc.org
16
Computing the number of alignments with gaps
Problem
Consider for simplicity two strings of the same length and using
the Stirling formula (
x! ~ (2
p
)
1/2
x
x+1/2
e

x
) show that:
(n+n)!/[n!n!] ~ 2
2n
/ (2
p
n)
1/2
Note that for a very short by biology standards sequence of length
n=50
one needs to perform about
10
30
basic operations for an exhaustive search,
making the naïve approach infeasible.
Dynamic programming provides in polynomial time an optimal
solution for a class of optimization problems with exponentially
scaling search space, including the approximate string matching.
Comments 0
Log in to post a comment