1
Pairwise sequence alignment
algorithms
Elya Flax
&
Inbar Matarasso
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
2
Outline
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
The importance of (sub)sequence
comparison in molecular biology
The edit distance between two strings
Dynamic Programming
String similarity
Computing alignments in linear space
Local alignment
gaps
3
Motivation
The area of approximate matching and
sequence comparison is central in
computational molecular biology both
because of active mutational processes that
(sub)sequence comparison methods seek to
model and reveal.
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Much of computational biology
concerns sequence alignments
4
The importance of (Sub)sequence
comparison in Molecular Biology
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
The
first fact
of biology sequence analysis
In
biomulecular sequences (DNA,RNA, or amino acid
sequences), high sequence similarity usually implies
significant functional or structural similarity.
“Redundancy”, and “similarity” are central
phenomena in biology. But similarity has its limits
–
humans differ in some respects. These differences
make conserved similarity even more significant,
which in turn makes comparison and analogy very
powerful tools in biology.
5
The importance of (Sub)sequence
comparison in Molecular Biology
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
“... Similar sequences yield similar structures,
but quite distinct sequences can produce
remarkably similar structures”.
F. E. Choen. Folding the sheets: using computational methods to predict structures of proteins. In E. Lander
and M.S. Waterman, editors,
Calculating the Secrets of Life,
pages
236

71
. National Academy Press,
1995
.
6
Terminology
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Approximate
–
some errors, of various types
detailed later, are acceptable in valid
matches.
Alignment
–
lining up characters of strings,
allowing mismatches as well as matches and
allowing characters of one string to be placed
opposite spaces made in opposing strings.
d
b
d
_
c
a
q
_
b
_
x
w
a
q
7
Terminology
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Subsequence versus Substring
: A
subsequence differs from a substring in that
the characters in a substring must be
contiguous, whereas the characters in a
subsequence embedded in a string need not
be.
For example, the string
xyz
is a subsequence, but
not a substring, in
axayaz
.
8
Dynamic Programming
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Dynamic programming is typically applied to
optimization problems
. The development of a
dynamic programming algorithm can be broken
into a sequence of four steps:
i.
Characterize the structure of an optimal solution.
ii.
Recursively define the value of an optimal solution.
iii.
Compute the value of an optimal solution in a bottom

up
fashion.
iv.
Construct an optimal solution from computed information.
9
Edit Distance
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Instance
:
2
sequences x[
1
..m] and y[
1
..n],
and set of operation costs.
Problem
: To find what is the cost of the least
expensive transformation sequence that
converts x to y.
10
The edit distance between two strings
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
The permitted edit operations are:
I
nsertion,
D
eletion,
R
eplacement.
Definition:
A string over the alphabet I,D,R,M
that describes a transformation of one string to
another is called
edit transcript
, or transcript
for short, of the two strings.
I
M
M
D
M
D
M
I
R
r
e
n
t
n
i
v
s
r
e
t
i
r
w
M
atch
11
The edit distance between two strings
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Definition:
The
edit distance
between two
strings is defined as the minimum number of
edit operations
–
insertion, deletion, and
substitutions
–
needed to transform the first
string into the second.
For emphasis, note that matches are not counted.
12
String alignment
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Definition:
A (global)
alignment
of two
strings S
1
and S
2
, is obtained by first
inserting chosen spaces (or dashes), either
into or at the ends of S
1
and S
2
, and then
placing the two resulting strings one above
the other so that every character or space in
either string is opposite a unique character or
a unique space in the other string.
13
String alignment
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Example

the alignment of the string qacdbd
and qawxb:
d
b
d
_
c
a
q
_
b
_
x
w
a
q
14
Alignment Versus edit transcript
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
From the mathematical standpoint
–
equivalent ways to describe a relationship
between two strings.
From a modeling standpoint
–
an edit
transcript emphasize the putative
mutational
events
that transform one string to another,
whereas an alignment only displays a
relationship between the two strings
15
Dynamic programming calculation of
edit distance
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Definition:
For two strings S
1
and S
2
,
D(i,j)
is defined to be the edit distance of S
1
[
1
..i]
and S
2
[
1
..j].
D(i,j) denotes the minimum number of
edit
operations needed to transform the first i characters
of S
1
into the first j characters of S
2
.
D(n,m)
–
the edit distance of S
1
and S
2
16
The recurrence relation
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
The base conditions are:
D(i,
0
) = i;
D(
0
,j) = j
The recurrence relation for D(i,j) when both i and j
are strictly positive is:
D(i,j)=min[D(i

1
,j)+
1
, D(i,j

1
)+
1
,D(i

1
,j

1
)+t(i,j)]
where t(i,j) is defined to have value
1
if S
1
(i)≠S
2
(j),
and
0
otherwise.
17
Correctness of the general recurrence
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Lemma
1
:
The value of D(i,j) must be D(i

1
,j)+
1
, D(i,j

1
)+
1
, or D(i

1
,j

1
)+t(i,j). There are no other possibilities.
Lemma
2
:
D(i,j)≤min[D(i

1
,j)+
1
, D(i,j

1
)+
1
,D(i

1
,j

1
)+t(i,j)]
Theorem:
When both i and j are strictly positive,
D(i,j)= min[D(i

1
,j)+
1
, D(i,j

1
)+
1
,D(i

1
,j

1
)+t(i,j)].
18
Tabular computation of edit distance
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Top

down computation.
efficiently compute the value D(n,m).
(n+
1
)
×
(m+
1
) combinations of i and j.
Redundant recursive.
Bottom

up computation.
Time analysis: O(nm) cells in the table.
19
Tabular computation of edit distance
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
20
Tabular computation of edit distance
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Theorem
: The dynamic programming table
for computing the edit distance between a
string of length n and a string of length m can
be filled in with O(nm) work. Hence, using
dynamic programming, the edit distance
D(n,m) can be computed in O(nm) time.
21
The traceback
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
When the value of cell (i,j) is computed set a pointer
according the following rules:
If D(i,j)=D(i,j

1
)+
1
(i,j)
(i,j

1
)
If D(i,j)=D(i

1
,j)+
1
(i,j)
(i

1
,j)
If D(i,j)=D(i

1
,j

1
)+t(i,j)
(i,j)
(i

1
,j

1
)
For optimal edit transcript, follow
any
path of pointers
from cell (n,m) to cell (
0
,
0
).
22
The traceback
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
23
The traceback
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Horizontal edge
for
insertion
.
Vertical edge for
deletion
.
Diagonal edge for
substitution
if
S
1
(i)≠S
2
(j),
and
match
otherwise.
24
The traceback
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Theorem:
Once the dynamic programming
table with pointers has been computed, an
optimal edit transcript can be found in
O(n+m) time.
25
The traceback
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Theorem:
Any path from (n,m) to (
0
,
0
)
following pointers established during the
computation of D(i,j) specifies an edit
transcript with the minimum number of edit
operations, any optimal edit transcript is
specified by such a path. Moreover, since a
path describes only one transcript, the
correspondence between paths and optimal
transcripts is one

to

one.
26
Edit graphs
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Definition:
Given two strings S
1
and S
2
of length n
and m, respectively, a
weighted edit graph
has
(n+
1
)
×
(m+
1
) nodes, each labeled with distinct pair
(i,j) (
0
≤i≤n,
0
≤j≤m). The specific edges and their
edge weights depend on the specific string problem.
For the edit distance problem:
•
The weight on the edges
(i,j)
(i,j+
1
) and (i,j)
(i+
1
,j) is one
•
The weight on the edges
(i,j)
(i+
1
,j+
1
) is t(i+
1
,j+
1
).
A N N
0 1 2 3
0
C
1
A
2
N
3
0
0
0
27
Weighted edit distance
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Definition:
With arbitrary operation weights, the
operation

weight edit distance problem
is to find
an edit transcript that transform string S
1
into S
2
with
the minimum total operation weight.
For example: if each mismatch has a weight of
2
, each
space has a weight of
4
, and each match a weight of
1
,
then the following alignment has a total weight of
17
and
is an optimal alignment.
s
r
e
_
t
i
r
w
_
r
e
n
t
n
i
v
28
Alphabet

weight edit distance
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
The weight of a substitution depends on exactly
which character in the alphabet is being
removed and which is being added.
29
String similarity
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
A way of formalizing the relatedness of two strings
by measuring their similarity rather than their
distance
Definition:
let
Σ
be the alphabet used for
strings S
1
and S
2
, and let
Σ
’ be
Σ
with the
added character “_”. Then, for any two
characters x, y in
Σ
’ , s( x, y) denotes the
value (or score) obtained by aligning x
against character y.
30
String similarity
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Definition:
for a given alignment A of S
1
and
S
2
, let S
1
’ and S
2
’ denote the strings after the
chosen insertion of spaces, and let l denote
the (equal) length of the two strings in A. the
value of alignment A is defined as
Σ
s(S
1
’(i), S
2
’(i)).
i=
1
l
31
String similarity
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
For example, let
Σ
={a, b, c, d}
and let the pairwise scores be
defined in the following matrix:
Then the alignment
c a c _ d b d
c a b b d b _
Has a total value of
0
+
1
–
2
+
3
+
3
–
1
=
4
_
d
c
b
a
s

1
0

2

1
1
a
0

1

2
3
b

2

4
0
c

1
3
d
0
_
32
String similarity
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Definition:
Given a pairwise scoring matrix
over the alphabet
Σ
’, the
similarity
of two
strings S
1
and S
2
is defined as the value of
the alignment A of S
1
and S
2
that maximizes
total alignment value.
33
Computing similarity
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Definition: V(i,j)
is defined as the value of
the
optimal alignment of prefixes
S
1
[
1
..i] and S
2
[
1
..j]
The base conditions are
V(
0
,j)=
Σ
s ( _ , S
2
(k))
V(i,
0
)=
Σ
s (S
1
(k), _ )
1
≤
k
≤
j
1
≤
k
≤
i
34
Computing similarity
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
For i and j both strictly positive, the general
recurrence is
V( i , j ) = max[ V(i

1
,j

1
) + s (S
1
(i), S
2
(j)),
V(i

1
,j) + s (S
1
(i), _ ),
V(i,j

1
) + s ( _ , S
2
(j)) ]
If S
1
and S
2
are of length n and m , then the value of
their optimal alignment (V( n, m)) can be found (using
dynamic programming table) in O (nm) time.
35
Alignment graphs for similarity
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
As was the case for edit distance, the computation of
similarity can be viewed as a path problem on a
directed acyclic graph called an alignment graph.
The longest start to destination paths in the
alignment graph are in one

to

one correspondence
with the optimal (maximum value) alignments.
36
End

space free variant
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Spaces at the end or the beginning of the
alignment contribute a weight of zero.
Example:
shotgun sequence assembly
problem.
Implementation: using the recurrence for
global alignment , but change the base
conditions to V(i,
0
)=V(
0
,j)=
0
37
Approximate occurrences of P in T
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Definition:
Given a parameter
δ
, a substring T’ of T
is said to be an
approximate occurrence
of P if and
only if the optimal alignment of P to T’ has value at
least
δ
.
Theorem:
There is an approximate occurrence of P
in T ending at position j of T if and only if V(n,j) ≥
δ
.
Moreover, T [k .. j] is an approximate occurrence of P
in T if and only if V(n,j) ≥
δ
and there is a path of
backpointers from cell (n,j) to cell(
0
,k).
38
How to find the optimal alignment in
linear space?
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Definition:
For any string
α
, let
α
r
denote the
reverse of string
α
.
Definition:
Given strings S
1
and S
2
, define V
r
(i,j) as
the similarity of the string consisting of the first i
characters of S
1
r
, and the string consisting of the first
j characters of S
2
r
. Equivalently, V
r
(i,j) is the
similarity of the last i characters of S
1
and the last j
characters of S
2
.
39
How to find the optimal alignment in
linear space?
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Lemma
1
:
V(n,m)=max
0
≤k≤m
[V(n/
2
,k)+V
r
(n/
2
,m

k)].
Definition:
Let k* be a position k that
maximizes [V(n/
2
,k)+V
r
(n/
2
,m

k)].
Definition:
Let L
n/
2
be the subpath of L that
starts with the last node of L in row n/
2

1
and
ends with the first node of L in row n/
2
+
1
.
40
How to find the optimal alignment in
linear space?
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Lemma
2
:
A position k* in row n/
2
can be
found in O(nm) time and O(m) space.
Moreover, a subpath L
n/
2
can be found and
stored in those time and space bounds.
41
How to find the optimal alignment in
linear space?
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
A
B
k
1
k* m
k
2
n/
2

1
n/
2
n
n/
2
+
1
42
How to find the optimal alignment in
linear space?
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Execute dynamic programming to compute the
optimal alignment of S
1
and S
2
,stop after interior n/
2
.
When filling in row n/
2
, save the normal traceback
pointers for the cells in that row. O(m) space
Do the same first steps for S
1
r
and S
2
r
.
Using the first set of saved pointers, follow any
traceback path from cell (n/
2
,k*) to a cell k
1
in row
n/
2

1
. (Do the same for k
2
and row n/
2
+
1
).
O(nm) time and O(m) space is used to find k*, k
1
, k
2
,
and L
n/
2
.
43
Local alignment
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Local alignment problem:
given two strings
S
1
and S
2
, find substrings
α
and
β
of S
1
and
S
2
, respectively, whose similarity (optimal
global alignment value) is maximum over all
pairs of substrings from S
1
and S
2
.
44
Local alignment
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
S
1
=pqr
axabcs
tvq S
2
=xy
axbacs
ll
match = +
2
mismatch =

2
space=

1
optimal local alignment
a x a b _ c s
a x _ b a c s
The optimal local alignment of S
1
and S
2
has value
8
and is defined by substrings axabcs and axbacs
45
Why local alignment?
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Global alignment of protein sequences is
often meaningful when the two strings are
members of the same protein family.
Local alignment is critical when comparing
long stretches of anonymous DNA or
proteins from very different families.
46
Computing local alignment
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Definition:
given a pair of indices i ≤ n and
j ≤ m, the local suffix alignment problem is to
find a (possibly empty) suffix
α
of S
1
[
1
..i] and
a (possibly empty) suffix
β
of S
2
[
1
..j] such
that V(
α
,
β
) is the maximum over all pairs of
suffixes of S
1
[
1
..i] and S
2
[
1
..j].
47
Computing local alignment
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Theorem:
let V(i,j) be the value of the
optimal local suffix alignment for the given
index pair I, j and v* be the value of the
optimal local alignment for two strings of
length n and m so v*=max [V(i,j): i ≤ n,j ≤ m]
48
Computing local alignment
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Theorem:
if i’, j’ is an index pair maximizing
V(i,j) over all i, j pairs, then a pair of
substrings solving the local suffix alignment
for i’, j’ also solves the local alignment
problem.
49
How to solve the local suffix alignment
problem
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
First, V(i,
0
)=V(
0
,j)=
0
for all i, j, since we can
always choose an empty suffix.
Theorem:
For i >
0
and j >
0
, the proper
recurrence for V(i,j) is
V( i , j ) = max[
0
,V(i

1
,j

1
) + s (S
1
(i), S
2
(j)),
V(i

1
,j) + s (S
1
(i), _ ),
V(i,j

1
) + s ( _ , S
2
(j)) ]
50
Time analysis
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Theorem:
For two strings s
1
and s
2
of lengths n and
m, the local alignment can be solved in O(nm) time,
the same time as for global alignment.
Theorem:
All optimal local alignments of two strings
are represented in the dynamic programming table
for V(i,j) and can be found by tracing any pointers
back from any cell with value V*.
51
Gaps
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Definition:
A gap is any maximal,
consecutive run of spaces in a single string
of a given alignment.
An alignment with seven spaces distributed
into four gaps
c t t t a a c _ _ a _ a c
c _ _ _ c a c c c a t _ c
52
Why gaps?
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
A gap in string S
1
opposite substring
α
in
string S
2
corresponds to either a deletion of
α
from S
1
or to an insertion of
α
into S
2
. the
concept of a gap in an alignment is therefore
important in many biological applications
because the insertion or deletion on an entire
substring (particularly in DNA) often occurs
as single mutational event.
53
Choices for gap weights
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
We will examine in detail four general types of gap
weights: constant, affine, convex, and arbitrary.
The objective in the
constant
gap weight model is
find an alignment A to maximize
Σ
s(S
1
’(i),S
2
’(i))

Wg(# gaps)
The objective in the
affine
gap weight model is find
an alignment A to maximize
Σ
s(S
1
’(i),S
2
’(i))

Wg(# gaps)

Ws(# spaces)
Ws
–
the weight given to spaces
i=
1
l
l
i=
1
54
Choices for gap weights
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Each additional space in a gap contributes less to
the gap weight than the preceding space, a gap
weight that is a
convex
, function of its length.
Example: Wg +logeq, where q is the length of the
gap.
The
arbitrary
gap weight, where the weight of the
gap is an arbitrary function w(q) of its length q. the
constant, affine, and convex weight models are of
course subcases of the arbitrary weight model.
55
Time bounds for gap choices
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Solving the above problems using Dynamic programming
Arbitrary gap O(nm
²
+n
²
m)
Convex gap O(nmlogm)
Affine gap O(nm)
Constant gap O(nm)
56
Arbitrary gap weights
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
S
1
S
2
S
1
S
2
S
1
S
2
E
F
G
1
2
3
i
j
i
i
j
j
57
Arbitrary gap weights
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Definition:
define E(i,j) as the maximum
value of any alignment of type
1
; define F(i,j)
as the maximum of any alignment of type
2
;
define G(i,j) as the maximum value of any
alignment of type
3
; and finally define V (i,j)
as the maximum value of the three terms
E(i,j), F(i,j), G(i,j).
58
Arbitrary gap weights
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Recurrences for the case of arbitrary gap weights:
V ( i , j ) = max [ E ( i , j ) , F ( i , j ) , G ( i , j ) ]
G ( i , j ) = V ( i
–
1
, j
–
1
) + s ( S
1
(i) , S
2
(j) )
E ( i , j ) = max [ V ( i , k )
–
w( j
–
k ) ]
F ( i , j ) = max [ V ( l , j )
–
w( i

l ) ]
0
≤
k
≤
j

1
0
≤
l
≤
i

1
59
Arbitrary gap weights
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Base case if all spaces are included in the objective
function:
V (i,
0
) =

w(i) V (
0
,j) =

w(j)
E (i,
0
) =

w(i) F (
0
,j) =

w(j)
G (
0
,
0
) =
0
Base case if end space, and hence end gaps are
free:
V (i,
0
) =
0
V (
0
,j) =
0
60
Time analysis
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Theorem:
assuming that S
1
 = n and
S
2
 = m, the recurrences can be evaluated
in O( nm
²
+ n
²
m ) time.
Before gaps were included in the model, V(i,j)
depended on the three cells adjacent to (i,j) and now
we need to look j cells to the left and i cells above to
determine V(i,j).
61
Summary
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
The first fact of biological sequence analysis
Dynamic Programming:
edit distance
the recurrence relation
tabular computation
Optimal alignment in linear space
Global alignment Vs. local alignment
Gaps
62
Food for thought…
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Repeated substrings
: find inexact repeats
in a single string.
If we do local alignment of a string against
itself, the best substring will be the entire
string.
Even using all the values in the table, the
best path may be strongly influenced by the
main diagonal.
63
Bibliography
Seminar in Structural Bioinformatics

Pairwise sequence alignment algorithms
Algorithms on strings, trees, and sequences :
computer science and computational
biology;
Gusfield Dan; Cambridge :
Cambridge
University Press,
1997
Introduction to algorithms;
by Thomas H. Cormen,
Charles E. Leiserson, Ronald L. Rivest;
2
nd edition;
Cambridge, MA :
MIT Press,
2001
; The MIT electrical
engineering and computer science series
Comments 0
Log in to post a comment