Space

Efficient Sequence
Alignment
Bioinformatics 202
University of California, San Diego
Lecture Notes No. 7
Dr. Pavel Pevzner
(prepared by Iman Famili)
Outline
New computational ideas for sequence comparison:
•
Divide

and

conquer
technique
•
Recursive programs
•
Hash
tables
Edit Graphs
•
Finds similarities between two sequences.
•
Every alignment in this method
corresponds to the longest path problem
from a source to a sink.
•
The alignment is done by constructing an
“
edit graph
”.
•
There are 3 types of edges in the edit graph
horizontal (
H
), diagonal (
D
), and vertical
(
V
) corresponding to insertion (
I
),
match/mismatch (
M
), and deletion (
D
),
respectively.
•
Every edge of the edit graph (i.e. every
movement) has a
weight
corresponding to
the
penalty or premium
for that action.
•
The best path is the path with the
maximum length.
Edit Graph
T
G
C
A
T
A
A
T
C
T
G
A
T
deletions:
mismatches:
insertions:
matches:
source
sink
Computational Complexity of
Dynamic Programming
Sequence alignment is limited by:
•
Time:
–
Four operations are needed at each vertex.
–
The required time is proportional to the number of edges
in the edit graph (i.e. O(nm), where n and m are sequence
lengths).
•
Space:
–
The required memory is proportional to the number of
vertices in the edit graph, O(nm).
Computational Complexity of
Dynamic Programming
–
To compute the score of alignment, we can reduce the
calculations to 2 columns at every computing instance. This can
be done since scoring for each box in dynamic programming
(DP) matrix is done based only on the three previously
calculated boxes. Therefore only a
linear memory
is required for
construction of the DP matrix.
–
To calculate the alignment (backtracking through the matrix),
however, a
quadratic memory
is needed (n
2
) since all the scores
are needed to find the best alignment.
only 2 columns are needed to determine
the score of each box
(forward calculation)
all columns are needed for calculating the best alignment (backtracking)
Space

Efficient Sequence
Alignment
To solve the space complexity of
sequence alignment:
•
Find the middle vertex
between a
source
and a
sink
by computing the score of the
path s
*,m/2
from (0,0) to
(i,m/2) and s
reverse
*,m/2
from
(i,m/2) to (n,m) (i.e. find the
longest path between the
source and the middle vertex
and middle vertex and the
sink).
•
Repeat this process iteratively
middle
m/2
m
(0,0)
(n,m)
n
i
m/2
m
(0,0)
(n,m)
n
middle
m/2
m
(0,0)
n
middle
middle
(n,m)
m
(0,0)
(n,m)
n
m
(0,0)
n
(n,m)
m
(0,0)
n
(n,m)
Source
Sink
Space

Efficient Sequence
Alignment
•
The
computing time
is equal to the area of the rectangles. The
total time to find the middle vertices is therefore:
area+area/2+area/4+…
2*area
•
The
space complexity
is of order n, O(n).
•
Pseudocode for this algorithm is:
Path (source, sink)
If source and sink are in consecutive columns
output the longest path from the source to the sink
Else
middle
middle vertex between source and sink
Path (source, middle)
Path (middle, sink)
String Matching: naïve approach
Let’s say we want to compare a sequence of length
l
=10
against a database of length, for example,
n
=10
9
and we want to find the exact sequence
l
=10 in
n
. We can:
1.
Move
l
along
n
one base at a time and find similar
sequences (this takes a long time):
l
=10
n
=10
9
So, essentially moving diagonally along the database alignments:
Sting Matching: hashing
2.
Create a hash table of all possible combinations of
l

length strings that exist in
n
Hash Table
and search your
l

length string against the hash table.
Approximate String Matching
•
Now if instead of
l
=10 we have
l
=1000, we can apply the
same method by dividing
l
into overlapping strings of 10
base

long and cross the resultant alignments, as shown
below:
•
String matching in this fashion may be done using
filtration/verification algorithms that will be described next.
Filtration/Verification Method
•
Let’s say we want to find a string in a database with up to 2
mismatches, or in general, find a string
t
1
…
t
n
(text) in a
database
q
1
…
q
p
(query) with up to
k
mismatches.
•
The query matching problem is to find all
m

substrings of the
query and the text that match with at most
k
mismatches.
Filtration/verification algorithms are used to perform this task.
•
Filtration/verification algorithms involve a two

stage process.
walk in both directions while
mismatches are < k
First, a set of positions are reselected in
the text that are potentially similar to the
query. Second, each potential position is
verified if mismatches are less than k and
rejected if more than k mismatches are
found.
Filtration/Verification Method
•
Filtration algorithm is done in 2

steps:
1.
Potential match detection:
Find all matches of
t

tuples in both query and
the text for
l
=
m
/
k
+1 (it’s sparse alignment
happens rarely)
2.
Potential match verification:
Verify each potential match by extending it to
the left and to the right until either (i) the first
k
+1 mismathces are found or (ii) the
beginning or end of the query or the text is
found
•
This is the idea behind BLAST and FASTA.
Comments 0
Log in to post a comment