Bioinformatics 202 Alignment and Sequence Comparison

fleagoldfishBiotechnology

Oct 2, 2013 (3 years and 10 months ago)

71 views

Space
-
Efficient Sequence
Alignment


Bioinformatics 202

University of California, San Diego


Lecture Notes No. 7

Dr. Pavel Pevzner

(prepared by Iman Famili)

Outline

New computational ideas for sequence comparison:



Divide
-
and
-
conquer

technique


Recursive programs


Hash

tables

Edit Graphs


Finds similarities between two sequences.


Every alignment in this method
corresponds to the longest path problem
from a source to a sink.


The alignment is done by constructing an

edit graph
”.


There are 3 types of edges in the edit graph
horizontal (
H
), diagonal (
D
), and vertical
(
V
) corresponding to insertion (
I
),
match/mismatch (
M
), and deletion (
D
),
respectively.


Every edge of the edit graph (i.e. every
movement) has a
weight

corresponding to
the
penalty or premium

for that action.


The best path is the path with the
maximum length.


Edit Graph

T

G

C

A

T

A

A

T

C

T

G

A

T

deletions:

mismatches:

insertions:

matches:

source

sink

Computational Complexity of
Dynamic Programming

Sequence alignment is limited by:


Time:



Four operations are needed at each vertex.


The required time is proportional to the number of edges
in the edit graph (i.e. O(nm), where n and m are sequence
lengths).


Space:



The required memory is proportional to the number of
vertices in the edit graph, O(nm).

Computational Complexity of
Dynamic Programming


To compute the score of alignment, we can reduce the
calculations to 2 columns at every computing instance. This can
be done since scoring for each box in dynamic programming
(DP) matrix is done based only on the three previously
calculated boxes. Therefore only a
linear memory
is required for
construction of the DP matrix.






To calculate the alignment (backtracking through the matrix),
however, a
quadratic memory

is needed (n
2
) since all the scores
are needed to find the best alignment.

only 2 columns are needed to determine
the score of each box

(forward calculation)

all columns are needed for calculating the best alignment (backtracking)

Space
-
Efficient Sequence
Alignment

To solve the space complexity of
sequence alignment:


Find the middle vertex
between a
source

and a
sink

by computing the score of the
path s
*,m/2

from (0,0) to
(i,m/2) and s
reverse
*,m/2
from
(i,m/2) to (n,m) (i.e. find the
longest path between the
source and the middle vertex
and middle vertex and the
sink).


Repeat this process iteratively

middle

m/2

m

(0,0)

(n,m)

n

i

m/2

m

(0,0)

(n,m)

n

middle

m/2

m

(0,0)

n

middle

middle

(n,m)

m

(0,0)

(n,m)

n

m

(0,0)

n

(n,m)

m

(0,0)

n

(n,m)

Source

Sink

Space
-
Efficient Sequence
Alignment


The
computing time

is equal to the area of the rectangles. The
total time to find the middle vertices is therefore:


area+area/2+area/4+…

2*area



The
space complexity

is of order n, O(n).


Pseudocode for this algorithm is:

Path (source, sink)

If source and sink are in consecutive columns


output the longest path from the source to the sink

Else


middle

middle vertex between source and sink


Path (source, middle)


Path (middle, sink)

String Matching: naïve approach

Let’s say we want to compare a sequence of length
l
=10
against a database of length, for example,
n
=10
9




and we want to find the exact sequence
l
=10 in
n
. We can:

1.
Move
l

along
n

one base at a time and find similar
sequences (this takes a long time):



l
=10

n
=10
9

So, essentially moving diagonally along the database alignments:

Sting Matching: hashing

2.
Create a hash table of all possible combinations of
l
-
length strings that exist in
n

Hash Table

and search your
l
-
length string against the hash table.

Approximate String Matching


Now if instead of
l
=10 we have
l
=1000, we can apply the
same method by dividing
l

into overlapping strings of 10
base
-
long and cross the resultant alignments, as shown
below:







String matching in this fashion may be done using
filtration/verification algorithms that will be described next.

Filtration/Verification Method


Let’s say we want to find a string in a database with up to 2
mismatches, or in general, find a string
t
1

t
n

(text) in a
database
q
1

q
p

(query) with up to
k

mismatches.


The query matching problem is to find all
m
-
substrings of the
query and the text that match with at most
k

mismatches.
Filtration/verification algorithms are used to perform this task.


Filtration/verification algorithms involve a two
-
stage process.

walk in both directions while
mismatches are < k

First, a set of positions are reselected in
the text that are potentially similar to the
query. Second, each potential position is
verified if mismatches are less than k and
rejected if more than k mismatches are
found.

Filtration/Verification Method


Filtration algorithm is done in 2
-
steps:

1.
Potential match detection:




Find all matches of
t
-
tuples in both query and

the text for
l
=
m
/
k
+1 (it’s sparse alignment

happens rarely)

2.
Potential match verification:



Verify each potential match by extending it to

the left and to the right until either (i) the first

k

+1 mismathces are found or (ii) the

beginning or end of the query or the text is

found


This is the idea behind BLAST and FASTA.