Survey of Sequence Matching and alignment algorithms

educationafflictedΒιοτεχνολογία

4 Οκτ 2013 (πριν από 4 χρόνια και 8 μέρες)

90 εμφανίσεις

Presented by Jennifer
Johnstone


What is Bioinformatics?


Sequence Matching Problem


The Alignment Problem


Future Research

Bioinformatics is the application of computers in
Biology using algorithms, statistics and other
mathematical techniques to decipher the
language of DNA.


Given a string
s,

of size
n
, and a pattern
p
, of size
m,
for what indices
I
of
s

does
p

exactly match
s.

Example: Let p =
ABA

and s =
AABAAGTABA

then
I = {2, 8} since





A
ABA
AGTABA






ABA




and






AABAAGT
ABA






ABA



Naive String Matching Algorithm,
O(m*n).


String Matching with Finite Automata ,
O((m*|
Σ
|)+n).


Boyer
-
Moore Algorithm,
O(
m+n
) (
in practice).



String Matching with Compact Suffix Trees,
O(n log(n) + m*|
Σ
| +k).


String Matching using Suffix Arrays ,


O(
n+m

log(n) +k).

Given a pattern
p

=
aba

and a string
s

=
acbababa

we must first define the state function
δ
(
q,x
).



q

x

δ
(
焬x


0

a

1

0

b

0

0

c

0

1

a

1

1

b

2

1

c

0

2

a

3

2

b

0

2

c

0

3

a

1

3

b

2

3

c

0

i

s
i

State

1

a

δ
(0,a) = 1

2

c

δ
(1,c) = 0

3

b

δ
(0,b) = 0

4

a

δ
(0,a) = 1

5

b

δ
(1,b) = 2

6

a

δ
(2,a) = 3

7

b

δ
(3,b) = 2

8

a

δ
(2,a) = 3

Now we see that the match
condition is met for
i

= 6, 8.
Then the starting indexes are

j =
i



3+ 1, such that I ={ 4, 6 }.


Given two strings we want to generate an optimal
alignment. The alignment of two strings may
involve the insertion of gaps and
\
or the
acceptance of mismatched entries.

Example: Consider the following possible alignment of
the two strings
GACGGATTATG and
GATCGGAATAG:

GA

CGGA
T
TATG

GATCGGA
A
TA

G

Dynamic Approach


Computing Optimal Alignment using a
dynamic programming matrix and a scoring
function. (
O(m*n))

Heuristic Approach used in practice to speed up
search times on large databases. Consider the
Human genome which is over 3 billion
characters long for which you may need to
align only a small portion.


FASTP and FASTA Programs


BLAST Algorithm


Development of the Heuristic approaches is
constantly being improved upon and
researched as the algorithms themselves are
only 10
-
15 years old.


Development of tools that can perform a 10
-
way comparison of genomes.


Bioinformatics as a whole is an active field of
research that strongly needs qualified
professionals who have an aptitude for
computing and
\
or biology
.


B
ockenhauer
, Hans
-
Joachim and
Bongartz
, Dirk (2007)
Algorithmic Aspects of Bioinformatics. Berlin: Springer
pg.37
-
114


Haubold
, Bernhard and
Wiehe
, Thomas (2006)
Introduction to Computational Biology: An Evolutionary
Approach. Basel:
Birkh
auser

pg.65
-
85.


Jones, Neil C. and
Pevzner
,
Pavel

A. (2004)
An
Introduction to Bioinformatics Algorithms. Cambridge: The
MIT Press pg. 148
-
226 and 311
-
337.


Parida
,
Laxmi

(2008)
Pattern Discovery in Bioinformatics:
Theory & Algorithms. Boca Raton: Chapman & Hall/CRC
pg. 139
-
182 and 183
-
212.


Polanski,
Andrzej

and Kimmel,
Marek

(2007)
Bioinformatics. Berlin: Springer pg. 155
-
183 and 349
-
354.