Literature Survey - Villanova Department of Computing Sciences ...

thingyoutstandingBiotechnology

Oct 1, 2013 (3 years and 6 months ago)

57 views

Bioinformatics: Sequence Alignment



Carmen Nigro



Computing Research

Department of

Computing

Sciences

Villanova

University,

Villanova,

Pa,

19085

carmen.nigro@villanova.edu



November 11, 2009


1.

Introduction

Sequence alignment is an important
division

of bioinformatics, which attempts to analyze and
compare sequences that make up DNA or proteins. Sequence alignment is a way of comparing
two or more sequences by searching for a series of individual characters that are in the same
order of both

sequences

[14]. As methods improved for collecting biological data, such as
nucleotide and amino acid sequences, there was a concern for the creation of a database for easy
storage, retrieval, and revision of the data. Today, bioinformatics scientists are intereste
d in the
an
alysis and interpretation of that

data. Because these sequences are too long to be analyzed by
people, efficient and accurate alignment progr
ams are essential for comparing
sequences of DNA
or proteins.

The

s
equences that are being compared are
usually
represented by

nitrogenous bases
for DNA sequences or amino acids for protein sequences. There are four different nitrogenous
bases which code for DNA, while there are 20

different amino acids which

code for
different
proteins.


Through sequence al
ignments,

attempts can be made to identify homologous sequences, or
sequences with a common evolutionary origin [14]. The discovery of homologous sequences
may help to predict the evolutionary process based on segments with mutations and segments
which hav
e remained the same over time. Sequence alignment also has functional importance, as
s
equences that are alike may

have the same role or code for the same entity.
The Drug Industry
has
benefited from applying this notion

when designing new drugs to treat ce
rtain diseases.
Some diseases are caused by the lack of certain parts of a protein sequence. Sequence alignment
can help to identify those regions

and t
he lack may be compensated

by injecting the missing
sequence into the protein. Sequence alignment has al
so been useful for analyzing protein
structure. Protein molecules that are alike in sequence are also more likely to have similar
structures, as many of the same bonds will form.

In addition, s
imilar

protein sequences have been
used

to determine

protein structure
-
function relationships [1].

2.

Global and Local Alignments

Global and local alignments are two different methods of aligning a sequence. Deciding which
method to choose depends on the purpose of the alignment.
Global alignments attempt to
c
ompare every residue of every sequence and are best employed when the sequences are similar
and are of the same size
, because different sized sequences will produce mismatches at the ends
of an alignment
. However, when attempting to align every element of
dissimilar sequences many
gaps will be produced because of the many mismatches between the two sequences
, as seen in
figure 1
. When comparing two long sequences, these gaps can become difficult to analyze. Local
alignments are best employed for dissimilar
sequences that may have similar regions

[3]
.

Local
alignments are very useful for finding a particular pattern that exists on both sequences, as that
pattern may also have a similar function. If both sequences are very similar, it should not make a
differe
nce which method is used, because the alignments should produce similar results.

There is
also no difference in time efficiency between the two methods. The most fundamental
global and
local alignment algorithms are based on dynamic programming. The Needle
man
-
Wunsch
algorithm
is
based on dynamic programming and solves the global alignment problem, while the
Smith
-
Waterman algorithm is also based on dynamic programming and solves the local
alignment problem
.


Figure 1.

Examples of local and global alignments

3.

Dynamic Programming



Dynamic programming involves breaking a larger problem down into smaller, more manageable
pieces. The basic dynamic programming approach for sequence alignment finds an optimal path
through a r
ectangular path graph. It accomplishes this by turning one sequence into another
through a series of edits. Each edit to the sequence is associated with a particular cost and the
purpose is to find the edits that produce the lowest cost [1].

This method drastically reduces the
number of alignments to be considered while always producing an optimal alignment.

Both the Needle
man
-
Wunsch and
Smith
-
Waterman algorithms are based on the dynamic
programming method and have a time efficiency of O(nm),

n and m being the lengths of the two
sequences. The Needle
-
Wunsch algorithm
works by maximizing the number of matches and
minimizing the number of gaps needed to align the two sequences. A scoring function must exist
so that scores may be assigned to the
alignments
based on the number of matches

and the
number of gaps of the alignment. The alignment with the largest score will be the optimal
alignment.

It
is implemented through the use of

a scoring matrix in which the horizontal and
vertical axes correspon
d to the two sequences
. The algorithm compares every element of a
sequence to every other element in the other sequence

and then traces back to find the optimal
alignment.
The Smith
-
Waterman algorithm acts in a similar manner, but produces a local
alignmen
t by finding the region with the highest similarity. The Smith
-
Waterman algorithm may
be obtained from the Needle
man
-
Wunsch algorithm by adjusting

the scoring function and
changing the

method of tracing back to find the longest matching subsequences.


4.

Pairwise

and Multiple
Alignments

Pairwise alignments attempt to align two sequences at a time while multiple alignments attempt
to align three or more sequences at a time. Analyzing three or more sequences at a time can
be
useful for studying molecular evolution and analyzing sequence
-
structure relationships [1]. Also,
the detection of a pattern common to a set of sequences may only be apparent through multiple
sequence alignment [1].
While the dynamic programming techniques

described above are
reliable methods of alignment, they are not practical to implement for multiple alignments. By
extending the dynamic programming algorithm for multiple alignments, an optimal alignment
will be produced in time O(n
k
)

for k sequences [13
]. The problem of multiple sequence
alignment grows exponentially every time another sequence is added and becomes
unreasonable

for comparing more than three sequences at a time [1].
Due to the impracticality of using
dynamic programming algorithms to solv
e the multiple alignment problem, many heuristic
algorithms have been sought after, which sacrifice accuracy for time efficiency. Heuristic
approaches attempt t
o optimize pairwise alignments rather than searching for an overall optimal
alignment [15]. Over

75 methods of solving the multiple alignment
problem have been identified
and the problem continues to be central to computational molecular biology [15].

5.

Scoring Functions

Scoring schemes are important for sequence alignment programs, because they are a means of
comparing different alignments. In alignment algorithms a scoring function must exist
so that
scores may be assigned to different alignments based on the number of g
aps and the number of
matches. Scores are assigned to each possible pair of elements based on their similar chemical
properties and evolutionary probability of the mutation. Gap costs are also an important part of
any sequence alignment program and have be
en stud
ied extensively.
Gap costs may take into
account that a mutational event may insert or delete multiple elements

[1]
.
Gap costs must also
take into account

aligning elements with nulls, when sequences are of different lengths.
Algorithms that have a
fixed penalty for each gap are popular and are easily extendable to
multiple alignments

[1].
An example of a scoring scheme with fixed penalties can be seen in
figure 3.


Figure 2
. A simple scoring scheme

One type of scoring for multiple alignments is the Sum
-
of
-
pairs score, which increases with the
number of sequences aligned correctly [30].
For multiple alignment
s
, the sum of the pairs is the
total of all alignment costs for each pair of the sequences in
the alignment.
A column score may
also be implemented in a multiple sequence alignment program which tests the capability of the
program to align all of the sequences correctly.
Scoring functions are crucial to any alignment
program, because they directly
affect the choice of the optimal alignment.


References

[1]

D.J. Lipman, S.F. Altschul, and J.D. Kececioglu, “A Tool for Multiple Sequence
Alignment”,
Proc. Nail. Acad. Sci. USA, Vol. 86, pp. 4412
-
4415, June 1989.

[2]

R. Chenna, H. Sugawara, T. Koike, R.

Lopez, T.J. Gibson, D.G. Higgins, and J.D.
Thompson, “
Multiple sequence alignment with the Clustal series of programs”, Oxford
Journals: Nucleic Acids Research, Vol. 31, pp.
3497
-
3500, 2003.

[3]

J.D. Thompson, F. Plewniak, and O. Poch
, “A comprehensive comparison of multiple
sequence alignment programs”, Oxford Journals: Nucleic Acids Research, Vol. 27, pp.
2682
-
2690, 1999.

[4]

J.D.

Thompson,

T.J.

Gibson,

F. Plewniak,

F. Jeanmougin,

and

D. G.

Higgins, “
The
CLUSTAL_X windows interface:
flexible strategies for multiple sequence alignment
aided by quality analysis tools”, Oxford Journals: Nucleic Acids Research, Vol. 25, pp.
4876
-
4882, 1997.

[5]

I. M. Wallace

,

O. Orla,

and D. G. Higgins, “Evaluation of Iterative Alignment
Algorithms for M
ultiple Alignment”, Oxford Journals: Bioinformatics, Vol. 21, pp.
1408
-
1414, 2005.

[6]

M.
S. Waterman, “Efficient Sequence Alignment Algorithms”,
J. theor. Biol.,
Vol.
108,
pp.
333
-
337, 1984.

[7]

H. Rangwala

and

G. Karypis, “
Incremental window
-
based
protein

sequence
alignment

algorithms”, Oxford Journals: Bioinformatics, Vol. 23, pp. e17
-
e23, 2007.

[8]

L. A. Newberg, “Memory efficient dynamic programming backtrace and pairwise local
sequence alignment”, Oxford Journals: Bioinformatics, Vol. 24, pp.
1
772
-
1778, 2008.

[9]

J. Hérisson,

G. Payen,

and

R. Gherbi, “A 3D pattern matching algorithm for DNA
sequences” , Oxford Journals: Bioinformatics, Vol. 23, pp. 680
-
686, 2007.

[10]

T. W. Lam,

W. K. Sung,

S. L. Tam,

C. K. Wong,

and

S. M. Yiu, “Compressed
indexing
and local alignment of DNA”, Oxford Journals: Bioinformatics, Vol. 24, pp. 791
-
797,
2008.

[11]

J. M. Sauder, J. W. Arthur, and .R L. Dunbrack, Jr., “
Large
-
Scale Comparison of
Protein

Sequence

Alignment

Algorithms

With

Structure Alignments,” Proteins:
Structure, Function, and Genetics, Vol. 40, pp. 6
-
22, 2000.

[12]

L. Delcher,

A. Phillippy,

J. Carlton

and S. L. Salzberg, “
Fast algorithms for large
-
scale
genome alignment and comparison”, Oxford Journals: Nucleic Acids Research, Vol. 30,
pp.
2478
-
2483, 2002.


[13]

Y. Bilu, P. K. Agarwal, R. Kolodny, “Faster Algorithms for Optimal Multiple Sequence
Alignment Based on Pairwise

Comparisons”, IEEE/ACM Transactions on Computational
Biology and Bioinformatics, Vol. 3, pp. 408
-
422, 2006.


[14]

D.G. Brown, “A survey of seeding for sequence alignment”, University of Waterloo,
Waterloo, Ontario, Canada, 2007.


[15]

S. Kumar, A. Filipsk
i, “Multiple Sequence Alignment: In pursuit of homologous DNA
positions”, Cold Spring Harbor Laboratory Press: Genome Research, Vol. 17, pp. 127
-
135, 2007.