Bioinformatics: Sequence Alignment
November 11, 2009
Sequence alignment is an important
of bioinformatics, which attempts to analyze and
compare sequences that make up DNA or proteins. Sequence alignment is a way of comparing
two or more sequences by searching for a series of individual characters that are in the same
order of both
. As methods improved for collecting biological data, such as
nucleotide and amino acid sequences, there was a concern for the creation of a database for easy
storage, retrieval, and revision of the data. Today, bioinformatics scientists are intereste
d in the
alysis and interpretation of that
data. Because these sequences are too long to be analyzed by
people, efficient and accurate alignment progr
ams are essential for comparing
sequences of DNA
equences that are being compared are
for DNA sequences or amino acids for protein sequences. There are four different nitrogenous
bases which code for DNA, while there are 20
different amino acids which
Through sequence al
attempts can be made to identify homologous sequences, or
sequences with a common evolutionary origin . The discovery of homologous sequences
may help to predict the evolutionary process based on segments with mutations and segments
e remained the same over time. Sequence alignment also has functional importance, as
equences that are alike may
have the same role or code for the same entity.
The Drug Industry
benefited from applying this notion
when designing new drugs to treat ce
Some diseases are caused by the lack of certain parts of a protein sequence. Sequence alignment
can help to identify those regions
he lack may be compensated
by injecting the missing
sequence into the protein. Sequence alignment has al
so been useful for analyzing protein
structure. Protein molecules that are alike in sequence are also more likely to have similar
structures, as many of the same bonds will form.
In addition, s
protein sequences have been
function relationships .
Global and Local Alignments
Global and local alignments are two different methods of aligning a sequence. Deciding which
method to choose depends on the purpose of the alignment.
Global alignments attempt to
ompare every residue of every sequence and are best employed when the sequences are similar
and are of the same size
, because different sized sequences will produce mismatches at the ends
of an alignment
. However, when attempting to align every element of
dissimilar sequences many
gaps will be produced because of the many mismatches between the two sequences
, as seen in
. When comparing two long sequences, these gaps can become difficult to analyze. Local
alignments are best employed for dissimilar
sequences that may have similar regions
alignments are very useful for finding a particular pattern that exists on both sequences, as that
pattern may also have a similar function. If both sequences are very similar, it should not make a
nce which method is used, because the alignments should produce similar results.
also no difference in time efficiency between the two methods. The most fundamental
local alignment algorithms are based on dynamic programming. The Needle
based on dynamic programming and solves the global alignment problem, while the
Waterman algorithm is also based on dynamic programming and solves the local
Examples of local and global alignments
Dynamic programming involves breaking a larger problem down into smaller, more manageable
pieces. The basic dynamic programming approach for sequence alignment finds an optimal path
through a r
ectangular path graph. It accomplishes this by turning one sequence into another
through a series of edits. Each edit to the sequence is associated with a particular cost and the
purpose is to find the edits that produce the lowest cost .
This method drastically reduces the
number of alignments to be considered while always producing an optimal alignment.
Both the Needle
Waterman algorithms are based on the dynamic
programming method and have a time efficiency of O(nm),
n and m being the lengths of the two
sequences. The Needle
works by maximizing the number of matches and
minimizing the number of gaps needed to align the two sequences. A scoring function must exist
so that scores may be assigned to the
based on the number of matches
number of gaps of the alignment. The alignment with the largest score will be the optimal
is implemented through the use of
a scoring matrix in which the horizontal and
vertical axes correspon
d to the two sequences
. The algorithm compares every element of a
sequence to every other element in the other sequence
and then traces back to find the optimal
Waterman algorithm acts in a similar manner, but produces a local
t by finding the region with the highest similarity. The Smith
Waterman algorithm may
be obtained from the Needle
Wunsch algorithm by adjusting
the scoring function and
method of tracing back to find the longest matching subsequences.
Pairwise alignments attempt to align two sequences at a time while multiple alignments attempt
to align three or more sequences at a time. Analyzing three or more sequences at a time can
useful for studying molecular evolution and analyzing sequence
structure relationships . Also,
the detection of a pattern common to a set of sequences may only be apparent through multiple
sequence alignment .
While the dynamic programming techniques
described above are
reliable methods of alignment, they are not practical to implement for multiple alignments. By
extending the dynamic programming algorithm for multiple alignments, an optimal alignment
will be produced in time O(n
for k sequences [13
]. The problem of multiple sequence
alignment grows exponentially every time another sequence is added and becomes
for comparing more than three sequences at a time .
Due to the impracticality of using
dynamic programming algorithms to solv
e the multiple alignment problem, many heuristic
algorithms have been sought after, which sacrifice accuracy for time efficiency. Heuristic
approaches attempt t
o optimize pairwise alignments rather than searching for an overall optimal
alignment . Over
75 methods of solving the multiple alignment
problem have been identified
and the problem continues to be central to computational molecular biology .
Scoring schemes are important for sequence alignment programs, because they are a means of
comparing different alignments. In alignment algorithms a scoring function must exist
scores may be assigned to different alignments based on the number of g
aps and the number of
matches. Scores are assigned to each possible pair of elements based on their similar chemical
properties and evolutionary probability of the mutation. Gap costs are also an important part of
any sequence alignment program and have be
Gap costs may take into
account that a mutational event may insert or delete multiple elements
Gap costs must also
take into account
aligning elements with nulls, when sequences are of different lengths.
Algorithms that have a
fixed penalty for each gap are popular and are easily extendable to
An example of a scoring scheme with fixed penalties can be seen in
. A simple scoring scheme
One type of scoring for multiple alignments is the Sum
pairs score, which increases with the
number of sequences aligned correctly .
For multiple alignment
, the sum of the pairs is the
total of all alignment costs for each pair of the sequences in
A column score may
also be implemented in a multiple sequence alignment program which tests the capability of the
program to align all of the sequences correctly.
Scoring functions are crucial to any alignment
program, because they directly
affect the choice of the optimal alignment.
D.J. Lipman, S.F. Altschul, and J.D. Kececioglu, “A Tool for Multiple Sequence
Proc. Nail. Acad. Sci. USA, Vol. 86, pp. 4412
4415, June 1989.
R. Chenna, H. Sugawara, T. Koike, R.
Lopez, T.J. Gibson, D.G. Higgins, and J.D.
Multiple sequence alignment with the Clustal series of programs”, Oxford
Journals: Nucleic Acids Research, Vol. 31, pp.
J.D. Thompson, F. Plewniak, and O. Poch
, “A comprehensive comparison of multiple
sequence alignment programs”, Oxford Journals: Nucleic Acids Research, Vol. 27, pp.
CLUSTAL_X windows interface:
flexible strategies for multiple sequence alignment
aided by quality analysis tools”, Oxford Journals: Nucleic Acids Research, Vol. 25, pp.
I. M. Wallace
and D. G. Higgins, “Evaluation of Iterative Alignment
Algorithms for M
ultiple Alignment”, Oxford Journals: Bioinformatics, Vol. 21, pp.
S. Waterman, “Efficient Sequence Alignment Algorithms”,
J. theor. Biol.,
G. Karypis, “
algorithms”, Oxford Journals: Bioinformatics, Vol. 23, pp. e17
L. A. Newberg, “Memory efficient dynamic programming backtrace and pairwise local
sequence alignment”, Oxford Journals: Bioinformatics, Vol. 24, pp.
R. Gherbi, “A 3D pattern matching algorithm for DNA
sequences” , Oxford Journals: Bioinformatics, Vol. 23, pp. 680
T. W. Lam,
W. K. Sung,
S. L. Tam,
C. K. Wong,
S. M. Yiu, “Compressed
and local alignment of DNA”, Oxford Journals: Bioinformatics, Vol. 24, pp. 791
J. M. Sauder, J. W. Arthur, and .R L. Dunbrack, Jr., “
Scale Comparison of
Structure Alignments,” Proteins:
Structure, Function, and Genetics, Vol. 40, pp. 6
and S. L. Salzberg, “
Fast algorithms for large
genome alignment and comparison”, Oxford Journals: Nucleic Acids Research, Vol. 30,
Y. Bilu, P. K. Agarwal, R. Kolodny, “Faster Algorithms for Optimal Multiple Sequence
Alignment Based on Pairwise
Comparisons”, IEEE/ACM Transactions on Computational
Biology and Bioinformatics, Vol. 3, pp. 408
D.G. Brown, “A survey of seeding for sequence alignment”, University of Waterloo,
Waterloo, Ontario, Canada, 2007.
S. Kumar, A. Filipsk
i, “Multiple Sequence Alignment: In pursuit of homologous DNA
positions”, Cold Spring Harbor Laboratory Press: Genome Research, Vol. 17, pp. 127