QOMA: quasi-optimal multiple alignment of protein sequences

wickedshortpumpΒιοτεχνολογία

1 Οκτ 2013 (πριν από 4 χρόνια και 1 μήνα)

131 εμφανίσεις

Vol.23 no.2 2007,pages 162–168
doi:10.1093/bioinformatics/btl590
BIOINFORMATICS ORIGINAL PAPER
Sequence analysis
QOMA:quasi-optimal multiple alignment of protein sequences
Xu Zhang and Tamer Kahveci
￿
Computer and Information Science and Engineering,University of Florida,Gainesville,FL 32611,USA
Received on August 8,2006;revised on November 15,2006;accepted on November 16,2006
Advance Access publication November 22,2006
Associate Editor:John Quackenbush
ABSTRACT
Motivation:We consider the problemof multiple alignment of protein
sequences with the goal of achieving a large SP (Sum-of-Pairs) score.
Results:We introduce a new graph-based method.We name our
method QOMA (Quasi-Optimal Multiple Alignment).QOMA starts
with an initial alignment.It represents this alignment using a K-partite
graph.It then improves the SP score of the initial alignment through
local optimizations within a window that moves greedily on the align-
ment.QOMA uses two parameters to permit flexibility in time/
accuracy trade off:(1) The size of the window for local optimization.
(2) The sparsity of the K-partite graph.Unlike traditional progressive
methods,QOMA is independent of the order of sequences.The
experimental results on BAliBASE benchmarks show that QOMA
produces higher SP score than the existing tools including ClustalW,
Probcons,Muscle,T-Coffee and DCA.The difference is more
significant for distant proteins.
Availability:The software is available fromthe authors upon request.
Contact:tamer@cise.ufl.edu
Supplementary information:Supplementary material is available at
Bioinformatics online.
1 MOTIVATION
Alignment of multiple (i.e.three or more) protein sequences is
one of the most fundamental problems in computational biology.
Multiple sequence alignment is widely used in many applications,
such as protein structure prediction (Jones,1999),phylogenetic
analysis (Phillips et al.,2000),identification of conserved motifs
and domains (Thompson et al.,1999) and protein classification
(Grundy,1997).
The quality of a multiple alignment is usually measured in terms
of its SP (Sum-of-Pairs) score (Lee et al.,2002;Lipman et al.,1989;
Walle et al.,2004;Stoye et al.,1997).The SP Score of an align-
ment,A,of sequences P
1
,P
2
,...,P
K
is computed by adding the
alignment scores of all of its induced pairwise alignments.It can be
expressed as SPðAÞ ¼
P
i<j
ScoreðP
i
‚P
j
Þ,where K is the number
of sequences,P
i
is the sequence indexed by i and Score(P
i
,P
j
) is the
score of the alignment of P
i
and P
j
induced by A.For simplicity,we
assume fixed penalty for each indel during the computation of Score
(P
i
,P
j
).Extension to affine gap is trivial.
Variations of the SP score are often used to reflect the biological
relevance of the multiple alignment.Assigning different weights to
sequences depending on their degree of similarity (Thompson et al.,
1994) or different weights to different positions of the sequences
depending on the structural and functional annotations (Zhang and
Kahveci,2006) are two examples to such variations.ClustalW
(Thompson et al.,1994),e.g.optimizes
P
i<j
score ðP
i
‚P
j
ÞW
i
W
j
,
where W
i
and W
j
are weights assigned to sequences P
i
and
P
j
,respectively.For simplicity,we use the SP score in this
paper.Extension to weighted SP score is trivial.
Finding the multiple sequence alignment that maximizes the SP
score is an NP-complete problem (Wang and Jiangs 1994).A vari-
ety of heuristics have been developed.Progressive alignment
methods form an important class of the heuristic methods.Such
methods progressively align pairs of profiles in a certain order and
produce a newprofile until a single profile is left.Aprofile is either a
sequence or the alignment of a set of sequences.Progressive meth-
ods,however,have an important shortcoming.The order that the
profiles are chosen for alignment affects the quality of the alignment
significantly.The optimal alignment may be different than all
possible alignments obtained by considering all possible orderings
of sequences (Zhang and Kahveci,2006).A method,which can
balance the running time and the alignment accuracy is seriously in
demand.
In this paper,we consider the problem of maximizing the SP
score of the alignment of multiple protein sequences.We develop a
graph-based method named QOMA (Quasi-Optimal Multiple
Alignment).QOMA starts by constructing an initial multiple align-
ment with the help of optimal pairwise alignments.It then builds a
graph corresponding to the initial alignment.It iteratively places a
window on this graph and improves the SP score of the initial
alignment by optimizing the alignment inside the window.The
location of the window is selected greedily as the one that has a
chance of improving the SP score by the largest amount.QOMA
uses two parameters to permit flexibility in time/accuracy trade off:
(1) The window size.(2) The sparsity of the K-partite graph.The
experimental results show that QOMA finds alignments with better
SP score compared to existing tools including ClustalW,Probcons,
Muscle,T-Coffee and DCA at similar running time.The improve-
ment is more significant for distant proteins.The running time of
QOMA is larger than that of the competing methods for large
window sizes and for large number of sequences.
The rest of the paper is organized as follows.Section 2 discusses
related work.Section 3 introduces the basic algorithm in detail.
Section 4 discusses the improved algorithm.Section 5 presents
experimental results and analysis.Section 6 concludes the paper
and points out possible future directions.
2 RELATED WORK
Given a table of scores for matches and mismatches between all
amino acids and penalties for insertions or deletions,the optimal
￿
To whom correspondence should be addressed.
162
 The Author 2006.Published by Oxford University Press.All rights reserved.For Permissions,please email:journals.permissions@oxfordjournals.org
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
alignment of two sequences can be determined by dynamic program-
ming (DP).The time and space complexity of this methods is O(N
2
)
(Needleman and Wunsch,1970;Smith and Waterman,1981;Gotoh,
1982),where N is the length of each sequence.This algorithm can
be extended to align K sequences,but requires O(2
K
N
K
K
2
) time
(Lipman et al.,1989;Gupta et al.,1995).MSA(Lipman et al.,1989)
is a representative program using this algorithm.However,MSA
suffers from computation expenses.Variety of heuristic algorithms
have been developed to overcome this difficulty (Thompson
et al.,1994).These heuristic methods can be classified into four
groups:progressive,iterative,anchor-based and probabilistic.
Progressive methods find multiple alignment by iteratively
picking two sequences from this set and replacing them with
their alignment (i.e.consensus sequence) until all sequences are
aligned into a single consensus sequence.Thus,progressive meth-
ods guarantee that never more than two sequences are simultane-
ously aligned.This approach is sufficiently fast to allowalignments
of virtually any size.ClustalW (Thompson et al.,1994),T-coffee
(Notredame et al.,2000),Treealign (Hein,1989),POA (Lee et al.,
2002) and MAFFT (Katoh et al.,2002) can be grouped into this
class (Sze et al.,2005).Iterative methods start with an initial align-
ment.They then repeatedly refine this alignment through a series of
iterations until no more improvements can be made.Muscle (Edgar,
2004) can be grouped into this class as well as the progressive
method class since it uses a progressive alignment at each iteration.
Anchor-based methods use local motifs (short common subseq-
uences) as anchors.Later,the unaligned regions between conse-
cutive anchors are aligned using other techniques.DIALIGN
(Morgenstern et al.,1998),Align-m (Walle et al.,2004),L-align
(Huang and Miller,1991),Mavid (Bray and Pachter,2004),and
PRRP (Gotoh,1996) belong to this class.Probabilistic methods
pre-compute the substitution probabilities by analyzing known
multiple alignments.They use these probabilities to maximize
the substitution probabilities for a given set of sequences.Probcons
(Do et al.,2004),and Hmmt (Eddy,1995) can be grouped into
this class.The common shortcomings of the methods above are:
(1) The alignments of progressive methods depend on the order of
sequences.(2) They do not provide flexible quality/time trade off.
Improving the alignment quality of an initial alignment has been
traditionally done manually (e.g.through programs like MaM and
WebMaM(Alkan et al.,2005)).There are a fewmethods,which aim
to optimize the SP score by running DP alignment on all sequences
simultaneously.MSAis the representative in this class (Gupta et al.,
1996).DCA extends MSA by utilizing ‘divide-and-conquer’ strat-
egy (Stoye et al.,1997).Unlike progressive methods,DCA divides
the sequences recursively until they are shorter than a given thresh-
old.It then uses MSA to find the optimal solutions for the smaller
problems.The performance of DCA depends on how it divides the
sequences.DCA uses a cut strategy that minimizes additional costs
(Stoye,1998) and does not guarantee to find optimal solution.It uses
the longest sequence in the input sequences as reference to select the
cut positions.This makes it order dependent,as there is no justifica-
tionwhythis selection(or anyother selection) optimizes the SPscore
of the alignment.On the contrary,our method is order independent.
3 ALGORITHM:COMPLETEK-PARTITEGRAPH
In this section,we introduce the basic QOMAalgorithmfor aligning
Kprotein sequences.QOMAworks in two steps:(1) It constructs an
initial alignment and a K-partite graph corresponding to this align-
ment.(2) It iteratively places a window on the sequences and
replaces the window with its optimal alignment.We call this the
complete K-partite graph algorithmsince a letter of a protein can be
aligned with any letter of the other proteins within the same
window.For simplicity,we use fixed penalty for each indel.Exten-
sion of the developed method to affine gap is trivial.Next,we
describe these two steps in detail.
3.1 Constructing initial alignment
The purpose of constructing an initial alignment is to roughly iden-
tify the position of each node in final alignment.It is important to
find this initial alignment quickly in order to minimize initialization
overhead.
The initial alignment can be constructed in many ways.One
possibility is to use an existing tool,such as ClustalW.This strategy,
however,has the shortcoming that the initial alignment depends on
another tool,which may be order-dependent.This makes QOMA
partially order-dependent.We propose to construct the initial align-
ment from pairwise optimal alignments of sequences.In this strat-
egy,first,sequence pairs are optimally aligned using DP (Gotoh,
1982).An edge is added between two nodes if the nodes are
matched in this alignment.A weight is assigned to each edge as
the substitution score of the two residues that constitute that edge.
The substitution score is obtained from the underlying scoring
matrix,such as BLOSUM62 (Henikoff and Henikoff,1992).The
weight of each node is defined as the sumof the weights of the edges
that have that node on one end.A node set is then defined by
selecting one node from the head of each sequence.The node
which has the highest weight is selected from this set.This node
is aligned with the nodes adjacent to it.Thus,the letters aligned at
the end of this step constitute one column of the initial multiple
alignment.The node set is then updated as the nodes immediately
after the nodes in current set in each sequence.This process is
repeated and columns are found until all the nodes are consumed.
The alignment is obtained by concatenating all the columns.Gaps
are inserted between nodes if needed.Unlike progressive tools,this
strategy is order-independent.An example for initial alignment
construction is shown in Supplementary Figure A1.
The time complexity of both of these strategies are O(K
2
N
2
) since
pairwise comparisons dominate the running time.However,latter
approach is faster.This is because it runs dynamic programming
only once for each sequence pair.On the other hand,the former one
performs two set of pairwise alignments.One to find a guide tree
and another to align sequences progressively according to the guide
tree.
3.2 Improving the SP score via local optimizations
After constructing the initial alignment,the nodes are placed
roughly in their correct positions (or in a close by position) in
the alignment.Next,the alignment is iteratively improved.At
each iteration,a short window is placed on the existing alignment.
The subsequences contained in this window are then replaced by
their optimal alignments (see Supplementary Figure A2).DP algo-
rithm (Gotoh,1982) is used to find the optimal alignment.This is
feasible since the cost of aligning a windowis much less than that of
the entire sequences.
This algorithm requires solving two problems.First,where
should the window be placed?Second,when should the iterations
QOMA of protein sequences
163
stop?One obvious solution is to slide a windowfromleft to right (or
right to left) shifting by some predefined amount D at each iteration.
In this case,the iterations will end once the window reaches the
right end (or the left end) of the alignment (see Supplementary,
Figure A2).This solution,however,has two problems.First,it is not
clear which direction the window should be slid.Second,a window
is optimized even if it is already a good alignment,resulting in a
waste of computation time.We propose another solution.We com-
pute an upper bound of the improvement of the SP score for every
possible windowposition as follows.Let X
i
denote the upper bound
to the SP score for the windowstarting at position i in the alignment.
This number can be computed as the sum of the scores of all
pairwise optimal alignments of the subsequences in this window.
Let Y
i
denote the current SP score of that window.The upper bound
is computed as X
i
 Y
i
.We propose to greedily select the window
that has the largest upper bound at each iteration.In order to ensure
that this solution does not optimize more windows than the first one
(i.e.,sliding windows),we do not select a window position that is
within D/2 positions to a previously optimized window.The itera-
tions stop when all the remaining windows have an upper bound of
zero or they are within D/2 positions of a previously optimized
window.In our experiments,the second solution was slightly better.
The second solution converged to the final result much faster than
the first one.(data not shown.)
The time complexity of QOMA is O
2
K
W
K
K
2
ðN  Wþ1Þ
D
 
.This is
because there are
ðN  Wþ1Þ
D
positions for window.A dynamic pro-
gramming solution is computed for each such window.The cost of
each dynamic programming solution is O(2
K
W
K
K
2
).This algorithm
is much faster than the optimal dynamic programming when W is
much smaller than N.The space complexity is O(W
K
+ KN).This is
because dynamic programming for a windowrequires O(W
K
) space,
and only one window is maintained at a time.Also O(KN) space is
needed to store the sequences and the alignment.Note that the edges
of the complete K-partite graph are not stored at this step as we
already know that the graph is complete.
3.3 QOMA and optimality
This section analyzes QOMA.Let P
1
,P
2
,...,P
K
be the protein
sequences to be aligned.Let A
￿
be an optimal alignment of P
1
,
P
2
,...,P
K
.Let S
￿
denote the SP score of A
￿
.Let A be an alignment
of P
1
,P
2
,...,P
K
.Let SP(A) be the SP score of A.We define the error
induced by Aas error (A) ¼S
￿
SP(A).This expression,however,is
not computable for finding S
￿
is NP-complete.Instead,we compute
the error of Aas eðAÞ ¼

SS  SPðAÞ,where

SS is an upper bound to S
￿
.
Hence,e(A) error(A).Here,

SS is computed as the sumof the scores
of all optimal pairwise alignments of P
1
,P
2
,...,P
K
.Let QOMA
(A,W) be the alignment obtained by QOMA starting from initial
alignment A using a window of size W.We define the percentage
of improvement provided by QOMA over A using a window size
of W as
improveðA‚WÞ ¼ f1  eðQOMAðA‚WÞÞ/eðAÞg · 100 ð1Þ
Our first lemma shows that QOMAalways results in an alignment
at least as good as the initial alignment (The proof is shown in the
Supplementary material).
L
EMMA
1.improve(A,W)  0,8A,W.&
Corollary 1 follows from Lemma 1.
C
OROLLARY
1.S
￿
¼ SP(QOMA(A
￿
,W)),8W.&
Corollary 1 implies that QOMAalters an initial alignment A only
if A is not optimal.Next lemma discusses the impact of the window
size on QOMA (See Supplementary material for proof).
L
EMMA
2.SP(QOMA(A,W))  SP(QOMA(A,2W)).
Lemma 2 indicates that as W increases,the SP score of the
resulting alignment increases.In other words,as W increases,
SP(QOMA(A,W)) converges to S
￿
.
4 IMPROVED ALGORITHM:SPARSE GRAPH
QOMA converges to optimal alignment as the window size (W)
grows.However,this happens at the expense of exponential time
complexity.In Section 3.1 we computed the time complexity of
QOMA using complete K-partite graph as O
2
K
W
K
ðN  Wþ1ÞK
2
D
 
for
proteins P
1
,P
2
,...,P
K
.In this section,we reduce the time
complexity of QOMAby sacrificing accuracy through use of sparse
K-partite graph.The goal is to run QOMA within a limited time
budget while using a larger window size.
The factor 2
K
in the complexity is incurred because each cell
of the DP matrix is computed by considering 2
K
 1 conditions
(i.e.2
K
 1 neighboring cells).This is because there are 2
K
 1
possible non-empty subsets of K residues.Each subset,here
corresponds to a set of residues that align together,and thus to a
neighboring cell.We propose to reduce the complexity by reducing
the number of residues that can be aligned together.We do this
by keeping only the edges between node pairs with high-possibility
of matching.
Choosing the promising edges is crucial for the quality of the
resulting alignment.We propose to store an edge between two nodes
only if their corresponding letters are aligned in the optimal pairwise
alignment.Thus we produce at most K1 edges per node since each
node is aligned with at most one node from each of the K  1
sequences.We also introduce a deviation parameter d,where d is a
non-negative integer.Let p[i] and q[ j] be the nodes corresponding
to protein sequences p and q at positions i and j in the initial graph,
respectively.We draw an edge between p[i] and q[ j] only if one of
the following two conditions holds in the optimal pairwise align-
ment of p and q:(1) 9d,|d|  d,such that p[i] is aligned with q[ j +
d];(2) 9d,|d|  d,such that q[ j] is aligned with p[i + d].In other
words,we draw an edge between two nodes if their positions differ
by at most d in the optimal alignment of p and q.For example,in
Figure 1,p[2] aligns with q[2].Therefore,we draw an edge from
p[2] to q[1] and q[3] as well as q[2] since q[1] and q[3] are within
d-neighborhood of (d ¼ 1) of q[2].
DP is modified for sparse K-partite graph as follows:each cell,
[x
1
,x
2
,...,x
K
] in the K-dimensional DP matrix corresponds to
Fig.1.Sparse K-partite graph for two sequences for d ¼ 0 and d ¼ 1.
X.Zhang and T.Kahveci
164
nodes P
1
[x
1
],P
2
[x
2
],...,P
K
[x
K
].Here P
i
[ j] stands for the node at
position j in sequence i.The set contains one node from each
sequence,and can be either a residue or a gap.Thus,each cell
defines a subgraph induced by its node set.For example,during
the alignment of the sequences that have the K-partite graph as
shown in Figure 2(a),the cell [3,4,4] corresponds to nodes
P
1
[3],P
2
[4] and P
3
[4].Figure 2(b) shows the induced subgraph
of cell [3,4,4].
The induced subgraph for each cell yields a set of connected
components.The sparse graph algorithm uses connected compo-
nents to improve the running time of DP as follows.During the
computation of the value of a DP matrix cell,we allowtwo nodes to
align only if they belong to the same connected component of the
induced subgraph of that cell.For example,for cell [3,4,4],P
2
[4]
and P
3
[4] can be aligned together,but P
1
[3] can not be aligned with
P
2
[4] or P
3
[4] (see Fig.2b).A connected component with n nodes
produce 2
n
1 non-empty subsets.Thus,for a given cell,if there are
t connected components and the tth component has n
t
nodes,then
the cost of that cell becomes
P
t
i¼1
ð2
n
i
 1Þ.This is a significant
improvement as the cost of a single cell is 2
n
1
+n
2
+...+n
t
1 using the
complete K-partite graph.For example,in Figure 2,the cost for cell
[3,4,4] drops from 2
3
 1 ¼7 to (2
1
1) +(2
2
 1) ¼ 4.
The connected components of an induced subgraph can be
found in O(K
2
) time (i.e.the size of the induced subgraph) by
traversing the induced subgraph once.Thus,the total time
complexity of the sparse K-partite graph approach is
O
P
W
K
i¼1
P
j
ð2
n
j
 1Þ
 h i
ðN  Wþ1ÞK
2
/D
 
.The space com-
plexity of using the sparse K-partite graph is O(W
K
+ KN +
N(K1)K(2d + 1)/2).The first term denotes the space for the
dynamic programming alignment within a window.The second
termdenotes the number of letters.The last termdenotes the number
of edges.The space complexity for the last two terms can be reduced
by storing only the subgraph inside the window.
5 EXPERIMENTAL EVALUATION
Experimental setup:We used BAliBASE benchmarks (Thompson
et al.,1999) reference 1 from version 1 (www-igbmc.
u-strasbg.fr/BioInfo/BAliBASE/) and references 1,
2,...,8 from version 3 (www-bio3d-igbmc.u-strasbg.
fr/BAliBASE/) for evaluation of our method.We use V1 and
V3 to denote BAliBASE versions 1 and 3,respectively.We use R1
to R8 to denote reference 1–8.For example,we use V3-R4 to
represent the reference 4 dataset from version 3.We split the
V1-R1 dataset into three datasets (V1-R1-low,V1-R1-medium
and V1-R1-high) according to the similarity of the sequences in
the benchmarks as denoted in BAliBASE (low- and high-
similarities).Similarly,V3-R1 is split into two datasets V3-R1-
low and V3-R1-high containing low and high-similarity bench-
marks.The number of sequences in the benchmarks in version
3 were usually too large for QOMA and DCA.Therefore,we cre-
ated 1000 benchmarks from each reference by randomly selecting
five sequences from the existing benchmarks.Thus,each of the
benchmarks from version 3 contains five sequences.
We evaluated the SP score and the running time in our experi-
ments.We do not report the BAliBASE scores since the purpose of
QOMA is to maximize the SP score.
We implemented the complete and the sparse K-partite QOMA
algorithms as discussed in the paper,using standard C.We used
BLOSUM62 as a measure of similarity between amino acids.We
used gap cost ¼4 to penalize each indel.We used D ¼W/2 in our
experiments since we achieved best quality per time with this value.
We also downloaded ClustalW,Probcons,Muscle,T-coffee and
DCA for comparison.We did not compare QOMA with our pre-
vious work HSA (Zhang and Kahveci,2006) since HSA needs
secondary structure information of proteins for alignment.To
ensure a fair comparison,we ran ClustalW,Muscle,T-coffee,
DCA and QOMA using the same parameters (gap cost ¼ 4,
similarity matrix ¼ BLOSUM62).This was not possible for Prob-
cons.We also ran all the competing methods using their default
parameters.
We ran all our experiments on Intel Pentium 4,with 2.6 G Hz
speed,and 512 MB memory.The operating system was Windows
2000.
Quality evaluation:We first evaluate the quality of QOMA.
Table 1 shows the average SP score of QOMA using two strategies
for constructing initial alignment and four values of W.Strategy 1
obtains the initial alignments fromClustalW.Strategy 2 obtains the
initial alignments from the algorithm provided in Section 3.1.The
table also shows the upper bound for the SP score and the SP score
of ClustalWfor comparison.The best results are shown in bold for
each test.QOMA achieves higher SP score compared to ClustalW
on average for all windowsizes and for all datasets.The SP score of
QOMA consistently increases as W increases.These results are
justified by Lemmas 1 and 1.The SP score of Strategy 2 is usually
higher than that of Strategy 1 for almost all cases of low- and
medium-similarity.Both strategies are almost identical for highly
similar sequences.There is a loose correlation between the initial SP
score and the final SP score of QOMA.Higher initial SP scores
usually imply higher SP scores of the end result.There are however
exceptions especially for highly similar sequences.In the rest of the
experiments,we use Strategy 2 to construct the initial alignments by
default.
Table 2 shows us the SP scores of five existing tools,and QOMA
on all the datasets when the competing tools are run using the same
parameters as QOMA and using their default parameters.QOMA
has higher SP score than all the tools compared for all the datasets.
DCA always has second best score since it also targets on maxi-
mizing the SP score of alignments.The difference between the SP
scores of QOMA and the other tools are more significant for low-
and medium- similarity sequences.This is an important achieve-
ment because the alignment of such sequences are usually harder
than highly similar sequences.
Table 3 shows the average percentage of improvement of QOMA
over alignments of ClustalW using the improvement formula as
Fig.2.(a) Asparse K-partite graph for three sequences froma windowof size
4 (b) The induced subgraph for the cell [3,4,4] of the dynamic programming
matrix for the K-partite graph in (a).
QOMA of protein sequences
165
given in Section 3.3,the dataset is V1-R1.As the window size
increases,the increase in improvement percentage reduces.This
indicates that QOMA converges to the optimal score at reasonably
small window sizes.In other words,using window size larger than
16 will not improve the SP score significantly.
Table 4 shows the average and the SD of the error incurred for
each window due to using the sparse K-partite graph for QOMA.
Tabel 1.The average SP scores of QOMA using complete K-partite graph with D ¼ W/2 on BAliBASE benchmarks and upper bound score (

SS)
Dataset

SS Initialization strategy 1 Initialization strategy 2
Initial W¼2 W¼4 W¼8 W¼16 Initial W¼2 W¼4 W¼8 W¼16
V1-R1-low 565
839 780 637 401 243 797 586 429 273 182
V1-R1-medium
2880 1982 2037 2181 2347 2442 2041 2192 2338 2446 2508
V1-R1-high
5324 4883 4933 5008 5071 5092 4867 4965 5057 5110 5122
(Initialization strategy 1:initial alignments are obtained from ClustalW,Initialization strategy 2:initial alignments are obtained from optimal pairwise alignments as discussed in
Section 3.1.) The best result is shown in bold face.
Table 2.The average SP scores of QOMA (using complete K-partite graph with W ¼ 16) and five other tools on BAliBASE benchmarks
Dataset ClustalW Probcons T-coffee Muscle DCA QOMA
V1-R1-low 808 (839)
1303 (1303) 1499 (1486) 2029 (778) 440 (440) 182 (182)
V1-R1-medium
2156 (1982) 2128 (2068) 1955 (2007) 424 (2161) 2356 (2289) 2582 (2508)
V1-R1-high
4954 (4935) 4924 (4975) 4842 (4920) 3468 (5001) 5007 (5055) 5122 (5172)
V3-R1-low
1233 (1316) 1763 (1763) 2141 (2052) 2617 (1200) 760 (760) 421 (421)
V3-R1-high
2048 (1911) 2008 (2008) 1803 (1902) 263 (2101) 2288 (2288) 2507 (2507)
V3-R2
5984 (5953) 5862 (5892) 5793 (5847) 4564 (6029) 6126 (6153) 6287 (6313)
V3-R3
5838 (6005) 5741 (5976) N/A (5945) 4406 (6088) 5968 (6202) 6110 (6348)
V3-R4
251 (347) 280 (95) N/A (148) 935 (463) 714 (880) 926 (1107)
V3-R5
3899 (3778) 3658 (3658) N/A (3553) 2117 (3935) 4310 (4310) 4446 (4446)
V3-R6
1601 (1554) 1781 (1782) N/A (1859) 1710 (1570) 1335 (1336) 1151 (1152)
V3-R7
6977 (6832) 6555 (6555) N/A (6409) 4663 (6973) 7502 (7502) 8000 (8000)
V3-R8
8432 (8321) 8238 (8238) N/A (8190) 6927 (8494) 8831 (8831) 9248 (9248)
The numbers outside (inside) parenthesis show the SP scores when the tools are run with the same parameters as QOMA (with their default parameters).Some of the tools,namely
T-coffee andClustalW,didnot produce anyalignment for some benchmarks for eachparameter settings.The results of all the tools are ignoredfor suchbenchmarks.‘N/A’ indicates that
thecorrespondingtool failedtoproducealignment for most of thebenchmarks inadataset for that parameter setting.We ignoresuchtools (i.e.,T-coffee) for thosedatasets andparameter
setting.The best result for each dataset and parameter setting is shown in bold.
Table 3.The improvement (see Formula 1 in Section 3.3) of QOMA (using
complete K-partite graph) over ClustalW on the V1-R1 dataset
Length Window Size
2 4 8 16
Short 18.0 29.2 40.2 46.7
Medium 23.3 39.6 51.6 58.6
Long 18.6 39.4 51.5 54.1
The dataset is split into three subsets (short,mediumand long) according to the length of
the sequences.
Table 4.The average (m),SD (s) of the error,S
￿
SP,for a window using
sparse version of QOMA on the V1-R1 dataset
W Error using sparse K-partite graph
d ¼ 0 d ¼ 1 d ¼ 2
m s e m s e m s e
8 1.127 3.820 7 0.358 2.089 1 0.200 1.591 1
16 2.159 5.909
11
0.751 3.249 1 0.433 2.434 1
Results are shown for window sizes W¼8 and 16,and deviation d ¼0,1 and 2.The «
value denotes the 95%confidence interval,i.e 95%of the expected improvement values
are in [m  e,m + e] interval.
Fig.3.The SPscores of QOMAalignments usingthe complete andthe sparse
K-partite graphs for different values of d and W on the V1-R1 dataset.
X.Zhang and T.Kahveci
166
The average error decreases as d increases.Most of the reduction
in error is obtained when d increases from 0 to 1.For example,for
W ¼ 16,the error reduction is 1.408 and 0.318 as d increases from
0 to 1 and from 1 to 2,respectively.This implies that the average
improvement in the SP score degrades quickly for d > 1.The 95%
error interval also significantly shrinks as d increases to 1.These
results suggest that the sparse K-partite graph algorithmis effective
even for small values of d.
Figure 3 shows the average SP scores of the resulting alignments
using sparse K-partite graph for different values of d and using
complete K-partite graph on the V1-R1 dataset.The complete
K-partite graph algorithm produces the best SP scores.However,
the SP scores of results from the sparse K-partite graph algorithm
are very close to that of the complete K-partite graph algorithm.The
quality of the sparse K-partite graph algorithm improves signifi-
cantly when d increases from0 to 1.The improvement is less when
d increases from1 to 2.This implies when d becomes larger than 1,
it has small impact on the quality of alignment.
Performance evaluation:Our second experiment set evaluates
the running time of QOMA.Table 5 lists the running time of QOMA
for the complete and the sparse K-partite graph algorithms for
varying values of W.Experimental results show that QOMA
runs faster for small W.The sparse K-partite graph algorithm is
faster than the complete K-partite graph algorithm for all values
of d for large W.The running time of QOMA increases as d
increases.The results in this table agree with the time complexity
we computed in Sections 3.3 and 4.Referring to Tables 1,3 and
4,we conclude that when the window size is small,QOMA runs
fast and has high SP score.As the window size increases,its
performance drops but the SP score improves further.Thus,it is
better to increase the window size and use sparse K-partite graph
strategy to obtain high scoring results quickly.In the light of
Figure 3,and Tables 4 and 5 we recommend d ¼ 1 using sparse
K-partite graph strategy.This is because most of the improvement
in the SP score is obtained when d ¼ 1.For d > 1,the score
improves slightly while the running time increases at the same or
higher rate.
6 CONCLUSION AND FUTURE WORK
We considered the problem of multiple alignment with high-SP
score for protein sequences.We developed a graph-based algorithm
called QOMA.QOMA first constructs an initial alignment of
multiple sequences.In order to create this initial alignment,we
developed a method based on the optimal alignment between
all pairs of sequences.QOMA represents this alignment using a
K-partite graph.It then improves the SP score of the initial
alignment by iteratively placing a window on it and optimizing
the alignment within this window.QOMAprovides two parameters
to permit flexibility in time and accuracy:(1) The window size.(2)
The sparsity of the K-partite graph.Unlike traditional tools,QOMA
is independent of the order of sequences.
In our experiments QOMA achieved significantly higher SP
scores than existing tools,including ClustalW,Probcons,Muscle,
T-Coffee and DCA.The quality improvement increased with the
window size.QOMA had slightly better SP score using the com-
plete K-partite graph algorithm than the sparse K-partite graph
algorithm.However,the sparse K-partite graph algorithm was
much faster when large window size is used.We found a good
balance point between quality of result and running time at
d ¼ 1 using sparse K-partite graph.
Currently,we are working on extending the QOMA algorithmto
align large number of sequences.
Conflict of Interest:none declared.
REFERENCES
Alkan,C.et al.(2005) Manipulating multiple sequence alignments via MaM and
WebMaM.Nucleic Acids Res.,33,W295–W298.
Bray,N.and Pachter,L.(2004) MAVID:constrained ancestral alignment of multiple
sequences.Genome Res.,14,693–699.
Do,C.B.et al.(2005) PROBCONS:Probabilistic consistency-based multiple sequence
alignment.Genome Res.,15,330–340.
Eddy,S.R.(1995) Multiple alignment using hidden Markov models.Proc Int Conf Intell
Syst Mol Biol.,3,114–120.
Edgar,R.(2004) MUSCLE:multiple sequence alignment with high accuracy and high
throughput.Nucleic Acids Res.,32,1792–1797.
Gotoh,O.(1982) An improved algorithm for matching biological sequences.
J.Mol.Biol.,162,705–708.
Gotoh,O.(1996) Significant improvement in accuracy of multiple protein sequence
alignments by iterative refinement as assessed by reference to structural align-
ments.J.Mol.Biol.,264,823–838.
Grundy,W.N.(1998) Family-based homology detection via pairwise sequence
comparison.In Proceedings of the Second Annual International Conference on
Computational Molecular Biology (RECOMB’98),ACM Press,New York,NY,
pp.94–100.
Gupta,S.et al.(1996) Improving the practical space and time efficiency of the shortest-
paths approach to sum-of-pairs multiple sequence alignment.J.Comput.Biol.
Gupta,S.K.et al.(1995) Improving the practical space and time efficiency of the
shortest-paths approach to sum-of-pairs multiple sequence alignment.J.Comput.
Biol.,2,459–462.
Hein,J.(1989) A new method that simultaneously aligns and reconstructs ancestral
sequences for any number of homologous sequences,when the phylogeny is given.
Mol.Biol.Evol.,6,649–668.
Table 5.The running time of QOMA (in seconds) using complete K-partite graph and spare graph for different value of d and W on the V1-R1 dataset
Window
Size
Short Medium Long
A B C D A B C D A B C D
W¼2 0.27
0.24 0.35 0.3 0.58 0.61 0.75 0.74 1.11 1.99 2.31 2.27
W¼4
1.76 1.01 1.53 1.78 3.62 2.19 3.2 3.74 8 5.48 7.83 8.93
W¼8
17 6 9 12 37 13 19 26 86 31 43 61
W¼16
232 51 63 81 535 110 132 166 1087 257 306 386
(A:complete K-partite graph.B:sparse K-partite graphwithd¼0.C:sparse K-partite graphwithd¼1.D:sparse K-partite graphwithd¼2.) The dataset is split intothree subsets (short,
mediumand long) according to the length of the sequences in the benchmarks.The running time of the competing methods,in seconds,are as follows.ClustalW:between 0.39 and 2.1,
Probcons:between 0.9 and 8.8,T-coffee:between 3.9 and 24.2,Muscle:between 0.4 and 1.6.
QOMA of protein sequences
167
Henikoff,S.and Henikoff,J.G.(1992) Amino acid substitution matrices from protein
blocks.Proc.Natl Acad.Sci.USA,89,10915–10919.
Huang,X.and Miller,W.(1991) A time-efficient,linear-space local similarity algo-
rithm.Adv.Appl.Math.,12,337–357.
Jones,D.T.(1999) Protein secondary structure prediction based on position-specific
scoring matrices.J.Mol.Biol.,292,195–202.
Katoh,K.et al.(2002) MAFFT:a novel method for rapid multiple sequence alignment
based on fast Fourier transform.Nucleic Acids Res.,30,3059–3066.
Lee,C.et al.(2002) Multiple sequence alignment using partial order graphs.
Bioinformatics,18,452–464.
Lipman,D.et al.(1989) A tool for multiple sequence alignment.Proc.Natl Acad.Sci.
USA,86,4412–4415.
Morgenstern,B.et al.(1998) DIALIGN:finding local similarities by multiple sequence
alignment.Bioinformatics,14,290–294.
Needleman,S.B.and Wunsch,C.D.(1970) A general method applicable to the search
for similarities in the amino acid sequence of two proteins.J.Mol.Biol.,48,
443–453.
Notredame,C.et al.(2000) T-coffee:a novel method for fast and accurate multiple
sequence alignment.J.Mol.Biol.,302,205–217.
Phillips,A.et al.(2000) Multiple sequence alignment in phylogenetic analysis.
Mol.Phylogene.Evol.,16,317–330.
Smith,T.F.and Waterman,M.S.(1981) Identification of common molecular subse-
quences.J.Mol.Biol.,147,195–197.
Stoye,J.(1998) Multiple sequence alignment with the divide-and-conquer method.
Gene,211,GC45–GC56.
Stoye,J.et al.(1997) DCA:an efficient implementation of the divide-and-conquer
approach to simultaneous multiple sequence alignment.Comput.Appl.Biosci.,
13,625–626.
Sze,S.H.et al.(2005) A polynomial time solvable formulation multiple sequence
alignment.In:Miyano,S.,Mesirov,J.P.,Kasif,S.,Istrail,S.Pevzner,P.A.and
Waterman,M.S.(eds),Proceedings of the 9th Annual International Conference
on Research in Computational Molecular Biology (RECOMB),Lecture Notes in
Computer Science,Springer,pp.204–216.
Thompson,J.et al.(1994) CLUSTAL W:improving the sensitivity of progressive
multiple sequence alignment through sequence weighting,position-specific gap
penalties and weight matrix choice.Nucleic Acids Res.,22,4673–4680.
Thompson,J.et al.(1999) A comprehensive comparison of multiple sequence align-
ment programs.Nucleic Acids Res.,27,2682–2690.
Van Walle,I.et al.(2004) Align-m—a new algorithmfor multiple alignment of highly
divergent sequences.Bioinformatics,20,1428–1435.
Wang,L.and Jiang,T.(1994) On the complexity of multiple sequence alignment.
J.Comput.Biol.,1,337–348.
Zhang,X.and Kahveci,T.(2006) A new approach for alignment of multiple proteins.
In:Altman,R.B.,Murray,T.,Klein,T.E.,Dunker,A.K.and Hunter,L.(eds),
Proceedings of the Pacific Symposium on Biocomputing (PSB),World Scientific
pp.339–350,
X.Zhang and T.Kahveci
168