Vol.23 no.2 2007,pages 162–168
doi:10.1093/bioinformatics/btl590
BIOINFORMATICS ORIGINAL PAPER
Sequence analysis
QOMA:quasioptimal multiple alignment of protein sequences
Xu Zhang and Tamer Kahveci
Computer and Information Science and Engineering,University of Florida,Gainesville,FL 32611,USA
Received on August 8,2006;revised on November 15,2006;accepted on November 16,2006
Advance Access publication November 22,2006
Associate Editor:John Quackenbush
ABSTRACT
Motivation:We consider the problemof multiple alignment of protein
sequences with the goal of achieving a large SP (SumofPairs) score.
Results:We introduce a new graphbased method.We name our
method QOMA (QuasiOptimal Multiple Alignment).QOMA starts
with an initial alignment.It represents this alignment using a Kpartite
graph.It then improves the SP score of the initial alignment through
local optimizations within a window that moves greedily on the align
ment.QOMA uses two parameters to permit flexibility in time/
accuracy trade off:(1) The size of the window for local optimization.
(2) The sparsity of the Kpartite graph.Unlike traditional progressive
methods,QOMA is independent of the order of sequences.The
experimental results on BAliBASE benchmarks show that QOMA
produces higher SP score than the existing tools including ClustalW,
Probcons,Muscle,TCoffee and DCA.The difference is more
significant for distant proteins.
Availability:The software is available fromthe authors upon request.
Contact:tamer@cise.ufl.edu
Supplementary information:Supplementary material is available at
Bioinformatics online.
1 MOTIVATION
Alignment of multiple (i.e.three or more) protein sequences is
one of the most fundamental problems in computational biology.
Multiple sequence alignment is widely used in many applications,
such as protein structure prediction (Jones,1999),phylogenetic
analysis (Phillips et al.,2000),identiﬁcation of conserved motifs
and domains (Thompson et al.,1999) and protein classiﬁcation
(Grundy,1997).
The quality of a multiple alignment is usually measured in terms
of its SP (SumofPairs) score (Lee et al.,2002;Lipman et al.,1989;
Walle et al.,2004;Stoye et al.,1997).The SP Score of an align
ment,A,of sequences P
1
,P
2
,...,P
K
is computed by adding the
alignment scores of all of its induced pairwise alignments.It can be
expressed as SPðAÞ ¼
P
i<j
ScoreðP
i
‚P
j
Þ,where K is the number
of sequences,P
i
is the sequence indexed by i and Score(P
i
,P
j
) is the
score of the alignment of P
i
and P
j
induced by A.For simplicity,we
assume ﬁxed penalty for each indel during the computation of Score
(P
i
,P
j
).Extension to afﬁne gap is trivial.
Variations of the SP score are often used to reﬂect the biological
relevance of the multiple alignment.Assigning different weights to
sequences depending on their degree of similarity (Thompson et al.,
1994) or different weights to different positions of the sequences
depending on the structural and functional annotations (Zhang and
Kahveci,2006) are two examples to such variations.ClustalW
(Thompson et al.,1994),e.g.optimizes
P
i<j
score ðP
i
‚P
j
ÞW
i
W
j
,
where W
i
and W
j
are weights assigned to sequences P
i
and
P
j
,respectively.For simplicity,we use the SP score in this
paper.Extension to weighted SP score is trivial.
Finding the multiple sequence alignment that maximizes the SP
score is an NPcomplete problem (Wang and Jiangs 1994).A vari
ety of heuristics have been developed.Progressive alignment
methods form an important class of the heuristic methods.Such
methods progressively align pairs of proﬁles in a certain order and
produce a newproﬁle until a single proﬁle is left.Aproﬁle is either a
sequence or the alignment of a set of sequences.Progressive meth
ods,however,have an important shortcoming.The order that the
proﬁles are chosen for alignment affects the quality of the alignment
signiﬁcantly.The optimal alignment may be different than all
possible alignments obtained by considering all possible orderings
of sequences (Zhang and Kahveci,2006).A method,which can
balance the running time and the alignment accuracy is seriously in
demand.
In this paper,we consider the problem of maximizing the SP
score of the alignment of multiple protein sequences.We develop a
graphbased method named QOMA (QuasiOptimal Multiple
Alignment).QOMA starts by constructing an initial multiple align
ment with the help of optimal pairwise alignments.It then builds a
graph corresponding to the initial alignment.It iteratively places a
window on this graph and improves the SP score of the initial
alignment by optimizing the alignment inside the window.The
location of the window is selected greedily as the one that has a
chance of improving the SP score by the largest amount.QOMA
uses two parameters to permit ﬂexibility in time/accuracy trade off:
(1) The window size.(2) The sparsity of the Kpartite graph.The
experimental results show that QOMA ﬁnds alignments with better
SP score compared to existing tools including ClustalW,Probcons,
Muscle,TCoffee and DCA at similar running time.The improve
ment is more signiﬁcant for distant proteins.The running time of
QOMA is larger than that of the competing methods for large
window sizes and for large number of sequences.
The rest of the paper is organized as follows.Section 2 discusses
related work.Section 3 introduces the basic algorithm in detail.
Section 4 discusses the improved algorithm.Section 5 presents
experimental results and analysis.Section 6 concludes the paper
and points out possible future directions.
2 RELATED WORK
Given a table of scores for matches and mismatches between all
amino acids and penalties for insertions or deletions,the optimal
To whom correspondence should be addressed.
162
The Author 2006.Published by Oxford University Press.All rights reserved.For Permissions,please email:journals.permissions@oxfordjournals.org
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
alignment of two sequences can be determined by dynamic program
ming (DP).The time and space complexity of this methods is O(N
2
)
(Needleman and Wunsch,1970;Smith and Waterman,1981;Gotoh,
1982),where N is the length of each sequence.This algorithm can
be extended to align K sequences,but requires O(2
K
N
K
K
2
) time
(Lipman et al.,1989;Gupta et al.,1995).MSA(Lipman et al.,1989)
is a representative program using this algorithm.However,MSA
suffers from computation expenses.Variety of heuristic algorithms
have been developed to overcome this difﬁculty (Thompson
et al.,1994).These heuristic methods can be classiﬁed into four
groups:progressive,iterative,anchorbased and probabilistic.
Progressive methods ﬁnd multiple alignment by iteratively
picking two sequences from this set and replacing them with
their alignment (i.e.consensus sequence) until all sequences are
aligned into a single consensus sequence.Thus,progressive meth
ods guarantee that never more than two sequences are simultane
ously aligned.This approach is sufﬁciently fast to allowalignments
of virtually any size.ClustalW (Thompson et al.,1994),Tcoffee
(Notredame et al.,2000),Treealign (Hein,1989),POA (Lee et al.,
2002) and MAFFT (Katoh et al.,2002) can be grouped into this
class (Sze et al.,2005).Iterative methods start with an initial align
ment.They then repeatedly reﬁne this alignment through a series of
iterations until no more improvements can be made.Muscle (Edgar,
2004) can be grouped into this class as well as the progressive
method class since it uses a progressive alignment at each iteration.
Anchorbased methods use local motifs (short common subseq
uences) as anchors.Later,the unaligned regions between conse
cutive anchors are aligned using other techniques.DIALIGN
(Morgenstern et al.,1998),Alignm (Walle et al.,2004),Lalign
(Huang and Miller,1991),Mavid (Bray and Pachter,2004),and
PRRP (Gotoh,1996) belong to this class.Probabilistic methods
precompute the substitution probabilities by analyzing known
multiple alignments.They use these probabilities to maximize
the substitution probabilities for a given set of sequences.Probcons
(Do et al.,2004),and Hmmt (Eddy,1995) can be grouped into
this class.The common shortcomings of the methods above are:
(1) The alignments of progressive methods depend on the order of
sequences.(2) They do not provide ﬂexible quality/time trade off.
Improving the alignment quality of an initial alignment has been
traditionally done manually (e.g.through programs like MaM and
WebMaM(Alkan et al.,2005)).There are a fewmethods,which aim
to optimize the SP score by running DP alignment on all sequences
simultaneously.MSAis the representative in this class (Gupta et al.,
1996).DCA extends MSA by utilizing ‘divideandconquer’ strat
egy (Stoye et al.,1997).Unlike progressive methods,DCA divides
the sequences recursively until they are shorter than a given thresh
old.It then uses MSA to ﬁnd the optimal solutions for the smaller
problems.The performance of DCA depends on how it divides the
sequences.DCA uses a cut strategy that minimizes additional costs
(Stoye,1998) and does not guarantee to ﬁnd optimal solution.It uses
the longest sequence in the input sequences as reference to select the
cut positions.This makes it order dependent,as there is no justiﬁca
tionwhythis selection(or anyother selection) optimizes the SPscore
of the alignment.On the contrary,our method is order independent.
3 ALGORITHM:COMPLETEKPARTITEGRAPH
In this section,we introduce the basic QOMAalgorithmfor aligning
Kprotein sequences.QOMAworks in two steps:(1) It constructs an
initial alignment and a Kpartite graph corresponding to this align
ment.(2) It iteratively places a window on the sequences and
replaces the window with its optimal alignment.We call this the
complete Kpartite graph algorithmsince a letter of a protein can be
aligned with any letter of the other proteins within the same
window.For simplicity,we use ﬁxed penalty for each indel.Exten
sion of the developed method to afﬁne gap is trivial.Next,we
describe these two steps in detail.
3.1 Constructing initial alignment
The purpose of constructing an initial alignment is to roughly iden
tify the position of each node in ﬁnal alignment.It is important to
ﬁnd this initial alignment quickly in order to minimize initialization
overhead.
The initial alignment can be constructed in many ways.One
possibility is to use an existing tool,such as ClustalW.This strategy,
however,has the shortcoming that the initial alignment depends on
another tool,which may be orderdependent.This makes QOMA
partially orderdependent.We propose to construct the initial align
ment from pairwise optimal alignments of sequences.In this strat
egy,ﬁrst,sequence pairs are optimally aligned using DP (Gotoh,
1982).An edge is added between two nodes if the nodes are
matched in this alignment.A weight is assigned to each edge as
the substitution score of the two residues that constitute that edge.
The substitution score is obtained from the underlying scoring
matrix,such as BLOSUM62 (Henikoff and Henikoff,1992).The
weight of each node is deﬁned as the sumof the weights of the edges
that have that node on one end.A node set is then deﬁned by
selecting one node from the head of each sequence.The node
which has the highest weight is selected from this set.This node
is aligned with the nodes adjacent to it.Thus,the letters aligned at
the end of this step constitute one column of the initial multiple
alignment.The node set is then updated as the nodes immediately
after the nodes in current set in each sequence.This process is
repeated and columns are found until all the nodes are consumed.
The alignment is obtained by concatenating all the columns.Gaps
are inserted between nodes if needed.Unlike progressive tools,this
strategy is orderindependent.An example for initial alignment
construction is shown in Supplementary Figure A1.
The time complexity of both of these strategies are O(K
2
N
2
) since
pairwise comparisons dominate the running time.However,latter
approach is faster.This is because it runs dynamic programming
only once for each sequence pair.On the other hand,the former one
performs two set of pairwise alignments.One to ﬁnd a guide tree
and another to align sequences progressively according to the guide
tree.
3.2 Improving the SP score via local optimizations
After constructing the initial alignment,the nodes are placed
roughly in their correct positions (or in a close by position) in
the alignment.Next,the alignment is iteratively improved.At
each iteration,a short window is placed on the existing alignment.
The subsequences contained in this window are then replaced by
their optimal alignments (see Supplementary Figure A2).DP algo
rithm (Gotoh,1982) is used to ﬁnd the optimal alignment.This is
feasible since the cost of aligning a windowis much less than that of
the entire sequences.
This algorithm requires solving two problems.First,where
should the window be placed?Second,when should the iterations
QOMA of protein sequences
163
stop?One obvious solution is to slide a windowfromleft to right (or
right to left) shifting by some predeﬁned amount D at each iteration.
In this case,the iterations will end once the window reaches the
right end (or the left end) of the alignment (see Supplementary,
Figure A2).This solution,however,has two problems.First,it is not
clear which direction the window should be slid.Second,a window
is optimized even if it is already a good alignment,resulting in a
waste of computation time.We propose another solution.We com
pute an upper bound of the improvement of the SP score for every
possible windowposition as follows.Let X
i
denote the upper bound
to the SP score for the windowstarting at position i in the alignment.
This number can be computed as the sum of the scores of all
pairwise optimal alignments of the subsequences in this window.
Let Y
i
denote the current SP score of that window.The upper bound
is computed as X
i
Y
i
.We propose to greedily select the window
that has the largest upper bound at each iteration.In order to ensure
that this solution does not optimize more windows than the ﬁrst one
(i.e.,sliding windows),we do not select a window position that is
within D/2 positions to a previously optimized window.The itera
tions stop when all the remaining windows have an upper bound of
zero or they are within D/2 positions of a previously optimized
window.In our experiments,the second solution was slightly better.
The second solution converged to the ﬁnal result much faster than
the ﬁrst one.(data not shown.)
The time complexity of QOMA is O
2
K
W
K
K
2
ðN Wþ1Þ
D
.This is
because there are
ðN Wþ1Þ
D
positions for window.A dynamic pro
gramming solution is computed for each such window.The cost of
each dynamic programming solution is O(2
K
W
K
K
2
).This algorithm
is much faster than the optimal dynamic programming when W is
much smaller than N.The space complexity is O(W
K
+ KN).This is
because dynamic programming for a windowrequires O(W
K
) space,
and only one window is maintained at a time.Also O(KN) space is
needed to store the sequences and the alignment.Note that the edges
of the complete Kpartite graph are not stored at this step as we
already know that the graph is complete.
3.3 QOMA and optimality
This section analyzes QOMA.Let P
1
,P
2
,...,P
K
be the protein
sequences to be aligned.Let A
be an optimal alignment of P
1
,
P
2
,...,P
K
.Let S
denote the SP score of A
.Let A be an alignment
of P
1
,P
2
,...,P
K
.Let SP(A) be the SP score of A.We deﬁne the error
induced by Aas error (A) ¼S
SP(A).This expression,however,is
not computable for ﬁnding S
is NPcomplete.Instead,we compute
the error of Aas eðAÞ ¼
SS SPðAÞ,where
SS is an upper bound to S
.
Hence,e(A) error(A).Here,
SS is computed as the sumof the scores
of all optimal pairwise alignments of P
1
,P
2
,...,P
K
.Let QOMA
(A,W) be the alignment obtained by QOMA starting from initial
alignment A using a window of size W.We deﬁne the percentage
of improvement provided by QOMA over A using a window size
of W as
improveðA‚WÞ ¼ f1 eðQOMAðA‚WÞÞ/eðAÞg · 100 ð1Þ
Our ﬁrst lemma shows that QOMAalways results in an alignment
at least as good as the initial alignment (The proof is shown in the
Supplementary material).
L
EMMA
1.improve(A,W) 0,8A,W.&
Corollary 1 follows from Lemma 1.
C
OROLLARY
1.S
¼ SP(QOMA(A
,W)),8W.&
Corollary 1 implies that QOMAalters an initial alignment A only
if A is not optimal.Next lemma discusses the impact of the window
size on QOMA (See Supplementary material for proof).
L
EMMA
2.SP(QOMA(A,W)) SP(QOMA(A,2W)).
Lemma 2 indicates that as W increases,the SP score of the
resulting alignment increases.In other words,as W increases,
SP(QOMA(A,W)) converges to S
.
4 IMPROVED ALGORITHM:SPARSE GRAPH
QOMA converges to optimal alignment as the window size (W)
grows.However,this happens at the expense of exponential time
complexity.In Section 3.1 we computed the time complexity of
QOMA using complete Kpartite graph as O
2
K
W
K
ðN Wþ1ÞK
2
D
for
proteins P
1
,P
2
,...,P
K
.In this section,we reduce the time
complexity of QOMAby sacriﬁcing accuracy through use of sparse
Kpartite graph.The goal is to run QOMA within a limited time
budget while using a larger window size.
The factor 2
K
in the complexity is incurred because each cell
of the DP matrix is computed by considering 2
K
1 conditions
(i.e.2
K
1 neighboring cells).This is because there are 2
K
1
possible nonempty subsets of K residues.Each subset,here
corresponds to a set of residues that align together,and thus to a
neighboring cell.We propose to reduce the complexity by reducing
the number of residues that can be aligned together.We do this
by keeping only the edges between node pairs with highpossibility
of matching.
Choosing the promising edges is crucial for the quality of the
resulting alignment.We propose to store an edge between two nodes
only if their corresponding letters are aligned in the optimal pairwise
alignment.Thus we produce at most K1 edges per node since each
node is aligned with at most one node from each of the K 1
sequences.We also introduce a deviation parameter d,where d is a
nonnegative integer.Let p[i] and q[ j] be the nodes corresponding
to protein sequences p and q at positions i and j in the initial graph,
respectively.We draw an edge between p[i] and q[ j] only if one of
the following two conditions holds in the optimal pairwise align
ment of p and q:(1) 9d,d d,such that p[i] is aligned with q[ j +
d];(2) 9d,d d,such that q[ j] is aligned with p[i + d].In other
words,we draw an edge between two nodes if their positions differ
by at most d in the optimal alignment of p and q.For example,in
Figure 1,p[2] aligns with q[2].Therefore,we draw an edge from
p[2] to q[1] and q[3] as well as q[2] since q[1] and q[3] are within
dneighborhood of (d ¼ 1) of q[2].
DP is modiﬁed for sparse Kpartite graph as follows:each cell,
[x
1
,x
2
,...,x
K
] in the Kdimensional DP matrix corresponds to
Fig.1.Sparse Kpartite graph for two sequences for d ¼ 0 and d ¼ 1.
X.Zhang and T.Kahveci
164
nodes P
1
[x
1
],P
2
[x
2
],...,P
K
[x
K
].Here P
i
[ j] stands for the node at
position j in sequence i.The set contains one node from each
sequence,and can be either a residue or a gap.Thus,each cell
deﬁnes a subgraph induced by its node set.For example,during
the alignment of the sequences that have the Kpartite graph as
shown in Figure 2(a),the cell [3,4,4] corresponds to nodes
P
1
[3],P
2
[4] and P
3
[4].Figure 2(b) shows the induced subgraph
of cell [3,4,4].
The induced subgraph for each cell yields a set of connected
components.The sparse graph algorithm uses connected compo
nents to improve the running time of DP as follows.During the
computation of the value of a DP matrix cell,we allowtwo nodes to
align only if they belong to the same connected component of the
induced subgraph of that cell.For example,for cell [3,4,4],P
2
[4]
and P
3
[4] can be aligned together,but P
1
[3] can not be aligned with
P
2
[4] or P
3
[4] (see Fig.2b).A connected component with n nodes
produce 2
n
1 nonempty subsets.Thus,for a given cell,if there are
t connected components and the tth component has n
t
nodes,then
the cost of that cell becomes
P
t
i¼1
ð2
n
i
1Þ.This is a signiﬁcant
improvement as the cost of a single cell is 2
n
1
+n
2
+...+n
t
1 using the
complete Kpartite graph.For example,in Figure 2,the cost for cell
[3,4,4] drops from 2
3
1 ¼7 to (2
1
1) +(2
2
1) ¼ 4.
The connected components of an induced subgraph can be
found in O(K
2
) time (i.e.the size of the induced subgraph) by
traversing the induced subgraph once.Thus,the total time
complexity of the sparse Kpartite graph approach is
O
P
W
K
i¼1
P
j
ð2
n
j
1Þ
h i
ðN Wþ1ÞK
2
/D
.The space com
plexity of using the sparse Kpartite graph is O(W
K
+ KN +
N(K1)K(2d + 1)/2).The ﬁrst term denotes the space for the
dynamic programming alignment within a window.The second
termdenotes the number of letters.The last termdenotes the number
of edges.The space complexity for the last two terms can be reduced
by storing only the subgraph inside the window.
5 EXPERIMENTAL EVALUATION
Experimental setup:We used BAliBASE benchmarks (Thompson
et al.,1999) reference 1 from version 1 (wwwigbmc.
ustrasbg.fr/BioInfo/BAliBASE/) and references 1,
2,...,8 from version 3 (wwwbio3digbmc.ustrasbg.
fr/BAliBASE/) for evaluation of our method.We use V1 and
V3 to denote BAliBASE versions 1 and 3,respectively.We use R1
to R8 to denote reference 1–8.For example,we use V3R4 to
represent the reference 4 dataset from version 3.We split the
V1R1 dataset into three datasets (V1R1low,V1R1medium
and V1R1high) according to the similarity of the sequences in
the benchmarks as denoted in BAliBASE (low and high
similarities).Similarly,V3R1 is split into two datasets V3R1
low and V3R1high containing low and highsimilarity bench
marks.The number of sequences in the benchmarks in version
3 were usually too large for QOMA and DCA.Therefore,we cre
ated 1000 benchmarks from each reference by randomly selecting
ﬁve sequences from the existing benchmarks.Thus,each of the
benchmarks from version 3 contains ﬁve sequences.
We evaluated the SP score and the running time in our experi
ments.We do not report the BAliBASE scores since the purpose of
QOMA is to maximize the SP score.
We implemented the complete and the sparse Kpartite QOMA
algorithms as discussed in the paper,using standard C.We used
BLOSUM62 as a measure of similarity between amino acids.We
used gap cost ¼4 to penalize each indel.We used D ¼W/2 in our
experiments since we achieved best quality per time with this value.
We also downloaded ClustalW,Probcons,Muscle,Tcoffee and
DCA for comparison.We did not compare QOMA with our pre
vious work HSA (Zhang and Kahveci,2006) since HSA needs
secondary structure information of proteins for alignment.To
ensure a fair comparison,we ran ClustalW,Muscle,Tcoffee,
DCA and QOMA using the same parameters (gap cost ¼ 4,
similarity matrix ¼ BLOSUM62).This was not possible for Prob
cons.We also ran all the competing methods using their default
parameters.
We ran all our experiments on Intel Pentium 4,with 2.6 G Hz
speed,and 512 MB memory.The operating system was Windows
2000.
Quality evaluation:We ﬁrst evaluate the quality of QOMA.
Table 1 shows the average SP score of QOMA using two strategies
for constructing initial alignment and four values of W.Strategy 1
obtains the initial alignments fromClustalW.Strategy 2 obtains the
initial alignments from the algorithm provided in Section 3.1.The
table also shows the upper bound for the SP score and the SP score
of ClustalWfor comparison.The best results are shown in bold for
each test.QOMA achieves higher SP score compared to ClustalW
on average for all windowsizes and for all datasets.The SP score of
QOMA consistently increases as W increases.These results are
justiﬁed by Lemmas 1 and 1.The SP score of Strategy 2 is usually
higher than that of Strategy 1 for almost all cases of low and
mediumsimilarity.Both strategies are almost identical for highly
similar sequences.There is a loose correlation between the initial SP
score and the ﬁnal SP score of QOMA.Higher initial SP scores
usually imply higher SP scores of the end result.There are however
exceptions especially for highly similar sequences.In the rest of the
experiments,we use Strategy 2 to construct the initial alignments by
default.
Table 2 shows us the SP scores of ﬁve existing tools,and QOMA
on all the datasets when the competing tools are run using the same
parameters as QOMA and using their default parameters.QOMA
has higher SP score than all the tools compared for all the datasets.
DCA always has second best score since it also targets on maxi
mizing the SP score of alignments.The difference between the SP
scores of QOMA and the other tools are more signiﬁcant for low
and medium similarity sequences.This is an important achieve
ment because the alignment of such sequences are usually harder
than highly similar sequences.
Table 3 shows the average percentage of improvement of QOMA
over alignments of ClustalW using the improvement formula as
Fig.2.(a) Asparse Kpartite graph for three sequences froma windowof size
4 (b) The induced subgraph for the cell [3,4,4] of the dynamic programming
matrix for the Kpartite graph in (a).
QOMA of protein sequences
165
given in Section 3.3,the dataset is V1R1.As the window size
increases,the increase in improvement percentage reduces.This
indicates that QOMA converges to the optimal score at reasonably
small window sizes.In other words,using window size larger than
16 will not improve the SP score signiﬁcantly.
Table 4 shows the average and the SD of the error incurred for
each window due to using the sparse Kpartite graph for QOMA.
Tabel 1.The average SP scores of QOMA using complete Kpartite graph with D ¼ W/2 on BAliBASE benchmarks and upper bound score (
SS)
Dataset
SS Initialization strategy 1 Initialization strategy 2
Initial W¼2 W¼4 W¼8 W¼16 Initial W¼2 W¼4 W¼8 W¼16
V1R1low 565
839 780 637 401 243 797 586 429 273 182
V1R1medium
2880 1982 2037 2181 2347 2442 2041 2192 2338 2446 2508
V1R1high
5324 4883 4933 5008 5071 5092 4867 4965 5057 5110 5122
(Initialization strategy 1:initial alignments are obtained from ClustalW,Initialization strategy 2:initial alignments are obtained from optimal pairwise alignments as discussed in
Section 3.1.) The best result is shown in bold face.
Table 2.The average SP scores of QOMA (using complete Kpartite graph with W ¼ 16) and five other tools on BAliBASE benchmarks
Dataset ClustalW Probcons Tcoffee Muscle DCA QOMA
V1R1low 808 (839)
1303 (1303) 1499 (1486) 2029 (778) 440 (440) 182 (182)
V1R1medium
2156 (1982) 2128 (2068) 1955 (2007) 424 (2161) 2356 (2289) 2582 (2508)
V1R1high
4954 (4935) 4924 (4975) 4842 (4920) 3468 (5001) 5007 (5055) 5122 (5172)
V3R1low
1233 (1316) 1763 (1763) 2141 (2052) 2617 (1200) 760 (760) 421 (421)
V3R1high
2048 (1911) 2008 (2008) 1803 (1902) 263 (2101) 2288 (2288) 2507 (2507)
V3R2
5984 (5953) 5862 (5892) 5793 (5847) 4564 (6029) 6126 (6153) 6287 (6313)
V3R3
5838 (6005) 5741 (5976) N/A (5945) 4406 (6088) 5968 (6202) 6110 (6348)
V3R4
251 (347) 280 (95) N/A (148) 935 (463) 714 (880) 926 (1107)
V3R5
3899 (3778) 3658 (3658) N/A (3553) 2117 (3935) 4310 (4310) 4446 (4446)
V3R6
1601 (1554) 1781 (1782) N/A (1859) 1710 (1570) 1335 (1336) 1151 (1152)
V3R7
6977 (6832) 6555 (6555) N/A (6409) 4663 (6973) 7502 (7502) 8000 (8000)
V3R8
8432 (8321) 8238 (8238) N/A (8190) 6927 (8494) 8831 (8831) 9248 (9248)
The numbers outside (inside) parenthesis show the SP scores when the tools are run with the same parameters as QOMA (with their default parameters).Some of the tools,namely
Tcoffee andClustalW,didnot produce anyalignment for some benchmarks for eachparameter settings.The results of all the tools are ignoredfor suchbenchmarks.‘N/A’ indicates that
thecorrespondingtool failedtoproducealignment for most of thebenchmarks inadataset for that parameter setting.We ignoresuchtools (i.e.,Tcoffee) for thosedatasets andparameter
setting.The best result for each dataset and parameter setting is shown in bold.
Table 3.The improvement (see Formula 1 in Section 3.3) of QOMA (using
complete Kpartite graph) over ClustalW on the V1R1 dataset
Length Window Size
2 4 8 16
Short 18.0 29.2 40.2 46.7
Medium 23.3 39.6 51.6 58.6
Long 18.6 39.4 51.5 54.1
The dataset is split into three subsets (short,mediumand long) according to the length of
the sequences.
Table 4.The average (m),SD (s) of the error,S
SP,for a window using
sparse version of QOMA on the V1R1 dataset
W Error using sparse Kpartite graph
d ¼ 0 d ¼ 1 d ¼ 2
m s e m s e m s e
8 1.127 3.820 7 0.358 2.089 1 0.200 1.591 1
16 2.159 5.909
11
0.751 3.249 1 0.433 2.434 1
Results are shown for window sizes W¼8 and 16,and deviation d ¼0,1 and 2.The «
value denotes the 95%confidence interval,i.e 95%of the expected improvement values
are in [m e,m + e] interval.
Fig.3.The SPscores of QOMAalignments usingthe complete andthe sparse
Kpartite graphs for different values of d and W on the V1R1 dataset.
X.Zhang and T.Kahveci
166
The average error decreases as d increases.Most of the reduction
in error is obtained when d increases from 0 to 1.For example,for
W ¼ 16,the error reduction is 1.408 and 0.318 as d increases from
0 to 1 and from 1 to 2,respectively.This implies that the average
improvement in the SP score degrades quickly for d > 1.The 95%
error interval also signiﬁcantly shrinks as d increases to 1.These
results suggest that the sparse Kpartite graph algorithmis effective
even for small values of d.
Figure 3 shows the average SP scores of the resulting alignments
using sparse Kpartite graph for different values of d and using
complete Kpartite graph on the V1R1 dataset.The complete
Kpartite graph algorithm produces the best SP scores.However,
the SP scores of results from the sparse Kpartite graph algorithm
are very close to that of the complete Kpartite graph algorithm.The
quality of the sparse Kpartite graph algorithm improves signiﬁ
cantly when d increases from0 to 1.The improvement is less when
d increases from1 to 2.This implies when d becomes larger than 1,
it has small impact on the quality of alignment.
Performance evaluation:Our second experiment set evaluates
the running time of QOMA.Table 5 lists the running time of QOMA
for the complete and the sparse Kpartite graph algorithms for
varying values of W.Experimental results show that QOMA
runs faster for small W.The sparse Kpartite graph algorithm is
faster than the complete Kpartite graph algorithm for all values
of d for large W.The running time of QOMA increases as d
increases.The results in this table agree with the time complexity
we computed in Sections 3.3 and 4.Referring to Tables 1,3 and
4,we conclude that when the window size is small,QOMA runs
fast and has high SP score.As the window size increases,its
performance drops but the SP score improves further.Thus,it is
better to increase the window size and use sparse Kpartite graph
strategy to obtain high scoring results quickly.In the light of
Figure 3,and Tables 4 and 5 we recommend d ¼ 1 using sparse
Kpartite graph strategy.This is because most of the improvement
in the SP score is obtained when d ¼ 1.For d > 1,the score
improves slightly while the running time increases at the same or
higher rate.
6 CONCLUSION AND FUTURE WORK
We considered the problem of multiple alignment with highSP
score for protein sequences.We developed a graphbased algorithm
called QOMA.QOMA ﬁrst constructs an initial alignment of
multiple sequences.In order to create this initial alignment,we
developed a method based on the optimal alignment between
all pairs of sequences.QOMA represents this alignment using a
Kpartite graph.It then improves the SP score of the initial
alignment by iteratively placing a window on it and optimizing
the alignment within this window.QOMAprovides two parameters
to permit ﬂexibility in time and accuracy:(1) The window size.(2)
The sparsity of the Kpartite graph.Unlike traditional tools,QOMA
is independent of the order of sequences.
In our experiments QOMA achieved signiﬁcantly higher SP
scores than existing tools,including ClustalW,Probcons,Muscle,
TCoffee and DCA.The quality improvement increased with the
window size.QOMA had slightly better SP score using the com
plete Kpartite graph algorithm than the sparse Kpartite graph
algorithm.However,the sparse Kpartite graph algorithm was
much faster when large window size is used.We found a good
balance point between quality of result and running time at
d ¼ 1 using sparse Kpartite graph.
Currently,we are working on extending the QOMA algorithmto
align large number of sequences.
Conﬂict of Interest:none declared.
REFERENCES
Alkan,C.et al.(2005) Manipulating multiple sequence alignments via MaM and
WebMaM.Nucleic Acids Res.,33,W295–W298.
Bray,N.and Pachter,L.(2004) MAVID:constrained ancestral alignment of multiple
sequences.Genome Res.,14,693–699.
Do,C.B.et al.(2005) PROBCONS:Probabilistic consistencybased multiple sequence
alignment.Genome Res.,15,330–340.
Eddy,S.R.(1995) Multiple alignment using hidden Markov models.Proc Int Conf Intell
Syst Mol Biol.,3,114–120.
Edgar,R.(2004) MUSCLE:multiple sequence alignment with high accuracy and high
throughput.Nucleic Acids Res.,32,1792–1797.
Gotoh,O.(1982) An improved algorithm for matching biological sequences.
J.Mol.Biol.,162,705–708.
Gotoh,O.(1996) Signiﬁcant improvement in accuracy of multiple protein sequence
alignments by iterative reﬁnement as assessed by reference to structural align
ments.J.Mol.Biol.,264,823–838.
Grundy,W.N.(1998) Familybased homology detection via pairwise sequence
comparison.In Proceedings of the Second Annual International Conference on
Computational Molecular Biology (RECOMB’98),ACM Press,New York,NY,
pp.94–100.
Gupta,S.et al.(1996) Improving the practical space and time efﬁciency of the shortest
paths approach to sumofpairs multiple sequence alignment.J.Comput.Biol.
Gupta,S.K.et al.(1995) Improving the practical space and time efﬁciency of the
shortestpaths approach to sumofpairs multiple sequence alignment.J.Comput.
Biol.,2,459–462.
Hein,J.(1989) A new method that simultaneously aligns and reconstructs ancestral
sequences for any number of homologous sequences,when the phylogeny is given.
Mol.Biol.Evol.,6,649–668.
Table 5.The running time of QOMA (in seconds) using complete Kpartite graph and spare graph for different value of d and W on the V1R1 dataset
Window
Size
Short Medium Long
A B C D A B C D A B C D
W¼2 0.27
0.24 0.35 0.3 0.58 0.61 0.75 0.74 1.11 1.99 2.31 2.27
W¼4
1.76 1.01 1.53 1.78 3.62 2.19 3.2 3.74 8 5.48 7.83 8.93
W¼8
17 6 9 12 37 13 19 26 86 31 43 61
W¼16
232 51 63 81 535 110 132 166 1087 257 306 386
(A:complete Kpartite graph.B:sparse Kpartite graphwithd¼0.C:sparse Kpartite graphwithd¼1.D:sparse Kpartite graphwithd¼2.) The dataset is split intothree subsets (short,
mediumand long) according to the length of the sequences in the benchmarks.The running time of the competing methods,in seconds,are as follows.ClustalW:between 0.39 and 2.1,
Probcons:between 0.9 and 8.8,Tcoffee:between 3.9 and 24.2,Muscle:between 0.4 and 1.6.
QOMA of protein sequences
167
Henikoff,S.and Henikoff,J.G.(1992) Amino acid substitution matrices from protein
blocks.Proc.Natl Acad.Sci.USA,89,10915–10919.
Huang,X.and Miller,W.(1991) A timeefﬁcient,linearspace local similarity algo
rithm.Adv.Appl.Math.,12,337–357.
Jones,D.T.(1999) Protein secondary structure prediction based on positionspeciﬁc
scoring matrices.J.Mol.Biol.,292,195–202.
Katoh,K.et al.(2002) MAFFT:a novel method for rapid multiple sequence alignment
based on fast Fourier transform.Nucleic Acids Res.,30,3059–3066.
Lee,C.et al.(2002) Multiple sequence alignment using partial order graphs.
Bioinformatics,18,452–464.
Lipman,D.et al.(1989) A tool for multiple sequence alignment.Proc.Natl Acad.Sci.
USA,86,4412–4415.
Morgenstern,B.et al.(1998) DIALIGN:ﬁnding local similarities by multiple sequence
alignment.Bioinformatics,14,290–294.
Needleman,S.B.and Wunsch,C.D.(1970) A general method applicable to the search
for similarities in the amino acid sequence of two proteins.J.Mol.Biol.,48,
443–453.
Notredame,C.et al.(2000) Tcoffee:a novel method for fast and accurate multiple
sequence alignment.J.Mol.Biol.,302,205–217.
Phillips,A.et al.(2000) Multiple sequence alignment in phylogenetic analysis.
Mol.Phylogene.Evol.,16,317–330.
Smith,T.F.and Waterman,M.S.(1981) Identiﬁcation of common molecular subse
quences.J.Mol.Biol.,147,195–197.
Stoye,J.(1998) Multiple sequence alignment with the divideandconquer method.
Gene,211,GC45–GC56.
Stoye,J.et al.(1997) DCA:an efﬁcient implementation of the divideandconquer
approach to simultaneous multiple sequence alignment.Comput.Appl.Biosci.,
13,625–626.
Sze,S.H.et al.(2005) A polynomial time solvable formulation multiple sequence
alignment.In:Miyano,S.,Mesirov,J.P.,Kasif,S.,Istrail,S.Pevzner,P.A.and
Waterman,M.S.(eds),Proceedings of the 9th Annual International Conference
on Research in Computational Molecular Biology (RECOMB),Lecture Notes in
Computer Science,Springer,pp.204–216.
Thompson,J.et al.(1994) CLUSTAL W:improving the sensitivity of progressive
multiple sequence alignment through sequence weighting,positionspeciﬁc gap
penalties and weight matrix choice.Nucleic Acids Res.,22,4673–4680.
Thompson,J.et al.(1999) A comprehensive comparison of multiple sequence align
ment programs.Nucleic Acids Res.,27,2682–2690.
Van Walle,I.et al.(2004) Alignm—a new algorithmfor multiple alignment of highly
divergent sequences.Bioinformatics,20,1428–1435.
Wang,L.and Jiang,T.(1994) On the complexity of multiple sequence alignment.
J.Comput.Biol.,1,337–348.
Zhang,X.and Kahveci,T.(2006) A new approach for alignment of multiple proteins.
In:Altman,R.B.,Murray,T.,Klein,T.E.,Dunker,A.K.and Hunter,L.(eds),
Proceedings of the Paciﬁc Symposium on Biocomputing (PSB),World Scientiﬁc
pp.339–350,
X.Zhang and T.Kahveci
168
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο