Vol.23 no.2 2007,pages 162–168

doi:10.1093/bioinformatics/btl590

BIOINFORMATICS ORIGINAL PAPER

Sequence analysis

QOMA:quasi-optimal multiple alignment of protein sequences

Xu Zhang and Tamer Kahveci

Computer and Information Science and Engineering,University of Florida,Gainesville,FL 32611,USA

Received on August 8,2006;revised on November 15,2006;accepted on November 16,2006

Advance Access publication November 22,2006

Associate Editor:John Quackenbush

ABSTRACT

Motivation:We consider the problemof multiple alignment of protein

sequences with the goal of achieving a large SP (Sum-of-Pairs) score.

Results:We introduce a new graph-based method.We name our

method QOMA (Quasi-Optimal Multiple Alignment).QOMA starts

with an initial alignment.It represents this alignment using a K-partite

graph.It then improves the SP score of the initial alignment through

local optimizations within a window that moves greedily on the align-

ment.QOMA uses two parameters to permit flexibility in time/

accuracy trade off:(1) The size of the window for local optimization.

(2) The sparsity of the K-partite graph.Unlike traditional progressive

methods,QOMA is independent of the order of sequences.The

experimental results on BAliBASE benchmarks show that QOMA

produces higher SP score than the existing tools including ClustalW,

Probcons,Muscle,T-Coffee and DCA.The difference is more

significant for distant proteins.

Availability:The software is available fromthe authors upon request.

Contact:tamer@cise.ufl.edu

Supplementary information:Supplementary material is available at

Bioinformatics online.

1 MOTIVATION

Alignment of multiple (i.e.three or more) protein sequences is

one of the most fundamental problems in computational biology.

Multiple sequence alignment is widely used in many applications,

such as protein structure prediction (Jones,1999),phylogenetic

analysis (Phillips et al.,2000),identiﬁcation of conserved motifs

and domains (Thompson et al.,1999) and protein classiﬁcation

(Grundy,1997).

The quality of a multiple alignment is usually measured in terms

of its SP (Sum-of-Pairs) score (Lee et al.,2002;Lipman et al.,1989;

Walle et al.,2004;Stoye et al.,1997).The SP Score of an align-

ment,A,of sequences P

1

,P

2

,...,P

K

is computed by adding the

alignment scores of all of its induced pairwise alignments.It can be

expressed as SPðAÞ ¼

P

i<j

ScoreðP

i

‚P

j

Þ,where K is the number

of sequences,P

i

is the sequence indexed by i and Score(P

i

,P

j

) is the

score of the alignment of P

i

and P

j

induced by A.For simplicity,we

assume ﬁxed penalty for each indel during the computation of Score

(P

i

,P

j

).Extension to afﬁne gap is trivial.

Variations of the SP score are often used to reﬂect the biological

relevance of the multiple alignment.Assigning different weights to

sequences depending on their degree of similarity (Thompson et al.,

1994) or different weights to different positions of the sequences

depending on the structural and functional annotations (Zhang and

Kahveci,2006) are two examples to such variations.ClustalW

(Thompson et al.,1994),e.g.optimizes

P

i<j

score ðP

i

‚P

j

ÞW

i

W

j

,

where W

i

and W

j

are weights assigned to sequences P

i

and

P

j

,respectively.For simplicity,we use the SP score in this

paper.Extension to weighted SP score is trivial.

Finding the multiple sequence alignment that maximizes the SP

score is an NP-complete problem (Wang and Jiangs 1994).A vari-

ety of heuristics have been developed.Progressive alignment

methods form an important class of the heuristic methods.Such

methods progressively align pairs of proﬁles in a certain order and

produce a newproﬁle until a single proﬁle is left.Aproﬁle is either a

sequence or the alignment of a set of sequences.Progressive meth-

ods,however,have an important shortcoming.The order that the

proﬁles are chosen for alignment affects the quality of the alignment

signiﬁcantly.The optimal alignment may be different than all

possible alignments obtained by considering all possible orderings

of sequences (Zhang and Kahveci,2006).A method,which can

balance the running time and the alignment accuracy is seriously in

demand.

In this paper,we consider the problem of maximizing the SP

score of the alignment of multiple protein sequences.We develop a

graph-based method named QOMA (Quasi-Optimal Multiple

Alignment).QOMA starts by constructing an initial multiple align-

ment with the help of optimal pairwise alignments.It then builds a

graph corresponding to the initial alignment.It iteratively places a

window on this graph and improves the SP score of the initial

alignment by optimizing the alignment inside the window.The

location of the window is selected greedily as the one that has a

chance of improving the SP score by the largest amount.QOMA

uses two parameters to permit ﬂexibility in time/accuracy trade off:

(1) The window size.(2) The sparsity of the K-partite graph.The

experimental results show that QOMA ﬁnds alignments with better

SP score compared to existing tools including ClustalW,Probcons,

Muscle,T-Coffee and DCA at similar running time.The improve-

ment is more signiﬁcant for distant proteins.The running time of

QOMA is larger than that of the competing methods for large

window sizes and for large number of sequences.

The rest of the paper is organized as follows.Section 2 discusses

related work.Section 3 introduces the basic algorithm in detail.

Section 4 discusses the improved algorithm.Section 5 presents

experimental results and analysis.Section 6 concludes the paper

and points out possible future directions.

2 RELATED WORK

Given a table of scores for matches and mismatches between all

amino acids and penalties for insertions or deletions,the optimal

To whom correspondence should be addressed.

162

The Author 2006.Published by Oxford University Press.All rights reserved.For Permissions,please email:journals.permissions@oxfordjournals.org

by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

by guest on October 1, 2013http://bioinformatics.oxfordjournals.org/Downloaded from

alignment of two sequences can be determined by dynamic program-

ming (DP).The time and space complexity of this methods is O(N

2

)

(Needleman and Wunsch,1970;Smith and Waterman,1981;Gotoh,

1982),where N is the length of each sequence.This algorithm can

be extended to align K sequences,but requires O(2

K

N

K

K

2

) time

(Lipman et al.,1989;Gupta et al.,1995).MSA(Lipman et al.,1989)

is a representative program using this algorithm.However,MSA

suffers from computation expenses.Variety of heuristic algorithms

have been developed to overcome this difﬁculty (Thompson

et al.,1994).These heuristic methods can be classiﬁed into four

groups:progressive,iterative,anchor-based and probabilistic.

Progressive methods ﬁnd multiple alignment by iteratively

picking two sequences from this set and replacing them with

their alignment (i.e.consensus sequence) until all sequences are

aligned into a single consensus sequence.Thus,progressive meth-

ods guarantee that never more than two sequences are simultane-

ously aligned.This approach is sufﬁciently fast to allowalignments

of virtually any size.ClustalW (Thompson et al.,1994),T-coffee

(Notredame et al.,2000),Treealign (Hein,1989),POA (Lee et al.,

2002) and MAFFT (Katoh et al.,2002) can be grouped into this

class (Sze et al.,2005).Iterative methods start with an initial align-

ment.They then repeatedly reﬁne this alignment through a series of

iterations until no more improvements can be made.Muscle (Edgar,

2004) can be grouped into this class as well as the progressive

method class since it uses a progressive alignment at each iteration.

Anchor-based methods use local motifs (short common subseq-

uences) as anchors.Later,the unaligned regions between conse-

cutive anchors are aligned using other techniques.DIALIGN

(Morgenstern et al.,1998),Align-m (Walle et al.,2004),L-align

(Huang and Miller,1991),Mavid (Bray and Pachter,2004),and

PRRP (Gotoh,1996) belong to this class.Probabilistic methods

pre-compute the substitution probabilities by analyzing known

multiple alignments.They use these probabilities to maximize

the substitution probabilities for a given set of sequences.Probcons

(Do et al.,2004),and Hmmt (Eddy,1995) can be grouped into

this class.The common shortcomings of the methods above are:

(1) The alignments of progressive methods depend on the order of

sequences.(2) They do not provide ﬂexible quality/time trade off.

Improving the alignment quality of an initial alignment has been

traditionally done manually (e.g.through programs like MaM and

WebMaM(Alkan et al.,2005)).There are a fewmethods,which aim

to optimize the SP score by running DP alignment on all sequences

simultaneously.MSAis the representative in this class (Gupta et al.,

1996).DCA extends MSA by utilizing ‘divide-and-conquer’ strat-

egy (Stoye et al.,1997).Unlike progressive methods,DCA divides

the sequences recursively until they are shorter than a given thresh-

old.It then uses MSA to ﬁnd the optimal solutions for the smaller

problems.The performance of DCA depends on how it divides the

sequences.DCA uses a cut strategy that minimizes additional costs

(Stoye,1998) and does not guarantee to ﬁnd optimal solution.It uses

the longest sequence in the input sequences as reference to select the

cut positions.This makes it order dependent,as there is no justiﬁca-

tionwhythis selection(or anyother selection) optimizes the SPscore

of the alignment.On the contrary,our method is order independent.

3 ALGORITHM:COMPLETEK-PARTITEGRAPH

In this section,we introduce the basic QOMAalgorithmfor aligning

Kprotein sequences.QOMAworks in two steps:(1) It constructs an

initial alignment and a K-partite graph corresponding to this align-

ment.(2) It iteratively places a window on the sequences and

replaces the window with its optimal alignment.We call this the

complete K-partite graph algorithmsince a letter of a protein can be

aligned with any letter of the other proteins within the same

window.For simplicity,we use ﬁxed penalty for each indel.Exten-

sion of the developed method to afﬁne gap is trivial.Next,we

describe these two steps in detail.

3.1 Constructing initial alignment

The purpose of constructing an initial alignment is to roughly iden-

tify the position of each node in ﬁnal alignment.It is important to

ﬁnd this initial alignment quickly in order to minimize initialization

overhead.

The initial alignment can be constructed in many ways.One

possibility is to use an existing tool,such as ClustalW.This strategy,

however,has the shortcoming that the initial alignment depends on

another tool,which may be order-dependent.This makes QOMA

partially order-dependent.We propose to construct the initial align-

ment from pairwise optimal alignments of sequences.In this strat-

egy,ﬁrst,sequence pairs are optimally aligned using DP (Gotoh,

1982).An edge is added between two nodes if the nodes are

matched in this alignment.A weight is assigned to each edge as

the substitution score of the two residues that constitute that edge.

The substitution score is obtained from the underlying scoring

matrix,such as BLOSUM62 (Henikoff and Henikoff,1992).The

weight of each node is deﬁned as the sumof the weights of the edges

that have that node on one end.A node set is then deﬁned by

selecting one node from the head of each sequence.The node

which has the highest weight is selected from this set.This node

is aligned with the nodes adjacent to it.Thus,the letters aligned at

the end of this step constitute one column of the initial multiple

alignment.The node set is then updated as the nodes immediately

after the nodes in current set in each sequence.This process is

repeated and columns are found until all the nodes are consumed.

The alignment is obtained by concatenating all the columns.Gaps

are inserted between nodes if needed.Unlike progressive tools,this

strategy is order-independent.An example for initial alignment

construction is shown in Supplementary Figure A1.

The time complexity of both of these strategies are O(K

2

N

2

) since

pairwise comparisons dominate the running time.However,latter

approach is faster.This is because it runs dynamic programming

only once for each sequence pair.On the other hand,the former one

performs two set of pairwise alignments.One to ﬁnd a guide tree

and another to align sequences progressively according to the guide

tree.

3.2 Improving the SP score via local optimizations

After constructing the initial alignment,the nodes are placed

roughly in their correct positions (or in a close by position) in

the alignment.Next,the alignment is iteratively improved.At

each iteration,a short window is placed on the existing alignment.

The subsequences contained in this window are then replaced by

their optimal alignments (see Supplementary Figure A2).DP algo-

rithm (Gotoh,1982) is used to ﬁnd the optimal alignment.This is

feasible since the cost of aligning a windowis much less than that of

the entire sequences.

This algorithm requires solving two problems.First,where

should the window be placed?Second,when should the iterations

QOMA of protein sequences

163

stop?One obvious solution is to slide a windowfromleft to right (or

right to left) shifting by some predeﬁned amount D at each iteration.

In this case,the iterations will end once the window reaches the

right end (or the left end) of the alignment (see Supplementary,

Figure A2).This solution,however,has two problems.First,it is not

clear which direction the window should be slid.Second,a window

is optimized even if it is already a good alignment,resulting in a

waste of computation time.We propose another solution.We com-

pute an upper bound of the improvement of the SP score for every

possible windowposition as follows.Let X

i

denote the upper bound

to the SP score for the windowstarting at position i in the alignment.

This number can be computed as the sum of the scores of all

pairwise optimal alignments of the subsequences in this window.

Let Y

i

denote the current SP score of that window.The upper bound

is computed as X

i

Y

i

.We propose to greedily select the window

that has the largest upper bound at each iteration.In order to ensure

that this solution does not optimize more windows than the ﬁrst one

(i.e.,sliding windows),we do not select a window position that is

within D/2 positions to a previously optimized window.The itera-

tions stop when all the remaining windows have an upper bound of

zero or they are within D/2 positions of a previously optimized

window.In our experiments,the second solution was slightly better.

The second solution converged to the ﬁnal result much faster than

the ﬁrst one.(data not shown.)

The time complexity of QOMA is O

2

K

W

K

K

2

ðN Wþ1Þ

D

.This is

because there are

ðN Wþ1Þ

D

positions for window.A dynamic pro-

gramming solution is computed for each such window.The cost of

each dynamic programming solution is O(2

K

W

K

K

2

).This algorithm

is much faster than the optimal dynamic programming when W is

much smaller than N.The space complexity is O(W

K

+ KN).This is

because dynamic programming for a windowrequires O(W

K

) space,

and only one window is maintained at a time.Also O(KN) space is

needed to store the sequences and the alignment.Note that the edges

of the complete K-partite graph are not stored at this step as we

already know that the graph is complete.

3.3 QOMA and optimality

This section analyzes QOMA.Let P

1

,P

2

,...,P

K

be the protein

sequences to be aligned.Let A

be an optimal alignment of P

1

,

P

2

,...,P

K

.Let S

denote the SP score of A

.Let A be an alignment

of P

1

,P

2

,...,P

K

.Let SP(A) be the SP score of A.We deﬁne the error

induced by Aas error (A) ¼S

SP(A).This expression,however,is

not computable for ﬁnding S

is NP-complete.Instead,we compute

the error of Aas eðAÞ ¼

SS SPðAÞ,where

SS is an upper bound to S

.

Hence,e(A) error(A).Here,

SS is computed as the sumof the scores

of all optimal pairwise alignments of P

1

,P

2

,...,P

K

.Let QOMA

(A,W) be the alignment obtained by QOMA starting from initial

alignment A using a window of size W.We deﬁne the percentage

of improvement provided by QOMA over A using a window size

of W as

improveðA‚WÞ ¼ f1 eðQOMAðA‚WÞÞ/eðAÞg · 100 ð1Þ

Our ﬁrst lemma shows that QOMAalways results in an alignment

at least as good as the initial alignment (The proof is shown in the

Supplementary material).

L

EMMA

1.improve(A,W) 0,8A,W.&

Corollary 1 follows from Lemma 1.

C

OROLLARY

1.S

¼ SP(QOMA(A

,W)),8W.&

Corollary 1 implies that QOMAalters an initial alignment A only

if A is not optimal.Next lemma discusses the impact of the window

size on QOMA (See Supplementary material for proof).

L

EMMA

2.SP(QOMA(A,W)) SP(QOMA(A,2W)).

Lemma 2 indicates that as W increases,the SP score of the

resulting alignment increases.In other words,as W increases,

SP(QOMA(A,W)) converges to S

.

4 IMPROVED ALGORITHM:SPARSE GRAPH

QOMA converges to optimal alignment as the window size (W)

grows.However,this happens at the expense of exponential time

complexity.In Section 3.1 we computed the time complexity of

QOMA using complete K-partite graph as O

2

K

W

K

ðN Wþ1ÞK

2

D

for

proteins P

1

,P

2

,...,P

K

.In this section,we reduce the time

complexity of QOMAby sacriﬁcing accuracy through use of sparse

K-partite graph.The goal is to run QOMA within a limited time

budget while using a larger window size.

The factor 2

K

in the complexity is incurred because each cell

of the DP matrix is computed by considering 2

K

1 conditions

(i.e.2

K

1 neighboring cells).This is because there are 2

K

1

possible non-empty subsets of K residues.Each subset,here

corresponds to a set of residues that align together,and thus to a

neighboring cell.We propose to reduce the complexity by reducing

the number of residues that can be aligned together.We do this

by keeping only the edges between node pairs with high-possibility

of matching.

Choosing the promising edges is crucial for the quality of the

resulting alignment.We propose to store an edge between two nodes

only if their corresponding letters are aligned in the optimal pairwise

alignment.Thus we produce at most K1 edges per node since each

node is aligned with at most one node from each of the K 1

sequences.We also introduce a deviation parameter d,where d is a

non-negative integer.Let p[i] and q[ j] be the nodes corresponding

to protein sequences p and q at positions i and j in the initial graph,

respectively.We draw an edge between p[i] and q[ j] only if one of

the following two conditions holds in the optimal pairwise align-

ment of p and q:(1) 9d,|d| d,such that p[i] is aligned with q[ j +

d];(2) 9d,|d| d,such that q[ j] is aligned with p[i + d].In other

words,we draw an edge between two nodes if their positions differ

by at most d in the optimal alignment of p and q.For example,in

Figure 1,p[2] aligns with q[2].Therefore,we draw an edge from

p[2] to q[1] and q[3] as well as q[2] since q[1] and q[3] are within

d-neighborhood of (d ¼ 1) of q[2].

DP is modiﬁed for sparse K-partite graph as follows:each cell,

[x

1

,x

2

,...,x

K

] in the K-dimensional DP matrix corresponds to

Fig.1.Sparse K-partite graph for two sequences for d ¼ 0 and d ¼ 1.

X.Zhang and T.Kahveci

164

nodes P

1

[x

1

],P

2

[x

2

],...,P

K

[x

K

].Here P

i

[ j] stands for the node at

position j in sequence i.The set contains one node from each

sequence,and can be either a residue or a gap.Thus,each cell

deﬁnes a subgraph induced by its node set.For example,during

the alignment of the sequences that have the K-partite graph as

shown in Figure 2(a),the cell [3,4,4] corresponds to nodes

P

1

[3],P

2

[4] and P

3

[4].Figure 2(b) shows the induced subgraph

of cell [3,4,4].

The induced subgraph for each cell yields a set of connected

components.The sparse graph algorithm uses connected compo-

nents to improve the running time of DP as follows.During the

computation of the value of a DP matrix cell,we allowtwo nodes to

align only if they belong to the same connected component of the

induced subgraph of that cell.For example,for cell [3,4,4],P

2

[4]

and P

3

[4] can be aligned together,but P

1

[3] can not be aligned with

P

2

[4] or P

3

[4] (see Fig.2b).A connected component with n nodes

produce 2

n

1 non-empty subsets.Thus,for a given cell,if there are

t connected components and the tth component has n

t

nodes,then

the cost of that cell becomes

P

t

i¼1

ð2

n

i

1Þ.This is a signiﬁcant

improvement as the cost of a single cell is 2

n

1

+n

2

+...+n

t

1 using the

complete K-partite graph.For example,in Figure 2,the cost for cell

[3,4,4] drops from 2

3

1 ¼7 to (2

1

1) +(2

2

1) ¼ 4.

The connected components of an induced subgraph can be

found in O(K

2

) time (i.e.the size of the induced subgraph) by

traversing the induced subgraph once.Thus,the total time

complexity of the sparse K-partite graph approach is

O

P

W

K

i¼1

P

j

ð2

n

j

1Þ

h i

ðN Wþ1ÞK

2

/D

.The space com-

plexity of using the sparse K-partite graph is O(W

K

+ KN +

N(K1)K(2d + 1)/2).The ﬁrst term denotes the space for the

dynamic programming alignment within a window.The second

termdenotes the number of letters.The last termdenotes the number

of edges.The space complexity for the last two terms can be reduced

by storing only the subgraph inside the window.

5 EXPERIMENTAL EVALUATION

Experimental setup:We used BAliBASE benchmarks (Thompson

et al.,1999) reference 1 from version 1 (www-igbmc.

u-strasbg.fr/BioInfo/BAliBASE/) and references 1,

2,...,8 from version 3 (www-bio3d-igbmc.u-strasbg.

fr/BAliBASE/) for evaluation of our method.We use V1 and

V3 to denote BAliBASE versions 1 and 3,respectively.We use R1

to R8 to denote reference 1–8.For example,we use V3-R4 to

represent the reference 4 dataset from version 3.We split the

V1-R1 dataset into three datasets (V1-R1-low,V1-R1-medium

and V1-R1-high) according to the similarity of the sequences in

the benchmarks as denoted in BAliBASE (low- and high-

similarities).Similarly,V3-R1 is split into two datasets V3-R1-

low and V3-R1-high containing low and high-similarity bench-

marks.The number of sequences in the benchmarks in version

3 were usually too large for QOMA and DCA.Therefore,we cre-

ated 1000 benchmarks from each reference by randomly selecting

ﬁve sequences from the existing benchmarks.Thus,each of the

benchmarks from version 3 contains ﬁve sequences.

We evaluated the SP score and the running time in our experi-

ments.We do not report the BAliBASE scores since the purpose of

QOMA is to maximize the SP score.

We implemented the complete and the sparse K-partite QOMA

algorithms as discussed in the paper,using standard C.We used

BLOSUM62 as a measure of similarity between amino acids.We

used gap cost ¼4 to penalize each indel.We used D ¼W/2 in our

experiments since we achieved best quality per time with this value.

We also downloaded ClustalW,Probcons,Muscle,T-coffee and

DCA for comparison.We did not compare QOMA with our pre-

vious work HSA (Zhang and Kahveci,2006) since HSA needs

secondary structure information of proteins for alignment.To

ensure a fair comparison,we ran ClustalW,Muscle,T-coffee,

DCA and QOMA using the same parameters (gap cost ¼ 4,

similarity matrix ¼ BLOSUM62).This was not possible for Prob-

cons.We also ran all the competing methods using their default

parameters.

We ran all our experiments on Intel Pentium 4,with 2.6 G Hz

speed,and 512 MB memory.The operating system was Windows

2000.

Quality evaluation:We ﬁrst evaluate the quality of QOMA.

Table 1 shows the average SP score of QOMA using two strategies

for constructing initial alignment and four values of W.Strategy 1

obtains the initial alignments fromClustalW.Strategy 2 obtains the

initial alignments from the algorithm provided in Section 3.1.The

table also shows the upper bound for the SP score and the SP score

of ClustalWfor comparison.The best results are shown in bold for

each test.QOMA achieves higher SP score compared to ClustalW

on average for all windowsizes and for all datasets.The SP score of

QOMA consistently increases as W increases.These results are

justiﬁed by Lemmas 1 and 1.The SP score of Strategy 2 is usually

higher than that of Strategy 1 for almost all cases of low- and

medium-similarity.Both strategies are almost identical for highly

similar sequences.There is a loose correlation between the initial SP

score and the ﬁnal SP score of QOMA.Higher initial SP scores

usually imply higher SP scores of the end result.There are however

exceptions especially for highly similar sequences.In the rest of the

experiments,we use Strategy 2 to construct the initial alignments by

default.

Table 2 shows us the SP scores of ﬁve existing tools,and QOMA

on all the datasets when the competing tools are run using the same

parameters as QOMA and using their default parameters.QOMA

has higher SP score than all the tools compared for all the datasets.

DCA always has second best score since it also targets on maxi-

mizing the SP score of alignments.The difference between the SP

scores of QOMA and the other tools are more signiﬁcant for low-

and medium- similarity sequences.This is an important achieve-

ment because the alignment of such sequences are usually harder

than highly similar sequences.

Table 3 shows the average percentage of improvement of QOMA

over alignments of ClustalW using the improvement formula as

Fig.2.(a) Asparse K-partite graph for three sequences froma windowof size

4 (b) The induced subgraph for the cell [3,4,4] of the dynamic programming

matrix for the K-partite graph in (a).

QOMA of protein sequences

165

given in Section 3.3,the dataset is V1-R1.As the window size

increases,the increase in improvement percentage reduces.This

indicates that QOMA converges to the optimal score at reasonably

small window sizes.In other words,using window size larger than

16 will not improve the SP score signiﬁcantly.

Table 4 shows the average and the SD of the error incurred for

each window due to using the sparse K-partite graph for QOMA.

Tabel 1.The average SP scores of QOMA using complete K-partite graph with D ¼ W/2 on BAliBASE benchmarks and upper bound score (

SS)

Dataset

SS Initialization strategy 1 Initialization strategy 2

Initial W¼2 W¼4 W¼8 W¼16 Initial W¼2 W¼4 W¼8 W¼16

V1-R1-low 565

839 780 637 401 243 797 586 429 273 182

V1-R1-medium

2880 1982 2037 2181 2347 2442 2041 2192 2338 2446 2508

V1-R1-high

5324 4883 4933 5008 5071 5092 4867 4965 5057 5110 5122

(Initialization strategy 1:initial alignments are obtained from ClustalW,Initialization strategy 2:initial alignments are obtained from optimal pairwise alignments as discussed in

Section 3.1.) The best result is shown in bold face.

Table 2.The average SP scores of QOMA (using complete K-partite graph with W ¼ 16) and five other tools on BAliBASE benchmarks

Dataset ClustalW Probcons T-coffee Muscle DCA QOMA

V1-R1-low 808 (839)

1303 (1303) 1499 (1486) 2029 (778) 440 (440) 182 (182)

V1-R1-medium

2156 (1982) 2128 (2068) 1955 (2007) 424 (2161) 2356 (2289) 2582 (2508)

V1-R1-high

4954 (4935) 4924 (4975) 4842 (4920) 3468 (5001) 5007 (5055) 5122 (5172)

V3-R1-low

1233 (1316) 1763 (1763) 2141 (2052) 2617 (1200) 760 (760) 421 (421)

V3-R1-high

2048 (1911) 2008 (2008) 1803 (1902) 263 (2101) 2288 (2288) 2507 (2507)

V3-R2

5984 (5953) 5862 (5892) 5793 (5847) 4564 (6029) 6126 (6153) 6287 (6313)

V3-R3

5838 (6005) 5741 (5976) N/A (5945) 4406 (6088) 5968 (6202) 6110 (6348)

V3-R4

251 (347) 280 (95) N/A (148) 935 (463) 714 (880) 926 (1107)

V3-R5

3899 (3778) 3658 (3658) N/A (3553) 2117 (3935) 4310 (4310) 4446 (4446)

V3-R6

1601 (1554) 1781 (1782) N/A (1859) 1710 (1570) 1335 (1336) 1151 (1152)

V3-R7

6977 (6832) 6555 (6555) N/A (6409) 4663 (6973) 7502 (7502) 8000 (8000)

V3-R8

8432 (8321) 8238 (8238) N/A (8190) 6927 (8494) 8831 (8831) 9248 (9248)

The numbers outside (inside) parenthesis show the SP scores when the tools are run with the same parameters as QOMA (with their default parameters).Some of the tools,namely

T-coffee andClustalW,didnot produce anyalignment for some benchmarks for eachparameter settings.The results of all the tools are ignoredfor suchbenchmarks.‘N/A’ indicates that

thecorrespondingtool failedtoproducealignment for most of thebenchmarks inadataset for that parameter setting.We ignoresuchtools (i.e.,T-coffee) for thosedatasets andparameter

setting.The best result for each dataset and parameter setting is shown in bold.

Table 3.The improvement (see Formula 1 in Section 3.3) of QOMA (using

complete K-partite graph) over ClustalW on the V1-R1 dataset

Length Window Size

2 4 8 16

Short 18.0 29.2 40.2 46.7

Medium 23.3 39.6 51.6 58.6

Long 18.6 39.4 51.5 54.1

The dataset is split into three subsets (short,mediumand long) according to the length of

the sequences.

Table 4.The average (m),SD (s) of the error,S

SP,for a window using

sparse version of QOMA on the V1-R1 dataset

W Error using sparse K-partite graph

d ¼ 0 d ¼ 1 d ¼ 2

m s e m s e m s e

8 1.127 3.820 7 0.358 2.089 1 0.200 1.591 1

16 2.159 5.909

11

0.751 3.249 1 0.433 2.434 1

Results are shown for window sizes W¼8 and 16,and deviation d ¼0,1 and 2.The «

value denotes the 95%confidence interval,i.e 95%of the expected improvement values

are in [m e,m + e] interval.

Fig.3.The SPscores of QOMAalignments usingthe complete andthe sparse

K-partite graphs for different values of d and W on the V1-R1 dataset.

X.Zhang and T.Kahveci

166

The average error decreases as d increases.Most of the reduction

in error is obtained when d increases from 0 to 1.For example,for

W ¼ 16,the error reduction is 1.408 and 0.318 as d increases from

0 to 1 and from 1 to 2,respectively.This implies that the average

improvement in the SP score degrades quickly for d > 1.The 95%

error interval also signiﬁcantly shrinks as d increases to 1.These

results suggest that the sparse K-partite graph algorithmis effective

even for small values of d.

Figure 3 shows the average SP scores of the resulting alignments

using sparse K-partite graph for different values of d and using

complete K-partite graph on the V1-R1 dataset.The complete

K-partite graph algorithm produces the best SP scores.However,

the SP scores of results from the sparse K-partite graph algorithm

are very close to that of the complete K-partite graph algorithm.The

quality of the sparse K-partite graph algorithm improves signiﬁ-

cantly when d increases from0 to 1.The improvement is less when

d increases from1 to 2.This implies when d becomes larger than 1,

it has small impact on the quality of alignment.

Performance evaluation:Our second experiment set evaluates

the running time of QOMA.Table 5 lists the running time of QOMA

for the complete and the sparse K-partite graph algorithms for

varying values of W.Experimental results show that QOMA

runs faster for small W.The sparse K-partite graph algorithm is

faster than the complete K-partite graph algorithm for all values

of d for large W.The running time of QOMA increases as d

increases.The results in this table agree with the time complexity

we computed in Sections 3.3 and 4.Referring to Tables 1,3 and

4,we conclude that when the window size is small,QOMA runs

fast and has high SP score.As the window size increases,its

performance drops but the SP score improves further.Thus,it is

better to increase the window size and use sparse K-partite graph

strategy to obtain high scoring results quickly.In the light of

Figure 3,and Tables 4 and 5 we recommend d ¼ 1 using sparse

K-partite graph strategy.This is because most of the improvement

in the SP score is obtained when d ¼ 1.For d > 1,the score

improves slightly while the running time increases at the same or

higher rate.

6 CONCLUSION AND FUTURE WORK

We considered the problem of multiple alignment with high-SP

score for protein sequences.We developed a graph-based algorithm

called QOMA.QOMA ﬁrst constructs an initial alignment of

multiple sequences.In order to create this initial alignment,we

developed a method based on the optimal alignment between

all pairs of sequences.QOMA represents this alignment using a

K-partite graph.It then improves the SP score of the initial

alignment by iteratively placing a window on it and optimizing

the alignment within this window.QOMAprovides two parameters

to permit ﬂexibility in time and accuracy:(1) The window size.(2)

The sparsity of the K-partite graph.Unlike traditional tools,QOMA

is independent of the order of sequences.

In our experiments QOMA achieved signiﬁcantly higher SP

scores than existing tools,including ClustalW,Probcons,Muscle,

T-Coffee and DCA.The quality improvement increased with the

window size.QOMA had slightly better SP score using the com-

plete K-partite graph algorithm than the sparse K-partite graph

algorithm.However,the sparse K-partite graph algorithm was

much faster when large window size is used.We found a good

balance point between quality of result and running time at

d ¼ 1 using sparse K-partite graph.

Currently,we are working on extending the QOMA algorithmto

align large number of sequences.

Conﬂict of Interest:none declared.

REFERENCES

Alkan,C.et al.(2005) Manipulating multiple sequence alignments via MaM and

WebMaM.Nucleic Acids Res.,33,W295–W298.

Bray,N.and Pachter,L.(2004) MAVID:constrained ancestral alignment of multiple

sequences.Genome Res.,14,693–699.

Do,C.B.et al.(2005) PROBCONS:Probabilistic consistency-based multiple sequence

alignment.Genome Res.,15,330–340.

Eddy,S.R.(1995) Multiple alignment using hidden Markov models.Proc Int Conf Intell

Syst Mol Biol.,3,114–120.

Edgar,R.(2004) MUSCLE:multiple sequence alignment with high accuracy and high

throughput.Nucleic Acids Res.,32,1792–1797.

Gotoh,O.(1982) An improved algorithm for matching biological sequences.

J.Mol.Biol.,162,705–708.

Gotoh,O.(1996) Signiﬁcant improvement in accuracy of multiple protein sequence

alignments by iterative reﬁnement as assessed by reference to structural align-

ments.J.Mol.Biol.,264,823–838.

Grundy,W.N.(1998) Family-based homology detection via pairwise sequence

comparison.In Proceedings of the Second Annual International Conference on

Computational Molecular Biology (RECOMB’98),ACM Press,New York,NY,

pp.94–100.

Gupta,S.et al.(1996) Improving the practical space and time efﬁciency of the shortest-

paths approach to sum-of-pairs multiple sequence alignment.J.Comput.Biol.

Gupta,S.K.et al.(1995) Improving the practical space and time efﬁciency of the

shortest-paths approach to sum-of-pairs multiple sequence alignment.J.Comput.

Biol.,2,459–462.

Hein,J.(1989) A new method that simultaneously aligns and reconstructs ancestral

sequences for any number of homologous sequences,when the phylogeny is given.

Mol.Biol.Evol.,6,649–668.

Table 5.The running time of QOMA (in seconds) using complete K-partite graph and spare graph for different value of d and W on the V1-R1 dataset

Window

Size

Short Medium Long

A B C D A B C D A B C D

W¼2 0.27

0.24 0.35 0.3 0.58 0.61 0.75 0.74 1.11 1.99 2.31 2.27

W¼4

1.76 1.01 1.53 1.78 3.62 2.19 3.2 3.74 8 5.48 7.83 8.93

W¼8

17 6 9 12 37 13 19 26 86 31 43 61

W¼16

232 51 63 81 535 110 132 166 1087 257 306 386

(A:complete K-partite graph.B:sparse K-partite graphwithd¼0.C:sparse K-partite graphwithd¼1.D:sparse K-partite graphwithd¼2.) The dataset is split intothree subsets (short,

mediumand long) according to the length of the sequences in the benchmarks.The running time of the competing methods,in seconds,are as follows.ClustalW:between 0.39 and 2.1,

Probcons:between 0.9 and 8.8,T-coffee:between 3.9 and 24.2,Muscle:between 0.4 and 1.6.

QOMA of protein sequences

167

Henikoff,S.and Henikoff,J.G.(1992) Amino acid substitution matrices from protein

blocks.Proc.Natl Acad.Sci.USA,89,10915–10919.

Huang,X.and Miller,W.(1991) A time-efﬁcient,linear-space local similarity algo-

rithm.Adv.Appl.Math.,12,337–357.

Jones,D.T.(1999) Protein secondary structure prediction based on position-speciﬁc

scoring matrices.J.Mol.Biol.,292,195–202.

Katoh,K.et al.(2002) MAFFT:a novel method for rapid multiple sequence alignment

based on fast Fourier transform.Nucleic Acids Res.,30,3059–3066.

Lee,C.et al.(2002) Multiple sequence alignment using partial order graphs.

Bioinformatics,18,452–464.

Lipman,D.et al.(1989) A tool for multiple sequence alignment.Proc.Natl Acad.Sci.

USA,86,4412–4415.

Morgenstern,B.et al.(1998) DIALIGN:ﬁnding local similarities by multiple sequence

alignment.Bioinformatics,14,290–294.

Needleman,S.B.and Wunsch,C.D.(1970) A general method applicable to the search

for similarities in the amino acid sequence of two proteins.J.Mol.Biol.,48,

443–453.

Notredame,C.et al.(2000) T-coffee:a novel method for fast and accurate multiple

sequence alignment.J.Mol.Biol.,302,205–217.

Phillips,A.et al.(2000) Multiple sequence alignment in phylogenetic analysis.

Mol.Phylogene.Evol.,16,317–330.

Smith,T.F.and Waterman,M.S.(1981) Identiﬁcation of common molecular subse-

quences.J.Mol.Biol.,147,195–197.

Stoye,J.(1998) Multiple sequence alignment with the divide-and-conquer method.

Gene,211,GC45–GC56.

Stoye,J.et al.(1997) DCA:an efﬁcient implementation of the divide-and-conquer

approach to simultaneous multiple sequence alignment.Comput.Appl.Biosci.,

13,625–626.

Sze,S.H.et al.(2005) A polynomial time solvable formulation multiple sequence

alignment.In:Miyano,S.,Mesirov,J.P.,Kasif,S.,Istrail,S.Pevzner,P.A.and

Waterman,M.S.(eds),Proceedings of the 9th Annual International Conference

on Research in Computational Molecular Biology (RECOMB),Lecture Notes in

Computer Science,Springer,pp.204–216.

Thompson,J.et al.(1994) CLUSTAL W:improving the sensitivity of progressive

multiple sequence alignment through sequence weighting,position-speciﬁc gap

penalties and weight matrix choice.Nucleic Acids Res.,22,4673–4680.

Thompson,J.et al.(1999) A comprehensive comparison of multiple sequence align-

ment programs.Nucleic Acids Res.,27,2682–2690.

Van Walle,I.et al.(2004) Align-m—a new algorithmfor multiple alignment of highly

divergent sequences.Bioinformatics,20,1428–1435.

Wang,L.and Jiang,T.(1994) On the complexity of multiple sequence alignment.

J.Comput.Biol.,1,337–348.

Zhang,X.and Kahveci,T.(2006) A new approach for alignment of multiple proteins.

In:Altman,R.B.,Murray,T.,Klein,T.E.,Dunker,A.K.and Hunter,L.(eds),

Proceedings of the Paciﬁc Symposium on Biocomputing (PSB),World Scientiﬁc

pp.339–350,

X.Zhang and T.Kahveci

168

## Σχόλια 0

Συνδεθείτε για να κοινοποιήσετε σχόλιο