1.What is the time complexity of the bottom-up string alignment algorithm?

plantationscarfAI and Robotics

Nov 25, 2013 (3 years and 9 months ago)

143 views

OMWOMA VINCENT

P58/76972/2012

ASSINGMENT III

DESIGN AND ANALYSIS OF ALGORITHMS



1.

What is the time complexity of the bottom
-
up string alignment
algorithm?



Complexity of CLUSTALW

Clustalw is one of the
algorithms

that implements bottom
-
up method
. By workin
g out the
complexity of Clustalw algorithm, then by default we shal
l

be working up the complexity of the
bottom
-
up string alignment.

Robert C
Edgar (
2004)

It is instructive to consider the complexity of CLUSTALW. This is of
intrinsic interest as CLUSTALW i
s currently the

most widely used MSA program and, to the
best of our knowledge, its complexity has not previously been stated

correctly in the literature.
The similarity measure is the fractional

identity computed from a global alignment, clustering is
don
e by neighbor
-
joining. Global alignment of a

pair of sequences or profiles is computed using
the Myers
-

Miller linear space
algorit
hm
which

is O(
L
) space and

O(
L
2
) time in the typical
sequence length
L
. Given
N
sequences and thus
N
(
N
-

1)/2 = O(
N
2
) pairs,
it is therefore(
N
2
L
2
)
time and O(
N
2

+
L
) space to construct the distance matrix. The neighbor
-
joining implementation
is

O(
N
2
) space and O(
N
4
) time, at least up to CLUSTALW 1.82, although O(
N
3
) time is possible
.
A single

iteration of progressive alignment c
omputes a profile of each subtree from its multiple
alignment, which is

O(
N
P
L
P
) time and space in the number of sequences in the profile
N
P and the
profile length
L
P, then uses Myers
-
Miller to align the profiles in O(
L
P
) space and O(
L
P
2
) time.
There are
N
-

1 internal nodes in a rooted binary tree

and hence O(
N
) iterations. It is often
assumed that
L
P is O(
L
), i.e. that O(0) gaps are introduced in each iteration.

However, we often
observe the alignment length to grow approximately linearly, i.e. that O(1) g
aps are added per

iteration

. It is therefore more realistic to

assume that
L
P is O(
L
+
N
), making one iteration of
progressive alignment O(
NL
+
L
2
) in both space and time.

This analysis is summarized in Table
1.


Step





O(Space)




O(Time)

Distance mat
rix



N
²

+
L




N
2
L
²

Neighbor joining


N
²






N
4

Progressive (one iteration)
NL
P

+
L
P

=
NL
+
L
²



NL
P

+
L
P
2 =
N
2 +
L
2

Progressive (total)


NL
+
L
²





N
3

+
NL
2

TOTAL




N
²

+
L
²





N
4

+
L
2


2

Create a top
-
down algorithm for the string alignment problem


In this question am using the LGscore alignment method to demonstrate the top
-
down algorithm.

Arne Elofsson(2002)

while Number of aligned residues
.
25

Super position all residues in the model

and in the correct structure.

Calculate and store the p value f
or this

super position

Delete the pair of residues that is furthest

apart from each other in the model

and the correct structure.

return the best p value.



3.
W

hat

is the time complexity of the algorithm


The Clustering Method

Clustering method

is one of

the algorithms

that implements top
-
down

method. By working out
the complexity of
clustering

algorithm, then by default we shal
l

be working up t
he complexity of
the top
-
down

string alignment.



Kuen
-
Feng Huang Et al (2001),
since

group alignment method is
being
introduced,

CMSA
(Clustering Multiple

Sequence Alignment) algorithm

is going to be shown
. The tree based
method uses a technique of ”once a gap, always a gap”.

Our main idea is to reduce the number of
gaps in each group. Thus, if we put the two seque
nces of the longest

distance in two distinct
groups, we can get a better multiple sequence alignment when the input
sets of sequences are

very similar. The detail of our CMSA isas follows.

Algorithm: CMSA

(Clustering Multiple Sequence Alignment)

Input:
A s
et of sequences
S
={
S
1
,
S
2
……
Sn
}
.

Output:
A multiple sequence alignment of
S
.


Step 1:
If

|
S
|
=<
1, then stop.


Step 2:
Compute the optimal alignment on each pair

of sequences in
S
. Then construct the
distance

matrix for
S
.


Step 3:
Sort all entries in the di
stance matrix into non

increasing

order.


Step 4:
Create a set of sequences
R
=

S
.


Step 5:
In
R
, select a pair of sequences
Si
and
Sj
such

that
Si
and
Sj
have the longest distance.


Step 6:
Let
G
1
={
Si
}
_
and
G
2
={
Sj
}

.

R
=
R
-
{
Si
,
Sj
}
.

Perform the follow
ing substeps until
R
becomes

empty.


Step 6.1:
Select
Sk
Ε

R
such that

M
in
{
d
(
Sk
,
G
1
)
,
d
(
Sk
,
G
2
),

is the minimum.


Step 6.2:
If
d
(
Sk
,
G
1
)<=
d
(
Sk
,
G
2
)
, then

G
1
=
G
1

U
{
Sk
}
; otherwise
G
2
=

G
2

U{
Sk
}
.

Step 6.3:
R
=
R

{
Sk
}
.


Step 7:
Recursively apply this algorit
hm (Algorithm

CMSA) by setting the input
S
=

G
1.

Recursively apply this algorithm (Algorithm

CMSA) by setting the input
S
=
G
2.

Step 8:
Perform our group alignment method (Algorithm

Group Alignment) on
G
1 and
G
2.

Let
L
=

max
{|
S
1
|,
|
S
2
|,……..,|
Sn
|}
, where
|
Si
|
denotes

the length of
Si
. The complexity of each
step is

as follows.


Step 2:
O
(
n
²
L
²

)
.

Step 3:
O
(
n
²

log
n
)
.

Step 6:
O
(
n
)

Step 8:
O
(
n
²
L
²

)
.

The time required for one recursion is
O
(
n
²
L
²

)
,

since
L
>
n
in almost all practical cases.
Combining

with the recurs
ive work in Step 7, we obtain the time

complexity of th
e algorithm as

O
(
n
²
L
²

)
.

We now explain our algorithm step by step. Let us

consider the following five sequences.
Suppose that

in the score function,

=

0, the costs of a match, a

mismatch and an inde
l are 0, 1
and 1, respectively.

S
1
=

AAGGCCTT

S
2
=

CGATT

S
3
=

AGGGAT

S
4
=

TCGA

S
5
=

AGGGCTT

In Step 2, we obtain a distance matrix, as shown. Our job is to divide the five sequences into two

groups in Steps 5 and 6. The dividing process is shown

in Figure
3. The first sequence put in
each group is

represented by a gray node. The number associated

with each node represents the
order that the sequence



References


1.

Robert C Edgar
;
A multiple sequence alignment method with reduced


time and space c
omplexity
, 2004


2.

Arne Elofsson
;
AStudy on Protein Sequence Alignment Quality

2002


3.

Kuen
-
Feng Huang, Chang
-
Biau Yang and Kuo
-
Tsung Tseng

An Efficient Algorithm for
Multiple Sequence Alignment


2001


4.

Van Walle I, Lasters I, Wyns L:
Align
-
m

a new algorithm f
or

multiple alignment of highly divergent sequences.
Bioinformatics

2004.


5.

Chia Mao Huang and Chang Biau Yang. Approximation algorithms for constructing
evolutionary trees. In
Proc. of National Computer Symposium,
Workshop on Algorithm
and Computation Theory
, pages A099

A109, 2001.