# Randomized approach to Distance Matrix calculation for Multiple Sequence Alignment

Τεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 4 χρόνια και 7 μήνες)

86 εμφανίσεις

V. Thapar

BME 300
-

Bioinformatics

1

Randomized approach to Distance Matrix calculation for

Multiple Sequence Alignment

Vishal Thapar

(
Vishal.Thapar@uconn.edu
)

BME 300

Bioinformatics

Instructor: Prof. Richard Simon

(December 3
rd
, 2003)

A
bstract:

Rigorous alignment of multiple sequences becomes impractical even with a
modest number of sequences [1]. Solution to multiple sequence alignment problem is
important for biological research purposes. Because of the high time complexity of
tional MSA algorithms, even today’s fast computers are not able to solve the
problem for large number of sequences. Our approach in this paper is to evaluate the
possibility of using randomized approach to calculate distance matrix for multiple
sequence a
lignment algorithm. In order to reduce time complexity, we will evaluate a
small randomly selected portion of a sequence and compare with similar portions
collected randomly from all other sequences. The initial idea of randomization was taken
from [2].

1

Introduction

Sequence alignment is one of the fundamental operations performed in
computational biology research [3]. Often times, it is necessary to evaluate more
than two sequences simultaneously in order to find out functions, structure and
e
volution of different organisms. Human genome project uses this technique to
map and organize DNA and protein sequences into groups for later use. There
has been significant research done in this area, because of the need for doing
multiple sequence alig
nment for many sequences of varying length. Algorithms
dealing with this problem span from simple comparison and dynamic
programming procedures to complex ones that rely on underlying biological
meaning of the sequences to align them more accurately. Sin
ce multiple sequence
V. Thapar

BME 300
-

Bioinformatics

2

alignment is an NP
-
Hard problem, practical solutions rely on clever heuristics to
do the job. There is a constant balancing of accuracy versus speed in these
algorithms. Accurate algorithms need more processing time and are usually
c
apable of comparing only a small number of sequences; where as fast and less
accurate ones can analyze many sequences in reasonable amount of time.

Dynamic programming algorithm first introduced by Needleman and Wunsch [4].
This algorithm is designed for

pair
-
wise sequence alignment. Feng and Doolittle
[5] developed an algorithm for multiple sequence alignment using modified
version of [4]. There are more complicated algorithms such as CLUSTAL W [6],
which relies on scoring system, and is adjusted based

on local homology of the
sequences.

Progressive algorithms suffer from the lack of computational speed because of
their iterative approach. Also, accuracy is compromised because greedy
algorithm such as dynamic programming reaches a local minimum for
distance
matrix score and not global minimum. Algorithms that rely significantly on
biological information may also be at a disadvantage in some domain. Often
times, it is not necessary to find the most accurate alignment between sequences.
In those cas
es, specialized algorithms such as CLUSTAL W might be over
qualified. Also, these algorithms will require some human intervention while
they are optimizing results. This intervention will have to be done by biologists
who are very familiar with the data
and thus there is limited user domain for such
an algorithm.

One of the more important usages of MSA is for Phylogenetic analyses [11].
Phylogenetic trees are at the base of understanding evolutionary relationships
between various species. In order to

build a Phylogenetic tree, orthologous
sequences have to be entered into the database, sequences have to be aligned,
pairwise Phylogenetic distance has to be calculated and a hierarchical tree is
calculated using clustering algorithm as shown in [8].

V. Thapar

BME 300
-

Bioinformatics

3

T
here are many algorithms which maximize accuracy and do not concern
themselves with speed. Few improvements have been made successfully to
reduce the CPU time, since the proposal of the Feng and Doolittle [5] method [7].
Our approach deals with reducing
CPU time by randomizing some part of
multiple sequence alignment. Our approach calculates distance matrix for star
-
alignment by randomly selecting small portions of sequences and aligning them.
Since randomly selected portion of the sequence is significa
ntly less than the
actual sequence length, it will result in significant reduction of running time.

2

Survey of Literature

In this section we will list relevant literature survey that was done for this paper.
We will also list some competing algorithm
s and applications that are in use
today.

2.1

CLUSTAL W

CLUSTAL W approach is an improvement of progressive approach invented by
Feng and Doolittle [5]. CLUSTAL W improves the sensitivity of multiple
sequence alignment without sacrificing speed and e
fficiency [6]. The speed and
efficiency in this context refer to that of Feng and Doolittle [5] style of
progressive algorithm. It will be shown that our algorithm is actually faster in
theoretical running time than CLUSTAL W. This algorithm differs fro
m
conventional algorithm in the sense that it allows genetic information to be
included in distance matrix calculations. In other words, it will not limit the
match/mismatch scores to constant but will allow them to change based on the
number of criteria
set by the user [6].

CLUSTAL W takes into account different types of weight matrices at each
comparison step based on the homogeneity of sequences being compared and
their evolutionary distances. It is divided into three stages. (1) In this stage, a f
ast
V. Thapar

BME 300
-

Bioinformatics

4

approximation algorithm is used to evaluate alignment scores. Idea is that, errors
made in alignment during this step will be corrected in later stages by more
accurate weights. (2) Unrooted trees are calculated using Neighbor
-
joining
method [6]. Ea
ch sequence is a branch in this tree. Each sequence gets a weight
proportional to its distance from the root. Also, it gets a proportion of the weight
from another sequence that it shares some similarities with. (3) This step is called
progressive align
ment. In this step, guide tree is used to combine sequences into
larger and larger pairwise alignments. Sequences are selected from the tip of the
tree to going towards the root. At each stage a full dynamic algorithm is used to
calculate weight matrix
and introduce gaps [6].

Giving proper weights is achieved by having one sequence with weight of 1.0 and
the rest less than that. Groups of closely related sequences receive lower weights
and thus do not “over
-
influence” the final alignment results inap
propriately.

Results of CLUSTAL W are staggeringly accurate. It gives near optimal results
for a data set with more than 35% identical pairs. For sequences that are
divergent, it is difficult to find proper weighing scheme and thus does not result in
a good alignment.

2.2

MSA using Hierarchical Clustering

Hierarchical clustering is a very interesting heuristic for MSA. It is rather old
approach in the fast changing field bioinformatics. It uses an approach often used
in bioinformatics, but mostly in
the field of data
-
mining [9, 10]. This approach
uses hierarchical clustering along with pairwise alignment to align similar
sequences. Hierarchical clustering of the sequences is done using weight matrix.
At each step, groups or clusters of sequences ar
e aligned together in larger
clusters until all of them are one group.

Distance matrix calculation is the central theme in this approach. First distance
matrix is calculated for each possible pairwise alignment of sequences. This
V. Thapar

BME 300
-

Bioinformatics

5

process could be eval
uated using a fast pairwise alignment algorithm such as [2].
Two sequences Si and Sj, which have lowest alignment score are chosen out of
the matrix and are aligned with each other in one cluster. Now, a matrix of size
nXn is replaced with (n
-
1)X(n
-
1) by

deleting row j and column j from the
resulting matrix. Also, row i is replaced with the average score of i and j [8].
This process continues until all sequences are aligned and they all form one
cluster.

This algorithm takes O (N(N
-
1)M
2
) time where N

is the number of sequences and
M is the length of sequences when aligned [8]. This solution is not nearly as fast
as what we are trying to achieve. Since this algorithm also uses distance matrix
calculation, using algorithm proposed here could reduce it
s running time further
as well.

2.3

MAFFT: Fast Fourier Transform based approach

Fast Fourier transform is used to determine homologous regions rapidly. FFT
converts amino acid sequences into sequences composed of volume and polarity
[7]. MAFFT im
plements two approaches of FFT, which are progressive method
and iterative refinement method. In this method, correlation between two amino
acid sequences is calculated using FFT formulas. High correlation value will
indicate that sequences may have homo
logous regions [7]. This program also has
sophisticated scoring system for similarity matrix and gap penalties. Just like
CLUSTAL W, this approach also uses guiding trees and similarity matrices.

By looking at results presented in [7], we can determin
e that FFT based
algorithms are significantly better than CLUSTAL W and T
-
COFFEE algorithms.
It is important to notice that all these algorithms are still polynomial time
algorithms and thus have similar behavior on log scaled graph. The only
difference
in FFT is that it has a lower co
-
efficient. Thus, from complexity point
of view, FFT is not significantly better than other approaches.

V. Thapar

BME 300
-

Bioinformatics

6

2.4

Other approaches to MSA

There are many other innovative approaches for MSA. Stochastic processes are
used to

perform MSA. Simulated annealing and Genetic algorithms [11] are
classic stochastic processes based MSA algorithms. In these algorithms, two
sequences are randomly aligned and their score is compared with what was
present earlier [11]. If the score is
better than previous matrix, it is kept and if not

Non
-
stochastic iterative algorithms are simple in understanding. They rely on the
logic that even a wrong alignment can be efficiently improved if it is realigned at
a later stage
. Berger and Munson’s algorithm [1] is one of such algorithm. This
algorithm randomly aligns sequences at first. Then, it iteratively tries to find
better results and updates sequences until no further improvements can be
achieved. Gotoh has described
such an algorithm in [12]. It is a double nested
iterative strategy with randomization that optimizes the weighted sum
-
of
-
pairs
with affine gap penalties [11].

There is also a relatively recent algorithm by Kececioglu, Lenhof, Mehlhorn,
Mutzen, Reinert

and Vingron [14], which studies alignment problem as an integer
linear program. With polyhedral approach, variations of a basic problem can
often be conveniently modeled through the addition of further constraints to the
basic linear programming [14]. T
his algorithm solves MSA problem to optimality
for non trivial algorithms of 18 sequences or more.

3

Randomized Algorithm

The idea of randomized sampling for local alignment was proposed by
Rajasekaran et. al [2]. Just like any other randomized algori
thm, we are going to
try to show that instead of evaluating entire sequences of length N, we can
achieve same result by evaluating N
Є

characters where 0 < Є < 1. This procedure
V. Thapar

BME 300
-

Bioinformatics

7

has a potential of theoretically getting results which are significantly close

order
of magnitude reduction.

2
*N
2
) time to create a distance matrix where M
is number of sequences and N is the length of aligned sequences. This could be
supported by the fact that traditional Needleman
-
Wunsch [4] algor
ithm will
require O(N
2
) time to find alignment score of any two sequences. There are M
sequences so, all possible combination of pairwise sequence alignment will take
M
2

operations. Thus, total time taken by Needleman Wunsch type algorithm will
be O (M
2
*
N
2
).

Our heuristic works to reduce time from pairwise
-
alignment and in effect
reducing overall time of any algorithm that requires distance matrix calculations.
It selects a subset of length N
Є

from sequence S starting at randomly selected
location bet
ween S1 to S (N
-

N
Є
). Similarly same length subset starting at the
same location is chosen from sequence T. These subsequences are aligned and
score is recorded. Since the length of subsequences is N
Є
, time complexity to find
pairwise alignment is O(N

). This will result in an overall time of O(M
2
*N

).
This is a significant reduction if the resulting distance matrix can return a reliable
and accurate score.

Algorithm

Input:

A file containing DNA or Protein sequences separated by new line
charact
er, value of Є.

Output:

Distance matrix calculated for all of the sequences T1 to Tn and total
sum of distances for each sequence.

Algorithm:

(1)

Read and store all sequences from the input file into an array.

(2)

For all sequences T1 to Tn Do

a.

For all sequences P
1 to Pn Do

i.

Select a Random number R that works as a starting point.

ii.

Select |Pj|

Є

characters from Pj starting at position Pj
R
.

V. Thapar

BME 300
-

Bioinformatics

8

iii.

Similarly select same number of characters from Ti starting at
position Ti
R
. Step ii and iii will result in two new sequences
Pj
’ and Ti’.

iv.

Use Needleman
-
Wunsch algorithm to evaluate pairwise
alignment score of Pj’ and Ti’.

b.

Record score from step a
-
iv in Matrix M at M(Ti, Pj).

c.

Increment j by 1.

(3)

At the end of step 2, we will have a complete matrix M with distance
scores for ea
ch combination of sequences. Now sum alignment score in
row order where

n
j
i
Pj
Ti
M
Sum
1
)
,
(
.

(4)

Select the lowest score from Sum
i

and use it as center of star
-
alignment.

(5)

Repeat the same process for different value of
Є.

Analysis

This algorithm is closely related to Needleman
-
Wunsch algorithm for pairwise
alignment. It requires a value of
Є
from the user along with input file containing
sequences of same length. Step 1 reads in the input from input file. Step 2 loo
ps
around to exhaust all possible combination of sequences. This step is repeated
once for each of the N sequences. Step 2a also iterates through each one of the N
sequences. Thus, Step 2 takes O(N
2
) time. After selecting a random number as a
starting
position, we select a subsequence from both sequences and align them
using Needleman
-
Wunsch or any other pairwise alignment algorithms. For our
purpose, step 2iv will take O(|Pj|
2
Є
) time. The score is recorded in the appropriate
column of the distance matrix. Step 3 sums up all pairwise alignment scores for a
given sequence. The sequence with lowest negative score or highest positive
score gets selected. The running time of the

algorithm is O(N
2
*|Pj|
2
Є
).

4

Implementation

In this section, we will explain the implementation detail of this algorithm on Java
platform. The algorithm uses a design from Neobio [15]. The implementation of
this algorithm was carried out in java. Th
e logic for the algorithm is simple and
has been designed with future additions in mind. As of now the algorithm uses a
randomized form of Needleman Wuncsh algorithm for alignment, but in future it
can be easily extended to use any algorithm that can globa
lly align two sequences.

V. Thapar

BME 300
-

Bioinformatics

9

The basic set of class framework has been referenced from the Neobio package
[15].

The main classes in the algorithm are in the package TheMatrix. The classes are:

1.

RandomMatrixCalculation.java : This class has the main method which

take as
input the file that contains all the sequences which are to be aligned. The file can
be in FAST
-
A format or it can be just a sequence of characters. The scoring
scheme can be specified in this class and the penalties for gap, match and
mismatch ca
n be set according to choice. We have used the standard convention
of gap=
-
2, match=+1 and mismatch=
-
1 for our application. They can be changed
easily.

2.

BasicScoringScheme.java: This class extends the class ScoringScheme.java
which is an abstract class. Thi
s can be used to set the scoring scheme and it can be
also used to sensitize the scoring scheme by implementing the methods in the
ScoringScheme class in anyway that is required by the user. The use of abstract
classes gives us the freedom to dynamically m
odify the scoring schemes like the
choice of the algorithm for the program dynamically based on the user preference.

3.

PairwiseAlignmentAlgorithm.java: This is again the abstract class whose object
“algorithm” is used through out in the program for all purpo
ses and finally based
on the users choice of algorithm, (in our case as of now its Needleman Wunsch
but more can be added), at runtime the object is dynamically attached to this
variable, “algorithm”. The methods that are implemented by any class that
exte
nds this class are loadAllsequenceFile() {This loads all the sequences from a
file into the memory}, computePairwiseAlignmentAll(), {This method when
implemented will contain the details of alignment of all sequences, they are
aligned in pairs. Based on wh
ich algorithm class extends this class, the
implementations will vary.}

4.

CharFile.java: This file is used in the reading of the sequences from the disk to the
memory and storing them in the desired format. In our case we have stored each
sequence as a chara
cter array and the arrays are stored in vectors, (extendable
arrays in java).

V. Thapar

BME 300
-

Bioinformatics

10

5.

IncompatibleScoringSchemeException.java and
InvalidScoringMatrixException.java : These have been taken from NeoBio
package[15] and extend the Exception class of java and are used

to display
meaningful messages in case of errors.

6.

NeedlemanWunsch.java: This is the major class that extends the class
PairwiseAlignmentAlgorithm class and thus implements the methods described
above in its way. So at run time the variable of the abstract

class
PairwiseAlignmentAlgorithm is assigned to the object of the NeedlemanWunsch
class. Thus even though all throughout the program the methods are called for the
PairwiseAlignment class, at run time the methods that are actually implemented
will be thos
e of this class and so later on when we need to add a new algorithm
we can easily just create one class and then extend the
PairwiseAlignmentAlgorithm class in that, implement the same methods in our
own way and we would have to make no changes to the exis
ting program. This is
the basis for a flexible framework. The main methods implemented in this class
are:

a.

ComputePairwiseAlignmentAll()

b.

ComputeScoreBetSeqIAndJ()

The first method reads sequences one at a time, compares it to all the others by
calling in a

loop the method ComputeScoreBetSeqIAndJ() and recording the score
for each comparison in the score matrix. Also the randomization step occurs in the
second method ComputeScoreBetSeqIAndJ() where based on a fixed value of

between 0.0 and 1.0 the lengths
of the 2 sequences to be compared are reduced
and then starting from a random point, “n*

” lengths are taken from both
sequences and compared using the standard Needleman Wuncsh algorithm.

The output is then recorded in a file, “Output.txt” again along wit
h the time
elapsed for the computation of the matrix.

5

Results

We are going to compare results from three different input files. Input files are
given as appendices A, B and C. We are going to compare actual results for
V. Thapar

BME 300
-

Bioinformatics

11

lowest distant score sum for ea
ch input file for various values of

. We will also
look at time it took to evaluate complete alignment (when

= 1.0) as opposed to

< 1.0.

Table 1 shows sum of the values of distant scores for various

.

FIRST RUN

Input in Appendix A

N = 9

|Si| = 600

S1

S2

S3

S4

S5

S6

S7

S8

S9

Run Time

1.00

-
738

-
656

-
980

-
914

-
1012

-
1194

-
1076

-
1032

-
976

3687ms

0.90

-
678

-
592

-
898

-
862

-
913

-
1080

-
976

-
951

-
860

3203ms

0.80

-
635

-
553

-
796

-
740

-
806

-
968

-
840

-
873

-
785

2360ms

0.70

-
583

-
494

-
703

-
660

-
721

-
894

-
752

-
797

-
730

1953ms

0.60

-
486

-
424

-
627

-
576

-
627

-
775

-
676

-
693

-
618

1516ms

0.50

-
362

-
354

-
489

-
490

-
532

-
608

-
554

-
578

-
525

1281ms

0.40

-
304

-
287

-
387

-
432

-
452

-
504

-
433

-
459

-
386

985ms

0.30

-
276

-
223

-
303

-
323

-
302

-
382

-
350

-
367

-
286

1157ms

0.20

-
230

-
206

-
219

-
225

-
246

-
287

-
304

-
284

-
231

609ms

The highlighted part in table 1 shows that for different values of

, lowest sum
was consistently for sequence S2. Even going as

low as

= 0.2 gave accurate
prediction of which sequence will have lowest sum. For

= 0.2, run time was
only 1/6
th

of what it was for

= 1.0. This gives us a rough idea of the magnitude
of time that could be saved with randomized approach.

Table 2
shows sum of the values of distant scores for Input in Appendix B.

Highlighted part in this section is in various columns. This shows the kind of
inaccuracy that could arise with randomized approach. But, majority of the
V. Thapar

BME 300
-

Bioinformatics

12

values of

have given the righ
t values. It is not safe to take

to be very low. For

= 0.60, right sequence has been picked for lowest sum. Runtime reduction is a
little more than ½ for this case.

Table 3 shows distant matrix values for input in Appendix C.

Highlighted part i
n this section is for S7 for all values of

. This shows consistent
results throughout different values of

. For

= 0.6, runtime reduction is more
than ½.

6

Conclusion

It can be concluded from the implementation of the algorithm presented in this
pape
r that for a value of

to be equal to 0.6 we are able to get a reduction in the
time
of the algorithm
by more than 50% and the accuracy is also maintained. Also
the implementation has supported our hypothesis about the improvement that can
be brought abou
t using the randomized approach for distance matrix calculation.

As can be expected for very small values of

, the results lose their accuracy and
hence the choice for the proper value of

would lead to a speedup while
maintaining the accuracy of the alg
orithm

7

Discussion

In this paper, we have discussed various methods of Multiple Sequence
Alignment. We have also introduced a new approach that deals with randomly
sampling sequences and aligning the samples to achieve the same result in terms
of distan
ce matrix calculation and achieve a significant runtime improvement.
V. Thapar

BME 300
-

Bioinformatics

13

We have backed up our claim of speed up and accuracy by empirical data and
examples. It can be noticed that since most algorithms that are currently being
used for MSA are using the dist
ance matrix calculation as an initial step, this time
reduction could be of importance.

8

Future Work

There has been no significant work done in the area of randomized algorithms for
MSA. This leaves a lot of opportunities for us for future work. We plan
to make
certain very critical improvements to our algorithm. First of all, we would like to
prove theoretical complexity of this algorithm and also show that it is in reality a
faster algorithm. We would also like to show that randomization gives the sam
e
result with very high probability. At this time, we have assumed that all
sequences are of same length. We would like to expand our work such that
sequences of uneven lengths can also be aligned using random approach. There is
a possibility of taking
this work further and implementing randomized portions
for CLUSTAL W, MAFFT and other popular MSA packages in order to increase
their speed. In our opinion, further speedup can be achieved by randomizing not
just pairwise alignment but also sequence selec
tion, but this hypothesis still needs
further work.

References

[1]

Berger M. P., P. J. Munson.
A novel randomized iterative strategy for aligning multiple
protein sequences
. Computer Applications in Biosciences. Vol. 7, No. 4 1991. Pages
479
-
484.

[2
]

S. Rajasekaran, H. Nick, P.M. Pardalos, S. Sahni, G. Shaw,
Efficient algorithms for local
alignment search
. Journal of Combinatorial Optimization. 5(1), 2001, pp. 117
-
124.

[3]

K. Charter, J. Schaeffer, D. Szafron.
Sequence Alignmetn using FastLSA.

International
Conference on Mathematics and Engineering Techniques in Medicine and Biological
Sciences. 2000.

[4]

S. Needleman, C. Wunsch.
A general method applicable to the search for similarities in
the amino acid sequence of two proteins.

Journal of

Molecular Biology. 48:443
-
453,
1970.

[5]

D. Feng, R. Doolittle.
Progressive sequence alignment as a prerequisite to correct
phylogenetic trees.
Journal of Molecular Evolution. 25:351
-
360, 1987.

[6]

J. Thompson, D Higgins, T. Gibson.
CLUSTAL W: impr
oving the sensitivity of
progressive multiple sequence alignment through sequence weighting, position
-
specific
gap penalties and weight matrix choice.
Nucleic Acids Res. 22, 4673
-
4680.

V. Thapar

BME 300
-

Bioinformatics

14

[7]

K. Katoh, K. Misawa, K Kuma, T. Miyata.
MAFFT: a novel method
for rapid multiple
sequence alignment based on fast Fourier transform.

Nucleic Acid Res. 30(14), 3059
-
3066.

[8]

F. Corpet.
Multiple sequence alignment with hierarchical clustering.

Nucleic Acid Res.
Vol 16, 10881
-
10890. November 1998.

[9]

G. Karypi
s, S. Han, V. Kumar.
CHAMELEON: A hierarchical clustering algorithm
using dynamic modeling.
Technical report TR
-
99
.
University of Minnesota,
Minneapolis, 1999.

[10]

A. Szymkowiak, J. Larsen, L. Hansen.
Hierarchical clustering for datamining.
Fifth
Int
ernational Conference on Knowledge
-
Based Intelligent Information Engineering
Systems & Allied Technologies. 2001.

[11]

C. Notredame.
Recent progress in multiple sequence alignment: a survey.
Pharmacogenomics 3(1). 2002.

[12]

O. Go
toh.
Furhter improvement in methods of group
-
to
-
group sequence alignment with
generalized profile operations.

Computer Applications in Biosciences, 10 (4), 1994, pp.
379
-
387.

[13]

O. Gotoh.
Optimal alignment between groups of sequences and its applicati
on to
multiple sequence alignment.

Computer Applications in biosciences, 9(3), 1993, pp.
361
-
370.

[14]

J. Kececioglu, H. Lenhof, K. Mehlhorn, P. Mutzel, K. Reinert, M. Vingron.
A polyhedral
approach to sequence alignment problems.

Discrete applied math
ematics 104 (2000), pp.
143
-
186.

[15]

S.
Anibal de Carvalho.
http://neobio.sourceforge.net/
. Department of Computer Science,
King’s college, London, UK.

Appendix A

Input 1

S1:AGGCTATACTTAAGTGGTCGTT
ATGGCCGTACACCGACCAGCGAGGAACGCATAACAGCGACCTACAT
AAGTTTGTGGTGCATCAAGCTACCGCTTTGCTGATGGCGGACGAAACGCAATTGTTAGAAAGGGGGCGGCA
CAGTACCGAACACGCGTTTCCACGGTCATATTCAGAGGTGCTGTTTTTCTCGTGTAACGCGGCACCTTCCA
TGTCGCCGTTAGTGCGATGAGACTCCAGACCGTGCCCACACTTTGCTCATCGCGCACCAAGAGGAGAC
CCC
TGTTATCAGGCGTCGCAGTTCCTAGGGGCGCTATCCCACCGTCGCATAACGCCCGACCAAAGGACCACCAA
TCGTTCCGGCGCTGATTTGTCTGGCTCGAGGCGAGTGTCTGATCTGCACTGAGTAGCGGTCCCACTTGGTG
CGCTATTACGGGACGCATGAGCCCTGCGTTTTCTCTCTAATAGTTAGAGAGTATCCTTCTATGCGTCATGC
GAGAGGTTTCGCCTTAGACTAGGTTTTCGAGCTGCCCAGG
GTTCCAGTGTGCTTAAGCCGCCATTTATGGT
TTACTCAAGGGTAAAGGTGATCCCATGATTTGATA

S2:ACTCCCACACCACTACTACTAGCCGTTCTTTGCTGTAGAATTCGAAACACCTTTCAGACTGTACCCTG
CCTGCAACTTATAGGGTGCTCATACCGACTCCTAGCCTGAGTCTGACTTGTCGGAAAAATACTGCGCTCGT
ATGGAAAAGTACACCGAGATGCTGAGCCTGAGTTACAAATCAGGCAG
TTTTTGGGTCTTATTACTAGGCCC
ACGCTATCTTTGAACATATACTTCTCAGATAACGAGATTTATGTGCTAAGCGATACGTGGCTCAATCCCCG
CTAGGATCTGCCACAACACCACGACTGTCACTCCTTATCAATGACACTCAGTTTTCCAAACGCGGCTGTAG
GTGGTTATTGGTTACGAACGCGACGAACTTACTGTCTTACCTATTGTCAAAGGCCTATAATGCCACACTCT
AAAGCGAGCGGACAACTAC
CGTTTAAAGCGAATAATGTACCGACCCAAAAAGAACATTTCCCGGTCCCGTC
AGTAGAGCTGGTCAAGAAGGTAGTCTGAATAACTCACGGAGGTATCTTTAGGCTAGGAGCTGAACAAACTT
CAGAAATATAACGCCCCGCCGCCTGCACATGCGCA

S3:TGCTCTCAGTCTTTGTGTCGGCGTCTGAGTACCGTTGAGCGATCCGACAGTGGGGCCAGCCTGCGGAC
CGTCACGAACGTCGTTACCTTGATGC
GCATAGTTGCCGTTCTCGCCGAGGCTGGGTGTCCAAGGTGGTCTT
V. Thapar

BME 300
-

Bioinformatics

15

TAGCGCCTGCTTTTCAAAGGTAGTAACCTGGTATAATCTGGGGCGATAGTGTCGCCAGTTCAAGGCGTTCA
ACGAGTCGCGCACCTGCTATTACACTGGGAGTAACTATTCAATCAAGTATGAGGCTCAGAACCACAGGTAT
TATTGATGATAAGCCAGACCTTCGAGGATCGTCTCTAGCACATGATCGTTTGATAGAAAGTGTGCAGCT
GG
TGAAGTTTTTAACATCCCGTGAGGACGTACACTGGCCTCTCTTGTGCCGGGCGTTAAACAATACCTTAAAG
CATGCCACAATCGTACCGGGCATAGGATGCTGATTTATGCCTTCATAAAGGGACTCGGCCACGTTGTAAGG
TGTGAATGCTAGATCTACCACGAAAGGGCCTGTTAGCACACATGCCGCCCTTGTCGCTAAAGGTTTTATAA
TACGCGTACGCTCATGCCCCCGAAAGAAGACCATGAGTTGA
CATTCGCTCATAATACAGGTCAGGCATAGG
TGGAGCTCGTGGATTTCTTATCGTTACAAACCATCGCAGAGCACCGTTCGATATACAATAGAGCTTCGGGC
ACTACGCCTACGCGGGTGATTAGGAACCCGTTACAAGGCAAGGACTCAATGGTGTCCCGGAATTTACGCCA
ACAACGGTTGTGAAGGGGATGCGGCGGACTATTGTTTAATGTGGTTGGATCCCACCGTGTGCAATCAGCCT
AGGGGAAACGCAG
GAGTCAGAGGCAGTTGGAGTCAGATTGTGCATTAATCAGTTCGTAAGCCTTCCACGGA
GAGTAATCACAACGTCTCGGACAGAAGCTCCCTAGACGACTAGCTGAAAGTGCCCCCAAAGTGCTATGGCA
TCAATCCCT

S4:GCCTATTCGGATGTACTCTCTCCGCCCAGAAGTGAAGGAGTCAGATAGGTCCTTGCTATAACAGCCGC
AACACTCATCGTGCCGGCAGCCTAGCAGTTACCTGGATCCCAGATC
TACCTTACCATTTCAGGCTAAATTT
AGGCTCGGGTACAAAAAACATCGCCGGGCTTCAACCTTGCCGCCCTTAACACACGGTGTGACTTTATACAG
GGAGATGGAGCATGGGCTGGCCTAGTGGGGTGTGGCGCTAATTTCCTCGCTAATGCTATGCGGAGCCCTGA
AAGCTGACTGGAGGAGGCCGAGCCGACAATGTCTCGTGAGTGGCATTGCGTTTAAGGAAGACTTTTGTCCG
ATCTACACCTTCCTCGAG
TCTCCGCAGGGTTGTGCATAGTGGCTGTAGACAGAATCCAGCTGACAGGTCTG
CATTTAGAAATAGCTTAGCGTCCGCCGGACCACTGTCAACTTTACTGTGGCTCTCGTCTGCTGACTTTGAT
TATCTGAATGTGAGTCTCAGTAACTGACCTGGGCGTCTTCGGCGAAGGATCAATGAACGAATCAAAGAGGT
GAAGGGGCTTTCCTGCTAAGACCGTGCATCAGTACTAGCCGGTCGAGTCCTTTGCACGTCC
GCCGCAGCCG
TACAGTCGATTGATATAGTCTACCCTCGATCCTTTAGCAAGTGCATATGCAGCCGACCAACCTTGCGGCAT
ACTCCAATCAACACTACCCAGATCCTAAGGTGACGGTTTCAGAGGATATACGAAGCGTATTGCACCGCGTA
TGTATTTAAGAACGGTGGGTGTTATGTCAGACGCGTCCGGTTTTAACCCTTTATACAAATCGTCTCGACAC
ACTACATCAATATATTACATGAAGGTGCATCAC
AGCCGGTCCACACCGGTT

S5:TCGGCTGTATTGGCGACCCAGGCGTGGGCTTAATGAATCAGAGACTCTGCAGCCAGGGAGTATGTATA
GCAGTTCTTTAAACGGTCTGCGACGAGGAAGGTTTCGAGTGTGCAACGTGAGGCTATCGTAAAAGTGTTTC
AACAGATGGGGGGCTATGAGCCGCTCGAACGTTACACACTGCACGCGGGGTCGACTAATGGAAGCTAACCT
AAGCTAATTGCCCTATTCGTGAAG
AAACATCTAATTCCTTCCTTGTATGTGTTCTCCCTACAGCACATATC
GACAATAGGTTTTAGTGCTTTACCACAAGTAGCAAGTACAACTTGAATTGGGTAAGACTTGCACTTCATGT
ATTTGAAATCGCTATCCCACGACTTGGTGTCAACCCCCGGCTCTTTATCACCTTGCATACCCAGCGGCATC
AAGTGACCGACATATGATCTGGTAGTAGTTCAACCCTGAAGACTATCTTTAGCTCAGCGCGTTAAGT
CCTT
ATACACTCTAGCGAGTGGGAAGGATGGATCGGCCGGACATCGTACGTAATTTAGAACCCAGTACCGAGACG
CGTTCGACAGTCCTAAGGCTCCATCAGAGTAGCTTACTACGTCACGAGTCAGGTAAAGCCGAGAGCGTCCG
ATCCATCCTTGGTGGATCAGCGTTCTCTGTTGTTGAACGCGAGGTAAACGTTGGTAACTTTTTCAACAGCA
GTAGAGTAGCGTGTAGTTACTCGGAGATCGACGTAACTG
CGCGCCCTGCAACACTAAGCGCTGCGCTGTCT
GCTGCGCAGACTCTATGAGAGTCGCTCGTCTCCGTCTGCTTAGGGGGCGTTAGCACACTAATCACGGCTCA
AATATGTTAAAGAAGGAGCCCCATTTCCGTGACGTCAGTACGAGCAATTTACGATGGCAAAGAGAGCAAGA
CCTTCGCGCAGGGTACGGACCTGACAGCATGGGTTATCAAGGCCCTTTCCAGGTAATAAATTTCAGATTTA
GTACTTATCAT
GTAGATAAGTTGGAAACCTTGA

S6:GAAGACTCAGGGAGAGAAATTTTTCTTGATTCATTCTGCAGATTGGCTTACTACACATGCTCTTTTCC
ATGAAGTTGCAAAATTGGATGTGGTGAAATTATTATACAATGAGCAGTTTGCTGTTCAAGGGTTGTTGAGA
TACCATACATATGCAAGATTTGGCATTGAAATTCAAGTTCAGATAAACCCTACACCTTTCCAACAGGGGGG
ATTGATCTGTGCTATGGTTC
CTGGTGACCAGAGCTATGGTTCTATAGCATCATTGACTGTTTATCCTCATG
GTTTGTTAAATTGCAATATTAACAATGTGGTTAGAATAAAGGTTCCATTTATTTACACAAGAGGTGCTTAC
CACTTTAAAGATCCACAATACCCAGTTTGGGAATTGACAATTAGAGTTTGGTCAGAATTAAATATTGGGAC
AGGAACTTCAGCTTATACTTCACTCAATGTTTTAGCTAGATTTACAGATTTGGAGTTGCATGG
ATTAACTC
CTCTTTCTACACAAATGATGAGAAATGAATTTAGGGTCAGTACTACTGAGAATGTGGTGAATCTGTCAAAT
TATGAAGATGCAAGAGCAAAGATGTCTTTTGCTTTGGATCAGGAAGATTGGAAATCTGATCCGTCCCAGGG
TGGTGGGATCAAAATTACTCATTTTACTACTTGGACATCTATTCCAACTTTGGCTGCTCAGTTTCCATTTA
ATGCTTCAGACTCAGTTGGTCAACAAATTAAAGTT
ATTCCAGTTGACCCATATTTTTTCCAAATGACAAAT
ACGAATCCTGACCAAAAATGTATAACTGCTTTGGCTTCTATTTGTCAGATGTTTTGTTTTTGGAGAGGAGA
TCTTGTCTTTGATTTTCAAGTTTTTCCCACCAAATATCATTCAGGTAGATTACTGTTTTGTTTTGTTCCTG
GCAATGAGCTAATAGATGTTTCTGGAATCACATTAAAGCAAGCAACTACTGCTCCTTGTGCAGTAATGGAT
ATTACAG
GAGTGCAGTCAAC

V. Thapar

BME 300
-

Bioinformatics

16

S7:CAGTGGCGATGACCCTGGAAAAGAATATGCCGATCGGTTCGGGCTTAGGCTCCAGTGCCTGTTCGGTG
GTCGCGGCGCTGATGGCGATGAATGAACACTGCGGCAAGCCGCTTAATGACACTCGTTTGCTGGCTTTGAT
GGGCGAGCTGGAAGGCCGTATCTCCGGCAGCATTCATTACGACAACGTGGCACCGTGTTTTCTCGGTGGTA
TGCAGTTGATGATCGAAGAAAACGACATC
ATCAGCCAGCAAGTGCCAGGGTTTGATGAGTGGCTGTGGGTG
CTGGCGTATCCGGGGATTAAAGTCTCGACGGCAGAAGCCAGGGCTATTTTACCGGCGCAGTATCGCCGCCA
GGATTGCATTGCGCACGGGCGACATCTGGCAGGCTTCATTCACGCCTGCTATTCCCGTCAGCCTGAGCTTG
CCGCGAAGCTGATGAAAGATGTTATCGCTGAACCCTACCGTGAACGGTTACTGCCAGGCTTCCGGCAGGCG
C
GGCAGGCGGTCGCGGAAATCGGCGCGGTAGCGAGCGGTATCTCCGGCTCCGGCCCGACCTTGTTCGCTCT
GTGTGACAAGCCGGAAACCGCCCAGCGCGTTGCCGACTGGTTGGGTAAGAACTACCTGCAAAATCAGGAAG
GTTTTGTTCATATTTGCCGGCTGGATACGGCGGGCGCACGAGTACTGGAAAACTAAATGAAACTCTACAAT
CTGAAAGATCACAACGAGCAGGTCAGCTTTGCGCAAGCCGTAAC
CCAGGGGTTGGGCAAAAATCAGGGGCT
GTTTTTTCCGCACGACCTGCCGGAATTCAGCCTGACTGAAATTGATGAGATGCTGAAGCTGGATTTTGTCA
CCCGCAGTGCGAAGATCCTCTCGGCGTTTATTGGTGATGAAATCCCACAGGAAATCCTGGAAGAGCGCGTG
CGCGCGGCGTTTGCCTTCCCGGCTCCGGTCGCCAATGTTGAAAGCGATGTCGGTTGTCTGGAATTGTTCCA
CGGGCCAACGCTGGCA
TTTAAAGATTTCGGCGG

S8:AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCA
GCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCA
ATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC
ATTACCACCACCATCACCATTACCA
CAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGC
CCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAAGT
TCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCCAGG
CAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTGA
AAA
AACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTTGACGG
GACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTTGCCCAA
ATAAAACATGTCCTGCATGGCATTAGTTTGTTGGGGCAGTGCCCGGATAGCATCAACGCTGCGCTGATTTG
CCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTA
TTAGAAGCGCGCGGTCACAACGTTACTGTTA
TCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATATTGCTGAGTCCACC
CGCCGTATTGCGGCAAGCCGCATTCCGGCTGATCACATGGTGCTGATGGCAGGTTTCACCGCCGGTAATGA
AAAAGGCGAACTGGTGGTGCTTGGACGCAACGGTTCCGACTACTCTGCTGCGGTGCTGGCTGCCTGTTTAC
GCGCCGATTGTT
GCGAGATTTGGACGGACGTTG

S9:ACCCATAACGGGCAATGATAAAAGGAGTAACCTGTGAAAAAGATGCAATCTATCGTACTCGCACTTTC
CCTGGTTCTGGTCGCTCCCATGGCAGCAGAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGA
TAGGCGATCGTGATAATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAA
CATTATGAATGGCGAGGCAAT
CGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCATAAGAAAGC
TCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAAATGACAAATGCCGGGTAACAAT
CCGGCATTCAGCGCCTGATGCGACGCTGGCGCGTCTTATCAGGCCTACGTTAATTCTGCAATATATTGAAT
CTGCATGCTTTTGTAGGCAGGATAAGGCGTTCACGCCGCATCCGGCATTGACTGCAAACTTAAC
GCTGCTC
GTAGCGTTTAAACACCAGTTCGCCATTGCTGGAGGAATCTTCATCAAAGAAGTAACCTTCGCTATTAAAAC
CAGTCAGTTGCTCTGGTTTGGTCAGCCGATTTTCAATAATGAAACGACTCATCAGACCGCGTGCTTTCTTA
GCGTAGAAGCTGATGATCTTAAATTTGCCGTTCTTCTCATCGAGGAACACCGGCTTGATAATCTCGGCATT
CAATTTCTTCGGCTTCACCGATTTAAAATACTCATC
TGACGCCAGATTAATCACCACATTATCGCCTTGTG
CTGCGAGCGCCTCGTTCAGCTTGTTGGTGATGATATCTCCCCAGAATTGATACAGATCTTTCCCTCGGGCA
TTCTCAAGACGGATCCCCATTTCCAGACGATAAGGCTGCATTAAATCGAGCGGGCGGAGTACGCCATACAA
GCCGGAAAGCATTCGCAAATGCTGTTGGGCAAAATCGAAATCGTCTTCGCTGAAGGTTTCGGCCTGCAAGC
CGGTGTAG
ACATCACCTTTAAACGCCAGAATCG

Appendix B

Input 2

V. Thapar

BME 300
-

Bioinformatics

17

Appendix C

Input 3