V. Thapar
BME 300

Bioinformatics
1
Randomized approach to Distance Matrix calculation for
Multiple Sequence Alignment
Vishal Thapar
(
Vishal.Thapar@uconn.edu
)
BME 300
–
Bioinformatics
Instructor: Prof. Richard Simon
(December 3
rd
, 2003)
A
bstract:
Rigorous alignment of multiple sequences becomes impractical even with a
modest number of sequences [1]. Solution to multiple sequence alignment problem is
important for biological research purposes. Because of the high time complexity of
tradi
tional MSA algorithms, even today’s fast computers are not able to solve the
problem for large number of sequences. Our approach in this paper is to evaluate the
possibility of using randomized approach to calculate distance matrix for multiple
sequence a
lignment algorithm. In order to reduce time complexity, we will evaluate a
small randomly selected portion of a sequence and compare with similar portions
collected randomly from all other sequences. The initial idea of randomization was taken
from [2].
1
Introduction
Sequence alignment is one of the fundamental operations performed in
computational biology research [3]. Often times, it is necessary to evaluate more
than two sequences simultaneously in order to find out functions, structure and
e
volution of different organisms. Human genome project uses this technique to
map and organize DNA and protein sequences into groups for later use. There
has been significant research done in this area, because of the need for doing
multiple sequence alig
nment for many sequences of varying length. Algorithms
dealing with this problem span from simple comparison and dynamic
programming procedures to complex ones that rely on underlying biological
meaning of the sequences to align them more accurately. Sin
ce multiple sequence
V. Thapar
BME 300

Bioinformatics
2
alignment is an NP

Hard problem, practical solutions rely on clever heuristics to
do the job. There is a constant balancing of accuracy versus speed in these
algorithms. Accurate algorithms need more processing time and are usually
c
apable of comparing only a small number of sequences; where as fast and less
accurate ones can analyze many sequences in reasonable amount of time.
Dynamic programming algorithm first introduced by Needleman and Wunsch [4].
This algorithm is designed for
pair

wise sequence alignment. Feng and Doolittle
[5] developed an algorithm for multiple sequence alignment using modified
version of [4]. There are more complicated algorithms such as CLUSTAL W [6],
which relies on scoring system, and is adjusted based
on local homology of the
sequences.
Progressive algorithms suffer from the lack of computational speed because of
their iterative approach. Also, accuracy is compromised because greedy
algorithm such as dynamic programming reaches a local minimum for
distance
matrix score and not global minimum. Algorithms that rely significantly on
biological information may also be at a disadvantage in some domain. Often
times, it is not necessary to find the most accurate alignment between sequences.
In those cas
es, specialized algorithms such as CLUSTAL W might be over
qualified. Also, these algorithms will require some human intervention while
they are optimizing results. This intervention will have to be done by biologists
who are very familiar with the data
and thus there is limited user domain for such
an algorithm.
One of the more important usages of MSA is for Phylogenetic analyses [11].
Phylogenetic trees are at the base of understanding evolutionary relationships
between various species. In order to
build a Phylogenetic tree, orthologous
sequences have to be entered into the database, sequences have to be aligned,
pairwise Phylogenetic distance has to be calculated and a hierarchical tree is
calculated using clustering algorithm as shown in [8].
V. Thapar
BME 300

Bioinformatics
3
T
here are many algorithms which maximize accuracy and do not concern
themselves with speed. Few improvements have been made successfully to
reduce the CPU time, since the proposal of the Feng and Doolittle [5] method [7].
Our approach deals with reducing
CPU time by randomizing some part of
multiple sequence alignment. Our approach calculates distance matrix for star

alignment by randomly selecting small portions of sequences and aligning them.
Since randomly selected portion of the sequence is significa
ntly less than the
actual sequence length, it will result in significant reduction of running time.
2
Survey of Literature
In this section we will list relevant literature survey that was done for this paper.
We will also list some competing algorithm
s and applications that are in use
today.
2.1
CLUSTAL W
CLUSTAL W approach is an improvement of progressive approach invented by
Feng and Doolittle [5]. CLUSTAL W improves the sensitivity of multiple
sequence alignment without sacrificing speed and e
fficiency [6]. The speed and
efficiency in this context refer to that of Feng and Doolittle [5] style of
progressive algorithm. It will be shown that our algorithm is actually faster in
theoretical running time than CLUSTAL W. This algorithm differs fro
m
conventional algorithm in the sense that it allows genetic information to be
included in distance matrix calculations. In other words, it will not limit the
match/mismatch scores to constant but will allow them to change based on the
number of criteria
set by the user [6].
CLUSTAL W takes into account different types of weight matrices at each
comparison step based on the homogeneity of sequences being compared and
their evolutionary distances. It is divided into three stages. (1) In this stage, a f
ast
V. Thapar
BME 300

Bioinformatics
4
approximation algorithm is used to evaluate alignment scores. Idea is that, errors
made in alignment during this step will be corrected in later stages by more
accurate weights. (2) Unrooted trees are calculated using Neighbor

joining
method [6]. Ea
ch sequence is a branch in this tree. Each sequence gets a weight
proportional to its distance from the root. Also, it gets a proportion of the weight
from another sequence that it shares some similarities with. (3) This step is called
progressive align
ment. In this step, guide tree is used to combine sequences into
larger and larger pairwise alignments. Sequences are selected from the tip of the
tree to going towards the root. At each stage a full dynamic algorithm is used to
calculate weight matrix
and introduce gaps [6].
Giving proper weights is achieved by having one sequence with weight of 1.0 and
the rest less than that. Groups of closely related sequences receive lower weights
and thus do not “over

influence” the final alignment results inap
propriately.
Results of CLUSTAL W are staggeringly accurate. It gives near optimal results
for a data set with more than 35% identical pairs. For sequences that are
divergent, it is difficult to find proper weighing scheme and thus does not result in
a good alignment.
2.2
MSA using Hierarchical Clustering
Hierarchical clustering is a very interesting heuristic for MSA. It is rather old
approach in the fast changing field bioinformatics. It uses an approach often used
in bioinformatics, but mostly in
the field of data

mining [9, 10]. This approach
uses hierarchical clustering along with pairwise alignment to align similar
sequences. Hierarchical clustering of the sequences is done using weight matrix.
At each step, groups or clusters of sequences ar
e aligned together in larger
clusters until all of them are one group.
Distance matrix calculation is the central theme in this approach. First distance
matrix is calculated for each possible pairwise alignment of sequences. This
V. Thapar
BME 300

Bioinformatics
5
process could be eval
uated using a fast pairwise alignment algorithm such as [2].
Two sequences Si and Sj, which have lowest alignment score are chosen out of
the matrix and are aligned with each other in one cluster. Now, a matrix of size
nXn is replaced with (n

1)X(n

1) by
deleting row j and column j from the
resulting matrix. Also, row i is replaced with the average score of i and j [8].
This process continues until all sequences are aligned and they all form one
cluster.
This algorithm takes O (N(N

1)M
2
) time where N
is the number of sequences and
M is the length of sequences when aligned [8]. This solution is not nearly as fast
as what we are trying to achieve. Since this algorithm also uses distance matrix
calculation, using algorithm proposed here could reduce it
s running time further
as well.
2.3
MAFFT: Fast Fourier Transform based approach
Fast Fourier transform is used to determine homologous regions rapidly. FFT
converts amino acid sequences into sequences composed of volume and polarity
[7]. MAFFT im
plements two approaches of FFT, which are progressive method
and iterative refinement method. In this method, correlation between two amino
acid sequences is calculated using FFT formulas. High correlation value will
indicate that sequences may have homo
logous regions [7]. This program also has
sophisticated scoring system for similarity matrix and gap penalties. Just like
CLUSTAL W, this approach also uses guiding trees and similarity matrices.
By looking at results presented in [7], we can determin
e that FFT based
algorithms are significantly better than CLUSTAL W and T

COFFEE algorithms.
It is important to notice that all these algorithms are still polynomial time
algorithms and thus have similar behavior on log scaled graph. The only
difference
in FFT is that it has a lower co

efficient. Thus, from complexity point
of view, FFT is not significantly better than other approaches.
V. Thapar
BME 300

Bioinformatics
6
2.4
Other approaches to MSA
There are many other innovative approaches for MSA. Stochastic processes are
used to
perform MSA. Simulated annealing and Genetic algorithms [11] are
classic stochastic processes based MSA algorithms. In these algorithms, two
sequences are randomly aligned and their score is compared with what was
present earlier [11]. If the score is
better than previous matrix, it is kept and if not
then it is discarded.
Non

stochastic iterative algorithms are simple in understanding. They rely on the
logic that even a wrong alignment can be efficiently improved if it is realigned at
a later stage
. Berger and Munson’s algorithm [1] is one of such algorithm. This
algorithm randomly aligns sequences at first. Then, it iteratively tries to find
better results and updates sequences until no further improvements can be
achieved. Gotoh has described
such an algorithm in [12]. It is a double nested
iterative strategy with randomization that optimizes the weighted sum

of

pairs
with affine gap penalties [11].
There is also a relatively recent algorithm by Kececioglu, Lenhof, Mehlhorn,
Mutzen, Reinert
and Vingron [14], which studies alignment problem as an integer
linear program. With polyhedral approach, variations of a basic problem can
often be conveniently modeled through the addition of further constraints to the
basic linear programming [14]. T
his algorithm solves MSA problem to optimality
for non trivial algorithms of 18 sequences or more.
3
Randomized Algorithm
The idea of randomized sampling for local alignment was proposed by
Rajasekaran et. al [2]. Just like any other randomized algori
thm, we are going to
try to show that instead of evaluating entire sequences of length N, we can
achieve same result by evaluating N
Є
characters where 0 < Є < 1. This procedure
V. Thapar
BME 300

Bioinformatics
7
has a potential of theoretically getting results which are significantly close
order
of magnitude reduction.
Traditional algorithms take O (M
2
*N
2
) time to create a distance matrix where M
is number of sequences and N is the length of aligned sequences. This could be
supported by the fact that traditional Needleman

Wunsch [4] algor
ithm will
require O(N
2
) time to find alignment score of any two sequences. There are M
sequences so, all possible combination of pairwise sequence alignment will take
M
2
operations. Thus, total time taken by Needleman Wunsch type algorithm will
be O (M
2
*
N
2
).
Our heuristic works to reduce time from pairwise

alignment and in effect
reducing overall time of any algorithm that requires distance matrix calculations.
It selects a subset of length N
Є
from sequence S starting at randomly selected
location bet
ween S1 to S (N

N
Є
). Similarly same length subset starting at the
same location is chosen from sequence T. These subsequences are aligned and
score is recorded. Since the length of subsequences is N
Є
, time complexity to find
pairwise alignment is O(N
2Є
). This will result in an overall time of O(M
2
*N
2Є
).
This is a significant reduction if the resulting distance matrix can return a reliable
and accurate score.
Algorithm
Input:
A file containing DNA or Protein sequences separated by new line
charact
er, value of Є.
Output:
Distance matrix calculated for all of the sequences T1 to Tn and total
sum of distances for each sequence.
Algorithm:
(1)
Read and store all sequences from the input file into an array.
(2)
For all sequences T1 to Tn Do
a.
For all sequences P
1 to Pn Do
i.
Select a Random number R that works as a starting point.
ii.
Select Pj
Є
characters from Pj starting at position Pj
R
.
V. Thapar
BME 300

Bioinformatics
8
iii.
Similarly select same number of characters from Ti starting at
position Ti
R
. Step ii and iii will result in two new sequences
Pj
’ and Ti’.
iv.
Use Needleman

Wunsch algorithm to evaluate pairwise
alignment score of Pj’ and Ti’.
b.
Record score from step a

iv in Matrix M at M(Ti, Pj).
c.
Increment j by 1.
(3)
At the end of step 2, we will have a complete matrix M with distance
scores for ea
ch combination of sequences. Now sum alignment score in
row order where
n
j
i
Pj
Ti
M
Sum
1
)
,
(
.
(4)
Select the lowest score from Sum
i
and use it as center of star

alignment.
(5)
Repeat the same process for different value of
Є.
Analysis
This algorithm is closely related to Needleman

Wunsch algorithm for pairwise
alignment. It requires a value of
Є
from the user along with input file containing
sequences of same length. Step 1 reads in the input from input file. Step 2 loo
ps
around to exhaust all possible combination of sequences. This step is repeated
once for each of the N sequences. Step 2a also iterates through each one of the N
sequences. Thus, Step 2 takes O(N
2
) time. After selecting a random number as a
starting
position, we select a subsequence from both sequences and align them
using Needleman

Wunsch or any other pairwise alignment algorithms. For our
purpose, step 2iv will take O(Pj
2
Є
) time. The score is recorded in the appropriate
column of the distance matrix. Step 3 sums up all pairwise alignment scores for a
given sequence. The sequence with lowest negative score or highest positive
score gets selected. The running time of the
algorithm is O(N
2
*Pj
2
Є
).
4
Implementation
In this section, we will explain the implementation detail of this algorithm on Java
platform. The algorithm uses a design from Neobio [15]. The implementation of
this algorithm was carried out in java. Th
e logic for the algorithm is simple and
has been designed with future additions in mind. As of now the algorithm uses a
randomized form of Needleman Wuncsh algorithm for alignment, but in future it
can be easily extended to use any algorithm that can globa
lly align two sequences.
V. Thapar
BME 300

Bioinformatics
9
The basic set of class framework has been referenced from the Neobio package
[15].
The main classes in the algorithm are in the package TheMatrix. The classes are:
1.
RandomMatrixCalculation.java : This class has the main method which
take as
input the file that contains all the sequences which are to be aligned. The file can
be in FAST

A format or it can be just a sequence of characters. The scoring
scheme can be specified in this class and the penalties for gap, match and
mismatch ca
n be set according to choice. We have used the standard convention
of gap=

2, match=+1 and mismatch=

1 for our application. They can be changed
easily.
2.
BasicScoringScheme.java: This class extends the class ScoringScheme.java
which is an abstract class. Thi
s can be used to set the scoring scheme and it can be
also used to sensitize the scoring scheme by implementing the methods in the
ScoringScheme class in anyway that is required by the user. The use of abstract
classes gives us the freedom to dynamically m
odify the scoring schemes like the
choice of the algorithm for the program dynamically based on the user preference.
3.
PairwiseAlignmentAlgorithm.java: This is again the abstract class whose object
“algorithm” is used through out in the program for all purpo
ses and finally based
on the users choice of algorithm, (in our case as of now its Needleman Wunsch
but more can be added), at runtime the object is dynamically attached to this
variable, “algorithm”. The methods that are implemented by any class that
exte
nds this class are loadAllsequenceFile() {This loads all the sequences from a
file into the memory}, computePairwiseAlignmentAll(), {This method when
implemented will contain the details of alignment of all sequences, they are
aligned in pairs. Based on wh
ich algorithm class extends this class, the
implementations will vary.}
4.
CharFile.java: This file is used in the reading of the sequences from the disk to the
memory and storing them in the desired format. In our case we have stored each
sequence as a chara
cter array and the arrays are stored in vectors, (extendable
arrays in java).
V. Thapar
BME 300

Bioinformatics
10
5.
IncompatibleScoringSchemeException.java and
InvalidScoringMatrixException.java : These have been taken from NeoBio
package[15] and extend the Exception class of java and are used
to display
meaningful messages in case of errors.
6.
NeedlemanWunsch.java: This is the major class that extends the class
PairwiseAlignmentAlgorithm class and thus implements the methods described
above in its way. So at run time the variable of the abstract
class
PairwiseAlignmentAlgorithm is assigned to the object of the NeedlemanWunsch
class. Thus even though all throughout the program the methods are called for the
PairwiseAlignment class, at run time the methods that are actually implemented
will be thos
e of this class and so later on when we need to add a new algorithm
we can easily just create one class and then extend the
PairwiseAlignmentAlgorithm class in that, implement the same methods in our
own way and we would have to make no changes to the exis
ting program. This is
the basis for a flexible framework. The main methods implemented in this class
are:
a.
ComputePairwiseAlignmentAll()
b.
ComputeScoreBetSeqIAndJ()
The first method reads sequences one at a time, compares it to all the others by
calling in a
loop the method ComputeScoreBetSeqIAndJ() and recording the score
for each comparison in the score matrix. Also the randomization step occurs in the
second method ComputeScoreBetSeqIAndJ() where based on a fixed value of
between 0.0 and 1.0 the lengths
of the 2 sequences to be compared are reduced
and then starting from a random point, “n*
” lengths are taken from both
sequences and compared using the standard Needleman Wuncsh algorithm.
The output is then recorded in a file, “Output.txt” again along wit
h the time
elapsed for the computation of the matrix.
5
Results
We are going to compare results from three different input files. Input files are
given as appendices A, B and C. We are going to compare actual results for
V. Thapar
BME 300

Bioinformatics
11
lowest distant score sum for ea
ch input file for various values of
. We will also
look at time it took to evaluate complete alignment (when
= 1.0) as opposed to
< 1.0.
Table 1 shows sum of the values of distant scores for various
.
FIRST RUN
Input in Appendix A
N = 9
Si = 600
S1
S2
S3
S4
S5
S6
S7
S8
S9
Run Time
1.00

738

656

980

914

1012

1194

1076

1032

976
3687ms
0.90

678

592

898

862

913

1080

976

951

860
3203ms
0.80

635

553

796

740

806

968

840

873

785
2360ms
0.70

583

494

703

660

721

894

752

797

730
1953ms
0.60

486

424

627

576

627

775

676

693

618
1516ms
0.50

362

354

489

490

532

608

554

578

525
1281ms
0.40

304

287

387

432

452

504

433

459

386
985ms
0.30

276

223

303

323

302

382

350

367

286
1157ms
0.20

230

206

219

225

246

287

304

284

231
609ms
The highlighted part in table 1 shows that for different values of
, lowest sum
was consistently for sequence S2. Even going as
low as
= 0.2 gave accurate
prediction of which sequence will have lowest sum. For
= 0.2, run time was
only 1/6
th
of what it was for
= 1.0. This gives us a rough idea of the magnitude
of time that could be saved with randomized approach.
Table 2
shows sum of the values of distant scores for Input in Appendix B.
Highlighted part in this section is in various columns. This shows the kind of
inaccuracy that could arise with randomized approach. But, majority of the
V. Thapar
BME 300

Bioinformatics
12
values of
have given the righ
t values. It is not safe to take
to be very low. For
= 0.60, right sequence has been picked for lowest sum. Runtime reduction is a
little more than ½ for this case.
Table 3 shows distant matrix values for input in Appendix C.
Highlighted part i
n this section is for S7 for all values of
. This shows consistent
results throughout different values of
. For
= 0.6, runtime reduction is more
than ½.
6
Conclusion
It can be concluded from the implementation of the algorithm presented in this
pape
r that for a value of
to be equal to 0.6 we are able to get a reduction in the
time
of the algorithm
by more than 50% and the accuracy is also maintained. Also
the implementation has supported our hypothesis about the improvement that can
be brought abou
t using the randomized approach for distance matrix calculation.
As can be expected for very small values of
, the results lose their accuracy and
hence the choice for the proper value of
would lead to a speedup while
maintaining the accuracy of the alg
orithm
7
Discussion
In this paper, we have discussed various methods of Multiple Sequence
Alignment. We have also introduced a new approach that deals with randomly
sampling sequences and aligning the samples to achieve the same result in terms
of distan
ce matrix calculation and achieve a significant runtime improvement.
V. Thapar
BME 300

Bioinformatics
13
We have backed up our claim of speed up and accuracy by empirical data and
examples. It can be noticed that since most algorithms that are currently being
used for MSA are using the dist
ance matrix calculation as an initial step, this time
reduction could be of importance.
8
Future Work
There has been no significant work done in the area of randomized algorithms for
MSA. This leaves a lot of opportunities for us for future work. We plan
to make
certain very critical improvements to our algorithm. First of all, we would like to
prove theoretical complexity of this algorithm and also show that it is in reality a
faster algorithm. We would also like to show that randomization gives the sam
e
result with very high probability. At this time, we have assumed that all
sequences are of same length. We would like to expand our work such that
sequences of uneven lengths can also be aligned using random approach. There is
a possibility of taking
this work further and implementing randomized portions
for CLUSTAL W, MAFFT and other popular MSA packages in order to increase
their speed. In our opinion, further speedup can be achieved by randomizing not
just pairwise alignment but also sequence selec
tion, but this hypothesis still needs
further work.
References
[1]
Berger M. P., P. J. Munson.
A novel randomized iterative strategy for aligning multiple
protein sequences
. Computer Applications in Biosciences. Vol. 7, No. 4 1991. Pages
479

484.
[2
]
S. Rajasekaran, H. Nick, P.M. Pardalos, S. Sahni, G. Shaw,
Efficient algorithms for local
alignment search
. Journal of Combinatorial Optimization. 5(1), 2001, pp. 117

124.
[3]
K. Charter, J. Schaeffer, D. Szafron.
Sequence Alignmetn using FastLSA.
International
Conference on Mathematics and Engineering Techniques in Medicine and Biological
Sciences. 2000.
[4]
S. Needleman, C. Wunsch.
A general method applicable to the search for similarities in
the amino acid sequence of two proteins.
Journal of
Molecular Biology. 48:443

453,
1970.
[5]
D. Feng, R. Doolittle.
Progressive sequence alignment as a prerequisite to correct
phylogenetic trees.
Journal of Molecular Evolution. 25:351

360, 1987.
[6]
J. Thompson, D Higgins, T. Gibson.
CLUSTAL W: impr
oving the sensitivity of
progressive multiple sequence alignment through sequence weighting, position

specific
gap penalties and weight matrix choice.
Nucleic Acids Res. 22, 4673

4680.
V. Thapar
BME 300

Bioinformatics
14
[7]
K. Katoh, K. Misawa, K Kuma, T. Miyata.
MAFFT: a novel method
for rapid multiple
sequence alignment based on fast Fourier transform.
Nucleic Acid Res. 30(14), 3059

3066.
[8]
F. Corpet.
Multiple sequence alignment with hierarchical clustering.
Nucleic Acid Res.
Vol 16, 10881

10890. November 1998.
[9]
G. Karypi
s, S. Han, V. Kumar.
CHAMELEON: A hierarchical clustering algorithm
using dynamic modeling.
Technical report TR

99
.
University of Minnesota,
Minneapolis, 1999.
[10]
A. Szymkowiak, J. Larsen, L. Hansen.
Hierarchical clustering for datamining.
Fifth
Int
ernational Conference on Knowledge

Based Intelligent Information Engineering
Systems & Allied Technologies. 2001.
[11]
C. Notredame.
Recent progress in multiple sequence alignment: a survey.
Pharmacogenomics 3(1). 2002.
[12]
O. Go
toh.
Furhter improvement in methods of group

to

group sequence alignment with
generalized profile operations.
Computer Applications in Biosciences, 10 (4), 1994, pp.
379

387.
[13]
O. Gotoh.
Optimal alignment between groups of sequences and its applicati
on to
multiple sequence alignment.
Computer Applications in biosciences, 9(3), 1993, pp.
361

370.
[14]
J. Kececioglu, H. Lenhof, K. Mehlhorn, P. Mutzel, K. Reinert, M. Vingron.
A polyhedral
approach to sequence alignment problems.
Discrete applied math
ematics 104 (2000), pp.
143

186.
[15]
S.
Anibal de Carvalho.
http://neobio.sourceforge.net/
. Department of Computer Science,
King’s college, London, UK.
Appendix A
Input 1
S1:AGGCTATACTTAAGTGGTCGTT
ATGGCCGTACACCGACCAGCGAGGAACGCATAACAGCGACCTACAT
AAGTTTGTGGTGCATCAAGCTACCGCTTTGCTGATGGCGGACGAAACGCAATTGTTAGAAAGGGGGCGGCA
CAGTACCGAACACGCGTTTCCACGGTCATATTCAGAGGTGCTGTTTTTCTCGTGTAACGCGGCACCTTCCA
TGTCGCCGTTAGTGCGATGAGACTCCAGACCGTGCCCACACTTTGCTCATCGCGCACCAAGAGGAGAC
CCC
TGTTATCAGGCGTCGCAGTTCCTAGGGGCGCTATCCCACCGTCGCATAACGCCCGACCAAAGGACCACCAA
TCGTTCCGGCGCTGATTTGTCTGGCTCGAGGCGAGTGTCTGATCTGCACTGAGTAGCGGTCCCACTTGGTG
CGCTATTACGGGACGCATGAGCCCTGCGTTTTCTCTCTAATAGTTAGAGAGTATCCTTCTATGCGTCATGC
GAGAGGTTTCGCCTTAGACTAGGTTTTCGAGCTGCCCAGG
GTTCCAGTGTGCTTAAGCCGCCATTTATGGT
TTACTCAAGGGTAAAGGTGATCCCATGATTTGATA
S2:ACTCCCACACCACTACTACTAGCCGTTCTTTGCTGTAGAATTCGAAACACCTTTCAGACTGTACCCTG
CCTGCAACTTATAGGGTGCTCATACCGACTCCTAGCCTGAGTCTGACTTGTCGGAAAAATACTGCGCTCGT
ATGGAAAAGTACACCGAGATGCTGAGCCTGAGTTACAAATCAGGCAG
TTTTTGGGTCTTATTACTAGGCCC
ACGCTATCTTTGAACATATACTTCTCAGATAACGAGATTTATGTGCTAAGCGATACGTGGCTCAATCCCCG
CTAGGATCTGCCACAACACCACGACTGTCACTCCTTATCAATGACACTCAGTTTTCCAAACGCGGCTGTAG
GTGGTTATTGGTTACGAACGCGACGAACTTACTGTCTTACCTATTGTCAAAGGCCTATAATGCCACACTCT
AAAGCGAGCGGACAACTAC
CGTTTAAAGCGAATAATGTACCGACCCAAAAAGAACATTTCCCGGTCCCGTC
AGTAGAGCTGGTCAAGAAGGTAGTCTGAATAACTCACGGAGGTATCTTTAGGCTAGGAGCTGAACAAACTT
CAGAAATATAACGCCCCGCCGCCTGCACATGCGCA
S3:TGCTCTCAGTCTTTGTGTCGGCGTCTGAGTACCGTTGAGCGATCCGACAGTGGGGCCAGCCTGCGGAC
CGTCACGAACGTCGTTACCTTGATGC
GCATAGTTGCCGTTCTCGCCGAGGCTGGGTGTCCAAGGTGGTCTT
V. Thapar
BME 300

Bioinformatics
15
TAGCGCCTGCTTTTCAAAGGTAGTAACCTGGTATAATCTGGGGCGATAGTGTCGCCAGTTCAAGGCGTTCA
ACGAGTCGCGCACCTGCTATTACACTGGGAGTAACTATTCAATCAAGTATGAGGCTCAGAACCACAGGTAT
TATTGATGATAAGCCAGACCTTCGAGGATCGTCTCTAGCACATGATCGTTTGATAGAAAGTGTGCAGCT
GG
TGAAGTTTTTAACATCCCGTGAGGACGTACACTGGCCTCTCTTGTGCCGGGCGTTAAACAATACCTTAAAG
CATGCCACAATCGTACCGGGCATAGGATGCTGATTTATGCCTTCATAAAGGGACTCGGCCACGTTGTAAGG
TGTGAATGCTAGATCTACCACGAAAGGGCCTGTTAGCACACATGCCGCCCTTGTCGCTAAAGGTTTTATAA
TACGCGTACGCTCATGCCCCCGAAAGAAGACCATGAGTTGA
CATTCGCTCATAATACAGGTCAGGCATAGG
TGGAGCTCGTGGATTTCTTATCGTTACAAACCATCGCAGAGCACCGTTCGATATACAATAGAGCTTCGGGC
ACTACGCCTACGCGGGTGATTAGGAACCCGTTACAAGGCAAGGACTCAATGGTGTCCCGGAATTTACGCCA
ACAACGGTTGTGAAGGGGATGCGGCGGACTATTGTTTAATGTGGTTGGATCCCACCGTGTGCAATCAGCCT
AGGGGAAACGCAG
GAGTCAGAGGCAGTTGGAGTCAGATTGTGCATTAATCAGTTCGTAAGCCTTCCACGGA
GAGTAATCACAACGTCTCGGACAGAAGCTCCCTAGACGACTAGCTGAAAGTGCCCCCAAAGTGCTATGGCA
TCAATCCCT
S4:GCCTATTCGGATGTACTCTCTCCGCCCAGAAGTGAAGGAGTCAGATAGGTCCTTGCTATAACAGCCGC
AACACTCATCGTGCCGGCAGCCTAGCAGTTACCTGGATCCCAGATC
TACCTTACCATTTCAGGCTAAATTT
AGGCTCGGGTACAAAAAACATCGCCGGGCTTCAACCTTGCCGCCCTTAACACACGGTGTGACTTTATACAG
GGAGATGGAGCATGGGCTGGCCTAGTGGGGTGTGGCGCTAATTTCCTCGCTAATGCTATGCGGAGCCCTGA
AAGCTGACTGGAGGAGGCCGAGCCGACAATGTCTCGTGAGTGGCATTGCGTTTAAGGAAGACTTTTGTCCG
ATCTACACCTTCCTCGAG
TCTCCGCAGGGTTGTGCATAGTGGCTGTAGACAGAATCCAGCTGACAGGTCTG
CATTTAGAAATAGCTTAGCGTCCGCCGGACCACTGTCAACTTTACTGTGGCTCTCGTCTGCTGACTTTGAT
TATCTGAATGTGAGTCTCAGTAACTGACCTGGGCGTCTTCGGCGAAGGATCAATGAACGAATCAAAGAGGT
GAAGGGGCTTTCCTGCTAAGACCGTGCATCAGTACTAGCCGGTCGAGTCCTTTGCACGTCC
GCCGCAGCCG
TACAGTCGATTGATATAGTCTACCCTCGATCCTTTAGCAAGTGCATATGCAGCCGACCAACCTTGCGGCAT
ACTCCAATCAACACTACCCAGATCCTAAGGTGACGGTTTCAGAGGATATACGAAGCGTATTGCACCGCGTA
TGTATTTAAGAACGGTGGGTGTTATGTCAGACGCGTCCGGTTTTAACCCTTTATACAAATCGTCTCGACAC
ACTACATCAATATATTACATGAAGGTGCATCAC
AGCCGGTCCACACCGGTT
S5:TCGGCTGTATTGGCGACCCAGGCGTGGGCTTAATGAATCAGAGACTCTGCAGCCAGGGAGTATGTATA
GCAGTTCTTTAAACGGTCTGCGACGAGGAAGGTTTCGAGTGTGCAACGTGAGGCTATCGTAAAAGTGTTTC
AACAGATGGGGGGCTATGAGCCGCTCGAACGTTACACACTGCACGCGGGGTCGACTAATGGAAGCTAACCT
AAGCTAATTGCCCTATTCGTGAAG
AAACATCTAATTCCTTCCTTGTATGTGTTCTCCCTACAGCACATATC
GACAATAGGTTTTAGTGCTTTACCACAAGTAGCAAGTACAACTTGAATTGGGTAAGACTTGCACTTCATGT
ATTTGAAATCGCTATCCCACGACTTGGTGTCAACCCCCGGCTCTTTATCACCTTGCATACCCAGCGGCATC
AAGTGACCGACATATGATCTGGTAGTAGTTCAACCCTGAAGACTATCTTTAGCTCAGCGCGTTAAGT
CCTT
ATACACTCTAGCGAGTGGGAAGGATGGATCGGCCGGACATCGTACGTAATTTAGAACCCAGTACCGAGACG
CGTTCGACAGTCCTAAGGCTCCATCAGAGTAGCTTACTACGTCACGAGTCAGGTAAAGCCGAGAGCGTCCG
ATCCATCCTTGGTGGATCAGCGTTCTCTGTTGTTGAACGCGAGGTAAACGTTGGTAACTTTTTCAACAGCA
GTAGAGTAGCGTGTAGTTACTCGGAGATCGACGTAACTG
CGCGCCCTGCAACACTAAGCGCTGCGCTGTCT
GCTGCGCAGACTCTATGAGAGTCGCTCGTCTCCGTCTGCTTAGGGGGCGTTAGCACACTAATCACGGCTCA
AATATGTTAAAGAAGGAGCCCCATTTCCGTGACGTCAGTACGAGCAATTTACGATGGCAAAGAGAGCAAGA
CCTTCGCGCAGGGTACGGACCTGACAGCATGGGTTATCAAGGCCCTTTCCAGGTAATAAATTTCAGATTTA
GTACTTATCAT
GTAGATAAGTTGGAAACCTTGA
S6:GAAGACTCAGGGAGAGAAATTTTTCTTGATTCATTCTGCAGATTGGCTTACTACACATGCTCTTTTCC
ATGAAGTTGCAAAATTGGATGTGGTGAAATTATTATACAATGAGCAGTTTGCTGTTCAAGGGTTGTTGAGA
TACCATACATATGCAAGATTTGGCATTGAAATTCAAGTTCAGATAAACCCTACACCTTTCCAACAGGGGGG
ATTGATCTGTGCTATGGTTC
CTGGTGACCAGAGCTATGGTTCTATAGCATCATTGACTGTTTATCCTCATG
GTTTGTTAAATTGCAATATTAACAATGTGGTTAGAATAAAGGTTCCATTTATTTACACAAGAGGTGCTTAC
CACTTTAAAGATCCACAATACCCAGTTTGGGAATTGACAATTAGAGTTTGGTCAGAATTAAATATTGGGAC
AGGAACTTCAGCTTATACTTCACTCAATGTTTTAGCTAGATTTACAGATTTGGAGTTGCATGG
ATTAACTC
CTCTTTCTACACAAATGATGAGAAATGAATTTAGGGTCAGTACTACTGAGAATGTGGTGAATCTGTCAAAT
TATGAAGATGCAAGAGCAAAGATGTCTTTTGCTTTGGATCAGGAAGATTGGAAATCTGATCCGTCCCAGGG
TGGTGGGATCAAAATTACTCATTTTACTACTTGGACATCTATTCCAACTTTGGCTGCTCAGTTTCCATTTA
ATGCTTCAGACTCAGTTGGTCAACAAATTAAAGTT
ATTCCAGTTGACCCATATTTTTTCCAAATGACAAAT
ACGAATCCTGACCAAAAATGTATAACTGCTTTGGCTTCTATTTGTCAGATGTTTTGTTTTTGGAGAGGAGA
TCTTGTCTTTGATTTTCAAGTTTTTCCCACCAAATATCATTCAGGTAGATTACTGTTTTGTTTTGTTCCTG
GCAATGAGCTAATAGATGTTTCTGGAATCACATTAAAGCAAGCAACTACTGCTCCTTGTGCAGTAATGGAT
ATTACAG
GAGTGCAGTCAAC
V. Thapar
BME 300

Bioinformatics
16
S7:CAGTGGCGATGACCCTGGAAAAGAATATGCCGATCGGTTCGGGCTTAGGCTCCAGTGCCTGTTCGGTG
GTCGCGGCGCTGATGGCGATGAATGAACACTGCGGCAAGCCGCTTAATGACACTCGTTTGCTGGCTTTGAT
GGGCGAGCTGGAAGGCCGTATCTCCGGCAGCATTCATTACGACAACGTGGCACCGTGTTTTCTCGGTGGTA
TGCAGTTGATGATCGAAGAAAACGACATC
ATCAGCCAGCAAGTGCCAGGGTTTGATGAGTGGCTGTGGGTG
CTGGCGTATCCGGGGATTAAAGTCTCGACGGCAGAAGCCAGGGCTATTTTACCGGCGCAGTATCGCCGCCA
GGATTGCATTGCGCACGGGCGACATCTGGCAGGCTTCATTCACGCCTGCTATTCCCGTCAGCCTGAGCTTG
CCGCGAAGCTGATGAAAGATGTTATCGCTGAACCCTACCGTGAACGGTTACTGCCAGGCTTCCGGCAGGCG
C
GGCAGGCGGTCGCGGAAATCGGCGCGGTAGCGAGCGGTATCTCCGGCTCCGGCCCGACCTTGTTCGCTCT
GTGTGACAAGCCGGAAACCGCCCAGCGCGTTGCCGACTGGTTGGGTAAGAACTACCTGCAAAATCAGGAAG
GTTTTGTTCATATTTGCCGGCTGGATACGGCGGGCGCACGAGTACTGGAAAACTAAATGAAACTCTACAAT
CTGAAAGATCACAACGAGCAGGTCAGCTTTGCGCAAGCCGTAAC
CCAGGGGTTGGGCAAAAATCAGGGGCT
GTTTTTTCCGCACGACCTGCCGGAATTCAGCCTGACTGAAATTGATGAGATGCTGAAGCTGGATTTTGTCA
CCCGCAGTGCGAAGATCCTCTCGGCGTTTATTGGTGATGAAATCCCACAGGAAATCCTGGAAGAGCGCGTG
CGCGCGGCGTTTGCCTTCCCGGCTCCGGTCGCCAATGTTGAAAGCGATGTCGGTTGTCTGGAATTGTTCCA
CGGGCCAACGCTGGCA
TTTAAAGATTTCGGCGG
S8:AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCA
GCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCA
ATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC
ATTACCACCACCATCACCATTACCA
CAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGC
CCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAAGT
TCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCCAGG
CAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTGA
AAA
AACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTTGACGG
GACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTTGCCCAA
ATAAAACATGTCCTGCATGGCATTAGTTTGTTGGGGCAGTGCCCGGATAGCATCAACGCTGCGCTGATTTG
CCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTA
TTAGAAGCGCGCGGTCACAACGTTACTGTTA
TCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATATTGCTGAGTCCACC
CGCCGTATTGCGGCAAGCCGCATTCCGGCTGATCACATGGTGCTGATGGCAGGTTTCACCGCCGGTAATGA
AAAAGGCGAACTGGTGGTGCTTGGACGCAACGGTTCCGACTACTCTGCTGCGGTGCTGGCTGCCTGTTTAC
GCGCCGATTGTT
GCGAGATTTGGACGGACGTTG
S9:ACCCATAACGGGCAATGATAAAAGGAGTAACCTGTGAAAAAGATGCAATCTATCGTACTCGCACTTTC
CCTGGTTCTGGTCGCTCCCATGGCAGCAGAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGA
TAGGCGATCGTGATAATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAA
CATTATGAATGGCGAGGCAAT
CGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCATAAGAAAGC
TCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAAATGACAAATGCCGGGTAACAAT
CCGGCATTCAGCGCCTGATGCGACGCTGGCGCGTCTTATCAGGCCTACGTTAATTCTGCAATATATTGAAT
CTGCATGCTTTTGTAGGCAGGATAAGGCGTTCACGCCGCATCCGGCATTGACTGCAAACTTAAC
GCTGCTC
GTAGCGTTTAAACACCAGTTCGCCATTGCTGGAGGAATCTTCATCAAAGAAGTAACCTTCGCTATTAAAAC
CAGTCAGTTGCTCTGGTTTGGTCAGCCGATTTTCAATAATGAAACGACTCATCAGACCGCGTGCTTTCTTA
GCGTAGAAGCTGATGATCTTAAATTTGCCGTTCTTCTCATCGAGGAACACCGGCTTGATAATCTCGGCATT
CAATTTCTTCGGCTTCACCGATTTAAAATACTCATC
TGACGCCAGATTAATCACCACATTATCGCCTTGTG
CTGCGAGCGCCTCGTTCAGCTTGTTGGTGATGATATCTCCCCAGAATTGATACAGATCTTTCCCTCGGGCA
TTCTCAAGACGGATCCCCATTTCCAGACGATAAGGCTGCATTAAATCGAGCGGGCGGAGTACGCCATACAA
GCCGGAAAGCATTCGCAAATGCTGTTGGGCAAAATCGAAATCGTCTTCGCTGAAGGTTTCGGCCTGCAAGC
CGGTGTAG
ACATCACCTTTAAACGCCAGAATCG
Appendix B
Input 2
V. Thapar
BME 300

Bioinformatics
17
Appendix C
Input 3
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο