A DNA Toolkit for Bioinformatics

moredwarfΒιοτεχνολογία

1 Οκτ 2013 (πριν από 3 χρόνια και 8 μήνες)

81 εμφανίσεις






















“A DNA Toolkit for Bioinformatics”








































Sophia Banton












CSC 7351











Summer 2007

ABSTRACT


DNA sequence analysi
s is important to bioinformatics.
Processing a raw DNA sequence
is usually

the first step

in carrying out projects in bioinformatics. A total of four programs were
created: PairwiseAlignment.cpp, Trans.cpp, PairwiseProteinAlign.cpp, and MSA.cpp. The
PairwiseAlignment program generates an alignment of two DNA sequences. The
Pa
irwiseProteinAlign program does the same thing, with the exception that the input is a pair of
protein sequences. The MSA.cpp completes an alignment of three protein sequences. The
Trans.cpp translates a DNA sequence into a primary amino acid sequence.
T
he Needleman
-
Wunsch and Smith
-
Waterman algorithms are efficient in finding homologous gene sequences
and proteins
. Progressive alignments of multiple sequences can be carried out using the Feng
Doolittle algorithm.

The limitation with the algorithm is tha
t early mistakes cannot be corrected.

The genetic code and the proteinic codes continue to be both simple in structure
and

elusive in
nature.



INTRODUCTION




DNA sequence analysi
s is important to bioinformatic
s
.
Processing a raw DNA sequence
is usually
the first step

in carrying out projects in bioinformatics.
The goal was to write a
package of small programs that manipulate raw DNA sequences
.
A total of four programs were
created: PairwiseAlignment.cpp, Trans.cpp, PairwiseProteinAlign.cpp, and MSA.cpp
. The
PairwiseAlignment program reads in a pair of DNA sequen
ces in standard FASTA format
and
generates a
n

aligned sequence
. The PairwiseProteinAlign program does the same thing, with the
exception that the input is a pair of protein sequences. The MSA
.cpp program extends on these
two programs and completes an alignment of three
or more
pro
tein

sequences. The Trans.cpp
program is the most unique among the group and this program translates a DNA sequence into a
primary amino acid sequence.

In vivo DNA
is translated from its primary nucleotide sequence to a primary amino acid
sequence.
Within the nucleus of a cell, the DNA is fist replicated and then exported to the
cytoplasm where transcription occurs. Transcription involves the formation of messenger

RNA

(mRNA)

from a DNA template. The mRNA is then brought to a ribosome where translation
occurs. In translation the mRNA in converted to amino acids sequences. Each amino acid is
derived from three DNA nucleotides. For computational simplicity the tra
nscription step can be
ignored because RNA

only

differs

from DNA by
a single

nucleotide. Ignoring this fac
t, the
DNA can be directly translated to amino acid sequences.


Due to the nature of DNA, each sequence has a total of six frames from which it
can b
e read.

The code is read in triplets so the DNA can be read for the 1st, 2nd, or 3rd codon.

Similarly the DNA can be read for the opposite or 3’
-
5’ end. At this end the sequence can be
read from the nth, nth
-
1, or nth


2 position. The figure below
illustrates this concept for the 5’
to 3’ order.



Sequence alignment is a useful step in trying to determine the function of an unknown
sequence. A global alignment is most useful when applied to sequences that are similar in both
similar in their sequen
ces and size.

To align two sequences a score

must be found.

DNA
sequences are scored using a match
-
mismatch table of constants. Matches are assigned positive
scores while mismatches are assigned negative scores. The protein sequences are aligned using
5'



















































3'





atg
cccaagctgaatagcgtagaggggttttcatcatttgaggacgatgta
taa




1

a
tg ccc aag ctg aat agc gta gag ggg ttt tca tca ttt gag gac gat gta
taa






M



P



K



L



N



S



V



E



G



F



S



S



F



E



D



D



V



*




2


t
gc cca agc
tga

ata gcg
tag

agg

ggt ttt cat cat ttg agg acg atg tat






C



P



S



*



I



A



*



R



G



F



H



H



L



R



T



M



Y




3



g
cc caa gct gaa
tag

cgt aga ggg gtt ttc atc att
tga

gga cga tgt ata







A



Q



A



E



*



R



R



G



V



F



I



I



*



G



R



C



I


the Block Substitution Matrix 62, which is set of values that are used specifically for protein
sequence alignment.

To make an alignment such as:






MRN
D
PCQ

M
-
N
E
P
C
-



Each column of the alignment is treated independently. Then we find the score of th
e total
alignment by summing all the columns. This process is called dynamic programming. To find the
optimal alignment, sub
-
alignments are found instead of finding all the possible alignments.


Aligned sequences contain
amino acid residues

or nucle
otide bases with

gaps. Gaps
represent evolutionary based insertions and deletions. The idea is that mutations between DNA
nucleotides or between amino acids with similar properties will be more tolerated than mutations
between pairs that are different.

Thus a

gap penalty can be viewed as a level of tolerance for
evolution of a residue in
a molecular
sequence
.
The alignment programs align the input
sequences, both DNA and protein using a gap penalty of
-
3. Gap penalties contribute to the final
score of
the alignments. The size of

the gap penalty relative to the entries in the
scoring table or
matrix affects the alignment that is finally selected. A higher gap penalty will cause less
favorable characters to be aligned, in an attempt not to create too many

gaps. A high number of
gaps signal a poor alignment.

A gap penalty of
-
3 is fairly reasonable for using the BLOSUM62
matrix.



Multiple sequence alignments are more powerful than pair
-
wise alignments when trying to
find the evolutionary history

of a protein.

Such alignments allow

the grouping of proteins and
genes into families
. Gene or protein families share similar structure, function, and properties.




AlGORITHMS



Section I :
DNA Translation

(Trans.cpp




This program converts a raw DNA

sequence into a protein primary structure.

DNA
sequence is read into an array of characters.

Then the
DNA is read from three frames in the
forward direction.


Primary amino acid sequences are generated for each frame.
The p
rotein
generated with the l
ongest length is chosen as best protein
.
The p
rotein is converted to
complementary strand,

reversed and the steps are repeated.

C
lass Declaration:

















Section II : Sequence Alignments


I
.

Setting up the Scoring Matri
x



For the DNA sequenc
e alignment program
the following scoring table was used. Each match
was rewarded with a positive score while, each

A


G


C


T


A

10

-
1

-
3

-
4


G

-
1


7

-
5

-
3

C

-
3

-
5


9


0


T

-
4

-
3


0


8

class trans

{


private:



DNA[]



trans_DNA[]


public:



trans( char (&input)[]){}



void Translate(){}



int get_prot_Length(){}



void get_Protein(){}



void print_best_naive_Prot(){}



void execute(){} //

};




Read in text file



Read sequence into an array of char DNA[i]



Frame 1 starts at first location in (DNA)



Frame 2 at nucleotide 2 (DNA + 1)



Frame 3 at nucleotide 3 (DNA + 2)



Read triplets and convert to amino acid



Store sequence into array [tempProtein]



Fi
nd the best start and stop nucleotide for the tempProtein



Generate a processed protein sequence using best stop



Find best stop by reading the DNA from the end
(3’)



The longest of the generated sequences is the best



Transpose(DNA[])


trans_DNA[]



REPEAT PRO
CESS

mismatch earned a negative mark.

Also a mismatch between pyrimidines (C
-
T) or purines (A
-
G) were assigned a score of zero. This is becaus
e from an evolutionary perspective changes
between those bases would be more favorable and more conserved
.




The Scoring matrix used for the protein pairwise alignment and multiple sequence
alignment programs is the BLOSUM62 matrix shown below.

















BLOSUM is an abbreviation for “Blocks of Amino Acid Substitution Matrix
”.
It is a
substit
ution matrix used for sequence alignment of proteins. It is used frequently in
bioinformatics to score alignments between related and non
-
related proteins. The matrix wa
s

created by Henikoff and Henikoff (1992; PNAS 89:10915
-
10919). They created the matr
ix by
scanning the BLOCKS database for much conserved regions of protein families. The regions
lacked gaps and so were truly “conserved”. To generate the matrix the scientists counted the

C

S

T

P

A

G

N

D

E

Q

H

R

K

M

I

L

V

F

Y

W

C

9

-
1

-
1

-
3

0

-
3

-
3

-
3

-
4

-
3

-
3

-
3

-
3

-
1

-
1

-
1

-
1

-
2

-
2

-
2

S

-
1

4

1

-
1

1

0

1

0

0

0

-
1

-
1

0

-
1

-
2

-
2

-
2

-
2

-
2

-
3

T

-
1

1

4

1

-
1

1

0

1

0

0

0

-
1

0

-
1

-
2

-
2

-
2

-
2

-
2

-
3

P

-
3

-
1

1

7

-
1

-
2

-
1

-
1

-
1

-
1

-
2

-
2

-
1

-
2

-
3

-
3

-
2

-
4

-
3

-
4

A

0

1

-
1

-
1

4

0

-
1

-
2

-
1

-
1

-
2

-
1

-
1

-
1

-
1

-
1

-
2

-
2

-
2

-
3

G

-
3

0

1

-
2

0

6

-
2

-
1

-
2

-
2

-
2

-
2

-
2

-
3

-
4

-
4

0

-
3

-
3

-
2

N

-
3

1

0

-
2

-
2

0

6

1

0

0

-
1

0

0

-
2

-
3

-
3

-
3

-
3

-
2

-
4

D

-
3

0

1

-
1

-
2

-
1

1

6

2

0

-
1

-
2

-
1

-
3

-
3

-
4

-
3

-
3

-
3

-
4

E

-
4

0

0

-
1

-
1

-
2

0

2

5

2

0

0

1

-
2

-
3

-
3

-
3

-
3

-
2

-
3

Q

-
3

0

0

-
1

-
1

-
2

0

0

2

5

0

1

1

0

-
3

-
2

-
2

-
3

-
1

-
2

H

-
3

-
1

0

-
2

-
2

-
2

1

1

0

0

8

0

-
1

-
2

-
3

-
3

-
2

-
1

2

-
2

R

-
3

-
1

-
1

-
2

-
1

-
2

0

-
2

0

1

0

5

2

-
1

-
3

-
2

-
3

-
3

-
2

-
3

K

-
3

0

0

-
1

-
1

-
2

0

-
1

1

1

-
1

2

5

-
1

-
3

-
2

-
3

-
3

-
2

-
3

M

-
1

-
1

-
1

-
2

-
1

-
3

-
2

-
3

-
2

0

-
2

-
1

-
1

5

1

2

-
2

0

-
1

-
1

I

-
1

-
2

-
2

-
3

-
1

-
4

-
3

-
3

-
3

-
3

-
3

-
3

-
3

1

4

2

1

0

-
1

-
3

L

-
1

-
2

-
2

-
3

-
1

-
4

-
3

-
4

-
3

-
2

-
3

-
2

-
2

2

2

4

3

0

-
1

-
2

V

-
1

-
2

-
2

-
2

0

-
3

-
3

-
3

-
2

-
2

-
3

-
3

-
2

1

3

1

4

-
1

-
1

-
3

F

-
2

-
2

-
2

-
4

-
2

-
3

-
3

-
3

-
3

-
3

-
1

-
3

-
3

0

0

0

-
1

6

3

1

Y

-
2

-
2

-
2

-
3

-
2

-
3

-
2

-
3

-
2

-
1

2

-
2

-
2

-
1

-
1

-
1

-
1

3

7

2

W

-
2

-
3

-
3

-
4

-
3

-
2

-
4

-
4

-
3

-
2

-
2

-
3

-
3

-
1

-
3

-
2

-
3

1

2

1
1

relative frequencies of amino acids and their substitution probabil
ities. A log
-
odds score for each
of the 210 possible substitutions of the 20 standard amino acids was derived and that is the basis
of the BLOSUM 62. So the matrix is not computer generated, but rather based on observed
alignments.

Each score within a BL
OSUM matrix is a measure of the likelihood of two amino acids
appearing/evolving by chance. The matrices are based on the minimum percentage identity of the
aligned protein sequence used in calculating them. Each possible identity or substitution is
assign
ed a score based on its observed frequencies in the alignment of related proteins. The more
likely substitutions are assigned positive scores, and those less likely to occur are give negative
scores. BLOSUM62 is the matrix calculated by using the observed
substitutions between
proteins which have 62% or more sequence identity. It is important to note that the likelihood of
substitution is based on the biological properties of the amino acid. Substitutions among the
members of each of the following groups o
f amino acids are more likely: aliphatic, polar, acidic,
basic, and aromatic. The aliphatic residues are G, A, V, L, I, and M. The polar residues are S, T,
and C. The Aromatic residues are F, W, and Y. The Basic residues are H, K, and R. Lastly; the
a
cidic residues are D and E.


II
.

Aligning

the Sequences



Global Alignment


This project implements the Needleman
-
Wunsch algorithm for finding a global alignment
in both the DNA sequence alignment (PairwiseAlignment.cpp) and the protein sequence
alignmen
t (ProteinPairwiseAlign.cpp).

The process begins by focusing on the last column of each alignment. These are the only
three possibilities for the last column of the alignment:

alignments in which the last character of

A is paired with the last character

in B, alignments in which the last character in A is paired with
a gap, and alignments in which the last character in B is paired with a gap.


Each of these alignments has effectively two parts, a prefix, and the last column.

The
prefix of the alignment
contains every column but the last.







MRNDPC
Q

M

NEPC
-

The score for each alignment is calculated by scoring the prefix and adding the score for the last
column. In each group all of the alignments in this group have the same last column, preceded
b
y prefixes that are different. With the last column held constant, the alignment with the highest
score is the alignment with the highest
-
scoring prefix.

Once the highest scoring alignments for
the prefixes of the sequences are found, the highest scoring

alignment over the entire lengths of
the sequences can be found.

The two sequences can be called S1 and S2.

For any number i, we'll refer to the first i
characters of S1 as S1[1...i].

Analogously, for any number j, we'll refer to the first j characters
of
S2 as S2[1...j].

For any value of i and j, we can calculate the optimal alignment between S1[1...i]
and S2[1...j] by finding the highest score among the three possible groups. The BLOSUM62
matrix is used to determine which of these three options is bes
t, and that yields the optimal
alignment between S1[1...i] and S2[1...j].

Let i and j range from 0 to the lengths of the
sequences (S1 and S2). The value of the optimal alignment for any particular i and j is then used
to find the next larger optimal alig
nment and so on.


The Needleman
-
Wunsch algorithm is shown below:



A two
-
dimensional array (or matrix) is allocated. This matrix is the F matrix, and its
(i,j)th entry is often denoted F
ij
. There is one column for each character in S1, and one row for
each

character in S2
.












The bottom right hand corner of the matrix is the maximum score for any alignments. To find the
alignment which generates this score, start from the bottom right cell, and compare the value
with the three possible
sources (
C
hoice1, Choice2, and Choice3 above) to see which it came
from. If Choice1, then S1

(i) and
S2 (
i)
are
aligned, if Choice2, then S1

(i) is
aligned with a gap, and if Choice3,
then S2(i) is aligned with a gap.

Embedded in the algorithm is a weight
Matrix whi
ch gives the alignment
between the sequences.


Local Alignment



The local alignment is very
similar to the global alignment. Local


for i=0 to length(A)
-
1




F(i,0) <
-

d*i



for j=0 to length(B)
-
1




F(0,j) <
-

d*j



for i=1 to length(A)




for j = 1 to length(B)




{





Choice1 <
-

F(i
-
1,j
-
1) + S(A(i
-
1), B(j
-
1))





Cho
ice2 <
-

F(i
-
1, j) + d





Choice3 <
-

F(i, j
-
1) + d





F(i,j) <
-

max(Choice1, Choice2, Choice3)




}



AlignmentA <
-

""


AlignmentB <
-

""


i <
-

length(A)


j <
-

length(B)


while (i > 0 AND j > 0)


{


Score <
-

F(i,j)


ScoreDiag <
-

F(i
-

1, j
-

1)


ScoreUp <
-

F(i, j
-

1)


ScoreLeft <
-

F(i
-

1, j)


if (Score == ScoreDiag + S(A(i
-
1), B(j
-
1)))


{


AlignmentA <
-

A(i
-
1) + AlignmentA


AlignmentB <
-

B(j
-
1) + AlignmentB


i <
-

i
-

1


j <
-

j
-

1


}


else
if (Score == ScoreLeft + d)


{


AlignmentA <
-

A(i
-
1) + AlignmentA


AlignmentB <
-

"
-
" + AlignmentB


i <
-

i
-

1


}


otherwise (Score == ScoreUp + d)


{


AlignmentA <
-

"
-
" + AlignmentA


AlignmentB <
-

B(j
-
1) + AlignmentB



j <
-

j
-

1


} }


while (i > 0)


{


AlignmentA <
-

A(i
-
1) + AlignmentA


AlignmentB <
-

"
-
" + AlignmentB


i <
-

i
-

1 }


while (j > 0)


{


AlignmentA <
-

"
-
" + AlignmentA


AlignmentB <
-

B(j
-
1) + AlignmentB


j <
-

j
-

1 }

alignments are best for non
-
similar sequences that contain regions of homology. This program
implements the Smith
-
Waterman

algorithm for finding local alignments.

The Smith
-
Waterman
method highlights only those regions of the alignment which have positive scores.

For each cell
of the matrix, the algorithm considers each path that leads to it. The paths can be of any length

and can contain gaps (insertions and deletions). Instead of looking at an entire sequence at once,
the S
-
W algorithm compares multi
-
lengthed segments, looking for whichever segment
maximizes the
sc
oring measure. The goal here is to align the sequences wh
ile ignoring the badly
aligned areas of the entire alignment
.

The only change in the algorithm as compared to the
global is during the initialization and iteration step. Here the matrix is set to zero at all cells.











When finding the maximum scor
e for a matrix entry, if the maximum score is negative, we
instead make it

0. This is done because we are searching for the optimal substring; if a part of the
alignment is negative it can be ignored.


III. Multiple sequence alignment (MSA)



The multipl
e sequence alignment method used was progressive alignment
. The algorithm
used was Feng
-
Doolittle algorithm. First global alignments are generated for all the input
sequence, then a distance tree is created to keep track of the alignments, and then all s
equences
are aligned to the best pair. Due to shortage of time, only the first step of the algorithm was
Initializ
ation
:


F(0, j) = F(i, 0) = 0










Iteration
:


0

F(i, j) = max


F(i


1, j)


d

F(i, j


1)


d

F(i


1, j


1) + s(xi, yj)


implemented. Global alignments were derived for all the sequences using
a for
-
loop
that aligned
each pair of sequences by generating instances of the

global alignment class. The global
alignment class is the same as the one used for the pairwise sequence alignments.



RESULTS


Th
is

section will display the results generated by each program.


DNA sequence alignment:


































D
NA Translation (Trans.cpp):












Protein sequence Alignement (ProteinPairWiseAlign.cpp):



Multiple Protein Sequence Alignment




FUTURE WORKS



The next logical step is to improve the multiple sequence alignment program.
Currently
the program
does not align the sequences as a group. From the output above it can be inferred
which sets of sequences are most similar, but similarities between all the proteins cannot be seen.
.
With the trans.cpp (
translation

program),
more efficient program would
do a better job of
trimming the excess
amino acids

from the “best” proteins.



CONCLUSION



It is necessary to process raw DNA sequences in order for biologists to obtain meaningful
data from the various genomes that have been sequenced
.
Predicting the pr
imary sequence can
assist in understanding the function of gene
.
Sequence alignment is a useful tool is finding gene
or protein analogs
. Hypothetical genes and protein don’t always amount to real genes and
proteins.
The Needleman
-
Wunsch and Smith
-
Waterma
n algorithms are efficient in finding
homologous gene sequences and proteins
. The genetic code and the proteinic codes continue to
be both simple in structure but elusive in nature.
























REFERENCES



Agarwal, Pankaj K., Orlando, David (2
003).
Lecture 15: Multiple Sequence Alignment.

CPS260/BGT204.1 Algorithms in Computational Biology.
h
ttp://www.cs.duke.edu/courses/cps260/fall03/notes/lecture15.pdf


Bhattacharya
, D
., Haque,R. and Singzh,U. (2005). Coding and Noncoding Genomic Regions of


Entamoeba histolytica Have Significantly Different Rates of Sequence Polymorphisms:
Implications for Epidemiological Studies. J. Clin. Microbiol. 43 (9), 4815
-
4819


Champe C., Pamela, Harvey, Richard A. and Ferrier, Denise R. (2005).
Lippincott's Illust
rated

Reviews: Biochemistry

(3rd ed.). Lippincott Williams & Wilkins


Needleman, S.B., Wunsch, C.D. A general method applicable to the search for similarities in the


amino acid sequences of two proteins.
1970
Journal of Molecular Biology
48
:443
-
453.


S
mith TF, Waterman MS (1981). "
Identification of Common Molecular Subsequences
".
Journal


of Molecular Biology

147
: 195
-
197.



Waterman, M.S., Eggert, M. A new algorithm for best subsequence alignments with applications
to tRNA
-
rRNA comparisons.
1987
Journ
al of Molecular Biology
197
:723
-
728.