Abstract - Knowledge Systems Institute

vivaciousefficientΒιοτεχνολογία

1 Οκτ 2013 (πριν από 3 χρόνια και 8 μήνες)

616 εμφανίσεις



-

1
-

THE DESIGN OF COMPUT
ATIONAL
JAVABEAN COMPONENT P
ACKAGE FOR
PROTEIN SEQUENCE ANA
LYSIS

by

Jenhsun Lo

A thesis submitted in partial fulfillment of the
requirements for the degree of

Master of Science in Bioinformatics



Approved by

________________________________
_______________


Chairperson of Supervisor
y Committee


________________________________
_______________



________________________________
_______________



________________________________
_______________


Program Authorized

to Offer Degree

________________________________
_____________


Date

________________________________
______________________



Knowledge System Institute

Skokie, Illinois

2003


-

2
-

A b s t r a c t

This
thesis

is intended to
represent
some of the

major challenge
s

f
acing

computer

science
, mathematics,
molecular
biology
,

and epidemiology scientist
s

in
the frontiers of b
ioinformatics.
Based on the complexity of genome information
analysis, this thesis represents a
custom design
component package for analysising
the genome sequence data and unit.

In chapter zero and one, t
he
knowledge of living

systems result
ed

from
vast
research
es is tremendously
complex for

any one human to comprehend. This thesis will
discuss abou
t the computational logic design in molecular sequence analysis and research
by adopting Needleman
-
Wunsch Algorithm.
In the evolution and mutation of protein
(amino acid) in chapter two, this thesis will represent the foundation mutation table for
supporti
ng the design

of the custom package
.

Also, this package

is designed to analysis
real world genome problem
s;

C
hapter three discusses the protein sequence inputting
resource from NCBI (
National Center for Biotechnology) web service for retrieving
the sequenc
e data and manipulation for supporting the design
and input
of
this
custom computaional package.


In view of the complexity of genetic sequence research,
chapter four

will
provides the case studies of real world problem in clinical epidemiology of
severe
a
cute respiratory syndrome (SARS)

and Bovine Parainfluenza virus (Horse Shipping
Fever) using this package and supporting programs

(EMBOSS)
.
After the
implementation
process from

both packages, we can find that these two virus may
not be very similar based
on the
ir

final score.



-

3
-

Lastly

in Appendix A and B
, this thesis will represent the software component
design pattern and implementation for supporting the design processes of
bioinformatics components.

By following this custom package which is designed to be

a
collection of reusable components under Sun Microsystems™ Java platform, this thesis is
prepared in an attempt to help the bioinformatics software developers easily inherit this
protein sequence alignment functionality as well as enhance their software
design in any
demanding fields of biotechnology.



-

4
-

T a b l e o f C o n t e n t s

ABSTRACT

--------------------------------
--------------------------------
--------------------------------
--

-

1
-

TABLE OF CONTENTS

--------------------------------
--------------------------------
------------------

-

4
-

LIST OF FIGURE

--------------------------------
--------------------------------
--------------------------

-

6
-

LIST OF TABLE

--------------------------------
--------------------------------
----------------------------

-

7
-

CHAPTER 0

--------------------------------
--------------------------------
--------------------------------
-

-

8
-

COMPUTATIONAL GENOME

SEQUENCE

--------------------------------
-------------------------

-

8
-

I
NTRODUCTION

--------------------------------
--------------------------------
--------------------------------
-

-

8

-

0.1

L
IFE AND LIFE
-
FORM

--------------------------------
--------------------------------
----------------------

-

8

-

0.2

M
OLECULAR
G
ENOME
I
NFORMATION AND
C
OMPUTER
S
CIENCE

--------------------------------
---

-

9

-

0.3

C
OMMON
G
ENOME
S
EQUENCE
A
NAL
YSES

--------------------------------
---------------------------

-

13

-

0.4

A
LGORITHM
D
RIVE OF
C
OMPUTATIONAL
M
OLECULAR
B
IOLOGY

--------------------------------

-

14

-

CHAPTER 1

--------------------------------
--------------------------------
--------------------------------

-

16
-

SEQUENCE ALIGNMENT

--------------------------------
--------------------------------
--------------

-

16
-

I
NTRODUCTION

--------------------------------
--------------------------------
-------------------------------

-

16

-

1.1

F
UNDAMENTALS

--------------------------------
--------------------------------
-------------------------

-

16

-

1.2

G
LOBAL AND LOCAL ALIG
NMENT

--------------------------------
--------------------------------
-----

-

17

-

1.3

D
YNAMIC PROGRAMMING

--------------------------------
--------------------------------
--------------

-

19

-

1.4

N
EEDLEMAN
-
W
UNSCH
A
LGORITHM

--------------------------------
--------------------------------
--

-

19

-

1.4.1 Simple scoring scheme

--------------------------------
--------------------------------
--------

-

20
-

Initialization Step

--------------------------------
--------------------------------
------------------------------

-

21
-

Matrix fill
step

--------------------------------
--------------------------------
--------------------------------
-

-

21
-

Traceback step

--------------------------------
--------------------------------
--------------------------------
-

-

26
-

1.4.2 Advanced scoring scheme

--------------------------------
--------------------------------
-----

-

30
-

Initialization step

--------------------------------
--------------------------------
------------------------------

-

31
-

Matrix fill step

--------------------------------
--------------------------------
--------------------------------
-

-

31
-

Traceback step

--------------------------------
--------------------------------
--------------------------------
-

-

36
-

CHAPTER 2

--------------------------------
--------------------------------
--------------------------------

-

45
-

PROTEIN AND THE POIN
T ACCEPT MUTATION MA
TRIX

--------------------------------
-

-

45
-

I
NTRODUCTION

--------------------------------
--------------------------------
-------------------------------

-

45

-

2.1

P
ROTEIN

--------------------------------
--------------------------------
--------------------------------
--

-

45

-

2.2

P
ROTEIN SEQUENCE COMP
ARISON

--------------------------------
--------------------------------
----

-

47

-

2.3

D
AYHOFF
P
OINT
A
CCEPT
M
UTATION
M
ATRIX

--------------------------------
----------------------

-

49

-

CHAPTER 3

--------------------------------
--------------------------------
--------------------------------

-

58
-

SOURCE INFORMATION P
ROGRAMMING

--------------------------------
----------------------

-

58
-

I
NTRODUCTION

--------------------------------
--------------------------------
-------------------------------

-

58

-

3.1

XML

--------------------------------
--------------------------------
--------------------------------
------

-

58

-

3.1.1 NCBI Tiny Sequence XML Format

--------------------------------
---------------------------

-

59
-

3.1.2 Docu
ment Type Definitions for Tiny Sequence XML

--------------------------------
------

-

61
-

3.2

XML

D
ATA
B
INDING FOR
J
AVA
A
PPLICATION

--------------------------------
----------------------

-

62

-

3.2.1 BorlandXML Data Binding Ar
chitecture

--------------------------------
--------------------

-

64
-

CHAPTER 4

--------------------------------
--------------------------------
--------------------------------

-

66
-

VERIFICATION AND IMP
LEMENTATION OF COMPU
TATIONAL JAVABEANS
PACKAGE

--------------------------------
--------------------------------
--------------------------------
--

-

66
-



-

5
-

I
NTRODUCTION

--------------------------------
--------------------------------
-------------------------------

-

66

-

4.1

C
OMPARISON AND
S
ELECTION

--------------------------------
--------------------------------
--------

-

66

-

4.2

H
ORSE
S
HIPPING
F
EVER

--------------------------------
--------------------------------
----------------

-

67

-

4.2.1 Symptoms of Pleuropneumonia

--------------------------------
-------------------------------

-

67
-

4.2.2 Bovine Parainfluenza Virus Genome Information

--------------------------------
---------

-

68
-

4.3

S
EVERE
A
CUTE
R
ESPIRATORY
S
YNDROME

--------------------------------
--------------------------

-

71

-

4.4

E
UROPEAN
M
OLECULAR
B
IOLOGY
O
PEN
S
OFTWARE
S
UITE
(EMBOSS)

------------------------

-

72

-

4.4.1 The EMBOSS Package Program

--------------------------------
-----------------------------

-

74
-

4.4.2 EMBOSS
-
EBI Web Service

--------------------------------
--------------------------------
---

-

75
-

4.5

V
ERIFICATION THE
C
OMPUTATIONAL
J
AVA
B
EAN
C
OMPONE
NT
P
ACKAGE

----------------------

-

77

-

4.5.1 Bovine Parainfluenza virus and SARS Coronavirus Analysis

----------------------------

-

77
-

SUMMARY

--------------------------------
--------------------------------
--------------------------------
-

-

84
-

REFERENCE

--------------------------------
--------------------------------
-------------------------------

-

86
-

APPENDIX A

--------------------------------
--------------------------------
-------------------------------

-

88
-

THE DESIGN AND PROGR
AMMING FOR COMPUTATI
ONAL JAVABEAN
COMPONENTS

PACKAGE

--------------------------------
--------------------------------
-------------

-

88
-

A.1

S
UN
M
ICROSYSTEMS


J
AVA
B
EAN

--------------------------------
--------------------------------
--

-

88

-

A.1.1 JavaBean Features

--------------------------------
--------------------------------
------------

-

88
-

A.2

O
PTIMAL
G
LOBAL
A
LIGNMENT AND
PAM

--------------------------------
--------------------------

-

89

-

A.3

UML

D
EFINITION

--------------------------------
--------------------------------
-----------------------

-

90

-

A.4

UML

FOR
C
OMPUTATIONAL
J
AVA
B
EAN
C
OMPONENT

--------------------------------
------------

-

91

-

A.4.1 UML Use Case Diagram

--------------------------------
--------------------------------
------

-

91
-

A.4.2 UML Sequence Diagram

--------------------------------
--------------------------------
------

-

92
-

A.4.3 UML Activity Diagram

--------------------------------
--------------------------------
--------

-

93
-

A.4.4 UML Deployment Diagram

--------------------------------
--------------------------------
---

-

95
-

A.4.5 UML Component Diagram

--------------------------------
--------------------------------
---

-

96
-

A.5

PAM

J
AVA
B
EAN
C
OMPONENT

--------------------------------
--------------------------------
--------

-

99

-

A.5.1 UML Component Diagram (SUBSITUTION MATRIX)

--------------------------------
---

-

99
-

A.6

XML

R
EADER
J
AVA
B
EAN
C
OMPONENT

--------------------------------
----------------------------

-

101

-

A.6.2 UML Class Diagram Example: xmlreader.java

--------------------------------
----------

-

101
-

A.6.3 UML Class Diagram Example: TSeqT
axid.java

--------------------------------
---------

-

102
-

APPENDIX B

--------------------------------
--------------------------------
-----------------------------

-

1
03
-

HOW TO USE THIS PACK
AGE?

--------------------------------
--------------------------------
----

-

103
-

B.1

B
ASIC
C
ONCEPT OF
R
EUSABLE
J
AVA
B
EAN
C
OMPONENT

--------------------------------
--------

-

103

-

B.2

H
OW TO USE THE COMPON
ENTS FROM THE PACKAG
E
?
--------------------------------
------------

-

103

-

APPENDIX C

--------------------------------
--------------------------------
-----------------------------

-

107
-

NCBI DOCUMENTATION

--------------------------------
--------------------------------
------------

-

107
-

APPENDIX D

--------------------------------
--------------------------------
-----------------------------

-

124
-

EMBOSS PROGRAMS

--------------------------------
--------------------------------
-----------------

-

124
-

APPENDIX E

--------------------------------
--------------------------------
-----------------------------

-

132
-

COMPLETE SOURCE CODE

--------------------------------
--------------------------------
--------

-

132
-



-

6
-

L i s t o f F i g u r e

Figure 1: DNA and RNA strand structure, Resource: Math of Life, Figure 3.17

--------------------

-

10
-

Figure 2: Protein, Resource: Math of Life, Figure 3.5

--------------------------------
-----------------

-

11
-

Figure 3: 3
-
D protein structure example (
X
-
Ray Crystal Structure of The Sars

--------------------

-

11
-

Coronavirus Main Protease, 29
-
Jul
-
2003 Release),

--------------------------------
--------------------

-

11
-

Resource: The Protein Data Bank,
RasMol graphics visualization

--------------------------------
---

-

11
-

Figure 4: Basic dynamic programming algorithm for comparison of two s
equences

---------------

-

42
-

Figure 5: Recursive algorithm for optimal alignment

--------------------------------
------------------

-

43
-

Figure 6: The data processing architecture
--------------------------------
------------------------------

-

65
-

Figure 7: XML Reader Class JavaDoc Example

--------------------------------
------------------------

-

65
-

Figure 8: EMBL
-
EBI Web Service Example

--------------------------------
-----------------------------

-

77
-

Figure 9: Sequence Alignment JavaBean Package Architecture

--------------------------------
------

-

89
-

Figure 10: UML Use Case Model represents the Demonstration of Component Package User
Interface

--------------------------------
--------------------------------
--------------------------------
------

-

91
-

Figure 11: UML Sequence Web Service Diagram

--------------------------------
-----------------------

-

92
-

Figure 12: UML Sequence XML Input Diagram

--------------------------------
------------------------

-

93
-

Figure 13: UML Sequence Alignment Process Diagram

--------------------------------
---------------

-

93
-

Figure 14: UML Activity Diagram represents the program’s workflow

------------------------------

-

94
-

Figure 15: UML Deployment Diagram for Components’ System Architecture

----------------------

-

95
-

Figure 16: UML Component Diagram represents the relationship of whole package
components

--------------------------------
--------------------------------
--------------------------------
---

-

97
-

Figure 17: UML Component Diagram represents the relationship of the alignment
components package

--------------------------------
--------------------------------
------------------------

-

98
-

Figure 18: UML Class Diagram

represents the access of each component’s pattern

--------------

-

99
-

Figure 19: UML Component Diagram represents the relationship of each PAM component

---

-

100
-

Figure 20: UML Class Diagram represents the access of each PAM component’s pattern

------

-

100
-

Figure 21: UML Component Diagram represents the relationship of XML Reader’s each
clas
s

--------------------------------
--------------------------------
--------------------------------
---------

-

101
-

Figure 22: UML Class Diagram for the whole XML reader class

--------------------------------
--

-

102
-

Figure 23: UML Class Diagram for one of the XML reader

class

--------------------------------
--

-

102
-



-

7
-

L i s t o f T a b l e

Table 1: Amino acid and their properties, Resource: Math of Life

--------------------------------
---

-

47
-

Table 2: Marginal Sums, Relative Frequencies and Mutabilities, Resource: 1989 Mutation
Data Matrix, Geetha Y. Srinivasarao, David G. George, and Winona C. Barker National
Biomedical Research Foundation

--------------------------------
--------------------------------
---------

-

53
-

Table 3: The probability of protein mutation rate, Resource: David W. Mount,
Bioinformatics Sequence and Genome Analysis, Page 81

--------------------------------
--------------

-

53
-

Table 4: M Matrix, Resource [5]

--------------------------------
--------------------------------
----------

-

55
-

Table 5: M(250) Matrix, Resource: [5]

--------------------------------
--------------------------------
--

-

56
-

Table 6: PAM (250) logarithm of odd matrix ( S(250) Matrix ), Resource: David W.

Mount,
Bioinformatics Sequence and Genome Analysis, Page 82

--------------------------------
--------------

-

56

-

Table 7: Tiny Sequence XML Data Format

--------------------------------
------------------------------

-

60
-

Table 8: The NCBI
Tiny Sequence XML Tree View, Implementation by XMLSPY

------------------

-

61
-

Table 9: Some EMBOSS program lists

--------------------------------
--------------------------------
---

-

75
-



-

8
-

C h a p t e r 0

COMPUTATIONAL GENOME

SEQUENCE

Introduction

The
com
plexity

of
l
ife into computational
science is a major challenge
in
biotechnology

and bio
-
industry field. What is the relationship between life
and machine? In t
his chapter
, we will descript the
fundamental feature of life

and life
-
form

and

t
he relationship

between molecular genome information and
computer science.

Based on the relationship
between

these two

major fields,
this chapter introduces the common sequence research and analysis topics for
genome sequence problem. In the end, the chapter emphasizes t
he genome
algorithm as a key of currently computational molecular science for
computing and analysis the problems of molecular biology.

0.1
Life

and life
-
form


Trying to define precisely what is living and what is non
-
living can be
rather confusing at the
beginning. There is n
o
t

a
simple d
efinition of what it is to
be a
live and what is not.


If t
he
fundamental
feature of life is
that
it

has the

ability to

reproduce
itself
,

the r
eproduct
ive ability feature will not be essential. Take viruses for
instance.

Viruses can only multiply when they have entered a suitable host cell
and taken over the cellular machinery. In fact, v
iruses are nearly pure genetic
material, wrapped in

a protective coating. The
host
cell that a virus infects does all


-

9
-

the synthetic wo
rk

involved in creating new viruses.
Although viruses are not
living cells, they are life
-
form consisting primarily of genetic information.

Ano
ther approach to defining “life” is to recognize its fundamental
relationship
.

All living things are related to e
ach other. Any pair of organisms,

no
matter
how different, has

a common ancestor sometime in the distant

past.
Organisms came to differ from each other, and to reach modern levels

of
complexity through
evolution
.

Based on Darwin Demise Evolution Theory,
E
volution has three
driving
factors behind it
: inheritance,

the passing of characteristics from parents to
offspring; variation, the

processes that make offspring other than exact copies of
their parents; and

selection, the process that differentially favor
s the reproduction
of some organisms,

and hence their characteristics, over others.

These three
factors

define an evolutionary process. Perhaps
, in this context,

the best definition
of life is that it is

the result of the evolutionary process taking place
on Earth.
Therefore, evolution should be

the

key not only to defining what counts as life but
also to understanding how

living systems function.

0.2
Molecular Genome Information and Computer Science

Based on the complexity of life as discussed above, the i
ndependent
unit of life is a living cell. A cell carries the genetic information which
contains gene made of d
eoxyribonucleic acid (DNA)

and possesses its own
machinery to generate energy and synthesize proteins, DNA and
ribonucleic
acid

(RNA)
.
The followi
ng figure illustrates
the structures of DNA and
RNA
.



-

10
-


Figure 1:
DNA and RNA strand structure,
Resource: Math of Life, Figure 3.17





-

11
-


Figure 2:
Protein,
Resource: Math of Life, Figure 3.5



Figure 3:
3
-
D protein structure
example
(
X
-
Ray Crystal Structure

of The Sars



Coronavirus Main Protease
,
29
-
Jul
-
2003 Release
)
,


Resource: The Protein Data Bank,
RasMol graphics visualization




-

12
-

A. Kent and J. G. Williams (1994) described the relationship between
molecular biology and computer science and wrote

“... DNA as a one
-
dimensional character string, abstracting away the reality of DNA as a
flexible three
-
dimensional molecule, interacting in a dynamic environment
with protein and RNA, and repeating a life
-
cycle in which even the classic
linear chromosome

exists for only a fraction of the time. A similar, but
stronger, assumption existed for protein, holding, for example, that all the
information needed for correct three
-
dimensional folding is contained in the
protein sequence itself, essentially independe
nt of the biological environment
the protein lives in. This assumption has recently been modified, but remains
largely intact.” [1]


In addition, M.V Olson (1995) emphasized the importance of the
sequence
-
level investigation and wrote “The digital informa
tion that underlies
biochemistry, cell biology, and development can be represented by a simple
string of G’s, A’s, T’s and C’s. This string is the root data structure of an
organism’s biology. ” [2]

G. von Heijne discussed about the sequences in his treat
ise and wrote
“In a very real sense, molecular biology is all about sequences. First, it tries
to reduce complex biochemical phenomena to interactions between defined
sequences. ...” [3]

J. Monod also pointed out that “The ultimate rationale behind all
pu
rposeful structures and behavior of living things is embodied in the


-

13
-

sequence of residues of nascent polypeptide chains...In a real sense it is at this
level of organization that the secret of life (if there is one) is to be found.” [4]

Based on the conclu
sions from pioneers’ researches, both sequence
and structure could be analyzed and defined by adopting the abstraction of
“DNA
-
as
-
string” (that is, a mathematical string as a finite sequence of
symbols) and “protein
-
as
-
three
-
dimensional
-
labeled
-
graph”. It

is obvious that
computer science can provide the much needed abstraction for bio
-
molecular
systems. Furthermore, advanced concepts in computer science are being used
to investigate the “molecule
-
as
-
computation” abstraction, in that systems of
interacting
molecular entities is captured and modeled by a system of
interacting computational entities.

0.3
Common Genome Sequence Analys
e
s


As the genome sequences are computable, computer science is not
concerned much about the more complex chemical and biologic
al aspects of
DNA and protein. The bioinformatics researchers can now be empowered to
tackle a new array of biologically important problems as defined primarily on
sequence in a similar sense that strings are defined in computer sciences.
Typical examples

include but not limited to a wide range of research tasks
such as:



Reconstructing long sequence of DNA from overlapping sequence
fragments



Determining physical and genetic maps from probe data under various
experimental protocols



-

14
-



Storing, retrieving, and
comparing DNA strings



Comparing two or more sequence for similarities



Searching database for related strings and substring



Defining and exploring different notions of string relationship



Exploring new or ill
-
defined patterns occurring frequently in DNA



Looking for structural patterns in DNA and protein determining
secondary (two
-
dimensional) structure of RNA



Finding conserved but faint, patterns in many DNA and protein
sequences

0.4
Algorithm Drive of Computational Molecular Biology


Algorithms that op
erate

on molecular sequence data (as represented in
form of strings) are at the heart of computational molecular biology. The big
-
picture question in computational molecular biology is how to “do” as much
“real biology” as possible by exploiting molecular
sequence data (including
DNA, RNA and protein) even though getting the sequence data is becoming
faster compared to more traditional laboratory investigations. Still, algorithms
that operate on strings/sequences will continue to be one of the major areas o
f
closest intersection and interaction between computer science and molecular
biology.

It is apparent that the focus of investigation in molecular biology is to
understand the fundamentals of the string algorithm by combining
programming into machine
-
known

language for the investigation of a series
of problems in sequences or the target protein structures. We clearly see that


-

15
-

probing the theoretical algorithm and investigating the genome information
together will going hand in hand and continue to grow in t
he arena of genetic
analyses.








-

16
-

C h a p t e r 1

SEQUENCE ALIGNMENT

Introduction

In the chapter we consider the fundamentals of sequence dada in
molecular biology. We start with the most classic
algorithmic technique

in
dynamic programming and
Needleman
-
W
unsch
global and local alignment
a
lgorithm

for solving the matching problem. In the end, the
final
p
seudo
-
code
s for both similarity matrix and sequence alignment illustrate the
computational ability of molecular genome into mechanism science.

1.1 Fundament
als

In computational biology, sequence data is the most abundant type of
biological data available electronically. From gene sequences to the proteins,
the importance of sequence databases to biology remains central.
Computational approaches to biological

questions are vital to help researchers
analyze data and formulate hypotheses.

The central dogma of molecular biology states that DNA acts as a
template to replicate itself and DNA is also transcribed into RNA, and RNA is
translated into protein. DNA is a

linear polymer made up of individual
chemical units called nucleotides or bases. The four nucleotides that made up
the DNA sequences of living things on Earth are adenine, thymine, cytosine,
and guanine


designated A, T, C, and G, respectively. In real
ity, DNA and


-

17
-

proteins are complicated 3
-
D molecules, composed of thousands or even
millions of atoms bonded together.

In order to grasp such a complicated system, researchers recognized
that it was convenient to represent them by strings/sequences of sin
gle
characters. Instead of representing each nucleic acid in a DNA sequence as a
detailed 3
-
D chemical entity, they could be represented simply as A, T, C, and
G.

In DNA, four nucleic acid monomers (A, T, C, and G) are commonly
used to build the polymer

chain. In proteins, 20 amino acid monomers are
used. However, DNA and proteins are both polymers, that is, chains of
repeating chemical units (monomers) with a common backbone holding them
together.

By definition, s
equence alignment is the procedure of

comp
aring two
or more

sequences
, which is accomplished

by searching for a series of
individual characters

or character patterns that are in the same order in the
sequences.
In practice, t
wo sequences are aligned

by writing them across a
page in two rows.
Identical or similar characters are placed in the

same
column, and non
-
identical characters can either be placed in the same

column
as a mismatch

or opposite a gap in the other sequence. In an optimal

alignment, non
-
identical characters

and gaps are placed

to bring as many
identical or similar characters as possible into

vertical register. Sequences that
can be readily aligned in this manner are said to be similar.

1.2
Global and local alignment



-

18
-

There are two types of sequence alignment
:

G
lobal and local
.
I
n global

alignment, an attempt is made to align the entire sequence,

using as many
characters as possible, up to both ends of each sequence. Sequences that are

quite similar and approximately the same length are suitable candidates for
global alignment.

In

local alignment, stretches of sequence with the highest density of
matches are

aligned, thus generating one or more islands of matches or sub
-
alignments in the aligned

sequences. Local alignments are more suitable for
aligning sequences that are similar a
long

some of their lengths but dissimilar
in others, sequences that differ in length or sequences

that share a conserved
region or domain.



Alignment of two sequences is performed using the following methods:

1.

Dot matrix analysis

2.

The dynamic programming (
or DP) algorithm

3.

Word or
k
-
tuple methods such as used by the programs FASTA and
BLAST



-

19
-

This thesis focuses on the Dynamic Programming Algorithm for
design and implementation of two
-
sequence alignment (also known as
pairwise alignment).

1.3 Dynamic programm
ing

By definition,

dynamic programming is a
n algorithmic technique in
which an
optimization problem

is solved by avoiding the

sub
-
problem
solutions

and
re
-
computing them
.

Dynamic programming was the brainchild of an American
Mathematician, Richard Bellman,

who described the way of solving problems
where
people

need to find the best decisions one after another. In the forty
-
odd years since this development, the number of uses and applications of
dynamic programming has increased enormously.

In fact, t
he wor
d

Programming


in the name has nothing to do with writing computer
programs. Mathematicians use the word to describe a set of rules which
anyone can follow to solve a problem. They do not have to be written in a
computer language.

1.4
Needleman
-
Wunsch Al
gorithm

Here is an

example of global sequence alignment using

Needleman/Wunsch techniques.

Let us examine
the
following
two sequences
which are
to be globally aligned
.


Sequence #1:
G A A T T C A G T T A

Sequence #2:
G G A T C G A



-

20
-

We set M=11 and N=7
where M denotes the length of Sequence #1 and N that
of Sequence #2.

1.4.1
Simple scoring scheme

A simple scoring scheme is assumed where



S
i,j

= 1 if the residue at position i of
Sequence #1

is the same a
s the
residue at position j of Sequence #2

(match s
core); otherwise



S
i,j

= 0 (mismatch score)



w = 0 (gap penalty)


In a DNA chain, the four nucleic acids can occur in any order, and the
order in which they occur determines what the DNA does. In a protein,
amino acids can occur in any order, and their o
rder determines the protein’s
fold and function.

This chapter discusses simple sequence alignment of DNA four
characters (A, T, C, and G). Protein sequence alignment algorithm is the
same, except it has 20 amino acid characters (
A
,
R
,

N
,

D
,

C
,

Q
,

E
,

G
,

H
,

I
,

L
,

K
,

M
,

F
,

P
,

S
,

T
,
W
,
Y
,
V
,
B
,

Z
,
X
).

In the next chapter, we will describe a better definition for decide the
score of match and mismatch for protein with the genetic evolution
relationship.


Three steps in dynamic programming

1.

Initialization



-

21
-

2.

Matr
ix fill (scoring)

3.

Traceback (alignment)


Initialization Step

The first step in the global alignment dynamic programming approach
is to create a matrix with M + 1 columns and N + 1 rows where M and N
correspond to the size of the sequences to be aligned

r
espectively
.

Since this example assumes there is no gap opening or gap extension
penalty, the first row and first column of the matrix can be initially filled with
0.



Matrix
f
ill
s
tep

One possible (inefficient) solution of the matrix fill step finds the
maximum global alignment score by s
tarting in the upper left
-
h
and corner in
the matrix and finding the maximal score M
i,j

for each position in the matri
x.


-

22
-

In order to find M
i,j

for any i,j it is minimal to know the score for the matrix
positions to the left, above and diagonal to i, j.

I
t is necessary to know M
i
-
1,j
,
M
i,j
-
1

and M
i
-
1, j
-
1

i
n t
erms of matrix positions.

For each position, M
i,j

is defined to

be the maximum score at position
i,j;
that is


M
i,j

= MAXIMUM[


M
i
-
1, j
-
1

+ S
i,j

(match/mismatch in the diagonal),


M
i,j
-
1

+ w
(gap in sequence #1),


M
i
-
1,j

+ w
(gap in sequence #2)
]

Note that in th
is

example, M
i
-
1,j
-
1

will be

marked

in
red, M
i,j
-
1

will be
in
green
and M
i
-
1,j

will be
in
blue.

Using this information, the score at position 1,1 in the matrix can be
calculated. Since the first residue in both sequences is a G, S
1,1
= 1, and by the
assumptions stated at the beginning, w = 0. Thus,
M
1,1

= MAX[M
0,0

+ 1, M
1, 0

+ 0, M
0,1
+ 0] = MAX [1, 0, 0] = 1.

A value of 1 is then placed in position 1,1 of the scoring matrix.




-

23
-


Since the gap penalty

(w) is 0, the rest of row 1 and column 1 can be
filled in with the value 1. Take the example of row 1. At column 2, the value
is the max of 0 (for a mismatch), 0 (for a vertical gap) or 1 (horizontal gap).
The rest of row 1 can be filled

out similarly unt
il we get to c
olumn 8. At this
point, there is a G in both sequences (light blue). Thus, the value for the cell
at row 1 column 8 is the maximum of 1 (for a match), 0 (for a vertical gap) or
1 (horizontal gap).
Again, t
he value will be 1. The rest of row 1

and column 1
can be filled with 1 using the above reasoning.



-

24
-


Now let

us examine

column 2. The location at row 2 will be assigned
the value of the maximu
m of 1

(mismatch), 1

(horizontal gap) or 1 (vertical
gap). So its value is 1.

At the position column 2 row 3, there is an A in both sequences. Thus,
its value will be the maximum of 2(match), 1 (horizontal gap), 1 (vertical
gap)
. So,

its value is 2.

Mov
ing along to position colum
n

2 row 4, its value will be the
maximum of 1 (mismatch), 1 (horizontal gap), 2 (vertical gap)
. So,
its value is
2.

Note that for all of the remaining positions except the last one in column 2,
the choices for the value will be
the exact same as in row 4 since there are no
matches. The final row will contain the value 2 since it is the maximum of 2
(match), 1 (horizontal gap) and 2

(vertical gap).



-

25
-


Using the same techniques as described for column 2, we can fill in column 3.



After filling in all of the values the score matrix
ap
pears

as follows:



-

26
-



Traceback
s
tep

After the matrix fill step, the maximum alignment score for the two
test sequences
turns out to be

6. The traceback ste
p determines the actual
alignment that result
s

in the maximum score. Note that with a simple scoring
algorithm such as
the
one that is used here, there are likely to be multiple
maximal alignments.

The traceback step begins in the M,J position in the matr
ix, i.e. the
position that leads to the maximal score. In this case, there is a
value of
6 in
that location.

Traceback takes the current cell and looks to the neighbor cells that
could be direct ancestors. This means it looks to the neighbor to the left (
gap
in
Sequence #2
), the diagonal neighbor (match/mismatch), and the neighbor
above it (gap in
Sequence #1
). The algorithm for traceback c
hooses as the


-

27
-

next cell in the Sequence #1

of the possible ancestors. In this case, the
neighbors are marked in red. T
hey are all also equal to 5.


Since the current cell has a value of 6 and the scores are 1 for a match
and 0 for anything else, the only possible ancestor

is the diagonal
match/mismatch neighbor. If more than one possible ancestor exists, any can
be chosen

arbitrarily
. This gives us a current alignment of


(Seq
uence

#1) A



|


(Seq
uence

#2) A

So now we look a
t the current cell and determine its direct
ancestor
. In this
case, it is the cell with the
value of
5

in red
.



-

28
-


The alignment as described in the above st
ep adds a gap to
S
equence #2, so
the current alignment is


(Seq
uence

#1) T A





|


(Seq
uence

#2) _

A


Once again, the direct
ancestor
produces a gap in
S
equence #2.




-

29
-

After this step, the current alignment
appears as follows:


(Seq
uence

#1) T T A







|



(Seq
uence

#
2
) _


_ A

Continuing on with the traceback
step, we eventually get to a position
in column 0 row 0
,

which tells us that traceback is completed. One possible
maximum alignment is:



Giving an alignm
ent of:


G A A T T C A G T T A




|


|


|

|


|


|


G G A _ T

C _

G _ _

A


An alternat
ive

solution is:



-

30
-


Giving an alignment of :


G _ A A T T C A G T T A




|


|


|


|


|


|


G G _ A _

T C _

G _ _ A

There are more alternative solutions
,
each

of which

resulting in a
maximal global alignment score of 6.
Since this is an exponential problem,
most dynamic programming algorithms will only print out a single solution.


1.4.2
Advanced scoring scheme


The difference between simple scoring scheme and advanced scoring
scheme is that the advanced scoring scheme‘s
match, mismatch and gap will
have different score (weight).

An advanced scoring scheme is assumed where



S
i,j

= 2 if the residue at position i of sequence #1 is the same as the
residue at position j of sequence #2 (match score); otherwise



-

31
-



S
i,j

=
-
1 (misma
tch score)



w =
-
2 (gap penalty)


Initialization
s
tep

The first step in the global alignment dynamic programming approach
is to create a matrix with M + 1 columns and N + 1 rows where M and N
correspond to the size of the sequences to be aligned

respectiv
ely
.

The first row and first column of the matrix can be initially filled with
0.



Matrix
f
ill
s
tep

One possible (inefficient) solution of the matrix f
ill step finds the
maximum global alignment score b
y starting in the upper left
-
hand corner in
the matrix and finding the maximal score M
i,j

for each position in the matrix.
In order to find M
i,j

for any i,j it is minimal to know the score for the matrix


-

32
-

p
ositions to the left, above and diagonal to i, j. In terms of matrix positions, it
is necessary to know M
i
-
1,j
, M
i,j
-
1

and M
i
-
1, j
-
1
.

For each position, M
i,j

is defined to be the maximum score at position
i,j; i.e.

M
i,j

= MAXIMUM[


M
i
-
1, j
-
1

+ S
i,j

(
match/mismatch in the diagonal)
,


M
i,j
-
1

+ w
(gap in sequence #1),


M
i
-
1,j

+ w
(gap in sequence #2)
]

Note that in the example, M
i
-
1,j
-
1

will be

marked in

red, M
i,j
-
1

will be
in
green
and

M
i
-
1,j

will be
in
blue.

Using this information, the score a
t position 1,1 in the matrix can be
calculated. Since the first residue in both sequences is a G, S
1,1
= 2, and by the
assumptions stated earlier, w =
-
2. Thus, M
1,1

= MAX[M
0,0

+ 2, M
1,0

-

2, M
0,1
-

2] = MAX[2,
-
2,
-
2].

A value of 2 is then placed in posi
tion 1,1 of the scoring matrix. Note
that there is also an arrow placed back into the cell that resulted in the
maximum score, M[0,0].



-

33
-


Moving down th
e first column to row 2, we can see that there is once
again a match in both sequences. Thus, S
1,2

= 2. So M
1,2

= MAX[M
0,1

+ 2,
M
1,1

-

2, M
0,2

-
2] = MAX[0 + 2, 2
-

2, 0
-

2] = MAX[2, 0,
-
2].

A value of 2 is then placed in position 1,2 of the scoring matri
x and an
arrow is placed to point back to M[0,1] which led to the maximum score.




-

34
-

Looking at column 1 row 3, there is not a match in the sequences, so

S

1,3

=
-
1. M
1,3

= MAX[M
0,2

-

1, M
1,2

-

2, M
0,3

-

2] = MAX[0
-

1, 2
-

2, 0
-

2] =
MAX[
-
1, 0,
-
2].

A value of 0 is then placed in position 1,3 of the scoring matrix and an
arrow is placed to point back to M[1,2] which led to the maximum score.



We can continue filling in the cells of the scoring matrix using the same
reasoning.

Eventually, we get to column 3 row 2. Since there is not a match in the
s
equences at this posit
i
on, S
3,2

=
-
1. M
3,2

= MAX[ M
2,1

-

1, M
3,1

-

2, M
2,2

-

2]
= MAX[0
-

1,
-
1
-

2, 1
-
2] = MAX[
-
1,
-
3,
-
1].



-

35
-


Note that in the above
case, there are two different ways to get the
maximum score.
Here
, pointers are placed back to all of the cells that can
produce the maximum score.


T
he rest of the score matrix can then be filled in

accordingly
. The
completed score matrix will be as follows:



-

36
-



Traceback
s
tep

After the matrix fill
step, the maximum global alignment score for the
two sequences is 3. The traceback step will determine the actual alignment
that result
s

in the maximum score.

The traceback step begins in the M,J position in the matrix, i.e. the
position where both sequen
ces are globally aligned.

Since we have kept pointers back to all possible
ancestors
, the
traceback step is simple. At each cell, we look to see where we move next
according to the pointers. To begin, the only possible
ancestor
is the diagonal
match.



-

37
-



This gives us an alignment of


A



|


A

Note that the
blue
-
colored l
etters and gold arrows indicate the path leading to
the maximum score.

We can continue to follow the path using a single pointer until we get
to the following situation.



-

38
-


The alignment at this point appear
s


T C A G
T T A


|

|

|

|


T C


_ G _

_

A

Note that there are now two possible neighbors that could result in the
current score. In such a case, one of the neighbors is arbitrarily chosen.

Once the traceback is completed,
we observe

that the
re are only two
possible paths leading to a maximal global alignment.

One possible path is as follows:



-

39
-


This gives an alignment of


G A A T T C A

G T T A




|

|


|

|


|


|


G G A

_

T

C _ G

_

_

A


The other possible path is as follows:





-

40
-

This gives an alignment of



G A A T T C A G T T A




|

|

|


|


|


|


G G

A


T _ C _ G _


_ A

Since

the scoring scheme is +2 for a match,
-
1 for a mismatch, and
-
2
for a gap, both sequences can be tested to make sure that they result in a score
of 3.



G A A T T C A G T T A




|


|


|


|


|


|


G G A

_ T C _

G _ _ A



2
-
1

2
-
2 2 2
-
2

2
-
2
-

22

Total score:
2
-

1 + 2
-

2 + 2 + 2
-

2 + 2
-

2
-

2 + 2 = 3




G A A T T C A G T T A



|

|


|


|


|


|


G G A T


_ C _

G _ _ A



2
-
1 2 2

-
2 2
-
2 2
-
2
-
2 2

Total score:
2
-

1 + 2 + 2
-

2 + 2
-

2 + 2
-

2
-

2 + 2 = 3


B
oth of these alignments do indeed result in the maxima
l alignment score.


Based on
the discussion of the
Needleman
-
Wunsch Algorithm
,
developers can use the dynamic programming technique to design the


-

41
-

appropriate task or machine instructions for computing the alignment of two
sequences.

In computer science, a
lgorithms are likely to come with pseudo
-
code.
The purpose of pseudo
-
code is to help readers or developers who wish to
implement the algorithm
as pre
-
defined
.



P
seudo
-
code (pronounced SOO
-
doh
-
kohd) is a detailed yet readable
description of what a computer

program or algorithm must do, expressed in a
formally
-
styled natural language rather than in a programming language.
Pseudo
-
code is sometimes used as a
necessary intermediate

step in the
process of developing a program. It allows designers or lead program
mers to
express the design in great detail
s and provides programming project team
with
a detailed
template

for
implementing the substantial code phase using
a
specific programming language.

The similarity matrix assignment for two sequences provides the re
sult
matrix of two sequences comparison. The similarity pseudo
-
code will input
the two sequences of s and t and output the result matrix between s and t.

The Following section contains the similarity pseudo
-
code. By using
the dynamic programming concept, t
he pseudo
-
code reduces the complex of
mathematical term of the procedure into relatively simple task.



-

42
-


Figure 4: Basic dynamic programming algorithm for comparison of two sequences


The alignment computation can use the recursive technique to assign
the m
atrix from similarity. However, a trackback stack is critical in keeping
the points at which there are options and backtracking to the matrix to explore
all possibilities of reaching the (0,0) through the trackback step as discussed
above
.

Let us examine t
he following recursive programming code for optimal
alignment, that is, to input the similarity matrix of two sequences array and
output the result of two sequences and length. The trackback step is omitted
for better readability.



-

43
-


Figure 5: Recursive alg
orithm for optimal alignment


Example source codes for similarity, alignment and trackback

is as following,
the rest of the source code will
locate

in Appendix
E
:


class Global_Pairwise extends TrackAlign {



public Global_Pairwise(SubMatrix matrix, int g
ap, String seq1, String seq2)
{


super(matrix, gap, seq1, seq2);


int n = this.n, m = this.m;


for (int i = 1; i <= n; i++) {


Final_Matrix[i][0] =
-
gap * i;


Trackback_Matrix[i][0] = new Traceback_Assign(i
-

1, 0);


}


for (int j
= 1; j <= m; j++) {


Final_Matrix[0][j] =
-
gap * j;


Trackback_Matrix[0][j] = new Traceback_Assign(0, j
-

1);


}


for (int i = 1; i <= n; i++) {


for (int j = 1; j <= m; j++) {


int val = max(Final_Matrix[i
-

1][j
-

1] +



matrix.getScore(seq1.charAt(i
-

1), seq2.charAt(j
-

1)),


Final_Matrix[i
-

1][j]
-

gap, Final_Matrix[i][j
-

1]
-

gap);


Final_Matrix[i][j] = val;


if (val ==


Final_Matrix[i
-

1][j
-

1] + matrix.
getScore(seq1.charAt(i
-

1),
seq2.charAt(j
-

1))) {


Trackback_Matrix[i][j] = new Traceback_Assign(i
-

1, j
-

1);


} else if (val == Final_Matrix[i
-

1][j]
-

gap) {


Trackback_Matrix[i][j] = new Traceback_Assign(i
-

1, j);


} else if (val == Final_Matrix[i][j
-

1]
-

gap) {


Trackback_Matrix[i][j] = new Traceback_Assign(i, j
-

1);


} else {


throw new Error("Global alignment error");


}



-

44
-


}


}


start_point = new Traceback_Assign(n, m);


}

}








-

45
-

C h a p t e r 2

PROTEIN AND THE POIN
T ACCEPT MUTATION MA
TRIX

Introduction

In previous chapter we talk about the common sequence alignment.
However, protein sequence analysis has to invole the evolution of protein
mutation. In the chapter, we briefly

introduce the protein units inside the

living cell and

the structure of its meaning.
For the reason of solving protein
evolution problem
,
this chapter represents
the probability of mutation rate in
all protein units from
Dr.
Margaret Dayhoff

research anal
ysis and c
alculation.
In the other hand, based on
the
problems

of mutation and alignment in two
protein sequences,
Dayhoff’s point accept matrix presents the appropriate
solution for protein sequence alignment.

2.1 P
rotein

Proteins

are the molecules that a
ccomplish most of the functions of
a
living cell.
Proteins are named after Proteus, an ancient Greek god of the sea,
who could change himself into any shape he felt like


usually to sleaze out of
his obligations.
The number of different structures and fun
ctions that proteins

take on in
the

single

unit of an

organism is staggering. They make possible all
of the

chemical reactions in the cell by acting as
enzymes

that promote
specific

chemical reactions, which would

otherwise occur only so slowly as to
be

ot
herwise negligible. The action of promoting chemical reactions is called

catalysis
. Therefore, a more
general term

for enzymes is catalysts.




-

46
-

Proteins also provide
a living cell with
structural support, and are the
keys to

how the immune system distinguish
es
it
self from
its
invaders. They
provide the

mechanism for acquiring and transforming energy. They underlie
sensors and the transmission

of information as well.

All proteins are constructed from linear sequences of smaller
molecules

called amino acids

and

are folded into a variety of complex 3
-
D
shapes
. There are twenty naturally occurring amino acids. Long

proteins may
contain as many as 4500 amino acids, so the
range

of possible

proteins
can be
anywhere between
204500
and
105850.

Proteins also fold up t
o form particular

3
-
D
shapes, which
support
their specific chemical functionality.

Although it is easily demonstrable that
the linear amino acid sequence

completely specifies the
3
-
D

structure of most
proteins,

the details of that mapping is one of the mos
t important open
questions
in

biology

today
.


In addition
,

a protein's
3
-
D

structure is not fixed; many

proteins move
and flex in constrained ways, and that
nature
can have a significant

role in
their bio
c
hemical function.

Proteins have a variety of roles

that they must fulfill:

1.

Enzymes: proteins that carry out a chemical reaction and
rearrange
chemical bonds.

2.

Messengers: proteins that carry

signals to and from the outside of
the cell, and within the cell.



-

47
-

3.

Transport proteins: proteins that usually carry ot
her small
molecules
.

4.

Structural proteins: proteins that make up sub
-
cellular structures.

5.

Regulatory proteins: proteins that control the expressions of a gene
or the activity of other proteins.

Again, e
ach protein is a linear sequence made of smaller consti
tuent
molecules called
amino acids
.

There are 20 different types of amino acids.

The following table lists the amino acids and their properties.


Table 1: Amino acid and their properties,
Resource: Math of Life


2.2 P
rotein
s
equence
c
omparison



-

48
-

Protein se
quence comparison is
one of the
most powerful
approaches
to

characterizing protein sequences because of the enormous amount of
information that is preserved throughout the evolutionary process.

For many protein sequences, an evolutionary history can be tr
aced
back 1

2 billion years.
Sequence homology is a general term that indicates
evolutionary relatedness among sequences.
Proteins that share a common
ancestor are
homologous
.
Sequence comparison is informative

when it detects
homologous

proteins. Homologo
us proteins always share a common
3
-
D

folding structure and they often share common active sites or binding
domains.

Frequently homologous proteins share common functions, but
sometimes they do not. The ability to characterize the biological properties of

a protein based on sequence data alone stems almost exclusively from
properties conserved through evolutionary time
.

Protein chemists discovered early on that certain amino acid
substitutions commonly

occur in related proteins from different species.
Beca
use the protein still functions with

these substitutions, the substituted
amino acids are compatible with protein structure and

function. Often,
such a
substitut
e

becomes

a chemically similar amino acid, but other

changes also
occur.
And y
et other substitu
tions are relatively rare. Knowing the types of

changes that are most and least common in a large number of proteins can
assist with predicting

alignments for any set of protein sequences
.
If related

protein sequences are quite similar, they are easy to al
ign, and one can readily



-

49
-

determine

the single
-
step amino acid changes. If ancestor relationships among
a group of proteins are

assessed, the
probable
amino

acid changes that
occurred during evolution can be predicted.

This type of analysis was
pioneered by

Margaret Dayhoff (1978)

[5]
.

In deciding to perform a sequence alignment, it is important to keep
the goal of the

analysis in mind.
We have several important questions here:



whether
or not
two proteins

have similar domains or structural features



whether
o
r not
they are in the same family with a

related biological
function



whether
or not
they share a common ancestor relationship

The

desir
able
objective
s

will influence the way the analysis is
to be
done. There are several decisions to

be made along the way,
including the
type of program, whether

or not

to produce a global or local

alignment, the
type of scoring matrix, and the value of the gap penalties to be used.

There

are a very large number of amino acid scoring mat
rices in use
,
and these scoring matrice
s are designed for different purposes.

Some, such as
the Dayhoff PAM matrices, are based on an evolutionary model of

protein
change, whereas others, such as the BLOSUM matrices, are designed to
identify

members of the same family. Alignments between DNA se
quences
require similar considerations. It is often worth the effort to try several
approaches to find out which

choice of scoring system and gap penalty give
the most reasonable result.

2.3 Dayhoff
Point Accept Mutat
ion Matrix



-

50
-

Amino acids, the residues
that make up protein sequence, have
biochemical

properties that influence their relative
a
bility

of replacement in
an evolutionary scenario
.

For instance, it is more likely that amino acids of
similar sizes get substituted

for on
e another than those of dif
ferent

sizes.

The acronym PAM stands for

Point Accepted Mutation

. PAM is a
substitution

of one amino acid of a protein by another that is “accepted” by
evolution,

in the sense that within some given species, the mutation has not
only arisen

but has
sprea
d to essentially the entire species

over time
. A PAM1
m
atrix is applying for a time period over which we

expect 1% of the amino
acids to undergo accepted point mutations within the

species of interest.

A definition is a useful place to start.

An
accepted
mutation
is a
mutation that occurred and was positively

selected by the environment; that
is, it did not cause the demise of the

particular organism where it occurred.

It is important for the basic PAM

1

matrix that we consider

to be

immediate

mutations,
a


b
,
but
not mediated ones like

a


c



b

The neces
sary ingredients to build the
PAM

1 matrix M

are
as
follow
s
:



A list of accepted mutations



The probability of occurrence
Pa

for each amino acid
a

The probabilities of occurrence can be estimated simply by
computing
the

relative frequency of occurrence of amino acids over a large, su
ff
iciently
varied

protein sequence set. These numbers satisfy




-

51
-

From the list of accepted mutations we can compute the quantities
fab
,
the

number of times the mutation
a


b

was
observed to occur. Recall that
we

are dealing with undirected mutations here, so
fab
=
fba
. We will also
need the

sums



t
he total number of mutations in which
a
was

involved,
and



t
he total number of amino acid occurrences involved in

mutations. The

number
f
is also twice the total number of mutations.


PAM is 20
by

20 matrix with
Mab
being the probability of amino acid
a

changing into amino acid
b
.
Maa
is
the
probability to
remain
unchanged for
certain

amino acid
a
during the evolutionary interval.

R
elative mutability

of
amino acid
a
is d
efined as

follows:




-

52
-


ACID

MARGINAL
SUM

RELATIVE
FREQUENCY

MAR. SUM
OVER REL.
FREQ.

MUTABILITY

NORMALIZED REL.
FREQ.

A

0.49400E+03

0.49804E+03

0.99189E+00

100.0000

0.78520E
-
01

R

0.27033E+03

0.29373E+03

0.92035E+00

92.7874

0.46309E
-
01

N

0.26533E+03

0.24265E+03

0.10935E+01

110.2410

0.38256E
-
01

D

0.38767E+03

0.35299E+03

0.10982E+01

110.7222

0.55652E
-
01

C

0.11700E+03

0.10710E+03

0.10925E+01

110.1414

0.16885E
-
01

Q

0.22667E+03

0.18904E+03

0.11991E+01

120.8875

0.29803E
-
01

E

0.49767E+03

0.42524E+03

0.11703E+01

117.9893

0.67042E
-
01

G

0.46400E+03

0.54661E+03

0.84887E+00

85.5809

0.86178E
-
01

H

0.15767E+03

0.16445E+03

0.95874E+00

96.6580

0.25927E
-
01

I

0.39767E+03

0.34095E+03

0.11664E+01

117.5898

0.53753E
-
01

L

0.44000E+03

0.53974E+03

0.81521E+00

82.1875

0.85094E
-
01

K

0.35833E+03

0.31098E+03

0.11523E+01

116.1698

0.49028E
-
01

M

0.19467E+03

0.18035E+03

0.10794E+01

108.8197

0.28434E
-
01

F

0.24133E+03

0.33601E+03

0.71824E+00

72.4110

0.52974E
-
01

P

0.24233E+03

0.30170E+03

0.803
23E+00

80.9802

0.47565E
-
01

S

0.45167E+03

0.39817E+03

0.11344E+01

114.3638

0.62775E
-
01

T

0.40900E+03

0.36742E+03

0.11132E+01

112.2276

0.57927E
-
01

W

0.48667E+02

0.10631E+03

0.45780E+00

46.1545

0.16760E
-
01



-

53
-

Y

0.20900E+03

0.21854E+03

0.95636E+00

96.4186

0.3
4454E
-
01

V

0.46033E+03

0.42284E+03

0.10887E+01

109.7583

0.66664E
-
01

SUM

0.63333E+04

0.63428E+04







Table 2
: Marginal Sums, Relative Frequencies and Mutabilities,

Resource: 1989
Mutation Data Matrix
,

Geetha Y. Srinivasarao, David G. George, and Winon
a C. Barker
National Biomedical Research Foundation


The m
utability
is

scaled to the number of replacements per occurrence
of the

given amino acid per 100 residues in each alignment.
The m
utability
for real

homologous proteins are presented in Table
2 belo
w
.


Table 3
: The probability of protein mutation rate
,
Resource: David W. Mount,
Bioinformatics Sequence and Genome Analysis
, Page 81


On the one hand, r
elative mutability is the probability that the given
amino acid will change

in the evolutionary period

of interest.

Hence, the
probability of
a
remaining unchanged is the complementary probability



-

54
-


On the other hand, the probability of
a
changing into
b
can be
computed as

the product of the conditional probability that
a
will change into
b
, given that

a
c
hanged, times the probability of
a
changing




It is easy to verify that
M
has the following properties



(1)


(2)


The first equation
is merely saying that by adding up the probability of
a
staying

the same and probabilities of it changing to

every other amino acid
we get 1.

The transition probability matrix has been normalized to reflect the fact

that
the amount of evolution will change 1 out of 100 amino acids on average.

We
can see this fact from
e
quation
(
2).

Sample matrix
M
is shown in T
able
3 below
.



-

55
-


Table
4
:
M

Matrix
,
Resource [5]


Once we have the basic matrix
M

we can derive transition probabilities
for

larger

amounts of evolution.
M
power of k →
M
(k)

is the transition
probability matrix for period

of
k
units of evolution. Sample matrix
M
(
250
)

is
shown in Table
4 below
.




-

56
-

Table
5
: M(250) Matrix
,
Resource: [5]


We are now ready to define the scoring matrixes. The entries in these
matri
xes

are related to the ratio between two probabilities, namely, the
probability that

a pair of mutations as opposed to being a random occurrence.
This is called


likelihood


or

odds


ratio
Mab
/Pb.

Each entry in
lod
(
logarithm of odds
) matrix
S
is calculat
ed

as follows:


Sample
S
(
250
)

matrix is shown in Table
5
.


Table
6
: PAM (250) logarithm of odd matrix

(
S(250)

Matrix )
,
Resource: David W.
Mount, Bioinformatics Sequence and Genome Analysis, Page 82


T
he PAM

250 scoring matrix was m
odified in an attempt

to improve

the

alignment obtained. All scores for matching a particular amino acid were
normalized to

the same mean and standard deviation, and all amino acid


-

57
-

identities were given the same

score to provide an equal contribution for each
amino acid in a s
equence alignment
.










-

58
-

C h a p t e r 3

SOURCE

INFORMATION

PROGRAMMING

Introduction

Based on previous chapters of sequence alignment and mut
ation rate
table from

each protein unit, we now can design a program for implementing
and analysis protein sequence. Ho
wever, we need the resource from
global
genome database. National Center for Biotechnology protein database
provides the source of each genome sequence and information. In this
chapter, we concentrate its database source format and design the input
techniq
ue for protein sequence alignment.

3.1
XML

E
xtensible Markup Language (XML) provides a foundation for
creating documents and document systems.
XML operates on two main
levels.

F
irst
ly
, it provides s
yntax for document markup. S
econd
ly
, it provides
syntax fo
r declaring the structures of documents. XML is clearly targeted at
the Web, though it certainly has applications beyond it. Users who have
worked with HTML before should be able to learn the basics of XML without
too much difficulty.

From the
www.
w
3.org

d
ocument [
6]


Extensible Markup Language
(XML) 1.0

(Second Edition)
”, we learnt about two important definitions of
XML.



-

59
-

Definition

1
: A data object is an XML document if it is well
-
formed,
as defined in this specification. A well
-
formed XML document may in

addition be valid if it

meets certain further constraints.


Definition

2
: A textual object is a
well
-
formed

XML document if:

1.

Taken as a whole, it matches the production labeled document.

2.

It meets all the well
-
formedness constraints given in this
specifica
tion.

3.

Each of the parsed entities which is referenced directly or
indirectly within the document is well
-
formed.

3.1.1
NCBI
Tiny Sequence XML

Format

National Center for Biotechnology (NCBI) web service

[7]

provides
several files formats for analyzing and m
anipulating genome information
including ASN1.0, FASTA, GenPept, XML etc. Because of the XML custom
tags’ benefits of developing application outside NCBI, this paper adopts Tiny
Sequence XML format (TinySeqXml) to program the appropriate XML
reading compon
ent for the protein alignment software.

Data Format


Element Name

Element Description

TSeqSet

The root of the document

TSeq

The element of the document

TSeq_seqtype

Sequence type. For instance: Nucleotide, Protein
and so on

TSeq_gi

Sequence version

T
Seq_accver

Sequence accession

TSeq_taxid

Sequence taxon id number for database

TSeq_orgname

Sequence organism name

TSeq_defline

Sequence definition

TSeq_length

Sequence length



-

60
-

TSeq_sequence

Sequence

Table
7
: Tiny Sequence XML Data Format


Here is an
example source code of TinySeqXml data format from NCBI:

<?xml version="1.0"?>

<!DOCTYPE TSeqSet PUBLIC "
-
//NCBI//NCBI TSeq/EN"
"http://www.ncbi.nlm.nih.gov/dtd/NCBI_TSeq.dtd">

<TSeqSet>


<TSeq>


<TSeq_seqtype value="protein"/>


<TSeq_gi>30314
342</TSeq_gi>


<TSeq_accver>AAP06763.1</TSeq_accver>


<TSeq_taxid>227998</TSeq_taxid>


<TSeq_orgname>SARS coronavirus Hong Kong/03/2003</TSeq_orgname>


<TSeq_defline>RNA
-
directed RNA polymerase [SARS coronavirus
H
ong
K
ong/03/2003]</TSeq
_defline>


<TSeq_length>215</TSeq_length>

<TSeq_sequence>QDAVASKILGLPTQTVDSSQGSEYDYVIFTQTTETAHSCNVNRFNVAITRAKIGILCIMSDRD
LYDKLQFTSLEIPRRNVATLQAENVTGLFKDCSKIITGLHPTQAPTHLSVDIKFKTEGLCVDIPGIPKDMTYRRLISM
MGFKMNYQVNGYPNMFITREEAIRHVRAWIGFDVEGCHATRDAVGTNLPLQLG
FSTGVNLVAVPTGYVDTENNL</TS
eq_sequence>


</TSeq>

</TSeqSet>


The tree structure of TinySeqXml data format as represented in an XML
viewer:



-

61
-


Table
8
: The NCBI Tiny Sequence XML Tree View,

Implementation by XMLSPY