Bioinformatics

sparrowcowardBiotechnology

Oct 2, 2013 (3 years and 10 months ago)

116 views

Definitions

Optimal alignment

-

one that exhibits the
most correspondences. It is the alignment
with the
highest score
. May or may not
be biologically meaningful.

Global alignment

-

Needleman
-
Wunsch
(1970) maximizes the number of matches
between the sequences along the entire
length of the sequences.

Local alignment

-

Smith
-
Waterman (1981)
gives the highest scoring local match
between two sequences.

Pairwise Global Alignment


Global alignment

-

Needleman
-
Wunsch (1970)


maximizes the number of matches between the
sequences along the entire length of the sequences.



Reason for making a global alignment:


checking minor difference between two sequences


Analyzing polymorphisms (ex. SNPs) between closely related
sequences




Pairwise Global Alignment


Computationally:



Given:

a pair of sequences (strings of characters)


Output:


an alignment that maximizes the similarity

How can we find an optimal
alignment?


ACGTCTGATACGCCGTATAGTCTATCT

CTGAT
---
TCG
-
CATCGTC
--
T
-
ATCT


How many possible alignments?


C(27,7) gap positions = ~888,000 possibilities


Dynamic programming: The Needleman &
Wunsch algorithm

1

27

Time Complexity

Consider two sequences:

AAGT

AGTC


How many possible alignments the 2 sequences
have?





2n

n

= (2n)!/(n!)
2
=

(
2
2n
/

n ) =

(2
n
)

Scoring a sequence alignment


Match/mismatch score:


+1/+0


Open/extension penalty:


2/

1

ACGTCTGAT
A
CGCCGTAT
A
GTCTATCT


||||| ||| || ||||||||

----
CTGAT
T
CGC
---
AT
C
GTCTATCT



Matches: 18
×

(+1)


Mismatches: 2
×

0


Open: 2
×

(

2)


Extension: 5
×

(

1)

Score = +9

Pairwise Global Alignment


Computationally:



Given:

a pair of sequences (strings of characters)


Output:


an alignment that maximizes the similarity

Needleman & Wunsch


Place each sequence along one axis


Place score 0 at the up
-
left corner


Fill in 1
st

row & column with gap penalty multiples


Fill in the matrix with max value of 3 possible moves:


Vertical move: Score + gap penalty


Horizontal move: Score + gap penalty


Diagonal move: Score + match/mismatch score


The optimal alignment score is in the lower
-
right corner


To reconstruct the optimal alignment, trace back where the max at
each step came from, stop when hit the origin.

Example



Let gap =
-
2

match = 1

mismatch =
-
1.


C


A


A


A

empty

C

G

A

empty


1


-
1


-
3


-
5


-
1


0


-
3


-
4


-
1


-
1


-
2


-
8


-
6


-
4


-
2


-
2


-
6


-
4


-
2


0

AAAC

A
-
GC


AAAC

-
AGC


Time Complexity Analysis


Initialize matrix values: O(n), O(m)


Filling in rest of matrix: O(nm)


Traceback: O(n+m)


If strings are same length, total time O(n
2
)

Local Alignment


Problem first formulated:


Smith and Waterman (1981)


Problem:


Find an optimal alignment between a substring
of s and a substring of t


Algorithm:



is a variant of the basic algorithm for global
alignment

Motivation


Searching for unknown domains or motifs within
proteins from different families


Proteins encoded from Homeobox genes (only conserved
in 1 region called Homeo domain


60 amino acids long)


Identifying active sites of enzymes


Comparing long stretches of anonymous DNA


Querying databases where query word much smaller
than sequences in database


Analyzing repeated elements within a single sequence

Local Alignment



Let gap =
-
2

match = 1

mismatch =
-
1.

GATCACCT

GATACCC

C

C

C

A

T

A

G

empty

T

C

C

A

C

T

A

G

empty

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

1

0

2

1

1

0

0

0

0

0

3

2

2

0

0

0

0

1

4

3

0

0

1

0

0

2

3

2

0

1

0

0

0

0

3

1

0

0

0

0

1

2

2

1

1

GATCACCT

GAT

_

ACCC

Smith & Waterman


Place each sequence along one axis


Place score 0 at the up
-
left corner


Fill in 1
st

row & column with
0s


Fill in the matrix with max value of
4

possible values:


0


Vertical move: Score + gap penalty


Horizontal move: Score + gap penalty


Diagonal move: Score + match/mismatch score


The optimal alignment score is
the max in the matrix


To reconstruct the optimal alignment, trace back where the MAX
at each step came from, stop when a
zero

is hit

exercise


Let:



gap =
-
2

match = 1

mismatch =
-
1.



Find the best local alignment:



CGATG

AAATGGA

Semi
-
global Alignment

Example:

CAGCA
-
CTTGGATTCTCGG

–––
CAGCGTGG
––––––––


CAGCACTTGGATTCTCGG

CAGC
––––
G
––
T
––––
GG


We like the first alignment much better. In semiglobal
comparison, we score the alignments ignoring some of
the
end spaces
.

Global Alignment

Example:

AAACCC

A


CCC

Prefer to see:


AAACCC






ACCC

Do not want to penalize

the end spaces

empty

A

A

A

C

C

C

empty

0

-
2

-
4

-
6

-
8

-
10

-
12

A

-
2

1

-
1

-
3

-
5

-
7

-
9

C

-
4

-
1

0

-
2

-
2

-
4

-
6

C

-
6

-
3

-
2

-
1

-
1

-
1

-
3

C

-
8

-
5

-
4

-
3

0

0

0

SemiGlobal Alignment

Example:

s = AAACCC


t =




ACCC

empty

A

A

A

C

C

C

empty

0

0

0

0

0

0

0

A

-
2

1

1

1

-
1

-
1

-
1

C

-
4

-
1

0

0

2

0

0

C

-
6

-
3

-
2

-
1

1

3

1

C

-
8

-
5

-
4

-
3

0

2

4

SemiGlobal Alignment

Example:

s = AAACCC
G


t =




ACCC



empty

A

A

A

C

C

C

empty

0

0

0

0

0

0

0

A

-
2

1

1

1

-
1

-
1

-
1

C

-
4

-
1

0

0

2

0

0

C

-
6

-
3

-
2

-
1

1

3

1

C

-
8

-
5

-
4

-
3

0

2

4

2

-
2

-
1

0

G

-
1

SemiGlobal Alignment


Summary of end space charging procedures:

Place where spaces are not
penalized for


Action

Beginning of 1
st

sequence

End of 1
st

sequence

Beginning of 2
nd

sequence

End of 2
nd

sequence

Initialize 1
st

row with zeros

Look for max in last row

Initialize 1
st

column with zeros

Look for max in last column

Pairwise Sequence Comparison over Internet

lalign

www.ch.embnet.org/software/LALIGN_form.html

Global/Local

lalign

fasta.bioch.virginia.edu/fasta_www/plalign.htm

Global/Local

USC

www
-
hto.usc.edu/software/seqaln/seqaln
-
query.html

Global/Local

alion

fold.stanford.edu/alion

Global/Local

genome.cs.mtu.edu/align.html

Global/Local

align

www.ebi.ac.uk/emboss/align

Global/Local

xenAliTwo

www.soe.ucsc.edu/~kent/xenoAli/xenAliTwo.html

Local for DNA

blast2seqs

www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.html

Local BLAST

blast2seqs

web.umassmed.edu/cgi
-
bin/BLAST/blast2seqs

Local BLAST

lalnview

www.expasy.ch/tools/sim
-
prot.html

Visualization

prss

www.ch.embnet.org/software/PRSS_form.html

Evaluation

prss

Fasta.bioch.virginia.edu/fasta/prss.htm

Evaluation

graph
-
align

Darwin.nmsu.edu/cgi
-
bin/graph_align.cgi

Evaluation

Bioinformatics for Dummies

Significance of Sequence Alignment


Consider randomly generated sequences.
What distribution do you think the best local
alignment score of two sequences of sample
length should follow?

1.
Uniform distribution

2.
Normal distribution

3.
Binomial distribution (n Bernoulli trails)

4.
Poisson distribution (n

, np=

)

5.
others

Extreme Value Distribution


Y
ev

= exp(
-

x
-

e
-
x
)

Extreme Value Distribution vs.
Normal Distribution

“Twilight Zone”

Some proteins with less than 15% similarity have exactly
the same 3
-
D structure while some proteins with 20%
similarity have different structures. Homology/non
-
homology is never granted in the twilight zone.