Introduction to Bioinformatics 2. Biology Background

sparrowcowardΒιοτεχνολογία

2 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

70 εμφανίσεις

Sequence Alignments with Indels


Evolution produces insertions and deletions (indels)


In addition to substitutions


Good example:


MHHNALQRRTVWVNAY MHHNALQRRTVWVNAY

MHHALQRRTVWVNAY
-

MHH
-
ALQRRTVWVNAY

Blosum Score = 2 (end =
-
6) Score = 79 (gap =
-
6)



An alignment must have equal length aligned sequences


So, we must add gaps at the start and the ends


Combinatorially difficult problem to find best indel solution

Gap


So far we ignored gaps


A gap corresponds to an insertion or a deletion
of a residue


A conventional wisdom dictates that the
penalty for a gap must be several times greater
than the penalty for a mutation. That is
because a gap/extra residue


Interrupts the entire polymer chain


In DNA shifts the reading frame

Gap Penalties


Gaps are penalised


Write w
x

to indicate the penalty for a gap of length x


For example, each gap scores
-
6, so w
x

=
-
6*x


One common scheme is


Score
-
12 for opening a gap


And
-
2 for every subsequent gap


i.e., w
x

=
-
12
-

2*(x
-
1)


Start and end gap penalties often set to zero


But this can leave a doubt



About evolutionary conclusions

Dot Matrix Representations (Dotplots)

To help visualise best alignments


Plot where each pair is the same, then draw best line

M

N

A

L

S

Q

L

N

N







A





L







M






S




Q




N







H

M

N

A

L

S

Q

L

N

N







A





L





M






S




Q




N







H

Getting Alignments

from Dotplot Paths

M

N

A

L

S

Q

L

N

N







A





L




M




S




Q




N







H

Indicates that M
matches with a gap

Indicates that L
matches with a gap


Stage 1:


Align middle


Use triangles


To indicate gaps


NAL
-
SQLN


NALMSQ
-
N


Stage 2:


Sort the ends out


MNAL
-
SQLN
-


-
NALMSQ
-
NH




Dotplots for Real Proteins


Need a way to automatically find the best path(s)

Dynamic Programming Approach


BLAST is quick


But not guaranteed to find best alignment


Gapped blast has indels, but no guarantee…


Dynamic Programming:


Also known as: Needleman
-
Wunsch Algorithm


Can use it to draw the Dotplot paths


From that we can get the alignment


Mathematically guaranteed


To find the best scoring alignment


Given a substitution scheme (scoring scheme, e.g., BLOSUM)


And given a gap penalty

The Needleman
-
Wunsch algorithm


A smart way to reduce the massive number of
possibilities that need to be considered, yet still
guarantees that the best solution will be found (Saul
Needleman and Christian Wunsch, 1970).


The basic idea is to build up the best alignment by
using optimal alignments of smaller subsequences.


The Needleman
-
Wunsch algorithm is an example of
dynamic programming, a discipline invented by
Richard Bellman (an American mathematician) in
1953!

Dynamic Programming


A divide
-
and
-
conquer strategy:


Break the problem into smaller
subproblems
.


Solve the smaller problems optimally.


Use the sub
-
problem solutions to construct an
optimal solution for the original problem.


Dynamic programming can be applied only to problems
exhibiting the properties of overlapping
subproblems
.


Examples include


Trevelling

salesman problem


Finding the best chess move

Overview of Needleman
-
Wunsch


Four Stages

1.
Initialise a matrix for the sequences

2.
Fill in the entries of that matrix (call these S
i,j
)


At the same time drawing arrows in the matrix

3.
Use the arrows to find the best scoring path(s)

4.
Interpret the paths as alignments as before



Illustrate with: MNALQM & NALMSQA

Stage 1

Initialising the Matrix


Draw the grid

Put in increasing gap penalties



Then put in BLOSUM scores

Stage 2

Putting Scores and Arrows in

Put the score in Draw the arrow

Mathematically, we are calculating:


Where:


S
i,j

is the matrix entry at (i,j) [the one we want to fill in]


S
i
-
1,j
-
1

is above and to the left of this


s(a
i
,b
j
) is the BLOSUM score for the


i
-
th residue from the horizontal sequence and


j
-
th residue from the vertical sequance


(i.e., just the scores we have written in brackets)

This diagram might help:

Fill in the next row and column

A Close up View

Continue filling in the S
i,j

entries

Stage 3

Finding the best path


Scores S
i,j

in the matrix


Are the BLOSUM scores for alignments


However!


We must take into account final gap penalties


Look down the final column and along the final row


Find the highest scoring number


Remembering to take off the gap penalty the correct
number of times


Finding the best path

So, the best path is:

Stage 4: Generating the Alignment

Firstly, draw the Dotplot

Secondly, Generate the Alignment


Using the technique previously mentioned


This path gives us an alignment with three gaps



M N A L
-

-

Q M


-

N A L M S Q A

S =
-
6 6 4 4
-
6
-
6 5
-
1 = 0



Should check that you get the same score


As on the diagram


Other Alignments

MNALQ
-
M
-

MNALQM
--


-
NALMSQA (score=
-
4)
-
NALMSQA (score=
-
5)

Smith
-

Waterman Alterations


To make the algorithm find best
local

alignments


Adjustments only to the scoring scheme for S
i,j
:


The scoring scheme must include:


Some negative scores for mismatches


When S
i,j

becomes negative, set it to zero


So local paths are not penalised for earlier bad routes


To find best local alignment


Find highest scoring matrix position (anywhere)


And work backwards until a zero is reached


Local and Global Alignments

Needleman & Wunsch

best global alignments



Smith & Waterman

best local alignments




For illustration purposes only


Calculations done slightly differently (don’t worry)