CS790 – Introduction to Bioinformatics

educationafflictedBiotechnology

Oct 4, 2013 (3 years and 6 months ago)

64 views

Sequence Alignments

and Dynamic Programming

BIO/CS 471


Algorithms for Bioinformatics

Sequence Alignments

2

Module II: Sequence Alignments


Problem: SequenceAlignment


Input: Two or more strings of characters


Output: The
optimal alignment

of the input strings,
possibly including gaps, and an
alignment score.


Example


Input:

The calico cat was very sleepy.


The cat was sleepy.


Output:

The calico cat was very sleepy.


The
------

cat was
----

sleepy.

Sequence Alignments

3

Biological Sequence Alignment


Input: DNA or Amino Acid sequence


Output: Optimal alignment


Example


Input:

ACTATAGCTATAGCTATAGCGTATCG



ACGATTACAGGTCGTGTCG


Output:

ACTATAGCTATAGCTATAGCGTATCG


|| || ||*|| | |||*|||



ACGAT
---
TACAGGT
----
CGTGTCG


Score: +8

Sequence Alignments

4

Why align sequences?


Why would we want to align two sequences?


Compare two genes/proteins, e.g. to infer function


Database searches


Protein structure prediction





Why would we want to align multiple
sequences?


Conservation analysis


Molecular evolution




Sequence Alignments

5

How do we decide on a scoring function?


Objective: Sequences that are
homologous

or
evolutionarily related, should align with high
scores. Less related sequences should score
lower.


Question: How do
homologous

sequences
arise?

Sequence Alignments

6

Transcription

The Central
Dogma

DNA

transcription



RNA

translation



Proteins

Sequence Alignments

7

DNA Replication


Prior to cell division, all the
genetic instructions must be
“copied” so that each new cell
will have a complete set


DNA polymerase is the enzyme
that copies DNA


Reads the old strand in the 3
´

to 5
´

direction

Over time, genes
accumulate
mutations


Environmental factors


Radiation


Oxidation


Mistakes in replication or repair


Deletions, Duplications


Insertions


Inversions


Point mutations

Sequence Alignments

9


Codon deletion:

ACG ATA GCG TAT GTA TAG CCG…


Effect depends on the protein, position, etc.


Almost always deleterious


Sometimes lethal


Frame shift mutation:


ACG ATA GCG TAT GTA TAG CCG…



ACG ATA GCG ATG TAT AGC CG?…


Almost always lethal

Deletions

Sequence Alignments

10

Indels


Comparing two genes it is generally impossible
to tell if an
indel

is an insertion in one gene, or
a deletion in another, unless ancestry is known:


ACGTCTGAT
ACG
CCGTATCGTCTATCT

ACGTCTGAT
---
CCGTATCGTCTATCT

Sequence Alignments

11

The Genetic Code

Substitutions

are
mutations
accepted by
natural selection.


Synonymous:


CG
C



CG
A


Non
-
synonymous:


GA
U



GA
A

Sequence Alignments

12

Two kinds of
homologs


Orthologs


Species A has a particular
gene


Species A diverges into
species B and C, each
with the same gene


The two genes
accumulate substitutions
independently

Sequence Alignments

13

Orthologs and Paralogs


Paralogs


A gene duplication event
within a species results in
two copies of a gene


One of the copies is now
under less selective
constraint


The copies can
accumulate substitutions
independently, before and
after speciation

Sequence Alignments

14

Scoring a sequence alignment


Match score:


+1


Mismatch score:

+0


Gap penalty:



1

ACGTCTGAT
A
CGCCGTAT
A
GTCTATCT


||||| ||| || ||||||||

----
CTGAT
T
CGC
---
AT
C
GTCTATCT


Matches: 18
×

(+1)


Mismatches: 2
×

0


Gaps: 7
×

(


1)

Score = +11

Sequence Alignments

15

Origination and length penalties


We want to find alignments that are
evolutionarily likely.


Which of the following alignments seems more
likely to you?


ACGTCTGATACGCCGTATAGTCTATCT

ACGTCTGAT
-------
ATAGTCTATCT


ACGTCTGATACGCCGTATAGTCTATCT

AC
-
T
-
TGA
--
CG
-
CGT
-
TA
-
TCTATCT


We can achieve this by penalizing more for a
new gap, than for extending an existing gap





Sequence Alignments

16

Scoring a sequence alignment (2)


Match/mismatch score:


+1/+0


Origination/length penalty:


2/

1

ACGTCTGAT
A
CGCCGTAT
A
GTCTATCT


||||| ||| || ||||||||

----
CTGAT
T
CGC
---
AT
C
GTCTATCT


Matches: 18
×

(+1)


Mismatches: 2
×

0


Origination: 2
×

(

2)


Length: 7
×

(

1)

Score = +7

Sequence Alignments

17

Optimal Substructure in Alignments


Consider the alignment:

ACGTCTGAT
A
CGCCGTAT
A
GTCTATCT


||||| ||| || ||||||||

----
CTGAT
T
CGC
---
AT
C
GTCTATCT


Is it true that the alignment in the boxed region
must be optimal?

Sequence Alignments

18

A Greedy Strategy


Consider this pair of sequences

GAGC

CAGC


Greedy Approach:

G or G or
-

C
-

G


Leads to

GAGC
---

Better: GACG

---
CAGC CACG

GAP =

1

Match = +1

Mismatch =

2

Sequence Alignments

19

Breaking apart the problem


Suppose we are aligning:

ACTCG

ACAGTAG


First position choices:

A

+1

CTCG

A


CAGTAG


A

-
1

CTCG

-


ACAGTAG


-

-
1

ACTCG

A


CAGTAG

Sequence Alignments

20

A Recursive Approach to Alignment


Choose the best alignment based on these three
possibilities:

align(seq1, seq2) {


if (both sequences empty) {return 0;}


if (one string empty) {



return(gapscore * num chars in nonempty seq);



else {




score1 = score(firstchar(seq1),firstchar(seq2))





+ align(tail(seq1), tail(seq2));




score2 = align(tail(seq1), seq2) + gapscore;




score3 = align(seq1, tail(seq2) + gapscore;




return(min(score1, score2, score3));



}


}

}

Sequence Alignments

21

Time Complexity of RecurseAlign


What is the recurrence equation for the time
needed by RecurseAlign?

3

3

3

3

3

3



n

3

9

27

3
n

Sequence Alignments

22

RecurseAlign repeats its work

Sequence Alignments

23

How can we find an optimal alignment?


Finding the alignment is computationally hard:

ACGTCTGATACGCCGTATAGTCTATCT

CTGAT
---
TCG

CATCGTC
--
T
-
ATCT


C(27,7) gap positions = ~888,000 possibilities


It’s possible, as long
as we don’t repeat our
work
!


Dynamic programming: The Needleman &
Wunsch algorithm

Sequence Alignments

24

What is the optimal alignment?


ACTCG

ACAGTAG


Match: +1


Mismatch: 0


Gap:

1

Sequence Alignments

25

Needleman
-
Wunsch: Step 1


Each sequence along one axis


Mismatch penalty multiples in first row/column


0 in [1,1] (or [0,0] for the CS
-
minded)

Sequence Alignments

26

Needleman
-
Wunsch: Step 2


Vertical/Horiz. move: Score + (simple) gap penalty


Diagonal move: Score + match/mismatch score


Take the
MAX

of the three possibilities

Sequence Alignments

27

Needleman
-
Wunsch: Step 2 (cont’d)


Fill out the rest of the table likewise…

Sequence Alignments

28

Needleman
-
Wunsch: Step 2 (cont’d)


Fill out the rest of the table likewise…


The optimal alignment score is calculated in the
lower
-
right corner

Sequence Alignments

29

But what
is

the optimal alignment


To reconstruct the optimal alignment, we must
determine of where the MAX at each step came
from…

Sequence Alignments

30

A path corresponds to an alignment



= GAP in top sequence



= GAP in left sequence



= ALIGN both positions


One path from the previous table:


Corresponding alignment (start at the end):



AC
--
TCG


ACAGTAG

Score = +2

Sequence Alignments

31

Practice Problem


Find an optimal alignment for these two
sequences:


GCGGTT


GCGT


Match: +1


Mismatch: 0


Gap:

1


Sequence Alignments

32

Practice Problem


Find an optimal alignment for these two
sequences:


GCGGTT


GCGT


GCGGTT

GCG
-
T
-

Score = +2

Sequence Alignments

33

Semi
-
global alignment


Suppose we are aligning:

GCG

GGCG


Which do you prefer?

G
-
CG


-
GCG

GGCG


GGCG


Semi
-
global alignment allows gaps at the ends
for free.

Sequence Alignments

34

Semi
-
global alignment


Semi
-
global alignment allows gaps at the ends
for free.






Initialize first row and column to all 0’s


Allow free horizontal/vertical moves in last
row and column

Sequence Alignments

35

Local alignment


Global alignments


score the entire alignment


Semi
-
global alignments


allow unscored gaps
at the beginning or end of either sequence


Local alignment


find the best matching
subsequence


CG
ATG

AA
ATG
GA


This is achieved by allowing a 4
th

alternative at
each position in the table: zero.

Sequence Alignments

36

Local alignment


Mismatch =

1 this time

CG
ATG

AA
ATG
GA

Sequence Alignments

37

Food for thought…


What is the asymptotic time complexity of
the Needleman & Wunsch algorithm?


Is this good or bad?


How about the
space

complexity?


Why might this be a problem?

Sequence Alignments

38

Saving Space


Note that we can throw away the previous rows
of the table as we fill it in:

This row
is based
only on
this one

Sequence Alignments

39

Saving Space (2)


Each row

of the table contains the scores for
aligning a
prefix

of the left
-
hand sequence with
all prefixes

of the top sequence:

Scores for
aligning
aca

with
all
prefixes of
actcg

Sequence Alignments

40

Divide and Conquer


By using a recursive approach, we can use only
two rows of the matrix at a time:


Choose the middle character of the top sequence,
i


Find out where
i

aligns to the bottom sequence


Needs two vectors of scores


Recursively align the sequences before and after the
fixed positions

ACGCTATGCTCATAG

CGACGCTCATCG

i

Sequence Alignments

41

Finding where
i

lines up


Find out where
i

aligns to the bottom sequence


Needs
two vectors

of scores






Assuming
i

lines up with a
character
:

alignscore = align(
ACGCTAT
, prefix(
t
)) + score(
G
, char from
t
)



+ align(
CTCATAG
, suffix(
t
))


Which character is best?


Can quickly find out the score for aligning
ACGCTAT

with
every prefix

of
t
.


s
:

ACGCTAT
G
CTCATAG

t
:

CGACGCTCATCG

i

Sequence Alignments

42

Finding where
i

lines up


But,
i

may also line up with a gap






Assuming
i

lines up with a
gap
:


alignscore = align(
ACGCTAT
, prefix(
t
)) + gapscore



+ align(
CTCATAG
, suffix(
t
))

s
:

ACGCTAT
G
CTCATAG

t
:

CGACGCTCATCG

i

Sequence Alignments

43

Recursive Call


Fix the best position for
I


Call
align

recursively for the prefixes and
suffixes:

s
:

ACGCTAT
G
CTCATAG

t
:


CGAC
G
CTCATCG

i

Sequence Alignments

44

Complexity


Let len(
s
) =
m

and len(
t
) =
n


Space: 2
m


Time:


Each call to build

similarity vector =
m
´
n
´


First call + recursive call:

s
:

ACGCTAT
G
CTCATAG

t
:


CGAC
G
CTCATCG

i

j

Sequence Alignments

45

General Gap Penalties


Suppose we are no longer using simple gap
penalties:


Origination =

2


Length =

1


Consider the last position of the alignment for
ACGTA

with
ACG


We can’t determine the score for



unless we know the previous positions!

G

-

-

G

or

Sequence Alignments

46

Scoring Blocks


Now we must score a
block

at a time





A
block

is a pair of characters, or a
maximal

group of gaps paired with characters


To score a position, we need to either start a
new block or add it to a previous block

A A C
---

A TATCCG A C T AC

A C T ACC T
------

C G C
--

Sequence Alignments

47

The Algorithm


Three tables


a



scores for alignments ending in char
-
char blocks


b



scores for alignments ending in gaps in the top
sequence (
s
)


c



scores for alignments ending in gaps in the left
sequence (
t
)


Scores no longer depend on only three
positions, because we can put
any number

of
gaps into the last block

Sequence Alignments

48

The Recurrences

Sequence Alignments

49

The Optimal Alignment


The optimal alignment is found by looking at
the maximal value in the lower right of all three
arrays


The algorithm runs in
O
(
n
3
) time


Uses

O
(
n
2
) space