# CS790 – Introduction to Bioinformatics

Biotechnology

Oct 4, 2013 (4 years and 8 months ago)

78 views

Sequence Alignments

and Dynamic Programming

BIO/CS 471

Algorithms for Bioinformatics

Sequence Alignments

2

Module II: Sequence Alignments

Problem: SequenceAlignment

Input: Two or more strings of characters

Output: The
optimal alignment

of the input strings,
possibly including gaps, and an
alignment score.

Example

Input:

The calico cat was very sleepy.

The cat was sleepy.

Output:

The calico cat was very sleepy.

The
------

cat was
----

sleepy.

Sequence Alignments

3

Biological Sequence Alignment

Input: DNA or Amino Acid sequence

Output: Optimal alignment

Example

Input:

ACTATAGCTATAGCTATAGCGTATCG

ACGATTACAGGTCGTGTCG

Output:

ACTATAGCTATAGCTATAGCGTATCG

|| || ||*|| | |||*|||

ACGAT
---
TACAGGT
----
CGTGTCG

Score: +8

Sequence Alignments

4

Why align sequences?

Why would we want to align two sequences?

Compare two genes/proteins, e.g. to infer function

Database searches

Protein structure prediction

Why would we want to align multiple
sequences?

Conservation analysis

Molecular evolution

Sequence Alignments

5

How do we decide on a scoring function?

Objective: Sequences that are
homologous

or
evolutionarily related, should align with high
scores. Less related sequences should score
lower.

Question: How do
homologous

sequences
arise?

Sequence Alignments

6

Transcription

The Central
Dogma

DNA

transcription

RNA

translation

Proteins

Sequence Alignments

7

DNA Replication

Prior to cell division, all the
genetic instructions must be
“copied” so that each new cell
will have a complete set

DNA polymerase is the enzyme
that copies DNA

Reads the old strand in the 3
´

to 5
´

direction

Over time, genes
accumulate
mutations

Environmental factors

Radiation

Oxidation

Mistakes in replication or repair

Deletions, Duplications

Insertions

Inversions

Point mutations

Sequence Alignments

9

Codon deletion:

ACG ATA GCG TAT GTA TAG CCG…

Effect depends on the protein, position, etc.

Almost always deleterious

Sometimes lethal

Frame shift mutation:

ACG ATA GCG TAT GTA TAG CCG…

ACG ATA GCG ATG TAT AGC CG?…

Almost always lethal

Deletions

Sequence Alignments

10

Indels

Comparing two genes it is generally impossible
to tell if an
indel

is an insertion in one gene, or
a deletion in another, unless ancestry is known:

ACGTCTGAT
ACG
CCGTATCGTCTATCT

ACGTCTGAT
---
CCGTATCGTCTATCT

Sequence Alignments

11

The Genetic Code

Substitutions

are
mutations
accepted by
natural selection.

Synonymous:

CG
C

CG
A

Non
-
synonymous:

GA
U

GA
A

Sequence Alignments

12

Two kinds of
homologs

Orthologs

Species A has a particular
gene

Species A diverges into
species B and C, each
with the same gene

The two genes
accumulate substitutions
independently

Sequence Alignments

13

Orthologs and Paralogs

Paralogs

A gene duplication event
within a species results in
two copies of a gene

One of the copies is now
under less selective
constraint

The copies can
accumulate substitutions
independently, before and
after speciation

Sequence Alignments

14

Scoring a sequence alignment

Match score:

+1

Mismatch score:

+0

Gap penalty:

1

ACGTCTGAT
A
CGCCGTAT
A
GTCTATCT

||||| ||| || ||||||||

----
CTGAT
T
CGC
---
AT
C
GTCTATCT

Matches: 18
×

(+1)

Mismatches: 2
×

0

Gaps: 7
×

(

1)

Score = +11

Sequence Alignments

15

Origination and length penalties

We want to find alignments that are
evolutionarily likely.

Which of the following alignments seems more
likely to you?

ACGTCTGATACGCCGTATAGTCTATCT

ACGTCTGAT
-------
ATAGTCTATCT

ACGTCTGATACGCCGTATAGTCTATCT

AC
-
T
-
TGA
--
CG
-
CGT
-
TA
-
TCTATCT

We can achieve this by penalizing more for a
new gap, than for extending an existing gap

Sequence Alignments

16

Scoring a sequence alignment (2)

Match/mismatch score:

+1/+0

Origination/length penalty:

2/

1

ACGTCTGAT
A
CGCCGTAT
A
GTCTATCT

||||| ||| || ||||||||

----
CTGAT
T
CGC
---
AT
C
GTCTATCT

Matches: 18
×

(+1)

Mismatches: 2
×

0

Origination: 2
×

(

2)

Length: 7
×

(

1)

Score = +7

Sequence Alignments

17

Optimal Substructure in Alignments

Consider the alignment:

ACGTCTGAT
A
CGCCGTAT
A
GTCTATCT

||||| ||| || ||||||||

----
CTGAT
T
CGC
---
AT
C
GTCTATCT

Is it true that the alignment in the boxed region
must be optimal?

Sequence Alignments

18

A Greedy Strategy

Consider this pair of sequences

GAGC

CAGC

Greedy Approach:

G or G or
-

C
-

G

Leads to

GAGC
---

Better: GACG

---
CAGC CACG

GAP =

1

Match = +1

Mismatch =

2

Sequence Alignments

19

Breaking apart the problem

Suppose we are aligning:

ACTCG

ACAGTAG

First position choices:

A

+1

CTCG

A

CAGTAG

A

-
1

CTCG

-

ACAGTAG

-

-
1

ACTCG

A

CAGTAG

Sequence Alignments

20

A Recursive Approach to Alignment

Choose the best alignment based on these three
possibilities:

align(seq1, seq2) {

if (both sequences empty) {return 0;}

if (one string empty) {

return(gapscore * num chars in nonempty seq);

else {

score1 = score(firstchar(seq1),firstchar(seq2))

+ align(tail(seq1), tail(seq2));

score2 = align(tail(seq1), seq2) + gapscore;

score3 = align(seq1, tail(seq2) + gapscore;

return(min(score1, score2, score3));

}

}

}

Sequence Alignments

21

Time Complexity of RecurseAlign

What is the recurrence equation for the time
needed by RecurseAlign?

3

3

3

3

3

3

n

3

9

27

3
n

Sequence Alignments

22

RecurseAlign repeats its work

Sequence Alignments

23

How can we find an optimal alignment?

Finding the alignment is computationally hard:

ACGTCTGATACGCCGTATAGTCTATCT

CTGAT
---
TCG

CATCGTC
--
T
-
ATCT

C(27,7) gap positions = ~888,000 possibilities

It’s possible, as long
as we don’t repeat our
work
!

Dynamic programming: The Needleman &
Wunsch algorithm

Sequence Alignments

24

What is the optimal alignment?

ACTCG

ACAGTAG

Match: +1

Mismatch: 0

Gap:

1

Sequence Alignments

25

Needleman
-
Wunsch: Step 1

Each sequence along one axis

Mismatch penalty multiples in first row/column

0 in [1,1] (or [0,0] for the CS
-
minded)

Sequence Alignments

26

Needleman
-
Wunsch: Step 2

Vertical/Horiz. move: Score + (simple) gap penalty

Diagonal move: Score + match/mismatch score

Take the
MAX

of the three possibilities

Sequence Alignments

27

Needleman
-
Wunsch: Step 2 (cont’d)

Fill out the rest of the table likewise…

Sequence Alignments

28

Needleman
-
Wunsch: Step 2 (cont’d)

Fill out the rest of the table likewise…

The optimal alignment score is calculated in the
lower
-
right corner

Sequence Alignments

29

But what
is

the optimal alignment

To reconstruct the optimal alignment, we must
determine of where the MAX at each step came
from…

Sequence Alignments

30

A path corresponds to an alignment

= GAP in top sequence

= GAP in left sequence

= ALIGN both positions

One path from the previous table:

Corresponding alignment (start at the end):

AC
--
TCG

ACAGTAG

Score = +2

Sequence Alignments

31

Practice Problem

Find an optimal alignment for these two
sequences:

GCGGTT

GCGT

Match: +1

Mismatch: 0

Gap:

1

Sequence Alignments

32

Practice Problem

Find an optimal alignment for these two
sequences:

GCGGTT

GCGT

GCGGTT

GCG
-
T
-

Score = +2

Sequence Alignments

33

Semi
-
global alignment

Suppose we are aligning:

GCG

GGCG

Which do you prefer?

G
-
CG

-
GCG

GGCG

GGCG

Semi
-
global alignment allows gaps at the ends
for free.

Sequence Alignments

34

Semi
-
global alignment

Semi
-
global alignment allows gaps at the ends
for free.

Initialize first row and column to all 0’s

Allow free horizontal/vertical moves in last
row and column

Sequence Alignments

35

Local alignment

Global alignments

score the entire alignment

Semi
-
global alignments

allow unscored gaps
at the beginning or end of either sequence

Local alignment

find the best matching
subsequence

CG
ATG

AA
ATG
GA

This is achieved by allowing a 4
th

alternative at
each position in the table: zero.

Sequence Alignments

36

Local alignment

Mismatch =

1 this time

CG
ATG

AA
ATG
GA

Sequence Alignments

37

Food for thought…

What is the asymptotic time complexity of
the Needleman & Wunsch algorithm?

Is this good or bad?

How about the
space

complexity?

Why might this be a problem?

Sequence Alignments

38

Saving Space

Note that we can throw away the previous rows
of the table as we fill it in:

This row
is based
only on
this one

Sequence Alignments

39

Saving Space (2)

Each row

of the table contains the scores for
aligning a
prefix

of the left
-
hand sequence with
all prefixes

of the top sequence:

Scores for
aligning
aca

with
all
prefixes of
actcg

Sequence Alignments

40

Divide and Conquer

By using a recursive approach, we can use only
two rows of the matrix at a time:

Choose the middle character of the top sequence,
i

Find out where
i

aligns to the bottom sequence

Needs two vectors of scores

Recursively align the sequences before and after the
fixed positions

ACGCTATGCTCATAG

CGACGCTCATCG

i

Sequence Alignments

41

Finding where
i

lines up

Find out where
i

aligns to the bottom sequence

Needs
two vectors

of scores

Assuming
i

lines up with a
character
:

alignscore = align(
ACGCTAT
, prefix(
t
)) + score(
G
, char from
t
)

+ align(
CTCATAG
, suffix(
t
))

Which character is best?

Can quickly find out the score for aligning
ACGCTAT

with
every prefix

of
t
.

s
:

ACGCTAT
G
CTCATAG

t
:

CGACGCTCATCG

i

Sequence Alignments

42

Finding where
i

lines up

But,
i

may also line up with a gap

Assuming
i

lines up with a
gap
:

alignscore = align(
ACGCTAT
, prefix(
t
)) + gapscore

+ align(
CTCATAG
, suffix(
t
))

s
:

ACGCTAT
G
CTCATAG

t
:

CGACGCTCATCG

i

Sequence Alignments

43

Recursive Call

Fix the best position for
I

Call
align

recursively for the prefixes and
suffixes:

s
:

ACGCTAT
G
CTCATAG

t
:

CGAC
G
CTCATCG

i

Sequence Alignments

44

Complexity

Let len(
s
) =
m

and len(
t
) =
n

Space: 2
m

Time:

Each call to build

similarity vector =
m
´
n
´

First call + recursive call:

s
:

ACGCTAT
G
CTCATAG

t
:

CGAC
G
CTCATCG

i

j

Sequence Alignments

45

General Gap Penalties

Suppose we are no longer using simple gap
penalties:

Origination =

2

Length =

1

Consider the last position of the alignment for
ACGTA

with
ACG

We can’t determine the score for

unless we know the previous positions!

G

-

-

G

or

Sequence Alignments

46

Scoring Blocks

Now we must score a
block

at a time

A
block

is a pair of characters, or a
maximal

group of gaps paired with characters

To score a position, we need to either start a
new block or add it to a previous block

A A C
---

A TATCCG A C T AC

A C T ACC T
------

C G C
--

Sequence Alignments

47

The Algorithm

Three tables

a

scores for alignments ending in char
-
char blocks

b

scores for alignments ending in gaps in the top
sequence (
s
)

c

scores for alignments ending in gaps in the left
sequence (
t
)

Scores no longer depend on only three
positions, because we can put
any number

of
gaps into the last block

Sequence Alignments

48

The Recurrences

Sequence Alignments

49

The Optimal Alignment

The optimal alignment is found by looking at
the maximal value in the lower right of all three
arrays

The algorithm runs in
O
(
n
3
) time

Uses

O
(
n
2
) space