CS790 – Introduction to Bioinformatics - Winona State University

powerfultennesseeBiotechnology

Oct 2, 2013 (3 years and 9 months ago)

100 views

Sequence Alignments

Introduction to Bioinformatics

Intro to Bioinformatics


Sequence Alignment

2

Sequence Alignments


Cornerstone of bioinformatics


What is a sequence?


Nucleotide sequence


Amino acid sequence


Pairwise and multiple sequence alignments


We will focus on pairwise alignments


What alignments can help


Determine function of a newly discovered gene sequence


Determine evolutionary relationships among genes, proteins,
and species


Predicting structure and function of protein

Acknowledgement: This notes is adapted from lecture notes of both Wright State University’s Bioinformatics Program and Profes
sor

Laurie Heyer of Davidson College with permission.

Intro to Bioinformatics


Sequence Alignment

3

DNA Replication


Prior to cell division, all the
genetic instructions must be
“copied” so that each new cell
will have a complete set


DNA polymerase is the enzyme
that copies DNA


Reads the old strand in the 3
´

to 5
´

direction

Intro to Bioinformatics


Sequence Alignment

4

Over time, genes accumulate
mutations


Environmental factors


Radiation


Oxidation


Mistakes in replication or
repair


Deletions, Duplications


Insertions, Inversions


Translocations


Point mutations

Intro to Bioinformatics


Sequence Alignment

5


Codon deletion:

ACG ATA GCG TAT GTA TAG CCG…


Effect depends on the protein, position, etc.


Almost always deleterious


Sometimes lethal


Frame shift mutation:


ACG ATA GCG TAT GTA TAG CCG…



ACG ATA GCG ATG TAT AGC CG?…


Almost always lethal

Deletions

Intro to Bioinformatics


Sequence Alignment

6

Indels


Comparing two genes it is generally impossible
to tell if an
indel

is an insertion in one gene, or
a deletion in another, unless ancestry is known:


ACGTCTGAT
ACG
CCGTATCGTCTATCT

ACGTCTGAT
---
CCGTATCGTCTATCT

Intro to Bioinformatics


Sequence Alignment

7

The Genetic Code

Substitutions

are
mutations
accepted by
natural selection.


Synonymous:


CG
C



CG
A


Non
-
synonymous:


GA
U



GA
A

Intro to Bioinformatics


Sequence Alignment

8

Comparing Two Sequences


Point mutations, easy:

ACGTCTGAT
A
CGCC
G
TAT
A
GTCTATCT

ACGTCTGAT
T
CGCC
C
TAT
C
GTCTATCT


Indels are difficult, must
align

sequences:

AC
G
TC
T
GAT
A
CGCCG
TAT
AGTCTATCT

CT
G
AT
T
CGC
A
TCGTC
TAT
CT


ACGTCTGAT
A
CGCCGTAT
A
GTCTATCT

----
CTGAT
T
CGC
---
AT
C
GTCTATCT


Intro to Bioinformatics


Sequence Alignment

9

Why Align Sequences?


The draft human genome is available


Automated gene finding is possible


Gene:
AGTACGTATCGTATAGCGTAA


What does it do?


One approach: Is there a similar gene in
another species?


Align sequences with known genes


Find the gene with the “best” match


Examples

Intro to Bioinformatics


Sequence Alignment

10

Gaps or No Gaps

Intro to Bioinformatics


Sequence Alignment

11

Scoring a Sequence Alignment

Given


Match score:


+1


Mismatch score:


+0


Gap penalty:



1


Sequences

ACGTCTGAT
A
CGCCGTAT
A
GTCTATCT


||||| ||| || ||||||||

----
CTGAT
T
CGC
---
AT
C
GTCTATCT




Matches: 18
×

(+1)


Mismatches: 2
×

0


Gaps: 7
×

(


1)

Score = +11

Intro to Bioinformatics


Sequence Alignment

12

Origination and Length Penalties


Note that a bioinformatics computational model or
algorithm must be “
biologically meaningful”
or
even “
biologically significant



We want to find alignments that are
evolutionarily
likely.


Which of the following alignments seems more
likely to you?


ACGTCTGATACGCCGTATAGTCTATCT

ACGTCTGAT
-------
ATAGTCTATCT


ACGTCTGATACGCCGTATAGTCTATCT

AC
-
T
-
TGA
--
CG
-
CGT
-
TA
-
TCTATCT


We can achieve this by penalizing more for a new
gap, than for extending an existing gap





Intro to Bioinformatics


Sequence Alignment

13

Scoring a Sequence Alignment (2)

Given


Match/mismatch score:


+1/+0


Gap origination/length penalty:


2/

1


Sequences

ACGTCTGAT
A
CGCCGTAT
A
GTCTATCT


||||| ||| || ||||||||

----
CTGAT
T
CGC
---
AT
C
GTCTATCT




Matches: 18
×

(+1)


Mismatches: 2
×

0


Origination: 2
×

(

2)


Length: 7
×

(

1)




Caution: Sometime “gap extension” used instead of “gap
length” …

Score = +7

Intro to Bioinformatics


Sequence Alignment

14

How can we find an optimal alignment?


Finding the alignment is computationally hard:

ACGTCTGATACGCCGTATAGTCTATCT

CTGAT
---
TCG
-
CATCGTC
--
T
-
ATCT


C(27,7) gap positions = ~888,000 possibilities


It’s possible, as long as

we don’t repeat our
work
!


Dynamic programming: The Needleman &
Wunsch algorithm

Intro to Bioinformatics


Sequence Alignment

15

Dynamic Programming


Technique of solving optimization problems


Find and
memorize

solutions for subproblems


Use those solutions to build solutions for larger
subproblems


Continue until the final solution is found


Recursive computation of cost function in a
non
-
recursive fashion

Intro to Bioinformatics


Sequence Alignment

16

Dynamic Programming


Example


Solving Fibonacci number f(n) = f(n
-
1) + f(n
-
2)
recursively takes exponential time


Because many numbers are recalculated


Solving it using dynamic programming only takes a
linear time


Use an array f[] to store the numbers


f[0] = 1;


f[1] = 1;


for (i = 2; i <= n; i++)



f[i] = f[i


1] + f[i


2];

Intro to Bioinformatics


Sequence Alignment

17

Global Sequence Alignment


Needleman
-
Wunsch algorithm






Suppose we are aligning:


a
with
a



D
i,j

= max{
D
i
-
1
,j

+
d
(
A
i
,

),
D
i,j
-
1

+
d
(

,
B
j
),
D
i
-
1
,j
-
1

+
d
(
A
i
, B
j
)

}

i
-
1

j
-
1

j

i

Seq1

Seq2

A

B

Intro to Bioinformatics


Sequence Alignment

18

Dynamic Programming (DP) Concept


Suppose we are aligning:


CACGA



CGA


Intro to Bioinformatics


Sequence Alignment

19

DP


Recursion Perspective


Suppose we are aligning:

ACTCG

ACAGTAG


Last position choices:

G

+1

ACTC

G


ACAGTA


G

-
1

ACTC

-


ACAGTAG


-

-
1

ACTCG

G


ACAGTA

Intro to Bioinformatics


Sequence Alignment

20

What is the optimal alignment?


ACTCG

ACAGTAG


Match: +1


Mismatch: 0


Gap:

1

Intro to Bioinformatics


Sequence Alignment

21

Needleman
-
Wunsch: Step 1


Each sequence along one axis


Gap penalty multiples in first row/column


0 in [1,1] (or [0,0] for the CS
-
minded)

Intro to Bioinformatics


Sequence Alignment

22

Needleman
-
Wunsch: Step 2


Vertical/Horiz. move: Score + (simple) gap penalty


Diagonal move: Score + match/mismatch score


Take the
MAX

of the three possibilities

Intro to Bioinformatics


Sequence Alignment

23

Needleman
-
Wunsch: Step 2 (cont’d)


Fill out the rest of the table likewise…

Intro to Bioinformatics


Sequence Alignment

24

Needleman
-
Wunsch: Step 2 (cont’d)


Fill out the rest of the table likewise…


The optimal alignment score is calculated in the
lower
-
right corner

Intro to Bioinformatics


Sequence Alignment

25

But what
is

the optimal alignment


To reconstruct the optimal alignment, we must
determine of where the MAX at each step came
from…

Intro to Bioinformatics


Sequence Alignment

26

A path corresponds to an alignment



= GAP in top sequence



= GAP in left sequence



= ALIGN both positions


One path from the previous table:


Corresponding alignment (start at the end):



AC
--
TCG


ACAGTAG


Score = +2

Intro to Bioinformatics


Sequence Alignment

27

Algorithm Analysis


Brute force approach



If the length of both sequences is
n
, number of
possibility = C(2
n, n
) = (2
n
)!/(
n
!)
2



2
2
n

/ (

n
)
1/2
,
using Sterling’s approximation of
n
! = (2

n
)
1/2
e
-
n
n
n
.


O(
4
n
)


Dynamic programming


O(
mn
), where the two sequence sizes are
m

and
n
,
respectively


O(
n
2
), if
m

is in the order of
n


Intro to Bioinformatics


Sequence Alignment

28

Practice Problem


Find an optimal alignment for these two
sequences:


GCGGTT


GCGT


Match: +1


Mismatch: 0


Gap:

1


Intro to Bioinformatics


Sequence Alignment

29

Practice Problem


Find an optimal alignment for these two
sequences:


GCGGTT


GCGT


GCGGTT

GCG
-
T
-

Score = +2

Intro to Bioinformatics


Sequence Alignment

30

Semi
-
global alignment


Suppose we are aligning:

GCG

GGCG


Which do you prefer?

G
-
CG


-
GCG

GGCG


GGCG


Semi
-
global alignment allows gaps at the ends
for free.


Terminal gaps are usually the result of incomplete
data acquisition


no biologically significant

Intro to Bioinformatics


Sequence Alignment

31

Semi
-
global alignment


Semi
-
global alignment allows gaps at the ends
for free.






Initialize first row and column to all 0’s


Allow free horizontal/vertical moves in last row
and column

Intro to Bioinformatics


Sequence Alignment

32

Local alignment


Global alignments


score the entire alignment


Semi
-
global alignments


allow unscored gaps
at the beginning or end of either sequence


Local alignment


find the best matching
subsequence


CG
ATG

AA
ATG
GA


This is achieved by allowing a 4
th

alternative at
each position in the table: zero.

Intro to Bioinformatics


Sequence Alignment

33

Local Sequence Alignment


Why local sequence alignment?


Subsequence comparison between a DNA sequence
and a genome


Protein function domains


Exons matching


Smith
-
Waterman algorithm

D
i,j

= max{


D
i
-
1
,j

+
d
(
A
i
,

),


D
i,j
-
1

+
d
(

,
B
j
),


D
i
-
1
,j
-
1

+
d
(
A
i
,B
j
),


0

}

Initialization:
D
1
,j

=

,
D
i,
1

=


Intro to Bioinformatics


Sequence Alignment

34

Local alignment


Score: Match = 1, Mismatch =
-
1, Gap =
-
1

CG
ATG

AA
ATG
GA

c

g

a

t

g

0

0

0

0

0

0

a

0

0

0

1

0

0

a

0

0

0

1

0

0

a

0

0

0

1

0

0

t

0

0

0

0

2

1

g

0

0

1

0

1

3

g

0

0

1

0

0

2

a

0

0

0

2

1

1

Intro to Bioinformatics


Sequence Alignment

35

Local alignment


Another example

Intro to Bioinformatics


Sequence Alignment

36

More Example


Align

ATGGCCTC

ACGGCTC

Mismatch



=
-
3

Gap


=
-
4


--

A

C

G

G

C

T

C

0

-
4

-
8

-
12

-
16

-
20

-
24

-
28

1

-
3

-
3

--

A

T

G

G

C

C

T

C

-
4

-
8

-
12

-
16

-
20

-
24

-
28

-
32

-
7

-
11

-
15

-
19

-
23

-
2

-
6

-
10

-
14

-
14

-
18

-
7

-
6

-
1

-
5

-
9

-
13

-
17

-
11

-
10

-
5

0

-
4

-
8

-
12

-
15

-
10

-
9

-
4

1

-
3

-
7

-
19

-
14

-
13

-
8

-
3

-
2

-
2

-
23

-
18

-
17

-
12

-
7

-
2

-
5

-
27

-
22

-
21

-
16

-
11

-
6

-
1

Global

Alignment:


A
T
GGC
C
TC

A
C
GGC
-
TC

Intro to Bioinformatics


Sequence Alignment

37

More Example





Local

Alignment:


ATGGC
CTC

ACGG
CTC


or


AT
GGC
CTC

AC
GGC

TC

--

A

C

G

G

C

T

C

0

0

0

0

0

0

0

0

1

0

0

--

A

T

G

G

C

C

T

C

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

1

1

0

0

0

0

0

1

2

0

0

0

0

1

0

0

3

0

1

0

1

0

0

1

0

1

0

0

0

0

0

2

0

0

1

0

0

1

0

3

Intro to Bioinformatics


Sequence Alignment

38

Scoring Matrices for DNA Sequences


Transition: A


G C


T


Transversion: a purine (A or G) is replaced by a
pyrimadine (C or T) or vice versa

Intro to Bioinformatics


Sequence Alignment

39

Scoring Matrices for Protein Sequence


PAM 250

Intro to Bioinformatics


Sequence Alignment

40

Scoring Matrices for Protein Sequence


PAM (Percent Accepted Mutations)


1 PAM unit can be thought of as the amount of
time

in which an
“average” protein mutates 1 out of every 100 amino acids.


Perform multiple sequence alignment on a “family” of proteins that are
at least 85% similar. Find the frequency of amino acid
i

and j are aligned
to each other and
normalize

it.


Entry
m
ij

in PAM1 represents the probability of amino acid
i

substituted by amino acid j in 1 PAM unit


PAM 2 = PAM 1
×

PAM 1


PAM
n

= (PAM 1)
n
, e.g., PAM 250 = (PAM 1)
250


Questions


Why are values in a PAM matrix integers? Shouldn’t a probability be
between 0 and 1?


Why is PAM
250

possible? Shouldn’t a probability be less than or equal
to 100%?

Intro to Bioinformatics


Sequence Alignment

41

Scoring Matrices for Protein Sequence


BLOSUM (BLOcks SUbstitution Matrix) 62

Intro to Bioinformatics


Sequence Alignment

42

Using Protein Scoring Matrices


Divergence

BLOSUM 80

BLOSUM 62


BLOSUM 45

PAM 1


PAM 120


PAM 250

Closely related




Distantly related

Less divergent




More divergent

Less sensitive




More sensitive


Looking for


Short similar sequences → use less sensitive matrix


Long dissimilar sequences → use more sensitive matrix


Unknown → use range of matrices


Comparison


PAM


designed to track evolutionary origin of proteins


BLOSUM


designed to find conserved regions of proteins

43

Multiple Sequence Alignment


Why multiple sequence alignment (MSA)


Two sequences might not look very similar.


But, some “similarity” emerges with more sequences,
however.


Is dynamic programming still applicable?


CLUSTALW


One of most popular tools for MSA


Heuristic
-
based approach


Basic

1) Calculate the distance matrix based on pairwise alignments

2) Construct a
guide tree


NJ (unrooted tree)


Mid
-
point (rooted tree)

3) Progressive alignment using the guide tree