Bioinformatics - Computer Science

weinerthreeforksBiotechnology

Oct 2, 2013 (4 years and 1 month ago)

85 views

Roadmap

The topics:


basic concepts of molecular biology


more on Perl


overview of the field


biological databases and database searching


sequence alignments


phylogenetics


structure prediction


microarray data analysis

Sequence alignments


Introduction


What is an alignment?


Why do alignments?


A bit of history


Dot matrix comparison


Scoring alignments


Alignment methods


Significance of alignments

What is Sequence alignment


Sequence alignment is an arrangement of
two or more sequences, highlighting their
similarity.

Why do alignments?

Sequence Alignment is useful for
discovering
structural
,
functional

and
evolutional

information in biological
sequences.

Over time, genes
accumulate
mutations


Environmental factors


Radiation


Oxidation


Mistakes in replication/repair


Deletions, Duplications


Insertions


Inversions


Point mutations

Comparing two sequences


Point mutations, easy:

ACGTCTGAT
A
CGCC
G
TAT
A
GTCTATCT

ACGTCTGAT
T
CGCC
C
TAT
C
GTCTATCT



Insertions/deletions, must
align
:

ACGTCTGATACGCCGTATAGTCTATCT

CTGATTCGCATCGTCTATCT



ACGTCTGAT
A
CGCCGTAT
A
GTCTATCT

----
CTGAT
T
CGC
---
AT
C
GTCTATCT

Sequence Alignment

Doolittle RF, Hunkapiller MW, Hood LE,

Devare SG, Robbins KC, Aaronson SA,

Antoniades HN.
Science

221:275
-
277, 1983.



A sequence for platelet derived


growth factor (PDGF) from mammalian cells was
virtually identical to the sequence for the retrovirus
encoded oncogene known as v
-
sis (gene causing cancer
in animals).



Retrovirus had acquired the gene from the host cell as some kind
of genetic exchange event and then had produced a mutant that
could alter the function of the normal protein when it infected
another animal.

Russell F. Doolittle

Dot Matrix Comparison

A: T C A G A G G T C T G

B: T C A G A G C T G


X

X

C

X

X

T

X

X

X

X

G

X

X

T

X

C

X

X

X

X

G

X

X

A

X

X

X

X

G

X

X

A

X

C

X

X

T

G

T

G

G

A

G

A

C

T

Interpretation of dot matrix


Regions of similarity appear as diagonal runs of dots



Reverse diagonals (perpendicular to diagonal) indicate
inversions



Can link or "join" separate diagonals to form alignment
with "gaps"

More on Dot Matrix


Improving detection of matching regions by
filtering


using sliding window to compare the two
sequences. For example,
print a dot at a matrix
position only if



7 out of the next 11 positions in the sequence
are identical


Similarity score of the next 11 positions in the
sequence is greater than 5.

Sequence repeats


Many
sequences
contains
repetitive
regions.

a retrovirus vector sequence against itself using a window size of 9 and mismatch limit of 2

(
http://arbl.cvmbs.colostate.edu/molkit/dnadot/bkg.html)

More on Dot Matrix


Dot matrix graphically presents regions of identity or
similarity between two sequences



The use of windows and thresholds can reduce
“noise” in dot matrix



Inversions and duplications have unique “signatures”
in dot matrix

Software


Dotlet (java applet)




www.ch.embnet.org


Dnadot




arbl.cvmbs.colostate.edu/molkit/dnadot/


Dotter



www.cgr.ki.se/cgr/groups/sonnhammer/Dotter.html


Dottup




www.emboss.org

How to measure the similarity

Basically three kinds of changes can occur at any
given position within a sequence:


Mutation


Insertion


Deletion


Insertion and deletion have been found to occur in
nature at a significantly lower frequency than
mutations.

Scoring Matrices for Aligning DNA Sequences

Transition

---

substitutions in which a purine (A/G)
is replaced by
another purine (A/G) or
a pyrimadine (C/T)
is replaced by
another pyrimadine (C/T).

Transversions

---



(A/G)


(C/T)

1

0

0

0

G

0

1

0

0

C

0

0

1

0

T

0

0

0

1

A

G

C

T

A


Identity matrix

5

-
4

-
4

-
4

G

-
4

5

-
4

-
4

C

-
4

-
4

5

-
4

T

-
4

-
4

-
4

5

A

G

C

T

A

BLAST matrix

1

-
5

-
5

-
1

G

-
5

1

-
1

-
5

C

-
5

-
1

1

-
5

T

-
1

-
5

-
5

1

A

G

C

T

A

Transition
-
Transversion matrix

Scoring a sequence alignment


Match score:


+1


Mismatch score:


+0


Gap penalty:



1



ACGTCTGAT
A
CGCCGTAT
A
GTCTATCT


||||| ||| || ||||||||

----
CTGAT
T
CGC
---
AT
C
GTCTATCT



Matches: 18
×

(+1)


Mismatches: 2
×

0


Gaps: 7
×

(


1)

Score = +11

Gap opening and extension penalties


We want to find alignments that are
evolutionarily likely.


Which of the following alignments seems more likely to
you?



ACGTCTGATACGCCGTATAGTCTATCT

ACGTCTGAT
-------
ATAGTCTATCT


ACGTCTGATACGCCGTATAGTCTATCT

AC
-
T
-
TGA
--
CG
-
CGT
-
TA
-
TCTATCT



We can achieve this by penalizing more for a new gap,
than for extending an existing gap





Scoring a sequence alignment


Match/mismatch score:


+1/+0


Open/extension penalty:


2/

1

ACGTCTGAT
A
CGCCGTAT
A
GTCTATCT


||||| ||| || ||||||||

----
CTGAT
T
CGC
---
AT
C
GTCTATCT



Matches: 18
×

(+1)


Mismatches: 2
×

0


Open: 2
×

(

2)


Extension: 5
×

(

1)

Score = +9

Amino Acid Substitution Matrices


PAM

-

point accepted mutation based on
global
alignment
[evolutionary model]


BLOSUM

-

block substitutions based on
local
alignments
[similarity among
conserved sequences]

Part of PAM 250 Matrix

C

S

T

P

A

G

C

12

S

0

2

T

-
2

1

3

P

-
3

1

0

6

A

-
2

1

1

1

2

G

-
3

1

0

-
1

1

5

Log
-
odds = log ( )


chance to see the pair in homologous proteins

chance to see the pair in unrelated proteins by chance

PAM matrices

PAM 1

Matrix reflects an amount of evolution
producing on average
one mutation per hundred
amino acids

(1 unit evolution).

PAM 250

---

250 unit evolution

0.01



0.01

0.02

0.01

0.04

Probability

PAM 250

0.0000

Phe to Cys



...

0.0000

Phe to Asp

0.0001

Phe to Asn

0.0001

Phe to Arg

0.0002

Phe to Ala

PAM 1

Amino acid change

Limitations of PAM Matrices


Constructed based on the phylogenetic
relationships prior to scoring mutations;


Difficulty of determining ancestral relationships
among sequences;


Based on a small set of closely related proteins;




BLOSUM Matrices


Based on the observed amino acid substitutions in a
large set of ~2000 conserved amino acid patterns
(blocks). The blocks are found in a database of protein
sequences representing more than 500 families of
related proteins and act as signatures of these protein
families.


The matrices are measured on the multiple alignment of
the blocks.


The entries of the matrices are computed based on the
same principle used in PAM
--

log(odds’ ratio).

Part of BLOSUM 62 Matrix

C

S

T

P

A

G

C

9

S

-
1

4

T

-
1

1

5

P

-
3

-
1

-
1

7

A

0

1

0

-
1

4

G

-
3

0

-
2

-
2

0

6


BLOSUM62 was
measured on pairs of
sequences with an
average of 62 %
identical amino acids.

Log
-
odds = log ( )


chance to see the pair in homologous proteins

chance to see the pair in unrelated proteins by chance

PAM vs. BLOSUM


PAM


Based on mutational model of evolution (
Markov process
)


PAM1 is based on sequences of 85% similarity


Designed to track the evolutionary origins



BLOSUM


Based on the multiple alignment of blocks


Good to be used to compare distant sequences


Designed to find proteins’ conserved domains

Gap Penalty


Optimal penalties vary from sequence to sequence, and
finding the most adequate value is a matter of empirical
trial and error.


When compare distantly related sequences, a high gap
-
opening penalty and a very low gap
-
extension penalty
often give better results


When compare closely related sequences, gaps should
be penalized on both a gap
-
opening and gap
-
extension