Local alignment of 'QUEVIVALASVEGAS' - Multiple Choices

disturbedtonganeseBiotechnology

Oct 2, 2013 (4 years and 1 month ago)

78 views

Sequence Alignment


Biomedische Wetenschappen

BW204

2011



Han Rauwerda

Martijs Jonker

Microarray Department UvA / Integrative Bioinformatics
Unit (MAD/IBU)

Introduction

Databases:

Genbank, EMBL: DNA

SWISS
-
PROT, PIT PRF : protein

KEGG Genes: genes

PDB: protein structure


mind the scale on the Y
-
axis.


29777 articles in ~ 8 years, 3722 articles per year, > 10 articles per day

2003:
30364
articles

2011:
60141

articles

http://www.genome.jp/en/db_growth.html

Data, Information & Knowledge

De anatomische les
van Dr. Nicolaes Tulp

(Rembrandt)

Teatro Anatomico di
Padova (1594)

William Harvey:
bloedsomloop
(1628)

Data verzamelen

Categoriseren van
data: informatie

Experimenten,
theorie vorming uit
informatie

Data, Information & Knowledge

Darwin spotlijsters
op Galapagos
(1835)

Data verzamelen

Linnaeus 1756

Categoriseren van
data: informatie

Experimenten,
theorie vorming uit
informatie

Data, Information & Knowledge

Informatie en theorie vorming: visualisatie

Dots are deaths by cholera, crosses
are water pumps

John Snow’s cholera map of
September 1854

Florence Nightingale Polar
Diagrams Crimean War (1855)

Crimean War, half a million deaths

Data, Information & Knowledge

Informatie en theorie vorming: computationele methoden

Biologist,
“Founder of
Modern Statistical Science”



Analysis of Variance



Fisher's exact test



Fisher's z
-
distribution



……..

Ronald Fisher (1890
-
1962)

Rothamsted, Harpenden UK

Data, Information & Knowledge

In Genomics:

Data genereren

Categoriseren van
data genereert
informatie

Experimenten, theorie
vorming uit informatie

Sanger Sequencers

Sequence Databases

e.g. tyrosine kinase
inhibitors in cancer
therapy

Biological Databases



Scientific method has not changed:



Formulate a hypothesis



Perform Experiment



Test hypothesis



reject or not reject



Add new insights to theory




What has changed in biology:



amount of data



dynamics of data


how scientists access & share data, information and
knowledge



how scientists use data, information and knowledge.



downloads,
‘freezes’



High Performance Computing
(Cloud, grid, Web services)

Biological Databases


B
asic sequence data


Curated sequence
data (information)


Organism specific
databases, e.g.


Flybase


Wormbase


TAIR


Topical databases e.g.


literature


diseases


experiments


Portals e.g.


Entrez


Ensembl

Biological Databases

How reliable are Biological Databases?


errors, unknown reliability


redundancy


repeated
submission


different sequences with the same name (e.g. ey)


erroneous annotations


identical sequences with more than one name

“Biologists would rather share their
toothbrush

than share a gene name



amplification of errors:

each new entry based on erratic information amplifies that error.



Example: contamination of cell lines by HeLa Cell lines

(sometimes referred to as a persistent laboratory "
weed“):

“ As journals wrestle with the problems posed by cell line mix
-
ups

Reynolds goes
so far as to estimate that journals would have to retract 35% to 40% of their
previously published cell biology papers to weed out invalid data …. “


Chatterjee R. Cell biology. Cases of mistaken identity. Science. 2007 Feb 16;315(5814):928
-
31


Example: p53 on Entrez

Example: p53 in Entrez Gene:

>gi|189083686|ref|NP_112251.2| cellular tumor antigen p53
[Rattus norvegicus]
MEDSQSDMSIELPLSQETFSCLWKLLPPDDILPTTATGSPN
SMEDLFLPQDVAELLEGPEEALQVSAPAAQEPGTEAPAPV
APASATPWPLSSSVPSQKTYQGNYGFHLGFLQSGTAKSV
MCTYSISLNKLFCQLAKTCPVQLWVTSTPPPGTRVRAMAI
YKKSQHMTEVVRRCPHHERCSDGDGLAPPQHLIRVEGNP
YAEYLDDRQTFRHSVVVPYEPPEVGSDYTTIHYKYMCNSS
CMGGMNRRPILTIITLEDSSGNLLGRDSFEVRVCACPGRD
RRTEEENFRKKEEHCPELPPGSAKRALPTSTSSSPQQKK
KPLDGEYFTLKIRGRERFEMFRELNEALELKDARAAEESG
DSRAHSSYPKTKKGQSTSRHKKPMIKKVGPDSD

+

>gi|120407068|ref|NP_000537.3| cellular tumor antigen
p53 isoform a [Homo sapiens]
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQA
MDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPA
APTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSG
TAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTR
VRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLI
RVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYN
YMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVC
ACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSS
SPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALEKDAQA
GKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD



These sequences apparently are similar but not identical…..



How do we know if stretches are similar



How do we map sequences of unequal length onto each other



Is it possible to associate a confidence level to the alignment of two sequences


Visual approach: dot matrix method (1)


Make a matrix and draw dots


Suppose sequence ACGGGAACG has mutated by
an insertion of 3 As: ACG
AAA
GGAACG, align:

G



?{

?{

?{

C



?{

A



?{

?{

?{

?{

?{

A



?{

?{

?{

?{

?{

G



?{

?{

?{

G



?{

?{

?{

G



?{

?{

?{

C



?{

A









?{

?{







?{







A

C

G

A

A

A

G

G

A

A

C

G


Diagonals show
alignments


Image is very noisy:

draw lines with an
alignment of 3 or more:

~ choose a window
size of 3 or more.


Insertion: horizontal
spacing


Dot matrix method (2)

Suppose the same sequence mutated and the 10
th

A
has become a G:


sequence 1: ACGGGAACG

sequence 2: ACGAAAGGA
A
CG → ACGAAAGGA
G
CG:

G



?{

?{

?{

?{

C



?{

A



?{

?{

?{

?{

A



?{

?{

?{

?{

G



?{

?{

?{

?{

G



?{

?{

?{

?{

G



?{

?{

?{

?{

C



?{

A









?{

?{













A

C

G

A

A

A

G

G

A

G

C

G


Window size of 3: we
miss
the homology in
the distal part


Apply a window size of
4 and draw a line if the
sum of the matches is
at least 3


What do you think?

Sensitivity? Specificity?


Dot matrix method (3)



Convenient to quickly discover similarities



Does not provide confidence metric



not 1 optimal alignment



Repetitive stretches quickly recognized

rat vs human p53 with dottup:

Sequence Alignment (1)


How to find the best alignment out of many possible alignments?


Scoring: reward matches, penalize mismatches and deletions


Can we motivate our choice with biological arguments?


Must have all mismatches equal penalty?


How to penalize deletions?


Take the entire alignment or just focus on stretches?


Comparison of human vs. rat P53 (stretches of similar length)


Comparison of one human P53 exon with rat chromosome 1?


Statistical metric


What is the likelihood that we would produce such an alignment by
chance?





Alignment Scoring: substitution matrices (1)


Which alignment is the best?


We need a scoring system:


Each element
i

of the alignment with length
c

adds to the score.
The value of each element is determined by a function
σ

of
match, mismatch and gap.


A higher score ~ better alignment



Matches, mismatch and gaps contribute according to a schema
(called the substitution matrix).



A simple schema for nucleotides
(= Substitution Matrix):


A C G T
-

A 1
-
1
-
1
-
1
-
1

C
-
1 1
-
1
-
1
-
1

G
-
1
-
1 1
-
1
-
1

T
-
1
-
1
-
1 1
-
1

-

-
1
-
1
-
1
-
1


Score for the alignment



AC
---
CGAGGAACG


|| || ||| ||


ACGTTCG
-
GGAGCG




2
-
3+2
-
1+3
-
1+2=4

AC
---
CGAGGAACG

|| || ||| ||

ACGTTCG
-
GGAGCG

Biological motivation on scoring alignments


Similarity of sequences can arise because they share a common
ancestor


Homology is an evolutionary concept:
Homologuous sequences are
sequences that share a common ancestor.


So homologues may have a similarity of 80% or 90%, but the fact of
sharing a common ancestor is boolean (true or false).


Similar sequences do not per se be related.


Molecular evolution: substitution and insertions and deletions
(indels).


Two types of homology:


Orthologuous genes: related genes after speciation


Paralogous genes: related genes after gene duplications (within
species) → gene families


Domains: stretches in the sequence that ‘do the job’


highly conserved


Alignment Scoring: substitution matrices (2)

Nucleotides:



Should replacement of one mismatching nucleotide with
another be penalized differently?


Much used: NUC2.2:





Should the starting of a gap be penalized differently from
extension of a gap?


gap start:
-
12, gap extension
-
6


Score M of alignment:


M = 2*5
-
12 + (2*
-
6) + 2*5
-
12 + 3*5
-
4+2*5 = 5








AC
---
CGAGGAACG

|| || ||| ||

ACGTTCG
-
GGAGCG


A T G C

A 5
-
4
-
4
-
4

T
-
4 5
-
4
-
4

G
-
4
-
4 5
-
4

C
-
4
-
4
-
4 5

Alignment Scoring: substitution matrices (3)


Do these alignments have an equal probabilty of occurring?

Each letter stands for an amino acid in this artificial sequence.






Alanine to Isoleucine: both hydrophobic


Glycine to Cysteine: both hydrophobic

Cysteine has a sulfhydryl group (essential

in metal binding):
Gain of function


Glutamate (E), polar and non
-
hydrophobic

and negatively charged to Isoleucine











L
A S V
E G A
S

| | |
|
| |
|

L
A
S
V
E G
I
S

L
A S V
E G A
S

|
|
|
|
| |
|

L
A
S
V
E C A
S

L
A S V
E G A
S

| | |
|
| | |

L
A
S
V
I
G A
S

Amino Acid Substitution Matrices


determine penalties and rewards in matrices by


Theoretical approach based on :



amino acid physico chemical properties



the redundancy in genetic code



Emperical / evolution motivated models


PAM matrices (Dayhoff, 1984)


based on predictions of mutations when proteins diverge from common
ancestor


explicit evolutionary model


BLOSUM matrices (
Henikoff & Henikoff,1992)


BLOSUM based on common regions (BLOCKS) in protein families


Matrix represented as log odds:


log ratio of observed mutation frequency and mutation frequency by
chance.


Positive values: substitutions that occur more frequently than expected


Negative values: substitutions that occur less frequently than expected









Amino Acid Substitution matrices (2)


PAM
(
P
oint
A
ccepted
M
utation ~ Accepted point mutation)


Based on closely related protein sequences (85% identical)


Construct a phylogenetic tree of a sequence group.


Count the number of substitutions along each branch in time.


Take a set, 1 PAM apart (= 1% of the amino acid content has changed)


Calculate the log odds score: log10 value of the observed mutation
frequency with respect to the common ancestor and the probability of that
mutation by chance (1/20).

This results in the PAM1 substitution matrix.


Extrapolate to matrices that are further apart, e.g. :

PAM250: 250 mutations per 100 residues ~ 20% sequence identity ~ 2500
million years of evolution.


For very related sequences take a low PAM number, for very unrelated,
take a high PAM number.


matrices given in log2 odds








Amino Acid Substitution matrices (3)


BLOSUM
(
BLO
cks amino acid
SU
bstitution
M
atrices)


Basis is > 2000 conserved amino acid patterns in 500 groups of protein
sequences.


Blocks: ungapped alignments of less than 60 amino acids.


BLOSUM62: sequences are taken that share on average an identity of
62%


matrices given in log10 odds


Procedure to calculate log odds score (the elements in the substitution
matrix) comparable to PAM (but log2 instead of log10 values)


better suited than PAM for finding conserved domains


Much larger data set used than for the PAM matrix








Sequence Alignment (2)


Take the entire alignment or just focus on stretches?


Comparison of human vs. rat P53 (stretches of similar length)


Global Alignment


Comparison of one human P53 exon with rat chromosome 1?


Local Alignment



Global versus Local Alignment:



V I V A L A S V E G A S


| | | | | | |


V I V A D A
-

V
-

-

I S



Q U E V I V A L A S V E G A S


| | | | |


V I V A D A V I S





Global alignment of
‘VIVALASVEGAS’ and
‘VIVADAVIS’

Local alignment of

QUEVIVALASVEGAS’ and
‘VIVADAVIS’ (high gap penalty)

Global Alignment (1)


V I V A L A S V E G A S

| | | | | | |

V I V A D A
-

V
-

-

I S



A global alignment of two sequences,
s

and
t,
is an
assignment of gap symbols “
-
” into those sequences, or
at their ends.

The optimal global alignment is given
,
so as to maximize
the alignment score.



What is the longest possible alignment of two
sequences s and t? And the shortest?


A B C D E F G H
-

-

-

-

-


-

-

-

-

-

-

-

-

I J K L M

Global alignment of ‘VIVALASVEGAS’ and
‘VIVADAVIS’

s


t

T

Local Alignment (1)



Q U E
V I V A L A

S V E G A S

prefix removed

|

| | | |
suffix removed


V

I V A D A

V I S






Definition: A local alignment of two sequences,
s

and
t,
is a global alignment of the subsequences
s
i:j

and
t
k:l
, for
some choice
(i,j)
and
(k,l).

The optimal local alignment is given by the optimal
choice of
(i,j)
and
(k, l),
so as to maximize the alignment
score.

Local alignment of ‘QUEVIVALASVEGAS’
on ‘VIVADAVIS’

s

t

Dynamic programming (1)


How to optimize the alignment score?


Brute force: try all possible combinations.

For a group of k objects which are a subset of a group of n, the number of possible
combinations (without counting the number of k! permutations) is :





Aligning two sequences of 500 bp (k=500, n=1000): 2.702882e+299 possibilities.
This is impossible, and a whole genome may be in the order of Gbases!!!


Needleman & Wunsch (1970) introduced an elegant solution:


Exact (finding of the perfect alignment is garantueed)


Like the dot matrix method works with a matrix but more
quantitative.


Moderate computational expense (n*m, size of matrix)


using Dynamic Programming (DP): partition a problem by
overlapping subproblems followed by finding the best path.



T

Needleman Wunsch algorithm


At any position in the alignment, 2 options


Gap


Match or mismatch



Example:


Obvious alignment

Matrix form: Formula:














Calculate for each position in the matrix the most cost
-
effective
combination and remember this decision


After calculation of the entire matrix, trace
-
back the most
optimal alignment




-

B B A

-

A X

B X

B X

-

B B A


| |

A B B
-

σ
(s
i
,t
j
) is the cost of aligning s and t
on the ith and jth position

Needleman Wunsch algorithm (2)


-

B B A

-

0
-
1
-
2
-
3

A
-
1

B
-
2

B
-
3

1.
Example: scoring matrix is:


A B
-

A 1
-
1
-
1

B
-
1 1
-
1

-

-
1
-
1

2.
start
with the s + t length all
-
gapped alignment using the substitution matrix: blue
digits
(all gaps would mean a score of
-
3 + (
-
3)
=
-
6



Needleman Wunsch algorithm (2)


-

B B A

-

0
-
1
-
2
-
3

A
-
1
-
1

B
-
2

B
-
3

1.
Example: scoring matrix is:


A B
-

A 1
-
1
-
1

B
-
1 1
-
1

-

-
1
-
1

2.
start
with the s + t length all
-
gapped alignment using the substitution matrix: blue
digits
(all gaps would mean a score of
-
3 + (
-
3)
=
-
6

3.
Proceed in the most upper left corner and determine wheter a gap, mismatch or
match would result in the highest
score

A diagonal arrow is a match or a mismatch, a horizontal or vertical arrow is a gap.




Needleman Wunsch algorithm (2)


-

B B A

-

0
-
1
-
2
-
3

A
-
1
-
1

B
-
2 0

B
-
3
-
1

1.
Example: scoring matrix is:


A B
-

A 1
-
1
-
1

B
-
1 1
-
1

-

-
1
-
1

2.
start
with the s + t length all
-
gapped alignment using the substitution matrix: blue
digits
(all gaps would mean a score of
-
3 + (
-
3)
=
-
6

3.
Proceed in the most upper left corner and determine wheter a gap, mismatch or
match would result in the highest
score

A diagonal arrow is a match or a mismatch, a horizontal or vertical arrow is a gap.

4.
Proceed with filling in the entire column (there should be digits above and to the left
of each empty cell)




Needleman Wunsch algorithm (2)


-

B B A

-

0
-
1

-
2
-
3

A
-
1
-
1
-
2
-
1

B
-
2 0 0
-
1

B
-
3
-
1 1 0

1.
Example: scoring matrix is:


A B
-

A 1
-
1
-
1

B
-
1 1
-
1

-

-
1
-
1

2.
start
with the s + t length all
-
gapped alignment using the substitution matrix: blue
digits
(all gaps would mean a score of
-
3 + (
-
3)
=
-
6

3.
Proceed in the most upper left corner and determine wheter a gap, mismatch or
match would result in the highest
score

A diagonal arrow is a match or a mismatch, a horizontal or vertical arrow is a gap.

4.
Proceed with filling in the entire column (there should be digits above and to the left
of each empty cell) and fill all other columns


Needleman Wunsch algorithm (2)

1.
Example: scoring matrix is:


A B
-

A 1
-
1
-
1

B
-
1 1
-
1

-

-
1
-
1

2.
start
with the s + t length all
-
gapped alignment using the substitution matrix: blue
digits
(all gaps would mean a score of
-
3 + (
-
3)
=
-
6

3.
Proceed in the most upper left corner and determine wheter a gap, mismatch or
match would result in the highest
score

A diagonal arrow is a match or a mismatch, a horizontal or vertical arrow is a gap.

4.
Proceed with filling in the entire column (there should be digits above and to the left
of each empty cell) and fill all other columns

5.
Start from the most bottom right cell (this is the alignment score) and trace back.

6.
This produces the alignment:

-

B B A

A B B
-


-

B B A

-

0
-
1

-
2
-
3

A
-
1

-
1
-
2
-
1

B
-
2
0

0
-
1

B
-
3
-
1
1

0

Smith Waterman algorithm



Q U E
V I V A L A

S V E G A S

prefix removed |
| | | |
suffix removed


V

I V A D A

V I S






Formula:





Very similar to Needleman
-
Wunsch, but
(changes in blue)


Initialize with zeros


Calculate for each position in the matrix the most cost
-
effective combination and remember
this decision.
Replace negative values with zeros
.


After calculation of the entire matrix, trace
-
back the most optimal alignment,
starting with the
element with the higest score


Local alignment of ‘QUEVIVALASVEGAS’
on ‘VIVADAVIS’

Statistical analysis of alignments


What would be the H0?


H
0
: the similarity between two sequences occurs by
chance (or: the sequences are unrelated)


Must we reject the null hypothesis?


Determine the probability of the null hypothesis


What is the distribution of the nucleotides, amino acids?


Probably not normally distributed!


Multinomial sequence model. For nucleotides:
p = (p
A
, p
C
, p
G
, p
T
)


Markov models: takes short distance dependencies on the
sequence into account.


e.g. occurances of CG tend to have higher CG content in their vicinity


Models that take knowledge into account (e.g. for nucleotide
sequences that are thought to be protein coding, use the genetic
code as blocks)




Statistical analysis of alignments (2)


Approach:


Construct a distribution by a method of choice

Example (multinomial model):


Permute n times a sequence that reflects the actual distribution of
nucleotides or amino acids in the population


Calculate the alignment for each permutated sequence


Determine which fraction has an equal or higher score than the original
alignment.


If n is large and the sequence is not very short (#permutations
depends on sequence_length) this can be interpreted as a p
-
value.


Example:


original score ~ 2


sequence length = 12 (12! ~ 479,001,600 permutations possible)


2000 permutations


5 permutated sequences have a score in (2,2,3,3,4).


p
-
value = 5/2000 = 0.0025






Sequence Alignment

Now we know


what type of alignment to use (global or local)


how to score an alignment


how to discover the optimal alignment


how to attach a confidence level to an alignment






BLAST (1)


With statistics: check 1 alignment implicates making
1001 or more alignments.


Often: question = can we find any domains, boxes,
similarity of a sequence of interest against
all
known
sequences of a species or a number of species.


Instead 1 alignment, make thousands of alignments


Database searching


Smith
-
Waterman too slow.


B
asic
L
ocal
A
lignment
S
earch
T
ool: BLAST


Heuristics: ‘rule of thumb’, common sense, used to rapidly come to a
solution that is reasonably close to the best possible answer.


Reduce search space


BLAST: approx. 50
-

100 times faster than dynamic programming





BLAST (2)

Steps:

1.
Create all possible words from the query sequence (default: proteins 3,
nucleotides 11)


E.g. VIVADAVIS: VIV,IVA,VAD,ADA,DAV,AVI,VIS


It may be that in the query sequence also words are occuring more than once.

2.
Seeding step: seek words of length W (3 for proteins, 11 for nucl.) that
score at least T when aligned with the query (scored with a chosen
substitution matrix).

This is the step that reduces the search space considerably.


More in detail: 20^3 possible match scores for a protein, 4^11 possible 11mers.
Set a threshold T and by using an appropriate similarity matrix, only take into
account words that have a score greater than T.

3.
Extend the alignment as long as it increases. The resulting stretches of
aligned query and database sequence are called High Scoring Pairs (HSP).


More in detail: this always would produce a non
-
gapped alignment. Nowadays
HSPs that are in the same region are subjected to dynamic programming to
produce a gapped alignment.





BLAST (3)
-

an example

T

BLAST (3)

Example BLAST output (NCBI):





BLAST (4)

Flavors of BLAST:











If you know a nucleotide sequence is protein coding, BLASTX will
increase the sensitivity of the search.


TBLASTX: search for distant relatedness.




Type Blast

Description

1

BLASTN

Nucleotide sequence against nucleotide
database

2

BLASTP

Protein sequence against protein database

3

BLASTX

Six frame translation

of
nucleotide sequence
against protein database

4

TBLASTN

protein sequence against six
-
frame translation of
nucleotide database.

5

TBLASTX

Six
-
frame translation of nucleotide sequence
against six
-
frame translation of nucleotide
database.

Statistical analysis of alignments (3)


BLAST: align against many sequences: adapt statistical
metrics


It makes a difference if we search for an alignment in a database
5 sequences or in a database of 1,000,000 sequences.


Question is:

Given the size of the database what is the
likelihood

that
a resulting alignment is caused by random chance?


Expect value (or e
-
value):






m ~ number of residues in database





n ~ number of residues in query





K and lambda

~ parameters; S ~ score


What does an e
-
value of 1 mean? Is there a maximum e
-
value?


e
-
value dependant on size of the database:


suppose your database grows because of your new sequences apparatus, your
e
-
values will change


bit score: score of the alignment normalized for the substitution matrix.


Use the bitscore when you want to compare BLAST results from different databases.

Blast(5) formal definitions


In the limit of sufficiently large sequence lengths
m

and
n
, the statistics of HSP scores
are characterized by two parameters,
K

and
lambda
. Most simply, the expected
number of HSPs with score at least
S

is given by the formula



We call this the
E
-
value for the score
S
. The parameters K and lambda can be
thought of simply as natural scales for the search space size and the scoring system
respectively


By normalizing a raw score using the formula


one attains a "bit score" S', which has a standard set of units.


Rewritten:


The chance of finding zero HSPs with score >=
S

is e
-
E
, so the probability of finding at
least one such HSP is :



For example, if one expects to find three HSPs with score >=
S
, the probability of finding at least
one is 0.95. The BLAST programs report
E
-
value rather than
P
-
values because it is easier to
understand the difference between, for example,
E
-
value of 5 and 10 than
P
-
values of 0.993 and
0.99995. However, when
E

< 0.01,
P
-
values and
E
-
value are nearly identical.

Remarks on database alignment


Most programs offer possibility to mask Low Complexity
Regions:


Short segments of repeats


Segments with a small number of residues



Programs like BLAST:


FASTA:


uses shorter words (ktuples ~ words)


approx. the same speed as BLAST


BLAT:
B
last
L
ike
A
lignment
T
ool


makes an index of all non
-
overlapping kmers and stores this in memory.


Much faster than BLAST


very ubiquitous stretches are not found because BLAT stores only unique or
relatively rare fragments in its index.


Today: many new aligners appear to facilitate short read
alignment from Next Generation Sequence data.

Multiple Sequence Alignment


Question: how related is a set of sequences from human,
mouse and cow that are thought to represent similar
domains? Which ones are most similar?


Definition: a multiple (global) alignment of k sequences is an
assignment of gap symbols into those sequences or at their ends. The
k resulting strings are placed one above the other to maximize the
score of the entire alignment.


How to design the algorithm


How to formulate the scoring system.


Algorithm: cost of extending dynamic programming: n*m*k


number of heuristic approaches


cluster sequences, start with pairwise alignment of most similar two
and extend the alignment with stepwise less similar sequences
(greedy algorithms).


Scoring: sum of pairs


Calculate the score, using an appropriate substitution matrix, of each pair
of alignments (example: k=4, number of pairs = )








The Practical


This afternoon:


Assignment 1, 2, 3c, 3d, 4 en 5d


Assignment 1d: kies als E
-
value bijv. 1


hogere E
-
values geven zeer veel resultaten terug waardoor het lastig wordt de
set te analyseren.


Assignment 5d:


Een PSSM is een Position Specific Scoring Matrix. Het is een matrix die een
motief van een bepaalde lengte beschrijft en per positie de relatieve frequentie
van een nucleotide of aminozuur weergeeft.


Een PSSM is dus een manier om

een gedegenereerde sequentie

kwantitatief te bekijken,

bijvoorbeeld:


END