Last lecture summary
•
identity vs. similarity
•
homology vs. similarity
•
gap penalty
•
affine gap penalty
•
gap penalty high
•
fewer gaps, if investigating related sequences
•
low
•
more gaps, larger gaps, distantly related sequences
BLOSUM
•
blocks
•
focuse
on substitution patterns only
in blocks
•
BLOSUM62
–
62, what does it mean?
•
BLOSUM vs. PAM
•
BLOSUM matrices are based on observed
alignments
•
BLOSUM numbering system goes in reversing order as the PAM
numbering
system
Selecting an Appropriate Matrix
Matrix
Best use
Similarity
(%)
Pam40
Short highly similar alignments
70

90
PAM160
Detecting members of a protein family
50

60
PAM250
Longer alingments of more divergent sequences
~30
BLOSUM90
Short highly similar alignments
70

90
BLOSUM80
Detecting members of a protein family
50

60
BLOSUM62
Most effective in finding all potential similarities
30

40
BLOSUM30
Longer alingments of more divergent sequences
<30
Similarity column gives range of similarities that the matrix is able to best detect.
Dynamic programming (DP)
•
Recursive approach, sequential dependency.
•
4
th
piece can be solved using solution of the 3
rd
piece, the 3
rd
piece can be solved by using solution of
the 2
nd
piece and so on…
Sequence B
Sequence A
Best previous alignment
New best alignment = previous best + local best
...
...
...
...
If
you already have the optimal solution to:
X…Y
A…B
then you know the
next
pair of characters will either be
:
X…Y
Z
or
X…Y

or
X…Y
Z
A…B
C
A…B
C
A…B

You
can extend the match by determining which
of these
has the highest score.
New stuff
Dot plot
•
Graphical method that allows the comparison of two
biological sequences and identify regions of close
similarity between them.
•
Also used for finding direct or inverted repeats in
sequences.
•
Or for prediction regions in RNA that are self

complementary and therefore have potential to form
secondary structures.
Self

similarity dot plot I
The DNA sequence
EU127468.1 compared
against itself.
Introduction to dot

plots, Jan Schulz
http://www.code10.info/index.php?option=com_content&view=article&id=64:inroduction

to

dot

plots&catid=52:cat_coding_algorithms_d
ot

plots&Itemid=76
runs of
matched
residues
gap
background
noise
Self

similarity dot plot II
Introduction to dot

plots, Jan Schulz
http://www.code10.info/index.php?option=com_content&view=article&id=64:inroduction

to

dot

plots&catid=52:cat_coding_algorithms_d
ot

plots&Itemid=76
The DNA sequence
EU127468.1 compared
against itself.
Window size = 16.
Linear color mapping
Improving dot plot
•
Sliding window
–
window size (lets say 11)
•
Stringency (lets say 7)
–
a dot is printed only if 7 out of the
next 11 positions in the sequence are identical
•
Color mapping
•
Scoring matrices can be used to assign a score to each
substitution. These numbers then can be converted to gray/color.
Interpretation of dot plot I
1.
Plot two homologous sequences of interest. If they they
similar
–
diagonal line will occur (
matches
).
2.
frame shifts
a)
mutations
gaps in diagonal
b)
insertions
shift of main diagonal
c)
deletions
shift of main diagonal
http://ugene.unipro.ru/documentation/manual/plugins/dotplot/interpret_a_dotplot.html
Interpretation of dot plot II
•
Identify repeat regions (
direct repeats
,
inverted repeats
)
–
lines parallel to the diagonal line in self

similarity plot
•
Microsattelites and minisattelites (these are also called
low

complexity regions
) can be identified as “squares”.
•
Palindromatic sequences are shown as lines
perpendicular to the main diagonal.
•
Plaindromatic sequence: V ELIPSE SPI LEV
Bioinformatics explained: Dot plots, http://www.clcbio.com/index.php?id=1330&manual=BE_Dot_plots.html
Repeats in dot plot
from the book Bioinformatics, David. M. Mount,
direct repeats
minisattelites
inverted repeats
self

similarity dot plot of
NA sequence ofhuman
LDL receptor
window 23, stringency 7
Interpretation of dot plot
–
summary
http://www.code10.info/index.php?option=com_content&view=article&id=64:inroduction

to

dot

plots&catid=52:cat_coding_algorithms_d
ot

plots&Itemid=76
Dot plot of the human genome
A. M. Campbell, L. J. Heyer, Discovering genomics, proteomics and bioinformatics
Dot plot rules
•
Larger windows size is used for DNA sequences because
the number of random matches is much greater due to
the presence of only four characters in the alphabet.
•
A typical window size for DNA is 15, with stringency 10.
For proteins the matrix has not to be filtered at all, or
windows 2/3 with stringency 2 can be used.
•
If two proteins are expected to be related but to have long
regions of dissimilar sequence with only a small
proportion of identities, such as similar active sites, a
large window, e.g., 20, and small stringency, e.g., 5,
should be useful for seeing any similarity.
Dot plot advantages/disadvantages
•
Advantages
:
•
All possible matches of residues between two sequences are
found. It’s just up to you to choose the most significant ones.
•
Readily reveals the presence of
insertions/deletions
and direct
and inverted
repeats
that are more difficult to find by the other,
more automated methods.
•
Disadvantages
:
Most dot matrix computer programs
do not show an
actual alignmen
t. Does not return a
score
to indicate
how ‘optimal’ a given alignment is (no statistical
significance that could be tested).
Homology vs. similarity again
•
Just a reminder of the important concept in sequence
analysis
–
homology
. It is a conclusion about a common
ancestral relationship drawn from sequence similarity.
•
Sequence
similarity
is a direct result of observation from
the sequence alignment. It can be quantified using
percentages, but homology can not!
•
It is important to understand this difference between
homology and similarity.
•
If the similarity is high enough, a common evolutionary
relationship can be inferred.
Limits of detection of alignment
•
However, what is enough? What are the detection limits of
pairwise alignments? How many mutations can occur
before the differences make two sequences
unrecognizable?
•
Intuitively, at some point are two homologous sequences
too divergent for their alignment to be recognized as
significant.
•
The best way to determine detection limits of pairwise
alignment is to use statistical hypothesis testing. See
later.
Twilight zone
•
However, the level one can infer homologous relationship
depends on type of sequence (proteins, NA) and on the
length of the alignment.
•
Unrelated sequences of DNA have at least 25% chance to be
identical. For proteins it is 5%. If gaps are allowed, this percentage
can increase up to 10

20%.
•
The shorter the sequence, the higher the chance that some
alignment is attributable to random chance.
•
This suggest that shorter sequences require higher cuttof
for inferring homology than longer sequences.
Essential bioinformatics, Xiong
Statistical significance
•
Key question
–
Constitutes a given alignment evidence for
homology? Or did it occur just by chance?
•
The statistical significance of the alignment (i.e. its score)
can be tested by statistical hypotheses testing.
•
The matched sequence reported e.g. by the search
program can be classified as TP (true positive, i.e. two
sequences are homologous) or as FP (false positive, i.e.
genuinely unrelated, aligned only by chance).
Significance of global alignment I
•
We align two proteins: human beta globin and myoglobin.
We obtain score
S
.
•
And we want to know if such score is significant or if it
appeared just by chance. How to proceed?
•
State H
0
•
two sequences are not related, score
S
represents a chance
occurrence
•
State H
a
•
Choose a significance level
•
What else do we have to know?
•
s
tatistics of distribution. i.e. what?
•
sample mean, sample standard deviation
Significance of global alignment
II
•
How
to
determine the parameters of distribution?
•
Compare S to scores of beta globin/myoglobin relative to a large
number of sequences of non

homologous proteins
•
Compare with a set of
randomly
generated sequences.
•
Keep the beta globin and randomly scramble the sequence of
myoglobin.
•
Performing any of the previous, we obtain the sample
mean and sample standard deviation.
•
A Z

score can be calculated. How?
𝑍
=
𝑆
−
𝜎
Significance
of global alignment
II
I
•
For
normal
distribution
,
if
Z=3 99.74% of the scores are
within how many stdev of the mean?
•
three
•
And the fraction of scores greater is?
1
.
0
−
0
.
9974
2
=
0
.
26
2
=
0
.
13%
•
We can expect to see this particular high score by chance
about 1 time in 750 (1/750 ≈ 0.13%)
•
0.26% is represented as confidence level
=
0
.
0026
.
•
In hypotheses testing, commonly used is
=
0
.
05
.
Significance of global alignment
IV
•
The problem with this approach is if the distribution is not
Gaussian.
•
Then the estimated significance level will be wrong.
•
Bad news
–
distribution of global alignments is generally
not Gaussian and no theory exists.
•
Another consideration
–
problem of multiple comparisons
•
If we compare query sequence to 1 million sequences in database,
we have a million chances to find a high scoring match. In such
case it is appropriate to adjust
to more stringent level
.
•
Bonferroni correction
–
𝛼
10
6
=
0
.
05
10
6
=
5
×
10
−
8
Significance of
local alignment
•
In contrast to global alignment there is a thorough
understanding of the distribution of scores.
•
Key role play
Extreme value distributions
(EVD)
•
generate N data sets from the same distribution,
create
a
new data set that includes the
maximum/minimum
values
from these N data sets, the resulting data set can
only
be
described by one of the
three distributions
•
Gumbel
, Fréchet,
Weibull
•
applications
•
extreme floods, large wildfires
•
large insurance losses
•
size of freak waves
•
sequence alignment
Gumbel distribution
𝜌
𝑥
=
1
𝜎
exp
−
𝑥
−
𝜎
−
exp
𝑥
−
𝜎
… location parameter
𝜎
,
,
… scale parameter
𝑃
𝑋
>
𝑥
=
1
−
exp
−
exp
−
𝑥
−
𝜎
wikipedia.org
Statistical distribution of alignments
•
local alignment
•
analytical theory
•
gapless
–
Gumbel, parameters can be evaluated analytically
•
gapped
–
Gumbel, paramaters must be obtained from simulations,
no analytical formulas
•
global alignment
•
no thorough theory, however empirical simultions show that the
distribution is also Gumbel
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment