Bioinformatics Dr. Víctor Treviño vtrevino@itesm.mx Pabellón Tec ...

clumpfrustratedBiotechnology

Oct 2, 2013 (3 years and 8 months ago)

102 views

BIOINFORMATICS

DR.
VÍCTOR

TREVIÑO

VTREVINO@ITESM.MX

A7
-
421

EXT
-
4536+103

BT4007



Blast and Alignments

vtrevino@itesm.mx

PRESENTACIONES DE PAPERS EN MARZO


Buscar un artículo de investigación relacionado con su proyecto y
que tenga un alto componente bioinformático. Por ejemplo:


Generación de una base de datos


Desarrollo de un programa o servicio


Descubrimiento de genes/vías metabólicas/etc por medio/con ayuda
de métodos bioinformáticos


Proponer el paper al profesor y confirmar


Estudiar el paper


Preparar presentación


Presentarlo en clase,
15
minutos,
10
minutos presentación +
5
de
preguntas


Las presentaciones las evalua el profesor y los alumnos, se lleva una
rúbrica calificando elementos como: Tema, Intro, Mét, Resul, Disc,
Critica, Voz, Claridad, Seguridad, Conocimiento, Respuestas, Tiempo


vtrevino@itesm.mx

PAPERS FOR

NEXT SESSION


vtrevino@itesm.mx

SEQUENCE SIMILARITY


Sequences are similar because are derived
from a
common ancestor


Will most often be the result of
duplication
events
.


Similarity will then depend on
diveregence
times
.


General Rule:
25
%

Identity in
100

aa
sequence
is good evidence of common
ancestry

Bioinformatics
-

Methods and Applications


Genomics, Proteomics and Drug Discovery


Rastogi


Mendiratta
-

PHI

vtrevino@itesm.mx

SEQUENCE SIMILARITY


Within a protein sequence,
some regions
will be
more conserved
than others.
As more conserved,
more important
.


for function


for 3D structure


for localization


for modification


for interaction


for regulation/control


for transcriptional regulation

(in DNA)

REASONS TO

PERFORM

SEQUENCE

SIMILARITY

SEARCHES

vtrevino@itesm.mx

SEQUENCE SIMILARITY
-

TERMS


Homologous
: similar due to common
ancestry


Analogous
: similar due to convergent
evolution


Orthologous
: homologous with conserved
function (by speciation in separated species)


Paralogous
: homologous with different
function (commonly within the same
species)

Bioinformatics
-

Methods and Applications


Genomics, Proteomics and Drug Discovery


Rastogi


Mendiratta
-

PHI

vtrevino@itesm.mx

SEQUENCE SIMILARITY
-

TERMS


Xenologous
: due to horizontal transfer


HGT: transfer of genetic material that is not its
offspring


VGT: transfer of genetic material from its
ancestor (mitosis)
[vgt is not related to xenologous]


Ohnologous
: paralogous that have originated
by whole genome duplication


Gametologous
: homologous genes in non
-
recombining opposite sex chromosomes.


Bioinformatics
-

Methods and Applications


Genomics, Proteomics and Drug Discovery


Rastogi


Mendiratta
-

PHI

Wikipedia

vtrevino@itesm.mx

SEQUENCE SIMILARITY


EVOLUTIONARY
RELATIONSHIP



Bioinformatics


Sequence and Genome Analysis


Mount


CSH Lab Press

vtrevino@itesm.mx

SEQUENCE SIMILARITY


ORIGINS OF GENES



Bioinformatics


Sequence and Genome Analysis


Mount


CSH Lab Press

a
1
& a
2
are

Paralogous

a1
-
S1 and a1
-
S2 are Orthologous

a2
-
S1 and a2
-
S2 are Orthologous

Analogous Genes


Same Function

Different Origin

Xenologous

vtrevino@itesm.mx

SEQUENCE SIMILARITY


TYPES OF
MODIFICATION



…ACCAGT
GTG
CCGTACA…



Mutations occur during evolution by


Insertions




ACCAGT
a
GTG
CCGTACA



Deletions



…ACCAGTCCGTACA…


Substitutions




ACCAGT
G
C
G
CCGTACA




GTG

vtrevino@itesm.mx

SIMILARITY AND DISTANCE BETWEEN
SEQUENCES


SIMILARITY is the
maximal SUM of
WEIGHTS

for the
conserved residues


More useful for
phylogenetic tree reconstruction


DISTANCE is the
minimal SUM of
WEIGHTS

for a set of
mutations transforming one sequence into the other


More useful for
database searching


Both are opposite and interconvertible concepts


WEIGHT accounts for different roles of mutation
events, AA residue similarity, etc.


e.g. synonymous mutations are different than non
-
sense
mutations

Bioinformatics
-

Methods and Applications


Genomics, Proteomics and Drug Discovery


Rastogi


Mendiratta
-

PHI

vtrevino@itesm.mx

SEQUENCE ALIGNMENT


Procedure for comparing two (pair
-
wise alignment) or
more (multiple sequence alignment) sequences by
searching for similar patterns that are in the same
order in the sequences


Identical residues

(
nt

or
aa
) are
placed in the same column


Non
-
identical residues can be placed in the same column or
indicated as gaps

Wikipedia,
http://www
-
personal.umich.edu/~lpt/fgf/fgfrcomp.htm

Bioinformatics


Sequence and Genome Analysis


Mount


CSH Lab Press

Overall

similitude

vtrevino@itesm.mx

SEQUENCE ALIGNMENT


GLOBAL
-

Procedure applied to the entire
sequence to include as many matches as
possible up to the end of the sequence


Methods


Brute Force


unpractical


Dot Matrix


graphical, easy to understand


Dynamical Programming


the most accurate


Heuristic Methods


fast, not so accurate


Word k
-
tuple


Database Searching


BLAST

Bioinformatics
-

Methods and Applications


Genomics, Proteomics and Drug Discovery


Rastogi


Mendiratta
-

PHI

Wikipedia

vtrevino@itesm.mx

GLOBAL AND LOCAL ALIGNMENTS


Proteins are
MODULAR


Patterns formed by exchange of whole
EXONS


Example:


F
12
: Coagulation Factor XII


PLAT: Tissue
-
type plasminogen activator

A practical guide to the analysis of genes and proteins


Baxevanis


Ouellette


Wiley 2Ed.

F
1
/
2
-

Fibronectins

E
-

Epidermal Growth Factors

K
-

"Kringle" domain

GLOBAL

ALIGNMENT

METHODS

DO NOT

CONSIDER

THIS ISSUES




LOCAL

ALIGNMENT

vtrevino@itesm.mx

GLOBAL AND LOCAL ALIGNMENTS


Bioinformatics


Sequence and Genome Analysis


Mount


CSH Lab Press

vtrevino@itesm.mx

LOCAL ALIGNMENT


Alignment stops at the end of regions of
identity or strong similarity


Much higher priority is given to find these
local regions than extending the alignment

A practical guide to the analysis of genes and proteins


Baxevanis


Ouellette


Wiley
2
Ed.

vtrevino@itesm.mx

DOT
-
MATRIX METHOD


Primary method for comparing sequences


Provides a global and local overview of
similarity


Useful for direct or inverted repeats


Useful for self
-
complementary RNA regions


DNA Straider, DOTTER, GCG
-
DOTPLOT, DOTLET

Bioinformatics


Sequence and Genome Analysis


Mount


CSH Lab Press

http://myhits.isb
-
sib.ch/cgi
-
bin/dotlet

vtrevino@itesm.mx

DOT
-
MATRIX METHOD


Align, the aa sequence "DOROTHYHODGKIN" vs
"DOROTHYCROWFOOTHODGKIN"


Bioinformatics


Sequence and Genome Analysis


Mount


CSH Lab Press

vtrevino@itesm.mx

DOT
-
MATRIX METHOD


EX
1

Bioinformatics


Sequence and Genome Analysis


Mount


CSH Lab Press

WINDOW SIZE

= 11


STRINGENCY

= 7

(how many identical)

…ACCAGTGTGCCGTACA…

window

vtrevino@itesm.mx

DOT
-
MATRIX METHOD


EX
2

A practical guide to the analysis of genes and proteins


Baxevanis


Ouellette


Wiley
2
Ed.

vtrevino@itesm.mx

DOT
-
MATRIX METHOD


EX
3
-
REPEATS


Figure 3.6.
Dot matrix analysis of the human LDL receptor against itself using DNA Strider, vers. 1.3, on a Macintosh

Bioinformatics


Sequence and Genome Analysis


Mount


CSH Lab Press

vtrevino@itesm.mx

DOT
-
MATRIX METHOD


PROGRAMS

(you could use PubMed also)

Bioinformatics for Dummies


Claviere


Notredame


Wiley
-

2
nd

Ed.
2007

vtrevino@itesm.mx

DOT
-
MATRIX EXAMPLES


http://hits.isb
-
sib.ch/util/dotlet/doc/dotlet_examples.html




http://myhits.isb
-
sib.ch/cgi
-
bin/dotlet



vtrevino@itesm.mx

DYNAMIC PROGRAMMING METHOD


Provides the very best or optimal alignment in a
very reasonable
amount of time


Several parameters though


Global: Needleman
-
Wunsch


Local: Smith
-
Waterman


Provides a p
-
value of obtaining the alignment by
chance of unrelated sequences


There is a method for statistical significance


Results depends on the
scoring
system

vtrevino@itesm.mx

DYNAMIC PROGRAMMING METHOD


Provides the very best or optimal alignment


Several parameters though


Global: Needleman
-
Wunsch


Local: Smith
-
Waterman


Provides a p
-
value of obtaining the
alignment by chance of unrelated sequences


There is a method for statistical significance

vtrevino@itesm.mx

DYN.PROG.METHOD
-

SCORING


Results depend on the scoring system


SCORING
MATRICES


Depending on Pair
-
wise


Gap Penalties


DNA alignments require a similar scoring system

vtrevino@itesm.mx

DYNAMIC PROGRAMMING METHOD

i



j



x, y

are the "radius"

Gap penalties from the scoring matrix

Bioinformatics


Sequence and Genome Analysis


Mount


CSH Lab Press

vtrevino@itesm.mx

DYNAMIC PROGRAMMING METHOD

i



j



x, y

are the "radius"

Gap penalties from the scoring matrix

Bioinformatics


Sequence and Genome Analysis


Mount


CSH Lab Press

vtrevino@itesm.mx

DYNAMIC PROGRAMMING METHOD


Bioinformatics


Sequence and Genome Analysis


Mount


CSH Lab Press

vtrevino@itesm.mx

DYNAMIC PROGRAMMING EXAMPLE

gap

A

C

G

G

A

T

A

T

gap

0

-
1

-
1

-
1

-
1

-
1

-
1

-
1

-
1

G

-
1

M
ax(0
,
-
2,
-
2)=0

-
1,
-
2,
-
1=

-
1

(d)+1

(d)+1

(l)0

(ld)
-
1

(d)
-
1

(d)
-
1

G

-
1

-
1,
-
1,
-
2=
-
1

0

(d)+1

(d)+3

(l)+2

(l)+1

(l)0

(ld)
-
1

C

-
1

-
1

(d)+1

(ldu)0

(u)+2

(d)+3

(ld)+2

(ld)+1

(ld)0

T

-
1

(d)
-
1

(u)0

(d)+1

(u)+1

(ud)+
2

(d)+5

(l)+4

(ld)+3

A

-
1

(d)+1

(l)0

(d)0

(d)+1

(d)+3

(u)+4

(d)+7

(l)+6

X=1

Y=1

G
ap

W(x
=1) = 1,
W(x
=2)=1


Gap
W(y

= 1)=1,


s
(
a,b
)=2,
if

a =
b

s(a,b
)=0, if a <>
b

ACGGATA
T

--
GGCTA
-

vtrevino@itesm.mx

DYN.PROG.METHOD
-

SCORING


Results depend on the scoring system


SCORING MATRICES


Depending on Pair
-
wise


Gap Penalties


Dayhoff PAM (point accepted mutations)
matrix is based on a evolutionary model for
proteins


One PAM is a unit of evolutionary divergence in which 1% of
the amino acids have been changed in
very similar sequences


BLOSUM matrix are designed to identify
members of the same family


Derived from BLOCKS database (for
distant sequences
,
blo
cks
su
bstitution
m
atrix)

vtrevino@itesm.mx

DYNAMIC PROGRAMING
-

SCORING


Remember "
SUM OF
WEIGHTS
" for
similarity/distance

PAM
250
is

250
times PAM

BLOSUM62, seq 62%
identical can be merged into
one.

BLOSUM90 for comparing
more similar sequences.

BLOSUM30 for very

different.

Bioinformatics


Sequence and Genome Analysis


Mount


CSH Lab Press

vtrevino@itesm.mx

DYNAMIC PROGRAMMING METHOD


Some programs provide alternative alignments,
depending on the goal


domains


structural


same family


biological function


common ancestor


There are several variations respect to original
Needleman
-
Wunsch, Smith
-
Waterman methods
improving memory usage, cpu time, and other
features

vtrevino@itesm.mx

DYNAMIC PROGRAMMING METHOD
-

OUTPUT

Bioinformatics


Sequence and Genome Analysis


Mount


CSH Lab Press

vtrevino@itesm.mx

DYNAMIC PROGRAMMING


STATISTICAL
SIGNIFICANCE


To assign a p
-
value, we could "shuffle" both
sequences 100,000 times.


The proportion of times we obtain SCORES larger
than that obtained in the real score represent the
p
-
value



Another quicker method is converting the
alignment to BINARY sequences (match or not
match)


e.g. probability of obtaining HTHTHHHH in a coin
toss experiment

vtrevino@itesm.mx

DYNAMIC PROGRAMMING


STATISTICAL
SIGNIFICANCE


Two random sequences of length m and n and
p=prob. of match


Length of matches=
log
1
/p
(
mn
)


DNA seq. length=
100
, p=
0.25
(equal nt)


the longest match =
2
x log
4
(
100
)=
6.65


More precise formula


vtrevino@itesm.mx

DYNAMIC PROGRAMMING


STATISTICAL
SIGNIFICANCE




Simpliying




k=mismatches, m and n are sequence length



Efective length = n


E(m) (used in BLAST)

(mean of the highest possible local alignment score)

vtrevino@itesm.mx

ALIGNMENT
PROCEDURE
OVERVIEW

vtrevino@itesm.mx

WORD K
-
TUPLE METHOD
-

BLAST


Search a database for sequences that at
least share
W
identical

residues



For a sequence of length
L
, the number of
"internal searches" is
L
-
W+1



All "potential" sequences are then "
extended
"
using the Dynamic Programming Method



A
statistical

significance
score

is estimated
representing the number of expected similar
sequences in the database (E value,
-
equivalent
-

to a p
-
value for the entire
database)

vtrevino@itesm.mx

BLAST


Pi


random residue probability


S
ij


From score matrix


Score


S=sum(PiPjSij)


Transformation


For statistical comparisons


Expressed in bits


Expected number of
matches of at least S’




Lengths: query=m, database=n


Example:


m=250, n=50,000,000, to achieve E=0.05


S’ = 38 bits


S = [(38 * ln 2) + ln K] / λ


S = 76.6

(for ungapped version :
λu =
0.3176
and
Ku =
0.134