Slides pptx - Bioinformatics and Research Computing - MIT

signtruculentBiotechnology

Oct 2, 2013 (3 years and 9 months ago)

95 views

Introduction to
NCBI &
Ensembl

tools including
BLAST and
database
searching

Incorporating
Bioinformatics into the
High School Biology Curriculum


Fran Lewitter, Ph.D.

Director, Bioinformatics & Research Computing

Whitehead Institute for Biomedical
Research

Lewitter AT
wi.mit.edu

http://jura.wi.mit.edu/bio


Lewitter
-
Whitehead Institute


August 15, 2012

What I hope you’ll learn


What do we learn from database
searching and sequence
alignments


What tools are available at NCBI


What tools are available through
Ensembl

2

First some background info


Lewitter
-
Whitehead Institute


August 15, 2012

3

4

Lewitter
-
Whitehead Institute


August 15, 2012

Lewitter
-
Whitehead Institute


August 15, 2012

Doolittle RF,
Hunkapiller

MW, Hood LE,
Devare

SG, Robbins KC, Aaronson SA,
Antoniades

HN.
Science

221:275
-
277, 1983.

Simian sarcoma virus onc gene, v
-
sis, is derived
from the gene (or genes) encoding a platelet
-
derived
growth factor.

5

Lewitter
-
Whitehead Institute


August 15, 2012

Why do alignments


Use sequence similarity to
infer homology
and/or
structural similarity between 2 or more
genes/proteins


Identify
more conserved regions of a protein,
potentially identifying regions of most functional
importance


Compare
and contrast homologs (perhaps into
groups) based on shared positions or regions


Infer
evolutionary distance from sequence
dissimilarity

6

Lewitter
-
Whitehead Institute


August 15, 2012

Evolutionary Basis of
Sequence Alignment



Similarity

-

observable
quantity
, such as
percent identity


Homology

-

conclusion drawn from data
that two genes share a
common
evolutionary history
; no metric is associated
with this


Paralog



genes related by duplication


Ortholog



genes related by speciation

7

Lewitter
-
Whitehead Institute


August 15, 2012

Local vs Global Alignment

From Mount,
Bioinformatics, 2004, pg 71

GLOBAL

LOCAL

8

Nucleotide
vs

Protein


If comparing protein coding genes, use
protein sequences because of less noise


If protein sequences are very similar, it
might be more instructive to use DNA
sequences

Lewitter
-
Whitehead Institute


August 15, 2012

9

Lewitter
-
Whitehead Institute


August 15, 2012

Example of simple scoring
system for nucleic acids


Match = +1 (ex. A
-
A, T
-
T, C
-
C, G
-
G)


Mismatch =
-
1 (ex. A
-
T, A
-
C,
etc
)


Gap opening =
-

5


Gap extension =
-
2

T

C

A

G A

C G A G

T G

T

C

G

G A

-

-

G C

T G

+1 +1
-
1 +1 +1
-
5
-
2
-
1
-
1 +1 +1 =
-
4

10

11

WIBR Sequence Analysis Course, © Whitehead Institute, February 2005

Possible Alignments

A:

T C A G A C G A G T G

B:

T C G G A G C T G


I.

T

C

A

G A

C G A G

T G


T

C

G

G A

-

-

G C

T G

II.

T

C

A

G A

C
G

A G

T G


T

C

G

G A

-

G

C
-

T G


III.

T

C

A

G A

C
G

A G

T G


T

C

G

G A

-

G

-

C

T G

s
core=
-
4


score=
-
5


score=
-
5

Lewitter
-
Whitehead Institute


August 15, 2012

Scoring for Protein
Alignments
-

Amino Acid
Substitution Matrices


PAM

-

point accepted mutation based on
global

alignment

[evolutionary model]



BLOSUM

-

block substitutions based on
local

alignments [similarity among conserved
sequences]


12

Lewitter
-
Whitehead Institute


August 15, 2012

AA Scoring Matrices


PAM

-

point accepted mutation
based on
global

alignment

[evolutionary model]






BLOSUM

-

block substitutions
based on
local

alignments [similarity
among conserved sequences]

Log
-
odds=
pair in homologous proteins



pair
in unrelated proteins by

chance

Log
-
odds=
obs

freq

of
aa

substitutions



freq

expected by chance

13

Lewitter
-
Whitehead Institute


August 15, 2012

Substitution Matrices

BLOSUM 30


BLOSUM 62


BLOSUM 80


% identity

PAM 250 (80)


PAM 120 (66)


PAM 90 (50)


% change

Increasing

similarity

14

Lewitter
-
Whitehead Institute


August 15, 2012

Scoring for BLAST Alignments

Score = 94.0 bits (230), Expect = 6e
-
19

Identities = 45/101 (44%), Positives = 54/101 (52%), Gaps = 7/101 (6%)


Query: 204 YTGPFCDV
----
DTKASCYDGRGLSYRGLARTTLSGAPCQPWASEATYRNVTAEQ
---
AR 256


Y+ FC + + CY G G +YRG T SGA C PW S V Q A+

Sbjct
: 198 YSSEFCSTPACSEGNSDCYFGNGSAYRGTHSLTESGASCLPWNSMILIGKVYTAQNPSAQ 257


Query: 257 NWGLGGHAFCRNPDNDIRPWCFVLNRDRLSWEYCDLAQCQT 297


GLG H +CRNPD D +PWC VL RL+WEYCD+ C T

Sbjct
: 258 ALGLGKHNYCRNPDGDAKPWCHVLKNRRLTWEYCDVPSCST 298







Position 1: Y
-

Y = 7

Position 2: T
-

S = 1

Position 3: G
-

S = 0

Position 4: P
-

E =
-
1



. . .

Position 9:
-

-

P =
-
11

Position 10:
-

-

A =
-
1


. . .



Sum 230

Based on
BLOSUM62

15

Lewitter
-
Whitehead Institute


August 15, 2012

Statistical Significance


Raw Scores

-

score of an alignment equal to the sum of
substitution and gap scores.


Bit scores

-

scaled version of an alignment

s raw score that
accounts for the statistical properties of the scoring system
used.


E
-
value

-

expected number of distinct alignments that
would achieve a given score by chance. Lower E
-
value =>
more significant.

16

Lewitter
-
Whitehead Institute


August 15, 2012

A formula

E

=
Kmn

e
-
l
S

This is the
E
xpected number of high
-
scoring
segment pairs (HSPs) with score at least
S

for sequences of length m and n.


This
is the
E

value for the score
S
.


The
parameters K and
l

can be thought of simply as natural scales for the
search space size and the scoring system respectively.


17

What’s significant?


High confidence
-

>40%
identity for
long
alignments
(
Rost
,
1999 found that sequence alignments
unambiguously distinguish between protein pairs of similar
and non
-
similar structure when the pairwise sequence identity
>40%)


“Twilight zone”


blurry
-

20
-
35% identity


“Midnight zone”
-

<20% identity

Lewitter
-
Whitehead Institute


August 15, 2012

18

Tools of Interest


NCBI (
US)
-

http://www.ncbi.nlm.nih.gov
/


BLAST


Entrez


EBI/EMBL/WTSI (Europe)


Ensembl

(http://
www.ensembl.org
/)

Lewitter
-
Whitehead Institute


August 15, 2012

19

Lewitter
-
Whitehead Institute


August 15, 2012

NCBI Home Page

20

Lewitter
-
Whitehead Institute


August 15, 2012

Ensemb
l

Home Page

21

Hands
-
On


Entrez



finding sequences


BLAST


look for similar sequences


Ensembl



finding sequences


Lewitter
-
Whitehead Institute


August 15, 2012

22

How is BLAST used @ WIBR?


Looking for homologs


Identifying domains in proteins


Peptide identification


Genome annotation

Lewitter
-
Whitehead Institute


August 15, 2012

23

Lewitter
-
Whitehead Institute


August 15, 2012

Remember

1.
Don’t be afraid to look at help
pages
for each
website


Click, click, click

2.
Any analysis tool will give you
results

3.
Interpret results you find

24

Lewitter
-
Whitehead Institute


August 15, 2012

Sequence of Note


1 GCGTTGCTGG CGTTTTTCCA TAGGCTCCGC


31

CCCCCTGACG AGCATCACAA
AAATCGACGC



61 GGTGGCGAAA CCCGACAGGA
CTATAAAGAT


…………..

1371 GTAAAGTCTG GAAACGCGGA
AGTCAGCGCC


Here you see the actual structure of a small
fragment of dinosaur DNA,


W甠獡楤⸠

乯瑩捥⁴桥t
獥煵敮捥s楳ima摥d異uo映景畲⁣潭灯畮摳
-

a摥湩湥Ⱐ
g畡湩湥Ⱐ瑨祭楮攠a湤⁣祴n獩湥⸠ 周楳⁡ o畮琠o映
䑎D 灲o扡扬b⁣潮瑡楮猠楮獴牵捴楯湳 瑯tmak攠a
獩湧汥l灲o瑥楮t
-

獡yⰠa⁨潲 o湥nor⁡渠敮 ym攮e 周攠
晵汬f䑎Do汥捵汥l捯湴慩湳
three billion

of these
bases. If we looked at a screen like this once a
second, for eight hours a day, it

搠獴楬氠瑡t攠mor攠
瑨慮t瑷o y敡牳r瑯t汯ok a琠瑨攠敮瑩e攠䑎D⁳瑲慮搮


s
that big.


⡰慧攠103)

(Published in 1990)

Biotechniques
, 1992

25

Lewitter
-
Whitehead Institute


August 15, 2012

DinoDNA

"Dinosaur DNA" from
Crichton's THE LOST WORLD p. 135

GAATTCCGGAAGCGAGCAAGAGATAAGTCCTGGC
ATCAGATACAGTTGGAGATAAGGACGACGTGTGG
CAGCTCCCGCAGAGGATTCACTGGAAGTGCATTA
CCTATCCCATGGGAGCCATGGAGTTCGTGGCGCT
GGGGGGGCCGGATGCGGGCTCCCCCACTCCGTTC
CCTGATGAAGCCGGAGCCTTCCTGGGGCTGGGGG
GGGGCGAGAGGACGGAGGCGGGGGGGCTGCTGGC
CTCCTACCCCCCCTCAGGCCGCGTGTCCCTGGTG
CCGTGGCAGACACGGGTACTTTGGGGACCCCCCA
GTGGGTGCCGCCCGCCACCCAAATGGAGCCCCCC
CACTACCTGGAGCTGCTGCAACCCCCCCGGGGCA
GCCCCCCCCATCCCTCCTCCGGGCCCCTACTGCC
ACTCAGCAGCGCCTGCGGCCTCTACTACAAAC



Published in 1995

26

Lewitter
-
Whitehead Institute


August 15, 2012


>
Erythroid

transcription factor (NF
-
E1 DNA
-
binding protein
)



Query: 121 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 300


MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG

Sbjct
: 1 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 60


Query: 301 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECVMARKNCGAT 480


TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV NCGAT

Sbjct
: 61 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV
----
NCGAT 116


Query: 481 ATPLWRRDGTGHYLCNWASACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCSHERENCQT 660


ATPLWRRDGTGHYLCN ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS NCQT

Sbjct
: 117 ATPLWRRDGTGHYLCN
---
ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS
----
NCQT 169


Query: 661 STTTLWRRSPMGDPVCNNIHACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 840


STTTLWRRSPMGDPVCN ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG

Sbjct
: 170 STTTLWRRSPMGDPVCN
---
ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 226


Query: 841 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF 1020


GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF

Sbjct
: 227 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF 286


Query: 1021 FGGGAGGYTAPPGLSPQI 1074


FGGGAGGYTAPPGLSPQI

Sbjct
: 287 FGGGAGGYTAPPGLSPQI 304


27

Lewitter
-
Whitehead Institute


August 15, 2012


>
Erythroid

transcription factor (NF
-
E1 DNA
-
binding protein)



Query: 121 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 300


MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG

Sbjct
: 1 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 60


Query: 301 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV
MARK
NCGAT 480


TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV NCGAT

Sbjct
: 61 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV
----
NCGAT 116


Query: 481 ATPLWRRDGTGHYLCN
WAS
ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS
HERE
NCQT 660


ATPLWRRDGTGHYLCN ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS NCQT

Sbjct
: 117 ATPLWRRDGTGHYLCN
---
ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS
----
NCQT 169


Query: 661 STTTLWRRSPMGDPVCN
NIH
ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 840


STTTLWRRSPMGDPVCN ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG

Sbjct
: 170 STTTLWRRSPMGDPVCN
---
ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 226


Query: 841 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF 1020


GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF

Sbjct
: 227 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF 286


Query: 1021 FGGGAGGYTAPPGLSPQI 1074


FGGGAGGYTAPPGLSPQI

Sbjct
: 287 FGGGAGGYTAPPGLSPQI 304


28