Bioinformatics databases and sequence retrieval

hordeprobableΒιοτεχνολογία

4 Οκτ 2013 (πριν από 3 χρόνια και 8 μήνες)

70 εμφανίσεις





Bioinformatics

databases &
sequence

retrieval


Content of
lecture


I.
Introduction

II.
Bioinformatics

data & databases

III.
Sequence

Retrieval

with

MRS





Celia

van Gelder

CMBI

UMC Radboud

September 2011

©CMBI
2009



I.
Bioinformatics

questions


Lookup



Is the gene known for my protein (or vice versa)?


What sequence patterns are present in my protein?


To what class or family does my protein belong?


Compare



Are there sequences in the database which resemble the protein I
cloned?


How can I optimally align the members of this protein family?


Predict



Can I predict the active site residues of this enzyme?


Can I predict a (better) drug for this target?


How can I predict the genes located on this genome?

©CMBI
2009



Sequence similarity

MVVSGAPPAL GGGCLGTFTS LLLLASTAIL NAARIPVPPA CGKPQQLNRV

VGGEDSTDSE

WPWIVSIQKN GTHHCAGSLL TSRWVITAAH CFKDNLNKPY LFSVLLGAWQ LGNPGSRSQK

VGVAWVEPHP VYSWKEGACA DIALVRLERS IQFSERVLPI CLPDASIHLP PNTHCWISGW

GSIQDGVPLP HPQTLQKLKV PIIDSEVCSH LYWRGAGQGP ITEDMLCAGY LEGERDACLG

DSGGPLMCQV DGAWLLAGII SWGEGCAERN RPGVYISLSA HRSWVEKIVQ GVQLRGRAQG

Image, you sequenced this human protein.

You know it is a serine protease.

Which residues belong to the
active

site
?

Is its sequence
similar

to the mouse serine protease?

©CMBI
2009



Sequence Alignment

MVVSGAPPAL GGGCLGTFTS LLLLASTAIL NAARIPVPPA CGKPQQLNRV VGGEDSTDSE

MMISRPPPAL GGDQFSILIL LVLLTSTAPI SAATIRVSPD CGKPQQLNRI VGGEDSMDAQ

*::* .**** **. :. : *:**:*** : .** * *.* *********: ****** *::



WPWIVSIQKN GTHHCAGSLL TSRWVITAAH CFKDNLNKPY LFSVLLGAWQ LGNPGSRSQK

WPWIVSILKN GSHHCAGSLL TNRWVVTAA
H

CFKSNMDKPS LFSVLLGAWK LGSPGPRSQK

******* ** *:******** *.***:**** ***.*::** *********: **.**.****



VGVAWVEPHP VYSWKEGACA DIALVRLERS IQFSERVLPI CLPDASIHLP PNTHCWISGW

VGIAWVLPHP RYSWKEGTHA
D
IALVRLEHS IQFSERILPI CLPDSSVRLP PKTDCWIAGW

**:*** *** ******: * ********:* ******:*** ****:*::** *:*.***:**



GSIQDGVPLP HPQTLQKLKV PIIDSEVCSH LYWRGAGQGP ITEDMLCAGY LEGERDACLG

GSIQDGVPLP HPQTLQKLKV PIIDSELCKS LYWRGAGQEA ITEGMLCAGY LEGERDACLG

********** ********** ******:*. ******** . ***.****** **********



DSGGPLMCQV DGAWLLAGII SWGEGCAERN RPGVYISLSA HRSWVEKIVQ GVQLRGRAQG

D
S
GGPLMCQV DDHWLLTGII SWGEGCAD
-
D RPGVYTSLLA HRSWVQRIVQ GVQLRG
----

********** *. ***:*** *******: : ***** ** * *****::*** ******


=> Transfer of information


II. Bioinformatics data and databases



mRNA
expression
profiles


MS data

Large amount of data

Growing very
very

fast

Heterogeneous data types

©CMBI
2011



EMBL
DNA database

Total
nucleotides

(
current

301,588,430,608)

Number of entries

(current 199,575,971)

©CMBI
2010




Biological databases (1)

Primary

databases


contain

biomolecular

sequences

or

structures

(
experimental

data!)
and
associated

annotation

information


Sequences

Nucleic

acid
sequences

EMBL
, Genbank, DDBJ


Protein

sequences

SwissProt
,
trEMBL
,
UniProt


Structures

Protein

Structures

PDB



Structures

of
small

compounds

CSD


Genomes

Ensembl



UCSC


©CMBI
2009



Biological databases (2)

Secondary

databases


Contain

data
derived

from

primary database(s)


Patterns
,
motifs
,
domains

PROSITE
, PFAM, PRINTS,
INTERPRO
,......



Disease

mutations


OMIM /
MIM


SNPs



dbSNP


Pathways



KEGG




©CMBI
2009



Databases

Data must be in a certain format for software to recognize


Every database can have its own format but some data elements are
essential for every database:


1. Unique identifier, or accession code

2. Name of depositor

3. Literature references

4. Deposition date

5. The real data

©CMBI
2009



Quality of Data

SwissProt



Data is only entered by annotation experts



EMBL, PDB



“Everybody” can submit data


No human intervention when submitted;

some automatic checks



©CMBI
2011


SwissProt database


Database
of
protein sequences



531.473
sequence

entries

(July 2011)



Swissprot

is
curated

and annotated manually!!



Obligatory
deposit of in
SwissProt

before
publication



SwissProt

is part of
UniProt



The
other

main

part of
UniProt

is
Trembl
.
Trembl

is
automatically

annotated

and is
not

reviewed
.


©CMBI
2009



Important records in SwissProt (1)

ID HBA_HUMAN Reviewed; 142 AA.

AC P69905; P01922; Q3MIF5; Q96KF1; Q9NYR7;

DT 21
-
JUL
-
1986, integrated into
UniProtKB
/Swiss
-
Prot.

DT 23
-
JAN
-
2007, sequence version 2.

DT 23
-
SEP
-
2008, entry version 63.



DE RecName: Full=Hemoglobin subunit alpha;

DE AltName: Full=Hemoglobin alpha chain;

DE AltName: Full=Alpha
-
globin;


©CMBI
2009



Important records in SwissProt (2)

Cross references section:

Hyperlinks to all entries in other databases which are relevant for the
protein sequence HBA_HUMAN

g
enes

&
mRNA

p
rotein

domains

structures

diseases

©CMBI
2011



Important records in SwissProt (3)

Features section:

post
-
translational modifications, signal peptides, binding sites, enzyme
active sites, domains, disulfide bridges, local secondary structure,
sequence conflicts between references etc. etc.


©CMBI
2009



And finally, the amino acid sequence!

©CMBI
2011


EMBL database

Nucleotide database


EMBL:
199
million sequence entries comprising
301
billion
nucleotides
(Aug 2011)


EMBL records follows roughly same scheme as
SwissProt


Obligatory deposit of sequence in EMBL before publication


Most EMBL
sequences

never

seen

by

a
human


©CMBI
2011




Protein Data Bank (PDB)

Databank for
3
-
dimensional structures of
biomolecules

(by X
-
ray & NMR):



Protein


DNA


RNA


Ligands



Obligatory
deposit of coordinates in the PDB before publication


~
75
000
entries
(Aug 2011)
(
~6000
“unique” structures)



PDB file is a keyword
-
organised flat
-
file (
80 column)

1) human readable

2) every line starts with a keyword (3
-
6 letters)

3) platform independent





©CMBI
2009



PDB important records (1)

PDB nomenclature

Filename= accession number=
PDB Code

Filename is 4 positions (often 1 digit & 3 letters, e.g.
1CRN
)


HEADER

describes molecule & gives deposition date

HEADER PLANT SEED PROTEIN 30
-
APR
-
81 1CRN


CMPND

name of molecule

COMPND CRAMBIN


SOURCE

organism

SOURCE ABYSSINIAN CABBAGE (CRAMBE ABYSSINICA) SEED

©CMBI
2009



PDB important records (2)

SEQRES

Sequence of protein; be aware: Not always all 3d
-
coordinates are present
for all the amino acids in SEQRES!!

SEQRES 1 46 THR THR CYS CYS PRO SER ILE VAL ALA ARG SER ASN PHE 1CRN 51

SEQRES 2 46 ASN VAL CYS ARG LEU PRO GLY THR PRO GLU ALA ILE CYS 1CRN 52

SEQRES 3 46 ALA THR TYR THR GLY CYS ILE ILE ILE PRO GLY ALA THR 1CRN 53

SEQRES 4 46 CYS PRO GLY ASP TYR ALA ASN 1CRN 54


SSBOND

disulfide bridges

SSBOND 1 CYS 3 CYS 40

SSBOND 2 CYS 4 CYS 32



©CMBI
2009



PDB important records (3)

and at the end of the PDB file the “real” data:


ATOM

one line for each atom with its unique name and its x,y,z coordinates

ATOM 1 N THR 1 17.047 14.099 3.625 1.00 13.79 1CRN 70

ATOM 2 CA THR 1 16.967 12.784 4.338 1.00 10.80 1CRN 71

ATOM 3 C THR 1 15.685 12.755 5.133 1.00 9.19 1CRN 72

ATOM 4 O THR 1 15.268 13.825 5.594 1.00 9.85 1CRN 73

ATOM 5 CB THR 1 18.170 12.703 5.337 1.00 13.02 1CRN 74

ATOM 6 OG1 THR 1 19.334 12.829 4.463 1.00 15.06 1CRN 75

ATOM 7 CG2 THR 1 18.150 11.546 6.304 1.00 14.23 1CRN 76

ATOM 8 N THR 2 15.115 11.555 5.265 1.00 7.81 1CRN 77

ATOM 9 CA THR 2 13.856 11.469 6.066 1.00 8.31 1CRN 78

ATOM 10 C THR 2 14.164 10.785 7.379 1.00 5.80 1CRN 79

ATOM 11 O THR 2 14.993 9.862 7.443 1.00 6.94 1CRN 80


©CMBI
2009



Structure Visualization

Structures

from

PDB
can

be

visualized

with
:


1.
Yasara

(
www.yasara.org
)


2.
SwissPDBViewer

(http://spdbv.vital
-
it.ch/)


3.
Protein

Explorer (http://www.umass.edu/microbio/rasmol/)


4.
Cn3D (http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml)

©CMBI
2009



Part III: Sequence Retrieval with MRS

Google

Th
é

best generic search and retrieval system



Google searches everywhere for everything


MRS

Maarten’s
Retrieval System (
http://mrs.cmbi.ru.nl )


MRS searches in selected data environments


MRS
is the Google of the biological database world


Search engine (like Google)


Input/Query

= word(s)


Output

= entry/entries from database


Other programs exist:
Entrez
, SRS, ....



©CMBI
2009



MRS


MRS is mainly used for (but not restricted to) protein/nucleic acid
and related databases



DNA and protein sequences


Sequence related information (e.g. alignments, protein,
domains, enzymes, metabolic pathways, structural
information)


Hereditary
information

©CMBI
2009



MRS
Search Steps



Select database(s) of choice



Formulate your query



Hit “Search”



The result is a “query set” or “hitlist”




Analyze the results


©CMBI
2011



http://mrs.cmbi.ru.nl

©CMBI
2009



MRS Database Selection

You can choose between
selecting all databases or just
one of them.


But think about your query first!!


©CMBI
2009



Simply type your keywords in the keyword field and choose SEARCH.

If you know the fields of the database you are searching in you can
specify your query further


But think about your query first!!

MRS Search options

©CMBI
2009



MRS
Hitlist

(1)

©CMBI
2009



MRS
Hitlist

(
2)

©CMBI
2009



MRS Options

MRS creates a result, or a “
query set
”, or “
hitlist
”.



With the result you can do different things in MRS:


View
the hits


Blast

single hit sequences


Clustal

multiple hit sequences


©CMBI
2009



MRS
-

View Hits

©CMBI
2009



Combine in MRS

AND or &

AND is implicit

OR or |

NOT or !

©CMBI
2011



MRS
-

Options


Home

brings you back to the start page of MRS. That is the page from which
you can do keyword searches.


Blast

brings you to the MRS
-
page from which you can do Blast searches.


Status
gives you all the currently indexed databases


Align

brings you to the MRS
-
page from which you can do
Clustal

alignments.


Databank:
uniprot

lists the database you selected.


Help

provides some help

©CMBI
2009



Try it yourself with the exercises!

Ground rules for bioinformatics


Don't always believe what programs tell you
-

they're often misleading &
sometimes wrong!


Don't always believe what databases tell you
-

they're often misleading &
sometimes wrong!


Don't always believe what lecturers tell you
-

they're sometimes wrong!


D
on't be a naive user, computers don’t do biology & bioinformatics, you do!
















free after Terri Attwood