Biomolecular databases Examples of biomolecular databases ...

peaceevenΒιοτεχνολογία

4 Οκτ 2013 (πριν από 3 χρόνια και 11 μήνες)

118 εμφανίσεις

Biomolecular databases

Bioinformatics

Jacques van
Helden
FORMER
ADDRESS (1999-2011)
Université
Libre
de
Bruxelles
,
Belgique

Bioinformatique
des
Génomes
et des
Réseaux
(BiGRe lab)
http://www.bigre.ulb.ac.be/



NEW ADDRESS (since Nov 1
st
, 2011)
Jacques.van-Helden@univ-amu.fr

Université
d’Aix
-Marseille, France
Lab. Technological Advances for Genomics and Clinics
(TAGC, INSERM Unit U1090)
http://tagc.univ-mrs.fr/



B
!
GR
e

B
i
oinformatique des

G
énomes

et

R
é
seaux

!"#$%&'&()#*'+*,-*%#".+/&0+("%&1)#.+*%,+#')%)#.
!"#$
Inserm U1090
Contents
!

Examples of biological databases
"

Nucleic sequences: Genbank, EMBL, and DDBJ
"

Protein sequences: UniProt
"

The Gene Ontology (GO) project
!

Issues and perspectives for biological databases
Examples of biomolecular databases

Biomolecular Databases

Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/

Examples of biomolecular databases

!

Sequence and structure databases
"

Protein sequences (UniProt)
"

DNA sequences (EMBL, Genbank, DDBJ)
"

3D structures (PDB)
"

Structural motifs (CATH)
"

Sequence motifs (PROSITE, PRODOM)
!

Genome sequences and annotations
"

Genome-specific databases (SGD, FlyBase, AceDB, PlasmoDB,
!
)
"

Multiple genomes (Integr8, NCBI, KEGG, TIGR,
!
)
!

Molecular functions
"

Transcriptional regulation (TRANSFAC, RegulonDB, InteractDB)
"

Enzymatic catalysis (Expasy, LIGAND/KEGG, BRENDA)
"

Transport (YTPdb)
!

Biological processes
"

Metabolic pathways (EcoCyc, LIGAND/KEGG, Biocatalysis/biodegradation)
"

Signal transduction pathways (CSNdb, Transpath)
"

Protein-protein interactions (DIP, BIND, MINT)
"

Gene networks (GeneNet, FlyNets)

Databases of databases

!

There are hundreds of databases related to molecular biology and biochemistry.
New databases are created every year.
!

Every year, the first issue of Nucleic Acids Research is dedicated to biological
databases
"

http://nar.oupjournals.org/

"

2011 Issue:
http://nar.oxfordjournals.org/content/39/suppl_1

!

The same journal maintains a database of databases: the Molecular Biology
Database Collection
"

http://www.oxfordjournals.org/nar/database/c/

!

Some bioinformatics centres maintain multiple database, with cross-links
between them. The SRS server at EBI holds an impressive collection of
databases.
"

http://srs.ebi.ac.uk/

Nucleic sequence databases:
GenBank, EMBL, and DDBJ

Biomolecular Databases

Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/

Okubo et al. (2006) NAR 34: D6-D9

Nucleic sequence databases
!

To publish an article dealing with a sequence, scientific journals impose to have
previously deposited this sequence in a reference database.
!

There are 3 main repositories for nucleic acid sequences.
!

Sequences deposited in any of these 3 databases are automatically
synchronized in the 2 other ones.
Adapted from Didier Gonze
The sequencing pace
!

Nucleic sequences
"

Genbank (April 2011)
http://www.ncbi.nlm.nih.gov/genbank/



126,551,501,141 bases in 135,440,924 sequence records in the
traditional GenBank divisions


191,401,393,188 bases in 62,715,288 sequence records in the
Whole Genome Ssequencing

!

Entire genomes
"

GOLD Release V.2 (Oct 2011) contains
~2000
completely sequenced
genomes.
"

http://www.genomesonline.org/gold_statistics.htm

!

Protein sequences
"

Essentially obtained by translation of putative genes in nucleic
sequences (almost no direct protein sequencing).
"

UniProtKB/TrEMBL (2011) contains
17 millions of
protein sequences.
"

http://www.ebi.ac.uk/swissprot/sptr_stats/index.html

Size of the
nucleotide

database

EMBL Nucleotide Sequence Database: Release Notes - Release 113 September 2012
http://www.ebi.ac.uk/embl/Documentation/Release_notes/current/relnotes.html

!
Class entries nucleotides
!
------------------------------------------------------------------
!
CON:Constructed
7,236,371 359,112,791,043
!
EST:Expressed
Sequence Tag 73,715,376 40,997,082,803
!
GSS:Genome
Sequence Scan 34,528,104 21,985,922,905
!
HTC:High
Throughput CDNA sequencing 491,770 594,229,662
!
HTG:High
Throughput Genome sequencing 152,599 25,159,746,658
!
PAT:Patents
24,364,832 12,117,896,594
!
STD:Standard
13,920,617 37,665,112,606
!
STS:Sequence
Tagged Site 1,322,570 636,037,867
!
TSA:Transcriptome
Shotgun Assembly 8,085,693 5,663,938,279
!
WGS:Whole
Genome Shotgun 88,288,431 305,661,696,545
!
----------- ---------------
!
Total 252,106,363 450,481,663,919
!
!
Division entries nucleotides
!
------------------------------------------------------------------
!
ENV:Environmental
Samples 30,908,230 14,420,391,278
!
FUN:Fungi
6,522,586 11,614,472,226
!
HUM:Human
32,094,500 38,072,362,804
!
INV:Invertebrates
31,907,138 52,527,673,643
!
MAM:Other
Mammals 40,012,731 145,678,620,711
!
MUS:Mus

musculus
11,745,671 19,701,637,499
!
PHG:Bacteriophage
8,511 85,549,111
!
PLN:Plants
52,428,994 55,570,452,118
!
PRO:Prokaryotes
2,808,489 28,807,572,238
!
ROD:Rodents
6,554,012 33,326,106,733
!
SYN:Synthetic
4,045,013 782,174,055
!
TGN:Transgenic
285,307 849,743,891
!
UNC:Unclassified
8,617,225 4,957,442,673
!
VRL:Viruses
1,358,528 1,518,575,082
!
VRT:Other
Vertebrates 22,809,428 42,568,889,857
!
----------- ---------------
!
Total 252,106,363 450,481,663,919
!
Genbank (NCBI - USA)
http://www.ncbi.nlm.nih.gov/Genbank/

The EMBL Nucleotide Sequence Database (EBI - UK)
http://www.ebi.ac.uk/embl/

DDBJ - DNA Data Bank of Japan
http://www.ddbj.nig.ac.jp/

URL
Sequences
Bases
(without
shotgun)
bases
(including
shotgun)
Organisms
DDBJ
http://www.ddbj.nig.ac.jp/
2.0E+06
1.7E+09
EMBL
http://www.ebi.ac.uk/embl/
1.0E+11
2.0E+05
GenBank
http://www.ncbi.nlm.nih.gov/
4.6E+07
5.1E+10
1.0E+11
2.1E+05
Size of the nucleic sequence databases
!

Summary of database contents for the 3 main databases of nucleic sequences.
!

Source: NAR database issue January 2006.
UniProt : protein sequences
and functional annotations

Biomolecular Databases

Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/

UniProt - the Universal Protein Resource
http://www.uniprot.org/
!

Database content (Sept 2012)
"

UniProtKB
:


24,532,088
entries



Translation of EMBL coding sequences
(non-redundant with Swiss-
Prot
)
"

UniProtKB
/Swiss-
Prot
section (reviewed):


537,505 entries


annotation by experts


high information content


many references to the literature


good reliability of the information
"

The rest (90% of the entries)


Automatic annotation by sequence
similarity.
!

Features
"

The most comprehensive protein database in
the world.
"

A huge team: >100 annotators + developers.
"

Annotation by experts: annotators are
specialized for different types of proteins or
organisms.
"

World-wide recognized as an essential
resource.
!

References
"

Bairoch
et al. The SWISS-PROT protein
sequence data bank. Nucleic Acids Res (1991)
vol. 19
Suppl
pp. 2247-9
"

The UniProt Consortium. The Universal Protein
Resource (UniProt) 2009. Nucleic Acids Res
(2008). Database Issue.

Number of entries (polypeptides) in Swiss-Prot
http://www.expasy.org/sprot/relnotes/relstat.html

Taxonomic distribution of the sequences
Within Eukaryotes
UniProt example - Human Pax-6 protein
Header : name and synonyms
UniProt example - Human Pax-6 protein
Human-based annotation by specialists
UniProt example - Human Pax-6 protein
Structured annotation : keywords and Gene Ontology terms
UniProt example - Human Pax-6 protein
Protein interactions; Alternative products
UniProt example - Human Pax-6 protein
Detailed description of regions, variations, and secondary structure

UniProt example - Human Pax-6 protein
Peptidic sequence
UniProt example - Human Pax-6 protein
References to original publications
UniProt example - Human Pax-6 protein
Cross-references to many databases (fragment shown)
3D Structure of macromolecules
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/

PDB - The Protein Data Bank
http://www.rcsb.org/pdb/

Genome browsers
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/

EnsEMBL Genome Browser (Sanger Institute + EBI)
http://www.ensembl.org/
UCSC Genome Browser
(University California Santa Cruz - USA)
http://genome.ucsc.edu/

Human gene Pax6 aligned with Vertebrate genomes
UCSC Genome Browser
(University California Santa Cruz - USA)
http://genome.ucsc.edu/

Drosophila gene eyeless (homolog to Pax6) aligned with Insect genomes
UCSC Genome Browser
(University California Santa Cruz - USA)
http://genome.ucsc.edu/

Drosophila 120kb chromosomal region covering the Achaete-Scute Complex
ECR Browser
http://ecrbrowser.dcode.org/

EnsEMBL - Example: Drosophila gene Pax6
http://www.ensembl.org/

Comparative genomics
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/

Integr8 - access to complete genomes and proteomes
http://www.ebi.ac.uk/integr8/

Integr8 - genome summaries
http://www.ebi.ac.uk/integr8/

Integr8 - clusters of orthologous genes (COGs)
http://www.ebi.ac.uk/integr8/
Integr8 - clusters of paralogous genes
http://www.ebi.ac.uk/integr8/
Databases of protein domains
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/

Prosite - protein domains, families and functional sites
http://www.expasy.ch/prosite/

Prosite - aligned sequences and logo
http://www.expasy.ch/prosite/

!

Some of the sequences that were
used to built the Prosite profile for
the Zn(2)-C6 fungal-type DNA-
binding domain
(ZN2_CY6_FUNGAL_2,
PS50048).
!

The Sequence Logo (below)
indicates the level of conservation
of each residue in each column of
the alignment.
!

Note the 6 cysteines,
characteristic of this domain.
Prosite - Example of profile matrix
http://www.expasy.ch/prosite/

Prosite - Example of sequence logo
http://www.expasy.ch/prosite/

Prosite - Example of domain signature
http://www.expasy.ch/prosite/

!

The domain signature is a string-based pattern representing the residues that
are characteristic of a domain.
PFAM (Sanger Institute - UK)
http://pfam.sanger.ac.uk/

Protein families represented by multiple sequence alignments and hidden Markov models (HMMs)

CATH - Protein Structure Classification
http://www.cathdb.info/

!

CATH is a hierarchical classification of
protein domain structures, which clusters
proteins at four major levels:
"

Class (C),
"

Architecture (A),
"

Topology (T)
"

Homologous superfamily (H).
!

The boundaries and assignments for
each protein domain are determined
using a combination of automated and
manual procedures which include
computational techniques, empirical and
statistical evidence, literature review and
expert analysis.
!

References
"

Orengo et al. The CATH Database
provides insights into protein structure/
function relationships. Nucleic Acids Res
(1999) vol. 27 (1) pp. 275-9
"

Cuff et al. The CATH classification
revisited--architectures reviewed and new
ways to characterize structural divergence
in superfamilies. Nucleic Acids Res (2008)
pp.
CATH - Protein Structure Classification
http://www.cathdb.info/

InterPro (EBI - UK)
http://www.ebi.ac.uk/interpro/

InterPro (EBI - UK)
Antennapedia-like Homeobox (entry IPR001827)
The Gene Ontology (GO) database

Biomolecular Databases

Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/

Ontology definition

!

Ontologie: partie de la métaphysique qui s'intéresse à l'être en tant qu'être,
indépendamment de ses déterminations particulières
!

Ontology: part of the metaphysics that focusses on the being as a beging, independently of
its particular determinations
Le Petit Robert -
dictionnaire alphabétique et analogique de la langue française
. 1993
!
The "bio-ontologies"

!

Answer to the problem of inconsistencies in the annotations
"

Controlled vocabulary
"

Hierarchical classification between the terms of the controlled vocabulary
!

E.g.: The Gene Ontology
"

molecular function ontology
"

process ontology
"

cellular component ontology

Gene ontology: processes

Gene ontology: molecular functions

Gene ontology: cellular components

Gene Ontology Database
http://www.geneontology.org/

Gene Ontology Database
(http://www.geneontology.org/)
Example: methionine biosynthetic process
Status of GO annotations (NAR DB issue 2006)
!

Term definitions
"

Biological process terms

9,805
"

Molecular function terms

7,076
"

Cellular component terms

1,574
"

Sequence Ontology terms

963
!

Genomes with annotation

30
"

Excludes annotations from UniProt, which represent 261 annotated proteomes.
!

Annotated gene products
"

Total

1,618,739
"

Electronic only

1,460,632
"

Manually curated

158,107
QuickGO (http://www.ebi.ac.uk/QuickGO/)


!

Web site

http://www.ebi.ac.uk/QuickGO/

!

A user-friendly Web interface to
the Gene Ontology.
!

Graphical display of the
hierarchical relationships
between terms.
!

Convenient browsing between
classes.
Remarks on "bio-ontologies"

!

Improvement compared to free text
"

controlled vocabulary (choice among synonyms)
"

hierarchical relationships between the concepts
!

Nothing to do with the philosophical concept of ontology
"

A "bio-ontologies" is usually nothing more than a taxonomical classification of
the terms of a controlled vocabulary
!

Multiple possibilities of classification criteria
"

e.g. compartment subtypes (plasma membrane
is a
membrane)
"

e.g. compartment locations (nucleus
is inside
cytoplasm
is inside
plasma
membrane)
!

To be useful, should remain purpose-based
"

each biologist might wish to define his/her own classification based on his/her
needs and scope of interest
"

impossible to define a unifying standard for all biologists
!

No representation of molecular interactions
"

relationships between objects are only hierarchical, not horizontal or cyclic
"

e.g. does not describe which genes are the target of a given transcription
factor

What is biological function ?

!

A general definition
"

Fonction: action, rôle caractéristique d
`
un élément, d
`
un organe, dans un ensemble
(souvent opposé à structure). Source:
Le Petit Robert - dictionnaire alphabetique et
analogique de la langue francaise. 1982
.

"

Function:
characteristic action (role) of an element (organ) within an
set
(often opposed to structure
)
!

Function and gene ontology
"

Understanding the function requires to establish the link between molecular activity
and the context in which it takes place (process).
"

Multifunctionality


Same activity can play different roles in different processes.
!

Example: scute gene in
Drosophila melanogaster
: a transcription factor
(activity) involved in sex determination, determination of neural precursors
and malpighian tubules (3 processes).


Multiple activities of a same protein in a given process
!

Example: aspatokinase PutA in
Escherichia coli
, contains 2 enzymatic
domains (enzymatic activities) + a DNA-binding domain (DNA binding
transcription factor) -> 3 molecular activities in the same process (proline
utilization).
Small compounds, reactions
and metabolic pathways
Biomolecular Databases
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/

LIGAND - Small compounds and metabolic reactions
KEGG - Kyoto Encycplopaedia of Genes and Genomes
Ecocyc, BioCyc and Metacyc - Metabolic pathways
Protein interaction networks
and transduction pathways
Biomolecular Databases
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/

Microarray databases
Biomolecular Databases
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/

Human genome resources
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/

HapMap
http://www.hapmap.org/

!

The International HapMap
Project is a multi-country effort to
identify and catalog genetic
similarities and differences in
human beings.
!

Associations between genetic
variations (SNPs, ...) and
diseases + response to
pharmaceuticals.
Issues for
biomolecular databases

Biomolecular Databases

Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/

Issues for biological databases

!

Dealing with biological complexity
!

Data content
"

Coverage
"

Information content
!

Data quality
"

Data structure
"

Consistency
!

Query capabilities
!

Interfaces
"

User interfaces
"

Programmatic interfaces
!

Annotation
!

Funding
Towards biological complexity
!

The main databases currently available are focussed on one type of molecular
entity : nucleic sequences, proteins, compounds,
!

!

This type of organization is very convenient as far as the information to be
represented is simple (e.g. DNA sequences, structures of small molecules and
macromolecules).
!

It becomes more difficult if we want to represent
"

the interactions between biological objects,
"

the integration of various elements in a biological process (metabolic pathways, protein
interaction networks, regulatory networks,
!
)
"

complex concepts such as ”biological function”
Data content

!

Scope of the database
"

types of biological objects represented
!

Number of entries
"

coverage of the current knowledge
!

Information content
"

Level of detail in the description of the biological objects
!

References to the source of information
Data quality

!

Data Consistency
"

always use the same name to indicate the same object
"

(this seems trivial, but its is unfortunately still not always the case)
"

event better: define an ID for each objects, and allow to retrieve it by any of its
synonyms
"

spelling mistakes
!

Data Structuration
"

distinct fields for distinct attributes of the biological objects
!

Reliability
"

Evidences ? Level of confidence ?
"

Assignation of function by similarity


recursive process

!
propagation of errors

Query capabilities

!

Browsing (click and read)
!

Simple search
"

select records with some constraints
!

More elaborate search
"

select specific fields of some records with constraints on some fields (~SQL
SELECT)
!

Complex querying
"

ability to return an answer that results from a "live" computation, and was not part
of any record of the dabatase

Interfaces

!

User interfaces
"

user-friendly
"

convenient browsing
"

intuitive query forms
"

visualization (graphical output)
!

Programmatic interfaces
"

communication with external programs:


other databases (concept of distributed database)


analysis tools

Annotation

!

Problem
"

The flow of available data is increasing exponentially
!

Strategies
"

internal curators
"

selected external experts
"

public submission
"

computer-based extraction of information from biological texts

Funding

!

Public funding
"

Problem: easier to obtain public funds for creating a new database than for
maintaining or expanding existing resources
!

Private funding
"

Industrial companies are


ready to invest in good data and good query capabilities


interested by academic expertise
!

Solutions
"

All users pay (per query for example)


Note: academic users are anyway funded by public funds
"

Hybrid solution


access is free for academic users, not for companies


companies can buy the whole database an install it in-house
(+ add their own private data)


academia-industry interface is often ensured by a spinoff company