Databases in bioinformatics - ICGEB

weinerthreeforksBiotechnology

Oct 2, 2013 (3 years and 9 months ago)

101 views

10/3/2013 1:47 AM

Introduction to Bioinformatics
databases: Nucleic Acid
Databases





Dinesh
Gupta



ICGEB

10/3/2013 1:47 AM

Biological databases: why?


Need for storing and communicating
large datasets has grown



Make biological data available to
scientists.



To make biological data available in
computer
-
readable form.



10/3/2013 1:47 AM

Different classifications of
databases


Type of data


nucleotide sequences


protein sequences


proteins sequence patterns or motifs


macromolecular 3D structure


gene expression data


metabolic pathways


10/3/2013 1:47 AM

Different classifications of databases….


Primary or derived databases


Primary databases: experimental results
directly into database


Secondary databases: results of analysis of
primary databases


Aggregate of many databases


Links to other data items


Combination of data


Consolidation of data


10/3/2013 1:47 AM

Different classifications of databases….


Technical design


Flat
-
files


Relational database (SQL)


Exchange/publication technologies (FTP,
HTML, CORBA, XML,...)


10/3/2013 1:47 AM

Different classifications of databases….


Availability


Publicly available, no restrictions


Available, but with copyright


Accessible, but not downloadable


Academic, but not freely available


Proprietary, commercial; possibly free for
academics


10/3/2013 1:47 AM

Where do I get DB of my interest ?


10/3/2013 1:47 AM


10/3/2013 1:47 AM

http://www3.oup.co.uk/nar/database/c/


10/3/2013 1:47 AM

Nucleotide sequence databases



EMBL, GenBank, and DDBJ are the
three
primary nucleotide sequence
databases



EMBL
www.ebi.ac.uk/embl/



GenBank
www.ncbi.nlm.nih.gov/Genbank/



DDBJ
www.ddbj.nig.ac.jp


10/3/2013 1:47 AM

Genbank


An annotated collection of all publicly
available nucleotide and proteins



Set up in 1979 at the LANL (Los Alamos).



Maintained since 1992 NCBI (Bethesda).



http://www.ncbi.nlm.nih.gov


10/3/2013 1:47 AM

10/3/2013 1:47 AM


10/3/2013 1:47 AM

EMBL Nucleotide Sequence
Database



An annotated collection of all publicly available
nucleotide and protein sequences



Created in 1980 at the
European Molecular
Biology Laboratory

in Heidelberg.



Maintained since 1994 by EBI
-

Cambridge.



http://www.ebi.ac.uk/embl.html


10/3/2013 1:47 AM

10/3/2013 1:47 AM

http://www3.ebi.ac.uk/Services/DBStats/


10/3/2013 1:47 AM

DDBJ

DNA Data Bank of Japan


An annotated collection of all publicly available
nucleotide and protein sequences



Started, 1984 at the
National Institute of
Genetics

(NIG) in Mishima.



Still maintained in this institute a team led by
Takashi Gojobori.



http://www.ddbj.nig.ac.jp


10/3/2013 1:47 AM

10/3/2013 1:47 AM


10/3/2013 1:47 AM

Other NCBI nucleic acids DBs


EST database:

A collection of expressed sequence tags, or short, single
-
pass sequence
reads from mRNA (cDNA).


GSS database
: A database of genome survey sequences, or short, single
-
pass genomic
sequences.



HomoloGene:

A gene homology tool that compares nucleotide sequences between pairs of
organisms in order to identify putative orthologs.


HTG database:

A collection of high
-
throughput genome sequences from large
-
scale
genome sequencing centers, including unfinished and finished sequences.



SNPs database:

A central repository for both single
-
base nucleotide substitutions and
short deletion and insertion polymorphisms.


RefSeq:

A database of non
-
redundant reference sequences standards, including genomic
DNA contigs, mRNAs, and proteins for known genes. Multiple collaborations, both within
NCBI and with external groups, supports data
-
gathering efforts.



STS database:

A database of sequence tagged sites, or short sequences that are
operationally unique in the genome.


UniSTS:

A unified, non
-
redundant view of sequence tagged sites (STSs).


UniGene:

A collection of ESTs and full
-
length mRNA sequences organized into clusters,
each representing a unique known or putative human gene annotated with mapping and
expression information and cross
-
references to other sources.

10/3/2013 1:47 AM

10/3/2013 1:47 AM

10/3/2013 1:47 AM

Sequence submission


Data mainly direct submissions from the
authors.


Submissions through the Internet:


Web forms.


Email.


Sequences shared/exchanged between
the 3 centers on a daily basis:


The sequence content of the banks is
identical.

10/3/2013 1:47 AM

Derived databases


CUTG Codon usage tabulated from GenBank
http://www.kazusa.or.jp/codon/


Genetic Codes Deviations from the standard genetic code in various
organisms and organelles
http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c


TIGR Gene Indices Organism
-
specific databases of EST and gene
sequences
http://www.tigr.org/tdb/tgi.shtml


UniGene Unified clusters of ESTs and full
-
length mRNA sequences
http://www.ncbi.nlm.nih.gov/UniGene/


ASAP Alternative spliced isoforms
http://www.bioinformatics.ucla.edu/ASAP


Intronerator Introns and alternative splicing in C.elegans and
C.briggsae
http://www.cse.ucsc.edu/~kent/intronerator/

10/3/2013 1:47 AM

10/3/2013 1:47 AM

10/3/2013 1:47 AM

10/3/2013 1:47 AM

10/3/2013 1:47 AM

10/3/2013 1:47 AM

10/3/2013 1:47 AM

Nucleic acid structure

databases


NDB Nucleic acid
-
containing structures
http://ndbserver.rutgers.edu/



NTDB Thermodynamic data for nucleic acids
http://ntdb.chem.cuhk.edu.hk/



RNABase RNA
-
containing structures from PDB and
NDB
http://www.rnabase.org/



SCOR Structural classification of RNA: RNA motifs by
structure, function and tertiary interactions


http://scor.lbl.gov/


10/3/2013 1:47 AM

10/3/2013 1:47 AM

10/3/2013 1:47 AM

10/3/2013 1:47 AM



10/3/2013 1:47 AM

Database searching tips


Look for links to
Help

or Examples


Try
Boolean

searches


Be careful with UK/US
spelling

differences



leukaemia vs leukemia



haemoglobin vs hemoglobin



colour vs color


10/3/2013 1:47 AM

Exercises


Study

the

statistics

of

the

three

primary

nucleic

acid

databases
:

Are

they

matching

?


Look

for

a

gene

of

your

interest

in

the

three

primary

nucleic

acid

databases
:

compare

the

information

given

in

each

one

of

them
.


Read

NAR

DB

paper

and

NAR

DB

index

site
:

search

for

different

nucleic

acid

databases

based

on

different

search

terms
.


Self study:


http://www3.oup.co.uk/nar/database/c/


Download NAR database paper (NARDB2004) from:
ftp://cbag.sc.mahidol.ac.th/pub/Course_Materials/dinesh