Slide 1 - BiGCaT - Data Server

disturbedtonganeseBiotechnology

Oct 2, 2013 (3 years and 8 months ago)

88 views

Genome Related

Biological Databases

Content


DNA Sequence databases


Protein databases


Gene prediction


Accession numbers



NCBI website


Ensembl

website


Nucleotide databases

GenBank

EMBL

DDBJ

Housed

at EBI


European

Bioinformatics

Institute


www.ebi.ac.uk/embl
/


Housed

at NCBI


National

Center for

Biotechnology

Information


www.ncbi.nlm.nih.gov/Genbank
/


Housed

in Japan


www.ddbj.nig.ac.jp/
Welcome
-
e.html


The
underlying

raw DNA sequences are identical

>100,000 species are represented in GenBank

all species



196,538


viruses



5,214

bacteria



14,258

archaea



500

eukaryota



171,843

NCBI nucleotide databases


GenBank


Individual submissions


Bulk submissions (Genome centers)


High throughput sequencing (DNA)


Expressed Sequence Tags (mRNA)



RefSeq


Curated subset of GenBank


“Reference” sequence


Single sequence per locus / molecule

Protein databases


NCBI


RefSeq

and
Protein



EBI


Swiss
-
Prot, PIR

and
TrEMBL

→ UniProt



Translated

from nucleotide sequence


Curated


Combined

UniProt
versus

GenBank

and
RefSeq

UniProt


Produced by SIB, EBI

& Georgetown U.


Protein data only



Curated

in
SwissProt
,
not in
TrEMBL

GenBank
/
RefSeq


Produced by INSDC
and NCBI



Protein and nucleotide
data


Curated

in
RefSeq
, not
in
GenBank

Accession numbers

Label to unambiguously identify a sequence


Examples (all for retinol
-
binding protein, RBP4):

protein

DNA

RNA

X02775


GenBank

genomic DNA sequence

NT_030059

Genomic
contig

Rs7079946

dbSNP

(single nucleotide polymorphism
)

RBP4


HUGO
genenames


N91759.1

An expressed sequence tag (1 of 170)

NM_006744

RefSeq

DNA sequence (from a transcript)


NP_007635

RefSeq

protein

AAC02945

GenBank

protein

Q28369


UniProt protein

1KT7


Protein Data Bank structure record

From Sequence to Genes


Gene prediction


Extrinsic


Search for genes based on observed mRNA / Protein
sequences


UniGene



Ab initio


Predict genes based on genomic sequence alone


Promoter sequence


Poly(A) tail binding sites, CpG islands, splicing sites

UniGene


Predict genes based on ESTs


EST:


DNA sequence corresponding to mRNA from
expressed gene


~500 base pairs long


Sequenced from a cDNA library



Cluster ESTs from many cDNA libraries to
predict distinct genes




EST clusters

This is a gene with

1 EST associated;

the cluster size is 1

This is a gene with

10 ESTs associated;

the cluster size is 10

40986

18424

17855

13411

8288

5332

4607

4075

4052

3958

1902

710

210

57

17

6

1

Number of clusters

Cluster size

UniGene clusters

Likely to be
real genes

Gene databases


Ensembl (EBI)


Automatic annotation: mRNA and protein
sequence


Curated annotation: Vega project



Entrez Gene (NCBI)


Links RefSeq sequences to external annotations

Web sites for biological databases


NCBI


www.ncbi.nlm.nih.gov




EBI


www.ebi.ac.uk



ENSEMBL

www.ensembl.org

(= at EBI)

NCBI website

NCBI website







PubMed


Ensembl website

Ensembl structure


Gene: ENSG…


Transcript: ENST…


Protein: ENSP…

Ensembl search

OTTHUMGXXX (Curated)

ENSGXXXX (Predicted)

Vega gene page

Ensembl

gene page

Ensembl transcript page

Ensembl protein page