mscslides - European Bioinformatics Institute (EBI) Home Page

weinerthreeforksBiotechnology

Oct 2, 2013 (3 years and 10 months ago)

83 views

EMBL Outstation


The European Bioinformatics Institute

The EMBL Database

Helen Parkinson

Nottingham University
2001

EMBL Outstation


The European Bioinformatics Institute

EBI, Wellcome Genome Campus, Hinxton, Cambridge, UK

EMBL Nucleotide Sequence Database

EMBL Outstation


The European Bioinformatics Institute

The European Bioinformatics Institute

Databases

www.ebi.ac.uk



EMBL Nucleotide Sequence Database


Protein Databases (SWISS
-
PROT & TREMBL)


Molecular Structure Database (EBI
-
MSD)


Radiation Hybrid Database (RhDB)


Immunogenetics Database (IMGT)


Ensembl


plus >
70
additional specialized databases on EBI’s FTP server





ftp://ftp.ebi.ac.uk/pub/databases



EMBL Outstation


The European Bioinformatics Institute

The European Bioinformatics Institute

Services and Research

www.ebi.ac.uk


SRS, sequence retrieval system


Research into complete genomes


Research into protein sequence analysis and structure
prediction


Microarray research and new database


Industry Programme


http://www.ebi.ac.uk/Groups/index.html

EMBL Outstation


The European Bioinformatics Institute

The European Bioinformatics Institute

EMBL Database Curation Activites

www.ebi.ac.uk


Biological and bioinformatic support for users


Biological annotation and provision of accession numbers


Development of features and qualifiers for new datatypes


Updating and correction of entries
-

Cleanup


Development and testing of new database tools


Liason with collaborating databases to maintain synchrony

EMBL Outstation


The European Bioinformatics Institute



















International Nucleotide Sequence Databases

DDBJ/EMBL/GenBank

NCBI

EBI

NIG

Genome Data

Direct Submissions

Patent Literature

EMBL Nucleotide Sequence
Database

EMBL Nucleotide Sequence Database


DNA Databank of Japan

GenBank

EMBL Outstation


The European Bioinformatics Institute

EMBL Outstation


The European Bioinformatics Institute

Sequence Data from Patent Literature

GenBank

EMBL

DDBJ

EMBL Nucleotide Sequence Database

EMBL Nucleotide Sequence Database


DNA Databank of Japan

EPO

USPTO

JPO

Release
64
(Sep
2000
)

entries:
207
,
677

bases:
67
,
411
,
887

EMBL Outstation


The European Bioinformatics Institute

Direct Submissions



mandatory submission policy


60
-

70
% to be held confidential


3009
-

4000
direct subs/month


exponential growth in submissions

Researcher

Journal

Database Curator

Sequence submission


Accession number


Manuscript


Accession #

Publication

EMBL Outstation


The European Bioinformatics Institute

Researcher

Journal

EMBL Curator


Sequence

Submission


Data

Submission

Publication

Webin


ID

Acc. #

EMBL
-
NEW

Database

Acc. #




Manuscript

Direct Submissions Dataflow

EMBL Outstation


The European Bioinformatics Institute


www.ebi.ac.uk/submissions/webin.html

Vector Screening Service

Context
-
sensitive ‘Help’



Bulk



Alignments


1
.
Submitter details

2
. Sequence and description

3
. Source information

4
. Citation information

5
. Feature information


(coding regions, signals,etc.)

6
. Final validation and


submission


WWW Submission System

EMBL Outstation


The European Bioinformatics Institute

Webin

Sequence Features


Central Page

EMBL Outstation


The European Bioinformatics Institute

Webin

Sequence Features


Central Page

SEQUIN Submission System


multi
-
platform (Mac/PC/Unix) stand
-
alone software tool



allows submissions to EMBL, GenBank and DDBJ



Available from EBI:



SEQUIN program



detailed downloading and installation instructions



plus general information



in
ftp://ftp.ebi.ac.uk/pub/software/sequin/

EMBL Outstation


The European Bioinformatics Institute

actggtgaccaggta

tgacgtactactctag

aactgcctgactacg

catcttcagcatcttgt

EMBL

database

correction

Y
98000

Y
98001

Y
98002

Y
98003

Y
98004

Y
98005

accession number

notification (e
-
mail)

update

actggtgaccaggta

tgacgtactactctag

aactgcctgactacg

** E R R O R **

actggtgaccaggta

tgacgtactactctag

aactgcctgactacg

catcttcagcatcttgt

Submission

(WWW, email, post)

Rejection (

e.g

.additional

information required)

preview

actggtgaccaggta

tgacgtactactctag

aactgcctgactacg

catcttcagcatcttgt

actggtgaccaggta

tgacgtactactctag

aactgcctgactacg

catcttcagcatcttgt

Gene=abcD

Product=enzyme X

Author(s)=A. Smith

Publication=Nature

Status=in press

actggtgaccaggta

tgacgtactactctag

aactgcctgactacg

catcttcagcatcttgt

annotation



(update)

error report form

update request


Direct Submissions Dataflow

EMBL
-
EBI


Research


Institute


EMBL Outstation


The European Bioinformatics Institute

Genome data acquisition




Submission through genome project accounts




Retrieval of unfinished sequence from ftp server




Exchange with DDBJ and Genbank

Peter Sterk,

EMBL Hinxton Outstation, the European Bioinformatics Institute, March
1999

EMBL Outstation


The European Bioinformatics Institute

Genome Projects Dataflow

Peter Sterk,

EMBL Hinxton Outstation, the European Bioinformatics Institute, October
1998

EMBL Outstation


The European Bioinformatics Institute

Human Draft Genome


Ensembl
http://www.ensembl.org



provides
automatic annotation of the human draft genome data


includes confirmed peptides&cDNAs, predicted peptides & repeatts, map & SNPs




Genome MOT

http://www.ebi.ac.uk/Databases/Genome_MOT/genome_mot.html



presents status of a number of large eukaryotic genome sequencing projects




provides access to individual EMBL database entries


updated daily




EMBL Release

ftp://ftp.ebi.ac.uk/pub/databases/embl/release/


draft sequence data included in EMBL Database HTG and HUM divisions



EMBL Outstation


The European Bioinformatics Institute

EMBL Outstation


The European Bioinformatics Institute

Monitoring the progress of major genome projects: the
Genome MOT


Collaboration with Sanger Centre


Updated weekly


Data source: EMBL database

http://www.ebi.ac.uk/Databases/Genome_MOT/genome_mot.html



Curr. Opin. Biotechnol.
9
:
116
-
120
(
1998
)

EMBL Outstation


The European Bioinformatics Institute

Calculation of the Genome MOT


Finished sequences, present in EMBL database


Genomic DNA (no RNAs, cDNAs, ESTs, STSs)


H.sapiens, C.elegans. A.thaliana, M. musculus S.pombe, broken down
according to chromosome


Redundancies taken into account

Cut
-
off:
1000
bp

Redundancies estimated with CLEANUP,
Grillo
et al
., CABIOS
12
:
1
-
8
(
1996
)


Peter Sterk,

EMBL Hinxton Outstation, the European Bioinformatics Institute, March
1999

EMBL Outstation


The European Bioinformatics Institute

Sanger Centre Sequencing Projects



Human Genome Project


Chromosome
1
,
6
,
9
,
10
,
11
,
13
,
20
,
22
, X,

Worm



Caenorhabditis elegans

Yeast


Schizosaccharomyces pombe





Candida albicans

Microbial Genomes


Bordetella parapertussis, Bordetella pertussis





Burkholderia pseudomallei





Campylobacter jejuni





Clostridium difficile





Corynebacterium diphtheriae





Mycobacterium bovis, Mycobacterium leprae, Mycobacterium tuberculosis





Neisseria meningitidis





Salmonella typhi





Staphylococcus aureus





Streptococcus pyogenes





Streptomyces coelicolor





Yersinia pestis

Protozoa



Dyctiostelium discoideum




Leishmania major




Plasmodium falciparum




Trypanosoma brucei

Fly




Drosophila melanogaster

non
-
human Vertebrates


Mus musculus (mouse)




Gallus gallus (chicken)









EMBL Outstation


The European Bioinformatics Institute

Selected other Genome Projects


Mouse chr. X, Oxford/HGMP
-
RC, UK


Arabidopsis thaliana (ESSA), European consortium/MIPS, Germany


Homo sapiens, GBF, Germany


European Drosophila Mapping Consortium, UK


Anopheles gambiae, Pasteur, France


Miscell. microbial genomes, Pasteur, France


Homo sapiens EST, MIPS, Germany


Mouse EST Project, GENOSCOPE, France


Homo sapiens EST, Padova, Italy


in progress: prokaryotic


>
100



eukaryotic


>
80

EMBL Outstation


The European Bioinformatics Institute










Homo sapiens

EMBL Outstation


The European Bioinformatics Institute

Data Management and Curation


Accession Numbers



X
46455


AJ
343321



Sequence Identifiers


nucleotide

sequence

identifier


Example
:

SV

AJ
400848
.
1


protein sequence identifier


Example: /protein_id="CAB
88705.1
"



Data confidentiality and release dates




Integration with external databases

Database X
-
references databases
/db_xref





TrEMBL


SWISS
-
PROT




MaizeDB

FLYBASE




IMGT



MENDEL





MGD TRANSFAC


SGD

EPD




Total # of links

>
2
,
8
million











EMBL Outstation


The European Bioinformatics Institute


Interoperability

EPD

Eukaryotic Promoters

Flybase

D. melanogaster

SubtiList

B. subtlis

MaizeDB

Zea mays

WormPep

C. elegans

REBASE

Restriction enzymes

StyGene

Salmonella typhimurium

Transfac

Transcription factors

EMBL Nucleotide

Sequence Database

SWISS
-
PROT

+ TrEMBL

MSD

3
D Structures

ECDC

E. coli map

GCRDb

G
-
coupled Receptors

EcoGene

E. coli

SGD

Yeast

DictyDB

Dictyostelium discodium

ENZYME

Enzyme Nomenclature

OMIM

Human

ECO
2
DBASE

(
2
D)

Maize
-
2
DPAGE

(
2
D)

Aarhus/Ghent

(
2
D)

YPD

Yeast

HSSP

3
D Similarities

Harefield

(
2
D)

Prosite

Pattens & Profiles

EMBL Outstation


The European Bioinformatics Institute

Data Management and Curation



Hardware


VMS


UNIX
Digital UNIX
2
Alphaservers
8400
(
12
/
4
CPUs)


network of PCs



Relational Database Management System (ORACLE)




Database Schema facilitating integration and interoperability

with other databases




EMBL Outstation


The European Bioinformatics Institute

Biological Data Curation



annotation of new submission data


quality Control
-
sequence, FT, proofreading


creation of database entries


updates / Corrections


curation of Genome Project Data


curation of data classes (e.g. immunoglobulins, TCR etc)


Classification of species in collaboration with taxonomy @ ncbi


production of annotation examples


development and testing of submission tools


writing submitter documentation


liason with linked databases


liason with genome projects


EMBL Outstation


The European Bioinformatics Institute

Database Entry Structure










description






taxonomic source information




submitter info

reference citation


biological features


sequence


EMBL Outstation


The European Bioinformatics Institute

Description / Source

ID BMGLUCKIN standard; DNA; PRO;
1362
BP.

XX

AC AJ
000005
;

XX

SV AJ
000005.1

XX

DT
22
-
JUL
-
1997
(Rel.
52
, Created)

DT
17
-
JAN
-
1998
(Rel.
54
, Last updated, Version
2
)

XX

DE Bacillus megaterium glk gene

XX

KW glk gene; glucose kinase.

XX

OS Bacillus megaterium

OC Bacteria; Firmicutes; Bacillus/Clostridium group;

OC Bacillaceae; Bacillus.

XX

EMBL Outstation


The European Bioinformatics Institute

Submitter Reference / Citation

XX

RN [
1
]

RP
1
-
1362

RA Spaeth C.;

RT ;

RL Submitted (
01
-
JUL
-
1997
) to the EMBL/GenBank/DDBJ

RL databases.

RL Spaeth C., Institut fuer Mikrobiologie, Biochemie und

RL Genetik, Lehrstuhl fuer Mikrobiologie, Staudtstr.
5
,

RL
91058
Erlangen, GERMANY.

XX

RN [
2
]

RA Spaeth C., Kraus A., Hillen W.;

RT "Contribution of glucose kinase to glucose repression
RL of xylose utilization in Bacillus megaterium";

RL J. Bacteriol.
179
:
7603
-
7605
(
1997
).

XX


Feature Table

FH Key Location/Qualifiers

FH

FT source
1
..
1362

FT /organism="Bacillus megaterium"

FT /db_xref="taxon:
1404
"

FT /sequenced_mol="DNA"

FT CDS
270
..
1244

FT /codon_start=
1

FT /db_xref="SPTREMBL:O
31392
"

FT /evidence=EXPERIMENTAL

FT /transl_table=
11

FT /gene="glk"

FT /product="glucose kinase"

FT /EC_number="
2.7.1.2
"

FT /protein_id="CAA
03848.1
"

FT



/translation="MNMDDKWLVGVDLGGTTIKMAF...

TREMBL entry

ID O
31392
PRELIMINARY; PRT;
324
AA.

AC O
31392
;

DT
01
-
JAN
-
1998
(TREMBLREL.
05
, CREATED)

DT
01
-
JAN
-
1998
(TREMBLREL.
05
, LAST SEQUENCE UPDATE)

DT
01
-
NOV
-
1998
(TREMBLREL.
08
, LAST ANNOTATION UPDATE)

DE GLUCOSE KINASE (EC
2.7.1.2
).

GN GLK.

OS BACILLUS MEGATERIUM.

OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE;

OC BACILLUS.

RN [
1
]

RX MEDLINE;
98053881
.

RA SPAETH C., KRAUS A., HILLEN W.;

RT "Contribution of glucose kinase to glucose repression of xylose

RT utilization in Bacillus megaterium.";

RL J. Bacteriol.
179
:
7603
-
7605
(
1997
).

DR EMBL;
AJ
000005
;
CAA
03848.1
;
-
.

DR PFAM;
PF
00480
; ROK;
1
.

DR PROSITE;
PS
01125
; ROK;
1
.

KW Transferase.

SQ SEQUENCE
324
AA;
33899
MW;
665
E
9
F
19
CRC
32
;


MNMDDKWLVG VDLGGTTIKM AFINHYGEII HKWEINTDVS EQGRKIPTDI AKAIDKKLND


LGEVKSRLVG IGIGAPGPVN FANGSIEVAV NLGWEKFPIK DILEVETSLP VVVDNDANIA


AIGEMWKGAG DGAKDLLCVT LGTGVGGGVI ANGEIVQGVN GAAGEIGHIT SIPEGGAPCN


CGKTGCLETI ASATGIVRLT MEELTETDKP SELRTVLEQN GQVTSKDVFD AARSKDGLAM


HVVDKVAFHL GLALANSANA LNPEKIVLGG GVSRAGEVLL APVRDYFKRF AFPRVAQGAE


LAIATLGNDA GIIGGAWLVK SYFE

//

EMBL Outstation


The European Bioinformatics Institute

Sequence

SQ Sequence
1362
BP;
446
A;
211
C;
325
G;
380
T;
0
other;


TGACACTTTG AGTTATCTCA AAATAAATGA ATCATATCAC CTAAAAAATA GGATAAAGGT
60


GACAGAATGA ATACGTTATA TGATGTACAA CAATTATTAA AGTCCTTCGG CATTTTTATA
120


TACGTGGGCG ATCGTATTGC TGATTTAGAG CTGATGGAAG CGGAAGTAAA AGAGTTATAT
180


CAGTCTAACT TGATTGATGT ACGTGATTAC CAAATGGCAA TTCTTTTGCT TCGTCGAGAG
240


TTAAAACAAC AAAAAGAGAA AAAGGATGAA TGAACATGGA TGACAAATGG TTAGTTGGAG
300


TTGATTTAGG CGGTACAACA ATAAAAATGG CCTTTATTAA TCATTATGGA GAAATCATTC
360


ACAAGTGGGA AATTAATACG GATGTGAGCG AGCAAGGCCG TAAAATTCCA ACGGATATTG
420


CAAAAGCAAT TGATAAAAAG TTAAACGATC TTGGAGAAGT AAAATCAAGG TTAGTAGGAA
480


TTGGCATTGG TGCACCGGGG CCTGTCAACT TTGCAAACGG TTCGATTGAA GTAGCTGTCA
540


ATTTAGGTTG GGAAAAATTC CCTATAAAAG ATATCTTGGA AGTAGAAACT TCTCTTCCTG
600


TTGTAGTAGA CAATGATGCA AACATTGCAG CGATTGGAGA AATGTGGAAG GGTGCTGGAG
660


ACGGAGCAAA AGATTTACTT TGCGTTACGC TTGGCACAGG CGTTGGCGGT GGCGTCATTG
720


CAAACGGTGA AATTGTACAA GGCGTAAATG GAGCCGCTGG TGAGATCGGG CACATTACTT
780


CTATTCCTGA AGGCGGGGCA CCGTGTAACT GCGGTAAAAC CGGCTGTTTA GAAACCATTG
840


CTTCAGCAAC TGGAATTGTA CGTTTAACAA TGGAAGAATT AACGGAAACG GACAAACCAA
900


GTGAGCTTCG CACAGTGTTA GAACAAAATG GACAAGTTAC ATCTAAAGAT GTATTTGATG
960


CAGCTCGTTC AAAAGACGGG TTAGCTATGC ATGTTGTAGA TAAAGTTGCT TTTCATTTAG
1020


GTCTAGCACT AGCAAACTCT GCTAATGCAT TAAACCCTGA GAAGATCGTT CTAGGCGGCG
1080


GTGTGTCTCG TGCAGGCGAG GTATTACTTG CACCGGTAAG AGATTATTTC AAACGTTTTG
1140


CATTTCCTCG CGTAGCGCAA GGTGCTGAAC TAGCAATCGC TACTTTAGGA AACGATGCGG
1200


GAATTATTGG AGGAGCTTGG TTAGTTAAAT CTTATTTTGA ATAATAAGCA AGAATCTAAC
1260


TGAGATAAAA AAGCGCTTTG ACATTTAGTC AAAGCGCTTT TTTATCATGC ATCTTTTCAA
1320


TCTTTACATA TACATAGTGT AAAGGAGTGA AGATTATGCA AA
1362

EMBL Outstation


The European Bioinformatics Institute

Database Divisions

EMBL Outstation


The European Bioinformatics Institute

Data Distribution



Genbank, DDBJ



EMBnet nodes



other mirrors

Peter Sterk,

EMBL Hinxton Outstation, the European Bioinformatics Institute, October
1998

EMBL Outstation


The European Bioinformatics Institute

Data Access

Network services


www, e
-
mail, ftp



Access to the most up
-
to
-
date data collection via Internet and
WWW



Sequence Retrieval System (SRS)


Network Browser for Databanks in Molecular Biology integrating and
linking the main nucleotide and protein databases plus many
specialized databases


CD
-
ROM




Database releases are produced quarterly
-

distributed on CD
-
ROM
.



EMBL Outstation


The European Bioinformatics Institute

Accessing Genome Data


Completed Genomes Webserver


http://www.ebi.ac.uk/genomes/




High
-
Throughput Genome Sequences






(HTG phases
0
-

3
)



ftp://ftp.ebi.ac.uk/pub/databases/embl/release



Ensembl
http://www.ensembl.org/






Genome MOT
http://www.ebi.ac.uk/Databases/Genome_MOT/



CON(struct) division
ftp://ftp.ebi.ac.uk/pub/databases/genomes


EMBL Outstation


The European Bioinformatics Institute


Database searching




fasta, blast, blitz


For sequence similarity searching a variety of tools are available for external
users to compare their own sequences against the most currently available
data in the EMBL Nucleotide Sequence Database and SWISS
-
PROT.



Sequence Analysis



clustalw


multiple sequence alignment and





inference of phylogenies

genemark

gene prediction




EMBL Outstation


The European Bioinformatics Institute

Uses of nucleotide and derivative protein databases



discovery of novel genes


identification of homologous genes and additional members of gene families


analysis of alternative splicing


chromosomal localisation of genes


detection of polymorphisms (SNPs)



comparative genomics


molecular evolution


comparisons of human/mouse DNA to identify genes unique to one or more complex organisms


humans/fruit fly/nematodes to identify genes essential for all multicellular organisms


human/yeast DNA to identify genes related to functions essential for all eukaryotic cells




regulation of gene expression


role of vast majority of DNA (junk" DNA)

e.g: introns with role in the function of transfer RNA critical to protein synthesis

twintrons etc


multiple genes (exact number of genes/interaction ?)


EMBL Outstation


The European Bioinformatics Institute

Release Production

EMBL Outstation


The European Bioinformatics Institute

Database Growth


status:
12
-
OCT
-
2000

EMBL nucleotides:
10
,
290
,
670
,
274

EMBL entries:
9
,
113
,
333


EMBL Outstation


The European Bioinformatics Institute

EMBL Outstation


The European Bioinformatics Institute

Acknowledgements

EBI


EMBL database staff, biologists and programmers





Collaborators


DDBJ, GenBank, Sanger Centre, EPO, plus many other projects


and databases




Peter Sterk,

EMBL Hinxton Outstation, the European Bioinformatics Institute, October
1998