2nd Korea-Japan Bioinformatics Training Course Bio-Databases ...

dasypygalstockingsBiotechnology

Oct 2, 2013 (3 years and 8 months ago)

89 views

APAN e
-
Science Workshop


e
-
Bio System for Bio
-
Knowledge Discovery

2003.8.27

Sangsoo Kim

Nat’l Genome Informat’n Ct.

Korea Res. Inst. of Biosci. & Biotech.

Bio
-
Databases & Servers


Contents


Bibliographic (Journal abstracts such as
Medline)


Experimental data (Sequences or structures)


Results from annotation and analyses


Bioinformatic analysis tools


Purpose


Storing & managing raw data


Querying for knowledge discovery


Sharing information with others


Serving others with online analysis

New Role of Databases


New discoveries of biological knowledge are
published in scientific journals


But journal space is limited and not suitable to
publish large amount of high throughput data


The supplementary information is provided in an
accompanying website


Readers can download the supplementary
information and analyze from different aspect


Combination with other information may surprise
with unexpected results


Journal publishers require supplementary
information deposited in public archives

Example
-

Nucleotide
Sequence Repositories


Nucleotide sequences discovered by sequencing
experiments are deposited in any one of the
public archives and the journal paper list the
accession numbers only (without deposition, you
cannot publish sequence discovery in journals)


Public archives are


DDBJ operated by CIB, NIG in Japan


EMBL operated by EMBL
-
EBI in UK


GenBank operated by NCBI, NIH in USA


The contents of these archives are exchanged
daily and freely accessible to everybody


Now extended to archive DNA chip data as well

Growth of GenBank

A Nucleotide Sequence Repository

Human Genome Project

RTFM

Entrez: Home Page

GenBank as HTML

Entrez: Display

FASTA as HTML

Example


BLAST Servers


Originally developed to compare my sequence to
those in the repository in order to check whether
mine is novel or not


Extended to detect distantly related sequences,
serving as the major sequence annotation tool


Servers accept various kinds of queries and return
alignment results over WWW


The most widely used bioinformatic tool


For the analysis of many sequences, better to use
local installation

http://www.ncbi.nlm.nih.gov/BLAST

program

query

database

blastn

dna

dna

blastp

protein

protein

blastx

dna (6x)

protein

tblastn

protein

dna (6x)

tblastx

dna (6x)

dna (6x)

RTFM

BLAST (Basic Local Alignment Sequence Tool)

Descriptions

Alignments

BLASTN (
Cont'd
)

Example


Derived Databases


Swiss
-
Prot & PIR


Proteins are predicted from deposited
nucleotide sequences, either being mRNA or
genomic DNA


Functions and features of the protein is
annotated manually by experts


Protein motifs


Prosite, pfam, BLOCKS, InterPro


Keyword querying and motif detection of user’s
sequence


Gene Ontology


Hierarchical organization of biological terms


Cataloging associated gene products

Ex
pert
P
rotein
A
nalysis
Sy
stem

ExPASy (http://www.expasy.ch)

NiceProt View

Gene Ontology


Systematic classification of
biological terminology


Molecular function


Biological process


Cellular component


Controlled vocabulary


Associated GENE list

Data Mining


Objective:


Discovery of (biological) knowledge by
querying information in the databases and
comprehending it


Problems:


Too many databases


Different protocols for access


Lack of standards


Poor quality or propagation of errors


Solutions:


Data warehousing or federated databases

Catalog of Bio
-
DBs

arranged by Data Domain

Database of Databases


Data warehousing


Collect all databases by mirroring


Store in a unified format


Entrez (NCBI) or SRS (EBI)


Powerful but heavy maintenance load


Federated databases


Maintained by participating members


Accessed by common protocols


Bio
-
DAS or Web Services via SOAP/XML


Next generation technology, but dependent on
both the cooperation by members and Internet
bandwidth

www.ngic.re.kr

www.ncbi.nih.gov

/LocusLink

New Data Types


Textual


Nucleotide or amino acid sequences


Associated feature annotation


Bibliographical texts


Numeric


Gene expression profiles


Results from statistical analysis


Graphical


Protein
-
protein interaction network


Genetic network


Biochemical reaction pathways

Building a Nation from a
Land of City States

Lincoln D. Stein

Cold Spring Harbor
Laboratory

Italy in the Middle Ages

Bioinformatics, ca. 2002

Bioinformatics

In the XXI Century

Making Easy Things Hard

Give me all human
sequences submitted to
GenBank/EMBL last week.

Lots of ways to do it


Download weekly update of
GenBank/EMBL from FTP site


Use official network
-
based interfaces
to data:


NCBI toolkit


EBI CORBA & XEMBL servers


Use friendly web interfaces at NCBI,
EBI

Perl/Java/Python to the
Rescue


One script to do the web fetch


Another to parse the file format


A third to move into private database


A fourth to repeat this weekly


Result:


6,719 scripts that do the same thing


None of them work together

What

s Wrong with This?


My EMBL fetcher is poorly documented so
you write your own


Your fetcher won

t work with my parser


My parser won

t work with your fetcher


We

ve now wasted 20 hours rather than 10


Multiply this by 6,719


What

s else is Wrong?


NCBI/EBI tweaks something


6,719 scripts fail at once


6,719 bioinformaticists tear their hair


21,261 biologists curse the
bioinformaticists


6,719 bioinformaticists curse their
own existence

Unifying Bioinformatics
Services

MIMBD: Meetings on the
Interconnection of Molecular
Biology Databases

Federated models: Gaea, Kleisli

Data warehouses: GUS, MODs,
Ensembl, UCSC

Ad hoc web services

Formal web services

Ad hoc services

BioXXX

Your Script

Conf file

Formal Web Services

SeqFetch

Service

BLAT

Service

Microarray

Service

BLAST

Service

SeqFetch

Service

GO

Service

Formal Web Services

Service

Registry

SeqFetch

Service

BLAT

Service

Microarray

Service

BLAST

Service

SeqFetch

Service

GO

Service

Formal Web Services

Your Script

Service

Registry

BioXXX

Microarray

Service

SeqFetch

Service

BLAT

Service

Microarray

Service

BLAST

Service

SeqFetch

Service

GO

Service

Technical Infrastructure is
Here*


Common vocabulary: GO


Transport format: XML


Data definition language: XSD


Wire protocol: SOAP


Service definition language: WSDL


Service registry: UDDI

*(almost)

Distributed Annotation System

http://www.biodas.org

Reference Server

AC003027

AC005122

M10154

Annotation Server

Annotation Server

AC003027

M10154

WI1029

AFM820

AFM1126

WI443

AC005122

Annotation Server

Thursday 10:30 AM

Canyon IV

Europe, ca 2000

Bioinformatics, ca 2010?

NGIC

KNIH

Human

Proteome

Animal

Ag
-
Bio

Crop

Plant

Microbial

Universities

Research

Institutes

Industry

Collection and Sharing of
National Genome Information

NGIC

KNIH

Human

Proteome

Animal

Ag
-
Bio

Crop

Plant

Microbial

Data Grid

KISTI

ETRI

Application
Grid

National Genome
Information Network