Slide 1 - Metabolomics Fiehn Lab

tunisianbromidrosisInternet και Εφαρμογές Web

5 Φεβ 2013 (πριν από 4 χρόνια και 9 μήνες)

268 εμφανίσεις

1

Guest Lecture @ Graduate level course

MCB221b
-

Mechanistic Enzymology & Metabolic Networks

Tobias Kind


March 2010

Bio
-
Chemical databases





Database concepts
-

what is a “good” database (DB)




How is data stored and queried and curated




Enzyme DBs, Protein and peptide DBs, small molecule DBs

This document is hyperlinked (pictures and
green text
).

To use WWW links in this PPT switch to slide show mode.

2

Databases


very short primer

(*)

Database interface


is what you see


Database queries


what you ask the database


Database objects


where the data is stored (index and tables)


Database types


relational databases, object oriented databases, flat file DBs


Database brands


Oracle, MySQL, Apache, IBM DB2, PostgreSQL, MS SQL


Database query language


how a database can be programmed (SQL)


Database dump file


the whole database in a single (*.dmp) file


Database Ontology


database vocabulary and used relationships


Database Semantics


capture meaning by grammar or logical analysis


(*) you can study this for several years

and get a PhD in computer and database sciences.

DB2

Oracle

MySQL

3

How are structures stored? (same for spectra)

DB

Database Interface

or

DB Cartridge

H
3
C
N
N
O
N
O
N
CH
3
CH
3
Storage

Conversion

View

A) In databases


for millions of structures

B) In structure files (text files)


for few structures

H
3
C
N
N
O
N
O
N
CH
3
CH
3
SDF/CML


4

How are structures stored?


…here cometh the (true) tower of Babel again

…more than
100 different file formats

in use




Structure formats can store 1D, 2D and 3D coordinate information and metadata

Tower of Babel


Source: Brueghel/WIKI

H
3
C
OH
H
H
H
H
H
H
H
H
CCO

1D

2D

3D

InChI=1/C2H6O/c1
-
2
-
3/h3H,2H2,1H3

InChIKey=
LFQSCWFLJHTTHZ
-
UHFFFAOYAB


InChiKey Source:
ChemSpider

InChI=1/C2H6O/c1
-
2
-
3/h3H,2H2,1H3

InChIKey=
LFQSCWFLJHTTHZ
-
UHFFFAOYAB


InChI=1/C8H8/c1
-
2
-
5
-
3(1)7
-
4(1)6(2)8(5)7/h1
-
8H

InChiKey=
TXWRERCHRDBNLG
-
UHFFFAOYAL


5

Chemical Structure Handling


Most common structure formats you need to know:


SMILES
/SMARTS
-

S
implified
M
olecular
I
nput
L
ine
E
ntry
S
pecification

SDF
/MOL
-

S
tructure
D
ata
F
ile

InChI
/InChIkey
-

IUPAC
In
ternational
Ch
emical
I
dentifier

PDB

-

P
rotein
D
ata
B
ank

CML

-

C
hemical
M
arkup
L
anguage


Some problems:




Data format needs to be based on Open Standard (problem with SMILES, ok with CML)



Stereo and aromatic bond information needs to be saved (ok with SDF)



Format needs to be small in space for millions of compounds (ok with SMILES)



SMILES notation needs to be unique (problem with SMILES)



Structure representation should be portable and based on Open Standard (ok with CML)

O
H
O
O
C
H
3
C
H
3
H
3
C
C
H
3
C
H
3
H
3
C
H
3
C
Moronic Acid
-

CID: 489941

6

Chemical Structure Identifiers


Structure Identifiers are needed for uniquely identifying structures

Important for searching chemical structures in text and databases





Structure Name


IUPAC name or common name


CAS RN


Chemical Abstracts identifier


PubChem ID


PubChem Compound ID


InChIKey


Short representation of InChI


InChI



IUPAC
In
ternational
Ch
emical
I
dentifier




H
3
C
N
N
O
N
O
N
CH
3
CH
3
1,3,7
-
trimethylpurine
-
2,6
-
dione


58
-
08
-
2



CID:
2519



InChiKey=
RYYVLZVUVIJVGH
-
UHFFFAOYAW


InChI=1/C8H10N4O2/c1
-
10
-
4
-
9
-
6
-
5(10)7(13)12(3)8(14)11(6)2/h4H,1
-
3H3



7

SMILES structure format


Positive:

Good for storing structures in single line


Fast text based search possible; human readable

Negative:

Many different SMILES codes exist


SMILES for same structure can be different (canonical or unique SMILES needed)




C


CC


CCC


CCCC


CCCCO


CCCCN


All those SMILES codes represent caffeine

[c]1([n+]([CH3])[c]([c]2([c]([n+]1[CH3])[n][cH][n+]2[CH3]))[O
-
])[O
-
]

CN1C(=O)N(C)C(=O)C(N(C)C=N2)=C12

Cn1cnc2n(C)c(=O)n(C)c(=O)c12

Cn1cnc2c1c(=O)n(C)c(=O)n2C

N1(C)C(=O)N(C)C2=C(C1=O)N(C)C=N2

O=C1C2=C(N=CN2C)N(C(=O)N1C)C

CN1C=NC2=C1C(=O)N(C)C(=O)N2C


H
3
C
N
HC
N
O
N
CH
3
O
N
CH
3
InChI=1/C8H10N4O2/c1
-
10
-
4
-
9
-
6
-
5(10)7(13)12(3)8(14)11(6)2/h4H,1
-
3H3

Caffeine SMILES Source
InChiI FAQ

8

SDF/MOL structure format


Positive:

established standard format; good for storing structures safely


can store 3D structure; can store metadata (boiling points, toxicity, mass spectra)

Negative:

large file size, need compression



OpenBabel02240823422D



1 0 0 0 0 0 0 0 0 0999 V2000


0.0000 0.0000 0.0000 C 0 0 0 0 0

M END

$$$$



OpenBabel02240823422D



2 1 0 0 0 0 0 0 0 0999 V2000


0.0000 0.0000 0.0000 C 0 0 0 0 0


0.0000 0.0000 0.0000 C 0 0 0 0 0


1 2 1 0 0 0

M END

$$$$



OpenBabel02240823422D



3 2 0 0 0 0 0 0 0 0999 V2000


0.0000 0.0000 0.0000 C 0 0 0 0 0


0.0000 0.0000 0.0000 C 0 0 0 0 0


0.0000 0.0000 0.0000 C 0 0 0 0 0


1 2 1 0 0 0


2 3 1 0 0 0

M END

$$$$




Creator

Coordinates for 3D

Connection of atoms

9

Source: http://wwmm.ch.cam.ac.uk/inchifaq/

InChI and InChIKey


UWPWWENWLZPQGU
-
CQNODWLUHW


(a short hashed
-
key for text based search in Google, Bing, Yahoo)

InChI

InChIKey

10

CML structure format


Positive:

Open Standard format; good for storing structures safely


machine readable

Negative:

huge files; redundant information; needs compression

<?xml version="1.0" ?>

<molecule id="m1">


<atomArray>


<atom id="a1" elementType="C"


x2="2.6673582436560714" y2="0.3080000000000006" />


<atom id="a2" elementType="C"


x2="1.3336791218280362" y2="
-
0.46199999999999997" />


<atom id="a3" elementType="C"


x2="4.440892098500626E
-
16" y2="0.30800000000000016" />


<atom id="a4" elementType="C"


x2="
-
1.3336791218280348" y2="
-
0.4620000000000002" />


<atom id="a5" elementType="O"


x2="
-
2.6673582436560705" y2="0.3079999999999997" />


</atomArray>


<bondArray>


<bond atomRefs2="a1 a2" order="1" />


<bond atomRefs2="a2 a3" order="1" />


<bond atomRefs2="a3 a4" order="1" />


<bond atomRefs2="a4 a5" order="1" />


</bondArray>

</molecule>

CH
3
HO
11

Oooh the meta data!

SBML (Systems Biology Markup Language)

Source: Akira Funahashi


Cell Designer Tutorial



List of supported SBML programs (around 181) from
sbml.org



List of curated and published SBML models (> 241) from
biomodels DB

BioPax format


used for representing pathway data (data exchange format)

SBML format


representing models of biochemical reaction networks


12

SQL
-

(Structured Query Language)



used for programming relational databases



database query tools and SQL editors

available (
Aqua Data Studio
)


yr

subject

winner

1901

Chemistry

Jacobus H. van 't Hoff

1902

Chemistry

Emil Fischer

1903

Chemistry

Svante Arrhenius

1904

Chemistry

Sir William Ramsay

1905

Chemistry

Adolf von Baeyer

1906

Chemistry

Henri Moissan

1907

Chemistry

Eduard Buchner

1908

Chemistry

Ernest Rutherford

1909

Chemistry

Wilhelm Ostwald

1910

Chemistry

Otto Wallach

1913


SELECT yr, subject, winner


FROM nobel


WHERE yr = 1909 and
subject = 'chemistry'

yr subject winner

1909 Chemistry Wilhelm Ostwald

Large Database Table


SQL query



Result


Visit the SQL Zoo

13


Application programming interfaces (API)

are important to connect and automate


data exchange between local programs and databases;

Example
: NCBI SOAP or
PubChem PUG
(Power User Interface) can be used to

download certain data via the web to another service or to a local program


Resource Description Framework (RDF)
define and capture metadata;

used for semantic web (triples: subject
-
predicate
-
object)


Mashups and integration

services use new web technology (
RDF
,
Yahoo Pipes
) to

combine data sources and create new knowledge or enhance usage


The semantic web and databases

(API, Mashups, RDF)

14

What is a good database?

As in normal life its important to distinguish between good and evil

Good DB:



allows multiple input queries



exports in multiple output formats



connects to other DBs



is curated (means checked for errors by humans or machines)



is regularly updated (daily, yearly)



cost money (your money or tax payers money) or time



allows bulk download (millions of data sets can be downloaded)



has open interfaces (APIs) for query requests


Bad DB:



allow only single requests (which have to be typed manually)



are not databases but just lists or tables



have no link
-
out and no link
-
in



allow no bulk download



are not curated






Source: wikimedia.org

Source: wikimedia.org

15

Clogged vs. Clean interfaces

Search engine problem:

What if internet is empty?

What if you don't know what to search?

Portal problem:

What if you have attention disorder?

What if you are overwhelmed?

16

Database front
-
ends (a good one)

Enhanced NCI Database Browser Release 2

(CACTVS DB)




Allow to query/search everything



NCI DB with revolutionary web
-
front
-
end (2001)



Multiple input an output (export) methods, batch
-
wise matching



Links to other services



Visualization modes (2D, 3D)



20 different molecular output


formats (SDF, CML, SMILES)



export to different other


(calculational) services



30 different query modes

17

Database front
-
ends (a good one)

Chemical Identifier Resolver

beta 2



74 million molecule entries from more than 100 databases,


representing more than 46 million unique chemical structures


18

Database visualization

Source:
Cytoscape.org

Start Cytoscape via JAVA webstart



Visualize complex networks; uses plug
-
in
-
technology from different sources



Map your own compound data (proteins, genes, molecules) onto networks



Perform literature search with enzymes, genes, small molecules



19

Uber
-
portals (
NCBI ENTREZ
)

20

Database and tools integration

Gaggle

Source:
http://gaggle.systemsbiology.org/docs/geese/

Source: WIKIMEDIA



Frameworks



Portals



Mashups

21

Gaggle

Integration of tools and database services

The Gaggle: an open
-
source software system for integrating bioinformatics software and data sources.

Shannon PT, Reiss DJ, Bonneau R, Baliga NS.

BMC Bioinformatics. 2006 Mar 28;7:176.

ListLink

Source: WIKIMEDIA

Use Gaggle

22

Use or built your own local database

Example: LipidMaps DB with Instant
-
JChem



Download the whole LipidMaps DB (10,000 lipids) as SDF file [
LINK
]



Use Instant
-
JChem as data DB, molecule DB, reaction DB [
LINK
]



Perform data and molecule queries on your laptop (PC, LINUX, MAC)

(…also works with
KEGG/Biometa DB
)

23

Welcome to the (database) jungle !

Pathguide.org



collection of pathway, enzyme, metabolite DBs



current number ~ 317



Protein
-
Protein Interactions

Metabolic Pathways

Signaling Pathways

Pathway Diagrams

Transcription Factors / Gene
Regulatory Networks

Protein
-
Compound Interactions

Genetic Interaction Networks

Protein Sequence Focused


24

Chemistry related (big players):



PubChem
,
CAS

(subscription),
Chemspider

(fast growing)



Beilstein

(subscription)


Important for chemistry/metabolomics:



Spectral databases (NMR, mass spectral databases),



compound property DBs


Pathway, Enzyme related:



KEGG, Brenda, Reactome, Expasy, MetaCyc

ChemBioGrid



collection of most chemistry databases



current number ~ 200

See you in the (database) jungle !

25

Enzyme and kinetics related databases

KDBI
-

Kinetic Data of Bio
-
molecular Interactions database

http://bidd.nus.edu.sg/group/kdbi/


SABIO
-
RK
-

SABIO
-
Reaction Kinetics Database

http://sabio.villa
-
bosch.de/SABIORK/


BRENDA
-

Comprehensive Enzyme Information System


http://www.brenda.uni
-
koeln.de/


EMP
-

Enzymes and Metabolic Pathways Database


http://www.empproject.com/


ENZYME
-

Enzyme nomenclature database (EXPASY)

http://www.expasy.ch/enzyme/


IntEnz
-

Integrated relational Enzyme database


http://www.ebi.ac.uk/intenz/index.html


TECR
-

Thermodynamics of Enzyme
-
Catalyzed Reaction


http://xpdb.nist.gov/enzyme_thermodynamics/


REBASE
-

Restriction Enzyme Database


http://rebase.neb.com/


Precise
-

Predicted and Consensus Interaction Sites in Enzymes


http://precise.bu.edu/





Source: Pathguide; Own search

26

PubChem



Most important small


molecule DB




There was no large open


chemistry DB until 10


years ago (!)




All records can be


downloaded via FTP




All other small molecule


link to PubChem




PubChem Compound ID


(true chemicals)



PubChem Substance ID


(formulations, mixtures)




substructure search and


multiple other options

Goto PubChem

27

Chemspider

A masterpiece of disruptive technology

Example how a good chemistry

database should be (gold standard)


Has only one flaw


its free but

not open (can not be downloaded)


Links to spectra, patents,

commercial availability data,

publications


User centric


cares about users


Crowd sourcing


registered users

can submit data, spectra, molecules


InChIKey Resolver

28

Chemspider

User centric database that utilizes crowd
-
sourcing

29

CAS SciFinder



52 million molecules and 61 million sequences



Largest reaction DB (18 million reactions) and literature DB



A must for chemists and biochemists/biologists



no bulk download, no good Import/ Export, no Linkouts



Standalone + web frontend



no text mining (requires ANAVIST)

Download Scifinder

30

BRENDA
-

Comprehensive Enzyme Information System

31

Brenda 3D model output with JMOL

Example: Brenda connection to RSCB Protein Data bank

Visit Brenda

32

KEGG


Pathway DB

KEGG ID:


C00002

(ATP)

KEGG pathway map ID:

map00195

(Photosynthesis)

KEGG reaction ID:

R05668

(ATP + NAD reaction)

Visit KEGG

33

Reactome


curated pathway maps

Visit Reactome

Example: Skypainter, map your given KEGG IDs to pathways

34

Outlook for the database lesson



Curation, Curation, Curation (costs money)




Inhale the good DB and bad DB scheme and apply when you enter a DB portal




Learn some basic database programming (
Ruby on Rails
, JAVA,
SQL
)


using bioinformatics and chemoinformatics approaches is crucial for research




Learn how to import and store and handle database search results on your local


computer (simple: parse important data with regular expressions)




Don’t be overwhelmed by the database jungle, take some time to play around;


Finally automation and clever use of DB tools will innovate your research




Multiple unique identifier problem (Kegg ID, PubChem ID, CAS number)


and biological naming problem still exist




The systems biology and chemistry database world is still different in terms


of re
-
use. Most of the chemistry data published (including molecules) is not


machine readable, hence can’t be automatically harvested by software robots.

35

Reading List databases

The Gaggle: An open
-
source software system for integrating bioinformatics software and data sources


Correcting ligands, metabolites, and pathways


Large
-
Scale Annotation of Small
-
Molecule Libraries Using Public Databases




36

Homework will be graded (Time to solve 10 min)

1)
What is the InChiKey for Tobias acid from PubChem?

Help: Goto PubChem

2)
How many US patents are listed in Chemspider for

InChIKey: UMYJVVZWBKIXQQ
-
QALSDZMNSA
-
N

Help: Goto Chemspider, select US patents

3)
What is the PubChem Compound CID for grasshopper ketone?

Help: goto PubChem and enter name, find CID


4)
How many extinct mammals are represented with sequence in Genbank?

Help: goto NCBI Taxonomy, statistics


5)
How many base pairs are in the locus GQ324610 for Equus hydruntinus?

Write down the first 20 amino acids.

Help: goto NCBI ENTREZ enter taxonomy name, select nucleotide DB


6)
Goto
Brenda

and find out how many enzymes are listed as resistant against

perchloric acid, Report name and Gene Ontology Number of enzyme(s)

Help: goto Brenda, Advanced search, go to EC number and GO number


Source: MS Office

37

Pathways and enzymes

http://www.biocarta.com/pathfiles/h_etcPathway.asp#

SQL learning

http://sqlzoo.net/

Databases

http://www.google.com/search?hl=en&q=enzyme+kinetics+database&btnG=Google+Search

SQL biologists

I’m a biologist Jim, not a

programmer

SQL biologists

SciView part 5: interview with Alexei Drummond






Thank you!

Thanks to all Wikimedia.org contributors for pictures!

Thanks to the Dinesh Kumar (FiehnLab) for discussions.