Karp

powerfultennesseeΒιοτεχνολογία

2 Οκτ 2013 (πριν από 3 χρόνια και 11 μήνες)

124 εμφανίσεις

The Pathway Tools Ontology
and Inferencing Layer

Peter D. Karp, Ph.D.

SRI International

SRI International

Bioinformatics

Overview


Definitions



Ontologies ultimately exciting because of the
inferences/computations they enable:


Where are the ontology killer apps?



Adding more facets to an ontology increases
inferences that can be made with it



Pathway Tools ontology and associated
applications

SRI International

Bioinformatics

Terminology


Model Organism Database (MOD)


DB describing genome and other
information about an organism



Pathway/Genome Database
(PGDB)


MOD that combines
information about


Pathways, reactions, substrates


Enzymes, transporters


Genes, replicons


Transcription factors, promoters,
operons, DNA binding sites



BioCyc


Collection of 15 PGDBs
at BioCyc.org


EcoCyc, AgroCyc, YeastCyc


SRI International

Bioinformatics

Terminology


Pathway Tools Software


PathoLogic


Prediction of metabolic network from genome


Computational creation of new Pathway/Genome Databases



Pathway/Genome Editors


Distributed curation of PGDBs


Distributed object database system, interactive editing tools



Pathway/Genome Navigator


WWW publishing of PGDBs


Querying, visualization of pathways, chromosomes, operons


Analysis operations


Pathway visualization of gene
-
expression data


Global comparisons of metabolic networks



Bioinformatics 18:S225 2002

SRI International

Bioinformatics

Ontology




Ontology = Terms + Taxonomy


+ Slots + Constraints

SRI International

Bioinformatics

Pathway Tools Ontology:

Terms and Taxonomy


Pathway Tools ontology contains 916 classes


Define datatypes


Replicons, Genes, Operons, Promoters, Trans Fac Binding Sites


Proteins: Enzymes, Transporters, Transcription Factors


Small molecule compounds


Reactions, pathways


Define taxonomies


Taxonomy of chemical compounds


Riley’s gene ontology


Taxonomy of metabolic pathways


EC system



Bioinformatics 16:269 2000

SRI International

Bioinformatics

Operations Enabled by

Controlled Vocabulary


Equality testing:


Is the function of gene X in organism A the same as the
function of gene Y in organism B?


Is location L1 in organism A the same as location L2 in
organism B?


SRI International

Bioinformatics

Operations Enabled by

Taxonomy


Counting / Pie charts


How many genes of category “small molecule metabolism”
are in organism A?



Intersecting sets


How many of these up
-
regulated genes are in class “cell
cycle”?



User search via drill down



Applying rules


If the substrate of X is an amino acid, then XXX


SRI International

Bioinformatics

Ontology




Ontology = Terms + Taxonomy


+ Slots + Constraints

SRI International

Bioinformatics

Pathway Tools Ontology:

Slots


Pathway Tools ontology contains 199 slots



Categories of slots:


Meta
-
data: Creator, Creation
-
Date


Textual data: Common
-
Name, Synonyms, Comment,
Citations


Attributes: Molecular
-
Weight, pI


Relationships: Gene, Catalyzes, In
-
Reaction



Give stats on how many slots in each of these
classes

SRI International

Bioinformatics

Pathway Tools Ontology:

Slots


Slots introduced at appropriate place in taxonomy


Child classes inherit the slot; parent classes do not



Examples:



Proteins: pI, MolWt, Component
-
Of


Polypeptides: Gene


Protein
-
Complexes: Components



Reactions: Left, Right, Keq, In
-
Pathway


Pathways: Reaction
-
List, Predecessor
-
List


Transcription Units: Components


Genes: Product, Component
-
Of


SRI International

Bioinformatics

Operations Enabled by Slots


Store/retrieve attributes of an entity


Get pI of protein


Get citations associated with pathway



Traverse network of semantic relationships


Find all substrates of all reactions in pathway X


Find all genes that encode an enzyme that catalyzes a
reaction in pathway X


Find all regulons encoding multiple metabolic pathways

SRI International

Bioinformatics

Ontology




Ontology = Terms + Taxonomy


+ Slots + Constraints

SRI International

Bioinformatics

Pathway Tools Ontology:

Constraints


Every Pathway Tools slot has associated meta data:


Class(es) to which it pertains


Keq pertains to Reactions


Data type (number, string, frame, etc)


Keq data type is number


Collection type (list, bag)


Keq is not a collection


Documentation string


Cardinality constraints
--

At most one Keq value


Range constraints


Taxonomy constraints


Values of Left slot of Reactions must be Chemicals

SRI International

Bioinformatics

Operations Enabled by Constraints



Constraints make a system “intelligent” because
they encode definitions in a machine
-
understandable fashion



Automated DB consistency checkers (batch or
interactive)


Schema
-
driven data input tools


Subsumption


Compare two concept definitions


SRI International

Bioinformatics

Pathway Tools Inference Layer


Commonly used queries implemented as stored
procedures



Infer what is implicitly recorded in the KB

SRI International

Bioinformatics

Compute Transitive Relationships

Sdh
-
flavo

Sdh
-
Fe
-
S

Sdh
-
membrane
-
1

Sdh
-
membrane
-
2

sdhA

sdhB

sdhC

sdhD

succinate + FAD = fumarate + FADH
2

Enzymatic
-
reaction

Succinate dehydrogenase

TCA Cycle

product

component
-
of

catalyzes

reaction

in
-
pathway

Chrom

succinate

FAD

fumarate

FADH
2

left

right

SRI International

Bioinformatics

Pathway Tools Inference Layer


Enumerate reactions given alternative definitions of a
reaction: all, enzyme, transport, small
-
mol, smm


All substrates, all cofactors, all transported chemicals


Protein tests: Is X a transcription factor, enzyme,
transporter


Rather than force user to manually assign physiological roles, compute
when possible from biochemical function



Transcription
-
unit
-
binding
-
sites


Compute in parts hierarchy: monomers
-
of
-
protein,
components
-
of
-
protein, genes
-
of
-
protein, modified
-
forms


Complex: regulon
-
of
-
protein, regulator
-
proteins
-
of
-
transcription
-
unit


SRI International

Bioinformatics

What Killer Apps have

Ontologies Enabled?




What comes after pie charts and drill
-
down
interfaces?

SRI International

Bioinformatics

Terminology


Pathway Tools Software


PathoLogic


Prediction of metabolic network from genome


Computational creation of new Pathway/Genome Databases



Pathway/Genome Editors


Distributed curation of PGDBs


Distributed object database system, interactive editing tools



Pathway/Genome Navigator


WWW publishing of PGDBs


Querying, visualization of pathways, chromosomes, operons


Analysis operations


Pathway visualization of gene
-
expression data


Global comparisons of metabolic networks


SRI International

Bioinformatics

BioCyc Collection of

Pathway/Genome DBs


Literature
-
based Datasets:



MetaCyc



Escherichia coli (EcoCyc)


Computationally Derived Datasets:



Agrobacterium tumefaciens


Caulobacter crescentus


Chlamydia trachomatis


Bacillus subtilis


Helicobacter pylori


Haemophilus influenzae


Mycobacterium tuberculosis
RvH37


Mycobacterium tuberculosis
CDC1551


Mycoplasma pneumonia


Pseudomonas aeruginosa


Saccharomyces cerevisiae


Treponema pallidum


Vibrio cholerae



Yellow Underlined

= Open Database

http://BioCyc.org/

SRI International

Bioinformatics

Pathway/Genome DBs Created by

External Users


Plasmodium falciparum
, Stanford University


plasmocyc.stanford.edu



Arabidopsis thaliana
and

Synechosistis
,
Carnegie Institution of Washington


Arabidopsis.org:1555



Methanococcus janaschii
, EBI


Maine.ebi.ac.uk:1555



Other PGDBs in progress by 20 other users


Software freely available


Each PGDB owned by its creator

SRI International

Bioinformatics

Ontology Reuse


A holy grail in AI since “ontology” became a buzz
-
word


Decrease knowledge acquisition bottleneck



GO qualifies as a large success in ontology reuse



Pathway Tools ontology reused across 18 PGDBs


Pathway Tools algorithms portable across all
PGDBs


SRI International

Bioinformatics

Pathway Tools Algorithms


Visualization and editing tools for
following datatypes



Full Metabolic Map


Paint gene expression data on metabolic
network; compare metabolic networks


Pathways


Pathway prediction


Reactions


Balance checker


Compounds


Chemical substructure comparison


Enzymes, Transporters, Transcription
Factors


Genes


Chromosomes


Operons


Operon prediction; visualize genetic network


SRI International

Bioinformatics

Inference of Metabolic Pathways

Pathway/Genome

Database

Annotated Genomic

Sequence

Genes/ORFs

Gene Products

DNA Sequences

Reactions

Pathways

Compounds

Multi
-
organism Pathway

Database (MetaCyc)

PathoLogic
Software

Integrates genome and
pathway data to identify
putative metabolic
networks

Genomic Map

Genes

Gene Products

Reactions

Pathways

Compounds

SRI International

Bioinformatics

PathoLogic Analysis Phases


Trial parsing of input data files [few days]


Initialize schema of new PGDB [3 min]


Create DB objects for replicons, genes, proteins [5 min]


Assign enzymes to reactions they catalyze


ferrochelatase
[10 min / 1 week]


glutamate 1
-
semialdehyde 2,1
-
aminomutase


porphobilinogen deaminase





A C G

B

D

E

F

E
1

E
2

SRI International

Bioinformatics

PathoLogic Analysis Phases


From assigned reactions, infer what pathways are
present [5 min / few days]



Define metabolic overview diagram [1 day]



Define protein complexes [few days]


SRI International

Bioinformatics

Killer App: Global Consistency

Checking of Biochemical Network



Given:


A PGDB for an organism


A set of initial metabolites



Infer:


What set of products can be synthesized by the small
-
molecule metabolism of the organism



Can known growth medium yield known essential
compounds?


Pacific Symposium on Biocomputing p471 2001

SRI International

Bioinformatics

Algorithm:

Forward Propagation

Nutrient

set

Metabolite

set

“Fire”

reactions

Products

Reactants

PGDB

reaction

pool

SRI International

Bioinformatics

Results


Phase I: Forward propagation


21 initial compounds yielded only half of 38 essential
compounds for
E. coli



Phase II: Manually identify


Bugs in EcoCyc (e.g., two objects for tryptophan)


Missing initial protein substrates (e.g., ACP)


Missing pathways in EcoCyc



Phase III: Forward propagation with 11 more
initial metabolites


Yielded all 38 essential compounds

SRI International

Bioinformatics

How to Characterize the

Metabolic Network of a Cell?

SRI International

Bioinformatics

Aggregate Properties of the
E. coli

Metabolic Network



EcoCyc is not a complete picture of
E. coli

metabolism


30% of
E. coli

genes remain unidentified



Analysis pertains to pathways of small
-
molecule
metabolism


Computed with respect to EcoCyc v4.5 (Sep
-
1998)



Joint work with Christos Ouzounis of EBI


Genome Research 10:268 2001

SRI International

Bioinformatics

Enzymes


4391 genes in
E. coli

genome



4288 code for proteins



676 (15%) gene products form 607 enzymes



Of the 607 enzymes, 296 are monomers, 311 are
multimers



90% of genes for heteromultimers are linked



SRI International

Bioinformatics

Reactions



744 reactions of small
-
molecule metabolism


582 assigned to at least one pathway


SRI International

Bioinformatics

Compounds


791 substrates in the 744 reactions



Each reaction contains 4.0 substrates on average



Each substrate appears in 2.1 reactions

SRI International

Bioinformatics

Enzyme Modulation


805 enzymatic
-
reaction objects in EcoCyc



80 have physiological inhibitors


22 have physiological activators


17 have both


43% have a modulator



327 require a cofactor or prosthetic group

SRI International

Bioinformatics

Enzyme
-
Reaction Associations


585 reactions catalyzed by 1 enzyme



55 reactions catalyzed by 2 enzymes



12 reactions catalyzed by 3 enzymes



1 reaction catalyzed by 4 enzymes



483 reactions belong to a single pathway


99 reactions belong to multiple pathways



100 of the 607
E. coli

enzymes are multifunctional

SRI International

Bioinformatics

Pathway Tools Implementation


Allegro Common Lisp


Sun and PC platforms


Run as window application or WWW server



Ocelot object database



250,000 lines of code



Lisp
-
based WWW server at BioCyc.org


Lisp process reads URLs from the network and generates
GIF+HTML from PGDBs


Manages 15 PGDBs


SRI International

Bioinformatics

Ocelot Knowledge Server

Architecture


Frame data model


Classes, instances, inheritance


Persistent storage via disk files, Oracle DBMS


Concurrent development: Oracle


Single
-
user development: disk files


Read
-
only delivery: bundle data into binary program


Transaction logging facility


Schema evolution


Local disk cache to improve Internet performance



J. Intelligent Information Systems

1:155
-
94 1999


SRI International

Bioinformatics

GKB Editor


Browser and editor for KBs and ontologies



Three editing tools:


Taxonomy editor


Frame editor


Relationships editor



All operations are schema driven



http://www.ai.sri.com/~gkb/user
-
man.html

SRI International

Bioinformatics

The Common Lisp Programming

Environment


Gatt studied
Lisp and Java
implementation
of 16 programs
by 14
programmers
(Intelligence
11:21 2000)


SRI International

Bioinformatics

Peter Norvig’s Solution


“I wrote my version in Lisp. It took me about 2
hours (compared to a range of 2
-
8.5 hours for the
other Lisp programmers in the study, 3
-
25 for
C/C++ and 4
-
63 for Java) and
I ended up with 45
non
-
comment non
-
blank lines

(compared with a
range of 51
-
182 for Lisp, and 107
-
614 for the other
languages).
(That means that some Java
programmer was spending 13 lines and 84
minutes to provide the functionality of each line
of my Lisp program.)”



http://www.norvig.com/java
-
lisp.html

SRI International

Bioinformatics

Common Lisp Programming

Environment


Interpreted and/or compiled execution


Fabulous debugging environment


High
-
level language


Interactive data exploration


Extensive built
-
in libraries


Dynamic redefinition



Find out more!


ALU.org
--

Association of Lisp Users


BioLisp.org

SRI International

Bioinformatics

Pathway Exchange Ontology


BioPathways group developing ontology and
format for exchange of pathway data


Metabolic pathways


Signaling pathways


Protein interactions


Moving upwards from chemicals, proteins, to
reactions and pathways


Working to extend CML


Draft ontology at
http://www.ai.sri.com/pkarp/misc/interactions.html

SRI International

Bioinformatics

Summary


Pathway Tools apps:


Predict pathways and generate PGDBs


Visualization and editing tools


Paint gene expression data; compare entire pathway maps


Global consistency checking of metabolic network


Characterize metabolic and genetic networks


New killer apps:


Interoperability


Text mining


Bake
-
off for genome annotation pipelines

SRI International

Bioinformatics

BioCyc and Pathway Tools

Availability


WWW BioCyc freely available to all


BioCyc.org


Six BioCyc DBs openly available to all



BioCyc DBs freely available to non
-
profits


Flatfiles downloadable from BioCyc.org


Binary executable:


Sun UltraSparc
-
170 w/ 64MB memory


PC, 400MHz CPU, 64MB memory, Windows
-
98 or newer


PerlCyc API



Pathway Tools freely available to non
-
profits

SRI International

Bioinformatics

Acknowledgements


SRI


Suzanne Paley, Pedro Romero,
John Pick, Cindy Krieger, Martha
Arnaud



EcoCyc Project


Julio Collado
-
Vides, Ian Paulsen,
Monica Riley, Milton Saier



MetaCyc Project


Sue Rhee, Lukas Mueller, Peifen
Zhang, Chris Somerville



Stanford


Gary Schoolnik, Harley McAdams,
Lucy Shapiro, Russ Altman, Iwei
Yeh


Funding sources:


NIH National Center for
Research Resources


NIH National Institute of
General Medical
Sciences


NIH National Human
Genome Research
Institute


Department of Energy
Microbial Cell Project


DARPA BioSpice, UPC


BioCyc.org