HyBrow: a prototype system for computer-aided hypothesis evaluation

earthsomberBiotechnology

Sep 29, 2013 (3 years and 10 months ago)

123 views

BIOINFORMATICS
Vol.20Suppl.12004,pages i257–i264
DOI:10.1093/bioinformatics/bth905
HyBrow:a prototype systemfor computer-aided
hypothesis evaluation
S.A.Racunas

,N.H.Shah

,
,I.Albert and N.V.Fedoroff
The Huck Institute for Life Sciences,Penn State University,University park,
PA 16801,USA
Received on January 15,2004;accepted on March 1,2004
ABSTRACT
Motivation:Experimental design,hypothesis-testing and
model-building in the current data-rich environment require
the biologists to collect,evaluate and integrate large amounts
of information of many disparate kinds.Developing a unied
framework for the representation and conceptual integration
of biological data and processes is a major challenge in bioin-
formatics because of the variety of available data and the
different levels of detail at which biological processes can be
considered.
Results:We have developed the HyBrow (Hypothesis
Browser) systemas a prototype bioinformatics tool for design-
ing hypotheses and evaluating them for consistency with
existing knowledge.HyBrowconsists of a modeling framework
with the ability to accommodate diverse biological information
sources,an event-based ontology for representing biological
processes at different levels of detail,a database to query
information in the ontology and programs to perform hypo-
thesis designandevaluation.WedemonstratetheHyBrowpro-
totype using the galactose gene network in Saccharomyces
cerevisiae as our test system,and evaluate alternative hypo-
theses for consistency with stored information.
Availability:www.hybrow.org
Contact:nigam@psu.edu
1 INTRODUCTION
To expand understanding of a biological system,an experi-
mentalist (1) formulates hypotheses about relationships that
exist within that system,(2) gathers information fromvarious
repositories about the components of the system,(3) evaluates
the hypotheses to assess whether they are supported or contra-
dicted by this information,(4) revises hypotheses as needed,
(5) perturbs the systemin informative ways and (6) integrates
all available information to deepen the understanding of how
the system works.Understanding grows as hypotheses are
accumulated.

To whomcorrespondence should be addressed.

The authors wish to be known that,in their opinion,the rst two authors
should be regarded as joint rst authors.
Current investigations or signal transduction and gene regu-
lation generate large volumes of data,making it increasingly
difcult to assemble and organize all the information needed
to test hypotheses.To complicate matters,the various kinds
of data reside in a wide array of repositories and are stored
in different formats.To help biologists make effective use
of increasing amounts of diverse data,we have developed the
HyBrow(Hypothesis Browser) systemtoaidinthe hypothesis
formulation and evaluation cycle.
Most bioinformatics tools are designed to perform spe-
cic analytical functions.These tools carry out tasks,such
as identifying patterns,categorizing information and probing
data sources for similarities.However,information synthesis
has remained solely the purview of the biologist (Kuchinsky
et al.,2002).To design a systemthat will support the tasks of
formulating and evaluating hypotheses for consistency with
prior knowledge,we must address several issues.We must
specify a representation for hypotheses that is both machine-
understandable and accessible to the experimental biologist.
We must choose a conceptual model and methods for stor-
ing existing information about the biological system in that
model.Finally,we must develop a framework that sup-
ports the evaluation of hypotheses with respect to the stored
information.
Organisms are (and contain) complex systems which are
incompletely understood.Such systems are more readily rep-
resented by specifying the events that occur in them than
by writing differential equations for the systems constitu-
ent reactions because sufcient detailed kinetic information
is generally not available (Ho,1989).Biological events are
changes ina biological systemfor whichwe canobtainexperi-
mental evidence.In a previous publication,we described the
development of a framework for conceptualizing biological
processes in terms of events,formulating hypotheses about
them and evaluating the hypotheses (Racunas et al.,2003).
In order to represent hypotheses about a biological process
in a machine-understandable format,it is necessary to cre-
ate a vocabulary of objects (agents) and processes,and dene
the relationships in which these entities can participate.We
refer to this vocabulary as the hypothesis ontology and in
our current work we construct and populate such an ontology
Bioinformatics 20(Suppl.1) © Oxford University Press 2004;all rights reserved.
i257
by guest on September 29, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on September 29, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on September 29, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on September 29, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on September 29, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on September 29, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on September 29, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
by guest on September 29, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
S.A.Racunas et al.
for a simple test system.We describe biological events by
naming the agents from the ontology (such as proteins and
nucleic acids) and the processes (such as binds) that con-
nect them.We use the term hypothesis event to represent an
abstract biological event.Thus,an hypothesis event consists
of an acting agent (a subject,such as a protein),a relation-
ship (a verb,such as induce,repress,...),a target agent (an
object,a gene,protein,...),the experimental and cellular
contexts in which the event takes place;and a set of associated
conditions (such as the presence or absence of other agents),
which can accompany the event.This event-based framework,
together with our hypothesis ontology,allows us to represent
hypotheses in a formal language that species the time and
context-dependent relationships among the systems objects
and processes (Sudkamp,1988;Racunas et al.,2003).
HyBrows event-based framework includes methods to
evaluate such formal language hypotheses for internal con-
sistency and agreement with existing knowledge (Racunas
et al.,2003).Consistency of an hypothesis with observed data
and prior knowledge is evaluated by applying constraints and
rules.Constraints specify classes of forbidden events.Rules
are the operations performed upon available information in
order to enforce the constraints.Rules generate judgments
of support or conict,depending upon whether or not an
assertion is supported by existing knowledge.The framework
also includes neighborhood functions to establish similar-
ity between hypotheses.These facilitate hypothesis revision
through the automatic generation of neighboring hypotheses
that are variants of an original hypothesis.Neighborhood
functions use biologically acceptable notions to generate sets
of variant events for events that conict with existing data or
prior knowledge.We examine these variants to nd more t-
tingevents andreplace conictedevents withsuperior variants
to produce hypotheses that better t the stored information.
HyBrows event-based framework makes it possible for the
biologists to deal strictly with experimental evidence and
to avoid the unintended assertions that are common artifacts
of statistical and equation-based approaches.In HyBrows
framework,hypotheses and evaluation methods are directly
compatible with the way information is conceptualized by
biologists,making it easier to tap the expertise of experienced
biologists.Finally,and most importantly,HyBrows frame-
work makes it possible to bring together many kinds of data
and information in a unied formal language.The inability
to combine information sources has been a stumbling block
for computational models of biological systems,leading cur-
rent information integration efforts to focus on only one or
two categories of information (Hartemink et al.,2001;Segal
et al.,2003).
In this paper,we demonstrate HyBrows information
synthesis capability using the galactose metabolic and regu-
latory network,which we chose because abundant data and
information of many different types are publicly available
for this system (Ideker et al.,2001).We designed a small
hypothesis ontology appropriate for the GAL system.We
specied the formal grammar that describes how to com-
bine terms from the ontology into hypotheses.We designed
a database to store yeast GAL data structured in the onto-
logyanddevelopedhypothesis composition,visualizationand
evaluation software.
To test the HyBrow prototype,we evaluated and ranked
hypotheses about the GAL system.During evaluation,
HyBrow assayed all stored data for conicting or support-
ing evidence for each statement in each hypothesis.HyBrow
modied hypotheses that contained errors to generate variants
with fewer aws.Finally,HyBrow combined the resulting
determinations of conict and support to generate evaluations
and rankings for all the original and variant hypotheses.
2 RESULTS
2.1 Hypothesis ontology
Common ontologies for biological objects and processes
(Schulze-Kremer,1998;Ashburner et al.,2000) are being
developed to support the intercommunication of diverse data-
bases as well as enable automated annotation and extrac-
tion of information from the literature (Fleischmann et al.,
1999;Novichkova et al.,2003).Ontologies also provide a
foundation for the construction of higher level models of bio-
logical systems (Rzhetsky et al.,2000;Peleg et al.,2002).
Models vary fromabstract Boolean (Akutsu et al.,2000) and
Bayesian networks (Hartemink et al.,2001) to highly spe-
cic (McAdams and Arkin,1998) and quantitative models
(Sveiczer et al.,2000).Currently,most databases do not store
information in an explicit ontology that facilitates modeling,
and groups that design ontologies (Rzhetsky et al.,2000;
Peleg et al.,2002) do not store all relevant data structured in
these ontologies.Hence,efforts aimed at integrating diverse
information sources need to choose or design a representa-
tion scheme and convert existing data into that representation.
Although specialized ontologies exist (Karp,2000;Rzhetsky
et al.,2000),there is a need for an ontology that allows users
to represent biological processes in an event-based manner.
Such an ontology should be compatible with existing ones
so that hierarchical relationships can be made between terms
in existing ontologies and the hypothesis ontology (Karp,
2000;Rzhetsky et al.,2000).
We used Protégé (Crubézy et al.,2003) to design a
small hypothesis ontology for representing GAL system
information in HyBrows event-based conceptual framework
(Racunas et al.,2003).For guidance,we relied upon the
principles used to design the Rzhetsky and the Bioprocess
ontologies (Rzhetsky et al.,2000;Peleg et al.,2002).Our
ontology (Fig.1) accommodates currently available literat-
ure data,extracted primarily from Yeast Proteome Data-
base (YPD) (Proteome,2001,http://www.proteome.com/
YPDhome.html) at a coarse level of resolution.An event
consists of an acting agent (the subject,such as gene,RNA,
i258
HyBrow:a prototype system
Fig.1.An overview of the ontology used to represent data in an
event-centered way.Operators are the relationships that can exist
between agents.
protein),a target agent (the object,such as a gene,protein,
complex),a relationship (the verb,such as induce,repress,
bind),a context in which the event takes place and an optional
set of associated conditions (such as the presence or absence
of other agents) that accompany the event.The construction of
events from elements of the ontology,event sets from events
and hypotheses fromevent sets is governed by a context-free
grammar.Events that occur in the same context are combined
to form event sets and an hypothesis consists of event sets
linked by logical and temporal operators.An hypothesis must
contain at least one event set and each event set must contain at
least one event.Please refer to www.hybrow.org for a formal
specication of this grammar.
Contexts specify where events occur in the cell and under
what genetic conditions they occur.Our contexts are derived
from established ontologies.For example,terms for spe-
cifying physical locations in the cell come from the cellular
component division of the Gene Ontology (GO).We currently
support genes,proteins,mRNA,small molecules,and com-
plexes of proteins,small molecules and mRNA as agents in
our prototype.We dene three main categories of relation-
ships:logical (e.g.induce),biochemical (e.g.phosphorylate)
and physical (e.g.bind).The key design principle is that the
ontology describes a regulatory systemin an event-based way
consistent with our evaluation framework.Our current hypo-
thesis ontology allows representation of events such as Gal4p
binds to the promoter of the gal1 gene in the presence of
galactose in wild type Saccharomyces cerevisiae.Depending
on the resolution of the ontology,this approach can represent
anything from simple protein phosphorylation to the entire
cell cycle (GKB,2003,http://www.genomeknowledge.org/).
Formal presentation of the complete ontology is available at
www.hybrow.org.
2.2 Inference rules and constraints
We have dened constraints and the rules that determine
whether or not a constraint is satised for each relationship
expressible with terms from our ontology.We dene sev-
eral categories of constraints.Ontology constraints determine
what agents can participate in which types of biological rela-
tionships.For example,a gene cannot transport a gene,but
a protein can transport a small molecule or another protein.
Data constraints determine what data values are valid for a
particular relationship.For example,for the relationship pro-
tein A binds to the promoter of gene B,it is acceptable for
protein A to be annotated as localized in the nucleus or cyto-
plasm but not on the cell membrane.Existence constraints
require an agents presence before it can enter a relation-
ship.For example,a protein cannot performits function when
its gene has been deleted.Temporal constraints govern the
transmission of modications made to an agent by previous
events.For example,event X phosphorylates Y implies
that in all subsequent events Y is phosphorylated (unless a
dephosphorylation event occurs).
Each rule has sections that correspond to the different con-
straints that exist in HyBrow.The rst section deals with
ontology constraints,the second,with constraints on annota-
tion data in GO(Ashburner et al.,2000) format,the third deals
with literature-extracted information structured in the onto-
logyandthe fourthwithconstraints onthe specic data type(s)
for a relationship,such as promoter sequence in the case of the
binds topromoter relationship.For eachconstraint that is viol-
ated in any section,the event is assigned a conict.For each
constraint that is satised,the event is assigned a support.If
a constraint is neither violated nor supported,a cannot com-
ment is assigned.Sections 13 can be generalized because
they have a common structure for different relationships,and
the operations to be performed on the data are very similar.
Section 4 is very specic because of the different ways in
i259
S.A.Racunas et al.
which different data types for each relationship must be used.
There are additional general sections that enforce existence
and temporal constraints.For example,the rule for protein
A binds to promoter of gene B has the following sections:
(1) check if A is a protein or a proteincomplex and if B
is a gene;(2) check whether protein A is annotated (a) to
have the molecular function of a transcriptional activator or
repressor,(b) to be involved in the biological process of tran-
scriptional regulation and (c) to have a nuclear localization;
(3) determine whether the literature reports the postulated
event;(4) search the promoter of gene B for a binding site
for protein A;(5) ensure that the event is not postulated in a
genetic context where the gene for protein A is knocked out.
Rules are coded in Perl as hierarchical function librar-
ies to keep the rule set extensible and exible.Most of the
constraints enforced by the generalized sections are stored
in database tables,which are queried at run time,allowing
exibility for changing the stringency of the constraints.A
more detailed description of the rule library is provided at
www.hybrow.org.
2.3 Database and information gathering
At the heart of HyBrow is the idea that disparate kinds of
information can be represented in a unied formal language.
Biological information residing in the published literature
and electronic databases is expanding at an accelerating rate.
Retrieving information and translating it into our ontology
presents several problems because the information is in differ-
ent repositories andstorage formats.Further,onlya fractionof
the published literature is available electronically.The prob-
lemof automatingextractionof informationfromtheliterature
is being addressed by a number of research groups,but is
far from solved (Andrade and Bork,2000;Rzhetsky et al.,
2000;IBM,2002).The most promising approach appears to
be MedScan (Novichkova et al.,2003),which can parse liter-
ature abstracts to identify biological events.But information
extraction is still largely manual,practiced by annotators who
read papers for relevant concepts and information.
In this work,we adopted different approaches to gather and
structure data in our ontology.For data with standard-
ized representation formats,we designed user agents to
access the existing public repositories and retrieve desired
information.For example,we designed a user-agent to
retrieve promoter sequences from the S.cerevisiae Pro-
moter Database (Zhu and Zhang,1999).In most cases,
we were able to access well-annotated information from
the Saccharomyces Genome Database (SGD,http://genome-
www.stanford.edu/Saccharomyces/) (Cherry et al.,1998) dir-
ectly.We used YPD (Proteome,2001) to access curated
literature information about S.cerevisiae genes and proteins.
We designed a form-based layout for gathering biological
information from YPD reports and lled in predened table
elds compatible with the ontology fromspecic elds of the
YPD report.This process is easily automated if direct access
Fig.2.Screen shots of the visual and widget interface for construct-
ing hypotheses.
to database tables is obtained and can be extended to frame-
based loading forms like the EcoCyc database (Karp and
Riley,1999).For quantitative data,such as that frommicroar-
ray expression proling experiments,we converted the data
into our table format using customPerl scripts.If microarray
data are structured in the MAGE object model,this task is
more straightforward (Spellman et al.,2002).
We designed a MySQL database and mapped our ontology
onto the database for easy extension as our ontology evolved.
We created a table in the database for each class in the onto-
logy,at the nest level of resolution.The table has elds for
properties,calledslots andfacets inthe ontology,of the rel-
evant class.This creates more tables thanwouldbe present ina
well-normalized relational database.However,a prototyping
effort requires the backend to be easily modied in response to
changes in the ontology.The backend also contains tables to
store constraints used during evaluation.A detailed database
schema is found at www.hybrow.org.
2.4 User interfaces
An important feature of HyBrow is that it is easy for the
user to construct a machine-readable formal language hypo-
thesis.We have created two interfaces for this purpose,a
visual interface (Fig.2,upper panel) and a widget inter-
face (Fig.2,lower panel).Our visual interface allows users
to construct hypotheses using a visual notation designed
in accordance with the proposed conventions (Cook et al.,
2001).This interface allows users to draw diagrams that are
i260
HyBrow:a prototype system
Inference rules
Event Handler
Justification
routines
Neighboring
events generator
Hypothesis
parser and
ranking rules
Result
formatter
Visual
Widget
Hypothesis
file
Browser
User
Database
Server
Fig.3.The visual or the widget interface is used to design hypo-
theses,which are sent to the server via a browser.The hypothesis
parser is the entry point for the systemand it uses the event handler,
which manages the event library,ranking,justication and event
neighborhood generation.The database stores the different data
structured into events.
then automatically translated into hypotheses.The widget
interface allows the user to write hypotheses in English-like
sentences constructed using subject/verb/object selection
menus.A user can construct portions of an hypothesis using
different interfaces and then combine them.Details on how
to use the tools are provided at www.hybrow.org.Hypotheses
are saved to local les and then submitted for evaluation via
the Web.
2.5 The hypothesis evaluation process
The hypothesis evaluation process is illustrated diagrammat-
ically in Figure 3.When HyBrow receives an hypothesis,it
checks the connections between events and event sets for con-
formity with the hypothesis grammar.If the hypothesis passes
these tests for syntax,each event is then checked for validity
using the appropriate rule for the relationship proposed in the
event.For each event,a support,conict or cannot comment
result corresponding to each of the four sections of the infer-
ence rules is returned.Finally,the support and conict calls
are tallied based upon the logical structure of the hypothesis.
Each and between event sets leads to the inclusion of results
fromboth sets in the nal tally.For each or connection,the
better set is chosen using a hierarchical set of rules.[Sample
rules:(1) anevent set withconicts it is better thananevent set
with more conicts and worse than one with fewer conicts.
(2) An event set for which all events have at least some support
is better than an event set for which at least one event is not
supported.(3) If one event sets support is a strict superset of
another event sets support,the superset is superior...].We
apply these rules sequentially until one of the rules returns a
clear decision.
For each event,the hypothesis evaluation process nds all
conicts with existing knowledge and indexes them,along
with their sources.These are reported to the user to allow
them to identify specic problems with the hypothesis and
the conicting data source.For each event that has a conict,
a set of variant events is generated using biologically motiv-
ated heuristics,such as replacing the acting agent with agents
that share a sequence similarity or share a similar cellular
localization and sequence similarity with the original agent.
Neighboring hypotheses that share the logical structure of the
original are generated by replacing conicting events with the
best variant events.These neighboring hypotheses are then
evaluated,and if a better (more supported,less conicted)
hypothesis is found,it is presented to the user.
After evaluation,the user is shown (1) the support and con-
ict totals;(2) the least conicted,most supported event sets
that t the logical structure of the hypothesis;(3) a support-
conict scatter plot of neighboring hypotheses automatically
generated from the user submitted hypothesis;and (4) a list
of all events that had conicts,the data that triggered the con-
icts,an explanation of why the rules interpret that data as a
conict,and a reference to the original article or data source.
The results pages (Fig.4) allow a user to gauge the tness
of his/her hypothesis in the light of all stored data.Iterative
renement of the hypothesis allows the user to reconcile all
stored data into a single coherent representation whose level
of detail depends on the resolution of the ontology used for
constructing hypotheses.
2.6 Test runs with sample hypotheses
In order to test the prototype,we comprised hypotheses about
the GAL system and ranked them.The GAL system con-
sists of genes that transport and metabolize galactose and
the regulatory network that controls whether the pathway
is on or off (Lohr et al.,1995).The process involves three
types of proteins as follows:(a) a permease (Gal2p) that
transports galactose (encoded by the gal2 gene);(b) proteins
that utilize intracellular galactose,galactokinase (encoded
by gal1),uridylyltransferase (encoded by gal7),epimerase
(encoded by gal10) and phosphoglucomutase (encoded by
gal5);and (c) the regulatory proteins Gal3p,Gal4p and
Gal80p,which exert transcriptional control over the genes
encoding the transporter,the enzymes and to some extent,
their own genes (Ideker et al.,2001).HyBrow successfully
identied the hypothesis that best explained the current under-
standing about GAL system regulation (Lohr et al.,1995;
Ideker et al.,2001).For six of the seven events that had
conicts,HyBrow was also able to suggest corrections suc-
cessfully that increased agreement with stored information.
All hypotheses used and explanations of their evaluations can
be found at www.hybrow.org.
Here,we describe the evaluation of a simple illustrative
hypothesis as follows:Gal2p transports galactose into the
cell at the cell membrane.In the cytoplasm,galactose activ-
ates Gal3p.Gal3p binds to the promoter of the gal1 gene
and induces its transcription in the presence of galactose.
This hypothesis was decomposed into events as shown in
Figure 4A.On evaluation,HyBrow reported support from
i261
S.A.Racunas et al.
Fig.4.Screen shot of the results page:see text for a more detailed description.
literature andGOannotationfor event number 0(ev0),support
fromliterature for ev1,support fromontology constraints and
annotation for ev2 and support from the ontology,literature
and data sections for ev3.It reported a conict for ev3 (which
is marked in red) from the annotation rule section because
Gal3p is annotated to be primarily in the cytoplasm (SGD,
2003).HyBrow then searched for variant events.For ev3,
it found an event (Gal4p binds to promoter of gal1) with
higher support and for ev4 it found the more meaningful event
[Gal4pinduces gal1innucleus inwild-type (wt) inpresence of
galactose] with the same support but no conict.These events
were inserted in place of the original events to create a neigh-
boring hypothesis that is better than the original hypothesis
(Fig.4B and C).
When a submitted event contains a perturbation,such as
the deletion of a gene,HyBrow identies the agents disabled
because of the perturbation and infers a conict with events
that depend on these agents.For example,if the submitted
event is Gal3p induce gal1 in nucleus in gal3-K/O HyBrow
reports a conict.(The event Gal3p not induces gal1 in
nucleus in gal3-K/O gets support.) Some of the inferences
suggested by HyBroware obvious for the small GAL system,
but HyBrows ability to automate the process offers a substan-
tial advantage for systems containing large numbers of genes
and proteins.
3 DISCUSSION AND FUTUREWORK
HyBrowsupports the construction,proofreading and evalu-
ation of hypotheses expressed in familiar diagramor intuitive
text-based formats to aid in synthesizing data into working
models.HyBrows methodology is evaluation-based.Thus,
i262
HyBrow:a prototype system
unlike systems that construct statistical or equation-based
models,HyBrow is able to provide explicit reasons (and
references) for its output.However,HyBrow neither force
the user to accept all the stored data nor judge the validity of
stored information.Rather,it gives the user links to the exact
source of each conict,leaving it upto the user to judge the
relative merits of information sources.The user can choose to
ignore conicts fromdata sources deemed unreliable.
HyBrow differs fundamentally from existing efforts such
as EcoCyc,modeling biological processes as workows and
GenomeKnowledgebase(GKB).EcoCycis designedusingan
explicit ontology for biological function and facilitates func-
tional querying.However,it lacks the notion of an hypothesis
or a formal framework to evaluate and rank alternative state-
ments about a biological process (Karp et al.,2002).Modeling
of biological processes as workows by Altmans group
includes some of HyBrows features,but the underlying con-
ceptual model (which uses hybrid Petri nets) does not support
hypothesis neighborhoods andthe analysis of models has tobe
done manually (Peleg et al.,2002).GKB (2003) is an effort
to structure biological knowledge in an event-centered data
model.It is not a modeling framework by itself,but serves
as a public source of structured data which efforts like ours
can use.
Inour test runs,HyBrowidentiedthe least conictedhypo-
thesis accurately and suggested valid corrections for events
with conicts.We believe that we can build upon this suc-
cess and plan to extend and strengthen HyBrow in several
ways.Currently,improvements to hypotheses are suggested
using neighboring events generated using simple heuristics,
while our conceptual framework supports neighborhood func-
tions that create similar event sets from a given event set
(Racunas et al.,2003).Extending HyBrow to use neigh-
borhoods of event sets as well as of events requires new
evaluation routines to track all the biochemical and other
modications that an event set generates and to ensure that
the neighboring event sets satisfy them.In future work,we
will explore biological notions of similarity between event
sets and modify our neighborhood functions and evaluation
routines accordingly.The current rule library contains rules
for extrapolations in the presence of perturbations such as
gene knock-outs and constitutive over-expression.In future,
we would be able to include extrapolations for more categor-
ies of perturbations.Our current implementation can only
propagate temporal constraints about the presence or absence
of biochemical modications.We intend to propagate con-
straints about activation/inhibition and induction/repression
in an attempt to model how an event affects downstream
agents.To this end,we will extend the current ontology to
include modication state and activation state descriptors
for agents;events will then be able to modify these state
descriptors.Duringtheontologyextension,wewill alsodene
more operators (such as ubiquitinate and methylate) and intro-
duce temporal operators,a new category of relationship that
allows for stating relations such as precedes,after and
until between events.
As our ontology becomes more complex and rened,
reorganization and reloading of the underlying database is
unavoidable.The manual gathering and loading of informa-
tion can become a major bottleneck and hence we need
automatic ways to access and structure information.This is
possible when the exchange format for the data type under
question is standardized or special semantic ontologies like
TAMBIS are used (Stevens et al.,2000).Three alternative
approaches have been proposed in the community to address
the task of integrating frequently changing databases inde-
pendent of the data sources storage format.The rst is the
BioMoby project,whose objective is to develop a web ser-
vice infrastructure for biological data sources (Wilkinson and
Links,2002).The second approach is to design database
transformers or wrappers and a set of drivers for extract-
ingaparticular datatypefromadatabasefor viewintegration
or constructing a data warehouse.The third alternative uses
Grid architecture to avoid local storage and provide distrib-
uted access to biological data.However,the myGrid systemis
at a conceptual state at present and is itself under development
(Stevens et al.,2003).Solutions such as K2,a viewintegration
system,and federated information integration schemes such
as Genomics Unied Schema are needed to automate data
access to maintain and periodically update all the information
structured in the ontology (Davidson et al.,2001).Finally,
HyBrowwill be extended to identify events that are frequently
specied,but for which evaluation was not possible.Identify-
ing such events will allow HyBrow to aid experiment design.
For instance,if many users include an event in their hypo-
theses and there is no experimental evidence for it,HyBrow
can indicate a need to obtain such data.
4 CONCLUSIONS AND SUMMARY
Our implementation of HyBrow for the GAL systemdemon-
strates that ontology driven,event-based modeling of bio-
logical processes is feasible and that structuring data in
HyBrows event-based framework facilitates computer-aided
hypothesis evaluation.HyBrow can accommodate both more
data and types of data as they become available.Moreover,its
constraints can be elaborated as understanding about the bio-
logical system grows.We believe that the approach we have
developed for this HyBrowprototype can signicantly inform
experimentation by integrating large amounts of information
for the evaluation of hypotheses.
REFERENCES
Akutsu,T.,Miyano,S.and Kuhara,S.(2000) Algorithms for identify-
ing Boolean networks and related biological networks based on
matrix multiplication and ngerprint function.J.Comput.Biol.,
7,331343.
i263
S.A.Racunas et al.
Andrade,M.A.and Bork,P.(2000) Automated extraction of
information in molecular biology.FEBS Lett.,476,1217.
Ashburner,M.,Ball,C.A.,Blake,J.A.,Botstein,D.,Butler,H.,
Cherry,J.M.,Davis,A.P.,Dolinski,K.,Dwight,S.S.,Eppig,J.T.
et al.(2000) Gene ontology:tool for the unication of biology.
The Gene Ontology Consortium.Nat.Genet.,25,2529.
Cherry,J.M.,Adler,C.,Ball,C.,Chervitz,S.A.,Dwight,S.S.,
Hester,E.T.,Jia,Y.,Juvik,G.,Roe,T.,Schroeder,M.,Weng,S.
and Botstein,D.(1998) SGD:Saccharomyces Genome Database.
Nucleic Acids Res.,26,7379.
Cook,D.L.,Farley,J.F.and Tapscott,S.J.(2001) A basis for a
visual language for describing,archiving and analyzing func-
tional models of complex biological systems.Genome Biol.,2,
RESEARCH0012.
Crubézy,M.,Fergerson,R.,Knublauch,H.,Musen,M.,Noy,N.,Tu,S.
and Vendetti,J.(2003) Protege 2000.Stanford University,Palo
Alto,CA.
Davidson,S.B.,Crabtree,J.,Brunk,B.P.,Schug,J.,Tannen,V.,
Overton,G.C.andStoeckert,C.(2001) K2/Kleisli andGUS:exper-
iments in integrated access to genomic data sources.IBMSyst.J.,
40,512531.
Fleischmann,W.,Moller,S.,Gateau,A.and Apweiler,R.(1999) A
novel method for automatic functional annotation of proteins.
Bioinformatics,15,228233.
GKB (2003) Genome Knowledge Base.
Hartemink,A.J.,Gifford,D.K.,Jaakkola,T.S.and Young,R.A.(2001)
Using graphical models and genomic expression data to statistic-
ally validate models of genetic regulatory networks.Pac.Symp.
Biocomput.,422433.
Ho,Y.C.(1989) Special issue on discrete event dynamical systems:
editorial.Proc.IEEE,77,2438.
IBM(2002) Discovery Link,IBMCorporation.
Ideker,T.,Thorsson,V.,Ranish,J.A.,Christmas,R.,Buhler,J.,
Eng,J.K.,Bumgarner,R.,Goodlett,D.R.,Aebersold,R.and
Hood,L.(2001) Integrated genomic and proteomic analyses of
a systematically perturbed metabolic network.Science,292,
929934.
Karp,P.D.(2000) An ontology for biological function based on
molecular interactions.Bioinformatics,16,269285.
Karp,P.D.and Riley,M.(1999) Ecocyc the resource and the
lessons learned.In Letovsky,S.(ed.),Bioinformatics Data-
bases and Systems.Kluwer Academic Publishers,New York,
pp.4762.
Karp,P.D.,Riley,M.,Saier,M.,Paulsen,I.T.,Collado-Vides,J.,
Paley,S.M.Pellegrini-Toole,A.Bonavides,C.and Gama-Castro,S.
(2002) The EcoCyc Database.Nucleic Acids Res.,30,5658.
Kuchinsky,A.,Graham,K.,Moh,D.,Creech,M.,Babaria,K.and
Adler,A.(2002) Biological Storytelling:a software tool for bio-
logical information organization based upon narrative structure.
6th International Working Conference on Advanced Visual Inter-
faces (AVI02),Trento,Italy,2224 May.pp.331341.
Lohr,D.,Venkov,P.and Zlatanova,J.(1995) Transcriptional regula-
tion in the yeast GAL gene family:a complex genetic network.
FASEB J.,9,777787.
McAdams,H.H.and Arkin,A.(1998) Simulation of prokaryotic
genetic circuits.Annu.Rev.Biophys.Biomol.Struct.,27,
199224.
Novichkova,S.,Egorov,S.and Daraselia,N.(2003) MedScan,a
natural language processing engine for MEDLINE abstracts.
Bioinformatics,19,16991706.
Peleg,M.,Yeh,I.and Altman,R.B.(2002) Modelling biological pro-
cesses using workow and Petri Net models.Bioinformatics,18,
825837.
Proteome (2001) Yeast Proteome Database.
Racunas,S.A.,Shah,N.and Fedoroff,N.V.(2003) A contradiction-
based framework for testing gene regulation hypotheses.IEEE
Computer Society CSB Conference,Stanford University,Palo
Alto,CA.IEEE Computer Society.
Rzhetsky,A.,Koike,T.,Kalachikov,S.,Gomez,S.M.,
Krauthammer,M.,Kaplan,S.H.Kra,P.,Russo,J.J.and
Friedman,C.(2000) A knowledge model for analysis and
simulation of regulatory networks.Bioinformatics,16,
11201128.
Schulze-Kremer,S.(1998) Ontologies for molecular biology.
Pac.Symp.Biocomput.,695706.
Segal,E.,Wang,H.and Koller,D.(2003) Discovering molecular
pathways from protein interaction and gene expression data.
Bioinformatics,19,I264I272.
SGD (2003) Saccharomyces Genome Database.
Spellman,P.T.,Miller,M.,Stewart,J.,Troup,C.,Sarkans,U.,
Chervitz,S.,Bernhart,D.,Sherlock,G.,Ball,C.,Lepage,M.
et al.(2002) Design and implementation of microarray gene
expression markup language (MAGE-ML).Genome Biol.,3,
RESEARCH0046.
Stevens,R.D.,Robinson,A.J.and Goble,C.A.(2003) myGrid:per-
sonalised bioinformatics on the information grid.Bioinformatics,
19,I302I304.
Stevens,R.,Baker,P.,Bechhofer,S.,Ng,G.,Jacoby,A.,Paton,N.W.,
Goble,C.A.and Brass,A.(2000) TAMBIS:transparent access to
multiple bioinformatics information sources.Bioinformatics,16,
184185.
Sudkamp,T.A.(1988) Languages and Machines.Addison-Wesley,
Reading.
Sveiczer,A.,Csikasz-Nagy,A.,Gyorffy,B.,Tyson,J.J.and Novak,B.
(2000) Modelingthe ssionyeast cell cycle:quantizedcycle times
in wee1-cdc25Delta mutant cells.Proc.Natl Acad.Sci.,USA,97,
78657870.
Wilkinson,M.D.and Links,M.(2002) BioMOBY:an open source
biological web services proposal.Brief Bioinform.,3,331341.
Zhu,J.and Zhang,M.Q.(1999) SCPD:a promoter database
of the yeast Saccharomyces cerevisiae.Bioinformatics,15,
607611.
i264