A. SPECIFIC AIMS

impulseverseΤεχνίτη Νοημοσύνη και Ρομποτική

24 Οκτ 2013 (πριν από 3 χρόνια και 11 μήνες)

370 εμφανίσεις

The Gene Ontology Consortium Blake, J
udith A.


118


A.

SPECIFIC AIMS

The Gene Ontology Consortium (GOC) provides the scientific community with a consistent and robust
infrastructure for describing, integrating, and comparing the structures of genetic elements and the
functional roles of gene products within

and between organisms. In just six years, its constituent
ontologies have become the
de facto

community standard for expressing, in a machine
-
usable form,
the biological domains of genome features, molecular function, biological process, and cellular
loca
lization. The Gene Ontology (GO) provides a set of well
-
defined terms organized into specialization
and part
-
of hierarchies that are technology and data format neutral. This technical adaptability has led
to its adoption by a wide range of databases and th
e GO has been integrated in a wide variety of
technical environments. Hence, the breadth and diversity of organisms annotated with both the GO,
alongside its associated Sequence Ontology (SO), continues to increase. This adaptability has also
encouraged it
s use for many unforeseen purposes,
e.g.
, Natural Language Processing (NLP) and
Information Retrieval of the biomedical literature. The GOC will now increase the depth and taxonomic
breadth of ontologies and associated annotations while keeping quality hig
h so that it may be reliably
used to draw inferences and translate knowledge across organisms. We will advance the
understanding of the molecular basis of human health and disease by focusing on the following key
aims to integrate and standardize biomedica
l and genomics information:

Aim 1: We will maintain comprehensive, logically rigorous and biologically accurate ontologies.
We will work closely with biological experts to ensure that the ontologies accurately reflect biological
reality.
We will incorporat
e new relationship types into the ontologies as needed and we will recast
compound terms as explicit cross
-
products with orthogonal ontologies. W
e will keep the ontologies
logically rigorous so that when used to query for terms associated with gene product
s it will neither
omit
relevant annotations nor return incorrect annotations.

Aim 2: We will comprehensively annotate reference genomes in as complete detail as possible.
Genomes that are fully and reliably annotated empower scientific research and are es
sential for use in
automatic inference. We will annotate reference organisms selected according to the following criteria:
a large body of
scientific literature exists; a
reasonably sized community of researchers study that
organism; the organism’s relativ
e importance as an experimental model in the study of human disease;
and high impact on discovery in the scientific community.

Aim 3: We will support annotation across all organisms.
Emerging genomes, or any collection of
gene products (
e.g.
EST libraries
or proteomic data), are best understood in a comparative context; and
inferring function from highly reliable sets of annotation on related organisms, such as those provided
by Aim 2, is the only practicable method to annotate the less well
-
known genomes.
We will provide a
standardized, structured methodology for functional annotation of emerging genomes. We will support
and encourage the submission of functional annotations to the central GOC repository from the
broadest possible spectrum of organisms.

Aim

4: We will provide our annotations and tools to the research community.
Sharing the
cumulative knowledge of the functional roles of each protein and non
-
coding RNA is the primary goal of
the GOC. Therefore, we will support the use of the GO by all researc
hers in functional genomics,
comparative biology, and other related fields. We will continue to provide all GO resources publicly and
freely to the research community.

The Gene Ontology Consortium Blake, J
udith A.


119



B.

BACKGROUND AND SIGNI
FICANCE

B.1.

T
HE ORIGIN AND
D
EVELOPMENT OF THE
G
ENE
O
NTOLOGY
C
ONSORTIUM

1998
-
2005.

The Gene Ontology (GO) was founded, in the fall of 1998, by researchers from three community Model
Organism Databases (MODs), Mouse Genome Informatics (MGI),
Saccharomyces
Genome Database
(SGD), and FlyBase. The objectives of the founders were
very simple: to provide a common vocabulary
for the description of what gene products 'do', and to apply this vocabulary to annotate gene products in
these three databases. The motivation of the founders of the GO was two
-
fold. First, recording the
functio
n of every gene product was an essential responsibility of the MODs. Second, this task would be
best done collaboratively, as it would then be more efficient, accurate and comparable. If these MODs
were to use a
shared

structured vocabulary for annotation
then one could foresee a common query
interface and the
de facto

integration of these databases within this domain, a result greatly facilitating
comparative biology research. These considerations remain major motivations for the work of the GOC
[GOC]
1
. In
deed the needs are even more urgent now: in 1998 only two eukaryote genomes (
S.
cerevisae, C. elegans
) and 18 bacterial genomes had been published. As of December 2005 the
numbers are 53 (although not all of these are closed or finished) and over 200, resp
ectively.

B.1.a.

Growth of the Gene Ontology Consortium.

A series of small informal meetings in late 1998 and early 1999 between members of the three founding
MODs, generously supported by funding from AstraZeneca (through the good offices of Dr. Ken
Fasman), es
tablished the backbone of the Gene Ontology. The key decisions taken at these meetings
were: (a) to limit initially the scope of the GO to three independent sub
-
domains: the molecular function,
biological role and cellular location of gene products (be the
y protein or RNA); (b) to structure
vocabularies as directed acyclic graphs [DAGs] using just two relationships between terms,
is_a
and
part_of
; (c) to provide each term with a dictionary style definition; (d) to provide a common database
and interface to
the GO and annotations supplied by the MODs, (e) to maintain and track the history of
changes to each term; and (f) to provide
all

of the work of the GOC to the public without any constraint
whatsoever. Finally, we determined that immediate testing and usa
ge would drive the work of the GOC,
solving immediate problems simply, without precluding increasing sophistication.

The first substantive application of the GO was made during the first annotation of the then newly
completed genome of
Drosophila melanogas
ter

during the Celera/ Berkeley
Drosophila

Genome
Project (BDGP) annotation jamboree in November 1999 (Adams
et al.
2000). Encouraged by the
success of this project and by the informal interest in the GO shown by others, we published, in May
2000, the firs
t formal account of the GO in
Nature Genetics

(The Gene Ontology Consortium, 2000)
and, in March 2000, made our first application to the NIH for funding. This was awarded in full with a
start date of January 2001, and renewed in 2003.

This funding enabled
the development of the GO (Appendix 11: Progress Report year 5). We are
delighted that all of the major model organism databases for eukaryotic organisms now annotate with
the GO. The phylogenetic range is about as wide as it could be, from protozoa such a
s
Plasmodium
and
Tetrahymena

(in progress) to human and rice. We are disappointed that, for a variety of reasons
(primarily tradition and logistics), there has been less universal use of the GO for the annotation of
bacterial genomes, with the notable exce
ption of the work done at The Institute for Genome Research
[TIGR]. However, we are encouraged that increasingly many outside TIGR are now using GO, although



1

References within square brackets are listed under "Acronyms and Web References" table provided
just before
section D:Research Plan.

The Gene Ontology Consortium Blake, J
udith A.


120


these projects have yet to deposit their data with the GOC (e.g., the annotations, see [
Pseudomona
s
]).
We have been in extensive discussion with many of the major players in this field, e.g., the Joint
Genome Institute's (JGI) Integrated Microbial Genomes database [IMG], the Sanger's Pathogen
Sequencing Unit [PSU] (who do use GO for their eukaryotic ge
nomes), the
E. coli

community and the
NIAID's Bioinformatics Resource Centers [NIAID]. We address these issues in more detail below (Aim
3).

The use of the Gene Ontology by many major cross
-
species databases has grown. These include not
only the major GO a
nnotation project (GOA) for UniProt at the European Bioinformatics Institute
([UniProt], Wu
et al.

2006), the TIGR Comprehensive Microbial Resource [CMR] and the Sanger's
GeneDB [GeneDB], but also the NCBI, where GO annotations are incorporated into Entrez
-
Gene
[NCBI] and the Protein Data Bank [PDB], which has recently released annotations of its structures with
GO terms. The GO is also incorporated into other open bioinformatics standards, e.g. the BioPax Level
2 specification ([BioPax]) and support for GO

is now provided in Cytoscape (Ver. 2.2) ([Cytoscape]).

We are also encouraged by the very extensive use of the GO in industry, not only by most of the major
Pharmaceuticals and many small biotech companies, but also by companies offering information
servi
ces to the pharmaceutical industry (Table 1). This use of the GO has been paralleled by the
extraordinary development of 'third
-
party' tools to manipulate the GO or GO annotations: we are now
aware of over 70 different tools, some are commercial (e.g. the
DecisionSite Ontology Browser of
Spotfire [SPOTFIRE]), but most are open source [GO.tools]. These include ontology browsers, tools for
the annotation of proteins with GO terms, tools for the analysis of high throughput expression data and
tools for the use

of GO in text mining.

Products that Use GO




NLP & Ontology products

Biowisdom

http://www.biowisdom.com

ReelTwo

http://www.go
-
kds.com/go/index.html

IBM, Japan

http://www.research.ibm.com/trl/projects/textmining/takmi/takmi_e.htm

Array products and

data
analysis


Affymetrix

http://www.affymetrix.com/support/technical/manual/go_manual.affx

BioMind

http://www.biomind.com/arraygenius.htm

Spotfire

http://www.spotfire.com/developercentral/blog/index.cfm?commentID=20

GenePilot

http://www.genepilot.com
/examples/geneOntology.html

DNA Array Systems

http://www.dnaarray.com/

Molecular Station

http://www.molecularstation.com/bioinformatics/link/Genomics_Gene_Ontolog
y_Tools/

Proteome Software

http://www.proteomesoftware.com/Proteome_software_ed_bioinformat
ics.html

Medicel

http://www.medicel.com/?pid=products&second=methods&third=workflow&fou
rth=annotation_workflow

Sapio Sciences

http://www.sapiosciences.com/go.html

Avadis

http://avadis.strandgenomics.com/productWalkthroughPage.html

BioSieve

http://www.b
iosieve.com/product.htm

The Gene Ontology Consortium Blake, J
udith A.


121


Products that Use GO


Molmine

http://www.molmine.com/help/go/go.html

MWG Genome Information

http://www.mwg
-
biotech.com/docs/upload_doc_literature/FLY011_EST.pdf

Persistent

http://www.persistentsys.com/solutions/lifesciences/funcexpr.htm

VizXlabs

http
://www.vizxlabs.com/docs/vizxlabs_gs_update_022503.pdf

Seqexpress

http://www.seqexpress.com/

Gene Pilot

http://www.genepilot.com/examples/geneOntology.html

Macrogen

http://www.macrogen.com/eng/biochip/express_summary.jsp

Biomind

http://www.biomind.com/

Clontech

http://www.clontech.com/clontech/products/literature/pdf/brochures/AbArray50
0.pdf

Inpharmatica

http://www.inpharmatica.com/pdfs/Annotation_Express.pdf

Ocimum Biosolutions

http://www.ocimumbio.com/docs/web/Genowiz.pdf

Bioin4matrix

http://www.b
ioinformatrix.com/net/modules.php?name=News&file=article&sid=
782

Operon

http://omad.operon.com/GO/go.php

Elashoff Consulting

http://www.elashoffconsulting.com/

Reagents & services

Actigenics

http://www.actigenics.com/site/articles_29.html

Macrogen USA

http://www.macrogenusa.com/company/s_endncdna.jsp

Sigma

http://www.sigmaaldrich.com/Area_of_Interest/The_Americas/Canada/Sigma_
Genosys_Canada/Gene_Expression.html

ExactAntigen

http://www.exactantigen.com/gene/apoe/acpp.html

Abnova Corporation

http://www.abnova.com.tw/product_search/PS_detail_partial.asp?catalog_id=
H00000182
-
Q01&geneid=182&family_id=

Mirus

http://www.mirusbio.com/index.asp

Admetis

http:
//www.admetis.com/

Treenomix

http://treenomix
-
com.canadawebhosting.com/cDNA
-
sequencing/default.aspx


Table 1. Some commercial products that incorporate or use the GO.


It was not coincidence that the idea of the GO came at just the time that Stanford r
esearchers had first
shown the power of microarray analysis of gene expression. Indeed, those charged with the analysis of
microarray data are among the most intensive users of the GO, and many major manufacturers of both
gene expression and protein arrays

include GO annotation of their probes (e.g., [AFFY], [CLONTECH]).
GO is even being used to describe commercial reagents (e.g., Abnova, a company that boasts over
10,000 monoclonal antibodies to human proteins [ABNOVA], and oligonucleotide libraries from S
igma
-
Genosys [GENOSYS])(Table1). However, it has been surprising to us that the GO has found so much
use in the evolving NLP field (see, for example, Jenssen
et al.
2001; Raychaudhuri
et al.
2002;
Hirschman
et al.
2005). Commercial tools for literature min
ing that incorporate the GO are now
The Gene Ontology Consortium Blake, J
udith A.


122


beginning to be released (e.g. [AGILENT], [GO
-
KDS]), as has been a public tool that classifies PubMed
abstracts with GO terms ([GOPUBMED]).

The idea of structured controlled vocabularies or ontologies was not new in 1998
, even in biology,
witness the development of SNOMED ([SNOMED]) and the UMLS ([UMLS)]. A very important step was
taken in 1993 by Monica Riley's development of a hierarchical controlled vocabulary for the description
of gene 'function' in
E. coli

(Riley, 1
993). This has been developed as MultiFun [MultiFun], with which
the GO has been mapped in collaboration with Greta Serres at Woods Hole. At about the same time,
Overbeek, Maltsev and Gaasterland in the Argonne Group did pioneering work by developing the
P
UMA (see [PUMA2]) and WIT resources ([WIT]). The FunCat project of the Munich Information Center
for Protein Sequences (MIPS) [FunCat] has also been mapped to the GO, in collaboration with MIPS
staff. Both MultiFun and FunCat are relatively small (505 and
1307 terms respectively); both are strict
subsumption hierarchies and both are very stable, not being regularly updated as biological knowledge
changes. Although derivatives of Riley's 1993 classification and of MultiFun have been widely used for
the annot
ation of bacterial genomes, Riley herself recognizes their limitations and has stated that the
GO is needed for this task (Serres
et al.
2001). Significantly, the most recent analyses of the
E. coli

K
-
12 genome use the GO (Riley
et al.
2006; [EcoCyc]).

The

accurate annotation of gene products with the GO depends on the availability of high quality
genome annotation, and a robust mechanism for exchanging annotations between multiple groups and
databases. The latter was developed by Durbin and Haussler in 199
7 [GFF] and has been enhanced by
Stein and colleagues as GFF3 [GFF3]. A major difference between GFF and GFF3 is that GFF3
incorporates an ontology that constrains the description of annotation feature types. This is the
Sequence Ontology, developed by the

GOC in collaboration with Richard Durbin, Lincoln Stein and
Mark Yandell (Eilbeck
et al.
2005, [SO]). The reason for the GOC's investment in this project is simply
the GO's reliance on high quality genome annotations. Annotations will be of higher quality

if the
annotators all agree on the definitions of the objects they annotate. Moreover, as shown by a small
example in Eilbeck
et al.
(2005), we discovered, having already built the SO, the unexpected benefit of
using tools from the discipline of extension
al mereology (see Simons, 1987, and [Mereology]). These
methods promise novel methods for the analysis of genomes (see [CGL]). The SO adds a fourth
domain to the efforts of the GOC.

There has been a major change in the bioinformatics community with respect

to ontologies in the last
six or seven years. Prior to 1999 only a few were advocating ontology development (e.g. Schulze
-
Kramer, 1997; 1998; see also Karp, 1995; Karp and Paley, 1996). Now, we see not only the
development of ontologies for many different

domains within biomedicine, but also their very extensive
use by biologists and bioinformaticians. These changes have been driven, we think, by three
considerations. First, the ever
-
growing number of completed genomes, and the increasing amounts of
'post
-
genomic' data that follow as a consequence, have opened the eyes of the community to the need
to bring semantic order to biomedical data. Second, the concept of the Semantic Web (Berners
-
Lee
et
al.
2001), whose success is predicated on the development, ava
ilability and use of domain ontologies
has influenced biomedical informatics (although for the counter case see [Shirky, 2005]). Finally, we
like to think that, the success of the GO project has proved the benefits that accrue to a community
from the adopt
ion of an ontology. We also believe that the GO is an example of how an open ontology
can be developed with widespread community participation.

The development of several new ontologies in the biomedical research domain is to be welcomed, but
presents the

community with several problems. The first of these is access, finding out just what is
available. To solve this problem the GOC established the OBO site [OBO] as a 'single
-
stop shopping
site' for biomedical research ontologies. As of December 2005, there

were nearly 60 contributed
ontologies accessible from this site (the majority maintained in the OBO Concurrent Versioning System
(CVS) repository). Many of these are of central importance to the future development of the GO (see
Aim 1), for example the EB
I's ChEBI ontology for chemicals of biological interest [CHEBI], the Cell
The Gene Ontology Consortium Blake, J
udith A.


123


Ontology (Bard
et al.
2005 and [CELL]), and many anatomical ontologies. Early in 2006 responsibility
for the OBO site will move from the GOC to the newly established National Center
for Biomedical
Ontology [NCBO]. NCBO is an NIH funded consortium of biologists, clinicians, informaticians and
ontologists who are developing novel methods for the creation, dissemination and management of
biomedical information. NCBO is
not

responsible fo
r the content of ontologies, but will,
inter alia,

provide services for their maintenance, evaluation, distribution and usage.

B.1.b.

The Sequence Ontology.

The Sequence Ontology project was initiated by the GOC, in collaboration with Drs. R. Durbin and L.
Stein,

to provide a structured controlled vocabulary for the description of features used for genome
annotation. Traditionally, the Feature Table descriptors of the International Sequence Data Library
(GenBank/EMBL/DDBJ) ([FT]) have been used. While this has ser
ved the community well for many
years, it suffers from certain disadvantages. On the one hand it is quite restricted in its scope, providing
only 65 terms for the description of sequence features. It is very cumbersome to change, any alteration
must be agr
eed by the international collaborators of the three data libraries and then only implemented
after 6 months notice to the community. Most seriously, the groupings of terms are not formalized. Just
as the MODs needed to express formally the attributes of ge
ne products, so they also need to express
formally the attributes of sequence. The computational analysis of annotated sequence would be
enormously helped if the MODs expressed these attributes using the same terms, used with accepted
formal definitions. A

second justification for the development of the SO was that there was no easy

and certainly no rigorous

way to retrieve sequences based on some biological property from the
sequence data libraries. Example queries are: retrieve all of the genes from the m
ouse genome that are
'maternally imprinted', retrieve all of the genes from mouse, worm, fly and yeast whose transcripts are
translated with a +1 ribosome frameshift. The SO provides a small subset of locatable features,
specifically for use by GFF3

this s
ubset, SOFA (Sequence Ontology for Feature Annotation, see
[SO.SOFA]), is only changed once a year, so as to afford stability to GFF3 files.

The development of the SO has been very closely associated with two other projects. The first is GFF3
([GFF3]), a p
roject to create a standardized file format for the exchange of annotations. The second is
the Generic Model Organism Database's (GMOD) chado database project ([CHADO]), whose goal is to
construct a generic database schema so that the annotations from diff
erent genomes can be archived,
searched and managed using a single set of database tools. The Sequence Ontology will provide the
terms and specify the relationships used to describe the contents of GFF3 files and GMOD databases.
Thus SO is a necessary and
planned adjunct to both of these projects. The SO provides a small subset
of locatable features, specifically for use by GFF3

this subset, SOFA (Sequence Ontology for Feature
Annotation, see [SO.SOFA]), is only changed once a year, so as to afford stabilit
y to GFF3 files.

B.1.c.

OBO and OBO
-
Edit.

The OBO site has encouraged developers of biomedical ontologies to use the file format developed by
the GOC. This OBO format has been enhanced considerably over the last few years and now has the
advantages of great expre
ssivity, computability and human readability [OBO
-
format]. A single common
format promotes community (re
-
)use of tools,
e.g.
, tools for ontology editing [OBO
-
edit] and browsing
[AMIGO]. OBO
-
Edit, the ontology editing tool developed and used by the GOC comm
unity, is now
widely used, and the AmiGO browser has also been used for the Plant Ontology [PO] and the
Drosophila

anatomy ontology [IMAGO].

The GOC is often asked: “What is the difference between Prot
é
g
é
-
OWL and OBO
-
Edit?” Protégé and
OBO
-
Edit are tools w
ith many similarities, but fundamental design differences. Protégé is a frame
-
based editing tool, while OBO
-
Edit is a graph
-
based editing tool. Both tools have optional extensions
that allows for successively richer and more expressive ontological modeling

(namely the Protégé OWL
The Gene Ontology Consortium Blake, J
udith A.


124


plug
-
in, and a series of optional plug
-
ins in obo
-
edit, such as the cross product plug
-
in). Even when
these extensions are used, the initial design philosophies of the two respective tools are apparent.

OBO
-
Edit's graph
-
based appro
ach is ideal for the rapid generation of large ontologies focusing on
relationships between relatively simple classes. In its default view, OBO
-
Edit hides the complex details
of the ontology to allow the user to focus on the overall graph structure of the
ontology. GO curators are
not aware of (and have no use for) "slots", for example, they only see a graph of labeled relations
between ontology terms. Since OBO
-
Edit's user community consists largely of biologists with little
computer science background, th
is simpler, high
-
level editing approach is ideal for the target
community. Further, by hiding the low
-
level complexity of an ontology, OBO
-
Edit can be a more usable
tool for editing very large ontologies.

OBO
-
Edit and Protégé are interoperable tools, inasm
uch as they both contain support for description
-
logic languages such as OWL (see section D.1.c.iii for more on interoperability). Most editing
operations that are possible in one tool are possible in the other. Yet the two tools are highly
complementary.
Each is specialized for a different user community and editing paradigm. It is likely that
many users will choose to possible that some advanced users will install both tools, and use each in
different circumstances.

B.2.

R
ATIONALE FOR THE CON
TINUED FUNDING OF
THE
GOC.

“The growth of scientific data, and of scientific databases, in the biomedical field

a growth not only of
size but also of complexity

has been remarked upon so often as to become a cliché. The urge for
database 'integration' has been a mantra of b
oth the bioinformatics community and of the funding
agencies for decades." So we wrote in our 2003 proposal to the NIH and, frankly, we can do no better
now. The 2006 Database Issue of
Nucleic Acids Research

[NAR] includes 162 papers each describing
at lea
st one database in the general domain of genomics and molecular biology. The Molecular Biology
Database Collection, maintained by M. Galperin (Galperin, 2006) in association with NAR, records 858
databases
2
. Other than within the Generic Model Organism Dat
abase community [GMOD], there is very
little agreement or collaboration between the providers of these databases with respect to semantic
standards. Within the GMOD community, the chado database schema [CHADO] (designed by C.
Mungall and D. Emmert) has ont
ologies at its heart, and these are all OBO ontologies. Chado has been
adopted by several MODs, including FlyBase, dictyBase, ParameciumDB and TIGR.

The first uses of the GO for annotation were for the
Drosophila

genome (Adams
et al.
2000) and for
yeast,
both in 2000. This was followed, in 2001, by its use for the annotation of mouse cDNAs by the
FANTOM project (Kawai
et al.
2001). Since then the GO has been used in over 166 published genome,
EST or protein annotation projects, mostly by groups outside the

GOC (data from [GO.pubs]). The
extraordinary utility of GO annotations is seen by their very extensive use for the analysis of gene
expression data. The first such papers were published in 2002, and we are now aware of over 230
studies to date (data from
[GO.pubs]).

The nature of high
-
throughput gene expression data challenges the ability of biologists and
bioinformaticists to extract useful knowledge from it. It is very clear that the GO is an essential tool, as
indicated by the number of tools that have
been developed for this purpose (39 to date) [GO.tools].
Many companies, academics and government institutions have developed products for specialized GO
applications, e.g. GOFFA for toxicogenomics developed by the FDA [GOFFA]. We could quote scores
of ex
amples of the use of the GO for microarray analysis, but will restrict ourselves to just three. In the
first, the authors of a very recent paper on gene expression in the invasive front of colorectal liver



2

There is now agreement for the GOC to provide NAR with a GO classification of many of these database
projects for the 2007 issue. This will be most useful for specialty databases on


particular processes and
function
s.


For example, NESbase with


"nuclear export", NMPdb with "nuclear matrix".

The Gene Ontology Consortium Blake, J
udith A.


125


metastases write: "Using the gene ontology (GO) cl
assification, we were able to determine patterns of
up
-

and down
-
regulated genes in the liver part of the invasive front. We observed a pronounced
overrepresentation, e.g., of the GO terms "extracellular matrix," "cell communication," "response to
biotic s
timulus," "structural molecule activity" and "cell growth," indicating a very pronounced host cell
response to tumor invasion." (Bandapalli
et al.
2006). Similarly, Zindy
et al.
(2005) discovered a "new
set of genes involved in DNA repair and damage" using

the GO for a study of gene expression in
cirrhosis associated with liver cancer. Our final example is of a study of the effects of the EGFR
inhibitor drug erlotinib on gene expression in metastatic breast cancer, from Swain's group at the
National Cancer
Institute: "Gene ontology comparison analysis pre
-
treatment and post
-
treatment in
EGFR
-
negative tumors revealed biological process categories that have more genes differentially
expressed than expected by chance. Among 495 gene ontology categories, the sig
nificant differed
gene ontology groups include G
-
protein
-
coupled receptor protein signaling (34 genes, P = 0.002) and
cell surface receptor
-
linked signal transduction (74 genes, P = 0.007)." (Yang
et al.
2005).

The growing reliance of the biomedical commun
ity on the GO can also be seen by the sheer growth in
the number of publications that cite the GO in their abstracts, as indexed by PUBMED [PUBMED]: from
7 in 2001, to 322 in 2005. A search for 'gene ontology' on Google Scholar reveals over 9,300 links and

that our 2000
Nature Genetics

paper has been cited 1291 times (01/16/06). Enthusiasm for the GO is
also seen by the fact that 1470 people took the time to respond to our recent web survey (Appendix
13). An analysis of just 77 primary research papers showe
d that research supported by 17 of the 24
Institutes of the NIH has used the GO. Its importance has also been recognized by its incorporation into
the National Library of Medicine's Unified Medical Language System [UMLS], see Lomax and McCray
(2004), and t
he National Cancer Institute's Enterprise Vocabulary System [NCI].

Our case for the GO has been, and remains, that semantic integration of biomedical research data is
both achievable and essential if these data are to be used by the communities of experime
ntal and
computational biologists to their greatest effect. We also argue that, while not inexpensive, the
investment required to bring about this semantic integration is but a small fraction of what these
biomedical data cost to discover and but a small f
raction of what they can yield in long
-
term benefits to
society. The relevance of this work for public health is that comprehensive integration and
standardization of biomedical and genomics information is an essential component of advancing the
understand
ing of the molecular basis underlying human health and disease outcomes.

The GO was invented out of necessity


the necessity of the MODs having a rigorous method for their
users to query their databases for gene products by their 'functional' attributes.
We have argued that
this necessity has not gone away, but rather has become even greater with the dramatic increase in the
number of 'completed' genomes in the last three years. Annotations of gene products with GO terms
have now become absolutely central
to the analysis of most high
-
throughput genomics data. We are
firmly of the view that the GO can be one solution to the problem of biomedical data management and
analysis, and that our work since 1998 has shown that this solution is both achievable and
mai
ntainable.


C.

PROGRESS REPORT.

The GO includes four controlled vocabularies for describing biology: the Molecular Function ontology
describing catalytic activity and other molecular properties of a gene product; the Biological Process
ontology describing the

role that a gene product plays in a higher biological pathway, the Cellular
Component ontology describing the sub
-
cellular compartment in which a gene product can be found;
and the Sequence Ontology describing features that are located on, or are attribut
es of, biological
sequences thus providing a systematic way to annotate genomes.

The Gene Ontology Consortium Blake, J
udith A.


126


The GOC is responsible for the content and structure of the four GO ontologies, for the software to edit
and display the ontologies, for the structure of gene association dat
a files, for supporting the application
of the GO to classify data through tools and training, for the GO database, and for the project’s web
presence.

C.1.

GO CURATION INFRASTR
UCTURE
.


A primary objective of the GO is to provide robust and biologically accura
te ontologies for the research
community. During the current granting period, we continued to improve the GO and to provide the
mechanisms for community input into the GO development process.

C.1.a.

Gene Ontology development.

The GO Editorial Office at the EBI
manages the distributed tasks of developing and maintaining the GO
vocabularies for molecular function, biological process and cellular component. The office consists of
the GO Editor (Dr. Midori Harris) and three GO curators who have primary responsibilit
y for the
biological content

terms and definitions

of the GO database. The editor and curators facilitate
internal consistency by coordinating all additions and changes to the GO suggested by contributors.
They work closely with model organism database ann
otators to identify areas of the GO that require
expansion or revision, respond to requests from the larger biological research community, and initiate
recruitment of experts to refine specific sub
-
domains. The Editorial Office staff also: maintains and
co
ordinates the GO project's documentation, development of the GOC web pages, presents GO at
scientific meetings, and answers the many questions about GO and its resources received from
community members.


January 2004

January 2005

January 2006

Component

12
69

1397

1681

Process

7867

8924

10291

Function

6907

6929

7384

Total

16043

17250

19356

Defined

86.1%

93.0%

95.4%

Obsolete

716

968

992





The totals exclude obsolete terms.


Table 2. The content of the Gene Ontology, January 2004 to January 2005, numbe
rs of terms per ontology.


Development continues on the GO ontologies and the GO vocabularies now comprise nearly 20,000
terms (Table 2). The editorial group performs regular integrity checks on the ontologies, and provides a
summary of changes to GO stru
ctures monthly ([GO.reports]). We use three strategies to keep the GO
current: content meetings, special interest groups, and on
-
line request tracking. Before a new term is
committed to the database a broad consensus must be reached as to the wording of th
e term, its
placement and relationships, and its definition. The discussions around new terms and their definitions
involve a broad group of people, to ensure the integrity of the ontologies. Only members of the GO
Editorial Office and a few senior annotat
ors associated with model organism databases have
permissions to modify the GO master file directly.

Interest groups:
Interest groups work together to develop the terms needed to describe a specific
topic e.g. development, cell cycle, plant biology, metabo
lism. They include domain experts, GO
curators and representatives from the model organism databases. When GO moves into an unfamiliar
The Gene Ontology Consortium Blake, J
udith A.


127


biological area we actively recruit external experts to form the core of a new interest group. Interest
groups have also
formed spontaneously when biologists volunteer their services to improve the terms
in their domain of interest. The interest groups communicate through their own mailing lists (Table 3)
and at meetings. Anyone may join an interest group. There are now 29 o
ntology development interest
groups providing the GO with biological domain expertise in such areas as ‘cell cycle’ and
‘development’. Major changes this year included new high
-
level component terms including ‘organelle’,
‘protein complex’ and ‘receptor co
mplex’. The current interest groups are described at [GO.interests].

Tracking Requests and Results:

Individual users can submit requests for change to the ontologies
via the GO request tracker system hosted by SourceForge. Over 700 tracker entries were pro
cessed in
2005 ([GO.sf]). There are approximately 70 new requests per month. A log of all requests, discussion,
and the status of each are available at the site. Suggestions are submitted by a wide range of groups,
including the various model organism data
bases, UniProt, TIGR, BRENDA ([BRENDA]) and Incyte
(now BioBase), as well as from individual researchers.

Mailing List Name

Number of
Subscribers

GOFRIENDS


general announcements and discussions on GO.

481

GO


Consortium discussions (closed list).

103

ANNOTATION


discussions of functional annotation.

86

PATHOGENESIS


special topic discussions on pathogenesis, cell killing
and immunology.

41

FARMANIMALS


special topic discussions on Farm and Work Animals
genome annotations to GO terms.

44

METABOLI
SM


special topic discussions on metabolism and GO
annotations.

24

DEVELOPMENT


special topic discussions on Developmental Biology.

22

NEUROBIOLOGY


special topic discussions on Neurobiology.

14

Table 3. Major GO mailing lists
.


Ontology Content Mee
tings:

We organized three ontology content meetings in 2004
-
5 to bring
together specialists to develop specific branches of the ontology. Each was focused on one or a few
major ontology development issues, such as pathogenesis, metabolism, cell cycle, and
immunology.
The Editorial staff invites experts from relevant fields to participate in each meeting and initiates
discussions with them to assess the requirements. Materials are prepared prior to these meetings,
discussing the issues and alternative approa
ches that may be taken, and in this way the attendees
arrive fully prepared for the intense discussions that are involved to reach resolution. The materials and
minutes for such meetings are available on the GO website, see [GO.meetings]. The most recent G
O
Ontology Content meeting on the representation of immunology in the GO was held at TIGR in
November 2005.
In addition to core participants from the GOC, the attendees included immunologists
working on ontological representation from three academic instit
utions as well as NIH intramural
researchers. Major revisions of the immunology sections of the GO were proposed and discussed. The
result of the meeting was agreement on the representation and definitions for
new terms and changes
to the organization of t
he Biological Process ontology structure in the area of immunology.

Ontology Quality Control:

This year, we created software ([OBOL]) for parsing GO term names to
infer missing relationships in GO. As missing relationships are determined, the GO editors co
rrect and
update the ontologies. A major project has been to curate logical definitions for GO terms involving cell
The Gene Ontology Consortium Blake, J
udith A.


128


types and to integrate the GO and OBO cell type ([CELL]) ontologies. These logical definitions are
machine
-
interpretable, and thus can be us
ed to detect inconsistencies between the ontologies, keeping
the GO and cell ontologies in synchrony and permitting cross
-
ontology queries (see Mungall, 2004).

C.1.b.

Sequence Ontology curation.

The Sequence Ontology is managed by Dr. K Eilbeck at LBNL, and is ho
sted
via SourceForge.net
[SO]. SourceForge is a web
-
based resource for managing open source projects
. The SO is continually
updated in response to the growing needs of the community. The feedback from the community is
predominantly expressed via an Interne
t mailing list. In this way the discussion reaches a wide variety
of users and developers and the response is immediate. The mailing list is open to all and prior
discussions are available from an archive. The Sequence Ontology held two content meetings in

2004
-
5 where particular issues were tackled in greater detail. The developers also solicit input from domain
experts, and have worked with other groups for the development of common vocabularies, for example
the MGED Ontology group [MGED], the alternate s
plicing terminology group [SANBI], and the RNA
Ontology group [RNA].

Architecture of the Sequence Ontology
. SO has kept to its aim of capturing the concepts necessary
to describe four different orthogonal aspects of biological sequence:
located_sequence_fe
atures
are terms that describe concepts that can be located on the sequence in base coordinates;
sequence_attributes
describe the properties a feature may have. For example, a gene may have
the attribute
maternally_expressed
, but this attribute cannot be l
ocated on a sequence;
chromosome_variation

catalogues the large
-
scale changes to the chromosomes such as ploidy
and rearrangements;
consequences_of_mutation

details the terms necessary to outline what the
mutation does, such as causing a frameshift mutatio
n.

Development of the vocabularies
. SO now contains 980 terms, including 180 SOFA terms, those
terms specifically used by genomic annotation projects (see Figure 1). There has been considerable
improvement in the number of terms that have a referenced biol
ogical definition. All of the terms in the
current SOFA release are now defined, and over half of SO terms (550) are defined. A defined term
has a free text description and a cross
-
reference to the resource that provided the definition. Mappings
of SO to t
he GenBank/EMBL/DDBJ Feature Table and to the MGED Ontology are now provided (see
[SO.mappings]).




Figure 1. The number of defined
and undefined terms in the SO and
SOFA

in the Spring of 2003 and in
the Fall of 2005.

The Gene Ontology Consortium Blake, J
udith A.


129


C.2.

A
NNOTATION INFRASTRUC
TURE
.

The second objective of the GOC is to use the ontologies to describe biological data. From the outset,
GO was intended as a p
ractical means to achieve comparability across organisms and across sites.
Each member of the GOC applies the ontologies to the gene products of its organism or discipline of
interest, and deposits these annotations in the central GO repository where they
can be accessed on
the Web via the AmiGO interface (
[AMIGO]).
All GO annotations submitted to the central repository are
backed up by associated evidence using a series of 12 evidence codes that describe the type and
reliability of the underlying evidence
(Table 4), along with the appropriate cross
-
references. For
example, when citing published literature, the evidence includes supporting PubMed IDs.


IC Inferred by curator


IDA Inferred from direct assay


IEA Inferred from electronic annota
tion


IEP Inferred from expression pattern


IGI Inferred from genetic interaction



IMP Inferred from mutant phenotype


IPI Inferred from physical interaction


ISS Inferred from sequence or structural similarity


ND No biologi
cal data available


RCA Reviewed computational analysis


TAS Traceable author statement


NR Not Recorded


The success of the work done by the GOC is evidenced by the fact that all of the major model organism
databases now use GO for annotation
, and most of the new databases that will be developed in the
near future have announced plans to use GO (
e.g. Tetrahymena
,
Xenopus

and
Chlamydomonas
). For
a summary of all annotations that GO users have contributed see [GO.annotations]. The GO is integral

to the UniProt database of proteins being developed by the SwissProt groups at the European
Bioinformatics Institute (EBI), the Swiss Institute for Bioinformatics and the Protein Information
Resource (PIR). GO is used by a number of other protein resource

databases including PDB,
BRENDA, and has recently been adopted by the National Center for Biotechnology Information (NCBI)
for use in the Entrez
-
Gene and RefSeq projects (See Letter of Support from D. Maglott).

There is also extensive use of GO within the

private sector for the annotation of in
-
house databases.
AstraZeneca, Bayer HealthCare, Caprion, Celera, Eli Lilly, Genentech, GlaxoSmithKline, Hoffmann
-
La
Roche, Incyte, Johnson&Johnson, Mendel Biotechnology, Merck, Millennium, and Unilever are just a
fe
w of the industrial users that we are aware of (see Letters of Support).

Six of the eight GOC groups are funded under our previous grant to curate and provide GO annotations
to their user communities and to the common GO database resource. GO annotations a
re also
regularly contributed to the GO resource from other groups such as the TIGR microbial and eukaryotic
annotation groups, and from the GeneDB at Sanger Institute. In addition,
Drosophila
,
Danio
,
Oryza

and
Candida

model organism database groups partic
ipate in the work of the GOC and contribute annotation
sets, but these efforts are not funded under the existing grant (nor are we now proposing funding). GO
annotations are now regularly submitted by ten Model Organism Databases (CGD, dictyBase, FlyBase,
MGI, RGD, SGD, TAIR, WormBase, ZFIN, Gramene) and three multi
-
organism databases (TIGR,
UniProt and GeneDB).


Annotation Consistency and Quality Control 1:

We have enhanced the quality control checks
applied to the annotation sets that are provided by the
GO resource. These checks include filters to
ensure proper syntax and currency of identifiers. A major change implemented this year is daily
Table 4. Gene Ontology Evidence
Code Abbreviations.

The codes
reflect the type of assay used

to
infer the annotation of the gene
product.

The Gene Ontology Consortium Blake, J
udith A.


130


automatic filtering of the contributed datasets of any annotations that do not meet a stringent check of
the data.
Any 'bounced' records are reported to the contributing project (see C.3). In addition to the
checks of syntax and data currency the daily checks also eliminate duplication of annotations between
datasets. This major enhancement utilizes the NCBI Taxonomy I
Ds provided for each annotation and
states that a particular model organism project is the authority for providing annotations for a particular
species. Within these annotation sets attribution is given to the source of the annotations, whether the
model o
rganism database or another contributor.

Annotation Consistency and Quality Control 2:
To ensure consistency between mouse, rat and
human we have developed an additional measure (described in Dolan
et al.

2005) that allows curators
to examine the consisten
cy of GO annotations for orthologous gene sets. Every month, MGI generates
reports for mouse, rat and human genes and MGI, RGD and GOA curators jointly review them. These
comparative annotations

reports are
graphical, making it easy for curators to quickly

see differences in
the annotations; both in consistency and in granularity. The software creates an individual graph for
every mouse
-
human
-
rat ortholog triple (Figure 2, [M
GI
-
GOGRAPH]
). Each node of the graph is a term,
and the nodes are color
-
coded to in
dicate which organism
-
combination is annotated to that term.
Examination of these graphs shows that annotations are often complementary, reflecting the fact that
the different model organisms are used to study different aspects of biology. While the GOgrap
h
comparative tool is now only being used for these three organisms, it can obviously be expanded to any
combination of organism orthology sets (see, for example, the mouse
-
human
-
rat
-
fly
-
chicken
HomoloGene set: [GO.compare]).

The Gene Ontology Consortium Blake, J
udith A.


131



Figure 2.

Comparative GO gra
ph for Shc3.
This comparative graph for Shc3 and its orthologs illustrates how combined
knowledge leads to a more comprehensive understanding of the biological role of a gene. The combination of annotations
from human, rat and mouse databases gives us a mo
re complete picture of this gene than any of the individual species'
annotations alone. The picture illustrates what we know and generates testable hypotheses concerning what we don't know
about this gene. As annotations from each species become more and m
ore complete, an even finer understanding of this
gene will emerge.

The Gene Ontology Consortium Blake, J
udith A.


132


GO Annotation Consistency Workshops:
The focus of the first Annotation Workshop (Cambridge,
UK, June 2004) was to

evaluate the annotation methods used by the different Consortium

members
a
nd to define a set of common annotation practices which should be used by GO annotation projects.
The second Annotation Workshop

(Stanford, June 2005) focused on educating non
-
GOC members on
these annotation

standards and working to facilitate the use of t
hese standards. A total of 54 people,
from

38 institutions representing five countries, attended the second Annotation Workshop. Workshop
minutes are available online ([GO.camp]). Roughly half of the workshop was devoted to lectures and
discussions on the
use of the GO ontologies and evidence codes. For the other

half, the attendees
divided into small groups (3
-
5 people, including

one person experienced with GO annotation from a
GOC member

group) to do annotations. These small groups read

papers and then di
scussed the GO
terms and evidence codes that best

represented in information reported in the publication.

Sequence Ontology Annotation.
Unlike the three other ontologies of the GO that characterize gene
products, the SO is focused on the actual parts of se
quence that comprise the gene, and the attributes
that describe them. The Sequence Ontology provides a platform independent approach to describe the
contents of a genome annotation. This flexibility means that there are many ways to utilize SO
compliant an
notation information ranging from database schemas to
ad hoc

flat file formats. Currently
there are several public data models that are using SO or SOFA to describe the features in
annotations. These include the flat file specification GFF3 ([GFF3]) and th
e chado ([CHADO]) relational
model from the Generic Model Organism database. Two related kinds of XML specifications, chado
XML and Chaos XML are derived from the chado relational model. The model organism communities
using the SO in their annotations are
FlyBase, WormBase, SGD, MGI, and dictyBase. The SO website
does not currently provide a repository of annotations. Other groups that use the SO include MGED
([MGED]), FlyMine ([FLYMINE]) and Atlas ([ATLAS]).

C.3.

GO DATABASE AND AMIG
O
.

The primary community acc
ess to GO and GOC annotations are via downloads from the GO database.
The AmiGO Browser provides web
-
based access that supports both semantic and sequence
-
based
queries on the GO database.

The GO Database:
There are three sources for the information provid
ed by the GO database: the
annotation project provided gene association files, the GO ontologies in OBO format, and protein
sequences obtained from UniProt and NCBI. Each annotation project provides their gene associations
in a specific format [GO.ga] for
all gene product annotations defined by that group. The projects are
responsible for updating their file(s) within the CVS repository. The repository has been very effective in
allowing this international collaboration to share responsibility for maintaini
ng input files. A Perl script is
provided as a quality control check in an effort to validate the format and to partially check the data
provided within the gene association files. This script is used on all gene association files before they
are loaded
into the GO Database.
This script is intended to be generic and to enforce the standards
defined by the GOC.

Each night any updated files are validated. Once a week all are validated files to maintain their
currency. Any annotation found not to comply wi
th the standards is removed, and the results of this
validation step are reported back to the submitting group. The validation script is available from the GO
servers and maintained in CVS. Thus, groups can use the script to validate their file before it

is
submitted. However, with on going revision of the ontologies some annotations will fail validation


thus
the weekly reprocessing of all available annotations.
The checks provided define the minimum standard
format for the repository:



GO identifiers m
ust be current and not secondary or obsolete.



Abbreviations used with any identifier must be defined within the GO crossreference
abbreviations file [GO.xref].

The Gene Ontology Consortium Blake, J
udith A.


133




IEA annotations must have been determined within the past year. Transitive annotations
must be
regularly updated to maintain a minimal level of quality.



The WITH column is not allowed for annotations using the TAS, NAS or ND evidence
codes.



The cardinality of all columns must strictly match the file specification.





Figure 3. Flow of information
provided by GO.



Annotations for the major model organisms are limited to a defined set of annotation projects. This
removes redundancy and motivates the annotation projects to collaborate. The GOA project includes
the annotations from the MODs within t
heir submitted file, but these are filtered (based on NCBI taxon
IDs) to remove taxa covered by the MODs. The GOA is the authority for human (taxon:9606)
annotations.


Organism

Project

Taxonomy ID

Leishmania major

GeneDB

5664

Plasmodium falciparum

GeneD
B

5833

Schizosaccharomyces pombe

GeneDB

4896

Trypanosoma brucei TREU927

GeneDB

185431

Glossina morsitans morsitans

GeneDB

37546

Candida albicans

Candida Genome Database

5476

The Gene Ontology Consortium Blake, J
udith A.


134


Dictyostelium discoideum & sp.

dictyBase

5782 & 44689

Drosophila melanogaste
r

FlyBase

7227

Gallus gallus

GOA at UniProt

9031

Homo sapiens

GOA at UniProt

9606

Oryza sp.

Gramene

4528, 4530, 4532 …

Mus musculus

Mouse Genome Informatics

10090

Ratt
u
s norvegicus

Rat Genome Database

10116

Saccharomyces cerevisiae

Saccharomyces

Geno
me Database

4932

Arabidopsis thaliana

The
Arabidopsis

Information Resource

3702

Bacillus anthracis

The Institute for Genome Research

198094

Coxiella burnetii

The Institute for Genome Research

227377

Campylobacter jejuni

The Institute for Genome Researc
h

195099

Dehalococcoides ethenogenes

The Institute for Genome Research

243164

Geobacter sulfurreducens

The Institute for Genome Research

243231

Listeria monocytogenes

The Institute for Genome Research

265669

Methylococcus capsulatus

The Institute for G
enome Research

243233

Pseudomonas syringae

The Institute for Genome Research

223283

Shewanella oneidensis

The Institute for Genome Research

211586

Silicibacter pomeroyi

The Institute for Genome Research

246200

Trypanosoma brucei

The Institute for Genom
e Research

5691

Vibrio cholerae 01 biovar eltor

The Institute for Genome Research

686

Caenorhabditis elegans

WormBase

6239

Danio rerio

ZebraFish Information Network

7955

Tab
le 5. Project responsibility for taxon annotation file.
Annotations for the lis
ted species are limited to the gene
association files from the stated project. The NCBI Taxonomy identifier is used to filter the annotations.


The GOC has defined particular projects as solely responsible for all annotations for specific organisms
using

the NCBI Taxonomy identifiers. For example SGD is responsible for all
S. cerevisiae
annotations
(taxid:4932). Any annotations specifying that taxon identifier in gene association files other than SGD's
are removed. A complete list of the responsible group
s and the associated taxon IDs is available at
([GO.QC].The filtered gene association files are provided via HTTP, FTP and CVS. The ontology file is
maintained as an OBO format file. A variety of other formats are also provided, the obsolete GO flat file
i
s generated nightly from the OBO file. The OBO file is also used to generate OBO XML, RDF XML and
OWL formatted files nightly. These files are provided via HTTP and FTP.

There are three flavors of the GO Database: GO full, GO lite and GO TermDB. The GO ful
l and GO lite
forms differ by the inclusion of IEA

annotations with GO full and their exclusion in GO lite. They also
differ in their frequency of creation. GO full is recreated once a month and archived. GO lite is the
backend database that drives the Ami
GO interface. GO lite is built three times a week, thus AmiGO is
at most three days out of sync with the ontologies or gene associations. Note this changed dramatically
during 2005, previous to May 2005 the AmiGO back end was only updated once a month. The

GO lite
database is recreated three times a week and archived once a week on the FTP archive. The GO full
and GO lite databases include the protein sequences, identified by the submitting MOD, that are
The Gene Ontology Consortium Blake, J
udith A.


135


associated with non
-
IEA annotations. A FASTA formatte
d file of the sequences is also provided with the
mySQL data dumps. The TermDB

flavor is created nightly and only includes those tables defined by
the ontologies, that is no annotations or sequences are included. The mySQL data dumps of the
TermDB are usef
ul for those wishing to only maintain a database of the ontology content.

AmiGO:
AmiGO is the GOC'S GO browser ([AmiGO]). It allows users to browse the ontologies in an
intuitive manner and determine which gene products have been associated with any one te
rm.
This
past year the interface was updated to enhance the querying and display of annotations. The
development procedures have also been refined to allow effective prioritization of new features.


The AmiGO team, or working group, consists of annotator
s from SGD, dictyBase, GeneDB, an editor
from the GO editorial office, and a developer. All AmiGO development is tracked using the
SourceForge.net AmiGO request tracker, including requests from both the GOC and the general public.
The AmiGO working group

regularly collects current items from the request tracker and from these
formulates the next release plan, which is announced to the GOC using the group mailing lists. The
plan may be revised based on the feedback following this notification. Once the p
lan is finalized,
mockups and specifications are generated and distributed to the AmiGO team for input. Mockups and
specifications are iterated upon until the group has reached consensus. In the next phase, the GOC
iterates upon these mockups and specifi
cations to reach mutual approval. Based on the approved
specifications, the developer implements the new features for the release, and AmiGO team jointly
tests and debugs the software until it is ready for the GOC to test the software. The fully tested n
ew
AmiGO after accepted by the GOC is released to the general public. A typical release cycle is about 2
months and each release tackles any where from 5 to 10 problems or features.

The GO repository and FTP:
All information provided by the GOC is maintai
ned within a CVS
repository, with archival access to information available via FTP. The CVS repository is a standard
software environment used by collaborative software development projects and has served the GOC
very well. The security of the informatio
n is good, SSH (secure shell) software is used by consortium
members to login to the CVS server at Stanford. Each consortium site has at least one person with an
account on the CVS server. Updates to files occur from the remote site via a CVS client appl
ication, a
standard tool included with Unix computers including Linux and Mac OS X. CVS handles the file
creation and modification issues created with a distributed project. Explicit file versions are maintained
on the server and an update to an older ve
rsion is not allowed. A remote user is informed when they
have an old version of a file and the CVS client assists them in updating or merging the versions. The
CVS repository contains the ontologies, documentation, gene association files, mapping files
as well as
meeting minutes and presentations. The file system that contains the web site is a “checked out”
version of the CVS repository, also called a sandbox within the CVS environment. Every 30 minutes
automatic processes update the web site from the

CVS. Thus any group anywhere in the world can
update a web page by updating the CVS repository, and in 30 minutes or less that page will be live on
the Internet. The FTP site contains more than the contents of the CVS. It includes large archival files.

As these files are archival, they will not change, and thus it is not necessary to maintain them within
CVS. The FTP specific files include the ontologies, and email list discussion archives for all lists, and
the mySQL data dump files that define the G
O database. The Stanford group maintains the CVS and
FTP servers.

For the convenience of users we archive a freeze of the ontologies, on the first of every month. The
GOC's FTP site archives these monthly releases from January 1, 2001. The GO Database ar
chive
([GO.database
-
arch]) provides downloads of the GO Database (MySQL), as well as an archive of the
monthly freezes (from Dec 1, 2002). It also provides, since March 2005, a weekly copy of the "lite"
version from which IEA annotations have been removed,

of the database, schema documentation and
a Perl library of modules for parsing and navigating OBO files and annotation data. We will continue to
provide these resources.

The Gene Ontology Consortium Blake, J
udith A.


136


C.4.

GO

E
DUCATION
,

D
OCUMENTATION
,

A
ND
O
UTREACH
.


The GOC presents talks, posters and tut
orials at scientific meetings and is constantly striving to
develop new educational and visualization tools to make the organization of the GO ontologies and
services more accessible to users. A complete list of outreach efforts, including tutorials and
wo
rkshops, is present in Appendix 12. Recent innovations have included a self
-
guided tutorial,
prepared by SGD [SGD.tutorial], an AmiGO visualization module that allows the relationships among
terms to be visualized on the fly as interactive diagrams and com
parative GO annotation graphs for
mouse, rat and human annotation available at MGI [MGI
-
GOGRAPH] (see C,2).

The success of GO is demonstrated by the increasing numbers of research projects that cite it. A
search for “Gene Ontology” in the abstracts at Pub
Med brings up hundreds of papers as shown in
Figure 4. This is an underestimate because there are many other papers that only refer to the GO in the
text, citations, or in table headings. Likewise, papers written by the GOC members are frequently cited
in
the literature, the most cited paper (The Gene Ontology Consortium 2000) being referenced 1291
times (data from Google Scholar, 16 Jan 2006).

Figure 4. Annual numbers of published papers that contain the key phrase “Gene Ontology” in their title or
abstra
ct

(excluding papers published by members of the GO Consortia itself) from EntrezGene PubMed. This
number does not reflect the many papers that use the GO for data analysis and representation that do not include
‘Gene Ontology’ in abstract.



The GOC main
tains two Internet domains for access to the GO ontologies, databases, software and
community outreach. The main site is GeneOntology.org, and the number of links from external sites to
this site is on a par with other major genomic sites such as
UC Santa
Cruz [UCSC], Protein Data Bank
(PDB), and EBI [EBI]. Figure 5 shows growth in use of the geneontology.org site. Between May 1 2005
and Jan 8 2006 the GO sites served an average of 68,768 high quality hits per week (excluding all
robots, hits to images, sty
le sheet files, etc). During this period there was an average of: 18,462 visits,
9,000 unique hosts and 2,407 trails per week. A visit represents a set of hits within a short period of
time from a single user. A trail is a unique list of URLs observed in a
t least one visit, in effect trails are
paths through the web pages provided. The top five GO terms that researchers search for are
“transport” (72020), “ATP binding” (56862), “virion” (53622), “immune response” (47773), and “DNA
binding” (46943). Half of
these are biological process terms, indicating the community’s interest in these
areas.

The Gene Ontology Consortium Blake, J
udith A.


137




The Sequence Ontology website is hosted via SourceForge.net [SO]. This is a web
-
based resource for
managing open source projects. Through this platform SO provide
s the following tools: a file release
system, a CVS repository, and a mailing list for developers, a bug and feature tracking service, and a
collection of project pages that house documentation other information. The documentation includes a
FAQ, poster an
d power point presentations, publications, a style guide, and the minutes of content
meetings. We provide information such as mapping tables to other vocabularies, and maintain lists of
groups annotating or using SO, software that incorporates SO and infor
mation about using SO
compliant annotation formats

We have also seen a growth in the use of the web based SO resources. The homepage and
SourceForge based tools are served about 4000 times a month, and the SO project pages have peak
usage of 180MB per mont
h. There have been two releases of SOFA, in May of 2004 and May 2005
(and one will follow in May 2006). Notice is given prior to the release of a new version of SOFA to
prepare the software developers of the changes. The download history of the file releas
e system shows
that SOFA has been downloaded 652 times, and the usage peaks after each yearly release. The
mailing list has proved a very successful tool in the development of SO. There are over 80 subscribers
who actively discuss the ontology. It is also
open to non
-
subscribers to post questions or comments. It
reaches a wide variety of scientists, both involved in software and those in the lab, from at least 8
countries, and from academia, hospitals, institutes and industry.

D.

RESEARCH PLAN.

The GOC faces a

challenging task. Our vision is to ensure that all possible functional descriptions, of
every protein and non
-
coding RNA spanning the spectrum of organisms, are accurate, detailed, and,
most importantly, semantically compatible. Semantic compatibility is
the bedrock for meaningful
discussions, comparisons, and contrasts of the annotations. Our goal is to ensure the integrity of the
ontologies and completeness of the annotations supported by the GOC community.

Ontological content:
Biological data is now lar
gely maintained on computer systems, and
consequently, biology has become an information science. The implication is that the biological
community must agree on standard, computationally tractable definitions to communicate their results
with one another v
ia these computer systems. Working closely with the biological community to create
this standard semantic framework for describing gene products is our first aim.

Figure 5. GO Web Use.
Growth of
the weekly use of geneontology.org
since 10/99. C
ombined with use of
godatabase.org since 5/05.

The Gene Ontology Consortium Blake, J
udith A.


138


Reference genomes:
Insights into the roles of unknown gene products are largely based upon
tr
ansference from well
-
characterized gene products, making it critical that such dependable well
-
characterized descriptions are available. Providing a core set of accurate, detailed annotations of the
genomic sequence and the gene products for nine reference

genomes is our second aim.

Annotation outreach:
Currently, there are over 180 eukaryotic genomes in the sequencing and
assembly pipeline (see Aim 3). For most of the 40 or so published eukaryote genome sequences the
standard of their annotation is, with r
espect to the standards of
S. cerevisiae, M. musculus, A. thaliana,
and
D. melanogaster,

relatively poor. The extrapolated annotation forecast, for genomes in production,
is more of the same: that is to say, unwieldy, difficult to compare, unreliable, and
in some cases non
-
existent annotation. This has serious consequences, because it diminishes the utility of these
sequences: for functional genomics, for improving the standard of annotation of the human genome
(essential for its maximal exploitation) and f
or comparative genomics. Therefore, our third aim is to
provide a basic, standardized methodology for functional annotation of emerging genomes.

Research community support:
Our final aim is dissemination of this resource to the research and
education commu
nity. We will work to educate the end users about how to use the GO to facilitate their
research. We will also develop and implement ways to obtain feedback to ensure that GO is relevant
and useful. The use of the ontologies and annotations to these ontolo
gies is their enduring legacy.

D.1.

A
IM
1:

W
E WILL MAINTAIN COMP
REHENSIVE
,

LOGICALLY RIGOROUS A
ND BIOLOGICALLY ACCU
RATE
ONTOLOGIES
.

Ontologies have thus far been used primarily for logically undemanding tasks such as search and
retrieval. Increasingly, however,

three key measures for evaluating an ontology need to be applied:
Does it cover the domain sufficiently to allow the precise and exhaustive classification of all relevant
entities and relations? Can it reliably be used for automatic reasoning? Is it made
available with high
-
quality documentation and with software tools, which meet the needs of the users? The last point is
particularly important

ontologies must be shared, and sharing demands high
-
quality documentation. In
this section we present our plans f
or each of these three sub
-
topics: the expansion of the ontology into
new biological areas to increase its descriptiveness; the continued logical development of the ontology;
and the development of the software tools and documentation that are essential to

accomplish these
tasks.

D.1.a.

Comprehensive biological domain coverage.

Ontological content cannot be developed without experienced, qualified biologists doing the real work:
thus it is absolutely essential that we enlist biologists who are aware of the possibi
lities of high quality
shared ontologies which can support genuine logical reasoning. Within GO, ontology content is curated
and controlled by the EBI editorial office. However, content development is a collaborative effort, and
the GO curators depend upon

other biologists, both within and external to the GO consortium, to
develop content. We use three methods to collaborate on content development:



Request tracking, using the SourceForge system



Special Interest Groups (SIGs)



Content Meetings


Of these three
, the content meetings are most critical for new domains and for comprehensively
specifying existing domains. At these meetings, where domain experts are actively engaged in
developing entire branches of the ontology, the more significant changes occur. As

discussed in the
progress report, there are four steps to the process.



We initiate discussions with invited experts from relevant fields to assess the
requirements and identify fundamental issues. We prepare material to summarize the
The Gene Ontology Consortium Blake, J
udith A.


139


most challenging que
stions to be resolved. These discussions are held via interest group
e
-
mail lists, where the discussion is recorded and archived.



We organize a face
-
to
-
face meeting between biological experts and members of the
GOC. We will have at least two ontology deve
lopment meetings each year. The agenda
is established according to the key ontological issues established in the first step. The
choices and alternative approaches are vigorously debated and ultimately resolved (or a
date for a second meeting is set).



The

changes are implemented in the ontologies, using the Consortium's normal
procedures for change. More minor questions are settled by using the interest group e
-
mail.



The ontology is updated and released into production.


Quite often, as annotation begins,

it becomes obvious that further changes to the ontologies are
needed. This does not present a difficulty, since the GOC has a well
-
established mechanism for
change. The entire process for developing ontology content is designed for agility and frequent
re
iteration (whilst ensuring an electronic audit trail is recorded, to help automate change management
in annotations). It ensures that new terms are efficiently and reliably added to the GO, as appropriate to
represent biological knowledge. It allows the GO
C to rapidly respond to the needs of the biological
curators.

This rapid response is important, because advances in biological knowledge, as deduced from new
experimental results, drive ontology development. Aims 2 and 3 deal directly with annotation and
t
herefore will be the primary drivers for new terms in the GO. The nine reference genomes will need
new terms as they increase the precision of their annotation. The reference genome annotations are
driven by current research results, largely collected from

the literature. Review articles in particular are
useful, and review authors are among the best candidates to enlist for ontology development efforts, as
they have the necessary knowledge of the domain. Annotation projects for new organisms will also
cert
ainly require new terms. A wide variety of organisms are used in research because each offers
unique leverage to explore certain aspects of life. The unique biology of individual organisms will
broaden the biological content of the ontology, just as the re
ference genomes will deepen the biological
content.

We are commonly asked how or whether we can enlist these experts, who are generally very busy
individuals. Thus far this has not been an issue, for two reasons. First, many domain experts are
actively ma
naging data from high
-
throughput research projects; they are already highly motivated
because they appreciate the challenge and need of organizing and classifying large data sets. In some
cases there are legacy issues, but we resolve these by providing map
pings between their existing
terms and GO terms. The second factor is intellectual. Because the development of the ontology itself
requires original, careful thought, papers describing the biology that underlies the development of a
branch of the ontology
are the result and this provides both additional motivation as well as education
for others.

D.1.b.

Logical development of the GO.

A logically robust ontology requires unambiguous formal definitions of the relational expressions used,
coupled with their consisten
t application throughout the ontology. These formal definitions should
enable software to draw inferences from the ontology and from data annotated with its terms. If
software tools and databases are to interoperate and provide consistent answers to biolog
ical
questions posed by end
-
users, then it is essential for relations to be shared and precisely defined. For
example: a biologist querying for gene products localized to the
nucleus [GO:0005634]

should
receive amongst their query results gene products ann
otated to
GO:0016592; Srb
-
mediator
complex
, as this cellular component is
part_of

the nucleus.

The Gene Ontology Consortium Blake, J
udith A.


140


D.1.b.i.


Improving Biological Relations.

Biological ontologies are concerned with the kinds of biological entities that exist, and the relationships
between these entiti
es. Currently the GO only admits two relations:
is_a

and
part_of
. SO extends
this core set with other relations such as

homologous_to
. Historically, other OBO ontologies have
also taken these relations and extended them, often in an ad
-
hoc manner.

This wa
s the primary motivation in the creation of the OBO Relations Ontology (Smith et al, 2005;
[RO]), a collection of definitions for relations to be used by GO and other biological ontologies. The
development of this ontology was a collaboration between membe
rs of the GO consortium, and formal
ontologists including members of the National Center for Biomedical Ontologies (NCBO). This ontology
also includes additional relations not yet used by GO, but anticipated to be useful for creating
relationships between
GO and other OBO ontologies, as well as within GO, and SO.

We will use these relations, and work with the OBO Relations Ontology content developers to extend
and clarify these relations to enhance GO and SO. We will make five types of modifications to the
GO
term relationships: completion, replacement, additions within a single GO ontology, additions across
current GO categories, and additions between GO ontologies and other external OBO ontologies
3
. The
following set of modifications to the relationships i
n the GO is approximately ordered by implementation
priority.

D.1.b.ii.

Providing comprehensive
is_a

relationships for all GO terms.

We will include appropriate
is_a

relationships for every GO term. The GO is currently incomplete in
that not all terms have an
is_a

p
arent (terms without an
is_a

parent will always have a
part_of

parent, except for the 3 root terms). This has negative consequences at both an ontological level, and
at a practical software level. Therefore filling in the missing
is_a

relationships is a cr
itical goal. This
task is currently underway.

On the ontological level, the missing
is_a

relationships create holes in the ontology, making inference
unreliable. The formal OBO definition of
is_a

takes both time and contextual contingencies into
account. T
he definition of
is_a

states: If, at all possible time points, all instances of
Class_A

are also
instances of
Class_B
, then
Class_A is_a Class_B
. For example, when the GO affirms that
glucose metabolism

is_a carbohydrate metabolism
, then we are stating tha
t all instances of
the former are

ipso facto

also instances of the latter. In common language, every class in GO must be a
subtype of one of the three upper level classes. If it is not one of these three, and yet it still falls within
the province of the G
O ontologies, then there is no possible way to determine what the class of an
entity it actually is. Moreover, there is no way to carry out inferences on the basis of data annotated to
that term.

On a practical level, the missing
is_a

relationships can le
ad to non
-
intuitive displays, and hinder
interoperability. The current visualization paradigm for the GO is to show all possible lineages to the
root term, combining a mixture of
is_a

and
part_of
relationships. For example, the term
AP
-
1
adaptor complex

ha
s 164 distinct lineages to the root traversing both
is_a

and
part_of

relationships. This leads to ‘tangled’ DAG displays that are confusing to users. It would be better to
show distinct subsumption hierarchies and partonomies (
is_a

and
part_of

DAGs), but,
because of
missing
is_a

relationships, the filtered display would contain hundreds of orphan terms. An additional
reason is the difficulty these
is_a

gaps create when importing GO into alternative ontology tools and
software such as Protégé. These tools ex
pect all terms (bar the root term) to have an
is_a

parent;
unexpected things happen when this is not the case. We could circumvent this problem during import



3