The Gene Ontology Categorizer

breakfastcorrieΒιοτεχνολογία

22 Φεβ 2013 (πριν από 4 χρόνια και 3 μήνες)

603 εμφανίσεις

BIOINFORMATICS
Vol.20Suppl.12004,pages i169–i177
DOI:10.1093/bioinformatics/bth921
The Gene Ontology Categorizer
Cliff A.Joslyn
1,

,Susan M.Mniszewski
1
,Andy Fulmer
2
and
Gary Heaton
3
1
Computer and Computational Sciences,Mail Stop B265,Los Alamos National
Laboratory,Los Alamos,NM87545,USA,
2
Corporate Biotechnology,Miami Valley
Labs and
3
Corporate Functions-IT,Procter & Gamble,Cincinnati,OH 45239-8707,USA
Received on January 15,2004;accepted on March 1,2004
ABSTRACT
Summary:The Gene Ontology Categorizer,developed jointly
by the Los Alamos National Laboratory and Procter & Gamble
Corp.,provides a capability for the categorization task in the
Gene Ontology (GO):given a list of genes of interest,what
are the best nodes of the GO to summarize or categorize that
list?The motivating question is from a drug discovery pro-
cess,where after some gene expression analysis experiment,
we wish to understand the overall effect of some cell treatment
or condition by identifying ‘where’ in the GO the differentially
expressed genes fall:‘clustered’ together in one place?in two
places?uniformly spread throughout the GO?‘high’,or ‘low’?
In order to address this need,we view bio-ontologies more as
combinatorially structured databases than facilities for logical
inference,and draw on the discrete mathematics of finite par-
tially ordered sets (posets) to develop data representation and
algorithms appropriate for the GO.In doing so,we have laid the
foundations for a general set of methods to address not just the
categorizationtask,but alsoother tasks (e.g.distances inonto-
logies and ontology merger and exchange) in both the GOand
other bio-ontologies (such as the Enzyme Commission data-
base or the MEdical Subject Headings) cast as hierarchically
structured taxonomic knowledge systems.
Contact:joslyn@lanl.gov
1 INTRODUCTION
The computational biology revolution has produced many
large databases of genomic information,including the
Gene Ontology (GO) (GO Consortium,2000,http://www.
geneontology.org).This explosionof informationhas substan-
tially changed the processes which biological researchers use
in such tasks as drug discovery,now increasingly involving
the dedication of substantial resources to navigating these
databases.
We can identify gene list categorization as one of the
new tasks required of computational biologists.Following a
geneexpressionexperiment involvinghigh-throughput micro-
arrays or Affymetrix gene chips,a biomedical researcher

To whomcorrespondence should be addressed.
is confronted with a list of a few hundred to a thousand
genes,from which the researcher will need to extract useful
information on the types of biological processes affected in
the experiment.Of these,some have their function described
in published papers;others have additionally been annotated
into specialized databases of proteins with known function;
and still others may not be known at all.
The GO is one such database,a large,standardized know-
ledge structure consisting of three branches:Molecular Func-
tion (MF),Biological Process (BP) and Cellular Component
(CC).Each branch is organized as a taxonomy of nodes
representing different categories of genomic characteristics,
connected by either is-a (subsumptive) or has-part
(compositional) links.Once a gene is characterized suffi-
ciently,it can be attached to the appropriate node,as shown
in Figure 1 (GO Consortium,2000).
The categorization task arises fromour researcher wanting
to take the names of these genes and gain an understand-
ing of their overall function by examining their distribution
throughthe GO:are theylocalized,groupedindistinct areas or
spread uniformly?Manual approaches and existing software
are inadequate to answer this question over hundreds of pro-
teins and more than 16 000 GOnodes,and thus an algorithmic
approach is necessary.
While modern bio-ontologies take many forms,an adequate
overall description is of a taxonomically organized data object
over which automated inference and reasoning (e.g.using
description logics) is performed.Leading research in onto-
logies tends to focus on logical properties,inference and
search.Our view is that their nature as hierarchical,taxo-
nomic categorizations of biological objects is what has made
existing bio-ontologies successful,and that they are thus first
best seen as specially structured databases.It is thus important
that appropriate mathematical and combinatorial techniques
be brought to bear on their representation,measurement and
manipulation.
The Gene Ontology Categorizer (GOC) applies novel
research in the discrete mathematics of finite partially ordered
sets (posets) for semantic hierarchies (C.A.Joslyn,submitted
for publication;JoslynandMniszewski,2004) toGOanalysis.
Bioinformatics 20(Suppl.1) ©Oxford University Press 2004;all rights reserved.
i169
by guest on February 21, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
C.A.Joslyn et al.
Fig.1.A portion of the BP branch of the GO (GO Consortium,2000).GO nodes in the hierarchy have genes from three species annotated
below them(Reprinted with permission fromNature).
Specifically,we represent the GOas a poset ontology,then use
pseudo-distances between comparable nodes to develop scor-
ing functions that rank-order the GO nodes with respect to
a query.Finally,we cluster
1
the resulting rank-ordered list
to produce a ranked list of appropriate summarizing nodes
within the GO,which act as functional hypotheses about the
characteristics of the genes expressed.
While GO analysis is an increasingly important area,
existing techniques suffer from some weaknesses.Many
researchers consider the GO simply as a list of categories,
ignoring any structural relationships among the categories
(Zeeberg et al.,2003).Even those researchers with a treat-
ment closest in spirit to ours (Lee et al.,2003,2004) consider
the GO primarily as a tree,or even cast it as a graph for
determining distances between nodes.And while methods
which emphasize external statistical information and valid-
ation (Lord et al.,2003;Zeeberg et al.,2003) are welcome
and necessary,it is always necessary to proceed froma proper
combinatorial perspective.
GOC has been developed over the past year (Joslyn et al.,
2003a,b) by researchers at the Los Alamos National Labor-
atory (LANL) and Procter & Gamble Corp.(P&G),and
is currently in use by staff scientists at P&G.In addition,
extensions of GOC to handle textually based queries have
1
Noting that we are not using the term ‘cluster’here in the same sense as
used in other clustering applications in data mining,for example k-means.
been used by LANL in its submission for the BioCreative
challenge (http://www.mitre.org/public/biocreative) for auto-
mated annotation (Verspoor et al.,2004).
2 METHODOLOGY
A finite partially ordered set (poset) (Schröder,2003) is a
mathematical structure P = P,≤,where P is a finite set
and ≤⊆ P
2
is a reflexive,anti-symmetric,transitive bin-
ary relation on P.Posets are the most general combinatorial
objects decomposable into levels,in our case,of semantic
specificity.While more specific than directed graphs or net-
works (every poset is a digraph with no cycles),they are more
general than trees or lattices (every tree and lattice is a poset),
in that collections of nodes can have multiple parents.
The GO is a pair of directed acyclic graphs (DAGs),one
for the is-a and has-part links.Every DAG determ-
ines a unique poset,which is evident in Figure 1,so that
P
GO
is the set of nodes such as ‘DNA unwinding’and
‘DNA replication’,and the ordering ≤ in ‘DNA repair ≤
DNA metabolism’represents that DNA repair is a kind of
DNA metabolism.Thus the GO,cast as a pair of posets
P
is
= P
GO
,≤
is
 and P
has
= P
GO
,≤
has
 for the two kinds of
relations,is a large,taxonomically organized semantic hier-
archy.Throughout the paper below we actually consider the
two kinds of links to be equivalent,and thus model the GOas
the poset P
GO
= P
GO
,≤
GO
,where ≤
GO
= ≤
is
∪ ≤
has
.
i170
by guest on February 21, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
Gene Ontology Categorizer
B
F
G
A
I
H
C
E
J
D
1
K
a,b,c
b,d
e
f
g,h,
i
j
b
Fig.2.An example of a labeled poset.
We now introduce concepts from poset theory (Schröder,
2003),and a simple example.Two nodes p
1
,p
2
∈P are com-
parable,denoted p
1
∼ p
2
,if either p
1
≤ p
2
or p
2
≤ p
1
;a
chain C ⊆ P is a collection of comparable nodes;and the
height H(P) is the size of the largest chain.Similarly,two
nodes p
1
,p
2
∈ P are non-comparable if p
1
∼ p
2
,an anti-
chain is a collection of non-comparable nodes,and the width
W(P) is the size of the largest anti-chain.
Given two comparable nodes p
1
≤ p
2
,the set of all nodes
‘between’them is the interval [p
1
,p
2
] = {p:p
1
≤ p ≤
p
2
},which is equivalent to the set of all chains between
p
1
and p
2
,denoted C(p
1
,p
2
).The vector of chain lengths


h(p
1
,p
2
) = |C(p
1
,p
2
)| is the collection of the lengths of
all these chains,and finally the minimal and maximumchain
lengths betweenp
1
andp
2
areh

(p
1
,p
2
) = min
C∈
C
(p
1
,p
2
)
|C|
and h

(p
1
,p
2
) = max
C∈
C
(p
1
,p
2
)
|C|,respectively.
Anexampleof aposet onaset of nodes P ={1,A,B,...,K}
is shown in Figure 2.We have that B and J are non-
comparable,while A ≤ B are comparable,and the interval
[A,B] = {A,F,G,H,I,B} consists of the three chains
C(A,B) = {A ≤ F ≤ B,A ≤ G ≤ B,A ≤ H ≤ I ≤ B},so
that the vector of chain lengths between them is


h(A,B) =
2,2,3 with h

(A,B) = 2,h

(A,B) = 3.Finally,P has
height H(P) = 5 (a maximal chain is D ≤ E ≤ I ≤
C ≤ 1) and width W(P) = 5 (the largest anti-chain is
{F,G,H,E,J}).
Note how a poset is not a tree:both Figures 1 and 2 show
nodes with more than one parent.Note also the inherently
two-dimensional structure displayed by division into levels:
whilenodes canbere-drawnleft toright (width) as convenient,
vertically it is crucial that higher nodes be placed above lower
ones (height).
The GO,modeled as P
GO
,has measurable poset properties,
as shown in Table 1 and Figure 3 (GO for September 2003).
The height parameter shows that the GO is properly seen as
Table 1.Poset statistics of the GO
Nodes Leaves Interior Edges H W
MF 7.0K 5.6K 1.3K 8.1K 13 ≥3.5K
BP 7.7K 4.1K 3.6K 11.8K 15 ≥2.9K
CC 1.3K 0.9K 0.4K 1.7K 13 ≥0.4K
GO 16.0K 10.6K 5.4K 21.5K 16 ≥5.9K
a structure divided into levels,15 for BP and 13 for MF and
CC.It branches out quickly and broadly,with twice as many
nodes (10.6K) being ‘terminal’leaves compared with interior
nodes (only 5.4K).Calculating the width of a poset is still a
daunting task algorithmically,so the width parameter is only
a lower bound estimate.Thus,the structure is at least three
orders of magnitude wider than it is high.Figure 3 shows
the distribution (on a log scale) of the number of parents and
children per node.Note that a few nodes have hundreds of
children,and a substantial quantity have at least two parents,
some as many as four or five.
We can then define a POSet Ontology (POSO) as
O= P,X,F,where X is a finite,non-empty set of labels,
and F:X →2
P
is an annotation function mapping each label
x ∈ Xtoa collectionof nodes F(x) ⊆ P.InFigure 2,we have
X = {a,b,...,j},and e.g.F(b) = {A,E,F}.In GOC,we
have O
GO
= P
GO
,X
GO
,F
GO
,where the gene products X
GO
and annotations F
GO
are provided by the GO’s native XML
file supplemented by translators provided by the GOAproject
(http://www.ebi.ac.uk/GOA).Current annotation files include
Affymetrix,enzymes,yeast,and UniProt SWISS-PROT and
TrEMBL.
We can nowpose the categorization problemin the context
of the example in Figure 2:given a particular set of genes of
interest cast as a query,say Y = {c,e,i} ⊆ X,what node(s) in
P best summarize that set?One answer is C,since it ‘covers’
all three genes,and does so in the most specific way.The
node 1 also covers the genes,but would not be favored since
its in a more general category.But it can also be argued that
H is a good answer,since,while it only covers c and e,it
does so more specifically than C does.Along with Lee et al.
(2003),we note that this interplay between ‘coverage’and
‘specificity’is central to this class of methodologies.
We nowneed the concept of a pseudo-distance as a function
δ:P
2
→R,where ∀p
1
≤p
2
∈sP,h

(p
1
,p
2
) ≤δ(p
1
,p
2
) ≤
h

(p
1
,p
2
);andanormalizeddistanceas
¯
δ =δ/H(P).Current
pseudo-distances implementedinGOCinclude:the minimum
path length δ
m
=h

;the maximum path length δ
x
=h

;the
average of extreme path lengths
δ
ax
(p
1
,p
2
) =
h

(p
1
,p
2
) +h

(p
1
,p
2
)
2
;
i171
by guest on February 21, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
C.A.Joslyn et al.
Fig.3.Distribution of number of children (left) and parents (right) per node.
Table 2.Scoring functions
Distance Score
Unnormalized Normalized
Unnormalized S
Y
(p) =
￿
x∈Y
￿
p

∈F(x):p

≤p
￿
δ
r
(p

,p) +1
￿
−1
ˆ
S
Y
(p) = S
Y
(p)/
￿
x∈Y
|F(x)|
Normalized
¯
S
Y
(p) =
￿
x∈Y
￿
p

∈F(x):p

≤p
￿
1 −
¯
δ(p

,p)
￿
r
ˆ
¯
S
Y
(p) =
¯
S
Y
(p)/
￿
x∈Y
|F(x)|
and the average of all path lengths
δ
ap
(p,p

) =
￿
h∈


h(p
1
,p
2
)
h
|


h|
.
Other pseudo-distances are in exploration.
Given a pseudo-distance and a set of nodes of interest Y ⊆
X,we then want to develop a scoring function S
Y
(p) that
returns the weighted rank of a node p ∈ P based on requested
nodes Y.Weactuallyusetwokinds of scores,anunnormalized
score S
Y
:P →R
+
which returns an ‘absolute’number,and
a normalized score
ˆ
S
Y
:P →[0,1] which returns a ‘relative’
number.
We allow the user to choose the relative value placed
on coverage versus specificity by introducing a parameter
s ∈{...,−1,0,1,2,3,...},where low s emphasizes cover-
ages,and high s emphasizes specificity.The scoring function
can use either the unnormalized distance δ,or the normalized
¯
δ.Letting r = 2
s
,we have the four scoring functions shown
in Table 2.
We then want to find collections of scored nodes that
breakintogroups byidentifyingnon-comparablenodes within
the ranked list as ‘cluster heads’.The resulting clusters
are at different depths in P,and while headed by non-
comparable nodes,their contents can overlap.Cluster heads
whicharenon-comparabletoall others of lower rankarecalled
Table 3.GOC output for the example in Figure 2 for query Y = {c,e,i}
Rank s = −1 s = 1 s = 3
ˆ
¯
S
Y
(p) p
ˆ
¯
S
Y
(p) p
ˆ
¯
S
Y
(p) p
1 0.7672 C+ 0.5467 H+ 0.3893 H+
2 0.6798 1− 0.3867 C− 0.3333 A;J+
3 0.6315 H 0.3333 A;I;J
4 0.5563 I 0.0617 C−
5 0.5164 B 0.0615 I
6 0.3333 A;J 0.2400 B− 0.0559 F;G;K
7 0.2267 1−
8 0.2981 F;G;K 0.2133 F;G;K
9 0.0112 B
10 0.0060 1−
‘primary’,and those above some previously identified cluster
head ‘secondary’.
Output for the example in Figure 2 is shown in Table 3,for
query Y ={c,e,i},specificity values s =−1,1 and 3,doubly
normalized score
ˆ
¯
S and pseudo-distance δ
m
.Cluster heads are
marked with +,and secondaries with −.Desirable results are
for low specificity,C preferred and primary,with 1 as a sec-
ondary;for high specificity,H and J preferred (J specifically
covers i),with C as a secondary.
GOC was implemented for Linux in Java j2sdk1.4.0 using
theOpenJGraphclasses (http://openjgraph.sourceforge.net),
i172
by guest on February 21, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
Gene Ontology Categorizer
an open source Java library (LGPL) used to create and
manipulate graphs.We created a class UBPoset to support
algorithms for scoring and distances,which implements a
Floyd-Warshall algorithm for calculating all shortest paths,
and is O(N
3
).We also use this algorithm to calculate all the
maximum chain lengths,the number of chains and sum of
chain lengths.
3 EXPERTVALIDATION
We enlisted an experienced molecular immunologist who
had no prior knowledge of the GOC to assess its utility and
accuracy.Not consulting the GO,he constructed two non-
overlapping lists of genes widely known to be involved in
particular functions:KT 1 a list of 242 genes involved in
immune processes;and KT 4 a list of 147 genes involved
in cell–cell/cell–matrix interactions.
KT 1,KT 4 and KT 1 ∪ KT 4 provided three queries for
GOCinto the BPbranch of the GOusing δ
m
,s =7 and scoring
function
¯
S.For eachreturnedGOCcluster,the expert assessed
the utility (did the cluster terms provide a useful description
of a specific biological process?) from1 =lowto 5 =high;and
the expectation(was the identifiedbiological process expected
for the genes in the query?) from 1 = high to 5 = low.Thus
higher scores are better.The expert was also asked to identify
any expected biological processes that were not represented
in the clusters.
The results are shown in Table 4,where U is the assessed
utility;depth
¯
δ(1,p) is the relative distance of the cluster head
p fromthe top of the GO1;rank is the rank of the cluster for
each query;genes is the number of genes in the cluster;and
exp is the assessed expectation.The results are shown first
ordered by rank in KT 1,with the corresponding rank (if any)
within KT 4 and KT 1 ∪KT also shown.
About one-third of the clusters in each set contain at least
10% of the genes in the query,and are thus the major
descriptors of the gene list,recalling that within a query,a
gene may be a member of more than one cluster.These larger
clusters tended to have higher expectation scores,while some
of the smaller clusters surprised the expert,and likely repres-
ent knowledge developed about the genes in areas of biology
outside immunology,and thus provide new insights for the
expert’s home field.
Utility here is weakly correlated with depth,in that
clusters that are ‘high’in the GO (e.g.GO:0008150
‘biological_process’,depth 0.06) tend to be too general
to allow the user to learn much.More useful clusters
include GO:0007031 ‘peroxisome organization and biogen-
esis’(depth 0.39) or GO:0007606 ‘chemosensory perception’
(depth 0.33).
Users can ‘drill down’into clusters to generate subclusters.
Partial results for the top six subclusters of the top scor-
ing clusters for KT 1 (GO:0006955 ‘immune response’) and
KT 4 (GO:0007155 ‘cell adhesion’) are shown in Table 5.The
resulting subclusters are both more specific and more useful.
GOC found considerable overlap in the clusters for KT 1
and KT 4,which is consistent with the importance of
genes involved in cell–cell/cell–matrix interactions to general
immune processes.It is also consistent with the increasingly
popular notion that one gene can participate in many dif-
ferent biological processes.Combining the two gene lists
neither generatedany‘new’clusters nor lost anyof the clusters
identified by the separate queries.
Finally,we considered the biological processes which the
expert expected to find,but were missing from the cluster
solutions,e.g.‘mucosal immunity’,‘dendritic cell activation’
and ‘regulatory Th3 T-cells’.Upon inspection of the GO and
its annotations,it was foundthat these missingclusters are due
to incomplete coverage of the GO in these areas of biology,
and the annotation of some genes only at levels that are too
‘high’in the GO to be informative to the biologist.
Researchers at P&G are finding GOC to be useful in inter-
preting large-scale gene expression datasets,and its utility is
expected to increase as the GO’s annotation increases.As a
reviewer noted,immunology is a notoriously difficult area
for gene annotation efforts,and the GO is not especially well
developed there.Thus GOCs success in this test is even more
noteworthy.
4 FORMALVALIDATION
We seek a more formal approach to complement our expert
validation.To accomplish this,we need an independent
source of annotations of collections of GO nodes (corres-
ponding to our lists of target genes) to other ‘summarizing’
GO nodes.This is available through the InterPro pro-
ject,which catalogs assignments of protein families,domains
and functional sites to GO IDs.As an example,the fam-
ily ‘phosphofructokinase’is InterPro ID IPR000023,and is
annotated to GO:0006096 = ‘glycolysis’,GO:0003872 =
‘6-phosphofructokinase activity’,and GO:0005945 = ‘6-
phosphofructokinase complex’.It also maps to 175 proteins.
Thus our validation task is to make these 175 proteins a GOC
query,and see how well our cluster heads match against the
set of GO IDs {GO:0006096,GO:0003872,GO:0005945}.
Regrettably these validation data are not ideal:InterPro GO
nodes are determined by hand curation,based on the cur-
ator’s biological knowledge and the description of the family
or domain,but alsobyexaminingtheGOnodes of theconstitu-
ent proteins.Thus there is some inherent bias and circularity
in this task.However,we were not aware of any other dataset
that is available,and furthermore,we can at least validate
our ability to capture the cognitive processes of the InterPro
curators.
For a givenqueryY,let GOCreturnthe nodes T ⊆ P,while
the‘correct’answers areadifferent set U ⊆ P.If T = U,then
GOC did as well as it could.Otherwise,quantifying GOC’s
degree of success is not simple.In our example in Figure 2,let
us say that GOC returns the answers T = {I,E,J},and that
the correct answers are U = {C,E,K}.So E is a ‘hit’,and
i173
by guest on February 21, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
C.A.Joslyn et al.
Table 4.Top clusters for GOC runs KT 1 (immune processes,242 genes),KT 4 (cell–cell/cell–matrix interactions,147 genes) and KT 1 ∪KT 4
Cluster head U Depth KT 1 KT 4 KT 1 ∪KT 4
Rank Genes Exp Rank Genes Exp Rank Genes
GO:0006955 immune response 3 0.33 1 117 1 2 124
GO:0006935 chemotaxis 3 0.28 2 59 2 23 3 2 4 61
GO:0007165 signal transduction 3 0.22 3 88 2 4 61 2 3 148
GO:0007267 cell–cell signaling 2 0.22 4 37 2 13 6 3 5 43
GO:0006952 defense response 3 0.28 5 121 1 11 11 3 6 131
GO:0006810 transport 2 0.22 6 42 3 9 53
GO:0007155 cell adhesion 3 0.22 7 32 2 1 89 1 1 119
GO:0006355 regulation of transcription,DNA-dependent 4 0.39 8 18 3 7 9 3 8 27
GO:0007031 peroxisome organization and biogenesis 5 0.39 9 15 2 14 15
GO:0006508 proteolysis and peptidolysis 2 0.33 10 15 4 6 9 4 10 24
GO:0006874 calciumion homeostasis 2 0.44 11 12 2 16 13
GO:0008152 metabolism 1 0.17 12 102 3 10 40 2 13 141
GO:0008283 cell proliferation 2 0.22 13 29 2 8 14 2 12 43
GO:0000004 biological_process unknown 1 0.11 14 11 5 5 11 5 11 22
GO:0006928 cell motility 2 0.17 15 18 3 12 8 2 15 25
GO:0007275 development 1 0.11 16 35 2 3 39 2 7 74
GO:0019835 cytolysis 3 0.22 17 9 3 20 9
GO:0006915 apoptosis 5 0.28 18 11 2 18 15
GO:0008151 cell growth and/or maintenance 2 0.17 19 90 3 9 32 2 17 122
GO:0009618 response to pathogenic bacteria 4 0.28 20 9 2 3 22 9
GO:0009615 response to virus 4 0.28 21 8 2 20 2 3 19 10
GO:0009611 response to wounding 3 0.22 22 70 2 24 76
GO:0007596 blood coagulation 2 0.22 23 5 4 16 3 4 21 8
GO:0008015 circulation 1 0.17 24 5 2 26 5
GO:0009406 virulence 2 0.17 25 4 3 29 4
GO:0007203 phosphatidylinositol-4,5-bisphosphate hydrolysis 3 0.17 26 4 3 30 4
GO:0007154 cell communication 2 0.17 27 108 4 31 108 2 31 214
GO:0006950 response to stress 3 0.17 28 115 2 32 126
GO:0007605 hearing 1 0.33 29 3 5 26 1 4 28 4
GO:0008166 viral replication 2 0.17 30 3 3 34 3
GO:0007397 histogenesis and organogenesis 1 0.17 31 3 3 17 3 2 23 6
GO:0009306 protein secretion 3 0.22 32 2 2 24 1 3 33 3
GO:0019233 perception of pain 3 0.33 33 2 4 36 2
GO:0009607 response to biotic stimulus 2 0.22 34 123 2 40 134
GO:0008219 cell death 5 0.17 35 20 2 15 7 2 27 27
GO:0016032 viral life cycle 4 0.11 36 2 4 43 2
GO:0009314 response to radiation 4 0.28 37 2 2 42
GO:0009636 response to toxin 4 0.33 38 1 2 44 1
GO:0007606 chemosensory perception 5 0.33 39 1 5 45 1
GO:0007566 embryo implantation 4 0.22 40 1 4 46 1
GO:0009405 pathogenesis 4 0.17 41 1 2 27 1 3 37 2
GO:0007586 digestion 4 0.17 42 1 5 49 1
GO:0006291 pyrimidine-dimer repair,DNA damage excision 3 0.17 43 1 4 50 1
GO:0006944 membrane fusion 4 0.17 44 1 3 41 2
GO:0007622 rhythmic behavior 4 0.17 45 1 5 51 1
GO:0042221 response to chemical substance 3 0.28 46 67 2 53 69
GO:0042330 taxis 5 0.22 47 59 2 54 61
GO:0007582 physiological processes 1 0.11 48 206 2
GO:0009987 cellular process 1 0.11 49 158 2
GO:0008150 biological_process 1 0.06 50 216 1 32 132 1 55 346
GO:0007229 integrin-mediated signaling pathway 4 0.33 2 2 34 1
GO:0007601 vision 4 0.33 5 14 4 4 25 5
GO:0007613 memory 4 0.22 5 18 3 5 35 3
GO:0042060 wound healing 4 0.28 4 19 2 4
GO:0009619 resistance to pathogenic bacteria 5 0.17 3 21 2 38 2
GO:0007048 oncogenesis 4 0.17 4 22 2 3 39 2
GO:0009408 response to heat 4 0.22 2 25 1 4
GO:0030104 water homeostasis 4 0.22 5 28 1 4 47 1
GO:0007588 excretion 3 0.17 4 29 1 3 48 1
GO:0007626 locomotory behavior 3 0.17 5 30 1 4 52 1
Average 3.00 0.22 2.93 2.88
SD 1.24 0.08 1.19 1.10
i174
by guest on February 21, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
Gene Ontology Categorizer
Table 5.Subclusters for GOCruns conducted on GO:0006955 ‘immune response’and GO:0007155 ‘cell adhesion’using the combined gene list KT 1∪KT 4
Subcluster U Depth Rank No.of genes
Cluster:GO:0006955 immune response
GO:0006954 inflammatory response 3 0.28 1 63
GO:0006956 complement activation 3 0.39 2 27
GO:0006968 cellular defense response 4 0.28 3 23
GO:0019735 antimicrobial humoral response (sensu Vertebrata) 4 0.39 4 20
GO:0006960 antimicrobial humoral response (sensu Invertebrata) 3 0.39 5 20
GO:0045087 innate immune response 4 0.39 6 64
Average 4.06 0.35
SD 0.78 0.05
Cluster:GO:0007155 cell adhesion
GO:0007160 cell–matrix adhesion 3 0.28 1 32
GO:0007156 homophilic cell adhesion 4 0.33 2 31
GO:0016337 cell–cell adhesion 3 0.28 3 55
GO:0007162 negative regulation of cell adhesion 5 0.33 4 4
GO:0008037 cell recognition 4 0.28 5 3
GO:0030155 regulation of cell adhesion 5 0.28 6 4
Average 4.00 0.30
SD 0.82 0.02
then we want to compare each GOC node {I,J} against the
nodes {C,K}:J is a ‘child’of both C and K,and while I is
also a child of C,I is only distantly related to K.In addition,
GOC does not return a simple set of answers T,but rather a
ranked list of indefinite length,and so it would be desirable to
account for this rank ordering,in that a hit on a high-ranked
node is more ‘valuable’than a hit on a lower-ranked node.
Mathematically,comparing T and U in the context of the
poset P can be described as a matching problemof measuring
the overlap or lack thereof of the two sub-posets of P induced
by T and U.To date,we have not been able to identify an
obvious solutiontothis probleminthemathematical literature,
and instead are beginning to address it ourselves (Joslyn and
Mniszewski,2004).Here,we use some simple methods which
lead in the right direction,and which we believe adequate to
measure GOC’s success.
For this run we used the November 2003 version of GO,
the 6 December 2003 version of the InterPro to GO trans-
lation file,the 12 December 2003 version of InterPro,the
8 December 2003 list of proteins in each InterPro group and
InterPro types,and the 16 December 2003 SWISS-PROT and
TrEMBL GO Translators (go_200311-assocdb.xml
and interpro2go from http://www.geneontology.org;
interpro.xml and protein2interpro.dat from
http://www.ebi.ac.uk/interpro;and gene_association.
goa_sptr fromhttp://www.ebi.ac.uk/GOA).
Let R = {r} be the set of InterProIDs,eachannotatedtoGO
nodes U(r) andproteins Y(r).InterPromappedeachIDr toan
average
|U(r)| = 2.34 GOnodes,and
|Y(r)| = 162 proteins,
per InterPro ID.In our run,there were 4866 InterPro IDs
with GOannotations,with 11 370 mappings to GOnodes and
787 760 mappings to proteins in total.Of these proteins,we
were able to locate 778 494,or >99%with GO annotations.
For each InterPro ID r,we ran GOC on the annotated pro-
teins Y(r) with score
ˆ
¯
S and distance δ
m
to convergence for
increasing specificity s,which valued on average
s = 7.65
over all the runs.For each r,let T (r) = t
1
,t
2
,...,t
i
,...,t
n

be the list of the top n GOnodes identified by GOCas cluster
heads,inrankorder,whenrunontheproteins Y(r).Wecapped
n at 25,although for some r,GOC returned fewer clusters
than that,yielding an average of ¯n = 22.7 GOC clusters per
InterPro ID.Our run time was ∼4 days on a dual-processor
Dell Precision 530.
For a particular InterPro ID r,let U(r) = {u
1
,u
2
,...,
u
j
,...,u
m
} be the set of m GO nodes which InterPro has
annotated to r.The first thing we are interested in is what
portion of U(r) we find among our top 25 cluster heads
T (r).In fact,over all the runs,we find 11 269 of the
11 370 InterPro GO IDs,or >99%,directly in this way.
Another 93 are ‘near misses’,in the sense that GOC finds
a node which is comparable.Finally,eight are not found by
GOC (Table 6).
Beyond saying that we find on an average 2.34 InterPro
nodes U(r) somewhere in our 25 GOC nodes,we would like
to knowwhere in the rank-ordered list one will find them:the
nearer at the top of our list they appear,the better our scoring
algorithm.One important quantity is the minimumrank of our
direct hits.If this is 1,that means that the GOC’s top cluster
head t
1
was also one of those listed by InterPro:t
1
∈U(r).
If it is 2,that means that the GOC’s top choice t
1
was not
listed by InterPro,our second choice was t
1
∈U(r),t
2
∈U(r).
Similarly,the maximum rank of our direct hits can also be
i175
by guest on February 21, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
C.A.Joslyn et al.
Table 6.Validation statistics for GOC:(top) InterPro GO nodes found by GOC;(bottom) GOC nodes found in InterPro
No.of Interpro nodes %Interpro nodes Minimumrank Maximumrank
Avg.SD Avg.SD
Found directly 11 269 99.11 1.22 0.69 2.01 1.56
Found indirectly 93 0.82
Not found 8 0.07
No.of GOC nodes %GOC nodes
Direct hits 11.3K 10.2 1.22 0.69 2.01 1.56
Immediate family 18.9K 17.2 2.80 1.98 4.99 3.51
Extended family 17.1K 15.5 3.12 2.41 3.95 2.86
Comparable 44.4K 40.2 2.99 2.10 13.21 4.88
Non-comparable 55.3K 50.0 3.42 3.00 13.12 5.68
defined.In general,we wish both quantities to be low,and in
fact,the minimumrank of directs hits averaged 1.22 with SD
0.69 over all the GOC runs;and the maximumrank averaged
2.01 with SD 1.56.
Hence,GOC found virtually all the GO nodes annotated
to InterPro IDs,and very high in its rankings.But each
r is left with an average 22.70 −2.01 =20.69 GOC nodes
t
i
remaining,which should be ‘near’some InterPro GO node
u
j
.This sense of distance in a poset is what we are work-
ing to define (C.A.Joslyn,submitted for publication;Joslyn
and Mniszewski,2004).For now,considering two nodes
p,p

∈P,then the status of p

relative to p can be

p

is direct hit on p:p

= p.

p

is in the nuclear family of p:p

is a child (immediate
successor),parent (immediate ancestor) or sibling (child
of a parent or parent of a child) of p.

p

is in the extended family of p:p

is a grandpar-
ent (parent of a parent),grandchild (child of a child),
cousin (grandchild of a grandparent or grandparent of
a grandchild) aunt/uncle (child of a grandparent) or a
niece/nephew (grandchild or a parent) of p.

p

is comparable to p:p ∼ p

.

p

is non-comparable to p:p ∼ p

.
These relations are summarized in Table 6.For example,
17.1K GOC nodes,or 15.5%of the total,are in the extended
family of at least one InterPro GO node,at average min-
imumrank 3.12 and maximumrank 3.95.Note that just being
comparable or non-comparable does not necessarily mean
‘close’or ‘far’:siblings are non-comparable,but are very
close,while comparable nodes can be separated by a large
‘vertical’distance.Also,the numbers in the first column
are not additive:both the immediate and extended families
include both comparable and non-comparable nodes.
5 CONCLUSIONS AND FURTHERWORK
We have demonstrated that the GOCmethodology provides a
valid and useful approach to categorization in the GO,and are
confident that it will prove to be a solid basis for development
of further methods andinother poset-basedontologies.Future
work includes

Methodological development in combinatorial appro-
aches to data analysis,including distances between non-
comparable nodes,interval-valued measures of ‘level’in
posets,algorithms for poset width calculation and poset
matching (C.A.Joslyn,submitted for publication;Joslyn
and Mniszewski,2004).

Expansion to other ontologies such as the EC
http://www.biochem.ucl.ac.uk/bsm/enzymes and MeSH
http://www.nlm.nih.gov/mesh/meshhome.html.

Continuation of our work in textual approaches,mapping
back and forth fromsemantic relations among GOnodes
to those among its lexical components (Verspoor et al.,
2003,2004).

Interaction with quantitative methods,including dis-
tances in the GO from external statistics (Lord et al.,
2003),and weightings of the GO to account for differ-
ent amounts of ‘sparseness’in coverage of biological
domains.
ACKNOWLEDGEMENTS
We would like to thank Phillip Lord of the University
of Manchester for suggesting InterPro for validation data.
Thanks also to Andreas Rechtsteiner (Computer Science) and
Michael Altherr (Biosciences) at LANL;Jun Xu,TimSmith,
Angela Qu,Joe Feeley and Ker-Sang Chen at P&G;and
Alex Pothen at Old Dominion University.This work was sup-
ported by the Department of Energy,and by a Cooperative
Research and Development Agreement between the Los
Alamos National Laboratory and Procter &Gamble Corp.
REFERENCES
Gene Ontology Consortium (2000) Gene Ontology:tool for the
unification of biology.Nat.Genet.,25,25–29.
Joslyn,C.A.,Mniszewski,S.M.,Fulmer,A.W.and Heaton,G.G.
(2003a) Measures on ontological spaces of biological function.
poster.PSB’03.Kauai,HA,January,2003.
i176
by guest on February 21, 2013http://bioinformatics.oxfordjournals.org/Downloaded from
Gene Ontology Categorizer
Joslyn,C.A.,Mniszewski,S.M.,Fulmer,A.W.and Heaton,G.G.
(2003b) Structural classification in the Gene Ontology.Proceed-
ings of the 6th Bio-Ontologies Workshop on International Society
for Computational Biology (ISMB’03).Brisbane,QLD,20 June.
Joslyn,C.A.(2004) Poset ontologies and concept lattices as semantic
hierarchies,lecture notes in artificial intelligence,Springer-
Verlag,in press.
Joslyn,C.A.and Mniszewski,S.M.(2004) Combinatorial approaches
to bio-ontology management with large partially ordered sets.
SIAM Workshop on Combinatorial Scientific Computing,San
Francisco,CA,April 2004.Society for Industrial and Applied
Mathematics.
Lee,I.Y.,Ho,J.M.and Lin,W.C.(2003) An algorithm for generat-
ing representative functional annotations based on gene ontology.
14th IEEE International Workshop on Database and Expert
System Applications.pp.10–15.
Lee,S.G.,Hur,J.U.and Kim,Y.S.(2004) A graph-theoretic model-
ing of GO space for biological interpretation of gene clusters.
Bioinformatics,20,381–388.
Lord,P.W.,Stevens,R.,Brass,A.and Goble,C.(2003) Investigat-
ing semantic similarity measures across the gene ontology:the
relationship between sequence and annotation.Bioinformatics,
10,1275–1283.
Schröder,B.S.W.(2003) Ordered Sets.Birkhauser,Boston.
Verspoor,K.,Joslyn,C.A.and Papcun,G.(2003) ‘Gene Ontology
as a source of lexical semantic knowledge for a biological natural
language processing application.In Workshop on Text Analysis
and Search for Bioinformatics (SIGIR 03),Tronto,Canada,
August 2003.
Verspoor,K.,Cohn,J.,Joslyn,C.A.,Mniszewski,S.M.,
Rechtsteiner,A.,Rocha,L.M.and Simas,T.(2004) Protein
annotation as term categorization in the Gene Ontology.In
Proceedings BioCreative Workshop,Granada,2004.
Zeeberg,B.R.,Feng,W.,Wang,G.,Wang,M.D.,Fojo,A.T.,
Sunshine,M.Narashimhan,S.,Kane,D.W.,Reinhold,W.C.,
Labadridi,S.et al.(2003) GoMiner:a resource for biological
interpretation of genomic and proteomic data.Genome Biol.,
4,R28.
i177
by guest on February 21, 2013http://bioinformatics.oxfordjournals.org/Downloaded from