Multiple testing on the directed acyclic graph of gene ontology


Oct 1, 2013 (4 years and 9 months ago)


Vol.24 no.4 2008,pages 537–544
Gene expression
Multiple testing on the directed acyclic graph of gene ontology
Jelle J.Goeman
and Ulrich Mansmann
Department of Medical Statistics,Postzone S5-P,P.O.Box 9600,2300 RC Leiden,The Netherlands and
IBE,Medical School,University of Munich and Department of Statistics,University of Munich,Germany
Received on July 26,2007;revised on November 28,2007;accepted on December 16,2007
Advance Access publication January 18,2008
Associate Editor:David Rocke
Motivation:Current methods for multiplicity adjustment do not
make use of the graph structure of Gene Ontology (GO) when testing
for association of expression profiles of GO terms with a response
Results:We propose a multiple testing method,called the focus
level procedure,that preserves the graph structure of Gene Ontology
(GO).The procedure is constructed as a combination of a Closed
Testing procedure with Holm’s method.It requires a user to choose
a ‘focus level’ in the GO graph,which reflects the level of specificity
of terms in which the user is most interested.This choice also
determines the level in the GO graph at which the procedure has
most power.We prove that the procedure strongly controls the
family-wise error rate without any additional assumptions on the joint
distribution of the test statistics used.We also present an algorithm
to calculate multiplicity-adjusted P-values.Because the focus level
procedure preserves the structure of the GO graph,it does not
generally preserve the ordering of the raw P-values in the adjusted
Availability:The focus level procedure has been implemented in the
globaltest and GlobalAncova packages,both of which are available
Supplementary information:Supplementary data are available at
Bioinformatics online.
In microarray data analysis,there has been a shift of interest
towards methods that analyze genes in the context of their
annotation.Annotation allows genes with similar function
or location to be grouped and analyzed together based on
information fromsources such as KEGG(Ogata et al.,1999) or
Gene Ontology (Ashburner et al.,2000).
In recent years several statistical methods have been
proposed for analyzing sets of genes with similar annotation.
Most of these methods start from univariate measures of
differential expression (usually P-values) obtained from single
gene analyses,using these to test for over-representation
(‘enrichment’) of differentially expressed genes in the gene sets
of interest (Khatri and Dra˘ghici,2005;Mootha et al.,2003).
Other methods start from the raw data and test for differential
gene expression profiles over the gene set in a multivariate
way without first going through single gene testing (Dinu et al.,
2007;Goeman et al.,2004;Mansmann and Meister,2005;
Tomfohr et al.,2005).
The most popular source of annotation used in this context is
the Gene Ontology (GO) database.GO is popular because it is
not only large and detailed,but also highly structured:the GO
terms are ordered in a directed acyclic graph,in which the set of
genes annotated to a certain term (node) is a subset of those
annotated to its parent nodes.Some authors (Alexa et al.,2006;
Grossmann et al.,2006) have proposed to take into account
the GO graph structure when testing for enrichment.
In this article we propose a way to exploit the GO graph
structure for multiple testing correction.The structure of
the GO graph complicates the application of multiple testing
procedures.On the one hand,overlap between sets of genes and
correlations between genes lead to highly complicated and
unknown dependency structures between test statistics.This
means that methods that make assumptions on the joint distri-
bution of the test statistics,such as Westfall and Young (1993)
and others (Dudoit et al.,2003),should be used with caution,
while methods for which the validity is assured under any
dependency structure,such as the procedures of Bonferroni or
Holm (1979),can be overly conservative.
The structure of the GO graph also presents special opportu-
nities,however,as this structure induces logical relationships
between the null hypotheses,which can be exploited (Shaffer,
1986).However,at this moment no currently available
method makes use of the GO structure for multiple testing.
Meinshausen (2007) and Yekutieli (2007) have suggested useful
methods,but both of these are for trees only and do not work
in a directed acyclic graph.
In this article we present a sequentially rejective multiple
testing method that strongly controls the family-wise error rate
(Dudoit et al.,2003) while explicitly making use of the directed
acyclic graph structure of GO.This focus level procedure
merges Holm’s (1979) procedure with the closed testing
procedure of Marcus et al.(1976).Like the two procedures
it combines,the new procedure makes no assumptions on the
joint distribution of the test statistics.
A prerequisite for the focus level method is that the tested
null hypotheses have a logical structure that mimics the
structure of the GO graph (detailed in Section 2).The primary
application we have in mind is with global testing methods such
*To whom correspondence should be addressed.
 The Author 2008.Published by Oxford University Press.All rights reserved.For Permissions,please
by guest on October 1, 2013 from
by guest on October 1, 2013 from
by guest on October 1, 2013 from
by guest on October 1, 2013 from
by guest on October 1, 2013 from
by guest on October 1, 2013 from
by guest on October 1, 2013 from
by guest on October 1, 2013 from
as the Global Test (Goeman et al.,2004,2005),Global Ancova
(Hummel et al.,2007;Mansmann and Meister,2005),PLAGE
(Tomfohr et al.,2005) and SAM-GS (Dinu et al.,2007).These
methods test a null hypothesis H
for a gene set A,of the form
:no gene in is associated with the response ð1Þ
The procedure we propose is not valid for all gene set testing
methods proposed for the GO graph.In particular it is not
valid for enrichment methods such as described by Khatri and
ghici (2005) and Mootha et al.(2003),which have a more
complicated logical structure (Goeman and Bu
Because the focus level procedure follows the structure of the
GO graph,the collection of rejected null hypotheses is always
a coherent subgraph of the GO graph,growing from the
root node up to the most specific nodes that can be called
significant.This coherence greatly facilitates interpretation.
However,a consequence of the preservation of the graph
structure is that the ordering of the raw P-values is not
preserved in the multiplicity-adjusted P-values.It may even
occur that some multiplicity-adjusted P-values are smaller than
their corresponding unadjusted P-values.
The new procedure will be illustrated with two applications
using the biological process GO graph.One uses the Global
Test method on a microarray data set of 295 breast cancer
patients from the Netherlands Cancer Institute (Van de Vijver
et al.,2002).A second example,available in the Supplemental
information,uses the Global Ancova method on a data set
of 238 AML patients from the University Medical Center
Grosshadern in Munich (Schnittger et al.,2005).
The focus level procedure has been implemented in the R
packages globaltest and GlobalAncova.Both packages are
available on
GO (Ashburner et al.,2000) is a controlled vocabulary
that describes gene and gene product attributes.It can be
represented as three disjoint directed acyclic graphs describing
biological process,molecular function and cellular component.
Each has a single root and thousands of dependent nodes.
The three graphs are sometimes joined together by an overall
root node called gene ontology,but in this article we consider
GO as three separate graphs and speak of a GO graph.We do
not make a distinction between the two types of relationships
in the GO graph (‘is a’ and ‘part of’).When talking of the
GO graph we visualize the root node of the graph as the ‘top’
of the graph and the end nodes as the ‘bottom‘.
For the purposes of this article,we will look at a GO graph
as a structured collection of sets of genes.Genes are annotated
to GO terms,so that each node in the gene ontology graph
can be associated with a set of genes.The structure of the
graph dictates that each gene annotated to a certain node is
automatically also annotated to all ancestor nodes,so that,
viewed in terms of sets,every edge in the graph represents
a subset relationship:each child node is a subset of all its parent
nodes,and the root node in each GO graph includes the
superset of all nodes.Note,however,that not all subset
relationships necessarily correspond to an edge in the GO
graph,because genes may have multiple functions.
Every gene set has a corresponding null hypothesis that can
be tested.We assume that these null hypotheses are formulated
in such a way that they reflect the relationships in the GO
graph.In general,we let any subset relationship between gene
sets imply a logical relationship between the corresponding
null hypotheses.Let A,B and C be gene sets,and H
the corresponding null hypotheses,then we assume that
1.If B4A,then truth of H
implies truth of H
2.If A¼B[C,then simultaneous truth of H
and H
implies truth of H
Note that the null hypothesis corresponding to the set of
all genes on the chip is false in this setup whenever any null
hypothesis is false.
The assumptions hold for the multivariate gene set testing
methods of Global Test (Goeman et al.,2004,2005),Global
Ancova (Hummel et al.,2007;Mansmann and Meister,2005),
PLAGE (Tomfohr et al.,2005) and SAM-GS (Dinu et al.,2007)
which test hypotheses of the form (1).Methods that test
enrichment of GO nodes (Khatri and Dra
et al.,2003) are typically not consistent with these assumptions.
See Tian et al.(2005) and Goeman and Bu
hlmann (2007) for an
overview of different null hypotheses used in gene set testing.
It should be remarked that the logical structure that holds for
the null hypotheses generally should not be expected to hold
for the raw,non-multiplicity adjusted test results.As the power
of the tests depends on the gene set,it may,for example,
easily occur in practice that H
is significant but H
is not,
while B4A.
The focus level procedure we propose for the GO graph is
a combination of the procedure of Holm (1979) and the closed
testing procedure of Marcus et al.(1976).It is actually not
a single procedure,but a sequence of procedures that depends
on a user-chosen focus level.At the two extreme choices of the
focus level,the procedure is either bottom-up,and purely based
on Holm,or top-down,and purely based on closed testing.
We’ll discuss these two extremes first.Let  be the chosen
significance level.
Abottom-up procedure based on Holmwould first look only
at the collection of all end node null hypotheses of the GO
graph.If this collection has m elements,it compares the
smallest P-value to /m.If the P-value is smaller,the corres-
ponding null hypothesis is declared significant,and the second
smallest P-value is compared to /(m1).The procedure con-
tinues until the kth smallest P-value is larger than /(mkþ1).
At this point,all mkþ1 largest P-values are declared
non-significant.At this point,significance can be propagated
to all ancestors in the GO-graph:a higher level GOnode can be
declared significant whenever it has any significant offspring.
It is easy to show that this procedure keeps strong control of
the family-wise error rate at level .By the proof of Holm’s
procedure (Holm,1979) the family-wise error rate is controlled
for the collection of end node null hypotheses.Further,by the
J.J.Goeman and U.Mansmann
logical structure of the null hypotheses,the upward propa-
gation of significance through the GO-graph can only lead to
a false rejection if there was already a false rejection in one of
the end nodes.Hence,it can never introduce a first false
rejection and so does not increase the probability of making
a first false rejection.For a formal proof,see Section 5.
An advantage of this bottom-up procedure is that it is quick,
as the number of tests that has to be preformed is smaller than
the number of nodes in the GO graph.A disadvantage is that
it is not very efficient.The GO graph fans out like a tree and
has a great number of end nodes,so that the multiple testing
correction can still be quite severe.The bottom-up procedure,
being focused at the leaves of the graph,may easily find
a strongly significant effect in a few genes in one of the end
nodes,but it may just as easily miss a weaker but more
consistent effect of a large number of genes in a gene set at
a higher level in the graph.
An alternative to the bottom-up procedure is a top-down
procedure based on the closed testing procedure of Marcus,
Peritz and Gabriel.To be able to use this procedure we must
enlarge the graph to be tested so that the collection of gene sets
in the nodes becomes closed to the union operation.That is,for
any gene sets A and B corresponding to nodes in the graph,
a node corresponding to A[B must also be included,and this
process must be continued recursively until all possible
(multiple) unions are included.New edges are added in
the expanded graph for every direct subset relationship.This
closing of the graph can dramatically increase the number of
hypotheses to be tested if the original number of hypotheses
was already large,as the size of the expanded graph increases
exponentially with the size of the original graph.We illustrate
the closing of the GO graph in Figure 1.
In the first step the top down closed testing procedure tests
only the top node,which is the union of all GO gene sets,
at level .If that test is not significant,the procedure stops.
Otherwise,at every next step it tests those gene sets of which all
supersets (i.e.all ancestors in the expanded graph) have been
declared significant,and when there are no such sets,the
procedure stops.At this point all gene sets which have not been
tested yet are declared non-significant.All tests in the closed
testing procedure can be performed at level ,but the
procedure still keeps strong control of the family-wise error
rate at level ,as has been very elegantly proved by Marcus
et al.(1976).
The property that all tests are done at level  makes the
closed testing procedure very attractive.It is a very efficient
multiple testing procedure if the test statistics are highly
correlated,which makes it especially useful for the GO graph.
However,a closed testing procedure can easily miss a strong
but isolated effect in one of the end nodes in a data set where
very few genes show an effect:to be called significant,the effect
of such a small end node must remain significant when it is
watered down in all possible unions of the significant set with
all possible other sets.Moreover,unless the initial number of
nodes to be tested is very small,the computational burden
can be crippling due to the huge size of the expanded graph.
For example,in the breast cancer data set used in Section 7 the
biological process GOgraph with 2865 non-empty nodes would
expand to a closed graph with around 8.5 10
The bottom-up and top-down multiple testing procedures
are mirror images of each other in many ways,and they have
opposite strengths.The bottom-up method is good at finding
local ‘hotspot’ effects,where only a single end node is highly
significant,even when most of the other nodes are non-
significant.Conversely,the top-down method is good at finding
weak global effects,where many genes show a small effect.
In the top-down method significance of GO terms low down
is greatly influenced by the (lack of) significance of large gene
sets near the top of the graph:a small node with a highly
significant raw P-value may fail to be significant in the presence
of a large non-significant node.Conversely,in the bottom-up
method significance of GO terms near the top of the graph is
greatly influenced by the presence of highly significant end
nodes:a large node with a significant rawP-value may fail to be
significant if none of its offspring terms individually reaches
the Bonferroni threshold.
The focus level method strikes a balance between the two
extremes described above.The basic idea is very simple:instead
of starting at the top or at the bottom of the graph,we start
somewhere the middle,at a pre-chosen level in the GO graph
called the focus level.We then use the bottom-up method to
extend significance to higher levels and the top-down method
to extend it to lower ones.
In the most general form a focus level is simply a collection
of GO terms.However,for an efficient procedure it is
recommended that the focus level truly represents a level in
the graph,meaning that no term in the focus level has either
ancestors or offspring which are also in the focus level.It is also
recommended that all other GO nodes of interest are either
ancestors or offspring (or at least ancestors of offspring) of
focus level nodes,because otherwise the focus level procedure
will not reach all nodes of interest.A third consideration is
computation time,as discussed in Section 6.
The full GOgraph G can nowbe split at the focus level into m
subgraphs F
,each with a single focus level node as
root node and consisting of all its offspring nodes.Note that
the subgraphs may partly overlap.Each of these subgraphs
will be expanded separately in the same way as in the closed
Fig.1.Illustration of the expansion (‘closing’) of a graph for use in a
Closed Testing procedure.Left:original graph.Right:expanded graph.
Multiple testing on the GO graph
testing procedure (Section 3).Call these expanded subgraphs


We also define an expanded scope graph

G.Its nodes are
given by the union of all nodes in the expanded graphs
G [



.It gets a newset of edges which correspond to all
subset relationships:r 2

G is a parent of s 2

G if the
corresponding gene sets R of r and S of s fulfil RS.
The focus level multiple testing procedure revolves around
a collection H

G of testable hypotheses.At the start of
the procedure H is the collection of focus level hypotheses.
The focus level procedure also uses a multiple testing correction
factor h,which we call Holm’s factor.It starts as m,the number
of elements in H.
The steps of the focus level procedure are outlined in Table 1.
This procedure keeps the family-wise error rate at level ,
as we shall prove in the Section 5.The algorithm is presented
in its most simple form here.Several adjustments can be made
to increase computation speed;we consider these in Section 6.
It is easy to adapt the algorithmto let it produce multiplicity-
adjusted P-values.A multiplicity adjusted P-value (Dudoit
et al.,2003) for a hypothesis is defined as the smallest -level at
which that hypothesis will be called significant.For the focus
level procedure these can be calculated as follows.Simply start
at ¼0 and run the algorithm.Amend the algorithm so that
every time no (more) rejection can be made,it does not stop,
but increases  to the next value that allows a new rejection,
and continues.All rejected hypotheses get an adjusted P-value
equal to the -level at which they were rejected.Stop the
algorithm only when  reaches 1,at which point all unrejected
hypotheses get adjusted P-value 1.
The focus level procedure is a true generalization of the
top-down and bottom-up procedures described in Section 3.
If the root node is taken as the focus level,the focus level
procedure is exactly the top-down closed testing procedure
(nothing happens in steps 2 and 4).Similarly,if the collection of
all end nodes is taken as the focus level,the focus level procedure
is exactly the bottom-up procedure (nothing happens in step 3).
The focus level procedure presents a generalized procedure
whose properties are between the bottom-up and top-down
procedures.Just as the top-down procedure has good power for
detecting weak global effects and the bottom-up procedure has
good power for detecting strong local effects,the focus level
procedure has good power for detecting intermediate effects
near the focus level.In the top-down procedure the significance
of end nodes is influenced by significance of nodes near the
root of the graph,and in the bottom-up procedure the signifi-
cance of nodes near the root is influenced by significance of
end nodes.Similarly,in the focus level procedure significance of
nodes far from the focus level is influenced by the significance
of nodes at the focus level.The choice of the focus level should
therefore be considered carefully:the focus level should ideally
be chosen at the level of specificity that is of most interest to the
researcher,taking into account the restrictions of computability
(see Section 6).
The significant subgraph returned by the focus level
procedure is always a coherent graph,because step 2 of the
algorithm guarantees that there is a path from any significant
node through the significant subgraph to the root node.In this
sense the result of the procedure reflects Assumption 1 of
Section 2:that assumption on truth and falsehood of null
hypotheses also holds for rejection and non-rejection.Similarly,
reflecting Assumption 2,a significant node will often have
at least one offspring node that is significant.This does not
always happen,however.In the first place,it may be that the
significance of the higher level node is due to genes which are
annotated to the higher level GO term but not to any of its
offspring terms.Alternatively,it may be that the significant
signal of the higher level node is due to the combination of
weak signals from its offspring nodes,none of which is strong
enough to be detected by itself.In both of these cases the
interpretation of the result should be that there is a clear signal
in the general GO term,which cannot be safely attributed
to any specific offspring term.
It is interesting to note that the focus level algorithm (just as
the bottom-up procedure) can produce multiplicity-adjusted
P-values which are smaller than the corresponding raw
P-values.This can only happen to terms above the focus level
terms,and occurs in step 2,where hypotheses are rejected based
on the P-values of their offspring,without reference to the
P-value of the hypothesis itself.An example of this effect will be
shown in Section 7.If this property of the procedure is felt to
be undesirable for interpretational reasons,the algorithm can
be easily amended to only reject in step 2 if the raw P-value of
the ancestor hypotheses is smaller than .However,this may
result in a significant graph that lacks coherence.
As the focus level procedure merges Holm’s procedure with
the closed testing procedure,the proof that the new procedure
controls the Family-wise error rate essentially merges the two
elegant proofs of Marcus et al.(1976) and Holm (1979).
For each subgraph F
,i ¼1,...,n,let T
be the collection
of nodes in F
that correspond to true null hypotheses.As each
node corresponds to a set of genes,we can represent T
Table 1.The focus level algorithm
1.Test phase:Reject all hypotheses in H with raw
P-value /h.
2.Upward phase:For every new hypothesis rejected in
step 1,reject all ancestors in

3.Downward phase:In each of


any hypotheses for which all parent hypotheses in
that subgraph have been rejected.Add these hypotheses
to the testable set H.
4.Holm’s phase:Recalculate Holm’s factor h as the
number of subgraphs f


g which have
non-empty intersection with H.
Until no new rejections occur
J.J.Goeman and U.Mansmann
g,a collection of n
gene sets corresponding to true
null hypotheses in F
Let k¼#{i:T
¼;},the number of subgraphs containing true
null hypotheses,and let i
be the indices i such that
6¼;.Let H
be the null hypotheses corresponding to
.Note that by assumption 2 in Section 2 H
are themselves true null hypotheses.
We prove that the focus level procedure keeps the family-wise
error at level  by showing that
1.The first true null hypothesis that is rejected is one
of H
.This must happen during step 1 of the
2.If all H
have raw P-values greater than /k,
none of them will be rejected.
If 1 and 2 are true,and p
are the P-values of H
Pðno false rejectionÞ  Pðminðp
so that
FWER  Pðat least one of p


 =kÞ  :
which proves the family-wise error control of the method.
To prove 1,first remark that by Assumption 1 in Section 2
any offspring in

G of a true hypothesis is itself a true hypothesis.
Therefore,step 2 can never lead to a first false rejection:if the
rejected ancestor in step 2 was a true null,its offspring rejected
in step 1 was already a true null itself.
The first false rejection must therefore take place in step 1.
To show that the first false rejection in step 1 is always from
among H
,remark that by the construction of the closed

the null hypothesis H
is contained in the graph

that by the construction of T
every other true null hypothesis in
that graph is offspring of H
in that graph.Therefore no true null
other than those in H
can be added to Hin step 3 before
at least one H
has been rejected.Therefore,as long
as no false rejection has been made,Hwill only contain true null
hypotheses from among H
To prove 2,remark that if all p
are larger than /k,
none of H
will be rejected before the Holm’s factor h
drops below k.But the Holm’s factor will not drop below
k before all but k1 subtrees

have been fully rejected,which
cannot happen as long as H
remain unrejected.
In a practical implementation of the focus level algorithm
in a large graph,efforts must be made to reduce the
computational burden.In this section we describe some
amendments to the algorithm that can be used to reduce this
burden,although sometimes at a small loss of power.
The greatest bottleneck of the algorithm is the potentially
enormous size of the expanded scope graph

G,which can
become especially large if some of the focus level nodes
have a large number of offspring nodes.To keep

G small,
it is,therefore,important to keep the expanded subgraphs


small.This can be done for each subgraph by
finding a small number of atom sets that allow construction of
all offspring sets of the focus level node as unions of these
atoms.Using these atoms it is possible to avoid redundant sets
when making the closed graphs as in Figure 1,while still
including all the original gene sets in the graph.For example,
in the data set of Section 7,a closed graph of all 2865 biological
process GO terms would contain 2
nodes,while reducing
the 2865 sets to 900 atoms leads to a closed graph of 2
which still includes all 2865 original sets.
We construct atoms for each subgraph as follows.Start with
the a collection C
¼ fC
g of gene sets.the algorithm
is given in Table 2.Note that the atom construction is not
necessarily unique and that a more sophisticated algorithm
may be able to find a smaller closed graph.
It is important to keep the number of atoms small,because the
number of nodes in the final closed graph

is 2
recommend not choosing any focus level nodes whose subgraphs
contain more than 9–12 atoms (500–4000 subgraph nodes).
Computation speed can also be gained by avoiding the
laborious construction of the whole graph

G,most of which will
never be reached by the algorithm.We can avoid construction

G by creating each expanded graph

only when it is needed,
that is when the ith focus term is declared significant.Avoiding
construction of

G may entail some loss of power in step 2 of the
algorithm,because in that step only known subset relationships
from the original GO graph G and from already expanded
subgraphs can be exploited for rejection.
These measures allow a good reduction of computation time.
Note that due to the stepwise nature of the algorithm,finding
only the smallest few adjusted P-values is much quicker than
finding all adjusted P-values.In the current R implementation
for Global Test,calculating all 2865 adjusted P-values of the
Biological process GO graph for the Netherlands Cancer
Institute data (see Section 7) takes about 42min on a 3.2GHz
PC.Calculating only the 100 smallest multiplicity-adjusted
P-values takes just 5min.
We have applied the focus level procedure to the Netherlands
Cancer Institute Breast Cancer Data (Van deVijver et al.,2002).
This is a data set of 295 breast cancer patients followed up to
18 years.Microarray data (Rosetta technology) are available for
Table 2.Finding atoms
For i in 1,...,k
1.Choose the ith atom A
from among the elements of C
such that A
is minimal.
2.Let C
¼ fC
3.Terminate if C
(i þ1)
contains only empty sets.
Multiple testing on the GO graph
these patients.We used the same quality preselection of
4919 genes as the original authors.Survival time and censoring
indicators are available for each patient.We are interested in
finding GO terms associated with survival of patients.GO
information was obtained using the AnnBuilder package (Zhang
et al.,2003).Of all GO terms,2865 from the biological process
ontology corresponded to non-empty gene sets in this data set.
The focus level for each of these three graphs was calculated
in an automated way.We first found the ‘feasible terms’ as all
GO terms whose offspring could all be written in terms of
unions of no more than 9 atom sets.Then we defined the focus
level as all feasible terms which had no parents among the
feasible terms.This gave us a focus level of maximumgenerality
given a reasonable computation time.For the biological
process graph the focus level calculated in this way consisted
of 589 terms;262 terms were above the focus level and 2002
below.For calculating raw P-values we used the global test for
survival data of Goeman et al.(2005).
Figure 2 compares the multiplicity-adjusted P-values from
the focus level method to the raw P-values.The figure clearly
shows that the focus level multiplicity adjustment does not
preserve the ordering of the raw P-values.The dominant
diagonal line around y ¼xþlog
(589) gives the GO terms at
or close to the focus level,while the spread of points around the
line is of GO terms far from the focus level.The figure also
shows examples of multiplicity-adjusted P-values which are
smaller than the corresponding raw P-values.
Next we compared the number of gene sets found with the
focus level method to Holm’s procedure,which also controls
the family-wise error rate without any assumptions on the joint
distribution of the test statistics.Table 3 gives the number of
significant GO terms for Holm’s (1979) procedure,and for
Holm’s procedure incorporating the logical relationships in
the graph (‘Holm plus’).The latter procedure tests all nodes as
Holm’s procedure,but rejects not only the term itself but also
all its ancestor terms.The table also gives the number of
rejected hypothesis for the focus level procedure for different
choices of the focus level.The focus levels are defined in
relation to the maximum number of atoms below each focus
level term,as explained above.Table 3 shows that the focus
level procedure rejects more gene sets than Holm’s procedure.
In this data set it seems preferable to choose a very general
focus level:the number of gene sets found when choosing a high
focus level is greater than the number found by the extended
Holm procedure,while much fewer gene sets are found when
choosing a low focus level.Table 4 gives the overlap between
the GO terms found by the focus level procedure at different
Table 3.Number of rejected biological process gene ontology terms in
the Netherlands Cancer Institute breast cancer data set for Holm’s
procedure and the focus level procedure at different choices of the focus
Procedure Number of
focus level
0.1 0.01 0.001 1e04 1e05 1e06
Holm 2865 439 247 98 38 9 0
Holm plus 2865 475 312 161 86 44 0
Bottom up 840 423 252 135 80 25 0
Focus level 1 960 466 283 142 73 36 0
Focus level 2 918 506 305 162 80 37 0
Focus level 3 851 508 313 164 89 51 26
Focus level 4 770 513 307 166 89 51 26
Focus level 5 711 518 317 169 85 53 32
Focus level 6 668 525 320 172 90 50 26
Focus level 7 641 536 318 165 87 40 26
Focus level 8 607 535 333 191 97 49 26
Focus level 9 589 539 336 197 99 52 26
Focus level 10 572 543 334 196 98 59 25
Focus level 11 559 544 342 193 99 63 26
Focus level 12 534 545 342 191 99 68 26
Table 4.Overlap between the significant GOgraphs found by the focus
level procedure at different heights of the focus level at ¼0.01
(numbers of GO terms)
1 2 3 4 5 6 7 8 9 10 11 12
1 283
2 268 305
3 252 287 313
4 246 281 304 307
5 250 275 298 300 317
6 250 275 297 299 316 320
7 250 275 294 295 311 315 318
8 254 278 300 297 312 316 315 333
9 253 277 300 295 310 313 312 330 336
10 252 276 299 294 309 312 311 329 334 334
11 253 277 299 294 309 312 311 329 334 333 342
12 242 265 287 282 297 300 299 317 322 321 330 342
Focus level height  Focus level height.
Fig.2.Base 10 logarithmof raw P-values for the biological process GO
graph versus base 10 logarithm of multiplicity-adjusted P-values from
the focus level procedure.The dotted line is the line of equality.
J.J.Goeman and U.Mansmann
choices of the focus level at ¼0.01.The numbers suggest good
agreement between the GO graphs found even at widely
different choices of the focus level.
Visualization of the significant GO graph obtained by the
focus level method can be done using the R package Rgraphviz
(Gentry,2005).Figure 3 shows the significant subgraph of the
biological process graph at ¼10
.This graph gives a succinct
picture of the processes most clearly associated with survival of
breast cancer patients.It has prominent clusters of processes
involved in cell cycle and related processes such as DNA
Asecondapplicationusing Global Ancova (Hummel et al.,2007;
Mansmann and Meister,2005) on an AML data set (Schnittger
et al.,2005) is presented in the Supplemental information.
Two things are important when adjusting for multiple testing
in the Gene Ontology graph.In the first place,the GO graph is
structured,and it is wasteful not to make use of that structure.
Secondly,due to the unknown and possibly highly irregular
dependency structures between test statistics,it is important not
to make any assumptions on the joint distribution of the test
statistics.In this article we have presented the focus level
method:a multiple testing method that controls the family-wise
error rate on the GO graph that makes use of the graph
structure and makes no further assumptions.Because the
procedure makes use of the logical structure of the GO graph,
the procedure has more power than traditional procedures such
as Holm’s.To our knowledge the focus level method is the first
method that allows the structure of Gene Ontology to be used
when correcting for multiple testing.
The significant gene sets that come out of the focus level
methods can be presented as a subgraph of the GO graph.
Table 5.Legend of Figure 3
1 biological process 53 DNA replication
2 response to stimulus 54 DNA recombination
3 metabolic process 55 protein modification
4 cellular process 56 dephosphorylation
5 biological regulation 57 second-messenger-mediated
6 developmental process signaling
7 response to stress 58 M phase
8 response to endogenous 59 interphase
stimulus 60 mitotic cell cycle
9 primary metabolic process 61 regulation of progression
10 macromolecule metabolic through cell cycle
process 62 microtubule-based process
11 cellular metabolic process 63 chromosome organization and
12 cell communication biogenesis (sensu Eukaryota)
13 cell division 64 chromosome localization
14 cell cycle 65 establishment of organelle
15 chromosome segregation localization
16 regulation of biological process 66 cell death
17 cell organization and 67 DNA-dependent DNA
biogenesis replication
18 localization 68 protein amino acid
19 cellular developmental process dephosphorylation
20 death 69 phosphoinositide-mediated
21 response to DNA damage signaling
stimulus 70 M phase of meiotic cell cycle
22 protein metabolic process 71 M phase of meiotic cell cycle
23 nucleobase,nucleoside,72 interphase of mitotic cell cycle
nucleotide and nucleic acid 73 regulation of progression through
metabolic process mitotic cell cycle
24 biopolymer metabolic process 74 cell cycle checkpoint
25 cellular macromolecule 75 positive regulation of
metabolic process progression through cell cycle
26 phosphorus metabolic process 76 microtubule cytoskeleton
27 signal transduction organization and biogenesis
28 cytokinesis 77 sister chromatid segregation
29 cell cycle process 78 establishment of chromosome
30 regulation of cellular process localization
31 positive regulation of biological 79 programmed cell death
process 80 regulation of a molecular function
32 organelle organization and 81 mitosis
biogenesis 82 meiosis
33 cellular localization 83 G2/M transition of mitotic cell
34 establishment of localization cycle
35 protein localization 84 positive regulation of progression
36 cell differentiation through mitotic cell cycle
37 DNA metabolic process 85 spindle checkpoint
38 cellular protein metabolic 86 spindle organization and
process biogenesis
39 biopolymer modification 87 regulation of programmed cell
40 phosphate metabolic process death
41 intracellular signaling cascade 88 apoptosis
42 meiotic cell cycle 89 regulation of catalytic activity
43 cell cycle phase 90 mitotic sister chromatid
44 regulation of cell cycle segregation
45 positive regulation of cellular 91 regulation of mitosis
process 92 chromosome condensation
46 cytoskeleton organization and 93 regulation of apoptosis
biogenesis 94 regulation of hydrolase activity
47 chromosome organization and 95 mitotic chromosome
biogenesis condensation
48 organelle localization 96 regulation of exit from mitosis
49 establishment of cellular 97 regulation of caspase activity
localization 98 positive regulation of exit from
50 protein complex localization mitosis
51 cell development 99 negative regulation of caspase
52 DNA repair activity
Fig.3.The significant subgraph of the biological process GO graph
according to the focus level procedure at ¼0.0001.See Table 5 for
a legend to this figure.The nodes colored grey are focus level nodes.
Multiple testing on the GO graph
This makes it easy to interpret the results of the experiment,
as the significant gene sets are always presented in context.The
focus level method tends to produce coherent graphs because
the algorithm favors finding significant gene sets close to other
significant sets.
The focus level method requires the user to specify a focus
level,which is the initial level of the graph in which the algorithm
searches for significant sets;it only delves deeper into the graph
at places where significance is found.The choice of a focus level
gives the researcher some control over the multiple testing
procedure,specifying at which level of detail in the GO graph
it should have most power.This flexibility of the procedure is an
important asset of the focus level method,as it allows for more
biological input into the statistical analysis.There are some
computational constraints that exclude some of the general
nodes as focus level nodes,but in our experience these con-
straints are flexible enough to fit the needs of most researchers.
Because the focus level method preserves the structure of the
GO graph when correcting for multiple testing,it does not
preserve the ordering of the raw P-values.Therefore,adjusted
P-values from the focus level method should never be
interpreted individually,but only in the context of the
significant GO graph,because the interpretation of an
individual adjusted P-value should depend on the location in
the graph where it occurs.The primary result of the focus level
method is a significant GO subgraph,not a list of adjusted
P-values.If the primary interest is in individual GO nodes
instead of in the significant graph as a whole,other multiple
testing correction methods may be more appropriate.
The focus level procedure controls the family-wise error rate
rather than the false discovery rate (Benjamini and Hochberg,
1995).This is because the concept of the false discovery
rate was designed for an unstructured collection of potential
discoveries,in which discoveries are exchangeable and it makes
sense to talk about a rate of false discoveries.This concept
cannot be immediately transferred to a hierarchical graph-like
structure with logical relationships.Control of the false
discovery rate in the closed graph (Figure 1),for example,
does not imply control of the false discovery rate in the original
graph.The concept of the family-wise error rate,on the other
hand,is easily applicable to graphs and much more consistent
with the idea that the graph that results from the focus level
procedure in fact constitutes a single ‘discovery’.
It is a frequently voiced general criticism of GO that it does
not represent an objective view of biology.The genes annotated
are mostly the result of disease related research and the
construction of GO is strongly biased in this direction.
However,this criticism does not reduce the need for statistical
procedures on acyclic graphs derived from ontologies,because
ontologies will remain an essential tool for organizing complex
knowledge in genomic research.The methods presented in this
article for GO can easily be applied to more reliable ontologies
that will be available in the future.
The authors thank Prof.Laura van ’t Veer (Netherlands Cancer
Institute) and Prof.Stephan Bohlander (Internal Medicine III,
University Hospital,University of Munich) for kindly
providing their data.The research of J.J.G.was supported by
the Netherlands Bioinformatics Centre,Biorange project 1.3.2.
The research of UM was supported by the German National
Genome Research Network,project 01 GR 0459.
Conflict of Interest:none declared.
Alexa, al.(2006) Improved scoring of functional groups from gene
expression data by decorrelating GO graph structure.Bioinformatics,22,
Ashburner, al.(2000) Gene Ontology:tool for the unification of biology.
Benjamini,Y.and Hochberg,Y.(1995) Controlling the false discovery rate:
a practical and powerful approach to multiple testing.J.Roy.Stat.Soc.
B Met.,57,289–300.
Dinu, al.(2007) Improving gene set analysis of microarray data by SAM-GS.
BMC Bioinformatics,8,242.
Dudoit, al.(2003) Multiple hypothesis testing in microarray experiments.
Gentry,J.(2005) Rgraphviz:Provides Plotting Capabilities for R Graph Objects,
R package version 1.10.0.
Goeman,J.J.and Bu
hlmann,P.(2007) Analyzing gene expression data in terms
of gene sets:methodological issues.Bioinformatics,23,980–987.
Goeman, al.(2004) A global test for groups of genes:testing association
with a clinical outcome.Bioinformatics,20,93–99.
Goeman, al.(2005) Testing association of a pathway with survival using
gene expression data.Bioinformatics,21,1950–1957.
Grossmann, al.(2006) An improved statistic for detecting over-representated
Gene Ontology annotations in gene sets.In A.Apostolico et al.(eds.) Lecture
Notes in Computer Science,Springer,New York,vol.3909,pp.85–98.
Holm,S.(1979) A simple sequentially rejective multiple test procedure.
Scandinavian J.Stat.,6,65–70.
Hummel, al.(2007) GlobalANCOVA:exploration and assessment of gene
group effects.Bioinformatics,doi:10.1093/bioinformatics/btm531.
Khatri,P.and Dra
ghici,S.(2005) Ontological analysis of gene expression
data:current tools,limitations,and open problems.Bioinformatics,21,
Mansmann,U.and Meister,R.(2005) Testing differential gene expression in
functional groups:Goeman’s global test versus an ANCOVA approach. Med.,44 (3),449–453.
Marcus, al.(1976) On closed testing procedures with special reference to
ordered analysis of variance.Biometrika,63,655–660.
Meinshausen,N.(2007) Hierarchical testing of variable importance.Unpublished
preprint available at

Mootha, al.(2003) PGC-1 alpha-responsive genes involved in oxidative
phosphorylation are coordinately downregulated in human diabetes.
Ogata, al.(1999) KEGG:Kyoto Encyclopedia of Genes and Genomes.
Nucleic Acids Res.,27,29–34.
Schnittger, al.(2005) Nucleophosmin gene mutations are predictors of
favorable prognosis in acute myelogenous leukemia with a normal karyotype.
Shaffer,J.P.(1986) Modified sequentially rejective multiple test procedures.
Tian, al.(2005) Discovering statistically significant pathways in expression
profiling studies.Proc.Natl Acad.of Sci.USA,102,13544–13549.
Tomfohr, al.(2005) Pathway level analysis of gene expression using singular
value decomposition.BMC Bioinformatics,6,225.
Van de Vijver, al.(2002) A gene-expression signature as a predictor of
survival in breast cancer.New Engl.J.Med.,347,1999–2009.
Westfall,P.H.and Young,S.S.(1993) Resampling-Based Multiple Testing:
Examples and Methods for p-value Adjustment.Wiley,New York.
Yekutieli,D.(2007) Unpublished preprint available at
Zhang, al.(2003) An extensible application for assembling annotation for
genomic data.Bioinformatics,19,155–156.
J.J.Goeman and U.Mansmann