Multiple testing adjustments
European Molecular Biology Laboratory
Predoc
Bioinformatics Course
17
th
Nov 2009
Tim Massingham, tim.massingham@ebi.ac.uk
Motivation
Already come across several cases where need to correct
p

values
Exp 1
Exp 2
Exp 3
Exp 4
Exp 5
Exp 6
Exp 1
0.027
0.033
0.409
0.330
0.784
Exp 2
0.117
0.841
0.985
0.004
Exp 3
0.869
0.927
0.001
Exp 4
0.245
0.021
Exp
5
0.004
Exp 6
Pairwise gene expression data
What happens if we perform several vaccine trials?
Motivation
10 new vaccines are trialled
Declare vaccine a success if test has
p

value of less than 0.05
If none of the vaccines work, what is our chance of success?
Motivation
10 new vaccines are trialled
Declare vaccine a success if test has
p

value of less than 0.05
Each trial has probability of 0.05 of “success” (false positive)
Each trial has probability of 0.95 of “failure” (true negative)
Probability of at least one
= 1

Probability of none
= 1

(Probability a trial unsuccessful)
10
= 1

0.95
10
= 0.4
If none of the vaccines work, what is our chance of a “success”?
Rule of Thumb
Multiple size of test by number of tests
Motivation
More extreme example: test entire population for disease
True negative
False positive
False negative
True positive
Mixture: some of population have disease, some don’t
Find individuals with disease
Family Wise Error Rate
Control probability that any false positive occurs
False Discovery Rate
Control proportion of false positives discovered
True status
Healthy
Diseased
Test report
Healthy Diseased
FDR = # false positives = # false positives
# positives # true positives + # false positives
Cumulative distribution
Simple examination by eye:
The cumulative distribution should be approximately linear
Rank
•
Rank data
•
Plot rank against
p

value
P

value
0
1
1
n
N.B. Often scale ranks to (0,1] by dividing by largest rank
Start
(0,1)
End
(1,n)
Never decreases
Cumulative distribution
Five sets of uniformly
distributed
p

values
Non

uniformly distributed
data. Excess of extreme
p

values (small)
Examples: For 910
p

values
Could use a one

sided
Kolmogorov
test if desired
A little set theory
Test 1 false positive
Test 2 false positive
No test gives
false positive
All tests give
false positive
Represent all possible outcomes of three tests in a Venn diagram
Areas are probabilities of events happening
A little set theory
+
+
≤
P(any
test gives a false positive)
A little set theory
+
+
≤
Bonferroni adjustment
Want to control this
Know how to control each of these
(the size of each test)
Keep things simple: do all tests at same size
If we have
n
tests, each at size
a/
n
then
Bonferroni adjustment
If we have
n
tests, each at size
a/
n
then
Family

Wise Error Rate
For a FWER of less than
a
, perform all tests at size
a
/
n
Equivalently: multiple
p

values of all tests by
n
(maximum of 1) to give adjusted
p

value
Example 1
Look at deviations from Chargaff’s 2
nd
parity rule
A and T content of genomes for 910 bugs
Many show significant deviations
First 9
pvalues
3.581291e

66 3.072432e

12
1.675474e

01 6.687600e

01
1.272040e

05 1.493775e

23 2.009552e

26 1.024890e

14
1.519195e

24
Unadjusted
pvalues
pvalue
< 0.05
764
pvalue
< 0.01
717
pvalue
< 1e

5
559
Bonferroni adjusted
pvalues
pvalue
< 0.05
582
pvalue
< 0.01
560
pvalue
< 1e

5
461
First 9 adjusted
pvalues
3.258975e

63 2.795913e

09
1.000000e+00
1.000000e+00 1.157556e

02
1.359335e

20 1.828692e

23 9.326496e

12 1.382467e

21
Aside:
pvalues
measure evidence
Shown that many bugs deviate substantial from Chargaff’s 2
nd
rule
p

values tell us that there is significant evidence for a deviation
median
Upper
quantile
Lower
quantile
Lots of bases and so ability to
detect small deviations from 50%
Powerful test
1st Qu. Median 3rd Qu.
0.4989 0.4999 0.5012
Bonferroni is conservative
Conservative: actual size of test is less than bound
Not too bad for independent tests
W
orst when positively correlated
•
Applying same test to subsets of data
•
Applying similar tests to same data
More subtle problem
Mixture of blue and red circles
Null hypothesis: Is blue
Red circles are never false positives
Bonferroni is conservative
+
+
≤
If experiment really is different from null, then
Over adjusted
p

value
Number of potential false positives may be less than number of tests
Holm’s method
Holm(1979) suggests repeatedly applying Bonferroni
Initial Bonferroni:
Insignificant
Significant
Insignificant
Significant
No false positive?
Been overly strict, apply Bonferroni only to
insignificant set.
False positive?
More won’t hurt, so may as well test again
Step 2
Insignificant
Significant
Step 3
Stop when “insignificant” set does not shrink further
Example 2
Bonferroni adjusted
pvalues
pvalue
< 0.05
582
pvalue
< 0.01
560
pvalue
< 1e

5
461
First 9 adjusted
pvalues
3.258975e

63 2.795913e

09
1.000000e+00
1.000000e+00 1.157556e

02
1.359335e

20 1.828692e

23 9.326496e

12 1.382467e

21
Return to Chargaff data
910 bugs but more than half are significantly different after adjustment
There is strong evidence that we’ve over

corrected
First 9 Holm adjusted
pvalues
2.915171e

63 1.591520e

09
1.000000e+00
1.000000e+00
4.452139e

03 9.903730e

21 1.390610e

23 5.623765e

12 1.019380e

21
Holm adjusted
pvalues
pvalue
< 0.05
606 (+24)
pvalue
< 0.01
574 (+14)
pvalue
< 1e

5
472 (+12)
Gained a couple of percent more but notice that gains tail off
Hochberg’s method
Consider a pathological case
A
pply same test to same data multiple times
# Ten identical
pvalues
pvalues
<

rep(0.01,10)
# None are significant with Bonferroni
p.adjust(pvalues,method
=“
bonferroni
”)
0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
# None are significant with Holm
p.adjust(pvalues,method
=“
holm
”)
0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
# Hochberg recovers correctly adjusted
pvalues
p.adjust(pvalues,method
=“
hochberg
”)
0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
First 9 Hochberg adjusted
pvalues
2.915171e

63 1.591520e

09
9.972469e

01 9.972469e

01
4.452139e

03 9.903730e

21 1.390610e

23
5.623765e

12 1.019380e

21
Hochberg adjusted
pvalues
pvalue
< 0.05
606
pvalue
< 0.01
574
pvalue
< 1e

5
472
Hochberg adjustment is identical to Holm for Chargaff data
….
but requires additional assumptions
False Discovery Rates
New methods, dating back to 1995
Gaining popularity in literature but mainly used for large data sets
Useful for enriching data sets for further analysis
Recap
FWER:
control probability of any false positive occurring
FDR:
control proportion of false positives that occur
“
q

value” is proportion of significant tests expected to be false
positives
q

value times number significant =
expected
number of false positives
Methods
Benjamini
& Hochberg (1995)
Benjamini
&
Yekutieli
(2001)
Storey (2002,2003) aka “positive false discovery rate”
Example 3
Returning once more to the Chargaff data
First 9 FDR
q

values
3.359768e

65 7.114283e

12 1.891664e

01 6.931340e

01 2.063380e

05 5.481191e

23 8.350193e

26
2.569283e

14 5.760281e

24
FDR
q

values
q
value
< 0.05
759
q
value
< 0.01
713
q
value
< 1e

5
547
Q

values have a different interpretation from
p

values
Use
qvalues
to get the expected number of false positives
qvalue
= 0.05
expect 38 false positives (759
x
0.05)
qvalue
= 0.01
expect 7 false positives (713
x
0.01)
qvalue
= 1e

5
expect 1/200 false positives
Summary
Holm is always better than Bonferroni
Hochberg can be better but has additional assumptions
FDR is a more powerful approach

finds more things significant
•
controls a different criteria
•
more useful for exploratory analyses than publications
A little question
Suppose results are published if the
p

value is less than
0.01, what proportion of the scientific literature is wrong?
Comments 0
Log in to post a comment