multiple_testingx

educationafflictedBiotechnology

Oct 4, 2013 (3 years and 10 months ago)

71 views

Multiple testing adjustments

European Molecular Biology Laboratory

Predoc

Bioinformatics Course

17
th

Nov 2009


Tim Massingham, tim.massingham@ebi.ac.uk


Motivation

Already come across several cases where need to correct
p
-
values

Exp 1

Exp 2

Exp 3

Exp 4

Exp 5

Exp 6

Exp 1

0.027

0.033

0.409

0.330

0.784

Exp 2

0.117

0.841

0.985

0.004

Exp 3

0.869

0.927

0.001

Exp 4

0.245

0.021

Exp

5

0.004

Exp 6

Pairwise gene expression data

What happens if we perform several vaccine trials?

Motivation

10 new vaccines are trialled

Declare vaccine a success if test has
p
-
value of less than 0.05

If none of the vaccines work, what is our chance of success?

Motivation

10 new vaccines are trialled

Declare vaccine a success if test has
p
-
value of less than 0.05

Each trial has probability of 0.05 of “success” (false positive)

Each trial has probability of 0.95 of “failure” (true negative)


Probability of at least one

= 1
-

Probability of none









= 1
-

(Probability a trial unsuccessful)
10









= 1
-

0.95
10









= 0.4

If none of the vaccines work, what is our chance of a “success”?

Rule of Thumb

Multiple size of test by number of tests

Motivation

More extreme example: test entire population for disease

True negative

False positive

False negative

True positive

Mixture: some of population have disease, some don’t

Find individuals with disease

Family Wise Error Rate

Control probability that any false positive occurs

False Discovery Rate

Control proportion of false positives discovered

True status

Healthy

Diseased

Test report

Healthy Diseased

FDR = # false positives = # false positives


# positives # true positives + # false positives

Cumulative distribution

Simple examination by eye:

The cumulative distribution should be approximately linear

Rank



Rank data



Plot rank against
p
-
value

P
-
value

0

1

1

n

N.B. Often scale ranks to (0,1] by dividing by largest rank

Start

(0,1)

End

(1,n)

Never decreases

Cumulative distribution

Five sets of uniformly
distributed
p
-
values

Non
-
uniformly distributed
data. Excess of extreme
p
-
values (small)

Examples: For 910
p
-
values

Could use a one
-
sided
Kolmogorov

test if desired

A little set theory

Test 1 false positive

Test 2 false positive

No test gives

false positive

All tests give

false positive

Represent all possible outcomes of three tests in a Venn diagram

Areas are probabilities of events happening

A little set theory

+

+



P(any

test gives a false positive)

A little set theory

+

+



Bonferroni adjustment

Want to control this

Know how to control each of these

(the size of each test)

Keep things simple: do all tests at same size



If we have
n

tests, each at size
a/
n

then

Bonferroni adjustment

If we have
n

tests, each at size
a/
n

then

Family
-
Wise Error Rate

For a FWER of less than
a
, perform all tests at size
a
/
n

Equivalently: multiple
p
-
values of all tests by
n

(maximum of 1) to give adjusted
p
-
value

Example 1

Look at deviations from Chargaff’s 2
nd

parity rule

A and T content of genomes for 910 bugs


Many show significant deviations

First 9
pvalues

3.581291e
-
66 3.072432e
-
12
1.675474e
-
01 6.687600e
-
01
1.272040e
-
05 1.493775e
-
23 2.009552e
-
26 1.024890e
-
14
1.519195e
-
24

Unadjusted
pvalues

pvalue

< 0.05

764

pvalue

< 0.01

717

pvalue

< 1e
-
5

559

Bonferroni adjusted
pvalues

pvalue

< 0.05

582

pvalue

< 0.01

560

pvalue

< 1e
-
5

461

First 9 adjusted
pvalues

3.258975e
-
63 2.795913e
-
09
1.000000e+00
1.000000e+00 1.157556e
-
02
1.359335e
-
20 1.828692e
-
23 9.326496e
-
12 1.382467e
-
21

Aside:
pvalues

measure evidence

Shown that many bugs deviate substantial from Chargaff’s 2
nd

rule

p
-
values tell us that there is significant evidence for a deviation

median

Upper
quantile

Lower
quantile

Lots of bases and so ability to
detect small deviations from 50%


Powerful test

1st Qu. Median 3rd Qu.

0.4989 0.4999 0.5012


Bonferroni is conservative

Conservative: actual size of test is less than bound

Not too bad for independent tests


W
orst when positively correlated



Applying same test to subsets of data



Applying similar tests to same data

More subtle problem

Mixture of blue and red circles

Null hypothesis: Is blue

Red circles are never false positives

Bonferroni is conservative

+

+



If experiment really is different from null, then

Over adjusted

p
-
value

Number of potential false positives may be less than number of tests

Holm’s method

Holm(1979) suggests repeatedly applying Bonferroni

Initial Bonferroni:

Insignificant

Significant

Insignificant

Significant

No false positive?

Been overly strict, apply Bonferroni only to






insignificant set.

False positive?


More won’t hurt, so may as well test again

Step 2

Insignificant

Significant

Step 3

Stop when “insignificant” set does not shrink further

Example 2

Bonferroni adjusted
pvalues

pvalue

< 0.05

582

pvalue

< 0.01

560

pvalue

< 1e
-
5

461

First 9 adjusted
pvalues

3.258975e
-
63 2.795913e
-
09
1.000000e+00
1.000000e+00 1.157556e
-
02
1.359335e
-
20 1.828692e
-
23 9.326496e
-
12 1.382467e
-
21

Return to Chargaff data

910 bugs but more than half are significantly different after adjustment

There is strong evidence that we’ve over
-
corrected

First 9 Holm adjusted
pvalues

2.915171e
-
63 1.591520e
-
09
1.000000e+00
1.000000e+00
4.452139e
-
03 9.903730e
-
21 1.390610e
-
23 5.623765e
-
12 1.019380e
-
21

Holm adjusted
pvalues

pvalue

< 0.05

606 (+24)

pvalue

< 0.01

574 (+14)

pvalue

< 1e
-
5

472 (+12)

Gained a couple of percent more but notice that gains tail off

Hochberg’s method

Consider a pathological case

A
pply same test to same data multiple times

# Ten identical
pvalues

pvalues

<
-

rep(0.01,10)

# None are significant with Bonferroni

p.adjust(pvalues,method
=“
bonferroni
”)


0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

# None are significant with Holm

p.adjust(pvalues,method
=“
holm
”)


0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

# Hochberg recovers correctly adjusted
pvalues

p.adjust(pvalues,method
=“
hochberg
”)


0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01


First 9 Hochberg adjusted
pvalues

2.915171e
-
63 1.591520e
-
09
9.972469e
-
01 9.972469e
-
01
4.452139e
-
03 9.903730e
-
21 1.390610e
-
23
5.623765e
-
12 1.019380e
-
21

Hochberg adjusted
pvalues

pvalue

< 0.05

606

pvalue

< 0.01

574

pvalue

< 1e
-
5

472

Hochberg adjustment is identical to Holm for Chargaff data

….
but requires additional assumptions

False Discovery Rates

New methods, dating back to 1995

Gaining popularity in literature but mainly used for large data sets

Useful for enriching data sets for further analysis


Recap

FWER:

control probability of any false positive occurring

FDR:

control proportion of false positives that occur


q
-
value” is proportion of significant tests expected to be false
positives

q
-
value times number significant =
expected
number of false positives

Methods

Benjamini

& Hochberg (1995)

Benjamini

&
Yekutieli

(2001)

Storey (2002,2003) aka “positive false discovery rate”

Example 3

Returning once more to the Chargaff data

First 9 FDR
q
-
values

3.359768e
-
65 7.114283e
-
12 1.891664e
-
01 6.931340e
-
01 2.063380e
-
05 5.481191e
-
23 8.350193e
-
26
2.569283e
-
14 5.760281e
-
24

FDR
q
-
values

q
value

< 0.05

759

q
value

< 0.01

713

q
value

< 1e
-
5

547

Q
-
values have a different interpretation from
p
-
values

Use
qvalues

to get the expected number of false positives

qvalue

= 0.05


expect 38 false positives (759
x

0.05)

qvalue

= 0.01


expect 7 false positives (713
x

0.01)

qvalue

= 1e
-
5


expect 1/200 false positives



Summary

Holm is always better than Bonferroni


Hochberg can be better but has additional assumptions


FDR is a more powerful approach
-

finds more things significant



controls a different criteria



more useful for exploratory analyses than publications

A little question

Suppose results are published if the
p
-
value is less than
0.01, what proportion of the scientific literature is wrong?