# multiple_testingx

Biotechnology

Oct 4, 2013 (4 years and 7 months ago)

83 views

European Molecular Biology Laboratory

Predoc

Bioinformatics Course

17
th

Nov 2009

Tim Massingham, tim.massingham@ebi.ac.uk

Motivation

Already come across several cases where need to correct
p
-
values

Exp 1

Exp 2

Exp 3

Exp 4

Exp 5

Exp 6

Exp 1

0.027

0.033

0.409

0.330

0.784

Exp 2

0.117

0.841

0.985

0.004

Exp 3

0.869

0.927

0.001

Exp 4

0.245

0.021

Exp

5

0.004

Exp 6

Pairwise gene expression data

What happens if we perform several vaccine trials?

Motivation

10 new vaccines are trialled

Declare vaccine a success if test has
p
-
value of less than 0.05

If none of the vaccines work, what is our chance of success?

Motivation

10 new vaccines are trialled

Declare vaccine a success if test has
p
-
value of less than 0.05

Each trial has probability of 0.05 of “success” (false positive)

Each trial has probability of 0.95 of “failure” (true negative)

Probability of at least one

= 1
-

Probability of none

= 1
-

(Probability a trial unsuccessful)
10

= 1
-

0.95
10

= 0.4

If none of the vaccines work, what is our chance of a “success”?

Rule of Thumb

Multiple size of test by number of tests

Motivation

More extreme example: test entire population for disease

True negative

False positive

False negative

True positive

Mixture: some of population have disease, some don’t

Find individuals with disease

Family Wise Error Rate

Control probability that any false positive occurs

False Discovery Rate

Control proportion of false positives discovered

True status

Healthy

Diseased

Test report

Healthy Diseased

FDR = # false positives = # false positives

# positives # true positives + # false positives

Cumulative distribution

Simple examination by eye:

The cumulative distribution should be approximately linear

Rank

Rank data

Plot rank against
p
-
value

P
-
value

0

1

1

n

N.B. Often scale ranks to (0,1] by dividing by largest rank

Start

(0,1)

End

(1,n)

Never decreases

Cumulative distribution

Five sets of uniformly
distributed
p
-
values

Non
-
uniformly distributed
data. Excess of extreme
p
-
values (small)

Examples: For 910
p
-
values

Could use a one
-
sided
Kolmogorov

test if desired

A little set theory

Test 1 false positive

Test 2 false positive

No test gives

false positive

All tests give

false positive

Represent all possible outcomes of three tests in a Venn diagram

Areas are probabilities of events happening

A little set theory

+

+

P(any

test gives a false positive)

A little set theory

+

+

Want to control this

Know how to control each of these

(the size of each test)

Keep things simple: do all tests at same size

If we have
n

tests, each at size
a/
n

then

If we have
n

tests, each at size
a/
n

then

Family
-
Wise Error Rate

For a FWER of less than
a
, perform all tests at size
a
/
n

Equivalently: multiple
p
-
values of all tests by
n

(maximum of 1) to give adjusted
p
-
value

Example 1

Look at deviations from Chargaff’s 2
nd

parity rule

A and T content of genomes for 910 bugs

Many show significant deviations

First 9
pvalues

3.581291e
-
66 3.072432e
-
12
1.675474e
-
01 6.687600e
-
01
1.272040e
-
05 1.493775e
-
23 2.009552e
-
26 1.024890e
-
14
1.519195e
-
24

pvalues

pvalue

< 0.05

764

pvalue

< 0.01

717

pvalue

< 1e
-
5

559

pvalues

pvalue

< 0.05

582

pvalue

< 0.01

560

pvalue

< 1e
-
5

461

pvalues

3.258975e
-
63 2.795913e
-
09
1.000000e+00
1.000000e+00 1.157556e
-
02
1.359335e
-
20 1.828692e
-
23 9.326496e
-
12 1.382467e
-
21

Aside:
pvalues

measure evidence

Shown that many bugs deviate substantial from Chargaff’s 2
nd

rule

p
-
values tell us that there is significant evidence for a deviation

median

Upper
quantile

Lower
quantile

Lots of bases and so ability to
detect small deviations from 50%

Powerful test

1st Qu. Median 3rd Qu.

0.4989 0.4999 0.5012

Bonferroni is conservative

Conservative: actual size of test is less than bound

Not too bad for independent tests

W
orst when positively correlated

Applying same test to subsets of data

Applying similar tests to same data

More subtle problem

Mixture of blue and red circles

Null hypothesis: Is blue

Red circles are never false positives

Bonferroni is conservative

+

+

If experiment really is different from null, then

p
-
value

Number of potential false positives may be less than number of tests

Holm’s method

Holm(1979) suggests repeatedly applying Bonferroni

Initial Bonferroni:

Insignificant

Significant

Insignificant

Significant

No false positive?

Been overly strict, apply Bonferroni only to

insignificant set.

False positive?

More won’t hurt, so may as well test again

Step 2

Insignificant

Significant

Step 3

Stop when “insignificant” set does not shrink further

Example 2

pvalues

pvalue

< 0.05

582

pvalue

< 0.01

560

pvalue

< 1e
-
5

461

pvalues

3.258975e
-
63 2.795913e
-
09
1.000000e+00
1.000000e+00 1.157556e
-
02
1.359335e
-
20 1.828692e
-
23 9.326496e
-
12 1.382467e
-
21

910 bugs but more than half are significantly different after adjustment

There is strong evidence that we’ve over
-
corrected

pvalues

2.915171e
-
63 1.591520e
-
09
1.000000e+00
1.000000e+00
4.452139e
-
03 9.903730e
-
21 1.390610e
-
23 5.623765e
-
12 1.019380e
-
21

pvalues

pvalue

< 0.05

606 (+24)

pvalue

< 0.01

574 (+14)

pvalue

< 1e
-
5

472 (+12)

Gained a couple of percent more but notice that gains tail off

Hochberg’s method

Consider a pathological case

A
pply same test to same data multiple times

# Ten identical
pvalues

pvalues

<
-

rep(0.01,10)

# None are significant with Bonferroni

=“
bonferroni
”)

0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

# None are significant with Holm

=“
holm
”)

0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

pvalues

=“
hochberg
”)

0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01

pvalues

2.915171e
-
63 1.591520e
-
09
9.972469e
-
01 9.972469e
-
01
4.452139e
-
03 9.903730e
-
21 1.390610e
-
23
5.623765e
-
12 1.019380e
-
21

pvalues

pvalue

< 0.05

606

pvalue

< 0.01

574

pvalue

< 1e
-
5

472

Hochberg adjustment is identical to Holm for Chargaff data

….

False Discovery Rates

New methods, dating back to 1995

Gaining popularity in literature but mainly used for large data sets

Useful for enriching data sets for further analysis

Recap

FWER:

control probability of any false positive occurring

FDR:

control proportion of false positives that occur

q
-
value” is proportion of significant tests expected to be false
positives

q
-
value times number significant =
expected
number of false positives

Methods

Benjamini

& Hochberg (1995)

Benjamini

&
Yekutieli

(2001)

Storey (2002,2003) aka “positive false discovery rate”

Example 3

Returning once more to the Chargaff data

First 9 FDR
q
-
values

3.359768e
-
65 7.114283e
-
12 1.891664e
-
01 6.931340e
-
01 2.063380e
-
05 5.481191e
-
23 8.350193e
-
26
2.569283e
-
14 5.760281e
-
24

FDR
q
-
values

q
value

< 0.05

759

q
value

< 0.01

713

q
value

< 1e
-
5

547

Q
-
values have a different interpretation from
p
-
values

Use
qvalues

to get the expected number of false positives

qvalue

= 0.05

expect 38 false positives (759
x

0.05)

qvalue

= 0.01

expect 7 false positives (713
x

0.01)

qvalue

= 1e
-
5

expect 1/200 false positives

Summary

Holm is always better than Bonferroni

Hochberg can be better but has additional assumptions

FDR is a more powerful approach
-

finds more things significant

controls a different criteria

more useful for exploratory analyses than publications

A little question

Suppose results are published if the
p
-
value is less than
0.01, what proportion of the scientific literature is wrong?