Size matters: the value of large

lowlytoolboxBiotechnology

Oct 22, 2013 (3 years and 7 months ago)

71 views

1

Size matters: the value of large
scale epidemiology



Paul Burton


Professor of Genetic Epidemiology

University of Leicester

P³G Consortium

PHOEBE


2

A daunting task!


Need for extensive, valid information


Developments in biotechnology, IT


Pre
-
morbid and longitudinal life
-
style/environment relevant


Bioclinical complexity


low statistical power!!




Intermediate

Pathways

E

G

Disease

Natural

History

3

Large scale genetic epidemiology


Focus on the aetiology of complex diseases


Common disease common variant hypothesis



a shift in paradigm from linkage to association


BUT: serious failure to identify associations
that can consistently be replicated



4

Hattersley AT, McCarthy MI. Lancet 2005;366:1315
-
1323

Examples of some polymorphisms or haplotypes that

have shown consistent association with complex disease


5

Why has replication

proved to be so difficult?


Poorly designed studies


e.g.
wrong controls, family v non
-
family designs


Poorly conducted analyses and meta
-
analyses


e.g.

use of inefficient or inconsistent methods; failure to take
proper account of extreme “multiple testing”; publication
and/or reporting bias


Inconsistent definitions of outcome or exposure


e.g.

what do we mean by “asthma”?


Poor methods of assessment


e.g.

bad choice of SNP genotyping platform




6


Heterogeneity


e.g. “
stroke” encompasses important
subcategories; phenocopies; pleiotropy


Population substructure


Latent stratification and admixture pertaining to
population of origin

Why has replication

proved to be so difficult?

7

Why has replication

proved to be so difficult?


LOW STATISTICAL POWER!!


A key feature of almost all proffered
explanations, and/or of the approaches
needed to correct for them


If we need 5,000 cases to test for a given
aetiological effect with a power of 80%, and
with a critical p
-
value of 0.0001, how much
power would there be for a study with 500
cases?

8

Why has replication

proved to be so difficult?


LOW STATISTICAL POWER!!


A key feature of almost all proffered
explanations, and/or of the approach
needed to correct for them


If we need 5,000 cases to test for a given
aetiological effect with a power of 80%, and
with a critical p
-
value of 0.0001, how much
power would there be for a study with 500
cases?


0.008!!

9

How should we respond?


Increase the quality of individual studies


Limit measurement/assessment error


Increase the size of individual studies


Promote harmonization to enable data pooling
and integration





10

How should we respond?


Increase the quality of individual studies


Limit measurement/assessment error


Increase the size of individual studies


Promote harmonization to enable data pooling
and integration







MAJOR international investment in

“biobanks” and “biobank harmonization”

11

What is a biobank?


“An organised collection of human biological
material and associated information stored for
one or more research purposes”


Population Biobanks Lexicon (P
3
G, PHOEBE)


Types


Disease
-
specific


Exposure
-
focused


Population
-
based


12

Justification for large
-
scale

genetic
-
epidemiology
programs


13

BIG
per se


No argument about:


Need to increase statistical power


Benefit of constructing biobanks containing
extensive case
-
series for case
-
control studies


Benefit of constructing large “acceptably
representative” series of controls for each nation

14

BIG cohort studies


Studies of the joint effects of genes and
environment/life
-
style


Genotype
-
based studies***


The genetics of disease progression


Direct association of genes with disease
***


Population
-
based replication studies


Universal controls

15

BUT: how big is “big”?


With Anna Hansell, Imperial College

16


Contemporary pre
-
eminence of
genetic
association studies

rather than genetic linkage
studies


Covers
both

stand
-
alone case
-
control studies,
and

nested case
-
control studies in large
cohorts.
Main issue is
the number of cases.


Sample size determining

in
both

settings



The statistical power of

case
-
control studies

17

Simulation
-
based power calculations


Work with the least powerful (
common
) setting


Disease outcome and exposures all binary


Logistic regression; interactions = departure
from a multiplicative model


Complexity (arbitrary but realistic).


Four controls per case





18

Diabetes mellitus defined by
HbA1C > 97.5 percentile


19

Genetic main effects

Prevalence of

‘at
-
risk’ genotype
=
0.1
, 0.5

20

Lifestyle main effects

Prevalence of

‘at
-
risk’ life
-
style
determinant = 0.5

Reliability

1.0: measured height

0.9: self reported weight

0.7: office BP, measured


serum cholesterol

0.5: dietary recall of


many components


(4*24 hr recalls)


21

Gene
-
lifestyle interactions

Prevalence of

‘at
-
risk’ life
-
style
determinant = 0.5

Prevalence of

‘at
-
risk’ genotype
= 0.1

22

Power

0.2
-
0.39; 0.4
-
0.59;

0.6
-
0.79; 0.8
-
1.0

Mean power


55%

23

What is needed?


Genetic main effects


2,500
-
10,000 cases


Life
-
style main effects


5,000
-
20,000 cases


Gene
-
lifestyle interactions


Probably need at least 20,000 cases

24

How can this be achieved?


Large disease
-
based biobanks


Very large cohort
-
based biobanks


But how large do these need to be?

25

Expected event rates

in UK Biobank


With Anna Hansell, Imperial College

26

Taking account of


Age range at recruitment 40
-
69 years


Recruitment over 5 years


All cause mortality


Disease incidence (“healthy cohort effect”)


Migration overseas


Withdrawal from the study

27

28

Conclusions


Having taken account of realistic bioclinical
complexity, a cohort
-
based biobank needs to be
very
large

if it is to provide a
stand
-
alone

infrastructure


Anything much less than 500,000 recruits severely
curtails the number of diseases that will be able to be
studied based on that biobank
alone


The value of
any

biobank will be greatly augmented if
it proves possible to set up a coherent and
scientifically
harmonized international network of
biobanks


29

What is biobank
harmonization?

30

Biobank harmonization


A set of procedures that promote, both now
and in the future, the effective interchange of
valid information and samples between a
number of studies or biobanks, accepting that
there may be important differences between
those studies


With thanks to Alastair Kent

31

Biobank harmonization


Prospective harmonization


Aims to modify study design and conduct,
ahead of
time
, in order to render subsequent data and
sample pooling more efficient and more
straightforward



Retrospective harmonization


Aims to optimize the pooling of data, samples and
phenotypes that have
already been collected
,
between studies with inevitably heterogeneous
designs.


32

Why harmonize?


Investigate less common (but not rare!!!)
conditions


UKBB: Ca stomach 2,500 cases in 29 years


6 UKBB equivalents:


10,000 cases in 20 years


Investigate smaller ORs


GME 1.5


1.2 requires 2,000


12,600


6.3 UKBB equivalents


Analysis based on subsets


homogeneous
classes of phenotype, or
e.g.

by sex

33

Why harmonize?


Earlier analyses


UKBB: Alzheimers disease, 10,000 cases in 18 yrs


5 UKBB equivalents


9 years


Events at younger ages


Broad range of environmental exposures


Aim for 5
-
6 UKBB equivalents


2.5M


3M recruits

34

Some key issues


Scientifically and politically VERY challenging


Laboratory science, clinical science, population
science, IT challenges, ethico
-
legal issues


A need for REAL collaboration and tools that
are ACCESSIBLE and USABLE


Case
-
control and cohort studies

35

International biobank

harmonization programs


Public Population Program in Genomics (P
3
G)


Tom Hudson, Bartha Knoppers, Isabel Fortier…


Population Biobanks


FP6 Co
-
ordination Action (PHOEBE


Promoting
Harmonization Of Epidemiological Biobanks in Europe)


Jennifer Harris, Leena Peltonen, Paul Burton …


Human Genome Epidemiology Network (HuGENet)


Muin Khoury, Julian Little …


ESSENTIAL THAT ALL INITIATIVES WORK
TOGETHER!!



36

Extra slides

37

Rarer genotypes


Genetic main effects

38

Proposed assessment visit model

Welcome

5

Consent

5

Blood/Urine

10

Touchscreen Questionnaire

25

Interviewer Questionnaire

5

Physical Measures

15

Exit

5

TOTAL

70

39

Taking account of


Age range at recruitment 40
-
69 years


Recruitment over 5 years


All cause mortality


Disease incidence (“healthy cohort effect”)


Migration overseas


Comprehensive withdrawal (max 1/500 p.a.)


Partial withdrawal (
c.f.

1958 Birth Cohort)

40

41

Necessary to contact subjects

42

Issues that are often ignored in
standard power calculations


Multiple testing/low prior probability of association*


Interactions*


Unobserved frailty


Misclassification*


Genotype


Environmental determinant


Case
-
control status


Subgroup analyses*


Population substructure

43

Harmonisation


Prospective


Retrospective


Description


Comparison


Harmonised synthesis

44

45

46

Recruitment and assessment


Recruitment via centrally held list of individuals
registered with Primary Care Practitioners
(GPs)


Assessment in large centres (

100 subjects
per day)


Assessment


70 minutes


Questionnaire, physical examination, bloods

47

Welcome

5

Consent

5

Blood/Urine

10

Touchscreen Questionnaire

25

Interviewer Questionnaire

5

Physical Measures

15

Exit

5

TOTAL

70

Assessment visit model

48

Summary


80% power for genotype frequency = 0.1


Genetic main effect


1.5, p=10
-
4



2,000 cases


Genetic main effect


1.3, p=10
-
4



5,500 cases


Genetic main effect


1.2, p=10
-
4



12,600 cases


Genetic main effect


1.7, p=10
-
7



2,000 cases


Genetic main effect


1.5, p=10
-
7



3,400 cases


Genetic main effect


1.3, p=10
-
7



9,500 cases


Genetic main effect


1.2, p=10
-
7



21,500 cases


G:E interaction with environmental exp. prevalence


= 0.5


2.0
, p=10
-
4


10,000 to 30,000 cases


49


A prospective cohort study


500,000 adults (40
-
69 years) across UK


A population
-
based biobank


Not disease or exposure based


Recruitment via electronic GP lists


“Broad spectrum” not “fully representative”


Individuals not families


MRC, Wellcome Trust, DH, Scottish Executive


£61M



UK Biobank

50


Initial data/sample collection and subsequent
longitudinal health tracking


Nested case
-
control studies


Long time
-
horizon


Owned by the Nation


Central Administration


Manchester


PI: Prof Rory Collins
-

Oxford


6 collaborating groups (RCCs) of university
scientists

UK Biobank

51

Smaller sample sizes