genetically homogeneous

sandwichtumtumBiotechnology

Dec 16, 2012 (4 years and 9 months ago)

238 views


Review of main points from last week


Medical costs escalating largely due to new technology


This is an ethical/social problem with major
conseq
.


Many new technologies provide only marginal benefits


cost
-
effectiveness frequently not evaluated


FDA


is it “safe and effective”


CMS


is it “necessary and reasonable”


Consider genetic testing and “personalized” medicine


as example of a new technology needing evaluation


What benefits does/will it provide?


At what cost? Are there potential harms?


What ethical/legal/social issues does it raise?

Some concepts needed to understand this technology


What does DNA “do”? (genes, proteins)

What is its structure?

What is DNA sequence?

What is a SNP?

How is DNA passed from parents to offspring?

What are mutations, genetic variants?

How can they be associated with traits, diseases,


disease risks, sensitivity to particular drugs?

Examples of tests offered “direct
-
to
-
consumer”




Today’s subject: gene
-
disease risk associations & GWAS


Understand how GWAS studies have been done


in order to better evaluate disease risk


predictions from companies like 23andMe


Understand strengths and limitations of GWAS


Go over some basic ideas in
statistics needed to



evaluate GWAS (and other apps.
i
n engineering!)


Think about how technical complexity affects your



ability to evaluate utility of this (and by example,



other) new technologies

Genome
-
wide association studies (GWAS) = source of data


for SNP associations with particular diseases


Basic idea




search for
chr
. regions (SNPs) with diff. allele

frequencies in cases
vs

controls


If such SNPs found, it could be that:


the SNP allele causes (or contributes to) the disease


the SNP allele is close enough on a
chr

to disease
-
causing


mutation that they have been inherited together in


most people since
mut’n
. arose
(founder effect)


the SNP allele and the disease
both
occur
at higher


freq
. in some ethnic group
but not for genetic reasons


e.g. malaria and skin pigment variants


Last possibility
would be false positive result for GWAS!


So GWAS studies first go to great lengths to select


genetically homogeneous
cases and controls


and
exclude genetically heterogeneous

individuals



How
can
you do this
?


Use multi
-
dimensional scaling


a data visualization tool



to group similar objects in complex data sets


Idea


imagine n
-
dim.
s
pace where each axis represents



a SNP locus, and AA=0,
Aa
=.5,
aa
=1 along axis



represent each individual as point in this space


genetic dist. btw. people = Euclidian dist. btw.
t
heir pts.

Hard to “see” data in n
-
dimensional space when n is large

So make 2
-
d plot of individuals so that dist.
b
etw
.

p
airs in 2
-
d “best” reflects Euclidean dist.
i
n n
-
dimen
.

(imagine moving each point in 2
-
d map randomly to minimize discrepancy btw. 2
-
d

and n
-
d distances, summed over all other individuals, then repeating for each individual

u
ntil map positions converge)

Genetically closely

r
elated people cluster

i
n such a map



Eliminate all outliers

f
rom GWAS study

population

Implication


Any positive GWAS findings are initially only “true”


for a particular homogeneous group (e.g.
CEU =



N
.
Europeans) and
must be retested in other



populations before they can be accepted generally


Next problem


if genotype
(pattern of alleles at some locus,



e.g. AA
vs

Aa

vs

aa
) frequencies
differ
betw
. cases


and controls, how much do they have to differ to


be statistically significant?


Basic idea in statistics


see if data are reasonably likely


given
“null” hypothesis (H
0
) that groups (e.g., cases and



controls) do not differ (in genotype frequencies)




If groups are not really different, you could pool the data and



calculate mean and
st
. dev.
f
or the pool, then ask if



you randomly chose 2 groups (of the size of the cases



and controls) from this one population, how likely would


the means of the 2 groups differ by as much as you



observe. A “t” test gives you this probability. If it is



very low, you may have reason to reject the null hypothesis.

The chi sq test is very like the “t” test


Chi sq =
S
(Exp
-
Obs
)
2
/Exp


It’s probability distribution is known for randomly


selected groups from a single
population.


If
p(chi sq) < small #
a
, e.g.
a

=.05,
you might


want to conclude the groups are different


Traditionally, and completely arbitrarily,
a

=
0.05


is often taken as a cut
-
off. This means that if


the groups are really not different, you’ll make


a mistake and call them different 5% of the time.



You pick the cut
-
off for whatever error rate you


feel appropriate



Complication


if one tests for association with
20 (or n),


independent things, expect
~
1 to have
p(chi sq)
<.05



(< 1/n) even
when no assoc.
exists (false
positive, FP).


Testing
for assoc with any of
~
10
6

genes,
one needs much


stricter
criterion than
a
=
.05 in
order to avoid lots of FP’s


Simplest correction


Bonferroni
: divide
a

by n = # of



SNPs tested; e.g. require p(chi sq.) < 0.05/10
6

~
10
-
8



in order that probability of
any

FP
be < .05






Example chi sq calculation




hypothetical #'s with each genotype





aa


aA


AA


sum


dis. cases

45

510

1445

2000


controls


120

960

1920

3000





totals

165

1470

3365

5000




If H
0

true, can pool groups for best est. of probabilities


p(
aa
) = 165/5000; p(
aA
)=1470/5000, p(AA)=3365/5000



Then expected #
aa

among dis. cases = p(
aa
)*2000 = 66


Expected # of
aA

among dis. cases = p(
aA
)*2000 = 588


Compute remaining expected #’s same way or from totals


-
>


Expected #


aa

aA

AA

sum


dis. cases


66

588

1346

2000


Controls


99

882

2019

3000



totals


165

1470

3365

5000



Chi sq =
S
(exp
-
obs
)
2
/exp = (66
-
45)
2
/66 + … = 40.52



p(chi sq, 2df) = 1.59x10
-
9

(from table, or web) <
a
= 10
-
8


so H
0

(no association) rejected, assoc. is likely




For confirmation, repeat study in independent group
s

Next problem, not really interested in p(data|hypothesis)


want p(hypothesis|data)



E.g., you observe freq of some SNP alllele is higher in

disease group vs controls, you want to know

p(dis.|genotype) not p(genotype|disease)


Bayesian statistics allows you to infer p(disease|genotype)

from p(genotype|disease)


Basic Idea: 2 ways to calculate p of disease
and

genotype AA


p(D|AA)p(AA) = p(AA|D)p(D)


-
> p(D|AA) = p(AA|D) p(D) / p(AA)


have to know p(D), p(AA), and p(AA|D) to get p(D|AA)

Relative risk might be measured as p(D|AA)/p(D)


but frequently expressed in terms of “odds ratio”


Odds = p(event)/[1
-
p(event)]

e.g. “2:1” if p(event)=.67


Odds
ratio

= odds(D|AA)/odds(D) (assume A is hi risk allele)



= {p(D|AA)/[1
-
p(D/AA)]} / {p(D)/[1
-
p(D)]}


note odds ratio > relative risk since [1
-
p(D)]/[1
-
p(D/AA)] >1


Look at data in GWAS paper, Nature 447:661 (2007)


appreciate the magnitude, expense, complexity




and limitations







~
100 authors, 10
6

SNPs tested


in each of 17,000 samples (@ $1000)



could study have been done if each test cost $1?

Note most disease risks measured by OR only
~
1.2
-
2

Do you understand most of the columns in this table?

raw

p
(chi sq)

ORs

dis

Note
many

SNPs in region are associated with disease

Example of hit region

Summary of all hits, all diseases, all chromosomes

Ln
10
(p)

Limitations


Do most SNP associations identify
causative

mutations?


No, because there are many SNPs in each region




they can’t all be causative


If not causative, why the association?


Likely explanation


causative mutation arose


sometime, not very long ago, on some chromosome in


“founder” individual; he/she passed on the mutation


to offspring along with adjacent chromosomal regions.


Recombination between causative mutation and these


regions has not yet occurred on most chromosomes


bearing mutation, so SNPs near
mut’n

in founder remain


associated in offspring = linkage disequilibrium (LD)

Implications


associated SNPs reflect fairly recent


mutations, therefore may be restricted to particular


ethnic groups (not enough time to spread throughout


the world by migration, interbreeding); in other groups


the same SNPs may be unassociated with disease;


hard to find very old mutations causing disease (no LD)



hits provide
locational

clues to causative
mutations;


the latter could provide leads for new
rx
, reduce


imprecision in risk assessments



most SNP associations now confirmed in independent


disease group studies (see 23andMe white paper on


“vetting” disease associations)

Note
most relative risks
(odds ratios) are small
, <
2
-
fold


Does this make most results practically insignificant?



Odds ratios are much
smaller than expected from estimates




of heritability
from family studies


Example: height
said
to be 80% inherited but max


combined effect of all associated SNPs only
~
5%


How is
“% heritability”
estimated?



Old way: for height, plot children’s height
vs

mean height

of their parents; if children with tall parents tend


to be tall, height could be
genetic

Fraction of variance explained by parents

height


= 1
-

S
[
h
c
-
(
a
h
p
+
b
)
]
2

/
S
(
h
c
-
<
h
c
>)
2

h
children

<h
parents
>

find
(least
sq
.)
best

fit line:
a
h
p
+
b


comp.
variance from

b
est fit
line

to
variance

from

global
mean line

<h
c
>

More mathematically,

Does child
-
parent height
correl.
prove height is genetic?


No


it may confound environment and gene effects


(tall parents may eat better and provide better diet)


Clever way to tease out genetic from environmental



effects within families:
use
SNP genotypes to measure



genetic relatedness between siblings and plot



height differences
betw
.
s
ib.
pairs
vs. genetic relatedness


Genetic relatedness =
%
genes that are identical in siblings


due to inheritance from the same grandparent (e.g. they



both get their mothers maternal (or paternal) alleles
vs




one gets the maternal and the other the paternal allele);



call
this % identity by descent, IBD)

Plot height difference between sibs
vs

% IBD

h
diff

55

50

45

(% IBD)

Now
variance from red line
/

variance from blue line

provides estimate of effect explained by genes,

controlled for environment (sib pairs expected

to share environments to same extent, unaffected

by their % IBD)

Can generalize
to disease
incidence …


(don’t worry about details)












Find least sq. best

fit line:



0


.5


1

hh

Hh

HH


genotype

disease 1




no

disease
0

Fraction
explained

by

H


= 1


(
var

from red line
/


var

from blue)

to say what
fx

of disease incidence is “explained” by SNPs

“Missing heritability”
= big
embarrassment for GWAS


Possible
explanations:




family
studies overestimate heritability by




confounding environmental
effects




disease
caused by changes in gene
expression not


detectable
by SNPs (
epistasis
)




disease
caused by
very old mutation (assoc. lost


due to genetic recombination over time)




disease
caused by rare alleles (SNPs analyzed



chosen to
have minor allele frequencies > 5%)


Some want to push on w/ GWAS, testing rarer SNPs or


sequencing genomes to look for alleles assoc.
w/disease

Reward may be in understanding how particular genes

contribute to diseases, not in utility of risk prediction

Next problem: how to combine risks from unlinked SNPs



23andMe multiples
relative risks (see 23andMe


white paper). This assumes effects are independent,


i.e. no gene interactions. Is this accurate
?


Counter example: gene that raises expression of fetal
hgb


decreases severity of sickle cell disease => some


genes interact “non
-
linearly”


How could one verify if predicted dis. risks are accurate?



1.
P
rospective
studies


think
about feasibility: how


many
subjects
needed, how much time,
etc
.




2. Compare different companies’ risk predictions


Venter et al compared risk predictions of
Navigenics

and


23andMe
for 13 diseases for 5 individuals


Results: Qualitative
discrepancies for half of
people


in half of tests


Explanation: Companies used
different sets of
SNPs.




Does this restore confidence in clinical validity of test?

Smallness of effects limit
clinical utility
-

most effects




comparable to risks
conferred by positive family
history


But for
some

diseases, where dis. mutation identified,


predicted risk
increase can
be large



e.g. CF
~
100%, though severity can vary


BRCA1


some mutations elevate life
-
time risk



from
~
8% to 80% (> 20x risk for early onset
)


Next problem


when relative risk inc. is large, is there


something one can do about it?





Will come back to this for BRCA in unit on screening



for breast cancer

At what point, if any, should FDA regulation be required?





Should it depend on
magnitude

of relative risk,



absolute risk?



O
n whether
test results
are likely
to
result in



life
-
altering action?






surgery




drug treatment




life
-
long screening




abortion

Different measures of test validity and utility:


Scientific validity


does it detect the SNPs it


says it does, with what error rate?


Clinical validity


does it produce valid diagnoses?


Clinical utility


is the information useful in a medical


setting?


How do tests for BRCA mutation, CF carrier status,


warfarin

sensitivity rate by these criteria?

Main
points



basic idea of GWAS


diff allele freq. in dis.
vs

cont.
grps




many DNA regions
found
to affect
chance of getting


several common
diseases



most effects small, possibly limited to spec. ethnic
groups



provide leads for finding causative genes
-
>

understanding disease
mechanism



possible applications in drug therapy (“personalized

medicine”)





Homework:




look over GWAS paper, try to get big ideas, don’t


worry about unintelligible jargon


divide and
conquer papers/topics
(pick one):




Math Exercise on odds ratios and chi sq (2 items)



Venter on comparing 23andMe and
Navigenics

results


what are ethics of his conclusions?


NYT
-

on
behavorial

effect of DTC genetic testing



NEJM


on risk prediction from GWAS (2 items)



2 views of utility of
warfarin

genetic test (pick one)


Am
Coll

Cardiol
.
-

it reduces hosp.




Ann Int. Med.


it is not cost
-
eff.