R

powerfultennesseeBiotechnology

Oct 2, 2013 (4 years and 1 month ago)

131 views

1

Day 2 Section 8

Introduction to
statistics for Genome
-
Wide Association
Studies (GWAS)

2

Day 2 Section 8

Outline


Background on GWAS


Presentation of GenABEL


Data checking with GenABEL


Data analysis with GenABEL


Display of results



3

Day 2 Section 8

R

Packages for GWAS


Plink

developped by the M.I.T. but only
available for linux platform only.
(
http://pngu.mgh.harvard.edu/~purcell/plink/
).


SNPassoc

(Juan R. González 1, et al.
Bioinformatics, 2007 23(5):654
-
655)


GenABEL

(Aulchenko Y.S., Ripke S., Isaacs A.,
van Duijn C.M. Bioinformatics. 2007,
23(10):1294
-
6.)

4

Day 2 Section 8

What is a GWAS?


A genome
-
wide association study is an approach that
involves rapidly scanning
markers across genome
(

0.5M or 1M) of
many people
(

2K) to find genetic
variations associated with a particular disease.


A large number of subjects are needed because

(1)
associations between SNPs and causal variants are expected
to show
low odds ratios
, typically below 1.5

(2)
In order to obtain a reliable signal, given the very large
number of tests that are required, associations must show a
high level of significance to survive the multiple testing
correction


Such studies are particularly useful in finding genetic
variations that contribute to common, complex
diseases

5

Day 2 Section 8

What is a GWAS?

6

Day 2 Section 8

Why are such studies possible now?

The completion of the Human Genome
Project in 2003 and the International
HapMap Project in 2005, researchers
now have a set of research tools that
make it possible to find the genetic
contributions to common diseases

7

Day 2 Section 8

GWAS for complex diseases

8

Day 2 Section 8

Overview of the general design and workflow of a genome
-
wide association
(GWA) study


9

Day 2 Section 8

What have GWAS found?


In 2005, it was learned through GWAS that age
-
related macular
degeneration is associated with variation in the gene for
complement factor H, which produces a protein that regulates
inflammation (Klein et al. (2005) Science, 308, 385

389)


In 2007, the Wellcome Trust Case
-
Control Consortium (WTCCC)
carried out GWAS for the diseases coronary heart disease, type
1 diabetes, type 2 diabetes, rheumatoid arthritis, Crohn's
disease, bipolar disorder and hypertension. This study was
successful in uncovering many new disease genes underlying these
diseases.


See next page for more publications in GWAS

10

Day 2 Section 8


Association scan of 14,500 nonsynonymous SNPs in four diseases identifies autoimmunity
variants. Nat Genet. 2007


Genome
-
wide association study of 14,000 cases of seven common diseases and 3,000 shared
controls.
Wellcome Trust Case Control Consortium
Nature. 2007;447;661
-
78


Genomewide association analysis of coronary artery disease.

Samani et al.
N Engl J Med. 2007;357;443
-
53


Sequence variants in the autophagy gene IRGM and multiple other replicating loci contribute to
Crohn's disease susceptibility.
Parkes et al.
Nat Genet. 2007;39;830
-
2


Robust associations of four new chromosome regions from genome
-
wide analyses of type 1
diabetes.
Todd et al.
Nat Genet. 2007;39;857
-
64


A common variant in the FTO gene is associated with body mass index and predisposes to
childhood and adult obesity.
Frayling et al.
Science. 2007;316;889
-
94


Replication of genome
-
wide association signals in UK samples reveals risk loci for type 2 diabetes.
Zeggini et al.
Science. 2007;316;1336
-
41


Scott et al. (2007) A genome
-
wide association study of type 2 diabetes in Finns detects multiple
susceptibility variants. Science, 316, 1341

1345.

Examples of GWAS

11

Day 2 Section 8

Example: Data & Results

12

Day 2 Section 8

Problem(s)

How to make inference
about SNP
-
Disease
associations ?


Which computational tools
to use?

13

Day 2 Section 8

Features of GenABEL


Specifically designed for GWAS


Provides specific facilities for storage and
manipulation of large data


Very fast tests for GWAS


Specific functions to analyze and display
the results


More efficient than the library “genetics”

14

Day 2 Section 8

GeneABEL: GWAS.data class

15

Day 2 Section 8

Exploring GWAS.data class objects

library("GenABEL")

data(ge03d2ex)

# phenotype data

summary(ge03d2ex@phdata)

R

code:

R

output

16

Day 2 Section 8

Exploring GWAS.data class objects

library("GenABEL")

data(ge03d2ex)

# phenotype data

summary(ge03d2ex@phdata)

# number of people in study

ge03d2ex@gtdata@nids

# number of SNPs

ge03d2ex@gtdata@nsnps

# SNP names

ge03d2ex@gtdata@snpnames[1:10]

# Chromosome labels

ge03d2ex@gtdata@chromosome[1:10]

# SNPs map positions

ge03d2ex@gtdata@map[1:10]


R

code:

17

Day 2 Section 8

Descriptive statistics: phenotypes

descriptive.trait(ge03d2ex)

R

code:

R

output

type 2 diabetes status

descriptives.trait(ge03d2ex, by=ge03d2ex@phdata$dm2))

R

code:

= by case
-
control status

18

Day 2 Section 8

Descriptive statistics: markers

descriptives.marker(ge03d2ex)

R

code:

R

output

19

Day 2 Section 8

Test of Hardy
-
Weinberg equilibrium

# Test of Hardy
-
Weinberg equilibrium in control group

s<
-
summary(ge03d2ex@gtdata[(ge03d2ex@phdata$dm2 == 0),])

pexcas<
-
s[,"Pexact"]

estlambda(pexcas)

# Test of Hardy
-
Weinberg equilibrium in case group

s<
-
summary(ge03d2ex@gtdata[(ge03d2ex@phdata$dm2 == 1),])

pexcas<
-
s[,"Pexact"]

estlambda(pexcas)

R

code:

Controls

Cases

R

output

20

Day 2 Section 8

Data checking: procedure

RUN 1

3993 markers and 134 people in total

304 (7.613323%) markers excluded as having low (<1.865672%) minor allele
frequency

36 (0.9015778%) markers excluded because of low (<95%) call rate

0 (0%) markers excluded because they are out of HWE (P <0)

1 (0.7462687%) people excluded because of low (<95%) call rate

3 (2.238806%) people excluded because too high autosomal heterozygosity
(FDR <1%)

Mean autosomal HET was 0.2747262 (s.e. 0.03721277), people excluded had
HET >= 0.5041617

1 (0.7462687%) people excluded because of too high IBS (>=0.95)

Mean IBS was 0.785972 (s.e. 0.02000698), as based on 2000 autosomal
markers

In total, 3653 (91.4851%) markers passed all criteria

In total, 129 (96.26866%) people passed all criteria


R

output

qc1<
-
check.marker(ge03d2ex, p.level=0)

R

code:

21

Day 2 Section 8

Data checking: summary table

R

output

summary(qc1)

R

code:

$`Per
-
SNP fails statistics`


NoCall NoMAF NoHWE Redundant Xsnpfail

NoCall 42 0 0 0 0

NoMAF NA 376 0 0 0

NoHWE NA NA 0 0 0

Redundant NA NA NA 0 0

Xsnpfail NA NA NA NA 1



$`Per
-
person fails statistics`


IDnoCall HetFail IBSFail isfemale ismale

IDnoCall 1 0 0 0 0

HetFail NA 3 0 0 0

IBSFail NA NA 1 0 0

isfemale NA NA NA 2 0

ismale NA NA NA NA 0


22

Day 2 Section 8

Data checking: output


The procedure provides the list of
individuals
(idok)

and SNPs
(snpok)

who
passed all QC criteria.


It is then possible to obtain a “clean”
dataset:

data1<
-
ge03d2ex[qc1$idok, qc1$snpok]

R

code:

23

Day 2 Section 8

Data checking:
HW plots after cleaning

s1<
-
summary(data1@gtdata[(data1@phdata$dm2 == 1),])

pexcas1<
-
s1[,"Pexact"]

estlambda(pexcas1)

R

code:

R

output

After

Before

24

Day 2 Section 8

Finding genetic sub
-
structure

# matrix of genomic kindship between all pairs of individuals

data1.gkin <
-
ibs(data1[,data1@gtdata@chromosome != "X"], weight="freq")

# distance matrix

data1.dist<
-
as.dist(0.5
-
data1.gkin)

#use classical multidimensional scaling

data1.mds<
-
cmdscale(data1.dist)

#plot the two first components

plot(data1.mds)

R

code:

Exclude these

individuals

25

Day 2 Section 8

Remove outliers

km<
-
kmeans(data1.mds, centers=2, nstart=1000)

cl1<
-
names(which(km$cluster==1))

cl2<
-
names(which(km$cluster==2))

data2<
-
data1[cl1,]

R

code:

Then, repeat the QC analysis allowing for HWE checks
(using controls and exclude markers with FDR

0.2)

qc2<
-
check.marker(data2, hweids=(data2@phdata$dm2 ==0), fdr=0.2)

summary(qc2)

R

code:

R

output

26

Day 2 Section 8

GWA scan: raw data

Scan of the raw data (before quality control) using a
score test, as implemented in the qtscore() function.

an0<
-
qtscore(dm2, ge03d2ex, trait="binomial")

plot(an0)

# add corrected p
-
values

in green

add.plot(an0, df="Pc1df", col="green")

R

code:

R

output

interesting

results?

27

Day 2 Section 8

GWA scan: raw data

Scan of the raw data (before quality control) using a
score test, as implemented in the qtscore() function.

#descriptive table

descriptives.scan(an0)

R

code:

R

output: Top 10 results

28

Day 2 Section 8

GWA scan: cleaned data

data2<
-
data2[qc2$idok, qc2$snpok]

# plot

an1<
-
qtscore(dm2, data2, trait="binomial")

plot(an1)

# add corrected p
-
values

add.plot(an1, df="Pc1df", col="green")

R

code:

interesting

results

R

output

29

Day 2 Section 8

Comparison of the two scans

Raw data

Clean data

false

signal?

#compare with previous results

plot(an1,, col="green")

# add corrected p
-
values

add.plot(an0, col="red")

R

code:

30

Day 2 Section 8

GWA scan: cleaned data

#descriptive table

descriptives.scan(an1)

R

code:

Clean data

Raw data

31

Day 2 Section 8

GWA in presence of genetic stratification


Assess population structure



Account for pop. structure in the analysis

pop<
-
as.numeric(data1@phdata$id %in% cl1)

pop

R

code:

# Assess pop. structure

pop<
-
as.numeric(data1@phdata$id %in%
cl1)

pop

# Stratified association

data1.sa<
-
qtscore(dm2, data=data1,
strata=pop)

# plots results and compare with
analysis removing the outliers

plot(an1, cex=0.5, pch=19, ylim=c(1, 5))

add.plot(data1.sa, col="green", cex=1.2)

R

code:

32

Day 2 Section 8

GWA in presence of genetic stratification


Adjust both phenotypes and genotypes for
possible stratification using principal
component analysis (Price’s method)

data1.eg<
-
egscore(dm2,
data=data1, kin=data1.gkin)

plot(an1, cex=0.5, pch=19,
ylim=c(1, 5))

add.plot(data1.sa, col="green",
cex=1.2)

add.plot(data1.eg, col="red",
cex=1.3)

R

code:

33

Day 2 Section 8

Other interesting features


Genetic data imputations


Meta
-
analysis of GWA scans


Analysis of selected regions


Conversion of plink files

34

Day 2 Section 8

Conclusion


GWAS is becoming of major area of
research


New computational tools and stat methods
are needed


GenABEL is an interesting program,
especially for easy data cleaning and
display of results


Plink has more features for stat analysis
but not yet available in
R

for Windows!


35

Day 2 Section 8

Thank you!