Week 5 - Population structure

weinerthreeforksBiotechnology

Oct 2, 2013 (4 years and 1 month ago)

128 views

GMS6181 Genomics and Bioinformatics

Week 5

Population structure


Simple case
-
control test for association


Effect of population structure in creating spurious associations


Measuring population structure








GMS6181 Genomics and Bioinformatics

GMS6181 Genomics and Bioinformatics



Identify unrelated affected (cases) and unaffected (control)
individuals.


Genotype everyone for the markers of interest.


Compare groups.


If markers are a different frequencies in the “case” and
“control” group of individuals, then there is an indication of
association.

Case
-
Control Tests

GMS6181 Genomics and Bioinformatics

GMS6181 Genomics and Bioinformatics



A mutation that increases disease susceptibility is expected to be
at a higher frequency among affected individuals (cases) than
among unaffected individuals (controls).


This test can be carried out using a simple chi
-
square test.


Case
-
Control Tests: Principle

GMS6181 Genomics and Bioinformatics

GMS6181 Genomics and Bioinformatics


Consider two populations…




Population affected:






Population not affected:

GMS6181 Genomics and Bioinformatics

GMS6181 Genomics and Bioinformatics


Mix the two populations…





Mixed population



Their progeny

GMS6181 Genomics and Bioinformatics

GMS6181 Genomics and Bioinformatics


After a few generations…



GMS6181 Genomics and Bioinformatics

GMS6181 Genomics and Bioinformatics


After many, many generations…



GMS6181 Genomics and Bioinformatics

GMS6181 Genomics and Bioinformatics


In many different individuals…



GMS6181 Genomics and Bioinformatics

GMS6181 Genomics and Bioinformatics


Resolution…



Gene

Marker A

Marker B

Marker C

Marker D

Marker E

Marker F

Marker G

Marker H

GMS6181 Genomics and Bioinformatics

GMS6181 Genomics and Bioinformatics



GMS6181 Genomics and Bioinformatics

GMS6181 Genomics and Bioinformatics



SNP effect

Disease susceptibility

GMS6181 Genomics and Bioinformatics

n
1|aff

n
2|aff

n
1|unaff

n
2|unaff

Affected

Unaffected

X
2

=
Σ
cells


(obs
-

exp)
2

exp

observed

expected

n
1
×
n
aff

/ N

n
2
×
n
aff

/ N

n
1
×
n
unaff

/ N

n
2
×
n
unaff

/ N

n
aff

n
unaff

n
1

n
2

Affected

Unaffected

Testing for association: Case
-
Control

GMS6181 Genomics and Bioinformatics

GMS6181 Genomics and Bioinformatics

Case control:

Pima
-
Papago Indians
-

Diabetes

Present

Absent

Present

Absent

Totals

23

1343

1366

270

3284

3554

293

4627

4920

Gm(3;5,13,14) locus

Knowler
et al.
[1988] Am J Hum Genet. 43:520
-
6

GMS6181 Genomics and Bioinformatics

GMS6181 Genomics and Bioinformatics

X
2

=
Σ
cells


(obs
-

exp)
2

exp

X
2

=
?



What

does

the

evidence

tell

us?



GMS6181 Genomics and Bioinformatics

Original Pima
-
Papago Population

Chi
-
square=55.27

Gm Freq in disease/healthy individuals

Present

2%

8%

Absent

98%

92%

GMS6181 Genomics and Bioinformatics


Problem occurs when both disease frequencies and allele
frequencies differ among subpopulations


Randomly select samples of cases and controls from a
population with underlying subpopulations


If disease of interest is at higher frequency in one subpopulation,
then marker alleles that are at higher frequency in that
subpopulation, and unlinked to the trait, will be detected as
associated to it

Case
-
control and population structure

GMS6181 Genomics and Bioinformatics

GMS6181 Genomics and Bioinformatics

?

?

?

?

Affected

Unaffected

X
2

=
Σ
cells


(obs
-

exp)
2

exp

observed

expected

N

?

?

?

?

n
aff

n
unaff

n
1

n
2

Affected

Unaffected

n
1|aff

n
2|aff

n
1|unaff

n
2|unaff

n
1

n
2

n
1
×
n
aff

/N

n
2
×
n
aff

/ N

n
1
×
n
unaff

/ N

n
2
×
n
unaff

/ N

GMS6181 Genomics and Bioinformatics

GMS6181 Genomics and Bioinformatics

Case control


European descendents








excluded:

Pima
-
Papago Indians
-

Present

Absent

Present

Absent

Totals

10

7

17

1058

706

1764

1068

713

1781

Diabetes

Gm(3;5,13,14) locus

Knowler
et al.
[1988] Am J Hum Genet. 43:520
-
6

GMS6181 Genomics and Bioinformatics

GMS6181 Genomics and Bioinformatics

X
2

=
Σ
cells


(obs
-

exp)
2

exp

X
2

=
?



What

does

the

evidence

tell

us?



GMS6181 Genomics and Bioinformatics

Full
-
blooded Pima
-
Papago Indians

Chi
-
square=0.001

Gm Freq in disease/healthy individuals

Present

1%

1%

Absent

99%

99%

GMS6181 Genomics and Bioinformatics

GMS6181 Genomics and Bioinformatics

Random genetic drift, mutation, migration and selection, and other
forces that change gene frequencies within subpopulation can lead to
population structure


i.e. unequal gene and genotype frequencies
among subpopulations, with variable levels of similarity between
them.

To know if there is population structure, and what level of similarity
exists among subpopulation is relevant in several fields. For instance,
when doing any type of genetic test you need to be aware of any
structure as it may limit the scope of the test conclusions, or simply
invalidate them (like in the example we saw about the Pima
-
Papago
Indians).

Population structure

GMS6181 Genomics and Bioinformatics

GMS6181 Genomics and Bioinformatics

How can we tell that there is
population structure?


One measure is the Fst.


Population subdivision causes a reduction in heterozygosity.


When measuring population structure we are usually measuring
the differences between subpopulations that originated from an
ancestral population that gave rise to all individuals of one
species.


The reduction in heterozygosity can be used as one measure of
the population structure.

f(A1A1) = p
2

f(A1A2) = H = 2pq

f(A2A2) = q
2

From Hardy
-
Weinberg Equilibrium we know that:

GMS6181 Genomics and Bioinformatics

GMS6181 Genomics and Bioinformatics

Founder
population

Pop.

1

H
S
<H
T

H
T

Pop.

2

H
S
<H
T

Pop.

3

H
S
<H
T

Pop.

4

H
S
<H
T

Pop.

5

H
S
<H
T

Pop.

6

H
S
<H
T

Pop.

7

H
S
<H
T

Pop.

8

H
S
<H
T

GMS6181 Genomics and Bioinformatics

The relative proportion of observed number of heterozygotes (H
S
) in
the subpopulations relative to the expected number of heterozygotes
in the total population (H
T
) is quantified by F
ST
, the
fixation index
.
The fixation index provides a measure of the proportion of
heterozygotes in the
S
ubpopulation relative to the
T
otal population,
and is estimated by:

F
ST

= (H
T

-

H
S
) / H
T

The heterozygosity of subpopulations is smaller than if the
subpopulations were combined into a single, large, randomly mating
population. F
ST

measures the extent of the reduction in
heterozygosity.

Another useful interpretation is that F
ST

measures the
amount of
variation in the whole population that is due to genetic differentiation
among subpopulations
.

GMS6181 Genomics and Bioinformatics

Let’s look at some hypothetical and extreme data:


q


0.50

0.50

0.50

0.50


0.50 =
q



2q(1
-
q)

0.50

0.50

0.50

0.50


0.50 =
H
S



Pop.

1

Pop.

2

Pop.

3

Pop.

4

Mean




H
T

=

2

q
(1
-

q
) =

0.50





F
ST

= (H
T

-

H
S
) / H
T

= (0.50
-
0.50) / 0.50 = 0

The entire population (Pop. 1, 2, 3 and 4 combined) is genetically
variable for the locus above. However there is no genetic variation
among populations (F
ST

= 0).

GMS6181 Genomics and Bioinformatics


q


1.00

0.00

0.00

0.00


0.25 =

q



2q(1
-
q)

0.00

0.00

0.00

0.00


0.00 =
H
S



Pop.

1

Pop.

2

Pop.

3

Pop.

4

Mean




H
T

=

2

q
(1
-

q
) =

0.375





F
ST

= (H
T

-

H
S
) / H
T

= (0.375
-
0.00) / 0.375 = 1

Populations (Pop. 1, 2, 3 and 4 combined) are genetically variable for
the locus above. They differ completely for gene frequency (F
ST

= 1).

Let’s look at some hypothetical and extreme data:

GMS6181 Genomics and Bioinformatics

A given value estimated by F
ST

has the following interpretations. If a
set of subpopulations has, for instance, an F
ST

of 0.1, this means that
of all the genetic variation that is observed in all populations
combined:




10% is due to variation among the subpopulations, and



90% is due to variation within populations


In reality, population structure studies generally sample many loci to
obtain an averaged estimate. Single loci may be influenced by the
effect of sampling, selection when linked to genes associated with
fitness, and other factors. Also, one has to avoid loci that may be
under any type of selection.




GMS6181 Genomics and Bioinformatics

GMS6181 Genomics and Bioinformatics

Great! But…


Prior knowledge is required to define the subset of individuals that
compose a hypothetical subpopulation


The fixation index only tells you if there is population structure,
but status of each individual is unknown.


Doesn’t allow for admixture


i.e. the individual is from either
population A or B, but if recently admixted, it is impossible to
define how much of its genome is derived from either of the two
populations


no option for inclusion of prior information in the analysis


And many other issues that don’t make it ideal for our purposes.

GMS6181 Genomics and Bioinformatics

Correction for population structure


-

Structured Association

Concept:

1. Estimate the population structure from random markers

2. Use these estimates to control for population structure
statistically in the association analyses (more on this on Thursday
and next week).



Pritchard JK, Stephens M, Donnelly P. 2000. Inference of population structure using multilocus genotype data.
Genetics

155: 945
-
59.


Pritchard JK, Stephens M, Rosenberg NA, Donnelly P. 2000. Association mapping in structured populations.
American Journal of Human Genetics

67: 170
-
81.


Falush D, Stephens M, Pritchard JK. 2003. Inference of population structure using multilocus genotype data:
Linked loci and correlated allele frequencies.
Genetics

164: 1567
-
87.


GMS6181 Genomics and Bioinformatics

GMS6181 Genomics and Bioinformatics

Accounting for population structure


Model
-
based clustering methods

-
assumes that the data represents a random sample drawn from a
parametric model


inferences for the parameters are made by
trying to model the ideal clustering membership of each sample

-
each clustering is defined by sets of individuals that minimize the
deviation from HWE

-
allow for inclusion of prior information


By far the most popular approach has been developed by Pritchard:


http://pritch.bsd.uchicago.edu/structure.html


GMS6181 Genomics and Bioinformatics

GMS6181 Genomics and Bioinformatics

Inferring Population Structure


Assume admixed or independent population with K contributing
populations.


Each of the K founding populations was in equilibrium (HWE,
no LD).


In this generation, each allele copy originated in one of the
founding populations.


Want to figure out the probability alleles in each individual
originated in population
k
:
Q

vector.

GMS6181 Genomics and Bioinformatics

GMS6181 Genomics and Bioinformatics

Variables


q
(i)
k

= proportion of individual
i
’s genome that originated in
population
k
.


Q


x
(i,a)
l

= genotype for individual
i

at locus
l

(a = 1,2)


X


z
(i,a)
l

= population of origin of allele copy x
(i,a)
l


Z


p
klj

= allele frequency for allele
j

at locus
l

in population
k


P

GMS6181 Genomics and Bioinformatics

GMS6181 Genomics and Bioinformatics

Inferring Population Structure


Bayesian approach


Likelihood:


Pr (
Z
,
P
,
Q|X
)


Priors:


Pr (
Z
), Pr (
P
), Pr (
Q
)


GMS6181 Genomics and Bioinformatics

Estimating K

Problem: Difficult to estimate allele frequencies, admixture
proportions and number of groups simultaneously


Suggested by Pritchard:

Phase 1: Define number of
k

populations to test.

Phase 2: Examine clustering of individuals to evaluate
appropriateness of the number of selected populations.


Pritchard suggested for his simulation data running > 10
6

iterations,
with 30,000 of burnin


GMS6181 Genomics and Bioinformatics

Example 1:


Eucalyptus (6 species)


Nine microsatellite

GMS6181 Genomics and Bioinformatics

GMS6181 Genomics and Bioinformatics

Example 2:


Loblolly pine


~ 500 genotypes


19 microsatellite

GMS6181 Genomics and Bioinformatics

GMS6181 Genomics and Bioinformatics

GMS6181 Genomics and Bioinformatics

GMS6181 Genomics and Bioinformatics

With
K

= 5, spatial patterns by cluster look like this:

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

GMS6181 Genomics and Bioinformatics

Week 5
-

Practical exercise


Population structure analysis
-

STRUCTURE






GMS6181 Genomics and Bioinformatics

GMS6181 Genomics and Bioinformatics

Input data file:


Open files
SNP genotype data.txt
;


Component of the file:


Row:

marker name (optional)




inter
-
marker distance (optional)




individual data (required)




phase information (optional)


Column:

label (individual information
-

optional)




PopData (prior population information
-

optional)




PopFlag (use/not use pop. data
-
optional), Phenotype



(optional), extra columns (optional)




Genotype data (individual genotypes
-

required)

NA06985

1

1

1

3

3

2

NA06991

1

1

1

3

3

2

NA06993

1

1

1

3

3

2

NA06994

1

1

1

3

3

2

NA07000

1

1

1

3

3

2

GMS6181 Genomics and Bioinformatics

GMS6181 Genomics and Bioinformatics

Modeling decisions:


How long to run the program


-

burnin length (how long to run before start collecting data):
10,000
-
100,000 adequate, but verify summary statistics


-

run length: several runs at each K, at different lengths, and
verify if answers are consistent through runs


10,000
-
100,000
adequate for estimating number of sub
-
populations


1,000,000
ore more for good estimates of Pr(X|K)



GMS6181 Genomics and Bioinformatics

GMS6181 Genomics and Bioinformatics

Modeling decisions:


Ancestry model


-

No admixture


i.e. each individual comes from one discrete
population


output is the posterior probability of one
individual

i

belonging to population K.


-

Admixture


i.e. ancestry may be present


output is the
posterior mean estimate of proportion of the individuals
genome that originated from population K


good starting
point.


-

Using prior information


i.e. information from geographic
sampling is available, or may want to use samples of known
sub
-
population origin to define where others came from

GMS6181 Genomics and Bioinformatics

GMS6181 Genomics and Bioinformatics


Data input:


Running STRUCTURE:


File → New Project: enter name of project, directory where the output
should be saved, and data file → Next


Enter number of individuals (180), number of loci (500 SNPs) and value of
missing data (
-
9)

Next


Data file: show data for individuals in single line


Check box if data contains following rows (leave all blank) → Next


Check box if data contains following columns (check: individual id for each
individual) → Finish → Proceed

GMS6181 Genomics and Bioinformatics

GMS6181 Genomics and Bioinformatics


Parameter set:


Running STRUCTURE:


Parameter set → New


Run length: burnin (10,000) and number of MCMC steps after burnin
(10,000)


Ancestry model: Admixture


Allele frequency model: Allele frequency independent


Please name parameter set: Enter name for set of parameters


Parameter set

Run


Set number of assumed populations

GMS6181 Genomics and Bioinformatics

GMS6181 Genomics and Bioinformatics


Each person run the analysis for a population (K) number:


Running STRUCTURE:


Run length: burnin (10,000) and number of MCMC steps after burnin
(10,000)


Ancestry model: Admixture


Allele frequency model: Allele frequency independent


Run model for K = 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10


Half of the class runs a burnin/MCMC length of 1,000/1,000 (twice)


Other half runs a burnin/MCMC of 10,000/10,000 (once)