GMS6181 Genomics and Bioinformatics
Week 5
Population structure
Simple case

control test for association
Effect of population structure in creating spurious associations
Measuring population structure
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
•
Identify unrelated affected (cases) and unaffected (control)
individuals.
•
Genotype everyone for the markers of interest.
•
Compare groups.
•
If markers are a different frequencies in the “case” and
“control” group of individuals, then there is an indication of
association.
Case

Control Tests
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
•
A mutation that increases disease susceptibility is expected to be
at a higher frequency among affected individuals (cases) than
among unaffected individuals (controls).
•
This test can be carried out using a simple chi

square test.
Case

Control Tests: Principle
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
Consider two populations…
•
Population affected:
•
Population not affected:
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
Mix the two populations…
•
Mixed population
•
Their progeny
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
After a few generations…
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
After many, many generations…
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
In many different individuals…
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
Resolution…
Gene
Marker A
Marker B
Marker C
Marker D
Marker E
Marker F
Marker G
Marker H
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
SNP effect
Disease susceptibility
GMS6181 Genomics and Bioinformatics
n
1aff
n
2aff
n
1unaff
n
2unaff
Affected
Unaffected
X
2
=
Σ
cells
(obs

exp)
2
exp
observed
expected
n
1
×
n
aff
/ N
n
2
×
n
aff
/ N
n
1
×
n
unaff
/ N
n
2
×
n
unaff
/ N
n
aff
n
unaff
n
1
n
2
Affected
Unaffected
Testing for association: Case

Control
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
Case control:
Pima

Papago Indians

Diabetes
Present
Absent
Present
Absent
Totals
23
1343
1366
270
3284
3554
293
4627
4920
Gm(3;5,13,14) locus
Knowler
et al.
[1988] Am J Hum Genet. 43:520

6
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
X
2
=
Σ
cells
(obs

exp)
2
exp
X
2
=
?
•
What
does
the
evidence
tell
us?
GMS6181 Genomics and Bioinformatics
Original Pima

Papago Population
Chi

square=55.27
Gm Freq in disease/healthy individuals
Present
2%
8%
Absent
98%
92%
GMS6181 Genomics and Bioinformatics
•
Problem occurs when both disease frequencies and allele
frequencies differ among subpopulations
•
Randomly select samples of cases and controls from a
population with underlying subpopulations
•
If disease of interest is at higher frequency in one subpopulation,
then marker alleles that are at higher frequency in that
subpopulation, and unlinked to the trait, will be detected as
associated to it
Case

control and population structure
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
?
?
?
?
Affected
Unaffected
X
2
=
Σ
cells
(obs

exp)
2
exp
observed
expected
N
?
?
?
?
n
aff
n
unaff
n
1
n
2
Affected
Unaffected
n
1aff
n
2aff
n
1unaff
n
2unaff
n
1
n
2
n
1
×
n
aff
/N
n
2
×
n
aff
/ N
n
1
×
n
unaff
/ N
n
2
×
n
unaff
/ N
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
Case control
–
European descendents
excluded:
Pima

Papago Indians

Present
Absent
Present
Absent
Totals
10
7
17
1058
706
1764
1068
713
1781
Diabetes
Gm(3;5,13,14) locus
Knowler
et al.
[1988] Am J Hum Genet. 43:520

6
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
X
2
=
Σ
cells
(obs

exp)
2
exp
X
2
=
?
•
What
does
the
evidence
tell
us?
GMS6181 Genomics and Bioinformatics
Full

blooded Pima

Papago Indians
Chi

square=0.001
Gm Freq in disease/healthy individuals
Present
1%
1%
Absent
99%
99%
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
Random genetic drift, mutation, migration and selection, and other
forces that change gene frequencies within subpopulation can lead to
population structure
–
i.e. unequal gene and genotype frequencies
among subpopulations, with variable levels of similarity between
them.
To know if there is population structure, and what level of similarity
exists among subpopulation is relevant in several fields. For instance,
when doing any type of genetic test you need to be aware of any
structure as it may limit the scope of the test conclusions, or simply
invalidate them (like in the example we saw about the Pima

Papago
Indians).
Population structure
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
How can we tell that there is
population structure?
One measure is the Fst.
•
Population subdivision causes a reduction in heterozygosity.
•
When measuring population structure we are usually measuring
the differences between subpopulations that originated from an
ancestral population that gave rise to all individuals of one
species.
•
The reduction in heterozygosity can be used as one measure of
the population structure.
f(A1A1) = p
2
f(A1A2) = H = 2pq
f(A2A2) = q
2
From Hardy

Weinberg Equilibrium we know that:
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
Founder
population
Pop.
1
H
S
<H
T
H
T
Pop.
2
H
S
<H
T
Pop.
3
H
S
<H
T
Pop.
4
H
S
<H
T
Pop.
5
H
S
<H
T
Pop.
6
H
S
<H
T
Pop.
7
H
S
<H
T
Pop.
8
H
S
<H
T
GMS6181 Genomics and Bioinformatics
The relative proportion of observed number of heterozygotes (H
S
) in
the subpopulations relative to the expected number of heterozygotes
in the total population (H
T
) is quantified by F
ST
, the
fixation index
.
The fixation index provides a measure of the proportion of
heterozygotes in the
S
ubpopulation relative to the
T
otal population,
and is estimated by:
F
ST
= (H
T

H
S
) / H
T
The heterozygosity of subpopulations is smaller than if the
subpopulations were combined into a single, large, randomly mating
population. F
ST
measures the extent of the reduction in
heterozygosity.
Another useful interpretation is that F
ST
measures the
amount of
variation in the whole population that is due to genetic differentiation
among subpopulations
.
GMS6181 Genomics and Bioinformatics
Let’s look at some hypothetical and extreme data:
q
0.50
0.50
0.50
0.50
0.50 =
q
2q(1

q)
0.50
0.50
0.50
0.50
0.50 =
H
S
Pop.
1
Pop.
2
Pop.
3
Pop.
4
Mean
H
T
=
2
q
(1

q
) =
0.50
F
ST
= (H
T

H
S
) / H
T
= (0.50

0.50) / 0.50 = 0
The entire population (Pop. 1, 2, 3 and 4 combined) is genetically
variable for the locus above. However there is no genetic variation
among populations (F
ST
= 0).
GMS6181 Genomics and Bioinformatics
q
1.00
0.00
0.00
0.00
0.25 =
q
2q(1

q)
0.00
0.00
0.00
0.00
0.00 =
H
S
Pop.
1
Pop.
2
Pop.
3
Pop.
4
Mean
H
T
=
2
q
(1

q
) =
0.375
F
ST
= (H
T

H
S
) / H
T
= (0.375

0.00) / 0.375 = 1
Populations (Pop. 1, 2, 3 and 4 combined) are genetically variable for
the locus above. They differ completely for gene frequency (F
ST
= 1).
Let’s look at some hypothetical and extreme data:
GMS6181 Genomics and Bioinformatics
A given value estimated by F
ST
has the following interpretations. If a
set of subpopulations has, for instance, an F
ST
of 0.1, this means that
of all the genetic variation that is observed in all populations
combined:
•
10% is due to variation among the subpopulations, and
•
90% is due to variation within populations
In reality, population structure studies generally sample many loci to
obtain an averaged estimate. Single loci may be influenced by the
effect of sampling, selection when linked to genes associated with
fitness, and other factors. Also, one has to avoid loci that may be
under any type of selection.
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
Great! But…
•
Prior knowledge is required to define the subset of individuals that
compose a hypothetical subpopulation
•
The fixation index only tells you if there is population structure,
but status of each individual is unknown.
•
Doesn’t allow for admixture
–
i.e. the individual is from either
population A or B, but if recently admixted, it is impossible to
define how much of its genome is derived from either of the two
populations
•
no option for inclusion of prior information in the analysis
•
And many other issues that don’t make it ideal for our purposes.
GMS6181 Genomics and Bioinformatics
Correction for population structure

Structured Association
Concept:
1. Estimate the population structure from random markers
2. Use these estimates to control for population structure
statistically in the association analyses (more on this on Thursday
and next week).
Pritchard JK, Stephens M, Donnelly P. 2000. Inference of population structure using multilocus genotype data.
Genetics
155: 945

59.
Pritchard JK, Stephens M, Rosenberg NA, Donnelly P. 2000. Association mapping in structured populations.
American Journal of Human Genetics
67: 170

81.
Falush D, Stephens M, Pritchard JK. 2003. Inference of population structure using multilocus genotype data:
Linked loci and correlated allele frequencies.
Genetics
164: 1567

87.
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
Accounting for population structure
•
Model

based clustering methods

assumes that the data represents a random sample drawn from a
parametric model
–
inferences for the parameters are made by
trying to model the ideal clustering membership of each sample

each clustering is defined by sets of individuals that minimize the
deviation from HWE

allow for inclusion of prior information
By far the most popular approach has been developed by Pritchard:
http://pritch.bsd.uchicago.edu/structure.html
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
Inferring Population Structure
•
Assume admixed or independent population with K contributing
populations.
•
Each of the K founding populations was in equilibrium (HWE,
no LD).
•
In this generation, each allele copy originated in one of the
founding populations.
•
Want to figure out the probability alleles in each individual
originated in population
k
:
Q
vector.
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
Variables
•
q
(i)
k
= proportion of individual
i
’s genome that originated in
population
k
.
–
Q
•
x
(i,a)
l
= genotype for individual
i
at locus
l
(a = 1,2)
–
X
•
z
(i,a)
l
= population of origin of allele copy x
(i,a)
l
–
Z
•
p
klj
= allele frequency for allele
j
at locus
l
in population
k
–
P
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
Inferring Population Structure
•
Bayesian approach
•
Likelihood:
–
Pr (
Z
,
P
,
QX
)
•
Priors:
–
Pr (
Z
), Pr (
P
), Pr (
Q
)
GMS6181 Genomics and Bioinformatics
Estimating K
Problem: Difficult to estimate allele frequencies, admixture
proportions and number of groups simultaneously
Suggested by Pritchard:
Phase 1: Define number of
k
populations to test.
Phase 2: Examine clustering of individuals to evaluate
appropriateness of the number of selected populations.
Pritchard suggested for his simulation data running > 10
6
iterations,
with 30,000 of burnin
GMS6181 Genomics and Bioinformatics
Example 1:
•
Eucalyptus (6 species)
•
Nine microsatellite
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
Example 2:
•
Loblolly pine
•
~ 500 genotypes
•
19 microsatellite
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
With
K
= 5, spatial patterns by cluster look like this:
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
GMS6181 Genomics and Bioinformatics
Week 5

Practical exercise
Population structure analysis

STRUCTURE
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
Input data file:
•
Open files
SNP genotype data.txt
;
•
Component of the file:
Row:
marker name (optional)
inter

marker distance (optional)
individual data (required)
phase information (optional)
Column:
label (individual information

optional)
PopData (prior population information

optional)
PopFlag (use/not use pop. data

optional), Phenotype
(optional), extra columns (optional)
Genotype data (individual genotypes

required)
NA06985
1
1
1
3
3
2
NA06991
1
1
1
3
3
2
NA06993
1
1
1
3
3
2
NA06994
1
1
1
3
3
2
NA07000
1
1
1
3
3
2
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
Modeling decisions:
•
How long to run the program

burnin length (how long to run before start collecting data):
10,000

100,000 adequate, but verify summary statistics

run length: several runs at each K, at different lengths, and
verify if answers are consistent through runs
–
10,000

100,000
adequate for estimating number of sub

populations
–
1,000,000
ore more for good estimates of Pr(XK)
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
Modeling decisions:
•
Ancestry model

No admixture
–
i.e. each individual comes from one discrete
population
–
output is the posterior probability of one
individual
i
belonging to population K.

Admixture
–
i.e. ancestry may be present
–
output is the
posterior mean estimate of proportion of the individuals
genome that originated from population K
–
good starting
point.

Using prior information
–
i.e. information from geographic
sampling is available, or may want to use samples of known
sub

population origin to define where others came from
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
•
Data input:
Running STRUCTURE:
File → New Project: enter name of project, directory where the output
should be saved, and data file → Next
Enter number of individuals (180), number of loci (500 SNPs) and value of
missing data (

9)
→
Next
Data file: show data for individuals in single line
Check box if data contains following rows (leave all blank) → Next
Check box if data contains following columns (check: individual id for each
individual) → Finish → Proceed
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
•
Parameter set:
Running STRUCTURE:
Parameter set → New
Run length: burnin (10,000) and number of MCMC steps after burnin
(10,000)
Ancestry model: Admixture
Allele frequency model: Allele frequency independent
Please name parameter set: Enter name for set of parameters
Parameter set
→
Run
Set number of assumed populations
GMS6181 Genomics and Bioinformatics
GMS6181 Genomics and Bioinformatics
•
Each person run the analysis for a population (K) number:
Running STRUCTURE:
Run length: burnin (10,000) and number of MCMC steps after burnin
(10,000)
Ancestry model: Admixture
Allele frequency model: Allele frequency independent
Run model for K = 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10
Half of the class runs a burnin/MCMC length of 1,000/1,000 (twice)
Other half runs a burnin/MCMC of 10,000/10,000 (once)
Comments 0
Log in to post a comment