JM

http://folding.chmcc.org
1
Machine Learning for Studies of
Genotype

Phenotype Correlations
Jarek Meller
Division of Biomedical Informatics,
Children’s Hospital Research Foundation
& Department of Biomedical Engineering, UC
JM

http://folding.chmcc.org
2
Outline
Motivating story: correlating inputs and outputs
Learning with a teacher (supervised learning)
Model selection, feature selection and generalization
k

Nearest Neighbors, Least Squares regression, Support Vector
Machines and some other machine learning approaches
Genotype

phenotype correlations and predictive fingerprints of
phenotypes
Ritchie et al., Multifactor

Dimensionality Reduction Reveals High

Order Interactions among Estrogen

Metabolism Genes in Sporadic
Breast Cancer,
Am. J. Hum. Genet.,
69:138

147, 2001
Early results for JRA SNP data (D. Glass et al.)
JM

http://folding.chmcc.org
3
Of statistics and machine learning
t

Test vs. regression or decision trees
Assessment vs. predictive models
Treatment
group mean
Control
group mean
Continuous variables
Discrete (categorical) variables
1
0
1
0
1
0
1
2
JM

http://folding.chmcc.org
4
Choice of the model, problem representation and feature
selection: another simple example
heights
estrogen
F
M
adults
children
weight
testosterone
JM

http://folding.chmcc.org
5
Three phases in supervised learning protocols
Training data: examples with class assignment are given
Learning:
i) appropriate model (or representation) of the problem needs to be
selected in terms of attributes, distance measure and classifier type;
ii) adaptive parameters in the model need to optimized to provide
correct classification of training examples (e.g. minimizing the
number of misclassified training vectors)
Validation: cross

validation, independent control sets and other
measure of “real” accuracy and generalization should be used to
assess the success of the model and the training phase
(finding
trade off between
accuracy and generalization is not trivial)
JM

http://folding.chmcc.org
6
Model complexity, training set size and generalization
JM

http://folding.chmcc.org
7
Examples of machine learning algorithms for
classification and regression problems
Linear perceptron, Least Squares
LDA/FDA (Linear/Fisher Discriminate Analysis)
(simple linear cuts, kernel non

linear generalizations)
SVM (Support Vector Machines) (optimal, wide
margin linear cuts, kernel non

linear generalizations)
Decision trees (logical rules)
k

NN (k

Nearest Neighbors) (simple non

parametric)
Neural networks (general non

linear models,
adaptivity, “artificial brain”)
JM

http://folding.chmcc.org
8
Decision trees provide a piecewise linear solution
0
1
1
0
JM

http://folding.chmcc.org
9
Support Vector Machines provide a wide
margin solution (separating hyperplane)
wx
+b=0
JM

http://folding.chmcc.org
10
Optimizing adaptable parameters in the model
Find a model
y
(
x
;
w
)
that describes the objects of each class as a
function of the features and adaptive parameters (weights)
w
.
Prediction, given
x
(e.g. LDL=240, age=52, sex=male) assign the
class C=?, (e.g. if y(x,w)>0.5 then C=1, i.e. likely to suffer from a
stroke or heart attack in the next 5 years)
y
(
x
;
w
)
0
1
JM

http://folding.chmcc.org
11
Training accuracy vs. generalization
JM

http://folding.chmcc.org
12
Case Study: Sporadic Breast Cancer
Ritchie et al., Multifactor

Dimensionality Reduction Reveals High

Order Interactions among Estrogen

Metabolism Genes in
Sporadic Breast Cancer,
Am. J. Hum. Genet.,
69:138

147, 2001
Study based on 200 white women with sporadic primary invasive
breast cancer who were treated at Vanderbilt University Medical
Center during 1982

96
Patients with sporadic breast cancer were frequency age

matched to control patients at Vanderbilt University Medical
Center who had been hospitalized for various acute and chronic
illnesses
Analysis focused on the genes:
COMT
(MIM
116790
), 22q11.2;
CYP1A1
(MIM
108330
), 15q22

qter;
CYP1B1
(MIM
601771
),
2p21

22;
GSTM1
(MIM
138350
), 1p13.3; and
GSTT1
(MIM
600436
), 22q11.2
Case

control study (machine learning to the rescue)
JM

http://folding.chmcc.org
13
Polymorphisms in the genes of interest
Genes involved in oxidative metabolism of estrogens
JM

http://folding.chmcc.org
14
Genotype representation and identification
of predictive loci (fingerprints): MDR
JM

http://folding.chmcc.org
15
Main effects (individual SNPs and chi
2
test)
For the simulated data shown before:
High
Risk
Low
Risk
total
AA
27
24
51
Aa
36
38
74
aa
21
24
45
total
84
86
170
S
(O

E)
2
/ E
JM

http://folding.chmcc.org
16
Genotype/haplotype representations
AABBCC
AaBBCC
AABbCC
aaBBCC
aabbcc
AAbbCC
In general, 3
n
genotypes
for n biallelic loci.
x
y
z
0, 1
; x, y, z =
0, 1, 2
Vector representation:
In general, highly dimensional representations …
JM

http://folding.chmcc.org
17
Multiple loci and more complex fingerprints
JM

http://folding.chmcc.org
18
Cross

validation results
JM

http://folding.chmcc.org
19
The role of gene

gene interactions in multifactorial
disease: towards even more complex traits …
CYP1A1, GSTM1,
and
GSTT1
polymorphisms were examined
before in a case

control study of 328 white and 108 African
American women, using multiple logistic

regression analysis
(Bailey et al.
1998
b
). None of the enzyme genotypes individually
or combined were associated with an increased risk for breast
cancer. However,
COMT
and
CYP1B1
were not included in the
analysis, because their roles in the catechol

estrogen pathway
and/or their various polymorphisms were only recently elucidated.
Here, the influence of each genotype on disease risk appears to
be dependent on the genotypes at each of the other loci: gene

gene interactions.
JM

http://folding.chmcc.org
20
Complexity of the model and power calculations:
as before adopted from Ritchie et al.
In logistic regression, as each additional main effect is included in the
model, the number of possible interaction terms grows exponentially.
On the other hand, simulation studies by Peduzzi et al. (
1996
) suggest
that having fewer than 10 outcome events per independent variable can
lead to biased estimates of the regression coefficients.
Hosmer and Lemeshow (
2000
) suggest that logistic

regression models
should contain no more than
P
< min(
n
1,
n
0)/10 parameters, where
n
1 is
the number of events of type 1 and
n
0 is the number of events of type 0.
For the 200 cases and the 200 controls evaluated in the present study,
this formula suggests that no more than 19 parameters should be
estimated in a logistic

regression model.
JM

http://folding.chmcc.org
21
Complexity of the model and power calculations:
as before adopted from Ritchie et al.
The number of regression terms needed to describe the
interactions among a subset,
k,
of
n
biallelic loci is (
n
choose
k
)
×
2
k
(Wade
2000
).
Thus, for 10 genes, we would need 20 parameters to model the
main effects (assuming two dummy variables per biallelic locus),
180 parameters to model the two

way interactions, 1,920
parameters to model the three

way interactions, 3,360
parameters to model the four

way interactions, and so forth. The
MDR method avoids the problems associated with the use of
parametric statistics to model high

order interactions.
At the same time, MDR involves sampling (evaluation) of different
combinations of loci
–
exponential scaling anyway …
JM

http://folding.chmcc.org
22
Some conclusions from Ritchie et al.
“If MDR is going to be used for genome scans with hundreds to
thousands of single

nucleotide polymorphisms, then it will be
necessary to develop
machine learning strategies
to optimize
the selection of polymorphisms to be modeled, since an
exhaustive search of all possible combinations will not be
possible. We are currently exploring the use of parallel
genetic
algorithms
(Cantú

Paz
2000
) as a robust machine learning
approach.”
Feature selection and aggregation, inferring a classifier
(approximator), validating prediction using cross

validation and
independent new data, i.e., applying machine learning
approaches …
JM

http://folding.chmcc.org
23
Reducing (somewhat) the complexity of the
problem: LD, hyplotype blocks and tagging SNPs
JM

http://folding.chmcc.org
24
Reducing (somewhat) the complexity of the
problem: LD, hyplotype blocks and tagging SNPs
Muse and Gibson, 2004
JM

http://folding.chmcc.org
25
Merging bottom

up and top

down approaches
Main effects and interactions (for limited k

tuples):
“statistics

based” approach, collaboration with Jack
Collins and his group (NCI)
Selection of loci/SNPs (feature selection) based on the
initial (limited) statistical analysis: use haplotype

based
Tag SNPs
Combining promising features into a complex pattern
(predictive fingerprint): machine learning
JM

http://folding.chmcc.org
26
Some early results for JRA (joint work with the
Rheumatology and Human Genetics Divisions)
771 SNPs from chromosome 2 and 765 from chromosome 7,
respectively (regions around implied before loci with high LOD scores for
associations with JRA subtypes)
Haplotype blocks identified and representative SNPs derived
Feature selection based on chi
2

statistics and other measures
Training and assessment using cross

validation on a set of about 200
data points (in several classes), case

control type of study, multiple
machine learning applied
No significant correlation of individual SNPs with clinical classes
observed
Top 20 SNPs, when combined into a classifier, yield classification
accuracy of about 70% for the problem of distinguishing between joint
erosion and lack of thereof (for affected individuals, baseline 62%)
Much less success for the classification into JRA subtypes, i.e., it
appears that SNPs included in the study cannot be used to predict if a
person is likely to have a specific (clinically defined) disease subtype
(e.g., poly vs. pauci)
JM

http://folding.chmcc.org
27
Hyplotype

based tag SNPs on chr2 vs. joint erosion …
JM

http://folding.chmcc.org
28
Next steps …
Use larger data sets with careful selection of
informative SNPs using prior knowledge and feature
selection algorithms
Use expression profiling to define “molecular”
phenotypes to define classes and find predictive
patterns in SNPs
Validate, validate, validate …
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment