Machine_Lerning_for_Genotype_Phenotype_Correlations

unknownlippsΤεχνίτη Νοημοσύνη και Ρομποτική

16 Οκτ 2013 (πριν από 3 χρόνια και 8 μήνες)

76 εμφανίσεις

JM
-

http://folding.chmcc.org

1

Machine Learning for Studies of
Genotype
-
Phenotype Correlations


Jarek Meller


Division of Biomedical Informatics,

Children’s Hospital Research Foundation

& Department of Biomedical Engineering, UC

JM
-

http://folding.chmcc.org

2

Outline




Motivating story: correlating inputs and outputs


Learning with a teacher (supervised learning)


Model selection, feature selection and generalization


k
-
Nearest Neighbors, Least Squares regression, Support Vector
Machines and some other machine learning approaches


Genotype
-
phenotype correlations and predictive fingerprints of
phenotypes


Ritchie et al., Multifactor
-
Dimensionality Reduction Reveals High
-
Order Interactions among Estrogen
-
Metabolism Genes in Sporadic
Breast Cancer,
Am. J. Hum. Genet.,

69:138
-
147, 2001


Early results for JRA SNP data (D. Glass et al.)

JM
-

http://folding.chmcc.org

3

Of statistics and machine learning

t
-
Test vs. regression or decision trees

Assessment vs. predictive models

Treatment
group mean

Control
group mean

Continuous variables

Discrete (categorical) variables

1

0

1

0

1

0

1

2

JM
-

http://folding.chmcc.org

4

Choice of the model, problem representation and feature
selection: another simple example

heights

estrogen

F


M


adults

children

weight

testosterone

JM
-

http://folding.chmcc.org

5

Three phases in supervised learning protocols



Training data: examples with class assignment are given


Learning:


i) appropriate model (or representation) of the problem needs to be
selected in terms of attributes, distance measure and classifier type;
ii) adaptive parameters in the model need to optimized to provide
correct classification of training examples (e.g. minimizing the
number of misclassified training vectors)


Validation: cross
-
validation, independent control sets and other
measure of “real” accuracy and generalization should be used to
assess the success of the model and the training phase
(finding
trade off between

accuracy and generalization is not trivial)

JM
-

http://folding.chmcc.org

6

Model complexity, training set size and generalization

JM
-

http://folding.chmcc.org

7

Examples of machine learning algorithms for
classification and regression problems


Linear perceptron, Least Squares


LDA/FDA (Linear/Fisher Discriminate Analysis)
(simple linear cuts, kernel non
-
linear generalizations)


SVM (Support Vector Machines) (optimal, wide
margin linear cuts, kernel non
-
linear generalizations)


Decision trees (logical rules)


k
-
NN (k
-
Nearest Neighbors) (simple non
-
parametric)


Neural networks (general non
-
linear models,
adaptivity, “artificial brain”)

JM
-

http://folding.chmcc.org

8

Decision trees provide a piecewise linear solution

0

1

1

0

JM
-

http://folding.chmcc.org

9

Support Vector Machines provide a wide
margin solution (separating hyperplane)

wx
+b=0

JM
-

http://folding.chmcc.org

10

Optimizing adaptable parameters in the model


Find a model
y
(
x
;
w
)

that describes the objects of each class as a
function of the features and adaptive parameters (weights)

w
.


Prediction, given

x
(e.g. LDL=240, age=52, sex=male) assign the
class C=?, (e.g. if y(x,w)>0.5 then C=1, i.e. likely to suffer from a
stroke or heart attack in the next 5 years)

y
(
x
;
w
)

0

1

JM
-

http://folding.chmcc.org

11

Training accuracy vs. generalization

JM
-

http://folding.chmcc.org

12

Case Study: Sporadic Breast Cancer


Ritchie et al., Multifactor
-
Dimensionality Reduction Reveals High
-
Order Interactions among Estrogen
-
Metabolism Genes in
Sporadic Breast Cancer,
Am. J. Hum. Genet.,

69:138
-
147, 2001


Study based on 200 white women with sporadic primary invasive
breast cancer who were treated at Vanderbilt University Medical
Center during 1982
-
96


Patients with sporadic breast cancer were frequency age
-
matched to control patients at Vanderbilt University Medical
Center who had been hospitalized for various acute and chronic
illnesses


Analysis focused on the genes:
COMT

(MIM
116790
), 22q11.2;
CYP1A1

(MIM
108330
), 15q22
-
qter;
CYP1B1

(MIM
601771
),
2p21
-
22;
GSTM1

(MIM
138350
), 1p13.3; and
GSTT1

(MIM
600436
), 22q11.2


Case
-
control study (machine learning to the rescue)

JM
-

http://folding.chmcc.org

13

Polymorphisms in the genes of interest

Genes involved in oxidative metabolism of estrogens

JM
-

http://folding.chmcc.org

14

Genotype representation and identification
of predictive loci (fingerprints): MDR

JM
-

http://folding.chmcc.org

15

Main effects (individual SNPs and chi
2

test)


For the simulated data shown before:



High
Risk

Low
Risk

total

AA

27

24

51

Aa

36

38

74

aa

21

24

45

total

84

86

170

S

(O
-
E)
2

/ E

JM
-

http://folding.chmcc.org

16

Genotype/haplotype representations

AABBCC

AaBBCC

AABbCC

aaBBCC

aabbcc

AAbbCC

In general, 3
n

genotypes

for n biallelic loci.

x

y

z

0, 1

; x, y, z =

0, 1, 2

Vector representation:

In general, highly dimensional representations …

JM
-

http://folding.chmcc.org

17

Multiple loci and more complex fingerprints

JM
-

http://folding.chmcc.org

18

Cross
-
validation results

JM
-

http://folding.chmcc.org

19

The role of gene
-
gene interactions in multifactorial
disease: towards even more complex traits …


CYP1A1, GSTM1,
and
GSTT1

polymorphisms were examined
before in a case
-
control study of 328 white and 108 African
American women, using multiple logistic
-
regression analysis
(Bailey et al.
1998
b
). None of the enzyme genotypes individually
or combined were associated with an increased risk for breast
cancer. However,
COMT

and
CYP1B1

were not included in the
analysis, because their roles in the catechol
-
estrogen pathway
and/or their various polymorphisms were only recently elucidated.



Here, the influence of each genotype on disease risk appears to
be dependent on the genotypes at each of the other loci: gene
-
gene interactions.


JM
-

http://folding.chmcc.org

20

Complexity of the model and power calculations:
as before adopted from Ritchie et al.


In logistic regression, as each additional main effect is included in the
model, the number of possible interaction terms grows exponentially.



On the other hand, simulation studies by Peduzzi et al. (
1996
) suggest
that having fewer than 10 outcome events per independent variable can
lead to biased estimates of the regression coefficients.



Hosmer and Lemeshow (
2000
) suggest that logistic
-
regression models
should contain no more than
P

< min(
n
1,
n
0)/10 parameters, where
n
1 is
the number of events of type 1 and
n
0 is the number of events of type 0.



For the 200 cases and the 200 controls evaluated in the present study,
this formula suggests that no more than 19 parameters should be
estimated in a logistic
-
regression model.

JM
-

http://folding.chmcc.org

21

Complexity of the model and power calculations:
as before adopted from Ritchie et al.



The number of regression terms needed to describe the
interactions among a subset,
k,

of
n

biallelic loci is (
n

choose
k
)
×

2
k

(Wade
2000
).



Thus, for 10 genes, we would need 20 parameters to model the
main effects (assuming two dummy variables per biallelic locus),
180 parameters to model the two
-
way interactions, 1,920
parameters to model the three
-
way interactions, 3,360
parameters to model the four
-
way interactions, and so forth. The
MDR method avoids the problems associated with the use of
parametric statistics to model high
-
order interactions.



At the same time, MDR involves sampling (evaluation) of different
combinations of loci


exponential scaling anyway …

JM
-

http://folding.chmcc.org

22

Some conclusions from Ritchie et al.


“If MDR is going to be used for genome scans with hundreds to
thousands of single
-
nucleotide polymorphisms, then it will be
necessary to develop
machine learning strategies

to optimize
the selection of polymorphisms to be modeled, since an
exhaustive search of all possible combinations will not be
possible. We are currently exploring the use of parallel
genetic
algorithms

(Cantú
-
Paz
2000
) as a robust machine learning
approach.”




Feature selection and aggregation, inferring a classifier
(approximator), validating prediction using cross
-
validation and
independent new data, i.e., applying machine learning
approaches …

JM
-

http://folding.chmcc.org

23

Reducing (somewhat) the complexity of the
problem: LD, hyplotype blocks and tagging SNPs

JM
-

http://folding.chmcc.org

24

Reducing (somewhat) the complexity of the
problem: LD, hyplotype blocks and tagging SNPs

Muse and Gibson, 2004

JM
-

http://folding.chmcc.org

25

Merging bottom
-
up and top
-
down approaches


Main effects and interactions (for limited k
-
tuples):
“statistics
-
based” approach, collaboration with Jack
Collins and his group (NCI)



Selection of loci/SNPs (feature selection) based on the
initial (limited) statistical analysis: use haplotype
-
based
Tag SNPs



Combining promising features into a complex pattern
(predictive fingerprint): machine learning

JM
-

http://folding.chmcc.org

26

Some early results for JRA (joint work with the
Rheumatology and Human Genetics Divisions)


771 SNPs from chromosome 2 and 765 from chromosome 7,
respectively (regions around implied before loci with high LOD scores for
associations with JRA subtypes)


Haplotype blocks identified and representative SNPs derived


Feature selection based on chi
2
-
statistics and other measures


Training and assessment using cross
-
validation on a set of about 200
data points (in several classes), case
-
control type of study, multiple
machine learning applied


No significant correlation of individual SNPs with clinical classes
observed


Top 20 SNPs, when combined into a classifier, yield classification
accuracy of about 70% for the problem of distinguishing between joint
erosion and lack of thereof (for affected individuals, baseline 62%)


Much less success for the classification into JRA subtypes, i.e., it
appears that SNPs included in the study cannot be used to predict if a
person is likely to have a specific (clinically defined) disease subtype
(e.g., poly vs. pauci)

JM
-

http://folding.chmcc.org

27

Hyplotype
-
based tag SNPs on chr2 vs. joint erosion …

JM
-

http://folding.chmcc.org

28

Next steps …


Use larger data sets with careful selection of
informative SNPs using prior knowledge and feature
selection algorithms


Use expression profiling to define “molecular”
phenotypes to define classes and find predictive
patterns in SNPs


Validate, validate, validate …