Use Case: Using Random Forests on

californiamandrillSoftware and s/w Development

Dec 13, 2013 (3 years and 6 months ago)

76 views

2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Use Case: Using Random Forests on
Gordon for An Exploration of Variable
Selection
vs

Statistical
Inferencing

in GWAS

Paul
Rodriguez,PhD
, SDSC

2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Outline


Classification Tree to Random Forests


Variable Selection and Variable Importance


Penalized Regression


Parallelizing data/tasks


GWAS explorations


2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Recall Classification Tree


Trees use flexible combinations of features (a
form on non
-
linearity)



Decision Node: split feature



Leaf Node: label data (or estimate data)



Each step chooses best variable for splitting
data


2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Recall Idea: Dividing and conquering

Number of Team HRs 2012

ERA

After all splits, classification or regression can be used.

2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO


Recursive Partition in R


Essential R:



install.packages
("
rpart
")

library
("
rpart
")

D
=
as.data.frame
(
cbind
(Y,X
));



rp_res

=
rpart
(
Y~HR+ERA,D,method
="class
",



control=
rpart.control
(
minbucket

= 2,
cp

= 0.01,
maxcompete

= 0
))


#recursive partition,

# options control tree building (number of leaf node members,



complexity parameter that limits splits)


See also:

predict(
rp_res
…)

p
lot(
rp_res

…)

p
rint(
rp_res
)

2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO


A Classification Tree on WL%

Number of Team HRs 2012

ERA

10/1 case of 1/
-
1 class

2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO


Decision Tree


non
-
linear decision boundary, but with sharp
corners


blue=>point classified

a
s WL% < .50

red=>point classified

a
s WL% > .50

Number of Team HRs 2012

ERA

2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO


Classification
Tree w/Bootstrapping


10 samples of data used for 10 trees


get average

c
lassification

a
t each grid

p
oint,


make counter

plot of average

Number of Team HRs 2012

ERA

2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO


Classification w
/ Bootstraps


100 samples => softer boundary


get average

c
lassification

a
t each grid

point


make counter

p
lot of average

Number of Team HRs 2012

ERA

2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO


Random Forest


“Bagging”=bootstrap & aggregate samples,
gives soft and highly flexible boundary


Number of Team HRs 2012

ERA

2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO


randomForest

in R


Essential R:


install.packages
("
randomForest
")library("
randomForest
")


rt_res
=
randomForest
(
X,as.factor
(Y),
ntree
=100,mtry=3,nodesize=1);


#number of trees 500 or more, depending on tasks

#
mtry
: number of variables to try



default is square total number

#
nodesize
: 1 is default, but can be slow,


See also,

predict()

c
ombine()


2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Variable selection


Idea: some features only add noise, but how to
select


PCA reduces to
k

factors, but less interpretable


Stepwise regression incrementally adds
variables


Select variable with highest correlation to Y


Only allow small
β

increment


Add/delete variable(s) to/from set that improve
performance,


Iterate



2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Variable selection


Recall: Logistic Linear
Regression


Transform
observation

O
i

to
[
0,1],
Y

= log ( (
1
-

O
i
)/

O
i
)


Solve:
β

=
arg

min
β

||

y
-

χβ

||



“Elastic Net” adds Penalty to size of
β


Solve:
β

=
arg

min
β

|| y
-

χ
β
|| +
λ
1
||
β

||
1

+
λ
2
||
β

||
2




λ
1
||
β

||
1

=>
selects
fewer
variables




(
λ
is penalty term, ||
β
||
1

norm is
absolute value)


λ
2
||
β

||
2

=> grouping variables, p>>N is possible


(||
β
||
2

norm
is squared
value)




2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Variable selection w/Random
Forest


Random Forests construction


Bootstrap samples of observations (
ntree
)


Fit classification tree to each sample


During construction, choose the best split only from a
subset of features (
mtry
)


Result: an
ensemble of diverse trees





Aggregating
over an ensemble reduces prediction variance


2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Variable selection w/Random
Forest


mtry

(size of variable subsets) control diversity


As
mtry

decreases
-
> trees are
less correlated

(different splits, but not all node interactions)


As
mtry

increases
-
>
trees are
more correlated

(similar splits, more possible interactions)



Essentially, trade off in bias and variance



Defaults:

mtry
=√P for classification

mtry

= P/3 for regression


2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Variable Importance


Bootstrap samples leave out some data points


Use these OOB (out of bag) points for testing


For each node in each tree:

,


record OOB predictions



permute values and record OOB prediction




VI
i

=

AVE
trees
(
%
correct

before

permuting








%
correct

after

permuting

Note:
not
same as the
node importance calculated
during
tree construction

R: use importance=TRUE option in function call, use …$importance in
results object

2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO


RandomForest

Computations


Parameters:


mtry

(try defaults and tune)


n
tree

(more is better, until performance drops off)


nodesize
=number cases at leaf

Ensembles alleviate need to prune, so
nodesize
=1 is
OK (but can be intensive)



Highly parallelizable across trees

2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO


Simple Random Forest Parallelization




Use “Serial Packing” batch script to pack jobs
onto cores:

On Gordon, see /opt/
sdsc
/
packing_serial_jobs

Essentially,




mpirun


np

number_cores

my_program


Note:

use #nodes X #cores per node =
number_cores


my_program

should perform housekeeping


2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO


Simple Random Forest Parallelization



1.
D
istribute data to
np
*16 files (
np

= number of nodes)

2.
In PBS Script, use
mpirun

to execute
perl

script



#PBS

l nodes=
np:ppn
=16



m
odule load R

mpirun


np

16*
np

./perl_bundler.pl #each core runs
perl

script


3. In
perl

script …



$r =`./
getid
`


#current processor rank




call R with r
-
th

file as an argument

Note: for >
np
*16 files use mod(
r,np
*16)

2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO

A GWAS Case Study


Genome Wide Association Studies (GWAS):

(typically, a
fter sequencing, alignment, and consensus)

100,000s
of SNPs (single nucleotide polymorphisms
)

1000s of controls (
less cases)



Individual SNPs
often account
for
only small
amount of
risks

(e.g
.
often only 1% percent increase in log odds, Cecile
et al., 2008
).



D
iseases likely depend
on complex interactions between
SNPs

(
as
well
as
metagenomic

factors)


2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO

A GWAS Case Study


Machine

Learning

algorithms
:


find

non
-
linear

interactions

for

classification

or

prediction


select

variables

as

part

of

the

learning

process


trade

offs
:

computational

feasibility

and

interpretability
.



Objective
:

Implement

and

compare

variable

selection

via

machine

learning

techniques

with

statistical

significance

through

univariate

procedures
.




2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO

A GWAS Case Study


Data

Preprocessing
:


Genomic

data

from

(Liu

et

al,

2013
)
:


SNPs

minor

allele

frequency

>

10
%
,

missingness

<

10
%
,

imputed

with

mean


subjects

>
90
%

SNPs

with

gender

given


leaving

23
K

controls

and

3
.
5
K

cases
,

80
K

SNPs
.


SNPs

coded

as

0
,
1
,
2

=>

number

of

minor

alleles


2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO

GWAS study


Methods:

statistical inference
from Logistic Regression


(1 SNP at a time, p
-
value)

v
ariable Importance from
RandomForest



(subsets of SNPs)

v
ariable Selection from Penalized Regression


(subsets of SNP interactions, tally when selected)


Compare
p
-
value, VI, selection
-
tally

with Rank Correlations


2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Subsetting

for
RandomForest


RandomForest

Sampling

in

stages
:

1.
Take

1000

samples

of

3000

variables,


2.
R
un

50

trees

on

each

sample

=>

50
K

total

trees

3.
Run

1

instance

of

R

on

each

core

=>

on

4

compute

nodes

(
64

cores)

<

16
hours



Runs

‘better’

than

foreach

randomForest
()


less

total

memory

across

nodes


Better

options

in

R

than

Matlab


Two

different

samples

had

~
99
%

correlation,

so

1000

maybe

enough

(need

more

tests)


2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Penalized Reg. Implementation


Sampling

2
-
way

interactions

in

stages

1.
T
ake

1000

samples

of

3000

variables,


2.
C
hoose

80

variables

in

Pen
.

Reg
.

3.
T
ake

80
x
80

interactions

and

choose

interactions

in

sequence

as

long

as

they

improve

fit
.

4.
Run

1

instance

of

R

on

each

core

=>

on

4

compute

nodes

(
64

cores)

<

4

hours



Note: ‘Least Angle Regression’ implementation

add best variable weighted by angle with next best variable

f
aster than gradient descent, uses more memory

2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO

GWAS Results

All SNPs, 1
-
22
Chromosones

Rand.Forest

Variable

Importance


Variable Importance
across SNPs

2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO

GWAS Results


VI more highly correlated with
univariate

p
-
values:


i.e. most, but not all, SNPs
that individually have

statistically
significant estimated effects are also important

in
the samples of decision trees for
classification

Pen.Reg
.

2
-
way counts

Univar.

Log.Reg
.

P
-
value

Rand Forest

0.38

-
0.64

Pen. Reg. 2
-
way

counts

-
0.37

2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO

GWAS Results


Conversely: some SNPs with high importance, and often
selected in 2
-
way interactions, but not among those
selected by
univariate

methods

Variable Importance from Random Forest

2
-
way

interaction

count

from sampling

(red=SNPs w/low
pvalue
)

Important and
selected for
interactions,

but not necessarily
significant

2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Summary


An
ensemble of variable selection methods might
improve the process of identifying SNPs and
SNP
-
interactions



Data intensive sampling for Regression tree with large P
can be parallelized



Ongoing

2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO

References


ACKNOWLEDGMENTS


Thanks to A.M. Dale, J. C.
Roddey
, at the MMIL at UCSD,
Dept

of Neuroscience for advice and
consultation on GWAS methods and sharing
thier

preprocessed PSC dataset. The author is part of
the Predictive Analytics Center of Excellence at SDSC. This research was supported in part by the
National Science Foundation through the San Diego Supercomputer Center under [#20091748 OCI
-
0910847] and utilized the Gordon system.





REFERENCES


Breiman
, L. 2001 Random Forests
Machine Learning

45: 5

32.


Cecile,A.,
Janssens
, J.W., van
Duijn
, C.M. 2008 Genome
-
based prediction of common diseases:
advances and prospects.
Human Molecular Genetics,

17, R166

R173.


Liu, JZ, et al..2013 Dense genotyping of immune
-
related disease regions identifies nine new risk
loci for primary
sclerosing

cholangitis.
Nature Genetics.

45, 670
-
675


Purcell S, Neale B, Todd
-
Brown K, Thomas L, Ferreira MAR,



Bender D,
Maller

J,
Sklar

P, de Bakker PIW, Daly MJ & Sham PC (2007)



PLINK: a toolset for whole
-
genome association and population
-
based



linkage analysis. American Journal of Human Genetics, 81.


Zhou, H., Hastie, T. 2005 Regularization and variable selection via the elastic net.
J.R. Statist. Soc.
B 67,pp301
-
320.


2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO

End




2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Big
Anova


ANOVA: regression with factor variables


if x has 2 levels, use 2 indicator variables in X


that are 1/0 if factor is present/absent




Null Hypothesis: factor effects are 0



More generally: Solve Y=X*B
s.t.

C*B=0,

where C is constraints on B

2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Big
Anova


Matlab

Computations:

create design matrix of all factor levels
and interactions

use QR

work with Y projections outside C*B (null
space)

2013 Summer Institute: Discover Big Data, August 5
-
9, San Diego, California


SAN DIEGO SUPERCOMPUTER CENTER
at
the

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Big
Anova


Problem: 2 factors, 9 and 19,000 levels (!)

interaction terms have 19K*9 combos


70K observations => not all combos have
data point


Matlab
: create all
combos in X, use QR


too much memory, too much for QR


Solution: create design matrix 1 col at a
time, shuffle
matrices in/out of RAM to
flash
drive


A >1
week
job



桯畲猬u㌰ぇ戠潮 噓䵐M
Gordon nodes