Tutorial for UCI machine learning benchmark data - UCLA

throneharshBiotechnology

Oct 2, 2013 (3 years and 9 months ago)

86 views

T
utorial for the randomGLM R package
:

prediction with UCI machine learning benchmark data

Lin Song, Steve Horvath


1.

Data preparation

Here we illustrate how to make prediction with the UCI machine learning benchmark data. We
start by loading required libraries and preparing data. The Ionosphere data
(from the R package
mlbench)
contains 34 f
eatures across 351 samples
. The data have
one binary outcome. In our
paper [1], we averaged
the 3 fold cross validation estimates of the accuracies across 100 random
partitions into 3 folds. But here
we

do not use cross validation. Instead, we

split the data into

training data (comprised of

2/3

of

the observations) and test data

(
1/3
of the observations).

# load required libraries

library(randomGLM)

library(mlbench)

options(stringsAsFactors=F)

# load data

data(Ionosphere)

# check data

head(Ionosphere)

# check outcome variable

y0 =
as.factor(Ionosphere[,35])

table(y0)

# y0

# bad good

# 126 225


# define features

x0 = as.matrix(Ionosphere[,
-
35])

mode(x0) = "numeric"

# m is total number of samples

m = nrow(x0)


# split data into 2/3 training and 1/3 test

set.seed(1)

indx = sample
(1:m, ceiling(m/3))

x = x0[
-
indx,]

xtest = x0[indx,]

y = y0[
-
indx]

ytest = y0[indx]


2.
Define accuracy measures

# define accuracyMeasure

library(WGCNA)

accuracyM =function(predicted, y) {accuracyMeasures(table(predicted,
y))[2,2]}

3.
RGLM prediction

Now
we have training data x

(234 samples x 34 features)
, training outcome y, test data xtest
(117
samples x 34 features)
and test outcome ytest. Prediction is done as follows. Make sure to set
“classify=TRUE” for binary outcome prediction.

# RGLM prediction

R
GLM = randomGLM(x, y, xtest,


classify=TRUE,


keepModels=TRUE,


nThreads=1)


# test set prediction

predictedTest = RGLM$predictedTest

# accuracy

accuracyM(predictedTest, ytest)

#
[1] 0.8376068


Note that parameter nFeaturesInBag of the randomGLM function controls the number of features
randomly selected into each bag. Here we use the default value which is
(









)





. We encourage users to
optimize
this parameter by, for example usi
ng OOB
estimates of the accuracy. How to do this is described in another tutorial (see
RGLMparameterTuningTutorial.docx
) which is posted on our webpage:

http://labs.genetics.ucla.edu/horvath/RGLM

4.

RGLM prediction with interaction terms between features

W
hen the total number of features is small

(like in most UCI benchmark data sets), we
recommend

adding pairwise interaction terms between features [1]. This can be done by setting
parameter "maxInteractionOrder=2". The following calculation takes about 6 minutes.

# RGLM prediction with pairwise feature interactions

RGLM.inter2 = randomGLM(x, y, xtes
t,


classify=TRUE,
maxInteractionOrder=2
,


keepModels=TRUE, nThreads=1)

# test set prediction

predictedTest.inter2 = RGLM.inter2$predictedTest

accuracyM(predictedTest.inter2, ytest)

# [1] 0.9145299


With pairwise interaction terms,
the
prediction accuracy increases from 0.84 to 0.91.
Users can
consider
higher ord
er interaction terms by increasing
maxInteractionOrder
(e.g.

to 3
)
. But
the
computation may be very

intensive and

the

performance
will probably not
improve [1].

5. Prediction
of other common predictors

In this section, we provide the code for making predictions using other common predictors,
including random forest, recursive partitioning, linear discriminant analysis, K
-
nearest neighbors,
support vector machine, shrunken centr
oid and penalized regression models.

#load required libraries

library(randomForest)

library(class)

library(rpart)

library(tree)

library(e1071)

library(pamr)

library(supclust)

library(glmnet)


# change outcome levels from bad, good to 1 and 2, so that all

methods
can apply

levels(y) = levels(ytest) = 1:2


# define matrix to save predicted values by different methods

method = c("RF", "RFbigmtry", "Rpart", "LDA", "DLDA", "KNN", "SVM",
"SC", "GLMNET0", "GLMNET0.5", "GLMNET1")

predictedMat = matrix(NA, length(
ytest), length(method))


# random forest

RF=randomForest(x, y, xtest, importance=F)

predictedMat[, 1] = RF$test$predicted


RFbigmtry = randomForest(x, y, xtest, mtry=ncol(x), importance=F)

predictedMat[, 2] = RFbigmtry$test$predicted


# recursive
partitioning

rp1 = rpart(y~., data=data.frame(x))

predictedMat[, 3] = predict(rp1, newdata=data.frame(xtest),
type="class")



# linear discriminant analysis

lda1 = lda(y~., data=data.frame(x[,
-
2]), CV=F, method = "moment")

predictedMat[, 4] = predict(lda1
, newdata=data.frame(xtest))$class


#diagonal linear discriminant analysis

# it only takes numeric 0/1 coding

dlda1 = dlda(x, xtest, as.numeric(y)
-
1)

predictedMat[, 5] = dlda1+1


# use cross validation to determine k for knn

fold = 3

k_knn = seq(1, 21, 2)

acc_knn_cv = matrix(NA, fold, length(k_knn))

for (i in 1:fold) {


cvIndx = sample(1:nrow(x), round(nrow(x)/fold))


for (j in 1:length(k_knn)) {


acc_knn_cv[i, j] = accuracyM(


y[cvIndx],


knn(train=x[
-
cvIndx, ], test=x[cvIndx,], cl=y[
-
cvIndx],
k=k_knn[j]))


}

}

k_knn_best = k_knn[which.max(apply(acc_knn_cv, 2, mean))]

predictedMat[, 6] = knn(train=x, test=xtest, cl=y, k=k_knn_best)



# support vector machine

svm1 = svm(y~., data=data.frame(x))

predictedMat[, 7] = predict(svm1, newdata=data.f
rame(xtest),
type="class")


# shrunken centroids

dat1_sc = list(x=t(x), y=y)

sc1 = pamr.train(dat1_sc)

sc_cv = pamr.cv( sc1, dat1_sc)

# if tie, take the first threshold

threshold_sc = sc_cv$threshold[which.min(sc_cv$error)]

predictedMat[, 8] = pamr.predict
(sc1, newx=t(xtest),
threshold=threshold_sc, type="class")


# penalized regression models

# alpha values: 0
--

ridge regression, 0.5
--

elastic net, 1
--

lasso.

alpha = c(0, 0.5, 1)

cv = cv.glmnet(x, y, family="binomial")

lambda = cv$lambda.min

for (i in 1:
length(alpha))

{


model = glmnet(x, y, family="binomial", alpha=alpha[i])


predictedMat[, 8+i] = predict(model, xtest, type="class", s=lambda)


rm(model)

}


# modify accuracy measure

accuracyM =function(predicted, y)
{accuracyMeasures(table(factor(predicted, levels=1:2), y))[2,2]}

# accuracy

accOther = apply(predictedMat, 2, accuracyM, ytest)

print(data.frame(method, accOther))

# method accOther

#1 RF 0.8974359

#2 RFbigmtry 0.8888889

#3 Rpart 0.8205
128

#4 LDA 0.8119658

#5 DLDA 0.4529915

#6 KNN 0.7777778

#7 SVM 0.8888889

#8 SC 0.7863248

#9 GLMNET0 0.8717949

#10 GLMNET0.5 0.8547009

#11 GLMNET1 0.8290598


Comparing the accuracies with that of RGLM.inter2, we can
see that RGLM.inter2 outperforms
other methods in this machine learning benchmark data set.

References

1.

Song L, Langfelder P, Horvath S (2013) Random generalized linear model: a highly
accurate and interpretable ensemble predictor. BMC Bioinformatics 14:5 P
MID:
23323760DOI: 10.1186/1471
-
2105
-
14
-
5.