1
STAT 425
–
Modern Methods of Data Analysis
(
23 + 54
pts.)
Assignment
9
–
Gradient Boosting,
Treed Regression
,
and a Mini

Midterm
PROBLEM
1
–
–
PREDICTING THE AGE O
F AN ABALONE
The age of abalone is determined by cutting the shell through the cone, staining it, and
counting the number of rings through a microscope

a boring and time

consuming
task. Other measurements, which are easier to obtain, are
often times
used to predict
the
age.
Further information, such as weather patterns and location (hence food
availability) may be required to solve the problem.
Attribute Information:
Given is the attribute name, attribute type, the measurement unit and a brief
description. The num
ber of rings is the value to
predict. These data are contained in the
data frame
Abalone
.
Name / Data Type / Measurement Unit / Description
Length / continuous / mm / Longest shell measurement
length
Diameter / continuous / mm / perpendicular to
length
diam
Height / continuous / mm / with meat in shell
height
Whole weight / continuous / grams / whole abalone
whole.weight
Shucked weight / continuous / grams / weight of meat
shucked.weight
Viscera weight / continuous / grams / gut weight (
after bleeding)
visc.weight
Shell weight / continuous / grams / after being dried
shell.weight
Rings / integer /

/ +1.5 gives the age in years
Rings
a)
Develop
a
gradient boosting
model (
gbm
)
to
predict the
number of rings
. Show or
explain how
you
choose your turning parameters (
,
J
m
, and
M
)
.
(5 pts.)
b)
Find the R
2
for your final model and estimate RMSEP using the residuals from the fit.
Plot
̂
and
̂
̂
.
Also look at partial plots for the predictors
using
gbm.plot()
command
.
Discuss all. (6
pts.)
c)
Use treed regression (
Cubist
) to build a model to predict the number of rings. Use
cross

validation to determine if boosting (i.e. committees > 1) helps prediction and
choose an optimal
value
for the number of committees,
M
.
Also
pl
ot
̂
and
̂
̂
for your final treed regression model. Discuss all. (6 pts.)
d)
How does
the
RMSEP for the
gradient boosting and treed regression
compare to
MARS,
RPART,
bagged RPART
, and Random Forests
?
(
6
pts.)
2
PROBLEM
2
––
PREDICTING THE
STRENGTH OF CONCRETE
(MINI

MIDTERM)
Given
below
are the variable name, variable type, the measurement un
it and a brief
description. To predict the
concrete compressive strength is the regression problem.
Cement

quantitative

kg
in a m
3
mixture

Input Variable
Blast Furnace Slag

quantitative

kg in a m
3
mixture

Input Variable
Fly Ash

quantitative

kg in a m
3
mixture

Input Variable
Water

quantitative

kg in a m
3
mixture

Input Variable
Superplasticizer

quantitative

kg in a m
3
mixture

Input Variable
Coarse Aggregate

quantitative

kg in a m
3
mixture

Input Variable
Fine Aggregate

quantitative

kg in a m
3
mixture

Input Variable
Age

quantitative

Day (1~365)

Input Variable
Concrete compressive strength

quantitative

MPa

Output Variable
(Y)
These data can be obtained from the UCI Machine Learning Repository under Concrete.
http://archive.ics.uci.edu/ml/machine

learning

databases/concrete/compressive/
Read it into Excel first and save it as comma

delimited (.CSV) format after shorten
ing
the variable names. Use the command below to read the dataset into R.
> Concrete = read.table(file.choose(),header=T,sep=”,”)
a)
Develop models to predict concrete compressive strength. Use the following modeling
approaches:
OLS
–
possibly using ACE
/AVAS to help find appropriate
transformations
Projection Pursuit
MARS
Neural networks
RPART
Bagged RPART
Random Forests
Gradient Boosted
Trees
Treed Regression
Be sure to include some discussion
for each method
on how you “tuned” the fit using
that modeling approach. Be sure to use the same response for each!!!!
(36 points
–
4 pts. each)
b)
Identify
which of the predictors are most important on the basis of the models you fit.
Also give at least one visualizat
ion of the predictor “effects” from the
models fit in part
(a). Discuss all of this in practical terms. (6 pts.)
3
c)
Using MC cross

validation, decide which modeling approach would be best to use to
predict the compressive strength of concrete. Be sure th
at all MCCV functions have
been fixed to perform similarly and correctly. Also make sure that you use the same
response for each method, so the RMSEP values can be fairly compared.
Put your
results in a table and discuss. (12 pts.)
Comments 0
Log in to post a comment