Assignment #9 and Mini-Midterm

piloturuguayanAI and Robotics

Oct 15, 2013 (4 years and 29 days ago)

87 views

1


STAT 425


Modern Methods of Data Analysis

(
23 + 54
pts.)

Assignment

9



Gradient Boosting,
Treed Regression
,


and a Mini
-
Midterm


PROBLEM
1




PREDICTING THE AGE O
F AN ABALONE

The age of abalone is determined by cutting the shell through the cone, staining it, and
counting the number of rings through a microscope
--

a boring and time
-
consuming
task. Other measurements, which are easier to obtain, are
often times
used to predict
the
age.

Further information, such as weather patterns and location (hence food
availability) may be required to solve the problem.



Attribute Information:

Given is the attribute name, attribute type, the measurement unit and a brief
description. The num
ber of rings is the value to
predict. These data are contained in the
data frame
Abalone
.


Name / Data Type / Measurement Unit / Description



Length / continuous / mm / Longest shell measurement


length


Diameter / continuous / mm / perpendicular to
length



diam

Height / continuous / mm / with meat in shell




height

Whole weight / continuous / grams / whole abalone




whole.weight

Shucked weight / continuous / grams / weight of meat



shucked.weight

Viscera weight / continuous / grams / gut weight (
after bleeding)


visc.weight

Shell weight / continuous / grams / after being dried




shell.weight

Rings / integer /
--

/ +1.5 gives the age in years




Rings

a)

Develop

a
gradient boosting

model (
gbm
)

to
predict the
number of rings
. Show or
explain how

you
choose your turning parameters (

,
J
m
, and
M
)
.
(5 pts.)


b)

Find the R
2

for your final model and estimate RMSEP using the residuals from the fit.
Plot

̂





and

̂




̂
.
Also look at partial plots for the predictors

using
gbm.plot()
command
.
Discuss all. (6

pts.)


c)

Use treed regression (
Cubist
) to build a model to predict the number of rings. Use
cross
-
validation to determine if boosting (i.e. committees > 1) helps prediction and
choose an optimal
value

for the number of committees,
M
.
Also

pl
ot

̂





and

̂




̂

for your final treed regression model. Discuss all. (6 pts.)


d)

How does
the
RMSEP for the
gradient boosting and treed regression

compare to
MARS,
RPART,
bagged RPART
, and Random Forests
?

(
6

pts.)

2


PROBLEM
2

––

PREDICTING THE
STRENGTH OF CONCRETE

(MINI
-
MIDTERM)

Given

below

are the variable name, variable type, the measurement un
it and a brief
description. To predict the
concrete compressive strength is the regression problem.




Cement
--

quantitative
--

kg
in a m
3

mixture
--

Input Variable




Blast Furnace Slag
--

quantitative
--

kg in a m
3

mixture
--

Input Variable




Fly Ash
--

quantitative
--

kg in a m
3

mixture
--

Input Variable




Water
--

quantitative
--

kg in a m
3

mixture
--

Input Variable




Superplasticizer

--

quantitative
--

kg in a m
3

mixture
--

Input Variable




Coarse Aggregate
--

quantitative
--

kg in a m
3

mixture
--

Input Variable




Fine Aggregate
--

quantitative
--

kg in a m
3

mixture
--

Input Variable




Age
--

quantitative
--

Day (1~365)
--

Input Variable




Concrete compressive strength
--

quantitative
--

MPa
--

Output Variable

(Y)


These data can be obtained from the UCI Machine Learning Repository under Concrete.


http://archive.ics.uci.edu/ml/machine
-
learning
-
databases/concrete/compressive/


Read it into Excel first and save it as comma
-
delimited (.CSV) format after shorten
ing
the variable names. Use the command below to read the dataset into R.


> Concrete = read.table(file.choose(),header=T,sep=”,”)



a)


Develop models to predict concrete compressive strength. Use the following modeling
approaches:



OLS


possibly using ACE
/AVAS to help find appropriate
transformations



Projection Pursuit



MARS



Neural networks



RPART



Bagged RPART



Random Forests



Gradient Boosted
Trees



Treed Regression


Be sure to include some discussion
for each method

on how you “tuned” the fit using
that modeling approach. Be sure to use the same response for each!!!!

(36 points


4 pts. each)


b)

Identify

which of the predictors are most important on the basis of the models you fit.
Also give at least one visualizat
ion of the predictor “effects” from the

models fit in part
(a). Discuss all of this in practical terms. (6 pts.)


3


c)

Using MC cross
-
validation, decide which modeling approach would be best to use to
predict the compressive strength of concrete. Be sure th
at all MCCV functions have
been fixed to perform similarly and correctly. Also make sure that you use the same
response for each method, so the RMSEP values can be fairly compared.
Put your
results in a table and discuss. (12 pts.)