Modern Methods of Data Analysis
Naïve Bayes Classifiers, Logistic Regression, and
, LDA, QDA, and RDA
CLEVELAND HEART DISE
The goal here is to predict heart disease status and
possibly severity of heart disease using
demographic and diagnostic information on the patients.
The variable descriptions are given
Description ___________ _
gender (male or female)
chest pain type
resting blood pressure (in mm Hg on admission to the hospital)
terol level in mg/dl
fasting blood sugar > 120 mg/dl) (true=true or fal=false)
resting electrocardiographic results
T wave abnormality (abn)
showing probable or definite left ventricular hypertrophy by Estes'
maximum heart rate achieved
exercise induced angina (true or false=fal)
ST depression induced by exercise relative to rest
the slope of the peak exercise ST segment
number of major vessels (0
3) colored by flourosopy
normal (norm), fixed defe
ct (fix), reversable defect
diag = sick or buff (healthy) grp = H (healthy), S1, S2, S3, S4 (higher number = more sick)
age gender cp trestbps chol fbs restecg thatach exang oldpeak slope ca thal diag g
1 63 male angina 145 233 true hyp 150 fal 2.3 down 0 fix buff H
2 67 male asympt 160 286 fal hyp 108 true 1.5 flat 3 norm sick S2
3 67 male asympt 120 229 fal hyp 129 true 2.6
flat 2 rev sick S1
4 37 male notang 130 250 fal norm 187 fal 3.5 down 0 norm buff H
5 41 fem abnang 130 204 fal hyp 172 fal 1.4 up 0 norm buff H
6 56 male abnang 120 236 fal norm
178 fal 0.8 up 0 norm buff H
in the Share folder
contains the raw data in comma
format. Read these data into a data frame called
in R. Next, form one data
which drops the variable
from the original database and
one data frame called
which drops the variable
from the original
database. The data frame
will be used to predict heart disease status (
(healthy)), while t
he data frame
can be used to predict heart disease
status and severity (H or S1,S2,S3,S4). (2 pts.)
Use logistic regression to predict heart disease status
(buff or sick)
working with the
What is the APER rate based on your fi
nal model? You do not need to
validation for this estimate.
From your model in part (b) what are the most important factors in determining the heart
disease status of a patient? Justify your answer. (3 pts.)
Using the last 50 observations in the
data set as a test, estimate the APER
using logistic regression by fitting a model to the first 246 observations.
Code to create the
test and training sets is given below.
Cleve.test = Cleve.diag[247:296
> Cleve.train = Cleve.diag[1:246,]
Again using the last 50 observations as
develop a neural network model using the
training set. Try different size neural networks and choose what you think is best. What is
the APER for predicting the test cases for your final
model? (3 pts.)
Use a naïve bayes classifier
to develop a prediction rule using the training
data set and predict the test cases. What is the APER based on your test case predictions?
SATELLITE IMAGE DATA
The goal here is to predict
the type of ground cover from
a satellite image broken up into pixels.
Description from UCI Machine Learning database:
The database consists of the multi
spectral values of pixels in 3x3 neighborhoods in a satellite image, and
the classification associated with the central pixel in ea
ch neighborhood. The aim is to predict this
classification, given the multi
spectral values. In the sample database, the class of a pixel is coded as a
The Landsat satellite data is one of the many sources of information available for a scene.
interpretation of a scene by integrating spatial data of diverse types and resolutions including
multispectral and radar data, maps indicating topography, land use etc. is expected to assume significant
importance with the onset of an era characterized
by integrative approaches to remote sensing (for
example, NASA's Earth Observing System commencing this decade). Existing statistical methods are ill
equipped for handling such diverse data types. Note that this is not true for Landsat MSS data considered
in isolation (as in this sample database). This data satisfies the important requirements of being numerical
and at a single resolution, and standard maximum
likelihood classification performs very well.
Consequently, for this data, it should be interesti
ng to compare the performance of other methods against
the statistical approach.
One frame of Landsat MSS imagery consists of four digital images of the same scene in different spectral
bands. Two of these are in the visible region (corresponding approxi
mately to green and red regions of
the visible spectrum) and two are in the (near) infra
red. Each pixel is a 8
bit binary word, with 0
corresponding to black and 255 to white. The spatial resolution of a pixel is about 80m x 80m. Each
image contains 2340
x 3380 such pixels.
The database is a (tiny) sub
area of a scene, consisting of 82 x 100 pixels. Each line of data corresponds
to a 3x3 square neighborhood of pixels completely contained within the 82x100 sub
area. Each line
contains the pixel values in
the four spectral bands (converted to ASCII) of each of the 9 pixels in the 3x3
neighborhood and a number indicating the classification label of the central pixel. The number is a code
for the following classes:
1 red soil
2 cotton crop
4 damp grey soil
5 soil with vegetation stubble
6 mixture class (all types present)
7 very damp grey soil
Note: There are no examples with class 6 in this dataset.
The data is given in random order and certain lines of data have been re
moved so you cannot reconstruct
the original image from this dataset.
In each line of data the four spectral values for the top
left pixel are given first followed by the four
spectral values for the top
middle pixel and then those for the top
l, and so on with the pixels
read out in sequence left
right and top
bottom. Thus, the four spectral values for the central pixel are
given by attributes 17,18,19 and 20.
You can read
the data into R from the file satimage.txt in the Shared folder o
n Class Storage
using the command below:
> SATimage = read.table(file.choose(),header=T,sep=” “)
be sure to put a space
between the quotes!
> SATimage = data.frame(SATimage[,1:36],class=as.factor(SATimage$class))
This command makes sure that the response is interpreted as a factor (categorical) rather than as a
as the data frame throughout.
Create a test and training set using the code below:
this ensures you all have the sa
> testcases = sample(1:dim(SATimage),1000,replace=F)
> SATtest = SATimage[testcases,]
> SATtrain = SATimage[
a) Compare sknn, naïve Bayes, neural network, lda, qda, and rda classification
of the test cases.
performs best for these data?
b) Write your own MCCV cross
validation routines for lda, qda, and rda classification.
Demonstrate their use with the full SAT image dataset. Which of these methods performs