Assignment #11

madbrainedmudlickΤεχνίτη Νοημοσύνη και Ρομποτική

20 Οκτ 2013 (πριν από 4 χρόνια και 23 μέρες)

93 εμφανίσεις

1


STAT 425


Modern Methods of Data Analysis

(37 pts.)


Assignment

11



Naïve Bayes Classifiers, Logistic Regression, and
Neural Networks
, LDA, QDA, and RDA


PROBLEM
1

––

CLEVELAND HEART DISE
ASE STUDY

The goal here is to predict heart disease status and
possibly severity of heart disease using
demographic and diagnostic information on the patients.

The variable descriptions are given
below.

Variable Name


Description ___________ _


age




age (yrs.)

gender




gender (male or female)

cp




chest pain type



--

typical angina

(angina)


--

atypical angina

(abang)



--

non
-
anginal pain

(notang)


--

asymptomatic

(asympt)

trestbps



resting blood pressure (in mm Hg on admission to the hospital)


chol




serum choles
terol level in mg/dl

fbs




fasting blood sugar > 120 mg/dl) (true=true or fal=false)


restecg




resting electrocardiographic results



--

normal

(norm)



--

having ST
-
T wave abnormality (abn)


--

showing probable or definite left ventricular hypertrophy by Estes'


criteria

(hyp)

thalach




maximum heart rate achieved


exang




exercise induced angina (true or false=fal)


oldpeak



ST depression induced by exercise relative to rest




slope




the slope of the peak exercise ST segment



--

upsloping (up)


--

flat (flat)


--

downsloping (down)

ca




number of major vessels (0
-
3) colored by flourosopy



thal




normal (norm), fixed defe
ct (fix), reversable defect

(rev)

Responses:

diag = sick or buff (healthy) grp = H (healthy), S1, S2, S3, S4 (higher number = more sick)

2




> head(Cleveland)


age gender cp trestbps chol fbs restecg thatach exang oldpeak slope ca thal diag g
rp

1 63 male angina 145 233 true hyp 150 fal 2.3 down 0 fix buff H

2 67 male asympt 160 286 fal hyp 108 true 1.5 flat 3 norm sick S2

3 67 male asympt 120 229 fal hyp 129 true 2.6

flat 2 rev sick S1

4 37 male notang 130 250 fal norm 187 fal 3.5 down 0 norm buff H

5 41 fem abnang 130 204 fal hyp 172 fal 1.4 up 0 norm buff H

6 56 male abnang 120 236 fal norm

178 fal 0.8 up 0 norm buff H


a)


The file
cleveland.txt

in the Share folder

contains the raw data in comma
-
delimited
format. Read these data into a data frame called
Cleveland

in R. Next, form one data
frame called
Cleve.diag

which drops the variable
grp

from the original database and
one data frame called
Cleve.grp

which drops the variable
diag
from the original
database. The data frame
Cleve.diag

will be used to predict heart disease status (
sick

or
buff

(healthy)), while t
he data frame
Cleve.grp

can be used to predict heart disease
status and severity (H or S1,S2,S3,S4). (2 pts.)


b)

Use logistic regression to predict heart disease status
(buff or sick)
working with the
Cleve.diag

data.

What is the APER rate based on your fi
nal model? You do not need to
use cross
-
validation for this estimate.

(3 pts.)


c)

From your model in part (b) what are the most important factors in determining the heart
disease status of a patient? Justify your answer. (3 pts.)


d)

Using the last 50 observations in the
Cleve.diag

data set as a test, estimate the APER
using logistic regression by fitting a model to the first 246 observations.
Code to create the
test and training sets is given below.


>

Cleve.test = Cleve.diag[247:296
,]

> Cleve.train = Cleve.diag[1:246,]


e)

Again using the last 50 observations as

a

test set,

develop a neural network model using the
training set. Try different size neural networks and choose what you think is best. What is
the APER for predicting the test cases for your final
neural network
model? (3 pts.)


f)

Use a naïve bayes classifier

(the
e1071

one)

to develop a prediction rule using the training
data set and predict the test cases. What is the APER based on your test case predictions?
(3 pts.)



3


PROBLEM
2

––

SATELLITE IMAGE DATA

The goal here is to predict
the type of ground cover from

a satellite image broken up into pixels.

Description from UCI Machine Learning database:

The database consists of the multi
-
spectral values of pixels in 3x3 neighborhoods in a satellite image, and
the classification associated with the central pixel in ea
ch neighborhood. The aim is to predict this
classification, given the multi
-
spectral values. In the sample database, the class of a pixel is coded as a
number.



The Landsat satellite data is one of the many sources of information available for a scene.
The
interpretation of a scene by integrating spatial data of diverse types and resolutions including
multispectral and radar data, maps indicating topography, land use etc. is expected to assume significant
importance with the onset of an era characterized

by integrative approaches to remote sensing (for
example, NASA's Earth Observing System commencing this decade). Existing statistical methods are ill
-
equipped for handling such diverse data types. Note that this is not true for Landsat MSS data considered

in isolation (as in this sample database). This data satisfies the important requirements of being numerical
and at a single resolution, and standard maximum
-
likelihood classification performs very well.
Consequently, for this data, it should be interesti
ng to compare the performance of other methods against
the statistical approach.



One frame of Landsat MSS imagery consists of four digital images of the same scene in different spectral
bands. Two of these are in the visible region (corresponding approxi
mately to green and red regions of
the visible spectrum) and two are in the (near) infra
-
red. Each pixel is a 8
-
bit binary word, with 0
corresponding to black and 255 to white. The spatial resolution of a pixel is about 80m x 80m. Each
image contains 2340
x 3380 such pixels.



The database is a (tiny) sub
-
area of a scene, consisting of 82 x 100 pixels. Each line of data corresponds
to a 3x3 square neighborhood of pixels completely contained within the 82x100 sub
-
area. Each line
contains the pixel values in
the four spectral bands (converted to ASCII) of each of the 9 pixels in the 3x3
neighborhood and a number indicating the classification label of the central pixel. The number is a code
for the following classes:



Number Class


1 red soil


2 cotton crop


3

grey soil


4 damp grey soil


5 soil with vegetation stubble


6 mixture class (all types present)



7 very damp grey soil



Note: There are no examples with class 6 in this dataset.



The data is given in random order and certain lines of data have been re
moved so you cannot reconstruct
the original image from this dataset.



In each line of data the four spectral values for the top
-
left pixel are given first followed by the four
4


spectral values for the top
-
middle pixel and then those for the top
-
right pixe
l, and so on with the pixels
read out in sequence left
-
to
-
right and top
-
to
-
bottom. Thus, the four spectral values for the central pixel are
given by attributes 17,18,19 and 20.

You can read

the data into R from the file satimage.txt in the Shared folder o
n Class Storage
using the command below:

> SATimage = read.table(file.choose(),header=T,sep=” “)



be sure to put a space


between the quotes!

> SATimage = data.frame(SATimage[,1:36],class=as.factor(SATimage$class))

This command makes sure that the response is interpreted as a factor (categorical) rather than as a
number. Use
SATimage

as the data frame throughout.

Create a test and training set using the code below:

> set.seed(888)



this ensures you all have the sa
me data!!!

> testcases = sample(1:dim(SATimage)[1],1000,replace=F)

> SATtest = SATimage[testcases,]

> SATtrain = SATimage[
-
testcases,]

a) Compare sknn, naïve Bayes, neural network, lda, qda, and rda classification
of the test cases.


Which method
performs best for these data?

b) Write your own MCCV cross
-
validation routines for lda, qda, and rda classification.


Demonstrate their use with the full SAT image dataset. Which of these methods performs


best?