Diabetes Predictor Variable Evaluation Homework

munchsistersΤεχνίτη Νοημοσύνη και Ρομποτική

17 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

76 εμφανίσεις

1

|
P a g e



Diabetes Predictor Variable Evaluation Homework

R:
3
-
7
-
1
3


Purpose

The purpose of this assignment is to practice
using data visualization and SVM in VisMiner to
evaluat
e

which predictor variables to keep in the model

and to evaluate the quality of the results
.

Although we
will use SVM in this exercise, this overall process of evaluating variables
for inclusion/exclusion
in the
final model generalizes to
other problems and data mining (DM) models. In your group resear
ch project,
for example, you w
ill

look at
other DM algorithms in addition to
SVM to see which
algorithm

produces
the best results.


1. Problem
Background

This data was taken from the UCI
Data Mining
Repository. It contains data on females who are
at least
21 years old of Pima

Indian heritage. Pima Indians live in the state of Arizona. Pima
Indians are an interesting population to study
for diabetes
because they have a higher
incidence of diabetes than the general U
.
S
.

population.

Number of records
: 392

Variables

Var#

Name

Description

1

Diabetes

Class variable (Y or N)

2

Pregnant

Number of times pregnant

3

Glucose

Plasma glucose concentration 2 hours after ingesting glucose

6

Insulin

2
-
Hour serum insulin (mu U/ml)

4

Diastolic

Diastolic blood pressure

(mm Hg)

7

BMI

Body mass index (weight in kg/(height in m)^2)

5

Skinfold

Triceps skin fold thickness (mm)

8

FamHist

Diabetes pedigree function

9

Age

Age (years)


Glucose.

Oral glucose tests are used to diagnose pre
-
diabetes and diabetes. After
ingesting oral
glucose, blood glucose is measured. Blood glucose tests have ranges for people who are
normal, somewhat impaired in their ability to absorb glucose, and diabetics, who have are
severely limited in their ability to absorb glucose.

Normal

Imp
aired

Diabetes

65
-
139

140
-
200

> 200


2

|
P a g e


Insulin.

Insulin is

the chemical that signals the
cells within the body to
absorb

glucose
.
Past
research has found that insulin level in the blood depends on age, weight, gender, and many
things, so it is difficult to create a common insulin
standard

for too much, just the right amount,
and too little insulin
.

BMI.

The Body Mass Index is a stand
ardize
d

measure of whether a person is overweight. It takes
into account weight and height so that the index works for people of all heights.

Normal

Overweight

Obese

18.5
-
24.9

25
-
29.9

≥ 30


Skinfold
assessments are taken with
calipers to measure loose flesh. Obese people have much
more fat and therefore larger skinfold measures.

A conceptual causal network may be created to give some structure that can be used to help
guide analysis. The question underlying this process is wha
t might cause what?



It is reasonable to assume that diabetes does not cause number of times pregnant, age, and
diabetes pedigree function (heredity).


This dataset has two measures related to whether someone is of normal weight or overweight:
triceps
skin fold thickness and BMI (body mass index). These measurements do not cause a
person to be overweight, rather being overweight causes these measurements to be high.
Unfortunately, this makes

overweight" a hidden variable in the network. Nevertheless,
we
must understand that these measures of being overweight to understand how they might relate
to predic
t
ing diabetes. Plasma glucose
c
oncentration and the serum insulin measurements are
both

tests for diabetes, so I have diabetes causing these.

In the di
agram, dashed boxes
surround the variables that may measure the same thing (possible collinearity problems). Part of
your task will be to determine if both of the variables in each dashed box should be retained in
the model.


3

|
P a g e


H
ere are the steps you should
follow
:

1.

Understand the problem

by reading or researching the background of the problem. Understand
what each variable means and consider what variables might cause other variables. In this
assignment
,

I have provided
background on the problem and the vari
ables are explained.

2.

Create a correlation matrix

in VisMiner to see which variables are related to the outcome
variable and which variables are highly related to each other. I have done this for you in this
assignment.

3.

Perform data visualization

on highly
correlated predictor variables

to get a preliminary idea of
whether
both
are
necessary or if one will suffice
. Use a scatter plot tha
t include one predictor
variable on the X
-
axis

and
the other predictor on the
Y
-
ax
is

and
that
show
s

category (diabetes =
Y
or N)
as a third dimension
in the scatter plot.

Save a screen capture of each visualization in this
document.
These are just preliminary assessments because without the SVM model it will be
impossible to control for
the effect of
multiple
variables

at the
same time
.


4.

Perform data visualization of other predictor variables.

One at a time, p
ut
each
predictor
variable on both the X and Y axis of the scatter plot. Then show the outcome variable category
(diabetes = Y or N) to m
ake preliminary judgments about
which variables are likely to be
eliminated and which will be kept.

As in Step 3,
these are just preliminary

assessments
.

Save
a
screen

capture of each v
isualization in this document.

5.

Create a baseline SVM model i
n VisMiner using all potential predictor v
ariables.

I have done
this step.
This
creates
a baseline model that you can use to see if eliminating one or more
variables will make a difference in the models
overall
%
error. To control for overfitting, use a
traini
ng and validation set:
partition the data as
292 training and 100 validation records.

Record
the
% error from the training and validation datasets

for the overall baseline model:

6.

Use of process of elimination with SVM models to determine whether each potential
predictor variabl
e should be retained in the model
. Use
training and validation datasets as in
Step 5
to control for overfitting. Each time you create a model, record
the % error from the
training and validation datasets
State your decision as to whether to keep or exclude

this
variable in the final model.

7.

Create and Evaluate the Model with just the
retained

Variables.
After you have eliminated
non
-
contributing or low
-
contributing variables, run the model with the remaining (keeper)
predictor variables. Interpret the result
s of the model with a confusion matrix in terms of overall
accuracy, overall error, sensitivity, and specificity.


This document has some of the work done to reduce the overall amount of work
required
and to
provide examples that you may use as reference
points.


2. Preliminary
Predictor Variable Evaluation with Correlation Matrix

To start determining which variables we might want to include and exclude, run a correlation
matrix. Focus on variables that are highly related to one another and focus on variab
les as they
related to the outcome variable (diabetes). I

have done this for you in this homework problem.
4

|
P a g e


When you do this for

other problems in the future
, I

recommend that you print out a picture of
the correlation table. Move the cursor over each cell
to get the correlation value. I have added
them
to the diagram
for highly correlated variables and each predictor with diabetes. See the
correlation matrix below.




3.
Perform data visualization of highly correlated predictor variables


L
ook at the highly correlated
predictor
variables. Pregnancy is highly related to Age, but this is
not necessarily helpful for predicting diabetes because it is only logical that
the number of
pregnanc
ies

increases with age

up to a certain point
. Let’s look

at the remaining highly
correlated variables

(BMI and skinfold) and (Glucose and Insulin)
.


Example
:
BMI
and Skinfold Measurements

are both indicators of whether someone is
unusually thin,
normal weight
,

or overweight.

The linear relationship between
the variables is
high

(c
orrelation is: .664
)
.


5

|
P a g e



Here is how
I conducted
a

preliminary
analysis to
make an educated guess as to
whether both

predictor variables

are likely to
be included in the model.


I created
a
n

X, Y, Category scatter plot
.

I put
BMI
o
n the X axis and the

skinfold

on the Y axis.
I
then added the category
outcome
variable
,

Diabetes
,
to
get a preliminary view of
how

the two
variables relate to the
Y/N outcomes on the
outcome variable.


This scatterplot is provided
below.



Preliminary
Results:

The scatterplot shows visually that
there does seem to be a noticeable
linear relationship between Skinfold
and BMI.
As both increase, the
occurrence of diabetes seems to
increase.
It is possible only one will be
necessary.
Furthermore, i
n light
of the
BMI standards for normal weight,
overweight, and obesity, the results
are interesting. It appears
that
diabetes is rare among people with a
low BMI. For example, the
re

is
virtually no diabetes when BMI is
below 25

(normal weight)
.


Student
Homewor
k Requirement
:
Perform a similar visual analysis to determine whether
Glucose and Insulin both
appear to have a linear relationship
.

[Insert
your
visualization
screen capture
and your conclusion from your evaluation here]








6

|
P a g e



4.
Perform data visualizat
ion of other predictor variables.

Example:

Pregnancies.


Pregnancies
appear to
matter. As they
increase, the proportion of the
population that has diabetes
relative
to those who do not have diabetes
increases.
We are not sure
whether we
will keep this
variable in the model in
the long run.

We will have
to apply the
controls
of adding other variables at
the same time and the
process of
elimination to tell if pregnancy actually
matters

when combined with other
predictors
. Here we are getting a
n
initial

se
nse of patterns.


Preliminary decision:
keep
the P
regnancies

variable
.

Student
Homework Requirement
:
Perform a similar
visual analysis to
draw preliminary
conclusion on the other potential predictor variables.

Include your screen captures of their
visualiz
ation and your preliminary conclusion here in the document.


5. Create a baseline SVM model i
n VisMiner using all potential predictor variables.


Now that we have done the
preliminary scatterplot visualizations

for each potential predictor
variable
,
I have

built
a
data
-
mining

(
DM
)

model
in VisMiner
with SVM
Classifier
to see if the
DM
models support our decision to include these variables.

Because we want to control for
overfitting

from here on out
, we will create
a
training
set
(292 records)
and
a
validation set
(100
records)
.

We will use both training and validation partitions for each and every model test.


Below

are the results of creating a S
VM

model with all variables

before excluding
any variables
.



Error

Model Name

Variable(s) Included

Training

Validation

Initial Baseline

All Variables

4
.8%

2
8
%


7

|
P a g e



6.
Use of process of elimination with SVM models to determine whether each potential predictor
variable should be retained in the model

Let’s deal first with the predictor variables that might
cause
a collinearity problem (BMI and
Skinfold) and (
Glucose and Insulin).
First, I will examine
the effect of

using
Glucose and Insulin as
an example
.
Afterwards, y
ou
will do a similar analysis on
B
MI and Skinfold
.

Example:
SVM with
possible inclusion of
Glucose and Insulin

Model Name

Variable(s) Included

Training

Validation

Initial Baseline

All Variables

4.8%

28.0%

Baseline 2

Glucose

23.3%

21.0%


Insulin

30.1%

37.0%


Glucose and Insulin

24.7%

22.0%



The model with both variables outperforms the model with either variable included one at a
time
, so we k
eep both in the model.

In addition, t
he model with both
Glucose and Insulin

outperforms the original baseline model

(lower validation error)
,

so it becomes the new
baseline model

(Baseline 2)
. In other words, it becomes
the
new
model to beat
. We will
compare all future models to this model.
A new model will replace this model o
nly if
the
new
model do
es

a better job of classification

with
validat
ion

data
.


Student
Homework Requirement
:

Run SVM with
possible combin
ations of
BMI and Skinfold
.

Now we will evaluate adding other variables to see if they produced a better model.

Example:
Test Family History

I
tested whether
add
ing
Family H
istory

to
the variables in
the B
aseline

2

model
improved the
classification results
.




Error

Model Name

Variable(s) Included

Training

Validation

Initial Baseline

All Variables

4.8%

28.0%

Baseline 2

Glucose

23.3%

21.0%


Glucose + FamHistory

21.90%

23%


Conclusion:
This is w
orse
than
the B
aseline
2
model
, so d
rop
the Family History variable
.

Student
Homework Requirement
:

Add additional predictor variables to the Baseline 2 to decide
which variables to drop and
keep.

There is no one right way to test all o
f the possibilities. I like
to add other
predictors by adding them, one at a time, to model Baseline 2. If adding them
8

|
P a g e


reduces error keep the variable. Otherwise, discard the variable. If you create a model with
lower error, that becomes the new baseline.

Test the addition of variable to the new baseline to
see if you can find a model better than the new baseline.

Test BMI and Skinfold Separately

by adding each to the new baseline model.

Note: I went through th
e overall process of elimination twice
and fou
nd multiple models that
produced validation error rates of 18%, so your final model should find a similar validation error.
It may be off be one or two percent

because of sampling differences
, but not more than that.

I
tested 13 combinations of variables t
o be confident that I had “a best solution.” You may use
fewer or more overall tests. Fortunately, VisMiner makes it fast to test models.

If you have multiple “best” models as measured by validation error, use the model with the
fewest variables and/or th
e lowest AUC to pick the best model.

7.

Create and Evaluate the

SVM

Model with just the Keeper
Predictor
Variables

and the
Outcome Variable
.


Now that you have identified the variables to keep in the SVM model, run the model and
view
the classification matrix in the Visminer viewer so you can see the various cells of the matrix.
I
nterpret that classification matrix in terms of overall accuracy, overall error, sensitivity, and
specificity.

Note that the confusion matrix provided b
y VisMiner has
the cells
a different order
than that produced by XLMiner and it produces answers in percents rather than counts. But you
can still derive the needed information to conduct your evaluations.