1

P a g e
Diabetes Predictor Variable Evaluation Homework
R:
3

7

1
3
Purpose
The purpose of this assignment is to practice
using data visualization and SVM in VisMiner to
evaluat
e
which predictor variables to keep in the model
and to evaluate the quality of the results
.
Although we
will use SVM in this exercise, this overall process of evaluating variables
for inclusion/exclusion
in the
final model generalizes to
other problems and data mining (DM) models. In your group resear
ch project,
for example, you w
ill
look at
other DM algorithms in addition to
SVM to see which
algorithm
produces
the best results.
1. Problem
Background
This data was taken from the UCI
Data Mining
Repository. It contains data on females who are
at least
21 years old of Pima
Indian heritage. Pima Indians live in the state of Arizona. Pima
Indians are an interesting population to study
for diabetes
because they have a higher
incidence of diabetes than the general U
.
S
.
population.
Number of records
: 392
Variables
Var#
Name
Description
1
Diabetes
Class variable (Y or N)
2
Pregnant
Number of times pregnant
3
Glucose
Plasma glucose concentration 2 hours after ingesting glucose
6
Insulin
2

Hour serum insulin (mu U/ml)
4
Diastolic
Diastolic blood pressure
(mm Hg)
7
BMI
Body mass index (weight in kg/(height in m)^2)
5
Skinfold
Triceps skin fold thickness (mm)
8
FamHist
Diabetes pedigree function
9
Age
Age (years)
Glucose.
Oral glucose tests are used to diagnose pre

diabetes and diabetes. After
ingesting oral
glucose, blood glucose is measured. Blood glucose tests have ranges for people who are
normal, somewhat impaired in their ability to absorb glucose, and diabetics, who have are
severely limited in their ability to absorb glucose.
Normal
Imp
aired
Diabetes
65

139
140

200
> 200
2

P a g e
Insulin.
Insulin is
the chemical that signals the
cells within the body to
absorb
glucose
.
Past
research has found that insulin level in the blood depends on age, weight, gender, and many
things, so it is difficult to create a common insulin
standard
for too much, just the right amount,
and too little insulin
.
BMI.
The Body Mass Index is a stand
ardize
d
measure of whether a person is overweight. It takes
into account weight and height so that the index works for people of all heights.
Normal
Overweight
Obese
18.5

24.9
25

29.9
≥ 30
Skinfold
assessments are taken with
calipers to measure loose flesh. Obese people have much
more fat and therefore larger skinfold measures.
A conceptual causal network may be created to give some structure that can be used to help
guide analysis. The question underlying this process is wha
t might cause what?
It is reasonable to assume that diabetes does not cause number of times pregnant, age, and
diabetes pedigree function (heredity).
This dataset has two measures related to whether someone is of normal weight or overweight:
triceps
skin fold thickness and BMI (body mass index). These measurements do not cause a
person to be overweight, rather being overweight causes these measurements to be high.
Unfortunately, this makes
“
overweight" a hidden variable in the network. Nevertheless,
we
must understand that these measures of being overweight to understand how they might relate
to predic
t
ing diabetes. Plasma glucose
c
oncentration and the serum insulin measurements are
both
tests for diabetes, so I have diabetes causing these.
In the di
agram, dashed boxes
surround the variables that may measure the same thing (possible collinearity problems). Part of
your task will be to determine if both of the variables in each dashed box should be retained in
the model.
3

P a g e
H
ere are the steps you should
follow
:
1.
Understand the problem
by reading or researching the background of the problem. Understand
what each variable means and consider what variables might cause other variables. In this
assignment
,
I have provided
background on the problem and the vari
ables are explained.
2.
Create a correlation matrix
in VisMiner to see which variables are related to the outcome
variable and which variables are highly related to each other. I have done this for you in this
assignment.
3.
Perform data visualization
on highly
correlated predictor variables
to get a preliminary idea of
whether
both
are
necessary or if one will suffice
. Use a scatter plot tha
t include one predictor
variable on the X

axis
and
the other predictor on the
Y

ax
is
and
that
show
s
category (diabetes =
Y
or N)
as a third dimension
in the scatter plot.
Save a screen capture of each visualization in this
document.
These are just preliminary assessments because without the SVM model it will be
impossible to control for
the effect of
multiple
variables
at the
same time
.
4.
Perform data visualization of other predictor variables.
One at a time, p
ut
each
predictor
variable on both the X and Y axis of the scatter plot. Then show the outcome variable category
(diabetes = Y or N) to m
ake preliminary judgments about
which variables are likely to be
eliminated and which will be kept.
As in Step 3,
these are just preliminary
assessments
.
Save
a
screen
capture of each v
isualization in this document.
5.
Create a baseline SVM model i
n VisMiner using all potential predictor v
ariables.
I have done
this step.
This
creates
a baseline model that you can use to see if eliminating one or more
variables will make a difference in the models
overall
%
error. To control for overfitting, use a
traini
ng and validation set:
partition the data as
292 training and 100 validation records.
Record
the
% error from the training and validation datasets
for the overall baseline model:
6.
Use of process of elimination with SVM models to determine whether each potential
predictor variabl
e should be retained in the model
. Use
training and validation datasets as in
Step 5
to control for overfitting. Each time you create a model, record
the % error from the
training and validation datasets
State your decision as to whether to keep or exclude
this
variable in the final model.
7.
Create and Evaluate the Model with just the
retained
Variables.
After you have eliminated
non

contributing or low

contributing variables, run the model with the remaining (keeper)
predictor variables. Interpret the result
s of the model with a confusion matrix in terms of overall
accuracy, overall error, sensitivity, and specificity.
This document has some of the work done to reduce the overall amount of work
required
and to
provide examples that you may use as reference
points.
2. Preliminary
Predictor Variable Evaluation with Correlation Matrix
To start determining which variables we might want to include and exclude, run a correlation
matrix. Focus on variables that are highly related to one another and focus on variab
les as they
related to the outcome variable (diabetes). I
have done this for you in this homework problem.
4

P a g e
When you do this for
other problems in the future
, I
recommend that you print out a picture of
the correlation table. Move the cursor over each cell
to get the correlation value. I have added
them
to the diagram
for highly correlated variables and each predictor with diabetes. See the
correlation matrix below.
3.
Perform data visualization of highly correlated predictor variables
L
ook at the highly correlated
predictor
variables. Pregnancy is highly related to Age, but this is
not necessarily helpful for predicting diabetes because it is only logical that
the number of
pregnanc
ies
increases with age
up to a certain point
. Let’s look
at the remaining highly
correlated variables
(BMI and skinfold) and (Glucose and Insulin)
.
Example
:
BMI
and Skinfold Measurements
are both indicators of whether someone is
unusually thin,
normal weight
,
or overweight.
The linear relationship between
the variables is
high
(c
orrelation is: .664
)
.
5

P a g e
Here is how
I conducted
a
preliminary
analysis to
make an educated guess as to
whether both
predictor variables
are likely to
be included in the model.
I created
a
n
X, Y, Category scatter plot
.
I put
BMI
o
n the X axis and the
skinfold
on the Y axis.
I
then added the category
outcome
variable
,
Diabetes
,
to
get a preliminary view of
how
the two
variables relate to the
Y/N outcomes on the
outcome variable.
This scatterplot is provided
below.
Preliminary
Results:
The scatterplot shows visually that
there does seem to be a noticeable
linear relationship between Skinfold
and BMI.
As both increase, the
occurrence of diabetes seems to
increase.
It is possible only one will be
necessary.
Furthermore, i
n light
of the
BMI standards for normal weight,
overweight, and obesity, the results
are interesting. It appears
that
diabetes is rare among people with a
low BMI. For example, the
re
is
virtually no diabetes when BMI is
below 25
(normal weight)
.
Student
Homewor
k Requirement
:
Perform a similar visual analysis to determine whether
Glucose and Insulin both
appear to have a linear relationship
.
[Insert
your
visualization
screen capture
and your conclusion from your evaluation here]
6

P a g e
4.
Perform data visualizat
ion of other predictor variables.
Example:
Pregnancies.
Pregnancies
appear to
matter. As they
increase, the proportion of the
population that has diabetes
relative
to those who do not have diabetes
increases.
We are not sure
whether we
will keep this
variable in the model in
the long run.
We will have
to apply the
controls
of adding other variables at
the same time and the
process of
elimination to tell if pregnancy actually
matters
when combined with other
predictors
. Here we are getting a
n
initial
se
nse of patterns.
Preliminary decision:
keep
the P
regnancies
variable
.
Student
Homework Requirement
:
Perform a similar
visual analysis to
draw preliminary
conclusion on the other potential predictor variables.
Include your screen captures of their
visualiz
ation and your preliminary conclusion here in the document.
5. Create a baseline SVM model i
n VisMiner using all potential predictor variables.
Now that we have done the
preliminary scatterplot visualizations
for each potential predictor
variable
,
I have
built
a
data

mining
(
DM
)
model
in VisMiner
with SVM
Classifier
to see if the
DM
models support our decision to include these variables.
Because we want to control for
overfitting
from here on out
, we will create
a
training
set
(292 records)
and
a
validation set
(100
records)
.
We will use both training and validation partitions for each and every model test.
Below
are the results of creating a S
VM
model with all variables
before excluding
any variables
.
Error
Model Name
Variable(s) Included
Training
Validation
Initial Baseline
All Variables
4
.8%
2
8
%
7

P a g e
6.
Use of process of elimination with SVM models to determine whether each potential predictor
variable should be retained in the model
Let’s deal first with the predictor variables that might
cause
a collinearity problem (BMI and
Skinfold) and (
Glucose and Insulin).
First, I will examine
the effect of
using
Glucose and Insulin as
an example
.
Afterwards, y
ou
will do a similar analysis on
B
MI and Skinfold
.
Example:
SVM with
possible inclusion of
Glucose and Insulin
Model Name
Variable(s) Included
Training
Validation
Initial Baseline
All Variables
4.8%
28.0%
Baseline 2
Glucose
23.3%
21.0%
Insulin
30.1%
37.0%
Glucose and Insulin
24.7%
22.0%
The model with both variables outperforms the model with either variable included one at a
time
, so we k
eep both in the model.
In addition, t
he model with both
Glucose and Insulin
outperforms the original baseline model
(lower validation error)
,
so it becomes the new
baseline model
(Baseline 2)
. In other words, it becomes
the
new
model to beat
. We will
compare all future models to this model.
A new model will replace this model o
nly if
the
new
model do
es
a better job of classification
with
validat
ion
data
.
Student
Homework Requirement
:
Run SVM with
possible combin
ations of
BMI and Skinfold
.
Now we will evaluate adding other variables to see if they produced a better model.
Example:
Test Family History
I
tested whether
add
ing
Family H
istory
to
the variables in
the B
aseline
2
model
improved the
classification results
.
Error
Model Name
Variable(s) Included
Training
Validation
Initial Baseline
All Variables
4.8%
28.0%
Baseline 2
Glucose
23.3%
21.0%
Glucose + FamHistory
21.90%
23%
Conclusion:
This is w
orse
than
the B
aseline
2
model
, so d
rop
the Family History variable
.
Student
Homework Requirement
:
Add additional predictor variables to the Baseline 2 to decide
which variables to drop and
keep.
There is no one right way to test all o
f the possibilities. I like
to add other
predictors by adding them, one at a time, to model Baseline 2. If adding them
8

P a g e
reduces error keep the variable. Otherwise, discard the variable. If you create a model with
lower error, that becomes the new baseline.
Test the addition of variable to the new baseline to
see if you can find a model better than the new baseline.
Test BMI and Skinfold Separately
by adding each to the new baseline model.
Note: I went through th
e overall process of elimination twice
and fou
nd multiple models that
produced validation error rates of 18%, so your final model should find a similar validation error.
It may be off be one or two percent
because of sampling differences
, but not more than that.
I
tested 13 combinations of variables t
o be confident that I had “a best solution.” You may use
fewer or more overall tests. Fortunately, VisMiner makes it fast to test models.
If you have multiple “best” models as measured by validation error, use the model with the
fewest variables and/or th
e lowest AUC to pick the best model.
7.
Create and Evaluate the
SVM
Model with just the Keeper
Predictor
Variables
and the
Outcome Variable
.
Now that you have identified the variables to keep in the SVM model, run the model and
view
the classification matrix in the Visminer viewer so you can see the various cells of the matrix.
I
nterpret that classification matrix in terms of overall accuracy, overall error, sensitivity, and
specificity.
Note that the confusion matrix provided b
y VisMiner has
the cells
a different order
than that produced by XLMiner and it produces answers in percents rather than counts. But you
can still derive the needed information to conduct your evaluations.
Comments 0
Log in to post a comment