STAT 6601
Classification:
Neural Networks
V&R 12.2
By
Gary Gongwer,
Madhu Iyer, Mable Kong
Classification
Classification is a multivariate technique concerned with
assigning data cases (i.e. observations) to one of a fixed number
of possible classes (represented by nominal output variables).
The Goal of classification is to:
sort observations into two or more labeled classes. The emphasis
is on deriving a rule that can be used to optimally assign
new
objects to the labeled classes.
In short, the aim of classification is to assign input cases to one
of a number of classes
Simple pattern Classification
Example
Let us consider a simple problem of
distinguishing handwritten versions of
the characters ‘a’ and ‘b’.
We seek an algorithm which can
distinguish as reliably as possible
between the two characters.
Therefore, goal in this classification
problem is to develop an algorithm
which will assign any image,
represented by a vector
x
, to one of
two classes, which we shall denote by
C
k
, where k=1,2, so that class C
1
corresponds to the character ‘a’ and
class C
2
corresponds to ‘b’.
Example
A large number of input variables can present severe
problems for pattern recognition systems. One
technique to alleviate such problems is to combine
input variables together to make a smaller number of
new variables called
features
.
In the present example we could evaluate the ratio of
the height of the character to its width
( x
1
) and we might expect that characters from class C
2
(corresponding to ‘b’) will typically have larger values of
x
1
than the characters from class C
1
(corresponding to
‘a’).
How can we make the best use of x1 to classify a
new image so as to minimize the number of
misclassifications?
One approach would be to build a classifier
system which uses a threshold for the value of x1
and which classifies as C2 any image for which x1
exceeds the threshold, and which classifies all
other images as C1.
The number of misclassifications will be
minimized if we choose the threshold to be at the
point where the two histograms cross.
This classification procedure is based on the
evaluation of x1 followed by its comparison with a
threshold.
Problem of this classification procedure: There is
still significant overlap of the histograms, and
many of the new characters we will test will be
misclassified.
Now consider another feature x2.
We try to classify new images on
the basis of the values of x1 and
x2.
We see examples of patterns from
two classes plotted in the (x1,x2)
space. It is possible to draw a line
in this space, known as the
decision boundary which gives
good separation of the two classes.
New patterns which lie above the
decision boundary are classified as
belonging to C1 while patterns
falling below the decision boundary
are classified as C2.
We could continue to consider larger number of
independent features in the hope of improving
the performance .
Instead we could aim to build a classifier which
has the smallest probability of making a mistake.
Classification Theory
In the terminology of pattern recognition, the given
examples together with their classifications are known
as the training set and future cases form the test set.
Our primary measure of success is the error or
(misclassification) rate.
Confusion matrix gives the number of cases with true
class
i
classified as of class
j
.
Assign costs L
ij
to allocating a case of class
i
to class
j
.
Therefore we are interested in the average error cost
rather than the error rate.
Average Error Cost
The average error cost is minimized by the Bayes rule,
which is to allocate to the class c minimizing
∑
i
L
ij
p(ix)
where p(ix) is the posterior distribution of the classes
after observing x.
If the costs of all errors are the same this rule amounts
to choosing the class c with the largest posterior
probability p(cx).
Minimum average cost is known as the Bayes risk.
Classification and Regression
We can represent the outcome of the classification in terms of a
variable y which takes the value 1 if the image is classified as C1,
and the value of 0 if it is classified as C2.
y
k
= y
k
(x;w)
w denotes the vector of parameters often called weights
The importance of neural networks in this context is that they
offer a very powerful and very general framework for
representing non

linear mappings from several input variables to
several output variables where the form of the mapping is
governed by a number of adjustable parameters.
Objective: Simulate the Behavior of a
Human Nerve
Inputs are accumulated by a weighted sum.
This sum is the input for output function
φ.
A single neuron is not very flexible
Input layer contains the value of each variable
Hidden layer allows approximations by combining
multiple logarithmic functions
Output neuron with highest probability determines
class
Regression = Learning
The weights are adjusted iteratively (batch or on

line)
Initially, they are random and small
Weight decay (
λ
) keeps weights from becoming
too large
Backpropagation
Adjusts weights “back to front”
Uses partial derivatives and chain rule
Avoiding Local Maxima
Make weights initially random
Use multiple runs and take the average
An Example: Cushing’s Syndrome
Cushing’s syndrome is a
hypersensitive disorder
associated with over

secretion of cortisol by
the adrenal gland.
Three recognized types of
syndromes:
a: adenoma
b: bilateral hyperplasia
c: carcinoma
u: unknown type
The observations are
urinary excretion rates
(mg/24hr) of the
steroid metabolites
tetrahydrocortisone
=
T and
pregnanetriol
=
P, and are consider on
log scale.
Cushing’s Syndrome Data
Tetrahydrocortisone Pregnanetriol Type
a1 3.1 11.70 a
a2 3.0 1.30 a
a3 1.9 0.10 a
a4 3.8 0.04 a
a5 4.1 1.10 a
a6 1.9 0.40 a
b1 8.3 1.00 b
b2 3.8 0.20 b
b3 3.9 0.60 b
b4 7.8 1.20 b
b5 9.1 0.60 b
b6 15.4 3.60 b
b7 7.7 1.60 b
b8 6.5 0.40 b
b9 5.7 0.40 b
b10 13.6 1.60 b
c1 10.2 6.40 c
c2 9.2 7.90 c
c3 9.6 3.10 c
c4 53.8 2.50 c
c5 15.8 7.60 c
u1 5.1 0.40 u
u2 12.9 5.00 u
u3 13.0 0.80 u
u4 2.6 0.10 u
u5 30.0 0.10 u
u6 20.5 0.80 u
R Code
library(MASS); library(class); library(nnet)
cush <

log(as.matrix(Cushings[,

3]))[1:21,]
tpi <

class.ind(Cushings$Type[1:21, drop = T])
xp <

seq(0.6, 4.0, length = 100); np <

length(xp)
yp <

seq(

3.25, 2.45, length = 100)
cushT <

expand.grid(Tetrahydrocortisone = xp, Pregnanetriol = yp)
pltnn <

function(main, ...) {
plot(Cushings[,1], Cushings[,2], log="xy", type="n",
xlab="Tetrahydrocortisone", ylab = "Pregnanetriol", main=main, ...)
for(il in 1:4) {
set <

Cushings$Type==levels(Cushings$Type)[il]
text(Cushings[set, 1], Cushings[set, 2],
as.character(Cushings$Type[set]), col = 2 + il) }}
#pltnn plots T and P against each other by type (a, b, c, u)
> cush <

log(as.matrix(Cushings[,

3]))[1:21,]
> cush
Tetrahydrocortisone Pregnanetriol
a1 1.1314021 2.45958884
a2 1.0986123 0.26236426
a3 0.6418539

2.30258509
a4 1.3350011

3.21887582
a5 1.4109870 0.09531018
a6 0.6418539

0.91629073
b1 2.1162555 0.00000000
b2 1.3350011

1.60943791
b3 1.3609766

0.51082562
b4 2.0541237 0.18232156
b5 2.2082744

0.51082562
b6 2.7343675 1.28093385
b7 2.0412203 0.47000363
b8 1.8718022

0.91629073
b9 1.7404662

0.91629073
b10 2.6100698 0.47000363
c1 2.3223877 1.85629799
c2 2.2192035 2.06686276
c3 2.2617631 1.13140211
c4 3.9852735 0.91629073
c5 2.7600099 2.02814825
> tpi <

class.ind(Cushings$Type[1:21, drop = T])
> tpi
a b c
[1,] 1 0 0
[2,] 1 0 0
[3,] 1 0 0
[4,] 1 0 0
[5,] 1 0 0
[6,] 1 0 0
[7,] 0 1 0
[8,] 0 1 0
[9,] 0 1 0
[10,] 0 1 0
[11,] 0 1 0
[12,] 0 1 0
[13,] 0 1 0
[14,] 0 1 0
[15,] 0 1 0
[16,] 0 1 0
[17,] 0 0 1
[18,] 0 0 1
[19,] 0 0 1
[20,] 0 0 1
[21,] 0 0 1
plt.bndry <

function(size=0, decay=0, ...) {
cush.nn <

nnet(cush, tpi, skip=T, softmax=T, size=size,
decay=decay, maxit=1000)
invisible(b1(predict(cush.nn, cushT), ...)) }
cush
–
data frame of x values of examples.
tpi
–
data frame of target values of examples.
skip
–
switch to add skip

layer connections from input to output.
softmax
–
switch for softmax (log

linear model) and maximum conditional likelihood
fitting.
size
–
number of units in the hidden layer.
decay
–
parameter for weight decay.
maxit
–
maximum number of iterations.
invisible
–
return a (temporarily) invisible copy of an object.
predict
–
generic function for predictions from the results of various model fitting
functions. The function invokes particular _methods_ which depend on the
'class' of the first argument. Here: using cush.nn to predict cushT
b1 <

function(Z, ...) {
zp <

Z[,3]

pmax(Z[,2], Z[,1])
contour(exp(xp), exp(yp), matrix(zp, np),
add=T, levels=0, labex=0, ...)
zp <

Z[,1]

pmax(Z[,3], Z[,2])
contour(exp(xp), exp(yp), matrix(zp, np),
add=T, levels=0, labex=0, ...)
}
par(mfrow = c(2, 2))
pltnn("Size = 2")
set.seed(1); plt.bndry(size = 2, col = 2)
set.seed(3); plt.bndry(size = 2, col = 3)
plt.bndry(size = 2, col = 4)
pltnn("Size = 2, lambda = 0.001")
set.seed(1); plt.bndry(size = 2, decay = 0.001, col = 2)
set.seed(2); plt.bndry(size = 2, decay = 0.001, col = 4)
pltnn("Size = 2, lambda = 0.01")
set.seed(1); plt.bndry(size = 2, decay = 0.01, col = 2)
set.seed(2); plt.bndry(size = 2, decay = 0.01, col = 4)
pltnn("Size = 5, 20 lambda = 0.01")
set.seed(2); plt.bndry(size = 5, decay = 0.01, col = 1)
set.seed(2); plt.bndry(size = 20, decay = 0.01, col = 2)
# functions pltnn and b1 are in the scripts
pltnn("Many local maxima")
Z <

matrix(0, nrow(cushT), ncol(tpi))
for(iter in 1:20) {
set.seed(iter)
cush.nn <

nnet(cush, tpi, skip = T, softmax = T, size = 3,
decay = 0.01, maxit = 1000, trace = F)
Z <

Z + predict(cush.nn, cushT)
cat("final value", format(round(cush.nn$value,3)), "
\
n")
b1(predict(cush.nn, cushT), col = 2, lwd = 0.5)
}
pltnn("Averaged")
b1(Z, lwd = 3)
References
Bishop, C.M. (1995) Neural Networks for Pattern
Recognition. Oxford: Clarendon Press.
Ripley, B.D. (1996) Pattern Recognition and
Neural Networks. Cambridge: Cambridge
University press.
Comments 0
Log in to post a comment