# STAT 6601 Classification: Neural Network

Τεχνίτη Νοημοσύνη και Ρομποτική

19 Οκτ 2013 (πριν από 4 χρόνια και 8 μήνες)

70 εμφανίσεις

STAT 6601

Classification:

Neural Networks

V&R 12.2

By

Gary Gongwer,

Classification

Classification is a multivariate technique concerned with
assigning data cases (i.e. observations) to one of a fixed number
of possible classes (represented by nominal output variables).

The Goal of classification is to:

sort observations into two or more labeled classes. The emphasis
is on deriving a rule that can be used to optimally assign
new
objects to the labeled classes.

In short, the aim of classification is to assign input cases to one
of a number of classes

Simple pattern Classification
Example

Let us consider a simple problem of
distinguishing handwritten versions of
the characters ‘a’ and ‘b’.

We seek an algorithm which can
distinguish as reliably as possible
between the two characters.

Therefore, goal in this classification
problem is to develop an algorithm
which will assign any image,
represented by a vector
x
, to one of
two classes, which we shall denote by
C
k
, where k=1,2, so that class C
1
corresponds to the character ‘a’ and
class C
2

corresponds to ‘b’.

Example

A large number of input variables can present severe
problems for pattern recognition systems. One
technique to alleviate such problems is to combine
input variables together to make a smaller number of
new variables called
features
.

In the present example we could evaluate the ratio of
the height of the character to its width

( x
1
) and we might expect that characters from class C
2

(corresponding to ‘b’) will typically have larger values of
x
1

than the characters from class C
1

(corresponding to
‘a’).

How can we make the best use of x1 to classify a
new image so as to minimize the number of
misclassifications?

One approach would be to build a classifier
system which uses a threshold for the value of x1
and which classifies as C2 any image for which x1
exceeds the threshold, and which classifies all
other images as C1.

The number of misclassifications will be
minimized if we choose the threshold to be at the
point where the two histograms cross.

This classification procedure is based on the
evaluation of x1 followed by its comparison with a
threshold.

Problem of this classification procedure: There is
still significant overlap of the histograms, and
many of the new characters we will test will be
misclassified.

Now consider another feature x2.
We try to classify new images on
the basis of the values of x1 and
x2.

We see examples of patterns from
two classes plotted in the (x1,x2)
space. It is possible to draw a line
in this space, known as the
decision boundary which gives
good separation of the two classes.

New patterns which lie above the
decision boundary are classified as
belonging to C1 while patterns
falling below the decision boundary
are classified as C2.

We could continue to consider larger number of
independent features in the hope of improving
the performance .

Instead we could aim to build a classifier which
has the smallest probability of making a mistake.

Classification Theory

In the terminology of pattern recognition, the given
examples together with their classifications are known
as the training set and future cases form the test set.

Our primary measure of success is the error or
(misclassification) rate.

Confusion matrix gives the number of cases with true
class
i

classified as of class
j
.

Assign costs L
ij

to allocating a case of class
i

to class
j
.
Therefore we are interested in the average error cost
rather than the error rate.

Average Error Cost

The average error cost is minimized by the Bayes rule,
which is to allocate to the class c minimizing

i
L
ij

p(i|x)

where p(i|x) is the posterior distribution of the classes
after observing x.

If the costs of all errors are the same this rule amounts
to choosing the class c with the largest posterior
probability p(c|x).

Minimum average cost is known as the Bayes risk.

Classification and Regression

We can represent the outcome of the classification in terms of a
variable y which takes the value 1 if the image is classified as C1,
and the value of 0 if it is classified as C2.

y
k

= y
k
(x;w)

w denotes the vector of parameters often called weights

The importance of neural networks in this context is that they
offer a very powerful and very general framework for
representing non
-
linear mappings from several input variables to
several output variables where the form of the mapping is
governed by a number of adjustable parameters.

Objective: Simulate the Behavior of a
Human Nerve

Inputs are accumulated by a weighted sum.

This sum is the input for output function
φ.

A single neuron is not very flexible

Input layer contains the value of each variable

Hidden layer allows approximations by combining
multiple logarithmic functions

Output neuron with highest probability determines
class

Regression = Learning

The weights are adjusted iteratively (batch or on
-
line)

Initially, they are random and small

Weight decay (
λ
) keeps weights from becoming
too large

Backpropagation

Uses partial derivatives and chain rule

Avoiding Local Maxima

Make weights initially random

Use multiple runs and take the average

An Example: Cushing’s Syndrome

Cushing’s syndrome is a
hypersensitive disorder
associated with over
-
secretion of cortisol by

Three recognized types of
syndromes:

b: bilateral hyperplasia

c: carcinoma

u: unknown type

The observations are
urinary excretion rates
(mg/24hr) of the
steroid metabolites
tetrahydrocortisone

=
T and
pregnanetriol

=
P, and are consider on
log scale.

Cushing’s Syndrome Data

Tetrahydrocortisone Pregnanetriol Type

a1 3.1 11.70 a

a2 3.0 1.30 a

a3 1.9 0.10 a

a4 3.8 0.04 a

a5 4.1 1.10 a

a6 1.9 0.40 a

b1 8.3 1.00 b

b2 3.8 0.20 b

b3 3.9 0.60 b

b4 7.8 1.20 b

b5 9.1 0.60 b

b6 15.4 3.60 b

b7 7.7 1.60 b

b8 6.5 0.40 b

b9 5.7 0.40 b

b10 13.6 1.60 b

c1 10.2 6.40 c

c2 9.2 7.90 c

c3 9.6 3.10 c

c4 53.8 2.50 c

c5 15.8 7.60 c

u1 5.1 0.40 u

u2 12.9 5.00 u

u3 13.0 0.80 u

u4 2.6 0.10 u

u5 30.0 0.10 u

u6 20.5 0.80 u

R Code

library(MASS); library(class); library(nnet)

cush <
-

log(as.matrix(Cushings[,
-
3]))[1:21,]

tpi <
-

class.ind(Cushings\$Type[1:21, drop = T])

xp <
-

seq(0.6, 4.0, length = 100); np <
-

length(xp)

yp <
-

seq(
-
3.25, 2.45, length = 100)

cushT <
-

expand.grid(Tetrahydrocortisone = xp, Pregnanetriol = yp)

pltnn <
-

function(main, ...) {

plot(Cushings[,1], Cushings[,2], log="xy", type="n",

xlab="Tetrahydrocortisone", ylab = "Pregnanetriol", main=main, ...)

for(il in 1:4) {

set <
-

Cushings\$Type==levels(Cushings\$Type)[il]

text(Cushings[set, 1], Cushings[set, 2],

as.character(Cushings\$Type[set]), col = 2 + il) }}

#pltnn plots T and P against each other by type (a, b, c, u)

> cush <
-

log(as.matrix(Cushings[,
-
3]))[1:21,]

> cush

Tetrahydrocortisone Pregnanetriol

a1 1.1314021 2.45958884

a2 1.0986123 0.26236426

a3 0.6418539
-
2.30258509

a4 1.3350011
-
3.21887582

a5 1.4109870 0.09531018

a6 0.6418539
-
0.91629073

b1 2.1162555 0.00000000

b2 1.3350011
-
1.60943791

b3 1.3609766
-
0.51082562

b4 2.0541237 0.18232156

b5 2.2082744
-
0.51082562

b6 2.7343675 1.28093385

b7 2.0412203 0.47000363

b8 1.8718022
-
0.91629073

b9 1.7404662
-
0.91629073

b10 2.6100698 0.47000363

c1 2.3223877 1.85629799

c2 2.2192035 2.06686276

c3 2.2617631 1.13140211

c4 3.9852735 0.91629073

c5 2.7600099 2.02814825

> tpi <
-

class.ind(Cushings\$Type[1:21, drop = T])

> tpi

a b c

[1,] 1 0 0

[2,] 1 0 0

[3,] 1 0 0

[4,] 1 0 0

[5,] 1 0 0

[6,] 1 0 0

[7,] 0 1 0

[8,] 0 1 0

[9,] 0 1 0

[10,] 0 1 0

[11,] 0 1 0

[12,] 0 1 0

[13,] 0 1 0

[14,] 0 1 0

[15,] 0 1 0

[16,] 0 1 0

[17,] 0 0 1

[18,] 0 0 1

[19,] 0 0 1

[20,] 0 0 1

[21,] 0 0 1

plt.bndry <
-

function(size=0, decay=0, ...) {

cush.nn <
-

nnet(cush, tpi, skip=T, softmax=T, size=size,

decay=decay, maxit=1000)

invisible(b1(predict(cush.nn, cushT), ...)) }

cush

data frame of x values of examples.

tpi

data frame of target values of examples.

skip

-
layer connections from input to output.

softmax

switch for softmax (log
-
linear model) and maximum conditional likelihood
fitting.

size

number of units in the hidden layer.

decay

parameter for weight decay.

maxit

maximum number of iterations.

invisible

return a (temporarily) invisible copy of an object.

predict

generic function for predictions from the results of various model fitting
functions. The function invokes particular _methods_ which depend on the
'class' of the first argument. Here: using cush.nn to predict cushT

b1 <
-

function(Z, ...) {

zp <
-

Z[,3]
-

pmax(Z[,2], Z[,1])

contour(exp(xp), exp(yp), matrix(zp, np),

zp <
-

Z[,1]
-

pmax(Z[,3], Z[,2])

contour(exp(xp), exp(yp), matrix(zp, np),

}

par(mfrow = c(2, 2))

pltnn("Size = 2")

set.seed(1); plt.bndry(size = 2, col = 2)

set.seed(3); plt.bndry(size = 2, col = 3)

plt.bndry(size = 2, col = 4)

pltnn("Size = 2, lambda = 0.001")

set.seed(1); plt.bndry(size = 2, decay = 0.001, col = 2)

set.seed(2); plt.bndry(size = 2, decay = 0.001, col = 4)

pltnn("Size = 2, lambda = 0.01")

set.seed(1); plt.bndry(size = 2, decay = 0.01, col = 2)

set.seed(2); plt.bndry(size = 2, decay = 0.01, col = 4)

pltnn("Size = 5, 20 lambda = 0.01")

set.seed(2); plt.bndry(size = 5, decay = 0.01, col = 1)

set.seed(2); plt.bndry(size = 20, decay = 0.01, col = 2)

# functions pltnn and b1 are in the scripts

pltnn("Many local maxima")

Z <
-

matrix(0, nrow(cushT), ncol(tpi))

for(iter in 1:20) {

set.seed(iter)

cush.nn <
-

nnet(cush, tpi, skip = T, softmax = T, size = 3,

decay = 0.01, maxit = 1000, trace = F)

Z <
-

Z + predict(cush.nn, cushT)

cat("final value", format(round(cush.nn\$value,3)), "
\
n")

b1(predict(cush.nn, cushT), col = 2, lwd = 0.5)

}

pltnn("Averaged")

b1(Z, lwd = 3)

References

Bishop, C.M. (1995) Neural Networks for Pattern
Recognition. Oxford: Clarendon Press.

Ripley, B.D. (1996) Pattern Recognition and
Neural Networks. Cambridge: Cambridge
University press.