1

Sep 25th, 2001Copyright © 2001, 2003, Andrew W. Moore

Regression and

Classification with

Neural Networks

Andrew W. Moore

Professor

School of Computer Science

Carnegie Mellon University

www.cs.cmu.edu/~awm

awm@cs.cmu.edu

412-268-7599

Note to other teachers and users of

these slides. Andrew would be delighted

if you found this source material useful in

giving your own lectures. Feel free to use

these slides verbatim, or to modify them

to fit your own needs. PowerPoint

originals are available. If you make use

of a significant portion of these slides in

your own lecture, please include this

message, or the following link to the

source repository of Andrew’s tutorials:

http://www.cs.cmu.edu/~awm/tutorials

.

Comments and corrections gratefully

received.

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 2

Linear Regression

Linear regression assumes that the expected value of

the output given an input, E[y|x], is linear.

Simplest case: Out(x) = wxfor some unknown w.

Given the data, we can estimate w.

y

5

= 3.1x

5

= 4

y

4

= 1.9x

4

= 1.5

y

3

= 2x

3

= 2

y

2

= 2.2x

2

= 3

y

1

= 1

x

1

= 1

outputs

inputs

DATASET

←1 →

↑

w

↓

2

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 3

1-parameter linear regression

Assume that the data is formed by

y

i

= wx

i

+ noise

i

where…

• the noise signals are independent

• the noise has a normal distribution with mean 0

and unknown variance σ

2

P(y|w,x) has a normal distribution with

• mean wx

• variance σ

2

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 4

Bayesian Linear Regression

P(y|w,x) = Normal (mean wx, var σ

2

)

We have a set of datapoints (x

1

,y

1

) (x

2

,y

2

) … (x

n

,y

n

)

which are EVIDENCE about w.

We want to infer wfrom the data.

P(w|x

1

, x

2

, x

3

,…x

n

, y

1

, y

2

…y

n

)

•You can use BAYES rule to work out a posterior

distribution for wgiven the data.

•Or you could do Maximum Likelihood Estimation

3

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 5

Maximum likelihood estimation of w

Asks the question:

“For which value of wis this data most likely to have

happened?”

<=>

For what wis

P(y

1

, y

2

…y

n

|x

1

, x

2

, x

3

,…x

n

, w) maximized?

<=>

For what w is

maximized

?

),(

1

i

n

i

i

xwyP

∏

=

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 6

For what w is

For what w is

For what w is

For what w is

maximized? ),(

1

i

n

i

i

xwyP

∏

=

maximized? ))(

2

1

exp(

2

1

σ

ii

wxy

n

i

−

=

∏

−

maximized?

2

1

2

1

−

−

∑

=

σ

ii

n

i

wxy

( )

minimized?

2

1

∑

=

−

n

i

ii

wxy

4

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 7

Linear Regression

The maximum

likelihood wis

the one that

minimizes sum-

of-squares of

residuals

We want to minimize a quadratic function of w.

(

)

( )

( )

2

22

2

2 wxwyxy

wxy

i

i

iii

i

ii

∑∑ ∑

∑

+−=

−=Ε

E(w)

w

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 8

Linear Regression

Easy to show the sum of

squares is minimized

when

2

∑

∑

=

i

ii

x

yx

w

The maximum likelihood

model is

We can use it for

prediction

( )

wxx

=

Out

5

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 9

Linear Regression

Easy to show the sum of

squares is minimized

when

2

∑

∑

=

i

ii

x

yx

w

The maximum likelihood

model is

We can use it for

prediction

Note:

In Bayesian stats you’d have

ended up with a prob dist of w

And predictions would have given a prob

dist of expected output

Often useful to know your confidence.

Max likelihood can give some kinds of

confidence too.

p(w)

w

( )

wxx

=

Out

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 10

Multivariate Regression

What if the inputs are vectors?

Dataset has form

x

1

y

1

x

2

y

2

x

3

y

3

.: :

.

x

R

y

R

3

.

. 4

6 .

.

5

. 8

. 10

2-d input

example

x

1

x

2

6

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 11

Multivariate Regression

Write matrix X and Y thus:

=

=

=

RRmRR

m

m

R

y

y

y

xxx

xxx

xxx

M

M

M

2

1

21

22221

11211

2

...

...

...

..........

..........

..........

y

x

x

x

x

1

(there are R datapoints. Each input has mcomponents)

The linear regression model assumes a vector w such that

Out(x) = w

T

x = w

1

x[1] + w

2

x[2] + ….w

m

x[D]

The max. likelihood wis w = (X

T

X)

-1

(X

T

Y)

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 12

Multivariate Regression

Write matrix X and Y thus:

=

=

=

RRmRR

m

m

R

y

y

y

xxx

xxx

xxx

M

M

M

2

1

21

22221

11211

2

...

...

...

..........

..........

..........

y

x

x

x

x

1

(there are R datapoints. Each input has mcomponents)

The linear regression model assumes a vector w such that

Out(x) = w

T

x = w

1

x[1] + w

2

x[2] + ….w

m

x[D]

The max. likelihood wis w = (X

T

X)

-1

(X

T

Y)

IMPORTANT EXERCISE:

PROVE IT !!!!!

7

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 13

Multivariate Regression (con’t)

The max. likelihood w is w = (X

T

X)

-1

(X

T

Y)

X

T

X is an mx mmatrix: i,j’th elt is

X

T

Y is an m-element vector: i

’th

elt

∑

=

R

k

kjki

xx

1

∑

=

R

k

kki

yx

1

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 14

What about a constant term?

We may expect

linear data that does

not go through the

origin.

Statisticians and

Neural Net Folks all

agree on a simple

obvious hack.

Can you guess??

8

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 15

The constant term

• The trick is to create a fake input “X

0

” that

always takes the value 1

2055

1743

1642

YX

2

X

1

1

1

1

X

0

2055

1743

1642

YX

2

X

1

Before:

Y=w

1

X

1

+ w

2

X

2

…has to be a poor

model

After:

Y= w

0

X

0

+w

1

X

1

+ w

2

X

2

= w

0

+w

1

X

1

+ w

2

X

2

…has a fine constant

term

In this example,

You should be able

to see the MLE w

0

, w

1

andw

2

by

inspection

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 16

Regression with varying noise

• Suppose you know the variance of the noise that

was added to each datapoint.

x=0 x=3x=2x=1

y=0

y=3

y=2

y=1

σ=1/2

σ=2

σ=1

σ=1/2

σ=2

1/423

432

1/412

111

4½½

σ

i

2

y

i

x

i

),(~

2

iii

wxNy σ

Assume

W

h

a

t

’

s

t

h

e

M

L

E

e

s

t

i

m

a

t

e

o

f

w

?

9

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 17

MLE estimation with varying noise

=),,...,,,,...,,|,...,,(log

22

2

2

12121

argmax

wxxxyyyp

w

RRR

σσσ

=

−

∑

=

R

i

i

ii

wxy

w

1

2

2

)(

argmin

σ

=

=

−

∑

=

0

)(

such that

1

2

R

i

i

iii

wxyx

w

σ

∑

∑

=

=

R

i

i

i

R

i

i

ii

x

yx

1

2

2

1

2

σ

σ

Assuming i.i.d. and

then plugging in

equation for Gaussian

and simplifying.

Setting dLL/dw

equal to zero

Trivial algebra

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 18

This is Weighted Regression

• We are asking to minimize the weighted sum of

squares

x=0 x=3x=2x=1

y=0

y=3

y=2

y=1

σ=1/2

σ=2

σ=1

σ=1/2

σ=2

∑

=

−

R

i

i

ii

wxy

w

1

2

2

)(

argmin

σ

2

1

i

σ

where weight for i’th datapoint is

10

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 19

Weighted Multivariate Regression

The max. likelihood w is w = (WX

T

WX)

-1

(WX

T

WY)

(WX

T

WX) is an mx mmatrix: i,j’th elt is

(WX

T

WY) is an m-element vector: i

’th

elt

∑

=

R

k

i

kjki

xx

1

2

σ

∑

=

R

k

i

kki

yx

1

2

σ

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 20

Non-linear Regression

• Suppose you know that y is related to a function of x in

such a way that the predicted values have a non-linear

dependence on w, e.g:

x=0 x=3x=2x=1

y=0

y=3

y=2

y=1

33

23

32

2.51

½½

y

i

x

i

),(~

2

σ

ii

xwNy +

Assume

W

h

a

t

’

s

t

h

e

M

L

E

e

s

t

i

m

a

t

e

o

f

w

?

11

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 21

Non-linear MLE estimation

=

),,,...,,|,...,,(log

2121

argmax

wxxxyyyp

w

RR

σ

(

)

=+−

∑

=

R

i

ii

xwy

w

1

2

argmin

=

=

+

+−

∑

=

0such that

1

R

i

i

ii

xw

xwy

w

Assuming i.i.d. and

then plugging in

equation for Gaussian

and simplifying.

Setting dLL/dw

equal to zero

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 22

Non-linear MLE estimation

=

),,,...,,|,...,,(log

2121

argmax

wxxxyyyp

w

RR

σ

(

)

=+−

∑

=

R

i

ii

xwy

w

1

2

argmin

=

=

+

+−

∑

=

0such that

1

R

i

i

ii

xw

xwy

w

Assuming i.i.d. and

then plugging in

equation for Gaussian

and simplifying.

Setting dLL/dw

equal to zero

We’re down the

algebraic toilet

S

o

g

u

e

s

s

w

h

a

t

w

e

d

o

?

12

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 23

Non-linear MLE estimation

=

),,,...,,|,...,,(log

2121

argmax

wxxxyyyp

w

RR

σ

(

)

=+−

∑

=

R

i

ii

xwy

w

1

2

argmin

=

=

+

+−

∑

=

0such that

1

R

i

i

ii

xw

xwy

w

Assuming i.i.d. and

then plugging in

equation for Gaussian

and simplifying.

Setting dLL/dw

equal to zero

We’re down the

algebraic toilet

S

o

g

u

e

s

s

w

h

a

t

w

e

d

o

?

Common (but not only) approach:

Numerical Solutions:

• Line Search

• Simulated Annealing

• Gradient Descent

• Conjugate Gradient

• Levenberg Marquart

• Newton’s Method

Also, special purpose statistical-

optimization-specific tricks such as

E.M. (See Gaussian Mixtures lecture

for introduction)

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 24

GRADIENT DESCENT

Suppose we have a scalar function

We want to find a local minimum.

Assume our current weight is w

GRADIENT DESCENT RULE:

η is called the LEARNING RATE. A small positive

number, e.g. η = 0.05

ℜ→

ℜ

:f(w)

( )

w

w

ww f

∂

∂

−← η

13

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 25

GRADIENT DESCENT

Suppose we have a scalar function

We want to find a local minimum.

Assume our current weight is w

GRADIENT DESCENT RULE:

η is called the LEARNING RATE. A small positive

number, e.g. η = 0.05

ℜ→

ℜ

:f(w)

( )

w

w

ww f

∂

∂

−← η

QUESTION: Justify the Gradient Descent Rule

Recall Andrew’s favorite

default value for anything

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 26

Gradient Descent in “m” Dimensions

ℜ→ℜ

m

:)f(w

(

)

wf-ww

∇

←

η

Given

points in direction of steepest ascent.

GRADIENT DESCENT RULE:

Equivalently

( )

wf-

j

jj

w

ηww

∂

∂

←

….where w

j

is the jth weight

“just like a linear feedback system”

( )

( )

( )

∂

∂

∂

∂

=∇

wf

wf

wf

1

m

w

w

M

(

)

wf∇

is the gradient in that direction

14

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 27

What’s all this got to do with Neural

Nets, then, eh??

For supervised learning, neural nets are also models with

vectors of w parameters in them. They are now called

weights.

As before, we want to compute the weights to minimize sum-

of-squared residuals.

Which turns out, under “Gaussian i.i.d noise”

assumption to be max. likelihood.

Instead of explicitly solving for max. likelihood weights, we

use GRADIENT DESCENT to SEARCH for them.

“

W

h

y

?

”

y

o

u

a

s

k

,

a

q

u

e

r

u

l

o

u

s

ex

p

r

es

s

i

o

n

i

n

y

o

u

r

ey

es

.

“

A

h

a!

!

”

I

r

e

p

l

y

:

“

W

e’

l

l

s

ee

l

at

e

r

.

”

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 28

Linear Perceptrons

They are multivariate linear models:

Out(x) = w

T

x

And “training” consists of minimizing sum-of-squared residuals

by gradient descent.

QUESTION: Derive the perceptron training rule.

(

)

(

)

( )

2

2

∑

∑

−=

−=Ε

Τ

k

k

y

y

kk

kk

x

xOut

w

15

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 29

Linear Perceptron Training Rule

∑

=

−=

R

k

k

T

k

yE

1

2

)( xw

Gradient descent tells us

we should update w

thusly if we wish to

minimize E:

j

jj

w

E

ηww

∂

∂

← -

So what’s

?

j

w

E

∂

∂

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 30

Linear Perceptron Training Rule

∑

=

−=

R

k

k

T

k

yE

1

2

)(

xw

Gradient descent tells us

we should update w

thusly if we wish to

minimize E:

j

jj

w

E

ηww

∂

∂

← -

So what’s

?

j

w

E

∂

∂

∑

=

−

∂

∂

=

∂

∂

R

k

k

T

k

jj

y

ww

E

1

2

)( xw

∑

=

−

∂

∂

−=

R

k

k

T

k

j

k

T

k

y

w

y

1

)()(2 xwxw

∑

=

∂

∂

−=

R

k

k

T

j

k

w

δ

1

2 xw

k

T

kk

yδ xw−=

…where…

∑ ∑

= =

∂

∂

−=

R

k

m

i

kii

j

k

xw

w

δ

1 1

2

∑

=

−=

R

k

kjk

xδ

1

2

16

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 31

Linear Perceptron Training Rule

∑

=

−=

R

k

k

T

k

yE

1

2

)(

xw

Gradient descent tells us

we should update w

thusly if we wish to

minimize E:

j

jj

w

E

ηww

∂

∂

← -

…where…

∑

=

−=

∂

∂

R

k

kjk

j

xδ

w

E

1

2

∑

=

+←

R

k

kjkjj

xδηww

1

2

We frequently neglect the 2 (meaning

we halve the learning rate)

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 32

The “Batch” perceptron algorithm

1) Randomly initialize weights w

1

w

2

… w

m

2) Get your dataset (append 1’s to the inputs if

you don’t want to go through the origin).

3) for i = 1 to R

4) for j = 1 to m

5) if stops improving then stop. Else loop

back to 3.

iii

y xw

Τ

−=:δ

∑

=

+←

R

i

ijijj

xww

1

δη

∑

2

i

δ

17

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 33

ijijj

iii

xww

y

ηδ

δ

+←

−←

Τ

xw

A RULE KNOWN BY

MANY NAMES

T

h

e

L

M

S

R

u

l

e

T

h

e

d

e

l

t

a

r

u

l

e

T

h

e

W

i

d

r

o

w

H

o

f

f

r

u

l

e

C

l

a

s

s

i

c

a

l

c

o

n

d

i

t

i

o

n

i

n

g

T

h

e

a

d

a

l

i

n

e

r

u

l

e

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 34

If data is voluminous and arrives fast

Input-output pairs (x,y) come streaming in very

quickly. THEN

Don’t bother remembering old ones.

Just keep using new ones.

observe (x,y)

jjj

xδηwwj

y

xw

+←∀

−←

Τ

δ

18

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 35

GD Advantages (MI disadvantages):

• Biologically plausible

• With very very many attributes each iteration costs only O(mR). If

fewer than m iterations needed we’ve beaten Matrix Inversion

• More easily parallelizable (or implementable in wetware)?

GD Disadvantages (MI advantages):

• It’s moronic

• It’s essentially a slow implementation of a way to build the XTX matrix

and then solve a set of linear equations

• If m is small it’s especially outageous. If m is large then the direct

matrix inversion method gets fiddly but not impossible if you want to

be efficient.

• Hard to choose a good learning rate

• Matrix inversion takes predictable time. You can’t be sure when

gradient descent will stop.

Gradient Descent vs Matrix Inversion

for Linear Perceptrons

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 36

GD Advantages (MI disadvantages):

• Biologically plausible

• With very very many attributes each iteration costs only O(mR). If

fewer than m iterations needed we’ve beaten Matrix Inversion

• More easily parallelizable (or implementable in wetware)?

GD Disadvantages (MI advantages):

• It’s moronic

• It’s essentially a slow implementation of a way to build the XTX matrix

and then solve a set of linear equations

• If m is small it’s especially outageous. If m is large then the direct

matrix inversion method gets fiddly but not impossible if you want to

be efficient.

• Hard to choose a good learning rate

• Matrix inversion takes predictable time. You can’t be sure when

gradient descent will stop.

Gradient Descent vs Matrix Inversion

for Linear Perceptrons

19

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 37

GD Advantages (MI disadvantages):

• Biologically plausible

• With very very many attributes each iteration costs only O(mR). If

fewer than m iterations needed we’ve beaten Matrix Inversion

• More easily parallelizable (or implementable in wetware)?

GD Disadvantages (MI advantages):

• It’s moronic

• It’s essentially a slow implementation of a way to build the XTX matrix

and then solve a set of linear equations

• If m is small it’s especially outageous. If m is large then the direct

matrix inversion method gets fiddly but not impossible if you want to

be efficient.

• Hard to choose a good learning rate

• Matrix inversion takes predictable time. You can’t be sure when

gradient descent will stop.

Gradient Descent vs Matrix Inversion

for Linear Perceptrons

But we’ll

soon see that

GD

has an important extra

trick up its sleeve

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 38

Perceptrons for Classification

What if all outputs are 0’s or 1’s ?

or

We can do a linear fit.

Our prediction is 0 if out(x)≤1/2

1 if out(x)>1/2

WHAT’S THE BIG PROBLEM WITH THIS???

20

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 39

Perceptrons for Classification

What if all outputs are 0’s or 1’s ?

or

We can do a linear fit.

Our prediction is 0 if out(x)≤½

1 if out(x)>½

WHAT’S THE BIG PROBLEM WITH THIS???

Blue = Out(x)

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 40

Perceptrons for Classification

What if all outputs are 0’s or 1’s ?

or

We can do a linear fit.

Our prediction is 0 if out(x)≤½

1 if out(x)>½

Blue = Out(x)

Green = Classification

21

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 41

Classification with Perceptrons I

( )

.xw

2

∑

Τ

−

ii

y

Don’t minimize

Minimize number of misclassifications instead. [Assume outputs are

+1 & -1, not +1 & 0]

where Round(x) = -1 if x<0

1 if x≥0

The gradient descent rule can be changed to:

if (x

i

,y

i

) correctly classed, don’t change

if wrongly predicted as 1 w w -x

i

if wrongly predicted as -1 w w +x

i

(

)

(

)

∑

Τ

−

ii

y xw Round

NOTE: CUTE &

NON OBVIOUS WHY

THIS WORKS!!

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 42

Classification with Perceptrons II:

Sigmoid Functions

Least squares fit useless

This fit would classify much

better. But not a least

squares fit.

22

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 43

Classification with Perceptrons II:

Sigmoid Functions

Least squares fit useless

This fit would classify much

better. But not a least

squares fit.

SOLUTION:

Instead of Out(x) = w

T

x

We’ll use Out(x) = g(w

T

x)

where is a

squashing function

(

) ( )

1,0:→ℜxg

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 44

The Sigmoid

)exp(1

1

)(

h

hg

−+

=

Note that if you rotate this

curve through 180

o

centered on (0,1/2) you get

the same curve.

i.e. g(h)=1-g(-h)

Can you prove this?

23

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 45

The Sigmoid

Now we choose w to minimize

[ ]

[

]

∑∑

=

Τ

=

−=−

R

i

ii

R

i

ii

gyy

1

2

1

2

)xw()x(Out

)exp(1

1

)(

h

hg

−+

=

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 46

Linear Perceptron Classification

Regions

0 0

0

1

1

1

X

2

X

1

We’ll use the model Out(x) = g(w

T

(x,1))

= g(w

1

x

1

+w

2

x

2

+ w

0

)

Which region of above diagram classified with +1, and

which with 0 ??

24

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 47

Gradient descent with sigmoid on a perceptron

(

)

(

)

(

)

(

)

( )

( )

( ) ( )( )

( ) ( )( )

∑

∑

∑∑∑ ∑

∑∑ ∑

∑ ∑

∑

=−=

−−=

∂

∂

−−=

∂

∂

−

−=

∂

Ε∂

−=Ε

=

−−=

−

+

−

−

+

−

=

−

+

−

−

+

=

−

+

−

−−

=

−

+

−

−

=

−

+

=

−=

k

kkiiii

iji

i

ii

k

ikk

j

k

ikk

i k

ikki

k

ikk

j

i k

ikki

j

i k

ikki

k

kk

xwy

xgg

xw

w

xwgxwgy

xwg

w

xwgy

w

xwgy

xwg

xgxg

x

e

x

e

x

e

x

e

x

e

x

e

x

e

x

e

xg

x

e

xg

xgxgxg

net )Out(x where

net1net2

'2

2

Out(x)

1

1

1

1

1

1

1

1

2

1

1

2

1

11

2

1

' so

1

1

:Because

1' notice First,

2

δ

δ

( )

∑

=

−+←

R

i

ijiiijj

xggww

1

1δη

=

∑

=

m

j

ijji

xwgg

1

iii

gy

−

=

δ

The sigmoid perceptron

update rule:

where

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 48

Other Things about Perceptrons

• Invented and popularized by Rosenblatt (1962)

• Even with sigmoid nonlinearity, correct

convergence is guaranteed

• Stable behavior for overconstrained and

underconstrained problems

25

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 49

Perceptrons and Boolean Functions

If inputs are all 0’s and 1’s and outputs are all 0’s and 1’s…

• Can learn the function x

1

∧ x

2

• Can learn the function x

1

∨ x

2

.

• Can learn any

conjunction of literals, e.g.

x

1

∧ ~x

2

∧ ~x

3

∧ x

4

∧ x

5

QUESTION: WHY?

X

1

X

2

X

1

X

2

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 50

Perceptrons and Boolean Functions

• Can learn any disjunction of literals

e.g. x

1

∧ ~x

2

∧ ~x

3

∧ x

4

∧ x

5

• Can learn majority function

f(x

1

,x

2

…x

n

) = 1 if n/2 x

i

’s or more are = 1

0 if less than n/2 x

i

’s are = 1

• What about the exclusive or function?

f(x

1

,x

2

) = x

1

∀ x

2

=

(x

1

∧ ~x

2

) ∨ (~ x

1

∧ x

2

)

26

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 51

Multilayer Networks

The class of functions representable by perceptrons

is limited

( )

==

∑

Τ

j

jj

xwgg Out(x) xw

Use a wider

representation !

=

∑∑

k

jkjk

j

j

xwgWg Out(x)

This is a nonlinear function

Of a linear combination

Of non linear functions

Of linear combinations of inputs

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 52

A 1-HIDDEN LAYER NET

N

INPUTS

= 2 N

HIDDEN

= 3

=

∑

=

HID

N

k

kk

vWg

1

Out

=

=

=

∑

∑

∑

=

=

=

INS

INS

INS

N

k

kk

N

k

kk

N

k

kk

xwgv

xwgv

xwgv

1

33

1

22

1

11

x

1

x

2

w

11

w

21

w

31

w

1

w

2

w

3

w

32

w

22

w

12

27

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 53

OTHER NEURAL NETS

2-Hidden layers + Constant Term

1

x

1

x

2

x

3

x

2

x

1

“JUMP” CONNECTIONS

+=

∑∑

==

HID

INS

N

k

kk

N

k

kk

vWxwg

11

0

Out

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 54

Backpropagation

( )( )

descent.gradient by

xOut

minimize to

}{,}{ weightsofset a Find

Out(x)

2

∑

∑ ∑

−

=

i

ii

jkj

j k

kjkj

y

wW

xwgWg

That’s it!

That’s the backpropagation

algorithm.

That’s it!

That’s the backpropagation

algorithm.

28

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 55

Backpropagation Convergence

Convergence to a global minimum is not

guaranteed.

•In practice, this is not a problem, apparently.

Tweaking to find the right number of hidden

units, or a useful learning rate η, is more

hassle, apparently.

IMPLEMENTING BACKPROP: Differentiate Monster sum-square residual

Write down the Gradient Descent Rule It turns out to be easier &

computationally efficient to use lots of local variables with names like h

j

o

k

v

j

net

i

etc…

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 56

Choosing the learning rate

• This is a subtle art.

• Too small: can take days instead of minutes

to converge

• Too large: diverges (MSE gets larger and

larger while the weights increase and

usually oscillate)

• Sometimes the “just right” value is hard to

find.

29

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 57

Learning-rate problems

From J. Hertz, A. Krogh, and R.

G. Palmer. Introduction to the

Theory of Neural Computation.

Addison-Wesley, 1994.

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 58

Improving Simple Gradient Descent

Momentum

Don’t just change weights according to the current datapoint.

Re-use changes from earlier iterations.

Let ∆w(t) = weight changes at time t.

Let be the change we would make with

regular gradient descent.

Instead we use

Momentum damps oscillations.

A hack? Well, maybe.

w∂

Ε∂

−η

( )

( )

tt ∆w

w

∆w αη +

∂

Ε

∂

−=+1

momentum parameter

( )

(

)

(

)

ttt ∆www

+

=

+

1

30

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 59

Momentum illustration

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 60

Improving Simple Gradient Descent

Newton’s method

)|(|

2

1

)()(

3

2

2

hh

w

h

w

hwhw O

EE

EE

TT

+

∂

∂

+

∂

∂

+=+

If we neglect the O(h

3

) terms, this is a quadratic form

Quadratic form fun facts:

If y = c + b

T

x-1/2 x

T

A x

And if Ais SPD

Then

x

opt

= A

-1

bis the value of xthat maximizes y

31

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 61

Improving Simple Gradient Descent

Newton’s method

)|(|

2

1

)()(

3

2

2

hh

w

h

w

hwhw O

EE

EE

TT

+

∂

∂

+

∂

∂

+=+

If we neglect the O(h

3

) terms, this is a quadratic form

ww

ww

∂

∂

∂

∂

−←

−

EE

1

2

2

This should send us directly to the global minimum if the

function is truly quadratic.

And it might get us close if it’s locally quadraticish

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 62

Improving Simple Gradient Descent

Newton’s method

)|(|

2

1

)()(

3

2

2

hh

w

h

w

hwhw O

EE

EE

TT

+

∂

∂

+

∂

∂

+=+

If we neglect the O(h

3

) terms, this is a quadratic form

ww

ww

∂

∂

∂

∂

−←

−

EE

1

2

2

This should send us directly to the global minimum if the

function is truly quadratic.

And it might get us close if it’s locally quadraticish

B

U

T

(a

n

d

i

t

’

s

a

b

i

g

b

u

t

)…

T

h

a

t

s

e

c

o

n

d

d

e

r

i

v

a

t

i

v

e

m

a

t

r

i

x

c

a

n

b

e

e

x

p

e

n

s

i

v

e

a

n

d

f

i

d

d

l

y

t

o

c

o

m

p

u

t

e

.

I

f

w

e

’

r

e

n

o

t

a

l

r

e

a

d

y

i

n

t

h

e

q

u

a

d

r

a

t

i

c

b

o

w

l

,

w

e

’

l

l

g

o

n

u

t

s

.

32

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 63

Improving Simple Gradient Descent

Conjugate Gradient

Another method which attempts to exploit the “local

quadratic bowl” assumption

But does so while only needing to use

and not

2

2

w∂

∂ E

It is also more stable than Newton’s method if the local

quadratic bowl assumption is violated.

It’s complicated, outside our scope, but it often works well.

More details in Numerical Recipes in C.

w∂

∂

E

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 64

BEST GENERALIZATION

Intuitively, you want to use the smallest,

simplest net that seems to fit the data.

HOW TO FORMALIZE THIS INTUITION?

1.Don’t. Just use intuition

2.Bayesian Methods Get it Right

3.Statistical Analysis explains what’s going on

4.Cross-validation

Discussed in the next

lecture

33

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 65

What You Should Know

• How to implement multivariate Least-

squares linear regression.

• Derivation of least squares as max.

likelihood estimator of linear coefficients

• The general gradient descent rule

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 66

What You Should Know

• Perceptrons

Linear output, least squares

Sigmoid output, least squares

• Multilayer nets

The idea behind back prop

Awareness of better minimization methods

• Generalization. What it means.

34

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 67

APPLICATIONS

To Discuss:

• What can non-linear regression be useful for?

• What can neural nets (used as non-linear

regressors) be useful for?

• What are the advantages of N. Nets for

nonlinear regression?

• What are the disadvantages?

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 68

Other Uses of Neural Nets…

• Time series with recurrent nets

• Unsupervised learning (clustering principal

components and non-linear versions

thereof)

• Combinatorial optimization with Hopfield

nets, Boltzmann Machines

• Evaluation function learning (in

reinforcement learning)

## Comments 0

Log in to post a comment