# Learning: Perceptrons &

AI and Robotics

Oct 19, 2013 (4 years and 8 months ago)

93 views

Learning: Perceptrons &

Neural Networks

Artificial Intelligence

CMSC 25000

February 8, 2007

Perceptrons: Single layer networks

Perceptron training

Perceptron convergence theorem

Perceptron limitations

Neural Networks

Motivation: Overcoming perceptron limitations

Motivation: ALVINN

Heuristic Training

Avoiding overfitting

Avoiding local minima

Conclusion: Teaching a Net to talk

Perceptron Structure

x
0
=1

x
1

x
3

x
2

x
n

w
1

w
0

. . .

w
2

w
3

w
n

y

x
0
w
0

compensates for threshold

Perceptron Convergence
Procedure

Straight
-
forward training procedure

Learns linearly separable functions

Until perceptron yields correct output for
all

If the perceptron is correct, do nothing

If the percepton is wrong,

If it incorrectly says “yes”,

Subtract

input vector from weight vector

Otherwise,
input vector to weight vector

Perceptron Convergence
Theorem

If there exists a vector
W
s.t.

Perceptron training will find it

Assume

for all +ive examples x

||w||^
2 increases by at most ||x||^2, in
each iteration

||
w+x|
|^2 <= ||
w||
^2+||x||^2 <=
k ||x||^2

v.w/||w|| > <= 1

Converges in
k <= O
steps

Perceptron Learning

Perceptrons learn linear decision
boundaries

E.g.

+ +

+

+ + +

0

0

0

0

0

0

0

x
1

x
2

But not

x
2

x
1

+

+

0

0

xor

X1 X2

-
1
-
1 w1x1 + w2x2 < 0

1
-
1 w1x1 + w2x2 > 0 => implies w1 > 0

1 1 w1x1 + w2x2 >0 => but should be false

-
1 1 w1x1 + w2x2 > 0 => implies w2 > 0

Perceptron Example

Digit recognition

Assume display= 8 lightable bars

Inputs

on/off + threshold

65 steps to recognize “8”

Perceptron Summary

Motivated by neuron activation

Simple training procedure

Guaranteed to converge

IF linearly separable

Neural Nets

Multi
-
layer perceptrons

Inputs: real
-
valued

Intermediate “hidden” nodes

Output(s): one (or more) discrete
-
valued

X1

X2

X3

X4

Inputs

Hidden

Hidden

Outputs

Y1

Y2

Neural Nets

Pro: More general than perceptrons

Not restricted to linear discriminants

Multiple outputs: one classification each

Con: No simple, guaranteed training
procedure

Use greedy, hill
-
climbing procedure to train

Solving the XOR Problem

x
1

w
13

w
11

w
21

o
2

o
1

w
12

y

w
03

w
22

-
1

x
2

w
23

w
02

-
1

w
01

-
1

Network

Topology:

2 hidden nodes

1 output

Desired behavior:

x1 x2 o1 o2 y

0 0 0 0 0

1 0 0 1 1

0 1 0 1 1

1 1 1 1 0

Weights:

w11= w12=1

w21=w22 = 1

w01=3/2; w02=1/2; w03=1/2

w13=
-
1; w23=1

Neural Net Applications

Speech recognition

Handwriting recognition

NETtalk: Letter
-
to
-
sound rules

ALVINN: Autonomous driving

ALVINN

Driving as a neural network

Inputs:

Image pixel intensities

I.e. lane lines

5 Hidden nodes

Outputs:

Steering actions

E.g. turn left/right; how far

Training:

Observe human behavior: sample images, steering

Backpropagation

Greedy, Hill
-
climbing procedure

Weights are parameters to change

Original hill
-
climb changes one
parameter/step

Slow

If smooth function, change all
parameters/step

Backpropagation: Computes current output, works
backward to correct error

Producing a Smooth Function

Key problem:

Pure step threshold is discontinuous

Not differentiable

Solution:

Sigmoid (squashed ‘s’ function): Logistic fn

Neural Net Training

Goal:

Determine how to change weights to get
correct output

Large change in weight to produce large reduction
in error

Approach:

Compute actual output: o

Compare to desired output: d

Determine effect of each weight w on error = d
-
o

Neural Net Example

y
3

w
03

w
23

z
3

z
2

w
02

w
22

w
21

w
12

w
1
1

w
01

z
1

-
1

-
1

-
1

x
1

x
2

w
13

y
1

y
2

xi : ith sample input vector

w : weight vector

yi*: desired output for ith sample

Sum of squares error over training samples

z
3

z
1

z
2

Full expression of output in terms of input and weights

-

From 6.034 notes lozano
-
perez

Error: Sum of squares error of inputs with
current weights

Compute rate of change of error wrt each
weight

Which weights have greatest effect on error?

Effectively, partial derivatives of error wrt
weights

In turn, depend on other weights => chain rule

E = G(w)

Error as function of
weights

Find rate of change of
error

change

Change weights s.t. error
is minimized

E

w

G(w)

dG

dw

Local

minima

w0w1

MIT AI lecture notes, Lozano
-
Perez 2000

z
3

z
1

z
2

y
3

w
03

w
23

z
3

z
2

w
02

w
22

w
21

w
12

w
1
1

w
01

z
1

-
1

-
1

-
1

x
1

x
2

w
13

y
1

y
2

Note: Derivative of sigmoid:

ds(z1) = s(z1)(1
-
s(z1))

dz1

-

From 6.034 notes lozano
-
perez

From Effect to Update

How each weight contributes to performance

To train:

Need to determine how to CHANGE weight
based on contribution to performance

Need to determine how MUCH change to
make per iteration

Rate parameter ‘r’

Large enough to learn quickly

Small enough reach but not overshoot target values

Backpropagation Procedure

Pick rate parameter ‘r’

Until performance is good enough,

Do forward computation to calculate output

Compute Beta in output node with

Compute Beta in all other nodes with

Compute change for all weights with

i

j

k

Backprop Example

y
3

w
03

w
23

z
3

z
2

w
02

w
22

w
21

w
12

w
11

w
01

z
1

-
1

-
1

-
1

x
1

x
2

w
13

y
1

y
2

Forward prop: Compute z
i

and y
i

given x
k
, w
l

Backpropagation Observations

Procedure is (relatively) efficient

All computations are local

Use inputs and outputs of current node

What is “good enough”?

Rarely reach target (0 or 1) outputs

Typically, train until within 0.1 of target

Neural Net Summary

Training:

Backpropagation procedure

Prediction:

Compute outputs based on input vector &
weights

Pros: Very general, Fast prediction

Cons: Training can be VERY slow (1000’s
of epochs), Overfitting

Training Strategies

Online training:

Update weights after each sample

Offline (batch training):

Compute error over all samples

Then update weights

Online training “noisy”

Sensitive to individual instances

However, may escape local minima

Training Strategy

To avoid overfitting:

Split data into: training, validation, & test

Also, avoid excess weights (less than # samples)

Initialize with small random weights

Small changes have noticeable effect

Use offline training

Until validation set minimum

Evaluate on test set

No more weight changes

Classification

Neural networks best for classification

Single output
-
> Binary classifier

Multiple outputs
-
> Multiway classification

Applied successfully to learning pronunciation

Sigmoid pushes to binary classification

Not good for regression

Neural Net Example

NETtalk: Letter
-
to
-
sound by net

Inputs:

Need context to pronounce

7
-
letter window: predict sound of middle letter

29 possible characters

alphabet+space+,+.

7*29=203 inputs

80 Hidden nodes

Output:

Generate 60 phones

Nodes map to 26 units: 21 articulatory, 5 stress/sil

Vector quantization of acoustic space

Neural Net Example: NETtalk

Learning to talk:

5 iterations/1024 training words: bound/stress

10 iterations: intelligible

400 new test words: 80% correct

Not as good as DecTalk, but automatic

Neural Net Conclusions

Simulation based on neurons in brain

Perceptrons (single neuron)

Guaranteed to find linear discriminant

IF one exists
-
> problem XOR

Neural nets (Multi
-
layer perceptrons)

Very general

Backpropagation training procedure