Learning: Perceptrons &

apricotpigletAI and Robotics

Oct 19, 2013 (3 years and 9 months ago)

83 views

Learning: Perceptrons &

Neural Networks

Artificial Intelligence

CMSC 25000

February 8, 2007

Roadmap


Perceptrons: Single layer networks


Perceptron training


Perceptron convergence theorem


Perceptron limitations


Neural Networks


Motivation: Overcoming perceptron limitations


Motivation: ALVINN


Heuristic Training


Backpropagation; Gradient descent


Avoiding overfitting


Avoiding local minima


Conclusion: Teaching a Net to talk


Perceptron Structure

x
0
=1

x
1

x
3

x
2

x
n

w
1

w
0

. . .

w
2

w
3

w
n

y

x
0
w
0

compensates for threshold

Perceptron Convergence
Procedure


Straight
-
forward training procedure


Learns linearly separable functions


Until perceptron yields correct output for
all


If the perceptron is correct, do nothing


If the percepton is wrong,


If it incorrectly says “yes”,


Subtract

input vector from weight vector


Otherwise,
add
input vector to weight vector

Perceptron Convergence
Theorem


If there exists a vector
W
s.t.


Perceptron training will find it




Assume


for all +ive examples x



||w||^
2 increases by at most ||x||^2, in
each iteration


||
w+x|
|^2 <= ||
w||
^2+||x||^2 <=
k ||x||^2


v.w/||w|| > <= 1
















Converges in
k <= O
steps

Perceptron Learning


Perceptrons learn linear decision
boundaries


E.g.

+ +

+

+ + +

0

0

0

0

0

0

0

x
1

x
2

But not

x
2

x
1

+

+

0

0

xor

X1 X2

-
1
-
1 w1x1 + w2x2 < 0

1
-
1 w1x1 + w2x2 > 0 => implies w1 > 0

1 1 w1x1 + w2x2 >0 => but should be false

-
1 1 w1x1 + w2x2 > 0 => implies w2 > 0

Perceptron Example


Digit recognition


Assume display= 8 lightable bars


Inputs


on/off + threshold


65 steps to recognize “8”

Perceptron Summary


Motivated by neuron activation


Simple training procedure


Guaranteed to converge


IF linearly separable

Neural Nets


Multi
-
layer perceptrons


Inputs: real
-
valued


Intermediate “hidden” nodes


Output(s): one (or more) discrete
-
valued


X1

X2

X3

X4

Inputs

Hidden

Hidden

Outputs

Y1


Y2

Neural Nets


Pro: More general than perceptrons


Not restricted to linear discriminants


Multiple outputs: one classification each


Con: No simple, guaranteed training
procedure


Use greedy, hill
-
climbing procedure to train


“Gradient descent”, “Backpropagation”

Solving the XOR Problem

x
1

w
13

w
11

w
21

o
2

o
1

w
12

y

w
03

w
22

-
1

x
2

w
23

w
02

-
1

w
01

-
1

Network

Topology:

2 hidden nodes

1 output

Desired behavior:

x1 x2 o1 o2 y


0 0 0 0 0


1 0 0 1 1


0 1 0 1 1


1 1 1 1 0

Weights:

w11= w12=1

w21=w22 = 1

w01=3/2; w02=1/2; w03=1/2

w13=
-
1; w23=1

Neural Net Applications


Speech recognition



Handwriting recognition



NETtalk: Letter
-
to
-
sound rules



ALVINN: Autonomous driving

ALVINN


Driving as a neural network


Inputs:


Image pixel intensities


I.e. lane lines


5 Hidden nodes


Outputs:


Steering actions


E.g. turn left/right; how far


Training:


Observe human behavior: sample images, steering

Backpropagation


Greedy, Hill
-
climbing procedure


Weights are parameters to change


Original hill
-
climb changes one
parameter/step


Slow


If smooth function, change all
parameters/step


Gradient descent


Backpropagation: Computes current output, works
backward to correct error

Producing a Smooth Function


Key problem:


Pure step threshold is discontinuous


Not differentiable


Solution:


Sigmoid (squashed ‘s’ function): Logistic fn

Neural Net Training


Goal:


Determine how to change weights to get
correct output


Large change in weight to produce large reduction
in error


Approach:


Compute actual output: o


Compare to desired output: d


Determine effect of each weight w on error = d
-
o


Adjust weights


Neural Net Example

y
3

w
03

w
23

z
3

z
2

w
02

w
22

w
21

w
12

w
1
1

w
01

z
1

-
1

-
1

-
1

x
1

x
2

w
13

y
1

y
2

xi : ith sample input vector

w : weight vector

yi*: desired output for ith sample

Sum of squares error over training samples

z
3

z
1

z
2

Full expression of output in terms of input and weights

-

From 6.034 notes lozano
-
perez

Gradient Descent


Error: Sum of squares error of inputs with
current weights


Compute rate of change of error wrt each
weight


Which weights have greatest effect on error?


Effectively, partial derivatives of error wrt
weights


In turn, depend on other weights => chain rule

Gradient Descent


E = G(w)


Error as function of
weights


Find rate of change of
error


Follow steepest rate of
change


Change weights s.t. error
is minimized

E

w

G(w)

dG

dw

Local

minima

w0w1

MIT AI lecture notes, Lozano
-
Perez 2000

Gradient of Error

z
3

z
1

z
2

y
3

w
03

w
23

z
3

z
2

w
02

w
22

w
21

w
12

w
1
1

w
01

z
1

-
1

-
1

-
1

x
1

x
2

w
13

y
1

y
2

Note: Derivative of sigmoid:

ds(z1) = s(z1)(1
-
s(z1))


dz1

-

From 6.034 notes lozano
-
perez

From Effect to Update


Gradient computation:


How each weight contributes to performance


To train:


Need to determine how to CHANGE weight
based on contribution to performance


Need to determine how MUCH change to
make per iteration


Rate parameter ‘r’


Large enough to learn quickly


Small enough reach but not overshoot target values

Backpropagation Procedure


Pick rate parameter ‘r’


Until performance is good enough,


Do forward computation to calculate output


Compute Beta in output node with



Compute Beta in all other nodes with



Compute change for all weights with



i


j


k

Backprop Example

y
3

w
03

w
23

z
3

z
2

w
02

w
22

w
21

w
12

w
11

w
01

z
1

-
1

-
1

-
1

x
1

x
2

w
13

y
1

y
2

Forward prop: Compute z
i

and y
i

given x
k
, w
l

Backpropagation Observations


Procedure is (relatively) efficient


All computations are local


Use inputs and outputs of current node



What is “good enough”?


Rarely reach target (0 or 1) outputs


Typically, train until within 0.1 of target

Neural Net Summary


Training:


Backpropagation procedure


Gradient descent strategy (usual problems)


Prediction:


Compute outputs based on input vector &
weights


Pros: Very general, Fast prediction


Cons: Training can be VERY slow (1000’s
of epochs), Overfitting

Training Strategies


Online training:


Update weights after each sample


Offline (batch training):


Compute error over all samples


Then update weights



Online training “noisy”


Sensitive to individual instances


However, may escape local minima

Training Strategy


To avoid overfitting:


Split data into: training, validation, & test


Also, avoid excess weights (less than # samples)


Initialize with small random weights


Small changes have noticeable effect


Use offline training


Until validation set minimum


Evaluate on test set


No more weight changes


Classification


Neural networks best for classification
task


Single output
-
> Binary classifier


Multiple outputs
-
> Multiway classification


Applied successfully to learning pronunciation




Sigmoid pushes to binary classification


Not good for regression

Neural Net Example


NETtalk: Letter
-
to
-
sound by net


Inputs:


Need context to pronounce


7
-
letter window: predict sound of middle letter


29 possible characters


alphabet+space+,+.


7*29=203 inputs


80 Hidden nodes


Output:

Generate 60 phones


Nodes map to 26 units: 21 articulatory, 5 stress/sil


Vector quantization of acoustic space

Neural Net Example: NETtalk


Learning to talk:


5 iterations/1024 training words: bound/stress


10 iterations: intelligible


400 new test words: 80% correct



Not as good as DecTalk, but automatic

Neural Net Conclusions


Simulation based on neurons in brain


Perceptrons (single neuron)


Guaranteed to find linear discriminant


IF one exists
-
> problem XOR


Neural nets (Multi
-
layer perceptrons)


Very general


Backpropagation training procedure


Gradient descent
-

local min, overfitting issues