lec12-dec11-09

cartcletchAI and Robotics

Oct 19, 2013 (3 years and 11 months ago)

71 views



Artificial Neural Networks






Notes based on Nilsson and Mitchell’s

Machine learning



Outline


Perceptrons (LTU)


Gradient descent


Multi
-
layer networks


Backpropagation

Biological Neural Systems


Neuron switching time : > 10
-
3

secs


Number of neurons in the human brain:
~10
10


Connections (synapses) per neuron : ~10
4

10
5


Face recognition : 0.1 secs


High degree of parallel computation


Distributed representations


Properties of Artificial Neural Nets (ANNs)


Many simple neuron
-
like threshold switching
units


Many weighted interconnections among units


Highly parallel, distributed processing


Learning by adaptation of the connection
weights

Appropriate Problem Domains for Neural Network Learning


Input is high
-
dimensional discrete or real
-
valued (e.g. raw sensor input)


Output is discrete or real valued


Output is a vector of values


Form of target function is unknown


Humans do not need to interpret the results
(black box model)

General Idea




A network of neurons. Each neuron is characterized
by:



number of input/output wires



weights on each wire



threshold value



These values are not explicitly programmed, but they
evolve through a training process.




During training phase, labeled samples are presented.
If the network classifies correctly, no weight changes.
Otherwise, the weights are adjusted.




backpropagation algorithm used to adjust weights.

ALVINN (Carnegie Mellon Univ)

Automated driving at 70 mph on a public highway


Camera

image

30x32 pixels

as inputs

30 outputs

for steering

30x32 weights

into one out of

four hidden

unit

4 hidden

units

Another Example


NETtalk


Program that learns to pronounce English text.
(Sejnowski and Rosenberg 1987).


-
A difficult task using conventional programming models.

-

Rule
-
based approaches are too complex since
pronunciations are very irregular.

-

NETtalk takes as input a sentence and produces a
sequence of phonemes and an associated stress for each
letter.

NETtalk



A phoneme is a basic unit of sound in a language.


Stress


relative loudness of that sound.


Because the pronunciation of a single letter depends upon its
context and the letters around it,
NETtalk

is given a seven
character window.


Each position is encoded by one of 29 symbols, (26 letters
and 3 punctuations.)


Letter in each position activates the corresponding unit.



NETtalk


The output units encode phonemes using 21 different
features of human articulation.


Remaining five units encode stress and syllable boundaries.


NETtalk also has a middle layer (hidden layer) that has 80
hidden units and nearly 18000 connections (edges).


NETtalk is trained by giving it a 7 character window so that
it learns the pronounce the middle character.


It learns by comparing the computed pronunciation to the
correct pronunciation.


Handwritten character recognition


This is another area in which neural networks have been
successful.


In fact, all the successful programs have a neural
network component.

Threshold Logic Unit (TLU)



x
1

x
2

x
n

.

.

.

w
1

w
2

w
n

a=

i=1
n

w
i

x
i


1 if
a



q

y
=


0 if
a

<
q

y

{

inputs

weights

activation

output

q

Activation Functions

a

y

a

y

a

y

a

y

threshold

linear

piece
-
wise linear

sigmoid

Decision Surface of a TLU

x
1

x
2

Decision line

w
1

x
1

+ w
2
x
2

=
q

w

1

1

1

0

0

0

0

0

1

Scalar Products & Projections

w • v > 0

v

w

w • v = 0

v

w

w • v < 0

v

w

j

j

j

w • v = |w||v| cos
j

v

w

j

Geometric Interpretation

x
1

x
2

Decision line

w

x

w•x=
q

y=1

y=0

|x
w
|=
q
/|w|

The relation
w•x
=
q

defines the decision line

x
w

Geometric Interpretation


In n dimensions the relation
w • x
=
q

defines a n
-
1 dimensional hyper
-
plane, which is
perpendicular to the weight vector
w
.


On one side of the hyper
-
plane (
w • x
>
q
) all
patterns are classified by the TLU as “1”, while
those that get classified as “0” lie on the other
side of the hyper
-
plane.


If patterns can be not separated by a hyper
-
plane
then they cannot be correctly classified with a
TLU.

Linear Separability

x
1

x
2

1

0

0

0

Logical AND

x
1

x
2

a

y

0

0

0

0

0

1

1

0

1

0

1

0

1

1

2

1

w
1
=1

w
2
=1

q
=1.5

x
1

1

0

0

w
1
=?

w
2
=?

q
= ?

1

Logical XOR

x
1

x
2

y

0

0

0

0

1

1

1

0

1

1

1

0

Threshold as Weight



x
1

x
2

x
n

.

.

.

w
1

w
2

w
n

w
n+1

x
n+1
=
-
1

a=

i=1
n+1

w
i

x
i

y


1 if
a



0

y =


0 if
a

<
0

{

q
=w
n+1

Geometric Interpretation

x
1

x
2

Decision line

w

x

w•x=
0

y=1

y=0

The relation
w • x
=
0

defines the decision line

Training ANNs


Training set S of examples
{x, t}



x

is an input vector and


t

the desired target vector


Example: Logical And


S = {(0,0),0}, {(0,1),0}, {(1,0),0}, {(1,1),1}


Iterative process


Present a training example x , compute network output y,
compare output y with target t, adjust weights and thresholds


Learning rule


Specifies how to change the weights w and thresholds
q

of the
network as a function of the inputs x, output y and target t.



Adjusting the Weight Vector

Target t=1

Output y=0

x

w

j
>90

w

x

w’ = w +
a
x

a
x

Target t=0

Output y=1

w

j
<90

x

w

x

Move w in the direction of x

-
a
x

w’ = w
-

a
x

Move w away from the direction of x

Perceptron Learning Rule


w’
=
w

+
a

(t
-
y)
x

Or in components


w’
i

= w
i

+
D
w
i
= w
i

+
a

(t
-
y) x
i

(i=1..n+1)

With w
n+1

=
q

and x
n+1
=

1


The parameter
a

is called the
learning rate
. It
determines the magnitude of weight updates
D
w
i
.


If the output is correct (t = y) the weights are not
changed (
D
w
i
=0).


If the output is incorrect (t


y) the weights w
i

are
changed such that the output of the TLU for the
new weights w’
i

is
closer/further

to the input x
i
.

Perceptron Training Algorithm

repeat


for

each training vector pair (
x
,t)



evaluate the output y when
x

is the input



if

y

t t
hen




form a new weight vector
w’

according





to
w’
=
w

+
a

(t
-
y)
x



else




do nothing


end if


end for

until
y=t for all training vector pairs

Perceptron Convergence Theorem




The algorithm converges to the correct classification


if the training data is linearly separable


and
a
is sufficiently small


If two classes of vectors X
1

and X
2

are linearly separable,
the application of the perceptron training algorithm will
eventually result in a weight vector
w
0
, such that
w
0

defines a TLU whose decision hyper
-
plane separates X
1

and X
2

(Rosenblatt 1962).


Solution
w
0

is not unique, since if
w
0

x

=0 defines a
hyper
-
plane, so does
w’
0

= k

w
0
.

Example



x1 x2 output


1

1 1

9.4 6.4
-
1

2.5 2.1 1

8.0 7.7
-
1

0.5 2.2 1

7.9 8.4
-
1

7.0 7.0
-
1

2.8 0.8 1

1.2 3.0 1

7.8 6.1
-
1

Initial weights: (0.75,
-
0.5,
-
0.6)

Linear Unit




x
1

x
2

x
n

.

.

.

w
1

w
2

w
n

a=

i=1
n

w
i

x
i

y

y= a =

i=1
n

w
i

x
i


inputs

weights

activation

output

Gradient Descent Learning Rule


Consider linear unit without threshold and
continuous output o (not just

1,1)


0 =w
0

+ w
1

x
1

+ … + w
n

x
n


Train the w
i
’s such that they minimize the squared
error


e

=

a

D

(f
a
-
d
a
)
2

where D is the set of training examples


Here
f
a

is the actual output,
d
a

is the desired
output.



Gradient Descent rule:



We want to choose the weights w
i

so that
e

is minimized. Recall that

e

=

a

D

(f
a


d
a
)
2

Since our goal is to work this error function for one input at a time,
let us consider a fixed input x in D, and define


e =
(f
x


d
x
)
2

We will drop the subscript and just write this as:



e =
(f


d)
2




Our goal is to find the weights that will minimize this expression.













e
/

W
=
[

e
/

w
1
,…

e
/

w
n+1
]

Since s, the threshold function, is given by s =
X
.
W,
we have:


e
/

W =

e
/

s

*


s/

W.
However,

s/

W = X.
Thus,


e
/

W =

e
/

s *
X

Recall from the previous slide that
e =
(f


d)
2

So, we have:

e
/

s = 2(f


d)*

f /

s (note: d is constant)

This gives the expression:



e
/

W =
2(f


d)*

f /

s * X


A problem arises when dealing with TLU, namely f is not a
continuous function of s.

For a fixed input x, suppose the
desired

output is d, and the actual
output is f, then the above expression becomes:


D
w

=
-

2(d


f) x


This is what is known as the
Widrow
-
Hoff

procedure, with 2
replaced by c:


The key idea is to move the weight vector along the gradient.


When will this converge to the correct weights?



We are assuming that the data is linearly separable.



We are also assuming that the
desired

output from the linear
threshold gate is available for the training set.



Under these conditions, perceptron convergence theorem shows
that the above procedure will converge to the correct weights after
a finite number of iterations.


Neuron with Sigmoid
-
Function



x
1

x
2

x
n

.

.

.

w
1

w
2

w
n

a=

i=1
n

w
i

x
i

y=
s
(a) =1/(1+e
-
a
)


y

inputs

weights

activation

output

Sigmoid Unit



x
1

x
2

x
n

.

.

.

w
1

w
2

w
n

w
0

x
0=
-
1

a=

i=0
n

w
i

x
i

y

y=
s
(a)=1/(1+e
-
a
)

s
(x) is the sigmoid function: 1/(1+e
-
x
)

d
s
(x)/dx=
s
(x) (1
-

s
(x))

Derive gradient descent rules to train:




Sigmoid function


f



s

Gradient Descent Rule for Sigmoid Output Function

a

s

sigmoid


E
p
/

w
i

=

/

w
i

(t
p
-
y
p
)
2



=

/

w
i
(t
p
-

s
(

i

w
i

x
i
p
))
2


= (t
p
-
y
p
)
s
’(

i

w
i

x
i
p
) (
-
x
i
p
)



for y=
s
(a) = 1/(1+e
-
a
)

s
’(a)= e
-
a
/(1+e
-
a
)
2
=
s
(a) (1
-
s
(a))



E
p
[w
1
,…,w
n
] =
(t
p
-
y
p
)
2

w’
i
= w
i

+
D
w
i

= w
i

+
a

y(1
-
y)(t
p
-
y
p
) x
i
p

a

s


Presentation of Training Examples


Presenting all training examples once to the
ANN is called an
epoch
.


In incremental stochastic gradient descent
training examples can be presented in


Fixed order (1,2,3…,M)


Randomly permutated order (5,2,7,…,3)


Completely random (4,1,7,1,5,4,……) (repetitions
allowed arbitrarily)

Capabilities of Threshold Neurons


The threshold neuron can realize any
linearly
separable

function
R
n



{0, 1}.



Although we only looked at two
-
dimensional
input, our findings apply to
any dimensionality
n
.



For example, for n = 3, our neuron can realize
any function that divides the three
-
dimensional
input space along a two
-
dimension plane.


Capabilities of Threshold Neurons



What do we do if we need a more complex function?



We can
combine

multiple artificial neurons to form
networks with increased capabilities.



For example, we can build a two
-
layer network with any
number of neurons in the first layer giving input to a single
neuron in the second layer.



The neuron in the second layer could, for example,
implement an AND function.

Capabilities of Threshold Neurons



What kind of function can such a network realize?


o
1


o
2


o
1


o
2


o
1


o
2


.


.


.


o
i

Capabilities of Threshold Neurons



Assume that the dotted lines in the diagram represent the input
-
dividing lines implemented by the neurons in the first layer:


1
st

comp.


2
nd

comp.



Then, for example, the second
-
layer neuron could output 1 if the
input is within a
polygon
, and 0 otherwise.

Capabilities of Threshold Neurons



However, we still may want to implement functions that
are more complex than that.



An obvious idea is to extend our network even further.



Let us build a network that has
three layers
, with
arbitrary numbers of neurons in the first and second layers
and one neuron in the third layer.



The first and second layers are
completely connected
,
that is, each neuron in the first layer sends its output to
every neuron in the second layer.

Capabilities of Threshold Neurons



What type of function can a three
-
layer network realize?


o
1


o
2


o
1


o
2


o
1


o
2


.


.


.


o
i


.


.


.

Capabilities of Threshold Neurons



Assume that the polygons in the diagram indicate the
input regions for which each of the second
-
layer neurons
yields output 1:


1
st

comp.


2
nd

comp.



Then, for example, the third
-
layer neuron could output 1 if the
input is within
any of the polygons
, and 0 otherwise.

Capabilities of Threshold Neurons



The more neurons there are in the first layer, the more
vertices can the polygons have.



With a sufficient number of first
-
layer neurons, the
polygons can approximate
any

given shape.



The more neurons there are in the second layer, the more
of these polygons can be combined to form the output
function of the network.



With a sufficient number of neurons and appropriate
weight vectors
w
i
, a three
-
layer network of threshold neurons
can realize
any

function
R
n



{0, 1}.

Terminology



Usually, we draw neural networks in such a way that
the input enters at the bottom and the output is
generated at the top.



Arrows indicate the direction of data flow.



The first layer, termed
input layer
, just contains the
input vector and does not perform any computations.



The second layer, termed
hidden layer
, receives input
from the input layer and sends its output to the
output
layer
.



After applying their activation function, the neurons in
the output layer contain the output vector.

Terminology



Example:

Network function f:
R
3


{0, 1}
2


output layer


hidden layer


input layer



input vector



output vector

Multi
-
Layer Networks

input layer

hidden layer

output layer

Training
-
Rule for Weights to the Output Layer


y
j

x
i

w
ji

E
p
[w
ij
] = ½

j

(t
j
p
-
y
j
p
)
2


E
p
/

w
ij

=

/

w
ij

½

j

(t
j
p
-
y
j
p
)
2


= …


=
-

y
j
p
(1
-
y
p
j
)(t
p
j
-
y
p
j
) x
i
p

D
w
ij

=
a

y
j
p
(1
-
y
j
p
) (t
p
j
-
y
j
p
) x
i
p


=
a

d
j
p

x
i
p

with
d
j
p
:= y
j
p
(1
-
y
j
p
) (t
p
j
-
y
j
p
)

Training
-
Rule for Weights to the Hidden Layer


x
k

x
i

w
ki

Credit assignment problem:


No target values t for hidden
layer units.

Error for hidden units?

w
jk

d
j

d
k

y
j

d
k

=

j

w
jk

d
j
y
j

(1
-
y
j
)

D
w
ki

=
a

x
k
p
(1
-
x
k
p
)
d
k
p

x
i
p

Training
-
Rule for Weights to the Hidden Layer


x
k

E
p
[w
ki
] = ½

j

(t
j
p
-
y
j
p
)
2


E
p
/

w
ki

=

/

w
ki

½

j

(t
j
p
-
y
j
p
)
2

=

/

w
ki

½

j

(t
j
p
-
s(
k
w
jk

x
k
p
))
2

=

/

w
ki

½

j

(t
j
p
-
s(
k
w
jk

s(
i
w
ki

x
i
p
)))
2

=
-

j

(t
j
p
-
y
j
p
)
s

j
(
a) w
jk

s

k
(a) x
i
p


=
-

j

d
j

w
jk

s

k
(a) x
i
p

=
-

j

d
j

w
jk

x
k

(1
-
x
k
) x
i
p


d
j

x
i

w
k
i

w
j
k

d
k

y
j

D
w
ki

=
a

d
k

x
i
p

with

d
k

=

j

d
j

w
jk

x
k
(1
-
x
k
)

Backpropagation


x
k

x
i

w
ki

w
jk

d
j

d
k

y
j

Backward step:

propagate errors from
output to hidden layer

Forward step:

Propagate activation

from input to output layer

Backpropagation Algorithm


Initialize weights w
ij

with a small random value


repeat

for each training pair {(x
1
,…x
n
)
p
,(t
1
,...,t
m
)
p
} Do


Present (x
1
,…,x
n
)
p

to the network and compute the
outputs y
j
(forward step)


Compute the errors
d
j

in the output layer and
propagate them to the hidden layer (backward step)


Update the weights in both layers according to


D
w
ki

=
a

d
k

x
i


end for loop

until overall error E becomes acceptably low



Backpropagation Algorithm


Initialize each w
i

to some small random value


Until the termination condition is met, Do


For each training example <(x
1
,…x
n
),t> Do


Input the instance (x
1
,…,x
n
) to the network and compute the
network outputs y
k



For each output unit k


d
k
=y
k
(1
-
y
k
)(t
k
-
y
k
)


For each hidden unit h


d
h
=y
h
(1
-
y
h
)

k
w
h,k

d
k


For each network weight w
i,j

Do


w
i,j
=w
i,j
+
D
w
i,j
where


D
w
i,j
=


d
j

x
i,j

Backpropagation


Gradient descent over entire
network
weight vector


Easily generalized to arbitrary directed graphs


Will find a local, not necessarily global error minimum


-
in practice often works well (can be invoked multiple times
with different initial weights)


Often include weight
momentum

term

D
w
i,j
(n)=


d
j

x
i,j
+
a

D
w
i,j
(n
-
1)


Minimizes error training examples


Will it generalize well to unseen instances (over
-
fitting)?


Training can be slow typical 1000
-
10000 iterations


(Using network after training is fast)


Backpropagation


Easily generalized to arbitrary directed graphs without clear
layers.


BP finds a local, not necessarily global error minimum


-

in practice often works well (can be invoked multiple times
with different initial weights)


Minimizes error over training examples


How does it generalize to unseen instances ?


Training can be slow typical 1000
-
10000 iterations


(use more efficient optimization methods than gradient descent)


Using network after training is fast


Convergence of Backprop

Gradient descent to some local minimum perhaps not global
minimum


Add momentum term:
D
w
ki
(n)



D
w
ki
(n) =
a

d
k
(n) x
i
(n) +
l D
w
ki
(n
-
1)

with
l


=
嬰ⰱ[
=

Stochastic gradient descent


Train multiple nets with different initial weights

Nature of convergence


Initialize weights near zero


Therefore, initial networks near
-
linear


Increasingly non
-
linear functions possible as training progresses

Expressive Capabilities of ANN

Boolean functions


Every boolean function can be represented by network with
single hidden layer


But might require exponential (in number of inputs) hidden
units


Continuous functions


Every bounded continuous function can be approximated
with arbitrarily small error, by network with one hidden
layer [Cybenko 1989, Hornik 1989]


Any function can be approximated to arbitrary accuracy by a
network with two hidden layers [Cybenko 1988]