Machine Learning Artificial Neural Networks (ANN)-cont.

Τεχνίτη Νοημοσύνη και Ρομποτική

19 Οκτ 2013 (πριν από 4 χρόνια και 8 μήνες)

104 εμφανίσεις

2013/10/20

1

Machine Learning

Artificial Neural Networks
(ANN)
-
cont.

Shanghai Jiao Tong University

2013/10/20

2

4.5
Multilayer Networks & Backpropagation

Multilayer network can express highly nonlinear
decision surfaces.

For example, Figure 4
-
5.

2013/10/20

3

4.5.1 Differentiable(

⤠瑨牥獨潬搠d畮捴楯c㨠

To build the multilayer network, What kinds of unit should
be used here?

A multi
-
layer network of linear units is still a linear function.

Perceptron units can be used to build nonlinear functions. However,
the non
-
continuous threshold function is not differentiable, so
gradient decent algorithm can not be used for training
.

The unit should has the following features:

Non
-
linear function;

differentiable

(has derivative

);

Sigmoid Unit

Similar to perceptron unit, but it is a smooth
differentiable

function
(

).

Smooth function: it has derivatives of all orders.

2013/10/20

4

4.5.1 Differentiable(

⤠瑨牥獨o汤⁦ 湣n楯渺

Sigmoid Unit

Sigmoid units first computes a liner combination of its
inputs, then applies a threshold to the result. The
threshold output is a continuous function of its output.

Where

2013/10/20

5

4.5.1
Sigmoid Unit

Sigmoid function

Also called logistic function, or squashing
function

Output ranges between 0 and 1;

Increasing monotonically with its input

Its derivative is easily expressed in terms of its
output.

Sigmoid variants

Other differentiable functions

The term e
-
y

e
-
ky
,

Where k>0 that determines the steepness of the
threshold

The function
Tanh

tanh

x =sinh x / cosh x

y

0.5

1

0

2013/10/20

6

4.5.2

Backpropagation(

)

Learn the weights of multilayer network

Try to minimize the difference (training error)
between the target output and the output of the
network;

difference (training error)

defined as the sum
of all the squared errors of outputs in the
networks

2013/10/20

7

4.5.2 B
ack
-
propagation

Learn the weights for multilayer network.

Search a huge hypothesis space, which is defined by all the possible
weights of the network.

It employs gradient descent to attempt to minimize the squared
error between the network output values and the target values;

The error surface may have multiple minima, the algorithm can not
guarantee to find the global error minimum. I
n practice often works
well (can be invoked multiple times with different initial weights)

Squared error function

2013/10/20

8

4.5.2 B
ack
-
propagation

Table 4.2. Backpropagation

2013/10/20

9

4.5.2 Back
-
propagation

2013/10/20

10

4.5.2
Back
-
propagation

Algorithm in Table 4
-
2 applies to layered feedforward
networks containing two layers of sigmoid units, with units
at each layer connect to all units from the preceding layer.

This is the Stochastic, or incremental, gradient descent
version of Back
-
Propogation.

The notation used here:

An index is assigned to each node. Where a "node" is either an
input to the network, or the output of some unit.

X
ji
denotes the input from node i to unit j

w
ji
denotes the
corresponding weights.

n
denotes the error term associates with unit n

2013/10/20

11

4.5.2.2 LEARNING IN ARBITRARY ACYCLIC
NETWORKS

the algorithm easily generalizes to feedforward networks of arbitrary depth.
The weight update rule seen in Equation (T4.5) is retained, and the only
change is to the procedure for computing

values.

the

r

value for a unit
r
in layer
m
is computed from the

values at the next
deeper layer
m
+ 1 according to

It is equally simple to generalize the algorithm to any directed acyclic graph,
regardless of whether the network units are arranged in uniform layers as we
have assumed up to now. In the case that they are not, the rule for calculating

for any internal unit (i.e., any unit that is not an output) is

Downstream(r)
is the set of units immediately downstream from unit
r
in the
network: that is, all units whose inputs include the output of unit
r.

2013/10/20

12

Derivation of the BP Rule

Derive the stochastic gradient descent rule
in the algorithm in Table 4.2, where the
error on a training example d

The Weight update rule is:

2013/10/20

13

Derivation of the BP Rule

X
ij
= the ith input to unit j

w
ij
=the weight associated with the ith input to
unit j

net
j
= (the weighted sum of inputs

for unit j)

0
j
=the output computed by unit j

t
j
=the target output for unit j

Outputs=the set of units in the final layer

Downstream(j)=the set of units whose
immediate inputs include the output of unit j

j

x
ji

x
j1

w
ji

Hidden unit

O
j

2013/10/20

14

Derivation of the BP Rule

Derive

j

i

m

w
ji

Notice

j

influence the network only through O
j

j

O
j

Case 1

w
ji

influence the network only through Net
j

2013/10/20

15

Derivation of the BP Rule

j

i

m

w
ji

Oj :the output

of

j

Output layer

Hidden layer

2013/10/20

16

Derivation of the BP Rule
（２）

Case 2

for units in hidden layer

j

i

m

w
ji

Output : O
k1

k1

O
k2

O
km

Hidden

units

net
k

2013/10/20

17

2013/10/20

18

More on Backpropagation

Gradient descent over entire network weight vector

Will find a local, not necessarily global error minimum

In practice, often works well (can run multiple times)

Often include weight
momentum α

Minimizes error over training examples

Will it generalize well to subsequent examples?

Training can take thousands of iterations

Slow!

Using network

after training is very fast

2013/10/20

19

Hidden layer representations

Introduction and application

Perceptron

Multilayer networks and Backpropagation

Hidden layer representations

Example: Face Recognition

2013/10/20

20

Learning Hidden Layer Representations

A target function:

Can this be learned?

One intriguing property of BACKPROPAGATION
is its ability to discover useful intermediate
representations at the hidden unit layers inside the
network.

2013/10/20

21

Learning Hidden Layer Representations

Learned hidden layer representation:

A network:

ANN: Automatically discover useful representations at the
hidden layer

The hidden unit encoding shown in the Figure was obtained after
5000 training iterations through the outer loop of the algorithm (i.e.,
5000 iterations through each of the eight training examples).

2013/10/20

22

Training

Each line: Sum of squared errors
over all training
examples,
for
one of the eight network outputs.

Num of iterations

We can directly observe the effect of BACKPROPAGATION's
gradient descent search by plotting the squared output error as a
function of the number of gradient descent search steps.

2013/10/20

23

Training

Hidden unit encoding for input 01000000

Num of
iterations

This plot shows the three
hidden unit values computed
by the learned network for one
of the possible inputs (in
particular, 01000000).

The horizontal axis indicates
the number of training
iterations.

2013/10/20

24

Training

The evolution of weights connecting 8 inputs to one hidden unit

This plots displays the evolution of weights connecting the eight
input units (and the constant 1 bias input) to one of the three hidden
units.

2013/10/20

25

Convergence of Backpropagation

Gradient descent to some local minimum

Perhaps not global minimum...

Train multiple nets with different initial weights

Nature of convergence

Initialize weights near zero

Therefore, initial networks near
-
linear

Increasingly non
-
linear functions possible as training
progresses

2013/10/20

26

Expressive Capabilities of ANNs

Boolean functions:

Every boolean function can be represented by network
with single hidden layer

but might require exponential (in number of inputs)
hidden units

Continuous functions:

Every bounded continuous function can be approximated
with arbitrarily small error, by network with one hidden
layer

Arbitrary functions:

Can be approximated to arbitrary accuracy by a network
with two hidden layers

2013/10/20

27

Overfitting in ANNs

example 1

example 2

2013/10/20

28

Overfitting Prevention

Keep a
hold
-
out(

) validation

set and test
accuracy when training.

Use
10
-
fold cross
-
validation

to determine the average
number of iterations that optimizes validation
performance.

weight decay:
all weights are multiplied by some
fraction between 0 and 1 after each iteration.

Encourages smaller weights and less complex hypotheses.

Equivalent to including an additive penalty to the sum of
the squares of the weights of the network.

2013/10/20

29

Example: Face Recognition

Introduction and application

perceptron

Multilayer networks and Backpropagation

Hidden layer representations

Example: Face Recognition

2013/10/20

30

Neural Nets for Face Recognition

90% accurate learning head pose, and recognizing 1
-
of
-
20 faces

2013/10/20

31

Learned Hidden Unit Weights

2013/10/20

32

Introduction and application

Perceptron

Multilayer networks and Backpropagation

Hidden layer representations

Example: Face Recognition

2013/10/20

33

Alternative Error Functions

Penalize large weights:

Train on target slopes as well as values:

Tie together weights:

e.g.,in phoneme recognition network

2013/10/20

34

4.8.3

(recurrent networks)

Up to this point we have considered only network
topologies that correspond to acyclic directed graphs.

Recurrent networks are artificial neural networks that
apply to time series data and that use outputs of network
units at time t as the input to other units at time t + 1. They
support a form of directed cycles(

) in the network.

Consider a time series prediction task:

predicting the next day's stock market average y(t + 1)
based on the current day's economic indicators x(t)

to train a feedforward network to predict y(t + 1) as its
output, based on the input values x(t).

(a) Feedforward network

2013/10/20

35

Recurrent Networks

One limitation of such a network is that the prediction of
y(t
+
1)
depends only on
x(t)
and cannot capture possible dependencies of
y(t+
1) on earlier values of
x.
This might be necessary, for example, if tomorrow's stock market average
y(t
+ 1)
depends on the difference between today's economic indicator values
x(t)
and
yesterday's values
x(t
-

1).

(a) Feedforward network

2013/10/20

36

Recurrent Networks

Of

course

we

could

remedy

this

difficulty

by

making

both

x(t)

and

x(t

-

1
)

inputs

to

the

feedforward

network
.

However,

if

we

wish

the

network

to

consider

an

arbitrary

window

of

time

in

the

past

when

predicting

y(t

+

1
),

then

a

different

solution

is

required
.

The

recurrent

network

shown

in

Figure

4
.
11
(b)

provides

one

such

solution
.

Here,

we

have

a

new

unit

b

to

the

hidden

layer,

and

new

input

unit

c(t)
.

The

value

of

c(t)

is

defined

as

the

value

of

unit

bat

time

t
-

1
;

2013/10/20

37

The value of
c(t)
is defined as the value of unit b at
time t
-
1; that is, the
input value
c(t)
to the network at one time step is simply copied from
the value of unit
b
on the previous time step.

Notice this implements a recurrence relation, in which
b
represents
information about the history of network inputs. Because
b
depends on
both
x(t)
and on
c(t),
it is possible for
b
to summarize information from
earlier values of
x
that are arbitrarily distant in time.

Recurrent Networks

2013/10/20

38

We have made several copies of the
recurrent network, Replacing the feedback
loop by connections between the various
copies. Notice that this large unfolded
network contains no cycles. Therefore, the
weights in the unfolded network can be
trained directly using
BACKPROPAGATION.

We wish to keep only one copy of the
recurrent network and one set of weights.

Therefore, after training the unfolded
network, the final
weight

in the recurrent
network can be taken to be the mean value
of the corresponding
weights

in the
various copies.

How can Recurrent Networks be trained ?

2013/10/20

39

4.8.4

Dynamically Modifying Network Structure

Up to this point we have considered neural network learning as a
problem of adjusting weights within a fixed graph structure.

A variety of methods have been proposed to dynamically grow or shrink
the number of network units and interconnections in an attempt to
improve generalization accuracy.

One idea is to begin with a network containing no hidden units, then
grow the network as needed by adding hidden units until the training
error is reduced to some acceptable level.

-
CORRELATION(

) algorithm (Fahlman and
Lebiere 1990) is one such algorithm. It begins by constructing a network
with no hidden units.

2013/10/20

40

4.8.4

Dynamically Modifying Network Structure

It begins by constructing a network with no hidden units. The
algorithm grows the network as needed by adding hidden units
until the training error is reduced to some acceptable level.

Begins by constructing a network
with no hidden units
, then
adds a hidden unit, choosing its weight values to maximize the
correlation between the hidden unit value and the residual error
of the overall network

The new unit is now installed into the network, with its weight
values held fixed, and a new connection from this new unit is

2013/10/20

41

4.8.4

Dynamically Modifying Network Structure

The process is now repeated. The original weights are retrained (holding
the hidden unit weights fixed), the residual error is checked, and a second
hidden unit added if the residual error is still above threshold. Whenever a
new hidden unit is added, its inputs include all of the original network
inputs plus the outputs of any existing hidden units. The network is grown
in this fashion, accumulating hidden units until the network residual error
is reduced to some acceptable level.

-
CORRELATION significantly reduces training times, due to
the fact that only a single layer of units is trained at each step.

One practical difficulty is that because the algorithm can add units
indefinitely, it is quite easy for it to overfit the training data, and
precautions to avoid overfitting must be taken.

2013/10/20

42

4.8.4 Another idea

A second idea for dynamically altering network structure is to take the
opposite approach. Instead of beginning with the simplest possible
network and adding complexity, we begin with a complex network and
prune it as we find that certain connections are inessential.

begin with a complex network and prune it as we find thatcertain
connections are inessential.

One way to decide whether a particular weight is inessential is to see whether its
value is close to zero.

A second way, which appears to be more successful in practice, is to consider the
effect that a small variation in the weight has on the error E.

The effect on E of varying w (i.e. the partial derivative ) can be taken as a measure
of the salience(

) of the connection

the least salient connections removed, and this process iterated until some
termination condition is met. They refer to this as the "optimal brain damage"
approach, because at each step the algorithm attempts to remove the least useful
connections.

homework

2013/10/20

43

4.2

4.3

4.5

4.7

4.11(do not need to submit)