2013/10/20
1
Machine Learning
Artificial Neural Networks
(ANN)

cont.
Shanghai Jiao Tong University
2013/10/20
2
4.5
Multilayer Networks & Backpropagation
•
Multilayer network can express highly nonlinear
decision surfaces.
•
For example, Figure 4

5.
2013/10/20
3
4.5.1 Differentiable(
可微
⤠瑨牥獨潬搠d畮捴楯c㨠
卩杭g楤⁕湩
•
To build the multilayer network, What kinds of unit should
be used here?
–
A multi

layer network of linear units is still a linear function.
–
Perceptron units can be used to build nonlinear functions. However,
the non

continuous threshold function is not differentiable, so
gradient decent algorithm can not be used for training
.
•
The unit should has the following features:
–
Non

linear function;
–
differentiable
(has derivative
—
有导数
);
•
Sigmoid Unit
–
Similar to perceptron unit, but it is a smooth
differentiable
function
(
平滑的可微阈值函数
).
–
Smooth function: it has derivatives of all orders.
2013/10/20
4
4.5.1 Differentiable(
可微
⤠瑨牥獨o汤 湣n楯渺
Sigmoid Unit
•
Sigmoid units first computes a liner combination of its
inputs, then applies a threshold to the result. The
threshold output is a continuous function of its output.
Where
2013/10/20
5
4.5.1
Sigmoid Unit
•
Sigmoid function
–
Also called logistic function, or squashing
function
–
Output ranges between 0 and 1;
–
Increasing monotonically with its input
–
Its derivative is easily expressed in terms of its
output.
•
Sigmoid variants
–
Other differentiable functions
–
The term e

y
e

ky
,
–
Where k>0 that determines the steepness of the
threshold
–
The function
Tanh
tanh
x =sinh x / cosh x
y
0.5
1
0
2013/10/20
6
4.5.2
Backpropagation(
反向传播
)
算法
•
Learn the weights of multilayer network
•
Using Gradient descent
–
Try to minimize the difference (training error)
between the target output and the output of the
network;
–
difference (training error)
–
defined as the sum
of all the squared errors of outputs in the
networks
2013/10/20
7
4.5.2 B
ack

propagation
•
Task
–
Learn the weights for multilayer network.
–
Search a huge hypothesis space, which is defined by all the possible
weights of the network.
–
It employs gradient descent to attempt to minimize the squared
error between the network output values and the target values;
–
The error surface may have multiple minima, the algorithm can not
guarantee to find the global error minimum. I
n practice often works
well (can be invoked multiple times with different initial weights)
Squared error function
2013/10/20
8
4.5.2 B
ack

propagation
Table 4.2. Backpropagation
算法
2013/10/20
9
4.5.2 Back

propagation
2013/10/20
10
4.5.2
Back

propagation
•
Algorithm in Table 4

2 applies to layered feedforward
networks containing two layers of sigmoid units, with units
at each layer connect to all units from the preceding layer.
•
This is the Stochastic, or incremental, gradient descent
version of Back

Propogation.
•
The notation used here:
–
An index is assigned to each node. Where a "node" is either an
input to the network, or the output of some unit.
–
X
ji
denotes the input from node i to unit j
，
w
ji
denotes the
corresponding weights.
–
n
denotes the error term associates with unit n
2013/10/20
11
4.5.2.2 LEARNING IN ARBITRARY ACYCLIC
NETWORKS
•
the algorithm easily generalizes to feedforward networks of arbitrary depth.
The weight update rule seen in Equation (T4.5) is retained, and the only
change is to the procedure for computing
values.
•
the
r
value for a unit
r
in layer
m
is computed from the
values at the next
deeper layer
m
+ 1 according to
•
It is equally simple to generalize the algorithm to any directed acyclic graph,
regardless of whether the network units are arranged in uniform layers as we
have assumed up to now. In the case that they are not, the rule for calculating
for any internal unit (i.e., any unit that is not an output) is
：
•
Downstream(r)
is the set of units immediately downstream from unit
r
in the
network: that is, all units whose inputs include the output of unit
r.
2013/10/20
12
Derivation of the BP Rule
•
Derive the stochastic gradient descent rule
in the algorithm in Table 4.2, where the
error on a training example d
The Weight update rule is:
2013/10/20
13
Derivation of the BP Rule
•
X
ij
= the ith input to unit j
•
w
ij
=the weight associated with the ith input to
unit j
•
net
j
= (the weighted sum of inputs
for unit j)
•
0
j
=the output computed by unit j
•
t
j
=the target output for unit j
•
Outputs=the set of units in the final layer
•
Downstream(j)=the set of units whose
immediate inputs include the output of unit j
j
x
ji
x
j1
w
ji
Hidden unit
O
j
…
…
2013/10/20
14
Derivation of the BP Rule
Derive
单元
j
单元
i
单元
m
w
ji
Notice
：
乥N
j
influence the network only through O
j
j
的输出
O
j
Case 1
：
景f畴灵琠畮楴u
w
ji
influence the network only through Net
j
2013/10/20
15
Derivation of the BP Rule
j
i
m
w
ji
Oj :the output
of
j
Output layer
Hidden layer
2013/10/20
16
Derivation of the BP Rule
（２）
Case 2
：
for units in hidden layer
j
i
m
w
ji
Output : O
k1
k1
O
k2
O
km
…
Hidden
units
net
k
2013/10/20
17
2013/10/20
18
More on Backpropagation
•
Gradient descent over entire network weight vector
•
Will find a local, not necessarily global error minimum
–
In practice, often works well (can run multiple times)
•
Often include weight
momentum α
•
Minimizes error over training examples
–
Will it generalize well to subsequent examples?
•
Training can take thousands of iterations
Slow!
•
Using network
after training is very fast
2013/10/20
19
Hidden layer representations
•
Introduction and application
•
Perceptron
•
Gradient descent
•
Multilayer networks and Backpropagation
•
Hidden layer representations
•
Example: Face Recognition
•
Advanced topics
2013/10/20
20
Learning Hidden Layer Representations
•
A target function:
Can this be learned?
One intriguing property of BACKPROPAGATION
is its ability to discover useful intermediate
representations at the hidden unit layers inside the
network.
2013/10/20
21
Learning Hidden Layer Representations
Learned hidden layer representation:
A network:
ANN: Automatically discover useful representations at the
hidden layer
The hidden unit encoding shown in the Figure was obtained after
5000 training iterations through the outer loop of the algorithm (i.e.,
5000 iterations through each of the eight training examples).
2013/10/20
22
Training
Each line: Sum of squared errors
over all training
examples,
for
one of the eight network outputs.
Num of iterations
We can directly observe the effect of BACKPROPAGATION's
gradient descent search by plotting the squared output error as a
function of the number of gradient descent search steps.
2013/10/20
23
Training
Hidden unit encoding for input 01000000
Num of
iterations
This plot shows the three
hidden unit values computed
by the learned network for one
of the possible inputs (in
particular, 01000000).
The horizontal axis indicates
the number of training
iterations.
2013/10/20
24
Training
The evolution of weights connecting 8 inputs to one hidden unit
This plots displays the evolution of weights connecting the eight
input units (and the constant 1 bias input) to one of the three hidden
units.
2013/10/20
25
Convergence of Backpropagation
•
Gradient descent to some local minimum
–
Perhaps not global minimum...
–
Add momentum
–
Stochastic gradient descent
–
Train multiple nets with different initial weights
•
Nature of convergence
–
Initialize weights near zero
–
Therefore, initial networks near

linear
–
Increasingly non

linear functions possible as training
progresses
2013/10/20
26
Expressive Capabilities of ANNs
•
Boolean functions:
–
Every boolean function can be represented by network
with single hidden layer
–
but might require exponential (in number of inputs)
hidden units
•
Continuous functions:
–
Every bounded continuous function can be approximated
with arbitrarily small error, by network with one hidden
layer
•
Arbitrary functions:
–
Can be approximated to arbitrary accuracy by a network
with two hidden layers
2013/10/20
27
Overfitting in ANNs
Number of weight updates
Number of weight updates
Error versus weight updates
example 1
example 2
2013/10/20
28
Overfitting Prevention
•
Keep a
hold

out(
留取
) validation
set and test
accuracy when training.
•
Use
10

fold cross

validation
to determine the average
number of iterations that optimizes validation
performance.
•
weight decay:
all weights are multiplied by some
fraction between 0 and 1 after each iteration.
–
Encourages smaller weights and less complex hypotheses.
–
Equivalent to including an additive penalty to the sum of
the squares of the weights of the network.
2013/10/20
29
Example: Face Recognition
•
Introduction and application
•
perceptron
•
Gradient descent
•
Multilayer networks and Backpropagation
•
Hidden layer representations
•
Example: Face Recognition
•
Advanced topics
2013/10/20
30
Neural Nets for Face Recognition
90% accurate learning head pose, and recognizing 1

of

20 faces
2013/10/20
31
Learned Hidden Unit Weights
2013/10/20
32
Advanced topics
•
Introduction and application
•
Perceptron
•
Gradient descent
•
Multilayer networks and Backpropagation
•
Hidden layer representations
•
Example: Face Recognition
•
Advanced topics
2013/10/20
33
Alternative Error Functions
•
Penalize large weights:
•
Train on target slopes as well as values:
•
Tie together weights:
–
e.g.,in phoneme recognition network
2013/10/20
34
4.8.3
递归网络
(recurrent networks)
•
Up to this point we have considered only network
topologies that correspond to acyclic directed graphs.
•
Recurrent networks are artificial neural networks that
apply to time series data and that use outputs of network
units at time t as the input to other units at time t + 1. They
support a form of directed cycles(
有向环
) in the network.
•
Consider a time series prediction task:
–
predicting the next day's stock market average y(t + 1)
based on the current day's economic indicators x(t)
–
to train a feedforward network to predict y(t + 1) as its
output, based on the input values x(t).
(a) Feedforward network
2013/10/20
35
Recurrent Networks
One limitation of such a network is that the prediction of
y(t
+
1)
depends only on
x(t)
and cannot capture possible dependencies of
y(t+
1) on earlier values of
x.
This might be necessary, for example, if tomorrow's stock market average
y(t
+ 1)
depends on the difference between today's economic indicator values
x(t)
and
yesterday's values
x(t

1).
(a) Feedforward network
2013/10/20
36
Recurrent Networks
•
Of
course
we
could
remedy
this
difficulty
by
making
both
x(t)
and
x(t

1
)
inputs
to
the
feedforward
network
.
However,
if
we
wish
the
network
to
consider
an
arbitrary
window
of
time
in
the
past
when
predicting
y(t
+
1
),
then
a
different
solution
is
required
.
•
The
recurrent
network
shown
in
Figure
4
.
11
(b)
provides
one
such
solution
.
Here,
we
have
added
a
new
unit
b
to
the
hidden
layer,
and
new
input
unit
c(t)
.
The
value
of
c(t)
is
defined
as
the
value
of
unit
bat
time
t

1
;
2013/10/20
37
•
The value of
c(t)
is defined as the value of unit b at
time t

1; that is, the
input value
c(t)
to the network at one time step is simply copied from
the value of unit
b
on the previous time step.
•
Notice this implements a recurrence relation, in which
b
represents
information about the history of network inputs. Because
b
depends on
both
x(t)
and on
c(t),
it is possible for
b
to summarize information from
earlier values of
x
that are arbitrarily distant in time.
Recurrent Networks
2013/10/20
38
•
We have made several copies of the
recurrent network, Replacing the feedback
loop by connections between the various
copies. Notice that this large unfolded
network contains no cycles. Therefore, the
weights in the unfolded network can be
trained directly using
BACKPROPAGATION.
•
We wish to keep only one copy of the
recurrent network and one set of weights.
•
Therefore, after training the unfolded
network, the final
weight
in the recurrent
network can be taken to be the mean value
of the corresponding
weights
in the
various copies.
How can Recurrent Networks be trained ?
2013/10/20
39
4.8.4
Dynamically Modifying Network Structure
•
Up to this point we have considered neural network learning as a
problem of adjusting weights within a fixed graph structure.
•
A variety of methods have been proposed to dynamically grow or shrink
the number of network units and interconnections in an attempt to
improve generalization accuracy.
•
One idea is to begin with a network containing no hidden units, then
grow the network as needed by adding hidden units until the training
error is reduced to some acceptable level.
•
The CASCADE

CORRELATION(
级联相关
) algorithm (Fahlman and
Lebiere 1990) is one such algorithm. It begins by constructing a network
with no hidden units.
2013/10/20
40
4.8.4
Dynamically Modifying Network Structure
•
It begins by constructing a network with no hidden units. The
algorithm grows the network as needed by adding hidden units
until the training error is reduced to some acceptable level.
•
Begins by constructing a network
with no hidden units
, then
adds a hidden unit, choosing its weight values to maximize the
correlation between the hidden unit value and the residual error
of the overall network
；
•
The new unit is now installed into the network, with its weight
values held fixed, and a new connection from this new unit is
added to each output unit.
2013/10/20
41
4.8.4
Dynamically Modifying Network Structure
•
The process is now repeated. The original weights are retrained (holding
the hidden unit weights fixed), the residual error is checked, and a second
hidden unit added if the residual error is still above threshold. Whenever a
new hidden unit is added, its inputs include all of the original network
inputs plus the outputs of any existing hidden units. The network is grown
in this fashion, accumulating hidden units until the network residual error
is reduced to some acceptable level.
•
CASCADE

CORRELATION significantly reduces training times, due to
the fact that only a single layer of units is trained at each step.
•
One practical difficulty is that because the algorithm can add units
indefinitely, it is quite easy for it to overfit the training data, and
precautions to avoid overfitting must be taken.
2013/10/20
42
4.8.4 Another idea
•
A second idea for dynamically altering network structure is to take the
opposite approach. Instead of beginning with the simplest possible
network and adding complexity, we begin with a complex network and
prune it as we find that certain connections are inessential.
•
begin with a complex network and prune it as we find thatcertain
connections are inessential.
–
One way to decide whether a particular weight is inessential is to see whether its
value is close to zero.
–
A second way, which appears to be more successful in practice, is to consider the
effect that a small variation in the weight has on the error E.
–
The effect on E of varying w (i.e. the partial derivative ) can be taken as a measure
of the salience(
显著性
) of the connection
–
the least salient connections removed, and this process iterated until some
termination condition is met. They refer to this as the "optimal brain damage"
approach, because at each step the algorithm attempts to remove the least useful
connections.
homework
2013/10/20
43
4.2
4.3
4.5
4.7
4.11(do not need to submit)
Comments 0
Log in to post a comment