Training Data
Concept Map
Practical Design Issues
Topology
Initial Weights
Learning
Algorithm
Fast Learning
Network Size
Generalization
Occam’s Razor
Ceoss

validation
&
Early stopping
Noise
weight sharing
Small size
Increase Training
Data
Network
Growing
Network
Pruning
Brain
Damage
Weight
Decay
Fast Learning
Training Data
Concept Map
Normalize
Scale
Present at Random
Cost Function
Activation Function
Adaptive slope
Architecture
Modular
Committee
BP variants
No weight
Learning
For
Correctly
Classified
Patterns
η
Chen & Mars
Momentum
Fahlmann’s
Other
Minimization
Method
Conjugate
Gradient
1.
Practical Issues
Performance =
f
(training data, topology, initial weights, learning algorithm, . . .)
=
Training Error, Net Size, Generalization.
(1)
How to prepare training data, test data ?

The training set must contain enough info to learn the
task.

Eliminate redundancy, maybe by data clustering.

Training Set size
N > W/
(N = # of training data, W = # of weights,
ε
= Classification error permitted on Test data
Generalization error)
Chapter 4. Designing & Training MLPs
Ex. Modes of Preparing Training Data for Robot Control
The importance of the training data for tracking performance can
not be overemphasized. Basically, three modes of training data
selection are considered here. In the regular mode, the training
data are obtained by tessellating the robot
’
s workspace and taking
the grid points as shown in the next page. However, for better
generalization, a sufficient amount of random training set might be
obtained by observing the light positions in response to uniformly
random Cartesian commands to the robot. This is the random
mode. The best generalization power is achieved by the semi

random mode which evenly tessellates the workspace into many
cubes, and chooses a randomly selected training point within each
cube. This mode is essentially a blend of the regular and the
random modes.
Regular mode
Random mode
Semi

random
mode
Training Data Acquisition mode
Fig.10. Comparison of training errors and generalization errors for
random and semi

random training methods.
0
5
10
15
20
25
30
35
40
45
50
0
50
100
150
200
250
300
350
400
RMS Error(mm)
Iteration
(a)
Training error
Random
SemiRandom
0
5
10
15
20
25
30
35
40
45
50
0
50
100
150
200
250
300
350
400
RMS Error(mm)
Iteration
( b)
Test error
Random
SemiRandom
(2)
Optimal Implementation
A. Network Size
Occam
’
s Razor :
Any learning machine should be
sufficiently large to solve a given problem, but not
larger.
A scientific model should favor simplicity or
shave off the fat in the model.
[Occam = 14th century British monk]
a. Network Growing: Start with a few / add more
(Ref. Kim, Modified Error BP Adding Neurons to Hidden Layer, J. of KIEE 92/4)
If
E >
1
and
E <
2,
A
dd a hidden node.
Use the current weights for existing weights and small random
values for newly added weights as initial weights for new
learning.
b. Network Pruning
①
Remove unimportant connections
After brain damage, retrain the network.
Improves generalization.
②
Weight decay: after each epoch
c. Size Reduction by Dim. Reduction or Sparse Connectivity in
Input Layer [e.g. Use 4 random instead of 8 connections]
w
w
)
(
1
'
Number of Epochs
E
E
Good
train(X)
test(O)
T
X
R
U
R'
T : Training Data
X : Test Data
R : NN with Good Generalization
R' : NN with Poor Generalization
Overfitting
(due to
too many traning samples
, weights)
noise
Poor
train(X)
test(O)
B. Generalization : Train (
memorize
) and
Apply to an Actual problem (
generalize
)
Learning
Subset
Validation
Subset
Test Set
Training Set
Mean

Square
Error
Validation
sample
Training
sample
Early
stopping
point
0
Number of epochs
For good generalization, train with Learning Subset. Check on validation set.
Determine best structure based on Validation Subset [10% at every 5

10 iterations].
Train further with the full Training Set. Evaluate on test set.
Statistics of training (validation) data must be similar to that of test (actual problem) data.
Tradeoff between training error and generalization !
Stopping Criterion Classification : Stop upon no error
Function Approximation : check
E
E
,
An Example showing how to prepare the various
data sets to learn an unknown function from
data samples
Other measures to improve generalization.
•
Add Noise (1

5 %) to
the
Training Data or Weights.
•
Hard (Soft) Weight Sharing (Using Equal Values for Groups of Weights)
Can Improve Generalization.
•
For fixed training data, the smaller the net the better the generalization.
•
Increase the training set to improve generalization.
•
For insufficient training data, use leave

one (some)

out method
= Select an example and train the net without this example, evaluate with
this unused example.
•
If still does not generalize well, retrain with the new problem data.
C. Speeding Up [Accelerating] Convergence

Ref. Book by Hertz, AI Expert Magazine 91/7
To speed up calculation itself:
Reduce # Floating Point Ops by Using a Fixed Point Arithmetic
And Use a Piecewise

Linear approximation for the sigmoid.
What will happen if more than 5

10 % validation data
are used ?
Consider 2 industrial assembly robots for precision
jobs made by the same company with an identical
spec. If the same NN is used for both, then the robots
will act differently. Do we need better generalization
methods to compensate for this difference ?
Large N may increase noisy data. However, wouldn
’
t
large N offset the problem by yielding more reliability ?
How big an influence would noise have upon
misguided learning ?
Wonder what measures can prevent the local
minimum traps.
Students
’
Questions from 2005
Is there any mathematical validation for the existence of
a stopping point in validation samples ?
The number of hidden nodes are adjusted by a human.
An NN is supposed to self

learn and therefore there
must be a way to automatically adjust the number of
the hidden nodes.
①
Normalize Inputs, Scale Outputs.
Zero mean, Decorrelate (PCA) and Covariance equalization
②
Start with small uniform random initial weights [for tanh] :
r
r
)
0
(
w
③
Present training patterns in random (shuffled) order (or mix different
classes).
④
Alternative Cost or Activation Functions
Ex.
Cost
Use with as targets or
( , , at )
in
fan
w
4
.
2
s
3
2
tanh
716
.
1
s
sinh
tan
2
1
k
k
k
P
r
k
k
k
P
y
d
E
vs
y
d
E
2
)
(
.
1
1
)
1
(
1
)
0
(
max
)
(
s
1
s
⑤
Fahlman's Bias to Ensure Nonzero
)
)(
1
.
0
'
(
k
k
y
t
for output units only or for all units
⑥
Chen & Mars Differential step size
0.1
=
=
outer
inner
)
(
k
k
y
t
⑦
(Accelerating
BP Algorithm through Omitting Redundant Learning,
J. of KIEE 92/9
)
If ,
E
p
<
do not update weight on the
p
th training pattern
–
NO BP
p
E
p
Cf. Principe’s Book recommends
.
Best to try diff. values.
5
~
2
=
For output units
only

drop .
'
'
⑧
Ahalt

Modular Net
MLP 1
MLP 2
x
1
y
2
y
vary
in
⑨
Ahalt

Adapt Slope (Sharpness) Parameters
J/
J/
e
s
s
w
w
)
1
/(
1
)
(
⑩
Plaut Rule
in
fan
pq
1
⑪
Jacobs

Learning Rate Adaptation
[Ref. Neural Networks, Vol. 1, No. 4, 88. ]
+
Reason
for
Slow
Convergence
a.
Momentum :
)
1
(
)
(
t
J
t
w
w
w
)
(
)
(
0
i
t
i
t
J
t
i
i
w
In plateau,
where is the effective learning rate
w
w
J
1
1
without
momentum
with
momentum
b.
rule : where
)
(
w
)
(
)
(
w
t
J
t
t
i
i
i
)
(
t
J
i
w
)
(
t
i
i
t
0
)
(
)
1
(
)
1
(
0
)
(
)
1
(
t
t
if
t
t
t
if
K
i
i
i
i
i
i
)
1
(
)
(
)
1
(
)
(
t
t
t
i
i
i
]
)
1
(
)
(
[
)
1
(
t
t
i
i
For actual parameters to be used, consult Jacob
’
s paper and also
“
Getting a fast
break with Backprop
”
, Tveter, AI Expert Magazine, excerpt from pdf files that I
provided.
Students
’
Questions from 2005
Is there any way to design a spherical error surface for
faster convergence ?
Momentum provides inertia to jump over a small peak.
Parameter Optimization technique seems to a good
help to NN design.
I am afraid that optimizing even the sigmoid slope and
the learning rate may expedite overfitting.
In what aspect is it more manageable to remove the
mean, decorrelate, etc. ?
How does using a bigger learning rate for the output
layer help learning ?
Does the solution always converge if we use the
gradient descent ?
Are there any shortcomings in using fast learning
algorithms ?
In the Ahalt
’
s modular net, is it faster for a single
output only or all the outputs than an MLP ?
Various fast learning methods have been proposed.
Which is the best one ? Is it problem

dependent ?
The Jacobs method cannot find the global min. for
an error surface like:
⑫
Conjugate Gradient : Fletcher & Reeves
Line Search
)
(
)
(
)
1
(
n
n
n
s
w
w
)
(
n
w
)
(
n
s
)]
(
)
(
[
n
n
E
Min
s
w
If
is fixed and
)
(
)]
(
[
)
(
n
n
E
n
g
w
s
Gradient Descent
)]
(
*
)
(
[
)]
(
)
(
[
Min
n
n
E
n
n
E
g
w
g
w
If
0
)]
(
*
)
(
[
)
(
)]
1
(
[
n
n
E
n
n
E
g
w
g
w
)
1
(
)
1
(
,
n
n
g
s
)
1
(
n
g
Steepest
Descent
)
(
)
1
(
n
n
g
g
)
1
(
n
w
G
radient
D
escent
S
teepest
D
escent
C
onjugate
G
radient
(n)]
E[
w
)
(
n
s
Gradient D.+ Line Search Steepest Descent + Momentum
GD
SD
Momentum
CG
w
(n+1)
w
(n)
=
1)]
(n
E[
w
w
(n+2)
w
(n)
w
(n+1)
w
(n+2)
1)]
(n
E[
*
w
(n)]
E[
*
w
=
)
(
*
n
s
w
(n

1)
w
(n)
(n)]
E[
w
w
(n+1)
w
(n)
)
(
*
n
s
w
(n+1)
1)]
(n
E[
w
s(n+1)
)
(
n
s
2
)
Choose
such
that
If : Conjugate Gradient
1) Line Search
)
(
)
1
(
)
1
(
n
n
n
s
g
s
j
i
w
w
E
Hessian
H
where
2
0
)]
1
(
)
1
(
[
)
(
n
H
n
n
s
g
s
)
(
)
(
)
(
0
0
w
w
w
w
H
E
E
For
0
)
1
(
)
(
n
H
n
s
s
From
Polak

Ribiere
Rule
:
2
)
(
)
1
(
))
(
)
1
(
(
n
n
n
n
g
g
g
g
0
)
(
)
1
(
0
)]
(
)
(
[
n
n
n
n
E
s
g
s
w
0
)
(
)]
1
(
)
1
(
[
n
n
n
E
s
s
w
N
Y
Y
End
START
N
Line Search
Initialize
)]
0
(
[
)
0
(
)
0
(
w
g
s
E
))
(
)
(
(
min
n
n
E
s
w
)
(
*
)
(
)
1
(
n
n
n
s
w
w
)]
1
(
[
)
1
(
n
E
n
w
g
)
(
)
1
(
)
1
(
n
n
n
s
g
s
1
))
(
(
n
E
w
max
2
))
(
(
n
n
n
E
or
w
)
(
n
s
)
(
n
w
)
1
(
n
w
)
(
n
s
)
1
(
n
g
)
1
(
n
s
+
+
Steepest Descent
Conjugate Gradient
Each step takes a line search.
For N

variable quadratic functions, converges in N steps at most
Recommended:
Steepest Descent + n steps of Conjugate Gradient
+ Steepest Descent + n steps of Conjugate Gradient
+
Comparison of SD and CG
X. Swarm Intelligence
What is
“
swarm intelligence
”
and why is it
interesting?
Two kinds of swarm intelligence
particle swarm optimization
ant colony optimization
Some applications
Discussion
What is
“
Swarm intelligence
”
?
“
Swarm Intelligence is a
property of systems of
non

intelligent agents
exhibiting collectively
intelligent behavior.
”
Characteristics of a swarm
distributed, no central
control or data source
no (explicit) model of the
environment
perception of environment
ability to change
environment
I can
’
t
do
…
We can
do
…
Group of friends each having a metal detector are on a treasure finding mission.
Each can communicate the signal and current position to the n nearest neighbors.
If you neighbor is closer to the treasure than him, you can move closer to that
neighbor thereby improving your own chance of finding the treasure. Also, the
treasure may be found more easily than if you were on your own.
Individuals in a swarm interact to solve a global objective in a more efficient
manner than one single individual could. A swarm is defined as a structured
collection of interacting organisms [ants, bees, wasps, termites, fish in schools an
birds in flocks] or agents. Within the swarms, individuals are simple in structure,
but their collective behaviors can be quite complex. Hence, the global behavior of
a swam emerges in a nonlinear manner from the behavior of the individuals in that
swarm.
The interaction among individuals plays a vital role in shaping the swarm’s
behavior. Interaction aids in refining experiential knowledge about the
environment, and enhances the progress of the swarm toward optimality. The
interaction is determined genetically or throgh social interaction.
Applications: function optimization, optimal route finding, scheduling, image and
data analysis.
Why is it interesting?
Robust nature of animal problem

solving
simple creatures exhibit complex
behavior
behavior modified by dynamic
environment
e.g.) ants, bees, birds, fishes, etc,.
Two kinds of Swarm
intelligence
Particle swarm optimization
Proposed in 1995 by J. Kennedy and
R. C. Eberhart
based on the behavior of bird flocks
and fish schools
Ant colony optimization
defined in 1999 by Dorigo, Di Cargo
and Gambardella
based on the behavior of ant colonies
1. Particle Swarm Optimization
Population

based method
Has three main principles
a particle has a movement
this particle wants to go back to the best
previously visited position
this particle tries to get to the position of the best
positioned particles
Four types of neighborhood
star (global) : all particles are
neighbors of all
particles
ring (circle) : particles have a
fixed number of
neighbors K (usually 2)
wheel : only one particle is
connected to all
particles and act as
“
hub
”
random : N random conections
are made between
the particles
algorithm
Initialization
Calculate performance
Update best particle
Move each particle
Until system
converges
: x
id
(0) = random value, v
id
(0) = 0;
: F (x
id
(t)) = ? (F : performance)
: F (x
id
(t)) is better than the pbest

> pbest = F(x
id
(t)), p
id
= x
id
(t), Same for
the gbest
: See next slide
Particle Dynamics
for convergence c
1
+ c
2
< 4 [Kennedy 1998]
Examples
http://uk.geocities.com/markcsinclair/pso.html
http://www.engr.iupui.edu/~shi/PSO/AppletGUI.html
⑭
Local Minimum Problem
•
Restart with different initial weights, learning rates, and number
of hidden nodes
•
Add (and anneal) noise
a little
(zero mean white Gaussian) to
weights or training data [desired output or input (for better generalization) ]
•
Use {Simulated Annealing} or {Genetic Algorithm Optimization then BP}
⑮
Design aided by a
Graphic User Interface
–
NN Oscilloscope
Look at Internal weights/Node Activities
with Color Coding
⑬
Fuzzy control of Learning rate, Slope (Principe’s, Chap. 4.16)
Students
’
Questions from 2005
When the learning rate is optimized and initialized,
there must be a rough boundary for it. Just an
empirical way to do it ?
In Conjugate Gradient, s(n) =

g(n+1)
…
The learning rate annealing just keeps on decreasing
the error as n without looking at where in the error
surface the current weights are. Is this OK ?
Conjugate Gradient is similar to Momentum in that old
search direction is utilized in determining the new
search direction. It is also similar to rule using the past
trend.
Is CG always faster converging than the SD ?
Do the diff. initial values of the weights affect the
output results ? How can we choose them ?
Comments 0
Log in to post a comment