# Concept Map Practical Design Issues

AI and Robotics

Nov 8, 2013 (4 years and 6 months ago)

68 views

Training Data

Concept Map

Practical Design Issues

Topology

Initial Weights

Learning
Algorithm

Fast Learning

Network Size

Generalization

Occam’s Razor

Ceoss
-
validation

&

Early stopping

Noise

weight sharing

Small size

Increase Training

Data

Network

Growing

Network

Pruning

Brain

Damage

Weight

Decay

Fast Learning

Training Data

Concept Map

Normalize

Scale

Present at Random

Cost Function

Activation Function

Architecture

Modular

Committee

BP variants

No weight

Learning

For

Correctly

Classified

Patterns

η

Chen & Mars

Momentum

Fahlmann’s

Other
Minimization
Method

Conjugate

1.
Practical Issues

Performance =
f
(training data, topology, initial weights, learning algorithm, . . .)

=
Training Error, Net Size, Generalization.

(1)
How to prepare training data, test data ?

-

The training set must contain enough info to learn the

-

Eliminate redundancy, maybe by data clustering.

-

Training Set size

N > W/

(N = # of training data, W = # of weights,

ε
= Classification error permitted on Test data

Generalization error)

Chapter 4. Designing & Training MLPs

Ex. Modes of Preparing Training Data for Robot Control

The importance of the training data for tracking performance can
not be overemphasized. Basically, three modes of training data
selection are considered here. In the regular mode, the training
data are obtained by tessellating the robot

s workspace and taking
the grid points as shown in the next page. However, for better
generalization, a sufficient amount of random training set might be
obtained by observing the light positions in response to uniformly
random Cartesian commands to the robot. This is the random
mode. The best generalization power is achieved by the semi
-
random mode which evenly tessellates the workspace into many
cubes, and chooses a randomly selected training point within each
cube. This mode is essentially a blend of the regular and the
random modes.

Regular mode

Random mode

Semi
-
random
mode

Training Data Acquisition mode

Fig.10. Comparison of training errors and generalization errors for
random and semi
-
random training methods.

0
5
10
15
20
25
30
35
40
45
50
0
50
100
150
200
250
300
350
400
RMS Error(mm)
Iteration
(a)
Training error
Random
Semi-Random
0
5
10
15
20
25
30
35
40
45
50
0
50
100
150
200
250
300
350
400
RMS Error(mm)
Iteration
( b)
Test error
Random
Semi-Random
(2)
Optimal Implementation

A. Network Size

Occam

s Razor :

Any learning machine should be

sufficiently large to solve a given problem, but not

larger.

A scientific model should favor simplicity or

shave off the fat in the model.

[Occam = 14th century British monk]

(Ref. Kim, Modified Error BP Adding Neurons to Hidden Layer, J. of KIEE 92/4)

If
E >

1

and

E <

2,

A
dd a hidden node.

Use the current weights for existing weights and small random
values for newly added weights as initial weights for new
learning.

b. Network Pruning

Remove unimportant connections

After brain damage, retrain the network.

Improves generalization.

Weight decay: after each epoch

c. Size Reduction by Dim. Reduction or Sparse Connectivity in

Input Layer [e.g. Use 4 random instead of 8 connections]

w
w
)
(

1
'
Number of Epochs

E

E

Good

train(X)

test(O)

T

X

R

U

R'

T : Training Data

X : Test Data

R : NN with Good Generalization

R' : NN with Poor Generalization

Overfitting

(due to
too many traning samples
, weights)

noise

Poor

train(X)

test(O)

B. Generalization : Train (
memorize
) and

Apply to an Actual problem (
generalize
)

Learning

Subset

Validation

Subset

Test Set

Training Set

Mean
-

Square

Error

Validation

sample

Training

sample

Early

stopping

point

0

Number of epochs

For good generalization, train with Learning Subset. Check on validation set.

Determine best structure based on Validation Subset [10% at every 5
-
10 iterations].

Train further with the full Training Set. Evaluate on test set.

Statistics of training (validation) data must be similar to that of test (actual problem) data.

Tradeoff between training error and generalization !

Stopping Criterion Classification : Stop upon no error

Function Approximation : check

E
E

,
An Example showing how to prepare the various
data sets to learn an unknown function from
data samples

Other measures to improve generalization.

-
5 %) to

the
Training Data or Weights.

Hard (Soft) Weight Sharing (Using Equal Values for Groups of Weights)

Can Improve Generalization.

For fixed training data, the smaller the net the better the generalization.

Increase the training set to improve generalization.

For insufficient training data, use leave
-
one (some)
-
out method

= Select an example and train the net without this example, evaluate with
this unused example.

If still does not generalize well, retrain with the new problem data.

C. Speeding Up [Accelerating] Convergence

-

Ref. Book by Hertz, AI Expert Magazine 91/7

To speed up calculation itself:

Reduce # Floating Point Ops by Using a Fixed Point Arithmetic

And Use a Piecewise
-
Linear approximation for the sigmoid.

What will happen if more than 5
-
10 % validation data
are used ?

Consider 2 industrial assembly robots for precision
jobs made by the same company with an identical
spec. If the same NN is used for both, then the robots
will act differently. Do we need better generalization
methods to compensate for this difference ?

Large N may increase noisy data. However, wouldn

t
large N offset the problem by yielding more reliability ?
How big an influence would noise have upon
misguided learning ?

Wonder what measures can prevent the local
minimum traps.

Students

Questions from 2005

Is there any mathematical validation for the existence of
a stopping point in validation samples ?

The number of hidden nodes are adjusted by a human.
An NN is supposed to self
-
learn and therefore there
must be a way to automatically adjust the number of
the hidden nodes.

Normalize Inputs, Scale Outputs.

Zero mean, Decorrelate (PCA) and Covariance equalization

r
r

)
0
(
w

Present training patterns in random (shuffled) order (or mix different
classes).

Alternative Cost or Activation Functions

Ex.

Cost

Use with as targets or

( , , at )

in
fan
w

4
.
2

s
3
2
tanh
716
.
1

s
sinh
tan
2
1

k
k
k
P
r
k
k
k
P
y
d
E
vs
y
d
E
2
)
(
.
1

1
)
1
(

1
)
0
(

max
)
(

s

1

s

Fahlman's Bias to Ensure Nonzero

)
)(
1
.
0
'
(
k
k
y
t

for output units only or for all units

Chen & Mars Differential step size

0.1

=

=
outer
inner



)
(
k
k
y
t

(Accelerating
BP Algorithm through Omitting Redundant Learning,
J. of KIEE 92/9

)

If ,
E
p

<

do not update weight on the
p
th training pattern

NO BP

p

E

p

Cf. Principe’s Book recommends

.
Best to try diff. values.

5

~
2

=

For output units

only
--

drop .

'

'

Ahalt
-

Modular Net

MLP 1
MLP 2
x
1
y
2
y

vary

in

Ahalt
-





J/

J/

e
s
s

w
w
)
1
/(
1
)
(

Plaut Rule

in
fan
pq

1

Jacobs
-

[Ref. Neural Networks, Vol. 1, No. 4, 88. ]

+

Reason

for

Slow

Convergence

a.
Momentum :

)
1
(
)
(

t
J
t
w
w
w

)
(
)
(
0
i
t
i
t
J
t
i
i

w

In plateau,

where is the effective learning rate

w
w

J

1

1
without

momentum

with

momentum

b.
rule : where

)
(
w
)
(
)
(
w
t
J
t
t
i
i
i

)
(
t
J
i
w

)
(
t
i

i

t

0
)
(
)
1
(
)
1
(
0
)
(
)
1
(
t
t
if
t
t
t
if
K
i
i
i
i
i
i



)
1
(
)
(
)
1
(
)
(

t
t
t
i
i
i

]
)
1
(
)
(
[
)
1
(

t
t
i
i

For actual parameters to be used, consult Jacob

s paper and also

Getting a fast
break with Backprop

, Tveter, AI Expert Magazine, excerpt from pdf files that I
provided.

Students

Questions from 2005

Is there any way to design a spherical error surface for
faster convergence ?

Momentum provides inertia to jump over a small peak.

Parameter Optimization technique seems to a good
help to NN design.

I am afraid that optimizing even the sigmoid slope and
the learning rate may expedite overfitting.

In what aspect is it more manageable to remove the
mean, decorrelate, etc. ?

How does using a bigger learning rate for the output
layer help learning ?

Does the solution always converge if we use the

Are there any shortcomings in using fast learning
algorithms ?

In the Ahalt

s modular net, is it faster for a single
output only or all the outputs than an MLP ?

Various fast learning methods have been proposed.
Which is the best one ? Is it problem
-
dependent ?

The Jacobs method cannot find the global min. for
an error surface like:

Conjugate Gradient : Fletcher & Reeves

Line Search

)
(
)
(
)
1
(
n
n
n
s
w
w

)
(
n
w
)
(
n
s

)]
(
)
(
[
n
n
E
Min
s
w

If

is fixed and

)
(
)]
(
[
)
(
n
n
E
n
g
w
s

)]
(
*
)
(
[
)]
(
)
(
[
Min
n
n
E
n
n
E
g
w
g
w

If

0
)]
(
*
)
(
[
)
(
)]
1
(
[

n
n
E
n
n
E
g
w
g
w

)
1
(
)
1
(
,

n
n
g
s
)
1
(

n
g

Steepest

Descent

)
(
)
1
(
n
n
g
g

)
1
(

n
w
G
D
escent

S
teepest
D
escent

C
onjugate
G

(n)]
E[
w

)
(
n
s

Gradient D.+ Line Search Steepest Descent + Momentum

GD

SD

Momentum

CG

w
(n+1)

w
(n)

=

1)]
(n
E[

w

w
(n+2)

w
(n)

w
(n+1)

w
(n+2)

1)]
(n
E[
*

w

(n)]
E[
*
w

=

)
(
*
n
s

w
(n
-
1)

w
(n)

(n)]
E[
w

w
(n+1)

w
(n)

)
(
*
n
s

w
(n+1)

1)]
(n
E[

w
s(n+1)

)
(
n
s

2
)

Choose

such

that

1) Line Search

)
(
)
1
(
)
1
(
n
n
n
s
g
s

j
i
w
w
E
Hessian
H
where
2
0
)]
1
(
)
1
(
[
)
(

n
H
n
n
s
g
s

)
(
)
(
)
(
0
0
w
w
w
w

H
E
E
For
0
)
1
(
)
(

n
H
n
s
s
From

Polak
-
Ribiere

Rule

:

2
)
(
)
1
(
))
(
)
1
(
(
n
n
n
n
g
g
g
g

0
)
(
)
1
(
0
)]
(
)
(
[

n
n
n
n
E
s
g
s
w

0
)
(
)]
1
(
)
1
(
[

n
n
n
E
s
s
w

N

Y

Y

End

START

N

Line Search

Initialize

)]
0
(
[
)
0
(
)
0
(
w
g
s
E


))
(
)
(
(
min
n
n
E
s
w

)
(
*
)
(
)
1
(
n
n
n
s
w
w

)]
1
(
[
)
1
(

n
E
n
w
g
)
(
)
1
(
)
1
(
n
n
n
s
g
s

1
))
(
(

n
E
w
max
2
))
(
(
n
n
n
E

or

w
)
(
n
s

)
(
n
w
)
1
(

n
w
)
(
n
s

)
1
(

n
g
)
1
(

n
s
+
+
Steepest Descent

Each step takes a line search.

For N
-
variable quadratic functions, converges in N steps at most

Recommended:

Steepest Descent + n steps of Conjugate Gradient

+ Steepest Descent + n steps of Conjugate Gradient

+


Comparison of SD and CG

X. Swarm Intelligence

What is

swarm intelligence

and why is it

interesting?

Two kinds of swarm intelligence

particle swarm optimization

ant colony optimization

Some applications

Discussion

What is

Swarm intelligence

?

Swarm Intelligence is a
property of systems of
non
-
intelligent agents
exhibiting collectively
intelligent behavior.

Characteristics of a swarm

distributed, no central
control or data source

no (explicit) model of the
environment

perception of environment

ability to change
environment

I can

t
do

We can
do

Group of friends each having a metal detector are on a treasure finding mission.

Each can communicate the signal and current position to the n nearest neighbors.
If you neighbor is closer to the treasure than him, you can move closer to that
neighbor thereby improving your own chance of finding the treasure. Also, the
treasure may be found more easily than if you were on your own.

Individuals in a swarm interact to solve a global objective in a more efficient
manner than one single individual could. A swarm is defined as a structured
collection of interacting organisms [ants, bees, wasps, termites, fish in schools an
birds in flocks] or agents. Within the swarms, individuals are simple in structure,
but their collective behaviors can be quite complex. Hence, the global behavior of
a swam emerges in a nonlinear manner from the behavior of the individuals in that
swarm.

The interaction among individuals plays a vital role in shaping the swarm’s
behavior. Interaction aids in refining experiential knowledge about the
environment, and enhances the progress of the swarm toward optimality. The
interaction is determined genetically or throgh social interaction.

Applications: function optimization, optimal route finding, scheduling, image and
data analysis.

Why is it interesting?

Robust nature of animal problem
-
solving

simple creatures exhibit complex
behavior

behavior modified by dynamic
environment

e.g.) ants, bees, birds, fishes, etc,.

Two kinds of Swarm
intelligence

Particle swarm optimization

Proposed in 1995 by J. Kennedy and
R. C. Eberhart

based on the behavior of bird flocks
and fish schools

Ant colony optimization

defined in 1999 by Dorigo, Di Cargo
and Gambardella

based on the behavior of ant colonies

1. Particle Swarm Optimization

Population
-
based method

Has three main principles

a particle has a movement

this particle wants to go back to the best
previously visited position

this particle tries to get to the position of the best
positioned particles

Four types of neighborhood

star (global) : all particles are
neighbors of all

particles

ring (circle) : particles have a
fixed number of

neighbors K (usually 2)

wheel : only one particle is
connected to all

particles and act as

hub

random : N random conections

the particles

algorithm

Initialization

Calculate performance

Update best particle

Move each particle

Until system
converges

: x
id
(0) = random value, v
id
(0) = 0;

: F (x
id
(t)) = ? (F : performance)

: F (x
id
(t)) is better than the pbest

-
> pbest = F(x
id
(t)), p
id

= x
id
(t), Same for
the gbest

: See next slide

Particle Dynamics

for convergence c
1
+ c
2

< 4 [Kennedy 1998]

Examples

http://uk.geocities.com/markcsinclair/pso.html

http://www.engr.iupui.edu/~shi/PSO/AppletGUI.html

Local Minimum Problem

of hidden nodes

a little

(zero mean white Gaussian) to
weights or training data [desired output or input (for better generalization) ]

Use {Simulated Annealing} or {Genetic Algorithm Optimization then BP}

Design aided by a

Graphic User Interface

NN Oscilloscope

Look at Internal weights/Node Activities

with Color Coding

Fuzzy control of Learning rate, Slope (Principe’s, Chap. 4.16)

Students

Questions from 2005

When the learning rate is optimized and initialized,
there must be a rough boundary for it. Just an
empirical way to do it ?

-
g(n+1)

The learning rate annealing just keeps on decreasing
the error as n without looking at where in the error
surface the current weights are. Is this OK ?

Conjugate Gradient is similar to Momentum in that old
search direction is utilized in determining the new
search direction. It is also similar to rule using the past
trend.

Is CG always faster converging than the SD ?

Do the diff. initial values of the weights affect the
output results ? How can we choose them ?