Concept Map Practical Design Issues

naivenorthΤεχνίτη Νοημοσύνη και Ρομποτική

8 Νοε 2013 (πριν από 4 χρόνια)

62 εμφανίσεις

Training Data


Concept Map


Practical Design Issues

Topology

Initial Weights

Learning
Algorithm

Fast Learning

Network Size

Generalization

Occam’s Razor

Ceoss
-
validation

&

Early stopping

Noise

weight sharing

Small size

Increase Training

Data

Network

Growing

Network

Pruning

Brain

Damage

Weight

Decay

Fast Learning

Training Data


Concept Map

Normalize

Scale

Present at Random

Cost Function

Activation Function

Adaptive slope

Architecture

Modular

Committee

BP variants

No weight

Learning

For

Correctly

Classified

Patterns

η

Chen & Mars

Momentum

Fahlmann’s


Other
Minimization
Method

Conjugate
Gradient



1.
Practical Issues


Performance =
f
(training data, topology, initial weights, learning algorithm, . . .)


=
Training Error, Net Size, Generalization.

(1)
How to prepare training data, test data ?


-

The training set must contain enough info to learn the
task.


-

Eliminate redundancy, maybe by data clustering.



-

Training Set size

N > W/



(N = # of training data, W = # of weights,


ε
= Classification error permitted on Test data


Generalization error)

Chapter 4. Designing & Training MLPs

Ex. Modes of Preparing Training Data for Robot Control



The importance of the training data for tracking performance can
not be overemphasized. Basically, three modes of training data
selection are considered here. In the regular mode, the training
data are obtained by tessellating the robot

s workspace and taking
the grid points as shown in the next page. However, for better
generalization, a sufficient amount of random training set might be
obtained by observing the light positions in response to uniformly
random Cartesian commands to the robot. This is the random
mode. The best generalization power is achieved by the semi
-
random mode which evenly tessellates the workspace into many
cubes, and chooses a randomly selected training point within each
cube. This mode is essentially a blend of the regular and the
random modes.

Regular mode

Random mode

Semi
-
random
mode

Training Data Acquisition mode

Fig.10. Comparison of training errors and generalization errors for
random and semi
-
random training methods.

0
5
10
15
20
25
30
35
40
45
50
0
50
100
150
200
250
300
350
400
RMS Error(mm)
Iteration
(a)
Training error
Random
Semi-Random
0
5
10
15
20
25
30
35
40
45
50
0
50
100
150
200
250
300
350
400
RMS Error(mm)
Iteration
( b)
Test error
Random
Semi-Random
(2)
Optimal Implementation



A. Network Size


Occam

s Razor :


Any learning machine should be

sufficiently large to solve a given problem, but not

larger.

A scientific model should favor simplicity or

shave off the fat in the model.

[Occam = 14th century British monk]

a. Network Growing: Start with a few / add more


(Ref. Kim, Modified Error BP Adding Neurons to Hidden Layer, J. of KIEE 92/4)





If
E >

1

and

E <

2,



A
dd a hidden node.




Use the current weights for existing weights and small random
values for newly added weights as initial weights for new
learning.

b. Network Pruning




Remove unimportant connections


After brain damage, retrain the network.




Improves generalization.




Weight decay: after each epoch

c. Size Reduction by Dim. Reduction or Sparse Connectivity in


Input Layer [e.g. Use 4 random instead of 8 connections]

w
w
)
(



1
'
Number of Epochs

E


E


Good

train(X)

test(O)

T

X

R

U

R'

T : Training Data

X : Test Data

R : NN with Good Generalization

R' : NN with Poor Generalization

Overfitting

(due to
too many traning samples
, weights)


noise

Poor

train(X)

test(O)

B. Generalization : Train (
memorize
) and


Apply to an Actual problem (
generalize
)

Learning

Subset

Validation

Subset

Test Set

Training Set

Mean
-

Square

Error

Validation

sample

Training

sample

Early

stopping

point

0

Number of epochs


For good generalization, train with Learning Subset. Check on validation set.


Determine best structure based on Validation Subset [10% at every 5
-
10 iterations].


Train further with the full Training Set. Evaluate on test set.

Statistics of training (validation) data must be similar to that of test (actual problem) data.

Tradeoff between training error and generalization !

Stopping Criterion Classification : Stop upon no error


Function Approximation : check


E
E

,
An Example showing how to prepare the various
data sets to learn an unknown function from
data samples


Other measures to improve generalization.


Add Noise (1
-
5 %) to

the
Training Data or Weights.


Hard (Soft) Weight Sharing (Using Equal Values for Groups of Weights)


Can Improve Generalization.


For fixed training data, the smaller the net the better the generalization.


Increase the training set to improve generalization.


For insufficient training data, use leave
-
one (some)
-
out method


= Select an example and train the net without this example, evaluate with
this unused example.


If still does not generalize well, retrain with the new problem data.

C. Speeding Up [Accelerating] Convergence



-

Ref. Book by Hertz, AI Expert Magazine 91/7

To speed up calculation itself:

Reduce # Floating Point Ops by Using a Fixed Point Arithmetic

And Use a Piecewise
-
Linear approximation for the sigmoid.

What will happen if more than 5
-
10 % validation data
are used ?

Consider 2 industrial assembly robots for precision
jobs made by the same company with an identical
spec. If the same NN is used for both, then the robots
will act differently. Do we need better generalization
methods to compensate for this difference ?

Large N may increase noisy data. However, wouldn

t
large N offset the problem by yielding more reliability ?
How big an influence would noise have upon
misguided learning ?

Wonder what measures can prevent the local
minimum traps.

Students


Questions from 2005

Is there any mathematical validation for the existence of
a stopping point in validation samples ?

The number of hidden nodes are adjusted by a human.
An NN is supposed to self
-
learn and therefore there
must be a way to automatically adjust the number of
the hidden nodes.



Normalize Inputs, Scale Outputs.


Zero mean, Decorrelate (PCA) and Covariance equalization







Start with small uniform random initial weights [for tanh] :

r
r



)
0
(
w


Present training patterns in random (shuffled) order (or mix different
classes).




Alternative Cost or Activation Functions



Ex.


Cost




Use with as targets or


( , , at )

in
fan
w


4
.
2









s
3
2
tanh
716
.
1



s
sinh
tan
2
1








k
k
k
P
r
k
k
k
P
y
d
E
vs
y
d
E
2
)
(
.
1

1
)
1
(




1
)
0
(


max
)
(



s

1

s


Fahlman's Bias to Ensure Nonzero

)
)(
1
.
0
'
(
k
k
y
t





for output units only or for all units




Chen & Mars Differential step size


0.1

=



=
outer
inner






)
(
k
k
y
t





(Accelerating
BP Algorithm through Omitting Redundant Learning,
J. of KIEE 92/9

)


If ,
E
p

<


do not update weight on the
p
th training pattern


NO BP


p

E



p


Cf. Principe’s Book recommends

.
Best to try diff. values.

5

~
2

=


For output units

only
--

drop .

'

'



Ahalt
-

Modular Net


MLP 1
MLP 2
x
1
y
2
y

vary


in



Ahalt
-

Adapt Slope (Sharpness) Parameters










J/



J/

e
s
s









w
w
)
1
/(
1
)
(


Plaut Rule


in
fan
pq


1



Jacobs
-

Learning Rate Adaptation



[Ref. Neural Networks, Vol. 1, No. 4, 88. ]


+

Reason

for

Slow

Convergence

a.
Momentum :


)
1
(
)
(








t
J
t
w
w
w


)
(
)
(
0
i
t
i
t
J
t
i
i








w


In plateau,



where is the effective learning rate


w
w






J


1



1
without

momentum

with

momentum

b.
rule : where



)
(
w
)
(
)
(
w
t
J
t
t
i
i
i






)
(
t
J
i
w


)
(
t
i



i

t












0
)
(
)
1
(
)
1
(
0
)
(
)
1
(
t
t
if
t
t
t
if
K
i
i
i
i
i
i






)
1
(
)
(
)
1
(
)
(




t
t
t
i
i
i





]
)
1
(
)
(
[
)
1
(








t
t
i
i




For actual parameters to be used, consult Jacob

s paper and also

Getting a fast
break with Backprop

, Tveter, AI Expert Magazine, excerpt from pdf files that I
provided.

Students


Questions from 2005

Is there any way to design a spherical error surface for
faster convergence ?

Momentum provides inertia to jump over a small peak.

Parameter Optimization technique seems to a good
help to NN design.

I am afraid that optimizing even the sigmoid slope and
the learning rate may expedite overfitting.

In what aspect is it more manageable to remove the
mean, decorrelate, etc. ?

How does using a bigger learning rate for the output
layer help learning ?

Does the solution always converge if we use the
gradient descent ?

Are there any shortcomings in using fast learning
algorithms ?

In the Ahalt

s modular net, is it faster for a single
output only or all the outputs than an MLP ?

Various fast learning methods have been proposed.
Which is the best one ? Is it problem
-
dependent ?

The Jacobs method cannot find the global min. for
an error surface like:





Conjugate Gradient : Fletcher & Reeves


Line Search


)
(
)
(
)
1
(
n
n
n
s
w
w




)
(
n
w
)
(
n
s

)]
(
)
(
[
n
n
E
Min
s
w



If


is fixed and

)
(
)]
(
[
)
(
n
n
E
n
g
w
s







Gradient Descent


)]
(
*
)
(
[
)]
(
)
(
[
Min
n
n
E
n
n
E
g
w
g
w






If

0
)]
(
*
)
(
[
)
(
)]
1
(
[









n
n
E
n
n
E
g
w
g
w


)
1
(
)
1
(
,




n
n
g
s
)
1
(

n
g


Steepest

Descent

)
(
)
1
(
n
n
g
g


)
1
(

n
w
G
radient
D
escent

S
teepest
D
escent

C
onjugate
G
radient

(n)]
E[
w



)
(
n
s


Gradient D.+ Line Search Steepest Descent + Momentum

GD

SD

Momentum

CG

w
(n+1)

w
(n)

=

1)]
(n
E[



w

w
(n+2)

w
(n)

w
(n+1)

w
(n+2)

1)]
(n
E[
*



w

(n)]
E[
*
w



=

)
(
*
n
s

w
(n
-
1)

w
(n)

(n)]
E[
w



w
(n+1)

w
(n)

)
(
*
n
s

w
(n+1)

1)]
(n
E[



w
s(n+1)

)
(
n
s

2
)

Choose



such

that


If : Conjugate Gradient


1) Line Search

)
(
)
1
(
)
1
(
n
n
n
s
g
s



















j
i
w
w
E
Hessian
H
where
2
0
)]
1
(
)
1
(
[
)
(








n
H
n
n
s
g
s

)
(
)
(
)
(
0
0
w
w
w
w





H
E
E
For
0
)
1
(
)
(




n
H
n
s
s
From

Polak
-
Ribiere

Rule

:

2
)
(
)
1
(
))
(
)
1
(
(
n
n
n
n
g
g
g
g






0
)
(
)
1
(
0
)]
(
)
(
[








n
n
n
n
E
s
g
s
w


0
)
(
)]
1
(
)
1
(
[






n
n
n
E
s
s
w

N

Y

Y



End

START

N

Line Search

Initialize

)]
0
(
[
)
0
(
)
0
(
w
g
s
E




))
(
)
(
(
min
n
n
E
s
w



)
(
*
)
(
)
1
(
n
n
n
s
w
w





)]
1
(
[
)
1
(




n
E
n
w
g
)
(
)
1
(
)
1
(
n
n
n
s
g
s







1
))
(
(



n
E
w
max
2
))
(
(
n
n
n
E


or

w
)
(
n
s

)
(
n
w
)
1
(

n
w
)
(
n
s

)
1
(


n
g
)
1
(

n
s
+
+
Steepest Descent

Conjugate Gradient


Each step takes a line search.

For N
-
variable quadratic functions, converges in N steps at most

Recommended:


Steepest Descent + n steps of Conjugate Gradient


+ Steepest Descent + n steps of Conjugate Gradient


+


Comparison of SD and CG

X. Swarm Intelligence



What is

swarm intelligence


and why is it



interesting?




Two kinds of swarm intelligence



particle swarm optimization



ant colony optimization




Some applications




Discussion


What is

Swarm intelligence

?



Swarm Intelligence is a
property of systems of
non
-
intelligent agents
exhibiting collectively
intelligent behavior.



Characteristics of a swarm


distributed, no central
control or data source


no (explicit) model of the
environment


perception of environment


ability to change
environment


I can

t
do


We can
do


Group of friends each having a metal detector are on a treasure finding mission.

Each can communicate the signal and current position to the n nearest neighbors.
If you neighbor is closer to the treasure than him, you can move closer to that
neighbor thereby improving your own chance of finding the treasure. Also, the
treasure may be found more easily than if you were on your own.


Individuals in a swarm interact to solve a global objective in a more efficient
manner than one single individual could. A swarm is defined as a structured
collection of interacting organisms [ants, bees, wasps, termites, fish in schools an
birds in flocks] or agents. Within the swarms, individuals are simple in structure,
but their collective behaviors can be quite complex. Hence, the global behavior of
a swam emerges in a nonlinear manner from the behavior of the individuals in that
swarm.

The interaction among individuals plays a vital role in shaping the swarm’s
behavior. Interaction aids in refining experiential knowledge about the
environment, and enhances the progress of the swarm toward optimality. The
interaction is determined genetically or throgh social interaction.

Applications: function optimization, optimal route finding, scheduling, image and
data analysis.

Why is it interesting?


Robust nature of animal problem
-
solving


simple creatures exhibit complex
behavior


behavior modified by dynamic
environment

e.g.) ants, bees, birds, fishes, etc,.

Two kinds of Swarm
intelligence


Particle swarm optimization



Proposed in 1995 by J. Kennedy and
R. C. Eberhart



based on the behavior of bird flocks
and fish schools


Ant colony optimization



defined in 1999 by Dorigo, Di Cargo
and Gambardella



based on the behavior of ant colonies

1. Particle Swarm Optimization


Population
-
based method


Has three main principles



a particle has a movement



this particle wants to go back to the best
previously visited position



this particle tries to get to the position of the best
positioned particles


Four types of neighborhood



star (global) : all particles are
neighbors of all



particles



ring (circle) : particles have a
fixed number of



neighbors K (usually 2)



wheel : only one particle is
connected to all



particles and act as

hub




random : N random conections
are made between



the particles


algorithm

Initialization

Calculate performance

Update best particle

Move each particle

Until system
converges

: x
id
(0) = random value, v
id
(0) = 0;

: F (x
id
(t)) = ? (F : performance)

: F (x
id
(t)) is better than the pbest

-
> pbest = F(x
id
(t)), p
id

= x
id
(t), Same for
the gbest

: See next slide



Particle Dynamics

for convergence c
1
+ c
2

< 4 [Kennedy 1998]


Examples









http://uk.geocities.com/markcsinclair/pso.html


http://www.engr.iupui.edu/~shi/PSO/AppletGUI.html




Local Minimum Problem



Restart with different initial weights, learning rates, and number


of hidden nodes



Add (and anneal) noise
a little

(zero mean white Gaussian) to
weights or training data [desired output or input (for better generalization) ]



Use {Simulated Annealing} or {Genetic Algorithm Optimization then BP}




Design aided by a

Graphic User Interface


NN Oscilloscope


Look at Internal weights/Node Activities

with Color Coding



Fuzzy control of Learning rate, Slope (Principe’s, Chap. 4.16)


Students


Questions from 2005

When the learning rate is optimized and initialized,
there must be a rough boundary for it. Just an
empirical way to do it ?

In Conjugate Gradient, s(n) =
-
g(n+1)


The learning rate annealing just keeps on decreasing
the error as n without looking at where in the error
surface the current weights are. Is this OK ?

Conjugate Gradient is similar to Momentum in that old
search direction is utilized in determining the new
search direction. It is also similar to rule using the past
trend.

Is CG always faster converging than the SD ?

Do the diff. initial values of the weights affect the
output results ? How can we choose them ?