Part2Report - Canturk Isci

foulchilianAI and Robotics

Oct 20, 2013 (3 years and 9 months ago)

84 views


1



EE
-
443

NEURAL

NETWORKS


















































M
M
M
I
I
I
N
N
N
I
I
I



P
P
P
R
R
R
O
O
O
J
J
J
E
E
E
C
C
C
T
T
T



1
1
1
,
,
,



P
P
P
A
A
A
R
R
R
T
T
T



-
-
-
2
2
2







This project aims to design a multilayer neural network in order to approximate the





function for

4


x


4, using the Backpropagation Algorithm. A two lay
er structure will be
used, with one output and one hidden layer. After training the neural network with several inputs
from

4 to 4. The network will be tested with inputs that are the arithmetic means of the training
inputs.





I.

N
EURAL
N
ETWORK
:


A tw
o layer neural network, with one input, representing the x value whose sinc is to be
calculated and one output, representing the calculated sinc(x) value is used. There will be n


10
neurons in the hidden layer each having two inputs: the input x, and a b
ias value (threshold). All
the neurons will have bipolar sigmoidal function:





as the activation function (nonlinearity). The structure of the hidden layer will be as follows;


2



















The h subscripts indicate these are

h
idden’ layer expressions.


The output layer will consist of a unique neuron, with n+1 inputs, n of which are coming
from the outputs of the hidden layer the last one for the bias. The nonlinearity will be the bipolar
sigmoidal function and the output wil
l be the calculated sinc(x) value.


The O output is the corresponding sinc(x) value for x(i).


With the above neural network topology, the sinc(x) value will be calculated as follows:



Sinc(x)=O=

[W
1,1
.

(Wh
1,1
.X+Wh
1,2
)+ W
1,2
.

(Wh
2,1
.X+Wh
2,2
)+..+ W
1,n
.

(W
h
n,1
.X+Wh
n,2
)+ W
1,n+1
]



If we write the above relation in the matrix form:



O=
;


where;


Wh
1,1

Wh
1,2

X(i)

1

Oh
1

Wh
i,1

Wh
i,2

1

Oh
i

1

Oh
n

Wh
n,1

Wh
n,2

vh
1

vh
i

vh
n

W1,1

W1,2

W1,n

W1,n+1

Oh
1

Oh
2

Oh
n

1

O

v


3



;



;




and X=[X(i) 1]
T


II.

T
RAINING
A
LGORITHM
:



For the training
, first the input training set is defined as:





i =1,2,...,8N+1

x(i) ranges from

4 to 4 with N+1 elements, having identical intervals. I call N, the
sampling frequency.



The corresponding desired output set is defined as:



d(i)
= sinc(x(i)).


Then, the backpropagation algorithm is used to achieve an average error


10
-
4

on the
training set.

To use the backpropagation algorithm, first x vector is extended to augmented x
matrix with a second row of 1s for the bias values.



As the

first step of backpropagation algorithm, the initial weights for each neuron are set.
In this specific network, the weight matrix, Wh, for the hidden layer and the weight matrix for the
output layer, W is initialized. The number of neurons in the hidden l
ayer,
n,

is defined at this step.


As the weights are updated according to the error gradient in backpropagation algorithm, a
weight W
1,j

at the output layer will be updated as follows:




W
1,j
(new)= W
1,j
(old)
-

(

E/


W
1,j
)

(*)



Here,


: learning constan
t, is a defined constant




E: total error at output, calculated as below:










here,

#o: Number of output neurons (= number of outputs)






e
k
: error made at k
th

output neuron, calculated as below:






e
k
=d
k


o
k




4





here
,

d
k

: desired output for k
th

output neuron






o
k

: acquired output for k
th

output neuron


The idea for the gradient is simple, if the gradient term is positive, this means E
increases/decreases as W
1,j
, thus decreasing W term will decrease E and increa
sing W will
decrease E in the reverse case.

For the output layer, some elaboration of the equation reveals:



W
1,j
(new)= W
1,j
(old)+

e
1
.

’(v).Oh
j

(**)


And some additional ordering using:




’(v)= 1/2(1
-
o
2
)

and defining:



Local Gradient:

=e.

’(v)

Reveal
s the final equation as:




W
1,j
(new)= W
1,j
(old)+


.Oh
j

(***)



The next step in backpropagation algorithm is to formulate the above relation in a more
compact form, supposedly in matrix notation. In order to accomplish this, first the
local gradient
matr
ix
is formed as:



[

o
] = [

’(v)].[e]




and



[W(new)]=[W(old)] +

[

o
][Oh:1]
T



hence, in our specific neural network design, the exact matrix equations are as follows:



;


’(v)= 1/2(1
-
o
2
) and e=d
-
o



( o is calculated as:

O=
; )





( [Oh]=

[Wh.X] )


this above matrix equation is computed for each input training pattern to update output
weight matrix.



5


In the backpropagation algorithm, the next step is to initialize the cumulative e
rror to
zero. Then, one of the input training patterns is submitted and the error for that input pattern is
added to cumulative error.


Then, the local gradient at the output layer,

o
,

is calculated and the previous layers’ local
gradients are calculated p
ropagating backwards, each layer using the output layer weight matrix,
W
,
of the successor layer.

In order to acquire insight to the hidden layer weight matrix update strategy, again starting
from the general gradient descent relation:




Wh
i,j
(new)= Wh
i,j
(old)
-

(

E/


Wh
i,j
)

(i: neuron #; j: input # ) (*)


Again some elaboration on the parameters and matrix formulation reveals the following
matrix expressions:




[

h
] = [

’(vh)].[W
T
(old)].[

o
]





and




[Wh(new)]=[Wh(old)] +

[

h
][currentX]
T


and in our

specific neural network, the above equations turn out to be:




;


’(vh
i
)= 1/2(1
-
oh
i
2
)










this matrix operation is performed for each training input.



After having calculated the local gradients, the f
ollowing step is to update the weight
matrices accordingly. First the output weight matrix,
W
, is updated and then the hidden layer
weight matrix,
Wh
, is updated as described in the above two pages.



6


Then the algorithm loops back to submit a new input tra
ining pattern until 1 Epoch is
comleted, i.e. all the input patterns are submitted.



After one full Epoch is completed, the average error for the epoch is calculated and the
algorithm stops if Eave

desired value. Else, the algorithm starts one new epoch.



III.

T
RAINING
P
ROCESS
:



When the pure training algorithm is run several times, it is seen not to converge to a
sufficiently small average error, reaching to values as low as 25/10000. The observed plot of the
approximated sinc function for the acquired

values is as in
Appendix.I
.Two important observations
are, the average error sometimes begins to rise and continues its ascent for several epochs and
sometimes the algorithm stops at a NaN error. The reason for the unexpected rise in the average
error is
probably the individual update of weights for each input. For example, while the error may
be gradient as, 1, below for one input set, it might be like, 2, for a previously trained set and thus
might cause an opposite result to what is desired.














Hence, the application of 1 at point A causes 2 to skip a high gradient and go in the reverse
direction, and as it has a low gradient at B, it cannot compensate the error increase, resulting a net
error increase for 1 full epoch. The obvious solution to

this is, to apply the input sequences out of
order. Another expected observation is, the slowing down of algorithm as the gradients become
smaller. The simplistic solution to this is the application of the momentum method to accelerate
the convergence.

Be
fore delving into the above considerations, I firstly chose to work on the costly method
of using different W and Wh matrices as the starting condition, with the trial of several randomized
initial values, I have achieved an error less than 10
-
4

with the i
nitial weights:


W =[0.4319 0.4361 0.2243 0.0132 0.3760 0.8386 0.2862 0.2841,
0.3271 0.4762 0.2260 0.9620 0.3065 0.7789 0.8953 0.2537]






2

1

A

B

A

B

Error Change

For 1

For 2


7


0.4645 0.0745


0.5966 0.1910


0.9348 0.3075


0.4619

0.8298


0.3001 0.7096


0.0746 0.7502


0.4689 0.9449


Wh=
0.0140 0.0542


0.7384 0.6356


0.9310 0.4610


0.5129 0.4245


0.4159 0.3882


0.8555 0.8230


0.9139

0.8360


0.8681 0.7756


With these initial weights, the algorithm converged to an error of
Eave=9.8577e
-
005
f
or 15 hidden layer neurons(n) and a sampling frequency (N) of 15


corresponding to 121
samples

. The final weight matrices obtained ar
e as below:


W =[ 2.4284 3.6194
-
2.6403
-
2.6197 0.2284 2.4010 1.7328,
1.4152 0.0029 0.5574 0.2820 2.7330
-
4.1617
-
0.0007,



0.0157
-
0.4158]



1
1.6913
-
1.8743


2
2.9433 5.2362


2
2.9718 3.9844


2
2.2926 5.2213


-
0.5589 2.0451


-
3.1517 5.3942


0
0.7042 1.8311


Wh=
1
1.4700
-
1.7420


-
0.0515 0.3130


0
0.5576 0.3225


0
0.3774 0.1139


6
6.6328
-
1.2134


4
4.6365 0.9140


-
0.0176 0.08
10


-
0.2096 0.6064



The approximated sinc plot is as plotted in
Appendix.II.
As seen in the plot, one interesting
observation is, the approximated sinc function is much better aligned with the actual sinc function
for x~
-
4 values than the x~4 va
lues. Nevertheless, I have no proposal for the explanation of this
situation for now. One immediate possibility turns out to be, the input patterns are begun to be
submitted beginning from the first one and therefore, they are
somehow

better approximated.
This
will be later verified when pattern shuffling is applied. Up to this point, the plain backpropagation
algorithm is used, which is attached in
Appendix.III

in Matlab code.





8

IV.

I
MPROVING
T
RAINING
A
LGORITHM
:


After having satisfied the specification
for the error with the plain backpropagation
algorithm, I will go on to improve the algorithm for momentum method, pattern shuffling and
maybe addition of white noise to weight updates, and some additional ideas.




First of all, in order to speed up the a
lgorithm, the
momentum method

is applied. The
advantage of momentum method is, it keeps the history of the previous gradient for each w and
strengthens the vector displacement if gradient continues in the same direction (+ or
-
), or inhibits
the update amo
unt for weights if an overshooting occurs. As a reminder, the algorithm calculates
W and Wh as:


[W(new)]=[W(old)] +

[

o
][Oh:1]
T


[Wh(new)]=[Wh(old)] +

[

h
][currentX]
T


therefore, I define

W and

Wh matrices as:



W
o
=

[

o
][Oh:1]
T

and


Wh
o
=

[

h
][current
X]
T


and the momentum method applies as:



W
NEW
=

.

W
OLD

+

W
o

and


Wh
NEW
=

.

Wh
OLD

+

Wh
o



After the addition of momentum factor, the algorithm speeded up, but the convergence is
seen to require still a long time. So as a second factor of speed improv
ement

, I decided to play
with


internally, such that, if the gradient begins roll down in one direction, I speed up the rolling
down with increasing acceleration by increasing

. In order to do that, I keep the track of signs of
previous

w
s and multiply



with the number of persistent signs. Hence, as the gradients are
separate for each weight entry, I will need to keep a history for each weight W
i,j
. The idea is,
increment the history matrix element when the current gradient is the same direction (i.e.
sign) as
the previous

w, rushing the roll down. However, this will obviously increase overshoot when a
local minimum is reached and then, the algorithm must slow down to reach the value if it cannot
skip it. Therefore, the matrix entry will be directly re
duced to 1, when such a situation is met.
Actually, this is similar to the idea of congestion control in Computer Networking, where a similar
algorithm named
additive
-
increase, multiplicative
-
decrease (AIMD) algorithm.

also called
Slow
Start Algorithm



i
s used.

The applied algorithm to accelerate the convergence is observed to get
stuck in the saddle points because of faster jumps in w values. The v value is seen to reach very
high (in magnitude) values, causing

’(v) to be very small ~0. The used algorit
hm, though given
up, is attached in
Appendix.IV
.


The next idea


to be implemented is as a result of the observation explained in page 6.
The average error is seen to increase for whiles during the convergence process. The most legible
explanation to this
is, as I apply the inputs in a fixed order, for example first training input must
push the gradient to a point which is better for its individual error term, but causing the errors of

9

others increase significantly


illustrated in the figures in page 6. Th
erefore, pattern shuffling
might be of significant help to reduce these reverse movements. My proposed shuffling algorithm
works as follows: for an epoch, the Ecum values for each input is calculated and stored, and in the
succeeding epoch the new Ecum val
ues are calculated and subtracted from the former ones. If the
new Eave turns out to be larger than the previous Eave, the previous weight matrices are restored
and the new epoch is executed beginning from the input with the largest Ecum difference.

The sh
uffling algorithm works as follows, when the input number with worst error change
is determined, it is set as the first to go input in the next epoch, and the other inputs are
interchanged around this number.


The function
shuffle
, to perform this operati
on is as copied below. Hence, the function does
not shuffle the second row entries of x as they are all 1s.


function

[shuffledx,shuffledd]= shuffle(xin,din,centerindex)

% gets x and d matrices, interchanges their columns around ind

% puts ind'th column
to first entry

shuffledx=xin;

shuffledx(1,1)=xin(1,centerindex);

shuffledx(1,centerindex)=xin(1,1);

shuffledd=din;

shuffledd(1,1)=din(1,centerindex);

shuffledd(1,centerindex)=din(1,1);


a=1;

while

(centerindex
-
a>1)&(centerindex+a<length(xin))


shuffledx(
1,centerindex
-
a)=xin(1,centerindex+a);


shuffledx(1,centerindex+a)=xin(1,centerindex
-
a);


shuffledd(1,centerindex
-
a)=din(1,centerindex+a);


shuffledd(1,centerindex+a)=din(1,centerindex
-
a);


a=a+1;

end




the shuffling algorithm is seen to go down
quite fast up to 2.10
-
4

values, but then, shuffling
is seen to make the convergence harder as it escaped the code from the destined minimum.
Therefore, I put a limit at 2,5.10
-
4

below which shuffling is avoided.


The last idea


to be considered is, to add

zero mean white noise (i.e. uniformly distributed
between

0.1 and 0.1 is added to W and Wh values, which might help avoid local minima. The
effect of white noise is seen to be rather indeterministic and it badly inhibited convergence.
Therefore, instead
of 0.1 value, 0.001 value as the range limit is tried. This is observed to be
acceptable for large values of error, but as the error decreased, the effect of white noise is rather
instability in the algorithm. Therefore, white noise is removed for Eave val
ues below 2.5 10
-
4
.


Worst error element


10

Finally, the algorithm is foolproof and complete with the additions of momentum method,
shuffling, avoidance from increases from Eave and white noise. The effects of all but momentum
method are removed when convergence is close, speci
fically Eave<2.5*10
-
4
. After then,
convergence continues with gradient descent and momentum methods.

The last step in training is, to run the final program, in Appendix.V, for new input values,
the tried input values are:


Number of hidden layer neurons(n)

Sampling Frequency(N)

Epochs to converge

15

25

1342

25

15

17509

25

25

3770


The final weight matrices for the above cases are found to be as below:



For n=15 and N=25, Eave is found to be 9.9985*10
-
5
. The approximated sinc drawing is
plotted in
Appen
dix.VI
. The final weight matrices obtained are:


W=[
-
0.5944 0.1306 0.4646 1.1843 0.0437
-
0.8697 0.1790 0.2814,


2.3172 0.0068 1.8697
-
1.4612 1.2223 4.1612
-
0.1520
-
0.5057]


Wh= 1.2264 1.1349


0.2149 0.5596


1.6179 0.3237


1.9340
0.8976


0.0913 0.3219


1.4200 1.8082


0.2676 0.6210


0.3732 0.8017


5.1247 1.0315


0.0311 0.1772


2.0476 1.3510


1.6766 2.7510


0.8003 2.3245


-
4.3207 1.3903


-
0.2462
-
0.6319



For n=25 and N
=15, Eave is found to be 9.9995*10
-
5
. The approximated sinc drawing is
plotted in
Appendix.VII
. The final weight matrices obtained are:


W=[0.0001 0.0001 0.0001 0.0000 6.2288 0.0001 2.7104
-

2.0495,
0

0



0.0003 0.0001 2.0363 0.0003 0.000
0 3.0301 0.0006 0.0000,


-
0.0001 0.0001 0.0022 0.5608 0.0006 0.0008 0.0001 0.0000,
0


0.0003
-
0.0587]


Wh= 0.0000 0.0077


0.0001 0.0125


0.0000 0.0072


0.0000 0.0013


-
2.1260
-
0.1874


11


0.0001 0.0140


2.
3886 4.3122


4.0038 7.6581


0.0002 0.0329


0.0000 0.0086


10.6243 2.6063


0.0001 0.0290


0.0000 0.0047


4.5712 2.8857


0.0003 0.0630


0.0000 0.0033


-
0.0001
-
0.0103


0.0000 0.0064


0.0016

0.2170


6.3223
-
10.7409


0.0003 0.0631


0.0005 0.0905


0.0000 0.0081


0.0000 0.0056


0.0002 0.0400



For n=25 and N=25, Eave is found to be 9.9991*10
-
5
. The approximated sinc drawing is
plotted in
Appendix.VIII
. The fi
nal weight matrices obtained are:


W =[ 2.8590
-
0.2645
-
0.2652
-
0.2680 3.4227
-
0.2648
-
0.2718,


-
2.2038
-
0.2650
-
0.2649 2.9006 0.9818
-
1.7850 3.2988,


-
2.0260
-
0.2665 1.9149
-
0.2647
-
0.2878 1.5601
-
0.2646,


-
0.2681
-
0.2646
-
0.2653
-
0.2678

0.4979]



Wh=
-
3.9853 3.0137


0.2660 0.3280


0.2618 0.2837


0.2579 0.2234


-
7.7010 1.3578


0.2692 0.3561


0.2949 0.5256


2.7731 3.6868


0.2711 0.3714


0.2625 0.2921


4.3772 5.2806


3.082
3 0.2300


-
1.9336 0.5836


4.7922 0.9061


2.6896 3.4817


0.2595 0.2511


1.6655
-
1.7334


0.2639 0.3070


0.3438 0.7706


2.1159 4.9104


0.2680 0.3459


0.2831 0.4549


12


0.2640 0.3088


0.2614

0.2786


0.2821 0.4483






V.

T
EST
P
ROCESS
:


After having acquired the weight matrices for different number of hidden layer neurons and
different sampling frequencies, the obtained weights are tested with the test points as the midpoints
of traini
ng points:



where,

i=1, 2, …,8N


The output values are calculated as:



O=
;


The test program used is as below:


N=input(
'N(sample freq.) : >>'
);

Ecum=0;

for

i=1:1:8*N


xtest(1,i)=(x(1,i)+x(1,i+1))/2;


xt
est(2,i)=1;


dtest(1,i)=sinc(xtest(1,i));
%desired output vector

end

plot(xtest(1,:),dtest(1,:),
'b:'
)

hold on

for

i=1:1:8*N
%for all the test patterns


%1)find hidden layer v vector(vh):


vh=Wh*xtest(:,i);


%2)Pass vh through threshold to obtain hid
den layer outputs(oh)


oh=(1
-

(exp(
-
vh)))./(1 + (exp(
-
vh)));
%oh is a vector of n outputs


%3)extend oh to ohAug for output layer's inputs


ohAug=[oh; 1];
%this is a vector of n+1 having 1 as the last element


%4)find output layer v vector(v):


v=
W*ohAug;


%5)Pass v through threshold to obtain output layer output(o)


o(i)=
-
(1
-

(exp(
-
v)))./(1 + (exp(
-
v)));


e=dtest(1,i)
-
o(i);


Ecum=Ecum+(1/2)*e^2;

end

plot(xtest(1,:),o,
'r
-
'
)

Eave=Ecum/((8*N))




13

The test program calculates the outputs for th
e given input test set. The calculated Eave
values are tabulated below and the plots of the approximated sinc functions are in
Appendices IX
to XII
.


Neural Network

Eave

N=15 & n=15

3.0128e
-
004

N=25 & n=15

0.0032

N=15 & n=25

0.0010

N=25 & n=25

0.0010






VI.

S
UCCESS OF
T
RAINING
A
LGORITHM
:


Although not specified in terms of epochs, the computation of the weight matrices for
number of hidden layer neurons (n)=15 and sampling frequency (N)=15 took longer time with
respect to the latter three computat
ions which used the improved algorithm. In terms of sampling
frequencies and hidden layer neurons, as described in the table in page 10, as the sampling
frequency increases, epochs required for convergence decrease and as the number of hidden layer
neurons

increase, epochs required for convergence increase. The reasoning behind the 1
st

observation is, as the number of samples increase, though the computational cost of each epoch
increases, the overall number of epochs decrease. More inputs mean they are clo
ser in magnitude
and this helps the smoother movement of gradients due to their high interdependence. The 2
nd

observation may be clarified simply by asserting, it is more difficult to handle more actions that
effect from each others’ current status than le
ss. As there are more neurons in the hidden layer,
there are more weights to update and more number of interactions.



VII.

S
UCCESS OF
T
EST
R
ESULTS
:


There are two evident observations on the test results, one is, the average error for the first
weight se
t is much less than all the others, second is, all the errors are higher than 10
-
4
. The first
point is really interesting, because this implies the pure backpropagation algorithm better
approximates the function at intermediate points. It might be true, th
at the more epochs the
algorithm spends on the traces, the better view it gets of the function. However, it is not easily
justified by backpropagation concept. The second observation points to an important, very
probable misunderstanding. The Eave, calcula
ted in the training is the average over the sum of
errors calculated for intermediate weight values calculated for each training input. However, the
Eave calculated in the test is the average error calculated over all the final weights. Therefore, it is
e
xpected for the two values to be different. The actual Eave is the one calculated in the test set, but
as the backpropagation algorithm suggests the other Eave value for the stopping criterion, there is
nothing disputable in the propriety of the results.







14


In this mini project, the backpropagation algorithm is used to approximate the sinc function
between

4 and 4. The backpropagation training algorithm is used first in its plain definition and
the weight matrices are calculated. Then, the backpropagat
ion algorithm is improved with
additional functions and the weight matrices for three more different cases are computed.


When the pure training algorithm is run for a fixed input set, it is seen not to converge to a
sufficiently small average error, reach
ing to values as low as 25/10000. However, when the
algorithm is started for randomized initial weights for several times it is observed to converge
below 10
-
4
. It is observed that, average error sometimes begins to rise and continues its ascent for
severa
l epochs due to one input causing ohters’ error contribution to increase along its local
gradient.


The pure training algorithm is improved with several additions. M
omentum method

is
applied in order to speed up the algorithm and it helped speedup, but the

convergence is still seen
to apply a high amount of computation.
(AIMD) similar algorithm,

applied for further speedup, is
observed to get stuck in the saddle points because of faster jumps in w values. The v value is seen
to reach very high (in magnitude
) values, causing

’(v) to be very small ~0. In order to prevent
undesired ascent behavior of Eave a
selective

pattern shuffling scheme is tried and is seen to
prevent undesired Eave movements. However, for small values of Eave, the shuffling mechanism
is
seen to divert the algorithm from minimum and an experience based limit, 2.5*10
-
4
, is provided
to block further shuffling operation. To avoid local minima, zero mean white noise is added to W
and Wh values. The effect of white noise is seen to be rather in
deterministic and it badly inhibited
convergence for relatively large values and peak=0.001 value is determined by experience. The
distorting effect of white noise is still observed for low error values and therefore, white noise is
removed for Eave values

below 2.5 10
-
4
.


The improved algorithm is used for 3 separate n&N pairs and it is seen that as the sampling
frequency (N) increases, epochs required for convergence decrease and as the number of hidden
layer neurons (n) increase, epochs required for con
vergence increase.


The computed weights are used to approximate sinc values for the test set. The average
error for the first weight set is interestingly seen to be much less than all the others, clueing up the
probability of slow algorithms keeping bette
r track of approximation at intermediate test points.




Canturk ISCI


05.08.2000