NEURAL NETWORK IMPROVEMENTS

jiggerluncheonΤεχνίτη Νοημοσύνη και Ρομποτική

19 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

222 εμφανίσεις




NEURAL NETWORK IMPROVEMENTS

Multi-Disciplinary Design and Optimization Tools
for Large Multihull Ships


Submitted to:

Office of Naval Research
875 North Randolph Street, Room 273
Arlington, VA 22203-1995

Dr. Paul Rispin, Program Manager
ONR Code 331
703.696.0339
rispinp@onr.navy.mil

In fulfillment of the requirements for:
Cooperative Agreement No. N00014-04-2-0003
Agile Port and High Speed Ship Technologies
FY06 Project 06-4

Classification: Unclassified


Prepared and submitted by:

Center for the Commercial Deployment of Transportation Technologies
California State University, Long Beach Foundation
6300 State University Drive, Suite 220 • Long Beach, CA 90815 • 562.985.7394


June 25, 2008

Multi-Disciplinary Design and Optimization Tools
for Large Multihull Ships

Improvements in Neural Networks and Optimization Process


Prepared for

Stanley Wheatley, Principal Investigator
C
C
e
e
n
n
t
t
e
e
r
r


f
f
o
o
r
r


t
t
h
h
e
e


C
C
o
o
m
m
m
m
e
e
r
r
c
c
i
i
a
a
l
l


D
D
e
e
p
p
l
l
o
o
y
y
m
m
e
e
n
n
t
t


o
o
f
f


T
T
r
r
a
a
n
n
s
s
p
p
o
o
r
r
t
t
a
a
t
t
i
i
o
o
n
n


T
T
e
e
c
c
h
h
n
n
o
o
l
l
o
o
g
g
i
i
e
e
s
s


California State University Long Beach Foundation
6300 State University Drive, Suite 332
Long Beach, CA 90815


CSULB MOU No: 07-328306
Fiscal Year: FY06
ONR Project No.06-4; CCDoTT Program Element: 2.37
Task 4.5- Improvements in Neural Networks and Optimization Process
Deliverable 4.5 – Improvements in Neural Networks and Optimization Process Report



Prepared by
Adeline Schmitz
Hamid Hefazi
Mechanical and Aerospace Engineering Department
California State University, Long Beach


This material is based upon work supported by the Office of Naval Research, under Cooperative Agreement No.
N00014-04-2-0003
with the California State University, Long Beach Foundation, Center for the Commercial
Deployment of Transportation Technologies (CCDoTT). Any opinions, findings and conclusions or recommendations
expressed in this material are those of the author(s) and do not necessarily reflect the views of the Center for the
Commercial Deployment of Transportation Technologies (CCDoTT) at California State University, Long Beach.
Improvements in Neural Networks and Optimization Process
I
TABLE of CONTENTS

Nomenclature....................................................................................................................II

Subscripts......................................................................................................................II

Definitions......................................................................................................................II

Introduction.......................................................................................................................2

Cascade Correlation algorithm..........................................................................................2

Proposed Improvements to Neural Network.....................................................................4

Cross-validation.............................................................................................................4

Weighted Averaging......................................................................................................5

Results..............................................................................................................................6

Test Function.................................................................................................................6

Weighted Averaging......................................................................................................7

Ten-fold Cross-validation on Original Training Sets......................................................9

Five-fold and Ten-fold Cross-validation on Data Sets composed of the Original
Training Sets and Validation Set.................................................................................11

Conclusions.....................................................................................................................13

References......................................................................................................................14


Improvements in Neural Networks and Optimization Process
II
Nomenclature
DV = design variable
f = exact function
f
NN
= Neural Network approximation of the function f
f
ENSEMBLE NN
= approximation of the function f by a Neural Network Ensemble
GS = generalization set
HU = hidden unit
MDO = multidisciplinary design optimization
n = number of inputs/design variables
NN = Neural network
N
p
= number of points in set
RSM = response surface method
TS = training set
VS = validation set
Subscripts
( )
p
= value at a point of the training set
Definitions
1
1
p
N
p
p
p
A A
N
=
< >=

= average of A over set
( ) ( ) ( )
NN
E x f x f x= − = error at x
(
)
1 1
max max ( )
p p
M p p
p N p N
E E E x
≤ ≤ ≤ ≤
= =
= maximum error over set

2
2
1
1
2
p
N
p
p
p
E E
N
=
=

= modified squared error over set
( )
2
1
1
( )
1
p
N
p
p
p
Std E E E
N
=
= − < >


= standard deviation of error over set

Improvements in Neural Networks and Optimization Process
1
Abstract

This report summarizes the results of Task 4.5, Improvements in Neural Networks and
Optimization Process in the scope of the FY06 CCDoTT program. Building upon several
previous successful applications, our optimization process utilizes a neural network-
based response surface method for reducing the cost of computer intensive
optimizations for applications in ship design. Complex or costly analyses are replaced by
a neural network which is used to instantaneously estimate the value of the function(s) of
interest. The cost of the optimization is shifted to the generation of (smaller) data sets
used for training the network. The focus of this report is on the use and analysis of
constructive networks for treating problems with a large number of variables, say around
30. A mathematical function is used to systematically evaluate the network performance
over a range of design spaces. A committee network using simple ensemble averaging
and a single training set for all members of the committee was developed and tested in
the FY 05 program. This approach showed that significant improvement of the network
performance could be obtained when compared to training a series of networks and
selecting only the best network. The FY 06 program, investigated further committee
networks which includes using a weighted average instead of a simple average method
and, a cross-validation re-sampling technique applied to different sizes training sets and
design space sizes. Results are compared to the basic ensemble method which makes
use of a single training set for all members.
Improvements in Neural Networks and Optimization Process
2

Introduction

This report describes the improvements on the constructive NN algorithm developed for
regression tasks in large dimensional space. It presents an alternative NN structure
based on a constructive NN topology and a corresponding training algorithm suitable for
large number of input/outputs to address the problems where the number of design
parameters is fairly large, say up to 30 or more. This work is performed as continuation
of a similar task in the FY 05 CCDOTT program (Schmitz, 2006).

The constructive algorithm is based on cascade correlation. This supervised learning
algorithm was first introduced by Fahlman and Lebiere (1990). Instead of just adjusting
the weights in a network of fixed topology, Cascade-Correlation begins with a minimal
network, then automatically trains and adds new hidden units one-by-one in a cascading
manner. This architecture has several advantages over other algorithms: it learns very
quickly; the network determines its own size and topology; it retains the structure it has
built even if the training set changes; and it requires no back-propagation of error signals
through the connections of the network. In addition, for a large number of inputs (design
variables), the most widely used learning algorithm, back-propagation, is known to be
very slow. Cascade-Correlation does not exhibit this limitation (Fahlman, 1990).
The cascade correlation algorithm has been substantially modified during the previous
studies in order to make it a robust and accurate method for function approximation
(Schmitz et al 2002, Schmitz, 2006). Following a brief review of the method, this report
describes two additional enhancements made to the algorithm during the FY 06 program.

Cascade Correlation algorithm

The training algorithm involves the steps described below and leads to a network with
the topology of Fig. 1.
Start with the required input and output units; both layers are fully connected. The
number of inputs (design variables) and outputs (objective function, constraints) is
dictated by the problem.
1. Train all connections ending at an output unit with a typical learning algorithm until
the squared error,

2
E,
of the NN no longer decreases.

2
2
1
1
2
p
N
p
p
p
E E
N
=
=


where
N
p
is the number of points in the training set and
E
p
is the error at each point
of the training set

2. Generate a large pool of candidate units that receive trainable input connections
from all of the network’s external inputs and from all pre-existing hidden units. The
output of each candidate unit is not yet connected to the active network (output).
Each candidate unit has a different set of random initial weights. All receive the same
input signals and see the same residual error for each training pattern. Among
candidate hidden units in the pool, select the one which has the largest correlation,
C,
defined by
Improvements in Neural Networks and Optimization Process
3

( ) ( )
0,0
1
p
N
p p
p
C z z E E
=
= − < > − < >

(1)
where
z
0,p
is the output of the candidate hidden unit and
E
p
is the error at each point
of the training set, calculated at Step 1. The weights of the hidden unit selected are
then modified by maximizing the correlation
C
with an ordinary learning algorithm.
Only the candidate with the best correlation score is installed. The other units in the
pool are discarded.

3. The best unit is then connected to the outputs and its input weights are frozen. The
candidate unit acts now as an additional input unit. Training of the input-outputs
connections by minimizing the squared error

2
E
as defined in Step 1 is then
repeated. The use of several candidates greatly reduces the chances that a useless
unit will be permanently installed because an individual candidate was trapped in a
local maximum during training.

4. Repeat until the stopping criterion is verified. Instead of stopping the training based
on the error measured on the Training Set (TS), the stopping criterion makes use of
a Validation Set (VS) which is much smaller than the Training Set and distributed
throughout the design space. Such approach is used because although the error on
the TS decreases as hidden units are added (the network grows), the error on points
elsewhere in the design space (such as those of the VS) may increase slightly. This
phenomenon is known as overfitting of the network and employing a VS circumvents
this issue (Prechelt 1998). This important phenomenon is discussed below in more
detail in the section illustrating the approach for a mathematical function.


Outputs
Inputs
Bias +1

v
m1

v
11

v
m,n+1

v
1,n+1

y
1p

y
mp

z
np

z
2p

z
1p

z
n+1,p





HU 1
z
n+2,p
w
1
1

w
n+1
1

v
m,n+2

v
1,n+2

HU 2
z
n+3,p
w
n+1
2
w
1
2
w
n+2
2
HU h
z
n+h+1,p
w
1
h
w
n+1
h
w
n+h+1
h
w
n+2
h
v
1,n+3

v
m,n+3


v
m,n+h

v
1,n+h


Fig. 1. NN Topology with Cascade Correlation Algorithm
Improvements in Neural Networks and Optimization Process
4
Proposed Improvements to Neural Network
Cross-validation

The critical issue in developing a neural network is generalization: how well will the
network make predictions for cases that are not in the training set? Neural Networks, like
other nonlinear estimation methods such as kernel regression and smoothing splines,
can suffer from either underfitting or overfitting. In last year’s report (Schmitz, 2006),
ensemble averaged committee networks have shown a significant improvement over the
results. The ensemble averaging method led to a 29.5% improvement on the squared
error on a large unseen dataset (GS) compared to the method of choosing only the
Network with smallest error on the validation set.

Indeed, it is common practice in the application of neural networks to train many different
candidate networks and then to select the best, on the basis of performance on an
independent validation set and to keep only this network and to discard the others.
There are two disadvantages in this approach. First, the effort involved in training the
remaining networks is wasted. Second, the generalization performance on the validation
set has a random component due to noise on the data, and so the network which had
best performance on the validation set might not be the one with the best performance
on new test data. These drawbacks can be overcome by combining the networks
together by forming an ensemble. The NNs in the ensemble are essentially trained on
the same input data and then the outputs of the NNs are combined. The basic idea
underlying the ensemble network is to find ways of exploiting the information contained
in these redundant NNs.

However, there is clearly no advantage of combining a set of NNs which generalize
exactly in the same way. There are many methods to create NNs which generalize
differently. Because of the stochastic nature of building NNs with the cascade correlation
algorithm, two NNs trained on the same data exhibit different number of hidden units
and/or different weights and thus will generalize differently. That is that their output value
will be different on an unseen dataset. This makes the cascade correlation an excellent
candidate for ensemble NNs.

Besides this, there are a number of parameters which can be varied to promote diversity
in the NNs trained; varying the set of initial random weights, varying the topology of the
NN and varying the data. CC already includes different initial weights and a varying
topology. So this research focuses on altering the training data. There are many
methods to vary the training data from one NN to the next: sampling data, using disjoint
training sets or boosting and adaptive re-sampling techniques. (Sharkey, 1996)

1/ Sampling data: This method consists of using a sampling technique so that each NN
is trained on a different subsample of the training data. Of the statistical sampling
methods, cross-validation (Krogh and Vedelsby, 1995) and bootstrapping (Efron, 1983)
have shown to give good results.
K-fold cross-validation consists of partitioning randomly the training data into k
subsamples. Of the k subsamples, a single subsample is retained as validation set for
testing the model and the remaining k-1subsamples are used for training data. The
cross-validation process is repeated k times, with each of the k subsamples only used
once. Ten-fold and five-fold cross-validations appear to be working the best (Sarle,
Improvements in Neural Networks and Optimization Process
5
2006).
Bootstrap, on the other hand, consists in sampling the dataset with replacement to form
a training set and to use the data not sampled as a validation set. In particular there is a
method called the .632 bootstrap where a dataset of n instances is sampled n times with
replacement to give the training set. Since some elements in the training set are
repeated, there must be some data that has not been used and that can be used for the
validation set. For a reasonably large dataset, it can be shown that the training set will
contain about 63.2% of the original dataset and the validation set about 36.8 %, thus the
name .632 bootstrap.

2/ Disjoint training sets: It is a similar method to the above, but the subsamples do not
share common points. There is thus no overlap between the data used to train each NN.
For regression, this method requires a larger training data sample then the previous
method so that there is a large enough dataset to “represent” the complexity of the
function for each NN trained.

3/ Boosting and adaptive re-sampling: the idea here is to train each new NN with data
that has been filtered by the previously trained members of the ensemble. Each data
sample is given a different chance (probability) to appear in a new training set by
prioritizing data poorly learnt by the previous members of the ensemble. However this
method usually requires large amount of data, and might not be applicable to our
problem. This method is also more computationally intensive as it requires than simple
bootstrapping as it requires calculating the ensemble performance and the probabilities
for each data point to appear in the new the training set after adding each network to the
ensemble.

4/ Preprocessing: there exist many ways to vary the data each NN views; for example,
using different nonlinear transformations or by injecting artificial noise to the data.
Choosing the transformation that would improve the ensemble is probably case-
dependant, so it is not pursued here since no knowledge of the function is known a-priori
and the method is developed for general regression.

Of these methods, cross-validation and bootstrapping seem most promising for our
problem as in general the amount of data available to train the NNs is relatively small
compared to the number of inputs. Cross-validation is researched first in this report, and
bootstrapping will be studied in further research.
Weighted Averaging

Once a set of networks has been created, one must find an effective way of combining
them. Last year’s study has already proven that taking all networks built and performing
simple averaging on the ensemble network significantly improves the results.
The function constructed on a simple average (SAvg) can be written as:
1
1
( ) ( )
i
N
SAvg NN
i
f
x f x
N
=



where the
i
NN
f
is the NN approximation of the function

f
of the i
th
ensemble member.

However, could there be a better way to combine those networks? If the networks of the
ensemble were uncorrelated, then the mean squared error on a simple average
Improvements in Neural Networks and Optimization Process
6
ensemble could be reduced by a factor of N (the number of NNs in the ensemble)
compared to the average mean squared error of the individual networks (Perrone, 1993).
The problem is that, in general, the individual networks are correlated, and thus the
reduction in error is significantly less than a factor of N. An idea is to use the information
given by the error on the validation set. This set is not seen during training and as such
gives a measure on how each NN performs on an unseen dataset. The idea of Perrone
(1993) is to minimize the mean squared error on the ensemble using the VS. He defines
a generalized ensemble (weighted averaged or WAvg) as:
1
( ) ( )
i
N
WAvg i NN
i
f
x f xα
=



where
i
α
’s are real and satisfy the constraint that
1
i
α
=

. To minimize the mean
squared error on the ensemble, he shows that the
i
α
’s must be chosen as following.
1
1
ij
j
i
kj
k j
C
C
α


=

∑ ∑

where
1
1
( ( ) ( ))( ( ) ( ))
p
i j
N
ij p NN p p NN p
p
p
C f x f x f x f x
N
=
= − −


is the ij-th element of the symmetric correlation matrix and
Np
the number of points in
the VS. This method should exceed simple average if the individual networks are mostly
uncorrelated.
This report compares simple and weighted average on a mathematical function as
described in the next section.

Results

In this section, the capabilities of the enhanced network are analyzed by testing it on a
mathematical function. This test function and its attributes are described in detail in
Hefazi (2006) and Besnard (2007). Results are then presented and analyzed when the
design space dimension is increased from 5 to 30 in order to gain an insight into the
behavior of the NN characteristics as the number of design variables increases. It must
be pointed out that the errors are characterized, not only on the training set as is
customary, but also at points not used in the training.
Test Function
The following mathematical test function is selected because it has many minima and
maxima and is easily extended to larger dimension spaces. The function has been
modified such that, as the size of the design space,
n
, increases, the magnitude of the
function remains between 0 and 1.

It is defined over the n-dimensional compact set
[
]
,
n
π
π−
by

2 2
3/4
1
1
( ) sin ( ) ( )
4
n
i i
i i
i
x y
f x A B
n
=

⎛ ⎞
= + −
⎜ ⎟
⎝ ⎠

(2)
where the scalars
A
i
and
B
i
are defined by
Improvements in Neural Networks and Optimization Process
7

( )
( )
,,
1
1
sin cos
|| || || ||
n
i i j j i j j
j
A a x b x
n
=
= +
+

a b
(3)
and

( )
( )
,,
1
1
sin cos
|| || || ||
n
i i j j i j j
j
B a y b y
n
=
= +
+

a b
(4)
Using 2-norm of matrices
a
and
b
given by

1,
0.5 1 1.5....
(1 ) 2 (2 ) 2 (3 ) 2....
( ( 1) )
(1 2 ) 2 (2 2 ) 2 (3 2 ) 2....
2
::::
::::
i j n
n n n
j i n
n n n
< <




+ + +


+ −
⎡ ⎤
= =


+ + +
⎢ ⎥
⎣ ⎦








a
(5)
and

1,
2 1.5 1....
( 4 ) 2 ( 3 ) 2 ( 2 ) 2....
( 1) 5
( 4 2 ) 2 ( 3 2 ) 2 ( 2 2 ) 2....
2
::::
::::
i j n
n n n
j i n
n n n
< <
− − −




− + − + − +


+ − −
⎡ ⎤
= =


− + − + − +
⎢ ⎥
⎣ ⎦








b
(6)

Weighted Averaging

In this section, network results using simple ensemble averaging are compared to
weighted ensemble averaging (using the
i
α
’s defined by Perrone, 1993) by training NNs
for varying numbers of inputs (
n
) and various sizes of training set (TS) using the
mathematical function as discussed in the previous section. Last year’s study included
the cases
n
= 5, 10, 15 and 30 for TS sizes of 500, 750 and 1,500 points. For the 30-
dimension case, a TS of 5,000 points is also considered. All networks are built with a
validation set (VS) of 200 points and a generalization set of 15000 points to evaluate the
generalization error. All TS, VS and GS are generated using a Latin Hypercube. Ten
Networks are built for each case and averaged as one ensemble. For each network, the
number of hidden units is determined by the minimum squared error on the validation set.

Table 1 presents the squared error, average error and standard deviation on the TS, VS
and GS for all networks built, using simple ensemble averaging. These results were
obtained in FY’05 (Schmitz, 2006) and presented an overall 29.5 % improvement for the
squared Error, 17.4 % improvement for the average error and 14.9 % improvement on
the standard deviation for the GS. Results were similar on the VS and about twice as
good on the TS. The table is presented here to give the reader a general idea on the
order of magnitude of the errors
Improvements in Neural Networks and Optimization Process
8

Table 1: Squared Error (E2), Average Error (<E>) and Standard Deviation (Std(E)) for simple
ensemble average ensemble NN (results from FY’05)
Training Set Validation Set Generalization Set
Simple
Average
E2 <E> Std(E) E2 <E> Std(E) E2 <E> Std(E)
5 inputs,
TS=500
0.0003 0.0193 0.0167 0.0018 0.0408 0.0433 0.0017 0.0423 0.0396
5 inputs,
TS=750
0.0003 0.0184 0.0150 0.0010 0.0311 0.0320 0.0008 0.0303 0.0271
5 inputs,
TS=1500 0.0002 0.0148 0.0135 0.0006 0.0249 0.0234 0.0005 0.0232 0.0214
10 inputs,
TS=500
0.0013 0.0413 0.0307 0.0027 0.0563 0.0465 0.0033 0.0643 0.0490
10 inputs,
TS=750
0.0004 0.0223 0.0171 0.0020 0.0496 0.0382 0.0022 0.0522 0.0419
10 inputs,
TS=1500
0.0006 0.0271 0.0207 0.0013 0.0390 0.0323 0.0016 0.0445 0.0362
15 inputs,
TS=500 0.0014 0.0414 0.0321 0.0028 0.0588 0.0464 0.0031 0.0629 0.0471
15 inputs,
TS=750 0.0009 0.0342 0.0254 0.0022 0.0519 0.0419 0.0026 0.0577 0.0434
15 inputs,
TS=1500
0.0004 0.0222 0.0166 0.0016 0.0433 0.0357 0.0017 0.0469 0.0360
30 inputs,
TS=500 0.0015 0.0440 0.0327 0.0020 0.0498 0.0394 0.0022 0.0526 0.0395
30 inputs,
TS=750 0.0016 0.0448 0.0339 0.0020 0.0494 0.0385 0.0022 0.0528 0.0393
30 inputs,
TS=1500
0.0015 0.0442 0.0327 0.0018 0.0477 0.0373 0.0020 0.0510 0.0384
30 inputs,
TS=5000 0.0005 0.0248 0.0187 0.0012 0.0385 0.0293 0.0013 0.0406 0.0310

Table 2 presents the percentage error improvement found on the squared error (E2 or
2
ˆ
E
), average error (<E>) and standard deviation (Std(E)) by using weighted ensemble
averaging instead of simple averaging. Results are shown for the training set, the
validation set and the generalization set for the number of inputs varying from 5 to 30
and the different training sets. A positive value means that the weighted average
ensemble performed better and a negative value means that the simple average
ensemble performed better. As expected, results show that the weighted average
ensemble always improves the error on the VS. Indeed, the weight factors
i
α
’s are
determined to minimize the mean squared error on the VS. For the results on the TS,
errors are reduced except for cases n=5, all TS sizes and n=15, TS=1500. And for the
GS, the most important of the sets, the errors are almost always improved except for
n=10, TS=1500, n=15, TS=1500 and n=30, TS=750, where the simple average
performed slighlty better. The worst result is for the case n=15, TS=1500 where E2 on
GS is increased by 6.6% by using weighted ensemble instead of the simple ensemble.

If averaging the results over all cases (all n and all TS), E2 for the weighted average
ensemble was reduced by 16% on the TS, 13.4% on the VS and 4.2% on the GS
compared to the simple average ensemble. The average error and standard deviation
were also reduced on the TS, VS and GS with the weighted average. This shows that
the weighted ensemble performs in average better than the simple average
Improvements in Neural Networks and Optimization Process
9

Table 2: Percentage improvement on squared error (E2), average error (<E>) and standard
deviation (Std(E)) by using weighted ensemble averaging instead of simple averaging.
TRAINING SET VALIDATION SET GENERALIZATION SET
Weighted Average
% Improvement
E2 <E> std(E) E2 <E> std(E) E2 <E> std(E)
n=5, TS=500
-30.98 -17.29 -10.54 24.60 9.15 16.91 10.81 2.31 9.42
n=5, TS=750
-10.09 -6.83 -2.00 23.71 11.04 14.22 13.09 4.32 9.92
n=5, TS=1500
-10.91 -8.03 -1.94 24.48 11.68 14.74 9.27 3.08 6.74
n=10, TS=500
65.39 42.24 39.27 21.87 11.89 11.19 19.31 11.94 7.21
n=10, TS=750
18.42 9.79 9.48 7.63 4.58 2.73 0.17 0.61 -0.72
n=10, TS=1500
5.38 2.71 2.76 3.71 2.33 1.21 -0.26 0.00 -0.33
n=15, TS=500
49.51 30.85 25.86 14.15 9.07 4.61 0.84 0.74 -0.15
n=15, TS=750
54.52 33.54 30.82 15.26 6.66 9.96 8.95 5.22 3.45
n=15, TS=1500
-14.40 -6.81 -7.23 8.78 3.50 5.98 -6.58 -3.15 -3.39
n=30, TS=500
21.48 11.85 10.53 0.87 0.63 0.12 9.38 4.94 4.57
n=30, TS=750
2.51 1.15 1.47 2.38 1.78 0.24 -1.85 -0.73 -1.26
n=30, TS=1500
34.05 19.32 17.83 12.94 6.21 7.49 0.59 0.33 0.23
n=30, TS=5000
47.15 27.47 27.00 9.03 4.82 4.28 4.89 2.65 2.17
Average improvement
(%) over all n and TS
16.09 9.82 10.12 13.37 6.50 7.51 4.24 1.95 2.41

Ten-fold Cross-validation on Original Training Sets

This section investigates the use of cross-validation in committee networks. In order to
evaluate the variations in NN performance for varying numbers of inputs, the dimension
of the search space,
n
, is again varied from 5 to 30 and the training is performed for
various sizes of training set (TS). Last year’s study included the cases
n
= 5, 10, 15 and
30 for TS sizes of 500, 750 and 1,500 points. All TS are generated using a Latin
Hypercube. For the 30-dimension case, a TS of 5,000 points is also considered.

The first study takes the existing TS of 500, 750, 1500 and 5000 points (referred to as
“original training sets”) and splits them using a 10-fold cross-validation. For example, a
TS of 500 is split randomly into ten sets. The first NN of the ensemble is trained using
the first set for validation (50 points) and the other nine sets (450 points) for training. The
second NN uses set number 2 for validation and the other nine sets for training, and so
on. So that each network uses a different training and validation set. The ten networks
built are then combined to create a committee network using the simple average and
weighted average method described previously. However, the correlation matrix cannot
be calculated on the VS since it is different for each network of the ensemble. Instead
the correlation matrix is calculated on the original training set which comprises the TS
and VS for each NN of the ensemble and is thus the same for the ten NNs constructed.

The results are compared with the ensemble NN created without cross-validation thus
using the same TS (500, 750, 1500 or 5000 points) for all members of the ensemble NN
and using a VS of 200 points created with a Latin Hypercube distribution. The
generalization set used to compare NNs with and without cross-validation over the
design space at points not used during the training. The GS contains 15000 points (also
generated by Latin Hypercube). Table 3 shows the percentage improvement on the
squared error (E2), the average error (<E>), and the standard deviation (Std(E)) on the
Improvements in Neural Networks and Optimization Process
10
GS by using cross-validation. A negative value means that the ensemble NN created
with the single TS for all members of the ensemble performed better than the cross-
validated case. Results are presented for the ensemble network with simple averaging
(Simple Average) and the ensemble network with weighted averaging (Weighted
Average), as described previously. The results do not seem to present a general trend
as to whether cross-validating data is better. It might even appear a bit worse,
especially for the ensemble using weighted average.

There are two possible explanations to these results. First, it must be pointed out that
the NNs generated without cross-validation, make use of use of 200 additional points to
determine when to stop training ( the “original validation set”) . Whereas for the cases
with cross-validation this set is not used, as the validation set is part of the original
training set. Second, especially for the small training sets, the validation set is relatively
small (50 and 75 points for TS = 500 and 750). The VS is used to decide when to stop
training, and thus with a very small set the NNs might stop too early (leading to
underfitting) or too late (leading to overfitting).

Two solutions are proposed. The first one is to incorporate the additional information
contained in those two hundred points into the NN training by simply concatenating the
original TS and the original VS together to create a larger dataset. And then perform
cross-validation on this dataset so that the information contained in these additional 200
points is used during training and/or validation. The second solution is to increase the
size of the validation set by performing a five-fold cross-validation on the data instead of
a ten-fold. This doubles the size of the VS, but it creates only 5 separate TS and VS. To
generate an ensemble of ten networks, the 5-fold cross-validation is performed twice on
the same data. Since the process of splitting the sets into five subsets is performed
randomly, ten different TS and VS will be obtained. The following section presents these
results.
Improvements in Neural Networks and Optimization Process
11
Table 3: Percentage improvement on generalization errors by using ten-fold cross-
validation on the original training sets compared using a single TS and a separate VS of
200 points.

10Fold CrossV
on original TS
ENSEMBLE NN % Improvement on GS
Case E2 <E> std(E)
Simple Average 8.07 3.73 4.57
n=5 TS=500

Weighted Average -10.23 -3.42 -7.05
Simple Average -5.61 -1.30 -4.56
n=5, TS=750

Weighted Average -48.73 -19.57 -25.22
Simple Average 19.19 9.04 11.37
n=5, TS=1500

Weighted Average 15.11 7.71 8.06
Simple Average 11.07 5.82 5.50
n=10, TS=500

Weighted Average -2.83 -1.47 -1.30
Simple Average -9.64 -5.68 -3.18
n=10, TS=750

Weighted Average -2.29 -0.71 -1.79
Simple Average -2.85 -1.76 -0.88
n=10, TS=1500

Weighted Average -8.35 -3.99 -4.24
Simple Average 0.40 0.32 0.00
n=15, TS=500

Weighted Average -13.83 -6.05 -7.80
Simple Average -3.78 -1.59 -2.37
n=15, TS=750

Weighted Average -32.34 -14.40 -16.12
Simple Average -3.52 -1.47 -2.20
n=15, TS=1500

Weighted Average -15.98 -7.44 -8.12
Simple Average 7.13 3.65 3.59
n=30, TS=500

Weighted Average -2.21 -0.53 -2.10
Simple Average 3.83 2.04 1.76
n=30, TS=750

Weighted Average -26.15 -11.93 -13.00
Simple Average 1.07 0.68 0.28
n=30, TS=1500

Weighted Average -8.93 -3.61 -5.69
Simple Average -8.44 -3.98 -4.39
n=30, TS=5000

Weighted Average -1.77 -0.63 -1.31
Simple Average 1.30 0.73 0.73
Average %
Improvement
for all n and TS
Weighted Average -12.19 -5.08 -6.59

Five-fold and Ten-fold Cross-validation on Data Sets composed of the Original
Training Sets and Validation Set

In light of the previous section’s inconclusive data, it was decided to use the additional
information contained in the validation set to train the NNs with cross-validation. Last
year’s method (without cross-validation) uses a fixed training set (500, 750, 1500 or
5000 points) to train all members of the ensemble and uses a separate 200 points VS to
decide when to stop training. The original TS (500, 750, 1500 and 5000 points) are
concatenated with the VS (200 points) to form data sets (DS) of respectively 700,
950,1750 and 5200 points) and a cross-validation method is applied to create ten distinct
TS and VS for each member of the ensemble.
Improvements in Neural Networks and Optimization Process
12
Table 4 presents the squared error (E2) and the relative improvement on the GS
obtained using 3 methods. Method 1 corresponds to training without cross-validation and
using a separate VS of 200 points. Method 2 corresponds to training with ten-fold cross-
validation on the original training set to which the 200 points original validation set was
added and method 3 corresponds to again using the original TS to which the VS was
added and performing twice five-fold cross-validation. Again, improvements are
presented on the ensemble NNs with simple and weighted average. Results show a
considerable improvement for the smaller number of inputs (n) and for relatively smaller
training sets (TS =500 and 750). Ten-fold cross-validation seems to work better in
average than five-fold cross-validation, so it appears more advantageous to have a
larger percentage of data for training each individual NN; 90% for the 10-fold compared
to 80% for 5-fold. Both cross-validation methods perform better in average than the
method which doesn’t use cross-validation. Method 2 shows an average improvement of
9.3% if using the simple average and 5.3 % if using weighted average when compared
to method 1 (without cross-validation). Method 3 shows an average improvement of
4.2% if using the simple average and 3.9% if using weighted average.

Improvements in Neural Networks and Optimization Process
13
Table 4: Comparison of squared errors on generalization set by using five-fold and ten-fold cross-
validation using a dataset combining the original training and validation sets,
Method 1
Method 2
Method 3
2 vs 1
3 vs 1

Ensemble NN
No CrossV
VS=200
10Fold
CrossV
TS+VS
2x5Fold
CrossV
TS+VS




E2
E2
E2
% Improvement
on E2

n=5, TS=500
Simple Average
1.68E-03
1.16E-03
1.12E-03
30.7
33.2

Weighted Average
1.50E-03
9.71E-04
1.02E-03
35.1
31.9
n=5, TS=750
Simple Average
8.26E-04
6.13E-04
8.78E-04
25.7
-6.3

Weighted Average
7.18E-04
5.72E-04
8.32E-04
20.4
-15.9
n=5, TS=1500
Simple Average
4.99E-04
4.78E-04
4.76E-04
4.2
4.5

Weighted Average
4.53E-04
4.45E-04
4.14E-04
1.7
8.6
n=10, TS=500
Simple Average
3.27E-03
2.47E-03
2.54E-03
24.6
22.3

Weighted Average
2.64E-03
2.52E-03
2.53E-03
4.5
4.2
n=10, TS=750
Simple Average
2.24E-03
2.14E-03
2.30E-03
4.4
-2.5

Weighted Average
2.24E-03
2.07E-03
2.39E-03
7.4
-6.8
n=10, TS=1500
Simple Average
1.64E-03
1.66E-03
1.71E-03
-1.1
-4.0

Weighted Average
1.65E-03
1.69E-03
1.73E-03
-2.7
-5.2
n=15, TS=500
Simple Average
3.08E-03
2.77E-03
2.67E-03
10.2
13.6

Weighted Average
3.06E-03
2.64E-03
2.60E-03
13.6
15.0
n=15, TS=750
Simple Average
2.61E-03
2.43E-03
2.48E-03
7.0
5.0

Weighted Average
2.38E-03
2.27E-03
2.19E-03
4.4
7.9
n=15, TS=1500
Simple Average
1.75E-03
1.74E-03
1.88E-03
0.2
-7.4

Weighted Average
1.86E-03
1.77E-03
1.60E-03
4.9
14.0
n=30, TS=500
Simple Average
2.16E-03
2.09E-03
2.15E-03
3.2
0.4

Weighted Average
2.25E-03
2.34E-03
2.24E-03
-4.0
0.4
n=30, TS=750
Simple Average
2.16E-03
2.08E-03
2.13E-03
4.0
1.7

Weighted Average
2.20E-03
2.27E-03
2.34E-03
-3.1
-6.4
n=30, TS=1500
Simple Average
2.04E-03
2.05E-03
2.05E-03
-0.8
-0.9

Weighted Average
2.02E-03
2.29E-03
2.14E-03
-13.4
-5.8
n=30, TS=5000
Simple Average
1.30E-03
1.19E-03
1.36E-03
8.4
-4.2

Weighted Average
1.24E-03
1.24E-03
1.13E-03
0.1
9.1

Simple Average
9.3 4.2

Weighted Average
% Average improvement

5.3 3.9


Conclusions
This report describes recent enhancements of neural networks for application in
numerical optimization process. It presents a neural network-based response surface
method for reducing the cost of computer intensive optimizations. A constructive network
based on the cascade correlation algorithm has been developed. It allows for efficient
Neural Network determination when dealing with function representation over large
design spaces. During training, the network grows until the error on a small set (VS)
different from that used in the training itself (TS) starts increasing. The method is
characterized for a mathematical function for dimensions ranging from 5 to 30.
Improvements in Neural Networks and Optimization Process
14
Improvements to the method, using Ensemble Averaging are described and documented.
The ensemble averaging method using a weighted average led to an average of 4.2%
improvement on the squared error on a large unseen dataset (GS) compared to the
simple average method. The simple average method had already improved the results
by a factor of 29.5 % over the method choosing the best of out ten networks constructed
(Schmitz, 2006). Also ensemble averaging is computationally more efficient as it makes
use of all networks used. Another enhancement to the method consists in cross-
validating the training and validation sets to improve the diversity of the NNs created and
improve the ensemble. This study shows that a ten-fold cross-validation improves the
squared error on the generalization set by 9.3% for the simple average and 5.3% for the
weighted average over an committee network using the same training and validation set
for all ten networks comprising the ensemble. Overall we have shown that our enhanced
algorithm can approximate functions with large dimensions (up to 30) with average
errors around 5%. In practical applications such as optimization loops, this
approximation is much better than resorting to empirical or highly idealized
approximation of complex function evaluations such as powering or seakeeping of
multihull ships. The NN approach allows the optimization process to utilize the results of
highly sophisticated CFD or experimental analysis in the process without limitations
imposed by computational costs. Application of the approach in such optimization
processes are reported in Task 4.4 report.

References

Besnard, E., Schmitz, A., Hefazi, H. and Shinde, R. (2007) Constructive Neural
Networks and their Application to Ship Multi-disciplinary Design Optimization
,
Journal of Ship Research, Vol. 51, No. 4., pp. 297-312.
Efron, B. (1983), Estimating the error rate of a prediction rule: Improvement on cross-
validation, J. of the American Statistical Association, 78, 316-331.
Fahlman, S.E., and Lebiere, C. (1990), The Cascade-Correlation Learning Architecture,
Technical Report CMU-CS-90-100
, School of Computer Science, Carnegie
Mellon University, Pittsburgh, PA, USA.
Hefazi, H., Schmitz, A.., Shinde, R. and Mizine, I. (2006) Automated Multidisciplinary
Design Optimization Method for Multi-Hull Vessels, CCDoTT Report FY05
Krogh, A. and Vedelsby, J. (1995) Neural Network Ensembles, Cross-validation and
active learning. In Tesauro, G. Touretzky, D.S. and Leen, T.K. ,
Advances in
Neural Information Processing Systems 7
, MIT Press.
Sarle, W ( 2006), Artificial Intelligence FAQ/ Neural Networks,
http://www.faqs.org/faqs/ai-faq/neural-nets/

Schmitz, A. Hefazi, H. (2006) Task 4.5 Improvement of Neural Networks for Numerical
Optimization,
CCDoTT report FY-05
, available at http://www.ccdott.org

Schmitz, A., Besnard, E., and Vives, E. (2002), Reducing the Cost of Computational
Fluid Dynamics Optimization Using Multi Layer Perceptrons, 2002 World
Congress on Computational Intelligence, Honolulu, HI.
Sharkey, A. (1996), On combining Artificial Neural Nets, Connection Science, Vol. 8, pp
299-313
Perrone, M. and Cooper, L.N. (1993) When Networks Disagree: Ensemble Methods for
Hybrid Neural Networks. In Mammone R.J.
Neural Networks for Speech and
Image Processing
, Chapman Hall.