NEURAL NETWORK IMPROVEMENTS

Multi-Disciplinary Design and Optimization Tools

for Large Multihull Ships

Submitted to:

Office of Naval Research

875 North Randolph Street, Room 273

Arlington, VA 22203-1995

Dr. Paul Rispin, Program Manager

ONR Code 331

703.696.0339

rispinp@onr.navy.mil

In fulfillment of the requirements for:

Cooperative Agreement No. N00014-04-2-0003

Agile Port and High Speed Ship Technologies

FY06 Project 06-4

Classification: Unclassified

Prepared and submitted by:

Center for the Commercial Deployment of Transportation Technologies

California State University, Long Beach Foundation

6300 State University Drive, Suite 220 • Long Beach, CA 90815 • 562.985.7394

June 25, 2008

Multi-Disciplinary Design and Optimization Tools

for Large Multihull Ships

Improvements in Neural Networks and Optimization Process

Prepared for

Stanley Wheatley, Principal Investigator

C

C

e

e

n

n

t

t

e

e

r

r

f

f

o

o

r

r

t

t

h

h

e

e

C

C

o

o

m

m

m

m

e

e

r

r

c

c

i

i

a

a

l

l

D

D

e

e

p

p

l

l

o

o

y

y

m

m

e

e

n

n

t

t

o

o

f

f

T

T

r

r

a

a

n

n

s

s

p

p

o

o

r

r

t

t

a

a

t

t

i

i

o

o

n

n

T

T

e

e

c

c

h

h

n

n

o

o

l

l

o

o

g

g

i

i

e

e

s

s

California State University Long Beach Foundation

6300 State University Drive, Suite 332

Long Beach, CA 90815

CSULB MOU No: 07-328306

Fiscal Year: FY06

ONR Project No.06-4; CCDoTT Program Element: 2.37

Task 4.5- Improvements in Neural Networks and Optimization Process

Deliverable 4.5 – Improvements in Neural Networks and Optimization Process Report

Prepared by

Adeline Schmitz

Hamid Hefazi

Mechanical and Aerospace Engineering Department

California State University, Long Beach

This material is based upon work supported by the Office of Naval Research, under Cooperative Agreement No.

N00014-04-2-0003

with the California State University, Long Beach Foundation, Center for the Commercial

Deployment of Transportation Technologies (CCDoTT). Any opinions, findings and conclusions or recommendations

expressed in this material are those of the author(s) and do not necessarily reflect the views of the Center for the

Commercial Deployment of Transportation Technologies (CCDoTT) at California State University, Long Beach.

Improvements in Neural Networks and Optimization Process

I

TABLE of CONTENTS

Nomenclature....................................................................................................................II

Subscripts......................................................................................................................II

Definitions......................................................................................................................II

Introduction.......................................................................................................................2

Cascade Correlation algorithm..........................................................................................2

Proposed Improvements to Neural Network.....................................................................4

Cross-validation.............................................................................................................4

Weighted Averaging......................................................................................................5

Results..............................................................................................................................6

Test Function.................................................................................................................6

Weighted Averaging......................................................................................................7

Ten-fold Cross-validation on Original Training Sets......................................................9

Five-fold and Ten-fold Cross-validation on Data Sets composed of the Original

Training Sets and Validation Set.................................................................................11

Conclusions.....................................................................................................................13

References......................................................................................................................14

Improvements in Neural Networks and Optimization Process

II

Nomenclature

DV = design variable

f = exact function

f

NN

= Neural Network approximation of the function f

f

ENSEMBLE NN

= approximation of the function f by a Neural Network Ensemble

GS = generalization set

HU = hidden unit

MDO = multidisciplinary design optimization

n = number of inputs/design variables

NN = Neural network

N

p

= number of points in set

RSM = response surface method

TS = training set

VS = validation set

Subscripts

( )

p

= value at a point of the training set

Definitions

1

1

p

N

p

p

p

A A

N

=

< >=

∑

= average of A over set

( ) ( ) ( )

NN

E x f x f x= − = error at x

(

)

1 1

max max ( )

p p

M p p

p N p N

E E E x

≤ ≤ ≤ ≤

= =

= maximum error over set

2

2

1

1

2

p

N

p

p

p

E E

N

=

=

∑

= modified squared error over set

( )

2

1

1

( )

1

p

N

p

p

p

Std E E E

N

=

= − < >

−

∑

= standard deviation of error over set

Improvements in Neural Networks and Optimization Process

1

Abstract

This report summarizes the results of Task 4.5, Improvements in Neural Networks and

Optimization Process in the scope of the FY06 CCDoTT program. Building upon several

previous successful applications, our optimization process utilizes a neural network-

based response surface method for reducing the cost of computer intensive

optimizations for applications in ship design. Complex or costly analyses are replaced by

a neural network which is used to instantaneously estimate the value of the function(s) of

interest. The cost of the optimization is shifted to the generation of (smaller) data sets

used for training the network. The focus of this report is on the use and analysis of

constructive networks for treating problems with a large number of variables, say around

30. A mathematical function is used to systematically evaluate the network performance

over a range of design spaces. A committee network using simple ensemble averaging

and a single training set for all members of the committee was developed and tested in

the FY 05 program. This approach showed that significant improvement of the network

performance could be obtained when compared to training a series of networks and

selecting only the best network. The FY 06 program, investigated further committee

networks which includes using a weighted average instead of a simple average method

and, a cross-validation re-sampling technique applied to different sizes training sets and

design space sizes. Results are compared to the basic ensemble method which makes

use of a single training set for all members.

Improvements in Neural Networks and Optimization Process

2

Introduction

This report describes the improvements on the constructive NN algorithm developed for

regression tasks in large dimensional space. It presents an alternative NN structure

based on a constructive NN topology and a corresponding training algorithm suitable for

large number of input/outputs to address the problems where the number of design

parameters is fairly large, say up to 30 or more. This work is performed as continuation

of a similar task in the FY 05 CCDOTT program (Schmitz, 2006).

The constructive algorithm is based on cascade correlation. This supervised learning

algorithm was first introduced by Fahlman and Lebiere (1990). Instead of just adjusting

the weights in a network of fixed topology, Cascade-Correlation begins with a minimal

network, then automatically trains and adds new hidden units one-by-one in a cascading

manner. This architecture has several advantages over other algorithms: it learns very

quickly; the network determines its own size and topology; it retains the structure it has

built even if the training set changes; and it requires no back-propagation of error signals

through the connections of the network. In addition, for a large number of inputs (design

variables), the most widely used learning algorithm, back-propagation, is known to be

very slow. Cascade-Correlation does not exhibit this limitation (Fahlman, 1990).

The cascade correlation algorithm has been substantially modified during the previous

studies in order to make it a robust and accurate method for function approximation

(Schmitz et al 2002, Schmitz, 2006). Following a brief review of the method, this report

describes two additional enhancements made to the algorithm during the FY 06 program.

Cascade Correlation algorithm

The training algorithm involves the steps described below and leads to a network with

the topology of Fig. 1.

Start with the required input and output units; both layers are fully connected. The

number of inputs (design variables) and outputs (objective function, constraints) is

dictated by the problem.

1. Train all connections ending at an output unit with a typical learning algorithm until

the squared error,

2

E,

of the NN no longer decreases.

2

2

1

1

2

p

N

p

p

p

E E

N

=

=

∑

where

N

p

is the number of points in the training set and

E

p

is the error at each point

of the training set

2. Generate a large pool of candidate units that receive trainable input connections

from all of the network’s external inputs and from all pre-existing hidden units. The

output of each candidate unit is not yet connected to the active network (output).

Each candidate unit has a different set of random initial weights. All receive the same

input signals and see the same residual error for each training pattern. Among

candidate hidden units in the pool, select the one which has the largest correlation,

C,

defined by

Improvements in Neural Networks and Optimization Process

3

( ) ( )

0,0

1

p

N

p p

p

C z z E E

=

= − < > − < >

∑

(1)

where

z

0,p

is the output of the candidate hidden unit and

E

p

is the error at each point

of the training set, calculated at Step 1. The weights of the hidden unit selected are

then modified by maximizing the correlation

C

with an ordinary learning algorithm.

Only the candidate with the best correlation score is installed. The other units in the

pool are discarded.

3. The best unit is then connected to the outputs and its input weights are frozen. The

candidate unit acts now as an additional input unit. Training of the input-outputs

connections by minimizing the squared error

2

E

as defined in Step 1 is then

repeated. The use of several candidates greatly reduces the chances that a useless

unit will be permanently installed because an individual candidate was trapped in a

local maximum during training.

4. Repeat until the stopping criterion is verified. Instead of stopping the training based

on the error measured on the Training Set (TS), the stopping criterion makes use of

a Validation Set (VS) which is much smaller than the Training Set and distributed

throughout the design space. Such approach is used because although the error on

the TS decreases as hidden units are added (the network grows), the error on points

elsewhere in the design space (such as those of the VS) may increase slightly. This

phenomenon is known as overfitting of the network and employing a VS circumvents

this issue (Prechelt 1998). This important phenomenon is discussed below in more

detail in the section illustrating the approach for a mathematical function.

Outputs

Inputs

Bias +1

v

m1

v

11

v

m,n+1

v

1,n+1

y

1p

y

mp

z

np

z

2p

z

1p

z

n+1,p

…

…

HU 1

z

n+2,p

w

1

1

w

n+1

1

v

m,n+2

v

1,n+2

HU 2

z

n+3,p

w

n+1

2

w

1

2

w

n+2

2

HU h

z

n+h+1,p

w

1

h

w

n+1

h

w

n+h+1

h

w

n+2

h

v

1,n+3

v

m,n+3

…

v

m,n+h

v

1,n+h

Fig. 1. NN Topology with Cascade Correlation Algorithm

Improvements in Neural Networks and Optimization Process

4

Proposed Improvements to Neural Network

Cross-validation

The critical issue in developing a neural network is generalization: how well will the

network make predictions for cases that are not in the training set? Neural Networks, like

other nonlinear estimation methods such as kernel regression and smoothing splines,

can suffer from either underfitting or overfitting. In last year’s report (Schmitz, 2006),

ensemble averaged committee networks have shown a significant improvement over the

results. The ensemble averaging method led to a 29.5% improvement on the squared

error on a large unseen dataset (GS) compared to the method of choosing only the

Network with smallest error on the validation set.

Indeed, it is common practice in the application of neural networks to train many different

candidate networks and then to select the best, on the basis of performance on an

independent validation set and to keep only this network and to discard the others.

There are two disadvantages in this approach. First, the effort involved in training the

remaining networks is wasted. Second, the generalization performance on the validation

set has a random component due to noise on the data, and so the network which had

best performance on the validation set might not be the one with the best performance

on new test data. These drawbacks can be overcome by combining the networks

together by forming an ensemble. The NNs in the ensemble are essentially trained on

the same input data and then the outputs of the NNs are combined. The basic idea

underlying the ensemble network is to find ways of exploiting the information contained

in these redundant NNs.

However, there is clearly no advantage of combining a set of NNs which generalize

exactly in the same way. There are many methods to create NNs which generalize

differently. Because of the stochastic nature of building NNs with the cascade correlation

algorithm, two NNs trained on the same data exhibit different number of hidden units

and/or different weights and thus will generalize differently. That is that their output value

will be different on an unseen dataset. This makes the cascade correlation an excellent

candidate for ensemble NNs.

Besides this, there are a number of parameters which can be varied to promote diversity

in the NNs trained; varying the set of initial random weights, varying the topology of the

NN and varying the data. CC already includes different initial weights and a varying

topology. So this research focuses on altering the training data. There are many

methods to vary the training data from one NN to the next: sampling data, using disjoint

training sets or boosting and adaptive re-sampling techniques. (Sharkey, 1996)

1/ Sampling data: This method consists of using a sampling technique so that each NN

is trained on a different subsample of the training data. Of the statistical sampling

methods, cross-validation (Krogh and Vedelsby, 1995) and bootstrapping (Efron, 1983)

have shown to give good results.

K-fold cross-validation consists of partitioning randomly the training data into k

subsamples. Of the k subsamples, a single subsample is retained as validation set for

testing the model and the remaining k-1subsamples are used for training data. The

cross-validation process is repeated k times, with each of the k subsamples only used

once. Ten-fold and five-fold cross-validations appear to be working the best (Sarle,

Improvements in Neural Networks and Optimization Process

5

2006).

Bootstrap, on the other hand, consists in sampling the dataset with replacement to form

a training set and to use the data not sampled as a validation set. In particular there is a

method called the .632 bootstrap where a dataset of n instances is sampled n times with

replacement to give the training set. Since some elements in the training set are

repeated, there must be some data that has not been used and that can be used for the

validation set. For a reasonably large dataset, it can be shown that the training set will

contain about 63.2% of the original dataset and the validation set about 36.8 %, thus the

name .632 bootstrap.

2/ Disjoint training sets: It is a similar method to the above, but the subsamples do not

share common points. There is thus no overlap between the data used to train each NN.

For regression, this method requires a larger training data sample then the previous

method so that there is a large enough dataset to “represent” the complexity of the

function for each NN trained.

3/ Boosting and adaptive re-sampling: the idea here is to train each new NN with data

that has been filtered by the previously trained members of the ensemble. Each data

sample is given a different chance (probability) to appear in a new training set by

prioritizing data poorly learnt by the previous members of the ensemble. However this

method usually requires large amount of data, and might not be applicable to our

problem. This method is also more computationally intensive as it requires than simple

bootstrapping as it requires calculating the ensemble performance and the probabilities

for each data point to appear in the new the training set after adding each network to the

ensemble.

4/ Preprocessing: there exist many ways to vary the data each NN views; for example,

using different nonlinear transformations or by injecting artificial noise to the data.

Choosing the transformation that would improve the ensemble is probably case-

dependant, so it is not pursued here since no knowledge of the function is known a-priori

and the method is developed for general regression.

Of these methods, cross-validation and bootstrapping seem most promising for our

problem as in general the amount of data available to train the NNs is relatively small

compared to the number of inputs. Cross-validation is researched first in this report, and

bootstrapping will be studied in further research.

Weighted Averaging

Once a set of networks has been created, one must find an effective way of combining

them. Last year’s study has already proven that taking all networks built and performing

simple averaging on the ensemble network significantly improves the results.

The function constructed on a simple average (SAvg) can be written as:

1

1

( ) ( )

i

N

SAvg NN

i

f

x f x

N

=

≡

∑

where the

i

NN

f

is the NN approximation of the function

f

of the i

th

ensemble member.

However, could there be a better way to combine those networks? If the networks of the

ensemble were uncorrelated, then the mean squared error on a simple average

Improvements in Neural Networks and Optimization Process

6

ensemble could be reduced by a factor of N (the number of NNs in the ensemble)

compared to the average mean squared error of the individual networks (Perrone, 1993).

The problem is that, in general, the individual networks are correlated, and thus the

reduction in error is significantly less than a factor of N. An idea is to use the information

given by the error on the validation set. This set is not seen during training and as such

gives a measure on how each NN performs on an unseen dataset. The idea of Perrone

(1993) is to minimize the mean squared error on the ensemble using the VS. He defines

a generalized ensemble (weighted averaged or WAvg) as:

1

( ) ( )

i

N

WAvg i NN

i

f

x f xα

=

≡

∑

where

i

α

’s are real and satisfy the constraint that

1

i

α

=

∑

. To minimize the mean

squared error on the ensemble, he shows that the

i

α

’s must be chosen as following.

1

1

ij

j

i

kj

k j

C

C

α

−

−

=

∑

∑ ∑

where

1

1

( ( ) ( ))( ( ) ( ))

p

i j

N

ij p NN p p NN p

p

p

C f x f x f x f x

N

=

= − −

∑

is the ij-th element of the symmetric correlation matrix and

Np

the number of points in

the VS. This method should exceed simple average if the individual networks are mostly

uncorrelated.

This report compares simple and weighted average on a mathematical function as

described in the next section.

Results

In this section, the capabilities of the enhanced network are analyzed by testing it on a

mathematical function. This test function and its attributes are described in detail in

Hefazi (2006) and Besnard (2007). Results are then presented and analyzed when the

design space dimension is increased from 5 to 30 in order to gain an insight into the

behavior of the NN characteristics as the number of design variables increases. It must

be pointed out that the errors are characterized, not only on the training set as is

customary, but also at points not used in the training.

Test Function

The following mathematical test function is selected because it has many minima and

maxima and is easily extended to larger dimension spaces. The function has been

modified such that, as the size of the design space,

n

, increases, the magnitude of the

function remains between 0 and 1.

It is defined over the n-dimensional compact set

[

]

,

n

π

π−

by

2 2

3/4

1

1

( ) sin ( ) ( )

4

n

i i

i i

i

x y

f x A B

n

=

−

⎛ ⎞

= + −

⎜ ⎟

⎝ ⎠

∑

(2)

where the scalars

A

i

and

B

i

are defined by

Improvements in Neural Networks and Optimization Process

7

( )

( )

,,

1

1

sin cos

|| || || ||

n

i i j j i j j

j

A a x b x

n

=

= +

+

∑

a b

(3)

and

( )

( )

,,

1

1

sin cos

|| || || ||

n

i i j j i j j

j

B a y b y

n

=

= +

+

∑

a b

(4)

Using 2-norm of matrices

a

and

b

given by

1,

0.5 1 1.5....

(1 ) 2 (2 ) 2 (3 ) 2....

( ( 1) )

(1 2 ) 2 (2 2 ) 2 (3 2 ) 2....

2

::::

::::

i j n

n n n

j i n

n n n

< <

⎡

⎤

⎢

⎥

+ + +

⎢

⎥

+ −

⎡ ⎤

= =

⎢

⎥

+ + +

⎢ ⎥

⎣ ⎦

⎢

⎥

⎢

⎥

⎢

⎥

⎣

⎦

a

(5)

and

1,

2 1.5 1....

( 4 ) 2 ( 3 ) 2 ( 2 ) 2....

( 1) 5

( 4 2 ) 2 ( 3 2 ) 2 ( 2 2 ) 2....

2

::::

::::

i j n

n n n

j i n

n n n

< <

− − −

⎡

⎤

⎢

⎥

− + − + − +

⎢

⎥

+ − −

⎡ ⎤

= =

⎢

⎥

− + − + − +

⎢ ⎥

⎣ ⎦

⎢

⎥

⎢

⎥

⎢

⎥

⎣

⎦

b

(6)

Weighted Averaging

In this section, network results using simple ensemble averaging are compared to

weighted ensemble averaging (using the

i

α

’s defined by Perrone, 1993) by training NNs

for varying numbers of inputs (

n

) and various sizes of training set (TS) using the

mathematical function as discussed in the previous section. Last year’s study included

the cases

n

= 5, 10, 15 and 30 for TS sizes of 500, 750 and 1,500 points. For the 30-

dimension case, a TS of 5,000 points is also considered. All networks are built with a

validation set (VS) of 200 points and a generalization set of 15000 points to evaluate the

generalization error. All TS, VS and GS are generated using a Latin Hypercube. Ten

Networks are built for each case and averaged as one ensemble. For each network, the

number of hidden units is determined by the minimum squared error on the validation set.

Table 1 presents the squared error, average error and standard deviation on the TS, VS

and GS for all networks built, using simple ensemble averaging. These results were

obtained in FY’05 (Schmitz, 2006) and presented an overall 29.5 % improvement for the

squared Error, 17.4 % improvement for the average error and 14.9 % improvement on

the standard deviation for the GS. Results were similar on the VS and about twice as

good on the TS. The table is presented here to give the reader a general idea on the

order of magnitude of the errors

Improvements in Neural Networks and Optimization Process

8

Table 1: Squared Error (E2), Average Error (<E>) and Standard Deviation (Std(E)) for simple

ensemble average ensemble NN (results from FY’05)

Training Set Validation Set Generalization Set

Simple

Average

E2 <E> Std(E) E2 <E> Std(E) E2 <E> Std(E)

5 inputs,

TS=500

0.0003 0.0193 0.0167 0.0018 0.0408 0.0433 0.0017 0.0423 0.0396

5 inputs,

TS=750

0.0003 0.0184 0.0150 0.0010 0.0311 0.0320 0.0008 0.0303 0.0271

5 inputs,

TS=1500 0.0002 0.0148 0.0135 0.0006 0.0249 0.0234 0.0005 0.0232 0.0214

10 inputs,

TS=500

0.0013 0.0413 0.0307 0.0027 0.0563 0.0465 0.0033 0.0643 0.0490

10 inputs,

TS=750

0.0004 0.0223 0.0171 0.0020 0.0496 0.0382 0.0022 0.0522 0.0419

10 inputs,

TS=1500

0.0006 0.0271 0.0207 0.0013 0.0390 0.0323 0.0016 0.0445 0.0362

15 inputs,

TS=500 0.0014 0.0414 0.0321 0.0028 0.0588 0.0464 0.0031 0.0629 0.0471

15 inputs,

TS=750 0.0009 0.0342 0.0254 0.0022 0.0519 0.0419 0.0026 0.0577 0.0434

15 inputs,

TS=1500

0.0004 0.0222 0.0166 0.0016 0.0433 0.0357 0.0017 0.0469 0.0360

30 inputs,

TS=500 0.0015 0.0440 0.0327 0.0020 0.0498 0.0394 0.0022 0.0526 0.0395

30 inputs,

TS=750 0.0016 0.0448 0.0339 0.0020 0.0494 0.0385 0.0022 0.0528 0.0393

30 inputs,

TS=1500

0.0015 0.0442 0.0327 0.0018 0.0477 0.0373 0.0020 0.0510 0.0384

30 inputs,

TS=5000 0.0005 0.0248 0.0187 0.0012 0.0385 0.0293 0.0013 0.0406 0.0310

Table 2 presents the percentage error improvement found on the squared error (E2 or

2

ˆ

E

), average error (<E>) and standard deviation (Std(E)) by using weighted ensemble

averaging instead of simple averaging. Results are shown for the training set, the

validation set and the generalization set for the number of inputs varying from 5 to 30

and the different training sets. A positive value means that the weighted average

ensemble performed better and a negative value means that the simple average

ensemble performed better. As expected, results show that the weighted average

ensemble always improves the error on the VS. Indeed, the weight factors

i

α

’s are

determined to minimize the mean squared error on the VS. For the results on the TS,

errors are reduced except for cases n=5, all TS sizes and n=15, TS=1500. And for the

GS, the most important of the sets, the errors are almost always improved except for

n=10, TS=1500, n=15, TS=1500 and n=30, TS=750, where the simple average

performed slighlty better. The worst result is for the case n=15, TS=1500 where E2 on

GS is increased by 6.6% by using weighted ensemble instead of the simple ensemble.

If averaging the results over all cases (all n and all TS), E2 for the weighted average

ensemble was reduced by 16% on the TS, 13.4% on the VS and 4.2% on the GS

compared to the simple average ensemble. The average error and standard deviation

were also reduced on the TS, VS and GS with the weighted average. This shows that

the weighted ensemble performs in average better than the simple average

Improvements in Neural Networks and Optimization Process

9

Table 2: Percentage improvement on squared error (E2), average error (<E>) and standard

deviation (Std(E)) by using weighted ensemble averaging instead of simple averaging.

TRAINING SET VALIDATION SET GENERALIZATION SET

Weighted Average

% Improvement

E2 <E> std(E) E2 <E> std(E) E2 <E> std(E)

n=5, TS=500

-30.98 -17.29 -10.54 24.60 9.15 16.91 10.81 2.31 9.42

n=5, TS=750

-10.09 -6.83 -2.00 23.71 11.04 14.22 13.09 4.32 9.92

n=5, TS=1500

-10.91 -8.03 -1.94 24.48 11.68 14.74 9.27 3.08 6.74

n=10, TS=500

65.39 42.24 39.27 21.87 11.89 11.19 19.31 11.94 7.21

n=10, TS=750

18.42 9.79 9.48 7.63 4.58 2.73 0.17 0.61 -0.72

n=10, TS=1500

5.38 2.71 2.76 3.71 2.33 1.21 -0.26 0.00 -0.33

n=15, TS=500

49.51 30.85 25.86 14.15 9.07 4.61 0.84 0.74 -0.15

n=15, TS=750

54.52 33.54 30.82 15.26 6.66 9.96 8.95 5.22 3.45

n=15, TS=1500

-14.40 -6.81 -7.23 8.78 3.50 5.98 -6.58 -3.15 -3.39

n=30, TS=500

21.48 11.85 10.53 0.87 0.63 0.12 9.38 4.94 4.57

n=30, TS=750

2.51 1.15 1.47 2.38 1.78 0.24 -1.85 -0.73 -1.26

n=30, TS=1500

34.05 19.32 17.83 12.94 6.21 7.49 0.59 0.33 0.23

n=30, TS=5000

47.15 27.47 27.00 9.03 4.82 4.28 4.89 2.65 2.17

Average improvement

(%) over all n and TS

16.09 9.82 10.12 13.37 6.50 7.51 4.24 1.95 2.41

Ten-fold Cross-validation on Original Training Sets

This section investigates the use of cross-validation in committee networks. In order to

evaluate the variations in NN performance for varying numbers of inputs, the dimension

of the search space,

n

, is again varied from 5 to 30 and the training is performed for

various sizes of training set (TS). Last year’s study included the cases

n

= 5, 10, 15 and

30 for TS sizes of 500, 750 and 1,500 points. All TS are generated using a Latin

Hypercube. For the 30-dimension case, a TS of 5,000 points is also considered.

The first study takes the existing TS of 500, 750, 1500 and 5000 points (referred to as

“original training sets”) and splits them using a 10-fold cross-validation. For example, a

TS of 500 is split randomly into ten sets. The first NN of the ensemble is trained using

the first set for validation (50 points) and the other nine sets (450 points) for training. The

second NN uses set number 2 for validation and the other nine sets for training, and so

on. So that each network uses a different training and validation set. The ten networks

built are then combined to create a committee network using the simple average and

weighted average method described previously. However, the correlation matrix cannot

be calculated on the VS since it is different for each network of the ensemble. Instead

the correlation matrix is calculated on the original training set which comprises the TS

and VS for each NN of the ensemble and is thus the same for the ten NNs constructed.

The results are compared with the ensemble NN created without cross-validation thus

using the same TS (500, 750, 1500 or 5000 points) for all members of the ensemble NN

and using a VS of 200 points created with a Latin Hypercube distribution. The

generalization set used to compare NNs with and without cross-validation over the

design space at points not used during the training. The GS contains 15000 points (also

generated by Latin Hypercube). Table 3 shows the percentage improvement on the

squared error (E2), the average error (<E>), and the standard deviation (Std(E)) on the

Improvements in Neural Networks and Optimization Process

10

GS by using cross-validation. A negative value means that the ensemble NN created

with the single TS for all members of the ensemble performed better than the cross-

validated case. Results are presented for the ensemble network with simple averaging

(Simple Average) and the ensemble network with weighted averaging (Weighted

Average), as described previously. The results do not seem to present a general trend

as to whether cross-validating data is better. It might even appear a bit worse,

especially for the ensemble using weighted average.

There are two possible explanations to these results. First, it must be pointed out that

the NNs generated without cross-validation, make use of use of 200 additional points to

determine when to stop training ( the “original validation set”) . Whereas for the cases

with cross-validation this set is not used, as the validation set is part of the original

training set. Second, especially for the small training sets, the validation set is relatively

small (50 and 75 points for TS = 500 and 750). The VS is used to decide when to stop

training, and thus with a very small set the NNs might stop too early (leading to

underfitting) or too late (leading to overfitting).

Two solutions are proposed. The first one is to incorporate the additional information

contained in those two hundred points into the NN training by simply concatenating the

original TS and the original VS together to create a larger dataset. And then perform

cross-validation on this dataset so that the information contained in these additional 200

points is used during training and/or validation. The second solution is to increase the

size of the validation set by performing a five-fold cross-validation on the data instead of

a ten-fold. This doubles the size of the VS, but it creates only 5 separate TS and VS. To

generate an ensemble of ten networks, the 5-fold cross-validation is performed twice on

the same data. Since the process of splitting the sets into five subsets is performed

randomly, ten different TS and VS will be obtained. The following section presents these

results.

Improvements in Neural Networks and Optimization Process

11

Table 3: Percentage improvement on generalization errors by using ten-fold cross-

validation on the original training sets compared using a single TS and a separate VS of

200 points.

10Fold CrossV

on original TS

ENSEMBLE NN % Improvement on GS

Case E2 <E> std(E)

Simple Average 8.07 3.73 4.57

n=5 TS=500

Weighted Average -10.23 -3.42 -7.05

Simple Average -5.61 -1.30 -4.56

n=5, TS=750

Weighted Average -48.73 -19.57 -25.22

Simple Average 19.19 9.04 11.37

n=5, TS=1500

Weighted Average 15.11 7.71 8.06

Simple Average 11.07 5.82 5.50

n=10, TS=500

Weighted Average -2.83 -1.47 -1.30

Simple Average -9.64 -5.68 -3.18

n=10, TS=750

Weighted Average -2.29 -0.71 -1.79

Simple Average -2.85 -1.76 -0.88

n=10, TS=1500

Weighted Average -8.35 -3.99 -4.24

Simple Average 0.40 0.32 0.00

n=15, TS=500

Weighted Average -13.83 -6.05 -7.80

Simple Average -3.78 -1.59 -2.37

n=15, TS=750

Weighted Average -32.34 -14.40 -16.12

Simple Average -3.52 -1.47 -2.20

n=15, TS=1500

Weighted Average -15.98 -7.44 -8.12

Simple Average 7.13 3.65 3.59

n=30, TS=500

Weighted Average -2.21 -0.53 -2.10

Simple Average 3.83 2.04 1.76

n=30, TS=750

Weighted Average -26.15 -11.93 -13.00

Simple Average 1.07 0.68 0.28

n=30, TS=1500

Weighted Average -8.93 -3.61 -5.69

Simple Average -8.44 -3.98 -4.39

n=30, TS=5000

Weighted Average -1.77 -0.63 -1.31

Simple Average 1.30 0.73 0.73

Average %

Improvement

for all n and TS

Weighted Average -12.19 -5.08 -6.59

Five-fold and Ten-fold Cross-validation on Data Sets composed of the Original

Training Sets and Validation Set

In light of the previous section’s inconclusive data, it was decided to use the additional

information contained in the validation set to train the NNs with cross-validation. Last

year’s method (without cross-validation) uses a fixed training set (500, 750, 1500 or

5000 points) to train all members of the ensemble and uses a separate 200 points VS to

decide when to stop training. The original TS (500, 750, 1500 and 5000 points) are

concatenated with the VS (200 points) to form data sets (DS) of respectively 700,

950,1750 and 5200 points) and a cross-validation method is applied to create ten distinct

TS and VS for each member of the ensemble.

Improvements in Neural Networks and Optimization Process

12

Table 4 presents the squared error (E2) and the relative improvement on the GS

obtained using 3 methods. Method 1 corresponds to training without cross-validation and

using a separate VS of 200 points. Method 2 corresponds to training with ten-fold cross-

validation on the original training set to which the 200 points original validation set was

added and method 3 corresponds to again using the original TS to which the VS was

added and performing twice five-fold cross-validation. Again, improvements are

presented on the ensemble NNs with simple and weighted average. Results show a

considerable improvement for the smaller number of inputs (n) and for relatively smaller

training sets (TS =500 and 750). Ten-fold cross-validation seems to work better in

average than five-fold cross-validation, so it appears more advantageous to have a

larger percentage of data for training each individual NN; 90% for the 10-fold compared

to 80% for 5-fold. Both cross-validation methods perform better in average than the

method which doesn’t use cross-validation. Method 2 shows an average improvement of

9.3% if using the simple average and 5.3 % if using weighted average when compared

to method 1 (without cross-validation). Method 3 shows an average improvement of

4.2% if using the simple average and 3.9% if using weighted average.

Improvements in Neural Networks and Optimization Process

13

Table 4: Comparison of squared errors on generalization set by using five-fold and ten-fold cross-

validation using a dataset combining the original training and validation sets,

Method 1

Method 2

Method 3

2 vs 1

3 vs 1

Ensemble NN

No CrossV

VS=200

10Fold

CrossV

TS+VS

2x5Fold

CrossV

TS+VS

E2

E2

E2

% Improvement

on E2

n=5, TS=500

Simple Average

1.68E-03

1.16E-03

1.12E-03

30.7

33.2

Weighted Average

1.50E-03

9.71E-04

1.02E-03

35.1

31.9

n=5, TS=750

Simple Average

8.26E-04

6.13E-04

8.78E-04

25.7

-6.3

Weighted Average

7.18E-04

5.72E-04

8.32E-04

20.4

-15.9

n=5, TS=1500

Simple Average

4.99E-04

4.78E-04

4.76E-04

4.2

4.5

Weighted Average

4.53E-04

4.45E-04

4.14E-04

1.7

8.6

n=10, TS=500

Simple Average

3.27E-03

2.47E-03

2.54E-03

24.6

22.3

Weighted Average

2.64E-03

2.52E-03

2.53E-03

4.5

4.2

n=10, TS=750

Simple Average

2.24E-03

2.14E-03

2.30E-03

4.4

-2.5

Weighted Average

2.24E-03

2.07E-03

2.39E-03

7.4

-6.8

n=10, TS=1500

Simple Average

1.64E-03

1.66E-03

1.71E-03

-1.1

-4.0

Weighted Average

1.65E-03

1.69E-03

1.73E-03

-2.7

-5.2

n=15, TS=500

Simple Average

3.08E-03

2.77E-03

2.67E-03

10.2

13.6

Weighted Average

3.06E-03

2.64E-03

2.60E-03

13.6

15.0

n=15, TS=750

Simple Average

2.61E-03

2.43E-03

2.48E-03

7.0

5.0

Weighted Average

2.38E-03

2.27E-03

2.19E-03

4.4

7.9

n=15, TS=1500

Simple Average

1.75E-03

1.74E-03

1.88E-03

0.2

-7.4

Weighted Average

1.86E-03

1.77E-03

1.60E-03

4.9

14.0

n=30, TS=500

Simple Average

2.16E-03

2.09E-03

2.15E-03

3.2

0.4

Weighted Average

2.25E-03

2.34E-03

2.24E-03

-4.0

0.4

n=30, TS=750

Simple Average

2.16E-03

2.08E-03

2.13E-03

4.0

1.7

Weighted Average

2.20E-03

2.27E-03

2.34E-03

-3.1

-6.4

n=30, TS=1500

Simple Average

2.04E-03

2.05E-03

2.05E-03

-0.8

-0.9

Weighted Average

2.02E-03

2.29E-03

2.14E-03

-13.4

-5.8

n=30, TS=5000

Simple Average

1.30E-03

1.19E-03

1.36E-03

8.4

-4.2

Weighted Average

1.24E-03

1.24E-03

1.13E-03

0.1

9.1

Simple Average

9.3 4.2

Weighted Average

% Average improvement

5.3 3.9

Conclusions

This report describes recent enhancements of neural networks for application in

numerical optimization process. It presents a neural network-based response surface

method for reducing the cost of computer intensive optimizations. A constructive network

based on the cascade correlation algorithm has been developed. It allows for efficient

Neural Network determination when dealing with function representation over large

design spaces. During training, the network grows until the error on a small set (VS)

different from that used in the training itself (TS) starts increasing. The method is

characterized for a mathematical function for dimensions ranging from 5 to 30.

Improvements in Neural Networks and Optimization Process

14

Improvements to the method, using Ensemble Averaging are described and documented.

The ensemble averaging method using a weighted average led to an average of 4.2%

improvement on the squared error on a large unseen dataset (GS) compared to the

simple average method. The simple average method had already improved the results

by a factor of 29.5 % over the method choosing the best of out ten networks constructed

(Schmitz, 2006). Also ensemble averaging is computationally more efficient as it makes

use of all networks used. Another enhancement to the method consists in cross-

validating the training and validation sets to improve the diversity of the NNs created and

improve the ensemble. This study shows that a ten-fold cross-validation improves the

squared error on the generalization set by 9.3% for the simple average and 5.3% for the

weighted average over an committee network using the same training and validation set

for all ten networks comprising the ensemble. Overall we have shown that our enhanced

algorithm can approximate functions with large dimensions (up to 30) with average

errors around 5%. In practical applications such as optimization loops, this

approximation is much better than resorting to empirical or highly idealized

approximation of complex function evaluations such as powering or seakeeping of

multihull ships. The NN approach allows the optimization process to utilize the results of

highly sophisticated CFD or experimental analysis in the process without limitations

imposed by computational costs. Application of the approach in such optimization

processes are reported in Task 4.4 report.

References

Besnard, E., Schmitz, A., Hefazi, H. and Shinde, R. (2007) Constructive Neural

Networks and their Application to Ship Multi-disciplinary Design Optimization

,

Journal of Ship Research, Vol. 51, No. 4., pp. 297-312.

Efron, B. (1983), Estimating the error rate of a prediction rule: Improvement on cross-

validation, J. of the American Statistical Association, 78, 316-331.

Fahlman, S.E., and Lebiere, C. (1990), The Cascade-Correlation Learning Architecture,

Technical Report CMU-CS-90-100

, School of Computer Science, Carnegie

Mellon University, Pittsburgh, PA, USA.

Hefazi, H., Schmitz, A.., Shinde, R. and Mizine, I. (2006) Automated Multidisciplinary

Design Optimization Method for Multi-Hull Vessels, CCDoTT Report FY05

Krogh, A. and Vedelsby, J. (1995) Neural Network Ensembles, Cross-validation and

active learning. In Tesauro, G. Touretzky, D.S. and Leen, T.K. ,

Advances in

Neural Information Processing Systems 7

, MIT Press.

Sarle, W ( 2006), Artificial Intelligence FAQ/ Neural Networks,

http://www.faqs.org/faqs/ai-faq/neural-nets/

Schmitz, A. Hefazi, H. (2006) Task 4.5 Improvement of Neural Networks for Numerical

Optimization,

CCDoTT report FY-05

, available at http://www.ccdott.org

Schmitz, A., Besnard, E., and Vives, E. (2002), Reducing the Cost of Computational

Fluid Dynamics Optimization Using Multi Layer Perceptrons, 2002 World

Congress on Computational Intelligence, Honolulu, HI.

Sharkey, A. (1996), On combining Artificial Neural Nets, Connection Science, Vol. 8, pp

299-313

Perrone, M. and Cooper, L.N. (1993) When Networks Disagree: Ensemble Methods for

Hybrid Neural Networks. In Mammone R.J.

Neural Networks for Speech and

Image Processing

, Chapman Hall.

## Comments 0

Log in to post a comment