International Research Journal of Finance and Economics

ISSN 1450-2887 Issue 39 (2010)

© EuroJournals Publishing, Inc. 2010

http://www.eurojournals.com/finance.htm

Artificial Intelligence: Neural Networks Simplified

Indranarain Ramlall

1

University of Technology, Mauritius

E-mail: ramindra0001@yahoo.co.uk

Abstract

Now standing as a major paradigm for data mining applications, Neural Networks

have been widely used in many fields due to its ability to capture complex patterns present

in the data. Under its feature extractor capability, Neural Networks extrapolate past pattern

into future ones and thereby relieves the burden of having recourse to complex detection

algorithm in case of pattern recognition such as face detection or fingerprint detection. The

development of Neural Networks has been so rapid that they are now referred as the sixth

generation of computing. While the main strength of Neural Networks is embedded in its

non-linearity and data-driven aspects, its main shortcoming relates to the lack of

explanation power in the trained networks due to the complex structure of the networks.

This paper explains the distinct mechanisms embodied in Neural Networks, its strengths,

weaknesses and applications.

Keywords: Neural Networks, Artificial Intelligence, Econometrics, Overfitting, Black

box

JEL Classification Codes: C45, B23, C01

1. Introduction

Considered as a subset of Artificial Intelligence, NN basically constitutes a computer program

designed to learn in a manner similar to the human brain. Axons which have an electrical signal and

found only on output cells, terminate at synapses that connect it to the dendrite of another neuron. NN,

throughout the overall part of the analysis, pertains to ANN and not biological NN. NN take after the

human brain on the following two grounds, first, they acquire knowledge via the network through a

learning process and secondly, interneuron connection strengths are used to store the acquired

knowledge. Perhaps, the best agreed upon definition of NN is that they constitute nonparametric

regression models which do not require a priori assumptions about the problem with the date being

enabled to speak for itself. Ironically, the origin of NN is not directly linked to any estimation or

forecasting exercise since the persons that pioneered research in this field was basically attempting to

have an insight of the learning abilities of human brain. However, as NN displayed interesting learning

capacity, this led towards interests in other fields, let alone science such as in finance, economics and

econometrics. In a nutshell, NN tries to sieve out patterns present in the past data and to extrapolate

them into the future. Prior to gaining an insight of NN, it is important to understand the basic

components of NN.

The aim of this paper is to provide a simplified approach to the mechanism embodied in Neural

Networks. Section 2.0 focuses on the features of the brain-stylized facts followed by section 3.0 which

1

The views expressed in this paper are solely those of the author and do not necessarily reflect those of the University of

Technology, Mauritius..

International Research Journal of Finance and Economics - Issue 39 (2010) 106

deals with the network architecture. While section 4.0 describes the choice of network topology,

section 5.0 deals with data preprocessing. Consequently, section 6.0 and 7.0 discusses the strengths and

drawbacks of Neural Networks, respectively. Section 8.0 focuses on the backpropagation algorithm

with section 9.0 dealing with applications of Neural Networks in distinct fields. Section 10.0 discusses

the differences between and econometrics. Finally section 11.0 concludes.

2. Features of the Brain-Stylized facts: Neuron and Perceptron

Prior to gaining insight of NN, it becomes proper to have recourse towards the basic components of the

brain, labeled as neurons. A neuron signifies a minute processor that receives, processes and sends data

to the next layer of the model. The brain is composed of about 100 billions neurons of many distinct

types. Neurons are grouped together into an intricate network and they work by transmitting electrical

impulses. The reaction of a neuron following receipt of an impulse, will hinge on the intensity of the

impulses received together with the neuron’s degree of sensitivity towards the neurons that dispatched

the neurons. There are two operations being performed inside a neuron, the first being the computation

of the weighted sum of all of the inputs while the second one converts the output of the summation in

terms of certain threshold. A neuron has basically four main parts: the cell body which constitutes the

spot for processing and generating impulses, the dendrites which are synonymous with signal

receivers, that basically accept signal from outside or other neurons, the axon which represents the

avenue that sends the message triggered by the neuron to the next neurons and the synaptic terminals

that entail excitatory or inhibitory reactions of receiving neuron. Brain power, at the end, is simply a

function of this intricately complex network of connections subsisting between the neurons.

A perceptron constitutes a model of simple learning, developed in 1962 by Rosenblatt. The

perceptron constitutes an element that weighs and sums up the inputs and compares the result with a

predefined threshold value. The perceptron emits or releases one if the weighted sum of inputs is

greater than the threshold value, otherwise 0. Bishop (1995) points out that a perceptron may work as a

linear discriminant in classifying elements that belong to different sets. The higher the number of

neurons in a single hidden layer, the higher the complexity of the represented function. Unfortunately,

when it comes to solving classification problems that are not linearly separable, the perceptron

algorithm developed by Rosenblatt is not able to terminate. The architecture of a perceptron can be

shown in figure 1.

Figure 1: A perceptron

b

X

W

T

a y

W

T

= matrix of weights a = b + W

T

X

X = vector of inputs y =f(a)

∑

The perceptron reminds us of the logit and probit models in econometrics. While a positive

weight is considered excitatory, a negative weight reflects inhibitory effect. The transformation

107 International Research Journal of Finance and Economics - Issue 39 (2010)

involves two steps, the first one is geared towards computing the weighted sum of inputs and the

second activation is transformed into a response by using a transfer function. The synaptic weights

pertain to interneuron connection strengths which are used to store the acquired knowledge.

3. Network Architecture

The simplest form of NN consists of only two layers, the input and output layer (no hidden layer is

present). This is sometimes referred to as the skip layer, which basically constitutes a conventional

linear regression modeling in a NN design whereby the input layer is directly connected to the output

layer, hence bypassing the hidden layer. Like any other network, this simplest form of NN relies on

weight as the connection between an input and the output; the weight representing the relative

significance of a specific input in the computation of the output. It is important to bear in mind that the

output generated will heavily depend on the type of activation function used. However, based on the

fact that the hidden layer confers strong learning ability to the NN, in practical applications, a three and

above three NN architecture is used. This is shown in figure 1. It is vital to distinguish between two

classes of weights in NN; first there are those weights that aim to connect the inputs to the hidden layer

and then those weights that connect the hidden layer to the output layer. In a parallel manner, there are

two classes of activation functions, one found in the hidden layer and one in the output layer.

An infinite number of ways prevail as to the construction of a NN; neurodynamics (basically

spells out the properties of an individual neuron such as its transfer function and how the inputs are

combined) and architecture (defines the structure of NN including the number of neurons in each layer

and the number of types of interconnections) are two terms used to describe the way in which a NN is

organized. Any network designer must factor in the following elements when building up a network:

1. Best starting values (weight initialisation)

2. Number of hidden layers

3. Number of neurons in each hidden layer

4. Number of input variables or combination of input variables (usually emanating from

regression analysis)

5. Learning rate

6. Momentum rate

7. Training time or amount of training(i.e., the number of iterations to employ)

8. Type of activation function to use in the hidden and output layers

9. Data partitioning and evaluation metrics

Ideally, a NN for a particular task has to be optimised over the entire parameter space of the

learning rate, momentum rate, number of hidden layers and nodes, combination of input variables and

activation functions. Attaining such an objective is computationally burdensome. There is widespread

consensus in the empirical literature that the final artifact of NN model rests purely on a process of trial

and error as there is poor theoretical guidance in creating a NN network. In that respect, the design of a

network is considered an art rather than a science. This is why the design of a network is a time-

consuming process. Nevertheless, the main criterion used for the design of NN is to end up with the

specification that minimises the errors; or the optimal network topology.

3.1. Weight Initialisation

Each connection has an associated parameter indicating the strength of the connection, the so-called

weight. By changing the weights in a specific way, the network can learn patterns present in the input

layer to target values or the output layer. Prior to gaining an insight of the need for weight

initialization, it becomes important to focus on the standard gradient descent algorithm which can be

succinctly described as follows:

An error function is defined.

Randomly pick up an initial set of weights.

International Research Journal of Finance and Economics - Issue 39 (2010) 108

5. Evaluate the vector representing the gradient of the error function.

6. Update the weight until a flat minimum has been converged; otherwise, the weight space

should again be modified till convergence is ensured.

Figure 2: Quadratic error function

Error function

Move in direction of negative derivative (always moving downhill)

Weight

Source: Own illustration

However, the above process becomes problematic in case the error function is not well defined,

i.e., there is not a unique minimum point but rather distinct local minima, as depicted on figure 3, and

then, the objective should be geared towards attaining the global minimum. Indeed, complex error

surfaces exhibiting valleys and ravines may appear in case nonlinear activation functions are

employed. Such objective requires experimentation using random restarts from multiple places in

weight space. The rationale for using diverse values of weights is to increase the chance of ending up

near the global minimum. Under the standard gradient descent approach, the objective is to derive the

network output for all the data points to estimate the error gradient (difference between target and

output). There are certain bad properties of the error surface for gradient descent. Such instance of

complex error surface can occur due to the use of more than one hidden. Subseuqently, it is not feasible

to analytically end up with the global minimum of the error function to the effect that, under such a

scenario, NN training is synonymous with exploration of the error surface.

Figure 3: Complex error function

Error function

Weight

Source: Own illustration

109 International Research Journal of Finance and Economics - Issue 39 (2010)

3.2. Number of Hidden Layers

The significance of the hidden layer is that; from it, one can infer the number of layers in the NN

architecture. For instance, a one-hidden layer signifies a three-layer NN model, a two-hidden layer

signifies a four-layer NN model and so forth. Though hidden layers don’t present any real concept or

that they have no interpretation, yet, they were extremely powerful in sieving distinct pattern structures

present in the data. In that respect, there are no parallels present in econometrics to capture for the

hidden layer and this is the real power of NN. It is the gold medal of NN by capturing for any

nonlinearity present in the data. The hidden layer provides the network the ability to generalize, in

practice NN with one and sometimes two hidden layers are widely used and have performed very well.

Raising the number of hidden layers not only scales up the computation time but also the probability of

overfitting which then paves the way towards poor out of sample forecasting performance. The greater

the number of weights relative to the size of the training set, the greater the ability of the network to

memorise idiosyncrasies of individual observations so that there is loss of generalization for the

validation set. The usual recommendation employed is either to begin with one or at most two hidden

layers with additional layers not recommended since it has been found that NN with 4-hidden layers do

not improve results.

3.3. Number of Units or Neurons in the Hidden Layer

One of the greatest issues in designing NN model lies in deciding on the number of hidden units in the

hidden layer as this is important to ensure good generalization; best approach being to sequentially fit

the model by adding one hidden unit each time. Generalization reflects the notion that a model based

on a sample of the data is convenient for forecasting the general population. The fewer the neurons in a

network, the fewer the number of operations required and this is a less time-consuming task to

implement. In the same vein, in case additional neurons (and layers) are introduced in the network, this

entails more weights to be used. Indeed, the learning capability of NN hinges on the number of neurons

in its hidden layer. In case the number of hidden neurons is too few, the model will not be flexible

enough to model the data well. Conversely, in case of too many hidden neurons, the model will overfit

the data. As the number of hidden nodes is not possible to determine in advance, empirical

experimentations are vital to determine this parameter. Technically speaking, the best number of

hidden units hinges a plethora of elements such as the number of input and output units, the number of

training cases, the amount of noise in the targets, the complexity of the error function, the network

architecture and the training algorithm.

Some specific rules of thumb have been established regarding the choice of the number of units

in the hidden layers such as half the input layer size and output layer size, two-third of the input layer

plus the output layer size, less than twice the input layer size and so forth. However, these rules are

mere reference when choosing a hidden layer size as they overlook other factors other than focusing on

the input and output layer size. Subsequently, there is no easy way to determine the optimal number of

hidden units without training using distinct numbers of hidden units and estimating the generalization

error of each. In that respect no magic formula prevails for arriving at the optimum number of hidden

neurons to the effect that experimentation constitutes the only avenue. This implies incrementing the

number of hidden units until there is no major improvement in performance. Often, it is better to start

with a large number of epochs and a small number of hidden units and that network which performs

best on the testing set with the least number of hidden neurons being selected as the final one. Care

must be exercised when testing a range of hidden neurons by keeping constant all other parameters;

since modifying any parameter leads to creation of a new NN with a distinct error surface that would

thereby complicate the choice of the optimum number of hidden neurons.

International Research Journal of Finance and Economics - Issue 39 (2010) 110

3.4. Number of Inputs

The inputs should have a temporal structure and should not be too numerous. In case of n independent

variables, then, there should be n number of input nodes for the proposed model. Usually, the practice

is to go for simple models which have few input variables. The reason is that, as the number of model

inputs scales up, the degrees of freedom of the governing equation also increases. While equations with

high degrees of freedom have the capability to model the training data effectively, they fail miserably

when given test data. This is because models with fewer degrees of freedom do not try to trace the

data’s random scattering but only follow the general trend. Wasserman (1994) states that, though

clustering of the input vectors scales down the size of the network and thereby assists in speeding up

the learning process; yet, there is no guarantee that the clustering algorithm employed is the most

efficient one. To end up with an efficient algorithm, it becomes necessary to experience with distinct

starting locations.

3.5. Learning Rate

NN models can be classified by whether the learning or iterative adjustment is supervised or

unsupervised. Under supervised learning, the network is trained by providing it with inputs along with

the targeted output values. Consequently, the error, as determined by the difference between the target

output values and the real output values, is then backpropagated using the training algorithm until the

desired error level is reached.

Under unsupervised learning, the iterative adjustment is guided only by the input vector so that

there is no feedback as such from the environment to show if the outputs of the network are correct.

Consequently, the network must discover features present in the data on its own. In fact, the unsupervised

learning don’t use target values, they merely use the input information to categorise the input patterns.

The learning parameter or learning rate controls the rate at which the weight is modified as

learning occurs. The learning rate determines the speed of learning. As a matter of fact, the gradient

descent merely provides information about the direction to move but not about the step size. In that

respect, the learning rate determines the size of the step that is required to move towards the minimum

of the error function. A too high learning rate leads to oscillations around the minimum while a too low

learning rate can lead to slow convergence of NN. The larger the learning rate, the greater the changes

in the weights and consequently, the quicker the network learns. However, assessing the other side of

the coin, a larger learning rate triggers oscillations of the weights among insufficient solutions.

Alternatively stated, larger steps may engender quick convergence but it may overstep the solution or

go in the wrong direction in case of peculiar surface error. Conversely, small steps may lead to the

proper direction but at the costs of a large number of iterations required, a real time-consuming task. In

that respect, the aim is geared towards accomplishing the largest learning rate feasible without trailing

oscillations to ensure that the network is subject to the best learning. The proper setting of the learning

rate depends on the specific application under consideration.

3.6. Momemtum Rate

The momentum rate is incorporated to modify the backpropagation algorithm. The objective behind the

use of the momentum rate is basically to reach the minimum faster because if such a momentum term

is excluded, it takes a long time before attaining the minimum. The use of the momentum term is

underscored with a view of improving the chances of ending up with the global minimum, chiefly

when finding the global minimum is not guaranteed under the gradient descent algorithm based on the

fact that the error surface can incorporate many local minima in which the algorithm can become stuck.

Technically speaking, the momentum term determines how past weight changes impact on current

weight changes so that it, in effect, restrains side to side oscillations by filtering out high frequency

variations. Momentum causes the weights to change only in the same direction, thus avoiding

oscillations. Above all, the momentum rate helps the network to escape the local minima problem.

Technically speaking, the momentum stimulates weight changes to continue in the same direction by

111 International Research Journal of Finance and Economics - Issue 39 (2010)

making the current weight change depend not only on the previous weight change but also on the

current error.

3.7. Training

Training pertains to the process during which data is input into the network along with their

corresponding output values so that the network can adjust the weights in such a way that it can

generate the given input and output vectors with as low error as possible. Training will be done until

convergence manifests at the point where the test set error start to rise to the effect that the network

weights are restored at the iteration cycle where the test set error was minimum. Training should only

be ceased until there is no improvement in the error function based on a sensible number of randomly

selected starting weights. Convergence constitutes the point at which the network does not improve.

The purpose of convergence training is to end up with a global minimum. Training is affected by many

parameters such as selection of the learning rate, momentum values and the backpropagation

algorithm. Goodness of fit constitutes the widely applied criterion among researchers to train their

models. An epoch constitutes a loop through the training set. The relationship between error and

training (learning) and testing can be illustrated as follows:

Figure 4: Training time

Error

Testing/generalisation

Learning/Training

0 Training time

As depicted above, as the learning time scales up, this gradually reduces the error. However, in

case of the testing sample, it is U-shaped since error falls, attains a minimum and then begins to rise.

There are two main types of network training; the sequential model having on-line or stochastic

where weights are updated following presentation of each pattern. Then, there is the batch model,

called off-line or epochwise training. Here, the total weight changes are made by summing all

individual changes computed for each pattern in the training set.

3.8. Activation Function (squashing function)

The selection of the squashing or activation function bears an important effect on the NN results. The

activation function constitutes a mathematical transformation of the summarized weighted input units

to generate the output of the neuron. Usually, studies use sigmoid activation function for the hidden

layer and linear or sigmoid one for the output layer. Technically speaking an activation function

comprises of two parts; a combination function that factors in all the input units into a single value

(weighted sum of inputs) and a transfer function which applies a nonlinear transformation to these

summarized units to trigger an output unit.

Usually, the data pattern determines the forms of the transfer functions. Most studies resort

towards the sigmoid transfer functions. The benefits of logistic or sigmoid function is that they are

continuous functions that monotonically rise or fall, saturate towards the minimum and maximum

values, approximate the step function very well and are differentiable on the whole domain and can

International Research Journal of Finance and Economics - Issue 39 (2010) 112

thereby dramatically reduce the computation burden for training It is important to note that NN where

the hidden neurons have sigmoidal activation function and the output neurons the sigmoidal or identity

function are called Multi Layer Perceptrons (MLP). The sigmoid activation function can be stated as:

g(x) =

1

1 + e

(-cx)

However, based on the fact that the return value of the sigmoid function lies in the interval

[0,1], this function cannot be used in NN that approximate functions that can also take negative values.

To fill in such a vacuum, the empirical literature has recourse towards the hyperbolic tangent function

which is also called the bipolar sigmoid function. Such a function can be stated as follows:

g(x) =

1 – e

(-cx)

1 + e

(-cx)

3.9. Data Partitioning and Evaluation Metrics

Data should be divided to strike the best trade-off between in sample and out of sample performance of

the NN model. While training the network, the researcher must be careful to shun overtraining so that

the network adjusts its weights in such a manner that the network does too well in the training sample

but fares poorly in case of values not incorporated in the training sample. Consequently, this requires

cross-validation which usually falls under two parts. In the first part, there are only two sets; the

training set and the testing set used for out of sample analysis. In the second part, there are three sets;

the training set, the testing set and finally the validation set, the latter being employed to validate the

NN architecture. Alternatively stated, the parameters (i.e., the value of the synaptic weights) of the

network are calculated by using the training set. After that, learning is ceased and the network is

evaluated with the data from the testing set. The question of how well the network generalises is

deduced by analysing its performance on the validation set and not on the test set as the purpose of the

test set is basically to decide when to cease training.

Adya and Collopy (1998) state that the NN model should be assessed in terms of how well its

architecture has been implemented and such an analysis is undertaken under the theme: effectiveness

of implementation. The guidelines used in that case comprises of convergence (determining the

network’s convergence capability in sample), generalization (testing the ability of the network to

recognize patters outside the training sample) and stability (how far results are consistent during the

validation phase under distinct samples of data). Consequently, they point out that results of studies

undertaken in NN should be viewed with care as only those studies that cater for both effectiveness of

validation and effectiveness of implementation are authentic ones. Otherwise, studies that suffer from

any of these two points, are not really helpful with little contributive value and should henceforth be

deemed as inconclusive. For instance, studies that fail to report in sample performance of the NN are

hard to digest as the NN configuration has not been analysed carefully.

Distinct measures of accuracy have been employed in the empirical literature. However, none

of them are perfect so that it is better to use a number of distinct performance measures. The rationale

is that, by opting for more than one performance measure, there is less susceptibility of bias whereby

the best model is selected simply on the basis of just one performance measure.

4. Choice of Network Topology

Too simple network topologies which have too few connections between their elements have little

capacity to store knowledge about the regularities in the training data. On the other side, more complex

network topologies will tend to have more weights that are adjusted in the training process of the

network. The rationale for the overwhelming use of multilayer perceptron (MLP) is due to the fact that

such network architecture is akin to multivariate non-linear regression model. The most common

neural network model is the MLP, also known as supervised network by virtue of its need for a

targeted output to be able to learn. The aim of such a network is geared towards building up a NN

113 International Research Journal of Finance and Economics - Issue 39 (2010)

model that properly maps the input to the output. The diagram below depicts a three-layer NN model.

At the outset, the inputs are channelled into the input layer (first layer). Then, these inputs are being

processed (as shown by the arrows) by simply multiplying them by weights. Inside the hidden layer,

these processed inputs are summed and then they pass through a stated activation function. The data is

multiplied by the weights and then processed one last time using the output layer (third layer) to

generate the NN output. But, it is vital to bear in mind that theory doesn’t generate a complete guide

for building up of MLP. Alternatively stated, no full-proof/full fledged theoric explanation exists to

obtain the optimal architecture of the MLP. The main source contributing towards using a one hidden

layer MLP construction is the work of Hornick, Stinchcombe and White’s conclusion (1989) who point

out that a MLP embedded with only one hidden layer can approximate any function. This underlies the

rationale as to why, in practice, when using MLP, many studies do not go beyond one hidden layer.

5. Data Preprocessing

Rescaling the data is considered beneficial to enhance the forecasting accuracy because the estimation

algorithms tend to perform better when the input values are small and centered around zero. An

important step in building up a NN is to select the proper data pre and postprocessing. This is of major

significance by virtue of the impact of curse of dimensionality. An avenue to represent the training data

is basically to specify intervals for the input variables and then to classify the data records by stating in

which interval the values of a record lie. Contrary to the intuitive assumption that additional data

should improve the performance of NN, it is not necessary the case. As a matter of fact,, the reduction

of dimensionality of the training data is vital for a proper functioning of NN. It makes more sense to

preprocess the picture to extract the features vital for recognition of the shape and then apply NN on

these features. As a matter of fact, quality and quantity of data are widely recognized as important

issues in the development of neural network models. It is imperative that the input data is free from

noise to ensure that the network is given the best possible training for it to be able to generalize better

results later. Meade (1995) states that NN are data-dependent so that the learning algorithms are only

as good as the data shown to them. Usually, data are transformed prior to submission to the NN model.

To accomplish a good prediction performance when applying NN; at the very least, the raw data must

be scaled between the upper and lower bounds of the transfer function (usually between zero and one

or minus one). First differencing (to remove a linear trend from the data) and taking log (useful for data

that can take both small and large values) constitute the two widely used data transformation in NN.

Another technique is to use ratios. Sampling or filtering the data refers to removing observations from

the training and testing sets to generate a more uniform distribution. Benefit of filtering is a fall in the

number of training facts which enables testing of more input variables, random starting weights or

hidden neurons rather than training large data sets. In practice, data preprocessing involves much trial

and error.

6. Strengths of NN

The following are deemed as the strengths of NN:

1. The greatest power of Neural Networks is that it is endowed with a finite number of hidden units,

can yet approximate any continuous function to any desired degree of accuracy. This has been

commonly referred to as the property of universal approximator. A universal approximator

signifies that, given an ample number of hidden layer units, the NN model can approximate any

functional form to any degree of accuracy. In fact, many authors have shown that a three-layer

NN equipped with a logistic activation function in the hidden units, constitutes a universal

approximator. But, this is a mere sufficient condition as the necessary condition requires that

there be ample number of hidden units being included to ensure that the network can approximate

nearly any linear or nonlinear function to a desired level of precision.

International Research Journal of Finance and Economics - Issue 39 (2010) 114

2. No prior knowledge of the data generating process is needed for implementing NN, i.e., NN are

free from statistical assumptions. ANN is more robust to missing and inaccurate data. NN should,

in theory, be able to detect and duplicate any complex nonlinear pattern in the data. In a parallel

manner, NN is not rigid and hence it can be customised to any architecture as per the fancy of the

forecaster. More specifically, NN can encompass many models such as linear regression, binary

probit model and others by simply tweaking with the activation functions and the network

architecture.

3. Since under regression analysis, there is need to state the functional form of the model, model

misspecification may manifest. However, such problem of model misspecifcation does not occur

in case of NN since no specifications are used as the network merely learns the hidden

relationship in the data. Though, nonlinear functions can be linearised by using specific

mathematical transformations in economics and finance, nevertheless, the problem lies in

knowing the proper transformation to be applied and this may not be easy in practice. But, when

it comes to using NN, there is no real need to know about the functional form to be applied.

Why? Because the universal approximator property under NN ensures that NN can mimic almost

any functional form.

4. One problem related to Neural Network could yet be considered not as a real strength but as

something common among other optimisation techniques. Indeed, many studies admit the fact

that their estimation may suffer from bias emanating from coefficients which are local rather than

being global. But, such a feature is also present among many nonlinear optimisation tasks and

there is no “silver bullet” to obviate such a problem.

7. Drawbacks of NN

Assessing the other side of the coin, the following drawbacks were noted for NN:

1. The chief pitfall related to the application of NN has been coined as the black box problem. One

avenue employed has been to generate rules from NN that are easy for a human user to

understand; these rules must be sufficiently simple yet accurate. The conditions of the rules

describe a subregion of the input space. Based on the fact that a single rule is not powerful

enough to approximate the nonlinear mapping of the network well, the remedy is basically to

split the input space of the data into distinct subregions. Nevertheless, there is still a ray of hope

looming on the NN horizon following the proposition made by Refenes, Zapranis and Francis

(1994), namely that the black box problem can be alleviated by resorting towards sensitivity

analysis. Such a technique involves plotting the value of the output for a range of values of a

given input, with all other inputs remaining fixed at their sample mean. Consequently, if the

value of the output stays stable for distinct values of the input under inspection, then, the

researcher can presume that this input does not entail much say in the predictive power of the NN

model. Such a process is successively applied to all inputs until the researcher ends up with all

the relevant inputs in the model. Hence, the NN model is said to be pruned via the elimination of

superfluous inputs.

2. The addition of too many hidden units incites the problem of overfitting the data; meaning that

the network learns too well in the training data session but generates inferior results in case of out

of sample session. The effect of overfitting in case of a forecasting exercise, manifests in form of

poor out of sample forecasts. Alternatively stated, overfitting signifies that rather than learning

the fundamental structure of the training set that would enable a satisfactory and sufficient

generalization, the network learns insignificant details of individual cases. Overfitting can be

caused either by a shorter sample size in use or a too complex NN model so that the NN tends to

memorize rather than generalize from the data. In that specific case, to avoid overfitting, NN

model should be kept small or parsimonious.

There are two ways to deal with the overfitting problem, the first is to train the network

model on the training set and then to analyse the performance on the test set. This technique is

115 International Research Journal of Finance and Economics - Issue 39 (2010)

usually referred to as early stopping criteria, whereby the data is split into three parts; the training

set, the test set and the validation set. In that context, the training set is used by the algorithm to

estimate the network weights; basically the training sets represents the in sample period of the

regression model while the test set is dedicated for out of sample analysis. The early stopping

strategy is, however, not without any caveat. The reason is that in case of small samples, the

three-pronged decomposition of the data set becomes somewhat problematic. Results can also be

sensitive to the observations contained in each of the three specific data sets. To curb the

overfitting problem, Refenes (1995) recommends cross-validation to be implemented during

learning. The second approach is to have recourse towards one of the distinct network pruning

algorithms.

3. NN depends so much on the quality of the data that the algorithms employed are only as good as

the data used to apply them. This is why NN is often called weakly deterministic systems.

Similarly, NN needs large samples to work. For instance, if a simple NN model is equipped with

a large number of weights, then, it s most likely that this leads towards a limited number of

degrees of freedom.

4. The construction of the NN model can be a time-consuming process since building up the NN

architecture is synonymous to a strenuous activity involving trial and error. For instance,

subjectivity in the construction of NN has led towards researchers to consider the results being

doubtful. Learning results in NN may not be stable on the back of random initialization of

weights coupled with complexity of the error surface.So, while building up the NN, it becomes

imperative to keep such subjectivity to a low level as far as possible. In the same vein, the early

stopping strategy used to fight against the overfitting problem may be subject to arbitrary

judgements made by the researcher. For instance, the researcher must make judgement about

dividing the sample into the training, validation and testing sets. Usually, researchers cling to

some rule of thumb for dividing the data sets. However, such a shortcoming does not pose a

major issue of concern, bearing the out of sample forecasting feature whereby 30% of the sample

tends to represent the holdout sample.

5. NN should never be viewed as a panacea. For instance, in case of extreme cases such as major

crisis that could substantially alter the movement in prices, NN prediction is unlikely to be

satisfactory. Consequently, expert judgments should always be present. NN models are prone to

problems of local optima, a conspicuous feature noted in case of non-linear optimization.

8. Backpropagation Algorithm

Backpropagation was developed independently by two researchers and provided a learning rule for

NN. By default, the backpropagation algorithm, which constitutes the most popular method for training

a network, is an optimization technique that uses the gradient descent philosophy. It represents a local

search algorithm that minimizes the error function. In that respect, the norm is to cease training when

the algorithm stops to improve the evaluation metric applied such as the sum of the squared error. The

backpropagation algorithm is deemed to be more efficient with satisfactory performance noted on

unseen data relative to conventional optimization technique. However, the hitch attached to such

algorithm is that it usually assumes a time-consuming convergence process. Coupled with the latter,

another main pitfall is that, it may end up with the local minimum relative to the global minimum by

virtue of random initialization of weights. Local minima problem manifest when the network

converges or stabilizes on an inadequate solution (i.e., local minimum) which is not the optimal

solution, i.e., the global minimum. A NN must be estimated hundreds or thousands of times with

distinct sets of starting values to avoid the local minimum.

With backpropagation, the input data is literally presented to the neural network. With each

iteration, a comparison is drawn between the output of the neural network and the targeted output, the

difference of which represents an error. Such error is then recycled (or backward recursively) inside the

network to modify the weights in such a way that the error steadily scales down with each iteration and

International Research Journal of Finance and Economics - Issue 39 (2010) 116

the neural model gradually approaches to the targeted output. Alternatively stated, the training process

can be ceased if the network ends with the desired output or the network no longer appears to be

learning. The gradient descent philosophy which is embedded inside the backpropagation algorithm,

works as follows:

1. Select some random initial values for the model parameters.

2. Calculate the gradient of the error function with respect to each model parameter.

3. Change the model parameters so that we move a short distance in the direction of the

greatest rate of decrease of the error; i.e., in direction of negative gradient. To reduce the

value of the error function, we always move in the reverse direction of the gradient.

4. Repeat steps 2 and 3 until the gradient gets close to zero.

Backpropagation differs from Radial Basis Function Network (RBFN) on two main

dimensions. First, in backpropagation, both the hidden and the output layers are nonlinear sigmoid

functions while in case of RBFN, the output layer is linear while the hidden layer is a Gaussian

function. Second, while the outputs of the hidden layer in backpropagation are functions of the

products of weights and inputs, in RBFN, the outputs of the hidden layer are functions of distances

between input vectors and centre vectors. In fact, RBFN triggers centres of the input vectors rather than

random weights in the first step of its algorithm. Consequently, the presence of the linear output layer

under backpropagation renders it learning process very fast relative to the RBFN model.

9. Applications of NN in Distinct Fields

NN has been overwhelmingly used in diverse fields such as portfolio management, credit rating and

predicting bankruptcy, forecasting exchange rates, predicting stock values, inflation and cash

forecasting, forecasting electricity consumption and others. The power of NN has been so strong that

most of the major investment banks such as Goldman Sachs and Morgan Stanley have dedicated

departments for implementing NN. However, NN has not only been applied in the field of finance and

economics but has also been extended to other fields such as computer science, marketing, medical

science, speech and voice recognition and many other applications. Many research have proved that the

forecasting accuracy of NN tends to excel over that of a well-established linear regression model. For

instance, Gonzalez (2000) has found error reduction in the range of 13 to 40% when trying to compare

the performance of NN vis-à-vis linear regression models. Such superior performance tends to manifest

not only in in-sample but also for out of sample forecasting accuracy. Moreover, some studies have

demonstrated the superiority of ANN over multiple regression (Spangler et al. 1999; Uysal & Roubi

1999; Fadlalla & Lin 2001; Nguyen & Cripps 2001). For the purpose of the current study, the coverage

will not go beyond the field of economics and finance.

In case of exchange rates, the most commonly cited example is the work of Verkooijen (1996)

who attempted to analyse the forecasting power of distinct models in case of predicting the monthly

US Dollar-DM exchange rate. He found that NN were powerful not only in case of out of sample

forecasts but also in case of predicting the direction of change of the exchange rate. In case of stock

market, Refenes, Zapranis and Francis (1994) found that NN outperformed under both in sample and

out of sample forecasting under the arbitrage pricing theory framework. Besides, some works have

focuses on forecasting competition. In that dimension, Stock and Watson (1998 had recourse to forty-

nine forecasting methods. Intringuingly, they found that NN underperforms compared to the naïve

AR(4) forecast and also to most other methods. However, their results are considered inconclusive on

the back of proper application of NN with the objective of properly dealing with the overfitting

problem that usually gnaws at an effective application of NN. Swanson and White (1997) undertook

another forecasting competition for nine US macroeconomic variables and they ended up with the fact

that NN merely generate ordinary performance. However, the result of their work is considered

inconclusive on the back of curbing the overfitting problem. Ironically, the authors acknowledged that

the Schwarz Information Criterion (SIC) did not do justice to the true potential of the NN

methodology. In case of forecasting bankcruptcy, Salchenberger et al. (1992) show the superiority of

117 International Research Journal of Finance and Economics - Issue 39 (2010)

neural networks on Logit Analysis while Tan (1996) concludes to the superiority of neural networks on

Probit analysis. McNelis and Chan (2004) find that nonlinear models such as Neural Networks best

describe the inflationary/deflationary dynamics in Hong Kong on both the in sample and out of sample

diagnostics. Moshiri, Cameron and Scuse (1999) compare the performance of three distinct types of

Neural Networks models, namely the backpropagation neural network (BPN), the radial basis function

network (RBFN) and the recurrent neural network (RNN) in the case of predicting inflation. They find

that RNN is better off in case of forecasting inflation for longer horizons relative to RBFN and BPN,

with the latter outperforming the former. Kumar and Walia (2006) find that NN models perform better

relative to conventional models for predicting both daily and weekly cash flows for a bank branch.

Planning for electricity consumption has become a vital task in many countries with a view of

optimising on scarce use of resources.

It is vital to note that in case of time series forecasting, Sharda and Patil (1992) conclude that

NNs are powerful enough to directly capture seasonality to the effect that there is no need to adjust the

data for seasonality. In the same vein, Nelson et al. (1999) find that NNs tend to do better in case of

deseasonalised data relative to unprocessed data. Heravi et al. (2004) applied NNs models to seasonally

unadjusted data on the ground that seasonal adjustment may induce nonlinearity and they find that NNs

generate inferior results relative to linear models in terms of RMSE but in case of the direction of

change, NNs maintain an edge over their counterpart. Recently, Zhang and Kline (2007) conducts an

extensive analysis of the effectiveness of distinct data preprocessing (basically a large set of data from

the M3 competition without adjusting for trends or seasonality) and modeling approaches (many

models included) using NN for forecasting seasonal time series. Using both parametric and

nonparametric diagnostics, they find that simpler models, both in form of set of inputs and number of

hidden nodes, do better than more complex models. Above all, they point out that data preparation and

the selection of inputs are vital steps to ensure proper performance of the NNs models.

10. NN and Econometrics

It is of paramount significance to bear in mind that, in effect, NN don’t deviate too much from

econometrics on the ground that the simplest types of NN are closely related to standard econometric

techniques-many papers have attempted to establish parallels between NN and econometric methods to

enhance insight of these models for macroeconomic and financial forecasting. For example, a two-

layer feedforward NN with an identity activation function is synonymous with a linear regression

model. Besides, the weights in NN represent the regression coefficients with the counterpart of the bias

term being the intercept term in econometrics. Alternatively stated, there is substantial overlap between

these two techniques. While NN focuses on altering the weights to attain the minimum point of the cost

function, econometrics relies on different assumptions such as homoscedasticity, no autocorrelation

and so forth. Based on the fact that NN does not constitute a full-fledged tool, many studies have

considered it as a complementary tool to econometric analysis. The power of NN lies in its ability to

generate insight in case where theory provides limited guidance in specifying the parameters of the

estimating equation, its functional form along with assumptions for the underlying data employed.

NN and econometrics basically represent two sides of the same coin. Alternatively stated, NN

is not very different from econometrics based on the fact that the concepts are the same with the

terminologies used being different. The following table can best be used to represent the differences

subsisting between NN and econometrics:

International Research Journal of Finance and Economics - Issue 39 (2010) 118

Model NN Econometrics

Regression model Non-parametric Parametric

Linear/Nonlinear Non-linear

Mostly linear; though non-linearity may

be captured by other specific models

Assumptions

No need to make assumption about the underlying

data

BLUE properties don’t usually hold on

a perfect basis

Weights

Weights represent the synaptic strengths; the

intensity and strength of the connections of the

neurons

Weights represent the regression

coefficients

Philosophy Learning and training Estimation

Paradigm Diagnostic Sampling

Data set Training, testing and validation In sample and out of sample analyses

Classification tool Perceptron Logit/Probit models

Method Least mean squares Ordinary least squares

Operation

Simultaneous updating and optimization of network

weights

Regression coefficients determined in

groups

Deterministic system Weak as NN is data-dependent

Relatively stronger than NN based on

assumptions made

Implementation Time

Usually time-consuming process; training may take

hours or even days

Easily implemented with econometric

software

Specificities

No specific nature, silent and can handle either time

series or cross-sectional one.

Three-pronged nature in terms of cross-

sectional, time-series and panel data

analyses

Theoretical

background

Atheoretical like VAR

Usually based on theory with laying out

of the functional form of the model

Upgrades

Genetic Algorithms, Simulated Annealing and Fuzzy

Logic

Nothing new; conventional models

Variables Inputs and Outputs

Independent variables and dependent

variables

Intercept Bias term Intercept term

Error

Embedded inside, iteratively used to approach

targeted output

Residuals

Source: Own computation

11. Conclusion

This paper provided a simplified approach to Neural Networks along with explanations on its different

components. Neural Networks have gained so much ground that they are now termed as the sixth

generation of computing. As a matter of fact, Neural Networks have been applied in many fields such

as science, finance, credit risk, economics and econometrics. Its ability to learn and being flexible

render it a powerful tool though the black box problem reduces its usefulness. Nonetheless, the

predictive power of NN cannot be denied and this is making it still one of the best forecasting tools not

only among practitioners, let alone for central bankers in the world.

119 International Research Journal of Finance and Economics - Issue 39 (2010)

References

[1] Adya, M. and Collopy, F. (1998), “How effective are neural networks at forecasting and

prediction? A review and Evaluation”. Journal of Forecasting 17 481-495.

[2] Bishop, C. M. (1995), Neural networks for pattern recognition. Oxford, UK: Oxford University

Press.

[3] Salchenberger, L. M., Cinar, E. M., and Lash, N. A. "Neural Networks: A New Tool for

Predicting Thrift Failures," Decision Sciences (23:4), July/August 1992, pp. 899-916.

[4] Fadlalla, A. & Lin, C.-H. (2001), “An analysis of the applications of neural networks in

finance”. Interfaces, 31, 4 (July/August), pp. 112–122.

[5] Gonzalez, S. (2000): “Neural Network for Macroeconomic Forecasting: A Complementary

Approach to Linear Regression Models”, Bank of Canada Working Paper 2000-07.

[6] Heravi, Saeed & Osborn, Denise R. & Birchenhall, C. R., (2004). "Linear versus neural

network forecasts for European industrial production series," International Journal of

Forecasting, Elsevier, vol. 20(3), pages 435-446.

[7] Hornick, Stinchcombe and White’s conclusion (1989) Hornik K., Stinchcombe M. and White

H., “Multilayer feedforward networks are universal approximators”, Neural Networks, vol. 2,

no. 5,pp. 359–366, 1989

[8] Kumar, P. and Walia, E., (2006), “Cash Forecasting: An Application of Artificial Neural

Networks in Finance”, International Journal of Computer Science and Applications 3 (1): 61-

77.

[9] McNelis, Paul D. and Chan, Carrie K.C., (2004), "Deflationary Dynamics in Hong Kong:

Evidence from Linear and Neural Network Regime Switching Models," Working Papers

212004, Hong Kong Institute for Monetary Research.

[10] Meade, N. (1995), “Neural network time series forecasting of financial markets - Azoff,EM,

International Journal of Forecasting, Vol: 11, Pages: 601 – 602.

[11] Moshiri, S., Cameron, N. E. and Scuse, D. (1999), "Static, Dynamic, and Hybrid Neural

Networks in Forecasting Inflation", Computational Economics, Vol. 14 (3). p 219-35.

[12] Nelson, C.A. (1999). Neural plasticity and human development. Current Directions in

Psychological Science, 8, 42-45.

[13] Nguyen, N. & Cripps, A. (2001) Predicting housing value: a comparison of multiple regression

analysis and artificial neural networks. Journal of Real Estate Research, 22, 3, pp. 313–336.

[14] Refenes A.P,(ed.), “Neural network in the capital markets”, John Wiley& Sons Ltd (1995)

[15] Refenes, A.N., Zapranis, A. and Francis, G., (1994), “Stock performance modeling using neural

network: A comparative study with regression models”, Neural Network (1994) 5: 961–970.

[16] Sharda, R. and Patil, R.B. (1992), “Connectionist approach to time series prediction: An

empirical test”, Journal of Intelligent Manufacturing, 3 (1992), 317-23.

[17] Spangler, W.E., May, J.H. & Vargas, L.G. (1999) Choosing data-mining methods for multiple

classification: representational and performance measurement implications for decision

support. Journal of Management Information Systems, 16, 1 (Summer), pp. 37–62.

[18] Stock, J.H. and Watson, M. W. (1998): "A Comparison of Linear and Nonlinear univariate

models for Forecasting Macroeconomic Time Series," National Bureau of Economic Research

Working Paper 6607, June 1998.

[19] Swanson, Norman R. and White, Halbert (1997): "A Model Selection Approach to Real-Time

Macroeconomic Forecasting Using Linear Models and Artificial Neural Networks", Review of

Economics and Statistics, 79, November 1997, p.540-550.

[20] Tan, S.S., Smeins, F.E., 1996. Predicting grassland community changes with an artificial neural

network model. Ecol. Model. 84, 91–97.

[21] Uysal, M. and S. El Roubi. (1999), “Artificial Neural Networks vs Multiple Regression in

Tourism Demand Analysis. Journal of Travel Research”, 38(2): 111-118.

[22] Verkooijen, W. (1996), “A Neural Network Approach to Long-Run Exchange Rate Prediction,”

Computational Economics, 9, p.51-65

International Research Journal of Finance and Economics - Issue 39 (2010) 120

[23] Wasserman, P. D. (1994), “Advanced Methods in Neural Computing”, Van Nostrand Reinhold,

1994.

[24] Zhang, G.P. and Kline, D.M. (2007), “Quarterly Time-Series Forecasting With Neural

Networks, IEEE Transactions on neural networks, VOL. 18, NO. 6, November 2007.

[25] Zhang, G.-P., & Qi, M. (2005), “Neural network forecasting for seasonal and trend time series.

European Journal of Operational Research”, 160 (2), 501-514.

## Comments 0

Log in to post a comment