Artificial Intelligence: Neural Networks Simplified

spineunkemptAI and Robotics

Jul 17, 2012 (5 years and 3 months ago)

560 views

International Research Journal of Finance and Economics
ISSN 1450-2887 Issue 39 (2010)
© EuroJournals Publishing, Inc. 2010
http://www.eurojournals.com/finance.htm

Artificial Intelligence: Neural Networks Simplified


Indranarain Ramlall
1

University of Technology, Mauritius
E-mail: ramindra0001@yahoo.co.uk


Abstract

Now standing as a major paradigm for data mining applications, Neural Networks
have been widely used in many fields due to its ability to capture complex patterns present
in the data. Under its feature extractor capability, Neural Networks extrapolate past pattern
into future ones and thereby relieves the burden of having recourse to complex detection
algorithm in case of pattern recognition such as face detection or fingerprint detection. The
development of Neural Networks has been so rapid that they are now referred as the sixth
generation of computing. While the main strength of Neural Networks is embedded in its
non-linearity and data-driven aspects, its main shortcoming relates to the lack of
explanation power in the trained networks due to the complex structure of the networks.
This paper explains the distinct mechanisms embodied in Neural Networks, its strengths,
weaknesses and applications.


Keywords: Neural Networks, Artificial Intelligence, Econometrics, Overfitting, Black
box
JEL Classification Codes: C45, B23, C01

1. Introduction
Considered as a subset of Artificial Intelligence, NN basically constitutes a computer program
designed to learn in a manner similar to the human brain. Axons which have an electrical signal and
found only on output cells, terminate at synapses that connect it to the dendrite of another neuron. NN,
throughout the overall part of the analysis, pertains to ANN and not biological NN. NN take after the
human brain on the following two grounds, first, they acquire knowledge via the network through a
learning process and secondly, interneuron connection strengths are used to store the acquired
knowledge. Perhaps, the best agreed upon definition of NN is that they constitute nonparametric
regression models which do not require a priori assumptions about the problem with the date being
enabled to speak for itself. Ironically, the origin of NN is not directly linked to any estimation or
forecasting exercise since the persons that pioneered research in this field was basically attempting to
have an insight of the learning abilities of human brain. However, as NN displayed interesting learning
capacity, this led towards interests in other fields, let alone science such as in finance, economics and
econometrics. In a nutshell, NN tries to sieve out patterns present in the past data and to extrapolate
them into the future. Prior to gaining an insight of NN, it is important to understand the basic
components of NN.
The aim of this paper is to provide a simplified approach to the mechanism embodied in Neural
Networks. Section 2.0 focuses on the features of the brain-stylized facts followed by section 3.0 which


1
The views expressed in this paper are solely those of the author and do not necessarily reflect those of the University of
Technology, Mauritius..
International Research Journal of Finance and Economics - Issue 39 (2010) 106
deals with the network architecture. While section 4.0 describes the choice of network topology,
section 5.0 deals with data preprocessing. Consequently, section 6.0 and 7.0 discusses the strengths and
drawbacks of Neural Networks, respectively. Section 8.0 focuses on the backpropagation algorithm
with section 9.0 dealing with applications of Neural Networks in distinct fields. Section 10.0 discusses
the differences between and econometrics. Finally section 11.0 concludes.


2. Features of the Brain-Stylized facts: Neuron and Perceptron
Prior to gaining insight of NN, it becomes proper to have recourse towards the basic components of the
brain, labeled as neurons. A neuron signifies a minute processor that receives, processes and sends data
to the next layer of the model. The brain is composed of about 100 billions neurons of many distinct
types. Neurons are grouped together into an intricate network and they work by transmitting electrical
impulses. The reaction of a neuron following receipt of an impulse, will hinge on the intensity of the
impulses received together with the neuron’s degree of sensitivity towards the neurons that dispatched
the neurons. There are two operations being performed inside a neuron, the first being the computation
of the weighted sum of all of the inputs while the second one converts the output of the summation in
terms of certain threshold. A neuron has basically four main parts: the cell body which constitutes the
spot for processing and generating impulses, the dendrites which are synonymous with signal
receivers, that basically accept signal from outside or other neurons, the axon which represents the
avenue that sends the message triggered by the neuron to the next neurons and the synaptic terminals
that entail excitatory or inhibitory reactions of receiving neuron. Brain power, at the end, is simply a
function of this intricately complex network of connections subsisting between the neurons.
A perceptron constitutes a model of simple learning, developed in 1962 by Rosenblatt. The
perceptron constitutes an element that weighs and sums up the inputs and compares the result with a
predefined threshold value. The perceptron emits or releases one if the weighted sum of inputs is
greater than the threshold value, otherwise 0. Bishop (1995) points out that a perceptron may work as a
linear discriminant in classifying elements that belong to different sets. The higher the number of
neurons in a single hidden layer, the higher the complexity of the represented function. Unfortunately,
when it comes to solving classification problems that are not linearly separable, the perceptron
algorithm developed by Rosenblatt is not able to terminate. The architecture of a perceptron can be
shown in figure 1.

Figure 1: A perceptron


b

X
W
T
a y

W
T
= matrix of weights a = b + W
T
X
X = vector of inputs y =f(a)
 
   ∑ 


The perceptron reminds us of the logit and probit models in econometrics. While a positive
weight is considered excitatory, a negative weight reflects inhibitory effect. The transformation
107 International Research Journal of Finance and Economics - Issue 39 (2010)
involves two steps, the first one is geared towards computing the weighted sum of inputs and the
second activation is transformed into a response by using a transfer function. The synaptic weights
pertain to interneuron connection strengths which are used to store the acquired knowledge.


3. Network Architecture
The simplest form of NN consists of only two layers, the input and output layer (no hidden layer is
present). This is sometimes referred to as the skip layer, which basically constitutes a conventional
linear regression modeling in a NN design whereby the input layer is directly connected to the output
layer, hence bypassing the hidden layer. Like any other network, this simplest form of NN relies on
weight as the connection between an input and the output; the weight representing the relative
significance of a specific input in the computation of the output. It is important to bear in mind that the
output generated will heavily depend on the type of activation function used. However, based on the
fact that the hidden layer confers strong learning ability to the NN, in practical applications, a three and
above three NN architecture is used. This is shown in figure 1. It is vital to distinguish between two
classes of weights in NN; first there are those weights that aim to connect the inputs to the hidden layer
and then those weights that connect the hidden layer to the output layer. In a parallel manner, there are
two classes of activation functions, one found in the hidden layer and one in the output layer.
An infinite number of ways prevail as to the construction of a NN; neurodynamics (basically
spells out the properties of an individual neuron such as its transfer function and how the inputs are
combined) and architecture (defines the structure of NN including the number of neurons in each layer
and the number of types of interconnections) are two terms used to describe the way in which a NN is
organized. Any network designer must factor in the following elements when building up a network:
1. Best starting values (weight initialisation)
2. Number of hidden layers
3. Number of neurons in each hidden layer
4. Number of input variables or combination of input variables (usually emanating from
regression analysis)
5. Learning rate
6. Momentum rate
7. Training time or amount of training(i.e., the number of iterations to employ)
8. Type of activation function to use in the hidden and output layers
9. Data partitioning and evaluation metrics
Ideally, a NN for a particular task has to be optimised over the entire parameter space of the
learning rate, momentum rate, number of hidden layers and nodes, combination of input variables and
activation functions. Attaining such an objective is computationally burdensome. There is widespread
consensus in the empirical literature that the final artifact of NN model rests purely on a process of trial
and error as there is poor theoretical guidance in creating a NN network. In that respect, the design of a
network is considered an art rather than a science. This is why the design of a network is a time-
consuming process. Nevertheless, the main criterion used for the design of NN is to end up with the
specification that minimises the errors; or the optimal network topology.

3.1. Weight Initialisation
Each connection has an associated parameter indicating the strength of the connection, the so-called
weight. By changing the weights in a specific way, the network can learn patterns present in the input
layer to target values or the output layer. Prior to gaining an insight of the need for weight
initialization, it becomes important to focus on the standard gradient descent algorithm which can be
succinctly described as follows:
An error function is defined.
Randomly pick up an initial set of weights.
International Research Journal of Finance and Economics - Issue 39 (2010) 108
5. Evaluate the vector representing the gradient of the error function.
6. Update the weight until a flat minimum has been converged; otherwise, the weight space
should again be modified till convergence is ensured.

Figure 2: Quadratic error function

Error function

Move in direction of negative derivative (always moving downhill)



Weight

Source: Own illustration

However, the above process becomes problematic in case the error function is not well defined,
i.e., there is not a unique minimum point but rather distinct local minima, as depicted on figure 3, and
then, the objective should be geared towards attaining the global minimum. Indeed, complex error
surfaces exhibiting valleys and ravines may appear in case nonlinear activation functions are
employed. Such objective requires experimentation using random restarts from multiple places in
weight space. The rationale for using diverse values of weights is to increase the chance of ending up
near the global minimum. Under the standard gradient descent approach, the objective is to derive the
network output for all the data points to estimate the error gradient (difference between target and
output). There are certain bad properties of the error surface for gradient descent. Such instance of
complex error surface can occur due to the use of more than one hidden. Subseuqently, it is not feasible
to analytically end up with the global minimum of the error function to the effect that, under such a
scenario, NN training is synonymous with exploration of the error surface.

Figure 3: Complex error function

Error function





Weight
 

Source: Own illustration
109 International Research Journal of Finance and Economics - Issue 39 (2010)
3.2. Number of Hidden Layers
The significance of the hidden layer is that; from it, one can infer the number of layers in the NN
architecture. For instance, a one-hidden layer signifies a three-layer NN model, a two-hidden layer
signifies a four-layer NN model and so forth. Though hidden layers don’t present any real concept or
that they have no interpretation, yet, they were extremely powerful in sieving distinct pattern structures
present in the data. In that respect, there are no parallels present in econometrics to capture for the
hidden layer and this is the real power of NN. It is the gold medal of NN by capturing for any
nonlinearity present in the data. The hidden layer provides the network the ability to generalize, in
practice NN with one and sometimes two hidden layers are widely used and have performed very well.
Raising the number of hidden layers not only scales up the computation time but also the probability of
overfitting which then paves the way towards poor out of sample forecasting performance. The greater
the number of weights relative to the size of the training set, the greater the ability of the network to
memorise idiosyncrasies of individual observations so that there is loss of generalization for the
validation set. The usual recommendation employed is either to begin with one or at most two hidden
layers with additional layers not recommended since it has been found that NN with 4-hidden layers do
not improve results.

3.3. Number of Units or Neurons in the Hidden Layer
One of the greatest issues in designing NN model lies in deciding on the number of hidden units in the
hidden layer as this is important to ensure good generalization; best approach being to sequentially fit
the model by adding one hidden unit each time. Generalization reflects the notion that a model based
on a sample of the data is convenient for forecasting the general population. The fewer the neurons in a
network, the fewer the number of operations required and this is a less time-consuming task to
implement. In the same vein, in case additional neurons (and layers) are introduced in the network, this
entails more weights to be used. Indeed, the learning capability of NN hinges on the number of neurons
in its hidden layer. In case the number of hidden neurons is too few, the model will not be flexible
enough to model the data well. Conversely, in case of too many hidden neurons, the model will overfit
the data. As the number of hidden nodes is not possible to determine in advance, empirical
experimentations are vital to determine this parameter. Technically speaking, the best number of
hidden units hinges a plethora of elements such as the number of input and output units, the number of
training cases, the amount of noise in the targets, the complexity of the error function, the network
architecture and the training algorithm.
Some specific rules of thumb have been established regarding the choice of the number of units
in the hidden layers such as half the input layer size and output layer size, two-third of the input layer
plus the output layer size, less than twice the input layer size and so forth. However, these rules are
mere reference when choosing a hidden layer size as they overlook other factors other than focusing on
the input and output layer size. Subsequently, there is no easy way to determine the optimal number of
hidden units without training using distinct numbers of hidden units and estimating the generalization
error of each. In that respect no magic formula prevails for arriving at the optimum number of hidden
neurons to the effect that experimentation constitutes the only avenue. This implies incrementing the
number of hidden units until there is no major improvement in performance. Often, it is better to start
with a large number of epochs and a small number of hidden units and that network which performs
best on the testing set with the least number of hidden neurons being selected as the final one. Care
must be exercised when testing a range of hidden neurons by keeping constant all other parameters;
since modifying any parameter leads to creation of a new NN with a distinct error surface that would
thereby complicate the choice of the optimum number of hidden neurons.
International Research Journal of Finance and Economics - Issue 39 (2010) 110
3.4. Number of Inputs
The inputs should have a temporal structure and should not be too numerous. In case of n independent
variables, then, there should be n number of input nodes for the proposed model. Usually, the practice
is to go for simple models which have few input variables. The reason is that, as the number of model
inputs scales up, the degrees of freedom of the governing equation also increases. While equations with
high degrees of freedom have the capability to model the training data effectively, they fail miserably
when given test data. This is because models with fewer degrees of freedom do not try to trace the
data’s random scattering but only follow the general trend. Wasserman (1994) states that, though
clustering of the input vectors scales down the size of the network and thereby assists in speeding up
the learning process; yet, there is no guarantee that the clustering algorithm employed is the most
efficient one. To end up with an efficient algorithm, it becomes necessary to experience with distinct
starting locations.

3.5. Learning Rate
NN models can be classified by whether the learning or iterative adjustment is supervised or
unsupervised. Under supervised learning, the network is trained by providing it with inputs along with
the targeted output values. Consequently, the error, as determined by the difference between the target
output values and the real output values, is then backpropagated using the training algorithm until the
desired error level is reached.
Under unsupervised learning, the iterative adjustment is guided only by the input vector so that
there is no feedback as such from the environment to show if the outputs of the network are correct.
Consequently, the network must discover features present in the data on its own. In fact, the unsupervised
learning don’t use target values, they merely use the input information to categorise the input patterns.
The learning parameter or learning rate controls the rate at which the weight is modified as
learning occurs. The learning rate determines the speed of learning. As a matter of fact, the gradient
descent merely provides information about the direction to move but not about the step size. In that
respect, the learning rate determines the size of the step that is required to move towards the minimum
of the error function. A too high learning rate leads to oscillations around the minimum while a too low
learning rate can lead to slow convergence of NN. The larger the learning rate, the greater the changes
in the weights and consequently, the quicker the network learns. However, assessing the other side of
the coin, a larger learning rate triggers oscillations of the weights among insufficient solutions.
Alternatively stated, larger steps may engender quick convergence but it may overstep the solution or
go in the wrong direction in case of peculiar surface error. Conversely, small steps may lead to the
proper direction but at the costs of a large number of iterations required, a real time-consuming task. In
that respect, the aim is geared towards accomplishing the largest learning rate feasible without trailing
oscillations to ensure that the network is subject to the best learning. The proper setting of the learning
rate depends on the specific application under consideration.

3.6. Momemtum Rate
The momentum rate is incorporated to modify the backpropagation algorithm. The objective behind the
use of the momentum rate is basically to reach the minimum faster because if such a momentum term
is excluded, it takes a long time before attaining the minimum. The use of the momentum term is
underscored with a view of improving the chances of ending up with the global minimum, chiefly
when finding the global minimum is not guaranteed under the gradient descent algorithm based on the
fact that the error surface can incorporate many local minima in which the algorithm can become stuck.
Technically speaking, the momentum term determines how past weight changes impact on current
weight changes so that it, in effect, restrains side to side oscillations by filtering out high frequency
variations. Momentum causes the weights to change only in the same direction, thus avoiding
oscillations. Above all, the momentum rate helps the network to escape the local minima problem.
Technically speaking, the momentum stimulates weight changes to continue in the same direction by
111 International Research Journal of Finance and Economics - Issue 39 (2010)
making the current weight change depend not only on the previous weight change but also on the
current error.

3.7. Training
Training pertains to the process during which data is input into the network along with their
corresponding output values so that the network can adjust the weights in such a way that it can
generate the given input and output vectors with as low error as possible. Training will be done until
convergence manifests at the point where the test set error start to rise to the effect that the network
weights are restored at the iteration cycle where the test set error was minimum. Training should only
be ceased until there is no improvement in the error function based on a sensible number of randomly
selected starting weights. Convergence constitutes the point at which the network does not improve.
The purpose of convergence training is to end up with a global minimum. Training is affected by many
parameters such as selection of the learning rate, momentum values and the backpropagation
algorithm. Goodness of fit constitutes the widely applied criterion among researchers to train their
models. An epoch constitutes a loop through the training set. The relationship between error and
training (learning) and testing can be illustrated as follows:

Figure 4: Training time

Error

Testing/generalisation


Learning/Training
0 Training time


As depicted above, as the learning time scales up, this gradually reduces the error. However, in
case of the testing sample, it is U-shaped since error falls, attains a minimum and then begins to rise.
There are two main types of network training; the sequential model having on-line or stochastic
where weights are updated following presentation of each pattern. Then, there is the batch model,
called off-line or epochwise training. Here, the total weight changes are made by summing all
individual changes computed for each pattern in the training set.

3.8. Activation Function (squashing function)
The selection of the squashing or activation function bears an important effect on the NN results. The
activation function constitutes a mathematical transformation of the summarized weighted input units
to generate the output of the neuron. Usually, studies use sigmoid activation function for the hidden
layer and linear or sigmoid one for the output layer. Technically speaking an activation function
comprises of two parts; a combination function that factors in all the input units into a single value
(weighted sum of inputs) and a transfer function which applies a nonlinear transformation to these
summarized units to trigger an output unit.
Usually, the data pattern determines the forms of the transfer functions. Most studies resort
towards the sigmoid transfer functions. The benefits of logistic or sigmoid function is that they are
continuous functions that monotonically rise or fall, saturate towards the minimum and maximum
values, approximate the step function very well and are differentiable on the whole domain and can
International Research Journal of Finance and Economics - Issue 39 (2010) 112
thereby dramatically reduce the computation burden for training It is important to note that NN where
the hidden neurons have sigmoidal activation function and the output neurons the sigmoidal or identity
function are called Multi Layer Perceptrons (MLP). The sigmoid activation function can be stated as:
g(x) =
1
1 + e
(-cx)

However, based on the fact that the return value of the sigmoid function lies in the interval
[0,1], this function cannot be used in NN that approximate functions that can also take negative values.
To fill in such a vacuum, the empirical literature has recourse towards the hyperbolic tangent function
which is also called the bipolar sigmoid function. Such a function can be stated as follows:
g(x) =
1 – e
(-cx)

1 + e
(-cx)


3.9. Data Partitioning and Evaluation Metrics
Data should be divided to strike the best trade-off between in sample and out of sample performance of
the NN model. While training the network, the researcher must be careful to shun overtraining so that
the network adjusts its weights in such a manner that the network does too well in the training sample
but fares poorly in case of values not incorporated in the training sample. Consequently, this requires
cross-validation which usually falls under two parts. In the first part, there are only two sets; the
training set and the testing set used for out of sample analysis. In the second part, there are three sets;
the training set, the testing set and finally the validation set, the latter being employed to validate the
NN architecture. Alternatively stated, the parameters (i.e., the value of the synaptic weights) of the
network are calculated by using the training set. After that, learning is ceased and the network is
evaluated with the data from the testing set. The question of how well the network generalises is
deduced by analysing its performance on the validation set and not on the test set as the purpose of the
test set is basically to decide when to cease training.
Adya and Collopy (1998) state that the NN model should be assessed in terms of how well its
architecture has been implemented and such an analysis is undertaken under the theme: effectiveness
of implementation. The guidelines used in that case comprises of convergence (determining the
network’s convergence capability in sample), generalization (testing the ability of the network to
recognize patters outside the training sample) and stability (how far results are consistent during the
validation phase under distinct samples of data). Consequently, they point out that results of studies
undertaken in NN should be viewed with care as only those studies that cater for both effectiveness of
validation and effectiveness of implementation are authentic ones. Otherwise, studies that suffer from
any of these two points, are not really helpful with little contributive value and should henceforth be
deemed as inconclusive. For instance, studies that fail to report in sample performance of the NN are
hard to digest as the NN configuration has not been analysed carefully.
Distinct measures of accuracy have been employed in the empirical literature. However, none
of them are perfect so that it is better to use a number of distinct performance measures. The rationale
is that, by opting for more than one performance measure, there is less susceptibility of bias whereby
the best model is selected simply on the basis of just one performance measure.


4. Choice of Network Topology
Too simple network topologies which have too few connections between their elements have little
capacity to store knowledge about the regularities in the training data. On the other side, more complex
network topologies will tend to have more weights that are adjusted in the training process of the
network. The rationale for the overwhelming use of multilayer perceptron (MLP) is due to the fact that
such network architecture is akin to multivariate non-linear regression model. The most common
neural network model is the MLP, also known as supervised network by virtue of its need for a
targeted output to be able to learn. The aim of such a network is geared towards building up a NN
113 International Research Journal of Finance and Economics - Issue 39 (2010)
model that properly maps the input to the output. The diagram below depicts a three-layer NN model.
At the outset, the inputs are channelled into the input layer (first layer). Then, these inputs are being
processed (as shown by the arrows) by simply multiplying them by weights. Inside the hidden layer,
these processed inputs are summed and then they pass through a stated activation function. The data is
multiplied by the weights and then processed one last time using the output layer (third layer) to
generate the NN output. But, it is vital to bear in mind that theory doesn’t generate a complete guide
for building up of MLP. Alternatively stated, no full-proof/full fledged theoric explanation exists to
obtain the optimal architecture of the MLP. The main source contributing towards using a one hidden
layer MLP construction is the work of Hornick, Stinchcombe and White’s conclusion (1989) who point
out that a MLP embedded with only one hidden layer can approximate any function. This underlies the
rationale as to why, in practice, when using MLP, many studies do not go beyond one hidden layer.


5. Data Preprocessing
Rescaling the data is considered beneficial to enhance the forecasting accuracy because the estimation
algorithms tend to perform better when the input values are small and centered around zero. An
important step in building up a NN is to select the proper data pre and postprocessing. This is of major
significance by virtue of the impact of curse of dimensionality. An avenue to represent the training data
is basically to specify intervals for the input variables and then to classify the data records by stating in
which interval the values of a record lie. Contrary to the intuitive assumption that additional data
should improve the performance of NN, it is not necessary the case. As a matter of fact,, the reduction
of dimensionality of the training data is vital for a proper functioning of NN. It makes more sense to
preprocess the picture to extract the features vital for recognition of the shape and then apply NN on
these features. As a matter of fact, quality and quantity of data are widely recognized as important
issues in the development of neural network models. It is imperative that the input data is free from
noise to ensure that the network is given the best possible training for it to be able to generalize better
results later. Meade (1995) states that NN are data-dependent so that the learning algorithms are only
as good as the data shown to them. Usually, data are transformed prior to submission to the NN model.
To accomplish a good prediction performance when applying NN; at the very least, the raw data must
be scaled between the upper and lower bounds of the transfer function (usually between zero and one
or minus one). First differencing (to remove a linear trend from the data) and taking log (useful for data
that can take both small and large values) constitute the two widely used data transformation in NN.
Another technique is to use ratios. Sampling or filtering the data refers to removing observations from
the training and testing sets to generate a more uniform distribution. Benefit of filtering is a fall in the
number of training facts which enables testing of more input variables, random starting weights or
hidden neurons rather than training large data sets. In practice, data preprocessing involves much trial
and error.


6. Strengths of NN
The following are deemed as the strengths of NN:
1. The greatest power of Neural Networks is that it is endowed with a finite number of hidden units,
can yet approximate any continuous function to any desired degree of accuracy. This has been
commonly referred to as the property of universal approximator. A universal approximator
signifies that, given an ample number of hidden layer units, the NN model can approximate any
functional form to any degree of accuracy. In fact, many authors have shown that a three-layer
NN equipped with a logistic activation function in the hidden units, constitutes a universal
approximator. But, this is a mere sufficient condition as the necessary condition requires that
there be ample number of hidden units being included to ensure that the network can approximate
nearly any linear or nonlinear function to a desired level of precision.
International Research Journal of Finance and Economics - Issue 39 (2010) 114
2. No prior knowledge of the data generating process is needed for implementing NN, i.e., NN are
free from statistical assumptions. ANN is more robust to missing and inaccurate data. NN should,
in theory, be able to detect and duplicate any complex nonlinear pattern in the data. In a parallel
manner, NN is not rigid and hence it can be customised to any architecture as per the fancy of the
forecaster. More specifically, NN can encompass many models such as linear regression, binary
probit model and others by simply tweaking with the activation functions and the network
architecture.
3. Since under regression analysis, there is need to state the functional form of the model, model
misspecification may manifest. However, such problem of model misspecifcation does not occur
in case of NN since no specifications are used as the network merely learns the hidden
relationship in the data. Though, nonlinear functions can be linearised by using specific
mathematical transformations in economics and finance, nevertheless, the problem lies in
knowing the proper transformation to be applied and this may not be easy in practice. But, when
it comes to using NN, there is no real need to know about the functional form to be applied.
Why? Because the universal approximator property under NN ensures that NN can mimic almost
any functional form.
4. One problem related to Neural Network could yet be considered not as a real strength but as
something common among other optimisation techniques. Indeed, many studies admit the fact
that their estimation may suffer from bias emanating from coefficients which are local rather than
being global. But, such a feature is also present among many nonlinear optimisation tasks and
there is no “silver bullet” to obviate such a problem.


7. Drawbacks of NN
Assessing the other side of the coin, the following drawbacks were noted for NN:
1. The chief pitfall related to the application of NN has been coined as the black box problem. One
avenue employed has been to generate rules from NN that are easy for a human user to
understand; these rules must be sufficiently simple yet accurate. The conditions of the rules
describe a subregion of the input space. Based on the fact that a single rule is not powerful
enough to approximate the nonlinear mapping of the network well, the remedy is basically to
split the input space of the data into distinct subregions. Nevertheless, there is still a ray of hope
looming on the NN horizon following the proposition made by Refenes, Zapranis and Francis
(1994), namely that the black box problem can be alleviated by resorting towards sensitivity
analysis. Such a technique involves plotting the value of the output for a range of values of a
given input, with all other inputs remaining fixed at their sample mean. Consequently, if the
value of the output stays stable for distinct values of the input under inspection, then, the
researcher can presume that this input does not entail much say in the predictive power of the NN
model. Such a process is successively applied to all inputs until the researcher ends up with all
the relevant inputs in the model. Hence, the NN model is said to be pruned via the elimination of
superfluous inputs.
2. The addition of too many hidden units incites the problem of overfitting the data; meaning that
the network learns too well in the training data session but generates inferior results in case of out
of sample session. The effect of overfitting in case of a forecasting exercise, manifests in form of
poor out of sample forecasts. Alternatively stated, overfitting signifies that rather than learning
the fundamental structure of the training set that would enable a satisfactory and sufficient
generalization, the network learns insignificant details of individual cases. Overfitting can be
caused either by a shorter sample size in use or a too complex NN model so that the NN tends to
memorize rather than generalize from the data. In that specific case, to avoid overfitting, NN
model should be kept small or parsimonious.
There are two ways to deal with the overfitting problem, the first is to train the network
model on the training set and then to analyse the performance on the test set. This technique is
115 International Research Journal of Finance and Economics - Issue 39 (2010)
usually referred to as early stopping criteria, whereby the data is split into three parts; the training
set, the test set and the validation set. In that context, the training set is used by the algorithm to
estimate the network weights; basically the training sets represents the in sample period of the
regression model while the test set is dedicated for out of sample analysis. The early stopping
strategy is, however, not without any caveat. The reason is that in case of small samples, the
three-pronged decomposition of the data set becomes somewhat problematic. Results can also be
sensitive to the observations contained in each of the three specific data sets. To curb the
overfitting problem, Refenes (1995) recommends cross-validation to be implemented during
learning. The second approach is to have recourse towards one of the distinct network pruning
algorithms.
3. NN depends so much on the quality of the data that the algorithms employed are only as good as
the data used to apply them. This is why NN is often called weakly deterministic systems.
Similarly, NN needs large samples to work. For instance, if a simple NN model is equipped with
a large number of weights, then, it s most likely that this leads towards a limited number of
degrees of freedom.
4. The construction of the NN model can be a time-consuming process since building up the NN
architecture is synonymous to a strenuous activity involving trial and error. For instance,
subjectivity in the construction of NN has led towards researchers to consider the results being
doubtful. Learning results in NN may not be stable on the back of random initialization of
weights coupled with complexity of the error surface.So, while building up the NN, it becomes
imperative to keep such subjectivity to a low level as far as possible. In the same vein, the early
stopping strategy used to fight against the overfitting problem may be subject to arbitrary
judgements made by the researcher. For instance, the researcher must make judgement about
dividing the sample into the training, validation and testing sets. Usually, researchers cling to
some rule of thumb for dividing the data sets. However, such a shortcoming does not pose a
major issue of concern, bearing the out of sample forecasting feature whereby 30% of the sample
tends to represent the holdout sample.
5. NN should never be viewed as a panacea. For instance, in case of extreme cases such as major
crisis that could substantially alter the movement in prices, NN prediction is unlikely to be
satisfactory. Consequently, expert judgments should always be present. NN models are prone to
problems of local optima, a conspicuous feature noted in case of non-linear optimization.


8. Backpropagation Algorithm
Backpropagation was developed independently by two researchers and provided a learning rule for
NN. By default, the backpropagation algorithm, which constitutes the most popular method for training
a network, is an optimization technique that uses the gradient descent philosophy. It represents a local
search algorithm that minimizes the error function. In that respect, the norm is to cease training when
the algorithm stops to improve the evaluation metric applied such as the sum of the squared error. The
backpropagation algorithm is deemed to be more efficient with satisfactory performance noted on
unseen data relative to conventional optimization technique. However, the hitch attached to such
algorithm is that it usually assumes a time-consuming convergence process. Coupled with the latter,
another main pitfall is that, it may end up with the local minimum relative to the global minimum by
virtue of random initialization of weights. Local minima problem manifest when the network
converges or stabilizes on an inadequate solution (i.e., local minimum) which is not the optimal
solution, i.e., the global minimum. A NN must be estimated hundreds or thousands of times with
distinct sets of starting values to avoid the local minimum.
With backpropagation, the input data is literally presented to the neural network. With each
iteration, a comparison is drawn between the output of the neural network and the targeted output, the
difference of which represents an error. Such error is then recycled (or backward recursively) inside the
network to modify the weights in such a way that the error steadily scales down with each iteration and
International Research Journal of Finance and Economics - Issue 39 (2010) 116
the neural model gradually approaches to the targeted output. Alternatively stated, the training process
can be ceased if the network ends with the desired output or the network no longer appears to be
learning. The gradient descent philosophy which is embedded inside the backpropagation algorithm,
works as follows:
1. Select some random initial values for the model parameters.
2. Calculate the gradient of the error function with respect to each model parameter.
3. Change the model parameters so that we move a short distance in the direction of the
greatest rate of decrease of the error; i.e., in direction of negative gradient. To reduce the
value of the error function, we always move in the reverse direction of the gradient.
4. Repeat steps 2 and 3 until the gradient gets close to zero.
Backpropagation differs from Radial Basis Function Network (RBFN) on two main
dimensions. First, in backpropagation, both the hidden and the output layers are nonlinear sigmoid
functions while in case of RBFN, the output layer is linear while the hidden layer is a Gaussian
function. Second, while the outputs of the hidden layer in backpropagation are functions of the
products of weights and inputs, in RBFN, the outputs of the hidden layer are functions of distances
between input vectors and centre vectors. In fact, RBFN triggers centres of the input vectors rather than
random weights in the first step of its algorithm. Consequently, the presence of the linear output layer
under backpropagation renders it learning process very fast relative to the RBFN model.


9. Applications of NN in Distinct Fields
NN has been overwhelmingly used in diverse fields such as portfolio management, credit rating and
predicting bankruptcy, forecasting exchange rates, predicting stock values, inflation and cash
forecasting, forecasting electricity consumption and others. The power of NN has been so strong that
most of the major investment banks such as Goldman Sachs and Morgan Stanley have dedicated
departments for implementing NN. However, NN has not only been applied in the field of finance and
economics but has also been extended to other fields such as computer science, marketing, medical
science, speech and voice recognition and many other applications. Many research have proved that the
forecasting accuracy of NN tends to excel over that of a well-established linear regression model. For
instance, Gonzalez (2000) has found error reduction in the range of 13 to 40% when trying to compare
the performance of NN vis-à-vis linear regression models. Such superior performance tends to manifest
not only in in-sample but also for out of sample forecasting accuracy. Moreover, some studies have
demonstrated the superiority of ANN over multiple regression (Spangler et al. 1999; Uysal & Roubi
1999; Fadlalla & Lin 2001; Nguyen & Cripps 2001). For the purpose of the current study, the coverage
will not go beyond the field of economics and finance.
In case of exchange rates, the most commonly cited example is the work of Verkooijen (1996)
who attempted to analyse the forecasting power of distinct models in case of predicting the monthly
US Dollar-DM exchange rate. He found that NN were powerful not only in case of out of sample
forecasts but also in case of predicting the direction of change of the exchange rate. In case of stock
market, Refenes, Zapranis and Francis (1994) found that NN outperformed under both in sample and
out of sample forecasting under the arbitrage pricing theory framework. Besides, some works have
focuses on forecasting competition. In that dimension, Stock and Watson (1998 had recourse to forty-
nine forecasting methods. Intringuingly, they found that NN underperforms compared to the naïve
AR(4) forecast and also to most other methods. However, their results are considered inconclusive on
the back of proper application of NN with the objective of properly dealing with the overfitting
problem that usually gnaws at an effective application of NN. Swanson and White (1997) undertook
another forecasting competition for nine US macroeconomic variables and they ended up with the fact
that NN merely generate ordinary performance. However, the result of their work is considered
inconclusive on the back of curbing the overfitting problem. Ironically, the authors acknowledged that
the Schwarz Information Criterion (SIC) did not do justice to the true potential of the NN
methodology. In case of forecasting bankcruptcy, Salchenberger et al. (1992) show the superiority of
117 International Research Journal of Finance and Economics - Issue 39 (2010)
neural networks on Logit Analysis while Tan (1996) concludes to the superiority of neural networks on
Probit analysis. McNelis and Chan (2004) find that nonlinear models such as Neural Networks best
describe the inflationary/deflationary dynamics in Hong Kong on both the in sample and out of sample
diagnostics. Moshiri, Cameron and Scuse (1999) compare the performance of three distinct types of
Neural Networks models, namely the backpropagation neural network (BPN), the radial basis function
network (RBFN) and the recurrent neural network (RNN) in the case of predicting inflation. They find
that RNN is better off in case of forecasting inflation for longer horizons relative to RBFN and BPN,
with the latter outperforming the former. Kumar and Walia (2006) find that NN models perform better
relative to conventional models for predicting both daily and weekly cash flows for a bank branch.
Planning for electricity consumption has become a vital task in many countries with a view of
optimising on scarce use of resources.
It is vital to note that in case of time series forecasting, Sharda and Patil (1992) conclude that
NNs are powerful enough to directly capture seasonality to the effect that there is no need to adjust the
data for seasonality. In the same vein, Nelson et al. (1999) find that NNs tend to do better in case of
deseasonalised data relative to unprocessed data. Heravi et al. (2004) applied NNs models to seasonally
unadjusted data on the ground that seasonal adjustment may induce nonlinearity and they find that NNs
generate inferior results relative to linear models in terms of RMSE but in case of the direction of
change, NNs maintain an edge over their counterpart. Recently, Zhang and Kline (2007) conducts an
extensive analysis of the effectiveness of distinct data preprocessing (basically a large set of data from
the M3 competition without adjusting for trends or seasonality) and modeling approaches (many
models included) using NN for forecasting seasonal time series. Using both parametric and
nonparametric diagnostics, they find that simpler models, both in form of set of inputs and number of
hidden nodes, do better than more complex models. Above all, they point out that data preparation and
the selection of inputs are vital steps to ensure proper performance of the NNs models.


10. NN and Econometrics
It is of paramount significance to bear in mind that, in effect, NN don’t deviate too much from
econometrics on the ground that the simplest types of NN are closely related to standard econometric
techniques-many papers have attempted to establish parallels between NN and econometric methods to
enhance insight of these models for macroeconomic and financial forecasting. For example, a two-
layer feedforward NN with an identity activation function is synonymous with a linear regression
model. Besides, the weights in NN represent the regression coefficients with the counterpart of the bias
term being the intercept term in econometrics. Alternatively stated, there is substantial overlap between
these two techniques. While NN focuses on altering the weights to attain the minimum point of the cost
function, econometrics relies on different assumptions such as homoscedasticity, no autocorrelation
and so forth. Based on the fact that NN does not constitute a full-fledged tool, many studies have
considered it as a complementary tool to econometric analysis. The power of NN lies in its ability to
generate insight in case where theory provides limited guidance in specifying the parameters of the
estimating equation, its functional form along with assumptions for the underlying data employed.
NN and econometrics basically represent two sides of the same coin. Alternatively stated, NN
is not very different from econometrics based on the fact that the concepts are the same with the
terminologies used being different. The following table can best be used to represent the differences
subsisting between NN and econometrics:
International Research Journal of Finance and Economics - Issue 39 (2010) 118

Model NN Econometrics
Regression model Non-parametric Parametric
Linear/Nonlinear Non-linear
Mostly linear; though non-linearity may
be captured by other specific models
Assumptions
No need to make assumption about the underlying
data
BLUE properties don’t usually hold on
a perfect basis
Weights
Weights represent the synaptic strengths; the
intensity and strength of the connections of the
neurons
Weights represent the regression
coefficients
Philosophy Learning and training Estimation
Paradigm Diagnostic Sampling
Data set Training, testing and validation In sample and out of sample analyses
Classification tool Perceptron Logit/Probit models
Method Least mean squares Ordinary least squares
Operation
Simultaneous updating and optimization of network
weights
Regression coefficients determined in
groups
Deterministic system Weak as NN is data-dependent
Relatively stronger than NN based on
assumptions made
Implementation Time
Usually time-consuming process; training may take
hours or even days
Easily implemented with econometric
software
Specificities
No specific nature, silent and can handle either time
series or cross-sectional one.
Three-pronged nature in terms of cross-
sectional, time-series and panel data
analyses
Theoretical
background
Atheoretical like VAR
Usually based on theory with laying out
of the functional form of the model
Upgrades
Genetic Algorithms, Simulated Annealing and Fuzzy
Logic
Nothing new; conventional models
Variables Inputs and Outputs
Independent variables and dependent
variables
Intercept Bias term Intercept term
Error
Embedded inside, iteratively used to approach
targeted output
Residuals
Source: Own computation


11. Conclusion
This paper provided a simplified approach to Neural Networks along with explanations on its different
components. Neural Networks have gained so much ground that they are now termed as the sixth
generation of computing. As a matter of fact, Neural Networks have been applied in many fields such
as science, finance, credit risk, economics and econometrics. Its ability to learn and being flexible
render it a powerful tool though the black box problem reduces its usefulness. Nonetheless, the
predictive power of NN cannot be denied and this is making it still one of the best forecasting tools not
only among practitioners, let alone for central bankers in the world.
119 International Research Journal of Finance and Economics - Issue 39 (2010)
References
[1] Adya, M. and Collopy, F. (1998), “How effective are neural networks at forecasting and
prediction? A review and Evaluation”. Journal of Forecasting 17 481-495.
[2] Bishop, C. M. (1995), Neural networks for pattern recognition. Oxford, UK: Oxford University
Press.
[3] Salchenberger, L. M., Cinar, E. M., and Lash, N. A. "Neural Networks: A New Tool for
Predicting Thrift Failures," Decision Sciences (23:4), July/August 1992, pp. 899-916.
[4] Fadlalla, A. & Lin, C.-H. (2001), “An analysis of the applications of neural networks in
finance”. Interfaces, 31, 4 (July/August), pp. 112–122.
[5] Gonzalez, S. (2000): “Neural Network for Macroeconomic Forecasting: A Complementary
Approach to Linear Regression Models”, Bank of Canada Working Paper 2000-07.
[6] Heravi, Saeed & Osborn, Denise R. & Birchenhall, C. R., (2004). "Linear versus neural
network forecasts for European industrial production series," International Journal of
Forecasting, Elsevier, vol. 20(3), pages 435-446.
[7] Hornick, Stinchcombe and White’s conclusion (1989) Hornik K., Stinchcombe M. and White
H., “Multilayer feedforward networks are universal approximators”, Neural Networks, vol. 2,
no. 5,pp. 359–366, 1989
[8] Kumar, P. and Walia, E., (2006), “Cash Forecasting: An Application of Artificial Neural
Networks in Finance”, International Journal of Computer Science and Applications 3 (1): 61-
77.
[9] McNelis, Paul D. and Chan, Carrie K.C., (2004), "Deflationary Dynamics in Hong Kong:
Evidence from Linear and Neural Network Regime Switching Models," Working Papers
212004, Hong Kong Institute for Monetary Research.
[10] Meade, N. (1995), “Neural network time series forecasting of financial markets - Azoff,EM,
International Journal of Forecasting, Vol: 11, Pages: 601 – 602.
[11] Moshiri, S., Cameron, N. E. and Scuse, D. (1999), "Static, Dynamic, and Hybrid Neural
Networks in Forecasting Inflation", Computational Economics, Vol. 14 (3). p 219-35.
[12] Nelson, C.A. (1999). Neural plasticity and human development. Current Directions in
Psychological Science, 8, 42-45.
[13] Nguyen, N. & Cripps, A. (2001) Predicting housing value: a comparison of multiple regression
analysis and artificial neural networks. Journal of Real Estate Research, 22, 3, pp. 313–336.
[14] Refenes A.P,(ed.), “Neural network in the capital markets”, John Wiley& Sons Ltd (1995)
[15] Refenes, A.N., Zapranis, A. and Francis, G., (1994), “Stock performance modeling using neural
network: A comparative study with regression models”, Neural Network (1994) 5: 961–970.
[16] Sharda, R. and Patil, R.B. (1992), “Connectionist approach to time series prediction: An
empirical test”, Journal of Intelligent Manufacturing, 3 (1992), 317-23.
[17] Spangler, W.E., May, J.H. & Vargas, L.G. (1999) Choosing data-mining methods for multiple
classification: representational and performance measurement implications for decision
support. Journal of Management Information Systems, 16, 1 (Summer), pp. 37–62.
[18] Stock, J.H. and Watson, M. W. (1998): "A Comparison of Linear and Nonlinear univariate
models for Forecasting Macroeconomic Time Series," National Bureau of Economic Research
Working Paper 6607, June 1998.
[19] Swanson, Norman R. and White, Halbert (1997): "A Model Selection Approach to Real-Time
Macroeconomic Forecasting Using Linear Models and Artificial Neural Networks", Review of
Economics and Statistics, 79, November 1997, p.540-550.
[20] Tan, S.S., Smeins, F.E., 1996. Predicting grassland community changes with an artificial neural
network model. Ecol. Model. 84, 91–97.
[21] Uysal, M. and S. El Roubi. (1999), “Artificial Neural Networks vs Multiple Regression in
Tourism Demand Analysis. Journal of Travel Research”, 38(2): 111-118.
[22] Verkooijen, W. (1996), “A Neural Network Approach to Long-Run Exchange Rate Prediction,”
Computational Economics, 9, p.51-65
International Research Journal of Finance and Economics - Issue 39 (2010) 120
[23] Wasserman, P. D. (1994), “Advanced Methods in Neural Computing”, Van Nostrand Reinhold,
1994.
[24] Zhang, G.P. and Kline, D.M. (2007), “Quarterly Time-Series Forecasting With Neural
Networks, IEEE Transactions on neural networks, VOL. 18, NO. 6, November 2007.
[25] Zhang, G.-P., & Qi, M. (2005), “Neural network forecasting for seasonal and trend time series.
European Journal of Operational Research”, 160 (2), 501-514.