COURSE ANNOUNCEMENT -- Spring 2004 DSES-4810-01 Intro to COMPUTATIONAL INTELLIGENCE & SOFT COMPUTING

bannerclubAI and Robotics

Oct 20, 2013 (3 years and 7 months ago)

92 views


1

COURSE ANNOUNCEMENT
--

Spring 2004

DSES
-
4810
-
01 Intro to COMPUTATIONAL INTELLIGENCE

& SOFT COMPUTING


With ever increasing computer power readily available novel engineering methods based
on “soft computing” are emerging at a rapid rate. This course prov
ides the students with a
working knowledge of computational intelligence (CI) covering the basics of fuzzy logic,
neural networks, genetic algorithms and evolutionary computing, simulated annealing,
wavelet analysis, artificial life and chaos. Applications

in control, forecasting,
optimization, data mining, fractal image compression and time series analysis are
illustrated with engineering case studies.


This course provides a hands
-
on introduction to the fascinating discipline of
computational intelligence

(i.e. the synergistic interplay of fuzzy logic, genetic
algorithms, neural networks, and other soft computing techniques). The students will
develop the skills to solve engineering problems with computational intelligence
paradigms. The course requires a
CI
-
related project in the student’s area of interest.


Instructor:


Prof. Mark J. Embrechts (x 4009 embrem@rpi.edu)

Office Hours:


Thursday 10
-
11 am (CII5217)


Or by appointment.

Class Time:

Monday/Thursday: 8:30


9:50 am (Amos Eaton Hall 216)

Text (opt
ionakl):


J. S. Jang, C. T. Sun, E. Mizutani, “Neuro
-
Fuzzy and Soft
Computing,” Prentice Hall, 1996 (1998). ISBN 0
-
13
-
261066
-
3


Course is open to
graduate students

and
seniors

of all disciplines.

GRADING
:


Tests

10%


5 Homework Projects

35%


Course Project

40%


Presentation

15%


ATTENDANCE POLICY

Course attendance is mandatory, a make
-
up project is required for each missed class. A
missed class without make
-
up results in the loss of half a grade point.


ACADEMIC HONESTY

Homework Projects are individual exer
cises. You can discuss assignments with your
peers, but not copy. Course project may be in groups of 2.


2

COMPUTATIONAL INTELLIGENCE
-

COURSE OUTLINE


1.

INTRODUCTION TO ARTIFICIAL NEURAL NETWORKS (ANNs)


1.1 History


1.2 Philosophy of neural nets


1.3 Over
view neural nets


2.

INTRODUCTION TO FUZZY LOGIC


2.1 History


2.2 Philosophy of Fuzzy Logic


2.3 Terminology and definitions


3.

INTRODUCTION TO EVOLUTIONARY COMPUTING


3.1 Introduction to Genetic Algorithms


3.2 Evolutionary Computing/ Evolutionary progr
amming/ Genetic Programming


3.3 Terminology and definitions


4.

NEURAL NETWORK APPLICATIONS/DATAMINING WITH ANNs

4.1

Case study: time series forecasting (population forecasting)

4.2

Case study: automated discovery of novel pharmaceuticals (Part I)

4.3

Data mining with n
eural networks


5.

FUZZY LOGIC APPLICATIONS/FUZZY EXPERT SYSTEMS

5.1

Fuzzy logic case study: tipping

5.2

Fuzzy expert systems


6.

SIMULATED ANNEALING/GENETIC ALGORITHM APPLICATIONS

6.1

Simulated annealing

6.2

Supervised clustering with GAs

6.3

Case study: automated discovery of nov
el pharmaceuticals (Part II)


7.

DATA VISUALIZATION WITH SELF
-
ORGANIZING MAPS


7.1 The Kohonen feature map


7.2 Case study: visual explorations for novel pharmaceuticals (Part III)



7.

ARTIFICIAL LIFE

7.1

Cellular automata

7.2

Self
-
organized criticality

7.3

Case study:
highway traffic jam simulation


8.

FRACTALS and CHAOS

8.1

Fractal Dimension

8.2

Introduction to Chaos

8.3

Iterated Function Systems


9.

WAVELETS


3



Monday January 12,

2004


DSES
-
4810
-
01 Intro to COMPUTATIONAL INTELLIGENCE

& SOFT COMPUTING



Instructor:


Prof. Mark J. Emb
rechts (x 4009 or 371
-
4562) (
embrem@rpi.edu
)


Office Hours:


Tuesday 10
-
12 am (CII5217)


Or by appointment.


Class Time:

Monday/Thursday: 8:30
-
9:50 (Amos Eaton Hall 216)


TEXT (
optional
):
J. S. Jang, C. T. Sun, E. Miz
utani, “Neuro
-
Fuzzy and Soft
Computing,” Prentice Hall, 1996. (1998) ISBN 0
-
13
-
261066
-
3



LECTURES #1
-
3: INTRO to Neural Networks

The purpose of the first two lectures is to expose an overview of the philosophy of
artificial neural networks. Today's lectur
e will provide a brief history of neural network
development and inspire the idea of training a neural network. We will introduce a neural
network as a framework to generate a map from an input space to an output space. Three
basic premises will be discuss
ed to explain artificial neural networks:


(1)

A problem can be formulated and represented as a
map

from a m
-
dimensional space
R
m

to a n
-
dimensional space
R
n
, or
R
m

-
>
R
n
.

(2)

Such a map can be realized by setting up an equivalent artificial framework of b
asic
building blocks of
McCulloch
-
Pitts artificial neurons
. This collection of artificial neurons
forms an artificial neural network or ANN.

(3)

The neural net can be
trained

to conform to the map based on samples of the map and will
reasonably
generalize

to new cases it has not encountered before.


Handouts:


1.

Mark J. Embrechts, "Problem Solving with Artificial Neural Networks."

2.

Course outline and policies.


Tasks:


Start thinking about project topic, meet with me during office hours or by appointment.



4

P
ROJECT DEADLINES:



January 22

Homework Project #0 (web page summary)


January 29

Project proposal (2 typed pages, title, references,



Motivation, deliverable, evaluation criteria)



WHAT IS EXPECTED FROM THE CLASS PROJECT?




Prepare a monologue about a c
ourse related subject (15 to 20 written pages and
supporting material in appendices).




Prepare a 20 minute lecture about your project and give presentation. Hand in a hard
copy of your slides.




A project starts in the library. Prepare to spend at least a f
ull day in the library over
the course of the project. Meticulously write down all the relevant references, and
attach a copy of the most important references to your report.




The idea for the lecture and the monologue is that you spend the maximum amount
of
effort to allow a third party to present that same material, based on your preparation,
with a minimal amount of effort.




The project should be a finished and self
-
consistent document where you
meticulously digest the prerequisite material, give a brief

introduction to your work,
and motivate the relevance of the material. Hands
-
on program development and
personal expansions of and reflections on the literature are strongly encouraged. If
your project involves programming, hand in a working version of th
e program (with
source code) and document the program with a user’s manual and sample problems.




It is expected that you spend on average 6 hours/week on the class project.




PROJECT PROPOSAL





A project proposal should be a fluent text of at least 2 fu
ll pages, where you are trying
to sell the idea for a research project in a professional way. Therefore the proposal
should contain a clear background and motivation.



The proposal should define a clear set of goals, deliverables, and time table.




Ident
ify how you would consider your project successful and address evaluation
criteria




Make sure you select a title (acronyms and logos are suggested as well), and add a list
of references to your proposal.


5

PROBLEM SOLVING WITH ARTIFICIAL NEURAL NETWORKS


Mark J. Embrechts



1.

INTRODUCTION TO NEURAL NETWORKS


1.1

Artificial neural networks in a nutshell


This introduction to
artificial neural networks

explains as briefly as possible what is commonly
understood by an artificial neural network and how they can

be applied to solve data mining
problems. Only the most popular type of neural networks will be discussed here: i.e.,
feedforward

neural networks (usually trained with the popular
backpropagation algorithm
).
Neural nets emerged from psychology as a learni
ng paradigm, which mimics how the brain
learns. There are many different types of neural networks,
training algorithms
, and different
ways to interpret how and why a neural network operates. A neural network problem is
viewed in this write
-
up as a paramet
er free implementation of a
map

and it is silently assumed
that most data mining problems can be framed as a map. This is a very limited view, which
does not fully cover the power of artificial neural networks. However, this view leads to a
intuitive basic

understanding of the neural network approach for problem solving with a
minimum of otherwise necessary introductory material.


Three basic premises will be discussed in order to explain artificial neural networks:


(1)

A problem can be formulated and repr
esented as a map from a m
-
dimensional space
R
m

to a n
-
dimensional space
R
n
, or
R
m

-
>
R
n
.

(2)

Such a map can be implemented by constructing an artificial framework of basic building
blocks of
McCulloch
-
Pitts artificial neurons
. This collection of artificial

neurons forms an
artificial neural network (ANN).

(3)

The neural net can be
trained

to conform to the map based on samples of the map and will
reasonable
generalize

to new cases it has not encountered before.


The next sections expand on these premises an
d explain a map, McCulloch
-
Pitts neuron,
artificial neural network or ANN, training and generalization.



1.2

Framing an equivalent map for a problem


Let us start by considering a token problem and reformulate this problem as a map. The token
problem invo
lves deciding whether a seven
-
bit binary number is odd or even. To restate this
problem as a map two spaces are considered: a seven
-
dimensional input space containing all
the seven
-
bit binary numbers, and a one
-
dimensional output space with just two elemen
ts (or
classes): odd or even, which will be symbolically represented by a one or a zero. Such a map
can be interpreted as a transformation from
R
m to
R
n
, or
R
m

-
>
R
n
(with m = 7 and n = 2). A
map for the seven
-
bit parity problem is illustrated in figure 1.
1.


6




Figure 1.1 The seven
-
bit parity problem posed as a mapping problem.


The seven
-
bit parity problem was just framed as a formal mapping problem. The specific
details of the map are yet to be determined: all we have so far is

that we hope that a precise
function can be formulated that transfers the seven bit binary input space to a 1
-
dimensional 1
-
bit output space which solves the seven
-
bit parity problem. We hope that eventually we can
specify a green
-
box that formally could
be implemented as a subroutine in a C
-
code, where the
subroutine would have a header of the type:



void Parity_Mapping(VECTOR sample, int *decision) {



code line 1;


...


line of code;


*decision = ... ;

} // end of subroutine



In other words: give
n a seven
-
bit binary vector as an input to this subroutine (e.g. {1, 0,
1, 1, 0, 0, 1}), we expect the subroutine to return an integer nicknamed "decision.” The
value for decision will turn out to be unity or zero, depending on whether the seven
-
bit
input
vector is odd or even.


We call this methodology a
green
-
box approach

to problem solving to imply that we only hope
that such a function can eventually be realized, but that so far, we are clueless about how
exactly we are going to fill the body of that gr
een box. Of course, you probably guessed by
now that somehow artificial neural networks will be applied to do this job for us. Before
elaborating on neural networks we still have to discuss a subtle but important point related to
our way of solving the sev
en
-
bit parity problem. Implicitly it is assumed for this problem that
all seven
-
bit binary numbers are available and that the parity of each seven
-
bit binary number
is known.


Let us complicate the seven
-
bit parity problem by specifying that we know for th
e time being
the correct parity for only 120 of the 128 possible seven
-
bit binary numbers. We want to
specify a map for these 120 seven
-
bit binary numbers such that the map will correctly identify
the eight remaining binary numbers. This is a much more dif
ficult problem than mapping the
seven
-
bit parity problem based on all the possible samples, and whether an answer exists and

7

can be found for this type of problem is often not clear at all from the onset. The methodology
for learning what has to go in the
green box for this problem will divide the available samples
for this map in a training set
--

a subset of the known samples
--

and a test set. The test set will
be used only for evaluating the goodness of the green box implementation to the map.


Let us i
ntroduce a second example to illustrate how a regression problem can be reformulated
as a mapping problem. Consider a collection of images of circles: all 64x64 black
-
and
-
white
(B&W) pixel images. The problem here is to infer the radii of these circles bas
ed on the pixel
values. Figure 1.2 illustrates how to formulate this problem as a formal map. A 64x64 image
could be scanned row by row and be represented by a string of zeros and ones depending
whether the pixel is white or black. This input space has 64x
64 or 4096 binary elements and
can therefore be considered as a space with 4096 dimensions. The output space is a one
-
dimensional number, being the radius of the circle in the appropriate units.


We generally would not expect for this problem to have acce
ss to all possible 64x64 B&W
images of circles to determine the mapping function. We therefore would only consider a
representative sample of circle images, somehow use a neural network to fill out the green box
to specify the map, and hope that it will gi
ve the correct circle radius within a certain tolerance
for future
out
-
of
-
sample

64x64 B&W image of circles. It actually turns out that the formal
mapping procedure as described so far would yield lousy estimates for the radius. Some
ingenious form of
prep
rocessing

on the image data (e.g., considering selected frequencies of a
2
-
D Fourier transform) will be necessary to reduce the dimensionality of the input space.


Most problems can be formulated in multiple ways as a map of the type:
R
m

-
>
R
n
. However,
no
t all problems can be elegantly transformed into a map, and some formal mapping
representations might be betters than others for a particular problem. Often ingenuity,
experimentation, and common sense are called for to frame an appropriate map that can
ad
equately be represented by artificial neural networks.





Figure 1.2

Determining the radius of a 64x64 B&W image of a circle, posed as a formal mapping
problem.


8


1.3

The McCulloch
-
Pitts neuron and artificial neural networks


The

first neural network premise states that most problems can be formulated as an equivalent
formal mapping problem. The second premise states that such a map can be represented by an
artificial neural network (or ANN): i.e., a framework of basic building bl
ocks, the so
-
called
McCulloch
-
Pitts artificial neurons.


The McCulloch
-
Pitts neuron was first proposed in 1943 by Warren McCulloch and Walter
Pitts, a psychologist and a mathematician, in a paper illustrating how simple artificial
representations of neuro
ns could in principle represent any arithmetic function. How to actually
implement such a function was first addressed by the psychologist Donald Hebb in 1949 in his
book "The organization of behavior." The McCulloch
-
Pitts neuron can easily be understood a
s
a simple mathematical operator. This operator has several inputs and one output and performs
two elementary operations on the inputs: first it makes a weighted sum of all the inputs, and
then it applies a functional transform to that sum which will be se
nd to the output. Assume that
there are N inputs {x
1
, x
2
, ... , x
N
}, or an input vector
and consider the output y. The output
y can be expressed as a function of its inputs according to the following e
quations:



(1)

and



(2)


So, far we have not yet specified the
transfer function

f(.). In its most simple form it is just a
threshold function givin
g an output of unity when the sum exceeds a certain value, and zero
when the sum is below this value. It is common practice in neural networks to use as transfer
function the sigmoid function, which can be expressed as:



(3)

Figure 1.3 illustrates the basic operations of a McCulloch
-
Pitts neuron. It is common practice
to apply an appropriate scaling to the inputs (usually such that either 0 < x
i

< 1, or
-
1 < x
i

< 1).



Figure
1.3


The McCulloch
-
Pitts artificial neuron as a mathematical operator.


9


One more enhancement has to be clarified for the basics of the McCulloch
-
Pitts neuron: before
summing the inputs, they actually have to be modified by multiplying them with a weight
v
ector, {w
1
, w
2
, ... , w
N
}, so that instead of using equation (1) and summing the inputs we will
make a weighted sum of the inputs according to equation (4).




(4)


A collection of these basic operators

can be stacked in a structure
--

an artificial neural network
--

that can have any number of inputs and any number of outputs. The neural network shown in
figure 2 represents a map with two inputs to one output. There are two fan
-
out input elements
and a
total of six neurons. There are three layers of neurons, the first layer is called the first
hidden layer, the second layer is the second hidden layer and the output layer consists of one
neuron. There are 14 weights. The layers are fully connected. In thi
s example there are no
backward connections and this type of neural net is therefore called a feedforward network.
The type of neural net of figure 1.4 is the most commonly encountered type of artificial neural
network, the feedforward net:


(1)

There are
no connections skipping layers.

(2)

The layers are fully connected.

(3)

There is usually at least one hidden layer.


It is not hard to envision now that any map can be translated into an artificial neural network
structure
--

at least formally. How to det
ermine the right weight set and how many neurons to
locate in the hidden layers we have not yet addressed. This is a subject for the next section.





Figure 1.4 Typical artificial feedforward neural network.


10

1.4

Artificial neur
al networks


An artificial neural network is a collection of connected McCulloch
-
Pitts neurons. Neural
networks can formally represent almost any functional map provided that:


(1)

A proper number of basic neurons are appropriately connected

(2)

Appropriat
e weights are selected


Specifying an artificial neural network to conform with a particular map means determining the
neural network structure and its weights. How to connect the neurons and how to select the
weights is the subject of the discipline of ar
tificial neural networks. Even when a neural
network can represent in principle any function or map, it is not necessarily clear that one can
ever specify such a neural network with the existing algorithms. This section will briefly
address how to set up a

neural network, and give at least a conceptual idea about determining
an appropriate weight set.


The feedforward neural network of figure 1.4 is the most commonly encountered type of
artificial neural net. For most functional maps at least one hidden lay
er of neurons, and
sometimes two hidden layers of neurons are required. The structural layout of a feedforward
neural network can now be determined. For a feedforward layered neural network two points
have to be addressed to determine the layout:


(1)

How
many hidden layers to use?

(2)

How many neurons to choose in each hidden layer?


Different experts in the field have often different answers to these questions. A general
guideline that works surprisingly well is to try one hidden layer first, and to choos
e as few
neurons in the hidden layer(s) as one can get away with.


The most intriguing question still remains and addresses the third premise of neural networks:
it is actually possible to come up with algorithms that allow us to specify a good weight set.

How do we determine the weights of the network from samples of the map? Can we expect a
reasonable answer for new cases that were not encountered before from such a network?


It is straightforward to devise algorithms that will determine a weight set for

neural networks
that contain just an input layer and an output layer
--

and no hidden layer(s) of neurons.
However, such networks do not generalize well at all. Neural networks with good
generalization capabilities require at least one hidden layer of neu
rons. For many applications
such neural nets generalize surprisingly well. The need for hidden layers in artificial neural
networks was already realized in the late fifties. However, in his 1963 book "Perceptrons" the
MIT professor Marvin Minsky argued tha
t it might not be possible at all to come up with any
algorithm to determine a suitable weight set if hidden layers are present in the network
structure. Only in 1986 emerged such an algorithm: the backpropagation algorithm,
popularized by Rummelhart and M
acLelland in a very clearly written chapter in their book
"Parallel Distributed Computing." The backpropagation algorithm was actually invented and
reinvented several times and its original formulation is generally credited to Paul Werbos. He

11

described the

backpropagation algorithm in his Harvard Ph.D. dissertation in 1972, but this
algorithm was not widely noted at that time. The majority of today’s neural network
applications relies in one form on the other on the backpropagation algorithm.


1.5

Training
neural networks


The result of a neural network is its weight set. Determining an appropriate weight set is called
training or learning, based on the metaphor that learning takes place in the human brain which
can be viewed as a collection of connected bio
logical neurons. The learning rule proposed by
Hebb was the first mechanism for determining the weights of a neural network. The Canadian
Donald Hebb postulated this learning strategy in the late forties as one of the basic mechanisms
how humans and animal
s can learn. Later on it turned out that he hit the hammer on the nail
with his formulation. Hebb's rule is surprisingly simple, and while in principle Hebb's rule can
be used to train multi
-
layered neural networks we will not elaborate further on this rul
e. Let us
just point out here that there are now many different neural network paradigms and many
algorithms for determining the weights of a neural network. Most of these algorithms work
iteratively: i.e., one starts out with a randomly selected weight se
t, applies one or more samples
of the mapping, and gradually upgrades the weights. This iterative search for a proper weight
set is called the learning or training phase.


Before explaining the workings of the backpropagation algorithm we will present a s
imple
alternative, the random search. The most naive answer to determine a weight set
--

which
rather surprisingly in hindsight did not emerge before the backpropagation principle was
formulated
--

is just to try randomly generated weight sets, and to keep

trying with new
randomly generated weight sets until one hits it just right. The random search is at least in
principle a way to determine a suitable weight set if it weren't for its excessive demands on
computing time. While this method sounds too naive
to give it even serious thought, smart
random search paradigms (such as genetic algorithm and simulated annealing) are nowadays
actually legitimate and widely used training mechanism for neural networks. However, random
search methods have many whistles to

blow and bells to ring, and are extremely demanding on
computing time. Only the wide availability of ever faster computers allowed this method to be
practical at all.


The process for determining the weights of a neural net proceeds in two separate stage
s. In the
first stage, the training phase, one applies an algorithm to determine a
--

hopefully good
--

weight set with about 2/3 of the available mapping samples. The generalization performance of
the just trained neural net is subsequently evaluated in t
he testing phase based on the remaining
samples of the map.


1. 6

The backpropagation algorithm


An error measure can be defined to quantify the performance of a neural net. This error
function depends on the weight values and the mapping samples. Determin
ing the weights of a
neural network can therefore be interpreted as an optimization problem, where the performance
error of the network structure is minimized for a representative sample of the mappings. All
paradigms applicable to general optimization pro
blems apply therefore to neural nets as well.


12


The backpropagation algorithm is elegant and simple, and is used in eighty percent of the
neural network applications. It consistently gives at least reasonably acceptable answers for the
weight set. The backp
ropagation algorithm can not be applied to just any optimization
problem, but it is specifically tailored to multi
-
layer feedforward neural network.


There are many ways to define the performance error of a neural network. The most commonly
applied error m
easure is the mean square error. This error, E, is determined by showing every
sample to the net and to tally the differences between the actual output, o, minus the desired
target output, t, according to equation (5).



(5)


Training a neural network starts out with a randomly selected weight set. A batch of samples is
shown to the network, and an improved weight set is obtained by iteration following equations
(6) and (7). The new weights for a particul
ar neuron (labeled ij) at iteration (n+1), are an
improvement for the weights from iteration (n), by moving a small amount on the gradient of
the error surface towards the direction of the minimum.



(6)





(7)

Equations (6) and (7) represent an iterative steepest descent algorithm, which will always
converge to a local minimum of the error function provided that the learning parameter, a, is
small
. The ingenuity of the backpropagation algorithm was to come up with a simple analytical
expression for the gradient of the error in multi
-
layered nets by a clever application of the
chain rule. While it was for a while commonly believed that the backpropa
gation algorithm
was the only practical algorithm to implement equation (7), it is worth pointing out that the
derivative of E with respect to the weights can easily be estimated numerically by tweaking the
weights a little bit. This approach is perfectly
valid, but is significantly slower than the elegant
backpropagation formulation. The details for deriving the backpropagation algorithm can be
found in the literature.


1.7

More neural network paradigms


So far, we briefly described how feedforward neural
nets can solve problems by recasting the
problem as a formal map. The workings of the backpropagation algorithm to train a neural
network were formally explained. While the views and algorithms presented here conform with
the mainstream approach to neural
network problem solving, there are literary hundreds of
different neural network types and training algorithms. Recasting the problem as a formal map
is just one part and one view of neural net. For a broader view on neural networks we refer to
the literat
ure.



13

At least two more paradigms revolutionized and popularized neural networks in the eighties:
the Hopfield net and the Kohonen net. The physicist John Hopfield gained attention for neural
networks in 1983 when he wrote a paper in the Proceedings of th
e National Academy of
Science indicating how neural networks form an ideal framework to simulate and explain the
statistical mechanics of phase transitions. The Hopfield net can also be viewed as a recurrent
content addressable memory that can be applied t
o image recognition, and traveling salesman
type of optimization problems. For several specialized applications, this type of network is far
superior to any other neural network approach. The Kohonen network proposed by the Finnish
professor Teuvo Kohonen
on the other hand is a one
-
layer feedforward network that can be
viewed as a self
-
learning implementation of the K
-
means clustering algorithm for vector
quantization with powerful self
-
organizing properties and biological relevance.


Other popular, powerfu
l and clever neural network paradigms are the radial basis function
network, the Boltzmann machine, the counterpropagation network and the ART (adaptive
resonance theory) networks. Radial basis functions can be viewed as a powerful general
regression techn
ique for multi
-
dimensional function approximation which employ Gaussian
transfer functions with different standard deviations. The Boltzmann machine is a recursive
simulated annealing type of network with arbitrary network configuration. Hecht
-
Nielsen's
co
unterpropagation network cleverly combines a feedforward neural network structure with a
Kohonen layer. Grossberg's ART networks use a similar idea but can be elegantly implemented
in hardware and retains a high level of biological plausibility.


There is
room as well for more specialized networks such as Oja's rules for principal
component analysis, wavelet networks, cellular automata networks and Fukushima's
neocognitron. Wavelet networks utilize the powerful wavelet transform and generally combine
elemen
ts of the Kohonen layer with radial basis function techniques. Cellular automata
networks are a neural network implementation of the cellular automata paradigm, popularized
by Mathemtica's inventor, Stephen Wolfram. Fukushima's neocognitron is a multi
-
laye
red
network with weight sharing and feature extraction properties that has shown the best
performance for handwriting and OCR recognition applications.


A variety of higher
-
order methods improve the speed of the backpropagation approach. Most
widely applie
d are conjugate gradient networks and the Levenberg
-
Marquardt algorithm.
Recursive networks with feedback connections are more and more applied, especially in neuro
-
control problems. For control applications specialized and powerful neural network paradigm
s
have been developed and it is worthwhile noting that a one
-
to
-
one equivalence can be derived
between feedforward neural nets of the backpropagation type and Kalman filters. Fuzzy logic
and neural networks are often combined for control problems.


There i
s no shortage of neural network tools and most paradigms can be applied to a wide
range of problems. Most neural network implementations rely on the backpropagation
algorithm. However, which neural network paradigm to use is often a secondary question and
whatever the user feels comfortable with is fair game.





14

1.8

Literature


The domain of artificial neural networks is vast and literature is expanding at a fast rate. With
the knowledge to be far from complete let me briefly discuss my favorite neural netw
ork
references in this section. Note also that an excellent comprehensive introduction to neural
networks can be found under the frequently asked questions on neural networks files at various
WWW websites (i.e. search “FAQ neural networks” in Alta Vista).


Jose Principe

Probably the standard textbook now for teaching neural networks. Comes with a demo version
of Neurosolutions.


Neural and Adaptive Systems: Fundamentals Through Simulations, Jose Principe, Neil R.
Euliano, and W. Curt Lefebre, John Wiley 200
0.


Hagan, Demuth, and Beale

An excellent book for basic comprehensive undergraduate teaching, going back to basics with
lots of Linear Algebra as well and good MATLAB illustration files is


Neural Network Design, Hagan, demuth, and Beale, PWS Publishing C
ompany, 1996.


Joseph P. Bigus

Bigus wrote an excellent introduction to neural networks for data mining for the non
-
technical reader. The book makes a good case why neural networks are an important data
mining tool and the power and limitations of neural n
etworks for data mining. Some
conceptual case studies are discussed. The book does not really discuss the theory of
neural networks, or how exactly to apply neural networks to a data mining problem, but it
gives nevertheless many practical hints and tips.


Data Mining with Neural Networks: Solving Business Problems


from Application
Development to Decision Support, McGraw
-
Hill (1997).


Maureen Caudill

Maureen Caudill has published several books that aim to the beginners market and provide
valuable insight
in the workings of neural nets. More than here books, I would recommend a
series of articles that appeared in the popular monthly magazine AI EXPERT. Collections of
Caudill's articles are bundled as separate special editions of AI EXPERT.


Phillip D. Wasse
rman

Wasserman published two very readable books explaining neural networks. He has a knack to
explain difficult paradigms efficiently and understandably with a minimum of mathematical
diversions.


Neural Computing,

Van Nostrand Reinhold (1990).


Advanced
Methods in Neural Computing
, Van Nostrand Reinhold (1993).


15


Jacek M. Zurada

Zurada published the first books on neural networks that can be considered a textbook. It is an
introductory
-
level graduate engineering course with an electrical engineering bias a
nd comes
with a wealth of homework problems and software.


Introduction to Artificial Neural Systems,

West Publishing Company (1992).


Laurene Fausett

An excellent introductory textbook on the advanced undergraduate level with a wealth of
homework problems
.


Fundamentals of Neural Networks: Architecture, Algorithms, and Application,

Prentice Hall
(1994).


Simon Haykin

Nicknamed the bible of neural networks by my students this 700
-
page work can be considered
both as a desktop reference and advanced graduate
level text on neural networks with
challenging homework problems.


Neural Networks: A Comprehensive Foundation,

MacMillan College Publishing Company
(1995).


Mohammed H. Hassoun

Excellent graduate level textbook with clear explanations and a collection of

very appropriate
homework problems.


Fundamentals of Artificial Neural Networks,

MIT Press (1995).


John Hertz, Anders Krogh, and Richard G. Palmer

This book is one of the earlier better books on neural networks and provides a thorough
understanding of th
e various neural paradigms and how and why neural networks work. This
book is excellent for its references and has an extremely high information density. Even though
this book is heavy on the Hopfield network and the statistical mechanics interpretation, I

probably consult this book more than any other. It does not lend itself well as a textbook, but
for a while it was one of the few good books available. Highly recommended.


Introduction to the Theory of Neural Computation,
Addison Wesley Publishing Compan
y
(1991).



Timothy Masters

Masters wrote a series of three books in short succession and I would call his collection of
book the user's guide to neural networks. If you program your own networks the wealth of
information is invaluable. If you use neural n
etworks, the wealth of information is invaluable.
The books come with software and all source code is included. The software is very powerful,

16

but is geared toward the serious C++ user and lacks a decent user's interface for the non
-
C++
initiated. A must f
or the beginner and the advanced user.


Practical Neural Network recipes in C++,

Academic Press, Inc. (1993).

Signal and Image Processing with neural Networks
, John Wiley (1994).

Advanced Algorithms for Neural Networks: A C++ Sourcebook,
John Wiley (1995)
.


Bart Kosko

Advanced electrical engineering graduate level textbook. Excellent for fuzzy logic and neural
network control applications. Not recommended for general introduction or advanced
reference.


Neural Networks and Fuzzy Systems,

Prentice Hall (199
2).


Guido J. DeBoeck

If you are serious about applying neural networks for stock market speculation this book is a
good starting point. No theory, just applications.


Trading on the Edge: Neural, genetic and fuzzy systems for chaotic Financial Markets,
Jo
hn
Wiley & Sons (1994).


17

2.

NEURAL NETWORK CASE STUDY


POPULATION FORECASTING


2.1

Introduction


The purpose of this case study is to expose an overview of the philosophy of artificial neural
networks. This case study will inspire the view of neural netwo
rks as a model free regression
technique. The study presented here describes how to estimate the world's population for the
year 2025 based on traditional regression techniques and based on an artificial neural network.


In the previous section an artific
ial neural network was explained as a biologically inspired
model that can implement a map. This model is based on an interconnection of elementary
McCulloch
-
Pitts neurons. It was postulated that:


(a)

Most real
-
world problems can be formulated as a map.

(
b)

Such a map can be formally represented by an artificial neural network, where the so
-
called "weights" are the free parameters to be determined.

(c)

Neural networks can "train" their weights to conform with a map using powerful
computational algorithms.
This model for the map does not only represent the "training
samples" quite reasonably, but generally extrapolates well to "test samples" that were not
used to train the neural network.


The most popular algorithm for training a neural network is the backp
ropagation algorithm
which has been rediscovered in various fields over and over again and is generally credited to
Dr. Paul Werbos.
[1]
The backpropagation algorithm was widely popularized in 1986 by
Rumelhart and McClelland
[2]
explaining why the surge in
popularity of artificial neural
networks is a relatively recent phenomenon. The derivation and implementation details of the
backpropagation algorithm are referred to the literature.


2.2

Population forecasting


The reverend Thomas Malthus identified in 17
98 in his seminal work "An essay on the
principle of population"
[3]

that the world's population grows exponentially while agricultural
output grows linearly, predicting gloom and doom for future generations. Indeed, the rapidly
expanding population on our

planet reminds us daily that the resources on our planet have to be
carefully mended to survive gracefully during the next few decades. The data for the world's
population from 1650 through 1996 are summarized in Table I and figure 2.1.
[4]


TABLE I. Esti
mates for the world population (1650


1996)



YEAR

POPULATION (in millions)


1650


470


1750


694


1850

1091


1900

1571


1950

2513


1960

3027


18


1970

3678


1980

4478


1990

5292


1995

5734


1996

5772


In order to build a model for population forecasting we

will normalize the data points (Table
II). The year 1650 is re
-
scaled as 0.0 and 2025 as 1.0 and we interpolate linearly in between for
all the other years. The reason for doing such a normalization is that it is customary (and often
required) for neural
networks to scale the data between zero and unity. Since our largest
considered year will be 2025 it will be re
-
scaled as unity. The reader can easily verify that a
linear re
-
normalization of a variable x between a maximum value (max) and a minimal value
(
min) will lead to a re
-
normalized value (x
nor)
according to:





Because the population increases so rapidly with time we will work with the natural logarithm
of the population (in million) and then re
-
n
ormalize these data according to the above formula,
where (anticipating the possibility for a large forecast for the world's population in 2025) we
used 12 as the maximum possible value for the re
-
normalized logarithm of the population in
2025 and 6.153 as

the minimum value. In other words: max in the above formula was
arbitrarily assigned a value of 12 to assure that the neural net predictions can accommodate
large values. Table II illustrates these transforms for the world population data.




Figure 2.1

Estimates of the world population between 1650 and 1996.


19

TABLE II. Estimates of World Population and corresponding normalizations



YEAR

POP

YEAR
nor

ln(POP)

POP
nor



1650


470

0.000

6.153

0.000



1750


694

0.267

6.542

0.067


1850

1091

0.533

6.995

0.144


1900

1571

0.667

7.359

0.206


1950

2513

0.800

7.829

0.287


1960

3027

0.827

8.015

0.318


1970

3678

0.853

8.210

0.352


1980

4478

0.880

8.407

0.385


1990

5292

0.907

8.574

0.414


1995

5734

0.920

8.654

0.428


1996

5772

0.923

8.661

0.
429


2.3

Traditional regression model for population forecasting


First we will apply traditional regression techniques to population forecasting. The classical
Malthusian model assumes that the population grows as an exponential curve. This equivalent
to
stating that the natural logarithm of the population will grow linearly with time. Because the
re
-
normalization in the previous paragraph re
-
scaled the population numbers first into their
natural logarithms, we should be able to get by with a linear regres
sion model for the re
-
scaled
values. With other words, we are trying to determine the unknown coefficients a and b in the
following population model:





or, using the traditional symbols Y and X for the

dependent and the independent variables





It is customary in regression analysis to determine the coefficients a and b such that the sum of
the squares of the errors (E) between the modeled values and

the actual values is minimized.
With other words, the following function needs to be minimized:





There are N data points,
and

are the actual data points, and the Y values are the estimates
according to the model. The values for the coefficients a and b for which this error is minimal
can be found by setting the partial derivatives of the error wi
th respect to the unknown
coefficients a and b equal to zero and solving this set of two equations for these unknown
coefficients. This leads to the following:



20



or





It is left as an exercise to the reader to verify that this yields for a and b




where,





Table III illustrates the numerical
calculation of a and b, where the first ten data entries were
used (with other words, we do not consider the 1996 data point).




TABLE III. Estimates of World Population and corresponding normalizations


X
nor

Y
nor


xy

x
2

.


0.000

0.000

0.000

0.000


0.267

0.067

0.018

0.071


0.533

0.144

0.077

0.284


0.667

0.206

0.137

0.445


0.800

0.287

0.230

0.640


0.827

0.318

0.263

0.684


0.853

0.352

0.300

0.728


0.880

0.385

0.339

0.774


0.907

0.414

0.375

0.823


0.920

0.428

0.394

0.846


6.654

2.601

2.133

5.295


21

Expre
ssions for a and b cab be evaluated based on the data in Table III.





Forecasting for the year 2025 according to the regression model yields the following for the
normalized value for the population:





When re
-
scaling back into the natural logarithm of the actual population we obtain:





The actual population estimate for the year 2025 is the
exponent of this value leading to an
estimate of 5321 million people. Obviously this value is not what we would expect or accept as
a forecast. What happened actually is that over the considered time period (1650
-

1996) the
population has actually been ex
ploding faster than exponentially and the postulated exponential
model is not a very good one. The flaws in this simple regression approach become obvious
when we plot the data and their approximations in the re
-
normalized frame according to figure
2.2. Ou
r model has an obvious flaw, but the approach we took here is a typical regression
implementation. Only by plotting our data and predictions, and often after the fact, becomes
the reason for the poor or invalid estimate obvious. More seasoned statisticians

would suggest
that we try an approximation of the type:





or use ARMA models and/or other state
-
of
-
the
-
art time series forecasting tools. All these
methods are fair game for forecasting and can yield
reliable estimates in the hands of the
experienced analyst. Nevertheless, from this simple case study we can conclude so far that
forecasting the world's population seems to be a challenging forecasting problem indeed.


2.4

Simple neural network model for
population forecasting


In this section we will develop the neural network approach for building a population
forecasting model. We will define a very simple network with one input element, two neurons
in the hidden layer and one output neuron. We will how
ever include two bias nodes (dummy
nodes with input unity) which is standard practice for most neural network applications. The
network has common sigmoid transfer functions and the bias is just an elegant way to allow
some shifts in the transfer functions

as well. The sigmoid transfer function can be viewed as a
crude approximation for the threshold function. Remember that an artificial neuron can be
viewed as a mathematical operator with following functions:



22




Figure 2.2

Resul
ts from regression analysis on logarithmically normalized data entries.




Figure 2.3. The sigmoid function

as a crude approximation to the threshold function.
Note that the

introduction of bias nodes (i.e., dummy nodes with input unity, as shown in
figure 4) allows horizontal shifts of the sigmoid (and/or threshold function) allowing more
powerful and more flexible approximation.


23

a)

Make a weighted sum of the input signals r
esulting in a signal z.

b)

Apply a transfer function f(z) to the signal z, which in the case of a sigmoid corresponds
to:






as illustrated in figure 2.3.


Figure 2.4 is a representation of our simple

neural network. Note that there are three neurons
and two bias nodes. There are three layers: an input layer, one hidden layer and an output layer.
Only the hidden layer and the output layer contain neurons: such a network is referred to as a
1x2x1 net. T
he two operations of a neuron (weighted sum and transfer function) are
symbolically represented on the figure for each neuron (by the symbols


and f). In order for a
neural network to be a robust function approximator at least one hidden layer of neurons and
generally at most two hidden layers of neurons are required. The neural network represented in
figure 2.4 is the most common neural networ
k of the feedforward type and is fully connected.
The unknown weights are indicated on the figure by the symbols
.


The weights can be considered as being the neural network equivalent for the unknown
r
egression coefficients from our regression model. The algorithm for finding these coefficients
that was applied here is the standard backpropagation algorithm, which minimizes the sum of
the squares of the errors similar to the way how it was done for regr
ession analysis. However,
contrary to regression analysis, an iterative numerical minimization procedure rather than an
analytical derivation was applied to estimate the weights in order to minimize the least
-
squares
error measure. The backpropagation algo
rithm uses a clever trick to solve this problem when a
hidden layer of neurons is present in the model. By all means think of a neural network as a
more sophisticated regression model. It is different from a regression model in the sense that
we do not spe
cify linear or higher
-
order models for the regression analysis. We specify only a
neural network frame (number of layers of neurons, and number of neurons in each layer) and
let the neural network algorithm work out what the proper choice for the weights w
ill be. This
approach is often referred to as a model
-
free approximation method, because we really do not
specify whether we are dealing with a linear, quadratic or exponential model. The neural
network was trained wi
th MetaNeural™, a general
-
purpose neural network program that uses
the backpropagation algorithm and runs on most computer platforms. The neural network was
trained on the same 10 patterns that were used for the regression analysis and the screen
response
is illustrated in figure 2.5.




24




Figure 2. 4

Neural network approximation for the population forecasting problem.




Figure 2.5

Screen response from MetaNeural™ for training and testing the populati
on forecasting
model.


Hands
-
on details for the network training will be left for lecture 3, where we will gain hands
-
on exposure to artificial neural network programs. The files that were used for the
MetaNeural™ program are reproduced in the appendix. Th
e program gave 0.48118 as the
prediction for the normalized population forecasts in 2025. After re
-
scaling this would
correspond to 7836 million people. Probably a rather underestimated forecast, but definitely
better than the regression model. The weights

corresponding to this forecast model are

25

reproduced in Table IV. The problem of the neural network model is that a 1
-
2
-
1 net is a rather
simplistic network and that the way we represented the patterns too much emphasis is placed
on the earlier years (1650

-

1850) which are really not all that relevant. By over
-
sampling (i.e.,
presenting the data from 1950 onward let's say three times as often than the other data) and
choosing a 1
-
3
-
4
-
1 network, the way a more seasoned practitioner might approach this
probl
em, we actually obtained a forecast of 11.02 billion people for the world's population in
2025. This answer seems to be a lot more reasonable than the one obtained from the 1
-
2
-
1
network. Changing to the 1
-
3
-
4
-
1 model is just a matter of changing a few num
bers in the input
file for MetaNeural™ and can be done in a matter of seconds. The results for the predictions
with the 1
-
3
-
4
-
1 network with over
-
sampling are shown in figure 2.6.



Figure 2.6.

World population prediction with a
1
-
3
-
41 artificial neural network with over
-
sampling.


TABLE IV. Weight values corresponding to the neural network in figure 2.4



WEIGHT

VALUE


w
1

-
2.6378



w
2


2.4415


w
3


1.6161


w
4

-
1.3550


w
5

-
3.6308


w
6


3.0321


w
7

-
1.3795



26


2.6

Conclusions


A neural

network can be viewed as a least
-
squares model
-
free regression
-
like approximator
that can implement almost any map. The illustration of a forecasting model for the world's
population with a simple neural network proceeds similar to regression analysis and

relatively
straightforward. The fact that neural networks are model
-
free approximators is often
advantageous over traditional statistical forecasting methods and standard time series analysis
techniques. Where neural networks differ from standard regress
ion techniques is the way how
the least
-
squares error minimization procedure was implemented: while regression techniques
rely on closed one
-
step analytical formulas, the neural network approach employs a numerical
iterative backpropagation algorithm.


2.7

Exercises for the brave


1.

Derive the expressions for the parameters a, b, c, d, and e for the following regression
model:






and forecast the world's population for the year 2025 based on this model
.


2.

Write a MATLAB program that implements the evaluation of the network shown in
figure 4 and verify the population forecast for the year 2025 based on this 1
-
21 neural
network model and the weights shown in TABLE IV.


3.

Expand the MATLAB program of e
xercise 2 to a program that can train the weights of a
neural network based on a random search model. I.e., Start with and initial random
collection for the weights (let's say all chosen from a uniform random distribution
between
-
1.0 and +1.0). Then itera
tively adjust the weights by making small random
perturbations (one weight at a time), evaluate the new error after showing all the training
samples, and retain the perturbed weight if it is smaller. Repeat this process until the
network has a reasonably s
mall error.


2.8

References


[1]

P. Werbos, "Beyond regression: New tools for prediction and analysis in the behavioral
sciences," Ph.D. thesis, Harvard University (1974).

[2]

D. E. Rumelhart, G. Hinton, and R. J. Williams, "Learning internal representatio
ns by
error propagation," In
Parallel distributed processing: explorations in the
microstructure of cognition, Vol. 1,

D. E. Rumelhart and James L. McClelland, Eds.,
Chapter 8, pp. 318
-
362, MIT Press, Cambridge, MA (1986).

[3]

Malthus, "An Essay on the Pri
nciple of Population," 1798. Republished in the Pelican
Classics series, Penguin Books, England (1976).

[4]

Otto Johnson, Ed., "1997 Information Please Almanac," Houghton Mifflin Company
Boston & New York (1996).


27


APPENDIX: INPUT FILES FOR 1
-
2
-
1 NETWORK F
OR MetaNeur
al™



ANNOTATED MetaNeural™ INPUT FILE: POP


3

Three layered network


1

One input node


2

2 neurons in the hidden layer


1

One output neuron

1

Show all samples and then update weights

0.1

Learning parameter first layer of weights

0.1

Learning parameter sec
ond layer of weights

0.5

Momentum first layer of weights

0.5

Momentum second layer of weights

1000

Do thousand iterations (for all patterns)

500

Show intermediate results every 500 iterations on the screen

1

Standard [0, 1] sigmoid transfer function

1

Temp
erature one for sigmoid (i.e., standard sigmoid)

pop.pat

Name of training pattern file

0

Ignored

100

Ignored

0.01

Stop training when error is less than 0.01

1

Initial weights are drawn from a uniform random distribution

0.6

between [
-
0.6, 0.6]



POP.PAT: T
he pattern file


10

10 training pattern

0.000 0.000 0

first training pattern


0.267 0.067 1

second training pattern


0.533

0.144 2

0.667

0.206 3

0.800

0.287 4

0.827

0.318 5

0.853

0.352 6

0.880

0.385 7

0.907

0.414 8

0.920

0.428 9