Maria Papaiwannou (periektiki)x - ThesisUCY2012

bigskymanAI and Robotics

Oct 24, 2013 (3 years and 5 months ago)

287 views


UNIVERSITY OF CYPRUS

DEPARTMENT OF COMPUT
ER SCIENCE


Computational Intelligence








Author


:


Maria Papaioannou

Supervisor

:


Prof. Christos N. Schizas

Date


:


6/7/2011

2


Table of Contents

Chapter 1
-

Introduction

................................
................................
................................
...........................

4

Chapter 2
-

Neural Networks

................................
................................
................................
....................

7

2.1
-

Neurons

................................
................................
................................
................................
.......

7

2.1.1
-

Biological Neurons

................................
................................
................................
................

7

2.1.2
-

Artificial Neuron

................................
................................
................................
....................

8

2.2
-

Artificial Neural Networks and Architecture

................................
................................
..................

8

2.3
-

Artificial Neural Networks and Learning

................................
................................
.......................

8

2.4
-

Feed
-
Forward Neural Network

................................
................................
................................
...

12

2.4.1
-

The perceptron algorithm

................................
................................
................................
.....

12

2.4.2
-

Functions used by Feed
-
forward Network
................................
................................
..............

14

2.4.3
-

Back propagation algorithm

................................
................................
................................
.

16

2.5
-

Self Organizing Maps

................................
................................
................................
................

22

2.5.1
-

Biology
................................
................................
................................
................................

22

2.5.2
-

Self

Organizing Maps Introduction
................................
................................
........................

23

2.5.3
-

Algorithm

................................
................................
................................
............................

25

2.6
-

Kernel Functions
................................
................................
................................
........................

28

2.7
-

Support Vector Machines
................................
................................
................................
............

33

2.7.1
-

SVM for two
-
class classification

................................
................................
...........................

33

2.7.2
-

Multi
-
class problems

................................
................................
................................
............

42

Chapter 3
-

Fuzzy Logic

................................
................................
................................
.........................

45

3.1
-

Basic fuzzy set operators
................................
................................
................................
.............

50

3.2
-

Fuzzy Relations

................................
................................
................................
..........................

52

3.3
-

Imprecise Reasoning

................................
................................
................................
..................

58

3.
4
-

Generalized Modus Ponens
................................
................................
................................
.........

59

3.5
-

Fuzzy System Development

................................
................................
................................
.........

69

Chapter 4
-

Evolutionary Algorithms

................................
................................
................................
.......

72

4.1
-

Biology

................................
................................
................................
................................
......

72

4.2
-

EC Terminology and History
................................
................................
................................
.......

75

4.2.1
-

Genetic Algorithms (GA)

................................
................................
................................
......

76

4.2.2
-

Genetic Programming (GP)

................................
................................
................................
..

76

3


4.2.3
-

Evolutionary Strategies (ES)

................................
................................
................................
.

77

4.2.4
-

Evolutionary Programming (EP)

................................
................................
..........................

78

4.3
-

Genetic Algorithms
................................
................................
................................
.....................

78

4.3.1
-

Search space
................................
................................
................................
........................

79

4.3.2
-

Simple Genetic Algorithm

................................
................................
................................
.....

79

4.3.3
-

Overall

................................
................................
................................
................................

91

Chapter 5
-

Fuzzy Cognitive Maps

................................
................................
................................
..........

93

5.1
-

Introduction

................................
................................
................................
...............................

93

5.2

-

FCM technical background

................................
................................
................................
...

94

5.2.1
-

FCM structure

................................
................................
................................
.....................

94

5.2.3
-

Static Analysis

................................
................................
................................
....................
100

5.2.
4
-

Dynamical Analysis of FCM (Inference)

................................
................................
...............
102

5.2.5
-

Building a FCM

................................
................................
................................
..................
107

5.2.6
-

Learning and Optimization of FCM
................................
................................
......................
109

5.2.7
-

Time in FCM

................................
................................
................................
......................
115

5.2.8
-

Modified FCM

................................
................................
................................
....................
118

5.3
-

Issues about FCMs

................................
................................
................................
....................
120

Chapter
6

-

Discussion

................................
................................
................................
..........................
125

References

................................
................................
................................
................................
...........
128






4


Chapter 1
-

Introduction



During the last century world has come to an important finding: How organized and structural thinking arises
from a bunch of individual processing units, which co
-
function and interact in harmony under no specific
organized supervision of a
super unit
. In other words humanity shed a light on how the brain works. The
appearance of computers and the outstanding evolution of their computational capabilities enhanced even
more the human need in exploring how human brain works by modeling single neurons or
group of
interconnected neurons and run
simulations

over them. That meant the first generation of the well known
Artificial Neural Networks (ANN). Through time scientists have made more observations on how human and
nature works. Behavioral concepts of ind
ividuals or organizations of individuals, psychological aspects,
biological and genetic findings have contributed into computerized methods which firstly went down the
algorithms class of “Artificial Intelligence” but later on created a novel category of a
lgorithms, the
“Computational Intelligence”
(CI)
.

Computer Science is a field where algorithmic processes, methods and techniques describing, transforming
and generally utilizing information are created to satisfy different human needs. Humans have to deal with a
variety of complex problems in different d
omains as Medicine, Engineering, Ecology, Biology and so on. In
parallel, humans have aggregated through years vast databases of a plethora kind of information. The need to
search, identify structures and distributions and overall take an advantage on all
this information along with
the simultaneous development of computer machines (in terms of hardware level) led humans in inventing
different ways on how to handle information
for

their own benefit.
Computational Intelligence lies under the
general umbrella

of Computer Science and hence encloses algorithms/ methods/ techniques of utilizing and
manipulating data/information. However, as indicated by the name of this area, CI methods aim in
incorporating some features of human and nature intelligence in handli
ng information through computations.
Although there is no strict definition of intelligence, approximately one could say that intelligence is the
ability of an individual to, efficiently and effectively, adapt to a new environment, altering from its presen
t
state to something new. In other words, being able under certain circumstances to take the correct decisions
about the next action. In order to accomplish something like that certain intelligence features must exist like
learning through experience, usin
g language in communications, spatial and environmental cognition,
invention, generalization, inference, applying calculations based on common logic and so on. It is clear that
humans are the only beings in this world gathering all the intelligent features
.

However computers are more able to
take

a big amount of effort into solving a specific and explicitly defined
by humans problem, taking an advantage on a vast amount of available data. CI technologies are offered in
5


mimicking intelligence properties foun
d in humans and nature and use these properties effectively and
efficiently in a method/algorithm body and eventually bringing better results faster than other conventional
mathematical or CS methods in the same problem domain.

The main objective of CI r
esearch is the understanding of the fashion in which natural systems function and
work. CI combines different elements of adaptation, learning, evolution, fuzzy logic, distributed and parallel
processing, and generally, CI offers models which employ intell
igent (to a degree) functions. CI research is
not competitive to other classic mathematical methods which act in the same domains, rather it comprises an
alternative technique in problem solving where the employment of conventional methods is not feasible.


The three main sub
-
areas of CI are Artificial Neural Networks, Fuzzy Logic and Evolutionary Computation.
Some of the application areas of CI
are

medical diagnostics, economics, physics, time scheduling, robotics,
control systems, natural language
process
ing,
and many others. The growing needs in all these areas and
their
application
in humans’ everyday life
intensified

CI research
.

This need guided researchers in
the direction of
investigating further the behavior of nature that lead to the definition of
evolutionary systems.

The second chapter is about Artificial Neural N
etwork models
which are
inspired by brain processes and
structures.

A first description of how ANN were first inspired and modeled is given and thereafter a basic
description of three AN
N models is given. Among the various configurations of ANN topologies and training
algorithms, Multilayer Perceptron (MLP), Kohonen and Support Vector Machines are described. A
NN
are
applicable in a plethora of areas like general
classification,
regression,
pattern
recognition,
optimisation,
control, function approximation, time series
prediction
, data mining and so on.

The third chapter addresses Fuzzy Logic elements and aspects. Coming from the Fuzzy Logic inventor
himself, Dr. Zadeh, the word “
fuzzy” indicates blurred, unclear or confused states or situations
. He even
stated that

the choice of the word “fuzzy” for
describing his theory could had been wrong. However, what
essentially Fuzzy Logic tries to give

is

a theoretical schema on how humans

think and reason. Fuzzy Logic
allows computers to handle information as humans do, d
escribed in linguistic terms,

not precisely defined yet
being able to apply inference on data and make correct conclusions. Fuzzy Logic has been applied in a wide
range of

fields as control, decision support, information systems and in a lot of industrial products as
cameras, washing machines, air
-
conditions, optimized timetables etc.

The fourth chapter is about Evolutionary Computation which essentially

draws ideas from na
tural evolution

as first stated by Darwin. Evolutionary Computation methods are focused in optimization problems. Potential
solutions to such problems are created and evolved through discrete iterations (named generations) under
selection, crossover and mu
tation operators. The solutions are encoded into strings (or trees) which act
6


analogously to biological chromosomes.

In these algorithms, the parameters comprising a problem are
represented either as genotypes (which are evolved based on inheritance) or ph
enotypes (which are evolved
based on environment interaction aspects). The EC area can be analyzed into four directions:
Evolutionary
Strategies, Evolutionary Programming, Genetic Programming
and finally
Genetic Algorithms.
All these EC
subcategories diffe
r in the solution representation and on the genetic operators they use. Fields where
EC has
been
successfully applied are

telecommunications,

scheduling,
engineering
design,
ANN

architectures and
structural
optimisation
,

control optimization
, classificatio
n and clustering, function approximation and time
series modeling, regression

and so on
.

The fifth chapter describes the model of Fuzzy Cognitive Maps which comprise a soft computing
methodology of modeling systems which constitute causal relationships amongst their parameters. They
manage to represent human knowledge and experience, in a cert
ain system’s domain, into a weighted
directed graph model and apply inference presenting potential behaviours of the model under specific
circumstances.

FCMs can be used to adequately describe and handle fuzzy and not precisely defined knowledge. Their
co
nstruction is based on knowledge extracted from experts on the modeled system. The expert must define
the number and linguistically describe the type of the concepts. Additionally they have to define the causal
relationships between the concepts and descri
be their type (direct or inverse) as well as their strength. The
main features of FCMs are their simplicity in use and their flexibility in designing, modeling and inferring
complex and big
-
sized systems. The basic structure of FCMs and their inference pro
cess is described in this
chapter. Some proposed algorithms of the FCM’s weight adaptation and optimization are also presented.
Such algorithms are needed to be used in cases where available data describing the system behavior is
feasible and experts are n
ot in position in describing the system adequately or completely. Moreover some
tries in introducing time dependencies are given along with some hybridized FCMs (mostly with ANN)
which are mainly used in classification problems. Finally some open issues ab
out FCMs are discussed in the
end of this chapter.

The last chapter is an open discussion on CI area and more specifically on Evolutionary Computation and
Fuzzy Cognitive Maps which gained the great of interest on behalf of the writer of this study.




7


Ch
apter 2
-

Neural Networks


Neural Networks comprise an efficient mathematical/computational tool which is widely used for pattern
recognition. As indicated by the term
«
neural
»
, ANN is constituted by a set of artificial neurons which are
interconnected resembling the architecture and the functionality of the biological neurons of the brain.

2.1
-

Neurons


2.1.1
-

Biological Neurons


In biological level, neurons are the structur
al units of the brain and
of
the nervous system
,

which are
characterized by a fundamental property: their cell membrane allows the exchange of signals to
interconnected neurons. A biological neuron is constituted by three basic elements: the body, the dend
rites
and the axon. A neuron accepts the input signals from neighbouring neurons through its dendrites and sends
output signals to other neurons through its axon. Each axon ends to a synapse. A synapse is the connection
between an axon and the dendrites. C
hemical processes take place in synapses which define the excitation or
inhibition of the electrical signal travelling to the neuron’s body. The a
rrival of an electrical pulse to

a
synapse causes the release of neurotransmitting chemicals which travel acro
ss to the postsynaptic area
. The
postsynaptic area is a part of the receiving neuron’s
dendrite. Since a neuron has many dendrites it is clear
that it may also receive several electrical pulses (input signals) at the same time. The neuron is then able to
i
ntegrate these input signals and if the strength of the total input overcomes a threshold, transform the input
signals into an output signal to its neighbour neurons through its axon.

Synapses are believed to play a significant role in learning since the
strength of a synaptic connection
between neurons is adjusted with chemical processes reflecting the favourable and unfavourable incoming
stimuli helping the organism to optimize its response to the particular task and environment.

The processing power of
the brain depends a lot on the parallel interconnected function of neurons. Neurons
comprise a very important piece in the puzzle of processing complex behaviours and features of a living
organism such as learning, language, memory, attention, solving prob
lems, consciousness, etc. The very first
neural networks were developed to mimic the biological information processing mechanism (McCulloch and
Pitts in 1943 and later others
Widrow and Hoff

with Adaline in
1960
, Rosenblatt

with Perceptron in 1962)
.

8


This
study however is focused in presenting the ANN as a computational model for regression and
classification problems. More specifically the
feed forward

neural network will be introduced since it appears
to gain greatest practical value.

2.1.2
-

Artificial
Neuron

Following the functionality and the structure of a biological neuron, an artificial neuron:

1.

Receives multiple inputs, in the form of real valued variables x
n

= [x
1
,… ,x
D
]. A neuron with
D

inputs, defines a D
-
dimensional space. The artificial neuron

will create a
D
-
1

dimensional hyperplane
(decision boundary) which will divide the input space in two space areas.

2.

Processes the inputs by integrating the input values with a set of synaptic parameters w = [w
1
,…,w
D
]
to produce a single numerical output si
gnal. The output signal is then transformed using a
differentiable, nonlinear function called activation function.

3.

Transmits the output signal to its interconnected artificial neurons.

2.2
-

Artificial Neural Networks and Architecture



Most ANNs are divi
ded in layers. Each layer is a set of artificial neurons called nodes. The first layer of the
network is the
input layer
. Each node of the input layer corresponds to a feature variable of the input vector.
The last layer of the network is the
output layer
.

Each node on the output layer denotes the final
decision/answer of the network to the problem being modelled. The rest layers between input and output
layer are called hidden layers and their neurons hidden units. The connections between the nodes are
ch
aracterized by a real value called synaptic weight and its role is in analogy with the biological synapses. A
connection with positive weight can be seen as an excitatory synapse while a negative weight resembles an
inhibitory synapse. There is a variety o
f ANN categories, depe
nding on their architectures and

connection
topologies. In this study feed


forward networks will be presented, which constitute one of the simplest
forms of ANN.

2.3
-

Artificial Neural Networks and Learning


Artificial Neural Netwo
rks are data processing systems. The main goal of ANN is to predict/classify correctly
given a specific input instance. Similarly to biological neurons, learning in ANN is achieved by adapting the
synaptic weights to the input data. There is a variety of l
earning methods for ANNs, drawn from Supervised
Learning, Unsupervised Learning and Reinforcement Learning fields. In any case, the general goal of an
ANN is to recover the mapping function between the input data and the correct/wanted/”good”
classificatio
n/clustering/prediction of the output data. Some basic description of the

supervised and
unsupervised

learning classes is given in the Table 2.2.


9



Stimuli

Synapse

Dendrites

Soma

Axon

Biological
neuron

Action
potentials
expressed as
electrical
impulses
(electrical
signals).

Electrical signals are “filtered” by
the strength of the synapse.

The dendrites
receive the
electrical
signals
through their
synapse.

Ι
nput signals
integration.

Transmits output
signals to
neighbouring

neurons.

Biological
neuron
grap
hically






Artificial
neuron
(AN)

A vector of
real valued
variables.

Each input variable is weighted by
the corresponding synaptic
weight.

The updated
-

weighted
input signal is
received in the
AN.

Integration of the input
variables. The result is
transformed using an
activation function
h(

).

A single real
variable which is
available to pass
as input into the
neighboring
artificial neurons.

Artificial
Neuron
graphically







.

.

.








.


.


.




Table 2.1

w
2

w
D

w
1

.
x
1

w
2

.
x
2

w
D
.
x
D

x
1


x
2


x
D



Σ

S

h
(
s
)

y


w
1

10




Dataset

Network Response

Goal

Methods

S
u
p
e
r
v
i
s
e
d




Input vector: Vector
containing
D

input
variables/features so that:

X
n

= x
1
, x
2

, … , x
D
.



Target vector: Vectors
containing the values of the
output variables that the
ANN should answer, given
an input sample.

T
n

= t
1
, … , t
P
.



For 1 ≤ n ≤ N

Output vector: The
answer of the ANN given
the instance
X
n

in a vector
containing the values for
each output variable.

O
n

= o
1
, … , o
P
.


Adjust parameters w so that
the difference (
t
n

-

O
n
) ,
called
training
error, is
minimized.

Least
-
mean
-
square algorithm

(for a single neuron) and its
generalization widely known
as back
-
propagation algorithm
(for multilayer interconnection
of neurons).

U
n
s
u
p
e
r
v
i
s
e
d



Input vector: Vector
containing
D

input
variables/features so that:

X
n

= x
1
, x
2

, … , x
D
.

Assign a neuron or

a set
of neurons to respond to
specific input instances
recovering their
underlying relations and
projecting them into
clusters/groups.

Adjust the synaptic
parameters w in such a way
that for each input
class/group, a specific
artificial neuron/neurons
response intensively while
the rest neurons don’t
response.

Unsupervised learning using
Hebbian learning

(
e.g.
Self
Organized Maps).

Table 2.2
: Supervised

and

Unsupervised learning overview.

The choice of which learning method to use depends on many
parameters such as the data set structure, the
ANN architecture, the type of the problem, etc.

An ANN learns by running for a number of iterations the
steps defined by the chosen learning algorithm. The process of applying learning in an ANN is called trai
ning
period/phase. The stopping criterion of the training period maybe one of the following (or a combination of
them):

1.

T
he maximum number of
iterations is reached.

2.

The
cost
-
error

function is

minimiz
ed
/maximiz
ed

adequately.

11


3.

T
he difference of all synaptic w
eight values between 2 sequential
runs

is substantially small.

Generalization is one of the most considerable properties of a trained computational model. Generalization is
achieved if the ANN (or any other computational model) is able to categorize/predic
t/decide correctly given
input instances that were not presented to the network during the training phase (test dataset). A measure of
the generalization of the network is the testing error (evaluation of the difference between the output of the
network gi
ven an input instance, taken from the test dataset, and its target value). If the ANN fails to
generalize (meaning the testing error is large


considerably bigger than the training error) then phenomena
as overfitting and underfitting is possible to appea
r. Overfitting happens when the ANN is said to be
overtrained so that the model captures the exact relationship between the specific input
-
output used during
training phase. Eventually the network becomes just a “look
-
up table” memorizing every input insta
nce (even
noise points) connecting it with the corresponding output. As a result, the network is incapable of recognizing
a new testing input instance, responding fairly randomly. Underfitting appears when the network is not able
to capture the underlying
function mapping input


output data, either due to the small size of the training
dataset or the poor architecture of the model. That is why designing an ANN model requires special attention
in terms of making the appropriate choice of an ANN model suited

to the problem to be solved
-
modeled.
Such considerations are:

1.

If the training dataset is a sample of the actual dataset then the chosen input instances should be

r
epresentative

covering all the possible classes/clusters/categories of the actual dataset. F
or example
if the problem is the diagnosis whether a patient brings a particular disease or not, then the sample
dataset should include instances of diseased patients and non
-
diseased patients.

2.

The number of input instances is also important. If the number of the input data is fairly small then
underfitting is dangerous to happen since the network will not be given the right amount of
information needed to shape the input
-
output mapping function.

Note that adequate instances should
be given for all the classes of the input data. Additionally, the size of the training dataset should
be
defined in respect to the
dimension of the input data space
. The larger the data dimensionality, the
larger the da
taset size should be allowing the network to adopt the synaptic weights (whose number
increases/decreases accordingly to the increase/decrease of the input feature variables).

3.

The

network architecture

meaning the number of hidden layers and the number of
their hidden units
,
the
network topology,
etc.

12


4.

If the stopping criterion is defined exclusively by the maximum number of epochs and the network is
left to run for a long time, there is increased risk for overfitting. On the other side, if the network is
tr
ained only for a small number of epochs, undefitting is possible.


Figure 2.1: How ANN response to a regression problem when underfitting, good fitting and
overfitting occur.

2.4
-

Feed
-
Forward Neural Network


Artificial neural networks has become through years a big family of different models combined with a variety
of learning methods. One of the most popular and
prevalent

forms of neural network is the feed
-
forward
network combined with backpropagation learni
ng. A feed
-
forward network consists of an input layer, an
output layer and optionally one or more hidden layers. Their name, “feed
-
forward”, denotes the direction of
the information processing which is guided by the connectivity between the network layers.

The nodes
(artificial neurons) of each layer are connected with all the nodes of the next layer whereas the neurons
belonging to the same layer are independent.

Feed
-
forward networks are also called “multilayer
-
perceptron” as they follow the basic idea o
f the
elementary perceptron model in a multilayered form. For that reason, the elementary perceptron algorithm
description is given paving

the way for
better understanding the “multilayer
-
perceptron” neural network.

2.4.1
-

The perceptron algorithm

The per
ceptron algorithm (Rosenblatt, 1962) belongs to the family of linear discriminant models. The model
is divided in two phases:

1.

The

nonlinear transformation of input vector
X

to a feature vector

φ
(x)

(e.g. by using fixed basis
functions).

2.

Build the generaliz
ed linear model as:

y(x
n
) =
f
(w
T


φ
(x
n
)) Eq. 2.1

13


where for N input instances 1 ≤ n ≤ N,
w

is the weight vector and
f

(∙)

is a step size function of the form:

f
(
a
) =
{



























Eq. 2.2

Perceptron is a two class model meaning that if the output y(x
n
) is +1, input pattern x
n

belongs to C
1

class
whereas if y(x
n
) is
-
1 then x
n

belongs C
2

class. This model can be rendered by a single artificial neuron, as
described previously in this study. The output of the perceptron depends on the weighted feature vector
φ
(
x
n
).
Therefore perceptrons algorithm intention is to find a weight vector w* such

that patterns x
n

which are
members of the class C
1

will satisfy w
T
φ
(x
n
) > 0 while patterns x
n

of class C
2

will satisfy w
T
φ
(x
n
) < 0. Setting
the target output t
n

є

{+1,
-
1} for class C
1

and class C
2

respectively, all the correctly classified patterns should
satisfy:

w
T
φ
(x
n
)

t
n

> 0 Eq. 2.3

The error function (named perceptron criterion) is zero for correctly classified patterns (y(x
n
) is equal to

t
n
)
while for the set of the misclassified patterns denoted by
M
, gives the quantity:

E(w) =
-
Σ
mєM

(w
T
φ
(x
m
)t
m
)

Eq. 2.4

The goal is the minimization of the error function and to do so, the weight vector should
be updated every
time the error function is not zero. The weight update function uses gradient descent algorithm so that:

w(i+1) = w(i)


η

E(w)

= w(i)


ηφ
(x
m
)t
m




Eq 2.5

where
i

is the counter of the algorithm iterations and
η

is the learning rate of the algorithm defining the
weight vector changing rate.

The overall perceptron algorithm is divided in the following steps:

For each input instance x
n

at step
i

apply

1.

y(x
n
) =
f
(w(i)
T


φ
(x
n
))

2.

if not(equal(y(x
n
),

t
n
))

if equal(t
n

, + 1)


w(i+1) = w(i) +

ηφ
(x
n
)

14


else

w(i+1) = w(i)
-

ηφ
(x
n
)


else move forward

3.

increase step counter
i
and move to next input instance

2.4.2
-

Functions used by Feed
-
forward Network

A feed
-
forward network is also named as multilayered perceptron since it can be analyzed to
HL +1

stages of
processing, each one containing multiple neurons behaving like elementary perceptrons, where
HL

is the
number of the hidden layers of the network. H
owever, attention should be given to the fact that ANN hidden
neurons use continuous sigmoidal nonlinearities whereas the perceptron uses step
-
function nonlinearities.


Figure 2.2: A feed
-
forward network with 2 hidden layers

An example of a feed
-
forward network is shown in Figure 2.2. The particular network is constituted of one
input layer, two hidden layers and an output layer. The features of an input pattern are assigned to the nodes
of the input layer. In Figure 2.2 the i
nput pattern has two features x1 and x2. Hence the input layer consists of
two nodes. The data is then processed by the hidden layer nodes ending in one output node which gives the
answer to the problem modelled by the input patter x
n

and the synaptic weig
ht vector w.

15


As it can be seen, there is an extra node in every layer, excluding output layer, notated by a zero marker. It is
called bias parameter and the synaptic weights beginning from the bias nodes are represented by thinner
lines, on purpose, since
adding a bias node is optional for every layer. The bias parameter is generally added
to a layer to allow a fixed offset in the data. For that reason their value is always 1 so that their weight values
are just added to the rest weighted node values.

The
connectivity of the network resembles the direction of the information propagation. It all begins from the
input layer and sequentially ends to the output layer. In the first place, an input pattern X
n

= x
1
,…,x
D

is
introduced to the network (that means the

input layer consists of
D

nodes
1
). M linear combinations between
the input nodes and hidden units of the first hidden layer are created as:

p
j

=
h
(














) Eq. 2.6

where j = 1,.., M (M is the n
umber of the hidden units of the first hidden layer excluding the bias node) and
h
(∙) is a differentiable, nonlinear activation function generally used from sigmoidal functions as the logistic
sigmoid or the tanh. The output values of the first hidden laye
r neurons are linearly combined in the form of:

q
k

=
h
(














) Eq. 2.7

to

give the output of the second hidden layer units. The same process is repeated for every extra hidden layer
units. The last step is the evaluation of the output units (belonging to the output layer) where again there are
K

linear combina
tions of the last
hidden layer variables:

o
r

=
















) Eq. 2.8

where r = 1,…R and R is the total number of outputs and
f

might be an activation function (e.g. same as
h
(∙)
)
or

the identity.

Taking the network p
resented in Figure 2.2 the whole model functioning is described as:

y
r
(x,w) =










h
(







h
(














)+
w
0
j
) +
w
0
k
) Eq. 2.9

The final output of the network, given an input pattern X
n
, is the vector O
n

= o
1

,…, o
R
. Each element of the
vector is the network’s answer/decision for the corresponding element of the target vector t
n
. Alike
perceptron model, if the target vector differs from the network output then there is error (given by an error
function). What is left

is to train the network by adopting the synaptic interconnections guided by an error



1
The word “nodes” is used but

for the input layer case it is more preferable to be understood as incoming signals.

16


function minimization. One of the most well
-
known learning algorithms, for feed
-
forward neural networks, is
the back propagation learning.

2.4.3
-

Back propagation
algorithm

Back propagation algorithm tries to compute the weight vector values so that after training phase, the
network will be able to produce the right output values given a new input pattern taken from the test dataset.

Initially the weight all the sy
naptic weights are randomized. Then the algorithm repeats the following steps:

Firstly, an input pattern X
n

is presented to the network and the feed
-
forward network functions are calculated
in every layer, as described in the above section, to get the outp
ut values in the output layer.

The next step is to measure the distance (error) between the network output y(X
n
,w) and the target output t
n
.
To do so, an error function must be defined. The most common error function applied in backpropagation
algorithm fo
r feed
-
forward networks, is the sum
-
of
-
squares defined by:

E
n

=



















Eq. 2.10

where
R

is the number of the output variables (output units),
t

is the target vector,
o

is the actual network
outp
ut vector and
n

is the indexing for input patterns (instances/observations).

Since the target is to find out the weight values which minimize the error function, the change of the error
function in respect to the change of the weight vector must be calcul
ated. However, there might be several
output nodes (each one corresponding to an element value of the target vector) so the gradient of the error
function for each output node in respect to its synaptic weights should be calculated as:





=







=
(






)

for

k

= 1... K,
r

= 1… R


Eq
. 2.11

Where:





is the error function for the output node
r

taken the input pattern
n
.




is the synaptic weight

vector including all the weight values of the connections between the output node
r

and

the last hidden layer.

K

is the number of units of the last hidden layer and
R

is the number of the output units.

Using the Equation 2.10 the gradient of each weight v
alue is:

17









=




























Eq. 2.12

By using the chain rule:







=

















Eq.
2.13

where:







=
-

(



-



)



Eq.2.14

and (recall Eq. 2.8)










=
f ’ (inp
r
)
q
k


Eq. 2.15

where
f(∙)
is the output activation function and
inp
r

is the
linear combination of the weight and corresponding
hidden unit value
(the input of the function
f

in
Eq.2.8)
.


Therefore,








=
-

(



-



)

f ’ (inp
r

)
q
k


Eq. 2.16


The general goal is to find the combination of w
k

which leads to the global minimum of the error function.
Hence, gradient descent is applied to every synaptic weight as:




=





η







=



+ η (



-



)
f
’’

(
inp
r

)
q
k





Eq
. 2.1
7

where
η

is the learning rate.

The process, so far, resembles the perceptron learning (perceptron criterion). However, backprop

differs since
the next step is about calculating each hidden unit’s contribution to the final output error and update the
corresponding weights accordingly. The idea is that every hidden unit affects, in an indirect way, the final
output (since each hidde
n unit is fully connected with all the units of the next layer). Therefore, every hidden
unit’s interconnection weights should be adjusted to minimize the output error. The difficulty is that there is
no target vector for the hidden unit values, which mean
s that the difference (error) between the calculated
value and the correct one is not plausible as it was at the previous step.

So, the next synaptic weights to be adjusted are the ones connecting the last hidden layer (with K nodes) with
the previous to
that layer (with M nodes), w
jk

for j = 1,… ,M and k = 1, …, K. The change of the error
function to the change of these weights is given:









=





























Eq
. 2.18



18



=
-






















Eq. 2.19


=

-























Eq. 2.
20


=
-



























Eq. 2.21


At this point let
δ
r

=



















Eq. 2.22


So that:







=

-
















Eq. 2.23



=
-
























Eq. 2.24



=
-




















Eq. 2.25



=
-























Eq. 2.26

where inp
k

is the input of the function
h

as given in Eq. 2.7 and hence:









=

-

































Eq. 2.27


=

-






















Eq. 2.28


=

-






















Eq. 2.29


L
et



δ
k

=



















Eq. 2.30


Then











=

-


δ
k




Eq. 2.31

19


Given the change of the error function in respect to the change of the weights in the last unit layer, the
corresponding
weight update will be:





=





η








=



+ η



δ
k





Eq
. 2.
32

Following the same deriving method as above, the updates for the rest hidden layer synaptic weights are
given. Thus, a general
form for updating hidden unit weights is:

For example, for the next hidden layer weights, the update would be:




=





η







=



+ η



δ
j


where

δ
j

=




















Eq
. 2.
33


Regarding the amount of
the dataset needed to evaluate the gradient of an error function, learning methods
can be split into two categories, the batch and the incremental. The batch set requires the whole training
dataset to be processed before evaluating the gradient and update
the weights. On the other side, incremental
methods update the weights using the gradient of the error function after an input pattern is processed.

Regarding the amount of the dataset needed to evaluate the gradient of an error function, learning methods
can be split into two categories, the batch and the incremental. The batch set requires the whole training
dataset to be processed before evaluating the gradient and update the weights. On the other side, incremental
methods update the weights using the gr
adient of the error function after an input pattern is processed.

Consequently, there can be applied batch backpropagation or incremental backpropagation as follows:

Batch Backpropagtion



Initialize the weights W randomly



Repeat:

o

For every input pattern X
n
:

a.

Present X
n

to the network and apply the feed
-
forward propagation functions to all
layers.

b.

Find the gradient of the error function to every synaptic weight (






).

c.

If X
n

was the last input pattern from the training dataset move forward otherwise go
back to step a.

o

For every synaptic weight:

a.

Sum up all the gradient
s of the error function in respect to the synaptic weight:




















o

Update the wei
ght using

20






=





η










Until the stopping criterion is true


Incremental Backpropagation



Initialize the weights W randomly



Repeat:

o

For every input pattern
X
n

:

a.

Present an input pattern X
n

to the network and apply the feed
-
forward propagation
functions to all layers.

b.

Find the gradient of the error function to every synaptic weight (






).

c.

Update the weight
s

using






=





η









Until the stopping criterion is true



As stated previously, a trained network should be able to generalize over new input patterns that were not
presented during training period. The most common generalization assessment method defines splitting the
total available dataset into training and te
sting dataset. Both of them have not common patterns but each of
them has adequate number of patterns from all classes. Reasonably, the training dataset will be larger than the
testing. Training dataset is used to train the network and update the weights.
After all training input patterns
are presented to the network, and its weights are adjusted using backpropagation (either batch mode or
incremental), the network applies feed
-
forward propagation for the testing input instances. For each testing
input patt
ern, the error between the network output and the testing target output is calculated. Finally, the
average testing error can be interpreted as a simple measure of the network’s generalization. The smaller the
testing error is the better generalization of
the network. If the testing error meets the standards defined by the
ANN user, the network training stops and the final weights are used to give any further answers to the
modelled problem. On the contrary, if the testing error is larger than a predefined
constant, then the training
of the network must continue.

A method called k
-
fold cross validation is an alternative generalization evaluation method and it appeals
especially to cases with scarce dataset. In such cases, whereas the total dataset has few in
put instances,
21


sacrificing a number of them to create a testing dataset might lead to
deficient
learning since there is a high
risk that the synaptic interconnections of the network will not get adequate information about the input space
to adjust to. Appl
ying k
-
fold cross validation requires splitting randomly the initial dataset into
k

even
disjoint datasets. The training period is also divided into
k

phases. During each training phase, the network is
trained using
k
-

1
datasets, leaving one out called v
alidation dataset. After each training phase, the validation
dataset is presented to the network applying feed
-
forward propagation and calculating the error between the
network output and the validation target vector. The process is repeated
k

times, each
time leaving a different
sample of the total dataset out as validation set. At the end, the average of the
k

validation scores is an
indicator for the network’s generalization ability. The goal of this method is to exploit the information given
by the whol
e dataset to train the network and at the same time having a generalization estimator for the
network learning progress.


Initial total dataset:



Initial dataset divided in 10


folds:


Learning phase #1:





1
st

validation set gives E
1

training dataset





Learning phase #2 :


.




. 2
nd

validation set gives E
2




.



Learning phase #10:





10
th

validation set gives E
10










































22



Figure 2.3: 10
-
fold cross validation where orange parts represent the
validation set

at each learning phase and
the purple ones the
training set
. E
n

where n=1,..,10 is the corresponding validation error. By the completion
of all 10 learning phases the total validation (or testing) error will be the average










or any other
combination of the validation errors (e.g. just the sum up of them).



2.5

-

Self Organizing Maps


There are some cases of systems for which the existing available data is not classified/clustered/categorized
in any way thus the correct outp
ut/response of the model to the input of a data pattern, is not provided.
However many times, certain underlying relations can be extracted in a sense of
clustering data
. The
identification of such relations comprises the task of unsupervised learning meth
ods. As mentioned before in
this chapter, neural networks can be trained using unsupervised learning. Self Organized Maps (SOM) is an
unsupervised learning neural network paradigm.


2.5.1
-

Biology

SOM model is neurobiologically

inspired by the way cerebral cortex of the brain handles signals from
various sources of inputs (motor, somatosensory, visual, auditory, etc). The cerebral cortex is organised in
regions. Each physical region is assigned to a different sensory input proce
ssing.



Figure 2.4 Mapping of the visual field (right side) on the cortex (left side)

23



Visual cortex comprises the part of the cerebral cortex which processes visual information. More specifically,
primary visual cortex, symbolized as V1, is mainly resp
onsible for processing the spatial information in
vision.

The left part of Figure 2.4 shows the left V1. The right side represents the visual field of the right eye
where the inner circle of the visual field represents all the contents which are captured b
y the fovea. Figure
2.4 shows the correlation between the subsets of visual field and the corresponding visual processing regions
on V1. Observing Figure 2.4 makes it obvious that the neighbouring regions of the visual field are mapped to
neighbouring regi
ons in the cortex. Additionally an interesting notice is that the visual cortex surface which is
responsible for processing the information coming from fovea (labelled as “1”) is disproportionally large to
the inner circle surface of the visual field (fove
a). This phenomenon leads to the conclusion that the signals
arriving from the fovea visual field need more processing power than the other signals from the surrounding
regions of the visual field. That is because signals coming from fovea have high resolu
tion and hence they
must be processed in more detail. Consequently it seems that the brain cortex is topologically organized in
such a way that it resembles the topologically ordered representation of the visual field.


2.5.2
-

Self Organizing Maps Introdu
ction

A
Self Organizing Map,

as introduced by Kohonen in 1982,
is a neural network constituted by an output layer
fully connected with synaptic weights to an input instance feature variables (Figure 2.5). The name of this
model
concisely
explains the data
processing mechanism. The neurons of the output layer are most of the
times organized in a two dimensional lattice. During training phase, output neurons adapt their synaptic
weights based on a presented input pattern features, without being supervised, by

applying competitive
learning. Eventually, when an adequate number of input patterns are shown to the model, the synaptic
weights will be organized in such a way that each output neuron will respond to a cluster/class of the input
space and the final topo
logical ordering of the synaptic weights will resemble the input distribution.

Analogising the Kohonen’s network to the biological model, the input layer can be thought as input signals
and the output layer as the cerebral cortex. Hence, this kind of netw
ork must be trained in such a way that
eventually each neuron, or group of neighbouring neurons, will respond to different input signals.

Kohonen’s

model learning can be divided into two processes: competition and cooperation. Competition
takes places right after an input pattern is introduced to the network. A discriminant function is calculated by
every artificial neuron and the resulting value pro
vides the basis of the competition
;

the neuron with the
highest discriminant function score is considered the winner.

Consequently, the synaptic weights between the presented input vector (stimuli) and the winning neuron will
be strengthened within the sco
pe of minimizing the Euclidean distance between them. Hence, the goal is to
24


move the winning neuron physically to the corresponding stimuli input. The increase of the winning neuron’s
synaptic weights can be also thought as inhibition.




Figure 2.5: A S
OM example. The dataset includes N input instances X = X
1
, ..., X
N
. Each synaptic weight W
j

is a vector of
D

elements (corresponding to the
D

features of the
X
n

input instance vector) connecting each
input feature with the j
th

neuron. The output layer is a 4 x 4 lattice of neurons
O
j
. Each neuron is indexed
according to its position on the lattice (row,column).


Cooperation phase follows, where the indexing position of the winning neuron on the lattice gives the
topological in
formation about the neighbourhood of its surrounding neurons to be excited. So, eventually the
neuron which was declared to be the winner through competition, cooperates with some of its neighbouring
neurons by acting as an excitation stimuli for them. Ess
entially that means that the winning neuron allows to
its neighbours to increase their discriminant function in response to the specific input pattern, to a certain
degree which is
assuredly
lower than its own synaptic weights strengthing.

Ultimately, the
trained network is able to transform a multidimensional input signal into a discrete map of
one or two dimensions (output layer). The transformation is driven by the synaptic weight vector of each
discrete output neuron which basically resembles the topolo
gical order of some input vectors (forming a
cluster) in the input space.


25


2.5.3
-

Algorithm

The whole training process is based on adapting, primarily, the synaptic weight vector of the winning neuron,
and thereafter the neighbouring ones, to the features of the introduced input vector. Towards that direction, a
sufficient number of input pattern
s should be presented to the network, iteratively, yet one at a time, initiating
this way the competition among the output neurons followed by the cooperation phase.

For the examples given in the rest description of the algorithm refer to the network give
n in Figure 2.4. At the
first place, a
D
-

dimensional input vector X
n

is presented at each time to the network.

X
n

= x
1
, ... , x
D

where
n



[1,...,N] and
N,D
∈ ℕ
1


The output layer is discrete since each neuron of the output layer is characterized by a d
iscrete value
j

which
can be thought as its unique identity or name. Output neurons are indexed meaning that their position is
defined by the coordinates of the lattice. So for a 2
-
dimensional lattice, the index of each neuron is given by a
two dimensional vector p
j

incl
uding the row and the column number of the neuron’s position on the lattice.
For example, neuron o
9

index is p
9

= (0,2) on the map. This can be thought as the address of the neuron.

Each output neuron, symbolized o
j
, will be assigned to a
D

dimensional sy
naptic weight vector W
j

so that:

W
j

= w
1j
, ... , x
D
j

where
j



[1,...,L] and
L
∈ ℕ
1


L

is the number of the output neurons so that
L

=
K

x
Q

for
K

and
Q

∈ ℕ
1
, defining the number of rows
and the number of columns of the map respectively.

Output neurons
compete each other by comparing their synaptic weight vectors W
j

with the input pattern X
n

and try to find the best matching between these two vectors. Hence, the dot product of W
j
X
n

for
j
= 1, ...,
L

is
calculated for each neuron and the highest value is c
hosen.
The long
-
term objective of the algorithm is to
maximize the quantity
W
j
X
n


for n = 1 ...
N

(the whole dataset)
which
could be interpreted as

minimizing
the
distance

given by the Euclidean norm ||
X
n

-

W
j

||. Therefore,

i
(
X
n
) = arg



||
X
n

-

W
j

|| = min{ ||
X
n



W
1
||, ... , ||
X
n



W
L
||} Eq. 2.34

where
i
(
X
n
) is the index of the winning output neuron
i


[1,
L
],
whose

synaptic weight vector
W
i

is the
closest to the input vector
X
n
, determined of all neurons
j

= 1, ... ,
L
. Note that the input pattern is a
D

dimensional vector of continuous elements which is mapped into a two dimensional discrete space of
L

outputs using a competition process. So for the example of Figure 2.4, the winning neuron of the competition
initiated

by the input vector X
n
, is the output neuron
j

= 15. So, the
i
(
X
n
) is (2,3).

By the ending of the competition process, the winning neuron “excites” the neurons which lie in its
neighbourhood. Two concepts must be clarified at this point. How a neuron’s ne
ighbourhood is defined and
how a neuron is excited by another neuron. A neighbourhood of neurons refers basically, to the topological
arrangement of neurons surrounding the winning neuron. Obviously, the neurons positioned closed to the
winning neuron will

be more excited than others positioned in a bigger lateral distance. Thus, the excitation
effect is graduated in respect with the lateral distance
d
ij
between the winning neuron
i
and any other neuron
j
26


for
j
= 1, ... ,L. Having that constraint, the
neighbourhood

is given by a function (called neighbourhood
function) h
ij

which should satisfy two basic properties:

(1)

It has to be symmetric about
d
ij

= 0 (where of course
d
ij

=
d
i
i
)

(2)

It should be monotonically decaying function in respect to the distance
d
ij
.

A Gaussian function is exhibits both properties so the Gaussian neighbourhood function is given as:

h
ij

=

(







)

Eq. 2.35

where:
















Eq. 2.36

Note that there besides Gaussian, there are other functions which can be chosen to implement the
neighbouring function. However, t
he size of the neighbourhood is not fixed since it decreases with time
(
discrete time,
iterations).

When the training begins the neighbourhood range is quite large so the function h
ij

returns large values for a big set of neighbouring neurons. Through iterations, the less number of neurons will
get large values by h
ij

meaning that fewer neur
ons will be excited by the winning neuron. Eventually maybe
the whole neighbourhood could be only the winning neuron. Since the neighbourhood function is given in
Equation 2.35, the neighbourhood decrease can be embedded by decreasing
σ

over time using the

function:










(




)

for
c

= 0,1, 2, 3, … Eq. 2.37

where
c
is the iterations counter
,
σ
0

is the initial value of
σ

and
T
1

is a time constant. Therefore the Eq. 2.35
can be reformed as:

h
ij
(c) =

(









(




)

)

Eq. 2.38


So when two similar in a way, but not totally identical, input vectors are presented to the network, there will
be two different winning neurons. However, beside the winning neuron’s synaptic weights,
the neighbours
will corporately update their own weights to a smaller degree. Hence clusters of input vectors that might be
close to each other, but still different, will be projected into topologically close neurons on the lattice. That
why cooperation is

a necessary process for the correct training of this model. A form of how this process is
biologically motivated is shown in Figure 2.5.

The final step is the weight adaptation which employs Hebbian learning. Hebbian learning defines that when
simultaneous activation in presynaptic and postsynaptic areas then the synaptic connection between them is
strengthened, otherwise it is weakened. In

the case of Kohonen’s model, the presynaptic area is the input
signal X
n
and the postsynaptic area is the output activity of the neuron noted as y
j
. Output activity is
maximized when the neuron wins the competition, is zero when it does not belong to the
winner’s
27


neighbourhood and somewhere in between otherwise. So y
j

could be expressed as the neighbourhood
function h
ij
.

Yet:

Δ
W
j



y
j
X
n



h
ij
X
n




Eq. 2.39

However by using Hebbian mechanism of learning, there is a risk that the synaptic weights might reach
saturation. This happens because the weight adaptation occurs only in one direction (towards the input pattern
direction) causing all weights to saturate.

To overcome this difficulty, a forgetting term


g(y
j
)W
j

is
introduced having:

Δ
W
j

=
η
y
j
X
n




g(y
j
)W
j

Eq. 2.40

where
η

is the learning rate parameter. The term
η
y
j
X
n

drives the weight update towards the input signal
direction making the particular
j
th

neuron more sensitive when this input pattern is presented to the network.
On the other side, the forgetting term


g(y
j
)W
j

binds the new synaptic weight to within reaso
nable levels.The
function


g(y
j
)
must return a positive scalar value so for simplification purposes it
could be assigned to y
j
,
so
that

g(y
j
) =
η
y
j
. Hence
the weight update could be expressed as:

Δ
W
j

=
η
h
ij
X
n




η
h
ij
W
j

=
η
h
ij
(X
n




W
j
)
Eq. 2.4
1

In discrete time simulation, the updated weight vector is given:

W
j
(
c
+1)
=

W
j
(
c
)
+

η
(
c
)
h
ij
(
c
)
(X
n
(
c
)




W
j
(
c
)) for


c

= 0,1, 2, 3, …
Eq. 2.4
2

where











(




)



Eq. 2.43

where

η
0

is the initial value of
η

and
T
2

is a time constant.

Essentially, the term (X
n
(
c
)




W
j
(
c
)) drives the alignment of W
j
(
c
+1
) to X
n

vector at each iteration
c
. The
winning weight vector will make a full step (
h
ij
(
c
)

=1) towards the input signal while the neighbouring
excited synaptic weights will move to a smaller degree defined by the neighbou
ring function (
h
ij
(
c
)

< 1).
Additionally, the update of all weights will be restricted by the learning rate
η
(
c
). The learning rate changes
in respect to time. In the first iterations, it will be quite large allowing fast learning which means the weight
ve
ctors will move more aggressively towards the input pattern and hence form the rough boundaries for the
potential input clusters. As iteration counter increases the learning rate will decrease forcing the weight
vectors to move more carefully to the space
in order to retrieve all the necessary information given from the
28


input patterns and gently converge to the correct topology. In the end,
the topological arrangement of the
synaptic weights
in the weight space
will resemble the input statistical
distribution
.

The algorithm in steps:

1.

Initialize randomly the weight vectors for all output neurons
j

= 1,…,
L

where
L

is the number of the
output neurons lying on the lattice.

2.

Present the input vector X
n

= (x
1
,…,x
D
) to the network where
n

= 1, … ,
N

and
N
is the number of all
input patterns included in the dataset.

3.

Calculate
W
j
X
n

for all output neurons and find the winner using Eq. 2.34.

4.

Update all synaptic weights using Equation 2.42

5.

Move to step 2 until there are no significant changes to the weight
vectors or the maximum limitation
for iterations number is reached.

The learning mechanism could be divided in two phases:

(A)

The self organizing topological ordering phase

(B)

Convergence phase

During (A) phase, the spatial locations of the output neurons are roughly defined (in the weight space). Each
neuron corresponds to a specific domain of input patterns. The winning neuron’s neighbourhood is large
during this phase allowing to neurons which

respond to similar input signals to be brought together. As a
result input clusters which have common or similar features will be projected to neurons which lie next to
each other on the weight map. However the neighbourhood shrinks down in respect to tim
e and eventually it
could be the case that only the winner itself comprises its neighbourhood.

The convergence phase

begins
when basically the neighbourhood is very small. Having a small neighbourhood allows the winning neurons
to fine tune their positio
ns which were obtained in topological ordering phase. During this phase the learning
rate is also fairly small so the weights move very slowly towards the input vectors
.

2.6
-

Kernel Functions


A challenge for computational intelligence area is to provide good estimators for non
-
linearly separable data.
As indicated by their name, linear functions (e.g. perceptron algorithm) are incapable of processing
successfully non
-
linear data and so they are

eliminated from the pool of possible solutions to non
-
linear
problems. However non
-
linear data become linearly separable if they are transformed to a higher dimensional
space (called feature space) and thus a linear model can be used to solve the problem
there. So the previous
29


statement should be rephrased as: Linear functions are incapable of processing successfully non
-
linear data in
the input space (space where input training examples are located).

Hence, given a set of data

[X,T
] :



X =
x
1
,
x
2
,…,
x
N

where x




D


T = t
1
, t
2
,…, t
N

where t




so that:


X =
[






]
=
[
































]

,
T

=
[






]




Eq. 2.44

and a dataset mapping via:

Φ
:
x


Φ
(
x
) where
Φ
(
x
) :=

























Eq. 2.4
5

the resulting feature data set is:

Φ

=
[
Φ






Φ




]
=
[


















































]
,
T

=
[






]




Eq. 2.4
6

As shown above, the raw dataset is represented by an N x D matrix while the mapped dataset is an N x K
matrix where
K


D
. The transformed dataset is now separable in

K

feature space by a linear hyperplane.

The key of the solution to the insepar
ability problem lies in the appropriate choice of feature space of
sufficient dimensionality. This is a task that must be accomplished with great care since wrong feature
mapping will give misleading results. In addition the whole data transformation proce
ss can be very
computationally expensive since data mapping depends highly on the feature space dimensionality in terms
of time and memory. That is why constructing features from a given dataset comprises rather a difficult task
which many times demands ex
pert knowledge in the domain of the problem.

For example consider constructing quadratic features for x



2
:

Φ








2
,










2
)
Eq.
2.47

In the case that data point
x

is a two dimensional vector, the new feature vector is three dimensional, so it
doesn’t seem to be very demanding regarding to time or memory. However imagine creating quadratic
30


features for a hundred
-
dimensional dataset. This would lead each new feature
data point to be of 5005
dimensions. Hence for every datapoint of the raw dataset, 5005 calculations should be made and saved in the
transformed data matrix.

However, most linear learning algorithms depend only on the dot product between two input
training vectors.
Therefore, applying such algorithms in the feature space, the dot product between the transformed data points
(features) must be calculated and substitute every dot product between the raw input vectors in the algorithms
formulation.










Φ




Φ
(


)


Eq.

2.48

Back to the example presented in Eq. 2.4
7
,

the dot product between the quadratic features will result the
quantity:


Φ




Φ
(


)

































=








Eq. 2.
49

As it is

observed in Eq. 2.49
, the dot product of two quadratic features in

2

is equal to the square of the dot
product between the corresponding raw data patterns. So an alternative solution to feature construction and
computation is to implicitly compute the do
t product between two features without defining in any way the
exact feature mapping routine. A function can be defined instead, which will accept two input vectors and