ARTIFICIAL NEURAL

NETWORKS

METHODOLOGICAL

ADVANCES AND

BIOMEDICAL

APPLICATIONS

Edited by Kenji Suzuki

Artificial Neural Networks

- Methodological Advances and Biomedical Applications

Edited by Kenji Suzuki

Published by InTech

Janeza Trdine 9, 51000 Rijeka, Croatia

Copyright © 2011 InTech

All chapters are Open Access articles distributed under the Creative Commons

Non Commercial Share Alike Attribution 3.0 license, which permits to copy,

distribute, transmit, and adapt the work in any medium, so long as the original

work is properly cited. After this work has been published by InTech, authors

have the right to republish it, in whole or part, in any publication of which they

are the author, and to make other personal use of the work. Any republication,

referencing or personal use of the work must explicitly identify the original source.

Statements and opinions expressed in the chapters are these of the individual contributors

and not necessarily those of the editors or publisher. No responsibility is accepted

for the accuracy of information contained in the published articles. The publisher

assumes no responsibility for any damage or injury to persons or property arising out

of the use of any materials, instructions, methods or ideas contained in the book.

Publishing Process Manager Ivana Lorkovic

Technical Editor Teodora Smiljanic

Cover Designer Martina Sirotic

Image Copyright Bruce Rolff, 2010. Used under license from Shutterstock.com

First published March, 2011

Printed in India

A free online edition of this book is available at www.intechopen.com

Additional hard copies can be obtained from orders@intechweb.org

Artificial Neural Networks - Methodological Advances and Biomedical Applications

Edited by Kenji Suzuki

p. cm.

ISBN 978-953-307-243-2

free online editions of InTech

Books and Journals can be found at

www.intechopen.com

Part 1

Chapter 1

Chapter 2

Chapter 3

Part 2

Chapter 4

Chapter 5

Chapter 6

Chapter 7

Chapter 8

Preface IX

Fundamentals 1

Introduction to the Artificial Neural Networks 3

Andrej Krenker, Janez Bešter and Andrej Kos

Review of Input Variable Selection Methods

for Artificial Neural Networks 19

Robert May, Graeme Dandy and Holger Maier

Artificial Neural Networks

and Efficient Optimization Techniques

for Applications in Engineering 45

Rossana M. S. Cruz, Helton M. Peixoto and Rafael M. Magalhães

Advanced Architectures for Biomedical Applications 69

Pixel-Based Artificial Neural Networks

in Computer-Aided Diagnosis 71

Kenji Suzuki

Applied Artificial Neural Networks:

from Associative Memories to Biomedical Applications 93

Mahmood Amiri and Katayoun Derakhshandeh

Medical Image Segmentation

Using Artificial Neural Networks 121

Mostafa Jabarouti Moghaddam and Hamid Soltanian-Zadeh

Artificial Neural Networks and Predictive Medicine:

a Revolutionary Paradigm Shift 139

Enzo Grossi

Reputation-Based Neural Network Combinations 151

Mohammad Nikjoo, Azadeh Kushki, Joon Lee,

Catriona Steele and Tom Chau

Contents

Contents

VI

Biological Applications 171

Prioritising Genes with an Artificial Neural Network

Comprising Medical Documents to Accelerate

Positional Cloning in Biological Research 173

Norio Kobayashi and Tetsuro Toyoda

Artificial Neural Networks Technology

to Model and Predict Plant Biology Process 197

Pedro P. Gallego, Jorge Gago and Mariana Landín

The Usefulness of Artificial Neural Networks in Predicting

the Outcome of Hematopoietic Stem Cell Transplantation 217

Giovanni Caocci, Roberto Baccoli and Giorgio La Nasa

Artificial Neural Networks

and Retinal Ganglion Cell Responses 233

María P. Bonomini, José M. Ferrández and Eduardo Fernández

Medical Applications 251

Diagnosing Skin Diseases

Using an Artificial Neural Network 253

Bakpo, F. S. and Kabari, L. G

Artificial Neural Networks

Used to Study the Evolution of the Multiple Sclerosis 271

Tabares Ospina and Hector Anibal

Estimation the Depth of Anesthesia

by the Use of Artificial Neural Network 283

Hossein Rabbani, Alireza Mehri Dehnavi and Mehrab Ghanatbari

Artificial Neural Networks (ANN) Applied

for Gait Classification and Physiotherapy

Monitoring in Post Stroke Patients 303

Katarzyna Kaczmarczyk, Andrzej Wit,

Maciej Krawczyk, Jacek Zaborski and Józef Piłsudski

Clinical and Other Applications 329

Forcasting the Clinical Outcome: Artificial Neural

Networks or Multivariate Statistical Models? 331

Ahmed Akl and Mohamed A Ghoneim

Telecare Adoption Model Based

on Artificial Neural Networks 343

Jui-Chen Huang

Part 3

Chapter 9

Chapter 10

Chapter 11

Chapter 12

Part 4

Chapter 13

Chapter 14

Chapter 15

Chapter 16

Part 5

Chapter 17

Chapter 18

Contents

VII

Effectiveness of Artificial Neural Networks

in Forecasting Failure Risk for Pre-Medical Students 355

Jawaher K. Alenezi, Mohammed M. Awny

and Maged M. M. Fahmy

Chapter 19

Preface

Artiﬁ cial neural networks may probably be the single most successful technology in

the last two decades which has been widely used in a large variety of applications

in various areas. An artiﬁ cial neural network, oft en just called a neural network, is a

mathematical (or computational) model that is inspired by the structure and function

of biological neural networks in the brain. An artiﬁ cial neural network consists of a

number of artiﬁ cial neurons (i.e., nonlinear processing units) which are connected each

other via synaptic weights (or simply just weights). An artiﬁ cial neural network can

“learn” a task by adjusting weights. There are supervised and unsupervised models.

A supervised model requires a “teacher” or desired (ideal) output to learn a task. An

unsupervised model does not require a “teacher,” but it leans a task based on a cost

function associated with the task. An artiﬁ cial neural network is a powerful, versatile

tool. Artiﬁ cial neural networks have been successfully used in various applications

such as biological, medical, industrial, control engendering, soft ware engineering,

environmental, economical, and social applications. The high versatility of artiﬁ cial

neural networks comes from its high capability and learning function. It has been

theoretically proved that an artiﬁ cial neural network can approximate any continu-

ous mapping by arbitrary precision. Desired continuous mapping or a desired task is

acquired in an artiﬁ cial neural network by learning.

The purpose of this book series is to provide recent advances of artiﬁ cial neural net-

work applications in a wide range of areas. The series consists of two volumes: the ﬁ rst

volume contains methodological advances and biomedical applications of artiﬁ cial

neural networks; the second volume contains artiﬁ cial neural network applications in

industrial and control engineering.

This ﬁ rst volume begins with a section of fundamentals of artiﬁ cial neural networks

which covers an introduction, design, and optimization of artiﬁ cial neural networks.

The fundamental concept, principles, and theory in the section help understand and

use an artiﬁ cial neural network in a speciﬁ c application properly as well as eﬀ ectively.

A section of advanced architectures for biomedical applications follows. Researchers

have developed advanced architectures for artiﬁ cial neural networks speciﬁ cally for

biomedical applications. Such advanced architectures oﬀ er improved performance

and desirable properties. Sections continue with biological applications such as gene,

plant biology, and stem cell, medical applications such as skin diseases, sclerosis, anes-

thesia, and physiotherapy, and clinical and other applications such as clinical outcome,

telecare, and pre-med student failure prediction.

X

Preface

Thus, this book will be a fundamental source of recent advances and applications of

artiﬁ cial neural networks in biomedical areas. The target audience of this book in-

cludes professors, college students, and graduate students in engineering and medical

schools, engineers in biomedical companies, researchers in biomedical and health sci-

ences, medical doctors such as radiologists, cardiologists, pathologists, and surgeons,

healthcare professionals such as radiology technologists and medical physicists. I hope

this book will be a useful source for readers and inspire them.

Kenji Suzuki, Ph.D.

University of Chicago

Chicago, Illinois,

USA

Part 1

Fundamentals

1

Introduction to the Artificial Neural Networks

Andrej Krenker

1

, Janez Bešter

2

and Andrej Kos

2

1

Consalta d.o.o.

2

Faculty of Electrical Engineering, University of Ljubljana

Slovenia

1. Introduction

An Artificial Neural Network (ANN) is a mathematical model that tries to simulate the

structure and functionalities of biological neural networks. Basic building block of every

artificial neural network is artificial neuron, that is, a simple mathematical model (function).

Such a model has three simple sets of rules: multiplication, summation and activation. At

the entrance of artificial neuron the inputs are weighted what means that every input value

is multiplied with individual weight. In the middle section of artificial neuron is sum

function that sums all weighted inputs and bias. At the exit of artificial neuron the sum of

previously weighted inputs and bias is passing trough activation function that is also called

transfer function (Fig. 1.).

Fig. 1. Working principle of an artificial neuron.

Artificial Neural Networks - Methodological Advances and Biomedical Applications

4

Although the working principles and simple set of rules of artificial neuron looks like

nothing special the full potential and calculation power of these models come to life when

we start to interconnect them into artificial neural networks (Fig. 2.). These artificial neural

networks use simple fact that complexity can grown out of merely few basic and simple

rules.

Fig. 2. Example of simple artificial neural network.

In order to fully harvest the benefits of mathematical complexity that can be achieved

through interconnection of individual artificial neurons and not just making system

complex and unmanageable we usually do not interconnect these artificial neurons

randomly. In the past, researchers have come up with several “standardised” topographies

of artificial neural networks. These predefined topographies can help us with easier, faster

and more efficient problem solving. Different types of artificial neural network topographies

are suited for solving different types of problems. After determining the type of given

problem we need to decide for topology of artificial neural network we are going to use and

then fine-tune it. We need to fine-tune the topology itself and its parameters.

Fine tuned topology of artificial neural network does not mean that we can start using our

artificial neural network, it is only a precondition. Before we can use our artificial neural

network we need to teach it solving the type of given problem. Just as biological neural

networks can learn their behaviour/responses on the basis of inputs that they get from their

environment the artificial neural networks can do the same. There are three major learning

paradigms: supervised learning, unsupervised learning and reinforcement learning. We

choose learning paradigm similar as we chose artificial neuron network topography - based

on the problem we are trying to solve. Although learning paradigms are different in their

principles they all have one thing in common; on the basis of “learning data” and “learning

rules” (chosen cost function) artificial neural network is trying to achieve proper output

response in accordance to input signals.

After choosing topology of an artificial neural network, fine-tuning of the topology and

when artificial neural network has learn a proper behaviour we can start using it for solving

given problem. Artificial neural networks have been in use for some time now and we can

find them working in areas such as process control, chemistry, gaming, radar systems,

automotive industry, space industry, astronomy, genetics, banking, fraud detection, etc. and

solving of problems like function approximation, regression analysis, time series prediction,

classification, pattern recognition, decision making, data processing, filtering, clustering,

etc., naming a few.

Introduction to the Artificial Neural Networks

5

As topic of artificial neural networks is complex and this chapter is only informative nature

we encourage novice reader to find detail information on artificial neural networks in

(Gurney, 1997; Kröse & Smagt 1996; Pavešić, 2000; Rojas 1996).

2. Artificial neuron

Artificial neuron is a basic building block of every artificial neural network. Its design and

functionalities are derived from observation of a biological neuron that is basic building

block of biological neural networks (systems) which includes the brain, spinal cord and

peripheral ganglia. Similarities in design and functionalities can be seen in Fig.3. where the

left side of a figure represents a biological neuron with its soma, dendrites and axon and

where the right side of a figure represents an artificial neuron with its inputs, weights,

transfer function, bias and outputs.

Fig. 3. Biological and artificial neuron design.

In case of biological neuron information comes into the neuron via dendrite, soma processes

the information and passes it on via axon. In case of artificial neuron the information comes

into the body of an artificial neuron via inputs that are weighted (each input can be

individually multiplied with a weight). The body of an artificial neuron then sums the

weighted inputs, bias and “processes” the sum with a transfer function. At the end an

artificial neuron passes the processed information via output(s). Benefit of artificial neuron

model simplicity can be seen in its mathematical description below:

ݕሺ݇ሻ ൌ ܨ ൭ݓ

ሺ݇ሻ · ݔ

ሺ݇ሻ

ୀ

ܾ൱

(1)

Where:

• ݔ

ሺ݇ሻ is input value in discrete time ݇ where ݅ goes from 0 to ݉,

• ݓ

ሺ݇ሻ is weight value in discrete time ݇ where ݅ goes from 0 to ݉,

• ܾ is bias,

• ܨ is a transfer function,

• ݕ

ሺ݇ሻ is output value in discrete time ݇.

As seen from a model of an artificial neuron and its equation (1) the major unknown

variable of our model is its transfer function. Transfer function defines the properties of

artificial neuron and can be any mathematical function. We choose it on the basis of

problem that artificial neuron (artificial neural network) needs to solve and in most cases we

choose it from the following set of functions: Step function, Linear function and Non-linear

(Sigmoid) function.

Artificial Neural Networks - Methodological Advances and Biomedical Applications

6

Step function is binary function that has only two possible output values (e.g. zero and one).

That means if input value meets specific threshold the output value results in one value and

if specific threshold is not meet that results in different output value. Situation can be

described with equation (2).

ݕ ൌ ൜

1 ݂݅ ݓ

ݔ

ݐ݄ݎ݁ݏ݄݈݀

0 ݂݅ ݓ

ݔ

൏ ݐ݄ݎ݁ݏ݄݈݀

(2)

When this type of transfer function is used in artificial neuron we call this artificial neuron

perceptron. Perceptron is used for solving classification problems and as such it can be most

commonly found in the last layer of artificial neural networks. In case of linear transfer

function artificial neuron is doing simple linear transformation over the sum of weighted

inputs and bias. Such an artificial neuron is in contrast to perceptron most commonly used

in the input layer of artificial neural networks. When we use non-linear function the sigmoid

function is the most commonly used. Sigmoid function has easily calculated derivate, which

can be important when calculating weight updates in the artificial neural network.

3. Artificial Neural Networks

When combining two or more artificial neurons we are getting an artificial neural network.

If single artificial neuron has almost no usefulness in solving real-life problems the artificial

neural networks have it. In fact artificial neural networks are capable of solving complex

real-life problems by processing information in their basic building blocks (artificial

neurons) in a non-linear, distributed, parallel and local way.

The way that individual artificial neurons are interconnected is called topology, architecture

or graph of an artificial neural network. The fact that interconnection can be done in

numerous ways results in numerous possible topologies that are divided into two basic

classes. Fig. 4. shows these two topologies; the left side of the figure represent simple feed-

forward topology (acyclic graph) where information flows from inputs to outputs in only

one direction and the right side of the figure represent simple recurrent topology (semi-

cyclic graph) where some of the information flows not only in one direction from input to

output but also in opposite direction. While observing Fig. 4. we need to mention that for

easier handling and mathematical describing of an artificial neural network we group

individual neurons in layers. On Fig. 4. we can see input, hidden and output layer.

Fig. 4. Feed-forward (FNN) and recurrent (RNN) topology of an artificial neural network.

Introduction to the Artificial Neural Networks

7

When we choose and build topology of our artificial neural network we only finished half of

the task before we can use this artificial neural network for solving given problem. Just as

biological neural networks need to learn their proper responses to the given inputs from the

environment the artificial neural networks need to do the same. So the next step is to learn

proper response of an artificial neural network and this can be achieved through learning

(supervised, un-supervised or reinforcement learning). No matter which method we use, the

task of learning is to set the values of weight and biases on basis of learning data to

minimize the chosen cost function.

3.1 Feed-forward Artificial Neural Networks

Artificial neural network with feed-forward topology is called Feed-Forward artificial neural

network and as such has only one condition: information must flow from input to output in

only one direction with no back-loops. There are no limitations on number of layers, type of

transfer function used in individual artificial neuron or number of connections between

individual artificial neurons. The simplest feed-forward artificial neural network is a single

perceptron that is only capable of learning linear separable problems. Simple multi-layer

feed-forward artificial neural network for purpose of analytical description (sets of

equations (3), (4) and (5)) is shown on Fig. 5.

݊

ଵ

ൌ ܨ

ଵ

ሺ

ݓ

ଵ

ݔ

ଵ

ܾ

ଵ

ሻ

݊

ଶ

ൌ ܨ

ଶ

ሺ

ݓ

ଶ

ݔ

ଶ

ܾ

ଶ

ሻ

݊

ଷ

ൌ ܨ

ଶ

ሺ

ݓ

ଶ

ݔ

ଶ

ܾ

ଶ

ሻ

݊

ସ

ൌ ܨ

ଷ

ሺ

ݓ

ଷ

ݔ

ଷ

ܾ

ଷ

ሻ

(3)

݉

ଵ

ൌ ܨ

ସ

ሺ

ݍ

ଵ

݊

ଵ

ݍ

ଶ

݊

ଶ

ܾ

ସ

ሻ

݉

ଶ

ൌ ܨ

ହ

ሺ

ݍ

ଷ

݊

ଷ

ݍ

ସ

݊

ସ

ܾ

ହ

ሻ

ݕ ൌ ܨ

ሺ

ݎ

ଵ

݉

ଵ

ݎ

ଶ

݉

ଶ

ܾ

ሻ

(4)

ݕ ൌ ܨ

ݎ

ଵ

൫ܨ

ସ

ൣݍ

ଵ

ܨ

ଵ

ሾ

ݓ

ଵ

ݔ

ଵ

ܾ

ଵ

ሿ

ݍ

ଶ

ܨ

ଶ

ሾ

ݓ

ଶ

ݔ

ଶ

ܾ

ଶ

ሿ

൧ ܾ

ସ

൯ ڮ

…ݎ

ଶ

ሺ

ܨ

ହ

ሾ

ݍ

ଷ

ܨ

ଶ

ሾ

ݓ

ଶ

ݔ

ଶ

ܾ

ଶ

ሿ

ݍ

ସ

ܨ

ଷ

ሾ

ݓ

ଷ

ݔ

ଷ

ܾ

ଷ

ሿ

ܾ

ହ

ሿ

ሻ

ܾ

(5)

Fig. 5. Feed-forward artificial neural network.

Artificial Neural Networks - Methodological Advances and Biomedical Applications

8

As seen on Fig. 5 and corresponding analytical description with sets of equations (3), (4) and

(5) the simple feed-forward artificial neural network can led to relatively long mathematical

descriptions where artificial neural networks’ parameters optimization problem solving by

hand is impractical. Although analytical description can be used on any complex artificial

neural network in practise we use computers and specialised software that can help us

build, mathematically describe and optimise any type of artificial neural network.

3.2 Recurrent Artificial Neural Networks

Artificial neural network with the recurrent topology is called Recurrent artificial neural

network. It is similar to feed-forward neural network with no limitations regarding back-

loops. In these cases information is no longer transmitted only in one direction but it is also

transmitted backwards. This creates an internal state of the network which allows it to

exhibit dynamic temporal behaviour. Recurrent artificial neural networks can use their

internal memory to process any sequence of inputs. Fig. 6. shows small Fully Recurrent

artificial neural network and complexity of its artificial neuron interconnections.

The most basic topology of recurrent artificial neural network is fully recurrent artificial

network where every basic building block (artificial neuron) is directly connected to every

other basic building block in all direction. Other recurrent artificial neural networks such as

Hopfield, Elman, Jordan, bi-directional and other networks are just special cases of recurrent

artificial neural networks.

Fig. 6. Fully recurrent artificial neural network.

3.3 Hopfield Artificial Neural Network

A Hopfield artificial neural network is a type of recurrent artificial neural network that is

used to store one or more stable target vectors. These stable vectors can be viewed as

memories that the network recalls when provided with similar vectors that act as a cue to

the network memory. These binary units only take two different values for their states that

are determined by whether or not the units' input exceeds their threshold. Binary units can

take either values of 1 or -1, or values of 1 or 0. Consequently there are two possible

definitions for binary unit activation ܽ

(equation (6) and (7)):

Introduction to the Artificial Neural Networks

9

ܽ

ൌ ൝

െ1 ݂݅ ݓ

ݏ

ߠ

,

1 ݐ݄݁ݎݓ݅ݏ݁.

(6)

ܽ

ൌ ൝

0 ݂݅ ݓ

ݏ

ߠ

,

1 ݐ݄݁ݎݓ݅ݏ݁.

(7)

Where:

• ݓ

is the strength of the connection weight from unit j to unit i,

• ݏ

is the state of unit j,

• ߠ

is the threshold of unit i.

While talking about connections ݓ

we need to mention that there are typical two

restrictions: no unit has a connection with itself (ݓ

) and that connections are symmetric

ݓ

ൌ ݓ

.

The requirement that weights must be symmetric is typically used, as it will guarantee that

the energy function decreases monotonically while following the activation rules. If non-

symmetric weights are used the network may exhibit some periodic or chaotic behaviour.

Training a Hopfield artificial neural network (Fig. 7.) involves lowering the energy of states

that the artificial neural network should remember.

Fig. 7. Simple “one neuron” Hopfield artificial neural network.

3.4 Elman and Jordan Artificial Neural Networks

Elman network also referred as Simple Recurrent Network is special case of recurrent artificial

neural networks. It differs from conventional two-layer networks in that the first layer has a

recurrent connection. It is a simple three-layer artificial neural network that has back-loop

from hidden layer to input layer trough so called context unit (Fig. 8.). This type of artificial

neural network has memory that allowing it to both detect and generate time-varying

patterns.

The Elman artificial neural network has typically sigmoid artificial neurons in its hidden

layer, and linear artificial neurons in its output layer. This combination of artificial neurons

transfer functions can approximate any function with arbitrary accuracy if only there is

enough artificial neurons in hidden layer. Being able to store information Elman artificial

neural network is capable of generating temporal patterns as well as spatial patterns and

Artificial Neural Networks - Methodological Advances and Biomedical Applications

10

responding on them. Jordan network (Fig. 9.) is similar to Elman network. The only

difference is that context units are fed from the output layer instead of the hidden layer.

Fig. 8. Elman artificial neural network.

Fig. 9. Jordan artificial neural network.

3.5 Long Short Term Memory

Long Short Term Memory is one of the recurrent artificial neural networks topologies. In

contrast with basic recurrent artificial neural networks it can learn from its experience to

process, classify and predict time series with very long time lags of unknown size between

important events. This makes Long Short Term Memory to outperform other recurrent

artificial neural networks, Hidden Markov Models and other sequence learning methods.

Long Short Term Memory artificial neural network is build from Long Short Term Memory

blocks that are capable of remembering value for any length of time. This is achieved with

gates that determine when the input is significant enough remembering it, when continue to

remembering or forgetting it, and when to output the value.

Architecture of Long Short Term Memory block is shown in Fig. 10 where input layer

consists of sigmoid units. Top neuron in the input layer process input value that might be

Introduction to the Artificial Neural Networks

11

sent to a memory unit depends on computed value of second neuron from the top in the

input layer. The third neuron from the top in the input layer decide how long will memory

unit hold (remember) its value and the bottom most neuron determines when value from

memory should be released to the output. Neurons in first hidden layer and in output layer

are doing simple multiplication of their inputs and a neuron in the second hidden layer

computes simple linear function of its inputs. Output of the second hidden layer is fed back

into input and first hidden layer in order to help making decisions.

Fig. 10. Simple Long Short Term Memory artificial neural network (block).

3.6 Bi-directional Artificial Neural Networks (Bi-ANN)

Bi-directional artificial neural networks (Fig. 11.) are designed to predict complex time series.

They consist of two individual interconnected artificial neural (sub) networks that performs

direct and inverse (bidirectional) transformation. Interconnection of artificial neural sub

networks is done through two dynamic artificial neurons that are capable of remembering

their internal states. This type of interconnection between future and past values of the

processed signals increase time series prediction capabilities. As such these artificial neural

networks not only predict future values of input data but also past values. That brings need

for two phase learning; in first phase we teach one artificial neural sub network for

predicting future and in the second phase we teach a second artificial neural sub network

for predicting past.

3.7 Self-Organizing Map (SOM)

Self-organizing map is an artificial neural network that is related to feed-forward networks

but it needs to be told that this type of architecture is fundamentally different in

arrangement of neurons and motivation. Common arrangement of neurons is in a hexagonal

or rectangular grid (Fig. 12.). Self-organizing map is different in comparison to other

artificial neural networks in the sense that they use a neighbourhood function to preserve

the topological properties of the input space. They uses unsupervised learning paradigm to

Artificial Neural Networks - Methodological Advances and Biomedical Applications

12

produce a low-dimensional, discrete representation of the input space of the training

samples, called a map what makes them especially useful for visualizing low-dimensional

views of high-dimensional data. Such networks can learn to detect regularities and

correlations in their input and adapt their future responses to that input accordingly.

Fig. 11. Bi-directional artificial neural network.

Fig. 12. Self-organizing Map in rectangular (left) and hexagonal (right) grid.

Introduction to the Artificial Neural Networks

13

Just as others artificial neural networks need learning before they can be used the same goes

for self-organizing map; where the goal of learning is to cause different parts of the artificial

neural network to respond similarly to certain input patterns. While adjusting the weights

of the neurons in the process of learning they are initialized either to small random values or

sampled evenly from the subspace spanned by the two largest principal component

eigenvectors. After initialization artificial neural network needs to be fed with large number

of example vectors. At that time Euclidean distance to all weight vectors is computed and

the neuron with weight vector most similar to the input is called the best matching unit. The

weights of the best matching unit and neurons close to it are adjusted towards the input

vector. This process is repeated for each input vector for a number of cycles. After learning

phase we do so-called mapping (usage of artificial neural network) and during this phase

the only one neuron whose weight vector lies closest to the input vector will be winning

neuron. Distance between input and weight vector is again determined by calculating the

Euclidean distance between them.

3.8 Stochastic Artificial Neural Network

Stochastic artificial neural networks are a type of an artificial intelligence tool. They are built

by introducing random variations into the network, either by giving the network's neurons

stochastic transfer functions, or by giving them stochastic weights. This makes them useful

tools for optimization problems, since the random fluctuations help it escape from local

minima. Stochastic neural networks that are built by using stochastic transfer functions are

often called Boltzmann machine.

3.9 Physical Artificial Neural Network

Most of the artificial neural networks today are software-based but that does not exclude the

possibility to create them with physical elements which base on adjustable electrical current

resistance materials. History of physical artificial neural networks goes back in 1960’s when

first physical artificial neural networks were created with memory transistors called

memistors. Memistors emulate synapses of artificial neurons. Although these artificial

neural networks were commercialized they did not last for long due to their incapability for

scalability. After this attempt several others followed such as attempt to create physical

artificial neural network based on nanotechnology or phase change material.

4. Learning

There are three major learning paradigms; supervised learning, unsupervised learning and

reinforcement learning. Usually they can be employed by any given type of artificial neural

network architecture. Each learning paradigm has many training algorithms.

4.1 Supervised learning

Supervised learning is a machine learning technique that sets parameters of an artificial

neural network from training data. The task of the learning artificial neural network is to set

the value of its parameters for any valid input value after having seen output value. The

training data consist of pairs of input and desired output values that are traditionally

represented in data vectors. Supervised learning can also be referred as classification, where

we have a wide range of classifiers, each with its strengths and weaknesses. Choosing a

Artificial Neural Networks - Methodological Advances and Biomedical Applications

14

suitable classifier (Multilayer perceptron, Support Vector Machines, k-nearest neighbour

algorithm, Gaussian mixture model, Gaussian, naive Bayes, decision tree, radial basis

function classifiers,…) for a given problem is however still more an art than a science.

In order to solve a given problem of supervised learning various steps has to be considered.

In the first step we have to determine the type of training examples. In the second step we

need to gather a training data set that satisfactory describe a given problem. In the third step

we need to describe gathered training data set in form understandable to a chosen artificial

neural network. In the fourth step we do the learning and after the learning we can test the

performance of learned artificial neural network with the test (validation) data set. Test data

set consist of data that has not been introduced to artificial neural network while learning.

4.2 Unsupervised learning

Unsupervised learning is a machine learning technique that sets parameters of an artificial

neural network based on given data and a cost function which is to be minimized. Cost

function can be any function and it is determined by the task formulation. Unsupervised

learning is mostly used in applications that fall within the domain of estimation problems

such as statistical modelling, compression, filtering, blind source separation and clustering.

In unsupervised learning we seek to determine how the data is organized. It differs from

supervised learning and reinforcement learning in that the artificial neural network is given

only unlabeled examples. One common form of unsupervised learning is clustering where

we try to categorize data in different clusters by their similarity. Among above described

artificial neural network models, the Self-organizing maps are the ones that the most

commonly use unsupervised learning algorithms.

4.3 Reinforcement learning

Reinforcement learning is a machine learning technique that sets parameters of an artificial

neural network, where data is usually not given, but generated by interactions with the

environment. Reinforcement learning is concerned with how an artificial neural network

ought to take actions in an environment so as to maximize some notion of long-term reward.

Reinforcement learning is frequently used as a part of artificial neural network’s overall

learning algorithm.

After return function that needs to be maximized is defined, reinforcement learning uses

several algorithms to find the policy which produces the maximum return. Naive brute

force algorithm in first step calculates return function for each possible policy and chooses

the policy with the largest return. Obvious weakness of this algorithm is in case of extremely

large or even infinite number of possible policies. This weakness can be overcome by value

function approaches or direct policy estimation. Value function approaches attempt to find a

policy that maximizes the return by maintaining a set of estimates of expected returns for

one policy; usually either the current or the optimal estimates. These methods converge to

the correct estimates for a fixed policy and can also be used to find the optimal policy.

Similar as value function approaches the direct policy estimation can also find the optimal

policy. It can find it by searching it directly in policy space what greatly increases the

computational cost.

Reinforcement learning is particularly suited to problems which include a long-term versus

short-term reward trade-off. It has been applied successfully to various problems, including

Introduction to the Artificial Neural Networks

15

robot control, telecommunications, and games such as chess and other sequential decision

making tasks.

5. Usage of Artificial Neural Networks

One of the greatest advantages of artificial neural networks is their capability to learn from

their environment. Learning from the environment comes useful in applications where

complexity of the environment (data or task) make implementations of other type of

solutions impractical. As such artificial neural networks can be used for variety of tasks like

classification, function approximation, data processing, filtering, clustering, compression,

robotics, regulations, decision making, etc. Choosing the right artificial neural network

topology depends on the type of the application and data representation of a given problem.

When choosing and using artificial neural networks we need to be familiar with theory of

artificial neural network models and learning algorithms. Complexity of the chosen model is

crucial; using to simple model for specific task usually results in poor or wrong results and

over complex model for a specific task can lead to problems in the process of learning.

Complex model and simple task results in memorizing and not learning. There are many

learning algorithms with numerous tradeoffs between them and almost all are suitable for

any type of artificial neural network model. Choosing the right learning algorithm for a

given task takes a lot of experiences and experimentation on given problem and data set.

When artificial neural network model and learning algorithm is properly selected we get

robust tool for solving given problem.

5.1 Example: Using bi-directional artificial neural network for ICT fraud detection

Spread of Information and Communication Technologies results in not only benefits for

individuals and society but also in threats and increase of Information and Communication

Technology frauds. One of the main tasks for Information and Communication Technology

developers is to prevent potential fraudulent misuse of new products and services. If

protection against fraud fails there is a vital need to detect frauds as soon as possible.

Information and Communication Technology frauds detection is based on numerous

principles. One of such principle is use of artificial neural networks in the detection

algorithms. Below is an example of how to use bi-directional artificial neural network for

detecting mobile-phone fraud.

First task is to represent problem of detecting our fraud in the way that can be easily

understand by humans and machines (computers). Each individual user or group of users

behave in specific way while using mobile phone. By learning their behaviour we can teach

our system to recognize and predict users’ future behaviour to a certain degree of accuracy.

Later comparison between predicted and real-life behaviour and potential discrepancy

between them can indicate a potential fraudulent behaviour. It was shown that mobile-

phone usage behaviour can be represented in the form of time series suitable for further

analysis with artificial neural networks (Krenker et al., 2009). With this representation we

transform the behaviour prediction task in time series prediction task. Time series prediction

task can be realized with several different types of artificial neural networks but as mentioned

in earlier chapters some are more suitable then others. Because we expect long and short time

periods between important events in our data representation of users’ behaviour the most

obvious artificial neural networks to use are Long Short Term Memory and bi-directional

Artificial Neural Networks - Methodological Advances and Biomedical Applications

16

artificial neural networks. On the basis of others researchers’ favourable results in time series

prediction with bi-directional artificial neural network (Wakuya & Shida, 2001) we decided to

use this artificial neural network topology for predicting our time series.

After we choose artificial neural network architecture we choose the type of learning

paradigm; we choose supervised learning where we gather real life data form

telecommunication system. Gathered data was divided into two sub-sets; training sub-set

and validation subset. With training data sub-set artificial neural network learn to predict

future and past time series and with validation data sub-set we simulate and validate the

prediction capabilities of designed and fine-tuned bi-directional artificial neural networks.

Validation was done with calculation of the Average Relative Variance that represents a

measure of similarity between predicted and expected time series.

Only after we gathered information about mobile-phone fraud and after choosing

representation of our problem and basic approaches for solving it we could start building

the overall model for detecting mobile-phone fraud (Fig. 13.).

On Fig. 13. we can see that mobile-phone fraud detection model is build out of three modules;

input module, artificial neural network module and comparison module. Input Module gathers

users’ information about usage of mobile-phone from telecommunication system in three

parts. In first part it is used for gathering learning data from which Artificial Neural Network

Module learn it-self. In second part Input Module gathers users’ data for purpose of validating

the Artificial Neural Network Module and in the third part it collects users’ data in real time for

purpose of using deployed mobile-phone fraud system. Artificial Neural Network Module is bi-

directional artificial neural network that is learning from gathered data and later when the

mobile-phone fraud detection system is deployed continuously predicts time series that

represents users’ behaviour. Comparison module is used for validation of Artificial Neural

Network Module in the process of learning and later when the mobile-phone fraud detection

system is deployed it is used for triggering alarms in case of discrepancies between predicted

and real-life gathered information about users’ behaviour.

Fig. 13. Mobile-phone fraud detection model.

Introduction to the Artificial Neural Networks

17

Although mobile-phone fraud detection system described above is simple and straight

forward reader needs to realize that majority of work is not in creating and later

implementing desired systems but in fine-tuning of data representation and artificial

neural network architecture and its parameters that is strongly dependant on type of input

data.

6. Conclusions

Artificial neural networks are widely spread and used in everyday services, products and

applications. Although modern software products enable relatively easy handling with

artificial neural networks, their creation, optimisation and usage in real-life situations it is

necessary to understand theory that stands behind them. This chapter of the book

introduces artificial neural networks to novice reader and serves as a stepping stone for all

of those who would like to get more involved in the area of artificial neural networks.

In the Introduction in order to lighten the area of artificial neural networks we briefly

described basic building blocks (artificial neuron) of artificial neural networks and their

“transformation” from single artificial neuron to complete artificial neural network. In the

chapter Artificial Neuron we present basic and important information about artificial neuron

and where researchers borrowed the idea to create one. We show the similarities between

biological and artificial neuron their composition and inner workings. In the chapter

Artificial Neural Networks we describe basic information about different, most commonly

used artificial neural networks topologies. We described Feed-forward, Recurrent, Hopfield,

Elman, Jordan, Long Short Term Memory, Bi-directional, Self Organizing Maps, Stochastic and

Physical artificial neural networks. After describing various types of artificial neural

networks architectures we describe how to make them useful by learning. We describe

different learning paradigms (supervised, unsupervised and reinforcement learning) in

chapter Learning. In the last chapter Usage of Artificial Neural Networks we describe how to

handle artificial neural networks in order to make them capable of solving certain problems.

In order to show what artificial neural networks are capable of, we gave a short example

how to use bi-directional artificial neural network in mobile-phone fraud detection system.

7. References

Gurney, K. (1997). An Introduction to Neural Networks, Routledge, ISBN 1-85728-673-1

London

Krenker A.; Volk M.; Sedlar U.; Bešter J.; Kos A. (2009). Bidirectional artificial neural

networks for mobile-phone fraud detection. ETRI Jurnal., vol. 31, no. 1, Feb. 2009,

pp. 92-94, COBISS.SI-ID 6951764

Kröse B.; Smagt P. (1996). An Introduction to Neural Networks, The University of Amsterdam,

Amsterdam.

Pavešić N. (2000). Razpoznavanje vzorcev: uvod v analizo in razumevanje vidnih in slušnih

signalov, Fakulteta za elektrotehniko, ISBN 961-6210-81-5, Ljubljana

Rojas R. (1996). Neural Networks: A Systematic Introduction, Springer, ISBN 3-540-60505-3,

Germany.

Artificial Neural Networks - Methodological Advances and Biomedical Applications

18

Wakuya H.; Shida K.. (2001). Bi-directionalization of neural computing architecture for time

series prediction. III. Application to laser intensity time record “Data Set A”.

Proceedings of International Joint Conference on Neural Networks, pp. 2098 – 2103, ISBN

0-7803-7044-9, Washington DC, 2001, Washington DC.

0

Review of Input Variable Selection Methods for

Artiﬁcial Neural Networks

Robert May

1

,Graeme Dandy

2

and Holger Maier

3

1

Veolia Water,University of Adelaide

2,3

University of Adelaide

Australia

1.Introduction

The choice of input variables is a fundamental,and yet crucial consideration in identifying the

optimal functional formof statistical models.The task of selecting input variables is common

to the development of all statistical models,and is largely dependent on the discovery of

relationships within the available data to identify suitable predictors of the model output.

In the case of parametric,or semi-parametric empirical models,the difﬁculty of the input

variable selection task is somewhat alleviated by the a priori assumption of the functional

formof the model,which is based on some physical interpretation of the underlying system

or process being modelled.However,in the case of artiﬁcial neural networks (ANNs),and

other similarly data-driven statistical modelling approaches,there is no such assumption

made regarding the structure of the model.Instead,the input variables are selected from

the available data,and the model is developed subsequently.The difﬁculty of selecting input

variables arises due to (i) the number of available variables,which may be very large;(ii)

correlations between potential input variables,which creates redundancy;and (iii) variables

that have little or no predictive power.

Variable subset selection has been a longstanding issue in ﬁelds of applied statistics dealing

with inference and linear regression (Miller,1984),and the advent of ANN models has only

served to create new challenges in this ﬁeld.The non-linearity,inherent complexity and

non-parametric nature of ANNregression make it difﬁcult to apply many existing analytical

variable selection methods.The difﬁculty of selecting input variables is further exacerbated

during ANN development,since the task of selecting inputs is often delegated to the ANN

during the learning phase of development.A popular notion is that an ANN is adequately

capable of identifying redundant and noise variables during training,and that the trained

network will use only the salient input variables.ANN architectures can be built with

arbitrary ﬂexibility and can be successfully trained using any combination of input variables

(assuming they are good predictors).Consequently,allowances are often made for a large

number of input variables,with the belief that the ability to incorporate such ﬂexibility and

redundancy creates a more robust model.Such pragmatism is perhaps symptomatic of the

popularisation of ANN models through machine learning,rather than statistical learning

theory.ANN models are too often developed without due consideration given to the

effect that the choice of input variables has on model complexity,learning difﬁculty,and

performance of the subsequently trained ANN.

1

Review of Input Variable Selection Methods

for Artificial Neural Networks

2

Recently,ANN modellers have become increasingly aware of the need to undertake input

variable selection (IVS),and a myriad of methods employed to undertake the IVS task

are described within reported ANN applications—some more suited to ANN development

than others.This review therefore serves to provide some guidance to ANN modellers,by

highlighting some of the key issues surrounding variable selection within the context of ANN

development,and survey some the alternative strategies that can be adopted within a general

framework,and provide some examples with discussion on the beneﬁts and disadvantges in

each case.

2.The input variable selection problem

Recall that for an unknown,steady-state input-output process,the development of an ANN

provides the non-linear transfer function

Y

=

F

(

X

) +

ε,(1)

where the model output Y is some variable of interest,X is a k-dimensional input vector,

whose component variables are denoted by X

i

(

i

=

1,...,k

)

,and ε is some small random

noise.Let C denote the set of d variables that are available to construct the ANNmodel.The

I

d

−

k

problemof input variable selection (IVS) is to choose a set of k variables fromC to form

X (Battiti,1994;Kwak &Choi,2002) that leads to the optimal formof the model,F.

Dynamic processes will require the development of an ANN to provide a time-series model

of the general form

Y

(

t

+

k

) =

F

(

Y

(

t

)

,...,Y

(

t

−

p

)

,X

(

t

)

,...,X

(

t

−

p

)) +

ε

(

t

)

.(2)

Here,the output variable is predicted at some future time t

+

k,as a function of past values

of both input X and output Y.Past observations of each variable are referred to as lags,

and the model order p deﬁnes the maximum lag of the model.The model order reﬂects

the persistence of dynamics within the system.In comparison to the steady-state model

formulation,the number of variables in the candidate set C is now multiplied by the model

order.Consequently,for systems with strong persistence,the number of candidate variables

is often quite large.

ANN models may be speciﬁed with insufﬁcient,or uninformative input variables

(under-speciﬁed);or more inputs than is strictly necessary (over-speciﬁed),due to the

inclusion of superﬂuous variables that are uninformative,weakly informative,or redundant.

Deﬁning what constitutes an optimal set of ANN input variables ﬁrst requires some

consideration of the impact that the choice of input variables has on model performance.The

following arguments summarise the key considerations:

• Relevance.Arguably the most obvious concern is that too few variables are selected,or

that the selected set of input variables is not sufﬁciently informative.In this case,the

outcome is a poorly performing model,since some of the behaviour of the output remains

unexplained by the selected input variables.In most cases,it is reasonable to assume that

a modeller will have some expert knowledge of the systemunder consideration;will have

surveyed the available data,and will have arrived at a reasonable set of candidate input

variables.The a priori assumption of model development is that at least one or more of

the available candidate variables is capable of describing some,if not all,of the output

behaviour,and that it is the nature and relative strength of these relationships that is

unknown (which is,of course,the motivation behind the development of non-parametric

20

Artificial Neural Networks - Methodological Advances and Biomedical Applications

Recently,ANN modellers have become increasingly aware of the need to undertake input

variable selection (IVS),and a myriad of methods employed to undertake the IVS task

are described within reported ANN applications—some more suited to ANN development

than others.This review therefore serves to provide some guidance to ANN modellers,by

highlighting some of the key issues surrounding variable selection within the context of ANN

development,and survey some the alternative strategies that can be adopted within a general

framework,and provide some examples with discussion on the beneﬁts and disadvantges in

each case.

2.The input variable selection problem

Recall that for an unknown,steady-state input-output process,the development of an ANN

provides the non-linear transfer function

Y

=

F

(

X

) +

ε,(1)

where the model output Y is some variable of interest,X is a k-dimensional input vector,

whose component variables are denoted by X

i

(

i

=

1,...,k

)

,and ε is some small random

noise.Let C denote the set of d variables that are available to construct the ANNmodel.The

I

d

−

k

problemof input variable selection (IVS) is to choose a set of k variables fromC to form

X (Battiti,1994;Kwak &Choi,2002) that leads to the optimal formof the model,F.

Dynamic processes will require the development of an ANN to provide a time-series model

of the general form

Y

(

t

+

k

) =

F

(

Y

(

t

)

,...,Y

(

t

−

p

)

,X

(

t

)

,...,X

(

t

−

p

)) +

ε

(

t

)

.(2)

Here,the output variable is predicted at some future time t

+

k,as a function of past values

of both input X and output Y.Past observations of each variable are referred to as lags,

and the model order p deﬁnes the maximum lag of the model.The model order reﬂects

the persistence of dynamics within the system.In comparison to the steady-state model

formulation,the number of variables in the candidate set C is now multiplied by the model

order.Consequently,for systems with strong persistence,the number of candidate variables

is often quite large.

ANN models may be speciﬁed with insufﬁcient,or uninformative input variables

(under-speciﬁed);or more inputs than is strictly necessary (over-speciﬁed),due to the

inclusion of superﬂuous variables that are uninformative,weakly informative,or redundant.

Deﬁning what constitutes an optimal set of ANN input variables ﬁrst requires some

consideration of the impact that the choice of input variables has on model performance.The

following arguments summarise the key considerations:

• Relevance.Arguably the most obvious concern is that too few variables are selected,or

that the selected set of input variables is not sufﬁciently informative.In this case,the

outcome is a poorly performing model,since some of the behaviour of the output remains

unexplained by the selected input variables.In most cases,it is reasonable to assume that

a modeller will have some expert knowledge of the systemunder consideration;will have

surveyed the available data,and will have arrived at a reasonable set of candidate input

variables.The a priori assumption of model development is that at least one or more of

the available candidate variables is capable of describing some,if not all,of the output

behaviour,and that it is the nature and relative strength of these relationships that is

unknown (which is,of course,the motivation behind the development of non-parametric

models).Should it happen that none of the available candidates are good predictors,then

the problem of model development is intractable,and it may be necessary to reconsider

the available data and the choice of model output,and to undertake further measurements

or observations before revisiting the task of model development.

• Computational Effort.The immediately obvious effect of including a greater number of

input variables is that the size of an ANN increases,which increases the computational

burden associated with querying the network—a signiﬁcant inﬂuence in determining the

speed of training.In the case of the multilayer perceptron (MLP),the input layer will

have an increased number of incoming connection weights.In the case of kernel-based

generalised regression neural network (GRNN) and radial basis function (RBF) networks,

the computation of distance to prototype vectors is more expensive due to higher

dimensionality.Furthermore,additional variables place an increased burden on any data

pre-processing steps that may be undertaken during ANNdevelopment.

• Training difﬁculty.The task of training an ANN becomes more difﬁcult due to the

inclusion of redundant and irrelevant input variables.The effect of redundant variables

is to increase the number of local optima in the error function that is projected over the

parameter space of the model,since there are more combinations of parameters that can

yield locally optimal error values.Algorithms such as the back-propagation algorithm,

which are based on gradient descent,are therefore more likely to converge to a local

optimum resulting in poor generalisation performance.Training of the network is also

slower because the relationship between redundant parameters and the error is more

difﬁcult to map.Irrelevant variables add noise into the model,which also hinders the

learning process.The training algorithm may expend resources adjusting weights that

have no bearing on the output variable,or the noise may mask the important input-output

relationships.Consequently,many more iterations of the training algorithm may be

required to determine a near-global optimum error,which adds to the computational

burden of model development.

• Dimensionality.The so-called curse of dimensionality (Bellman,1961) is that,as the

dimensionality of a model increases linearly,the total volume of the modelling problem

domain increases exponentially.Hence,in order to map a given function over the

model parameter space with sufﬁcient conﬁdence,an exponentially increasing number of

samples is required (Scott,1992).Alternatively,where a ﬁnite number of data are available

(as is generally the case in real-world applications),it can be said that the conﬁdence or

certainty that the true mapping has been found will diminish.ANN architectures like

the MLP are particularly susceptible to the curse due to the rapid growth in the number

of connection weights as input variables are added.Table 1 illustrates the growth in the

sample size required to maintain a constant error associated with estimates of the input

probability,as determined by the pattern layer of a GRNN.Some ANN architectures

can also circumvent the curse of dimensionality through their handling of redundancy

and their ability to simply ignore irrelevant variables (Sarle,1997).Others,such as

RBF networks and GRNN architectures,are unable to achieve this without signiﬁcant

modiﬁcations to the behaviour of their kernel functions,and are particularly sensitive to

increasing dimensionality (Specht,1991).

• Comprehensibility.In many applications,such as in the case of ANN transfer functions

for process modelling,it will often sufﬁce to regard an ANN as a “black-box"’ model.

However,ANN modellers are increasingly concerned with the development of ANN

21

Review of Input Variable Selection Methods for Artificial Neural Networks

Dimension,d Sample size,N

1 4

2 19

3 67

4 223

5 768

6 2790

7 10 700

8 43 700

9 180 700

10 842 000

Table 1.Growth of sample size with increasing dimensionality required to maintain a

constant standard error of the probability of an input estimated in the GRNNpattern layer

(Silverman,1986).

models for knowledge discovery from data (KDD) and data mining (Craven & Shavlik,

1998).The goal of KDD is to train an ANN based on observations of a process,and then

interrogate the ANNto gain further understanding of the process behaviour it has learned.

Rule-extraction from ANN models can be useful for a number of purposes,including:

(i) deﬁning input domains that produce certain ANN outputs,which can be useful

knowledge in itself;(ii) validation of the ANNbehaviour (e.g.verifying that input-output

response trends make sense),which increases conﬁdence in the ANN predictions;and

(iii) the discovery of new relationships,which reveals previously unknown insights into

the underlying physical process (Craven & Shavlik,1998;Darbari,2000).Reducing the

complexity of the ANN architecture,by minimising redundancy and the size of the

network,can signiﬁcantly improve the performance of data mining and rule extraction

algorithms.

Based on the arguments presented,a desirable input variable is a highly informative

explanatory variable (i.e a good predictor) that is dissimilar to other input variables (i.e.

independent).Consequently,the optimal input variable set will contain the fewest input

variables required to describe the behaviour of the output variable,with a minimum degree

of redundancy and with no uninformative (noise) variables.Identiﬁcation of an optimal

set of input variables will lead to a more accurate,efﬁcient,cost-effective and more easily

interpretible ANNmodel.

The fundamental importance of the IVS issue is evident from the depth of literature

surrounding the development and discussion of IVS algorithms in ﬁelds such as classiﬁcation,

machine learning,statistical learning theory,and many other ﬁelds where ANN models are

applied.In a broad context,reviews of IVS approaches have been presented by Kohavi &

John (1997),Blum&Langley (1997) and more recently,by Guyon &Elisseeff (2003).However,

in many examples of the application of ANNs to modelling and data analysis applications,

the importance of IVS is often understated.In other cases,the task is given only marginal

consideration and this often results in the application of ad hoc or inappropriate methods.

Reviews by Maier & Dandy (2000) and Bowden (2003) examined the IVS methods that have

been applied to ANN applications in engineering and concluded that there was a need for a

more consideredapproachto the IVS task.Certainly,no consensus has beenreachedregarding

22

Artificial Neural Networks - Methodological Advances and Biomedical Applications

Dimension,d Sample size,N

1 4

2 19

3 67

4 223

5 768

6 2790

7 10 700

8 43 700

9 180 700

10 842 000

Table 1.Growth of sample size with increasing dimensionality required to maintain a

constant standard error of the probability of an input estimated in the GRNNpattern layer

(Silverman,1986).

models for knowledge discovery from data (KDD) and data mining (Craven & Shavlik,

1998).The goal of KDD is to train an ANN based on observations of a process,and then

interrogate the ANNto gain further understanding of the process behaviour it has learned.

Rule-extraction from ANN models can be useful for a number of purposes,including:

(i) deﬁning input domains that produce certain ANN outputs,which can be useful

knowledge in itself;(ii) validation of the ANNbehaviour (e.g.verifying that input-output

response trends make sense),which increases conﬁdence in the ANN predictions;and

(iii) the discovery of new relationships,which reveals previously unknown insights into

the underlying physical process (Craven & Shavlik,1998;Darbari,2000).Reducing the

complexity of the ANN architecture,by minimising redundancy and the size of the

network,can signiﬁcantly improve the performance of data mining and rule extraction

algorithms.

Based on the arguments presented,a desirable input variable is a highly informative

explanatory variable (i.e a good predictor) that is dissimilar to other input variables (i.e.

independent).Consequently,the optimal input variable set will contain the fewest input

variables required to describe the behaviour of the output variable,with a minimum degree

of redundancy and with no uninformative (noise) variables.Identiﬁcation of an optimal

set of input variables will lead to a more accurate,efﬁcient,cost-effective and more easily

interpretible ANNmodel.

The fundamental importance of the IVS issue is evident from the depth of literature

surrounding the development and discussion of IVS algorithms in ﬁelds such as classiﬁcation,

machine learning,statistical learning theory,and many other ﬁelds where ANN models are

applied.In a broad context,reviews of IVS approaches have been presented by Kohavi &

John (1997),Blum&Langley (1997) and more recently,by Guyon &Elisseeff (2003).However,

in many examples of the application of ANNs to modelling and data analysis applications,

the importance of IVS is often understated.In other cases,the task is given only marginal

consideration and this often results in the application of ad hoc or inappropriate methods.

Reviews by Maier & Dandy (2000) and Bowden (2003) examined the IVS methods that have

been applied to ANN applications in engineering and concluded that there was a need for a

more consideredapproachto the IVS task.Certainly,no consensus has beenreachedregarding

suitable methods for undertaking the IVS task in the development of ANN regression or

time-series forecasting models (Bowden,2003).

3.Taxonomy of algorithms

Figure 1 presents a taxonomy,which provides some examples of the various approaches that

have beenproposedwithinANNliterature.IVS algorithms canbe broadly classiﬁedinto three

main classes:wrapper,embedded or ﬁlter algorithms (Blum&Langley,1997;Guyon &Elisseeff,

2003;Kohavi & John,1997),as shown in Figure 1.These different conceptual approaches to

IVS algorithmdesign are illustrated in Figure 2.Wrapper algorithms,as shown in Figure 2(a),

approach the IVS task as part of the optimisation of model architecture.The optimisation

searches through the set,or a subset,of all possible combinations of input variables,and

selects the set that yields the optimal generalisation performance of the trained ANN.As

the name indicates,embedded algorithms (Figure 2(b)) for IVS are directly incorporated into

the ANNtraining algorithm,such that the adjustment of input weights considers the impact

of each input on the performance of the model,with irrelevant and/or redundant weights

progressively removed as training proceeds.In contrast,IVS ﬁlters (Figure 2(c)) distinctly

separate the IVS task from ANN training and instead adopt an auxiliary statistical analysis

technique to measure the relevance of individual,or combinations of,input variables.

Given the general basis for the formulation of both IVS wrapper and ﬁlter designs,the

diversity of implementations that can possibly be conceived is immediately apparent.

However,designs for wrappers and ﬁlters share the same overall components,in that,in

addition to a measure of the informativeness of input variables,each class of selection

algorithms requires:

1.a criterion or test to determine the inﬂuence of the selected input variable(s),and

2.a strategy for searching among the combinations of candidate input variables.

3.1 Optimality Criteria

The optimality criterion deﬁnes the interpretation of the arguments presented in Section 2

into an expression for the optimal size k and composition of the input vector,X.Optimality

criteria for wrapper selection algorithms are derived from,or are exactly the same as,criteria

that are ultimately used to assess the predictive performance of the trained ANN.Essentially,

the wrapper approach treats the IVS task as a model selection exercise,where each model

corresponds to a unique combination of input variables.Recall that the most commonly

adoptedmeasure of predictive performance for ANNs is the mean squarederror (MSE),which

is given by

MSE

=

1

n

n

∑

j

=

1

y

j

−

ˆ

y

j

2

(3)

where y

j

and ˆy

j

are the actual and predicted outputs,which correspond to a set of test

data.Following the development of m models,a simple strategy is to select the model

that corresponds to the minimum MSE.However,the drawback of this criterion is that the

“best” performing model,in terms of the MSE,is not necessarily the “optimal” model,since

models with a large number of input variables tend to be biased as a result of over-ﬁtting.

Consequently,it is more common to adopt an optimality criterion such as Mallows’ C

p

(Mallows,1973),or the Akaike information criterion (AIC) (Akaike,1974),which penalise

overﬁtting.Both Mallows’ C

p

and the AIC determine the optimal number of input variables

23

Review of Input Variable Selection Methods for Artificial Neural Networks

Dimension Reduction

Rotation

Linear

Principal component analysis (PCA)

Partial Least-Squares (PLS) (Wold,1966)

Non-Linear

Independent component analysis (ICA)

Non-linear PCA(NLPCA)

Clustering

Learning vector quantisation (LVQ)

Self-organizing map (SOM) (Bowden et al.,2002)

Variable selection

Model-based

Wrapper

Nested

Forward selection (constructive ANNs)

Backward elimination

Nested subset (e.g.increasing delay order)

Global search

Exhaustive search

Heuristic search (e.g.GA-ANN)

Ranking

Single-variable Ranking (SVR)

GRNNInput Determination Algorithm(GRIDA)

Embedded

Optimisation

Direct Optimisation (e.g.Lasso)

Evolutionary ANNs

Weight-based

Stepwise regression

Pruning (e.g.OBD(Le Cun et al.,1990))

Recursive feature elimination

Filter (model-free)

Correlation (linear)

Rank (maximum) Pearson correlation

Ranked (maximum) Spearman correlation

Forward partial correlation selection

Time-series analysis (Box &Jenkins,1976)

Information theoretic (non-linear)

Entropy

Entropy (minimum) ranking

Minimumentropy

Mutual Information (MI)

Rank (maximum) MI

MI feature selection (MIFS) (Battiti,1994)

MI w/ICA(ICAIVS) (Back &Trappenberg,2001)

Partial mutual information (PMI) (Sharma,2000)

Joint MI (JMI) (Bonnlander &Weigend,1994)

Fig.1.Taxonomy of IVS Strategies and Algorithms

24

Artificial Neural Networks - Methodological Advances and Biomedical Applications

Dimension Reduction

Rotation

Linear

Principal component analysis (PCA)

Partial Least-Squares (PLS) (Wold,1966)

Non-Linear

Independent component analysis (ICA)

Non-linear PCA(NLPCA)

Clustering

Learning vector quantisation (LVQ)

Self-organizing map (SOM) (Bowden et al.,2002)

Variable selection

Model-based

Wrapper

Nested

Forward selection (constructive ANNs)

Backward elimination

Nested subset (e.g.increasing delay order)

Global search

Exhaustive search

Heuristic search (e.g.GA-ANN)

Ranking

Single-variable Ranking (SVR)

GRNNInput Determination Algorithm(GRIDA)

Embedded

Optimisation

Direct Optimisation (e.g.Lasso)

Evolutionary ANNs

Weight-based

Stepwise regression

Pruning (e.g.OBD(Le Cun et al.,1990))

Recursive feature elimination

Filter (model-free)

Correlation (linear)

Rank (maximum) Pearson correlation

Ranked (maximum) Spearman correlation

Forward partial correlation selection

Time-series analysis (Box &Jenkins,1976)

Information theoretic (non-linear)

Entropy

Entropy (minimum) ranking

Minimumentropy

Mutual Information (MI)

Rank (maximum) MI

MI feature selection (MIFS) (Battiti,1994)

MI w/ICA(ICAIVS) (Back &Trappenberg,2001)

Partial mutual information (PMI) (Sharma,2000)

Joint MI (JMI) (Bonnlander &Weigend,1994)

Fig.1.Taxonomy of IVS Strategies and Algorithms

Training

Model

Selection

Candidates

Search

Algorithm

Variable(s)

Optimality Test

Selected ANN (inputs)

Error

(a) Wrapper

Training

Query

Weight

Update

Candidates

Optimality Test

Trained ANN (inputs)

Error

(b) Embedded

Statistical

Evaluation

Training

Candidates

Search

Algorithm

Optimality Test

Trained ANN

Variable(s)

Selected

input(s)

(c) Filter

Fig.2.Conceptual IVS approach using a (a) wrapper,(c) embedded,or (b) ﬁlter algorithm.

by deﬁning the optimal trade-off between model size and accuracy by penalising models with

an increasing number of parameters.In fact,the C

p

criterion is considered to be a special case

of the AIC.

Mallows’ C

p

is is deﬁned as

C

p

=

∑

n

j

=

1

y

j

−

ˆy

j

(

k

)

2

σ

2

d

−

n

+

2p,(4)

where y

j

(

k

)

are the outputs generated by a model using p parameters,and σ

2

d

are residuals

for a full model trained using all d possible input variables.C

p

measures the relative bias and

variance of a model with p variables.The theoretical value of C

p

for an unbiased (optimal)

model will be p,and in model selection,the model with the C

p

value that is closest to p is

selected.

25

Review of Input Variable Selection Methods for Artificial Neural Networks

2p+1

-log(MSE(p))

Optimal number of

parameters

AIC(p)

AIC(p)

p

Fig.3.The Akaike Information Criterion determines the optimumtrade-off between model

error and size

The AIC is deﬁned as

AIC

= −

nlog

∑

n

j

=

1

y

j

−

ˆy

j

(

k

)

2

n

+

2

(

p

+

1

)

.(5)

Here,the accuracy is determined by the log-likelihood,which is a function of the MSE.The

complexity of the model is determined by the term p

+

1,where p is the number of model

parameters.Typically,the regression error decreases with increasing p,but since the model

is more likely to be over-ﬁt for a ﬁxed sample size,the increasing complexity is penalised.

At some point an optimal AIC is determined,which represents the optimal trade-off between

model accuracy and model complexity.The optimummodel is determined by minimising the

AIC with respect to the number of model parameters,p.This is illustrated in Figure 3.

Other model selection criteria have also been similarly derived,such as the Bayesian

information criterion (BIC) (Schwarz,1978),which is similar to the AIC,although it applies

a more severe penalty of

(

k lnn

)

to the number of model parameters.The expression for the

AIC in (5) assumes a linear regression model,but can be extended to non-linear regression.

However,it should be noted that in this case,p

+

1 no longer sufﬁciently describes the

complexity of the model and other measures are required.Such measures include the effective

number of parameters,or Vapnik-Chernovenkis dimension.The values of these measures are a

function of the class of regression model that is estimated and the training data.The effective

number of parameters,d can be determined by trace

(

S

)

,where S is a matrix deﬁned by the

expression

ˆ

y

=

Sy.(6)

For kernel regression,the hat matrix,S,is equal to K

T

K,where the elements of K correspond

to each K

j

(

x,h

)

,and the complexity is therefore given by trace(K

T

K).Factors affecting

26

Artificial Neural Networks - Methodological Advances and Biomedical Applications

2p+1

-log(MSE(p))

Optimal number of

parameters

AIC(p)

AIC(p)

p

Fig.3.The Akaike Information Criterion determines the optimumtrade-off between model

error and size

The AIC is deﬁned as

AIC

= −

nlog

∑

n

j

=

1

y

j

−

ˆy

j

(

k

)

2

n

+

2

(

p

+

1

)

.(5)

Here,the accuracy is determined by the log-likelihood,which is a function of the MSE.The

complexity of the model is determined by the term p

+

1,where p is the number of model

parameters.Typically,the regression error decreases with increasing p,but since the model

is more likely to be over-ﬁt for a ﬁxed sample size,the increasing complexity is penalised.

At some point an optimal AIC is determined,which represents the optimal trade-off between

model accuracy and model complexity.The optimummodel is determined by minimising the

AIC with respect to the number of model parameters,p.This is illustrated in Figure 3.

Other model selection criteria have also been similarly derived,such as the Bayesian

information criterion (BIC) (Schwarz,1978),which is similar to the AIC,although it applies

a more severe penalty of

(

k lnn

)

to the number of model parameters.The expression for the

AIC in (5) assumes a linear regression model,but can be extended to non-linear regression.

However,it should be noted that in this case,p

+

1 no longer sufﬁciently describes the

complexity of the model and other measures are required.Such measures include the effective

number of parameters,or Vapnik-Chernovenkis dimension.The values of these measures are a

function of the class of regression model that is estimated and the training data.The effective

number of parameters,d can be determined by trace

(

S

)

,where S is a matrix deﬁned by the

expression

ˆ

y

=

Sy.(6)

For kernel regression,the hat matrix,S,is equal to K

T

K,where the elements of K correspond

to each K

j

(

x,h

)

,and the complexity is therefore given by trace(K

T

K).Factors affecting

complexity include the number of data,the dimension of the data,and the number of basis

functions.The VC-dimension is similarly deﬁned as the number of data points that can be

shattered by the model (i.e.how many points in space can be uniquely separated by the

regression function).However,calculating the VC-dimension of complex regression functions

can be difﬁcult (Hastie et al.,2001).For MLP architectures,the VC-dimension is related to

the number of connection weights,and for RBF networks the VC-dimension depends on the

number of basis functions and their respective bandwidths,if different value are used for each

basis function.Both the effective number of parameters and the VC-dimension revert to the

value of p

+

1 for linear models.

In ﬁlter algorithm designs,the optimality criterion is embedded in the statistical analysis of

candidate variables,which deﬁnes the interpretation of “good” input variables.In general,

selection ﬁlters search amongst the candidate variables and identify suitable input variables

according to the following criteria:

1.maximumrelevance (MR),

2.minimumredundancy (mR),and

3.minimumredundancy–maximumRelevance (mRMR).

The criterion of maximum relevance ensures that the selected input variables are highly

informative by searching for variables that have a high degree of correlation with the output

variable.Input ranking schemes are a prime example of MR techniques,in which the

relevance is determined for each input variable with the output variable.Greedy selection

can be applied to select the k most relevant variables,or a threshold value can be applied to

select inputs that are relevant,and reject those which are not.

The issue with MR criteria is that the selection of the k most relevant candidate variables does

not strictly yield an optimal ANN.Here,Kohavi & John (1997) make the distinction between

relevance and usefulness by observing that redundancy between variables can render highly

relevant variables useless as predictors.Consequently,a criterion of minimum redundancy

aims to ﬁnd inputs that are maximally dissimilar fromone another,in order to select the most

useful set of relevant variables.The application of an additional mRcriterion with the existing

MR criterion leads to mRMR selection criteria,where input variables are evaluated with

the dual consideration of relevance,with respect to the output variable;and independence

(dissimilarity),with respect to the other candidate variables (Ding &Peng,2005).

Embedded IVS considers regularisation (reducing the size or number) of the weights of

a regression to minimise the complexity,while maintaining predictive performance.This

involves the formulation of a training algorithm that simultaneously ﬁnds the minimum

model error and model complexity,somewhat analogous to ﬁnding the optimum the AIC.

If the model architecture is linear-in-the-parameters,the resulting expression can be solved

directly.Depending on the model complexity term,this approach gives rise to various

embedded selection algorithms,such as the Lasso (Tibrishani,1996).However,the non-linear

and non-parametric nature of ANNregression does not lend itself to this approach (Guyon &

Elisseeff,2003;Tikka,2008).Instead,embedded selection is typically applied in ANN model

development in the form of a pruning strategy,where the connection weights of a network

are assessed,and insigniﬁcant weights are removed from the network.Pruning algorithms

were originally developed to address the computational burden associated with training fully

connected networks,given that many of the weights may be only marginally important due

to redundancy within the ANN architecture.However,the strategy also offers the means of

selectively removing inputs,since aninput variable is eliminatedby eliminating all connection

27

Review of Input Variable Selection Methods for Artificial Neural Networks

weights between an input and the ﬁrst hidden layer (Tikka,2008).A criterion is required to

identify which connection weights should be pruned,and several different approaches can be

used to determine howweights are removed (Guyon &Elisseeff,2003):

1.Analysis of sensitivity of training error to elimination of weights,or

2.Elimination of variables based on weight magnitude.

Where the ﬁrst approach has been used,different expressions for the sensitivity of the error to

the weights have led to various different algorithms.The use of derivatives of error functions

or transfer functions at hidden nodes with respect to the weights are common strategies,and

lead to examples of pruning algorithms such as the optimal brain damage (OBD) algorithm

(Le Cun et al.,1990).

3.2 Search strategies

Search strategies applied to IVS algorithms seek to provide an efﬁcient method for searching

through the many possible combinations of input variables and determining an optimal,or

near optimal set,while working within computational constraints.Searches may be global,

and consider many combinations;or local methods,which begin at a start location and move

through the search space incrementally.The latter are also commonly referred to as nested

subset techniques,since the region they explore comprises overlapping (i.e.nested) sets by

incrementally adding variables.

3.2.1 Exhaustive search

Exhaustive search simply evaluates all of the possible combinations of input variables and

selects the best set according to a predetermined optimality criteria.The method is the

only selection technique that is guaranteed to determine the optimal set of input variables

for a given ANN model (Bonnlander & Weigend,1994).Given the combinatorial nature

of the IVS problem,the number of possible subsets that form the search space is equal

to 2

d

,with subsets ranging in size from single input variables,to the set of all available

input variables.Exhaustive evaluation of all of these possible combinations may be feasible

when the dimensionality of the candidate set is low,but quickly becomes infeasible as

dimensionality increases.

3.2.2 Forward selection

Forward selection is a linear incremental search strategy that selects individual candidate

variables one at a time.In the case of wrappers,the method starts by training d single-variable

ANN models and selecting the input variable that maximises the model performance-based

optimality criterion.Selection then continues by iteratively training d

−

1 bivariate ANN

models,in each case adding a remaining candidate to the previously selected input variable.

Selection is terminated when the addition of another input variable fails to improve the

performance of the ANNmodel.In ﬁlter designs,the single most relevant candidate variable

is selected ﬁrst,and then forward selection proceeds by iteratively identifying the next

most relevant candidate and evaluating whether the variable should be selected,until the

optimality criterion is satisﬁed.

The approach is computationally efﬁcient overall,and tends to result in the selection of

relatively small input variable sets,since it considers the smallest possible models,and trials

increasingly larger input variable sets until the optimal set is reached.However,because

forward selection does not consider all of the possible combinations,and only searches a

28

Artificial Neural Networks - Methodological Advances and Biomedical Applications

weights between an input and the ﬁrst hidden layer (Tikka,2008).A criterion is required to

identify which connection weights should be pruned,and several different approaches can be

used to determine howweights are removed (Guyon &Elisseeff,2003):

1.Analysis of sensitivity of training error to elimination of weights,or

2.Elimination of variables based on weight magnitude.

Where the ﬁrst approach has been used,different expressions for the sensitivity of the error to

the weights have led to various different algorithms.The use of derivatives of error functions

or transfer functions at hidden nodes with respect to the weights are common strategies,and

lead to examples of pruning algorithms such as the optimal brain damage (OBD) algorithm

(Le Cun et al.,1990).

3.2 Search strategies

Search strategies applied to IVS algorithms seek to provide an efﬁcient method for searching

through the many possible combinations of input variables and determining an optimal,or

near optimal set,while working within computational constraints.Searches may be global,

and consider many combinations;or local methods,which begin at a start location and move

through the search space incrementally.The latter are also commonly referred to as nested

subset techniques,since the region they explore comprises overlapping (i.e.nested) sets by

incrementally adding variables.

3.2.1 Exhaustive search

Exhaustive search simply evaluates all of the possible combinations of input variables and

selects the best set according to a predetermined optimality criteria.The method is the

only selection technique that is guaranteed to determine the optimal set of input variables

for a given ANN model (Bonnlander & Weigend,1994).Given the combinatorial nature

of the IVS problem,the number of possible subsets that form the search space is equal

to 2

d

,with subsets ranging in size from single input variables,to the set of all available

input variables.Exhaustive evaluation of all of these possible combinations may be feasible

when the dimensionality of the candidate set is low,but quickly becomes infeasible as

dimensionality increases.

3.2.2 Forward selection

Forward selection is a linear incremental search strategy that selects individual candidate

variables one at a time.In the case of wrappers,the method starts by training d single-variable

ANN models and selecting the input variable that maximises the model performance-based

optimality criterion.Selection then continues by iteratively training d

−

1 bivariate ANN

models,in each case adding a remaining candidate to the previously selected input variable.

Selection is terminated when the addition of another input variable fails to improve the

performance of the ANNmodel.In ﬁlter designs,the single most relevant candidate variable

is selected ﬁrst,and then forward selection proceeds by iteratively identifying the next

most relevant candidate and evaluating whether the variable should be selected,until the

## Commentaires 0

Connectez-vous pour poster un commentaire