U2127
THE UNIVERSITY OF BIRMINGHAM
Degree of B.Sc. with Honours
Artificial Intelligence and Computer Science. Second Examination
Computer Science/Software Engineering. Second Examination
Joint Honours Degree of B.Sc. with Honours
Mathematics and Artifici
al Intelligence. Second Examination
Psychology and Artificial Intelligence. Second Examination
06 02360
(SEM 2A2)
Introduction to Neural Networks
August 2000 (RESIT) 2 hours
[Answer ALL Questions]
U2127
1.
Answer the following questions
:
(a)
Among various non

linear activation functions, why do we prefer to use the
logistic function in back

propagation networks.
[10%]
ANSWER: The primary reason is the ease of calculating derivatives and efficiency. BP
is in essence a gradient desce
nt optimisation algorithm. It requires computation of
derivatives of the error function with respect to weights. As BP training involved
thousands of epochs through a training set and each presentation of a training
pattern will lead to calculation of deri
vatives unless the error is zero, it is of great
practical importance to be able to compute derivatives of an activation function
efficiently. Because derivatives of a logistic function can easily be computed from
its function values, we prefer to use it i
n neural networks.
(b)
One of the useful “tricks” in training a back

propagation network with a sigmoidal
activation function is to initialise all weights at random in a small range, e.g., in [

0.5, 0.5] or [

1.0, 1.0]. Please explain why this is benefic
ial to network learning.
[10%]
ANSWER: It is important to initialise weights in a small range because large weights
tend to make a neuron’s output saturate and thus unable to learn. When inputs are
large negatives or positives, the output of a sigmoida
l function tends to have a
constant output. In other words, the function output is very insensitive to different
inputs. When inputs to a sigmoidal function is small (in terms of absolute values),
its output is most responsive to input changes. In this cas
e, a neural network is
able to adapt its weights and learn faster.
(c)
Most data sets obtained from real

world problems contain noise. Why isn’t a
neural network that achieves zero training error the best neural network? Describe
one method that you woul
d use to find a better neural network.
[10%]
ANSWER: A neural network that achieves zero training error would have learned all
the noise in the data set, i.e., over

fitting to the training data. Such over

fitting
leads to poor generalisation of the learn
ed network, which would perform poorly
on unseen data. The ultimate goal of learning is to achieve best generalisation not
the minimum training error. One method that can be used to improve
generalisation is to stop training early using a validation set. W
e first divide all the
training data into a training set and a validation set. The training set will be used
in training as was done before. The validation set is used to decide when to stop
training. In the initial stages of training, neural network’s err
ors on both training
and validation sets are expected to decrease. However, at certain point, the error
on the validation set will start increasing again. This is when the training should
be stopped. In essence, we use neural network’s performance on the v
alidation set
to estimate its generalisation ability.
(d)
Can a back

propagation training algorithm always find a global minimum of the
error function for a feed

forward neural network? Why or why not?
[5%]
U2127
ANSWER: Not always, because (1) the BP algor
ithm may get stuck in a local minimum
due to its gradient descent nature; (2) the NN may be inappropriately designed
(i.e., too small) to perform the task; and/or (3) the parameters of the BP algorithm
may be inappropriately set so that the algorithm never
converges.
2.
The following figure shows a back

propagation network with all weights given. There
are no biases (thresholds) in this network. Given a training pattern with input (0.2, 0.3,
0.4). Assume that the logistic function with gain parameter 1 is us
ed for all hidden and
output nodes.
(a)
What are the outputs from hidden units H1 and H2?
[5%]
(b)
Assume that the outputs from O1 and O2 are the same as target outputs, but the
output from O3 is different from the target output. List all the weights
which will
be
unchanged
after the presentation of this training pattern.
[10%]
(Note: You only need to give correct expressions for questions (a) and (b). Calculation
of exponential functions is
not
required.)
ANSWER:
(a)
=
, where u=0.2*1.1+0.3*1.3+0.4*1.5
=
, where u=0.2*1.2+0.3*1.4+0.4*1.6
(b)
,
,
,
,
,
,
. They do not change because there is no error
signal back

propagated.
U2127
3.
What are similarities and differences between a back

propagation network with
Gaussian
activation functions and a radial basis function network?
[10%]
ANSWER:
First, both of them are feed

forward NNs. Second, training algorithms are very different.
A BP network uses hyperplanes to form decision boundaries while a RBF network uses
hyperell
ipses. BP training relies on error back

propagation while RBF training does not.
BP training applies to multilayers while RBF training applies only to the hidden to
output layer. The input to hidden layer in a RBF network is not trained. It is determined
b
y the centre of Gaussian functions. BP usually has a nonlinear output layer while RBF
has a linear one. In practice, BP training can be very slow while RBF training is
relatively fast.
4.
The Kohonen learning algorithm is an unsupervised learning algorit
hm.
(a) What does the algorithm learn in unsupervised learning as there are no target outputs
that can be used to calculate an error?
[7%]
(b) Why do we need to define a neighbourhood for each output node in Kohonen
networks? How does it help learning?
[8%]
ANSWER:
(a)
In unsupervised learning, the algorithm learns the relationship among data according
to inputs only. A similarity measurement, e.g., the Euclidean distance, is often
defined and used in learning. The algorithm learns to cluster similar data
together
into groups/clusters. Learning here is not based on minimisation of an error between
the target and actual outputs, but based on maximisation of similarity of data points
in a cluster and maximisation of dissimilarity of data points between cluste
rs.
Unsupervised learning algorithms are closely related to clustering algorithms in
statistical pattern recognition.
(b)
Winner

takes

all is simple and works well for some problems. However, when a
winner is only slightly better than its closest competitors,
winner

takes

all does not
work well. The introduction of neighbourhoods can remedy this problem. In
addition, the introduction of neighbourhoods encourages formation of feature maps
in the output layer of Kohonen networks, i.e., similar units are close to
each other
spatially. The introduction of neighbourhoods also makes learning faster although
the neighbourhood size decreases as learning progresses.
5.
The post code recognition problem can be described as follows.
(a) Each post code consists of six o
r seven hand

written letters/digits. The letters are 26
upper

case English letters, i.e., 'A' .. 'Z'. The digits are '0' .. '9'.
(b) Each letter/digit is represented by a 32x32 binary image.
U2127
(c) The total available data consist of 10000 images. They are
in the format of
0111011 ... 0100000 B
0100000 ... 1111110 H
1100111 ... 0011010 7
... ...
where the first 1024 bits represent 32x32 pixel values and the last column indicates the
correct class.
You are required to design a feed

forward artificial neura
l network (or several networks)
to recognise letters/digits. Your answer should include
(1) Input representation, output representation, and network (or networks) architecture
(including the number of hidden layers, hidden nodes and connectivity pattern).
(2) Any pre

processing of the input data and/or post

processing of the output data that
should be carried out.
(3) The training algorithm that should be used.
You must justify all your design choices.
[25%]
ANSWER:
There is no fixed answer to this q
uestion. However, the answer (including the design
and its justification) should include the following for marks to be awarded. (The answer
may be illustrated by a number of graphs which show the network structure and
connectivity.)
The best design for th
is problem is to classify letters from digits first using one neural
network (NN1), and then classify different letters and digits using separate neural
networks (NNL and NND), respectively.
The raw 32x32 images should not be used as inputs to neural netw
orks (NNs) because
32*32=1024 is too large in comparison with the number of training patterns available
(10000). Pre

processing or weight

sharing must be used to reduce the number of inputs
to the NN. For example, 32x32 raw images can be reduced to 8x8 inp
uts by averaging
pixel values in 4x4 windows.
For NN1, there are 8x8 inputs and one output. The network must have a hidden layer
since the problem is nonlinear. The number of hidden nodes is determined empirically.
For this design, the number of hidden no
des is set as 4. The network is strictly

layered
and fully

connected.
Pre

processing is necessary for NN1: first, original 32x32 binary data will be
transformed into 64 real values. The target output values should not be 0 (for letters) and
U2127
1 (for digits)
. They should be greater than 0 and smaller than 1, e.g., 0.3 for letters and
0.7 for digits due to the use of logistic functions.
The output of NN1 is used to determine whether NNL or NND should be used for the
next step.
NNL also has 8x8 inputs. It has
26 outputs, one for each letter. NNL must have a hidden
layer since the problem is non

linear. The number of hidden nodes should be greater
than the number of hidden nodes used in NN1 because NNL performs a more difficult
task. For this design, the number
of hidden nodes is set (empirically) as 5. The target
output of NNL should not be a vector of 26 binary values. It should be a vector of 26
real values, one of which is close to 1 (e.g., 0.7) and the rest close to 0 (e.g., 0.3). The
network is strictly la
yered and fully

connected.
As 64*5*26=8320 is rather close to 10000, NNL has a high risk of over

fitting the
training data if nothing is done to further reduce the network size or having more
training data. For this design, we will add small Gaussian nois
e to the existing training
data to generate more training data. Each existing pattern will generate three new
patterns, making the total number of training patterns 40000. 37000 will be used in
training while 3000 will be used as a validation set for early
stopping in order to
improve generalisation.
NND is similar to NNL, but with 10 outputs.
Any algorithm which works with continuous variables can be used as the training
algorithm for the above three NNs, but not the Perceptron learning algorithm.
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο