# MS Word document - Computer Science - University of Birmingham

Τεχνίτη Νοημοσύνη και Ρομποτική

20 Οκτ 2013 (πριν από 4 χρόνια και 8 μήνες)

77 εμφανίσεις

U2127

THE UNIVERSITY OF BIRMINGHAM

Degree of B.Sc. with Honours

Artificial Intelligence and Computer Science. Second Examination

Computer Science/Software Engineering. Second Examination

Joint Honours Degree of B.Sc. with Honours

Mathematics and Artifici
al Intelligence. Second Examination

Psychology and Artificial Intelligence. Second Examination

06 02360

(SEM 2A2)

Introduction to Neural Networks

August 2000 (RESIT) 2 hours

U2127

1.

:

(a)

Among various non
-
linear activation functions, why do we prefer to use the
logistic function in back
-
propagation networks.

[10%]

ANSWER: The primary reason is the ease of calculating derivatives and efficiency. BP
is in essence a gradient desce
nt optimisation algorithm. It requires computation of
derivatives of the error function with respect to weights. As BP training involved
thousands of epochs through a training set and each presentation of a training
pattern will lead to calculation of deri
vatives unless the error is zero, it is of great
practical importance to be able to compute derivatives of an activation function
efficiently. Because derivatives of a logistic function can easily be computed from
its function values, we prefer to use it i
n neural networks.

(b)

One of the useful “tricks” in training a back
-
propagation network with a sigmoidal
activation function is to initialise all weights at random in a small range, e.g., in [
-
0.5, 0.5] or [
-
1.0, 1.0]. Please explain why this is benefic
ial to network learning.
[10%]

ANSWER: It is important to initialise weights in a small range because large weights
tend to make a neuron’s output saturate and thus unable to learn. When inputs are
large negatives or positives, the output of a sigmoida
l function tends to have a
constant output. In other words, the function output is very insensitive to different
inputs. When inputs to a sigmoidal function is small (in terms of absolute values),
its output is most responsive to input changes. In this cas
e, a neural network is
able to adapt its weights and learn faster.

(c)

Most data sets obtained from real
-
world problems contain noise. Why isn’t a
neural network that achieves zero training error the best neural network? Describe
one method that you woul
d use to find a better neural network.

[10%]

ANSWER: A neural network that achieves zero training error would have learned all
the noise in the data set, i.e., over
-
fitting to the training data. Such over
-
fitting
leads to poor generalisation of the learn
ed network, which would perform poorly
on unseen data. The ultimate goal of learning is to achieve best generalisation not
the minimum training error. One method that can be used to improve
generalisation is to stop training early using a validation set. W
e first divide all the
training data into a training set and a validation set. The training set will be used
in training as was done before. The validation set is used to decide when to stop
training. In the initial stages of training, neural network’s err
ors on both training
and validation sets are expected to decrease. However, at certain point, the error
on the validation set will start increasing again. This is when the training should
be stopped. In essence, we use neural network’s performance on the v
alidation set
to estimate its generalisation ability.

(d)

Can a back
-
propagation training algorithm always find a global minimum of the
error function for a feed
-
forward neural network? Why or why not?

[5%]

U2127

ANSWER: Not always, because (1) the BP algor
ithm may get stuck in a local minimum
due to its gradient descent nature; (2) the NN may be inappropriately designed
(i.e., too small) to perform the task; and/or (3) the parameters of the BP algorithm
may be inappropriately set so that the algorithm never

converges.

2.

The following figure shows a back
-
propagation network with all weights given. There
are no biases (thresholds) in this network. Given a training pattern with input (0.2, 0.3,
0.4). Assume that the logistic function with gain parameter 1 is us
ed for all hidden and
output nodes.

(a)

What are the outputs from hidden units H1 and H2?

[5%]

(b)

Assume that the outputs from O1 and O2 are the same as target outputs, but the
output from O3 is different from the target output. List all the weights
which will
be
unchanged

after the presentation of this training pattern.

[10%]

(Note: You only need to give correct expressions for questions (a) and (b). Calculation
of exponential functions is
not

required.)

(a)

=
, where u=0.2*1.1+0.3*1.3+0.4*1.5

=
, where u=0.2*1.2+0.3*1.4+0.4*1.6

(b)
,
,
,
,
,
,
. They do not change because there is no error
signal back
-
propagated.

U2127

3.

What are similarities and differences between a back
-
propagation network with
Gaussian
activation functions and a radial basis function network?

[10%]

First, both of them are feed
-
forward NNs. Second, training algorithms are very different.
A BP network uses hyperplanes to form decision boundaries while a RBF network uses
hyperell
ipses. BP training relies on error back
-
propagation while RBF training does not.
BP training applies to multilayers while RBF training applies only to the hidden to
output layer. The input to hidden layer in a RBF network is not trained. It is determined
b
y the centre of Gaussian functions. BP usually has a nonlinear output layer while RBF
has a linear one. In practice, BP training can be very slow while RBF training is
relatively fast.

4.

The Kohonen learning algorithm is an unsupervised learning algorit
hm.

(a) What does the algorithm learn in unsupervised learning as there are no target outputs
that can be used to calculate an error?

[7%]

(b) Why do we need to define a neighbourhood for each output node in Kohonen
networks? How does it help learning?

[8%]

(a)

In unsupervised learning, the algorithm learns the relationship among data according
to inputs only. A similarity measurement, e.g., the Euclidean distance, is often
defined and used in learning. The algorithm learns to cluster similar data
together
into groups/clusters. Learning here is not based on minimisation of an error between
the target and actual outputs, but based on maximisation of similarity of data points
in a cluster and maximisation of dissimilarity of data points between cluste
rs.
Unsupervised learning algorithms are closely related to clustering algorithms in
statistical pattern recognition.

(b)

Winner
-
takes
-
all is simple and works well for some problems. However, when a
winner is only slightly better than its closest competitors,

winner
-
takes
-
all does not
work well. The introduction of neighbourhoods can remedy this problem. In
addition, the introduction of neighbourhoods encourages formation of feature maps
in the output layer of Kohonen networks, i.e., similar units are close to

each other
spatially. The introduction of neighbourhoods also makes learning faster although
the neighbourhood size decreases as learning progresses.

5.

The post code recognition problem can be described as follows.

(a) Each post code consists of six o
r seven hand
-
written letters/digits. The letters are 26
upper
-
case English letters, i.e., 'A' .. 'Z'. The digits are '0' .. '9'.

(b) Each letter/digit is represented by a 32x32 binary image.

U2127

(c) The total available data consist of 10000 images. They are
in the format of

0111011 ... 0100000 B

0100000 ... 1111110 H

1100111 ... 0011010 7

... ...

where the first 1024 bits represent 32x32 pixel values and the last column indicates the
correct class.

You are required to design a feed
-
forward artificial neura
l network (or several networks)

(1) Input representation, output representation, and network (or networks) architecture
(including the number of hidden layers, hidden nodes and connectivity pattern).

(2) Any pre
-
processing of the input data and/or post
-
processing of the output data that
should be carried out.

(3) The training algorithm that should be used.

You must justify all your design choices.

[25%]

There is no fixed answer to this q
uestion. However, the answer (including the design
and its justification) should include the following for marks to be awarded. (The answer
may be illustrated by a number of graphs which show the network structure and
connectivity.)

The best design for th
is problem is to classify letters from digits first using one neural
network (NN1), and then classify different letters and digits using separate neural
networks (NNL and NND), respectively.

The raw 32x32 images should not be used as inputs to neural netw
orks (NNs) because
32*32=1024 is too large in comparison with the number of training patterns available
(10000). Pre
-
processing or weight
-
sharing must be used to reduce the number of inputs
to the NN. For example, 32x32 raw images can be reduced to 8x8 inp
uts by averaging
pixel values in 4x4 windows.

For NN1, there are 8x8 inputs and one output. The network must have a hidden layer
since the problem is nonlinear. The number of hidden nodes is determined empirically.
For this design, the number of hidden no
des is set as 4. The network is strictly
-
layered
and fully
-
connected.

Pre
-
processing is necessary for NN1: first, original 32x32 binary data will be
transformed into 64 real values. The target output values should not be 0 (for letters) and
U2127

1 (for digits)
. They should be greater than 0 and smaller than 1, e.g., 0.3 for letters and
0.7 for digits due to the use of logistic functions.

The output of NN1 is used to determine whether NNL or NND should be used for the
next step.

NNL also has 8x8 inputs. It has

26 outputs, one for each letter. NNL must have a hidden
layer since the problem is non
-
linear. The number of hidden nodes should be greater
than the number of hidden nodes used in NN1 because NNL performs a more difficult
task. For this design, the number

of hidden nodes is set (empirically) as 5. The target
output of NNL should not be a vector of 26 binary values. It should be a vector of 26
real values, one of which is close to 1 (e.g., 0.7) and the rest close to 0 (e.g., 0.3). The
network is strictly la
yered and fully
-
connected.

As 64*5*26=8320 is rather close to 10000, NNL has a high risk of over
-
fitting the
training data if nothing is done to further reduce the network size or having more
training data. For this design, we will add small Gaussian nois
e to the existing training
data to generate more training data. Each existing pattern will generate three new
patterns, making the total number of training patterns 40000. 37000 will be used in
training while 3000 will be used as a validation set for early

stopping in order to
improve generalisation.

NND is similar to NNL, but with 10 outputs.

Any algorithm which works with continuous variables can be used as the training
algorithm for the above three NNs, but not the Perceptron learning algorithm.