Irene Moreno González 100039158
José Manuel Camacho Camacho 100038938
DSP08

G95
D
IGITAL
S
IGNAL
P
ROCESSING
Final Project 2007/08
IMG/JMCC
DSP08

G95
1
Table of Contents
Visualisation of Data
................................
................................
....................
2
Linear Classifiers
................................
................................
..........................
2
Single Perceptron with ADALINE Learning Rule
................................
.....................
2
Sequential Gradient Rule & Soft Activation
................................
..............................
3
Non

Linear, Non

Parametric Classification
................................
............
6
Reducing Costs via Clustering
................................
................................
...
8
IMG/JMCC
DSP08

G95
2
+
1

1
1
x
2
x
1
w
2
w
1
0
w
z
o
D
IGITAL
S
IGNAL
P
ROCESSING
Final Project 2007/08
V
ISUALISATION OF
D
ATA
In order to have a general view of the problem, the cloud of points is represented first,
showing with each colour which class the data belongs to.
L
INEAR
C
LASSIFIERS
Sin
gle Perceptron with ADALINE Learning Rule
This machine has the same structure as the hard SLP (single layer perceptron). The
structure that will be used is as follows:
2
1
1
x
x
x
2
1
0
w
w
w
w
IMG/JMCC
DSP08

G95
3
Given a group of already classified
samples, the perceptron can be trained following a
Widrow

Hoff algorithm applied to the input at the error of the decider:
In each iteration, the weights are re

calculated as shown above, where the samples used as
input are the training samples. The step
controls the speed of the convergence in the
algorithm.
There are some design decisions that were taken during the implementation:

Weights are initialized randomly; for this reason, even if the training samples and
the iterations do not change, different r
esults are obtained each time the algorithm
is run.

A pocket algorithm with zipper is used to obta
in the optimum weights. A
lthough
the weights are calculated with the
training samples as inputs, it is the validation
set the one used
to
selects the optimum
weights to keep. In this way, better
generalization of the problem is reached, since the perceptron is receiving more
information about the
distribution of the data; but its
raise
of
the computational
complexity can be shown as a disadvantage, since for ea
ch re

calculation of the
weights, the error over the whole validation set has to be obtained.

In each iteration both training and validation sets are reordered in a random way,
but initial weights are no longer random, since the ones from previous iterat
ions
are used. A number of 100 iterations is chosen, although the algorithm reaches its
convergence much earlier, as it will be shown later.

Although it is important to know the samples that were correctly classified, not the
percentage error but the MSE o
ne is considered during the training phase, since not
only
has
the success rate
to be optimized,
also an intermediate position of the
border between the clouds of patterns is sought.

The value of the step was set by trial and error: different
steps were us
ed until the
best results
w
ere
reached. For the case of the ADALINE algorithm, a final step of
0.0
001 was chosen
.
Weigths:
w
0
=
0.
196
w
1
=
0.519
w
2
=
0.074
Error rate
TRN
= 3.23 %
Error rate
VAL
= 3.31 %
Error rate
TEST
= 3.15 %
Step:
0.0001
IMG/JMCC
DSP08

G95
4
In order to obtain the error rate for each set,
every point
is multiplied by the weights, and
the decision will be based
in the sign of the result. D
ecision border and the error rates that
were obtained are show above.
N
ext plot shows the evolution
of the error rate of the validation set through the iterations of
the algorithm. The fact that the test rate is lower than the training and validation one is
not relevant: linear classification
is performed
, so depending on the position of the clouds of
d
ata
for
each
set
, the problem can be more separ
able or less. Nevertheless,
as it was
pointed out before, these results depend on the initialization of the problem each time, since
it follows a random procedure.
As it can be seen from the figures, the alg
orithm gets its minimum before the 15
th
iteration,
so the weights stored in the pocket are no longer renewed from that moment (right side
figure). The MSE error of the validation set starts increasing after that minimum (left side
figure).
With this algori
thm, the number of error is not minimized, but a convergence to a local
minimum is ensured.
Sequential Gradient Rule & Soft Activation
The advantage of the sequential gradient rule is that the hard decision is replac
ed by a
derivable approximation.
In this case,
soft hyperbolic tangent activation function
was used
.
In this way, a Least Mean Squared (LMS) algorithm can be implemented for the
recalculation of the weights:
P
arameters and design decisions remain the same as in the ADALINE case, but a b
etter
performance and better results are obtained with this learning rule, as
it
is shown in
following graphs and percentages. A step of 0.05 was used.
IMG/JMCC
DSP08

G95
5
As can be seen in the figures, convergence is fast and optimal
for
this rule. The algorithm
gets its minimum around the 20
th
iteration, but the recalculation
of the weights is less
unstable
than in ADALINE case.
Weigths:
w
0
=
3.262
w
1
=
8.834
w
2
=

1.292
Error rate
TRN
= 3.23 %
Error rate
VAL
= 3.31 %
Error rate
TEST
= 3.15 %
Step:
0.05
IMG/JMCC
DSP08

G95
6
N
ON

L
INEAR
,
N
ON

P
ARAMETRIC
C
LASSIFICATION
k

NN Classifier was implemented as required in the guide. Following sections depict
classifier’s propert
ies in terms of error rates, expresiveness and computational cost.
Comparision with
linear classifier results is provided as well.
a.
Plot the training and validation classification error rates of this classifier as a
function of k, for k = 1, 3, 5, 7, …, 25.
At a first glance, it can be c
learly seen that error
rates
are
slightly
lower
at training than
in
validation
classification, for any
value of
k
. This is due to
the
k
set during the training
classification has at least one training point belonging to the true class, which is the sample
itself. Latter does not hold for validation, even, if training points are not representative
enough, this method may generalise badl
y a
nd training error rates could not be reliable
.
Validation se
t was used to estimate the optimum
k
parameter. Classification resulted in,
This value
is employed to compute results for the test set later on.
b.
Depict the decision borders for the classifiers
with k = 1, 5, 25.
Fig.
3
,
Fig.
2
and
Fig.
4
show classification border for three different values of
k
. Dealing
with expressiveness, this method has better properties compared to the linear
classification, error rates are reduced thanks to a better fitting of
these classification
borders to
the d
ata spread all over the plane, specially those at the right part, which were
always missed by the linear border.
Fig.
1
Training and Validation Error Rates (as percentage) for each value of
‘k’
, for
k
=1,3…25
IMG/JMCC
DSP08

G95
7
Given a training set from which build up a
classification border, effect of
k
choice is
discussed here. 1

NN method has null
training
error rate, its
border is therefore flawless, as it
can be seen, creating very expressive shapes.
Validation error rate, however, it is not the best,
hence it can be concluded that 1

NN classifier
over

fits to the training set, laying rather bad
generalization in validat
ion (and we guess that
also for testing).
25

NN classifier has exactly the opposite
behavior. For each classification it takes into
account so many training points, which
introduce wrong information to the classifier
.
Actually, this classifier
neither fit
s well
the
training nor the validation set.
In between is the 5

NN. It has not such a good
training error rate, however it establishes a not
so
shaped
classificat
ion border compared to 1

NN. In principle, that can
lead to better
generalization and less sen
sitivity to outliers.
Furthermore
, it seems clear that, in the choice of
k depends on how close/far are the training
points of different classes to each other.
In order to obtain the results for this part of the
assignment, several simulations with the
i
mplemented k

NN classifier were ca
rried out.
Time consumption was
much higher than in
linear classification. If
operations are needed
to calculate distance between one sample from
th
e training set to a
training point, and
k
operations to select the
k
closest
points and
counting
its labels, we get that in order to
classify
n
samples,
Same computational cost is required for
validation and
test if
those set
s have similar size
as
the training set.
Fig.
3
Classification Border for the 1

NN Classifier
Fig.
2
Classification Border for the 5

NN Classifier
Fig.
4
25

NN’s Classification
Border
IMG/JMCC
DSP08

G95
8
c.
Obtain the optimum value for k according to the validation set, and give the test
classification error rate that would be obtained in that case.
As mention, value of
k
parameter that minimizes validation error rates is
k
=9
. In that
case,
9

NN classification had following error rate,
This error rate is lower than the one for the linear classifier, according to the improvement
of the expressiveness. Test error for this classifier is slightly higher than the training and
validation, which
may be due to the dependability of the training set for the classification.
R
EDUCING
C
OSTS VIA
C
LUSTERING
Implement the classifiers corresponding to k
C
=1, 2, 3,…, 50, and compute their
training and validation classification error rates.
An iteration of the algorithm for each of the possible number of centroids is run and plotted.
As can be
extracted from the figure, error rates have a reasonable value when more than 3
centroids per class are used. The fact is that, due to the shape of the cloud of the given
points, three or less centroids are not good enough to represent the data. In additi
on,
training error rates are always lower than validation ones, since the centroids seek their
optimum position with reference to the training set.
Fig.
5
Training (orange) and Validation (blue) error rates expressed as missed
samples out of set size in the vertical axe
for
k
c
=1,2,…,50
IMG/JMCC
DSP08

G95
9
Select the value k
C
* that minimizes
the following objective function
J(k
C
) = Te + log
10
(k
C
)
Where
Te is the validation error rate. Obtain the test classification error rate when
using k
C
*, and plot the classification border of this classifier.
Fig.
6
Objective Function J(
k
C
) for
k
C
= 1,2,…,50.
In order to obtain the optimum value of centroids to be used, as a compromise between the
error rate and the complexity of the problem, the objective function J is compute for each of
the values of kc. It results with a minimum value for a
k
C
*
=20. The test
error is computed
for this case. Using this target function, large numbers of centroids are avoided. Using Te
as figure of merit,
k
C
would trend to the given number of training points, because error
rates would converge to the 1

NN error, and therefore, c
omputational cost would not be
reduced. Instead, weighting
the error percentage
with
log
10
(k
C
)
, not only affects the
optimum
k
C
to the error rate, but computational cost is kept low.
Following picture depicts the test points and classification border
for
k
C
=20
The border of the classifier gets a quite good
adaptation to the samples.
Actually, this
border is the set of equidistant points among
different centroids representing each class.
An issue that was found during the
development of the algorithm is that, when
having a relative high
k
C
, after some
iterat
ions some of the centroids did no
t get
any sample assigned to them, since all the
close samples found
a closer centroid to join.
O
ur i
mplementation,
does not re

allocate
these centroids. But different
adjustments
can be done to the algorithm to handle this
problem: those "
dummy
" centroids can either
be assigned a new random value, nudging
them
,
or splitting
clusters with greater
dispers
ion
in
to
two groups, producing two
new clusters
.
Fig.
7
IMG/JMCC
DSP08

G95
10
Repeat now the experiment 100 times, and give the following results:
Plot the average value and the standard deviation of the training and
validation error rates as a function of k
C
.
Since the initialization of the centroids is done in a random fashion, different results of
error rates and performance
are obtained from different runnings
of the algorithm. For this
reason, a good way to get trusty results is to run the experiment a give
n number of
iterations, extracting the average value and standard deviation of the error rates. In this
case, 100 iterations were performed, and the figures below show the results:
From the figures it can be concluded that:

There is a strong dependency
of the error rates on
the initialization when using
small
number of centroids (less than 10). This is due to t
h
e f
act that the shape
of data
spread all
over the plane makes those few centroids not representative of data set. Bad initialization
may
cause th
e error to grow even more
(large confidence interval)
.

For higher values of
k
C
both the
and
decrease steady towards the 1

NN
behavior
.
As it
is discussed latter, these results show that the relatively small improvement of the error
rate for large
k
C
may not be enough to justify the increase the cost of employing such
amount of centroids.

When there is one centroid per class, algorithm has the worst performance. C
onvergence
is reached at the first iteration, since the cent
roid is simply the center of mass
of all the
samples of
its class.
Hence, average error is constant through all repetitions up. Therefore,
variance is null, since the centroid is always the same, so that error rate is deterministic
.
Fig.
8
Mean and Standard Deviation of Error Rates (no percentages) for different values of
k
C
. Despite using
v
as letter for the
right grap
h, it is not the variance, it is
sigma
or standard deviation.
IMG/JMCC
DSP08

G95
11
Using a histogram, give an approximate representation of the distribution of
the 100 values obtained for k
C
*. Obtain the average value for k
C
*, and for
its corresponding test error rate.
In order to carry out this section, we have assumed that for each initialization of the
centroids,
k
C
can be view as a different discrete random variable. By adding 100 random
variables, central limit theorem guarantees that independently of each variable
distribution, the total mixture can be characterize
d
as a Gaussian distribution.
The histogram shows
the results obtained through the 100 iterations: for each of them, a
value of
turned out to be the optimum one, and the histogram represents the frequency
of those elected values.
Fitting a Gaussian p
.
d
.
f
.
to the displayed histogram, expectation of
k
C
is
,
Latter results are consistent with
previous analysis. Average error is
higher than 1

NN error
. Furthermore,
optimum choice for
k
C
is at the
beginning of the flat region of
Fig.
8
,
where, as mentioned, good properties in
term of errors can be achieved without
increasing algorithm complexity.
Dealing with computational cost, like in
previous section, if
is said to be t
he
cost to compute the
Euclidean
distance
between t
w
o points, and
k
is the cost o
f
computing the closest centroid among
k
centroids, total cost of classifying a
entire n

sample set is as follows,
Hence,
Thus,
once centroids have been calculated (there may not be reduction of computational
cost while training),
this clustering approach reduces computational cost to a linear cost
in
n
during classification
, instead of k

NN quadratic approach, providing more scala
bility
while keeping such a good
performance
in terms of errors.
As mentioned,
k
C
also affects the
complexity.
A
randomly
chosen
classification border of a classifier obtained among
the
100
iterations using the optimum
estimated
k
C
early
is shown below.
IMG/JMCC
DSP08

G95
12
C
ONCLUSIONS TO THIS
R
EPORT
Linear Classifiers: they provide a simple tool for data classification, with reasonable
performance which can be enough for certain applications. Their main drawbacks
are the lack of expressiveness (linear borders) and sensiti
vity to step size. Soft
activation has shown better performance (almost same error rates but less
iterations).
k

NN classifier gave the best performance for this classification problem. Its good
properties to depict complex classification borders provided
the best
classification
pattern. Good
k
choice is critical to have good generalisation. The main shortcoming
of this algorithm is the humongous computational cost, because it performs a global
sweep for each sample to get choose the most suitable class fo
r that sample.
Clustering classification provides in this particular case the chance to set up a
trade

off between computational cost and accuracy. The amount of centroids
calculated during training for each class determines the accuracy of the results.
S
mall number of centroids means low computational cost is required but
performance turns out to be
clumsy
, whereas is every training point becomes a
centroid (extrem case) same performance as de k

NN algorithm is achieved by
means of the already mentioned c
omputational effort.
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο