1
Lab 7: Clustering
Pieter Spronck
p.spronck@cs.unimaas.nl
June 30, 2005
1.
Introduction
Clustering is concerned with the task of discovering natural groupings (“clusters”) in a
collection of instances. The discovered clusters presumably reflect some mecha
nism at
work in the domain from which instances are drawn, a mechanism that causes some
instances to bear a stronger resemblance to one another than they do to the remaining
instances.
In this lab session you will experiment with three clustering algorith
ms. You will use Weka
3

4 for these experiments. The data files you need you can get from the “Materials” page on
the Data

mining summer course website. Place these in the Weka “data” directory.
2.
The
k

means algorithm
The
k

means algorithm is a simple, st
raightforward algorithm to assign instances to clusters.
Each cluster is defined by a cluster centroid, and instances belong to the cluster for which
their Euclidian distance to the centroid is the smallest. For each cluster a new centroid is
found by taki
ng the average over the cluster instances, which may lead to shifts of instances
between clusters. This iterative process ends when the centroids stop changing.
Start the Weka Explorer and on the Preprocess page load the file “simpleclustering.arff”.
The
contents of the file are as follows:
% Simple clustering experiment
@relation location
@attribute x numeric
@attribute y numeric
@data
0,0
0,2
0,4
0,6
0,8
0,10
10,0
10,2
10,4
10,6
10,8
10,10
The data contains twelve instances, with attributes “x” an
d “y” (you will also see an
attribute “class” when you have loaded the file, but I have left that attribute out of the
description above).
2
2.1
Study the file contents. How would you cluster this data?
The file contains two natural clusters. These are labell
ed “left” and “right” in the file. We
will now see whether or not the
k

means algorithm is able to find these clusters.
Go to the Cluster page. In the Clusterer box, click the “Choose” button. You can now select
the “SimpleKMeans” clusterer. For “Cluster
mode” select “Classes to clusters evaluation”.
This will make the clusterer ignore the “class” attribute, but will use this attribute for
comparison in the generated report. Click “Start” to activate the clustering.
2.2
Study the results. Are they as expecte
d?
Click on the “Clusterer” box (the text right next to the “Choose” button) and change the
seed, but leave the parameter
k
(which is the number of clusters) at 2. Changing the seed
will make the clusterer use different random numbers. Rerun the clusteri
ng.
2.3
Study the results. Are they equal to the results you got previously? If so, try a couple
more times until the results change. Note that, while Weka does not report which
instance is placed in which cluster, in this example you can easily determine tha
t
yourself by drawing the instances and cluster centroids on graph paper. Estimate
(
very
roughly) the percentage of experiments that gives the natural clusters as a
result.
You can repeat this experiment a couple of times to see what different clusters th
e
k

means
algorithm generates.
2.4
Can you explain what the basic cause is for the fact that the
k

means algorithm not
always finds the natural clusters?
When you have finished experimenting with this data set, load the file “marble.arff”. This
file contain
s the fifteen marble instances that were used as examples during the lecture (also
shown on page 6). After loading the file, remove the “colour” attribute from the data set by
checking it and clicking the “Remove” button. Note that the “class” attribute co
ntains the
name of the “natural” groups that were discussed during the lecture. We will now see if the
k

means algorithm is also able to find these groups.
Go to the “Cluster” page. Since presumably there are four natural clusters, the
k
parameter
(availa
ble when you tap on the “Clusterer” box) should be set to at least 4, but you might
get better results if you choose 5. Make sure the “Cluster mode” is set to “Classes to clusters
evaluation”.
2.5
Because
k

means finds local optima, it is often advised to run
the algorithm several
times and stick with the best results you get. Do this now and study the results. How
close can you come to a division of marbles to the “ideal” classes?
If you like, you can apply the
k

means algorithm to some of the other datasets
. Be warned
that if you use a dataset with a great many instances and ask the algorithm to split it into a
dozen or more classes, this task can take a very, very long time (so, stay away from the
“Soybean” data set).
3
3.
The Cobweb algorithm
The Cobweb algor
ithm builds clusters by incrementally adding instances to a tree, merging
them with an existing cluster if this leads to a higher “Category Utility” (CU) value than
when the instance would get its own cluster. If the need arises, an existing cluster may al
so
be split up into two new clusters, if this is beneficial to the CU value. The resulting set of
clusters is called a “dendrogram” (a tree

form).
The following experiments will be done with the marble set. On the Preprocess page, load
the file “marblespe
cific.arff”. When the file is loaded, remove the “colour” attribute.
The “marblespecific” data set is similar to the set found in the “marble.arff” file, except for
the fact that the “class” attribute is now unique for each marble, which allows you to
rec
ognize the marbles on the report generated by Weka. The code for each marble consists
of four letters, followed by the marble colour. The four letters respectively encode the size
(
B
ig or
S
mall), colouring (
M
onochrome or
P
olychrome), shininess (
S
hiny or
D
u
ll) and
transparency (
T
ransparent or
O
paque). See page 6 for details.
Go to the Cluster page. In the Clusterer box, click on the text you see there. You can now
select the “Cobweb” clusterer. For “Cluster mode” select “Classes to clusters evaluation”.
Thi
s will make the clusterer ignore the “class” attribute, but will use this attribute for
comparison in the generated report. Click “Start” to activate the clustering.
3.1
Study the results. These form a tree, and from the report you can derive which
marble end
s up in which leaf (and therefore in which cluster). Are you satisfied with
this clustering? Why or why not?
When you click on the “Clusterer” box you can change the “Cutoff” value. A higher cutoff
value will encourage lumping similar marbles into the sam
e cluster. Note that changing the
“Acuity” won’t have an effect, because the acuity only matters for numerical attributes,
which do not exist in this data set.
3.2
Experiment with changing the “Cutoff” value and rerunning the experiment until
you are satisfie
d with the generated clusters.
Because the result is a dendrogram, it can be translated straightforwardly into a logical
procedure, which can be used to place new marbles in the correct clusters (or tree leaves).
3.3
From the dendrogram derive a procedure (f
or instance in pseudo

code) that can be
used to find the correct cluster for new marbles. Note that because of the limited
samples in the training set, you may be able to produce marbles for which there is
currently more than one logical cluster to reside
in. In that case, it doesn’t matter
which of the possible clusters you choose for such a marble, as long as the procedure
manages to assign to each marble a good home.
If you like, you can apply the Cobweb algorithm to some of the other datasets. While th
e
Cobweb algorithm is not iterative, the “Soybean” data set still takes too long to cluster with
the Cobweb algorithm, so you should pick one of the others.
4
4.
The Expectation

Maximisation (EM) algorithm
The EM algorithm is a probabilistic clustering algorit
hm. Each cluster is defined by
probabilities for instances to have certain values for their attributes, and a probability for
instances to reside in the cluster. For numerical values it consists of a mean value and a
standard deviation for each attribute v
alue, for discrete values it consists of a probability for
each attribute value.
Because discrete values are easier to evaluate in this respect, and also to facilitate
comparison with the previous two algorithms, we will apply the EM algorithm again to th
e
“marble” data set. On the Preprocess page, load the file “marble.arff”. When the file is
loaded, remove the “colour” attribute.
Go to the Cluster page. In the Clusterer box, click on the text you see there. You can now
select the “EM” clusterer. For “Cl
uster mode” select “Classes to clusters evaluation”. This
will make the clusterer ignore the “class” attribute, but will use this attribute for comparison
in the generated report. In the default setup, the EM algorithm will determine the number of
clusters
automatically. Click “Start” to activate the clustering.
4.1
Study the results. How many clusters are generated? Why is this? Can you get a
different result with a different seed?
When you click on the “Clusterer” box you can change the “numClusters” value
. This value
is

1 by default, which allows the algorithm to determine by itself the needed number of
clusters. If you set it to a specific value, the algorithm will try to derive that number of
clusters.
4.2
Change the number of clusters to the number of clu
sters you like to get (or a bit
higher) and rerun the experiment. Also rerun the experiment with different seeds
(and a desired number of clusters). How does this impact the results?
If you study the report generated, you can see, for each cluster, the c
luster’s “prior
probability” and for each of the attributes a “discrete estimator”. The estimators consist of a
number for each possible attribute value, and the attribute values are treated in order. For
“size” the order is {big, small}, for “colouring” t
he order is {monochrome, polychrome},
for “shininess” the ordering is {shiny, dull} and for “transparency” the ordering is
{transparent, opaque}. The estimators for each attribute value do not add up to 1, but to the
“total” which is displayed to the right
of them. The probability of an attribute value for this
cluster is therefore the estimator divided by the total. For instance:
Attribute: size
Discrete Estimator. Counts = 2.41 11.86 (Total = 14.26)
means that for the attribute “size
” the probability for the value “big” is 2.41/14.26 = 0.169.
To determine what the “best” cluster for a marble is, for each cluster one should multiply
the “prior probability” with the probabilities for the marble’s attribute values. This gives a
number f
or the marble for each cluster. If these are normalized, they indicate the
“expectation” probabilities for the marble to be located in each of the clusters.
5
For the next experiment, set the seed to 14 and the number of clusters to 2. Rerun the
clustering.
4.3
Suppose you have a small, monochrome, dull, opaque marble. According to these
results, in which cluster would you expect to find this marble? (You may want to use
a calculator for this

there should be one among the standard “Accessories”
programs).
Su
ppose that, when performing physical experiments with the marbles in the training set,
you find that when you break the marbles open, in some of the marbles you find a small
centre made of pure gold, with a value five times that of the original marble (whi
ch is
destroyed in the process). You find that this is the case for all the marbles that were
assigned to the smallest cluster, and for none of the other marbles. You suspect that the
fundamental property these clusters represent is the presence or absence
of a gold centre.
You now have the opportunity to buy a bag of small, monochrome, dull, opaque marbles
(similar to the one in the previous question) for a normal price. Recall that there was no
such marble in the training set. While you have no interest
in the marbles themselves, you
are interested in selling the gold that they might or might not contain.
4.4
Assuming that the cost of processing the marbles is negligible, would you consider it
a wise choice to buy the marbles? Why or why not?
4.5
Would you be
willing to substantially increase the mortgage on your house to
acquire a huge truckload of marbles of a type you tested in the training set, which
you found contains a gold centre?
If you like, you can apply the EM algorithm to some of the other dataset
s. Again, stay away
from the “Soybean” data set.
5.
Conclusion
You have now experimented with three different clustering algorithms: the
k

means
algorithm, the Cobweb algorithm and the EM algorithm. You have tried them all out on the
same “marble” data set.
5.1
Think of (at least) one advantage and one disadvantage of each algorithm.
5.2
Which of the three clustering algorithms would you prefer to use for the marble data
set?
Have a nice day!
6
Marbles
The “marble” data set contains fifteen marbles with the follow
ing attributes:
Attribute
Possible values
Abbreviations
size
big, small
B,S
colouring
monochrome, polychrome
M,P
shininess
shiny, dull
S,D
transparency
transparent, opaque
T,O
colour
blue, green, red, yellow, white, black, multicoloured
class
class
ic, vanilla, tiger, stone
The “colour” attribute should be unchecked for all experiments. The file
“marblespecific.arff” contains the same data set with the same attributes, except that the
“class” attribute is unique for each marble, namely a letter fo
r each of the first four attribute
values followed by the colour name.
The fifteen marbles in the data set are the following:
BMSO
blue
vanilla
BPST
multi
tiger
BPDO
multi
stone
SMSO
blue
vanilla
SPST
multi
tiger
SPDO
multi
stone
SM
SO
red
vanilla
SMSO
white
vanilla
SMST
green
classic
SMSO
yellow
vanilla
SMST
white
classic
SMST
black
classic
SMSO
green
vanilla
SMST
blue1
classic
SMST
blue2
classic
On the “Materials” page of the summer course website you’ll
find a PDF file of this page,
which is in colour.
7
Answers to Lab 7: Clustering
2.1
The first half “left”, the second half “right”. They are dots on the two sides a square.
2.2
With the default values, they usually are.
2.3
After a few tries, you sho
uld find different clusters to be generated. About 50% of the
experiments will give the ideal clusters.
2.4
The basic cause is the fact that the first centroids are selected randomly (the rest of the
k

means algorithm is deterministic, so the cause is fou
nd in the first centroid placement). If
they are placed in bad spots, the results might be unsatisfying. For instance, if they are
placed at (5,0) and (5,10), the clusters will consist of the upper half and the lower half of the
data set, instead of the le
ft and right.
2.5
There are many different clusters generated. I myself was able to find one that has only one
marble outside its ideal set (seed=21, k=5).
3.1
You should not be satisfied, because there is a cluster generated for every separately
disting
uishable marble. Only marbles that are equal in all attributes end up in a cluster.
3.2
With a cutoff value of 0.2 five clusters are derived. This is an OK answer. With cutoff value
0.2505 four clusters can be found. At 0.26 only one cluster is found, whi
ch certainly is too
high.
3.3
One possibility (for cutoff value 0.2) is:
if opaque then leaf 1
else if (small and shiny) then leaf 2
else
if dull then
if small then leaf 5 else leaf 6
else leaf 7
4.1
You
will get only one cluster. The main reason is that EM is a statistical division, and for a
statistical division four clusters (which you would like to find) is a lot for only 15 samples
with only four attributes. These should be VERY distinct groups if a
statistical division
should be able to keep them apart. In this case, they are not (in fact, there are basically only
7 different marbles which are equal in many of their attributes). Changing the seed won’t
change the results.
8
4.2
If you set the number o
f clusters to 4, you get three or four clusters, depending on the seed.
If you set it to 5, you get three or four clusters most of the time, and in rare cases five,
depending on the seed. Usually, the tiger class is lumped in the same cluster as the classi
c
class. Sometimes one tiger or one stone marble gets its own cluster. The results are quite
consistent.
4.3
For the small cluster, the calculation gives:
0.18*(2.14/4.74)*(1.32/4.74)*(2.78/4.74)*(3.03/4.74) = 0.00848
For the big cluster, the calculation
gives:
0.82*(11.86/14.26)*(11.68/14.26)*(1.22/14.26)*(6.97/14.26) = 0.02336
Normalised, these values are 0.26633 for the small cluster and 0.73367 for the big cluster.
The marble therefore is best assigned to the big cluster.
4.4
It is a wise choice. Norm
alised, the probability that such a marble belongs to the “gold”
cluster is 0.26633 (as was shown in the previous answer). The expected profit is therefore:
(0.26633 * 5
P
)
–
P
= (0.26633 * 5

1)*
P
= 0.33165
P
where
P
is the price you paid for th
e marbles. This is the expected value of selling a marble
minus the price you originally paid for the marble. Therefore the gold will give an expected
profit of 33%.
4.5
Would you be willing to increase the mortgage on your house? That would not be wise.
It
was stated that you suspected the clusters to represent the presence or absence of gold. But
you have no way to know if you really have taken all relevant attributes into account or if
your suspicion is based on hard facts. The training set is just too
small. Your interpretation
of the detected pattern might be off. Wagering a bit of money is perhaps OK, but as stated
during the lecture: “clustering is NOT the last word”. There certainly will be information
you won’t be able to derive from your cluster r
esults.
5.1
k

Means: Advantage: Fast, easily implemented. Disadvantage: Often suboptimal results,
lots of experiments are needed and even then you can’t be sure that you have found the best
answer.
Cobweb: Advantage: Usually, tree

like result are easily t
ranslated into a procedure.
Disadvantage: Results are highly depended on sample order and experimenting with the
cutoff value is necessary to find acceptable results.
EM: Advantage: Cluster boundaries are not so strict and results are quite consistent.
Dis
advantage: Predictions are not solid and may vary over experiments. Furthermore,
attributes must be independent, which in practice they often are not (without the
experimenter knowing this).
5.2
None are really bad, and it depends on what you want to achi
eve with clustering which
method you should prefer. In practice, for data

mining usually the EM algorithm is applied,
because it acknowledges that not all information is known about the instances. The question
about the gold centre is typically a data

mini
ng application.
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο