Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
1
Cluster Analysis
Chapter 12
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
2
Cluster analysis
•
It is a class of techniques used to classify
cases into groups that are
•
relatively homogeneous within themselves and
•
heterogeneous between each other
•
Homogeneity (similarity)
and
heterogeneity
(dissimilarity)
are measured on the basis of a
defined set of variables
•
These groups are called
clusters
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
3
Market segmentation
•
Cluster analysis is especially useful for market
segmentation
•
Segmenting a market means dividing its potential
consumers into separate sub

sets where
•
Consumers in the same group are similar with respect to
a given set of characteristics
•
Consumers belonging to different groups are dissimilar
with respect to the same set of characteristics
•
This allows one to calibrate the marketing mix
differently according to the target consumer group
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
4
Other uses of cluster analysis
•
Product characteristics and the identification of new
product opportunities.
•
Clustering of similar brands or products according to their
characteristics allow one to identify competitors, potential
market opportunities and available niches
•
Data reduction
•
Factor analysis and principal component analysis allow to reduce the
number of variables.
•
Cluster analysis allows to reduce the number of observations, by
grouping them into homogeneous clusters.
•
Maps profiling simultaneously consumers and products,
market opportunities and preferences as in
preference
or
perceptual mappings
(lecture 14)
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
5
Steps to conduct a cluster analysis
•
Select a
distance measure
•
Select a
clustering algorithm
•
Define the
distance between two clusters
•
Determine the
number of clusters
•
Validate
the analysis
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
6
Distance measures for individual
observations
•
To measure
similarity
between two observations a distance
measure is needed
•
With a single variable,
similarity
is straightforward
•
Example: income
–
two individuals are similar if their income level is
similar and the level of dissimilarity increases as the income gap
increases
•
Multiple variables require an
aggregate distance measure
•
Many characteristics (e.g. income, age, consumption habits, family
composition, owning a car, education level, job…), it becomes more
difficult to define similarity with a single value
•
The most known measure of distance is the
Euclidean
distance
, which is the concept we use in everyday life for
spatial coordinates.
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
7
Examples of distances
D
ij
distance between cases
i
and
j
x
kj
value of variable
x
k
for case
j
Problems
•
Different measures = different weights
•
Correlation between variables (double counting)
Solution:
Standardization, rescaling, principal
component analysis
2
1
n
ij ki kj
k
D x x
1
n
ij ki kj
k
D x x
Euclidean distance
City

block (Manhattan) distance
A
B
A
B
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
8
Other distance measures
•
Other distance measures
:
Chebychev, Minkowski,
Mahalanobis
•
An alternative approach:
use
correlation
measures
, where correlations are not between
variables, but between observations.
•
Each observation is characterized by a set of
measurements (one for each variable) and bi

variate correlations can be computed between two
observations.
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
9
Clustering procedures
•
Hierarchical procedures
•
Agglomerative
(start from
n
clusters to get to
1
cluster)
•
Divisive
(start from
1
cluster to get to
n
clusters)
•
Non hierarchical procedures
•
K

means clustering
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
10
Hierarchical clustering
•
Agglomerative:
•
Each of the
n
observations constitutes a separate cluster
•
The two clusters that are more similar according to same distance rule are
aggregated, so that in step 1 there are
n

1
clusters
•
In the second step another cluster is formed (
n

2
clusters), by nesting the
two clusters that are more similar, and so on
•
There is a merging in each step until all observations end up in a single
cluster in the final step.
•
Divisive
•
All observations are initially assumed to belong to a single cluster
•
The most dissimilar observation is extracted to form a separate cluster
•
In step 1 there will be 2 clusters, in the second step three clusters and so
on, until the final step will produce as many clusters as the number of
observations.
•
The number of clusters determines the stopping rule for the
algorithms
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
11
Non

hierarchical clustering
•
These algorithms do not follow a hierarchy and produce a
single partition
•
Knowledge of the
number of clusters
(
c
) is required
•
In the first step, initial cluster centres (the
seeds
) are
determined for each of the
c
clusters, either by the
researcher or by the software (usually the first
c
observation or observations are chosen randomly)
•
Each iteration allocates observations to each of the
c
clusters, based on their distance from the cluster centres
•
Cluster centres are computed again and observations may
be reallocated to the nearest cluster in the next iteration
•
When no observations can be reallocated or a stopping rule
is met, the process stops
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
12
Distance between clusters
•
Algorithms vary according to the way the
distance between two clusters is
defined.
•
The most common algorithm for
hierarchical methods include
•
single linkage method
•
complete linkage method
•
average linkage method
•
Ward algorithm
(see slide 14)
•
centroid method
(see slide 15)
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
13
Linkage methods
•
Single linkage method (nearest neighbour):
distance between two clusters is the
minimum
distance among all possible distances between
observations belonging to the two clusters.
•
C
omplete linkage method (furthest neighbour):
nests two cluster using as a basis the
maximum
distance between observations belonging to
separate clusters.
•
Average linkage method:
the distance between
two clusters is the
average
of all distances
between observations in the two clusters
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
14
Ward algorithm
1.
The sum of squared distances is computed
within
each of the cluster, considering all distances
between observation within the same cluster
2.
The algorithm proceeds by choosing the
aggregation between two clusters which
generates the smallest increase in the total sum
of squared distances.
•
It is a computationally intensive method, because
at each step all the sum of squared distances
need to be computed, together with all potential
increases in the total sum of squared distances for
each possible aggregation of clusters.
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
15
Centroid method
•
The distance between two clusters is the distance
between the two centroids,
•
Centroids
are the cluster averages for each of the
variables
•
each cluster is defined by a single set of coordinates,
the averages of the coordinates of all individual
observations belonging to that cluster
•
Difference between the
centroid
and the
average
linkage method
•
Centroid
: computes the average of the co

ordinates of
the observations belonging to an individual cluster
•
Average linkage
: computes the average of the distances
between two separate clusters.
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
16
Non

hierarchical clustering:
K

means method
1.
The number
k
of clusters is fixed
2.
An initial set of
k
“seeds” (aggregation centres)
is
provided
•
First
k
elements
•
Other seeds (randomly selected or explicitly defined)
3.
Given a certain fixed threshold, all units are assigned
to the nearest cluster seed
4.
New seeds are computed
5.
Go back to step 3 until no reclassification is necessary
Units can be reassigned in successive steps (
optimising
partioning
)
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
17
Non

hierarchical threshold methods
•
Sequential threshold methods
•
a prior threshold is fixed and units within that distance
are allocated to the first seed
•
a second seed is selected and the remaining units are
allocated, etc.
•
Parallel threshold methods
•
more than one seed are considered simultaneously
•
When reallocation is possible after each stage, the
methods are termed
optimizing procedures
.
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
18
Hierarchical vs. non

hierarchical
methods
Hierarchical Methods
Non

hierarchical methods
No decision about the number of
clusters
Problems when data contain a high
level of error
Can be very slow, preferable with
small data

sets
Initial decisions are more influential
(one

step only)
At each step they require computation
of the full proximity matrix
Faster, more reliable, works with
large data sets
Need to specify the number of
clusters
Need to set the initial seeds
Only cluster distances to seeds need
to be computed in each iteration
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
19
The number of clusters
c
•
Two alternatives
•
Determined by the analysis
•
Fixed by the researchers
•
In segmentation studies, the
c
represents the number of
potential separate segments.
•
Preferable approach:
“let the data speak”
•
Hierarchical approach and optimal partition identified through
statistical tests (
stopping rule
for the algorithm)
•
However, the detection of the optimal number of clusters is subject
to a high degree of uncertainty
•
If the research objectives allow a choice rather than
estimating the number of clusters, non

hierarchical
methods are the way to go.
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
20
Example: fixed number of clusters
•
A retailer wants to identify several shopping
profiles in order to activate new and targeted
retail outlets
•
The budget only allows him to open three types of
outlets
•
A partition into three clusters follows naturally,
although it is not necessarily the optimal one.
•
Fixed number of clusters and (
k

means) non
hierarchical approach
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
21
Example:
c
determined from the data
•
Clustering of shopping profiles is expected to detect a new
market niche.
•
For market segmentation purposes, it is less advisable to
constrain the analysis to a fixed number of clusters
•
A hierarchical procedure allows to explore all potentially valid numbers of
clusters
•
For each of them there are some statistical diagnostics to pinpoint the best
partition.
•
What is needed is a
stopping rule
for the hierarchical algorithm, which
determines the number of clusters at which the algorithm should stop.
•
Statistical tests are not always univocal, leaving some room
to the researcher’s experience and arbitrariness
•
Statistical rigidities should be balanced with the knowledge
gained from and interpretability of the final classification.
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
22
Determining the optimal number of
cluster from hierarchical methods
•
Graphical
•
dendrogram
•
scree diagram
•
Statistical
•
Arnold’s criterion
•
pseudo F statistic
•
pseudo t
2
statistic
•
cubic clustering criterion (CCC)
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
23
Dendrogram
Rescaled Distance Cluster Combine
C A S E 0 5 10 15 20 25
Label Num +

+

+

+

+

+
231
275
145
181
333
117
336
337
209
431
178
This dotted line represents the
distance between clusters
These
are the
individual
cases
Case 231 and case 275 are merged
And the merging
distance is
relatively small
As the algorithm proceeds, the
merging distances become larger
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
24
Scree diagram
0
2
4
6
8
10
12
11
10
9
8
7
6
5
4
3
2
1
Number of clusters
Distance
Merging
distance on
the y

axis
When one moves from
7 to 6 clusters, the
merging distance
increases noticeably
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
25
Statistical tests
•
The rationale is that in optimal partition,
variability within clusters should be as small as
possible, while variability between clusters should
be maximized
•
This principle is similar to the ANOVA

F test
•
However, since hierarchical algorithms proceed
sequentially, the probability distribution of
statistics relating variability within and variability
between is unknown and differs from the
F
distribution
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
26
Statistical criteria to detect the optimal
partition
•
Arnold’s criterion
: find the
minimum
of the determinant of the
within
cluster sum of squares matrix
W
•
Pseudo F, CCC and Pseudo t
2
: the ideal number of clusters should
correspond to
•
a local maximum for the Pseudo

F and CCC, and
•
a small value of the pseudo t
2
which increases in the next step (preferably a
local minimum).
•
These criteria are rarely consistent among them, so that the researcher
should also rely on meaningful (interpretable) criteria.
•
Non

parametric methods
(SAS) also allow one to determine the number
of clusters
•
k

th nearest neighbour method:
•
the researcher sets a parameter (
k
)
•
for each
k
the method returns the optimal number of clusters.
•
if this optimal number is the same for several values of
k
, then the
determination of the number of clusters is relatively robust
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
27
Suggested approach:
2

steps procedures
1.
First perform a hierarchical method to define
the number of clusters
2.
Then use the
k

means procedure to actually
form the clusters
The
reallocation problem
•
Rigidity of hierarchical methods: once a unit is classified into a
cluster, it cannot be moved to other clusters in subsequent steps
•
The
k

means method allows a reclassification of all units in each
iteration.
•
If some uncertainty about the number of clusters remains after
running the hierarchical method, one may also run several
k

means clustering procedures and apply the previously discussed
statistical tests to choose the best partition.
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
28
The SPSS two

step procedure
•
The observations are preliminarily aggregated into clusters
using an hybrid hierarchical procedure named
cluster
feature tree.
•
This first step produces a number of
pre

clusters
, which is
higher than the final number of clusters, but much smaller
than the number of observations
.
•
In the second step, a hierarchical method is used to classify
the pre

clusters, obtaining the final classification.
•
During this second clustering step, it is possible to
determine the number of clusters.
The user can either fix the number of clusters or let the
algorithm search for the best one according to
information
criteria
which are also based on
goodness

of

fit measures
.
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
29
Evaluation and validation
•
goodness

of

fit
of a cluster analysis
•
ratio between the sum of squared errors and the total sum of
squared errors (similar to R
2
)
•
root mean standard deviation
within clusters.
•
Validation: if the identified cluster structure
(number of clusters and cluster characteristics) is
real, it should not be c
•
Validation approaches
•
use of different samples to check whether the final output is
similar
•
Split the sample into two groups when no other samples are
available
•
Check for the impact of initial seeds / order of cases
(hierarchical
approach)
on the final partition
•
Check for the impact of the selected clustering method
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
30
Cluster analysis in SPSS
Three types of cluster
analysis are available in
SPSS
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
31
Hierarchical cluster analysis
Variables selected
for the analysis
Statistics required
in the analysis
Graphs (dendrogram)
Advice: no plots
Clustering method
and options
Create a new variable
with cluster membership
for each case
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
32
Statistics
The agglomeration
schedule is a table
which shows the
steps of the clustering
procedure, indicating
which cases (clusters)
are merged and the
merging distance
The proximity matrix
contains all distances
between cases (it may
be huge)
Shows the cluster
membership of
individual cases only
for a sub

set of
solutions
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
33
Plots
Shows the
clustering process,
indicating which
cases are
aggregated and the
merging distance
With many cases,
the dendrogram is
hardly readable
The icicle plot (which can
be restricted to cover a
small range of clusters),
shows at what stage
cases are clustered. The
plot is cumbersome and
slows down the analysis
(advice: no icicle)
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
34
Method
Choose a
hierarchical
algorithm
Choose the type of data
(interval, counts binary) and
the appropriate measure
Specify whether the variables (values)
should be standardized before analysis.
Z

scores return variables with zero mean
and unity variance. Other standardizations
are possible. Distance measures can also
be transformed
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
35
Cluster memberships
If the number of clusters has been decided (or at least a
range of solutions), it is possible to save the cluster
membership for each case into new variables
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
36
The example:
agglomeration schedule
Cluster Combined
Stage
Number of
clusters
Cluster 1
Cluster 2
Distance
Diff. Dist
490
10
8
12
544.4
491
9
8
11
559.3
14.9
492
8
3
7
575.0
15.7
493
7
3
366
591.6
16.6
494
6
3
6
610.6
19.0
495
5
3
37
636.6
26.0
496
4
13
23
663.7
27.1
497
3
3
13
700.8
37.1
498
2
1
8
754.1
53.3
499
1
1
3
864.2
110.2
Last 10 stages
of the process
(10 to 1 clusters)
As the
algorithms
proceeds
towards the
end, the
distance
increases
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
37
Scree diagram
Scree diagram
590
640
690
740
790
840
7
6
5
4
3
2
1
Number of clusters
Distance
The scree diagram (not provided by
SPSS but created from the
agglomeration schedule) shows a
larger distance increase when the
cluster number goes below 4
Elbow?
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
38
Non

hierarchical solution
with 4 clusters
26.6%
20.2%
23.8%
29.4%
100.0%
1.4
3.2
1.9
3.1
2.4
238.0
1158.9
333.8
680.3
576.9
72
44
40
48
52
28.8
64.4
29.2
60.6
45.4
8.8
64.3
9.2
19.0
23.1
25.1
77.7
33.5
39.1
41.8
17.7
147.8
24.6
57.1
57.2
29.6
146.2
39.4
63.0
65.3
N %
Case Number
Mean
Househol d si ze
Mean
Gross current income of
househol d
Mean
Age of Househol d
Reference Person
Mean
EFS: Total Food &
nonalcohol ic beverage
Mean
EFS: Total Cl othi ng and
Footwear
Mean
EFS: Total Housi ng,
Water, El ectri ci ty
Mean
EFS: Total Transport
costs
Mean
EFS: Total Recreati on
1
2
3
4
Ward Method
Total
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
39
K

means solution (4 clusters)
Variables
Number of clusters (fixed)
Ask for one (classify only) or more
iterations before stopping the
algorithm
It is possible to read a file with
initial seeds or write final seeds on
a file
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
40
K

means options
Improve the
algorithm by
allowing for
more iterations
and running
means (seeds
are recomputed
at each stage)
Creates a new
variable with
cluster
membership
for each case
More options
including an
ANOVA table
with statistics
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
41
Results from k

means
(initial seeds chosen by SPSS)
Final Cluster Centers
2.0
2.0
2.8
3.2
264.5
241.1
791.2
1698.1
56
75
46
45
37.3
22.2
54.1
66.2
14.0
28.0
31.7
48.4
34.7
100.3
47.3
64.5
28.4
10.4
78.3
156.8
39.6
3013.1
74.4
125.9
Househol d si ze
Gross current income of
househol d
Age of Househol d
Reference Person
EFS: Total Food &
nonalcohol i c beverage
EFS: Total Cl othi ng and
Footwear
EFS: Total Housing,
Water, El ectri ci ty
EFS: Total Transport
costs
EFS: Total Recreati on
1
2
3
4
Cl uster
Number of Cases in each Cluster
292.000
1.000
155.000
52.000
500.000
.000
1
2
3
4
Cl uster
Vali d
Missing
The k

means algorithm is
sensible to outliers and SPSS
chose an improbable amount for
recreation expenditure as an
initial seed for cluster 2
(probably an outlier due to
misrecording or an exceptional
expenditure)
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
42
Results from k

means:initial seeds from
hierarchical clustering
32.6%
10.2%
33.6%
23.6%
100.0%
1.7
3.1
2.5
2.9
2.4
163.5
1707.3
431.8
865.9
576.9
60
45
50
46
52
31.3
65.5
45.1
56.8
45.4
12.3
48.4
19.1
32.7
23.1
29.8
65.3
41.9
48.1
41.8
24.6
156.8
37.4
87.5
57.2
30.3
126.8
67.9
83.4
65.3
N %
Case Number
Mean
Househol d si ze
Mean
Gross current income of
househol d
Mean
Age of Househol d
Reference Person
Mean
EFS: Total Food &
nonalcohol ic beverage
Mean
EFS: Total Cl othi ng and
Footwear
Mean
EFS: Total Housi ng,
Water, El ectri ci ty
Mean
EFS: Total Transport
costs
Mean
EFS: Total Recreati on
1
2
3
4
Cl uster Number of Case
Total
The first cluster is now larger, but it still represents older and poorer households. The
other clusters are not very different from the ones obtained with the Ward algorithm,
indicating a certain robustness of the results.
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
43
2

step clustering
it is possible to
make a distinction
between categorical
and continuous
variables
The search for
the optimal
number of
clusters may be
constrained
This is the
information
criterion to
choose the
optimal partition
One may also
asks for plots and
descriptive stats
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
44
Options
It is advisable
to control for
outliers (OLs)
because the
analysis is
usually
sensitive to
OLs
It is possible to
choose which
variable should
be standardized
prior to run the
analysis
More advanced
options are
available for a
better control on
the procedure
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
45
Output
Cluster Distribution
2
.4%
.4%
5
1.0%
1.0%
490
98.2%
98.2%
2
.4%
.4%
499
100.0%
100.0%
499
100.0%
1
2
3
4
Combi ned
Cl uster
Total
N
% of
Combi ned
% of Total
•
Results are not satisfactory
•
With no prior decision on the number of clusters, two
clusters are found, one with a single observations and the
other with the remaining 499 observations.
•
Allowing for outlier treatment does not improve results
•
Setting the number of clusters to four produces these
results
It seems that the two

step clustering
is biased towards finding a macro

cluster.
This might be due to the fact that the
number of observations is relatively
small, but the combination of the
Ward algorithm with the k

means
algorithm is more effective
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
46
SAS cluster analysis
•
Compared to SPSS, SAS provides more
diagnostics and the option of non

parametric
clustering through three SAS/STAT
procedures
•
the procedure CLUSTER and VARCLUS (for
hierarchical and the
k

th neighbour methods)
•
the procedure FASTCLUS (for non

hierarchical
methods)
•
and the procedure MODECLUS (for non

parametric methods)
Statistics for Marketing & Consumer Research
Copyright © 2008

Mario Mazzocchi
47
Discussion
•
It might seem that cluster analysis is too sensitive to the
researcher’s choice.s
•
This is partly due to the relatively small data

set and
possibly to correlation between variables
•
However, all outputs point out to a segment with older and
poorer household and another with younger and larger
households, with high expenditures.
•
By intensifying the search and adjusting some of the
properties, cluster analysis does help identifying
homogeneous groups.
•
“Moral”
: cluster analysis needs to be adequately validated
and it may be risky to run a single cluster analysis and take
the results as truly informative, especially in presence of
outliers.
Comments 0
Log in to post a comment