# Cluster Analysis - Statistics for Marketing & Consumer Research

AI and Robotics

Nov 25, 2013 (4 years and 9 months ago)

96 views

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

1

Cluster Analysis

Chapter 12

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

2

Cluster analysis

It is a class of techniques used to classify
cases into groups that are

relatively homogeneous within themselves and

heterogeneous between each other

Homogeneity (similarity)
and
heterogeneity
(dissimilarity)
are measured on the basis of a
defined set of variables

These groups are called

clusters

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

3

Market segmentation

Cluster analysis is especially useful for market
segmentation

Segmenting a market means dividing its potential
consumers into separate sub
-
sets where

Consumers in the same group are similar with respect to
a given set of characteristics

Consumers belonging to different groups are dissimilar
with respect to the same set of characteristics

This allows one to calibrate the marketing mix
differently according to the target consumer group

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

4

Other uses of cluster analysis

Product characteristics and the identification of new
product opportunities.

Clustering of similar brands or products according to their
characteristics allow one to identify competitors, potential
market opportunities and available niches

Data reduction

Factor analysis and principal component analysis allow to reduce the
number of variables.

Cluster analysis allows to reduce the number of observations, by
grouping them into homogeneous clusters.

Maps profiling simultaneously consumers and products,
market opportunities and preferences as in
preference
or
perceptual mappings
(lecture 14)

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

5

Steps to conduct a cluster analysis

Select a
distance measure

Select a
clustering algorithm

Define the
distance between two clusters

Determine the
number of clusters

Validate

the analysis

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

6

Distance measures for individual
observations

To measure
similarity
between two observations a distance
measure is needed

With a single variable,
similarity

is straightforward

Example: income

two individuals are similar if their income level is
similar and the level of dissimilarity increases as the income gap
increases

Multiple variables require an
aggregate distance measure

Many characteristics (e.g. income, age, consumption habits, family
composition, owning a car, education level, job…), it becomes more
difficult to define similarity with a single value

The most known measure of distance is the
Euclidean
distance
, which is the concept we use in everyday life for
spatial coordinates.

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

7

Examples of distances

D
ij

distance between cases
i

and
j

x
kj

value of variable
x
k

for case
j

Problems

Different measures = different weights

Correlation between variables (double counting)

Solution:

Standardization, rescaling, principal
component analysis

2
1
n
ij ki kj
k
D x x

 

1
n
ij ki kj
k
D x x

 

Euclidean distance

City
-
block (Manhattan) distance

A

B

A

B

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

8

Other distance measures

Other distance measures
:
Chebychev, Minkowski,
Mahalanobis

An alternative approach:
use
correlation
measures
, where correlations are not between
variables, but between observations.

Each observation is characterized by a set of
measurements (one for each variable) and bi
-
variate correlations can be computed between two
observations.

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

9

Clustering procedures

Hierarchical procedures

Agglomerative
(start from
n

clusters to get to
1

cluster)

Divisive
(start from
1

cluster to get to
n

clusters)

Non hierarchical procedures

K
-
means clustering

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

10

Hierarchical clustering

Agglomerative:

Each of the
n
observations constitutes a separate cluster

The two clusters that are more similar according to same distance rule are
aggregated, so that in step 1 there are
n
-
1
clusters

In the second step another cluster is formed (
n
-
2
clusters), by nesting the
two clusters that are more similar, and so on

There is a merging in each step until all observations end up in a single
cluster in the final step.

Divisive

All observations are initially assumed to belong to a single cluster

The most dissimilar observation is extracted to form a separate cluster

In step 1 there will be 2 clusters, in the second step three clusters and so
on, until the final step will produce as many clusters as the number of
observations.

The number of clusters determines the stopping rule for the
algorithms

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

11

Non
-
hierarchical clustering

These algorithms do not follow a hierarchy and produce a
single partition

Knowledge of the
number of clusters

(
c
) is required

In the first step, initial cluster centres (the
seeds
) are
determined for each of the
c
clusters, either by the
researcher or by the software (usually the first
c
observation or observations are chosen randomly)

Each iteration allocates observations to each of the
c
clusters, based on their distance from the cluster centres

Cluster centres are computed again and observations may
be reallocated to the nearest cluster in the next iteration

When no observations can be reallocated or a stopping rule
is met, the process stops

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

12

Distance between clusters

Algorithms vary according to the way the
distance between two clusters is
defined.

The most common algorithm for
hierarchical methods include

Ward algorithm
(see slide 14)

centroid method
(see slide 15)

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

13

distance between two clusters is the
minimum

distance among all possible distances between
observations belonging to the two clusters.

C
nests two cluster using as a basis the
maximum
distance between observations belonging to
separate clusters.

the distance between
two clusters is the
average

of all distances
between observations in the two clusters

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

14

Ward algorithm

1.
The sum of squared distances is computed
within
each of the cluster, considering all distances
between observation within the same cluster

2.
The algorithm proceeds by choosing the
aggregation between two clusters which
generates the smallest increase in the total sum
of squared distances.

It is a computationally intensive method, because
at each step all the sum of squared distances
need to be computed, together with all potential
increases in the total sum of squared distances for
each possible aggregation of clusters.

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

15

Centroid method

The distance between two clusters is the distance
between the two centroids,

Centroids
are the cluster averages for each of the
variables

each cluster is defined by a single set of coordinates,
the averages of the coordinates of all individual
observations belonging to that cluster

Difference between the
centroid
and the
average

Centroid
: computes the average of the co
-
ordinates of
the observations belonging to an individual cluster

: computes the average of the distances
between two separate clusters.

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

16

Non
-
hierarchical clustering:

K
-
means method

1.
The number
k

of clusters is fixed

2.
An initial set of
k

“seeds” (aggregation centres)
is
provided

First
k

elements

Other seeds (randomly selected or explicitly defined)

3.
Given a certain fixed threshold, all units are assigned
to the nearest cluster seed

4.
New seeds are computed

5.
Go back to step 3 until no reclassification is necessary

Units can be reassigned in successive steps (
optimising
partioning
)

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

17

Non
-
hierarchical threshold methods

Sequential threshold methods

a prior threshold is fixed and units within that distance
are allocated to the first seed

a second seed is selected and the remaining units are
allocated, etc.

Parallel threshold methods

more than one seed are considered simultaneously

When reallocation is possible after each stage, the
methods are termed
optimizing procedures
.

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

18

Hierarchical vs. non
-
hierarchical
methods

Hierarchical Methods

Non
-
hierarchical methods

No decision about the number of
clusters

Problems when data contain a high
level of error

Can be very slow, preferable with
small data
-
sets

Initial decisions are more influential
(one
-
step only)

At each step they require computation
of the full proximity matrix

Faster, more reliable, works with
large data sets

Need to specify the number of
clusters

Need to set the initial seeds

Only cluster distances to seeds need
to be computed in each iteration

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

19

The number of clusters
c

Two alternatives

Determined by the analysis

Fixed by the researchers

In segmentation studies, the
c
represents the number of
potential separate segments.

Preferable approach:
“let the data speak”

Hierarchical approach and optimal partition identified through
statistical tests (
stopping rule

for the algorithm)

However, the detection of the optimal number of clusters is subject
to a high degree of uncertainty

If the research objectives allow a choice rather than
estimating the number of clusters, non
-
hierarchical
methods are the way to go.

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

20

Example: fixed number of clusters

A retailer wants to identify several shopping
profiles in order to activate new and targeted
retail outlets

The budget only allows him to open three types of
outlets

A partition into three clusters follows naturally,
although it is not necessarily the optimal one.

Fixed number of clusters and (
k
-
means) non
hierarchical approach

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

21

Example:
c
determined from the data

Clustering of shopping profiles is expected to detect a new
market niche.

For market segmentation purposes, it is less advisable to
constrain the analysis to a fixed number of clusters

A hierarchical procedure allows to explore all potentially valid numbers of
clusters

For each of them there are some statistical diagnostics to pinpoint the best
partition.

What is needed is a
stopping rule

for the hierarchical algorithm, which
determines the number of clusters at which the algorithm should stop.

Statistical tests are not always univocal, leaving some room
to the researcher’s experience and arbitrariness

Statistical rigidities should be balanced with the knowledge
gained from and interpretability of the final classification.

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

22

Determining the optimal number of
cluster from hierarchical methods

Graphical

dendrogram

scree diagram

Statistical

Arnold’s criterion

pseudo F statistic

pseudo t
2

statistic

cubic clustering criterion (CCC)

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

23

Dendrogram

Rescaled Distance Cluster Combine

C A S E 0 5 10 15 20 25

Label Num +
---------
+
---------
+
---------
+
---------
+
---------
+

231


275




145




181




333


117




336


337


209


431





178


This dotted line represents the
distance between clusters

These
are the
individual
cases

Case 231 and case 275 are merged

And the merging
distance is
relatively small

As the algorithm proceeds, the
merging distances become larger

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

24

Scree diagram

0
2
4
6
8
10
12
11
10
9
8
7
6
5
4
3
2
1
Number of clusters
Distance
Merging
distance on
the y
-
axis

When one moves from
7 to 6 clusters, the
merging distance
increases noticeably

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

25

Statistical tests

The rationale is that in optimal partition,
variability within clusters should be as small as
possible, while variability between clusters should
be maximized

This principle is similar to the ANOVA
-
F test

However, since hierarchical algorithms proceed
sequentially, the probability distribution of
statistics relating variability within and variability
between is unknown and differs from the
F

distribution

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

26

Statistical criteria to detect the optimal
partition

Arnold’s criterion
: find the
minimum
of the determinant of the
within
cluster sum of squares matrix
W

Pseudo F, CCC and Pseudo t
2
: the ideal number of clusters should
correspond to

a local maximum for the Pseudo
-
F and CCC, and

a small value of the pseudo t
2

which increases in the next step (preferably a
local minimum).

These criteria are rarely consistent among them, so that the researcher
should also rely on meaningful (interpretable) criteria.

Non
-
parametric methods
(SAS) also allow one to determine the number
of clusters

k
-
th nearest neighbour method:

the researcher sets a parameter (
k
)

for each
k

the method returns the optimal number of clusters.

if this optimal number is the same for several values of
k
, then the
determination of the number of clusters is relatively robust

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

27

Suggested approach:

2
-
steps procedures

1.
First perform a hierarchical method to define
the number of clusters

2.
Then use the
k
-
means procedure to actually
form the clusters

The
reallocation problem

Rigidity of hierarchical methods: once a unit is classified into a
cluster, it cannot be moved to other clusters in subsequent steps

The
k
-
means method allows a reclassification of all units in each
iteration.

If some uncertainty about the number of clusters remains after
running the hierarchical method, one may also run several
k
-
means clustering procedures and apply the previously discussed
statistical tests to choose the best partition.

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

28

The SPSS two
-
step procedure

The observations are preliminarily aggregated into clusters
using an hybrid hierarchical procedure named
cluster
feature tree.

This first step produces a number of
pre
-
clusters
, which is
higher than the final number of clusters, but much smaller
than the number of observations
.

In the second step, a hierarchical method is used to classify
the pre
-
clusters, obtaining the final classification.

During this second clustering step, it is possible to
determine the number of clusters.

The user can either fix the number of clusters or let the
algorithm search for the best one according to
information
criteria
which are also based on
goodness
-
of
-
fit measures
.

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

29

Evaluation and validation

goodness
-
of
-
fit
of a cluster analysis

ratio between the sum of squared errors and the total sum of
squared errors (similar to R
2
)

root mean standard deviation
within clusters.

Validation: if the identified cluster structure
(number of clusters and cluster characteristics) is
real, it should not be c

Validation approaches

use of different samples to check whether the final output is
similar

Split the sample into two groups when no other samples are
available

Check for the impact of initial seeds / order of cases
(hierarchical
approach)
on the final partition

Check for the impact of the selected clustering method

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

30

Cluster analysis in SPSS

Three types of cluster
analysis are available in
SPSS

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

31

Hierarchical cluster analysis

Variables selected
for the analysis

Statistics required
in the analysis

Graphs (dendrogram)

Clustering method
and options

Create a new variable
with cluster membership
for each case

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

32

Statistics

The agglomeration
schedule is a table
which shows the
steps of the clustering
procedure, indicating
which cases (clusters)
are merged and the
merging distance

The proximity matrix
contains all distances
between cases (it may
be huge)

Shows the cluster
membership of
individual cases only
for a sub
-
set of
solutions

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

33

Plots

Shows the
clustering process,
indicating which
cases are
aggregated and the
merging distance

With many cases,
the dendrogram is

The icicle plot (which can
be restricted to cover a
small range of clusters),
shows at what stage
cases are clustered. The
plot is cumbersome and
slows down the analysis

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

34

Method

Choose a
hierarchical
algorithm

Choose the type of data
(interval, counts binary) and
the appropriate measure

Specify whether the variables (values)
should be standardized before analysis.
Z
-
scores return variables with zero mean
and unity variance. Other standardizations
are possible. Distance measures can also
be transformed

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

35

Cluster memberships

If the number of clusters has been decided (or at least a
range of solutions), it is possible to save the cluster
membership for each case into new variables

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

36

The example:

agglomeration schedule

Cluster Combined

Stage

Number of
clusters

Cluster 1

Cluster 2

Distance

Diff. Dist

490

10

8

12

544.4

491

9

8

11

559.3

14.9

492

8

3

7

575.0

15.7

493

7

3

366

591.6

16.6

494

6

3

6

610.6

19.0

495

5

3

37

636.6

26.0

496

4

13

23

663.7

27.1

497

3

3

13

700.8

37.1

498

2

1

8

754.1

53.3

499

1

1

3

864.2

110.2

Last 10 stages
of the process
(10 to 1 clusters)

As the
algorithms
proceeds
towards the
end, the
distance
increases

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

37

Scree diagram

Scree diagram
590
640
690
740
790
840
7
6
5
4
3
2
1
Number of clusters
Distance
The scree diagram (not provided by
SPSS but created from the
agglomeration schedule) shows a
larger distance increase when the
cluster number goes below 4

Elbow?

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

38

Non
-
hierarchical solution

with 4 clusters

26.6%
20.2%
23.8%
29.4%
100.0%
1.4
3.2
1.9
3.1
2.4
238.0
1158.9
333.8
680.3
576.9
72
44
40
48
52
28.8
64.4
29.2
60.6
45.4
8.8
64.3
9.2
19.0
23.1
25.1
77.7
33.5
39.1
41.8
17.7
147.8
24.6
57.1
57.2
29.6
146.2
39.4
63.0
65.3
N %
Case Number
Mean
Househol d si ze
Mean
Gross current income of
househol d
Mean
Age of Househol d
Reference Person
Mean
EFS: Total Food &
non-alcohol ic beverage
Mean
EFS: Total Cl othi ng and
Footwear
Mean
EFS: Total Housi ng,
Water, El ectri ci ty
Mean
EFS: Total Transport
costs
Mean
EFS: Total Recreati on
1
2
3
4
Ward Method
Total
Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

39

K
-
means solution (4 clusters)

Variables

Number of clusters (fixed)

Ask for one (classify only) or more
iterations before stopping the
algorithm

It is possible to read a file with
initial seeds or write final seeds on
a file

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

40

K
-
means options

Improve the
algorithm by
allowing for
more iterations
and running
means (seeds
are recomputed
at each stage)

Creates a new
variable with
cluster
membership
for each case

More options
including an
ANOVA table
with statistics

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

41

Results from k
-
means

(initial seeds chosen by SPSS)

Final Cluster Centers
2.0
2.0
2.8
3.2
264.5
241.1
791.2
1698.1
56
75
46
45
37.3
22.2
54.1
66.2
14.0
28.0
31.7
48.4
34.7
100.3
47.3
64.5
28.4
10.4
78.3
156.8
39.6
3013.1
74.4
125.9
Househol d si ze
Gross current income of
househol d
Age of Househol d
Reference Person
EFS: Total Food &
non-alcohol i c beverage
EFS: Total Cl othi ng and
Footwear
EFS: Total Housing,
Water, El ectri ci ty
EFS: Total Transport
costs
EFS: Total Recreati on
1
2
3
4
Cl uster
Number of Cases in each Cluster
292.000
1.000
155.000
52.000
500.000
.000
1
2
3
4
Cl uster
Vali d
Missing
The k
-
means algorithm is
sensible to outliers and SPSS
chose an improbable amount for
recreation expenditure as an
initial seed for cluster 2
(probably an outlier due to
misrecording or an exceptional
expenditure)

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

42

Results from k
-
means:initial seeds from
hierarchical clustering

32.6%
10.2%
33.6%
23.6%
100.0%
1.7
3.1
2.5
2.9
2.4
163.5
1707.3
431.8
865.9
576.9
60
45
50
46
52
31.3
65.5
45.1
56.8
45.4
12.3
48.4
19.1
32.7
23.1
29.8
65.3
41.9
48.1
41.8
24.6
156.8
37.4
87.5
57.2
30.3
126.8
67.9
83.4
65.3
N %
Case Number
Mean
Househol d si ze
Mean
Gross current income of
househol d
Mean
Age of Househol d
Reference Person
Mean
EFS: Total Food &
non-alcohol ic beverage
Mean
EFS: Total Cl othi ng and
Footwear
Mean
EFS: Total Housi ng,
Water, El ectri ci ty
Mean
EFS: Total Transport
costs
Mean
EFS: Total Recreati on
1
2
3
4
Cl uster Number of Case
Total
The first cluster is now larger, but it still represents older and poorer households. The
other clusters are not very different from the ones obtained with the Ward algorithm,
indicating a certain robustness of the results.

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

43

2
-
step clustering

it is possible to
make a distinction
between categorical
and continuous
variables

The search for
the optimal
number of
clusters may be
constrained

This is the
information
criterion to
choose the
optimal partition

One may also
descriptive stats

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

44

Options

to control for
outliers (OLs)
because the
analysis is
usually
sensitive to
OLs

It is possible to
choose which
variable should
be standardized
prior to run the
analysis

options are
available for a
better control on
the procedure

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

45

Output

Cluster Distribution
2
.4%
.4%
5
1.0%
1.0%
490
98.2%
98.2%
2
.4%
.4%
499
100.0%
100.0%
499
100.0%
1
2
3
4
Combi ned
Cl uster
Total
N
% of
Combi ned
% of Total

Results are not satisfactory

With no prior decision on the number of clusters, two
clusters are found, one with a single observations and the
other with the remaining 499 observations.

Allowing for outlier treatment does not improve results

Setting the number of clusters to four produces these
results

It seems that the two
-
step clustering
is biased towards finding a macro
-
cluster.

This might be due to the fact that the
number of observations is relatively
small, but the combination of the
Ward algorithm with the k
-
means
algorithm is more effective

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

46

SAS cluster analysis

Compared to SPSS, SAS provides more
diagnostics and the option of non
-
parametric
clustering through three SAS/STAT
procedures

the procedure CLUSTER and VARCLUS (for
hierarchical and the
k
-
th neighbour methods)

the procedure FASTCLUS (for non
-
hierarchical
methods)

and the procedure MODECLUS (for non
-
parametric methods)

Statistics for Marketing & Consumer Research

-

Mario Mazzocchi

47

Discussion

It might seem that cluster analysis is too sensitive to the
researcher’s choice.s

This is partly due to the relatively small data
-
set and
possibly to correlation between variables

However, all outputs point out to a segment with older and
poorer household and another with younger and larger
households, with high expenditures.

By intensifying the search and adjusting some of the
properties, cluster analysis does help identifying
homogeneous groups.

“Moral”
: cluster analysis needs to be adequately validated
and it may be risky to run a single cluster analysis and take
the results as truly informative, especially in presence of
outliers.