Cluster analysis and categorical data

muttchessΤεχνίτη Νοημοσύνη και Ρομποτική

8 Νοε 2013 (πριν από 3 χρόνια και 7 μήνες)

82 εμφανίσεις

Cluster analysis and categorical data
Hana Řezanková
Vysoká škola ekonomická v Praze, Praha
1.Introduction
Methods of cluster analysis are placed between statistics and informatics. They play an
important role in the area of data mining. The main aim of cluster analysis is to assign
objects into groups (clusters) in such a way that two objects from the same cluster are more
similar than two objects from different clusters. We can consider respondents in market
research, firms, states or products as the objects. The similarity is investigated on the basis
of certain features (variables), which can be quantitative (length of a certain activity) or
qualitative (evaluation of respondent relationships to the employer, qualitative features of
the products). These variables are often denoted as categorical, see bellow.
The aim of this paper is to present some approaches to clustering in categorical data.
Whereas methods for cluster analysis of quantitative data are currently implemented in all
software packages for statistical analysis and data mining, and the differences among them
in this area are small, the differences in implementation of methods for clustering in quali-
tative data are substantial. Both special methods designed for clustering such a type of data
and advantages of some statistical software packages (S-PLUS, SPSS, STATISTICA,
SYSTAT) in this area are presented.
2.Methods of cluster analysis
For the methods of multivariate statistical analysis, vectors of observations (vectors
of values of individual variables) form the base. Clustering of observation vectors (objects)
is the most frequently used application. However, clusters of variables can be created, or
objects and variables can be clustered simultaneously. Moreover, clustering of categories of
qualitative variable can be applied on the basis of contingency table.
There are different ways of classifying the cluster analysis methods. We can distin-
guish
partitioning methods
(denoted as flat) which optimize assignment of the objects
into a certain number of clusters, and
methods of hierarchical cluster analysis
with gra-
phical outputs which mak
e assignment of objects into different numbers of clusters
possible. In the f
irst group,
k
-centroids and
k
-medoids methods are used for disjuncti
v
e
clustering.
The former is based on initial assignment of the objects into
k
clusters. For this purpo-
se,
k
initial centroids are selected which are the centers of the clusters. Dif
ferent approa
-
ches are applied for selection of the initial centroids; for e
xample, the f
irst
k
objects can be
used. After that, the distances of each object from all centers are calculated. The object is
assigned to the closest centroid. Further
,the elements of the new centroids are computed;
216
usually they are the average values of individual variables. Then the distances of each
object from all centroids are calculated again. If an object is closer to the centroid of any
other cluster, it is moved to that cluster. This process is repeated as long as any object can
be moved. If the centroid is created from average values of individual variables, the
method is called
k
-means
.If the centroid is created from medians, the method is called
k
-medians
.In the first case, the Euclidean distance (see [13]) is used. However, some soft-
ware systems (SYSTAT) also offer further measures (see below). In the
k
-medoids
method,
a certain vector of observations is taken for the center of the cluster.
Methods of hierarchical cluster analysis can be
agglomerative
(step-by-step clustering
of objects and groups to larger groups) or
divisive
(step-by-step splitting of the whole set
of objects into the smaller subsets and individual objects). Further, we can distinguish clus-
tering
monothetic
(only one variable is considered in individual steps) and
polythetic
(all
variables are considered in individual steps).
Methods of hierarchical cluster analysis (as well as some other methods) are based on
the proximity matrix. This matrix is symmetric with zero on the diagonal. The values out
of the diagonal express dissimilarities for the corresponding pairs of objects, variables or
categories. These dissimilarities are the values of certain coefficients, or they are derived
from the values of similarity coefficients. For example, if we consider a similarity measu-
re then the dissimilarity measure is obtained by subtraction of this value from
one, i.e.
D
= 1 –
S
.
More information and examples of the methods of cluster analysis can be found in
books [2], [5], [9] and [13]. Implementation in the SAS system is described in [14].
3.Categorical data
Categorical variables are characterized by values which are categories. Two main types
of these variables can be distinguished: dichotomous, for which there only are two catego-
ries, and multi-categorical.
Dichotomous variables
are often coded by the values zero and
one. For similarity measuring it is necessary to take into account whether the variables are
symmetric or asymmetric. In the first case, both categories have the same importance
(male, female). In the second case, one category is more important (presence of the word
in a textual document is more important than its absence).
Multi-categorical variables can be classified into three types: nominal, ordinal and
quantitative. Unlike the other types, categories of
nominal variables
cannot be ordered
(from the point of view of intensity etc.). Categories of
ordinal variables
can be ordered
but we cannot usually do the arithmetic operations with them (it depends on the relations
among categories, see below). We can do arithmetic operations with quantitative variables
(number of children).
W
e can apply traditional distance measures in this case and so this
type will not be considered in the paper
.
F
or this reason, we will further denote nominal, ordinal and dichotomous v
ariables as
categorical. These variables are also called
qualitative
.We will suppose that dichotomous
variables are
binary
with categories zero and one. The same similarity measures are
usually used for clustering of both objects and v
ariables in the case of binary data.
217
3/2OO9
If binary variables are
symmetric
,one can apply the same measures as for quantita-
tive data. Moreover, many specific coefficients have been proposed for this kind of data, as
well as for data files with asymmetric binary variables.
If there are no special means for clustering multi-categorical data in a software packa-
ge, then transformation of the data file to a file with binary data is usually needed. The
distinction between nominal and ordinal types is necessary.
First, we will mention the data file with
nominal variables
.In comparison with clas-
sification tasks involving a target variable (regression and discriminant analyses, decision
trees), the number of dummy variables must be equal to the number of categories, see
Table 1. In this way it is guaranteed that one can obtain only two possible values of simi-
larity: one for the matched categories, and the second for unmatched categories.
Table 1
Recoding of the nominal variable School for three binary variables P1 to P3
There are two processes for transforming
ordinal data
.The first one consists of trans-
formation of a data file to a binary data file. In comparison to the case with nominal va-
riables,
k
possible values of similarity should be considered where
k
is a number of cate-
gories. It is guaranteed by the coding shown in Table 2.
Table 2
Recoding of the ordinal variable Reaction for three binary variables P1 to P3
The second process makes use of the fact that values of an ordinal variable can be orde-
red. Under the assumption of the same distances between categories, the arithmetic opera-
tions can be done. It is recommended to code categories from 1 to
k
and divide these codes
by the maximum value. In this way, the values will be in the interval from 0 to 1. Then we
can apply the techniques designed for quantitati
v
e data.
4.Object clustering
In the following text, we will consider a simplifier case in which all variables are of the
same type (for other case see [13]).
Binary v
ariables
If objects are only characterized by
binary variables
,then the usual process consists
of creating the proximity matrix, followed by application of
hierarchical cluster analysis
.
218
Some software systems offer special measures (the SPSS system offers 26 measures inclu-
ding general ones; the SYSTAT system offers 5 measures). Formulas of these measures are
usually expressed by means of frequencies from the contingency table. Let the symbols
from Table 3 be given.
Table 3
Two-way frequency table for objects x
i
a x
j
In the case of symmetric variables,
Sokal and Michener’s simple matching coeffi-
cient
is used for example. For two objects, it is a ratio of the number of variables with the
same values (0 or 1) in both objects, and the total number of variables:
(1)
The similarity between two objects characterized by asymmetric variables can be mea-
sured by
Jaccard’s coefficient
.Its value is expressed as a ratio of the number of variables
which have the value 1 for both objects, and the number of variables with at least one value
equal to 1:
(2)
Further, we can apply
Yule’s Q
which is calculated by the formula
(3)
The publications [11] and [13] provide a more detailed treatment of these measures for
binary variables.
However, general measures can be also applied. For example, Euclidean distance and
coefficient of disagreement
(designed for data files with nominal variables, see below) can
be used. The latter is a complement of the simple matching coefficient to the value 1. Furt-
her,
gamma coefficient
(designed for clustering ordinal variables, see below) is a measure
suitable for this purpose. In the case of binary variable analysis, it is called Yule’s Q, see
formula (3).
Coefficient of disagreement is provided by systems SYSTAT and STATISTICA,
gamma coefficient is provided by SYSTAT. In addition, a proximity matrix created by other
means can serve as an input for hierarchical cluster analysis. The SYSTAT system provides
a possibility to create such matrices on the basis of 13 measures applicable to binary vari
-
ables.
Monothetic divisive cluster analysis
can be applied to objects characterized by
symmetric binary variables. It starts from one cluster which is split into two clusters. Any
variable can serve for this purpose (one group will contain ones in this variable, the second
group will contain zeros). If we denote the number of variables as
m
,then
m
possibilities
219
3/2OO9
exist for splitting a data file into two groups of objects. For the next splitting,
m
– 1
possibilities exist, etc. The criterion for splitting is based on measurement of dependency
of two variables. This method is called
MONA (MONothetic Analysis)
in [8] and in the
S-PLUS system. In this algorithm, the measure
is used for evaluation of dependency between the
k
-th and
l
-th variables where
a
kl
,
b
kl
,
c
kl
and
d
kl
are frequencies in the contingency table created for these variables. For each
l
-th
variable the value
is calculated. The objects are split according to the variable for which the maximum of
these values is achieved.
Further, the
k
-means
and
k
-medians
methods with
Yule’s Q
,see formula (3), can be
applied on data files with asymmetric dichotomous variables in SYSTAT.
Nominal variables
Typical process for the data files with
nominal variables
is creation of the proximity
matrix on the basis of the simple matching coefficient and application of hierarchical clus-
ter analysis. The
simple matching coefficient
is a ratio of the number of pairs with the
same values in both elements, and the total number of variables (when objects are cluste-
red). Sokal-Michener coefficient (1) is a special case of that. For the
i
-th and
j
-th objects
can be written as
where
m
is the number of variables,
S
ijl
= 1

x
il
=
x
jl
(the values of the
l
-th variable are
equal for the
i
-th and
j
-th objects) and in other cases
S
ijl
= 0. Dissimilarity is a comple-
ment of the simple matching coefficient to the value 1, i.e.
D
ij
= 1 –
S
ij
.This
coefficient
of disagreement
expresses a ratio of the number of pairs with distinct values and the total
number of variables (it is implemented in the STATISTICA and SYSTAT systems).
Another measure of the relationship between two objects (and also between two clus-
ters) is the
log-lik
elihood distance measur
e
.Its implementation in the softw
are systems is
linked with
two-step cluster analysis
in SPSS. This method has been designed for cluste-
ring a large number of objects and it is based on the BIRCH method, which uses the prin-
ciple of trees, see [15] and [16].
The log-likelihood distance measure is determined for data
files with combinations of quantitative and qualitative variables. Dissimilarity is expressed
on the basis of variability, whereas the entropy is applied to categorical variables. For the
l
-th variable in the
g
-th cluster, the formula for the entropy can be written as
220
where
K
l
is the number of categories of the
l
-th variable,
n
g
lu
represents the frequency of
the
u
-th category of the
l
-th variable in the
g
-th cluster, and
n
g
is the number of objects in
the
g
-th cluster. Two objects are the most similar if the cluster composed of them has the
smallest entropy.
Other specific methods exist in addition to the mentioned techniques, for clustering
objects characterized by nominal variables. There are both the
k
-clustering methods and
modifications of the hierarchical approaches. The
k
-means and
k
-medians methods are the
basis for the former. It assumes that each variable has values
v
l
u
(
u
= 1,2,...,
K
l
). Each
cluster is represented by the
m
-dimensional vector which contains either the categories with
the highest frequencies (in the
k
-modes method
,see [6] and [7]), or the figures about fre-
quencies of all individual-variable categories (in the
k
-histograms method
,see [4]). These
vectors are special types of centroids. Specific dissimilarity measures are applied. In the
case of the
k
-modes algorithm, measurement based on the simple matching coefficient is
used. However, we obtain only a locally optimal solution which depends on the order of
objects in a data file as well as in the case of the clustering by the
k
-means algorithm.
ROCK and CACTUS are additional special methods.
The ROCK (RObust Clustering
using linKs)
algorithm, see [3], is based on the principle of hierarchical clustering. First,
a random sample of objects is chosen. These objects are clustering to the desired number
of clusters, and then the remaining objects are assigned to the created clusters. The method
uses a graph concept, whose main terms are neighbors and links. A
neighbor
of a certain
object is such an object to which similarity with the investigated object is equal to or grea-
ter than a predefined threshold. A
link
between two objects is the number of common
neighbors of these objects. The principle of the ROCK methods lies in maximization of the
function which takes into account both maximization of sums of links for the objects from
the same cluster,and minimization of sums of links for the objects from different clusters.
Let us denote by
S
(
x
i
,
x
j
) the similarity measure between objects
x
i
and
x
j
;this mea-
sure can achieve values between 0 and 1. If we define the threshold
T
in such a way that
,then the objects
x
i
and
x
j
are neighbors if the condition
S
(
x
i
,
x
j
) ≥
T
is satis-
fied. For the binary data, Jaccard’s similarity coefficient, see formula (2), is used in the
algorithm. The similarity in the case of the data files with multi-categorical variables is
investigated within the same principle. If a value is missing, the corresponding variable is
omitted from the comparison.
The second means to be used is a
link
,i.e., the number of common neighbors of objects
x
i
and
x
j
.It will be denoted as link (
x
i
,
x
j
) in the text that follows. The greater value of the
link implies the greater probability of objects
x
i
and
x
j
belonging to the same cluster. The
resulting clusters are determined by minimization of the function
where
n
h
is the size of cluster
C
h
.Each object belonging to the
h
-th cluster has approxima-
tely
n
f
h
(T)
neighborhoods in this cluster, whereas for the binary data the
f
(
T
) function is
determined by the formula
221
3/2OO9
T  0;1
The value
n
1
h
+
2
f
(T)
is the expected number of links between pairs of objects in the
h
-th clus-
ter. The merging of clusters
C
h
and
C
h
'
is realized by means of the measure
The pair most suitable for clustering is the pair of clusters for which this measure
attains its maximum value. In the final phase, remaining objects are assigned to the created
clusters. From each
h
-th cluster, the set of objects is selected according to which the re-
maining objects should be assigned (this set will be denoted as
L
h
and the number of objects
in this set as
￿
L
h
￿
). Each remaining object is assigned to a cluster in which it has the most
neighborhoods from the set after normalization. If we denote the number of neighbor-hoods
in the
L
h
set as
N
h
,then the object is assigned to such a cluster for which the value of the
expression
is maximal whereas (
￿
L
h
￿
+ 1)
f
(
T
)
is the number of neighborhoods for the objects compared
with the
L
h
set.
Algorithm
CACTUS (CAtegorical ClusTering Using Summaries)
,see [1], is based
on the idea of the common occurrences of certain categories of different variables. If the
difference in the number of occurrences for the categories
v
kt
and
v
lu
of the
k
-th and
l
-th
variable, and the expected frequency (on the assumption of uniform distribution in the
frame of the certain categories of the remaining variables, and the assumption of the inde-
pendency) is greater than a user-defined threshold, the categories are strongly connected.
The algorithm has three phases: summarization, clustering and verification. During cluste-
ring, the candidates for clusters are chosen from which the final clusters are determined in
the verification phase.
Ordinal variables
From the specialized methods we can recall the
k
-median method
,in which the vec-
tors of medians of the individual variables are used as centroids. Application of the
Man-
hattan distance
(city block distance) is recommended, which is defined as
for vectors
x
i
a
x
j
.
In the SYSTAT system, the
gamma coefficient
can be used. It will be
described in the following section in connection with measurement of ordinal variable simi-
larities.
222
5.Variable clustering
Clustering of categorical variables is usually realized by application of hierarchical
cluster analysis on a proximity matrix, performed on the basis of suitable similarity mea-
sures.
Binary variables
If variables are
binary
and
symmetric
,then one can use both the
simple matching
coefficient
,see formula (1), and
Pearson’s correlation coefficient
,which can be expres-
sed (with symbols from Table 1) as
(4)
For
asymmetric
variables,
gamma coefficient
can be applied for example. It is called
Yule’s Q
in the binary data analysis, see formula (3). Moreover, some other specific coe-
fficients and proximity matrices created by different means can be used.
Nominal variables
For determination of dissimilarity of
nominal variables
,the
coefficient of disagree-
ment
is offered in some software packages (STATISTICA, SYSTAT). It expresses a ratio
of the number of the pairs of different values and the total number of objects. It can be cal-
culated from the simple matching coefficient by subtracting it from one. For the
k
-th and
l
-th variables, the simple matching coefficient can be expressed as
where
n
is the number of objects,
S
kli
= 1

x
ik
=
x
il
(the values of the
l
-th and
k
-th varia-
bles are the same for the
i
-th object). The disagreement coefficient is then calculated accor-
ding to the formula
D
kl
= 1 –
S
kl
.
Theoretically,there is a wider range of possibilities because symmetric measures of
dependency can be used. They do not occur in the procedures for cluster analysis but the
proximity matrix created by different means can serv
e as a basis for the analysis. The well-
known measures are derived from
Pearson’s chi-square statistic
,which is calculated
according to the formula
(5)
where
K
k
is the number of cate
gories of the
k
-th v
ariable,
K
l
is the number of cate
gories of
the
l
-th v
ariable,
n
rs
is the frequenc
y in the contingency table (in the
r
-th ro
w and
s
-th
column), and
M
rs
is the e
xpected frequenc
y under the assumption of independenc
y
,i.e.
223
3/2OO9
where
n
r
+
and
n
+
s
are marginal frequencies, expressed as
This statistic is a basis for
phi coefficient
,which is calculated by the formula
Further, we can mention
Pearson’s coefficient of contingency
,calculated as
Cramér’s V
is another example of this type of similarity coefficients. It is expressed as
where
q
= min{
K
k
,
K
l
}. For two binary variables, the value is the same as the value of the
phi coefficient.
More symmetric dependency measures can be derived from the pairs of asymmetric
measures. The contingency table is again the basis. We can distinguish row variables
X
k
and
column variables
X
l
.If we investigate the dependency of a column variable
X
l
on a row
variable
X
k
,two situations can occur:
{1}the columns are independent of the rows,
{2}the columns depend on the values of variable
X
k
.
Let us have a new object for which we know the values of variable
X
k
but we do not
know the values of variable
X
l
.If we suppose situation {1}, we will estimate the value of
variable
X
l
according to the category with relation
p
+Mo
= max
s
(
p
+
s
) where
p
+
s
is the
column subtotal of relative frequencies (
p
+
s
=
n
+
s
/
n
). The probability of error can be
expressed as P{1} = (1 –
p
+Mo
). When we suppose situation {2}, we will estimate the
value of variable
X
l
according to the row maximum corresponding to the known value of
variable
X
k
.Let us denote this maximum as
p
r
Mo
= max
s
(
p
rs
) where
p
rs
is a relative fre-
quency in the
r
-th row and
s
-th column (
p
rs
=
n
rs
/
n
). Then the probability of error equals
P
{2} = (1 – ∑
p
r
Mo
). The p
roportional reduction in error
can be calculated according
to the scheme
Goodman and Kruskal’s lambda
coefficient is based on this formula. Asymmetric
coefficient can be written as
224
For symmetric coefficients, the following probabilities are considered:
The final formula is either
or
The
uncertainty coefficient
investigates dependency in more detail. It is based on the
principle of analysis of variance. If variability is expressed by other characteristics than
variance, then the measure of dependency of variable
X
l
on variable
X
k
can be written as
the following ratio:
where
var
(
X
l
) is variability of the dependent variable,
var
(
X
l
|
X
k
) is variability within
a group and
v
kr
is the
r
-th category of the independent (explanatory) variable
X
k
.Variabi-
lity of the nominal variable can be measured by different means. The uncertainty coeffi-
cient is based on the entropy, which can be written as
where
p
lu
is relative frequency of the
u
-th category of the
l
-th variable. The symmetric
measure is calculated as a harmonic means of both asymmetric measures. The final for-
mula is usually written in the simplified form as
Ordinal variables
Dependency of the
ordinal variables
is denoted as a rank correlation and their inten-
sity is e
xpressed by correlation coef
ficients. The best known among them is
Spearman’
s
corr
elation coeff
icient
.If in
v
estig
ated ordinal v
ariables e
xpress the unambiguous rank, the
following formula can be used:
225
3/2OO9
If this assumption is not satisfied, the process described in [12] must be applied.
Further measures investigate pairs of objects. If, in a pair of objects, the values of both
investigated variables are greater (less) for one of these objects, this pair is denoted as con-
cordant. If for one variable the value is greater and for the second one it is less, then the
pair is denoted as discordant. In other cases (the same values for both objects exist for at
least one variable), the pairs are tied. For the sake of simplification, we will use the follo-
wing symbols:
Γ
– number of concordant pairs,
Δ
– number of discordant pairs,
Ψ
k
– number of pairs with the same values of variable
X
k
but distinct values of variable
X
l
,
Ψ
l
– number of pairs with the same values of variable
X
l
but distinct values of variable
X
k
.
Goodman and Kruskal’s gamma
is a symmetric measure. It is expressed as
For two binary variables, it can be written as
and it coincides with Yule‘s Q, see formula (3).
Another symmetric measure is
Kendall’s tau-
b
(Kendall’s coefficient of the rank corre-
lation). It is expressed as
For two binary variables, the value of this coefficient is the same as the value of Pearson‘s
correlation coefficient, see formula (4).
Another correlation coefficient is the tau-c coefficient, which is denoted either
Ken-
dall’s tau-
c
(SPSS) or
Stuart’s tau-
c
(SYSTAT, SAS). The formula is following:
where
q
= min{
K
k
,
K
l
}.
Further,
Somers’ d
is used. Both symmetric and asymmetric types of this measure
e
xist.
The asymmetric one is e
xpressed as
The symmetric measure is calculated as a harmonic mean of both asymmetric measures,
i.e., the final formula is
Features of measures mentioned in this chapter are described in [10] and [12].
226
As concerns the possibilities of software packages in the area of creation of dependen-
cy matrices that can be used as input matrices for cluster analysis, the SPSS system offers
Pearson’s and Spearman’s coefficients and Kendall’s tau-
b
.The offer of the SYSTAT
system is larger. It includes phi coefficient, Goodman and Kruskal’s lambda, uncertainty
coefficient, Pearson’s and Spearman’s correlation coefficients, Kendall’s tau-
b
,Stuart’s
tau-
c
,and Goodman and Kruskal’s gamma.
6.Category clustering
In the case of clustering categories of a nominal variable, hierarchical cluster analysis
is usually applied to the proximity matrix, which is created on the basis of a suitable mea-
sure. The contingency table for two categorical variables is an input for the corresponding
procedure in a software package. These processes can be applied in SPSS and SYSTAT.
Moreover, in the SYSTAT system the suitable similarity measures can also be used in the
k
-means and
k
-medians methods.
The relationships between categories can be measured by means of special coefficients
based on the chi-square statistic, see formula (5). For the determination of dissimilarity be-
tween categories
v
ki
and
v
kj
of the
k
-th (row) variable, we consider the contingency table of
the dimension 2 x
K
l
,where
K
l
is the number of categories of the
l
-th (column) variable.
We can use
chi-square dissimilarity measure
which is written as
where
Further,
phi coefficient
can be used. It is calculated according to the formula
In both cases, the coefficients measure dissimilarity and
D
rr
= 0.
7.Examples of applications
In this chapter, two examples will be presented – variable clustering and category clus-
tering. The data file is from the research “Male and female with university diploma”, No.
0136, Institute of Sociology of the
Academy of Sciences of the Czech Republic.
The
author of this research is the Gender in sociology team; the data collection was performed
by Sofres-Factum (Praha, 1998).
227
3/2OO9
Example 1 – variable clustering
For this purpose, 13 variables expressing satisfaction concerning a job of the respon-
dent from different points of view were analyzed. Respondents evaluated their satisfaction
on the scale from 1 (very satisfied) to 4 (very dissatisfied). Similarity matrix based on Ken-
dall’s tau-
b
was created in the SPSS system. This matrix was transformed to dissimilarity
matrix by subtraction of the values from 1 in Microsoft Excel. The transformed matrix was
analyzed by complete linkage method of hierarchical cluster analysis (the distance between
two clusters is determined by the greatest distance between two objects from these clusters)
in the STATISTICA system (for the reason of better quality of graphs). The resulting dend-
rogram is shown in Figure 1.
If we do a cut in the distance 0.6 in the dendrogram, we obtain 6 clusters. The first clus-
ter represents satisfaction with salary, remuneration and evaluation of working enforce-
ment. The further clusters represent the following groups of variables: satisfaction with
perspective with the company and possibility of promotion, satisfaction with relationships
in the company and relationships between males and females, satisfaction with scope of
employment and use of respondent degree of education, satisfaction with management
of the company, respondent supervisor and possibility to express own opinion, and satis-
faction with working burden.
Figure 1
Dendrogram of relationships among variables
228
Example 2 – category clustering
In this case, the categories of the variable expressing specialization in university stu-
dies were clustered on the basis of the categories of the variable containing information
about the following university diploma: magister – Mgr., engineer – Ing., doctor – Dr.
(RNDr., MUDr. JUDr. etc.). The respondents with the bachelor diploma (Bc.) were omit-
ted from the analysis. The contingency table for these two variables is in Table 4.
Table 4
Contingency table for variables Diploma and Specialization
In Table 5, the proximity matrix with using the chi-square dissimilarity measure is dis-
played. It was created with using the SPSS system.
Table 5
Proximity matrix for categories of variable
Specialization
The proximity matrix w
as analyzed by complete linkage method of hierarchical cluster
analysis in the STATISTICA system. The resulting dendrogram is shown in Figure 2.
229
3/2OO9
Figure 2
Dendrogram of relationships among categories
This example is only illustrative. It is well known that graduates from certain faculties
and universities have a specific diploma. Graduates from faculties with specialization for
natural a social sciences (including law) usually have the Mgr.diploma first but some of
them continue their studies for a doctoral diploma (the RNDr.diploma for natural sciences
and JUDr. diploma for law). The physicians have the MUDr. diploma. Graduates from
faculties with specialization for pedagogy and art usually have the Mgr. diploma. Gradua-
tes from universities with specialization for technical sciences, economy and agricultural
sciences usually have the Ing. diploma. We obtain these three clusters if we do a cut in the
distance 10 in the dendrogram.
8.Further directions of development
Although a lot of approaches and methods for clustering of categorical data have been
proposed in the literature, capabilities of statistical software packages are limited. One
expected direction of development is implementation of more algorithms into software pro-
ducts. Besides clustering, the programs should offer other processing and analyses: missing
value imputation, choice of variables for object clustering and dimensionality reduction,
identif
ication of outliers, and determination of the optimal number of clusters.
Researchers are presently focusing on two areas: clustering of large data files, and on-
line clustering when some additional objects arise during analysis (web pages). Another
230
area which should be solved is clustering of data files with different types of variables. In
commercial software packages, only two-step cluster analysis in the SPSS system makes
such a clustering possible.
References
0[1] Ganti, V., Gehrke, J., Ramakrishnan, R. CACTUS – Clustering categorical data using summa-
ries. Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discove-
ry and Data Mining, San Diego: ACM Press, 1999, 73–83.
0[2] Gordon, A. D. Classification, 2nd ed. Boca Raton: Chapman & Hall/CRC, 1999.
0[3] Guha, S., Rastogi, R., Shim, K. ROCK: A robust clustering algorithm for categorical attributes.
Information Systems, 25 (5), 2000, 345–366.
0[4] He, X., Ding, C. H. Q., Zha, H., Simon, H. D. Automatic topic identification using webpage
clustering. Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM 01),
2001,195–203.
0[5] Hebák, P., Hustopecký, J., Pecáková, I., Plašil, M., Průša, M., Řezanková, H., Svobodová, A.,
Vlach, P. Vícerozměrné statistické metody (3). 2nd ed. Praha: Informatorium, 2007.
0[6] Huang, Z. A fast clustering algorithm to cluster very large categorical data sets in data mining.
Proc. of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discove-
ry,University of British Columbia, 1997, 1–8.
0[7] Huang, Z. Extensions to the k-means algorithm to clustering large data sets with categorical
values. Data Mining and Knowledge Discovery, 2, 1998, 283–304.
0[8] Kaufman, L., Rousseeuw, P. Finding Groups in Data: An Introduction to Cluster Analysis.
Hoboken: Wiley,2005.
0[9] Mirkin, B. Clustering for Data Mining: A Data Recovery Approach. Boca Raton: Chapman &
Hall/CRC, 2005.
[10] Pecáková, I. Statistika v terénních průzkumech. Praha: Professional Publishing, 2008.
[11] Řezanková, H. Measurement of binary variables similarities. Acta Oeconomica Pragensia, 9
(3), 2001, 129–136.
[12] Řezanková, H. Analýza dat z dotazníkových šetření. Praha: Professional Publishing, 2007.
[13] Řezanková, H., Húsek, D., Snášel, V. Shluková analýza dat. 2nd. ed. Praha: Professional Pub-
lishing, 2009.
[14] Stank
o
vičová, I., Vojtková, M. Viacrozmerné štatistické metódy s aplikáciami. Bratislava: Iura
Edition, 2007.
[15] Zhang, T., Ramakrishnan, R., Livny, M. BIRCH: An efficient data clustering method for very
lar
ge databases.
A
CM SIGMOD Record, 25(2), 1996, 103–114.
[16] Žambochová, M. Algoritmus BIRCH a jeho varianty pro shlukování velkých souborů dat.
Mezinárodní statisticko-ekonomické dny [CD-ROM]. Praha: VŠE, 2008.
Hana Řezanková, Fakulta informatiky a statistiky VŠE v Praze, nám. W. Churchilla 4, 130 67 Praha 3 – Žižkov,
e-mail: hana.rezankova@vse.cz
231
3/2OO9
Abstract
This paper deals with specific techniques proposed for cluster analysis if a data file includes cate-
gorical variables. Nominal, ordinal and dichotomous variables are considered as categorical. Three
types of clustering are described: object clustering, variables clustering and category clustering. Both
specific coefficients for measurement of similarity and specific methods are mentioned. Two illustra-
tive examples are included in the paper. One of them shows variable clustering (variables express
satisfaction concerning a job of the respondent from different points of view) and the second one con-
cerns category clustering (specializations of respondents are clustered according to the type of the
university diploma); combination of the SPSS and STATISTICA software systems is applied in both
example.
Key words:
Cluster analysis, categorical data analysis, similarity measures, dissimilarity measu-
res.
232