Proceedings of the National Conference ,
“
Computational Systems and Information Security
”
Jan.,4,2008

by
CSE
Department

P.B.
College of Engineering, Chennai

602105
Copy Right @CSE

PBCE

2008
Fuzzy Clustering For Categorical Data
Latha Parthiban , M.Babu,
N.Ramaraj
Department of Computer Science & Engineering
GKM COLLEGE OF ENGINEERING,CHENNAI
Abstract
:
Partitioning a large set of objects into
homogeneous clusters is a fundamental operation
in data mining. Clustering categorical data are
the interesting and essential one in the Relational
Databases. The categorical data clustering
mostly rely on the fundamental algorithms called
K

Means Algorithm, K

Modes Algorithm.
Haung (1998) proposed
the K

modes algorithm
to tackle the problem of clustering large
categorical data sets in data mining. It extends
the k

means algorithm by using a simple
matching dissimilarity measure for categorical
objects, modes instead of means for clusters, and
a freq
uency

based method to update modes in the
clustering process to minimize the clustering cost
function. Drawback of this work is the k

modes
algorithm is unstable due to non

uniqueness of
the modes. Van

Nam and his friends (2004)
proposed a new alternative
algorithm called k

representative that reduces the above drawback.
Also it mimics the k

Means method in clustering
categorical data. Future work of this paper is to
extend the proposed technique to the problem of
fuzzy clustering for categorical data.
The
proposed Fuzzy Clustering Algorithm for
categorical data is the key idea of applying the
optimization technique of Fuzzy logic. It
calculates the relative frequency of relational
objects in to a single cluster instead of
calculating mean or modes. Thes
e changes in
similarity measure calculations of k

means
algorithm lead to get better results.
The proposed
work is to carry out the implementation of a
clustering technique called Fuzzy Cluster Means
(FCM) algorithm to help doctors to find new
patterns in
the medical diagnostic systems.
Keywords

cluster analysis, categorical data,
data mining
1. Introduction
During the last decade, data

mining has
emerged as a rapidly growing
interdisciplinary field which merges together
databases, statistics, machine
learning and
related areas in order to extract useful
knowledge from data. Clustering is one of
fundamental operations in data mining.
Clustering can be defined as the process of
organizing objects in a database into
clusters/groups such that objects withi
n the
same cluster have a high degree of similarity,
while objects belonging to different clusters
have a high degree of
dissimilarity
.
Most of the earlier work on clustering has
been focused
o
n numerical data whose
inherent geometric properties
can be
e
xploited to naturally define distance
functions
between data points. However, data
mining applications
frequently involve many
datasets that also consist
of categorical
attributes on which distance functions are
not
naturally defined. Recently, clustering
data
with categorical attributes have drawn some
attention. As
is well known, k

means
clustering has been a very popular technique
for partitioning
large data sets with numerical
attributes.
T
he k

modes algorithm
was
introduced
to tackle the problem of cl
ustering
large categorical data sets in data mining.
The k

modes
algorithm extends the k

means
algorithm by using a simple
matching
dissimilarity measure for categorical
objects,
modes
instead of means for clusters, and a
frequency
based
method to update m
odes in
the clustering process to
minimize the
clustering cost function. However, the k

modes algorithm
is unstable due to non

Proceedings of the National Conference ,
“
Computational Systems and Information Security
”
Jan.,4,2008

by
CSE
Department

P.B.
College of Engineering, Chennai

602105
Copy Right @CSE

PBCE

2008
uniqueness of the modes.
That is, the
clustering results depend strongly on the
selection
of modes during the clustering
process.
This paper aims at eliminating the above
drawback
in the k

modes algorithm by
introducing
a new notion
of “cluster centers”
called representatives for categorical
objects.
As arithmetic operations are completely
absent
tin the setting of categorical obje
cts,
we apply the notion
of fuzziness in defining
representatives instead of means
for clusters.
With this notion we can also formulate the
clustering problem of categorical objects as a
partitioning
problem in the fashion similar to
k

means clustering
.
2
. Notation
W
e assume that the set of objects
to be
clustered is stored in a dataset D defined by a
set of attributes A1, . . . ,Am with domains
D1, . . . ,Dm,
respectively. Each object in D
is represented by a tuple
t 2 D1 × . . . × Dm.
A d
omain Di is defi
ned as categorica
l if it is
finite and unordered.
Logically, each data object X in the
dataset is also
represented as a conjunction of
attribute

value pairs
[A1 = x1] ^ . . . ^ [Am = x
m
],
where xi
Є
Di for 1
<
i _ m. For simplicity,
we
represent X as a tuple
(x1, . . . , xm)
Є
D1 × ∙ ∙ ∙ × Dm.
If all Di’s are categorical domains, then
objects in
D are called categorical objects.
3
. Proposed Algorithm
In this section we discuss
how to avoid the
drawback of
the k

modes algorithm, and
propose a new alternative algorithm
that also
mimics the k

means method in clustering
categorical
data. As
we have seen, in
applying the k

means method to
categorical
objects, two main problems are
en
countered,
namely
, the formation of cluster centers and
the calculation
of dissimilarity between
objects and cluster centers.
These problems have been completely solved
in the k
modes
algorithm by using the simple
matching dissimilarity
measure for
categor
ical data instead of the Euclidean
distance measure, and replacing the means of
clusters by
the modes. In the following,
we
address the
two problems before introducing
the proposed
A
lgorithm.
4.1. Formation of “Cluster Centers”
As arithmetic operations a
re completely
absent in the setting
of categorical objects,
we use the Cartesian product
and union
operations for the formation of “cluster
centers”
based on the notion of means in the
numerical setting. Particularly,
we replace addition and multiplication
by
the
union and the Cartesian product, respectively,
for categorical
data in defining the notion of
representatives
for
clusters.
Given a cluster C = {X1, . . . ,Xp} of
categorical
objects, with
Xi = (xi,1, . . . , xi,m),
1
≤
i
≤
p,
denote by Dj the set formed from categorical
values
x1,j , . . . , xp,j .
Then the representative of C is defined
by Q =
(q1, . . . , qm), with
qj = {(cj , fcj )  cj
Є
Dj},
(1)
where fcj is the relativ
e frequency of
category cj within
C, i.e., fcj = ncj /p, where
ncj is the number of objects
in C having
category cj at attribute Aj . Formally, each
qj can be seen as a fuzzy set on Dj with
membership
grades of elements to be defined
Proceedings of the National Conference ,
“
Computational Systems and Information Security
”
Jan.,4,2008

by
CSE
Department

P.B.
College of Engineering, Chennai

602105
Copy Right @CSE

PBCE

2008
by their relative freq
uencies
within the
cluster.
4.2. Dissimilarity Measure
Due to the modification proposed in forming
representatives
for clusters of categorical
objects, the dissimilarity
between a
categorical object and the representative of a
cluster is defined based on
simple matching
as follows.
Let C = {X1, . . . ,Xp} be a cluster of
categorical
objects, with
Xi = (xi,1, . . . ,
xi,m),
≤
i
≤
p,
and X = (x1, . . . , xm) be a categorical
object. Note
that X may or may not belong to
C. Assume that Q =
(q1, . .
. , qm), with
qj = {(cj , fcj )  cj
Є
Dj},
is a representative of cluster C. Now we
define the dissimilarity
between object X and
representative Q by
m
d(X,Q) =
∑
∑
fcj ∙
δ
(xj , cj).
(2)
j=1
cjЄ
D
j
Under such a definition, the
dissimilarity d(X,Q)
is mainly
dependent on
the relative frequencies of categorical
values
within the cluster and simple matching
be
tween
categorical values. It is also of
interest to note that the
simple matching
dissimilarity measure between categorical
objects can be considered as a categorical
counterpart
of the squared Euclidean distance
measure
.
It is easily seen that
m
d(X,Q)
=
∑ ∑
fcj ∙
δ
(xj ,
cj).
j=1
cjЄ
D
j
m
=
∑ ∑
fcj
j=1
cjЄ
D
j
cj
≠
xj
m
=
∑
(1

fxj
).
(3)
j=1
where fxj is the relati
ve frequency of
category xj
within C.
5.
Implementation of
Fuzzy Clustering
Algorithm
With the modifications just made above, we
are now ready
to formulate the problem of
clustering categorical data as a
partitioning
problem in a fashion similar to k

mean
s
clustering.
Assume that we have a data set D
= {X1, . . . ,Xn} of categorical objects to be
clustered, where each object
Xi = (xi,1, . . . ,
xi,m), 1 _ i _ n is described by m
categorical attributes. Then the problem can
be mathematical
ly
stated as follows:
Minimize
k n
P(W,Q) =
∑ ∑
wi,l
d
(Xi,Ql),
(4)
l=1 i=1
subject to
k
.
∑
wi,l = 1, 1
≤
i
≤
n,
(5)
l=1
wi,l
Є
{0, 1}, 1
≤
i
≤
n, 1
≤
l
≤
k,
where W = [wi,l]n×k is a partition matrix,
Q
= {Q1, . . . ,Qk} is the set of representatives,
and
d(Xi,Ql) is the dissimilarity between
object Xi and representative
Ql
defined by
(
2
).
Proceedings of the National Conference ,
“
Computational Systems and Information Security
”
Jan.,4,2008

by
CSE
Department

P.B.
College of Engineering, Chennai

602105
Copy Right @CSE

PBCE

2008
In much the same way as in the
k

mode algorithm
we introduce the
following
algorithm for clustering categorical
data:
1. Initialize a
k

partition of D randomly.
2. Calculate k representatives, one for each
cluster.
3. For each Xi, calculate the dissimilarities
d(Xi,Ql), l = 1, . . . , k.
Reassign Xi
to cluster Cl (from cluster Cl’
,
say)
such that the dissimilarity between Xi
and Ql is
least. Update both Ql and Ql
’
.
4. Repeat Step 3 until no object has changed
clus
ters
after a full cycle test of the
whole
data set.
We should also note that the definition
of
representatives
of clusters in the proposed
technique is based on
the notion of means,
i.e., the optimized solution of the
correspon
ding numerical problem
.
Thus the
optimization
problem (
4
) is reduced to the
partial optimization
problem, for W as
specified
in the above algorithm.
According to (
3
), we have
m
d(Xi,Ql) =
∑
(1

fxi,j ),
(6)
j=1
where fxi,j is the relative frequency of
category xi,j
with
in cluster Cl. Thus, in the
proposed algorithm object
Xi will be
allocated to cluster Cl so that the categories
of Xi are most likely to constitute a mode1 of
Cl related
to the other clusters. Note that, by
definition, all possible
modes of each cluster
ar
e taken into account in the proposed
algorithm.
To introduce fuzziness parameter the
following to Equations
for calculating cluster
center
and
fuzziness
parameter
are used to
get a very accurate cluster boundaries.
∑
n
u
ij
m
xj
j=1
ci =
____
______________
∑
n
u
ij
m
j=1
1
uij =
__________________
c
∑
dij
2/(m

1)
k=1
dkj
6. Conclusions
The work used fuzzy clustering and hard k

means algorithms to cluster the given
medica
l data. In medical diagnostic systems,
fuzzy clustering algorithm gives the better
results than hard

k

means algorithm
according to our application. Another
important feature of fuzzy
cluster
algorithm
is membership function and an object can
belong to sev
eral classes at the same time but
with different degrees. This is a useful
feature for a medical diagnostic system. At a
result, fuzzy clustering methods can be
important supportive tool for the medical
experts in diagnostic.
Although the fuzzy clustering
algorithm is an
accurate algorithm than the existing
techniques, it does not give 100% accuracy.
Proceedings of the National Conference ,
“
Computational Systems and Information Security
”
Jan.,4,2008

by
CSE
Department

P.B.
College of Engineering, Chennai

602105
Copy Right @CSE

PBCE

2008
This can be achieved by either altering any
parameters used in the algorithm or by
changing the flow of manipulation, the
algorithm currently adopts. Any alter
ation
that can reduce the memory consumption to
store and cluster data points can be suggested
and implemented.
A formula to find the fuzziness parameter
and the convergence value, by knowing the
number of clusters can be formulated. This
will he
lp in automated formation of the above
said parameters in addition to giving an
optimal output. This eases the task in the user
side, from giving all the other parameters as
inputs
.
References
1.
Earl Cox (2005) “Fuzzy Modeling and Genetic
Algorithms for Data
Mining and Exploration”,
pp. 208

247
2.
Jiawei Han and Micheline Kamber (2000)
“Data Mining Concepts and Techniques”,
Morgan Kaufmann Publishers, 2000,
pp. 1

20, 335

387.
3.
Timothy Ross (1997) “Fuzzy Logic and its
Implementation”,
pp. 379

396
4.
Jon
athan C. Prather, David F. Lobach, Linda K.
Goodwin, R.N, Joseph W. Hales, Marvin L.
Hage, and W. Edward Hammond “Medical
Data Mining: Knowledge Discovery in a
Clinical Data Warehouse”
5.
Georg Berks, Diedrich Graf v. Keyserlingk, Jan
Jantzen, Mariagrazia Do
toli, Hubertus Axer
(2000) “Fuzzy Clustering

A Versatile Mean to
Explore Medical Databases”.
6.
Songul Albayrak, Fatih Amasyalı (2003)
“FUZZY C

MEANS CLUSTERING ON
MEDICAL DIAGNOSTIC SYSTEMS”
7.
Herbert Schildt and Patrick Naughton (2000)
“The Complete Refer
ence
–
JAVA

2” Third
Edition, pp. 871
–
899.
8.
“Integrating K

Means Clustering with a
Relational DBMS Using SQL”
–
Feb 2006,
IEEE Trans. by Carlos Ordonez.
9.
“Efficient Disk

Based K

Means Clustering for
Relational Databases”
–
April 2004, IEEE
Trans. by Carlos
Ordonez.
10.
“Clustering Binary Data Streams with K

Means”
–
June 2003, IEEE Trans. By Carlos
Ordonez.
11.
“A Fast Clustering Algorithm to Cluster Very
Large Categorical Data Sets in Data Mining”

2003 by Zhexue Huang.
12.
“A Fast Clustering Algorithm to Cluster
Very
Large Categorical Data in Data Mining”

2003
by Anne Denton, Qiang Ding, Qin Ding and
William Perrizo.
13.
“A CSA
–
based clustering algorithm for large
data sets with mixed numeric and categorical
values”
–
15

19 June 2004 IEEE Trans. By Li
Jie, Gao X
inbo and Jiao Li

Cheng.
14.
“A novel clustering method with network
structure based on clonal algorithm”
–
17

21,
May 2004 IEEE Trans. By Li Jie, Gao Xinbo
and Jiao Li

Cheng.
15.
“An alternative extension of the K

Means
algorithm for Clustering Categorical Dat
a”

2004, by Ohn Mar San, Van

Nam Huynth,
Yoshiteru Nakamori.
Comments 0
Log in to post a comment