Fuzzy Clustering For Categorical Data

tealackingΤεχνίτη Νοημοσύνη και Ρομποτική

8 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

54 εμφανίσεις

Proceedings of the National Conference ,

Computational Systems and Information Security


Jan.,4,2008
-


by
CSE

Department
-

P.B.

College of Engineering, Chennai
-
602105



Copy Right @CSE
-
PBCE
-
2008



Fuzzy Clustering For Categorical Data

Latha Parthiban , M.Babu,
N.Ramaraj

Department of Computer Science & Engineering

GKM COLLEGE OF ENGINEERING,CHENNAI



Abstract

:

Partitioning a large set of objects into
homogeneous clusters is a fundamental operation

in data mining. Clustering categorical data are
the interesting and essential one in the Relational
Databases. The categorical data clustering
mostly rely on the fundamental algorithms called
K
-

Means Algorithm, K
-

Modes Algorithm.
Haung (1998) proposed
the K
-
modes algorithm
to tackle the problem of clustering large
categorical data sets in data mining. It extends
the k
-
means algorithm by using a simple
matching dissimilarity measure for categorical
objects, modes instead of means for clusters, and
a freq
uency
-
based method to update modes in the
clustering process to minimize the clustering cost
function. Drawback of this work is the k
-
modes
algorithm is unstable due to non
-
uniqueness of
the modes. Van
-
Nam and his friends (2004)
proposed a new alternative
algorithm called k
-
representative that reduces the above drawback.


Also it mimics the k
-
Means method in clustering
categorical data. Future work of this paper is to
extend the proposed technique to the problem of
fuzzy clustering for categorical data.
The
proposed Fuzzy Clustering Algorithm for
categorical data is the key idea of applying the
optimization technique of Fuzzy logic. It
calculates the relative frequency of relational
objects in to a single cluster instead of
calculating mean or modes. Thes
e changes in
similarity measure calculations of k
-
means
algorithm lead to get better results.

The proposed
work is to carry out the implementation of a
clustering technique called Fuzzy Cluster Means
(FCM) algorithm to help doctors to find new
patterns in
the medical diagnostic systems.


Keywords

-

cluster analysis, categorical data,
data mining



1. Introduction

During the last decade, data
-
mining has
emerged as a rapidly growing
interdisciplinary field which merges together
databases, statistics, machine
learning and
related areas in order to extract useful
knowledge from data. Clustering is one of
fundamental operations in data mining.

Clustering can be defined as the process of
organizing objects in a database into
clusters/groups such that objects withi
n the
same cluster have a high degree of similarity,
while objects belonging to different clusters
have a high degree of
dissimilarity
.


Most of the earlier work on clustering has
been focused

o
n numerical data whose
inherent geometric properties

can be
e
xploited to naturally define distance
functions

between data points. However, data
mining applications

frequently involve many
datasets that also consist

of categorical
attributes on which distance functions are

not
naturally defined. Recently, clustering

data
with categorical attributes have drawn some

attention. As

is well known, k
-
means
clustering has been a very popular technique
for partitioning

large data sets with numerical
attributes.
T
he k
-
modes algorithm

was
introduced
to tackle the problem of cl
ustering

large categorical data sets in data mining.
The k
-
modes

algorithm extends the k
-
means
algorithm by using a simple

matching
dissimilarity measure for categorical
objects,
modes

instead of means for clusters, and a
frequency

based

method to update m
odes in
the clustering process to

minimize the
clustering cost function. However, the k
-
modes algorithm

is unstable due to non
-
Proceedings of the National Conference ,

Computational Systems and Information Security


Jan.,4,2008
-


by
CSE

Department
-

P.B.

College of Engineering, Chennai
-
602105



Copy Right @CSE
-
PBCE
-
2008



uniqueness of the modes.

That is, the
clustering results depend strongly on the
selection

of modes during the clustering
process.


This paper aims at eliminating the above
drawback

in the k
-
modes algorithm by
introducing

a new notion

of “cluster centers”
called representatives for categorical

objects.
As arithmetic operations are completely
absent
tin the setting of categorical obje
cts,
we apply the notion

of fuzziness in defining
representatives instead of means

for clusters.
With this notion we can also formulate the

clustering problem of categorical objects as a
partitioning

problem in the fashion similar to
k
-
means clustering
.


2
. Notation

W
e assume that the set of objects

to be
clustered is stored in a dataset D defined by a

set of attributes A1, . . . ,Am with domains
D1, . . . ,Dm,

respectively. Each object in D
is represented by a tuple

t 2 D1 × . . . × Dm.
A d
omain Di is defi
ned as categorica
l if it is
finite and unordered.


Logically, each data object X in the
dataset is also

represented as a conjunction of
attribute
-
value pairs




[A1 = x1] ^ . . . ^ [Am = x
m
],


where xi
Є

Di for 1
<

i _ m. For simplicity,
we

represent X as a tuple



(x1, . . . , xm)
Є

D1 × ∙ ∙ ∙ × Dm.



If all Di’s are categorical domains, then
objects in

D are called categorical objects.


3
. Proposed Algorithm

In this section we discuss
how to avoid the
drawback of

the k
-
modes algorithm, and
propose a new alternative algorithm

that also
mimics the k
-
means method in clustering

categorical
data. As

we have seen, in
applying the k
-
means method to

categorical
objects, two main problems are
en
countered,
namely
, the formation of cluster centers and
the calculation

of dissimilarity between
objects and cluster centers.

These problems have been completely solved
in the k

modes

algorithm by using the simple
matching dissimilarity

measure for
categor
ical data instead of the Euclidean

distance measure, and replacing the means of
clusters by

the modes. In the following,

we
address the

two problems before introducing

the proposed
A
lgorithm.



4.1. Formation of “Cluster Centers”

As arithmetic operations a
re completely
absent in the setting

of categorical objects,
we use the Cartesian product

and union
operations for the formation of “cluster
centers”

based on the notion of means in the
numerical setting. Particularly,

we replace addition and multiplication

by

the
union and the Cartesian product, respectively,
for categorical

data in defining the notion of
representatives
for

clusters.


Given a cluster C = {X1, . . . ,Xp} of
categorical

objects, with



Xi = (xi,1, . . . , xi,m),

1


i


p,


denote by Dj the set formed from categorical
values

x1,j , . . . , xp,j .



Then the representative of C is defined
by Q =

(q1, . . . , qm), with


qj = {(cj , fcj ) | cj
Є

Dj},

(1)



where fcj is the relativ
e frequency of
category cj within

C, i.e., fcj = ncj /p, where
ncj is the number of objects

in C having
category cj at attribute Aj . Formally, each

qj can be seen as a fuzzy set on Dj with
membership

grades of elements to be defined
Proceedings of the National Conference ,

Computational Systems and Information Security


Jan.,4,2008
-


by
CSE

Department
-

P.B.

College of Engineering, Chennai
-
602105



Copy Right @CSE
-
PBCE
-
2008



by their relative freq
uencies

within the
cluster.


4.2. Dissimilarity Measure

Due to the modification proposed in forming
representatives

for clusters of categorical
objects, the dissimilarity

between a
categorical object and the representative of a

cluster is defined based on
simple matching
as follows.



Let C = {X1, . . . ,Xp} be a cluster of
categorical

objects, with

Xi = (xi,1, . . . ,
xi,m),


i


p,


and X = (x1, . . . , xm) be a categorical
object. Note

that X may or may not belong to
C. Assume that Q =

(q1, . .

. , qm), with




qj = {(cj , fcj ) | cj
Є

Dj},


is a representative of cluster C. Now we
define the dissimilarity

between object X and
representative Q by



m


d(X,Q) =







fcj ∙
δ

(xj , cj).

(2)







j=1

cjЄ
D
j



Under such a definition, the
dissimilarity d(X,Q)
is mainly

dependent on
the relative frequencies of categorical

values
within the cluster and simple matching
be
tween

categorical values. It is also of
interest to note that the

simple matching
dissimilarity measure between categorical

objects can be considered as a categorical
counterpart

of the squared Euclidean distance
measure
.


It is easily seen that








m



d(X,Q)


=


∑ ∑


fcj ∙
δ

(xj ,
cj).





j=1
cjЄ
D
j







m



=


∑ ∑

fcj





j=1
cjЄ
D
j

cj

xj



m




=




(1
-

fxj
).

(3)





j=1


where fxj is the relati
ve frequency of
category xj

within C.


5.

Implementation of

Fuzzy Clustering

Algorithm

With the modifications just made above, we
are now ready

to formulate the problem of
clustering categorical data as a

partitioning
problem in a fashion similar to k
-
mean
s
clustering.



Assume that we have a data set D
= {X1, . . . ,Xn} of categorical objects to be
clustered, where each object

Xi = (xi,1, . . . ,
xi,m), 1 _ i _ n is described by m

categorical attributes. Then the problem can
be mathematical
ly

stated as follows:
Minimize




k n


P(W,Q) =

∑ ∑


wi,l
d
(Xi,Ql),

(4)





l=1 i=1



subject to






k



.



wi,l = 1, 1


i


n,

(5)









l=1



wi,l
Є

{0, 1}, 1


i


n, 1


l

k,


where W = [wi,l]n×k is a partition matrix,

Q
= {Q1, . . . ,Qk} is the set of representatives,
and

d(Xi,Ql) is the dissimilarity between
object Xi and representative

Ql

defined by
(
2
).

Proceedings of the National Conference ,

Computational Systems and Information Security


Jan.,4,2008
-


by
CSE

Department
-

P.B.

College of Engineering, Chennai
-
602105



Copy Right @CSE
-
PBCE
-
2008





In much the same way as in the
k
-
mode algorithm


we introduce the
following

algorithm for clustering categorical
data:

1. Initialize a
k
-
partition of D randomly.

2. Calculate k representatives, one for each
cluster.


3. For each Xi, calculate the dissimilarities

d(Xi,Ql), l = 1, . . . , k.

Reassign Xi

to cluster Cl (from cluster Cl’

,
say)

such that the dissimilarity between Xi
and Ql is

least. Update both Ql and Ql


.


4. Repeat Step 3 until no object has changed
clus
ters

after a full cycle test of the

whole
data set.

We should also note that the definition

of
representatives

of clusters in the proposed
technique is based on

the notion of means,
i.e., the optimized solution of the

correspon
ding numerical problem
.

Thus the
optimization

problem (
4
) is reduced to the
partial optimization

problem, for W as
specified

in the above algorithm.





According to (
3
), we have



m



d(Xi,Ql) =


(1
-

fxi,j ),

(6)






j=1



where fxi,j is the relative frequency of
category xi,j

with
in cluster Cl. Thus, in the
proposed algorithm object

Xi will be
allocated to cluster Cl so that the categories

of Xi are most likely to constitute a mode1 of
Cl related

to the other clusters. Note that, by
definition, all possible

modes of each cluster
ar
e taken into account in the proposed

algorithm.

To introduce fuzziness parameter the
following to Equations

for calculating cluster
center

and
fuzziness

parameter

are used to
get a very accurate cluster boundaries.









n

u
ij

m


xj






j=1





ci =
____
______________





n

u
ij

m








j=1

















1


uij =
__________________

c













dij


2/(m
-
1)



k=1

dkj



6. Conclusions

The work used fuzzy clustering and hard k
-
means algorithms to cluster the given
medica
l data. In medical diagnostic systems,
fuzzy clustering algorithm gives the better
results than hard
-
k
-
means algorithm
according to our application. Another
important feature of fuzzy
cluster

algorithm
is membership function and an object can
belong to sev
eral classes at the same time but
with different degrees. This is a useful
feature for a medical diagnostic system. At a
result, fuzzy clustering methods can be
important supportive tool for the medical
experts in diagnostic.

Although the fuzzy clustering
algorithm is an
accurate algorithm than the existing
techniques, it does not give 100% accuracy.
Proceedings of the National Conference ,

Computational Systems and Information Security


Jan.,4,2008
-


by
CSE

Department
-

P.B.

College of Engineering, Chennai
-
602105



Copy Right @CSE
-
PBCE
-
2008



This can be achieved by either altering any
parameters used in the algorithm or by
changing the flow of manipulation, the
algorithm currently adopts. Any alter
ation
that can reduce the memory consumption to
store and cluster data points can be suggested
and implemented.


A formula to find the fuzziness parameter
and the convergence value, by knowing the
number of clusters can be formulated. This
will he
lp in automated formation of the above
said parameters in addition to giving an
optimal output. This eases the task in the user
side, from giving all the other parameters as
inputs
.

References

1.

Earl Cox (2005) “Fuzzy Modeling and Genetic
Algorithms for Data

Mining and Exploration”,
pp. 208
-
247

2.

Jiawei Han and Micheline Kamber (2000)
“Data Mining Concepts and Techniques”,
Morgan Kaufmann Publishers, 2000,


pp. 1
-
20, 335
-
387.

3.

Timothy Ross (1997) “Fuzzy Logic and its
Implementation”,


pp. 379
-
396

4.

Jon
athan C. Prather, David F. Lobach, Linda K.
Goodwin, R.N, Joseph W. Hales, Marvin L.
Hage, and W. Edward Hammond “Medical
Data Mining: Knowledge Discovery in a
Clinical Data Warehouse”


5.

Georg Berks, Diedrich Graf v. Keyserlingk, Jan
Jantzen, Mariagrazia Do
toli, Hubertus Axer
(2000) “Fuzzy Clustering
-

A Versatile Mean to
Explore Medical Databases”.


6.

Songul Albayrak, Fatih Amasyalı (2003)
“FUZZY C
-
MEANS CLUSTERING ON
MEDICAL DIAGNOSTIC SYSTEMS”


7.

Herbert Schildt and Patrick Naughton (2000)
“The Complete Refer
ence

JAVA
-
2” Third
Edition, pp. 871


899.

8.

“Integrating K
-
Means Clustering with a
Relational DBMS Using SQL”


Feb 2006,
IEEE Trans. by Carlos Ordonez.

9.


“Efficient Disk
-
Based K
-
Means Clustering for
Relational Databases”


April 2004, IEEE
Trans. by Carlos
Ordonez.

10.


“Clustering Binary Data Streams with K
-
Means”


June 2003, IEEE Trans. By Carlos
Ordonez.

11.


“A Fast Clustering Algorithm to Cluster Very
Large Categorical Data Sets in Data Mining”
-

2003 by Zhexue Huang.

12.


“A Fast Clustering Algorithm to Cluster
Very
Large Categorical Data in Data Mining”
-

2003
by Anne Denton, Qiang Ding, Qin Ding and
William Perrizo.

13.


“A CSA


based clustering algorithm for large
data sets with mixed numeric and categorical
values”


15
-
19 June 2004 IEEE Trans. By Li
Jie, Gao X
inbo and Jiao Li
-
Cheng.

14.


“A novel clustering method with network
structure based on clonal algorithm”


17
-
21,
May 2004 IEEE Trans. By Li Jie, Gao Xinbo
and Jiao Li
-
Cheng.

15.


“An alternative extension of the K
-
Means
algorithm for Clustering Categorical Dat
a”
-
2004, by Ohn Mar San, Van
-
Nam Huynth,
Yoshiteru Nakamori.