A Unified Metric for Categorical and Numerical Attributes in Data Clustering

muttchessAI and Robotics

Nov 8, 2013 (3 years and 9 months ago)

87 views

Outline
A
Unified Metric for Categorical and Numerical
Attributes in Data Clustering
Yiu-ming Cheung and Hong Jia
Department of Computer Science and Institute of Computational and Theoretical Studies
Hong Kong Baptist University,Hong Kong SAR,China
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
1/35
Outline
Outline
1
Introduction
Motiv
ation
Pre
vious Work
Objectiv
e
2
Object-cluster
Similarity Metric
Cluster
ing Task
Similar
ity Metric for Mixed Data
3
Iter
ative Clustering Algorithm
4
Exper
iments
Ev
aluation Criteria
P
erformance on Mixed Data Sets
P
erformance on Categorical Data Sets
5
Conclusion
6
Ac
knowledgment
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
2/35
Introduction
Object-cluster
Similarity Metric
Iter
ative Clustering Algorithm
Exper
iments Conclusion Acknowledgment
Outline
1
Introduction
Motiv
ation
Pre
vious Work
Objectiv
e
2
Object-cluster
Similarity Metric
Cluster
ing Task
Similar
ity Metric for Mixed Data
3
Iter
ative Clustering Algorithm
4
Exper
iments
Ev
aluation Criteria
P
erformance on Mixed Data Sets
P
erformance on Categorical Data Sets
5
Conclusion
6
Ac
knowledgment
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
3/35
Introduction
Object-cluster
Similarity Metric
Iter
ative Clustering Algorithm
Exper
iments Conclusion Acknowledgment
Motivation
Cluster
ing and Attribute
Clustering:
A
widely utilized technique in variant scientific areas;
The
main task is to discover the natural group structure of objects
represented by numerical or categorical attributes (Michalski et
al.,1998).
Attribute:
An
attribute is a property or characteristic of an object;
Each
object is described by a collection of attributes;
There
exists two different types of attributes:
- Numerical attributes:can be ordered by numbers;
- Categorical attributes:cannot be ordered by their values,but can
be separated into groups.
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
4/35
Introduction
Object-cluster
Similarity Metric
Iter
ative Clustering Algorithm
Exper
iments Conclusion Acknowledgment
Motivation
An
Example:Diagnostic Records of Patients
UCI
Heart Disease Data set:contains 8 categorical and 5 numerical
attributes.
Attrib
ute
Descriptor
Pr
operty
T
ype
Age
contin
uous
n
umerical
Se
x
{F
,M}
discrete
categor
ical
Chest
pain type
{typical
angina,atypical angina,...}
discrete
categor
ical
Resting
blood pressure
contin
uous
n
umerical
Ser
um cholestoral
contin
uous
n
umerical
F
asting blood sugar
{> 120mg
=dl, 120mg=dl}
discrete
categor
ical
Resting
electrocardiographic
{type
I,type II,type III}
discrete
categor
ical
Maxim
um heart rate
contin
uous
n
umerical
Ex
ercise induced angina
{y
es,no}
discrete
categor
ical
ST
depression
contin
uous
n
umerical
Slope
of ST segment
{upsloping,
flat,downsloping}
discrete
categor
ical
CA
contin
uous
n
umerical
THAL
{nor
mal,fixed defect,reversable defect}
discrete
categor
ical
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
5/35
Introduction
Object-cluster
Similarity Metric
Iter
ative Clustering Algorithm
Exper
iments Conclusion Acknowledgment
Motivation
Prob
lem
T
raditional clustering methods often concentrate on purely
numerical data only.
There
exists an awkward gap between the similarity metrics for
categorical and numerical data.
T
ransforming the categorical values into numerical ones will
ignore the similarity information embedded in the categorical
values and cannot faithfully reveal the similarity structure of the
data sets (Hsu,TNN’2006).
It
is desirable to solve this problem by finding a unified similarity metric
for categorical and numerical attributes.
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
6/35
Introduction
Object-cluster
Similarity Metric
Iter
ative Clustering Algorithm
Exper
iments Conclusion Acknowledgment
Pre
vious Work
Roughly
,the existing approaches dealing with categorical attributes in
clustering analysis can be summarized into the four categories:
Methods
based on the perspective of similarity
- Similarity Based Agglomerative Clustering (SBAC) algorithm (Li and Biswas,TKDE’02)
Methods
based on graph partitioning
- CLICKS algorithm (Zaki and Peters,ICDE’2005)
Entrop
y-based methods
- COOLCAT algorithm (Barbara et al.,CIKM’2002)
Approaches
that attempt to give a distance metric for categorical values
- K-prototype algorithm (Huang,PAKDD’97)
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
7/35
Introduction
Object-cluster
Similarity Metric
Iter
ative Clustering Algorithm
Exper
iments Conclusion Acknowledgment
Objectiv
e
Giv
e a unified similarity metric which can be simply applied to the
data with categorical,numerical,and mixed attributes;
Design
an efficient clustering algorithm which is applicable to the
three types of data:numerical,categorical,and mixed data.
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
8/35
Introduction
Object-cluster
Similarity Metric
Iter
ative Clustering Algorithm
Exper
iments Conclusion Acknowledgment
Outline
1
Introduction
Motiv
ation
Pre
vious Work
Objectiv
e
2
Object-cluster
Similarity Metric
Cluster
ing Task
Similar
ity Metric for Mixed Data
3
Iter
ative Clustering Algorithm
4
Exper
iments
Ev
aluation Criteria
P
erformance on Mixed Data Sets
P
erformance on Categorical Data Sets
5
Conclusion
6
Ac
knowledgment
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
9/35
Introduction
Object-cluster
Similarity Metric
Iter
ative Clustering Algorithm
Exper
iments Conclusion Acknowledgment
Clustering Task
Cluster
ing Task
Cluster
ing a set of N objects,fx
1
;x
2
;:::;x
N
g,into k different clusters,
denoted as C
1
,C
2
,:::,C
k
,can be formulated to find the optimal Q

via
Q

= arg max
Q
F(Q) = arg max
Q
[
k
X
j=1
N
X
i=1
q
ij
s(x
i
;C
j
)];(1)
where s(x
i
;C
j
) is the similarity between object x
i
and Cluster C
j
,and
Q= (q
ij
) is an N k partition matrix satisfying
k
X
j=1
q
ij
= 1;0 <
N
X
i=1
q
ij
< N;and q
ij
2 [0;1]:(2)
Evidently,the desired clusters can be obtained as long as the metric of
object-cluster similarity is determined.
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
10/35
Introduction
Object-cluster
Similarity Metric
Iter
ative Clustering Algorithm
Exper
iments Conclusion Acknowledgment
Similarity Metric for Mixed Data
Representation
of Mixed Data
Suppose
the mixed data x
i
with d different attributes consists of d
c
categorical attributes and d
u
numerical attributes (d
c
+d
u
= d).
x
i
can be denoted as [x
c
i
T
;x
u
i
T
]
T
with x
c
i
= (x
c
i1
;x
c
i2
;:::;x
c
id
c
)
T
and
x
u
i
= (x
u
i1
;x
u
i2
;:::;x
u
id
u
)
T
.
Here,we have:
x
u
ir
(r =
1;2;:::;d
u
) belonging to R;
x
c
ir
(r =
1;2;:::;d
c
) belonging to dom(A
r
),where dom(A
r
) contains all
possible values that can be chosen by categorical attribute A
r
.
Specially
,dom(A
r
) with m
r
elements can be represented with
dom(A
r
) = fa
r1
;a
r2
;:::;a
rm
r
g.
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
11/35
Introduction
Object-cluster
Similarity Metric
Iter
ative Clustering Algorithm
Exper
iments Conclusion Acknowledgment
Similarity Metric for Mixed Data
Definition
of s(x
i
;C
j
) (I)
Obser
vations:In clustering analysis,numerical attributes are usually treated
as a whole vector while the categorical attributes are investigated individually.
Definition:Let the object-cluster similarity s(x
i
;C
j
) be the average of the
similarity calculated based on each attribute,we will then have
s(x
i
;C
j
) =
1
d
s(x
c
i1
;
C
j
) +
1
d
s(x
c
i2
;
C
j
) +:::+
1
d
s(x
c
id
c
;
C
j
) +
d
u
d
s(x
u
i
;
C
j
)
=
1
d
d
c
X
r=1
s(x
c
ir
;
C
j
) +
d
u
d
s(x
u
i
;
C
j
):(3)
Here,the similarity between each numerical attribute and the cluster C
j
is
replaced with the similarity between the cluster and the whole numerical
vector x
u
i
.
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
12/35
Introduction
Object-cluster
Similarity Metric
Iter
ative Clustering Algorithm
Exper
iments Conclusion Acknowledgment
Similarity Metric for Mixed Data
Definition
of s(x
i
;C
j
) (II)
If
we denote the similarity between x
c
i
and C
j
as s(x
c
i
;C
j
),we can get
s(x
c
i
;C
j
) =
1
d
c
d
c
X
r=1
s(x
c
ir
;
C
j
) =
d
c
X
r=1
1
d
c
s(x
c
ir
;
C
j
):(4)
Then,previous Eq.(3) can be further rewritten as
s(x
i
;C
j
) =
d
c
d
s(x
c
i
;
C
j
) +
d
u
d
s(x
u
i
;
C
j
);(5)
Subsequently,the object-cluster similarity metric can be obtained
based on the definitions of s(x
c
i
;C
j
) and s(x
u
i
;C
j
).
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
13/35
Introduction
Object-cluster
Similarity Metric
Iter
ative Clustering Algorithm
Exper
iments Conclusion Acknowledgment
Similarity Metric for Mixed Data
Similar
ity Metric for Categorical Attributes (I)
T
aking into account the unequal importance of different categorical
attributes for clustering analysis,the computation of s(x
c
i
;C
j
) should
be further modified with
s(x
c
i
;C
j
) =
d
c
X
r=1
w
r
s(x
c
ir
;C
j
);(6)
where w
r
is the weight of categorical attribute A
r
satisfying 0  w
r
 1
and
d
c
P
r=1
w
r
= 1.
That is,the object-cluster similarity for categorical part is the weighted
summation of the similarity between the cluster and each attribute
value.
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
14/35
Introduction
Object-cluster
Similarity Metric
Iter
ative Clustering Algorithm
Exper
iments Conclusion Acknowledgment
Similarity Metric for Mixed Data
Similar
ity Metric for Categorical Attributes (II)
Definition
1
The
similarity between a categorical attribute value x
c
ir
and cluster C
j
is
defined as:
s(x
c
ir
;C
j
) =

A
r
=x
c
ir
(C
j
)

A
r
6=N
ULL
(C
j
)
;(7)
where 
A
r
=x
c
ir
(C
j
) counts the number of objects in cluster C
j
that have the
value x
c
ir
for attribute A
r
,NULL refers to empty.
Theref
ore,the object-cluster similarity for categorical part is calculated by
s(x
c
i
;C
j
) =
d
c
X
r=1
w
r
s(x
c
ir
;C
j
) =
d
c
X
r=1
w
r

A
r
=x
c
ir
(C
j
)

A
r
6=N
ULL
(C
j
)
:(8)
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
15/35
Introduction
Object-cluster
Similarity Metric
Iter
ative Clustering Algorithm
Exper
iments Conclusion Acknowledgment
Similarity Metric for Mixed Data
Calculation
of Categorical Attribute Weights
F
rom the view point of information theory,the importance of any categorical
attribute A
r
can be estimated by
H
A
r
= 
1
m
r
m
r
X
t=1
p(a
r
t
) log p(a
rt
) with p(a
rt
) =

A
r
=a
rt
(X)

A
r
6=N
ULL
(X)
;(9)
where a
rt
2 dom(A
r
),X is the whole data set and m
r
is the number of values
can be chosen by A
r
.
The weight of
each attribute is then computed as
w
r
= H
A
r
=
d
c
X
t=1
H
A
t
:(10)
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
16/35
Introduction
Object-cluster
Similarity Metric
Iter
ative Clustering Algorithm
Exper
iments Conclusion Acknowledgment
Similarity Metric for Mixed Data
Similar
ity Metric for Numerical Attributes (I)
It
is a universal law that the distance and perceived similarity
between numerical vectors are related via an exponential function
as follows:
s(x
A
;x
B
) = exp(Dis(x
A
;x
B
));(11)
where Dis stands for a distance measure.
Moreo
ver,to avoid the influence of different magnitudes of
distances,we can further use proportional distance instead of
absolute distance.
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
17/35
Introduction
Object-cluster
Similarity Metric
Iter
ative Clustering Algorithm
Exper
iments Conclusion Acknowledgment
Similarity Metric for Mixed Data
Similar
ity Metric for Numerical Attributes (II)
Definition
2
The
object-cluster similarity between numerical vector x
u
i
and cluster C
j
is
given by
s(x
u
i
;C
j
) = exp
0
B
B
@

Dis(x
u
i
;c
j
)
k
P
t=1
D
is(x
u
i
;c
t
)
1
C
C
A
;(12)
where c
j
is the center of all numerical vectors in cluster C
j
.
In
practice,different distance metrics can be utilized to calculate Dis(x
u
i
;c
j
).
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
18/35
Introduction
Object-cluster
Similarity Metric
Iter
ative Clustering Algorithm
Exper
iments Conclusion Acknowledgment
Similarity Metric for Mixed Data
Calculation
of Object-cluster Similarity
According
to previous descriptions,the object-cluster similarity metric for
mixed data is given by
s(x
i
;C
j
) =
d
c
d
d
c
X
r=1
0
B
B
B
@
H
A
r
d
c
P
t=1
H
A
t


A
r
=x
c
ir
(C
j
)

A
r
6=N
ULL
(C
j
)
1
C
C
C
A
+
d
u
d
exp
0
B
B
@

D
is(x
u
i
;c
j
)
k
P
t=1
D
is(x
u
i
;c
t
)
1
C
C
A
;
(13)
where i = 1;2;:::;N,j = 1;2;:::;k.
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
19/35
Introduction
Object-cluster
Similarity Metric
Iter
ative Clustering Algorithm
Exper
iments Conclusion Acknowledgment
Outline
1
Introduction
Motiv
ation
Pre
vious Work
Objectiv
e
2
Object-cluster
Similarity Metric
Cluster
ing Task
Similar
ity Metric for Mixed Data
3
Iter
ative Clustering Algorithm
4
Exper
iments
Ev
aluation Criteria
P
erformance on Mixed Data Sets
P
erformance on Categorical Data Sets
5
Conclusion
6
Ac
knowledgment
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
20/35
Introduction
Object-cluster
Similarity Metric
Iter
ative Clustering Algorithm
Exper
iments Conclusion Acknowledgment
Cluster
ing Criterion
W
e concentrate on hard partition only,i.e.,q
ij
2 f0;1g.
Giv
en a set of N objects,the optimal Q

= fq

ij
g in Eq.(1) can be
given by
q

ij
=

1;if s(x
i
;C
j
)  s(x
i
;C
r
);1  r  k;
0;otherwise:
(14)
Similar
to the learning procedure of k-means,an iterative
algorithm can be conducted to implement the clustering analysis.
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
21/35
Introduction
Object-cluster
Similarity Metric
Iter
ative Clustering Algorithm
Exper
iments Conclusion Acknowledgment
OCIL
Algorithm
Iter
ative clustering learning based on object-cluster similarity metric:
Require:data set X = fx
1
;x
2
;:::;x
N
g,number of clusters k
Ensure:cluster label Y = fy
1
;y
2
;:::;y
N
g
1:Calculate the importance of each categorical attribute if applicable
2:Set Y = f0;0;:::;0g and randomly select k initial objects,one for each cluster
3:repeat
4:Initialize noChange = true
5:for i = 1 to N do
6:y
(new)
i
= arg max
j2f1;:::;kg
[s(x
i
;C
j
)]
7:if y
(new)
i
6= y
(old)
i
then
8:noChange = false
9:Update the information of clusters C
y
(new)
i
and C
y
(old)
i
,including the frequency of
each categorical value and the centroid of numerical vectors
10:end if
11:end for
12:until noChange is true
13:return Y
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
22/35
Introduction
Object-cluster
Similarity Metric
Iter
ative Clustering Algorithm
Exper
iments Conclusion Acknowledgment
Outline
1
Introduction
Motiv
ation
Pre
vious Work
Objectiv
e
2
Object-cluster
Similarity Metric
Cluster
ing Task
Similar
ity Metric for Mixed Data
3
Iter
ative Clustering Algorithm
4
Exper
iments
Ev
aluation Criteria
P
erformance on Mixed Data Sets
P
erformance on Categorical Data Sets
5
Conclusion
6
Ac
knowledgment
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
23/35
Introduction
Object-cluster
Similarity Metric
Iter
ative Clustering Algorithm
Exper
iments Conclusion Acknowledgment
Evaluation Criteria
Ev
aluation Criteria
Cluster
ing Accuracy (ACC):
ACC =
P
N
i=1
(c
i
;map(r
i
))
N
;
where map(r
i
) maps
the obtained cluster label r
i
to the equivalent
label from the data corpus by using the Kuhn-Munkres algorithm.
Cluster
ing Error Rate:
e = 1 ACC
.
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
24/35
Introduction
Object-cluster
Similarity Metric
Iter
ative Clustering Algorithm
Exper
iments Conclusion Acknowledgment
Performance on Mixed Data Sets
Mix
ed Data Sets
T
able 1:Statistics of mixed data sets
Data
set Instance Attribute (d
c
+d
u
) Class
Statlog
Heart 270 7 + 6 2
Heart Disease 303 7 + 6 2
Credit Approval 653 9 + 6 2
German Credit 1000 13 + 7 2
Dermatology 366 33 + 1 6
Adult 30162 8 + 6 2
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
25/35
Introduction
Object-cluster
Similarity Metric
Iter
ative Clustering Algorithm
Exper
iments Conclusion Acknowledgment
Performance on Mixed Data Sets
Cluster
ing Errors on Mixed Data Sets
T
able 2:Clustering errors of OCIL on mixed data sets in comparison with
k-prototype and k-means
Data
set K-means K-prototype OCIL
Statlog
0.40470.0071 0.23060.0821 0.17160.0065
Heart 0.42240.0131 0.22800.0903 0.16440.0030
Credit 0.44870.0016 0.26190.0976 0.25190.0966
German 0.32900.0014 0.32890.0006 0.30570.0007
Dermatology 0.70060.0216 0.69030.0255 0.30510.0896
Adult 0.38690.0067 0.38550.0143 0.30790.0305
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
26/35
Introduction
Object-cluster
Similarity Metric
Iter
ative Clustering Algorithm
Exper
iments Conclusion Acknowledgment
Performance on Mixed Data Sets
Compar
ison of Convergence Rate
T
able 3:Comparison of average convergent time and iterations between
k-prototype and OCIL
Data
set
Time
Iter
ations
K-prototype
OCIL
K-prototype
OCIL
Statlog
0.0519s 0.0516s
3.09 3.07
Hear
t
0.0639s 0.0576s
3.54 3.02
Credit
0.1323s
0.1625s
3.18 4.26
Ger
man
0.2999s 0.2023s
5.29 3.15
Der
matol
0.3674s 0.1888s
7.27 4.32
Adult
15.2795s 9.6774s
10.93 6.78
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
27/35
Introduction
Object-cluster
Similarity Metric
Iter
ative Clustering Algorithm
Exper
iments Conclusion Acknowledgment
Performance on Categorical Data Sets
Categor
ical Data Sets
T
able 4:Statistics of categorical data sets
Data
set Instance Attribute Class
So
ybean 47 35 4
Breast 699 9 2
Vote 435 16 2
Zoo 101 16 7
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
28/35
Introduction
Object-cluster
Similarity Metric
Iter
ative Clustering Algorithm
Exper
iments Conclusion Acknowledgment
Performance on Categorical Data Sets
Cluster
ing Errors on Categorical Data Sets
T
able 5:Comparison of clustering errors obtained by three different
methods on categorical data sets
Data
set H’s k-modes N’s k-modes OCIL
So
ybean 0.16910.1521 0.09640.1404 0.10170.1380
Breast 0.16550.1528 0.13560.0016 0.09340.0009
Vote 0.13870.0066 0.13450.0031 0.12130.0010
Zoo 0.28730.1083 0.27300.0818 0.26810.0906
H’
s k-modes:original k-modes algorithm (Huang,SIGMOD’97);
N’s k-modes:k-modes algorithm with Ng’s dissimilarity metric (Ng et al.,TPAMI’07);
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
29/35
Introduction
Object-cluster
Similarity Metric
Iter
ative Clustering Algorithm
Exper
iments Conclusion Acknowledgment
Outline
1
Introduction
Motiv
ation
Pre
vious Work
Objectiv
e
2
Object-cluster
Similarity Metric
Cluster
ing Task
Similar
ity Metric for Mixed Data
3
Iter
ative Clustering Algorithm
4
Exper
iments
Ev
aluation Criteria
P
erformance on Mixed Data Sets
P
erformance on Categorical Data Sets
5
Conclusion
6
Ac
knowledgment
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
30/35
Introduction
Object-cluster
Similarity Metric
Iter
ative Clustering Algorithm
Exper
iments Conclusion Acknowledgment
Conclusion
A
general clustering framework based on object-cluster similarity has
been proposed.
A
unified similarity metric for both categorical and numerical attributes
has been presented.
An
iterative algorithm which is applicable to clustering analysis on
various data types has been introduced.
The
advantages of the proposed method have been experimentally
demonstrated in comparison with the existing counterparts
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
31/35
Introduction
Object-cluster
Similarity Metric
Iter
ative Clustering Algorithm
Exper
iments Conclusion Acknowledgment
Outline
1
Introduction
Motiv
ation
Pre
vious Work
Objectiv
e
2
Object-cluster
Similarity Metric
Cluster
ing Task
Similar
ity Metric for Mixed Data
3
Iter
ative Clustering Algorithm
4
Exper
iments
Ev
aluation Criteria
P
erformance on Mixed Data Sets
P
erformance on Categorical Data Sets
5
Conclusion
6
Ac
knowledgment
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
32/35
Introduction
Object-cluster
Similarity Metric
Iter
ative Clustering Algorithm
Exper
iments Conclusion Acknowledgment
Ac
knowledgment
Collabor
ative Graduate Program in Design,Kyoto University;
Depar
tment of Computer Science,Hong Kong Baptist University.
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
33/35
Introduction
Object-cluster
Similarity Metric
Iter
ative Clustering Algorithm
Exper
iments Conclusion Acknowledgment
Ref
erences
1.
Michalski,R.S.,Bratko,I.,Kubat,M.:Machine learning and data mining:methods and
applications.Wiley,New York (1998)
2.Hsu,C.C.:Generalizing self-organizing map for categorical data.IEEE Transactions on
Neural Networks 17(2) (March 2006) 294–304
3.Li,C.,Biswas,G.:Unsupervised learning with mixed numeric and nominal data.IEEE
Transactions on Knowledge and Data Engineering 14(4)(July/August 2002) 673–690
4.Zaki,M.J.,Peters,M.:Click:Mining subspace clusters in categorical data via k-partite
maximal cliques.In:Proceedings of the 21st International Conference on Data Engineering.
(2005) 355–356
5.Barbara,D.,Couto,J.,Li,Y.:Coolcat:An entropy-based algorithm for categorical clustering.
In:Proceedings of the 11th ACM Conference on Information and Knowledge Management.
(2002) 582–589
6.Huang,Z.:Clustering large data sets with mixed numeric and categorical values.In:
Proceedings of the First Pacific-Asia Conference on Knowledge Discovery and Data Mining.
(1997) 21–24
7.Huang,Z.:A fast clustering algorithm to cluster very large categorical data sets in data
mining.In:Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and
Know ledge Discovery.(1997) 1–8
8.Ng,M.K.,Li,M.J.,Huang,J.Z.,He,Z.:On the impact of dissimilarity measure in k-modes
clustering algorithm.IEEE Transactions on Pattern Analysis and Machine Intelligence 29(3)
(2007) 503–507
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
34/35
Introduction
Object-cluster
Similarity Metric
Iter
ative Clustering Algorithm
Exper
iments Conclusion Acknowledgment
Thank You!
Y
iu-ming Cheung and Hong Jia (HKBU)
Unified
Metric for Mixed Data Clustering
2013
35/35