Literature review A.K. Jain M.N. Murty and P.J. Flin:'Data Clustering ...

tribecagamosisAI and Robotics

Nov 8, 2013 (3 years and 11 months ago)

79 views

Literature review

A.K. Jain M.N. Murty and P.J. Flin:'Data Clustering: a review’ ACM Computing survey
vol 31 (3), sept 1999.

This paper presents an overview of pattern clustering methods, with a goal of providing
useful advice and references to
fundamental concepts accessible to the broad community
of clustering practitioners. It presents taxonomy of clustering techniques, and recent
advances. It also describes some important applications of clustering algorithms such as
image segmentation, objec
t recognition, and information retrieval.

Components of a Clustering Task

Typical pattern clustering activity involves the following steps [Jain and Dubes 1988]:

(1)

P
attern representation,

(2)

D
efinition of a pattern proximity measure appropriate to the data doma
in,

(3)

C
lustering or grouping,

(4)

D
ata abstraction (if needed), and

(5)

A
ssessment of output (if needed).

Pattern representation
refers to the number of classes, the number of available patterns,
and the number, type, and scale of the features available to the clust
ering algorithm.
Some of this information
may not be controllable
.


Pattern proximity
determines how the

similarity

of two elements is calculated. It
is
usually measured by a distance function defined on pairs of patterns. A variety of
distance measures
are in use in the various communities
.


The
grouping
step can be performed in a number of ways. The output clustering (or
clusterings) can be hard (a partition of the data into groups) or fuzzy (where each pattern
has a variable degree of membership in eac
h of the output clusters).


Data abstraction
is the process of extracting a simple and compact representation
of a
data set


a compact description of each cluster
.


How is the o
utput of a clustering algorithm evaluated? What characterizes a
‘good’
clustering result and a ‘poor’

one?
All clustering alg
orithms will, when
presented with
data, produce clusters


regardl
ess of whether the data contain
clusters o
r not. If the data
does contain
clust
ers, some clustering algorithms
may obtain ‘better’

clust
ers than others.
The assessment of a clustering procedure’
s
ou
tput, then, has several facets.
One is
actually an assessment of the

data do
main rather than the clustering
algorithm itself


data which do not

contain clusters should not be processed

by a clus
tering algorithm.