Literature review

A.K. Jain M.N. Murty and P.J. Flin:'Data Clustering: a review’ ACM Computing survey

vol 31 (3), sept 1999.

This paper presents an overview of pattern clustering methods, with a goal of providing

useful advice and references to

fundamental concepts accessible to the broad community

of clustering practitioners. It presents taxonomy of clustering techniques, and recent

advances. It also describes some important applications of clustering algorithms such as

image segmentation, objec

t recognition, and information retrieval.

Components of a Clustering Task

Typical pattern clustering activity involves the following steps [Jain and Dubes 1988]:

(1)

P

attern representation,

(2)

D

efinition of a pattern proximity measure appropriate to the data doma

in,

(3)

C

lustering or grouping,

(4)

D

ata abstraction (if needed), and

(5)

A

ssessment of output (if needed).

Pattern representation

refers to the number of classes, the number of available patterns,

and the number, type, and scale of the features available to the clust

ering algorithm.

Some of this information

may not be controllable

.

Pattern proximity

determines how the

similarity

of two elements is calculated. It

is

usually measured by a distance function defined on pairs of patterns. A variety of

distance measures

are in use in the various communities

.

The

grouping

step can be performed in a number of ways. The output clustering (or

clusterings) can be hard (a partition of the data into groups) or fuzzy (where each pattern

has a variable degree of membership in eac

h of the output clusters).

Data abstraction

is the process of extracting a simple and compact representation

of a

data set

—

a compact description of each cluster

.

How is the o

utput of a clustering algorithm evaluated? What characterizes a

‘good’

clustering result and a ‘poor’

one?

All clustering alg

orithms will, when

presented with

data, produce clusters

—

regardl

ess of whether the data contain

clusters o

r not. If the data

does contain

clust

ers, some clustering algorithms

may obtain ‘better’

clust

ers than others.

The assessment of a clustering procedure’

s

ou

tput, then, has several facets.

One is

actually an assessment of the

data do

main rather than the clustering

algorithm itself

—

data which do not

contain clusters should not be processed

by a clus

tering algorithm.

## Comments 0

Log in to post a comment