Data Mining
Comp. Sc. and Inf. Mgmt.
Asian Institute of Technology
Instructor
: Dr. Sumanta Guha
Slide Sources
: Han & Kamber
“Data Mining: Concepts and
Techniques” book, slides by
Han,
Han & Kamber,
adapted
and supplemented by Guha
Chapter 7: Cluster Analysis
What is Cluster Analysis?
Cluster
: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
Unsupervised learning
: no predefined classes
Typical applications
As a
stand

alone tool
to get insight into data distribution
As a
preprocessing step
for other algorithms
Clustering: Rich Applications and
Multidisciplinary Efforts
Pattern Recognition
Spatial Data Mining
Create thematic maps in GIS by clustering feature
spaces
Detect spatial clusters or for other spatial mining tasks
Image Processing
Economic Science
Market research
WWW
Document classification
Cluster Weblog data to discover groups of similar access
patterns
Examples of Clustering
Applications
Marketing:
Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
Land use:
Identification of areas of similar land use in an earth
observation database
Insurance:
Identifying groups of motor insurance policy holders with
a high average claim cost.
Fraud detection
–
outliers
!
City

planning:
Identifying groups of houses according to their house
type, value, and geographical location
Earth

quake studies:
Observed earth quake epicenters should be
clustered along continent faults
Quality: What Is Good
Clustering?
A
good clustering
method will produce high quality
clusters with
high
intra

class
similarity
low
inter

class
similarity
The
quality
of a clustering result depends on both the
similarity measure used by the method and its
implementation
The
quality
of a clustering method is also measured by its
ability to discover some or all of the
hidden
patterns
Measure the Quality of Clustering
Dissimilarity/Similarity metric
: Similarity is expressed in
terms of a distance function, typically metric:
d
(
i, j
)
There is a separate “quality” function that measures the
“goodness” of a cluster.
The definitions of
distance functions
are usually very
different for
numeric
,
boolean
,
categorical
and
ordinal
variables.
Numeric: income, temperature, price, etc.
Boolean: Yes/no, e.g, student? citizen?
Categorical: color (red, blue, green, …), nationality, etc.
Ordinal: Excellent/Very good…, High/medium/low (i.e., with order)
It is hard to define “similar enough” or “good enough”
the answer is typically highly subjective
.
Requirements of Clustering in Data Mining
Scalability
Ability to deal with different types of attributes
Ability to handle dynamic data
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to
determine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user

specified constraints
Interpretability and usability
Major Clustering Approaches
Partitioning approach:
Given
n
objects in the database, a partitioning approach splits it into
k
groups.
Typical methods: k

means, k

medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects) using
one of two methods:
Agglomerative (bottom

up)
: start with each object as a separate
group; successively, merge groups that are close until a termination
condition holds.
Divisive (top

down)
: start with all objects in one group; successively
split groups that are not “tight” until a termination condition holds.
Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
Partitioning Algorithms: Basic Concept
Partitioning method:
Construct a partition of a database
D
of
n
objects
into a set of
k
clusters
K
m
, so as to minimize the sum of squared
distances
where is the cluster leader or representative (which itself may or
may not belong to the database
D
).
Given a
k
, find a partition of
k clusters
that optimizes the chosen
partitioning criterion
Global optimal: exhaustively enumerate all partitions
Heuristic methods:
k

means
and
k

medoids
algorithms
k

means
(MacQueen’67): Each cluster is represented by the center
of the cluster
k

medoids
or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
2
1
( )
mi
k
m t Km m mi
C t
m
C
1. The
K

Means
Clustering Method
Algorithm:
k

means.
The k

means algorithm for partitioning,
where each cluster’s center is represented by the mean value of
the objects in the cluster.
Input:
k
: the number of clusters,
D
: a data set containing
n
objects.
Output:
A set of
k
clusters.
Method:
(1) arbitrarily choose
k
objects from
D
as the initial cluster centers;
(2)
repeat
(3) assign each object to the cluster whose center is closest to
the object
(4) update the cluster centers as the mean value of the objects
in each cluster;
(5)
until
no change;
The
K

Means
Clustering Method
Example
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
K=2
Arbitrarily choose K
object as initial
cluster center
Assign
each
objects
to most
similar
center
Update
the
cluster
means
Update
the
cluster
means
reassign
reassign
K

Means Clustering Method:
Example with 8 points on a line
1
3
2
8
4
9
11
10
5.3
3.6
2.5
9.5
Randomly choose 2 objects as cluster leaders.
Cluster to nearest cluster leader.
Compute new cluster leaders = cluster means.
No change = Exit!
Comments on the
K

Means
Method
Strength:
Relatively efficient
:
O
(
tkn
), where
n
is # objects,
k
is #
clusters, and
t
is # iterations. Normally,
k
,
t
<<
n
.
Comparing: PAM: O(k(n

k)
2
), CLARA: O(ks
2
+ k(n

k))
Comment:
Often terminates at a
local optimum
. The
global optimum
may be found using techniques such as
deterministic annealing
Weakness
Applicable only when
mean
is defined, then what about categorical
data?
Need to specify
k,
the
number
of clusters, in advance
Unable to handle noisy data and
outliers
Not suitable to discover clusters with
non

convex shapes
Variations of the
K

Means
Method
A few variants of the
k

means
which differ in
Selection of the initial
k
means
Dissimilarity calculations
Strategies to calculate cluster means
Handling categorical data:
k

modes
(Huang’98)
Replacing means of clusters with
modes
Using new dissimilarity measures to deal with categorical objects
Using a
frequency

based method to update modes of clusters
A mixture of categorical and numerical data:
k

prototype
method
What Is the Problem with the
K

Means Method?
The k

means algorithm is sensitive to outliers !
Since an object with an extremely large value may substantially
distort the distribution of the data.
K

Medoids: Instead of taking the
mean
value of the objects in a
cluster as a reference point,
medoids
can be used, which is the
most
centrally located
object in a cluster.
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
More Partitioning Clustering
Algorithms
2.
K

medoids
(
PAM
=
P
artitioning
A
round
M
edoids)
3.
CLARA
(
C
ustering
LAR
ge
A
pplications)
4.
CLARANS
(
C
lustering
L
arge
A
pplications based on
RAN
domized
S
earch)
Read above three clustering methods from the paper
Efficient and Effective Clustering Methods for Spatial
Data Mining
, by Ng and Han,
Intnl
. Conf. on Very
Large Data Bases (VLDB’94)
,
1994,
which proposes
CLARANS, but has a good presentation of PAM and
CLARA as well.
Hierarchical Clustering
Algorithms
5.
ROCK
ROCK
: A Robust Clustering Algorithm for
Categorical Data
, by
Guha
,
Rastogi
and Shim,
Information Systems
, 2000.
6.
DBSCAN
A Density

Based Algorithm for
Discovering Clusters in Large Spatial Databases with
Noise
, by Ester,
Kriegel
, Sander and
Xu
,
Intnl
. Conf.
Knowledge Discovery And Data Mining (KDD’96)
,
1996.
7.
CLIQUE
Automatic Subspace Clustering of High
Dimensional Data for Data Mining Applications
, by
Agrawal
,
Gehrke
,
Gunopulos
and
Raghavan
,
ACM

SIGMOD
Intnl
. Conf. on Management of Data
(SIGMOD’98)
, 1998.
Comments 0
Log in to post a comment