Slides - AIT CSIM Program - Asian Institute of Technology

mudlickfarctateΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 4 μήνες)

63 εμφανίσεις

Data Mining

Comp. Sc. and Inf. Mgmt.

Asian Institute of Technology

Instructor
: Dr. Sumanta Guha

Slide Sources
: Han & Kamber
“Data Mining: Concepts and
Techniques” book, slides by
Han,


Han & Kamber,
adapted
and supplemented by Guha

Chapter 7: Cluster Analysis

What is Cluster Analysis?


Cluster
: a collection of data objects


Similar to one another within the same cluster


Dissimilar to the objects in other clusters


Cluster analysis


Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters


Unsupervised learning
: no predefined classes


Typical applications


As a
stand
-
alone tool

to get insight into data distribution


As a
preprocessing step

for other algorithms

Clustering: Rich Applications and
Multidisciplinary Efforts



Pattern Recognition


Spatial Data Mining


Create thematic maps in GIS by clustering feature
spaces


Detect spatial clusters or for other spatial mining tasks


Image Processing


Economic Science



Market research




WWW


Document classification


Cluster Weblog data to discover groups of similar access
patterns

Examples of Clustering
Applications


Marketing:

Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs


Land use:

Identification of areas of similar land use in an earth
observation database


Insurance:

Identifying groups of motor insurance policy holders with
a high average claim cost.
Fraud detection


outliers
!


City
-
planning:

Identifying groups of houses according to their house
type, value, and geographical location


Earth
-
quake studies:

Observed earth quake epicenters should be
clustered along continent faults

Quality: What Is Good
Clustering?


A
good clustering

method will produce high quality
clusters with


high
intra
-
class

similarity


low
inter
-
class

similarity


The
quality

of a clustering result depends on both the
similarity measure used by the method and its
implementation


The
quality

of a clustering method is also measured by its
ability to discover some or all of the
hidden

patterns

Measure the Quality of Clustering


Dissimilarity/Similarity metric
: Similarity is expressed in
terms of a distance function, typically metric:
d
(
i, j
)


There is a separate “quality” function that measures the
“goodness” of a cluster.


The definitions of
distance functions

are usually very
different for
numeric
,
boolean
,
categorical

and
ordinal

variables.


Numeric: income, temperature, price, etc.


Boolean: Yes/no, e.g, student? citizen?


Categorical: color (red, blue, green, …), nationality, etc.


Ordinal: Excellent/Very good…, High/medium/low (i.e., with order)


It is hard to define “similar enough” or “good enough”



the answer is typically highly subjective
.

Requirements of Clustering in Data Mining


Scalability


Ability to deal with different types of attributes


Ability to handle dynamic data


Discovery of clusters with arbitrary shape


Minimal requirements for domain knowledge to
determine input parameters


Able to deal with noise and outliers


Insensitive to order of input records


High dimensionality


Incorporation of user
-
specified constraints


Interpretability and usability

Major Clustering Approaches


Partitioning approach:


Given
n

objects in the database, a partitioning approach splits it into
k

groups.


Typical methods: k
-
means, k
-
medoids, CLARANS


Hierarchical approach:


Create a hierarchical decomposition of the set of data (or objects) using
one of two methods:


Agglomerative (bottom
-
up)
: start with each object as a separate
group; successively, merge groups that are close until a termination
condition holds.


Divisive (top
-
down)
: start with all objects in one group; successively
split groups that are not “tight” until a termination condition holds.


Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON


Partitioning Algorithms: Basic Concept


Partitioning method:

Construct a partition of a database
D

of
n

objects
into a set of
k

clusters
K
m
, so as to minimize the sum of squared
distances




where is the cluster leader or representative (which itself may or
may not belong to the database
D
).


Given a
k
, find a partition of
k clusters
that optimizes the chosen
partitioning criterion


Global optimal: exhaustively enumerate all partitions


Heuristic methods:
k
-
means

and
k
-
medoids

algorithms


k
-
means

(MacQueen’67): Each cluster is represented by the center
of the cluster


k
-
medoids

or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster

2
1
( )
mi
k
m t Km m mi
C t
 
  
m
C
1. The
K
-
Means

Clustering Method


Algorithm:
k
-
means.

The k
-
means algorithm for partitioning,
where each cluster’s center is represented by the mean value of
the objects in the cluster.

Input:


k
: the number of clusters,


D
: a data set containing
n

objects.

Output:

A set of
k
clusters.

Method:

(1) arbitrarily choose
k
objects from
D

as the initial cluster centers;

(2)
repeat

(3) assign each object to the cluster whose center is closest to

the object

(4) update the cluster centers as the mean value of the objects

in each cluster;

(5)
until

no change;

The
K
-
Means

Clustering Method



Example

0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0

1

2

3

4

5

6

7

8

9

10

0

1

2

3

4

5

6

7

8

9

10

0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0

1

2

3

4

5

6

7

8

9

10

0

1

2

3

4

5

6

7

8

9

10

K=2

Arbitrarily choose K
object as initial
cluster center

Assign
each
objects
to most
similar
center

Update
the
cluster
means

Update
the
cluster
means

reassign

reassign

K
-
Means Clustering Method:
Example with 8 points on a line

1

3

2

8

4

9

11

10

5.3

3.6

2.5

9.5

Randomly choose 2 objects as cluster leaders.

Cluster to nearest cluster leader.

Compute new cluster leaders = cluster means.

No change = Exit!

Comments on the
K
-
Means

Method


Strength:

Relatively efficient
:
O
(
tkn
), where
n

is # objects,
k

is #
clusters, and
t
is # iterations. Normally,
k
,
t

<<
n
.


Comparing: PAM: O(k(n
-
k)
2

), CLARA: O(ks
2

+ k(n
-
k))


Comment:

Often terminates at a
local optimum
. The
global optimum

may be found using techniques such as
deterministic annealing



Weakness


Applicable only when
mean

is defined, then what about categorical
data?


Need to specify
k,
the
number

of clusters, in advance


Unable to handle noisy data and
outliers


Not suitable to discover clusters with
non
-
convex shapes

Variations of the
K
-
Means

Method


A few variants of the
k
-
means

which differ in


Selection of the initial
k

means


Dissimilarity calculations


Strategies to calculate cluster means


Handling categorical data:
k
-
modes

(Huang’98)


Replacing means of clusters with
modes


Using new dissimilarity measures to deal with categorical objects


Using a
frequency
-
based method to update modes of clusters


A mixture of categorical and numerical data:
k
-
prototype

method

What Is the Problem with the
K
-
Means Method?


The k
-
means algorithm is sensitive to outliers !


Since an object with an extremely large value may substantially
distort the distribution of the data.


K
-
Medoids: Instead of taking the
mean

value of the objects in a
cluster as a reference point,
medoids

can be used, which is the
most
centrally located

object in a cluster.

0

1

2

3

4

5

6

7

8

9

10

0

1

2

3

4

5

6

7

8

9

10

0

1

2

3

4

5

6

7

8

9

10

0

1

2

3

4

5

6

7

8

9

10

More Partitioning Clustering
Algorithms

2.
K
-
medoids

(
PAM

=
P
artitioning
A
round
M
edoids)


3.
CLARA

(
C
ustering
LAR
ge
A
pplications)


4.
CLARANS

(
C
lustering
L
arge
A
pplications based on
RAN
domized
S
earch)



Read above three clustering methods from the paper
Efficient and Effective Clustering Methods for Spatial
Data Mining
, by Ng and Han,
Intnl
. Conf. on Very
Large Data Bases (VLDB’94)
,

1994,
which proposes
CLARANS, but has a good presentation of PAM and
CLARA as well.


Hierarchical Clustering
Algorithms

5.
ROCK

ROCK
: A Robust Clustering Algorithm for
Categorical Data
, by
Guha
,
Rastogi
and Shim,
Information Systems
, 2000.

6.

DBSCAN

A Density
-
Based Algorithm for
Discovering Clusters in Large Spatial Databases with
Noise
, by Ester,
Kriegel
, Sander and
Xu
,
Intnl
. Conf.
Knowledge Discovery And Data Mining (KDD’96)
,
1996.

7.
CLIQUE

Automatic Subspace Clustering of High
Dimensional Data for Data Mining Applications
, by
Agrawal
,
Gehrke
,
Gunopulos

and
Raghavan
,
ACM
-
SIGMOD

Intnl
. Conf. on Management of Data
(SIGMOD’98)
, 1998.