What Is Good Clustering?

naivenorthAI and Robotics

Nov 8, 2013 (3 years and 11 months ago)

72 views

8

What Is Good Clustering?


A
good clustering

method will produce high quality
clusters with


high
intra
-
class

similarity


low
inter
-
class

similarity


The
quality

of a clustering result depends on the
similarity measure used by the method.


The
quality

of a clustering method is also measured by
its ability to discover some or all of the
hidden

patterns.

9

Vocabulary of Clustering


Records, data points, samples, items, objects, patterns…



Attributes, features, variables…



Similarity, dissimilarity, distances.



Centre, Centroid, Prototype.



Hard Clustering (Crisp Clustering)

10

Requirements of Clustering


Scalability


Ability to deal with different types of attributes


Discovery of clusters with arbitrary shape


Minimal requirements for domain knowledge to
determine input parameters


Able to deal with noise and outliers


Insensitive to order of input records


Insensitive to the initial conditions


High dimensionality

11

Clustering Algorithms

12

Clustering Algorithms

13

Data Representation



Data matrix (two mode)


N objects with p attributes






Dissimilarity matrix (one mode)


d(i,j) : dissimilarity


between i and j


with p attributes



















np
x
...
nf
x
...
n1
x
...
...
...
...
...
ip
x
...
if
x
...
i1
x
...
...
...
...
...
1p
x
...
1f
x
...
11
x
















0
...
)
2
,
(
)
1
,
(
:
:
:
)
2
,
3
(
)
...
n
d
n
d
0
d
d(3,1
0
d(2,1)
0
14

How to deal with missing values?
























np
x
...
nf
x
...
n1
x
...
...
...
...
...
ip
x
...
if
x
...
i1
x
...
...
...
...
...
1p
x
...
1f
x
...
11
x
15

Types of Clusters: Well
-
Separated


Well
-
separated clusters


A

cluster

is

a

set

of

points

such

that

any

point

in

a

cluster

is

closer

(or

more

similar)

to

every

other

point

in

the

cluster

than

to

any

point

not

in

the

cluster

3
well
-
separated clusters

16

Types of Clusters: Center
-
Based


Center
-
based



A

cluster

is

a

set

of

objects

such

that

an

object

in

a

cluster

is

closer

(more

similar)

to

the

“center”

of

a

cluster,

than

to

the

center

of

any

other

cluster



The

center

of

a

cluster

is

often

a

centroid
,

the

average

of

all

the

points

in

the

cluster,

or

a

medoid
,

the

most

“representative”

point

of

a

cluster


4
center
-
based clusters

17

Types of Clusters: Contiguity
-
Based


Contiguous Cluster (Nearest neighbor or Transitive)


A

cluster

is

a

set

of

points

such

that

a

point

in

a

cluster

is

closer

(or

more

similar)

to

one

or

more

other

points

in

the

cluster

than

to

any

point

not

in

the

cluster
.

8
contiguous clusters

18

Types of Clusters: Density
-
Based


Density
-
based


A

cluster

is

a

dense

region

of

points,

which

is

separated

by

low
-
density

regions,

from

other

regions

of

high

density
.



Used

when

the

clusters

are

irregular

or

intertwined,

and

when

noise

and

outliers

are

present
.


6
density
-
based clusters

19

Types of Clusters: Conceptual Clusters


Shared Property or Conceptual Clusters


Finds

clusters

that

share

some

common

property

or

represent

a

particular

concept
.


2
Overlapping Circles

20

Types of Clusters: Objective Function


Clusters Defined by an Objective Function


Finds

clusters

that

minimize

or

maximize

an

objective

function
.



Enumerate

all

possible

ways

of

dividing

the

points

into

clusters

and

evaluate

the

`goodness'

of

each

potential

set

of

clusters

by

using

the

given

objective

function
.

November
8
,
2013

21

Type of data in clustering analysis

November
8
,
2013

22

Symbol Table

November
8
,
2013

23

Symbol Table

November
8
,
2013

24

Frequency Table

November
8
,
2013

25

Frequency Table

November
8
,
2013

26

Frequency Table

November
8
,
2013

27

Frequency Table

November
8
,
2013

28

Type of data in clustering analysis


Binary variables


Nominal variables


Ordinal variables


Interval
-
scaled variables


Ratio variables


Variables of mixed types

November
8
,
2013

29

Binary variables






The binary variable is symmetric (Simple match
coefficient)



The binary variable is asymmetric (Jaccard coefficient)

p
d
b
c
a
sum
d
c
d
c
b
a
b
a
sum




0
1
0
1
Object
i

Object
j

d
c
b
a
c
b

j
i
d





)
,
(
c
b
a
c
b

j
i
d




)
,
(
November
8
,
2013

30

Binary variables

November
8
,
2013

31

Dissimilarity between Binary
Variables


Example






gender is a symmetric attribute


the remaining attributes are asymmetric binary


let the values Y and P be set to
1
, and the value N be set to
0

Name
Gender
Fever
Cough
Test-1
Test-2
Test-3
Test-4
Jack
M
Y
N
P
N
N
N
Mary
F
Y
N
P
N
P
N
Jim
M
Y
P
N
N
N
N
75
.
0
2
1
1
2
1
)
,
(
67
.
0
1
1
1
1
1
)
,
(
33
.
0
1
0
2
1
0
)
,
(















mary
jim
d
jim
jack
d
mary
jack
d
November
8
,
2013

32

Nominal Variables


A generalization of the binary variable in that it can take
more than
2
states, e.g., red, yellow, blue, green


Method
1
: Simple matching


m
: # of matches,

p
: total # of variables




Method
2
: use a large number of binary variables


creating a new binary variable for each of the
M

nominal states

p
m
p
j
i
d


)
,
(
November
8
,
2013

33

Nominal Variables


Examples


Eye Color


Days of the week


Religion


Seasons


Job title


November
8
,
2013

34

Nominal Variables



Find the Proximity Matrix?

November
8
,
2013

35

Ordinal Variables



Order is important, e.g., rank


Can be treated like interval
-
scaled


replacing
x
if


by their rank


map the range of each variable onto [
0
,
1
] by replacing

i
-
th object in the
f
-
th variable by




compute the dissimilarity using methods for interval
-
scaled variables

1
1



f
if
if
M
r
z
}
,...,
1
{
f
if
M
r

November
8
,
2013

36

Ordinal Variables



Find the Proximity Matrix?

November
8
,
2013

37

Interval
-
valued variables


Examples


Temperature


Weight


Time


Age


Length



November
8
,
2013

38

Interval
-
valued variables


Standardize data


Calculate the mean absolute deviation:


where


Calculate the standardized measurement (
z
-
score
)



Using mean absolute deviation is more robust than using
standard deviation


.
)
...
2
1
1
nf
f
f
f
x
x
(x
n

m




|)
|
...
|
|
|
(|
1
2
1
f
nf
f
f
f
f
f
m
x
m
x
m
x
n
s







f
f
if
if
s
m
x

z


November
8
,
2013

39

Ratio
-
Scaled Variables


Ratio
-
scaled variable
: a positive measurement on a
nonlinear scale, approximately at exponential scale,


such as
Ae
Bt

or
Ae
-
Bt



Methods:


treat them like interval
-
scaled variables


not a good
choice! (why?)


apply logarithmic transformation

y
if
=

log(x
if
)


treat them as continuous ordinal data treat their rank
as interval
-
scaled.

November
8
,
2013

40

Ratio
-
Scaled

Variables



Find the Proximity Matrix?

Variables of Mixed Types


A database may contain all the six types of variables


symmetric binary, asymmetric binary, nominal,
ordinal, interval and ratio.


One may use a weighted formula to combine their
effects.



f

is binary or nominal:

d
ij
(f)

=
0
if x
if
= x
jf

, or d
ij
(f)

=
1
o.w.


f

is interval
-
based: use the normalized distance


f

is ordinal or ratio
-
scaled


compute ranks r
if

and


and treat z
if

as interval
-
scaled


)
(
1
)
(
)
(
1
)
,
(
f
ij
p
f
f
ij
f
ij
p
f
d
j
i
d







1
1



f
if
M
r
z
if
November
8
,
2013

42

Variables of Mixed Types



Find the Proximity Matrix?