# What Is Good Clustering?

Τεχνίτη Νοημοσύνη και Ρομποτική

8 Νοε 2013 (πριν από 4 χρόνια και 6 μήνες)

82 εμφανίσεις

8

What Is Good Clustering?

A
good clustering

method will produce high quality
clusters with

high
intra
-
class

similarity

low
inter
-
class

similarity

The
quality

of a clustering result depends on the
similarity measure used by the method.

The
quality

of a clustering method is also measured by
its ability to discover some or all of the
hidden

patterns.

9

Vocabulary of Clustering

Records, data points, samples, items, objects, patterns…

Attributes, features, variables…

Similarity, dissimilarity, distances.

Centre, Centroid, Prototype.

Hard Clustering (Crisp Clustering)

10

Requirements of Clustering

Scalability

Ability to deal with different types of attributes

Discovery of clusters with arbitrary shape

Minimal requirements for domain knowledge to
determine input parameters

Able to deal with noise and outliers

Insensitive to order of input records

Insensitive to the initial conditions

High dimensionality

11

Clustering Algorithms

12

Clustering Algorithms

13

Data Representation

Data matrix (two mode)

N objects with p attributes

Dissimilarity matrix (one mode)

d(i,j) : dissimilarity

between i and j

with p attributes

np
x
...
nf
x
...
n1
x
...
...
...
...
...
ip
x
...
if
x
...
i1
x
...
...
...
...
...
1p
x
...
1f
x
...
11
x

0
...
)
2
,
(
)
1
,
(
:
:
:
)
2
,
3
(
)
...
n
d
n
d
0
d
d(3,1
0
d(2,1)
0
14

How to deal with missing values?

np
x
...
nf
x
...
n1
x
...
...
...
...
...
ip
x
...
if
x
...
i1
x
...
...
...
...
...
1p
x
...
1f
x
...
11
x
15

Types of Clusters: Well
-
Separated

Well
-
separated clusters

A

cluster

is

a

set

of

points

such

that

any

point

in

a

cluster

is

closer

(or

more

similar)

to

every

other

point

in

the

cluster

than

to

any

point

not

in

the

cluster

3
well
-
separated clusters

16

Types of Clusters: Center
-
Based

Center
-
based

A

cluster

is

a

set

of

objects

such

that

an

object

in

a

cluster

is

closer

(more

similar)

to

the

“center”

of

a

cluster,

than

to

the

center

of

any

other

cluster

The

center

of

a

cluster

is

often

a

centroid
,

the

average

of

all

the

points

in

the

cluster,

or

a

medoid
,

the

most

“representative”

point

of

a

cluster

4
center
-
based clusters

17

Types of Clusters: Contiguity
-
Based

Contiguous Cluster (Nearest neighbor or Transitive)

A

cluster

is

a

set

of

points

such

that

a

point

in

a

cluster

is

closer

(or

more

similar)

to

one

or

more

other

points

in

the

cluster

than

to

any

point

not

in

the

cluster
.

8
contiguous clusters

18

Types of Clusters: Density
-
Based

Density
-
based

A

cluster

is

a

dense

region

of

points,

which

is

separated

by

low
-
density

regions,

from

other

regions

of

high

density
.

Used

when

the

clusters

are

irregular

or

intertwined,

and

when

noise

and

outliers

are

present
.

6
density
-
based clusters

19

Types of Clusters: Conceptual Clusters

Shared Property or Conceptual Clusters

Finds

clusters

that

share

some

common

property

or

represent

a

particular

concept
.

2
Overlapping Circles

20

Types of Clusters: Objective Function

Clusters Defined by an Objective Function

Finds

clusters

that

minimize

or

maximize

an

objective

function
.

Enumerate

all

possible

ways

of

dividing

the

points

into

clusters

and

evaluate

the

`goodness'

of

each

potential

set

of

clusters

by

using

the

given

objective

function
.

November
8
,
2013

21

Type of data in clustering analysis

November
8
,
2013

22

Symbol Table

November
8
,
2013

23

Symbol Table

November
8
,
2013

24

Frequency Table

November
8
,
2013

25

Frequency Table

November
8
,
2013

26

Frequency Table

November
8
,
2013

27

Frequency Table

November
8
,
2013

28

Type of data in clustering analysis

Binary variables

Nominal variables

Ordinal variables

Interval
-
scaled variables

Ratio variables

Variables of mixed types

November
8
,
2013

29

Binary variables

The binary variable is symmetric (Simple match
coefficient)

The binary variable is asymmetric (Jaccard coefficient)

p
d
b
c
a
sum
d
c
d
c
b
a
b
a
sum

0
1
0
1
Object
i

Object
j

d
c
b
a
c
b

j
i
d

)
,
(
c
b
a
c
b

j
i
d

)
,
(
November
8
,
2013

30

Binary variables

November
8
,
2013

31

Dissimilarity between Binary
Variables

Example

gender is a symmetric attribute

the remaining attributes are asymmetric binary

let the values Y and P be set to
1
, and the value N be set to
0

Name
Gender
Fever
Cough
Test-1
Test-2
Test-3
Test-4
Jack
M
Y
N
P
N
N
N
Mary
F
Y
N
P
N
P
N
Jim
M
Y
P
N
N
N
N
75
.
0
2
1
1
2
1
)
,
(
67
.
0
1
1
1
1
1
)
,
(
33
.
0
1
0
2
1
0
)
,
(

mary
jim
d
jim
jack
d
mary
jack
d
November
8
,
2013

32

Nominal Variables

A generalization of the binary variable in that it can take
more than
2
states, e.g., red, yellow, blue, green

Method
1
: Simple matching

m
: # of matches,

p
: total # of variables

Method
2
: use a large number of binary variables

creating a new binary variable for each of the
M

nominal states

p
m
p
j
i
d

)
,
(
November
8
,
2013

33

Nominal Variables

Examples

Eye Color

Days of the week

Religion

Seasons

Job title

November
8
,
2013

34

Nominal Variables

Find the Proximity Matrix?

November
8
,
2013

35

Ordinal Variables

Order is important, e.g., rank

Can be treated like interval
-
scaled

replacing
x
if

by their rank

map the range of each variable onto [
0
,
1
] by replacing

i
-
th object in the
f
-
th variable by

compute the dissimilarity using methods for interval
-
scaled variables

1
1

f
if
if
M
r
z
}
,...,
1
{
f
if
M
r

November
8
,
2013

36

Ordinal Variables

Find the Proximity Matrix?

November
8
,
2013

37

Interval
-
valued variables

Examples

Temperature

Weight

Time

Age

Length

November
8
,
2013

38

Interval
-
valued variables

Standardize data

Calculate the mean absolute deviation:

where

Calculate the standardized measurement (
z
-
score
)

Using mean absolute deviation is more robust than using
standard deviation

.
)
...
2
1
1
nf
f
f
f
x
x
(x
n

m

|)
|
...
|
|
|
(|
1
2
1
f
nf
f
f
f
f
f
m
x
m
x
m
x
n
s

f
f
if
if
s
m
x

z

November
8
,
2013

39

Ratio
-
Scaled Variables

Ratio
-
scaled variable
: a positive measurement on a
nonlinear scale, approximately at exponential scale,

such as
Ae
Bt

or
Ae
-
Bt

Methods:

treat them like interval
-
scaled variables

not a good
choice! (why?)

apply logarithmic transformation

y
if
=

log(x
if
)

treat them as continuous ordinal data treat their rank
as interval
-
scaled.

November
8
,
2013

40

Ratio
-
Scaled

Variables

Find the Proximity Matrix?

Variables of Mixed Types

A database may contain all the six types of variables

symmetric binary, asymmetric binary, nominal,
ordinal, interval and ratio.

One may use a weighted formula to combine their
effects.

f

is binary or nominal:

d
ij
(f)

=
0
if x
if
= x
jf

, or d
ij
(f)

=
1
o.w.

f

is interval
-
based: use the normalized distance

f

is ordinal or ratio
-
scaled

compute ranks r
if

and

and treat z
if

as interval
-
scaled

)
(
1
)
(
)
(
1
)
,
(
f
ij
p
f
f
ij
f
ij
p
f
d
j
i
d

1
1

f
if
M
r
z
if
November
8
,
2013

42

Variables of Mixed Types

Find the Proximity Matrix?