8
What Is Good Clustering?
A
good clustering
method will produce high quality
clusters with
high
intra

class
similarity
low
inter

class
similarity
The
quality
of a clustering result depends on the
similarity measure used by the method.
The
quality
of a clustering method is also measured by
its ability to discover some or all of the
hidden
patterns.
9
Vocabulary of Clustering
Records, data points, samples, items, objects, patterns…
Attributes, features, variables…
Similarity, dissimilarity, distances.
Centre, Centroid, Prototype.
Hard Clustering (Crisp Clustering)
10
Requirements of Clustering
Scalability
Ability to deal with different types of attributes
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to
determine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
Insensitive to the initial conditions
High dimensionality
11
Clustering Algorithms
12
Clustering Algorithms
13
Data Representation
Data matrix (two mode)
N objects with p attributes
Dissimilarity matrix (one mode)
d(i,j) : dissimilarity
between i and j
with p attributes
np
x
...
nf
x
...
n1
x
...
...
...
...
...
ip
x
...
if
x
...
i1
x
...
...
...
...
...
1p
x
...
1f
x
...
11
x
0
...
)
2
,
(
)
1
,
(
:
:
:
)
2
,
3
(
)
...
n
d
n
d
0
d
d(3,1
0
d(2,1)
0
14
How to deal with missing values?
np
x
...
nf
x
...
n1
x
...
...
...
...
...
ip
x
...
if
x
...
i1
x
...
...
...
...
...
1p
x
...
1f
x
...
11
x
15
Types of Clusters: Well

Separated
Well

separated clusters
A
cluster
is
a
set
of
points
such
that
any
point
in
a
cluster
is
closer
(or
more
similar)
to
every
other
point
in
the
cluster
than
to
any
point
not
in
the
cluster
3
well

separated clusters
16
Types of Clusters: Center

Based
Center

based
A
cluster
is
a
set
of
objects
such
that
an
object
in
a
cluster
is
closer
(more
similar)
to
the
“center”
of
a
cluster,
than
to
the
center
of
any
other
cluster
The
center
of
a
cluster
is
often
a
centroid
,
the
average
of
all
the
points
in
the
cluster,
or
a
medoid
,
the
most
“representative”
point
of
a
cluster
4
center

based clusters
17
Types of Clusters: Contiguity

Based
Contiguous Cluster (Nearest neighbor or Transitive)
A
cluster
is
a
set
of
points
such
that
a
point
in
a
cluster
is
closer
(or
more
similar)
to
one
or
more
other
points
in
the
cluster
than
to
any
point
not
in
the
cluster
.
8
contiguous clusters
18
Types of Clusters: Density

Based
Density

based
A
cluster
is
a
dense
region
of
points,
which
is
separated
by
low

density
regions,
from
other
regions
of
high
density
.
Used
when
the
clusters
are
irregular
or
intertwined,
and
when
noise
and
outliers
are
present
.
6
density

based clusters
19
Types of Clusters: Conceptual Clusters
Shared Property or Conceptual Clusters
Finds
clusters
that
share
some
common
property
or
represent
a
particular
concept
.
2
Overlapping Circles
20
Types of Clusters: Objective Function
Clusters Defined by an Objective Function
Finds
clusters
that
minimize
or
maximize
an
objective
function
.
Enumerate
all
possible
ways
of
dividing
the
points
into
clusters
and
evaluate
the
`goodness'
of
each
potential
set
of
clusters
by
using
the
given
objective
function
.
November
8
,
2013
21
Type of data in clustering analysis
November
8
,
2013
22
Symbol Table
November
8
,
2013
23
Symbol Table
November
8
,
2013
24
Frequency Table
November
8
,
2013
25
Frequency Table
November
8
,
2013
26
Frequency Table
November
8
,
2013
27
Frequency Table
November
8
,
2013
28
Type of data in clustering analysis
Binary variables
Nominal variables
Ordinal variables
Interval

scaled variables
Ratio variables
Variables of mixed types
November
8
,
2013
29
Binary variables
The binary variable is symmetric (Simple match
coefficient)
The binary variable is asymmetric (Jaccard coefficient)
p
d
b
c
a
sum
d
c
d
c
b
a
b
a
sum
0
1
0
1
Object
i
Object
j
d
c
b
a
c
b
j
i
d
)
,
(
c
b
a
c
b
j
i
d
)
,
(
November
8
,
2013
30
Binary variables
November
8
,
2013
31
Dissimilarity between Binary
Variables
Example
gender is a symmetric attribute
the remaining attributes are asymmetric binary
let the values Y and P be set to
1
, and the value N be set to
0
Name
Gender
Fever
Cough
Test1
Test2
Test3
Test4
Jack
M
Y
N
P
N
N
N
Mary
F
Y
N
P
N
P
N
Jim
M
Y
P
N
N
N
N
75
.
0
2
1
1
2
1
)
,
(
67
.
0
1
1
1
1
1
)
,
(
33
.
0
1
0
2
1
0
)
,
(
mary
jim
d
jim
jack
d
mary
jack
d
November
8
,
2013
32
Nominal Variables
A generalization of the binary variable in that it can take
more than
2
states, e.g., red, yellow, blue, green
Method
1
: Simple matching
m
: # of matches,
p
: total # of variables
Method
2
: use a large number of binary variables
creating a new binary variable for each of the
M
nominal states
p
m
p
j
i
d
)
,
(
November
8
,
2013
33
Nominal Variables
Examples
Eye Color
Days of the week
Religion
Seasons
Job title
November
8
,
2013
34
Nominal Variables
Find the Proximity Matrix?
November
8
,
2013
35
Ordinal Variables
Order is important, e.g., rank
Can be treated like interval

scaled
replacing
x
if
by their rank
map the range of each variable onto [
0
,
1
] by replacing
i

th object in the
f

th variable by
compute the dissimilarity using methods for interval

scaled variables
1
1
f
if
if
M
r
z
}
,...,
1
{
f
if
M
r
November
8
,
2013
36
Ordinal Variables
Find the Proximity Matrix?
November
8
,
2013
37
Interval

valued variables
Examples
Temperature
Weight
Time
Age
Length
November
8
,
2013
38
Interval

valued variables
Standardize data
Calculate the mean absolute deviation:
where
Calculate the standardized measurement (
z

score
)
Using mean absolute deviation is more robust than using
standard deviation
.
)
...
2
1
1
nf
f
f
f
x
x
(x
n
m
)

...



(
1
2
1
f
nf
f
f
f
f
f
m
x
m
x
m
x
n
s
f
f
if
if
s
m
x
z
November
8
,
2013
39
Ratio

Scaled Variables
Ratio

scaled variable
: a positive measurement on a
nonlinear scale, approximately at exponential scale,
such as
Ae
Bt
or
Ae

Bt
Methods:
treat them like interval

scaled variables
—
not a good
choice! (why?)
apply logarithmic transformation
y
if
=
log(x
if
)
treat them as continuous ordinal data treat their rank
as interval

scaled.
November
8
,
2013
40
Ratio

Scaled
Variables
Find the Proximity Matrix?
Variables of Mixed Types
A database may contain all the six types of variables
symmetric binary, asymmetric binary, nominal,
ordinal, interval and ratio.
One may use a weighted formula to combine their
effects.
f
is binary or nominal:
d
ij
(f)
=
0
if x
if
= x
jf
, or d
ij
(f)
=
1
o.w.
f
is interval

based: use the normalized distance
f
is ordinal or ratio

scaled
compute ranks r
if
and
and treat z
if
as interval

scaled
)
(
1
)
(
)
(
1
)
,
(
f
ij
p
f
f
ij
f
ij
p
f
d
j
i
d
1
1
f
if
M
r
z
if
November
8
,
2013
42
Variables of Mixed Types
Find the Proximity Matrix?
Comments 0
Log in to post a comment