Data Mining Techniques and Applications
The University of Nottingham
Clustering
Alvaro Garcia

Piquer
Research Group in Intelligent Systems (GRSI)
La Salle
–
Ramon
Llull
University
alvarog@salle.url.edu
Outline
1
.
Introduction
2
.
Clustering
Taxonomy
3
.
Some
Algorithms
4
.
Validation
of
Clustering
Solutions
5
.
Summary
2
Outline
1
.
Introduction
2
.
Clustering
Taxonomy
3
.
Some
Algorithms
4
.
Validation
of
Clustering
Solutions
5
.
Summary
3
Grouping Data
4
1
Data
Mining
Clustering
•
To
group
data
according
to
a
set
of
criteria,
providing
to
experts
a
possible
classification
or
categorization
of
the
elements
[Kaufman,
2005
]
[Han, 2006]
We have rich data,
but poor information
Data mining

searching for
knowledge (interesting
patterns) in your data
Clustering example
5
1
by family
by age
Applications
6
1
Marketing
:
finding
groups
of
customers
with
similar
behavior
given
a
large
database
of
customer
data
containing
their
properties
and
past
buying
records
Biology
:
classification
of
living
organisms
according
to
their
DNA
Image
segmentation
:
identifying
objects
in
images
according
to
the
features
of
each
pixel
(position,
color
…
)
Outline
1
.
Introduction
2
.
Clustering
Taxonomy
3
.
Some
Algorithms
4
.
Validation
of
Clustering
Solutions
5
.
Summary
7
Clustering steps
8
2
To choose the
number of
clusters
To choose the
type of
clustering
Clustering
process
Convergence?
(optimization
of clusters)
Validation of
clustering
solution
Clustering algorithm
No
Yes
Cases distribution into the clusters
Relationship
of
the
clusters
Search
typology
Number of clusters
9
2
The
determination
of
the
number
of
clusters
to
find
can
be
:
•
Manual
The
search
space
of
the
algorithm
is
reduced
•
Automatic
The
search
space
is
not
delimited
and
is
more
difficult
to
the
algorithm
to
converge
Can you group these data according
to the colour?
How many groups can you identify?
Can you group these data in two
clusters according to the colour?

Cluster
1
:
blue
data

Cluster
2
:
green
data
Relationships of the clusters
10
2
Partitional
•
There
are
not
relationships
between
the
clusters
Hierarchical
•
All
the
clusters
have
some
relationships
between
them
•
Two
types
Agglomerative
Divisive
[
Gan
,
2007
;
Duda
,
2000
]
Partitional
11
2
Hierarchical agglomerative
12
2
Hierarchical divisive
13
2
Cases distribution into the clusters
14
2
Hard
•
Each
data
element
belongs
to
exactly
one
cluster
Fuzzy
(
soft
)
•
Data
elements
can
belong
to
more
than
one
cluster
•
Associated
with
each
of
the
objects
are
membership
grades
which
indicate
the
degree
to
which
the
objects
belong
to
the
different
clusters
The
sum
of
all
the
membership
grades
of
each
object
have
to
be
the
same
(normally
1
)
[
Gan
,
2007
;
Duda
,
2000
]
Hard clustering
15
2
Fuzzy clustering
16
2
Red
cluster
: 0.7
Green
cluster
: 0.3
Red
cluster
: 0.6
Green
cluster
: 0.4
Red
cluster
: 1
Green
cluster
: 0
Red
cluster
: 0.2
Green
cluster
: 0.8
Search typology (1)
17
2
Centre

based
algorithms
[
Gan
,
2007
]
•
Each
cluster
is
defined
by
a
prototype,
and
the
instances
are
assigned
to
the
closest
prototype
•
The
clusters
have
convex
shapes
and
each
cluster
is
represented
by
a
centre
•
They
can
not
find
clusters
of
arbitrary
shapes
•
They
are
sensible
to
the
initialization
and
they
may
fall
in
a
local
optimal
solution
x
y
x
y
prototype
Search typology (2)
18
2
Graph

based
algorithms
[
Gan
,
2007
]
•
They
construct
a
graph
or
hypergraph
and
then
apply
some
heuristic
to
partition
it
•
They
can
find
arbitrarily
shaped
clusters
•
They
are
sensible
to
the
initialization
and
they
may
fall
in
a
local
optimal
solution
x
y
x
y
Graph
construction
:
each
instance
is
related
with
the
nearest
neighbour
not
visited
Eliminating
edges
:
the
edges
that
are
longer
than
a
threshold
are
eliminated
Search typology (3)
19
2
Model

based
algorithms
[
Gan
,
2007
]
•
Is
assumed
that
the
data
are
generated
by
a
mixture
of
probability
distributions
in
which
each
one
represents
a
different
cluster
•
The
distributions
are
estimated
from
the
data
and
each
data
instance
is
assigned
to
each
one
•
They
are
sensible
to
the
initialization
and
they
may
fall
in
a
local
optimal
solution
x
y
µ
1
σ
1
µ
2
σ
2
Gaussian distributions
µ
1
σ
1
σ
2
µ
2
Search typology (4)
20
2
Search

based
algorithms
[
Gan
,
2007
]
•
They
are
a
complement
of
the
previous
strategies
•
The
previous
strategies
may
not
be
able
to
find
the
globally
optimal
clustering
that
fits
the
data
set
•
This
strategy
tries
to
search
in
the
overall
solution
space
and
find
a
globally
optimal
clustering
that
fits
the
data
set
Genetic
algorithms
Ant
colony
optimization
Simulated
annealing
•
They
are
very
time
expensive
Search typology (5)
21
2
Density

based
algorithms
[
Gan
,
2007
]
•
Clusters
are
defined
as
dense
regions
separated
by
low

density
regions
•
They
need
only
one
scan
of
the
original
data
set
and
can
handle
noise
•
The
number
of
clusters
is
not
required
•
They
can
find
arbitrarily
shaped
clusters
x
y
x
y
noise
Search typology (6)
22
2
Subspace

based
algorithms
[
Gan
,
2007
]
•
They
are
applied
to
high
dimensional
data
sets
•
They
consist
on
finding
clusters
in
each
dimension
identifying
dense
units
•
The
final
clusters
are
found
overlapping
the
clusters
of
each
dimension
x
y
y
x
Optimization of the clusters
23
2
Several
clustering
algorithms
are
iterative,
and
consists
on
optimize
the
evaluation
of
the
clusters
according
to
one
or
several
objectives
Single
objective
•
The
clustering
process
consists
on
optimize
a
single
objective
Several
objectives
•
The
clustering
process
consists
on
optimize
several
objectives
obtaining
a
trade

off
between
them
[Law, 2004]
Single objective (1)
24
2
The
clusters
are
obtained
taking
into
account
the
attributes
‘x’
and
‘y’
Criterion
to
optimize
:
1
)
Each
cluster
has
to
contain
elements
of
the
same
shape
Criteria
to
optimize
:
1
)
Each
cluster
has
to
contain
elements
of
the
same
shape
2
)
The
number
of
clusters
has
to
be
minimized
x
y
x
y
These
two
criteria
are
considered
as
a
single
objective
due
to
optimize
a
criterion
doesn’t
affect
to
the
other
criterion
Single objective (2)
25
2
The
clusters
are
obtained
taking
into
account
the
attributes
‘x’
and
‘y’
Criteria
to
optimize
:
1
)
Minimize
intra

cluster
variance
2
)
Maximize
inter

cluster
variance
x
y
x
y
Intra

cluster
variance
optimized
Inter

cluster
variance
optimized
Is
impossible
to
optimize
both
criteria
at
the
same
time
Single objective (3)
26
2
Validation
indexes
[
Halkidi
,
2002
]
•
They
evaluate
a
clustering
solution
according
to
the
quality
of
the
clusters
(shape)
using
the
inter

cluster
and
intra

cluster
variance
simultaneously
.
•
Some
indexes
Davies

Bouldin
index
Dunn’s
index
Silhouette
index
...
•
Example
:
Davies

Bouldin
index
[Dunn,
1974
]
Several objectives (1)
27
2
Ensemble
clustering
[Law,
2004
]
x
y
x
y
x
y
?
Combination
of the results
Criteria
to
optimize
:
1
)
Minimize
intra

cluster
variance
2
)
Maximize
inter

cluster
variance
Several objectives (2)
28
2
Multi

objective
clustering
Criteria
to
optimize
:
1
)
Minimize
intra

cluster
variance
2
)
Maximize
inter

cluster
variance
intra

cluster variance
1

inter

cluster variance
dominated
solution
x
y
x
y
x
y
x
y
x
y
Taxonomy Summary
29
2
Search
typology
•
Centre

based
•
Search

based
•
Graph

based
•
Density

based
•
Model

based
•
Subspace

based
•
...
Single
objective
Cases
distribution
into
the
clusters
Optimization
of
the
clusters
Several
objectives
Ensemble
clustering
Multi

objective
clustering
Relationships
of
the
clusters
•
Partitional
•
Hierarchical
•
Hard
•
Fuzzy
(soft)
Number
of
clusters
•
Manual
•
Automatic
Outline
1
.
Introduction
2
.
Clustering
Taxonomy
3
.
Some
Algorithms
4
.
Validation
of
Clustering
Solutions
5
.
Summary
30
k

means
31
3
MacQueen
,
1967
[
MacQueen
,
1967
]
•
Partitional
•
Centre

based
•
Hard
clustering
•
Number
of
clusters
manual
•
Single
objective
It
consists
on
group
the
instances
into
k
circular
clusters
according
to
the
distance
between
them
and
the
centre
of
the
cluster,
updating
the
centres
with
the
new
assignments
.
This
process
is
repeated
until
convergence
has
been
reached
Similar
algorithms
:
x

means
(Number
of
clusters
automatic),
fuzzy
C

means
(fuzzy
clustering)
Single

link
32
3
Johnson,
1967
[Johnson,
1967
]
•
Hierarchical
agglomerative
•
Centre

based
•
Hard
clustering
•
Number
of
clusters
automatic
•
Single
objective
In
each
step
the
two
clusters
whose
two
closest
members
have
the
smallest
distance
are
merged
Similar
algorithms
:
Complete

link,
Average

link
•
They
follow
other
heuristic
to
merge
the
instances
Outline
1
.
Introduction
2
.
Clustering
Taxonomy
3
.
Some
Algorithms
4
.
Validation
of
Clustering
Solutions
5
.
Summary
33
Clustering validation (1)
34
How
to
validate
a
clustering
solution?
•
The
data
is
not
labelled
External
criteria
[
Halkidi
,
2002
]
•
Expert
in
the
domain
of
the
problem
as
judge
Comparing
with
an
intuitive
solution
o
F

Measure,
Rand
Index,
Adjusted
Rand
Index
...
Explanations
of
each
cluster
to
justify
them
o
Main
features
(attributes)
of
the
elements
of
each
cluster
4
35
Relative
criteria
[
Halkidi
,
2002
]
•
Comparing
the
clustering
results
according
to
a
validation
index
(or
a
combination
of
them)
Validation
index
use
only
the
information
of
the
data
set
•
Normally
is
used
to
select
the
best
solution
from
several
clustering
results
obtained
with
different
clustering
algorithms
This
does
not
means
that
the
solution
is
a
good
solution
to
the
problem
•
The
selected
solution
depends
on
the
validation
index
used
4
Clustering validation (2)
Outline
1
.
Introduction
2
.
Clustering
Taxonomy
3
.
Some
Algorithms
4
.
Validation
of
Clustering
Solutions
5
.
Summary
36
Summary
37
How
to
solve
a
clustering
problem?
•
Data
analysis
Pre

process
the
data
if
it
is
necessary
(noise,
unknown
values
...
)
•
Selection
of
the
clustering
algorithm
Is
important
to
know
the
domain
of
the
problem
Is
there
a
known
number
of
clusters?
Can
be
overlapping
between
clusters?
Is
necessary
a
hierarchical
relationship
between
clusters?
Is
important
to
detect
arbitrary
shapes?
What
are
the
clustering
criteria?
...
5
References
38
A
.
P
.
Dempster
,
N
.
M
.
Laird,
and
D
.
B
.
Rubin
.
Maximum
likelihood
from
incomplete
data
via
the
EM
algorithm
.
Journal
of
the
Royal
Statistical
Society,
vol
.
39
,
pp
.
1

38
,
1977
.
G
.
Corral,
A
.
Garcia

Piquer,
A
.
Orriols

Puig,
A
.
Fornells,
and
E
.
Golobardes
.
Analysis
of
Vulnerability
Assessment
Results
based
on
CAOS
.
Applied
Softcomputing
Journal
,
in
press,
2010
.
R
.
O
.
Duda
,
P
.
E
.
Hart
and
D
.
G
.
Stork
.
Pattern
classification
.
John
Wiley
&
Sons,
Inc,
2000
.
J
.
C
.
Dunn
.
Well
separated
clusters
and
optimal
fuzzy
partitions
.
Journal
of
Cybernetics
,
95

104
,
1974
.
G
.
Gan
,
M
.
Chaoqun
,
and
J
.
Wu
.
Data
Clustering
Theory,
Algorithms,
and
Applications
.
ASA

SIAM,
2007
.
M
.
Halkidi
,
Y
.
Batistakis
,
and
M
.
Vazirgiannis
.
Cluster
validity
methods
:
part
I
.
ACM
SIGMOD
Record
,
31
(
2
)
:
40

45
,
2002
.
J
.
Han,
M
.
Kamber
.
Data
Mining
.
Concepts
and
techniques
.
Morgan
Kaufmann,
2006
.
S
.
C
.
Johnson
.
Hierarchical
Clustering
Schemes
.
Psychometrika
,
2
:
241

254
,
1967
.
L
.
Kaufman,
and
P
.
J
.
Rousseeuw
.
Finding
Groups
in
Data
:
An
Introduction
to
Cluster
Analysis
.
John
Wiley
&
Sons,
Inc,
2005
.
M
.
Law,
A
.
Topchy
,
and
A
.
Jain
.
Multiobjective
data
clustering
.
IEEE
Computer
Society
Conference
on
Computer
Vision
and
Pattern
Recognition,
2
:
424

430
,
2004
.
M
.
Matteucci
.
A
Tutorial
on
Clustering
Algorithms
,
Politecnico
di
Milano
.
<
http
:
//home
.
dei
.
polimi
.
it/matteucc/Clustering/tutorial_html/index
.
html
>
J
.
MacQueen
.
Some
methods
for
classification
and
analysis
of
multivariate
observations
.
In
Proceedings
of
the
5
th
Berkeley
symposium
on
mathematical
statistics
and
probability,
1
:
281

297
,
1967
.
I
.
H
.
Witten
and
E
.
Frank
.
DataMining
:
Practical
machine
learning
tools
and
techniques
.
Morgan
Kaufmann
Publishers,
2005
.
Comments 0
Log in to post a comment