Computacion inteligente
Fuzzy Clustering
Nov 2005
2
Agenda
Basic concepts
Types of Clustering
Types of Clusters
Distance functions
Clustering Algorithms
Basic concepts
Nov 2005
4
Classification
Historically,
objects
are classified into
groups
periodic table of the elements (chemistry)
taxonomy (zoology, botany)
Why classify?
Understanding
prediction
organizational convenience, convenient summary
Summarization
Reduce the size of large data sets
These aims do not necessarily lead
to the same classification; e.g.
SIZE
of object vs.
TYPE/USE
of object
Nov 2005
5
Classification
Classification divides objects into
groups
based
on a set of values
Unlike a theory, a classification is neither true nor
false, and should be judged largely on the
usefulness of results
However, a classification (
clustering
) may be
useful for suggesting a theory, which could then
be tested
Nov 2005
6
What is
clustering
?
Finding groups of objects such that the objects in
a group will be similar (or related) to one another
and different from (or unrelated to) the objects in
other groups
Inter

cluster
distances are
maximized
Intra

cluster
distances are
minimized
Nov 2005
7
Simple example
Species
Fat (%)
Proteins
(%)
Horse
Donkey
Mule
Camel
Llama
Zebra
Sheep
Buffalo
Fox
Pig
Rabbit
Rat
Deer
Reindeer
Whale
1.0
1.4
1.8
3.4
3.2
4.8
6.4
7.9
5.9
5.1
13.1
12.6
19.7
20.3
21.2
2.6
1.7
2.0
3.5
3.9
3.0
5.6
5.9
7.4
6.6
7.1
12.3
9.2
10.4
11.1
Composition of
mammalian milk
Nov 2005
8
Simple example
Composition of mammalian milk
Fat (%)
Proteins
(%)
Clustering
Nov 2005
9
What is
clustering
?
No class values denoting an a priori
grouping of the data instances are given.
So, it’s a method of
data exploration
Pattern
Feature space
a way of looking for patterns
or structure in the data that
are of interest
Nov 2005
10
What is clustering?
A form of
unsupervised learning
You generally don’t have examples
demonstrating how the data should be
grouped together
Clustering is often called an
unsupervised
learning
task
Due to historical reasons, clustering is often
considered synonymous with unsupervised
learning.
Nov 2005
11
Clustering vs. class prediction
Clustering:
No learning set, no given classes
Goal: discover the ”best” classes or groupings
Class prediction:
A learning set of objects with known classes
Goal: put new objects into existing classes
Also called: Supervised learning, or
classification
Nov 2005
12
Components of Clustering Task
Pattern Representation
Number of classes and available patterns
Number, type, and scale of features available
to algorithm
Feature selection/extraction
Definition of Pattern Proximity measure
Defined on pairs of patterns
Distance measures and conceptual similarities
And…
Nov 2005
13
Components of Clustering Task
Clustering / Grouping
Data abstraction (optional)
Extraction of simply and compact data
representation
Output Assessment (optional)
How good is it?
The
quality
of a clustering result depends on
the algorithm, the distance function, and the
application.
Nov 2005
14
Pattern Representation
Which features do we use?
Currently, no theoretical guidelines to suggest
appropriate patterns and features to use in specific
situation
User generally must provide insight
Careful analysis of available features can yield improved
clustering results
Nov 2005
15
Pattern Representation
Which features do we use?
Currently, no theoretical guidelines to suggest
appropriate patterns and features to use in specific
situation
User generally must provide insight
Careful analysis of available features can yield improved
clustering results
Nov 2005
16
Pattern Representation
Example: The balls of same colour are
clustered into a group as shown below:
Thus, we see clustering means grouping of data or
dividing a large data set into smaller data sets of
some similarity.
Nov 2005
17
Notion of a Cluster can be Ambiguous
How many clusters?
Four Clusters
Two Clusters
Six Clusters
Types of Clustering
Nov 2005
19
Types of Clusterings
Important distinction between
hierarchical
and
partitional
sets of clusters
Partitional Clustering
A division data objects into non

overlapping
subsets (clusters) such that each data object is in
exactly one subset
Hierarchical clustering
A set of nested clusters organized as a hierarchical
tree
Nov 2005
20
Partitional Clustering
Original Points
A Partitional Clustering
Nov 2005
21
Hierarchical Clustering
p4
p1
p3
p
2
p4
p1
p3
p
2
p4
p1
p2
p3
p4
p1
p2
p3
Nov 2005
22
Other Distinctions Between Sets of
Clusters
Exclusive
versus
non

exclusive
In non

exclusive clusterings, points may
belong to multiple clusters.
Fuzzy
versus
non

fuzzy
In fuzzy clustering, a point belongs to every
cluster with some weight between 0 and 1
Weights must sum to 1
Partial
versus
complete
In some cases, we only want to cluster some
of the data
Types of Clusters
Nov 2005
24
Types of Clusters
Well

separated clusters
Center

based clusters
Contiguous clusters
Density

based clusters
Property or Conceptual
Described by an Objective Function
Nov 2005
25
Types of Clusters: Well

Separated
Well

Separated Clusters:
A cluster is a set of points such that any point in a
cluster is closer (or more similar) to every other
point in the cluster than to any point not in the
cluster.
3 well

separated
clusters
Nov 2005
26
Types of Clusters: Center

Based
Center

based
A cluster is a set of objects such that an object in
a cluster is closer (more similar) to the
“
center
”
of
a cluster, than to the center of any other cluster
The center of a cluster is often a
centroid
, the
average of all the points in the cluster, or a
medoid
, the most
“
representative
”
point of a
cluster
4 center

based clusters
Nov 2005
27
Types of Clusters: Center

Based
Center

based
The centroid representation alone works well if
the clusters are of the hyper

spherical shape.
If clusters are elongated or are of other shapes,
centroids are not sufficient
4 center

based clusters
Nov 2005
28
Common ways to represent clusters
Use the centroid of each cluster to
represent the cluster
.
compute the radius and
standard deviation of the cluster to determine
its spread in each dimension
The centroid representation alone works well if
the clusters are of the hyper

spherical shape.
If clusters are elongated or are of other
shapes, centroids are not sufficient
Nov 2005
29
Types of Clusters: Contiguity

Based
Contiguous Cluster (Nearest
neighbor or Transitive)
A cluster is a set of points such that a point in a
cluster is closer (or more similar) to one or more
other points in the cluster than to any point not in
the cluster.
8 contiguous clusters
Nov 2005
30
Types of Clusters: Density

Based
Density

based
A cluster is a dense region of points, which is
separated by low

density regions, from other
regions of high density.
Used when the clusters are irregular or
intertwined, and when noise and outliers are
present.
6 density

based clusters
Nov 2005
31
Types of Clusters: Density

Based
Density

based
A cluster is a dense region of points, which is
separated by low

density regions, from other regions
of high density.
Used when the clusters are irregular or intertwined,
and when noise and outliers are present.
6 density

based clusters
Nov 2005
32
Types of Clusters: Conceptual Clusters
Shared Property or Conceptual
Clusters
Finds clusters that share some common property
or represent a particular concept.
2 Overlapping Circles
Nov 2005
33
Types of Clusters: Objective Function
Clusters Defined by an Objective
Function
Finds clusters that minimize or maximize an
objective function.
Enumerate all possible ways of dividing the
points into clusters and evaluate the
‘
goodness
’
of each potential set of clusters by using the
given objective function.
Nov 2005
34
Types of Clusters: Objective Function
Clusters Defined by an Objective
Function
Can have global or local objectives.
Hierarchical clustering algorithms typically have
local objectives
Partitional algorithms typically have global
objectives
Distance functions
Nov 2005
36
Clustering Task
Consists in introducing
D
, a
distance
measure (or a measure of
similarity
or
proximity) between sample patterns.
Nov 2005
37
Distance functions
The
similarity
measure is often more
important than the clustering algorithm
used
Instead of talking about
similarity
measures, we often equivalently refer to
dissimilarity
measures
Nov 2005
38
Quality in Clustering
A
good
clustering method will produce
high quality clusters with
high intra

class similarity
low inter

class similarity
The
quality
of a clustering result depends
on both the similarity measure used by
the method and its implementation
Nov 2005
39
Distance functions
There are numerous
distance functions
for
Different types of data
Numeric data
Nominal data
Different specific applications
Weights should be associated with different variables
based on applications and data semantics.
Nov 2005
40
Distance functions for numeric data
We denote distance with:
where
x
i
and
x
j
are data points (vectors)
Most commonly used functions are
Euclidean
distance and
Manhattan
(city block) distance
d
(
x
,
y
)
They are special cases of Minkowski distance
Nov 2005
41
Metric Spaces
Metric Space
: A pair (
X,d
) where
X
is a
set and d is a distance function
such that for
x
,
y
in
X
:
Separation
Triangular inequality
Symmetry
Nov 2005
42
Minkowski distance,
L
p
Q
C
d
(
Q
,
C
)
p
n
i
p
i
i
c
q
C
Q
d
1
,
p
= 1 Manhattan (Rectilinear, City Block)
p
= 2 Euclidean
p
=
Max (Supremum, “sup”)
Nov 2005
43
Euclidean distance,
L
2
Here
n
is the number of dimensions in the
data vector.
n
i
i
i
euc
y
x
d
1
2
)
(
)
,
(
y
x
Nov 2005
44
Euclidean distance
These examples of
Euclidean distance
match our intuition of
dissimilarity pretty
well…
d
euc
=2.6115
d
euc
=1.1345
d
euc
=0.5846
Nov 2005
45
Euclidean distance
…But what about these?
What might be going on with the expression profiles
on the left? On the right?
d
euc
=1.41
d
euc
=1.22
Nov 2005
46
Weighted Euclidean distance
Weighted Euclidean distance
2
(,)
i i i
dxy w x y
Nov 2005
47
Mahalanobis distance
T
j
i
j
i
j
i
M
x
x
x
x
x
x
d
)
(
)
(
)
,
(
1
Nov 2005
48
More Metrics
Manhattan distance,
L
1
L
inf
(Chessboard):
(,)
i i
dxy x y
(,) max
i i i
dxy x y
Comments 0
Log in to post a comment