4 center-based clusters

savagelizardAI and Robotics

Nov 25, 2013 (3 years and 10 months ago)

70 views

Computacion inteligente

Fuzzy Clustering

Nov 2005

2

Agenda


Basic concepts


Types of Clustering


Types of Clusters


Distance functions


Clustering Algorithms

Basic concepts


Nov 2005

4

Classification


Historically,
objects

are classified into
groups



periodic table of the elements (chemistry)


taxonomy (zoology, botany)



Why classify?



Understanding


prediction


organizational convenience, convenient summary


Summarization


Reduce the size of large data sets

These aims do not necessarily lead
to the same classification; e.g.
SIZE

of object vs.
TYPE/USE

of object

Nov 2005

5

Classification


Classification divides objects into
groups

based
on a set of values



Unlike a theory, a classification is neither true nor
false, and should be judged largely on the
usefulness of results



However, a classification (
clustering
) may be
useful for suggesting a theory, which could then
be tested

Nov 2005

6

What is
clustering
?


Finding groups of objects such that the objects in
a group will be similar (or related) to one another
and different from (or unrelated to) the objects in
other groups

Inter
-
cluster
distances are
maximized

Intra
-
cluster
distances are
minimized

Nov 2005

7

Simple example

Species

Fat (%)

Proteins
(%)

Horse

Donkey

Mule

Camel

Llama

Zebra

Sheep

Buffalo

Fox

Pig

Rabbit

Rat

Deer

Reindeer

Whale

1.0

1.4

1.8

3.4

3.2

4.8

6.4

7.9

5.9

5.1

13.1

12.6

19.7

20.3

21.2

2.6

1.7

2.0

3.5

3.9

3.0

5.6

5.9

7.4

6.6

7.1

12.3

9.2

10.4

11.1

Composition of
mammalian milk

Nov 2005

8

Simple example

Composition of mammalian milk

Fat (%)

Proteins

(%)

Clustering

Nov 2005

9

What is
clustering
?


No class values denoting an a priori
grouping of the data instances are given.



So, it’s a method of
data exploration

Pattern

Feature space


a way of looking for patterns
or structure in the data that
are of interest

Nov 2005

10

What is clustering?


A form of
unsupervised learning




You generally don’t have examples
demonstrating how the data should be
grouped together



Clustering is often called an
unsupervised
learning

task

Due to historical reasons, clustering is often
considered synonymous with unsupervised
learning.

Nov 2005

11

Clustering vs. class prediction


Clustering:



No learning set, no given classes


Goal: discover the ”best” classes or groupings



Class prediction:



A learning set of objects with known classes


Goal: put new objects into existing classes


Also called: Supervised learning, or
classification

Nov 2005

12

Components of Clustering Task


Pattern Representation



Number of classes and available patterns


Number, type, and scale of features available
to algorithm


Feature selection/extraction



Definition of Pattern Proximity measure



Defined on pairs of patterns


Distance measures and conceptual similarities

And…

Nov 2005

13

Components of Clustering Task


Clustering / Grouping



Data abstraction (optional)



Extraction of simply and compact data
representation



Output Assessment (optional)



How good is it?



The
quality

of a clustering result depends on
the algorithm, the distance function, and the
application.

Nov 2005

14

Pattern Representation


Which features do we use?



Currently, no theoretical guidelines to suggest
appropriate patterns and features to use in specific
situation



User generally must provide insight



Careful analysis of available features can yield improved
clustering results

Nov 2005

15

Pattern Representation


Which features do we use?



Currently, no theoretical guidelines to suggest
appropriate patterns and features to use in specific
situation



User generally must provide insight



Careful analysis of available features can yield improved
clustering results

Nov 2005

16

Pattern Representation


Example: The balls of same colour are
clustered into a group as shown below:


Thus, we see clustering means grouping of data or
dividing a large data set into smaller data sets of
some similarity.

Nov 2005

17

Notion of a Cluster can be Ambiguous

How many clusters?

Four Clusters

Two Clusters

Six Clusters

Types of Clustering

Nov 2005

19

Types of Clusterings


Important distinction between
hierarchical

and
partitional

sets of clusters



Partitional Clustering


A division data objects into non
-
overlapping
subsets (clusters) such that each data object is in
exactly one subset



Hierarchical clustering


A set of nested clusters organized as a hierarchical
tree


Nov 2005

20

Partitional Clustering

Original Points

A Partitional Clustering

Nov 2005

21

Hierarchical Clustering

p4
p1
p3
p
2

p4

p1

p3

p
2

p4
p1
p2
p3
p4
p1
p2
p3
Nov 2005

22

Other Distinctions Between Sets of
Clusters


Exclusive

versus
non
-
exclusive



In non
-
exclusive clusterings, points may
belong to multiple clusters.



Fuzzy

versus
non
-
fuzzy



In fuzzy clustering, a point belongs to every
cluster with some weight between 0 and 1


Weights must sum to 1



Partial

versus
complete



In some cases, we only want to cluster some
of the data

Types of Clusters

Nov 2005

24

Types of Clusters



Well
-
separated clusters




Center
-
based clusters




Contiguous clusters




Density
-
based clusters



Property or Conceptual



Described by an Objective Function

Nov 2005

25

Types of Clusters: Well
-
Separated


Well
-
Separated Clusters:



A cluster is a set of points such that any point in a
cluster is closer (or more similar) to every other
point in the cluster than to any point not in the
cluster.

3 well
-
separated
clusters

Nov 2005

26

Types of Clusters: Center
-
Based


Center
-
based




A cluster is a set of objects such that an object in
a cluster is closer (more similar) to the

center


of
a cluster, than to the center of any other cluster



The center of a cluster is often a
centroid
, the
average of all the points in the cluster, or a
medoid
, the most

representative


point of a
cluster

4 center
-
based clusters

Nov 2005

27

Types of Clusters: Center
-
Based


Center
-
based




The centroid representation alone works well if
the clusters are of the hyper
-
spherical shape.



If clusters are elongated or are of other shapes,
centroids are not sufficient

4 center
-
based clusters

Nov 2005

28

Common ways to represent clusters


Use the centroid of each cluster to
represent the cluster
.


compute the radius and


standard deviation of the cluster to determine
its spread in each dimension



The centroid representation alone works well if
the clusters are of the hyper
-
spherical shape.


If clusters are elongated or are of other
shapes, centroids are not sufficient

Nov 2005

29

Types of Clusters: Contiguity
-
Based


Contiguous Cluster (Nearest
neighbor or Transitive)



A cluster is a set of points such that a point in a
cluster is closer (or more similar) to one or more
other points in the cluster than to any point not in
the cluster.

8 contiguous clusters

Nov 2005

30

Types of Clusters: Density
-
Based


Density
-
based



A cluster is a dense region of points, which is
separated by low
-
density regions, from other
regions of high density.



Used when the clusters are irregular or
intertwined, and when noise and outliers are
present.


6 density
-
based clusters

Nov 2005

31

Types of Clusters: Density
-
Based


Density
-
based


A cluster is a dense region of points, which is
separated by low
-
density regions, from other regions
of high density.


Used when the clusters are irregular or intertwined,
and when noise and outliers are present.

6 density
-
based clusters

Nov 2005

32

Types of Clusters: Conceptual Clusters


Shared Property or Conceptual
Clusters



Finds clusters that share some common property
or represent a particular concept.

2 Overlapping Circles

Nov 2005

33

Types of Clusters: Objective Function


Clusters Defined by an Objective
Function



Finds clusters that minimize or maximize an
objective function.



Enumerate all possible ways of dividing the
points into clusters and evaluate the

goodness


of each potential set of clusters by using the
given objective function.

Nov 2005

34

Types of Clusters: Objective Function


Clusters Defined by an Objective
Function



Can have global or local objectives.



Hierarchical clustering algorithms typically have
local objectives




Partitional algorithms typically have global
objectives

Distance functions

Nov 2005

36

Clustering Task


Consists in introducing
D
, a
distance

measure (or a measure of
similarity

or
proximity) between sample patterns.

Nov 2005

37

Distance functions


The
similarity

measure is often more
important than the clustering algorithm
used



Instead of talking about
similarity

measures, we often equivalently refer to
dissimilarity

measures

Nov 2005

38

Quality in Clustering


A
good

clustering method will produce
high quality clusters with



high intra
-
class similarity


low inter
-
class similarity



The
quality

of a clustering result depends
on both the similarity measure used by
the method and its implementation

Nov 2005

39

Distance functions


There are numerous
distance functions

for



Different types of data



Numeric data


Nominal data



Different specific applications


Weights should be associated with different variables
based on applications and data semantics.

Nov 2005

40

Distance functions for numeric data


We denote distance with:




where
x
i

and
x
j

are data points (vectors)




Most commonly used functions are



Euclidean

distance and


Manhattan

(city block) distance

d
(
x
,
y
)

They are special cases of Minkowski distance

Nov 2005

41

Metric Spaces


Metric Space
: A pair (
X,d
) where
X
is a
set and d is a distance function
such that for
x
,
y

in
X
:

Separation

Triangular inequality

Symmetry

Nov 2005

42

Minkowski distance,
L
p

Q

C

d
(
Q
,
C
)





p
n
i
p
i
i
c
q
C
Q
d




1
,
p

= 1 Manhattan (Rectilinear, City Block)

p

= 2 Euclidean

p

=


Max (Supremum, “sup”)


Nov 2005

43

Euclidean distance,
L
2


Here
n

is the number of dimensions in the
data vector.





n
i
i
i
euc
y
x
d
1
2
)
(
)
,
(
y
x
Nov 2005

44

Euclidean distance

These examples of
Euclidean distance
match our intuition of
dissimilarity pretty
well…

d
euc
=2.6115

d
euc
=1.1345

d
euc
=0.5846

Nov 2005

45

Euclidean distance

…But what about these?

What might be going on with the expression profiles
on the left? On the right?

d
euc
=1.41

d
euc
=1.22

Nov 2005

46

Weighted Euclidean distance


Weighted Euclidean distance



 

2
(,)
i i i
dxy w x y
Nov 2005

47

Mahalanobis distance

T
j
i
j
i
j
i
M
x
x
x
x
x
x
d
)
(
)
(
)
,
(
1





Nov 2005

48

More Metrics


Manhattan distance,
L
1








L
inf

(Chessboard):



 

(,)
i i
dxy x y
 
(,) max
i i i
dxy x y