# 4 center-based clusters

AI and Robotics

Nov 25, 2013 (4 years and 7 months ago)

94 views

Computacion inteligente

Fuzzy Clustering

Nov 2005

2

Agenda

Basic concepts

Types of Clustering

Types of Clusters

Distance functions

Clustering Algorithms

Basic concepts

Nov 2005

4

Classification

Historically,
objects

are classified into
groups

periodic table of the elements (chemistry)

taxonomy (zoology, botany)

Why classify?

Understanding

prediction

organizational convenience, convenient summary

Summarization

Reduce the size of large data sets

These aims do not necessarily lead
to the same classification; e.g.
SIZE

of object vs.
TYPE/USE

of object

Nov 2005

5

Classification

Classification divides objects into
groups

based
on a set of values

Unlike a theory, a classification is neither true nor
false, and should be judged largely on the
usefulness of results

However, a classification (
clustering
) may be
useful for suggesting a theory, which could then
be tested

Nov 2005

6

What is
clustering
?

Finding groups of objects such that the objects in
a group will be similar (or related) to one another
and different from (or unrelated to) the objects in
other groups

Inter
-
cluster
distances are
maximized

Intra
-
cluster
distances are
minimized

Nov 2005

7

Simple example

Species

Fat (%)

Proteins
(%)

Horse

Donkey

Mule

Camel

Llama

Zebra

Sheep

Buffalo

Fox

Pig

Rabbit

Rat

Deer

Reindeer

Whale

1.0

1.4

1.8

3.4

3.2

4.8

6.4

7.9

5.9

5.1

13.1

12.6

19.7

20.3

21.2

2.6

1.7

2.0

3.5

3.9

3.0

5.6

5.9

7.4

6.6

7.1

12.3

9.2

10.4

11.1

Composition of
mammalian milk

Nov 2005

8

Simple example

Composition of mammalian milk

Fat (%)

Proteins

(%)

Clustering

Nov 2005

9

What is
clustering
?

No class values denoting an a priori
grouping of the data instances are given.

So, it’s a method of
data exploration

Pattern

Feature space

a way of looking for patterns
or structure in the data that
are of interest

Nov 2005

10

What is clustering?

A form of
unsupervised learning

You generally don’t have examples
demonstrating how the data should be
grouped together

Clustering is often called an
unsupervised
learning

Due to historical reasons, clustering is often
considered synonymous with unsupervised
learning.

Nov 2005

11

Clustering vs. class prediction

Clustering:

No learning set, no given classes

Goal: discover the ”best” classes or groupings

Class prediction:

A learning set of objects with known classes

Goal: put new objects into existing classes

Also called: Supervised learning, or
classification

Nov 2005

12

Pattern Representation

Number of classes and available patterns

Number, type, and scale of features available
to algorithm

Feature selection/extraction

Definition of Pattern Proximity measure

Defined on pairs of patterns

Distance measures and conceptual similarities

And…

Nov 2005

13

Clustering / Grouping

Data abstraction (optional)

Extraction of simply and compact data
representation

Output Assessment (optional)

How good is it?

The
quality

of a clustering result depends on
the algorithm, the distance function, and the
application.

Nov 2005

14

Pattern Representation

Which features do we use?

Currently, no theoretical guidelines to suggest
appropriate patterns and features to use in specific
situation

User generally must provide insight

Careful analysis of available features can yield improved
clustering results

Nov 2005

15

Pattern Representation

Which features do we use?

Currently, no theoretical guidelines to suggest
appropriate patterns and features to use in specific
situation

User generally must provide insight

Careful analysis of available features can yield improved
clustering results

Nov 2005

16

Pattern Representation

Example: The balls of same colour are
clustered into a group as shown below:

Thus, we see clustering means grouping of data or
dividing a large data set into smaller data sets of
some similarity.

Nov 2005

17

Notion of a Cluster can be Ambiguous

How many clusters?

Four Clusters

Two Clusters

Six Clusters

Types of Clustering

Nov 2005

19

Types of Clusterings

Important distinction between
hierarchical

and
partitional

sets of clusters

Partitional Clustering

A division data objects into non
-
overlapping
subsets (clusters) such that each data object is in
exactly one subset

Hierarchical clustering

A set of nested clusters organized as a hierarchical
tree

Nov 2005

20

Partitional Clustering

Original Points

A Partitional Clustering

Nov 2005

21

Hierarchical Clustering

p4
p1
p3
p
2

p4

p1

p3

p
2

p4
p1
p2
p3
p4
p1
p2
p3
Nov 2005

22

Other Distinctions Between Sets of
Clusters

Exclusive

versus
non
-
exclusive

In non
-
exclusive clusterings, points may
belong to multiple clusters.

Fuzzy

versus
non
-
fuzzy

In fuzzy clustering, a point belongs to every
cluster with some weight between 0 and 1

Weights must sum to 1

Partial

versus
complete

In some cases, we only want to cluster some
of the data

Types of Clusters

Nov 2005

24

Types of Clusters

Well
-
separated clusters

Center
-
based clusters

Contiguous clusters

Density
-
based clusters

Property or Conceptual

Described by an Objective Function

Nov 2005

25

Types of Clusters: Well
-
Separated

Well
-
Separated Clusters:

A cluster is a set of points such that any point in a
cluster is closer (or more similar) to every other
point in the cluster than to any point not in the
cluster.

3 well
-
separated
clusters

Nov 2005

26

Types of Clusters: Center
-
Based

Center
-
based

A cluster is a set of objects such that an object in
a cluster is closer (more similar) to the

center

of
a cluster, than to the center of any other cluster

The center of a cluster is often a
centroid
, the
average of all the points in the cluster, or a
medoid
, the most

representative

point of a
cluster

4 center
-
based clusters

Nov 2005

27

Types of Clusters: Center
-
Based

Center
-
based

The centroid representation alone works well if
the clusters are of the hyper
-
spherical shape.

If clusters are elongated or are of other shapes,
centroids are not sufficient

4 center
-
based clusters

Nov 2005

28

Common ways to represent clusters

Use the centroid of each cluster to
represent the cluster
.

standard deviation of the cluster to determine

The centroid representation alone works well if
the clusters are of the hyper
-
spherical shape.

If clusters are elongated or are of other shapes,
centroids are not sufficient

Nov 2005

29

Types of Clusters: Contiguity
-
Based

Contiguous Cluster (Nearest
neighbor or Transitive)

A cluster is a set of points such that a point in a
cluster is closer (or more similar) to one or more
other points in the cluster than to any point not in
the cluster.

8 contiguous clusters

Nov 2005

30

Types of Clusters: Density
-
Based

Density
-
based

A cluster is a dense region of points, which is
separated by low
-
density regions, from other
regions of high density.

Used when the clusters are irregular or intertwined,
and when noise and outliers are present.

6 density
-
based clusters

Nov 2005

31

Types of Clusters: Density
-
Based

Density
-
based

A cluster is a dense region of points, which is
separated by low
-
density regions, from other regions
of high density.

Used when the clusters are irregular or intertwined,
and when noise and outliers are present.

6 density
-
based clusters

Nov 2005

32

Types of Clusters: Conceptual Clusters

Shared Property or Conceptual
Clusters

Finds clusters that share some common property
or represent a particular concept.

2 Overlapping Circles

Nov 2005

33

Types of Clusters: Objective Function

Clusters Defined by an Objective
Function

Finds clusters that minimize or maximize an
objective function.

Enumerate all possible ways of dividing the
points into clusters and evaluate the

goodness

of each potential set of clusters by using the
given objective function.

Nov 2005

34

Types of Clusters: Objective Function

Clusters Defined by an Objective
Function

Can have global or local objectives.

Hierarchical clustering algorithms typically have
local objectives

Partitional algorithms typically have global
objectives

Distance functions

Nov 2005

36

Consists in introducing
D
, a
distance

measure (or a measure of
similarity

or
proximity) between sample patterns.

Nov 2005

37

Distance functions

The
similarity

measure is often more
important than the clustering algorithm
used

similarity

measures, we often equivalently refer to
dissimilarity

measures

Nov 2005

38

Quality in Clustering

A
good

clustering method will produce
high quality clusters with

high intra
-
class similarity

low inter
-
class similarity

The
quality

of a clustering result depends
on both the similarity measure used by
the method and its implementation

Nov 2005

39

Distance functions

There are numerous
distance functions

for

Different types of data

Numeric data

Nominal data

Different specific applications

Weights should be associated with different variables
based on applications and data semantics.

Nov 2005

40

Distance functions for numeric data

We denote distance with:

where
x
i

and
x
j

are data points (vectors)

Most commonly used functions are

Euclidean

distance and

Manhattan

(city block) distance

d
(
x
,
y
)

They are special cases of Minkowski distance

Nov 2005

41

Metric Spaces

Metric Space
: A pair (
X,d
) where
X
is a
set and d is a distance function
such that for
x
,
y

in
X
:

Separation

Triangular inequality

Symmetry

Nov 2005

42

Minkowski distance,
L
p

Q

C

d
(
Q
,
C
)

p
n
i
p
i
i
c
q
C
Q
d

1
,
p

= 1 Manhattan (Rectilinear, City Block)

p

= 2 Euclidean

p

=

Max (Supremum, “sup”)

Nov 2005

43

Euclidean distance,
L
2

Here
n

is the number of dimensions in the
data vector.

n
i
i
i
euc
y
x
d
1
2
)
(
)
,
(
y
x
Nov 2005

44

Euclidean distance

These examples of
Euclidean distance
match our intuition of
dissimilarity pretty
well…

d
euc
=2.6115

d
euc
=1.1345

d
euc
=0.5846

Nov 2005

45

Euclidean distance

What might be going on with the expression profiles
on the left? On the right?

d
euc
=1.41

d
euc
=1.22

Nov 2005

46

Weighted Euclidean distance

Weighted Euclidean distance

 

2
(,)
i i i
dxy w x y
Nov 2005

47

Mahalanobis distance

T
j
i
j
i
j
i
M
x
x
x
x
x
x
d
)
(
)
(
)
,
(
1

Nov 2005

48

More Metrics

Manhattan distance,
L
1

L
inf

(Chessboard):

 

(,)
i i
dxy x y
 
(,) max
i i i
dxy x y
Clustering Algorithms