# 4

Τεχνίτη Νοημοσύνη και Ρομποτική

8 Νοε 2013 (πριν από 4 χρόνια και 7 μήνες)

107 εμφανίσεις

Data

Clustering

1

Clustering

A Naïve Example

Object

A

B

C

0

3

1

0

1

2

1

1

2

3

0

0

3

2

1

1

Clustering 1
-

{0}, {1,3}, {2}

Depending on how similarity of data objects is
considered could also be:

Clustering 2
-

{0,2}, {1,3}

2

Clustering

“Assignment of a set of observations into
groups called clusters so that observations in
the same cluster are similar in some sense”

Some data contains natural clusters

easy to
deal with

In most cases no ideal solution

only ‘Best
Guesses’

3

Supervised and Unsupervised Learning

So, what is the difference between say: classification and
clustering?

Clustering belongs to a set of problems known as

Classification involves the use of data 'labels' or 'classes‘

In Clustering there are no labels
-

4

So, how is clustering performed?

In the most common techniques a distance measure is
employed

Most naïve include distance measures

Referring to the earlier example is distance between objects:

Ob ject0

Object1

Object2

Object3

Object 0

-

Object 1

0.6667

-

Object 2

0.333

1.0

-

Object 3

0.6667

0

1.0

-

5

So, how is clustering performed?

How we use these distances determines the
cluster assignments for each of the objects in
the dataset

Other measures include Hamming distance

C A T

M A T

One single change, so distance = 1

6

So, how is clustering performed?

Other aspects which affect Clustering

Number of clusters?

Predefined or natural clusterings?

How are objects similar and how is this defined?

Object

A

B

C

0

3

1

0

1

2

1

1

2

3

0

0

3

2

1

1

7

How can we tell if our clustering is any
good?

Cluster validity
-

as many measures as there
are clustering approaches!

Many attempts: no correct answers only best
guesses!

Few standardised measures but:

Compactness

members of the same cluster
‘tightly
-
packed’

Separation

clusters should be widely separated

8

How can we tell if our clustering is any
good?

Dunn Index

The ratio between the minimal intracluster distance
to maximal intercluster distance

9

How can we tell if our clustering is any
good?

Davies
-
Bouldin

Index

A function of the ratio of the sum of within
-
cluster scatter to
between
-
cluster separation

where:
n
is the number of clusters,
S
n

is the average distance of all
objects from the cluster to their respective cluster centre, and
S(
Q
i

,
Q
j
)

is the distance between cluster centres

Small values are therefore indicative of good clustering

10

How can we tell if our clustering is any
good?

C

index

Where:

S

is the sum of distances over all pairs of objects in the same cluster

l is

the number of those pairs.

Then,
S
min

is the sum of the
l

smallest distances if all pairs of objects are considered
(i.e. if the objects can belong to different clusters).

S
max

is the sum of the
l

largest distance out of all pairs.

Again, a small value of
C

indicates a good clustering.

11

How can we tell if our clustering is any
good?

Class Index

Special case where object labels may be known but we would like to
assess how well the clustering method predicts these

Solution 1
-

{0}, {1,3}, {2} = 1.0
(all objects clustered correctly)

Solution 2
-

{0,2}, {1,3} = 0.75
( 3 of 4 objects clustered correctly)

Object

A

B

C

Class

0

3

1

0

A

1

2

1

1

B

2

3

0

0

C

3

2

1

1

B

12

as well as the rows?

Yes, can be done also

known as ‘Biclustering’

Generates a subset of rows which exhibit
similar behaviour across a subset of columns.

13

Object

A

B

C

0

3

1

0

1

2

1

1

2

3

0

0

3

2

1

1

Clustering Algorithm Types

4 basic types

Non Hierarchical or
Partitional
: k
-
Means/C
-
means/etc.

Hierarchical
: Agglomerative/Divisive/etc.

GA
-
.

Others: Data Density etc
.

14

Non
-
Hierarchical/
Partitional

Clustering

Iterative process:

Partitional Algorithm

initial partition of
k

clusters

perform random assignment of data objects to clusters

do

(1) Calculate new cluster centres from membership assignments

(2) Calculate membership assignments from cluster centres

until

cluster
membership assignment stabilizes

15

Non
-
Hierarchical/
Partitional

Clustering

Unclustered Data

Initial random centres

1
st

Iteration

2
nd

Iteration

Final Clustering

16

Hierarchical

All data
objects

Cluster1

Cluster2

Cluster3

Agglomerative

Divisive

Clusters are formed by either:

Dividing

the bag or collection of data into clusters or

Agglomerating
similar clusters

Metric required to decide when smaller clusters of objects should be merged (or
split)

17

Fuzzy Sets

18

Extension of conventional sets

Deal with vagueness

Elements of fuzzy sets are allowed partial
membership described by a membership
function

VERY OLD

Fuzzy Techniques

Fuzzy C
-
Means (FCM) (Dunn
-
1973/Bezdek
-
1981)

Non
-
Hierarchical Approach

Data objects allowed to belong to multiple ‘fuzzy’
clusters

Maximisation of objective function

Numerous termination criteria

19

FCM

Problems:

Specification of
parameters up to
4 different
parameters

Relies on simple
distance metrics

Outliers still a
problem

20

FCM

Simple Example:

Memberships calulated from
initial means/centres

Following 1
st

Iteration

Following 2nd Iteration

21

Fuzzy agglomerative hierarchical clustering
(FLAME)

Fuzzy Approach

employs data density measure

Basic concepts:

cluster supporting object (CSO)
object with density higher than all its
neighbors
;

Outlier
object with density lower than all its
neighbors
, and lower than
a predefined threshold
;

Everything else
all other objects

Algorithm uses these concepts to form clusters of data
objects

22

FLAME

Algorithm comprises three steps

1.
For each data object

Find the k
-
nearest neighbours

Use the proximity measurements to calculate density

Use density to define the object type (
CSO /Outlier/other
)

2.

1.
Local Approximation of Fuzzy Memberships

Initialization of fuzzy membership: Each CSO is assigned with fixed and full membership
to itself to represent one cluster;

All outliers are assigned with fixed and full membership to the outlier group;

The rest are assigned with equal memberships to all clusters and the outlier group

Then fuzzy memberships of all the ”rest” objects are updated by a converging iterative
procedure called
Local/
Neighborhood

Approximation of Fuzzy Memberships
-

fuzzy
membership of each object is updated by a linear combination of the fuzzy memberships
of its nearest
neighbors
.

3.
Construct clusters from fuzzy memberships

23

FLAME

2D Example

24

FLAME

Problems

Still requires parameters:
k, neighbourhood
density, etc.

But...

Produces good clusters and can deal with non
-
linear structures well

Effective mechanism for dealing with outliers

25

“Rough Set” Clustering

Rough k
-
Means proposed by
Lingras

and West

Not strictly Rough Set Clustering

Borrows some of the properties of RST but not all
core concepts

Can in reality be viewed as binary
-
interval based
clustering

26

“Rough Set” Clustering

27

“Rough Set” Clustering

Utilises the following properties of RST to cluster data
objects:

A data object can only be a member of one lower
approximation

A data object that is a member of the lower
approximation of a cluster is also member of the upper
approximation of the same cluster.

A data object that does not belong to any lower
approximation is member of at least two upper
approximations.

28

“Rough Set” Clustering

Two main concepts that are borrowed are

Upper and Lower approximation concepts

These concepts are used to generate centres

29

“Rough Set” Clustering

Calculation of the means or
centres for each cluster

Weighting of lower approx and
boundary affects the values of
each mean/centre

30

“Rough Set” Clustering

Simple 2D Example

31

32

Real
-
World Applications

(Apart from grouping data points what is clustering actually used for?)

Data clustering is one of the core AI methods used for the
search engine.

also.

uses data clustering techniques in order to identify communities
within large groups

Amazon uses recommender systems to predict a user's preferences based on
the preferences of other users in the same cluster

Other Applications:

Mineralogy

Seismic analysis and survey,

Medical Imaging, ...etc.

Summary

Clustering is a complex problem

No ‘right’ or ‘wrong’ solutions

Validity measures are generally only
as good as the ‘desired’ solution

33