Ch. 15: Data Clustering

muttchessAI and Robotics

Nov 8, 2013 (3 years and 9 months ago)

62 views

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

1

Ch. 15: Data Clustering


Organisms with similar genomes


A


䈠B


䌠†縠†C
evolutionary chains



|A,B| < |A,C|

Ch.15.1: Motivation

Human genome ~ 1GB
probability for F is extremely
small

Assumption:

C stems from
closest neighbor



A


B


D

C


E


F

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

2

More Examples


Stocks with similar behavior in market


Buying basket discovery


Documents on similar topics



nearest neighbors?



similarity or distance measures?


8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

3

Supervised vs. Unsupervised Classification

Supervised:
set of classes (clusters) is given, assign new
pattern (point) to proper cluster, label it with label of its
cluster

Examples:
classify bacteria to select proper antibiotics,
assign signature to book and place in proper shelf

Unsupervised:
for given set of patterns, discover a set of
clusters (training set) and assign addtional patterns to
proper cluster

Examples:

buying behavior, stock groups, closed groups
of researchers citing each other, more?

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

4

Components of a Clustering Task


Pattern representation

(feature extraction), e.g.
key words, image features, genes in DNS
sequences


Pattern proximity:

similarity, distance


Clustering

(grouping) algorithm


Data abstraction:

representation of a cluster,
label for class, prototype, properties of a class


Assessment
of quality of output

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

5

Examples


Clustering of documents or research groups by
citation index, evolution of papers. Problem: find
minimal set of papers describing essential ideas


Clustering of items from excavations according to
cultural epoques


Clustering of tooth fragments in anthropology


Carl von Linne: Systema Naturae, 1735, botanics,
later zoology


etc ...

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

6

Ch. 15.2: Formal Definitions

Pattern:
(feature vector, measurements, observations, data
points)

X

= ( x
1
, ... , x
d
)

d dimensions,
measurements, often d not fixed, e.g. for key words

Attribute, Feature:


x
i


Dimensionality:

d

Pattern Set:

H
= { X
1
, X
2
, ... , X
n
}


H
often represented as n

搠灡瑴敲e m慴a楸

Class:
set of similar patterns, pattern generating process in
nature, e.g. growth of plants

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

7

More Definitions

Hard Clustering:
classes are disjoint, every pattern gets a
unique label from


L =
{ l
1
, l
2
, ... , l
n
}
with l
i


{ 1, ... , k }


see Fig. 1

Fuzzy Clustering:
pattern

X
i

gets a fractional degree of
membership
f
ij

for each output cluster
j

Distance Measure:
metric or proximity function or
similarity function in feature space to quantify similarity of
patterns

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

8

Ch. 15.3: Pattern Representation and
Feature Selection

Human creativity:







Select few, but most relevant features !!!

Cartesian or polar Coordinates?

See Fig. 3

Document retrieval:
key words (which) or citations? What
is a good similarity function? Use of thesaurus?

Zoology:

Skeleton, lungs instead of body shape or living
habits: dolphins, penguins, ostrich!!

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

9

Types of Features


Quantitative
: continuous, discrete,
intervals, fuzzy


Qualitative
: enumeration types (colors),
ordinals (military ranks), general features
like (hot,cold), (quiet, loud) (humid, dry)


Structure Features:
oo hierarchies

like


vehicle


car



Benz



S400



vehicle


boat



submarine

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

10

Ch. 15.4: Similarity Measures

Similar ~ small distance

Similarity function:
not necessarily a metric, triangle
inequality

dist (A,B) + dist (B,C)


摩獴 ⡁ⱃ⤠





may be missing, quasi metric.

Euclidean Distance


dist
2

(
X
i
,
X
j
) = (

k=1
d

|x
i,k

-

x
j,k

|²)
1/2

= ||
X
i

-

X
j
||
2

Special case of
Minkowski metric


dist
p

(
X
i
,
X
j
) = (

k=1
d

|x
i,k

-

x
j,k

|
p
)
1/p

= ||
X
i

-

X
j
||
p

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

11

Proximity Matrix:
for n patterns with symmetric similarity:


n * (n
-
1)/2


similarity values

Representation problem:

mixture of continuous and discontinuous attributes, e.g.


dist

(
(white,
-
17), (green, 25)
)

use wavelength as value for colors and then Euclidean
distance???

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

12

Other Similarity Functions

Context Similarity:



s(X
i

, X
k

) = f(X
i

, X
k,

E
)
for environment
E

e.g.

2 cars on a country road


2 climbers on 2 different towers of 3 Zinnen mountain

neighborhood distance

of
X
k

w.r. to
X
i






= nearest neighbor number =
NN(X
i

, X
k

)

mutual neighborhood distance



MND(X
i

, X
k

) = NN(X
i

, X
k

) + NN(X
k

, X
i

)

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

13

Lemma:

MND(X
i

, X
k

) = MND(X
k

, X
i

)



MND(X
i

, X
i

) = 0

Note:
MND is not a metric, triangle inequality is missing!

see Fig. 4:


NN(A,B) = NN(B,A) = 1

MND(A,B) = 2


NN(B,C) = 2; NN(C,B) = 1

MND(B,C) = 3

see Fig. 5:


NN(A,B) = 4; NN(B,A) = 1

MND(A,B) = 5


NN(B,C) = 2; NN(C,B) = 1

MND(B,C) = 3

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

14

(Semantic) Concept Similarity


s (X
i

, X
k

) = f (X
i

, X
k

,
C, E
)


for concept
C,
environment
E

Examples
for
C: ellipse, rectangle, house, car, tree ...

See Fig. 6

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

15

Structural Similarity of Patterns


(5 cyl, Diesel, 4000 ccm)


(6 cyl, gasoline, 2800 ccm)


dist ( ... )

???


dist (car, boat) ???



how to cluster car engines?

Dynamic, user defined clusters via query boxes?

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

16

Ch. 15.5: Clustering Methods


agglomerative


merge clusters



versus
divisive


split clusters


monothetic


all features at once


versus
polythetic

one feature at a time


hard clusters

pattern in

single class

versus
fuzzy clusters

pattern in several classes



incremental

add pattern at a time

versus

non
-
increm.

add patterns at once



(important for large data sets!!)



8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

17

Ch. 15.5.1: Hierarchical Clustering

by Agglomeration

Single Link:

C1

C2

dist (C1, C2) =


min { dist (X
1
, X
2

) : X
1

C1, X
2

C2 }

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

18

Complete Link:

C1

C2

dist (C1, C2) =


max { dist (X
1
, X
2

) : X
1

C1, X
2

C2 }


Merge clusters with smallest distance in both cases!

Examples:
Figures 9 to 13

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

19

Hierarchical algglomerative clustering algorithm

for single link and complete link clustering

1. Compute proximity matrix between pairs of patterns,
initialize each pattern as a cluster

2. Find closest pair of clusters, i.e. sort n
2

distances with
O(n
2
*log n)
, merge clusters, update proximity matrix

(how, complexity?)

3.
if

all patterns are in one cluster
then stop






else goto

step 2

Note:
proximity matrix requires
O(n
2
) space

and for
distance computation at least
O(n
2
) time

for n patterns (even
without clustering process), not feasible for large datasets!

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

20

Ch. 15.5.2 Partitioning Algorithms

Remember: agglomerative algorithms
compute a sequence
of partitions from finest (1 pattern per cluster) to coarsest (all
patterns in a single cluster). The number of desired clusters is
chosen at the end by cutting the dendogram at a certain level.

Partitioning algorithms
fix the number k of desired clusters
first, choose k starting points (e.g. randomly or by sampling)
and assign the patterns to the closest cluster.

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

21

Def.:
A k
-
partition has k (disjoint) clusters.

There are n
k

different k
-
partitions for a set of n patterns.

What is a good k
-
partition?

Squared Error Criterion:
Assume pattern set
H
is divided
into k clusters labeled by
L

and l
i



{1, ... ,k}

Let c
j

be the centroid of cluster j, then the squared error is:


e
2

(
H,L
) =

j=1
k

i=1

nj

||
X
i
j

-

c
j

||
2

X
i
j

= j
th

pattern of cluster j

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

22

Sqared Error Clustering Algorithm (k
-
means)

1. Choose k cluster centers somehow

2. Assign each pattern to ist closest cluster center
O(n*k)

3. Recompute new cluster centers c
j
’ and e
2


4.
If
convergence criterion not satisfied




then goto
step 2
else exit

Convergence Criteria:



few reassignments



little decrease of e
2


8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

23

Problems



convergence, speed of convergence?



local minimum of e
2

instead of global minimum?

This is just a hill climbing algorithm.



therefore several tries with different sets of



starting centroids

Complexities


time:

O(n*k*l)
l is number of iterations


space:


O (n)
disk and

O(k)
main storage space

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

24

Note:
simple k
-
means algorithm is very sensitive
to initial choice of clusters

-
> several runs with different initial choices



-
> split and merge clusters, e.g. merge 2 clusters

with closest centers and




split cluster with largest e
2


Modified algorithm is
ISODATA
algorithm:

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

25

ISODATA Algorithm:

1. choose k cluster centers somehow, heuristics

2. assign each pattern to its closest cluster center
O(n*k)

3. recompute new cluster centers c
j
’ and e
2


4.
if
convergence criterion not satisfied




then goto
step 2
else exit

5. merge and split clusters according to some heuristics


Yields more stable results than k
-
means in practical cases!

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

26

Minimal Spanning Tree clustering

1. Compute minimal spanning tree in
O (m*log m)


where m is the number of edges in graph, i.e.

m = n
2


2. Break tree into k clusters by removing the


k
-
1 most expensive edges from tree

Ex:
see Fig. 15

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

27

Representation of Clusters


hard clusters ~ equivalence classes

1. Take one point as representative

2. Set of members

3. Centroid (Fig. 17)

4. Some boundary points

5. Bounding polygon

6. Convex hull (fuzzy)

7. Decision tree or predicate (Fig. 18)

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

28

Genetic Algorithms with k
-
means Clustering

Pattern set

H
= { X
1
, X
2
, ... , X
n
}

labeling

L =
{ l
1
, l
2
, ... , l
n
}
with l
i


{ 1, ... , k }

generate one or several labelings
L
i


to start

solution ~ genome or
chromosome

B

=

b
1



b
2



⸮⸠


b
n


B
is binary encoding of
L
with


b
i

= fixed length binary representation of l
i


Note:
there are 2
n ld k
points in the search space for
solutions, gigantic! Interesting cases n >> 100. Removal of
symmetries and redundancies does not help much.

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

29

Fitness

function
:
inverse of squared error function.



What is the optimum?

Genetic operations:

crossover
of two chromosomes





b
1



b
2



⸮.
|

b
i



⸮⸠


b
n






c
1



c
2



⸮.
|

c
i



⸮⸠.


c
n



results in two new solutions, decide by fitness function


b
1



b
2



⸮.
|

c
i



⸮⸠.


c
n


c
1



c
2



⸮.
|
b
i



⸮⸠


b
n

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

30

Genetic operations continued

mutation:
invert 1 bit, this guarantees completeness of
search procedure. Distance of 2 solutions = number of
different bits

selection:
probabilistic choice from a set of solutions, e.g.
seeds that grow into plants and replicate (natural
selection). Probabilistic choice of centroids for clustering.

exchange of genes

replication of genes
at another place

Integrity constraints for survival (mongolism) and
fitness functions for quality. Not all genetic
modifications of nature are used in genetic algorithms!

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

31

Example for Crossover


S1
= 01000


S2
= 11111


S1
= 01
|
000


S2
= 11
|
111

crossover yields


S3
= 01111


S4
= 11000


for global search see Fig. 21


8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

32

K
-
clustering and Voronoi Diagrams



C1





C2





C3








C6



C4




C5

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

33

Difficulties


how to choose k?


how to choose starting centroids?


after clustering, recomputation of centroids


centroids move and Voronoi partitioning
changes, e.i. reassignment of patterns to
clusters


etc.

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

34

Example 1: Relay stations for mobile phones

Optimal placement of relay stations ~




optimal k
-
clustering!

Complications:



points correspond to phones



positions are not fixed



number of patterns is not fixed



how to choose k ?



distance function complicated: 3D geographic model


with mountains and buildings, shadowing, ...

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

35

Example 2: Placement of Warehouses for Goods



points correspond to customer locations



centroids correspond to locations of warehouses



distance function is delivery time from warehouse



multiplied by number of trips, i.e. related to



volume of delivered goods



multilevel clustering, e.g. for post office, train companies,

airlines (which airports to choose as hubs), etc.


8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

36

Ch. 15.6: Clustering of Large Datasets

Experiments reported in Literature

clustering of
60 patterns into 5 clusters

and comparison of
various algorithms: using the encoding of genetic algorithms:

length of one chromosome: 60 * ceiling(ld(5)) = 180 bits

each chromosome ~ 1 solution, i.e.


2
60*ceiling(ld(5))

= 2
180

~ 1000
18

~
10
54


points in the search space for optimal solution

Isolated experiments with 200 points to cluster

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

37

Examples

clustering pixels of 500

500

image

=
250.000

points

documents in Elektra:
> 1 Mio

documents


see Table 1




above problems prohibitive for most algorithms,


only candidate so far is k
-
means

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

38

Properties of k
-
means Algorithm


Time

O (n*k*l)


Space

O (k + n )

to represent
H, L,


dist (
X
i
, centroid(l
i

))


solution is independent of order in which centroids
are chosen, order in which points are processed


high potential for parallelism

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

39

Parallel Version 1

use p processors Proc
i
, each Proc
i

knows centroids


C1, C2, ...Ck

points are partitioned (round robin or hashing) into p groups

G1, G2, ... Gp

processor
Proc
i

processes group
Gi




parallel k
-
means has time complexity 1/p O(n*k*l)

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

40

Parallel Version 2

Use GA to generate p different initial clusterings


C1
i
, C2
i
, ... , Ck
i


for i = 1, 2, ... , p

Proc
i
computes solution for the seed
C1
i
, C2
i
, ... , Ck
i


and fitness function for its own solution, which determines
the winning clustering

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

41

Large Experiments (main memory)

classification of < 40.000 points, basic ideas:



use random sampling to find good initial centroids

for clustering



keep summary information in balanced tree structures

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

42

Algorithms for Large Datasets (on disk)


Divide and conquer: cluster subsets of data
separately, then combine clusters


Store points on disk, use compact cluster
representations in main memory, read
points from disk, assign to cluster, write
back with label


Parallel implementations

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

43

Ch. 15.7: Nearest Neighbor Clustering

Idea:
every point belongs to the same cluster as its
nearest neighbor

Advantage:
instead of computing n
2

distances in
O(n
2

)

compute nearest neighbor in
O (n*log(n))

Note:
this works in 2
-
dimensional space with
sweep line paradigm and Voronoi diagrams,
unknown complexity in multidimensional space

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

44

Nearest Neighbor Clustering Algorithm

1. Initialize: point = its own cluster

2. Compute nearest neighbor Y of X, represent as


(X, Y, dist)



O(n*log(n))
for 2

dim



expected
O(n*log(n))
for d dim???



I/O complexity
O(n)

3. Sort
(X, Y, dist)

by dist


O (n*log(n))

4. Assign point to cluster of nearest neighbor ( i.g. merge
cluster with nearest cluster and compute: new centroid,
diameter, cardinality of cluster, count number of clusters)

5.
if not
done
then goto
2

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

45

Termination Criteria


distance between clusters


size of clusters


diameter of cluster


squared error function

Analysis:
with every iteration the number of clusters is
decreased to at most 1/2 of previous number,


i.e. at most O( log(n)) iterations,





total complexity:
O(n*log
2
(n))






resp.
O(n*log (n))



we could compute complete dendrogram for nearest
neighbor clustering!

8.6.00

Prof. Bayer, DWH
-
SS2000,
Clustering

46

Efficient Computation of Nearest Neighbor

2
-
dimensional:
use sweep line paradigm and
Voronoi diagrams

d
-
dimensional:
so far just an idea, try as DA? Use
generalization of sweep line to
sweep zone

in
combination with UB
-
tree and caching similar to
Tetris algorithm