8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
1
Ch. 15: Data Clustering
•
Organisms with similar genomes
A
†
䈠B
†
䌠†縠†C
evolutionary chains
A,B < A,C
Ch.15.1: Motivation
Human genome ~ 1GB
probability for F is extremely
small
Assumption:
C stems from
closest neighbor
A
B
D
C
E
F
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
2
More Examples
•
Stocks with similar behavior in market
•
Buying basket discovery
•
Documents on similar topics
nearest neighbors?
similarity or distance measures?
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
3
Supervised vs. Unsupervised Classification
Supervised:
set of classes (clusters) is given, assign new
pattern (point) to proper cluster, label it with label of its
cluster
Examples:
classify bacteria to select proper antibiotics,
assign signature to book and place in proper shelf
Unsupervised:
for given set of patterns, discover a set of
clusters (training set) and assign addtional patterns to
proper cluster
Examples:
buying behavior, stock groups, closed groups
of researchers citing each other, more?
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
4
Components of a Clustering Task
•
Pattern representation
(feature extraction), e.g.
key words, image features, genes in DNS
sequences
•
Pattern proximity:
similarity, distance
•
Clustering
(grouping) algorithm
•
Data abstraction:
representation of a cluster,
label for class, prototype, properties of a class
•
Assessment
of quality of output
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
5
Examples
•
Clustering of documents or research groups by
citation index, evolution of papers. Problem: find
minimal set of papers describing essential ideas
•
Clustering of items from excavations according to
cultural epoques
•
Clustering of tooth fragments in anthropology
•
Carl von Linne: Systema Naturae, 1735, botanics,
later zoology
•
etc ...
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
6
Ch. 15.2: Formal Definitions
Pattern:
(feature vector, measurements, observations, data
points)
X
= ( x
1
, ... , x
d
)
d dimensions,
measurements, often d not fixed, e.g. for key words
Attribute, Feature:
x
i
Dimensionality:
d
Pattern Set:
H
= { X
1
, X
2
, ... , X
n
}
H
often represented as n
搠灡瑴敲e m慴a楸
Class:
set of similar patterns, pattern generating process in
nature, e.g. growth of plants
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
7
More Definitions
Hard Clustering:
classes are disjoint, every pattern gets a
unique label from
L =
{ l
1
, l
2
, ... , l
n
}
with l
i
{ 1, ... , k }
see Fig. 1
Fuzzy Clustering:
pattern
X
i
gets a fractional degree of
membership
f
ij
for each output cluster
j
Distance Measure:
metric or proximity function or
similarity function in feature space to quantify similarity of
patterns
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
8
Ch. 15.3: Pattern Representation and
Feature Selection
Human creativity:
Select few, but most relevant features !!!
Cartesian or polar Coordinates?
See Fig. 3
Document retrieval:
key words (which) or citations? What
is a good similarity function? Use of thesaurus?
Zoology:
Skeleton, lungs instead of body shape or living
habits: dolphins, penguins, ostrich!!
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
9
Types of Features
•
Quantitative
: continuous, discrete,
intervals, fuzzy
•
Qualitative
: enumeration types (colors),
ordinals (military ranks), general features
like (hot,cold), (quiet, loud) (humid, dry)
•
Structure Features:
oo hierarchies
like
vehicle
car
Benz
S400
vehicle
boat
submarine
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
10
Ch. 15.4: Similarity Measures
Similar ~ small distance
Similarity function:
not necessarily a metric, triangle
inequality
dist (A,B) + dist (B,C)
摩獴 ⡁ⱃ⤠
may be missing, quasi metric.
Euclidean Distance
dist
2
(
X
i
,
X
j
) = (
k=1
d
x
i,k

x
j,k
²)
1/2
= 
X
i

X
j

2
Special case of
Minkowski metric
dist
p
(
X
i
,
X
j
) = (
k=1
d
x
i,k

x
j,k

p
)
1/p
= 
X
i

X
j

p
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
11
Proximity Matrix:
for n patterns with symmetric similarity:
n * (n

1)/2
similarity values
Representation problem:
mixture of continuous and discontinuous attributes, e.g.
dist
(
(white,

17), (green, 25)
)
use wavelength as value for colors and then Euclidean
distance???
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
12
Other Similarity Functions
Context Similarity:
s(X
i
, X
k
) = f(X
i
, X
k,
E
)
for environment
E
e.g.
2 cars on a country road
2 climbers on 2 different towers of 3 Zinnen mountain
neighborhood distance
of
X
k
w.r. to
X
i
= nearest neighbor number =
NN(X
i
, X
k
)
mutual neighborhood distance
MND(X
i
, X
k
) = NN(X
i
, X
k
) + NN(X
k
, X
i
)
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
13
Lemma:
MND(X
i
, X
k
) = MND(X
k
, X
i
)
MND(X
i
, X
i
) = 0
Note:
MND is not a metric, triangle inequality is missing!
see Fig. 4:
NN(A,B) = NN(B,A) = 1
MND(A,B) = 2
NN(B,C) = 2; NN(C,B) = 1
MND(B,C) = 3
see Fig. 5:
NN(A,B) = 4; NN(B,A) = 1
MND(A,B) = 5
NN(B,C) = 2; NN(C,B) = 1
MND(B,C) = 3
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
14
(Semantic) Concept Similarity
s (X
i
, X
k
) = f (X
i
, X
k
,
C, E
)
for concept
C,
environment
E
Examples
for
C: ellipse, rectangle, house, car, tree ...
See Fig. 6
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
15
Structural Similarity of Patterns
(5 cyl, Diesel, 4000 ccm)
(6 cyl, gasoline, 2800 ccm)
dist ( ... )
???
dist (car, boat) ???
how to cluster car engines?
Dynamic, user defined clusters via query boxes?
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
16
Ch. 15.5: Clustering Methods
•
agglomerative
merge clusters
versus
divisive
split clusters
•
monothetic
all features at once
versus
polythetic
one feature at a time
•
hard clusters
pattern in
single class
versus
fuzzy clusters
pattern in several classes
•
incremental
add pattern at a time
versus
non

increm.
add patterns at once
(important for large data sets!!)
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
17
Ch. 15.5.1: Hierarchical Clustering
by Agglomeration
Single Link:
C1
C2
dist (C1, C2) =
min { dist (X
1
, X
2
) : X
1
C1, X
2
C2 }
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
18
Complete Link:
C1
C2
dist (C1, C2) =
max { dist (X
1
, X
2
) : X
1
C1, X
2
C2 }
Merge clusters with smallest distance in both cases!
Examples:
Figures 9 to 13
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
19
Hierarchical algglomerative clustering algorithm
for single link and complete link clustering
1. Compute proximity matrix between pairs of patterns,
initialize each pattern as a cluster
2. Find closest pair of clusters, i.e. sort n
2
distances with
O(n
2
*log n)
, merge clusters, update proximity matrix
(how, complexity?)
3.
if
all patterns are in one cluster
then stop
else goto
step 2
Note:
proximity matrix requires
O(n
2
) space
and for
distance computation at least
O(n
2
) time
for n patterns (even
without clustering process), not feasible for large datasets!
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
20
Ch. 15.5.2 Partitioning Algorithms
Remember: agglomerative algorithms
compute a sequence
of partitions from finest (1 pattern per cluster) to coarsest (all
patterns in a single cluster). The number of desired clusters is
chosen at the end by cutting the dendogram at a certain level.
Partitioning algorithms
fix the number k of desired clusters
first, choose k starting points (e.g. randomly or by sampling)
and assign the patterns to the closest cluster.
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
21
Def.:
A k

partition has k (disjoint) clusters.
There are n
k
different k

partitions for a set of n patterns.
What is a good k

partition?
Squared Error Criterion:
Assume pattern set
H
is divided
into k clusters labeled by
L
and l
i
{1, ... ,k}
Let c
j
be the centroid of cluster j, then the squared error is:
e
2
(
H,L
) =
j=1
k
i=1
nj

X
i
j

c
j

2
X
i
j
= j
th
pattern of cluster j
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
22
Sqared Error Clustering Algorithm (k

means)
1. Choose k cluster centers somehow
2. Assign each pattern to ist closest cluster center
O(n*k)
3. Recompute new cluster centers c
j
’ and e
2
4.
If
convergence criterion not satisfied
then goto
step 2
else exit
Convergence Criteria:
•
few reassignments
•
little decrease of e
2
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
23
Problems
•
convergence, speed of convergence?
•
local minimum of e
2
instead of global minimum?
This is just a hill climbing algorithm.
•
therefore several tries with different sets of
starting centroids
Complexities
time:
O(n*k*l)
l is number of iterations
space:
O (n)
disk and
O(k)
main storage space
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
24
Note:
simple k

means algorithm is very sensitive
to initial choice of clusters

> several runs with different initial choices

> split and merge clusters, e.g. merge 2 clusters
with closest centers and
split cluster with largest e
2
Modified algorithm is
ISODATA
algorithm:
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
25
ISODATA Algorithm:
1. choose k cluster centers somehow, heuristics
2. assign each pattern to its closest cluster center
O(n*k)
3. recompute new cluster centers c
j
’ and e
2
4.
if
convergence criterion not satisfied
then goto
step 2
else exit
5. merge and split clusters according to some heuristics
Yields more stable results than k

means in practical cases!
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
26
Minimal Spanning Tree clustering
1. Compute minimal spanning tree in
O (m*log m)
where m is the number of edges in graph, i.e.
m = n
2
2. Break tree into k clusters by removing the
k

1 most expensive edges from tree
Ex:
see Fig. 15
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
27
Representation of Clusters
hard clusters ~ equivalence classes
1. Take one point as representative
2. Set of members
3. Centroid (Fig. 17)
4. Some boundary points
5. Bounding polygon
6. Convex hull (fuzzy)
7. Decision tree or predicate (Fig. 18)
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
28
Genetic Algorithms with k

means Clustering
Pattern set
H
= { X
1
, X
2
, ... , X
n
}
labeling
L =
{ l
1
, l
2
, ... , l
n
}
with l
i
{ 1, ... , k }
generate one or several labelings
L
i
to start
solution ~ genome or
chromosome
B
=
b
1
†
b
2
⸮⸠
b
n
B
is binary encoding of
L
with
b
i
= fixed length binary representation of l
i
Note:
there are 2
n ld k
points in the search space for
solutions, gigantic! Interesting cases n >> 100. Removal of
symmetries and redundancies does not help much.
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
29
Fitness
function
:
inverse of squared error function.
What is the optimum?
Genetic operations:
crossover
of two chromosomes
b
1
†
b
2
⸮.

b
i
⸮⸠
b
n
c
1
†
c
2
⸮.

c
i
⸮⸠.
c
n
results in two new solutions, decide by fitness function
b
1
†
b
2
⸮.

c
i
⸮⸠.
c
n
c
1
†
c
2
⸮.

b
i
⸮⸠
b
n
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
30
Genetic operations continued
mutation:
invert 1 bit, this guarantees completeness of
search procedure. Distance of 2 solutions = number of
different bits
selection:
probabilistic choice from a set of solutions, e.g.
seeds that grow into plants and replicate (natural
selection). Probabilistic choice of centroids for clustering.
exchange of genes
replication of genes
at another place
Integrity constraints for survival (mongolism) and
fitness functions for quality. Not all genetic
modifications of nature are used in genetic algorithms!
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
31
Example for Crossover
S1
= 01000
S2
= 11111
S1
= 01

000
S2
= 11

111
crossover yields
S3
= 01111
S4
= 11000
for global search see Fig. 21
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
32
K

clustering and Voronoi Diagrams
C1
C2
C3
C6
C4
C5
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
33
Difficulties
•
how to choose k?
•
how to choose starting centroids?
•
after clustering, recomputation of centroids
•
centroids move and Voronoi partitioning
changes, e.i. reassignment of patterns to
clusters
•
etc.
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
34
Example 1: Relay stations for mobile phones
Optimal placement of relay stations ~
optimal k

clustering!
Complications:
•
points correspond to phones
•
positions are not fixed
•
number of patterns is not fixed
•
how to choose k ?
•
distance function complicated: 3D geographic model
with mountains and buildings, shadowing, ...
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
35
Example 2: Placement of Warehouses for Goods
•
points correspond to customer locations
•
centroids correspond to locations of warehouses
•
distance function is delivery time from warehouse
multiplied by number of trips, i.e. related to
volume of delivered goods
•
multilevel clustering, e.g. for post office, train companies,
airlines (which airports to choose as hubs), etc.
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
36
Ch. 15.6: Clustering of Large Datasets
Experiments reported in Literature
clustering of
60 patterns into 5 clusters
and comparison of
various algorithms: using the encoding of genetic algorithms:
length of one chromosome: 60 * ceiling(ld(5)) = 180 bits
each chromosome ~ 1 solution, i.e.
2
60*ceiling(ld(5))
= 2
180
~ 1000
18
~
10
54
points in the search space for optimal solution
Isolated experiments with 200 points to cluster
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
37
Examples
clustering pixels of 500
500
image
=
250.000
points
documents in Elektra:
> 1 Mio
documents
see Table 1
above problems prohibitive for most algorithms,
only candidate so far is k

means
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
38
Properties of k

means Algorithm
•
Time
O (n*k*l)
•
Space
O (k + n )
to represent
H, L,
dist (
X
i
, centroid(l
i
))
•
solution is independent of order in which centroids
are chosen, order in which points are processed
•
high potential for parallelism
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
39
Parallel Version 1
use p processors Proc
i
, each Proc
i
knows centroids
C1, C2, ...Ck
points are partitioned (round robin or hashing) into p groups
G1, G2, ... Gp
processor
Proc
i
processes group
Gi
parallel k

means has time complexity 1/p O(n*k*l)
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
40
Parallel Version 2
Use GA to generate p different initial clusterings
C1
i
, C2
i
, ... , Ck
i
for i = 1, 2, ... , p
Proc
i
computes solution for the seed
C1
i
, C2
i
, ... , Ck
i
and fitness function for its own solution, which determines
the winning clustering
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
41
Large Experiments (main memory)
classification of < 40.000 points, basic ideas:
•
use random sampling to find good initial centroids
for clustering
•
keep summary information in balanced tree structures
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
42
Algorithms for Large Datasets (on disk)
•
Divide and conquer: cluster subsets of data
separately, then combine clusters
•
Store points on disk, use compact cluster
representations in main memory, read
points from disk, assign to cluster, write
back with label
•
Parallel implementations
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
43
Ch. 15.7: Nearest Neighbor Clustering
Idea:
every point belongs to the same cluster as its
nearest neighbor
Advantage:
instead of computing n
2
distances in
O(n
2
)
compute nearest neighbor in
O (n*log(n))
Note:
this works in 2

dimensional space with
sweep line paradigm and Voronoi diagrams,
unknown complexity in multidimensional space
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
44
Nearest Neighbor Clustering Algorithm
1. Initialize: point = its own cluster
2. Compute nearest neighbor Y of X, represent as
(X, Y, dist)
O(n*log(n))
for 2
dim
expected
O(n*log(n))
for d dim???
I/O complexity
O(n)
3. Sort
(X, Y, dist)
by dist
O (n*log(n))
4. Assign point to cluster of nearest neighbor ( i.g. merge
cluster with nearest cluster and compute: new centroid,
diameter, cardinality of cluster, count number of clusters)
5.
if not
done
then goto
2
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
45
Termination Criteria
•
distance between clusters
•
size of clusters
•
diameter of cluster
•
squared error function
Analysis:
with every iteration the number of clusters is
decreased to at most 1/2 of previous number,
i.e. at most O( log(n)) iterations,
total complexity:
O(n*log
2
(n))
resp.
O(n*log (n))
we could compute complete dendrogram for nearest
neighbor clustering!
8.6.00
Prof. Bayer, DWH

SS2000,
Clustering
46
Efficient Computation of Nearest Neighbor
2

dimensional:
use sweep line paradigm and
Voronoi diagrams
d

dimensional:
so far just an idea, try as DA? Use
generalization of sweep line to
sweep zone
in
combination with UB

tree and caching similar to
Tetris algorithm
Comments 0
Log in to post a comment