# Clustering using Hierarchies

AI and Robotics

Nov 8, 2013 (4 years and 6 months ago)

93 views

Birch:

Balanced Iterative Reducing and

Clustering using Hierarchies

By Tian Zhang, Raghu
Ramakrishnan

Presented by

/10

e
-

What is Data Clustering?

A cluster is a closely
-
packed group.

A collection of data objects that are similar to
one another and treated collectively as a
group.

Data Clustering is the partitioning of a
dataset into clusters

2

of 28

Data Clustering

Helps understand the natural grouping or
structure in a dataset

Provided a large set of multidimensional data

Data space is usually not uniformly occupied

Identify the sparse and crowded places

Helps visualization

3

of 28

Some Clustering Applications

Biology

building groups of genes with
related patterns

Marketing

partition the population of
consumers to market segments

Division of WWW pages into genres.

Image segmentations

for object recognition

Land use

Identification of areas of similar
land use from satellite images

4

of 28

Clustering Problems

Today many datasets are too large to fit into
main memory

The dominating cost of any clustering
algorithm is I/O, because seek times on disk
are orders of a magnitude higher than RAM
access times

5

of 28

Previous Work

Two classes of clustering algorithms:

Probability
-
Based

Examples: COBWEB and CLASSIT

Distance
-
Based

Examples: KMEANS, KMEDOIDS, and CLARANS

6

of 28

Previous Work: COBWEB

Probabilistic approach to make decisions

Clusters are represented with probabilistic
description

Probability representations of clusters is
expensive

Every instance (data point) translates into a
terminal node in the hierarchy, so large
hierarchies tend to over fit data

7

of 28

Previous Work: KMeans

Distance based approach, so there must be
distance measurement between any two
instances

Sensitive to instance order

Instances must be stored in memory

All instances must be initially available

May have exponential run time

8

of 28

Previous Work: CLARANS

Also distance based approach, so there must
be distance measurement between any two
instances

computational complexity of CLARANS is
O(n2)

Sensitive to instance order

Ignore the fact that not all data points in the
dataset are equally important

9

of 28

Contributions of BIRCH

Each clustering decision is made without scanning
all data points

BIRCH exploits the observation that the data space
is usually not uniformly occupied, and hence not
every data point is equally important for clustering
purposes

BIRCH makes full use of available memory to derive
the finest possible subclusters ( to ensure accuracy)
while minimizing I/O costs ( to ensure efficiency)

10

of 28

Background Knowledge (1)

Given a cluster of instances , we define:

11

of 28

}
{
i
x

2
1
1
1
2
2
1
1
2
0
1
0
)
)
1
(
)
(
(
)
)
(
(

N
N
x
x
D
N
x
x
R
N
x
x
N
i
N
j
j
i
N
i
i
N
i
i

Centroid:

Diameter:

Background Knowledge (2)

12

of 28

2
)
2
1
1
1
1
2
1
1
1
2
2
1
1
1
(
2
)
1
1
1
(
2
)
2
1
2
1
1
(
4
2
1
)
)
1
2
1
)(
2
1
(
2
1
1
2
1
1
2
)
(
(
3
2
1
)
2
1
1
1
2
1
1
1
2
)
(
(
2

N
N
k
N
i
N
N
N
j
N
N
N
N
l
l
x
j
x
N
N
l
l
x
i
x
N
N
N
N
l
l
x
k
x
D
N
N
N
N
N
N
i
N
N
j
j
x
i
x
D
N
N
N
i
N
N
N
j
j
x
i
x
D

|
1
)
(
2
0
)
(
1
0
|
|
2
0
1
0
|
1
2
1
)
2
)
2
0
1
0
((
0

d
i
i
x
i
x
x
x
D
x
x
D

centroid Manhattan distance:

centroid Euclidian distance:

average inter
-
cluster:

average intra
-
cluster:

variance increase:

Clustering Features (CF)

Th
e Birch algorithm builds a dendrogram called
clustering feature tree (CF tree) while scanning the
data set.

Each entry in the CF tree represents a cluster of
objects and is characterized by a 3
-
tuple: (N, LS,
SS), where N is the number of objects in the cluster
and LS, SS are defined in the following.

13

of 28

Clustering Feature (CF)

Given N d
-
dimensional data points in a
cluster: {X
i
} where i = 1, 2, …, N,

CF = (
N
,
LS
,
SS
)

N
is the number of data points in the cluster,

LS
is the linear sum of the N data points,

SS
is the square sum of the N data points.

14

of 28

If CF1 = (N1, LS1, SS1), and

CF2 = (N2 ,LS2, SS2) are the CF entries of two
disjoint sub
-
clusters.

The CF entry of the sub
-
cluster formed by
merging the two disjoin sub
-
clusters is:

CF1 + CF2 = (N1 + N2 , LS1 + LS2, SS1 +
SS2)

15

of 28

16

of 28

0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
CF = (5, (16,30),(54,190))

(3,4)

(2,6)

(4,5)

(4,7)

(3,8)

Example:

Properties of CF
-
Tree

Each non
-
leaf node has
at most
B

entries

Each leaf node has at
most
L

CF entries which
each satisfy threshold
T

Node size is
determined by
dimensionality of data
space and input
parameter
P

(page size)

17

of 28

CF Tree Insertion

Identifying the appropriate leaf: recursively
descending the CF tree and choosing the
closest child node according to a chosen
distance metric

Modifying the leaf: test whether the leaf can
absorb the node without violating the
threshold. If there is no room, split the node

Modifying the path: update CF information up
the path.

18

of 28

Example of the BIRCH Algorithm

Root

LN1

LN2

LN3

LN1

LN2

LN3

sc1

sc2

sc3

sc4

sc5

sc6

sc7

sc1

sc2

sc3

sc4

sc5

sc6

sc7

sc8

sc8

New subcluster

19

of 28

19 of 28

Merge Operation in BIRCH

Root

LN1”

LN2

LN3

LN1’

LN2

LN3

sc1

sc2

sc3

sc4

sc5

sc6

sc7

sc1

sc2

sc3

sc4

sc5

sc6

sc7

sc8

sc8

LN1’

LN1”

If the branching factor of a leaf node can not exceed 3, then LN1 is split

19 of 28

Merge Operation in BIRCH

Root

LN1”

LN2

LN3

LN1’

LN2

LN3

sc1

sc2

sc3

sc4

sc5

sc6

sc7

sc1

sc2

sc3

sc4

sc5

sc6

sc7

sc8

sc8

LN1’

LN1”

If the branching factor of a non
-
leaf node can not exceed 3, then the
root is split and the height of the CF Tree increases by one

NLN1

NLN2

19 of 28

Merge Operation in BIRCH

root

LN1

LN2

sc1

sc2

sc3

sc4

sc5

sc6

LN1

Assume that the subclusters are numbered according to the order of
formation

sc3

sc4

sc5

sc6

sc2

sc1

LN2

19 of 28

LN2”

LN1

sc3

sc4

sc5

sc6

sc2

If the branching factor of a leaf node can not exceed 3, then LN2 is split

sc1

LN2’

root

LN2’

sc1

sc2

sc3

sc4

sc5

sc6

LN1

LN2”

Merge Operation in BIRCH

19 of 28

root

LN2”

LN3’

LN3”

sc3

sc4

sc5

sc6

sc2

sc1

sc2

sc3

sc4

sc5

sc6

LN3’

LN2’ and LN1 will be merged, and the newly formed node

wil be split immediately

sc1

LN3”

LN2”

Merge Operation in BIRCH

Birch Clustering Algorithm (1)

Phase 1: Scan all data and build an initial in
-
memory CF tree.

Phase 2: condense into desirable length by
building a smaller CF tree.

Phase 3: Global clustering

Phase 4: Cluster refining

this is optional,
and requires more passes over the data to
refine the results

20 of 28

Birch Clustering Algorithm (2)

21 of 28

Birch

Phase 1

tree

If run out of memory, increase thresholdvalue, and
rebuild a smaller tree by reinserting values from
older tree and then other values

Good initial threshold is important but hard to figure
out

Outlier removal

when rebuilding tree remove
outliers

22 of 28

Birch
-

Phase 2

Optional

Phase 3 sometime have minimum size which
performs well, so phase 2 prepares the tree
for phase 3.

BIRCH applies a (selected) clustering
algorithm to cluster the leaf nodes of the CF
tree, which removes sparse clusters as
outliers and groups dense clusters into larger
ones.

23 of 28

Birch

Phase 3

Problems after phase 1:

Input order affects results

Splitting triggered by node size

Phase 3:

cluster all leaf nodes on the CF values according
to an existing algorithm

Algorithm used here: agglomerative hierarchical
clustering

24 of 28

Birch

Phase 4

Optional

Do additional passes over the dataset &
reassign data points to the closest centroid
from phase 3

Recalculating the centroids and redistributing
the items.

Always converges (no matter how many time
phase 4 is repeated)

25 of 28

Conclusions (1)

Birch performs faster than existing algorithms
(
CLARANS and KMEANS)
on large datasets

Scans whole data only once

Handles outliers better

Superior to other algorithms in stability and
scalability

26 of 28

Conclusions (2)

Si
nce each node in a CF tree can hold only a
limited number of entries due to the size, a
CF tree node doesn’t always correspond to
what a user may consider a nature cluster.
Moreover, if the clusters are not spherical in
shape, it doesn’t perform well because it
uses the notion of radius or diameter to
control the boundary of a cluster

27 of 28

References

T. Zhang, R. Ramakrishnan, and M. Livny.
BIRCH :
A
n efficient data clustering method
for very large databases. SIGMOD'96

Jan Oberst
:
E
ffi
cient Data Clustering and
How to Groom Fast
-
Growing Trees

Tan, Steinbach, Kumar
:
Introduction to Data
Mining