Birch:
Balanced Iterative Reducing and
Clustering using Hierarchies
By Tian Zhang, Raghu
Ramakrishnan
Presented by
Vladimir Jelić 3218
/10
e

mail: jelicvladimir5@gmail.com
What is Data Clustering?
A cluster is a closely

packed group.
A collection of data objects that are similar to
one another and treated collectively as a
group.
Data Clustering is the partitioning of a
dataset into clusters
2
of 28
Vladimir Jelić (jelicvladimir5@gmail.com)
Data Clustering
Helps understand the natural grouping or
structure in a dataset
Provided a large set of multidimensional data
–
Data space is usually not uniformly occupied
–
Identify the sparse and crowded places
–
Helps visualization
3
of 28
Vladimir Jelić (jelicvladimir5@gmail.com)
Some Clustering Applications
Biology
–
building groups of genes with
related patterns
Marketing
–
partition the population of
consumers to market segments
Division of WWW pages into genres.
Image segmentations
–
for object recognition
Land use
–
Identification of areas of similar
land use from satellite images
4
of 28
Vladimir Jelić (jelicvladimir5@gmail.com)
Clustering Problems
Today many datasets are too large to fit into
main memory
The dominating cost of any clustering
algorithm is I/O, because seek times on disk
are orders of a magnitude higher than RAM
access times
5
of 28
Vladimir Jelić (jelicvladimir5@gmail.com)
Previous Work
Two classes of clustering algorithms:
Probability

Based
Examples: COBWEB and CLASSIT
Distance

Based
Examples: KMEANS, KMEDOIDS, and CLARANS
6
of 28
Vladimir Jelić (jelicvladimir5@gmail.com)
Previous Work: COBWEB
Probabilistic approach to make decisions
Clusters are represented with probabilistic
description
Probability representations of clusters is
expensive
Every instance (data point) translates into a
terminal node in the hierarchy, so large
hierarchies tend to over fit data
7
of 28
Vladimir Jelić (jelicvladimir5@gmail.com)
Previous Work: KMeans
Distance based approach, so there must be
distance measurement between any two
instances
Sensitive to instance order
Instances must be stored in memory
All instances must be initially available
May have exponential run time
8
of 28
Vladimir Jelić (jelicvladimir5@gmail.com)
Previous Work: CLARANS
Also distance based approach, so there must
be distance measurement between any two
instances
computational complexity of CLARANS is
about
O(n2)
Sensitive to instance order
Ignore the fact that not all data points in the
dataset are equally important
9
of 28
Vladimir Jelić (jelicvladimir5@gmail.com)
Contributions of BIRCH
Each clustering decision is made without scanning
all data points
BIRCH exploits the observation that the data space
is usually not uniformly occupied, and hence not
every data point is equally important for clustering
purposes
BIRCH makes full use of available memory to derive
the finest possible subclusters ( to ensure accuracy)
while minimizing I/O costs ( to ensure efficiency)
10
of 28
Vladimir Jelić (jelicvladimir5@gmail.com)
Background Knowledge (1)
Given a cluster of instances , we define:
11
of 28
Vladimir Jelić (jelicvladimir5@gmail.com)
}
{
i
x
2
1
1
1
2
2
1
1
2
0
1
0
)
)
1
(
)
(
(
)
)
(
(
N
N
x
x
D
N
x
x
R
N
x
x
N
i
N
j
j
i
N
i
i
N
i
i
Centroid:
Radius:
Diameter:
Background Knowledge (2)
Vladimir Jelić (jelicvladimir5@gmail.com)
12
of 28
2
)
2
1
1
1
1
2
1
1
1
2
2
1
1
1
(
2
)
1
1
1
(
2
)
2
1
2
1
1
(
4
2
1
)
)
1
2
1
)(
2
1
(
2
1
1
2
1
1
2
)
(
(
3
2
1
)
2
1
1
1
2
1
1
1
2
)
(
(
2
N
N
k
N
i
N
N
N
j
N
N
N
N
l
l
x
j
x
N
N
l
l
x
i
x
N
N
N
N
l
l
x
k
x
D
N
N
N
N
N
N
i
N
N
j
j
x
i
x
D
N
N
N
i
N
N
N
j
j
x
i
x
D

1
)
(
2
0
)
(
1
0


2
0
1
0

1
2
1
)
2
)
2
0
1
0
((
0
d
i
i
x
i
x
x
x
D
x
x
D
centroid Manhattan distance:
centroid Euclidian distance:
average inter

cluster:
average intra

cluster:
variance increase:
Clustering Features (CF)
Th
e Birch algorithm builds a dendrogram called
clustering feature tree (CF tree) while scanning the
data set.
Each entry in the CF tree represents a cluster of
objects and is characterized by a 3

tuple: (N, LS,
SS), where N is the number of objects in the cluster
and LS, SS are defined in the following.
13
of 28
Vladimir Jelić (jelicvladimir5@gmail.com)
Clustering Feature (CF)
Given N d

dimensional data points in a
cluster: {X
i
} where i = 1, 2, …, N,
CF = (
N
,
LS
,
SS
)
N
is the number of data points in the cluster,
LS
is the linear sum of the N data points,
SS
is the square sum of the N data points.
14
of 28
Vladimir Jelić (jelicvladimir5@gmail.com)
CF Additivity Theorem (1)
If CF1 = (N1, LS1, SS1), and
CF2 = (N2 ,LS2, SS2) are the CF entries of two
disjoint sub

clusters.
The CF entry of the sub

cluster formed by
merging the two disjoin sub

clusters is:
CF1 + CF2 = (N1 + N2 , LS1 + LS2, SS1 +
SS2)
15
of 28
Vladimir Jelić (jelicvladimir5@gmail.com)
CF Additivity Theorem (2)
Vladimir Jelić (jelicvladimir5@gmail.com)
16
of 28
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
CF = (5, (16,30),(54,190))
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
Example:
Properties of CF

Tree
Each non

leaf node has
at most
B
entries
Each leaf node has at
most
L
CF entries which
each satisfy threshold
T
Node size is
determined by
dimensionality of data
space and input
parameter
P
(page size)
17
of 28
Vladimir Jelić (jelicvladimir5@gmail.com)
CF Tree Insertion
Identifying the appropriate leaf: recursively
descending the CF tree and choosing the
closest child node according to a chosen
distance metric
Modifying the leaf: test whether the leaf can
absorb the node without violating the
threshold. If there is no room, split the node
Modifying the path: update CF information up
the path.
18
of 28
Vladimir Jelić (jelicvladimir5@gmail.com)
Example of the BIRCH Algorithm
Root
LN1
LN2
LN3
LN1
LN2
LN3
sc1
sc2
sc3
sc4
sc5
sc6
sc7
sc1
sc2
sc3
sc4
sc5
sc6
sc7
sc8
sc8
New subcluster
19
of 28
Vladimir Jelić (jelicvladimir5@gmail.com)
Vladimir Jelić (jelicvladimir5@gmail.com)
19 of 28
Merge Operation in BIRCH
Root
LN1”
LN2
LN3
LN1’
LN2
LN3
sc1
sc2
sc3
sc4
sc5
sc6
sc7
sc1
sc2
sc3
sc4
sc5
sc6
sc7
sc8
sc8
LN1’
LN1”
If the branching factor of a leaf node can not exceed 3, then LN1 is split
Vladimir Jelić (jelicvladimir5@gmail.com)
19 of 28
Merge Operation in BIRCH
Root
LN1”
LN2
LN3
LN1’
LN2
LN3
sc1
sc2
sc3
sc4
sc5
sc6
sc7
sc1
sc2
sc3
sc4
sc5
sc6
sc7
sc8
sc8
LN1’
LN1”
If the branching factor of a non

leaf node can not exceed 3, then the
root is split and the height of the CF Tree increases by one
NLN1
NLN2
Vladimir Jelić (jelicvladimir5@gmail.com)
19 of 28
Merge Operation in BIRCH
root
LN1
LN2
sc1
sc2
sc3
sc4
sc5
sc6
LN1
Assume that the subclusters are numbered according to the order of
formation
sc3
sc4
sc5
sc6
sc2
sc1
LN2
Vladimir Jelić (jelicvladimir5@gmail.com)
19 of 28
LN2”
LN1
sc3
sc4
sc5
sc6
sc2
If the branching factor of a leaf node can not exceed 3, then LN2 is split
sc1
LN2’
root
LN2’
sc1
sc2
sc3
sc4
sc5
sc6
LN1
LN2”
Merge Operation in BIRCH
Vladimir Jelić (jelicvladimir5@gmail.com)
19 of 28
root
LN2”
LN3’
LN3”
sc3
sc4
sc5
sc6
sc2
sc1
sc2
sc3
sc4
sc5
sc6
LN3’
LN2’ and LN1 will be merged, and the newly formed node
wil be split immediately
sc1
LN3”
LN2”
Merge Operation in BIRCH
Birch Clustering Algorithm (1)
Phase 1: Scan all data and build an initial in

memory CF tree.
Phase 2: condense into desirable length by
building a smaller CF tree.
Phase 3: Global clustering
Phase 4: Cluster refining
–
this is optional,
and requires more passes over the data to
refine the results
20 of 28
Vladimir Jelić (jelicvladimir5@gmail.com)
Birch Clustering Algorithm (2)
Vladimir Jelić (jelicvladimir5@gmail.com)
21 of 28
Birch
–
Phase 1
Start with initial threshold and insert points into the
tree
If run out of memory, increase thresholdvalue, and
rebuild a smaller tree by reinserting values from
older tree and then other values
Good initial threshold is important but hard to figure
out
Outlier removal
–
when rebuilding tree remove
outliers
22 of 28
Vladimir Jelić (jelicvladimir5@gmail.com)
Birch

Phase 2
Optional
Phase 3 sometime have minimum size which
performs well, so phase 2 prepares the tree
for phase 3.
BIRCH applies a (selected) clustering
algorithm to cluster the leaf nodes of the CF
tree, which removes sparse clusters as
outliers and groups dense clusters into larger
ones.
23 of 28
Vladimir Jelić (jelicvladimir5@gmail.com)
Birch
–
Phase 3
Problems after phase 1:
–
Input order affects results
–
Splitting triggered by node size
Phase 3:
–
cluster all leaf nodes on the CF values according
to an existing algorithm
–
Algorithm used here: agglomerative hierarchical
clustering
24 of 28
Vladimir Jelić (jelicvladimir5@gmail.com)
Birch
–
Phase 4
Optional
Do additional passes over the dataset &
reassign data points to the closest centroid
from phase 3
Recalculating the centroids and redistributing
the items.
Always converges (no matter how many time
phase 4 is repeated)
25 of 28
Vladimir Jelić (jelicvladimir5@gmail.com)
Conclusions (1)
Birch performs faster than existing algorithms
(
CLARANS and KMEANS)
on large datasets
Scans whole data only once
Handles outliers better
Superior to other algorithms in stability and
scalability
26 of 28
Vladimir Jelić (jelicvladimir5@gmail.com)
Conclusions (2)
Si
nce each node in a CF tree can hold only a
limited number of entries due to the size, a
CF tree node doesn’t always correspond to
what a user may consider a nature cluster.
Moreover, if the clusters are not spherical in
shape, it doesn’t perform well because it
uses the notion of radius or diameter to
control the boundary of a cluster
Vladimir Jelić (jelicvladimir5@gmail.com)
27 of 28
References
T. Zhang, R. Ramakrishnan, and M. Livny.
BIRCH :
A
n efficient data clustering method
for very large databases. SIGMOD'96
Jan Oberst
:
E
ffi
cient Data Clustering and
How to Groom Fast

Growing Trees
Tan, Steinbach, Kumar
:
Introduction to Data
Mining
Vladimir Jelić (jelicvladimir5@gmail.com)
28 of 28
Comments 0
Log in to post a comment