Clustering using Hierarchies

naivenorthAI and Robotics

Nov 8, 2013 (3 years and 10 months ago)

87 views

Birch:


Balanced Iterative Reducing and

Clustering using Hierarchies

By Tian Zhang, Raghu
Ramakrishnan


Presented by

Vladimir Jelić 3218
/10

e
-
mail: jelicvladimir5@gmail.com

What is Data Clustering?


A cluster is a closely
-
packed group.


A collection of data objects that are similar to
one another and treated collectively as a
group.


Data Clustering is the partitioning of a
dataset into clusters

2

of 28

Vladimir Jelić (jelicvladimir5@gmail.com)

Data Clustering



Helps understand the natural grouping or
structure in a dataset


Provided a large set of multidimensional data


Data space is usually not uniformly occupied


Identify the sparse and crowded places


Helps visualization


3

of 28

Vladimir Jelić (jelicvladimir5@gmail.com)

Some Clustering Applications


Biology


building groups of genes with
related patterns


Marketing


partition the population of
consumers to market segments


Division of WWW pages into genres.


Image segmentations


for object recognition


Land use


Identification of areas of similar
land use from satellite images


4

of 28

Vladimir Jelić (jelicvladimir5@gmail.com)

Clustering Problems


Today many datasets are too large to fit into
main memory


The dominating cost of any clustering
algorithm is I/O, because seek times on disk
are orders of a magnitude higher than RAM
access times


5

of 28

Vladimir Jelić (jelicvladimir5@gmail.com)

Previous Work


Two classes of clustering algorithms:


Probability
-
Based


Examples: COBWEB and CLASSIT



Distance
-
Based


Examples: KMEANS, KMEDOIDS, and CLARANS


6

of 28

Vladimir Jelić (jelicvladimir5@gmail.com)

Previous Work: COBWEB


Probabilistic approach to make decisions


Clusters are represented with probabilistic
description


Probability representations of clusters is
expensive


Every instance (data point) translates into a
terminal node in the hierarchy, so large
hierarchies tend to over fit data


7

of 28

Vladimir Jelić (jelicvladimir5@gmail.com)

Previous Work: KMeans


Distance based approach, so there must be
distance measurement between any two
instances


Sensitive to instance order


Instances must be stored in memory


All instances must be initially available


May have exponential run time



8

of 28

Vladimir Jelić (jelicvladimir5@gmail.com)

Previous Work: CLARANS


Also distance based approach, so there must
be distance measurement between any two
instances


computational complexity of CLARANS is
about
O(n2)


Sensitive to instance order


Ignore the fact that not all data points in the
dataset are equally important


9

of 28

Vladimir Jelić (jelicvladimir5@gmail.com)

Contributions of BIRCH


Each clustering decision is made without scanning
all data points


BIRCH exploits the observation that the data space
is usually not uniformly occupied, and hence not
every data point is equally important for clustering
purposes


BIRCH makes full use of available memory to derive
the finest possible subclusters ( to ensure accuracy)
while minimizing I/O costs ( to ensure efficiency)


10

of 28

Vladimir Jelić (jelicvladimir5@gmail.com)

Background Knowledge (1)


Given a cluster of instances , we define:


11

of 28

Vladimir Jelić (jelicvladimir5@gmail.com)

}
{
i
x

2
1
1
1
2
2
1
1
2
0
1
0
)
)
1
(
)
(
(
)
)
(
(
















N
N
x
x
D
N
x
x
R
N
x
x
N
i
N
j
j
i
N
i
i
N
i
i




Centroid:

Radius:

Diameter:

Background Knowledge (2)

Vladimir Jelić (jelicvladimir5@gmail.com)

12

of 28

2
)
2
1
1
1
1
2
1
1
1
2
2
1
1
1
(
2
)
1
1
1
(
2
)
2
1
2
1
1
(
4
2
1
)
)
1
2
1
)(
2
1
(
2
1
1
2
1
1
2
)
(
(
3
2
1
)
2
1
1
1
2
1
1
1
2
)
(
(
2












































N
N
k
N
i
N
N
N
j
N
N
N
N
l
l
x
j
x
N
N
l
l
x
i
x
N
N
N
N
l
l
x
k
x
D
N
N
N
N
N
N
i
N
N
j
j
x
i
x
D
N
N
N
i
N
N
N
j
j
x
i
x
D









|
1
)
(
2
0
)
(
1
0
|
|
2
0
1
0
|
1
2
1
)
2
)
2
0
1
0
((
0








d
i
i
x
i
x
x
x
D
x
x
D






centroid Manhattan distance:

centroid Euclidian distance:

average inter
-
cluster:

average intra
-
cluster:

variance increase:

Clustering Features (CF)


Th
e Birch algorithm builds a dendrogram called
clustering feature tree (CF tree) while scanning the
data set.


Each entry in the CF tree represents a cluster of
objects and is characterized by a 3
-
tuple: (N, LS,
SS), where N is the number of objects in the cluster
and LS, SS are defined in the following.


13

of 28

Vladimir Jelić (jelicvladimir5@gmail.com)

Clustering Feature (CF)


Given N d
-
dimensional data points in a
cluster: {X
i
} where i = 1, 2, …, N,




CF = (
N
,
LS
,
SS
)


N
is the number of data points in the cluster,


LS
is the linear sum of the N data points,


SS
is the square sum of the N data points.


14

of 28

Vladimir Jelić (jelicvladimir5@gmail.com)

CF Additivity Theorem (1)


If CF1 = (N1, LS1, SS1), and

CF2 = (N2 ,LS2, SS2) are the CF entries of two
disjoint sub
-
clusters.


The CF entry of the sub
-
cluster formed by
merging the two disjoin sub
-
clusters is:

CF1 + CF2 = (N1 + N2 , LS1 + LS2, SS1 +
SS2)


15

of 28

Vladimir Jelić (jelicvladimir5@gmail.com)

CF Additivity Theorem (2)



Vladimir Jelić (jelicvladimir5@gmail.com)

16

of 28

0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
CF = (5, (16,30),(54,190))

(3,4)

(2,6)

(4,5)

(4,7)

(3,8)

Example:

Properties of CF
-
Tree


Each non
-
leaf node has
at most
B

entries


Each leaf node has at
most
L

CF entries which
each satisfy threshold
T


Node size is
determined by
dimensionality of data
space and input
parameter
P

(page size)

17

of 28

Vladimir Jelić (jelicvladimir5@gmail.com)

CF Tree Insertion


Identifying the appropriate leaf: recursively
descending the CF tree and choosing the
closest child node according to a chosen
distance metric


Modifying the leaf: test whether the leaf can
absorb the node without violating the
threshold. If there is no room, split the node


Modifying the path: update CF information up
the path.

18

of 28

Vladimir Jelić (jelicvladimir5@gmail.com)

Example of the BIRCH Algorithm

Root

LN1

LN2

LN3

LN1

LN2

LN3

sc1

sc2

sc3

sc4

sc5

sc6

sc7

sc1

sc2

sc3

sc4

sc5

sc6

sc7

sc8

sc8

New subcluster

19

of 28

Vladimir Jelić (jelicvladimir5@gmail.com)

Vladimir Jelić (jelicvladimir5@gmail.com)

19 of 28

Merge Operation in BIRCH

Root

LN1”

LN2

LN3

LN1’

LN2

LN3

sc1

sc2

sc3

sc4

sc5

sc6

sc7

sc1

sc2

sc3

sc4

sc5

sc6

sc7

sc8

sc8

LN1’

LN1”

If the branching factor of a leaf node can not exceed 3, then LN1 is split

Vladimir Jelić (jelicvladimir5@gmail.com)

19 of 28

Merge Operation in BIRCH

Root

LN1”

LN2

LN3

LN1’

LN2

LN3

sc1

sc2

sc3

sc4

sc5

sc6

sc7

sc1

sc2

sc3

sc4

sc5

sc6

sc7

sc8

sc8

LN1’

LN1”

If the branching factor of a non
-
leaf node can not exceed 3, then the
root is split and the height of the CF Tree increases by one

NLN1

NLN2

Vladimir Jelić (jelicvladimir5@gmail.com)

19 of 28

Merge Operation in BIRCH

root

LN1

LN2

sc1

sc2

sc3

sc4

sc5

sc6

LN1

Assume that the subclusters are numbered according to the order of
formation

sc3

sc4

sc5

sc6

sc2

sc1

LN2

Vladimir Jelić (jelicvladimir5@gmail.com)

19 of 28

LN2”

LN1

sc3

sc4

sc5

sc6

sc2

If the branching factor of a leaf node can not exceed 3, then LN2 is split

sc1

LN2’

root

LN2’

sc1

sc2

sc3

sc4

sc5

sc6

LN1

LN2”

Merge Operation in BIRCH

Vladimir Jelić (jelicvladimir5@gmail.com)

19 of 28

root

LN2”

LN3’

LN3”

sc3

sc4

sc5

sc6

sc2

sc1

sc2

sc3

sc4

sc5

sc6

LN3’

LN2’ and LN1 will be merged, and the newly formed node

wil be split immediately

sc1

LN3”

LN2”

Merge Operation in BIRCH

Birch Clustering Algorithm (1)


Phase 1: Scan all data and build an initial in
-
memory CF tree.


Phase 2: condense into desirable length by
building a smaller CF tree.


Phase 3: Global clustering


Phase 4: Cluster refining


this is optional,
and requires more passes over the data to
refine the results

20 of 28

Vladimir Jelić (jelicvladimir5@gmail.com)

Birch Clustering Algorithm (2)

Vladimir Jelić (jelicvladimir5@gmail.com)

21 of 28

Birch


Phase 1


Start with initial threshold and insert points into the
tree


If run out of memory, increase thresholdvalue, and
rebuild a smaller tree by reinserting values from
older tree and then other values


Good initial threshold is important but hard to figure
out


Outlier removal


when rebuilding tree remove
outliers

22 of 28

Vladimir Jelić (jelicvladimir5@gmail.com)

Birch
-

Phase 2


Optional


Phase 3 sometime have minimum size which
performs well, so phase 2 prepares the tree
for phase 3.


BIRCH applies a (selected) clustering
algorithm to cluster the leaf nodes of the CF
tree, which removes sparse clusters as
outliers and groups dense clusters into larger
ones.


23 of 28

Vladimir Jelić (jelicvladimir5@gmail.com)

Birch


Phase 3


Problems after phase 1:


Input order affects results


Splitting triggered by node size


Phase 3:


cluster all leaf nodes on the CF values according
to an existing algorithm


Algorithm used here: agglomerative hierarchical
clustering

24 of 28

Vladimir Jelić (jelicvladimir5@gmail.com)

Birch


Phase 4


Optional


Do additional passes over the dataset &
reassign data points to the closest centroid
from phase 3



Recalculating the centroids and redistributing
the items.


Always converges (no matter how many time
phase 4 is repeated)

25 of 28

Vladimir Jelić (jelicvladimir5@gmail.com)

Conclusions (1)


Birch performs faster than existing algorithms
(
CLARANS and KMEANS)
on large datasets


Scans whole data only once


Handles outliers better


Superior to other algorithms in stability and
scalability

26 of 28

Vladimir Jelić (jelicvladimir5@gmail.com)

Conclusions (2)


Si
nce each node in a CF tree can hold only a
limited number of entries due to the size, a
CF tree node doesn’t always correspond to
what a user may consider a nature cluster.
Moreover, if the clusters are not spherical in
shape, it doesn’t perform well because it
uses the notion of radius or diameter to
control the boundary of a cluster

Vladimir Jelić (jelicvladimir5@gmail.com)

27 of 28

References


T. Zhang, R. Ramakrishnan, and M. Livny.
BIRCH :
A
n efficient data clustering method
for very large databases. SIGMOD'96


Jan Oberst
:
E
ffi
cient Data Clustering and
How to Groom Fast
-
Growing Trees


Tan, Steinbach, Kumar
:
Introduction to Data
Mining


Vladimir Jelić (jelicvladimir5@gmail.com)

28 of 28