BIRCH: An Efficient Data Clustering Method for Very Large Databases

AI and Robotics

Nov 8, 2013 (4 years and 8 months ago)

167 views

BIRCH:

An Efficient Data Clustering
Method for Very Large
Databases

Tian Zhang, Raghu Ramakrishnan, Miron
Livny

University of Wisconsin
-
Maciison

Presented by Zhirong Tao

Outline of the Paper

Background

Clustering Feature and CF Tree

The BIRCH Clustering Algorithm

Performance Studies

Background

A
cluster

is a collection of data objects
that are similar to one another within
the same cluster and are dissimilar to
the objects in other clusters.

The process of grouping a set of
physical or abstract objects into classes
of similar objects is called
clustering
.

Background (Contd)

Given N d
-
dimensional data points in a
cluster: {X
i
} where i = 1, 2,

, N, the
centroid X
0
, radius R and diameter D of
the cluster are defined as:

Background (Contd)

Given the centroids of two clusters: X
01

and X
02
,

The centroid Euclidean distance D0:

The centroid Manhattan distance D1:

BIRCH: Hierarchical Method

A distance
-
based approach:

Assume there is a distance measurement
between any two instances .

Represent clusters by some kind of

center

measure.

A hierarchical clustering

a sequence of partitions in which each
partition is nested into the next partition in
the sequence.

Clustering Feature Definition

Given N d
-
dimensional data points in a
cluster: {X
i
} where i = 1, 2,

, N,

CF = (
N
,
LS
,
SS
)

N

is the number of data points in the
cluster,

LS

is the linear sum of the N data points,

SS

is the square sum of the N data points.

Assume that CF1 = (N
1
, LS
1
, SS
1
), and CF2 =
(N
2

,LS
2
, SS
2
) are the CF entries of two
disjoint subclusters.

The CF entry of the subcluster formed by
merging the two disjoint subclusters is:

CF1 + CF2 = (N
1

+ N
2

,
LS
1

+
LS
2
, SS
1

+ SS
2
)

The CF entries can be stored and calculated
incrementally and consistently as subclusters are
merged or new data points are inserted.

CF
-
Tree

A CF
-
tree is a height
-
balanced tree with two
parameters: branching factor (B for nonleaf
node and L for leaf node) and threshold T.

The entry in each nonleaf node has the form
[CF
i
, child
i
]

The entry in each leaf node is a CF; each leaf
node has two pointers: `prev' and`next'.

Threshold value T: the diameter (alternatively,
the radius) of each leaf entry has to be less
than T.

BIRCH Algorithm Overview

Phase 1

Insertion Algorithm

Identifying the appropriate leaf

Modifying the leaf: assume the closest leaf
entry, say Li,

Li can `absorb' `Ent'

Add a new entry for `Ent' to the leaf

Split the leaf node

Modifying the path to the leaf:

The parent has space for this entry

Split the parent, and so on up to the root

Phase 3: Global Clustering

Use an existing global or semi
-
global
algorithm to cluster all the leaf entries across
the boundaries of different nodes.

This way we can overcome Anomaly 1:

Anomaly 1: Depending upon the order of data
input and the degree of skew, it is also possible
that two subclusters that should not be in one
cluster are kept in the same node.

Comparison of BIRCH and CLARANS

With synthetic generated dataset:

Summary

Compared with previous distance
-
based
approached (e.g, K
-
Means and
CLARANS), BIRCH is appropriate for
very large datasets.

BIRCH can work with any given amount
of memory, and the I/O complexity is a
little more than one scan of data.