CURE: Efficient Clustering Algorithm for Large Databases

naivenorthAI and Robotics

Nov 8, 2013 (3 years and 9 months ago)

58 views

CHAN Siu Lung, Daniel

CHAN Wai Kin, Ken

CHOW Chin Hung, Victor

KOON Ping Yin, Bob

CURE: Efficient Clustering Algorithm


for Large Databases

Content

I.
Different problem in traditional clustering method

II.
Basic idea of CURE clustering

III.
Improved CURE

IV.
Summary

V.
References

I.
Different problem in
traditional clustering method


Partitional Clustering


Partitional Clustering


This category of clustering method try to reduce the
data set into k clusters based on some criterion
functions.


The most common criterion is square
-
error criterion.





This method favor to clusters with data points as
compact and separated as possible



Different problem in traditional clustering method

Partitional Clustering


You may find error in case the square
-
error is reduced by
splitting some large cluster to favor some other group.

Different problem in traditional clustering method

Figure: Splitting occur in large cluster by partitional method


Hierarchical Clustering


This category of clustering method try to merge sequences of disjoint
clusters into the target k clusters base on the minimum distance between
two clusters.


The distance between clusters can be measured as:


Distance between mean:



Distance between average point




Distance between two nearest point within cluster


Hierarchical Clustering

Different problem in traditional clustering method

Hierarchical Clustering


This method favor hyper
-
spherical shape and uniform data.


Let’s take some prolonged data as example:


Result of
d
mean
:











Different problem in traditional clustering method

Hierarchical Clustering


Result of
d
min
:

Different problem in traditional clustering method

Problems summary

1.
Traditional clustering

mainly favors spherical shape.

2.
Data in the cluster must be compact together.

3.
Each cluster must separate far away enough.

4.
Cluster size must be uniform.

5.
Outliner will greatly disturb the cluster result.

Different problem in traditional clustering method

II.
Basic idea of CURE
clustering

General CURE clustering procedure.

1.
It is similar to hierarchical clustering approach. But it use
sample point variant as the cluster representative rather than
every point in the cluster.

2.
First set a target sample number
c

. Than we try to select
c

well
scattered
sample points from the cluster.

3.
The chosen scattered points are shrunk toward the centroid in a
fraction of


where 0
<

<
1



Basic idea of CURE clustering

General CURE clustering procedure.

4.
These points are used as representative of clusters and will be
used as the point in d
min

cluster merging approach.

5.
After each merging, c sample points will be selected from
original representative of previous clusters to represent new
cluster.

6.
Cluster merging will be stopped until target k cluster is found

Basic idea of CURE clustering

Nearest

Merge

Nearest

Merge

Pseudo function of CURE

Basic idea of CURE clustering

CURE efficient


The worst
-
case time complexity is
O(n
2
logn)


The space complexity is
O(n)

due to the use of k
-
d treee
and heap.

Basic idea of CURE clustering

III.
Improved CURE


In case of dealing with large database, we can’t store every data point
to the memory.


Handle of data merge in large database require very long time.


We use random sampling to both reduce the time complexity and
memory usage.


Assume if we need to detect a cluster
u
present, we need to at least
capture
f
fraction of data from this cluster
f|u|


The the required sampling data s

to capture can be present as follow:




You can refer to proof from the reference (i). Here we just want to
show that we can determine a sample size
s
min

such that the probability
of get enough sample from every cluster u is
1
-





Random Sampling

Improved CURE

Partitioning and two pass clustering


In addition, we use two
-
pass approach to reduce the computation time.


First, we divide the
n
data point into
p

partition and each contain
n/p
data point.


We than pre
-
cluster each partition until the number of cluster
n/pq

reached in each
partition for some
q > 1


Then each cluster in the first pass result will be used as the second pass clustering
input to form the final cluster.


Each one partition’s time complexity is:



Therefore, the first pass complexity will be:



And the second pass complexity is:



Overall, the time complexity will become:



Improved CURE

Partitioning and two pass clustering


The overall improvement will be:



Also, to maintain the quality of clustering, we must make sure
n/pq

must be 2 to 3 times of
k
.

Improved CUR

Outlier elimination


We can introduce outliners elimination by two method
.

1.
Random sampling: With random sampling, most of outlier points are filtered out.

2.
Outlier elimination: As outliner is not a compact group, it will grow in size very
slowly during the cluster merge stage. We will then kick in the elimination
procedure during the merging stage such that those cluster with 1 ~ 2 data points
are removed from the cluster list.







In order to prevent these outliners from merging into proper cluster, we must trigger
the procedure in proper stage such that we can properly remove the outliners. In
general, we will trigger this procedure when cluster sets reduce to 1/3 of total data sets.

Improved CURE

Data labeling


Due to the use of random sample. We need to label back
every remaining data points to the proper cluster group.


Each data point is assigned to the cluster group with a
representative point nearest to the data point.

Improved CURE

Final overview of CURE flow

Improved CURE

Data

Draw Random Sample

Partition Sample

Partially cluster partition

Elimination outliers

Cluster partial clusters

Label data in disk

Sample result with different
parameter

Improved CURE

Different shrinking factor



Sample result with different
parameter

Improved CURE

Different number of representatives
c

Sample result with different
parameter

Improved CURE

Relation of execution time, different partition number
p
,

and different sample points

s

IV.
Summary


CURE can effectively detect proper shape of the cluster
with the help of scattered representative point and centroid
shrinking.


CURE can reduce computation time and memory loading
with random sampling and 2 pass clustering


CURE can effectively remove outlier.


The quality and effectiveness of CURE can be tuned be
varying different
s,p,c,


to adapt different input data set.

V.
References

i.
GRS97


Sudipto Guha, R. Rastogi, and K. Shim. CURE: A clustering algorithm for large
databases. Technical report, Bell Laboratories, Murray Hill, 1997.

ii.
ZRL96 Tian Zhang , Raghu Ramakrishnan , Miron Livny, BIRCH: an efficient data clustering
method for very large databases, ACM SIGMOD Record, v.25 n.2, p.103
-
114, June 1996

iii.
Sudipto Guha, Rajeev Rastogi, Kyuseok Shim: CURE: An Efficient Clustering Algorithm for
Large Databases, ACM SIGMOD, 1998.