CHAN Siu Lung, Daniel
CHAN Wai Kin, Ken
CHOW Chin Hung, Victor
KOON Ping Yin, Bob
CURE: Efficient Clustering Algorithm
for Large Databases
Content
I.
Different problem in traditional clustering method
II.
Basic idea of CURE clustering
III.
Improved CURE
IV.
Summary
V.
References
I.
Different problem in
traditional clustering method
Partitional Clustering
•
Partitional Clustering
–
This category of clustering method try to reduce the
data set into k clusters based on some criterion
functions.
–
The most common criterion is square

error criterion.
–
This method favor to clusters with data points as
compact and separated as possible
Different problem in traditional clustering method
Partitional Clustering
•
You may find error in case the square

error is reduced by
splitting some large cluster to favor some other group.
Different problem in traditional clustering method
Figure: Splitting occur in large cluster by partitional method
•
Hierarchical Clustering
–
This category of clustering method try to merge sequences of disjoint
clusters into the target k clusters base on the minimum distance between
two clusters.
–
The distance between clusters can be measured as:
•
Distance between mean:
•
Distance between average point
•
Distance between two nearest point within cluster
Hierarchical Clustering
Different problem in traditional clustering method
Hierarchical Clustering
–
This method favor hyper

spherical shape and uniform data.
–
Let’s take some prolonged data as example:
–
Result of
d
mean
:
Different problem in traditional clustering method
Hierarchical Clustering
–
Result of
d
min
:
Different problem in traditional clustering method
Problems summary
1.
Traditional clustering
mainly favors spherical shape.
2.
Data in the cluster must be compact together.
3.
Each cluster must separate far away enough.
4.
Cluster size must be uniform.
5.
Outliner will greatly disturb the cluster result.
Different problem in traditional clustering method
II.
Basic idea of CURE
clustering
General CURE clustering procedure.
1.
It is similar to hierarchical clustering approach. But it use
sample point variant as the cluster representative rather than
every point in the cluster.
2.
First set a target sample number
c
. Than we try to select
c
well
scattered
sample points from the cluster.
3.
The chosen scattered points are shrunk toward the centroid in a
fraction of
where 0
<
<
1
Basic idea of CURE clustering
General CURE clustering procedure.
4.
These points are used as representative of clusters and will be
used as the point in d
min
cluster merging approach.
5.
After each merging, c sample points will be selected from
original representative of previous clusters to represent new
cluster.
6.
Cluster merging will be stopped until target k cluster is found
Basic idea of CURE clustering
Nearest
Merge
Nearest
Merge
Pseudo function of CURE
Basic idea of CURE clustering
CURE efficient
•
The worst

case time complexity is
O(n
2
logn)
•
The space complexity is
O(n)
due to the use of k

d treee
and heap.
Basic idea of CURE clustering
III.
Improved CURE
•
In case of dealing with large database, we can’t store every data point
to the memory.
•
Handle of data merge in large database require very long time.
•
We use random sampling to both reduce the time complexity and
memory usage.
•
Assume if we need to detect a cluster
u
present, we need to at least
capture
f
fraction of data from this cluster
fu
•
The the required sampling data s
to capture can be present as follow:
•
You can refer to proof from the reference (i). Here we just want to
show that we can determine a sample size
s
min
such that the probability
of get enough sample from every cluster u is
1

Random Sampling
Improved CURE
Partitioning and two pass clustering
•
In addition, we use two

pass approach to reduce the computation time.
•
First, we divide the
n
data point into
p
partition and each contain
n/p
data point.
•
We than pre

cluster each partition until the number of cluster
n/pq
reached in each
partition for some
q > 1
•
Then each cluster in the first pass result will be used as the second pass clustering
input to form the final cluster.
•
Each one partition’s time complexity is:
•
Therefore, the first pass complexity will be:
•
And the second pass complexity is:
•
Overall, the time complexity will become:
Improved CURE
Partitioning and two pass clustering
•
The overall improvement will be:
•
Also, to maintain the quality of clustering, we must make sure
n/pq
must be 2 to 3 times of
k
.
Improved CUR
Outlier elimination
•
We can introduce outliners elimination by two method
.
1.
Random sampling: With random sampling, most of outlier points are filtered out.
2.
Outlier elimination: As outliner is not a compact group, it will grow in size very
slowly during the cluster merge stage. We will then kick in the elimination
procedure during the merging stage such that those cluster with 1 ~ 2 data points
are removed from the cluster list.
•
In order to prevent these outliners from merging into proper cluster, we must trigger
the procedure in proper stage such that we can properly remove the outliners. In
general, we will trigger this procedure when cluster sets reduce to 1/3 of total data sets.
Improved CURE
Data labeling
•
Due to the use of random sample. We need to label back
every remaining data points to the proper cluster group.
•
Each data point is assigned to the cluster group with a
representative point nearest to the data point.
Improved CURE
Final overview of CURE flow
Improved CURE
Data
Draw Random Sample
Partition Sample
Partially cluster partition
Elimination outliers
Cluster partial clusters
Label data in disk
Sample result with different
parameter
Improved CURE
Different shrinking factor
Sample result with different
parameter
Improved CURE
Different number of representatives
c
Sample result with different
parameter
Improved CURE
Relation of execution time, different partition number
p
,
and different sample points
s
IV.
Summary
•
CURE can effectively detect proper shape of the cluster
with the help of scattered representative point and centroid
shrinking.
•
CURE can reduce computation time and memory loading
with random sampling and 2 pass clustering
•
CURE can effectively remove outlier.
•
The quality and effectiveness of CURE can be tuned be
varying different
s,p,c,
to adapt different input data set.
V.
References
i.
GRS97
Sudipto Guha, R. Rastogi, and K. Shim. CURE: A clustering algorithm for large
databases. Technical report, Bell Laboratories, Murray Hill, 1997.
ii.
ZRL96 Tian Zhang , Raghu Ramakrishnan , Miron Livny, BIRCH: an efficient data clustering
method for very large databases, ACM SIGMOD Record, v.25 n.2, p.103

114, June 1996
iii.
Sudipto Guha, Rajeev Rastogi, Kyuseok Shim: CURE: An Efficient Clustering Algorithm for
Large Databases, ACM SIGMOD, 1998.
Comments 0
Log in to post a comment