RAIN: Data Clustering using Randomized Interactions between Data Points

muttchessΤεχνίτη Νοημοσύνη και Ρομποτική

8 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

58 εμφανίσεις

RAIN:Data Clustering using Randomized
Interactions between Data Points
Jonatan Gomez Olfa Nasraoui Elizabeth Leon
Computer and Systems Eng.Computer Eng.& Dept.of Electrical &
School of Engineering Computer Science Department Computer Science Department
Universidad Nacional de Speed School of Engineering Speed School of Engineering
Colombia University of Louisville University of Louisville
jgomezpe@unal.edu.co olfa.nasraoui@louisville.edu eli.leon@louisville.edu
Louisville,KY 40292 Louisville,KY 40292
Abstract—This paper introduces a generalization of the Grav-
itational Clustering Algorithm proposed by Gomez et all in
[1].First,it is extended in such a way that not only the
Gravitational Law can be applied but any decreasing function
of the distance between points can be used.An estimate of the
maximum distance between closest points is calculated in order
to reduce the sensibility of the clustering process to the size
of the data set.Finally,an heuristic for setting the interaction
strength (gravitational constant) is introduced in order to reduce
the number of parameters of the algorithm.Experiments with
well known synthetic data sets are performed in order to show
the applicability of the proposed approach.
I.INTRODUCTION
Clustering is an unsupervised learning technique that takes
unlabeled data points (data records) and classifies them into
different groups or clusters.This is done in such a way that
points assigned to the same cluster have high similarity,while
the similarity between points assigned to different clusters is
low [2].Although different clustering techniques have been
developed and applied with relative success [2],[3],[4],[5],
the data clustering problem remains a challenging task.In
particular,two issues have concentrated a huge amount of
research effort in trying to deal with them:robustness to noise
and automatic determination of number of clusters.
Robust Clustering
Clustering techniques such as k-means [3] and fuzzy k-
means [4] relay on the assumption that a data set is free
of noise and follows a certain distribution [3],[4],[5].In
fact,if noise is introduced,several techniques based on the
least square estimate are spoiled [5].Several approaches have
tried to tackle this problem;some of them based on robust
statistics [6],[5],and others by modifying some elements of
well-known clustering algorithms to make them more robust
to noise [7],[8],[9].
Number of Clusters
Determining the number of clusters in a data set is a very
hard task since they can vary in shape,size and density.In
order to tackle this issue,many clustering techniques require
the number of clusters in advance [3],[4].Some techniques
try to determine the number of cluster based on estimations of
the data density and considering regions of high data density
as candidate clusters [10],[11],[12].
Other group of clustering techniques try to determine the
number of clusters by moving data points toward the clusters
center by using kernel density functions or by considering each
data point as a particle in a universe expose to a gravitational
field [13],[14] and then defining a cluster with the set
of data points that converge to the same region.The main
disadvantage of these techniques is that their time complexity
is at least quadratic (O(n
2
)) respectful the size of the data set.
Moreover,Kundu shows that the gravitational clustering has
cubic time complexity (O(n
3
)) in [14].
In [1] we developed a new gravitational clustering algorithm
that reduces the time complexity of the original (less than
quadratic) and is able to deal with noise.Instead of considering
all data points to move a data point,other data point is
randomly selected.Then,both data points are moved according
to an oversimplification of the Universal Gravitational Lawand
Second Newton’s Motion Law.Points that are close enough
are merged into virtual clusters.Finally,the big crunch effect
(one single big cluster at the end) is eliminated by introducing
a cooling mechanismsimilar to the one in simulated annealing.
Our experiments shown that the proposed clustering algorithm
works well in a variety of data sets (synthetic and real).
However,the performance reached by the new gravitational
clustering can be highly affected by the selection of the four
parameters required by it.
These paper introduces a generalization of our gravitational
clustering algorithm,by allowing different moving functions
and setting automatically the initial gravitational constant.
These paper is divided in 5 sections.Section 2 gives an
overview of the gravitational clustering algorithm that we
proposed in [1];Section 3 introduces the new approach called
RAIN (Clustering based on RAndomized Interactions of data
points);Section 4 presents the experiments performed and the
analysis of the results;Section 5 draws some conclusions and
future work.
II.RANDOMIZED GRAVITATIONAL CLUSTERING (RGC)
In [1],we developed a robust clustering technique based on
the gravitational law and Newton’s second motion law.In this
way,for an n-dimensional data set with N data points,each
data point is considered as an object in the n-dimensional
space with mass equal to 1.Each point in the data set is
moved according to a simplified version of the Gravitational
Law using the Second Newton’s Motion Law.The basic ideas
behind applying the gravitational law are:
1) A data point in some cluster exerts a higher gravitational
force on a data point in the same cluster than on a
data point that is not in the cluster.Then,points in a
cluster move in the direction of the center of the cluster.
In this way,the proposed technique will determine
automatically the clusters in the data set.
2) If some point is a noise point,i.e.,does not belong to
any cluster,then the gravitational force exerted on it
from other points is so small that the point is almost
immobile.Therefore,noise points will not be assigned
to any cluster.
In order to reduce the amount of memory and a time expended
in moving a data point according to the gravitational field
generated by another point (y),we use the following simplified
equation:
x(t +1) = x(t) +
!
d
G



!
d



3
(1)
where,
!
d =
!
y 
!
x,and the gravitational constant G.
We considered the velocity at any time,v(t),as the zero
vector and 4(t) = 1.Since the distance between points is
reduced each iteration,all the points will be moved to a single
position after a huge (possibly infinite) number of iterations,
(big crunch).Then,the gravitational clustering algorithm will
define a single cluster.
In order to eliminate this limit effect,the gravitational
constant G is reduced each iteration in a constant proportion
(the decay term:4(G)).Algorithm 1 shows the randomized
gravitational clustering algorithm.
Function MOVE (line 6),moves both points x
j
and x
k
using
equation 1 taking into consideration that both points cannot
move further than half of the distance between them.In each
iteration,RGC creates a set of clusters by using an optimal
disjoint set union-find structure
1
and the distance between
objects (after moving data points according to the gravitational
force).When two points are merged,both of them are kept
in the system while the associated set structure is modified.
In order to determine the new position of each data point,
the proposed algorithm only selects another data point in a
random way and move both of them according to equation 1
(MOVE function).RGC returns the sets stored in the disjoint
set union-find structure.
Because RGC assigns every point in the data set (noisy
or normal) to one cluster,it is necessary to extract the valid
1
A disjoint set Union-Find structure is a structure that supports the
following three operators [15]:
 MAKESET(x):Create a new set containing the single element x
 UNION(x;y):Replace the two sets containing x and y by their union.
 FIND(x):Return the name of the set containing the element x.
In the optimal disjoint set union-find structure,each set is represented by a
tree where the root of the tree is the canonical element of the set,and each
child node has a pointer to the parent node (the root node points to itself)
[15].
clusters.We used an extra parameter () to determine the
minimum number of points (percentage of the training data
set) that a cluster should include in order to be considered
a valid cluster.In this way,we used an additional function
GETCLUSTERS that takes the disjoint sets generated by RGC
and returns the collection of clusters that have at least the
minimum number of points defined,see Algorithm 2.
III.RAIN:CLUSTERING BASED ON RANDOMIZED
INTERACTIONS OF DATA POINTS
RAIN extends the RGC algorithm in such a way that
different decreasing functions can be used instead of the one
based on the Gravitational Law.Basically,three elements
should be considered,reducing the effect of the data set in
the dynamics of the system,defining an interaction function
and setting automatically the initial interaction strength.
A.Maximum Distance between Closest Points
In order to reduce the sensitivity of RAIN to the size of
the data set,we calculate a rough estimate of the maximum
distance between closest points in the data set.This distance
will allow RAIN to have a reference value for merging and
moving data points.
Given a collection of N data points in the n-dimensional
[0;1] Euclidean space,the maximum distance between closest
points can be roughly approximate using equation 2.
b
d =
2 
p
n
p
3  N
1
n
(2)
The conjecture behind this equation is that in order to have a
maximum distance between closest points,such points should
be arranged in a grid defining isosceles triangles (pyramids).
The height of such triangle is
2
p
3
the side of the triangle and
the maximum number of points per dimension of such grid
should be bounded by N
1
n
.
p
n is a correction factor for data
sets where the number of points is considerably low compared
to the number of vertices on the hypercube [0;1]
n
,i.e.,(2
n
).
B.Moving Functions
Although motivated by the Gravitational Law and second
Newton’s motion law,RGC can be seen as an algorithm
that moves interacting data points according to a decreasing
function of the data points distance.In RAIN,the final position
of a data point x that is interacting with another data point y
is defined by equation 3.
x(t +1) = x(t) +G
!
d  f
0
@



!
d



b
d
1
A
(3)
where,
!
d =
!
y 
!
x,f is a decreasing function,
b
d is the
rough estimate of maximum distance between closest points
and G is the initial strength of the data points interaction.
Although many decreasing functions can be used,we only
consider f (x) =
1
x
3
and f (x) = e
x
2
.
Algorithm 1 Randomized Gravitational Clustering
RGC( x,G,4(G),M,")
1 for i=1 to N do
2 MAKE(i)
3 for i=1 to M do
4 for j=1 to N do
5 k = random point index such that k 6= j
6 MOVE( x
j
,x
k
) (see Eq (1))//Move both points
7 if dist( x
j
,x
k
) "then UNION( j,k)
8 G = (1-4(G))*G
9 for i=1 to N do
10 FIND(i)
11 return disjoint-sets
Algorithm 2 Cluster Extraction.
GETCLUSTERS( clusters,alpha,N )
1 newClusters =;
2 MIN
POINTS = N
3 for i=0 to number of clusters do
4 if size( cluster
i
) MIN
POINTS then
5 newClusters = newClusters [ f cluster
i
g
6 return newClusters
C.Setting the Initial Interaction Strength (G)
Since RAIN is creating the clusters while it is moving the
data points,it is possible to use the number of merged points
after some checking iterations for determining if RAIN is
using an appropriate value of G.A point is considered merged
if it has been assigned to a cluster of size greater than one.A
high number of merged points (more than a half) can indicate
that the initial strength is so high that it will define a single
big cluster.If the number of merged points is low (less than
a half) it can indicate that the initial strength is so low that
not cluster is going to be defined at all.Therefore,a value
close to the half of the number of data points will indicate an
appropriated value for the initial strength.Algorithm 3 shows
the process for determining the initial interaction strength.
D.RAIN Time Complexity
Although it looks like RAIN has linear time complexity,our
experiments indicates that the number of iterations required
is 10
p
N,where N is the number of data points.Also,our
experimental results indicate that for checking each candidate
initial strength,the number of iterations required is close to
p
N.Therefore,upon experimental evidence the time com-
plexity of RAIN is O(n
p
n).
IV.EXPERIMENTATION
In order to evaluate the performance of RAIN,experiments
were performed on four synthetic data sets:Three of them
are included in the CLUTO toolkit.Figure 1 shows the four
synthetic data sets.
1) Experimental Settings:RAIN was run with a distance
merging of"= 1e  4,decay term of 4(G) = 0:001,
and minimum cluster size of  = 50 points.We run RAIN
with both of the defined interaction (moving) functions.The
reported results are obtained after allowing RAIN to determine
the initial strength and 5
p
N number of iterations.
2) Analysis of Results:Figure 2 presents the extracted
clusters and configuration of data points after 100,300 and
500 iterations of a typical run of RAIN for the chameleon
10000-9 data set using e
x
2
as interaction function.
Notice that after 100 iterations (
p
10000),almost all the
points belonging to the natural clusters have been merged with
other point in the cluster while no many noisy points have been
merged,see figure 2.a.Moreover,after 500 iterations,noisy
points have been not practically moved from their original
positions,see Figure 2.f.Clearly,RAIN is able to deal with
noise in the data set.
As the number of iterations is increased,the size of the
conformed clusters increases too.After 300 iterations,RAIN
is able to detect all the natural clusters,but has some of them
divided in two or three clusters,see Figure 2.b.When the
maximum number of iterations is reached (5
p
10000),RAIN
has detected all the natural clusters.Similar behavior was
observed in several different trials.
Figure 5 shows the results obtained by RAIN in the other
three data sets.
In general,RAIN using e
x
2
as interaction function was
able to find the natural clusters in the data set.Sometimes
RAIN merges two clusters and sometimes it generates two
clusters instead of one.Only for the Chameleon 8000-8 data
set,one cluster was missed.It is possible that being such
cluster of low density RAIN is not able to conform it due
to the random interaction process.
Algorithm 3 Initial Strength Estimation.
GETINITIALSTRENGTH( x,4(G),M,")
1 G = 1
2 RGC(x,G;4(G),M,")//test the given strength
3 K =number of merged points
4 while
n
2
K >
p
N do
5 G = 2  G
6 RGC(x,G;4(G),M,")//test the given strength
7 K =number of merged points
8 a =
G
2
9 b = G
10 G =
a+b
2
11 RGC(x,G;4(G),M,")//test the given strength
12 K =number of merged points
13 while


n
2
K


>
p
N do
14 if
n
2
> K then b = G else a = G
15 G =
a+b
2
16 return G
Fig.1.Synthetic Data Sets.(a) Chameleon with 10000 points and 9 natural clusters,(b) Chameleon with 8000 points and 6 natural clusters,(c) Chameleon
with 8000 points and 8 natural clusters,(d) Rain with 12000 points and 7 natural clusters.
(a)
(b)
(c)
(d)
Fig.2.RAIN evolution for Chameleon data set 10000-9 using e
x
2
(best seen in color).(a) Clusters with size 3 or higher after 100 iterations,(b) Clusters
with size 100 or higher after 300 iterations,(c) Clusters with size 100 or higher after 500 iterations,(d) Data point configuration after 100 iterations,(e) Data
point configuration after 300 iterations,(f) Data point configuration after 500 iterations.
(a)
(b)
(c)
Fig.3.RAIN clustering for synthetic data sets using e
x
2
(best seen in color).(a) Chameleon 8000-6,(b) Chameleon 8000-8,(c) Rain 12000-7.
Figure 4 presents the extracted clusters and configuration
of data points after 100,300 and 500 iterations of a typical
run of RAIN for the chamaleon 10000-9 data set using
1
x
3
as
interaction function.
As expected,this configuration of RAIN has similar behav-
ior to the one previously discussed.However,some differences
can be noticed in the dynamics of the system.When RAIN is
using e
x
2
,it creates some holes in the clusters,see Figure
2.f,but when using
1
x
3
the clusters are well defined after 500
iterations,see figure 4.f.Clearly,RAIN has detected all the
natural clusters but has divided two of them in halves.Similar
behavior was observed in several different trials.
Figure 5 shows the results obtained by RAIN in the other
three data sets.
In general,when RAIN is using
1
x
3
,it merges some
close natural clusters.It is possible that this function requires
some additional considerations for setting the initial interaction
strength.However,the performance is not bad as no more than
three clusters are merged together.
V.CONCLUSIONS AND FUTURE WORK
We have developed an heuristic mechanism for determining
the initial value for the interaction strength parameter.It has
been shown that the heuristic works well in a variety of data
sets.Also,the gravitational clustering proposed by Gomez
et all in [1] was extended by allowing different interaction
functions between points.Our results shown that RAIN works
well in a variety of data sets.
Our future research work will concentrate in carrying on a
more detailed analysis of the heuristic mechanism for setting
the initial interaction strength.Also,we will try to develop
a mechanism for determining the minimum cluster size in
order to eliminate this parameter.Finally,additional interaction
functions should be analyzed in order to determine which
function is more suitable for data clustering using RAIN.
REFERENCES
[1] J.Gomez,D.Dasgupta,and O.Nasraoui,“A new gravitational clustering
algorithm,” in Proceedings of the Third SIAM International Conference
on Data Mining 2003,2003.
[2] J.Han and M.Kamber,Data Mining:Concepts and Techniques.Morgan
Kaufmann,2000.
[3] J.MacQueen,“Some methods for classification and analysis of mul-
tivariate observations,” in Fifth Berkeley Symposium on Mathematics,
Statistics,and Probabilities,1967.
[4] J.C.Bezdek,Pattern recognition with fuzzy objective function algo-
rithms.Plenun Press,1981.
[5] P.J.Rousseeuw and A.M.Leroy,Robust regression and outlier
detection.John Wiley & Sons,1987.
[6] F.R.Hampel,E.M.Ronchetti,P.J.Rousseeuw,and W.A.Stahel,
Robust statistics:the approach based on influence functions.John Wiley
& Sons,1986.
[7] R.Krishnapuram and J.M.Keller,“A possibilistic approach to cluster-
ing,” IEEE Transactions on Fuzzy Systems,no.1(2),pp.98–110,1993.
[8] G.Beni and X.Liu,“A least biased fuzzy clustering method,” IEEE
Transactions on Pattern Analysis and Machine Intelligence,no.16(9),
pp.954–960,1994.
[9] H.Frigui and R.Krishnapuram,“A robust clustering algorithm based on
the m-estimator,” Neural,Parallel and Scientific Compuatations,1995.
[10] J.M.Jolion,P.Meer,and S.Bataouche,“Robust clustering with
applications in computer vision,” IEEE Transactions on Pattern Analysis
and Machine Intelligence,no.13(8),pp.791–802,1991.
[11] H.Frigui and R.Krishnapuram,“Clustering by competitive agglomera-
tion,” Pattern Recognition,no.30(7),pp.1109–1119,1997.
[12] O.Nasraoui and R.Krishnapuram,“A novel approach to unsupervised
robust clustering using genetic niching,” in In Proceedings of the Ninth
IEEE International Conference on Fuzzy Systems,pp.170–175,2000.
[13] W.E.Wright,“Gravitational clustering,” Pattern Recognition,no.9,
pp.151–166,1977.
[14] S.Kundu,“Gravitational clustering:a new approach based on the spatial
distri-bution of the points,” Pattern Recognition,no.32,pp.1149–1160,
1999.
[15] T.Cormer,C.Leiserson,and R.Rivest,Introduction to Algorithms.
McGraw Hill,1990.
(a)
(b)
(c)
(d)
Fig.4.RAIN evolution for Chameleon data set 10000-9 using
1
x
3
(best seen in color).(a) Clusters with size 3 or higher after 100 iterations,(b) Clusters
with size 100 or higher after 300 iterations,(c) Clusters with size 100 or higher after 500 iterations,(d) Data point configuration after 100 iterations,(e) Data
point configuration after 300 iterations,(f) Data point configuration after 500 iterations.
(a)
(b)
(c)
Fig.5.RAIN clustering for synthetic data sets using
1
x
3
(best seen in color).(a) Chameleon 8000-6,(b) Chameleon 8000-8,(c) Rain 12000-7.