RAIN:Data Clustering using Randomized

Interactions between Data Points

Jonatan Gomez Olfa Nasraoui Elizabeth Leon

Computer and Systems Eng.Computer Eng.& Dept.of Electrical &

School of Engineering Computer Science Department Computer Science Department

Universidad Nacional de Speed School of Engineering Speed School of Engineering

Colombia University of Louisville University of Louisville

jgomezpe@unal.edu.co olfa.nasraoui@louisville.edu eli.leon@louisville.edu

Louisville,KY 40292 Louisville,KY 40292

Abstract—This paper introduces a generalization of the Grav-

itational Clustering Algorithm proposed by Gomez et all in

[1].First,it is extended in such a way that not only the

Gravitational Law can be applied but any decreasing function

of the distance between points can be used.An estimate of the

maximum distance between closest points is calculated in order

to reduce the sensibility of the clustering process to the size

of the data set.Finally,an heuristic for setting the interaction

strength (gravitational constant) is introduced in order to reduce

the number of parameters of the algorithm.Experiments with

well known synthetic data sets are performed in order to show

the applicability of the proposed approach.

I.INTRODUCTION

Clustering is an unsupervised learning technique that takes

unlabeled data points (data records) and classiﬁes them into

different groups or clusters.This is done in such a way that

points assigned to the same cluster have high similarity,while

the similarity between points assigned to different clusters is

low [2].Although different clustering techniques have been

developed and applied with relative success [2],[3],[4],[5],

the data clustering problem remains a challenging task.In

particular,two issues have concentrated a huge amount of

research effort in trying to deal with them:robustness to noise

and automatic determination of number of clusters.

Robust Clustering

Clustering techniques such as k-means [3] and fuzzy k-

means [4] relay on the assumption that a data set is free

of noise and follows a certain distribution [3],[4],[5].In

fact,if noise is introduced,several techniques based on the

least square estimate are spoiled [5].Several approaches have

tried to tackle this problem;some of them based on robust

statistics [6],[5],and others by modifying some elements of

well-known clustering algorithms to make them more robust

to noise [7],[8],[9].

Number of Clusters

Determining the number of clusters in a data set is a very

hard task since they can vary in shape,size and density.In

order to tackle this issue,many clustering techniques require

the number of clusters in advance [3],[4].Some techniques

try to determine the number of cluster based on estimations of

the data density and considering regions of high data density

as candidate clusters [10],[11],[12].

Other group of clustering techniques try to determine the

number of clusters by moving data points toward the clusters

center by using kernel density functions or by considering each

data point as a particle in a universe expose to a gravitational

ﬁeld [13],[14] and then deﬁning a cluster with the set

of data points that converge to the same region.The main

disadvantage of these techniques is that their time complexity

is at least quadratic (O(n

2

)) respectful the size of the data set.

Moreover,Kundu shows that the gravitational clustering has

cubic time complexity (O(n

3

)) in [14].

In [1] we developed a new gravitational clustering algorithm

that reduces the time complexity of the original (less than

quadratic) and is able to deal with noise.Instead of considering

all data points to move a data point,other data point is

randomly selected.Then,both data points are moved according

to an oversimpliﬁcation of the Universal Gravitational Lawand

Second Newton’s Motion Law.Points that are close enough

are merged into virtual clusters.Finally,the big crunch effect

(one single big cluster at the end) is eliminated by introducing

a cooling mechanismsimilar to the one in simulated annealing.

Our experiments shown that the proposed clustering algorithm

works well in a variety of data sets (synthetic and real).

However,the performance reached by the new gravitational

clustering can be highly affected by the selection of the four

parameters required by it.

These paper introduces a generalization of our gravitational

clustering algorithm,by allowing different moving functions

and setting automatically the initial gravitational constant.

These paper is divided in 5 sections.Section 2 gives an

overview of the gravitational clustering algorithm that we

proposed in [1];Section 3 introduces the new approach called

RAIN (Clustering based on RAndomized Interactions of data

points);Section 4 presents the experiments performed and the

analysis of the results;Section 5 draws some conclusions and

future work.

II.RANDOMIZED GRAVITATIONAL CLUSTERING (RGC)

In [1],we developed a robust clustering technique based on

the gravitational law and Newton’s second motion law.In this

way,for an n-dimensional data set with N data points,each

data point is considered as an object in the n-dimensional

space with mass equal to 1.Each point in the data set is

moved according to a simpliﬁed version of the Gravitational

Law using the Second Newton’s Motion Law.The basic ideas

behind applying the gravitational law are:

1) A data point in some cluster exerts a higher gravitational

force on a data point in the same cluster than on a

data point that is not in the cluster.Then,points in a

cluster move in the direction of the center of the cluster.

In this way,the proposed technique will determine

automatically the clusters in the data set.

2) If some point is a noise point,i.e.,does not belong to

any cluster,then the gravitational force exerted on it

from other points is so small that the point is almost

immobile.Therefore,noise points will not be assigned

to any cluster.

In order to reduce the amount of memory and a time expended

in moving a data point according to the gravitational ﬁeld

generated by another point (y),we use the following simpliﬁed

equation:

x(t +1) = x(t) +

!

d

G

!

d

3

(1)

where,

!

d =

!

y

!

x,and the gravitational constant G.

We considered the velocity at any time,v(t),as the zero

vector and 4(t) = 1.Since the distance between points is

reduced each iteration,all the points will be moved to a single

position after a huge (possibly inﬁnite) number of iterations,

(big crunch).Then,the gravitational clustering algorithm will

deﬁne a single cluster.

In order to eliminate this limit effect,the gravitational

constant G is reduced each iteration in a constant proportion

(the decay term:4(G)).Algorithm 1 shows the randomized

gravitational clustering algorithm.

Function MOVE (line 6),moves both points x

j

and x

k

using

equation 1 taking into consideration that both points cannot

move further than half of the distance between them.In each

iteration,RGC creates a set of clusters by using an optimal

disjoint set union-ﬁnd structure

1

and the distance between

objects (after moving data points according to the gravitational

force).When two points are merged,both of them are kept

in the system while the associated set structure is modiﬁed.

In order to determine the new position of each data point,

the proposed algorithm only selects another data point in a

random way and move both of them according to equation 1

(MOVE function).RGC returns the sets stored in the disjoint

set union-ﬁnd structure.

Because RGC assigns every point in the data set (noisy

or normal) to one cluster,it is necessary to extract the valid

1

A disjoint set Union-Find structure is a structure that supports the

following three operators [15]:

MAKESET(x):Create a new set containing the single element x

UNION(x;y):Replace the two sets containing x and y by their union.

FIND(x):Return the name of the set containing the element x.

In the optimal disjoint set union-ﬁnd structure,each set is represented by a

tree where the root of the tree is the canonical element of the set,and each

child node has a pointer to the parent node (the root node points to itself)

[15].

clusters.We used an extra parameter () to determine the

minimum number of points (percentage of the training data

set) that a cluster should include in order to be considered

a valid cluster.In this way,we used an additional function

GETCLUSTERS that takes the disjoint sets generated by RGC

and returns the collection of clusters that have at least the

minimum number of points deﬁned,see Algorithm 2.

III.RAIN:CLUSTERING BASED ON RANDOMIZED

INTERACTIONS OF DATA POINTS

RAIN extends the RGC algorithm in such a way that

different decreasing functions can be used instead of the one

based on the Gravitational Law.Basically,three elements

should be considered,reducing the effect of the data set in

the dynamics of the system,deﬁning an interaction function

and setting automatically the initial interaction strength.

A.Maximum Distance between Closest Points

In order to reduce the sensitivity of RAIN to the size of

the data set,we calculate a rough estimate of the maximum

distance between closest points in the data set.This distance

will allow RAIN to have a reference value for merging and

moving data points.

Given a collection of N data points in the n-dimensional

[0;1] Euclidean space,the maximum distance between closest

points can be roughly approximate using equation 2.

b

d =

2

p

n

p

3 N

1

n

(2)

The conjecture behind this equation is that in order to have a

maximum distance between closest points,such points should

be arranged in a grid deﬁning isosceles triangles (pyramids).

The height of such triangle is

2

p

3

the side of the triangle and

the maximum number of points per dimension of such grid

should be bounded by N

1

n

.

p

n is a correction factor for data

sets where the number of points is considerably low compared

to the number of vertices on the hypercube [0;1]

n

,i.e.,(2

n

).

B.Moving Functions

Although motivated by the Gravitational Law and second

Newton’s motion law,RGC can be seen as an algorithm

that moves interacting data points according to a decreasing

function of the data points distance.In RAIN,the ﬁnal position

of a data point x that is interacting with another data point y

is deﬁned by equation 3.

x(t +1) = x(t) +G

!

d f

0

@

!

d

b

d

1

A

(3)

where,

!

d =

!

y

!

x,f is a decreasing function,

b

d is the

rough estimate of maximum distance between closest points

and G is the initial strength of the data points interaction.

Although many decreasing functions can be used,we only

consider f (x) =

1

x

3

and f (x) = e

x

2

.

Algorithm 1 Randomized Gravitational Clustering

RGC( x,G,4(G),M,")

1 for i=1 to N do

2 MAKE(i)

3 for i=1 to M do

4 for j=1 to N do

5 k = random point index such that k 6= j

6 MOVE( x

j

,x

k

) (see Eq (1))//Move both points

7 if dist( x

j

,x

k

) "then UNION( j,k)

8 G = (1-4(G))*G

9 for i=1 to N do

10 FIND(i)

11 return disjoint-sets

Algorithm 2 Cluster Extraction.

GETCLUSTERS( clusters,alpha,N )

1 newClusters =;

2 MIN

POINTS = N

3 for i=0 to number of clusters do

4 if size( cluster

i

) MIN

POINTS then

5 newClusters = newClusters [ f cluster

i

g

6 return newClusters

C.Setting the Initial Interaction Strength (G)

Since RAIN is creating the clusters while it is moving the

data points,it is possible to use the number of merged points

after some checking iterations for determining if RAIN is

using an appropriate value of G.A point is considered merged

if it has been assigned to a cluster of size greater than one.A

high number of merged points (more than a half) can indicate

that the initial strength is so high that it will deﬁne a single

big cluster.If the number of merged points is low (less than

a half) it can indicate that the initial strength is so low that

not cluster is going to be deﬁned at all.Therefore,a value

close to the half of the number of data points will indicate an

appropriated value for the initial strength.Algorithm 3 shows

the process for determining the initial interaction strength.

D.RAIN Time Complexity

Although it looks like RAIN has linear time complexity,our

experiments indicates that the number of iterations required

is 10

p

N,where N is the number of data points.Also,our

experimental results indicate that for checking each candidate

initial strength,the number of iterations required is close to

p

N.Therefore,upon experimental evidence the time com-

plexity of RAIN is O(n

p

n).

IV.EXPERIMENTATION

In order to evaluate the performance of RAIN,experiments

were performed on four synthetic data sets:Three of them

are included in the CLUTO toolkit.Figure 1 shows the four

synthetic data sets.

1) Experimental Settings:RAIN was run with a distance

merging of"= 1e 4,decay term of 4(G) = 0:001,

and minimum cluster size of = 50 points.We run RAIN

with both of the deﬁned interaction (moving) functions.The

reported results are obtained after allowing RAIN to determine

the initial strength and 5

p

N number of iterations.

2) Analysis of Results:Figure 2 presents the extracted

clusters and conﬁguration of data points after 100,300 and

500 iterations of a typical run of RAIN for the chameleon

10000-9 data set using e

x

2

as interaction function.

Notice that after 100 iterations (

p

10000),almost all the

points belonging to the natural clusters have been merged with

other point in the cluster while no many noisy points have been

merged,see ﬁgure 2.a.Moreover,after 500 iterations,noisy

points have been not practically moved from their original

positions,see Figure 2.f.Clearly,RAIN is able to deal with

noise in the data set.

As the number of iterations is increased,the size of the

conformed clusters increases too.After 300 iterations,RAIN

is able to detect all the natural clusters,but has some of them

divided in two or three clusters,see Figure 2.b.When the

maximum number of iterations is reached (5

p

10000),RAIN

has detected all the natural clusters.Similar behavior was

observed in several different trials.

Figure 5 shows the results obtained by RAIN in the other

three data sets.

In general,RAIN using e

x

2

as interaction function was

able to ﬁnd the natural clusters in the data set.Sometimes

RAIN merges two clusters and sometimes it generates two

clusters instead of one.Only for the Chameleon 8000-8 data

set,one cluster was missed.It is possible that being such

cluster of low density RAIN is not able to conform it due

to the random interaction process.

Algorithm 3 Initial Strength Estimation.

GETINITIALSTRENGTH( x,4(G),M,")

1 G = 1

2 RGC(x,G;4(G),M,")//test the given strength

3 K =number of merged points

4 while

n

2

K >

p

N do

5 G = 2 G

6 RGC(x,G;4(G),M,")//test the given strength

7 K =number of merged points

8 a =

G

2

9 b = G

10 G =

a+b

2

11 RGC(x,G;4(G),M,")//test the given strength

12 K =number of merged points

13 while

n

2

K

>

p

N do

14 if

n

2

> K then b = G else a = G

15 G =

a+b

2

16 return G

Fig.1.Synthetic Data Sets.(a) Chameleon with 10000 points and 9 natural clusters,(b) Chameleon with 8000 points and 6 natural clusters,(c) Chameleon

with 8000 points and 8 natural clusters,(d) Rain with 12000 points and 7 natural clusters.

(a)

(b)

(c)

(d)

Fig.2.RAIN evolution for Chameleon data set 10000-9 using e

x

2

(best seen in color).(a) Clusters with size 3 or higher after 100 iterations,(b) Clusters

with size 100 or higher after 300 iterations,(c) Clusters with size 100 or higher after 500 iterations,(d) Data point conﬁguration after 100 iterations,(e) Data

point conﬁguration after 300 iterations,(f) Data point conﬁguration after 500 iterations.

(a)

(b)

(c)

Fig.3.RAIN clustering for synthetic data sets using e

x

2

(best seen in color).(a) Chameleon 8000-6,(b) Chameleon 8000-8,(c) Rain 12000-7.

Figure 4 presents the extracted clusters and conﬁguration

of data points after 100,300 and 500 iterations of a typical

run of RAIN for the chamaleon 10000-9 data set using

1

x

3

as

interaction function.

As expected,this conﬁguration of RAIN has similar behav-

ior to the one previously discussed.However,some differences

can be noticed in the dynamics of the system.When RAIN is

using e

x

2

,it creates some holes in the clusters,see Figure

2.f,but when using

1

x

3

the clusters are well deﬁned after 500

iterations,see ﬁgure 4.f.Clearly,RAIN has detected all the

natural clusters but has divided two of them in halves.Similar

behavior was observed in several different trials.

Figure 5 shows the results obtained by RAIN in the other

three data sets.

In general,when RAIN is using

1

x

3

,it merges some

close natural clusters.It is possible that this function requires

some additional considerations for setting the initial interaction

strength.However,the performance is not bad as no more than

three clusters are merged together.

V.CONCLUSIONS AND FUTURE WORK

We have developed an heuristic mechanism for determining

the initial value for the interaction strength parameter.It has

been shown that the heuristic works well in a variety of data

sets.Also,the gravitational clustering proposed by Gomez

et all in [1] was extended by allowing different interaction

functions between points.Our results shown that RAIN works

well in a variety of data sets.

Our future research work will concentrate in carrying on a

more detailed analysis of the heuristic mechanism for setting

the initial interaction strength.Also,we will try to develop

a mechanism for determining the minimum cluster size in

order to eliminate this parameter.Finally,additional interaction

functions should be analyzed in order to determine which

function is more suitable for data clustering using RAIN.

REFERENCES

[1] J.Gomez,D.Dasgupta,and O.Nasraoui,“A new gravitational clustering

algorithm,” in Proceedings of the Third SIAM International Conference

on Data Mining 2003,2003.

[2] J.Han and M.Kamber,Data Mining:Concepts and Techniques.Morgan

Kaufmann,2000.

[3] J.MacQueen,“Some methods for classiﬁcation and analysis of mul-

tivariate observations,” in Fifth Berkeley Symposium on Mathematics,

Statistics,and Probabilities,1967.

[4] J.C.Bezdek,Pattern recognition with fuzzy objective function algo-

rithms.Plenun Press,1981.

[5] P.J.Rousseeuw and A.M.Leroy,Robust regression and outlier

detection.John Wiley & Sons,1987.

[6] F.R.Hampel,E.M.Ronchetti,P.J.Rousseeuw,and W.A.Stahel,

Robust statistics:the approach based on inﬂuence functions.John Wiley

& Sons,1986.

[7] R.Krishnapuram and J.M.Keller,“A possibilistic approach to cluster-

ing,” IEEE Transactions on Fuzzy Systems,no.1(2),pp.98–110,1993.

[8] G.Beni and X.Liu,“A least biased fuzzy clustering method,” IEEE

Transactions on Pattern Analysis and Machine Intelligence,no.16(9),

pp.954–960,1994.

[9] H.Frigui and R.Krishnapuram,“A robust clustering algorithm based on

the m-estimator,” Neural,Parallel and Scientiﬁc Compuatations,1995.

[10] J.M.Jolion,P.Meer,and S.Bataouche,“Robust clustering with

applications in computer vision,” IEEE Transactions on Pattern Analysis

and Machine Intelligence,no.13(8),pp.791–802,1991.

[11] H.Frigui and R.Krishnapuram,“Clustering by competitive agglomera-

tion,” Pattern Recognition,no.30(7),pp.1109–1119,1997.

[12] O.Nasraoui and R.Krishnapuram,“A novel approach to unsupervised

robust clustering using genetic niching,” in In Proceedings of the Ninth

IEEE International Conference on Fuzzy Systems,pp.170–175,2000.

[13] W.E.Wright,“Gravitational clustering,” Pattern Recognition,no.9,

pp.151–166,1977.

[14] S.Kundu,“Gravitational clustering:a new approach based on the spatial

distri-bution of the points,” Pattern Recognition,no.32,pp.1149–1160,

1999.

[15] T.Cormer,C.Leiserson,and R.Rivest,Introduction to Algorithms.

McGraw Hill,1990.

(a)

(b)

(c)

(d)

Fig.4.RAIN evolution for Chameleon data set 10000-9 using

1

x

3

(best seen in color).(a) Clusters with size 3 or higher after 100 iterations,(b) Clusters

with size 100 or higher after 300 iterations,(c) Clusters with size 100 or higher after 500 iterations,(d) Data point conﬁguration after 100 iterations,(e) Data

point conﬁguration after 300 iterations,(f) Data point conﬁguration after 500 iterations.

(a)

(b)

(c)

Fig.5.RAIN clustering for synthetic data sets using

1

x

3

(best seen in color).(a) Chameleon 8000-6,(b) Chameleon 8000-8,(c) Rain 12000-7.

## Σχόλια 0

Συνδεθείτε για να κοινοποιήσετε σχόλιο