Clustering Algorithms Implementation on ATLaS

overratedbeltAI and Robotics

Nov 25, 2013 (3 years and 10 months ago)

93 views

Clustering Algorithms Implementation
on ATLaS

--
CS240B Project Report









Richard Luo

Prof. Carlo Zaniolo

2002/6
Abstract


In this project, we will discus clustering algorithms in spatial data mining, such as
partitioning algorithm PAM and density
-
based algorithm DBSCAN. Some of their
implementations on User
-
Defined Aggregate (UDA) database system ATLaS are illustrated.
With

UDA, it's convenient to implement such clustering algorithms. A spatial index structure
called R
-
tree will significantly improve the performance. Experiments with real data of
SEQUOIA 2000 show that such algorithms implementation on ATLaS is satisfactor
y even with
absence of R
-
tree index. However, some improvement on ATLaS will benefit the development
of these clustering algorithms as well as other general data mining algorithms. An ATLaS
system improvement proposal is addressed in the end.

Introducti
on

Knowledge discovery becomes more and more important in spatial databases since increasingly
large amounts of data obtained from satellite images, X
-
ray crystallography or other automatic
equipment are stored in spatial databases. Several types of clust
ering algorithms are addressed
in the last few years, such as:

1)

Partitioning Algorithm: Construct various partitions then evaluate them by some criterion

2)

Hierarchy Algorithm: Create a hierarchical decomposition of the set of data (or objects)
using so
me criterion

3)

Density
-
based Algorithm: based on local connectivity and density functions


In this report, we will discus partitioning algorithm PAM and based
-
based algorithm DBSCAN.
Their implementation on User
-
Defined Aggregate (UDA) database system ATL
aS will be
illustrated. With UDA, it's convenient to implement such clustering algorithms. A spatial
index structure called R
-
tree will significantly improve the performance. Experiments with real
data of SEQUOIA 2000 show that such algorithms implement
ation on ATLaS is satisfactory
even with absence of R
-
tree index.


Finally, we will talk about some improvement on ATLaS which may benefit the development of
these clustering algorithms as well as other general data mining algorithms. An ATLaS system
im
provement proposal is addressed in the end.

Clustering Algorithms

DBSCAN

The key idea of a density
-
based cluster is that for each point of a cluster its
Ep
s
-
neighborhood
for some given
Eps
> 0 has to contain at least a minimum number of points, i.e. the “
density” in
the
Ep
s
-
neighborhood of points has to exceed some threshold. Furthermore, the density within
the areas of noise is lower than the density in any of the clusters.


This idea of “density
-
based clusters” can be generalized in two important ways. F
irst, we can use
any notion of a neighborhood instead of an
Ep
s
-
neighborhood if the definition of the
neigh
-
borhood is based on a binary predicate which is symmetric and reflexive. Second, instead
of sim
-
ply counting the objects in a neighborhood of an obj
ect we can as well use other measures
to de
-
fine the “cardinality” of that neighborhood.


A naive approach could require for each object in a density
-
connected set that the weighted
cardinality of the
NPre
d
-
neighborhood of that object has at least a value
MinCar
d. However, this
approach fails because there may be two kinds of objects in a density
-
connected set, objects
in
-
side
(core objec
t) and objects “on the border” of the density
-
connected set
(border object
s). In
general, an
NPre
d
-
neighborhood of a bord
er object has a significantly lower
wCard
than an
NPre
d
-
neighborhood of a core object. Therefore, we would have to set the value
MinCard
to a
relatively low value in order to include all objects belonging to the same density
-
connected set.
This value, howe
ver, will not be characteristic for the respective density
-
connected set
-

particularly in the presence of noise objects. Therefore, for every object
p
in a density
-
connected
set
C
there must be an object
q
in
C
so that
p
is inside of the
NPre
d
-
neighborhoo
d of
q
and the
weight
-
ed cardinality
wCard
of
NPred(q)
is at least
MinCar
d. We also require the objects of the
set
C
to be somehow “connected” to each other.

PAM

PAM (Partitioning Around Medoids) was developed by Kaufman and Rousseeuw. To find k
clusters,

PAM's approach is to determine a representative object for each cluster. This
representative object, called a medoid, is meant to be the most centrally located object within the
cluster. Once the medoids have been selected, each non-selected object is gro
uped with the
medoid to which it is the most similar. More precisely, if Oj is a non-selected object, and Oi is a
(selected) medoid, we say that Oj belongs to the cluster represented by

Oi, if d(Oj ; Oi ) = minOe d(Oj ; Oe), where the notation minOe denot
es the minimum over all

medoids Oe , and the notation d(Oa ; Ob ) denotes the dissimilarity or distance between objects

Oa and Ob . All the dissimilarity values are given as inputs to PAM. Finally, the quality of

a clustering (i.e. the combined quality
of the chosen medoids) is measured by the average

dissimilarity between an object and the medoid of its cluster.


To find the k medoids, PAM begins with an arbitrary selection of k objects. Then in each step, a
swap between a selected object Oi and a non-
selected object Oh is made, as long as such a swap
would result in an improvement of the quality of the clustering. In particular, to calculate the
effect of such a swap between Oi and Oh , PAM computes costs Cjih for all non-selected objects
Oj . Dependin
g on which of the following cases Oj is in, Cjih is defined by one of the equations
below:


First Case: suppose Oj currently belongs to the cluster represented by Oi . Furthermore, let Oj be
more similar to Oj2 than Oh , i.e. d(Oj ; Oh ) >= d(Oj ; Oj2 ), w
here Oj2 is the second most
similar medoid to Oj . Thus, if Oi is replaced by Oh as a medoid, Oj would belong to the cluster
represented by Oj2 . Hence, the cost of the swap as far as Oj is concerned is:

C jih = d(Oj ; Oj2 )
-

d(Oj ; Oi )

This equation al
ways gives a non-negative Cjih , indicating that there is a non-negative cost
incurred in replacing Oi with Oh.


Second Case: Oj currently belongs to the cluster represented by Oi . But this time, Oj is less
similar to Oj2 than Oh , i.e. d(Oj ; Oh ) < d(Oj

; Oj2 ). Then, if Oi is replaced by Oh , Oj would
belong to the cluster represented by Oh . Thus, the cost for Oj is given by:

Cjih = d(Oj ; Oh )
-

d(Oj ; Oi );

Cjih here can be positive or negative, depending on whether Oj is more similar to Oi or to O
h .


Third Case: suppose that Oj currently belongs to a cluster other than the one represented by Oi .
Let Oj2 be the representative object of that cluster. Furthermore, let Oj be more similar to Oj2
than Oh . Then even if Oi is replaced by Oh , Oj would
stay in the cluster represented by Oj2 .
Thus, the cost is:

C jih = 0


Fourth Case: Oj currently belongs to the cluster represented by Oj2 . But Oj is less similar to Oj2
than Oh . Then replacing Oi with Oh would cause Oj tOjump to the cluster of Oh from
that of
Oj2 . Thus, the cost is:

C jih = d(Oj ; Oh )
-

d(Oj ; Oj2 );

and is always negative.


Combining the four cases above, the total cost of replacing Oi with Oh is given by:

TCih = sum of Cjih


We now present Algorithm PAM.

1. Select k representative objects arbitrarily.

2. Compute TCih for all pairs of objects Oi ; Oh where Oi is currently selected, and Oh is

not.

3. Select the pair Oi ; Oh which corresponds to minOi ;Oh TCih . If the minimum TCih is

negative, replace Oi

with Oh , and go back to Step (2).

4. Otherwise, for each non-selected object, find the most similar representative object.

Halt. 2


R*
-
tree spatial index

In the following, we will introduce a typical spatial index, the R*
-
tree. The
R*
-
tree

generalize
s
the 1
-
dimensional B
-
tree to
d
-
dimensional data spaces, specifically an R*
-
tree manages
k
-
dimensional hyperrectangles instead of 1
-
dimension
-
al keys. An R*
-
tree may organize
extended objects such as polygons using
minimum bounding rectangles (MB
R) as
appr
oximations as well as point objects as a special case of rectangles. The leaves store the
MBRs of the data objects and a pointer to the exact geometry of the polygons.

Internal nodes store a sequence of pairs consisting of a rectangle and a pointer to a ch
ild node.

These rectangles are the MBRs of all data or directory rectangles stored in the subtree having the

referenced child node as its root. To answer a region query, starting from the root, the set of

rectangles intersecting the query region is determi
ned and then their referenced child nodes are

searched until the data pages are reached.


Fig 1. R*
-
tree


The height of an R*
-
tree is O(log n) for a database of
n
objects in the worst case and a query

with a “small” query region has to traverse only a limited number of paths in the R*
-
tree.


Implementation


Although ATLaS is still on its alpha
-
stage and provides only basic functionalities, we still
find it convenie
nt to implement these clustering algorithms. User
-
defined Aggregate (UDA)
provides one
-
scan approach and flexible access to the database. In this section, we will
describe 2 clustering algorithms implementation on ATLaS
--

DBSCAN and PAM.

DBSCAN


















table setofpoints (x real, y real, ClId real);

















/* meaning of ClId:
-
1: unclassified, 0: noise, 1,2,3...: cluster*/

















table nextid(ClusterId real);

















table seeds (sx real, sy real);


insert into nextid
values (1);


load from dbscan.input into temp;


insert into setofpoints


select x, y,
-
1


from temp;


select ExpandCluster(x, y, ClusterId, 1000, 4)


















from setofpoints, nextid


















where ClId<=0;


The table setofp
oints stores the coordinates and cluster ids of all points read from the input file
dbscan.input. After initializing the cluster id to
-
1, it calls the major aggregate in this algorithm
--

ExpandCluster() to expand the cluster from any point (x,y). We u
se the global attribute
MinPoints of 4 and Eps of 1000.


The regionQuery() aggregate returns the Eps
-
neighborhood of point (qx,qy):


aggregate regionQuery(qx real, qy real, eps real):(r1 real,r2 real)

{



INITIALIZE: ITERATE:





{



INSERT INTO return se
lect x,y from setofpoints where (x
-
qx)*(x
-
qx) + (y
-

qy) * (y
-

qy)
<= eps * eps;


}

}



In the changeClId(), points which have been marked to be NOISE may be changed later, if they
are density
-
reachable from some other pint of the database. This happens
for border points of a
cluster. Those points are not added to the seeds because we already know that a point with ClId
of NOISE is not a core point. Adding those pints to seeds would only result in additional
region queries which would yield no new answe
rs.


aggregate changeClId (sx real, sy real, ClusterId real, Eps real, MinPts

real):real

{



table result (rx real, ry real);



table resultsize (size real);



initialize:



iterate:





{





insert into result select regionQuery(sx, sy, Eps);





insert into resultsize select count(rx) from result;





insert into seeds select rx, ry from result











where (select size from resultsize)>=MinPts











and (select ClId from setofpoints where x=result.rx and y=result.ry)=
-
1;





update setofpoints set ClId=ClusterId where SQLCODE=1











and exists (select rx,ry from result) and (ClId=
-
1 or ClId=0);





delete from seeds where seeds.sx=sx and seeds.sy=sy;


delete from resultsize where 1=1;





}

}



AGGREGATE ExpandClu
ster (ex real, ey real, ClusterId real, Eps real, MinPts

real):real

{

















table seedssize (size real);

















initialize:

















iterate:



















{



















insert into seeds select regionQuery (ex, ey, Eps
);



















insert into seedssize select count(sx) from seeds;

/*


insert into stdout select ex, ey, size from seedssize;*/



update setofpoints set ClId=0



where exists (select sx from seeds where sx=setofpoints.x and sy=setofpoints.y)




and

(select size from seedssize)<MinPts;



















update setofpoints set ClId=ClusterId



























where exists (select sx from seeds where sx=setofpoints.x and
sy=setofpoints.y)




and SQLCODE=0;



















update nextid
set ClusterId=ClusterId+1 where SQLCODE=1;



















delete from seeds where sx=ex and sy=ey and SQLCODE=1;



















select changeClId (sx, sy, ClusterId, Eps, MinPts) from seeds



where SQLCODE=1;



delete from seedssize where 1=1;



delete from seeds where 1=1;



















}

}



PAM









table setofpoints (id int, x real, y real);


table pointSize (psize int);


table temp (x real, y real, name char(30));


table temp1 (x real, y real);


table mediod(mx real, my real);


table i
(i int);


aggregate randSel(size int):int

{


table randNo(no real);


initialize:iterate:


{



insert into randNo values(rand()*size);



insert into mediod



select x, y



from setofpoints, randNo



where id
-
1 < no and no <= id;



delete from randNo where

1=1;



}

}


AGGREGATE addid(ax real, ay real) : int

{


TABLE tmp(i int);


INITIALIZE :


{


INSERT INTO tmp VALUES(1);


INSERT INTO setofpoints values(1, ax, ay);



}


ITERATE :


{


UPDATE tmp SET i=i+1;


INSERT INTO setofpoints


SELECT

i, ax, ay FROM tmp;


}

}


aggregate mymin(c real, mx real, my real, x real, y real):(r1 real,r2 real,r3 real,r4 real,r5 real)

{


table minCost(cc real, cmx real, cmy real, cx real, cy real);


initialize:


{



insert into minCost values(c, mx, my, x, y)
;


}


iterate:


{



update minCost



set cc=c, cmx=mx, cmy = my, cx = x, cy = y



where c<cc;


}


terminate:


{



insert into return



select cc, cmx, cmy, cx, cy



from minCost;


}


}


aggregate allCost(jx real, jy real, ix real, iy real, hx real ,hy real
):(r1 real,r2 real,r3 real,r4 real,r5
real)

{


table cost(cost real);


initialize:


{




}


iterate:


{



update cost



set cost = cost +
sqrt((jx
-
hx)*(jx
-
hx)+(jy
-
hy)*(jy
-
hy))
-
sqrt((jx
-
ix)*(jx
-
ix)+(jy
-
iy)*(jy
-
iy));


}


terminate:


{




insert into return



select cost, ix, iy, hx, hy from cost;



}


}


aggregate updMediod(ix real, iy real, hx real, hy real):int

{


table cost(cc real, cmx real, cmy real, cx real, cy real);


table minCost(cc real, cmx real, cmy real, cx real, cy real);

/*

(cmx, cmy) (ix, iy)

selected mediod
--
Oi in the paper


(cx,cy) (hx, hy) unselected object
--
Oh in the paper

*/



initialize:iterate:


{







insert into cost



select allCost(x,y, ix, iy, hx, hy)



from setofpoints;


}


terminate:


{



insert into minCost



select mymi
n(cc, cmx, cmy, cx, cy) from cost;



delete from cost where 1=1;



update mediod



set mx = (select cx from minCost where cc<0),




my = (select cy from minCost where cc<0);



select updMediod(mx, my, x, y)



from mediod, setofpoints



where SQLCODE = 1 a
nd ((mx <> x) or (my <> y));



}

}





load from pam.input into temp;


select addid(x,y)


from temp1;


insert into pointSize


select count(x)


from setofpoints;


insert into stdout select id, x, y from setofpoints;


insert intOi values(0),(0), (0);


select

randSel(psize)


from i, pointSize;


select updMediod(mx, my, x, y)


from mediod,setofpoints


where mx <> x or my <> y;


insert into stdout select mx,my from mediod;


Experiment


To test the efficiency of DBSCAN implementation on ATLaS, we use the SEQ
UOIA
2000 benchmark data. The SEQUOIA 2000 benchmark database uses real data sets that

are typical for Earth Science tasks. There are four types of data in the database: raster data,
pointdata, polygon data and directed graph data. The point data set cont
ains 62,584 Californian
names of landmarks, extracted from the US Geological Survey’s Geographic Names Information
Sys
-
tem, together with their location.


The data set is look like this:

-
1651760,
-
833648,Corral Creek Campground

-
1853558,
-
861151,Corral De
Piedra

-
1828216,
-
922899,Corral De Quati

-
1956635,
-
565741,Corral De Tierra (Palomares)

-
1953782,
-
569635,Corral De Tierra (Vasquez)

-
1920767,
-
690536,Corral Del Tierra (McCobb)

......




Even though we are not using R
-
tree index in our current experiment, the

result is still
satisfactory. Currently, since ATLaS doesn't support large integer data type, we use real data
type to store data, which is another improvable latency.


Points

3910

5213

6256

62584

In paper

11

16

18

233

On ATLaS

180

300

400

107

Fig. 2

comparison of DBSCAN running time



It’s interesting to note that the last experiment which has most points is fastest in our
system. The reason for that is we use global value of MinPoints and Eps. If the points are
large enough, there would be less cl
usters so that less calls of ExpandCluster() may be involved.

ATLaS Improvement Proposal

In above sections, we describe the application of ATLaS system on clustering algorithms and the
experiment results. You may see that UDA benefits the developers a lo
t. However, during
our implementation process, we find out that the following suggestion might improve the
system’s flexibility and power.

Embedded C Standard

The idea of embedded SQL called by a host language such as C is not new and exciting. But
think

it over in the other way!


Now the ATLaS is conforming to the SQL syntax standard. SQL syntax is easy to write and
understand. But sometimes it's not flexible and powerful enough, especially for those
algorithms containing some iteration or other c
-
lang
uage concepts which is very common!


The reasons for embedding C standard into ATLaS are:

1) ATLaS would become more powerful and flexible with embedded C;

Recall the implementation example for PAM. If we need to store a variable within an
aggregate imple
mentation, we have to create a table and an attribute:



aggregate allCost(jx real, jy real, ix real, iy real, hx real ,hy real):(r1 real,r2 real,r3 real,r4 real,r5
real)

{


table cost(cost real);


......

}



Well, if you think this is still acceptable, th
ink about how you implement an iteration?
In our example, we use an indirect way
--

recursion:

aggregate updMediod(ix real, iy real, hx real, hy real):int

{


......


initialize:iterate:


{




......


}


terminate:


{



......



select updMediod(mx, my, x,

y)



from mediod, setofpoints



where SQLCODE = 1 and ((mx <> x) or (my <> y));



}

}


This is obviously not a straight forward and good approach. But SQL doesn't provide
any means to do
iteration. However, if we embed C codes into ATLaS codes, all these
problems can be solved by single C statements. Thus, the developer can write ATLaS
programs in a more powerful and efficient way.


2) Not much overload will be added.

ATLaS is built on t
he BerkeleyDB. The ATLaS codes are first compiled into C codes object
file then make use of BerkeleyDB's API. Therefore, every ATLaS file will have a related C
codes object file. If we embed C code into ATLaS codes, it's not hard for the system to "move
"
them from the ATLaS file to the C object file. The overload for it will be small.


Conclusion

In this report, we talk about 2 clustering algorithms: partitioning algorithm PAM and
based
-
based algorithm DBSCAN and their implementation on ATLaS. By usi
ng user
-
defined
aggregate provided by ATLaS system, we find it convenient to implement these clustering
algorithms. A spatial index structure called R
-
tree will significantly improve the performance.
Even though we are not using R
-
tree index in our curren
t experiment, the result is still
satisfactory.


Our future work will focus on improving the ATLaS system. During our implementation
process, we find out that the embedding C solution might improve the system for the following
reasons:

1) ATLaS would beco
me more powerful and flexible with embedded C;

2) Not much overload will be added.


Reference

1. Ester M., Kriegel H.
-
P., Sander J. and Xu X. 1996. “A Density
-
Based Algorithm for
Discovering Clusters in Large Spatial Databases with Noise”. Proc. 2nd Int.
Conf. on Knowledge
Discovery and Data Mining. Portland, OR, 226
-
231.

2. Raghu Ramakrishnan, Johannes Gehrke, “Database Management systems (Second Edition)”,
McGraw
-
Hill Companies, Inc.

3. Beckmann N., Kriegel H.
-
P., Schneider R, and Seeger B. 1990. “The R*
-
tree: An Efficient
and RobustAccess Method for Points and Rectangles”. Proc. ACM SIGMOD Int. Conf. on
Management of Data.Atlantic City, NJ, 322
-
331.

4. Jain A.K., and Dubes R.C. 1988. “Algorithms for Clustering Data”. New Jersey: Prentice Hall.

5. Sander
J., Ester M., Kriegel H.
-
P., Xu X.: Density
-
Based Clustering in Spatial Databases: The
Algorithm GDBSCAN and its Applications, in: Data Mining and Knowledge Discovery, an Int.
Journal, Kluwer Academic Publishers, Vol. 2, No. 2, 1998, pp. 169
-
194.

6. Haixun

Wang, Carlo Zaniolo: Database System Extensions for Decision Support: the AXL
Approach. ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge
Discovery 2000: 11
-
20

7.
Raymond T. Ng, Jiawei Han: Efficient and Effective Clustering Methods for
Spatial Data Mining. VLDB
1994: pp. 144
-
155