Clustering Algorithms Implementation
on ATLaS

CS240B Project Report
Richard Luo
Prof. Carlo Zaniolo
2002/6
Abstract
In this project, we will discus clustering algorithms in spatial data mining, such as
partitioning algorithm PAM and density

based algorithm DBSCAN. Some of their
implementations on User

Defined Aggregate (UDA) database system ATLaS are illustrated.
With
UDA, it's convenient to implement such clustering algorithms. A spatial index structure
called R

tree will significantly improve the performance. Experiments with real data of
SEQUOIA 2000 show that such algorithms implementation on ATLaS is satisfactor
y even with
absence of R

tree index. However, some improvement on ATLaS will benefit the development
of these clustering algorithms as well as other general data mining algorithms. An ATLaS
system improvement proposal is addressed in the end.
Introducti
on
Knowledge discovery becomes more and more important in spatial databases since increasingly
large amounts of data obtained from satellite images, X

ray crystallography or other automatic
equipment are stored in spatial databases. Several types of clust
ering algorithms are addressed
in the last few years, such as:
1)
Partitioning Algorithm: Construct various partitions then evaluate them by some criterion
2)
Hierarchy Algorithm: Create a hierarchical decomposition of the set of data (or objects)
using so
me criterion
3)
Density

based Algorithm: based on local connectivity and density functions
In this report, we will discus partitioning algorithm PAM and based

based algorithm DBSCAN.
Their implementation on User

Defined Aggregate (UDA) database system ATL
aS will be
illustrated. With UDA, it's convenient to implement such clustering algorithms. A spatial
index structure called R

tree will significantly improve the performance. Experiments with real
data of SEQUOIA 2000 show that such algorithms implement
ation on ATLaS is satisfactory
even with absence of R

tree index.
Finally, we will talk about some improvement on ATLaS which may benefit the development of
these clustering algorithms as well as other general data mining algorithms. An ATLaS system
im
provement proposal is addressed in the end.
Clustering Algorithms
DBSCAN
The key idea of a density

based cluster is that for each point of a cluster its
Ep
s

neighborhood
for some given
Eps
> 0 has to contain at least a minimum number of points, i.e. the “
density” in
the
Ep
s

neighborhood of points has to exceed some threshold. Furthermore, the density within
the areas of noise is lower than the density in any of the clusters.
This idea of “density

based clusters” can be generalized in two important ways. F
irst, we can use
any notion of a neighborhood instead of an
Ep
s

neighborhood if the definition of the
neigh

borhood is based on a binary predicate which is symmetric and reflexive. Second, instead
of sim

ply counting the objects in a neighborhood of an obj
ect we can as well use other measures
to de

fine the “cardinality” of that neighborhood.
A naive approach could require for each object in a density

connected set that the weighted
cardinality of the
NPre
d

neighborhood of that object has at least a value
MinCar
d. However, this
approach fails because there may be two kinds of objects in a density

connected set, objects
in

side
(core objec
t) and objects “on the border” of the density

connected set
(border object
s). In
general, an
NPre
d

neighborhood of a bord
er object has a significantly lower
wCard
than an
NPre
d

neighborhood of a core object. Therefore, we would have to set the value
MinCard
to a
relatively low value in order to include all objects belonging to the same density

connected set.
This value, howe
ver, will not be characteristic for the respective density

connected set

particularly in the presence of noise objects. Therefore, for every object
p
in a density

connected
set
C
there must be an object
q
in
C
so that
p
is inside of the
NPre
d

neighborhoo
d of
q
and the
weight

ed cardinality
wCard
of
NPred(q)
is at least
MinCar
d. We also require the objects of the
set
C
to be somehow “connected” to each other.
PAM
PAM (Partitioning Around Medoids) was developed by Kaufman and Rousseeuw. To find k
clusters,
PAM's approach is to determine a representative object for each cluster. This
representative object, called a medoid, is meant to be the most centrally located object within the
cluster. Once the medoids have been selected, each nonselected object is gro
uped with the
medoid to which it is the most similar. More precisely, if Oj is a nonselected object, and Oi is a
(selected) medoid, we say that Oj belongs to the cluster represented by
Oi, if d(Oj ; Oi ) = minOe d(Oj ; Oe), where the notation minOe denot
es the minimum over all
medoids Oe , and the notation d(Oa ; Ob ) denotes the dissimilarity or distance between objects
Oa and Ob . All the dissimilarity values are given as inputs to PAM. Finally, the quality of
a clustering (i.e. the combined quality
of the chosen medoids) is measured by the average
dissimilarity between an object and the medoid of its cluster.
To find the k medoids, PAM begins with an arbitrary selection of k objects. Then in each step, a
swap between a selected object Oi and a non
selected object Oh is made, as long as such a swap
would result in an improvement of the quality of the clustering. In particular, to calculate the
effect of such a swap between Oi and Oh , PAM computes costs Cjih for all nonselected objects
Oj . Dependin
g on which of the following cases Oj is in, Cjih is defined by one of the equations
below:
First Case: suppose Oj currently belongs to the cluster represented by Oi . Furthermore, let Oj be
more similar to Oj2 than Oh , i.e. d(Oj ; Oh ) >= d(Oj ; Oj2 ), w
here Oj2 is the second most
similar medoid to Oj . Thus, if Oi is replaced by Oh as a medoid, Oj would belong to the cluster
represented by Oj2 . Hence, the cost of the swap as far as Oj is concerned is:
C jih = d(Oj ; Oj2 )

d(Oj ; Oi )
This equation al
ways gives a nonnegative Cjih , indicating that there is a nonnegative cost
incurred in replacing Oi with Oh.
Second Case: Oj currently belongs to the cluster represented by Oi . But this time, Oj is less
similar to Oj2 than Oh , i.e. d(Oj ; Oh ) < d(Oj
; Oj2 ). Then, if Oi is replaced by Oh , Oj would
belong to the cluster represented by Oh . Thus, the cost for Oj is given by:
Cjih = d(Oj ; Oh )

d(Oj ; Oi );
Cjih here can be positive or negative, depending on whether Oj is more similar to Oi or to O
h .
Third Case: suppose that Oj currently belongs to a cluster other than the one represented by Oi .
Let Oj2 be the representative object of that cluster. Furthermore, let Oj be more similar to Oj2
than Oh . Then even if Oi is replaced by Oh , Oj would
stay in the cluster represented by Oj2 .
Thus, the cost is:
C jih = 0
Fourth Case: Oj currently belongs to the cluster represented by Oj2 . But Oj is less similar to Oj2
than Oh . Then replacing Oi with Oh would cause Oj tOjump to the cluster of Oh from
that of
Oj2 . Thus, the cost is:
C jih = d(Oj ; Oh )

d(Oj ; Oj2 );
and is always negative.
Combining the four cases above, the total cost of replacing Oi with Oh is given by:
TCih = sum of Cjih
We now present Algorithm PAM.
1. Select k representative objects arbitrarily.
2. Compute TCih for all pairs of objects Oi ; Oh where Oi is currently selected, and Oh is
not.
3. Select the pair Oi ; Oh which corresponds to minOi ;Oh TCih . If the minimum TCih is
negative, replace Oi
with Oh , and go back to Step (2).
4. Otherwise, for each nonselected object, find the most similar representative object.
Halt. 2
R*

tree spatial index
In the following, we will introduce a typical spatial index, the R*

tree. The
R*

tree
generalize
s
the 1

dimensional B

tree to
d

dimensional data spaces, specifically an R*

tree manages
k

dimensional hyperrectangles instead of 1

dimension

al keys. An R*

tree may organize
extended objects such as polygons using
minimum bounding rectangles (MB
R) as
appr
oximations as well as point objects as a special case of rectangles. The leaves store the
MBRs of the data objects and a pointer to the exact geometry of the polygons.
Internal nodes store a sequence of pairs consisting of a rectangle and a pointer to a ch
ild node.
These rectangles are the MBRs of all data or directory rectangles stored in the subtree having the
referenced child node as its root. To answer a region query, starting from the root, the set of
rectangles intersecting the query region is determi
ned and then their referenced child nodes are
searched until the data pages are reached.
Fig 1. R*

tree
The height of an R*

tree is O(log n) for a database of
n
objects in the worst case and a query
with a “small” query region has to traverse only a limited number of paths in the R*

tree.
Implementation
Although ATLaS is still on its alpha

stage and provides only basic functionalities, we still
find it convenie
nt to implement these clustering algorithms. User

defined Aggregate (UDA)
provides one

scan approach and flexible access to the database. In this section, we will
describe 2 clustering algorithms implementation on ATLaS

DBSCAN and PAM.
DBSCAN
table setofpoints (x real, y real, ClId real);
/* meaning of ClId:

1: unclassified, 0: noise, 1,2,3...: cluster*/
table nextid(ClusterId real);
table seeds (sx real, sy real);
insert into nextid
values (1);
load from dbscan.input into temp;
insert into setofpoints
select x, y,

1
from temp;
select ExpandCluster(x, y, ClusterId, 1000, 4)
from setofpoints, nextid
where ClId<=0;
The table setofp
oints stores the coordinates and cluster ids of all points read from the input file
dbscan.input. After initializing the cluster id to

1, it calls the major aggregate in this algorithm

ExpandCluster() to expand the cluster from any point (x,y). We u
se the global attribute
MinPoints of 4 and Eps of 1000.
The regionQuery() aggregate returns the Eps

neighborhood of point (qx,qy):
aggregate regionQuery(qx real, qy real, eps real):(r1 real,r2 real)
{
INITIALIZE: ITERATE:
{
INSERT INTO return se
lect x,y from setofpoints where (x

qx)*(x

qx) + (y

qy) * (y

qy)
<= eps * eps;
}
}
In the changeClId(), points which have been marked to be NOISE may be changed later, if they
are density

reachable from some other pint of the database. This happens
for border points of a
cluster. Those points are not added to the seeds because we already know that a point with ClId
of NOISE is not a core point. Adding those pints to seeds would only result in additional
region queries which would yield no new answe
rs.
aggregate changeClId (sx real, sy real, ClusterId real, Eps real, MinPts
real):real
{
table result (rx real, ry real);
table resultsize (size real);
initialize:
iterate:
{
insert into result select regionQuery(sx, sy, Eps);
insert into resultsize select count(rx) from result;
insert into seeds select rx, ry from result
where (select size from resultsize)>=MinPts
and (select ClId from setofpoints where x=result.rx and y=result.ry)=

1;
update setofpoints set ClId=ClusterId where SQLCODE=1
and exists (select rx,ry from result) and (ClId=

1 or ClId=0);
delete from seeds where seeds.sx=sx and seeds.sy=sy;
delete from resultsize where 1=1;
}
}
AGGREGATE ExpandClu
ster (ex real, ey real, ClusterId real, Eps real, MinPts
real):real
{
table seedssize (size real);
initialize:
iterate:
{
insert into seeds select regionQuery (ex, ey, Eps
);
insert into seedssize select count(sx) from seeds;
/*
insert into stdout select ex, ey, size from seedssize;*/
update setofpoints set ClId=0
where exists (select sx from seeds where sx=setofpoints.x and sy=setofpoints.y)
and
(select size from seedssize)<MinPts;
update setofpoints set ClId=ClusterId
where exists (select sx from seeds where sx=setofpoints.x and
sy=setofpoints.y)
and SQLCODE=0;
update nextid
set ClusterId=ClusterId+1 where SQLCODE=1;
delete from seeds where sx=ex and sy=ey and SQLCODE=1;
select changeClId (sx, sy, ClusterId, Eps, MinPts) from seeds
where SQLCODE=1;
delete from seedssize where 1=1;
delete from seeds where 1=1;
}
}
PAM
table setofpoints (id int, x real, y real);
table pointSize (psize int);
table temp (x real, y real, name char(30));
table temp1 (x real, y real);
table mediod(mx real, my real);
table i
(i int);
aggregate randSel(size int):int
{
table randNo(no real);
initialize:iterate:
{
insert into randNo values(rand()*size);
insert into mediod
select x, y
from setofpoints, randNo
where id

1 < no and no <= id;
delete from randNo where
1=1;
}
}
AGGREGATE addid(ax real, ay real) : int
{
TABLE tmp(i int);
INITIALIZE :
{
INSERT INTO tmp VALUES(1);
INSERT INTO setofpoints values(1, ax, ay);
}
ITERATE :
{
UPDATE tmp SET i=i+1;
INSERT INTO setofpoints
SELECT
i, ax, ay FROM tmp;
}
}
aggregate mymin(c real, mx real, my real, x real, y real):(r1 real,r2 real,r3 real,r4 real,r5 real)
{
table minCost(cc real, cmx real, cmy real, cx real, cy real);
initialize:
{
insert into minCost values(c, mx, my, x, y)
;
}
iterate:
{
update minCost
set cc=c, cmx=mx, cmy = my, cx = x, cy = y
where c<cc;
}
terminate:
{
insert into return
select cc, cmx, cmy, cx, cy
from minCost;
}
}
aggregate allCost(jx real, jy real, ix real, iy real, hx real ,hy real
):(r1 real,r2 real,r3 real,r4 real,r5
real)
{
table cost(cost real);
initialize:
{
}
iterate:
{
update cost
set cost = cost +
sqrt((jx

hx)*(jx

hx)+(jy

hy)*(jy

hy))

sqrt((jx

ix)*(jx

ix)+(jy

iy)*(jy

iy));
}
terminate:
{
insert into return
select cost, ix, iy, hx, hy from cost;
}
}
aggregate updMediod(ix real, iy real, hx real, hy real):int
{
table cost(cc real, cmx real, cmy real, cx real, cy real);
table minCost(cc real, cmx real, cmy real, cx real, cy real);
/*
(cmx, cmy) (ix, iy)
selected mediod

Oi in the paper
(cx,cy) (hx, hy) unselected object

Oh in the paper
*/
initialize:iterate:
{
insert into cost
select allCost(x,y, ix, iy, hx, hy)
from setofpoints;
}
terminate:
{
insert into minCost
select mymi
n(cc, cmx, cmy, cx, cy) from cost;
delete from cost where 1=1;
update mediod
set mx = (select cx from minCost where cc<0),
my = (select cy from minCost where cc<0);
select updMediod(mx, my, x, y)
from mediod, setofpoints
where SQLCODE = 1 a
nd ((mx <> x) or (my <> y));
}
}
load from pam.input into temp;
select addid(x,y)
from temp1;
insert into pointSize
select count(x)
from setofpoints;
insert into stdout select id, x, y from setofpoints;
insert intOi values(0),(0), (0);
select
randSel(psize)
from i, pointSize;
select updMediod(mx, my, x, y)
from mediod,setofpoints
where mx <> x or my <> y;
insert into stdout select mx,my from mediod;
Experiment
To test the efficiency of DBSCAN implementation on ATLaS, we use the SEQ
UOIA
2000 benchmark data. The SEQUOIA 2000 benchmark database uses real data sets that
are typical for Earth Science tasks. There are four types of data in the database: raster data,
pointdata, polygon data and directed graph data. The point data set cont
ains 62,584 Californian
names of landmarks, extracted from the US Geological Survey’s Geographic Names Information
Sys

tem, together with their location.
The data set is look like this:

1651760,

833648,Corral Creek Campground

1853558,

861151,Corral De
Piedra

1828216,

922899,Corral De Quati

1956635,

565741,Corral De Tierra (Palomares)

1953782,

569635,Corral De Tierra (Vasquez)

1920767,

690536,Corral Del Tierra (McCobb)
......
Even though we are not using R

tree index in our current experiment, the
result is still
satisfactory. Currently, since ATLaS doesn't support large integer data type, we use real data
type to store data, which is another improvable latency.
Points
3910
5213
6256
62584
In paper
11
16
18
233
On ATLaS
180
300
400
107
Fig. 2
comparison of DBSCAN running time
It’s interesting to note that the last experiment which has most points is fastest in our
system. The reason for that is we use global value of MinPoints and Eps. If the points are
large enough, there would be less cl
usters so that less calls of ExpandCluster() may be involved.
ATLaS Improvement Proposal
In above sections, we describe the application of ATLaS system on clustering algorithms and the
experiment results. You may see that UDA benefits the developers a lo
t. However, during
our implementation process, we find out that the following suggestion might improve the
system’s flexibility and power.
Embedded C Standard
The idea of embedded SQL called by a host language such as C is not new and exciting. But
think
it over in the other way!
Now the ATLaS is conforming to the SQL syntax standard. SQL syntax is easy to write and
understand. But sometimes it's not flexible and powerful enough, especially for those
algorithms containing some iteration or other c

lang
uage concepts which is very common!
The reasons for embedding C standard into ATLaS are:
1) ATLaS would become more powerful and flexible with embedded C;
Recall the implementation example for PAM. If we need to store a variable within an
aggregate imple
mentation, we have to create a table and an attribute:
aggregate allCost(jx real, jy real, ix real, iy real, hx real ,hy real):(r1 real,r2 real,r3 real,r4 real,r5
real)
{
table cost(cost real);
......
}
Well, if you think this is still acceptable, th
ink about how you implement an iteration?
In our example, we use an indirect way

recursion:
aggregate updMediod(ix real, iy real, hx real, hy real):int
{
......
initialize:iterate:
{
......
}
terminate:
{
......
select updMediod(mx, my, x,
y)
from mediod, setofpoints
where SQLCODE = 1 and ((mx <> x) or (my <> y));
}
}
This is obviously not a straight forward and good approach. But SQL doesn't provide
any means to do
iteration. However, if we embed C codes into ATLaS codes, all these
problems can be solved by single C statements. Thus, the developer can write ATLaS
programs in a more powerful and efficient way.
2) Not much overload will be added.
ATLaS is built on t
he BerkeleyDB. The ATLaS codes are first compiled into C codes object
file then make use of BerkeleyDB's API. Therefore, every ATLaS file will have a related C
codes object file. If we embed C code into ATLaS codes, it's not hard for the system to "move
"
them from the ATLaS file to the C object file. The overload for it will be small.
Conclusion
In this report, we talk about 2 clustering algorithms: partitioning algorithm PAM and
based

based algorithm DBSCAN and their implementation on ATLaS. By usi
ng user

defined
aggregate provided by ATLaS system, we find it convenient to implement these clustering
algorithms. A spatial index structure called R

tree will significantly improve the performance.
Even though we are not using R

tree index in our curren
t experiment, the result is still
satisfactory.
Our future work will focus on improving the ATLaS system. During our implementation
process, we find out that the embedding C solution might improve the system for the following
reasons:
1) ATLaS would beco
me more powerful and flexible with embedded C;
2) Not much overload will be added.
Reference
1. Ester M., Kriegel H.

P., Sander J. and Xu X. 1996. “A Density

Based Algorithm for
Discovering Clusters in Large Spatial Databases with Noise”. Proc. 2nd Int.
Conf. on Knowledge
Discovery and Data Mining. Portland, OR, 226

231.
2. Raghu Ramakrishnan, Johannes Gehrke, “Database Management systems (Second Edition)”,
McGraw

Hill Companies, Inc.
3. Beckmann N., Kriegel H.

P., Schneider R, and Seeger B. 1990. “The R*

tree: An Efficient
and RobustAccess Method for Points and Rectangles”. Proc. ACM SIGMOD Int. Conf. on
Management of Data.Atlantic City, NJ, 322

331.
4. Jain A.K., and Dubes R.C. 1988. “Algorithms for Clustering Data”. New Jersey: Prentice Hall.
5. Sander
J., Ester M., Kriegel H.

P., Xu X.: Density

Based Clustering in Spatial Databases: The
Algorithm GDBSCAN and its Applications, in: Data Mining and Knowledge Discovery, an Int.
Journal, Kluwer Academic Publishers, Vol. 2, No. 2, 1998, pp. 169

194.
6. Haixun
Wang, Carlo Zaniolo: Database System Extensions for Decision Support: the AXL
Approach. ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge
Discovery 2000: 11

20
7.
Raymond T. Ng, Jiawei Han: Efficient and Effective Clustering Methods for
Spatial Data Mining. VLDB
1994: pp. 144

155
Comments 0
Log in to post a comment