Probabilistic Analysis of the RNN-CLINK Clustering Algorithm

overratedbeltAI and Robotics

Nov 25, 2013 (3 years and 8 months ago)

86 views

Probabilistic Analysis of the RNN
-
CLINK Clustering Algorithm


S.D. Lang, L.
-
J. Mao, W.
-
L. Hsu

School of Computer Science

University of Central Florida

Orlando, FL 32816


ABSTRACT


Clustering is among the oldest techniques used in data mining applications. Typical

implementations of the hierarchical
agglomerative clustering methods (HACM) require an amount of O(
N
2
)
-
space when there are
N

data objects, making such
algorithms impractical for problems involving large datasets. The well
-
known clustering algorithm RNN
-
CLINK requires
only O(
N
)
-
space but O(
N
3
)
-
time in the worst case, although the average time appears to be O(
N
2
log
N
). We provide a
probabilistic interpretation of the average
-
time complexity of the algorithm. We also report experimental results using the

randomly generated bit vectors and using the NETNEWS articles as the input, to support our theoretical analysis.


Keywords:

Clustering, nearest neighbor, reciprocal nearest neighbor, complete link,
probabilistic analysis
.



1.

INTRODUCTION


Cluster analysis,

or clustering, is a multivariate statistical technique which identifies groupings of the data objects based on
the inter
-
object similarities computed by a chosen distance metric.
3, 10, 13.

Clustering methods can be divided into two types:
partitional and

hierarchical. The partitional

clustering methods start with a single group that contains all the objects, then

proceed to divide the objects into a fixed number of clusters. Many early clustering

applications used partitional clustering
due to its compu
tational efficiency when large datasets need to be processed. However, partitional clustering methods
require prior specification of the number of clusters as an input parameter. Also, the partitions

obtained are often strongly
dependent on the order in
which the objects are processed.
7.


The hierarchical agglomerative clustering methods (HACM)
attempt to cluster the objects into a hierarchy of nested clusters. Starting with the individual objects as separate cluster
s, the
HACMs successively merge the mo
st similar pair of the clusters into a higher level cluster, until a single cluster is left as the
root of the hierarchy. HACMs are also known as unsupervised classification, because their ability in automatically
discovering the inter
-
object similarity p
atterns. Clustering techniques have

been applied in a variety of engineering and
scientific fields such as

biology, psychology, computer vision, data compression, information retrieval, and more recently,
data mining.
3, 4, 8, 10.


In order to cluster the
objects in a data set, a distance metric is used to quantify the

degree of association (i.e., similarity)
between objects and clusters. Once a distance metric is defined between the objects in the data set, variations of the
clustering algorithms exist de
pending on how the inter
-
cluster distance is computed. The most commonly used HACMs are
single link
,
complete link
,
group average link
, and
Ward’s

methods, where the single link and complete link methods
represent the two extremes.
7.

In the
single link

m
ethod, the distance between two clusters is the shortest distance between
two objects in different clusters; in the
complete link

method, the distance between two clusters is the longest distance
between two objects in different clusters. Several studies
reported in the literature
7, 18

indicate superior clustering accuracy of
the complete link method, which is the focus of the present paper.


Typical implementations of the complete link clustering method require
O(
N
2
)
-
space with
N

data objects, making the
algorithm infeasible for large data mining applications. One exception is the RNN
-
CLINK algorithm which uses only O(
N
)
-
space but O(
N
3
)
-
time in the worst case, although it has been noticed that the algorithm’s average time
-
complexity appears to
be O(
N
2

log

N
).
1.

In this paper, we present a probabilistic interpretation of this behavior, supported by experimental results.
The remainder of the paper is organized as follows: Section 2 describes the most relevant work. In Section 3, we briefly
describe the im
plementation of the RNN
-
CLINK algorithm, and present a probabilistic analysis of the algorithm’s average
-
time complexity. Section 4 reports the experimental results using the randomly generated bit vectors and using the
NETNEWS articles as the input, to s
upport our theoretical analysis. Section 5 offers the conclusion and points out directions
for further research.


2.

RELATED WORK


The first
complete link

algorithm reported in the literature is the CLINK algorithm by Defays.
6.

The algorithm requires O(
N
2
)
time and O(
N
) space, so it is quite efficient. However, its output depends on certain node insertion order to generate the
exact complete link hierarchy. Also, according to the experiments performed by El
-
Hamdouchi and Willett,
7

Defays's
CLINK algorithm
gave unsatisfactory results in some information retrieval applications. Voorhees proposed an O(
N
2
)
-
space,
O(
N
3
)
-
time algorithm
18

in which she first computed the distances between
N

given objects, then sorted them by applying the
bucket
-
sort algorithm. The

result is an ordered list of triplets (
i, j, dist
) in which
i

and
j

are object ids, and
dist

is the distance
between objects
i

and
j
. The algorithm then proceeds to construct the clusters using these sorted distance triplets. Since the
distances between

all pairs of the clusters are processed in the descending order, two clusters of size
m
i

and
m
j

can be merged
as soon as the
m
i
m
j
-
th distance triplet of the objects in the respective clusters is reached.


Using priority queues with an O(
N
2
) work space, D
ay and Edelsbrunner,
5

and Aichholzer,
2

showed that the complete

link
hierarchy can be obtained in O(
N
2
log
N
) time. By applying the idea of the reciprocal nearest neighbors (RNN) with the
Lance
-
Williams updating formula,
12

an O(
N
2
)
-
space and O(
N
2
)
-
time al
gorithm was described by Murtagh.
14, 15.

In this
algorithm, the initial distance computation between all clusters takes O(
N
2
) time, and the distance update after each merge
-
operation takes O(
N
) time. Since there are a total of
N


1 such cluster
-
merge op
erations, the total time complexity is O(
N
2
).
On the other hand, for special cases when the objects are points in the 2
-
D plane, and the distance is the Euclidean metric,
Krzanaric and Levcopoulos developed an O(
N

log
2
N
)
-
time clustering algorithm.
11.

Ther
e is still no general algorithm for the
complete link clustering method that takes less than O(
N
3
) time while using only O(
N
) space.


3.

THE RNN
-
CLINK ALGORITHM


In this section, we first explain the definition and properties of the
reciprocal nearest neighb
ors

(RNN), we then present an
implementation of the RNN
-
CLINK algorithm for the complete link method. Several ideas similar to nearest neighbors have
been studied in the literature. These include the
reciprocal nearest neighbor

(RNN) technique used in ag
glomerative
clustering,
14, 15

and the
mutual nearest neighbor

(MNN)
9

and
pairwise nearest neighbor

(PNN)
16

techniques in clustering and
other types of application.


For any given connected graph, a
nearest neighbor chain

(NN
-
chain, see Figure 1) can be ob
tained by starting from an
arbitrary node
c
1
, find the nearest neighbor of
c
1
, call it
c
2

= NN(
c
1
), then find the nearest neighbor of
c
2
, call it
c
3

= NN(
c
2
),
etc. It is easy to see that any NN
-
chain is a subset of a
minimum spanning tree

(MST). Since th
e inter
-
node distances are
monotonically decreasing along the chain, an NN
-
chain must end in a pair of nodes that are nearest neighbors of each other
.






Figure 1. A Nearest Neighbor Chain and Reciprocal Nearest Neighbors (RNN)

the symbol


represents NN; the symbol


represents RN
N



The following property of the
nearest neighbor chains

(NN
-
chain) in a connected graph with a distance metric is known:
14, 15.




Let
d
(
c
x
,
c
y
) denote the distance between node
c
x

and node
c
y
. The distances along an NN
-
chain consisting of nodes
c
1
,
c
2
, …
,
c
p

and
c
q
, are monotonically decreasing, i.e.
d
(
c
1
,
c
2
) >
d
(
c
2
,
c
3
) >
d
(
c
3
,
c
4
) > … >
d
(
c
p
,
c
q
) =
d
(
c
q
,
c
p
). An NN
-
chain must end at some point, and the last pair of the nodes on this chain are nearest neighbors of each other, called
Reciprocal Nearest
Neighbors

(RNN).





NN NN NN NN

NN RNN



C
1




C
2




C
3




C
4




………… C
k


2




C
k


1



C
k



3.1. Implementation of the RNN
-
CLINK Algorithm


In the following, we briefly describe our im
plementation of the complete link method based on the RNN technique (Figures 2
and 3). We also explain two speed
-
up measures of the implementation. The RNN
-
CLINK algorithm (Figure 2) first
initializes each node as a single
-
node cluster. In each iteratio
n of the while
-
loop, every cluster

c
i

computes its

nearest
neighbor NN(
c
i
) (Steps 3
-
6), and, based on these nearest neighbor results, every cluster creates an NN
-
chain which terminates
with an RNN pair. The algorithm then merges the two clusters of each R
NN pair and updates the cluster structure (steps 7
-
12). The while
-
loop terminates when there is only one cluster left.



Figure 2. Algorithm RNN
-
CLINK




Figure 3. Algorithm
COMPUTE
-
INTER
-
CLUSTER
-
DISTANCE


Algorithm

RNN
-
CLINK



Input:
G
= (
V
,
E
), a distance metric
d
(
v
,
w
) defined on
V


V

; Output: Cluster structure
C


1.

Initialize:
C

= {
c
i

| 1


i



N

} = {
v
i

| 1


i



N
} =
V


2.

While

( |
C
| > 1) {

3.


For

all

c
i



C

{

4.


COMPUTE
-
INTER
-
CLUSTER
-
DISTANCE(
c
i

)

5.


FIND
-
NN(
c
i

)

6.


}

7.


For

all

c
i



C

{

8.


if

clusters
c
i

and NN(
c
i
) form an RNN pair {

9.


MERGE
-
CLUSTER(
c
i

, NN(
c
i

))

10.


SORT
-
CLUSTER(
C
,
c
i

, NN(
c
i

))

11.


}

12.


}

13.
}



Algorithm

COMPUTE
-
INTER
-
CLUSTER
-
DISTANCE (
c
i
)



Input:
G
= (
V
,
E
), a distance metric
d
(
v
,

w
) defined on
V


V
, cluster structure
C


Output: The distances between cluster
c
i

and each of the other clusters


1.

For

all

c
j



C



{

c
i

} {

2.




(
c
i
, c
j
) = 0

3.


For

all
v
x


c
j

, v
y



c
j

{

4.


if

d
(
v
x
,v
y
) >


(
c
i
, c
j
)
then



(
c
i
, c
j
) =
d
(
v
x
,v
y
)

5.


}

6.

}

We now analyze the time complexity of the RNN
-
CLINK algorithm. First, we consider Step 4 of the algorithm which
computes the inter
-
cluster distances. Assuming the distance between two objects can be computed in O(1) time, then the
initial computation of the distances between

N

objects takes O(
N
2
) time, i.e.
N
(
N



1)/2 distances. When the objects form
clusters, the distance between
two clusters
c
q

and
c
r

is defined as:

(
c
q
,
c
r
) = MAX{
d
(
x
i
,
x
j
) |
x
i



c
q

and
x
j



c
r
}. Since RNN
-
CLINK computes the minimum

(
c
q
,
c
r
) for each cluster
c
q
, i.e., finding
c
q
's nearest neighbor, and RNN
-
CLINK operates
under an O(
N
) space constraint, the dis
tance computation between clusters must be done in a cluster
-
by
-
cluster fashion. An
easy way to accomplish this is to maintain an array of node indexes to keep track of the node
-
cluster relationship. After each
cluster
-
merge operation, we move the node i
ndexes in this array so that the nodes of the same cluster have their indexes next
to each other. As a result, the distance computation between clusters can be done in O(
N
2
) time in each iteration of the loop.


The purpose of the MERGE
-
CLUSTER function (S
tep 9 of Figure 2) is to re
-
label and update the cluster hierarchy, while the
purpose of SORT
-
CLUSTER (Step 10) is to sort the node indexes based on their cluster number. Thus, the computation time
of each iteration of the while
-
loop is O(
N
2
). Since the a
lgorithm finds at least one RNN pair in each iteration, the number of
iterations is O(
N
), proving that the overall time complexity of RNN
-
CLINK is O(
N
3
). The space complexity of RNN
-
CLINK
is O(
N
), because there are only a constant number of O(
N
) arrays us
ed in the algorithm.


Two speed
-
up measures can be incorporated in the implementation of the RNN
-
CLINK algorithm. First, if cluster
c
i

and
NN(
c
i
) are not merged in the current iteration of the while
-
loop, NN(
c
i
) would still be the nearest neighbor of
c
i

f
or the next
iteration, thus saving the time to recompute NN(
c
i
). Second, the computation of NN(
c
i
) involves keeping track of the value
of a variable
min
-
clust
-
dist

(the minimum cluster
-
distance found so far) and computing the distance between all pairs of

the
clusters, i.e.

(
c
i
, c
k
) for 1


k


i



n
. However, while computing an inter
-
cluster distance

(
c
i
, c
k
) can take up to |
c
i
|

|
c
k
|
inter
-
object distance computations, this process can safely stop once a distance value between two objects,
x

c
i

and
y

c
k

, is
found to be greater than
min
-
clust
-
dist
. Although the savings in inter
-
object computations is not known in general, our
experimental results show that this speed
-
up measure is a key factor in reducing the execution time of the RNN
-
CLINK
algorithm

from the worst
-
case O(
N
3
) time to close to O(
N
2

log

N
) on average.


3.2. Probabilistic Analysis of the RNN Structure


We assume there is a (finite or infinite) universe of
points
, from which we draw a finite
sample

S = {
x
1
,
x
2
,...,
x
n
}. These
sample poi
nts are used to model the objects and the clusters in the RNN
-
CLINK algorithm. To make the problem more
tractable, we assume the sample points are chosen with the uniform probability from the universe, and the points are chosen
with replacement, i.e., aft
er point
x
i

is selected, it is returned to the universe before the next point
x
i
+1

is selected. Thus, it is
possible to have
x
i
= x
j

for
i


j

in our sample, although the probability of its occurrence will be negligible for the ranges of
the sample sizes
we will be considering. The main reason that this replacement model is used is its simplicity in performing
probabilistic analysis. We assume a distance metric is defined between the points of the universe.


Definition 1.

A distance metric
d
(
x
i
,
x
j
) is
defined for any points
x
i

and
x
j
, satisfying the following properties:

1.

d
(
x
i
,
x
j
)


0, and
d
(
x
i
,
x
j
) = 0 iff
x
i
= x
j
.

2.

d
(
x
i
,
x
j
) =
d
(
x
j
,
x
i
).


Definition 2.

A point
x
j
is called a nearest neighbor of point
x
i
, denoted
x
j
= NN(
x
i
), in a set of
n

points
x
1
,
x

2
,...,
x

n
, if
d
(
x
i
,
x
j
)


d
(
x
i
,
x
k
) for all
k



i

and 1


k



n
.


Note that a point
x
i

may have multiple nearest neighbors when equal distances occur. In that case, we would still use NN(
x
i
)
to denote any of these nearest neighbors of
x
i
, by abuse of
notation.


Definition 3.

Two points
x
i

and
x
j

are called reciprocal nearest neighbors (RNN's) if
x
i
= NN(
x
j
) and
x
j
= NN(
x
i
). That is, if
d
(
x
i
,
x
j
)


d
(
x
i
,
x
k
) and
d
(
x
i
,
x
j
)


d
(
x
j
,
x
k
), for all
k
, 1


k



n
,
k



i

and
k



j
. Such pair of points
x
i

and

x
j

are called an RNN
pair.


Definition 4.

A nearest
-
neighbor chain (NN chain) is a sequence of points
x
i
1
,
x
i
2
,...,
x
i
m

such that
x
i
k
+1

= NN(
x
i
k
) for 1


k



(
m


1), and
x
i
m

1

= NN(
x
i
m
). Thus, an NN chain consists of a sequence of points that end w
ith an RNN pair.


We now state the following fact precisely in terms our notation and terminology.
14, 15.



Lemma 1.

Let
S
= {
x
1
,
x
2
,...,
x
n
} be a set of
n

points chosen from an arbitrary space such that a distance function is defined
between any pair of
the points. An NN chain exists starting with any point
x
i
, provided that when forming an NN chain,
x
i
1
,
x
i
2
,…, if
x
i
j

1
is one of the nearest neighbors of
x
i
j
, due to equal distances, the NN chain chooses
x
i
j

1
as NN(
x
i
j
) and
terminates the chain.


Our
key result is the following theorem which relates the RNN structure to certain conditional probability measures.


Theorem 1.

For 1


k



n
, let
S
k

={
x
1
,
x
2
,….,
x
k
} denote a set of
k

random points chosen from a universe with replacement.
Let
P
k

denote the
probability that an arbitrary point
x
i

is involved in an RNN pair in the set S
k
; that is,
P
k

=
Pr
[
x
i

= NN(
x
j
) |
x
j

= NN(
x
i
)]. Then the following properties hold:

(1)

The expected number of RNN pairs in
S
n

={
x
1
,
x
2
,….,
x
n
} is
2
Pn
n

.

(2)

If
P
k



P

for 1


k



n
, where
P
, 0 <
P

< 1, is a constant independent of
n
, then the expected length of an NN chain in S
n

={
x
1
,
x
2
,….,
x
n
} is

2
1
P
.


Proof: (1) Let
Vi

denote the following random variable



Vi

=
{
S
in
pair

RNN
an
in

involved

is

point

if

1,

ohterwise.

0,

n
x
i

Then the expected number of RNN pairs in S
n

=


n
i
Vi
E
1
2
1
)
(
, because each RNN pair will be counted twice in the
summation


n
i
Vi
E
1
)
(
. Notice that


E
(
Vi
) = 1


Pr

[
Vi

= 1] + 0 ∙
Pr
[
Vi

= 0], by definition

=
Pr

[
x
i

is
involved in an RNN pair]

=
Pn

by definition of
Pn

Substituting
E
(
Vi
) =
Pn

into the preceding summation yields the formula of (1).


(2)

Let
L

denote the length of an NN chain. By Lemma 1, the length of
L



n


1.

Thus,
E
[
L
] =





1
1
]
[
n
k
k
L
Pr
k
, by definition of
E
[
L
].

Let
x
i

be the starting node of an NN chain. Notice that
Pr
[
L
= 1] =
Pr
[
x
i

is involved in an RNN pair in S
n
] =
Pn
.

Similarly,
Pr
[
L
= 2] = (1


Pn
)

P
n

1
, because
L
= 2 means
x
i

is not i
nvolved in an RNN pair, which has a probability 1


Pn

Also,
L
= 2 means the node NN(
x
i
), call it
x
j
, is involved in an RNN pair in the node set S
n


{
x
i
}. The probability
for
x
j

to be involved in an RNN pair is
P
n

1
, because the node set S
n


{
x
i
} cont
ains (
n


1) random points with
replacement. Thus,
Pr
[
L
= 2] = (1


P
n
)

P
n

1

because the two events {
x
i

is not involved in an RNN pair} and {
x
j

=
NN(
x
i
) is involved in an RNN pair} are two independent events.


More generally,
Pr
[
L
=
k
] = (1


P
n
) (1


P
n

1
) (1


P
n

2
).... (1


P
n

k+
2
)
P
n

k+
1
, for 2


k



n



1, by the same argument.

Notice that
Pr
[
L
=
k
]


(1


P
)
k

1

because each 1


P
j



1


P
n

for
n



k

+ 1


j



n
, and
P
n

k+
1



1.

Therefore,
E
[
L
] =





1
1
]
[
n
k
k
L
Pr
k









1
1
1
)
1
(
n
k
k
P
k










1
1
)
1
(
k
k
P
k




=

2
1
P
,

using the formula
2
1
1
)
1
(
1
x
x
k
k
k








if 0 < |
x
| < 1.

Notice that the proof of the theorem strongly depends on the randomness property of the points in set S
n
= {
x
1
,
x
2
,….,
x
n
}.
That

is, starting with an arbitrary point
x
i
, the random sample assumption implies that its nearest neighbor
x
j

can be any of the
remaining points with an equal probability; similarly, the nearest neighbor of
x
j

is a random point among the points in S
n


{
x
i
,
x
j
}. In general, after forming a sequence of
k

nearest neighbors starting from point
x
i
, he random sample assumption
implies that the remaining set of
n


k

points still form a random set of
n


k
points.


We want to apply this random sample model to the
analysis of the RNN
-
CLINK algorithm. Specifically, we want to prove
that the average number of while
-
loop iterations is O(log
N
), which would imply the O(
N
log
N

average
-
time complexity of
the algorithm. To this end, it suffices to prove that the average

number of RNN pairs in a set of
m

clusters is


Cm
, for some
constant
C
, throughout the entire looping process of the algorithm. This result follows from Theorem 1, if the conditional
probability of an RNN is


C
, and if the randomness property holds amo
ng the clusters, throughout the entire looping process
of the algorithm. Our empirical results reported in the next section support this claim, using the randomly generated bit
vectors, and using the news articles available from NETNEWS, as the input obje
cts for clustering. However, we are unable
to prove either the validity of the randomness assumption or a constant lower bound for the conditional probabilities at the
present time.


4.

EXPERIMENTAL RESULTS


In this section, we present the experimental resul
ts reporting the time complexity of the RNN
-
CLINK algorithm. We use the
number of inter
-
object distance computations

(NODC) as a measure of the execution time, mainly because this measure is
independent of the computing platform and it provides a good indi
cation of the actual execution time. In the experiments, we
implemented both RNN
-
CLINK and a well
-
known single
-
link clustering algorithm SLINK.
17.
The NODC

of the single
-
link
algorithm for clustering
N

objects is simply
N
(
N

1)/2, because the algorithm c
omputes all inter
-
object distances once and
uses O(
N
2
) space to save them.


Our first experiment is based on randomly selected news articles available on the NETNEWS. We created 6 sets of test data,
with the number of news articles ranging from 100 to 600
, and the number of news groups ranging from 2 to 10. The news
articles are first filtered through a stoplist and a stemmer, in order to eliminate the insignificant words (e.g., the, on, u
p) and
to reduce the words to their stems. The distance formula us
ed for measuring the similarity between two articles is the
Dice
coefficient

with binary term weights.
8.

For articles
Dq

and
Dr

with
Tq

and
Tr

terms, respectively, and

qr

common terms, the
similarity between the two articles is given by
Sim
(
Dq
,
Dr
) = 2 *


qr

/ (
Tq

+
Tr
). Table 1 presents the statistics comparing the
NODC values of RNN
-
CLINK and SLINK. The column MOE (measure of effect) reports the Hubert’s


statistics which
measures the clustering accuracy using the news group classification as the basi
s. We notice that for RNN
-
CLINK, the ratio
of the number of RNN pairs to the number of objects (articles) seems to have a lower bound of 0.25, at least for the first
iteration of the clustering loop. Also, the total number of iterations appears to grow l
ogarithmically.



No. of
Articles

No. of
News
Groups

No. of
Terms

RNN
-
CLINK

SLINK

While
-
loop
Iterations

RNN
pairs
in 1
ST

round

NODC's

MOE (

)

NODC's

MOE (

)

100

2

2975

20

32

43516

0.34

4950

0.07

200

3

4879

21

65

157926

0.65

19900

0.49

300

5

6541

23

86

396724

0.51

44850

0.35

400

7

7587

25

110

737937

0.41

79800

0.14

500

8

8809


27

125

1214330

0.44

124750

0.12

600

10

9706


31

151

1712507

0.41

179700

0.10

* NODC: Number of Distance Computations.

* MOE (

): larger absolute values of


suggest bette
r clustering performance


Table 1. Experimental results of clustering news articles

Our second experiment uses randomly generated bit
-
vectors as the input data, where the
Hamming distance

is used to
measure the dissimilarity between vectors. The dimensio
n of the vectors is fixed at 200, and the number of objects (vectors)
varies from 100 to 1000. In each case, we ran the RNN
-
CLINK clustering for 100 times; Table 2 reports the average
NODC’s, and the average number of clustering loop iterations, for each
of the cases. For the same reason given previously,
the NODC value for SLINK is set to
N
(
N

1)/2. Similar to the results reported in Table 1, the average number of clustering
loop iterations seems to grow logarithmically. Figure 4 plots the NODC values o
f the RNN
-
CLINK and SLINK algorithms,
along with the curves of
N
2
and
N
2

log
N
. The figure clearly shows a growth pattern of O(
N
2

log
N
) for the average
-
time
complexity of RNN
-
CLINK. Therefore, our experimental results using two types of input data sugg
est that the conclusion of
Theorem 1 concerning the average number of RNN pairs seems valid, even when the assumption on the random sample
points is not necessarily satisfied.



No. of Nodes

RNN
-
CLINK

SLINK

Avg. No. of
Iterations

Avg. NODC's

Avg. NODC's

100

18

40021

4950

200

23

180678

19900

300

26

431559

44850

400

28

798095

79800

500

30

1281078

124750

600

31

1889095

179700

700

33

2622060

244650

800

34

3478650

319600

900

35

4423560

404550

1000

36

5537472

499500

* NODC: Number of Distance Comput
ations.


Table 2. Experimental results of clustering random bit
-
vectors


Comparison of Number of Distance Computations
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
100
200
300
400
500
600
700
800
900
1000
Number of Bit-Vectors
No. of Distance Computations
N
2
SLINK
N
2
log
N
RNN-CLINK


Figure 4. Experimental results of clustering random bit
-
vectors



5.

CONCLUSION AND FURTH
ER RESEARCH


In this paper, we presented an implementation of the O(
N
)
-
space clustering algorithm RNN
-
CLINK using the reciprocal
nearest neighbor technique. We also described two speed
-
up measures that improve the algorithm’s execution time. We
gave a probabilistic model using random sample points and assumptions about the

RNN conditional probabilities to explain
the algorithm’s O(
N
2

log
N
) average
-
time complexity. Our experimental results suggested the validity of these probabilistic
assumptions. For further research, we would like to provide a mathematical proof of the
se assumptions.



REFERENCES


1.

O. Aichholzer, “Clustering the hypercubes,” SFB Report Series 93, TU
-
Graz, 1996.

2.

O. Aichholzer and F. Aurenhammer, “Classifying hyperplanes in hypercubes,”
SIAM J. Discrete Math
, 9 (2), pp. 225
-
232, May 1996.

3.

M. R. Anderberg,

Cluster Analysis for Applications
, Academic Press, New York, 1973.

4.

A. Berson and S. J. Smith,
Data Warehousing, Data Mining, and OLAP
, McGraw
-
Hill, New York, 1997.

5.

W. H. E. Day and H. Edelsbrunner, “Efficient algorithms for agglomerative hierarchical clus
tering methods,”
Journal

of
Classification
, 1(1), pp. 7
-
24, 1984.

6.

D. Defays, “An efficient algorithm for a complete link method,”
The Computer Journal
, 20 (4), pp. 364
-
366, 1977.

7.

A. El
-
Hamdouchi and P. Willett, “Comparison of hierarchical agglomerative clu
stering methods for document retrieval,”
The Computer Journal
, 32 (3), 1989.

8.

W. B. Frakes and R. Baeza
-
Yates,
Information Retrieval: Data Structures and Algorithms
, Prentice Hall, 1992.

9.

K. C. Gowda and G. Krishna, “Agglomerative clustering using the conc
ept of mutual nearest neighborhood,”
Pattern
Recognition
, 10 (2), pp. 105
-
112, 1978.

10.

A. K. Jain and R. C. Dubes,
Algorithms for Clustering Data
, Prentice Hall, 1988.

11.

D. Krznaric and C. Levcopoulos, “Fast algorithms for complete linkage clustering,”
Discret
e Comput. Geom
. 19, pp.
131
-
145, 1998.

12.

G. H. Lance and W. T. Williams, “A general theory of classificatory sorting strategies: I. Hierarchical systems,”
The
Computer Journal

9, pp. 373
-
380, 1967.

13.

L. Laufman and P. J. Rousseeuw,
Finding Groups in Data: An I
ntroduction to cluster Analysis
, John Wiley and Sons,
Inc., New York, 1990.

14.

F. Murtagh, “A survey of recent advances in hierarchical clustering algorithms,”
The Computer Journal
, 26 (4), pp. 354
-
359, 1983.

15.

F. Murtagh, “Complexities of hierarchic clustering

algorithms: State of the art,”
Computational Statistics Quarterly
, 1,
pp. 101
-
113, 1984.

16.

J. Shanbehzadeh and P. O. Ogunbona, “On the computational complexity of the LBG and PNN algorithms,”
IEEE
Trans. on Image Proc
., 6 (4), pp. 614
-
617, 1997.

17.

R. Sibson,
“SLINK: An optimally efficient algorithm for the single
-
link cluster method,”
The Computer Journal
, 16 (1),
pp. 30
-
45, 1973.

18.

E. M. Voorhees, “The effectiveness and efficiency of agglomerative hierarchic clustering in Document Retrieval,” Ph.D.
thesis, Cor
nell University, 1985.