Probabilistic Analysis of the RNN

CLINK Clustering Algorithm
S.D. Lang, L.

J. Mao, W.

L. Hsu
School of Computer Science
University of Central Florida
Orlando, FL 32816
ABSTRACT
Clustering is among the oldest techniques used in data mining applications. Typical
implementations of the hierarchical
agglomerative clustering methods (HACM) require an amount of O(
N
2
)

space when there are
N
data objects, making such
algorithms impractical for problems involving large datasets. The well

known clustering algorithm RNN

CLINK requires
only O(
N
)

space but O(
N
3
)

time in the worst case, although the average time appears to be O(
N
2
log
N
). We provide a
probabilistic interpretation of the average

time complexity of the algorithm. We also report experimental results using the
randomly generated bit vectors and using the NETNEWS articles as the input, to support our theoretical analysis.
Keywords:
Clustering, nearest neighbor, reciprocal nearest neighbor, complete link,
probabilistic analysis
.
1.
INTRODUCTION
Cluster analysis,
or clustering, is a multivariate statistical technique which identifies groupings of the data objects based on
the inter

object similarities computed by a chosen distance metric.
3, 10, 13.
Clustering methods can be divided into two types:
partitional and
hierarchical. The partitional
clustering methods start with a single group that contains all the objects, then
proceed to divide the objects into a fixed number of clusters. Many early clustering
applications used partitional clustering
due to its compu
tational efficiency when large datasets need to be processed. However, partitional clustering methods
require prior specification of the number of clusters as an input parameter. Also, the partitions
obtained are often strongly
dependent on the order in
which the objects are processed.
7.
The hierarchical agglomerative clustering methods (HACM)
attempt to cluster the objects into a hierarchy of nested clusters. Starting with the individual objects as separate cluster
s, the
HACMs successively merge the mo
st similar pair of the clusters into a higher level cluster, until a single cluster is left as the
root of the hierarchy. HACMs are also known as unsupervised classification, because their ability in automatically
discovering the inter

object similarity p
atterns. Clustering techniques have
been applied in a variety of engineering and
scientific fields such as
biology, psychology, computer vision, data compression, information retrieval, and more recently,
data mining.
3, 4, 8, 10.
In order to cluster the
objects in a data set, a distance metric is used to quantify the
degree of association (i.e., similarity)
between objects and clusters. Once a distance metric is defined between the objects in the data set, variations of the
clustering algorithms exist de
pending on how the inter

cluster distance is computed. The most commonly used HACMs are
single link
,
complete link
,
group average link
, and
Ward’s
methods, where the single link and complete link methods
represent the two extremes.
7.
In the
single link
m
ethod, the distance between two clusters is the shortest distance between
two objects in different clusters; in the
complete link
method, the distance between two clusters is the longest distance
between two objects in different clusters. Several studies
reported in the literature
7, 18
indicate superior clustering accuracy of
the complete link method, which is the focus of the present paper.
Typical implementations of the complete link clustering method require
O(
N
2
)

space with
N
data objects, making the
algorithm infeasible for large data mining applications. One exception is the RNN

CLINK algorithm which uses only O(
N
)

space but O(
N
3
)

time in the worst case, although it has been noticed that the algorithm’s average time

complexity appears to
be O(
N
2
log
N
).
1.
In this paper, we present a probabilistic interpretation of this behavior, supported by experimental results.
The remainder of the paper is organized as follows: Section 2 describes the most relevant work. In Section 3, we briefly
describe the im
plementation of the RNN

CLINK algorithm, and present a probabilistic analysis of the algorithm’s average

time complexity. Section 4 reports the experimental results using the randomly generated bit vectors and using the
NETNEWS articles as the input, to s
upport our theoretical analysis. Section 5 offers the conclusion and points out directions
for further research.
2.
RELATED WORK
The first
complete link
algorithm reported in the literature is the CLINK algorithm by Defays.
6.
The algorithm requires O(
N
2
)
time and O(
N
) space, so it is quite efficient. However, its output depends on certain node insertion order to generate the
exact complete link hierarchy. Also, according to the experiments performed by El

Hamdouchi and Willett,
7
Defays's
CLINK algorithm
gave unsatisfactory results in some information retrieval applications. Voorhees proposed an O(
N
2
)

space,
O(
N
3
)

time algorithm
18
in which she first computed the distances between
N
given objects, then sorted them by applying the
bucket

sort algorithm. The
result is an ordered list of triplets (
i, j, dist
) in which
i
and
j
are object ids, and
dist
is the distance
between objects
i
and
j
. The algorithm then proceeds to construct the clusters using these sorted distance triplets. Since the
distances between
all pairs of the clusters are processed in the descending order, two clusters of size
m
i
and
m
j
can be merged
as soon as the
m
i
m
j

th distance triplet of the objects in the respective clusters is reached.
Using priority queues with an O(
N
2
) work space, D
ay and Edelsbrunner,
5
and Aichholzer,
2
showed that the complete
link
hierarchy can be obtained in O(
N
2
log
N
) time. By applying the idea of the reciprocal nearest neighbors (RNN) with the
Lance

Williams updating formula,
12
an O(
N
2
)

space and O(
N
2
)

time al
gorithm was described by Murtagh.
14, 15.
In this
algorithm, the initial distance computation between all clusters takes O(
N
2
) time, and the distance update after each merge

operation takes O(
N
) time. Since there are a total of
N
1 such cluster

merge op
erations, the total time complexity is O(
N
2
).
On the other hand, for special cases when the objects are points in the 2

D plane, and the distance is the Euclidean metric,
Krzanaric and Levcopoulos developed an O(
N
log
2
N
)

time clustering algorithm.
11.
Ther
e is still no general algorithm for the
complete link clustering method that takes less than O(
N
3
) time while using only O(
N
) space.
3.
THE RNN

CLINK ALGORITHM
In this section, we first explain the definition and properties of the
reciprocal nearest neighb
ors
(RNN), we then present an
implementation of the RNN

CLINK algorithm for the complete link method. Several ideas similar to nearest neighbors have
been studied in the literature. These include the
reciprocal nearest neighbor
(RNN) technique used in ag
glomerative
clustering,
14, 15
and the
mutual nearest neighbor
(MNN)
9
and
pairwise nearest neighbor
(PNN)
16
techniques in clustering and
other types of application.
For any given connected graph, a
nearest neighbor chain
(NN

chain, see Figure 1) can be ob
tained by starting from an
arbitrary node
c
1
, find the nearest neighbor of
c
1
, call it
c
2
= NN(
c
1
), then find the nearest neighbor of
c
2
, call it
c
3
= NN(
c
2
),
etc. It is easy to see that any NN

chain is a subset of a
minimum spanning tree
(MST). Since th
e inter

node distances are
monotonically decreasing along the chain, an NN

chain must end in a pair of nodes that are nearest neighbors of each other
.
Figure 1. A Nearest Neighbor Chain and Reciprocal Nearest Neighbors (RNN)
the symbol
represents NN; the symbol
represents RN
N
The following property of the
nearest neighbor chains
(NN

chain) in a connected graph with a distance metric is known:
14, 15.
Let
d
(
c
x
,
c
y
) denote the distance between node
c
x
and node
c
y
. The distances along an NN

chain consisting of nodes
c
1
,
c
2
, …
,
c
p
and
c
q
, are monotonically decreasing, i.e.
d
(
c
1
,
c
2
) >
d
(
c
2
,
c
3
) >
d
(
c
3
,
c
4
) > … >
d
(
c
p
,
c
q
) =
d
(
c
q
,
c
p
). An NN

chain must end at some point, and the last pair of the nodes on this chain are nearest neighbors of each other, called
Reciprocal Nearest
Neighbors
(RNN).
NN NN NN NN
NN RNN
C
1
C
2
C
3
C
4
………… C
k
2
C
k
1
C
k
3.1. Implementation of the RNN

CLINK Algorithm
In the following, we briefly describe our im
plementation of the complete link method based on the RNN technique (Figures 2
and 3). We also explain two speed

up measures of the implementation. The RNN

CLINK algorithm (Figure 2) first
initializes each node as a single

node cluster. In each iteratio
n of the while

loop, every cluster
c
i
computes its
nearest
neighbor NN(
c
i
) (Steps 3

6), and, based on these nearest neighbor results, every cluster creates an NN

chain which terminates
with an RNN pair. The algorithm then merges the two clusters of each R
NN pair and updates the cluster structure (steps 7

12). The while

loop terminates when there is only one cluster left.
Figure 2. Algorithm RNN

CLINK
Figure 3. Algorithm
COMPUTE

INTER

CLUSTER

DISTANCE
Algorithm
RNN

CLINK
Input:
G
= (
V
,
E
), a distance metric
d
(
v
,
w
) defined on
V
V
; Output: Cluster structure
C
1.
Initialize:
C
= {
c
i
 1
i
N
} = {
v
i
 1
i
N
} =
V
2.
While
( 
C
 > 1) {
3.
For
all
c
i
C
{
4.
COMPUTE

INTER

CLUSTER

DISTANCE(
c
i
)
5.
FIND

NN(
c
i
)
6.
}
7.
For
all
c
i
C
{
8.
if
clusters
c
i
and NN(
c
i
) form an RNN pair {
9.
MERGE

CLUSTER(
c
i
, NN(
c
i
))
10.
SORT

CLUSTER(
C
,
c
i
, NN(
c
i
))
11.
}
12.
}
13.
}
Algorithm
COMPUTE

INTER

CLUSTER

DISTANCE (
c
i
)
Input:
G
= (
V
,
E
), a distance metric
d
(
v
,
w
) defined on
V
V
, cluster structure
C
Output: The distances between cluster
c
i
and each of the other clusters
1.
For
all
c
j
C
–
{
c
i
} {
2.
(
c
i
, c
j
) = 0
3.
For
all
v
x
c
j
, v
y
c
j
{
4.
if
d
(
v
x
,v
y
) >
(
c
i
, c
j
)
then
(
c
i
, c
j
) =
d
(
v
x
,v
y
)
5.
}
6.
}
We now analyze the time complexity of the RNN

CLINK algorithm. First, we consider Step 4 of the algorithm which
computes the inter

cluster distances. Assuming the distance between two objects can be computed in O(1) time, then the
initial computation of the distances between
N
objects takes O(
N
2
) time, i.e.
N
(
N
1)/2 distances. When the objects form
clusters, the distance between
two clusters
c
q
and
c
r
is defined as:
(
c
q
,
c
r
) = MAX{
d
(
x
i
,
x
j
) 
x
i
c
q
and
x
j
c
r
}. Since RNN

CLINK computes the minimum
(
c
q
,
c
r
) for each cluster
c
q
, i.e., finding
c
q
's nearest neighbor, and RNN

CLINK operates
under an O(
N
) space constraint, the dis
tance computation between clusters must be done in a cluster

by

cluster fashion. An
easy way to accomplish this is to maintain an array of node indexes to keep track of the node

cluster relationship. After each
cluster

merge operation, we move the node i
ndexes in this array so that the nodes of the same cluster have their indexes next
to each other. As a result, the distance computation between clusters can be done in O(
N
2
) time in each iteration of the loop.
The purpose of the MERGE

CLUSTER function (S
tep 9 of Figure 2) is to re

label and update the cluster hierarchy, while the
purpose of SORT

CLUSTER (Step 10) is to sort the node indexes based on their cluster number. Thus, the computation time
of each iteration of the while

loop is O(
N
2
). Since the a
lgorithm finds at least one RNN pair in each iteration, the number of
iterations is O(
N
), proving that the overall time complexity of RNN

CLINK is O(
N
3
). The space complexity of RNN

CLINK
is O(
N
), because there are only a constant number of O(
N
) arrays us
ed in the algorithm.
Two speed

up measures can be incorporated in the implementation of the RNN

CLINK algorithm. First, if cluster
c
i
and
NN(
c
i
) are not merged in the current iteration of the while

loop, NN(
c
i
) would still be the nearest neighbor of
c
i
f
or the next
iteration, thus saving the time to recompute NN(
c
i
). Second, the computation of NN(
c
i
) involves keeping track of the value
of a variable
min

clust

dist
(the minimum cluster

distance found so far) and computing the distance between all pairs of
the
clusters, i.e.
(
c
i
, c
k
) for 1
k
i
n
. However, while computing an inter

cluster distance
(
c
i
, c
k
) can take up to 
c
i


c
k

inter

object distance computations, this process can safely stop once a distance value between two objects,
x
c
i
and
y
c
k
, is
found to be greater than
min

clust

dist
. Although the savings in inter

object computations is not known in general, our
experimental results show that this speed

up measure is a key factor in reducing the execution time of the RNN

CLINK
algorithm
from the worst

case O(
N
3
) time to close to O(
N
2
log
N
) on average.
3.2. Probabilistic Analysis of the RNN Structure
We assume there is a (finite or infinite) universe of
points
, from which we draw a finite
sample
S = {
x
1
,
x
2
,...,
x
n
}. These
sample poi
nts are used to model the objects and the clusters in the RNN

CLINK algorithm. To make the problem more
tractable, we assume the sample points are chosen with the uniform probability from the universe, and the points are chosen
with replacement, i.e., aft
er point
x
i
is selected, it is returned to the universe before the next point
x
i
+1
is selected. Thus, it is
possible to have
x
i
= x
j
for
i
j
in our sample, although the probability of its occurrence will be negligible for the ranges of
the sample sizes
we will be considering. The main reason that this replacement model is used is its simplicity in performing
probabilistic analysis. We assume a distance metric is defined between the points of the universe.
Definition 1.
A distance metric
d
(
x
i
,
x
j
) is
defined for any points
x
i
and
x
j
, satisfying the following properties:
1.
d
(
x
i
,
x
j
)
0, and
d
(
x
i
,
x
j
) = 0 iff
x
i
= x
j
.
2.
d
(
x
i
,
x
j
) =
d
(
x
j
,
x
i
).
Definition 2.
A point
x
j
is called a nearest neighbor of point
x
i
, denoted
x
j
= NN(
x
i
), in a set of
n
points
x
1
,
x
2
,...,
x
n
, if
d
(
x
i
,
x
j
)
d
(
x
i
,
x
k
) for all
k
i
and 1
k
n
.
Note that a point
x
i
may have multiple nearest neighbors when equal distances occur. In that case, we would still use NN(
x
i
)
to denote any of these nearest neighbors of
x
i
, by abuse of
notation.
Definition 3.
Two points
x
i
and
x
j
are called reciprocal nearest neighbors (RNN's) if
x
i
= NN(
x
j
) and
x
j
= NN(
x
i
). That is, if
d
(
x
i
,
x
j
)
d
(
x
i
,
x
k
) and
d
(
x
i
,
x
j
)
d
(
x
j
,
x
k
), for all
k
, 1
k
n
,
k
i
and
k
j
. Such pair of points
x
i
and
x
j
are called an RNN
pair.
Definition 4.
A nearest

neighbor chain (NN chain) is a sequence of points
x
i
1
,
x
i
2
,...,
x
i
m
such that
x
i
k
+1
= NN(
x
i
k
) for 1
k
(
m
1), and
x
i
m
1
= NN(
x
i
m
). Thus, an NN chain consists of a sequence of points that end w
ith an RNN pair.
We now state the following fact precisely in terms our notation and terminology.
14, 15.
Lemma 1.
Let
S
= {
x
1
,
x
2
,...,
x
n
} be a set of
n
points chosen from an arbitrary space such that a distance function is defined
between any pair of
the points. An NN chain exists starting with any point
x
i
, provided that when forming an NN chain,
x
i
1
,
x
i
2
,…, if
x
i
j
1
is one of the nearest neighbors of
x
i
j
, due to equal distances, the NN chain chooses
x
i
j
1
as NN(
x
i
j
) and
terminates the chain.
Our
key result is the following theorem which relates the RNN structure to certain conditional probability measures.
Theorem 1.
For 1
k
n
, let
S
k
={
x
1
,
x
2
,….,
x
k
} denote a set of
k
random points chosen from a universe with replacement.
Let
P
k
denote the
probability that an arbitrary point
x
i
is involved in an RNN pair in the set S
k
; that is,
P
k
=
Pr
[
x
i
= NN(
x
j
) 
x
j
= NN(
x
i
)]. Then the following properties hold:
(1)
The expected number of RNN pairs in
S
n
={
x
1
,
x
2
,….,
x
n
} is
2
Pn
n
.
(2)
If
P
k
P
for 1
k
n
, where
P
, 0 <
P
< 1, is a constant independent of
n
, then the expected length of an NN chain in S
n
={
x
1
,
x
2
,….,
x
n
} is
2
1
P
.
Proof: (1) Let
Vi
denote the following random variable
Vi
=
{
S
in
pair
RNN
an
in
involved
is
point
if
1,
ohterwise.
0,
n
x
i
Then the expected number of RNN pairs in S
n
=
n
i
Vi
E
1
2
1
)
(
, because each RNN pair will be counted twice in the
summation
n
i
Vi
E
1
)
(
. Notice that
E
(
Vi
) = 1
Pr
[
Vi
= 1] + 0 ∙
Pr
[
Vi
= 0], by definition
=
Pr
[
x
i
is
involved in an RNN pair]
=
Pn
by definition of
Pn
Substituting
E
(
Vi
) =
Pn
into the preceding summation yields the formula of (1).
(2)
Let
L
denote the length of an NN chain. By Lemma 1, the length of
L
n
–
1.
Thus,
E
[
L
] =
1
1
]
[
n
k
k
L
Pr
k
, by definition of
E
[
L
].
Let
x
i
be the starting node of an NN chain. Notice that
Pr
[
L
= 1] =
Pr
[
x
i
is involved in an RNN pair in S
n
] =
Pn
.
Similarly,
Pr
[
L
= 2] = (1
–
Pn
)
P
n
1
, because
L
= 2 means
x
i
is not i
nvolved in an RNN pair, which has a probability 1
Pn
Also,
L
= 2 means the node NN(
x
i
), call it
x
j
, is involved in an RNN pair in the node set S
n
–
{
x
i
}. The probability
for
x
j
to be involved in an RNN pair is
P
n
1
, because the node set S
n
–
{
x
i
} cont
ains (
n
–
1) random points with
replacement. Thus,
Pr
[
L
= 2] = (1
–
P
n
)
P
n
1
because the two events {
x
i
is not involved in an RNN pair} and {
x
j
=
NN(
x
i
) is involved in an RNN pair} are two independent events.
More generally,
Pr
[
L
=
k
] = (1
–
P
n
) (1
–
P
n
1
) (1
–
P
n
2
).... (1
–
P
n
k+
2
)
P
n
k+
1
, for 2
k
n
–
1, by the same argument.
Notice that
Pr
[
L
=
k
]
(1
–
P
)
k
1
because each 1
–
P
j
1
–
P
n
for
n
–
k
+ 1
j
n
, and
P
n
k+
1
1.
Therefore,
E
[
L
] =
1
1
]
[
n
k
k
L
Pr
k
1
1
1
)
1
(
n
k
k
P
k
1
1
)
1
(
k
k
P
k
=
2
1
P
,
using the formula
2
1
1
)
1
(
1
x
x
k
k
k
if 0 < 
x
 < 1.
Notice that the proof of the theorem strongly depends on the randomness property of the points in set S
n
= {
x
1
,
x
2
,….,
x
n
}.
That
is, starting with an arbitrary point
x
i
, the random sample assumption implies that its nearest neighbor
x
j
can be any of the
remaining points with an equal probability; similarly, the nearest neighbor of
x
j
is a random point among the points in S
n
–
{
x
i
,
x
j
}. In general, after forming a sequence of
k
nearest neighbors starting from point
x
i
, he random sample assumption
implies that the remaining set of
n
–
k
points still form a random set of
n
–
k
points.
We want to apply this random sample model to the
analysis of the RNN

CLINK algorithm. Specifically, we want to prove
that the average number of while

loop iterations is O(log
N
), which would imply the O(
N
log
N
average

time complexity of
the algorithm. To this end, it suffices to prove that the average
number of RNN pairs in a set of
m
clusters is
Cm
, for some
constant
C
, throughout the entire looping process of the algorithm. This result follows from Theorem 1, if the conditional
probability of an RNN is
C
, and if the randomness property holds amo
ng the clusters, throughout the entire looping process
of the algorithm. Our empirical results reported in the next section support this claim, using the randomly generated bit
vectors, and using the news articles available from NETNEWS, as the input obje
cts for clustering. However, we are unable
to prove either the validity of the randomness assumption or a constant lower bound for the conditional probabilities at the
present time.
4.
EXPERIMENTAL RESULTS
In this section, we present the experimental resul
ts reporting the time complexity of the RNN

CLINK algorithm. We use the
number of inter

object distance computations
(NODC) as a measure of the execution time, mainly because this measure is
independent of the computing platform and it provides a good indi
cation of the actual execution time. In the experiments, we
implemented both RNN

CLINK and a well

known single

link clustering algorithm SLINK.
17.
The NODC
of the single

link
algorithm for clustering
N
objects is simply
N
(
N
–
1)/2, because the algorithm c
omputes all inter

object distances once and
uses O(
N
2
) space to save them.
Our first experiment is based on randomly selected news articles available on the NETNEWS. We created 6 sets of test data,
with the number of news articles ranging from 100 to 600
, and the number of news groups ranging from 2 to 10. The news
articles are first filtered through a stoplist and a stemmer, in order to eliminate the insignificant words (e.g., the, on, u
p) and
to reduce the words to their stems. The distance formula us
ed for measuring the similarity between two articles is the
Dice
coefficient
with binary term weights.
8.
For articles
Dq
and
Dr
with
Tq
and
Tr
terms, respectively, and
qr
common terms, the
similarity between the two articles is given by
Sim
(
Dq
,
Dr
) = 2 *
qr
/ (
Tq
+
Tr
). Table 1 presents the statistics comparing the
NODC values of RNN

CLINK and SLINK. The column MOE (measure of effect) reports the Hubert’s
statistics which
measures the clustering accuracy using the news group classification as the basi
s. We notice that for RNN

CLINK, the ratio
of the number of RNN pairs to the number of objects (articles) seems to have a lower bound of 0.25, at least for the first
iteration of the clustering loop. Also, the total number of iterations appears to grow l
ogarithmically.
No. of
Articles
No. of
News
Groups
No. of
Terms
RNN

CLINK
SLINK
While

loop
Iterations
RNN
pairs
in 1
ST
round
NODC's
MOE (
)
NODC's
MOE (
)
100
2
2975
20
32
43516
0.34
4950
0.07
200
3
4879
21
65
157926
0.65
19900
0.49
300
5
6541
23
86
396724
0.51
44850
0.35
400
7
7587
25
110
737937
0.41
79800
0.14
500
8
8809
27
125
1214330
0.44
124750
0.12
600
10
9706
31
151
1712507
0.41
179700
0.10
* NODC: Number of Distance Computations.
* MOE (
): larger absolute values of
suggest bette
r clustering performance
Table 1. Experimental results of clustering news articles
Our second experiment uses randomly generated bit

vectors as the input data, where the
Hamming distance
is used to
measure the dissimilarity between vectors. The dimensio
n of the vectors is fixed at 200, and the number of objects (vectors)
varies from 100 to 1000. In each case, we ran the RNN

CLINK clustering for 100 times; Table 2 reports the average
NODC’s, and the average number of clustering loop iterations, for each
of the cases. For the same reason given previously,
the NODC value for SLINK is set to
N
(
N
–
1)/2. Similar to the results reported in Table 1, the average number of clustering
loop iterations seems to grow logarithmically. Figure 4 plots the NODC values o
f the RNN

CLINK and SLINK algorithms,
along with the curves of
N
2
and
N
2
log
N
. The figure clearly shows a growth pattern of O(
N
2
log
N
) for the average

time
complexity of RNN

CLINK. Therefore, our experimental results using two types of input data sugg
est that the conclusion of
Theorem 1 concerning the average number of RNN pairs seems valid, even when the assumption on the random sample
points is not necessarily satisfied.
No. of Nodes
RNN

CLINK
SLINK
Avg. No. of
Iterations
Avg. NODC's
Avg. NODC's
100
18
40021
4950
200
23
180678
19900
300
26
431559
44850
400
28
798095
79800
500
30
1281078
124750
600
31
1889095
179700
700
33
2622060
244650
800
34
3478650
319600
900
35
4423560
404550
1000
36
5537472
499500
* NODC: Number of Distance Comput
ations.
Table 2. Experimental results of clustering random bit

vectors
Comparison of Number of Distance Computations
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
100
200
300
400
500
600
700
800
900
1000
Number of BitVectors
No. of Distance Computations
N
2
SLINK
N
2
log
N
RNNCLINK
Figure 4. Experimental results of clustering random bit

vectors
5.
CONCLUSION AND FURTH
ER RESEARCH
In this paper, we presented an implementation of the O(
N
)

space clustering algorithm RNN

CLINK using the reciprocal
nearest neighbor technique. We also described two speed

up measures that improve the algorithm’s execution time. We
gave a probabilistic model using random sample points and assumptions about the
RNN conditional probabilities to explain
the algorithm’s O(
N
2
log
N
) average

time complexity. Our experimental results suggested the validity of these probabilistic
assumptions. For further research, we would like to provide a mathematical proof of the
se assumptions.
REFERENCES
1.
O. Aichholzer, “Clustering the hypercubes,” SFB Report Series 93, TU

Graz, 1996.
2.
O. Aichholzer and F. Aurenhammer, “Classifying hyperplanes in hypercubes,”
SIAM J. Discrete Math
, 9 (2), pp. 225

232, May 1996.
3.
M. R. Anderberg,
Cluster Analysis for Applications
, Academic Press, New York, 1973.
4.
A. Berson and S. J. Smith,
Data Warehousing, Data Mining, and OLAP
, McGraw

Hill, New York, 1997.
5.
W. H. E. Day and H. Edelsbrunner, “Efficient algorithms for agglomerative hierarchical clus
tering methods,”
Journal
of
Classification
, 1(1), pp. 7

24, 1984.
6.
D. Defays, “An efficient algorithm for a complete link method,”
The Computer Journal
, 20 (4), pp. 364

366, 1977.
7.
A. El

Hamdouchi and P. Willett, “Comparison of hierarchical agglomerative clu
stering methods for document retrieval,”
The Computer Journal
, 32 (3), 1989.
8.
W. B. Frakes and R. Baeza

Yates,
Information Retrieval: Data Structures and Algorithms
, Prentice Hall, 1992.
9.
K. C. Gowda and G. Krishna, “Agglomerative clustering using the conc
ept of mutual nearest neighborhood,”
Pattern
Recognition
, 10 (2), pp. 105

112, 1978.
10.
A. K. Jain and R. C. Dubes,
Algorithms for Clustering Data
, Prentice Hall, 1988.
11.
D. Krznaric and C. Levcopoulos, “Fast algorithms for complete linkage clustering,”
Discret
e Comput. Geom
. 19, pp.
131

145, 1998.
12.
G. H. Lance and W. T. Williams, “A general theory of classificatory sorting strategies: I. Hierarchical systems,”
The
Computer Journal
9, pp. 373

380, 1967.
13.
L. Laufman and P. J. Rousseeuw,
Finding Groups in Data: An I
ntroduction to cluster Analysis
, John Wiley and Sons,
Inc., New York, 1990.
14.
F. Murtagh, “A survey of recent advances in hierarchical clustering algorithms,”
The Computer Journal
, 26 (4), pp. 354

359, 1983.
15.
F. Murtagh, “Complexities of hierarchic clustering
algorithms: State of the art,”
Computational Statistics Quarterly
, 1,
pp. 101

113, 1984.
16.
J. Shanbehzadeh and P. O. Ogunbona, “On the computational complexity of the LBG and PNN algorithms,”
IEEE
Trans. on Image Proc
., 6 (4), pp. 614

617, 1997.
17.
R. Sibson,
“SLINK: An optimally efficient algorithm for the single

link cluster method,”
The Computer Journal
, 16 (1),
pp. 30

45, 1973.
18.
E. M. Voorhees, “The effectiveness and efficiency of agglomerative hierarchic clustering in Document Retrieval,” Ph.D.
thesis, Cor
nell University, 1985.
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο