Hierarchical Agglomerative Clustering (AGNES)

overratedbeltΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 6 μήνες)

67 εμφανίσεις




Hierarchical Agglomerative Clustering
(AGNES)



Nanam Kumar Sunil Kumar


Department of Computer Science



North Dakota State University
, Fargo


Sunil855@yahoo.com




Abstract
:
-

Clustering is grouping of related objects or logically related items physically
near. Agglomerative nesting algorithm is similar to hie
rarchical clusterin
g which starts
with two

closest clusters and merges them and continues merging until there is no more
then one cluster.

For hierarchical agglomerative algorithm time complexity is O(n
2
logn).
So there is always a chance of reducing thi
s complexity by reducing number of priority
queue updates which in turn reduces time complexity. Hence I propose a new hierarchical
agglomerative algorithm whose time complexity is comparably less then hierarchical
agglomerative algorithm. Following steps

gives overview of my new algorithm.



1.

For each cluster generate a priority queue of the distances to all the other clusters.

2.

Determine the closest two clusters

3.

Agglomerate the clusters.

4.

Find the nearest neighbor for newly agglomerated cluster.

5.

Update the priority queue of only nearest neighbor.

6.

Nearest neighbor priority queue is scanned to determine the cluster nearest to it.

7.

If newly agglomerated cluster is closer to nearest neighbor cluster then it is
agglomerated with it otherwise it is aggl
omerated with cluster closer to it.

8.

If more than one cluster remain go to step 4.






Introduction

:

-

The problem of automatically classifying data in an unsupervised way
arises in many areas in pattern recognition and hierarchical classification offer
s a flexible,
nonparametric approach. Hierarchical clustering have been considered as a convenient
approach among other clustering algorithms mainly because hierarchical clustering
presupposes very little in what respects to data characteristics and a pr
iori knowledge on
the part of the analyst.


Clustering algorithms are divided into two types



Partitioning Clustering :
-
Partitions N objects of database D into K clusters such
that objects in a cluster are more similar to each other than to objects in d
ifferent
clusters.



Hierarchical Clustering :
-

Hierarchical clustering algorithms decompose a
database
D
of
n

objects into several levels of nested partitions (clusters)
represented by a dendogram , a tree that iteratively splits
D

into smaller subsets
unt
il each subset consists of only one object.


AGNES (Agglomerative Nesting Algorithm)
:
-

Agnes proceeds by a series of fusions.
Intially all objects are apart
-

each object forms a small cluster by itself. At the first step
two closest or minimally dissimila
r objects are merged to form a cluster. Then the
algorithm finds a pair of objects with
minimal dissimilarity. If there are several pairs with
minimal dissimilarity the algorithm picks a pair of objects at random. Picking of pairs of
objects which are min
imally dissimilar is quite straightforward but how to select pairs of
clusters with minimal dissimilarity ? It is necessary to define a measure of dissimilarity
between the clusters. Average pair group method is used to compute the dissimilarity
between a

pair of clusters.


There are several methods for determining the distances between clusters. Most common
metrics can be divided in
following
two general classes.

A.

Graph methods :
-

These method s determine intercluster distances using the
graph of points

in the two clusters. Example include



Single Link : The distance between any two clusters is the minimum distance
between two points such that there is one point in each cluster.



Average Link :
-

The distance between any two clusters is the average
distanc
e between each pair of points such that there is one point in each
cluster.



Complete Link :
-

The distance between any two clusters is the maximum
distance between two points such that there is one point in each cluster.


B.

Geometric Methods :
-


These methods

define a cluster centre for each cluster
and use these cluster centres to determine the distances between clusters. For
example




Centroid :
-

The cluster centre is the centroid of the points in the cluster.



Median :
-

The cluster centre is the average of

the centres of the two
clusters agglomerated to form it.



Minimum Variance :
-

The cluster centre is the centroid of the points in the
cluster . The distance between two clusters is the increase in the sum of
squared distances from each point to the centre
of its cluster caused by the
clusters .


Hierarchical clustering :
-

On the other hand , very different algorithms can be given for
the same hierarchical clustering methods. However a general agglomerative algorithm for
hierarchical clustering may be descr
ibed informally as follows.



1.
For each cluster generate a priority queue of the distances to all the other clusters.

2.
Determine the closest two clusters

3.
Agglomerate the clusters.

4.
Update the distances in the priority queues.

5.
If more than
one cluster remain go to step 2.


For any metric where the intercluster dista
nces can be determined in O(1) time then
clustering can be performed in O(n
2

. logn) time where n is number of clusters.




Step 1 can be performed in O(n
2
) time.



Steps 2
-
5 performe
d n
-
1 times each.



Step 4 is performed in O(n.logn) time.

So total time required is O(n
2
.logn).

Problems with hierarchical clustering :
-




1.

Creating a new priority queue for agglomerated cluster and updating O(n)
priority queues each time it is per
formed.

2.

Total time required is O(n
2
.logn)




KILLER IDEA

:
-

Following is my new algorithm for hierarchical agglomerative
clustering. Algorithm is described in following 8 steps.

New Algorithm (My approach)

:
-




1.
For each cluster generate

a priority queue of t
he distances to all the other


clusters.


2.
Determine the closest two clusters

3.

Agglomerate the clusters.

4.

Find the nearest neighbor for newly agglomerated cluster.

5.

Update the priority queue of only neares
t neighbor.

6.

Nearest neighbor priority queue is scanned to deter
mine the cluster nearest to
it.

7.

If newly agglomerated cluster is closer to nearest neighbor cluster then it is
aggl
omerated with it otherwise nearest neighbor cluster

agglomerated with
cluste
r closer to it.

8.

If more than one cluster remain go to step 4.



Example

:
-

Consider Following four clusters.


I used single link distance metric to

find two nearest clusters.













1.
In the example first two nearest clusters A and B are selected by single link distance
metric and there are
merged to
form cluster AB.















2. Then f
ind the nearest neighbor for newly agglomerated cluster and update it
priority queue. In the example it is cluster D.


3.

Then nearest neighbor priority queue is scanned to find out its nearest neighbor in
this example it is cluster C.





4. Then agglomerate the nearest neighbor cluster with the cluster nearest to it















A

C

D

B

AB
B

C

D

AB

CD

5. In our example
example we are left with two clusters AB and CD. Cluster AB is
nearest to CD. So they are merged to form cluster ABCD.















Proof

:
-


In this section I tried to prove my newly proposed algorithm with and


Example.


Comparison of New algorithm with Agglomerative algorithm.



Agglomerati
ve algorithm Updates Distances in all priority queues so
time complexity is O(n.log(n)) for this step




New algortithm updates only nearest neighbor so time complexity is less
then O(n.logn) for updating priority queues.



So overall time complexity is als
o less then O(n
2
.log(n).



This proves that new algorithm performs better then agglomerative
algorithm



Example :
-



In this example i am considering four clusters A,B,C,D which are
merged using my newly proposed algorithm. We continue our algorithm
until we remain
with single cluster. In this example I used single link distance metric for finding nearest
two clusters.












Consider following four clusters A B C D






ABCD

A

B














1

:

-


we f
ind the closest two clusters by using single link distance metric. This

Step is same in my algorit
hm and agglomerative algorithm. In example
as
cluster A and B are closer so

clusters A and B are merged to form cluster
AB.


















2
.



agglomerative algorithm updates the priority queues of all the other clusters



i
n above example it updates the priority queues of C , D

In
m
y algorithm


updates the priority queue of only cluster D which is nearest to


n
ewly merged cluster

AB
. Until this step number of priority
queue updates of



a
gglomerative algorithm is
2
( updates C and D)

where as
my algorithm is



only 1(updates only D) .



3.

In this example for both the algorithm
s


newly merged cluster AB and




and cluster
D

are nearest clusters so we merge them

together
.

After merging




both algorithms updates the
priority queue of C

as C is only nearest cluster
.



left.











C

D

AB

C

D

ABD



C





4.

So finally we are left with two clusters


ABD and C

we merge them as this two
clusters are the nearest clusters. Finally we left with single clusters so we stop
clustering here.












In this example number of
priority queue updates made by agglomerative hierarchical
algorithm is 3. For my algorithm number of pri
ority queue updates is only 2. so which
shows that number of priority updates decreases in new algorithm. As number of priority
queue updates decreases it also


decreases the overall time complexity. Thus this proves
that new algorithm performs better th
en the hierarchical agglomerative algorithm.




Advantages
of New algorithm
:
-



1.

Each time agglomerated cluster is generated instead of updating all the priority
queues it updates only the priority queue of nearest neighbor. So

it reduces the
number of priority queue updates.

2.

Complexity is less then O(n
2
.logn).






Conclusion
:

-

In this paper I p
roposed a new algori
thm for hierarchical clustering which
reduces number of priority queue updates by updating only the neare
st neighbor
.

As the
number of priority queue updates decreases time complexity for the algorithm also
ABCD

decreases , so newly proposed algorithm perform better then the agglomerative
hierarchical algorithm.

In future there is also a chance of expanding th
is algorithm to
large data sets.


References :
-


1.

W. H. E. Day and H Edelsbrunner, “Efficient algorithms for Agglomerative
Hierarchical clustering Method”

2.

C.F. Olson “Parallel Algorithms for Hierarchical clustering”

3.

F. Murtagh, “Hierarchical clustering “