Measurement and comparison of clustering algorithms - DiVA

spiritualblurtedΤεχνίτη Νοημοσύνη και Ρομποτική

24 Νοε 2013 (πριν από 4 χρόνια και 1 μήνα)

156 εμφανίσεις



School of Mathematics and Systems Engineering

Reports from MSI  Rapporter från MSI




Titel
Measurement and comparison of clustering
algorithms


















Författare
Shima Javar






Dec
2007

MSI Report 07149
Växjö University ISSN 16502647
SE-351 95 VÄXJÖ ISRN VXU/MSI/DA/E/07149/SE


i

Abstract

In this project, a number of different clustering algorithms are described and their
workings explained. They are compared to each other by implementing them on
number of graphs with a known architecture.
These clustering algorithm, in the order they are implemented, are as follows:
Nearest neighbour hillclimbing, Nearest neighbour big step hillclimbing, Best
neighbour hillclimbing, Best neighbour big step hillclimbing, Gem 3D, Kmeans
simple, Kmeans Gem 3D, One cluster and One cluster per node.
The graphs are Unconnected, Directed KX, Directed Cycle KX and Directed
Cycle.
The results of these clusterings are compared with each other according to three
criteria: Time, Quality and Extremity of nodes distribution. This enables us to find
out which algorithm is most suitable for which graph. These artificial graphs are then
compared with the reference architecture graph to reach the conclusions.

Keywords: Clustering algorithm, Module dependency graph, artificial graph,
Reference graph, Cluster, Node, Edge, Implementation time, Quality, Extremity,
Precision, Recall.

ii

Table of content
ABSTRACT...................................................................................................................................................I
1. INTRODUCTION...................................................................................................................................1
2. CLUSTERING ANALYSIS....................................................................................................................2
3. CLUSTERING ALGORITHMS............................................................................................................3
3.1 HILLCLIMBING ALGORITHM...............................................................................................................3
3.1.1 Nearest neighbour.....................................................................................................................5
3.1.2 Nearest neighbour big step..........................................................................................................5
3.1.3 Best neighbour...........................................................................................................................6
3.1.4 Best neighbour big step.............................................................................................................6
3.2 GEM 3D.............................................................................................................................................6
3.3 KMEANS...........................................................................................................................................6
3.3.1 K-means simple..........................................................................................................................7
3.3.2 K-means Gem3D........................................................................................................................7
3.4 ONE CLUSTER.....................................................................................................................................8
3.5 ONE CLUSTER PER NODE....................................................................................................................8
4. ENVIRONMENT OF IMPLEMENTATION ALGORITHMS AND TESTING.............................9
4.1 WHAT IS ECLIPSE AND ECLIPSE FOUNDATION?..................................................................................9
4.2 HISTORY OF ECLIPSE..........................................................................................................................9
4.3 DOWNLOADING ECLIPSE...................................................................................................................10
4.4 PROJECT CODES..................................................................................................................................10
5. TABLES AND DIAGRAMS.................................................................................................................11
5.1 TIME TABLES AND DIAGRAMS...........................................................................................................11
5.1.1 Nearest neighbour.....................................................................................................................11
5.1.2 Nearest neighbour big step.........................................................................................................12
5.1.3 Best neighbour...........................................................................................................................12
5.1.4 Best neighbour big step..............................................................................................................13
5.1.5 Gem 3D.....................................................................................................................................14
5.1.6 K-means simple.........................................................................................................................15
5.1.7 K-means Gem3D.......................................................................................................................16
5.1.8 One cluster................................................................................................................................16
5.1.9 One cluster per node..................................................................................................................17
5.2 QUALITY TABLE AND DIAGRAMS.......................................................................................................18
5.2.1 Nearest neighbour.....................................................................................................................18
5.2.2 Nearest neighbour big step.........................................................................................................19
5.2.3 Beat neighbour..........................................................................................................................20
5.2.4 Best neighbour big step..............................................................................................................20
5.2.5 Gem3D......................................................................................................................................21
5.2.6 K-means simple.........................................................................................................................22
5.2.7 K-means Gem3D.......................................................................................................................23
5.2.8 One cluster................................................................................................................................24
5.2.9 One cluster per node..................................................................................................................24
5.3 EXTREMITY........................................................................................................................................25
5.3.1 Nearest neighbour.....................................................................................................................26
5.3.2 Nearest neighbour big step.........................................................................................................26
5.3.3 Best neighbour...........................................................................................................................26
5.3.4 Best neighbour big step..............................................................................................................26
5.3.5 Gem 3D.....................................................................................................................................27
5.3.6 K-means simple.........................................................................................................................27
5.3.7 K-means Gem 3D......................................................................................................................27
6. REFERENCE GRAPH..........................................................................................................................28
6.1 REFERENCE GRAPH TIME, PRECISION, RECALL AND QUALITY..........................................................28
6.2 REFERENCE GRAPH EXTREMITY........................................................................................................30
7. COMPARING REFERENCE GRAPH WITH ARTIFICIAL GRAPHS........................................33

iii
7.1 TIME.................................................................................................................................................33
7.1.1 Nearest neighbour.....................................................................................................................33
7.1.2 Nearest neighbour big step.........................................................................................................33
7.1.3 Best neighbour...........................................................................................................................34
7.1.4 Best neighbour big step..............................................................................................................35
7.1.5 Gem 3D.....................................................................................................................................36
7.1.6 K-means simple.........................................................................................................................36
7.1.7 K-means Gem 3D......................................................................................................................37
7.1.8 One cluster................................................................................................................................38
7.2 QUALITY...........................................................................................................................................39
7.2.1 Nearest neighbour.....................................................................................................................39
7.2.2 Nearest neighbour big step.........................................................................................................40
7.2.3 Best neighbour...........................................................................................................................40
7.2.4 Best neighbour big step..............................................................................................................41
7.2.5 Gem 3D.....................................................................................................................................41
7.2.6 K-means simple.........................................................................................................................42
7.2.7 K-means Gem 3D......................................................................................................................43
7.2.8 One cluster...............................................................................................................................43
7.2.9 One cluster per node..................................................................................................................44
7.3 EXTREMITY.......................................................................................................................................44
7.3.1 Nearest neighbour big step.........................................................................................................44
7.3.2 Gem 3D.....................................................................................................................................45
7.3.3 K-means simple.........................................................................................................................46
7.3.4 K-means Gem 3D......................................................................................................................47
8. CONCLUSION......................................................................................................................................48
9. REFERENCES......................................................................................................................................49
APPENDICES............................................................................................................................................50
APPENDIX A GAUSSIAN DISTRIBUTION...................................................................................................50
APPENDIX B EUCLIDEAN DISTANCE........................................................................................................51
APPENDIX C MANHATTAN/CITYBLOCK DISTANCE...............................................................................52

1

1. Introduction
A welldocumented architecture can improve the quality and maintainability of the
software. Many of software systems do not have their architecture documented. In
addition, the documented architectures become obsolete and outdated, and the
system’s structure declines when the changes are made to the system to obtain the
market requirements. High turnover among developers make the situation even more
complex. The other problem of large software systems is maintenance of
architectural documentation. These are the points, which have made Software
clustering as one the most important subjects in computer science area. Software
clustering is a useful tool, which can help with these problems.
It helps to understand inheritance of software system’s architectural
documentation and supports their remodularization by decomposing the software
system in to meaningful sub systems.
In recent years a large number of books, articles and researches are published
which, analyze different kinds of clustering algorithm(Everitt, B.S(1979)). They can
help that the most suitable algorithm for different problems in wide variety of
sciences to be selected. What is the definition of a ‘suitable algorithm’? It is the
algorithm which, takes little time and has a highquality result.
In the first part of this project, I compare nine clustering algorithms to each other
by implementing them over four artificial graphs and a reference graph according to
tree criteria, which are as following Implementation time, Quality and Extremity.
Depending on time and quality, I conclude which algorithm/s is the most
appropriate one for each graph.
By implementing algorithms over Directed cycle KX graph, in addition to time
and quality, Extremity of them is compared with each other too. Extremity is defined
as: Nodes distribution in a clustering.
These algorithms are presented in this thesis and are made clear by giving a
detailed description. Tables and diagrams show the time and quality and lead us to
conclusion for selecting the best algorithm.
In the second part, these algorithms are implemented over’ Reference graph’
which is not an artificial graph but is a real graph. Finally results of algorithms over
this graph are compared with the results of artificial ones.

2

2. Clustering analysis
The practice of classifying objects and organization data into sensible groupings
according to perceived similarities is one of the most fundamental modes of
understanding and learning several branches of sciences and the basis for many of
them.

Clustering algorithm is the formal study of algorithms and methods for
grouping, or classifying objects. It offers several advantages over a manual grouping
process. First, a clustering program can apply a specified objective criterion
consistently to from the groups. Second, it can form the groups in a fraction of time
required by a manual grouping, particularly if a long list of descriptors or features is
associated with each object.
An object is described either by a set of measurements or by relationship between
the object and other objects. The objective of clustering is simply to find a
convenient and valid organization of the data.
A cluster is comprised of a number of similar objects collected or grouped
together. Everitt (1974 cited by Jain and Dubes, 1988) documents some of the
following definitions of a cluster:
1. A cluster is a set of entities, which are alike, and entities from different clusters
are not alike.
2. A cluster is an aggregation of points in the test space such that the distance
between any two points in the cluster is less than the distance between any point in
the cluster and any point not in it.
3. Clusters maybe described as connected regions of a multidimensional space
containing a relatively high density of points separated from other such regions by a
region containing a relatively low density of points.
The last two definitions assume that the objects to be clustered are represented as
points in the measurement space. While it is easy to give a functional definition of a
cluster, it is difficult to give an operational definition of a cluster. It is because
objects can be grouped into clusters with different purposes in mind.
Depending on kind of problem, we can use several clustering algorithms to make
clusters. The question is ‘Which algorithm is the best one?’ and it leads to ‘How
can we valuate a clustering algorithm?’. One of the most important factors in
software science world is ‘Time’. Comparison with creating software, making it
optimized, some times takes more time and budget. Therefore, time is a factor that is
used to assess clustering algorithms. The second factor is ‘Quality’ of algorithm.
What is the definition of quality for a clustering algorithm? In computer
programming, cohesion is a measure of how well the lines of source code within a
module work together to provide a specific piece of functionality. Cohesion is an
ordinal type of measurement and is usually expressed as ‘high cohesion’ or ‘low
cohesion’ when being discussed. Modules with high cohesion are preferable because
high cohesion is associated with several desirable traits of software including
robustness, reliability, reusability and understandability whereas low cohesion is
associated with undesirable traits such as being difficult to maintain, difficult to test ,
difficult to reuse and even difficult to understand. A clustering has good quality
when the objects in each cluster have high interdependence and when independent
objects are assigned to separate clusters. Welldesigned software systems are
organized into cohesive clusters, which are highly interdependence. (Constantine
and Yourdon, 1979)

3

3. Clustering algorithms
Clustering algorithm can be hierarchical or partitional. Hierarchical algorithms
find successive clusters using previously established clusters, whereas partitional
algorithms determine all clusters at once. Jain and Dubes (1988, p 5590) conclude
that hierarchical algorithms can be agglomerative (bottomup) or divisive (top
down). Agglomerative algorithms begin with each element as a separate cluster and
merge them into successively larger clusters. Divisive algorithms begin with the
whole set and proceed to divide it into successively smaller clusters. Both clustering
strategies have their appropriate domains of applications. Anquetil and Lethbridge
(1999, p 235255) state that hierarchal techniques are popular in biological, social
and behavioural sciences because of the need to construct taxonomies. Whereas
partitional techniques are used frequently in engineering applications where single
partitions are important. In addition, Partitional algorithms would be appropriate for
the representation and compression of large databases where dendrograms of
hierarchical algorithm are impractical with more than hundred patterns.
There are several methods of clustering for each strategy. Simple link (Nearest
neighbour) and Complete link (Best neighbour) are two famous methods of
hierarchal clustering. Square-error, K-means and Genetic algorithms belong to
partitional clustering.
Clustering algorithms presented in this project are: Nearest neighbour
hillclimbing, Nearest neighbour big step hillclimbing, Best neighbour hillclimbing,
Best neighbour big step hillclimbing, Gem 3D, Kmeans simple, Kmeans Gem
3D,One cluster and One cluster per node.
In fact, in this project for applying the algorithms, fist, a number of graphs are
extracted; these graphs are:
Unconnected graph: In a directed graph G, two vertices u and v are called
unconnected if G does not contain any path from u to v.
Directed KX graph: Is a fully connected graph with x nodes; e.g., K5 is a graph
with 5nodes all connected with another.
Directed cycle KX graph: Is a cycle of y KX subgraphs; e.g. "Cycle 4 of K5" is
a graph consisting of four "K5" graphs where one and only node of each "K5" is
connected to two nodes of two other "K5" such that they form a cycle.
Directed cycle graph is a graph that consists of a single cycle. In this graph some
number of directed vertices create a closed chain.
Reference graph: This graph is a real graph. It cannot get any sizes and its size is
420.
Then the clustering algorithms are run over these extracted, different sized graphs,
of course we cannot define the size for reference graph because its size is fixed. Time
and quality of algorithms are compared with each other and with reference graph.
Extremity of artificial graphs are compared with each other too.

3.1 Hillclimbing algorithm
The hillclimbing algorithm quickly discovers a suboptimal clustering result by using
neighbouring partitions. This involves moving objects between clusters so as
improve the quality.
Hill climbing algorithm has three steps:
1. Initial the clusters.
2. Evaluate the quality of the clustering.
3. Find next neighbour with the better quality.

4




Figure 3.1: Hillclimbing algorithm

In the first step, we should extract graphs of the target system. The graph is called
MDG (Module Dependency Graph). There are several possible ways to cluster the
nodes of the dependency graph. Number of these ways grows very quickly, as the
number of nodes (modules in the system) increases: (S: Number of possible ways for
clustering, n: Number of nodes, k: Number of cluster/s)

1 if k = 1 or k = n
Sn,k = S n 1, k – 1 + kS n – 1, k otherwise


1=1 6=203 11=678570
2=2 7=877 12=4213597
3=5 8=4140 13=27644437
4=15 9=21147 14=190899322
5=52 10=115975 15= 1382958545

A 15module system (A dependency graph with 15 nodes) can be clustered in
1382958545 ways, which it is about the limit for Exhaustive Analysis. Therefore, this
clustering cannot be appropriate for big systems.
Obviously, there are several ways for a dependency graph to be clustered. To
start, for initializing, the simplest way is to assign each object to one cluster.
Considering that a good clustering is a cohesive clustering with a loose
dependency between clusters, two parameters can be defined here for measuring the
quality of the clustering, this definition is used for measuring the quality the most of
clustering algorithms:

edges: (IntraEdges) which are edges that start and end within the same cluster.
End

Yes
No
Initial clustering
Start
Measure Quality

Find neighbour of clustering
Measure Quality

Quality is
improved?

5

edges : ( Interedges) which are edges that start and end in different clusters.
According to these definitions Clustering Factor (CF) and Module Quality
(MQ) can be defined as: (4)










This definition of MQ is used for evaluating the most kind of clustering
algorithms quality in this project. It expresses clustering with the most Intra edges
and the least Inter edges has the best quality.
Neighbour partition is created by altering current partition slightly. The ways for
obtaining the next neighbour are as follows:
1. All objects in one cluster, which have connection with the objects in another
cluster, are transferred. Nearest neighbour and Best neighbour use this method.
2. Merge the clusters two by two until finding a clustering with better quality.
Nearest neighbour big step and Best neighbour big step work like that.
According to these ways, there are four kinds of hillclimbing clustering
algorithms in this project: Nearest neighbour, Nearest neighbour big step, Best
neighbour, Best neighbour big step.

3.1.1 Nearest neighbour
The Nearest neighbour is one of the hillclimbing algorithms that is similar, but
often faster, than its best neighbour counterpart is.
In nearest neighbour, our hill climbing improvement technique is based on
finding a better neighbour of the current clustering. We define nearest neighbour to
be a neighbour of partition P if and only if nearest neighbour is the same as P except
that a single element in P is in a different cluster in nearest neighbour.
A better neighbour of the current clustering is found by enumerating through the
neighbours of the clustering. The nearest neighbour algorithm stops when the fist
better neighbour, which has larger quality, is found. For Nearest neighbour the
algorithm is:
1. Generate a random initial decomposition of dependency graph.
2. Measure the quality.
3. Find neighbour of clustering.
4. Measure the quality.
5. While the quality is not improved, go to 3.
6. Return clustering.

3.1.2 Nearest neighbour big step
The Nearest neighbour big step hillclimbing algorithm is another hillclimbing
algorithm that is very similar to nearest neighbour algorithm.
The nearest neighbour big step differs from nearest neighbour algorithm in the
way it systematically finds the better neighbour.













++++++++
====
====
∑∑∑∑
≠≠≠≠
====
otherwise
CF
ij
k
ij
j
jii
i
i
i
)(2
2
00
,
1
,
εεεεεεεε














∑∑∑∑
====
====
k
i
i
CFMQ
1

6

In nearest neighbour big step algorithm, the two first clusters merge, if quality is
not improved, the twosecond clusters will merge and so on until finding a clustering,
which has a better quality.
3.1.3 Best neighbour
Best neighbour hillclimbing algorithm is based on traditional hillclimbing
techniques. The goal of this algorithm is to create a new partition from the current
partition of the dependency graph where the quality of the newer partition is larger
than the quality of the original partition. Each iteration of the algorithm attempts to
improve the quality by finding the best neighbour of the current partition.
The best neighbour is determined by examining all nearest neighbours of current
partition and selecting one that has the largest quality. The best neighbour algorithm
converges when all neighbours of current clustering are evaluated for their quality.
According to this, the algorithm is:
1. Generate a random initial decomposition of dependency graph.
2. Measure the quality.
3. Set first clustering as current clustering.
4. If there is a neighbour, measure the quality of that.
5. Return clustering with the best quality.
3.1.4 Best neighbour big step
This clustering algorithm is very similar to best neighbour hillclimbing. The
difference is the way of finding the best neighbour.
In best neighbour big step clustering algorithm, first all of the clusters merge two
by two and finally clustering with the highest quality, the result of merging the
clusters, is found.
The algorithm stops when it finds the best neighbour with the highest quality after
merging all clusters two by two.
3.2 Gem 3D
This clustering algorithm contains 3 dimensional concepts. When it gets a node,
defines a random position in 3 dimensional space for it. Therefore each node has a
certain position (x,y,x) in 3D space. Then for each two nodes, which, are connected
together with an edge, algorithm calculate distance.
For two nodes n1(x1, y1, z1) and n2(x2, y2, z2), distance is defined as:
(x1x2 )^2+ (y1y2) ^2+ (z1z2) ^2. If the distance is less than a defined value, those
two nodes are placed in the same cluster, unless the edge between them is cut and
they are placed in two different clusters.
3.3 K-means
Kmeans procedure follows a simple and easy way to classify a given data set. The
Kmeans algorithm moves the objects to different groups until a stable situation is
reached and no object moves to any other group. (Shokoufandeh et al, 2002)
It is a variant of the expectationmaximization algorithm in which the goal is to
determine the k means of data generated from gaussian distributions (See appendix A
for more information). Kmeans algorithm is mentioned below:
1. Randomly generate k clusters and determine the cluster centres, or directly
generate k seed points as cluster centres.
2. Assign each point to the nearest cluster centre.
3. Compute the new cluster centres.

7

4. Repeat until some convergence criterion is met (usually that the assignment has
not changed).
Kmeans clustering accepts the number of clusters to group data into (K), and
Module Dependency Graph to be clustered as input values. It then creates the first K
initial clusters from the Module Dependency Graph by assigning nodes of the
dependency graph to K clusters randomly. (Bradly and Fayyad, 1998)
The KMeans algorithm assigns the centroid of each cluster. The centroid of
a cluster is selected randomly among the nodes of each cluster.
KMeans simple clustering assigns each node in the clusters to the nearest cluster
based on the distance to centroids using a measure of distance or similarity like the
Euclidean Distance Measure or Manhattan/CityBlock Distance Measure(See
appendix B and C for more information). The preceding steps are repeated until
stable clusters are formed and the KMeans clustering procedure is completed. Stable
clusters are formed when new iterations or repetitions of the KMeans clustering
algorithm do not create new clusters, as the cluster centre of each cluster formed is
the same as the old cluster centre. (Likas et al, 2002)



Figure 3.2: Kmeans algorithm
3.3.1 K-means simple
Structure of Kmeans simple algorithm is exactly same as the algorithm which is
defined and clarified in previous section.
3.3.2 K-means Gem3D
Kmeans Gem 3D’s step is just like the Kmeans simple algorithm. Different
between these algorithms is how distance between two nodes is calculated.
End

Yes
No
Number of Clusters:K
Start
Centroids
Distance object to centroids
Grouping based on minimum distance
No object
move group?

8

In Kmeans Gem 3D instead of Euclidean Distance Measure or Manhattan/City
Block Distance Measure, Gem 3D algorithm’s method for calculating the distance is
used. Structure of algorithm is the same as Kmeans simple algorithm.
3.4 One cluster
One cluster algorithm is a hierarchical clustering which, builds (agglomerative) a
hierarchy of clusters. It is a sequences of partitions which, each partition is nested
into the next partition in the sequence. When the algorithm is implemented over a
graph, it places each node into a cluster until all nodes are in one cluster. In the end a
single cluster containing all the nodes of graph remains.
3.5 One cluster per node
This algorithm (Divisive) gets nodes of a graph one by one and adds each node to a
new cluster. Therefore in each sequence of this clustering one cluster is created
which contains just one node and in the end of clustering n (the number of nodes of
the graph) clusters exists.

9

4. Environment of implementation algorithms and testing
In this chapter there is some explanation about the platform and environment in
which, codes are written.
The environment of this project is Eclipse.
4.1 What is Eclipse and Eclipse Foundation?
Eclipse in an open source community, whose projects are focused on building an
open development platform comprised of extensible frameworks, tools and runtimes
for building, developing and managing software across the life style. The Eclipse
Foundation is a notforprofit, member supported corporation that helps cultivate
both an open source community and an ecosystem of complementary products and
services.
The Eclipse Project was originally created by IBM in November 2001 and
supported by a consortium of software vendors. The Eclipse Foundation was created
in January 2004 as an independent notforprofit corporation to act as the steward of
the Eclipse community. The independent notforprofit corporation was created to
allow a vendor neutral and open, transparent community to be established around
Eclipse. Today, the Eclipse community consists of individuals and organizations
from a cross section of the software industry.
The Eclipse Foundation has been established to serve the Eclipse open source
projects and the Eclipse community. As an independent notforprofit corporation,
the Foundation and the Eclipse governance model ensures no single entity is able to
control the strategy, policies or operations of the Eclipse community.
The Foundation is focused on creating an environment for successful open source
projects and to promote the adoption of Eclipse technology in commercial and open
source solutions. Through services like IP Due Diligence, annual release trains,
development community support and ecosystem development, the Eclipse model of
open source development is a unique and proven model for open source
development.
4.2 History of Eclipse
Industry leaders Borland, IBM, MERANT, QNX Software Systems, Rational
Software, Red Hat, SuSE, TogetherSoft and Webgain formed the initial eclipse.org
Board of Stewards in November 2001. By the end of 2003, this initial consortium
had grown to over 80 members.
On Feb 2, 2004 the Eclipse Board of Stewards announced Eclipse’s
reorganization into a notforprofit corporation. Originally a consortium that formed
when IBM released the Eclipse Platform into Open Source, Eclipse became an
independent body that will drive the platform’s evolution to benefit the providers of
software development offerings and endusers. All technology and source code
provided to and developed by this fastgrowing community is made available
royaltyfree via the Eclipse Public License.
The founding Strategic Developers and Strategic Consumers were Ericsson, HP,
IBM, Intel, MontaVista Software, QNX, SAP and Serena Software.
More information about the structure and mission of the Eclipse Foundation is
available in formal documentations that establish how the foundation operates, and
press release announcing the creation of the independent organization.

10

4.3 Downloading Eclipse
To download Eclipse open the Eclipse home page and select the proper version.(It is
available via References, Web links[4]). For using Eclipse Java runtime
environment(JRE) is needed( Java 5 JRE recommended).
4.4 Project codes
All codes are programmed in Eclipse. Eclipse has been chosen for programming
because these codes are integrated with Vizz analyzer project (developed in the MSI
department of Växjö university) and uses some classes and methods of the project
which is written in Eclipse.
This project is comprised of a number of classes which implement different
clustering algorithms and several classes for testing the clustering algorithms over
artificial graphs and Reference graph. These implementing and testing classes are
written by me.
In addition this project, Grail helped me to create and test clustering algorithms.
Grail is an open source library which has been developed in the MSI department of
Växjö university.
It has several classes which contain basic definition of different clustering, I have
used them while I wanted to create clustering classes.
In addition for testing the different clustering I have used a .gml file from Grail
library which is an input for testing the clustering.



11

5. Tables and diagrams
In this section, tables and diagrams show time and quality of each algorithm, which are,
implemented over four graphs.
Algorithms are: Nearest neighbour, Nearest neighbour big step, Best neighbour, Best
neighbour big step, Gem3D, Kmeans simple, Kmeans Gem3D, One cluster and One
cluster per node. These algorithms are implemented over Unconnected graph, Directed
KX graph, Directed cycle KX graph and Directed cycle graph with different sizes :2, 4,
8,…1024.
While size of some graphs rises up, clustering of them takes more than 300 seconds
(5 minutes), or it ceases because lack of memory. These cases are stated in the tables
and are not drawn in the diagrams.
5.1 Time tables and diagrams
Time factor is changing when a graph has different sizes and when different
algorithms are implemented. Here, there are timetables and diagrams, which state
how time is changing during these changes.
5.1.1 Nearest neighbour


Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

0 0 - 0
4

0 15 0 0
8

16 47 31 16
16

0 234 172 46
32

32 3250 9282 110
64

47 187781 170532 656
128

94 More than 5 min More than 5 min 6266
256

219 More than 5 min More than 5 min 274484
512

7156 More than 5 min More than 5 min More than 5 mi n
1024

27719 More than 5 min More than 5 min More than 5 m in

Figure 5.1: Timetable of Nearest neighbour

0
0
16
46
110
656
6266
274484
0
0
0
31
172
9282
170532
0
0
0
0
0
15
47
234
3250
187781
0
0
0
0
0
0
16
0
32
47
94
219
7156
27719
2 4 8 16 32 64 128 256 512 1024
Unconnected graph
Directed KX graph
DirectedCycle KX graph
DirectedCycle graph


Figure 5.2: Time diagram of Nearest neighbour

While the size of graphs is rising up, time of clustering increases. Clustering is very
time consuming when size is rising up to big numbers as 512 and1024. However, it is
obvious that Unconnected graph clustering takes the less time and Directed cycle KX
and Directed KX graphs take the most.

12

5.1.2 Nearest neighbour big step


Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

0 0 - 0
4

0 0 0 0
8

0 0 0 0
16

0 16 0 16
32

0 47 406 0
64

0 1125 2578 31
128

31 18578 12031 110
256

78 98563 120328 3469
512

3422 Out of memory More than 5 min 11610
1024

13610 Out of memory More than 5 min 85641

Figure 5.3: Timetable of Nearest neighbour big step

0
0
0
16
0
31
110
3469
11610
85641
0
0
0
406
2578
12031
120328
0
0
0
0
0
16
47
1125
18578
98563
0
0
0
0
0
0
0
0
31
78
3422
13610
2 4 8 16 32 64 128 256 512 1024
Unconnected graph
Directed KX graph
DirectedCycle KX graph
DirectedCycle graph


Figure 5.4: Time diagram of Nearest neighbour big step

Nearest neighbour big step average time is less than Nearest neighbour, It takes more
time when size of graphs increases but not as much as the first one. Unconnected graph,
Directed cycle graph take less time comparison with Directed KX graph and Directed
cycle KX graphs.

5.1.3 Best neighbour

Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

0 0 - 0
4

0 16 0 0
8

0 31 15 31
16

16 187 94 31
32

0 3625 8188 109
64

31 224750 131046 734
128

62 More than 5 min More than 5 min 10641
256

172 More than 5 min More than 5 min More than 5 min
512

6813 More than 5 min More than 5 min More than 5 mi n
1024

27469 More than 5 min More than 5 min More than 5 m in

Figure 5.5: Timetable of Best neighbour


13

0
0
31
31
109
734
10641
0
0
0
0
15
94
8188
131046
0
0
0
0
0
16
31
187
3625
224750
0
0
0
0
0
0
0
16
0
31
62
172
6813
27469
2 4 8 16 32 64 128 256 512 1024
Unconnected graph
Directed KX graph
DirectedCycle KX graph
DirectedCycle graph


Figure 5.6: Time diagram of Best neighbour


In this algorithm, time increases very fast while size is rising up. Except
Unconnected graph, clustering over other graphs takes more than 5 minutes in
average sizes. Unconnected graph and Direct cycle graph take less time than
Directed KX and Directed cycle KX graphs.
5.1.4 Best neighbour big step


Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

0 0 - 0
4

0 0 0 16
8

0 15 0 0
16

0 47 47 16
32

0 1250 3031 78
64

0 68171 49985 641
128

31 More than 5 min More than 5 min 32578
256

109 More than 5 min More than 5 min More than 5 min
512

3406 More than 5 min More than 5 min More than 5 mi n
1024

13515 More than 5 min More than 5 min More than 5 m in

Figure 5.7: Timetable of Best neighbour big step

0
16
0
16
78
641
32578
0
0
0
0
0
47
3031
49985
0
0
0
0
0
0
15
47
1250
68171
0
0
0
0
0
0
0
0
0
0
31
109
3406
13515
2 4 8 16 32 64 128 256 512 1024
Unconnected graph
Directed KX graph
DirectedCycle KX graph
DirectedCycle graph


Figure 5.8: Time diagram of Best neighbour big step

This algorithm averagely takes less time than Best neighbour does, however it is one
of the most time consuming algorithms. It takes the least time over Unconnected graph,

14

after that Directed cycle graph comes and the most time consuming graphs are Directed
KX and Directed cycle KX graphs.
5.1.5 Gem 3D

Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

0 15 - 16
4

16 15 15 0
8

47 47 31 31
16

62 79 78 47
32

109 188 1125 125
64

219 594 3047 218
128

813 2344 22438 688
256

2797 1026 104688 2485
512

107516 out of memory 275562 1766
1024

More than 5 min out of memory out of memory 175328

Figure 5.9: Timetable of Gem3D

16
0
31
47
125
218
688
2485
1766
175328
15
31
78
1125
3047
22438
104688
275562
0
15
15
47
79
188
594
2344
1026
0
0
0
16
47
62
109
219
813
2797
107516
0
2 4 8 16 32 64 128 256 512 1024
Unconnected graph
Directed KX graph
DirectedCycle KX graph
DirectedCycle graph


Figure 5.10: Time diagram of Gem3D

Algorithm takes quite good time comparison with the last ones. Its performance over
Directed cycle graph is considerable, time for clustering, before size of the Directed
cycle graph rises up to 1024 is almost 2 seconds.
The results, ranked in descending order are Directed cycle KX graph, Directed
KX graph, Unconnected graph and Directed cycle graph.















15

5.1.6 K-means simple

Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

0 0 - 0
4

0 0 0 0
8

0 0 0 0
16

0 0 16 0
32

0 0 78 0
64

0 16 94 0
128

31 47 484 16
256

32 156 1953 31
512

390 Out of memory 3391 47
1024

1281 Out of memory Out of memory 109

Figure 5.11: Timetable of Kmeans simple

0
0
0
0
0
0
16
31
47
109
0
0
16
78
94
484
1953
3391
0
0
0
0
0
0
16
47
156
0
0
0
0
0
0
0
0
31
32
390
1281
2 4 8 16 32 64 128 256 512 1024
Unconnected graph
Directed KX graph
DirectedCycle KX graph
DirectedCycle graph


Figure 5.12: Time diagram of Kmeans simple

Timetable and diagram state that this algorithm decreases the time of clustering
undoubtedly. Directed cycle graph is the least time consuming graph and for size
1024 it takes just 0.1 second to be clustered. Unconnected graph and Directed KX
graph have the second place and the last graph is Directed cycle KX. The last one
takes the most time to be clustered, however in comparison with other algorithms it
has decreased considerably.
















16

5.1.7 K-means Gem3D

Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

16 31 - 16
4

16 31 16 15
8

0 63 16 47
16

31 94 31 63
32

63 187 672 125
64

172 609 2578 234
128

734 2359 20344 735
256

3046 11016 122328 2578
512

117360 out of memory 282140 12016
1024

More than 5 min out of memory out of memory 198016

Figure 5.103: Timetable of Kmeans Gem3D

16
15
47
63
125
234
735
2578
12016
198016
16
16
31
672
2578
20344
122328
282140
0
31
31
63
94
187
609
2359
11016
0
0
16
16
0
31
63
172
734
3046
117360
0
2 4 8 16 32 64 128 256 512 1024
Unconnected graph
Directed KX graph
DirectedCycle KX graph
DirectedCycle graph


Figure 5.14: Time diagram of Kmeans Gem3D

Taking time has clearly increased comparison with the last algorithm, but it is still
one of the less time consuming algorithms. The results, ranked in descending order is
Directed cycle KX, Directed KX, Unconnected and directed cycle graphs, the same
as the Kmeans simple algorithm.
5.1.8 One cluster

Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

0 0 - 0
4

0 0 0 0
8

0 0 0 0
16

0 0 0 0
32

0 16 0 0
64

0 0 0 0
128

0 16 62 0
256

0 62 547 16
512

16 out of memory 734 16
1024

16 out of memory out of memory 15

Figure 5.15: Timetable of One cluster


17

0
0
0
0
0
0
0
16
16
15
0
0
0
0
0
62
547
734
0
0
0
0
0
16
0
16
62
0
0
0
0
0
0
0
0
0
0
16
16
2 4 8 16 32 64 128 256 512 1024
Unconnected graph
Directed KX graph
DirectedCycle KX graph
DirectedCycle graph


Figure 5.16: Time diagram of One cluster

This algorithm, which put all the nodes in one cluster, is the less time consuming
algorithm same as algorithm ‘One cluster per node’. There is no big difference
between small and large sizes of the graphs and all of them take almost the same
time. Among these four graphs, Unconnected and Directed cycle graph take less time
than Directed KX graph and Directed cycle of KX.

5.1.9 One cluster per node

Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

0 0 - 0
4

0 0 0 0
8

0 0 0 0
16

0 0 0 0
32

0 0 16 0
64

0 0 16 0
128

0 15 203 0
256

0 62 484 0
512

0 out of memory 625 16
1024

0 out of memory out of memory 16

Figure 5.17: Timetable of One cluster per node

0
0
0
0
0
0
0
0
16
16
0
0
0
16
16
203
484
625
0
0
0
0
0
0
0
15
62
0
0
0
0
0
0
0
0
0
0
0
0
2 4 8 16 32 64 128 256 512 1024
Unconnected graph
Directed KX graph
DirectedCycle KX graph
DirectedCycle graph


Figure5.18: Time diagram of One cluster per node

18


One cluster per node that puts each node in one cluster takes quite little time, for
different graphs with different sizes, the same as ‘One cluster’ algorithm.
5.2 Quality table and diagrams
Size increasing affects quality of performance. It can be raised up or go down. It
depends on type of the graph, in some cases there is no considerable quality changing
between different sizes of a graph. Here, there are tables and diagram that state size
effects on quality.
5.2.1 Nearest neighbour


Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

0 0,666 - 1
4

0 0.800 0,428 0,333
8

0 0,888 0,499 0,333
16

0 0,941 0,727 0,333
32

0 0,969 0,864 0,355
64

0 0,984 0,934 0,355
128

0 More than 5 mins More than 5 mins 0,344
256

0 More than 5 mins More than 5 mins 0,335
512

0 More than 5 mins More than 5 mins More than 5 min s
1024

0 More than 5 mins More than 5 mins More than 5 min s

Figure 5.19: Quality table of Nearest neighbour

1
0,333
0,333
0,333
0,355
0,355
0,344
0,335
0
0
0,428
0,499
0,727
0,864
0,934
0
0
0
0
0,666
0
0,888
0,941
0,969
0,984
0,992
0
0
0
0
0
0
0
0
0
0
0
0
0
2 4 8 16 32 64 128 256 512 1024
Unconnected graph
Directed KX graph
DirectedCycle KX graph
DirectedCycle graph


Figure 5.20: Quality diagram of Nearest neighbour

Implementing the algorithm over Unconnected graph does not have any changes;
it is zero for all sizes of the graph. By implementing the algorithm over Directed KX
graph, quality increases when size becomes bigger. Algorithm has the highest quality
when it is implemented over this graph; it is almost one. Quality of Directed cycle
KX graph increases when size of the graph goes up; it is near one as well.
Nevertheless, it is opposite for Directed cycle graph. Quality of the graph with size 2
is the highest comparison with other sizes; it is one. By increasing the size, quality
goes down; however, it is not a big difference between qualities of the graph with
variety of sizes.



19

5.2.2 Nearest neighbour big step


Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

0 0,666 - 1
4

0 0,8 0,309 1
8

0 0,888 0,295 1
16

0 0,941 0,329 1
32

0 0,969 0,391 0,999
64

0 0,984 0,355 0,999
128

0 0,992 0,243 0,999
256

0 0,996 0,405 0,999
512

0 Out of memory More than 5 min 1
1024

0 Out of memory More than 5 min 0,999

Figure 5.21: Quality table of Nearest neighbour big step

1
1
1
1
0,999
0,999
0,999
0,999
1
0,999
0,309
0,295
0,329
0,391
0,355
0,243
0,405
0
0
0,666
0,8
0,888
0,941
0,969
0,984
0,992
0,996
0
0
0
0
0
0
0
0
0
0
0
0
2 4 8 16 32 64 128 256 512 1024
Unconnected graph
Directed KX graph
DirectedCycle KX graph
DirectedCycle graph


Figure 5.22: Quality diagram of Nearest neighbour big step

This algorithm under different sizes of graphs behaves differently; However,
Unconnected graph quality is zero for all sizes. Although changes are not so big,
quality increases when size becomes bigger in Directed KX graph and Directed cycle
KX graph. Nearest neighbour big step has the best quality by implementing over
Directed cycle graph. Quality changes are not considerable. It is almost one for all
sizes of this graph.

20


5.2.3 Beat neighbour


Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph
2

0 0,666 - 1
4

0 0,8 0,428 0,333
8

0 0,888 0,499 0,333
16

0 0,941 0,727 0,333
32

0 0,969 0,864 0,355
64

0 0,984 0,934 0,333
128

0 More than 5 min More than 5 min 0,344
256

0 More than 5 min More than 5 min More than 5 min
512

0 More than 5 min More than 5 min More than 5 min
1024

0 More than 5 min More than 5 min More than 5 min

Figure 5.23: Quality table of Best neighbour

1
0,333
0,333
0,333
0,355
0,333
0,344
0
0
0
0,428
0,499
0,727
0,864
0,934
0
0
0
0
0,666
0,8
0,888
0,941
0,969
0,984
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2 4 8 16 32 64 128 256 512 1024
Unconnected graph
Directed KX graph
DirectedCycle KX graph
DirectedCycle graph


Figure 5.24: Quality diagram of Best neighbour

This algorithm behaves just like the Nearest neighbour algorithm, how ever
quality of some graphs is a little bit less than the first one. Unconnected graph quality
is zero for all sizes. Quality of two Directed KX graph and Directed cycle KX graph
increases by raising up the size and Directed cycle graph quality goes down when
size becomes bigger, however it is one when size of the graph is two.
5.2.4 Best neighbour big step

Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

0 0,666 - 1
4

0 0,8 0,309 1
8

0 0,888 0,499 1
16

0 0,941 0,727 1
32

0 0,969 0,864 0,999
64

0 0,984 0,934 0,999
128

0 More than 5 min More than 5 min 0,999
256

0 More than 5 min More than 5 min More than 5 min
512

0 More than 5 min More than 5 min More than 5 min
1024

0 More than 5 min More than 5 min More than 5 min

Figure 5.25: Quality table of Best neighbour big step

21



1
1
1
1
0,999
0,999
0,999
0
0
0
0,309
0,499
0,727
0,864
0,934
0
0
0
0
0,666
0,8
0,888
0,941
0,969
0,984
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2 4 8 16 32 64 128 256 512 1024
Unconnected graph
Directed KX graph
DirectedCycle KX graph
DirectedCycle graph


Figure 5.26: Quality diagram of Best neighbour big step

Best neighbour big step has the same changes as the last algorithm. Just Directed
cycle graph’s quality improves and it is almost one for all sizes of graph.
5.2.5 Gem3D

Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

0 0,25 - 1
4

0 0,312 0,428 0,25
8

0 0,888 0,343 0,444
16

0 0,941 0,551 0,292
32

0 0,969 0,864 0,438
64

0 0,984 0,934 0,27
128

0 0,992 0,968 0,447
256

0 0,495 0,984 0,404
512

0 out of memory 0,496 0,402
1024

More than 5 min out of memory out of memory 0,406

Figure 5.27: Quality table of Gem3D

1
0,25
0,444
0,292
0,438
0,27
0,447
0,404
0,402
0,406
0,428
0,343
0,551
0,864
0,934
0,968
0,984
0,496
0
0,25
0,312
0,888
0,941
0,969
0,984
0,992
0,495
0
0
0
0
0
0
0
0
0
0
0
0
2 4 8 16 32 64 128 256 512 1024
Unconnected graph
Directed KX graph
DirectedCycle KX graph
DirectedCycle graph


Figure 5.28: Quality diagram of Gem3D

22


Unconnected graph quality is zero just as it was in other algorithms. By increasing
the size of Directed KX graph up to 128, quality increases but when size of this
graph becomes bigger quality of algorithm decreases. It does not have a good
performance over graph with big sizes. Directed cycle KX graph has the highest
quality and its quality is raising up to one when its size becomes bigger, but as it is
obvious for size 512, the quality descends completely. Directed cycle graph quality
is one when its size is two, but quality becomes less when the size of graph grows up.
5.2.6 K-means simple

Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

0 0,25 - 0
4

0 0,125 0,25 0
8

0 0,062 0,308 0
16

0 0,031 0,497 0
32

0 0,015 0,864 0
64

0 0,007 0,557 0
128

0 0,003 0,968 0
256

0 0,001 0,984 0
512

0 Out of memory 0,992 0
1024

0 Out of memory Out of memory 0

Figure 5.29: Quality table of Kmeans simple

0
0
0
0
0
0
0
0
0
0
0,25
0,308
0,497
0,864
0,557
0,968
0,984
0,992
0
0,25
0,125
0,062
0,031
0,015
0,007
0,003
0,001
0
0
0
0
0
0
0
0
0
0
0
0
2 4 8 16 32 64 128 256 512 1024
Unconnected graph
Directed KX graph
DirectedCycle KX graph
DirectedCycle graph


Figure 5.30: Quality diagram of Kmeans simple

Quality descends obviously under implementing of Kmean simple algorithm over
Unconnected and Direct cycle graphs. Quality of them is zero for all sizes; Directed
KX graph quality goes down by rising up the size. It is almost zero for big sizes.
Directed cycle KX graph has the best quality. Its quality becomes more when its size
grows up and it is almost one when its size is 512.

23

5.2.7 K-means Gem3D

Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

0 0,25 - 0
4

0 0,125 0,25 0
8

0 0,625 0,5 0
16

0 0,031 0,727 0
32

0 0,015 0,864 0
64

0 0,007 0,934 0
128

0 0,003 0,967 0
256

0 0,001 0,669 0
512

0 out of memory 0,2 0
1024

More than 5 min out of memory out of memory 0

Figure 5.31: Quality table of Kmeans Gem3D


0
0
0
0
0
0
0
0
0
0
0,25
0,5
0,727
0,864
0,934
0,967
0,669
0,2
0
0,25
0,125
0,625
0,031
0,015
0,007
0,003
0,001
0
0
0
0
0
0
0
0
0
0
0
0
2 4 8 16 32 64 128 256 512 1024
Unconnected graph
Directed KX graph
DirectedCycle KX graph
DirectedCycle graph


Figure 5.32: Quality diagram of Kmeans Gem3D


Unconnected graph and Directed cycle graph have the same quality as they had
under Kmeans simple implementation, zero; but quality of Directed KX graph and
Directed cycle KX graph changes when size becomes bigger.
For Directed KX graph, quality raises when size of graph increases up to eight;
then quality descends almost to zero by raising the size. Directed cycle KX graph
quality increases up to one until size raises to 128; then it goes down.

24

5.2.8 One cluster

Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

0 0,666 - 1
4

0 0,8 0,666 1
8

0 0,888 0,714 1
16

0 0,941 0,809 1
32

0 0,969 0,89 1
64

0 0,984 0,941 1
128

0 0,992 0,969 1
256

0 0,996 0,984 1
512

0 out of memory 0,992 1
1024

0 out of memory out of memory 1

Figure 5.33: Quality table of One cluster

1
1
1
1
1
1
1
1
1
1
0,666
0,714
0,809
0,89
0,941
0,969
0,984
0,992
0
0,666
0,8
0,888
0,941
0,969
0,984
0,992
0,996
0
0
0
0
0
0
0
0
0
0
0
0
2 4 8 16 32 64 128 256 512 1024
Unconnected graph
Directed KX graph
DirectedCycle KX graph
DirectedCycle graph


Figure 5.34: Quality diagram of One cluster

This algorithm, which puts all the nodes in one cluster, has the least quality when
it is implemented over Unconnected graph and the best quality when it is
implemented over Directed cycle graph.
The quality rises almost to one when the size increases for Direct KX graph and
Directed cycle KX graph.
5.2.9 One cluster per node

Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

0 0,25 - 0
4

0 0,125 0,25 0
8

0 0,625 0,208 0
16

0 0,312 0,118 0
32

0 0,015 0,061 0
64

0 0,007 0,031 0
128

0 0,003 0,015 0
256

0 0,001 0,007 0
512

0 out of memory 0,003 0
1024

0 out of memory out of memory 0

Figure 5.35: Quality table of One cluster per node

25


0
0
0
0
0
0
0
0
0
0
0,25
0,208
0,118
0,061
0,031
0,015
0,007
0,003
0
0,25
0,125
0,625
0,312
0,015
0,007
0,003
0,001
0
0
0
0
0
0
0
0
0
0
0
0
2 4 8 16 32 64 128 256 512 1024
Unconnected graph
Directed KX graph
DirectedCycle KX graph
DirectedCycle graph


Figure 5.36: Quality diagram of One cluster per node

This clustering algorithm has the worst quality. Its quality is zero for two
Unconnected graph and Directed cycle graph for all sizes. Quality of Directed KX
graph and Directed cycle KX graph decreases almost to zero when size of graphs
raises.
5.3 Extremity
As it is mentioned in the introduction section, extremity of algorithms is calculated
by implementing them over Directed cycle of KX graph. The nodes in the clusters
of a clustering should not show extremity because extremity influences quality
factor. Clustering would have a very low quality because of extremity.
In particular, a clustering algorithm should avoid the following two situations:
a) The majority of nodes are grouped in to one or few clusters.
One cluster algorithm, partitions nodes just like it. It puts all of the nodes in one
cluster. According to extremity this clustering is not acceptable, however the time of
clustering is less than other algorithms.
b) The majority of clusters are singletons. (Contain just one node)
One cluster per node algorithm does clustering like it. In the end of clustering the
number of clusters is equal with the number of nodes. It puts each node in one
cluster.
In this part nodes distribution of the algorithms are exhibited for the different
sizes of Directed cycle KX graph. The number of clusters is defined as four.
Therefore, there are four clusters in each clustering.
Algorithms are implemented over different sizes of Directed cycle KX graph. The
least size of the graph is four and it grows up to512.
Implementing time for bigger sizes of some graphs is more than five minutes. In
the following tables, these sizes are ignored.






26

5.3.1 Nearest neighbour

4 8 16 32 64
Cluster1 2 2 4 8 16
Cluster2 2 2 4 8 16
Cluster3 0 2 4 8 16
Cluster4 0 2 4 8 16

Figure 5.37: Nodes distribution for Nearest neighbour

Nodes distribution is quite good for this clustering algorithm. Number of nodes in
each cluster is equal with the other ones for all of the sizes of the graph; just when
size of the graph is four, there are two clusters, which do not contain any nodes.
5.3.2 Nearest neighbour big step

4 8 16 32 64 128 256
Cluster1 2 4 9 11 23 125 215
Cluster2 1 2 5 9 39 1 39
Cluster3 1 1 1 11 1 1 1
Cluster4 0 1 1 1 1 1 1

Figure 5.38: Nodes distribution for Nearest neighbour big step

This algorithm shows extremity when it is implemented over different sizes of the
graph. As it can be seen in the table, the majority of nodes are grouped in one or two
clusters and the other clusters just contain few nodes.
5.3.3 Best neighbour

4 8 16 32 64
Cluster1 2 2 4 8 16
Cluster2 2 2 4 8 16
Cluster3 0 2 4 8 16
Cluster4 0 2 4 8 16

Figure 5.39: Nodes distribution for Best neighbour

Best neighbour algorithm show extremity just when it is implemented over the graph
with size four. There are two clusters, which contain no node for this size of the
graph. For other sizes, all clusters contain the same number of nodes.
5.3.4 Best neighbour big step

4 8 16 32 64
Cluster1 2 2 4 8 16
Cluster2 1 2 4 8 16
Cluster3 1 2 4 8 16
Cluster4 0 2 4 8 16

Figure 5.40: Nodes distribution for Best neighbour big step


27

Best neighbour algorithm does not show extremity. The number of nodes is almost the
same for all clusters for the different sizes of graphs.



5.3.5 Gem 3D

4 8 16 32 64 128 256 512
Cluster1 2 1 4 8 16 64 64 511
Cluster2 2 2 4 8 16 32 64 1
Cluster3 0 4 3 8 16 32 64 0
Cluster4 0 1 5 8 16 0 64 0

Figure 5.41: Nodes distribution for Gem 3D

This algorithm does not show extremity and nodes distribution is quite good for all
sizes of graph but size 512. For this size of the graph, extremity is obvious. Almost
all of the nodes are grouped in one cluster and other clusters have one or no node.
5.3.6 K-means simple

4 8 16 32 64 128 256 512
Cluster1 1 2 4 8 20 32 64 128
Cluster2 1 3 2 8 20 32 64 128
Cluster3 1 2 8 8 20 64 64 256
Cluster4 1 1 2 8 4 0 64 0

Figure 5.42: Nodes distribution for Kmeans simple

Nodes distribution for Kmeans simple is quite good and it shows no extremity. Just for
tow sizes 128 and 512 of the graph, there is one cluster, which contains no node.
5.3.7 K-means Gem 3D

4 8 16 32 64 128 256 512
Cluster1 1 2 4 8 16 32 19 423
Cluster2 1 2 4 8 16 32 45 31
Cluster3 1 2 4 8 16 32 64 33
Cluster4 1 2 4 8 16 32 128 25

Figure 5.43: Nodes distribution for Kmeans Gem 3D

Kmeans Gem 3D shows slightly extremity. Up to size 256, nodes distribution is
very good. Nevertheless, for size 512 the majority of nodes are grouped in one
cluster and the other clusters contain few nodes comparison with the first one.

28

6. Reference graph
Reference graph is a real graph with a fixed size. For this graph in addition to ‘Time’
and ‘Quality’ we can calculate ‘Precision’ and ‘Recall’.
What are ‘Precision’ and ‘Recall’? For clarifying these phrases first, we should
define some basis meanings, if there are a reference clustering and a current
clustering, then:
‘True positivs’: Pairs of nodes in the same cluster in both clusterings. (tp)
‘False positivs’: Pairs of nodes in the same cluster in the current clustering, but in
different clusters in reference clustering. (fp)
‘False negativs’: Pairs of nodes in the different clusters in the current clustering,
but in the same cluster in reference clustering. (fn)
Precision= (tp/ (tp+fp)), Recall= (tp/ (tp+fn)) therefore how much Precision and
recall are big, reference clustering and current clustering are much same as each
other and vice versa.
6.1 Reference graph Time, Precision, Recall and Quality
Here is the clustering algorithms table, showing time and quality, precision and recall
when they are implemented over reference graph.

Time Precision Recall Quality
Nearest neighbour More than 5 min

More than 5 min More than 5 min More than 5 min

Nearest neighbour big step 1313 0,199 0,936 0,137
Best neighbour More than 5 min

More than 5 min More than 5 min More than 5 min

Best neighbour big step More than 5 min

More than 5 min More than 5 min More than 5 min

Gem 3D 7500 0,199 0,926 0,061
K-means Simpel 47 0,335 0,309 0,247
K-means Gem 3D 8609 0,326 0,433 0,35
One cluster 16 0,195 1 1
One cluster per node
15 1 0,115 0

Figure 6.1: Time, Precision, Recall and Quality table for Reference graph

Size of reference graph is 420.We cannot check the time and quality for different
sizes of reference graph definitely.

1313
7500
47
8609
16
15
Nearest
neighbour big
step
Gem 3D K-means
Simpel
K-means Gem
3D
One cluster One cluster
per node
Time


Figure 6.2: Time diagram of Reference graph

Executing time of Nearest neighbour, Best neighbour and Best neighbour big step
algorithms is extremely large. After these three algorithms Kmeans Gem 3D and

29

Gem 3D take more time than other ones. Nearest neighbour big step algorithm needs
one second to be performed that is obviously little time. Kmeans simple, One cluster
and One cluster per node take the least time.

0,199
0,199
0,335
0,326
0,195
1
Nearest
neighbour big
step
Gem 3D K-means
Simpel
K-means
Gem 3D
One cluster One cluster
per node
Precision


Figure 6.3: Precision diagram of Reference graph

One cluster per node algorithm has the most precision. It expresses, there are not
any two nodes in the same cluster in current clustering which are in different clusters
in reference clustering, in other words if two nodes are in a cluster in the reference
clustering, they will be in same cluster in the current clustering. Gem 3D, Nearest
neighbour big step and One cluster algorithms have the least precision. It means
there are many nodes in a cluster in the current clustering, which are not in same
cluster in the reference clustering.

0,936
0,926
0,309
0,433
1
0,115
Nearest
neighbour big
step
Gem 3D K-means
Simpel
K-means
Gem 3D
One cluster One cluster
per node
Recall


Figure 6.4: Recall diagram of Reference graph

One Cluster, Nearest neighbour big step and Gem 3D have the most recall, that is
almost one. It means they are few nodes in different clusters in the current clustering,
which are in same cluster in the reference clustering. One cluster per node has the
lowest recall.


30


0,137
0,061
0,247
0,35
1
0
Nearest
neighbour big
step
Gem 3D K-means
Simpel
K-means
Gem 3D
One cluster One cluster
per node
Quality


Figure 6.5: Quality diagram of Reference graph

One cluster algorithm has the best quality, one. Kmeans Gem 3D and Kmeans
simple are after One cluster algorithm; however, their qualities are not so big
comparison with the first one. Nearest neighbour big step and Gem 3D and One
cluster per node algorithms have the lowest quality.
6.2 Reference graph Extremity
Nodes distribution of Reference graph has different results under implementing of
different algorithms. Nearest neighbour big step and Gem 3D algorithms show
extremity totally. Kmeans Gem 3D algorithm shows extremity, however its not as
big as the two first algorithms. Nodes distribution of the graph under implementation
of Kmeans simple algorithm is very good. Nodes distribution tables are as follows:

Nearest neighbour big step

Cluster0

414
Cluster1

1
Cluster2

1
Cluster3

1
Cluster4

1
Cluster5

1
Cluster6

1

Figure 6.6: Node distribution of Nearest neighbour big step

When Nearest neighbour big step is implemented, almost all of the nodes are
grouped in the cluster 0 and other clusters contain just one node.

31



Gem 3D
Cluster0 1
Cluster1 1
Cluster2 1
Cluster3 1
Cluster4 1
Cluster5 1
Cluster6 4
Cluster7 1
Cluster8 1
Cluster9 4
Cluster10

1
Cluster11

1
Cluster12

1
Cluster13

1
Cluster14

1
Cluster15

1
Cluster16

1
Cluster17

1
Cluster18

1
Cluster19

1
Cluster20

1
Cluster21

2
Cluster22

1
Cluster23

1
Cluster24

1
Cluster25

1
Cluster26

1
Cluster27

1
Cluster28

1
Cluster29

384

Figure 6.7: Node distribution of Gem 3D

Obviously, Gem 3D clustering algorithm shows enormous extremity. A large
number of nodes are in cluster 29 but other clusters just contain few nodes.

K-means simple
Cluster0

89
Cluster1

127
Cluster2

40
Cluster3

51
Cluster4

40
Cluster5

17
Cluster6

56

Figure 6.8: Node distribution of Kmeans simple

K-means simple algorithm does not show extremity at all and nodes distribution
is completely acceptable.

32



K-means Gem 3D

Cluster0

5
Cluster1

97
Cluster2

4
Cluster3

135
Cluster4

127
Cluster5

3
Cluster6

49

Figure 6.9: Node distribution of Kmeans Gem 3D

Although extremity of K-means Gem 3D algorithm is not very big comparison
with the two first algorithms, it shows extremity.

33

7. Comparing Reference graph with artificial graphs
In this part, artificial graphs are compared with the reference graph. Obviously
artificial graphs can have different sizes when clustering algorithms implement over
them, but reference graph, which is a real graph, does not get different sizes. Its size
is fixed and it is 420, which means that the Reference graph has 420 nodes.
Therefore, in this part, Reference graph is compared with the other artificial graphs
when the size of these graphs is 256 and 512.

7.1 Time
7.1.1 Nearest neighbour

Graph Size Time
Unconnected 256 219
Reference 420 More than 5 min
Unconnected
512 7156

Graph Size Time
Directed KX 256 More than 5 min
Reference 420 More than 5 min
Directed KX 512 More than 5 min

Graph Size Time
Directed cycle KX 256 More than 5 min
Reference 420 More than 5 min
Directed cycle KX 512 More than 5 min

Graph Size Time
Directed cycle 256 274484
Reference 420 More than 5 min
Directed cycle 512 More than 5 min

Figure 7.1: Time tables of Reference & artificial graphs (NN)

When this algorithm implemented over Reference graph it takes much time. This
time is bigger than the time that Unconnected graph takes to be clustered when its
size is 256 and when it raises to 512.
Directed KX, Directed cycle KX and Directed cycle graphs take as much as
Reference graph does. As tables show, for the two sizes 256 and 512 they need
several minutes to be clustered.
7.1.2 Nearest neighbour big step

Graph Size Time
Unconnected 256 78
Reference 420
1313
Unconnected 512 3422





34


Graph Size Time
Directed KX 256 98563
Reference 420 1313
Directed KX 512 Out of memory

Graph Size Time
Directed cycle KX 256 120328
Reference 420 1313
Directed cycle KX 512 More than 5 min

Graph Size Time
Directed cycle 256 3469
Reference 420 1313
Directed cycle 512 11610

Figure 7.2: Time tables of Reference & artificial graphs (NNBS)

Implementing time for Nearest neighbour big step algorithm over Reference graph
is very little. When it is compared with artificial graphs, surprisingly its
implementation time is between implementation time of Unconnected graph when
its size is 256 and when it is 512. Directed KX, Directed cycle and Directed cycle
KX take more time than Reference graph either with small or big size.
7.1.3 Best neighbour

Graph Size Time
Unconnected 256 172
Reference 420
More than 5 min
Unconnected 512 6813

Graph Size Time
Directed KX 256
More than 5 min
Reference 420
More than 5 min
Directed KX 512
More than 5 min

Graph Size Time
Directed cycle KX 256

More than 5 min
Reference 420

More than 5 min
Directed cycle KX 512

More than 5 min

Graph Size Time
Directed cycle 256

More than 5 min
Reference 420

More than 5 min
Directed cycle 512

More than 5 min

Figure 7.3: Time tables of Reference & artificial graphs (BN)

Best neighbour algorithm takes enormously big time over Reference graph.
Unconnected graph takes quite little time, when it is smaller and even when it is
bigger than The Reference graph.
Surprisingly Directed KX graph, Directed cycle KX graph and Directed cycle
graph take time as much as the Reference graph, when their size is 256 or it is 512.

35


7.1.4 Best neighbour big step

Graph Size Time
Unconnected 256 109
Reference 420 More than 5 min
Unconnected 512 3406

Graph Size Time
Directed KX 256 More than 5 min
Reference 420 More than 5 min
Directed KX 512
More than 5 min

Graph Size Time
Directed cycle KX 256 More than 5 min
Reference 420 More than 5 min
Directed cycle KX 512 More than 5 min

Graph Size Time
Directed cycle 256 More than 5 min
Reference 420 More than 5 min
Directed cycle 512
More than 5 min

Figure 7.4: Time tables of Reference & artificial graphs (BNBS)

The Reference graph takes huge time to be clustered. Comparison with it,
Unconnected graph with the both sizes, 256 and 512 needs very little time to be
clustered. Other artificial graphs take time certainly as much as the Reference graph.























36

7.1.5 Gem 3D

Graph Size Time
Unconnected 256 2797
Reference 420 7500
Unconnected 512 107516

Graph Size Time
Directed KX 256 1026
Reference 420 7500
Directed KX 512
Out of memory

Graph Size Time
Directed cycle KX 256 104688
Reference 420 7500
Directed cycle KX 512 275562

Graph Size Time
Directed cycle 256 2485
Reference 420 7500
Directed cycle 512
1766

Figure 7.5: Time tables of Reference & artificial graphs (Gem3D)

When Gem 3D is implemented over the Reference graph, it takes an acceptable
time. Obviously this amount of time is between the time that Unconnected graph
needs to be clustered when its size is changing between 256 and 512.
Directed cycle KX graph needs much time to be clustered, Comparison with the
Reference graph, even when it is smaller than the Reference graph.
Diretcted KX with size 256 needs less time than the Reference graph.
Directed cycle graph takes less time than Reference graph with both 256 and 512
sizes.
7.1.6 K-means simple

Graph Size Time
Unconnected 256 32
Reference 420 47
Unconnected 512 390

Graph Size Time
Directed KX 256 156
Reference 420 47
Directed KX 512 Out of memory

Graph Size Time
Directed cycle KX 256 1953
Reference 420 47
Directed cycle KX 512 3391




37

Graph Size Time
Directed cycle 256 31
Reference 420 47
Directed cycle 512 47

Figure 7.6: Time tables of Reference & artificial graphs (Kmeans simple)

Implementing time of Kmeans simple over the Reference graph is completely
little. This time is exactly between the two times which Unconnected graph needs to
be clustered with two sizes, 256 and 512.
Two other artificial graphs, Directed KX and Directed cycle KX needs more