School of Mathematics and Systems Engineering

Reports from MSI Rapporter från MSI

Titel

Measurement and comparison of clustering

algorithms

Författare

Shima Javar

Dec

2007

MSI Report 07149

Växjö University ISSN 16502647

SE-351 95 VÄXJÖ ISRN VXU/MSI/DA/E/07149/SE

i

Abstract

In this project, a number of different clustering algorithms are described and their

workings explained. They are compared to each other by implementing them on

number of graphs with a known architecture.

These clustering algorithm, in the order they are implemented, are as follows:

Nearest neighbour hillclimbing, Nearest neighbour big step hillclimbing, Best

neighbour hillclimbing, Best neighbour big step hillclimbing, Gem 3D, Kmeans

simple, Kmeans Gem 3D, One cluster and One cluster per node.

The graphs are Unconnected, Directed KX, Directed Cycle KX and Directed

Cycle.

The results of these clusterings are compared with each other according to three

criteria: Time, Quality and Extremity of nodes distribution. This enables us to find

out which algorithm is most suitable for which graph. These artificial graphs are then

compared with the reference architecture graph to reach the conclusions.

Keywords: Clustering algorithm, Module dependency graph, artificial graph,

Reference graph, Cluster, Node, Edge, Implementation time, Quality, Extremity,

Precision, Recall.

ii

Table of content

ABSTRACT...................................................................................................................................................I

1. INTRODUCTION...................................................................................................................................1

2. CLUSTERING ANALYSIS....................................................................................................................2

3. CLUSTERING ALGORITHMS............................................................................................................3

3.1 HILLCLIMBING ALGORITHM...............................................................................................................3

3.1.1 Nearest neighbour.....................................................................................................................5

3.1.2 Nearest neighbour big step..........................................................................................................5

3.1.3 Best neighbour...........................................................................................................................6

3.1.4 Best neighbour big step.............................................................................................................6

3.2 GEM 3D.............................................................................................................................................6

3.3 KMEANS...........................................................................................................................................6

3.3.1 K-means simple..........................................................................................................................7

3.3.2 K-means Gem3D........................................................................................................................7

3.4 ONE CLUSTER.....................................................................................................................................8

3.5 ONE CLUSTER PER NODE....................................................................................................................8

4. ENVIRONMENT OF IMPLEMENTATION ALGORITHMS AND TESTING.............................9

4.1 WHAT IS ECLIPSE AND ECLIPSE FOUNDATION?..................................................................................9

4.2 HISTORY OF ECLIPSE..........................................................................................................................9

4.3 DOWNLOADING ECLIPSE...................................................................................................................10

4.4 PROJECT CODES..................................................................................................................................10

5. TABLES AND DIAGRAMS.................................................................................................................11

5.1 TIME TABLES AND DIAGRAMS...........................................................................................................11

5.1.1 Nearest neighbour.....................................................................................................................11

5.1.2 Nearest neighbour big step.........................................................................................................12

5.1.3 Best neighbour...........................................................................................................................12

5.1.4 Best neighbour big step..............................................................................................................13

5.1.5 Gem 3D.....................................................................................................................................14

5.1.6 K-means simple.........................................................................................................................15

5.1.7 K-means Gem3D.......................................................................................................................16

5.1.8 One cluster................................................................................................................................16

5.1.9 One cluster per node..................................................................................................................17

5.2 QUALITY TABLE AND DIAGRAMS.......................................................................................................18

5.2.1 Nearest neighbour.....................................................................................................................18

5.2.2 Nearest neighbour big step.........................................................................................................19

5.2.3 Beat neighbour..........................................................................................................................20

5.2.4 Best neighbour big step..............................................................................................................20

5.2.5 Gem3D......................................................................................................................................21

5.2.6 K-means simple.........................................................................................................................22

5.2.7 K-means Gem3D.......................................................................................................................23

5.2.8 One cluster................................................................................................................................24

5.2.9 One cluster per node..................................................................................................................24

5.3 EXTREMITY........................................................................................................................................25

5.3.1 Nearest neighbour.....................................................................................................................26

5.3.2 Nearest neighbour big step.........................................................................................................26

5.3.3 Best neighbour...........................................................................................................................26

5.3.4 Best neighbour big step..............................................................................................................26

5.3.5 Gem 3D.....................................................................................................................................27

5.3.6 K-means simple.........................................................................................................................27

5.3.7 K-means Gem 3D......................................................................................................................27

6. REFERENCE GRAPH..........................................................................................................................28

6.1 REFERENCE GRAPH TIME, PRECISION, RECALL AND QUALITY..........................................................28

6.2 REFERENCE GRAPH EXTREMITY........................................................................................................30

7. COMPARING REFERENCE GRAPH WITH ARTIFICIAL GRAPHS........................................33

iii

7.1 TIME.................................................................................................................................................33

7.1.1 Nearest neighbour.....................................................................................................................33

7.1.2 Nearest neighbour big step.........................................................................................................33

7.1.3 Best neighbour...........................................................................................................................34

7.1.4 Best neighbour big step..............................................................................................................35

7.1.5 Gem 3D.....................................................................................................................................36

7.1.6 K-means simple.........................................................................................................................36

7.1.7 K-means Gem 3D......................................................................................................................37

7.1.8 One cluster................................................................................................................................38

7.2 QUALITY...........................................................................................................................................39

7.2.1 Nearest neighbour.....................................................................................................................39

7.2.2 Nearest neighbour big step.........................................................................................................40

7.2.3 Best neighbour...........................................................................................................................40

7.2.4 Best neighbour big step..............................................................................................................41

7.2.5 Gem 3D.....................................................................................................................................41

7.2.6 K-means simple.........................................................................................................................42

7.2.7 K-means Gem 3D......................................................................................................................43

7.2.8 One cluster...............................................................................................................................43

7.2.9 One cluster per node..................................................................................................................44

7.3 EXTREMITY.......................................................................................................................................44

7.3.1 Nearest neighbour big step.........................................................................................................44

7.3.2 Gem 3D.....................................................................................................................................45

7.3.3 K-means simple.........................................................................................................................46

7.3.4 K-means Gem 3D......................................................................................................................47

8. CONCLUSION......................................................................................................................................48

9. REFERENCES......................................................................................................................................49

APPENDICES............................................................................................................................................50

APPENDIX A GAUSSIAN DISTRIBUTION...................................................................................................50

APPENDIX B EUCLIDEAN DISTANCE........................................................................................................51

APPENDIX C MANHATTAN/CITYBLOCK DISTANCE...............................................................................52

1

1. Introduction

A welldocumented architecture can improve the quality and maintainability of the

software. Many of software systems do not have their architecture documented. In

addition, the documented architectures become obsolete and outdated, and the

system’s structure declines when the changes are made to the system to obtain the

market requirements. High turnover among developers make the situation even more

complex. The other problem of large software systems is maintenance of

architectural documentation. These are the points, which have made Software

clustering as one the most important subjects in computer science area. Software

clustering is a useful tool, which can help with these problems.

It helps to understand inheritance of software system’s architectural

documentation and supports their remodularization by decomposing the software

system in to meaningful sub systems.

In recent years a large number of books, articles and researches are published

which, analyze different kinds of clustering algorithm(Everitt, B.S(1979)). They can

help that the most suitable algorithm for different problems in wide variety of

sciences to be selected. What is the definition of a ‘suitable algorithm’? It is the

algorithm which, takes little time and has a highquality result.

In the first part of this project, I compare nine clustering algorithms to each other

by implementing them over four artificial graphs and a reference graph according to

tree criteria, which are as following Implementation time, Quality and Extremity.

Depending on time and quality, I conclude which algorithm/s is the most

appropriate one for each graph.

By implementing algorithms over Directed cycle KX graph, in addition to time

and quality, Extremity of them is compared with each other too. Extremity is defined

as: Nodes distribution in a clustering.

These algorithms are presented in this thesis and are made clear by giving a

detailed description. Tables and diagrams show the time and quality and lead us to

conclusion for selecting the best algorithm.

In the second part, these algorithms are implemented over’ Reference graph’

which is not an artificial graph but is a real graph. Finally results of algorithms over

this graph are compared with the results of artificial ones.

2

2. Clustering analysis

The practice of classifying objects and organization data into sensible groupings

according to perceived similarities is one of the most fundamental modes of

understanding and learning several branches of sciences and the basis for many of

them.

Clustering algorithm is the formal study of algorithms and methods for

grouping, or classifying objects. It offers several advantages over a manual grouping

process. First, a clustering program can apply a specified objective criterion

consistently to from the groups. Second, it can form the groups in a fraction of time

required by a manual grouping, particularly if a long list of descriptors or features is

associated with each object.

An object is described either by a set of measurements or by relationship between

the object and other objects. The objective of clustering is simply to find a

convenient and valid organization of the data.

A cluster is comprised of a number of similar objects collected or grouped

together. Everitt (1974 cited by Jain and Dubes, 1988) documents some of the

following definitions of a cluster:

1. A cluster is a set of entities, which are alike, and entities from different clusters

are not alike.

2. A cluster is an aggregation of points in the test space such that the distance

between any two points in the cluster is less than the distance between any point in

the cluster and any point not in it.

3. Clusters maybe described as connected regions of a multidimensional space

containing a relatively high density of points separated from other such regions by a

region containing a relatively low density of points.

The last two definitions assume that the objects to be clustered are represented as

points in the measurement space. While it is easy to give a functional definition of a

cluster, it is difficult to give an operational definition of a cluster. It is because

objects can be grouped into clusters with different purposes in mind.

Depending on kind of problem, we can use several clustering algorithms to make

clusters. The question is ‘Which algorithm is the best one?’ and it leads to ‘How

can we valuate a clustering algorithm?’. One of the most important factors in

software science world is ‘Time’. Comparison with creating software, making it

optimized, some times takes more time and budget. Therefore, time is a factor that is

used to assess clustering algorithms. The second factor is ‘Quality’ of algorithm.

What is the definition of quality for a clustering algorithm? In computer

programming, cohesion is a measure of how well the lines of source code within a

module work together to provide a specific piece of functionality. Cohesion is an

ordinal type of measurement and is usually expressed as ‘high cohesion’ or ‘low

cohesion’ when being discussed. Modules with high cohesion are preferable because

high cohesion is associated with several desirable traits of software including

robustness, reliability, reusability and understandability whereas low cohesion is

associated with undesirable traits such as being difficult to maintain, difficult to test ,

difficult to reuse and even difficult to understand. A clustering has good quality

when the objects in each cluster have high interdependence and when independent

objects are assigned to separate clusters. Welldesigned software systems are

organized into cohesive clusters, which are highly interdependence. (Constantine

and Yourdon, 1979)

3

3. Clustering algorithms

Clustering algorithm can be hierarchical or partitional. Hierarchical algorithms

find successive clusters using previously established clusters, whereas partitional

algorithms determine all clusters at once. Jain and Dubes (1988, p 5590) conclude

that hierarchical algorithms can be agglomerative (bottomup) or divisive (top

down). Agglomerative algorithms begin with each element as a separate cluster and

merge them into successively larger clusters. Divisive algorithms begin with the

whole set and proceed to divide it into successively smaller clusters. Both clustering

strategies have their appropriate domains of applications. Anquetil and Lethbridge

(1999, p 235255) state that hierarchal techniques are popular in biological, social

and behavioural sciences because of the need to construct taxonomies. Whereas

partitional techniques are used frequently in engineering applications where single

partitions are important. In addition, Partitional algorithms would be appropriate for

the representation and compression of large databases where dendrograms of

hierarchical algorithm are impractical with more than hundred patterns.

There are several methods of clustering for each strategy. Simple link (Nearest

neighbour) and Complete link (Best neighbour) are two famous methods of

hierarchal clustering. Square-error, K-means and Genetic algorithms belong to

partitional clustering.

Clustering algorithms presented in this project are: Nearest neighbour

hillclimbing, Nearest neighbour big step hillclimbing, Best neighbour hillclimbing,

Best neighbour big step hillclimbing, Gem 3D, Kmeans simple, Kmeans Gem

3D,One cluster and One cluster per node.

In fact, in this project for applying the algorithms, fist, a number of graphs are

extracted; these graphs are:

Unconnected graph: In a directed graph G, two vertices u and v are called

unconnected if G does not contain any path from u to v.

Directed KX graph: Is a fully connected graph with x nodes; e.g., K5 is a graph

with 5nodes all connected with another.

Directed cycle KX graph: Is a cycle of y KX subgraphs; e.g. "Cycle 4 of K5" is

a graph consisting of four "K5" graphs where one and only node of each "K5" is

connected to two nodes of two other "K5" such that they form a cycle.

Directed cycle graph is a graph that consists of a single cycle. In this graph some

number of directed vertices create a closed chain.

Reference graph: This graph is a real graph. It cannot get any sizes and its size is

420.

Then the clustering algorithms are run over these extracted, different sized graphs,

of course we cannot define the size for reference graph because its size is fixed. Time

and quality of algorithms are compared with each other and with reference graph.

Extremity of artificial graphs are compared with each other too.

3.1 Hillclimbing algorithm

The hillclimbing algorithm quickly discovers a suboptimal clustering result by using

neighbouring partitions. This involves moving objects between clusters so as

improve the quality.

Hill climbing algorithm has three steps:

1. Initial the clusters.

2. Evaluate the quality of the clustering.

3. Find next neighbour with the better quality.

4

Figure 3.1: Hillclimbing algorithm

In the first step, we should extract graphs of the target system. The graph is called

MDG (Module Dependency Graph). There are several possible ways to cluster the

nodes of the dependency graph. Number of these ways grows very quickly, as the

number of nodes (modules in the system) increases: (S: Number of possible ways for

clustering, n: Number of nodes, k: Number of cluster/s)

1 if k = 1 or k = n

Sn,k = S n 1, k – 1 + kS n – 1, k otherwise

1=1 6=203 11=678570

2=2 7=877 12=4213597

3=5 8=4140 13=27644437

4=15 9=21147 14=190899322

5=52 10=115975 15= 1382958545

A 15module system (A dependency graph with 15 nodes) can be clustered in

1382958545 ways, which it is about the limit for Exhaustive Analysis. Therefore, this

clustering cannot be appropriate for big systems.

Obviously, there are several ways for a dependency graph to be clustered. To

start, for initializing, the simplest way is to assign each object to one cluster.

Considering that a good clustering is a cohesive clustering with a loose

dependency between clusters, two parameters can be defined here for measuring the

quality of the clustering, this definition is used for measuring the quality the most of

clustering algorithms:

edges: (IntraEdges) which are edges that start and end within the same cluster.

End

Yes

No

Initial clustering

Start

Measure Quality

Find neighbour of clustering

Measure Quality

Quality is

improved?

5

edges : ( Interedges) which are edges that start and end in different clusters.

According to these definitions Clustering Factor (CF) and Module Quality

(MQ) can be defined as: (4)

This definition of MQ is used for evaluating the most kind of clustering

algorithms quality in this project. It expresses clustering with the most Intra edges

and the least Inter edges has the best quality.

Neighbour partition is created by altering current partition slightly. The ways for

obtaining the next neighbour are as follows:

1. All objects in one cluster, which have connection with the objects in another

cluster, are transferred. Nearest neighbour and Best neighbour use this method.

2. Merge the clusters two by two until finding a clustering with better quality.

Nearest neighbour big step and Best neighbour big step work like that.

According to these ways, there are four kinds of hillclimbing clustering

algorithms in this project: Nearest neighbour, Nearest neighbour big step, Best

neighbour, Best neighbour big step.

3.1.1 Nearest neighbour

The Nearest neighbour is one of the hillclimbing algorithms that is similar, but

often faster, than its best neighbour counterpart is.

In nearest neighbour, our hill climbing improvement technique is based on

finding a better neighbour of the current clustering. We define nearest neighbour to

be a neighbour of partition P if and only if nearest neighbour is the same as P except

that a single element in P is in a different cluster in nearest neighbour.

A better neighbour of the current clustering is found by enumerating through the

neighbours of the clustering. The nearest neighbour algorithm stops when the fist

better neighbour, which has larger quality, is found. For Nearest neighbour the

algorithm is:

1. Generate a random initial decomposition of dependency graph.

2. Measure the quality.

3. Find neighbour of clustering.

4. Measure the quality.

5. While the quality is not improved, go to 3.

6. Return clustering.

3.1.2 Nearest neighbour big step

The Nearest neighbour big step hillclimbing algorithm is another hillclimbing

algorithm that is very similar to nearest neighbour algorithm.

The nearest neighbour big step differs from nearest neighbour algorithm in the

way it systematically finds the better neighbour.

++++++++

====

====

∑∑∑∑

≠≠≠≠

====

otherwise

CF

ij

k

ij

j

jii

i

i

i

)(2

2

00

,

1

,

εεεεεεεε

∑∑∑∑

====

====

k

i

i

CFMQ

1

6

In nearest neighbour big step algorithm, the two first clusters merge, if quality is

not improved, the twosecond clusters will merge and so on until finding a clustering,

which has a better quality.

3.1.3 Best neighbour

Best neighbour hillclimbing algorithm is based on traditional hillclimbing

techniques. The goal of this algorithm is to create a new partition from the current

partition of the dependency graph where the quality of the newer partition is larger

than the quality of the original partition. Each iteration of the algorithm attempts to

improve the quality by finding the best neighbour of the current partition.

The best neighbour is determined by examining all nearest neighbours of current

partition and selecting one that has the largest quality. The best neighbour algorithm

converges when all neighbours of current clustering are evaluated for their quality.

According to this, the algorithm is:

1. Generate a random initial decomposition of dependency graph.

2. Measure the quality.

3. Set first clustering as current clustering.

4. If there is a neighbour, measure the quality of that.

5. Return clustering with the best quality.

3.1.4 Best neighbour big step

This clustering algorithm is very similar to best neighbour hillclimbing. The

difference is the way of finding the best neighbour.

In best neighbour big step clustering algorithm, first all of the clusters merge two

by two and finally clustering with the highest quality, the result of merging the

clusters, is found.

The algorithm stops when it finds the best neighbour with the highest quality after

merging all clusters two by two.

3.2 Gem 3D

This clustering algorithm contains 3 dimensional concepts. When it gets a node,

defines a random position in 3 dimensional space for it. Therefore each node has a

certain position (x,y,x) in 3D space. Then for each two nodes, which, are connected

together with an edge, algorithm calculate distance.

For two nodes n1(x1, y1, z1) and n2(x2, y2, z2), distance is defined as:

(x1x2 )^2+ (y1y2) ^2+ (z1z2) ^2. If the distance is less than a defined value, those

two nodes are placed in the same cluster, unless the edge between them is cut and

they are placed in two different clusters.

3.3 K-means

Kmeans procedure follows a simple and easy way to classify a given data set. The

Kmeans algorithm moves the objects to different groups until a stable situation is

reached and no object moves to any other group. (Shokoufandeh et al, 2002)

It is a variant of the expectationmaximization algorithm in which the goal is to

determine the k means of data generated from gaussian distributions (See appendix A

for more information). Kmeans algorithm is mentioned below:

1. Randomly generate k clusters and determine the cluster centres, or directly

generate k seed points as cluster centres.

2. Assign each point to the nearest cluster centre.

3. Compute the new cluster centres.

7

4. Repeat until some convergence criterion is met (usually that the assignment has

not changed).

Kmeans clustering accepts the number of clusters to group data into (K), and

Module Dependency Graph to be clustered as input values. It then creates the first K

initial clusters from the Module Dependency Graph by assigning nodes of the

dependency graph to K clusters randomly. (Bradly and Fayyad, 1998)

The KMeans algorithm assigns the centroid of each cluster. The centroid of

a cluster is selected randomly among the nodes of each cluster.

KMeans simple clustering assigns each node in the clusters to the nearest cluster

based on the distance to centroids using a measure of distance or similarity like the

Euclidean Distance Measure or Manhattan/CityBlock Distance Measure(See

appendix B and C for more information). The preceding steps are repeated until

stable clusters are formed and the KMeans clustering procedure is completed. Stable

clusters are formed when new iterations or repetitions of the KMeans clustering

algorithm do not create new clusters, as the cluster centre of each cluster formed is

the same as the old cluster centre. (Likas et al, 2002)

Figure 3.2: Kmeans algorithm

3.3.1 K-means simple

Structure of Kmeans simple algorithm is exactly same as the algorithm which is

defined and clarified in previous section.

3.3.2 K-means Gem3D

Kmeans Gem 3D’s step is just like the Kmeans simple algorithm. Different

between these algorithms is how distance between two nodes is calculated.

End

Yes

No

Number of Clusters:K

Start

Centroids

Distance object to centroids

Grouping based on minimum distance

No object

move group?

8

In Kmeans Gem 3D instead of Euclidean Distance Measure or Manhattan/City

Block Distance Measure, Gem 3D algorithm’s method for calculating the distance is

used. Structure of algorithm is the same as Kmeans simple algorithm.

3.4 One cluster

One cluster algorithm is a hierarchical clustering which, builds (agglomerative) a

hierarchy of clusters. It is a sequences of partitions which, each partition is nested

into the next partition in the sequence. When the algorithm is implemented over a

graph, it places each node into a cluster until all nodes are in one cluster. In the end a

single cluster containing all the nodes of graph remains.

3.5 One cluster per node

This algorithm (Divisive) gets nodes of a graph one by one and adds each node to a

new cluster. Therefore in each sequence of this clustering one cluster is created

which contains just one node and in the end of clustering n (the number of nodes of

the graph) clusters exists.

9

4. Environment of implementation algorithms and testing

In this chapter there is some explanation about the platform and environment in

which, codes are written.

The environment of this project is Eclipse.

4.1 What is Eclipse and Eclipse Foundation?

Eclipse in an open source community, whose projects are focused on building an

open development platform comprised of extensible frameworks, tools and runtimes

for building, developing and managing software across the life style. The Eclipse

Foundation is a notforprofit, member supported corporation that helps cultivate

both an open source community and an ecosystem of complementary products and

services.

The Eclipse Project was originally created by IBM in November 2001 and

supported by a consortium of software vendors. The Eclipse Foundation was created

in January 2004 as an independent notforprofit corporation to act as the steward of

the Eclipse community. The independent notforprofit corporation was created to

allow a vendor neutral and open, transparent community to be established around

Eclipse. Today, the Eclipse community consists of individuals and organizations

from a cross section of the software industry.

The Eclipse Foundation has been established to serve the Eclipse open source

projects and the Eclipse community. As an independent notforprofit corporation,

the Foundation and the Eclipse governance model ensures no single entity is able to

control the strategy, policies or operations of the Eclipse community.

The Foundation is focused on creating an environment for successful open source

projects and to promote the adoption of Eclipse technology in commercial and open

source solutions. Through services like IP Due Diligence, annual release trains,

development community support and ecosystem development, the Eclipse model of

open source development is a unique and proven model for open source

development.

4.2 History of Eclipse

Industry leaders Borland, IBM, MERANT, QNX Software Systems, Rational

Software, Red Hat, SuSE, TogetherSoft and Webgain formed the initial eclipse.org

Board of Stewards in November 2001. By the end of 2003, this initial consortium

had grown to over 80 members.

On Feb 2, 2004 the Eclipse Board of Stewards announced Eclipse’s

reorganization into a notforprofit corporation. Originally a consortium that formed

when IBM released the Eclipse Platform into Open Source, Eclipse became an

independent body that will drive the platform’s evolution to benefit the providers of

software development offerings and endusers. All technology and source code

provided to and developed by this fastgrowing community is made available

royaltyfree via the Eclipse Public License.

The founding Strategic Developers and Strategic Consumers were Ericsson, HP,

IBM, Intel, MontaVista Software, QNX, SAP and Serena Software.

More information about the structure and mission of the Eclipse Foundation is

available in formal documentations that establish how the foundation operates, and

press release announcing the creation of the independent organization.

10

4.3 Downloading Eclipse

To download Eclipse open the Eclipse home page and select the proper version.(It is

available via References, Web links[4]). For using Eclipse Java runtime

environment(JRE) is needed( Java 5 JRE recommended).

4.4 Project codes

All codes are programmed in Eclipse. Eclipse has been chosen for programming

because these codes are integrated with Vizz analyzer project (developed in the MSI

department of Växjö university) and uses some classes and methods of the project

which is written in Eclipse.

This project is comprised of a number of classes which implement different

clustering algorithms and several classes for testing the clustering algorithms over

artificial graphs and Reference graph. These implementing and testing classes are

written by me.

In addition this project, Grail helped me to create and test clustering algorithms.

Grail is an open source library which has been developed in the MSI department of

Växjö university.

It has several classes which contain basic definition of different clustering, I have

used them while I wanted to create clustering classes.

In addition for testing the different clustering I have used a .gml file from Grail

library which is an input for testing the clustering.

11

5. Tables and diagrams

In this section, tables and diagrams show time and quality of each algorithm, which are,

implemented over four graphs.

Algorithms are: Nearest neighbour, Nearest neighbour big step, Best neighbour, Best

neighbour big step, Gem3D, Kmeans simple, Kmeans Gem3D, One cluster and One

cluster per node. These algorithms are implemented over Unconnected graph, Directed

KX graph, Directed cycle KX graph and Directed cycle graph with different sizes :2, 4,

8,…1024.

While size of some graphs rises up, clustering of them takes more than 300 seconds

(5 minutes), or it ceases because lack of memory. These cases are stated in the tables

and are not drawn in the diagrams.

5.1 Time tables and diagrams

Time factor is changing when a graph has different sizes and when different

algorithms are implemented. Here, there are timetables and diagrams, which state

how time is changing during these changes.

5.1.1 Nearest neighbour

Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

0 0 - 0

4

0 15 0 0

8

16 47 31 16

16

0 234 172 46

32

32 3250 9282 110

64

47 187781 170532 656

128

94 More than 5 min More than 5 min 6266

256

219 More than 5 min More than 5 min 274484

512

7156 More than 5 min More than 5 min More than 5 mi n

1024

27719 More than 5 min More than 5 min More than 5 m in

Figure 5.1: Timetable of Nearest neighbour

0

0

16

46

110

656

6266

274484

0

0

0

31

172

9282

170532

0

0

0

0

0

15

47

234

3250

187781

0

0

0

0

0

0

16

0

32

47

94

219

7156

27719

2 4 8 16 32 64 128 256 512 1024

Unconnected graph

Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

Figure 5.2: Time diagram of Nearest neighbour

While the size of graphs is rising up, time of clustering increases. Clustering is very

time consuming when size is rising up to big numbers as 512 and1024. However, it is

obvious that Unconnected graph clustering takes the less time and Directed cycle KX

and Directed KX graphs take the most.

12

5.1.2 Nearest neighbour big step

Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

0 0 - 0

4

0 0 0 0

8

0 0 0 0

16

0 16 0 16

32

0 47 406 0

64

0 1125 2578 31

128

31 18578 12031 110

256

78 98563 120328 3469

512

3422 Out of memory More than 5 min 11610

1024

13610 Out of memory More than 5 min 85641

Figure 5.3: Timetable of Nearest neighbour big step

0

0

0

16

0

31

110

3469

11610

85641

0

0

0

406

2578

12031

120328

0

0

0

0

0

16

47

1125

18578

98563

0

0

0

0

0

0

0

0

31

78

3422

13610

2 4 8 16 32 64 128 256 512 1024

Unconnected graph

Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

Figure 5.4: Time diagram of Nearest neighbour big step

Nearest neighbour big step average time is less than Nearest neighbour, It takes more

time when size of graphs increases but not as much as the first one. Unconnected graph,

Directed cycle graph take less time comparison with Directed KX graph and Directed

cycle KX graphs.

5.1.3 Best neighbour

Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

0 0 - 0

4

0 16 0 0

8

0 31 15 31

16

16 187 94 31

32

0 3625 8188 109

64

31 224750 131046 734

128

62 More than 5 min More than 5 min 10641

256

172 More than 5 min More than 5 min More than 5 min

512

6813 More than 5 min More than 5 min More than 5 mi n

1024

27469 More than 5 min More than 5 min More than 5 m in

Figure 5.5: Timetable of Best neighbour

13

0

0

31

31

109

734

10641

0

0

0

0

15

94

8188

131046

0

0

0

0

0

16

31

187

3625

224750

0

0

0

0

0

0

0

16

0

31

62

172

6813

27469

2 4 8 16 32 64 128 256 512 1024

Unconnected graph

Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

Figure 5.6: Time diagram of Best neighbour

In this algorithm, time increases very fast while size is rising up. Except

Unconnected graph, clustering over other graphs takes more than 5 minutes in

average sizes. Unconnected graph and Direct cycle graph take less time than

Directed KX and Directed cycle KX graphs.

5.1.4 Best neighbour big step

Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

0 0 - 0

4

0 0 0 16

8

0 15 0 0

16

0 47 47 16

32

0 1250 3031 78

64

0 68171 49985 641

128

31 More than 5 min More than 5 min 32578

256

109 More than 5 min More than 5 min More than 5 min

512

3406 More than 5 min More than 5 min More than 5 mi n

1024

13515 More than 5 min More than 5 min More than 5 m in

Figure 5.7: Timetable of Best neighbour big step

0

16

0

16

78

641

32578

0

0

0

0

0

47

3031

49985

0

0

0

0

0

0

15

47

1250

68171

0

0

0

0

0

0

0

0

0

0

31

109

3406

13515

2 4 8 16 32 64 128 256 512 1024

Unconnected graph

Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

Figure 5.8: Time diagram of Best neighbour big step

This algorithm averagely takes less time than Best neighbour does, however it is one

of the most time consuming algorithms. It takes the least time over Unconnected graph,

14

after that Directed cycle graph comes and the most time consuming graphs are Directed

KX and Directed cycle KX graphs.

5.1.5 Gem 3D

Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

0 15 - 16

4

16 15 15 0

8

47 47 31 31

16

62 79 78 47

32

109 188 1125 125

64

219 594 3047 218

128

813 2344 22438 688

256

2797 1026 104688 2485

512

107516 out of memory 275562 1766

1024

More than 5 min out of memory out of memory 175328

Figure 5.9: Timetable of Gem3D

16

0

31

47

125

218

688

2485

1766

175328

15

31

78

1125

3047

22438

104688

275562

0

15

15

47

79

188

594

2344

1026

0

0

0

16

47

62

109

219

813

2797

107516

0

2 4 8 16 32 64 128 256 512 1024

Unconnected graph

Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

Figure 5.10: Time diagram of Gem3D

Algorithm takes quite good time comparison with the last ones. Its performance over

Directed cycle graph is considerable, time for clustering, before size of the Directed

cycle graph rises up to 1024 is almost 2 seconds.

The results, ranked in descending order are Directed cycle KX graph, Directed

KX graph, Unconnected graph and Directed cycle graph.

15

5.1.6 K-means simple

Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

0 0 - 0

4

0 0 0 0

8

0 0 0 0

16

0 0 16 0

32

0 0 78 0

64

0 16 94 0

128

31 47 484 16

256

32 156 1953 31

512

390 Out of memory 3391 47

1024

1281 Out of memory Out of memory 109

Figure 5.11: Timetable of Kmeans simple

0

0

0

0

0

0

16

31

47

109

0

0

16

78

94

484

1953

3391

0

0

0

0

0

0

16

47

156

0

0

0

0

0

0

0

0

31

32

390

1281

2 4 8 16 32 64 128 256 512 1024

Unconnected graph

Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

Figure 5.12: Time diagram of Kmeans simple

Timetable and diagram state that this algorithm decreases the time of clustering

undoubtedly. Directed cycle graph is the least time consuming graph and for size

1024 it takes just 0.1 second to be clustered. Unconnected graph and Directed KX

graph have the second place and the last graph is Directed cycle KX. The last one

takes the most time to be clustered, however in comparison with other algorithms it

has decreased considerably.

16

5.1.7 K-means Gem3D

Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

16 31 - 16

4

16 31 16 15

8

0 63 16 47

16

31 94 31 63

32

63 187 672 125

64

172 609 2578 234

128

734 2359 20344 735

256

3046 11016 122328 2578

512

117360 out of memory 282140 12016

1024

More than 5 min out of memory out of memory 198016

Figure 5.103: Timetable of Kmeans Gem3D

16

15

47

63

125

234

735

2578

12016

198016

16

16

31

672

2578

20344

122328

282140

0

31

31

63

94

187

609

2359

11016

0

0

16

16

0

31

63

172

734

3046

117360

0

2 4 8 16 32 64 128 256 512 1024

Unconnected graph

Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

Figure 5.14: Time diagram of Kmeans Gem3D

Taking time has clearly increased comparison with the last algorithm, but it is still

one of the less time consuming algorithms. The results, ranked in descending order is

Directed cycle KX, Directed KX, Unconnected and directed cycle graphs, the same

as the Kmeans simple algorithm.

5.1.8 One cluster

Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

0 0 - 0

4

0 0 0 0

8

0 0 0 0

16

0 0 0 0

32

0 16 0 0

64

0 0 0 0

128

0 16 62 0

256

0 62 547 16

512

16 out of memory 734 16

1024

16 out of memory out of memory 15

Figure 5.15: Timetable of One cluster

17

0

0

0

0

0

0

0

16

16

15

0

0

0

0

0

62

547

734

0

0

0

0

0

16

0

16

62

0

0

0

0

0

0

0

0

0

0

16

16

2 4 8 16 32 64 128 256 512 1024

Unconnected graph

Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

Figure 5.16: Time diagram of One cluster

This algorithm, which put all the nodes in one cluster, is the less time consuming

algorithm same as algorithm ‘One cluster per node’. There is no big difference

between small and large sizes of the graphs and all of them take almost the same

time. Among these four graphs, Unconnected and Directed cycle graph take less time

than Directed KX graph and Directed cycle of KX.

5.1.9 One cluster per node

Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

0 0 - 0

4

0 0 0 0

8

0 0 0 0

16

0 0 0 0

32

0 0 16 0

64

0 0 16 0

128

0 15 203 0

256

0 62 484 0

512

0 out of memory 625 16

1024

0 out of memory out of memory 16

Figure 5.17: Timetable of One cluster per node

0

0

0

0

0

0

0

0

16

16

0

0

0

16

16

203

484

625

0

0

0

0

0

0

0

15

62

0

0

0

0

0

0

0

0

0

0

0

0

2 4 8 16 32 64 128 256 512 1024

Unconnected graph

Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

Figure5.18: Time diagram of One cluster per node

18

One cluster per node that puts each node in one cluster takes quite little time, for

different graphs with different sizes, the same as ‘One cluster’ algorithm.

5.2 Quality table and diagrams

Size increasing affects quality of performance. It can be raised up or go down. It

depends on type of the graph, in some cases there is no considerable quality changing

between different sizes of a graph. Here, there are tables and diagram that state size

effects on quality.

5.2.1 Nearest neighbour

Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

0 0,666 - 1

4

0 0.800 0,428 0,333

8

0 0,888 0,499 0,333

16

0 0,941 0,727 0,333

32

0 0,969 0,864 0,355

64

0 0,984 0,934 0,355

128

0 More than 5 mins More than 5 mins 0,344

256

0 More than 5 mins More than 5 mins 0,335

512

0 More than 5 mins More than 5 mins More than 5 min s

1024

0 More than 5 mins More than 5 mins More than 5 min s

Figure 5.19: Quality table of Nearest neighbour

1

0,333

0,333

0,333

0,355

0,355

0,344

0,335

0

0

0,428

0,499

0,727

0,864

0,934

0

0

0

0

0,666

0

0,888

0,941

0,969

0,984

0,992

0

0

0

0

0

0

0

0

0

0

0

0

0

2 4 8 16 32 64 128 256 512 1024

Unconnected graph

Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

Figure 5.20: Quality diagram of Nearest neighbour

Implementing the algorithm over Unconnected graph does not have any changes;

it is zero for all sizes of the graph. By implementing the algorithm over Directed KX

graph, quality increases when size becomes bigger. Algorithm has the highest quality

when it is implemented over this graph; it is almost one. Quality of Directed cycle

KX graph increases when size of the graph goes up; it is near one as well.

Nevertheless, it is opposite for Directed cycle graph. Quality of the graph with size 2

is the highest comparison with other sizes; it is one. By increasing the size, quality

goes down; however, it is not a big difference between qualities of the graph with

variety of sizes.

19

5.2.2 Nearest neighbour big step

Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

0 0,666 - 1

4

0 0,8 0,309 1

8

0 0,888 0,295 1

16

0 0,941 0,329 1

32

0 0,969 0,391 0,999

64

0 0,984 0,355 0,999

128

0 0,992 0,243 0,999

256

0 0,996 0,405 0,999

512

0 Out of memory More than 5 min 1

1024

0 Out of memory More than 5 min 0,999

Figure 5.21: Quality table of Nearest neighbour big step

1

1

1

1

0,999

0,999

0,999

0,999

1

0,999

0,309

0,295

0,329

0,391

0,355

0,243

0,405

0

0

0,666

0,8

0,888

0,941

0,969

0,984

0,992

0,996

0

0

0

0

0

0

0

0

0

0

0

0

2 4 8 16 32 64 128 256 512 1024

Unconnected graph

Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

Figure 5.22: Quality diagram of Nearest neighbour big step

This algorithm under different sizes of graphs behaves differently; However,

Unconnected graph quality is zero for all sizes. Although changes are not so big,

quality increases when size becomes bigger in Directed KX graph and Directed cycle

KX graph. Nearest neighbour big step has the best quality by implementing over

Directed cycle graph. Quality changes are not considerable. It is almost one for all

sizes of this graph.

20

5.2.3 Beat neighbour

Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

0 0,666 - 1

4

0 0,8 0,428 0,333

8

0 0,888 0,499 0,333

16

0 0,941 0,727 0,333

32

0 0,969 0,864 0,355

64

0 0,984 0,934 0,333

128

0 More than 5 min More than 5 min 0,344

256

0 More than 5 min More than 5 min More than 5 min

512

0 More than 5 min More than 5 min More than 5 min

1024

0 More than 5 min More than 5 min More than 5 min

Figure 5.23: Quality table of Best neighbour

1

0,333

0,333

0,333

0,355

0,333

0,344

0

0

0

0,428

0,499

0,727

0,864

0,934

0

0

0

0

0,666

0,8

0,888

0,941

0,969

0,984

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2 4 8 16 32 64 128 256 512 1024

Unconnected graph

Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

Figure 5.24: Quality diagram of Best neighbour

This algorithm behaves just like the Nearest neighbour algorithm, how ever

quality of some graphs is a little bit less than the first one. Unconnected graph quality

is zero for all sizes. Quality of two Directed KX graph and Directed cycle KX graph

increases by raising up the size and Directed cycle graph quality goes down when

size becomes bigger, however it is one when size of the graph is two.

5.2.4 Best neighbour big step

Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

0 0,666 - 1

4

0 0,8 0,309 1

8

0 0,888 0,499 1

16

0 0,941 0,727 1

32

0 0,969 0,864 0,999

64

0 0,984 0,934 0,999

128

0 More than 5 min More than 5 min 0,999

256

0 More than 5 min More than 5 min More than 5 min

512

0 More than 5 min More than 5 min More than 5 min

1024

0 More than 5 min More than 5 min More than 5 min

Figure 5.25: Quality table of Best neighbour big step

21

1

1

1

1

0,999

0,999

0,999

0

0

0

0,309

0,499

0,727

0,864

0,934

0

0

0

0

0,666

0,8

0,888

0,941

0,969

0,984

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2 4 8 16 32 64 128 256 512 1024

Unconnected graph

Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

Figure 5.26: Quality diagram of Best neighbour big step

Best neighbour big step has the same changes as the last algorithm. Just Directed

cycle graph’s quality improves and it is almost one for all sizes of graph.

5.2.5 Gem3D

Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

0 0,25 - 1

4

0 0,312 0,428 0,25

8

0 0,888 0,343 0,444

16

0 0,941 0,551 0,292

32

0 0,969 0,864 0,438

64

0 0,984 0,934 0,27

128

0 0,992 0,968 0,447

256

0 0,495 0,984 0,404

512

0 out of memory 0,496 0,402

1024

More than 5 min out of memory out of memory 0,406

Figure 5.27: Quality table of Gem3D

1

0,25

0,444

0,292

0,438

0,27

0,447

0,404

0,402

0,406

0,428

0,343

0,551

0,864

0,934

0,968

0,984

0,496

0

0,25

0,312

0,888

0,941

0,969

0,984

0,992

0,495

0

0

0

0

0

0

0

0

0

0

0

0

2 4 8 16 32 64 128 256 512 1024

Unconnected graph

Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

Figure 5.28: Quality diagram of Gem3D

22

Unconnected graph quality is zero just as it was in other algorithms. By increasing

the size of Directed KX graph up to 128, quality increases but when size of this

graph becomes bigger quality of algorithm decreases. It does not have a good

performance over graph with big sizes. Directed cycle KX graph has the highest

quality and its quality is raising up to one when its size becomes bigger, but as it is

obvious for size 512, the quality descends completely. Directed cycle graph quality

is one when its size is two, but quality becomes less when the size of graph grows up.

5.2.6 K-means simple

Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

0 0,25 - 0

4

0 0,125 0,25 0

8

0 0,062 0,308 0

16

0 0,031 0,497 0

32

0 0,015 0,864 0

64

0 0,007 0,557 0

128

0 0,003 0,968 0

256

0 0,001 0,984 0

512

0 Out of memory 0,992 0

1024

0 Out of memory Out of memory 0

Figure 5.29: Quality table of Kmeans simple

0

0

0

0

0

0

0

0

0

0

0,25

0,308

0,497

0,864

0,557

0,968

0,984

0,992

0

0,25

0,125

0,062

0,031

0,015

0,007

0,003

0,001

0

0

0

0

0

0

0

0

0

0

0

0

2 4 8 16 32 64 128 256 512 1024

Unconnected graph

Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

Figure 5.30: Quality diagram of Kmeans simple

Quality descends obviously under implementing of Kmean simple algorithm over

Unconnected and Direct cycle graphs. Quality of them is zero for all sizes; Directed

KX graph quality goes down by rising up the size. It is almost zero for big sizes.

Directed cycle KX graph has the best quality. Its quality becomes more when its size

grows up and it is almost one when its size is 512.

23

5.2.7 K-means Gem3D

Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

0 0,25 - 0

4

0 0,125 0,25 0

8

0 0,625 0,5 0

16

0 0,031 0,727 0

32

0 0,015 0,864 0

64

0 0,007 0,934 0

128

0 0,003 0,967 0

256

0 0,001 0,669 0

512

0 out of memory 0,2 0

1024

More than 5 min out of memory out of memory 0

Figure 5.31: Quality table of Kmeans Gem3D

0

0

0

0

0

0

0

0

0

0

0,25

0,5

0,727

0,864

0,934

0,967

0,669

0,2

0

0,25

0,125

0,625

0,031

0,015

0,007

0,003

0,001

0

0

0

0

0

0

0

0

0

0

0

0

2 4 8 16 32 64 128 256 512 1024

Unconnected graph

Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

Figure 5.32: Quality diagram of Kmeans Gem3D

Unconnected graph and Directed cycle graph have the same quality as they had

under Kmeans simple implementation, zero; but quality of Directed KX graph and

Directed cycle KX graph changes when size becomes bigger.

For Directed KX graph, quality raises when size of graph increases up to eight;

then quality descends almost to zero by raising the size. Directed cycle KX graph

quality increases up to one until size raises to 128; then it goes down.

24

5.2.8 One cluster

Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

0 0,666 - 1

4

0 0,8 0,666 1

8

0 0,888 0,714 1

16

0 0,941 0,809 1

32

0 0,969 0,89 1

64

0 0,984 0,941 1

128

0 0,992 0,969 1

256

0 0,996 0,984 1

512

0 out of memory 0,992 1

1024

0 out of memory out of memory 1

Figure 5.33: Quality table of One cluster

1

1

1

1

1

1

1

1

1

1

0,666

0,714

0,809

0,89

0,941

0,969

0,984

0,992

0

0,666

0,8

0,888

0,941

0,969

0,984

0,992

0,996

0

0

0

0

0

0

0

0

0

0

0

0

2 4 8 16 32 64 128 256 512 1024

Unconnected graph

Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

Figure 5.34: Quality diagram of One cluster

This algorithm, which puts all the nodes in one cluster, has the least quality when

it is implemented over Unconnected graph and the best quality when it is

implemented over Directed cycle graph.

The quality rises almost to one when the size increases for Direct KX graph and

Directed cycle KX graph.

5.2.9 One cluster per node

Unconnected graph Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

2

0 0,25 - 0

4

0 0,125 0,25 0

8

0 0,625 0,208 0

16

0 0,312 0,118 0

32

0 0,015 0,061 0

64

0 0,007 0,031 0

128

0 0,003 0,015 0

256

0 0,001 0,007 0

512

0 out of memory 0,003 0

1024

0 out of memory out of memory 0

Figure 5.35: Quality table of One cluster per node

25

0

0

0

0

0

0

0

0

0

0

0,25

0,208

0,118

0,061

0,031

0,015

0,007

0,003

0

0,25

0,125

0,625

0,312

0,015

0,007

0,003

0,001

0

0

0

0

0

0

0

0

0

0

0

0

2 4 8 16 32 64 128 256 512 1024

Unconnected graph

Directed KX graph

DirectedCycle KX graph

DirectedCycle graph

Figure 5.36: Quality diagram of One cluster per node

This clustering algorithm has the worst quality. Its quality is zero for two

Unconnected graph and Directed cycle graph for all sizes. Quality of Directed KX

graph and Directed cycle KX graph decreases almost to zero when size of graphs

raises.

5.3 Extremity

As it is mentioned in the introduction section, extremity of algorithms is calculated

by implementing them over Directed cycle of KX graph. The nodes in the clusters

of a clustering should not show extremity because extremity influences quality

factor. Clustering would have a very low quality because of extremity.

In particular, a clustering algorithm should avoid the following two situations:

a) The majority of nodes are grouped in to one or few clusters.

One cluster algorithm, partitions nodes just like it. It puts all of the nodes in one

cluster. According to extremity this clustering is not acceptable, however the time of

clustering is less than other algorithms.

b) The majority of clusters are singletons. (Contain just one node)

One cluster per node algorithm does clustering like it. In the end of clustering the

number of clusters is equal with the number of nodes. It puts each node in one

cluster.

In this part nodes distribution of the algorithms are exhibited for the different

sizes of Directed cycle KX graph. The number of clusters is defined as four.

Therefore, there are four clusters in each clustering.

Algorithms are implemented over different sizes of Directed cycle KX graph. The

least size of the graph is four and it grows up to512.

Implementing time for bigger sizes of some graphs is more than five minutes. In

the following tables, these sizes are ignored.

26

5.3.1 Nearest neighbour

4 8 16 32 64

Cluster1 2 2 4 8 16

Cluster2 2 2 4 8 16

Cluster3 0 2 4 8 16

Cluster4 0 2 4 8 16

Figure 5.37: Nodes distribution for Nearest neighbour

Nodes distribution is quite good for this clustering algorithm. Number of nodes in

each cluster is equal with the other ones for all of the sizes of the graph; just when

size of the graph is four, there are two clusters, which do not contain any nodes.

5.3.2 Nearest neighbour big step

4 8 16 32 64 128 256

Cluster1 2 4 9 11 23 125 215

Cluster2 1 2 5 9 39 1 39

Cluster3 1 1 1 11 1 1 1

Cluster4 0 1 1 1 1 1 1

Figure 5.38: Nodes distribution for Nearest neighbour big step

This algorithm shows extremity when it is implemented over different sizes of the

graph. As it can be seen in the table, the majority of nodes are grouped in one or two

clusters and the other clusters just contain few nodes.

5.3.3 Best neighbour

4 8 16 32 64

Cluster1 2 2 4 8 16

Cluster2 2 2 4 8 16

Cluster3 0 2 4 8 16

Cluster4 0 2 4 8 16

Figure 5.39: Nodes distribution for Best neighbour

Best neighbour algorithm show extremity just when it is implemented over the graph

with size four. There are two clusters, which contain no node for this size of the

graph. For other sizes, all clusters contain the same number of nodes.

5.3.4 Best neighbour big step

4 8 16 32 64

Cluster1 2 2 4 8 16

Cluster2 1 2 4 8 16

Cluster3 1 2 4 8 16

Cluster4 0 2 4 8 16

Figure 5.40: Nodes distribution for Best neighbour big step

27

Best neighbour algorithm does not show extremity. The number of nodes is almost the

same for all clusters for the different sizes of graphs.

5.3.5 Gem 3D

4 8 16 32 64 128 256 512

Cluster1 2 1 4 8 16 64 64 511

Cluster2 2 2 4 8 16 32 64 1

Cluster3 0 4 3 8 16 32 64 0

Cluster4 0 1 5 8 16 0 64 0

Figure 5.41: Nodes distribution for Gem 3D

This algorithm does not show extremity and nodes distribution is quite good for all

sizes of graph but size 512. For this size of the graph, extremity is obvious. Almost

all of the nodes are grouped in one cluster and other clusters have one or no node.

5.3.6 K-means simple

4 8 16 32 64 128 256 512

Cluster1 1 2 4 8 20 32 64 128

Cluster2 1 3 2 8 20 32 64 128

Cluster3 1 2 8 8 20 64 64 256

Cluster4 1 1 2 8 4 0 64 0

Figure 5.42: Nodes distribution for Kmeans simple

Nodes distribution for Kmeans simple is quite good and it shows no extremity. Just for

tow sizes 128 and 512 of the graph, there is one cluster, which contains no node.

5.3.7 K-means Gem 3D

4 8 16 32 64 128 256 512

Cluster1 1 2 4 8 16 32 19 423

Cluster2 1 2 4 8 16 32 45 31

Cluster3 1 2 4 8 16 32 64 33

Cluster4 1 2 4 8 16 32 128 25

Figure 5.43: Nodes distribution for Kmeans Gem 3D

Kmeans Gem 3D shows slightly extremity. Up to size 256, nodes distribution is

very good. Nevertheless, for size 512 the majority of nodes are grouped in one

cluster and the other clusters contain few nodes comparison with the first one.

28

6. Reference graph

Reference graph is a real graph with a fixed size. For this graph in addition to ‘Time’

and ‘Quality’ we can calculate ‘Precision’ and ‘Recall’.

What are ‘Precision’ and ‘Recall’? For clarifying these phrases first, we should

define some basis meanings, if there are a reference clustering and a current

clustering, then:

‘True positivs’: Pairs of nodes in the same cluster in both clusterings. (tp)

‘False positivs’: Pairs of nodes in the same cluster in the current clustering, but in

different clusters in reference clustering. (fp)

‘False negativs’: Pairs of nodes in the different clusters in the current clustering,

but in the same cluster in reference clustering. (fn)

Precision= (tp/ (tp+fp)), Recall= (tp/ (tp+fn)) therefore how much Precision and

recall are big, reference clustering and current clustering are much same as each

other and vice versa.

6.1 Reference graph Time, Precision, Recall and Quality

Here is the clustering algorithms table, showing time and quality, precision and recall

when they are implemented over reference graph.

Time Precision Recall Quality

Nearest neighbour More than 5 min

More than 5 min More than 5 min More than 5 min

Nearest neighbour big step 1313 0,199 0,936 0,137

Best neighbour More than 5 min

More than 5 min More than 5 min More than 5 min

Best neighbour big step More than 5 min

More than 5 min More than 5 min More than 5 min

Gem 3D 7500 0,199 0,926 0,061

K-means Simpel 47 0,335 0,309 0,247

K-means Gem 3D 8609 0,326 0,433 0,35

One cluster 16 0,195 1 1

One cluster per node

15 1 0,115 0

Figure 6.1: Time, Precision, Recall and Quality table for Reference graph

Size of reference graph is 420.We cannot check the time and quality for different

sizes of reference graph definitely.

1313

7500

47

8609

16

15

Nearest

neighbour big

step

Gem 3D K-means

Simpel

K-means Gem

3D

One cluster One cluster

per node

Time

Figure 6.2: Time diagram of Reference graph

Executing time of Nearest neighbour, Best neighbour and Best neighbour big step

algorithms is extremely large. After these three algorithms Kmeans Gem 3D and

29

Gem 3D take more time than other ones. Nearest neighbour big step algorithm needs

one second to be performed that is obviously little time. Kmeans simple, One cluster

and One cluster per node take the least time.

0,199

0,199

0,335

0,326

0,195

1

Nearest

neighbour big

step

Gem 3D K-means

Simpel

K-means

Gem 3D

One cluster One cluster

per node

Precision

Figure 6.3: Precision diagram of Reference graph

One cluster per node algorithm has the most precision. It expresses, there are not

any two nodes in the same cluster in current clustering which are in different clusters

in reference clustering, in other words if two nodes are in a cluster in the reference

clustering, they will be in same cluster in the current clustering. Gem 3D, Nearest

neighbour big step and One cluster algorithms have the least precision. It means

there are many nodes in a cluster in the current clustering, which are not in same

cluster in the reference clustering.

0,936

0,926

0,309

0,433

1

0,115

Nearest

neighbour big

step

Gem 3D K-means

Simpel

K-means

Gem 3D

One cluster One cluster

per node

Recall

Figure 6.4: Recall diagram of Reference graph

One Cluster, Nearest neighbour big step and Gem 3D have the most recall, that is

almost one. It means they are few nodes in different clusters in the current clustering,

which are in same cluster in the reference clustering. One cluster per node has the

lowest recall.

30

0,137

0,061

0,247

0,35

1

0

Nearest

neighbour big

step

Gem 3D K-means

Simpel

K-means

Gem 3D

One cluster One cluster

per node

Quality

Figure 6.5: Quality diagram of Reference graph

One cluster algorithm has the best quality, one. Kmeans Gem 3D and Kmeans

simple are after One cluster algorithm; however, their qualities are not so big

comparison with the first one. Nearest neighbour big step and Gem 3D and One

cluster per node algorithms have the lowest quality.

6.2 Reference graph Extremity

Nodes distribution of Reference graph has different results under implementing of

different algorithms. Nearest neighbour big step and Gem 3D algorithms show

extremity totally. Kmeans Gem 3D algorithm shows extremity, however its not as

big as the two first algorithms. Nodes distribution of the graph under implementation

of Kmeans simple algorithm is very good. Nodes distribution tables are as follows:

Nearest neighbour big step

Cluster0

414

Cluster1

1

Cluster2

1

Cluster3

1

Cluster4

1

Cluster5

1

Cluster6

1

Figure 6.6: Node distribution of Nearest neighbour big step

When Nearest neighbour big step is implemented, almost all of the nodes are

grouped in the cluster 0 and other clusters contain just one node.

31

Gem 3D

Cluster0 1

Cluster1 1

Cluster2 1

Cluster3 1

Cluster4 1

Cluster5 1

Cluster6 4

Cluster7 1

Cluster8 1

Cluster9 4

Cluster10

1

Cluster11

1

Cluster12

1

Cluster13

1

Cluster14

1

Cluster15

1

Cluster16

1

Cluster17

1

Cluster18

1

Cluster19

1

Cluster20

1

Cluster21

2

Cluster22

1

Cluster23

1

Cluster24

1

Cluster25

1

Cluster26

1

Cluster27

1

Cluster28

1

Cluster29

384

Figure 6.7: Node distribution of Gem 3D

Obviously, Gem 3D clustering algorithm shows enormous extremity. A large

number of nodes are in cluster 29 but other clusters just contain few nodes.

K-means simple

Cluster0

89

Cluster1

127

Cluster2

40

Cluster3

51

Cluster4

40

Cluster5

17

Cluster6

56

Figure 6.8: Node distribution of Kmeans simple

K-means simple algorithm does not show extremity at all and nodes distribution

is completely acceptable.

32

K-means Gem 3D

Cluster0

5

Cluster1

97

Cluster2

4

Cluster3

135

Cluster4

127

Cluster5

3

Cluster6

49

Figure 6.9: Node distribution of Kmeans Gem 3D

Although extremity of K-means Gem 3D algorithm is not very big comparison

with the two first algorithms, it shows extremity.

33

7. Comparing Reference graph with artificial graphs

In this part, artificial graphs are compared with the reference graph. Obviously

artificial graphs can have different sizes when clustering algorithms implement over

them, but reference graph, which is a real graph, does not get different sizes. Its size

is fixed and it is 420, which means that the Reference graph has 420 nodes.

Therefore, in this part, Reference graph is compared with the other artificial graphs

when the size of these graphs is 256 and 512.

7.1 Time

7.1.1 Nearest neighbour

Graph Size Time

Unconnected 256 219

Reference 420 More than 5 min

Unconnected

512 7156

Graph Size Time

Directed KX 256 More than 5 min

Reference 420 More than 5 min

Directed KX 512 More than 5 min

Graph Size Time

Directed cycle KX 256 More than 5 min

Reference 420 More than 5 min

Directed cycle KX 512 More than 5 min

Graph Size Time

Directed cycle 256 274484

Reference 420 More than 5 min

Directed cycle 512 More than 5 min

Figure 7.1: Time tables of Reference & artificial graphs (NN)

When this algorithm implemented over Reference graph it takes much time. This

time is bigger than the time that Unconnected graph takes to be clustered when its

size is 256 and when it raises to 512.

Directed KX, Directed cycle KX and Directed cycle graphs take as much as

Reference graph does. As tables show, for the two sizes 256 and 512 they need

several minutes to be clustered.

7.1.2 Nearest neighbour big step

Graph Size Time

Unconnected 256 78

Reference 420

1313

Unconnected 512 3422

34

Graph Size Time

Directed KX 256 98563

Reference 420 1313

Directed KX 512 Out of memory

Graph Size Time

Directed cycle KX 256 120328

Reference 420 1313

Directed cycle KX 512 More than 5 min

Graph Size Time

Directed cycle 256 3469

Reference 420 1313

Directed cycle 512 11610

Figure 7.2: Time tables of Reference & artificial graphs (NNBS)

Implementing time for Nearest neighbour big step algorithm over Reference graph

is very little. When it is compared with artificial graphs, surprisingly its

implementation time is between implementation time of Unconnected graph when

its size is 256 and when it is 512. Directed KX, Directed cycle and Directed cycle

KX take more time than Reference graph either with small or big size.

7.1.3 Best neighbour

Graph Size Time

Unconnected 256 172

Reference 420

More than 5 min

Unconnected 512 6813

Graph Size Time

Directed KX 256

More than 5 min

Reference 420

More than 5 min

Directed KX 512

More than 5 min

Graph Size Time

Directed cycle KX 256

More than 5 min

Reference 420

More than 5 min

Directed cycle KX 512

More than 5 min

Graph Size Time

Directed cycle 256

More than 5 min

Reference 420

More than 5 min

Directed cycle 512

More than 5 min

Figure 7.3: Time tables of Reference & artificial graphs (BN)

Best neighbour algorithm takes enormously big time over Reference graph.

Unconnected graph takes quite little time, when it is smaller and even when it is

bigger than The Reference graph.

Surprisingly Directed KX graph, Directed cycle KX graph and Directed cycle

graph take time as much as the Reference graph, when their size is 256 or it is 512.

35

7.1.4 Best neighbour big step

Graph Size Time

Unconnected 256 109

Reference 420 More than 5 min

Unconnected 512 3406

Graph Size Time

Directed KX 256 More than 5 min

Reference 420 More than 5 min

Directed KX 512

More than 5 min

Graph Size Time

Directed cycle KX 256 More than 5 min

Reference 420 More than 5 min

Directed cycle KX 512 More than 5 min

Graph Size Time

Directed cycle 256 More than 5 min

Reference 420 More than 5 min

Directed cycle 512

More than 5 min

Figure 7.4: Time tables of Reference & artificial graphs (BNBS)

The Reference graph takes huge time to be clustered. Comparison with it,

Unconnected graph with the both sizes, 256 and 512 needs very little time to be

clustered. Other artificial graphs take time certainly as much as the Reference graph.

36

7.1.5 Gem 3D

Graph Size Time

Unconnected 256 2797

Reference 420 7500

Unconnected 512 107516

Graph Size Time

Directed KX 256 1026

Reference 420 7500

Directed KX 512

Out of memory

Graph Size Time

Directed cycle KX 256 104688

Reference 420 7500

Directed cycle KX 512 275562

Graph Size Time

Directed cycle 256 2485

Reference 420 7500

Directed cycle 512

1766

Figure 7.5: Time tables of Reference & artificial graphs (Gem3D)

When Gem 3D is implemented over the Reference graph, it takes an acceptable

time. Obviously this amount of time is between the time that Unconnected graph

needs to be clustered when its size is changing between 256 and 512.

Directed cycle KX graph needs much time to be clustered, Comparison with the

Reference graph, even when it is smaller than the Reference graph.

Diretcted KX with size 256 needs less time than the Reference graph.

Directed cycle graph takes less time than Reference graph with both 256 and 512

sizes.

7.1.6 K-means simple

Graph Size Time

Unconnected 256 32

Reference 420 47

Unconnected 512 390

Graph Size Time

Directed KX 256 156

Reference 420 47

Directed KX 512 Out of memory

Graph Size Time

Directed cycle KX 256 1953

Reference 420 47

Directed cycle KX 512 3391

37

Graph Size Time

Directed cycle 256 31

Reference 420 47

Directed cycle 512 47

Figure 7.6: Time tables of Reference & artificial graphs (Kmeans simple)

Implementing time of Kmeans simple over the Reference graph is completely

little. This time is exactly between the two times which Unconnected graph needs to

be clustered with two sizes, 256 and 512.

Two other artificial graphs, Directed KX and Directed cycle KX needs more

## Σχόλια 0

Συνδεθείτε για να κοινοποιήσετε σχόλιο