International Journal of Computer Science and Application Issue 2010

ISSN 0974-0767

76

A New Ant Approach for Unr aveling Data-

Clustering and Data-Classification Setback

Mohd. Husain, Raj Gaurang Tiwarim,Anil Agrawal, Bineet Gupta

Abstract--Data mining is a process that uses technology to

bridge the gap between data and logical decision making.The

ter minology itself provides a promising view of systematic

data manipulation for extr acting useful information and

knowledge from high volume of data.Numerous techniques

are developed tofulfill this goal.This paper describes the data

mining ter minology,outlines the colony optimization

algorithm which is used newly in data mining mostly aiming

solve data-clustering and data-classification problems and

developed from imitating the technique of real ants finding

the shor test way from their nests and the food source.This

paper represents an application aiming to cluster a data set

with ant colony optimization algor ithm and to increase the

working performance of colony optimization algorithm used

for solving data-cluster ing problem,proposes two new

techniques and shows the increase on the perfor mance with

the addition of these suggested techniques.

Keywords-DataMining,KnowledgeDiscoveryinDatabases,

Clustering,ColonyOptimization.

I. INTRODUCTION

Data Mining (DM) or Knowledge discovery in Databases

(KDD) as it is also known,is the nontrivial extraction of

implicit,previously unknown,and potentially useful

information from data [1]. This encompasses a number of

different technical approaches,such as clustering,data

summarization,learning classification rules,finding

dependency networks,analyzing changes,and detecting

anomalies.Clustering is the task of identifying groups in a

data set based upon some criteria of similarity [2].

Clustering aims to discover sensible organization of

objects in a given dataset by identifying and quantifying

similarities or dissimilarities between the objects[3].In

data mining,clustering is used especially as preprocess to

another data mining application.We implement a

clustering method using ant colony optimization for

clustering a data set into a pre-determined number of

clusters and propose two new techniques added to the

algorithm.

Real ants have the abilitytofindthe shortest pathfromtheir

nests to the food source without any visual trace [4].Ant

colony optimization is developed by modeling this

behavior of real ants[2].

This paper is organized as follows:Section 2 describes ant

colony optimization subject.Section 3 describes ant

colony optimization algorithm developed for data-

clustering,and our proposed two newtechniques.Section 4

reports the results of the verification of ACOalgorithmand

proposed techniques on an application program using a

dataset.Finally,in Section 5 conclusions of the current work

are reported.

II.ANT COLONY OPTIMIZATION

Ant colonyoptimization(ACO) [5] mimics the wayreal ants

find the shortest route between a food source and their nest.

As shown in Figure 1-a,ants start from their nest and goes

alonga linear paththroughthe foodsource.

Actually,if there exists a difficulty on the path while going

to the food source (Figure 1-b),ant lying in front of this

difficulty can not continue and has to account a preference

for the new outgoing path.In the present case,selection

probability of the new direction alternatives of ants is

equal.In other words, if ant can select anyone of the right

and left directions,the selection chance of these directions

is equal (Figure 1-c).Namely,two ants start fromtheir nest

in the search of food source at the same time to these two

directions.One of themchooses the paththat turns out tobe

shorter while the other takes the longer path.But it is

Figure1. Behavior of ants between their nest and food source

77

observed that following ants mostly select the shorter path

because of the pheromone concentration deposited mostly

onthe shorter one.

The ant movinginthe shorter pathreturns tothe nest earlier

andthe pheromone depositedinthis pathis obviouslymore

than what is deposited in the longer path.Other ants in the

nest thus have high probability of following the shorter

route.These ants also deposit their own pheromone on this

path.More andmore ants are soonattractedtothis pathand

hence the optimal route fromthe nest tothe foodsource and

back is very quickly established.Such a pheromone-

meditated cooperative search process leads to the

intelligent swarmbehavior.

The instrument of ants uses to find the shortest path is

pheromone.Pheromone is a chemical secretion used by

some animals to affect their own species.Ant deposit some

pheromone while moving,they deposit some amount of

pheromone and they prefer the way deposited more

pheromone than the other one with a method based on

probability.Ants leave the pheromone on the selected path

while going to the food source,so they help following ants

onthe selectionof the path(Figure 1-d).

III. CLUSTERING WITHANT COLONY

OPTIMIZATION

In this section we used ant colony optimization algorithm

tosolve the data-clusteringproblemandproposedtwonew

techniques are explained in detail and the solutions are

compared.

We use anACOalgorithmfor data clustering,inwhicha set

of concurrent distributed agents collectively discover a

sensible organization of objects for a given dataset [3]. In

the algorithm,each agent discovers a possible partition of

objects in a given dataset and the level of partitioning is

measured subject to some metric like Euclidean distance.

Information associated with an agent about clustering of

objects is accumulated in the global information hub

(pheromone trail matrix) and is used by the other agents to

construct possible clustering solutions and iteratively

improve them.The algorithmworks for a given maximum

number of iterations and the best solution found with

respect to a given metric represents an optimal or near-

optimal partitioning of objects into subsets in a given

dataset.

The aimof data-clustering is to obtain optimal assignment

of Nobjects in one of the Kclusters where Nis the number

of objects and Kis the number of clusters[7-8].Artificial

ants used in algorithmare named as software ants or agent

and number of agents expressed with R.Ants start with

emptysolutionstrings andinthe first iterationthe elements

of the pheromone matrix are initialized to the same values.

With the progress of iterations,the pheromone matrix is

updateddependinguponthe qualityof solutions produced.

To describe the algorithm in detail,a data set with 10 test

data is formed.The data of this test data set are obtained

from UCI's machine learning repository [6]. Test data are

shown inTable 1 and in real data set,data are divided into 3

subsets,soK=3.

TABLE 1.ILLUSTRATIVE DATASET TO EXPLAIN ACO

ALGORITHM FOR CLUSTERING WITH N=10 AND N=4 (N:

NUMBEROFAGENTS,N:NUMBEROFATTRIBUTES)

To construct a solution,the agent uses the pheromone trail

information to allocate each element of string S to an

appropriate cluster label. At the start of the algorithm, each

agent or software ant start with empty solution string and

the pheromone matrix τ keeping each element is assigned

to which cluster is initialized to some small value τ0.

Hence,at first iteration each element of solution string Sof

eachagent is assignedrandomlytoone of the Kclusters.

The trail value, τij at location (i,j) represents the

pheromone concentration of sample i associated to the

cluster j. So, for the problemof separating Nsamples into

K clusters the size pheromone matrix is NxK. Thus, each

sample is associated with K pheromone concentrations.

The pheromone trail matrix evolves as we iterate.At any

iteration level,each agent or software ants will develop

solutions showing the probability of each ant belonging to

which cluster using this pheromone matrix.After

generating the solutions of R agents,a local search is

performed to further improve fitness of these solutions.

The pheromone matrix is then updated depending on the

quality of solutions produces by the agents.Then,the

agents build improved solutions depending on the

pheromone matrix and the above steps are repeated for

certainnumber of iterations.

At the end of any iteration level each agent generates the

solution using the information derived from updated

pheromone matrix.The pheromone matrix at any iteration

level for test dataset is showninTable 2below.

Sample

Number

Sepal

length

Sepal

width

Petal

length

Petal

width

Cluster

1 5.1 3.5 1.4 0.2 1

2 7 3.2 4.7 1.4 2

3 6.3 3.3 6 2.5 3

4 4.9 3 1.4 0.2 1

5 4.6 3.1 1.5 0.2 1

6 6.4 3.2 4.5 1.5 2

7 6.2 2.9 4.3 1.3 2

8 5.8 2.7 5.1 1.9 3

9 7.1 3 5.9 2.1 3

10 6.3 2.9 5.6 1.8 3

International Journal of Computer Science and Application Issue 2010

ISSN 0974-0767

78

TABLE 3. FOR DATA-CLUSTERING PROBLEM GENERATED

SOLUTIONS SORTED DECREASINGLY

The pheromone concentration for the first sample as shown

in Table 2 are: τ11=(0,014756), τ12=(0,015274) and

τ13=(0,009900). It indicates that at the current iteration,

sample number 1 has the highest probability of belonging

to cluster number 2, because τ12is the highest.

Each agent selects a cluster number with a probability

value for each element of S string to formits own solution

string S.The quality of constructed solution string S is

measured in terms of the value of objective function for a

given data-clustering problem.This objective function is

defined as the sumof squared Euclidian distances between

each object and the center of belonging cluster. Then, the

elements of the population,namely agents are sorted

increasingly by the objective function values.Because,

the lower objective function value,the higher fitness to the

real solution,namely,lower objective function values are

more approximated to real solution values. Table 4 shows

the solution string values of ten agents in the test data set

andthe fitness values of eachagent sorteddecreasingly.

Most of existing ant colony optimization algorithms use

some local search procedures to develop the generated

solutions discovered by software ants. Local search helps

togenerate better solutions,if the heuristic informationcan

not be discovered easily.Local search is applied on all

generated solutions or on a few percent R.In this work,

local search is performed on 20%of the total solutions.So

in the test data set of 10 data,local search is applied on the

top 2 solutions inTable 3.In the local search procedure,the

objective function values of top 2 agents are computed

again. These solutions can be accepted only if there is an

improvement onthe fitness,namely,if the newlycomputed

objective function value is lower than the first computed

value,newlygeneratedsolutionreplaces the oldone.

After the local search procedure,the pheromone trail

matrix is updated.Such a pheromone updating process

reflects the usefulness of dynamic informationprovidedby

software ants.The pheromone matrix used in ant colony

optimization algorithm is a kind of adaptive memory that

contains information provided by the previously found

superior solutions and is updated at the end of the iteration.

The pheromone updating process used in this algorithm

includes best L solutions discovered by R agents at

iteration level t.This L agent mimics the real ants'

pheromone depositionbyassigningthe values of solutions.

The trail informationis updatedusingthe followingrule as

∑

=

∆+−=+

L

l

l

ijijij

tt

1

)()1()1( ττρτ

I =1,…,N j =1,…,K

where ρ is a persistence or trail and lies between [0,1] and

(1-ρ) is the evaporation rate. Higher value of ρ suggests

that the information gathered in the past iterations is

forgotten faster.

The amount of is equal to,if cluster j is

assigned to ith element of the solution constructed by ant l

andzerootherwise.

An optimal solution is that solution which minimizes the

objective function value.If the value of best solution in

memory is updated with the best solution value of the

current iteration if it has a lower objective function value

than that of the best solution in memory,otherwise the best

solution in memory kept.This process explains that an

iteration of the algorithm is finished.Algorithm iterates

these steps repeatedly until a certain number of iterations

and solution having lowest function value represents the

optimal partitioning of objects of a given dataset into

several groups.

The flow chart of ant colony optimization algorithm

developed for solving data-clustering problem and

explained in detail above is shown in Figure 2.The

flowcharts of the first and second techniques proposed to

increase the performance of theACOalgorithmare shown

inFigure 3and4,respectively.

l

ij

τ∆

l

F1

TABLE 2. PHEROMONE TRAIL MATRIX GENERATEDATANY

ITERATION LEVEL OF THEACOALGORITHM FOR TEST

DATASET

N (Sample No)

1 2 3 4 5 6 7 8 9 10 F(Fitness)

S(SolutionString)

1

2 1 1 2 2 3 3 1 2 2 4.003931

2

2 3 1 2 2 3 2 3 2 2 7.172357

3

2 1 1 2 2 3 2 1 2 3 7.864054

4

2 1 3 2 2 3 2 1 2 3 8.455329

5

2 2 1 2 2 3 2 1 2 2 10.36714

6

2 1 1 2 3 3 2 1 1 3 10.92255

7

1 1 1 2 2 3 2 1 2 3 11.94087

8

2 1 1 2 1 3 2 1 1 1 12.00959

9

1 1 2 2 2 3 1 1 2 2 13.26286

10

1 1 2 2 2 3 3 1 2 3 13.33634

K (Cluster No)

1 2 3

N(SampleNo)

1

0.014756

0.015274

0.009900

2

0.015274

0.009900

0.014756

3

0.015274

0.014756

0.009900

4

0.009900

0.015274

0.014756

5

0.014756

0.015274

0.009900

6

0.009900

0.014756

0.015274

7

0.009900

0.020131

0.009900

8

0.015274

0.014756

0.009900

9

0.009900

0.015274

0.014756

10

0.014756

0.015274

0.009900

International Journal of Computer Science and Application Issue 2010

ISSN 0974-0767

79

Figure 2. The flow chart of ACO algorithm developed for solving data-

clustering problem [3]

Figure 3: The flow chart of the first technique proposed to increase the

performance of ACO

International Journal of Computer Science and Application Issue 2010

ISSN 0974-0767

80

Figure 4.The flow chart of the second technique proposed to increase

the performance of ACO

Ants followthe pathbetweentheir nest andthe foodsource

according to the pheromone amount deposited on the path.

Following ants decides which path to go depending on the

pheromone concentrations on the path. After a number of

iterations ants becomes to follow continuously the same

path because of the enormous pheromone concentration

than the disused paths.This behavior of ants is called

stagnation behavior.To avoid from this disadvantage,

reference algorithm is improved with the addition of two

new techniques and the solutions are compared with each

other.First proposed technique (Figure 3) brings the

pheromone amount to initial values every 50 iteration to

avoid from stagnation behavior.Aiming minimize the

stagnationbehavior of ants,the secondproposedtechnique

(Figure 4) follows the pheromone amounts of ants and if

there is no change on the pheromone concentration of

every path after last 10 iterations,it brings the pheromone

amount toinitial values.

IV. EXPERIMENTAL EVALUATION

With the aim of generating the optimal solutions of the

presented ACO algorithm developed for solving data-

clustering problem and added two new techniques,an

application program is written with “Microsoft Visual

Basic 6.0” and the program is applied on the iris database

existinginthe data warehouse of UCI [6].The iris database

consists of 150 data and it is stored in a text file.The main

screen of the application program is shown in Figure 5.

Number of iterations, clusters, agents, local search agents

and initial pheromone values,evaporation rate of

pheromone and some values needed for the algorithm are

specified in this screen.Programruns the algorithmuntil a

number of iterations.

Figure 5. The main screen of the application program

Figure 6,shows the statistical result values of these

three methods (reference algorithm and the two new

techniques) worked on the application programwith 1000

iterations.Figure 6,'1.Solution'represents our main ant

colonyoptimizationalgorithmandcomparingwiththe real

cluster values of iris database its performance is 4%,'2.

Solution'represents our proposed first technique and its

performance is 52% and'3.Solution'represents our

proposedsecondtechnique andits performance is 80%.

Figure 6. Statistical results values of the ACO methods worked with the

criterion specified on Figure 5.

International Journal of Computer Science and Application Issue 2010

ISSN 0974-0767

81

Figure 7. Graph screen showing the result values of the ACO methods

worked with the criterion specified in Figure 5 and the real solution.

Figure 7,shows the graph screen of these three methods

(reference algorithmand the two newtechniques) worked

on the application program with 1000 iterations and the

given criterion (see Figure 6).The straight line existing on

the graph points out the fitness value of the real cluster

values of the iris database.Curve specifying the'1.

Solution'shows theACOalgorithmresults and its working

performance derived fromcomparing with the real cluster

values is only 4% (see Figure 7),because algorithm

exposed stagnation behavior after 615th iteration (see

Figure 6).Curve specifying the'2.Solution'shows the first

proposed technique's results and its working performance

is 52% and curve specifying the'3.Solution'shows the

second proposed technique's results and its working

performance is 80%(see Figure 7).

V. CONCLUSION

In this paper we proposed two new techniques to increase

the working performance of the ant colony optimization

algorithmalgorithm.We also verified ACOalgorithmand

proposed techniques on an application program With the

comparison of these three methods,it is shown that the

proposed techniques increase the performance of the

reference ACO algorithm and the best results are derived

from the second proposed technique.Consequently,our

proposed two techniques markedly increased the success

of the ACO algorithm developed for solving the data-

clusteringproblem.

REFERENCES

[1] FRAWLEY,W.J,PIATETSKY-SHAPIRO,G.,MATHEUS,C.,J,

"Knowledge Discovery in Databases:An Overview",AI Magazine,

13(3):57-70,1992

[2] DORIGO,M.,MANIEZZO,V.,COLORNI,A.,"The Ant System:

Optimization by a colony of cooperating agents",IEEE

Transactions on Systems,Man,and Cybernetics-Part B,Vol.26,

No.1,pp.1-13,1996

[3] SHELOKAR,V.K.,JAYARAMAN,et.al.,"An Ant Colony

Approach for Clustering", Analytica Chimica Acta 509,187-195,

2004

[4] DALKILIÇ,G.,et.al.,"Kar?nca Kolonisi Optimizasyonu",

YPBS2002 - Yüksek Performansl?Bili?imSempozyumu,Kocaeli,

Ekim2002

[5] DI CARO,G.,DORIGO,M.,"Extending AntNet for Best-effort

Quality-of-Services Routing",Ant Workshop on Ant Colony

Optimization,htpp://iridia.ulb.ac.be/ants98/ants98.html,15-16,

1998

[6] UCI Repository for Machine Learning Databases retrieved fromthe

Wor l d Wi de Web:ht t p://www.i cs.uci.edu/~ml ear n/

MLRepository.htm

[7] TSAI,C.F.,TSAI,C.W.,WU,HC,YANG,T.,"ACODF:a novel

data clustering approach for data mining in large databases",The

Journal of Systems andSoftware 73,p.133-145,2004

[8] KUO,R.J.,WANG,H.S.,HUT.,CHOU,S.H.,"Application of Ant

K-Means on Clustering Analysis",Computers and Mathematics

withApplications 50,p.1709-1724,2005

[9] MANIEZZO,V.,et.al.,"AnAnt ApproachTo Membership Overlay

Design",ANTS 2004 - Fouth International Workshop On Ant

Colony Optimization and Swarm Intelligence,p.37-48,Berlin,

2004

[10] PARPINELLI,R.S.,LOPES,H.S.,FREITAS,A.A.,

"Classification-Rule Discovery with an Ant Colony Algorithm",

Encyclopedia of Information Science and Technology, Idea Group

Inc.,2005

Prof. (Dr.) Mohd Husain

Professor, Deptt. of Computer Sc. & Engineering

AZAD Institute of Engineering and Technology,

Lucknow, India

mohd.husain90@gmail.com

Mr. Raj Gaur ang Tiwar i

Assistant Professor,

Deptt. of Computer Applications

AZAD Institute of Engineering and Technology,

Lucknow, India

rajgaurang@gmail.com

Mr.Anil Agrawal

Assistant Professor,

Deptt. of Computer Sc. and Engineering

Ambalika Institute of Management and

Technology, Lucknow, India

anil19974@gmail.com

Mr. Bineet Gupta

Lecturer, Deptt. of Computer Sc. & Engineering

Mizan Tepi University, Ethiopia

bineet777@gmail.com

International Journal of Computer Science and Application Issue 2010

ISSN 0974-0767

## Σχόλια 0

Συνδεθείτε για να κοινοποιήσετε σχόλιο