Clustering the Mixed Numerical and Categorical Datasets using Similarity Weight and Filter Method

tealackingAI and Robotics

Nov 8, 2013 (3 years and 9 months ago)

115 views

1


Clustering the Mixed Numerical and Categorical Data
sets

using

Similarity Weight

and

Filter Method


Srinivasulu Asadi
1

Dr.Ch.D.V.Subba Rao
2
C.Kishore
3

S
hreyash Raju
4


1
Associate Professor, Department of IT,

Sree Vidyanikethan Engineering

College
,
A
.
Rangampet,

Chittoor

(dt).

A.P. INDIA
-
51710
,
srinu_asadi@yahoo.com


2
Professor of Computer Science Engg
, Department of CSE, S.V.University Engg.
College
, Tirupathi
,

Chittoor

(dt).

A.P
,
INDIA
-
51710
,
subbarao_chdv1@hotmail.com


3
M
.Tech Student, Department of IT,

Sree Vidyanikethan Engineering

College
,
A
.
Rangampet,

Chittoor

(dt).

A.P. INDIA
-
51710
,
kishore.btech16@gmail.com


ABSTRACT:

Clustering is
a
challenging task
in
data

mining technique
. The aim of clustering is to group
the similar data into number of clusters. Various
clustering algorithms have been developed to group
data into clusters. However, these clustering algorithms
work effectively eith
er on pure numeric data or on pure
categorical data, most of them perform poorly on mixed
categorical and numerical data
types in

previous

k
-
means algorithm was used but it is not accurate for
large datasets.

In this paper we cluster the mixed
numeric and
categorical data set in efficient manner.
In
this paper we present a clustering algorithm based on
similarity weight and filter method paradigm that works
well for data with mixed numeric and categorical
features. We propose a modified description of clust
er
center to overcome the numeric data only limitation and
provide a better characterization of clusters. The
performance of this algorithm has been studied on
benchmark data sets.


Keywords


Data Mining, Clustering, Numerical
Data, Categorical Data, simi
larity weight, filter method

.

.
1.
INTRODUCTION


C
lustering is a collection of objects which are “similar”
between them and are “dissimilar” to the objects
belonging to other clusters. Many
data mining
applications require partitioning of data into
homogen
eous clusters from which interesting groups
may be discovered, such as a group of motor insurance
policy holders with a high average claim cost, or a
group of clients in a banking database showing a heavy
investment in real estate. To perform such analyses

at
least the following two problems have to be solved; (1)
efficient partitioning of a large data set into
homogeneous groups or clusters, and (2) effective
interpretation of clusters. This paper proposes a solution
to the first problem and suggests a sol
ution to the
second. With the amazing progress of both computer
hardware and software, a vast amount of data is
generated and collected daily. There is no doubt that
data are meaningful only when one can extract the
hidden information inside them. However,

“the major
barrier for obtaining high quality knowledge from data
is due to the limitations of the data itself”. These major
barriers of collected data come from their growing size
and versatile domains. Thus, data mining that is to
discover interesting p
atterns from large amounts of data
within limited
sources (
i.e., computer memory and
execution time) has become popular in recent years.

Clustering is considered an important tool for data
mining. The goal of data clustering is aimed at dividing
the data s
et into several groups such that objects have a
high degree of similarity to each other in the same
group and have a high degree of dissimilarity to the
ones in different groups. Each formed group is called a
cluster. Useful patterns may be extracted by an
alyzing
each cluster. For example, grouping customers with
similar characteristics based on their purchasing
behaviors in transaction data may find their previously
unknown patterns. The extracted information is helpful
for decision making in marketing.


V
arious clustering applications have emerged in diverse
domains. However, most of the traditional clustering
algorithms are designed to focus either on numeric data
or on categorical data. The collected data in real world
often contain both numeric and cate
gorical attributes. It
is difficult for applying traditional clustering algorithm
directly into these kinds of data. Typically, when people
need to apply traditional distance
-
based clustering
algorithms (ex., k
-
means) to group these types of data, a
numeri
c value will be assigned to each category in this
attributes. Some categorical values, for example “low”,
“medium” and “high”, can easily be transferred into
numeric values. But if categorical attributes contain the
values like “red”, “white” and “blue” …
etc., it cannot
be ordered naturally. How to assign numeric value to
these kinds of categorical attributes will be a challenge
work.


In this paper we first divide the original data set into
pure numerical and categorical data set. Next, existing
well est
ablished clustering algorithms designed for
different types of datasets are employed to produce
2


corresponding clusters. Last, the clustering results on
the categorical and numeric dataset are combined as a
categorical dataset on which the categorical data
algorithm is employed to get the final output.

The reminder of this paper is organized as follows. Next
we show the background and related works and the
proposed method for clustering on mixed categorical
and numerical data, finally the conclusion of our w
ork.


2. RELATED WORK


2.1 Cluster Ensemble approach for mixed data


Dataset with mixed data type are common i
n

real life.
Cluster Ensemble [3
] is a method to combine several
runs of different clustering algorithm to get a common
partition of the original
dataset. In the paper divide and
conquers technique is formulated. Existing algorithm
use similarity measures like Euclidean distance which
gives good result for the numeric attribute. This will not
work well for categorical attribute. In the cluster
ensem
ble approach numeric data are handled separately
and categorical data are handled separately. Then both
the results are then treated in a categorical manner.
Different types of algorithm used for categorical data
ar
e K
-
Modes, K
-
Prototype, ROCK [4
] and sque
ezer
algorithm. In K
-
Mode the total mismatch of categorical
attributes of two data record is projected. The Squeezer
algorithm yields good clustering result, good scalability
and it handles high dimensional data set efficiently.


2.2 Methodology

1. Splitti
ng of the given data set into two parts. One for
numerical data and another for categorical data

2. Applying any one of the existing clustering
algorithms for numerical data set

3. Applying any one of the existing clustering
algorithms for categorical data

set

4. Combining the output of step 2 and step 3

5. Clustering the results using squeezer algorithm

The credit approval and cleve (heart diseases) dataset
are used and measures the cluster accuracy and cluster
error rate. The cluster accuracy ‘r‘is define
d by



Where

K represents number of
clusters


represents
number of instance occurring in both the cluster I and
its class and n represents number of instance in the
dataset.
Finally cluster error rate ‘e’ defined by




Where

r represents clu
ster accuracy.


Fig

1
:
Overview of Cluster Ensemble algorithm
framework


Fig 1.
Shows

that original dataset has to be splitted into
categorical dataset and numerical dataset and clustering
them. The output of clustering datasets are clustering
using clus
ter ensemble algorithm.


The algorithm is compared with k
-
prototype algorithm.

Fig 2.
Shows error rate among the k
-
prototype and
Cluster Ensemble
algorithm
.




Fig

2
:
Comparing K
-
prototype and
CEBMC


3
.
REVIEW OF
K
-
MEANS

ALGORITHM


K
-
means is a clustering

algorithm that deals with
categorical data only. The k
-
means clustering algorithm
requires the user to specify from the beginning the
number of clusters to be produced and the algorithm
builds and refines the specified number of clusters.
Each cluster has

a mode associated with it. Assuming
that the objects in the data set are described by
m
categorical attributes, the mode of a cluster is a vector
Q
={
q1
,
q2
,…,
qm
} where
qi
is the most frequent value
for the
i
th attribute inthe cluster of objects.

Given a
data set and the number of clusters
k
, the k
-
means algorithm clusters the set as follows:

1. Select initial
k
means for
k
clusters.

2. For each object
X

a. Calculate the similarity between object
X
and the
means of all clusters.

b.

Insert object
X
into the

cluster
c whose

mode is the
most similar to object
X
.

3


c.

Update the mode of
cluster
c


3. Retest the similarity of objects against the current
modes. If an object is found to be closer to the mode of
another cluster rather than its own cluster, reallocate

the
object to that cluster and update the means of both
clusters.

4. Repeat 3 until no or few objects change clusters after
a full cycle test of all the objects.


In most clustering algorithms, an object is usually
viewed as a point in a multidimensional
space. It can be
represented as a vector (
x1
...
xd
), a collection of values
of selected attributes with
d
dimensions; and
xi
is the
value of
i
-
th selected attribute. The value of
xi
may be
numerical or categorical.

Most pioneers of solving mixed numeric and

categorical value for clustering problem is to redefine
the distance measure and apply it to existing clustering
algorithms.


K
-
prototype:


K
-
prototype is one of the most famous methods. K
-
prototype inherits the ideas of k
-
means, it applies
Euclidean dist
ance to numeric attributes and a distance
function is defined to be added into the measure of the
closeness between two objects. Object pairs with
different categorical values will enlarge the distance
between them. The main shortcomings of k
-
prototype
may

fall into followings:

(1) Binary distance is employed for categorical value. If
object pairs with the same categorical value, the
distance between them is zero; otherwise it will be one.
However, it will not properly show the real situation,
since categor
ical values may have some degree of
difference. For example, the difference between “high”
and “low” shall not equal to the one between “high” and
“medium”.

(2) Only one attribute value is chosen to represent
whole attribute in cluster center. Therefore, t
he
categorical value with less appearance seldom gets the
chance to be shown in cluster center, though these items
may play an important role during clustering process.
Additionally, since k
-
prototype inherits the ideas of k
-
means, it will retain the same
weakness of k
-
means.


4
.
PROPOSED ALGORITHM


In this paper we propose a new algorithm for clustering
mixed numerical and categorical data. In this algorithm
we do the following: First we read the large data set D.

Split the Data Set D into Numerical Data
and
Categorical Data. Store all the Data Set.

Cluster the Numerical data set and categorical using
Similarity Weight.






Combine the clustered categorical dataset and clustered
numerical dataset

as a categorical dataset
using Filtered
method.

In this a
lgorithm we cluster the numerical data,
categorical data and mixed data.

The

above process is
shown in fig 3.

4
.
1

SI
MI
LARITY WEIGHT METHOD

Cluster validity functions are often used to evaluate the

performance of clustering in different indexes and even
two

different clustering methods. A lot of cluster
validity criteria

were proposed during the last 10 years.
Most of them came

from different studies dealing with
the number of clusters.

To test validity indices, we
conduct the Iris data sets and

Wine data se
ts. Iris data
sets are perhaps the best known

database to be found in
the pattern recognition literature.

The data set contains
3 classes of 50 instances each, where

each class refers
to a type of Iris plant. One class is linearly

separable
from the other
2; the latter is not linearly separable

from
each other. Wine data are the results of a chemical

analysis of wines grown in the same region in Italy but
derived

from three different cultivars. The analysis
determines the

quantity of 13 constituents found i
n each
of the three types of

wines.

Essentially, for a given data
set and “K” value, the DHCA
start with K randomly
selected centres, and then assigns

each data point to its
closest
centre
, creating K partitions. At each stage in the
iteration, for each of

these K partitions, DHCA
recursively selects K random
centres

and continues the
clustering process within each partition to form at most
K
N

partitions for the N
th

stage [
1]. In our
implementation, the procedure continues until the
number of elements in a
partition is below K+2, at
which time, the distance of each data item to other data
items in that partition can be updated with a smaller
value by a brute
-
force nearest neighbour search.
The
Divisive
[2]
Hierarchical Clustering Algorithm
partitions the data

set into smaller partitions so that the
number of data items in each partition must be less than
the maximum partition size i.e., “K+2”. In the first
iteration the entire data set is stored as the initial
partition. After that, at each stage all the parti
tions are
stored irrespective of their “K+2” condition.

For a set of s
-
dimensional data, i.e., each data item is a
point in the s
-
dimensional space, there exists a distance
between every pair of the data items. In this sequential
initialization, all the pa
ir wise distances are calculated
after reading their details from the database. The
threshold value calculation consists of Distance and
Index Arrays.
The distance
array [
1] is used to record
the distance of each data point to some other data point
in the
sequentially stored data set. The index
array [
1]
records the index of the data item at the other end of the
distance in the distance
array. The

number of partition
centers at each stage of the DHCA,
[2]
performance of
the algorithm with and without a give
n number of
clusters. K varies from min to max. For max k, the






n
i
i
i
b
a
t
b
a
Sim
1
2
)
,
(
4


algorithm produces into larger number of distance
computations in the
DHCA [2]

processes and the
running time increases with k. As K increases, the
change in running time increases along. For m
in
imum

value of K, the changes in the running time are small
because when K is min
imum for the c
onstruction of
DHCA a small increase in k, decreases the total number
of nodes which makes better
in performance

than using
small k. When K gets larger and larg
er, more distance
processes to partition center and the increase in
processes eventually improves.


Fig

3
:
Overview of
Similarity and Filtered

Algorithm
Framework


Clustering is one of the major data mining tasks and
aims at grouping the data objects into

meaningful
classes (clusters) such that the similarity of objects
within clusters is maximized, and the similarity of
objects from different clusters is minimized. In this
paper we present a clustering algorithm based on NMT
[14]

paradigm

that works well
for data with mixed
numeric and categorical features. We propose a
modified description of cluster center to overcome the
numeric data only limitation of MST
[14]

algorithm

and
provide a better characterization of clusters. The
performance of this algorith
m has been studied on
benchmark data sets.

Real

life database it contains
mixed numeric and categorical type data set. In this
work, we have proposed a modified MST algorithm for
finding a globally optimal partition of a given mixed
numeric and categorical

data into a specified number of
clusters. This incorporates the MST into the k
-
means
algorithm with enhance cost function to handle the
categorical data, and our experimental results show that
it is effective in recovering the underlying cluster
structure
s from categorical data if such structures exist.
Modified MST representation for the cluster centre is
used. This representation can capture cluster
characteristics very effectively because it contains the
distribution of all categorical values in cluster
.


4
.2
.
FILTER ALGORITHM

For clustering the mixed numerical and categorical
datasets, we proposed an algorithm called filter method.
First, the original dataset is divided into two sub
-
datasets i.e., pure categorical dataset and the pure
numerical dataset
. Next, we apply clustering algorithms
on sub
-
datasets based upon their type of dataset to get
the corresponding clusters. Last, the clustering results
of numerical and categorical datasets are combined as a
categorical dataset, on which the categorical da
ta
clustering

algorithm is exploited to get the final cluster.

Now we discuss the last step of above process. With the
clustering results on the categorical and numerical
datasets, we also get the final clustering results by
exploiting filter method.

The p
rocess of filtering for information or patterns
using techniques involving collaboration among
multiple agents, viewpoints and data sources, etc.
Applications of collaborative filtering typically involve
very large data sets. Collaborative filtering method
s
have been applied to many different kinds of data
including sensing and monitoring data
-

such as in
mineral exploration, environmental sensing over large
areas or multiple sensors; financial data
-

such as
financial service institutions that integrate m
any
financial sources; or in electronic commerce and web
2.0 applications where the focus is on user data, etc.
The remainder of this discussion focuses on
collaborative filtering for user data, although some of
the methods and approaches may apply to the
other
major applications as well.


Step(1)
: Start with a tree built by
the s
equentia
l
i
nitialization.

Step(2)
: Calculate mean and standard deviation of the
edge weights distance array.

Step(3)
: Use their sum as the threshold.

Step(4):
Perform multiple r
uns of

Similarity

Algorithm.

Step(5)
:
Identify longest edge using
Similarity
.

Step(6)
:
Remove this longest edge.

Step(7)
:
Check Terminating Condition and continue.

Step
(8)
:

Put t
hat number of clusters into
Filter Method


To summarize, the numerical pa
rameters the algorithm
needs from the user include the data set, the loosely
estimated minimum and maximum numbers of data
points in each cluster, the input “K” to the
filter method

and
similarity weight method
, and number of nearest
neighbours to keep for

each data item in the auxiliary
arrays, while the outputs will be the final distance and
index arrays, and a labelling array that remembers the
cluster label each data item belongs to.

Filter Algorithm

Input: Categorical Data set DS

Output: Set of Cluster
s

D={X
1
,X
2
,….X
n
}

C =No of Clusters

m= No of Attributes

n=No of Records






m
t
n
i
t
i
t
i
C
X
d
w
C
D
F
1
1
,
)
,
(
)
,
(

1. Read Data set D

5


2. Compute Dissimilarity


y
x
y
x
y
x
d
.
.
)
,
(
2
2


3. Compute F







m
t
n
i
t
i
t
i
C
X
d
w
C
D
F
1
1
,
)
,
(
)
,
(

4. Cluster data according to F.

5.
Submit the F into post Clustering Technique


i.e. Filtered Algorithm




Fig 4
:

Applying clustering technique Similarity Weight
and Filter Method



Fig

5
:

Results of clustering showing groups divided

into clusters


Fig
6
:

Initialization and

Input


4.
3

Advantages of Proposed System




Efficient use of the cut and cycle properties by our
Fast
Filtered
-
Inspired clustering algorithm.



Shape of a cluster has very little impact on the
performance of this
Filtered

clustering algorithm.



Efficient for dimensional
ity more than 5 and
reduced time complexity.



Nearest neighbor search is used to construct
efficient
Filtered
.



Works efficiently even if the boundaries of
clusters are irregular.


Fig 7
:

Final
filtered and EMST



6


4.
CLUSTERING RESULTS

We used the filter a
nd k
-
prototypes algorithms to
cluster the credit approval dataset and the
clever

dataset into different numbers of clusters, varying
from 1 to 10.
For each fixed

number of clusters, the
clustering errors of different algorithms were compa
red.

For
both of datasets, the

k
-
prototypes algorithm, just as

has been done
.
All numeric

attributes are rescaled to the
range of [
0.1].
On the credit approval dataset
.



Fig

8
:

Clustering Error
vs.

number of clusters


Figure
8

shows

the results on the
credit approval data
set of different clustering algorithms from figure
8
, we
can
summaries

the relative performance of our
algorithms as follows.


Table
4.1
:

Relative performance of different clustering
algorithms (Credit approval dataset)


Method

Average

clustering error

k
-
prototype

0.211

Similarity & Filter

0.181


That is, comparing with the k
-
prototypes algorithm, our
algorithm performed the best
in all

cases. It never
performed the worst. Furthermore,
the average
clustering errors

of our algorithm

are significantly
smaller than that of the k
-
prototypes algorithm.

The
above experimental results demonstrate the
effectiveness of filter method for clustering dataset
with mixed attributes.
In addition, it outperforms the k
-
prototypes algor
ithm

with respect to clustering accuracy.


5
. CONCLUSION


The method of comprehensive assessment using
similarity weight in the attribute synthetic evaluation
system seemed to be objective and rational. Not only it
embodied the weights of variables involve
d, but also
exploiting the information presented by the sample.

Our
system is efficient for any number of Dimensions and
reduces Time Complexity. Also Irregular boundaries
can be efficiently handled using our
filtered algorithm

Divide and Conquer Technique
.

The future work is that
we
mix

the different clustering
datasets (
labeled,
unlabeled,
nominal,
and ordinal
) with different
algorithms.


6
.
REFERENCES



[1] A. Ahmad and L. Dey, (2007), A k
-
mean clustering
algorithm for mixed numeric and categorica
l data’,
Data
and Knowledge Engineering Elsevier
Publication, vol. 63, pp 503
-
527.

[2]
Xiaochun Wang, Xiali Wang and D. Mitchell
Wilkes, IEEE

Members, “A Divide
-
and
-
Conquer
Approach for Minimum Spanning Tree
-
Based
Clustering”, IEEE Knowledge and Data
Engin
eering Transactions, vol 21, July 2009.

[
3
]

S Deng, Z He, X Xu, 2005. Clustering mixed
numeric and categorical data: A cluster ensemble
approach. Arxiv preprint cs/0509011
.

[4
]

S. Guha, R. Rastogi, and K. Shim, 2000. ROCK: A
Robust Clustering Algorithm f
or Categorical
Attributes. Information Systems, vol. 25, no. 5 :
345
-
366.

[
5
] V.V. Cross and T.A. Sudkamp, Similarity and
Compatibility in Fuzzy Set

Theory: assessment and
Applications, Physica
-
Verlag, New York, 2002.

[6
] M. Kalina, Derivatives of fuzzy fu
nctions and fuzzy
derivatives, Tatra


[7
]

C.T. Zahn, “Graph
-
Theoretical Methods for
Detecting and Describing Gestalt Clusters,” IEEE
Trans. Computers, vol. 20, no. 1, pp. 68
-
86, Jan.
1971.

[8
] D.J
. States, N.L. Harris, and L. Hunter,
“Computationally Effi
cient Cluster Representation
in Molecular Sequence Megaclassification,” ISMB,
vol. 1, pp. 387
-
394, 1993.

[9
] P. Raghu Rama Krishnan, “Accessing Databases
with JDBC and ODBC”, Pearson Education,
Fourth Edition.

[10
] Henry F Korth, S. Sudharshan, “Database S
ystem
Concepts”

McGraw


Hill, International

Editions,
Fourth Edition, 2002.

[11
]
I. Katriel, P. Sanders, and J.L. Traff, “A Practical
Minimum Spanning Tree Algorithm Using the
Cycle Property,”

[12
]

Sartaj Sahni, Advanced Data Structures and
Algorithms, S
econd Edition.

[1
3
]

Grady Booch, James Rumbaugh, Ivar Jacobson.
Unified Modeling Language User Guide. Pearson
Education, 2006 edition.

[1
4
] I. Katriel, P. Sanders, and J.L. Traff, “A Practical
Minimum Spanning Tree Algorithm Using the
Cycle Property,” Pro
c. 11th European Symp.
Algorithms (ESA ’03), vol. 2832, pp. 679
-
690,
2003.

[1
5
] C.T. Zahn, “Graph
-
Theoretical Methods for
Detecting and Describing Gestalt Clusters,” IEEE
Trans. Computers, vol. 20,no. 1, pp. 68
-
86, Jan.
1971.

[1
6
] R.O. Duda and P.E. Hart,
Pattern Classification
and Scene Analysis. Wiley
-
Interscience, 1973.

[
17
] N. Chowdhury and C.A. Murthy, “Minimum
Spanning Tree
-

Based Clustering Technique:
7


Relationship with Bayes Classifier,” Pattern
Recognition, vol. 30, no. 11, pp. 1919
-
1929, 1997.

[
18
] A. Vathy
-
Fogarassy, A. Kiss, and J. Abonyi,
“Hybrid Minimal Spanning Tree and Mixture of
Gaussians Based Clustering Algorithm,”
Foundations of Information and Knowledge
Systems, pp. 313
-
330, Springer, 2006.

[
19
] O. Grygorash, Y. Zhou, and Z. Jorgensen,

Minimum Spanning Tree
-
Based Clustering
Algorithms,” Proc. IEEE Int’l Conf. Tools with
Artificial Intelligence, pp. 73
-
81, 2006.

[
20
] R.C. Gonzalez and P. Wintz, Digital Image
Processing, second ed. Addison
-
Wesley, 1987.

[
21
] Y. Xu, V. Olman, and E.C. Uberb
acher, “A
Segmentation Algorithm for Noisy Images: Design
and Evaluation,” Pattern Recognition Letters, vol.
19, pp. 1213
-
1224, 1998.

[22
] Y. Xu and E.C. Uberbacher, “2D Image
Segmentation Using

Minimum Spanning Trees,”
Image and Vision Computing, vol. 15,

pp. 47
-
57,
1997.


[
23
] D.J. States, N.L. Harris, and L. Hunter,
“Computationally Efficient

Cluster Representation
in Molecular Sequence Megaclassification,”

ISMB,
vol. 1, pp. 387
-
394, 1993.

[
24
] Y. Xu, V. Olman, and D. Xu, “Clustering Gene
Expression Data

Using a Graph
-
Theoretic
Approach: An Application of Minimum

Spanning
Trees,” Bioinformatics, vol. 18, no. 4, pp. 536
-
545,
2002.

[
25
] M. Laszlo and S. Mukherjee, “Minimum Spanning
Tree Partitioning

Algorithm for Microaggregation,”
IEEE Trans. Knowledge and

Data Eng., vol. 17,
no. 7, pp. 902
-
911, July 2005.

[
26
] M. Forina, M.C.C. Oliveros, C. Casolino, and M.
Casale,“Minimum Spanning Tree: Ordering Edges
to Identify Clustering

Structure,” Analytical
Chimica Acta, vol. 515, pp. 43
-
53, 2004.


[27] Chaturvedi,
P. Green and J. Carroll (2001), k
-
modes clustering. Journal of Classification, vol
18,pp. 35
-
55.

[28] Jain, M. Murty and P. Flynn (1999), ‘Data
clustering: A review’, ACM Computing Survey.,
vol.31, no. 3, pp. 264

323.

[29] G. Gan, Z. Yang, and J. Wu (2005)
, A Genetic k
-
Modes Algorithm for Clustering for Categorical
Data,
ADMA,

LNAI 3584, pp. 195

202.

[30] J. Z. Haung, M. K. Ng, H. Rong, Z. Li (2005)
Automated variable weighting in k
-
mean
[1]

type
clustering, IEEE Transaction on PAMI 27(5).

[31] K. Krishna an
d M. Murty (1999), ‘Genetic K
-
Means Algorithm’
,
IEEE Transactions on
Systems,
Man
, and Cybernetics vol. 29, NO. 3, pp. 433
-
439.

[32] S. Bandyopadhyay and U. Maulik, (2001),
Nonparametric genetic clustering: Comparison of
validity indices’, IEEE Trans. Syst
., Man, Cybern.
C, Appl. Rev., vol. 31, no. 1, pp. 120

125.

[33] S. Guha, R. Rastogi, and K. Shim (2000). ROCK:
A robust clustering algorithm for categorical
attributes’, Information System., vol. 25, no. 5, pp.
345

366.

[34] Y. Lu, S. Lu, F. Fotouhi, Y. D
eng, and S. Brown
(2004), ‘Incremental genetic K
-
means algorithm
and its application in gene expression data
analysis’, BMC Bioinformatics 5:172.

[35] Y. Lu, S. Lu, F. Fotouhi, Y. Deng, and S. Brown
(2004), FGKA: A Fast Genetic K
-
means Clustering
Algorithm
’, ACM 1
-
58113
-
812
-
1.

[36] Z. He, X. Xu, & S. Deng,(2005) Scalable
algorithms for clustering categorical data, Journal
of Computer Science and Intelligence Systems 20,
1077
-
1089.

[37] A. Ahmad, L. Dey, A feature selection technique
for classificatory analy
sis, Pattern Recognition
Letters 26 (1) (2005) 43

56.

[38] M. Mahdavi and H. Abolhassani, (2009) Harmony
K
-
means algorithm for document clustering, Data
Min Knowl Disc (2009) 18:370

391.

[39] H. Yan, K. Chen, L. Liu, and Z. Yi (2010)
SCALE:
a scalable fram
ework for efficiently clustering
transactional data’, Data Min Knowl Disc (2010)
20:1

27.

[40]

Jiawei Han and Micheline Kamber. “Data Ware
Housing and Data Mining.
Concepts and
Techniques”, Third Edition 2007.

[41]
N. Chowdbury and C.A. Murthy, “Minimum
Sp
anning Tree
-
Based Clustering

Technique:
Relationship with Bayes Classifier,”

Pattern
Recognition, vol. 30.


AUTHORS BIOGRAPHY

Author 1

Asadi Srinivasulu
received

the B Tech
Computer Science Engineering
from
Sri Venkateswara

University
,
Tirupati, India in
2000 and M.Tech with
Intelligent Systems in IT from Indian Institute of
Information Technology, Allahabad (IIIT
) in 2004 and
he is pursuing Ph.D

in CSE from J.N.T.U.A,
Anantapur, India. He has got 10 years of teaching and
industrial experience. He served a
s the Head, Dept of
Information Technology, S V College of Engineering,
Karakambadi, Tirupati, India during 2007
-
2009. His
areas of interests include Data Mining and Data
warehousing, Intelligent Systems, Image Processing,
Pattern Recognition, Machine Visi
on Processing and
Cloud Computing. He is a member of IAENG, IACSIT.
He has
published more than 25

papers
in International
Journals and Conferences.
Some of his publications
appear in

IJCSET,

IJCA and IJCSIT digital libraries.

He visited Malaysia and Singa
pore.

srinu_asadi@yahoo.com
.


Author 2:
Dr Ch D V Subba Rao
received the B Tech
(CSE) from S V University College of Engineering,
Tirupati, India in 1991, M.E. (CSE) from M K
University, Madurai in 1998 and he w
as the first Ph
.
D
awardee in CSE from S V University, Tirupati in 2008.
He has got 19 years of teaching experience. He served
as the Head, Dept of Computer Science and
8


Engineering, S V University College of Engineering,
Tirupati, India during 2008
-
11. His
areas of interests
include Distributed Systems, Advanced Operating
Systems and Advanced Computing. He is a member of
IETE, IAENG, CSI and ISTE. He chaired and served as
reviewer of IAENG and IASTED international
conferences. He has
published more than 25

p
apers
in
International journals and conferences
. Some of his
publications appear in IEEE and ACM digital libraries.
He visited Austria, Netherlands, Belgium, Hong
-
Kong,
Thailand and Germany.
subbarao_chdv1@ho
tmail.com


Author 3:
C.Kishore

studying

I
I

M.Tech
in
Information Technology
with Software Engineering
from JNTUA, Anantapur in 2011
-
12
and his research
areas are
Software Engineering
, Data warehousing and
Data Mining, Database Management Systems and
Clo
ud Computing
.
kishore.btech16@gmail.com
.


Author 4:
Shreyash Raju
studying

IV B.Tech degree
in Information Technology from JNTUA, Anantapur
in 2011
-
12
and his research areas are Image Processing,
Data warehousing and Data Mining, Database
Management Syst
ems and Software Engineering.
shreyashraju18@gmail.com
.