1
Clustering the Mixed Numerical and Categorical Data
sets
using
Similarity Weight
and
Filter Method
Srinivasulu Asadi
1
Dr.Ch.D.V.Subba Rao
2
C.Kishore
3
S
hreyash Raju
4
1
Associate Professor, Department of IT,
Sree Vidyanikethan Engineering
College
,
A
.
Rangampet,
Chittoor
(dt).
A.P. INDIA

51710
,
srinu_asadi@yahoo.com
2
Professor of Computer Science Engg
, Department of CSE, S.V.University Engg.
College
, Tirupathi
,
Chittoor
(dt).
A.P
,
INDIA

51710
,
subbarao_chdv1@hotmail.com
3
M
.Tech Student, Department of IT,
Sree Vidyanikethan Engineering
College
,
A
.
Rangampet,
Chittoor
(dt).
A.P. INDIA

51710
,
kishore.btech16@gmail.com
ABSTRACT:
Clustering is
a
challenging task
in
data
mining technique
. The aim of clustering is to group
the similar data into number of clusters. Various
clustering algorithms have been developed to group
data into clusters. However, these clustering algorithms
work effectively eith
er on pure numeric data or on pure
categorical data, most of them perform poorly on mixed
categorical and numerical data
types in
previous
k

means algorithm was used but it is not accurate for
large datasets.
In this paper we cluster the mixed
numeric and
categorical data set in efficient manner.
In
this paper we present a clustering algorithm based on
similarity weight and filter method paradigm that works
well for data with mixed numeric and categorical
features. We propose a modified description of clust
er
center to overcome the numeric data only limitation and
provide a better characterization of clusters. The
performance of this algorithm has been studied on
benchmark data sets.
Keywords
—
Data Mining, Clustering, Numerical
Data, Categorical Data, simi
larity weight, filter method
.
.
1.
INTRODUCTION
C
lustering is a collection of objects which are “similar”
between them and are “dissimilar” to the objects
belonging to other clusters. Many
data mining
applications require partitioning of data into
homogen
eous clusters from which interesting groups
may be discovered, such as a group of motor insurance
policy holders with a high average claim cost, or a
group of clients in a banking database showing a heavy
investment in real estate. To perform such analyses
at
least the following two problems have to be solved; (1)
efficient partitioning of a large data set into
homogeneous groups or clusters, and (2) effective
interpretation of clusters. This paper proposes a solution
to the first problem and suggests a sol
ution to the
second. With the amazing progress of both computer
hardware and software, a vast amount of data is
generated and collected daily. There is no doubt that
data are meaningful only when one can extract the
hidden information inside them. However,
“the major
barrier for obtaining high quality knowledge from data
is due to the limitations of the data itself”. These major
barriers of collected data come from their growing size
and versatile domains. Thus, data mining that is to
discover interesting p
atterns from large amounts of data
within limited
sources (
i.e., computer memory and
execution time) has become popular in recent years.
Clustering is considered an important tool for data
mining. The goal of data clustering is aimed at dividing
the data s
et into several groups such that objects have a
high degree of similarity to each other in the same
group and have a high degree of dissimilarity to the
ones in different groups. Each formed group is called a
cluster. Useful patterns may be extracted by an
alyzing
each cluster. For example, grouping customers with
similar characteristics based on their purchasing
behaviors in transaction data may find their previously
unknown patterns. The extracted information is helpful
for decision making in marketing.
V
arious clustering applications have emerged in diverse
domains. However, most of the traditional clustering
algorithms are designed to focus either on numeric data
or on categorical data. The collected data in real world
often contain both numeric and cate
gorical attributes. It
is difficult for applying traditional clustering algorithm
directly into these kinds of data. Typically, when people
need to apply traditional distance

based clustering
algorithms (ex., k

means) to group these types of data, a
numeri
c value will be assigned to each category in this
attributes. Some categorical values, for example “low”,
“medium” and “high”, can easily be transferred into
numeric values. But if categorical attributes contain the
values like “red”, “white” and “blue” …
etc., it cannot
be ordered naturally. How to assign numeric value to
these kinds of categorical attributes will be a challenge
work.
In this paper we first divide the original data set into
pure numerical and categorical data set. Next, existing
well est
ablished clustering algorithms designed for
different types of datasets are employed to produce
2
corresponding clusters. Last, the clustering results on
the categorical and numeric dataset are combined as a
categorical dataset on which the categorical data
algorithm is employed to get the final output.
The reminder of this paper is organized as follows. Next
we show the background and related works and the
proposed method for clustering on mixed categorical
and numerical data, finally the conclusion of our w
ork.
2. RELATED WORK
2.1 Cluster Ensemble approach for mixed data
Dataset with mixed data type are common i
n
real life.
Cluster Ensemble [3
] is a method to combine several
runs of different clustering algorithm to get a common
partition of the original
dataset. In the paper divide and
conquers technique is formulated. Existing algorithm
use similarity measures like Euclidean distance which
gives good result for the numeric attribute. This will not
work well for categorical attribute. In the cluster
ensem
ble approach numeric data are handled separately
and categorical data are handled separately. Then both
the results are then treated in a categorical manner.
Different types of algorithm used for categorical data
ar
e K

Modes, K

Prototype, ROCK [4
] and sque
ezer
algorithm. In K

Mode the total mismatch of categorical
attributes of two data record is projected. The Squeezer
algorithm yields good clustering result, good scalability
and it handles high dimensional data set efficiently.
2.2 Methodology
1. Splitti
ng of the given data set into two parts. One for
numerical data and another for categorical data
2. Applying any one of the existing clustering
algorithms for numerical data set
3. Applying any one of the existing clustering
algorithms for categorical data
set
4. Combining the output of step 2 and step 3
5. Clustering the results using squeezer algorithm
The credit approval and cleve (heart diseases) dataset
are used and measures the cluster accuracy and cluster
error rate. The cluster accuracy ‘r‘is define
d by
Where
K represents number of
clusters
represents
number of instance occurring in both the cluster I and
its class and n represents number of instance in the
dataset.
Finally cluster error rate ‘e’ defined by
Where
r represents clu
ster accuracy.
Fig
1
:
Overview of Cluster Ensemble algorithm
framework
Fig 1.
Shows
that original dataset has to be splitted into
categorical dataset and numerical dataset and clustering
them. The output of clustering datasets are clustering
using clus
ter ensemble algorithm.
The algorithm is compared with k

prototype algorithm.
Fig 2.
Shows error rate among the k

prototype and
Cluster Ensemble
algorithm
.
Fig
2
:
Comparing K

prototype and
CEBMC
3
.
REVIEW OF
K

MEANS
ALGORITHM
K

means is a clustering
algorithm that deals with
categorical data only. The k

means clustering algorithm
requires the user to specify from the beginning the
number of clusters to be produced and the algorithm
builds and refines the specified number of clusters.
Each cluster has
a mode associated with it. Assuming
that the objects in the data set are described by
m
categorical attributes, the mode of a cluster is a vector
Q
={
q1
,
q2
,…,
qm
} where
qi
is the most frequent value
for the
i
th attribute inthe cluster of objects.
Given a
data set and the number of clusters
k
, the k

means algorithm clusters the set as follows:
1. Select initial
k
means for
k
clusters.
2. For each object
X
a. Calculate the similarity between object
X
and the
means of all clusters.
b.
Insert object
X
into the
cluster
c whose
mode is the
most similar to object
X
.
3
c.
Update the mode of
cluster
c
3. Retest the similarity of objects against the current
modes. If an object is found to be closer to the mode of
another cluster rather than its own cluster, reallocate
the
object to that cluster and update the means of both
clusters.
4. Repeat 3 until no or few objects change clusters after
a full cycle test of all the objects.
In most clustering algorithms, an object is usually
viewed as a point in a multidimensional
space. It can be
represented as a vector (
x1
...
xd
), a collection of values
of selected attributes with
d
dimensions; and
xi
is the
value of
i

th selected attribute. The value of
xi
may be
numerical or categorical.
Most pioneers of solving mixed numeric and
categorical value for clustering problem is to redefine
the distance measure and apply it to existing clustering
algorithms.
K

prototype:
K

prototype is one of the most famous methods. K

prototype inherits the ideas of k

means, it applies
Euclidean dist
ance to numeric attributes and a distance
function is defined to be added into the measure of the
closeness between two objects. Object pairs with
different categorical values will enlarge the distance
between them. The main shortcomings of k

prototype
may
fall into followings:
(1) Binary distance is employed for categorical value. If
object pairs with the same categorical value, the
distance between them is zero; otherwise it will be one.
However, it will not properly show the real situation,
since categor
ical values may have some degree of
difference. For example, the difference between “high”
and “low” shall not equal to the one between “high” and
“medium”.
(2) Only one attribute value is chosen to represent
whole attribute in cluster center. Therefore, t
he
categorical value with less appearance seldom gets the
chance to be shown in cluster center, though these items
may play an important role during clustering process.
Additionally, since k

prototype inherits the ideas of k

means, it will retain the same
weakness of k

means.
4
.
PROPOSED ALGORITHM
In this paper we propose a new algorithm for clustering
mixed numerical and categorical data. In this algorithm
we do the following: First we read the large data set D.
Split the Data Set D into Numerical Data
and
Categorical Data. Store all the Data Set.
Cluster the Numerical data set and categorical using
Similarity Weight.
Combine the clustered categorical dataset and clustered
numerical dataset
as a categorical dataset
using Filtered
method.
In this a
lgorithm we cluster the numerical data,
categorical data and mixed data.
The
above process is
shown in fig 3.
4
.
1
SI
MI
LARITY WEIGHT METHOD
Cluster validity functions are often used to evaluate the
performance of clustering in different indexes and even
two
different clustering methods. A lot of cluster
validity criteria
were proposed during the last 10 years.
Most of them came
from different studies dealing with
the number of clusters.
To test validity indices, we
conduct the Iris data sets and
Wine data se
ts. Iris data
sets are perhaps the best known
database to be found in
the pattern recognition literature.
The data set contains
3 classes of 50 instances each, where
each class refers
to a type of Iris plant. One class is linearly
separable
from the other
2; the latter is not linearly separable
from
each other. Wine data are the results of a chemical
analysis of wines grown in the same region in Italy but
derived
from three different cultivars. The analysis
determines the
quantity of 13 constituents found i
n each
of the three types of
wines.
Essentially, for a given data
set and “K” value, the DHCA
start with K randomly
selected centres, and then assigns
each data point to its
closest
centre
, creating K partitions. At each stage in the
iteration, for each of
these K partitions, DHCA
recursively selects K random
centres
and continues the
clustering process within each partition to form at most
K
N
partitions for the N
th
stage [
1]. In our
implementation, the procedure continues until the
number of elements in a
partition is below K+2, at
which time, the distance of each data item to other data
items in that partition can be updated with a smaller
value by a brute

force nearest neighbour search.
The
Divisive
[2]
Hierarchical Clustering Algorithm
partitions the data
set into smaller partitions so that the
number of data items in each partition must be less than
the maximum partition size i.e., “K+2”. In the first
iteration the entire data set is stored as the initial
partition. After that, at each stage all the parti
tions are
stored irrespective of their “K+2” condition.
For a set of s

dimensional data, i.e., each data item is a
point in the s

dimensional space, there exists a distance
between every pair of the data items. In this sequential
initialization, all the pa
ir wise distances are calculated
after reading their details from the database. The
threshold value calculation consists of Distance and
Index Arrays.
The distance
array [
1] is used to record
the distance of each data point to some other data point
in the
sequentially stored data set. The index
array [
1]
records the index of the data item at the other end of the
distance in the distance
array. The
number of partition
centers at each stage of the DHCA,
[2]
performance of
the algorithm with and without a give
n number of
clusters. K varies from min to max. For max k, the
n
i
i
i
b
a
t
b
a
Sim
1
2
)
,
(
4
algorithm produces into larger number of distance
computations in the
DHCA [2]
processes and the
running time increases with k. As K increases, the
change in running time increases along. For m
in
imum
value of K, the changes in the running time are small
because when K is min
imum for the c
onstruction of
DHCA a small increase in k, decreases the total number
of nodes which makes better
in performance
than using
small k. When K gets larger and larg
er, more distance
processes to partition center and the increase in
processes eventually improves.
Fig
3
:
Overview of
Similarity and Filtered
Algorithm
Framework
Clustering is one of the major data mining tasks and
aims at grouping the data objects into
meaningful
classes (clusters) such that the similarity of objects
within clusters is maximized, and the similarity of
objects from different clusters is minimized. In this
paper we present a clustering algorithm based on NMT
[14]
paradigm
that works well
for data with mixed
numeric and categorical features. We propose a
modified description of cluster center to overcome the
numeric data only limitation of MST
[14]
algorithm
and
provide a better characterization of clusters. The
performance of this algorith
m has been studied on
benchmark data sets.
Real
life database it contains
mixed numeric and categorical type data set. In this
work, we have proposed a modified MST algorithm for
finding a globally optimal partition of a given mixed
numeric and categorical
data into a specified number of
clusters. This incorporates the MST into the k

means
algorithm with enhance cost function to handle the
categorical data, and our experimental results show that
it is effective in recovering the underlying cluster
structure
s from categorical data if such structures exist.
Modified MST representation for the cluster centre is
used. This representation can capture cluster
characteristics very effectively because it contains the
distribution of all categorical values in cluster
.
4
.2
.
FILTER ALGORITHM
For clustering the mixed numerical and categorical
datasets, we proposed an algorithm called filter method.
First, the original dataset is divided into two sub

datasets i.e., pure categorical dataset and the pure
numerical dataset
. Next, we apply clustering algorithms
on sub

datasets based upon their type of dataset to get
the corresponding clusters. Last, the clustering results
of numerical and categorical datasets are combined as a
categorical dataset, on which the categorical da
ta
clustering
algorithm is exploited to get the final cluster.
Now we discuss the last step of above process. With the
clustering results on the categorical and numerical
datasets, we also get the final clustering results by
exploiting filter method.
The p
rocess of filtering for information or patterns
using techniques involving collaboration among
multiple agents, viewpoints and data sources, etc.
Applications of collaborative filtering typically involve
very large data sets. Collaborative filtering method
s
have been applied to many different kinds of data
including sensing and monitoring data

such as in
mineral exploration, environmental sensing over large
areas or multiple sensors; financial data

such as
financial service institutions that integrate m
any
financial sources; or in electronic commerce and web
2.0 applications where the focus is on user data, etc.
The remainder of this discussion focuses on
collaborative filtering for user data, although some of
the methods and approaches may apply to the
other
major applications as well.
Step(1)
: Start with a tree built by
the s
equentia
l
i
nitialization.
Step(2)
: Calculate mean and standard deviation of the
edge weights distance array.
Step(3)
: Use their sum as the threshold.
Step(4):
Perform multiple r
uns of
Similarity
Algorithm.
Step(5)
:
Identify longest edge using
Similarity
.
Step(6)
:
Remove this longest edge.
Step(7)
:
Check Terminating Condition and continue.
Step
(8)
:
Put t
hat number of clusters into
Filter Method
To summarize, the numerical pa
rameters the algorithm
needs from the user include the data set, the loosely
estimated minimum and maximum numbers of data
points in each cluster, the input “K” to the
filter method
and
similarity weight method
, and number of nearest
neighbours to keep for
each data item in the auxiliary
arrays, while the outputs will be the final distance and
index arrays, and a labelling array that remembers the
cluster label each data item belongs to.
Filter Algorithm
Input: Categorical Data set DS
Output: Set of Cluster
s
D={X
1
,X
2
,….X
n
}
C =No of Clusters
m= No of Attributes
n=No of Records
m
t
n
i
t
i
t
i
C
X
d
w
C
D
F
1
1
,
)
,
(
)
,
(
1. Read Data set D
5
2. Compute Dissimilarity
y
x
y
x
y
x
d
.
.
)
,
(
2
2
3. Compute F
m
t
n
i
t
i
t
i
C
X
d
w
C
D
F
1
1
,
)
,
(
)
,
(
4. Cluster data according to F.
5.
Submit the F into post Clustering Technique
i.e. Filtered Algorithm
Fig 4
:
Applying clustering technique Similarity Weight
and Filter Method
Fig
5
:
Results of clustering showing groups divided
into clusters
Fig
6
:
Initialization and
Input
4.
3
Advantages of Proposed System
Efficient use of the cut and cycle properties by our
Fast
Filtered

Inspired clustering algorithm.
Shape of a cluster has very little impact on the
performance of this
Filtered
clustering algorithm.
Efficient for dimensional
ity more than 5 and
reduced time complexity.
Nearest neighbor search is used to construct
efficient
Filtered
.
Works efficiently even if the boundaries of
clusters are irregular.
Fig 7
:
Final
filtered and EMST
6
4.
CLUSTERING RESULTS
We used the filter a
nd k

prototypes algorithms to
cluster the credit approval dataset and the
clever
dataset into different numbers of clusters, varying
from 1 to 10.
For each fixed
number of clusters, the
clustering errors of different algorithms were compa
red.
For
both of datasets, the
k

prototypes algorithm, just as
has been done
.
All numeric
attributes are rescaled to the
range of [
0.1].
On the credit approval dataset
.
Fig
8
:
Clustering Error
vs.
number of clusters
Figure
8
shows
the results on the
credit approval data
set of different clustering algorithms from figure
8
, we
can
summaries
the relative performance of our
algorithms as follows.
Table
4.1
:
Relative performance of different clustering
algorithms (Credit approval dataset)
Method
Average
clustering error
k

prototype
0.211
Similarity & Filter
0.181
That is, comparing with the k

prototypes algorithm, our
algorithm performed the best
in all
cases. It never
performed the worst. Furthermore,
the average
clustering errors
of our algorithm
are significantly
smaller than that of the k

prototypes algorithm.
The
above experimental results demonstrate the
effectiveness of filter method for clustering dataset
with mixed attributes.
In addition, it outperforms the k

prototypes algor
ithm
with respect to clustering accuracy.
5
. CONCLUSION
The method of comprehensive assessment using
similarity weight in the attribute synthetic evaluation
system seemed to be objective and rational. Not only it
embodied the weights of variables involve
d, but also
exploiting the information presented by the sample.
Our
system is efficient for any number of Dimensions and
reduces Time Complexity. Also Irregular boundaries
can be efficiently handled using our
filtered algorithm
Divide and Conquer Technique
.
The future work is that
we
mix
the different clustering
datasets (
labeled,
unlabeled,
nominal,
and ordinal
) with different
algorithms.
6
.
REFERENCES
[1] A. Ahmad and L. Dey, (2007), A k

mean clustering
algorithm for mixed numeric and categorica
l data’,
Data
and Knowledge Engineering Elsevier
Publication, vol. 63, pp 503

527.
[2]
Xiaochun Wang, Xiali Wang and D. Mitchell
Wilkes, IEEE
Members, “A Divide

and

Conquer
Approach for Minimum Spanning Tree

Based
Clustering”, IEEE Knowledge and Data
Engin
eering Transactions, vol 21, July 2009.
[
3
]
S Deng, Z He, X Xu, 2005. Clustering mixed
numeric and categorical data: A cluster ensemble
approach. Arxiv preprint cs/0509011
.
[4
]
S. Guha, R. Rastogi, and K. Shim, 2000. ROCK: A
Robust Clustering Algorithm f
or Categorical
Attributes. Information Systems, vol. 25, no. 5 :
345

366.
[
5
] V.V. Cross and T.A. Sudkamp, Similarity and
Compatibility in Fuzzy Set
Theory: assessment and
Applications, Physica

Verlag, New York, 2002.
[6
] M. Kalina, Derivatives of fuzzy fu
nctions and fuzzy
derivatives, Tatra
[7
]
C.T. Zahn, “Graph

Theoretical Methods for
Detecting and Describing Gestalt Clusters,” IEEE
Trans. Computers, vol. 20, no. 1, pp. 68

86, Jan.
1971.
[8
] D.J
. States, N.L. Harris, and L. Hunter,
“Computationally Effi
cient Cluster Representation
in Molecular Sequence Megaclassification,” ISMB,
vol. 1, pp. 387

394, 1993.
[9
] P. Raghu Rama Krishnan, “Accessing Databases
with JDBC and ODBC”, Pearson Education,
Fourth Edition.
[10
] Henry F Korth, S. Sudharshan, “Database S
ystem
Concepts”
McGraw
–
Hill, International
Editions,
Fourth Edition, 2002.
[11
]
I. Katriel, P. Sanders, and J.L. Traff, “A Practical
Minimum Spanning Tree Algorithm Using the
Cycle Property,”
[12
]
Sartaj Sahni, Advanced Data Structures and
Algorithms, S
econd Edition.
[1
3
]
Grady Booch, James Rumbaugh, Ivar Jacobson.
Unified Modeling Language User Guide. Pearson
Education, 2006 edition.
[1
4
] I. Katriel, P. Sanders, and J.L. Traff, “A Practical
Minimum Spanning Tree Algorithm Using the
Cycle Property,” Pro
c. 11th European Symp.
Algorithms (ESA ’03), vol. 2832, pp. 679

690,
2003.
[1
5
] C.T. Zahn, “Graph

Theoretical Methods for
Detecting and Describing Gestalt Clusters,” IEEE
Trans. Computers, vol. 20,no. 1, pp. 68

86, Jan.
1971.
[1
6
] R.O. Duda and P.E. Hart,
Pattern Classification
and Scene Analysis. Wiley

Interscience, 1973.
[
17
] N. Chowdhury and C.A. Murthy, “Minimum
Spanning Tree

Based Clustering Technique:
7
Relationship with Bayes Classifier,” Pattern
Recognition, vol. 30, no. 11, pp. 1919

1929, 1997.
[
18
] A. Vathy

Fogarassy, A. Kiss, and J. Abonyi,
“Hybrid Minimal Spanning Tree and Mixture of
Gaussians Based Clustering Algorithm,”
Foundations of Information and Knowledge
Systems, pp. 313

330, Springer, 2006.
[
19
] O. Grygorash, Y. Zhou, and Z. Jorgensen,
“
Minimum Spanning Tree

Based Clustering
Algorithms,” Proc. IEEE Int’l Conf. Tools with
Artificial Intelligence, pp. 73

81, 2006.
[
20
] R.C. Gonzalez and P. Wintz, Digital Image
Processing, second ed. Addison

Wesley, 1987.
[
21
] Y. Xu, V. Olman, and E.C. Uberb
acher, “A
Segmentation Algorithm for Noisy Images: Design
and Evaluation,” Pattern Recognition Letters, vol.
19, pp. 1213

1224, 1998.
[22
] Y. Xu and E.C. Uberbacher, “2D Image
Segmentation Using
Minimum Spanning Trees,”
Image and Vision Computing, vol. 15,
pp. 47

57,
1997.
[
23
] D.J. States, N.L. Harris, and L. Hunter,
“Computationally Efficient
Cluster Representation
in Molecular Sequence Megaclassification,”
ISMB,
vol. 1, pp. 387

394, 1993.
[
24
] Y. Xu, V. Olman, and D. Xu, “Clustering Gene
Expression Data
Using a Graph

Theoretic
Approach: An Application of Minimum
Spanning
Trees,” Bioinformatics, vol. 18, no. 4, pp. 536

545,
2002.
[
25
] M. Laszlo and S. Mukherjee, “Minimum Spanning
Tree Partitioning
Algorithm for Microaggregation,”
IEEE Trans. Knowledge and
Data Eng., vol. 17,
no. 7, pp. 902

911, July 2005.
[
26
] M. Forina, M.C.C. Oliveros, C. Casolino, and M.
Casale,“Minimum Spanning Tree: Ordering Edges
to Identify Clustering
Structure,” Analytical
Chimica Acta, vol. 515, pp. 43

53, 2004.
[27] Chaturvedi,
P. Green and J. Carroll (2001), k

modes clustering. Journal of Classification, vol
18,pp. 35

55.
[28] Jain, M. Murty and P. Flynn (1999), ‘Data
clustering: A review’, ACM Computing Survey.,
vol.31, no. 3, pp. 264
–
323.
[29] G. Gan, Z. Yang, and J. Wu (2005)
, A Genetic k

Modes Algorithm for Clustering for Categorical
Data,
ADMA,
LNAI 3584, pp. 195
–
202.
[30] J. Z. Haung, M. K. Ng, H. Rong, Z. Li (2005)
Automated variable weighting in k

mean
[1]
type
clustering, IEEE Transaction on PAMI 27(5).
[31] K. Krishna an
d M. Murty (1999), ‘Genetic K

Means Algorithm’
,
IEEE Transactions on
Systems,
Man
, and Cybernetics vol. 29, NO. 3, pp. 433

439.
[32] S. Bandyopadhyay and U. Maulik, (2001),
Nonparametric genetic clustering: Comparison of
validity indices’, IEEE Trans. Syst
., Man, Cybern.
C, Appl. Rev., vol. 31, no. 1, pp. 120
–
125.
[33] S. Guha, R. Rastogi, and K. Shim (2000). ROCK:
A robust clustering algorithm for categorical
attributes’, Information System., vol. 25, no. 5, pp.
345
–
366.
[34] Y. Lu, S. Lu, F. Fotouhi, Y. D
eng, and S. Brown
(2004), ‘Incremental genetic K

means algorithm
and its application in gene expression data
analysis’, BMC Bioinformatics 5:172.
[35] Y. Lu, S. Lu, F. Fotouhi, Y. Deng, and S. Brown
(2004), FGKA: A Fast Genetic K

means Clustering
Algorithm
’, ACM 1

58113

812

1.
[36] Z. He, X. Xu, & S. Deng,(2005) Scalable
algorithms for clustering categorical data, Journal
of Computer Science and Intelligence Systems 20,
1077

1089.
[37] A. Ahmad, L. Dey, A feature selection technique
for classificatory analy
sis, Pattern Recognition
Letters 26 (1) (2005) 43
–
56.
[38] M. Mahdavi and H. Abolhassani, (2009) Harmony
K

means algorithm for document clustering, Data
Min Knowl Disc (2009) 18:370
–
391.
[39] H. Yan, K. Chen, L. Liu, and Z. Yi (2010)
SCALE:
a scalable fram
ework for efficiently clustering
transactional data’, Data Min Knowl Disc (2010)
20:1
–
27.
[40]
Jiawei Han and Micheline Kamber. “Data Ware
Housing and Data Mining.
Concepts and
Techniques”, Third Edition 2007.
[41]
N. Chowdbury and C.A. Murthy, “Minimum
Sp
anning Tree

Based Clustering
–
Technique:
Relationship with Bayes Classifier,”
Pattern
Recognition, vol. 30.
AUTHORS BIOGRAPHY
Author 1
Asadi Srinivasulu
received
the B Tech
Computer Science Engineering
from
Sri Venkateswara
University
,
Tirupati, India in
2000 and M.Tech with
Intelligent Systems in IT from Indian Institute of
Information Technology, Allahabad (IIIT
) in 2004 and
he is pursuing Ph.D
in CSE from J.N.T.U.A,
Anantapur, India. He has got 10 years of teaching and
industrial experience. He served a
s the Head, Dept of
Information Technology, S V College of Engineering,
Karakambadi, Tirupati, India during 2007

2009. His
areas of interests include Data Mining and Data
warehousing, Intelligent Systems, Image Processing,
Pattern Recognition, Machine Visi
on Processing and
Cloud Computing. He is a member of IAENG, IACSIT.
He has
published more than 25
papers
in International
Journals and Conferences.
Some of his publications
appear in
IJCSET,
IJCA and IJCSIT digital libraries.
He visited Malaysia and Singa
pore.
srinu_asadi@yahoo.com
.
Author 2:
Dr Ch D V Subba Rao
received the B Tech
(CSE) from S V University College of Engineering,
Tirupati, India in 1991, M.E. (CSE) from M K
University, Madurai in 1998 and he w
as the first Ph
.
D
awardee in CSE from S V University, Tirupati in 2008.
He has got 19 years of teaching experience. He served
as the Head, Dept of Computer Science and
8
Engineering, S V University College of Engineering,
Tirupati, India during 2008

11. His
areas of interests
include Distributed Systems, Advanced Operating
Systems and Advanced Computing. He is a member of
IETE, IAENG, CSI and ISTE. He chaired and served as
reviewer of IAENG and IASTED international
conferences. He has
published more than 25
p
apers
in
International journals and conferences
. Some of his
publications appear in IEEE and ACM digital libraries.
He visited Austria, Netherlands, Belgium, Hong

Kong,
Thailand and Germany.
subbarao_chdv1@ho
tmail.com
Author 3:
C.Kishore
studying
I
I
M.Tech
in
Information Technology
with Software Engineering
from JNTUA, Anantapur in 2011

12
and his research
areas are
Software Engineering
, Data warehousing and
Data Mining, Database Management Systems and
Clo
ud Computing
.
kishore.btech16@gmail.com
.
Author 4:
Shreyash Raju
studying
IV B.Tech degree
in Information Technology from JNTUA, Anantapur
in 2011

12
and his research areas are Image Processing,
Data warehousing and Data Mining, Database
Management Syst
ems and Software Engineering.
shreyashraju18@gmail.com
.
Comments 0
Log in to post a comment