University of Houston Clear Lake
This paper discusses types of clustering from a data
mining perspective, and business examples for clustering.
While the price of data storage ha
dropped over the
several years businesses have been storing data at an
increasing rate, the
by many businesses is
“what do we
do with the data
data may have been “cheap” to store in the beginning by
the cost of data storage excee
ds the purchase price of the
disks. For example, most businesses back up their data,
the more data that is collected the more the cost of
archival. For data retained in a transactional database,
performance becomes an issue as queries and update
ns take longer to process.
As the data collected
by businesses grows, it becomes increasingly important to
make sense of
large amounts of
This data must be
grouped in ways that can be interpreted, and put to use for
An obvious example
of clustering is the common activity
of sorting laundry. Dark colors go in one group, light
colors in another. The clothes that are dark in color are
alike, and they are different from the light colored clothes.
Simply put, clustering is the activity of g
objects that are like one another, and not like the objects
In a data warehouse clustering “can be
viewed as one of finding groupings in a set of events by
extremizing some criterion function.”
Clustering is a
technique that has its roots is statistical
In statistics one
of the first calculations used
when analyzing data is variance,
which is the average of
the squared deviation about the arithmetic mean for a set
ed that one of the
most widely used method of clustering analysis is to use
the sum of squares method and apply it to the Euclidean
distances from the cluster center.
In data mining,
clustering is used to give a user a high level view of what
is going on i
n their database. Clustering can also be
performed to make it easier to identify outliers.
Smith, p. 409
While the clustering algorithms themselves can be fairly
complex, the general approach is relatively simple. There
are five basic steps in
; [Jain, 1988]
The analyst identifies the
number, type, and scale of features available to the
Identify the pattern proximity relative to the data
Usually performed using the
Clustering of the data.
Assessment of output.
TYPES OF CLUSTERING
There are many clustering methods available. Four of the
Han & Kamber (p346
partitioning, hierarchical, d
based, and grid
One of the most popular methods of performing clustering
means or partitioning.
Unlike the other methods of
means requires the analyst to know
something about the underlying
data. The analyst then
tells the system how many clusters the system to create
when analyzing the data. Partitioning consists of the
classification of the data into
groups, which meet two
requirements; each group must contain at least one object,
ch object must belong to exactly one group. There
are two steps to creating the clusters
“For each data item, assign it to the closest center,
resolving ties arbitrarily. A proof can be found in that
this phase gives the optimal partition for the given
“Recalculate all the centers. Each center is moved to
the geometric centroid of the points assigned to it. A
proof can be found in that this phase gives the
optimal center.” [Foreman, 2000]
In simple terms one decides how many clusters there
ld be, then creates the best fit of points to a cluster.
the figure below the analyst decided to create two
clusters. The algorithm then analyzed the data and creates
a logical center of for each cluster. Once a center or
centroid has been created the a
lgorithm then identifies the
distance between the centroid and the individual data
point and assigns that point to a cluster.
Unlike the k
means testing above the analyst does not
have to identify the number of clusters as the algorithm
means clustering is the most prevalent method of
chical clustering is designed primarily for
clusters in large database sets.
clusters are used when the data is so similar that the
statistical difference between the points is statistically
The hierarchal me
thod is either agglomerative (bottom
up) or divisive (top
down). The agglomerative approach
works with data individually at first, and then forms it
into clusters until all of the data is in one cluster. The
divisive approach is performed in the opposite w
single cluster, and breaking the data
away until each piece of data is isolated
The density based algorithms define the data by the
density of the data distribution.
“Clusters are formed by
connecting neighboring ’core’ objects and those ’non
core’ objects either se
rve as the boundaies of clusters or
become outliers.” [Jaing ,2004]
also does not require the user to identify the number of
clusters before beginning the data analysis.
clustering method will continue to grow a given cluster
long as the number of objects (density) exceeds a given
threshold. This method is useful for dealing with outliers.
Based clustering is an adaptation of Density
clustering the data points are
placed in a data grid. Each data grid is of equal size and
can be decomposed into smaller and smaller data grids
depending on the business need and level of data
These grids can be either fixed or
Fixed data grids are calculated by portioning the
data into discrete non
overlapping clusters. Then the
algorithm creates a histogram for all possible data grids.
An adaptive data grid does not require the user to identify
the number of grids, grid si
ze or data density.
Instead the grids are populated then histograms are
created based on the data in the individual dimension
instead of all of the data. This results in increased
performance and data accuracy.
REQUIREMENTS FOR CLU
here are requirements for clustering algorithms in data
mining (Han &
Kimber, p. 337). The algorithms must be:
ble to deal with different types of attributes
ble to deal with clusters of varying shapes
Insensitive to input parameters
Able to de
al with noisy data
Insensitive to the order of input records
ble to handle data that is highly dimensional
Able to work within constraints
Interpretable and usable
BUSINESS USES OF CLU
There are a number of ways in which data can be
then used to support decision making.
Marketing is a classic example of clustering, where
potential customers are grouped into categories by
common characteristics. Clustering enables companies to
identify outliers, which can be useful to determine which
customers are in jeopardy of discontinuing services, or
even in the detection of credit card fraud. The use of
clustering is in no way limited to business support
decisions. A review of ACM journal titles shows
clustering being used in fields such as astro
biology, chemistry, medicine, psychology, and sociology.
Clustering techniques are being used to answer the
questions regarding the very make up of a human being.
“In recent years, clustering analysis has even become a
valuable and useful tool for
gene expression data.”
Clustering provides businesses a method of making sense
of their data.
While early clustering techniques were labor
intensive and required analysts to pour over mounds of
data, today clustering tools are re
analysis tools are built into statistical packages
Plus, and SPSS [
Han & Kamber, p336
future of computing is in artificial
intelligence, then that
future of computing relies on clustering.
For an artificial
intelligence engine to work properly it must be capable of
unsupervised learning from pattern recognition.
S. J. WAN, S. K. M. WONG, and P.
PRUSINKIEWICZ, An Algorithm for
Data Clustering, ACM
Transactions on Mathematical
Software, Vol. 14, No. 2, June 1988, Pages 153
[Black, 2006] Black, Ken, Business Statistics for
Contemporary Decision Making, John Wiley & Sons,
[Figueiredo, 2002] M. Figueiredo, A.K. Jain,
ning of Finite Mixture Models",
IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. 24, No. 3, March 2002.
[Tseng, 2005] Tseng, Vincent S.and Kao, Ching
Efficiently Mining Gene Expression Data via a
Novel Parameterless Clustering Met
TRANSACTIONS ON COMPUTATIONAL
BIOLOGY AND BIOINFORMATICS, VOL. 2, NO.
[Jain, 1988] Jain, A. K. AND Dubes, R. C. 1988.
“Algorithms for Clustering Data”. Prentice
Upper Saddle River, NJ.
[Foreman, 2000] Forman, G
. and Zhang, B., “Distributed
Data Clustering Can Be Efficient and Exact” , ACM
SIGKDD Explorations, December 2000, Volume 2,
[Yu, 2003]Yu, Hwanjo, Yang, Jiong and Han, Jiawei
“Classifying Large Data Sets Using SVMs with
Hierarchical Clusters” SI
GKDD’ 03, August 24
Nagesh, Harsha. Goil, Sanjay , and Choudhary,
Alok. "Adaptive Grids for Clustering Massive Data
Sets" Society for Industrial and Applied Mathematics,
On line at
ng ,2004] Daxin Jiang, Chun Tang, Aidong Zhang,
"Cluster Analysis for Gene Expression Data: A
Survey," IEEE Transactions on Knowledge and Data
Engineering, vol. 16, no. 11, pp. 1370