Clustering
John Owen
Sarah Smith
University of Houston Clear Lake
ABSTRACT
This paper discusses types of clustering from a data
mining perspective, and business examples for clustering.
1
INTRODUCTION
While the price of data storage ha
s
dropped over the
last
several years businesses have been storing data at an
increasing rate, the
question asked
by many businesses is
“what do we
do with the data
we
have collected
?”
The
data may have been “cheap” to store in the beginning by
the cost of data storage excee
ds the purchase price of the
disks. For example, most businesses back up their data,
the more data that is collected the more the cost of
archival. For data retained in a transactional database,
performance becomes an issue as queries and update
transactio
ns take longer to process.
As the data collected
by businesses grows, it becomes increasingly important to
make sense of
large amounts of
data.
This data must be
grouped in ways that can be interpreted, and put to use for
the business.
An obvious example
of clustering is the common activity
of sorting laundry. Dark colors go in one group, light
colors in another. The clothes that are dark in color are
alike, and they are different from the light colored clothes.
Simply put, clustering is the activity of g
rouping together
objects that are like one another, and not like the objects
other clusters.
In a data warehouse clustering “can be
viewed as one of finding groupings in a set of events by
extremizing some criterion function.”
[
Wan, 1988
]
Clustering is a
technique that has its roots is statistical
analysis.
In statistics one
of the first calculations used
when analyzing data is variance,
which is the average of
the squared deviation about the arithmetic mean for a set
of numbers
.
[Black, 2006]
Wan
identifi
ed that one of the
most widely used method of clustering analysis is to use
the sum of squares method and apply it to the Euclidean
distances from the cluster center.
In data mining,
clustering is used to give a user a high level view of what
is going on i
n their database. Clustering can also be
performed to make it easier to identify outliers.
[
Berson &
Smith, p. 409
]
While the clustering algorithms themselves can be fairly
complex, the general approach is relatively simple. There
are five basic steps in
a
clustering
task
; [Jain, 1988]
1.
Pattern representation.
The analyst identifies the
number, type, and scale of features available to the
clustering algorithm.
2.
Identify the pattern proximity relative to the data
domain.
Usually performed using the
Euclide
an
distances.
3.
Grouping or
Clustering of the data.
4.
Data abstraction.
5.
Assessment of output.
2
TYPES OF CLUSTERING
METHODS
There are many clustering methods available. Four of the
methods described
by
Han & Kamber (p346

348)
are
partitioning, hierarchical, d
ensity

based, and grid

based
Partitioning
(k

means clustering)
One of the most popular methods of performing clustering
is K

means or partitioning.
Unlike the other methods of
clustering k

means requires the analyst to know
something about the underlying
data. The analyst then
tells the system how many clusters the system to create
when analyzing the data. Partitioning consists of the
classification of the data into
k
groups, which meet two
requirements; each group must contain at least one object,
and ea
ch object must belong to exactly one group. There
are two steps to creating the clusters
:
1.
“For each data item, assign it to the closest center,
resolving ties arbitrarily. A proof can be found in that
this phase gives the optimal partition for the given
ce
nters.”
2.
“Recalculate all the centers. Each center is moved to
the geometric centroid of the points assigned to it. A
proof can be found in that this phase gives the
optimal center.” [Foreman, 2000]
In simple terms one decides how many clusters there
shou
ld be, then creates the best fit of points to a cluster.
In
the figure below the analyst decided to create two
clusters. The algorithm then analyzed the data and creates
a logical center of for each cluster. Once a center or
centroid has been created the a
lgorithm then identifies the
distance between the centroid and the individual data
point and assigns that point to a cluster.
(Source k

means clustering
htt
p://www.togaware.com/datamining/survivor/K_Means.html
)
Hierarchical
Unlike the k

means testing above the analyst does not
have to identify the number of clusters as the algorithm
.
While k

means clustering is the most prevalent method of
clustering hierar
chical clustering is designed primarily for
creating micro

clusters in large database sets.
Micro

clusters are used when the data is so similar that the
statistical difference between the points is statistically
insignificant.
[Yu, 2003]
The hierarchal me
thod is either agglomerative (bottom

up) or divisive (top

down). The agglomerative approach
works with data individually at first, and then forms it
into clusters until all of the data is in one cluster. The
divisive approach is performed in the opposite w
ay,
beginning
with a
single cluster, and breaking the data
away until each piece of data is isolated
.
(
Source
http://genome.imim.es/~eblanco/
seminars/docs/clustering/index_types.html#hierarchy
)
Density

Based
The density based algorithms define the data by the
density of the data distribution.
“Clusters are formed by
connecting neighboring ’core’ objects and those ’non

core’ objects either se
rve as the boundaies of clusters or
become outliers.” [Jaing ,2004]
Density

Based clustering
also does not require the user to identify the number of
clusters before beginning the data analysis.
This
clustering method will continue to grow a given cluster
as
long as the number of objects (density) exceeds a given
threshold. This method is useful for dealing with outliers.
(Source:
http://klimt.iwr.uni

heidelberg.de/mip/research/hader_clust/
)
Grid

Based
Grid

Based clustering is an adaptation of Density

Based
Clustering.
In Grid

Based
clustering the data points are
placed in a data grid. Each data grid is of equal size and
can be decomposed into smaller and smaller data grids
depending on the business need and level of data
abstraction required.
These grids can be either fixed or
adapt
ive.
Fixed data grids are calculated by portioning the
data into discrete non

overlapping clusters. Then the
algorithm creates a histogram for all possible data grids.
An adaptive data grid does not require the user to identify
the number of grids, grid si
ze or data density.
[Nagesh]
Instead the grids are populated then histograms are
created based on the data in the individual dimension
instead of all of the data. This results in increased
performance and data accuracy.
3
REQUIREMENTS FOR CLU
STERING
T
here are requirements for clustering algorithms in data
mining (Han &
Kimber, p. 337). The algorithms must be:
Scalable
A
ble to deal with different types of attributes
A
ble to deal with clusters of varying shapes
Insensitive to input parameters
Able to de
al with noisy data
Insensitive to the order of input records
A
ble to handle data that is highly dimensional
Able to work within constraints
Interpretable and usable
4
BUSINESS USES OF CLU
STERING
There are a number of ways in which data can be
clustered, and
then used to support decision making.
Marketing is a classic example of clustering, where
potential customers are grouped into categories by
common characteristics. Clustering enables companies to
identify outliers, which can be useful to determine which
customers are in jeopardy of discontinuing services, or
even in the detection of credit card fraud. The use of
clustering is in no way limited to business support
decisions. A review of ACM journal titles shows
clustering being used in fields such as astro
physics,
biology, chemistry, medicine, psychology, and sociology.
Clustering techniques are being used to answer the
questions regarding the very make up of a human being.
“In recent years, clustering analysis has even become a
valuable and useful tool for
gene expression data.”
[Tseng, 2005]
Conclusion
Clustering provides businesses a method of making sense
of their data.
While early clustering techniques were labor
intensive and required analysts to pour over mounds of
data, today clustering tools are re
adily available.
Cluster
analysis tools are built into statistical packages
such as
SAS, S

Plus, and SPSS [
Han & Kamber, p336
]
.
If the
future of computing is in artificial
intelligence, then that
future of computing relies on clustering.
For an artificial
intelligence engine to work properly it must be capable of
unsupervised learning from pattern recognition.
[Figueiredo, 2002]
REFERENCES
[WAN, 1988]
S. J. WAN, S. K. M. WONG, and P.
PRUSINKIEWICZ, An Algorithm for
Multidimensional
Data Clustering, ACM
Transactions on Mathematical
Software, Vol. 14, No. 2, June 1988, Pages 153

162.
[Black, 2006] Black, Ken, Business Statistics for
Contemporary Decision Making, John Wiley & Sons,
Hoboken, NJ.
[Figueiredo, 2002] M. Figueiredo, A.K. Jain,
"Unsupervised Lear
ning of Finite Mixture Models",
IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. 24, No. 3, March 2002.
[Tseng, 2005] Tseng, Vincent S.and Kao, Ching

Pin,
“
Efficiently Mining Gene Expression Data via a
Novel Parameterless Clustering Met
hod
”
, IEEE/ACM
TRANSACTIONS ON COMPUTATIONAL
BIOLOGY AND BIOINFORMATICS, VOL. 2, NO.
4, OCTOBER

DECEMBER 2005
[Jain, 1988] Jain, A. K. AND Dubes, R. C. 1988.
“Algorithms for Clustering Data”. Prentice

Hall,
Upper Saddle River, NJ.
[Foreman, 2000] Forman, G
. and Zhang, B., “Distributed
Data Clustering Can Be Efficient and Exact” , ACM
SIGKDD Explorations, December 2000, Volume 2,
Issue 2
[Yu, 2003]Yu, Hwanjo, Yang, Jiong and Han, Jiawei
“Classifying Large Data Sets Using SVMs with
Hierarchical Clusters” SI
GKDD’ 03, August 24

27,
2003,
[Nagesh]
Nagesh, Harsha. Goil, Sanjay , and Choudhary,
Alok. "Adaptive Grids for Clustering Massive Data
Sets" Society for Industrial and Applied Mathematics,
On line at
http://www.siam.org/meetings/sdm01/pdf/sdm01_07.p
df
[Jai
ng ,2004] Daxin Jiang, Chun Tang, Aidong Zhang,
"Cluster Analysis for Gene Expression Data: A
Survey," IEEE Transactions on Knowledge and Data
Engineering, vol. 16, no. 11, pp. 1370

1386, Nov.,
2004.
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο