Clustering
Analysis
and
Algorithms
Keegan
Myers
Department
of
Computer
Science
University
of
Wisconsin
Platteville
Platteville,
WI
53818
myerske@uwplatt.edu
March
27
,
2011
Abstract
Most
fields
from
botany
to
law
e
nforcement
are
plagued
with
an
abundance
of
raw
data.
But
data
has
little
meaning
without
a
method
of
interpreting
it.
This
is
where
clustering
becomes
an
invaluable
asset.
By
utilizing
a
range
of
clustering
methods
,
professionals
in
many
fields
can
more
accurately
interpret
data.
The
most
significant
of
these
methods
will
be
discussed
and
evaluated.
Along
with
explanations
of
the
main
clustering
methods
,
some
of
the
major
issues
that
can
impair
accurate
interpretation
will
be
considered.
Solutions
to
missing
data,
masking
variables
and
comparing
variables
measured
in
different
units
will
be
provided.
Additionally
,
several
applications
of
clustering
will
be
highlighted.
While
this
will
not
serve
as
an
exhaustive
reference
,
it
is
written
in
a
stepwise
manner
giving
the
reader
a
foundational
u
nderstand
ing
of
clustering.
Introduction
When
posed
with
a
large
variety
of
diverse
information
,
one
of
the
primary
methods
a
human
mind
uses
to
make
sense
of
the
chaos
is
to
classify
the
information
being
processed.
By
classifying
information
a
person
can
more
effectively
utilize
information
presented
to
them.
This
is
a
common
event
that
occurs
on
a
daily
basi
s.
You
may
walk
into
a
room
and
identify
an
object
as
a
chair
having
never
seen
that
particular
chair
before.
You
can
then
interact
with
it
appropriately.
Similarly
the
process
known
as
clustering
allows
systems
to
evaluate
large
diverse
datasets
and
organize
the
data
into
groups
called
clusters
so
that
they
can
be
more
easily
understood.
This
practice
began
with
the
least
computationally
intensive
methods
in
order
to
accommodate
for
the
limited
hardware
capabilities
of
the
time.
Many
of
these
algorithms
were
then
revisited
as
systems
became
more
robust
leading
to
the
clustering
algorithms
currently
in
use.
Overview
of
the
clustering
process
While
Clustering
algorithms
can
be
applied
to
many
fields
and
many
types
of
data
,
the
basic
steps
remain
the
same.
Select
the
data
to
cluster
Select
the
variables
to
use
Page
2
Identify
Missing
data
Variable
Standardization
Proximity
Measurements
Number
of
Clusters
Clustering
Method
Validation
There
are
several
methods
to
employ
for
nearly
all
of
the
aforementioned
steps.
The
key
factors
in
deciding
which
methods
are
the
most
eff
ective
are
size
of
the
dataset,
its
complexity,
and
the
most
likely
clusters
to
be
identified.
Variable
Selection
Variables
should
be
selected
based
on
the
likelihood
that
they
will
define
a
cluster.
Those
variables
that
are
not
likely
to
define
a
cluster
are
considered
masking
variables
and
should
be
either
removed
or
ignored.
Masking
variables
can
pose
a
large
problem
in
properly
defining
clusters
[1]
.
T
here
are
two
solutions
to
this
issue
:
One
is
to
use
weighted
variables.
The
numeric
weight
associated
with
each
variable
is
based
on
the
importance
the
user
places
on
the
variable
within
the
context
of
the
cluster
definition.
The
weights
can
either
be
defined
directly
by
the
user
or
indirectly,
u
tilizing
the
indirect
method
. Variables
are
compared
and
weighted
based
on
their
dissimilarity.
The
most
similar
variables
will
be
considered
the
cluster
defining
variables.
The
other
option
to
account
for
masking
variables
is
to
utilize
model

based
variable
selection.
This
method
is
most
effective
when
the
number
of
variables
is
much
greater
than
the
number
of
entries
in
the
dataset.
In
this
method
,
a
secondary
dataset
is
created
that
contains
items
with
a
far
smaller
number
of
variables.
All
variables
that
do
not
v
a
ry
by
item
or
v
a
ry
minimally
are
removed.
The
model

based
approach
can
decrease
computational
intensity
. However
,
it
may
also
fail
to
recognize
statistically
significant
clusters.
Missing
values
Once
the
cluster
defining
variables
have
been
identified
,
the
next
issue
that
may
arise
in
creating
meaningful
clusters
is
those
items
that
are
missing
the
predefined
variables.
Missing
variables
can
have
a
significant
effect
on
the
conclusions
that
can
be
drawn
from
clustering.
Missing
data
is
perceived
by
the
clustering
algorithm
as
a
nonresponse
[3]
.
Depending
on
the
weight
of
the
variable
in
question
,
the
nonresponse
will
proportionally
affect
the
result.
To
alleviate
this
issue
,
non

responses
can
be
avoided
at
the
time
the
data
is
collected.
However,
this
may
often
prove
impossible.
In
such
cases
,
there
are
five
options
available
listed
in
order
of
viability
. T
hey
are
imputation,
partial
imputation,
partial
deletion,
full
analysis,
and
interpolation.
Imputation
or
unit
imputation
can
be
conducted
in
either
the
hot

deck
or
cold

deck
methods.
In
the
hot

deck
method
the
missing
value
is
replaced
at
random
with
variables
from
another
item
in
the
dataset.
In
the
cold

deck
method
the
missing
value
is
replaced
by
values
from
a
different
but
similar
dataset.
Both
of
th
ese
methods
are
more
traditional
and
are
often
replaced
with
newer
,
less
standard
derivatives
such
as
the
hot

deck
closest
neighbor
imputation
method.
Regardless
of
the
implementation
,
multiple
imputations
should
be
run
for
the
sake
of
validity
[3]
.
Many
scholars
Page
3
recommend
20
to
100
imputations
per
missing
variable.
Partial
imputation
works
similarly
,
except
imputation
is
not
conducted
on
every
missing
value
but
only
key
values
identified
by
a
pattern.
Partial
deletion
can
be
conducted
using
likewise
deletion.
Likewise
deletion
disregards
all
items
that
are
missing
the
predefined
variables.
This
can
potentially
cause
invalid
results
if
a
significant
number
of
the
entries
are
missing
values.
Full
analysis
utilizes
the
entire
dataset
to
evaluate
the
probability
of
a
missing
variable.
This
method
is
conducted
for
every
missing
value
resulting
in
a
potentially
slow
and
inefficient
approach
depending
on
the
size
of
the
dataset
and
the
number
of
missing
values.
The
final
method
available
to
account
for
missing
values
is
interpolation
which
uses
the
values
surrounding
the
missing
value
to
calculate
it.
This
method
may
also
be
somewhat
slow
on
large
datasets.
Variable
Standardization
In
s
ome
cases
the
cluster
defining
variables
may
not
be
measured
in
the
same
units
or
may
be
of
different
types.
For example,
i
n
a
dataset
utilizing
height
as
a
variable
,
the
height
may
be
measured
in
inches
or
feet
.
The
formula
to
calculate
the
standardization
of
an
item
or
z
score
is
Z
=
X
–
U
/
σ
i
n
which
X
represents
the
value
U
represents
the
mean
of
the
population
and
σ
represents
the
standard
deviation
of
the
population.
(N
ote
that
U
and
σ
are
for
the
entire
population
,
not
a
sample
of
the
population).
A
more
universally
applicable
implementation
is
to
utilize
a
clustering
method
that
is
invariant
under
scaling
[2]
,
that
is
a
method
that
grouping
solutions
unaffected
by
variances
in
the
variable's
unit
of
measurement.
Proximity
Measurements
Distance
measurements
can
be
computed
in
a
number
of
ways
,
but
they
are
all
used
to
evaluate
the
amount
that
a
value
varies
from
other
values
currently
in
a
cluster.
The
methods
to
calculate
proximity
include
Euclidean
distance,
Manhattan
distance,
Chebyshev
distance,
and
Mahalanobis.
Euclidean
or
ordinary
distanc
e
is
the
absolute
value
of
the
difference
between
two
variables.
The
result
is
then
squared
in
order
to
weigh
values
further
apart
more
heavily
,
giving
the
formula
d(p,q)
=
(p
1
–
q
1
)
2
+
(p
2

q
2
)
2
…
+
(p
n
–
q
n
)
2
The
Manhattan
distance
or
taxicab
distance
is
the
sum
of
the
variances
between
the
values
d(p,q)
=
Σ
n
i=1
p
i

q
i

Incidentally
,
the
name
of
this
distance
method
was
created
by
a
19
th
century
man
named
Hermann
Minkowski
and
its
colloquial
name
was
given
as
it
was
once
used
to
calculate
the
Page
4
shortest
path
a
car
could
take
between
two
intersections
[4]
.
Colloquial
definitions
aside
,
the
next
common
measurement
of
distance
is
the
Chebyshev
distance
algorithm.
In
this
algorithm
the
maximum
distance
between
any
two
vectors
,
or
in
the
case
of
clustering
analysis
any
two
variables
,
is
expressed
in
the
formula
d(p,q)
=
max(p
i
–
q
i
)
The
final
common
measurement
of
distance
is
known
as
the
Mahalanobis
distance
[4]
.
This
measurement
,
created
in
1936
,
is
invariant
,
making
it
a
preferred
method
as
it
does
not
require
variable
standardization.
It
is
a
measurement
of
similarity
between
unknown
sample
sets.
Unlike
Euclidean
distance
,
it
also
takes
into
account
correlations
within
the
dataset.
It
can
also
be
considered
the
dissimilarity
between
two
random
variables
and
is
expressed
as
Number
of
Clusters
The
number
of
clusters
initially
chosen
plays
a
major
role
in
the
k

mean
clustering
method
and
fuzzy
c

mean
method.
However
,
the
number
is
largely
depended
upon
the
users
desired
output.
As
such
,
there
are
few
standardized
methods
to
calculate
how
many
clusters
should
be
used.
A
larger
K
or
cluster
number
will
usually
result
in
denser
or
more
inter
related
clusters
[6]
.
But
a
lower
K
will
yield
fewer
errors.
One
could
remove
all
error
in
fact
by
setting
K
to
0
effectively
making
each
item
in
the
dataset
a
cluster.
One
method
of
analyzing
the
optimal
number
in
respect
to
k

mean
is
the
elbow
method.
This
method
requires
that
multiple
k

means
be
conducted.
In
this
method
the
percent
variance
between
clusters
is
used
to
calculated
the
k

means
overall
value.
If
the
data
is
graphed
the
diminishing
return
can
be
seen.
The
optimal
K
is
then
selected
to
be
the
point
before
returns
begin
to
diminish.
This is illustrated by the following
figure.
Page
5
Figure 1: Elbow Method diagram
For
a
less
resource
intensive
alternative
than
creating
multiple
k

means
,
a
heuristic
algorithm
can
be
used.
The
goal
of
this
algorithm
is
to
produce
a
high
cluster
quality
with
a
low
k,
high
intra

cluster
similarity
and
low
enter

cluster
similarity.
This
heuristic
algorithm
is
expressed
as:
Ø
Q
represents
the
cluster
quality.
If
the
quality
is
0
or
lower
,
the
n
two
items
of
the
same
cluster
are
,
on
average
,
more
dissimilar
than
a
pair
of
items
from
two
different
clusters.
If
the
quality
rating
is
1
,
it
means
that
two
items
from
different
clusters
are
entirely
dissimilar
,
and
items
from
the
same
cluster
are
more
similar
to each other
.
This
will
also
most
likely
result
in
a
denser
k

mean.
Clustering
Methods
K

Means
K

means
,
or
non

hierarchical
clustering
,
is
one
of
the
oldest
and
most
simple
clustering
methods.
It
was
originally
proposed
by
John
MacQueen
in
1967
.
S
emantically
,
it
identifies
a
previously
set
number
of
centroids
based
on
their
dissimilarity.
It
then
iterates
through
the
dataset
and
compares
each
item
to
the
centroids
using
heuristic
algorithms
[5]
.
(It
should
be
noted
that
heuristic
algorithms
by
their
nature
are
designed
to
find
the
most
optimal
solution
as
quickly
as
possible,
but
may
not
find
the
best
possible
solution.
They
are
greedy
algorithms
in
that
they
find
the
locally
optimal
solution
for
each
set
of
items
rather
than
a
globally
optimum
solution
for
all
items).
The
algorithm
for
assignment
also
known
as
Lloyd's
algorithm
is
As
items
are
associated
with
a
cluster
,
the
centroid
is
recalculated
to
more
accurately
reflect
the
similarity
within
the
cluster
using
the
algorithm:
Page
6
The
densities
of
the
clusters
are
highly
dependent
on
the
initial
centroids
created
and
the
distance
algorithm
selected.
The
overall
validity
of
a
cluster
can
be
evaluated
based
on
its
density.
Usually
the
distance
algorithm
used
is
the
Euclidean
distance
measure
,
though
the
Manhattan
measurement
is
also
valid.
Since
the
K

mean
is
the
oldest
of
all
clustering
methods
,
many
derivative
forms
also
exist
that
enhance
its
speed
and
efficiency
.
T
hey
include
fuzzy
C

mean
clustering,
Gaussian
mixture
models,
spherical
k

means,
and
k

means++.
The
advantage
of
this
method
is
that
when
using
a
large
number
of
variables
,
it
may
be
faster
than
hierarchical
clustering
if
the
number
of
centroids
or
k
is
low,
a
nd
the
clusters
produced
may
be
denser
.
However,
the
quality
of
clusters
may
prove
difficult
to
evaluate.
It
is
also
difficult
to
ascertain
the
optimal
number
of
initial
centroids.
This
method
is
most
efficient
with
mid

sized
datasets
with
a
large
number
of
variables.
Below
is
an
illustration
of
this
method
being
used
on
a
randomly
generated
dataset
using
Euclidean
distance.
The
circles
represent
items
in
the
dataset
and
the
squares
represent
the
centroids
as
they
are
adjusted
throughout
the
process.
Figure 2: k

mean clustering
Hierarchical
clustering
The
more
current and resource intensive alternative to k

mean is hierarchical clustering
[8]
. As
the name suggests hierarchical clustering is utilized to form a complete
hierarchy for
the entire
cluster. This approach is far more resource intensive
. The complexity
of a
gglomerative clustering
is O(n
3
) . And the d
ivisive clustering approach is even more complex O(2
n
). Though there have
been improvements and additional methods developed that are less complex
,
hierarchical
clustering will usually be too slow for large
or continuous datasets. However,
t
he structure that is
created can be easier to interpret.
The above mentioned methods
,
agglomerative and divisive
,
are the most common methods
utilized. Agglomerative clustering is a top down approach
[8]
. All items in t
he dataset are initial
put into their own clusters. So
,
for N items in the dataset
,
there would be N clusters produced by
the initial step. The two clusters with the
smallest distance measurement (
or those that are most
similar) are then merged together. T
hen the distance from the new cluster to all of the other
clusters is calculated using one of the linkage algorithms discussed later. The previous two steps
Page
7
are then repeated until the distance be
tween the clusters exceeds a pre

determined maximum
distance
(known as distance criterion), o
r the minimum number of clusters is achieved (know
n
as
number criterion). The results of this can be depicted in a graph known as a dendrogram.
Figure
3: Initial dataset before clustering
Figure 4: two
items have been clustered
Figure 5: complete dendogram using single linkage
A similar but less used method is divisive clustering. It uses a top down approach and yields
results similar to those of agglomerative clustering
[8]
. In the divis
ive method initially a single
cluster is created. The distances between all objects are then compared. If the distance between
objects is greater than a preset threshold
,
then the cluster is split. This is repeated until the
number of desired clusters is reached or there is little dissimilarity in the objects being examined.
In either agglomerative or
divisive clustering
,
a linkage algorithm is used to evaluate the distan
ce
between clusters. There are three linkage algorithms typically used
. T
hey are single linkage,
complete linkage, and average or mean linkage. In single linkage
,
the distance between any two
clusters is computed by the distance between the closest object
s in the clusters. It can be
Page
8
expressed by the formula
This algorithm was later modified in 1973 to increase efficiency the modified version is known
as the SLINK algorithm. The opposite approach is used in the complete linkage algorithm.
Complete linkage computes the distance between any two clusters as the maximum distance
between objects in the cluster. It can be expressed in the following algorithm.
The third algorithm is UPGMA (un

weighted
pair group method with arithmetic mean) or
average linkage. The distance between any two clusters computed using average is the mean of
the distance between all objects within the cluster. It was created by Sokal and Michener. The
formula for it is:
Fuzzy c

means
The previous two methods of clustering
identify each item as being a part of a single cluster
.
While this does yield descriptive data
,
it may limit pattern identification. However, utilizing
fuzzy c

mean a single item can be
associated with multiple clusters
[4]
. It was created by Dunn
in 1973 and was later modified in 1981 by Bezdek. Using this method
,
items that appear in the
center of a cluster are considered more related to the cluster than items at its edges. The
algorit
hm begins by creating C random centroids where C is the number of clusters specified.
Then it calculates
the fuzzy
membership
'µij'
of each item I to each cluster J using the formula:
Each time items are added to a cluster the center must then be
recalculated much as it is in k

means this time using the formula:
This process is then repeated until all items in the dataset have been placed into one or more
clusters. This method relies on a degree of “fuzziness” or the degree to whic
h items can be
related. That degree is expressed as m in the formula and can range from 1 to ∞. The closer to 1
the degree
,
the
fewer
items will be related to multiple clusters and the result will begin to
Page
9
resemble a k

mean. As the degree
approaches ∞
,
int
erpretation of the result may be less
meaningful
,
as all items will be related to all clusters. A study by Hathaway and Bezdek in 2001
suggested that the ideal degree
form
was 2. The end result of clustering a relatively small dataset
using 3 clusters and
a degree of 2 looks as follows.
Figure 6: Complete fuzzy c

mean with C=3
Validation
Prior to interpreting a clustered data set
,
it must be determined if the method and perimeters used
were effective. This can be done either internally or externally. Internal evaluation utilizes the
clustered dataset itself
[10]
. Internal evaluation is highly biased toward methods such as k

mean
t
hat optimize items distance, while a method such as fuzzy c

mean will receive a very low score
due largely to the fact that
the distance between clusters is
not as clearly defined. There are two
prominent algorithms used to test
clustering’s
using interna
l evaluation
;
Davis

Bouldin index and
Dunn index. The Davis

Bouldin index is expressed in the formula:
N represents the number of clusters
,
C
x
is the centroid of a cluster
,
and σ
x
is the average distance
of all items in the cluster to the centroid.
d (
C
i
,C
j
) is the distance from the centroid of cluster I to
the centroid of cluster j. A smaller number is considered a better clustering one. This algorithm
attempt to test the validit
y of a cluster based on low intra

cluster distances and high intra

cluster
distances. Another approach is to test validity of clusters based on their density. Dense
,
well
separated clusters may also be a sign that they are valid. The Dunn index tests for t
his type of
cluster. It is calculated using the following formula:
Page
10
In this formula
,
d(i,j) represents the distance between clusters I and j. d(k) measure intra

cluster
density to determine density. The distance between two clusters can als
o be calculated as the
distance between their centroids. A high Dunn index is considered a better cluster.
Alternatively
,
External evaluation can be used to validate clusters
[10]
. There are a number of
methods to evaluate clusters
externally;
however, th
ey are derivations on the same theme. That is
using data not in the clustered dataset to test it. Normally the data used takes the form of
benchmarks. These benchmarks are small datasets created by users. External evaluation methods
then test how close th
e clustered dataset is to the benchmark. This method is controversial
,
as
experts wonder how applicable it is to real datasets that are prohibitively complex or those that
relatively accurate benchmarks are difficult to create for. An example of one such
method would
be the Jaccard index. It is expressed in the following formula:
I
ts results range from 0
(
meaning the two datasets have nothing in common
)
to 1
(
meaning they
are identical
)
. It is the total number of unique elements in both
datasets divided by the total
number of unique element
s
in each dataset separately. But
,
given the controversy related to
external evaluation and the bias related to internal evaluation
,
some
clusterings,
such as those
using fuzzy c

means
,
may remain diff
icult to validate.
Applications of Clustering
Since its creation
,
clustering has become a standard tool used by many fields. In biology it is
used for analyzing similarities in communities, build genes with related patterns, and creating
genotypes. Th
e field of medicine uses it in conjunction with PET scans to identify certain
types
of tissue and blood. And m
arketing uses clustering techniques constantly on the results of
surveys and shopping records to aid them in targeting demographics. Also
,
markete
rs that rely
less on brick an
d
market stores
,
such as
EBay
and Amazon
,
use clustering analysis to organize
their products into similar groups so that they can make suggestions to customers
[7]
. Law
enforcement officials use clustering to identify patterns in crime, allowing them to more
effectively manage resources around their predicted need.
Conclusion
Clustering is currently and integral part of many fields. It is an older concept that
has evolved
over the course of time. This has resulted in the field becoming complex
,
with a variety of
options for each step of the clustering process. In its simplest form
,
it
consists
of identifying a
dataset, selecting variables, normalizing the data,
conducting a clustering method, then validating
it. But
,
given the variety of options available and its implementation in many software packages
,
the results of clustering can be meaningfully interpreted across a wide variety of datasets.
Page
11
References
[1]
Anderberg,
M.
R.
(1973)
Cluster
Analysis
for
Applications.
Academic
Press,
New
York.
[2]
Basu,
S.,
Davidson,
I.
and
Wagstaff,
K.
(2008)
“
Constrained
Clustering:
Advances
in
Algorithms,
Theory,
and
Applications.
”
Chapman
and
Hall/CRC,
London.
[3]
Birant,
D.
and
Kut,
A.
(2007)
“
ST

DBSCAN:
An
algorithm
for
clustering
spatial

temporal
data.
Data
&
Knowledge
Engineering.
”
[4]
Brian
S.
Everitt;
Sabin
Landau;
Morven
Lesse;
Daniel
Stahl.
(2011)
“
Clustering
Analysis
5
th
edition
”
Wiley
&
Sons,
New
York.
[5]
Bowman,
A.
W.
and
Azzalini,
A.
(1997)
“
Applied
Smoothing
Techniques
for
Data
Analysis.
”
Oxford
University
Press,
Oxford.
[6]
Chakrapani,
C.
(2004)
Statistics
in
Market
Research.
Arnold,
London.
[7]
Chen,
H.,
Schuffels,
C.
and
Orwig,
R.
(1996)
Internet
categorization
and
search:
a
self

organizing
approach
.
Journal
of
Visual
Communication
and
Image
Representation,
[8]
De
Boeck,
P.
and
Rosenberg,
S.
(1988)
Hierarchical
classes:
model
and
data
analysis.
Psychometrika
[9]
Dunson,
D.
B.
(2009)
Bayesian
nonparametric
hierarchical
modeling
.
Biometrical
Journal
[10]
Everitt,
B.
S.
and
Hothorn,
T.
(2009)
“
A
Handbook
of
Statistical
Analyses
Using
R
(2nd
edition).
”
Chapman
and
Hall,
Boca
Raton.
[11]
Fitzmaurice,
G.
M.,
Laird,
N.
M.
and
Ware,
J.
H.
(2004)
“
Applied
Longitudinal
Analysis.
”
John
Wiley
&
Sons,
Inc.,
Hoboken,
NJ.
[12]
Gordon,
A.
D.
(1987)
“
A
review
of
hierarchical
classification.
”
Journal
of
the
Royal
Statistical
Society
A
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο