Clustering Analysis and Algorithms

ticketdonkeyAI and Robotics

Nov 25, 2013 (3 years and 10 months ago)

99 views

Clustering

Analysis

and

Algorithms



Keegan

Myers

Department

of

Computer

Science

University

of

Wisconsin

Platteville

Platteville,

WI

53818

myerske@uwplatt.edu

March

27
,

2011



Abstract



Most

fields

from

botany

to

law

e
nforcement

are

plagued

with

an

abundance

of

raw

data.

But

data

has

little

meaning

without

a

method

of

interpreting

it.

This

is

where

clustering

becomes

an

invaluable

asset.

By

utilizing

a

range

of

clustering

methods
,

professionals

in

many

fields

can

more

accurately

interpret

data.

The

most

significant

of

these

methods

will

be

discussed

and

evaluated.

Along

with

explanations

of

the

main

clustering

methods
,

some

of

the

major

issues

that

can

impair

accurate

interpretation

will

be

considered.

Solutions

to

missing

data,

masking

variables

and

comparing

variables

measured

in

different

units

will

be

provided.

Additionally
,

several

applications

of

clustering

will

be

highlighted.

While

this

will

not

serve

as

an

exhaustive

reference
,

it

is

written

in

a

stepwise

manner

giving

the

reader

a

foundational

u
nderstand
ing

of

clustering.






Introduction


When

posed

with

a

large

variety

of

diverse

information
,

one

of

the

primary

methods

a

human

mind

uses

to

make

sense

of

the

chaos

is

to

classify

the

information

being

processed.

By

classifying

information

a

person

can

more

effectively

utilize

information

presented

to

them.

This

is

a

common

event

that

occurs

on

a

daily

basi
s.

You

may

walk

into

a

room

and

identify

an

object

as

a

chair

having

never

seen

that

particular

chair

before.

You

can

then

interact

with

it

appropriately.

Similarly

the

process

known

as

clustering

allows

systems

to

evaluate

large

diverse

datasets

and

organize

the

data

into

groups

called

clusters

so

that

they

can

be

more

easily

understood.

This

practice

began

with

the

least

computationally

intensive

methods

in

order

to

accommodate

for

the

limited

hardware

capabilities

of

the

time.

Many

of

these

algorithms

were

then

revisited

as

systems

became

more

robust

leading

to

the

clustering

algorithms

currently

in

use.




Overview

of

the

clustering

process


While

Clustering

algorithms

can

be

applied

to

many

fields

and

many

types

of

data
,

the

basic

steps

remain

the

same.




Select

the

data

to

cluster



Select

the

variables

to

use

Page
2




Identify

Missing

data



Variable

Standardization



Proximity

Measurements



Number

of

Clusters



Clustering

Method



Validation

There

are

several

methods

to

employ

for

nearly

all

of

the

aforementioned

steps.

The

key

factors

in

deciding

which

methods

are

the

most

eff
ective

are

size

of

the

dataset,

its
complexity,

and

the

most

likely

clusters

to

be

identified.



Variable

Selection


Variables

should

be

selected

based

on

the

likelihood

that

they

will

define

a

cluster.

Those

variables

that

are

not

likely

to

define

a

cluster

are

considered

masking

variables

and

should

be

either

removed

or

ignored.

Masking

variables

can

pose

a

large

problem

in

properly

defining

clusters

[1]
.

T
here

are

two

solutions

to

this

issue
:

One

is

to

use

weighted

variables.

The

numeric

weight

associated

with

each

variable

is

based

on

the

importance

the

user

places

on

the

variable

within

the

context

of

the

cluster

definition.

The

weights

can

either

be

defined

directly

by

the

user

or

indirectly,

u
tilizing

the

indirect

method
. Variables

are

compared

and

weighted

based

on

their

dissimilarity.

The

most

similar

variables

will

be

considered

the

cluster

defining

variables.



The

other

option

to

account

for

masking

variables

is

to

utilize

model
-
based

variable

selection.

This

method

is

most

effective

when

the

number

of

variables

is

much

greater

than

the

number

of

entries

in

the

dataset.

In

this

method
,

a

secondary

dataset

is

created

that

contains

items

with

a

far

smaller

number

of

variables.

All

variables

that

do

not

v
a
ry

by

item

or

v
a
ry

minimally

are

removed.

The

model
-
based

approach

can

decrease

computational

intensity
. However
,

it

may

also

fail

to

recognize

statistically

significant

clusters.




Missing

values


Once

the

cluster

defining

variables

have

been

identified
,

the

next

issue

that

may

arise

in

creating

meaningful

clusters

is

those

items

that

are

missing

the

predefined

variables.

Missing

variables
can

have

a

significant

effect

on

the

conclusions

that

can

be

drawn

from

clustering.

Missing

data

is

perceived

by

the

clustering

algorithm

as

a

nonresponse

[3]
.

Depending

on

the

weight

of

the

variable

in

question
,

the

nonresponse

will

proportionally

affect

the

result.

To

alleviate

this

issue
,

non
-
responses

can

be

avoided

at

the

time

the

data

is

collected.

However,

this

may

often

prove

impossible.

In

such

cases
,

there

are

five

options

available

listed

in

order

of

viability
. T
hey

are

imputation,

partial

imputation,

partial

deletion,

full

analysis,

and

interpolation.

Imputation

or

unit

imputation

can

be

conducted

in

either

the

hot
-
deck

or

cold
-
deck

methods.

In

the

hot
-
deck

method

the

missing

value

is

replaced

at

random

with

variables

from

another

item

in

the

dataset.

In

the

cold
-
deck

method

the

missing

value

is

replaced

by

values

from

a

different

but

similar

dataset.

Both

of

th
ese

methods

are

more

traditional

and

are

often

replaced

with

newer
,

less

standard

derivatives

such

as

the

hot
-
deck

closest

neighbor

imputation

method.

Regardless

of

the

implementation
,

multiple

imputations

should

be

run

for

the

sake

of

validity

[3]
.

Many

scholars

Page
3


recommend

20

to

100

imputations

per

missing

variable.

Partial

imputation

works

similarly
,

except

imputation

is

not

conducted

on

every

missing

value

but

only

key

values

identified

by

a

pattern.



Partial

deletion

can

be

conducted

using

likewise

deletion.

Likewise

deletion

disregards

all

items

that

are

missing

the

predefined

variables.

This

can

potentially

cause

invalid

results

if

a

significant

number

of

the

entries

are

missing

values.

Full

analysis

utilizes

the

entire

dataset

to

evaluate

the

probability

of

a

missing

variable.

This

method

is

conducted

for

every

missing

value

resulting

in

a

potentially

slow

and

inefficient

approach

depending

on

the

size

of

the

dataset

and

the

number

of

missing

values.

The

final

method

available

to

account

for

missing

values

is

interpolation

which

uses

the

values

surrounding

the

missing

value

to

calculate

it.

This

method

may

also

be

somewhat

slow

on

large

datasets.




Variable

Standardization


In

s
ome

cases

the

cluster

defining

variables

may

not

be

measured

in

the

same

units

or

may

be

of

different

types.

For example,
i
n

a

dataset

utilizing

height

as

a

variable
,

the

height

may

be

measured

in
inches

or

feet
.

The

formula

to

calculate

the

standardization

of

an

item

or

z

score

is






Z

=

X



U

/

σ



i
n

which

X

represents

the

value

U

represents

the

mean

of

the

population

and

σ

represents

the

standard

deviation

of

the

population.

(N
ote

that

U

and

σ

are

for

the

entire

population
,

not

a

sample

of

the

population).

A

more

universally

applicable

implementation

is

to

utilize

a

clustering

method

that

is

invariant

under

scaling

[2]
,

that

is

a

method

that

grouping

solutions

unaffected

by

variances

in

the

variable's

unit

of

measurement.



Proximity

Measurements


Distance

measurements

can

be

computed

in

a

number

of

ways
,

but

they

are

all

used

to

evaluate

the

amount

that

a

value

varies

from

other

values

currently

in

a

cluster.

The

methods

to

calculate

proximity

include

Euclidean

distance,

Manhattan

distance,

Chebyshev

distance,

and

Mahalanobis.

Euclidean

or

ordinary

distanc
e

is

the

absolute

value

of

the

difference

between

two

variables.

The

result

is

then

squared

in

order

to

weigh

values

further

apart

more

heavily
,

giving

the

formula




|d(p,q)|

=

(p
1



q
1
)
2

+

(p
2

-

q
2
)
2


+

(p
n



q
n
)

2



The

Manhattan

distance

or

taxicab

distance

is

the

sum

of

the

variances

between

the

values



d(p,q)

=

Σ
n
i=1

|p
i
-
q
i
|



Incidentally
,

the

name

of

this

distance

method

was

created

by

a

19
th

century

man

named

Hermann

Minkowski

and

its

colloquial

name

was

given

as

it

was

once

used

to

calculate

the

Page
4


shortest

path

a

car

could

take

between

two

intersections

[4]
.

Colloquial

definitions

aside
,

the

next

common

measurement

of

distance

is

the

Chebyshev

distance

algorithm.

In

this

algorithm

the

maximum

distance

between

any

two

vectors
,

or

in

the

case

of

clustering

analysis

any

two

variables
,

is

expressed

in

the

formula



d(p,q)

=

max(|p
i



q
i

|)


The

final

common

measurement

of

distance

is

known

as

the

Mahalanobis

distance

[4]
.

This

measurement
,

created

in

1936
,

is

invariant
,

making

it

a

preferred

method

as

it

does

not

require

variable

standardization.

It

is

a

measurement

of

similarity

between

unknown

sample

sets.

Unlike

Euclidean

distance
,

it

also

takes

into

account

correlations

within

the

dataset.

It

can

also

be

considered

the

dissimilarity

between

two

random

variables

and

is

expressed

as







Number

of

Clusters


The

number

of

clusters

initially

chosen

plays

a

major

role

in

the

k
-
mean

clustering

method

and

fuzzy

c
-
mean

method.

However
,

the

number

is

largely

depended

upon

the

users

desired

output.

As

such
,

there

are

few

standardized

methods

to

calculate

how

many

clusters

should

be

used.

A

larger

K

or

cluster

number

will

usually

result

in

denser

or

more

inter

related

clusters

[6]
.

But

a

lower

K

will

yield

fewer

errors.

One

could

remove

all

error

in

fact

by

setting

K

to

0

effectively

making

each

item

in

the

dataset

a

cluster.

One

method

of

analyzing

the

optimal

number

in

respect

to

k
-
mean

is

the

elbow

method.

This

method

requires

that

multiple

k
-
means

be

conducted.

In

this

method

the

percent

variance

between

clusters

is

used

to

calculated

the

k
-
means

overall

value.

If

the

data

is

graphed

the

diminishing

return

can

be

seen.

The

optimal

K

is

then

selected

to

be

the

point

before

returns

begin

to

diminish.

This is illustrated by the following
figure.














Page
5


Figure 1: Elbow Method diagram


For

a

less

resource

intensive

alternative

than

creating

multiple

k
-
means
,

a

heuristic

algorithm

can

be

used.

The

goal

of

this

algorithm

is

to

produce

a

high

cluster

quality

with

a

low

k,

high

intra
-
cluster

similarity

and

low

enter
-
cluster

similarity.

This

heuristic

algorithm

is

expressed

as:






Ø
Q

represents

the

cluster

quality.

If

the

quality

is

0

or

lower
,

the
n

two

items

of

the

same

cluster

are
,

on

average
,

more

dissimilar

than

a

pair

of

items

from

two

different

clusters.

If
the
quality

rating

is

1
,

it
means

that

two

items

from

different

clusters

are

entirely

dissimilar
,

and

items

from

the

same

cluster

are

more

similar

to each other
.

This

will

also

most

likely

result

in

a

denser

k
-
mean.



Clustering

Methods


K
-
Means


K
-
means
,

or

non
-
hierarchical

clustering
,

is

one

of

the

oldest

and

most

simple

clustering

methods.

It

was

originally

proposed

by

John

MacQueen

in

1967
.

S
emantically
,

it

identifies

a

previously

set

number

of

centroids

based

on

their

dissimilarity.

It

then

iterates

through

the

dataset

and

compares

each

item

to

the

centroids

using

heuristic

algorithms

[5]
.

(It

should

be

noted

that

heuristic

algorithms

by

their

nature

are

designed

to

find

the

most

optimal

solution

as

quickly

as

possible,

but

may

not

find

the

best

possible

solution.

They

are

greedy

algorithms

in

that

they

find

the

locally

optimal

solution

for

each

set

of

items

rather

than

a

globally

optimum

solution

for

all

items).

The

algorithm

for

assignment

also

known

as

Lloyd's

algorithm

is









As

items

are

associated

with

a

cluster
,

the

centroid

is

recalculated

to

more

accurately

reflect

the

similarity

within

the

cluster

using

the

algorithm:








Page
6


The

densities

of

the

clusters

are

highly

dependent
on

the

initial

centroids

created

and

the

distance

algorithm

selected.

The

overall

validity

of

a

cluster

can

be

evaluated

based

on

its

density.

Usually

the

distance

algorithm

used

is

the

Euclidean

distance

measure
,

though

the

Manhattan

measurement

is

also

valid.

Since

the

K
-
mean

is

the

oldest

of

all

clustering

methods
,

many

derivative

forms

also

exist

that

enhance

its

speed

and

efficiency
.

T
hey

include

fuzzy

C
-
mean

clustering,

Gaussian

mixture

models,

spherical

k
-
means,

and

k
-
means++.

The

advantage

of

this

method

is

that

when

using

a

large

number

of

variables
,

it

may

be

faster

than

hierarchical

clustering

if

the

number

of

centroids

or

k

is

low,

a
nd

the

clusters

produced

may

be

denser
.
However,

the

quality

of

clusters

may

prove

difficult

to

evaluate.

It

is

also

difficult

to

ascertain

the

optimal

number

of

initial

centroids.

This

method

is

most

efficient

with

mid
-
sized

datasets

with

a

large

number

of

variables.

Below

is

an

illustration

of

this

method

being

used

on

a

randomly

generated

dataset

using

Euclidean

distance.

The

circles

represent

items

in

the

dataset

and

the

squares

represent

the

centroids

as

they

are

adjusted

throughout

the

process.
















Figure 2: k
-
mean clustering



Hierarchical

clustering


The

more

current and resource intensive alternative to k
-
mean is hierarchical clustering

[8]
. As
the name suggests hierarchical clustering is utilized to form a complete
hierarchy for

the entire

cluster. This approach is far more resource intensive
. The complexity

of a
gglomerative clustering
is O(n
3
) . And the d
ivisive clustering approach is even more complex O(2
n
). Though there have
been improvements and additional methods developed that are less complex
,

hierarchical
clustering will usually be too slow for large

or continuous datasets. However,
t
he structure that is
created can be easier to interpret.


The above mentioned methods
,

agglomerative and divisive
,

are the most common methods
utilized. Agglomerative clustering is a top down approach

[8]
. All items in t
he dataset are initial
put into their own clusters. So
,

for N items in the dataset
,

there would be N clusters produced by
the initial step. The two clusters with the
smallest distance measurement (
or those that are most
similar) are then merged together. T
hen the distance from the new cluster to all of the other
clusters is calculated using one of the linkage algorithms discussed later. The previous two steps
Page
7


are then repeated until the distance be
tween the clusters exceeds a pre
-
determined maximum

distance

(known as distance criterion), o
r the minimum number of clusters is achieved (know
n

as
number criterion). The results of this can be depicted in a graph known as a dendrogram.








Figure

3: Initial dataset before clustering









Figure 4: two
items have been clustered













Figure 5: complete dendogram using single linkage




A similar but less used method is divisive clustering. It uses a top down approach and yields
results similar to those of agglomerative clustering

[8]
. In the divis
ive method initially a single
cluster is created. The distances between all objects are then compared. If the distance between
objects is greater than a preset threshold
,

then the cluster is split. This is repeated until the
number of desired clusters is reached or there is little dissimilarity in the objects being examined.


In either agglomerative or
divisive clustering
,

a linkage algorithm is used to evaluate the distan
ce
between clusters. There are three linkage algorithms typically used
. T
hey are single linkage,
complete linkage, and average or mean linkage. In single linkage
,

the distance between any two
clusters is computed by the distance between the closest object
s in the clusters. It can be
Page
8


expressed by the formula











This algorithm was later modified in 1973 to increase efficiency the modified version is known
as the SLINK algorithm. The opposite approach is used in the complete linkage algorithm.

Complete linkage computes the distance between any two clusters as the maximum distance
between objects in the cluster. It can be expressed in the following algorithm.










The third algorithm is UPGMA (un
-
weighted

pair group method with arithmetic mean) or
average linkage. The distance between any two clusters computed using average is the mean of
the distance between all objects within the cluster. It was created by Sokal and Michener. The
formula for it is:











Fuzzy c
-
means


The previous two methods of clustering

identify each item as being a part of a single cluster
.
While this does yield descriptive data
,

it may limit pattern identification. However, utilizing
fuzzy c
-
mean a single item can be
associated with multiple clusters

[4]
. It was created by Dunn
in 1973 and was later modified in 1981 by Bezdek. Using this method
,

items that appear in the
center of a cluster are considered more related to the cluster than items at its edges. The
algorit
hm begins by creating C random centroids where C is the number of clusters specified.
Then it calculates
the fuzzy

membership
'µij'
of each item I to each cluster J using the formula:








Each time items are added to a cluster the center must then be
recalculated much as it is in k
-
means this time using the formula:










This process is then repeated until all items in the dataset have been placed into one or more
clusters. This method relies on a degree of “fuzziness” or the degree to whic
h items can be
related. That degree is expressed as m in the formula and can range from 1 to ∞. The closer to 1
the degree
,

the
fewer

items will be related to multiple clusters and the result will begin to
Page
9


resemble a k
-
mean. As the degree
approaches ∞
,

int
erpretation of the result may be less
meaningful
,

as all items will be related to all clusters. A study by Hathaway and Bezdek in 2001
suggested that the ideal degree
form

was 2. The end result of clustering a relatively small dataset
using 3 clusters and

a degree of 2 looks as follows.














Figure 6: Complete fuzzy c
-
mean with C=3



Validation


Prior to interpreting a clustered data set
,

it must be determined if the method and perimeters used
were effective. This can be done either internally or externally. Internal evaluation utilizes the
clustered dataset itself

[10]
. Internal evaluation is highly biased toward methods such as k
-
mean
t
hat optimize items distance, while a method such as fuzzy c
-
mean will receive a very low score
due largely to the fact that
the distance between clusters is

not as clearly defined. There are two
prominent algorithms used to test
clustering’s

using interna
l evaluation
;

Davis
-
Bouldin index and
Dunn index. The Davis
-
Bouldin index is expressed in the formula:











N represents the number of clusters
,

C
x
is the centroid of a cluster
,

and σ
x

is the average distance
of all items in the cluster to the centroid.
d (
C
i
,C
j
) is the distance from the centroid of cluster I to
the centroid of cluster j. A smaller number is considered a better clustering one. This algorithm
attempt to test the validit
y of a cluster based on low intra
-
cluster distances and high intra
-
cluster
distances. Another approach is to test validity of clusters based on their density. Dense
,

well
separated clusters may also be a sign that they are valid. The Dunn index tests for t
his type of
cluster. It is calculated using the following formula:








Page
10


In this formula
,

d(i,j) represents the distance between clusters I and j. d(k) measure intra
-
cluster
density to determine density. The distance between two clusters can als
o be calculated as the
distance between their centroids. A high Dunn index is considered a better cluster.


Alternatively
,

External evaluation can be used to validate clusters

[10]
. There are a number of
methods to evaluate clusters
externally;

however, th
ey are derivations on the same theme. That is
using data not in the clustered dataset to test it. Normally the data used takes the form of
benchmarks. These benchmarks are small datasets created by users. External evaluation methods
then test how close th
e clustered dataset is to the benchmark. This method is controversial
,

as
experts wonder how applicable it is to real datasets that are prohibitively complex or those that
relatively accurate benchmarks are difficult to create for. An example of one such
method would
be the Jaccard index. It is expressed in the following formula:










I
ts results range from 0
(
meaning the two datasets have nothing in common
)

to 1

(
meaning they
are identical
)
. It is the total number of unique elements in both
datasets divided by the total
number of unique element
s

in each dataset separately. But
,

given the controversy related to
external evaluation and the bias related to internal evaluation
,

some
clusterings,
such as those
using fuzzy c
-
means
,

may remain diff
icult to validate.



Applications of Clustering


Since its creation
,

clustering has become a standard tool used by many fields. In biology it is
used for analyzing similarities in communities, build genes with related patterns, and creating
genotypes. Th
e field of medicine uses it in conjunction with PET scans to identify certain
types
of tissue and blood. And m
arketing uses clustering techniques constantly on the results of
surveys and shopping records to aid them in targeting demographics. Also
,

markete
rs that rely
less on brick an
d

market stores
,

such as
EBay

and Amazon
,

use clustering analysis to organize
their products into similar groups so that they can make suggestions to customers

[7]
. Law
enforcement officials use clustering to identify patterns in crime, allowing them to more
effectively manage resources around their predicted need.



Conclusion


Clustering is currently and integral part of many fields. It is an older concept that
has evolved
over the course of time. This has resulted in the field becoming complex
,

with a variety of
options for each step of the clustering process. In its simplest form
,

it
consists

of identifying a
dataset, selecting variables, normalizing the data,

conducting a clustering method, then validating
it. But
,

given the variety of options available and its implementation in many software packages
,

the results of clustering can be meaningfully interpreted across a wide variety of datasets.


Page
11



References


[1]
Anderberg,

M.

R.

(1973)

Cluster
Analysis

for

Applications.

Academic

Press,

New

York.


[2]
Basu,

S.,

Davidson,

I.

and

Wagstaff,

K.

(2008)


Constrained

Clustering:

Advances

in

Algorithms,

Theory,

and

Applications.

Chapman

and

Hall/CRC,

London.


[3]
Birant,

D.

and

Kut,

A.

(2007)


ST
-
DBSCAN:

An

algorithm

for

clustering

spatial
-
temporal

data.

Data

&

Knowledge

Engineering.



[4]
Brian

S.

Everitt;

Sabin

Landau;

Morven

Lesse;

Daniel

Stahl.

(2011)


Clustering

Analysis

5
th

edition

Wiley

&

Sons,

New

York.


[5]
Bowman,

A.

W.

and

Azzalini,

A.

(1997)


Applied

Smoothing

Techniques

for

Data

Analysis.


Oxford
University

Press,

Oxford.


[6]
Chakrapani,

C.

(2004)

Statistics
in

Market

Research.

Arnold,

London.


[7]
Chen,

H.,

Schuffels,

C.

and

Orwig,

R.

(1996)

Internet

categorization

and

search:

a

self
-

organizing
approach
.
Journal

of

Visual

Communication

and

Image

Representation,


[8]
De

Boeck,

P.

and

Rosenberg,

S.

(1988)

Hierarchical
classes:

model

and

data

analysis.

Psychometrika


[9]
Dunson,

D.

B.

(2009)

Bayesian
nonparametric

hierarchical

modeling
.
Biometrical

Journal


[10]
Everitt,

B.

S.

and

Hothorn,

T.

(2009)


A

Handbook

of

Statistical

Analyses

Using

R

(2nd

edition).



Chapman

and

Hall,

Boca

Raton.


[11]
Fitzmaurice,

G.

M.,

Laird,

N.

M.

and

Ware,

J.

H.

(2004)


Applied

Longitudinal

Analysis.

John

Wiley

&

Sons,

Inc.,

Hoboken,

NJ.


[12]
Gordon,

A.

D.

(1987)


A

review

of

hierarchical

classification.

Journal

of

the

Royal

Statistical

Society

A