1. What is
Cluster Analysis
? (Introduction)
Cluster analysis
is a technique used for
classification of data
in which data elements
are partitioned
into groups called clusters
that represent collections of data elements
that are proximate based on a distan
ce or dissimilarity function. [1]
The cluster
analysis approach is an important
tool in decision making
and an effective
creativity
technique in generating ideas and obtaining solutions
.
T
he term cluster analysis (first used by Tryon, 1939) encompasses
a number of
different algorithms and methods for grouping objects of similar kind into respective
categories. A general question facing researchers in many areas of inquiry is how to
organize observed data into meaningful structures, that is, to develop ta
xonomies. In
other words cluster analysis is an exploratory data analysis tool which aims at sorting
different objects into groups in a way that the degree of association between two
objects is maximal if they belong to the same group and minimal otherwise
.
[
2
]
Cluster analysis is m
ainly a discovery tool, it often surfaces perceived problem areas,
concerns or items that naturally belong together.
The clusters analysis
aims
at
[3]
:
classifying
data
into natural groupings on the basis of similar or related
characteristics
,
indentifying
most important
characteristics
to be considered in developing a
problem specification,
developing
a more homogeneous group of items from a large list of dissimilar
items,
identifying
differences among customer
, employee or s
upplier groups in
regard to quality perception and performance issues.
2. How i
s
i
t
implemented?
Types of clustering
Data clustering algorithms can be
hierarchical
or
partitional
. Hierarchical
algorithms find successive clusters using previously establis
hed clusters, whereas
partitional algorithms determine all clusters at once.
For example,
designing
an
effective hierarchical menu system for an
automotive
application in

vehicle mobile
multimedia systems, one alternative is a hierarchical menu implemented
on an
integrated display/control unit such as a multi

function display (MFD).
In a top

down
approach, the designer identifies
first

order (macro) categories that are repeatedly
divided
into progressively smaller subcategories until menu
items are represent
ed at
their lowest level. This
approach is conceptually driven in that items are
discriminated
along categorical boundaries and
conceptual dimensions. The top

down approach
emphasizes the
differences
between functions rather
than their similarities.
Hierar
chical algorithms can be agglomerative ("bottom

up") or divisive ("top

down").
Agglomerative algorithms begin with each element as a separate cluster and merge
them into successively larger clusters. Divisive algorithms begin with the whole set
and proceed
to divide it into successively smaller clusters
.
[5
]
Distance Measures
The joining or tree clustering method
above
uses the dissimilarities (similarities) or
distances between objects when forming the clusters. Similarities are a set of rules
that serve
as criteria for grouping or separating items.
An important step in any
clustering is to select a distance measure, which will determine how the similarity of
two elements is calculated.
These distances (similarities) can be based on a single
dimension or m
ultiple dimensions, with each dimension representing a rule or
condition for grouping objects.
This will influence the shape of the clusters, as some
elements may be close to one another according to one distance and further away
according to another.
[2]
Common distance functions:

The Euclidean distance

The Mahalanobis
distance
[8]

The Manhattan distance [9]

The Hamming distance
[10]
.
Picture
1
: The Euclidean distance
Source:
http://www1.uni

hamburg.de
/RRZ/Software/Statistica/Handbuch/stcluan.html#d
Type of data in clustering analysis: Interval

scaled variables, Binary variables,
Nominal, ordinal, and ratio variables, Variables of mixed types
.
The clusters analysis tool is best utilized after a brainsto
rming session to organize data
by subdividing different
idea
, items or characteristics into relatively similar groups,
each under a topical heading.
Consider a Horizontal Hierarchical Tree Plot (see graph below), on the left of the plot,
we begin with ea
ch object in a class by itself. Now imagine that, in very small steps,
we "relax" our criterion as to what is and is not unique. Put another way, we lower our
threshold regarding the decision when to declare two or more objects to be members
of the same cl
uster.
The
following
tree diagram classifies 22 different car models and
their linkage (connection) using “Euclidean distance
” which compares car category
according to certain characteristics (e.g fuel consumption, cost, accessories etc).
3. What ar
e the success factors?
(Do/ Do not)
Cluster analysis is not as much a typical statistical test as it is a "collection" of
different algorithms that "put objects into clusters according to well defined similarity
rules." The point here is that, unlike many
other statistical procedures,
cluster
analysis methods are mostly used when we do not have any a priori hypotheses
,
but are still in the exploratory phase of our research. In a sense, cluster analysis finds
the "most significant solution possible."
[2]
.
Wha
t Is Good Clustering?
High Quality:
high intra

class similarity
(similarity between two or more classes of attributes)
low inter

class similarity
(similarity between attributes belonging in the same
category)
Depends on:
similarity measure
(how similar two
or more attributes are)
algorithm for searching
a
bility to discover hidden patterns
Clustering may not be the best way to discover interesting groups in a data set.
Often
visualisation
methods work well, allowing the human expert to identify
useful groups
. However, as the data set sizes increase to millions of
entities
, this
becomes
in practical
and clusters help to partition the data so that we can deal
with smaller groups. Different algorithms deliver different clusterings
[6
]
.
4.
Case study
–
Title
:
An
Analysis of Industrial Clusters in Burnaby
[
7
]
.
This case study
presents
the findings of the analysis of industrial clusters in
The City
of Burnaby in the State of Vancouver, CANADA
and
draws
conclusions based on this
analysis.
Specifically,
Figu
re 1
shows both the ratio of GVRD (Greater Vancouver Regional
District
) employment compared to Canada as a whole and Burnaby compared to
Canada. The Y

axis value on Figure 1 is the ratio of the percentage of
employees’
employment within a specific industri
al category for the GVRD or Burnaby compared
to the national average.
Not all industrial codes are shown: the figures show, as expected, that the GVRD and
Burnaby have much smaller than average labour forces in agriculture and mining, and
smaller than ave
rage employment in manufacturing and public administration.
If either the GVRD or Burnaby has a ratio greater than one, it indicates that it has
some competitive advantage compared to Canada as a whole.
FIGURE 1
What is perhaps more important is
to look at those areas where Burnaby has a distinct
advantage over the rest of the GVRD.
There are four such areas: utilities,
construction, wholesale and information and culture.
A similar type of analysis can be
done by occupational codes.
Figure 2
show
s the ratios of the percentages of total employment in the GVRD and
Burnaby in a particular aggregation of occupational codes compared to Canada
.
Neither Burnaby nor the GVRD have higher than average numbers of workers in
primary and manufacturing occupati
ons. The GVRD overall has average numbers of
individuals in health occupations and education, but Burnaby falls behind the GVRD
(probably due to the preponderance of health and educational facilities outside
Burnaby).
Burnaby has slight advantages in manag
ement and business. While it does not have as
great an advantage in the arts as does the GVRD as a whole, its advantage over the
rest of Canada is still significant, and bears further study.
FIGURE 2
5.
List of References
Web sites:
[1]
http://mathworld.wolfram.com/ClusterAnalysis.html
[
2
] http://www.statsoft.com/textbook/stcluan.html
[
3
]
http://datamining.anu.edu.au/student/math3346_2006/algintro

2x3.
pdf
[4]Using Cluster Analysis for Deriving Menu Structures for Automotive Mobile
Multimedia Applications Mona L. Toms, Mark A. Cummings

Hill and David G.
Curry Delphi Delco Electronics Systems Scott M. Cone Veridian Engineering,
SAE 2001 World Congress De
troit, Michigan March 5

8, 2001
[5
]
en.wikipedia.org/wiki/
Data
_
clustering
[6
]
datamining.anu.edu.au/student/math3346_2006/clusters

2x3.pdf
[7
]
www.sfu.ca/cprost/publications.htm
[8]
http://mrw.interscience.wiley.com/emrw/9780470011812/eob/article/b2a13038/curren
t/abstract
[9]
http:/
/www.camo.com/rt/Resources/Clustering.html
[10]
http://en.wikipedia.org/wiki/Hamming_distance
6.
Glossary
Algorithm
:
As opposed to heuristics (which contain general recommendations based
on statistical evidence or theoretical reasoning), algorithms are co
mpletely defined,
finite sets of steps, operations, or procedures that will produce a particular outcome.
For example, with a few exceptions, all computer programs, mathematical formulas,
and (ideally) medical and food recipes are algorithms.
Cluster:
a co
llection of data objects: Similar to one another within the same cluster

Dissimilar to the objects in other clusters
.
Cluster analysis
:
Grouping a set of data objects into clusters
.
Clustering is
unsupervised classification: no predefined classes
—
descripti
ve data mining.
H
euristics
:
are general recommendations or guides based on statistical evidence or
theoretical reasoning
.
7. Keywords
Cluster analysis,
Types of clustering,
Cluster
algorithms,
Cluster distances
,
C
reativity
technique,
C
reative.
8. Questio
ns
1)
What is cluster analysis and what is
it used
for?
2)
Which are the factors used in a tree clustering
method?
3)
Which are the methods (codes) used in clustering the industries in
Burnaby?
4)
High Quality similarities in good Clustering are
:
P
lease insert T for true & F for false:
a.
high intra

class similarity
b.
high inter

class similarity
c.
low intra

class similarity
d.
low inter

class similarity
Answer:
a. T
b. F
c. F
d. T
Comments 0
Log in to post a comment