1. What is Cluster Analysis ? (Introduction)

tribecagamosisAI and Robotics

Nov 8, 2013 (3 years and 11 months ago)

87 views

1. What is

Cluster Analysis

? (Introduction)

Cluster analysis

is a technique used for
classification of data

in which data elements
are partitioned
into groups called clusters

that represent collections of data elements
that are proximate based on a distan
ce or dissimilarity function. [1]

The cluster
analysis approach is an important
tool in decision making

and an effective
creativity
technique in generating ideas and obtaining solutions
.

T
he term cluster analysis (first used by Tryon, 1939) encompasses
a number of
different algorithms and methods for grouping objects of similar kind into respective
categories. A general question facing researchers in many areas of inquiry is how to
organize observed data into meaningful structures, that is, to develop ta
xonomies. In
other words cluster analysis is an exploratory data analysis tool which aims at sorting
different objects into groups in a way that the degree of association between two
objects is maximal if they belong to the same group and minimal otherwise
.

[
2
]

Cluster analysis is m
ainly a discovery tool, it often surfaces perceived problem areas,
concerns or items that naturally belong together.

The clusters analysis

aims
at

[3]
:



classifying

data

into natural groupings on the basis of similar or related
characteristics
,



indentifying

most important
characteristics

to be considered in developing a
problem specification,



developing

a more homogeneous group of items from a large list of dissimilar
items,



identifying

differences among customer
, employee or s
upplier groups in
regard to quality perception and performance issues.


2. How i
s

i
t

implemented?

Types of clustering

Data clustering algorithms can be
hierarchical

or
partitional
. Hierarchical
algorithms find successive clusters using previously establis
hed clusters, whereas
partitional algorithms determine all clusters at once.
For example,
designing

an
effective hierarchical menu system for an

automotive
application in
-
vehicle mobile
multimedia systems, one alternative is a hierarchical menu implemented

on an
integrated display/control unit such as a multi
-
function display (MFD).
In a top
-
down
approach, the designer identifies

first
-
order (macro) categories that are repeatedly
divided

into progressively smaller subcategories until menu

items are represent
ed at
their lowest level. This

approach is conceptually driven in that items are

discriminated
along categorical boundaries and

conceptual dimensions. The top
-
down approach

emphasizes the
differences
between functions rather

than their similarities.

Hierar
chical algorithms can be agglomerative ("bottom
-
up") or divisive ("top
-
down").
Agglomerative algorithms begin with each element as a separate cluster and merge
them into successively larger clusters. Divisive algorithms begin with the whole set
and proceed

to divide it into successively smaller clusters
.

[5
]

Distance Measures

The joining or tree clustering method
above
uses the dissimilarities (similarities) or
distances between objects when forming the clusters. Similarities are a set of rules
that serve
as criteria for grouping or separating items.
An important step in any
clustering is to select a distance measure, which will determine how the similarity of
two elements is calculated.
These distances (similarities) can be based on a single
dimension or m
ultiple dimensions, with each dimension representing a rule or
condition for grouping objects.

This will influence the shape of the clusters, as some
elements may be close to one another according to one distance and further away
according to another.

[2]


Common distance functions:



-
The Euclidean distance



-
The Mahalanobis
distance
[8]


-
The Manhattan distance [9]

-
The Hamming distance
[10]
.


Picture
1
: The Euclidean distance



















Source:

http://www1.uni
-
hamburg.de
/RRZ/Software/Statistica/Handbuch/stcluan.html#d

Type of data in clustering analysis: Interval
-
scaled variables, Binary variables,
Nominal, ordinal, and ratio variables, Variables of mixed types
.

The clusters analysis tool is best utilized after a brainsto
rming session to organize data
by subdividing different
idea
, items or characteristics into relatively similar groups,
each under a topical heading.

Consider a Horizontal Hierarchical Tree Plot (see graph below), on the left of the plot,
we begin with ea
ch object in a class by itself. Now imagine that, in very small steps,
we "relax" our criterion as to what is and is not unique. Put another way, we lower our
threshold regarding the decision when to declare two or more objects to be members
of the same cl
uster.

The

following

tree diagram classifies 22 different car models and
their linkage (connection) using “Euclidean distance
” which compares car category
according to certain characteristics (e.g fuel consumption, cost, accessories etc).








3. What ar
e the success factors?
(Do/ Do not)

Cluster analysis is not as much a typical statistical test as it is a "collection" of
different algorithms that "put objects into clusters according to well defined similarity
rules." The point here is that, unlike many
other statistical procedures,
cluster
analysis methods are mostly used when we do not have any a priori hypotheses
,
but are still in the exploratory phase of our research. In a sense, cluster analysis finds
the "most significant solution possible."
[2]
.

Wha
t Is Good Clustering?

High Quality:



high intra
-
class similarity
(similarity between two or more classes of attributes)



low inter
-
class similarity

(similarity between attributes belonging in the same
category)

Depends on:



similarity measure

(how similar two

or more attributes are)



algorithm for searching



a
bility to discover hidden patterns

Clustering may not be the best way to discover interesting groups in a data set.
Often
visualisation

methods work well, allowing the human expert to identify
useful groups
. However, as the data set sizes increase to millions of
entities
, this
becomes
in practical

and clusters help to partition the data so that we can deal
with smaller groups. Different algorithms deliver different clusterings

[6
]
.

4.
Case study


Title
:

An
Analysis of Industrial Clusters in Burnaby

[
7
]
.

This case study

presents

the findings of the analysis of industrial clusters in
The City
of Burnaby in the State of Vancouver, CANADA
and
draws

conclusions based on this
analysis.










Specifically,
Figu
re 1

shows both the ratio of GVRD (Greater Vancouver Regional
District
) employment compared to Canada as a whole and Burnaby compared to
Canada. The Y
-
axis value on Figure 1 is the ratio of the percentage of
employees’

employment within a specific industri
al category for the GVRD or Burnaby compared
to the national average.

Not all industrial codes are shown: the figures show, as expected, that the GVRD and
Burnaby have much smaller than average labour forces in agriculture and mining, and
smaller than ave
rage employment in manufacturing and public administration.

If either the GVRD or Burnaby has a ratio greater than one, it indicates that it has
some competitive advantage compared to Canada as a whole.

FIGURE 1










What is perhaps more important is

to look at those areas where Burnaby has a distinct

advantage over the rest of the GVRD.

There are four such areas: utilities,
construction, wholesale and information and culture.

A similar type of analysis can be
done by occupational codes.

Figure 2
show
s the ratios of the percentages of total employment in the GVRD and
Burnaby in a particular aggregation of occupational codes compared to Canada
.

Neither Burnaby nor the GVRD have higher than average numbers of workers in
primary and manufacturing occupati
ons. The GVRD overall has average numbers of
individuals in health occupations and education, but Burnaby falls behind the GVRD
(probably due to the preponderance of health and educational facilities outside
Burnaby).

Burnaby has slight advantages in manag
ement and business. While it does not have as
great an advantage in the arts as does the GVRD as a whole, its advantage over the
rest of Canada is still significant, and bears further study.



FIGURE 2






















5.
List of References

Web sites:

[1]
http://mathworld.wolfram.com/ClusterAnalysis.html

[
2
] http://www.statsoft.com/textbook/stcluan.html

[
3
]
http://datamining.anu.edu.au/student/math3346_2006/algintro
-
2x3.
pdf

[4]Using Cluster Analysis for Deriving Menu Structures for Automotive Mobile
Multimedia Applications Mona L. Toms, Mark A. Cummings
-
Hill and David G.
Curry Delphi Delco Electronics Systems Scott M. Cone Veridian Engineering,
SAE 2001 World Congress De
troit, Michigan March 5
-
8, 2001

[5
]
en.wikipedia.org/wiki/
Data
_
clustering

[6
]

datamining.anu.edu.au/student/math3346_2006/clusters
-
2x3.pdf

[7
]

www.sfu.ca/cprost/publications.htm

[8]

http://mrw.interscience.wiley.com/emrw/9780470011812/eob/article/b2a13038/curren
t/abstract

[9]

http:/
/www.camo.com/rt/Resources/Clustering.html

[10]

http://en.wikipedia.org/wiki/Hamming_distance

6.
Glossary

Algorithm
:
As opposed to heuristics (which contain general recommendations based
on statistical evidence or theoretical reasoning), algorithms are co
mpletely defined,
finite sets of steps, operations, or procedures that will produce a particular outcome.
For example, with a few exceptions, all computer programs, mathematical formulas,
and (ideally) medical and food recipes are algorithms.

Cluster:

a co
llection of data objects: Similar to one another within the same cluster
-
Dissimilar to the objects in other clusters
.

Cluster analysis
:

Grouping a set of data objects into clusters
.

Clustering is
unsupervised classification: no predefined classes

descripti
ve data mining.


H
euristics
:

are general recommendations or guides based on statistical evidence or
theoretical reasoning
.


7. Keywords

Cluster analysis,

Types of clustering,

Cluster

algorithms,
Cluster distances
,
C
reativity
technique,
C
reative.

8. Questio
ns

1)

What is cluster analysis and what is
it used
for?


2)

Which are the factors used in a tree clustering
method?


3)

Which are the methods (codes) used in clustering the industries in
Burnaby?


4)

High Quality similarities in good Clustering are
:


P
lease insert T for true & F for false:

a.

high intra
-
class similarity

b.

high inter
-
class similarity

c.

low intra
-
class similarity

d.

low inter
-
class similarity


Answer:

a. T

b. F

c. F

d. T