6Cluster Analysis (6hrs)

ticketdonkeyAI and Robotics

Nov 25, 2013 (3 years and 9 months ago)

101 views

6

C
汵lt敲⁁湡汹獩s

(
6桲s



6.1 What is Cluster Analysis?

6.2 Types of Data in Cluster Analysis

6.3 A Categorization of Major Clustering Methods

6.4 Partitioning Methods

6.5
Grid
-
Based Methods

6.6 Model
-
Based Methods

6.7 Clustering High
-
Dimensional Data

6.8 Outlier Analysis

6.9 Summary


Key Points

Clustering, Partition, Hierarchical method, Outlier Analysis


Reading

Chapter 8

Q&A:

1.

Briefly outline how to compute the
dissimilarity between
objects described by Categorical

variables
.

Answer:

A categorical variable is a generalization of the binary variable in that it can
take on more than two states.

The dissimilarity between two objects i and j can be compu
ted based
[old: the simple matching approach][new: on the ratio of mismatches]:
d(i,j)=(p
-
m)/p, where m is the number of matches (i.e., the number of
variables for which i and j are in the same state), and p is the total number
of variab
les.

Alternatively, we can use a large number of binary variables by creating a new
binary variable for each of the M nominal states. For an object with a
given state value, the binary variable representing that state is set to 1, while
the r
emaining binary variables are set to 0.

2.

Briefly outline how to compute the dissimilarity between objects
described by Ratio
-
scaled variables
.

Answer:

Three methods include:

• Treat ratio
-
scaled variables as interval
-
scaled variables, so that the Minko
wski,
Manhattan, or Euclidean distance can be used to compute the dissimilarity.

• Apply a logarithmic transformation to a ratio
-
scaled variable
f

having value
x
if

for object
i

by using the formula
y
if

= log(
x
if
). The
yif

values can be treated as
int
erval
-
valued,

• Treat
x
if

as continuous ordinal data, and treat their ranks as interval
-
scaled
variables.

3.

Given the following measurements for

the variable age:


18, 22, 25, 42, 28, 43, 33, 35, 56, 28,


Standardize the variable by the following:

(a)
Compute the mean absolute deviation of age.

(b) Compute the z
-
score for the first four measurements.


Answer:

(a) Compute the mean absolute deviation of age.

The mean absolute deviation of age is 8.8, which is derived as follows.



(b) Compute the
z
-
score for the first four measurements.

According to the z
-
score computation formula,


4.

Given two objects represented by the tuples (22, 1, 42, 10) and
(20, 0, 36, 8):


(a) Compute the Euclidean distance between the two objects.

(b) Compute the Manhat
tan distance between the two objects.

(c) Compute the Minkowski distance between the two objects, using p = 3.


Answer:




5.

Briefly describe the concepts of clustering and list several
approaches to clustering.

Clustering is the process of grouping data

into classes, or clusters, so that
objects within a cluster have high similarity in comparison to one another,
but are very dissimilar to objects in other clusters. There are several
approaches to clustering: Partitioning methods, Hierarchical

methods, Density
-
based
methods, Grid
-
based methods, Model
-
based methods,Constraint
-
based methods

6.

W
hat is the mainly concept of model
-
based methods for
clustering?

Model
-
based methods: This approach hypothesizes a model for each of the
clusters and f
inds the best fit of the data to the given model. A
model
-
based algorithm may locate clusters by constructing a density function
that reflects the spatial distribution of the data points. It also leads to a
way of automatically determinin
g the number of clusters based on standard
statistics. It takes “noise” or outliers into account, therein contributing
to the robustness of the approach. COBWEB and self
-

organizing
feature maps are examples of model
-
based cluster
ing.


7.

Suppose that the data mining task is to cluster the following eight
points (with (x, y) representing location)

into three clusters.

A1 (2, 10), A2 (2, 5), A3 (8, 4), B1 (5, 8), B2 (7, 5), B3 (6, 4), C1 (1, 2), C2 (4, 9).


The distance function is

Euclidean distance. Suppose initially we assign
A1 , B1 , and C1 as the center of each cluster, respectively. Use the
k
-
means algorithm to show only


(a) The three cluster centers after the first round of execution and

(b) The final t
hree clusters


Answer:


(a) After the first round, the three new clusters are: (1) {A1 }, (2) {B1 , A3 ,
B2 , B3 , C2 }, (3) {C1 , A2 }, and their centers are (1) (2, 10), (2) (6, 6), (3) (1.5,
3.5).

(b) The final three clusters are: (1) {A1 ,
C2 , B1 }, (2) {A3 , B2 , B3 }, (3)
{C1 , A2 }.


8.

Both k
-
means and k
-
medoids algorithms can perform
effective clustering. Illustrate the strength and weakness of k
-
means
in comparison with the k
-
medoids algorithm. Also, illustrate the
strength and we
akness of these schemes in comparison with a
hierarchical clustering scheme (such as AGNES).

Answer:


(a) Illustrate the strength and weakness of k
-
means in comparison with the
k
-
medoids algorithm.

The k
-
medoids algorithm is more robust than k
-
me
ans in the presence of noise
and outliers, because a medoid is less influenced by outliers or other extreme
values than a mean. However, its processing is more costly than the k
-
means
method.

(b) Illustrate the strength and weakness of these schem
es in comparison with a
hierarchical clustering scheme (such as AGNES).

Both k
-
means and k
-
medoids perform partitioning
-
based clustering. An advantage
of such partitioning approaches is that they can undo previous clustering steps (by
iterative relo
cation), unlike hierarchical methods, which cannot make adjustments
once a split or merge has been executed. This weakness of hierarchical methods can
cause the quality of their resulting clustering to suffer. Partitioning
-
based

9.

Clustering has been popula
rly recognized as an important data
mining task with broad applications.
I
n the content of business, give
an application example that takes clustering as a major data mining
function.

An example that takes clustering as a major data mining function
could
be a system that identifies groups of houses in a city according to house type,
value, and geographical location. More specifically, a clustering algorithm like
CLARANS can be used to discover that, say, the most expensive housing units
in
Vancouver can be grouped into just a few clusters.

10.

Why is outlier mining important? Briefly describe the different
approaches

behind

statistical
-
based outlier detection.

Answer:

Data objects that are grossly different from, or inconsistent with,
the
remaining set of data are called “outliers”. Outlier mining is useful for
detecting fraudulent activity (such as credit card or telecom fraud), as well as
customer segmentation and medical analysis. Computer
-
based outlier analysis
may be statis
tical
-

based, distance
-
based, or deviation
-
based.

The statistical
-
based approach assumes a distribution or probability model for
the given data set and then identifies outliers with respect to the
model using a discordancy test. The disc
ordancy test is based on data
distribution, distribution parameters (e.g., mean, variance), and the number
of expected outliers. The drawbacks of this method are that most tests
are for single attributes, and in many cases, the data d
istribution may not be
known.