Clustering: An Overview

muttchessAI and Robotics

Nov 8, 2013 (4 years and 2 days ago)

121 views

Clustering: An Overview


1. Introduction
Clustering algorithms maps the data items into clusters, where clusters are natural grouping
of data items based on similarity methods. Unlike classification and prediction which
analyzes class-label data objects, clustering analyzes data objects without class-labels and
tries to generate such labels. Clustering has many applications. In business/ marketing,
clustering can help in identifying different customer groups and appropriate marketing
campaign can be carried out targeting different groups. In agriculture, it can be used to
derive plant and animal taxonomies, characterization of diseases and varieties, in
bioinformatics- categorization of genes with similar functionally. Further it can be used to
group similar documents on the web for faster discovery of content. It can be used to group
geographical locations based on crime, amenities, weather etc. As data mining function,
cluster analysis is used to gain insight into distribution of data, to observe the
characteristics of each cluster and to focus on a particular set of clusters for further
analysis.

2. Similarity Measures
Similarity is fundamental to majority of clustering algorithms. Similarity is quantity that
reflects the strength of relationship between two objects or two features. This quantity is
usually having range of either -1 to +1 or normalized into 0 to 1. If the similarity between
feature i and feature j is denoted by s
ij

, we can measure this quantity in several ways
depending on the scale of measurement (or data type) that we have. Dissimilarity is
opposite to similarity. There are many types of distance and similarity measures.
Similarity and dissimilarity can be measured for two objects based on several features/
variables. After the distance or similarity of each variable is determined, we can aggregate
all features/ variables together into single Similarity (or dissimilarity) index between the
two objects.

2.1 Distance for binary variables
We often face variables that only binary value such as Yes and No, or Agree and Disagree,
True and False, Success and Failure, 0 and 1, Absence or Present, Positive and Negative,
etc. Similarity of dissimilarity (distance) of two objects that represented by binary
variables can be measured in term of number of occurrence (frequency) of positive and
negative in each object.
Feature of Fruit

For example:
Sphere shape

Sweet

Sour

Crunchy

Object
=Apple

Yes

Yes

Yes

Yes

Object
=Banana

No

Yes

No

No

Clustering: An Overview

Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets”
251


The coordinate of Apple is (1,1,1,1) and coordinate of Banana is (0,1,0,0). Because each
object is represented by 4 variables, we say that these objects have 4 dimensions.
Let p = number of variables that positive for both objects .
q = number of variables that positive for the i th objects and negative for the j th object
r= number of variables that negative for the th objects and positive for the th object
s= number of variables that negative for both objects
t= p+q+r+s = total number of variables.

Object









Yes

object
No

Yes






No



For our example above, we have measured Apple and Banana have p=1, q=3 and r=0,
s=0. Thus, t= p+q+r+s=4.
The most common use of binary dissimilarity (distance) is
Simple Matching distance
Jaccard's distance
Hamming distance
Example: Simple matching distance between Apple and Banana is 3/4.
Jaccard's distance between Apple and Banana is 3/4.
Hamming distance between Apple and Banana is 3.

2.2 Distance for quantitative variables
Variable which have quantitative values.






Features










cost

time

weight

incentive

Object A

0

3

4

5

Object B

7

6

3

-
1

We can represent the two objects as points in 4 dimension. Point A has coordinate (0, 3, 4,
5) and point B has coordinate (7, 6, 3, -1). Dissimilarity (or similarity) between the two
objects are based on these coordinates.
Clustering: An Overview

Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets”
252

Euclidean Distance :Euclidean Distance is the most common use of distance. In most
cases when people said about distance , they will refer to Euclidean distance. Euclidean
distance or simply 'distance' examines the root of square differences between coordinates
of a pair of objects.
Formula


Euclidean distance is a special case of Minkowski distance with
City block (Manhattan) distance : It is also known as Manhattan distance, boxcar
distance, absolute value distance. It examines the absolute differences
The City Block Distance between point A and B is
between coordinates
of a pair of objects. City block distance is a special case of Minkowski distance with
Formula:

Chebyshev Distance :Chebyshev distance is also called Maximum value distance. It
examines the absolute magnitude of the differences between coordinates of a pair of
objects. This distance can be used for both ordinal and quantitative variables.
Formula

and B is
Minkowski Distance: This is the generalized metric distance. When it becomes city
block distance and when , it becomes Euclidean distance. Chebyshev distance is a
special case of Minkowski distance with (taking a limit). This distance can be used
for both ordinal and quantitative variables.
Formula

Clustering: An Overview

Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets”
253


3. Clustering Algorithms
There are many clustering algorithms available in literature, choice of appropriate
algorithm depends on the data type and desired results. We will be focused on commonly
used clustering algorithms.

3.1 Hierarchical Algorithms
A hierarchical method creates a hierarchical decomposition of data objects in the form of
tree like diagram which is called a dendogram. There are two approaches to building a
cluster hierarchy. Agglomerative approach also called bottom up approach starts with each
object forming a separate group and successively merges the objects close to one another,
until all the groups are merged into one. Divisive approach also called top-down approach
starts with all the objects in same cluster, until each object is in one cluster.


Process flow of agglomerative hierarchical clustering method is given below:
• Convert object features to distance matrix.
• Set each object as a cluster (thus if we have 6 objects, we will have 6 clusters in the
beginning)
• Iterate until number of cluster is 1
1. Merge two closest clusters
2. Update distance matrix
First distance matrix is computed using any valid distance measure between pairs of
objects. The choice of which clusters to merge is determined by a linkage criterion, which
is a function of the pairwise distances between observations. Commonly used linkage
criteria are mentioned below:
• Complete Linkage: The maximum distance between elements of each cluster

• Single Linkage: The minimum distance between elements of each cluster

• Average Linkage /UPGMA: The mean distance between elements of each cluster
s1


s2

s4

s5

s3

Clustering: An Overview

Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets”
254



3.1.1 Example:
Consider 6 objects (with name A, B, C, D, E and F) and each object have two measured
features (X1 and X2). We can plot the features in a scattered plot to get the visualization of
proximity between objects.
For example, distance between object A = (1, 1) and B = (1.5, 1.5) is computed as

Another example of distance between object D = (3, 4) and F = (3, 3.5) is calculated as






Using the same way, we can compute all distances between objects and put the distances
into a matrix form. Since distance is symmetric (i.e. distance between A and B is equal to
distance between B and A). The diagonal elements of distance matrix are zero represent
distance from an object to itself.

Clearly the minimum distance is 0.5 (between object D and F). Thus, we group cluster D
and F into cluster (D, F). Then we update the distance matrix. Distance between ungrouped
clusters will not change from the original distance matrix. Now the problem is how to
calculate distance between newly grouped clusters (D, F) and other clusters?
Clustering: An Overview

Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets”
255


That is exactly where the linkage rule comes into effect. Using single linkage, we specify
minimum distance between original objects of the two clusters.


Using the input distance matrix, distance between cluster (D, F) and cluster A is computed
as


Then, the updated distance matrix becomes

Looking at the lower triangular updated distance matrix, we found out that the closest
distance between cluster B and cluster A is now 0.71. Thus, we group cluster A and cluster
B into a single cluster name (A, B).

Now we update the distance matrix. Aside from the first row and first column, all the other
elements of the new distance matrix are not changed.

Using the input distance matrix (size 6 by 6), distance between cluster C and cluster (D, F)
is computed as

Distance between cluster (D, F) and cluster (A, B) is the minimum distance between all
objects involves in the two clusters

Clustering: An Overview

Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets”
256

Then the updated distance matrix is


Observing the lower triangular of the updated distance matrix, we can see that the closest
distance between clusters happens between cluster E and (D, F) at distance 1.00. Thus, we
cluster them together into cluster ((D, F), E ).
The updated distance matrix is given below.

Distance between cluster ((D, F), E) and cluster (A, B) is calculated as


The updated distance matrix is shown in the figure below

Using all this information, final results of a dendogram:

Clustering: An Overview

Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets”
257


3.2 Partitional Algorithms
It basically involves segmenting data objects into k partitions, optimizing some criteria,
over t iterations. These methods are popularly known as iterative relocation methods.

3.2.1 K-means Algorithm
K-means is the most popularly used algorithm in this category. It randomly selects k
objects as cluster mean or center. It works towards optimizing square error criteria
function, defined as:


=



k
i Cx
i
i
mx
1
2
, where
i
m
is the mean of cluster
i
C
.

Main steps of k-means algorithm are:

1) Assign initial means
i
m

2) Assign each data object
x
to the cluster
i
C
for the closest mean
3) Compute new mean for each cluster
4)Iterate until criteria function converges, that is, there are no more new assignments.

The k-means algorithm is sensitive to outliers since an object with an extremely large value
may substantially distort the distribution of data.

Example:
Suppose we have several objects (4 types of medicines) and each object have two attributes
or features as shown in table below. Our goal is to group these objects into K=2 group of
medicine based on the two features (pH and weight index).
Object

attribute 1 (X): weight index

attribute 2 (Y): pH

Medicine A

1

1

Medicine B

2

1

Medicine C

4

3

Medicine D

5

4

1. Initial value of centroids
2.
: Suppose we use medicine A and medicine B as the first
centroids. Let and denote the coordinate of the centroids, then and

Objects-Centroids distance
: we calculate the distance between cluster centroid to each
object. Let us use Euclidean distance, then we have distance matrix at iteration 0 is
Clustering: An Overview

Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets”
258


Each column in the distance matrix symbolizes the object. The first row of the distance
matrix corresponds to the distance of each object to the first centroid and the second row is
the distance of each object to the second centroid. For example, distance from medicine C
= (4, 3) to the first centroid is , and its distance to the
second centroid is
3. Objects clustering

: We assign each object based on the minimum distance. Thus,
medicine A is assigned to group 1, medicine B to group 2, medicine C to group 2 and
medicine D to group 2. The element of Group matrix below is 1 if and only if the object is
assigned to that group.
4. Iteration-1, determine centroids
5.
: Knowing the members of each group, now we
compute the new centroid of each group based on these new memberships. Group 1 only
has one member thus the centroid remains in . Group 2 now has three members,
thus the centroid is the average coordinate among the three members:

Iteration-1, Objects-Centroids distances

: The next step is to compute the distance of all
objects to the new centroids. Similar to step 2, we have distance matrix at iteration 1 is
6. Iteration-1, Objects clustering:
Similar to step 3, we assign each object based on the
minimum distance. Based on the new distance matrix, we move the medicine B to Group 1
while all the other objects remain. The Group matrix is shown below
Clustering: An Overview

Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets”
259



7. Iteration 2, determine centroids:
8.
Now we repeat step 4 to calculate the new centroids
coordinate based on the clustering of previous iteration. Group1 and group 2 both has two
members, thus the new centroids are and

Iteration-2, Objects-Centroids distances

: Repeat step 2 again, we have new distance
matrix at iteration 2 as
9. Iteration-2, Objects clustering:

Again, we assign each object based on the minimum
distance.
We obtain result that . Comparing the grouping of last iteration and this iteration
reveals that the objects does not move group anymore. Thus, the computation of the k-
mean clustering has reached its stability and no more iteration is needed.
4. Case Study in Agriculture
4.1 Soybean dataset
Soybean disease set contains 47 objects and 35 multi-valued variables characterizing
diaporthe-stem-canker, charcoal-rot, rhizoctonia-root-rot and phytophthora-rot diseases.
Variables are broadly categorized into environmental descriptors, condition of leaves,
condition of stem, condition of fruit pods and condition of root. It was observed that
dataset is having unique value for some of the variables hence those variables are irrelevant
and removed from the dataset. Reduced dataset then has 20 variables characterizing
soybean diseases. Dataset consist of instance number and class variables that was not
considered while clustering.

Clustering: An Overview

Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets”
260

4.2 Soybean dataset: Clustering
K means clustering algorithm was applied on soybean disease dataset with number of
clusters as four. Four disease clusters corresponding to diseases diaporthe-stem-canker
(Cluster1), charcoal-rot (Cluster2), rhizoctonia-root-rot (Cluster3), phytophthora-rot
(Cluster 4) were obtained. Table presents the result of clustering algorithm on soybean
disease dataset.

References

A.K. Jain, M.N. Murty, P.J. Flynn, Data Clustering: A Review, ACM Computing Surveys,
31( 3 ): 264-323, 1999, ISSN: 0360-0300.
B. Mirkin, Clustering for Data Mining: Data Recovery Approach, Chapman & Hall/CRC,
2005.
I. H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques
with Java Implementation, Morgan Kaufmann publishers, 1999.
J. Han, M. Kamber, Data Mining: Concepts and Techniques, 2nd ed., Morgan Kaufmann
Publisher, 2006, ISBN 1-55860-901-6
L. Kaufman and P. Rousseeuw, Finding Groups in Data: An Introduction to Cluster
Analysis, Wiley, 1990.
http://en.wikipedia.org/wiki/K-means_clustering
http://people.revoledu.com/kardi/tutorial
R. Xu, D. Wunsch, Survey of Clustering Algorithms, IEEE Transactions on Neural
Networks, Vol. 16, No. 3, May 2005
S. Mitra, T. Acharya, Data Mining: Multimedia, Soft Computing, and Bioinformatics, John
Wiley & Sons, 2004, ISBN 9812-53-063-0.



Clustering: An Overview

Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets”
261


Table: Soybean dataset with clustering results
sno

v1

v2

v3

v4

v5

v6

v7

v8

v9

v10

v12

v20

v21

v22

v23

v24

v25

v26

v27

v28

v35


Cluster

0

4

0

2

1

1

1

0

1

0

2

1

0

3

1

1

1

0

0

0

0

0


cluster1

1

5

0

2

1

0

3

1

1

1

2

1

1

3

0

1

1

0

0

0

0

0


cluster1

2

3

0

2

1

0

2

0

2

1

1

1

0

3

0

1

1

0

0

0

0

0


cluster1

3

6

0

2

1

0

1

1

1

0

0

1

1

3

1

1

1

0

0

0

0

0


cluster1

4

4

0

2

1

0

3

0

2

0

2

1

0

3

1

1

1

0

0

0

0

0


cluster1

5

5

0

2

1

0

2

0

1

1

0

1

1

3

1

1

1

0

0

0

0

0


cluster1

6

3

0

2

1

0

2

1

1

0

1

1

1

3

0

1

1

0

0

0

0

0


cluster1

7

3

0

2

1

0

1

0

2

1

2

1

0

3

0

1

1

0

0

0

0

0


cluster1

8

6

0

2

1

0

3

0

1

1

1

1

0

3

1

1

1

0

0

0

0

0


cluster1

9

6

0

2

1

0

1

0

1

0

2

1

0

3

1

1

1

0

0

0

0

0


cluster1

10

6

0

0

2

1

0

2

1

0

0

1

1

0

3

0

0

0

2

1

0

0


cluster2

11

4

0

0

1

0

2

3

1

1

1

1

0

0

3

0

0

0

2

1

0

0


cluster2

12

5

0

0

2

0

3

2

1

0

2

1

0

0

3

0

0

0

2

1

0

0


cluster2

13

6

0

0

1

1

3

3

1

1

0

1

0

0

3

0

0

0

2

1

0

0


cluster2

14

3

0

0

2

1

0

2

1

0

1

1

0

0

3

0

0

0

2

1

0

0


cluster2

15

4

0

0

1

1

1

3

1

1

1

1

1

0

3

0

0

0

2

1

0

0


cluster2

16

3

0

0

1

0

1

2

1

0

0

1

0

0

3

0

0

0

2

1

0

0


cluster2

17

5

0

0

2

1

2

2

1

0

2

1

1

0

3

0

0

0

2

1

0

0


cluster2

18

6

0

0

2

0

1

3

1

1

0

1

0

0

3

0

0

0

2

1

0

0


cluster2

19

5

0

0

2

1

3

3

1

1

2

1

0

0

3

0

0

0

2

1

0

0


cluster2

20

0

1

2

0

0

1

1

1

1

1

0

0

1

1

0

1

1

0

0

3

0


cluster3

21

2

1

2

0

0

3

1

2

0

1

0

0

1

1

0

1

0

0

0

3

0


cluster3

22

2

1

2

0

0

2

1

1

0

2

0

0

1

1

0

1

1

0

0

3

0


cluster3

23

0

1

2

0

0

0

1

1

1

2

0

0

1

1

0

1

0

0

0

3

0


cluster3

24

0

1

2

0

0

2

1

1

1

1

0

0

1

1

0

1

0

0

0

3

0


cluster3

25

4

0

2

0

1

0

1

2

0

2

1

1

1

1

0

1

1

0

0

3

0


cluster3

26

2

1

2

0

0

3

1

2

0

2

0

0

1

1

0

1

1

0

0

3

0


cluster3

27

0

1

2

0

0

0

1

1

0

1

0

0

1

1

0

1

0

0

0

3

1


cluster3

28

3

0

2

0

1

3

1

2

0

1

0

1

1

1

0

1

1

0

0

3

0


cluster3

29

0

1

2

0

0

1

1

2

1

2

0

0

1

1

0

1

0

0

0

3

0


cluster3

30

2

1

2

1

1

3

1

2

1

2

1

0

2

2

0

1

0

0

0

3

1


cluster4

31

0

1

1

1

0

1

1

1

0

0

1

0

1

2

0

0

0

0

0

3

1


cluster4

32

3

1

2

0

0

1

1

2

1

0

1

0

2

2

0

0

0

0

0

3

1


cluster4

33

2

1

2

1

1

1

1

2

0

2

1

0

1

2

0

1

0

0

0

3

1


cluster4

34

1

1

2

0

0

3

1

1

1

2

1

0

2

2

0

0

0

0

0

3

1


cluster4

35

1

1

2

1

0

0

1

2

1

1

1

0

2

2

0

0

0

0

0

3

1


cluster4

36

0

1

2

1

0

3

1

1

0

0

1

0

1

2

0

0

0

0

0

3

1


cluster4

37

2

1

2

0

0

1

1

2

0

0

1

0

1

2

0

0

0

0

0

3

1


cluster4

38

3

1

2

0

0

2

1

2

1

1

1

0

2

2

0

0

0

0

0

3

1


cluster4

39

3

1

1

0

0

2

1

2

1

2

1

0

2

2

0

0

0

0

0

3

1


cluster4

40

0

1

2

1

1

1

1

1

0

0

1

0

1

2

0

1

0

0

0

3

1


cluster4

41

1

1

2

1

1

3

1

2

0

1

1

1

1

2

0

1

0

0

0

3

1


cluster4

42

1

1

2

0

0

0

1

2

1

0

1

0

2

2

0

0

0

0

0

3

1


cluster4

43

1

1

2

1

1

2

3

1

1

1

1

0

2

2

0

1

0

0

0

3

1


cluster4

44

2

1

1

0

0

3

1

2

0

2

1

0

1

2

0

0

0

0

0

3

1


cluster4

45

0

1

1

1

1

2

1

2

1

0

1

1

2

2

0

1

0

0

0

3

1


cluster4

46

0

1

2

1

0

3

1

1

0

2

1

0

1

2

0

0

0

0

0

3

1


cluster4