Clustering: An Overview
1. Introduction
Clustering algorithms maps the data items into clusters, where clusters are natural grouping
of data items based on similarity methods. Unlike classification and prediction which
analyzes classlabel data objects, clustering analyzes data objects without classlabels and
tries to generate such labels. Clustering has many applications. In business/ marketing,
clustering can help in identifying different customer groups and appropriate marketing
campaign can be carried out targeting different groups. In agriculture, it can be used to
derive plant and animal taxonomies, characterization of diseases and varieties, in
bioinformatics categorization of genes with similar functionally. Further it can be used to
group similar documents on the web for faster discovery of content. It can be used to group
geographical locations based on crime, amenities, weather etc. As data mining function,
cluster analysis is used to gain insight into distribution of data, to observe the
characteristics of each cluster and to focus on a particular set of clusters for further
analysis.
2. Similarity Measures
Similarity is fundamental to majority of clustering algorithms. Similarity is quantity that
reflects the strength of relationship between two objects or two features. This quantity is
usually having range of either 1 to +1 or normalized into 0 to 1. If the similarity between
feature i and feature j is denoted by s
ij
, we can measure this quantity in several ways
depending on the scale of measurement (or data type) that we have. Dissimilarity is
opposite to similarity. There are many types of distance and similarity measures.
Similarity and dissimilarity can be measured for two objects based on several features/
variables. After the distance or similarity of each variable is determined, we can aggregate
all features/ variables together into single Similarity (or dissimilarity) index between the
two objects.
2.1 Distance for binary variables
We often face variables that only binary value such as Yes and No, or Agree and Disagree,
True and False, Success and Failure, 0 and 1, Absence or Present, Positive and Negative,
etc. Similarity of dissimilarity (distance) of two objects that represented by binary
variables can be measured in term of number of occurrence (frequency) of positive and
negative in each object.
Feature of Fruit
For example:
Sphere shape
Sweet
Sour
Crunchy
Object
=Apple
Yes
Yes
Yes
Yes
Object
=Banana
No
Yes
No
No
Clustering: An Overview
Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets”
251
The coordinate of Apple is (1,1,1,1) and coordinate of Banana is (0,1,0,0). Because each
object is represented by 4 variables, we say that these objects have 4 dimensions.
Let p = number of variables that positive for both objects .
q = number of variables that positive for the i th objects and negative for the j th object
r= number of variables that negative for the th objects and positive for the th object
s= number of variables that negative for both objects
t= p+q+r+s = total number of variables.
Object
Yes
object
No
Yes
No
For our example above, we have measured Apple and Banana have p=1, q=3 and r=0,
s=0. Thus, t= p+q+r+s=4.
The most common use of binary dissimilarity (distance) is
Simple Matching distance
Jaccard's distance
Hamming distance
Example: Simple matching distance between Apple and Banana is 3/4.
Jaccard's distance between Apple and Banana is 3/4.
Hamming distance between Apple and Banana is 3.
2.2 Distance for quantitative variables
Variable which have quantitative values.
Features
cost
time
weight
incentive
Object A
0
3
4
5
Object B
7
6
3

1
We can represent the two objects as points in 4 dimension. Point A has coordinate (0, 3, 4,
5) and point B has coordinate (7, 6, 3, 1). Dissimilarity (or similarity) between the two
objects are based on these coordinates.
Clustering: An Overview
Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets”
252
Euclidean Distance :Euclidean Distance is the most common use of distance. In most
cases when people said about distance , they will refer to Euclidean distance. Euclidean
distance or simply 'distance' examines the root of square differences between coordinates
of a pair of objects.
Formula
Euclidean distance is a special case of Minkowski distance with
City block (Manhattan) distance : It is also known as Manhattan distance, boxcar
distance, absolute value distance. It examines the absolute differences
The City Block Distance between point A and B is
between coordinates
of a pair of objects. City block distance is a special case of Minkowski distance with
Formula:
Chebyshev Distance :Chebyshev distance is also called Maximum value distance. It
examines the absolute magnitude of the differences between coordinates of a pair of
objects. This distance can be used for both ordinal and quantitative variables.
Formula
and B is
Minkowski Distance: This is the generalized metric distance. When it becomes city
block distance and when , it becomes Euclidean distance. Chebyshev distance is a
special case of Minkowski distance with (taking a limit). This distance can be used
for both ordinal and quantitative variables.
Formula
Clustering: An Overview
Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets”
253
3. Clustering Algorithms
There are many clustering algorithms available in literature, choice of appropriate
algorithm depends on the data type and desired results. We will be focused on commonly
used clustering algorithms.
3.1 Hierarchical Algorithms
A hierarchical method creates a hierarchical decomposition of data objects in the form of
tree like diagram which is called a dendogram. There are two approaches to building a
cluster hierarchy. Agglomerative approach also called bottom up approach starts with each
object forming a separate group and successively merges the objects close to one another,
until all the groups are merged into one. Divisive approach also called topdown approach
starts with all the objects in same cluster, until each object is in one cluster.
Process flow of agglomerative hierarchical clustering method is given below:
• Convert object features to distance matrix.
• Set each object as a cluster (thus if we have 6 objects, we will have 6 clusters in the
beginning)
• Iterate until number of cluster is 1
1. Merge two closest clusters
2. Update distance matrix
First distance matrix is computed using any valid distance measure between pairs of
objects. The choice of which clusters to merge is determined by a linkage criterion, which
is a function of the pairwise distances between observations. Commonly used linkage
criteria are mentioned below:
• Complete Linkage: The maximum distance between elements of each cluster
• Single Linkage: The minimum distance between elements of each cluster
• Average Linkage /UPGMA: The mean distance between elements of each cluster
s1
s2
s4
s5
s3
Clustering: An Overview
Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets”
254
3.1.1 Example:
Consider 6 objects (with name A, B, C, D, E and F) and each object have two measured
features (X1 and X2). We can plot the features in a scattered plot to get the visualization of
proximity between objects.
For example, distance between object A = (1, 1) and B = (1.5, 1.5) is computed as
Another example of distance between object D = (3, 4) and F = (3, 3.5) is calculated as
Using the same way, we can compute all distances between objects and put the distances
into a matrix form. Since distance is symmetric (i.e. distance between A and B is equal to
distance between B and A). The diagonal elements of distance matrix are zero represent
distance from an object to itself.
Clearly the minimum distance is 0.5 (between object D and F). Thus, we group cluster D
and F into cluster (D, F). Then we update the distance matrix. Distance between ungrouped
clusters will not change from the original distance matrix. Now the problem is how to
calculate distance between newly grouped clusters (D, F) and other clusters?
Clustering: An Overview
Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets”
255
That is exactly where the linkage rule comes into effect. Using single linkage, we specify
minimum distance between original objects of the two clusters.
Using the input distance matrix, distance between cluster (D, F) and cluster A is computed
as
Then, the updated distance matrix becomes
Looking at the lower triangular updated distance matrix, we found out that the closest
distance between cluster B and cluster A is now 0.71. Thus, we group cluster A and cluster
B into a single cluster name (A, B).
Now we update the distance matrix. Aside from the first row and first column, all the other
elements of the new distance matrix are not changed.
Using the input distance matrix (size 6 by 6), distance between cluster C and cluster (D, F)
is computed as
Distance between cluster (D, F) and cluster (A, B) is the minimum distance between all
objects involves in the two clusters
Clustering: An Overview
Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets”
256
Then the updated distance matrix is
Observing the lower triangular of the updated distance matrix, we can see that the closest
distance between clusters happens between cluster E and (D, F) at distance 1.00. Thus, we
cluster them together into cluster ((D, F), E ).
The updated distance matrix is given below.
Distance between cluster ((D, F), E) and cluster (A, B) is calculated as
The updated distance matrix is shown in the figure below
Using all this information, final results of a dendogram:
Clustering: An Overview
Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets”
257
3.2 Partitional Algorithms
It basically involves segmenting data objects into k partitions, optimizing some criteria,
over t iterations. These methods are popularly known as iterative relocation methods.
3.2.1 Kmeans Algorithm
Kmeans is the most popularly used algorithm in this category. It randomly selects k
objects as cluster mean or center. It works towards optimizing square error criteria
function, defined as:
∑
=
∑
∈
−
k
i Cx
i
i
mx
1
2
, where
i
m
is the mean of cluster
i
C
.
Main steps of kmeans algorithm are:
1) Assign initial means
i
m
2) Assign each data object
x
to the cluster
i
C
for the closest mean
3) Compute new mean for each cluster
4)Iterate until criteria function converges, that is, there are no more new assignments.
The kmeans algorithm is sensitive to outliers since an object with an extremely large value
may substantially distort the distribution of data.
Example:
Suppose we have several objects (4 types of medicines) and each object have two attributes
or features as shown in table below. Our goal is to group these objects into K=2 group of
medicine based on the two features (pH and weight index).
Object
attribute 1 (X): weight index
attribute 2 (Y): pH
Medicine A
1
1
Medicine B
2
1
Medicine C
4
3
Medicine D
5
4
1. Initial value of centroids
2.
: Suppose we use medicine A and medicine B as the first
centroids. Let and denote the coordinate of the centroids, then and
ObjectsCentroids distance
: we calculate the distance between cluster centroid to each
object. Let us use Euclidean distance, then we have distance matrix at iteration 0 is
Clustering: An Overview
Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets”
258
Each column in the distance matrix symbolizes the object. The first row of the distance
matrix corresponds to the distance of each object to the first centroid and the second row is
the distance of each object to the second centroid. For example, distance from medicine C
= (4, 3) to the first centroid is , and its distance to the
second centroid is
3. Objects clustering
: We assign each object based on the minimum distance. Thus,
medicine A is assigned to group 1, medicine B to group 2, medicine C to group 2 and
medicine D to group 2. The element of Group matrix below is 1 if and only if the object is
assigned to that group.
4. Iteration1, determine centroids
5.
: Knowing the members of each group, now we
compute the new centroid of each group based on these new memberships. Group 1 only
has one member thus the centroid remains in . Group 2 now has three members,
thus the centroid is the average coordinate among the three members:
Iteration1, ObjectsCentroids distances
: The next step is to compute the distance of all
objects to the new centroids. Similar to step 2, we have distance matrix at iteration 1 is
6. Iteration1, Objects clustering:
Similar to step 3, we assign each object based on the
minimum distance. Based on the new distance matrix, we move the medicine B to Group 1
while all the other objects remain. The Group matrix is shown below
Clustering: An Overview
Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets”
259
7. Iteration 2, determine centroids:
8.
Now we repeat step 4 to calculate the new centroids
coordinate based on the clustering of previous iteration. Group1 and group 2 both has two
members, thus the new centroids are and
Iteration2, ObjectsCentroids distances
: Repeat step 2 again, we have new distance
matrix at iteration 2 as
9. Iteration2, Objects clustering:
Again, we assign each object based on the minimum
distance.
We obtain result that . Comparing the grouping of last iteration and this iteration
reveals that the objects does not move group anymore. Thus, the computation of the k
mean clustering has reached its stability and no more iteration is needed.
4. Case Study in Agriculture
4.1 Soybean dataset
Soybean disease set contains 47 objects and 35 multivalued variables characterizing
diaporthestemcanker, charcoalrot, rhizoctoniarootrot and phytophthorarot diseases.
Variables are broadly categorized into environmental descriptors, condition of leaves,
condition of stem, condition of fruit pods and condition of root. It was observed that
dataset is having unique value for some of the variables hence those variables are irrelevant
and removed from the dataset. Reduced dataset then has 20 variables characterizing
soybean diseases. Dataset consist of instance number and class variables that was not
considered while clustering.
Clustering: An Overview
Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets”
260
4.2 Soybean dataset: Clustering
K means clustering algorithm was applied on soybean disease dataset with number of
clusters as four. Four disease clusters corresponding to diseases diaporthestemcanker
(Cluster1), charcoalrot (Cluster2), rhizoctoniarootrot (Cluster3), phytophthorarot
(Cluster 4) were obtained. Table presents the result of clustering algorithm on soybean
disease dataset.
References
A.K. Jain, M.N. Murty, P.J. Flynn, Data Clustering: A Review, ACM Computing Surveys,
31( 3 ): 264323, 1999, ISSN: 03600300.
B. Mirkin, Clustering for Data Mining: Data Recovery Approach, Chapman & Hall/CRC,
2005.
I. H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques
with Java Implementation, Morgan Kaufmann publishers, 1999.
J. Han, M. Kamber, Data Mining: Concepts and Techniques, 2nd ed., Morgan Kaufmann
Publisher, 2006, ISBN 1558609016
L. Kaufman and P. Rousseeuw, Finding Groups in Data: An Introduction to Cluster
Analysis, Wiley, 1990.
http://en.wikipedia.org/wiki/Kmeans_clustering
http://people.revoledu.com/kardi/tutorial
R. Xu, D. Wunsch, Survey of Clustering Algorithms, IEEE Transactions on Neural
Networks, Vol. 16, No. 3, May 2005
S. Mitra, T. Acharya, Data Mining: Multimedia, Soft Computing, and Bioinformatics, John
Wiley & Sons, 2004, ISBN 9812530630.
Clustering: An Overview
Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets”
261
Table: Soybean dataset with clustering results
sno
v1
v2
v3
v4
v5
v6
v7
v8
v9
v10
v12
v20
v21
v22
v23
v24
v25
v26
v27
v28
v35
Cluster
0
4
0
2
1
1
1
0
1
0
2
1
0
3
1
1
1
0
0
0
0
0
cluster1
1
5
0
2
1
0
3
1
1
1
2
1
1
3
0
1
1
0
0
0
0
0
cluster1
2
3
0
2
1
0
2
0
2
1
1
1
0
3
0
1
1
0
0
0
0
0
cluster1
3
6
0
2
1
0
1
1
1
0
0
1
1
3
1
1
1
0
0
0
0
0
cluster1
4
4
0
2
1
0
3
0
2
0
2
1
0
3
1
1
1
0
0
0
0
0
cluster1
5
5
0
2
1
0
2
0
1
1
0
1
1
3
1
1
1
0
0
0
0
0
cluster1
6
3
0
2
1
0
2
1
1
0
1
1
1
3
0
1
1
0
0
0
0
0
cluster1
7
3
0
2
1
0
1
0
2
1
2
1
0
3
0
1
1
0
0
0
0
0
cluster1
8
6
0
2
1
0
3
0
1
1
1
1
0
3
1
1
1
0
0
0
0
0
cluster1
9
6
0
2
1
0
1
0
1
0
2
1
0
3
1
1
1
0
0
0
0
0
cluster1
10
6
0
0
2
1
0
2
1
0
0
1
1
0
3
0
0
0
2
1
0
0
cluster2
11
4
0
0
1
0
2
3
1
1
1
1
0
0
3
0
0
0
2
1
0
0
cluster2
12
5
0
0
2
0
3
2
1
0
2
1
0
0
3
0
0
0
2
1
0
0
cluster2
13
6
0
0
1
1
3
3
1
1
0
1
0
0
3
0
0
0
2
1
0
0
cluster2
14
3
0
0
2
1
0
2
1
0
1
1
0
0
3
0
0
0
2
1
0
0
cluster2
15
4
0
0
1
1
1
3
1
1
1
1
1
0
3
0
0
0
2
1
0
0
cluster2
16
3
0
0
1
0
1
2
1
0
0
1
0
0
3
0
0
0
2
1
0
0
cluster2
17
5
0
0
2
1
2
2
1
0
2
1
1
0
3
0
0
0
2
1
0
0
cluster2
18
6
0
0
2
0
1
3
1
1
0
1
0
0
3
0
0
0
2
1
0
0
cluster2
19
5
0
0
2
1
3
3
1
1
2
1
0
0
3
0
0
0
2
1
0
0
cluster2
20
0
1
2
0
0
1
1
1
1
1
0
0
1
1
0
1
1
0
0
3
0
cluster3
21
2
1
2
0
0
3
1
2
0
1
0
0
1
1
0
1
0
0
0
3
0
cluster3
22
2
1
2
0
0
2
1
1
0
2
0
0
1
1
0
1
1
0
0
3
0
cluster3
23
0
1
2
0
0
0
1
1
1
2
0
0
1
1
0
1
0
0
0
3
0
cluster3
24
0
1
2
0
0
2
1
1
1
1
0
0
1
1
0
1
0
0
0
3
0
cluster3
25
4
0
2
0
1
0
1
2
0
2
1
1
1
1
0
1
1
0
0
3
0
cluster3
26
2
1
2
0
0
3
1
2
0
2
0
0
1
1
0
1
1
0
0
3
0
cluster3
27
0
1
2
0
0
0
1
1
0
1
0
0
1
1
0
1
0
0
0
3
1
cluster3
28
3
0
2
0
1
3
1
2
0
1
0
1
1
1
0
1
1
0
0
3
0
cluster3
29
0
1
2
0
0
1
1
2
1
2
0
0
1
1
0
1
0
0
0
3
0
cluster3
30
2
1
2
1
1
3
1
2
1
2
1
0
2
2
0
1
0
0
0
3
1
cluster4
31
0
1
1
1
0
1
1
1
0
0
1
0
1
2
0
0
0
0
0
3
1
cluster4
32
3
1
2
0
0
1
1
2
1
0
1
0
2
2
0
0
0
0
0
3
1
cluster4
33
2
1
2
1
1
1
1
2
0
2
1
0
1
2
0
1
0
0
0
3
1
cluster4
34
1
1
2
0
0
3
1
1
1
2
1
0
2
2
0
0
0
0
0
3
1
cluster4
35
1
1
2
1
0
0
1
2
1
1
1
0
2
2
0
0
0
0
0
3
1
cluster4
36
0
1
2
1
0
3
1
1
0
0
1
0
1
2
0
0
0
0
0
3
1
cluster4
37
2
1
2
0
0
1
1
2
0
0
1
0
1
2
0
0
0
0
0
3
1
cluster4
38
3
1
2
0
0
2
1
2
1
1
1
0
2
2
0
0
0
0
0
3
1
cluster4
39
3
1
1
0
0
2
1
2
1
2
1
0
2
2
0
0
0
0
0
3
1
cluster4
40
0
1
2
1
1
1
1
1
0
0
1
0
1
2
0
1
0
0
0
3
1
cluster4
41
1
1
2
1
1
3
1
2
0
1
1
1
1
2
0
1
0
0
0
3
1
cluster4
42
1
1
2
0
0
0
1
2
1
0
1
0
2
2
0
0
0
0
0
3
1
cluster4
43
1
1
2
1
1
2
3
1
1
1
1
0
2
2
0
1
0
0
0
3
1
cluster4
44
2
1
1
0
0
3
1
2
0
2
1
0
1
2
0
0
0
0
0
3
1
cluster4
45
0
1
1
1
1
2
1
2
1
0
1
1
2
2
0
1
0
0
0
3
1
cluster4
46
0
1
2
1
0
3
1
1
0
2
1
0
1
2
0
0
0
0
0
3
1
cluster4
Comments 0
Log in to post a comment