ACM Student Chapter,
Heritage Institute of Technology
17
th
February
, 2012
SIGKDD Presentation by
Megha
Nangia
J. M. Mansa
Koustav
Mullick
•
Clustering results are used:
–
As a
stand

alone tool
to get insight into data distribution
•
Visualization of clusters may unveil important
information
–
As a
preprocessing step
for other algorithms
•
Efficient indexing or compression often relies on
clustering
Cluster analysis
or
clustering
is the task of
assigning a set of objects into groups (called
clusters
) so that the objects in the same cluster are
more “similar” (in some sense or another) to each
other than to those in other clusters.
Cluster analysis itself is not one specific
algorithm. But the general task to be solved is
forming similar clusters. It can be achieved by
various algorithms.
Recall that the goal is to group together “similar” data
–
but what does this mean?
No single answer
–
it depends on what we want to find
or emphasize in the data; this is one reason why
clustering is an “art”
The similarity
measure is often more important than the
clustering algorithm used
–
don’t overlook this choice!
Minimize Intra

cluster distance
Maximize Inter

cluster distance
Clustering is a main task of explorative data mining to reduce
the size of large data sets. Its a common technique for statistical
data analysis used in many fields, including :
Machine learning
Pattern recognition
Image analysis
Information retrieval
Bioinformatics.
Web applications such as
s
ocial network analysis, grouping of
shopping items, search result grouping etc.
•
Scalability
•
Ability to deal with different types of attributes
•
Discovery of clusters with arbitrary shape
•
Able
to deal with noise and outliers
•
Insensitive to order of input records
•
High dimensionality
•
Interpretability
and usability
How many clusters?
Four Clusters
Two Clusters
Six Clusters
Clustering algorithms can be categorized
Some of the major algorithms are:
1)
Hierarchical or connectivity based clustering
2)
Partitional
clustering (K

means or
centroid

based
clustering)
3)
Density based
4)
Grid based
5)
Model based
Mammals
In statistics and data mining,
k

means clustering
is a
method of cluster analysis which aims to partition
n
observations into
k
clusters in which each
observation belongs to the cluster with the nearest
mean. This results into a partitioning of the data
space into
Voronoi
cells.
A division data objects into non

overlapping subsets
(clusters) such that each data object is in exactly one
subset
Original Points
A Partitional Clustering
Connectivity based clustering, also known as
hierarchical
clustering
, is based on the core idea of objects being more
related to nearby objects than to objects farther away.
As such, these algorithms connect "objects" to form "clusters"
based on their distance. At different distances, different
clusters will form, which can be represented using a
dendrogram
.
These algorithms do not provide a single partitioning of the
data set, but instead provide an extensive hierarchy of clusters
that merge with each other at certain distances.
A set of nested clusters organized as a hierarchical tree
Traditional Hierarchical Clustering
Non

traditional Hierarchical
Clustering
Non

traditional
Dendrogram
Traditional
Dendrogram
Hierarchical Clustering.
Partitional
Clustering.
Partitioning method: Construct a partition of
n
objects into a set of
K
clusters
Given: a set of objects and the number
K
Find: a partition of
K
clusters that optimizes the
chosen
partitionin`g
criterion
Effective heuristic methods:
K

means and
K

medoids
algorithms
Euclidean distance:
City block or Manhattan distance:
Cosine similarity:
Jaccard
similarity:
Partitional
clustering approach
Each cluster is associated with a
centroid
(center point)
Each point is assigned to the cluster with the closest
centroid
Number of clusters, K, must be specified
The basic algorithm is very simple
1.
Select K points as initial
Centroids
.
2.
Repeat:
3.
Form k clusters by assigning all points to their
respective closest
centroid
.
4.
Re

compute the
centroid
for each cluster
5. Until:
The
centroids
don`t change.
START
Choose K
Centroids
Form k clusters.
Recompute
centroid
Centroids
change
END
YES
NO
•
Assume computing distance between two instances is
O(
m
) where
m
is the dimensionality of the vectors.
•
Reassigning clusters: O(
kn
) distance computations, or
O(
knm
).
•
Computing
centroids
: Each instance vector gets added
once to some
centroid
: O(
nm
).
•
Assume these two steps are each done once for
I
iterations: O(
Iknm
).
Algorithm: k

means, Distance Metric: Euclidean Distance
Algorithm: k

means, Distance Metric: Euclidean Distance
Algorithm: k

means, Distance Metric: Euclidean Distance
Algorithm: k

means, Distance Metric: Euclidean Distance
k
1
k
2
k
3
Algorithm: k

means, Distance Metric: Euclidean Distance
Sub

optimal Clustering
Optimal Clustering
Original Points
•
Multiple runs
–
Helps, but probability is not on your side
•
Sample and use hierarchical clustering to
determine initial
centroids
•
Select more than k initial
centroids
and then
select among these initial
centroids
–
Select most widely separated
•
Postprocessing
•
Bisecting K

means
–
Not as susceptible to initialization issues
•
Most common measure is Sum of Squared Error (SSE)
–
For each point, the error is the distance to the nearest cluster
–
To get SSE, we square these errors and sum them.
–
x
is a data point in cluster
C
i
and
m
i
is the representative point for cluster
C
i
•
can show that
m
i
corresponds to the center (mean) of the cluster
–
Given two clusters, we can choose the one with the smallest error
–
One easy way to reduce SSE is to increase K, the number of clusters
•
A good clustering with smaller K can have a lower SSE than a poor clustering
with higher K
Strength
Relatively efficient
:
O
(
ikn
), where
n
is # objects,
k
is # clusters, and
i
is # iterations. Normally,
k
,
i
<<
n
.
Often terminates at a
local optimum
. The
global optimum
may be
found using techniques such as:
deterministic annealing
and
genetic algorithms
Weakness
Applicable only when
mean
is defined, then what about
categorical data?
Need to specify
k,
the
number
of clusters, in advance
Unable to handle noisy data and
outliers
Not suitable to discover clusters with
non

convex shapes
Also may give rise to Empty

clusters.
•
Outliers
are
objects that do not belong to any cluster
or
form clusters of very small cardinality
cluster
outliers
A variant of k

means, that can produce a
partitional
or
heirarchical
clustering.
Can pick the largest Cluster , or
The cluster With lowest average similarity, or
Cluster with the largest SSE.
START
Initialize
clusters
Select a cluster
K clusters
END
NO
YES
1.
Initialize the list of clusters.
2.
Repeat:
3.
Select a cluster from the list of clusters.
4.
For
i
=1 to
number_of_iterations
5.
Bisect the cluster using k

means algorithm
6.
End for
7.
Select two clusters having the lowest SSE
8.
Add the two clusters from the bisection to
the list of clusters
9. Until:
The list contains k clusters.
i
< no. of iterations
YES
Bisect the cluster.
i
++
Add the two bisected
clusters, having lowest
SSE, to list of clusters
NO
–
Bisecting K

means tends to produce clusters of relatively
uniform size.
–
Regular K

means tends to produce clusters of widely
different sizes.
–
Bisecting K

means beats Regular K

means in Entropy
measurement
K

means has problems when clusters are of
differing
–
Sizes
–
Densities
–
Non

globular shapes
K

means has problems when the data contains
outliers.
Original Points
K

means (3 Clusters)
Original Points
K

means (3 Clusters)
Original Points
K

means (2 Clusters)
Original Points
K

means Clusters
One solution is to use many clusters.
Find parts of clusters, but need to put together.
Original Points
K

means Clusters
Original Points
K

means Clusters
A
medoid
can be defined as the object of a cluster,
whose average dissimilarity to all the objects in the
cluster is minimal,
i.e
, it is a most centrally located point
in the cluster.
In contrast to the k

means algorithm, k

medoids
chooses
datapoints
as centers(
medoids
or exemplars)
The most common
realisation
of k

medoid
clustering
is the Partitioning Around
Medoids
(PAM) algorithm.
1. Initialize: randomly select k of the n data points as the
medoids
.
2. Associate each data point to the closest
medoid
.
3. For each
medoid
m
1. For each non

medoid
data point o
1. Swap m and o and compute the total cost of the
configuration.
4. Select the configuration with the lowest cost.
5. Repeat steps 2 to 5 until there is no change in the
medoid
.
Cluster the following set of ten objects into two clusters i.e. k=2.
Consider a data set of ten objects as follows:
Point
Cordinate
1
Cordinate2
X1
2
6
X2
3
4
X3
3
8
X4
4
7
X5
6
2
X6
6
4
X7
7
3
X8
7
4
X9
8
5
X10
7
6
0
1
2
3
4
5
6
7
8
9
0
2
4
6
8
10
Initialize k
centres
. Let us assume c1=(3,4) and c2=(7,4).
So here c1 and c2 are selected as
medoids
.
Calculating distance so as to associate each data object to its
nearest
medoid
.
c1
Data objects
(Xi)
Cost
3
4
2
6
3
3
4
3
8
4
3
4
4
7
4
3
4
6
2
5
3
4
6
4
3
3
4
7
3
5
3
4
8
5
6
3
4
7
6
6
C2
Data objects
(Xi)
Cost
7
4
2
6
7
7
4
3
8
8
7
4
4
7
6
7
4
6
2
3
7
4
6
4
1
7
4
7
3
1
7
4
8
5
2
7
4
7
6
2
Then so the clusters become:
Cluster1={(3,4)(2,6)(3,8)(4,7)}
Cluster2={(7,4) (6,2)(6,4)(7,3)(8,5)(7,6)}
The total cost involved is 20
0
1
2
3
4
5
6
7
8
9
0
2
4
6
8
10
c1
c2
Next, we choose a non

medoid
point for each
medoid
, swap it with the
medoid
and re

compute the cost. If the cost is optimized, we make it the
new
medoid
and proceed similarly, until there is no change in the
medoids
.
Pam is more robust than k

means in the
presence of noise and outliers because a
medoid
is less influenced by outliers or other extreme
values than a mean
Pam works well for small data sets but does not
scale well
for large data sets.
Partitional
clustering is a very efficient and easy to implement clustering
method.
It helps us find the global and local optimums.
Some of the heuristic approaches involve the K

means and K

medoid
algorithms.
However
partitional
clustering also suffers from a number of shortcomings:
The performance of the algorithm depends on the initial
centroids
. So
the algorithm gives no guarantee for an optimal solution.
Choosing poor initial
centroids
may lead to the generation of empty clusters
as well.
The number of clusters need to be determined beforehand.
Does not work well with non

globular clusters.
Some of the above stated drawbacks can be solved using the other popular
Clustering approach, such as Hierarchical or density based clustering.
Nevertheless the importance of
partitional
clustering cannot be denied.
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο