ACM Student Chapter, Heritage Institute of Technology 17February, 2012 SIGKDD Presentation by Megha Nangia J. M. Mansa Koustav Mullick

hostitchAI and Robotics

Oct 23, 2013 (4 years and 16 days ago)

108 views

ACM Student Chapter,

Heritage Institute of Technology


17
th
February
, 2012

SIGKDD Presentation by

Megha

Nangia

J. M. Mansa

Koustav

Mullick



Clustering results are used:


As a
stand
-
alone tool

to get insight into data distribution


Visualization of clusters may unveil important
information


As a
preprocessing step

for other algorithms


Efficient indexing or compression often relies on
clustering

Cluster analysis

or
clustering

is the task of
assigning a set of objects into groups (called
clusters
) so that the objects in the same cluster are
more “similar” (in some sense or another) to each
other than to those in other clusters.



Cluster analysis itself is not one specific
algorithm. But the general task to be solved is
forming similar clusters. It can be achieved by
various algorithms.

Recall that the goal is to group together “similar” data


but what does this mean?

No single answer


it depends on what we want to find
or emphasize in the data; this is one reason why
clustering is an “art”

The similarity
measure is often more important than the
clustering algorithm used


don’t overlook this choice!

Minimize Intra
-
cluster distance

Maximize Inter
-
cluster distance

Clustering is a main task of explorative data mining to reduce
the size of large data sets. Its a common technique for statistical
data analysis used in many fields, including :


Machine learning


Pattern recognition


Image analysis


Information retrieval


Bioinformatics.


Web applications such as
s
ocial network analysis, grouping of
shopping items, search result grouping etc.


Scalability


Ability to deal with different types of attributes


Discovery of clusters with arbitrary shape


Able
to deal with noise and outliers


Insensitive to order of input records


High dimensionality


Interpretability
and usability

How many clusters?

Four Clusters


Two Clusters


Six Clusters


Clustering algorithms can be categorized

Some of the major algorithms are:

1)
Hierarchical or connectivity based clustering

2)
Partitional

clustering (K
-
means or
centroid
-
based
clustering)

3)
Density based

4)
Grid based

5)
Model based

Mammals

In statistics and data mining,
k
-
means clustering

is a
method of cluster analysis which aims to partition
n

observations into
k

clusters in which each
observation belongs to the cluster with the nearest
mean. This results into a partitioning of the data
space into
Voronoi

cells.

A division data objects into non
-
overlapping subsets
(clusters) such that each data object is in exactly one
subset


Original Points

A Partitional Clustering

Connectivity based clustering, also known as
hierarchical
clustering
, is based on the core idea of objects being more
related to nearby objects than to objects farther away.

As such, these algorithms connect "objects" to form "clusters"
based on their distance. At different distances, different
clusters will form, which can be represented using a
dendrogram
.

These algorithms do not provide a single partitioning of the
data set, but instead provide an extensive hierarchy of clusters
that merge with each other at certain distances.

A set of nested clusters organized as a hierarchical tree

Traditional Hierarchical Clustering

Non
-
traditional Hierarchical
Clustering

Non
-
traditional
Dendrogram

Traditional
Dendrogram

Hierarchical Clustering.

Partitional

Clustering.

Partitioning method: Construct a partition of
n

objects into a set of
K

clusters

Given: a set of objects and the number
K


Find: a partition of
K

clusters that optimizes the
chosen
partitionin`g

criterion

Effective heuristic methods:
K
-
means and
K
-
medoids

algorithms



Euclidean distance:



City block or Manhattan distance:







Cosine similarity:




Jaccard

similarity:



Partitional

clustering approach

Each cluster is associated with a
centroid

(center point)

Each point is assigned to the cluster with the closest
centroid

Number of clusters, K, must be specified

The basic algorithm is very simple

1.
Select K points as initial
Centroids
.


2.
Repeat:


3.
Form k clusters by assigning all points to their


respective closest
centroid
.


4.
Re
-
compute the
centroid

for each cluster


5. Until:
The
centroids

don`t change.






START

Choose K

Centroids

Form k clusters.


Recompute

centroid

Centroids

change

END

YES

NO


Assume computing distance between two instances is
O(
m
) where
m

is the dimensionality of the vectors.


Reassigning clusters: O(
kn
) distance computations, or
O(
knm
).


Computing
centroids
: Each instance vector gets added
once to some
centroid
: O(
nm
).


Assume these two steps are each done once for
I

iterations: O(
Iknm
).

Algorithm: k
-
means, Distance Metric: Euclidean Distance

Algorithm: k
-
means, Distance Metric: Euclidean Distance

Algorithm: k
-
means, Distance Metric: Euclidean Distance

Algorithm: k
-
means, Distance Metric: Euclidean Distance

k
1

k
2

k
3

Algorithm: k
-
means, Distance Metric: Euclidean Distance


Sub
-
optimal Clustering

Optimal Clustering

Original Points


Multiple runs


Helps, but probability is not on your side


Sample and use hierarchical clustering to
determine initial
centroids


Select more than k initial
centroids

and then
select among these initial
centroids


Select most widely separated


Postprocessing


Bisecting K
-
means


Not as susceptible to initialization issues


Most common measure is Sum of Squared Error (SSE)


For each point, the error is the distance to the nearest cluster


To get SSE, we square these errors and sum them.





x
is a data point in cluster
C
i

and
m
i

is the representative point for cluster
C
i




can show that
m
i

corresponds to the center (mean) of the cluster


Given two clusters, we can choose the one with the smallest error


One easy way to reduce SSE is to increase K, the number of clusters



A good clustering with smaller K can have a lower SSE than a poor clustering
with higher K

Strength




Relatively efficient
:
O
(
ikn
), where
n

is # objects,
k

is # clusters, and
i

is # iterations. Normally,
k
,
i

<<
n
.


Often terminates at a
local optimum
. The
global optimum

may be
found using techniques such as:
deterministic annealing

and
genetic algorithms


Weakness



Applicable only when
mean

is defined, then what about
categorical data?


Need to specify
k,
the
number

of clusters, in advance


Unable to handle noisy data and
outliers


Not suitable to discover clusters with
non
-
convex shapes


Also may give rise to Empty
-
clusters.





Outliers

are
objects that do not belong to any cluster
or
form clusters of very small cardinality







cluster

outliers


A variant of k
-
means, that can produce a
partitional

or
heirarchical

clustering.

Can pick the largest Cluster , or

The cluster With lowest average similarity, or

Cluster with the largest SSE.

START

Initialize
clusters

Select a cluster

K clusters

END

NO

YES

1.
Initialize the list of clusters.


2.
Repeat:


3.
Select a cluster from the list of clusters.


4.
For

i
=1 to
number_of_iterations


5.
Bisect the cluster using k
-
means algorithm


6.
End for


7.
Select two clusters having the lowest SSE


8.
Add the two clusters from the bisection to


the list of clusters


9. Until:
The list contains k clusters.

i

< no. of iterations

YES

Bisect the cluster.

i
++

Add the two bisected
clusters, having lowest
SSE, to list of clusters


NO





Bisecting K
-
means tends to produce clusters of relatively
uniform size.




Regular K
-
means tends to produce clusters of widely
different sizes.




Bisecting K
-
means beats Regular K
-
means in Entropy
measurement

K
-
means has problems when clusters are of
differing


Sizes


Densities


Non
-
globular shapes


K
-
means has problems when the data contains
outliers.





Original Points

K
-
means (3 Clusters)





Original Points

K
-
means (3 Clusters)

Original Points

K
-
means (2 Clusters)





Original Points




K
-
means Clusters

One solution is to use many clusters.

Find parts of clusters, but need to put together.





Original Points




K
-
means Clusters

Original Points




K
-
means Clusters

A
medoid

can be defined as the object of a cluster,
whose average dissimilarity to all the objects in the
cluster is minimal,
i.e
, it is a most centrally located point
in the cluster.

In contrast to the k
-
means algorithm, k
-
medoids

chooses
datapoints

as centers(
medoids

or exemplars)

The most common
realisation

of k
-
medoid

clustering
is the Partitioning Around
Medoids

(PAM) algorithm.

1. Initialize: randomly select k of the n data points as the
medoids
.

2. Associate each data point to the closest
medoid
.

3. For each
medoid

m



1. For each non
-
medoid

data point o




1. Swap m and o and compute the total cost of the
configuration.



4. Select the configuration with the lowest cost.

5. Repeat steps 2 to 5 until there is no change in the
medoid
.



Cluster the following set of ten objects into two clusters i.e. k=2.

Consider a data set of ten objects as follows:

Point

Cordinate

1

Cordinate2

X1

2

6

X2

3

4

X3

3

8

X4

4

7

X5

6

2

X6

6

4

X7

7

3

X8

7

4

X9

8

5

X10

7

6

0
1
2
3
4
5
6
7
8
9
0
2
4
6
8
10
Initialize k
centres
. Let us assume c1=(3,4) and c2=(7,4).

So here c1 and c2 are selected as
medoids
.

Calculating distance so as to associate each data object to its
nearest
medoid
.

c1

Data objects
(Xi)

Cost

3

4

2

6

3

3

4

3

8

4

3

4

4

7

4

3

4

6

2

5

3

4

6

4

3

3

4

7

3

5

3

4

8

5

6

3

4

7

6

6

C2

Data objects
(Xi)

Cost

7

4

2

6

7

7

4

3

8

8

7

4

4

7

6

7

4

6

2

3

7

4

6

4

1

7

4

7

3

1

7

4

8

5

2

7

4

7

6

2

Then so the clusters become:

Cluster1={(3,4)(2,6)(3,8)(4,7)}

Cluster2={(7,4) (6,2)(6,4)(7,3)(8,5)(7,6)}

The total cost involved is 20

0
1
2
3
4
5
6
7
8
9
0
2
4
6
8
10
c1

c2

Next, we choose a non
-
medoid

point for each
medoid
, swap it with the
medoid

and re
-
compute the cost. If the cost is optimized, we make it the
new
medoid

and proceed similarly, until there is no change in the
medoids
.


Pam is more robust than k
-
means in the
presence of noise and outliers because a
medoid

is less influenced by outliers or other extreme
values than a mean


Pam works well for small data sets but does not
scale well

for large data sets.




Partitional

clustering is a very efficient and easy to implement clustering
method.

It helps us find the global and local optimums.

Some of the heuristic approaches involve the K
-
means and K
-
medoid

algorithms.



However
partitional

clustering also suffers from a number of shortcomings:


The performance of the algorithm depends on the initial
centroids
. So

the algorithm gives no guarantee for an optimal solution.

Choosing poor initial
centroids

may lead to the generation of empty clusters
as well.

The number of clusters need to be determined beforehand.

Does not work well with non
-
globular clusters.


Some of the above stated drawbacks can be solved using the other popular
Clustering approach, such as Hierarchical or density based clustering.
Nevertheless the importance of
partitional

clustering cannot be denied.