CMune
:
A CLUSTERING USING
MUTUAL NEAREST NEIGHBORS
ALGORITHM
Amin
Shoukry
Computer Science
and
Engineering
Department
Egypt Japan University of
Science and
Technology, Alexandria, Egypt.
email:
amin.shoukry@ejust.edu.eg
Mohamed A.
Abbas
Graduate student with the College of
Computing and Information Technology
Arab Academy for Science &Technology
Alexandria, Egypt.
e

mail:
mohamed.alyabbas@gmail.com
1.
Introduction & Related Work
2.
Cmune
(Clustering Using Mutual Nearest
Neighbors) Algorithm, Underlying Data Model and
Complexity Analysis.
3.
Experimentation & Validation
(9 data sets
–
with feature space dimensionality varying
from 4 up to 5000). Six well known validity indices have been
used.
4.
Conclusion
Agenda
CL
US
TER
ING
•
Clustering of data is an important step in data analysis. The
main goal of clustering is to
partition data objects
into
well
separated groups
so that objects lying in the same group are
more similar to one another than to objects in other groups.
•
Clusters can be described in terms of
internal homogeneity
and
external separation
.
Natural Clusters
A natural cluster is a cluster of
any
s
h
a
p
e
,
Si
ze
and
de
nsi
t
y
,
and it
should not be restricted to a globular shape
as a wide number
of classical algorithms assume, or to a specific user

defined density as
some density

based algorithms require.
Cluster Prototype
•
Many clustering algorithms adopt the notion of a
prototype
(i.e. data point that is a representative for
a set of points)
•
This
prototype
can be:
–
only one data point such as in K

means, K

medoids
,
DBScan
[Martin Ester, 1996], and Cure[
Guha
et al, 1998].
–
A
set of points
representing
tiny clusters
to be merged/
propagated as in Chameleon [
Karypis
G, 1999]/
CMune
.
Cluster Representation
CMune relies on the principle of K

Mutual Nearest

Neighbor consistency
K

Mutual Nearest Neighbors versus K

Nearest Neighbors Consistency
•
Principle of K

NB consistency of a cluster
[1]
: “
An object should be in
the same cluster as its nearest neighbors
”.
•
Principle of K

Mutual Nearest

Neighbor consistency (K

MNB
consistency): “
An object should be in the same cluster as its mutual
nearest neighbors
”. K

MNB consistency is
stronger
than K

NB
consistency (i.e. K

MNB consistency implies K

NB consistency).
[1] Lee, J.

S. and
Ólafsson
, S. (2011). Data clustering by minimizing
disconnectivity
. Inf.
Sci
, 181(4):732

746.
“A” is in 4

NB of “B”, however, “B” is not in 4

NB of “A”. Therefore, “A “and “B”
are not Mutual Nearest Neighbors.
Mutual Nearest Neighborhood is not a
symmetric relation.
CMune
Concept of Mutual Nearest Neighbors
Reference Point and Reference Block/ List
RL
(A) consists of points that are Mutual Nearest Neighbors to
point ‘A’. It is constructed from the
intersection
of the set points in
K

NB(A) and the set of points having A in their K

Nearest
Neighborhood.
Reference Point ‘A’
and the
Reference List RL(A)
it
represents
.
Role of Reference Blocks/ Lists
•
They are considered as dense regions/blocks.
•
These blocks are the
seeds
from which clusters may
grow up. Therefore, CMune is
not
a
point

to

point
clustering algorithm. Rather, it is
a block

to

block
clustering technique.
•
Much of its advantages come from these facts:
Noise
points and
outliers
correspond to blocks of small sizes,
and homogeneous blocks highly overlap.
Type Of Representative Points
•
Given a Reference List
RL(A)
,
A is said to represent RL(A).
There are three possible types of representative points
:
1)
Strong
Points representing blocks of size greater than a pre

defined
threshold parameter.
2)
Noise
points, representing empty Reference

Lists (neither can
form clusters nor can participate in clusters growing).
3)
Weak
points which are neither strong points nor noise points.
These points
may be merged with other clusters
if
they are
members of other strong points Blocks.
Clusters Merging/
Propagation
C
i
greedily
chooses
C
l
to merge with,
as they
have the
maximum mutual intersection
among all overlapping reference
blocks.
How to Impose An Order On the Merging
Of Reference Blocks?
3
cases with different homogeneity factors:
(a)
0.98
,
(b)
0.80
and (c)
0.55
Homogeneity Factor α =
(a)
(b)
(c)
Answer: By Cardinality (Density) and Homogeneity
p
i
q
j
CMune
•
Initialize parameters:
–
K {size of neighbourhood of a point}
–
T { Noise threshold/min size of a reference list}
•
Construct similarity matrix based on Euclidean distance
•
Construct the reference list for each point pi:
RL(pi)= K_NB(pi)
RB(pi)
•
Form a sorted list
L
based on the cardinality of the
reference lists RL(pi)
–
Exclude weak points for which
RL(pi)
†
㰠<
Noise and Weak Points
are excluded
Cluster
Merging
First
Cluster
Sort w.r.t
homogeneity
Test closeness to
existing clusters
Experimental Results
Results were assessed using:
(
100
experiments / data set / algorithm)
1

V

measure
2

F

measure
3

Adjusted Rand Index
4

Jaccard
Coefficient
5

Purity
6

Entropy
Eight data sets:
1.
Iris data
5
. E coli data
2.
OCR data
6
. Yeast data
3.
Time series data
7
. Libras movement data
4.
Letter recognition data
8
.
Gisette
data
Higher Index Value
Indicates better Accuracy
Lower Index Values
corresponds to better Accuracy
Experimental Results
•
100
experiments / data set / algorithm are
conducted to determine the best values of T & K
•
CMUNE is compared to
4
state

of the art clustering
techniques:
–
K

means (
Forgy
,
1965
)
–
DBScan
( Martin Ester,
1996
)
–
Mitosis (Noha Yousri,
2009
)
–
Spectral Clustering (
Chen, W.

Y., Song, Y.,
2011
)
Data Description & Results Statistics
Iris dataset
This is a well known database found in the pattern recognition literature. The
data set contains
3
classes of
50
instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other
2
; the latter are
NOT linearly separable from each other.
Index Type
The Synthetic Control Charts (SCC) data set includes
600
patterns, each of
60
dimensions (time points).
Time Series Dataset
In general, Cmune has better indices values.
Index Type
Optical Character Recognition
dataset
Consists of
5620
patterns, each of
64
dimensions. The features used to
describe the bitmaps of the digit characters. The aim is to properly classify
the digit characters to
10
classes from
0
to
9
.
Index Type
In general, Cmune has better indices values.
Ecoli Dataset
This data contains protein localization sites. Consists of
336
patterns, each
of
8
dimensions
Index Type
In general, Cmune has better indices values.
Libras Movement
Data set
The dataset contains
15
classes of
24
instances each, where each class
references to a hand movement type. Consists of
360
patterns, each of
91
dimensions.
Index Type
In general, Cmune has better indices values.
Yeast Dataset
The Protein Localization Sites, Yeast obtained from the UCI repository.
Consists of
1484
patterns, each of
8
dimensions.
Index Type
In general, Cmune has better indices values.
Letter Recognition Data Set
The objective is to identify each of a large number of black

and

white rectangular
pixel displays as one of the
26
capital letters in the English alphabet. Consists of
20000
patterns, each of
16
attributes.
Index Type
In general, Cmune has better indices values.
SPECT Heart Dataset*
The dataset describes diagnosing of cardiac Single Proton Emission Computed
Tomography (SPECT) images. Consists of
267
patterns, each of
22
dimensions.
Each of the patients is classified into two categories: normal and abnormal.
Index Type
In general, Cmune has better indices values.
Breast Cancer Diagnostic Dataset*
Consists of
569
patterns, each of
30
dimensions. The aim is to classify the
data into two Diagnosis (malignant and benign).
Index Type
In general, Cmune has better indices values.
Gisette Dataset
GISETTE is a handwritten digit recognition problem. The problem is to
separate the highly confusible digits '
4
' and '
9
'. This dataset is one of five
datasets of the NIPS
2003
feature selection challenge. Consists of
13500
patterns, each of
5000
dimensions.
Index Type
In general, Cmune has better indices values.
Sensitivity to parameters (K)
Conclusion
•
We present a novel clustering algorithm based on
mutual nearest neighbor
concept
. It can find clusters of varying shapes, sizes and densities; even in the
presence of noise and outliers and in high dimensional spaces as well.
•
Clusters are represented by reference blocks (points + list). Two clusters can be
merged if their link strength is maximal (i.e. reference blocks have max.
intersection). Any data point not belonging to a cluster is considered as noise.
•
The results of our experimental study on several data sets are encouraging.
CMune
solutions have been found, in general, superior to those obtained by
DBScan
, K

means and Mitosis and competitive with spectral clustering
algorithm.
•
We intend to parallelize our algorithm as its clustering propagation is inherently
parallel & determine T through some statistical analysis.
•
Algorithm is publicly available to other researchers at
http://www.csharpclustering.com
.
CSHARP Evolution
•
CSHARP [Abbas, Shoukry &
Kashef
,
2012
]
is presented for the
purpose of finding clusters of arbitrary shapes and arbitrary
densities in high dimensional feature spaces. It can be
considered as a variation of the Shared Nearest Neighbor
algorithm (SNN) (
Ertoz
,
2003
).
•
Then a modified version of CSHARP is presented
[Abbas &
Shoukry,
2012
].
The modification includes the incorporation of a
new measure of cluster homogeneity.
•
In this paper,
an enhanced version of Modified CSHARP is
presented. Specifically, the
number of parameters has been
reduced from three to only two parameters
to reduce the effort
needed to select the best parameters.
The overall time complexity for the CMune is :
Where
N
is the number of data points and
K
is the
number of nearest neighbors.
CMune takes a space complexity of
where
N
is the number of data points
K
and is the number
of nearest neighbors used; since only the K

nearest
neighbors of each data point is required.
Algorithm
Complexity
Speed of CMune compared to (a) DBScan and (b) K

means
and DBScan using letter recognition data set.
Speed Performance
Data Size
Cmune Speed
Performance is “Stable”
Comments 0
Log in to post a comment