MUTUAL NEAREST NEIGHBORS

naivenorthΤεχνίτη Νοημοσύνη και Ρομποτική

8 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

72 εμφανίσεις

CMune
:

A CLUSTERING USING
MUTUAL NEAREST NEIGHBORS
ALGORITHM


Amin

Shoukry



Computer Science

and
Engineering
Department

Egypt Japan University of

Science and
Technology, Alexandria, Egypt.

email:
amin.shoukry@ejust.edu.eg


Mohamed A.
Abbas


Graduate student with the College of

Computing and Information Technology

Arab Academy for Science &Technology

Alexandria, Egypt.

e
-
mail:
mohamed.alyabbas@gmail.com

1.
Introduction & Related Work



2.
Cmune

(Clustering Using Mutual Nearest
Neighbors) Algorithm, Underlying Data Model and
Complexity Analysis.


3.
Experimentation & Validation


(9 data sets


with feature space dimensionality varying
from 4 up to 5000). Six well known validity indices have been
used.


4.
Conclusion




Agenda

CL
US
TER
ING


Clustering of data is an important step in data analysis. The
main goal of clustering is to
partition data objects

into
well
separated groups
so that objects lying in the same group are
more similar to one another than to objects in other groups.



Clusters can be described in terms of
internal homogeneity
and
external separation
.

Natural Clusters

A natural cluster is a cluster of
any

s
h
a
p
e
,

Si
ze

and
de
nsi
t
y
,
and it
should not be restricted to a globular shape
as a wide number
of classical algorithms assume, or to a specific user
-
defined density as
some density
-
based algorithms require.

Cluster Prototype


Many clustering algorithms adopt the notion of a
prototype

(i.e. data point that is a representative for
a set of points)




This
prototype

can be:


only one data point such as in K
-
means, K
-
medoids
,
DBScan

[Martin Ester, 1996], and Cure[
Guha

et al, 1998].



A
set of points
representing
tiny clusters
to be merged/
propagated as in Chameleon [
Karypis

G, 1999]/
CMune
.


Cluster Representation


CMune relies on the principle of K
-
Mutual Nearest
-
Neighbor consistency



K
-

Mutual Nearest Neighbors versus K
-
Nearest Neighbors Consistency



Principle of K
-
NB consistency of a cluster
[1]
: “
An object should be in
the same cluster as its nearest neighbors
”.




Principle of K
-
Mutual Nearest
-
Neighbor consistency (K
-
MNB
consistency): “
An object should be in the same cluster as its mutual
nearest neighbors
”. K
-
MNB consistency is
stronger

than K
-
NB
consistency (i.e. K
-
MNB consistency implies K
-
NB consistency).


[1] Lee, J.
-
S. and
Ólafsson
, S. (2011). Data clustering by minimizing
disconnectivity
. Inf.
Sci
, 181(4):732
--
746.


“A” is in 4
-
NB of “B”, however, “B” is not in 4
-
NB of “A”. Therefore, “A “and “B”
are not Mutual Nearest Neighbors.
Mutual Nearest Neighborhood is not a
symmetric relation.

CMune

Concept of Mutual Nearest Neighbors

Reference Point and Reference Block/ List

RL
(A) consists of points that are Mutual Nearest Neighbors to
point ‘A’. It is constructed from the
intersection

of the set points in
K
-
NB(A) and the set of points having A in their K
-
Nearest
Neighborhood.

Reference Point ‘A’
and the
Reference List RL(A)
it
represents
.

Role of Reference Blocks/ Lists



They are considered as dense regions/blocks.



These blocks are the
seeds

from which clusters may
grow up. Therefore, CMune is
not

a
point
-
to
-
point

clustering algorithm. Rather, it is
a block
-
to
-
block
clustering technique.



Much of its advantages come from these facts:
Noise

points and
outliers

correspond to blocks of small sizes,
and homogeneous blocks highly overlap.




Type Of Representative Points



Given a Reference List
RL(A)
,
A is said to represent RL(A).



There are three possible types of representative points
:


1)
Strong

Points representing blocks of size greater than a pre
-
defined
threshold parameter.



2)
Noise

points, representing empty Reference
-
Lists (neither can
form clusters nor can participate in clusters growing).




3)
Weak

points which are neither strong points nor noise points.
These points
may be merged with other clusters
if
they are
members of other strong points Blocks.


Clusters Merging/
Propagation

C
i

greedily

chooses

C
l

to merge with,
as they
have the
maximum mutual intersection
among all overlapping reference
blocks.

How to Impose An Order On the Merging
Of Reference Blocks?

3
cases with different homogeneity factors:


(a)
0.98
,

(b)
0.80


and (c)
0.55



Homogeneity Factor α =



(a)



(b)




(c)

Answer: By Cardinality (Density) and Homogeneity

p
i

q
j

CMune


Initialize parameters:


K {size of neighbourhood of a point}


T { Noise threshold/min size of a reference list}



Construct similarity matrix based on Euclidean distance



Construct the reference list for each point pi:

RL(pi)= K_NB(pi)


RB(pi)



Form a sorted list
L
based on the cardinality of the
reference lists RL(pi)


Exclude weak points for which

RL(pi)


㰠<



Noise and Weak Points
are excluded

Cluster
Merging

First
Cluster

Sort w.r.t

homogeneity

Test closeness to
existing clusters

Experimental Results

Results were assessed using:
(
100
experiments / data set / algorithm)


1
-

V
-
measure


2
-

F
-
measure

3
-

Adjusted Rand Index

4
-

Jaccard

Coefficient

5
-

Purity


6
-

Entropy



Eight data sets:

1.
Iris data



5
. E coli data

2.
OCR data


6
. Yeast data

3.
Time series data

7
. Libras movement data

4.
Letter recognition data


8
.
Gisette

data


Higher Index Value
Indicates better Accuracy

Lower Index Values
corresponds to better Accuracy

Experimental Results



100
experiments / data set / algorithm are
conducted to determine the best values of T & K



CMUNE is compared to
4
state
-
of the art clustering
techniques:


K
-
means (
Forgy
,
1965
)


DBScan

( Martin Ester,
1996
)


Mitosis (Noha Yousri,
2009
)


Spectral Clustering (
Chen, W.
-
Y., Song, Y.,
2011
)




Data Description & Results Statistics

Iris dataset

This is a well known database found in the pattern recognition literature. The
data set contains
3
classes of
50
instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other
2
; the latter are
NOT linearly separable from each other.


Index Type

The Synthetic Control Charts (SCC) data set includes
600
patterns, each of
60
dimensions (time points).

Time Series Dataset

In general, Cmune has better indices values.


Index Type

Optical Character Recognition
dataset

Consists of
5620
patterns, each of
64
dimensions. The features used to
describe the bitmaps of the digit characters. The aim is to properly classify
the digit characters to
10
classes from
0
to
9
.


Index Type

In general, Cmune has better indices values.

Ecoli Dataset

This data contains protein localization sites. Consists of
336
patterns, each
of
8
dimensions


Index Type

In general, Cmune has better indices values.

Libras Movement
Data set

The dataset contains
15
classes of
24
instances each, where each class
references to a hand movement type. Consists of
360
patterns, each of
91
dimensions.


Index Type

In general, Cmune has better indices values.

Yeast Dataset

The Protein Localization Sites, Yeast obtained from the UCI repository.
Consists of
1484
patterns, each of
8
dimensions.


Index Type

In general, Cmune has better indices values.

Letter Recognition Data Set

The objective is to identify each of a large number of black
-
and
-
white rectangular
pixel displays as one of the
26
capital letters in the English alphabet. Consists of
20000
patterns, each of
16
attributes.


Index Type

In general, Cmune has better indices values.

SPECT Heart Dataset*

The dataset describes diagnosing of cardiac Single Proton Emission Computed
Tomography (SPECT) images. Consists of
267
patterns, each of
22
dimensions.
Each of the patients is classified into two categories: normal and abnormal.


Index Type

In general, Cmune has better indices values.

Breast Cancer Diagnostic Dataset*

Consists of
569
patterns, each of
30
dimensions. The aim is to classify the
data into two Diagnosis (malignant and benign).


Index Type

In general, Cmune has better indices values.

Gisette Dataset

GISETTE is a handwritten digit recognition problem. The problem is to
separate the highly confusible digits '
4
' and '
9
'. This dataset is one of five
datasets of the NIPS
2003
feature selection challenge. Consists of
13500
patterns, each of
5000
dimensions.


Index Type

In general, Cmune has better indices values.

Sensitivity to parameters (K)

Conclusion


We present a novel clustering algorithm based on
mutual nearest neighbor
concept
. It can find clusters of varying shapes, sizes and densities; even in the
presence of noise and outliers and in high dimensional spaces as well.



Clusters are represented by reference blocks (points + list). Two clusters can be
merged if their link strength is maximal (i.e. reference blocks have max.
intersection). Any data point not belonging to a cluster is considered as noise.



The results of our experimental study on several data sets are encouraging.


CMune

solutions have been found, in general, superior to those obtained by
DBScan
, K
-
means and Mitosis and competitive with spectral clustering
algorithm.



We intend to parallelize our algorithm as its clustering propagation is inherently
parallel & determine T through some statistical analysis.


Algorithm is publicly available to other researchers at
http://www.csharpclustering.com
.

CSHARP Evolution


CSHARP [Abbas, Shoukry &
Kashef
,
2012
]
is presented for the
purpose of finding clusters of arbitrary shapes and arbitrary
densities in high dimensional feature spaces. It can be
considered as a variation of the Shared Nearest Neighbor
algorithm (SNN) (
Ertoz
,
2003
).



Then a modified version of CSHARP is presented
[Abbas &
Shoukry,
2012
].
The modification includes the incorporation of a
new measure of cluster homogeneity.



In this paper,
an enhanced version of Modified CSHARP is
presented. Specifically, the
number of parameters has been
reduced from three to only two parameters
to reduce the effort
needed to select the best parameters.


The overall time complexity for the CMune is :



Where
N
is the number of data points and
K

is the
number of nearest neighbors.


CMune takes a space complexity of



where
N

is the number of data points
K

and is the number
of nearest neighbors used; since only the K
-
nearest
neighbors of each data point is required.

Algorithm
Complexity


Speed of CMune compared to (a) DBScan and (b) K
-
means
and DBScan using letter recognition data set.

Speed Performance

Data Size

Cmune Speed
Performance is “Stable”