A Cluster Validity Measure With Outlier

pucefakeAI and Robotics

Nov 30, 2013 (3 years and 9 months ago)

75 views

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

A Cluster Validity Measure With Outlier
Detection for Support Vector Clustering

Presenter : Lin,
Shu
-
Han

Authors :
Jeen
-
Shing

Wang, Jen
-
Chieh

Chiang

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS(2008)

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

2

Outline


Introduction

of

SVC


Motivation


Objective


Methodology


Experiments


Conclusion


Comments

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

SVC


SVC is from SVMs


SVMs is supervised clustering technique


Fast convergence


Good generalization performance


Robustness for noise


SVC is unsupervised approach

1.
Data points map to HD feature space using a Gaussian kernel.

2.
Look for smallest sphere enclose data.

3.
Map sphere back to data space to form set of contours.

4.
Contours are treated as the cluster boundaries.

3

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

SVC
-

Sphere Analysis


To find the minimal enclose sphere with soft margin:





To solve this problem, the Lagrangian function:

4

a

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

SVC
-

Sphere Analysis

5

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

SVC
-

Sphere Analysis


Karush
-
Kuhn
-
Tucker complementarity:


6

Bound SV; Outlier

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

SVC
-
Sphere
Analysis


To find the minimal enclose sphere with soft margin:







C : existence of outliers

allowed

7

Wolfe dual
optimization
problem

a

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

SVC
-
Sphere
Analysis


The distance between x and a:







q :

|clusters|

&

the

smoothness/tightness

of

the

cluster

boundaries.


8

Mercer kernel

Kernel: Gaussian

a

Gaussian function:

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Motivation

9


The traditional
cluster validity measure
such as


Partition coefficient (PC)


Separation measures


Base on fuzzy membership grades and cancroids of
clusters.


SVC algorithm generates boundaries to cluster are


arbitrary


no fuzzy membership grade.

Which clustering is better?

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Objectives


Optimal

cluster number


Cluster validity measure


Outlier
-
detection algorithm


Cluster merging mechanism

10

Outlier
-
detection

Cluster merging

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Methodology

-

Overview

11

Cluster Validity Measure for the SVC Algorithm

Outlier detection

Cluster
-
Merging Mechanism

C=1, no outliers are allowed

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Methodology


Cluster Validity Measure
for the SVC Algorithm

12


Compactness (intra
-
cluster)




Separation (inter
-
cluster)




Cluster Validity measure (ratio) for SVC

min

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Methodology


Outlier Detection

13


In SVC, outliers (BSV) are the data in boundary regions.

q = 1

q = 4

q = 2

q = 1.8

C=0.02

singleton

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Methodology


Outlier Detection


C


If C=1, result clusters are smooth, but not desirable


BSV (outlier)


All outlier are SVs


Some outlier is far away from other data in clusters


SVs


More SVs make too tight to fit the data


q


Increase q makes clusters compact


Singleton


Important

criterion

14

q = 1

q = 4

q = 2

q = 1.8

C=0.02

singleton

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Methodology


Outlier Detection


Outlier Existence Criterion








Desirable Cluster Criterion


Singleton clusters
can’t exceed threshold


Datapoint’s

% of SVs
can’t greater than threshold, suggested 50%


Recursively adjust C
to satisfy this two criterion

15

Suggested
γ
= 2

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Methodology


Cluster
-
Merging Mechanism


Similarity: overlapping degree

16

Gaussian function:

P
C
= 0

P
A

> 0

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Methodology


Cluster
-
Merging Mechanism

1)
Agglomerative outliers/noises: identification



For all
ci

< ε,
i

= 1, . . . , K,

where ε is

density, chosen as 3%~5%


{Set x ← mi. For each j, j =
i
, perform
pj
(x), where
pj



[0, 1] is the
normalized overlapping index of the j cluster.


If
pj
(x)
>
0, merge cluster
i

and cluster j.


Otherwise, discard cluster
i
. Set K ← K − 1.}

2)
Compatible clusters: Combination (similarity)



Sort the size of the remaining K clusters in ascending order such that
cK

= max(
ci
),


i



K. For each
i
,
i

= 1, . . . , K, perform {Set x ← mi.
For each j, j =
i

+ 1, . . . , K, perform
pj
(x)


Find l =
arg

max
i+1≤j≤K
pj
(x), where
arg

maxa

denotes the value of a at
which the expression that follows is maximized.


If pl > 0, merge cluster
i

with cluster l. Set K ← K − 1 and repeat 2)
until no further combination.}

17

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Methodology


Summary

1)
Initialize a small value of
q, and set C = 1
and γ = 2

2)
Perform SVC algorithm,

get |clusters|.


3)
If |clusters|

< 2, increase q,
go to 2).

4)
If the outlier
-
detection criterion holds,
decrease
C, fix

q, and go to 2). Otherwise,
go to 5).

5)
If |SVs|
< 50% of the datapoints,
go to 6).
Otherwise, decrease
C, and go to

2).

6)
Compute validity measure index (
V (m)).

7)
If |clusters|

>

N, increase q,
and go to 2).
Otherwise, stop the SVC.

8)
Use cluster
-
merging mechanism to identify
an ideal |clusters|.

Output

|clusters|.


18

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Experiments

-

Benchmark and Artificial Examples


Bensaid

Data Set






19

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Experiments

-

Benchmark and Artificial Examples


Five
-
Cluster Data Set & Five
-
Cluster Data Set With Noise






20

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Experiments

-

Benchmark and Artificial Examples

21


Five
-
Cluster Data Set With Noise,

after cluster
-
merge


Merge

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Experiments

-

Benchmark and Artificial Examples

22


Crescent Data Set






Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Experiments
-

IRIS

Data

Set

23






Misclassificatoin

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Conclusions


This

paper

integrated for

SVC:


cluster validity measure


Outlier detection


Merging mechanism


Automatically

determine

suitable

values

for


Kernel

parameter



Soft
-
margin

constant


Clustering

with


Compact

and

smooth

arbitrary
-
shaped

cluster

contours


Increasing

robustness

to

outliers

and

noises


24

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Comments


Advantage


Provide

a

cluster

validity

index

for

a

cluster

method


Drawback





Application


SVC

25