DATA CLUSTERING WITH KERNAL K-MEANS++

naivenorthAI and Robotics

Nov 8, 2013 (7 years and 9 months ago)

216 views

DATA CLUSTERING WITH KERNAL K
-
MEANS++

PROJECT OBJECTIVES

o

PROJECT GOAL


Experimentally demonstrate the application of Kernel K
-
Means
to non
-
linearly clusterable data sets


o

ACADEMIC IMPORTANCE


Expand the application of the Kernel K
-
Means clustering
algorithm to non
-
traditional uses

Matt Strautmann,
Dept. of Electrical and
Computer Engineering

BACKGROUND

o
WHAT IS K
-
MEANS CLUSTERING?


K
-
Means clustering aims to divide the dataset into clusters
(“groups”) in which each data point belongs to the cluster with the
nearest mean vector.


o
WHAT IS KERNAL K
-
MEANS?



Sum
-
of
-
squares algorithm


Two step process: data point assignment and update

o

WHAT IS THE PLUS PLUS INITIALIZATION SCHEME?


The first mean vector is a randomly selected data point


Each subsequent mean vector is created by evaluating randomly
selected data points against a vector weighting probability

APPROACH


Evaluate standard K
-
Means (Soft++) against 4 datasets to
form benchmark


Hybridize Soft K
-
Means++ with Kernel K
-
Means to form
Kernel K
-
Means++


Test Kernel K
-
Means++ on small size, small dimension
Gaussian, large dimension Gaussian, and large size
datasets

Dr.
Donald C. Wunsch II
,
Dept. of Electrical and

Computer Engineering

PROJECT DATASETS

DISCUSSION


Kernel K
-
Means++ was found to cluster the test datasets
in a superior manner over Soft K
-
Means++


Kernel data
-
mapping was seen to solve the overlapping
data sets by:


Mapping the data before clustering to a higher
-
dimensional feature space using a nonlinear function


Partitioning the points with linear separators in the
new space


Soft K
-
Means++ could not successfully cluster the Lung
Cancer Dataset; results were for one cluster out of three
successfully clustered


Soft K
-
Means++ clustered the two dimension, two
cluster Gaussian dataset with only one error out of the
one thousand data points

SOFT K
-
MEANS++ VS. KERNEL K
-
MEANS++




CONCLUDING REMARKS


The initialization was seen to be the most important
factor in the algorithm converging


The “PLUS PLUS” cluster mean initialization was seen
to improve the results


Kernel assignment works better than the maximum
responsibility calculation of Soft K
-
Means


Kernel K
-
Means++ can handle small or large dimension
datasets well; the increase of dimensionally seemed to
be advantageous for the Lung Cancer Dataset (56
dimensions) over the lower clustering accuracy of the
Iris Plant Dataset (4 dimensions)


Kernel K
-
Means++ produced superior results to Soft K
-
Means++ when clustering the Lung Cancer Dataset and
demonstrated recognition of all three clusters

RESULTS COMPARISON



Kernel K
-
Means++ clustering accuracy superior in all cases
except the two dimensional, two cluster dataset.



The clustering accuracy of the datasets increased by the
following amounts:


Iris Plant: 104%


Lung Cancer: 38%





2D2k:
-
2.5%


8D5K: 30%

FUTURE WORK


Further

improvement

of

the

mean

vector

initialization

is

believed

possible

over

the

“PLUS

PLUS”

initialization


Other

options

for

the

mean
-
squared

error

calculation

for

data

point

evaluation

are

possible


The

time

analysis

of

the

algorithm

must

be

calculate





The

author

would

like

to

acknowledge

the

expertise

of

Dr
.

Rui

Xu

in

advising

this

project
.


Acknowledgements



1.) Initial Mean Orientations

2.) Voronoi Diagram

Generated by the Means

(data points associated with nearest cluster mean)

3.) Cluster Centroid Becomes

New Cluster Mean

4.) Step 2 and 3 Repeated


until Convergence

http://en.wikipedia.org/wiki/K
-
means_clustering

http://en.wikipedia.org/wiki/K
-
means_clustering

http://en.wikipedia.org/wiki/K
-
means_clustering

http://en.wikipedia.org/wiki/K
-
means_clustering

Iris Plant Dataset


2 Dimension,

2 Cluster Dataset

(Gaussian 2D2K)


2 Dimension,

2 Cluster Dataset

(Gaussian 2D2K)

lans.ece.utexas.edu

lans.ece.utexas.edu

eleves.ens.fr

Soft K
-
Means++

Clustering
Accuracy Average
(over ten runs)

Standard Deviation
of Accuracy
Calculation

(over ten runs)

Variance of
Accuracy
Calculation
(over ten runs)

Iris Plant Dataset

28.00%

8.218%

2.867%

Lung Cancer Dataset

43.75%

-

-

2D2K Gaussian Dataset

99.00%

-

-

8D5K Gaussian Dataset

58.50%

2.082%

0.043%

Kernel K
-
Means++

Clustering
Accuracy Average

(over ten runs)

Standard Deviation of
Accuracy Calculation
(over ten runs)

Variance of
Accuracy
Calculation
(over ten runs)

Iris Plant Dataset

57.00%

5.009%

2.238%

Lung Cancer Dataset

62.00%

6.878%

0.473%

2D2K Gaussian Dataset

96.50%

1.677%

0.028%

8D5K Gaussian Dataset

76.31%

10.366%

1.075%