DATA CLUSTERING WITH KERNAL K

MEANS++
PROJECT OBJECTIVES
o
PROJECT GOAL
Experimentally demonstrate the application of Kernel K

Means
to non

linearly clusterable data sets
o
ACADEMIC IMPORTANCE
Expand the application of the Kernel K

Means clustering
algorithm to non

traditional uses
Matt Strautmann,
Dept. of Electrical and
Computer Engineering
BACKGROUND
o
WHAT IS K

MEANS CLUSTERING?
K

Means clustering aims to divide the dataset into clusters
(“groups”) in which each data point belongs to the cluster with the
nearest mean vector.
o
WHAT IS KERNAL K

MEANS?
Sum

of

squares algorithm
Two step process: data point assignment and update
o
WHAT IS THE PLUS PLUS INITIALIZATION SCHEME?
The first mean vector is a randomly selected data point
Each subsequent mean vector is created by evaluating randomly
selected data points against a vector weighting probability
APPROACH
•
Evaluate standard K

Means (Soft++) against 4 datasets to
form benchmark
•
Hybridize Soft K

Means++ with Kernel K

Means to form
Kernel K

Means++
•
Test Kernel K

Means++ on small size, small dimension
Gaussian, large dimension Gaussian, and large size
datasets
Dr.
Donald C. Wunsch II
,
Dept. of Electrical and
Computer Engineering
PROJECT DATASETS
DISCUSSION
•
Kernel K

Means++ was found to cluster the test datasets
in a superior manner over Soft K

Means++
•
Kernel data

mapping was seen to solve the overlapping
data sets by:
•
Mapping the data before clustering to a higher

dimensional feature space using a nonlinear function
•
Partitioning the points with linear separators in the
new space
•
Soft K

Means++ could not successfully cluster the Lung
Cancer Dataset; results were for one cluster out of three
successfully clustered
•
Soft K

Means++ clustered the two dimension, two
cluster Gaussian dataset with only one error out of the
one thousand data points
SOFT K

MEANS++ VS. KERNEL K

MEANS++
CONCLUDING REMARKS
•
The initialization was seen to be the most important
factor in the algorithm converging
•
The “PLUS PLUS” cluster mean initialization was seen
to improve the results
•
Kernel assignment works better than the maximum
responsibility calculation of Soft K

Means
•
Kernel K

Means++ can handle small or large dimension
datasets well; the increase of dimensionally seemed to
be advantageous for the Lung Cancer Dataset (56
dimensions) over the lower clustering accuracy of the
Iris Plant Dataset (4 dimensions)
•
Kernel K

Means++ produced superior results to Soft K

Means++ when clustering the Lung Cancer Dataset and
demonstrated recognition of all three clusters
RESULTS COMPARISON
•
Kernel K

Means++ clustering accuracy superior in all cases
except the two dimensional, two cluster dataset.
•
The clustering accuracy of the datasets increased by the
following amounts:
•
Iris Plant: 104%
•
Lung Cancer: 38%
•
2D2k:

2.5%
•
8D5K: 30%
FUTURE WORK
•
Further
improvement
of
the
mean
vector
initialization
is
believed
possible
over
the
“PLUS
PLUS”
initialization
•
Other
options
for
the
mean

squared
error
calculation
for
data
point
evaluation
are
possible
•
The
time
analysis
of
the
algorithm
must
be
calculate
The
author
would
like
to
acknowledge
the
expertise
of
Dr
.
Rui
Xu
in
advising
this
project
.
Acknowledgements
1.) Initial Mean Orientations
2.) Voronoi Diagram
Generated by the Means
(data points associated with nearest cluster mean)
3.) Cluster Centroid Becomes
New Cluster Mean
4.) Step 2 and 3 Repeated
until Convergence
http://en.wikipedia.org/wiki/K

means_clustering
http://en.wikipedia.org/wiki/K

means_clustering
http://en.wikipedia.org/wiki/K

means_clustering
http://en.wikipedia.org/wiki/K

means_clustering
Iris Plant Dataset
2 Dimension,
2 Cluster Dataset
(Gaussian 2D2K)
2 Dimension,
2 Cluster Dataset
(Gaussian 2D2K)
lans.ece.utexas.edu
lans.ece.utexas.edu
eleves.ens.fr
Soft K

Means++
Clustering
Accuracy Average
(over ten runs)
Standard Deviation
of Accuracy
Calculation
(over ten runs)
Variance of
Accuracy
Calculation
(over ten runs)
Iris Plant Dataset
28.00%
8.218%
2.867%
Lung Cancer Dataset
43.75%


2D2K Gaussian Dataset
99.00%


8D5K Gaussian Dataset
58.50%
2.082%
0.043%
Kernel K

Means++
Clustering
Accuracy Average
(over ten runs)
Standard Deviation of
Accuracy Calculation
(over ten runs)
Variance of
Accuracy
Calculation
(over ten runs)
Iris Plant Dataset
57.00%
5.009%
2.238%
Lung Cancer Dataset
62.00%
6.878%
0.473%
2D2K Gaussian Dataset
96.50%
1.677%
0.028%
8D5K Gaussian Dataset
76.31%
10.366%
1.075%
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment