CS685 : Special Topics in Data Mining, UKY
The
UNIVERSITY
of
KENTUCKY
Semi

Supervised Clustering
CS 685: Special Topics in Data Mining
Spring 2008
Jinze Liu
CS685 : Special Topics in Data Mining, UKY
Outline of lecture
•
Overview of clustering and classification
•
What is semi

supervised learning?
–
Semi

supervised clustering
–
Semi

supervised classification
•
Semi

supervised clustering
–
What is semi

supervised clustering?
–
Why semi

supervised clustering?
–
Semi

supervised clustering algorithms
CS685 : Special Topics in Data Mining, UKY
Supervised classification versus unsupervised
clustering
•
Unsupervised clustering
–
Group similar objects together to find clusters
•
Minimize intra

class distance
•
Maximize inter

class distance
•
Supervised classification
–
Class label for each training sample is given
–
Build a model from the training data
–
Predict class label on unseen future data points
CS685 : Special Topics in Data Mining, UKY
What is clustering?
•
Finding groups of objects such that the objects in a group will be
similar (or related) to one another and different from (or
unrelated to) the objects in other groups
Inter

cluster
distances are
maximized
Intra

cluster
distances are
minimized
CS685 : Special Topics in Data Mining, UKY
What is Classification?
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
10
Tid
Attrib1
Attrib2
Attrib3
Class
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
10
Test Set
Learning
algorithm
Training Set
CS685 : Special Topics in Data Mining, UKY
Clustering algorithms
•
K

Means
•
Hierarchical clustering
•
Graph based clustering (Spectral
clustering)
CS685 : Special Topics in Data Mining, UKY
Classification algorithms
•
Decision Trees
•
Naïve Bayes classifier
•
Support Vector Machines (SVM)
•
K

Nearest

Neighbor classifiers
•
Logistic Regression
•
Neural Networks
•
Linear Discriminant Analysis (LDA)
CS685 : Special Topics in Data Mining, UKY
Supervised Classification Example
.
.
.
.
CS685 : Special Topics in Data Mining, UKY
.
Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CS685 : Special Topics in Data Mining, UKY
.
Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CS685 : Special Topics in Data Mining, UKY
.
Unsupervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CS685 : Special Topics in Data Mining, UKY
.
Unsupervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CS685 : Special Topics in Data Mining, UKY
Semi

Supervised Learning
•
Combines
labeled
and
unlabeled
data
during
training
to
improve
performance
:
–
Semi

supervised
classification
:
Training
on
labeled
data
exploits
additional
unlabeled
data,
frequently
resulting
in
a
more
accurate
classifier
.
–
Semi

supervised
clustering
:
Uses
small
amount
of
labeled
data
to
aid
and
bias
the
clustering
of
unlabeled
data
.
Unsupervised
clustering
Semi

supervised
learning
Supervised
classification
CS685 : Special Topics in Data Mining, UKY
.
Semi

Supervised Classification
Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CS685 : Special Topics in Data Mining, UKY
.
Semi

Supervised Classification
Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CS685 : Special Topics in Data Mining, UKY
Semi

Supervised Classification
•
Algorithms:
–
Semisupervised EM [Ghahramani:NIPS94,Nigam:ML00].
–
Co

training [Blum:COLT98].
–
Transductive SVM’s [Vapnik:98,Joachims:ICML99].
–
Graph based algorithms
•
Assumptions:
–
Known, fixed set of categories given in the labeled data.
–
Goal is to improve classification of examples into these
known categories.
•
More discussions next week
CS685 : Special Topics in Data Mining, UKY
.
Semi

Supervised Clustering
Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CS685 : Special Topics in Data Mining, UKY
.
Semi

Supervised Clustering
Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CS685 : Special Topics in Data Mining, UKY
.
Second Semi

Supervised Clustering
Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CS685 : Special Topics in Data Mining, UKY
.
Second Semi

Supervised Clustering
Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CS685 : Special Topics in Data Mining, UKY
Semi

supervised clustering:
problem definition
•
Input:
–
A set of unlabeled objects, each described by a set of attributes
(numeric and/or categorical)
–
A small amount of domain knowledge
•
Output:
–
A partitioning of the objects into k clusters (possibly with some
discarded as outliers)
•
Objective:
–
Maximum intra

cluster similarity
–
Minimum inter

cluster similarity
–
High consistency between the partitioning and the domain
knowledge
CS685 : Special Topics in Data Mining, UKY
Why semi

supervised clustering?
•
Why not clustering?
–
The clusters produced may not be the ones required.
–
Sometimes there are multiple possible groupings.
•
Why not classification?
–
Sometimes there are insufficient labeled data.
•
Potential applications
–
Bioinformatics (gene and protein clustering)
–
Document hierarchy construction
–
News/email categorization
–
Image categorization
CS685 : Special Topics in Data Mining, UKY
Semi

Supervised Clustering
•
Domain knowledge
–
Partial label information is given
–
Apply some constraints (must

links and cannot

links)
•
Approaches
–
Search

based Semi

Supervised Clustering
•
Alter the clustering algorithm using the constraints
–
Similarity

based Semi

Supervised Clustering
•
Alter the similarity measure based on the constraints
–
Combination of both
This class
Next class
CS685 : Special Topics in Data Mining, UKY
Search

Based Semi

Supervised Clustering
•
Alter the clustering algorithm that searches for a good
partitioning by:
–
Modifying the objective function to give a reward for
obeying labels on the supervised data [Demeriz:ANNIE99].
–
Enforcing constraints (
must

link, cannot

link
) on the
labeled data during clustering [Wagstaff:ICML00,
Wagstaff:ICML01].
–
Use the labeled data to initialize clusters in an iterative
refinement algorithm (kMeans,) [Basu:ICML02].
CS685 : Special Topics in Data Mining, UKY
Overview of K

Means Clustering
•
K

Means
is
a
partitional
clustering
algorithm
based
on
iterative
relocation
that
partitions
a
dataset
into
K
clusters
.
Algorithm
:
Initialize
K
cluster
centers
randomly
.
Repeat
until
convergence
:
–
Cluster
Assignment
Step
:
Assign
each
data
point
x
to
the
cluster
X
l
,
such
that
L
2
distance
of
x
from
(center
of
X
l
)
is
minimum
–
Center
Re

estimation
Step
:
Re

estimate
each
cluster
center
as
the
mean
of
the
points
in
that
cluster
}
{
1
l
K
l
l
l
CS685 : Special Topics in Data Mining, UKY
K

Means Objective Function
•
Locally
minimizes
sum
of
squared
distance
between
the
data
points
and
their
corresponding
cluster
centers
:
•
Initialization
of
K
cluster
centers
:
–
Totally
random
–
Random
perturbation
from
global
mean
–
Heuristic
to
ensure
well

separated
centers
2
1


K
l
X
x
l
i
l
i
x
CS685 : Special Topics in Data Mining, UKY
K Means Example
CS685 : Special Topics in Data Mining, UKY
K Means Example
Randomly Initialize Means
x
x
CS685 : Special Topics in Data Mining, UKY
K Means Example
Assign Points to Clusters
x
x
CS685 : Special Topics in Data Mining, UKY
K Means Example
Re

estimate Means
x
x
CS685 : Special Topics in Data Mining, UKY
K Means Example
Re

assign Points to Clusters
x
x
CS685 : Special Topics in Data Mining, UKY
K Means Example
Re

estimate Means
x
x
CS685 : Special Topics in Data Mining, UKY
K Means Example
Re

assign Points to Clusters
x
x
CS685 : Special Topics in Data Mining, UKY
K Means Example
Re

estimate Means and Converge
x
x
CS685 : Special Topics in Data Mining, UKY
Semi

Supervised K

Means
•
Partial label information is given
–
Seeded K

Means
–
Constrained K

Means
•
Constraints (Must

link, Cannot

link)
–
COP K

Means
CS685 : Special Topics in Data Mining, UKY
Semi

Supervised K

Means for partially labeled
data
•
Seeded
K

Means
:
–
Labeled
data
provided
by
user
are
used
for
initialization
:
initial
center
for
cluster
i
is
the
mean
of
the
seed
points
having
label
i
.
–
Seed
points
are
only
used
for
initialization
,
and
not
in
subsequent
steps
.
•
Constrained
K

Means
:
–
Labeled
data
provided
by
user
are
used
to
initialize
K

Means
algorithm
.
–
Cluster
labels
of
seed
data
are
kept
unchanged
in
the
cluster
assignment
steps,
and
only
the
labels
of
the
non

seed
data
are
re

estimated
.
CS685 : Special Topics in Data Mining, UKY
Seeded K

Means
Use labeled data to find
the initial centroids and
then run K

Means.
The labels for seeded
points may change.
CS685 : Special Topics in Data Mining, UKY
Seeded K

Means Example
CS685 : Special Topics in Data Mining, UKY
Seeded
K

Means Example
Initialize Means Using Labeled Data
x
x
CS685 : Special Topics in Data Mining, UKY
Seeded K

Means Example
Assign Points to Clusters
x
x
CS685 : Special Topics in Data Mining, UKY
Seeded K

Means Example
Re

estimate Means
x
x
CS685 : Special Topics in Data Mining, UKY
Seeded K

Means Example
Assign points to clusters and Converge
x
x
the label is changed
CS685 : Special Topics in Data Mining, UKY
Constrained K

Means
Use labeled data to find
the initial centroids and
then run K

Means.
The labels for seeded
points will not change.
CS685 : Special Topics in Data Mining, UKY
Constrained K

Means Example
CS685 : Special Topics in Data Mining, UKY
Constrained K

Means Example
Initialize Means Using Labeled Data
x
x
CS685 : Special Topics in Data Mining, UKY
Constrained K

Means Example
Assign Points to Clusters
x
x
CS685 : Special Topics in Data Mining, UKY
Constrained K

Means Example
Re

estimate Means and Converge
x
x
CS685 : Special Topics in Data Mining, UKY
COP K

Means
•
COP
K

Means
[Wagstaff
et
al
.:
ICML
01
]
is
K

Means
with
must

link
(must
be
in
same
cluster)
and
cannot

link
(cannot
be
in
same
cluster)
constraints
on
data
points
.
•
Initialization
:
Cluster
centers
are
chosen
randomly,
but
as
each
one
is
chosen
any
must

link
constraints
that
it
participates
in
are
enforced
(so
that
they
cannot
later
be
chosen
as
the
center
of
another
cluster)
.
•
Algorithm
:
During
cluster
assignment
step
in
COP

K

Means,
a
point
is
assigned
to
its
nearest
cluster
without
violating
any
of
its
constraints
.
If
no
such
assignment
exists,
abort
.
CS685 : Special Topics in Data Mining, UKY
COP K

Means Algorithm
CS685 : Special Topics in Data Mining, UKY
Illustration
x
x
Must

link
Determine
its label
Assign to the red class
CS685 : Special Topics in Data Mining, UKY
Illustration
x
x
Cannot

link
Determine
its label
Assign to the red class
CS685 : Special Topics in Data Mining, UKY
Illustration
x
x
Cannot

link
Determine
its label
The clustering algorithm fails
Must

link
CS685 : Special Topics in Data Mining, UKY
Summary
•
Seeded and Constrained K

Means: partially labeled data
•
COP K

Means: constraints (Must

link and Cannot

link)
•
Constrained K

Means and COP K

Means require all the
constraints to be satisfied.
–
May not be effective if the seeds contain noise.
•
Seeded K

Means use the seeds only in the first step to
determine the initial centroids.
–
Less sensitive to the noise in the seeds.
•
Experiments show that semi

supervised k

Means outperform
traditional K

Means.
CS685 : Special Topics in Data Mining, UKY
Reference
•
Semi

supervised Clustering by Seeding
–
http://www.cs.utexas.edu/users/ml/papers/semi

icml

02.pdf
•
Constrained K

means clustering with
background knowledge
–
http://www.litech.org/~wkiri/Papers/wagstaff

kmeans

01.pdf
CS685 : Special Topics in Data Mining, UKY
Next class
•
Topics
–
Similarity

based semi

supervised clustering
•
Readings
–
Distance metric learning, with application to
clustering with side

information
•
http://ai.stanford.edu/~ang/papers/nips02

metric.pdf
–
From Instance

level Constraints to Space

Level
Constraints: Making the Most of Prior Knowledge
in Data Clustering
•
http://www.cs.berkeley.edu/~klein/papers/constrained
_clustering

ICML_2002.pdf
Comments 0
Log in to post a comment