Incorporating User Provided Constraints
into Document Clustering
Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi
Department of Computer Science
Wayne State University
Detroit, MI48202
{chenyanh, rege, mdong, jinghua, fotouhi}@wayne.edu
Outline
•
Introduction
•
Overview of related work
•
S
emi

S
upervised
N
on

negative
M
atrix
F
actorization (
SS

NMF
) for document clustering
•
Theoretical result for SS

NMF
•
Experiments and results
•
Conclusion
What is clustering?
•
Finding groups of objects such that the objects in a
group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups
Inter

cluster
distances are
maximized
Intra

cluster
distances are
minimized
Document Clustering
•
Grouping of text documents into meaningful clusters in an
unsupervised manner.
Government
Science
Arts
Unsupervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Semi

supervised clustering: problem
definition
•
Input:
–
A set of unlabeled objects
–
A small amount of domain knowledge (labels or pairwise
constraints)
•
Output:
–
A partitioning of the objects into k clusters
•
Objective:
–
Maximum intra

cluster similarity
–
Minimum inter

cluster similarity
–
High consistency between the partitioning and the domain
knowledge
•
According to different given domain knowledge:
–
Users provide
class labels
(seeded points)
a priori to some of
the documents
–
Users know about which few documents are related (
must

link
)
or unrelated (
cannot

link
)
Semi

Supervised Clustering
Seeded points
Must

link
Cannot

link
Why semi

supervised clustering?
•
Large amounts of unlabeled data exists
–
More is being produced all the time
•
Expensive to generate Labels for data
–
Usually requires human intervention
•
Use human input to provide labels for some of the data
–
Improve existing naive clustering methods
–
Use labeled data to guide clustering of unlabeled data
–
End result is a better clustering of data
•
Potential applications
–
Document/word categorization
–
Image categorization
–
Bioinformatics (gene/protein clustering)
Outline
•
Introduction
•
Overview of related work
•
Semi

supervised Non

negative Matrix Factorization (SS

NMF) for document clustering
•
Theoretical work for SS

NMF
•
Experiments and results
•
Conclusion
Clustering Algorithm
•
Document hierarchical clustering
–
Bottom

up, agglomerative
–
Top

down, divisive
•
Document partitioning (flat clustering)
–
K

means
–
probabilistic clustering using the Naïve Bayes or Gaussian
mixture model, etc.
•
Document clustering based on graph model
Semi

supervised Clustering Algorithm
•
Semi

supervised Clustering with labels
(Partial label
information is given )
:
–
SS

Seeded

Kmeans ( Sugato Basu, et al. ICML 2002)

SS

Constraint

Kmeans ( Sugato Basu, et al. ICML 2002)
•
Semi

supervised Clustering with Constraints
(Pairwise
Constraints (Must

link, Cannot

link) is given):
–
SS

COP

Kmeans (Wagstaff
et al
. ICML01)
–
SS

HMRF

Kmeans (Sugato Basu, et al. ACM SIGKDD 2004)
–
SS

Kernel

Kmeans (Brian Kulis, et al. ICML 2005)
–
SS

Spectral

Normalized

Cuts (X. Ji, et al. ACM SIGIR 2006)
Overview of K

means Clustering
•
K

means
is
a
partition
clustering
algorithm
based
on
iterative
relocation
that
partitions
a
dataset
into
k
clusters
.
•
Objective
function
:
Locally
minimizes
sum
of
squared
distance
between
the
data
points
and
their
corresponding
cluster
centers
:
Algorithm
:
Initialize
k
cluster
centers
randomly
.
Repeat
until
convergence
:
–
Cluster
Assignment
Step
:
Assign
each
data
point
x
i
to
the
cluster
f
h
such
that
distance
of
x
i
from
center
of
f
h
is
minimum
–
Center
Re

estimation
Step
:
Re

estimate
each
cluster
center
as
the
mean
of
the
points
in
that
cluster
Semi

supervised Kernel K

means
(SS

KK)
[Brian Kulis, et al. ICML 2005]
•
Semi

supervised Kernel K

means algorithm
:
where is kernel function mapping from , is centroid,
is the cost of violating the constraint between two points
–
First term: kernel k

means objective function
–
Second term: reward function for satisfying must

link constraints
–
Third term: penalty function for violating cannot

link constraints
Overview of Spectral Clustering
•
Spectral
clustering
is
a
graph

theoretic
clustering
algorithm
Weighted
Graph
G=(V,
E,
A)
min
between

cluster similarities (weights : A
ij
)
Spectral Normalized Cuts
•
Min
similarity
between & :
Balance weights
:
Cluster indicator
:
•
Graph partition becomes:
•
Solution is
eigenvector
of:
Semi

supervised Spectral Normalized
Cuts (SS

SNC)
[X. Ji, et al. ACM SIGIR 2006]
•
Semi

supervised Spectral Learning algorithm
:
where ,
–
First term: spectral normalized cut objective function
–
Second term: reward function for satisfying must

link constraints
–
Third term: penalty function for violating cannot

link constraints
Outline
•
Introduction
•
Related work
•
S
emi

S
upervised
N
on

negative
M
atrix
F
actorization (SS

NMF) for document clustering
–
NMF review
–
Model formulation and algorithm derivation
•
Theoretical result for SS

NMF
•
Experiments and results
•
Conclusion
Non

negative Matrix Factorization
(NMF)
•
NMF is to decompose matrix into two parts(
D. Lee et al., Nature 1999
)
•
Symmetric NMF for clustering (
C. Ding et al. SIAM ICDM 2005
)
3172
.
0
3148
.
0
2568
.
0
2640
.
0
2650
.
0
3148
.
0
3244
.
0
2055
.
0
2090
.
0
2038
.
0
2568
.
0
2055
.
0
7202
.
0
7411
.
0
8311
.
0
2640
.
0
2090
.
0
7411
.
0
7822
.
0
8749
.
0
2650
.
0
2038
.
0
8311
.
0
8749
.
0
0000
.
1
X
F
G
~
=
min  X
–
FG
T

2
~
=
0348
.
0
5476
.
0
0005
.
0
5355
.
0
5256
.
0
3698
.
0
5538
.
0
3765
.
0
6449
.
0
3672
.
0
x
0402
.
2
0
0
0735
.
1
x
0348
.
0
0005
.
0
5476
.
0
5355
.
0
5256
.
0
5538
.
0
6449
.
0
3698
.
0
3765
.
0
3672
.
0
min  A
–
GSG
T

2
SS

NMF
CL
j
i
C
d
d
)
,
(
•
Incorporate
prior knowledge
into NMF based framework
for document clustering.
•
Users provide pairwise constraints:
–
Must

link
constraints C
ML
: two documents d
i
and
d
j
must belong to the same cluster.
–
Cannot

link
constraints C
CL
: two documents d
i
and
d
j
must belong to the different cluster.
ML
j
i
C
d
d
)
,
(
•
Constraints are defined by associated violation cost matrix W:
–
W
reward
: cost of violating the constraint between document
d
i
and d
j
if a constraint exists.
–
W
penalty
: cost of violating the constraints between document
d
i
and d
j
if a constraint exists.
ML
j
i
C
d
d
)
,
(
CL
j
i
C
d
d
)
,
(
SS

NMF Algorithm
•
Define the objective function of SS

NMF:
where
2
0
,
0
~
min
T
G
S
NMF
SS
GSG
A
J
}
.
.
,
)
,
(

{
j
i
ML
j
i
ij
reward
y
y
t
s
C
d
d
w
W
}
.
.
,
)
,
(

{
j
i
CL
j
i
ij
penalty
y
y
t
s
C
d
d
w
W
penalty
reward
W
W
A
A
~
is the cluster label of
i
y
i
d
Summary of SS

NMF Algorithm
Outline
•
Introduction
•
Overview of related work
•
Semi

supervised Non

negative Matrix Factorization (SS

NMF) for document clustering
•
Theoretical result for SS

NMF
•
Experiments and results
•
Conclusion
Algorithm Correctness and Convergence
Based on constraint optimization theory,
auxiliary function, we can prove SS

NMF:
1.
Correctness:
Solution converges to local minimum
2. Convergence:
Iterative algorithm converges
(Details in paper [1], [2])
[1] Y. Chen, M. Rege, M. Dong and J. Hua, “Incorporating User provided Constraints
into Document Clustering”,
Proc. of IEEE ICDM
, Omaha, NE, October 2007.
(Regular Paper, acceptance rate 7.2%
)
[2] Y. Chen, M. Rege, M. Dong and J. Hua, “Non

negative Matrix Factorization
for Semi

supervised Data Clustering”,
Journal of Knowledge and Information Systems
,
to appear, 2008.
SS

NMF: General Framework for
Semi

supervised Clustering
j
i
CL
j
i
h
j
i
ML
j
i
y
y
t
s
C
d
d
ij
k
h
X
i
y
y
t
s
C
d
d
ij
h
i
KK
SS
w
w
d
J
.
.
)
,
(
1
.
.
)
,
(
2
,
,
)
(
Proof: (1)
(2)
(3)
Orthogonal Symmetric
Semi

supervised NMF
is equivalent to Semi

supervised
Kernel K

means (SS

KK) and Semi

supervised Spectral Normalized Cuts (SS

SNC)!
Advantages of SS

NMF
SS

KK
SS

SNC
SS

NMF
Clustering
Indicator
•
Hard clustering
•
Exact
orthogonal
•
The derived latent
semantic space to be
orthogonal
•
No direct
relationship between
the singular vectors
and the clusters
•
Soft clustering
•
Map the documents into non

negative latent
semantic space which may
not be orthogonal
•
Cluster label can be determined by the axis
with the largest projection value
Time
Complexity
•
Iterative
algorithm
•
Solving a
computationally
expensive
constrained
eigen

decomposition
•
Iterative algorithm
to obtain partial answer
at intermediate stages of the solution by
specifying a fixed number of iterations
•
Simple basic matrix computation and easily
deployed over a distributed computing
environment when dealing with large
document collections.
Outline
•
Introduction
•
Overview of related work
•
Semi

supervised Non

negative Matrix Factorization (SS

NMF)
for document clustering
•
Theoretical result for SS

NMF
•
Experiments and results
–
Artificial Toy Data
–
Real Data
•
Conclusion
Experiments on Toy Data
1.
Artificial toy data: consisting of two natural clusters
Results on Toy Data
(SS

KK and SS

NMF)
Right Table:
Difference between cluster indicator G of
SS

KK
(hard
clustering) and
SS

NMF
(soft clustering) for the toy
data
•
Hard Clustering:
Each object belongs
to a single cluster
•
Soft Clustering:
Each object is
probabilistically
assigned to clusters.
Results on Toy Data
(SS

SNC and SS

NMF)
(b) Data distribution in the
SS

NMF
subspace of two column vectors of
G. The data points from the two
clusters get distributed along the two
axes.
(a) Data distribution in the
SS

SNC
subspace of the first two singular
vectors. There is no relationship
between the axes and the clusters.
Time Complexity Analysis
Up Figure
: Computational Speed comparison for SS

KK, SS

SNC and SS

NMF ( )
)
(
2
tkn
Experiments on Text Data
i
y
ˆ
2
. Summary of data sets
[1]
used in the experiments.
[1]http://www.cs.umn.edu/~han/data/tmdata.tar.gz
•
Evaluation Metric:
where n is the total number of documents in the experiment,
δ
is the
delta function that equals one if ,
is the
estimated label, is the ground truth.
i
y
i
i
y
y
ˆ
Results on Text Data
(Compare with Unsupervised Clustering)
•
(1) Comparison with unsupervised clustering approaches:
Note: SS

NMF adds 3% constraints
Results on Text Data
(Before Clustering and After Clustering)
(a)
Typical document

document matrix
before clustering
(b)
Document

document similarity
matrix after clustering
with SS

NMF (k=2)
(c)
Document

document similarity
matrix after clustering
with SS

NMF (k=5)
Results on Text Data
(Clustering with Different Constraints)
Left Table:
Comparison of
confusion
matrix
C and normalized
cluster centroid matrix
S of
SS

NMF for different
percentage of documents
pairwise constrained
Results on Text Data
(Compare with Semi

supervised Clustering)
•
(2) Comparison with SS

KK and SS

SNC
(a)
Graft

Phos
(b)
England

Heart
(c)
Interest

Trade
•
Comparison with SS

KK and SS

SNC
(
Fbis2, Fbis3, Fbis4, Fbis5)
Results on Text Data
(Compare with Semi

supervised Clustering)
Experiments on Image Data
Up Figure
: Sample images for images categorization.
(From up to down: O

Owls, R

Roses, L

Lions, E

Elephants, H

Horses)
3
. Image data sets
[2]
used in the experiments.
[2] http://kdd.ics.uci.edu/databases/CorelFeatures/CorelFeatures.data.html
Results on Image Data
(Compare with Unsupervised Clustering)
Up Table
: Comparison of image clustering accuracy between KK, SNC, NMF
and SS

NMF with only 3% pair

wise constraints on the images.
It shows that SS

NMF consistently outperforms other well

established unsupervised
image clustering methods.
•
(1) Comparison with unsupervised clustering approaches:
Results on Image Data
(Compare with Semi

supervised Clustering)
•
(2) Comparison with SS

KK and SS

SNC:
Left Figure
:
Comparison of image
clustering accuracy between
SS

KK, SS

SNC, and SS

NMF for different
percentages of images pairs
constrained (a) O

R, (b) L

H,
(c) R

L, (d) O

R

L.
Results on Image Data
(Compare with Semi

supervised Clustering)
•
(2) Comparison with SS

KK and SS

SNC:
Left Figure
:
Comparison of image
clustering accuracy
between SS

KK, SS

SNC, and SS

NMF for
different percentages of
images pairs constrained
(e) L

E

H, (f) O

R

L

E,
(g) O

L

E

H, (h) O

R

L

E

H
Outline
•
Introduction
•
Related work
•
Semi

supervised Non

negative Matrix Factorization (SS

NMF) for document clustering
•
Theoretical result for SS

NMF
•
Experiments and results
•
Conclusion
Conclusion
•
Semi

supervised Clustering:

many real world applications

outperform the traditional clustering algorithms
•
Semi

supervised NMF algorithm provides a unified
mathematic framework for semi

supervised clustering.
•
Many existing semi

supervised clustering algorithms can be
extended to achieve multi

type objects co

clustering tasks.
Reference
[1] Y. Chen, M. Rege, M. Dong and F. Fotouhi, “Deriving Semantics for Image
Clustering from Accumulated User Feedbacks”,
Proc. of ACM Multimedia
,
Germany, 2007.
[2] Y. Chen, M. Rege, M. Dong and J. Hua, “Incorporating User provided
Constraints into Document Clustering”,
Proc. of IEEE ICDM
, Omaha, NE,
October 2007. (Regular Paper, acceptance rate 7.2%)
[3] Y. Chen, M. Rege, M. Dong and J. Hua, “Non

negative Matrix Factorization for
Semi

supervised Data Clustering”,
Journal of Knowledge and Information
Systems
, invited as a best paper of ICDM 07, to appear 2008.
Comments 0
Log in to post a comment