into Document Clustering

sharpfartsAI and Robotics

Nov 8, 2013 (3 years and 9 months ago)

71 views

Incorporating User Provided Constraints
into Document Clustering

Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi

Department of Computer Science

Wayne State University

Detroit, MI48202

{chenyanh, rege, mdong, jinghua, fotouhi}@wayne.edu

Outline


Introduction



Overview of related work



S
emi
-
S
upervised
N
on
-
negative
M
atrix
F
actorization (
SS
-
NMF
) for document clustering



Theoretical result for SS
-
NMF



Experiments and results



Conclusion





What is clustering?


Finding groups of objects such that the objects in a
group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups

Inter
-
cluster
distances are
maximized

Intra
-
cluster
distances are
minimized

Document Clustering



Grouping of text documents into meaningful clusters in an
unsupervised manner.

Government

Science

Arts

Unsupervised Clustering Example

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Semi
-
supervised clustering: problem
definition


Input:


A set of unlabeled objects


A small amount of domain knowledge (labels or pairwise
constraints)



Output:


A partitioning of the objects into k clusters



Objective:


Maximum intra
-
cluster similarity


Minimum inter
-
cluster similarity


High consistency between the partitioning and the domain
knowledge


According to different given domain knowledge:


Users provide
class labels

(seeded points)
a priori to some of
the documents












Users know about which few documents are related (
must
-
link
)
or unrelated (
cannot
-
link
)











Semi
-
Supervised Clustering

Seeded points

Must
-
link

Cannot
-
link

Why semi
-
supervised clustering?


Large amounts of unlabeled data exists


More is being produced all the time



Expensive to generate Labels for data


Usually requires human intervention



Use human input to provide labels for some of the data


Improve existing naive clustering methods


Use labeled data to guide clustering of unlabeled data


End result is a better clustering of data



Potential applications


Document/word categorization


Image categorization


Bioinformatics (gene/protein clustering)

Outline


Introduction



Overview of related work



Semi
-
supervised Non
-
negative Matrix Factorization (SS
-
NMF) for document clustering



Theoretical work for SS
-
NMF



Experiments and results



Conclusion





Clustering Algorithm


Document hierarchical clustering


Bottom
-
up, agglomerative


Top
-
down, divisive



Document partitioning (flat clustering)


K
-
means


probabilistic clustering using the Naïve Bayes or Gaussian
mixture model, etc.



Document clustering based on graph model








Semi
-
supervised Clustering Algorithm


Semi
-
supervised Clustering with labels
(Partial label
information is given )

:


SS
-
Seeded
-
Kmeans ( Sugato Basu, et al. ICML 2002)

-
SS
-
Constraint
-
Kmeans ( Sugato Basu, et al. ICML 2002)



Semi
-
supervised Clustering with Constraints

(Pairwise
Constraints (Must
-
link, Cannot
-
link) is given):


SS
-
COP
-
Kmeans (Wagstaff
et al
. ICML01)


SS
-
HMRF
-
Kmeans (Sugato Basu, et al. ACM SIGKDD 2004)


SS
-
Kernel
-
Kmeans (Brian Kulis, et al. ICML 2005)


SS
-
Spectral
-
Normalized
-
Cuts (X. Ji, et al. ACM SIGIR 2006)



Overview of K
-
means Clustering


K
-
means

is

a

partition

clustering

algorithm

based

on

iterative

relocation

that

partitions

a

dataset

into

k

clusters
.


Objective

function
:

Locally

minimizes

sum

of

squared

distance

between

the

data

points

and

their

corresponding

cluster

centers
:




Algorithm
:



Initialize

k

cluster

centers

randomly
.

Repeat

until

convergence
:


Cluster

Assignment

Step
:

Assign

each

data

point

x
i

to

the

cluster

f
h

such

that

distance

of

x
i


from

center

of

f
h

is

minimum


Center

Re
-
estimation

Step
:

Re
-
estimate

each

cluster

center

as

the

mean

of

the

points

in

that

cluster

Semi
-
supervised Kernel K
-
means
(SS
-
KK)
[Brian Kulis, et al. ICML 2005]


Semi
-
supervised Kernel K
-
means algorithm

:










where is kernel function mapping from , is centroid,



is the cost of violating the constraint between two points


First term: kernel k
-
means objective function


Second term: reward function for satisfying must
-
link constraints


Third term: penalty function for violating cannot
-
link constraints



Overview of Spectral Clustering


Spectral

clustering

is

a

graph
-
theoretic

clustering

algorithm




Weighted

Graph

G=(V,

E,

A)




min

between
-
cluster similarities (weights : A
ij
)

Spectral Normalized Cuts


Min
similarity

between & :





Balance weights
:









Cluster indicator
:













Graph partition becomes:





Solution is
eigenvector

of:




Semi
-
supervised Spectral Normalized
Cuts (SS
-
SNC)

[X. Ji, et al. ACM SIGIR 2006]


Semi
-
supervised Spectral Learning algorithm

:










where ,


First term: spectral normalized cut objective function


Second term: reward function for satisfying must
-
link constraints


Third term: penalty function for violating cannot
-
link constraints



Outline


Introduction



Related work



S
emi
-
S
upervised
N
on
-
negative
M
atrix
F
actorization (SS
-
NMF) for document clustering


NMF review


Model formulation and algorithm derivation



Theoretical result for SS
-
NMF



Experiments and results



Conclusion





Non
-
negative Matrix Factorization
(NMF)


NMF is to decompose matrix into two parts(
D. Lee et al., Nature 1999
)








Symmetric NMF for clustering (
C. Ding et al. SIAM ICDM 2005
)

















3172
.
0
3148
.
0
2568
.
0
2640
.
0
2650
.
0
3148
.
0
3244
.
0
2055
.
0
2090
.
0
2038
.
0
2568
.
0
2055
.
0
7202
.
0
7411
.
0
8311
.
0
2640
.
0
2090
.
0
7411
.
0
7822
.
0
8749
.
0
2650
.
0
2038
.
0
8311
.
0
8749
.
0
0000
.
1
X

F

G

~

=

min || X


FG
T
||
2

~

=



















0348
.
0
5476
.
0
0005
.
0
5355
.
0
5256
.
0
3698
.
0
5538
.
0
3765
.
0
6449
.
0
3672
.
0
x







0402
.
2
0
0
0735
.
1
x







0348
.
0
0005
.
0
5476
.
0
5355
.
0
5256
.
0
5538
.
0
6449
.
0
3698
.
0
3765
.
0
3672
.
0
min || A


GSG
T
||
2

SS
-
NMF

CL
j
i
C
d
d

)
,
(

Incorporate
prior knowledge

into NMF based framework
for document clustering.


Users provide pairwise constraints:


Must
-
link

constraints C
ML
: two documents d
i

and
d
j

must belong to the same cluster.


Cannot
-
link
constraints C
CL
: two documents d
i

and
d
j

must belong to the different cluster.

ML
j
i
C
d
d

)
,
(

Constraints are defined by associated violation cost matrix W:


W

reward
: cost of violating the constraint between document


d
i

and d
j

if a constraint exists.


W
penalty
: cost of violating the constraints between document


d
i

and d
j

if a constraint exists.






ML
j
i
C
d
d

)
,
(
CL
j
i
C
d
d

)
,
(
SS
-
NMF Algorithm



Define the objective function of SS
-
NMF:









where







2
0
,
0
~
min
T
G
S
NMF
SS
GSG
A
J





}
.
.
,
)
,
(
|
{
j
i
ML
j
i
ij
reward
y
y
t
s
C
d
d
w
W



}
.
.
,
)
,
(
|
{
j
i
CL
j
i
ij
penalty
y
y
t
s
C
d
d
w
W



penalty
reward
W
W
A
A



~

is the cluster label of

i
y
i
d
Summary of SS
-
NMF Algorithm

Outline


Introduction



Overview of related work



Semi
-
supervised Non
-
negative Matrix Factorization (SS
-
NMF) for document clustering



Theoretical result for SS
-
NMF



Experiments and results



Conclusion





Algorithm Correctness and Convergence


Based on constraint optimization theory,
auxiliary function, we can prove SS
-
NMF:


1.
Correctness:

Solution converges to local minimum

2. Convergence:

Iterative algorithm converges


(Details in paper [1], [2])

[1] Y. Chen, M. Rege, M. Dong and J. Hua, “Incorporating User provided Constraints

into Document Clustering”,
Proc. of IEEE ICDM
, Omaha, NE, October 2007.

(Regular Paper, acceptance rate 7.2%

)

[2] Y. Chen, M. Rege, M. Dong and J. Hua, “Non
-
negative Matrix Factorization


for Semi
-
supervised Data Clustering”,
Journal of Knowledge and Information Systems
,

to appear, 2008.

SS
-
NMF: General Framework for
Semi
-
supervised Clustering

















j
i
CL
j
i
h
j
i
ML
j
i
y
y
t
s
C
d
d
ij
k
h
X
i
y
y
t
s
C
d
d
ij
h
i
KK
SS
w
w
d
J
.
.
)
,
(
1
.
.
)
,
(
2
,
,
)
(



Proof: (1)






(2)







(3)








Orthogonal Symmetric

Semi
-
supervised NMF

is equivalent to Semi
-
supervised


Kernel K
-
means (SS
-
KK) and Semi
-
supervised Spectral Normalized Cuts (SS
-
SNC)!

Advantages of SS
-
NMF

SS
-
KK

SS
-
SNC


SS
-
NMF

Clustering
Indicator


Hard clustering


Exact
orthogonal


The derived latent
semantic space to be
orthogonal


No direct
relationship between
the singular vectors
and the clusters



Soft clustering


Map the documents into non
-
negative latent
semantic space which may
not be orthogonal


Cluster label can be determined by the axis
with the largest projection value


Time
Complexity


Iterative
algorithm


Solving a
computationally
expensive
constrained
eigen
-
decomposition


Iterative algorithm

to obtain partial answer
at intermediate stages of the solution by
specifying a fixed number of iterations


Simple basic matrix computation and easily
deployed over a distributed computing
environment when dealing with large
document collections.

Outline


Introduction



Overview of related work



Semi
-
supervised Non
-
negative Matrix Factorization (SS
-
NMF)
for document clustering



Theoretical result for SS
-
NMF



Experiments and results


Artificial Toy Data


Real Data



Conclusion






Experiments on Toy Data

1.

Artificial toy data: consisting of two natural clusters


Results on Toy Data

(SS
-
KK and SS
-
NMF)

Right Table:

Difference between cluster indicator G of
SS
-
KK

(hard
clustering) and
SS
-
NMF

(soft clustering) for the toy
data



Hard Clustering:
Each object belongs
to a single cluster







Soft Clustering:

Each object is
probabilistically

assigned to clusters.

Results on Toy Data

(SS
-
SNC and SS
-
NMF)






(b) Data distribution in the
SS
-
NMF

subspace of two column vectors of
G. The data points from the two
clusters get distributed along the two
axes.

(a) Data distribution in the
SS
-
SNC
subspace of the first two singular
vectors. There is no relationship
between the axes and the clusters.

Time Complexity Analysis

Up Figure
: Computational Speed comparison for SS
-
KK, SS
-
SNC and SS
-
NMF ( )

)
(
2
tkn

Experiments on Text Data










i
y
ˆ
2
. Summary of data sets
[1]

used in the experiments.





[1]http://www.cs.umn.edu/~han/data/tmdata.tar.gz


Evaluation Metric:








where n is the total number of documents in the experiment,
δ
is the
delta function that equals one if ,


is the
estimated label, is the ground truth.











i
y
i
i
y
y

ˆ
Results on Text Data

(Compare with Unsupervised Clustering)


(1) Comparison with unsupervised clustering approaches:














Note: SS
-
NMF adds 3% constraints

Results on Text Data

(Before Clustering and After Clustering)

(a)

Typical document
-
document matrix
before clustering

(b)

Document
-
document similarity
matrix after clustering
with SS
-
NMF (k=2)

(c)

Document
-
document similarity
matrix after clustering
with SS
-
NMF (k=5)

Results on Text Data

(Clustering with Different Constraints)

Left Table:

Comparison of
confusion
matrix

C and normalized
cluster centroid matrix

S of
SS
-
NMF for different
percentage of documents
pairwise constrained

Results on Text Data

(Compare with Semi
-
supervised Clustering)


(2) Comparison with SS
-
KK and SS
-
SNC

(a)
Graft
-
Phos

(b)
England
-
Heart

(c)
Interest
-
Trade


Comparison with SS
-
KK and SS
-
SNC
(
Fbis2, Fbis3, Fbis4, Fbis5)

Results on Text Data

(Compare with Semi
-
supervised Clustering)

Experiments on Image Data

Up Figure
: Sample images for images categorization.

(From up to down: O
-
Owls, R
-
Roses, L
-
Lions, E
-
Elephants, H
-
Horses)

3
. Image data sets
[2]

used in the experiments.




[2] http://kdd.ics.uci.edu/databases/CorelFeatures/CorelFeatures.data.html

Results on Image Data


(Compare with Unsupervised Clustering)

Up Table

: Comparison of image clustering accuracy between KK, SNC, NMF

and SS
-
NMF with only 3% pair
-
wise constraints on the images.

It shows that SS
-
NMF consistently outperforms other well
-
established unsupervised
image clustering methods.



(1) Comparison with unsupervised clustering approaches:














Results on Image Data


(Compare with Semi
-
supervised Clustering)


(2) Comparison with SS
-
KK and SS
-
SNC:














Left Figure
:

Comparison of image
clustering accuracy between
SS
-
KK, SS
-
SNC, and SS
-
NMF for different
percentages of images pairs
constrained (a) O
-
R, (b) L
-
H,
(c) R
-
L, (d) O
-
R
-
L.

Results on Image Data


(Compare with Semi
-
supervised Clustering)


(2) Comparison with SS
-
KK and SS
-
SNC:














Left Figure
:

Comparison of image
clustering accuracy
between SS
-
KK, SS
-
SNC, and SS
-
NMF for
different percentages of
images pairs constrained
(e) L
-
E
-
H, (f) O
-
R
-
L
-
E,
(g) O
-
L
-
E
-
H, (h) O
-
R
-
L
-
E
-
H

Outline


Introduction



Related work



Semi
-
supervised Non
-
negative Matrix Factorization (SS
-
NMF) for document clustering



Theoretical result for SS
-
NMF



Experiments and results



Conclusion






Conclusion



Semi
-
supervised Clustering:


-

many real world applications


-

outperform the traditional clustering algorithms





Semi
-
supervised NMF algorithm provides a unified
mathematic framework for semi
-
supervised clustering.



Many existing semi
-
supervised clustering algorithms can be
extended to achieve multi
-
type objects co
-
clustering tasks.







Reference

[1] Y. Chen, M. Rege, M. Dong and F. Fotouhi, “Deriving Semantics for Image
Clustering from Accumulated User Feedbacks”,
Proc. of ACM Multimedia
,
Germany, 2007.

[2] Y. Chen, M. Rege, M. Dong and J. Hua, “Incorporating User provided
Constraints into Document Clustering”,
Proc. of IEEE ICDM
, Omaha, NE,
October 2007. (Regular Paper, acceptance rate 7.2%)

[3] Y. Chen, M. Rege, M. Dong and J. Hua, “Non
-
negative Matrix Factorization for
Semi
-
supervised Data Clustering”,
Journal of Knowledge and Information
Systems
, invited as a best paper of ICDM 07, to appear 2008.