Text Categorization
Moshe Koppel
Lecture 11: Unsupervised Text Categorization
Some slides from Prabhakar Raghavan, Chris Manning, Ray
Mooney, Soumen Chakrabarti and TK Prasad
No Training Examples
•
Suppose we want to class docs as positive or negative but
have no training examples
–
We could try the bottom

up methods we’ve seen
•
Suppose we want to class docs as written by X or Y but
have no training examples
–
We don’t have a clue what distinguishes them
•
Suppose we want to class docs in two classes but we don’t
even know what the classes are
•
Suppose we don’t even know how many classes there
should be
Sometimes the Answer is Obvious
Clustering
•
The idea is to define “natural” clusters
•
Position in space depends on feature set
•
Need to have some notion of what you want in order to
choose right feature types
–
e.g., cluster by topic: content words
–
e.g., cluster by author: style features
•
With many features, clusters aren’t so elegant
How Many Clusters?
•
Some methods decide in advance
•
Some methods try to optimize k based on data
–
e.g., in previous example, would want
3
clusters
•
Caution: some optimization criteria biased
towards one giant cluster or many tiny ones
Flat or Hierarchical
We consider two kinds of clustering:
•
Partitional (flat) clustering
•
Hierarchical clustering
Flat Clustering: K

Means Algorithm
Select
K
random docs {
s
1
,
s
2
,…
s
K
} as seeds.
Until clustering converges or other stopping criterion:
For each doc
d
i
:
Assign
d
i
to the cluster
c
j
such that
dist
(
x
i
,
s
j
) is minimal.
For each cluster
c
j:
s
j
=
(
c
j
)
(
Update the seeds to the centroid of each cluster
)
K

Means Example
(
K
=
2
)
Pick seeds
Reassign clusters
Compute centroids
x
x
Reassign clusters
x
x
x
x
Compute centroids
Reassign clusters
Converged!
Termination conditions
•
Several possibilities, e.g.,
–
A fixed number of iterations.
–
Doc partition unchanged.
–
Centroid positions don’t change.
Convergence
•
Why should the
K

means algorithm ever
reach a
fixed point
?
–
A state in which clusters don’t change.
•
K

means is a special case of a general
procedure known as the
Expectation
Maximization (EM) algorithm
.
–
EM is known to converge.
–
Number of iterations could be large.
Convergence of
K

Means
•
Define goodness measure of cluster
k
as sum of
squared distances from cluster centroid:
G
k
=
Σ
i
(d
i
–
c
k
)
2
(sum over all d
i
in cluster
k
)
G =
Σ
k
G
k
•
Reassignment
monotonically
decreases G since
each vector is assigned to the closest centroid.
Dependence on Seed Choice
•
Results depend on random seed selection.
•
Some tricks for avoiding bad seeds:
–
Choose seed far from any already chosen
–
Try out multiple seed sets
If you start with B and E as centroids
you converge to {A,B,C} and {D,E,F}
If you start with D and F you converge
to {A,B,D,E} and {C,F}
A Problem…
•
The methods we’ve discussed so far don’t
always give very satisfying results.
•
Have a look at this example….
Ng et al On Spectral clustering: analysis and algorithm
Spectral Methods
•
Because of examples like that, spectral
methods have been developed.
•
Here’s the idea: Imagine weighted edges
between points, where weight reflects
similarity. Zero out all edges except k
nearest neighbors with high weights.
•
Now make the “cheapest” cuts that result in
separate components.
Ng et al On Spectral clustering: analysis and algorithm
Spectral Clustering
Properly formulated, the min Ncut problem is NP

hard.
But there are some nice linear algebra tricks we can use to
get reasonable approximate solutions:
1.
Pre

processing
•
Construct a matrix representation of the dataset.
2.
Decomposition
•
Compute eigenvalues and eigenvectors of the matrix.
•
Map each point to a lower

dimensional representation based on
one or more eigenvectors.
3.
Grouping
•
Assign points to two or more clusters, based on the new
representation.
Applications to Text Categorization
•
There are many applications where we
cluster according to topic.
•
We’ll look at a few standard IR
applications.
•
But what really interests me is clustering
according to style.
Scatter/Gather:
Cutting, Karger, and Pedersen ‘
92
Sec.
16.1
For improving search recall
•
Cluster hypothesis

Documents in the same cluster
behave similarly with respect to relevance to information
needs
•
Therefore, to improve search recall:
–
Cluster docs in corpus a priori
–
When a query matches a doc
D
, also return other docs
in the cluster containing
D
•
Hope if we do this: The query “car” will also return docs
containing
automobile
–
Because clustering grouped together docs containing
car
with those containing
automobile.
Sec.
16.1
For better navigation of search
results
•
For grouping search results thematically
–
clusty.com / Vivisimo
Sec.
16.1
Comments 0
Log in to post a comment