Lecture11

coachkentuckyAI and Robotics

Nov 25, 2013 (3 years and 23 days ago)

43 views

Text Categorization

Moshe Koppel

Lecture 11: Unsupervised Text Categorization

Some slides from Prabhakar Raghavan, Chris Manning, Ray
Mooney, Soumen Chakrabarti and TK Prasad

No Training Examples


Suppose we want to class docs as positive or negative but
have no training examples


We could try the bottom
-
up methods we’ve seen


Suppose we want to class docs as written by X or Y but
have no training examples


We don’t have a clue what distinguishes them


Suppose we want to class docs in two classes but we don’t
even know what the classes are


Suppose we don’t even know how many classes there
should be

Sometimes the Answer is Obvious

Clustering


The idea is to define “natural” clusters



Position in space depends on feature set



Need to have some notion of what you want in order to
choose right feature types


e.g., cluster by topic: content words


e.g., cluster by author: style features



With many features, clusters aren’t so elegant

How Many Clusters?


Some methods decide in advance



Some methods try to optimize k based on data


e.g., in previous example, would want
3
clusters



Caution: some optimization criteria biased
towards one giant cluster or many tiny ones

Flat or Hierarchical

We consider two kinds of clustering:


Partitional (flat) clustering


Hierarchical clustering


Flat Clustering: K
-
Means Algorithm

Select
K

random docs {
s
1
,
s
2
,…
s
K
} as seeds.



Until clustering converges or other stopping criterion:



For each doc
d
i
:


Assign
d
i

to the cluster
c
j

such that
dist
(
x
i
,
s
j
) is minimal.



For each cluster
c
j:


s
j
=

(
c
j
)


(
Update the seeds to the centroid of each cluster
)

K
-
Means Example
(
K
=
2
)

Pick seeds

Reassign clusters

Compute centroids

x

x

Reassign clusters

x

x

x

x

Compute centroids

Reassign clusters

Converged!

Termination conditions


Several possibilities, e.g.,


A fixed number of iterations.


Doc partition unchanged.


Centroid positions don’t change.

Convergence


Why should the
K
-
means algorithm ever
reach a
fixed point
?


A state in which clusters don’t change.


K
-
means is a special case of a general
procedure known as the
Expectation
Maximization (EM) algorithm
.


EM is known to converge.


Number of iterations could be large.

Convergence of
K
-
Means


Define goodness measure of cluster
k

as sum of
squared distances from cluster centroid:


G
k

=
Σ
i

(d
i



c
k
)
2

(sum over all d
i

in cluster
k
)



G =
Σ
k

G
k



Reassignment
monotonically

decreases G since
each vector is assigned to the closest centroid.

Dependence on Seed Choice


Results depend on random seed selection.


Some tricks for avoiding bad seeds:


Choose seed far from any already chosen


Try out multiple seed sets


If you start with B and E as centroids

you converge to {A,B,C} and {D,E,F}


If you start with D and F you converge
to {A,B,D,E} and {C,F}

A Problem…


The methods we’ve discussed so far don’t
always give very satisfying results.



Have a look at this example….

Ng et al On Spectral clustering: analysis and algorithm
Spectral Methods


Because of examples like that, spectral
methods have been developed.


Here’s the idea: Imagine weighted edges
between points, where weight reflects
similarity. Zero out all edges except k
nearest neighbors with high weights.


Now make the “cheapest” cuts that result in
separate components.

Ng et al On Spectral clustering: analysis and algorithm
Spectral Clustering

Properly formulated, the min Ncut problem is NP
-
hard.
But there are some nice linear algebra tricks we can use to
get reasonable approximate solutions:


1.
Pre
-
processing


Construct a matrix representation of the dataset.


2.
Decomposition


Compute eigenvalues and eigenvectors of the matrix.


Map each point to a lower
-
dimensional representation based on
one or more eigenvectors.


3.
Grouping


Assign points to two or more clusters, based on the new
representation.

Applications to Text Categorization


There are many applications where we
cluster according to topic.


We’ll look at a few standard IR
applications.


But what really interests me is clustering
according to style.

Scatter/Gather:
Cutting, Karger, and Pedersen ‘
92

Sec.
16.1

For improving search recall


Cluster hypothesis

-

Documents in the same cluster
behave similarly with respect to relevance to information
needs


Therefore, to improve search recall:


Cluster docs in corpus a priori


When a query matches a doc
D
, also return other docs
in the cluster containing
D


Hope if we do this: The query “car” will also return docs
containing
automobile


Because clustering grouped together docs containing
car

with those containing
automobile.

Sec.
16.1

For better navigation of search
results


For grouping search results thematically


clusty.com / Vivisimo

Sec.
16.1