# Lecture11

Τεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 4 χρόνια και 7 μήνες)

81 εμφανίσεις

Text Categorization

Moshe Koppel

Lecture 11: Unsupervised Text Categorization

Some slides from Prabhakar Raghavan, Chris Manning, Ray
Mooney, Soumen Chakrabarti and TK Prasad

No Training Examples

Suppose we want to class docs as positive or negative but
have no training examples

We could try the bottom
-
up methods we’ve seen

Suppose we want to class docs as written by X or Y but
have no training examples

We don’t have a clue what distinguishes them

Suppose we want to class docs in two classes but we don’t
even know what the classes are

Suppose we don’t even know how many classes there
should be

Clustering

The idea is to define “natural” clusters

Position in space depends on feature set

Need to have some notion of what you want in order to
choose right feature types

e.g., cluster by topic: content words

e.g., cluster by author: style features

With many features, clusters aren’t so elegant

How Many Clusters?

Some methods try to optimize k based on data

e.g., in previous example, would want
3
clusters

Caution: some optimization criteria biased
towards one giant cluster or many tiny ones

Flat or Hierarchical

We consider two kinds of clustering:

Partitional (flat) clustering

Hierarchical clustering

Flat Clustering: K
-
Means Algorithm

Select
K

random docs {
s
1
,
s
2
,…
s
K
} as seeds.

Until clustering converges or other stopping criterion:

For each doc
d
i
:

Assign
d
i

to the cluster
c
j

such that
dist
(
x
i
,
s
j
) is minimal.

For each cluster
c
j:

s
j
=

(
c
j
)

(
Update the seeds to the centroid of each cluster
)

K
-
Means Example
(
K
=
2
)

Pick seeds

Reassign clusters

Compute centroids

x

x

Reassign clusters

x

x

x

x

Compute centroids

Reassign clusters

Converged!

Termination conditions

Several possibilities, e.g.,

A fixed number of iterations.

Doc partition unchanged.

Centroid positions don’t change.

Convergence

Why should the
K
-
means algorithm ever
reach a
fixed point
?

A state in which clusters don’t change.

K
-
means is a special case of a general
procedure known as the
Expectation
Maximization (EM) algorithm
.

EM is known to converge.

Number of iterations could be large.

Convergence of
K
-
Means

Define goodness measure of cluster
k

as sum of
squared distances from cluster centroid:

G
k

=
Σ
i

(d
i

c
k
)
2

(sum over all d
i

in cluster
k
)

G =
Σ
k

G
k

Reassignment
monotonically

decreases G since
each vector is assigned to the closest centroid.

Dependence on Seed Choice

Results depend on random seed selection.

Some tricks for avoiding bad seeds:

Choose seed far from any already chosen

Try out multiple seed sets

you converge to {A,B,C} and {D,E,F}

to {A,B,D,E} and {C,F}

A Problem…

The methods we’ve discussed so far don’t
always give very satisfying results.

Have a look at this example….

Ng et al On Spectral clustering: analysis and algorithm
Spectral Methods

Because of examples like that, spectral
methods have been developed.

Here’s the idea: Imagine weighted edges
between points, where weight reflects
similarity. Zero out all edges except k
nearest neighbors with high weights.

Now make the “cheapest” cuts that result in
separate components.

Ng et al On Spectral clustering: analysis and algorithm
Spectral Clustering

Properly formulated, the min Ncut problem is NP
-
hard.
But there are some nice linear algebra tricks we can use to
get reasonable approximate solutions:

1.
Pre
-
processing

Construct a matrix representation of the dataset.

2.
Decomposition

Compute eigenvalues and eigenvectors of the matrix.

Map each point to a lower
-
dimensional representation based on
one or more eigenvectors.

3.
Grouping

Assign points to two or more clusters, based on the new
representation.

Applications to Text Categorization

There are many applications where we
cluster according to topic.

We’ll look at a few standard IR
applications.

But what really interests me is clustering
according to style.

Scatter/Gather:
Cutting, Karger, and Pedersen ‘
92

Sec.
16.1

For improving search recall

Cluster hypothesis

-

Documents in the same cluster
behave similarly with respect to relevance to information
needs

Therefore, to improve search recall:

Cluster docs in corpus a priori

When a query matches a doc
D
, also return other docs
in the cluster containing
D

Hope if we do this: The query “car” will also return docs
containing
automobile

Because clustering grouped together docs containing
car

with those containing
automobile.

Sec.
16.1