lecture_17

dealerdeputyAI and Robotics

Nov 25, 2013 (3 years and 8 months ago)

72 views

Clustering


Supervised vs. Unsupervised Learning

Examples of clustering in Web IR

Characteristics of clustering

Clustering algorithms

Cluster Labeling

1

Supervised vs. Unsupervised Learning

Supervised Learning


Goal: A program that performs a task as good as humans.


TASK


well defined (the target function)


EXPERIENCE


training data provided by a human


PERFORMANCE


error/accuracy on the task


Unsupervised Learning


Goal: To find some kind of structure in the data.


TASK


vaguely defined


No EXPERIENCE


No PERFORMANCE (but, there are some evaluations metrics)



2

What is Clustering?

Clustering is the most common form of
Unsupervised
Learning


Clustering

is the process of grouping a set of physical or
abstract objects into classes of similar objects


It can be used in IR:


To improve recall in search applications


For better navigation of search results

3

Example 1: Improving Recall

Cluster hypothesis

-

Documents with similar text are
related

Thus, when a query matches a document
D
, also return
other documents in the cluster containing
D.


4

Example 2: Better Navigation

5

Clustering Characteristics

Flat versus Hierarchical Clustering


Flat means dividing objects in groups (clusters)


Hierarchical means organize clusters in a subsuming hierarchy


Evaluating Clustering


Internal Criteria


The intra
-
cluster similarity is high (tightness)


The inter
-
cluster similarity is low (separateness)


External Criteria


Did we discover the hidden classes? (we need gold
standard data for this evaluation)

6

Clustering for Web IR

Representation for clustering


Document representation


Vector space? Normalization?


Need a notion of similarity/distance


How many clusters?


Fixed a priori?


Completely data driven?


Avoid “trivial” clusters
-

too large or small


7

Recall documents as vectors

Each doc
j

is a vector of
tf

idf

values, one component for
each term.

Can normalize to unit length.




So we have a vector space


terms are axes
-

aka
features


n

docs live in this space


even with stemming, may have 20,000+ dimensions


8

i
j
i
j
i
n
i
j
i
j
i
j
j
j
idf
tf
w
w
w
d
d
d






,
,
1
,
,

where


What makes documents related?

Ideal: semantic similarity.

Practical: statistical similarity


We will use cosine similarity.


Documents as vectors.

We will describe algorithms in terms of cosine similarity.

9





n
i
k
i
w
j
i
w
j
d
sim
d
d
k
d
k
j
1
,
,
)
(


:
,

normalized

of

similarity

Cosine
,
This is known as the normalized inner product.

Intuition for relatedness

10

t 1

D2

D1

D3

D4

t 2

x

y

Documents that are “close together”

in vector space talk about the same things.

Clustering Algorithms

Partitioning “flat” algorithms


Usually start with a random (partial) partitioning


Refine it iteratively



k
-
means clustering


Model based clustering
(we will not cover it)


Hierarchical algorithms


Bottom
-
up, agglomerative


Top
-
down, divisive


(we will not cover it)


11

Partitioning “flat” algorithms


Partitioning method: Construct a partition of
n

documents
into a set of
k

clusters


Given: a set of documents and the number
k



Find: a partition of
k

clusters that optimizes the chosen
partitioning criterion


12

Watch animation of k
-
means

K
-
means

Assumes documents are real
-
valued vectors.

Clusters based on
centroids
(aka the
center of gravity

or
mean) of points in a cluster,
c
:




Reassignment of instances to clusters is based on distance
to the current cluster centroids.


13




c
x
x
c



|
|
1
(c)
μ
K
-
Means Algorithm

14

Let
d

be the distance measure between instances.


Select
k

random instances {
s
1
,
s
2
,…
s
k
} as seeds.


Until clustering converges or other stopping criterion:


For each instance
x
i
:


Assign
x
i

to the cluster
c
j

such that
d
(
x
i
,
s
j
) is minimal.




(
Update the seeds to the
centroid

of each cluster
)


For each cluster
c
j


s
j

=

(
c
j
)

K
-
means: Different Issues

When to stop?


When a fixed number of iterations is reached


When centroid positions do not change

Seed Choice


Results can vary based on random seed selection.


Try out multiple starting points

15

Example showing

sensitivity to seeds

A

B

D

E

C

F

If you start with B and E
as centroids

you converge to {A,B,C}

and {D,E,F}

If you start with D and F

you converge to

{A,B,D,E} {C,F}

Hierarchical clustering

Build a tree
-
based hierarchical taxonomy (
dendrogram
)
from a set of unlabeled examples.


16

animal

vertebrate

fish reptile amphib. mammal worm insect crustacean

invertebrate

Hierarchical Agglomerative Clustering

We assume there is a similarity function that determines
the similarity of two instances.


17

Start with all instances in their own cluster.

Until there is only one cluster:


Among the current clusters, determine the two


clusters,
c
i
and
c
j
, that are most similar.


Replace
c
i
and
c
j

with a single cluster
c
i


c
j


Algorithm:

Watch animation of HAC

What is the most similar cluster?

Single
-
link


Similarity of the most cosine
-
similar (single
-
link)

Complete
-
link


Similarity of the “furthest” points, the least cosine
-
similar


Group
-
average agglomerative clustering


Average cosine between pairs of elements

Centroid clustering


Similarity of clusters’ centroids


18

Single link clustering

19

1) Use maximum similarity of pairs:

)
,
(
max
)
,
(
,
y
x
sim
c
c
sim
j
i
c
y
c
x
j
i



2) After merging
c
i

and
c
j
, the similarity of the resulting cluster to
another cluster,
c
k
, is:

))
,
(
),
,
(
max(
)
),
((
k
j
k
i
k
j
i
c
c
sim
c
c
sim
c
c
c
sim


Complete link clustering

20

1) Use minimum similarity of pairs:

)
,
(
min
)
,
(
,
y
x
sim
c
c
sim
j
i
c
y
c
x
j
i



2) After merging
c
i

and
c
j
, the similarity of the resulting cluster to
another cluster,
c
k
, is:

))
,
(
),
,
(
min(
)
),
((
k
j
k
i
k
j
i
c
c
sim
c
c
sim
c
c
c
sim


Major issue
-

labeling

After clustering algorithm finds clusters
-

how can they be
useful to the end user?


Need a concise label for each cluster


In search results, say “Animal” or “Car” in the
jaguar

example.


In topic trees (Yahoo), need navigational cues.


Often done by hand, a posteriori.


21

How to Label Clusters

Show titles of typical documents


Titles are easy to scan


Authors create them for quick scanning!


But you can only show a few titles which may not fully represent
cluster

Show words/phrases prominent in cluster


More likely to fully represent cluster


Use distinguishing words/phrases


But harder to scan


22

Not covered in this lecture

Complexity:


Clustering is computationally expensive. Implementations need
careful balancing of needs.


How to decide how many clusters are best?


Evaluating the “goodness” of clustering


There are many techniques, some focus on implementation issues
(complexity/time), some on the quality of

23