Basic Machine Learning:
Clustering
CS 315
–
Web Search and Data Mining
1
Supervised vs. Unsupervised Learning
Two Fundamental Methods in Machine Learning
Supervised Learning
(“learn from my example”)
Goal: A program that performs a task as good as humans.
TASK
–
well defined (the target function)
EXPERIENCE
–
training data provided by a human
PERFORMANCE
–
error/accuracy on the task
Unsupervised Learning
(“see what you can find”)
Goal: To find some kind of structure in the data.
TASK
–
vaguely defined
No EXPERIENCE
No PERFORMANCE (but, there are some evaluations metrics)
2
What is Clustering?
The most
common
form of
Unsupervised Learning
Clustering
is the process of
grouping a set of physical or abstract objects
into classes (“clusters”) of similar objects
It can be used in IR:
To improve recall in search
For better navigation of search results
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.2
0.4
0.6
0.8
1
1.2
Ex1: Cluster to Improve Recall
Cluster hypothesis
:
Documents with similar text are related
Thus, when a query matches a document
D
,
also return other documents in the cluster containing
D.
4
Ex2: Cluster for Better Navigation
5
Clustering Characteristics
Flat
Clustering vs
Hierarchical
Clustering
Flat: just dividing objects in groups (clusters)
Hierarchical: organize clusters in a hierarchy
Evaluating
Clustering
Internal Criteria
The intra

cluster similarity is high (tightness)
The inter

cluster similarity is low (separateness)
External Criteria
Did we discover the hidden classes?
(we need gold standard data for this evaluation)
6
Clustering for Web IR
Representation for clustering
Document representation
Need a notion of similarity/distance
How many clusters?
Fixed a priori?
Completely data driven?
Avoid
“
trivial
”
clusters

too large or small
7
Recall: Documents as vectors
Each doc
j
is a vector of
tf
.
idf
values,
one
component for each term.
Can normalize to unit length
.
Vector
space
terms are axes

aka
features
N
docs live in this space
even with stemming, may have 20,000+
dimensions
What makes documents related?
8
i
j
i
j
i
n
i
j
i
j
i
j
j
j
idf
tf
w
w
w
d
d
d
,
,
1
,
,
where
Intuition for relatedness
9
t 1
D2
D1
D3
D4
t 2
x
y
Documents that are
“
close together
”
in vector space talk about the same things.
What makes documents related?
Ideal: semantic similarity.
Practical: statistical similarity
We will use cosine similarity.
We will describe algorithms in terms of cosine similarity.
10
n
i
k
i
w
j
i
w
j
d
sim
d
d
k
d
k
j
1
,
,
)
(
:
,
normalized
of
similarity
Cosine
,
This is known as the “
normalized inner product
”
.
Clustering Algorithms
Hierarchical algorithms
Bottom

up, agglomerative clustering
Partitioning
“
flat
”
algorithms
Usually start with a random (partial) partitioning
Refine it iteratively
The famous k

means partitioning algorithm:
Given: a set of
n
documents and the number
k
Compute: a partition of
k
clusters that
optimizes the chosen partitioning criterion
11
K

means
Assumes documents are real

valued vectors.
Clusters based on
centroids
of points in a cluster,
c
(= the
center of gravity
or mean) :
Reassignment of instances to clusters is based on distance
to the current cluster centroids.
See
Animation
12
c
x
x
c


1
(c)
μ
K

Means Algorithm
13
Let
d
be the distance measure between instances.
Select
k
random instances {
s
1
,
s
2
,…
s
k
} as seeds.
Until clustering converges or other stopping criterion:
For each instance
x
i
:
Assign
x
i
to the cluster
c
j
such that
d
(
x
i
,
s
j
) is minimal.
(
Update the seeds to the centroid of each cluster
)
For each cluster
c
j
s
j
=
(
c
j
)
K

means: Different Issues
When to stop?
When a fixed number of iterations is reached
When centroid positions do not change
Seed Choice
Results can vary based on random seed selection.
Try out multiple starting points
14
Example showing
sensitivity to seeds
A
B
D
E
C
F
If you start with
centroids: B and E
you converge to
If you start with
centroids D and F
you converge to:
Hierarchical clustering
Build a tree

based hierarchical taxonomy (
dendrogram
)
from a set of unlabeled examples.
15
animal
vertebrate
fish reptile amphib. mammal worm insect crustacean
invertebrate
Hierarchical Agglomerative Clustering
We assume there is a similarity function that determines
the similarity of two instances.
16
Start with all instances in their own cluster.
Until there is only one cluster:
Among the current clusters, determine the two
clusters,
c
i
and
c
j
, that are most similar.
Replace
c
i
and
c
j
with a single cluster
c
i
c
j
Algorithm:
Watch animation of HAC
What is the most similar cluster?
Single

link
Similarity of the most cosine

similar (single

link)
Complete

link
Similarity of the
“
furthest
”
points, the least cosine

similar
Group

average agglomerative clustering
Average cosine between pairs of elements
Centroid clustering
Similarity of clusters
’
centroids
17
Single link clustering
18
1) Use maximum similarity of pairs:
)
,
(
max
)
,
(
,
y
x
sim
c
c
sim
j
i
c
y
c
x
j
i
2) After merging
c
i
and
c
j
, the similarity of the resulting cluster to
another cluster,
c
k
, is:
))
,
(
),
,
(
max(
)
),
((
k
j
k
i
k
j
i
c
c
sim
c
c
sim
c
c
c
sim
Complete link clustering
19
1) Use minimum similarity of pairs:
2) After merging
c
i
and
c
j
, the similarity of the resulting cluster to
another cluster,
c
k
, is:
Major issue

labeling
After clustering algorithm finds clusters

how can they be
useful to the end user?
Need a concise label for each cluster
In search results, say
“
Animal
”
or
“
Car
”
in the
jaguar
example.
In topic trees (Yahoo), need navigational cues.
Often done by hand, a posteriori.
20
How to Label Clusters
Show titles of typical documents
Titles are easy to scan
Authors create them for quick scanning!
But you can only show a few titles which may not fully represent
cluster
Show words/phrases prominent in cluster
More likely to fully represent cluster
Use distinguishing words/phrases
But harder to scan
21
Further issues
Complexity:
Clustering is computationally expensive. Implementations need
careful balancing of needs.
How to decide how many clusters are best?
Evaluating the
“
goodness
”
of clustering
There are many techniques, some focus on implementation issues
(complexity/time), some on the quality of
22
Comments 0
Log in to post a comment