1
Clustering Web

Search Results
Using Transduction

Based
Relevance Model
Lurong Xiao and Edward Hung
Hong Kong Polytechnic University
csehung@comp.polyu.edu.hk
Edward Hung
2
Outline
Motivation
Transduction

based Clustering Algorithm
(TCA)
Result preprocessing
Similarity measurement
Transduction

based Relevance Model (TRM)
Clustering Algorithm
Performance Evaluation
Conclusion
Edward Hung
3
Motivation
Existing search engines return a long
list of ranked results
Keyword: jaguar
Edward Hung
4
Edward Hung
5
Edward Hung
6
Challenges
Search engines return titles, snippets and
links
Time consuming to download and process
original results
Short texts
hard to extract reliable features
and produce good clusters
Online
need fast response time
Clusters should have readable descriptions
No study on relationship between results and
distributions
Edward Hung
7
Our Approach
Transduction

based Relevance
Model (TRM)
No assumption on distribution
Determine relevance (relationship)
between two results
Edward Hung
8
Our Approach
Clustering algorithm
Clustering based on relevance
Compute Importance value of each result (how
closely other results are relevant to it?)
An important result
cluster representative
Importance of results relevant to them are suppressed
Repeat to pick important results above threshold
Assign other results to most relevant clusters
Results with very low relevance
outliers (a special
cluster with miscellaneous topics)
Adjust importance and relevance thresholds
determine cluster number and granularity of
clustering
Edward Hung
9
Transduction

based Clustering
Algorithm
1.
Result preprocessing
2.
Similarity measurement
3.
Transduction

based relevance model
4.
Clustering algorithm
Edward Hung
10
1. Result preprocessing
User query
a list of ranked results
First n results: X ={x
1
, x
2
, …, x
n
}
Term set of m terms
an m x 1 cell array: all terms in titles and snippets
(except non

term tokens and frequent terms)
m x n histogram matrix H=(h
i,j
)
mxn
where H =
α
H
t
+ (1

α
)H
s
H(k,i): weighted occurrence of k

th term in x
i
Edward Hung
11
2. Similarity measurement
Normalized tf

idf weighted vectors
Term frequency
Inverse document frequency
Weight of k

th term in i

th result
cosine similarity
Distance between results x
i
, x
j
n x n distance matrix D
Edward Hung
12
3. Transduction

based
Relevance Model (TRM)
Construct a KNN graph based on
distance matrix D
Result x
i
node
x
j
is a KNN of x
i
x
j
in N(x
i
)
edge (x
i
, x
j
)
Affinity weight matrix W=(w
i,j
)
nxn
Edward Hung
13
3. Transduction

based
Relevance Model (TRM)
Propagation coefficient matrix Q=(q
i,j
)
nxn
Propagate relevance to get the Relevance matrix R=(r
i,j
)
nxn
R = Q x R
r
i,i
=1
R ≠ I
x
i
has a high relevance to x
j
if x
i
is near to nodes (e.g. x
k
)
with high relevances to x
j
Edward Hung
14
3. Transduction

based
Relevance Model (TRM)
Iteratively propagation algorithm
Iteratively apply until R becomes stable
Edward Hung
15
3. Transduction

based
Relevance Model (TRM)
Matrix solution
Applying matrix equations instead of
iteratively propagations of relevance values
Edward Hung
16
3. Transduction

based
Relevance Model (TRM)
Importance of node x
i
A node is more important iff it has more
other nodes highly relevant to it
Edward Hung
17
4. Clustering Algorithm
Iteratively find N
c
or fewer cluster representatives
with Im
i
> threshold
Pick node x
i
with highest importance value Im
i
Avoid to pick nodes highly relevant to x
i
Suppression attenuation
Assign other nodes to the most relevant cluster
representatives
If relevance < threshold
outlier or special cluster for
miscellaneous topics
Edward Hung
18
Performance Evaluation
Query log of June 10, 2007 from google
Keywords with multiple subtopics and
commonly used in literature
jaguar, iraq, java
α
=0.5,
σ
=1, w
0
=0.001, 20

NN graph
Search results: 50, 75, 100
Relevance threshold: 0.001, 0.01
Importance threshold: 2.5, 3, 3.5
98.9% accuracy (higher than k

medoids
clustering: 88.3%)
Running time: 0.3s
–
0.7s
Edward Hung
19
Performance Evaluation
Higher relevance threshold
higher accuracy
More relevant results are usually more likely to be
correct
Lower importance threshold
more clusters
higher accuracy
Members of new clusters were usually outliers or
from clusters with low ranks (relevances)
More data
more clusters
Some results were once too small in quantity and
relevance to form a cluster
They now become important enough to form their
own cluster
Edward Hung
20
Example on member movement
in different cluster numbers
Edward Hung
21
Conclusion
Transduction

based Clustering Algorithm
(TCA)
Analyze inter

result relationship by
propagating relevance through local
interactions
Given importance threshold, TCA decides the
number of clusters automatically
Search results clustering and outlier detection
Fast running time and high accuracy, suitable
for online users
Comments 0
Log in to post a comment