Clustering Web-Search Results Using Transduction-Based Relevance Model

savagelizardAI and Robotics

Nov 25, 2013 (3 years and 8 months ago)

69 views

1

Clustering Web
-
Search Results
Using Transduction
-
Based
Relevance Model

Lurong Xiao and Edward Hung

Hong Kong Polytechnic University

csehung@comp.polyu.edu.hk

Edward Hung

2

Outline


Motivation


Transduction
-
based Clustering Algorithm
(TCA)


Result preprocessing


Similarity measurement


Transduction
-
based Relevance Model (TRM)


Clustering Algorithm


Performance Evaluation


Conclusion

Edward Hung

3

Motivation


Existing search engines return a long
list of ranked results


Keyword: jaguar


Edward Hung

4

Edward Hung

5

Edward Hung

6

Challenges


Search engines return titles, snippets and
links


Time consuming to download and process
original results


Short texts


hard to extract reliable features
and produce good clusters


Online


need fast response time


Clusters should have readable descriptions


No study on relationship between results and
distributions

Edward Hung

7

Our Approach


Transduction
-
based Relevance

Model (TRM)


No assumption on distribution


Determine relevance (relationship)
between two results



Edward Hung

8

Our Approach


Clustering algorithm


Clustering based on relevance


Compute Importance value of each result (how
closely other results are relevant to it?)


An important result


cluster representative


Importance of results relevant to them are suppressed


Repeat to pick important results above threshold


Assign other results to most relevant clusters


Results with very low relevance


outliers (a special
cluster with miscellaneous topics)


Adjust importance and relevance thresholds


determine cluster number and granularity of
clustering

Edward Hung

9

Transduction
-
based Clustering
Algorithm

1.
Result preprocessing

2.
Similarity measurement

3.
Transduction
-
based relevance model

4.
Clustering algorithm

Edward Hung

10

1. Result preprocessing


User query


a list of ranked results


First n results: X ={x
1
, x
2
, …, x
n
}


Term set of m terms


an m x 1 cell array: all terms in titles and snippets
(except non
-
term tokens and frequent terms)


m x n histogram matrix H=(h
i,j
)
mxn

where H =
α
H
t

+ (1
-

α
)H
s


H(k,i): weighted occurrence of k
-
th term in x
i

Edward Hung

11

2. Similarity measurement


Normalized tf
-
idf weighted vectors


Term frequency


Inverse document frequency


Weight of k
-
th term in i
-
th result


cosine similarity


Distance between results x
i
, x
j



n x n distance matrix D

Edward Hung

12

3. Transduction
-
based
Relevance Model (TRM)


Construct a KNN graph based on
distance matrix D


Result x
i



node


x
j

is a KNN of x
i



x
j

in N(x
i
)


edge (x
i
, x
j
)


Affinity weight matrix W=(w
i,j
)
nxn


Edward Hung

13

3. Transduction
-
based
Relevance Model (TRM)


Propagation coefficient matrix Q=(q
i,j
)
nxn






Propagate relevance to get the Relevance matrix R=(r
i,j
)
nxn



R = Q x R


r
i,i
=1


R ≠ I


x
i

has a high relevance to x
j


if x
i

is near to nodes (e.g. x
k
)

with high relevances to x
j

Edward Hung

14

3. Transduction
-
based
Relevance Model (TRM)


Iteratively propagation algorithm


Iteratively apply until R becomes stable

Edward Hung

15

3. Transduction
-
based
Relevance Model (TRM)


Matrix solution


Applying matrix equations instead of
iteratively propagations of relevance values

Edward Hung

16

3. Transduction
-
based
Relevance Model (TRM)


Importance of node x
i





A node is more important iff it has more
other nodes highly relevant to it

Edward Hung

17

4. Clustering Algorithm


Iteratively find N
c

or fewer cluster representatives
with Im
i
> threshold


Pick node x
i

with highest importance value Im
i


Avoid to pick nodes highly relevant to x
i



Suppression attenuation




Assign other nodes to the most relevant cluster
representatives


If relevance < threshold


outlier or special cluster for
miscellaneous topics

Edward Hung

18

Performance Evaluation


Query log of June 10, 2007 from google


Keywords with multiple subtopics and
commonly used in literature


jaguar, iraq, java


α
=0.5,
σ
=1, w
0
=0.001, 20
-
NN graph


Search results: 50, 75, 100


Relevance threshold: 0.001, 0.01


Importance threshold: 2.5, 3, 3.5


98.9% accuracy (higher than k
-
medoids
clustering: 88.3%)


Running time: 0.3s


0.7s


Edward Hung

19

Performance Evaluation


Higher relevance threshold


higher accuracy


More relevant results are usually more likely to be
correct


Lower importance threshold


more clusters


higher accuracy


Members of new clusters were usually outliers or
from clusters with low ranks (relevances)


More data


more clusters


Some results were once too small in quantity and
relevance to form a cluster


They now become important enough to form their
own cluster

Edward Hung

20

Example on member movement
in different cluster numbers


Edward Hung

21

Conclusion


Transduction
-
based Clustering Algorithm
(TCA)


Analyze inter
-
result relationship by
propagating relevance through local
interactions


Given importance threshold, TCA decides the
number of clusters automatically


Search results clustering and outlier detection


Fast running time and high accuracy, suitable
for online users