# Clustering Web-Search Results Using Transduction-Based Relevance Model

AI and Robotics

Nov 25, 2013 (4 years and 7 months ago)

94 views

1

Clustering Web
-
Search Results
Using Transduction
-
Based
Relevance Model

Lurong Xiao and Edward Hung

Hong Kong Polytechnic University

csehung@comp.polyu.edu.hk

Edward Hung

2

Outline

Motivation

Transduction
-
based Clustering Algorithm
(TCA)

Result preprocessing

Similarity measurement

Transduction
-
based Relevance Model (TRM)

Clustering Algorithm

Performance Evaluation

Conclusion

Edward Hung

3

Motivation

Existing search engines return a long
list of ranked results

Keyword: jaguar

Edward Hung

4

Edward Hung

5

Edward Hung

6

Challenges

Search engines return titles, snippets and

original results

Short texts

hard to extract reliable features
and produce good clusters

Online

need fast response time

No study on relationship between results and
distributions

Edward Hung

7

Our Approach

Transduction
-
based Relevance

Model (TRM)

No assumption on distribution

Determine relevance (relationship)
between two results

Edward Hung

8

Our Approach

Clustering algorithm

Clustering based on relevance

Compute Importance value of each result (how
closely other results are relevant to it?)

An important result

cluster representative

Importance of results relevant to them are suppressed

Repeat to pick important results above threshold

Assign other results to most relevant clusters

Results with very low relevance

outliers (a special
cluster with miscellaneous topics)

determine cluster number and granularity of
clustering

Edward Hung

9

Transduction
-
based Clustering
Algorithm

1.
Result preprocessing

2.
Similarity measurement

3.
Transduction
-
based relevance model

4.
Clustering algorithm

Edward Hung

10

1. Result preprocessing

User query

a list of ranked results

First n results: X ={x
1
, x
2
, …, x
n
}

Term set of m terms

an m x 1 cell array: all terms in titles and snippets
(except non
-
term tokens and frequent terms)

m x n histogram matrix H=(h
i,j
)
mxn

where H =
α
H
t

+ (1
-

α
)H
s

H(k,i): weighted occurrence of k
-
th term in x
i

Edward Hung

11

2. Similarity measurement

Normalized tf
-
idf weighted vectors

Term frequency

Inverse document frequency

Weight of k
-
th term in i
-
th result

cosine similarity

Distance between results x
i
, x
j

n x n distance matrix D

Edward Hung

12

3. Transduction
-
based
Relevance Model (TRM)

Construct a KNN graph based on
distance matrix D

Result x
i

node

x
j

is a KNN of x
i

x
j

in N(x
i
)

edge (x
i
, x
j
)

Affinity weight matrix W=(w
i,j
)
nxn

Edward Hung

13

3. Transduction
-
based
Relevance Model (TRM)

Propagation coefficient matrix Q=(q
i,j
)
nxn

Propagate relevance to get the Relevance matrix R=(r
i,j
)
nxn

R = Q x R

r
i,i
=1

R ≠ I

x
i

has a high relevance to x
j

if x
i

is near to nodes (e.g. x
k
)

with high relevances to x
j

Edward Hung

14

3. Transduction
-
based
Relevance Model (TRM)

Iteratively propagation algorithm

Iteratively apply until R becomes stable

Edward Hung

15

3. Transduction
-
based
Relevance Model (TRM)

Matrix solution

iteratively propagations of relevance values

Edward Hung

16

3. Transduction
-
based
Relevance Model (TRM)

Importance of node x
i

A node is more important iff it has more
other nodes highly relevant to it

Edward Hung

17

4. Clustering Algorithm

Iteratively find N
c

or fewer cluster representatives
with Im
i
> threshold

Pick node x
i

with highest importance value Im
i

Avoid to pick nodes highly relevant to x
i

Suppression attenuation

Assign other nodes to the most relevant cluster
representatives

If relevance < threshold

outlier or special cluster for
miscellaneous topics

Edward Hung

18

Performance Evaluation

Query log of June 10, 2007 from google

Keywords with multiple subtopics and
commonly used in literature

jaguar, iraq, java

α
=0.5,
σ
=1, w
0
=0.001, 20
-
NN graph

Search results: 50, 75, 100

Relevance threshold: 0.001, 0.01

Importance threshold: 2.5, 3, 3.5

98.9% accuracy (higher than k
-
medoids
clustering: 88.3%)

Running time: 0.3s

0.7s

Edward Hung

19

Performance Evaluation

Higher relevance threshold

higher accuracy

More relevant results are usually more likely to be
correct

Lower importance threshold

more clusters

higher accuracy

Members of new clusters were usually outliers or
from clusters with low ranks (relevances)

More data

more clusters

Some results were once too small in quantity and
relevance to form a cluster

They now become important enough to form their
own cluster

Edward Hung

20

Example on member movement
in different cluster numbers

Edward Hung

21

Conclusion

Transduction
-
based Clustering Algorithm
(TCA)

Analyze inter
-
result relationship by
propagating relevance through local
interactions

Given importance threshold, TCA decides the
number of clusters automatically

Search results clustering and outlier detection

Fast running time and high accuracy, suitable
for online users