TopicalClustering of Search Results

hesitantdoubtfulΤεχνίτη Νοημοσύνη και Ρομποτική

29 Οκτ 2013 (πριν από 3 χρόνια και 5 μήνες)

61 εμφανίσεις

Topical

Clustering
of Search Results

Scaiella

et al


[Originally published in


“Proceedings
of the fifth ACM
international conference on Web search and data
mining
, ACM, New York, NY, USA

(2012
)”,
pp.
223
-
232]

4/23/2013

1

Topical Clustering., Scaiella et al

Presented by

Anupam

Kumar / CSCI 572 / Spring 2013

Topics covered in CSCI 572
course


Lecture


Rich Text Snippets

o
Basic text snippets (Relates to tools needed extract snippets and assign a
set of topics to it)


Lecture


Search Engine Query Processing

o
Query Expansion (Relates to query topic decomposition)

o
Google
Wonderwheel

(performed syntactical clustering to aid topic
discovery)


Lecture


Intelligent Information Retrieval

o
AI and machine learning

o
Judging success


Precision/ Recall

o
Retrieval Models


(Vector Space/ Probabilistic /Graph Based)



4/23/2013

Topical Clustering., Scaiella et al

2

Clustering 101

4/23/2013

Topical Clustering., Scaiella et al

3


What is clustering &
why should we study it



What is search result
clustering (SRC) and
how do we do it


What are the
challenges with the
current SRC face &
how to address them






Bag of Words
vs

Graph of
Topics


Bag of Words


It’s syntactical


Can never bridge the gap
between user’s intentions
and query result relevance


Not very effective for
clustering


Easier to implement, but still
a computationally intensive
(NP Complete problem)


4/23/2013

Topical Clustering., Scaiella et al

4


Graph of Topics


It’s
semantical


Associating the pages with
topics and clustering them
based on topics inferred
helps bridge the subjective
user’s query intentions and
objective query results


Still harder to implement and
introduces newer variables
into clustering model

4/23/2013

Topical Clustering., Scaiella et al

5

Semantic Recipe to solve
SRC Problem

To solve the problem using Graph of Topics paradigm
Follow these steps….


#0 Get the results from your favorite search engine


#1 Annotate those results with topics using a
knowledge base (like
wikipedia
) and a annotation
tool (like TAGME)


#2 Represent each result’s text
-
snippet as a graph
of topics containing two types of nodes


topic
nodes & text snippet nodes


#3 Plug in their algorithm on this graph, you create
to get meaningful & intelligent clusters (More on
algorithm coming up…)

4/23/2013

Topical Clustering., Scaiella et al

6

Topical Clustering
Algorithm


Let S be set of snippets S = { s1,s2,s3…
sn
}


Let T be set of Topics To = { t1, t2, t3, ….
tr
} identified


For each snippet s and a topic, associate a annotation score , p(
s,t
).
p(
s,t
) represents the reliability/importance of topic t w.r.t s


For each pair of topics ta,
tb

rel
(
ta,tb
) is the degree of their relatedness,
measured by mutual citation and co
-
citation of ta in
tb

or vice versa in
the knowledge base (here wikipedia.org)


Let S(t) be the set of snippets that are annotated by topic t. For any set
of Topics T = { t1, t2…
tk
} S(T) = S(t1) U S(t2) U… S(
tk
) is the set of snippets
annotated with at least one topic of T.


Let the
subgraph

of the topics in set To be called G’. For a given “m”
(where m is maximum the number of topical clusters we want), perform
a topical decomposition of G’, consisting of C = { T1, T2… Tm} where T1,
T2 … Tm are disjoint subsets of T


Let h(Ti)
be a
labelling

function that labels each Ti in C that defines its
general theme


Create m clusters of topics, with each cluster contain S(Ti) snippets


These m clusters are your topical clusters.

4/23/2013

Topical Clustering., Scaiella et al

7

How to Run
the algorithm


What are the decomposition criteria to get the
desirable properties for needed to create
intelligible clusters ?


How to achieve this ?

o
Create G’

o
Perform topical decomposition on G’ to produce a set of topical clusters

o
Label each element topical cluster appropriately



4/23/2013

Topical Clustering., Scaiella et al

8

Creating G’

4/23/2013

Topical Clustering., Scaiella et al

9


For each element in To remove the elements that
represent the topics that cover more than 50% of
snippets, because they are too generic to form
intelligent cluster


Let U = S = { s1, s2 …
sn
}


Let B = { {t1, t3, t4 } , { t2, t5}… }


This is an optimization version of set cover problem,
where the goal is to cover all elements of U by
having minimum cardinality of B.


This is NP hard, but Greedy method CAN provide a
solution in polynomial time, though it is always
complete, it may not always be optimal



Creating set C out of G’

4/23/2013

Topical Clustering., Scaiella et al

10


This step is where the topic decomposition of results
take place for the topic graph G’.


Perform Eigen decomposition / spectral
decomposition of the graph S + G’


Iteratively bi
-
section or cut G’ into clusters such that:

o
A random walk seldom transitions across the bisection (or cut).

o
Of the clusters filtered above, pick the cluster that covers more snippets
than empirically derived delta
-
max value

o
Of the clusters filtered above, pick the selected cluster for bisection has
second lowest
eigen

value


Stop when m is reached or if there are no more
clusters to be cut




Clustering and Labeling

4/23/2013

Topical Clustering., Scaiella et al

11


Assign snippet to topic, based to annotations
discovered by TAGME


The main topic of a cluster Ti is calculated by
looking at a topic t belonging to Ti that maximizes
the sum of rho
-
score between t and its covered
snippets.


The title of the
wikipedia

page is the standard
label of the topic


If the topic title is same as query string then the
title is appended with the most frequent anchor
text used to refer to the topic page

Experimental Evaluation


The setup



o
2 Datasets


AMBIENT (from yahoo search engine) and ODP
-
239 from
DMOZ.org

o
AMBIENT contained 44 ambiguous queries with 100 results returned for
each query; ODP
-
239 contained 239 topics with 100 documents per
topic spread across 25580 topics


Each document had a name and
very short text snippet

o
The queries used on ODP
-
239 was 100 query from
TextRerieval

Conference (TREC) 2009
-
2010


The dataset and query were same across Topical
Clustering, Lingo3G and CLUSTY clustering engines



4/23/2013

Topical Clustering., Scaiella et al

12

Experimental Evaluation
(
Contd
)


EXPERIMENTAL FINDINGS

o
COVERAGE ANALYSIS


TAGME
tags 98% of snippets & annotates
5 tags per snippet on average





o
CLUSTERING EVALUATION


Topical algorithm outperformed
the rest by a large margin

4/23/2013

Topical Clustering., Scaiella et al

13

Experimental Evaluation
(
Contd
)

4/23/2013

Topical Clustering., Scaiella et al

14


EXPERIMENTAL FINDINGS

o
Diversification of topics



Topical clustering worked better than
all other popular algorithms



o
Time Efficiency

The algorithm runs in polynomial time
mostly with the help of optimization
and pruning at all the 3 stages of the
algorithm.



Pros & Cons


Pros

o
Attempts to look at semantic aspect of result

o
Attempts to provide a promising data structure and a novel algorithm

o
Provides a new way to look at the problem and departs from traditional “bag
of words” paradigm

o
Provided extensive experiment data that showed promising results


Cons

o
It still uses syntactical search results to gather snippets and talks nothing about
what would happen if we used semantic search results (Would it even work if
we applied this to semantic
resultsets

?)

o
Does not provide any information on what would happen if we used any other
knowledge base other than
wikipedia

o
This perhaps will not cluster well if the topics are closely related. Thus it probably
may not function for clustering vertical search engine results as well as it does
for horizontal search engines

o
The paper did not discuss or show how the system scales to increasingly larger
datasets

o
The paper did not talk about effective would the clustering be if we are trying
results that are changing over time(
eg
: Result set containing snippets from RSS
feed, which may change over time)

4/23/2013

Topical Clustering., Scaiella et al

15

Thank You

4/23/2013

Topical Clustering., Scaiella et al

16