Slides - VLDB 2009

cowphysicistInternet and Web Development

Dec 4, 2013 (3 years and 6 months ago)

128 views


Arnd

Christian König


Data Management, Exploration and Mining Group

Microsoft Research








Data Management, Exploration and Mining Group, MSR


Sanjay
Agrawal


Surajit

Chaudhuri



Kaushik

Chakrabarti


Venkatesh

Ganti




Dong Xin


Text Mining, Search and Navigation Group, MSR


Kenneth W. Church


Qiang Wu


Natural Language Processing Group, MSR


Michael Gamon


Microsoft
AdCenter


Martin Markov


Microsoft Search


Liying Sui










Issues with non
-
document data in search
:




Management of separate data stores.



Varying data format, retrieval semantics, query processing.



Selection of verticals to show.

Sponsored Search Ads
:



Separate data store / index.



Ads associated with ‘
bid
-
phrases
’.



Retrieval by matching (modified)


queries with bid
-
phrases.



Ranking using a combination of


relevance, bid
-
amount and


expected CTR.


Vertical Sub
-
collections
:



Examples: Products, News, Images,...



Separate data store / index.



Index may be different from web index.



Ranking function different for each vertical.



Many verticals => vertical selection problem.


Retrieval Overhead



Retrieval Quality



Vertical Selection

Verticals not provisioned


to handle 100% of traffic.


=> fast, initial filter on queries.


Some queries may have


relevant results in many


verticals.

Can the specific ranking
function be indexed
efficiently?


Different retrieval engine for each ‘vertical’ and ads.



Retrieval processing involves
matching

and
ranking
.



Match
-
processing independent of ranking function


Multiple ranking functions to be tested in parallel.


Allows for arbitrarily complex ranking functions.


Organizational boundaries.





剡湫楮朠潦⁲敳畬u猠楳
not

monotone function of single
-
word


scores:
Top
-
k optimizations do not apply
.






䱯湧L污l敮e礠y潲⁳潭攠煵敲楥猠s楴栠污牧攠湵浢敲映浡m捨敳e














1

0.17

0.027

0.02

0.003

0.001

0
0.2
0.4
0.6
0.8
1
1.2
F + F
F + M
M + M
F + L
M + L
L + L
Relative Running Times

Keyword
-
Combinations

Setup
:

Index 20 M product descriptions from
Product

Vertical.


Using commercial full
-
text engine, measure latency for 2
-
word queries for
words of different frequencies
.

User
-
perceives

significant latency

3 orders of magnitude
latency difference

F

= Frequent Keywords (>800K postings)

M

= Medium
-
frequency Keywords (

50K postings)

L

= Low
-
frequency keywords (<1K postings)



Query logs: combinations of frequent keywords common.

“BMW bike”

“Book
jewelry”

“Book
Dumbledore”

> 2.1% of queries
in log


Different retrieval engine for each ‘vertical’ and ads.



Retrieval processing involves
matching

and
ranking
.



Match
-
processing independent of ranking function


Multiple ranking functions to be tested in parallel.


Allows for arbitrarily complex ranking functions.


Organizational boundaries.





剡湫楮朠潦⁲敳畬u猠楳
not

monotone function of single
-
word


scores:
Top
-
k optimizations do not apply
.



Search Latency is crucial

(via Dries
Buytaert’s

blog):


Amazon:
100 ms of extra load time caused a 1% drop in sales.
(Source:
Greg Linden, Amazon
)


Google:
500 ms of extra load time caused 20% fewer searches.
(Source:
Marrissa

Mayer, Google
)


Yahoo!:
400 ms of extra load time caused a 5 to 9% increase in
the number of people that clicked "back" before the page
even loaded. (Source:
Nicole Sullivan, Yahoo!)



Latency issues addressed through parallelism, caching, but also


specialized data structures.


Matching
: Conditions on overlap between bid and query;

3 different
match
-
types

possible:


Exact
-
Match
: query Q = bid B.


Phrase
-
Match
: Words in bid B occur in query Q in identical order.


Broad
-
Match
: Query Q = {w
1
,…,
w
n
}


扩搠䈬B攮e.



Query {
cheap books
}

matches bid {
books
}





matches

bid {
cheap used books
}

Broad
-
Match is the most common matching logic (>90%).



Information retrieval

: Query Q = {w
1
,…,
w
n
}


摯捵浥湴D






䍡渠睥⁲w
-
畳攠䥮景牭慴楯渠剥瑲楥i慬⁴散湩煵敳q



Exact
-
Match
: query Q = bid B.


Phrase
-
Match
: Words in bid B
occur in query Q in identical
order.

Is this efficient?

“cheap”

“used”

“books”

Augmented Inverted Index
:

Vocabulary
:


Bid
-
ID # words


177 2


2090 3






… …




Bid
-
ID # words


11 2


99 1




… …




Bid
-
ID # words


2004 1


2090 3




… …







Query:


{
cheap books
}

Problem:
Nearly all processed


bids do not match the query.




Indexing by word not selective.



No early termination.



Some improvement via non
-

redundant indexing.

1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
1.00E+06
1.00E+07
1.00E+08
1
3
5
7
9
11
13
15
17
19
21
23
25
28
31
#Bid
-
Phrases

Number of Keywords


Alternative approach: Index bids at fine granularity.




Hash
({“cheap” “books”})

Hash
({“cheap”,”used”,”books”})

Hash
({“new”, ”books”})


Cheap books
” [bid phrase]

B
1

[Bid ID]

B
4

[Bid ID]


Cheap used books
” [bid phrase]

B
3

[Bid ID]


New books
” [bid phrase]

B
2

[Bid ID]


Bid ID


Bid Phrase

B
1

“Cheap books”

B
2

“New books”

B
3

“Cheap used books”

B
4

“Cheap books”

Bid Lists
:

“Vocabulary”
:

Query:


{
cheap books
}


Alternative approach: Index bids at fine granularity.




Hash
({“cheap” “books”})

Hash
({“cheap”,”used”,”books”})

Hash
({“new”, ”books”})


Cheap books
” [bid phrase]

B
1

[Bid ID]

B
4

[Bid ID]


Cheap used books
” [bid phrase]

B
3

[Bid ID]


New books
” [bid phrase]

B
2

[Bid ID]


Bid ID


Bid Phrase

B
1

“Cheap books”

B
2

“New books”

B
3

“Cheap used books”

B
4

“Cheap books”

Bid Lists
:

“Vocabulary”
:

Query:


{
cheap books
}


Alternative approach: Index bids at fine granularity.




Hash
({“cheap” “books”})

Hash
({“cheap”,”used”,”books”})

Hash
({“new”, ”books”})


Cheap books
” [bid phrase]

B
1

[Bid ID]

B
4

[Bid ID]


Cheap used books
” [bid phrase]

B
3

[Bid ID]


New books
” [bid phrase]

B
2

[Bid ID]


Bid ID


Bid Phrase

B
1

“Cheap books”

B
2

“New books”

B
3

“Cheap used books”

B
4

“Cheap books”

Bid Lists
:

“Vocabulary”
:

Query:


{
cheap

books
}




Alternative approach: Index bids at fine granularity.




Hash
({“cheap” “books”})

Hash
({“cheap”,”used”,”books”})

Hash
({“new”, ”books”})


Cheap books
” [bid phrase]

B
1

[Bid ID]

B
4

[Bid ID]


Cheap used books
” [bid phrase]

B
3

[Bid ID]


New books
” [bid phrase]

B
2

[Bid ID]


Bid ID


Bid Phrase

B
1

“Cheap books”

B
2

“New books”

B
3

“Cheap used books”

B
4

“Cheap books”

Bid Lists
:

“Vocabulary”
:

Query:


{
cheap
books
}



Problem:
# Hash
-
lookups becomes significant for long queries.




Query containing
n

words requires
2
n
-
1

hash lookups.




Inacceptable for long queries, due to tight latency
-
constraints
for sponsored search.

Idea
: Queries that access a set of words access all its subsets [ICDE 2009]


Why does this help?


Trading off random access against sequential access (main memory)


Large reduction in page walks (via TLB misses)


Fewer (worst
-
case) lookups:


Let
k

be # words in largest
vocabulary

node


then a
n
-
word query


Hash
({“cheap” “books”})

Hash
({“cheap”,”used”,”books”})

Hash
({“new”, ”books”})


Cheap books
” [bid phrase]

B
1

[Bid ID]

B
4

[Bid ID]


New books
” [bid phrase]

B
2

[Bid ID]

Bid Lists
:

“Vocabulary”
:


Cheap used books
” [bid phrase]

B
3

[Bid ID]











k
i
i
n
1
1
2

n
requires

lookups, instead of

Can it hurt?

Yes, if we do too much merging…


Cheap used books
” [bid phrase]

B
3

[Bid ID]

Task
: Can we compute an ‘optimal’ assignment of bids to bid lists?


Optimization problem
:

Given a cost model and a query workload, compute bids
-
lists that
minimize query cost.


Workload:


The relative frequencies of the most frequent queries in search


logs are stable


Assignment is computed off
-
line and refreshed periodically


Cost Model:


Simple Model, decomposing access cost into


Cost of random memory access


Cost of sequential memory scans (monotonic in #bytes read)





Hash
({w1, w2})

Hash
({w1, w2, w3})

Hash
({w4, w1”})


w1 w2


B
1

[Bid ID]

B
4

[Bid ID]


w1 w4


B
2

[Bid ID]


w1 w3 w2


B
3

[Bid ID]

Solution sketch
:


We model the assignment as grouping of bids.


For each query q, we can now assign costs to a query.


For each value of
k
and workload , we assign cost to each set of bids.



†
Mapping selection ≈ set covering problem.


l

= MAX(#distinct bids in single node),
l

small.


Greedy selection


log l



approximation

Possible mappings:


(a)
{B1, B4}, {B3}, {B2}

(b)
{B1,B4,B3}, {B2}

(c)




w1 w3 w2


B
3

[Bid ID]

k = MAX(#words in entry)











k
i
i
q
1
|
|
lookups

Costs of scanning bid
-
list


Other Examples
:


WAND processing
[
Broder

et al, CIKM’03]


Keyword Search in Spatial Data
, e.g., [De Felipe et al.,
ICDE’08],[Zhang et al., ICDE’09]


Entity Search
, e.g., [
Balmin
, VLDB’04], [S.
Chakrabarti

et al.
PKDD’04, WWW’06, WWW’07], [
Agrawal

WWW’09]


(Approximate) Auto
-
completion
[
Bast

et al., SIGIR’06], [Nandi et
al., SIGMOD’07, VLDB’07], [
Chaudhuri

et al., SIGMOD’09]


… techniques modifications of IR
-
style processing or string


matching.



Search over relational objects in RDBMS


well
-
studied problem (e.g.,
DBExplorer
, Discover, BANKS
, etc.)


… but join
-
paths (= business objects) are known in verticals search
and can be pre
-
materialized and indexed.




Retrieval Overhead



Retrieval Quality



Vertical Selection

Q: [canon rebel xti]

[canon rebel
xti
]

The Canon EOS Digital Rebel XTi offers an
unbeatable combination of performance,
ease
-
of
-
use and value. It has a newly designed
10.1 MP Canon CMOS sensor plus a host of
new features including a 2.5
-
inch LCD
monitor, the exclusive...
More...

Canon EOS Digital Rebel XTi offers an
unbeatable combination of performance,
ease
-
of
-
use and value. It has a newly designed
10.1 Mega Pixel Canon CMOS sensor plus a
host of new features including a 2.5
-
inch LCD
monitor, the... More...

The ultra
-
powerful 12x optical zoom on the
PowerShot S5 IS means you'll get the shot
you want with no compromise, yet that's
only the beginning of what makes this
camera so exciting. The S5 IS is…

Retrieval Semantics
: Keyword
-
search over product descriptions

Q: [low light camera]

[low light
camera ]

The Canon EOS Digital Rebel XTi offers an
unbeatable combination of performance,
ease
-
of
-
use and value. It has a newly designed
10.1 MP Canon CMOS sensor plus a host of
new features including a 2.5
-
inch LCD
monitor, the exclusive...
More...

Canon EOS Digital Rebel XTi offers an
unbeatable combination of performance,
ease
-
of
-
use and value. It has a newly designed
10.1 Mega Pixel Canon CMOS sensor plus a
host of new features including a 2.5
-
inch LCD
monitor, the... More...

The ultra
-
powerful 12x optical zoom on the
PowerShot S5 IS means you'll get the shot
you want with no compromise, yet that's
only the beginning of what makes this
camera so exciting. The S5 IS is…



Observation I
:

Many web documents
mention

instances
of
low light digital
cameras
in close
proximity to

query
keywords {low light,
digital camera}

Q: [low light camera]

The Canon EOS Digital Rebel XTi offers an
unbeatable combination of performance,
ease
-
of
-
use and value. It has a newly designed
10.1 MP Canon CMOS sensor plus a host of
new features including a 2.5
-
inch LCD
monitor, the exclusive...
More...

Canon EOS Digital Rebel XTi offers an
unbeatable combination of performance,
ease
-
of
-
use and value. It has a newly designed
10.1 Mega Pixel Canon CMOS sensor plus a
host of new features including a 2.5
-
inch LCD
monitor, the... More...

The ultra
-
powerful 12x optical zoom on the
PowerShot S5 IS means you'll get the shot
you want with no compromise, yet that's
only the beginning of what makes this
camera so exciting. The S5 IS is…

Observation II: (Pseudo
-
Relevance)

The top web search results will contain mostly relevant pages.

Hence, we can identify most relevant entities by:


Submitting
query to a search engine


Identifying

mentions of entities in top returned documents


Aggregating

scores for these entities

Results

[WWW 2009]:


Significant improvement in retrieval precision and recall.


Low overhead by piggy
-
backing on search engine components


Entities extracted as part of page crawl pipeline.


Entity indexing and retrieval in snippet generation.

General approach
:


Issue the search query against a document corpus .


Identify relevant sub
-
components of top results (e.g., titles,
captions, tags, categories, entites, etc.)


Aggregate over the components.


Query














)
P(w,
)
P(w,
log
)
P(w,
C
q
2
Result
d
q
Approach
:


Retrieve top ~50 documents from web search engine.


Categorize each document into commercial taxonomy


Use combination of categories to characterize query for

advertising.


Cat(D
1
)

Query

Cat(D
2
)

Cat(D
3
)

Approach
:


Retrieve top news documents from web search engine.


Extract the publish date / order


Count how many of the retrieved documents were among
the k most recently published ones.


Date(D
1
)

Query

Date(D
2
)

Date(D
3
)

Approach
:


Retrieve top news documents from web search engine.


Extract the publish date / order


Count how many of the retrieved documents were among
the k most recently published ones.


Additional Examples
:


[Shen et al. , SIGKDD Exploration 2005]


Query classification:


Using title, snippet and category information from each
document.


[Collins
-
Thompson et al., SIGIR’09]


Query difficulty prediction:


Document are represented as a low
-
dim. feature vector.



Many isolated variations on general approach.


What is the right abstraction or infrastructure?


Neither corpus nor retrieval depth need to correspond to


‘normal’ web search result.


=> Integration of pre
-
computed information into retrieval,


aggregation over this data.

Q: [Sony]

[Sony]

The Canon EOS Digital Rebel XTi offers an
unbeatable combination of performance,
ease
-
of
-
use and value. It has a newly designed
10.1 MP Canon CMOS sensor plus a host of
new features including a 2.5
-
inch LCD
monitor, the exclusive...
More...

Canon EOS Digital Rebel XTi offers an
unbeatable combination of performance,
ease
-
of
-
use and value. It has a newly designed
10.1 Mega Pixel Canon CMOS sensor plus a
host of new features including a 2.5
-
inch LCD
monitor, the... More...

The ultra
-
powerful 12x optical zoom on the
PowerShot S5 IS means you'll get the shot
you want with no compromise, yet that's
only the beginning of what makes this
camera so exciting. The S5 IS is…

Similar problem:


Many verticals with relevant answers.


Example: Query “
Harry Potter
” may trigger products, images,
movies, etc.


Retrieval Overhead



Retrieval Quality



Vertical Selection

Once instances of query have been observed, CTR can be tracked,

… but how do deal with unseen queries ?




呡獫T
䕳瑩浡瑥E
Pr ( Click | Query, News
-
Results)
.

News results compete for
space with web results/ads.



trigger only for queries


with likely click
.

News CTR is not primarily a function of document relevance


Relevant document(s) necessary, not sufficient for high CTR.


CTR for an ongoing news story remains (often) stable, even as


the underlying documents change.




”Buzz/Attention” around a story makes a difference.


Identifying news queries is not a (binary) query classification task


Many queries are inherently ambiguous, e.g. ‘Georgia’


Human labeling of training data is difficult:


‘Voter Registration’


‘Oil Prices’



Caylee

Anthony’












News CTR is not primarily a function of document relevance


Relevant document(s) necessary, not sufficient for high CTR.


CTR for an ongoing news story remains (often) stable, even as


the underlying documents change.




”Buzz/Attention” around a story makes a difference.


Identifying news queries is not a (binary) query classification task


Many queries are inherently ambiguous, e.g. ‘Georgia’


Human labeling of training data is difficult:


‘Voter Registration’

1.5

%



5
% CTR


‘Oil Prices’



22

%



29
% CTR



Caylee

Anthony’


63

%



69
% CTR


News click
-
through rates change (rapidly) over time
.


Query text n
-
grams unlikely to yield good features.




Queries w/o
news intent
may still
receive clicks.

CTR varies
significantly
among news
-
queries.

Keywords that are
specific to a news
event receive
higher CTR.


Supervised learning, using collected click data.



Model
Pr ( Click | Query, News
-
Results)

as



Pr ( Click |



Relevance (Top News Result(s)),



Attention/Buzz around keywords,




“Cohesion” of retrieved stories,



query surface properties).


Supervised learning, using collected click data.



Model
Pr ( Click | Query, News
-
Results)

as



Pr ( Click |



Relevance (Top News Result(s)),



Attention/Buzz around keywords,




“Cohesion” of retrieved stories,



query surface properties).

BM
-
25 Score

Next

slides…

News

Crawl

0
200
400
600
800
1000
1200
1400
1600
1800
31-Aug
02-Sep
04-Sep
06-Sep
08-Sep
10-Sep
12-Sep
Number of occurrences

'Georgia' (news title)
'Georgia'' (news article)
'Georgia' (news 1st paragraph)
Partition news articles by crawl
-
date.

Titles

1
st

Paragraph

Text Body

Now, track ``attention’’ in news by measuring
incidence of query
-
keywords in each partition.



䕡捨c煵q特r来湥牡瑥猠慲牡礠a映捯畮瑥牳c

News

Crawl

Titles

1
st

Paragraph

Text Body

Issues
:



Occurrence of query
-
keywords in news


tracks coverage, not attention.



Differentiating keywords that are ‘globally’


frequent from new news headlines?

Blog

Crawl

News

Crawl

Titles

1
st

Paragraph

Text Body

Multiple Corpora:



Blogs and news complement each other to
capture
coverage vs. attention.



Use of ‘background’ corpus allows us to
identify keywords
indicative

of news.


Supervised Approach, using collected click data.



Model
Pr ( Click | Query, News
-
Results)

as



Pr ( Click |



Relevance (Top News Result(s)),



Attention/Buzz around keywords,




“Cohesion” of retrieved stories,



query surface properties).

Occurs often, but in several
different news events.



偲⡣汩l欩k
楳敳献

佣c畲u汥獳潦o敮Ⱐ畮u煵q汹†
楤敮瑩晩敳⁳灥捩晩f湥睳w敶敮琮



偲⡣汩l欩k
楳慲来g.

Query: “
President Obama
” vs. Query: “
Hurricane Ike




How similar are the documents the query terms occur in?


Approach:

For all (subsets) of query terms:


Retrieve matching documents.


Compute a language
-
model of the
contexts
the terms occur in.


Compute similarity of these.


Similarity Metric
:
Jensen
-
Shannon Divergence

Carmel, Yom
-
Tov, Darlow and Pelleg, ‘What makes a Query difficult?’,
SIGIR 2006









80%
82%
84%
86%
88%
90%
92%
94%
96%
98%
100%
0%
20%
40%
60%
80%
100%
Precision

Recall

Predicting CTR > 10%
Predicting CTR > 15%
Predicting CTR > 20%

Baseline:
CTR > 10%:

70.1%
CTR > 15
%: 75.9%
CTR > 20%:
81.8%


For
82.5%

of queries prediction within ‘error
-
band’ of +/
-

10%.


Using (relevance
-
) scores for single verticals limiting.


=> How indicative is a query for a vertical?


Additional sources of evidence:


Query
-
logs
(e.g., [
Arguello

et al., SIGIR’09], [Diaz, WSDM’09])


Sets of queries issued against / resulting in clicks for a given vertical.


Generalization through language models.


Non
-
web text corpora
(e.g., [
Arguello

et al., SIGIR’09])


Collections representative of verticals or concepts (e.g. via Wikipedia).


Measures: clarity/cohesion, expected # results, trends over time.


Document categories
(e.g., [Collins
-
Thompson et al., SIGIR’09])


Concept Graphs
(e.g., [
Diemert

et al., WWW’09])


Based on co
-
reference between concepts.


Extracted automatically, levering search engine in computation to take
advantage of relevance model, spam filtering.



Query text based classification performs well given
large

training
data sets
.


Automatic generation of queries/labels (e.g., [
Li et al.
, SIGIR’08],
[
Fuxman

et al., SIGKDD’09]).



Retrieval processing


Novel retrieval problems


Loose coupling between retrieval processing and ranking.


(Worst
-
case) latency matters.


Faster retrieval leveraging data distributions, matching semantics.


Integration of web search and ad/vertical retrieval


Search provides context and can be used to enrich both the query


as well as the text associated with items in a vertical.


Approach of search, extract/pre
-
compute & aggregate appears


to apply in many scenarios.


Extends to additional (non
-
web) corpora, query logs, etc.




䍯C扩湩湧n敶楤敮c攠晲潭f浵汴楰汥i獯畲c敳.



Challenges / next steps:


Identifying
common abstraction / operators.


What is the correct system infrastructure?