Ziv Bar-Yossef Maxim Gurevich

homelybrrrInternet and Web Development

Dec 4, 2013 (3 years and 4 months ago)

46 views

Ziv

Bar
-
Yossef


Maxim
Gurevich


Google and
Technion



Technion




TexPoint

fonts used in EMF.

Read the
TexPoint

manual before you delete this box.:
A
A
A
A
A
A

Impressions and ImpressionRank


Impression of page/site
x

on a
keyword
w
:


A user sends
w

to a search
engine


The search engine returns
x

as
one of the results


The user sees the result
x


ImpressionRank of
x
:


# of impressions of
x


Within a certain time frame


Measure of page/site visibility
in a search engine

Each result has an impression on
the keyword “www 2009”:


www.2009.org


www2009.org/calls.html


www.loginconference.com


...

Popular Keyword Extraction


The Popular Keyword Extraction problem:


Input: web page
x
,
int

k


Output:
k

keywords on which
x

has the most
impressions among all keywords


Example: x =
www.johnmccain.com


sarah

palin


john
mccain


cindy

mccain

Motivation


Popularity rating of pages and sites


Site analytics


Enable site owners to determine their visibility in
different search engines


Combine with traffic data to derive click
-
through rates


Compare to other sites


Keyword suggestions for online advertising


Social analysis


Search engine evaluation


Finding similar pages

Internal Measurements of
ImpressionRank

and Popular Keyword Extraction


Search engines can compute both ImpressionRank and
popular keywords based on their query logs


Query logs are not publicly released due to privacy
concerns


Caveats:


Only search engines can do this


Non
-
transparent

External Measurements of ImpressionRank
and Popular Keyword Extraction

Main cost measure:

# of requests to the search engine and to the suggestion server

ImpressionRank estimator /
Popular keyword extractor

ImpressionRank / Popular Keywords

Target page URL

Our Contributions


Reduce
ImpressionRank

Estimation to Popular
Keyword Extraction


First external algorithm for popular keyword
extraction


Accurate


Uses relatively few search engine requests


Applies to:


Single web pages (www.cnn.com)


Web sites (www.cnn.com/*)


Domains (*.
cnn.com
/*)

Related Work


Keyword extraction
[Frank et al 99,
Turney

00, …]


Keyword suggestions (for online advertising)
[
Yih

et al 06,
Fuxman

et al 08]


Query by Document
[Yang et al 09]


Commercial traffic reporting
[
GoogleTrends
,
comScore
, Nielsen,
Compete]


Roadmap


The naïve popular keyword extraction algorithm


The improved popular keyword extraction algorithm


Best
-
First Search


Experimental results

Search
Engine

Suggestion
Server

Popular Keyword Extraction: The Naïve Algorithm


Verification procedure for keyword
w
:


Submit
w

to the search engine and the suggestion server


Verify that
w

returns the target page


Verify that the popularity of
w

> 0
[BG08]

Candidate
Verifier

Term
Extractor

Term
Pool

Candidate
keyword
generator

Popular
Keywords

Recall problem:

Target page may have impressions on
keywords that do not occur in its text

Efficiency problem:

10
3

terms


10
9

3
-
term candidates



mp3

song

tag

weather



Candidate keyword TRIE

mp3





Target
Page

Candidate

keyword

TRIE

mp3 tag

Candidate
keyword
generator

Best
-
First
Search

Popular
Keywords

Popular Keyword Extraction: The Improved Algorithm

Candidate
Verifier

Term
Extractor

Term
Pool

Target
Page

Candidate

keyword

TRIE

Target
Page

Similar
Pages

Anchor
Text

Search
Engine

Suggestion
Server



mp3

weather



mp3

song

tag



Candidate keyword TRIE

Best
-
First
Search

Best
-
First Search

Candidate
Verifier

3

5

8


Goals:


Prune as many candidates as
possible


Verify the most promising
candidates first



Start with single term
candidates


Score candidates


While not exceeded search
engine request budget


w

= top scoring candidate


Send
w

to the verifier


Decide whether to prune
w


If not prune
w


Expand
w



generate and score
the children of
w

Search
Engine

Suggestion
Server

Pruning


Pruning decision for keyword
w:


Submit query
inurl
:<target
url
> w


If no results, prune
w

and all its descendants


Retrieve suggestions for
w


If no results, prune
w

and all its descendants



Pruning eliminates the vast majority of candidates


A single search/suggestion request may eliminate
thousands of candidates

Scoring


The Best
-
First search algorithm considers only the top
scoring candidates given the budget


Want to predict


Whether the search engine returns the target page on
w


Whether
w

is a popular keyword


score(
w
) =
tf
(
w
)




idf
(
w
)




popularity_score
(
w
)







,

, and

: relative weights of the scoring components





Predicts whether the search engine
returns the target page on
w

Predicts the popularity of
w

How to Compute Candidate Scores


Every time the algorithm expands
a keyword,
it needs to
compute scores for all its children


There could be thousands of such children


TF Score


Straightforward. No search requests needed.


IDF Score


Approximated based on an offline corpus. No search requests
needed.


Popularity Score


[
BarYossefGurevich

08]:

Algorithm for estimating keyword
popularity using the query suggestion service


Too costly: may use dozens of suggestion requests per estimate


We present a new algorithm that estimates popularity for all
the children in bulk


Uses hundreds of suggestion requests to estimate the popularity of
all the children


Estimates are less accurate

Cheap Popularity Estimation


Input: a keyword
w


Goal: Estimate popularity of all
w
’s

children



Bucket children according to their first character


Estimate relative popularity of each bucket


Estimate the relative popularity within each bucket

Estimate of
popularity_score
(prefix
)

BG08
Popularity
Estimator

mp3_

a



s

t

mp3
s
ong

mp3
t
ag

mp3
t
able



5

6

2

4

5

mp3 s

mp3 t

Example: w = “mp3”

children: “
mp3 song
”, “mp3 tag”, “mp3 table”, …

Popular Keyword Extraction
Algorithm: Quality Analysis


Precision: 100%


All extracted keywords return the target page


Recall: do we miss some popular keywords?


More difficult to measure


no ground truth to compare
to


Estimate lower bound on the recall



Google: recall >
90%


Yahoo!: recall =
70%
-

80%


0
%
10
%
20
%
30
%
40
%
50
%
60
%
70
%
80
%
90
%
100
%
0
200
400
600
800
1000
Weighted fraction of
popular keywords found
Search requests used
Google
Yahoo!
Resource Usage


~10000 suggestion server requests per page


~1000 search engine requests per page


85%(Google), 75%(Yahoo) after 25% of resources spent

Google
Yahoo!
Compete
Relative ImpressionRank
cnn.com
nytimes.com
washingtonpost.com
ImpressionRank

of News Sites

(March 2009)

weather

cnn

video

obama

weather

cnn

bristol

palin

news

amazon

movies

barack

obama

stimulus package

new
york

times

barack

obama

Google
Yahoo!
Compete
Relative ImpressionRank
en.wikipedia.org
www.youtube.com
www.facebook.com
www.myspace.com
ImpressionRank

of
Social Sites

(March 2009)

Conclusions


First external algorithms for


ImpressionRank

estimation


Popular keyword extraction



Future work


Improve efficiency


Improve recall