A Brief Tour of Modern Web

lilactruckInternet and Web Development

Dec 4, 2013 (3 years and 8 months ago)

48 views

A Brief Tour of Modern Web
Search Engines

Hugh E. Williams

eBay Inc.

h
ugh.williams@ebay.com

Overview


Introduction


Web crawling


Document stores and indexing


Inverted Indexing


Query Evaluation


Ranking and Relevance Measurement


Caching and Web Serving


eBay


Reading Materials

Web Search Basics


Web search engines don’t search the web


They search a copy of the web


They
crawl
or
spider
documents from the web


They index the documents, and provide a search
interface based on that index


Document summarization
is used to present short
snippets
that allow users to judge relevance


Users click on links to visit the actual, original web
document

(Simplified) Web Search Architecture

Crawlers

Document Store

Index File Managers

Result Cache

Web Servers

Aggregators

CRAWLERS AND CRAWLING

Crawling from Seed Resources


The basic seed
-
based crawling algorithm is as follows:

1.
Create an empty URL queue

2.
Add user
-
supplied seed URLs to the queue (simplest approach:
append to the tail)

3.
If the resource at the head of queue meets the “crawl criteria” (more
later) request the resource at the head of the queue

4.
Process the retrieved resource:

1.
Extract URLs from the resource. For each URL:


Decide if the URL should be added to the URL queue


If yes, store the headers and resource in the collection store

2.
Record the URL in the visited URL list with the time visited

5.
Repeat from Step 3 until the queue is empty, then stop.

So, it’s that simple?



I'm writing a robot, what do I need to be careful of?

Lots. First read through all the stuff on
the robot page then read the proceedings of past WWW Conferences, and the complete HTTP
and HTML spec. Yes; it's a lot of work.”
(from
http://www.robotstxt.org/faq/writing.html
)


Writing a crawler isn’t straightforward. Some examples:


Sites can use the robots.txt exclusion standard to limit which pages
should be retrieved


Crawler shouldn’t overload or overvisit sites


Many URLs exist for the same resource


URLs redirect to other resources (more in a moment)


Dynamic pages can generate loops, unending lists, and other traps


URLs are difficult to harvest: some are embedded in JavaScript scripts,
hidden behind forms, and so on




Example: Resolving URLs


The following URLs resolve to the same resource:


ebay.com/garden


pages.ebay.com/garden


www.ebay.com/garden


www.ebay.com/./garden


www.ebay.com//////garden


ebay.com/GARDEN


ebay.com:80/garden


ebay.com/%67%61%72%64%65%6e


ebay.com/garden/foo/..


garden.ebay.com


garden.ebay.com/index.html


garden.ebay.com/#test


garden.ebay.com/?test=hello



Crawl Criteria


Crawlers actually need to do three fundamental
tasks:

1.
Fetch new resources from new domains or pages

2.
Fetch new resources from existing domains or pages

3.
Re
-
fetch existing resources (that have changed)


We can think of a successful crawl action as one
that leads to a new resource being indexed and
visited (or viewed?) by the user of the search
engine


A failure is fetching a resource that isn’t used, or
refetching

a resource that didn’t change

Crawl Criteria…


Crawl prioritization is essential:


There are far more URLs than available fetching
bandwidth


For large sites, where we’re being polite, it’s
impossible to fetch all resources


It’s essential to balance refetch and discovery


It’s essential to balance new site exploration with
old site exploration

Interesting problems


Crawler challenges:


HTTP HEAD
and GET
requests sometimes
return
different headers


Not Found pages often return HTTP 2xx codes


A pages can redirect to itself, or into a
cycle (more in a
moment)


Pages can look different to end
-
user browsers and
crawlers


Pages can require JavaScript processing


Pages can require cookies


Pages can be built in non
-
HTML environments

DOCUMENT STORES AND INDEXES

(Simplified) Web Search Architecture

Crawlers

Document Store

Index File Managers

Result Cache

Web Servers

Aggregators

Indexing Challenges


There are hundreds of billions of web pages (as we’ve seen, it’s really
infinite)


It is neither practical nor desirable to search over all of them:


Should remove spam pages


Should remove illegal pages


Should remove repetitive or duplicate pages


Should remove crawler traps


Should remove automatically generated pages


Should remove pages that no longer exist


Should remove pages that have substantially changed


Should remove pages that cannot be understood by the target users





Most search engines index somewhere in the range of 20 to 50 billion
documents


Figuring out how many pages each engine indexes, and how many pages
are on the web are both hard research problems

How do we choose the right pages?


There are many ways to choose the right pages:


Store those that meet future information needs!


In practice, this means:


Choose pages that users visit


Choose pages that are popular in the web link graph


Choose pages that match queries


Choose pages from popular sites


Choose pages that are clicked on in search results


Choose pages shown by competitors


Choose pages in the language or market of the users


Choose pages that are distinct from other pages


Choose pages that change at a moderate rate





Whatever choice is made:


The head is stable


The tail “wags around”, billions of candidate pages have similar or identical
scores

Choosing Pages in Practice


In practice, there are two solutions to choosing pages for the index:


In real time, make a yes/no decision about each page, and add to the index


Store the pages, and process them offline to construct an index


The former solution is typically based on the well
-
known AltaVista “chunk”
solution


Create a buffer of documents (a “chunk”)


Build an index on that buffer


Move the index and content to an index serving node


(After some time) Mark the chunk’s URLs for
refetch


(After some time) Expire the chunk


The latter approach is likely what’s used at Google:


Store multiple copies of the web in a document store


Iterate over the document store (potentially multiple times) to choose
documents


Create an index, and ship it to the index serving nodes


Repeat

INVERTED INDEXES

Supporting Query Based Retrieval


We’ll talk about query evaluation in the next
section


But, for now, believe that queries
are evaluated
using
inverted indexes


Compressed inverted
indexes are typically 10%
-
20% of the size of the data being stored.


In many cases,
they are too large to store in memory,
so disk storage is a necessity. However:


disk size is limited


disk access is slow

Inverted Index


A document
-
level
inverted index

for a collection consists of:


lexicon

-

a searchable in
-
memory vocabulary containing
the unique searchable terms in the collection (t
1
, …, t
n
)


for each
t
, a pointer to the
inverted list

of that term on disk


The
inverted list

contains information about the occurrence of
terms:


postings <
d
,
f
d,t
> where f
d,t

is the frequency of term t in
document d; one posting is stored for each document in
which t occurs


additional statistics such as
f
t

(the number of documents
that
t
occurs in) and
L
d

(the length of document d)

Inverted Index

Mapping file

wild cat

Collection

1

8

7

6

5

4

3

2

fat cat

cat on the mat

Memory

Memory or Disk

Lexicon

3: 1, 2, 7

Inverted Lists

cat

Answering Queries


Document numbers and frequencies are sufficient to
answer
ranked

(more later) and
Boolean queries


the position of terms in a document is not important



For phrase and proximity queries, must additionally
store term
offsets

o
i
, so postings need to be of the
form:

<d,
f
d,t

[o
1


o
f
d,t
]>



Example: inverted list for the term “
cat


3 <1, 2 [2, 6]> <2, 1 [8]> <7, 3 [4, 8, 11
]>

Index Ordering


Postings are usually ordered by increasing
d
, and offsets within postings are
ordered by increasing
o


this allows the difference between values to be
stored


This improves compressibility because the values are smaller


This improves compressibility because the integer distribution is more skew


The inverted list for “cat”

3 <1, 2 [2, 6]> <2, 1 [8]> <7, 3 [4, 8, 11]>


becomes:

3 <1, 2 [2, 4]> <1, 1 [8]> <5, 3 [4, 4, 3]>


Other orderings
are typically used in web search engines:


frequency
-
sorted index


impact
-
ordered
index


Page
-
rank ordered index


Access
-
ordered index


The differences can’t be taken between
d
values, but other differences can often be taken

Benefits of Compression



Compression of indexes has several benefits:

1. less storage space needed

2. better use of disk
-
to
-
cpu

communication
bandwidth (or
main
-
memory to CPU)

3. more data can be cached in memory, so fewer disk
accesses are required for a stream of
queries


To be effective: the total retrieval time and CPU
processing costs under a compression scheme should
be
less than

the retrieval time for the uncompressed
representation


Compression Experiments


Hardware


Intel Pentium III 1.0 GHz


512 MB main
-
memory


Linux operating system (kernel 2.4.7)


Collections


small


500 Mb (94,802 documents from TREC
-
7 VLC)


index fits in main memory (703,518 terms)


large


20
Gb

(4,014,894 documents from TREC
-
7 VLC)


index several times larger than main memory (9,574,703 terms)


Queries


10,000 / 25,000 queries from a 1997 query log from the Excite search engine


filtered to remove profanities


evaluated as conjunctional Boolean queries



Results: Small Collection

Results: Large Collection

QUERY PROCESSING

(Simplified) Web Search Architecture

Crawlers

Document Store

Index File Managers

Result Cache

Web Servers

Aggregators

Index Serving


The document collection is partitioned equally
between
n
machines


Each machine evaluates a query on its fraction of
the collection, and returns its best
m
results


An aggregator collates the responses, and
chooses the overall best
l
(typically l=10)


The set of
n
machines is known as a row


Rows can be copied to increase throughput


Rows can be widened to decrease latency

Querying on the nodes


In practice, nodes don’t exhaustively evaluate
queries on the inverted index:


They stop evaluating when:


Time runs out


Result sets are stable


Enough results have been found


The system is under too much load




RANKING

Querying in Web Search


Web search users search for a variety of information needs:


Broder

(2002) proposed this taxonomy:


Informational (want to learn something) [around 80% of queries]


Navigational (want to go somewhere else) [around 10%]


Transactional (want to do something) [around 10%]


Users express their information needs as
queries


Usually informally expressed as two or three words (we call this a
ranked query.
More later)


A year 2000 study showed the mean query length was 2.4 words per
query with a median of 2; the mean length is getting longer (Why?)


Around 48.4% of users submit just one query in a session, 20.8%
submit two, and about 31% submit three or more


Less than 5% of queries use
Boolean operators

(AND, OR, and NOT),
and around 5% contain quoted
phrases


What Users Are Searching For

Reproduced from:
Bernard J. Jansen
, Amanda Spink: How are we searching the World Wide Web? A comparison of nine search
engine transaction logs.

Inf. Process. Manage. 42
(1): 248
-
263 (2006)

Answers


What is a good answer to a query?


One that is relevant to the user’s information need!


Web search
engines typically return ten answers
-
per
-
page, where each answer is a short summary of a web
document


Likely relevance to an information need is
approximated

by
statistical similarity

between web
documents and the query


Users favour search engines that have
high precision
,
that is, those that return relevant answers in the first
page of
results


Around 75% of queries don’t go beyond page one


Approximating Relevance


Statistical similarity is used to estimate the relevance of
a query to an answer


Consider the query
“Mark
Nason

Adler Boots”


An interesting document contains
all
four words


Web search engines enforce this Boolean AND requirement


The more
frequently the
words occur in the document, the
better
;
this is called the
term frequency

(TF)


Better documents have more occurrences of the rarer
words


For
example, an answer containing only
“Adler”
is likely to be
better than an answer containing only
“Boots”


This is the so
-
called
inverse
document frequency

(IDF)


Term Frequency…


The notion of
term frequency
is typically
expressed as
tf
t,d

where
tf

is the term
frequency of term
t
document
d


The weight of the term frequency component
in the ranking function is usually a logarithm
of the raw frequency (and 0, if
tf
t,d

is zero)


This “dampens” the effect of high frequency
tf

values


Usually, if
tf
t,d

> 0
then

w
t,d

= 1 + log
10
(
tf
t,d
)


Inverse Document Frequency


To introduce discrimination between terms, we introduce
the notion of IDF


The
inverse document frequency
is typically expressed as
idf
t

where
idf

is the inverse of the number of documents
in the collection that contain term
t


Usually,
idf
t

=
log
10
(N /
df
t
),
where
N
is the number of
documents in the collection


The
log
is again used to dampen the effect of very
uncommon terms


Note that every term in the collection has one IDF value


This is important for index design and query evaluation; this is one
reason why inverted indexing works for web search





tf.idf


Most popular ranking functions bring together TF and
IDF to weight terms:

w
t,d

=

(1 + log
10
(
tf
t,d
)) x

log
10
(N /
df
t
)


When you hear the phrase “tf.idf”, this is the basic
formalization that’s being discussed:


The more a term occurs in a relevant document, the better


The more discriminating a term is across the collection, the
better


You’ll sometimes see the same concept written as “
tf
-
idf



You’ll often see the elements of the “tf.idf” approach
hidden amongst constants and other factors



The Okapi ranking function is as follows:




Q

is a query that contains the words
T


k1
,

b
, and
k3

are constant parameters (
k1
=1.2 and
b
=0.75 work well,
k3

is 7 or 1000)


K

is:


tf

is the term frequency of the term with a document


qtf

is the term frequency in the query


w

is:



N

is the number of documents,
n

is the number containing the term


dl

and
avdl

are the document length and average document
length


Okapi is a well
-
known ranking function that you’ll
often find in the literature and in experimental
research work


It also contains an IDL component (more later)

A ranking function: Okapi BM25








Q
T
qtf
k
qtf
k
tf
K
tf
k
w
3
3
1
)
1
(
)
1
(
)
5
.
0
(
)
5
.
0
(
log



n
n
N
)
/
.
)
1
((
1
avdl
dl
b
b
k


Comments on tf.idf schemes


If a query contains only one word, the IDF
component has no effect; it’s only useful for
discriminating between terms in queries


Many formulations also include IDL, which
ensures long documents don’t dominate short
documents


They are only one (important) component of a
modern search engine ranking function


Query Evaluation in Web Search


In practice, web search engines:


Don’t do pure ranking, because they perform the
Boolean AND to find documents that contain all
query terms, and then rank over those terms


High
precision
, lower
recall
(more later)


Less expensive to evaluate than a ranked query


In practice, because of query alterations, the AND is
often very broad

Ranking in Practice


Search engine rankers are complex:


Machine
-
learned ranking functions


Hundreds of ranking factors


Query independent ranking factors


Document segments or
streams


Two
-
pass ranking


Early query termination


Query alterations


Query independent ranking factors


Documents may have ranking factors that are
query independent:


Spam score


PageRank or page authority score


Basic statistics (word counts, inlink and outlink
counts, intra and inter domain link counts, image
counts, …)


Impression counts




Streams


It is desirable to rank different parts of the document
using different factors, weights, and rankers


For example, consider:


URL text


Title text


Body text


Anchor text (more in a moment)


Query text (queries that lead to clicks on the document)





Streams allow logical documents to be represented in
the index

Anchor text


Anchor text is drawn from the HTML <a> tags that
“point” to a document

<a
href
=
http://ebay.com
>eBay home page</a>


Anchor text is often a more useful description of a site
or page than the page contains


Anchor text is very useful for navigational querying


In practice, we may take just the anchor text, or some
fragment of the text surrounding the tag too


Anchor text is
painted
into the destination document


In practice, anchor text management is a key challenge
of crawler design

Query Text


110009824911: number stencils


110009869781: rivera ceramic tiles


110009873312: rivera ceramic tiles
rivera ceramic tiles rivera ceramic
tiles rivera ceramic tiles


110009952936: ibm machines


110010165729: sick


110010223000: mudd


110010296589: carson pirie


110010301601: boy scout shoes


110010311498: ferragamo 6


110010377525: spin win


110010581717: invitations sweet 16


110010594270: bonfire of the vanities


110010672814: 1968 vw manual


110010675084: studebaker poster


110010757262: hawaiian shirt


110010785797: silver capris 27 silver
jeans capris 27


110010831515: fishing lures deep
diving


110010874213: harley davidson boots
harley davidson boots harley davidson
boots


110011110468: soligen hunting knife


110011350242: amanda lee


110011535343: orphan annie


110011646306: crkt weasel


110011977526: 18k gold ruby earrings


110011979581: the dale earnhardt
story


Idea: tag item with query <x>, when the item is clicked for query <x>

Query Alterations


Most engines use past user behavior to aid in
augmenting queries


For example, if many users correct a misspelling, it
will be automatically corrected


Query alterations fall in several classes:


Corrections (active, partial, or suggested)


Additional terms


Can be added to the AND, or used in
reranking

later


Used for highlighting


RELEVANCE MEASUREMENT

Relevance Judgment


Web search engines measure relevance using
human judges


Each result to each query is judged on a scale. For
example:


Perfect, the ideal result for this query


Relevant, that is, meets the information need


Irrelevant, that, does not meet the information need


Detrimental, hurts the impression of the search
engine


The judgments are used to compute various
metrics that measure
recall
and
precision


Recall and Precision


Recall
is the fraction of relevant documents
retrieved from all relevant documents in the
collection


If there are 100 documents in the collection, 10 are
relevant, 12 are retrieved, and 3 of those are relevant,
the recall is 0.3 or 30%


Precision
is the fraction of retrieved documents
that are relevant


If there are 100 documents in the collection, 10 are
relevant, 12 are retrieved, and 3 of those are relevant,
the precision is 0.25 or 25%


Recall and Precision…


In practice, recall is hard to measure:


Requires relevance judgment of all documents in
the collection to each query


Impractical for large collections


Typically ignored by web search engines


Precision is much easier to measure


Precision may fluctuate, but typically
decreases with the number of results
inspected

Recall and Precision…


The simplest way to compare two retrieval systems is the
P@n

measure:


Take the top
n
results for a query
q


Measure the fraction that are relevant


Store this as the
P@n

for
q
for system
x


Determine the mean average
P@n

over all queries for system
x


Repeat for system
y


Determine whether
x
is better or worse than
y
using a statistical
measure, such a two
-
sided t
-
test


It’s difficult to choose the right value for
n
, but a typical
choice is n=10


In practice, the measurements are more complex, but
typically favor high precision measures

CACHING AND WEB SERVING

Caching


Around 70% of web search queries have been
seen recently


Therefore, web search engines are able to
search most results from large, distributed
caches


In practice, caching is keyed on more than the
query


Market, preferences, personalization, …

Web Serving


The web servers host the web resources


HTML, CSS, JavaScript, and interpreter


The web servers often host many simple services


The web servers open, manage, and close
connections to the search engine


The web servers also manage the issues of being
connected to the Internet:


Denial of service attack prevention


Load balancing


They’re also a convenient place to do logging



EBAY

CHALLENGES IN SEARCH

Challenges at eBay


eBay manages:


Over 90 million active users worldwide


Over 200 million items for sale in 50,000 categories


Over 8 billion URL requests per day


Over 10,000 queries per second at peak


… in a dynamic environment


Hundreds of new features per quarter


Roughly 10% of items are listed or ended every day


… worldwide!


In 39 countries and 10 languages


24x7x365


More than70 billion read / write operations per
day


eBay Search: differences



The first, major real time search engine:


Dynamic collection


New documents


Time is important in relevance


Low latency requirement for publication
-
to
-
index


Index updates


Changes in documents, new documents, and document
deletions

eBay Search: differences…


Ranking challenges


Different signals:


Temporal relevance


Rapidly changing signals


Difficult to maintain accurate statistics


New terms and phrases


Missing signals:


Anchor text, link graph, page rank, …


One major query type


Auction cycle makes tuning harder


Systems challenges


Cache hit ratio; results need to be fresh



Q&A

Pssst
…. eBay is hiring! Mail me if you’re interested,
hugh.williams@ebay.com

REFERENCE MATERIAL

Great Books!


Manning,
Raghavan

and
Schütze
,


Introduction
to Information Retrieval, Cambridge University
Press, 2008. (free online)


Croft, Metzler, and
Strohman
, Search Engines:
Information Retrieval in Practice, 2010.


Witten, Moffat, and Bell, Managing Gigabytes,
Morgan
-
Kaufmann, 2
nd

Edition, 1999.


Baeza
-
Yates and
Ribeiro
-
Neto
, Modern
Information Retrieval, Addison
-
Wesley,, 1999



References


Spink and Xu, “Selected results from a large study of
Web searching: the Excite study”, Information Research
6(1), October 2000


Scholer
, Williams,
Yiannis
,

and
Zobel
, “Compression of
inverted indexes for fast query evaluation”,


In Proc. of
the ACM
-
SIGIR International Conference on Research
and Development in Information Retrieval, 2002.


Broder
,

“A taxonomy of web search”,

SIGIR Forum
36(2), 2002.


Jansen

and Spink
,

“How are we searching the World
Wide Web? A comparison of nine search engine
transaction logs.”,

Inf. Process. Manage. 42(1), 2006
.