Search Engine Internal Processes - Greg Newby

photofitterInternet and Web Development

Dec 4, 2013 (3 years and 8 months ago)

65 views

Search Engine

Internal Processes

Greg Newby @ WTH 2005


Talk slides are available online:
http://petascale.org/presentations/wth/wth
-
search.ppt


Who is this guy?


My academic research is mostly focused on
information retrieval
(
http://petascale.org/vita.html
)


I’ve authored a research retrieval system


I’m co
-
chair of the Global Grid Forum’s
working group on Grid Information Retrieval
(GIR
-
WG), working to standardize
distributed search

About this presentation


We’re not focusing on how to get good
ranking in search engines, or how to search
effectively


Instead, we will look at the
scientific and
practical basis for search engines
, from
an “information science” point of view


What unanswered questions and open
possibilities exist for enhanced information
retrieval systems of the future?

Outline: The Science of Search


Basic language; recall and precision


Basic technologies: indexers, spiders and
query processors


Query optimization strategies


Spiders/harvesters


Indexers


Where’s the action in search engines?

Basic language


information Retrieval

is the science, study and
practice of how
humans seek information


information seeking is complex human
behavior, in which some sort of
cognitive
change

is sought


The
nature of information

is similarly complex.
Does it exist apart from a human observer?
Why is one person’s “data” another person’s
“information?” Can we measure the information
content of a message, or is that only for the
telephone engineers (like Shannon & Weaver)?


Information retrieval (IR) systems attempt to
match

information seekers with information

Why is IR is hard:


For all their performance, modern search is often
unsatisfying
: finding the information you want is
difficult.


(Many people are satisfied, though their results were
poor)


IR systems use
queries

as expressions of
information need. But such expressions are
necessarily inexact:


human language is imprecise


queries are usually short, but might represent complex needs


a person’s history and background will impact which information is useful


A document != information

More on why IR is hard


The language of documents is imprecise.
Documents, or document extracts, or answers,
or ... is what is presented by an IR system.


long documents have many topics


what is the “meaning” of a document?


what it its information content?


does the document type match the information need
type? I.e., answers for questions ... or, quick basic
information for quick basic information needs …

Core concept:
Relevance


Relevance is the core goal of an IR system


Relevance is
multifaceted
, and includes:


useful


timely


pertinent


accurate


authoritative


etc... as needed for a particular information seeker and her query


For evaluation, we think of relevance as
binary
: yes/no
for a particular document’s relevance to a particular
query


In reality, relevance is a difficult topic, and hard to
measure accurately

Recall and Precision


How good is an IR system?
Precision

is the
best measure for search engines:


The proportion of retrieved documents that are
relevant.


Want perfect precision? Try retrieving just 1
document. If it’s relevant, precision is 100%!


Early high precision is a common approach
of IR systems: present some quality
documents first, in the hopes they satisfy the
information seeker

Recall


Recall is the proportion of relevant documents that
are retrieved.


Not so useful for Web search, where there are
potentially
very many relevant documents


For perfect recall: retrieve all documents, so recall
= 100%!


Recall is appropriate for very complete or
specialized searches or very small collections


Usually, Web search just looks at precision,
especially precision @ some number (
p@10
)

Anatomy of a Web search engine


Harvesters

(or spiders, or collection
managers): these gather documents and
prepare them for indexing


Indexers
: the core IR system, that
represents a collection for rapid retrieval


Query processors
: front end to the index, to
retrieve documents

Harvesters: to gather input


Lots of challenges:


different document types


duplicate documents; finding authoritative/master sites


different languages


dynamic content


firewalls, passwords


invalid HTML; frames


not overloading harvested sites


dealing with site requests for non
-
indexing


bandwidth (to sites; to indexer)


harvest schedule; retiring removed or inaccessible documents

Harvesters, continued


Harvesting is complex, and
largely
orthogonal

to the rest of the IR system (i.e.,
the IR system doesn’t really care how
difficult it was to get the documents ... it just
indexes and retrieves them!)


Utilities such as htdig, wget and curl can be
used for basic indexing, but more complete
indexing is challenging

Indexers: the core IR system


Similar concepts and practice to
DBMS


Terms and documents are usually assigned
ID #s (for
fixed
-
length fields
); information
about term frequency and position is kept,
as well as weights for terms in documents.


Better query terms (more unique; better at
distinguishing among documents) get
higher
weights

in documents

Indexers, continued


Documents are also weighted: better
documents should be ranked more highly.
Google’s
Page Rank

is one way of
measuring document quality, based on site
authoritativeness


The challenge for indexers is to
represent

documents quickly and efficiently (input),
but more importantly to enable
rapid
querying

(output)

IR interaction


A user sends a
query


Conceptually, all documents in the/each
collection are evaluated against the query for
relevance, based on a formula


(In fact, only a small subset need to be ranked.
More in a minute…)


The
top
-
ranked

documents are presented to
the user. If some of those documents are
relevant, the search engine did a good job!

Query shortcuts IR systems

can take


Table lookup and caching
: if a query
matches a known query, just return the prior
results from a table/database


no need to
run the query. Yes, this can be used to
“hand tune” query results (i.e., human
optimizers)


Algebra shortcuts
: most forms of ranking
only look at the occurrence of query terms.
So, any document without any/all query
terms is automatically considered non
-
relevant

Sample Basic Equations of IR


tf

is the weighted frequency of a term in a
document (“term frequency”) for term
i

in
document
j


idf
is the inverse of the weight for term

j

in
the collection (“inverse document
frequency”)


The weighted relevance score of a query
term
i

in a document
j

is:
tf * idf


A Relevance Score


A query has
t
terms (i.e., words). To get a
relevance score for an entire document
j
,
we treat the query and document as
vectors,
normalize

(take the vector norm)
and compute the
cosine

(which is the dot
product of the normalized vectors)


Cosine ranges from 0 to 1; 0 is orthogonal
(“unrelated”), 1 is a perfect match.
Rank in
descending order and present results

Numerically…


Imagine a two word query


Document d1 has a weighted score of 1 for
term 1, and 2 for term 2:
vector[1,2]


Query terms are weighted
vector[1,3]


We first normalize the document to
vector[.45,.89]
, and query to
vector[.34,.95]


Then get cosine ==
.9985

Real
-
world Ranking is more Complex


Sample for “LNU.LTC” weighting after Salton &
Buckley (1998) or

http://www.cs.mu.oz.au/~aht/SMART2Q.html


LNU

is term in document weight:

log(tf{i,j}) + 1

----------------------------------------


Sqrt [ Sigma{1,t} (log(tf{i,k} + 1))^2 ]


LTC

term in collection weight:

2 x idf{j}

----------------------------------


Sqrt [ Sigma{1,t} (2 x idf{k})^2 ]



There are many variations…


Simple Boolean retrieval


Probabilistic retrieval


Latent semantic indexing or information space (uses
eigensystems to measure term relations, rather than assume all
terms are orthogonal)


Semantic
-
based representation (i.e., part of speech, document
structure)


Link
-
based techniques (for HTML: use inlinke and outlinks to
identify document topic & relations)


Many HTML tricks: use META, H1/title, etc.


But the fundamental processes of
weighting and ranking

almost always applies

How can Web search engines be
really
fast
, with huge collections?


Query optimization


Fast building of candidate response set; fast
ranking of results


This is mostly about
engineering
: parallelizing search;
storing data in memory; fast disk structures …


More engineering: handle simultaneous
harvesting/updating, concurrent queries, and subset
queries (i.e., local search or search within particular
sites or domains)

How can Web search engines be
really
good
?


Be really fast


Have effective measures for weights of
terms, weights of documents, and other
factors:


Site quality/authority; duplicate removal;
currentness …


Query spelling check; good HTML parsing; non
-
HTML document representation


Have features for better utility & usability

What do Information Scientists know
about IR that could be useful?


A lot about human information seeking


Depth on the topic of relevance


Aspects of documents (i.e., different types) and
queries (i.e., different needs)


Techniques for:


Personalized search (training; standing queries)


Searching streams of documents with standing queries
(filtering)


Multi
-
lingual search; cross
-
language IR

For Further Study


The Text Retrieval Conference (TREC) at
http://trec.nist.gov

; also DARPA’s TIDES
program, and several internal
conferences/competitions such as CLEF


Citeseer: there’s a lot of literature


Try “information retrieval” in Google


read
Robertson’s book


Go forth and search: better, and more
informed!


There are lots of opportunities to improve
IR performance: it is not a solved problem