Searching the Web

photofitterInternet και Εφαρμογές Web

4 Δεκ 2013 (πριν από 3 χρόνια και 10 μήνες)

80 εμφανίσεις

Searching the Web

Mark Levene

(Follow the links to learn more!)


Mechanics of a Typical Search

Query submitted to Google

Mechanics of a Typical Search

Google results for the query

Search Engines as Information
Gatekeepers


Search engines are becoming the primary
entry point for discovering web pages.


Ranking of web pages influences which
pages users will view.


Exclusion of a site from search engines
will cut off the site from its intended
audience.


The privacy policy of a search engine is
important.


Search Engine Wars


The battle for domination of the web
search space is heating up!


The competition is good news for users!


The way in which advertising is combined
with search results is crucial!


There are serious implications if one of the
search engines will manage to dominate
the space!



Google


Verb “google” has become synonymous
with searching for information on the web.


Has raised the bar on search quality,


Has been the most popular search engine
in the last few years.


Had a very successful IPO in August
2004.


Is innovative and dynamic.

Yahoo!


Synonymous with the dot
-
com boom,
probably the best known brand on the
web.


Started off as a web directory service.


Has very strong advertising and e
-
commerce partnerships.


Acquired leading search engine
technology in 2003.

MSN Search


Synonymous with PC software.


Remember its victory in the browser wars
with Netscape.


Developed its own search engine
technology only recently, officially
launched in Feb. 2005.


May link web search into its next version
of Windows.


Others


Ask Jeeves


Specialises in natural language question
answering.


Search driven by
Teoma
.


Looksmart


Has its own directory service.


Search driven by
Wisenut
.




Statistics from search engine
logs

Statistic

(Year)

AltaVista

(1998)

AlltheWeb

(2002)

Excite

(2001)

average terms per
query

2.35

2.30

2.60

average queries per
session

2.02

2.80

2.30

average result
pages viewed

1.39

1.55

1.70

usage of advanced
search features

20.4%

1.0%

10.0%

Experiment with search engine
query syntax


Default is AND, e.g. “computer chess” normally
interpreted as “computer AND chess”, i.e. both
keywords must be present in all hits.


“+chess” in a query means the user insists that
“chess” be present in all hits.


“computer OR chess” means either keywords
must be present in all hits.


“”computer chess”” means that the phrase
“computer chess” must be present in all hits.

The most popular search
keywords

AltaVista
(1998)

AlltheWeb
(2002)

Excite

(2001)

sex

free

free

applet

sex

sex

porno

download

pictures

mp3

software

new

chat

uk

nude

Search Engine Architecture

Crawler Algorithm



A crawler is a program that traverses web pages, downloads them
for indexing and follows (or harvests) the hyperlinks on the
downloaded pages.


A crawler will typically start from a multitude of web pages and aims
to cover as much of the indexable web as possible.


Standard algorithm used breadth
-
first strategy.


Focused crawlers use best
-
first strategy.


Search Index
-

Inverted File


Also store position of word in web page and info.
on HTML structure.

The query engine


The interface between the search index,
the user and the web.


Algorithmic details of commercial search
engines kept as trade secrets.



First step is retrieval of potential results
from the index.


Second step is the ranking of the results
based on their “relevance” to the query.

Vector Space Model


Content Relevance

Term Frequency (TF)


Count number of
occurrences of each term.


Bag of words approach


Ignore stopwords such as
is
,

a
,

of
,

the
,



Stemming
-

computer

is
replaced by
comput
, as are
its variants:
computers
,
computing

computation
,
computer

and
computed
.


Normalise TF by dividing by
doc length, byte size of doc
or max num of occurrences
of a word in the bag.


chess

computer

programming

chess

game

chess

game

is a

Inverse Document Frequency (IDF)

i
n
N
log

N

is number of documents in the corpus.


ni
is number of docs in which word
i
appears.


Log dampens the effect of IDF.


IDF is also number of bits to represent the term.

Ranking with TF
-
IDF





q
i
j
i
j
i
j
i
j
i
w
score
IDF
TF
w
,
,
,

j


refers to document
j


i


refers to word (or term)
i
in doc
j


q


is the query which is a sequence of terms


scorej

-

is the score for document
j

given
q


Rank results according to the scoring function.




Content Relevance


Phrase matching.


Synonyms.


URL analysis.


Date last updated.


Spell checking.


Home page detection.

Link Text (Anchor Text)


Include link text for a link pointing to a web
page, say
P
, as part of the content of
P


Link text is very useful in finding home
pages.


Link text behaves like user queries


They act as short summaries


They often match query terms


HTML Weighting

Class Name

HTML

tags

1) Plain Text

None of the above

2) Strong

STRONG, B, EM, I, U

3) List

DL, OL, UL

4) Header

H1, H2, H3, H4, H5, H6

5) Anchor

A

6) Title

TITLE


Normal retrieval = (111101)

ranking with TF
-
IDF


(181882)


39.6% improvement.


(181782)


48.3% improvement


C2, C4 and C5.


(181582)
-

43.5% improvement


Meta tag text is mostly ignored by search engines



Factor in Link Metrics

j
i
j
i
j
i
PR
IDF
TF
w
,
,


Multilply by PageRank of document (web page).


We do not know exactly how Google factors in
the PR, it may be that log(PR) is used.

Popularity Based Metrics


Factor in users’ opinions as represented in
the query logs.


Document space modification adjusts the
weights of keywords in popular pages.


Clickthrough data can also be taken into
account to improve the ranking of search
engine query results.


Precision and Recall


Precision is Overlap/Retrieved (first results page
retrieved is most important).


Recall is Overlap/Relevant (for web search recall is
related to index coverage).

Typical Recall
-
Precision Curve


Top
-
n precision


proportion of relevant for top n ranked
results.


Measure top
-
n precision at fixed recall point for n being
0% to 100% of the ranked results.

Probabilistic IR


Basic question
: What is the probability that
a document,
D
, is
relevant

to a query
Q
?


Probability ranking assumption
: If docs
retrieved are ordered by decreasing
probability of relevance then the overall
effectiveness of the system is the best
obtainable given the input documents.

Bayes Formulation of Relevance

)
(
)
(
)
|
(
)
|
(
D
P
R
P
R
D
P
D
R
P


R



relevance of
D
with respect to a query
Q


D



document (web page)


P(R|D)



probability that a page is relevant given
its
description
(or
representation
)


NR


D not
relevant with respect to
Q

Naïve Bayes Independence
Assumption







n
i
i
n
i
i
NR
w
P
NR
D
P
R
w
P
R
D
P
1
1
)
|
(
)
|
(
)
|
(
)
|
(

n


the number of words in
D


wi



the word in position
i
in
D


Also assume that the probability of a word is
independent of its position in the document.

Computing the probabilities

c
NRW
nr
NR
w
P
c
RW
r
R
w
P
i
i
i
i






1
)
|
(
1
)
|
(

ri


number of times
wi
occurs in relevant docs


RW
-

number of words in relevant docs

(counting
duplicate words multiple times, since docs are
bags
)


nri


number of times
wi
occurs in relevant docs


NRW


number of words in non relevant docs


P(R)
is the number of relevant docs with respect to
Q.


P(NR)
is the number of docs which are not relevant with
respect to
Q.


c


is a smoothing constant greater or equal to one

Is a Document Relevant?






n
i
n
i
i
i
NR
w
P
NR
P
R
w
P
R
P
1
1
)
|
(
)
(
)
|
(
)
(

Assume we have a set of
training examples

of
relevant and non
-
relevant documents to
compute
P(wi|R) and P(wi|NR)

for words
wi.


The user could mark docs as
R
or
NR.


Choose the class (
R
or
NR
) which has higher
probability.

Ranking Documents


D1
is more relevant than
D2
given
Q,
where


nD1
-

the number of words in
D1


nD2


the number of words in
D2


If we are ranking, then
P(wi|R)
can be
approximated on the basis of a document with a
weighting function such as TF
-
IDF.








1
1
2
1
)
|
(
)
|
(
nD
i
nD
i
i
i
R
w
P
R
w
P
Other types of Search Engine


Directory


e.g.
Yahoo!
(
Open Directory
)


MetaSearch


e.g.
Dogpile

(
Mamma
)


Clustering


e.g.
Clusty


Question Answering


e.g.
Ask Jeeves

WolframAlpha
,
True Knowledge
, & Google,
Yahoo and Bing


Visual


e.g.
Quintura


Collaborative


e.g,
Omgili,

Sproose
,
AfterVote


Human Input


e.g.
ChaCha


Social Tagging


e.g.
Blekko
,
MrTaggy

Directions in Search


Mobile search


e.g.
Google Mobile


Local search


e.g.
Google Local


Video search


e.g.
YouTube


Image search


e.g.
Picsearch


Audio search


e.g.
Yahoo Audio


Blog search


e.g.
Technorati


Social bookmarking
-

e.g.
Delicious


Some new ideas
-

e.g.
Cuil

Paid Inclusion and Paid
Placement


Paid inclusion


payment to speed up
inclusion in the search index.


Pay
-
Per
-
Click (PPC) or Cost
-
Per
-
Click
(CPC)


payment for being advertised on
the search engine’s sponsored results list.


The sponsored list should be separated
from the organic list.


PPC is a major revenue source for search
engines.


Click fraud is a problem!

Behavioural Targeting


Contextual targeting is a weaker form
based on a single user session.


Personalised advertising in order to
increase the effectiveness of advertising.


Data collected from individual users,
normally through cookies.


Pay
-
Per
-
Action (PPA)


Charge the advertiser only when an action
takes place such as a purchase, a
download or any other trackable action.


Ad network will require the advertiser to
place a script in the web page triggering
the action.


Some level of trust is needed.