web search engines

homelybrrrInternet and Web Development

Dec 4, 2013 (3 years and 8 months ago)

66 views

w
eb search engines

W
HAT

ARE

W
EB

S
EARCH

E
NGINES
?


“The perfect search engine would understand exactly what
you mean and give you back exactly what you want.”







Larry Page (CEO of Google)



General process of cataloging the web:

1.
Crawl
the web.

2.
Index
the web
.

3.
Deliver search results.

deliver

index

crawl

W
EB

C
RAWLING


The simplest web crawling algorithm is just a queue of URLs
that have yet to be visited.


Also includes a fast method for figuring out whether a URL has been seen
before.


Crawlers initiate queues with seed URLs.


Good seed URLs


websites that link to many other high
quality sites.

W
HAT

M
AKES

A

G
OOD

S
EED

URL?


Poorly
chosen
seed URLs
lead
to low quality search
results.


Quality


Amount of outgoing links.


If a page is considered spam, or associates with spam.


Importance


P
opularity, trustworthiness, etc.


May depend upon a factor like PageRank.


Potential yield of hosts


Potential for the discovery of new sites.


May search for different seed URLs based



upon geography, to target specific regions.

O
THER

A
SPECTS

OF

W
EB

C
RAWLING


Speed


Crawling is carried out using large cluster of computers.


May use a hashing function to divvy up URLs between machines, to split
work loads.


Politeness


Make sure to only send one request to a specific webserver, so it doesn’t
overload.


A “politeness” delay can be put between requests.

O
THER

A
SPECTS

OF

W
EB

C
RAWLING


Excluded Content


Before requesting pages from a site, the crawler accesses
robot.txt

if it
exists.


Robot.txt



specifies which files and directories can be crawled.


Spam Rejection


Examples: white text on white background, zero
-
point font, meta tags, etc.


Ineffective now that page ranks depend heavily on link information.


Crawlers can put sites determined to be spam into blacklists and reject
URLs of pages associated with blacklisted sites.

O
PTIMIZING

Y
OUR

S
ITE

FOR

C
RAWLING


Have actual content/text.


Use normal HTML links. (not JavaScript events/links)


Submit your site to various search engines, so it is found
and crawled faster.


Have a proper
robot.txt

file.


Create a sitemap


XML document that lists all the pages
that make up your site.


P
AGE
R
ANK

A
LGORITHM


A private link analysis algorithm, created and used by
Google
.


Algorithm based on web graph.


Web graph


directed graph. Nodes are sites,
e
dges are links.


Assigns numerical weights to each


element of a hyperlinked


(connected) set of documents.


Measures elements’ relative


importance within the set.


PageRank is defined recursively.

PR(v)

L(v)

P
AGE
R
ANK

A
LGORITHM


Probability distribution.


Represents the likelihood that a person randomly clicking on links will
end up on any particular page.


Expressed as value between 0 and 1.


A PageRank of 0.5 = 50% chance that randomly clicking will lead to that page
.

Σ

v in B
u

u = page, v =
each page linking to u

L
(v) = links from page
v

B
u

= set of all pages linking to u

PageRank transmitted by outbound
link =


document’s PageRank

number
of outbound
links

PR(u) =

P
AGE
R
ANK

E
XAMPLE


If our example universe only had
pages A, B, C, and D:


(Links to self and multiple links
between two pages being ignored.)


Initial PR value of each page =
0.25 (25%
)


If the only links were from B, C, D


A


PR(A) after next PR iteration = PR(B)/1 + PR(C)/1 + PR(D)/1









= 0.75

A

B

C

D

A

B

C

D


If the only links were from:


B


C, A


D


A, B, C


C


A



Upon next iteration:


B’s current PageRank would be divided by 2 (
since there’s
two outgoing
links
from B
) and transmitted to A and C’s PageRank values.


D’s current PageRank would be divided by 3 and transmitted to A, B, and C.


C’s current PageRank would be divided by 1 since it only links to A.


To calculate A’s PageRank = PR(A) = PR(B)/2 + PR(C)/1 + PR(D)/3

P
AGE
R
ANK

E
XAMPLE

C
ONTINUED

D
AMPING

F
ACTOR


The probability that at any step, the random surfer will stop clicking.


Generally set to be about 0.85


In the equation:




w
here d = damping factor and N = number of documents

P
AGE
R
ANK

A
LGORITHM

C
OMPLEXITY


Difficult to determine since it is a private algorithm


Probably an FP problem


Polynomial time complexity


Trouble with finding runtime goes further than being private


So many webpages that its adjacency matrix is too large to fit on a single computer


Since a search engine like Google uses clusters of computers data and work load is
spread out


Technically FP, but because of its size, does not fit on traditional scale of single
-
processor time complexity


Distributed computation modal would need to be used for more accurate analysis

HITS A
LGORITHM


A precursor to PageRank.


Two types of rankings for each document:


Hub value


higher if the page points to many pages with high authority weights.


Authority value


higher if the page is pointed to by pages with high hub weights.


Hyperlink
-
Induced Topic
Search Algorithm, also
known as hubs and
authorities.


Another link
-
based analysis
algorithm that
ranks
webpages.

HITS A
LGORITHM

C
ONTINUED

A

0
0
1
0
0
1
0
0
0










3

1

2

A
T

0
0
0
0
0
0
1
1
0










u

1
1
1










v

A
T

u

0
0
0
0
0
0
1
1
0










1
1
1











0
0
2










u

A

v

0
0
1
0
0
1
0
0
0










0
0
2











2
2
0










The adjacency
matrix of the graph
is

and

Assume the initial hub weight vector is

Compute
the authority weight
vector:

Then the update hub weight is:

This would mean node 3 is

the
most
authoritative!

Node 1 and 2 are equally important hubs.

D
IFFERENCES

IN

HITS
FROM

P
AGE
R
ANK


For HITS, hub
and a
uthority
scores
resulting
from the link
analysis are influenced by the search
terms.


Executed
at query time, not at indexing
time.


C
omputes
two scores
for each document
,
not a single
score like PageRank.

G
OOGLE

P
ANDA


Google algorithm
update.


Aims
to lower the rankings of low quality
sites.


Return higher
-
quality sites near the top of search
results.


Older ranking factors, like PageRank, downgraded in
importance.


Human quality testers rated thousands of
websites.


The
algorithm is a learning AI that
looks for similarities
between
the websites
human testers deemed high quality
and low
quality.

G
OOGLE

P
ENGUIN


Another Google
algorithm
update.


Decreases
rankings of websites that violate Google’s
Webmaster
Guidelines.


Basic principles from Google’s Webmaster
Guidelines:


Make pages primarily for users, not for search
engines.


Don’t deceive
users.


Avoid tricks intended to improve search engine
rankings.

I
MPORTANCE

OF

O
PTIMIZATION

AND

S
PACE

C
OMPLEXITY


The World Wide Web is BIG (duh)


Over 500 terabytes of data.


This far exceeds average graph problem.


Smallest optimization of the algorithm multiples because that same
algorithm is being run thousands of times over thousands of
computers.


Every space saving change makes it cost efficient for an ever growing
web.

R
EFERENCES


http://www.seobythesea.com/2010/05/what
-
makes
-
a
-
good
-
seed
-
site
-
for
-
search
-
engine
-
web
-
crawls/


http
://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture4/lecture4.
html


http://en.wikipedia.org/wiki/
PageRank


http://searchengineland.com/the
-
penguin
-
update
-
googles
-
webspam
-
algorithm
-
gets
-
official
-
name
-
119623


http://support.google.com/webmasters/bin/answer.py?hl=en&answer=35769#
3


http://www.cis.hut.fi/Opinnot/T
-
61.6020/2008/
pagerank_hits.pdf