w
eb search engines
W
HAT
ARE
W
EB
S
EARCH
E
NGINES
?
“The perfect search engine would understand exactly what
you mean and give you back exactly what you want.”
–
Larry Page (CEO of Google)
General process of cataloging the web:
1.
Crawl
the web.
2.
Index
the web
.
3.
Deliver search results.
deliver
index
crawl
W
EB
C
RAWLING
The simplest web crawling algorithm is just a queue of URLs
that have yet to be visited.
Also includes a fast method for figuring out whether a URL has been seen
before.
Crawlers initiate queues with seed URLs.
Good seed URLs
–
websites that link to many other high
quality sites.
W
HAT
M
AKES
A
G
OOD
S
EED
URL?
Poorly
chosen
seed URLs
lead
to low quality search
results.
Quality
Amount of outgoing links.
If a page is considered spam, or associates with spam.
Importance
P
opularity, trustworthiness, etc.
May depend upon a factor like PageRank.
Potential yield of hosts
Potential for the discovery of new sites.
May search for different seed URLs based
upon geography, to target specific regions.
O
THER
A
SPECTS
OF
W
EB
C
RAWLING
Speed
Crawling is carried out using large cluster of computers.
May use a hashing function to divvy up URLs between machines, to split
work loads.
Politeness
Make sure to only send one request to a specific webserver, so it doesn’t
overload.
A “politeness” delay can be put between requests.
O
THER
A
SPECTS
OF
W
EB
C
RAWLING
Excluded Content
Before requesting pages from a site, the crawler accesses
robot.txt
if it
exists.
Robot.txt
–
specifies which files and directories can be crawled.
Spam Rejection
Examples: white text on white background, zero

point font, meta tags, etc.
Ineffective now that page ranks depend heavily on link information.
Crawlers can put sites determined to be spam into blacklists and reject
URLs of pages associated with blacklisted sites.
O
PTIMIZING
Y
OUR
S
ITE
FOR
C
RAWLING
Have actual content/text.
Use normal HTML links. (not JavaScript events/links)
Submit your site to various search engines, so it is found
and crawled faster.
Have a proper
robot.txt
file.
Create a sitemap
–
XML document that lists all the pages
that make up your site.
P
AGE
R
ANK
A
LGORITHM
A private link analysis algorithm, created and used by
Google
.
Algorithm based on web graph.
Web graph
–
directed graph. Nodes are sites,
e
dges are links.
Assigns numerical weights to each
element of a hyperlinked
(connected) set of documents.
Measures elements’ relative
importance within the set.
PageRank is defined recursively.
PR(v)
L(v)
P
AGE
R
ANK
A
LGORITHM
Probability distribution.
Represents the likelihood that a person randomly clicking on links will
end up on any particular page.
Expressed as value between 0 and 1.
A PageRank of 0.5 = 50% chance that randomly clicking will lead to that page
.
Σ
v in B
u
u = page, v =
each page linking to u
L
(v) = links from page
v
B
u
= set of all pages linking to u
PageRank transmitted by outbound
link =
document’s PageRank
number
of outbound
links
PR(u) =
P
AGE
R
ANK
E
XAMPLE
If our example universe only had
pages A, B, C, and D:
(Links to self and multiple links
between two pages being ignored.)
Initial PR value of each page =
0.25 (25%
)
If the only links were from B, C, D
A
PR(A) after next PR iteration = PR(B)/1 + PR(C)/1 + PR(D)/1
= 0.75
A
B
C
D
A
B
C
D
If the only links were from:
B
C, A
D
A, B, C
C
A
Upon next iteration:
B’s current PageRank would be divided by 2 (
since there’s
two outgoing
links
from B
) and transmitted to A and C’s PageRank values.
D’s current PageRank would be divided by 3 and transmitted to A, B, and C.
C’s current PageRank would be divided by 1 since it only links to A.
To calculate A’s PageRank = PR(A) = PR(B)/2 + PR(C)/1 + PR(D)/3
P
AGE
R
ANK
E
XAMPLE
C
ONTINUED
D
AMPING
F
ACTOR
The probability that at any step, the random surfer will stop clicking.
Generally set to be about 0.85
In the equation:
w
here d = damping factor and N = number of documents
P
AGE
R
ANK
A
LGORITHM
C
OMPLEXITY
Difficult to determine since it is a private algorithm
Probably an FP problem
Polynomial time complexity
Trouble with finding runtime goes further than being private
So many webpages that its adjacency matrix is too large to fit on a single computer
Since a search engine like Google uses clusters of computers data and work load is
spread out
Technically FP, but because of its size, does not fit on traditional scale of single

processor time complexity
Distributed computation modal would need to be used for more accurate analysis
HITS A
LGORITHM
A precursor to PageRank.
Two types of rankings for each document:
Hub value
–
higher if the page points to many pages with high authority weights.
Authority value
–
higher if the page is pointed to by pages with high hub weights.
Hyperlink

Induced Topic
Search Algorithm, also
known as hubs and
authorities.
Another link

based analysis
algorithm that
ranks
webpages.
HITS A
LGORITHM
C
ONTINUED
A
0
0
1
0
0
1
0
0
0
3
1
2
A
T
0
0
0
0
0
0
1
1
0
u
1
1
1
v
A
T
u
0
0
0
0
0
0
1
1
0
1
1
1
0
0
2
u
A
v
0
0
1
0
0
1
0
0
0
0
0
2
2
2
0
The adjacency
matrix of the graph
is
and
Assume the initial hub weight vector is
Compute
the authority weight
vector:
Then the update hub weight is:
This would mean node 3 is
the
most
authoritative!
Node 1 and 2 are equally important hubs.
D
IFFERENCES
IN
HITS
FROM
P
AGE
R
ANK
For HITS, hub
and a
uthority
scores
resulting
from the link
analysis are influenced by the search
terms.
Executed
at query time, not at indexing
time.
C
omputes
two scores
for each document
,
not a single
score like PageRank.
G
OOGLE
P
ANDA
Google algorithm
update.
Aims
to lower the rankings of low quality
sites.
Return higher

quality sites near the top of search
results.
Older ranking factors, like PageRank, downgraded in
importance.
Human quality testers rated thousands of
websites.
The
algorithm is a learning AI that
looks for similarities
between
the websites
human testers deemed high quality
and low
quality.
G
OOGLE
P
ENGUIN
Another Google
algorithm
update.
Decreases
rankings of websites that violate Google’s
Webmaster
Guidelines.
Basic principles from Google’s Webmaster
Guidelines:
Make pages primarily for users, not for search
engines.
Don’t deceive
users.
Avoid tricks intended to improve search engine
rankings.
I
MPORTANCE
OF
O
PTIMIZATION
AND
S
PACE
C
OMPLEXITY
The World Wide Web is BIG (duh)
Over 500 terabytes of data.
This far exceeds average graph problem.
Smallest optimization of the algorithm multiples because that same
algorithm is being run thousands of times over thousands of
computers.
Every space saving change makes it cost efficient for an ever growing
web.
R
EFERENCES
http://www.seobythesea.com/2010/05/what

makes

a

good

seed

site

for

search

engine

web

crawls/
http
://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture4/lecture4.
html
http://en.wikipedia.org/wiki/
PageRank
http://searchengineland.com/the

penguin

update

googles

webspam

algorithm

gets

official

name

119623
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=35769#
3
http://www.cis.hut.fi/Opinnot/T

61.6020/2008/
pagerank_hits.pdf
Comments 0
Log in to post a comment