Optimal Crawling Strategies for Web
Wolf, Sethuraman, Ozsen
Why Do we care ?
Purpose of the paper.
Proposed solution for optimal crawling
Why Do We Care?
Search engines use crawlers in a automated manner to
build local repositories containing web pages.
These local copies of web pages are used for later
processing, like creating the index, run ranking
Due to dynamic nature of website web pages are updated
To maintain the fresh copy of these pages, we require
efficient crawling mechanism
Purpose of The Paper?
This paper provide efficient solution to:
Optimum crawling frequency problem.
Crawling scheduling problem.
Minimization of the average level of staleness over all
Minimize search engine embarrassment level metric.
To use efficient resource allocation algorithms to
achieve optimum crawling mechanism
Solution: Minimize staleness over all web pages
Size of the web is estimated to be 10 + billion pages.
According to the study around 25%
30 % of the web pages change
In order to maintain fresh web page repository, efficient crawling
algorithm should be used.
Two main aspects to build an efficient crawling algorithm are:
1) Optimal frequency : Number of crawls for each web page over a
fixed period of time and Ideal crawl times between these intervals.
2) Efficient scheduling for these crawling process.
To handle the update pattern of the web pages, Some pages are updated
deterministic manner other tend to be updated in Poisson
Solution: Optimal frequency problem
To compute a particular probability function that captures,
whether the search engine have a stale copy of web page i at an
arbitrary time t in the interval [0; T].
From this we can compute the time
average staleness estimate,
by averaging this probability function over all t within [0; T]
To find a time interval to minimize the time
To find the importance of web pages (weights), in order to
organize possible results search query. This can be efficiently
explained by search engine embarrassment metrics.
Search engine embarrassment level metrics
The frequency with which a client
makes a query, and finds that the
resulting page is inconsistent.
Case 1: lucky case, stale page is not
returned to user.
Case 2 : unlucky case, stale page is
returned to user but not clicked by
Case 3: stale page returned and user
clicks the result page to find the
Case 4: returned pages has
inconsistent result w.r.t query
Solution: Greedy approach for resource allocation
Probability of clicking a page to
the position or weight of the web
For quasi deterministic case for
updating a page, crawl should be
done at potential update time.
To solve the resource allocation
problem, in order to find
optimum crawl time author has
used dynamic programming and
To find the optimal time interval
between minimum and maximum
Solution : Optimum scheduling problem
Number of crawls to obtain fresh copy of the page for a
time period T, the problem is to decide optimum time
interval between these crawls.
Since for most of the cases scheduling the crawl bit early or
bit late does not affect performance too much.
But for the quasi
deterministic process being late is
acceptable but being early is not useful.
This scheduling problem can be posed and solved as
transportation problem and network flow.
A bipartite network graph with one sided flow depicts this
Solution : Optimum scheduling problem …….
If C be total no of crawlers and S be
crawl task in time T.
Each node has supply of 1 unit and
there is one demand node per time
slot and crawler pair.
Then they are indexed by 1≤ l ≤ S and
1 ≤ k ≤ C
Where k is individual crawler and is
the no of tasks.
The solution for this transportation
problem ensures the existence of
optimal solution with optimal flow
Parameterization issue about update process:
Information about last crawl time does not tell
anything about other updates occurring since last
Crawl time, update pattern and data can be used
together to formulate the statistical properties of
This information can be then used to build
probability distribution for the interupdate for any
Precisely describe the optimal crawling
process to reduce staleness of web pages.
Provide good introduction to search
engine embarrassment metrics.
Provides schemes for optimized number of
crawls for a dynamic page using dynamic
Give us clear idea about the optimal
Research data is quite outdated, and lot of
advancement have been made since then.
No strategy has been proposed for
handling the content replication.
Introduction of blogs, forums and social
networking site has changed the way we
calculate weight for the pages.
Crawling process can improve the quality of
services provided by search engine.
Optimal crawling process and the scheduling
algorithm plays a vital role in determining the
quality and freshness of web pages.
Overall objective is to reduce the search engine
embarrassment metrics and to provide best
possible search results.
Further Research :
Event driven web page crawler, to be able to fetch ajax
Adaptive Model based crawling strategies, fixed order vs
random order crawling.
Implementing ranking based crawling strategies
Formulate the crawling strategies keeping page replication
in account. To reduce the crawling task to some extent.
Current Trends :
Building adaptive model based web crawlers.
Using separate crawling strategies for finding fresh pages
and for deep crawls (eg. Googles organic crawl)
Fresh bot to fetch fresh pages, and deep crawl bot to index
all the web pages.
Duplicate content aware crawling to
reduce the crawling load.
Current Trends ……
URL ordering and queuing based on priority.
Context focused crawling for better Result
Distributed crawling and multi threaded crawlers
Crawling and real time web search.
J. L. Wolf , M. S. Squillante , P. S. Yu , J. Sethuraman , L. Ozsen, Optimal crawling
strategies for web search engines, Proceedings of the 11th international conference
on World Wide Web, May 07
11, 2002, Honolulu, Hawaii, USA
Edwards, J., McCurley, K. S., and Tomlin, J. A. (2001). "An adaptive model for
optimizing performance of an incremental web crawler". In Proceedings of the Tenth
Conference on World Wide Web (Hong Kong: Elsevier Science): 106
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C. L., and Gori, M. (2000). Focused
crawling using context graphs. In Proceedings of 26th International Conference on
Very Large Databases (VLDB), pages 527
534, Cairo, Egypt.
Pant, Gautam; Srinivasan, Padmini; Menczer, Filippo (2004). "Crawling the Web". in
Levene, Mark; Poulovassilis, Alexandra. Web Dynamics: Adapting to Change in
Content, Size, Topology and Use. Springer. pp. 153
178. ISBN 9783540406761.
Articles from search engine journal, search engine round table, Wikipedia …..
Q & A