Web Search Engines: System Architecture & Techniques

startwaitingInternet and Web Development

Nov 18, 2013 (3 years and 9 months ago)

151 views

ACCST Research Journal


Vol. VII, No. 2, April 2009



Web

Search

Engine
s
:

System
Architecture &

Techniques

Aarti Singh
1

,
Dimple Juneja
2

, A.K.Sharma
3



1
Tilak Raj Chadha Institute of Management & Technology, Yamuna Nagar

2
M.M. Institute of computer Technology & Buisness Management

M.M.University, Mulla
na

3
Y.M.C.A Institute of Engineering

Faridabad

Email id:

1
singh2208@gmail.com
,

2
dimplejunejagupta@gmail.com

,

3
ashokkale2@rediffmail.com


Abstract

-

Today internet has become an inseparable part of our lives. In today’s technology
oriented scenario we are dependent on internet for
many of

our routine tasks.

Importance
of internet is apparent to everyone and no
draft needs to be laid down!

Information spread over WWW is worthless, without Web Search Engines.
Web search
engines
provide valuable information to us within fraction of minutes. This paper aims to
throw light on working details of web search engine alo
ng with some algorithms used in the
searching process.


Keywords:

WWW,
URL,

Web Server, Internet


1. Introduction

Internet plays a very important role in information
retrieval.
Information
retrieval

refers to finding relevant
information from databases in
response to a user’s request. Search Engines

(
SEs
)


are the most powerful
tool in
retrieval

of
information from

the web.

Search Engine acts as an interface between the end user and
the vast repository of knowledge on web. And it’s almost impossible to uti
lize the information spread on
web,

without search engines in scene!



Engineering a search engine is a
very
challenging task
.

Search engine index
es

hundreds of million
of web

pages involving a comparable number of distinct keywords. They answer tens of mi
llions of queries
everyday. In spite the importance of large scale search engines on the
web,

very
little

academic research
has been conducted in this area

[1]
.


1.1
Search Engine System
Architecture [
1][6][8]

Search
Engine
s

(SEs)

are very efficient tools
to
retrieve

information about web pages.
Before SE can
provide information regarding any topic or keyword or web site, information should be available to it.

So
SE needs to traverse web pages and their associated links so as to copy and index them into a w
eb database.
For the purpose of traversing hundreds of millions

web pages

available on WWW, SEs typically employee
a software module called Web Crawler.

Web Crawler

(
WC
)

is a program which automatically traverses the web by downloading the documents and
fo
llowing links from page to page.
WCs are also known as
robots,

worms, spiders etc.


They are mainly
used by web search engines to gather data for indexing

into web database
.

The main decisions associated
with the crawlers algorithms are when to visit a sit
e again (to see if a page has changed) and when to visit
new sites that have been discovered through links.


The main goals of a crawler are

[8]

1 The index should contain a large number of web objects that are interesting to the search engine's users.

2

Every object on the index should accurately represent a real object on the web

3


Generate a representation of the objects that capture the most significant aspects of the crawled object
using the minimum amount of resources.



These programs are given
a starting set of URLs, whose pages they retrieve from the Web. The crawlers
extract URLs appearing in the retrieved pages, and give this information to the crawler control module.
This module determines what links to visit next, and feeds the links to vis
it, back to the crawlers.

ACCST Research Journal


Vol. VII, No. 2, April 2009


The crawlers also pass the retrieved pages into a page repository. Crawlers continue visiting the Web, until
local resources, such as storage is exhausted.


The indexer module extracts all the words from each page, and records t
he URL where each word occurred.
The result is a generally very large lookup table that can provide all the URLs that point to pages where a
given word occurs (the text index in Figure 1). The table is of course limited to the pages that were covered
in th
e crawling process. The indexing module may also create a structure index, which reflects the links
between pages. Such indexes would not be appropriate for traditional text collections that do not contain
links.




Fig. 1 Architecture of Search Engine


The collection analysis module creates the utility

index
shown
in Fig. 1.The
utility indexes may provide
access to pages of a gi
ven length, pages of a certain importance,

or pages with some number of images in
them. The collection analysis module may use t
he text and structure indexes when creating utility indexes.

During a crawling and indexing run, search engines must store the pages they retrieve from the Web. The
page repository represents the possibly temporary collection.

The query engine module is
responsible for receiving and fulfilling search requests from users. The engine
relies heavily on the indexes, and sometimes on the page repository. Because of the Web's size, and the fact
that users typically only enter one or two keywords, result sets ar
e usually very large.

The ranking module therefore has the task of sorting the results such that results near the top are the most
likely ones to be what the user is looking for. The query module is of special interest, because traditional
information ret
rieval (IR) techniques have run into selectivity problems when applied without modification
to Web searching: Most traditional techniques rely on measuring the similarity of query texts with texts in a
collection's documents. The tiny queries over vast col
lections that are typical for Web search engines
prevent such similarity based approaches from filtering sufficient numbers of irrelevant
pages out of search
results.

Next section elaborates the working of web crawler.

2.

Web Crawler
’s
working [
6]

Web cra
wler is the heart
of a

SE. So
it’s

worth elabora
ting their working in
detail
.
Web crawler (also known
as a Web spider or Web robot) is a program or automated script which browses the World Wide Web in a
methodical and automated manner. In general, it starts

with a list of URLs to visit, called the seeds. As the
crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to
visit, called the crawl frontier.

ACCST Research Journal


Vol. VII, No. 2, April 2009


While crawling the web, the general processes that a craw
ler
performs

are as follows:

1)

Check for the next page to download
-

the system keeps track of pages to download in a queue

called


frontier

2)

Check to see if the page is allowed to be downloaded
-

this is
performed
by checking
the

robots
exclusion

file and
also reading the header of the page to see if any exclusion instructions were
provided. Some people don't want their pages archived by search engines.

3)

Download the whole page

4)

Extract all links from the page (additional web site and page addresses) and ad
d those to the queue
mentioned above to be downloaded later

5)

Extract all words, save them to a database associated with this page, and save the order of the
words so that people can search for phrases, not just keywords

6)

Save the summary of the page and up
date the
last processed

date for the page so that the system
knows when it should re
-
check the page at a later date.



Robot
Allowed?

Get next URL
from frontier

Contact Web
Server

Fetch Page

Yes

Process Page

No

Index Contents

Extract Summary

Extract URL
s
and
update
frontier


Fig. 2 Working of a Web
-
Crawler



2.
1

Web Crawling Algorithms

As already mentioned

crawlers are the main acto
rs in th
e

process which crawl the web on search engine’s
behalf, so implementing crawling algorithms effectively can definitely improve performance of search
ACCST Research Journal


Vol. VII, No. 2, April 2009


engines.
According to their process of searching crawlers may be classified into two types [3], Ge
neral
Purpose web crawlers and Topical/Focused /Preferential or Heuristic based web crawlers


General purpose crawlers serve as entry points to Web pages and they strive for coverage that is as broad
as possible. These crawlers are blind and exhaustive in
their approach. Topical or Focused web crawlers on
the other hand are specifically designed to retrieve web pages related to a specific topic. Topical crawler
tries to follow edges that are expected to lead to portions of the graph that are relevant to a t
opic.

There are many algorithms which can be used
in
web crawler
s like Breadth first search, Best first search,
Page rank,
Shark search, Fish search etc.



2.
1
.1

Bread
t
h first
Search [2]

[3]

The breadth first crawling is the simplest strategy for crawling.

This algorithm was explored as early as
1994 in web crawler [Pinkerton 1994] [
7
] as well as in more recent researches [cho et. al 1998][5], Najork
and Wiener 2001[
9
]. In this algorithm frontier is implemented as a FIFO queue and the links are crawled in
t
he order in which they are encountered. Crawlers based on this algorithm does not use any domain specific
knowledge, thus they act as baseline for the other crawlers. Algorithm for breadth first crawler is as follows
[
3
]:

Breadth
-
First(starting_urls)

{

f
oreach link(starting_urls)

{ enqueue(frontier,link);}


while(visited < MAX_PAGES)

{

link:= dequeue_link(frontier);



doc:= fetch(link);



enqueue(frontier, extract_links(doc));


I

if(#frontier>MAX_BUFFER)

{ dequeue_last_links(frontier); }

}

}

Algorit
hm for Breadth First Crawling




Fig 3 Breadth first Crawler



2.
1
.2

Best First
Search [2]
[
7
]


Best
-
First crawlers have been studied by Cho et al. 1998 [5] and Hersovici et al. [1998].In this case given a
frontier of links, the best link according to some
estimation criteria is selected for traversal.
A crawler using
a

best


first search strategy is known as a

focused crawler

.
Different Best
-
First strategies of increasing
complexity and (potentially) effectiveness could be designed on the bases of increa
singly sophisticated link
estimation criteria. As a simple example, the link to be explored next may be chosen by computing
similarity between a topic’s keyword and the source page for the link. So the similarity between a topic’s
keyword and page p is use
d to compute the relevance of the pages pointed by p. The URL with the best
similarity estimate is then selected for traversal. The pages with minimum similarity scores may be
removed from the frontier, so as to avoid reaching the maximum frontier size. Th
ese crawlers use Cosine
ACCST Research Journal


Vol. VII, No. 2, April 2009


similarity between the keyword and the pages. The sim() function returns the cosine similarity between
topic and page:

)
)(
(
)
,
(
2
2








q
k
kp
p
k
kp
p
q
k
kp
kq
f
f
f
f
p
q
sim




where
q
is the topic,
p
is the fetched page, and
f
kd

is the frequency of term
k
in
d
.
Algorithm for Best First
search is as follows[
3
]:



BFS(topic, starting_urls)

{ for each link(starting_urls)


{ enqueue(frontier, link, l);}


while(visited< Max_pages)


{ link:= dequeue_top_link(frontier);



doc:= fetch(link);



score:= sim(topi
c,doc);



enqueue(frontier,extract_links(doc),score);



if(#frontier> Max_Buffer)



{ dequeue_bottom_links(frontier);}


}

}
















Fig 4

Best first Crawler



2.1.3 Page Rank

algorithm

[3]


PageRank was proposed by Brin and Page [1998
] [1
0
] as a possible model of user surfing behavior. The
PageRank of a page represents the probability that a random surfer (one who follows links randomly from
page to page) will be on that page at any given time.
Score of
a
page

depend
s

recursively upon
the scores of
the pages that point to it. Source pages distribute their PageRank across
all their

outlinks. Formally:







)}
(
{
|
)
(
|
)
(
)
1
(
)
(
p
in
d
d
Out
d
PR
p
PR



where
p
is the page being scored,
in
(
p
) is the set of pages pointing to
p
,
out
(
d
) is the set of links out of
d
,
a
nd the constant
γ <
1 is a damping factor that represents the probability that the random surfer requests
another random page. As originally proposed PageRank was intended to be used in combination with
content
-
based criteria to rank retrieved sets of documents [1
0
]. This

is in fact how PageRank is used in the
Google search engine. More recently PageRank has been used to guide crawlers [Cho et al. 1998][5] and to
assess page quality [Henzinger et al. 1999].

Page Rank algorithm is as follows [
3
]:


PageRank (topic, starting
_urls, frequency)

{

foreach link(starting_urls)

{ enqueue(frontier, link);}


while (visited<Max_Pages)

{


if(multiplies(visited, frequency))

{ recomputed_scores_PR;}


link:= dequeue_top_link(frontier);


doc:= fetch(link);

ACCST Research Journal


Vol. VII, No. 2, April 2009



score_sim:= sim(topic, do
c);

enqueue(buffered_pages, doc, score_sim);

if(#buffered_pages>= Max_Buffer)

{ dequeue_bottom_links(buffered_pages); }

merge(frontier, extract_links(doc), score_PR);

if(#forntier> Max_Buffer)

{ dequeue_bottom_links(frontier);}

}

}


3. Conclusion

& Future

Scope


In this paper efforts are being made to throw light on concept of Web Search Engines and some techniques
used in web crawlers. Due to the wide spread of internet usage, Search Engines and their Optimization has
emerged as a challenging area of rese
arch. Although due to the commercialization of this area, researchers
are facing hurdles
.

Still this area has large potential for research and vigorous efforts can yield fruitful
results in term of search engine optimization!


References:


1.

Monica Peshave
:

‘How search engine works and a Web crawler
A
pplication’,
Deptt. Of Computer
Science
,
University of Illinois at Springfield,

Springfield, IL 62703
.

2.

Pant, G., Srinivasan, P., and Menczer, F.
.
‘Crawling the Web’
. Web Dynamics: Adapting to Change in
Content,
Size, Topology and Use. Edited by M. Levene and A. Poulovassilis, pages 153
-
178. Springer
-
Verlag, 2004.

3.

Gautam Pant, Padmini Srinivasan and Filippo Menczer
:

Topical Web Crawlers: Evaluating
Adaptive Algorithms’
,

ACM Transactions on Internet Technology,
v
o
l. 4
(
4
)
,

pp. 378
-
419,

November
2004.

4.

Arasu, J. Cho, H. Garcia
-
Molina, A. Paepcke, and S. Raghavan
:

Searching the Web’
ACM
Transactions on Internet Technology
,

vol.

1(1), 2001
.

5.

Cho J., Garcia
-
Molina

H., and Page L
:


Efficient Crawling Through URL Ordering

. In
Proceedings
of the 7th World Wide Web Conference
, Brisbane, Australia, pp. 161
-
172. April 1998
.

6.

‘Implementing an effective Web Crawler’

from
http://www.informatics.indiana.edu/fil/papers/TOIT.pdf

7.

Pinkerton

B.
:


Finding What People Want: Experiences w
ith the Web

Crawler

.

Second
International WWW conference 1994.

8.

N. A. El
-
Ramly , H. M. Harb, M. Amin , A. M. Tolba
:


More Effective, Efficient, and Scalable Web
Crawler System Architecture’
.

In International Conference on Electrical, Electronics and Compu
ter
Engineering (ICEEC’04) 2004
, pp. 120
-
123, 5
-
7 September 2004.

9.

M. Najork and J. L. Wiener
:


Breadth
-
F
irst
C
rawling
Y
ields
H
igh
-
Q
uality
P
ages

. In Proceedings
of the Tenth Conference on World Wide Web, p
p.

114

-
118, Hong Kong, May 2001. Elsevier Science.

10.

Brin

S., Pag

L., Motwani R., a
nd Winograd T.
:


The Page Rank Citation Ranking : Bringing order
to
T
he
W
eb


,

Tech. Rep. 1999
-
66, Stanford University. Available on the Internet at
http://dbpubs.stanford.edu:8090/pub/1999
-
66.

January 1998

11.

Manber

U.:

Findi
ng
S
imilar
F
iles in a
L
arge
F
ile
S
ystem

.

In Proceedings of the Winter 1994
USENIX Technical Conference, San Francisco, CA, January 1994.


12.

Brin

S.
, Devis

J.
, Molina

G,H.
:


Copy Detection Mechanisms for digital documents

,

In
Proceedings of ACM SIGMOD
Inte
rnational Conference on Management of Data,

pp 398
-
409,1995


13.

Aggarwal

C.C., Garawi A.F., Yu S.P.:

‘Design of a
L
earning
C
rawler for
T
opical
R
esource
D
iscovery’
,

ACM Transactions on Information Systems,
v
ol. 19
(3)
,
pp 286
-
309,

July 2001
.