Internet Search Engines

toadspottedincurableInternet και Εφαρμογές Web

4 Δεκ 2013 (πριν από 3 χρόνια και 7 μήνες)

107 εμφανίσεις


Internet Search Engines










By

Anup.D.Warnulkar & Indrajeet.S.Kamat

E
-
mail:
anup_git@rediffmail.com

i_kamat@rediffmail.com


Semester: Second

Gogte I
nstitute of Technology





















Abstract




Internet is a wide ocean of information. Its size and number of




users doubles every year. So, there exists a need for simplifying




information retrieval. This problem is solved by search engines.




They send crawlers, which send the links related to the keywords




as hits. Search Engines analyze these links and display results



based on pagerank. In this paper, we discuss the anatomy and



working of Search Engines. We also present the archite
cture of



Google, the most popular and efficient Search Engine.



Keywords
: Internet, Information retrieval, Search Engine,
Crawlers, Hits, Pagerank, Google.






Word Count: 3492




INDEX







INTRODUCTION
--------------------------------
----------------

1




ANATOMY
--------------------------------------------------------

3




GOOGLE ARCHITECTURE
-----------------------------------

8




MAJOR DATA STUCTURES
----------------------------------

10




COMPARISON
------------------------------------
---------------

13




CONCLUSION
---------------------------------------------------

14




GLOSSARY
-------------------------------------------------------

15




REFERENCES
----------------------------------------------------

15





















Introduc
tion


The World
-
Wide Web is moving rapidly from text
-
based towards multimedia
content, and requires more personalized access. The amount of information on the web is
growing rapidly, as well as the number of new users inexperienced in the art of web
search
.


Search engines use automated software programs known as
Spiders or Robots

to
survey the Web and build their databases. Web documents are retrieved by these
programs and analyzed.


Data collected from each web page are then added to the search
engine ind
ex.

When you enter a query at a search engine site, your input is checked
against the search engine's index of all the web pages it has analyzed.

The best URLs are
then returned to you as hits, ranked in order with the best results at the top.


Internet
search engines are special tools on the Websites or a separate website that
are designed to help people find information on the World Wide Web.



1. Difference between a Search Engine and a Directory


A directory (as Yahoo!) stores the name of the site, a
relevant category and a short
description of what’s contained in the site. The information is stored as a hierarchy with
divisions represented by separate pages. When the site is searched, the s
earch

is
performed on the title and description of the site, n
ot of the contents of the site.


The search engine (as Google) links all the URLs on the web. Then based on the
keyword it sends its crawlers, which return the linked pages with the keywords as hits. It
then ranks all the pages sent by them and displays re
sults.


2.
Different methods of searching used by a Search Engine

There are differences in the ways various search engines work, but they all perform
three basic tasks:



They search the Internet
--

or select pieces of the Internet
-

based on important
wo
rds.



They keep an index of the words they find, and where they find them.



They allow users to look for words or combinations of words found in that index.

Types of Search Engines

There are three basic categories of search engines:


1) Spider
-
or crawle
r
-
based search engines.

2) Directories powered by humans.

3) Combinations or ‘hybrids’ of spider and directories.

Spider
-
based search engines create their listings by using digital ‘spiders’ that
‘crawl’ the Web. People sort the spiders' findings and ent
er the information into the
search engine's database, which can then be searched by users.

There are also human
-
powered search sites, such as Yahoo!. Marketers submit a
short Web site description to the directory, or the site's editors may write one for si
tes
they review. User searches match against the descriptions submitted, which means that
changes to Web pages will not affect listings. Generally, today's search engines present
both types of results.



























Anatomy & Working of Search Engines





Data is stored in number of files on the internet, as ASCII files, binary files or
data bases. Search engines may vary on the way the data is stored. If the data is s
tored as
the data base the same can be used to create search engines. For html files, graphics &
PDFs search engine is an additional program.




A search engine that doesn’t have a given content, searches it else where. This
data come
s from a program that crawls many pages and reading the contents. Such a
program is called a ROBOT or SPIDER. It crawls the URLs specified by the search
engine and marks a new URL when found. Google.com differentiates the pages that have
been crawled and t
hose that have not. The pages that have been crawled display the pages
title on the results page. For those which are not crawled it displays the URL of the page.



When a user searches, they are actually not searching contents. Instead they a
re
searching an index of the content the spider has found. In a database driven site, the user
performs a query on the content.


2.1 Simple data queries


When data is stored in a database, simple queries are possible which is a call to
database by a middle
ware program based on user input. This query looks at a select
number of fields in a database. If it finds a match for the input the database returns the
information to the middleware program, which generates a useful HTML display of the
content that was f
ound.


The database will be indexed for complex queries where it searches the index
instead of the content. It also helps in noise word reduction, stemming and look up tables
for content mapping.


2.2 Complex data queries


Nielson’s recent summary on sea
rching behavior suggests that if users are not
successful with their first search they will not improve their search results on a second or
third query (Nielsen). Since finding the correct piece of information quickly is important,
complex queries are appr
opriate for keyword searching. They allow the user to ask that a
series of conditions about their specific query be met.




2.3 Boolean searching


Some search engines allow the user to specify conditions they want met in their
search results. Boolean searc
hing allows the user to specify groups of words, that should
not appear and whether or not the search should be case sensitive. The use of AND or OR
can be used to refine the search. These terms are
logical expressions
included in Boolean
searching.

Most s
earch engines allow some form of Boolean searching. Boolean searching
includes syntax for case sensitive searching; but, some data bases store their information
in case insensitive fieldtypes.



2.4 Pre
-
processed data


In most search engines the data that

the user searches is not the actual pages of
information but a dataset of information about what’s contained in the pages. This is
called an index. The original content is in a database and index is second dataset.

Content indexing creates a document inde
x that contains information about where each
word is found on each page. The user performs a search on this index. The display results
page translates the information found in the document index back into the information
that is on the actual pages.


2.5 I
ndexing content


Databases are sometimes given to improve performance. A search engine can be
improved in terms of speed by using an index. An index is used to strip noise words out
of the content.


2.6 Document index


A document index is special content i
ndex. Most search engines use a document
index to get responses for keywords. Information about each of the words in the
documents allows the search engine’s relevancy calculation to return the best result.



2.7 Noise words


To save space and time, sear
ch engines strip out words when you query the
database. Some databases, such as MySQL, have noise word rules built in. These general
rules can be modified, additional rules can be placed on specific data sets to allow best
results.The words that are stripp
ed are called noise words. Noise words may be stripped
out based on a specific list of words or length.



2.8 Content mapping


In search engines where noise words re calculated based on length, it is beneficial
to create a look
-
up table. While doing so you

can map acronyms to fullwords. This is
called
content mapping
.


2.9 Stemming


Stemming is a technique to remove suffixes leaving only the rootword in the
index. This allows users to not only search for the term they entered but also its
variations. Stemmi
ng is good for small datasets which increases the chances of matching
terms but it’s not good for large datasets, which need to isolate the information.


2.10 Calculating relevancy


Not all search engines calculate relevance. Some may simply return docume
nts
based on when they were indexed and how many keywords they contain. For small
datasets, this is sufficient.


2.11 Search results display


2.11.1 Number of records displayed


Search results should give relevant information. They can be divided over a
n
umber of pages. Nielsen states ‘users almost never look behind the second page of
search results’ (Nielsen). Some results may be lost by pagination.


2.11.2 Suggesting new spellings


Sometimes spelling mistakes occur. To provide more chances of relevant se
arch it
suggests alternate spellings. Synonym lists present the user with an alternative word. A
spellchecker can provide a list of alternate spellings of a word.


2.11.3 Hit highlighting


In the search results page the words you were searching on are some
times
highlighted in some way. Usually the word is bold. This is hit highlighting.


2.11.4 Returning results for each successful query


Often a user will search for multiple keywords. But each page should be returned
only once. Instead of each successful
result being displayed once, the search engine must
know if a page has already been tagged as containing a result. If it has already been
tagged, it should not be re
-
tagged.




2.12 Crawlers


2.12.1 Indexing a site


When searching a web site chances are yo
u’re not actually searching the content,
but rather a pre
-
formatted copy of the content. This increases the speed of each search
-

just as looking up a keyword in the index of a book is faster than reading the entire book.
A database site may have an additi
onal index, or it may search the content directly. An
HTML site will need to enter all of its content into an index for the search engine to
search.

With HTML sites the content is usually crawled at specific intervals of time. If
new information is added,
it will not appear in the search engine until the site has been re
-
indexed. Database sites that require an additional search engine usually have content
stored in multiple tables. A site that always has the same format for its content might not
have an add
itional index for the search engine. (Refer figure
-
1)


.

2.13 PageRank: Bringing Order to the Web

The citation (link) graph of the web is an important resource that has largely gone
unused in existing web search engines. The search engine creates maps wit
h all the
hyperlinks. These maps allow rapid calculation of a web page's ‘PageRank’, a measure
of its citation importance that corresponds to people's subjective idea of importance.
Because of this, PageRank is an excellent way to prioritize the results o
f keyword
searches. For most popular subjects, a simple text matching search that is restricted to
web page titles performs admirably when PageRank prioritizes the results.

2.13.1 PageRank Calculation

Academic citation literature has been applied to the w
eb, largely by counting
citations or backlinks to a given page. This gives some approximation of a page's
importance or quality. PageRank extends this idea by not counting links from all pages
equally, and by normalizing by the number of links on a page. P
ageRank is defined as
follows:

We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter
d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There
are more details about d in the next section.
Also C(A) is defined as the number of links
going out of page A. The PageRank of a page A is given as follows:


PR(A) = (1
-
d) + d (PR(T1)/C(T1) + ...
+ PR(Tn)/C(Tn))





Figure: 1 Figure showing various components of SE and its working.







Google Architecture







This section explains how the Google system works as pictured in Figure
-
2.

In Google, the web crawlin
g is done by several distributed
crawlers
. There is a
URL server that sends lists of URLs to be fetched to the crawlers. The web pages that are
fetched are then sent to the
storeserver
, which then compresses and stores the web pages
into a
repository
. Ever
y web page has an associated ID number called a
docID
, which is
assigned whenever a new URL is parsed out of a web page. The indexing function is
performed by the
indexer

and the
sorter
. Indexer reads the repository, uncompresses the
documents, and parses
them. Each document is converted into a set of word occurrences
called
hits
. The hits record the word, position in document, an approximation of font size,
and capitalization. The indexer distributes these hits into a set of ‘
barrels
’, creating a
partially

sorted forward index. The indexer performs another important function. It parses
out all the links in every web page and stores important information about them in an
anchors file. This file contains enough information to determine where each link points
from and to, and the text of the link.

The
URLresolver

reads the anchors file and converts relative URLs into absolute
URLs and in turn into docIDs. It puts the anchor text into the forward index, associated
with the docID that the anchor points to. It al
so generates a database of links, which are
pairs of docIDs. The links database is used to compute
PageRanks

for all the documents.

The sorter takes the barrels, which are sorted by docID, and resorts them by
wordID to generate the
inverted index
. This is

done in place so that little temporary space
is needed for this operation. The sorter also produces a list of wordIDs and offsets into
the inverted index. A program called
DumpLexicon

takes this list together with the
lexicon produced by the indexer and g
enerates a new lexicon to be used by the searcher.
The searcher is run by a web server and uses the lexicon built by DumpLexicon together
with the inverted index and the PageRanks to answer queries.











Figure
-

2 A complete model of Google Architecture.








Figure
-

3 Detailed illustration of searching mechanism.




Major Data Structures



4.1 BigFiles

BigFiles are virtual files spanning multiple file systems and are addressa
ble by 64
bit integers. The allocation among multiple file systems is handled automatically. The
BigFiles package also handles allocation and deallocation of file descriptors, since the
operating systems do not provide enough for our needs.

4.2 Repository



The repository contains the full HTML of every web page. Each page is
compressed using zlib. The choice of compression technique is a tradeoff between speed
and compression ratio. We chose zlib's speed over a significant improvement in
compression offe
red bzip. In the repository, the documents are stored one after the other
and are prefixed by docID, length, and URL.

4.3 Document Index

The document index keeps information about each document. It is a fixed width
ISAM (Index sequential access mode) inde
x, ordered by docID. The information stored
in each entry includes the current document status, a pointer into the repository, a
document checksum, and various statistics. If the document has been crawled, it also
contains a pointer into a variable width f
ile called
docinfo
which contains its URL and
title. Otherwise the pointer points into the
URLlist

which contains just the URL.

4.4 Lexicon

The lexicon has several different forms. The current lexicon contains 14 million
words. It is implemented in two pa
rts
--

a list of the words and a hash table of pointers.

4.5 Hit Lists

A hit list corresponds to a list of occurrences of a particular word in a particular
document including position, font, and capitalization information. Hit lists account for
most of the

space used in both the forward and the inverted indices. Because of this, it is
important to represent them as efficiently as possible. Google considers several
alternatives for encoding position, font, and capitalization
--

simple encoding, a compact
enc
oding, and Huffman coding.


4.6 Forward Index

The forward index is actually already partially sorted. It is stored in a number of
barrels. Each barrel holds a range of wordID's. If a document contains words that fall into
a particular barrel, the docID i
s recorded into the barrel, followed by a list of wordID's
with hitlists which correspond to those words. Instead of storing actual wordID's, it stores
each wordID as a relative difference from the minimum wordID that falls into the barrel
the wordID is in
.

4.7 Inverted Index

The inverted index consists of the same barrels as the forward index, except that
they have been processed by the sorter. For every valid wordID, the lexicon contains a
pointer into the barrel that wordID falls into. It points to a do
clist of docID's together with
their corresponding hit lists. This doclist represents all the occurrences of that word in all
documents.

4.8 Crawling the Web

Running a web crawler is a challenging task. Crawling is the most fragile
application since it in
volves interacting with hundreds of thousands of web servers and
various name servers which are all beyond the control of the system.


To scale to millions of web pages, Google has a fast
-
distributed crawling system.
A single URLserver serves lists of URL
s to a number of crawlers. Each crawler keeps
roughly 300 connections open at once. It works on a simple iterative algorithm. This
algorithm is differs with the search engines as well as the kind of query. This is necessary
to retrieve web pages at a fast
enough pace. This factor makes the crawler a complex
component of the system.(Refer figure
-
3)

This means running a crawler which connects to more than half a million servers,
and generates tens of millions of log entries. Because of the immense variation
in web
pages and servers, it is virtually impossible to test a crawler without running it on large
part of the Internet. Systems which access large parts of the Internet are designed to be
very robust and carefully tested.

4.9 Indexing the Web



Parsing
--

A
ny parser is designed to run on the entire Web must handle a huge
array of possible errors. These range from typos in HTML tags to kilobytes of
zeros in the middle of a tag, non
-
ASCII characters, HTML tags nested hundreds
deep, and a great variety of other

errors that challenge anyone's imagination to
come up with equally creative ones.



Indexing

Documents into Barrels
--

After each document is parsed, it is
encoded into a number of barrels. Every word is converted into a wordID by
using an in
-
memory hash t
able
--

the lexicon. Once the words are converted into
wordID's, their occurrences in the current document are translated into hit lists
and are written into the forward barrels.



Sorting

--

In order to generate the inverted index, the sorter takes each of

the
forward barrels and sorts it by wordID to produce an inverted barrel for title and
anchor hits and a full text inverted barrel.



4.10 The Ranking System

Google maintains much more information about web documents than typical
search engines. Every hi
tlist includes position, font, and capitalization information.
Combining all of this information into a rank is difficult.




First, consider the simplest case
--

a single word query. In order to rank a
document with a single word query, Google looks at th
at document's hit list for that word.

Google counts the number of hits of each type in the hit list. Then it computes an IR
score for the document. Finally, the IR score is combined with PageRank to give a final
rank to the document.

For a multi
-
word sear
ch, the situation is more complicated. Now multiple hit lists
must be scanned through at once so that hits occurring close together in a document are
weighted higher than hits occurring far apart. For every matched set of hits, proximity is
computed. Count
s are computed not only for every type of hit but for every type and
proximity. Thus it computes an IR score. (Refer figure
-
4)


Figure
-
4

A mechanism of Google Query Results.



A comparison of various Search Engines.



Internet Search Engines

Category

AltaVista

Excite

WebCrawler

Lycos

Open
Text

InfoSeek

Yahoo!

Google

Case Sensitive?

Y

N

N

N

N

Y

N

N

Considers
Phrases?

Y

N

Y

N

Y

Y

N

Y

Required Term
Operator

+

+

N

N

N

+

N

Y

Prohibited
Term Operator

-

-

N

N

N

-

N

N

Wildcard
Expander

*

N

N

$

N

N

N

*

L
imiting
Character

N

N

N

.

N

N

N

N

Results
Ranking?

Y

Y

Y

Y

Y

Y

N

Y

Controllable
Results
Ranking?

Y

N

N

Y

N

N

N

Y

Booleans
Allowed?

Y

Y

Y

N

Y

N

N

Y

Proximity
Operators
Allowed?

Y(10)

N

Y(range)

N

Y(80)

Y(100)

N

Y

Subject
(Directory)
Searching?

N

Y

Y

Y

N

Y

Y

N

Refine Based
On First
Search?

N

Y

N

N

Y

N

N

Y

table
: 1




Conclusion




In this paper, we discussed the anatomy and working of a Web Search Engine.
We presented Google Architecture and its operations. We conclude that different Search
Engines
have different strong points. There is no perfect search engine. However due to
its advanced features, Google is one of the most efficient. We as users are living in a
special era where Search Engines are undergoing a profound evolution, the refinement of
their special tools. We believe that very soon the Internet will evolve standards, such as
standard categories, ways of automatically classifying information into these categories,
and the search tools to take advantage of them. We believe this will really

improve
searching.





























Glossary



1.

Spider
:

A spider is a robotic program that downloads WebPages. It works just as the browser
does when connected to a web site and download a page.

2.

Crawler
:

As a spider downloads pages, it can

strip apart the page and look for ‘links’. It is the
crawler’s job to then decide where the spider should go to next based on the links, or
based upon a preprogrammed list of URLs.

3.

Indexer
:

An indexer rips apart a page into it's various components and a
nalyze them.

4.

Database
:

The database is the storage medium for all the data a search engine downloads and
analyzes.


References


[1] Barbara Kasser, “Using, the Internet”, Fourth Edition, Prentice Hall India Ltd,



p122
-

144.


[2] Thomas Levine Young ”The Complete Reference Internet”, Second Edition, Tata


McGraw Hill Ltd.

[3] Gilles Brassard & Paul Bratley,“Fundamentals of Algorithmics”Prentice Hall India.


[4] Best of the We
b 1994
-

Navigators
http://botw.org/1994/awards/navigators.html


[5] Google Search Engine
-

http://google.stanford.edu/


[6] Search Engine Watch


http://www.searchenginewatch.com/


[7] Sergey Brin & Lawrence Page
-

“The Anatomy of a Large
-
Scale Hypertextual Web


Search Engine”
-

http://www.db
-
stanford.edu/


[8] Emma Jane Hogbin
-

“Search Engines 101”
-



http://xtrinsic.com/geek/articles/searchengines101.pdf


[9] Advanced Web Search
-

http://www.learnthenet.com/advancedwebsearch.html


[10] Google Search
-

http://www.google.guide.com/howgoogleworks.html