Project Report Comparative Analysis of Data Structures for Inverted File Indexing in Web Search Engines

toadspottedincurableInternet and Web Development

Dec 4, 2013 (3 years and 11 months ago)

147 views









Project Report



Comparative Analysis of Data Structures

for Inverted File Indexing in Web Search Engines


Ingrid Biswas, Vikram Phadke

CSE 598: Design and Analysis of algorithms Project

Computer Science & Engineering Department

Arizona State
University

ibiswas@asu.edu,vikram.phadke@asu.edu























ABSTRACT

................................
................................
................................
..........

4

1

INTRODUCTION

................................
................................
...........................

4

2

BACKGROUND: INFORMA
TION RETRIEVAL SYSTE
MS

.........................

5

3

WEB SEARCH ENGINE AR
CHITECTURE

................................
..................

6

3.1

Crawler

................................
................................
................................
..................

6

3.2

Repository

................................
................................
................................
..............

8

3.3

Parser

................................
................................
................................
.....................

9

3.4

Indexer

................................
................................
................................
.................

10

3.5

Page ranking module and Query Engine

................................
..........................

11

4

THE GOOGLE SEARCH EN
GINE

................................
..............................

12

5

TEXT INDEXING AND RE
TRIEVAL

................................
...........................

15

5.1

Signature files

................................
................................
................................
......

17

5.2

Vector space models

................................
................................
............................

18

5.3

Latent semantic in
dexing (LSI)

................................
................................
.........

18

5.4

Inverted File Indexing

................................
................................
........................

19

5.5

Inverted File Compression

................................
................................
.................

20

5.6

Representing and Accessing Lexicons

................................
...............................

21

6

IMPLEMENTATION

................................
................................
....................

22

7

ANALYSIS AND RESULTS

................................
................................
........

24

7.1

Inverted File Indexing Using Sorted Array

................................
......................

25

7.2

Inverted File Indexing Using Hash Table

................................
.........................

27

7.3

Inve
rted File Indexing Using BTrees

................................
................................

30

7.4

Comparative analysis of all three data structures

................................
...........

32

7.5

Search and retrieval efficiency of t
he data structures

................................
.....

34

7.6

BTrees and External memory:

................................
................................
..........

34

8

FUTURE RESEARCH DIRE
CTIONS

................................
..........................

35

9

CONCLUSION

................................
................................
.............................

36

10

REFERENCES

................................
................................
.........................

37





































ABSTRACT


Search Engines of today serve as portals to the millions of web p
ages that form the WWW
(World Wide Web). They are probably the most popular examples of Information
Retrieval tools. They contain four major components that interact together namely, the
Crawler, Storage module, Parser, Indexer Query Processor and Ranking
module.
Efficient algorithms and data structures can make the difference between an average and
an exceptional search engine. Search engines today have to index millions of pages. Our
work studies text indexing in the context of web search engines. In par
ticular the
inverted file
-
indexing algorithm for indexing is studied in detail. Different Data
structures are compared in terms of the time required to create index, the time required
to query the index and the space footprint.



1

INTRODUCTION


Search engin
es are extremely useful information retrieval tools. They are used for just
about everything from shopping for electronics to looking for research papers. With the
size of the WWW growing rapidly the search engine technology faces increasing
challenges. O
ur work had the following objectives (1) Gain an in depth understanding of
search engine technology (2) Look at search engines from the perspective of algorithms
and data structures (3) Studying the different modules of the search engines in detail,
analyz
ing algorithms, data structures.


We focus on the indexing module of search engines, and analyze the inverted file
indexing algorithm. Different kinds of data structures can be used to implement the
index. Sorted arrays, tries, Btrees, Hash Tables can be

used to create the index. Various
issues such as the time required to create the index, the space footprint of the index, the
time required for retrieval arise when talking about efficient data structures for the
indexing algorithm. Our work focuses on co
mparing data structures for the inverted file
indexing algorithm in terms of time required to create the index. An outline of this report
follows. Section 2 provides some background on information retrieval techniques.
Section 3 discusses web search engine
s and the various modules in the web search
engines. Section 4 describes in detail the working of the Google search engine. Section 5
describes the various algorithms used for text indexing. It describes in detail the inverted
file indexing algorithm and
the data structures that can be used to store the index.
Section 6 describes the design of the implementation of the “evaluation environment”
that was used for comparing the performance of the inverted file indexing algorithms
when different data structur
es are used. Section 7 explains the results of the experiments.

Section 8 explains future research directions based on the experiences with this work.


2

BACKGROUND: INFORMA
TION RETRIEVAL SYSTE
MS


Information retrieval is a general term that is used to id
entify all those activities that
enable us to choose from a given
collection
of
documents.
These could be documents that
belong to particular domain of interest or a particular topic. The activities that we are
concerned with in retrieving information are
those that permit us to reach the target of
choosing the documents that are probably relevant to the initial information need in an
automatic
way. The main criteria for automatic information retrieval is that the
collections of documents that are available

are in a
digital form
. In traditional IR, the
collection
of
documents
is a set of documents that has been put together, because it is
related to a specific context of interest for the users that are going to use it. An IR
collection is a set of all the do
cuments of the collection that have certain properties or
features in common. These features are used to cluster similar documents enabling a
faster retrieval of documents pertaining to the user query.


It is possible to use a traditional IR system and its

documents collection in a web based
IR system. But there are issues that need to be looked into in this aspect. The IR system
needs to be made available to the end user through a program that connects the IR system
sitting on the web server to a Web page
that acts as an interface between the user and the
IR system.


Retrieving information from the Internet is a common practice for Internet users.
However the size and heterogeneity of the web makes it very challenging. It also reduces
the effectiveness of
information retrieval techniques that are used to retrieve information
from traditional data sources. Many software tools are available these days for web
information retrieval, like search engines, hierarchical (Google, AltaVista) directories
(Yahoo) and
many other software agents.


Web users started having the availability of proper tools to access documents on the
Internet during 1994. Before that year, it was possible to use tools that were indexing and
managing only the title, the URL, and some small p
arts of Web pages [Maud98]. Since
then there have been so many advances in this field that is can be looked at as big event
in the history of Information retrieval system technology.
WebCrawler
, developed at the
University of Washington (USA), was the fir
st tool that allowed the user to search the full
text of entire Web documents, was available in April 1994 [Maud98].
Lycos
, another web
search engine developed at Carnegie Mellon University (USA) [Herr99] in July 1994. So,
we can say that from 1994 on it h
as been possible to have Web tools with effective IR
functionalities.

Since 1994, the IR system for the web has flourished with innovative and better tools for
effective and faster information retrieval.
Altavista
entered the scene in 1995 with a
number of

innovative features, and in the following years many other search tools were
made available.

3

WEB SEARCH ENGINE AR
CHITECTURE




3.1

Crawler

The crawler module retrieves pages from the Web. It typically starts with an initial set of
URLs. This initial set
of URLs

is fed into the crawler
in a queue structure. The crawler
then gets a URL from this queue one at a time. There are different ways to choose which
URL to visit next, namely, Depth
-
First, Breadth
-
First or randomly, depending on the
implementation of
the crawler. The crawler downloads the web page, extracts any URLs
in the downloaded web page, and adds the URLs that it found on the downloaded web
page in the same queue. This action is continued until the crawler decides to stop. The
crawler will stop
once it has visited all the web page URLs in its queue. There are several
issues that need to be taken into consideration regarding how the crawler behaves. The
main issues that need to be considered are based on the enormous size of the Internet. It is
im
possible for the crawler to download
all
pages on the Web. The most comprehensive
search engine can index only a small fraction of the entire Internet. Based on this fact, it
is necessary for the crawler to prioritize the URLs in such a way that it will vi
sit
“important" pages first. This ensures that the part of the Web that is visited by the crawler
is more meaningful.

The main steps that the crawler has to take can be summarized in the following steps.
First, it needs to be fed a URL or a set of URL’s.

The crawler picks a URL from this
queue and fetches the web page from this URL. It then parses this page and extracts links
to other URL’s from this page. It filters out unwanted links and links that it has already
visited. It add all these URL’s into the

queue. This is the basic working of all crawlers.
The main difference between crawlers comes in depending on what algorithm they use
for choosing the next URL. Some crawlers use simple algorithms to pick the next URL
like random, FIFO, LIFO, others use a
priority algorithm shown below in [Kwon00].

After the crawler has downloaded a number of pages, it sends the downloaded web page
to the repository module to be stored. It then needs to make sure that the repository of
web pages it has stored is refreshed.
For this the crawler needs to revisit the same URLs
in order to detect changes in the downloaded pages and refresh the collection. Because
Web pages are changing at very different rates [Cho00], and due to the enormous size of
the web the crawler is not ab
le to go back to all the web pages fast and refresh the pages.
Hence, it needs to decide which pages to revisit and which page to skip. This decision
significantly impacts the “freshness" of the downloaded documents. As an example, if a
certain page change
s rarely, the crawler may not want to revisit the page very often, that
way it is able to visit more pages that change more frequently.


[Kwon00] gives a prediction algorithm that can be used to find out when a particular
web page will be updated, helping

the crawler to decide when to visit the page.

This
paper tells how to calculate the update frequency of each page by using these main
factors. Firstly, you need to get LA(P), which is the local average of the page P, i.e., we
calculate the average frequen
cy of the pages that are in proximity of this page P and also
all the page frequencies should be close within a certain threshold. Secondly, we need the
history average of the page HA(P), which gives the average frequency that is calculated
using the page
modification history. Thirdly, we need to calculate the tolerance of the
page, which defines how close this page is to other pages and this value is used in
calculating the value of LA(P). The formula used to calculate the update frequency of a
given page
P as FR(P), is given below. It uses the 3 terms we calculated earlier.


FR(P)=HA(P)*(1
-
LW(n)) + LA(P)*LW(n)


where
LW
is a weight factor associated with the local average LA(P) and
n
is the
number of history records. The algorithm makes a few trivial but
useful assumptions.
First, recent history is much more important than old history. Second, history data of the
page are more trustful than locality data, provided that we have enough history records.
The equations for history average and local weight are d
efined based on these two
assumptions.


Due to the enormous size of the Web, crawlers often run on multiple machines and
download pages in parallel [Cho00]. This parallel processing is needed so that the crawler
is able to download a substantial amount of
pages in a reasonable amount of time. These
parallel crawlers need to be coordinated with each other, so that multiple crawlers do not
visit the same URL multiple times.


3.2

Repository

The page repository

is a scalable storage system for managing large colle
ctions of Web
pages. The repository needs to perform two main functions. Firstly, it needs to provide an
interface for the crawler to store web pages it has crawled. Secondly, it must provide an
efficient API for accessing that the indexer module can use t
o retrieve the pages. There
are a few challenges that the storage module needs to address. It needs to be scalable and
distributive so that the data it stores can be distributed over a network of servers, due to
the large size of data we are dealing with.
The repository also must support different
access modes namely,
Random access and Streaming access
.


Random access is used to quickly retrieve a specific Web page, given the page's unique
identifier. The query engine module needs to access the repository
with “Random access”
to serve out web pages to the end
-
user depending on their query string. Streaming access
is used to receive the entire collection, or a significant subset, as a stream of pages.
Indexer module uses “Streaming access” to process and ana
lyze pages in bulk. The
repository also needs to deal with issues regarding updating the newer versions of the
web pages. The repository needs to be able to identify pages that are obsolete (deleted
from their websites). When the web pages are removed from

their web sites, it is not
informed to the repository. Thus, the repository needs a mechanism to be able to identify
and remove obsolete pages from its storage.



3.3

Parser


The parser module is an intermediate module between the repository and indexer. The
Indexer module uses this module to extract the web pages from the repository and
process the web pages to remove the HTML tags. The Parser then takes this page content,
i.e. the web page without all the tags and parses the page again to remove any Stop Lis
t
words. Stop List words are words that occur very frequently and do no help in any way to
differentiate between the documents. In other words, they appear in almost all the
documents. Eg. of Stop List words would be a, and, the, if, how, etc. The indexer
module
will take the page content left by the Parser and use it to index the text. The parser then
extracts the keywords from the page content and creates a forward index for each page.
Forward index is a structure that stores a list of all the keywords th
at appear in the web
page along with the occurrence of the keyword.


3.4

Indexer


The
indexer
module builds a variety of indexes on the pages in the repository. It gets the
forward index structure built by the parser module and creates an inverted index
struc
ture. An inverted index structure contains a list of all keywords along with the list of
URLs that the keyword appears in, for each keyword. The inverted index structure is
indexed on the keywords. The indexer module creates two main indexes: a Text Index
to
index all the keywords and a Link Index to index all the links on the web page.

Text
-
based retrieval, namely, searching for pages containing some keywords is the main
method for identifying pages relevant to a query. Various methods have been used to
im
plement support for text
-
based retrieval to search over the text document collections.
Examples include
suffix arrays
[Manb90],
inverted files or inverted indexes
[Salt89,
Witt94], and
signature files
[Falo84]. Inverted indexes have been the index structur
e of
choice on the Web traditionally. Inverted indexes will be discussed in detail later on
section 6.

The whole Web is modeled as a graph with nodes and edges [
Brod00
]. Each node in the
graph is a Web page and a directed edge from node
A
to node
B
represe
nts a hypertext
link in page
A
that points to page
B
. A Link Index is a subset of this graph that contains
web pages (nodes) that have been visited and links (edges) that have been found on the
web pages. The most common structural information that is ofte
n used by search
algorithms [Brin98] is
neighborhood information
, i.e., for a given page
P
, the outward
links are the set of pages that are pointed to by
P
or incoming links are the set of pages
pointing to
P
. Neighborhood information of the original graph

and its sub graph can be
easily retrieved using the Adjacency list representations [Aho83] of the graph. The
information stored in these adjacency lists can be used to extract other structural
properties of the Web graph. For example, if we need to retrie
ve pages that are related to
a given page, then the notion of
sibling pages
is often used. This information about
sibling can be easily derived from the adjacency list structures described above.

Small graphs of hundreds or even thousands of nodes can be e
fficiently represented by
any one of a variety of well
-
known data structures [Aho83]. However the biggest
challenge is to do the same for a graph with several million nodes and edges. The
Connectivity Server

in the AltaVista search engine that is used to d
eliver linkage
information for all pages retrieved and indexed, is described in [Bhar98]. Even though
link
-
based techniques are used to enhance the quality and relevance of search results, text
based structures are the most important ones used.



3.5

Page rank
ing module and Query Engine

The Query Engine takes the query string from the user containing the terms to search for
and retrieves pages that are likely to be relevant to the query. The relevant pages that are
retrieved need to be ranked. Traditional Infor
mation Retrieval (IR) techniques do not
have any effective algorithm for ranking query results due to the reasons listed below.

Firstly, the Web is very large and has a great variation in the content, amount and quality
of information present in the Web p
ages. Hence, many pages that contain the search
terms may not be relevant to the user or could be of poor quality. Secondly, most Web
pages are not very self
-
descriptive, so the traditional IR techniques that are used to
examine the contents of a page do n
ot work very well. An often cited example to
illustrate this issue is the search for “search engines" [Klei99]. The homepages of most of
the important search engines does not contain the text “search engine". Spamming is a
big issue while ranking pages. We
b developers have started adding misleading terms to
the web pages so that the search engine will rank them higher. This is another reason, the
content of pages alone cannot be used as a technique to rank the pages.

As we have mentioned earlier, the web i
s looked as having a graph structure. The
information maintained by the link structure can be used in ranking pages. For example,
if there is a link to page B in a web page A, then it implies that web page A is
recommending web page B. This recommendation
can be used to give an importance to a
web page based on how many pages are referring to it. Some new algorithms have been
proposed that make use of this link structure. These algorithms are based not only on the
content of the page but also on the link st
ructure, hence they are generally better than the
traditional IR algorithms. Spamming has come into even this aspect of the web with web
developers adding more links to particular web pages. But the advantage is that they are
not able to influence the link

structure at a global level. Hence link analysis algorithms
working at a
global level
are relatively robust against spamming.

Page and Brin describe a global ranking scheme, called PageRank, in [Page98] that tries
to capture the notion of “importance" of
a page. The rank of the page is defined based on
the number of pages that link to that page, in other words, a page is more important than
another page if the number of incoming links is higher than the other’s. The rank of a
web page
A
can be defined as t
he number of pages in the Web that point to
A
, and could
be used to rank the results of a search query. This is known as citation ranking. It does
not work very well against spamming, as it is very easy to artificially create a huge
number of pages to poin
t to the desired page.

The PageRank algorithm extends the basic citation
-
ranking algorithm. It takes into
consideration how important the pages are that point to this web page. Thus if an
important web page points to a page A, it receives more importance i
n its ranking than if
an unimportant page pointed to it.

The definition of PageRank is recursive and the importance of a page both
depends on
and
influences
the importance of other pages. A simple definition of PageRank algorithm
is given below that captu
res the above intuition. Let us denote the pages on the Web as 1,

2,….,
m
.
forward
(
i
) denotes the number of outgoing (forward link) links from a page
i
.
back
(
i
) denotes all the pages that contain a link to page
i
(back links). In this algorithm
we assume th
at we can reach every page from any given page, i.e., the web forms a very
strongly connected graph. A simple formula to calculate PageRank of page
i
, denoted by
rank
(
i
), is given by




The division by
forward
(
j
) captures the intuition that pages which po
int to page
i
evenly
distribute their rank to boost to all of the pages they point to.


4

THE GOOGLE SEARCH EN
GINE



In this section we see the architecture and working of a very popular Search Engine,
Google. Most of Google is implemented in C and C++ for
efficiency and it can run on
Linux and Solaris servers.

)
(
/
)
(
)
(
)
(
j
forward
j
rank
i
rank
i
Back
j





Google has several distributed servers for web crawling, i.e. finding URLs and
downloading web pages from the Internet. This helps in parallel processing, as there are
millions of web pages all over

the Internet. At the start of each run, the URL Server has a
list of URLs that need to be crawled. The URL Server sends a list of these URLs to the
crawlers. The crawlers use these URLs as a starting point to go and fetch more URLs
from the web pages. The

fetched web pages are then sent to the store server where they
are compressed and sent to the repository for storage. This function of the crawler is an
on going process where it keeps on going to the same URLs and tries to see if the web
pages have been
updated since it got them the previous time. If they have been updated,
then the crawler gets the new web page and sends it to the store server. Every web page is
given a Document ID number that is assigned when the URL is parsed on the web page.


Next ste
p in the process is to index and sort the web pages that is done by the indexer and
sorter. The indexing module takes the web pages from the repository, uncompresses and
parses them. Each web page is then converted to a forward index structure that contain
s
all the words in that web page along with the occurrences, the number of times the word
occurs in the document. The position of the word in the document along with the font
size and capitalization are also stored in the forward index. The indexer then di
stributes
this structure into a set of barrels, creating a forward index that is partially sorted. The
index also parses out all the links in the web page and stores them in the anchors file.
Information from this file can be used to easily determine where

each link points to and
from and the text that is part of the link.



The URL Resolver reads the URLs from the anchors file and converts them to absolute
Document ID’s. It puts the anchor text into the forward index associated with the
Document ID. It als
o generates a Links database, which are pair of Document IDs. This
Link database is used later in the Page Ranking algorithm.



The sorter takes the documents in the barrel that are sorted by Document ID and creates
an inverted index. An inverted index con
tains the index for each word associated with the
document and the occurrence in that document. The DumpLexicon takes this inverted
index list along with the Lexicon that is produced by the indexer module and creates a
new lexicon to be used by the searche
r. The searcher that is run by the web server takes
the query words from the user and uses the lexicon produced by the DumpLexicon, the
inverted index and PageRank to answer the query.





Figure 1. High level Google Architecture [Huan00]



5

TEXT INDEXING

AND RETRIEVAL


Indexing addresses the issue of how information from a collection of documents should
be organized so that queries can be resolved efficiently and relevant portions of the data
extracted quickly. We will describe a variety of indexing metho
ds. To be as general as
possible, a document collection or document database can be treated as a set of separate
documents, each described by a set of representative terms, or simply terms (each term
might have additional information, such as its location
within the document).

An index must be capable of identifying all documents that contain combinations of
specified terms, or that are in some other way judged to be relevant to the set of query
terms. The process of identifying the documents based on the
terms is called a search or
query of the index.


Applications of indexing


Indexing has been used for many years in a wide variety of applications. It has gained
particular recent interest in the area of web searching (e.g. AltaVista, Hotbot, Lycos,
Exc
ite, ...). Some applications include Web searches, Library article and catalog searches,
Law, patent searches , Information filtering, e.g. get interesting New York Time articles.


The goals of these applications:

Speed
--

want minimal information retrie
val latency

Space
--

storing the document and indexing information with minimal space

Accuracy
--

returns the ``right'' set of documents

Updates
--

ability to modify index on the fly (only required by some applications)


Figure 2 provides an Overview
of Indexing and Searching process.




Figure 2: Overview of Indexing and Searching










Figure 2 Overview of indexing and searching


The main approaches that are used for Text Indexing are as follows:



Full text scanning (e.g. grep, egrep)




Inverted file indexing (most web search engines)



Signature files



Vector space model


Each one of these approaches will be explained in detail in the following sections. Our
work focuses on Inverted file indexing and efficient data structures that can

be used.

The different types of queries that a index may have to support are, boolean (and, or, not),

proximity (adjacent, within), key word set, in relation to other documents (relevance
feedback). The Index should also allow for prefix matches (AltaVi
sta does this)
,wildcards ,edit distance bounds (egrep)


There are some general techniques that are used by all indexing approaches irrespective
of the algorithm or data structures. These are



case folding: London = london



stemming: compress = compressio
n = compressed

(several off-the-shelf English language stemmers are available)



ignore stop words: to, the, it, be, or, ...



Problems arise when search on To be or not to be or the month of May

Document Collections

Index

“Document List”

兵nry




Thesaurus: fast = rapid

(handbuilt clustering)


Granularit
y of Index


The Granularity of the index refers to the resolution to which term locations are recorded
within each document. This might be at the document level, at the sentence level or exact
locations. For proximity searches, the index must know exact (
or near exact) locations.


5.1

Signature files


Signature files are an alternative to inverted file indexing. The main advantage of
signature files is that they don't require that a lexicon be kept in memory during query
processing. In fact they do not requir
e a lexicon at all. If the vocabulary of the stored
documents is rich, then the amount of space occupied by a lexicon may be a substantial
fraction of the amount of space filled by the documents themselves.


Signature files are a probabilistic method for

indexing documents. Each term in a
document is assigned a random signature, which is a bit vector. These assignments are
made by hashing. The descriptor of document is the bitwise logical OR of the signatures
of its terms. As we will see, queries to signa
ture files sometimes respond that a term is
present in a document when in fact the term is absent. Such false matches necessitate a
three valued query logic.

There are three main issues with respect to signature files : (1) generating signatures, (2)
sear
ching on signatures, and (3) query logic on signature files.


5.2

Vector space models


Boolean queries are useful for detecting Boolean combinations of the presence and
absence of terms in documents. However, Boolean queries never yield more information
tha
n a Yes or No answer. In contrast, vector space models allow search engines to
quantify the degree of similarity between a query and a set of documents. The uses of
vector space models include:


Ranked keyword searches, in which the search engine generates

a list of documents that
are ranked according to their relevance to a query.

Relevance feedback, where the user specifies a query, the search engine returns a set of
documents; the user then tells the search engine that documents among the set are
relev
ant, and the search engine returns a new set of documents. This process continues
until the user is satisfied.

Semantic indexing, is a type of indexing in which search engines are able to return a set
of documents whose ``meaning'' is similar to the mean
ings of terms in a user's query. In
vector space models, documents are treated as vectors in which each term is a separate
dimension. Queries are also modeled as vectors, typically 0-1 vectors. Vector space
models are often used in conjunction with cluste
ring to accelerate searches.

5.3

Latent semantic indexing (LSI)

All of the methods we have explained so far to search a collection of documents have
matched words in users' queries to words in documents. These approaches all have two
drawbacks. First, since t
here are usually many ways to express a given concept, there
may be no document that matches the terms in a query even if there is a document that
matches the meaning of the query. Second, since a given word may mean many things, a
term in a query may retr
ieve irrelevant documents. In contrast, latent semantic indexing
allows users to retrieve information on the basis of the conceptual content or meaning of
a document. For example, the query automobile will pick up documents that do not
contain automobile,
but that do contain car or perhaps driver.


5.4

Inverted File Indexing

Inverted file indices are probably the most common method used for indexing
documents. Figure 3 shows the structure of an inverted file index. It consists first of a
lexicon with one entry

for every term that appears in any document. We will discuss later
how the lexicon can be organized. For each item in the lexicon the inverted file index has
an inverted file entry (or posting list) that stores a list of pointers (also called postings) t
o
all occurrences of the term in the main text. Thus to find the documents with a given term
we need only look for the term in the lexicon and then grab its posting list. Boolean
queries involving more than one term can be answered by taking the intersecti
on
(conjunction) or union (disjunction) of the corresponding posting lists.

We will consider the following important issues in implementing inverted file indices.



How to minimize the space taken by the posting lists?



How to access the lexicon efficiently

and allow for prefix and wildcard
queries?



How to take the union and intersection of posting lists efficiently.?


Figure 3: Structure of Inverted Index

5.5

Inverted File Compression


The total size of the posting lists can be as large as the document data i
tself. In fact, if the
granularity of the posting lists is such that each pointer points to the exact location of the
term in the document, then we can in effect recreate the original documents from the
lexicon and posting lists (i.e., it contains the same

information). By compressing the
posting lists we can both reduce the total storage required by the index, and at the same
time potentially reduce access time since fewer disk accesses will be required and/or the
compressed lists can fit in faster memory.

This has to be balanced with the fact that any
compression of the lists is going to require on-the-fly uncompression, which might
increase access times. In this section we discuss compression techniques that are quite
cheap to uncompress on-the-fly. The k
ey to compression is the observation that each
posting list is an ascending sequence of integers (assume each document is indexed by an
integer). The list can therefore be represented by a initial position followed by a list of
gaps or deltas between adjac
ent locations.


For example:

original posting list: elephant: [3, 5, 20, 21, 23, 76, 77, 78]

posting list with deltas: elephant: [3, 2, 15, 1, 2, 53, 1, 1]


The advantage of using the deltas is that they can usually be compressed much better than
indi
ces themselves since their entropy is lower. To implement the compression on the
deltas we need some model describing the probabilities of the deltas. Based on these
probabilities we can use a standard Huffman or Arithmetic coding to code the deltas in
eac
h posting list. Models for the probabilities can be divided into global or local models
(whether the same probabilities are given to all lists or not) and into fixed or dynamic
(whether the probabilities are fixed independent of the data or whether they c
hange based
on the data).



5.6

Representing and Accessing Lexicons


There are many ways to store the lexicon. Here we list some of them

Sorted
--

just store the terms one after the other in a sorted array

Tries
--

store terms as a trie data structure

B-
trees
--

well suited for disk storage


Perfect hashing
--

assuming lexicon is fixed, a perfect hash can be calculated

Front-coding
--

stores terms sorted but does not repeat front part of terms. Requires
much less space than a simple sorted array.


Wh
en choosing among the methods one needs to consider both the space taken by the
data structure and the access time. Another consideration is whether the structure allows
for easy prefix queries (e.g., all terms that start with wux). Of the above methods al
l
except for perfect hashing allow for easy prefix searching since terms with the same
prefix will appear adjacently in the structure. Wildcard queries (e.g., w*x) can be
handled in two ways. One way is to use n-grams, by which fragments of the terms are
indexed (adding a level of indirection). Another way is to use a rotated lexicon.












6

IMPLEMENTATION


The main idea behind this project is to create indexes of keywords from web pages and
store them in data structures. We then find out the time com
plexity and space complexity
of storing and searching the keywords in these data structures.

This project contains three main modules, the crawler module, parser module and the
indexer module as shown in figure 4.
























Figure 4 : Architecture of implementation of indexer module

Retrieve
information

Output file


Crawler

<URL>

HT
ML Parser

(take URL and parse
HTML page to remove all
HTML tags)

Indexer

(get the index of documents
with keyword and
occurrences and create
inverted index)

Index keywords into
data structures

Out.txt

List of URL’s
c牡睬w搠晲潭⁴桥
睥扳楴e⁕ i


併琮
瑸t

呥x琠ta牳rr

⡧(琠t桥⁰ ge c潮瑥湴猠
晲潭⁴桥⁈ Mi⁐a牳e爠
a湤⁲n瑲楥ve⁴桥 yw潲o猩

The crawler module is input a root URL and the number of levels that it needs to go
down. In order to test our system with changing number of keywords, we use the option
of crawling different le
vels to get variable web pages. This module uses Breadth
-
First
-
Search approach to get URL’s and visit these URL’s to get the web page content. The
crawler visits the root URL and looks for links on the web page at that URL. It stores the
links on that pag
e in a queue and visits these pages in sequence. The crawler then outputs
the visited URL’s into a file “out.txt”. In traditional search engines, the web page content
is also downloaded and compressed and stored in a repository. Our crawler only gets the
U
RL’s and does not store the web page content as we have very few web pages and we
are not implementing a query processor that will need the document for returning to the
user.

The next module is the parser module. The parser module reads the file output by

the
crawler containing the entire URL’s visited. It takes each URL and does processing on it
to remove the HTML tags first. This content is then taken and text parsed to extract all
the words in that page. The words that are extracted are then processed s
ome more to
remove Stop Words and Stemmed to save the word as its root. Stop words are words that
occur very often in the documents and do not assist in any way to discriminate one
document from another. The Porter stemming algorithm is used to stem the en
ds of the
words. We have not used any algorithm to remove prefixes to the words. These words are
then stored in the forward indexing structure. The forward indexing structures stores the
URL of the page visited, a list of keywords processed and the number
of times each
keyword occurs in that document.

The indexer modules takes the forward indexing structure and processes it to create a
inverse index structure that’s stores the keywords along with a list of documents that the
keyword occurs in. Ideally the
location of the keyword in each document should also be
stored, but we have ignored that aspect as we are not interested in displaying the result to
the user. Out main aim is to study the time taken to build this inverted index and time it
takes to search
for keywords in this structure.

We have used 3 indexing structures. The first structure we have used is a simple sorted
array which stores the keyword as the key for sorting. We use s binary search technique
to find the keywords in the document.

The second

index structure used is a HashTable. Here again we use the Keyworsd as the
key for hashing into the HashTable. The third data structure use is the Btree. We have
implemented a 2 Btree that stores a minimum of 2 keywords in each of its nodes and a
maximum
of 4.


7

ANALYSIS AND RESULTS


We compared the efficiency of three different data structures with respect to the inverted
file indexing algorithm. As explained in the implementation section 3.1 a crawler is used
that can retrieve links contained in web pages
. We first use the crawler to retrieve a set of
pages that shall be indexed. A HTML parser is given the list of URL’s or web pages. The
HTML parser then parses the files and retrieves only the strings in the web page. This set
of strings from a web page is

then passed on to a text parser that performs stemming and
uses a stop list to remove words. This text parser then gives a set of words that shall be
indexed using the inverted file indexing algorithm.

We can set the crawler to visit pages with varying
depths thus allowing us to vary the
number of keywords that are indexed. We initiated the crawler with different starting
URLs like
http://www.cnn.com

,
http://
www.nbc.com

etc. The de
pth used by the crawler
is also varied so as to vary the no of keywords. With a depth of 1, the crawler would
generate a set of all URLs available on the home page of the websites.

Once a set of keywords is generated, the number of keywords is calculated.

We shall
compare the performance of the different data structures using based on the time that is
needed to create the index.

A tester program retrieves a set of keywords from a set of URLs and uses three different
data structures sorted array, hash ta
ble and Btree to create indexes using the inverted file
indexing algorithm.

In the following sections the results on the performance of each of the data structures are
presented.

7.1

Inverted File Indexing Using Sorted Array

To create an inverted index using

sorted array, initially an index is created that is of the
following format as shown in figure 5.


URL

Words found on the web page

URL1

Word1,word2…………..word n

啒i2

Word1,word2…………..word n

啒i3






啒i

Word1,word2…………..word n


c楧u牥‵⸠c潲oa牤

楮摥i映 oi猠睩瑨楳i映步y睯w摳†d桡琠慰灥a爠楮r
瑨敭


呯Tc牥a瑥ta渠楮癥i瑥搠楮iex 晲潭o瑨攠a扯癥 獴牵st畲uⰠ睥 癩獩琠eac栠啒i 楮i瑨攠啒r 汩獴Ⱐ
a湤n 瑨敮t e癥ry 睯w搠co湴慩湥搠楮i 瑨攠啒iⰠ楦i t桥 睯w搠桡猠湥ve爠扥en 楮摥ie搠楴 is
楮獥i瑥搠楮i漠a 獯牴s搠ar
ray a湤na 灯楮pe爠楳i灬pce搠瑯t汩獴s潦o啒r猠瑨慴tco湴慩渠瑨t猠睯w搮
ff 瑨t 睯wd a汲lady 楳ipre獥湴n楮i瑨t 獯牴s搠a牲ay 瑨敮t瑨攠灯楮per 瑨慴t灯p湴猠瑯t啒is 楳i
異摡瑥搠瑯ta摤d瑨t猠湥眠啒i⸠ 䕶b湴na汬y 瑨e 楮癥牴r搠楮摥i 桡s 瑨攠景汬潷楮f 獴牵c瑵牥
a猠獨潷渠
楮⁦楧u牥‶






























Figure 6: Inverted index using sorted array


The plot shown in figure 7 shows the performance of the sorted array structure when used
in the inverted file indexing algorithm. The X
-
axis represents the number of ke
ywords
indexed, and the Y
-
axis represents the time required to create an inverted index based on
the sorted array data structure.




quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

Sorted array

of words

4, 8

2, 4, 6

1, 3, 7

1, 3, 5, 7

2, 4, 6, 8

3, 5

3, 5, 7

2, 4, 6, 8

3

1, 3, 5, 7

2, 4, 8

2, 6, 8

1, 3, 5, 7, 8

6, 8

1, 3

1, 5, 7

2, 4, 6

Postings list

(Index in to URL list)

Inv Indexing using Sorted Array
0
5000
10000
15000
20000
25000
30000
35000
40000
394
1409
3061
6362
8191
9965
12626
18759
23097
26616
No. of keyword indexed
Time to create Index (mil sec)
Series1


Figure 7: Plot of performance in sorted Array


As can be seen from the plot, the curve is cl
osed to linear. The time required to create the
sorted array is quite large is because every time a keyword needs to be indexed a binary
search algorithm decides where this word should be placed, this is quite an expensive
operation and must be performed f
or every keyword, The other operation that needs to be
performed is that of retrieving and updating the postings list for a new keyword, or
creating and initializing a postings list for a keyword that has been indexed before.

7.2

Inverted File Indexing Using

Hash Table

To create an inverted file index using a hash table the class library in Java was used. The
class “HashTable” implements a hash table, which maps keys to values. Here the
keywords represent “keys” and the values are the “list of URLs”, that c
ontain the
keyword. Any non
-
null object can be used as a key or as a value. The hash table is open:
in the case a "hash collision", a single bucket stores multiple entries, which must be
searched sequentially. The load factor is a measure of how full the h
ash table is allowed
to get before its capacity is automatically increased. To create an inverted index, each
URL a structure similar to Table 1.0 is used. Each URL is visited sequentially and the
words contained in each URL are read. A keyword that repre
sents a key is then hashed
using the hashing function. The java hashing function seemingly satisfies the
requirements of a good hashing function. The structure of an inverted file index with a
hash table is shown in figure 8.





Figure 8: Hash Table Im
plementation




















Figure 9 Inverted index structure of Hash Table

Key
(Keyword)


Hash Code =

H(K)

Ins
ert Element At(Hash
Code) = Posting List


Posting List

quick

brown

fox

over

lazy

dog

back

now

time

aid

fox

men

come

jump

aid

their

party

Key words being
indexed

4, 8

2, 4, 6

1, 3, 7

1, 3, 5, 7

2, 4, 6, 8

3, 5

3, 5, 7

2, 4, 6, 8

3

1, 3, 5, 7

2, 4, 8

2, 6, 8

1, 3, 5, 7, 8

6, 8

1, 3

1, 5, 7

2, 4, 6

Hash Code

generated by hashing

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16


The following plot shows the performance of the hash table when used in the inverted file
indexing algorithm. The X axis represents the number of keywords indexed, and the Y
axis represents the time required to create an inverted index based on the Hash table data
structure.


Inv Indexing using Hash Table
0
50
100
150
200
250
300
394
1017
1790
3061
5169
6824
8191
9146
10203
12626
16335
18970
23097
24804
28585
No Of keywords Indexed
Time to create index (mil sec)
Series1


Figure 10 Plot of performance in HashTable


As can be seen from the plot, the curve is sub linear. The time required to c
reate the hash
table is small because as opposed to the sorted array wherein every time a keyword needs
to be indexed a binary search algorithm decides where this word should be placed, the
hash table requires a simple hash function to compute the hash cod
e which serves as an
index into an array where a posting list is stored. The other operation that needs to be
performed is that of retrieving and updating the postings list for a new keyword, or
creating and initializing a postings list for a keyword that

has been indexed before.
Because of the absence of the binary search operation this is a very efficient data
structure for creation of the inverted index.

7.3

Inverted File Indexing Using BTrees


BTree is a data structure that is a balanced tree (all leaf no
des are at the same level) in
which all the nodes are sorted based on a key. Each node, except the root, stores a
maximum of m number of keys in each node and a minimum of m/2 nodes. If each node
has t number of keys in it, then it will have t+1 number of
children to represent the range
of values that it can store.

At each insertion and deletion, the tree is restructured so that it is height balanced. There
are 2 ways that the tree can be restructured. First, add the new keyword into the slot that
it shou
ld be put into. If the node is over it's size limit, then split the node into 2 nodes and
pass the middle keyword into the parent of this full node. This keeps going till the root,
so that none of the nodes are full and the end result is a balanced tree. T
his method makes
2 passes of the tree like in an AVL tree where the node is first added and if the node is
full, then it is split. Another way of inserting a new node into the tree would be to start
with the root and keep splitting nodes that come in the w
ay that are at their size limit, i.e.
m. This way we parse the tree only once and create a place for the new node. The BTree
class we have implemented uses this method. We have not concentrated on deleting
keywords from the node. In the index structure tha
t we have created using the BTree, the
keyword is used as a key to sort the nodes. Each keyword has 2 lists associated with it.
One list contains indexes to all the documents that contain that keyword and the second
list contains the occurrence list, i.e.
the number of times the keyword occurs in the
document.

The index in the BTree is shown in the diagram. Each node contains minimum 2
keywords and maximum of 4 keywords. We have not been able to show the associated
document and occurrence list for each of

the keywords. The structure of the BTree is
shown in figure 11.

And example of a node would be


keyword

: document

urlList

: [doc1, doc3, doc4]

occurList

: [1, 2, 1]


This implies that the keyword "document" occurs in doc1 1 time, doc3 two times and
do
c4 one time.

















Figure 11 Structure of BTree created


The following plot in figure 12 shows the performance of the BTree when used in the
inverted file indexing algorithm. The X
-
axis represents the number of keywords indexed,
and the Y
-
axis
represents the time required to create an inverted index based on the
BTree data structure.






Doc





Cour
s


Alpha









cut






bat






abet




Wet


util






zip






vest






sit

Inv Indexing usng BTrees
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
394
1409
3061
6362
8191
9965
12626
18759
23097
26616
No Of keywords Indexed
Time to create index (mil sec)
Series1


Figure 12: Plot of performance in BTree


As can be seen from the plot, the curve is sub linear. The time required to create the
index

using the BTree is of the order of :

. where m is
the order of the BTree. The maximum depth of the BTree is always log
[m/2]

n, where
n is the
total number of keywords that we have indexed. Each node has to have a
minimum of m/2 keywords.

7.4

Comparative analysis of all three data structures

The following plot shows the performance of the hash table, sorted array and the BTree
when used in the inver
ted file indexing algorithm. The X
-
axis represents the number of
keywords indexed, and the Y
-
axis represents the time required to create an inverted index
when using each of the three data structures. Three different colors distinguish the line
correspondi
ng to the hash table, sorted array and the BTree from each other.



Inverted Index Creation
0
5000
10000
15000
20000
25000
30000
35000
40000
394
1409
3061
6362
8191
9965
12626
18759
23097
26616
No Of KeyWords
Time in mil sec
Sorted Array
Hash Table
BTree


Figure 13: Comparitive plot of all 3 data structures


As is quite evident from the comparison plot of all three for these data structures. The
Hash Table outp
erforms sorted array and the BTree in terms of the time required to create
the inverted index. This can be attributed to the fact that when inserting a new keyword
into the index, minimal amount of computing time is required in case of the hash table
inde
x. Comparing this to the binary search algorithm that the sorted array requires to find
the correct place to insert the keyword. The binary search Searches a sorted array by
repeatedly dividing the search interval in half. Begins with an interval covering
the whole
array. If the value of the search key is less than the item in the middle of the interval,
narrow the interval to the lower half. Otherwise narrow it to the upper half. Repeatedly
check until the value is found or the interval is empty. It runs i
n
O(log N)

wherein N is
the size of the array, in this case the size of the lexicon in the index.


The hash table complexity depends on the hash function and collision resolution, but in
this case is constant

(1)
. Some open addressing schemes may suffer

from clustering
more than others. So it is evident that if we use a hashing function that minimizes
collisions and we use a good resolution strategy, outperforming the sorted array is an
easy task.


A BTree which is a balanced search tree in which every n
ode has between minimum m/2
and ceiling m children, where m>1 is a fixed integer. The root may have as few as 2
children and the leaf nodes do not have any children. Inserting a keyword in a BTree has
complexity of O(m log
m

n), where m is the order of the

tree and n is the total number of
keywords that are being indexed.


7.5

Search and retrieval efficiency of the data structures

Even though the project has focused on comparing the data structures in terms of the time
required to create the index, the search a
nd retrieval efficiency and the memory storage
requirements of the data structures in warrants discussion. Searching in a hash table is
O(1) , i.e. constant time and is extremely efficient.

Searching in a B Tree is O(log
[m/2]

n), where n is the total n
umber of keywords that are
indexed. The advantage of using the BTree is that it is balanced. Hence the height of the
tree remains constant.

Searching and retrieval on a sorted array requires O(log n) operations, because it needs
the binary search algorithm

for retrieval.

7.6

BTrees and External memory:

The payoff of the BTree insert and delete rules are that B
-
trees are always "balanced".
Searching an unbalanced tree may require traversing an arbitrary and unpredictable
number of nodes and pointers.

Searching a

balanced tree means that all leaves are at the same depth. There is no
runaway pointer overhead. Indeed, even very large BTrees can guarantee only a small
number of nodes must be retrieved to find a given key. For example, a B
-
tree of
10,000,000 keys with

50 keys per node never needs to retrieve more than 4 nodes to find
any key.


This is a good structure if much of the tree is in slow memory (disk), since the height, and
hence the number of accesses, can be kept small, say one or two, by picking a large m
.


They are especially useful for search structures stored on disk. Disks have different
retrieval characteristics than internal memory (RAM).

Obviously, disk access is much, much slower. Furthermore, data is arranged in
concentric circles (called track
s) on each side of a disk “platter” . (Most disks these days
have a single platter, but some disks are a stack of platters.) A disk is read by read/write
heads mounted on an arm that is moved in and out from track to track. Moving that arm
takes time, so t
here is a real timing benefit to grouping data so that it can be read without
moving the arm. The amount of data that can be read without moving the arm (from both
sides of all platters) is called a cylinder. It's much faster to read an entire cylinder tha
n to
read a little, move the arm, read a little more, move the arm, etc., even if the total amount
of data in a cylinder is much more than we need.

BTrees are a good match for on
-
disk storage and searching because we can choose the
node size to match the
cylinder size. In doing so, we will store many data members in
each node, making the tree flatter, so fewer node
-
to
-
node transitions will be needed.

8

FUTURE RESEARCH DIRE
CTIONS


This work has analyzed the performance of different data structures when used t
o build
an index for text using the inverted file indexing algorithm. The metric that was used for
the comparison was the time required to build an index. The ways in which this work
could be expanded is as follows:




Using other metrics to compare the dat
a structures for example the space footprint
of the index, time required to search for a keyword in the index.



Analyzing the efficieny of data structures like kd
-
trees and tries within the context
of inverted file indexing algorithm



Evaluating different
text indexing algorithms like Signature Files. LSI and vector
space model. Different metrics can be used for this analysis.



Analysis of indexing algorithms for image and video retrieval.



Text indexes are compressed to save space. Analysis of compression al
gorithms
like Huffman coding and searching compressed indexes is another interesting
research topic.

9

CONCLUSION


We achieved the goals that we had set for this project. We have gained a sound
understanding of search engine technology, information retrieval

techniques particularly
text indexing. We have studied in depth the inverted file indexing algorithm and related
data structures like hash table B trees and sorted arrays. From the performance analysis of
the inverted index file indexing algorithm and dat
a structures we can conclude that
efficient algorithms and data structures are the key to efficient search engines. Google’s
page rank algorithm that revolutionized search engine technology also bears testament to
this fact. This work also enumerates futur
e work based on this project.




10

REFERENCES



[Huan00] Huang, L.
A survey on web information retrieval technologies
. Tech. rep.,
ECSL, 2000.


[Aras01] A. Arasu, J. Cho, H. Garcia
-
Molina, A. Paepcke, and S. Raghavan .
Searching
the Web
. ACM Transactions
on Internet Technology, 1, p. 2
-
43, 2001.


[Brin98] S. Brin and L. Page.
The Anatomy of a Large
-
Scale Hypertextual Web Search
Engine
. Proceedings of the 7th International WWW conference, 1998.


[Najo01] M. Najork and Janet L. Wiener.
Breadth
-
First Search
Crawling Yields High
-
Quality Pages
. In Proceedings of the Tenth Internal World Wide Web Conference, pages
114
-
118, May, 2001


[Bent97] J. Bentley and R. Sedgewick.
Fast Algorithms for Sorting and Searching
Strings
. Proceedings of the eighth annual ACM
-
SIAM

symposium on Discrete
algorithms. New Orleans, January, 1997. Pages 360
-

369.


[Have99] T. Haveliwala.
Efficient Computation of PageRank
. Stanford Technical Report
2000
-
36.


[Brod00] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata,
A.
Tomkins, J. Wiener.
Graph structure in the web
. In Proc. Ninth International World Wide
Web Conference (WWW9), 2000.


[Cho00] J. Cho and H. Garcia
-
Molina.
The evolution of the web and implications for an
incremental crawler
. In Proceedings of the Twenty
-
sixth International Conference on
Very LargeDatabases, 2000. Available at
http://www
-
diglib.stanford.edu/cgi
-
bin/get/SIDL
-
WP
-
1999
-
0129
.


[Kwon00]
A. Kwong M. Gertz.
Improving th
e Quality of a Web Page Index.
Department
of Computer Science, University of California, Davis. 2000.


[Manb90] U. Manber and G. Myers.
Suffix arrays: A new method for on
-
line string
searches
. In Proc. Of the 1st ACM
-
SIAM Symposium on Discrete Algorithms,
pages
319
-
327, 1990.


[Salt89] G. Salton.
Automatic Text Processing
. Addison
-
Wesley, Reading, Mass., 1989.


[Witt94] I. H. Witten.
Managing gigabytes : compressing and indexing documents and
images
. Van Nostrand Reinhold, New York, 1994.


[Falo84] C. Falou
tsos and S. Christodoulakis.
Signature files: An access method for
documents and its analytical performance evaluation
. ACM Transactions on Office
Information Systems, 2(4):267
-
288, October 1984.


[Brod00] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. R
ajagopalan, R. Stata, A.
Tomkins, and J. Wiener.
Graph Structure in the Web.
In Proceedings of WWW9
Conference, 2000.


[Brin98] S. Brin and L. Page.
The anatomy of a large
-
scale hypertextual web search
engine
. In Proceedings of 7th World Wide Web Conferenc
e, 1998.


[Aho83] A. Aho, J. Hopcroft, and J. Ullman.
Data Structures and Algorithms
. Addison
-
Wesley,

1983.


[Bhar98] K. Bharat, A. Broder, M. Henzinger, P. Kumar, and S. Venkatasubramanian.
The connectivity server: Fast access to linkage information on th
e web
. In Proceedings of
the

Seventh International World
-
Wide Web Conference, April 1998.


[Klei99] J. Kleinberg.
Authoritative sources in a hyperlinked environment
. Journal of the
ACM, 46(5):604
-
632, November 1999.


[Page98] L. Page, S. Brin, R. Motwani,
and T. Winograd.
The pagerank citation ranking:

Bringing order to the web
. Technical report, Computer Science Department, Stanford
University, 1998.

[Maud98] M. Maudlin.
A history of search engines
. 1998.
http://www.wiley.com/compbooks/sonnenreich/history.
html,


[Herr99] S. Davis Herring.
The value of interdisciplinarity: A study based on the design
of Internet search engines.

Journal of the American Society for Information Science,

50(4):358
-
365, 1999.