Towards a Highly-Scalable and Effective Metasearch Engine

toadspottedincurableInternet and Web Development

Dec 4, 2013 (3 years and 11 months ago)

48 views






Towards a Highly
-
Scalable and Effective Metasearch
Engine





-

By Zonghuan Wu, Weiyi Meng,
Clement Yu, Zhuogang Li














Presented By:

Suganya Ravikumar

Jitendra Bethina

Mira Chokshi

Vijayaram Bethina










Introduction:

In recent years, th
e World Wide Web has become an important information source.
Most of the data on the web is in form of text. Many search engines have been created
until now to find the desired data on the web in a timely and cost effective way. There
are two types of sear
ch engines. General
-
purpose search engines focus on finding all
pages on the web. Special purpose search engines focus on specific domains such as
documents in an organization or in specific subject area.

The amount of data on the Web is huge. By February
1999 there were more than 800
million documents publicly indexable on the web [Lawrence and Lee Giles 1999] and it
is much more than 1.3 billion now and is increasing at a higher rate. Due to such
enormous amount of data, assuming that a single general pu
rpose search engine will
retrieve most useful documents is unrealistic, because firstly its processing power may
not scale to the rapidly increasing and unlimited amount of data and secondly gathering
all data on the web and keeping them up
-
to
-
date is extr
emely difficult. In order to
overcome such situations a metasearch engine is built.

A metasearch engine is a system that provides unified access to multiple local search
engines. A metasearch engine does not maintain its own index of documents, but
maintai
ns the contents of the underlying local search engines to provide a better service.

A Simple ( 2 Level) Metasearch Engine Architecture.













Global Interface

Search
Engine 1

Search
Engine 2

Sea
rch
Engine N

Reasons for developing a metasearch engine:


1.

Increased Search Coverage of Web: A recent study done
by Lawrence and Lee
Giles, 1999 shows that the average coverage by a single general purpose search
engine is decreasing steadily with the increases in the number of documents on
the web. According to Bergman 2000, if the largest general
-
purpose search engi
ne
can index 2 billion web pages, all special purpose search engines combined may
index up to 500 billion web pages.


2.

Increase the Scalability of Searching the Web: Employing a single general
-
purpose search engine for the entire web has a poor scalability
. Since the size of a
typical special purpose local search engine is much less than a general
-
purpose
search engine, it is much easier to keep the index data up
-
to
-
date. Also it is easier
to provide the hardware and software support for a special purpose s
earch engine


3.

Improve the Retrieval Effectiveness: When a user searches for documents related
to a specific subject area, both the special purpose and general purpose search
engine return documents relevant to user’s query. The general
-
purpose search
engi
ne might return unrelated documents, thus hindering the effective retrieval of
useful documents.












Goal of this Paper:


When a Metasearch engine receives a user query; it parses the query, selects appropriate
databases by a optimal database selec
tion algorithm and then collects the documents
retrieved by underlying databases and reorganizes the results received from them. The
selection of optimal database and collecting and reorganizing the retrieved results are
the main challenges of building a m
etasearch engine. Lot of methods has been proposed
in the past to meet these challenges. This paper focuses on a new approach to perform
database selection and collection fusion. This method however uses the framework that
was developed in previous researc
h papers some of which were contributed by the
authors themselves for ranking databases optimally. A new technique is proposed in this
paper to rank the databases and experiments are conducted to support this new
approach.




















Related Work:


Most of the existing approaches for database selection rank the databases for a given
query based on certain usefulness measures.




gGIOSS uses the sum of document similarities that are higher than a threshold.



CORI Net uses the probability that a databas
e contains relevant documents due
to the terms in a given query.



D
-
WISE uses the sum of weighted document frequencies of query terms.



Q
-
Pilot uses the dot
-
product similarity between an expansion query and a
database description.


All the database
-
ranking m
ethods are heuristics and they are not designed to produce
optimal orders based on some optimality criteria. The framework for this paper uses a
measure to rank a database based on the similarity of the most similar document in the
database.


For collectio
n fusion, earlier approaches use weighted allocation to retrieve documents,
that is retrieve proportionally more documents from databases that have higher
-
ranking
scores. e.g. CORI
-
Net, D
-
WISE, ProFusion and MRDD.

Some approaches then use adjusted local si
milarities of documents to merge retrieved
documents. e.g. D
-
WISE and ProFusion.

These approaches for collection fusion are also heuristics and do not guarantee the
retrieval of all potentially useful documents for a given query.

The Inquirus metasearch en
gine uses the real global similarities of documents to merge
retrieved documents. The advantage is high level of merging is achieved but the
disadvantage is in order to compute the global similarities documents are required to be
fetched to the metasearch
engine.





Framework for Database Selection and Collection Fusion:


The framework for database selection and collection fusion used in this paper is the
previous work of authors and it tries to rank the databases optimally using the similarity
of most sim
ilar document in the database.

Necessary and Sufficient Condition: Ranking databases in descending order of the
similarity of the most similar document in each database is a necessary and sufficient
condition to rank databases optimally for retrieving the
‘m’ most similar documents
across all databases for any positive integer ‘m’.

Global Similarity: When the inverse document frequency
idf

weight of each term is
computed based on the global df of the term (i.e. the number of documents containing
the term ac
ross all databases), the computed similarities are global similarities.

The framework for optimal database selection Is summarized as follows:

Proposition 1:

Databases D
1,
D
2
, D
N

are optimally ranked in the order [D
1,
D
2
, …, D
N
] with
respect to a given qu
ery q if and only if
msim
( q, D
1
) >
msim
( q, D
2
) >… >
msim
( q, D
N
),
where
msim
( q, D
i
) is the global similarity of the most similar document in database D
i
with the query
q
.

Based on the optimal order of the databases [D
1,
D
2
, …, D
N
], an algorithm,
known as
OptDocRetrv
, was developed to perform database selection and collection fusion.
Suppose
s
optimal databases are selected. Each of these selected search engine returns
the actual global similarity of the most similar document to the metasearch engi
ne,
which computes the minimum, denoted by
min_sim
, of these
s
values. Each of the
s
search engines then returns to the metasearch engine those documents whose global
similarities are greater than or equal to
min_sim
. If ‘m’ or more documents are returned
from
s

search engines, then they are sorted in descending order of similarity and first ‘m’
documents are returned to the user.

In author’s previous paper “
Methodology for Retrieving Text Documents from Multiple Databases”
that if databases are ranked opti
mally, then algorithm OptDocRetrv will guarantee the
retrieval of all the ‘m’ desired documents.

New Database Ranking Measure


Advantages:

The advantages of this method are

a)

It utilizes less information than estimating the similarity of the most similar
do
cument

b)

Ranking matches well with the method using the similarity of the most similar
document.


Description :

Terms used:

mnw
i,j


=
max {di/|d|}
-

maximum normalized weight of term t
i
in local database D with
document collection d=(d
1
, d
2
,……..d
n
).

anw
i,j

-

average normalized weight of t in D
j
.

gidf
i

= global inverse document frequency weight of t
i
.


q
-

user query

q’ =(q
1
*gidf
1
,……,q
n
*gidfn)



Formula:

Global similarity of the most similar document of Database with respect to query


max{ qi* gidf
i

*

mnw
i,j



+


qk *gidf
k

* anw
i,j
}/|q’|

The two parts of the formula represent maximum weight and average normalized
weight. The maximum normalized weight is in general much larger than the average
normalized weight and usually query terms have equal weights(terms
usually appear
once in the query).So the formula can be simplified to the following simpler one.

Adjusted maximum normalized weight of term :


am
i,j
= max{ qi* gidf
i

*

mnw
i,j

}

The ranking score for the new measure with linear time complexity with respect

to
number of query terms is :


Rs(q,Dj)=max{q * am
i,j
}






Integrated Representative of Databases

The major requirement for computing the score is that, the adjusted maximum
normalization of each term in the database needs to be obtained and stored. Th
ese
statistical data can be easily obtained when the database is under the control of
information retrieval system’s developer, or when the documents can be retrieved
independently or when the local search engines are co
-
operative. If the data cannot be
o
btained , query sampling techniques can be used.

Assuming that the adjusted maximum normalized weights are available, a database
representative for each database with these weights for each term in D is created. During
query processing , the metasearch e
ngine uses the query information and the database
representative to compute the ranking score of each document. The
OptDocRetrv
algorithm is then used to select databases and retrieve the documents.

This representative is scalable as it stores only one dat
a per term. Still the scalability
becomes dubious while considering thousands of local search engines as storing and
computing of scores of even this data requires a humungous amount of storage capacity
and processing.

In order to solve this problem , onl
y the r largest adjusted maximum normalized weights
of t across all local databases are stored in a single integrated representative. This can be
justified on the basis that storage preference is given to the information in more relevant
databases for each

potential query term. So the computing of scores is also restricted to
these databases for which there is an entry corresponding to the query term. The authors
of the paper demonstrate the scalability by showing that for 10 million distinct terms ,
the st
orage space requirement is only 1.7 GB. They also put across a
Proposition2
that
states that when number of documents ‘m‘ desired by the user for a single

term query is
less than the number of adjusted weights ’ r’ stored, then all the m most similar
doc
uments are contained in the r local databases stored in the integrated representative.
They base it on the argument that if the databases are ranked in the descending order of
the maximum normalized weights of the query, they are also ranked optimally for

the
query .


Experimental Results:

Input Documents:

The experiments were performed on 221 databases which were obtained from 5 TREC
document collections created by NIST (National Institute of Standards and Technology
of the US). The five collections are:

CR


Congressional Record of 103
rd

Congress

FR


Federal Register 1994

FT


Articles in Financial Times from 1992 to 1994

FBIS


Articles via Foreign Broadcast Information Service

LAT


Lost Angeles Times

These collections are partitioned into databases of

sizes ranging from 2MB to 20MB.
There are more than 1 million distinct terms in these databases. 1000 Internet queries
collected from Stanford University collection were used in the experiments, where 2
queries had no terms (after removing stopwords), 343

queries have two terms, 185
queries have three terms, 94 queries have four terms, 29 queries have five terms and 24
queries have six terms. TREC queries were not used since their average length is longer
than typical Internet queries.


Performance Metrics
:

cor_iden_db :
Correctly identified databases

The ratio of number of databases that contain one or more of the ‘m’ most similar
documents and are searched by this method to the number of databases that contain one
or more of the ‘m’ most similar documents
.

cor_iden_doc :
Correctly identified documents

The ratio of number of documents retrieved among the ‘m’ most similar
documents over ‘m’.

db_effort :
Database Search Effort


The ratio of the number of databases searched by the algorithm over the number
of
databases which contain one or more of the ‘m’ most similar documents. This ratio is
usually more than 1.


doc_effort :
Document Search Effort



The ratio of the number of documents received by the metasearch engine over
‘m’. This is a measure of Tran
smission cost.


Beta


The original algorithm
OptDocRetrv

terminates when at least ‘m’ documents are
returned by the local search engine to the metasearch engine. Beta


was used to control
the termination of
OptDocRetrv
algorithm. i.e. if

=2m. the algo
rithm will not stop until at
least 2m documents are returned to the metasearch engine by local search engines.


The Graphical Representation of Experimental Results are as follows:


1.

Result for cor_iden_db & cor_iden_doc






















When


= m
, a
s ‘m’ varies from 2 to 20

86.4% to 92.3%
of correct databases are identified.

86.4% to 92.7%
of correct documents are identified.


When


= 1.5m
, as ‘m’ varies from 2 to 20

Approximately

2%
more percentage of correct databases and correct documents are
ide
ntified

Approximately 27%

more databases are searched and
45%

more documents are
retrieved.


When


= 2m
, as ‘m’ varies from 2 to 20

Approximately

3.5%
more percentage of correct databases and correct documents are
identified.

Approximately 50%

more databa
ses are searched and
80%

more documents are
retrieved.


2.

Result for db_effort & doc_effort



























From Graph 3, the value of db_effort can be less than 1.0. However when all desired
databases are ranked ahead of all databases, the dh_ef
fort will be at least 1.0


In Graph 4, for


= 2m, doc_effort is less than 2. In general, if for each query there are at
least


documents with positive similarities in searched databases, then we should have
doc_effort


2.


The proposed method of databas
e selection and collection fusion correct retrieval of the
‘m’ most similar documents for single
-
term queries if m


r. However the experimental
results show that out method performs very well even for multi
-
term queries. In general,
this approach gives be
tter results for short
-
term queries.










Comparisons of Selected MetaSearch Engines:


(Listed in terms of best ranking)

Meta Search Tool

Databases Searched

Special Features

Copernic

Alta Vista, AOL.com, Direct Hit,
EuroSeek, Excite, Fast Search, Goog
le,
GoTo, HotBot, LookSmart, Lycos,
Magellan, MSN Web, Netscape
Netcenter, Open Directory, Snap, Web
Crawler, Yahoo! (Also international
collections of search engines)

Can include Booleans

T
ranslate any Web page or search
result into any of seven languages
:
English, French, German, Italian,
Japanese, Portuguese, and Spanish
.

Ixquick

AOL, AltaVista, LookSmart, EuroSeek,
Excite, FindWhat, MSN, alltheweb,
GoTo, Hotbot, Yahoo!

Translates your search into each
search engine's syntax.

Brings "top 10" from each s
earch
engine, and aggregates results

ProFusion

AltaVista, , LookSmart, Excite, Magellan,
WebCrawler, GoTo,
AllTheWeb/alltheweb, Yahoo! Can
customize what is searched

Aggregates results into one ranked
list.

Results can be sorted by relevancy,
A
-
Z by site
title, or source

Dogpile


About 15 search engines and directories.
Some are big, some are small and esoteric.

The results can be VERY
inconsistent, especially from some
of the smaller sources, which are
NOT search engines, but subject
directories. Results

are retrieve in
lists of 10 hits from each engine
queried.









Prototype System:


Based on the method proposed by the Authors, they implemented a prototype of
metasearch engine called as CSams (Computer Science Academic Metasearch Engine).
http://slate.cs.binghamton.edu:8080/CSams/

The system has 104 databases with each database containing Web pages from a
Computer Science department in a US university. Each database is treated like a search

engine in the demo system. The user can specify the number of documents desired, and
view the search statistics also.


Conclusion:


The new approach of optimal database ranking and collection fusion significantly
improved the scalability in terms of com
putation and space, since it uses an integrated
database representative that can scale to unlimited number of databases and also permit
efficient selection of useful databases for any given query.