Optimized Inverted List Assignment in

lilactruckInternet και Εφαρμογές Web

4 Δεκ 2013 (πριν από 3 χρόνια και 4 μήνες)

67 εμφανίσεις

Optimized Inverted List Assignment in
Distributed Search Engine Architecture

Jiangong Zhang Torsten Suel

Web Exploration and Search Technology Lab (WestLab)

Polytechnic University

Brooklyn, NY 11201

Distributed Search Engines


DSE: Search engines based on highly
distributed or P2P architecture


DSE is for indexing and search data
residing in P2P systems


Google: more centralized


The performance of DSE could not
compete with current centralized search
engines

Challenge in Distributed SE


Limited bandwidth and longer latency


Each node has limited storage space and
computational power


Textual collection size in terabytes


Compute top
-
k in millions of matches


Challenge



efficiency of search engine
query processing in such environments

Text Index Structures


An inverted index consists of inverted lists


Each inverted list is a sequence of postings


Each posting includes docID, frequency,
additional position and context information


Inverted lists are sorted and compressed

Term
-
Based Ranking

Local Index Organization

Global Index Organization

Related Work


Our previous work [34, 40]


Index partition schemes [2, 13, 19, 25, 38]


Query processing in P2P search engines with
local index [14, 21, 32, 37, 36, 24, 5]


P2P query processing with global index [28,
22, 18, 34, 40]


Bloom Filter [7, 12, 27]


Pruning Technique [15, 16]

Query Processing in DSE

Query: distributed search engine polytechnic university

List Assignment Problem

Query: distributed search engine polytechnic university

List Assignment with replication

Query: distributed search engine polytechnic university

Contributions


Study the problem of assigning/replicating
inverted lists over a set of nodes to minimize
communication costs during query processing


Propose heuristic algorithms for this problem


Evaluate the performance of the algorithms
on real web pages and query traces

Problem Definition


NP
Complete


Two realistic
approaches


Greedy
approach


Graph
approach

Greedy Approach


List
-
Driven
: round
-
robin over the lists, put on
the node with largest overall reduction of
communication cost on the given query trace


Node
-
Driven
: select the node with the most
available space, assign the inverted list that
gives the most benefit


Node
-
Driven Ratio
: similar to Node
-
Driven,
but the benefit is the ratio between the cost
reduction and the size of the list

Graph Approach


Each Term is a vertex with weight (size of list)


Each edge is proportional to the benefit of
having two terms on the same node


minimize the total weight of those edges that
are “completely cut”


Combination of Two Approaches


No tool for an overlapping partitioning


Combining graph and greedy approaches


Use METIS* package to do initial graph
partitioning of a single copy of index

*http://glaros.dtc.umn.edu/gkhome/metis/metis/overview

Experimental Setup


GOV2 data set used in the TREC Terabyte
(25.2 million pages)


100K queries used in the TREC Terabyte
efficiency competition task


Precompute the intersection sizes of all pairs
of terms that occur in a common query


Query trace is divided into two sets: training
set and testing set

Relative Costs on 16 Nodes

Relative Costs on 128 Nodes

Imbalance in Lists Assignment


List
-
Driven brings significant imbalances


Other approaches perform similar on
balances of index size


Node
-
Drive approaches have less difference
on number of inverted lists on each node

List Assignment with replication

Query: distributed search engine polytechnic university

Change of Percentage of K
-
term Queries


Transfer each query into another query


Average number of terms per query decreases


Queries with many terms are very likely to have
at least one pair on same node

Total Cost Per Query in KB

Conclusion


Extrapolate to a collection of 2.5 billion pages


TREC is not the best data set to evaluate


Careful assignment of inverted lists to nodes
obtains significant savings

Open Questions


Further improvements are possible


Hybrid organization maybe the best choice


Thank You !!!