Optimized Inverted List Assignment in

lilactruckInternet και Εφαρμογές Web

4 Δεκ 2013 (πριν από 4 χρόνια και 7 μήνες)

81 εμφανίσεις

Optimized Inverted List Assignment in
Distributed Search Engine Architecture

Jiangong Zhang Torsten Suel

Web Exploration and Search Technology Lab (WestLab)

Polytechnic University

Brooklyn, NY 11201

Distributed Search Engines

DSE: Search engines based on highly
distributed or P2P architecture

DSE is for indexing and search data
residing in P2P systems

Google: more centralized

The performance of DSE could not
compete with current centralized search

Challenge in Distributed SE

Limited bandwidth and longer latency

Each node has limited storage space and
computational power

Textual collection size in terabytes

Compute top
k in millions of matches


efficiency of search engine
query processing in such environments

Text Index Structures

An inverted index consists of inverted lists

Each inverted list is a sequence of postings

Each posting includes docID, frequency,
additional position and context information

Inverted lists are sorted and compressed

Based Ranking

Local Index Organization

Global Index Organization

Related Work

Our previous work [34, 40]

Index partition schemes [2, 13, 19, 25, 38]

Query processing in P2P search engines with
local index [14, 21, 32, 37, 36, 24, 5]

P2P query processing with global index [28,
22, 18, 34, 40]

Bloom Filter [7, 12, 27]

Pruning Technique [15, 16]

Query Processing in DSE

Query: distributed search engine polytechnic university

List Assignment Problem

Query: distributed search engine polytechnic university

List Assignment with replication

Query: distributed search engine polytechnic university


Study the problem of assigning/replicating
inverted lists over a set of nodes to minimize
communication costs during query processing

Propose heuristic algorithms for this problem

Evaluate the performance of the algorithms
on real web pages and query traces

Problem Definition


Two realistic



Greedy Approach

: round
robin over the lists, put on
the node with largest overall reduction of
communication cost on the given query trace

: select the node with the most
available space, assign the inverted list that
gives the most benefit

Driven Ratio
: similar to Node
but the benefit is the ratio between the cost
reduction and the size of the list

Graph Approach

Each Term is a vertex with weight (size of list)

Each edge is proportional to the benefit of
having two terms on the same node

minimize the total weight of those edges that
are “completely cut”

Combination of Two Approaches

No tool for an overlapping partitioning

Combining graph and greedy approaches

Use METIS* package to do initial graph
partitioning of a single copy of index


Experimental Setup

GOV2 data set used in the TREC Terabyte
(25.2 million pages)

100K queries used in the TREC Terabyte
efficiency competition task

Precompute the intersection sizes of all pairs
of terms that occur in a common query

Query trace is divided into two sets: training
set and testing set

Relative Costs on 16 Nodes

Relative Costs on 128 Nodes

Imbalance in Lists Assignment

Driven brings significant imbalances

Other approaches perform similar on
balances of index size

Drive approaches have less difference
on number of inverted lists on each node

List Assignment with replication

Query: distributed search engine polytechnic university

Change of Percentage of K
term Queries

Transfer each query into another query

Average number of terms per query decreases

Queries with many terms are very likely to have
at least one pair on same node

Total Cost Per Query in KB


Extrapolate to a collection of 2.5 billion pages

TREC is not the best data set to evaluate

Careful assignment of inverted lists to nodes
obtains significant savings

Open Questions

Further improvements are possible

Hybrid organization maybe the best choice

Thank You !!!