Chapter 3: Data Storage and Access Methods

basesprocketΔιαχείριση Δεδομένων

31 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

99 εμφανίσεις

Chapter 3: Data Storage and Access Methods


Title:

The R* Tree: An Efficient and Robust Access
Method for Points and Rectangles


Authors:

N. Beckmann, H. Kriegel, R. Schneider and B.
Seeger


Pages:
207
-
216

The R* Tree: An Efficient and Robust Access
Method for Points and Rectangles


Problem


Problem Statement


Why is this problem important?


Why is this problem hard?



Approaches


Approach description, key concepts


Contributions (novelty, improved)


Assumptions


Problem Statement


R* Tree


Given


Data containing points and rectangles


Spatial queries (point, range query, insert, delete)


Find

-

An Access Method (Data Structure)


A hierarchical organization of rectangles


Example from wikipedia


Objectives


Efficiency of spatial queries


Constraints


Balanced tree


Each node is a disk page and has >= m (min # of entries) entries.


Root has at least two children unless it is a leaf


Efficiency metric = number of disk
-
pages accessed

Why is this problem important?


Multi
-
dimensional Applications


Large geographic data. e.g., Map objects like countries occupy
regions of non
-
zero size in two dimension.


Common real world usage: “Find all museums within 2 miles of
my current location".


CAD





Many DBMS servers support spatial indices


Orcale, IBM DB2, …



Why is this problem Hard?


B
-
tree split methods ineffective in 2
-
dimensions


Ex. Sorting



Size variation across data Rectangles


Large rectangles limit split options!



Non
-
uniform data distribution over space



Dynamic Access Method


Insertions and deletions


Overlapping directory rectangles => multiple search paths

Novelty of Contribution


Related Work


Traditional one
-
dimensional indexing structures

(e.g., hash, B
-
tree)
are not appropriate for range search


B+ tree


Represents sorted data in a way that allows for efficient insertion and
removal of elements.


Dynamic, multilevel index with maximum and minimum bounds on the
number of keys in each node.


Leaf nodes are linked

together as a linked list to make range queries easy.


R
-
tree


R
-
tree is a foundation for spatial access method


A complex spatial object is represented by
minimum bounding rectangles

while preserving essential geometric properties


Over
-
lapping regions


Heuristic:
minimize the area of each enclosing rectangle in the inner nodes.



Principles of R
-
tree



Reference: A Guttman ‘R
-
tree a dynamic index structure for spatial searching’, 1984


Height
-
balanced tree similar to a B
-
tree with index records
in its leaf nodes containing pointers to data objects.


Heuristic Optimization: minimize the area of each
enclosing rectangle in the inner nodes.

Performance Parameters beyond R
-
tree


(Q1) The area covered by a directory rectangle should be minimized.



(Q2) The overlap between directory rectangles should be minimized.



(Q3) The margin of a directory rectangle should be minimized.



(Q4) Storage utilization should be optimized.



Intuitions:


Reduce overlap between sibling nodes.


Reduce traversal of multiple branches for point query


Reinsert old data changes entries between neighboring nodes and thus
decreases overlap.


Due to more restructuring, less splits occur



Difference between R
-
tree and R*
-
tree


Minimization of area, margin, and overlap

is crucial to the
performance of R
-
tree / R*
-
tree.



The R*
-
tree attempts to reduce the tree, using a combination of a
revised node split algorithm

and the concept of
forced reinsertion at
node overflow
. This is based on the observation that
R
-
tree structures
are highly susceptible to the order

in which their entries are inserted,
so an insertion
-
built (rather than bulk
-
loaded) structure is likely to be
sub
-
optimal. Deletion and reinsertion of entries allows them to "find" a
place in the tree that may be
more appropriate than their original
location
.


Improve retrieval performance

Example



R1

R2

R3

R5

R4

R1

R2

R3

R5

R4

R1

R2

R3

R5

R4

Preferred by R
-
tree

Preferred by R*
-
tree

Validation Methodology


Methodology


Experiments with simulated workloads


Evaluation of design decisions



Results


R*
-
tree outperforms variants of R
-
tree and 2
-
level grid file.


R*
-
tree is robust against non
-
uniform data distributions.


Summary


Paper’s focus


R*
-
tree


implementations and performance



Ideas


Heuristic Optimizations (pp. 208)


Reduction of area, margin, and overlap of the directory rectangles


Better Storage Utilization (pp 211)


Forced Reinsertion (splits can be prevented)



Experimental comparison


Using many data distributions

Assumptions, Rewrite today


Assumptions


Indexing data in two
-
dimensional space


Bulk load and bulk reorganization not available


Concurrency control and recovery costs are negligible


Reinserts during split!



Rewrite today


Bulk
-
load of rectangles


Compare with newer methods


R+ tree (disjoint sibling), Hilbert
-
R
-
tree


Analytical results


Formally compare R*
-
tree with alternatives