# Chapter 3: Data Storage and Access Methods

Data Management

Oct 31, 2013 (4 years and 6 months ago)

119 views

Chapter 3: Data Storage and Access Methods

Title:

The R* Tree: An Efficient and Robust Access
Method for Points and Rectangles

Authors:

N. Beckmann, H. Kriegel, R. Schneider and B.
Seeger

Pages:
207
-
216

The R* Tree: An Efficient and Robust Access
Method for Points and Rectangles

Problem

Problem Statement

Why is this problem important?

Why is this problem hard?

Approaches

Approach description, key concepts

Contributions (novelty, improved)

Assumptions

Problem Statement

R* Tree

Given

Data containing points and rectangles

Spatial queries (point, range query, insert, delete)

Find

-

An Access Method (Data Structure)

A hierarchical organization of rectangles

Example from wikipedia

Objectives

Efficiency of spatial queries

Constraints

Balanced tree

Each node is a disk page and has >= m (min # of entries) entries.

Root has at least two children unless it is a leaf

Efficiency metric = number of disk
-
pages accessed

Why is this problem important?

Multi
-
dimensional Applications

Large geographic data. e.g., Map objects like countries occupy
regions of non
-
zero size in two dimension.

Common real world usage: “Find all museums within 2 miles of
my current location".

Many DBMS servers support spatial indices

Orcale, IBM DB2, …

Why is this problem Hard?

B
-
tree split methods ineffective in 2
-
dimensions

Ex. Sorting

Size variation across data Rectangles

Large rectangles limit split options!

Non
-
uniform data distribution over space

Dynamic Access Method

Insertions and deletions

Overlapping directory rectangles => multiple search paths

Novelty of Contribution

Related Work

-
dimensional indexing structures

(e.g., hash, B
-
tree)
are not appropriate for range search

B+ tree

Represents sorted data in a way that allows for efficient insertion and
removal of elements.

Dynamic, multilevel index with maximum and minimum bounds on the
number of keys in each node.

together as a linked list to make range queries easy.

R
-
tree

R
-
tree is a foundation for spatial access method

A complex spatial object is represented by
minimum bounding rectangles

while preserving essential geometric properties

Over
-
lapping regions

Heuristic:
minimize the area of each enclosing rectangle in the inner nodes.

Principles of R
-
tree

Reference: A Guttman ‘R
-
tree a dynamic index structure for spatial searching’, 1984

Height
-
balanced tree similar to a B
-
tree with index records
in its leaf nodes containing pointers to data objects.

Heuristic Optimization: minimize the area of each
enclosing rectangle in the inner nodes.

Performance Parameters beyond R
-
tree

(Q1) The area covered by a directory rectangle should be minimized.

(Q2) The overlap between directory rectangles should be minimized.

(Q3) The margin of a directory rectangle should be minimized.

(Q4) Storage utilization should be optimized.

Intuitions:

Reduce overlap between sibling nodes.

Reduce traversal of multiple branches for point query

Reinsert old data changes entries between neighboring nodes and thus
decreases overlap.

Due to more restructuring, less splits occur

Difference between R
-
tree and R*
-
tree

Minimization of area, margin, and overlap

is crucial to the
performance of R
-
tree / R*
-
tree.

The R*
-
tree attempts to reduce the tree, using a combination of a
revised node split algorithm

and the concept of
forced reinsertion at
node overflow
. This is based on the observation that
R
-
tree structures
are highly susceptible to the order

in which their entries are inserted,
so an insertion
-
built (rather than bulk
-
loaded) structure is likely to be
sub
-
optimal. Deletion and reinsertion of entries allows them to "find" a
place in the tree that may be
more appropriate than their original
location
.

Improve retrieval performance

Example

R1

R2

R3

R5

R4

R1

R2

R3

R5

R4

R1

R2

R3

R5

R4

Preferred by R
-
tree

Preferred by R*
-
tree

Validation Methodology

Methodology

Evaluation of design decisions

Results

R*
-
tree outperforms variants of R
-
tree and 2
-
level grid file.

R*
-
tree is robust against non
-
uniform data distributions.

Summary

Paper’s focus

R*
-
tree

implementations and performance

Ideas

Heuristic Optimizations (pp. 208)

Reduction of area, margin, and overlap of the directory rectangles

Better Storage Utilization (pp 211)

Forced Reinsertion (splits can be prevented)

Experimental comparison

Using many data distributions

Assumptions, Rewrite today

Assumptions

Indexing data in two
-
dimensional space

Bulk load and bulk reorganization not available

Concurrency control and recovery costs are negligible

Reinserts during split!

Rewrite today

Bulk
-

R+ tree (disjoint sibling), Hilbert
-
R
-
tree

Analytical results

Formally compare R*
-
tree with alternatives