Survey of techniques used to reduce the Semantic Gap between Database Management Systems and Storage Subsystems

kneewastefulAI and Robotics

Oct 29, 2013 (3 years and 7 months ago)


Survey of techniques used to reduce the Semantic Gap between
Database Management Systems and Storage Subsystems

Problem Description:

One of the most important modules of any database management system is the storage
manager module. This module essentially

controls the way the data is allocated, accessed,
and stored on storage devices. Storage subsystems are typically virtualized for the
purposes of consolidation, easy of management, reducing interdependence, etc. Due to
this virtualization now the database

storage managers neither have a strict control over
the physical layout of data nor are they aware of the internal characteristics of storage
subsystems, and apply some coarse rules of thumb to optimize its access. On the other

storage subsystems


not have semantic knowledge of the data that they are

relying on their own rules of thumb to manage

such workload
tasks as pre
fetching, caching,

and data layout
. The end result is both sides are working
blindly to optimize their
performance without not knowing what other side is doing.
Various studies [1][2] emphasize the importance reducing information gap between
applications and underlying storage devices. Over the years, various techniques have
been developed to reduce the se
mantic gap between database management systems and
storage systems. In this project we plan to survey various approaches used by researchers
or implementers in order to reduce this semantic gap.

The idea of placing intelligence in storage systems to help
database operation was
explored extensively in the context of database machines in late 1970s and late 1980s.
Database machines can be classified into four categories depending on disk processing



Processor per head: DBC, SURE


Processor per track: CAFS


Processor per disk: SURE


processor cache: DIRECT, RDBM, DBMAC, INFOPLEX, RAP.2

In all of the architectures, there was a central processor which pushed simple database
operations (e.g., scan) closer to disk, and achieved a dramati
c performance improvement
for these operations. The main counter
arguments are summarized by Boral and Dewitt
[11]. First, most database machine use special
purpose hardware, such as associative
disks, associative CCD devices, and magnetic bubble memory wh
ich increased the design
time and cost of these machines. Again, the performance gain was not enough to justify
the additional cost incurred by these hardwares. Second, although the performance was
impressive for scan operations, but for the complex datab
ase operations, such as sorts and
joins did not provide significant benefits. Third, the performance offered by database
machines can be easily achievable by smart indexing techniques. Fourth, CPU processing
speed was improving much faster than the disk tr
ansfer rates improve, so CPU was sitting
idle. Fifth, the communication overhead between processing elements were high. Finally,
database vendors did not agree to rewrite their legacy code base to take advantage of
features offered by these new hardwares.

Storage technology has evolved much in the intervening years and disk
processing got attention from the research community again in 1990s. The biggest change
was widespread use of disk arrays that use a large number of disks working in parallel.
purpose silicon cores in database machines are replaced by general
embedded processing and increased memory cores. Numerous parallel algorithms for
database operations, such as joins and sorts have been developed for different
es, such as shared
nothing, shared
memory, and shared disk since the
inception of specialized database machines. Serial communications were able to provide
enough bandwidth to disk to overcome the message passing overheads.

Description of Candidate Solut

To take advantage of this technology innovations, Archrya et al.


and Riedel et al.


explored the benefits of mapping applications form database, data mining, image
processing, sorting, and data cubes onto storage devices to enable applica
tion specific
processing close to data. Their work mostly focuses on how to partition applications
across host and storage devices to minimize communication over overhead due to data
transfer. Keeton et al.,


explored using a analytical model of speed u
p to offload
portions of SMP databases to use the processing power in storage devices.

The Fates database storage project at CMU

, explores how to extract disk
characteristics to map efficient database access patterns. Logical Volume Manager
(LVM) e
xposes information about disk layout to the database storage manager, which
uses this information to efficiently write data to disks so as to improve later readings
from storage devices.

The database aware semantically storage disk project


explores t
he possibility of a
database aware storage device. It uses WAL entries to find the access patterns of block
access for database and use thin information for efficient pre
fetching, caching, and data
layout inside disk. Host is transparent to what the disk

[7] makes use of object
based storage devices (OSD). It examines the approach of
passing semantic information from a database system to the storage subsystem to bridge
the information gap between these two levels. Recently
standardized OSD interfac
moves low level storage functionalities close to data and provides an object interface to
access the data. This paper leverages OSD interface for communicating semantic
information database to the storage device. It discusses how we can map relation of a

database to an OSD object; and how we can read and write database relation efficiently
taking advantage of geometry aware data
layout through additional OSD interface. This
paper only scratches the surface but neither builds any prototype system nor shows

performance improvement compared to the traditional approaches.

Methodology and Objective:

In this work we plan to survey these above
mentioned techniques to provide a clear
picture of the state of the art when it comes to current database management

systems. We
plan to compare these approaches using very common illustrative examples. Our final
goal is to clearly identify the advantages and disadvantages of each approach and
hopefully conclude as to which approach is the most appropriate or promising
one for
future database systems.


[1] T. E. Denehy, A. C. Arpaci
Dusseau, and R. H. Arpaci
Dusseau. Bridging the
information gap in storage protocol stacks.
Summer USENIX Technical Conference
(Monterey, CA, 10

15 June 2002), pages 177

190, 200

[2] G. R. Ganger.
Blurring the line between OSs and storage devices
. Technical report



166. Carnegie Mellon University, December 2001.

[3] Jiri Schindler, Steven W Schlosser, et al, Atropos: A Disk Array Volume Manager for
Orchestrated Use of D
In 3

USENIX Conference on File and Storage Technologies
FAST 04, CA. March 2004

[4] Muthian Sivathanu, Lakshmi N Bairavasundaram, et al,
Database Aware Sematicaly
Smart Storage
, FAST 2005.

[5]Minglong Shao, et al,
Clotho: Decoupling Memory Page Lay
out from Storage
, VLDB 2004.

[6] E. Riedel, C. Faloutsos, and D. Nagle.
Active Disk Archittecture for Database
Technical Report CMU
145, Carniegie Mellon University, April 2000.

[7] Steve Schlosser, Sami Iren.
Database Storage Manageme
nt with Object
Storage Devices,

DAMON 2005.

[8] K. Keeton.
Computer Architecture Support for Database Applications
, PhD thesis,
University of California at Berkeley, 1999.

[9] A. Acharya, M. Uysal, and J. Saltz.
Active Disk: Programming Model, Algori
and Evaluation
, ASPLOS VIII, 1998.

[10] E. Riedel, G. Gibson, and C. Faloutsos.
Active Storage for Large Scale Data Mining
and Multimedia
, VLDB 1998.

[11] H. Boral and D. J. Dewitt.
Database Machines: An Idea whose time has passed?,


Workshop o
n Database Machines, 1983.