CS511 Advanced Database Management Systems

righteousgaggleΔιαχείριση Δεδομένων

31 Ιαν 2013 (πριν από 7 χρόνια και 5 μήνες)

288 εμφανίσεις

CS511 Advanced Database Management Systems


Lecture 3:

System R

Sept. 1 2006

ChengXiang Zhai

Most slides are adapted from Kevin Chang’s lecture slides

CS511 Advanced Database Management Systems


System R

System R: 1974

IBM San Jose Labs, lots of PhD researchers

Gray: coming from OS, first CS PhD of Berkeley

lots of influence in RSS

ACM SIGMOD Innovation Award 1992

Turing Award 1998

Won Kim is UIUC alum

Dissertation: Query Processing for Relational Database

CS511 Advanced Database Management Systems



INGRES: 1973

U.C. Berkeley faculty & graduate students

Mike Stonebraker (then an asst. prof)

ACM SIGMOD Innovation Award 1991

Eugene Wong



CS511 Advanced Database Management Systems


System R and INGRES: Gray

Jim Gray: (see System R 25th Reunion page)

Hostility developed between the San Jose IBM group and the Berkeley
group because they were working on very, very similar things and
had very, very similar ideas…

As a consequence we came to the conclusion that the best thing was
not to talk to each other.

The Berkeley folks thought the IBM guys were ripping off ideas from
the INGRES project. We had a strained relationship.

CS511 Advanced Database Management Systems



ACM Software System Award

System R:

Donald Chamberlin, James Gray, Raymond Lorie, Gianfranco Putzolu,

Patrici Selinger, Irving Traiger


Gerald Held, Michael Stonebraker, Eugene Wong


The INGRES and System R systems demonstrated that a practical and
efficient database management system (DBMS) could be implemented based
on the relational data model.

These systems were full
function DBMS's that
supported non
procedural query languages (QUEL and SQL), automatic
query optimization, alternative storage structures, transactions, crash
recovery, views, integrity, and protection. They have revolutionized the
database system industry by showing how data stored in a computer can be
conveniently accessed by end users and while at the same time it can be used
by production application programs.


CS511 Advanced Database Management Systems


Contributions of System R??

CS511 Advanced Database Management Systems


Contributions of System R?

Bringing theory to practice

nice theory implemented into practical system

High level query language (SQL)

Codd’s relational algebra/calculus were criticized as too

“System” research in action

macro: design a complete system architecture

micro: identify key problems and provide solutions

Defining database landscape

industry product spec. and research directions

CS511 Advanced Database Management Systems


Complete System Study

Phase 0: 1974, 1975

initial single
user prototype

try out ideas and find issues

felt a good idea to plan to throw away ver. 1.0

Phase 1: 1976, 1977

function, multi
user prototype

Phase 2: 1978, 1979

evaluation and feedback

lots of good lessons learned

Very similar process took place in INGRES

CS511 Advanced Database Management Systems


Relational System Modules??

CS511 Advanced Database Management Systems


System Modules Identified

view management

query parser/rewriter

query optimizer

query executor

data storage

access methods

buffer manager

lock manager

log/recovery system

CS511 Advanced Database Management Systems


System R Architecture

RDS/RSS divide remains in many systems

RDS: query processing (logical)

view, query parser, rewriter, optimizer, executor

RSS: storage/access methods (physical)

storage, access methods, buffer manager, lock,

CS511 Advanced Database Management Systems



View defined as a query

another consistent use of SQL (no separate DDL)

Query on views:

query rewriter to flatten view (unfold def.)

form a composite query tree

View transparency

Almost any queries on any views

Not fully transparent though:

update: only for single
relation views

no right meanings in some cases

one nature of view def. (ambiguity)

even none
one (some view state has no correspondence)

CS511 Advanced Database Management Systems


SQL as Query Language

level declarative, English
based language

declarative language: what not how

founded/simple semantics based on relational algebra

small set of well
understood operators, so optimizer knows:

how operators can be interchanged/transformed

what equivalent implementations are for each op

Consistent for different functionalities

data definition: e.g., table creation, view definition

data manipulation: e.g., queries, updates

Uniform for different usage scenarios

embedding from different host languages (canned queries)

ad hoc user queries from command lines

Unexpected benefit: Standardized DB interface (mid 80’s)

CS511 Advanced Database Management Systems


What makes SQL possible?

Query: parse, access
path selection, code gen, execute

based access
path selection (optimization)

compilation for canned queries

remove preprocessing/optimization from run time

data indexes and statistics may change:

reoptimize and recompile by observing dependencies

alternative approaches?

caching of recent used query plans

trigger to invalidate cached plans on relevant events

e.g., on rebuilding system statistics, on index creation

in contrast, interpreted QUEL in INGRES admitted mistake

CS511 Advanced Database Management Systems


Query Optimizer: Cost Based

based optimizer: set up paradigm

largely unchanged since

Cost model:

C = weighted
sum(CPU time, #IO)

CPU time modeled as number of RSS calls

CS511 Advanced Database Management Systems


Query Optimizer: Access Path

Access path selection based on expected costs

select people where job = programmer and city = champaign

path 1: job index
> check city; path 2: city index
> check job

more paths?

Data independence: what are hidden from users?

Cost estimation based on

index selectivity

job (=programmer) more selective or city (=chamapign)?

index clustering

records of same/neighboring key are packed physically together

minimize #IO to fetch records of same key or a range

a relation can typically has at most one clustering index. Why?

CS511 Advanced Database Management Systems


Query Optimizer: Join Strategies

To evaluate R.a = S.a:

loop join:

for each tuple r in R:

use index, fetch S tuples s s.t. s.a=r.a

Q: B
tree or hashing index better?

merge join:

sort R, sort S

merge tuples in R and S in order

Q: Use B
tree index to speed up? Hashing?

All joins two
way; n
ary joins as binary trees

prune away lots of alternative plans for n
ary joins

CS511 Advanced Database Management Systems


Storage: Phase 0

Tuples have TID, containing page number

direct access by TID to fetch the page

Tuples contain pointers to values in “domain”


clean and original “domain” notion in Codd’s model

ultimate data normalization

cons: inefficient

idea picked up by some web search engines to store cached

Design of Database Systems …

0AF1 00A0 A0B8 B001 …

CS511 Advanced Database Management Systems


Access Methods: Phase 0

Inverted Index (an inversion is also an index)

domain value
> list of TIDs

similar idea used in text information retrieval for word
to document ID mapping

Q: find documents with “cat” and “dog”

inverted index:

> d1, d2, d3

> d1, d3, d5, d6

A: d1, d3

How is this different from, say, B

CS511 Advanced Database Management Systems


Access Methods: Phase 1


Hashing: rejected

place records to buckets based on hashing values

reason of rejection: cannot support range queries


to reach a record, you…

CS511 Advanced Database Management Systems


Lock Manager: Influential

Multiple granularity (hierarchy) of locks

records, relations, entire database

Intention locks for traversing data hierarchies

lock table with intention before locking tuples

acquiring locks from top down

Deadlock handling: detection

We will study this in more detail later (Gray)

CS511 Advanced Database Management Systems


Lock Manager: Predicate

Lock all the tuples satisfying a predicate

e.g., lock “student.dept = CS”


hard to determine if locks conflict

“dept = CS” vs. “GPA > 3.0”?

involving semantics

data snapshot also matters

locked set can be changing during locking

CS511 Advanced Database Management Systems


Lock Manager: Convoy Problem

Lock resource may interfere with OS resources

Transaction T1 holds lock while to be dispatched by OS

Transaction T2, …, Tn all wait; convoy formed

T1 dispatched, and then release lock

say 10K instructions for dispatching

lock granted to T2, which is waiting to be dispatched by OS

T1 soon need the lock again, go back to end of convoy

say only executed 1k instructions between locking

T1 is now wasting its CPU time slice

Most cycles are for dispatching only!

T1: 10K dispatching, 1K execution, waiting in convoy

CS511 Advanced Database Management Systems


Log/Recovery: Failures

Transaction failures

i.e., transaction rollback if cannot be committed

System failures

all data updates in main memory buffer lost

Media failures

data on disk lost

CS511 Advanced Database Management Systems


Log/Recovery: Before Logging

“Before” (undo) logging: record old value

after crash, bring DB back to consistency

undo all uncommitted transactions


data must be flushed to disk before transaction commit

after crash, cannot bring to the most recent status

CS511 Advanced Database Management Systems


Log/Recovery: After Logging

“After” (redo) logging: record new value

after crash, bring DB forward to consistency

redo all committed transactions


must hold data in buffer until commit made

long history of “committed transactions”; slow in

CS511 Advanced Database Management Systems


Log/Recovery: Before/After Logging

Combined before/after logging; to recover:

checkpoints to archive current healthy state

redo all committed transactions

undo all uncommitted transactions

Dual logs to prevent log failure

Shadow pages not a good idea

tend to lose physical clustering

too much overhead in maintaining

alternative: in
place updates; separate logs #

CS511 Advanced Database Management Systems


System R Influences: Other Ideas

catalog as relations

metadata managed as data

security management (grant/revoke)

integrity constraints, triggers

CS511 Advanced Database Management Systems


System R Influence

Ellison's Oracle beats IBM to market by reading white papers.

IBM releases multiple RDBMSs, settles down to DB2.

Gray (System R), Jerry Held (Ingres) and others join Tandem
Stop SQL), Kapali Eswaran starts EsVal, which begets
HP Allbase and Cullinet

Relational Technology Inc (Ingres Corp), Britton
Wang PACE grow out of Ingres group

CA releases CA
Universe, a commercialization of INGRES

CS511 Advanced Database Management Systems


A Comment from Students

I think it is remarkable how this System R influenced many
actual Relational Database systems architecture, with it's
various components as the locking , recovery, security and
views subsystems. To read this is not much different than
for example reading the description of the architecture of
Oracle (I used to work for them).

CS511 Advanced Database Management Systems


What You Should Know

The main challenges the system R builders had to

Challenges related to the implementation of the
relational model (storage, indexing, query
optimization, query language)

Additional challenges (locking system, log/recovery

The main contributions of system R in implementing
a relational data model (mostly corresponding to the
challenges above)

CS511 Advanced Database Management Systems


Carry Away Messages

System development is as valuable as (or more
valuable than?) theory formulation

Often, new problems/challenges would be
discovered through system building

A system’s value depends on

Utility of the system (new/better functions that people
care about)

Technical contributions (solutions to particular

System R has high values in both aspects!