CS511 Advanced Database Management Systems

righteousgaggleΔιαχείριση Δεδομένων

31 Ιαν 2013 (πριν από 4 χρόνια και 2 μήνες)

165 εμφανίσεις

CS511 Advanced Database Management Systems

1



Lecture 3:

System R



Sept. 1 2006

ChengXiang Zhai

Most slides are adapted from Kevin Chang’s lecture slides

CS511 Advanced Database Management Systems

2


System R


System R: 1974
-
1978


IBM San Jose Labs, lots of PhD researchers


Gray: coming from OS, first CS PhD of Berkeley


lots of influence in RSS


ACM SIGMOD Innovation Award 1992


Turing Award 1998


Won Kim is UIUC alum


Dissertation: Query Processing for Relational Database
Systems

CS511 Advanced Database Management Systems

3


INGRES


INGRES: 1973
-
1977


U.C. Berkeley faculty & graduate students


Mike Stonebraker (then an asst. prof)


ACM SIGMOD Innovation Award 1991


Eugene Wong




Postgres


偯獴杲敓兌

CS511 Advanced Database Management Systems

4


System R and INGRES: Gray

Jim Gray: (see System R 25th Reunion page)

Hostility developed between the San Jose IBM group and the Berkeley
group because they were working on very, very similar things and
had very, very similar ideas…

As a consequence we came to the conclusion that the best thing was
not to talk to each other.

The Berkeley folks thought the IBM guys were ripping off ideas from
the INGRES project. We had a strained relationship.

CS511 Advanced Database Management Systems

5


Joint

ACM Software System Award
1988

System R:


Donald Chamberlin, James Gray, Raymond Lorie, Gianfranco Putzolu,


Patrici Selinger, Irving Traiger

INGRES:


Gerald Held, Michael Stonebraker, Eugene Wong


Citation



The INGRES and System R systems demonstrated that a practical and
efficient database management system (DBMS) could be implemented based
on the relational data model.

These systems were full
-
function DBMS's that
supported non
-
procedural query languages (QUEL and SQL), automatic
query optimization, alternative storage structures, transactions, crash
recovery, views, integrity, and protection. They have revolutionized the
database system industry by showing how data stored in a computer can be
conveniently accessed by end users and while at the same time it can be used
by production application programs.




http://awards.acm.org/software_system/


CS511 Advanced Database Management Systems

6


Contributions of System R??


CS511 Advanced Database Management Systems

7


Contributions of System R?


Bringing theory to practice


nice theory implemented into practical system


High level query language (SQL)


Codd’s relational algebra/calculus were criticized as too
mathematical


“System” research in action


macro: design a complete system architecture


micro: identify key problems and provide solutions


Defining database landscape


industry product spec. and research directions

CS511 Advanced Database Management Systems

8


Complete System Study


Phase 0: 1974, 1975


initial single
-
user prototype


try out ideas and find issues


felt a good idea to plan to throw away ver. 1.0


Phase 1: 1976, 1977


full
-
function, multi
-
user prototype


Phase 2: 1978, 1979


evaluation and feedback


lots of good lessons learned


Very similar process took place in INGRES

CS511 Advanced Database Management Systems

9


Relational System Modules??
(subsystems)


CS511 Advanced Database Management Systems

10


System Modules Identified


view management


query parser/rewriter


query optimizer


query executor


data storage


access methods


buffer manager


lock manager


log/recovery system

CS511 Advanced Database Management Systems

11


System R Architecture


RDS/RSS divide remains in many systems


RDS: query processing (logical)


view, query parser, rewriter, optimizer, executor


RSS: storage/access methods (physical)


storage, access methods, buffer manager, lock,
log/recovery

CS511 Advanced Database Management Systems

12


Views


View defined as a query


another consistent use of SQL (no separate DDL)


Query on views:


query rewriter to flatten view (unfold def.)


form a composite query tree


View transparency


Almost any queries on any views


Not fully transparent though:


update: only for single
-
relation views


no right meanings in some cases


many
-
to
-
one nature of view def. (ambiguity)


even none
-
to
-
one (some view state has no correspondence)

CS511 Advanced Database Management Systems

13


SQL as Query Language


High
-
level declarative, English
-
based language


declarative language: what not how


well
-
founded/simple semantics based on relational algebra


small set of well
-
understood operators, so optimizer knows:


how operators can be interchanged/transformed


what equivalent implementations are for each op


Consistent for different functionalities


data definition: e.g., table creation, view definition


data manipulation: e.g., queries, updates


Uniform for different usage scenarios


embedding from different host languages (canned queries)


ad hoc user queries from command lines


Unexpected benefit: Standardized DB interface (mid 80’s)

CS511 Advanced Database Management Systems

14


What makes SQL possible?

Query: parse, access
-
path selection, code gen, execute


Cost
-
based access
-
path selection (optimization)


Pre
-
compilation for canned queries


remove preprocessing/optimization from run time


data indexes and statistics may change:


reoptimize and recompile by observing dependencies


alternative approaches?


caching of recent used query plans


trigger to invalidate cached plans on relevant events


e.g., on rebuilding system statistics, on index creation


in contrast, interpreted QUEL in INGRES admitted mistake

CS511 Advanced Database Management Systems

15


Query Optimizer: Cost Based


Cost
-
based optimizer: set up paradigm


largely unchanged since


Cost model:


C = weighted
-
sum(CPU time, #IO)


CPU time modeled as number of RSS calls


CS511 Advanced Database Management Systems

16


Query Optimizer: Access Path
Selection


Access path selection based on expected costs


select people where job = programmer and city = champaign


path 1: job index
--
> check city; path 2: city index
--
> check job


more paths?


Data independence: what are hidden from users?


Cost estimation based on


index selectivity


job (=programmer) more selective or city (=chamapign)?


index clustering


records of same/neighboring key are packed physically together


minimize #IO to fetch records of same key or a range


a relation can typically has at most one clustering index. Why?

CS511 Advanced Database Management Systems

17


Query Optimizer: Join Strategies

To evaluate R.a = S.a:


Nested
-
loop join:

for each tuple r in R:

use index, fetch S tuples s s.t. s.a=r.a


Q: B
-
tree or hashing index better?


Sort
-
merge join:

sort R, sort S

merge tuples in R and S in order


Q: Use B
-
tree index to speed up? Hashing?


All joins two
-
way; n
-
ary joins as binary trees


prune away lots of alternative plans for n
-
ary joins

CS511 Advanced Database Management Systems

18


Storage: Phase 0


Tuples have TID, containing page number


direct access by TID to fetch the page


Tuples contain pointers to values in “domain”


pros:


clean and original “domain” notion in Codd’s model


ultimate data normalization


cons: inefficient


idea picked up by some web search engines to store cached
documents


Design of Database Systems …
--
>


0AF1 00A0 A0B8 B001 …

CS511 Advanced Database Management Systems

19


Access Methods: Phase 0


Inverted Index (an inversion is also an index)


domain value
--
> list of TIDs


similar idea used in text information retrieval for word
to document ID mapping


Q: find documents with “cat” and “dog”


inverted index:



cat
--
> d1, d2, d3



dog
--
> d1, d3, d5, d6


A: d1, d3


How is this different from, say, B
-
trees?

CS511 Advanced Database Management Systems

20


Access Methods: Phase 1


B
-
trees


Hashing: rejected


place records to buckets based on hashing values


reason of rejection: cannot support range queries


Advantages?


to reach a record, you…

CS511 Advanced Database Management Systems

21


Lock Manager: Influential


Multiple granularity (hierarchy) of locks


records, relations, entire database


Intention locks for traversing data hierarchies


lock table with intention before locking tuples


acquiring locks from top down


Deadlock handling: detection



We will study this in more detail later (Gray)

CS511 Advanced Database Management Systems

22


Lock Manager: Predicate
-
Lock
Abandoned


Lock all the tuples satisfying a predicate


e.g., lock “student.dept = CS”


Problems:


hard to determine if locks conflict


“dept = CS” vs. “GPA > 3.0”?


involving semantics


data snapshot also matters


locked set can be changing during locking




CS511 Advanced Database Management Systems

23


Lock Manager: Convoy Problem

Lock resource may interfere with OS resources


Transaction T1 holds lock while to be dispatched by OS


Transaction T2, …, Tn all wait; convoy formed


T1 dispatched, and then release lock


say 10K instructions for dispatching


lock granted to T2, which is waiting to be dispatched by OS


T1 soon need the lock again, go back to end of convoy


say only executed 1k instructions between locking


T1 is now wasting its CPU time slice


Most cycles are for dispatching only!


T1: 10K dispatching, 1K execution, waiting in convoy



CS511 Advanced Database Management Systems

24


Log/Recovery: Failures


Transaction failures


i.e., transaction rollback if cannot be committed


System failures


all data updates in main memory buffer lost


Media failures


data on disk lost


CS511 Advanced Database Management Systems

25


Log/Recovery: Before Logging


“Before” (undo) logging: record old value


after crash, bring DB back to consistency


undo all uncommitted transactions


problems:


data must be flushed to disk before transaction commit


after crash, cannot bring to the most recent status

CS511 Advanced Database Management Systems

26


Log/Recovery: After Logging


“After” (redo) logging: record new value


after crash, bring DB forward to consistency


redo all committed transactions


problem:


must hold data in buffer until commit made


long history of “committed transactions”; slow in
recovery


CS511 Advanced Database Management Systems

27


Log/Recovery: Before/After Logging


Combined before/after logging; to recover:


checkpoints to archive current healthy state


redo all committed transactions


undo all uncommitted transactions


Dual logs to prevent log failure


Shadow pages not a good idea


tend to lose physical clustering


too much overhead in maintaining


alternative: in
-
place updates; separate logs #

CS511 Advanced Database Management Systems

28


System R Influences: Other Ideas


catalog as relations
--

metadata managed as data


security management (grant/revoke)


integrity constraints, triggers




CS511 Advanced Database Management Systems

29


System R Influence


Ellison's Oracle beats IBM to market by reading white papers.


IBM releases multiple RDBMSs, settles down to DB2.



Gray (System R), Jerry Held (Ingres) and others join Tandem
(Non
-
Stop SQL), Kapali Eswaran starts EsVal, which begets
HP Allbase and Cullinet


Relational Technology Inc (Ingres Corp), Britton
-
Lee/Sybase,
Wang PACE grow out of Ingres group


CA releases CA
-
Universe, a commercialization of INGRES



CS511 Advanced Database Management Systems

30


A Comment from Students

I think it is remarkable how this System R influenced many
actual Relational Database systems architecture, with it's
various components as the locking , recovery, security and
views subsystems. To read this is not much different than
for example reading the description of the architecture of
Oracle (I used to work for them).



CS511 Advanced Database Management Systems

31


What You Should Know


The main challenges the system R builders had to
solve


Challenges related to the implementation of the
relational model (storage, indexing, query
optimization, query language)


Additional challenges (locking system, log/recovery
system)


The main contributions of system R in implementing
a relational data model (mostly corresponding to the
challenges above)

CS511 Advanced Database Management Systems

32


Carry Away Messages


System development is as valuable as (or more
valuable than?) theory formulation


Often, new problems/challenges would be
discovered through system building


A system’s value depends on


Utility of the system (new/better functions that people
care about)


Technical contributions (solutions to particular
challenges)


System R has high values in both aspects!