Slides - University of Pennsylvania

righteousgaggleData Management

Jan 31, 2013 (4 years and 4 months ago)

110 views

Archetypal Databases,

or, How Do I Build a DBMS?

Zachary G. Ives

University of Pennsylvania

CIS 650


Implementing Data Management Systems



January 12, 2005


2

Ahh, the
1970
s…

System
-
R

… well, it

returned


Which Icons Endured?

3

The INGRES System


Interactive Graphics and Retrieval System


Probably the first usable RDBMS


One of the first real projects built on UNIX


On a minicomputer, the PDP
-
11
, which greatly constrained things due
to limited memory


No low
-
level access to disk


prevented clustering in storage


Based on a relational query language called QUEL


Many religious wars between QUEL and SEQUEL camps


Stonebraker’s first commercialized project


Changed university IP rules forever



Today: was just open
-
sourced by Computer Associates

4

Processes


Due to 64K process size limits,
needed to break things into 4
processes communicating via pipes


Later added a Process 2.5 to make 5
processes

C+EQUEL

program

lexer, parser

concurrency

query mod.

Query

processing

Utilities,

recovery

2

3

4

5

System
-
R


Probably a bit closer to today’s DBMSs, at least at
the low level


but didn’t run on UNIX


Based on years of experience with IMS and other
IBM database systems


At this point, prior to Selinger (note limited optimization),
Lindsay (not a huge performance focus)


Components were built to be generally reusable


Default language was called SEQUEL, by Chamberlin and
Boyce

6

System
-
R Architecture

Relational Storage

System


Storage, concurrency,

access paths (“images”, “links”),

triggers

Relational

Data System


parsing,

(limited) optimization

SEQUEL or e
-
SEQUEL

(RDI)

(RSI)

7

Languages


QUEL vs. SEQUEL


The focus of many religious wars, though they borrowed each other’s ideas


Ultimately IBM won due to (
1
) market presence, (
2
) Oracle


SEQUEL had NULL concept, bag semantics, aggregation



Postgres was originally based on QUEL, hence PostgreSQL

QUEL: more orthogonal,

simple EQUEL embedding


RANGE OF X IS MYREL

RETRIEVE (A.B)

WHERE …


## C
-
block

SEQUEL: more block
-
oriented,

embedded via cursors and row sets,

aggregation

SELECT A.B

FROM MYREL X

WHERE …

8

Administration of a Database

INGRES:


Only the DBA can create shared relations and grant access to them


It’s possible to create temporary relations (and required for query
processing)

System
-
R:


Anyone can create private relations and grant access to them


It’s possible to create persistent or temporary relations


Mechanisms to add columns to tables (by default these become
NULL)

9

Integrity and Security


Both systems allowed much more general notions of integrity
than key constraints


Assertions as a means of validation!


Query modification

is one method used to do this


Conjunction of assertions plus the query/update


Only limited expressiveness


in INGRES there must be at most one
variable


Security models:


System
-
R used a view
-
based security model


INGRES used query modification


How do these differ? Which is better?

Data Storage

11

Physical Data Layout


All tuples have TIDs


TID is a page + index


System
-
R


Allowed links between tuples


Built
-
in concept of NULL


Where else are TIDs useful?



Page
-
level clustering: done
by System
-
R in
extents


What about INGRES?

t1

t2

t
3

12

Access Paths


INGRES


INGRES Access Methods Interface (AMI):


Can have unordered “heap file”


Hash
-
based indexing


ISAM
-
like indexing on a primary key


Predates B+ Tree


initially height
-
balanced, but afterwards index
structure is static; requires overflow pages


“ISAM
-
like” because can’t sort across pages


can only lock one page at a
time!!!



Lookups are done via OPENR(), GET()


FIND() can be used to find a start or stop point


Can call PARAMD, PARAMI to get parameters of data or index


Idea of extensible access methods

13

Access Paths


System
-
R


“Images”


Slightly more than indices (which are included there)


also linked lists
and orderings


Index structures include ISAM and B Trees


Search arguments, aka sargable predicates


Note that clustering was a key concept here


Links via TIDs


Note that these were NOT relational in spirit!


Intention was to support IMS over RSS


Some code likely made its way back into IMS



How do these concepts relate to OO databases and today’s XML
databases? Path indices?

14

Catalogs and Indices as Logical
Abstractions


System
-
R and INGRES decided to express catalogs
as relations, making them accessible to queries as
well as enabling reuse of code


How are they accessible in today’s RDBMSs?


INGRES even did this with indices


Indices vs. views: how are they different?


How are they different in the presence of TIDs or links?


15

Views


INGRES:


Not supported at that time (though later)


System
-
R:


Single
-
table views are typically updatable, if one
-
to
-
one
(contrast with today’s SQL)


Additionally, cursors could typically be modified!


Views as a security/encapsulation mechanism


GRANT and REVOKE privileges

16

Triggers


Note that System
-
R had triggers from the beginning!


Later work led to the idea of
active databases
, which
had very rich triggger languages

Query Processing

18

Query Processing
-

INGRES


DECOMP algorithm and One Variable Query Processor


Break every query into separate operations that generate a temp
relation for every projection, selection, join


Pick a relation to iterate over, substitute value for variable


Repeat recursively



Note that (1) it’s interpreted, and (2) it’s adaptive, always choosing
the smallest relation for the substitution



Note that access paths were chosen in a pretty ad hoc way


Also, no concept of sargable predicates


instead, could FIND both
upper and lower bounds, but needed to GET between

19

DECOMP Example


RANGE OF E, M IS EMPLOYEE

RANGE OF D IS DEPT

RETRIEVE (E.NAME)

WHERE
E.SALARY > M.SALARY

AND

E.MANAGER = M.NAME

AND

E.DEPT = D.DEPT

AND

D.FLOOR# = 1

AND

E.AGE > 40



First: apply selection and projection

20

DECOMP, Ctd.


RANGE OF D IS DEPT

RETRIEVE INTO T1 (D.DEPT)

WHERE D.FLOOR# = 1


RANGE OF E IS EMPLOYEE

RETRIEVE INTO T2(E.NAME, E.SALARY,
E.MANAGER, E.DEPT)

WHERE E.AGE > 40



Now substitute these back into the main query

21

DECOMP, Ctd.


RANGE OF E IS T2

RANGE OF M IS EMPLOYEE

RANGE OF D IS T1

RETRIEVE (E.NAME)

WHERE
E.SALARY > M.SALARY

AND

E.MANAGER = M.NAME

AND

E.DEPT = D.DEPT




Now pick the relation with smallest cardinality, e.g.,
T1, and substitute

22

One
-
Variable Substitution


For each D in T1, recursively process:


RANGE OF E IS T2

RANGE OF M IS EMPLOYEE

RETRIEVE (E.NAME)

WHERE
E.SALARY > M.SALARY

AND

E.MANAGER = M.NAME

AND

E.DEPT = *val*

23

DECOMP, Recursively


Next stage:


RANGE OF E IS T
2

RETRIEVE INTO T
3
(E.NAME, E.SALARY, E.MANAGER, E.DEPT)

WHERE
E.DEPT = *val*


And substitute into the bigger query:


RANGE OF E IS T
3

RANGE OF M IS EMPLOYEE

RETRIEVE (E.NAME)

WHERE
E.SALARY > M.SALARY

AND

E.MANAGER = M.NAME


Now choose the smallest relation for substitution (e.g., T
3
)

24

DECOMP, Recursively


For each D in T1


For each E in T2


RANGE OF M IS EMPLOYEE

RETRIEVE INTO (val1)

WHERE
val2 > M.SALARY

AND

val3 = M.NAME

25

Query Processing


System
-
R


Optimization is cost
-
based


Consider both disk cost (primary) and CPU cost (scaled
by some
H
)


Compare clustered, non
-
clustered indices; sequential scan


Needed to consider cases where data was interspersed with
other relations



Every join is binary; query plans are compiled


Merge join and nested loops join are present


Also have a link
-
based join


No dynamic programming yet


Selinger was not yet aboard

ACIDity

27

Rollback / Abort


Notion of transactions in System
-
R encompassed
multiple operations


BEGIN_TRANS, END_TRANS, SAVE, RESTORE


Both systems had a notion of “old” and “new” pages
(“shadow paging”)


Could roll back transactions by swapping back the old
page


But how did that work with concurrency?


What else did System
-
R do to help here?

28

Concurrency & Locking


INGRES:


Query locked all resources (table
-
level) before it began


All page updates are atomic via locks


Avoid deadlocks by preventing cases where they could occur


this is
why they don’t do true ISAM


System
-
R:


Multiple levels of lock granularity


logical: table
-
, range
-
level


physical: page
-
, tuple
-
level


Shared, exclusive locks


Multiple isolation levels (READ UNCOMMITTED, READ
COMMITTED, SERIALIZABLE; no REPEATABLE READ)


Resolve deadlocks by choosing a victim, restarting

29

Recovery


INGRES:


Deferred updates


for performance, isolation


Also used to make recovery possible


But how far does this take you?


System
-
R:


Notion of checkpoints and restarting


Transaction logging and replay


Not much mentioned about possibility of recovery from
failed restart


Today’s techniques even recover from that

Analysis

31

What Ideas from this Work

Were Broken?


Shadow paging


Now everything is purely log
-
based


SQL idiosyncrasies


INGRES recovery


Relations as files


Query optimization


32

Discussion:

What Ideas Are We Still Using?