CSC 485D/586D/SEng 480D Introduction

basesprocketData Management

Oct 31, 2013 (3 years and 9 months ago)

52 views

CSC 485D/586D/SEng 480D

Introduction


In essence a database is nothing more than a
collection of
information

that exists over a long period of time.



Databases are empowered by a body of knowledge and technology
embodied in specialized software called a
database management
system
, or
DBMS
.


A DBMS is a powerful tool for creating and managing large amounts of
data efficiently and allowing it to persist over long periods of time, safely.


Among the most complex types of software available.

What’s a database?

1.
Allows users to create new databases and specify their
schema

(
logical
structure of the data
), using a
data
-
definition language
.


2.
Gives user the ability to
query

the data and
modify
the data, using a
query
language

and

data
-
manipulation language
.


3.
Supports
intelligent storage

of very large amounts of data.



Protects the data from

accident
or

not proper use
.


Example:

We can require from the DBMS to not allow the insertion of two
different employees with the same SIN.



Allows
efficient access

to the data for queries and modifications.


Example:

The use of indexes over a specified field, e.g. on the
name

field for
employees, allows fast response for queries asking a specific name.


4.
Controls access to data from
many users at once

(
concurrency
), without
allowing “bad” interactions that can corrupt the data accidentally.

5.
Recovers from failures and crashes.

The database [management] system

Relational Model


Based on tables:



Today used in most DBMS’s


Oracle, SQL
-
Server, IBM DB2,
Sybase, Microsoft Access etc…


Challengers: Object oriented
DB’s (ObjectStore,
Objectivity…)


Object
-
relational:: oo
-
extension
of rel’s

Database Studies


Design of databases.



What kinds of information go into the database?


How is the information structured?


How do data items connect?



Database programming.



How does one express queries on the database?


How does one use other capabilities of a DBMS, such as transactions or
constraints, in an application?


How is database programming combined with conventional
programming?



Database system implementation.



How does one build a DBMS, including such matters as query
processing, transaction processing and organizing storage for efficient
access?

Fictitious Megatron 2006 DBMS


Stores relations as Unix files


Students
(
name, sid, dept
) is stored in the file
/home/megatron/students

as


Smith#123#CS

Jones#533#EE



Schemas are stored in
/home/megatron/schemas

e.g.


Students#name#STR#id#INT#dept#STR

Depts#name#STR#office#str


Megatron sample session

mayne$

megatron

WELCOME TO MEGATRON 2006


megaSQL%

SELECT * FROM Students;


Name


id

dept



----------------------------------

Smith


123

CS

Johnson

522

EE


megaSQL%

Megatron sample session II

megaSQL%

SELECT * FROM Students


WHERE id >= 500 | HighId.txt;


megaSQL%

more HighId.txt

Jones#522#EE


megaSQL%

quit

THANK YOU FOR USING MEGATRON 2006


mayne$

Megatron Implementation


To execute SELECT * FROM R WHERE <COND>



Read file schema to get attributes of R


Check that the <COND> is semantically valid for R


Read file R,


for each line


check condition


if OK, display



If we pipe the result into a file, say T, then add an entry for T in
the file
/home/megatron/schemas

Megatron Implementation II


To execute


SELECT office


FROM Students, Dept


WHERE Students.name = 'Smith' AND


Students.dept = Depts.name;



Read file schema to get attributes and do semantic check.


If Ok, then,

for each tuple
s

in
Students


for each tuple
d

in
Depts




if
s

and
d

satisfy the WHERE condition,




display the
office

value from
s

What’s wrong with Megatron?


Tuple layout on disk: no flexibility for DB modifications.


Change CS to ECON and the entire file has to be rewritten.


Search Expensive: no indexes; always read entire relation.


Brute
force query processing.


Did we need to look at all pairs of student
dept tuples?


No buffer manager: everything comes off of disk all the time.


No concurrency control: several users can modify a file at the
same time with unpredictable results.


No reliability: can lose data in a crash or leave operations half
done.


Little security: file system protection too coarse.


The “cylindrical” component contains
not only
data
, but also
metadata
, i.e.

info
about the structure of data
.


If the DBMS is
relational

the
metadata includes:


names of relations
,


names of attributes

of those
relations, and


data types

for those attributes
(e.g., integer or character
string).



Often a database maintains
indexes

for
the data.


Indexes are part of the stored data.


A description of which attributes
have indexes is part of the metadata.

Architecture of a DBMS


The job of the
storage manager

is


to
obtain

requested information
from

the data storage, and


to
modify

the information
to

the data storage when requested.



The
buffer manager

handles main memory. It obtains and returns blocks of
data from/to the file manager and stores the blocks temporarily in main
memory pages.



E.g. 1 block = 1 page = 4,000 to 16,000 bytes.



One block is the smallest unit of data that is read/written from/to disk.

Storage and BufferManager

Query Processor


The
query processor

handles:
queries+modifications to the data.


Its job is to find the best way to carry
out a requested operation and,


to issue commands to the storage
manager that will carry them out.



E.g.

A bank has a DB with two relat.:



Customers (name, ssn, address),


Accounts (accountNo, balance, ssn)





Query
: “Find the balances of all accounts
of which Sally is the owner.”


SELECT

Accounts.balance


FROM

Customers, Accounts


WHERE

Customers.ssn = Accounts.ssn

AND

Customers.name = “Sally”


What this query
logically

says is:

1.
Make the Cartesian product of the tables specified in the
FROM
-
clause,


i.e. associate each tuple of
Customers

with each tuple of
Accounts
.


We get a new temporary relation
R

with longer tuples,


the attributes are renamed so as to include the name of originating relation.
(
Customer.ssn

etc.)

2.
Chose from
R

only the tuples satisfying the condition in the
WHERE

clause.

3.
Produce in the answer only the values of attributes in
SELECT
-
clause.





Of course,

if we would answer this query as it says the performance would be
terrible (step 1)
.



Supp. we have an
index

on
name

of
Customer

and an
index

on
ssn

of
Accounts
.


Using the index on name of
Customer

we need usually
three

disk access.


One

more access gets us the tuple for “
Sally”


Similarly we need
four

disk accesses for finding the correspond. account.


If there are several accounts of “
Sally”

we need a few more accesses.

Query Processor (Cont.)


The
transaction manager

is responsible
for the
integrity

of the system. It must
assure that:


several queries running
simultaneously do not interfere with
each other and that,


the system will not lose data even if
there is a power failure.



The transaction manager
interacts

with:


execution engine
,


it may need to delay certain
queries or operations in order to
avoid conflicts.


storage manager



schemes for protecting the data
usually involve storing a
log

of
changes to the data.


Transaction Manager

What will be covered

1.
Storage Systems:

a)
Physical devices and characteristics, especially disks.

b)
Logical layout of data; data structures, especially, indexed
sequential files,
B
trees, hashing.

c)
Multidimensional indexes for GIS and OLAP

2.
Query optimization (
we concentrate a lot here
):

a)
Query
plan generation; algebraic transformations.

b)
Join methods.

3.
Resilience:

a)
Logging.

b)
Authorization and encryption.

4.
Transaction processing: Serialization, deadlocks, locking, timestamping.

5.
Distributed DB's.

6.
OLAP
in detail

7.
Data Integration