CS377: Database Systems

radiographerfictionΔιαχείριση Δεδομένων

31 Οκτ 2013 (πριν από 4 χρόνια και 12 μέρες)

117 εμφανίσεις

CS377: Database Systems

CS377: Database Systems


Fusheng Wang

Department of Biomedical Informatics

Department of Mathematics and Computer Science

Emory University

CS377: Database Systems

Data and Database


Data are collection of facts or numbers


By relating different pieces of data, we are able to
extract valuable information by presenting it in
meaningful context


For that we need to be able to:


Store data
to get it persisted


Structure data
for easier manipulation


Organize data
in a meaning way


Process data
to derive data value from it


A database is an organized collection of
data,
and
supporting
data
structures

Introduction

CS377: Database Systems

Example


Patient’s history, lab results, pathology reports,
radiology images, medications, billing data …


All the data is in databases


Questions to ask about data


Can I find the data I need?


Can I find the data quickly?


Can I understand the data once I find them?


Can I add and update data reliably?


Such questions are answered through database
management systems

Introduction

CS377: Database Systems

What Is a Database Management System?


A

database

is an integrated collection of data
records, files, and other digital

objects


A database management system (DBMS) is


a software system designed to store, manage, and
facilitate access to databases


A DBMS allows multiple users or applications to
concurrently access the same database reliably


Excel is not a database management system

Introduction

CS377: Database Systems

Benefits of Using a DBMS


Information representation:
structured and
normalized data


Persistent:
data in the database outlives the programs
that execute on that data


Convenient:
declarative language for accessing data


Efficient retrieval:
efficiently handle huge volumes of
data,
e.g., terabytes
or more


Multiple user access:
data can be updated
simultaneously without risk of losing data and its
consistency


Maintenance:
tools for data backup, restore and
recovery


Data security:
database security and fine grained
access control




Introduction

CS377: Database Systems

Data
Modeling

Schema
Definition

Data
Loading

Querying

Data
Exchange

Build a Database

Introduction

DBMS

Key concepts: data modeling; schema definition;
querying; efficiency; data exchange

CS377: Database Systems

Example Database: Patient Encounter


Patient has encounters, described with diagnosis
codes (ICD9)


Introduction

CS377: Database Systems

Data Model


Define how data is to be represented,
structured, linked, and constrained


Entity relational model (ER):
a popular
conceptual model for database design


Relational model: implemented by most
DBMS in the market: RDBMS

Data Modeling

CS377: Database Systems

ER Model


A database is of a collection of instances of
entities
, independent of any other entities


Entities have
attributes

to characterize the
entity


There could be relationships between entities

Patient

Encounter

diagnosis

code

has

1

N

has

1

N

gender

DOB

race

encounter

type

discharge

disp

start

timestamp

end

timestamp

l
ast


name

f
irst


name

OBX

timestamp

entity id

Data Modeling

CS377: Database Systems

Relational Database with the Relational Model


A relational database is implemented based on the
relational data model


Data stored in
tables
, consists of
columns

and
rows


Each column has a specific
data type


Constraints
: primary key, uniqueness…

key

Data Modeling

CS377: Database Systems

SQL: Database Language for RDBMS


SQL (Structured Query Language) is the
standard language of relational database


A SQL query is compiled and executed by the
DBMS engine


Two types of SQL


Data Definition Language (DDL),
defines properties of
data objects. e.g. creation of a table


Data manipulation language (DML)
: retrieve, insert,
update and delete

Relational Database

CS377: Database Systems

Example SQL: Creating a Table

CREATE TABLE
PATIENT (


PATIENT_KEY
INTEGER

NOT NULL
,


FIRST_NAME
VARCHAR
(32),


LAST_NAME
VARCHAR
(32),


DATE_OF_BIRTH
DATE
,


LANGUAGE


VARCHAR
(32),


MARITAL_STATUS
VARCHAR
(50),


RACE
VARCHAR
(16),


GENDER
CHAR
(6),


PRIMARY KEY
(PATIENT_KEY) );

Schema Definition

CS377: Database Systems

Data Loading


Data loading using SQL statement:





Data loading using import tools:

Data Loading

INSERT INTO
PATIENT
VALUES
(


2,’MONICA’, ’REESE’, ‘1960
-
11
-
02’,


’Czech’, ‘Widowed’, ‘Black’, ‘female’ );

CS377: Database Systems

Example SQL

SELECT
PATIENT_KEY

FROM
ENCOUNTER

WHERE
ENCOUNTER_TYPE =‘
EMERGENCY
’;

Select all patients with “emergency” encounters:

Querying

CS377: Database Systems

Making Queries Fast


DBMS provides many optimizations to
accelerate queries


Indexing: tree like data structure with sorted
and branched keys for quick identification of
records




Hard disk reading can be a bottleneck for
queries: use many disks reading in parallel


Caching: caching frequently used data in
memory



Indexing

CS377: Database Systems

Data Access Architecture

Application
Server

Database
Server

Architecture

Network


JDBC/ODBC…

Network


Web Service

XML

Network

JDBC/ODBC…

Client

Client


Standard database access API


Standard application communication
interface


Web Service


Standard data exchange format


XML

CS377: Database Systems

Data Exchange: XML


XML is a language to encode data with simple,
human
-
readable tags


Standard language for data exchange over the Web


HL7 CDA uses XML for messaging

Data Exchange

2,MONICA,REESE,1960
-
11
-
02,Czech,Widowed,Black,female

CSV

XML

CS377: Database Systems

Key People



DBMS implementer: build the system


Database kernel engineer


DBMS designer: establish schemas


Database architect


Database application designer: develop
applications that use databases


Application/database developer


Database administrator: load data and keep
system running smoothly


DBA


Database Users

CS377: Database Systems

A Brief History of DBMS


Pre
-
relational era (1970’s)


Hierarchical (IMS), Network (
Codasyl
)


Complex data structures and low
-
level query language


Relational DBMSs (1980s)


Edgar F.
Codd's

relational model in 1970


Set of tuples (i.e., tables) as data model


Powerful high
-
level query language


Object
-
Oriented DBMSs (1990s)


Motivated by “
impedance mismatch” between RDBMS and
OO PL


Persistent types in C++, Java or Small Talk


Issues: Lack of high level QL, no standards, performance

History of DBMS

CS377: Database Systems

A Brief History of DBMS (2)


OR
-
DBMS (1990s)


Relational DBMS vendors’ answer to OO


User
-
defined types, functions (spatial, multimedia)


Nested tables


SQL: 1999 (2003) standards. Plus performance


XML/DBMS (2000s)


Web and XML are merging


Native support of XML through ORDBMS extension or
native XML DBMS


History of DBMS

CS377: Database Systems

A Brief History of DBMS (3)


Decision support system (DSS) (2000s)


Data warehousing and OLAP


Data stream management systems (2000s)


Continuous query against data streams


The era of big data (mid 2000
-
now):


Big data
: datasets that grow so large (terabytes to
petabytes
) that they become awkward to work with
traditional DBMS


MapReduce:
a simplified distributed computing framework;
dominates on big data processing



NoSQL
” (not only SQL) is fast growing


History of DBMS

CS377: Database Systems

A Brief History of DBMS Products


First hierarchy DBMS: IBM Information
Management System (IMS)


starting in 1966 for the Apollo program


Still going strong over 40 years later


Mainframe only


IDMS (Integrated Database Management System) is
a network model based system


The roots of IDMS go back to Dr. Charles Bachman's
IDS (Integrated Data Store) developed at GE


Since 1989 the product has been owned by Computer
Associates, who renamed it CA
-
IDMS


Mainframe only


History of DBMS Products

CS377: Database Systems

A Brief History of DBMS Products (2)


Two early RDBMS projects started and were
operational in late 1970s:INGRES and System R


INGRES (
IN
teractive

G
raphics
RE
trieval

S
ystem)
started at UC Berkeley, by Michael
Stonebraker

and
Eugene Wong


In the early 1980s, Ingres competed head
-
to
-
head
with Oracle, but lost market due to Oracle’s
marketing and Ingres’ own proprietary QUEL


Since the mid
-
1980s, Ingres has spawned into:
Sybase, Microsoft SQL Server,
NonStop

SQL, etc


Postgres

(Post Ingres) started in the mid
-
1980s, later
evolved into
PostgreSQL


In the 1990s
Stonebraker

commercialized
Postgres

as
Illustra
, later sold to Informix (sold to IBM in 2001)

History of DBMS Products

CS377: Database Systems

A Brief History of DBMS Products (3)


IBM
System R
was a research project at IBM San
Jose Research (now IBM
Almaden

Research) in the
1970s


SQL/DS

was IBM's first commercial DBMS for
mainframe built around SQL in early 1980s


A little later, in 1983, IBM released
DB2
on its MVS
mainframe platform


IBM brought DB2 to other platforms (LUW) in 90s.
DB2 renamed as
DB2 UDB z/OS, DB2 UDB LUW


Larry Ellison and his friends started Software
Development Laboratories (SDL) in 1977, which
developed the original version of
Oracle


The name
Oracle

comes from the code
-
name of a
CIA
-
funded project Ellison had worked before


History of DBMS Products

CS377: Database Systems

Genealogy of RDBMS

http://www.hpi.uni
-
potsdam.de/naumann/projekte/rdbms_genealogy.html

History of DBMS Products

CS377: Database Systems

A Brief History of SQL


SQL: Structured Query Language Invented in 1974
by Donald Chamberlin and Raymond Boyce for IBM


Initially called SEQUEL, changed to SQL due to
trademark issue


In late 1970s, Relational Software, Inc. (now Oracle
Corporation) introduced the first commercially
available implementation of SQL in Oracle V2


Multiple standard revisions and multiple flavors
(implementations) exist


History of SQL

CS377: Database Systems

SQL Standard Revisions


SEQUEL/Original SQL
-

1974


SQL
86: ratification and acceptance of a formal SQL
standard by ANSI and ISO


SQL2 (a.k.a. SQL
92): still strictly relational, with new
primitive data types, operations and join types


SQL3: working documents discussing new specs for
OR systems, but also for recursion, active rules, OLAP


SQL:1999: added user defined types, etc


SQL:2003: added XML
-
related features, etc


SQL:2006: increased support for XML support for
XQuery, an XML
-
SQL interface standard


SQL:2011: added temporal support


And evolution continues…

History of SQL

CS377: Database Systems

Database Landscape Map


December 2012

http://blogs.the451group.com/information_management/2012/12/20/database
-
landscape
-
map
-
december
-
2012/

DBMS Products

CS377: Database Systems

The Current RDBMS Market (Forrester’09)

DBMS Products

CS377: Database Systems

Enterprise Data Warehouse Platforms
(Forrester’11)

DBMS Products

CS377: Database Systems

Fundamental Principles of Relational
Databases: ACID


A
tomicity: all or nothing


C
onsistency: any transaction will take the DB from
one consistent state to another, without broken
constraints (referential integrity)


I
solation: other operations cannot access data that
has been modified during a transaction that has not
yet completed


D
urability: ability to recover the committed
transaction updates against any kind of system
failure (transaction log)

DB Fundamentals

CS377: Database Systems

RDBMS: ACID

Availability

Consistency

Properties of Transactions


A
tomicity: all or nothing


C
onsistency: any transaction will take the DB from one
consistent state to another, without broken constraints


I
solation: other operations cannot access data that has
been modified during a transaction that has not yet
completed


D
urability: ability to recover the committed transaction
updates against any kind of system failure

CS377: Database Systems

Distributed DBMS: CAP
theorem

Properties of Transactions


Consistency
: all nodes see the same data at the same
time


Availability
: a guarantee that every request receives a
response about whether it was successful or failed


Partition tolerance
: the system continues to operate
despite arbitrary message loss or failure of part of the
system

CS377: Database Systems

NoSQL

Systems


Most people think “
NoSQL
” is short for “Not Only
SQL”. It’s more a movement rather than technology


When “
NoSQL
” is applied to a database, it refers to
a set
of
weakly defined, mostly
open
-
source
databases, mostly developed in the early 21st
century, and mostly not using SQL


Common characteristics of
NoSQL

databases:


Non
-
relational


Schema
-
free


Non
-
(quite)
-
acid


Horizontally scalable, distributed, easy replication


Simple API


NoSQL

CS377: Database Systems

NoSQL

Systems

Category

Data Model

Example Databases

Key
-
Value

(Global) collection of
K
-
V pairs

BerkeleyDB
,
LevelDB
,
Memcached
, Project
Voldemort
,

Redis
,
Riak

Column Families

Big table, column
families

Amazon
SimpleDB
, Cassandra,
HBase
,
Hypertable

Document

Collections of K
-
V
Collections

CouchDB
,
MongoDB
,
OrientDB
,
RavenDB
,
Terrastore

Graph

Nodes, relations, K
-
V
on both

FlockDB
,
HerperGraphDB
,
Infinite Graph,
AllegroGraph
,
Neo4j,
OrientDB

NoSQL

Products

CS377: Database Systems

Scope the Course


Relational data model


SQL query language


Relational algebra


Conceptual modeling: ER model and UML model


Relational design theory


XML and XML query language (
XPath

and XQuery)


Database programming with Java: JDBC


Data storage and indexing


Constraints and t
riggers


Transactions


Views


NoSQL


Overview

CS377: Database Systems

Textbook

Ramez

Elmasri

and

Shamkant

Navathe:


Fundamentals
of Database Systems


(
6th Edition),

Addison
-
Wesley.

Overview

CS377: Database Systems

Course Overview


Lectures


About 25 lectures


Homework


MySQL setup; SQL; XML and XML queries; relational
algebra; functional dependencies


Projects


Database design


JDBC programming (Java programming needed)


Grading: homework (30%) + projects (30%) +
midterm (15%) + final (25%)


Overview

CS377: Database Systems

Course Resources


Course wiki


https://web.cci.emory.edu/confluence/display/CS377


Schedules, lecture presentations, readings, resources


Please check the wiki periodically for updates


Blackboard


Submit your homework/projects


Discussion board for questions: the instructor and TA
will monitor the board to answer questions

Overview

CS377: Database Systems

Software


MySQL and MySQL Workbench: for database and
SQL queries


http://www.mysql.com


Oxygen XML Editor: for XML and XML queries


http://www.oxygenxml.com


Do not download now; 30
-
day
trial
edition


IBM Infosphere Data Architect (educational license
available): for database design


Will be distributed; Windows and Linux only (no
MacOS

version)


Eclipse IDE:


http://www.eclipse.org



Overview

CS377: Database Systems

Office Hours


TA: Shuai Zheng (
shuai.zheng@emory.edu
)

MSC N414, 2
-
4pm,
Friday


Instructor: by appointment

Office Hours

CS377: Database Systems

Additional resources


Introduction to databases


Introduction to databases (Stanford online course)


https://class.coursera.org/db/lecture/index


DB2 and Oracle have free express edition. You can
install and play by yourself


DB2
express
-
C:
www.ibm.com/software/data/db2/express
-
c


Oracle express:
http://
www.oracle.com/technetwork/products/expres
s
-
edition
/


Resources