Parallel and Distributed

learnedmooseupvalleyElectronics - Devices

Nov 7, 2013 (3 years and 9 months ago)

79 views

Parallel and Distributed
Databases


CS263 Lecture 16

LECTURE PLAN





Parallel DBMS
-

What and Why?




What is a Client/Server DBMS?




Why do we need Distributed DBMSs?




Date’s rules for a Distributed DBMS




Benefits of a Distributed DBMS




Issues associated with a Distributed DBMS




Disadvantages of a Distributed DBMS


PARALLEL DATABASE SYSTEM


PARALLEL DBMSs

WHY DO WE NEED THEM?



More and More Data!



We have databases that hold a high amount of


data, in the order of 10
12

bytes:



10,000,000,000,000

bytes
!




Faster and Faster Access!



We have data applications that need to process


data at very high speeds:



10,000s transactions per second
!


SINGLE
-
PROCESSOR DBMS AREN’T UP TO THE JOB!




Improves Response Time.



INTERQUERY PARALLELISM




It is possible to process a number of transactions in


parallel with each other.




Improves Throughput
.



INTRAQUERY PARALLELISM



It is possible to process ‘sub
-
tasks’ of a transaction in


parallel with each other.


PARALLEL DBMSs

BENEFITS OF A PARALLEL DBMS




Speed
-
Up.



As you multiply resources by a certain factor, the time taken


to execute a transaction should be reduced by the same factor:



10 seconds to scan a DB of 10,000 records using 1 CPU


1 second to scan a DB of 10,000 records using 10 CPUs

PARALLEL DBMSs

HOW TO MEASURE THE BENEFITS





Scale
-
up
.



As you multiply resources the size of a task that can be executed


in a given time should be increased by the same factor.



1 second to scan a DB of 1,000 records using 1 CPU


1 second to scan a DB of 10,000 records using 10 CPUs


Sub
-
linear speed
-
up

Linear speed
-
up (ideal)

Number of CPUs

Number of transactions/second

1000/Sec

5 CPUs

2000/Sec

10 CPUs

16 CPUs

1600/Sec

PARALLEL DBMSs

SPEED
-
UP

10 CPUs

2 GB Database

Number of CPUs, Database size

Number of transactions/second

Linear scale
-
up (ideal)

Sub
-
linear scale
-
up

1000/Sec

5 CPUs

1 GB Database

900/Sec

PARALLEL DBMSs

SCALE
-
UP


MEMORY


CPU

CPU

CPU

CPU

CPU

CPU

Shared Memory


Parallel Database Architecture

CPU

CPU

CPU

CPU

CPU

CPU

Shared Disk


Parallel Database Architecture

M

M

M

M

M

M

Shared Nothing


Parallel Database Architecture

CPU

M

CPU

M

CPU

M

CPU

M

CPU

M


MAINFRAME DATABASE
SYSTEM


DUMB

DUMB

DUMB

SPECIALISED NETWORK CONNECTION

TERMINALS

MAINFRAME COMPUTER

PRESENTATION LOGIC

BUSINESS LOGIC

DATA LOGIC


CLIENT/SERVER DATABASE
SYSTEM


CLIENT/SERVER DBMS







Manages user interface




Accepts user data




Processes application/business logic




Generates database requests (SQL)




Transmits database requests to server




Receives results from server




Formats results according to application logic




Present results to the user

CLIENT PROCESS

CLIENT/SERVER DBMS







Accepts database requests




Processes database requests




Performs integrity checks




Handles concurrent access




Optimises queries




Performs security checks




Enacts recovery routines




Transmits result of database request to client

SERVER PROCESS







Data Request



Data Response





CLIENT/SERVER

DBMS ARCHITECTURE

CLIENT
#1

CLIENT
#2

CLIENT
#3

PRESENTATION LOGIC

BUSINESS LOGIC

DATA LOGIC

(FAT CLIENT)

D/BASE

SERVER





D/BASE

SERVER











Data Request



Data Response





CLIENT/SERVER

DBMS ARCHITECTURE

CLIENT
#1

CLIENT
#2

CLIENT
#3

PRESENTATION LOGIC

BUSINESS LOGIC

DATA LOGIC

(THIN CLIENT)

LAN

CLIENT

CLIENT

LAN

CLIENT

CLIENT

CLIENT

CLIENT

LAN

CLIENT

CLIENT

LAN

CLIENT

Leyton

CLIENT

CLIENT

CLIENT

Stratford

DBMS

Barking

Leytonstone

DISTRIBUTED PROCESSING ARCHITECTURE

CLIENT

CLIENT

CLIENT

CLIENT


DISTRIBUTED DATABASE
SYSTEM





A distributed database system is a collection of


logically related databases that co
-
operate in a


transparent

manner
.




Transparent implies that each user within the


system may access all of the data within all of the


databases as if they were a single database




There should be
‘location independence’

i.e.
-

as


the user is unaware of where the data is located it


is possible to move the data from one physical


location to another without affecting the user.

DISTRIBUTED DATABASES

WHAT IS A DISTRIBUTED DATABASE?

LAN

CLIENT

CLIENT

CLIENT

CLIENT

DBMS

DISTRIBUTED DATABASE ARCHITECTURE

LAN

CLIENT

CLIENT

CLIENT

CLIENT

DBMS

Leytonstone

CLIENT

CLIENT

CLIENT

DBMS

Stratford

CLIENT

CLIENT

CLIENT

CLIENT

DBMS

Barking

CLIENT

CLIENT

CLIENT

Leyton

D/BASE

SERVER #1

CLIENT
#1

D/BASE

SERVER #2

CLIENT
#2

CLIENT
#3

M:N CLIENT/SERVER DBMS ARCHITECTURE

NOT TRANSPARENT!



DB



Computer



Network



Site 2



Site 1



GSC



DDBMS



DC



LDBMS



GSC



DDBMS



DC



LDBMS
=

Local DBMS


DC
= Data Communications

GSC
= Global Systems Catalog

DDBMS
=
Distributed DBMS

COMPONENTS OF A DDBMS



Reduced Communication Overhead



Most data access is local, less expensive and performs


better
.




Improved Processing Power



Instead of one server handling the full database, we now


have a collection of machines handling the same database.





Removal of Reliance on a Central Site




If a server fails, then the only part of the system that is


affected is the relevant local site. The rest of the system


remains functional and available.



DISTRIBUTED DATABASES

ADVANTAGES



Expandability




It is easier to accommodate increasing the size of the


global (logical) database.




Local autonomy




The database is brought nearer to its users. This can effect


a cultural change as it allows potentially greater control


over local data .



DISTRIBUTED DATABASES

ADVANTAGES


A distributed system looks exactly like

a non
-
distributed system to the user!


1.

Local autonomy

2.

No reliance on a central site

3.

Continuous operation

4.

Location independence

5.

Fragmentation independence

6.

Replication independence

7.

Distributed query independence

8.

Distributed transaction processing

9.

Hardware independence

10.

Operating system independence

11.

Network independence

12.

Database independence

DISTRIBUTED DATABASES

DATE’S TWELVE RULES FOR A DDBMS




Data Allocation





Data Fragmentation





Distributed Catalogue Management





Distributed Transactions





Distributed Queries



(see chapter 20)

DISTRIBUTED DATABASES

ISSUES


1.

Locality of reference



Is the data near to the sites that need it?


2.

Reliability and availability



Does the strategy improve fault tolerance and accessibility?



3.

Performance



Does the strategy result in bottlenecks or under
-
utilisation of resources?



4.

Storage costs



How does the strategy effect the availability and cost of data storage?



5.

Communication costs



How much network traffic will result from the strategy?

DISTRIBUTED DATABASES

DATA ALLOCATION METRICS



CENTRALISED




DISTRIBUTED DATABASES

DATA ALLOCATION STRATEGIES

Locality of Reference

Reliability/Availability

Storage Costs

Performance

Communication Costs

Lowest

Lowest

Lowest

Unsatisfactory

Highest



PARTITIONED/FRAGMENTED




DISTRIBUTED DATABASES

DATA ALLOCATION STRATEGIES

Locality of Reference

Reliability/Availability

Storage Costs

Performance

Communication Costs

High

Low
(item)



High
(system)

Lowest

Satisfactory

Low



COMPLETE REPLICATION




DISTRIBUTED DATABASES

DATA ALLOCATION STRATEGIES

Locality of Reference

Reliability/Availability

Storage Costs

Performance

Communication Costs

Highest

Highest

Highest

High

High
(update)



Low
(read)



SELECTIVE REPLICATION




DISTRIBUTED DATABASES

DATA ALLOCATION STRATEGIES

Locality of Reference

Reliability/Availability

Storage Costs

Performance

Communication Costs

High

Average

Satisfactory

Low

Low
(item)



High
(system)




Usage



Applications are usually interested in ‘views’ not whole relations
.




Efficiency



It’s more efficient if data is close to where it is frequently used.





Parallelism



It is possible to run several ‘sub
-
queries’ in tandem.




Security



Data not required by local applications is not stored at the local


site.



DISTRIBUTED DATABASES

WHY FRAGMENT DATA?

DISTRIBUTED DATABASES

HORIZONTAL DATA FRAGMENTATION

333.00

STRATFORD

KHAN

456

500.00

BARKING

ONO

400

340.14

BARKING

GREEN

350

23.17

STRATFORD

SMITH

345

200.00

BARKING

GRAY

324

1000.00

STRATFORD

JONES

200

BALANCE

BRANCH

CUSTOMER

ACCOUNT

Horizontal Fragmentation: Consists of a Restriction on a Relation.


e.g.,

(


branch = ‘Stratford’

Account)

DISTRIBUTED DATABASES

HORIZONTAL DATA FRAGMENTATION

STRATFORD

STRATFORD

STRATFORD

333.00

KHAN

456

23.17

SMITH

345

1000.00

JONES

200

BALANCE

BRANCH

CUSTOMER

ACCT NO.

BARKING

BARKING

BARKING

500.00

ONO

400

340.14

GREEN

350

200.00

GRAY

324

BALANCE

BRANCH

CUSTOMER

ACCT NO.

STRATFORD BRANCH

BARKING BRANCH

DISTRIBUTED DATABASES

VERTICAL DATA FRAGMENTATION

KJTR78

KHA456T

0208
-
500
-
5821

STRATFORD

KHAN

456

ZZEE56

GRA324S

0208
-
545
-
7528

BARKING

GRAY

324

XXYY22

JON200T

0208
-
500
-
9000

STRATFORD

JONES

200

PASSWORD

LOGIN

PHONE NO

SITE

NAME

S#

Vertical Fragmentation: Consists of a Projection on a Relation.


e.g.,

(


S#, NAME, SITE, PHONE NO

Student)

DISTRIBUTED DATABASES

VERTICAL DATA FRAGMENTATION

STRATFORD

BARKING

STRATFORD

KHAN

456

GRAY

324

0208
-
500
-
5821

0208
-
545
-
7528

0208
-
500
-
9000

JONES

200

PHONE NO.

SITE

NAME

S#

KJTR78

ZZEE56

XXYY22

KHA456T

456

GRA324S

324

JON200T

200

PASSWORD

LOGIN
-
ID

S#

STUDENT ADMINISTRATION

NETWORK ADMINISTRATION

DISTRIBUTED DATABASES

DISTRIBUTED CATALOG MANAGEMENT



Centralised Global Catalog




One site

maintains the full global catalog. All changes to


any local system catalog have to be propagated to the site


maintaining the global catalog.
Bad performance, single


point of failure
,
compromises site autonomy
.





Dispersed Catalog




There is
no physical global catalog
. Each time a remote


data item is required, the catalogues from ALL other sites


are examined for the item. This has
severe performance


penalties
.

DISTRIBUTED DATABASES

DISTRIBUTED CATALOG MANAGEMENT



Replicated Global Catalog




Each site maintains its own global catalog. Although this


greatly speeds up remote data location, it is
very


inefficient to maintain
. A detail of every data item added,


changed or deleted locally has to be propagated to
ALL



other sites .





Local
-
Master Catalog




Each site maintains both its local system catalog as well


as a catalog of all of its data items that are replicated at


other sites. This
avoids compromising site autonomy
, is


fairly efficient
, and is
not a single point of failure
.

ATOMIC DISTRIBUTED TRANSACTION

DISTRIBUTED DATABASES

DISTRIBUTED TRANSACTIONS

Stratford DB

Barking DB

Leyton DB

Stratford

DBMS

Stratford

Client

Stratford

Client

Stratford

Client

Barking

DBMS

Leyton

DBMS

Global Transaction


(a)
Debit Stratford A/C £500

(b)
Credit Barking A/C £350

(c)
Credit Leyton A/C £150

(a)

(b)

(c)

TWO
-
PHASE COMMIT (2PC)
-

OK

TWO
-
PHASE COMMIT (2PC)
-

ABORT




Architectural complexity.




Cost.





Security.




Integrity control more difficult.




Lack of standards.




Lack of experience.




Database design more complex.

DISTRIBUTED DATABASES

DISADVANTAGES OF DDBMSs