slides - CS 491/591: Cloud Computing - The University of Alabama

sunfloweremryologistData Management

Oct 31, 2013 (3 years and 9 months ago)

75 views

Homework 4


Code for word count


http://grepcode.com/file/repository.cloudera.
com/content/repositories/releases/com.cloud
era.hadoop/hadoop
-
examples/0.20.2
-
320/org/apache/hadoop/examples/WordCou
nt.java#WordCount


Data Bases in Cloud Environments


Based on:

Md. Ashfakul Islam

Department of Computer Science

The University of Alabama


Data Today


Data sizes are increasing exponentially everyday.


Key difficulties in processing large scale data


acquire required amount of on
-
demand resources


auto scale up and down based on dynamic workloads


distribute and coordinate a large scale job on several
servers


Replication


update consistency maintenance


Cloud platform can solve
most
of the above

Large Scale Data Management


Large scale data management is attracting
attention.


Many organizations produce data in PB level.


Managing such an amount of data requires
huge resources.


Ubiquity of huge data sets inspires researchers
to think in new way.


Issues to Consider


Distributed or Centralized application?


How can ACID guarantees be maintained?


CAPS theorem


Consistency, Availability, Partition


Data availability and reliability (even if network
partition) are achieved by compromising consistency


Traditional consistency techniques become obsolete


Consistency becomes bottleneck of data
management deployment in cloud


Costly to maintain

Evaluation Criteria for Data
Management


Evaluation criteria:


Elasticity



scalable, distribute new resources, offload unused
resources, parallelizable, low coupling


Security



untrusted

host, moving off premises, new
rules/regulations


Replication



available, durable, fault tolerant, replication across
globe

Evaluation of Analytical DB


Analytical DB handles historical data with little or no
updates
-

no ACID properties.


Elasticity


Since no ACID


easier


E.g. no updates, so locking not needed


A number of commercial products support elasticity.


Security


requirement of sensitive and detailed data



third party vendor store data



potential risk of data leakage and privacy violation


Replication


Recent snapshot of DB serves purpose.



Strong consistency isn’t required.

Analytical
DBs

-

Data Warehousing


Data Warehousing DW
-

Popular application of
Hadoop


Typically DW is relational (OLAP)


but also semi
-
structured, unstructured data


Can also be parallel DBs (
teradata
)


column oriented


Expensive, $10K per TB of data


Hadoop

for DW


Facebook

abandoned Oracle for
Hadoop

(Hive)


Also Pig


for semi
-
structured

Evaluation of Transactional DM


Elasticity



data partitioned over sites



locking and commit protocol become complex
and time consuming



huge distributed data processing overhead


Security



requirement of sensitive and detailed data



third party vendor store data



potential risk of data leakage and privacy
violation


Evaluation of Transactional DM


Replication


data replicated in cloud


CAP theorem: Consistency, Availability, data
Partition, only two can be achievable



consistency and availability


must choose one



availability is main goal of cloud



consistency is sacrificed



ACID violation




Transactional Data
Management

Transactional Data Management


Needed because:


Transactional Data Management



heart of database industry



almost all financial transaction conducted
through it



rely on ACID guarantees


ACID properties are main challenge in
transactional DM deployment in Cloud.

Scalable Transactions for Web
Applications in the Cloud


Two important properties of Web applications


all transactions are short
-
lived


data request can be responded to with a small set
of well
-
identified data items


Scalable database services like Amazon
SimpleDB

and Google
BigTable

allow data to
be queried only by primary key.


Eventual data consistency is maintained in
these database services.

Relational Joins


Hadoop

is
not

a DB


Debate between parallel DBs and MR for
OLAPS


Dewitt/Stonebreaker call MR “step backwards”


Parallel faster because can create indexes


Relational Joins
-

Example


Given 2 data sets S and T:


(k1, (s1,S1)) k1 is join attribute, s1 is
tuple

ID, S1 is
rest of attributes


(k2, (s2,S2))


(k1, (t1,T1)) info for T


(k2, (t2,T2))


S could be user profiles


k is PK,
tuple

info about
age, gender, etc.


T could be logs of online activity,
tuple

is
particular URL, k is FK

Reduce side Join 1:1


Map over both datasets, emit (join key,
tuple
)


All
tuples

grouped by join key


what is needed
for join


Which is what type of join?


Parallel sort
-
merge join


If one
-
to
-
one join


at most 1
tuple

from S, T match







If 2 values, one must be from S, other from T, (don’t know
which since no order), join them

Reduce side Join 1:N


If one to many


If S is one (based on PK) same approach as 1 to 1
will work


But


which one is S? (no ordering)


Solution: buffer all S values in memory


Pick out
tuples

from S and perform join


Scalability


use memory









Reduce side Join 1:N


Use value
-
to value conversion


Create composite key: join key and
tuple

ID


Define sort order so:


sort by join key


Sort by IDs from S first then


Sort by IDS from T


Define
partitioner

so use only join key, so all keys
from with same join key at same reducer

Reduce side Join 1:N





Can remove join key and
tuple

ID from value to
save space


Whenever reducer finds new join key, will be
from S and not T,


put into memory (only the S one)


Join with other
tuples

until next new join key


No more bottleneck




Consistency in Clouds

Transactional DM


Transaction is sequence of read & write
operations.


Guarantee ACID properties of transactions:


Atomicity
-

either all operations execute or none.


Consistency
-

DB remains consistent after each
transaction execution.


Isolation
-

impact of a transaction can’t be altered by
another one.


Durability
-

guarantee impact of committed
transaction.

ACID Properties


Atomicity maintained by 2 PC.


Eventual consistency is maintained.


Isolation maintained by decomposing of
transaction.


Timestamp ordering is introduced to order
conflicting transactions.


Durability is maintained by the replication of
data items across several LTMs.

Consistency in Clouds


Consistent database must remain consistent
after execution of successful operations.


Inconsistency may cause to huge damage.


Consistency is always sacrificed to achieve
availability and scalability.


Strong consistency maintenance in cloud is
very costly.


Traditional DM is becoming obsolete.


Thin portable devices and concentrated
computing power shows new way.


ACID guarantee become main challenge.


Some solutions are provided to overcome
challenge.


Consistency remains bottleneck.


Our goal to provide low cost solutions to ensure
data consistency in the cloud.


Current DB Market Status


MS SQL doesn’t support auto scaling and load.


MySQL

recommended for “lower traffic”


New products: advertise replace
MySQL

with us


Oracle recently released on
-
demand resource
allocation


IBM DB2 can auto scale with dynamic workload.


Azure Relational DB


great performance