C-Store: Data Management in the Cloud

builderanthologyAI and Robotics

Oct 19, 2013 (3 years and 7 months ago)

73 views

C
-
Store: Data Management in
the Cloud

Jianlin Feng

School of Software

SUN YAT
-
SEN UNIVERSITY

Jun 5, 2009

What is the Cloud?


A definition from Wikipedia


Cloud computing

is a style of computing in which
dynamically
scalable

and often
virtualized

resources are
provided
as a service

over the Internet.



Platform as a service (e.g, Amazon EC2)


allows customers to rent computers (
virtual machines
) on
which to run their own computer applications.


Software as a service


infrastructure as a service

Amazon EC2 (Elastic Compute Cloud)


EC2 uses
Xen

virtualization
.


Each virtual machine, called an "instance",
functions as a
virtual private server

in one of
three sizes:


small, large or extra large.


Amazon.com sizes instances based on "EC2
Compute Units"


the equivalent CPU capacity of physical hardware.
One EC2 Compute Unit equals 1.0
-
1.2 GHz 2007
Opteron

or 2007
Xeon

processor.

Pricing


Amazon charges customers in two primary
ways:


Hourly charge per virtual machine


Data transfer charge



Amazon advertising describes the pricing
scheme as "
you pay for resources you
consume
".

Advantage of Public Cloud


Public clouds are hosted by large
infrastructure companies such as


Amazon, Google, Yahoo, Microsoft, Sun


Can afford huge cloud.


For many companies, especially for
start
-
ups

and
medium
-
sized

business), setting up a
private cloud can be too expensive


hardware cost


Software cost


Personnel cost for maintaining the system

Cloud Characteristics


Computing power is elastic, but noly if workload is
parallelizable.


Computing power comes from
shared
-
nothing architecture
.



Data is stored at an un
-
trusted host.


A possible solution is encrypting data.



Data is replicated, often across large geographic
distance.


To provide data availability and durability.

Transactional Data Management (OLTP)


Typically does not use a shared
-
nothing architecture.


OLTP systems are usually less than 1TB in size.



It is hard to maintain ACID guarantees in the face of
data replication over large geographic distances.


Google’s Bigtable implements a replicated shared
-
nothing
database, by weaking “A” from ACID.


The H
-
Store project still remains in vision stage.



There are big risks in storing transactional data on
an un
-
trusted host.


Transactional data include details at the lowest granularity.

First Conclusion


Transactional data management applications
are
not well

suited for deployment in the
cloud.

Analytical Data Management (DW)


Tend to be read
-
mostly (read
-
only), with occasional
batch inserts.



Shared
-
nothing architecture is a good match.


The ever increasing amount of data is the primary driver for
choosing shared
-
nothing.


Large scans, multidimensional aggregations, and star
schema joins for analytical workload are easy to parallelize
on shared
-
nothing system.


Infrequent writes eliminates the need for complex
distributed locking and commit protocols.

Analytical Data Management (DW):

continued


ACID guarantees are typically not needed.


Snapshot isolation is usually enough.



Particularly sensitive data can often be left
out of the analysis.


Less granular versions of the data are usually
used for analysis.

Second Conclusion


Analytical Data Management applications are
well
-
suited

for deployment in the cloud.

Vertica (C
-
Store) for the Cloud

Cloud DBMS Wish List


Efficiency



Fault Tolerance


If a query must restart each time a node fails, then long, complex
queries are difficult to complete.



Ability to run in a
heterogeneous

environment.


Should prevent the slowest node from making a disproportionate
affect on total query performance.



Ability to operate on
encrypted

data.



Ability to interface with
business intelligence

products.

MapReduce vs. Parallel DBMS (1)


Efficiency


MapReduce is good for brute
-
force scan over unstructured
data such as text documents.


Parallel DBMS is good for selective access of structured
data.


Fault Tolerance


MapReduce takes it as a high priority.


Most parallel DBMS restart a query upon a faiure.


Ability to run in a
heterogeneous

environment.


MapReduce does well.


Parallel DBMS are generally designed to run in a
homogeneous

environment.

MapReduce vs. Parallel DBMS (2)


Ability to operate on
encrypted

data.


Neither has the native ability to operate on
encrypted data.



Ability to interface with
business intelligence

products.


MapReduce is not intended for interfacing with BI
products.


Parallel DBMS supports BI products well.

A Call for A Hybrid Solution


Bring together ideas from MapReduce and
Parallel DBMS.



The hybrid solution should combine


Fault tolerance, heterogeneous cluster, and ease
of use out
-
of
-
the
-
box capabilities of MapReduce


With the efficiency, performance, and tool
plugability of shared
-
nothing parallel DBMS.

References

1.
Abadi, Daniel J.
Data Management in the
Cloud: Limitations and Opportunities.
In
IEEE Data Engineering Bulletin, 2009.


2.
Vertica Company. Getting Started with
Vertica Analytic Database for the Cloud.
2009.