Cloud Computing Systems

coleslawokraΛογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 3 χρόνια και 9 μήνες)

173 εμφανίσεις

Cloud Computing Systems

Lin Gu

Hong Kong University of Science and Technology


Sept. 14, 2011


How to effectively compute in a datacenter?

Is MapReduce the best answer to computation in the cloud?

What is the limitation of MapReduce?

How to provide general
-
purpose parallel processing in DCs?


MapReduce

parallel computing for Web
-
scale
data processing


Fundamental component in Google’s
technological architecture


Why didn’t Google use parallel Fortran, MPI, …?


Followed by many technology firms

The MapReduce Approach

Program Execution on Web
-
Scale Data

MapReduce

Old ideas can be fabulous, too!


( = Lisp “Lost In Silly Parentheses”) ?


Map and Fold


Map: do something to all elements in a list


Fold: aggregate elements of a list


Used in functional programming languages
such as Lisp


Map is a higher
-
order function:
apply an op to all
elements in a list


Result is a new list


Parallelizable

f

f

f

f

f

MapReduce

(map (lambda (x) (* x x))


'(1 2 3 4 5))




'(1 4 9 16 25)


Reduce is also a higher
-
order function


Like “fold”:

aggregate elements of a list


Accumulator set to initial value


Function applied to list element and the accumulator


Result stored in the accumulator


Repeated for every item in the list


Result is the final value in the accumulator

f

f

f

f

f

final result

Initial value

(fold + 0 '(1 2 3 4 5))



15

(fold * 1 '(1 2 3 4 5))



120

The MapReduce Approach

Program Execution on Web
-
Scale Data

Massive parallel processing made simple


Example: word count


Map: parse a document and generate <word, 1> pairs


Reduce: receive all pairs for a specific word, and count
(sum)

// D is a document

for each word w in D


output <w, 1>



Map

Reduce

Reduce for key w:

count = 0

for each input item


count = count + 1

output <w, count>

The MapReduce Approach

Program Execution on Web
-
Scale Data

Design Context


Big data, but simple dependence


Relatively easy to partition data


Supported by a distributed system


Distributed OS services across thousands of
commodity PCs (e.g., GFS)


First users are search oriented


Crawl, index, search

Designed years ago, still working today, growing adoptions

Single Master node

Worker threads

Worker threads

Workflow

Single master, numerous worker threads

Workflow


1. The
MapReduce library
in the user program first
splits the input files into
M pieces
of typically 16
megabytes to 64 megabytes (MB) per piece. It then
starts up many copies of the program on a cluster of
machines.


2. One of the copies of the program is the
master
. The
rest are
workers

that are assigned work by the master.
There are
M map tasks
and
R reduce tasks
to assign.
The master picks idle workers and assigns each one a
map task or a reduce task.

Workflow


3. A worker who is assigned a map task reads the
contents of the corresponding input split. It parses
key/value pairs
out of the input data and passes each
pair to the user
-
defined
Map function. The
intermediate
key/value
pairs produced by the
Map
function
are
buffered in memory
.


4. Periodically, the buffered pairs are written to
local
disk
,
partitioned into R regions
by the partitioning
function. The locations of these buffered pairs on the
local disk are passed back to the master, who is
responsible for forwarding these locations to the
reduce workers.

Workflow


5. When a reduce worker is notified by the master about
these locations, it uses
RPCs

to read the buffered data
from the local disks of the map workers. When a reduce
worker has read all intermediate data, it
sorts

it by the
intermediate keys so that
all occurrences of the same key
are grouped together
.


6. The reduce worker iterates over the sorted
intermediate data and for
each unique intermediate key
encountered, it passes the key and
the corresponding set
of intermediate values

to the
Reduce function.
The output
of the
Reduce function is appended
to a final output file
for this reduce partition.


7. When
all

map tasks and reduce tasks have been
completed, the MapReduce returns back to the user code.

Programming


How to write a MapReduce programto


Generate inverted indices?


Sort?


How to express more sophisticated
logic?


What if some workers (slaves) or the
master fails?

Workflow

Where is the communication
-
intensive part?

Initial data split

into 64MB blocks

Computed, results

locally stored

Master informed of

result locations

R reducers retrieve

Data from
mappers

Final output written


Distributed, scalable storage for key
-
value pairs


Example: Dynamo (Amazon)


Another example may be P2P storage (e.g., Chord)


Key
-
value store can be a general foundation for more
complex data structures


But performance may suffer

Data Storage


Key
-
Value Store

Data Storage


Key
-
Value Store

Dynamo: a decentralized, scalable key
-
value
store


Used in Amazon


Use consistent hashing to distributed data
among nodes


Replicated, versioning, load balanced


Easy
-
to
-
use interface: put()/get()



Networked block storage


ND by SUN Microsystems


Remote block storage over Internet


Use S3 as a block device [Brantner]


Block
-
level remote storage may become slow in
networks with long latencies

Data Storage


Network Block Device


PC file systems


Link together all clusters of a file


Directory entry: filename, attributes, date/time,
starting cluster, file size


Boot sector (superblock) : file system wide
information


File allocation table, root directory, …


Data Storage


Traditional File Systems

Boot
sector

FAT 1

FAT 2 (dup)

ROOT dir

Normal directories and files


NFS

Network File System [Sandberg]


Designed by SUN Microsystems in the 1980’s


Transparent remote access to files stored
remotely


XDR, RPC, VNode, VFS


Mountable file system, synchronous behavior


Stateless server


Data Storage


Network File System

NFS organization

Client







Server

Data Storage


Network File System


A distributed file system at work (GFS)


Single master and numerous slaves communicate with each other


File data unit, “chunk”, is up to 64MB. Chunks are replicated.


“master” is a single point of failure and bottleneck of scalability,
the consistency model is difficult to use

Data Storage


Google File System (GFS)

22

E 75656 C

A 42342 E

B 42521 W

C 66354 W

D 12352 E

F 15677 E

E 75656 C

A 42342 E

B 42521 W

C 66354 W

D 12352 E

F 15677 E

CREATE TABLE Parts (


ID VARCHAR,


StockNumber INT,


Status VARCHAR




)

Parallel database

Replication

Indexes and views

Structured schema

A 42342 E

B 42521 W

C 66354 W

D 12352 E

E 75656 C

F 15677 E

Data Storage


Database

Designed and used by
Yahoo!

PNUTS


a relational database service

MapReduce/Hadoop


Around 2004,

Google invented
MapReduce

to
parallelize computation of large data sets. It’s been a
key component in Google’s technology foundation


Around 2008,
Yahoo! developed the open
-
source

variant of MapReduce named Hadoop


After 2008, MapReduce/Hadoop become a key
technology component in cloud computing





In 2010, the U.S. conferred the MapReduce

patent to
Google

MapReduce

… Hadoop or variants …

Hadoop


MapReduce provides an easy
-
to
-
use framework for parallel
programming, but is it the most efficient and best solution to
program execution in datacenters?


MapReduce has its discontents


DeWitt and Stonebraker: “
MapReduce: A major step backwards”


MapReduce is far less sophisticated and efficient than parallel query
processing


MapReduce is a parallel processing framework, not a database
system, nor a query language


It is possible to use MapReduce to implement some of the parallel query
processing functions


What are the real limitations?


Inefficient for general programming (and not designed for that)


Hard to handle data with complex dependence, frequent updates, etc.


High overhead, bursty I/O, difficult to handle long streaming data


Limited opportunity for optimization


MapReduce

Limitations

Critiques

MapReduce: A major step backwards








--

David J. DeWitt and Michael Stonebraker

(MapReduce) is


A giant step backward in the programming paradigm for large
-
scale data intensive applications


A sub
-
optimal implementation, in that it uses brute force
instead of indexing


Not novel at all


Missing features


Incompatible with all of the tools DBMS users have come to
depend on



Inefficient for general programming (and not designed
for that)


Hard to handle data with complex dependence, frequent
updates, etc.


High overhead, bursty I/O


Experience with developing a Hadoop
-
based distributed
compiler


Workload: compile Linux kernel


4 machines available to Hadoop for parallel compiling


Observation: parallel compiling on 4 nodes with Hadoop can
be even slower than sequential compiling on one node

MapReduce

Limitations


Proprietary solution developed in an environment with
one prevailing application (web search)


The assumptions introduce several important constraints in
data and logic


Not a general
-
purpose parallel execution technology


Design choices in MapReduce


Optimizes for throughput rather than latency


Optimizes for large data set rather than small data structures


Optimizes for coarse
-
grained parallelism rather than fine
-
grained

Re
-
thinking
MapReduce


A lightweight parallelization framework following the
MapReduce paradigm


Implemented in C++


More than just an efficient implementation of MapReduce


Goal: a lightweight “parallelization” service that programs
can invoke during execution


MRlite follows several principles


Memory is media

avoid touching hard drives


Static facility for dynamic utility

use and reuse threads
for map tasks

MRlite
: Lightweight Parallel Processing

MRlite

Towards Lightweight, Scalable, and
General Parallel Processing

MRlite

client

MRlite

master

scheduler

slave

slave

slave

slave


High speed

Distributed storage

application

Data flow

Command flow

Linked together with the
app, the
MRlite

client
library accepts calls from
app and submits jobs to
the master

High speed distributed
storage, stores
intermediate files

The MRlite master accepts jobs
from clients and schedules them to
execute on slaves


Distributed nodes
accept tasks from
master and execute
them

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Linux kernel
ImageMagick
Xen tools
2936

312

128

9044

653

1419

506

50

65

Execution time (sec)

gcc (on one node)
mrcc/Hadoop
mrcc/MRlite
30

Computing Capability

Using MRlite, the parallel compilation jobs, mrcc, is 10
times faster than that running on Hadoop!

Z. Ma and L. Gu.
The Limitation of MapReduce: a
Probing Case and a Lightweight Solution
. CLOUD
COMPUTING 2010

Network activities under MapReduce/Hadoop workload


Hadoop: open
-
source implementation of MapReduce


Processing data with 3 servers (20 cores)


116.8GB input data


Network activities captured with Xen virtual
machines

Inside
MapReduce
-
Style Computation

Workflow

Where is the communication
-
intensive part?

Initial data split

into 64MB blocks

Computed, results

locally stored

Master informed of

result locations

R reducers retrieve

Data from
mappers

Final output written


Packet reception under MapReduce/Hadoop workload


Large data volume


Bursty network traffic


Genrality

widely observed in MapReduce workloads

Packet reception
on a slave server

Inside
MapReduce

Packet reception on the master server

Inside
MapReduce

Packet transmission on the master server

Inside
MapReduce

Major Components of a
Datacenter


Computing hardware (equipment racks)


Power supply and distribution hardware


Cooling hardware and cooling fluid
distribution hardware


Network infrastructure


IT Personnel and office equipment


Datacenter Networking

Growth Trends in Datacenters


Load on network & servers continues to rapidly grow


Rapid growth: a rough estimate of annual growth rate:
enterprise data centers: ~35%, Internet data centers: 50%
-

100%


Information access anywhere, anytime, from many devices


Desktops, laptops, PDAs & smart phones, sensor
networks, proliferation of broadband


Mainstream servers moving towards higher speed links


1
-
GbE to10
-
GbE in 2008
-
2009


10
-
GbE to 40
-
GbE in 2010
-
2012


High
-
speed datacenter
-
MAN/WAN connectivity


High
-
speed datacenter syncing for disaster recovery


Datacenter Networking


A large part of the total cost of the DC hardware


Large routers and high
-
bandwidth switches are very
expensive


Relatively unreliable


many components may fail.


Many major operators and companies design their
own datacenter networking to save money and
improve reliability/scalability/performance.


The topology is often known


The number of nodes is limited


The protocols used in the DC are known


Security is simpler inside the data center, but
challenging at the border


We can distribute applications to servers to distribute
load and minimize hot spots

Datacenter Networking

Networking components (examples)


High Performance & High
Density Switches & Routers


Scaling to 512 10GbE ports per
chassis


No need for proprietary protocols
to scale



Highly scalable DC
Border Routers


3.2 Tbps capacity in a single
chassis


10 Million routes, 1 Million in
hardware


2,000 BGP peers


2K L3 VPNs, 16K L2 VPNs


High port density for GE and
10GE application connectivity


Security

768 1
-
GE port Downstream

64 10
-
GE port Upstream

Datacenter Networking

Common data center topology

Internet

Servers

Layer
-
2 switch

Access

Data Center

Layer
-
2/3 switch

Aggregation

Layer
-
3 router

Core

Datacenter Networking

Data center network design goals


High network bandwidth, low latency


Reduce the need for large switches in the core


Simplify the software, push complexity to the
edge of the network


Improve reliability


Reduce capital and operating cost

Datacenter Networking

Avoid this…

Data Center Networking

and simplify this…

?

Can we avoid using high
-
end switches?


Expensive high
-
end switches to
scale up


Single point of failure and
bandwidth bottleneck


Experiences from real systems


One answer:
DCell

43

Interconnect

DCell

Ideas



#1:
Use mini
-
switches to scale out



#2:
Leverage servers to be part of the routing
infrastructure



Servers have multiple ports and need to forward
packets



#3:
Use recursion to scale and build complete
graph to increase capacity

Interconnect

One approach: switched network with
a hypercube interconnect


Leaf switch: 40 1Gbps ports+2 10 Gbps ports.


One switch per rack.


Not replicated (if a switch fails, lose one rack of capacity)


Core switch: 10 10Gbps ports


Form a hypercube


Hypercube


high
-
dimensional rectangle

Data Center Networking

Hypercube properties


Minimum hop count


Even load distribution for all
-
all communication.


Can route around switch/link failures.


Simple routing:


Outport = f(Dest xor NodeNum)


No routing tables

Interconnect

A 16
-
node (dimension 4) hypercube

0
3
2
1
0
0
1
2
3
3
1
1
3
0
2
0
2
1
5
4
6
7
3
2
10
11
15
14
8
9
13
12
1
1
1
1
1
1
1
1
1
1
1
1
3
3
3
3
2
2
2
2
2
2
2
2
2
2
2
2
0
0
0
0
0
0
0
0
0
0
0
0
3
3
3
3
3
3
3
3
Interconnect



64
-
switch Hypercube
63
*
4
links to
other containers
One container
:
4
links
Level
0
:
32 40
-
port
1
Gb
/
sec switches
Level
1
:
8 10
-
port
10
Gb
/
sec switches
64 10
Gb
/
sec links
16 10
Gb
/
sec links
Level
2
:
2 10
-
port
10
Gb
/
sec switches
1280
Gb
/
sec links
4
X
4
Sub
-
cube
4
X
4
Sub
-
cube
4
X
4
Sub
-
cube
4
X
4
Sub
-
cube
16
links
16
links
16
links
16
links
Interconnect

How many servers can
be connected in this
system?

81920 servers with
1Gbps bandwidth

Core switch:
10Gbps port x 10

Leaf switch: 1Gbps port x
40 + 10Gbps port x 2.

The Black Box

Data Center Networking

Shipping Container as Data Center Module


Data Center Module


Contains network gear, compute, storage, &
cooling


Just plug in power, network, & chilled water


Increased cooling efficiency


Water & air flow


Better air flow management


Meet seasonal load requirements


Data Center Network

Unit of Data Center Growth


One at a time:


1 system


Racking & networking: 14 hrs ($1,330)


Rack at a time:


~40 systems


Install & networking: .75 hrs ($60)


Container at a time:


~1,000 systems


No packaging to remove


No floor space required


Power, network, & cooling only


Weatherproof & easy to transport


Data center construction takes 24+
months


Data Center Network

Multiple
-
Site
Redundancy and
Enhanced
Performance
using
load balancing




Handling site failures
transparently


Providing best site
selection per user


Leveraging both DNS and
non
-
DNS methods for
multi
-
site redundancy


Providing disaster
recovery and non
-
stop
operation




LB system

DNS

Datacenter

Datacenter

Datacenter

LB (load balancing) System


The load balancing systems regulate global data center traffic


Incorporates site health, load, user proximity, and service response for user
site selection


Provides transparent site failover in case of disaster or service outage

Global Data Center
Deployment Problems

Data Center Network

Challenges and Research Problems

Hardware


High
-
performance, reliable, cost
-
effective
computing infrastructure


Cooling, air cleaning, and energy efficiency

[Barraso]
Clusters

[Fan] Power

[Andersen]
FAWN

[Reghavendra]
Power

Challenges and Research Problems

System software


Operating systems


Compilers


Database


Execution engines and containers

Ghemawat:
GFS

Chang:
Bigtable

DeCandia:
Dynamo

Brantner: DB
on S3

Cooper:
PNUTS

Yu:
DryadLINQ

Dean:
MapReduce

Burrows:
Chubby

Isard: Quincy

Challenges and Research Problems

Networking


Interconnect and global network structuring


Traffic engineering

Al
-
Fares:
Commodity DC

Guo 2008: DCell

Guo 2009:
BCube

Challenges and Research Problems


Data and programming


Data consistency mechanisms (e.g., replications)


Fault tolerance


Interfaces and semantics


Software engineering


User interface


Application architecture


Pike: Sawzall

Olston: Pig
Latin

Buyya: IT
services

Resources


[Al
-
Fares] Al
-
Fares, M., Loukissas, A., and Vahdat, A. A scalable, commodity data center
network architecture. In Proceedings of the ACM SIGCOMM 2008 Conference on Data
Communication (Seattle, WA, USA, August 17
-

22, 2008). SIGCOMM '08. 63
-
74.
http://baijia.info/showthread.php?tid=139


[Andersen] David G. Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee,
Lawrence Tan, Vijay Vasudevan. FAWN: A Fast Array of Wimpy Nodes. SOSP'09.
http://baijia.info/showthread.php?tid=179


[Barraso] Luiz Barroso, Jeffrey Dean, Urs Hoelzle, "Web Search for a Planet: The Google
Cluster Architecture," IEEE Micro, vol. 23, no. 2, pp. 22
-
28, Mar./Apr. 2003
http://baijia.info/showthread.php?tid=133


[Brantner] Brantner, M., Florescu, D., Graf, D., Kossmann, D., and Kraska, T. Building a
database on S3. In Proceedings of the 2008 ACM SIGMOD international Conference on
Management of Data (Vancouver, Canada, June 09
-

12, 2008). SIGMOD '08. 251
-
264.
http://baijia.info/showthread.php?tid=125



Resources


[Burrows] Burrows, M. The Chubby lock service for loosely
-
coupled distributed systems.
In Proceedings of the 7th Symposium on Operating Systems Design and Implementation
(Seattle, Washington, November 06
-

08, 2006). 335
-
350. .
http://baijia.info/showthread.php?tid=59


[Buyya] Buyya, R. Chee Shin Yeo Venugopal, S. Market
-
Oriented Cloud Computing. The
10th IEEE International Conference on High Performance Computing and
Communications, 2008. HPCC '08.
http://baijia.info/showthread.php?tid=248


[Chang] Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M.,
Chandra, T., Fikes, A., and Gruber, R. E. Bigtable: a distributed storage system for
structured data. In Proceedings of the 7th Symposium on Operating Systems Design and
Implementation (Seattle, Washington, November 06
-

08, 2006). 205
-
218.
http://baijia.info/showthread.php?tid=4


[Cooper] Cooper, B. F., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P.,
Jacobsen, H., Puz, N., Weaver, D., and Yerneni, R. PNUTS: Yahoo!'s hosted data serving
platform. Proc. VLDB Endow. 1, 2 (Aug. 2008), 1277
-
1288.
http://baijia.info/showthread.php?tid=126

Resources


[Dean] Dean, J. and Ghemawat, S. 2004. MapReduce: simplified data processing on large
clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems
Design & Implementation
-

Volume 6 (San Francisco, CA, December 06
-

08, 2004).
http://baijia.info/showthread.php?tid=2


[DeCandia] DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A.,
Pilchin, A., Sivasubramanian, S., Vosshall, P., and Vogels, W. 2007. Dynamo: amazon's
highly available key
-
value store. In Proceedings of Twenty
-
First ACM SIGOPS Symposium
on Operating Systems Principles (Stevenson, Washington, USA, October 14
-

17, 2007).
SOSP '07. ACM, New York, NY, 205
-
220.
http://baijia.info/showthread.php?tid=120


[Fan] Fan, X., Weber, W., and Barroso, L. A. Power provisioning for a warehouse
-
sized
computer. In Proceedings of the 34th Annual international Symposium on Computer
Architecture (San Diego, California, USA, June 09
-

13, 2007). ISCA '07. 13
-
23.
http://baijia.info/showthread.php?tid=144


Resources


[Ghemawat] Ghemawat, S., Gobioff, H., and Leung, S. 2003. The Google file system. In
Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (Bolton
Landing, NY, USA, October 19
-

22, 2003). SOSP '03. ACM, New York, NY, 29
-
43.
http://baijia.info/showthread.php?tid=1



[Guo 2008] Chuanxiong Guo, Haitao Wu, Kun Tan, Lei Shi, Yongguang Zhang, and
Songwu Lu, DCell: A Scalable and Fault
-
Tolerant Network Structure for Data Centers, in
ACM SIGCOMM 08.
http://baijia.info/showthread.php?tid=142


[Guo 2009] Chuanxiong Guo, Guohan Lu, Dan Li, Xuan Zhang, Haitao Wu, Yunfeng Shi,
Chen Tian, Yongguang Zhang, and Songwu Lu, BCube: A High Performance, Server
-
centric Network Architecture for Modular Data Centers, in ACM SIGCOMM 09.
http://baijia.info/showthread.php?tid=141


[Isard] Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar and
Andrew Goldberg. Quincy: Fair Scheduling for Distributed Computing Clusters. SOSP'09.
http://baijia.info/showthread.php?tid=203

Resources


[Olston] Olston, C., Reed, B., Srivastava, U., Kumar, R., and Tomkins, A. 2008. Pig Latin: a
not
-
so
-
foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD
international Conference on Management of Data (Vancouver, Canada, June 09
-

12,
2008). SIGMOD '08. 1099
-
1110.
http://baijia.info/showthread.php?tid=124


[Pike] Pike, R., Dorward, S., Griesemer, R., and Quinlan, S. 2005. Interpreting the data:
Parallel analysis with Sawzall. Sci. Program. 13, 4 (Oct. 2005), 277
-
298.
http://baijia.info/showthread.php?tid=60


[Reghavendra] Ramya Raghavendra, Parthasarathy Ranganathan, Vanish Talwar, Zhikui
Wang, Xiaoyun Zhu. No "Power" Struggles: Coordinated Multi
-
level Power Management
for the Data Center. In Proceedings of the International Conference on Architectural
Support for Programming Languages and Operating Systems (ASPLOS), Seattle, WA,
March 2008.
http://baijia.info/showthread.php?tid=183


[Yu] Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey.
DryadLINQ: A system for general
-
purpose distributed data
-
parallel computing using a
high
-
level language. In Proceedings of the 8th Symposium on Operating Systems Design
and Implementation (OSDI), December 8
-
10 2008.
http://baijia.info/showthread.php?tid=5

Thank you!

Questions?