ppt slides - Course Website Directory

cabbagepatchtapeInternet and Web Development

Feb 5, 2013 (4 years and 4 months ago)

145 views

1

1

Yeah! That’s what
I’d like to know.

Indranil Gupta (Indy)

Lecture 2

What(’s in) the Cloud?

January 20, 2011

CS 525

Advanced Distributed
Systems

Spring 2011

All Slides © IG

2

Clouds are Water Vapor


Oracle has a Cloud Computing Center.


And yet…


Larry Ellison’s Rant on Cloud Computing



2

3

3

The Hype!


Gartner
-

Cloud computing revenue
will soar faster than expected and will
exceed $150 billion
within five years.


Forrester
-

Cloud
-
Based Email Is
Often Cheaper Than On
-
Premise
Email


Vivek Kundra, CTO of Obama
Government: “Growing adoption of
cloud computing could improve data
sharing and promote collaboration
among federal, state and local
governments.” E.g:
fedbizopps.gov


Merrill Lynch: “By 2011 the volume of
cloud computing market opportunity
would amount to
$160bn
, including
$95bn in business and productivity
apps (email, office, CRM, etc.) and
$65bn in online advertising.”


IDC: “Spending on IT cloud services
will triple in the next 5 years, reaching
$42 billion
and capturing 25% of IT
spending growth in 2012.”

Sources: http://www.infosysblogs.com/cloudcomputing/2009/08/the_cloud_computing_quotes.htm and http://www.mytestbox.com

4

Ha ha hype! It’s a
bunch of tripe,
since no one is
probably making or
saving money.

5

$$$


Ingo Elfering, Vice President of
Information Technology Strategy,
GlaxoSmithKline
:

“With Online Services, we are able to reduce
our IT operational costs by roughly
30%
of
what we’re spending now and introduce a
variable cost subscription model for these
technologies that allows us to more rapidly
scale or divest our investment as necessary
as we undergo a transformational change in
the pharmaceutical industry”


Jim Swartz, CIO, Sybase
: “At Sybase, a
private cloud of virtual servers inside its data
centre has saved
nearly $US2 million
annually

since 2006, Swartz says, because
the company can share computing power
and storage resources across servers.”


Dave Power, Associate Information
Consultant at Eli Lilly and Company
:
“With AWS, Powers said, a new server can
be up and running in
three minutes
(it used
to take Eli Lilly seven and a half weeks to
deploy a server internally) and a 64
-
node
Linux cluster can be online in five minutes
(compared with three months internally).
The deployment time is really what
impressed us. It's just shy of instantaneous."

Sources: http://www.infosysblogs.com/cloudcomputing/2009/08/the_cloud_computing_quotes.htm and http://www.mytestbox.com

6

Alright, alright. But
for heaven’s sake,
can someone tell
me what
is

a cloud?

7

What is a Cloud?


It’s a cluster! It’s a supercomputer! It’s a
datastore!


It’s superman!



None of the above


All of the above



Cloud =
Lots of storage

+
compute
cycles nearby

8

What is a Cloud?


A single
-
site cloud (aka “Datacenter”) consists of


Compute nodes (split into
racks
)


Switches, connecting the racks


A network topology, e.g., hierarchical


Storage (backend) nodes connected to the network


Front
-
end for submitting jobs


Services: physical resource set, software services


A geographically distributed cloud consists of


Multiple such sites


Each site perhaps with a different structure and
services

9

A Sample Cloud Topology

Top of the Rack Switch

Core Switch

Servers

Rack

If higher bandwidth link,

then a “fat tree” topology

10

Scale of Industry Datacenters


Microsoft [NYTimes, 2008]


150,000 machines


Growth rate of 10,000 per month


Largest datacenter: 48,000 machines


80,000 total running Bing


Yahoo! [Hadoop Summit, 2009]


25,000 machines


Split into datacenters of 4000 machines each


AWS EC2 (Oct 2009)


40,000 machines


8 cores/machine


Google


(Rumored) several hundreds of thousands of
machines

11

OK, they are massive. But it
is still called a “cluster”! And
that’s not a new concept!

12

1940

1950

1960

1970

1980

1990

2000

Timesharing Companies & Data Processing Industry

2010

Grids

Peer to peer

systems

Clusters

The first datacenters!

PCs

(not distributed!)

Clouds and datacenters

“A Cloudy History of Time”

© IG 2010

13

Timesharing Industry (1975):


Market Share: Honeywell 34%, IBM 15%,


Xerox 10%, CDC 10%, DEC 10%, UNIVAC 10%


Honeywell 6000 & 635, IBM 370/168,


Xerox 940 & Sigma 9, DEC PDP
-
10, UNIVAC 1108

Grids (1980s
-
2000s):


GriPhyN (1970s
-
80s)


Open Science Grid and Lambda Rail (2000s)


Globus & other standards (1990s
-
2000s)

First large datacenters: ENIAC, ORDVAC, ILLIAC

Many used vacuum tubes and mechanical relays

P2P Systems (90s
-
00s)


Many Millions of users


Many GB per day

Data Processing Industry


-

1968: $70 M. 1978: $3.15 Billion.

Berkeley NOW Project

Supercomputers

Server Farms (e.g., Oceano)

“A Cloudy History of Time”

© IG 2010

Clouds

14

Why did all of this happen?

15

Trends: Technology


Doubling Periods


storage: 12 mos,
bandwidth: 9 mos, and (what law is this?) cpu
capacity: 18 mos


Then and Now


Bandwidth


1985: mostly 56Kbps links nationwide


2004: 155 Mbps links widespread

Disk capacity


Today’s PCs have 100GBs, same as a 1990
supercomputer

16

Trends: Users


Then and Now


Biologists:


1990: were running small single
-
molecule
simulations


2004: want to calculate structures of complex
macromolecules, want to screen thousands of
drug candidates, sequence very complex
genomes

Physicists


2008 onwards: CERN’s Large Hadron Collider will
produce 700 MB/s or 15 PB/year


Trends in Technology and User
Requirements: Independent or Symbiotic?


17

Prophecies

In 1965, MIT's Fernando Corbató and the other
designers of the Multics operating system
envisioned a computer facility operating “like a
power company or water company”.


Plug
your thin client into the computing Utility

and Play

your favorite Intensive Compute &

Communicate Application


[Have today’s clouds brought us closer to this reality?]

18

So, clouds have been
around for decades! But
aside from massive scale
what’s new about
today’s

cloud computing?!

19

What(’s new) in Today’s Clouds?

Three major features:

I.
On
-
demand access
: Pay
-
as
-
you
-
go, no upfront
commitment.


Anyone can access it (e.g., Washington Post


Hillary Clinton
example)

II.
Data
-
intensive Nature
: What was MBs has now
become TBs.


Daily logs, forensics, Web data, etc.


Do you know the size of Wikipedia dump?

III.
New Cloud Programming Paradigms
:
MapReduce/Hadoop, Pig Latin, DryadLinq, Swift, and
many others.


High in accessibility and ease of programmability

Combination of one or more of these gives rise to novel
and unsolved distributed computing problems in cloud
computing.

20

I. On
-
demand access:

*aaS
Classification

On
-
demand: renting a cab vs (previously) renting a car, or buying one. E.g.:


AWS Elastic Compute Cloud (EC2): $0.086
-
$1.16 per CPU hour


AWS Simple Storage Service (S3): $0.055
-
$0.15 per GB
-
month



HaaS: Hardware as a Service


You get access to barebones hardware machines, do whatever you want with
them


Ex: Your own cluster, Emulab


IaaS: Infrastructure as a Service


You get access to flexible computing and storage infrastructure. Virtualization is
one way of achieving this. Often said to subsume HaaS.


Ex: Amazon Web Services (AWS: EC2 and S3), Eucalyptus, Rightscale.


PaaS: Platform as a Service


You get access to flexible computing and storage infrastructure, coupled with a
software platform (often tightly)


Ex: Google’s AppEngine


SaaS: Software as a Service


You get access to software services, when you need them. Often said to
subsume SOA (Service Oriented Architectures).


Ex: Microsoft’s LiveMesh, MS Office on demand

21

II. Data
-
intensive Computing


Computation
-
Intensive Computing


Example areas: MPI
-
based, High
-
performance computing, Grids


Typically run on supercomputers (e.g., NCSA Blue Waters)


Data
-
Intensive


Typically store data at datacenters


Use compute nodes nearby


Compute nodes run
computation services


In data
-
intensive computing, the
focus shifts from computation to the data
:
CPU utilization no longer the most important resource metric


Problem areas include


Distributed systems


Middleware


OS


Storage


Networking


Security


Others


22

III. New Cloud Programming Paradigms

Dataflow programming frameworks


Google: MapReduce and Sawzall


Yahoo: Hadoop and Pig Latin


Microsoft: DryadLINQ


Facebook: Hive


Amazon: Elastic MapReduce service (pay
-
as
-
you
-
go)


Google (MapReduce)


Indexing: a chain of
24 MapReduce jobs


~200K jobs processing
50PB
/month (in 2006)


Yahoo! (Hadoop + Pig)


WebMap: a chain of
100 MapReduce jobs


280 TB

of data, 2500 nodes, 73 hours


Facebook (Hadoop + Hive)


~
300TB

total, adding 2TB/day (in 2008)


3K jobs processing
55TB
/day


Similar numbers from other companies, e.g.,
Yieldex, eharmony.com,
etc.


23

This is all confusing. Can
you give me some examples
of clouds?

24

Two Categories of Clouds


Industrial Clouds



Can be either a (i) public cloud, or (ii) private cloud


Private clouds are accessible only to company employees


Public clouds provide service to any paying customer:


Amazon S3 (Simple Storage Service): store arbitrary datasets ,pay per GB
-
onth stored


Amazon EC2 (Elastic Compute Cloud): upload and run arbitrary images, pay
per CPU hour used


Google AppEngine: develop applications within their appengine framework,
upload data which is then imported into their format, and run


Academic Clouds


Allow researchers to innovate, deploy, and experiment


Google
-
IBM Cloud (U. Washington): run apps programmed atop
Hadoop


Cloud Computing Testbed (CCT @ UIUC): first cloud testbed to support
systems research. Runs: (i) apps programmed atop Hadoop and Pig, (ii)
systems
-
level research on this first generation of cloud computing
models (~HaaS), and (iii) Eucalyptus services (~AWS EC2).
http://cloud.cs.illinois.edu



OpenCirrus: first federated cloud testbed.
http://opencirrus.org


25

Academic Clouds


CCT = Cloud Computing Testbed


NSF infrastructure


Used by 10+ NSF projects, including several non
-
UIUC projects


Housed within Siebel Center (4
th

floor!)


Accessible to students of CS525!


Almost half of SP09/SP10 course used CCT for their projects


OpenCirrus = Federated Cloud Testbed


Contains CCT and other sites


If you need a CCT account for your CS525 experiment,
let me know asap! There are a limited number of these
available for CS525

26

Cloud Computing Testbed (CCT)

27


128 compute nodes = 64+64


500 TB & 1000+ shared cores


CCT Hardware in more Detail

28

Goal of CCT
: Support both
Systems Research and
Applications Research

in Data
-
intensive Distributed
Computing

29

CCT Software Services

Accessing and Using CCT:

I.
Systems Partition (64
-
8 nodes):


CentOS machines


Dedicated access to a subset of machines (~
Emulab), with
sudo

access


User accounts


User requests # machines (<= 64) + storage quota (<= 30
TB)


Machine allocation survives for 4 weeks, storage survives
for 6 months (both extendible)

II.
Hadoop/Pig Partition and Service (64 nodes)

III.
Eucalyptus Partition (8 nodes)

30

Accessing and Using CCT:

I.
Systems Partition (64
-
8 nodes)

II.
Hadoop/Pig Partition and Service (64 nodes):


Looks like a regular shared Hadoop cluster service


Users share 64 nodes. Individual nodes not directly
reachable.


4 slots per machine


Several users are reporting stable operation at 256
instances


During Spring 09/10, 10+ projects running simultaneously


User accounts


User requests account + storage quota (<= 30 TB)


Storage survives for 6 months (extendible)

III.
Eucalyptus Partition (8 nodes)

CCT Software Services

31

Accessing and Using CCT:

I.
Systems Partition (64
-
8 nodes)

II.
Hadoop/Pig Partition and Service (64
nodes):

III.
Eucalyptus Partition (8 nodes):


Based on open
-
source version of
Eucalyptus from UCSB (Rich Wolski)


Exports same interface as AWS EC2 and
S3.


CCT Software Services

32


Some Services running inside CCT


ZFS: backend file system.


Zenoss: Systems Monitoring. Shared with
department’s other computing clusters


Hadoop + HDFS


Ability to make datasets publicly available


How do users request an account: two
-
stage
process (go to
http://cloud.cs.illinois.edu

)

1.
User account request


require background check

2.
Allocation request

CCT Software Services

33

Founding 6 sites

Open Cirrus Federation

34

34

18 March 2013

HP

UIUC

Intel

KIT (de)

IDA (sg)

Yahoo

First open federated cloud testbed

Shared: research
,
applications
,
infrastructure
(9*1,000 cores),
data sets

Global services
: sign on, monitoring, store, etc.,
Federated clouds, meaning each is different

CMU

RAS

ETRI

MIMOS

Grown to 9 sites, with more to come

Open Cirrus Federation

35

OK, so that’s what a cloud
looks like today. Now,
suppose I want to start my
own company, Devils Inc.
Should I buy a cloud and own
it, or should I outsource to a
public cloud?

We’ll do that next week…


For now, it’s important to start thinking of
who’s on your project team…


Projects


Groups of 2 (need not be same as
presentation groups). Could be 3.


We’ll start detailed discussions “soon” (a
few classes into the student
-
led
presentations)




Entr. Tidbits: Selecting your Team


Selecting your partner is important: select
someone with a complementary personality to
yours!


Apple: Wozniak loved being an engineer and hated
interacting with people, Jobs loved making calls,
doing sales and preferred engineering much less


Flickr: Stewart was improvisational, Fake was goal
-
driven


Levchin loved to program and break things, Thiel
talked to VCs and did sales.


Hansson says that development of Ruby on Rails
benefited from having a small team and a small
budget that kept them focused


this is why the big
giants could not beat them.


The upshot is that you have to select a team
with complementary characteristics


37

Next Week


We will continue discussion of cloud
computing


How MapReduce works


What is PlanetLab and Emulab


What is Grid computing


Then we will start to discuss Basics of P2P
systems



Please read at least one paper from each
session

38

39

Administrative Announcements

Student
-
led paper presentations

(
see instructions on
website
)


Start from February 10th


Groups of up to 2 students present
each class,
responsible for a set of 3 “Main Papers” on a topic


45 minute presentations (total) followed by discussion


Set up appointment with me to show slides by 5 pm day
prior to presentation


Select your topic by Jan 31st


List of papers is up on the website


Each of the
other

students (non
-
presenters) expected
to
read the papers before class
and turn in a one to
two page
review

of the
any two

of the main set of
papers (summary, comments, criticisms and possible
future directions)


Email review and bring in hardcopy before class


Reviews are not due until student presentations start


Submit reviews for any 15 sessions (from 2/10 to 4/28)