Running Scientific Workflow Applications on the Amazon EC2 Cloud

chirpskulkInternet and Web Development

Nov 3, 2013 (4 years and 6 days ago)

91 views




Running Scientific Workflow
Applications on the Amazon EC2 Cloud




Bruce Berriman

NASA Exoplanet Science Institute, IPAC

Gideon Juve, Ewa Deelman, Karan Vahi, Gaurang Mehta

Information Sciences Institute, USC

Benjamin Berman


USC Epigenome Center


Phil Maechling


So Cal Earthquake Center


Clouds (Utility Computing)


Pay for what you use rather than purchase compute
and storage resources that end up underutilized


Analogous to household utilities


Originated in the business domain to provide services
for small companies who did not want to maintain an
IT Department


Provided by data centers that are built on compute
and storage virtualization technologies.


Clouds built with commodity hardware. They are a
“new purchasing paradigm” rather than a new
technology.



Benefits and Concerns

Benefits


Pay only for what you
need


Elasticity
-

increase or
decrease capacity within
minutes


Ease strain on local
physical plant


Control local system
administration costs




Concerns


What if they become
oversubscribed and user cannot
increase capacity on demand?


How will the cost structure
change with time?


If we become dependent on
them, will we be at the cloud
providers’ mercy?


Are clouds secure?


Are they up to the demands of
science applications?



Cloud Providers


Pricing Structures vary widely


Amazon EC2 charges for hourly usage


Skytap charges per month


IBM requires an annual subscription


Savvis offers servers for purchase


Uses


Running business applications


Web hosting


Provide additional capacity for heavy
loads


Application testing


Provider

Amazon.com EC2

AT&T Synaptic Hosting

GNi Dedicated Hosting

IBM Computing on Demand

Rackspace Cloud Servers

Savvis Open Cloud

ServePath GoGrid

Skytap Virtual Lab

3Tera

Unisys Secure

Verizon Computing

Zimory Gateway

Source Information Week,
9/4/09

Purposes of Our Study

How useful is cloud computing for scientific workflow applications?


An experimental study of the performance of three workflows with
different I/O, memory and CPU requirements on a commercial cloud


A comparison of the performance of cloud resources and typical HPC
resources, and


An analysis of the various costs associated with running workflows on a
commercial cloud.



Clouds are well suited to processing of
workflows



Workflows are loosely
-
couple applications composed of tasks
connected by data



Allocate resources as needed for processing tasks and decrease
scheduling overheads

Chose Amazon EC 2 Cloud and the NCSA Abe Cluster

http://aws.amazon.com/ec2/

http://www.ncsa.illinois.edu/UserInfo/Resou
rces/Hardware/Intel64Cluster/

The Applications: Montage

Montage processing flow

Reprojection

Background rectification

Co
-
addition


Science Grade
-

preserves spatial and
calibration fidelity of input images.


Portable



all common *nix platforms


Open source code


General


all common coords and
image projections





Speed



Processes 40 million pixels
in 32 min on 128 nodes of 1.2 GHz
Linux cluster


Utilities

for managing and
manipulating image files



Stand
-
alone modules




http://montage.ipac.caltech.edu

Toolkit for assembling FITS images into science
-
grade mosaics.





The Applications: Broadband and
Epigenome


Broadband

simulates and compares
seismograms from earthquake simulation
codes.


Generates high
-

and low
-
frequency
earthquakes for several sources


Computes intensities of seismograms at
measuring stations.


Epigenome

maps short DNA segments
collected using high
-
throughput gene
sequencing machines to a reference
genome.


Maps chunks to a reference genome


Produces an output map of gene density
compared with the reference genome

Comparison of Resource Usage


Ran a mosaic 8 deg sq of
M17 in 2MASS J
-
band



Workflow contains 10,429
tasks


Reads 4.2 GB of input data


Produces 7.9 GB of output
data.




Montage is I/O
-
bound
because it spends more than
95% of its time in I/O
operations
.



Application

I/O

Memory

CPU

Montage

High

Low

Low

Broadband

Medium

High

Medium

Epigenome

Low

Medium

High

Comparison of Resource Usage

Broadband


4 sources and 5 stations


Workflow contains 320 tasks



6 GB of input data and 160 MB
of output data.


Memory
-
limited
because more
than 75% of its runtime is
consumed by tasks requiring more
than 1 GB of physical memory


Epigenome


Workflow contains 81 tasks,


1.8 GB of input data


300 MB of output data.


CPU
-
bound
because it spends
99% of its runtime in the CPU
and only 1% on I/O and other
activities.



Application

I/O

Memory

CPU

Montage

High

Low

Low

Broadband

Medium

High

Medium

Epigenome

Low

Medium

High

Processing Resources

Networks and File Systems


HPC systems use high
-
performance
network and parallel file systems
BUT


Amazon EC2 uses commodity
hardware



Ran all processes on single,
multi
-
core nodes. Used local and
parallel file system on Abe.




Processors and OS


Linux Red Hat Enterprise with
VMWare


Amazon EC2 offers different
instances


look at cost vs.
performance


c1.xlarge
and
abe.local
equivalent


estimate overhead due to
virtualization


abe.lustre

and
abe.local
differ only in
file system



Amazon

Abe

Type

Arch

CPU

Cores

Memory

Network

Storage

Price

m1.small

32
-
bit

2.0
-
2.6 GHz Opteron

1/2

1.7 GB

1
-
Gbps Ethernet

Local

$0.10/hr

m1.large

64
-
bit

2.0
-
2.6 GHz Opteron

2

7.5 GB

1
-
Gbps Ethernet

Local

$0.40/hr

m1.xlarge

64
-
bit

2.0
-
2.6 GHz Opteron

4

15 GB

1
-
Gbps Ethernet

Local

$0.80/hr

c1.medium

32
-
bit

2.33
-
2.66 GHz Xeon

2

1.7 GB

1
-
Gbps Ethernet

Local

$0.20/hr

c1.xlarge

64
-
bit

2.0
-
2.66 GHz Xeon

8

7.5 GB

1
-
Gbps Ethernet

Local

$0.80/hr

abe.local

64
-
bit

2.33 GHz Xeon

8

8 GB

10
-
Gbps InfiniBand

Local



abe.lustre

32
-
bit

2.0
-
2.6 GHz Opteron

8

18 GB

10
-
Gbps InfiniBand

Lustre



Execution Environment


Establish equivalent software
environments on the two platforms



“Submit” host used to send jobs
to EC2 or Abe.


All workflows used the Pegasus
Workflow Management System with
DAGMan and Condor.


Pegasus
-

transforms abstract
workflow descriptions into
concrete plans


DAGMan


manages
dependencies


Condor manages task execution



Amazon EC2

Abe

Montage Performance

(I/O Bound)


Slowest on
m1.small,
but fastest on those machines with the most
cores:
m1.xlarge, c1.xlarge
and
abe.lustre, abe.local
.


The parallel file system on
abe.lustre
offers a big performance
advantage for I/O bound systems


cloud providers would need to
offer parallel file system and high
-
speed networks.


Virtualization overhead <10%






Broadband Performance

(Memory bound)


Lower I/O requirements


not much difference between
abe.lustre
and
abe.local
; both have 8 GB memory. Only slightly worse performance on
c1.xlarge,
7.5 GB memory.


Poor performance on
c1.medium


only 1.7 GB of memory. Cores may
sit idle to prevent system running out of memory.


Virtualization overhead small






Epigenome Performance

(CPU Bound)


c1.xlarge
,
abe.lustre
and
abe.local

give best performance


they are the
three most powerful machines (64
-
bit, 2.3
-
2.6 GHz)


The parallel file system on
abe.lustre
offers little benefit.


Virtualization overhead is roughly 10%, largest of three apps
-

competing for CPU with OS.




?Resource
Cost Analysis


You get what you pay for!


The cheapest instances are the least
powerful.


Instance

Cost $/hr

m1.small

0.10

m1.large

0.40

m1.xlarge

0.80

c1.medium

0.20

c1
.xlarge

0.80


c1.medium
a good choice for
Montage but more powerful
processors better for other two.


Data Transfer Costs

Operation

Cost $/GB

Transfer In

0.10

Transfer Out

0.17


For Broadband and
Epigenome, economical
to transfer data out of
the cloud




For Montage, output larger than input, so the costs to transfer data
out are equal to or higher than processing costs for all but one
processing instance.


Is it more economical to store data on the cloud?



Application

Input (GB)

Output (GB)

Logs (MB)

Montage

4.2

7.9

40

Broadband

4.1

0.16

5.5

Epigenome

1.8

0.3

3.3

Application

Input

Output

Logs

Total

Montage

$0.42

$1.32

<$0.01

$1.75

Broadband

$0.40

$0.03

<$0.01

$0.43

Epigenome

$0.18

$0.05

<0.01

$0.23

Storage Costs

Item

Charges $

Storage of VM’s in S3 Disk

0.15/GB
-
Month

Storage of data in EBS disk

0.10/GB
-
Month

Item

Low Cost ($)

High Cost ($)

Transfer Data In

0.42

0.42

Processing

0.55

2.45

Storage

1.07

1.07

Transfer Out



1.32

Totals

2.04

5.22

Application

Data

VM

Monthly Cost $

Montage

0.95

0.12

1.07

Broadband

0.02

0.10

0.12

Epigenome

0.20

0.10

0.32

Storage Costs of Output/job

Storage Charges

… And the bottom line

Most cost
-
effective model
?


15 Xeon 3.2
-

GHz dual processor dual
-
core Dell
2650 Power Edge servers


Aberdeen Technologies 6
-
TB staging disk farm


Dell PowerVault MD1200 storage disks




Transfer In ($)

Store 2MASS ($)

IPAC Service ($)

Transfer In

7,560

3,780



Store input data

17,100

61,500

13,200

Processing

9,000

9,000

66,000

Transfer Data Out

25,560

25,560


Cost $/job

1.65

2.75

2.20


Assume 1,000 2MASS mosaics of 4 deg sq centered
on M17 per month for 3 years. Assume
c1.medium
processor on Amazon EC2

Conclusions


Clouds can be used effectively and fairly efficiently for
scientific applications. The virtualization overhead is low.


The high speed network and parallel file systems give HPC
clusters a significant performance advantage over cloud
computing for I/O bound applications.


On Amazon EC2, primary cost for Montage is data
transfer. Processing is primary cost for Broadband,
epigenome.


Amazon EC2 offers no dramatic cost benefits over a locally
mounted image
-
mosaic service.

Reference:

G. Juve, E. Deelman, K. Vahi, G. Mehta, B. Berriman, B. P.
Berman, and P. Maechling, "Scientific Workflow Applications on Amazon
EC2," in CloudComputing Workshop

in Conjunction with e
-
Science
Oxford, UK: IEEE, 2009