Technology Solutions In Data Intensive Astronomy

meatcologneInternet and Web Development

Nov 3, 2013 (3 years and 11 months ago)

75 views








Approaches to Investigating

Technology Solutions

In Data Intensive Astronomy














G. Bruce Berriman

gbb@ipac.caltech.edu

NASA Exoplanet Science Institute,


Infrared Processing and Analysis Center, Caltech


Innovations in Data Intensive Astronomy, May 3
-
5 2011.

1

Developing A New Business Model For
Astronomical Computing


Astronomy is already a data intensive science


Over 1 PB served electronically through data centers and
archives.


Growing at 0.5 PB/yr, and accelerating.


Astro2010 recognized that future research will demand
high performance computing on massive, distributed
data sets.


High Performance/Massive Parallelization: Scalability


Current model for managing data unsustainable:
universities hitting “power wall”




Learn how to unleash the power of new technologies


Learn how to write applications that take advantage of the technology


Learn how to develop innovative data discovery and access mechanisms.




2

Cloud Computing In A Nutshell

New model for purchasing resources:
pay only for what you use.

Amazon EC2 front page:

Commercial Providers

Amazon.com

EC2

AT&T Synaptic Hosting

GNi Dedicated Hosting

IBM Computing on Demand

Rackspace

Cloud Servers

Savvis Open Cloud

ServePath

GoGrid

Skytap Virtual Lab

3Tera

Unisys Secure

Verizon Computing

Zimory

Gateway

Science Clouds

FutureGrid

NERSC Magellan

NASA Nebula

This looks
cheap!

http://aws.amazon.com/ec2/

3

“Little sins add up …”

… and that’s not all. You pay for:

-
Transferring data into the cloud

-
Transferring them back out again

-
Storage while you are processing (or sitting idle)

-
Storage of the VM and your own software

-
Special services: virtual private cloud…


See Manav Gupta’s blog post
http://manavg.wordpress.com/2010/12/01/amazon
-
ec2
-
costs
-
a
-
reality
-
check
/

Annual Costs!

4

How Useful Is Cloud Computing For
Scientific Workflow Applications?

The study was designed to answer the question:
How useful is cloud
computing for scientific workflow applications
?

Workflow applications
are loosely coupled applications
in which the
output files from one component become the input to the next.

There were three goals:

1.
Conduct an experimental study of the performance
of three
workflows
with different I/O, memory and CPU requirements on a
commercial cloud.

2.
Compare the performance of cloud resources
with the performance
of a

typical High Performance Cluster (HPC). The cloud uses
commodity hardware and virtualization

and

HPCs use
parallel

file
systems and fast networks
.

3.
Provide a
n
analysis of the various costs
associated with running
workflows on a commercial cloud.

We chose Amazon EC2 as the cloud provider and
the NCSA Abe cluster as a high
-
performance
cluster.










1.
Compare performance/cost of different resource configurations

2.
Compare performance of grid and cloud

3.
Characterize virtualization overhead


Scientific Workflow Applications on Amazon EC2.

G. Juve, et al.
arxiv.org/abs/1005.2718

Data Sharing Options for Scientific Workflows on Amazon EC2
. G. Juve et al.

arxiv.org/abs/1010.482
2












Loosely
-
coupled parallel applications


Many domains: astronomy, biology, earth science, others


Potentially very large: 10K tasks common, >1M not uncommon


Potentially data
-
intensive: 10GB common, >1TB not uncommon


Data communicated via files


Shared storage system, or network transfers required

5

The Applications

Montage
(http://montage.ipac.caltech.edu)
creates science
-
grade image
mosaics from multiple input images.


Broadband

simulates and compares seismograms from earthquake
simulation codes.


Epigenome

maps short DNA segments collected using high
-
throughput
gene sequencing machines to a reference genome.

Montage
(http://montage.ipac.caltech.edu)
creates
science
-
grade image mosaics from multiple input
images.


Broadband

calculates seismograms from simulated
earthquakes.


Epigenome

maps short DNA segments collected with
gene sequencing machines to a reference genome.

Montage Workflow

Reprojection

Background Rectification

Co
-
addition

Output

Input

Montage Workflow

Reprojection

Background Rectification

Co
-
addition

Output

Input

Montage Workflow

Reprojection

Background Rectification

Co
-
addition

Output

Input

Montage Workflow

Reprojection

Background Rectification

Co
-
addition

Output

Input

6

Characteristics of Workflows

Resource Usage of the Three Workflow Applications

Resource Usage of the Three Workflow Applications

Workflow Specifications for this Study

7

Computing Resources

Processors and OS


Amazon offers wide selection of processors.


Ran Linux Red Hat Enterprise with VMWare


c1.xlarge
and
abe.local
are equivalent


estimate
overhead due to virtualization


abe.lustre

and
abe.local
differ only in file system



Networks and File Systems


HPC systems use high
-
performance
network and parallel file systems


Amazon EC2 uses commodity hardware



Ran all processes on single, multi
-
core nodes. Used local and parallel file
system on Abe.




8

Execution Environment


Pegasus



workflow planner



Maps tasks and data from
abstract descriptions to
executable resources


Performance optimizer



DAGMan


workflow engine


Tracks dependencies, releases
tasks, retries tasks


Condor



task manager; schedules
and dispatches tasks (and data) to
resources

NCSA Abe
-

high
-
performance cluster.

Amazon EC2


Amazon provides the resources.


End
-

user must configure and manage them

9

Performance Results











Virtualization Overhead <10%


Large differences in performance
between the resources and between the
applications


The parallel file system on
abe.lustre

offers a big performance
advantage of x3 for Montage

10

?
How Much Did It Cost
?

Instance

Cost $/hr

m1.small

0.10

m1.large

0.40

m1.xlarge

0.80

c1.medium

0.20

c1
.xlarge

0.80

Montage:


Clear trade
-
off between performance and cost.


Most powerful processor
c1.xlarge

offers 3x
the performance of
m1.small



but at 4x the
cost.


Most cost
-
effective processor for Montage is
c1.medium



20% performance loss over
m1.small
, but 5x lower cost.


11

Data Transfer Costs

Operation

Cost $/GB

Transfer In

0.10

Transfer Out

0.17

Application

Input (GB)

Output (GB)

Logs (MB)

Montage

4.2

7.9

40

Broadband

4.1

0.16

5.5

Epigenome

1.8

0.3

3.3

Application

Input

Output

Logs

Total

Montage

$0.42

$1.32

<$0.01

$1.75

Broadband

$0.40

$0.03

<$0.01

$0.43

Epigenome

$0.18

$0.05

<0.01

$0.23

Transfer Rates


Amazon charges different
rates for transferring data
into the cloud and back out
again.


Transfer
-
out costs are the
higher of the two.


Transfer Costs



For Montage, the
cost to transfer data out of
the cloud is higher
than monthly storage and
processing costs.


For Broadband and Epigenome,
processing
incurs the biggest costs
.

12

Storage Costs

Item

Charges $

Storage of VM’s in local
Disk (S3)

0.15/GB
-
Month

Storage of data in EBS disk

0.10/GB
-
Month

Storage Rates

Data Storage Charges


Amazon charges for storing Virtual
Machines (VM) and user’s applications in
local disk


It also charges for storing data in persistent
network
-
attached Elastic Block Storage
(EBS).

Storage Volumes


Storage Costs


Montage
Storage Costs
Exceed Most
Cost
-
Effective
Processor Costs


13

The bottom line for Montage

Item

Best Value

Best Performance

c1.medium

c1.xlarge

Transfer Data In

$ 0.42

$ 0.42

Processing

$ 0.55

$ 2.45

Storage/month

$ 1.07

$ 1.07

Transfer Out

$ 1.32

$ 1.32

Totals

$ 3.36

$ 5.26

4.5x the processor
cost for 20% better
performance

14

Just To Keep It Interesting …

Running the Montage Workflow With Different File Storage Systems

Cost and performance vary
widely with different types of
file storage dependence on
how storage architecture
handles lots of small files

Cf.
Epigenome

15

Cost
-
Effective Mosaic Service

Local Option

Amazon EBS Option

Amazon S3 Options

Amazon cost is 2X local!

-
2MASS image data set

-

1,000
x

4 square degree
mosaics/month

16

When Should I Use The Cloud?



The answer is….it depends on your application and use case.


Recommended best practice: Perform a cost
-
benefit analysis to
identify the most cost
-
effective processing and data storage
strategy. Tools to support this would be beneficial.


Amazon offers the best value


For compute
-

and memory
-
bound applications.


For one
-
time bulk
-
processing tasks, providing excess capacity
under load, and running test
-
beds.


Parallel file systems and high
-
speed networks offer the best
performance for I/O
-
bound applications.


Mass storage is
very
expensive on Amazon EC2

17

Periodograms and the Search for
Exoplanets


What is a periodogram?


Calculates the significance of different
frequencies in time
-
series data to identify
periodic signals.


Powerful tool in the search for
exoplanets


NStED Periodogram tool


Computes periodograms using 3
algorithms: Box Least Squares, Lomb
-
Scargle, Plavchan


Fast, portable implementation in C


Easily scalable: each frequency sampled
independently of all other frequencies


Implemented a NStED on 128
-
node
cluster.


The Application of Cloud Computing to
Astronomy:

A Study of Cost and
Performance.
Berriman et al. 2010.

http://arxiv.org/abs/1006.4860


http://nsted.ipac.caltech.edu/periodogram/cgi
-
bin/Periodogram/nph
-
simpleupload


18

Kepler Periodogram Atlas


Compute periodogram atlas for public Kepler dataset


~200K light curves X 3 algorithms X 3 parameter sets


Each parameter set was a different “Run”, 3 runs total


Use 128 prrocessor cores in parallel

Estimated cost

Compute
is ~10X
Transfer

19

Should We All Move To The Cloud?

“The Canadian Advanced Network For Astronomical Research
(CANFAR) is an operational system for the delivery, processing,
storage, analysis, and distribution of very large astronomical
datasets. The goal of CANFAR is to support large Canadian
astronomy projects.”

20

GPU’s In Astronomy


GPU invented to
accelerate building of
images in a frame buffer
as an output on a display
device.


Consist of many floating
point processor cores



Highly parallel structure makes them attractive
for processing huge blocks of data in parallel.


In early days, apps had look like video apps, but
there are now frameworks to support
application development: CUDA, Open GL


21

What Types of Applications Do We Run
on GPU’s?

Barsdell, Barnes and Fluke (2010)

have
analyzed astronomy algorithms to
understand which types are best suited to
running on GPU’s. (
arxiv.org/abs/1007.1660

)


Can be parallelized into
many fine
-
grained
elements.


Neighboring threads
access similar locations in
memory.


Minimize neighboring
threads that execute
different instructions.



Have high arithmetic
intensity


Avoid host
-
device memory
transfers

“CPU’s handle complexity, GPU’s handle concurrency”

22

“Critical Decisions For Early Adopters”


Title of a paper by Fluke et al (2010) on Astrophysical
Supercomputing with GPU’s. (
arxiv.org/abs/1008.4623
)


Suggest brute
-
force parallelization may be highly
competitive with algorithmic complexity.


Development times can be reduced with brute
-
force
approach.


GPU’s support single precision calculations, but
astronomy often needs double precision.


Need to understand architecture to get speed
-
ups of x100


Speeds quoted are for graphics
-
like calculations


Code profiling will very likely help code optimization

23

What Have We Learned About

“Next Generation” Code?


Downloaded 5,000 times with
wide applicability in astronomy
and computer science.


Simple t
o build.


Written in ANSI
-
C for
performance and portability
.


Portable to all flavors of *nix




Montage Workflow

Reprojection

Background Rectification

Co
-
addition

Output

Input


Developed as a
component
-
based
toolkit
for flexibility.


Environment agnostic


Naturally “data parallel”


Technology Agnostic:
Supports
tools such as Pegasus, MPI, ..
Same code runs on all platforms.

24

Applications of Montage: Science Analysis


Desktop research tool


astronomers now sharing their
scripts


Incorporation into pipelines to generate products or
perform QA.


Spitzer Space Telescope Legacy teams


Cosmic Background Imager


ALFALFA


BOLOCAM


1,500
-
square
-
degree
-
equal
-
area Aitoff projection mosaic, of HI observed with
(ALFALFA) survey near the North Galactic Pole (NGP).
Dr Brian Kent

25

Applications of Montage:
Computational Infrastructure


Task scheduling in distributed environments (performance
focused)


Designing job schedulers for the grid


Designing fault tolerance techniques for job schedulers


Exploring issues of data provenance in scientific workflows


Exploring the cost of scientific applications running on Clouds


Developing high
-
performance workflow restructuring techniques


Developing application performance frameworks


Developing workflow orchestration techniques


List kindly provided by Dr.
Ewa

Deelman

26

What Are The Next Steps?


Greater recognition of the role of software engineering


Provide career
-
paths for IT professionals.


Next generation software skills should be a mandatory part
of graduate education.


An on
-
line journal devoted to computational techniques in
astronomy.


Share computational knowledge from different fields and
take advantage of it.


27

A U.S. Software Sustainability Institute: A
Brain Trust For Software

“A US Software Infrastructure Institute that provides a
national center of excellence for community based
software architecture, design and production; expertise
and services in support of software life cycle practices;
marketing, documentation and networking services;
and transformative workforce development activities.”


Report from the
Workshops on
Distributed Computing, Multidisciplinary
Science, and the NSF’s Scientific Software
Innovation Institutes Program

Miron

Livny, Ian Foster, Ruth Pordes, Scott
Koranda, JP Navarro. August 2011

28

U.K. Software Sustainability Institute

http://www.software.ac.uk

Nuclear Fusion
-

Culham
Centre for Fusion Energy

Pharmacology
-

DMACRYS

Climate change
-

Enhancing
Community Integrated Assessment

Geospatial Information
-

Geospatial
transformations with OGSA
-
DAI

Scottish Brain Imaging
Research Centre

Keeping up to date with
research

29

The Moderate Resolution Imaging Spectroradiometer
(MODIS)


Science products created by aggregating calibrated
products in various bands


Calibrated data kept for 30
-
60 days (size) and so:


MODIS maintains a
virtual archive

of the
provenance of the data and processing history that
enables reproduction of any science product

A
pplication of Cloud Computing to the Creation of Image Mosaics
and Management of Their Provenance, Berriman et al.
arxiv.org/abs/1006.4860



Global Surface Reflectance and Sea Surface
Temperature

Global Vegetation Index

Scans Earth every 2 days in 36 bands

30

What Are The Next Steps?


The VAO can play a big role in providing sharable, scalable
software for the community.


From the VAO’s Expected Outcomes:


“The VAO’s services and libraries, developed to respond to
the growing scale and complexity of modern data sets
, will
be indispensable tools for astronomers integrating data sets
and creating new data sets.”


“The VAO will collaborate and cooperate with missions,
observatories and new projects, who will be able to routinely
integrate VAO libraries into their processing environments to
simplify and accelerate the development and dissemination
of new data products.”

-
VAO Program Execution Plan, version 1.1 (Nov 2010)

31

VAO Inventory: R
-
tree Indexing


Fast searches over
very large and
distributed data
sets


Performance scales
as log(N)



Performance gain
of x1000 over table
scan


Used in Spitzer
and WISE image
archives





Memory
-
mapped files



Parallelization / cluster processing



REST
-
based web services


Segment of virtual memory is assigned a
byte for byte correlation with part of a file.

32


Scientific Workflow Applications on Amazon EC2.

G. Juve et al. Cloud Computing
Workshop in Conjunction with e
-
Science 2009 (Oxford, UK).
http://arxiv.org/abs/1005.2718



Data Sharing Options for Scientific Workflows on Amazon EC2
, G. Juve et al.
Proceedings of Supercomputing 10 (SC10), 2010.
http://arxiv.org/abs/1010.4822



The Application of Cloud Computing to the Creation of Image Mosaics and
Management of Their Provenance
, G. B. Berriman, et al. SPIE Conference 7740:
Software and Cyberinfrastructure for Astronomy
. 2010.
http://arxiv.org/abs/1006.4860



The Application of Cloud Computing to Astronomy:

A Study of Cost and
Performance.
G. B. Berriman et al. 2010. Proceedings of “e
-
Science in Astronomy”
Workshop. Brisbane.
http://arxiv.org/abs/1006.4860



Astrophysical Supercomputing with GPUs: Critical Decisions for Early Adopters.
Fluke et al. 2011. PASA Submitted.
http://arxiv.org/abs/1008.4623.



Analysing Astronomy Algorithms for GPUs and Beyond.
Barsdell, Barnes and Fluke.
2010. Submitted to MNRAS.
http://arxiv.org/abs/1007.1660



Bruce Berriman’s blog, “Astronomy Computing Today,” at
http://astrocompute.wordpress.com

Where Can I Learn More?

33