Applications of Cloud Computing

chirpskulkInternet και Εφαρμογές Web

3 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

71 εμφανίσεις

Some Thoughts on Scientific
Applications of Cloud Computing

Rick Stevens

Argonne National Laboratory

University of Chicago

Supercomputing

& Cloud Computing


Two macro strategies dominate large
-
scale
(intentional) computing infrastructures


Supercomputing type Structures


Large
-
scale integrated coherent systems


Managed for high utilization and efficiency


Emerging cloud type Structures


Large
-
scale loosely coupled, lightly integrated


Managed for availability, throughput, reliability

My (limited but so far happy

experience)
with Clouds Like

Stuff



EC2, S3, EBS


I’ve been playing with
the Amazon AWS
offering, in fact I have a
couple of images
running right now.


Elasticfox

and the
Spandexfox

have made
this pretty fun stuff


Eucalyptus, Nimbus, etc.


Web services interfaces
to biological
databses

(KEGG, SEED, etc.)

ec2
-
75
-
101
-
195
-
159.compute
-
1.amazonaws.com

How should we think about the
cloud opportunities?


Virtual zoo of systems?


Replacements for Clusters?


Extensions to existing systems
and infrastructure?


Surge capacity?


Edge
datasystems
?


Opportunity

to go “
hardwareless

when designing new systems

and
services?


Infinitely scalable genome
annotation?

The

Virtual Zoo


Access to a diverse image library

provides
an inexpensive mechanism to test
applications and services

on a variety of OS
configurations without having to build all of
them.


Leverages

virtualization and community
images


Leverages

“cloud” when scale is important



Using cloud for scalability testing could be
interesting when you have servers you want
to stress and test, but limited time and
resources


Creating hundreds of running instances is
relatively easy and could be done by a few
people in less than a day


Automation of the scalability testing could be
easily accomplished

As Replacements for Clusters?


There have been several experiments creating virtual
clusters in EC2 and probably in other environments
as
well [Peter
Skomoroch
, et al].


These “soft” clusters are interesting, constructed on
demand and then torn down with the application run is
complete.


It might be possible to integrate virtual clusters into
existing Linux cluster queues such that jobs that are
queued for a physical cluster could be dispatched to a
local cluster or a cloud based virtual cluster for
execution.


In

fact for throughput jobs this might be even more effective.


Local

facilities that start supporting image based scheduling
services would lead in this transition (i.e. you submit your
job as one or more images rather than scripts or executables)


The economics

for Cloud Clusters start to make sense if
clusters are not fully utilized (as many in universities are
not).


Cloud hosting for clusters provides one easy way to
implement
cycle banking

since each application
determines their own operation environment and
overheads are relatively low


This

would ideally be implemented as a distributed
resource if physical ownership was important


Virtual

ownership would make it much easier and
robust to implement

Seamless extensions


Like in the previous example
seamlessly extending an existing
queue could be a one way to
integrate clouds with existing
services and systems.


But we can imagine others.


How about using the cloud as a
giant impedance matcher for
geographically distributed
systems of large
-
scale sensors
and tightly coupled data analysis
environments?


The idea is simple.

Surge Capacity


Power companies have
peakers
.


Typically natural gas powered turbines used
during times of peak demand for power.


Clouds can be used for surge capacity for
groups that have variable demands for
access to compute cycles or server/service
cycles


Sensor + Cloud + Supercomputer =
Next Generation Simulations


Imagine thousands (or millions) of
distributed sensors deployed over
the globe each generating data in
some asynchronous fashion.


Each sensor updates data
structures in the cloud via local
internet connections. The cloud is
ubiquitous, secure enough, reliable
etc. and scales to the size of the
sensor network and acts as an
impedance matcher.


Periodically harvesting processes
(in the cloud say) wake up and
organize the datasets into a fashion
that they can be downloaded
coherently to a supercomputer for
data assimilation to a large
-
scale
parallel simulation.

Going
Hardwareless


Need: 24x7 access to flexibly
configured hardware, scalable data
infrastructure, and customized
operating environment


1000 cores
x

.10 hour
x

8760
hours/year
x

3 years = $2.6M


1000 cores
x

$390/core + 3
x

$43,800 power + 3
x

200K + 3
x

100K = $1.4M


In my example if cluster utilization
is < 53% then it is cheaper to go

hardwareless
” at current retail
prices

Enterprise vs Science Issues


Many companies are moving to large
-
scale
server consolidation and virtualization


Typical business servers are about 5%
utilized (web and data base servers)


When they
virtualize

they can often
consolidate 20:1 and it’s a big deal.


What is the equivalent value proposition for
science applications?

Test Case we are Porting to the Cloud


The RAST genome annotation service


Currently runs on hardware in our lab


Front end (16 core Linux cluster)


Web interface, queue management, analysis
interface, and rapid propagation


Backend (40 node cluster) supporting three
Services (similarities, blasting, seed)


GenBank


Now bigger than
100,000,000,000
bases



240,000 named
organisms



More than 60 million
records

http://woldlab.caltech.edu/biohub/scipy2006/genbankgrowth.jpg

www.nmpdr.org

RAST Challenge


provide framework for the annotation of
1000+ genomes


in addition to BLAST, use most trusted
technology (“Clusters”)


lower computational cost per genome by pre
-
computing


ensure “identical” proteins are annotated the
same


provide free and easy access to everyone

www.nmpdr.org

RAST workflow
--

Rapid propagation

ORFs

Small gene set

phylogenetic
neighbors

Master
FIGfam set

Master CDS
set

glimmer2
training set

Predict ORFs

Find universal genes

Via BLAST

Create gene model

Create CDS set

Extract functions

Upload

Annotations

Find candidate protein

Assign functions

New paradigm for annotation
:


reverse search:
search for protein for a given prediction function

www.nmpdr.org

RAST Server workflow (2)



Master functions set via protein load of neighboring genomes



Master CDS set via neighbor trained glimmer2



Map functions from neighbors via FIGfams



BLAST search for the rest


Some issues we are facing


The minimal RAST instance needs four servers


Web Submission and and data analysis interface


Similarity Server with a 500GB database


Propagation Server with a 100GB database


Blast Server with a 200GB database


Each database is updated daily


We want to launch an instance for each job
submitted


The critical issue is data access, updates, replication
and sharing


Code is also updated every few days so all the
images have to be rebuilt often.


Requirements for Science?


Ability to easily import and export images.


Ability to queue large number of image based jobs.


Ability to create “clusters” with affinity for HP communication
and IO.


Ability to update and snapshot collections of related images
easily.


Parallel operations across set of images or storage objects


Shared EBS equivalents across sets of images.


High
-
bandwidth upload/download to S3 and EBS equivalents.


Ability to incrementally update large (TB) class datasets.


Tools for lightweight access to cloud based storage.


Tools to integrate cloud queues with existing production queues.


Conclusions


The emerging concept of the cloud is pretty cool.


The existing available “retail” models are hugely empowering,
since they require only a credit card to get going.


Ease of use is being tackled, a market is developing for images
and value added services.


Clouds feel like the next thing that will have traction and will
enable
hardwareless

ventures.


Scientific applications will not drive clouds, but will benefit from
their widespread adoption.


It is a disruptive technology in many ways and the
university/agency shift will take some time, hence private sector
will likely get significantly ahead.


Many groups should be experimenting and it really is pretty
cheap to gain the critical experience to figure out interesting
things to try.