Scientic Computing in the Cloud with

doctorrequestInternet και Εφαρμογές Web

4 Δεκ 2013 (πριν από 4 χρόνια και 7 μήνες)

190 εμφανίσεις

Scientic Computing in the Cloud with
Google App Engine
master thesis in computer science
Michael Sperk
submitted to the Faculty of Mathematics,Computer
Science and Physics of the University of Innsbruck
in partial fulllment of the requirements
for the degree of Master of Science
supervisor:Prof.Dr.Radu Prodan,Institute of Computer
Innsbruck,17 January 2011
Certicate of authorship/originality
I certify that the work in this thesis has not previously been submitted for a
degree nor has it been submitted as part of requirements for a degree except as
fully acknowledged within the text.
I also certify that the thesis has been written by me.Any help that I have
received in my research work and the preparation of the thesis itself has been
acknowledged.In addition,I certify that all information sources and literature
used are indicated in the thesis.
Michael Sperk,Innsbruck on the 17 January 2011
Cloud Computing as a computing paradigm recently emerged to a topic of high
research interest.It has become attractive alternative to traditional computing
environments,especially for smaller research groups that can not aord expen-
sive infrastructure.Most of the research regarding scientic computing in the
cloud however focused on IaaS cloud providers.Google App Engine is a PaaS
cloud framework dedicated to the development of scalable web applications.
The focus of this thesis is to investigate App Engine's capabilities in terms of
scientic computing.Moreover algorithm properties that are well suited for ex-
ecution on Google App Engine as well as potential problems and bottlenecks
are identied.
0.1 Architecture..............................xi
0.1.1 The Runtime Environment.................xiii
0.1.2 The Datastore........................xiv
0.1.3 Scalable Services.......................xviii
0.1.4 The App Engine Development Server...........xix
0.1.5 Quotas and Limits......................xx
0.2 HTTP Requests...........................xxvi
0.2.1 The HTTP Protocol.....................xxvi
0.2.2 Apache HTTP Components.................xxvii
0.2.3 Entity Compression.....................xxix
0.3 Slave Types..............................xxxi
0.3.1 App Engine Slaves......................xxxii
0.3.2 Local Slaves..........................xxxii
0.3.3 Comparison of Slave Types.................xxxiii
0.4 The Master Application.......................xxxv
0.4.1 Architecture.........................xxxv
0.4.2 Generating Jobs.......................xxxvi
0.4.3 Job Mapping.........................xxxviii
0.4.4 Fault Tolerance........................xxxix
0.5 The Slave Application........................xlii
0.5.1 WorkJobs...........................xliii
0.5.2 Results............................xliv
0.5.3 Message Headers.......................xlv
0.6 Shared Data Management......................xlvii
0.6.1 Data Splitting........................xlviii
0.6.2 Data Transfer Strategy...................xlviii
0.6.3 Performance Evaluation...................l
0.7 Monte Carlo Routines........................liii
0.7.1 Pi Approximation......................liii
0.7.2 Integration..........................lix
0.8 Matrix Multiplication........................lxiv
0.8.1 Algorithm...........................lxv
0.8.2 Parallelization........................lxvi
0.8.3 Implementation........................lxvii
0.9 Mandelbrot Set............................lxxi
0.9.1 Algorithm...........................lxxii
0.9.2 Parallelization........................lxxiii
0.9.3 Implementation........................lxxiv
0.10 Rank Sort...............................lxxix
0.10.1 Algorithm...........................lxxx
0.10.2 Parallelization........................lxxx
0.10.3 Implementation........................lxxxi
0.10.4 Hardware and Experimental Setup.............lxxxvii
0.11 Analyzing Google App Engine Performance............lxxxix
0.11.1 Latency Analysis.......................lxxxix
0.11.2 Bandwidth..........................xc
0.11.3 Java Performance Analysis.................xcii
0.11.4 JIT Compilation.......................xcv
0.11.5 Cache Hierarchy.......................xcvii
0.12 Speedup Analysis...........................xcviii
0.12.1 Pi Approximation......................c
0.12.2 Matrix
0.12.3 Rank Sort...........................cii
0.12.4 Mandelbrot Set........................ciii
0.13 Scalability
0.14 Resource Consumption and Cost Estimation............cviii
List of Figures cxv
List of Tables cxvii
Bibliography cxix
In the last few years a new paradigm for handling computing resources called
Cloud Computing emerged.The basic idea is that resources,software and data
are provided as on-demand services over the Internet [20].The actual technology
used in the cloud is abstracted from the user,so all the administrative tasks
are shifted to the service provider.Moreover the provider deals with problems
such as load balancing and scalability.Typically resource virtualization is used
to deal with these problems.Cloud Computing provides a exible and cost
ecient alternative to local management of compute resources.Payment is
typically done on a per use basis,so the user only pays for the resources that
were actually consumed.
Cloud services can be classied to three categories by the level of abstraction of
the service [20]:
1.Infrastructure as a Service (IaaS):IaaS provides only basic storage,
network and computing resources.The user does not manage or control
the underlying cloud infrastructure,but can deploy and execute arbitrary
software including operating system and applications.
2.Platform as a Service (PaaS):PaaS provides a platform for executing
consumer created applications,developed using programming languages
and tools provided by the producer.The user does not manage the un-
derlying cloud infrastructure,storage or the operating system,but has
control over the deployed applications.
3.Software as a Service (SaaS):SaaS provides the use of applications
developed by the producer running on the cloud infrastructure through
a thin client,typically a web browser.The user has no control over the
infrastructure,the operating system or the software capabilities.
Cloud computing has recently become an appealing alternative for research
groups,to buying and maintaining expensive computing clusters.Most work on
scientic computing in the cloud focused on IaaS clouds such as Amazon EC2 [1],
because arbitrary software can be executed which makes the process of porting
existing scientic programs a lot easier.Moreover there are no restrictions in
terms of operating system or programming language.
Google App Engine is a PaaS Cloud Service especially dedicated to scalable
web applications [5].It mainly targets smaller companies that cannot aord
the infrastructure to handle a large number of requests or sudden trac peaks.
App Engine provides a framework for developing servlet based web application
using Python or Java as programming language.Applications are deployed to
Google's server infrastructure.
Each application can consume resources such as CPU time,number of requests
or used bandwidth.Google grants a certain amount of free daily resources to
each applications.If billing is enabled the user pays for resource usage that
surpasses the free limits,otherwise the web application becomes unavailable if
critical resources are depleted.This makes the service especially interesting
for scientic computing,since an automated program could just use up the
given daily free resources and pause computation until resources are replenished.
Moreover each member of the research group can provide his own account with
separate resources.
The problem is though that the framework is very restrictive in terms of pro-
gramming language,libraries and many other aspects.This makes an use for
scientic computing more dicult than on common IaaS cloud platforms.
The focus of this thesis is to explore the capabilities of the Google App En-
gine framework in terms of scientic computing.The goal is to build a simple
framework for utilizing App Engine servers for parallel scientic computations.
Subsequently a few exemplary algorithms should be implemented and analyzed
in order to identify algorithm properties that might be well suited for execution
in a PaaS cloud environment.Moreover potential problems and bottlenecks that
arise should be analyzed as well.
The thesis is structured in four main parts rst introducing the App Engine
framework and the parts of the API that will be used.Followed by a description
of the Distribution Framework that was developed in the course of the thesis
and a basic introduction to the algorithms that were implemented to test the
system.Finally the the experimental results obtained by testing the Distribution
Framework under practical circumstances are presented.
Google App Engine
Google App Engine is a Cloud service for hosting Web applications on Google's
large scale server infrastructure.However,it provides a whole framework for
building scalable Web applications rather than plain access to hardware.As
more people access the application,App Engine automatically allocates and
manages additional resources.
The user never has to set up or maintain a server environment.In addition,
common problems such as load balancing,caching and trac peaks are handled
by App Engine automatically.
The framework provides a certain amount of free resources,enough for smaller
applications.Google estimates with the free resources an application can handle
about 5 million page views per month.If an application needs resources that
exceed the monthly quota limits,these are billed on a per-use basis.For example,
if an application is very computation heavy only the additional CPU hours are
In this chapter the aspects of the App Engine framework relevant to this thesis
will be described.The description in this chapter are mostly based on [22] and
the ocial online documentation [2].
0.1 Architecture
Figure 0.1 shows the request handling architecture of Google App Engine.Each
request is at rst inspected by the App Engine frontend.In fact there are multi-
ple frontend machines and a load balancer that manages the proper distribution
of requests to the actual machines.The frontend determines to which applica-
tion the request is addressed by inspecting the domain name of the request.
In the next step the frontend reads the conguration of the corresponding appli-
cation.The conguration of an application determines how the frontend handles
a request,depending on the URL.The URL path can be mapped either to a
Figure 0.1:Request handling architecture of Google App Engine taken
from [22]
static le or to a request handler.Static les are typically images,Java scripts or
les.A request handler dynamically generates a response for the request,based
on application code.If no matching mapping is found in the conguration a
HTTP 404"Not Found"error message is responded by the frontend.
Requests to static les are forwarded to the static le servers.The static le
servers are optimized for fast delivery of resources that do not change often.
Whether a le is static and should be stored on the static le servers is decided
at application deployment.
If the request is linked to a request handler it is forwarded to the application
servers.One specic application server is chosen and a instance application is
started.If there is already a instance of the application running it can be reused,
so typically servers already running an instance are preferred.The appropriate
request handler of the application is then invoked.
The strategies for load balancing and distributing requests to application servers
are still being optimized.However the main goal is fast responding request han-
dlers,in order to guarantee a high throughput of requests.How many instances
of an application are started at a time and how requests are distributed depends
on the applications trac and resource usage patterns.Typically there are just
enough instances started at a time to handle the current trac.
The application code itself runs in a runtime environment,an abstraction above
the operating system.This mechanism allows servers to manage resources such
as CPU cycles and memory for multiple applications running on the same server.
Besides applications are prevented from interfering with one another.
The application server waits until the request handler terminates and returns the
response to the frontend,thus completing the request.Request handlers have to
terminate before returning data,therefore streaming of data is not possible.The
frontend then constructs the nal response to the client.If the client indicates
that it supports compression by adding the"Accept-Encoding"request header,
data is automatically compressed using zip le format.
App Engine consists of three main parts:the runtime environment,the Data-
store and the scalable services.The runtime environment executes the code of
the application.The Datastore provides a possibility for developers to persist
data beyond requests.Finally App Engine provides a couple of scalable services
typically useful to web applications.In the following each of the parts will by
described shortly.
0.1.1 The Runtime Environment
As already mentioned,the application code runs in a runtime environment,
which is abstracted from the underlying operating system.This isolated envi-
ronment running the applications is called the sandbox.App Engine applications
can be programmed either in Python or in Java.As a consequence each pro-
gramming language has its own runtime environment.
The Python runtime provides an optimized interpreter (by the time of this writ-
ing Python 2.5.2 was the latest supported version).Besides the standard library,
a wide variety of useful libraries and frameworks for Python web application de-
velopment,such as Django,can be used.
The Java runtime follows the Java Servlet standards,providing the correspond-
ing APIs to the application.Common web application technologies such as
JavaServer Pages (JSP) are supported as well.The App Engine SDK supports
developing applications using Java in the version 5 or 6.
Though applications are typically developed in Java,in principle any language
supporting compilers producing Java bytecode,such as JavaScript,Ruby or
Scala,can be used.This section will focus on the Java runtime.
The sandbox imposes several restrictions to applications:
1.Developers have limited access to the lesystem.Files deployed along
with the application can be read,however there is no write access to the
lesystem whatsoever.
2.Applications have no direct access to the network,though HTTP requests
can be performed through a service API.
3.In general no access to the underlying hardware or the operating system
is granted.
4.App Engine does not support domains without"www"such as http:
//,because of canonical name records being used for load
5.Usage of threads is not permitted.
6.Java applications can only use a limited set of classes from the standard
Java Runtime Environment,documented in [8].
Sandboxing on the one hand prevents applications from performing malicious
operations that could harm the stability of the server infrastructure or interfere
with other applications running on the same physical machine.On the other
hand,it enables App Engine to perform automatic load balancing,because it
does not matter on what underlying hardware or operating system the appli-
cation is executed.There is no guarantee that two requests will be executed
on the same machine even if the requests arrive one after another and from the
same client.Multiple instances of the same or even of dierent applications can
run on the same machine without aecting one another.
The sandbox also limits resources such as CPU or memory use and can throttle
applications that use a particular high amount of resources in order to protect
applications executed on the same machine.A single request has a maximum
of 30 seconds to terminate and respond to the client,although App Engine
is optimized for much shorter requests and may slowdown an application that
consumes too many CPU cycles.
Since scientic applications are CPU intensive,these limitations imposed by the
runtime environment are problematic for such an application.
0.1.2 The Datastore
Web applications need a way to persist data between the stateless requests
to the application.The traditional approach is a relational database residing
on a single database server.The central database is accessed by a single or
potentially by multiple web servers retrieving the necessary data.The advantage
of such a system is that every web server always has the most recent version
of the data.However,once the limit for handling multiple parallel database
requests is reached,it gets dicult to scale the system up to more requests.
Alongside relational database systems there are various other approaches like
XML databases or object databases.
The Datastore is App Engine's own distributed data storage service.The main
idea is to provide a high level API for use and hide the details of how storage
is actually done from the developer.This spares the application developer the
task of keeping data up to date while still maintaining scalability.
The database paradigm of the Datastore most closely resembles an object
database.Data objects are called entities and have a set of properties.Prop-
erty values can be chosen from a set of supported data types.Entities are of a
named kind in order to provide a mechanism for categorizing data.
This concept might seemsimilar to a relational database.Entities resemble rows
and properties resemble the columns in a table.However,there are some key
dierences to a relational database.First of all,the Datastore is schemaless,
which means that entities of the same kind are not required to have the same
properties.Furthermore,two entities are allowed to have a property with the
same name but dierent value types.Another important dierence is that a
single property can have multiple values.
Entities are identied by a key,which can either be generated automatically by
the App Engine or manually by the programmer.Unlike the primary key in
a relational database,an entity key is not a eld,but a separate aspect of the
entity.App Engine uses the key in combination with the kind of an entity to
determine where to store the entity in the distributed server infrastructure.As
a consequence the key as well as the kind of an entity cannot be changed once
it is created.
Indexes used in the Datastore are dened in a conguration le.While testing
the application locally on the development server index,suggestions are auto-
matically added to the conguration.The framework recognizes typical queries
performed by the application and generates according indexes.The index def-
initions can be manually ne tuned by modifying the conguration le before
uploading the application.
App Engine's query concept provides most common query types.A query con-
sists of the entity kind alongside with a set of conditions and a sorting order.
Executing a query results in all the entities of the given kind meeting all of
the given conditions being returned sorted by the given order.Besides letting
the query return the entities there is also the option to let it return only the
key values of the entities.This helps to minimize the data transfer from the
Datastore to the application,if only some of the queried entities are actually
The data of web application is typically accessed by multiple users simultane-
ously,thus making a transaction concept important.App Engine guarantees
atomicity:every update of an entity involving multiple properties either suc-
ceeds entirely or fails entirely,leaving the object in its original state.Other users
will only see the complete update or the original entity and never something in
App Engine uses a optimistic concurrency control mechanism.It is assumed that
transactions can take place without con ict.Once a con ict occurs (multiple
users try to update the same entity at the same time),the entity is rolled back to
its original state and all users trying to performan update receive an concurrency
failure exception.Such a concept is most ecient for a system where con ict
occurs rather sparse,which is usually the case for a web application.Reads
will always succeed and the user will just see the most recent version of the
entity.There is also a possibility to read multiple entities in a group in order to
guarantee consistency of the data.
There is also the possibility to dene transactions manually,by bundling mul-
tiple database operations into a transaction.For example,an application can
read an entity,update a property accordingly,write the entity back and commit
the transaction.Again,if the transaction fails all of the database operations
have to be repeated.
The Datastore provides two standard Java interfaces for data access:Java Data
Objects (JDO) and Java Persistence API (JPA).The implementation of the two
interfaces uses the DataNucleus Access Platform,which is an open source im-
plementation of the specied APIs.Alongside the high level APIs,App Engine
also provides a low level API,which can be used to program further database
interfaces.The low level API can also be used directly from the application,
which in some cases might be more ecient than the high level APIs.
The Java Data Objects API
In the following,the JDO API will be shortly described,alongside with an
example illustrating the use of the API.JDO uses annotations to describe how
entities are stored and reconstructed.In the following a JDO data class called
DataStoreEntity is dened in order to demonstrate the use of annotations:
Listing 1:example of a Datastore entity
2 @PersistenceCapable
3 public class DataStoreEntity {
5 @PrimaryKey
6 @Persistent
7 String key;
9 @Persistent
10 private Blob data;
12 public DataStoreEntity(byte [] data,String key){
13 = new Blob(data);
14 this.key = key;
15 }
16 public byte[] getData() {
17 return data.getBytes();
18 }
19 public void setData(byte[] data) {
20 = new Blob(data);
21 }
22 public String getKey() {
23 return key;
24 }
25 public void setKey(String key) {
26 this.key = key;
27 }
28 }
The class is marked with the annotation @PersistenceCapable indicating that
it is a storable data class.The class denes two elds annotated with @Per-
sistent telling the datastore that they should be stored as properties of the
entity.The eld key is additionally annotated with @PrimaryKey,making it
the database key of the entity.Besides the standard Java data types,there are
several additional classes for various purposes provided.
In order to perform database operations,a PersistenceManager is needed
which is retrieved through a PersistenceManagerFactory (PMF).The PMF
takes some time to initialize,though only one instance is needed for the appli-
cation.Typically the PMF is stored in a static variable making it available to
the application through a singleton wrapper:
Listing 2:PersistenceManagerFactory
2 public final class PMF {
3 private static final PersistenceManagerFactory
4 pmfInstance = JDOHelper.
5 getPersistenceManagerFactory("transactions -optional");
7 private PMF() {}
8 public static PersistenceManagerFactory get() {
9 return pmfInstance;
10 }
11 }
Having dened a JDOdata class and the singleton wrapper for the PersistenceManagerFactory,
instances of the entity can be stored into the Datastore and retrieved using the
query API:
Listing 3:using the query API
1 PersistenceManager pm = PMF.get().getPersistenceManager();
3 pm.makePersistent(new DataStoreEntity(
4 data,req.getHeader("id")));
6 Query query = pm.newQuery(DataStoreEntity.class);
7 List <DataStoreEntity > objs =
8 (List <DataStoreEntity >) query.execute();
Every database operation is performed through a PersistenceManager in-
stance.The makePersistent() method simply stores persistence capable
classes in the Datastore.Datastore entities are retrieved using queries.Queries
are also generated by the PersistenceManager,the newQuery() method returns
a query for a given class.Executing the query without further constraints
returns all Datastore entities of the given class.Entities are returned in a Java
List of the corresponding class.
0.1.3 Scalable Services
The Datastore provides a high level API,that hides implementation details
from the programmer.In a similar fashion App Engine provides an API to
several scalable services.On the one hand some services are a compensation to
the restrictions of the sandbox.On the other hand there are services typically
useful to web application.
This system enables App Engine to handle scalability and performance of the
services while the developers do not have to worry about implementation details.
In the following the dierent services will be described in short:
1.URL Fetch:Because of the restrictions of the sandbox,applications are
not allowed to initiate arbitrary network connections.The URL fetch
service provides an API for accessing HTTP resources on the Internet,
such as web services or other web sites.Since requests to web resources
often take a long time,there is a way to perform asynchronous HTTP
requests as well as a timeout mechanism to abort requests to resources
that do not respond timely.
2.Mail:An application can send emails through the mail service.Many web
applications use emails for user notication or conrmation of user data.
There is also the possibility for an application to receive emails.When a
mail is sent to the applications address,the mail service performs a HTTP
request to a request handler forwarding the message to the application.
3.Memcache:The Memcache is a short lived key-value storage service used
for data that does not need the persistence or transactional features of the
Datastore.It can also be accessed by multiple instances of the application.
The advantage over the Datastore is that it performs much faster,since
the data is stored in memory.As the name indicates,the service is usually
used as cache for persistent data.
4.Image Manipulation:Web applications often need image transforma-
tions,for example when creating thumbnails.This service allows the ap-
plication to perform simple image manipulations on common le formats.
5.XMPP:An application can send and receive messages from any XMPP
compatible instant messaging service.Received messages trigger a request
handler similar to a HTTP request.
0.1.4 The App Engine Development Server
The App Engine SDK includes a development server that simulates the runtime
environment along with all the accessible services,including the Datastore.As
the name states,the development server is intended for development and de-
bugging purposes,however there is the possibility to make the server remotely
accessible.This provides a way to host an App Engine web application on hard-
ware besides Googles servers.For example if the free quotas limit the application
and there is additional hardware available,one can host the application on an
alternative server.
Though this rarely makes sense for an actual web application,for scientic com-
putations it actually can be very useful.The work can be distributed heteroge-
neously on several Google App Engine accounts,as well as on some development
servers running on additional hardware.Since the development server has the
same behavior as the App Engine runtime environment,there is in principle no
dierence where the application is executed.
There are necessarily some dierences between the development server and the
App Engine runtime,however most of them make things easier on the devel-
opment server.For example,all the quota restrictions do not apply for an ap-
plication running on the development server,leaving more freedom in resource
usage.Moreover the underlying hardware is known,making rough runtime
estimates possible and thus correct scheduling of jobs easier.The dierences
between an application running on Google's infrastructure and one running in
the development server will be discussed in more detail in section 0.3.
The scalable services are simulated by the development server,in order to pro-
vide the same API to the programmer.For example the Datastore is simulated
using the local lesystem.
0.1.5 Quotas and Limits
App Engine applications can use each resource up to a maximum limit,called
quota.Each type of resource has a quota associated with it.There are two
dierent types of quotas:billable and xed [6].
Billable quotas are set by the application administrator in order to prevent the
application from overusing costly resources.There is a certain amount of each
billable quota provided to the application for free.In order to use more than
the free resources,billing has to be activated for the application.With billing
activated the user sets a daily budget for the application assigned to the desired
resources.Application owners are only charged for the amount of resources the
application actually used and only the amount that exceeded the free quotas.
Fixed quotas are set by App Engine in order to ensure stability and integrity
of the server system.These are maximum limits shared by all applications,
preventing applications from consuming too many resources at a time.When
billing is enabled for an applications the xed quotas increase.
Once the quota for a resource is reached the resource is considered depleted.
Resources are replenished at the beginning of every day giving the application
a fresh contingent for the next 24 hours.An exception are the datastore quotas
which represent the total amount of storable data and thus are not replenished.
Besides the daily quotas there are also per-minute quotas preventing applications
from consuming their resources in a very short time.Per-minute quotas again
are increased for applications with billing enabled.
There are essential resources required to initiate a request handler,if one of those
is depleted,requests will be rejected with a HTTP 403"Forbidden"status code.
Following resources are necessary for handling a request:
 number of allowed requests;
 CPU cycles;
 incoming and outgoing bandwidth.
For the rest of the resources,an exception is raised once an application tries to
access a depleted resource.These exceptions can be caught by the application
in order to display appropriate error messages for users.
In the following,we shortly describe the resources and their corresponding quo-
tas relevant to this thesis.However there are many more quotas besides the
ones mentioned in this section,especially every scalable service has its own set
of quotas.
In the following the general resources with a quota are listed:
 Requests:The total number of HTTP requests to the application.
 Outgoing Bandwidth:The total amount of data sent by the application.
This includes data returned by request handlers,data served by the static
le servers,data sent in emails and data sent using the URL Fetch service.
 Incoming Bandwidth:The total amount of data received by the appli-
cation.This includes data sent to the application via HTTP requests as
well as returned data using the URL Fetch service.
 CPU Time:Total time spent processing,including all database opera-
tions.Waiting for other services such as URL Fetch or Image processing
does not count.CPU time is reported in seconds.CPU seconds are calcu-
lated in reference to a 1.2 GHz Intel x86 processor.This value is adjusted
because CPU cycles may vary greatly due to App Engine internal cong-
urations,such as diering hardware.
Daily Limit
Maximum Rate
1,300,000 requests
7,400 requests/minute
Outgoing Bandwidth
1 gigabyte
56 megabytes/minute
Incoming Bandwidth
1 gigabyte
56 megabytes/minute
CPU Time
6.5 CPU-hours
15 CPU-minutes/minute
Table 0.1:Free quotas for general resources (as of 20.09.2010) [6].
Table 0.1 shows the quota limits for the general resources.For scientic compu-
tations the main limitation here will be the CPU cycles.In fact,the per minute
quota limits the application to a maximum computation power of 15 times a 1.2
GHz Intel processor on a minutely basis.Since the actual amount of CPU cycles
useable for computation may be even lower.Moreover,a system using App En-
gine in an automated way has to implement proper fault tolerance mechanisms,
since once resources are depleted requests may result in an exception or may
even be rejected in the rst place.
The number of maximumrequests as well as the corresponding per minute quota
are not problem for scientic applications,since splitting a problem into more
than 7400 requests per minute would create substantial transmission overhead.
Therefore the limiting factor will still be the CPU time long before the number
of requests becomes relevant.Note that these quota limits make sense in the
context of web applications,which are typically optimized for high throughput
and fast response but have no need for large amounts of CPU cycles.An ap-
plication dedicated to scientic computations on the other hand will consume a
lot more CPU time compared to the number of requests.
The bandwidth limits will in most cases not be problematic to the application.
The reason is that data has to be transferred over the Internet which is a rela-
tively slow medium,so typically problems that only need small amounts of data
transferred will be better suited for execution on Google App Engine.Data
intensive problems would have a high communication overhead and thus are not
a preferable class of problems for execution on Google App Engine.
In the following the quotas associated to the Datastore are listed:
 Stored Data:The total amount of data stored in the Datastore and its
indexes.There might by considerable overhead when storing entities in
the Datastore.For each entity the id,the ids of its ancestors and its kind
has to be stored.Since the Datastore is schemaless for every property the
name of the property has to be stored along with its value.Finally all the
index data has to be stored along with the entities.
 Number of Indexes:The number of dierent Datastore indexes for an
application,including every index created in the past that has not been
explicitly deleted.
 Datastore API Calls:The total number of calls to the Datastore API,
including retrieving,creating,updating or deleting an entity as well as
posting a query.
 Datastore Queries:The total number of Datastore queries.There are
some interface operations,such as"not equal"queries,that internally
perform multiple queries.Every internal query counts for this quota.
 Data Sent to API:The amount of data sent to the API.This includes
creating and updating entities as well as data sent with an query.
 Data Received from API:The amount of data received by the Datas-
tore API when querying for entities.
 Datastore CPU time:The CPU time needed for performing database
operations.The Datastore CPU time is calculated in the same way as
for the regular CPU time quota.Note that CPU cycles used for database
operations also count towards the CPU time quota.
Stored Data
1 gigabyte
Maximum entity size
1 megabyte
Maximum number of entities in a batch put/delete
500 entities
Maximum size of a datastore API call request
1 megabyte
Maximum size of a datastore API call response
1 megabyte
Number of Indexes
Table 0.2:General Datastore quotas (as of 20.09.2010) [6].
In Table 0.2 and 0.3 the general and daily quotas for the Datastore are listed.
The Datastore will be used for data that has to persist between multiple requests
to the application.This will be typically data that is inherent to the algorithm
and shared among all the requests.The daily limits will not be problematic
Daily Limit
Maximum Rate
Datastore API Calls
10,000,000 calls
57,000 calls/minute
Datastore Queries
10,000,000 queries
57,000 queries/minute
Data Sent to API
12 gigabytes
68 megabytes/minute
Data Received from API
115 gigabytes
659 megabytes/minute
Datastore CPU Time
60 CPU-hours
20 CPU-minutes/minute
Table 0.3:Daily Datastore quotas (as of 20.09.2010) [6].
for a scientic computation,for the same reasons stated for the bandwidth
limitations.Moreover,the Datastore will be cleared after each algorithm run
thus resetting the general Datastore quotas.
However the maximum entity size limitations of one megabyte is quite a prob-
lem.A normal web application has no need to store large data entities but
rather stores many dierent small entities that are optimized for quick retrieval.
Scientic applications though often have large amounts of data to operate on.
As a consequence data beyond one megabyte has to be partitioned in order to
t in the datastore.A more detailed discussion of the implications can be found
in Section 0.6.
The Distribution Framework
The goal of this thesis is to build a simple framework for utilizing App Engine
servers for parallel scientic computations.The system will mainly be used
to identify properties of parallel algorithms that are well suited for use on the
App Engine environment and subsequently those that are less suited.There-
fore the system should be extensible,to allow easy incorporation of additional
algorithms.Besides,the management of data and distribution of jobs should be
independent from the actual algorithm used.Finally,the system should provide
an algorithm library that utilizes App Engine for parallelization.
In general the system uses a simple master-slave architecture.The master-
slave model is commonly used in distributed architectures [14].It consists of
a central master process that controls the distribution of work among several
slave processes.When implementing a system based on the master-slave model,
it should be guaranteed that the master can provide work fast enough to feed
all slaves with sucient work.When the job size is too small the master might
be too slow to generate enough jobs and can become a bottleneck.
The slave application is implemented as web application using the Google App
Engine framework.It provides a simple HTTP interface that is invoked pro-
grammatically by the master application.The interface accepts either data
transfer requests or requests containing a computational job.In either case
data is transmitted in the payload of the HTTP request.Parallelism is achieved
by multiple parallel requests to the application.In order to make communica-
tion between master and slaves easier,both applications are written in the Java
programming language.
The master application is a Java program running on the users machine.It
manages the logic and the distribution of the algorithm.The problem is split
into several work chunks.Each chunk is then submitted as a job to the slave
application,which performs the actual computation.The results of the jobs
are then collected and reassembled to provide the complete result of the algo-
rithm.Besides the master application has to manage scheduling of jobs and
data transfers.
In this chapter the architecture of the system and its components are explained
in detail.Furthermore important concepts used in the implementation of the
system will be discussed.
0.2 HTTP Requests
The slave application is in principle a HttpServlet implementing the request
handling method for Hypertext Transfer Protocol (HTTP) post requests.There-
fore the communication between the master and its slaves is entirely based on
the HTTP protocol.
In this section the basics of the protocol will explained followed by a description
of the HTTP library used by the system.
0.2.1 The HTTP Protocol
The HTTP is an stateless application level networking protocol.The latest
version of the protocol is HTTP/1.1 dened in RFC 2616 [9].The protocol
assumes a reliable transport layer protocol.Therefore the TCP protocol is most
widely used as transport layer protocol.
HTTP is mainly used by web browsers to load web pages,however it has nu-
merous other applications.The protocol follows the request-response message
exchange pattern.A client sends a HTTP request message to a server,which
typically stores content or provides resources.The server replies with a HTTP
response message containing status information and any content requested by
the client.
The protocol denes nine request methods indicating the action that should be
performed by the server.The most important methods are:
1.GET:Retrieves whatever information is identied by the request URI.
The information should be returned in the message-body of the HTTP
response message.
2.HEAD:Is identical to the GET method,except that the HTTP response
must not contain a message-body.This method is typically used for re-
trieving metainformation on an entity or testing the validity of hypertext
3.POST:The POST method is used to submit the entity enclosed in
the request message to the server.A POST request might result in the
modication of an existing entity or even in the creation of a new one.
4.OPTIONS:The OPTIONS method is a request for information on the
available communication options for the entity associated with the request
Servers are required to at least implement the HEAD and GET method.The
communication between the master and slave application uses the HTTP post
Every HTTP response message contains a three digit numeric status code,fol-
lowed by a textual description of the status.Status codes are organized in ve
general classes indicated by the rst digit of the status code:
1.1xx Informational:Indicates that a request was received and the server
continues to process it.Such a response is only provisional consisting of
the status line and optional headers.One or more such responses might
be returned before a regular response to the request is sent back.Infor-
mational responses are typically used to avoid timeouts.
2.2xx Successful:Indicates that the server has successfully received,un-
derstood and accepted the corresponding request.
3.3xx Redirection:Indicates that further action is needed by the client
in order to fulll the request.
4.4xx Client Error:Indicates that the client seems to have caused an
error.The server should include a entity containing a closer description
of the error in the response message.
5.5xx Server Error:Indicates that the server is unable to process a seem-
ingly valid request.Again,an entity containing a closer description of the
error should be included in the response.
A client at least has to recognize these ve classes of response status codes and
react accordingly.
0.2.2 Apache HTTP Components
The Apache HTTP Components library [?] provides an easy to use API for ap-
plications making use of the HTTP protocol.Besides the library is open source
thus making,required adaptions of the code possible.The master application
uses the HTTP Component client 4.0.1,which is the latest stable version by
this writing for building the HTTP requests necessary for invoking the slave
Listing 4 shows a code sample performing a HTTP request to""
using the functionality of the Apache HTTP components library used by the
master application:
Listing 4:Code example for performing a simple HTTP request.
1 HttpResponse response = null;
2 HttpClient client = new DefaultHttpClient();
3 HttpPost post = new HttpPost("");
5 HttpEntity entity =
6 new SerializableEntity(new Integer(5),true);
7 post.setEntity(entity);
9 post.setHeader("type","integer");
11 response = client.execute(post);
12 System.out.println(response.getStatusLine());
14 client.getConnectionManager().shutdown();
The main class necessary for initiating a HTTP communication is the
HttpClient.Its most basic functionality is to execute HTTP methods.
Execution of a HTTP method consists of one or several HTTP request-response
exchanges.The DefaultHttpClient is the standard implementation of the
The user provides a request object to the HttpClient for execution.The Http-
Client supports all methods dened in the HTTP 1.1 specication.The library
provides a separate class for every method.In the example a HttpPost method
is used for the request.Every method is initiated with a URI dening the target
of the request.An URI consists of the protocol in use,the host name,optional
port and a resource path.In this case the URI is""the
protocol used is HTTP,the target host is and the port used is 80.
HTTP request and response messages can optionally have content entities as-
sociated with them.Requests carrying an entity are referred to as entity en-
closing requests.The specication denes two entity enclosing methods namely
POST and PUT.The SerializableEntity class allows to construct an entity
containing a serializable Java object.In the example,an entity containing an
Integer object is created and attached to the POST method.The second param-
eter of the SerializableEntity constructor determines whether object will be
A message can contain one or multiple message headers describing properties of
the message,such as content type and content encoding.The example attaches
a header to the POST method with the name"type"and the value"integer".
Message headers help a server to decide how to handle a request.
A HTTP response is the message sent back by the server as reply to a request,
implemented by the HttpResponse class.The rst line of a HTTP response
contains the a status line containing the protocol version,followed by a numeric
status code and its textual description.In the example the status line is retrieved
by calling the getStatusLine() method of the HttpResponse and printed to
the standard output.A successful request will usually print HTTP/1.1 200 OK,
indicating that the protocol version 1.1 was used and the request was successful.
0.2.3 Entity Compression
The HTTP components library provides a wide variety of functionality,though
there is no convenient way to compress entities attached to HTTP request.
Therefore we slightly modied the SerializableEntity to an entity called
CompressedSerializableEntity,providing compression of the contained seri-
alized object.The modied version works basically the same way as the original,
except that a compression lter streamis put between the ObjectOutputStream
and the ByteArrayOutputStream that actually writes the serialized object in
the buer.
The information whether entity compression is enabled is stored in the header
of the HTTP request,so the slave application can decompress entities prior to
their usage.
Java provides three dierent compression lter streams in the JRE standard
library,ZIP,GZIP and raw de ate compression [18].The ZipOutputStream
is an output stream lter for writing les in the ZIP le format.The
GZIPOutputStream is the same for GZIP le format.DeflateOutputStream
generates a stream compressing the data using the de ate algorithm.
An alternative compression algorithm library is unfortunately not an option,
since the App Engine runtime environment does not allow additional libraries
x 106
array size
compression time (milliseconds)

Figure 0.2:Compression time needed for the dierent compression streams.
besides the JRE standard library.Therefore the slave application would be
incapable of decompressing the payload when using a alternative compression
In order to determine the best suited compression we tested the dierent com-
pression streams in terms of compression eciency and runtime required.For
this experiment,an one dimensional integer array lled with random numbers
was used as raw data.For the compression eciency,random numbers in the
range from 0 to 1000 and numbers in the range from 0 to 10.000 where tested.
Using a smaller range of random numbers increases compression eciency,since
there are less possible values that have to be encoded.The tests were performed
on a system with a Intel Core 2 Duo CPU with 2.4 ghz and 4 gigabyte RAM.
Figure 0.2 shows a comparison of compression times needed by the three dierent
streams.The times are almost identical especially those of ZIP and GZIP.
De ate however performs the best throughout all tested data sizes.In addition,
for all three algorithms the runtime grows linearly for increasing data sizes.
In terms of compression the streams performed equally well,though de ate
compressed data was generally slightly.The reason that de ate performs slightly
x 106
x 107
size of integer array
data size (byte)

deflate 1000
deflate 10000
Figure 0.3:Data size for an integer array lled with random numbers.
better in terms of execution time and compression eciency is the overhead
needed for the ZIP and GZIP le format information.The streams most likely
even use the same internal compression routine.
Figure 0.3 shows the compression eciency of the de ate algorithm (ZIP and
GZIP were omitted because of overlapping graphs).The test using random
numbers in the range from 0 to 1000 (green line) performs better than the
second test using numbers from 0 to 10.000 (red line) as expected.
0.3 Slave Types
A slave is a server reachable by a distinct network address executing an instance
of the slave web application.The slave application is actually intended to be
deployed to and executed on Google's App Engine servers.In order to provide
more exibility and a way to compare the performance of the App Engine server
infrastructure to known hardware we included the possibility to address local
servers executing an instance of the slave application using the App Engine
development server.As already described more closely the development is just
a slimweb server simulating the App Engine API locally.Consequently deployed
slaves will be referred to as App Engine slaves and slaves running on machines
using the development server will be referred to as local slaves.In the following
some basic considerations for each slave type are discussed followed by a concrete
comparison of their properties.
0.3.1 App Engine Slaves
An App Engine slave is a instance of our slave application executed in the run-
time environment of the App Engine servers.In principle one instance of the
slave application would be enough,since parallelism is achieved by sending mul-
tiple parallel HTTP requests.However,there are a various restrictions imposed
to an App Engine application in terms of resource usage.In order to circum-
vent these restrictions it is necessary to enable the systemto distribute the work
among multiple instances of the slave application.
One App Engine account allows the user to create and deploy up to ten dierent
applications,each of them having a separate URI as well as separate resource
quota restrictions.Usually these are meant to be dierent applications,though
it is possible to deploy a single application multiple times.In terms of a web
application this would not make any sense,since each of the instances would be
reachable through a dierent address.For a scientic application that needs as
many resources as possible though,it is an useful way to get additional resources.
As a consequence the master application has to be able to distribute tasks to
dierent instances of the slave application,each reachable through a dierent
network address.
0.3.2 Local Slaves
As already mentioned instances of an App Engine application can also be exe-
cuted by the development server.Besides minor dierences the instances behave
the same way as those deployed to Google's infrastructure.There are even some
restrictions present for deployed applications that do not apply to an application
running on a development server.As a consequence local computers executing
an instance of the slave application using the development server can be incor-
porated as additional computing resources.For example an algorithm could be
distributed,to a couple of deployed instances as well as some instances running
on local cluster nodes using the development server.
The local nodes are typically machines with multiple CPU cores.So the concept
of sending multiple parallel HTTP requests to one instance in order to achieve
parallelismapplies here as well.The development server handles each request in
a separate thread,thus automatically distributing the load on the available cores.
In principle it does not make a dierence to the master application,whether it
sends requests to a deployed instance where the App Engine frontend manages
load balancing of parallel requests or a instance running on the development
server where every request is handled by a separate thread.To the master
application it is only relevant how many requests an instance can handle in
parallel,which will be referred to as queue.Further discussion on the impact of
the queue size is provided in Section??.
Besides making the distributed systemmore exible,the use of the development
server provides a way to compare the performance of the App Engine framework
to regular hardware.In addition,changes in the distributed system that might
have an impact on the runtime of algorithms can be tested more reliably on local
slaves,since measurements on Google's App Engine infrastructure are oftentimes
biased due to background load on the servers or the network.
0.3.3 Comparison of Slave Types
In terms of interface and general behavior,both types of slaves are equivalent,
though there are still some dierences that have to be considered:
1.Latency/Bandwidth:Generally the network connection will be better
to a Local Slave,since they typically reside in the same local net as the
master application.App Engine slaves on the other hand are always ac-
cessed over the Internet,thus having a much slower connection.Besides
latency and bandwidth often may vary due to background load.A closer
analysis of App Engines network behavior is provided in Section 5.
2.Hardware:For local slaves the underlying hardware is known.As a
result rough calculation time estimates can be done,as well as heuristics
for scheduling jobs can be applied.On the contrary for App Engine slaves
the underlying hardware is neither known nor are there any guarantees in
that respect.Multiple requests can be executed on completely dierent
hardware even if the requests happen in a small time frame one after
3.Reliability/Accessibility:Local slaves reliability depends on the proper
administration of the machines the instance is running.Granting access
to the application by opening the corresponding ports is an administrative
issue as well.App Engine slaves on the other side have no need for admin-
istration at all.Applications running on Google App Engine are highly
reliable and are accessible from everywhere over the Internet.
4.Restrictions:App Engine slaves have various restrictions,for example a
request handler has to terminate within 30 seconds otherwise the request
fails.Furthermore the total quotas as well as the per minute quotas are
limiting for App Engine slaves.For a local slave all these restriction do
no apply,thus leaving the programmer more exibility.
5.Services:The scalable services provided by the runtime environment
are only simulated by the development server and thus may dier in their
behavior.However,the slave application will only use the Datastore which
is provided suciently by the development server.
Concluding App Engine slaves are in general more dicult to handle program-
matically,because of the various unknown variables and the strict restrictions
of the runtime environment.This however also means that incorporating the
possibility to use local slaves into the system does not require a lot of adjust-
ments in the code.Table 0.4 shows a quick overview of the dierences between
Local and App Engine Slaves.
Local Slave
App Engine Slave
fast local network
hardware is known;
completely unknown;
runtime estimates are possible
may vary
administration needed
highly reliable
and accessible
most restrictions
very restrictive
do not apply
(see quotas)
only simulated
provided by
google infrastructure
Table 0.4:Properties overview of App Engine and local slaves.
0.4 The Master Application
The master application is a program written in Java that automatically invokes
the web interface provided by the slave application.It is responsible for the
generation and distribution of the parallel tasks,as well as for collecting and
assembling the partial results into a complete solution.Moreover it is responsible
for mapping jobs eciently to the given slaves.Another important requirement
is a good fault tolerance mechanism,since requests may fail for various reasons.
In the following,the architecture of the master application will be described
starting with a general overview of the architecture,followed by a more detailed
description of the individual components and their responsibilities.
0.4.1 Architecture
Figure 0.4:Master application architecture.
Figure 0.4 shows the main components of the master application and their de-
pendencies.The main entry point to the system is the DistributionEngine
class.A client using the system instantiates the DistributionEngine hand-
ing it a Implementation of the JobFactory representing the parallel problem.
Furthermore,the DistributionEngine needs a list of URIs of reachable slave
instances.For every slave,a HostConnector is instantiated managing the actual
connection to it and providing high-level control to the DistributionEngine.
The HostConnector associated with every slave instance is responsible for sup-
plying the it with data and jobs.Each HostConnectors has a reference to the
JobFactory and directly requests jobs from it and posts results of nished jobs.
For the actual HTTP connection,multiple threads have to be used for man-
aging the parallel data and job requests.For that purpose,a HostConnector
uses JobTransferThreads for every WorkJob it submits.The HostConnector
implements the ResultListener interface providing a callback method for the
JobTransferThreads to deliver the Result of nished jobs.These results are
then forwarded to the JobFactory which is responsible for assembling the nal
result.The TransferThreads contain the code for building the actual HTTP
request matching the interface of the slave application.
Since the HostConnectors are responsible for supplying the slaves with tasks,
they implicitly determine the mapping of jobs.A closer description of the map-
ping strategy is provided in Section 0.4.3.Besides the HostConnector are re-
sponsible for handling failed requests and possibly failed slave instances.Fault
tolerance mechanisms are discussed more closely in Section 0.4.4.
Some algorithms such as matrix multiplication require additional data to
be transferred besides the data associated with WorkJobs.For such algo-
rithms HostConnectors additionally manage a reference to a DataManager
that is responsible for transferring data to the slaves.The DataManager
itself uses DataTransferThreads which are just a slightly modied version of
JobTransferThreads.Data is generally transferred prior to the distribution of
jobs.Data transfers are split to multiple parallel HTTP requests.Section 0.6
provides a detailed description of the shared data management concept and the
underlying data transfer strategies.
0.4.2 Generating Jobs
A substantial part of the master is to correctly split the problem into smaller
work items that can be wrapped into WorkJobs.The system uses the concept
of a JobFactory,which is an interface that has to be implemented for a con-
crete parallel algorithm similar to the WorkJob interface.A class implementing
the interface carries the logic for generating appropriate WorkJobs that can be
submitted to the slave applications.The JobFactory is also responsible for
reassembling the partial results of the WorkJobs in order to produce a nal
Result.For applications using shared data,the JobFactory class also provides
the serialized data that has to be sent separately.
Listing 5:The JobFactory interface.
1 public interface JobFactory {
3 public WorkJob getWorkJob();
4 public int remainingJobs();
5 public void submitResult(Result r);
6 public Result getEndResult();
8//only for applications using shared data
9 public boolean useSharedData();
10 public Serializable getSharedData();
12 }
The JobFactory manages a list of WorkJobs that have to be completed to solve
the algorithm with the given parameters.The getWorkJob() method returns
the next WorkJob ready for submission and null if there are no more jobs left to
execute.The remainingJobs() method returns the number of remaining jobs
that still have to be executed.This information is necessary for load balancing
purposes.For example,if there are three jobs left and ve idle slaves available
the slaves with the fastest expected execution should be chosen rst.
Results of completed WorkJobs are submitted to the JobFactory via the
submitResult() method.The class is responsible for assembling all the partial
results to a nal result.WorkJobs and their corresponding Results must have
the same identier,in order to allow proper assembly of the end result.Once
all the results are submitted the getEndResult() method provides the result
of the algorithm.
The useSharedData() method indicates whether the algorithmuses shared data
management.In case shared data is used the serialized data can be requested
through the getSharedData() method.Shared data management is described
in more detail in Section 0.6.
Using the concept of a JobFactory,WorkJobs and Results,the logic of a
specic algorithm is decoupled from the rest of the system.For integrating an
additional algorithm in the system,a programmer simply needs to implement
the interface.In Section 0.7.1 concrete examples for implementing algorithms
in the system will be discussed.
0.4.3 Job Mapping
Mapping in parallel computing represents the procedure of assigning the tasks of
a parallel computation to the available processors [14].The main goal of when
mapping tasks is to minimize the global execution time of the computation.
This is usually achieved by minimizing the overhead of the parallel execution.
Typical sources of overhead are communication and processors staying idle due
to insucient supply with work.
Communication overhead is minimized by avoiding unnecessary communication
between processors or machines.Avoiding idle processors requires good load
balancing,which means that the work should be distributed equally among
Mapping strategies can be roughly classied into two categories:static and
1.Static Mapping:Static mapping strategies decide the mapping of tasks
to the available processors before executing the program.Providing an
good mapping in advance is a complex and computationally expensive
task.In fact nding an optimal mapping is NP-complete.Knowledge of
task size,data size,the underlying hardware used and even the system
implementation is crucial.However for most practical applications there
exist heuristics that produce fairly good static mappings.
2.Dynamic Mapping:Dynamic mapping strategies distribute the tasks
dynamically during execution of the algorithm.When there is insucient
knowledge on the environment,static mappings can cause load imbalances.
In such cases,dynamic mapping techniques often yield better results.
As described earlier parallelism is achieved by sending multiple tasks in parallel
to the web application.The web application then handles those requests in
parallel.In case of an App Engine slave,the requests might be handled in
parallel on one machine or on several dierent machines.In case of a local slave
the requests are handled in separate threads thus using multiple available CPU
cores.Thus,the mapping to cores depends on when and how many requests are
sent in parallel to each slave managed by the system.
The basic mapping approach of our system is similar to a dynamic work pool.
A distributed system using a work pool approach typically has a central node
managing the parallel tasks and the computation nodes request tasks from the
work pool for computation.More advanced implementations sort the parallel
tasks in the work pool using a priority queue.Such an approach has the advan-
tage that work is given only to nodes that actually free resources.Moreover,
if there are sucient tasks roughly of the same size,even in a heterogeneous
environment there are almost no load imbalances to expect.Faster nodes will
automatically request more work,since they nish tasks faster and slower nodes
will not be ooded by work they cannot handle.
The mapping strategy of the system is inspired by the work pool approach.
The work pool is implemented by the JobFactory,which provides WorkJobs
on demand as well as the possibility to put back WorkJobs for reassignment.A
closer description of the JobFactory is provided in Section 0.4.2.
Because tasks are pushed to the nodes by HTTP requests the computation nodes
are not able to request additional work by themselves.Therefore every slave has
a HostConnector associated with it,managing the job retrieval and the posting
of results for the particular slave instance.Every slave has a queue size,which
is simply a number indicating the number of parallel requests it can handle.
Initially the HostConnector retrieves the number jobs indicated by the queue
size and sends them to the corresponding slave.Every time a job is nished the
HostConnector retrieves a new job and sends it to the slave.
Dierent slave instances have dierent optimal queue sizes.For example slave
running on a machine with a larger number of CPU cores is able to handle more
requests in parallel than one running on a machine with only one core.For local
slaves the optimal queue size is usually the number of available processor cores.
For App Engine slaves it is a little more dicult to nd an appropriate queue
size,since there is no information on the hardware available.Besides,the ex-
ecution time might be in uenced by various factors such as other applications
sharing the same server and thus causing background load.The fact that subse-
quent requests might be handled on completely dierent hardware is problematic
as well.However typically if the problem can be partitioned intro similar sized
jobs and only App Engine slaves are used the best choice is to evenly distribute
the whole problem right at the start of the algorithm,since if the load is to high
the requests excessive request get aborted and can be rescheduled.
0.4.4 Fault Tolerance
Fault tolerance in the context of a distributed system means guaranteeing that
every parallel task is executed and results are collected correctly in order to make
completion of the algorithm possible.Recoverable problems such as single slave
instances going oine should be recognized timely and handled accordingly.If
there are problems the system cannot recover from,for example a complete loss
of network connectivity,the system should persist its state in order to make
continuation of execution at a later time possible.
Retransmission of Requests
HTTP requests to the slave application may fail at any time for various reasons,
thus a mechanism for correctly handling failed requests forms an important
part of the system.In order to guarantee execution of the task associated to
the requests requires either resending until the request is performed correctly or
detaching the task and putting it back into the work pool.
The best action for recovering from a error often depends on the cause of the
problem.First of all,requests may get lost due to a unreliable network,here
the best reaction is to resend the request as soon as possible.Another reason
can be a busy slave instance that is temporarily not able to handle additional
requests or a depleted resource.The best reaction in this case is to resend the
request as soon as the slave is able to receive additional requests.In some cases
a task cannot be executed correctly by one slave,while others might be able
to execute it without a problem.For example,a long task assigned to an App
Engine slave that repeatedly exceeds the runtime limitations could be executed
by a local slave without a problem.
Resending of requests is implemented in the DataTransferThread and the
JobTransferThread itself,in order to avoid creating a new thread every time
a HTTP request gets lost.Threads resend failed requests for a congurable
number of times using a exponential backo mechanism.A TransferThread
will initially wait a small amount of time before resending a request,however
doubling the wait time for every further attempt.This technique avoids ooding
the network with unnecessary HTTP requests that will be discarded anyways.
Once the maximumnumber of retries is reached a,TransferThread sets its state
to failed.The HostConnector regularly checks for failed JobTransferThreads
and tasks associated to a failed JobTransferThread are detached and put back
into the work pool in order to make reassignment to a dierent slave possible.
Unlike single jobs,data has to be transferred to the slave it is intended to in
order to make execution of the algorithm possible.As a consequence,once
a DataTransferThread fails the corresponding slave has to be removed from
the list of available slaves and its associated HostConnector has to be deac-
tivated.Therefore,DataTransferThreads have typically a higher retry count
than JobTransferThreads in order to avoid accidentally removing an active
Handling Oine Slave Instances
The HostConnector regularly checks for failed TransferThreads and once it
discovers a high amount of failed requests it suspends job transmission,in order
to check the slaves availability.Ping requests are sent in order to check whether
the slave is still online.A ping request should cause the slave application to
return immediately with a empty response.If a ping request succeeds job trans-
mission or a successful result of a previous job is returned,the job submission
is resumed.
However,if a certain number of ping requests fail the HostConnector assumes
its slave has gone oine,puts back all active tasks into the work pool and
deactivates itself.Optionally,the availability of slaves can be checked prior to
execution in order to avoid starting to send requests to inactive slaves.
Handling Loss of Connectivity
Once a unrecoverable fault is detected such as complete loss of connectivity,
the Distribution System tries to persist its state in order to continue execution
at a later point in time.This behavior is especially desired for long running
algorithms where a unrecoverable error would mean the complete loss of all
the already nished computation.The state of the problem is implicitly given
by the JobFactory class that manages the open WorkJobs as well as the al-
ready computed partial Results.The framework provides the possibility to
make a JobFactory Serializable and additionally implementing the interface
Listing 6:The Persistable interface.
1 public interface Persistable {
2 public void saveState();
3 public void loadState();
4 }
Listing 6 shows the Persistable interface providing the methods saveState()
and loadState().If a\verbJobFactory+ implementation additionally im-
plements this interface the system puts back all unnished tasks in the work
pool and calls the saveState() method once a unrecoverable error is detected.
This provides the possibility to load the state of the JobFactory at a later point
in time and continue execution of the algorithm.
0.5 The Slave Application
The slave application is a web application written in Java using the Google App
Engine framework.As previously discussed,instances of the slave application
can run either be App Engine slaves or local slaves.Basically the slave appli-
cation does nothing more than receiving small pieces of work,executing them
and sending back the results to the master.
Figure 0.5:Activity diagram illustrating the control ow of the slave
Figure 0.5 shows a UML activity diagram visualizing the general control ow of
the slave application.The entry point of the slave application is a HTTP request
to the servlets POST method.First of all it has to be checked whether the entity
in the message body of the HTTP request is compressed and if so,the payload
has to be decompressed prior to further usage.The next step is to determine the
type of the request.There are dierent types of requests:job,data,clear and
ping.Each request type has to be treated dierently.The corresponding meta
information of the request is stored in form of message headers in the HTTP
request (see Section 0.5.3).
The most important request type is job request.Such a request contains a
parallel task intended to be executed by the slave application.Once a job request
is identied,the job itself has to be extracted from the entity,by deserializing
the data to a WorkJob object.The next step is to determine whether the job
needs shared data and if so,it has to be retrieved from the Datastore prior to
execution.After that,the WorkJob is executed by calling its run() method.The
result of the computation is then again stored in serialized form in the HTTP
response.If result compression is enabled the serialized object is additionally
Data requests are used to transfer shared data that is used by all the jobs and
thus has only to be transferred once to each slave instance.A closer description
of the shared data management concept is provided in Section 0.6.Once a data
request is identied,the raw data is extracted and stored in the Datastore using
a wrapper data entity.
A clear request causes the slave application to delete the entire content of the
Datastore,in order to erase all save state.A clear request is typically sent
after a successful or failed run of an algorithm in order to prepare the slave for
subsequent runs of the algorithm.
Ping requests are used for determining whether a slave instance is still online.
The slave application instantly returns to the request once identifying a ping
request.A closer description of the fault tolerance mechanisms is provided in
Section 0.4.4.
0.5.1 WorkJobs
A WorkJob is a piece of work that can be received and executed by the slave
application.They contain the algorithmic logic as well as the data needed for
execution.WorkJob itself is a abstract class dening the necessary methods
expected by the system:
Listing 7:The abstract WorkJob class.
1 public abstract class WorkJob extends Serializable {
2 private int id;
4 public int getId();
5 public void setId(int Id);
6 public Result run();
8//only needed for algorithms with shared data
9 public void fetchSharedData();
11 }
Every algorithm needs a specic implementation of a WorkJob that extends this
abstract class.WorkJobs always have to be serializable,since they are transferred
in serialized form.
The core of a WorkJob is the run() method which contains the algorithmic
logic of the job.The return value is of the type Result,which is again a generic
abstract class that needs to be extended when implementing a result class for a
specic algorithm.
How data needed for the algorithm is managed is left to the programmer im-
plementing the specic WorkJob.However,the class should only contain data
specic to the job.Transferring data that is used by multiple jobs within the
class would lead to redundant data transfers.Data shared by multiple jobs can
be sent separately and should be retrieved by invoking the fetchSharedData()
method.The concept of shared data management is described more detailed in
Section 0.6.
Every WorkJob has a unique identier,that has to be the same as the identier
of the corresponding Result.This allows the master to correctly map WorkJobs
to their Results,which is necessary for assembling the solution of the algorithm.
0.5.2 Results
The slave application returns a serialized instance of the class Result wrapped
in the HTTP response.Result is a abstract class every algorithmspecic result
implementation must extend:
Listing 8:The abstract Result class.
1 public abstract class Result implements Serializable {
2 private int id;
3 private long calculationTime;
5 public long getCalculationTime();
6 public void setCalculationTime(long calculationTime);
7 public int getId();
8 public void setId(int id);
9 }
The class only denes the id used for relating the result to the corresponding
WorkJob and a eld storing the execution time of the run() method.The actual
data types for returning the results must be dened in the algorithm specic
In the eld calculationTime the execution time needed for the run() method
is stored.This value represents the time spent doing useful computations and
is used to determine the ratio between parallelization overhead and the actual
computation time.
Results can be optionally returned in a compressed form,in order to reduce
the amount of data to be transfered.The master application indicates that it
expects a compressed result in the HTTP message header (see 0.5.3).
0.5.3 Message Headers
HTTP requests contain header elds used for transferring meta-information,
such as which encoding is accepted by a browser.Aheader eld has a name and a
corresponding value which is usually a string.Besides using the standard header
elds,self-dened custom headers can be used for transferring information.The
slave application decides using the header elds how to treat requests.
In the following the parameters used by the slave application are listed:
 type:The type eld indicates the kind of request transferred,how it has
to be handled by the application and the kind of data contained in the
{ job:A job request is a computational task to be executed by
the slave.The payload of the request contains the corresponding
{ data:A data request serves as a means to transfer data to the appli-
cation.The payload contains shared data to be stored in the Datas-
{ clear:A clear request causes the slave to clear all stored data.It
contains no data in the payload.Such a request is typically sent after
all jobs have nished to reset the application.
{ ping:A ping request is used to determine whether a slave is reach-
able.The application should respond immediately with a empty re-
{ retrieve:A retrieve request causes the application to read the con-
tents of the Datastore and send them back in the response of the
request.This ag is only used for debugging purposes.
 compression:The compression eld indicates whether the payload of
the request is compressed and therefore has to be decompressed prior to
{ enabled:The enabled ag indicates that request compression is en-
{ disabled:The disabled ag indicates that request compression is
 resultCompression:The resultCompression eld indicates whether the
result should be compressed before returning it.
{ enabled:The enabled ag indicates that result compression is en-
{ disabled:The disabled ag indicates that result compression is dis-
 sharedData:The sharedData eld indicates whether the corresponding
WorkJob uses shared data and therefore whether shared data has to be
retrieved prior to execution of the job.
{ true:The true ag causes the slave to invoke the fetchSharedData()
method prior to the run() method.
{ false:On the contrary the false ag causes the slave to invoke the
run() method immidiatly without retrieving further data.
 benchmark:The benchmark eld indicates whether database operations
should be performed,when sending data requests.The eld is used for de-
activating database operations,in order to more precisely measure transfer
{ true:The true ag causes the slave to omit transferred data and
return immediately.
{ false:The false ag causes the slave to store the enclosed data in
the Datastore.
0.6 Shared Data Management
The naive approach for transferring data from the master to its slaves is by
simply attaching all the necessary data to each job.However,various parallel
algorithms have shared data that has to be accessed by all of the jobs.The fact
that a single slave will compute multiple jobs results in unnecessary redundant
data transfer from the master to the slaves.Especially if communication takes
place over relatively slow networks such as the Internet,this can result in a
major bottleneck.A common example would be a parallel implementation of
the matrix multiplication algorithm,where one matrix is shared and has to
be accessed by each job.So in principle the shared matrix only has to be
communicated once to each slave.
For this reason we introduced data requests besides regular job requests in the
system.Using shared data is optional though,since not every algorithm needs
shared data or in some cases the overhead of sending data multiple times might
be acceptable for other reasons.So the the parallel computation process is split
in two phases:rst shared data is transferred to each slave and stored;and
second the jobs with the actual computation tasks are distributed among the
slave instances.
As described in Section 2 the runtime environment of the App Engine framework
restricts any access to the le system.Consequently we had to use the Datastore
service in order to store shared data in a way that all the jobs executed on one
slave can access the data.The development server simulates the Datastore by
storing data in a single le in the le system,since it is usually used for testing
purposes only.This may seem inecient,yet the jobs will not have to query for
data but usually need the whole share data stored.Therefore,it is still more
ecient to read data from the le system than to communicate it over a slow
network.For an application running on Google's servers it is not guaranteed
that every request will be executed on the same hardware (though if possible it
is preferred).However the Datastore service manages proper access to the data
for every request,thus such an application can be logically treated as a single
slave.Froman performance viewpoint the Datastore service again in most cases
will be better than plainly sending data multiple times.
0.6.1 Data Splitting
We choose to split data into multiple chunks for data transfer for two reasons.
First of all the Datastore service allows a single storage entity to have a max-
imum size of one megabyte.Besides HTTP has limitations on how much data
one request is allowed to carry in its payload as well.In order to avoid limita-
tions,the shared data has to be split and stored in several parts.Once a job
needs to access the data,it simply reads all the chunks and reassembles them.
The second major advantage is,that by splitting data,multiple TCP streams
are used for transmission,which can often improve transfer speed notably.Es-
pecially in our case where data transfers typically will last only a couple of
seconds,multiple streams help hiding the eects of the TCP startup phase [13].
Of course also the overhead produced by transferring an additional HTTP
header for each separate data chunk has to be considered.So a good ratio
between the total data to be transferred and the data transferred in a single
chunk has to be found.An experimental analysis of the splitting factor is pro-
vided in Section 0.11.2.
0.6.2 Data Transfer Strategy
An important consideration when transferring data to multiple slaves whether
to transfer data to all of them in parallel or sequentially one after another.If
there are separate independent network links to the slaves,the parallel approach
is clearly the best strategy.
However,this is typically not the case in a practical scenario and outgoing
bandwidth to the hosts is shared.Over the Internet the most likely network
bottleneck is the upload bandwidth of the master.In a local network the nodes
are usually interconnected by a switch or a router.In a typical homogenous
network topology like in gure 0.6 the bandwidth bottleneck is already the link
to the switch.
In a heterogenous network topology where the master has a considerably faster
connection to the switch like in gure 0.7 data transfers won't aect each other,
thus sending data in parallel is clearly the best option.
Figure 0.6:Typical homogeneous local network topology.
Assuming however the network bandwidth to the slaves has to be shared,like in
gure 0.6,it is better to send data to the slaves one after another.A slave needs
the complete shared data in order to start the computation,so partial data is
not useful to a slave.Assuming a topology like in the homogenous example and
1000 mbit of data has to be transferred to each host,it takes three seconds to
broadcast the data to all slaves.When transferring the data in parallel after the
three seconds every slave can start computing.
However,if data is sent separately to the slaves it takes one second for each
transfer,since the whole bandwidth can be used.So after one second the data
transfer to the rst slave is nished and a job can be assigned,so the rst slave
can start its computation.Consequently it takes another second to transfer the
data to the second slave,which can start the computation after a total of two
seconds.After three seconds also the third slave can start computing.In total we
gain three seconds of computation time in comparison to transferring the data in
parallel,by enabling the rst two slaves to start their computations earlier.As
a result in a scenario where the bandwidth to all the slaves is shared a sequential
data transfer strategy should be preferred over a parallel data transfer.
When transferring data sequentially the next consideration is in which order to
Figure 0.7:Heterogeneous local network topology,with superior master node.
transfer the data to the slaves.In general slaves to which a faster network link
is available and slaves that have more computing power are preferable.The
data is transferred faster to slaves with more bandwidth available thus enabling
them to begin their computation earlier.On the other hand the system benets
more from faster slaves beginning their computation early.So if in the previous