MR-Tandem: Parallel X!Tandem Using Hadoop MapReduce on Amazon Web Services

photofitterInternet and Web Development

Dec 4, 2013 (3 years and 6 months ago)

59 views

MR
-
Tandem: Parallel X!Tandem Using Hadoop MapReduce on Amazon Web Services

Brian S. Pratt, J. Jeffry Howbert, Natalie I. Tasman, Erik J. Nilsson

Insilicos LLC, Seattle WA


Clusters Are:


Too Big, Too Small, Too Expensive

and You Probably Don’t Own One Anyway

A compute cluster is never the right size


either it’s overburdened, or underutilized.


The moment the parts are ordered it starts to become obsolete. The moment it’s installed (which is usually
quite a while after the parts are ordered) components begin to age and eventually fail.


Everything must be backed up, regularly, and off site. In practice, this doesn’t always happen, especially as
data volumes grow.


Clusters are a pain to own. Yet the only thing that’s worse than owning a cluster is not owning one, and
having to negotiate for time on a shared cluster.


What if you could have your own perfectly sized cluster that exists only when you need it, and have your data
backed up in three different data centers automatically?



Many Researchers Miss Valuable PTM Information

for Lack of Compute Power

Post
-
translational modifications (PTMs) of proteins are a valuable source of biological insights, and mass
spectrometry is one of the few techniques able to reliably prospect for and identify PTMs.


Yet many valuable data sets are not searched for PTMs due to computational constraints on investigators who
do not have claim to significant time on a compute cluster.


MR
-
Tandem

adapts the popular X!Tandem peptide search engine to work in the Amazon Web Services cloud
so that investigators can easily and inexpensively create their own temporary compute clusters on demand, and
quickly conduct what would otherwise be excessively time consuming searches.

MR
-
Tandem Drops In

Where You Already Use X!Tandem

MR
-
Tandem uses exactly the same parameter files as your local copy of X!Tandem, so it’s easy to switch back
and forth between local and cloud
-
based X!Tandem searches. MR
-
Tandem places its results on your computer
in the same place your local copy of X!Tandem would, and all file references in the results are in terms of your
local file system.


MR
-
Tandem uses a Python script on your local computer to spin up a cluster in AWS Elastic Map Reduce. Any
data files which aren’t already in the cloud are compressed and transferred, your search (or searches) are
performed, and results are copied back to your computer. Your data files and results are also stored in
Amazon’s S3 Simple Storage Service, which replicates them in three different data centers.

Parallel vs. Simultaneous Search

There are many systems that will launch multiple simultaneous searches on a cluster. While this is
obviously faster than running them one at a time, it still doesn’t address the issue of very thorough
searches that take hours or even days.




Works With Any Hadoop Cluster

MR
-
Tandem also works with Hadoop clusters other than AWS Elastic Map Reduce. Operation is the same,
except your files are stored in HDFS instead of S3, and backups are your cluster administrator’s business.

Other
Insilicos

Cloud Army Projects

Easily Launch Multiple Searches

AWS bills by the hour (1 minute of computation costs the same as 60), so it makes sense to keep a cluster
running for multiple searches instead of bringing a cluster up and down for each search. MR
-
Tandem will
accept a file containing a list of parameter files instead of an actual parameter file, and each search will be
executed in turn on the cluster before it is finally shut down.

Simple Setup

MR
-
Tandem is a Python script that manages
your connection to AWS or any Hadoop cluster.
A convenient installer for Windows provides the
proper version of Python and needed Python
modules.


By default MR
-
Tandem provisions the Hadoop
-
enabled X!Tandem executable from a public S3
site. You can override this if needed, and code
for building your own X!Tandem executable is in
the Sashimi project on Sourceforge.


The MR
-
Tandem Python script can be found in
the Insilicos Cloud Army (ICA) project:

http://ica.svn.sourceforge.net/viewvc/ica/trunk



Why Hadoop?

MR
-
Tandem isn’t the only parallel implementation of X!Tandem. Standard
X!Tandem already supports
threaded execution, which is useful on multicore processors. The X!!Tandem project uses MPI to spread
that threading model across multiple compute nodes, and MR
-
Tandem uses Hadoop to do the same thing.
All three will yield identical results as long as the thread/node counts are the same.


The problem with X!!Tandem is that MPI is notoriously brittle as cluster size goes up. Specialized
hardware is needed to satisfy MPI’s intolerance of network latency, and a failure anywhere in the MPI ring
results in the entire search failing. In practice we find that it is very difficult to instantiate an MPI ring of
more than about 10 nodes on AWS.


MR
-
Tandem’s use of Hadoop puts fault tolerance is at the core of its design, and MR
-
Tandem will happily
run on commodity hardware. If a worker node should fail for any reason, its workload is transparently
reassigned and the search will be completed despite the failure.


# Nodes

t (hh:mm:ss)

Speedup

1

36:07:02

2

29:19:26

1.23

5

11:32:09

3.13

10

6:41:31

5.40

20

4:05:30

8.83

50

1:45:23

20.56

100

1:17:03

28.13

200

1:09:26

31.21

400

1:09:26

31.21

The table to the right shows the completion time for a
233 MB mzXML file searched against a 32.5 MB
database with PTMs using MR
-
Tandem. The single
node case was run in traditional X!Tandem mode with
two threads, the multinode cases are all single threaded.


Note that as node counts become very high the task of
processing the database to create theoretical spectra
begins to eclipse the task of comparing them to the
actual spectra, and each node is doing essentially
identical work. Adding more nodes doesn’t help at that
point.

Enter the Cloud

MR
-
Tandem uses Amazon Web Services (AWS) Simple Storage Service (S3) and Elastic Map Reduce (EMR) to
create on
-
demand compute clusters and secure redundant data stores. Unlike a traditional cluster, your up front
costs are zero, and you never have to replace a burned out node. Your data is automatically backed up across
multiple data centers. You don’t have to share your cluster with anybody and you don’t have to ask anybody for
time on it. Your cluster can be just as big or small as you need it to be for the job at hand, and when the job is
complete the cluster just goes away.


Some of the biggest names on the web (Yelp, Netflix, IMDB, etc) use AWS. They decided AWS could do a better,
more cost effective job of managing their mission critical compute and storage infrastructure than they could
themselves. Your lab is going to do better than that?

Acknowledgments

It all begins with X!Tandem, of course.
http://www.thegpm.org/TANDEM/index.html



The X!!Tandem project provided insights and some code for expanding the X!Tandem threading model onto
networked compute nodes.
http://wiki.thegpm.org/wiki/X!!Tandem



The Ensemble Cloud Army project provided valuable insights into distributed computing on AWS.
http://sourceforge.net/projects/ica/



This work was funded in part by NIH grant # HG006091.

Insilicos Cloud Army (ICA) was developed as a general purpose framework for parallel
computing using AWS. It manages the uploading of data files and computational code
from a user’s local computer to AWS, the distribution of work across an AWS cluster,
and the compilation and return of results to the local computer. It has been adapted to
support cluster communication using both MPI and Elastic MapReduce, and can
deploy either self
-
contained compute engines like X!Tandem or user
-
defined code.


An example of another ICA project is
Ensemble Cloud Army
(ECA). ECA is designed
for parallel computation of machine learning ensembles (e.g. ensembles of decision
trees or neural networks). Ensembles achieve classification accuracy superior to that
of an single model by running large numbers (100’s to 1000’s) of base classifiers, and
aggregating their predictions via a voting process. The training and prediction
processes of each base classifier are independent of the other base classifiers, so the
ensemble as a whole is ideally suited to parallelization on a cluster.


In the current implementation of ECA, machine learning computations are driven by
scripts in the statistical language R. The Figure below shows typical improvements in
accuracy obtained with ECA on large datasets (covertype = 581,012 land usage
records; Jones = 227,260 protein secondary structure records; 90 to 180 base
classifiers, 3 to 60 cluster nodes). Speedups were qualitatively similar to those seen
with MR
-
Tandem.


No new algorithms


No new search settings


Existing multithread model running on
networked nodes instead of onboard cores


So it’s actually just X!!Tandem


Except it’s Hadoop, not MPI


Hadoop is an open source implementation of
the MapReduce algorithm invented at Google


Classically used for text search, but also a
robust general framework for distributed
computing


Designed for cheap commodity hardware,
and lots of it


More scalable and fault tolerant than MPI



Data is just getting bigger, searches getting
longer


Researchers short on compute power may
shortchange themselves on search:


Overly selective database


Insufficient decoys


Too few PTMs


Too small


Too big


Too expensive


You don’t own one


At about $0.10 per hour per core you
probably can’t run a cluster as cheaply as
they can


Especially since you have no up front costs


Data and results stored in 3 data centers


Good enough for Netflix, probably good
enough for the average lab


MR
-
Tandem will run on any Hadoop system,
though


Computing as a utility


Prediction: next wave of young investigators
will never own clusters


Prediction: granting agencies will soon view
compute power as a consumable


more like
a reagent than a mass spec