The Limitation of
MapReduce
:
A Probing Case and a Lightweight Solution
Zhiqiang Ma Lin Gu
Department of Computer Science and Engineering
The Hong Kong University of Science and Technology
CLOUD COMPUTING 2010
November 21
-
26, 2010
-
Lisbon, Portugal
MapReduce
MapReduce
: parallel computing
framework
for large
-
scale data processing
Successful used in datacenters comprising
commodity computers
A fundamental piece of software in the Google
architecture for many years
Open source variant already exists:
Hadoop
Widely used in solving data
-
intensive problems
2
MapReduce
… Hadoop or variants …
Hadoop
Introduction to
MapReduce
Map and Reduce are higher
-
order functions
Map:
apply an operation to
all elements in a list
Reduce: Like “fold”;
aggregate elements of a list
3
1
m
4
m
9
m
16
m
25
m
1
2
3
4
5
m: x
2
0
1
r
5
r
14
r
30
r
55
r
final value
Initial value
r: +
1
2
+ 2
2
+ 3
2
+ 4
2
+ 5
2
= ?
Introduction to
MapReduce
Massive parallel processing made simple
Example: world count
Map: parse a document and generate <word, 1> pairs
Reduce: receive all pairs for a specific word, and count
4
// D is a document
for each word w in D
output <w, 1>
Map
Reduce for key w:
count = 0
for each input item
count = count + 1
output <w, count>
Reduce
Thoughts on MapReduce
MapReduce provides an easy
-
to
-
use
framework for parallel programming.
But is it good for general programs running
in datacenters?
5
Our work
Analyze MapReduce’s design and use a
case study to probe the
limitation
Design a new parallelization framework
-
MRlite
Evaluate the new framework’s performance
6
Design a general parallelization framework and
programming paradigm for cloud computing
Thoughts on MapReduce
Originally designed for processing large static
data sets
No significant dependence
Throughput over latency
Large
-
data
-
parallelism over small
-
maybe
-
ephemeral parallelization opportunities
7
…
Input
Output
MapReduce
MapReduce
The limitation of MapReduce
One
-
way scalability
Allows a program to scale up to process very
large data sets
Constrains the program’s ability to process
moderate
-
size data items
Limits the applicability of
MapReduce
Difficult to handle dynamic, interactive and
semantic
-
rich applications
8
A case study on MapReduce
Distributed compiler
Very useful in development environments
Code (data) has dependence
Abundant parallelization opportunities
A “typical” application,
9
make
-
j N
init/version.o
vmlinux
-
main
vmlinux
-
init
kallsyms.o
vmlinux
driver/built
-
in.o
mm/built
-
in.o
…
but a
hard case for MapReduce
A case study: mrcc
Develop a distributed compiler using the
MapReduce model
How to extract the parallelizable components in a
relatively complex data flow?
mrcc: A distributed compilation system
The workload is parallelizable but data
-
dependence
constrained
Explores parallelism using the MapReduce model
10
mrcc
Multiple machines
available to MapReduce
for parallel compilation
A master instructs
multiple slaves (“map
workers”) to compile
source files
11
Design
of mrcc
12
make
-
j N
…
…
master
slave
“make” explores
parallelism among
compiling source files
MapReduce
jobs
for compiling
source files
The map task
compiles an
individual file
Experiment:
mrcc
over
Hadoop
MapReduce
implementation
Hadoop
0.20.2
Testbed
10 nodes
available to
Hadoop
for parallel
execution
Nodes are connected by
1Gbps
Ethernet
Workload
Compiling Linux kernel,
ImageMagick
, and
Xen
tools
13
Result and o
bservation
The compilation using mrcc on 10 nodes is
2~11
times
slower than sequential compilation on one node.
Project
Time for gcc (sequesntial
compilation) (min)
Time for
mrcc/Hadoop (min)
Linux kernel
49
151
ImageMagick
5
11
Xen tools
2
24
14
For compiling source file
•
Put source files to HDFS: >2s
•
Start
Hadoop
job: > 20s
•
Retrieve object files: >2s
Where does the slowdown
come from?
•
Network communication
overhead for data transportation
and replication
•
Tasking overhead
Is there sufficient parallelism to exploit?
Yes. “distcc” serves as baseline
One
-
way scalability in the (MapReduce) design
and (Hadoop) implementation.
MapReduce is not designed for compiling. We
use this case to show some of its limitations.
mrcc
: Distributed Compilation
15
Parallelization framework
MapReduce/Hadoop is inefficient for
general programming
Cloud computing needs a general
parallelization framework!
Handle applications with complex logic, data
dependence and frequent updates, etc.
39% of Facebook’s MapReduce workload have
only 1 Map [Zaharia 2010]
Easy to use and high performance
16
[Zaharia 2010] M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. 2010. Delay scheduling: a
simple technique for achieving locality and fairness in cluster scheduling. EuroSys ‘10.
Lightweight solution: MRlite
A lightweight parallelization framework
following the MapReduce paradigm
Parallelization can be invoked when needed
Able to scale “up” like MapReduce, and
scale “down” to process moderate
-
size data
Low latency and massive parallelism
Small run
-
time system overhead
17
A general parallelization framework and
programming paradigm for cloud computing
Architecture of MRlite
18
MRlite
client
MRlite
master
scheduler
slave
slave
slave
slave
High speed
Distributed storage
application
Data flow
Command flow
Linked together with the
app, the
MRlite
client
library accepts calls from
app and submits jobs to
the master
High speed distributed
storage, stores
intermediate files
The MRlite master accepts jobs
from clients and schedules them
to execute on slaves
Distributed nodes
accept tasks from
master and execute
them
Design
Parallelization is invoked when needed
An application can request parallel execution for
arbitrary number of times
Program’s natural logic flow integrated with
parallelism
Remove one important limitation
Facility outlives utility
Use and reuse threads for master and slaves
Memory is “first class” medium
Avoid touching hard drives
19
Design
Programming interface
Provides simple API
API allows programs to invoke parallel processing
during execution
Data handling
Network file system which stores files in memory
No replication for intermediate files
Applications are responsible to retrieve output files
Latency control
Jobs and tasks have timeout limitations
20
Implementation
Implemented in C as Linux applications
Distributed file storage
Implemented with NFS in memory; Mounted
from all nodes; Stores intermediate files
A specially designed distributed in
-
memory
network file system may further improve
performance (future work)
There is no limitation on the choice of
programming languages
21
Evaluation
Re
-
implement mrcc on MRlite
It is not difficult to port mrcc because MRlite can
handle a “superset” of the MapReduce workloads
Testbed and workload
Use the same testbed and same workload to
compare MRlite‘s performance with
MapReduce/Hadoop’s
22
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Linux kernel
ImageMagick
Xen tools
2936
312
128
9044
653
1419
506
50
65
Time
(sec)
gcc (on one node)
mrcc/Hadoop
mrcc/MRlite
Result
The compilation of the three projects using mrcc on MRlite
is much faster than compilation on one node. The speedup
is
at least 2
and the best speedup reaches
6
.
23
MRlite vs. Hadoop
The average speedup of MRlite is more than
12
times
better than that of Hadoop.
24
The evaluation shows that MRlite is one order of magnitude
faster than Hadoop on problems that MapReduce has
difficulty in handling.
Project
Speedup
on
Hadoop
Speedup
on
MRlite
MRlite
vs.
Hadoop
Linux
0.32
5.8
17.9
ImageMagick
0.48
6.2
13.0
Xen
tools
0.09
2.0
22.0
Conclusion
Cloud Computing needs a general programming
framework
Cloud computing shall not be a platform to run just simple OLAP
applications. It is important to support complex computation and even
OLTP on large data sets.
Use the distributed compilation case (mrcc) to probe
the
one
-
way scalability
limitation of MapReduce
Design MRlite: a general parallelization framework
for cloud computing
Handles applications with complex logic flow and data dependencies
Mitigates the one
-
way scalability problem
Able to handle all MapReduce tasks with comparable (if not better)
performance
25
Conclusion
Emerging computing platforms increasingly
emphasize parallelization capability, such as
GPGPU
MRlite respects applications’ natural
logic flow
and data dependencies
This modularization of parallelization capability
from application logic enables MRlite to
integrate GPGPU processing very easily
(future
work)
26
Thank you!
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Comments 0
Log in to post a comment