The Limitation of MapReduce:

fallsnowpeasInternet and Web Development

Nov 12, 2013 (3 years and 8 months ago)

117 views

The Limitation of
MapReduce
:


A Probing Case and a Lightweight Solution

Zhiqiang Ma Lin Gu


Department of Computer Science and Engineering

The Hong Kong University of Science and Technology

CLOUD COMPUTING 2010

November 21
-
26, 2010
-

Lisbon, Portugal


MapReduce


MapReduce
: parallel computing
framework
for large
-
scale data processing


Successful used in datacenters comprising
commodity computers


A fundamental piece of software in the Google
architecture for many years


Open source variant already exists:
Hadoop


Widely used in solving data
-
intensive problems

2

MapReduce

… Hadoop or variants …

Hadoop

Introduction to
MapReduce


Map and Reduce are higher
-
order functions


Map:
apply an operation to
all elements in a list


Reduce: Like “fold”;

aggregate elements of a list

3

1

m

4

m

9

m

16

m

25

m

1

2

3

4

5

m: x
2

0

1

r

5

r

14

r

30

r

55

r

final value

Initial value

r: +

1
2

+ 2
2

+ 3
2

+ 4
2

+ 5
2
= ?

Introduction to
MapReduce

Massive parallel processing made simple


Example: world count


Map: parse a document and generate <word, 1> pairs


Reduce: receive all pairs for a specific word, and count

4

// D is a document

for each word w in D


output <w, 1>



Map

Reduce for key w:

count = 0

for each input item


count = count + 1

output <w, count>

Reduce

Thoughts on MapReduce

MapReduce provides an easy
-
to
-
use
framework for parallel programming.


But is it good for general programs running
in datacenters?


5

Our work


Analyze MapReduce’s design and use a
case study to probe the

limitation


Design a new parallelization framework
-

MRlite


Evaluate the new framework’s performance

6

Design a general parallelization framework and
programming paradigm for cloud computing

Thoughts on MapReduce


Originally designed for processing large static
data sets


No significant dependence


Throughput over latency


Large
-
data
-
parallelism over small
-
maybe
-
ephemeral parallelization opportunities

7



Input

Output

MapReduce

MapReduce

The limitation of MapReduce


One
-
way scalability


Allows a program to scale up to process very
large data sets


Constrains the program’s ability to process
moderate
-
size data items


Limits the applicability of
MapReduce


Difficult to handle dynamic, interactive and
semantic
-
rich applications

8

A case study on MapReduce

Distributed compiler


Very useful in development environments


Code (data) has dependence


Abundant parallelization opportunities

A “typical” application,

9

make
-
j N

init/version.o

vmlinux
-
main

vmlinux
-
init

kallsyms.o

vmlinux

driver/built
-
in.o

mm/built
-
in.o



but a

hard case for MapReduce

A case study: mrcc


Develop a distributed compiler using the
MapReduce model


How to extract the parallelizable components in a
relatively complex data flow?



mrcc: A distributed compilation system


The workload is parallelizable but data
-
dependence
constrained


Explores parallelism using the MapReduce model


10

mrcc


Multiple machines
available to MapReduce
for parallel compilation


A master instructs
multiple slaves (“map
workers”) to compile
source files

11

Design
of mrcc

12

make
-
j N





master

slave

“make” explores

parallelism among

compiling source files

MapReduce

jobs

for compiling

source files

The map task

compiles an

individual file

Experiment:
mrcc

over
Hadoop


MapReduce

implementation


Hadoop

0.20.2


Testbed


10 nodes
available to
Hadoop

for parallel
execution


Nodes are connected by
1Gbps

Ethernet


Workload


Compiling Linux kernel,
ImageMagick
, and
Xen

tools

13

Result and o
bservation

The compilation using mrcc on 10 nodes is
2~11
times
slower than sequential compilation on one node.

Project

Time for gcc (sequesntial
compilation) (min)

Time for

mrcc/Hadoop (min)

Linux kernel

49

151

ImageMagick

5

11

Xen tools

2

24

14

For compiling source file


Put source files to HDFS: >2s


Start
Hadoop

job: > 20s


Retrieve object files: >2s

Where does the slowdown
come from?


Network communication
overhead for data transportation
and replication


Tasking overhead


Is there sufficient parallelism to exploit?


Yes. “distcc” serves as baseline


One
-
way scalability in the (MapReduce) design
and (Hadoop) implementation.


MapReduce is not designed for compiling. We
use this case to show some of its limitations.

mrcc
: Distributed Compilation

15

Parallelization framework


MapReduce/Hadoop is inefficient for
general programming


Cloud computing needs a general
parallelization framework!


Handle applications with complex logic, data
dependence and frequent updates, etc.


39% of Facebook’s MapReduce workload have
only 1 Map [Zaharia 2010]


Easy to use and high performance

16

[Zaharia 2010] M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. 2010. Delay scheduling: a
simple technique for achieving locality and fairness in cluster scheduling. EuroSys ‘10.

Lightweight solution: MRlite


A lightweight parallelization framework
following the MapReduce paradigm


Parallelization can be invoked when needed


Able to scale “up” like MapReduce, and
scale “down” to process moderate
-
size data


Low latency and massive parallelism


Small run
-
time system overhead

17

A general parallelization framework and
programming paradigm for cloud computing

Architecture of MRlite

18

MRlite

client

MRlite

master

scheduler

slave

slave

slave

slave


High speed

Distributed storage

application

Data flow

Command flow

Linked together with the
app, the
MRlite

client
library accepts calls from
app and submits jobs to
the master

High speed distributed
storage, stores
intermediate files

The MRlite master accepts jobs
from clients and schedules them
to execute on slaves


Distributed nodes
accept tasks from
master and execute
them

Design


Parallelization is invoked when needed


An application can request parallel execution for
arbitrary number of times


Program’s natural logic flow integrated with
parallelism


Remove one important limitation


Facility outlives utility


Use and reuse threads for master and slaves


Memory is “first class” medium


Avoid touching hard drives

19

Design


Programming interface


Provides simple API


API allows programs to invoke parallel processing
during execution


Data handling


Network file system which stores files in memory


No replication for intermediate files


Applications are responsible to retrieve output files


Latency control


Jobs and tasks have timeout limitations

20

Implementation


Implemented in C as Linux applications


Distributed file storage


Implemented with NFS in memory; Mounted
from all nodes; Stores intermediate files


A specially designed distributed in
-
memory
network file system may further improve
performance (future work)


There is no limitation on the choice of
programming languages

21

Evaluation


Re
-
implement mrcc on MRlite


It is not difficult to port mrcc because MRlite can
handle a “superset” of the MapReduce workloads


Testbed and workload


Use the same testbed and same workload to
compare MRlite‘s performance with
MapReduce/Hadoop’s

22

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Linux kernel
ImageMagick
Xen tools
2936

312

128

9044

653

1419

506

50

65

Time
(sec)

gcc (on one node)
mrcc/Hadoop
mrcc/MRlite
Result

The compilation of the three projects using mrcc on MRlite
is much faster than compilation on one node. The speedup
is
at least 2
and the best speedup reaches
6
.

23

MRlite vs. Hadoop

The average speedup of MRlite is more than
12

times
better than that of Hadoop.


24

The evaluation shows that MRlite is one order of magnitude
faster than Hadoop on problems that MapReduce has
difficulty in handling.

Project

Speedup
on
Hadoop

Speedup
on
MRlite

MRlite

vs.
Hadoop

Linux

0.32

5.8

17.9

ImageMagick

0.48

6.2

13.0

Xen

tools

0.09

2.0

22.0

Conclusion


Cloud Computing needs a general programming
framework


Cloud computing shall not be a platform to run just simple OLAP
applications. It is important to support complex computation and even
OLTP on large data sets.


Use the distributed compilation case (mrcc) to probe
the

one
-
way scalability

limitation of MapReduce


Design MRlite: a general parallelization framework
for cloud computing


Handles applications with complex logic flow and data dependencies


Mitigates the one
-
way scalability problem


Able to handle all MapReduce tasks with comparable (if not better)
performance

25

Conclusion

Emerging computing platforms increasingly
emphasize parallelization capability, such as
GPGPU


MRlite respects applications’ natural

logic flow
and data dependencies


This modularization of parallelization capability
from application logic enables MRlite to
integrate GPGPU processing very easily
(future
work)



26

Thank you!