Automatic optimization of

hedgebornabaloneSoftware and s/w Development

Dec 2, 2013 (3 years and 8 months ago)

79 views

Automatic optimization of
MapReduce

Programs


Michael
Cafarella
,
Eaman

Jahani
, Christopher Re

August 2011

MapReduce

is victorious


Google statistics:








Hadoop

statistics:

7 PB+
Vertica

clusters vs. 22 PB+
Cloudera

H
adoop

clusters
1


Aug 04

Mar 06

Sept 07

May 10

Number

of jobs

29K

171K

2127K

4474K

Machine years used

217

2002

11081

39121

Input

Data (TB)

3,288

52,254

403,152

946,460

Output Data (TB)

193

2,970

14,018

45,720

Average worker

machines

157

268

394

368

1. Omer
Trajman
,
Cloudera

VP,
http://www.dbms2.com/

MapReduce

in relational land



Designers original Intention: free
-
formed data

o
web
-
scale
indexing/log
processing



But, many relational workloads
1

o
Complex queries/data analysis



Caveat: MR performance lags RDBMS performance


1.
Karmasphere

corporation: A study of
hadoop

developers, http
://
karmasphere.com
,
2010


Pavlo

et al., A
Comparison of Approaches to Large
-
Scale Data Analysis,
SIGMOD 2009

Selection is Slower with
MapReduce


Pavlo

et al., A
Comparison of Approaches to Large
-
Scale Data Analysis,
SIGMOD 2009

Join is Even Slower

MR Lags in Relational
L
and



Stonebraker
, Dewitt:

'
'
MapReduce

has no indexes and therefore has only brute
force as a processing option. It will be creamed whenever an
index is the better access mechanism
.’’
1



Query processing tasks

o
No metadata, semantics, indices

o
Free
-
formed input is a
double
-
edged sword


1.
MapReduce
: a major step backwards,
http://databasecolumn.vertica.com/
, 2008

Manimal


Manimal

is a
hybrid system,

combining
MapReduce

programming model and well
-
known execution
techniques



Techniques
today

only found
in

RDBMS
, but
should

be
in
MapReduce
,

too.



Manimal

Approach

b
ytecode

*.class

MR
Engine

Static
Analyzer

Optimizer
logic

Execution
Framework

optimization

opportunities

execution

path


void
map(Text key,
WebPage

w) {

if
(
w.rank

> 10)



emit
(
w.url
,w.rank
);

}



Challenges:

o
S
afely detect
q
uery
s
emantic optimization

o
How much performance gain?


SELECTION

from
B+Tree

index on
W.RANK

Manimal

Contributions



Our
Manimal

system:

o
Detect
safe relational optimizations in users’
compiled
M
apReduce

programs



Our results:

o
Runs with unmodified
MapReduce

code

o
Runs up to
11x faster

on same code

o
Provides framework for more optimizations


Outline


Introduction


Execution Framework


Optimization/Analyzer Examples


Experiments

o
Analyzer recall

o
P
erformance gain


Related Work and Conclusion


Execution framework

public void map(Text
key,
WebPage

w
,


OutputCollector
<Text,
LongWritable
> out
) {


if
(
w.rank

> 10
)



emit
(
w.url
,
w.rank
);

}

Execution Framework

varload


value


invokevirtual

astore


text




ifeq



Analyzer

Optimizer

Execution

13

Execution Framework

void map(k,
w) {


out.set
(
indexedOutputFormat
);


emit
(
w.rank
, (
k,w
)) }

(
SELECT

f,
w
.rank
>10)

Analyzer

in
: user program

Analyzer

out:
optimization descriptor



index
-
generation program

varload


value


invokevirtual

astore


text




ifeq



Analyzer

Optimizer

Execution

14

Execution Framework

Optimizer

in: optimization descriptor



catalog

Optimizer

out: execution descriptor


/logs/log.1

/logs/log.1.idx

select src…

/logs/log.2

/logs/log.2.idx

select
src


(SELECT,

log.1.idx

,

w
.rank
>10)

varload


value


invokevirtual

astore


text




ifeq



Analyzer

Optimizer

Execution

(
SELECT

f,
w
.rank
>10)

15

Execution Framework

numwords

19519

(SELECT,

log.1.idx

,

w
.rank
>10
)

varload


value


invokevirtual

astore


text




ifeq



Analyzer

Optimizer

Execution

Execution

in: execution descriptor



user
program

Execution

out: program output


Outline


Introduction


Execution Framework


Optimization/Analyzer Examples


Experiments

o
Analyzer recall

o
P
erformance gain


Related Work and Conclusion


An Optimization
E
xample

//
webpage.java
: SCHEMA!

Class
WebPage

{String
URL,int

rank,String

content}


//
mapper.java

void map(Text key,
WebPage

w) {

i
f (
w.url
=
=‘
teaparty.fr
’)


emit
(
w.url
, 1);

}



Data
-
centric programming idioms == relational ops

PROJECTED

view: (
url,null,null
)

DIRECT
-
OP

on compressed
Webpage

Semantic Extraction



Query semantic are obvious to human readers, but
not explicit in the code for framework



EXTRACT IT!

o
Static code analysis

o
Control
-
flow graph and data
-
flow graph

o
Find
opportunities:
selection, projection, direct op

o
Safe optimizations: same output



Analyzer: An
E
xample



//
webpage.java

Class
WebPage

{String
URL,int

rank,String

content}


//
mapper.java

map(Text
key,Webpage

w
) {


if (
w.rank

> 10
)



emit(
w.url,w.rank
);

}


Fn

Entry

w.rank

> 10

Fn Exit

A
nalyzer

e
mit
(
url,rank
)

Current Optimizations



B
+
-
Tree for Selections


Projected
views


Delta compression on
numerics


Direct operation of compressed data




Hadoop

compression is not
semantic aware


Outline


Introduction


Execution Framework


Optimization/Analyzer Examples


Experiments

o
Analyzer recall

o
P
erformance gain


Related Work and Conclusion


Experiments: Analyzer



Test
MapReduce

programs from
Pavlo
, SIGMOD

09:


Detected
5 out of 8
opportunities:

o
Two misses due to custom
serialization class

o
Another miss requires
knowledge of
java.util.Hashtable

semantics


Experiments: Performance



Optimize
f
our Web
page handling t
asks:

o
Selection

(filtering)

o
Projection

(aggregation
on subfield of page)

o
Join

(pages
to user
visits)

o
User Defined Functions

(aggregation)



5
cluster nodes, 123GB of data

Experiments: Performance







Description

Hadoop

Selection

430 s

Projection

5496 s

Join

6078 s

Experiments: Performance






Description

Hadoop

Manimal

Speedup

Selection

430 s

38 s

11.2

Projection

5496 s

1856 s

2.96

Join

6078 s

904 s

6.73

Experiments: Performance








Up
to 11x speedup over original
Hadoop


Performance comparable to DBMS
-
X from
Pavlo


UDF not detected: running time identical

Description

Hadoop

Manimal

Speedup

Space Overhead

Selection

430 s

38 s

11.2

0.1%

Projection

5496 s

1856 s

2.96

20%

Join

6078 s

904 s

6.73

11.7%

Outline


Introduction


Execution Framework


Optimization/Analyzer Examples


Experiments

o
Analyzer recall

o
P
erformance gain


Related Work and Conclusion


Related Work


Lots of recent
MapReduce

activity

o
Quincy: Task
scheduling

(
Isard

et al, SOSP,
2009)

o
HadoopDB

(
Abouzeid

et
al, PVLDB 2009)


o
Hadoop
++
(
Dittrich

et al, PVLDB 2010
)

o
HaLoop

(
Bu
et
al, PVLDB 2010)


o
Twister
(
Ekanayake

et al,
HPDC
2010
)

o
Starfish
(
Herodotou

et al, CIDR 2011)



Manimal

does not introduce new optimizations. It
detects and applies existing optimizations to code



Lessons Learned



The
G
ood:
We can recognize data processing
idioms in real code. Relational operations still exist
even in
NoSQL

world



The Ugly:
When we started this project in 2009, we
underestimated interest in writing in higher level
languages (
e.g.
, Pig
L
atin)




Conclusion



Manimal

provides framework for applying
well
-
known optimization techniques
to
MapReduce

o
Automatic optimization

of user code

o
Up
to 11x speed
increase

o
Provides framework for more optimizations