# Query Optimization for Map Reduce Systems

Internet and Web Development

Nov 18, 2013 (4 years and 7 months ago)

133 views

Q
UERY

O
PTIMIZATION

FOR

M
AP

R
EDUCE

S
YSTEMS

Presented by:

AISHWARYA G

113050042

Under the guidance of:

Dr . S. Sudarshan

A
GENDA

Introduction

Parallel Programming models

Optimization

Pipelining

Incremental Computation

Other optimization techniques

Future work

Conclusion

I
NTRODUCTION

M
OTIVATION

Analytical

queries

running

on

MapReduce

(a

parallel

processing

architecture)

systems

complex

process

huge

amount

of

data

very

expensive

Great

benefit

if

we

can

optimize

execution

of

these

queries

W
HAT

IS

M
APREDUCE
?

A

simplified

programming

model

to

process

huge

unstructured

data

Specify

computations

as

functional

programming

primitives

map

and

reduce

Runtime

automatically

parallelizes

computation

across

cluster

distributes

the

input

data

provides

fault

tolerance,

reliability

P
ARALLEL

P
ROGRAMMING

MODELS

P
ARALLEL

P
ROGRAMMING

MODELS

Relieves

programmer

from

burden

of

building

resilient

and

scalable

distributed

computing

environment

In

this

section
:

Mapreduce

Microsoft’s

High

level

declarative

languages

M
APREDUCE

An

implementation

by

based

on

functional

programming

primitives

Provides

a

simple

interface

for

programmers

to

execute

queries

on

large

data

sets

typically

terabytes

of

data

Programmers

specify

computations

in

terms

of

map

and

reduce

primitives

Framework

automatically

parallelizes

the

computation

M
AP

AND

R
EDUCE

P
RIMITIVES

map

applied

on

every

logical

record

in

the

input

generates

a

set

of

intermediate

key/value

pairs

Framework

groups

these

intermediate

values

according

to

same

intermediate

key

passes

to

reduce

function

as

key

and

list

of

values

reduce

reduces

this

list

of

values

)
,
(
)
,
(
2
2
1
1
v
k
list
v
k
map

2
2
2
)
,
(
v
list
v
list
k
reduce
W
ORDCOUNT

EXAMPLE

Counting the no. of occurrences of each word in a very
large data set

function map(String key String value)

//key: line number

//value: Contents of the line

for each word in value:

EmitIntermediate
(word ,"1");

function reduce(String key, List<String> values)

//key: word

//values: list of counts

result = 0;

for each value in values:

result +=
ParseInt
(value);

Emit(
AsString
(result));

W
ORKING

OF

M
APREDUCE

tracker

tracker

tracker

Job

tracker

Client

tracker

tracker

Split 0

Split 1

Split 2

Split 3

User
submits job

Local

write

Remote

write

Input file Map phase
Intermediate

Reduce phase Output files

files

F
AULT

TOLERANCE

&
OTHER

OPTIMIZATIONS

IN

M
APREDUCE

Fault

tolerance

Periodic

heartbeats

Re
-
execution

of

failed

Materialization

of

output

of

both

map

and

reduce

Tries

to

run

map

and

reduce

on

local

copies

of

data

Optional

combiner

function

applied

locally

on

a

map

output

local

pre
-
aggregation
.

Execution

of

backup

prevents

straggling

from

affecting

the

overall

execution

time

of

the

job
.

D

A

general

purpose

distributed

execution

engine

for

coarse

grain

data

parallel

applications

from

Microsoft
.

End

user

problem

specified

as

a

DAG

of

vertices

as

computations

and

edges

as

data

channels

Vertex

programs

:

map,

reduce,

distribute,

joins

etc

Channels

implemented

as

TCP

pipes,

shared

memory

or

files
.

I

M

Input

Map(Hash

& Distribute)

I

M

I

M

S

Sort

S

R

S

R

Reduce

O

O

Output

D

S

DAG

FOR

WORDCOUNT

EXAMPLE

D

V
S

M
APREDUCE

Computation

can

be

expressed

as

a

number

of

stages

not

limited

to

map

and

reduce

steps

as

in

Mapreduce

Graph

vertices

are

allowed

to

use

an

arbitrary

number

of

inputs

and

outputs

unlike

in

Mapreduce

Encapsulation

of

vertices

data

pipelined

between

these

vertices

Similarity

to

Mapreduce

Execution

of

failed

vertices

Backup

mechanism

Runtime

graph

refinement

(like

combiners)

-

pre
-
aggregation

on

set

of

vertices

Apply

recursively

to

form

an

aggregation

tree

H
IGH
-
LEVEL

D
ECLARATIVE

LANGUAGES

Mapreduce

programming

model

or

DAG
:

low

level

specifications

End

users

familiar

with

query

languages

such

as

SQL

found

it

difficult

to

write

Mapreduce

program

for

simple

SQL

like

declarative

languages

that

are

high

level

structured

abstraction

were

developed
.

Yahoo's

Pig

Hive

and

Scope

over

.

O
PTIMIZATION

O
PTIMIZING

Q
UERY

P
ROCESSING

IN

M
APREDUCE

Many

optimizations

possible

Few

aspects

of

the

optimization

are

studied
:

Pipelining
:

data

is

pushed

from

producers

to

consumers

effectively

as

and

when

it

is

produced

Incremental

computations
:

previous

state

is

use

of

Reducing

redundant

I/O

by

sharing

input

scans

between

multiple

jobs

Online

aggregation
:

provide

near

approximation

of

the

result

during

job

progress

P
IPELINING

P
IPELINING

Mapreduce systems follow a pessimistic approach
towards fault tolerance

output from map and reduce tasks are materialized
before they can be consumed.

Some systems follow less pessimistic approach to
attain potential gains:

Hyracks

Mapreduce Online

H
YRACKS

Partitioned

parallel

dataflow

execution

platform

on

shared

nothing

clusters

of

computers
.

DAG

of

operators

and

connectors

Operators

File

Mappers
,

Sorters,

Joiners

and

Aggregators

Connectors

M
:
N

Hash

Partitioner
,

M
:
N

Hash

Partitioning

Merger,

M
:
N

Range

Partitioner
,

M
:
N

Replicator

and

1
:
1

Connector

H
YRACKS

D
AG

FOR

W
ORDCOUNT

E
XAMPLE

Input I

Map M

Hash&

Distribute

Input I

Map M

Hash&

Distribute

Sort

Input I

Map M

Hash&

Distribute

Sort

STAGE 1

Reduce

Reduce

Output

O

Reduce

Reduce

Output

O

STAGE 2

Blocking

edge

Non
-

Blocking edge

H
YRACKS

V
S

M
APREDUCE

Reduced

disk

contention

due

to

pipelining

of

data

between

Mapreduce

-

all

reducers

try

to

pull

its

partition

from

all

the

mappers

at

the

end

of

map

phase

Reduced

job

startup

-

push
-
based

job

activation

Re
-
execution

of

all

in

the

stage

even

if

a

single

job

fails

M
APREDUCE

O
NLINE

Goal

-

A

modified

Map
-
Reduce

architecture

that

supports
:

Pipelining

of

intermediate

data

between

operators

Online

aggregation

produce

approximate

results

during

job

progress

Support

for

incremental

computation

P
IPELINING

IN

M
APREDUCE

ONLINE

Naïve pipelining

it is produced.

An

pipelining scheme

Accumulate map output, apply combiner and sort

Push data to reducer if n/w is not a bottleneck and
reducer is able to keep up with the data from many

Else accumulate till it can be sent

combine and sort
before sending

F
AULT

TOLERANCE

IN

M
APREDUCE

ONLINE

Output from unfinished map tasks are marked
tentative
and ignored if the map task fails

-
executed

A possible future work :
Checkpointing

notifies
jobtracker

it has processed till offset x

New map task resumes from x

Map task retain their output on local disk until the
job completes

-
executed by sending all map
outputs to it

I
NCREMENTAL

COMPUTATION

W
HY

I
NCREMENTAL

COMPUTATION
?

Large

scale

computations

large

input

data

set

to

which

incoming

data

gets

incrementally

run

on

a

daily

basis

Mapreduce

and

-

runs

the

computation

from

scratch

over

entire

repository

work

done

in

the

previous

runs

Time

of

computation

proportional

to

entire

repository

rather

than

size

of

since

last

computation

was

last

run

P
ERCOLATOR

A

system

by

to

provide

an

improved

indexing

mechanism

Aims

at

reducing

time

between

when

a

page

is

found

and

when

it

is

available

in

search
.

In

previous

system

based

upon

Mapreduce

-

whole

repository

was

re
-
processed

when

pages

were

newly

crawled

Percolator

is

built

over

Bigtable

Indexing

for

each

new

page

run

incrementally

by
:

random

access

to

the

repository

through

indexing

multi

row

transactions

Observer

triggered

by

changes

in

columns

of

Bigtable

applied

in

an

eager

fashion

D
I
NC

Tries

to

find

solutions

to

automate

the

of

incrementalizing

the

computation

Two

approaches

Identical

Computation

:

reuse

partial

results

by

caching

Mergeable

Computation

:

users

specify

a

merging

function

M

such

that

F

(

I

+

)

=

M

(

F

(

I

)

,

F

(

)

)

Run

a

BFS

to

find

out

stage

in

DAG,

in

which

all

the

vertices

gets

affected

by

Δ

and

cache

channel

inputs

to

this

stage

I1

M

I2

M

I3

M

S

S

R

S

R

O

O

I4

M

S

S

R

S

R

O

O

I4

M

Cached o/p of M on I1, I2 and I3

Identical sub DAG

I
DENTICAL

C
OMPUTATION

M
ERGEABLE

C
OMPUTATION

S

Cached
result

I4

M

R

Previous

DAG

O

Merge

I1

M

I2

M

I3

M

S

S

R

S

R

O

O

C
ONTINUOUS

WORKFLOWS

USING

N
OVA

Workflow

manager

deployed

at

Yahoo

-

carries

out

incremental

processing

of

continuously

arriving

data

through

Pig

programs

Workflow

vertices

and

edges

Vertices

and

data

containers

can

be

stateful

or

stateless

incremental

or

non
-
incremental

Edges

annotated

as

ALL,

NEW,

B

and

Δ

N
OVA

W
ORKFLOW

FOR

WORDCOUNT

O
THER

O
PTIMIZATION

T
ECHNIQUES

C
OMET

-

S
HARING

S
CANS

An

optimization

over

Scanning

of

input

shared

among

multiple

queries

-

operator

TEE

that

connects

one

input
-
stream

provider

with

multiple

concurrent

consumers

Sub
-
queries

from

different

query

aligned

to

form

a

single

jumbo

query

Jumbo

query

optimized

to

remove

redundancies

Use

of

a

cost

model

O
NLINE

A
GGREGATION

An

option

provided

by

Mapreduce

Online

Users

need

not

wait

for

results

till

job

has

completed

Map

o/p

is

pipelined

to

reduce

Apply

the

reduce

function

on

what

reduce

has

so

far

Possible

in

case

of

multiple

mapreduce

jobs

F
UTURE

W
ORK

Developing

an

efficient

query

processing

framework

over

systems

like

Mapreduce

/

Hyracks

Translate

effectively

from

high

level

languages

like

Hive,

Pig

to

a

query

plan

to

be

executed

on

Mapreduce

/
Hyracks

Applying

optimizations

like

Incremental

computation,

pipelining

to

Mapreduce

/

Hyracks

C
ONCLUSION

Explosion

in

the

amount

of

data

due

to

the

growth

of

the

web

Massively

parallel

processing

architectures

simplified

query

processing

on

this

huge

amount

of

data
.

Optimization

techniques

for

efficient

query

processing

in

these

systems

will

reduce

job

completion

time

R
EFERENCES

1.
Vinayak

Borkar
,

Michael

Carey,

Raman

Grover,

Nicola

Onose
,

and

Rares

Vernica
.

Hyracks
:

A

flexible

and

extensible

foundation

for

data
-
intensive

computing
.

In

Proceedings

of

the

2011

IEEE

27
th

International

Conference

on

Data

Engineering,

ICDE

11
,

pages

1151

1162
,

Washington,

DC,

USA,

2011
.

IEEE

Computer

Society
.

2.
Tyson

Condie
,

Neil

Conway,

Peter

Alvaro,

Joseph

M
.

Hellerstein
,

Khaled

Elmeleegy
,

and

Russell

Sears
.

Mapreduce

online
.

In

Proceedings

of

the

7
th

USENIX

conference

on

Net
-

worked

systems

design

and

implementation,

NSDI’
10
,

pages

21

21
,

Berkeley,

CA,

USA,

2010
.

USENIX

Association
.

3.
Christopher

Olston
,

Greg

Chiou
,

Laukik

Chitnis
,

Francis

Liu,

Yiping

Han,

Mattias

Neumann,

Vellanki

B
.
N
.

Rao
,

Vijayanand

Sankarasubramanian
,

Siddharth

Seth,

Chao

Tian
,

Topher

ZiCornell
,

and

XiaodanWang
.

Nova
:

continuous

pig/

workflows
.

In

Proceedings

of

the

2011

international

conference

on

Management

of

data,

SIGMOD

11
,pages

1081

1090
,

New

York,

NY,

USA,

2011
.

ACM
.

R
EFERENCES

(
CONTD
)

4.
Jeffrey

Dean

and

Sanjay

Ghemawat
.

Mapreduce
:

Simplified

data

processing

on

large

clusters
.

In

Proceedings

of

the

6
th

symposium

on

Operating

Systems

Design

and

Implementation

(OSDI),

pages

137

150
,

December

2004
.

5.
Bingsheng

He,

Mao

Yang,

Zhenyu

Guo
,

Rishan

Chen,

Bing

Su,

Wei

Lin,

and

Lidong

Zhou
.

Comet
:

batched

stream

processing

for

data

intensive

distributed

computing
.

In

Proceedings

of

the

1
st

ACM

symposium

on

Cloud

computing,

SoCC

10
,

pages

63

74
,

New

York,

NY,

USA,

2010
.

ACM
.

6.
Michael

Isard
,

Mihai

Budiu
,

Yuan

Yu,

Andrew

Birrell
,

and

Dennis

Fetterly
.

:

distributed

data
-
parallel

programs

from

sequential

building

blocks
.

In

Proceedings

of

the

2
nd

ACM

SIGOPS/
EuroSys

European

Conference

on

Computer

Systems

2007
,

EuroSys

07
,

pages

59

72
,

New

York,

NY,

USA,

2007
.

ACM
.

R
EFERENCES

(
CONTD
)

7.
Daniel

Peng

and

Frank

Dabek
.

Large
-
scale

incremental

processing

using

distributed

transactions

and

.

In

Proceedings

of

the

9
th

USENIX

conference

on

Operating

systems

design

and

implementation,

OSDI’
10
,

pages

1

15
,

Berkeley,

CA,

USA,

2010
.

USENIX

Association
.

8.
Lucian

Popa
,

Mihai

Budiu
,

Yuan

Yu,

and

Michael

Isard
.

:

reusing

work

in

large
-

scale

computations
.

In

Proceedings

of

the

2009

conference

on

Hot

topics

in

cloud

computing,HotCloud’
09
,

Berkeley,

CA,

USA,

2009
.

USENIX

Association
.

9.
Ashish

Thusoo
,

Joydeep

Sen

Sarma
,

Namit

Jain,

Zheng

Shao
,

Chakka
,

Ning

Zhang,Suresh

Antony,

Hao

Liu,

and

Raghotham

Murthy
.

Hive

-

a

petabyte

scale

data

warehouseusing

.

Data

Engineering,

International

Conference

on,

0
:
996

1005
,

2010
.

T
HANK

YOU