Query Optimization for Map Reduce Systems

nostrilshumorousInternet and Web Development

Nov 18, 2013 (4 years and 1 month ago)

85 views

Q
UERY

O
PTIMIZATION

FOR

M
AP

R
EDUCE

S
YSTEMS

Presented by:


AISHWARYA G

113050042


Under the guidance of:

Dr . S. Sudarshan

A
GENDA



Introduction


Parallel Programming models


Optimization


Pipelining


Incremental Computation


Other optimization techniques


Future work


Conclusion

I
NTRODUCTION

M
OTIVATION


Analytical

queries

running

on

MapReduce

(a

parallel

processing

architecture)

systems




complex


process

huge

amount

of

data



very

expensive



Great

benefit

if

we

can

optimize

execution

of

these

queries

W
HAT

IS

M
APREDUCE
?


A

simplified

programming

model

to

process

huge

unstructured

data




Specify

computations

as

functional

programming

primitives



map

and

reduce



Runtime

automatically



parallelizes

computation

across

cluster


distributes

the

input

data


provides

fault

tolerance,

reliability


P
ARALLEL

P
ROGRAMMING

MODELS

P
ARALLEL

P
ROGRAMMING

MODELS


Relieves

programmer

from

burden

of

building

resilient

and

scalable

distributed

computing

environment



In

this

section
:


Google’s

Mapreduce


Microsoft’s

Dryad


High

level

declarative

languages

M
APREDUCE


An

implementation

by

Google

based

on

functional

programming

primitives




Provides

a

simple

interface

for

programmers

to

execute

queries

on

large

data

sets

typically

terabytes

of

data



Programmers

specify

computations

in

terms

of

map

and

reduce

primitives




Framework

automatically

parallelizes

the

computation

M
AP

AND

R
EDUCE

P
RIMITIVES


map


applied

on

every

logical

record

in

the

input


generates

a

set

of

intermediate

key/value

pairs




Framework



groups

these

intermediate

values

according

to

same

intermediate

key


passes

to

reduce

function

as

key

and

list

of

values



reduce


reduces

this

list

of

values



)
,
(
)
,
(
2
2
1
1
v
k
list
v
k
map






2
2
2
)
,
(
v
list
v
list
k
reduce
W
ORDCOUNT

EXAMPLE


Counting the no. of occurrences of each word in a very
large data set


function map(String key String value)



//key: line number



//value: Contents of the line



for each word in value:



EmitIntermediate
(word ,"1");


function reduce(String key, List<String> values)



//key: word



//values: list of counts



result = 0;



for each value in values:



result +=
ParseInt
(value);




Emit(
AsString
(result));


W
ORKING

OF

M
APREDUCE

Task


tracker

Task


tracker

Task

tracker

Job

tracker

Client

Task

tracker

Task

tracker

Split 0

Split 1

Split 2

Split 3

User
submits job

Assign map task

Assign reduce task

read

Local

write

Remote

read

write


Input file Map phase
Intermediate

Reduce phase Output files





files

F
AULT

TOLERANCE

&
OTHER

OPTIMIZATIONS

IN

M
APREDUCE


Fault

tolerance


Periodic

heartbeats


Re
-
execution

of

failed

tasks


Materialization

of

output

of

both

map

and

reduce

tasks


Tries

to

run

map

and

reduce

tasks

on

local

copies

of

data


Optional

combiner

function

applied

locally

on

a

map

task's

output



local

pre
-
aggregation
.


Execution

of

backup

tasks



prevents

straggling

tasks

from

affecting

the

overall

execution

time

of

the

job
.


D
RYAD


A

general

purpose

distributed

execution

engine

for

coarse

grain

data

parallel

applications

from

Microsoft
.



End

user

problem

specified

as

a

DAG

of

vertices

as

computations

and

edges

as

data

channels




Vertex

programs

:

map,

reduce,

distribute,

joins

etc



Channels

implemented

as

TCP

pipes,

shared

memory

or

files
.


I

M

Input

Map(Hash


& Distribute)

I

M

I

M

S

Sort

S

R

S

R

Reduce

O

O

Output

D
RYAD

S

DAG

FOR

WORDCOUNT

EXAMPLE

D
RYAD

V
S

M
APREDUCE


Computation

can

be

expressed

as

a

number

of

stages


not

limited

to

map

and

reduce

steps

as

in

Mapreduce


Graph

vertices

are

allowed

to

use

an

arbitrary

number

of

inputs

and

outputs

unlike

in

Mapreduce


Encapsulation

of

vertices



data

pipelined

between

these

vertices


Similarity

to

Mapreduce


Execution

of

failed

vertices


Backup

task

mechanism


Runtime

graph

refinement

(like

combiners)

-

pre
-
aggregation

on

set

of

vertices


Apply

recursively

to

form

an

aggregation

tree


H
IGH
-
LEVEL

D
ECLARATIVE

LANGUAGES


Mapreduce

programming

model

or

Dryad's

DAG
:


low

level

specifications


End

users

familiar

with

query

languages

such

as

SQL

found

it

difficult

to

write

Mapreduce

program

for

simple

tasks




SQL

like

declarative

languages

that

are

high

level

structured

abstraction

were

developed
.



Yahoo's

Pig



Facebook's

Hive


DryadLINQ

and

Scope

over

Dryad
.

O
PTIMIZATION

O
PTIMIZING

Q
UERY

P
ROCESSING

IN

M
APREDUCE



Many

optimizations

possible




Few

aspects

of

the

optimization

are

studied
:


Pipelining
:

data

is

pushed

from

producers

to

consumers

effectively

as

and

when

it

is

produced


Incremental

computations
:

previous

state

is

made

use

of


Reducing

redundant

I/O

by

sharing

input

scans

between

multiple

jobs


Online

aggregation
:

provide

near

approximation

of

the

result

during

job

progress

P
IPELINING

P
IPELINING


Mapreduce systems follow a pessimistic approach
towards fault tolerance


output from map and reduce tasks are materialized
before they can be consumed.



Some systems follow less pessimistic approach to
attain potential gains:


Hyracks


Mapreduce Online


Dryad

H
YRACKS


Partitioned

parallel

dataflow

execution

platform

on

shared

nothing

clusters

of

computers
.



DAG

of

operators

and

connectors



Operators



File

Readers/Writers,

Mappers
,

Sorters,

Joiners

and

Aggregators



Connectors



M
:
N

Hash

Partitioner
,

M
:
N

Hash

Partitioning

Merger,

M
:
N

Range

Partitioner
,

M
:
N

Replicator

and

1
:
1

Connector

H
YRACKS

D
AG

FOR

W
ORDCOUNT

E
XAMPLE

Input I

Map M

Hash&

Distribute

Input I

Map M

Hash&

Distribute

Sort

Input I

Map M

Hash&

Distribute

Sort

STAGE 1

Reduce

Reduce

Output


O

Reduce

Reduce

Output


O

STAGE 2

Blocking

edge

Non
-

Blocking edge

H
YRACKS

V
S

M
APREDUCE


Reduced

disk

contention

due

to

pipelining

of

data

between

tasks


Mapreduce

-

all

reducers

try

to

pull

its

partition

from

all

the

mappers

at

the

end

of

map

phase



Reduced

job

startup

overhead

-

push
-
based

job

activation



Re
-
execution

of

all

tasks

in

the

stage

even

if

a

single

job

fails

M
APREDUCE

O
NLINE


Goal

-

A

modified

Map
-
Reduce

architecture

that

supports
:



Pipelining

of

intermediate

data

between

operators



Online

aggregation



produce

approximate

results

during

job

progress



Support

for

incremental

computation


P
IPELINING

IN

M
APREDUCE

ONLINE


Naïve pipelining


map tasks push output to reduce tasks as and when
it is produced.



An
adaptive

pipelining scheme



Accumulate map output, apply combiner and sort


Push data to reducer if n/w is not a bottleneck and
reducer is able to keep up with the data from many
map tasks


Else accumulate till it can be sent


combine and sort
before sending


F
AULT

TOLERANCE

IN

M
APREDUCE

ONLINE


Map task failure:


Output from unfinished map tasks are marked
tentative
and ignored if the map task fails


Failed map task simply re
-
executed


A possible future work :
Checkpointing



map task
notifies
jobtracker

it has processed till offset x


New map task resumes from x



Reduce task failure:


Map task retain their output on local disk until the
job completes


Failed reduce task re
-
executed by sending all map
outputs to it

I
NCREMENTAL

COMPUTATION

W
HY

I
NCREMENTAL

COMPUTATION
?


Large

scale

computations


large

input

data

set

to

which

incoming

data

gets

added

incrementally


run

on

a

daily

basis



Mapreduce

and

Dryad

-

runs

the

computation

from

scratch

over

entire

repository

discarding

work

done

in

the

previous

runs



Time

of

computation

proportional

to

entire

repository

rather

than

size

of

updates

made

since

last

computation

was

last

run

P
ERCOLATOR


A

system

by

Google

to

provide

an

improved

indexing

mechanism


Aims

at

reducing

time

between

when

a

page

is

found

and

when

it

is

made

available

in

Google

search
.


In

previous

system

based

upon

Mapreduce

-

whole

repository

was

re
-
processed

when

pages

were

newly

crawled


Percolator

is

built

over

Bigtable


Indexing

task

for

each

new

page

run

incrementally

by
:


random

access

to

the

repository

through

indexing



multi

row

transactions


Observer



triggered

by

changes

in

columns

of

Bigtable


Updates

applied

in

an

eager

fashion


D
RYAD
I
NC


Tries

to

find

solutions

to

automate

the

task

of

incrementalizing

the

computation


Two

approaches





Identical

Computation

:

reuse

partial

results

by

caching



Mergeable

Computation

:

users

specify

a

merging

function

M

such

that


F

(

I

+



)

=

M

(

F

(

I

)

,

F

(



)

)



Run

a

BFS

to

find

out

stage

in

DAG,

in

which

all

the

vertices

gets

affected

by

Δ

and

cache

channel

inputs

to

this

stage

I1

M

I2

M

I3

M

S

S

R

S

R

O

O

I4

M

S

S

R

S

R

O

O

I4

M

Cached o/p of M on I1, I2 and I3

Identical sub DAG

I
DENTICAL

C
OMPUTATION

M
ERGEABLE

C
OMPUTATION


S

Cached
result

I4

M

R

Previous

DAG

O

Merge

I1

M

I2

M

I3

M

S

S

R

S

R

O

O

C
ONTINUOUS

WORKFLOWS

USING

N
OVA


Workflow

manager

deployed

at

Yahoo

-

carries

out

incremental

processing

of

continuously

arriving

data

through

Pig

programs



Workflow



vertices

and

edges



Vertices



tasks

and

data

containers


Tasks

can

be

stateful

or

stateless

incremental

or

non
-
incremental



Edges

annotated

as

ALL,

NEW,

B

and

Δ

N
OVA

W
ORKFLOW

FOR

WORDCOUNT

O
THER

O
PTIMIZATION

T
ECHNIQUES

C
OMET

-

S
HARING

S
CANS



An

optimization

over

Dryad



Scanning

of

input

shared

among

multiple

queries

-

operator

TEE

that

connects

one

input
-
stream

provider

with

multiple

concurrent

consumers



Sub
-
queries

from

different

query

aligned

to

form

a

single

jumbo

query



Jumbo

query

optimized

to

remove

redundancies



Use

of

a

cost

model



O
NLINE

A
GGREGATION


An

option

provided

by

Mapreduce

Online




Users

need

not

wait

for

results

till

job

has

completed



Map

o/p

is

already

pipelined

to

reduce

task



Apply

the

reduce

function

on

what

reduce

task

has

received

so

far



Possible

in

case

of

multiple

mapreduce

jobs



F
UTURE

W
ORK


Developing

an

efficient

query

processing

framework

over

systems

like

Mapreduce

/

Hyracks




Translate

effectively

from

high

level

languages

like

Hive,

Pig

to

a

query

plan

to

be

executed

on

Mapreduce

/
Hyracks



Applying

additional

optimizations

like

Incremental

computation,

pipelining

to

Mapreduce

/

Hyracks

C
ONCLUSION


Explosion

in

the

amount

of

data

due

to

the

growth

of

the

web



Massively

parallel

processing

architectures

simplified

query

processing

on

this

huge

amount

of

data
.




Optimization

techniques

for

efficient

query

processing

in

these

systems

will

reduce

job

completion

time


R
EFERENCES

1.
Vinayak

Borkar
,

Michael

Carey,

Raman

Grover,

Nicola

Onose
,

and

Rares

Vernica
.

Hyracks
:

A

flexible

and

extensible

foundation

for

data
-
intensive

computing
.

In

Proceedings

of

the

2011

IEEE

27
th

International

Conference

on

Data

Engineering,

ICDE


11
,

pages

1151

1162
,

Washington,

DC,

USA,

2011
.

IEEE

Computer

Society
.


2.
Tyson

Condie
,

Neil

Conway,

Peter

Alvaro,

Joseph

M
.

Hellerstein
,

Khaled

Elmeleegy
,

and

Russell

Sears
.

Mapreduce

online
.

In

Proceedings

of

the

7
th

USENIX

conference

on

Net
-

worked

systems

design

and

implementation,

NSDI’
10
,

pages

21

21
,

Berkeley,

CA,

USA,

2010
.

USENIX

Association
.


3.
Christopher

Olston
,

Greg

Chiou
,

Laukik

Chitnis
,

Francis

Liu,

Yiping

Han,

Mattias

Larsson,Andreas

Neumann,

Vellanki

B
.
N
.

Rao
,

Vijayanand

Sankarasubramanian
,

Siddharth

Seth,

Chao

Tian
,

Topher

ZiCornell
,

and

XiaodanWang
.

Nova
:

continuous

pig/
hadoop

workflows
.

In

Proceedings

of

the

2011

international

conference

on

Management

of

data,

SIGMOD


11
,pages

1081

1090
,

New

York,

NY,

USA,

2011
.

ACM
.



R
EFERENCES

(
CONTD
)

4.
Jeffrey

Dean

and

Sanjay

Ghemawat
.

Mapreduce
:

Simplified

data

processing

on

large

clusters
.

In

Proceedings

of

the

6
th

symposium

on

Operating

Systems

Design

and

Implementation

(OSDI),

pages

137

150
,

December

2004
.


5.
Bingsheng

He,

Mao

Yang,

Zhenyu

Guo
,

Rishan

Chen,

Bing

Su,

Wei

Lin,

and

Lidong

Zhou
.

Comet
:

batched

stream

processing

for

data

intensive

distributed

computing
.

In

Proceedings

of

the

1
st

ACM

symposium

on

Cloud

computing,

SoCC


10
,

pages

63

74
,

New

York,

NY,

USA,

2010
.

ACM
.


6.
Michael

Isard
,

Mihai

Budiu
,

Yuan

Yu,

Andrew

Birrell
,

and

Dennis

Fetterly
.

Dryad
:

distributed

data
-
parallel

programs

from

sequential

building

blocks
.

In

Proceedings

of

the

2
nd

ACM

SIGOPS/
EuroSys

European

Conference

on

Computer

Systems

2007
,

EuroSys


07
,

pages

59

72
,

New

York,

NY,

USA,

2007
.

ACM
.


R
EFERENCES

(
CONTD
)

7.
Daniel

Peng

and

Frank

Dabek
.

Large
-
scale

incremental

processing

using

distributed

transactions

and

notifications
.

In

Proceedings

of

the

9
th

USENIX

conference

on

Operating

systems

design

and

implementation,

OSDI’
10
,

pages

1

15
,

Berkeley,

CA,

USA,

2010
.

USENIX

Association
.


8.
Lucian

Popa
,

Mihai

Budiu
,

Yuan

Yu,

and

Michael

Isard
.

Dryadinc
:

reusing

work

in

large
-

scale

computations
.

In

Proceedings

of

the

2009

conference

on

Hot

topics

in

cloud

computing,HotCloud’
09
,

Berkeley,

CA,

USA,

2009
.

USENIX

Association
.


9.
Ashish

Thusoo
,

Joydeep

Sen

Sarma
,

Namit

Jain,

Zheng

Shao
,

Prasad

Chakka
,

Ning

Zhang,Suresh

Antony,

Hao

Liu,

and

Raghotham

Murthy
.

Hive

-

a

petabyte

scale

data

warehouseusing

hadoop
.

Data

Engineering,

International

Conference

on,

0
:
996

1005
,

2010
.


T
HANK

YOU