Presentation

waralligatorMobile - Wireless

Nov 21, 2013 (3 years and 6 months ago)

105 views

I
MPROVING

M
AP
R
EDUCE

P
ERFORMANCE

THROUGH

D
ATA

P
LACEMENT

IN

H
ETEROGENEOUS

H
ADOOP

C
LUSTERS

Prepared by:

Eng. SALAH HARB

P
RESENTATION

O
UTLINE


Background



Motivation



Design and implementation



Experimental
Results



Conclusions

2

B
ACKGROUND


What

is

MapReduce?


A

simple

programming

model

(framework)


Origin

from

Google


For

large
-
scale

data

processing


Exploits

large

set

of

computers

(nodes)

forming

cluster


Executes

of

jobs/tasks

in

distributed

manner


Offers

high

availability



Computation

process

has

two

phases
:



Map

phase



Reduce

phase


The

computation

is

done

on

a

data

set

of

key/value

pairs



3

B
ACKGROUND

(
M
AP
R
EDUCE
)

4

Figure
1
: MapReduce
Architecture

B
ACKGROUND

(
H
ADOOP
)


What

is

Hadoop?


Framework

for

running

applications

on

large

clusters

built

of

commodity

hardware


Scale
:

petabytes

of

data

on

thousands

of

nodes


Consists

of


Storage
:

HDFS


Processing

model
:



Support

the

Map/Reduce

programming

model

(Java
-
Based)


Why

desirable?

(Yahoo!,

Facebook

etc
..
)


Scalable
:

use

cluster

of

large
-
set

of

nodes

efficiently


Easy

to

use


Users
:

no

need

to

deal

with

the

complexity

of

distributed

computing


Reliable
:

can

handle

node

failures

automatically


5

B
ACKGROUND

(
H
ADOOP
)

6

Figure
2
: Hadoop Architecture

(http://lucene.apache.org/hadoop)

B
ACKGROUND

(HDFS)


Hadoop

Distributed

File

System

(HDFS)

is

designed

to

reliably

store

very

large

files

across

machines

in

a

large

cluster
.



Provides

high

throughput

access

to

application

data,

highly

fault
-
tolerant

and

is

designed

to

be

deployed

on

low
-
cost

hardware
.




Hadoop

implements

MapReduce,

using

the

Hadoop

Distributed

File

System

(HDFS
)
.



HDFS

creates

multiple

replicas

of

data

blocks

for

reliability,

placing

them

on

compute

nodes

around

the

cluster
.

MapReduce

can

then

process

the

data

where

it

is

located
.

7

B
ACKGROUND

(HDFS)

8

Figure 3: HDFS Architecture

(http://lucene.apache.org/hadoop)

Coordination, Management



S
torage, Data processing

M
OTIVATION


Data

Locality



A
n

determining

factor

for

the

MapReduce

performance
.



To

balance

load,

Hadoop

distributes

data

to

multiple

nodes

based

on

disk

space

availability
.



Very

practical

and

efficient

for
:


Homogeneous

Cluster
:


all

the

nodes

have

identical

workload,

indicating

that

no

data

needs

to

be

moved

from

one

node

into

another
.



But

what

about

Heterogeneous

Cluster?


Cluster

with

a

high
-
performance

nodes

can

complete

processing

local

data

faster

than

a

low
-
performance

nodes
.




9

M
OTIVATION

(
EXAMPLE
)





10

Execution Time
(min)

Node A

(fast)

Node B

(slow)

Node C

(slowest)

2
x slower

3
x slower

1
task/min

M
OTIVATION

(
EXAMPLE
)





11

Node A

Node B

Node C

3
tasks

2 tasks

6
tasks

Loading

Transferring

Processing

Execution Time
(min)

S
OLUTION

(
EXAMPLE
)





12

Node A’

Node B’

Node C’

3
tasks

2 tasks

6
tasks

Loading

Transferring

Processing

Execution Time (min)

Node A

D
ESIGN

AND

I
MPLEMENTATION


The

goal

is

to

boost

the

performance

of

Hadoop

in

heterogeneous

clusters

by

minimizing

data

movement

between

slow

and

fast

nodes
.


D
ata

placement

scheme
:


Distribute

data

across

multiple

heterogeneous

nodes

based

on

their

computing

capacities
.


Data

movement

can

be

reduced

if

the

number

of

file

fragments

placed

on

the

disk

of

each

node

is

proportional

to

the

node’s

data

processing

speed
.


Two

stages
:


Initial

Data

Placement


Data

Redistribution

(Reorganizer)

13

I
NITIAL

D
ATA

P
LACEMENT


B
egins

by

first

dividing

a

large

input

file

into

a

number

of

even
-
sized

fragments
.


Assigns

fragments

to

nodes

in

a

cluster

in

accordance

to

the

nodes’

data

processing

speed
.


H
igh

performance

nodes

are

expected

to

store

and

process

more

file

fragments

than

low

performance

nodes
.

14

Namenode

1

2

3

File
1

4

5

6

C

B

A

Datanodes

D
ATA

R
EDISTRIBUTION


For

each

node’s

processing

speed

in

a

heterogeneous

cluster

using

a

new

term

called

computing

ratio
.



Vary

from

application

to

another

application
.




Input

file

fragments

distributed

by

the

initial

data

placement

algorithm

might

be

disrupted

due

to

the

following

reasons
:



new

data

is

appended

to

an

existing

input

file
.


data

blocks

are

deleted

from

the

existing

input

file
.


new

data

computing

nodes

are

added

into

an

existing

cluster
.



So,

data

redistribution

algorithm

is

required

to

reorganize

file

fragments

automatically,

based

on

computing

ratios
.



15

C
ALCULATING

C
OMPUTING

R
ATIO


Computing

ratios

are

determined

by

a

profiling

procedure

which

carried

out

in

using

the

following

steps
:


The

data

processing

operations

of

a

given

MapReduce

application

are

separately

performing

in

each

node
.



All

the

nodes

process

the

same

amount

of

data
.



R
ecord

the

response

time

of

each

node

performing

the

data

processing

operations
.


The

shortest

response

time

is

used

as

a

reference

to

normalize

the

response

time

measurements
.

16

Table
1
: Computing Ratios

D
ATA

R
EDISTRIBUTION

17

Namenode

3

6

7

8

9

a

b

c

C

A

C

B

A

B

L
1

L
2

Datanodes

over
-
utilized node

under
-
utilized
node

under
-
utilized
node

Under
-
utilized
nodes

Over
-
utilized
nodes

1

2

4

5

Round
-
robin algorithm

E
XPERIMENTAL

R
ESULTS

18

Table 2: Five node
s

in Hadoop cluster


Grep

and

Wordcount

MapReduce

applications
:


Grep

is

a

tool

searching

for

a

regular

expression

in

a

text

file


Wordcount

is

a

program

used

to

count

words

in

a

text

file



E
XPERIMENTAL

R
ESULTS

(
CONT
.)

19

Table
3
: Computing ratios


Grep

and

WordCount

MapReduce

applications
:


Computing

ratio

of

the

five

nodes

with

respective

of

Grep

and

Wordcount

applications


E
XPERIMENTAL

R
ESULTS

(
CONT
.)

20

Figure 4:
Response times


Response time of Grep
and Wordcount
in each Node

E
XPERIMENTAL

R
ESULTS

(
CONT
.)

21

Figure
5
:
Decisions


Six

Data

Placement

Decisions


E
XPERIMENTAL

R
ESULTS

(
CONT
.)

22

Figure
6
:
Performance
of Grep


Impact

of

data

placement

on

performance

of

Grep

E
XPERIMENTAL

R
ESULTS

(
CONT
.)

23

Figure
7
:
Performance
of
Wordcount


Impact

of

data

placement

on

performance

of

Wordcount

C
ONCLUSIONS

24


Identify

the

performance

degradation

caused

by

heterogeneity
.



Designed

and

implemented

a

data

placement

mechanism

in

HDFS
.



Discovering

mechanism

that

distributes

fragments

of

an

input

file

to

heterogeneous

nodes

based

on

their

computing

capacities

(computing

ratios)
.



Improving

performance

of

Hadoop

heterogeneous

clusters
.





Thanks!


Questions?

25