HaLoop: Efficient Iterative Data Processing On Large Scale Clusters

muscleblouseAI and Robotics

Oct 19, 2013 (3 years and 9 months ago)

80 views

HaLoop: Efficient Iterative Data
Processing On Large Scale Clusters

Presentation by

Amr Swafta


1

Outlines


Introduction / Motivation


Iterative Application
E
xample


HaLoop Architecture


Task Scheduling


Caching and Indexing


Experiments & Results


Conclusion





2

Introduction / Motivation




HaLoop: is a
modified version of the Hadoop
MapReduce framework
that is designed to serve
iterative applications.



MapReduce framework can’t directly support
recursion/iteration.



Many
data analysis techniques require iterative
computations:

-
PageRank

-
Clustering

-
N
eural
-
network analysis

-
S
ocial
network
analysis



3

Iterative Application Example



PageRank algorithm:
s
ystem for ranking web
pages.





Where:


-

PR(A):
is the PageRank of page A.


-

PR(Ti):
is the PageRank of pages Ti which link to page A.


-

C(Ti):
is the number of outbound links on page Ti.


-

d:

is a damping factor which can be set between 0 and 1.


4

PR(A) = (1
-
d) + d (PR(T1)/C(T1) + ... + +PR(
Tn
)/C(
Tn
))


5


Consider a small web consisting of three pages
A, B and C with d= 0.5.



The PageRank will be calculated as the
following:



PR(A) = 0.5 + 0.5 PR(C)



PR(B) = 0.5 + 0.5 (PR(A) / 2)



PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))


HaLoop Architecture


6


HaLoop’s master
node contains a new
loop
control

module

that
repeatedly starts
new
map
-
reduce steps that compose the loop
body.



HaLoop uses a
modified
task scheduler
for
iterative
applications.



HaLoop
caches

and
indexes

application data
on slave
nodes.

7



Different between Hadoop and HaLoop with


iterative applications.

8



Note:
The loop control is pushed from the application into


the infrastructure.


Task Scheduling



Inter
-
iteration locality: place
on
the same
physical
machines those map and reduce tasks that occur
in different
iterations but access the same
data.



In order to cached data reused between iterations.



The scheduling exhibits inter
-
iteration locality if:

9

For all
i

> 1, Ti(d) and Ti−1 (d) are assigned to the same physical node

d: mapper / reducer input.

T: task which consumes (d) during iterations.


10

-

Scheduling the first iteration in HaDoop and HaLoop is the same.


-

Subsequent iterations put tasks that access the same data on the


same physical node.


Caching and Indexing



To reduce I/O cost,
HaLoop caches loop
-
invariant
data
partitions
on the physical node’s local disk
for
subsequent re
-
use.



To
further accelerate processing, it indexes the


cached
data
.


-

Keys and values stored in separate local files.



Type of caches:


-

Reducer Input Cache


-

Reducer Output Cache


-

Mapper Input Cache




11

Reducer
I
nput Cache


Access to loop invariant
data


without
map/shuffle.



RI cached data is used By reducer function.



Assumes:

1.

Mapper output is constant across



iterations
.

2.

Static partitioning (implies: no new nodes).



In HaLoop,
the number of reducer tasks is



unchanched
across
iterations
.



12



Reducer Output Cache


Stores
and indexes the
most recent



local output
on each reducer
node.



Provides distributed
access to output of previous
iterations.



RO cached data is used by
Fixpoint
evaluation.



It’s very efficient when the fixpoint evaluation
should be conducting after each iteration.

13



Mapper Input Cache


In
the first iteration, if
a mapper



performs
a non
-
local read on
an



input
split, the split
will be
cached in the local
disk of the mapper’s physical node
.



In
later iterations, all
mappers read
data only
from local
disks.



MI cached data is
used during scheduling of map
tasks.


14



Cache Reloading

1
-

The
hosting node
fails.


2
-

The
hosting node has a full
load and
a map or
reduce task must be scheduled on a different
substitution node.


15

Experiments & Results



HaLoop is evaluated on
real queries
and real
datasets.



Compared
with Hadoop, on average, HaLoop
reduces query
runtimes by
1.85
, and shuffles
only
4%
of the data
between mappers and
reducers.

16

Evaluation of Reducer Input Cache

17


Overall runtime.

18

Reduce and
Shuffle

Evaluation of Reducer
Output
Cache

19


Time
spent
on fixpoint
evaluation in each iteration.

Fixpoint evaluation (s)

Iteration #

Iteration #

Livejournal dataset

50
Nodes

Freebase dataset

90
Nodes

Evaluation of
Mapper
Input Cache

20


Overall runtime.

Cosmo
-
dark

8 Nodes

Cosmo
-
gas

8 Nodes

Conclusion



Authors present
the design, implementation, and
evaluation of
HaLoop, a novel parallel and
distributed system that
supports large
-
scale
iterative data analysis applications
.



HaLoop is built on top of Hadoop and extends it
with a several important optimizations that
include:


-

A
loop
-
aware
task scheduler


-

Loop
-
invariant
data
caching


-

Caching
for
efficient fixpoint
verification.

21

Questions

22

Thank You

23