# HaLoop: Efficient Iterative Data Processing On Large Scale Clusters

AI and Robotics

Oct 19, 2013 (4 years and 6 months ago)

98 views

HaLoop: Efficient Iterative Data
Processing On Large Scale Clusters

Presentation by

Amr Swafta

1

Outlines

Introduction / Motivation

Iterative Application
E
xample

HaLoop Architecture

Caching and Indexing

Experiments & Results

Conclusion

2

Introduction / Motivation

HaLoop: is a
MapReduce framework
that is designed to serve
iterative applications.

MapReduce framework can’t directly support
recursion/iteration.

Many
data analysis techniques require iterative
computations:

-
PageRank

-
Clustering

-
N
eural
-
network analysis

-
S
ocial
network
analysis

3

Iterative Application Example

PageRank algorithm:
s
ystem for ranking web
pages.

Where:

-

PR(A):
is the PageRank of page A.

-

PR(Ti):
is the PageRank of pages Ti which link to page A.

-

C(Ti):
is the number of outbound links on page Ti.

-

d:

is a damping factor which can be set between 0 and 1.

4

PR(A) = (1
-
d) + d (PR(T1)/C(T1) + ... + +PR(
Tn
)/C(
Tn
))

5

Consider a small web consisting of three pages
A, B and C with d= 0.5.

The PageRank will be calculated as the
following:

PR(A) = 0.5 + 0.5 PR(C)

PR(B) = 0.5 + 0.5 (PR(A) / 2)

PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))

HaLoop Architecture

6

HaLoop’s master
node contains a new
loop
control

module

that
repeatedly starts
new
map
-
reduce steps that compose the loop
body.

HaLoop uses a
modified
for
iterative
applications.

HaLoop
caches

and
indexes

application data
on slave
nodes.

7

Different between Hadoop and HaLoop with

iterative applications.

8

Note:
The loop control is pushed from the application into

the infrastructure.

Inter
-
iteration locality: place
on
the same
physical
machines those map and reduce tasks that occur
in different
iterations but access the same
data.

In order to cached data reused between iterations.

The scheduling exhibits inter
-
iteration locality if:

9

For all
i

> 1, Ti(d) and Ti−1 (d) are assigned to the same physical node

d: mapper / reducer input.

T: task which consumes (d) during iterations.

10

-

Scheduling the first iteration in HaDoop and HaLoop is the same.

-

Subsequent iterations put tasks that access the same data on the

same physical node.

Caching and Indexing

To reduce I/O cost,
HaLoop caches loop
-
invariant
data
partitions
on the physical node’s local disk
for
subsequent re
-
use.

To
further accelerate processing, it indexes the

cached
data
.

-

Keys and values stored in separate local files.

Type of caches:

-

Reducer Input Cache

-

Reducer Output Cache

-

Mapper Input Cache

11

Reducer
I
nput Cache

data

without
map/shuffle.

RI cached data is used By reducer function.

Assumes:

1.

Mapper output is constant across

iterations
.

2.

Static partitioning (implies: no new nodes).

In HaLoop,
the number of reducer tasks is

unchanched
across
iterations
.

12

Reducer Output Cache

Stores
and indexes the
most recent

local output
on each reducer
node.

Provides distributed
iterations.

RO cached data is used by
Fixpoint
evaluation.

It’s very efficient when the fixpoint evaluation
should be conducting after each iteration.

13

Mapper Input Cache

In
the first iteration, if
a mapper

performs
a non
-
an

input
split, the split
will be
cached in the local
disk of the mapper’s physical node
.

In
later iterations, all
data only
from local
disks.

MI cached data is
used during scheduling of map

14

1
-

The
hosting node
fails.

2
-

The
hosting node has a full
a map or
reduce task must be scheduled on a different
substitution node.

15

Experiments & Results

HaLoop is evaluated on
real queries
and real
datasets.

Compared
reduces query
runtimes by
1.85
, and shuffles
only
4%
of the data
between mappers and
reducers.

16

Evaluation of Reducer Input Cache

17

Overall runtime.

18

Reduce and
Shuffle

Evaluation of Reducer
Output
Cache

19

Time
spent
on fixpoint
evaluation in each iteration.

Fixpoint evaluation (s)

Iteration #

Iteration #

Livejournal dataset

50
Nodes

Freebase dataset

90
Nodes

Evaluation of
Mapper
Input Cache

20

Overall runtime.

Cosmo
-
dark

8 Nodes

Cosmo
-
gas

8 Nodes

Conclusion

Authors present
the design, implementation, and
evaluation of
HaLoop, a novel parallel and
distributed system that
supports large
-
scale
iterative data analysis applications
.

HaLoop is built on top of Hadoop and extends it
with a several important optimizations that
include:

-

A
loop
-
aware

-

Loop
-
invariant
data
caching

-

Caching
for
efficient fixpoint
verification.

21

Questions

22

Thank You

23