Fault Tolerant Parallel Data-Intensive Algorithms

agreeablesocietyΤεχνίτη Νοημοσύνη και Ρομποτική

29 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

88 εμφανίσεις

Fault Tolerant Parallel Data
-
Intensive
Algorithms

Mucahid Kutlu, Gagan Agrawal, Oguz Kurt

(Ohio State University)


Introduction and Motivation




The Mean
-
Time
-
To
-
Failure (MTTF) of the sys
-

tems

is decreasing with growing number of cores.



For the future
exascale

systems, it is being argued
that check
-

pointing and recovery time (with current
methods) will even exceed the MTTF.




Algorithm
-
based fault
-
tolerance can be alternative
method


Our Goal




We focus on only fail
-
stop failures.




We do not use any back up node and continue the
process with remaining nodes after the failure.



We

have two main goals for faster recovery:




minimize the data loss, since the lost data


needs to be reread from the storage cluster




minimize re
-
processing of the lost data



Our Approach


-

Intelligent
Replication




Minimum Data Intersection

by dividing data into blocks and distributing
them in different processors.




Passive Replicas










-

Summarization



After processing one block, a
summary

is generated for that block
and sent to the master node.




No need to re
-
process the
blocks that a summary
is
already sent before the failure.


Master

File
System

P1

P2

P3

P4


Recovery Scenario




P1 and P2 fail at the beginning of the iteration.




Master node notifies


-

P3 to process D2 and D3


-

P4 to process D4




Since all D1 blocks are lost, master node reads D1

from the file system/storage cluster and notifies P4

to process it.




D1

D2

D6

D7

D3

D4

D1

D8

D5

D6

D2

D3

D7

D8

D4

D5

Experimental Setup




Implemented

k
-
means

and

apriori

algorithms

in

C

programming

language

by

using

MPI

library
.



Used

2
.
5

GHz

Opterons

processors

and

24

GB

memory




The

number

of

processors

is

8



In

the

experiments

with

Hadoop
:



Replication

factor(R)

:

3



Summarization

frequency(S)

:

4


Impact of Summary Exchange Frequency


in Apriori: Varying Number of Failures

Total Execution Time that Changes with

the Number of Failures

Experimental Results

P1

1

2

P2

3

4

P3

5

6

P4

7

8

P5

9

10

P6

11

12

P7

13

14

12

13

1

14

2

3

4

5

6

7

8

9

10

11

primary

replica