Selective Recovery From Failures In

coleslawokraΛογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 3 χρόνια και 9 μήνες)

79 εμφανίσεις

Selective Recovery From Failures In
A Task Parallel Programming Model

James Dinan*,
Sriram Krishnamoorthy
#
,

Arjun Singri*, P. Sadayappan*


*The Ohio State University

#
Pacific Northwest National Laboratory

1

Faults at Scale

Future systems built with large number of components

MTBF inversely proportional to #components


Faults will be frequent


Checkpoint
-
restart too expensive with numerous faults

Strain on system components, notably file system


Assumption of fault
-
free operation infeasible


Applications need to think about faults


2

Programming Models

SPMD ties computation to a process

Fixed machine model

Applications needs to change with major architectural shifts


Fault handling involves non
-
local design changes

Rely on
p
processes: what if one goes away?


Message
-
passing makes it harder

Consistent cuts are challenging

Message logging, etc. expensive


Fault management requires lot of user involvement

3

Problem Statement

Fault management framework


Minimize user effort


Components

Data state

Application data

Communication operations

Control state

What work is each process doing?

Continue to completion despite faults

4

Approach

One
-
sided communication model

Easy to derive consistent cuts


Task parallel control model

Computation decoupled from processes


User specifies computation

Collection of tasks on global data


Runtime schedules computation

Load balancing

Fault management

5

Global Arrays (GA)

PGAS Family: UPC (C), CAF (Fortran), Titanium (Java),
GA
(library)

Aggregate memory from multiple nodes into global address space

Data access via one
-
sided
get(..), put(..), acc(..)

operations

Programmer controls data distribution and locality

Fully inter
-
operable with MPI and ARMCI

Support for higher
-
level collectives


DGEMM, etc.

Widely used


chemistry, sub
-
surface transport,

bioinformatics, CFD

6

Shared

Global
address space

Private


Proc
0

Proc
1

Proc
n


X[M][M][N]

X[1..9]

[1..9][1..9]

X

GA Memory Model

Remote memory access

Dominant communication in GA programs

Destination known in advance

No receive operation or tag matching

Remote Progress

Ensure overlap


Atomics and collectives

Blocking

Few outstanding at any time

7

Saving Data State

Data State =
Commn

state + memory state


Communication state

“Flush” pending RMA operations (single call)

Save atomic and collective ops (small state)


Memory state

Force other processes to flush their pending ops


Used in virtualized execution of GA apps (
Comp.
Frontiers’09
)


Also enables pre
-
emptive migration

8

9

The Asynchronous Gap

The PGAS memory model

simplifies managing data


Computation model is still regular,

process
-
centric SPMD

Irregularity in the data can lead to

load imbalance


Extend PGAS model to bridge asynchronous gap

Dynamic, irregular view of the computation

Runtime system should perform load balancing

Allow for computation movement to exploit locality


X[M][M][N]

X[1..9]

[1..9][1..9]

X

get(…)

Control State


Task Model

Express computation as collection of tasks

Tasks operate on data stored in Global Arrays

Executed in collective task parallel phases


Runtime system manages task execution

10

SPMD

SPMD

Task


Parallel

Termination

11

Task Model


Inputs: Global data, Immediates, CLOs


Outputs: Global data, CLOs, Child tasks


Strict dependence: Only parent

child
(for now)

CLO
1

CLO
1

Shared

Y[0]

Private

Y[1]

Y[N]


Proc
0

Proc
1

Proc
n

CLO
1

f(...)

In: 5, Y[0], ...

Out: X[1]

Task:

Partitioned Global Address Space

X[0]

X[1]

X[N]

12

Scioto Programming Interface

High level interface:

shared global task collection


Low level interface
: set of distributed task queues

Queues are prioritized by affinity

Use work first principle (LIFO)

Load balancing via work stealing (FIFO)

13

Work Stealing Runtime System

ARMCI task queue on each processor

Steals don’t interrupt remote process

When a
process runs
out of work

Select a victim at random and steal work from them

Scaled to 8192 cores (
SC’09
)

Communication Markers

Communication initiated by a failed process


Handling partial completions

Get(), Put() are idempotent


ignore

Acc() non
-
idempotent


Mark beginning and end of acc() ops


Overhead

Memory usage


proportional to # tasks

Communication


additional small messages

14

Fault Tolerant Task Pool

15

Re
-
execute incomplete tasks till a round without failures

Task Execution

16

Update result only if it has not already been modified

Detecting Incomplete Commn

Data with ‘started’ set but not ‘contributed’



Approach 1: “Naïve” scheme

Check all markers for any that remain `started’

Not scalable


Approach 2: “Home
-
based” scheme

Invert the task
-
to
-
data mapping

Distributed meta
-
data check + all
-
to
-
all


17

Algorithm Characteristics

Tolerance to arbitrary number of failures


Low overhead in absence of failures

Small messages for markers

Can we optimized through pre
-
issue/speculation


Space overhead proportional to task pool size

Storage for markers


Recovery cost proportional to #failures

Redo work to produce data in failed processes

18

Bounding Cascading Failures

A process with “corrupted” data

Incomplete comm. from failed process


Marking it as failed
-
> cascade failures


A process with “corrupted” data

Flushes its communication; then recovers its data


Each task computes only a few data blocks

Each process: pending comm. to few blocks at a time

Total recovery cost

Data in failed processes + a small additional number

19

Experimental Setup

Linux cluster


Each node

Dual quad
-
core 2.5GHz
opterons

24GB RAM


Infiniband

interconnection network


Self
-
Consistent Field (SCF) kernel


48 Be atoms


Worst case fault


at the end of a task pool


20

Cost of Failure


Strong Scaling

21

#tasks re
-
executed goes down with increase in process count

Worst Case Failure Cost

22

Relative Performance

23

Less than 10% cost for one worst case fault

Related Work

Checkpoint restart

Continues to handle the SPMD portion of an app

Finer
-
grain recoverability using our approach

BOINC


client
-
server

CilkNOW



single assignment form

Linda


requires transactions

CHARM++

processor virtualization based

Needs message logging

Efforts on fault tolerant runtimes

Complements this work

24

Conclusions

Fault tolerance through

PGAS memory model

Task parallel computation model


Fine
-
grain recoverability through markers


Cost of failure proportional to #failures


Demonstrated low cost recovery for an SCF kernel

25

Thank You!

26