Neutron Sensitivity and

yellvillepotatocreekSoftware and s/w Development

Dec 2, 2013 (3 years and 10 months ago)

75 views

Neutron Sensitivity
and

Software Hardening Strategies

for
Matrix Multiplication and FFT

on Graphics
Processing Units

June 18
th
, 2013


New York City, NY, USA


P. Rech
, L. Pilla, F. Silvestri,

P. O. Navaux, and Luigi Carro

Paolo Rech


FTXS 2013, New York City, NY

Outline




Radiation Effects on Graphics Processing Units



Experimental Setup



Matrix Multiplication

-
Error Rate at Sea Level

-
Hardening Techniques



Fast Fourier Transform

-
Error Rate at Sea Level

-
Hardening Techniques



Conclusions








2/27

Paolo Rech


FTXS 2013, New York City, NY

Outline




Radiation Effects on Graphics Processing Units



Experimental Setup



Matrix Multiplication

-
Error Rate at Sea Level

-
Hardening Techniques



Fast Fourier Transform

-
Error Rate at Sea Level

-
Hardening Techniques



Conclusions








Paolo Rech


FTXS 2013, New York City, NY

Terrestrial Radiation Environment

Galactic cosmic rays interact with atmosphere


shower of energetic particles:

-

Muons

-

Pions

-

Protons

-

Gamma rays

-

Neutrons



13 n/(cm
2

h
) @sea level

Radiation is an issue at sea level!!

3/27

Paolo Rech


FTXS 2013, New York City, NY

GPU Internal Structure

GPU

Thread

Thread

Thread

Reg

Reg

Reg

Shared Memory

Thread

Thread

Thread

Reg

Reg

Reg

Streaming Multiprocessor

DRAM

A GPU is an array of
Streaming Multiprocessors

The SMs share DRAM

SM executes various threads in parallel

Threads has access to Registers and Shared Memory

4/27

Paolo Rech


FTXS 2013, New York City, NY

Streaming Multiprocessor

Radiation Effects on a GPU

GPU

Thread

Thread

Thread

Reg

Reg

Reg

Shared Memory

Thread

Thread

Thread

Reg

Reg

Reg

DRAM

SEU

SEU

SEU

SET

Radiation can corrupt memory resources (
SEU
)…

…but also logic (
SET
)

and control circuitry:

a
scheduler failure

may have severe repercussions

5/27

Paolo Rech


FTXS 2013, New York City, NY

Why Radiation Test on GPUs?

6/27

Titan (Oak Ridge National Lab): 18,000 GPUs

Pedestrian Detection*

High probability of having a GPU corrupted

High reliability is required

*From 2015: 5 stars of security only to cars with


pedestrian detection (
Euro NCAP
)

NVIDIA Tegra

Paolo Rech


FTXS 2013, New York City, NY

Outline




Radiation Effects on Graphics Processing Units



Experimental Setup



Matrix Multiplication

-
Error Rate at Sea Level

-
Hardening Techniques



Fast Fourier Transform

-
Error Rate at Sea Level

-
Hardening Techniques



Conclusions








Paolo Rech


FTXS 2013, New York City, NY

Tested Devices

NVIDIA GeeForce GTX480

(desktop board)

NVIDIA TESLA C2050

(built
-
in ECC)

7/27

Paolo Rech


FTXS 2013, New York City, NY

Radiation Test Facilities

p
+

8/27

Paolo Rech


FTXS 2013, New York City, NY

Radiation Test Facilities

9/27

Paolo Rech


FTXS 2013, New York City, NY

Radiation Test Facilities

Weapon Nuclear Research

10/27

Paolo Rech


FTXS 2013, New York City, NY

Neutrons Spectrum

1 sec @ISIS = 10
7

sec

(110 days) of natural irradiation @NYC


11/27

Paolo Rech


FTXS 2013, New York City, NY

GPU Radiation Test Setup

PC

20 cm PCI
-
E bus

Beam spot

PC inside the room but

out of the beam

PCI
-
E bus extension
between PC and GPU


Extension with fuses

on power lines

to avoid GPU latchups
to affect the PC

12/27

Paolo Rech


FTXS 2013, New York City, NY

GPU Radiation Test Setup

GPU power control
circuitry is out of beam

power control circuitry failure could
compromise the experience and the GPU

DDR are out of beam

Beam spot is 3cm wide:

GPU fully irradiated

13/27

Paolo Rech


FTXS 2013, New York City, NY

Outline




Radiation Effects on Graphics Processing Units



Experimental Setup



Matrix Multiplication

-
Error Rate at Sea Level

-
Hardening Techniques



Fast Fourier Transform

-
Error Rate at Sea Level

-
Hardening Techniques



Conclusions








Paolo Rech


FTXS 2013, New York City, NY

x

Matrix Multiplication

A

2048 elements

B

2048 elements

2048 elements

M

2048 elements

2048 x 2048 threads

2048 sum & mult

=

2048 sum & mult

14/27

Paolo Rech


FTXS 2013, New York City, NY

Matrix Multiplication Results

Experimental Cross Section* @ISIS = 2.01


-
6

cm
2

The
Cross Section

@ISIS resemble
the
Cross Section

@sea level


2.60

10
4

FIT

1 error every 4,5 years



Neutrons spectrum @ISIS resemble
the atmospheric one

Cross Section


⍐慲瑩t汥l
䁳敡 癥氩l

䕲牯爠剡瑥

2.01


-
6

cm
2



13

n/cm
2
/h
=


Titan (GTX): 18,000 errors every 4,5 years

10 errors per day!

*with double data

15/27

Paolo Rech


FTXS 2013, New York City, NY

Multiple Output Errors

It was accredited that just
single error
affects output

Experimental results:


Single: 42.2%

Multiple: 58.8%



the majority of errors are
multiple output errors

16/27

Paolo Rech


FTXS 2013, New York City, NY

Multiple Output Errors Analysis

Three different
Multiple Errors
patterns are detected:

0
5
10
15
20
25
30
35
40
45
Output Errors [%]

Multiple

Single

Row

Column

RND

1) 22.8% on the same Row



M

x

x

x

x

x

x

x

x

x

x

x

x

x

x

2) 26.8% on the same Column


3) 8% Cluster Errors

17/27

Paolo Rech


FTXS 2013, New York City, NY

Errors on Row/Column Causes

A

B

M





GPU cache

x

x

x

x

x

x

x

x

M column is calculated
using A rows and one
column of B, stored in the
GPU cache.

Cache corruption causes errors on row/column


threads on a SM
share cache

18/27

Paolo Rech


FTXS 2013, New York City, NY

Errors Correction

1) ECC on Cache memory


-

Corrects multiple errors on Row/Column, which are almost


50% of the total (tested on C2050)


-

Memory availability is reduced of 12.5%*


-

Execution time is increased of up to 30%*

19/27

*NVIDIA datasheet

2) Algorithm Based Fault Tolerance:

technique specifically designed for an algorithm

x

A

B

checksum

checksum





M

=

col
-
check

row
-
check

*Freivalds ‘79

Paolo Rech


FTXS 2013, New York City, NY

M

col
-
check

row
-
check



col
-
sum

row
-
sum

X

X

X

Single Errors*
are

detected in O(N)

and corrected in O(1)

Matrix Multiplication ABFT

M

col
-
check

row
-
check

col
-
sum

row
-
sum

X

X

X

X

X

X

X

Errors on a Row/Col*
are
detected in O(N)

and corrected in O(1)

*Huang and Abraham ‘84

*P. Rech at al, ‘12

20/27

Paolo Rech


FTXS 2013, New York City, NY

Cluster Errors Causes

Scheduler failure affects some
threads synchronization or
provides incomplete results

Random locations of M result
then erroneous

M

x

x

x

x

21/27

Cluster errors can be caused by

-
Cache cross
-
talk

-
Errors in dirty cache flags

-
Pairwise bit flips in cache

-
Scheduler failure

Paolo Rech


FTXS 2013, New York City, NY

Cluster Errors Criticality

Cluster errors:

-
not corrected by ECC (tested on C2050)


-
scheduler cannot be physically harden


-
scheduler SW hardening* not yet proved on GPU


22/27

*Rossi et al.

10

*Karimi et al. ‘10

0
10
20
30
40
50
Output Errors [%]

Multiple

Single

Row

Column

Cluster errors are less likely to
occur, however
their FIT is
1.13


3
, which is not
negligible!

Paolo Rech


FTXS 2013, New York City, NY

M

col
-
check

row
-
check

col
-
sum

row
-
sum

X

X

X

X

X

X

X

X

various mismatches between

row
-
check

row
-
sum

various mismatches between

col
-
check

col
-
sum

checksum info is not enough for
distinguishing errors but…

…we can try to correct errors with row
-
checksums or

col
-
checksums and check if correction succeed

Experimentally observed

corrupted location on a cluster ≤ 4:

at most 16 checks are needed!

M

X

X

X

X

23/27

Cluster Errors Correction

Paolo Rech


FTXS 2013, New York City, NY

Outline




Radiation Effects on Graphics Processing Units



Experimental Setup



Matrix Multiplication

-
Error Rate at Sea Level

-
Hardening Techniques



Fast Fourier Transform

-
Error Rate at Sea Level

-
Hardening Techniques



Conclusions








Paolo Rech


FTXS 2013, New York City, NY

Fast Fourier Transform

64
-
points FFT

64
-
points FFT

...

64
-
points FFT

64
-
points FFT

64
-
points FFT

...

64
-
points FFT

64
-
points FFT

64
-
points FFT

...

512 FFTs

log
2
64=6 iterations required

512x512 threads, each
executing the Stockham
algorithm on a 64
-
points FFT

at each iteration a thread
updates 2
-
by
-
2 the 64 elemens

a thread in one iteration uses the
output of previous threads as input


Threads are not independent, errors are likely to spread

FFT cross section =
3.69


-
6

cm
2
(5.17


5

FIT)



24/27

Paolo Rech


FTXS 2013, New York City, NY

FFT Multiple Errors

Real
Imaginary
Multiple Errors

Percentage of faulty FFT

0

1

2

3

4

5

6

7

8

9

10

2

4

6

9
-
11

14

16

18

20
-
21

24

26

28

30

34
-
39

42

44

46
-
47

50
-
51

54
-
55

62

64

66
-
126

32

57
-
59

>130

128

Less than 4% of execution has single errors

few executions has odd amount of errors

Most executions has less than 32 errors or 64 (thread failure
leads to the wrong update of all the 64 elements in the FFT)

Software hardening idea: prevent errors propagation

25/27

Paolo Rech


FTXS 2013, New York City, NY

FFT Hardening

input

coding

output decoding

checksum generation

All errors are detected with a wise
coding
-
decoding scheme*...

*J.Y. Jou and Abraham ’88

*P. Rech and al. ‘13

...but just when all iterations are
completed: errors do propagate
and FFT recomputation is required

Divide the N
-
FFT in N2
-
FFTs and N1
-
FFTs
(N=N1*N2) performing coding
-
decoding
-
checksum
on each smaller FFT...

...only the small FFT found corrupted has to be
recomputed


error propagation

computational overhead

check

check

check

26/27

FFT

FFT

ABFT

Hardened FFT

Paolo Rech


FTXS 2013, New York City, NY

Outline




Radiation Effects on Graphics Processing Units



Experimental Setup



Matrix Multiplication

-
Error Rate at Sea Level

-
Hardening Techniques



Fast Fourier Transform

-
Error Rate at Sea Level

-
Hardening Techniques



Conclusions








Paolo Rech


FTXS 2013, New York City, NY

-

GPUs are very prone to be corrupted by neutrons

-

The radiation response depends on executed algorithm

-

The corruption of shared and critical resources leads to


multiple output errors

-

ECC is not sufficient to guarantee high reliability

-

Software
-
Based Hardening Strategies can be built


analyzing the algorithm and experimental data





Work in Progress:

-

Reduce scheduler strain optimizing thread distributions

-

Analyze cache flags corruptions

-

Evaluate error criticality (precision of data)



Conclusions

27/27