CudaDMA: Emulating DMA engines on GPUs

smilinggnawboneInternet και Εφαρμογές Web

4 Δεκ 2013 (πριν από 3 χρόνια και 10 μήνες)

90 εμφανίσεις

C
udaDMA
: Emulating DMA engines on GPUs
for Performance and
Programmability

Brucek Khailany (NVIDIA Research)

Michael Bauer (Stanford)

Henry Cook (UC Berkeley)

C
udaDMA

o
verview


A library for efficient bulk transfers between global and
shared memory in CUDA kernels (not host<
-
>device copies)







Motivation: Ease programmer burden for high performance


http
://code.google.com/p/cudadma
/

GPU

CPU

SM

Shared

DRAM

DRAM

L2

PCIe

SM

Shared

SM

Shared

SM

Shared



global<
-
>shared

transfers

Motivation: data shape != thread shape


Thread block size/shape mismatch shared data size/shape








Complex
kernel code (lots of ‘if’ statements, thread index math
)


Goal:


Decouple
data
shape from thread block dimensions

6x7 input data

4x5 thread block

Example:
3D
finite difference stencil


8th order in space, 1st order in time
computation


Thread per (
x,y
) location


Step through Z
-
dimension


Load
2D h
alos into
shared for

each
step in
Z
-
dimension



Programmer challenges


How to split halo
xfers

across threads?


Memory B/W optimizations

/////////////////////////////////////////

// update the data slice in
smem


s_data
[
ty
][
tx
] = local_input1[radius];

s_data
[
ty
][
tx+BDIMX
] = local_input2[radius];

if(
threadIdx.y
<radius )

// halo above/below

{


s_data
[
threadIdx.y
][
tx
] =
g_curr
[
c_offset

-

radius*
dimx
];


s_data
[
threadIdx.y
][
tx+BDIMX
] =
g_curr
[
c_offset

-

radius*
dimx

+ BDIMX];

}


if(
threadIdx.y

>= radius &&
threadIdx.y

< 2*radius )

{


s_data
[
threadIdx.y+BDIMY
][
tx
] =
g_curr
[
c_offset

+ (BDIMY
-
radius)*
dimx
];


s_data
[
threadIdx.y+BDIMY
][
tx+BDIMX
] =
g_curr
[
c_offset

+ (BDIMY
-
radius)*
dimx

+ BDIMX];

}


if(
threadIdx.x
<radius)

// halo left/right

{


s_data
[
ty
][
threadIdx.x
] =
g_curr
[
c_offset

-

radius];


s_data
[
ty
][threadIdx.x+2*
BDIMX+radius
] =
g_curr
[
c_offset

+ 2*BDIMX];

}

__
syncthreads
();

Example copy code for 3D stencil

C
udaDMA

approach


C
udaDMA

library


Block transfers explicitly declared in CUDA kernels


Primarily used for “streaming” data through shared memory


Common access patterns supported


Implemented as C++ objects instantiated in CUDA kernels


Object member functions used to initiate “DMA transfers”


Advantages


Simple, maintainable user code


Access
patterns
independent of thread block dimensions


Optimized library implementations for global memory bandwidth

Kernel
pseudocode

with
C
udaDMA


C
udaDMA

objects declared
at top of the kernel


Fixed access pattern



Kernel loops over large
dataset


Copy data to shared


Barrier


Process data in shared


Barrier





__global__

void
cuda_dma_kernel
(
float

*data)

{


__shared__ float
buffer[NUM_ELMTS];



cudaDMAStrided
<
false,ALIGNMENT
>



dma_ld
(EL_SZ,EL_CNT,EL_STRIDE);



for

(
int

i
=0;
i
<NUM_ITERS;
i
++) {



dma_ld.execute_dma


(data[A*
i
],buffer);


__
syncthreads
();



process_buffer
(buffer);


__
syncthreads
();



}

}

Supported access patterns


CudaDMASequential



CudaDMAStrided



CudaDMAIndirect


Scatter/Gather



CudaDMAHalo


2D halo regions

Specifying access patterns


Access pattern described with parameters


Up to 5 parameters for
strided

patterns


BYTES_PER_ELMT

-

the size of each element in bytes


NUM_ELMTS

-

the number of elements to be
transfered


ALIGNMENT


whether elements are 4
-
,8
-
, or 16
-
byte
aligned


src_stride

-

the stride between the source elements in bytes


dst_stride

-

the stride between elements after they have been
transferred in bytes


Similar parameters used for other patterns


# of threads independent of access pattern

Optimizations and tuning


Optimizations performed by
CudaDMA

implementations


Pointer casting enables vector loads and stores


Hoisting of pointer arithmetic into the constructor


Memory coalescing and shared memory bank conflict avoidance


Considerations for memory bandwidth performance


Use compile
-
time constant template parameters if possible


Load at maximum alignment (highest performance at 16 bytes)


Size, #threads: Highest performance with <=64 bytes per thread


64 threads: 4 KB transfers
;

128 threads: 8KB transfers, …

Predictable performance


Strided

access pattern


2 KB transfers


Tesla C2070 (ECC Off)


128 threads per thread block
participating in the DMA


2 thread blocks per SM


Similar results for other
access patterns

Element
size
(Bytes)

Total
elements

GB/s

32

64

73.4

64

32

73.5

128

16

73.6

256

8

73.5

512

4

73.5

1024

2

83.7

2048

1

84.5

Optimizations using warp specialization


C
udaDMA

supports splitting
thread blocks into

compute and DMA warps


Supports producer
-
consumer
synchronization functions


On compute warps:


Non
-
blocking
start_async_dma
()


Blocking
wait_for_dma_finish
()


On
DMA
warps
:
execute_dma
()
includes synchronization

Compute
Warps

DMA

Warps

Barrier 1

Barrier 2

Barrier 1

Barrier 2

Iteration
i

Iteration i+1

execute_dma
()

start_async_dma
()

wait_for_dma_finish
()

execute_dma
()

start_async_dma
()

wait_for_dma_finish
()


Uses named barriers in
ptx


bar.arrive
,
bar.sync

Warp specialization buffering techniques


Usually one set of DMA warps per buffer



Single
-
Buffering


1 buffer, 1 warp group



Double
-
buffering


2 buffers, 2 warp groups



Manual double
-
buffering


2 buffers, 2 warp groups

Experimental results


Up to 1.15x speedups over tuned 3D finite difference


Using
cudaDMA

with warp

specialization on a Tesla C2050



Up to 3.2x speedup on SGEMV


Large speedups at small matrix sizes


Compared to Magma BLAS library



2.74x speedup on

finite
-
element CFD code

Execution Time (
ms
)

Status


Recent library changes


[Nov, 2011] Library released on Google Code


Included optimized implementations for many sequential,
strided
, and halo
access patterns


cudaDMA

objects with and without warp specialization supported


[May, 2012] Enhancements to
cudaDMAIndirect


In progress


Kepler
-
optimized
implementation


Exploring the use of loads through the texture
cache


Future
productization

discussions

Summary


CudaDMA


A library for efficient bulk transfers between global and shared
memory in CUDA kernels


Supports asynchronous “DMA” transfers using warp specialization
and inline
ptx

producer
-
consumer synchronization instructions


Detailed documentation and source code:


http://code.google.com/p/cudadma/


Bauer, Cook, and Khailany, “
CudaDMA
: Optimizing GPU Memory
Bandwidth via Warp Specialization
”, SC’11


Feedback?
bkhailany@nvidia.com