CUDA_SCED11x - SC11EducationProgram

gradebananaSoftware and s/w Development

Dec 2, 2013 (3 years and 4 months ago)

54 views

2



Your goal is to speed up your code as much as possible,
BUT




…you have a budget for Power...



Do you choose:

1.
6 Processors, each providing
N
performance, and using
P

Watts

2.
450 Processors, each providing N/10 performance, and
collectively using
2P
Watts

3.
It depends!

3

~
1.0TFLOPS
(SP
)/~500GFLOPS
(DP)

140+
GB/s DRAM Bandwidth

ASCI Red


Sandia National Labs


1997

4

5

6


A quiet revolution


Calculation: TFLOPS vs. 100 GFLOPS


Memory Bandwidth: ~10x













GPU in every PC


massive volume

Figure 1.1. Enlarging Performance Gap between
GPUs
and CPUs.
Multi
-
core CPU
Many
-
core GPU
Courtesy: John Owens
7


512 GPU cores


1.30 GHz


Single precision floating point performance:
1331 GFLOPs



(2 single precision flops per clock per core)


Double precision floating point performance:
665 GFLOPs



(1 double precision flop per clock per core)


Internal RAM: 6 GB DDR5


Internal RAM speed: 177 GB/sec (compared 30s
-
ish GB/sec for
regular RAM)


Has to be plugged into a
PCIe

slot (at most 8 GB/sec)

8


4 C2050 cards inside a 1U server



(looks like a typical CPU node)


1.15 GHz


Single Precision (SP) floating point performance:
4121.6 GFLOPs


Double Precision (DP) floating point performance:
2060.8 GFLOPs


Internal RAM: 12 GB total (3 GB per GPU card)


Internal RAM speed: 576 GB/sec aggregate


Has to be plugged into two
PCIe

slots



(at most 16 GB/sec)

9


Let’s compare a good dual socket x86 server today
vs

S2050.

Dual socket, AMD
2.3 GHz 12
-
core

NVIDIA Tesla S2050

Peak DP FLOPs

220.8 GFLOPs DP

2060.8 GFLOPs DP (9.3x)

Peak SP FLOPS

441.6 GFLOPs SP

4121.6 GFLOPs SP (9.3x)

Peak RAM BW

25 GB/sec

576 GB/sec (23x)

Peak PCIe BW

N/A

16 GB/sec

Needs x86 server to
attach to?

No

Yes

Power/Heat

~450 W

~900 W + ~400 W (~2.9x)

Code portable?

Yes

No (CUDA)

Yes (PGI,
OpenCL
)

10


Here are some interesting measures:

Dual socket, AMD
2.3 GHz 12
-
core

NVIDIA Tesla S2050

DP GFLOPs/Watt

~0.5 GFLOPs/Watt

~1.6 GFLOPs/Watt (~3x)

SP GFLOPS/Watt

~1 GFLOPs/Watt

~3.2 GFLOPs/Watt (~3x)

DP GFLOPs/sq ft

~590 GFLOPs/sq ft

~2750 GFLOPs/sq ft (4.7x)

SP GFLOPs/sq ft

~1180 GFLOPs/sq ft

~5500 GFLOPs/sq ft (4.7x)

Racks per PFLOP DP

142 racks/PFLOP DP

32 racks/PFLOP DP (23%)

Racks per PFLOP SP

71 racks/PFLOP SP

16 racks/PFLOP SP (23%)

OU’s Sooner is
34.5
TFLOPs
DP,
which
is just over
1 rack

of
S2050.

11



Do they bear out in practice?



Tianhe
-
1


Hybrid (GPU
-
heavy) machine


55% peak on HPL



Jaguar


CPU
-
based machine


75% peak on HPL

Stone, et al.
Overset
Grid/
Gridless

Methods for
Fuselage and Rotor Wakes


But they do bear out

more
fully on
some applications



Many of these applications
are in computational science
and engineering.

13



Dealing with graphics API


To get general purpose code
working, you had to use the
corner cases of the graphics API



Essentially


re
-
write entire
program as a collection of
shaders

and polygons

Input Registers

Fragment Program



Output Registers

Constants

Texture

Temp Registers

per thread

per
Shader

per Context


FB Memory

14


“Compute Unified Device Architecture”


General purpose programming model


User kicks off batches of threads on the GPU


GPU = dedicated super
-
threaded, massively data
parallel co
-
processor


Targeted software stack


Compute oriented drivers, language, and tools


Driver for loading computational programs
onto GPU


15


CUDA programming model


Basic concepts and data types



CUDA application programming interface (API) basics



A couple of simple examples



Some performance issues will be in session #2 (3:30
-
5pm)

16


A CUDA compute device


Is a coprocessor to the CPU or host


Has its own DRAM (device memory)



Runs many threads in parallel


Is typically a GPU but can also be another type of parallel
processing device


Data
-
parallel portions of an application are expressed as
device kernels which run on many threads


Differences between GPU and CPU threads


GPU threads are extremely lightweight


Very little creation overhead


GPU needs
1000
s of threads for full efficiency


Multi
-
core CPU needs only a few (and is hurt by having too many)


One program, two devices


Serial or modestly parallel parts in host C code


Highly parallel parts in device kernel C code

17

Serial Code (host)


. . .

. . .

Parallel Kernel (device)


KernelA
<<<
nBlk
,
nTid

>>>(
args
);

Serial Code (host)


Parallel Kernel (device)


KernelB
<<<
nBlk
,
nTid

>>>(
args
);

18



In CUDA, a
kernel

is code (typically a function) that can be run
inside the GPU.



The kernel code runs on many of the stream processors in the
GPU
in parallel.


Each processor runs the code over different data (SPMD)



19


In CUDA, a
thread

is an execution of a
kernel with a given index.


Each thread uses its index to access a specific
subset of the data, such that the collection of all
threads cooperatively processes the entire data
set.


Think
: MPI Process ID


These operate very much like threads in
OpenMP


they even have shared and private variables.



So what’s the difference with CUDA?


Threads are
free

7

6

5

4

3

2

1

0



float x = input[
threadID
];

float y =
func
(x);

output[
threadID
] = y;



threadID

20


In CUDA, a
block

is a group of threads.



Blocks are used to
organize

threads into manageable (and
schedulable) chunks.


Can organize threads in
1
D,
2
D, or
3
D arrangements


What best matches your data?


Some restrictions, based on hardware



Threads within a block can do a bit of synchronization, if
necessary.

21


In CUDA, a
grid

is a group of blocks


no synchronization at all between the blocks.



Grids are used to
organize

blocks into manageable (and
schedulable) chunks.


Can organize blocks in 1D or 2D arrangements


What best matches your data?



A grid is the set of threads created by a call to a CUDA kernel


Grids

map to GPUs


Blocks

map to the
MultiProcessors

(MP)



Blocks are never split across
MPs, but a MP can have
multiple blocks


Threads

map to Stream
Processors (SP)



Warps

are groups of (
32
) threads
that execute simultaneously


Completely forget these exist
until you get good at this

Image Source:

NVIDIA CUDA Programming Guide

23


16
highly threaded SM’s, >
128
FPU’s,
367
GFLOPS,
768
MB DRAM,
86.4
GB/S
Mem

BW,
4
GB/S BW to CPU

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Load/store

Load/store

Load/store

Load/store

Load/store

Device

Block 0

Block
1

Block 2

Block
3

Block
4

Block
5

Block
6

Block 7


Hardware is free to assign blocks to any SM (processor)


A kernel scales across any number of parallel processors

24

Kernel grid

Block 0

Block
1

Block 2

Block 3

Block
4

Block
5

Block
6

Block 7

Device

Block 0

Block
1

Block 2

Block
3

Block 4

Block 5

Block
6

Block
7

Each block can execute in any order relative to other blocks.

time

Host
Kernel
1
Kernel
2
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(0, 1)
Block
(1, 1)
Grid 2
Courtesy: NDVIA
Figure 3.2. An Example of CUDA Thread Organization.
Block (1, 1)
Thread
(0,1,0)
Thread
(1,1,0)
Thread
(2,1,0)
Thread
(3,1,0)
Thread
(0,0,0)
Thread
(1,0,0)
Thread
(2,0,0)
Thread
(3,0,0)
(0,0,1)
(1,0,1)
(2,0,1)
(3,0,1)
25


Each thread uses IDs to
decide what data to work on


BlockIdx
:
1
D or
2
D


ThreadIdx
:
1
D,
2
D, or
3
D



Simplifies memory

addressing when processing

multidimensional data


Image processing


Solving PDEs on volumes





26


Global memory


Main means of communicating
R/W Data between host and
device


Contents visible to all threads


Long latency access


We will focus on global memory
for now


Other memories will come
later

Grid

Global Memory

Block (
0
,
0
)


Shared Memory

Thread (
0
,
0
)


Registers

Thread (
1
,
0
)


Registers

Block (
1
,
0
)


Shared Memory

Thread (
0
,
0
)


Registers

Thread (
1
,
0
)


Registers

Host

Note: This is not hardware!


cudaMalloc
()


Allocates object in the device Global Memory


Requires two parameters


Address of a pointer to the allocated object


Size of
of

allocated object



cudaFree
()


Frees object from device Global Memory


Pointer to freed object

27


Code example:


Allocate a 64 * 64 single precision float array


Attach the allocated storage to pointer named
Md


“d” is often used in naming to indicate a device data
structure

28

TILE_WIDTH =
64
;

float
*
Md
;

int

size = TILE_WIDTH * TILE_WIDTH *
sizeof
(float);


cudaMalloc
((void**)&
Md
, size
);


cudaFree
(
Md
);

29

CPU

(host)

GPU w/

local DRAM

(device)

30


cudaMemcpy()


memory data transfer


Requires four parameters


Pointer to destination


Pointer to source


Number of bytes copied


Type of transfer


Host to Host


Host to Device


Device to Host


Device to Device


Asynchronous transfer

Grid

Global

Memory

Block (
0
,
0
)


Shared Memory

Thread (
0
,
0
)


Registers

Thread (
1
,
0
)


Registers

Block (
1
,
0
)


Shared Memory

Thread (
0
,
0
)


Registers

Thread (
1
,
0
)


Registers

Host


Code example:


Transfer a
64
*
64
single precision float array


M is in host memory and Md is in device memory


cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost
are symbolic constants

31


cudaMemcpy
(
Md
, M, size,
cudaMemcpyHostToDevice
);


cudaMemcpy
(M,
Md
, size,
cudaMemcpyDeviceToHost
);


In C:

void
foo
(
int

a, float b)

{


// slow code goes here

}



In CUDA C:

__global__
void
foo
(
int

a, float b)

{



// fast code goes here!

}

33


A kernel function must be called with an execution
configuration:



dim
3
DimGrid
(
100
,
50
); //
5000
thread blocks

dim
3
DimBlock
(
4
,
8
,
8
); //
256
threads per block


KernelFunc
(...); // invoke a function


34


A kernel function must be called with an execution
configuration:



dim
3
DimGrid
(
100
,
50
); //
5000
thread blocks

dim
3
DimBlock
(
4
,
8
,
8
); //
256
threads per block


KernelFunc
(...); // invoke a function


Declare the dimensions for grid/blocks

35


A kernel function must be called with an execution
configuration:



dim
3
DimGrid
(
100
,
50
); //
5000
thread blocks

dim
3
DimBlock
(
4
,
8
,
8
); //
256
threads per block


KernelFunc
<<<
DimGrid
,
DimBlock
>>>(...);



Any call to a kernel function is asynchronous


explicit synch needed for blocking


Declare the dimensions for grid/blocks

36

void

saxpy_serial
(
int

n, float a, float *x, float *y)

{


int

i
;


for(
i
=
0
;
i

< n;
i
++) {



y[
i
] = a*x[
i
] + y[
i
];


}

}



//invoke the kernel

saxpy_serial
(n,
2.0
, x, y);

37


Doing anything across an entire vector is perfect for massively
parallel (GPGPU) computing.



Instead of
one

function
looping

over the data set,



we’ll use
many

threads, each doing
one

calculation

7

6

5

4

3

2

1

0



y[
tid
] = a*x[
tid
] + y[
tid
];



threadID

38

__global__
void

saxpy_cuda
(
int

n, float a, float *x, float *y)

{


int

i

= (
blockIdx.x

*
blockDim.x
) +
threadIdx.x
;


if(
i

< n)



y[
i
] = a*x[
i
] + y[
i
];

}



int

nblocks

= (n +
255
) /
256
;


//invoke the kernel with
256
threads per block

saxpy_cuda
<<<
nblocks
,
256
>>>
(n,
2.0
, x, y);

39


What kinds of codes are good for GPGPU acceleration?






What kinds of codes are bad?

40


Could I be getting better performance?


Probably a little bit. Most of the performance is handled in
HW



How much better?


If you compile

O
3
, you can get faster (maybe
2
x)


If you are careful about tiling your memory, you can get
faster on codes that benefit from that (maybe
2
-
3
x)



Is that much performance worth the work?


Compiling with optimizations is a no
-
brainer (and yet…)


Tiling is useful, but takes an investment

41


Could I be getting better performance?


Am I getting near peak GFLOP performance?



How much better?


Brandon’s particle code, using several different
code modifications


148
ms per time step


4
ms per time step



Is that much worth the work?


How much work would you do for
30
-
40
x?


Most of the modifications are fairly straightforward


You just need to know how the hardware works a bit more


Am I bandwidth bound? (How do I tell?)


Make sure I have high thread occupancy to tolerate latencies


These threads can get some work done while we wait for memory


Move re
-
used values to closer memories


Shared


Constant/Texture



Am I not bandwidth bound


what is now my limit?


Take a closer look at the instruction stream


Unroll loops


Minimize branch divergence


42