Dr. Barry Wilkinson

birdsowlΛογισμικό & κατασκευή λογ/κού

2 Δεκ 2013 (πριν από 3 χρόνια και 8 μήνες)

95 εμφανίσεις

1

Workshop 9: General purpose computing using
GPUs: Developing a hands
-
on

undergraduate course on CUDA programming

SIGCSE 2011

-

The 42
nd

ACM Technical Symposium
on Computer Science Education


Wednesday


March 9, 2011, 7:00 pm
-

10:00 pm

Dr. Barry Wilkinson

University of North Carolina
Charlotte

Dr.
Yaohang Li

Old Dominion University

SIGCSE 2011 Workshop 9 Session1.ppt © 2010 B. Wilkinson Modification date:
Feb 23, 2011


2

Session 1


7:15 pm
-

8:25 pm

Session 1: Basic CUDA programming


Presentation (about 35 minutes)


Kernel calls, data movement, predefined
variables, thread organization, code
examples


Hands
-
on experience using remote GPU
server (about 35 minutes)

2

3

CUDA Program

A CUDA program has code to be executed on CPU and code to be
executed on GPU in one source file (in simple cases) and one
executable when compiled


A CUDA
kernel

is a routine to be executed on the GPU
--

a SIMT
code sequence.


Kernel code will be regular C except one typically needs to use the
thread ID in expressions to ensure each thread accesses different
data.


When a kernel is reached in the code for the first time, it will
launched onto the GPU.

4

CPU and GPU Memories



Separate memories on CPU
(host) and GPU (device)*



Usually need to



Explicitly transfer data from CPU
to GPU for GPU computation, and



Explicitly transfer results in GPU
memory copied back to CPU
memory

Copy from
CPU to
GPU

Copy from
GPU to
CPU

GPU

CPU

CPU main memory

GPU global memory

* assuming a separate GPU card. Integrated systems might share some memory.

5

Basic CUDA program structure

int main (int argc, char **argv ) {



1. Allocate memory space in device (GPU) for data


2. Allocate memory space in host (CPU) for data



3. Copy data to GPU



4. Call “kernel” routine to execute on GPU


(
with CUDA syntax that defines no of threads and their physical structure)



5. Transfer results from GPU to CPU



6. Free memory space in device (GPU)


7. Free memory space in host (CPU)



return;

}

6

1. Allocating memory space in
“device” (GPU) for data

Use CUDA malloc routines:


int size = N *sizeof( int); // space for N integers


int *devA, *devB, *devC; // devA, devB, devC ptrs


cudaMalloc( (void**)&devA, size) );

cudaMalloc( (void**)&devB, size );

cudaMalloc( (void**)&devC, size );

Derived from Jason Sanders, "Introduction to CUDA C" GPU technology conference, Sept. 20, 2010.

7

2. Allocating memory space in
“host” (CPU) for data

Use regular C malloc routines:


int *a, *b, *c;



a = (int*)malloc(size);

b = (int*)malloc(size);

c = (int*)malloc(size);


or statically declare variables:


#define N 256



int a[N], b[N], c[N];


8

3. Transferring data from host
(CPU) to device (GPU)

Use CUDA routine
cudaMemcpy



cudaMemcpy( devA, &A, size, cudaMemcpyHostToDevice);


cudaMemcpy( dev_B, &B, size, cudaMemcpyHostToDevice);


where
devA

and
devB

are pointers to destination in device



and


A

and
B

are pointers to host data

9

4. Declaring “kernel” routine to
execute on device (GPU)

CUDA introduces a
<<<…>>>

syntax addition to C for kernel
calls:


myKernel<<< n, m >>>(arg1, … );


<<< … >>>
contains thread organization for this particular kernel
call in two parameters,
n

and
m
:


For now, we will set
n = 1
, which say one block and
m = N,
which
says N threads in this block.


arg1
, … ,
--

arguments to routine
myKernel

typically pointers to
device memory obtained previously from
cudaMallac
.

10


Example


Adding to vectors A and B


#define N 256


__global__ void vecAdd(int *A, int *B, int *C) { // Kernel definition


int i = threadIdx.x;


C[i] = A[i] + B[i];

}


int main() {

// allocate device memory &

// copy data to device

// device mem. ptrs devA,devB,devC



vecAdd<<<1, N>>>(
devA,devB,devC
);




}


Loosely derived from CUDA C programming guide, v 3.2 , 2010, NVIDIA

Kernel Routine


Defined using CUDA specifier
__global__

Each thread performs one pair
-
wise addition:



Thread 0: devC[0] = devA[0] + devB[0];

Thread 1: devC[1] = devA[1] + devB[1];

Thread 2: devC[2] = devA[2] + devB[2];







.






.






.






One block of N threads

CUDA structure that provides thread ID in block

11

5. Transferring data from device
(GPU) to host (CPU)

Use CUDA routine
cudaMemcpy



cudaMemcpy( &C, devC, size, cudaMemcpyDeviceToHost);


where
devC

is a pointer in device and
C

is a pointer in
host.


12

Free memory space

In “device” (GPU)
--

Use CUDA cudaFree routine:


cudaFree( dev_a);

cudaFree( dev_b);

cudaFree( dev_c);


In (CPU) host (if CPU memory allocated with malloc)
--

Use regular C free routine:


free( a );

free( b );

free( c );


13

Complete
CUDA
program


Adding two
vectors, A and B


N elements in A and
B, and N threads


(without code to load
arrays with data)



#define N 256


__global__ void vecAdd(int *A, int *B, int *C) {


int i = threadIdx.x;


C[i] = A[i] + B[i];

}



int main (int argc, char **argv ) {



int size = N *sizeof( int);


int a[N], b[N], c[N], *devA, *devB, *devC;



cudaMalloc( (void**)&devA, size) );


cudaMalloc( (void**)&devB, size );


cudaMalloc( (void**)&devC, size );



a = (int*)malloc(size); b = (int*)malloc(size);c =
(int*)malloc(size);



cudaMemcpy( devA, a, size, cudaMemcpyHostToDevice);


cudaMemcpy( dev_B, b size, cudaMemcpyHostToDevice);



vecAdd<<<1, N>>>(devA, devB, devC);



cudaMemcpy( &c, devC size, cudaMemcpyDeviceToHost);


cudaFree( dev_a);


cudaFree( dev_b);


cudaFree( dev_c);


free( a ); free( b ); free( c );



return (0);

}


14

So far, organization of threads is one block of N
threads.


GPUs are actually organized to execute blocks of
threads in 1 or 2 dimensions


the collection of
blocks being called a
grid


The blocks themselves can be organized in 1
-
D
2
-
D or 3
-
D.

15

Can be 1 or 2
dimensions

Can be 1, 2 or
3 dimensions

CUDA C programming guide, v 3.2, 2010,
NVIDIA

CUDA SIMT
Thread Structure

Allows
flexibility and
efficiency in
processing
1D, 2
-
D, and
3
-
D data on
GPU.


Linked to
internal
organization


Threads in
one block
execute
together.

16

NVIDIA defines “compute capabilities”, 1.0, 1.1, … with
these limits and features supported.


Compute capability 1.0


Maximum number of threads per block


= 512

Maximum sizes of x
-

and y
-

dimension


of thread block









= 512

Maximum size of each dimension of grid


of thread blocks









= 65535

Device characteristics
--

some limitations

17

Need to provide each kernel call with values for two key structures:



Number of blocks in each dimension


Threads per block in each dimension


myKernel<<< B, T >>>(arg1, … );


B


a structure that defines the number of blocks in grid in each
dimension (1D or 2D).



T



a structure that defines the number of threads in a block in each
dimension (1D, 2D, or 3D).

Defining Grid/Block Structure

18

1
-
D grid and/or 1
-
D blocks

If want a 1
-
D structure, can use a integer for B and T in:


myKernel<<< B, T >>>(arg1, … );


B


An integer would define a 1D grid of that size


T


An integer would define a 1D block of that size


Example


myKernel<<< 1, 100 >>>(arg1, … );



19


CUDA Built
-
in Variables

for a 1
-
D grid and 1
-
D block







threadIdx.x

--

“thread index” within block in “x” dimension




blockIdx.x

--

“block index” within grid in “x” dimension




blockDim.x

--

“block dimension” in “x” dimension







(i.e. number of threads in a block in the x dimension)









Full global thread ID in x dimension can be computed by:



x = blockIdx.x * blockDim.x + threadIdx.x;




20

Example
--

x direction

A 1
-
D grid and 1
-
D block

4 blocks, each having 8 threads

0

1

2

3

4

7

6

5

0

1

2

3

4

7

6

5

0

1

2

3

4

7

6

5

0

1

2

3

4

7

6

5

threadIdx.x

threadIdx.x

threadIdx.x

blockIdx.x = 3

threadIdx.x

blockIdx.x = 1

blockIdx.x = 0

Derived from Jason Sanders, "Introduction to CUDA
C" GPU technology conference, Sept. 20, 2010.

blockIdx.x = 2

gridDim = 4 x 1

blockDim = 8 x 1


Global thread ID =
blockIdx.x * blockDim.x + threadIdx.x

= 3 * 8 + 2 = thread 26 with linear global addressing

Global ID 26

21

#define N 2048 // size of vectors

#define T 256 // number of threads per block


__global__ void vecAdd(int *A, int *B, int *C) {



int i = blockIdx.x*blockDim.x + threadIdx.x;



C[i] = A[i] + B[i];

}



int main (int argc, char **argv ) {






vecAdd<<<N/T, T>>>(devA, devB, devC); // assumes N/T is an integer






return (0);

}

Code example with a 1
-
D grid and blocks

Vector addition

Number of blocks to map each vector across grid,
one element of each vector per thread.

N/T assumed an integer

Note: __global__ CUDA
function qualifier.


__ is two underscores



__global__ must return
a void

22

1
-
D grid and 1
-
D block suitable for processing one dimensional data


Higher dimensional grids and blocks convenient for higher
dimensional data:


Processing 2
-
D arrays might use a two dimensional grid and two
dimensional block


Might need higher dimensions because of limitation on sizes of
block in each dimension


CUDA provided with built
-
in variables and structures to define
number of blocks in grid in each dimension and number of threads
in a block in each dimension.


Higher dimensional grids/blocks

23

dim3


can be considered
essentially

as CUDA
-
defined structure of
unsigned integers: x, y, z, i.e.:
struct

dim3

{ x; y; z; };



Grid/Block Sizes


dim3

gridDim

--

Grid dimensions, x and y (z not used).


No of blocks in grid =
gridDim.x * gridDim.y



dim3
blockDim

--

Size of block dimensions x, y, and z.


No of threads in a block =
blockDim.x * blockDim.y * blockDim.z





Built
-
in CUDA data types and structures
to define multidimensional structures

24

To set dimensions, use for example:



dim3 grid(16, 16);


// Grid
--

16 x 16 blocks


dim3 block(32, 32);


// Block
--

32 x 32 threads


myKernel<<<grid, block>>>(...);


which sets:



gridDim.x = 16


gridDim.y = 16


blockDim.x = 32


blockDim.y = 32


blockDim.z = 1


(although you do not initial CUDA structure elements that way)

Example Initializing Values

25

uint3



can be considered
essentially

as CUDA
-
defined structure of
unsigned integers: x, y, z, i.e.:
struct


uint3

{ x; y; z; };



Block index within grid




uint3

blockIdx

--

blockIdx.x, blockIdx.y
(z not used)



Thread index within block




uint3

threadIdx

--

threadIdx.x, threadIdx.y, threadId.z



CUDA Built
-
in Variables

for Grid/Block Indices

26

2
-
D Grids and 2
-
D blocks

threadID.x

threadID.y

Thread

blockIdx.x * blockDim.x + threadIdx.x

blockIdx.y * blockDim.y + threadIdx.y

27

Flattening arrays onto linear
memory

Generally memory allocated dynamically on device (GPU)
and we cannot not use two
-
dimensional indices (e.g.
A[row][column]
) to access array as we might otherwise.


Need to know how array is laid out in memory and then
compute distance from the beginning of the array.


C uses
row
-
major order
---

rows are stored one after the
other in memory, i.e. row 0 then row 1 etc.

28

Flattening an array

Number of columns, N

column

Array element


a[row][column] = a[offset]


offset = column + row * N


where
N

is number of column in
array


row * number of columns

row

0

0

N
-
1

29

int col = blockIdx.x*blockDim.x+threadIdx.x;

int row = blockIdx.y*blockDim.y+threadIdx.y;

int index = col + row * N; // thread ID

A[index] =


With one thread per array
element

30

CUDA version using 2
-
D grid and 2
-
D blocks

Adding two arrays where one thread handles
one element in each array

#define N 2048 // size of arrays


__global__void addMatrix (int *a, int *b, int *c) {


int col = blockIdx.x*blockDim.x+threadIdx.x;


int row =blockIdx.y*blockDim.y+threadIdx.y;


int index = col + row * N;



if ( col < N && row < N) c[index]= a[index] + b[index];

}


int main() {


...


dim3 dimBlock (16,16);


dim3 dimGrid (N/dimBlock.x, N/dimBlock.y);



addMatrix<<<dimGrid, dimBlock>>>(
devA, devB, devC
);




}

31

Compiling CUDA programs

“nvcc”

NVIDIA provides
nvcc

--

the NVIDIA CUDA “compiler
driver”.


Will separate out code for host and for device


Regular C/C++ compiler used for host (needs to be
available)


Programmer simply uses nvcc instead of gcc/cc compiler
on a Linux system


Command line options include for GPU features


32

Compiling code
-

Linux

Command line:

nvcc

O3

o <exe> <source_file>
-
I/usr/local/cuda/include










L/usr/local/cuda/lib

lcuda

lcudart



CUDA source file that includes device code has the extension
.cu

nvcc separates code for CPU and for GPU and compiles code.

Need regular C compiler installed for CPU.

Make file convenient


see next.

See “The CUDA Compiler Driver NVCC” from NVIDIA for more details

Optimization level if
you want optimized
code

Directories for #include files

Directories for libraries

Libraries to be linked

33

Very simple sample Make file

NVCC = /usr/local/cuda/bin/nvcc

CUDAPATH = /usr/local/cuda


NVCCFLAGS =
-
I$(CUDAPATH)/include

LFLAGS =
-
L$(CUDAPATH)/lib64
-
lcuda
-
lcudart
-
lm


prog1:


cc
-
o prog1 prog1.c

lm


prog2:


cc
-
I/usr/openwin/include
-
o prog2 prog2.c
-
L/usr/openwin/lib
-
L/usr/X11R6/lib
-
lX11

lm



prog3:


$(NVCC) $(NVCCFLAGS) $(LFLAGS)
-
o prog3 prog3.cu


prog4:


$(NVCC) $(NVCCFLAGS) $(LFLAGS)
-
I/usr/openwin/include
-
o prog4
prog4.cu
-
L/usr/openwin/lib
-
L/usr/X11R6/lib
-
lX11
-
lm

A regular C program

A C program with X11 graphics

A CUDA program

A CUDA program with X11 graphics

34

Compilation process

nvcc

gcc

ptxas

nvcc “wrapper” divides
code into host and
device parts.


Host part compiled by
regular C compiler


Device part compiled
by NVIDIA “ptxas”
assembler


Two compiled parts
combined into one
executable

executable

Combine

Object file

nvcc

o prog prog.cu

I/includepath
-
L/libpath

Executable file a “fat” binary” with
both host and device code

35

Executing Program

Simple type name of executable created by nvcc:


./prog1


File includes all the code for host and for device in a “fat binary” file


Host code starts running


When first encounter device kernel, GPU code physically sent to
GPU and function launched on GPU

Hence first launch will be slow!!


Run time environment (cudart) controls memcpy timing and
synchronization

36

Ways to measure time of
execution

Generally instrument code

Measure time at two places and get difference


Ways to measure time:



C clock() or time() routines



CUDA “events” (seems the best way)



CUDA SDK timer



37

Timing GPU Execution with CUDA events

Code


cudaEvent_t start, stop;

float elapsedTime;

cudaEventCreate(&start);



// create event objects

cudaEventCreate(&stop);


cudaEventRecord(start, 0);



// Record start event


.


.


.

cudaEventRecord(stop, 0);



// record end event

cudaEventSynchronize(stop);



// wait for all device work to complete


cudaEventElapsedTime(&elapsedTime, start, stop); //time between events


cudaEventDestroy(start);




//destroy start event

cudaEventDestroy(stop););



//destroy stop event

Time period

38

Recording Events

cudaEventRecord(event1, 0)

record an “event” into
default “stream” (0).


Device will record a timestamp for the event when it
reaches that event in the stream, that is, after all
preceding operations have completed.


(Default stream 0 will mean completed in CUDA context)


NOTE: This operation is asynchronous and may
return before recording event!

39

Making event actually recorded

cudaEventSynchronize(event)
--

waits until
named event actually recorded.



Event recorded when all work done by threads
to complete prior to specified event


(Not strictly be necessary if synchronous CUDA call in
code.)


40

Measuring time between two events

cudaEventElapsedTime(&time, event1, event2)

will
return (pointer argument) the time elapsed between
two events, in milliseconds.


Resolution approx ½ millisecond.


Timing measured using GPU clock.

Questions