11/2

birdsowlΛογισμικό & κατασκευή λογ/κού

2 Δεκ 2013 (πριν από 3 χρόνια και 10 μήνες)

87 εμφανίσεις

CUDA

Assignment


Subject: DES using CUDA


Deliverables:
des.c
,
des.cu
, report


Due: 12/14, nai0315@snu.ac.kr

Index


What is GPU?


Programming model and Simple Example


The Environment for CUDA programming


What is DES?

What’s in a GPU?


A GPU is a heterogeneous chip multi
-
processor (highly
tuned for graphics)

Slimming down


kjkd

Idea #1:

Remove components that
help a single instruction
stream run fast

Parallel execution

Two cores

Four cores

Sixteen cores:

16 simultaneous instruction
streams




Be able to share an
instruction stream

SIMD processing

Idea #2:

Amortize cost/complexity of
managing an instruction
stream across many
ALUs

16 cores = 128
ALUs

What about branches?

Throughput!


Idea #3:

Interleave processing of many fragments on a
single core to avoid stalls caused by high latency
operations

Summary: three key ideas of GPU

1.
Use many “slimmed down cores” to run in parallel

2.
Pack cores full of
ALUs

(by sharing instruction stream
across groups of fragments)

3.
Avoid latency stalls by interleaving execution of many
groups of fragments


When one group stalls, work on another group

Programming Model


GPU is viewed as a compute device operating as a
coprocessor to the main CPU (host)


Data
-
parallel, compute intensive functions should be off
-
loaded
to the device


Functions that are executed many times, but independently on
different data, are prime candidates


I.e. body of for
-
loops


A function compiled for the device is called a kernel


The kernel is executed on the device as many different threads


Both host (CPU) and device (GPU) manage their own memory,
host memory and device memory

Block and Thread Allocation


Blocks assigned to
SMs

(Streaming
Multiprocessos
)


Threads assigned to
PEs

(Processing Elements)



Each thread executes the kernel



Each block has an unique block ID



Each thread has an unique thread ID
within the block




Warp: max 32 threads



GTX 280: 30SMs


1 SM: 8
SPs


1 SM: 32 warps


1024 threads


Total threads: 30*1024 = 30,720


Memory model


Memory types


Registers (
r/w

per thread)


Local
mem

(
r/w

per thread)


Shared
mem

(
r/w

per block)


Global
mem

(
r/w

per kernel)


Constant
mem

(
r

per kernel)



Separate from CPU


CPU can access global and
constant
mem

via
PCIe

bus

Simple Example (C to CUDA conversion)

__global_

void
ForceCalcKernel(int

nbodies
,
struct

Body *body, ..) {}

__global_

void Advancing
Kernel(int

nbodies
,
struct

Body *body, …){}


int

main(…) {


Body *body,
*body1
;





cudaMalloc((void
**)&body1,
sizeof(Body
)*
nbodies
);


cudaMemcpy(body1, body,
sizeof(Body
)*
nbodies
,
cuda_HostToDevice
);


for(timestep

= …) {


ForceCalcKernel
<<1, 1>>
(
nbodies
, body1, …);


AdvancingKernel
<<1, 1>>
(
nbodies
, body1, …);


}


cudaMemcpy(body
, body1,
sizeof(Body
)*
nbodies
,
cuda_DeviceToHost
);


cudaFree(body1);




}

Indicates GPU kernel that CPU can call

Separate address spaces, need two pointers

Allocate memory on GPU

Copy CPU data to GPU

Call GPU kernel with 1block and 1thread per block

Copy GPU data back to CPU

Environment


The NVCC compiler


CUDA kernels are typically
stored in files ending with .cu


NVCC uses the host
compiler (CL/G++) to
compile CPU code


NVCC automatically handles
#
include’s

and linking


You can download CUDA
toolkit from:


http://developer.nvidia.com/c
uda
-
downloads


What is DES?


The archetypal block cipher


An algorithm that takes a fixed
-
length
string of plaintext bits and transforms it
through a series of complicated
operations into another
ciphertext

bitstring

of the same length


The block size is 64 bits