CUDA
Assignment
Subject: DES using CUDA
Deliverables:
des.c
,
des.cu
, report
Due: 12/14, nai0315@snu.ac.kr
Index
What is GPU?
Programming model and Simple Example
The Environment for CUDA programming
What is DES?
What’s in a GPU?
A GPU is a heterogeneous chip multi
-
processor (highly
tuned for graphics)
Slimming down
kjkd
Idea #1:
Remove components that
help a single instruction
stream run fast
Parallel execution
Two cores
Four cores
Sixteen cores:
16 simultaneous instruction
streams
Be able to share an
instruction stream
SIMD processing
Idea #2:
Amortize cost/complexity of
managing an instruction
stream across many
ALUs
16 cores = 128
ALUs
What about branches?
Throughput!
Idea #3:
Interleave processing of many fragments on a
single core to avoid stalls caused by high latency
operations
Summary: three key ideas of GPU
1.
Use many “slimmed down cores” to run in parallel
2.
Pack cores full of
ALUs
(by sharing instruction stream
across groups of fragments)
3.
Avoid latency stalls by interleaving execution of many
groups of fragments
When one group stalls, work on another group
Programming Model
GPU is viewed as a compute device operating as a
coprocessor to the main CPU (host)
Data
-
parallel, compute intensive functions should be off
-
loaded
to the device
Functions that are executed many times, but independently on
different data, are prime candidates
I.e. body of for
-
loops
A function compiled for the device is called a kernel
The kernel is executed on the device as many different threads
Both host (CPU) and device (GPU) manage their own memory,
host memory and device memory
Block and Thread Allocation
Blocks assigned to
SMs
(Streaming
Multiprocessos
)
Threads assigned to
PEs
(Processing Elements)
•
Each thread executes the kernel
•
Each block has an unique block ID
•
Each thread has an unique thread ID
within the block
•
Warp: max 32 threads
•
GTX 280: 30SMs
•
1 SM: 8
SPs
•
1 SM: 32 warps
1024 threads
•
Total threads: 30*1024 = 30,720
Memory model
Memory types
Registers (
r/w
per thread)
Local
mem
(
r/w
per thread)
Shared
mem
(
r/w
per block)
Global
mem
(
r/w
per kernel)
Constant
mem
(
r
per kernel)
Separate from CPU
CPU can access global and
constant
mem
via
PCIe
bus
Simple Example (C to CUDA conversion)
__global_
void
ForceCalcKernel(int
nbodies
,
struct
Body *body, ..) {}
__global_
void Advancing
Kernel(int
nbodies
,
struct
Body *body, …){}
int
main(…) {
Body *body,
*body1
;
…
cudaMalloc((void
**)&body1,
sizeof(Body
)*
nbodies
);
cudaMemcpy(body1, body,
sizeof(Body
)*
nbodies
,
cuda_HostToDevice
);
for(timestep
= …) {
ForceCalcKernel
<<1, 1>>
(
nbodies
, body1, …);
AdvancingKernel
<<1, 1>>
(
nbodies
, body1, …);
}
cudaMemcpy(body
, body1,
sizeof(Body
)*
nbodies
,
cuda_DeviceToHost
);
cudaFree(body1);
…
}
Indicates GPU kernel that CPU can call
Separate address spaces, need two pointers
Allocate memory on GPU
Copy CPU data to GPU
Call GPU kernel with 1block and 1thread per block
Copy GPU data back to CPU
Environment
The NVCC compiler
CUDA kernels are typically
stored in files ending with .cu
NVCC uses the host
compiler (CL/G++) to
compile CPU code
NVCC automatically handles
#
include’s
and linking
You can download CUDA
toolkit from:
http://developer.nvidia.com/c
uda
-
downloads
What is DES?
The archetypal block cipher
An algorithm that takes a fixed
-
length
string of plaintext bits and transforms it
through a series of complicated
operations into another
ciphertext
bitstring
of the same length
The block size is 64 bits
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο