GPU computing and CUDA

sizzlepictureSoftware and s/w Development

Dec 2, 2013 (3 years and 6 months ago)

66 views

GPU computing and CUDA

Marko Mišić (
marko.misic@etf.rs
)

Milo Tomašević (
mvt@etf.rs
)


YUINFO 2012

Kopaonik, 29.02.2012.

Introduction to GPU computing
(1)


Graphics Processing Units (GPUs) have been used

for non
-
graphics computation for several years


This trend is called

General
-
Purpose computation on GPUs (GPGPU)


The GPGPU applications can be found in:


Computational physics/chemistry/biology


Signal processing


Computational geometry


Database management


Computational finance


Computer vision

Introduction to GPU computing
(
2
)


The GPU is a highly parallel processor

good at data
-
parallel processing

with many calculations per memory access


The same computation executed

on many data elements in parallel

with high arithmetic intensity


Same computation means lower requirement

for sophisticated flow control


High arithmetic intensity and many data elements

mean that memory access latency can be hidden

with calculations instead of big data caches


CPU v
s.

GPU

trends (1)



CPU is optimized to execute tasks


Big caches hide memory latencies


Sophisticated flow control


GPU is specialized for compute
-
intensive,

highly parallel computation


More transistors can be devoted to data processing

rather than data caching and flow control

DRAM

Cache

ALU

Control

ALU

ALU

ALU

DRAM

CPU

GPU

CPU v
s.

GPU

trends (2)


The GPU has evolved into a very flexible and powerful processor


Programmable using high
-
level languages


Computational power
: 1 TFLOPS vs. 100 GFLOPS


Bandwidth
: ~10x
bigger


GPU is found in almost every workstation

Figure 1.1. Enlarging Performance Gap between
GPUs
and CPUs.
Multi
-
core CPU
Many
-
core GPU
Courtesy: John Owens
CPU v
s.

GPU

trends (3)

CUDA

Advantage

Rigid Body

Physics

Solver

10x

20x

47x

197x

Matrix

Numerics


BLAS1:

60+ GB/s

BLAS3:

100+ GFLOPS

Wave

Equation


FDTD:

1.2 Gcells/s

FFT:

52 GFLOPS

(GFLOPS as defined by benchFFT)

Biological

Sequence

Match


SSEARCH:

5.2 Gcells/s

Finance


Black Scholes:

4.7 GOptions/s

History of GPU programming


The fast
-
growing video game industry puts

strong pressure that forces constant innovation


GPUs evolved from fixed
-
function pipeline processors

to the more programmable, general
-
purpose processors


Programmable shaders (2000)


Programmed through OpenGL and DirectX API


Lots of limitations


Memory access, ISA, floating
-
point support, etc.


NVIDIA CUDA (2007)


AMD/ATI (Brook+, FireStream, Close
-
To
-
Metal)


Microsoft DirectCompute (DirectX 10/DirectX 11)


OpenCompute Language, OpenCL (2009)

CUDA overview (1)


Compute Device Unified Architecture (CUDA)


A new hardware and software architecture

for issuing and managing computations on the GPU


Started with NVIDIA 8000 (G80) series GPUs


General
-
purpose programming
model


SIMD / SPMD


User launches batches of threads on the GPU


GPU could be seen as dedicated

super
-
threaded, massively data parallel coprocessor


Explicit and unrestricted memory management

CUDA overview (2)


The GPU is viewed as a compute device

that is a coprocessor to the CPU (host)


Executes compute
-
intensive part of the application


Runs many threads in parallel


Has its own DRAM (device memory)


Data
-
parallel portions of an application are expressed

as device kernels which run on many threads


GPU threads are extremely lightweight


Very little creation overhead


GPU needs 1000s of threads for full efficiency


Multicore CPU needs only a few

CUDA overview (3)


Dedicated software stack


Runtime and driver


C
-
language extension

for easier programming


Targeted API for advanced users


Complete tool chain


Compiler, debugger, profiler


Libraries and 3
rd

party support


GPU Computing SDK


cu
FFT,
cu
BLAS...


FORTRAN, C++, Python,
MATLAB, Thrust, GMAC…

GPU

CPU

CUDA Runtime

CUDA Libraries

(FFT, BLAS)

CUDA Driver


Application

Programming model
(
1
)

Serial Code (host)


. . .

. . .

Parallel Kernel (device)


KernelA<<< nBlk, nTid >>>(args);

Serial Code (host)


Parallel Kernel (device)


KernelB<<< nBlk, nTid >>>(args);


CUDA application consists of two parts


Sequential parts are executed on the CPU (host)


Compute
-
intensive parts are executed on the GPU (device)


The CPU is responsible for data management,

memory transfers, and the GPU execution configuration

Programming model
(
2
)


A kernel is executed as

a grid of thread blocks


A thread block is a batch of
threads that can cooperate
with each other by:


Efficiently sharing data
through shared memory


Synchronizing their execution


Two threads from

two different blocks

cannot cooperate

Host

Kernel
1

Kernel
2

Device

Grid 1

Block

(0, 0)

Block

(
1
,
0
)

Block

(2, 0)

Block

(0, 1)

Block

(
1
,
1
)

Block

(
2
,
1
)

Grid
2

Block (1, 1)

Thread

(
0
,
1
)

Thread

(
1
,
1
)

Thread

(
2
,
1
)

Thread

(3, 1)

Thread

(4, 1)

Thread

(
0
,
2
)

Thread

(1, 2)

Thread

(2, 2)

Thread

(
3
,
2
)

Thread

(
4
,
2
)

Thread

(
0
,
0
)

Thread

(1, 0)

Thread

(2, 0)

Thread

(
3
,
0
)

Thread

(4, 0)

Programming model
(
3
)


Threads and blocks have IDs


So each thread can decide
what data to work on


Block ID: 1D or 2D


Thread ID: 1D, 2D, or 3D


Simplifies memory

addressing when processing

multidimensional data


Image processing


Solving PDEs on volumes


Device

Grid
1

Block

(0, 0)

Block

(1, 0)

Block

(
2
,
0
)

Block

(
0
,
1
)

Block

(
1
,
1
)

Block

(2, 1)

Block (1, 1)

Thread

(
0
,
1
)

Thread

(1, 1)

Thread

(2, 1)

Thread

(
3
,
1
)

Thread

(
4
,
1
)

Thread

(
0
,
2
)

Thread

(1, 2)

Thread

(2, 2)

Thread

(
3
,
2
)

Thread

(4, 2)

Thread

(0, 0)

Thread

(
1
,
0
)

Thread

(
2
,
0
)

Thread

(
3
,
0
)

Thread

(4, 0)

Memory model (
1
)


Each thread can:


Read/write

per
-
thread registers


Read/write

per
-
thread local memory


Read/write

per
-
block shared memory


Read/write

per
-
grid global memory


Read only

per
-
grid constant memory


Read only

per
-
grid texture memory

Grid

Constant

Memory

Texture

Memory

Global

Memory

Block (0, 0)

Shared Memory

Local

Memory

Thread (
0
,
0
)

Registers

Local

Memory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Local

Memory

Thread (
0
,
0
)

Registers

Local

Memory

Thread (1, 0)

Registers

Host

Memory model (2)


The host can read/write
global, constant,

and texture memory


All stored in device DRAM


Global memory accesses are
slow


Around ~
200
cycles


Memory architecture
optimized for high bandwidth


Memory banks


Transactions

Device

Global Memory

(DRAM)

Block (
0
,
0
)


Shared Memory

Thread (
0
,
0
)


Registers

Thread (
1
,
0
)


Registers

Block (
1
,
0
)


Shared Memory

Thread (
0
,
0
)


Registers

Thread (
1
,
0
)


Registers

Host

Global Memory

(DRAM)

Memory model (3)


Shared memory is a fast on
-
chip memory


Allows threads in a block to share intermediate data


Access time ~
3
-
4
cycles


Could be seen as user
-
managed cache (scratchpad)


Threads are responsible

to bring the data to and move it from the shared memory


Small in size (up to
48
KB)

DRAM

ALU

Shared

memory

Control

Cache

ALU

ALU

...

d
0

d
1

d
2

d
3

d
0

d
1

d
2

d
3

ALU

Shared

memory

Control

Cache

ALU

ALU

...

d
4

d
5

d
6

d
7

d
4

d
5

d
6

d
7





A common programming strategy


Local and global memory reside in device memory
(DRAM)


Much slower access than shared memory


A common way of performing computation on the device
is to block it up (tile) to take advantage of fast shared
memory


Partition the data set into subsets that fit into shared memory


Handle each data subset with one thread block by:


Loading the subset from global memory to shared memory


Performing the computation on the subset from shared memory


Each thread can efficiently multi
-
pass over any data element


Copying results from shared memory to global memory

Matrix Multiplication Example (
1
)


P = M * N of size WIDTH x WIDTH


Without blocking:


One thread handles one element of P


M and N are loaded WIDTH times from
global memory


M

N

P




WIDTH

WIDTH

WIDTH

WIDTH

Matrix Multiplication Example (
2
)


P = M * N of size WIDTH x WIDTH


With blocking:


One thread block handles one
BLOCK_SIZE x

BLOCK_SIZE

sub
-
matrix
Psub

of P


M and N are only loaded
WIDTH

/

BLOCK_SIZE times

from global memory


Great saving of memory bandwidth!

M

N

P

P
sub

BLOCK_SIZE




BLOCK_SIZE

BLOCK_SIZE

BLOCK_SIZE

BLOCK_SIZE

BLOCK_SIZE

BLOCK_SIZE

BLOCK_SIZE

WIDTH

WIDTH

WIDTH

WIDTH

CUDA API (1)


The CUDA API is an extension

to the C programming language consisting of:


Language extensions


To target portions of the code for execution on the device


A runtime library split into:


A common component providing built
-
in vector types

and a subset of the C runtime library

in both host and device codes


A host component to control and access

one or more devices from the host


A device component providing device
-
specific functions

CUDA API (
2
)


Function declaration qualifiers


__global__
,
__
host
__
,

__device__


Variable qualifiers


__
host
__
,

__device___
,

__shared__
, etc.


Built
-
in variables


gridDim
,
blockDim
,
blockIdx
,
threadIdx


Mathematical functions


Kernel calling convention (execution configuration)


myKernel<<<
DimGrid
,
DimBlock

>>>(arg
1
, … );


Programmer explicitly specifies block and grid organization


1
D,
2
D or
3
D

Hardware implementation (
1
)


The device is a set of multiprocessors


Each multiprocessor is a set of
32
-
bit
processors with a SIMD architecture


At each clock cycle, a multiprocessor
executes the same instruction on a
group of threads called a warp


Including branches


Allows scalable execution of kernels


Adding more multiprocessors improves
performance

Device

Multiprocessor N

Multiprocessor
2

Multiprocessor
1

Instruction

Unit

Processor 1



Processor
2

Processor M



Hardware implementation (
2
)

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Load/store

Load/store

Load/store

Load/store

Load/store

Hardware implementation (
3
)


Each thread block of a grid is split into warps

that get executed by one multiprocessor


Warp consists of threads with consecutive thread IDs)


Each thread block is executed by only one multiprocessor


Shared memory space resides in the on
-
chip shared memory


Registers are allocated among the threads


A kernel that requires too many registers will fail to launch


A multiprocessor can execute several blocks concurrently


Shared memory and registers are allocated

among the threads of all concurrent blocks


Decreasing shared memory usage (per block) and

register usage (per thread) increases number of

blocks that can run concurrently

Memory architecture (
1
)


In a parallel machine,

many threads access memory


Memory is divided into banks


Essential to achieve high bandwidth


Each bank can service one address per cycle


A memory can service

as many simultaneous accesses as it has banks


Multiple simultaneous accesses to a bank

result in a bank conflict


Conflicting accesses are serialized


Shared memory is organized in similar fashion

Bank 15

Bank
7

Bank
6

Bank
5

Bank 4

Bank
3

Bank
2

Bank
1

Bank 0

Memory architecture (
2
)


When accessing global memory,

accesses are combined into transactions


Peak bandwidth is achieved

when all threads in a half warp access

continuous memory locations


“Memory coalescing”


In that case, there are no bank conflicts


Programmer is responsible to optimize algorithms
to access data in appropriate fashion

Performance considerations


CUDA has a low learning curve


It is easy to write a correct program


Performance can vary greatly

depending on the resource constraints of

the particular device architecture


Performance concerned programmers still need

to be aware of them

to make a good use of a contemporary hardware


It is essential to understand

hardware and memory architecture


Thread scheduling and execution


Suitable memory access patterns


Shared memory utilization


Resource limitations

Conclusion


Highly multithreaded architecture of modern GPUs is

very suitable for solving data
-
parallel problems


Vastly improves performance in certain domains


It is expected that GPU architectures will evolve

to further broaden application domains


We are in the dawn of heterogeneous computing


Software support is developing rapidly


Mature tool chain


Libraries


Available applications


OpenCL

References


David Kirk, Wen
-
mei Hwu, Programming Massively Parallel Processors: A Hands
on Approach, Morgan Kaufmann, 2010.


Course

ECE498AL,
University of Illinois
,
Urbana
-
Champaign

http://courses.engr.illinois.edu/ece498/al/



Dann Connors,
OpenCL and CUDA

Programming for Multicore


and GPU Architectures
, ACACES 2011, Fiuggi, Italy, 2011.


David Kirk, Wen
-
mei Hwu,
Programming and tUnining

Massively Parallel Systems
, PUMPS 2011, Barcelona, Spain, 2011.


NVIDIA CUDA C Programming Guide 4.0, 2011.


Mi
š
i
ć
,

Đurđević,

Tomašević
, “Evolution and Trends in GPU Computing”
, MIPRO
2012, Abb
azia
,
Croatia, 2012. (to be published)


NVIDIA
Developer zone
,
http://developer.nvidia.com/category/zone/cuda
-
zone



http://en.wikipedia.org/wiki/GPGPU


http://en.wikipedia.org/wiki/CUDA



GPU training wiki,

https://hpcforge.org/plugins/mediawiki/wiki/gpu
-
training/index.php/Main_Page



GPU computing and CUDA


Questions?

Marko Mišić (
marko.misic@etf.rs
)

Milo Tomašević (
mvt@etf.rs
)


YUINFO
2012

Kopaonik,
29.02.2012
.