01-GPU-Motivationx

gradebananaΛογισμικό & κατασκευή λογ/κού

2 Δεκ 2013 (πριν από 3 χρόνια και 10 μήνες)

103 εμφανίσεις

GPUs: Overview of
Architecture and
Programming Options

Lee
Barford



firstname

dot
lastname

at
gmail

dot com

Outline

Why parallel computing is now important

What GPUs are and what they provide

Overview of GPU architecture


Enough to orient the discussion of programming them


Future changes

Two language
-
tool chain pairs for programming GPUs


Those that we’re not covering include
CUDAFortran
, Python
CUDA & CL bindings,
OpenCL
,
WebCL




3

Graph from UC Berkeley
ParLab

Serial App

Performance

Exponentially
growing gap

Graphics Processor (GPU) as Parallel Accelerator



Commodity priced, massively
parallel floating point



Claimed
performance on
various problems
50
-
2500x
CPU running serial code



4

Graph from http
://
drdobbs.com/high
-
performance
-
computing/231500166

The GPU as a Co
-
Processor to the CPU:

The physical and logical connections


Main memory

chipset

GPU memory

PCIe

Slow

Control actions & code (kernels) to run

I/
Os
:


Video


Ethernet


USB hub


Firewire




CPU

GPU

Running GPU code is like requesting asynchronous I/O

Now from AMD & Intel: Fusion of CPU and GPU

CPU

Main memory

I/O subsystem

Multiple cores

GPU

Running GPU code will be
like pending method
pointers for future
execution. (Like C++11,
TBB, TPL, PPL).

Hardware task scheduler

Programming Tomorrow’s CPU will be Like
Programming Today’s GPU


GPUs that compute will come “for free” with computers


Slow step of moving data to/from GPU will be eliminated


Hardware task scheduler for both CPU and GPU will


Almost eliminate OS & I/O overhead for invoking GPU kernels


Also almost eliminate OS overhead for invoking parallel tasks on CPU


AMD laptop chip; Intel laptops (e.g. fall ‘12 refresh MacBook Pros)


NVIDIA GPU+ARM chip available now for battery operated devices


Both promise desktop chips in next year or two


Programming models will probably evolve from what we’ll cover


Course will use current,
PCIe
-
based GPUs


We will be dealing with overheads that will pass away over next few years

CUDA (NVIDIA) GPU Compute Architecture:

Many Simple, Floating
-
Point Cores

32 cores (Streaming
Multiprocessor) share:


Instruction stream


Registers


Execute same program
(kernel)


SPMD: ~ [Same place in same
kernel at the same time]


Act as 100
-
1000’s more cores
by switching context instead
of waiting for memory


1000’s of virtual cores executing
same lines of code together, but

Sharing limited resources

Cores organized into groups

GPU has multiple SMs


SMs run in parallel


Do not need to be executing
same location in the same
program at the same time


In aggregate, many 1000’s of
parallel copies of same kernel
running simultaneously


Total of up to 1Tflop/s at peak


CENTRAL SOFTWARE ISSUE:


How to generate and control
this much parallelism

GPUs: Programming Options


Libraries: called from CPU code.
W
rite no GPU code. Examples:


Image/video processing, dense & sparse matrix, FFT, random numbers


Generic programming for GPU


Thrust


Like C++ Standard Template Library


Specialize & use built
-
in data structures and algorithms


NVIDIA GPUs only


Programming the GPU directly


CUDA C/C++
,
OpenCL
,
WebCL
, CUDA Fortran,
various Python
libraries


Write code that runs on GPU (kernels)


Write CPU code that directly controls and coordinates


Data movement between CPU memory and GPU memory


Startup of kernels on GPU


CPU processing of results from GPU when they become available




Two Programming Environments that We’ll Cover

CUDA C/C++:


Very efficient code


Lots of fussy detail to get that efficiency


Robust tool chains for Linux, Windows,
MacOS


Specific to NVIDIA

Thrust:


Easy to write


Algorithms provided among the fastest (e.g., sort)


NVIDIA GPUs only

Questions

BACKUP SLIDES

CUDA C/C++
vs

OpenCL

CUDA C/C++


Proprietary (NVIDIA)


Code runs on NVIDIA GPUs


Reportedly 10
-
50% faster than
OpenCL


Compiles at build time to binary
code for particular targeted
hardware


Specific NVIDIA hardware
architecture versions


No compiler available at run time

OpenCL


Open standard (
Khronos
)


Code runs on NVIDIA & AMD
GPUs, x86 multicore, FPGAs
(academic research)
at the same
time


Compiles at build time to
intermediate form that is compiled
at run time for the hardware that is
present



Compiler is available at run time


Can execute downloaded or
dynamically generated source code

Class Project Idea


Accurate edge finding in a 1D signal


Journal paper published on multicore version


Student project last year doing Thrust implementation


Project: Do CUDA version + performance tests


Paper combining previous student’s work with above: 60%
probability of getting accepted in a particular IEEE conference


3 co
-
authors, including previous student & Lee


Extended abstract due: Nov 6


Class project due during finals, same as everyone else


Camera ready paper due: March 4


See or email me in the next week or two if interested