GPU Computing - CENG545

footballsyrupΛογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 3 χρόνια και 8 μήνες)

109 εμφανίσεις

1

Ceng 545


GPU Computing



Grading


2


Midterm
Exam:
2
0%


Homeworks
:
40
%


Demo/knowledge: 25%


Functionality: 40%


Report: 35%


Project:
4
0%


Design Document: 25%


Project Presentation: 25%


Demo/Final Report: 50%


TextBooks /References

3




D. Kirk and W.
Hwu
,”
Programming Massively Parallel Processors
”,
Morgan
Kaufmann

2010
,
978
-
0
-
12
-
381472
-
2


J. Sanders and

E.
Kandrot
, “
CUDA by Example: An Introduction to General
-
Purpose GPU Programming
”,
Pearson

2010
,
978
-
0
-
13
-
138768
-
3


Draft textbook by Prof. Hwu and Prof. Kirk available at the website


NVIDIA,
NVidia CUDA Programming Guide
, NVidia, 2007

(reference book)


Videos (Stanford University)


http://itunes.apple.com/us/itunes
-
u/programming
-
massively
-
parallel/id384233322


Lecture Notes (I
llinois

University )

http://courses.engr.illinois.edu/ece498/al/


Lecture notes will be posted at the class web site

GPU vs CPU


A GPU is tailored for highly parallel operation while a CPU
executes programs serially


For this reason, GPUs have many parallel execution units and
higher transistor counts

(GTX 480 has 3200*million)
, while
CPUs have few execution units and higher clockspeeds


GPUs have much deeper pipelines (several thousand stages vs 10
-
20 for CPUs)


GPUs have significantly faster and more advanced memory
interfaces as they need to shift around a lot more data than CPUs

Many
-
core
GPU
s

vs
Multicore
CPU


Design philosophies:


The design of a CPU is optimized for sequential code
performance.(Large cache memories are provided to reduce the
instruction and data access latencies)


Memory bandwith:

CPU :It has to satisfy requirements from OS, applications and

I/O devices.

GPU: Small cache memories are provided to help control the
bandwith requirements so multiple threads that access the
same memory data do not need to all go to the DDRAM


Marketplace


6

CPU vs. GPU
-

Hardware











More transistors devoted to data processing

Supercomputing 2008 Education Program

Supercomputing 2008 Education Program

7

Processing Element









Processing element = thread processor = ALU

Supercomputing 2008 Education Program

8

Memory Architecture


Constant Memory


Texture Memory


Device Memory

Why Massively Parallel Processor

© David Kirk/NVIDIA and Wen
-
mei W. Hwu, 2007
-
2009

ECE 498AL, University of Illinois, Urbana
-
Champaign

9


A quiet revolution and potential build
-
up


Calculation: 367 GFLOPS vs. 32 GFLOPS


Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s


Until last year, programmed through graphics API











GPU in every PC and workstation


massive volume and potential impact

GFLOPS

G80 = GeForce 8800 GTX

G71 = GeForce 7900 GTX

G70 = GeForce 7800 GTX

NV40 = GeForce 6800 Ultra

NV35 = GeForce FX 5950 Ultra

NV30 = GeForce FX 5800

GeForce 8800

© David Kirk/NVIDIA and Wen
-
mei W. Hwu, 2007
-
2009

ECE 498AL, University of Illinois, Urbana
-
Champaign

10

16 highly threaded SM’s

(each with 8 SP)
, >128 FPU’s, 367 GFLOPS,
768

MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Load/store

Load/store

Load/store

Load/store

Load/store

We have

11

GTX 480


32

highly threaded SM’s

(each with 15 SP)
, >
480

FPU’s,
1344

GFLOPS,
1536

MB DRAM,
177
.4
GB/S
Mem

BW

Tesla C2070


Compared to the latest quad
-
core
CPUs,Tesla

C2050 and C2070 Computing

Processors deliver
equivalent

supercomputing performance at


1/10th the cost and 1/20th the
powerconsumption
.



12

Terms



GPGPU


General
-
Purpose computing on a Graphics Processing Unit


Using graphic hardware for non
-
graphic computations



CUDA


Compute Unified Device Architecture


Software architecture for managing data
-
parallel programming


Parallel Programming

© David Kirk/NVIDIA and Wen
-
mei W. Hwu, 2007
-
2009

ECE 498AL, University of Illinois, Urbana
-
Champaign

13


MPI: Computing nodes do not share memory; all data
sharing and interaction must be done through explicit
passing.

Cuda provides sharde memory


OpenMP :It has not been able to scale beyond a couple
hundred computing nodes due to thread management
overheads and cache coherence hardware requirements.

Cuda achivies simple thread management and no cache coherence
hardware requirements.

Future
A
pps
Reflect a Concurrent World

© David Kirk/NVIDIA and Wen
-
mei W. Hwu, 2007
-
2009

ECE 498AL, University of Illinois, Urbana
-
Champaign

14


Exciting applications in future mass computing market have been
traditionally considered

supercomputing applications



M
olecular dynamics

simulation,
Video

and audio coding

and
manipulation, 3D

imaging and visualization, Consumer game
physics, and virtual
reality products


These “Super
-
apps” represent and model physical, concurrent world


Various granularities of

parallelism exist, but…


programming model must not hinder parallel implementation


data delivery
need
s careful management

Stretching Traditional Architectures


Traditional parallel architectures cover some super
-
applications


DSP, GPU, network apps, Scientific


The game is to grow mainstream architectures

out


or domain
-
specific architectures

in



CUDA is latter


© David Kirk/NVIDIA and Wen
-
mei W. Hwu, 2007
-
2009

ECE 498AL, University of Illinois, Urbana
-
Champaign

15

Traditional applications
Current architecture
coverage
New applications
Domain
-
specific
architecture coverage
Obstacles
Previous Projects

Application

Description

Source

Kernel

% time

H.264

SPEC ‘06 version, change in guess vector

34,811

194

35%

LBM

SPEC ‘06 version, change to single precision and print fewer
reports

1,481

285

>99%

RC5
-
72

Distributed.net RC5
-
72 challenge client code

1,979

218

>99%

FEM

Finite element modeling, simulation of 3D graded materials

1,874

146

99%

RPES

Rye Polynomial Equation Solver, quantum
chem
, 2
-
electron
repulsion

1,104

281

99%

PNS

Petri Net simulation of a distributed system

322

160

>99%

SAXPY

Single
-
precision implementation of
saxpy
, used in
Linpack’s

Gaussian
elim
. routine

952

31

>99%

TRACF

Two Point Angular Correlation Function

536

98

96%

FDTD

Finite
-
Difference Time Domain analysis of 2D electromagnetic
wave propagation

1,365

93

16%

MRI
-
Q

Computing a matrix Q, a scanner’s configuration in MRI
reconstruction

490

33

>99%

© David Kirk/NVIDIA and Wen
-
mei W. Hwu, 2007
-
2009

ECE 498AL, University of Illinois, Urbana
-
Champaign

16

Speedup of Applications

0
10
20
30
40
50
60
H.264
LBM
RC5-72
FEM
RPES
PNS
SAXPY
TPACF
FDTD
MRI-Q
MRI-
FHD
Kernel
Application
210
457
431
316
263
GPU Speedup
Relative to CPU
79

GeForce 8800 GTX vs. 2.2GHz Opteron 248


10


speedup in a kernel is typical, as long as the kernel can occupy enough
parallel threads


25


to 400


speedup if the function’s data requirements and control flow suit the
GPU and the application is optimized


© David Kirk/NVIDIA and Wen
-
mei W. Hwu, 2007
-
2009

ECE 498AL, University of Illinois, Urbana
-
Champaign

17