Graphics Processing
Unit
Joshua Reynolds
Ted Gardner
GPUs

Background
•
Graphics are one of the most obvious
examples of
embarrasingly
parallel
computations
•
Graphics cards use their own computational
unit
–
the GPU
•
GPUs have evolved to process graphics in a
highly parallel way
Shaders
•
Shader types
o
Pixel/Fragment, Vertex, and Geometry
o
Unified shader model allows for a single shader to be used for
any of the three types of shader
•
Functions
o
Read/write data from buffer
o
Perform arithmetic operations
•
Run entirely in parallel and can be very
numerous
•
Example

Radeon HD 8xxx generation
o
Radeon HD 8350 has 80 unified shaders
o
Radeon HD 8970 has 2048 unified shaders
Example: NVIDIA Tesla
•
Up to 128 scalar processors
•
12,000+ concurrent threads in flight
•
470+ GFLOPS sustained performance
•
100x or better speedups on GPUs
General Purpose Computing on
GPU
•
GPUs were originally designed for
manipulation of graphics
•
Shaders are programmable, and can be
used for non

graphical data
•
Each shader can apply a kernel to a set of
data (or to create a set of data)
•
Individual shaders are generally slower and
more limited than CPU cores, but their
parallel nature can give a dramatic speedup
Computational Uses
•
Conway's Game of Life
•
Video encoding/decoding
•
Fluid Simulation
•
N

Body Simulation
•
Fourier Transform
•
Computation of Voronoi Diagrams
•
Crack UNIX password encryption(PixelFlow
SIMD graphics computer)
•
Computation of artificial neural networks
•
Bitcoin mining (SHA

256)
Programming Languages
•
CUDA (C, C++ and Fortran)
o
Third party wrappers for: Python, Perl, Java, Ruby,
LUA, Haskell, MATLAB, IDL, Mathematica
•
OpenCL(C99)
o
Wrappers for: C++, C, Java, C#, Python, Ruby, Perl,
Lisp, Haskell, Mathematica, R, MATLAB, Pascal
Vendors
•
Cuda
o
NVIDIA
•
OpenCL
o
NVIDIA
o
AMD
o
Apple
o
Intel
o
IBM
o
Portable OpenCL
Primary Scheduler
Voronoi diagram

Shops
Centroidal Voronoi Tessellation
GPU

Assisted Computation of
Centroidal Voronoi Tessellation
Performance Tuning

Optimization
•
Populating all of the multiprocessors.
•
Being able to keep the cores busy with multithreading.
•
Optimizing device memory accesses for contiguous
data, essentially optimizing for stride

1 memory
accesses
•
Utilizing the software data cache to store intermediate
results or to reorganize data that would otherwise
require non

stride

1 device memory accesses.
•
Take advantage of asynchronous kernel launches by
overlapping CPU computations with kernel execution
Example Kernel

Prime Number
Sieve
(OpenCL)
•
CPU sets up data as array of "P" characters
o
'P' denotes prime
o
'c' denotes composite
•
For each prime, the CPU instructs the GPU to
apply the composite kernel on the array
o
Kernel applies marking on the array
o
get_global_id(0)

"Rank" of the process, transformed so that
the GPU only needs to run the kernel on the factors of the
prime
__kernel
void
composite(
int
currentPrime,
__global
char
* output){
size_t
i =
currentPrime*currentPrime+currentPrime*get_global_id(
0
);
output[i]=
'c'
;
}
Test

O(n
2
)
Description
•
List of n integers, numbered 0 to n
•
For each value in list, add up and store all
the values in the list
o
Obviously not the best algorithm for summing values
in parallel, but we're just trying to simulate O(n
2
)
•
CPU has 4 cores
•
GPU has 480 unified shaders
•
OpenCL applies same kernel to GPU and
CPU
Test

O(n
2
)
OpenCL kernel
__kernel
void
sum(
__global
int
* input,
__global
int
* output){
size_t
i = get_global_id(
0
);
int
out =
0
;
for
(
int
j =
0
; j < get_global_size(
0
); j++){
out += input[j];
}
output[i] = out;
}
Test

O(n
2
)
Result
Video Example
Cuda VS OpenCL
•
Cuda
o
More Popular
o
Large and mature libraries
o
Slightly faster
o
NVIDIA only
•
OpenCL
o
More Flexible Synchronization
o
Can enqueue regular CPU function pointers in its
command queues
o
Run

time code generation built

in
Sources
http://techreport.com/review/17670/nvidia

fermi

gpu

architecture

revealed/2
http://people.maths.ox.ac.uk/~gilesm/hpc/NVIDIA/NVIDIA_CUDA_Tutorial_No_
NDA_Apr08.pdf
Guodong Rong; Yang Liu; Wenping Wang; Xiaotian Yin; Gu, X.D.; Guo,
Xiaohu, "GPU

Assisted Computation of Centroidal Voronoi Tessellation,"
Visualization and Computer Graphics, IEEE Transactions on
, vol.17, no.3,
pp.345,356, March 2011
http://www.computer.org/csdl/trans/tg/2011/03/ttg2011030345

abs.html
http://www.math.psu.edu/qdu/Res/Pic/gallery3.html
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment