Reynoldsx

carpentergambrinousSecurity

Dec 3, 2013 (3 years and 10 months ago)

119 views

Graphics Processing
Unit

Joshua Reynolds

Ted Gardner

GPUs
-

Background


Graphics are one of the most obvious
examples of
embarrasingly

parallel
computations


Graphics cards use their own computational
unit


the GPU


GPUs have evolved to process graphics in a
highly parallel way

Shaders


Shader types

o
Pixel/Fragment, Vertex, and Geometry

o
Unified shader model allows for a single shader to be used for
any of the three types of shader


Functions

o
Read/write data from buffer

o
Perform arithmetic operations


Run entirely in parallel and can be very
numerous


Example
-

Radeon HD 8xxx generation

o
Radeon HD 8350 has 80 unified shaders

o
Radeon HD 8970 has 2048 unified shaders

Example: NVIDIA Tesla


Up to 128 scalar processors



12,000+ concurrent threads in flight



470+ GFLOPS sustained performance



100x or better speedups on GPUs


General Purpose Computing on
GPU


GPUs were originally designed for
manipulation of graphics


Shaders are programmable, and can be
used for non
-
graphical data


Each shader can apply a kernel to a set of
data (or to create a set of data)


Individual shaders are generally slower and
more limited than CPU cores, but their
parallel nature can give a dramatic speedup

Computational Uses


Conway's Game of Life


Video encoding/decoding


Fluid Simulation


N
-
Body Simulation


Fourier Transform


Computation of Voronoi Diagrams


Crack UNIX password encryption(PixelFlow
SIMD graphics computer)


Computation of artificial neural networks


Bitcoin mining (SHA
-
256)

Programming Languages


CUDA (C, C++ and Fortran)

o
Third party wrappers for: Python, Perl, Java, Ruby,
LUA, Haskell, MATLAB, IDL, Mathematica


OpenCL(C99)

o
Wrappers for: C++, C, Java, C#, Python, Ruby, Perl,
Lisp, Haskell, Mathematica, R, MATLAB, Pascal

Vendors


Cuda

o
NVIDIA


OpenCL

o
NVIDIA

o
AMD

o
Apple

o
Intel

o
IBM

o
Portable OpenCL

Primary Scheduler

Voronoi diagram
-

Shops

Centroidal Voronoi Tessellation


GPU
-
Assisted Computation of
Centroidal Voronoi Tessellation

Performance Tuning
-

Optimization




Populating all of the multiprocessors.







Being able to keep the cores busy with multithreading.


Optimizing device memory accesses for contiguous
data, essentially optimizing for stride
-
1 memory
accesses


Utilizing the software data cache to store intermediate
results or to reorganize data that would otherwise
require non
-
stride
-
1 device memory accesses.



Take advantage of asynchronous kernel launches by
overlapping CPU computations with kernel execution






Example Kernel
-

Prime Number
Sieve
(OpenCL)


CPU sets up data as array of "P" characters

o
'P' denotes prime

o
'c' denotes composite


For each prime, the CPU instructs the GPU to
apply the composite kernel on the array

o
Kernel applies marking on the array

o
get_global_id(0)
-

"Rank" of the process, transformed so that
the GPU only needs to run the kernel on the factors of the
prime

__kernel

void

composite(
int

currentPrime,
__global

char
* output){


size_t

i =
currentPrime*currentPrime+currentPrime*get_global_id(
0
);


output[i]=
'c'
;

}

Test
-

O(n
2
)

Description


List of n integers, numbered 0 to n


For each value in list, add up and store all
the values in the list

o
Obviously not the best algorithm for summing values
in parallel, but we're just trying to simulate O(n
2
)


CPU has 4 cores


GPU has 480 unified shaders


OpenCL applies same kernel to GPU and
CPU

Test
-

O(n
2
)

OpenCL kernel

__kernel

void

sum(
__global

int
* input,
__global

int
* output){


size_t

i = get_global_id(
0
);


int

out =
0
;


for
(
int

j =
0
; j < get_global_size(
0
); j++){



out += input[j];


}


output[i] = out;

}

Test
-

O(n
2
)

Result

Video Example


Cuda VS OpenCL


Cuda

o
More Popular

o
Large and mature libraries

o
Slightly faster

o
NVIDIA only


OpenCL

o
More Flexible Synchronization

o
Can enqueue regular CPU function pointers in its
command queues

o
Run
-
time code generation built
-
in

Sources

http://techreport.com/review/17670/nvidia
-
fermi
-
gpu
-
architecture
-
revealed/2


http://people.maths.ox.ac.uk/~gilesm/hpc/NVIDIA/NVIDIA_CUDA_Tutorial_No_
NDA_Apr08.pdf


Guodong Rong; Yang Liu; Wenping Wang; Xiaotian Yin; Gu, X.D.; Guo,
Xiaohu, "GPU
-
Assisted Computation of Centroidal Voronoi Tessellation,"
Visualization and Computer Graphics, IEEE Transactions on

, vol.17, no.3,
pp.345,356, March 2011


http://www.computer.org/csdl/trans/tg/2011/03/ttg2011030345
-
abs.html


http://www.math.psu.edu/qdu/Res/Pic/gallery3.html