Thesis Defense Slides - Kentucky State University Cluster

monkeybeetleSoftware and s/w Development

Dec 2, 2013 (3 years and 8 months ago)

74 views

Nicholas
Lykins

Kentucky State University

May 7, 2012



Introduction


Why pursue GPU accelerated computing?


Performance figures



Historical background


Graphics rendering pipeline


History of GPU technology


NVIDIA and GPU implementations


Alternative GPU processing frameworks



CUDA


Background and available libraries


Terminology


Architectural design


Syntax



Hands
-
on CUDA sample demonstration


Line by line illustration of code execution


Animated execution pipeline for sample application



Conclusion and future outlook



Initial goal
: Demonstrate the potential for
GPU technology to further enhance data
processing needs of the scientific community.



Objectives


Deliver an account of the history of GPU technology


Provide an overview of NVIDIA’s CUDA framework


Demonstrate the motivation for scientists to pursue
GPU acceleration and apply it to their own scientific
disciplines


Multi
-
Core Processing



GPU Acceleration



….How are they different?



Hardware differences: CPU vs. GPU


CPU

(Single
-
Input, Single
-
Data)


Control unit, arithmetic and logic unit, internal
registers, internal data bus


Speed limitations


One bit in, one bit out



GPU

(Single
-
Input, Multiple
-
Data)


Many processing cores and onboard memory


Parallel execution of each core


One bit in, multiple bits out




GPU processing time is measurably faster
than comparable CPU processing time when
working with large
-
scale input data.




Graphics rendering pipeline


Entire process through which an image is generated
by a graphics processing device


Vertex calculations


Color generation


Shadows and lighting



Shaders


Specialized program executed as a function of
graphics processing hardware to produce a
particular aspect of the resulting image



Traditional pipelining process


System collects data to be graphically represented


Modeling transformations within the world space


Vertices are “shaded” according to various properties


Lighting, materials, textures


Viewing transformation is performed


reorienting the
graphical object with respect to the human eye


Clipping is performed, eliminating constructed content
outside the frustum




The three
-
dimensional scene is then rendered onto
a two
-
dimensional viewing plane, or screen space


Rasterization

takes place, in which the continuous
geometric representation of objects is translated
into a set of discrete fragments for a particular
display


Color, transparency, and depth


Stored within the frame buffer, where Z
-
buffering
and alpha blending take place, pixels are
determined with respect to their appearance on the
screen.



OpenGL


OpenGL 1.0 first developed by Silicon Graphics in 1992.


First middle layer developed for interpreting between operating system
and underlying hardware.


Industry
-
wide standard was implemented for graphics development, with
each vendor crafting their hardware architecture with those standards in
mind.


Cross
-
platform compatibility



DirectX


Developed by Microsoft employees Craig
Eisler
, Alex St. John, and Eric
Engstrum

in 1995, for facilitating low level access by programmers of
Window’s restricted memory space.


Set of related APIs (Direct3D, DirectDraw, DirectSound) that enable
multimedia development.


Vendor provides device driver that enables compatibility for its own
hardware across all Windows systems.


Restricted to Windows only.


Released in August of 1999, it was the
world’s first official GPU device.



Integration of all graphics processing actions
onto a single chip.



Implemented with a fixed function rendering
pipeline



OpenGL 2.0


Programmable
shaders


Programmers could write unique instructions for accessing hardware
functionality



Programmability enabled by proprietary “shading languages”


ARB


Low
-
level assembly based language for directly interfacing with hardware
elements


Unintuitive and difficult to use effectively


GLSL

(OpenGL Shading Language)


High
-
level language derived from C


Translates high
-
level code into corresponding low
-
level instructions to be
interpreted as ARB language



Cg


High
-
level
shader

language designed by NVIDIA


Compiles into assembly
-
based and GLSL code for interpretation by OpenGL
and DirectX





Released in November of 2006, first
implemented within the
GeForce

8800.


First architecture to implement the CUDA
framework, and first instance of a unified
graphics rendering pipeline


Vertex and fragment
shaders

integrated as one
hardware component


Programmability given over individual processing
elements on the device


Scalability based on targeted consumer
market


Proportions of processing cores, memory, etc.



GeForce

8800 GTX


Each “tile” represents a separate multiprocessor


Eight streaming cores per multiprocessor, 16
multiprocessors per card


Shared L1 cache per pair of tiles


Texture handling units attached to each tile


Recursive method for handling graphics rendering


Output data for one core becomes input data
for another


Six discrete memory partitions, each 64
-
bit,
totalling

to a 384
-
bit interface.


Bit interface and memory size varies based on
specific GT80 device.





Second generational GPU architecture, released in
June of 2008.



Most recently featured architecture until the
Kepler

architecture was published, in March of
2012.



Rebranding of streaming processor cores as
CUDA cores.



Overall superior design in terms of performance
and computational precision



Core count of 240,
increased to 512.



32 cores per
multiprocessor, totaling
16 streaming
multiprocessors



Similar memory interface
to the GT80, hosting six
64
-
bit memory
partitions
totalling

a
384
-
bit memory
interface.



64 KB shared memory
per streaming
multiprocessor








Unified memory address space
: Thread,
block, globally layered.


Enables a read and write mechanism compatible
with C++ via pointer handling.



Configurable shared memory
: 48 KB shared,
16 KB as L1 cache, vs. 48 KB L1 cache and 16
KB shared memory



L2 cache common across all streaming
multiprocessors





Added CUDA compatibility with the
implementation of
PTX

(Parallel Thread
Execution) 2.0


Low level equivalent of assembly language


Low level virtual machine, responsible for
translating system calls from the CPU, to
hardware instructions
interpretible

by the GPU’s
onboard hardware.


CUDA passes high level CUDA code to the compiler.


PTX translates it into corresponding low level code.


Hardware instructions are then interpreted based on that
low level code and executed by the GPU itself.


Rival GPU manufacturer


develops its own
proprietary line of graphics cards


Significant architectural differences with
NVIDIA products


Evergreen chipset


ATI
Radeon

HD 5870
-

Comparison


NVIDIA’s GTX 480


512 active cores, 3 billion
transistors


Radeon

HD 5870


20 parallel engines


16 cores


5
processing elements


totalling

1600 work units, 2.15
billion transistors


OpenCL


Parallel computing framework similar to CUDA


Initially introduced by Apple, but development of its
standards currently done by the
Khronos

Group


Emphasis on portability and cross
-
platform
implementations


Flagship parallel computing API of AMD


CPU/GPU, Apple systems, GPUs, etc.


Adopted by Intel, AMD, NVIDIA, ARM Holdings



CTM

(Close to Metal)


Released in 2006 by AMD as a low level API providing
hardware access, similar to NVIDIA’s PTX instruction set.


Discontinued in 2008, replaced by
OpenCL

for principal
usage






Programming framework by NVIDIA for
performing GPGPU (General
-
Purpose GPU)
computing



Potential for applying parallel processing
capabilities of GPU hardware to traditional
software applications



NVIDIA Libraries


Ready
-
made libraries for implementing complex
computational functions


cuFFT

(NVIDIA CUDA Fast Fourier Transform),
cuBLAS

(NVIDIA CUDA Basic Linear Algebra Subroutines), and
cuSPARSE

(NVIDIA CUDA Sparse)






Hardware or software? ……or both.



Development framework that correlates between
hardware elements on the GPU and the algorithms
responsible for accessing and manipulating those
elements



Expands on its original definition as a C
-
compatible
compiler which special extensions for recognizing
CUDA code


Resource
allocation is
dynamically
handled by the
framework.



Scalable to
different hardware
devices without
the need to recode
an application.

o


CUDA is designed as a high level API, hiding low level
hardware details from the user.



Three major layers of abstraction between the
architecture and the programmer:


Hierarchy of thread groups, shared memories, barrier
synchronization



Computational features implemented as functions.
Input data passed as parameters.



High level functionality allows for a low learning curve
in terms of use.



Allows for applications to be run on any GPU card
with a compatible architecture.


Backwards compatible for older versions.


Resource allocation handled through
threading


A thread represents a single work unit or operation


Lowest level of resource allocation for CUDA



Hierarchical structure


Threads, blocks, and grids; from lowest to highest



Paralleled to multiple layers of nested
execution loops




Visual representation
of thread hierarchy


Multiple
threads
embedded in blocks,
multiple blocks
embedded in grids


Intuitive scheme

for
understanding

allocation
mechanism


Threading syntax



Recognized by the framework for handling thread
usage within an application.



Each variable provides for tracking and monitoring
of individual thread activity.



Resource assignment for an application not covered
by these syntax elements.




Keywords


Threadidx

.x/y/z


Represents the number of
threads within a given block, three
-
dimensional.


blockIdx.x
/y/z


Refers to a particular block within
a grid, three
-
dimensional.


blockDim.x
/y/z



Total number of threads allocated
along a single dimension of a block, three
-
dimensional.


gridDim.x
/y/z



Block count per dimension, three
-
dimensional


tid



Identifying marker for each individual thread; a
unique value for each allocated thread




Flexibility for managing threads within an
application.


Example:
int

tid

=
threadIdx.x

+
blockIdx.x

*
blockDim.x


Current block number, multiplied by number of
threads per block, added to the current thread
count.


Thread IDs are managed in this equation by
mapping each value on a per
-
thread basis.


Simultaneous implementation of all thread IDs.


Parallel mapping of the equation across all threads as
opposed to one thread at a time.


blockDim.x

= 4


blockIdx.x

= {0, 1,
2, 3…}


threadIdx.x

= {0, 1,
2, 3}…{0,

1, 2, 3}…


idx
/
tid

=
blockDim.x

*
blockIdx.x

+
threadIdx.x


Problem

size of ten
operations, so two
threads go to
waste.


Current scheme handles thread execution,
but not subsequent
incrementation

of thread
IDs.


Right and wrong way to increment threads, to
avoid overflow into other allocated IDs.


Increment based on grid dimensions, not on
block and thread counts


Example:


tid

+=
blockDim.x

*
gridDim.x


Thread ID incremented by a multiple of
threads per block and blocks per grid.


Indicates structural limitations of hardware
architectures


Determines various technical thresholds such
as block and thread ceilings, etc.



Revision 1.x



Pre
-
Fermi architectures


Revision 2.x



Fermi architecture


Host memory vs. device memory


Each platform has a separate memory space


Host can read and write to host only, device can
read and write to device only



Synchronization needed between CPU and
GPU activity



GPU only handles computationally intensive
calculations


CPU still executes serial code


Application pipeline



Represents CPU and
GPU activity


Illustrates behavior of
application, and
invocation of GPU
computations


Three address
spaces


Localized memory


Unique to each
thread


Shared memory


Shared among
threads within a
particular block


Global memory


Accessible by
threads and blocks
across a given grid


More accurate
representation of
hardware level
interaction between
address spaces


Two new spaces:
constant memory
and texture memory


Constant memory is
read
-
only and
globally accessible.


Texture memory is a
subset of global
memory, useful in
graphics rendering


Two
-
dimensionality


Surface memory


Similar functionality
to texture memory
but different technical
elements


Three basic steps of the allocation process


1. Declare host and device memory allocations


2. Copy input data from host memory to device
memory


3. Transfer processed data back to host upon
completion



Bare memory requirements for successfully
executing a GPU application


More sophisticated memory functions exist, but are
geared towards more complex functionality and
better performance


CUDA
-
specific keywords for dynamically
allocating memory


cudaMalloc



Allocates a dynamic reference to a
location in GPU memory. Identical in function to
malloc

in C.


cudaMemCpy



Transfers data from CPU memory to
GPU memory. Also responsible for reversing the
transfer.


cudaFree



Deallocates

reference to GPU memory
location. Identical to
free

in C.



Basic syntax needed for handling memory
allocation


Additional features available for more sophisticated
applications


Kernel



Executes processing instructions for
data loaded onto the GPU


Executes an operation N times for N threads
simultaneously


Structured similarly to a normal function, but with its
own unique changes



Kernel syntax


__global__ void example1<<<M, N>>>(A, B, C)


__global__

-

Declaration
specifier

identifying a line as a GPU
kernel.


Void example1



Return type and kernel name


<<<M, N>>>

-

M represents number of threads to be
allocated per block. N indicates number of blocks to set
aside for executing the kernel.


(A, B, C)


Argument list to be passed to the kernel


During kernel execution, threads organized into
warps
.


A warp is a grouping of 32 threads, all executed in
parallel with one another.


Threads are executed at the same program address, but
mapped onto its own instruction counter and register
state.


Allows parallel execution, but independent pacing of
each thread in terms of completion.



Handling of threads in a warp is managed by a
warp scheduler
.


Two warp schedulers available per streaming
multiprocessor


Warp execution optimized if no data dependence
between threads.


Otherwise, dependent threads remain disabled till
required data is received from completed operations


Separation of threads between warps can
cause data to get “tangled”.


Completed data does not coalesce back in memory
as it should due to out of order warp execution.



Problem avoided by using
__
syncthreads
()


Forcibly halts continued execution of a thread batch
until all threads in a warp have reached completion.


Minimizes idle time for threads that finish early and
ensures fewer errors in sensitive computations


Animated visualization, indicating the relation
between CPU and GPU elements



Sample code obtained from:

Sanders, Jason and
Kandrot
, Edward.

CUDA By
Example: An Introduction to General
-
Purpose GPU
Programming.
Boston

: Pearson Education, Inc., 2011.



Highlights the activities needed to facilitate
completion of a GPU
-
based data processing
application.



Code Animation Link



Major topics covered:



Performance benefits of GPU accelerated
applications.


Historical account of GPU technology and graphics
processing.


Hands
-
on demonstration of CUDA, including
syntax, architecture, and implementation.






Promising future, with positive projected market
demand for GPU technology



Growing market share for NVIDIA products


Gaming applications, scientific computing, and video
editing and engineering purposes



Release of
Kepler

architecture


March 2012.


Indicates further increase in performance metrics and
optimized resource consumption


Currently little documentation released in terms of
technical specifications



Role of GPU technology is sure to continue
saturating the professional market, as it’s
capabilities continue to rise.


1.
Meyers, Michael.

Mike Meyers'
CompTIA

A+ Guide to Managing and
Troubleshooting PCs.
s.l
.

: McGraw
-
Hill Osborne Media, 2010.


2.
MAC.

Hardware Canucks. [Online] November 14, 2011. [Cited: February 21,
2012.] http://www.hardwarecanucks.com/forum/hardware
-
canucks
-
reviews/48210
-
intel
-
sandy
-
bridge
-
e
-
core
-
i7
-
3960x
-
cpu
-
review
-
3.html.


3.
Intel Corporation.

Intel AVX. [Online] [Cited: February 21, 2012.]
http://software.intel.com/en
-
us/avx/.


4.
Performance Analysis of GPU compared to Single
-
core and Multi
-
core CPU for
Natural Language Applications.
Gupta,
Shubham

and
Babu
, M.
Rajasekhara
.

5,
2011, International Journal of Advanced Computer Science and Applications, Vol.
2, p. 4.


5.
IAP 2009 CUDA @ MIT / 6.963.
[Online] January 2009. [Cited: February 7,
2012.] https://sites.google.com/site/cudaiap2009/.


6.
Palacios, Jonathan and
Triska
, Josh.

A Comparison of Modern GPU and CPU
Architectures: And the Common Convergence of Both. [Online] March 15, 2011.
[Cited: February 21, 2012.]
http://web.engr.oregonstate.edu/~palacijo/cs570/final.pdf.


7.
NVidia
.

NVidia's

Next Generation CUDA Compute Architecture: Fermi.
nvidia.com.
[Online] 2009. [Cited: February 21, 2012.]
http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute
_Architecture_Whitepaper.pdf.


8.

.
NVidia

CUDA C Programming Guide 4.1.
NVidia
.
[Online] November 18,
2011. [Cited: February 17, 2012.]
http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUD
A_C_Programming_Guide.pdf.


9.
Lillian, Peter.

NVIDIA GPU'S Workshop.
Lexington

:
s.n
., January 17, 2012.


10.
Cutler, Barb.

The Traditional Graphics Pipeline.
Rensselaer Polytechnic Institute
-

Barb Cutler
-

Faculty Website.
[Online] 2009. [Cited: February 12, 2012.]
http://www.cs.rpi.edu/~cutler/classes/advancedgraphics/S09/lectures/15_Graphi
cs_Pipeline.pdf.


11.
Thomson, Richard.

The Direct3D Graphics Pipeline.
Richard Thomson
-

Personal Website.
[Online] 2006. [Cited: February 24, 2012.]
http://user.xmission.com/~legalize/book/download/index.html.


12.
Edwards,
Benji
.

A Brief History of Computer Displays. [Online] November 1,
2010. [Cited: January 12, 2012.]
http://www.pcworld.com/article/209224/a_brief_history_of_computer_displays.ht
ml.


13.
Intel Corporation.

ISBX 275 Video Graphics Controller
Multimodule

Board
Reference Manual. [Online] 1982. [Cited: January 9, 2012.]
http://www.bitsavers.org/pdf/intel/iSBX/144829
-
001_iSBX_275_Video_Graphics_Multimodule_Sep82.pdf.


14.
Farrimond
, Dorian.

Technology that Changed Gaming #2: The Commodore
Amiga. [Online] April 2011, 15. [Cited: January 23, 2012.]
http://www.blitterandtwisted.com/2011/04/technology
-
that
-
changed
-
gaming
-
2
-
the
-
commodore
-
amiga.html.


15.
Silicon Graphics International Corporation.

OpenGL Overview. [Online] 2009.
[Cited: February 12, 2012.]
http://www.sgi.com/products/software/opengl/?/overview.html.


16.
Coding Unit.

The History of DirectX. [Online] [Cited: February 22, 2012.]
http://www.codingunit.com/the
-
history
-
of
-
directx.


.


17.
NVidia
.

Geforce

256. [Online] [Cited: February 12, 2012.]
http://www.nvidia.com/page/geforce256.html.


18.
Rost
, Randi J.

OpenGL Shading Language, Second Edition.
s.l
.

:
Addison Wesley Professional, 2006.


19.
Woo, Mason, et al.

Opengl

Programming Guide: The Official Guide to
Learning OpenGL, Version 1.1.
s.l
.

: Addison Wesley Publishing, 1997.


20.
Cg: A system for programming graphics hardware in a C
-
like
language.
William, Mark R., et al.

2003, ACM Transactions on Graphics,
pp. 896
-
907.


21.
NVidia
.

The Cg Tutorial: Chapter 1. Introduction.
NVidia.com.
[Online] April 20, 2011. [Cited: February 23, 2012.]
http://developer.nvidia.com/node/76.


22.

. NVidia.com.
NVidia

Geforce

8800 GPU Architecture Overview.
[Online] November 2006. [Cited: February 12, 2012.]
http://www.nvidia.com/object/IO_37100.html


23.
Kirk, David B. and
Hwu
,
Wen
-
Mei W.

Programming Massively Parallel
Processors: A Hands
-
on Approach.
s.l
.

: Morgan Kaufmann, 2010.


24.
NVidia
.

PTX: Parallel Thread Execution ISA Version 2.3.
NVidia.com.
[Online] March 8, 2011. [Cited: February 27, 2012.]
http://developer.download.nvidia.com/compute/cuda/4_0_rc2/toolkit/d
ocs/ptx_isa_2.3.pdf.




25.
AMD.

AMD
Radeon

HD 5870.
AMD.
[Online] [Cited: February 27,
2012.] http://www.amd.com/us/products/desktop/graphics/ati
-
radeon
-
hd
-
5000/hd
-
5870/Pages/ati
-
radeon
-
hd
-
5870
-
overview.aspx#2.


26.

. Heterogeneous Computing
OpenCL

and the ATI
Radeon

HD 5870
("Evergreen") Architecture.
AMD
-
ATI.
[Online] [Cited: February 27, 2012.]
http://developer.amd.com/gpu_assets/Heterogeneous_Computing_Ope
nCL_and_the_ATI_Radeon_HD_5870_Architecture_201003.pdf.


27.
Rosenberg,
Ofer
.

OpenCL

Overview.
Khronos

Group.
[Online]
November 2011. [Cited: February 28, 2012.]
http://www.khronos.org/assets/uploads/developers/library/overview/o
pencl
-
overview.pdf.


28.
Khronos

Group.

OpenCL

Overview.
Khronos

Group.
[Online] [Cited:
February 28, 2012.] http://www.khronos.org/opencl/.


29.
AMD.

GPGPU History.
AMD.
[Online] [Cited: February 28, 2012.]
http://www.amd.com/us/products/technologies/stream
-
technology/opencl/Pages/gpgpu
-
history.aspx.


30.

. AMD "Close to Metal" Press Release.
AMD.
[Online] November 14,
2006. [Cited: February 28, 2012.] http://www.amd.com/us/press
-
releases/Pages/Press_Release_114147.aspx.


31. GPU
-
Accelerated Libraries.
NVIDIA.com.
[Online] NVIDIA. [Cited:
April 4, 2012.] http://developer.nvidia.com/gpu
-
accelerated
-
libraries.


32.
McGlaun
, Shane.

DailyTech
. [Online] April 5, 2008. [Cited: February 12, 2012.]
http://www.nvidia.com/object/GPU_Computing.html.


33.
Farber, Rob.

CUDA, Supercomputing for the Masses:
Parrt

2.
Dr.Dobbs

-

The
World of Software Development.
[Online] April 29, 2008. [Cited: April 5, 2012.]
http://drdobbs.com/cpp/207402986.


34.
NVIDIA.

NVIDIA.com.
NVIDIA
GeForce

GTX 680 Whitepaper.
[Online] March 22,
2012. [Cited: April 7, 2012.]
http://www.geforce.com/Active/en_US/en_US/pdf/GeForce
-
GTX
-
680
-
Whitepaper
-
FINAL.pdf.


35.
Sanders, Jason and
Kandrot
, Edward.

CUDA By Example: An Introduction to
General
-
Purpose GPU Programming.
Boston

: Pearson Education, Inc., 2011.


36.
Lee,
Hsien
-
Hsin

Sean.

Multicore

And Programming for Video Games.
Georgia
Institute of Technology.
[Online] October 5, 2008. [Cited: February 22, 2012.]
http://users.ece.gatech.edu/lanterma/mpg08/mpglecture12f08_4up.pdf.


37.
Phillips, Jeff M.

Introduction to and History of GPU Algorithms.
The University
of Utah
-

Models of Computation for Massive Data Course.
[Online] November 9,
2011. [Cited: February 22, 2012.]
http://www.cs.utah.edu/~jeffp/teaching/cs7960/GPU
-
intro.pdf.


38.
MAC.

Hardware Canucks. [Online] November 14, 2011. [Cited: February 21,
2011.] http://www.hardwarecanucks.com/forum/hardware
-
canucks
-
reviews/48210
-
intel
-
sandy
-
bridge
-
e
-
core
-
i7
-
3960x
-
cpu
-
review
-
3.html.