Kernel Specialization for Improved Adaptability and Performance on Graphics Processing Units (GPUs)

skillfulwolverineSoftware and s/w Development

Dec 2, 2013 (3 years and 16 days ago)

468 views

Kernel Specialization for Improved Adaptability and Performance on
Graphics Processing Units (GPUs)
A Dissertation Presented
by
Nicholas John Moore
to
The Department of Electrical and Computer Engineering
in partial fulllment of the requirements
for the degree of
Doctor of Philosophy
in the eld of
Computer Engineering
Northeastern University
Boston,Massachusetts
June 2012
c
Copyright 2012 by Nicholas Moore
All Rights Reserved
Abstract
Graphics processing units (GPUs) oer signicant speedups over CPUs for certain
classes of applications.However,maximizing GPU performance can be a dicult
task due to the relatively high programming complexity as well as frequent hardware
changes.Important performance optimizations are applied by the GPU compiler
ahead of time and require xed parameter values at compile time.As a result,many
GPU codes oer minimum levels of adaptability to variations among problem in-
stances and hardware congurations.These factors limit code reuse and the applica-
bility of GPU computing to a wider variety of problems.This dissertation introduces
GPGPU kernel specialization,a technique that can be used to describe highly adapt-
able kernels that work across dierent generations of GPUs with high performance.
With kernel specialization,customized GPU kernels incorporating both problem-
and implementation-specic parameters are compiled for each problem and hard-
ware instance combination.This dissertation explores the implementation and pa-
rameterization of three real world applications targeting two generations of NVIDIA
CUDA-enabled GPUs and utilizing kernel specialization:large template matching,
particle image velocimetry,and cone-beam image reconstruction via backprojection.
iii
Starting with high performance adaptable GPU kernels that compare favorably to
multi-threaded and FPGA-based reference implementations,kernel specialization is
shown to maintain adaptability while providing performance improvements in terms
of speedups and reduction in per-thread register usage.The proposed technique of-
fers productivity benets,the ability to adjust parameters that otherwise must be
static,and a means to increase the complexity and parameterizability of GPGPU
implementations beyond what would otherwise be feasible on current GPUs.
iv
Acknowledgements
I would like to thank Professor Leeser for her guidance and patience over the past sev-
eral years.MathWorks generously supported my studies.Together with supervision
by James Lebak,this assistance signicantly enhanced the quality of my education.
Completing this work would not have been possible without the support of Catherine,
my family,and my friends.
v
Contents
1 Introduction 1
2 Background 5
2.1 NVIDIA CUDA..............................5
2.2 In-Block Reductions...........................9
2.3 Recent CUDA Techniques........................11
2.4 CUDA Development and Adaptability.................13
2.5 Implementation Parameters and Parameterization...........17
2.6 OpenCV Example.............................19
3 Related Work 22
3.1 Run Time Code Generation.......................23
3.2 Autotuning................................27
3.3 Domain Specic Tools..........................29
3.4 Summary.................................31
4 Kernel Specialization 33
vi
4.1 Benets..................................39
4.2 OpenCV Example.............................41
4.3 Trade-Os.................................43
4.4 Implementation and Tools........................43
4.4.1 GPU-PF..............................45
4.4.2 Validation and Parameterization.................50
5 Applications 53
5.1 Large Template Matching........................53
5.1.1 Algorithm and Problem Space..................54
5.1.2 CUDA Implementation Challenges...............57
5.1.3 CUDA Implementation......................58
5.1.3.1 Numerator Stage....................59
5.1.3.2 Variable Tile Sizes and Kernel Specialization.....62
5.1.3.3 Other Stages......................64
5.1.3.4 Runtime Operation...................65
5.1.4 CPU Implementations......................66
5.2 Particle Image Velocimetry........................67
5.2.1 CUDA Implementation......................73
5.2.2 Kernel Specialization.......................78
5.3 Cone Beam Backprojection........................80
5.3.1 Kernel Specialization.......................82
vii
5.4 Summary.................................82
6 Experiments and Results 83
6.1 Experimental Setup............................83
6.1.1 Hardware and Software Congurations.............85
6.1.2 Problem and Implementation Parameterization........85
6.1.2.1 Template Matching...................86
6.1.2.2 PIV...........................88
6.1.2.3 Cone Beam Back Projection..............93
6.2 Results...................................94
6.2.1 Comparative Performance....................94
6.2.2 Kernel Specialization Performance................97
6.2.2.1 Template Matching...................97
6.2.2.2 PIV...........................98
6.2.2.3 Cone Beam Backprojection..............106
6.3 Analysis..................................110
7 Conclusions and Future Work 117
7.1 Conclusions................................117
7.2 Future Work................................118
7.2.1 Existing Applications.......................118
7.2.2 GPU-PF..............................119
viii
7.2.3 Kernel Specialization.......................120
A Glossary 122
B Flexibly Specializable Kernel 123
C Sample Run Time Evaluated PTX 125
D Sample Kernel Specialized PTX 127
E OpenCV Kernel Source 129
F OpenCV Specialized Kernel Source 136
G Sample GPU-PF Log Output 140
ix
List of Figures
2.1 A graphical representation of the NVIDIA CUDA-capable GPU archi-
tecture realization.This gure appears in [36]..............8
2.2 Example of a parallel reduction tree starting with eight elements...10
4.1 Dierent template tile regions......................48
5.1

A and

B are the matrix averages of A and B..............55
5.2

B is the matrix average of B,A
C
is the current template with the
template average subtracted from each value,and A
D
is the template
contribution to the denominator.....................57
5.3 The actual functionality implemented by the numerator stage.

B is
the matrix average of B,and A
C
is the template with its average value
subtracted.................................59
5.4 Dierent template tile regions.......................60
x
5.5 Graphical data layout of the shift area for a single tile.Regardless of
dimensions,all tiles are applied to the same shift area.A single block
of the summation kernel may only combine a subset,shown in gray,
of the template locations..........................61
5.6 Each thread accumulates the tile contributions for a singe shift oset,
an example of which is shown in gray...................62
5.7 A pseudocode representation of the computation performed by each
CPU thread.................................67
5.8 A graphical depiction of the physical PIV conguraiton.........68
5.9 A graphical depiction of the terminology originally used for the FPGA
PIV implementation.This image is from Bennis's dissertation [6]...70
5.10 The per-mask oset sumof squared dierences similarity score,dened
in terms of the original PIV problemspecication as shown in Figure 5.9.73
5.11 Example of a set of threads striped across a mask's area........75
5.12 Adepiction of the warp specialization used in the PIVkernel to remove
the reduction as a bottleneck.......................77
5.13 A graphical depiction of cone beam scanning setup...........80
6.1 Contour plots of performance relative to the peak for each of the data
sets in Table 6.4 on the Tesla C1060.The location of peak performance
is marked with a white square.......................115
xi
6.2 Contour plots of performance relative to the peak for each of the data
sets in Table 6.4 on the Tesla C2070.The location of peak performance
is marked with a white square.......................116
xii
List of Tables
2.1 CUDA thread block characteristics by compute capability.......8
2.2 The amount of register and shared memory available within each
streaming multiprocessor for NVIDIA GPUs of various compute capa-
bilities [38].................................11
4.1 Various parameter types provided by GPU-PF.............46
4.2 Various resource types provided by GPU-PF...............47
4.3 Various memory types provided by GPU-PF...............47
4.4 Various actions provided by GPU-PF...................47
5.1 Per patient,the number of image frames,template number and size,
vertical/horizontal shift within ROI,and number of corr2() calls...56
5.2 Template tiling examples for the template size associated with Patient
4 (156 116 pixels)............................63
6.1 Template matching GPU implementation parameters benchmarked..87
xiii
6.2 The PIV problem set parameters,in terms of interrogation window
and image dimensions,used for comparing performance of the FPGA
and GPU implementations........................89
6.3 The PIV problem set parameters,in terms of mask and oset counts,
used for comparing performance of the FPGA and GPU implementa-
tions.....................................89
6.4 PIV problem set parameters used to test the impact of mask size on
the performance of the GPU implementation..............90
6.5 PIV problem set parameters used to test the impact of the number of
search osets on the performance of the GPU implementation.....91
6.6 PIV problem set parameters used to test the impact of interrogation
window overlaps on the performance of the GPU implementation...92
6.7 PIV GPU implementation parameters benchmarked..........93
6.8 Cone beam backprojection GPU problem parameters benchmarked..93
6.9 Cone beam backprojection GPU implementation parameters bench-
marked...................................94
6.10 Template matching performance results comparing the multi-threaded
C CPU implementation to the best performing CUDA implementation
on two GPUs................................95
6.11 PIV performance results comparing the FPGA implementation to the
best performing CUDA implementation on two GPUs.........96
xiv
6.12 Cone beam backprojection results comparing the OpenMP CPU im-
plementation with four threads to the best performing conguration
on both GPUs...............................96
6.13 Template matching partial sums:performance and optimal congu-
rations characteristics for the tiled summation kernel.RE stands for
runtime evaluated,and SK stands for specialized kernel........99
6.14 PIV GPU performance comparisons for several kernel variants across
the FPGA benchmark set.........................101
6.15 PIV GPU performance data for the FPGA benchmark set,including
optimal register blocking and thread counts...............102
6.16 PIV GPU performance data for the varying mask size benchmark set,
including optimal register blocking and thread counts..........104
6.17 PIV GPU performance data for the varying search benchmark set,
including optimal register blocking and thread counts..........107
6.18 PIV GPU performance data for the varying overlap benchmark set,
including optimal register blocking and thread counts..........108
6.19 Performance comparisons for the backprojection kernels........109
6.20 Occupancy and execution data for the C1060 on the V2 data set...110
6.21 Percentage of the peak performance for the template matching appli-
cation with various xed main tile sizes and thread counts.......113
xv
6.22 Percentage of the peak performance for the PIV application with var-
ious xed data register counts and thread counts............114
xvi
Listings
4.1 A CUDA C GPU kernel designed to demonstrate the common ways
kernel specialization can improve the performance of kernels.This
kernel is a regular fully run-time evaluated kernel and does not rely
on specialization in any way........................34
4.2 A CUDA C GPU kernel designed to demonstrate kernel specialization.
The constants specied by identiers in all capital letters must be
provided at compile time.........................34
5.1 MATLAB function implementing the sliding window correlation for
each template for each ROI within the current frame..........55
B.1 A CUDA C GPU kernel designed to demonstrate exible kernel spe-
cialization.The kernel can be compiled both with and without spe-
cialization..................................123
C.1 The nvcc command line used to generate the PTXin C.2.The mathTest2.cu
le contained the source of Listing B.1..................125
C.2 The run-time adaptable PTX produced by calling nvcc on the CUDA
C source in Appendix B without any xed parameters.........125
xvii
D.1 The nvcc command line used to generate the PTXin D.2.The mathTest2.cu
le contained the source of Listing B.1.................127
D.2 Specialized PTX produced by calling nvcc on the CUDA C source in
Appendix B and specifying all parameters on the command line....127
E.1 Unmodied OpenCV CUDA example..................129
F.1 Modied OpenCV CUDA example:this portion is specialized....136
F.2 Modied OpenCV CUDA example:this portion is compiled into the
host program...............................138
G.1 Initial application refresh,Part 1....................140
G.2 Initial application refresh,Part 2....................142
G.3 Pipeline iteration.............................143
G.4 Per-operation timing...........................145
G.5 High-level timing,Example A......................147
G.6 High-level timing,Example B......................147
G.7 High-level timing,Example C......................148
xviii
Chapter 1
Introduction
General purpose computing on graphics processing units (GPGPU) has seen wide
adoption in the past few years.The promise of signicant performance gains over
traditional CPUs has resulted in an ever increasing variety of problem types to be
accelerated on GPUs.
However,developing GPGPUapplications is often dicult,and peak performance
is obtained by carefully accommodating particular GPU hardware characteristics.
This leads to programming practices that limit the adaptability of kernel imple-
mentations to specic problems.Compounding this issue is the rapid evolution of
GPU hardware over time.Constructing a GPGPU application around the specic
properties of one GPU model can limit the performance of an implementation on
other hardware.Together,these practices hinder code reuse and the applicability of
GPGPU to a wider range of problem instances.
This dissertation introduces kernel specialization of GPGPU kernels,specically
CUDA-enabled GPUs fromNVIDIA.Kernel specialization refers to the generation of
CHAPTER 1.INTRODUCTION 2
GPU binaries that are customized for the current problem and/or hardware param-
eters.Specically,the approach uses developer-friendly CUDA C language kernels
and run-time calls to the CUDA compiler.Kernel specialization allows GPU kernel
implementations to achieve greater levels of adaptability to variations among both
problems and hardware while preserving the performance associated with hard-coded
approaches.It may also enable more complicated kernels to t within the constraints
imposed by the CUDA environment.
Many important GPU optimizations,such as loop unrolling,strength reduction,
and register blocking,require static values at compile time.With GPGPU kernel
compilation typically done ahead of time,it is common practice to hard code both
problem and hardware parameters.This limits the ability of many GPGPU kernels
to adapt to a wide range of problems and GPUtargets.When adaptability is required
and parameter values are not specied,performance often suers since many crucial
performance optimizations cannot be applied by the compiler.Generation of special-
ized GPU binaries once parameter values are known results in higher performance
code based on the now-xed parameters.Parameters selected for specialization can
only be changed through recompilation.
The xed parameters used to generate customized kernels are derived in part from
the particular problem instance.However,many CUDA kernel implementations also
provide a set of implementation parameters that can be adjusted independent of
the problem parameters.These often provide an opportunity to tune kernel imple-
CHAPTER 1.INTRODUCTION 3
mentations to the particular characteristics of the current GPU.Examples of these
parameters include the number of threads per block and the amount of work assigned
to each thread.
The kernel specialization method presented here,while helping improve the per-
formance of GPU kernels,can also improve the CUDA development experience.
Common techniques that select implementation parameters at compile time using
preprocessor macros or instantiate many versions of the same kernel for dierent
parameter and data type combinations can be avoided.This improves the main-
tainability and readability of GPU kernels.The size of program binaries are also
minimized,as they do not contain compiled versions of all possible versions of the
kernel.C++ template techniques can be used to extend this benet to higher levels
of algorithmic specialization.
This dissertation covers the application of kernel specialization to CUDA versions
of three real-world applications:large template matching,particle image velocimetry,
and cone-beam image reconstruction via backprojection.Each application utilizes
specialization dierently to achieve better performance than is otherwise possible.
The applications utilize a custom framework developed as a part of this research
that automates the compilation of customized CUDA C kernels at run time.
The technique presented here complements other related research projects that
focus on greater parameterization of GPU kernels.It lls in gaps between auto-
mated domain-limited frameworks and auto-tuning tools.Custom human-developed
CHAPTER 1.INTRODUCTION 4
kernels are often required to augment libraries'problem space coverage.Auto-tuning
approaches are important for determining optimal parameterizations for kernels de-
veloped with kernel specialization in mind,but may not explore the large variety of
fundamental implementation approaches that a human designer or domain-specic
tool may employ.
Contributions of this research include a methodology to write kernels once and
recompile for dierent parameters,along with an application framework that auto-
mates the process.Using the methodology,this research:
 Demonstrates improved adaptability of GPU kernels to dierent problems and
architectures while oering good performance using kernel specialization
 Employs register blocking and loop unrolling in a run-time adjustable fashion
with kernel specialization
 Demonstrates reduced register usage with kernel specialization
Chapter 2 provides background information,and Chapter 3 discusses related re-
search projects.Chapter 4 presents kernel specialization in more detail.The details
of the three application case studies to which kernel specialization has been applied
are described in Chapter 5.The testing scenarios and performance results for each
application are covered in Chapter 6.Chapter 7 concludes the dissertation with a
summary and plans for future work.A glossary is provided in Appendix A.
Chapter 2
Background
This section provides an overview of the challenges this research addresses.First,
Section 2.1 provides a brief description of the CUDA architectural abstraction and
programming environment provided by NVIDIA.While the principles discussed in
this dissertation apply to GPGPU in general,all experiments to date have targeted
NVIDIA CUDA-enabled GPUs.Some aspects of typical CUDA usage make writing
general code that can adapt to arbitrary incoming problems challenging,and these
issues are examined in Section 2.4.
2.1 NVIDIA CUDA
CUDA provides a software environment for writing GPGPU kernels and species an
abstract hardware target with large amounts of parallelism.The abstractions pro-
vided help shield the application developer from a number of low-level hardware con-
siderations and allow NVIDIA to change the underlying implementation of the hard-
ware abstraction,increasing peak performance over time.However,this paradigm
still requires application programmers to contend with a number of issues not of-
CHAPTER 2.BACKGROUND 6
ten considered when developing software for general purpose processors,including
a structured and restricted parallelism model,several distinct memory spaces,and
transferring data to and from a device that is external to the host processor.Ad-
ditionally,developers have not been completely isolated from changes in the CUDA
hardware realization,which NVIDIA has been updating rapidly over time.New
GPUs have changed important hardware parameters and introduced completely new
capabilities.
In CUDA,parallelism is presented in the form of thousands of threads.To orga-
nize these threads,CUDA provides a two-level thread hierarchy composed of thread
blocks and the grid.Thread blocks can have one,two,or three logical dimensions.
The grid species a one,two,or,on newer GPUs,three dimensional space consisting
of identically shaped thread blocks.Together,the grid and blocks can be used to
dene a large thread space dened independently for each kernel invocation but xed
for the duration of a kernel's execution.
Inter-thread communication is possible at thread block scope via a manually man-
aged block-local memory called shared memory,which is available for reading and
writing by all threads.Block-level thread barriers allow for synchronized shared
memory access by all the threads within a block.However,large scale inter-block
communication is generally not ecient.Combined,the thread hierarchy and com-
munication restrictions allowfor the scalability of CUDAapplications by constraining
kernels to relatively small and independent chunks of computation.CUDA applica-
CHAPTER 2.BACKGROUND 7
tions scale to larger problems by increasing the number of blocks,and newer CUDA
hardware scales performance by providing more parallel block execution resources.
This requires,however,that the given problem is amenable to the problem partition-
ing required by CUDA.
The amount of memory and execution resources used at the block level can af-
fect performance and must be considered by kernel developers.NVIDIA's hardware
realization of the abstract CUDA architecture uses streaming multiprocessors (SMs)
to execute blocks,as shown in Figure 2.1.Full thread blocks are assigned to a SM,
and within the SM a thread block is mapped to groups of 32 consecutive threads,
collectively referred to as a warp.Warps are the unit of execution,with the threads
within a warp executed in a single-instruction,multiple-thread (SIMT) manner.Fast
context switching between warps and the ability to execute multiple thread blocks
within the same SMgenerates a signicant amount of thread-level parallelism.How-
ever,kernel congurations can have a signicant impact on the ability of the GPU
to execute more warps simultaneously.SMs contain a limited number of shared
resources,including registers,shared memory,and warp and block state tracking
hardware.The number of blocks simultaneously executed by a SM is limited by the
block-level resource usage required by a kernel.
The above restrictions require CUDA developers to balance a number of trade-
os when developing GPGPU applications.However,developers concerned with
targeting a wide variety of CUDA-capable GPUs must also account for changes in
CHAPTER 2.BACKGROUND 8
Figure 2.1:A graphical representation of the NVIDIA CUDA-capable GPU archi-
tecture realization.This gure appears in [36].
Compute Capability
1.0 & 1.1 1.2 & 1.3 2.0 & 2.1
Date of Toolkit Support [13]
June 2007 August 2008 March 2010
Max.Threads
512 512 1024
Shared Mem.
16 KB 16 KB 16/48 KB
Shared Mem.Banks
16 16 32
32-bit Registers
8 K 16 K 32 K
Max.Warps/SM
24 32 48
Table 2.1:CUDA thread block characteristics by compute capability
both the CUDA hardware abstraction and the physical hardware.In general,newer
CUDA-capable GPUs have had an increasing number of resources per block,as shown
in Table 2.1,with the biggest changes associated with the introduction of CUDA
devices of compute capability 2.x (the Fermi architecture).While the number of many
block-level resources has increased,the physical realization of the CUDA abstraction
has also changed.
Further complicating CUDA development are memory usage concerns.While
CHAPTER 2.BACKGROUND 9
they are not covered here,the CUDA environment provides a number of dierent
memory types,each with its own set of size and performance characteristics.Ac-
commodating some memory access requirements (e.g.coalescing for global memory)
are fundamental to achieving good performance on CUDA-enabled devices.Many
other issues,however,are more nuanced,and while they occur less frequently,they
can also signicantly degrade performance of the memory hierarchy.
2.2 In-Block Reductions
There are many fundamental parallel programming patterns.Of interest here are
parallel reductions.Reductions have been an important part of GPGPU program-
ming since its inception,and an examination of techniques for implementing high-
performance reductions with CUDA has been included with the CUDA SDK for
many releases [24].In particular this work relies on reductions that include an as-
sociative operation,such as addition.Reductions are often applied to large amounts
of data and the usual concerns about generating parallelism across a set of thread
blocks still apply.Generally,each block will perform a moderately-sized reduction
on a subset of the data,usually producing a single output value per block.Multiple
rounds of the reduction kernel call are used,with each successive call requiring fewer
and fewer thread blocks to span the data set.
The reductions discussed in this dissertation are of moderate size,so only the
in-block behavior is relevant.Much like the multiple reduction kernel calls,multiple
CHAPTER 2.BACKGROUND 10
1
+
+
+
+
2
3
4
5
6
7
8
1
2
3
4
1
2
1
+
+
+
Figure 2.2:Example of a parallel reduction tree starting with eight elements.
reduction rounds within a block are used,as shown in Figure 2.2.Less parallelism
available after each round,with the working set,and therefore number of threads
needed,reduced by half (assuming power-of-two initial element counts).This forms
a tree where the number of levels,or rounds,in the tree is determined by the base-2
logarithm of the initial number of elements.
In NVIDIA GPUs,register memory is private and not accessible by other threads.
This requires reductions to take place through shared memory.For each level of the
tree,data is written to shared memory,a thread-synchronization barrier is executed,
and then the remaining participating threads read out the newly compacted data.
However,with CUDA,the number of synchronizations is not the same as the number
of levels in the tree.Due to the thirty-two-thread SIMT width (warp) in NVIDIA
GPUs,the threads within a warp are guaranteed to be in sync.Once the number
of elements reaches sixty-four,a single warp can nish a reduction without synchro-
nization.
CHAPTER 2.BACKGROUND 11
Compute Capability
Register File
Shared Memory
per SM
Size per SM
1.0 { 1.1
32 KB
16 KB
1.2 { 1.3
64 KB
16 KB
2.x
128 KB
16/48 KB
Table 2.2:The amount of register and shared memory available within each streaming
multiprocessor for NVIDIA GPUs of various compute capabilities [38].
Regardless,throughout the duration of a block-level reduction,fewer and fewer
threads participate after each level of the tree.This results in a increasing number
of idle threads.
2.3 Recent CUDA Techniques
Since the introduction of CUDA,many kernel implementation techniques have been
suggested.Most have not broken with the original paradigm suggested by NVIDIA,
which emphasizes context switching among many warps.However,some recent
research has proposed new techniques that take advantage of particular traits of
NVIDIA hardware.
First,NVIDIA GPUs provide a memory hierarchy that is inverted relative to
CPUs,with more on-die memory dedicated to the register le than shared mem-
ory [54],as shown in Table 2.2.However,since shared memory does not provide
enough throughput to achieve peak performance;sourcing ALU operands from reg-
isters is mandatory to maximize computational throughput.
While this encourages increased register usage,other factors can exert down-
ward pressure on register usage.When sucient instruction-level parallelism is not
CHAPTER 2.BACKGROUND 12
available,CUDA-enabled GPUs rely on rapid low-latency context switching between
warps,possibly from multiple thread blocks,to hide latencies.However,a streaming
multiprocessor can only execute warps from multiple blocks if it contains enough
hardware resources,including registers,to execute multiple full blocks simultane-
ously.As each thread uses more registers,fewer thread blocks may be processed by
a single SM.Even when a SM is only executing a single thread block,complex or
highly register-blocked kernels often require a reduction in the number of threads per
block to t within resource limits.
Increasing the number of warps that can be run simultaneously within a streaming
multiprocessor is one of NVIDIA's main performance recommendations [37].How-
ever,NVIDIA GPUs have been shown to be able to take advantage of signicant
memory- and instruction-level parallelism,even to the extent where a relatively small
number of warps fully saturate the available memory or computational hardware re-
sources [54].While occupancy may be very low,additional active warps will not
improve performance.This creates a scenario where maximum performance can be
achieved using resource intensive thread blocks with low numbers of threads per
block.
One technique to increase the number of operands coming from register memory
is register blocking.Register blocking has been shown to be an eective performance
technique for GPUs [55,53].Higher levels of register blocking on the GPU generally
assign more work to each thread,shifting the balance from thread-level parallelism
CHAPTER 2.BACKGROUND 13
(TLP) to instruction-level parallelism (ILP).
A second relatively new CUDA kernel programming technique is called warp
specialization.While intra-warp thread divergence is undesirable,there is no intrinsic
penalty associated with inter-warp divergence.Introduced by Bauer et al.[3] in their
CudaDMA library,warp specialization was used to dedicate several warps to loading
data and the rest to compute.Data was double buered through shared memory,and
a block-wide synchronization point was used to control toggling buers between the
two groups of threads.By manually assigning and limiting warps to the distinct tasks
of data movement and processing,CudaDMAis able to maximize the parallelismused
by ensuring that both the memory hierarchy and compute resources are constantly
engaged.
2.4 CUDA Development and Adaptability
Much like traditional CPU code,typical CUDA C development practices compile
code once ahead of time.However,unlike traditional C-like languages on CPUs,the
target for compilation is often not machine executable.CUDA C is rst compiled
to PTX (parallel thread execution),an assembly-level intermediate representation
for NVIDIA GPUs [40],and then,at run time,translated to the nal instruction
set architecture (ISA) used by the target GPU.The separation of PTX from a xed
ISA allows NVIDIA to make hardware changes between generations of GPUs while
preserving portability of compiled programs.
CHAPTER 2.BACKGROUND 14
It is possible to instruct the CUDA C compiler front-end to generate the actual
binary ISAversion of a given kernel before run time,as is done in this work.PTXand
binary versions of a compiled kernel can reside in the same Executable and Linkable
Format (ELF) formatted executable objects.A compatible binary version will be
selected before translating a compatible PTX version.
The syntax of PTX les is similar to many other assembly language formats,
with some extra CUDA-specic registers and instructions for managing the wide
parallelism available on NVIDIA GPUs.Sample PTX listings are provided in Ap-
pendices C and D.Familiar instruction types,such as load,add,multiply,etc.,are
suxed with a data type and/or memory space.PTX uses load-store semantics,with
the destination specied before source operands.Registers,whose names are prexed
with %,are virtual.Register assignment takes place during the translation fromPTX
to a binary ISA.This abstraction helps to accommodate the varying register les on
dierent CUDA-capable GPUs.
The just-in-time (JIT) translation from PTX to a binary ISA is intended to be
light-weight so as to minimize overhead and leaves little time for applying optimizing
code transformations.Many important optimizations,especially those important to
high performance on NVIDIA GPUs,are applied when CUDA C is compiled to PTX.
Optimizations of particular relevance for this dissertation include strength reduction,
constant folding and propagation,loop unrolling and register blocking.
An important property of these optimizations is that they require xed values
CHAPTER 2.BACKGROUND 15
at compile time.For example,the compiler must know when scalars are powers of
two to strength reduce division or modulus (two relatively expensive operations on
NVIDIA GPUS) to bit-wise operations.Likewise loop counts must be xed for the
compiler to unroll loops or implement register blocking.
While programloops are fully supported by GPGPUlanguages like CUDAC,they
may incur signicant overhead [49].Unrolling loops is a key CUDA performance op-
timization.Rolled loops need to include several instructions for loop setup,iteration,
termination condition checking,and branching,all of which introduce overhead and
reduce the ability of the GPU to take advantage of ILP.
Register blocking can be complicated by the fact that existing NVIDIA GPU
architectures cannot indirectly address registers.Fixed loop counts are required for
the CUDA C compiler to specify the use of extra registers for data and assign them
unique virtual registers in the PTX representation of a program.
A similar issue occurs with constant memory space declarations.These must have
a xed size at compile time.The constant memory is reserved when a CUDA module
(the CUDA translation unit) is loaded,and the total amount of CUDA memory that
can be allocated across all loaded kernels is limited to 64 KB.
A fundamental issue with xing values at compile time is limiting the ability of a
kernel to adapt to new problems without recompilation.It is possible to leave CUDA
C kernels without xed values,forgoing the aforementioned optimizations,but this
may incur additional performance penalties.Registers may have to be dedicated to
CHAPTER 2.BACKGROUND 16
the storage of intermediate values computed fromone or more adjustable parameters.
Independent parameters,either inputs or intrinsic values like thread indexes,have to
be loaded from shared memory or special registers into regular registers before they
can be used.Dynamic code without xed values at compile time also often requires
extra run-time guards against illegal values and memory accesses.
Adaptability can also interact with preferences for hardware-friendly parameter
values.Common fundamental parallel algorithmic components are simplest in terms
of control ow at powers of two.Reductions,for example,are guaranteed to have
an even number of elements at each tree level if the initial size is a power of two.
Otherwise,extra logic to handle tree levels with odd element counts must be incor-
porated.All told,adaptability can contribute a signicant number of non-compute
instructions to a GPU kernel.
Since CUDA C supports many features of C++ templates,they can be used to
help circumvent these restrictions while oering some level of adaptability.Templates
with template arguments for parameter values that control optimizations can be
explicitly instantiated multiple times for dierent xed parameter values.A table
lookup of function or method pointers can be used to select the optimal specialization
for a given problem.This technique is useful,especially for handling multiple data
types,but has drawbacks.First,the adaptability is limited to the pre-compiled range,
and second,it can signicantly increase compiled binary size.While the impact of
applying this technique to a single kernel may be limited,the increase in code size
CHAPTER 2.BACKGROUND 17
across a large software project may not be acceptable.
The above issues can make CUDA development dicult,but the same types of is-
sues apply to hardware,adding another layer of complexity.As mentioned,NVIDIA
has frequently updated its GPU architecture.Newer devices have increased the
number of block-level resources that may aect the partitioning of a large problem
among thread blocks.New versions of CUDA have also introduced new features.Ex-
amples of these include atomic operations,double-precision oating-point support,
and more advanced block-level synchronization primitives.The relative latencies of
several operations has also changed over time.For example,the relative throughput
of the * operator and the
[u]mul24() intrinsic for integer multiplies was inverted
between NVIDIA GPUs of compute capability 1.3 and 2.0.Another example is the
throughput of shared memory relative to the register le decreases between GPUs
of CUDA capability 1.3 and 2.0,putting additional emphasis on eective use of the
register le in newer GPUs.These factors can result in dierent optimal implemen-
tation approaches between architecture generations,making performance portability
dicult.
2.5 Implementation Parameters and Parameteri-
zation
The thread block and grid structured parallelism provided by CUDA,block-level re-
source limitations,the need to utilize important optimizations,and variations among
CUDA-capable GPUs all combine to create a complicated development environment
CHAPTER 2.BACKGROUND 18
that requires developers to balance many non-obvious trade-os.One mechanism for
balancing some of these trade-os is implementation parameters.
Implementation parameters are generally scalar values that control how a CUDA
kernel performs its computation.These are usually distinct fromproblemparameters
and can be independently adjusted.The shape and number of threads in a thread-
block,for example,is a built-in CUDA implementation parameter that every kernel
must select.At the most basic level,the number of threads in a block trades o the
availability of block-level resources with the number of total thread blocks.A second
CUDA built-in parameter is warp size.This,however,has not been a major issue as
NVIDIA has so far kept the warp size at thirty-two threads.
In addition to the required CUDA implementation parameters,many kernel im-
plementations add a number of other parameters.An example of this occurs with
tiling,a common kernel implementation technique that breaks a large parallel com-
putation up into smaller chunks that are distributed among thread blocks.The size
of the tiles can control how much shared memory is required per block,potentially
accommodating increased shared memory sizes in newer GPUs.Increasing tile sizes
often assigns a larger amount of work to each block,which can be used to scale the
block-level work load to newer GPUs with more threads and resources available in
each thread block.
Another example involves adjusting the number of registers used for register block-
ing,which can be used to adjust the amount of work assigned to each thread.As
CHAPTER 2.BACKGROUND 19
discussed above,register blocking can improve ILP but comes at the cost of increased
register usage.
In addition to performance,changing implementation parameters can also impact
the complexity of CUDA kernel code.Shared memory,for example,can be allocated
either statically at compile time or dynamically at kernel launch time.Dynamically
allocated shared memory,however,is more complicated and error prone to use.
It is common practice,and required for certain parameters,such as register block-
ing levels,to x implementation parameters for all launches of the kernel.This allows
the compiler to apply optimizations,but a single value may not be optimal across a
range of problem parameters.CUDA kernel performance is notoriously sensitive to
small variations in both problem and implementation parameters.Several research
projects have focused on investigating techniques for the optimal selection of imple-
mentation parameters across problem parameter ranges.Some of these eorts are
described in Chapter 3.There is,however,a fundamental tension between adjusting
implementation parameters for dierent problem and hardware parameters and the
relative importance of compile-time optimizations for GPGPU performance.This
issue is the focus of the work described in this dissertation.
2.6 OpenCV Example
Taken together,the issues discussed in this chapter produce a complex environment
that negatively aects GPGPU kernel adaptability and code reuse.Ultimately,this
CHAPTER 2.BACKGROUND 20
reduces the application and adoption of GPGPU.As a real world example of these
issues,Appendix E contains a listing from the open-source OpenCV computer vision
library's CUDA module that implements row ltering,a basic image processing al-
gorithm [42,41].The kernel is indicative of the eort required to achieve maximum
performance with CUDA while maintaining adaptability.
A number of the kernel development issues discussed in the last section appear
in the lter kernel.Lines 67 through 77 encode xed block and grid dimensions
and other parameters in the kernel directly,which are used to control loop unrolling
counts.The preprocessor conditional selects values based on compute capability.
The macro denition on line 71 declares constant memory for storing the lter that
will be applied to the input.The constant size creates an arbitrary ceiling on the size
of lters that can be applied.This may make sense for the application { it is unlikely
that a user will have a lter larger than thirty-two pixels { but is also required for
meeting the CUDA restriction that the constant memory size is known at compile
time.
Many of the compile-time optimizations discussed above are present in the OpenCV
kernel.The kernel utilizes explicit instantiation of kernel variants for every supported
parameter combination,contained in the array of function pointers starting on line
164.The declaration explicitly instantiates template specializations for lter sizes
from one to thirty-two pixels,which are necessary for loop unrolling,and for each
addressing mode.
CHAPTER 2.BACKGROUND 21
The explicit specializations starting on line 348 are used to include multiple ver-
sions of the lter code based on the needed data types.Instead of relying on a
lookup table,C++ overloading is used.Versions of the lookup table are compiled for
each data type pair.All told,800 variants of the kernel are generated and compiled
into the kernel binary.It should be noted that for each kernel version a dedicated
CPU-invoking function (linearRowFilter
caller()) is also compiled,increasing
the penalty for including multiple kernel variants in a binary.
Chapter 3
Related Work
This chapter discusses a number of other projects and research that are related to the
work discussed in this dissertation.The related work can be roughly categorized as:
run-time code generation generation,autotuning,and domain-specic tools.Many
of the domain-specic tools also include an autotuning component.
In addition to these main categories,there have also been examples in the liter-
ature of kernel customization for a very specic application.Stone et al.[52] have
demonstrated the potential for improving the performance of GPU kernels through
the use of custom compilation for a specic problem instance,but did not integrate
runtime compilation of GPU kernels into their molecular orbital visualization soft-
ware.The group manually generated variants of their kernels for specic problem
instances so that loops are unrolled.They report a 40% performance improvement
for the problem-specic kernels.
Linford et al.[31] used CUDA for a GPGPU implementation of large scale at-
mospheric chemical kinetics simulations.The basis of their GPU implementation
CHAPTER 3.RELATED WORK 23
is a problem instance specic C kernel generated by the Kinetic PreProcessor [15]
software package.While the serial C code is problem specic and generated auto-
matically,the conversion to CUDA C was a manual process.
Stivala et al.[51] created a CUDA implementation of a bioinformatics protein
structures database search problem.Before performing a search,each thread block
loads information about the query into shared memory.However,for some searches,
the query information is too large for shared memory.To handle this,they compile
a second version of the kernel that only uses global memory.While the kernel that
only uses global memory is much slower,the number of queries in their data sets that
exceed shared memory usage is small so that it is still faster to run the inecient
kernel on the GPU instead of transferring the data to the CPU for handling the
outliers and then back to the GPU.
Also to address hardware dierences,Levine et al.[30] provide dierent implemen-
tations of radial distribution function histogramming kernels for dierent generations
of NVIDIA GPUs and determined optimal implementation parameters.
3.1 Run Time Code Generation
In this dissertation,kernels that are customized for a specic problem and target
GPU at run time are explored.Another approach for executing customized GPU is
generation of GPU code on the y at run time.
The NVIDIA OptiX ray tracing engine [44] uses run-time manipulation and com-
CHAPTER 3.RELATED WORK 24
pilation of PTX kernels,but requires them to be compiled oine using the CUDA
nvcc compiler.While operating only on PTX,the OptiX PTX to PTX compilation
provides a number of domain specic optimizations at runtime by analyzing both
the PTX source and the current problem.These include memory selection,inlining
object pointers,register reduction,and eorts to reduce the impact of divergent code.
Garg et al.[21] have created a framework that accepts parallelism and type an-
notated Python code and automatically generates GPU code targeting the AMD
Compute Abstraction Layer (CAL) [1],a low-level intermediate language and API
for AMD GPUs.An ahead-of-time compilation step converts Python functions to an
intermediate state and a runtime just-in-time (JIT) compiler generates both GPU
kernel and host control code using program information that only becomes known
at runtime.The JIT compiler can perform loop unrolling and fusion and optimizes
some memory access patterns.
Dotzler et al.[18] present a framework (JCudaMP) that takes Java code with
OpenMPparallelization pragmas and automatically generates code to run on NVIDIA
CUDAcapable GPUs.AcustomJava class loader converts a subset of Java/OpenMP
parallel code sections with additional JCudaMP annotations and dynamically gen-
erates C for CUDA source.JCudaMP relies on the CUDA nvcc compiler and in-
vokes it at runtime to compile and link C for CUDA to a shared object that is
loaded by the Java virtual machine.The generated CUDA code will check for and
indicate addressing violations for exception handling and uses Java's array-of-arrays
CHAPTER 3.RELATED WORK 25
multi-dimensional array organization.While tiling of problems larger than the target
GPU's global memory is supported,important aspects of CUDA performance,like
shared memory,are not.Overhead information is provided for CUDA compilation
but not for CUDA code generation.
Also starting with Java,Leung et al.[29] have modied the JikesRVMJava virtual
machine to connect to the RapidMind framework for the purpose of running Java code
on NVIDIA GPUs.After parallelization,their modications generate RapidMind
IR to feed directly into the RapidMind back-end,which generates target specic
code.The RapidMind template libraries and front-end stage is not used.Loops are
automatically identied and parallelized,and a cost-benet analysis is used to decide
which loops should be executed on the GPU.
Mainland and Morrisett [33] create an abstract domain-specic language,Nikola,
and embed it inside of Haskell.While signicantly raising the abstraction - Nikola
is a rst-order language represented in Haskell using higher-order abstract syntax -
and allowing the use of NVIDIA GPUs from within a pure Haskell environment,the
approach seems limited in its ability to eectively utilize many CUDA resources and
target all algorithms.Nikola automatically generates and compiles CUDA kernel
code and manages data movement for the user.It appears that CUDA functions are
compiled once at program compile time or at runtime at each invocation.
Ocelot,created by Diamos et al.[17],dynamically translates NVIDIA PTX code
to Low Level Virtual Machine (LLVM) IR.From there,a just-in-time compiler can
CHAPTER 3.RELATED WORK 26
generate code for either a multi-core CPU system or NVIDIA GPUs at runtime.The
focus of the project is on restructuring CUDA-oriented threads and memory use to
match multi-core CPU platforms.
Bergen et al.[7] discuss their experiences using OpenCL in an HPC environment
creating a compressible gas dynamics simulator that runs on GPUs from multiple
vendors.They cover several middle-level abstractions they developed to increase
programmer productivity.Of particular interest,the authors use C-based macros
to compile in parameter constants to reduce memory usage and specialize kernels.
Additionally,one of two variations of the kernel is selected at run time based on
the hardware target.One variation consists of separate stages that the increases the
memory accesses to computation ratio but use fewer registers.The second variation
fuses the two stages but uses much register resources.
PyCUDA [27] explicitly attempts to address many of the same issues discussed
in this dissertation,including parameterization and GPGPU kernel exibility.Py-
CUDA includes a number of Python object oriented abstractions around CUDA API
entities and provides many of the same features as the GPU Prototyping Frame-
work,discussed in the next chapter,such as GPU memory abstractions,precision
timing,and caching of compiled GPU binaries.PyCUDA also provides higher-level
abstractions such as GPU-based numerical arrays and support a number of element-
wise and reduction operations and fast Fourier transforms.In addition to the built-
in operations,PyCUDA provides element-wise and reduction code generators for
CHAPTER 3.RELATED WORK 27
user-provided functions.By combining PyCUDA with other Python-based tools for
modifying code templates,it is possible to implement parameter space exploration
functionality.PyCUDA forms the basis for some even higher-level domain-specic
tools discussed in Section 3.3.
The MATLABParallel Computing Toolbox [34] provides a set of CUDA-accelerated
versions of many standard MATLAB functions.Functionality is largely hidden from
the user behind custom MATLAB array types that allow users to program at the
same high abstraction level as CPU-based MATLAB code.Interfacing customCUDA
C kernels into the built-in GPU functionality is supported,as is automated kernel
generation from user-specied MATLAB functions.Instead of generating CUDA C
and relying on the CUDA C compiler,the Parallel Computing Toolbox generates
code that runs on the GPU.
3.2 Autotuning
Autotuning is becoming an increasingly popular approach to improving kernel per-
formance.There are many examples in the literature regarding autotuning specic
classes of problems,but the work highlighted here focuses on more general auto-
tuning tools.It has been employed to both improve performance and help adapt
to dierent problems.Choi et al.[12] have worked on sparse matrix-vector multi-
plication on NVIDIA GPUs.Starting with two high-performance hand-optimized
kernels,they use oine autotuning with a performance prediction model to select
CHAPTER 3.RELATED WORK 28
implementation parameters quickly at runtime.The performance model is derived
fromassumptions drawn fromthe characteristics of the sparse matrix-vector problem
and incorporates a number of hardware,kernel,and problem instance parameters to
select thread block counts and data tiling sizes.For the problems tested,their model
selects the optimal conguration in 86.5 percent of test cases.When a non-optimal
conguration is selected,the performance was only a few percent lower than that of
the optimal conguration.Nath et al.[35] apply similar techniques,but for dense
linear algebra.
While Baskaran et al.[2] performane loop nest transformations similar to other
work,they also use a model-driven empirical search for optimizing GPU kernels.
Their work focuses on memory hierarchy performance and determines optimal tile
sizes and loop unroll values.
The G-ADAPT framework of Liu et al.[32] takes GPU kernels and empirically
evaluates multiple kernel variation and conguration parameters through benchmark-
ing.At runtime G-ADAPT can estimate with high accuracy the best kernel and
conguration for an incoming set of problem parameters,including the size of the
inputs.Kernel congurations are generated automatically from instrumented GPU
code.
Rudy et al.[48] have created the CUDA-CHiLL framework for generating opti-
mized CUDA code.The framework can apply transformation recipes to C-language
loop nests to CUDA C,accounting for memory accesses patterns,tiling,and assign-
CHAPTER 3.RELATED WORK 29
ing data to registers or shared memory.Combined with autotuning to search the
transformation and parameter space,highly optimized kernels can be generated.
Using OpenMP code as a starting point,Lee and Eigenmann [28] document their
OpenMP to CUDA source to source compiler called OpenMPC.Directives and envi-
ronment variables are used to guide the CUDA code generation and include a large
subset of CUDA features,but translation between OpenMP and CUDA concepts is
automatic as is selecting portions of the application code to convert to GPU kernels.
Parallel regions are selected for GPU acceleration based on the level of inter-thread
data sharing and communication,and data transfers between the host and GPU are
minimized.The authors provide tools for searching the conguration space,includ-
ing tools to prune unnecessary variables from the search space and generate and test
variants.Results using the tool set on a few OpenMP benchmark applications and
algorithm kernels show that they achieve 88 percent,on average of the performance
of hand-tuned kernels.
3.3 Domain Specic Tools
A number of domain-specic tools allow for highly automated high performance
GPGPU implementations from still highly automated and high-level representations
of algorithms.Many projects include aspects of run-time code generation and/or
autotuning.
Tools fromthe FLAME project [25] can generate very ecient dense linear algebra
CHAPTER 3.RELATED WORK 30
code by applying a number of loop-based transformations.Designed for distributed
memory systems,including GPUs,cost functions for computation and data move-
ment are used to estimate the performance of each.Modeling the performance of the
transformations enables a much faster search of the permutation space than empirical
benchmarking.
Ravi et al.[45] have developed a system for easing the implementation of gener-
alized reduction (MapReduce) applications on systems with CPUs and GPUs.Users
provide serial kernels conforming to the MapReduce paradigm and their compiler
will automatically parallelize and generate both kernel and host code.A runtime
component dynamically adjusts the workload balance between the CPU and GPU.
A number of frameworks relying on embedded domain specic languages (DSL)
have been documented in the literature.The Delite Compiler Framework and Run-
time [11,8] embeds OptiML,a machine learning DSL,in Scala and targets CUDA,
while the Chai [26] framework reimplements the PeakStream [43] language for tar-
geting OpenCL.Both feature code generation and automatic GPU memory manage-
ment.Liszt is a DSL in Scala by DeVito et al.[16] used to represent partial dierential
equations,which are solved with stencil-based computations.Their framework trans-
lates the DSL into an intermediate representation,and at run time it is optimized
for the target problem and translated to CUDA C.
Grewe et al.[22] have presented a framework that generates and then autotunes
CUDA or OpenCL kernels for sparse matrix-vector multiplication from a high-level
CHAPTER 3.RELATED WORK 31
input representation.Zhang and Mueller [58] have developed a framework for gen-
erating highly optimized 3D stencils,from a concise user-provided representation
consisting of a single equation and a number of general stencil parameters.After
code generation,an autotuning facility search for optimal kernel design and congu-
ration and supports targeting one or more GPUs.
Also working with stencils Catanzaro et al.[10] are able to use the JITcompilation
features of Ruby and Python (based on PyCUDA) to transform concise embedded
stencil representations into GPU code.The performance,however,is slower than
hand-coded kernels,partially due to JIT compilation overhead,but some benet is
seen from specialized kernels with xed loop bounds.Building on this work is the
Copperhead [9] domain-specic language embedded in Python that can be used to
represent data parallel operations in the formof map,reduce,scan,gather,etc.These
are then compiled into CUDA code at run time.
3.4 Summary
The literature surveyed here relates to this research in a number of dierent ways.
While able to specialize GPU kernels for specic problems at run time once the
parameters are known,run time code generation tools are limited by the range of
problem domains that they understand.Likewise,by limiting a tool to a specic
problem area,such as many of the domain-specic tools,the automatic generation
of high performance GPU code becomes tractable.However,users are restricted
CHAPTER 3.RELATED WORK 32
to a language subset that may not be able to represent the problem at hand.The
approach to kernel specialization presented here relies on standard CUDA C kernel
input,which can be designed to target any problem domain.While a much more
manual approach,this can be used to ll in gaps between the capabilities oered by
more restricted tools.
General autotuning tools,while highly eective,can take a long time to search the
parameter space.In addition,these tools may be limited by the variety of supported
code transformations and may never radically restructure implementations the way a
skilled human designer may.The research presented here is complementary to these
types of advanced parameter space mapping and performance estimation tools.By
using highly parameterized CUDA kernels that are specialized quickly at run time,
autotuning tools can be used to characterize the performance of a given implemen-
tation so that eective parameters can be selected quickly and used to compile a
specialized kernel.
Chapter 4
Kernel Specialization
Chapter 2 described the current state of CUDA development,with particular em-
phasis on the trade-os between performance and adaptability.In this chapter,a
technique for addressing these trade-os,kernel specialization,is introduced.
Kernel specialization is as a technique to enable greater GPGPU kernel adapt-
ability while mitigating the associated performance overheads.Kernel specialization
involves delaying the generation of GPU binaries until problem and hardware pa-
rameters are known.This enables a number of static-value optimizations that are
important for achieving high-performance on GPUs.
From a kernel development perspective,all that is required to use kernel spe-
cialization is to write a kernel in terms of undened constants.These constants are
provided values when the kernel is compiled.The CUDA kernel in Listing 4.1 is
designed to highlight several of the most common types of optimizations that are ap-
plied at compilation time.In the example kernel,each thread determines its global
oset in the thread space to use as a base oset to the in and out pointers.Then,
CHAPTER 4.KERNEL SPECIALIZATION 34
using a stride of argA * argB,count values from in are accumulated in acc.The
kernel assumes that the memory regions pointed to by in and out are large enough
for any accesses generated by the other input values.
1
g l o ba l
void mathTest ( int in,int out,int argA,int argB,int
loopCount ) f
2 int acc = 0;
3
4 const unsigned int s t r i de = argA  argB;
5 const unsigned int o f f s e t = bl ockIdx.x  blockDim.x + threadIdx.x;
6
7 for ( int i = 0;i < loopCount;i ++)f
8 acc += ( i n + o f f s e t + i  s t r i de );
9 g
10
11 ( out + o f f s e t ) = acc;
12 return;
13 g
Listing 4.1:A CUDA C GPU kernel designed to demonstrate the common ways
kernel specialization can improve the performance of kernels.This kernel is a regular
fully run-time evaluated kernel and does not rely on specialization in any way.
1
g l o ba l
void mathTest ( int in,int out,int argA,int argB,int
loopCount ) f
2 int acc = 0;
3
4 const unsigned int o f f s e t = bl ockIdx.x  BLOCK
DIM
X + threadIdx.x;
5
6 for ( int i = 0;i < LOOP
COUNT;i ++)f
7 acc += ( ( int )PTR
IN + o f f s e t + i  ARG
A  ARG
B);
8 g
9
10 ( ( int )PTR
OUT + o f f s e t ) = acc;
11 return;
12 g
Listing 4.2:A CUDA C GPU kernel designed to demonstrate kernel specialization.
The constants specied by identiers in all capital letters must be provided at compile
time.
The kernel in Listing 4.1 is fully run-time evaluated (RE).That is,no parameters
CHAPTER 4.KERNEL SPECIALIZATION 35
are assumed to have xed values a compile time.As discussed in Chapter 2,this
allows the kernel to adapt to any set of inputs but comes at the expense of lower
performance.
The kernel can be modied for use with kernel specialization by replacing only
a few identiers with constants that will be specied at compile time,as shown in
Listing 4.2.Unique names are required,assuming that the original kernel proto-
type is not modied,and as a convention,macro values that will be specied on
the command line are in all capital letters.In the specialized kernel (SK),a xed
value for LOOP
COUNT allows the for() loop to be unrolled.Fixed values for ARG
A,
ARG
B,and PTR
IN result in constant folding and propagation.Lastly,xed values for
BLOCK
DIM
X and PTR
OUT allow the use of immediate values in the generated PTX,
removing the need to load special registers or kernel arguments from shared memory.
Expanding on the specialized kernel of Listing 4.2,the sample kernel code in
Appendix Bdemonstrates howit is possible to create a single CUDACkernel that can
be used in both specialized and non-specialized situations.The kernel source listing
implements the same functionality as the rst two examples,but allows for individual
control over which parameters are evaluated at run time or specialized when the
kernel is compiled.It also demonstrates two basic routes for taking advantage of the
parameter values specied on the compiler command line:preprocessing directives
and constant-value macro expansion,as previously shown,or as arguments to C++
templates.
CHAPTER 4.KERNEL SPECIALIZATION 36
In the example,macro denition names that are prexed with CT
function as
Boolean ags to control whether a parameter is evaluated at run time or specialized at
compile time.The series of preprocessor directives before the kernel declaration dene
default values for macros when left undened and allow the kernel to be compiled in
a fully run-time evaluated (non-specialized) state when it is not compiled in a kernel
specialization-aware setting.
Lines 51 through 57 in the kernel demonstrate using C++ templates to toggle
the specialization state of three dependent parameters,count,stride,and offset.
The typedef lines of 51 through 53 use the CT
-prexed macros to select between
template specializations that either return the run-time evaluated argument or the
value specied at compilation time.Here,the loop iteration count,thread block
dimensions,and a purely dependent parameter are optionally specialized.The gpu
and gpu::ctrt namespace utilities are dened in the gpuFunctions.hpp header.The
gpu::num class converts macro values to types for use in additional math template
classes,like gpu::ctrt::mult on line 52.(The ctrt namespace is an acronym for
compile time/run time.)
Lines 55 through 57 invoke static methods belonging to the template classes in-
stantiated on lines 51 through 53.In the run-time evaluated non-specialized parame-
ter template specializations,the argument,which can refer to an input or a computed
value,is returned.This produces a run-time dependency.For example,the run time
evaluated specialization for computedStride::op() is simply a * operator applied
CHAPTER 4.KERNEL SPECIALIZATION 37
to the kernel inpus argA and argB.
The template specializations associated with specialized parameters generate static
values at compile time that the compiler can optimize.Applying the
forceinline
CUDA keyword function attribute guarantees inlining,even with the method call
syntax,and allows the use of the same kernel code to access values regardless of
specialization status.There is no need to access the static const struct members
directly,as is typical with C++ template compile time processing.For the same
computedStride::op() example,the specialized version returns the multiplication
of the two template arguments specied on line 52:ARG
A and ARG
B.Since,as tem-
plate arguments,these are static,the compiler performs the multiplication during
compilation.As a constant variable with a known value,the compiler can also prop-
agate the result of this multiplication when it is used later on in the kernel.
Using preprocessor directives to toggle specialization status is shown by the two
if/else conditionals starting on lines 60 and 67.The conditionals control inclusion of
non-specialized code dependent on run-time evaluated kernel arguments or specialized
code based on xed values.In the later case,pointer values are expanded directly
into the code
1
.
To demonstrate the signicant dierences produced by kernel specialization,Ap-
pendices C and D contain PTX generated from the single CUDA C kernel in Ap-
pendix B.For the listing in Appendix C,none of the parameters values were special-
1
Statically compiled pointer values are provided as unsigned long hexadecimal values,so casting
is needed.Single- and double-precision oating-point values can be specied on the command line
in a similar manner.See Section 4.4
CHAPTER 4.KERNEL SPECIALIZATION 38
ized,making the PTX fully run-time evaluated.In contrast,Appendix D contains
PTX for the case where every parameter was specialized and xed at compile time.
In this case,a loop iteration count of ve,and a one-dimensional block of 128 threads,
was used.The argA and argB inputs were xed at 3 and 7,respectively.The input
pointer was set to 0x200ca0200,and the output pointer to 0x200b80000.
Comparing the two PTX samples,a number of observations are immediately
apparent.First,the specialized PTX in Appendix D has no control ow.The
for() loop is completely unrolled.The run-time adaptable kernel includes several
instructions for loop setup,iteration,termination condition checking,and branching.
As discussed in Chapter 2,loop unrolling is important on NVIDIA GPUs.
Other optimizations are also visible in the PTX samples.These include strength
reduction and constant folding and propagation.While many computed values de-
pend on thread index values that are unique to each thread and must be generated at
run time,such as osets from a base memory pointer,the computations often involve
common subsets dependent only on parameters that can be determined at compile
time,such as thread block,grid,or data dimensions.
In the specialized PTX,base plus oset addressing is used and fully unrolled.
The base input pointer (0x200ca0200 is 8,603,173,376 in decimal) plus an stride of
84 bytes (argA * argB * 4) is propagated through the unrolled loads,but the base
register is still dependent on a run-time variable - the thread index.A strength
reduced multiply of 128 (shift left by 7) also appears.
CHAPTER 4.KERNEL SPECIALIZATION 39
In this case,the specialization is complete.The specialized PTX kernel contains
no references to the input arguments.The kernel arguments,however,are kept to
preserve the interchangeability of the kernel in the case that various input parameters
are toggled between run-time evaluated and statically compiled variants.
4.1 Benets
Instead of locking a kernel into one of two regimes of higher performance without
adaptability or adaptability with lower performance,kernel specialization allows for
both performance and adaptability.The static-value requirements for many impor-
tant optimizations are eectively removed.
Beyond generating ecient code for a given problem,kernel specialization can
also improve the performance portability of a single GPU kernel implementation be-
tween dierent GPUs.Implementation parameters that are adjustable independently
from problem parameters can be used to adjust the characteristics of a kernel for a
particular device.Of particular importance are implementation parameters that im-
pact the amount of work assigned per-block and per-thread,discussed in Chapter 2,
such as adjusting tile sizes and the number of registers used for register blocking.
Second order performance eects resulting from the interplay between problem
and hardware characteristics are possible and can also be addressed with kernel spe-
cialization.For example,problemparameters at low extremes can reduce the amount
of work and/or inherent parallelism available at the block-level to the point where
CHAPTER 4.KERNEL SPECIALIZATION 40
ILP alone is not sucient to maximize performance.Reducing the amount of work
assigned to each thread can increase thread counts,potentially improving perfor-
mance.
A less measurable benet of kernel specialization at run time is better code reuse
and maintainability.While greater adaptability may increase initial development
time due to the need to account for a greater number of corner cases and parameter
interactions,once completed,a single GPU kernel can often be specialized for and
applied to a wide range of problems.Assumed block and grid dimensions and param-
eters related to diering compute capabilities often appear both in the kernel code
(usually as macro denition constants or hard-coded values) and host code.With
kernel specialization,these assumptions can be removed from kernel code,as they
are provided to the kernel when it is compiled.Similar to loop unrolling and register
blocking,kernel specialization can convert xed size constant memory declarations
to ones that are dynamically sized.Kernel specialization also allows developers to
use the simpler static shared memory allocation syntax but have it behave like dy-
namically allocated shared memory.
Kernel specialization can be very powerful when combined with C++ templates
and compile-time techniques.Kernels can be further customized beyond purely nu-
merical parameter values.Template specializations can be used to select among a
number of variants based on the specic problem parameters or hardware character-
istics.An example related to the former would be selecting a dierent data type or
CHAPTER 4.KERNEL SPECIALIZATION 41
per-pixel comparison metric (i.e.sumof absolute dierences to sumof squared dier-
ences) for a sliding window matching operation.For the later,kernel specialization
could be used to select between the * operator and the
[u]mul24() intrinsic for
integer multiplies.With kernel specialization,the kernels for the particular problem
and hardware instance are generated as needed instead of compiling all supported
variants ahead of time.This can signicantly reduce program binary size.
Put together,a single implementation can generate optimized GPU binaries at
hardware friendly values,but also perform well on other values.Kernel implementa-
tions may include many additional implementation parameters for greater problem
and hardware adaptability without many of the associated impacts of more exible
GPU code.
4.2 OpenCV Example
As a concrete example of several of the benets discussed in this section,Appendix F
includes a lightly specialized version of the same OpenCV GPU kernel shown in
Appendix E.It represents a real-world study of the possible benets provided by
kernel specialization at run time.It also provides an example of incrementally adding
kernel specialization to existing CUDA C code.An implementation designed with
kernel specialization in mind may have produced a signicantly dierent design.
The modied source is split over two compilation units,Listings F.1 and F.2.The
rst contains the GPU kernel and the host code that launches the kernel.This le
CHAPTER 4.KERNEL SPECIALIZATION 42
would be compiled with specialized values at run time.The second grouping contains
the host function that OpenCV applications invoke from the host.
The kernel code in the specialized portion is similar to the original version since
many specic template specializations were present in the original.The anchor argu-
ment was converted fromrun-time evaluation to compile time.Similarly,the template
arguments to linearRowFilter
caller() function were replaced with tokens that
are replaced directly with the type names.Another dierence is the elimination of
the preprocessor conditional that controls certain kernel launch parameters based on
the current GPUs compute capability.This logic is now isolated in the host code.
The second compilation unit contains the host code that is invoked by OpenCV.
The specialize() function accepts a source le name,a target function name,and
then key-value pairs in the form of strings.This specialize() function generates
a specialized version of the CUDA C source and returns a function handle to a
customized version of the linearRowFilter
caller() function.The type special-
izations for the linearRowFilter
gpu() host code are kept since they are used for
host-side C++ linking with the rest of the OpenCV library.
In this example,the specialize() function is designed to work with the compi-
lation model of run-time CUDA API,where a host function is usually compiled along
with one or more kernels.The actual mechanism used in this research is described
in Section 4.4.
Notably absent from the example are the explicit template specializations used
CHAPTER 4.KERNEL SPECIALIZATION 43
to pre-compile many kernel variants.These are now generated on the y and for any
combination of types,parameter values,and target Compute Capabilities.
4.3 Trade-Os
While compiling kernels at run time oers a number of benets,the enabling mech-
anism is itself the downside:compiling kernels incurs overhead.The delay incurred
from compiling kernels can vary signicantly,depending on the platform and the
complexity of the kernel.
Despite the overhead,it is not a factor in many situations.For example,with
many long-running streaming applications processing throughput is more important
than the length of the initialization phase.The framework used in this work and
described in the next section also caches generated binaries.If the same set of pa-
rameters is encountered,the previously generated kernel can be loaded quickly (with
speed similar to loading a dynamically linked shared object).Additionally,parame-
ters that change often between kernel invocations can be left as run-time evaluated.
Kernel specialization is more relevant for parameters that control algorithmic behav-
ior and control ow than those that can be considered data,such as scaling factors.
4.4 Implementation and Tools
Kernel specialization is a general technique,but a specic approach is used in this
dissertation:the CUDA C compiler driver,nvcc,is invoked at run time to convert
CUDA C language source to GPU binaries.This approach leverages existing com-
CHAPTER 4.KERNEL SPECIALIZATION 44
piler tools to generate highly-optimized binaries for hardware-friendly values while
preserving adaptability to arbitrary values.Working with the high-level CUDA C
kernels also has the benet of improving the kernel development experience.
This approach to using NVIDIA-provided tools is also currently required,as the
CUDA API supports compiling kernels at run time from PTX input,but not CUDA
C.The undened constants in kernel code are provided at compile time to the nvcc
compiler by using the macro denition ag,-D.As discussed,many important opti-
mizations are applied during the translation from CUDA C to PTX.Relying on the
NVIDIA provided front-end for this stage leverages vendor-implemented code gener-
ation and optimization techniques,minimizes eort spent on reimplementing existing
technologies and keeps application kernel development in the CUDA C language.
For OpenCL,which includes an API call for C-source level kernel compilation,or
a future CUDA API that includes high-level source compilation,an alternative would
be a source-to-source translation where identiers are replaced with the constant val-
ues before compilation.Instead of providing the constant values on the command
line,the source itself would be directly customized.Another option would be gener-
ating CUDA C or PTX code directly like some other existing frameworks have,as
discussed in Chapter 3.Version 4.1 and above of the CUDA toolchain has transi-
tioned the CUDA compiler to LLVM,opening up the possibility for more complex
code generation for CUDA C as well as other high-level languages [39].
To automate the process of kernel compilation within the CUDA environment
CHAPTER 4.KERNEL SPECIALIZATION 45
and assist with application prototyping,testing,and parameterization,a pair of
tools were developed.The rst is a C-language framework,referred to as the the
GPU Prototyping Framework,that automates kernel specialization and was used to
build the applications considered here.The second is a MATLAB-based tool for
verication against reference code,parameterization,and performance analysis.
4.4.1 GPU-PF
While the macro denition mechanism for kernel specialization is relatively straight-
forward,as is most CUDA driver-level API code,it is often verbose and error prone.
In response to this,the GPU Prototyping Framework (GPU-PF) was developed
with many host-code abstractions,including kernel specialization automation.The
framework is designed for rapidly constructing applications with streaming process-
ing pipelines.It provides a problem and implementation parameter-focused set of
objects where resources and actions are dened in terms of parameters.Once an
application has been specied,only the values of various parameters need to be ad-
justed.The framework handles propagating the eects of the parameter changes
through the application.The current set of supported parameter types are listed in
Table 4.1.
Parameters are used to control the properties and behavior of both resources
and actions.Resources include data locations (memory or les) and module-based
resources like CUDA kernels.Table 4.2 lists the currently supported resource types.
The single memory reference type may refer to any memory type except for textures,
CHAPTER 4.KERNEL SPECIALIZATION 46
Parameter Type Description
Memory Extent Geometry (up to three dimensions) and element
size of a memory reference.
Memory Subset Subrange of a memory extent with associated
stride between updates.
Schedule Period between events and delay before rst oc-
currence.
Integer Scalar integer parameter.
Float Scalar oating point parameter.
Array Traits Various properties used for CUDA texture and ar-
ray memory types.
Pointer A pointer value.
Triplet Three integer values.General,but commonly
used for grid and block dimensions.Individual
elements can be referenced.
Pair Two integer values.Individual elements can be
referenced.
Type Data type (int32,uint8, oat4,double,etc.)
Boolean True or false boolean parameter.
Step Self-updating parameter that iterates through a
specied range with an associated stride.
Table 4.1:Various parameter types provided by GPU-PF.
due to the decoupling of the actual memory allocation in host code as global or CUDA
array memory from the texture reference inside a CUDA module.The supported
memory types are detailed in Table 4.3.They include a memory subview that can
be used any place a full allocation reference can be used.The subviews can update
themselves on each pipeline iteration to refer to a dierent subset of the full memory
allocation.
Table 4.4 lists supported actions.The number of actual GPU-PF functions rep-
resenting actions is quite low,as the single memory reference resource type can
represent any type of host or GPU memory.The GPU-PF framework automatically
determines the correct actions based on the underlying memory type.
The three classes of concepts (parameter,resource,and action) form a natural
CHAPTER 4.KERNEL SPECIALIZATION 47
Resource Type Description Dependencies
Module CUDA Module Optional:boolean,integer,
pointer,pitch,and oating point
parameters
Kernel CUDA kernel Module
Memory reference A generic reference to any type
of memory except texture
Various:see Table 4.3.
Texture CUDA texture reference Module,memory reference,and
array traits (not required for
binding to CUDA arrays).
Table 4.2:Various resource types provided by GPU-PF.
Memory
Type
Description Dependencies
Constant Constant memory from a module Module resource and data type param-
eter.
Array CUDA Array Extent and array traits parameters.
Global Pitched or linear global memory.Extent parameter.
Host Malloced,CUDA pinned,memory
mapped,or user-provided host memory.
Extent parameter.
Subset Generic reference to a subset of any
memory type.Can move subset through
the full memory reference over time.
Can be used any place a regular refer-
ence can be used.
Memory reference,memory subset,
schedule,integer parameter (reset pe-
riod).
Table 4.3:Various memory types provided by GPU-PF.
Action Description Dependencies
Memory copy Single function transfers data properly
according to underlying memory types
at each end point.
2memory reference,schedule.
Kernel Execu-
tion
Kernel launch arguments and congura-
tion.
Required:2 triplet (grid and block
shape),int (dynamic shared memory),
and schedule parameters;kernel re-
source.Optional:integer,pointer,
pitch,and oating point parameters as
arguments;texture reference.
User function Arbitrary user function.Schedule parameter.
File I/O Binary data input or output.Memory reference,schedule.
Table 4.4:Various actions provided by GPU-PF.
CHAPTER 4.KERNEL SPECIALIZATION 48
Dimensions
Integer
DataType
Host
Memory
Module
Kernel
Global
Memory
Transfer
Execution
Transfer
Figure 4.1:Dierent template tile regions
hierarchy of dependencies.Aresource may depend on one or more parameters,and an
action may depend on one or more parameters and/or resources.Figure 4.1 depicts
an example of the dependencies that form between the three concept types.The
GPU-resident global memory object depends on two parameter objects that specify
the memory's data type and dimensions.These parameters determine the size of the
memory allocation.The kernel execution,which takes the global memory allocation
as a source or destination for data,depends on the memory object to provide a
pointer value.While not directly dependent,the kernel execution can be aected by
the parameters associated with the memory allocation.
The GPU-PF library breaks a program's lifetime into three phases:specication,
refresh,and execution.First,a pipeline is constructed (specied) by instantiating
parameters,resources like memory allocations and kernel modules,and actions like
data transfer and kernel execution.At this point,nothing is allocated and no actions
are performed,other than error checking.
The refresh stage occurs before the rst pipeline execution and before the next
CHAPTER 4.KERNEL SPECIALIZATION 49
pipeline execution whenever parameters are updated.A separate refresh phase al-
lows for comprehensive error checking and convenient abstractions without sacric-
ing performance during repeated execution.Only the subset of application resources
aected by parameter changes are updated for a given refresh,and all resource al-
location takes place during the refresh phase instead of during processing.Resource
allocation includes generation of CUDA module-based resources,allocating memory,
and preparing many CUDA API arguments.The framework automatically manages
constructing the nvcc command line,building and loading the module,and extracting