Porting the CPN code to GPU

desirespraytownΛογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 3 χρόνια και 8 μήνες)

251 εμφανίσεις






Porting the CPN code to GPU

Author: Jie Lin (s0969522)


August 26, 2010
















MSc in High Performance Computing

The University of Edinburgh

Year of Presentation: 20
10


Abstract

GPGPU which stands for General
-
Purpose computation on graphic proce
ssing unit is
now a popular topic in high performance computing. It has attracted the attention of
many programmers for its high computation capability as well as its ease of use. Existing
code with rich parallelism can be ported to GPU to examine the pote
ntial performance
gain. This dissertation aims to implement and benchmark a GPU implementation of the
CPN code using CUDA programming language. It discusses the strategies for both
porting and optimisation process. With large parameters, the first version
of modified
code has a speed
-
up of 6.3 for the whole program, and about twenty times speed
-
up for
some selected function. Further optimisation for removing unnecessary produces even
better speed
-
up. With large parameters, the speed
-
up is about 7.4 for the
whole program
and 37 for some selected suitable functions.

i

Contents

Chapter 1 Introduction

................................
................................
................................
........

1

Chapter 2 Background Knowledge

................................
................................
....................

2

2.1 Features of GPU and CUDA Programming language

................................
............

2

2.1.1 Main differences to

CPUs

................................
................................
.................

2

2.1.2 Architecture of a modern GPU

................................
................................
.........

2

2.2 CUDA programming language

................................
................................
................

3

2.2.1 CUDA Program Structure

................................
................................
.................

4

2.2.2 CUDA Memory and variable Types

................................
................................
.

5

2.3 Libraries

................................
................................
................................
....................

8

Chapter 3 Porting the CPN code to GPU

................................
................................
...........

9

3.1 Desirable characteristics of selected code

................................
...............................

9

3.1.
1 High computational intensity

................................
................................
............

9

3.1.2 Embarrassingly parallel

................................
................................
...................
10

3.1.3 Minimal conditional execution

................................
................................
.......
10

3.2 Profiling

................................
................................
................................
..................
10

3.3 Selecting code to port

................................
................................
.............................
10

3.4 Identifying data required on GPU

................................
................................
..........
12

3.5 Decomposition and execution configuration

................................
.........................
13

3.6 Parallel reduction algorithm

................................
................................
...................
14

3.7 Porting code using thrust

................................
................................
........................
18

3.8 Debugging
................................
................................
................................
...............
19

Chapter 4 Optimisation

................................
................................
................................
.....
21


ii

4.1 Overall performance optimisation strategies

................................
.........................
21

4.2 Measuring Performance

................................
................................
.........................
22

4.2.1 Timing

................................
................................
................................
.............
22

4.2.2 Maximum Performance Benefit

................................
................................
.....
24

4.3 Optimisation for the CPN code

................................
................................
..............
25

4.3.1 Optimise memory usage
................................
................................
..................
25

4.3.2 Optimise instruction usage

................................
................................
..............
28

Chapter 5 Conclusions

................................
................................
................................
......
29

Appendix A

Profile of CPN code

................................
................................
.................
30

References

................................
................................
................................
.........................
32


iii


List of Tables

Table 1: Suitable functions and code sections

................................
................................
.
11

Table 2: Time spent for CPN code on GPU Vs CPU

................................
......................
24

Table 3: Performance gain for
m remove unnecessary memory allocation and
de
-
allocation

................................
................................
................................
......................
27

Table 4
: Profiles of CPN code

................................
................................
..........................
31



















iv

List of Figures

Figure 1 [6]: A Stream Multiprocessor

................................
................................
..............

3

Figure 2 [14]: The execution of a typical CUDA program

................................
...............

4

Figure
3 [7]: Hierarch of CUDA memory

................................
................................
.........

6

Figure 4 [11]: Tree based approach for parallel reduction

................................
..............
15

Figure 5 [11]: Parallel reduction
with divergence

................................
...........................
16

Figure 6: Overall speed
-
up for the first version

................................
...............................
25

Figure 7: Time spent on for GPU related operations

................................
.......................
26



v

Acknowledgements

I am grateful to my supervisor Chris Johnson for his patient guidance during the project
time, and for his careful correction and suggestions about the style, language for the
report on the revision stage. I
’d also like to thank my friends and other classmates for
helping me to learn the CUDA programming techniques and dissertation written skills.


1

Chapter 1


Introduction

Driven by the very large computer game industry, graphics processor unit get rapid
performance in
crease in recent years. Traditionally, it is very efficient and widely used in
the computer graphics. Due to its rapid performance improvement and highly parallel
structure, it has now moved to general purpose domain as an accelerator of CPU

GPGPU which re
fers to general purpose computation on Graphics Processing,
“has a
broader application across a variety of fields such as computer clusters, physics engines,
audio signal processing, digital image processing, bioinformatics as well as scientific
computing”

[13]
.
As the impressive power of GPU, Some institutions have begun their
scheme to construct supercomputer using GPU.
“Tokyo Institute of Technology
announced a plan to build a 2.4 Petaflop supercomputer based on GPU”

[17], ACCRE
also plans to construct a

GPU cluster to boost its supercomputer power [18].
In summary,
application of GPU has become prevalent in many traditional HPC areas,
it is predictable
that GPU will be applied in a more extensive and deeper level

in the future
.

In the section 2, I briefl
y describe GPU features and CUDA Programming language.
Section 3 lists the porting process of the CPN code along with a discussion of parallel
reduction. Section 4 discusses the general optimisation strategies, the approaches to
evaluate the performance an
d optimisation attempts on the first version of modified CPN
code.


2

Chapter 2


Background
Knowledge

2.1

Features of GPU and CUDA Programming language

In this section I will discuss some features associate with GPU and the CUDA
Programming language.

2.1.1

Main differences to C
PUs

Compared to the increase in the performance of CPU which has slow down its pace in
recent years, GPU has got a fast and stable improvement in performance. The reason
dues to the difference in design between GPU and CPU.

D, Kirk and W. Hwu [14] brief t
he difference.
“Design of a CPU is optimised for
sequential code performance, while GPU is designed to optimize for the execution of
massive number of threads”

[14]. CPU has sophisticated design which can deal with
very complex control logic. By contrast,
GPU p
u
r
s
ues high computational throughput. It
sustains thousands of threads to perform the intensive computation and hide memory
latency. In addition, CPU is designed to minimise instruction and data access latencies,
which will require large cache, while
GPU use small cache so that more transistor area is
dedicated to calculations [6].

Although performance of GPU is impressive, it cannot take all the work. GPU is just
used to complement CPU. A progra
m which can benefit from GPU should

have its
parallel por
tion run on GPU and sequential portion run on CPU.

2.1.2

Architecture of a modern GPU

Streaming Multiprocessors (SMs) is the very building block of modern NVIDIA GPU. It
is composed of 8 Streaming Processors, each of which has the essential function units
such a
s multiply
-
add unit [14]. Besides that, each SM contains a piece of shared memory
and an instruction unit which issues one instruction to all Streaming Processors at one
clock cycle. SMs also have the capability to perform fast floating point by its specia
l
functional unit. Figure 1 illustrates an example of a stream multiprocessor.


3


Figure
1

[6]: A Stream Multiprocessor

The details of the GPU cards in Ness are as follow
s
. There are two nodes in total. On
each node there are 2 of t
he 2.0GHz Intel Xeon E5504 quad cores and a dual M1060
Tesla Card. So there are 16 2.0GHz Intel Xeon cores in total and 4 GPUs. The Integrated
Intel Dual Port Gigabit Ethernet connects them.

The M1060 Tesla Card chip is massively parallel, with 240 of Stre
aming Processor
Cores [2]. The peak performance for single precision floating point performance and
double precision floating point performance are 933 GFlops and 78 GFlops respectively.
It has a 4Gbyte dedicated memory with the bandwidth of 102 GB/sec. Th
ese hardware
resources will
generate

the powerful computational capability. GPU will sustain
thousands of threads in the runtime of a program. Applications well written for GPU
could
derive

more than ten thousand threads which are running simultaneously. I
n order
to write effective program, it is very important to understand this architecture in mind.

2.2

CUDA programming language

Several years ago when the concept of GPGPU is initially introduced, programming on
GPU
wa
s very hard, because there are no dedicate
d programming languages for GPU

4

programming. Therefore the programmer needs to reformulate the code in a graphics
development pattern which is very difficult. In Novemb
er 2006, NVIDIA introduced
CUDA
.

CUDA allows developers to use C as a high level program
ming language [1].
An alternative programming language is OpenCL [3]. OpenCL, which refers to Open
Compute Language, was initialis
e
d and developed by Apple. It supports large range of
applications and architectures. OpenCL is a low level API, and will intr
oduce more work
to the programmer. For my project, I will choose C for CUDA to port the code on GPU.

2.2.1

CUDA Program Structure

In this section, I will discuss a simple CUDA programming model.

D, Kirk and W. Hwu [14] illustrate the execution process of a typic
al CUDA program. A
CUDA program consists of two kinds of codes, one is the standard ASNI C code which
runs on the CPU,
and the

other is the extended C code which executes on the GPU. In the
CUDA programming model, CPU also refers to a host, while GPU refer
s to a device.
Figure 2 gives an example of CUDA program execution. In general, a CUDA program
will start from the host side code, which is running in serial. When it reaches a device
side code, such as a kernel, the program starts to execute in parallel o
n multiple
streaming processors in a single GPU. When the parallel execution of the device code
completes, the program returns the control to CPU. This allows the concurrent execution
of host code on CPU and device code of GPU in CUDA programming model, es
pecially
when there is no data dependence between host code and device code.


Figure
2

[14]:

The execution of a typical CUDA program


The compilation phase just reflects the execution process of a CUDA program. NVIDIA
provide a c
ompiler called NVCC which first separates the host and device code from a
single CUDA program. The host code is then further compiled by a standard C compiler
such as GNU compiler. On the other hand, NVCC further compiles the device code as
well to make it

able to execute in parallel on multiple streaming processors.

The device code uses C syntax with a few extensions, i.e. keywords, according to which
the NVCC compiler could split the device code from the whole program. The main

5

differences between host co
de and device code are their execution pattern and memory
structure. The device code is executed in parallel rather than in serial, it also uses a
variety of memory types within the GPU. The parallel execution code and different
memory types are specified
by the extended CUDA keywords. CUDA introduced two
additional functions, kernel function and device function. The kernel function with the
__global__ qualifier generates multiple threads running on GPU, and the device function
with the __device__ qualifier

only executes on a single thread. The parallel execution
pattern in the CUDA programming model is a two level hierarchy structure. The
invocation of a kernel produces a grid which consists of several blocks, and each block
includes a large number of threa
ds. The structure of the grid and blocks within the grid is
determined in the invocation of a kernel. A kernel invocation uses <<<
gridDim
,
blockDim
>>> syntax to specify the dimension of grid and block.

Similar to the functions running on GPU, variables in
host code and device code use
separate memory spaces. GPU has its own memory, and there are many types of
memories in GPU’s memory hierarchy. Variables in the GPU side cannot be used in the
CPU side, nor can the variables on CPU side to be used on GPU side
. For this reason,
there must be explicit transfer of data between host and device. A CUDA program will
usually allocate two interrelated buffers which reside on GPU and CPU separately. The
host buffer stores the initial data. In order to make it available

on the GPU side for
parallel operation on it during the invocation of kernel, this buffer on the CPU should
copy data to its corresponding buffer on the GPU. When the execution of the kernel
completes and the output data is generated, the buffer on the GP
U should also transfer
the outcome to its corresponding buffer on the CPU. This process is realised by the
CUDA runtime function
cudaMemcpy
. Allocation of a buffer on a GPU is done by
cudaMalloc
, this buffer is dynamically allocated so it can be freed up a
fter use by
cudaFree
.

Listing 2.1

[1]
: a simple C for CUDA example

__global__ void vecAdd(float* A, float *B, float *C)

{



int
i = threadIdx.x; C[i] = A[i] + B[i];


}

int main()

// Kernel invocation


dim3 dimBlock(16, 16);

dim3 dimGrid(4, 4);


vecAdd<<dimGrid, dimBlock>>(A, B, C);

}

In the example shown in Listing 2.1, the block and grid
are both defined as two
dimensional, and threads in block is organised in 16x16 pattern.

2.2.2

CUDA Memory and variable Types

So far, we have briefly discussed a simple CUDA programming model. The only
Memory type we have covered is the global memory, which is u
sed to allocate the global
variable from the host. But we will not gain performance improvement by
only using
global memory

because accessing the global memory is fairly slow. Besides global

6

memory space, there are three other memory types, i.e. shared mem
ory, registers and
constant memory, which are much more efficient than global memory. Figure 3
illustrate
s

the hierarchy of the CUDA memory.


Figure
3

[7]: Hierarch of CUDA memory


The performance of memory can be measured from di
fferent aspects, including the
access latency and bandwidth. We can further understand its usage by looking at the
features of the variable types that reside in the memory, such as the scope and lifetime of
a particular type of variable. Again, different t
ypes of variable are separated by
additional CUDA keywords. The scope here refers to the range of the threads that can
access the variable. This is caused by the design of two level hierarchy
structures

of
threads in CUDA programming model. A variable can
either be accessed from a single
thread, or it can be accessed from a block, or it can be accessed from the whole grid.
Lifetime of variables is another important issue. It identifies the range of programs in
which the variable can be used. There are two d
ifferent kinds lifetime, one is within the
execution of a kernel,
and the

other is for the whole program. We can look through and
characterise different memory spaces and their associated variable types in terms of these
aspects.


7

2.2.2..1

Global Memory

Global memor
y has much higher latency and lower bandwidth compared to other
memory spaces on GPU, although compared to contemporary CPU memory the
bandwidth of global memory are many times higher. Resid
ing

on Global memory space
s

are global variables. Global variables

are

specified by an optional __device__ qualifier
.

I
t is created and operated in the host code by invoking the CUDA runtime APIs such as
cudaMalloc

and
cudaMemcpy
. Global variables have the scope of the entire grid, and
their lifetime is across the entire

program. The value of global variables is maintained
throughout multiple invocations of different kernel functions. Global variables are
particularly used to pass data between different kernel executions. The device side buffer
mentioned in Section 2.2.1
is
a

typical example of global variable.

2.2.2..2

Shared Memory

Global memory exists in DRAM on a GPU. By contrast, Shared memory is on chip
memory which is much like a cache manipulated by user. Each SM has a piece of shared
memory that is evenly partitioned in t
erms of the number of blocks on the SM and
assigned to these blocks during runtime. Therefore, access to shared memory is much
faster than global memory. Besides the low latency, shared memory also consi
sts of
several memory banks which allow it to

be acce
ssed simult
aneously by threads. This
feature of shared memory
yields the result that
its

bandwidth is several times as high as
that of a single bank. Residing on shared memory are shared variables specified by the
__shared__ qualifier. Shared variables hav
e the scope of a block. They are accessible to
all threads within the block which the shared memory is assigned to. Shared variables
have the lifetime of the block. They are not kept when the kernel created them completes
execution. Due to the remarkable l
ow latency and high bandwidth, shared variables are
used to replace global variables in a kernel in order to get good data locality when these
global variables are heavily used.

2.2.2..3

Registers

Registers are another kind of on chip memory with extremely low late
ncy. Each SM has
a set of registers that are assigned to threads. Unlike shared memory that is owned by all
threads in a block, registers are exclusively owned by each thread. Access to shared
memory can be highly parallel due to its multiple
banks;

likewi
se, access to registers can
also be highly parallel because each thread has its unique registers. Variables placed into
registers are part of automatic variables which refer to variables declared within a kernel
or device function without any qualifier. Th
e
scope of automatic variables is

within
individual thread. Each thread generates a private copy of an automatic variable during
the kernel’s execution. Similar to global variables, automatic variables have the lifetime
of within a kernel’s execution. As t
he number of registers is limited, only a few
automatic variables will be stored in registers. The remaining variables will reside in
local memory space with the same high latency and low bandwidth as global memory.
This raises the
trade
-
off

in the number
of threads for a kernel. We can either use a large
number of threads to achieve a high level of parallelism, or reduce the number of threads
to have more registers for each thread in order to get fast access to frequently used
variables.


8

2.2.2..4

Constant memory

S
imilar to global memory, constant memory space is in DRAM and can be both read and
written in the host code. However, speeded up by a constant cache in SM, constant
memory has much faster access speed than global memory. In the host code, constant
memory o
nly provides read
-
only access. Constant variables are specified by
__constant__ qualifier, it must be declared outside any function body. Clearly its scope is
across all grids, and it has the lifetime of the entire program. Constant variables are often
use
d to hold constant input values which will be passed to a kernel.

2.3

Libraries

Programming libraries are important tools and building blocks for application
development. Beside the CUDA library from NVIDIA, which is a component to CUDA
C language, there are s
everal third party libraries which provide powerful utilities to
facilitate CUDA programming. There are two libraries from which the porting of the
CPN code may benefit, one is the Thrust library which use similar style as C++ STL and
immensely contribute
to the productivity of application development [8],
and another

is
the CUDPP library which provide data parallel algorithm primitives such as parallel
reduction [
9]. Application of these libraries can get high and stable performance and ease
the developmen
t process.



9

Chapter 3


Porting the CPN code to GPU

The project is aiming to port the CPN code to GPU. The CPN code is about dealing with
a 2D lattice QCD field theory. It uses three different Monte
-
Carlo simulation algorithms,
i.e. Metropolis, overheat bath and Mic
rocanonical to update fields. Within the code, the
main routine is in the file
mc.c
. The input parameters are found in a text file, "
input_file
".
There are two choices of action, 0 and 1, corresponding to "
action_unimproved
" and
"
action_symanzik
". However,

from a programmer's view, the maths or the physical
meaning of the results are not important to the project.

Porting is the process of modifying the code so that it can run on GPU. Since modifying
a large code is difficult and time consuming, it is better

to make sure the code runs
correctly, without considering the performance at this stage. The common strategy of
porting the code can be divided into following steps. The first step is to find out which
part of the CPN code can be ported to GPU, according
to the profiling result and whether
a function or code has desirable characteristics. Then all the data that will be used and
modified on device must be identified. Next step is the decomposition process which
parallelises the code into many threads.

It is

always difficult to port a large program to GPU. Many functions or code segments
will be selected to be converted to kernels. Many CUDA runtime API functions will be
needed to deal with the data required by GPU. It is preferable to move all the newly
intr
oduced code in the process of porting to separate files instead of the original ones, to
avoid damaging the original ones.

3.1

Desirable characteristics of selected code

Application of GPU is only beneficial to certain types of codes. This section discusses
se
veral desirable characteristics of the code that is suitable for running on GPU.

3.1.1

High computational intensity

As describe in background section, host and device have separate memory space, and
they cannot reference each other’s memory. Therefore, CPU and G
PU should transfer
data via the PCI Express connection between them in order to see the data. The latency
for transferring the data is big, and this will result in the possibility that performing an

10

operation on the CPU may be faster than performing that o
n GPU. Although GPU is
good at large amount of computations, the delay of transferring data between each other
will significantly slow down the performance. Therefore, the code must involve a
significant number of computations over each data element, so as

to achieve a speed
-
up
when implemented on a GPU.

3.1.2

Embarrassingly parallel

The selected code should involve rich data parallelism. Program with rich data
parallelism means that it involves many same operations which can be simultaneously
executed on a large

number of data. GPUs have an explicitly parallel programming
model [4]. When a kernel is invoked, thousands of lightweight threads will be
constructed to perform similar operations

[14]
. Therefore a large amount of data
parallelism is desired for GPU appl
ication to benefit from its tremendous computational
capability.

We should also avoid interaction between each thread. Due to implementation of thread
scheduling, different threads are not guaranteed to be active simultaneously. So if we do
need interactio
n between threads, we will need the expensive barrier
synchronisations
.
And the interaction is limited because a synchronisation only occurs in a block.

3.1.3

Minimal conditional execution

The selected code should have as less as conditional execution as possibl
e. Warp is the
scheduling unit in SM. Each thread is not independent, they are grouped into blocks, and
blocks are executed as 32 thread warps. Only one instruction will be issued to a warp for
a ‘warp time” which is 4 seconds. If the code has different se
ts of instructions for threads
in a warp from at some point, as the limit of one instruction, some threads will take the
instruction while others that fall into another execution path will be idle. This is called
thread divergence, which is a waste to GPU’
s computational capability. Therefore,
Programmer should avoid using conditional statement to ensure maximum efficiency.

3.2

Profiling

According to Amdahl's law, we know that it is only worth working on parts of code that
take a lot of time. Large speed up of
unimportant sections will have little impact on the
overall performance. Therefore it is essential to examine the code by using profiling
tools. Only functions that consume the most time and are good candidates for parall
el
ism
will have the potential benef
it from GPU acceleration. Functions identified should also
exhibit potential for rich data parallelism so that they can be decomposed into many
threads. In a word, the overall performance gain by using GPU depends entirely on the
extent to which the code c
an be
parallelised
.

3.3

Selecting code to port

With the profile of the code, it is able to identify which functions would be suitable for
acceleration. Note that although the profile indicates the top three functions which

11

consume most time are
iz
,
itheta
and
Trpp
, it is actually not possible for them to be ported
to GPU as they do not include loop, so they can not be decomposed. Though these
primitive functions like
iz
,
itheta
and
Trpp
can not be converted to kernel functions, but
functions that contains loops

and invoke these primitive functions inside loops could be
parallelised. From the profile, it is obvious that functions and code sections that both take
large portions of time and have rich data parallelism are the three fields updating
functions (
update_
fields_heat
,
update_fields_motropolis
,
update_fields_micro
) and two
magnetic functions (
magnetic_susceptibility
, magnetic_GP1). In addition, it is desirable
that the selected functions and code section have enough computation intensity, so that
the memory
latency can be well hidden.

A profile of the CPN code indicates that if the parameters N, T, L are small, then there is
no function that takes absolute advantage over others, which means that there will be
little performance benefit for porting a single fu
nction although it may parallelise
efficiently. However when scaling the problem size by increasing these parameters (N,T,
L), some functions take large portion of the runtime. When keeping the parameter N
small (e.g. 2) but making T and L large (e.g., 16)
, about 80% of the code's runtime is
spend in functions
magnetic_susceptibility
and
magnetic_GP1
. When keeping the
parameters T and L small but making N large, the updating functions will consume much
time as these functions scale with N. Moreover, many fu
nctions in the file
observables.c
have the same loop pattern as the functions magnetic_susceptibility and
magnetic_GP1
so that they can be ported to GPU as well through the improvement willbe trivial. Table 1
illustrates some potential functions and code s
ections which are suitable for running on
GPU.


Functions and code sections

loop size

plaquette

L*T

topological_susceptibility

L*T

A
ction_unimproved

L*T

A
ction_symanzik

L*T

code section which invokes there update

functions

L*T

magnetic_susceptibility

L*L*T*T

magnetic_GP1

L*L*T*T

Table
1
: Suitable functions and code sections

It is also important to verify that there is no data dependency between loops for a
function. Data dependency is an important requirement of Decomposition
. If the later
computation within a loop is dependent on the computation in the previous iteration, the
function cannot be parallelized. For the loops that invoke update_fields_heat,

12

update_fields_motropolis and update_fields_micro, although each function
focuses on
the distinct vector element in the lattice field, the computation on each vector element
depend son its neighbouring vector element. Therefore direct decomposition method
cannot be applied here.

3.4

Identifying data required on GPU

As discussed in t
he background section, there
is separate memory space

for host and
device. Programmers have to declare two interrelated buffers in host and device code,
and explicitly transfer value between them before and after the invocation of GPU
functions.

Another po
int to note is that, CUDA supported GPU has multi level memory hierarchies.
CUDA threads within a grid may access data from multiple memory spaces during their
exception, and variables in different memory spaces have different access scope, lifetime
and di
fferent memory latency. Using different memory will result in significant
difference in performance. Therefore, after identifying all the data that is required by
GPU, it is necessary to figure out which memory space will be suitable for placing these
data
. Different memory spaces and their associated variables are discussed as follows.

The global space is persistent across kernel launches by the same application, and it can
be accessed by all the threads. At least two arrays should reside in global memory

space.
They are z_field and theta_field, the former is a 2N
-
component real vector at each site on
the L*T lattice (so a total of L*T*2N double precision data), the latter is a real phase
theta at each site (so a total of L*T*2 double precision data). Deal
ing with global
memory will need memory management functions from CUDA runtime API. These
functions should all be invoked on the host side. We will need these functions to allocate
memory on the GPU, copying data from host side to device side or from devic
e side to
host side, and finally free up the allocated memory on GPU.

Threads within a block can access the shared memory which is on the same
multiprocessor as the block is issued on. Shared memory has the same lifetime as the
block. As shared memory is o
n
-
chip, it has a very low latency, in circumstance with no
bank conflict between each threads, its latency could be about 100 times lower than that
of global memory. For this reason, if one variable is used many times in the kernel
function, this variable
should be loaded from the device memory and stored in the shared
memory. In the CPN code, the reduction operation is used frequently in the selected
functions and code section which will be ported to GPU. As the parallel reduction will
use the same data el
ement many times, these elements should reside on the shared
memory. However, shared memory it is fast but small, and it is evenly issued to each
block. As the number of blocks increase, the amount of shared memory issued to each
block decreases. For examp
le, the maximum amount of shared memory per
multiprocessor in Tesla 1060 is 16 KB, therefore for SM with 8 blocks, shared memory
per block is 2 KB.

Each thread has its own
local variable which is called
automatic

variables. The
autom
at
ic

variable is only a
ccessible from the thread and have the same lifetime as the kernel. All
the variables declared within a kernel without any qualifier is a
autom
at
ic
variable, and

13

they will either reside on the r
egisters or on the
DRAM
, depending on some conditions.
For exa
mple, whether the register issued to each thread is exhausted. It is important to
note, there is a finite number of registers on each SM. Similar to the shared memory,
registers are shared by all the threads resident on that SM. Therefore the more threads
on
the SM, the less registers each thread can have. The number of 32
-
bit registers per
multiprocessor is 16 KB. Since the maximum number of resident threads per
multiprocessor is 1024 in Tesla, in full occupancy, there could be 16384/1024=16
registers per
thread.

Constant memory space resides in device memory. As it is speed up by constant cache in
SM, accessing constant memory is much faster than accessing global memory. So it is
better to place unchanged parameters to constant memory, to reduce unnecessar
y global
memory access. In the CPN code, there is a C struct called
global_params_t

that stores
parameters read from the input file, and these parameters will not be changed. As these
parameters are frequently loaded throughout the whole program, placing t
hem to
constant memory instead of global memory will have a good performance gain.

As CPN code uses double precision arithmetic and the double precision data element.
We will need a compiler flag which is
-
arch=sm_13 to support that. Otherwise, all the
dou
ble precision operations will be converted to single precision.

3.5

Decomposition and execution configuration

After selecting the suitable functions and code sections, and identifying the necessary
variables for it, the next stage of porting is to decide how t
o decompose the functions or
code sections into many GPU threads. Parallel decomposition on CUDA requires one to
decide how many threads each block has and how many blocks each grid has. This is
called execution configuration on CUDA. The selected function
s and code sections all
involve many loops, or rather, nested loops. In order to port that to GPU, we first convert
the computations within the loops to a kernel, which is specified by a __global__
qualifier. Then the selected functions and code sections i
nvoke the kernels instead of
involving the loops in the serial code. Invocation of kernel requires one to specify the
execute configuration. Apart from the computations that are intended to be converted to
a kernel function, the primitive function which wa
s invoked in these computations, like
iz, itheta and Trpp should be converted to device function with the __device__ qualifier.

Execution configuration is a fraction of kernel invocation. The whole invocation of
kernel function is in the form of : function
<<<gridDim,blockDim,Ns,S>>>(argus,…).
The gridDim and blockDim refers to the grid dimension and block dimension
respectively. Ns specifie the amount of dynamically allocated shared memory, and S
specifies the associated stream (will be covered in optimizat
ion section). Ns and S are
optional arguments.

There are several important points to note here. Firstly, there is a limit to the number of
threads per SM, according to different compute capability. For Tesla 1060, the
maximum number of threads per SM is 10
24. Also, there are finite amount of memory
resources such as registers and shared memory available per multi
-
processor. For
example, using a lot of registers in each thread will reduce the number of active blocks,

14

which means registers and shared memory c
ould be the limit to parallelism. In addition,
threads within a block can communicate with each other through shared memory.
Moreover, only threads in the same block can have barrier synchronisation with each
other.

The magnetic_susceptibility function and

magnetic_GP1 function have some desirable
characteristics for parallel decomposition. They contain four nested loops, and within the
inner loop, the computation intensity is strong enough. For these two functions, the block
dimension could initially be se
t to the total iteration number of inner nested loops, and
the grid dimension could be set to the total iteration number of outer nested loops. As
there is rich data parallelism in the magnetic_susceptibility function and magnetic_GP1
function, one can be
sure the performance gain will be big for porting that to GPU. The
nested loops invoke three fields updating function (update_fields_heat,
update_fields_motropolis, update_fields_micro) in the main function, so these loops
could be parallelised. One way to

parallelise these loops is to set the block dimension to
the total iteration number of the inner loop, and set the grid dimension to the number of
SMs. This decomposition means that the number of threads is distributed evenly
between each block. However,
there are alternative decompositions. As the thread is
scheduled in the form of warps and one warp is 32 threads, in order to fully utilise the
SM's capability, it is always better to set the thread per block to multiples of 32. As
memory transactions are
issued per half warp which is 16, it is always better to set the
number of threads per kernel to multiples of 16 to facilitate the coalesced memory
access. Therefore, it is better to use different parameter combinations setting the
execution configuration,

in order to find out the optimal configuration. When setting the
block size to a constant value which is multiples of 16, the number of blocks is equal to
the total iteration number of the nested loops divided by the multiple of 16. It will
probably not b
e divided evenly, so the number of blocks should be truncated to an integer
and plus one. It means that one of the blocks will have idle threads that do not need to
take any work.

3.6

Parallel reduction algorithm

Reduction operations are common and important c
omputation primitives. In the CPN
code, they occur a lot of times. Almost all functions or code segments that have been
determined to be converted to kernels contain a reduction operation over the
intermediate results. So it is worth taking a close look at

how to implement the parallel
version of the reduction operation.

Suppose we have to perform sum reduction to an array that has a size of N, and we are
going to apply GPU to do it.
One method

of performing reductions on GPU is to use
intrinsic atomic fun
ctions in CUDA. In this implementation, there is a variable
designating the result, and each thread in the grid holds one piece of computational
element. In the kernel definition, the result variable updates itself by invoking the atomic
functions on the e
lement and itself. However, atomic functions use a linear access
manner which wastes the capability of GPU resources and leads to very low efficiency.
The recommended approach is to use a tree based algorithm which can be executed in
parallel.


15


Figure
4

[11]: Tree based approach for parallel reduction

Generally, each thread block constructs a loop that implicitly builds an addition tree over
the input elements. In every round of reduction, neighboring data elements are added
toget
her. The addition result is saved for the next round of addition. Each step of this
pair
-
wise addition cuts the number of partial sums in half and ultimately produces the
final sum after log2(N) steps.

Listing 3.1

[11
]
: Reduction with
divergence

__global__

void block_sum(double *input, double *result, size_t n)

{


extern __shared__ double s_data[];


s_data[threadIdx.x]=0;


int i=blockIdx.x*blockDim.x+threadIdx.x;




if (i<n) {


s_data[threadIdx.x]=input[i];


}


__syncthreads();




for (int strid
e=1; stride<blockDim.x; stride*=2)

{


if (threadIdx.x%(2*stride)==0) {


s_data[threadIdx.x]+=s_data[threadIdx.x+stride];


}


__syncthreads();


}




if (threadIdx.x==0) {


result[blockIdx.x]=s_data[threadIdx.x];


}

}

Listing 3.1 illu
strates the implementation of a reduction kernel. There is an array
declared in the kernel which holds the initial values and stores partial sums afterwards.
The reduction kernel is mainly composed of a loop which continuously adds the
neighboring elements

of the array and stores the partial result in a proper position. Each
iteration of the loop reflects one round of pair
-
wise addition as shown in Figure 5. In
order to access this array in parallel along with the performance consideration, this array

16

shoul
d be a shared variable. Another consideration is about the synchronisation of
threads within a block. An intrinsic function
__syncthreads
is needed to avoid RAW
hazard and ensure the completion of one round of pair
-
wise addition. There is a stride
variable

which is used to select the thread which is required for the pair
-
wise addition. In
the first iteration, the stride is set to 1, which means even threads are selected to perform
the pair
-
wise addition.
A
ll the elements participate in the reduction, and th
e results are
stored in elements with even indices. At the start of second iteration, the number of
remaining elements becomes half

of the

number of initial elements. The second iteration
picks up all the even elements, performing addition on these element
s, and then place
s

the results in the position with indices of multiples of four. This process goes on until
thread 0 finishes the last round of addition and derives the final result [15].


Figure
5

[11]: Parallel reduction with
divergence


However, this implementation has a big drawback, because there is thread divergence
within each warp. When thread divergence occurs in a warp, it harms the performance
significantly. We can take the first round of reduction as an example to dis
cuss the
problem. In the first iteration, only threads with even indices will perform the pair
-
wise
addition. This creates two execution paths. At first, half of threads take the control and
others are idle. When the work is completed, th
is

half of threads

go to sleep and another
half take the control. In successive iteration, this effect becomes even more serious, as
fewer threads will attend the pair
-
wise addition in a warp, more and more threads will be
idle. This is obviously a waste of GPU recourse.

Li
sting 3.2

[1
1
]
: Reduction with interleaved addressing

__global__ void block_sum(double *input, double *result, size_t n)

{


extern __shared__ double s_data[];


s_data[threadIdx.x]=0;


int i=blockIdx.x*blockDim.x+threadIdx.x;




17


if (i<n) {


s_data[t
hreadIdx.x]=input[i];


}


__syncthreads();




for (int stride=blockDim.x/2; stride>0; stride>>=1)


{


if (threadIdx.x<stride) {


s_data[threadIdx.x]+=s_data[threadIdx.x+stride];


}


__syncthreads();


}




if (threadIdx.x==0) {



result[blockIdx.x]=s_data[threadIdx.x];


}

}


Listing 3.2 shows a modified kernel with a slightly different algorithm for sum reduction.
Thread divergence is inevitable for the whole grid, but it is able to avoid many
occurrences of divergence because d
ivergence only takes place in a warp. This gives rise
to an improved version of parallel reduction with interleaved addressing [11]. By
contrast to the old version of the reduction algorithm, a new version algorithm always
picks up first half of the remain
ing threads to perform pair
-
wise addition. It is achieved
by adding a pair of elements that is half size away from each other in the remaining
elements. There is an important variable
stride

which controls the distance of elements in
a pair. Although one d
ivergence will probably exist in the warp which just take
s

the last
several working elements and first several idle elements in any iteration, this algorithm
remarkably reduces the redundant divergence compared to the first version.

For loops with a tremen
dous number of iterations, the solution is to decompose
computation into multiple kernel invocations (figure). In the first invocation, as there is a
limit on the number of threads for each block, we will need to use multiple thread blocks
to deal with the

large array. Each thread block takes a portion of the total reduction work.
In order to combine the partial results from all blocks within a grid, we declare an array
to store these temporary results and invoke a reduction again on them. The input size of

elements for this launch should be much smaller and it is able to place them in a single
block to get the final result [11]. In the case of reductions, code for all levels is the same.
Listing 3.3 illustrate the multiple kernel invocation of parallel redu
ction.

Listing 3.3

[14]
: Multiple kernel invocation of parallel reduction

// reduce per
-
block partial sums

int smem_sz = block_size*sizeof(float);

block_sum<<<num_blocks,block_size,smem_sz>>>(d_input, d_sums, n);

// reduce partial sums to a total

sum

block_sum<<<1,

bloc
k_size,smem_sz>>>(
d_sums, d_sums + num_blocks,

num_blocks);


18

3.7

Porting code using thrust

This section will discuss the porting process by using the thrust library which is
mentioned in the background section. Application of thrust libr
ary has many benefits.
First, thrust has many parallel primitives which can simplify the code of CUDA
computation and make the CUDA code concise. Employing thrust allow the users to
rapidly develop complex applications. Thrust also encourages generic progr
amming, one
reduction can apply to input in all data types. In addition, with the minimal programmer
effort, thrust achieves high performance. Finally, thrust integrates with CUDA C/C++
code well.

Thrust provides two vector containers, host_vector and dev
ice_vector. The host_vector
is stored in host memory and device_vector is stored in device memory. Thrust’s vector
containers are analogous to the C++ STL std::vector. They are generic containers that
can store any data type and can be resized dynamically.

Listing 3.4 illustrates the use of
Thrust's vector containers and compare it with the raw CUDA.

Listing 3.4: comparing Thrust's vector containers implementation with raw

CUDA implementation

/* raw CUDA implementation */

// array in host memory

double z_field[2*global.V*global.N];


// allocate array in the device memory
double *d_zfield;
cudaMolloc((void**)&d_zfield,z_field,sizeof(d
ouble)*(2*global.V*global
.N));

// copy the array from host memory to device memory
cudaMemcpy(d_zfield,z_field,sizeof(double)*(2*global.V*global.N),cudaM
emcpyH ostToDevice);

/* Thrust's vector containers implementation*/

// array in host memory
thrust::host_vector<double> z_field(2*global.V*global.N);


// copy the array from host memory to device memory
thrust::device_vector<double> d_zfield = z_field;

As this example shows, the = operator can be used to copy a host_vector to a
device_vector. The = oper
ator can also be used to copy host_vector to host_vector or
device_vector to device_vector. This feature hides cudaMalloc, cudaMemcpy and
cudaFree, and makes the common operations concise and readable. Also note that
individual elements of a device_vector
can be accessed using the standard bracket
notation. This feature facilitates the debugging process in CUDA, because the device
value can be directly examined in host side. However, because each of these accesses
requires a call to cudaMemcpy, they should
be used sparingly.

In order to use the user defined kernels, we could extract a raw pointer from device
vector and pass it to the kernel. Listing 3.5 illustrates an example.


19

Listing 3.5: passing raw pointer of device vector to a kernel

// allocate device v
ector
thrust::device_vector<double>d_trpp(n);

// obtain raw pointer to device vector’s memory

double *ptrpp=thrust::raw_pointer_cast(&d_trpp[0]);

// use ptr in a CUDA C kernel
sum_magnetic_GP1<<<global.L*global.T,global.L*globa
l.T>>>(d_zfield,ptr
pp);

Thrust provides a large number of common parallel algorithms such as reduction
operation. Do not like the raw CUDA implementation, performing such an operation
with Thrust do not need to write a kernel, and set the execution configu
ration. The sum of
an array is implemented with thrust::reduce as follows: int sum =
thrust::reduce(D.begin(), D.end(), (int) 0, thrust::plus<double>()); Here the first
arguments to reduction, begin() and end() is called an iterator, which defines the rang
e of
value. And the fourth argument
plus()

is an addition operator.

3.8

Debugging

CUDA debugging is time consuming, and there are several ways of debugging. For

former version of the CUDA Toolkit, the programming environment does not include

any native debug s
upport for kernel functions. There is
only compiling

mode called
device emulation mode which can make the entire program runs on CPU. Therefore,
programmers are able to debug both the host code and device cod on the host side with
common tools such as GDB.

[1] As the code runs on the host, the device emulation
model also allows print statement in the kernel function. However, execution on GPU
may produce different results from using device emulation mode, because device
emulation behaves slightly differentl
y from normal model. As a consequence, device
emulation mode may not work for debugging. In addition, compiling with device
emulation mode, the performance is rather slow. Both these reasons make the device
emulation mode not the suitable approach for debu
gging.

Directly examining the value of variables using print statement is the common way of
debugging code runs on CPU. But print statements are not natively supported in GPU in
former versions and in compute capability 1.x. Neither can the printf function

be invoked
in the kernel, nor can the device data be examined by printf function in the host side.
Fortunately, there is a software package called cuPrintf that allows the user to add printf
equivalent cuPrintf calls to the CUDA C code. The cuPrintf code
works with CUDA 2.3
or newer, and is supported on all platforms on which the CUDA Toolkit is supported.
This is the main way of examining the variables in kernel on the ness system. With the
thrust library, it is also possible to directly examine the devic
e data on the host side,
avoiding the process of transferring device data to host side before the value can be
printed. In fermi0 system with compute capability 2.0, the formatted output which
behaves in a similar way to the standard C printf function is n
atively supported.

The newly version of GUDA has introduced a new debugger called GUDA
-
GDB. It is
an extension to the standard GDB. It allows user to debug not only the native host code

20

but also the CUDA code.

CUDA

GDB supports setting breakpoints at any
host or
device function residing in a CUDA application by using the function symbol name or
the source file line number


[12]
. This allows programmers to check the value of a
variable at any point no matter the checking occurs in the host side or device si
de.
However, the ness system where I spent most project time on does not support
CUDA
-
GDB for some reason. Fortunately, the fermi0 system does support this so I can
use that.


21

Chapter 4


Optimisation


Porting the code on GPUs is aiming to exploit GPUs’ c
apacity to a
ccelerate the code.
therefore performance of the resulting code is always the key objective of the project.

GPU has many features which is different from CPU, some features has significant
influence to the performance. To efficiently utilize GPU's compute
capability, it is better
to optimize the code according to these features after the first round of porting. Several
papers list the key issue for achieving optimal performance on GPUs [5]. The general
performance optimisation approach will involve several
steps. First step is to measure
the performance and find out optimisation on which part of code will get the significant
performance gain. Then the optimisation should be performed follow the overall
performance optimisation strategies listed in section 4.
1, including the memory usage
optimisation, instruction usage optimisation respectively. This section discusses the
overall performance optimisation strategies, the importance of performance
measurement and optimisation attempts for the CPN code.

4.1

Overall p
erformance optimisation strategies

Performance optimisation including three basic strategies [1], they are parallel execution
maximisation, memory usage optimisation and instruction usage optimisation.

Parallel execution maximisation means to efficiently m
ap the parallelism of the
application to various components of the GPU system, in order to fully utilise the GPU
compute capability.

Instruction usage optimisation includes two aspects. First is to minimise the use of
arithmetic instructions with lower spe
ed. CUDA support standard mathematical
functions as well as its intrinsic functions. Application of intrinsic functions sacrifices
the precision but results in high speed.

Optimise instruction usage also means avoiding the divergence with warps. In CUDA
pr
ogramming model, block is executed as 32 threads warps. During execution,threads in
the same warp perform the same operation at the some time if there are no different paths
within the warp. However, if there are different paths within a single warp, diffe
rent

22

execution paths are serialised. An example is discussed below for different results of
conditional statement.

If the conditional statement is written in the form of:
“if (threadIdx.x>2) {}
, then two
different execution path is produced for threads in
a block. If we write another statement
like:
if (threadidx.x/32>2) {}
, we also get two different execution path. Although, both of
them have different execution path, but the impact of them is different. As thread
divergence only occurs in warp, so the sec
ond statement which have the same execution
path for threads in a warp will actually not incur a thread divergence.

Memory usage optimisation is the key issue for performance improvement. CUDA
memory handling is complex, there are various types of memory,
and different memory
spaces have different bandwidth and latency. For example, it takes 4 cycles to issue on
memory fetch but 400
-
600 cycles of latency, which is equivalent to 100 MADs. While
latency for registers, shared memory and constant memory are muc
h lower, effectively
using different memory can lead to huge speed
-
up. Memory usage optimisation means
minimising data transfer between host and device, minimizing data transfer between
global memory and device, accessing the memory space with optimal acce
ss pattern.
Here briefly discusses the concept of memory coalescing and shared memory bank
conflict which relate with the optimal access pattern.

Memory coalescing refers to the concept that, when accessing global memory, peak
performance utilisation occu
rs when all threads in a half warp access continuous
memory locations

[14]
. Tiling algorithm is another example of efficient use of memory,
and it is beneficial when there is many data reuse within a tile.

Shared memory is divided into banks to enable many

threads can access memory
simultaneously. Each bank can service one address per cycle. So a shared memory can
service as many simultaneous accesses as the banks it has. Without bank conflict, CUDA
Shared memory is as fast as registers. However, multiple s
imultaneous accesses to a
bank will result in a bank conflict, and conflicting accesses are serialised. Therefore, we
must try to eliminate bank conflict.

4.2

Measuring Performance

Measuring performance is an essential preparation for following optimisation st
eps. It
clearly shows the effect or improvement from each optimisation step, indicating if this
optimisation is worth. It can also point out the limit of the speed
-
up achievable on the
GPU. This section will discuss the common approach to evaluate performa
nce using
either CPU timer or CUDA events.

4.2.1

Timing

Measuring performance for a CUDA program particularly focuses on CUDA API calls
and user defined kernel executions. It is possible to measure the performance by both
CPU and GPU timers. There are a variety
of ways to measure the time in the host side,
and the programmers should always choose to adapt the CPU timer with high resolution.
Another issue to keep in mind is that many CUDA API functions are asynchronous [16],

23

which means when these functions are in
voked, they immediately return control back to
the CPU thread where they are calling from, although they just begin their work on GPU.
Likewise, execution of kernel functions is asynchronous as well. As a result,
programmers should use the CUDA API functio
n
cudaThreadSynchronize

to force the
synchronisation between CPU thread and device in order to get accurate result for the
duration of a CUDA call or kernel execution.

Listing 4.1: Timing code using CPU timer

double kernel_time, start, stop;


// block until all the previous calls have completed

cudaThreadSynchronize();


start=second();

magnetic_susceptibility_kernel<<<global.L*global.T,global.L*global.T>>
>(d_zfield,

ptrpp);


// block the CPU thread until the device has completed the kernel

cudaThreadSynchronize();


stop=second();

time+=function_stop
-
function_start;

To make the result more accurate, it is b
etter to use events. Events are part of CUDA

API
and provide a system independent way to measure execution times on CUDA devices.
CUDA provides calls that create and destroy events, record events (via timestamp), and
convert timestamp differences into a fl
oating
-
point value. Note that events are recorded
on the GPU, so they will only be used for timing GPU execution.

This returned value is
expressed in milliseconds and has a resolution of approximately half a microsecond.
With such a resolution, it is able
to get reliable timings even from a single kernel launch.
Listing 4.2 illustrates an example of timing using CUDA events.

Listing 4.2 Timing code using CUDA events

cudaEvent_t start, stop;


float time;

cudaEventCreate(&start);

cudaEventCreate(&s
top);

cudaEventRecord( start, 0 );

magnetic_susceptibility_kernel<<<global.L*global.T,global.L*global.T>>
>(d_zfield,

ptrpp);

cudaEventRecord( stop, 0 );


cudaEventSynchronize( stop );

cudaEventElapsedTime( &time, start, stop );

cu
daEventDestroy( start );

cudaEventDestroy( stop );

Here cudaEventRecord is used to place the start and stop events into the default
cudaEventRecord is a CUDA runtime API funct
ion which used to place events into
Stream.
“Stream is a sequence of operations that are performed in order on the device.
The device will record a timestamp for the event when it reaches that event in the
stream”

[1]. Two events are necessary to get the e
lapsed time, the programmer need to
have start and stop event surrounding the target code which is going to be timed. The

24

elapsed time is returned by cudaEventElapsedTime function. Table 2 illustrates the time
spent for CPN code on GPU compared with CPU.


T

CPU
runtime

GPU
runtime

magnetic_susceptibility

runtime

magnetic_susceptibility_gpu

runtime

6

11.59024

21.789

1.5467

7.6587

8

53.193

28.298

5.7527

8.2164

10

58.191

34.558

12.008

8.516

12

174.81

45.843

24.943

10.196

14

195.23

57.657

46.265

11.544

16

454.94

72.616

79.039

13.946

Table
2
: Time spent for CPN code on GPU Vs CPU

4.2.2

Maximum Performance Benefit

Before start to perform the optimisation, it is essential to figure out how much speedup
can be achieve theoretically. One way
of this is to apply Amdahl's law. Amdahl’s law is
used to predict the maximum speed
-
up for a parallel program, clearly it can be adapted to
CUDA programming. It illustrates the expected maximum speed
-
up by a formula, which
is:

Equation
1
:

S(N, P)

here stands for the speed
-
up for a program with the scale of N and P processors,
and
a

is the sequential portion of the program that is not able to execute in parallel. This
equation can be transformed to another form in extreme case
s

to

make the result more
clear. In CUDA programming where there are hundreds of streaming processors, the
parallel fraction of execution time that is
(1
-
a)/P

can be ignored. Therefore we can
simplify the equation and derive the result:
, which in
dicates that speed
-
up
of a program is limited by time of sequential potion, despite the how large the number of
processors is. For example, with 10% sequential code, a program is expected to get 10
times maximum speed
-
up.

So in order to get as much speed
-
u
p, it is important to find as many portion of code that
can be parallelised to minimise the number of a.



25


Figure
6
: Overall speed
-
up for the first version

Figure 6 illustrate the speed
-
up for the first version of the modified cod
e. For small size
of T, there is no speed
-
up, porting the code to GPU actually gets a performance
reduction. However, as the value of T increase, execution on GPU gets a big speed up for
the selected functions magnetic_GP1 and magnetic_susceptibility. This

is because that
GPU is suitable for code with rich data parallelism.

This figure also indicates the maximum Performance Benefit can be achieved for only
modifying the magnetic_GP1 and magnetic_susceptibility functions. As these functions
take approximatel
y 85% of the total, the speed
-
up for the whole program can be about 7.

4.3

Optimisation for the CPN code

There are many spots where the first version of the modified CPN code can be optimised.
This section illustrates the optimisation attempts for the CPN code
. All the optimisation
illustrated below is performed on magnetic_susceptibility_gpu and magnetic_GP1_gpu.

4.3.1

Optimise memory usage

Memory optimisation is the key issue for performance improvement as discussed in
section 4.1. Therefore, it is better to firstl
y explore the potential performance
improvement with memory optimisation. Before the optimisation, it is worth to examine
the time spent on each GPU related operation.

The total time spend on GPU not only include kernel execution, but also the time spend
o
n dealing with data, involving memory allocation (may implicitly include CUDA
initialisation in the first cudaMalloc function) and de
-
allocation, memory transferring

26

between CPU and GPU. Therefore, to get the accurate result of performance gain by
using GP
U, it is essential to record all the time spent on GPU. Figure 7 illustrates the
time spent on for each GPU related operations.


Figure
7
: Time spent on for GPU related operations

4.3.1..1

Remove redundant device memory allocation

When fir
st porting the code to GPU, correct result is the first concern instead of
performance. In order to get the accurate result, programmer should be very cautious to
modify the code. So in the first version, all the introduced code including the memory
alloca
tion and de
-
allocation are placed in the selected function. However, device
memory allocation and de
-
allocation via
cudaMalloc
and
cudaFree
are expensive
operations, so these operations should be minimised.

From Figure 7, it is obvious that memory allocati
on and memory de
-
allocation takes
more than 1/3 of the total time spent on GPU related operation. It is expected that
eliminating unnecessary memory allocation and memory de
-
allocation will get a good
performance win.

In the CPN code, as the variable
z_fie
ld
and
theta_field
exist throughout the whole
program, it is better to move the invocation of cudaMalloc and
cudaFree
out of the
selected function and put them in the begin and end of program respectively. After this
optimisation, the allocation and de
-
all
ocation are only invoked once, compared with
thousands iterations of computation, they are trivial and not a problem of performance.
This optimisation gets a speed
-
up of 1.5 for both two functions.




27


magnetic_susceptibility_gpu

magnetic_GP1_gpu

version 1

13.9459

15.443742

version 2

8.77401

10.393218

Table
3
: Performance gain form remove unnecessary memory allocation and
de
-
allocation

After this optimisation, the speed
-
up for the whole program is 7.3 and speed

up for
magnetic_GP1_
gpu is 37.

4.3.1..2

Minimise data transfer between CPU and GPU

Therefore, next optimization on memory usage is to minimise data transfers between the
host and device. This can be done by only transfer data when the data is updated.

In the previous version, in both
modified functions magnetic_susceptibility_gpu and

magnetic_GP1_gpu, there is a data transfer for the z_field array. However, these two
functions do no change the value of element in z_field. Therefore, the data transfer for
z_field in magnetic_GP1_gpu fun
ction, which immediately follow the invocation of the
magnetic_susceptibility_gpu, is not necessary. So I could remove the redundant memory
transfer in the magnetic_GP1_runtime.

4.3.1..3

Using mapped paged
-
locked memory

Another way to entirely eliminate the data tr
ansfer between host and device is to use
mapped pinned memory. Section 2.2.1 has discussed the separate memory space in
CUDA programming model and the necessity to transferring data between host and
device. However, latest version of CUDA provides a way fo
r the device to have direct
access to host memory. This is realised by the mapped pinned memory which is on the
host system but can be shared by device. There are two scenarios where it is suitable to
the pinned memory, one is the discrete GPUs, another is

the integrated GPUs. As
integrated GPUs actually use the same memory of the host system, using mapped pinned
memory is always beneficial [10]. Whether a given GPU is integrated can be verified by
calling the CUDA runtime’s cudaGetDeviceProperties function

and checking the
integrated member of the cudaDeviceProp structure. For the ness system with integrated
GPU, it is better to use the mapped pinned memory. However, since the mapped buffer is
shared by the CPU and GPU, developers must synchronize access us
ing existing context,
stream or event APIs. The introduced synchronisation overhead is huge. In the modified
CPN code, application of mapped pinned memory actually results in a big performance
hit.

4.3.1..4

Eliminating unnecessary Global Memory accesses

As the huge

difference in latency between global memory and on
-
chip shared memory,
kernel access to global memory also should be minimized by maximizing the use of
shared memory on the device. Using shared memory is a valuable when the global

28

memory in kernel is acce
ssed many times. In the kernel function
magnetic_susceptibility_kernel, the access to the array z_field is frequently, it is
desirable to load the whole array from the global memory to shared memory. As in full
occupancy where each multi
-
processor has the
maximum number of resident blocks,
each block can at least have 16KB/8=2KB shared memory. This amount is enough to
keep the whole z_field array when T is 16 and N is 2.

Listing 4.3: Loading global memory to shared memory

__global__ void magnetic_susceptib
ility_kernel2(double *z_field, double
*trpp) {

extern __shared__ double s_zfield[];

int x1=blockIdx.x/d_global.T, t1=blockIdx.x%d_global.T;
int x2=threadIdx.x/d_global.T, t2=threadIdx.x%d_global.T;

int j;

int i = __mul24(blockIdx.x,blockDim.x) + threadIdx.x;

int k = 2*d_global.N*threadIdx.x;
#pragma unroll 32

for (j=0; j<2*d_global.N; j++)
s_zfield[k+j]=z_field[k+j];
__sy
ncthreads();
trpp[i]=d_TrPP(s_zfield + d_iz(t1,x1),s_zfield + d_iz(t2,x2));
}

But loading the whole array into

the shared memory is itself an overhead, it also
introduce a
__syncthreads

function to avoid read after write hazards which reduce the
performance. Moreover, the global memory latency is well hidden in the first
modification of code, which means there is
small potential gain of performance for using
shared memory. Global memory in this kernel is well hidden dues to two reasons. Firstly,
for the kernel function magnetic_susceptibility_kernel, the block size is big (256 when T
and L are set to 16). With such

a block, multi
-
process can have 8 warps, which means it
is very flexible for the multi
-
process to switch the warps. As each warp will take 4 clock
cycles for one instruction, 8 warps can have 32 cycles per instruction. Another reason is
that this kernel i
nvolves rich arithmetic instructions. Sufficient instructions give thread
scheduler more ability to schedule threads in a way where the global memory latency can
be well hidden.

As the previous reasons, loading the whole array into shared memory is not ver
y
beneficial. However, it is best to avoid accessing global memory whenever possible.

4.3.2

Optimise instruction usage

Integer division is very costly. In the previous version of kernel function, there are four
integer divisions which must be removed. Instructio
n usage can be optimised by using
intrinsic functions as mentioned in section 4.1. The benefit can be tested by replacing all
the arithmetic operations with intrinsic functions. Intrinsic functions for double presicion
are __dadd_rn and __dmul_rn, and for
integer is __mul24.

Application of the intrinsic function results in a small performance gain for kernel
magnetic_susceptibility_kernel.


29

Chapter 5


Conclusions

This dissertation evaluates the potential benefit for porting a CPN code to GPU. When

T and L set to 16,

a final speed
-
up of 7.3 is achieved. With large value of T and L, the
first version has already get big performance gain for the selected functions, which are

5 times faster for the magnetic_susceptibility_gpu compared with the serial code and 21
faster f
or magnetic_GP1_gpu compared with serial code. However, for small number of
T and L, there is no performance gain. This is due to the reason that GPU are only
suitable for code with rich data parallelism to fully utilise the GPU resource. Small
number of T

and L means small loop size, which result in a big waste for the GPU
resource.


Several optimisations have performed to achieve more speed
-
ups. Optimisation is
mainly concentrate on optimise memory usage. The elimination of the redundant
allocation and d
e
-
allocation is a performance win over the first version of modification.
Using mapped pinned memory is not as expected, it actually reduce the performance
dramatically. This may due to the introduce synchronisation overhead. The benefit of
using shared me
mory instead of global memory is trivial. Using intrinsic function has
some small performance gain.


Apart from the speed
-
up, this dissertation also discusses approaches for porting the code,
and strategies to follow for optimisation. Measurement of the co
de is always important,
as it tells which part of code is worth porting and optimising.



30

Appendix A



Profile of CPN code

Each sample counts as 0.01 seconds.




%

cumulative

self


self

total


time

seconds

seconds

calls

s/call

s/call

name

36.65

84.67

84.67

224817
9112

0.00

0.00

iz

21.53

134.42

49.74

6

8.29

8.29

auto_corr

9.60

156.59

22.18

519536492

0.00

0.00

itheta

6.92

172.58

15.98

777600000

0.00

0.00

TrPP

4.89

183.87

11.29

104519206

0.00

0.00

ranlxd

4.65

194.62

10.75

300000

0.00

0.00

magnetic_GP1

3.15

201.89

7.27

300000

0.00

0.00


magnetic_susceptibility





2.78

208.32

6.42

216430108

0.00

0.00

mul_zytheta

2.43

213.93

5.61

48707527

0.00

0.00


local_action_unimproved





0.94

216.11

2.18

2701287

0.00

0.00


update_fields_metropolis





0.78

217.90

1.79

27012870

0.00

0.00

rand_dz

0.70

219.52

1.63

5396903

0.00

0.00

update_fields_micro

0.56

220.81

1.29

8170713

0.00

0.00

F_z_unimproved

0.52

222.01

1.19

8321430

0.00

0.00

random_heat_angle

0.47

223.08

1.08

2773810

0.00

0.00

update_fields_heat

0.44

224.10

1.02

21600000

0.00

0.00

ImLnTrPPP

0.38

224.98

0.88

6

0.15

8.44

print_stats

0.33

225.75

0.77

54025740

0.00

0.00

rand_theta


31

0.30

226.45

0.70

300000

0.00

0.00

plaquette

0.28

227.09

0.64

11578052

0.00

0.00

h

0.22

227.59

0.50

300000

0.00

0.00

action_unimproved

0.21

228.07

0.48

1
0800000

0.00

0.00

Q

0.19

228.52

0.45

8170713

0.00

0.00

normalise_z

0.19

228.95

0.43




safe_mod

0.16

229.33

0.38

8170713

0.00

0.00

cos_real_xy

0.15

229.67

0.34




main

0.14

229.99

0.32

29403905

0.00

0.00

copy_z

0.12

230.27

0.28

8170713

0.00

0.00

F_theta1_u
nimproved

0.11

230.52

0.25

8170713

0.00

0.00

mod_z

0.09

230.73

0.21

8170713

0.00

0.00

F_theta0_unimproved

0.04

230.82

0.09

1

0.09

0.09

rand_theta_field

0.03

230.90

0.08

1

0.08

0.08

rlxd_init

0.03

230.97

0.07

300000

0.00

0.00


topological_susceptibility





0.02

231.02

0.05




local_action_symanzik

0.01

231.04

0.02




F_z_symanzik

0.00

231.05

0.01




cool_fields_standard

0.00

231.06

0.01




copy_theta_field

0.00

231.06

0.00

17

0.00

0.00

get_val

0.00

231.06

0.00

6

0.00

8.29

sigma

0.00

231.06

0.00

3

0.00

0.00

write_config_theta

0.00

231.06

0.00

3

0.00

0.00

write_config_z

0.00

231.06

0.00

1

0.00

0.00

rand_z_field

0.00

231.06

0.00

1

0.00

0.00

read_input

Table
4
:

Profiles of CPN code



32

References

[1]

Nvidia Corporation. Nvidia CUDA Programming
Guide, 2.3 edition.
F.Bloggs.
1993
Latex Users do it in Environments

Int. Journal of Silly Findings. pp 23
-
29.

[2]

Tesla M1060 Processor.
http://www.nvidia.com/object/product_tesla_m1060_us.html

[3]

OpenCL.
http://www.
khronos.org/opencl

[4]

David Tarditi, Sidd Puri, Jose Oglesby. Accelerator: Using Data Parallelism to
Program GPUs for General
-
Purpose Uses.

[5]

S. Ryoo, C. Rodrigues, S. Baghsorkhi, S. Stone, D. Kirk, W. Hwu. Optimization
Principles and Application Performance
Evaluation of a Multithreaded
GPU
Using CUDA.

[6]

An Introduction to modern GPU Architecture. http.download.nvidia.com/
developer/cuda/seminar/TDCI_Arch.pdf


[7]

Parallel Numerical Methods for Partial Differential Equations. http://www.caam.
rice.edu/~timwar/RMMC/
CUDA.html

[8]

Thrust. http://code.google.com/p/thrust/

[9]

CUDA Data Parallel Primitives Library. http://code.google.com/p/cudpp/

[10]

CUDA 2.2 Pinned Memory APIs.
http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.html

[11]

Mark Harris. Optimizing Parall
el Reduction in CUDA

[12]

CUDA
-
GDB user manual.
http://developer.nvidia.com/object/cuda_3_1_downloads.html

[13]

GPU computing


high performance computing at its best

http://www.jenniferdejoya.com/2010/04/gpu
-
computing
-
high
-
performance
-
computi
ng
-
at
-
its
-
best.html

[14]

D,
Kirk, W. Hwu, “Programming Massively Parallel Processors: A
Hands
-
on Approach”, 2010.

[15]

J, Nickolls, I, Buck, M, Garland, K, Skadron, “Scalable Parallel
Programming with CUDA”, accessible from:
http://queue.acm.org/detail.cfm?id=1365500


33

[16]

Nvidia Corporation.
CUDA C Best Practices Guide 3.2 edition.

[17]

HPCwire.
http://www.hpcwire.com/features/The
-
Week
-
in
-
Review
-
20100527.html

[18]

Vanderbilt News. “New graphics processor cluster gives Vanderbilt
supercomputer a major boost”