Report on CUDA porting tool - fastcuda

shrewdnessfreedomSoftware and s/w Development

Dec 2, 2013 (3 years and 9 months ago)

87 views

D3.2
.
Report on
CUDA porting tool




Page
1

of
12



This document is produced under the EC contract 286770.

It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the F
ASTCUDA
Steering
Committee



FASTCUDA

Project N
o
: 286770




D3
.2
.
Report on CUDA porting tool


3
1
st

October

2012

Abstract:

This deliverable describes the
porting tool
that is used in order to
run CUDA kernels on the mult
i
-
core
processor platform
.
There are
several challenging goals
in effectively

translating CUDA applications

to C
-
based applications
. For example, in a typical

usage case of the CUDA programming model, programmers
specify hundreds

to thousands of small, simultaneously active threads to achieve full utilization

of GPU

e
xecution resources. However, the

current CPU architecture
s

support

only up to tens of active thread
contexts. Also, each CUDA thread block should be scheduled to a single core for locality, yet maintain the
ordering semantics imposed by potential barrier s
ynchronization points. Finally, the SIMD
-
like nature of
the CUDA threads in many applications should be clearly exposed to the compiler. However, this goal is in
conflict with supporting arbitrary control flow among CUDA threads. The translation techniques

of MCUDA
tool address most of

these challenges and provide an efficien
t mapping of CUDA kernels to a multi
-
processor platform. The FASTCUDA porting tool will incorporate the MCUDA flow and will tune various
parameters, such as the number of CUDA threads i
n a block and the number of blocks, in order to
optimize the performance.


Document Manager

Iakovos Mavroidis

TSI

Researcher




Document Id N°:

Report on CUDA porting tool


Version:

V0.
1

Date:

9/11
/12



Filename:

FASTCUDA
-
D3.2_TSI_V0.1
-
09
112012

Disclaimer

This document contains material, which is the copyright of certain
FASTCUDA

contractors, and may not be
reproduced or copied without permission. All
FASTCUDA

consortium partners have agreed to the full
D3.2
.
Report on
CUDA porting tool




Page
2

of
12



This document is produced under the EC contract 286770.

It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the F
ASTCUDA
Steering
Committee



publication of this document. The commercial use of any information contained in this document may require
a license from the proprietor of that information




The
FASTCUDA

Consortium consists of the following companies:


Participant
no.

Participant organi
sation names

short name

Country

P1
(Coordinator)

Ingenieria de Sistemas Intensivos en Software Ltd

ISIS

Spain

P2

Politecnico di Torino

POLITO

Italy

P3

Universidad Politécnica de Madrid

UPM

Spain

P4

Telecommunication Systems Institute (TSI
-

Technical
University of Crete)

TSI

Greece

P5

Ardoran

ARD

Estonia

P6

FSResult GmbH

FSR

Germany


The information in this document is provided “as is” and no guarantee or warranty is given that the
information is fit for any particular purpose. The user thereof
uses the information at its sole risk and liability.


An up
-
to
-
date version of th
is document can be found on
FASTCUDA's website

(
http://fastcuda.eu/techreports.html
)
.



Page
3

of
12



This document is produced under the EC contract 286770.

It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the F
ASTCUDA
Steering
Committee



1.

IN
TRODUCTION


In this deliverable we describe in detail the flow from the input CUDA code
(
application
code)
to

the executable code that
run
s

on the multi
-
core processor consisting of multiple inter
-
connected
M
icroblaze processors. In order to run

a CUDA application on the multi
-
core
processor we used the MCUDA open
-
source tool provided
by University of Illinois
.

The porting of a CUDA kernel on FASTCUDA's multi
-
cor
e processor is shown in Figure 1
. First
MCUDA i
s used in order to transform the CUDA kernel into thread
-
based C code (see also D1.1).
Each thread performs all the functionality of a CUDA thread
-
block. By having multiple thread
-
blocks in the original CUDA code we can derive multiple threads at the outpu
t of the MCUDA
transformed code. These "MCUDA threads" will be scheduled to the available cores of the multi
-
core processor. We can safely assume that the workload of the original CUDA code is equally
balanced between the thread blocks and therefore the op
timum scheduling policy is to schedule the
same number of threads to each core which can be succeeded by using a Round Robin scheduler.

In the example shown in Figure 1

the original CUDA kernel is organized into 6 thread
-
blocks and
therefore MCUDA will gen
erate 6 threads (one for each thread
-
block). This transformation step is
performed at compile time before the execution of the code on the multi
-
core processor. Each
MCUDA thread now represents a CUDA thread block and the original CUDA shared memory of
the

thread
-
block has been moved in the stack memory of the MCUDA thread. In the original CUDA
code there were thread blocks with private shared memories. Since each CUDA thread block has
been transformed to a single thread, the shared data that can be accesse
d only by the thread block
can be stored in the stack of the thread.


Figure
1
. Execution Flow of CUDA kernel


Microblaze1

coherent
cache

CUDA kernel


DDR Controller


MCUDA

th1


Xilkernel


non cohe
-

rent cache

threads


th2


th3


th4


th5


th6


Microblaze2


non cohe
-

rent cache

threads


stack

memory


global

memory

stack

memory



Page
4

of
12



This document is produced under the EC contract 286770.

It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the F
ASTCUDA
Steering
Committee



The OS
-
level software running on our multi
-
core processor is a modified version of the Xilinx
kernel “Xilkernel”. Xilkernel supports POSIX threads,
mutexes and semaphores, but was
targeted
for

single core

processors
, thus having no support for a symmetric multiprocessing (SMP)
environmen
t like ours. We consequently have

to add SMP support to Xilkernel.

Moreover, we have
to modify the thread scheduler
of the Xilinx kernel to provide a round
-
robin (RR) scheduling.
In the
example shown in Figure 1
, the Xilkernel schedules the 6 MCUDA threads between the two
micoblazes in a RR fashion (three threads are scheduled to each microblaze processor).

The microbl
aze supports two separate data memory spaces: a) A private non cache coherent
memory space which is used in order to store the stacks of the threads. As described the CUDA
shared memory has been moved into the thread stack and therefore the data of the ori
ginal CUDA
shared memory is stored in this memory space along with other data from the thread's stack. b) A
global cache coherent memory space (for simplicity we will use a single data cache, instead of
implementing a complicate coherent protocol with mult
iple caches) which is used for storing the
global data of the CUDA kernel. Moreover, the global memory space is used by the Xilkernel and
the CUDA host program.

In Section 2 we provide a short description of the CUDA programming model and in Section 3
we
review the multi
-
core architecture in order to better understand the flow. In Section 4 we
describe how MCUDA works and finally in Section 5 we show the flow in detail.




Page
5

of
12



This document is produced under the EC contract 286770.

It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the F
ASTCUDA
Steering
Committee



2.

CUDA PROGRAMMING MOD
EL


This Section briefly describes the CUDA programming model in
order to better understand the
CUDA porting tool.

CUDA is a data parallel programming model that supports several key abstractions (thread blocks,
hierarchical memory and barrier synchronization) for allowing for efficient applications
development. In CUDA
, the routines of an application are split into two groups: those that can
benefit from a multi
-
threaded parallel execution

and those that can
not. The first group of routines,
called the “CUDA kernels”, are written in standard C/C++ using special annotatio
ns and constructs
to specify the parallelism and the memory hierarchy. The second group of routines, called the
“CUDA host program” are written in standard C/C++.

Execution starts with the CUDA host program running single
-
threaded on the host CPU. Whenever

a CUDA kernel is invoked, the host CPU dispatches the execution of the kernel to an accelerator
(separate device) that supports parallel execution of multiple threads. Traditionally these are
Nvidia’s GPUs or other multi
-
core platforms. However, we believ
e that even higher acceleration
can be obtained if a CUDA kernel is synthesized into hardware and mapped onto an FPGA for
execution. Therefore, FASTCUDA employs a hybrid approach: it uses an FPGA
-
based accelerator
for executing the time critical CUDA kerne
ls and a multi
-
core processor for executing the CUDA
kernels that could not fit in the FPGA fabric.




Figure
2
. Example CUDA code


CUDA is a SIMD architecture and programming model initially developed by Nvidia for its GPUs.
Figure
2

shows an example CUDA code that adds two arrays, A and B, into a resulting array C. The
addition is performed in a CUDA “kernel” that runs in parallel across multiple cores in a SIMD
fashion.

The CUDA kernels are invoked by the CUDA “host program” which runs serially on a
single core.

Each kernel implicitly describes multiple CUDA threads that are organized in groups, called
//
kernel

__global__ void vectorAdd(float *A, float *B, float *C)

{


int i = threadIdx.x;


C[i] = A[i] + B[i];


}


# define N 100

#define M N*sizeof(
int)

//
host program

main() {

int A[N], B[N], C[N];

...

//copy input vectors from host memory to device memory

cudaMemcpy( d_A, A, M, cudaMemcpyHostToDevice);

cudaMemcpy( d_B, B, M, cudaMemcpyHostToDevice);

// kernel invocation

vectorAdd<<<1,N>>>(d_A, d_B,
d_C);

//copy output vectors from device memory to host memory

cudaMemcpy(C, d_C, M, cudaMemcpyDeviceToHost );

...

}



Page
6

of
12



This document is produced under the EC contract 286770.

It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the F
ASTCUDA
Steering
Committee



"
thread
-
blocks"
. Thread
-
blocks are further organized into a
grid

stru
cture. Threads within a thread
-
block are executed by a single "streaming multiprocessor" inside a GPU and are synchronized and
share data through a fast and small private memory of the streaming multiprocessor, called "
shared
memory"
. On the other hand,
synchronization between threads belonging to different thread
-
blocks
is not supported. However, a slow and large "g
lobal
memory", is accessible by all thread
-
blocks.
Similar to a GPU, FASTCUDA employs two separate memory spaces (global and local) as well a
s
a similar mapping of the block
-
threads onto the FPGA resources as described below.




Page
7

of
12



This document is produced under the EC contract 286770.

It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the F
ASTCUDA
Steering
Committee



3.

MULTI
-
CORE PROCESSOR

The CUDA host program as well as the SW kernels (the subset of the kernels determined by the
design space exploration tool) will run in software on
the multi
-
core processor. In this section we
first review the architecture of this processor

and the operating system that was employed in order
to execute multi
-
threaded applications. Next,

we discuss the required steps in order to
port the
CUDA source co
de to this architecture
.


3.1.

SHORT DESCRIPTION OF

ARCHITECTURE



Figure
3
. Multi
-
Core Processor Architecture


Figure
3

shows our prototype’s multi
-
core processor architecture. It uses Xilinx Microblaze soft
cores (configurable small processors) with separate instruction caches and a shared data cache all
communicating through two AXI4
-
based buses. FASTCUDA uses a thread m
apping scheme that is
similar to what is used by a GPU. Each core executes a thread
-
block which can use the core's
scratchpad memory as a thread
-
private and block
-
shared local memory. All the threads from any
thread
-
block can access the global memory which

can also be accessed by the HW accelerators
(notice the connection on the AXI4 bus in the Figure).

The AXI4_Lite bus is used for the communication between the multi
-
core processor and the
Accelerator block that is running the HW kernels. A simple handshak
e protocol is employed to pass
the arguments and invoke a specific HW kernel, which will then respond back when it has finished
running. According to the CUDA approach, the host can go on running after a kernel invocation.
However, if another kernel is lau
nched before the previous one has finished, the host will be
blocked until the first kernel finishes.

Lastly, the
timer

and
mutex

blocks on the AXI4_Lite bus are a requirement for the symmetric
multiprocessing (SMP) support of the runtime that is running
on the processor as we will explain in
the following section.

Notice that the number of cores, as well as the data cache size and organization (single or multiple
Microblaze1

Microblaze
N

...

icache

d
cache

icache

icache

AXI4

AXI4

uart

AXI4_Lite

DDR

controller

HW
accelerator


HW
kernel

invocation

Microblaze0


scratc
h

pad


scratc
h

pad


scratc
h

pad

timer

mutex

lmb


lmb


lmb



Page
8

of
12



This document is produced under the EC contract 286770.

It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the F
ASTCUDA
Steering
Committee



banks), the configuration of the Microblazes, and other configuration parameters, are applica
tion
-
dependent and are determined by the design space exploration tool described in the previous section
according to the requirements of the CUDA application.


3.2.

OPERATING SYSTEM

The OS
-
level software running on our multi
-
core processor is a modified versio
n of the Xilinx
kernel “Xilkernel”.
Xilkernel supports POSIX threads, mutexes and semaphores, but was meant to
run on a single core, thus having no support for a SMP environment like ours.
We consequently had
to add SMP support to Xilkernel following the m
ethodology described
by Pable et al (Pablo Huerta,
Javier Castillo, Carlos Sánchez, Jose Ignacio Martínez, "Operating System for Symmetric
Multiprocessors on FPGA", ReConFig'08, pp.157
-
162)
.

However, this methodology assumes that the cores access a shared

global memory where the
operating system and the threads reside. A global “ready” queue holds all the threads that are ready
for execution and during a context switch a core is free to pick any thread from the ready queue. In
this way a thread executes on

many cores. We had to modify this approach in order to support
separate private scratchpad memories per core as shown in Figure 4.

In FASTCUDA, the stack memory of a thread is stored in a core's fast local scratchpad memory
instead of the slow global memo
ry. Therefore, a thread can run only on the core that holds its stack
memory. We modified Xilkernel in order to support a “ready” queue per core as well as to assign
new threads to the cores in a round
-
ro
bin fashion as shown in Figure 4
. When a new thread
is
created it is stored in the “ready” queue of the scheduled core which is responsible for its execution.
In parallel, the stack memory of the thread is allocated in the local scratchpad memory of the
scheduled core.


Figure
4
. Scheduling of threads to cores



context

switch

RR Scheduler

core

mem

ready queues

one per core

new thread

core

core

mem

...

0

1

2

3


4

5



allocate next
free stack

free

stacks


Page
9

of
12



This document is produced under the EC contract 286770.

It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the F
ASTCUDA
Steering
Committee



4.

OVERVIEW OF MCUDA

The MCUDA translation framework is a linux
-
based tool designed to effectively compile the
CUDA programming model to a CPU architecture
. The MCUDA system tran
slates CUDA
application into effi
cient parallel CPU programs. Automatic translation of the thread blocks is
composed of a few key code transformations: iterative wrapping, synchronization enforcement, and
data buffering. All transfor
mations are performed

on the program's abstract syntax tree (AST).

The fi
rst step in the transformation changes the nature of the kernel function from a per
-
thread code

specifi
c
ation to a per
-
block code specifi
cation. This means that the implicit
threadIdx variable

now
needs to be exp
licitly included, with control fl
ow introduced to perform a logical thread's

computation for each value of threadIdx within a single OS thread.

In order to enforce synchronization between the threads of a block MCUDA d
efines a
synchronization statement to be a statement or control structure in the program that all logical
threads must enter and leave synchronously. This means that no logical thread can begin executing
a synchronization statement before all other logical

threads reach that synchronization statement,
and all logical threads must complete the synchronization statement before any logical thread can
continue past it. A thread loop is an instance of a synchronous statement,
for example. A
programmer
-
specifi
ed
synchronization point is an example of a synchronization statement that
contains no computation.

T
he translat
ion process the kernels are
d
efined as block
-
level functions,
and all that remains is, on
kernel invocation, to iterate
through the block indices s
pecifi
ed and call

the transforme
d function
once for every specifi
ed block inde
x. For a CPU that gains no benefi
ts

fro
m multithreading, this is
an effi
cient way of executing the kernel computation. However, CPU

architecture
s that do gain
performance benefi
t
s from multithreadi
ng will likely not achieve full effi
ciency with this method.

Since these blocks can execute independently according to the programming model, it is

trivial to
have multiple OS threads partition the set of block indices among themselves,
and execute

blocks
concurrently on multithreaded CPU architectures. Many

frameworks exist for such work
distribution, such as OpenMP

or threading building blocks. The specifi
c implementation

uses
POSIX threads as an example

of how thread blocks can be effi
ciently scheduled.

In the host code, the kernel launch statement is translated into a function call to the runtime kernel
launch r
outine. The function call specifi
es a reference to the kernel function to be invoked,
the
kernel confi
guration parameters, and

the parameters to the kernel function itself. In the run
-

time
library kernel launch routine, the host thread stores the kernel launch information into global
variables, and enters a barrier synchronization point. A statically created pool of worker pthre
ads,
representing the device in the CUDA model, also enters the barrier. On exiting, each worker thread
reads the kernel launch data and begins executing blocks. The host thread then enters a second
barrier to wait for kernel completion before returning to

the host code.

The MCUDA runtime includes support for static and dynamic methods of assigning computation to
CPU threads. The static method distributes a contiguous set of blocks to each worker thread. Any
thread is assigned at most one additional block c
ompared to any other thread. Each thread then
executes independently until completing its set. Under the dynamic method, each worker thread
iteratively acquires and executes blocks until all blocks in the kernel have been issued. Each OS
thread, when reque
sting a block to execute, atomically loads the current block index, represented by
a global variable. If it is within the range specified by the kernel launch configuration parameters, it
executes that block, and increments the current block index to mark
that the block is being
processed. Otherwise, all blocks in the kernel have been issued. In both methods, when each worker
threads completes processing, it enters the barrier at which the host thread is waiting. When all

Page
10

of
12



This document is produced under the EC contract 286770.

It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the F
ASTCUDA
Steering
Committee



worker threads reach the barrier, t
he kernel execution has completed, and the host thread is allowed
to leave the barrier and return to the host code.




Page
11

of
12



This document is produced under the EC contract 286770.

It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the F
ASTCUDA
Steering
Committee



5.

IMPLEMENTING CUDA KE
RNELS ON THE MULTI
-
CORE PROCESSOR

The programmer specifies data
-
parallel CUDA kernels, expressed as parallel logical t
hreads that
execute cooperatively in CUDA thread blocks. CUDA kernels are supposed to run on SIMT devices
(i.e. GPUs), which are drastically different from our multi
-
core processor. Thus, the next step is to
port the CUDA kernels to run on top of the multi
-
core multi
-
threaded environment provided by our
modified Xilkernel.

A closer view of the CUDA programming model shows that we can have an efficient mapping of
the execution of a CUDA kernel on a multi
-
core architecture. At the first granularity of paralle
lism,
blocks can execute completely independently. Thus, if all logical threads within a block run on the
same CPU core, there is no need for inter
-
core synchronization during the execution of the blocks.
We could successfully implement this parallel execu
tion of CUDA kernels on the FASTCUDA
multi
-
core processor using MCUDA.

MCUDA transforms the CUDA code into thread
-
based C code that uses the MCUDA runtime in
order to create a pool of threads and coordinate the operations of the threads as well as to provi
de
the basic CUDA runtime functionality for kernel invocation and data movements.

In particular, MCUDA transforms the code executed by a single CUDA thread block into its
equivalent sequential code (i.e. single instruction single data). At a kernel invoca
tion, for each
thread block the MCUDA runtime executes the aforementioned sequential code once. These
multiple executions of the sequential code are performed by the threads of the MCUDA thread pool.

MCUDA runs inside the Xilkernel environment described i
n Section 6.2. We have configured the
size of the MCUDA thread pool to be equal to the number of the cores and therefore Xilkernel will
schedule a single thread of the MCUDA runtime to each core. This whole

process is depicted in
Figure 5
.


Figure
5
. Scheduling of threads to cores


The compilation process that results in a single executable code is shown in the following Figure.

CUDA kernel

logical threads

MCUDA

schedule

thread pool

Xilkernel

schedule

core

core

core

core

...

thread blocks


Page
12

of
12



This document is produced under the EC contract 286770.

It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the F
ASTCUDA
Steering
Committee




Figure
6
. Software Porting Process


Xilkernel provides the mutex support required by the MCUDA library and the thread
-
based support
required by the multi
-
threaded SW kernels.


5.1.

IMPLEMENTATION

The Figure below shows the porting of the MCUDA library to the Xilinx Software Development
Kit (SDK). The files in Figure
7

provide the MCUDA runtime shown in Figure
6

which is compiled
together with the transformed thread
-
based C code and the Xilkernel.


Figure
7
.
MCUDA library on Xilinx SDK


So far we have tested the porting tool running some simple CUDA applications. We used the
CUDA code of a matrix multiplication kernel of two 256x256 element matrices. The source code
was derived from the CUDA SDK
4.2 (
https://developer.nvidia.com/cuda
-
downloads)
.



CUDA host program

Thread
-
based C

MCUDA
runtime

Xilkernel

MCUDA

Compile

Executable

SW kernels