A Source-to-Source OpenACC compiler for CUDA - High ...

hungryhorsecabinSoftware and s/w Development

Dec 14, 2013 (3 years and 10 months ago)

114 views

A Source
-
to
-
Source OpenACC
compiler for CUDA

Akihiro
Tabuchi
†1

Masahiro
Nakao
†2

Mitsuhisa

Sato
†1

†1. Graduate
School of Systems and Information Engineering,

University
of
Tsukuba

†2.
Center for Computational Sciences, University of
Tsukuba


Outline


Background


OpenACC


Compiler Implementation


Performance Evaluation


Conclusion & Future Work

Background


Accelerator programming model


CUDA (for NVIDIA GPU)


OpenCL (for various accelerators)


Accelerator programming is complex


memory management, kernel function, …


low productivity & low portability



OpenACC
is proposed
to
solve these
problems

OpenACC


The directive
-
based programming model for
accelerators


support C, C++ and Fortran


Offloading model


offload a part of code to an accelerator



High productivity


only adding directives


High portability


run on any accelerators



as
long as the compiler supports it



Example of OpenACC

int

main(){


int

i
;


int
a
[
N], b[N], c[N]
;


/* initialize array ‘a’ and ‘b’ */



#
pragma
acc

parallel loop
copyin
(
a,b
)
copyout
(c)


for
(
i

= 0;
i

< N;
i
++)
{



c
[
i
] = a[
i
] + b[
i
];


}

}

T
his directive specifies data transfers and
loop
offloading and parallelization

Purpose of Research


Designing and implementing an open source
OpenACC compiler


T
arget language


: C


T
arget accelerator

: NVIDIA GPU


S
ource
-
to
-
source approach


C + OpenACC → C + CUDA API


This approach enables to leave detailed machine
-
specific code optimization to the mature CUDA
compiler by
NVIDIA


T
he result of compilation is a executable file




Related Work


Commercial compiler


PGI Accelerator compiler


CAPS HMPP


Cray compiler



Open source compiler


accULL


developed
at
University of La Laguna in
Spain


Source
-
to
-
source
translation


B
ackend is CUDA and OpenCL


Output is codes
and
a
Makefile

OpenACC directives


parallel


kernels


loop


data


host_data


update


wait


cache


declare


parallel loop


kernels loop



(OpenACC specification 1.0)


data

construct

int


a[
4
];

#pragma
acc

data copy(a
)

{



/
* some codes using
‘a’ *
/


}

host memory

device memory

computation

on

device

Data management on Accelerator

I
f an array is specified in “copy” clause …

1.
Device memory allocation

2.
Data transfer from host
to
device

3.
Data
transfer from
device to host

4.
Device memory release

at the beginning of region

at the end of region

T
ranslation of data construct

int


a[
4
];

#pragma
acc

data copy(a
)

{


/
* some codes using
‘a’ *
/

}

int


a[
4
];

{


void
*_
ACC_DEVICE_ADDR_a
,*
_
ACC_HOST_DESC_a
;


_
ACC_gpu_init_data
(
&_
ACC_HOST_DESC_a
, &_
ACC_DEVICE_ADDR_a
, a,
4*
sizeof
(
int
));

_
ACC_gpu_copy_data
(_
ACC_HOST_DESC_a
, 400);




{




/
* some codes using
‘a’ */


}


_
ACC_gpu_copy_data
(_
ACC_HOST_DESC_a
, 401);



_
ACC_gpu_finalize_data
(_
ACC_HOST_DESC_b
)
;

}

allocate ‘a’ on GPU


copy ‘a’ to GPU from host

free ‘a’ on GPU

copy ‘a’ to host from GPU


host address


device address


size


….


Codes in parallel region are executed on device


Three levels of parallelism


gang


worker


vector


parallel construct

#pragma
acc

parallel
num_gangs
(1)
vector_length
(128
)

{


/
* codes in parallel region */

}

OpenACC

CUDA

gang

thread block

worker

(warp)

vector

thread


The number of
gang or
worker
or vector length can be
specified
by clauses

T
ranslation of parallel

construct

#pragma
acc

parallel
num_gangs
(1)
vector_length
(128
)

{


/
* codes in parallel region */

}

__global__
static void
_ACC_GPU_FUNC_0_DEVICE
( .
.
. )

{


/
* codes in parallel region */

}

extern "
C”
void

_ACC_GPU_FUNC_0
( … )

{


dim3 _ACC_block(1, 1, 1), _ACC_thread(128, 1, 1)
;


_ACC_GPU_FUNC_0_DEVICE
<<<
_
ACC_block
,
_ACC_thread
>>>
( .
.
. )
;


_ACC_GPU_M_BARRIER_KERNEL
()
;

}

GPU kernel
function

kernel

launch

function

loop

construct

/
*
inside
parallel region */

#pragma
acc

loop
vector

for
(
i

= 0;
i

<
256;
i
++){


a
[
i
]++
;

}



Loop construct describes parallelism of loop


Distribute loop iteration among gang, worker or vector


Two or more parallelisms can be specified for
a loop



L
oops with no loop directive in parallel region is
basically executed serially.

8

9

1
0

1
1

1
2

1
3

1
4

1
5

Translation of loop construct (1/3)

/
* inner parallel region */

#pragma
acc

loop
vector

for
(
i

= 0;
i

< N
;
i
++){


a
[
i
]++
;

}


1.
A virtual index which is
the same length as loop
iteration is prepared

2.
The virtual index is divided
and distributed among
blocks and/or threads

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

0

1

2

3

4

5

6

7

3.
Each thread calculates
the value of loop variable
from the virtual index
and executes loop body

Translation of loop construct (2/3)

/
* inner parallel region */

#pragma
acc

loop
vector

for
(
i

= 0;
i

< N
;
i
++){


a
[
i
]++
;

}


/* inner
gpu

kernel code *
/

int

i
, _
ACC_idx
;


int

_
ACC_init
, _
ACC_cond
, _
ACC_step
;

_
ACC_gpu_init_thread_x_iter
(&_
ACC_init
, &_
ACC_cond
, &_
ACC_step
, 0, N, 1
);

for(
_
ACC_idx

= _
ACC_init
; _
ACC_idx

< _
ACC_cond
; _
ACC_idx

+= _
ACC_step
){


_
ACC_gpu_calc_idx
(_
ACC_idx
, &
i
, 0, N, 1);



a
[
i
]++;

}

calculate

i
’ from virtual
index

virtual index

_
ACC_idx

virtual index range : _
ACC_init
,
cond
, step

calculate the range of virtual index

virtual index

range variables

loop body

Translation of loop construct(3/3)


Our compiler supports 2D blocking for nested
loops


Nested loops are distributed among the 2D blocks
in the 2D grid in CUDA (default block size is 16x16)


But it’s not allowed in OpenACC 2.0 and
“tile” clause is provided instead

#pragma
acc

loop gang vector

for(

i

= 0;
i

< N;
i
++)

#pragma
acc

loop gang vector



for(j = 0; j < N; j++)



/* … */

distribute

2D Grid

2D Block

Compiler Implementation


Our compiler translates
C with OpenACC
directives to C with CUDA
API


read C code with directives and output translated
code


using Omni compiler infrastructure



Omni
compiler
infrastructure


a

set of programs for a source
-
to
-
source compiler with
code analysis and transformation


s
upports
C and Fortran95



Flow of Compilation

Omni compiler
infrastructure

sample.gpu.o

a
cc

runtime

sample_tmp.o

Omni
Frontend

OpenACC

translator

C
c
ompiler

nvcc

a.out

sample.c

sample.xml

sample

_
tmp.c

sample.cu

XcodeML

C with ACC API

CUDA

C with OpenACC

directives

Performance Evaluation


Benchmark


Matrix multiplication


N
-
body problem


NAS Parallel Benchmarks


CG


Evaluation environment


1

node of Cray XK6m
-
200


CPU

: AMD
Opteron Processor
6272 (2.1GHz)


GPU

: NVIDIA X2090 (
MatMul
, N
-
body)



: NVIDIA K20 (NPB CG)

Performance Comparison


Cray compiler


Our compiler


Hand written CUDA


T
he code is written in CUDA and compiled by NVCC


The code doesn’t use shared memory of GPU



Our compiler (2D
-
blocking)


The code uses 2D blocking and is compiled by our
compiler


This is applied to
o
nly
matrix
multiplication




Matrix multiplication

0
1
2
3
4
5
6
1K
2K
4K
8K
Relative performance against
CPU

Matrix size

Cray compiler
Hand-written CUDA
Our compiler
Our compiler, 2D-
Blocking
4.6x

5.5x

1.5x

1.4x

The performance of our compiler using 2D
-
blocking
and hand
-
written CUDA are slightly lower

Matrix multiplication


Our compiler achieves better performance than that of
Cray compiler


The PTX code directly generated by Cray compiler has more
operations in the innermost loop


Our compiler outputs CUDA code, and NVCC generates more
optimized PTX code



2D
-
blocking is lower performance


default 2D block size (16x16) is not



adequate to this
program


the best block size was 512x2


Hand
-
written CUDA code also uses



16x16 block






N
-
body

0
5
10
15
20
25
30
35
1K
2K
4K
8K
16K
32K
Relative performance against
CPU

The number of particles

Cray compiler
Hand-written CUDA
Our compiler
5.4x

31x

0.95x

1.2x

At the small problem size, the performance of our
compiler is lower than that of Cray
compiler

N
-
body


At small problem size,
the performance
became worse


Decline in the utilization of Streaming
Multiprocessors(SMs)


A kernel is executed by SMs per thread block


If the number of blocks is smaller than that of SMs, the
performance of the kernel becomes low.


Default block size


Cray
compiler : 128 threads /
block


Our compiler : 256 threads / block









NPB CG

0
2
4
6
8
10
12
Relative perfomance against CPU

Class(Matrix size)

Cray compiler
Our compiler
the performance is lower than
that of CPU and Cray compiler

0.66x

9.7x

0.74x

2.1x

NPB CG


At class S, the performance of GPU is lower than that of
CPU


Overheads are larger compared with kernel execution time


launching kernel functions


synchronization with device


data allocation / release / transfer



The overhead is larger than



that of Cray compiler


large overhead of reduction


The performance of GPU kernels



are better than that of Cray compiler

Conclusion


We implemented a source
-
to
-
source OpenACC
compiler for CUDA


C with OpenACC

directives → C with CUDA API


Using Omni compiler infrastructure


In most case, the performance of GPU code by our
compiler is higher than that of CPU single core


Speedup of up to 31 times at N
-
body


Our
compiler makes use of CUDA backend successfully
by source
-
to
-
source
approach


the
performance
is
often better than that of Cray
compiler


There is room for performance improvement


using suitable grid size and block size


reducing overhead of synchronization and reduction

Future Work


Optimization


tuning block size at compile time


reducing overhead from synchronization and
reduction


Support the full set of directives for
conforming to OpenACC specification in our
compiler


We will release our compiler at next SC