Sieve C++ Parallel Programming System

footballsyrupSoftware and s/w Development

Dec 1, 2013 (3 years and 6 months ago)

81 views

Codeplay Software Ltd

2
nd

Floor, 45 York Place

Edinburgh, EH1 3HP

United Kingdom

Tel: +44(0)131 466 0506


www.codeplay.com


Multi
-
core Compilation

an Industrial Approach

Alastair F. Donaldson

EPSRC Postdoctoral Research Fellow, University of Oxford

Formerly at Codeplay Software Ltd.


Thanks to the Codeplay Sieve team:

Pete Cooper, Uwe Dolinsky, Andrew Richards, Colin Riley, George Russell

Multi
-
core Compilation


an Industrial Approach

Coverage


Limits of automatic parallelisation


Programming heterogeneous multi
-
core processors


Codeplay Sieve Threads approach


Like pthreads for accelerator processors


The promises and limitations of OpenCL

Laboratory session: Sieve Partitioning System for
Cell Linux


a Practical Introduction

Multi
-
core Compilation


an Industrial Approach

Limits of automatic parallelisation


Part of why this has not been achieved


C/C++, pointers, function pointers, multiple source
files, precompiled libraries


Why this will
never

be achieved


Many parallelisable programs require ingenuity to
parallelise!


State
-
of
-
the
-
art: we are good at parallelising regular
loops, when we can see all the code

Dream:

a tool which takes a serial program, finds
opportunities for parallelism, produces parallel code
optimized for target processor, preserves determinism

Multi
-
core Compilation


an Industrial Approach

Example: Floyd
-
Steinberg error diffusion

Multi
-
core Compilation


an Industrial Approach

Example: Floyd
-
Steinberg error diffusion

255

22

64

180

200

55

Threshold = 128

22 < 128 so set pixel value to 0

error = old


new = 22

255

0

64

180

200

55

3/16

5/16

7/16

1/16

error

255

0

68

187

210

56

Multi
-
core Compilation


an Industrial Approach

Error diffusion can be parallelised

1

2

3

3

4

4

5

5

5

1

2

3

4

5

6

7

8

9

10

11

12

3

4

5

6

7

8

9

10

11

12

13

14

5

6

7

8

9

10

11

12

13

14

15

16

7

8

9

10

11

12

13

14

15

16

17

18

9

10

11

12

13

14

15

16

17

18

19

20

11

12

13

14

15

16

17

18

19

20

21

22

13

14

15

16

17

18

19

20

21

22

23

24

15

16

17

18

19

20

21

22

23

24

25

26

17

18

19

20

21

22

23

24

25

26

27

28

19

20

21

22

23

24

25

26

27

28

29

30

17

17

17

17

17

17

3/16

5/16

7/16

1/16

error

Multi
-
core Compilation


an Industrial Approach

Error diffusion can be parallelised


...but approach is problem
-
specific and requires human
ingenuity


Panafiotis Metaxas: Parallel Digital Halftoning by Error
-
Diffusion, PCK50 (2003)


Previously believed to be non
-
parallelisable:

“[the Floyd
-
Steinberg algorithm] is an inherently serial
method; the value of [the pixel in the lower right
corner of the image] depends on all m.n entries of
[the input]”


Donald Knuth: Digital Halftones by Dot Diffusion, ACM
Transactions on Graphics (1987)

Multi
-
core Compilation


an Industrial Approach

Another example: collision response

void

apply_collisions(GameWorld* world,


CollisionPair* collisions,


int

num_collisions) {


for
(
int

i=0; i<num_collisions; i++) {


world
-
> update_velocities(collisions[i].first,


collisions[i].second);


}

}


Can process (a, b) and (c, d) in either order


If { a, b } intersects { c, d } then cannot process (a, b)
and (c, d) simultaneously


How should we deal with this?


Locks? Transactional memory? Data preprocessing?

Multi
-
core Compilation


an Industrial Approach

Our perspective


Let's nail auto
-
parallelisation for special cases


In general, we are stuck with multi
-
threading


Let's design sophisticated tools to help with multi
-
threaded programming


Modern problem: multi
-
threaded programming for
heterogeneous multi
-
core is
very hard

Multi
-
core Compilation


an Industrial Approach

Heterogeneous architectures

Host

Accelerator

RAM

Accelerator

RAM

Accelerator

RAM

Main memory

x86 PC

Power
Processing
Element

Synergistic Processing
Element, GPU, FPGA, etc.

Direct memory
access (DMA)

data bus

mailbox/interrupt

Multi
-
core Compilation


an Industrial Approach

Example: Cell Broadband Engine

PPE

= Power Processing Element (
Host
)

SPE

= Synergistic Processing Element (
Accelerator
)

PPE

SPE

SPE

SPE

SPE

SPE

SPE

SPE

SPE

128
-
bit SIMD
processor (3.2 GHz)

256 KB RAM

SPEs access main
memory via DMA
interface

Dual hyperthreaded
PowerPC core,
connected to main
memory

Multi
-
core Compilation


an Industrial Approach

Programming heterogeneous machines




Write
separate
programs for host and accelerator


Lots of “glue” code


launch accelerators


orchestrate data movement


clear down accelerators


Can achieve great performance, but:


Time consuming


Non portable


Error prone (limited scope for static checking)


Multiple source files for logically related functionality

Multi
-
core Compilation


an Industrial Approach

Illustrative example



#define
HEIGHT ...

#define
WIDTH ...


unsigned char

mand(
int
,
int
);


void
computeMandelbrot(
unsigned char
* pixels) {


for
(
int

y = 0; y < HEIGHT; ++y ) {


for
(
int

x = 0; x < WIDTH; ++x) {


pixels[y*WIDTH + x] = mand(x, y);


}


}

}


Serial code for Mandelbrot loops

Multi
-
core Compilation


an Industrial Approach

Illustrative example (continued)

#define

HEIGHT ...

#define

WIDTH ...


typedef

struct

{


int

row;


int

length;


unsigned

char
* dest;


int

padding;

} context;


// PPE uses this handle to run SPE code

extern

spe_program_handle_t speComputeMandelbrot;


void

ppeComputeMandelbrot(
unsigned

char
* pixels) {



speid_t spe_ids[8];


context ctxs[8] __attribute__ ((aligned (16)));


const

int

count = HEIGHT / 8;



for
(
int

i=0, offset=0; i<8; i++, offset += count) {



ctxs[i].length = (i==7) ? HEIGHT
-

offset : count;


ctxs[i].dest = & (pixels[offset*WIDTH]);


ctxs[i].row = offset;



spe_ids[i] = spe_create_thread(


&speComputeMandelbrot, &ctxs[i]);


}



for
(
int

i=0; i<8; i++) {


spe_wait(spe_ids[i]);


}

}

unsigned

char

mand(
int
,
int
);


#define

BLOCK ...


volatile

unsigned

char

myPixels[BLOCK]


__attribute__ ((aligned (16)));


volatile

context ctx __attribute__ ((aligned (16)));


int

main(
unsigned

long

long

spu_id,


unsigned

long

long

ctxAddress) {



spu_mfcdma32(&ctx, ctxAddress,


sizeof
(context), MFC_GET_CMD);



spu_mfcstat(MFC_TAG_UPDATE_ALL);



for
(
int

y=0; y<ctx.length; y++) {



for
(
int

x=0; x<WIDTH; x+=BLOCK) {



int

N = (WIDTH
-
x < BLOCK ? WIDTH
-
x : BLOCK);


for
(
int

k=0; k < N; k++) {


myPixels[k] = mand(x+k, ctx.row + y);


}



spu_mfcdma32(myPixels, ctx.dest+y*WIDTH+x,


N*
sizeof
(
unsigned

char
), MFC_PUT_CMD);


spu_mfcstat(MFC_TAG_UPDATE_ALL);


}


}


return

0;

}

PPE Code

SPE Code

Multi
-
core Compilation


an Industrial Approach

Why bother with heterogeneous architectures?


Homogeneous multi
-
threading relatively easier


Every thread running on same type of processor


All methods compiled as usual


No need for explicit data movement code


Minimal start
-
up code:
pthread_create(...)


Heterogeneous architectures can give better performance


Scratchpad memory => contention
-
free local access


Accelerator faster than host at
e.g.

vector processing

PlayStation is a registered trademark of Sony Computer Entertainment Inc.


Multi
-
core Compilation


an Industrial Approach

#include

<libsieve>


void

GameWorld::doFrame(...)

{


// Suppose calculateStrategy and


// detectCollisions are independent


this
-
>calculateStrategy(...);



this
-
>detectCollisions();



this
-
>updateEntities();


this
-
>renderFrame();

}

Codeplay Sieve Thread approach


Wrap code inside
sievethread

block to say

“run this code asynchronously on accelerator”

#include

<libsieve>


void

GameWorld::doFrame(...)

{


int handle =
sievethread
(...)


{


this
-
>calculateStrategy(...);


}


this
-
>detectCollisions();


sieveThreadJoin
(handle);


this
-
>updateEntities();


this
-
>renderFrame();

}

Offload to
accelerator


non
-
blocking

Call graph for
calculateStrategy

compiled for
accelerator

Host can wait for
sievethread to
complete


Full implementation for Cell. Sievethread runs on SPE.

Multi
-
core Compilation


an Industrial Approach

Parameters to sievethread block

#include

<libsieve>


void

start_accelerators(
int
* handles)

{


for
(int i=0; i<NUM_SPES; i++)


{


handles[i] =
sievethread

{


do_work(i);


};


}

}


void

wait_for_accelerators(
int
* handles)

{


for
(
int

i=0; i<NUM_SPES; i++)


{


sieveThreadJoin(handles[i]);


}

}

Illegal:
i

may
change, or
disappear!

Multi
-
core Compilation


an Industrial Approach

Parameters to sievethread block

#include

<libsieve>


void

start_accelerators(
int
* handles)

{


for
(int i=0; i<NUM_SPES; i++)


{


handles[i] =
sievethread
(i) {


do_work(i);


};


}

}


void

wait_for_accelerators(
int
* handles)

{


for
(
int

i=0; i<NUM_SPES; i++)


{


sieveThreadJoin(handles[i]);


}

}

Solution


pass
i

by value as a
parameter to
sievethread block

Parameters and
handles omitted
from many of the
following examples

Multi
-
core Compilation


an Industrial Approach

Working with multiple threading libraries

#ifdef

__WIN32__


#include

<windows.h>

#define

thread_handle_t WinThreadHandle_t

#define

createThread(context, program) WinThreadCreate(context, program)


#else


#ifdef

__LINUX__


#include

<pthread.h>

#define

thread_handle_t pthread_t

#define

createThread(context, program) pthread_create(context, program)


#else


#ifdef

__SIEVE_THREADS__


#include

<sievethread.h>

#define

thread_handle_t SieveThreadHandle_t

#define

createThread(context, program)
sievethread
(context) {
\
\


program(context);
\
\


}


#endif

#endif

#endif


Multi
-
core Compilation


an Industrial Approach

Pointer recap

int

* p;
// pointer to integer

*p = 5
// assign location pointed to by p to 5

int

x;

p = &x
// p is address of x



const int
* p;
// pointer to constant integer

int const
* p;
// means same thing


Multi
-
core Compilation


an Industrial Approach

Pointer types


Separate pointers into two categories:


Pointer to host data: marked with
__outer

qualifier


Pointer to accelerator data: not marked

5

int
* x

int

__outer
* y

int __outer

*
__outer

* z

int __outer

* * w

112

Accelerator memory
(kilobytes)

Host memory
(gigabytes)

Not supported

Similar to
const
,
volatile
.


__
” is common in C++

Multi
-
core Compilation


an Industrial Approach

Pointer types


Pointers outside sievethread context: implicitly
__outer


On accelerator, dereferencing
__outer

pointer =>
DMA transfer


Illegal to assign between local and outer pointers


For sensible code, can statically eliminate attempts to
dereference host address as if it were accelerator
address, and vice versa


C++ => programmer can always get their way if
they
really

want!


Multi
-
core Compilation


an Industrial Approach

Pointer types: example

float

f_out = 3.0f;

float
* out_ptr;

// Implicitly __outer pointer


sievethread

{



float

f_in = 5.0f;


float
* in_ptr;



in_ptr = &f_in;



out_ptr = &f_out;



*in_ptr = *out_ptr;
// DMA: Host
-
> Accelerator



out_ptr = in_ptr;
// ILLEGAL


}

f_in

in_ptr

f_out

out_ptr

3.0

5.0

Accelerator

Host

DMA

3.0

Multi
-
core Compilation


an Industrial Approach

Method duplication


Method has pointer/reference parameters


Called from sievethread context with mixture of outer
and local pointers/references


For each accelerator calling context, compile separate
version of method

void

func(
float
* x,
int
* y) { ... }

int

x;

sievethread

{


float

y;


func(&y, &x);


// signature: void (float*, __outer int*)

}

Multi
-
core Compilation


an Industrial Approach

Method duplication example

class

Circle {


...

public
:


static

bool

collides(Circle* c1, Circle* c2) {
... }

};


void

my_func() {



Circle out_circ_1, out_circ_2;




sievethread

{


Circle in_circ_1, in_circ_2;



if
(


collides( &out_circ_1, &out_circ_2)

&&


c
ollides( &out_circ_1, &in_circ_2)

&&


collides( &in_circ_1, &out_circ_2)

&&


collides( &in_circ_1, &in_circ_2)


) {


...


}


}

}

collides

duplicated:

bool

collides(


__outer

Circle*,


__outer

Circle*)

collides

duplicated:

bool

collides(

__outer

Circle*, Circle*)

collides

duplicated:

bool

collides(Circle*,


__outer

Circle*)

collides

duplicated:

bool

collides(Circle*,


Circle*)

Multi
-
core Compilation


an Industrial Approach

Challenges


Function pointers, virtual methods


Method duplication across multiple compilation units


Silent deduction of
__outer

(type inference)

Multi
-
core Compilation


an Industrial Approach

Function pointers


Given function type:

typedef void

(* int_to_void) (
int
);


+ methods

void

meth1(
int
);

void

meth2(
int
);


+ function pointer:

int_to_void f_ptr
;


+ call in sievethread context

sievethread

{ f_ptr(25); ... }


Don’t know until runtime which method is called


How do we know what to duplicate?

Multi
-
core Compilation


an Industrial Approach

Possible solutions

Compile and load all matching methods

May be hundreds:


Long compilation time


Large code size

Compile all methods, load on demand

Slow compilation

Significant runtime overhead

Compile methods on demand

Prohibitive runtime overhead

Requires access to compiler at runtime

Delegate call to host

Defeats point of offloading

Would only work if all pointers are
__outer

Useful as a fallback

Multi
-
core Compilation


an Industrial Approach

Our solution


function domains


Sievethread block equipped with
domain

of functions OK
to call via pointers


typedef void

(* int_to_void) (
int
);


void

meth1(
int

x) { ... }

void

meth2(
int

x) { ... }

void

meth3(
int

x) { ... }


int_to_void f_ptr;

...


sievethread

[ meth1, meth3 ]

{


f_ptr(25);

}

Duplicate call graphs for
meth1

and
meth3
, have
methods loaded and ready
to call

Runtime exception if

f_ptr == meth2

Multi
-
core Compilation


an Industrial Approach

Domains in practice

// 2d table of methods

collisionFunction collisionFunctions[3][3] =


{ fix_fix, fix_mov,
..., dead_dead

};


sievethread

[ fix_fix, fix_mov,
...
, dead_dead ]

{


for
(
...i, j...
)

{


// Apply function according to objects’ status


collisionFunctions

[

status[i] ] [ status[j]

]
(
...
);


}

}


Virtual methods handled similarly


Multi
-
core Compilation


an Industrial Approach

Method duplication across compilation units

Box.cpp


// Implementation of ‘collides’

bool

Box::collides(Entity &)

{


...

}


Box.h


class
Box

{

public
:


bool

collides(Entity &);


...


Physics.cpp


Box b, c;


sievethread

{


if

(b.collides(c)) {





}

}


A

B


#include

Need to duplicate
collides
, but
don’t have source
code

Multi
-
core Compilation


an Industrial Approach

Method duplication across compilation units


Current solution:


Mark externally called functions to be duplicated

bool

collides(Entity &)


__attribute
((__duplicate(


bool

(
__outer

Entity &)
__outer


)));


Possible automatic solution:


Build up “compilation conditions” while processing files


Repeatedly process files until conditions are fulfilled


If collides calls other functions in its compilation unit
these will be automatically duplicated

Multi
-
core Compilation


an Industrial Approach

Silent deduction of

__outer

__outer short
* x;

__outer int
* y;


// z is given type ‘__outer int*’ due to initializer

int
* z = x;


// OK to use ‘short*’ rather than ‘__outer short*’ in cast

x = (
short
*) y;


Two simple relaxations help with automatic method
duplication

Multi
-
core Compilation


an Industrial Approach

Other features


Mark method
sievethread
: only compile for
accelerator (can then be hand
-
optimized)


Overloading based on
__outer


Facility for accelerator to invoke method on host


e.g.

to allocate a lot of memory


Sieve Partitioning System comes with libraries to help
optimize data movement


Compiler generates advice to suggest how to use these
libraries

Multi
-
core Compilation


an Industrial Approach

Development approach


Identify code to offload (manually, using profiler)


Enclose in sievethread block (fix a few
__outer

issues)


Basic offload may not yield
optimal

performance


...but
any

offloading frees up host


Incremental performance improvements


Overload core functions with sievethread versions
optimized for accelerator


Compiler advice guides optimization

Multi
-
core Compilation


an Industrial Approach

Performance


Results on PS3 (image processing, raytracing, fractals):


Linear scaling


With 6 SPEs, speedup between 3x and 14x over host,
after some optimization



Possible to hand
-
optimize as much as desired


Tradeoff: hand
-
optimization increases performance at
expense of portability

Multi
-
core Compilation


an Industrial Approach

OpenCL


Language and API from Khronos group for programming
heterogeneous multicore systems


Codeplay is a contributing member


Motivation: unify bespoke languages for programming
CPUs, GPUs and Cell BE
-
like systems


Host code: C/C++ with API calls to launch
kernels

to run
on devices


Kernels written in OpenCL C


C99 with some
restrictions and some extensions


OpenCL is portable, but
too low level for large
applications

Multi
-
core Compilation


an Industrial Approach

Sievethreads


OpenCL

C++ application

Hot spot 1

OpenCL kernel
for hot spot 1

C++ application

Hot spots replaced by
kernel calls

Automatic
translation

Low level OpenCL code
for data movement
automatically generated

Runs on host

Runs on
accelerator(s)

Hot spots enclosed in
sievethread blocks

Hot spot 2

Hot spot 3

OpenCL kernel
for hot spot 2

OpenCL kernel
for hot spot 3

Multi
-
core Compilation


an Industrial Approach

Sievethreads


OpenCL
-

challenges


Various limitations in OpenCL 1.0 (e.g. no recursion, no
function pointers) which will probably go away


Severe (prohibitive?) limitation: accelerator cannot
randomly access host memory

void

some_method(
__outer

int
* x)

{


... = *x;
// Read from “who knows where?” in host memory


}


On Cell processor, DMA from host on demand is fine


OpenCL does not support this (due to limitations of GPUs)

Multi
-
core Compilation


an Industrial Approach

Related work


Hera
-
JVM (University of Glasgow)
-

Java virtual machine
on Cell SPEs


CUDA (NVIDIA), Brook+ (AMD)
-

somewhat subsumed
by OpenCL


Cilk++ (Cilk Arts)
-

shared memory only


OpenMP (IBM have an implementation for Cell)


PS
-
Algol (Atkinson, Chisholm, Cockshott)


pointers to
memory vs. pointers to disk is analogous to local vs.
outer pointers

Multi
-
core Compilation


an Industrial Approach

Summary


Sievethreads: practical way get C++ code running on
heterogeneous systems


Can co
-
exist with other threading methods


Core technology: method duplication


Main area for future work: data movement


Data movement optimizations


Declarative language for
specifying

data movement
patterns

Multi
-
core Compilation


an Industrial Approach

Thank you!

After the break, come back and use the
Sieve Partitioning System!

Codeplay are interested in academic collaborations, e.g.
student project applying sievethreads to a large open
-
source application