Aparapi ADFS Presentation - Houston Java Users Group

burnwholeInternet και Εφαρμογές Web

5 Φεβ 2013 (πριν από 4 χρόνια και 8 μήνες)

466 εμφανίσεις


APARAPI

Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU


Gary Frost

AMD

PMTS Java Runtime Team

3

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

AGENDA


The age of heterogeneous computing is here


The supercomputer in your desktop/laptop


Why
Java


?


Current GPU programming options for
J
ava developers


Are developers likely to adopt emerging Java OpenCL

/CUDA



bindings?


Aparapi


What is it


How it works


Performance


Examples/Demos


Proposed Enhancements


Future work

4

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

THE AGE OF HETEROGENEOUS COMPUTE IS HERE


GPUs originally developed to accelerate graphics operations


Early adopters repurposed their GPUs for ‘general compute’ by performing ‘unnatural acts’

with shader APIs


OpenGL allowed shaders/textures to be compiled and executed via extensions


OpenCL
TM
/GLSL/CUDA
TM

standardized/formalized how to express GPU compute

and simplified host programming


New programming models are emerging and lowering barriers to adoption

5

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

THE SUPERCOMPUTER IN YOUR DESKTOP


Some interesting tidbits from
http://www.top500.org/


November 2000


“ASCI White is new #1 with
4.9
TFlops

on the
Linpack
"


http://www.top500.org/lists/2000/11



November 2002



3.2
TFlops

are needed to enter the top 10”


http://www.top500.org/lists/2002/11


May 2011


AMD
Radeon
TM

6990
5.1TFlops
single precision performance


http://www.amd.com/us/products/desktop/graphics/amd
-
radeon
-
hd
-
6000/hd
-
6990/Pages/amd
-
radeon
-
hd
-
6990
-
overview.aspx#3


6

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011


O
ne of the most widely used programming languages


http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html



Established in domains likely to benefit from heterogeneous compute


BigData

, Search,
Hadoop+Pig
, Finance, GIS, Oil & Gas



Even if applications are not implemented in Java, they may still run on the Java Virtual Machine (JVM)


JRuby
,
JPython
,
Scala
,
Clojure
,
Quercus
(PHP)



Acts as a good proxy/indicator for enablement of other runtimes/interpreters


JavaScript, Flash, .NET, PHP, Python, Ruby,
Dalvik
?


WHY JAVA?

18.16

16.17

9.14

7.54

6.51

5.01

4.58

32.89

Java
C
C++
C#
PHP
Objective C
Python
Other
7

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

GPU PROGRAMMING OPTIONS FOR JAVA PROGRAMMERS


Emerging Java GPU
APIs require
coding a ‘Kernel’ in a
domain
-
specific language

// JOCL/
OpenCL

kernel code

__
kernel void
squares(__global
const

float *in, __global float *out){


int

gid

=
get_global_id
(0);


out[
gid
] = in[
gid
] * in[
gid
];

}


As well as
writing the Java ‘host
’ CPU
-
based code
to:


Initialize the data


Select/Initialize execution device


Allocate or define memory buffers for
args
/parameters


Compile
'Kernel' for a selected
device


Enqueue/Send
arg

buffers to
device


Execute the kernel


Read results
buffers back
from the
device


Cleanup (remove buffers/queues/device handles)


Use the results

import static org.jocl.CL.*;

import
org.jocl
.*;


public class Sample {


public static void main(String
args
[]) {


// Create input
-

and output data


int

size =
10;


float
inArr
[] = new float[size];


float
outArray
[] = new float[size];


for (
int

i=0; i<size; i++) {


inArr
[i] = i;


}



Pointer in = Pointer.to(
inArr
);


Pointer out = Pointer.to(
outArray
);



// Obtain the platform IDs and initialize the context properties


cl_platform_id

platforms[] = new
cl_platform_id
[1];


clGetPlatformIDs
(1, platforms, null);


cl_context_properties

contextProperties

= new
cl_context_properties
();


contextProperties.addProperty
(CL_CONTEXT_PLATFORM, platforms[0]);



// Create an
OpenCL

context on a GPU device


cl_context

context =
clCreateContextFromType
(
contextProperties
,


CL_DEVICE_TYPE_CPU, null, null, null);



// Obtain the
cl_device_id

for the first device


cl_device_id

devices[] = new
cl_device_id
[1];


clGetContextInfo
(context, CL_CONTEXT_DEVICES,


Sizeof.cl_device_id
, Pointer.to(devices), null);



// Create a command
-
queue


cl_command_queue

commandQueue

=


clCreateCommandQueue
(context, devices[0], 0, null);



// Allocate the memory objects for the input
-

and output data


cl_mem

inMem

=
clCreateBuffer
(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,


Sizeof.cl_float

* size, in, null);


cl_mem

outMem

=
clCreateBuffer
(context, CL_MEM_READ_WRITE,


Sizeof.cl_float

* size, null, null);



// Create the program from the source code


cl_program

program =
clCreateProgramWithSource
(context, 1, new String[]{


"__kernel void
sampleKernel
("+


" __global
const

float *in,"+


" __global float *out){"+


"
int

gid

=
get_global_id
(0);"+


" out[
gid
] = in[
gid
] * in[
gid
];"+


"}"


}, null, null);



// Build the program


clBuildProgram
(program, 0, null, null, null, null);



// Create and extract a reference to the kernel


cl_kernel

kernel =
clCreateKernel
(program, "
sampleKernel
", null);



// Set the arguments for the kernel


clSetKernelArg
(kernel, 0,
Sizeof.cl_mem
, Pointer.to(
inMem
));


clSetKernelArg
(kernel, 1,
Sizeof.cl_mem
, Pointer.to(
outMem
));



// Execute the kernel


clEnqueueNDRangeKernel
(
commandQueue
, kernel,


1, null, new long[]{
inArray.length
}, null, 0, null, null);



// Read the output data


clEnqueueReadBuffer
(
commandQueue
,
outMem
, CL_TRUE, 0,


outArray.length

*
Sizeof.cl_float
, out, 0, null, null);



// Release kernel, program, and memory objects


clReleaseMemObject
(
inMem
);


clReleaseMemObject
(
outMem
);


clReleaseKernel
(kernel);


clReleaseProgram
(program);


clReleaseCommandQueue
(
commandQueue
);


clReleaseContext
(context);



for (float f:outArray){


System.out.printf
("%5.2f, ", f);


}



}

}

8

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

ARE DEVELOPERS LIKELY TO ADOPT EMERGING JAVA OPENCL/CUDA BINDING
S?


Some will


Early adopters


Prepared to learn new languages


Motivated to squeeze all the performance they can from available compute devices


Prepared to implement algorithms both in Java and in CUDA/OpenCL


Many won’t


OpenCL/CUDA C99 heritage likely to disenfranchise Java developers


Many walked away from C/C++ or possibly never encountered it at all (due to CS education shifts)


Difficulties exposing low level concepts (such as GPU memory model) to developers who have ‘moved on’ and just
expect the JVM to ‘do the right thing’


Who pays for retraining of Java developers?


Notion of writing code twice (once for Java execution another for GPU/APU) alien


Where’s my ‘Write
O
nce, Run
A
nywhere’?


9

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

WHAT IS APARAPI?


A
n
API for expressing data parallel workloads in
Java


Developer extends a Kernel base class


C
ompiles to
J
ava bytecode using existing tool chain


Uses existing/familiar Java tool chain to debug the logic of their Kernel implementations


A runtime component capable of either :


E
xecuting Kernel
via a Java Thread
Pool


Converting Kernel
bytecode

to
OpenCL

and executing on GPU


Platform
Supports
OpenCL
?
Yes
Bytecode can
be converted
to OpenCL
?
No
No
Execute Kernel

using Java
Thread Pool
Convert
bytecode to
OpenCL
Yes
Execute
OpenCL
Kernel

on GPU
MyKernel
.
class
10

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

AN
EMBARRASSINGLY

PARALLEL USE CASE





First lets revisit our earlier code example



Calculate square[0..size] for a given input in[0..size]



final
int
[] square= new
int
[size];

final
int
[] in = new
int
[size];


// populating in[0..size] omitted


for (
int

i
=0;
i
<size;
i
++){


square[i
] =
in[i
]
* in[i
];

}


Note that the order we traverse the loop is unimportant


Ideally Java would provide a way to
indicate that the
body of the loop need not be executed
sequentially


Something like a
parallel
-
for

?


However we don’t want to modify the language, compiler
or tool chain.





parallel
-
for

(
int

i
=0;
i
<size;
i
++){




11

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

REFACTORING OUR EXAMPLE TO USE APARAPI




final
int
[] square= new
int
[size];

final
int
[] in = new
int
[size];
// populating in[0..size] omitted



for (
int

i
=0;
i
<size;
i
++){


square[i
] =
in[i
]
* in[i
];

}

n
ew Kernel(){


@Override public void run(){


int

i =
getGlobalID
();


square[
i
] =
in[i]*in[i
];


}

}.execute(size);

12

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

EXPRESSING DATA PARALLEL IN APARAPI


What happens when we call execute(n)?







Kernel
kernel

= new Kernel(){


@Override public void run(){



int

i=
getGlobalID
();


square[i]=
int
[i]*
int
[i];


}

};



k
ernel.execute
(size);



Platform
Supports
OpenCL
?
Yes
Bytecode can
be converted
to OpenCL
?
No
No
Execute Kernel

using Java Thread
Pool
Convert bytecode
to OpenCL
Execute OpenCL Kernel

on GPU
Is this the
first
execution
?
Do we have
OpenCL
?
Yes
Yes
No
No
Yes
13

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

FIRST CALL OF
KERNEL.EXECUTE
(SIZE) WHEN OPENCL/GPU IS AVAILABLE


Reload
classfile

via
classloader

and locate all methods
and
fields


For ‘
run()
’ method
and all methods reachable from

run()



Convert method
bytecode

to an
IR


E
xpression trees


Conditional sequences analyzed and converted to if{}, if{}else{} and for{} constructs


C
reate a
list of
fields accessed by the
bytecode


N
ote the access type
(read/write/
read+write
)


Accessed fields will be turned into
args

and passed to generated
OpenCL


Create an
OpenCL

buffer for each accessed primitive array (read, write or
readwrite
)


Create and Compile
OpenCL


Bail and revert to Java Thread Pool if we encounter any issues in previous steps

14

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

ALL CALLS OF
KERNEL.EXECUTE
(SIZE) WHEN OPENCL/GPU IS AVAILABLE


Lock any accessed primitive arrays (so the garbage collector doesn’t move or collect them)


For each field
readable by
the kernel:


If field is an array


enqueue

a buffer
write


If field is scalar


set kernel
arg

value


Enqueue Kernel execution


For each array
writeable by the kernel:


Enqueue a
buffer
read


Wait for all enqueued requests to complete


Unlock
accessed primitive
arrays

15

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

KERNEL.EXECUTE
(SIZE) WHEN OPENCL/GPU IS NOT AN OPTION


Create a thread pool


O
ne thread per core


Clone Kernel once for each thread


Each Kernel
accessed exclusively from a single
thread


Each
Kernel

loops
globalSize
/
thread
Count

times


U
pdate
globalId
,
localId
,
groupSize
,
globalSize

on Kernel instance


Executes run() method on Kernel instance


Wait for all threads to complete



16

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

ADOPTION CHALLENGES (APARAPI VS EMERGING JAVA GPU BINDINGS)

Emerging GPU
bindings

Aparapi

Learn

OpenCL/CUDA

DIFFICULT

N/A

Locate potential

data parallel opportunities

MEDIUM

MEDIUM

Refactor

existing code/data structures

MEDIUM

MEDIUM

Create Kernel Code

DIFFICULT

EASY

Create
code to coordinate execution and buffer transfers

MEDIUM

N/A

Identify GPU

performance bottlenecks

DIFFICULT

DIFFICULT

Iterate code/debug algorithm

logic

DIFFICULT

MEDIUM

Solve build/deployment

issues

DIFFICULT

MEDIUM

17

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

MANDELBROT

EXAMPLE


new Kernel(){



@Override
public void run()
{


int

gid

=
getGlobalId
();


float
x = (((
gid

% w)
-
(w/2))/(float)w);
//
x
{
-
1.0 .. +1.0
}


float
y = (((
gid

/ w)
-
(h/2))/(float)h);
//
y
{
-
1.0 .. +1.0}




float
zx

=
x,

zy

=
y,
new_zx

= 0f
;


int

count = 0;


while
(count <
maxIterations

&&
zx

*
zx

+
zy

*
zy

< 8) {



new_zx

=
zx

*
zx

-

zy

*
zy

+ x;


zy

= 2 *
zx

*
zy

+ y;


zx

=
new_zx
;


count
++;


}


rgb
[
gid
] =
pallette
[count];


}).execute(width*height);

18

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

EXPRESSING DATA PARALLEL IN JAVA WITH APARAPI BY EXTENDING KERNE
L

class
SquareKernel

extends Kernel{


final
int
[] in, square;


public
SquareKernel
(final
int
[] in){


this.in = in;


this.square

= new
int
[
in.length
);


}


@Override public void run(){


int

i
=
getGlobalID
();


square[
i
]=
int
[
i
]*
int
[
i
];


}


public
int
[] square(){


execute(
in.length
);


return(square);


}

}


int

[]in
= new
int
[size];

SquareKernel

squareKernel

= new
SquareKernel
(in);

// populating in[0..size] omitted

int
[] result =
squareKernel.square
();

square() method ‘wraps’ the execution mechanics


Provides a more natural Java API


19

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

EXPRESSING DATA PARALLELISM IN APARAPI USING PROPOSED JAVA 8 LAM
BDAS


JSR 335 ‘Project Lambda’ proposes addition of ‘lambda’ expressions to Java 8.

http://cr.openjdk.java.net/~briangoetz/lambda/lambda
-
state
-
3.html



How we expect
Aparapi

will make use of the proposed Java 8 extensions


final
int

[] square = new
int
[size];

final
int

[] in = new
int
[size]; // populating in[0..size] omitted


Kernel.execute
(size, #{ i
-
>
out[i
]=
int
[i]*
int
[i
];
});



20

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011


At runtime
Aparapi

converts Java
bytecode

to
OpenCL


OpenCL

compiler converts
OpenCL

to device specific ISA (for GPU/APU)


GPU comprised of multiple SIMD (Single Instruction Multiple Dispatch) Cores


SIMD performance stems from executing the same instruction on different data at the same time


Think single program counter shared across multiple threads



All SIMDs executing at the same time (in lock
-
step)


new Kernel(){


@Override public void run(){




int

i

=
getGlobalID
();


int

temp= in[
i
]*2;


temp = temp+1;



out[
i
] = temp;


}

}.execute(4)


HOW APAPAPI EXECUTES
ON THE GPU

i
=0

i
=1

i
=2

i
=3

int

temp =in[0]*2

int

temp =in[1]*2


int

temp =in[2]*2


int

temp =in[3]*2


temp=temp+1

temp=temp+1


temp=temp+1


temp=temp+1


out[0]=temp

out[1]=temp


out[2]=temp


out[3]=temp


21

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

DEVELOPER IS RESPONSIBLE FOR ENSURING PROBLEM IS DATA PARALLEL


Data dependencies may violate
the ‘in any order’
contract

for (
int

i=1; i< 100; i
++){


out[
i
] =
out[i
-
1
]+in[i
];

}


out[i
-
1]
refers to a value resulting from a previous iteration which may not have been evaluated yet



Loops mutating shared data will need to be refactored or will necessitate atomic operations

for (
int

i
=0;
i
< 100;
i
++){


sum += in[
i
];

}


sum += x

causes a race condition

Almost certainly will not be atomic when translated to OpenCL

N
ot safe in multi
-
threaded Java either




n
ew Kernel(){ @Override public void run(){


int

i =
getGlobalID
();


out[
i
] =
out[i
-
1]
+in[i];

}}.execute(100);

n
ew Kernel(){ @Override public void run(){


int

i =
getGlobalID
();


sum+= in[i
]
;

}}.execute(100);

22

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

SOMETIMES WE CAN REFACTOR TO RECOVER SOME PARALLELISM



for (
int

i=0; i< 100; i
++){


sum
+= in[i
];

}



new Kernel(){


@Override public void run(){


int

i

=
getGlobalID
();

sum+= in[
i
];


}

}.execute(100);


new Kernel(){


@Override public void run(){


int

n =
getGlobalID
()



for
(
int

i=0; i<10; i
++)



partial[n
] += data[n*10+i
];


}

}.execute(10);


for (
int

i
=0;
i
< 10;
i
++){


sum+=partial[
i
];

}

for (
int

n=0; n<10; n++){


for (
int

i
=0;
i
<10;
i
++){


partial[n] += data[n*10+i];


}

}

for (
int

i
=0;
i
< 10;
i
++){


sum+=partial[
i
];

}


23

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011


SIMD performance impacted when code contains branches


To stay in lockstep SIMDs must process both the 'then' and 'else' blocks



Use result of 'condition' to predicate instructions (conditionally mask to a no
-
op)


new
Kernel(){


@Override public void run(){


int

i

=
getGlobalID
();


int

temp= in[
i
]*2
;



if (i%2==0)



temp
= temp+1
;


else



temp = temp
-
1;


out[
i
] = temp;


}

}.execute(4)



TRY TO AVOID BRANCHING WHEREVER POSSIBLE

i
=0

i
=1

i
=2

i
=3

int

temp =in[0]*2

int

temp =in[1]*2


int

temp =in[2]*2


int

temp =in[3]*2


<c> = (0%2==0)

<c> = (1%2==0)


<c> = (2%2==0)


<c> = (3%2==0)


if< c>

temp=temp+1

if< c>

temp=temp+1


if< c>

temp=temp+1


if< c>

temp=temp+1


if <!c> temp=temp
-
1

if <!c> temp=temp
-
1

if <!c> temp=temp
-
1

if <!c> temp=temp
-
1

out[0]=temp

out[1]=temp


out[2]=temp


out[3]=temp


24

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

CHARACTERISTICS OF IDEAL DATA PARALLEL WORKLOADS


Code which iterates over large arrays
of
primitives


32/64 bit
data types preferred


Where the order
of
iteration is not critical


Avoid data
dependencies between iterations


Each iteration contains sequential code (few branches
)


Good
balance between data size
(low
) and compute (high)


Transfer of data to/from the GPU can be
costly


Although APUs likely to mitigate this over time


Trivial compute often not worth the transfer cost


May still
benefit
by freeing up CPU for other work


Compute

Data Size

GPU

Memory

Ideal

25

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

APARAPI NBODY EXAMPLE


NBody

is a common
OpenCL
/CUDA benchmark/demo


For each particle/body


C
alculate new position based on the gravitational force impinged on each body, by every
other
body


Essentially a N^2 space problem


If we double the number of bodies, we perform four times the positional calculations


Following
charts compare


Naïve Java version (single loop)


Aparapi version using Java Thread
Pool


Aparapi version
running on the GPU (ATI
Radeon

™ 5870)

26

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

APARAPI NBODY PERFORMANCE (FRAMES RATE
VS

NUMBER
OF

BODIES)

80.42

19.96

5.19

1.29

0.32

0.08

0.02

0.01

260.8

72.67

19.37

5.47

1.45

0.38

0.1

0.02

670.2

389.12

186.05

79.87

34.24

12.18

3.57

0.94

0
100
200
300
400
500
600
700
800
1k
2k
4k
8k
16k
32k
64k
128k
Java Single Thread
Aparapi Thread Pool
Aparapi GPU
Number of bodies/particles

Frames per second

27

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

NBODY PERFORMANCE: CALCULATIONS PER µSEC VS. NUMBER OF
BODIES

84

83

83

86

86

86

86

86

273

304

313

367

388

407

412

412

702

1632

3146

5360

9190

13078

15663

16101

0
2000
4000
6000
8000
10000
12000
14000
16000
18000
1k
2k
4k
8k
16k
32k
64k
128k
Java Single Thread
Aparapi Thread Pool
Aparapi GPU
Position calculations per µS

Number of bodies/particles

28

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

APARAPI EXPLICIT BUFFER MANAGEMENT


This code demonstrates a fairly common pattern. Namely a Kernel executed inside a loop



int

[] buffer = new
int
[HUGE];

int

[]
unusedBuffer

= new
int
[HUGE];


Kernel k = new Kernel(){


@Override public void run(){


// mutates buffer contents


// no reference to
unusedBuffer


}

};


for (
int

i
=0;
i
< 1000;
i
++){



k.execute
(HUGE);


}











Although
Aparapi

analyzes kernel methods to optimize host
buffer transfer requests,


i
t has no knowledge of buffer accesses from the enclosing loop.


Aparapi

must assume that the buffer is modified between
invocations.


This forces
(possibly unnecessary)

buffer copies to and

from the
device for each invocation of
Kernel.excute
(
int
)


//Transfer buffer to GPU

//Transfer buffer from GPU

29

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

APARAPI EXPLICIT BUFFER MANAGEMENT


Using the new explicit buffer management APIs


int

[] buffer = new
int
[HUGE];


Kernel k = new Kernel(){


@Override public void run(){


// mutates buffer contents


}

};

k.setExplicit
();

k.put
(buffer);

for (
int

i
=0;
i
< 1000;
i
++){


k.execute
(HUGE);

}

k.get
(buffer);



Developer takes control (of all buffer transfers) by
marking the kernel as explicit



Then coordinates when/if transfers take place



Here we save 999 buffer writes and 999 buffer reads


30

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

APARAPI EXPLICIT BUFFER MANAGEMENT


A possible alternative might be to expose the ‘host’ code to
Aparapi



int

[] buffer = new
int
[HUGE];


Kernel k = new Kernel(){


@Override public void run(){


// mutates buffer contents


}



@Override public void host(){


for
(
int

i=0; i< 1000; i++){



execute(HUGE
);


}



}

};

k.host
();


Developer exposes the host code to
Aparapi

by
overriding the host() method.



By analyzing the
bytecode

of host(),
Aparapi

can
determine when/if buffers are mutated and can ‘inject’
appropriate put()/get() requests behind the scenes.


31

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

APARAPI BITONIC SORT

WITH EXPLICIT BUFFER MANAGEMENT


Bitonic

mergesort

is
a parallel friendly ‘in place’ sorting algorithm


http://en.wikipedia.org/wiki/Bitonic_sorter


On 10/18/2010 the following post appeared on
Aparapi

forums


Aparapi

140x slower than single thread
Java
?! what am
I
doing wrong
?”


Source code (for
Bitonic

Sort) was included in the post



An
Aparapi

Kernel (for each sort pass) executed inside a Java loop.


Aparapi

was forcing unnecessary buffer copies.


Following
chart compares :


Single threaded Java version


Aparapi
/
GPU version without explicit buffer management (default AUTO mode)


Aparapi
/
GPU version with recent explicit buffer management feature enabled.


Both
Aparapi

versions running on ATI
Radeon ™
5870.

32

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

EXPLICIT BUFFER MANAGEMENT EFFECT
ON

BITONIC SORT IMPLEMENTATION

13

21

36

69

142

296

632

1525

3235

117

137

164

215

332

495

850

1462

2855

17

19

23

25

34

54

97

165

337

0
500
1000
1500
2000
2500
3000
3500
16k
32k
64k
128k
256k
512k
1024k
2048k
4096k
Java Single Thread
GPU (AUTO)
GPU (EXPLICIT)
Number of integers

Time (
ms
)

33

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

PROPOSED APARAPI ENHANCEMENT: ALLOW ACCESS TO ARRAYS OF OBJECTS


A Java developer
implementing an
'
nbody
'
solution
would probably define a class for each particle

public class Particle{


private
int

x, y, z;


private String name;


private Color color;



// ...

}


… would make all fields private and limit access via setters/getters


public void
setX
(
int

x){
this.x

= x};



public
int

getX
(){return
this.x
);


// same for
y,z
, name
etc



… and expect to create a Kernel to update positions for an
a
rray of such particles


Particle[] particles = new Particle[1024];

ParticleKernel

kernel = new
ParticleKernel
(particles);

w
hile(displaying){

kernel.execute
(
particles.length
);

updateDisplayPositions
(particles);

}


34

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

PROPOSED APARAPI
ENHANCEMENT:
ALLOW ACCESS TO ARRAYS OF OBJECTS



Unfortunately the current ‘alpha’
version of
Aparapi

would fail to convert this kernel to
OpenCL


Would fall back to using a Thread Pool.



Aparapi

currently requires that
t
he previous code to be refactored so that data is held in primitive arrays

int
[] x = new
int
[1024];

int
[] y = new
int
[1024];

int
[] z = new
int
[1024];

Color[] color = new Color[1024];

String[] name = new String[1024];

Positioner.position
(x, y, z);



This is clearly

a potential
‘barrier to adoption



35

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

PROPOSED
APARAPI

ENHANCEMENT:
ALLOW ACCESS TO ARRAYS OF OBJECTS



Proposed enhancement will allow
Aparapi

Kernels to access arrays (or array based collections) of objects



At runtime
Aparapi
:


Tracks all fields accessed via objects reachable from
Kernel.run
()


Extracts the data from these fields into a parallel
-
array form


Executes a Kernel using the parallel
-
array form


Returns the data back into the original object array



These extra steps do impact performance

(compared with refactored data parallel form)


However, we can still demonstrate performance gains over non
-
Aparapi

versions


36

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

FUTURE WORK


Sync with ‘project lambda’ (Java 8) and allow kernels to be represented as lambda expressions


Continue to investigate automatic extraction of buffer transfers from object collections


Hand more explicit control to ‘power users’


Explicit buffer (or even sub buffer) transfers


Expose local memory and barriers


Open Source


Aiming for Q3 Open Source release of
Aparapi


License TBD, probably BSD
variant


Still
reviewing
hosting options


Encourage community contributions

37

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

SIMILAR INTERESTING/
RELATED

WORK


Tidepowerd


Offers a similar solution for .NET


NVIDIA cards only at present


http://www.tidepowerd.com/


java
-
gpu


An open source project for extracting kernels from nested loops


Extracts code structure from bytecode


Creates CUDA behind the scenes


http://code.google.com/p/java
-
gpu/


GRAPHITE
-
OpenCL


Auto detect data parallel loops in
gcc

compiler and generate OpenCL + host code for those loops


http://gcc.gnu.org/wiki/summit2010?action=AttachFile&do=get&target=2010
-
GCC
-
Summit
-
Proceedings.pdf



38

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

SUMMARY


APUs/GPUs
offer unprecedented performance for the appropriate workload


Don’t assume everything can/should execute on the
APU/GPU


Profile your Java code to uncover potential parallel opportunities


Aparapi

provides an ideal framework for executing data
-
parallel code on the GPU


Find out more about
Aparapi

at
http://developer.amd.com/Aparapi



P
articipate
in the upcoming
Aparapi

Open Source
community



QUESTIONS

40

| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011

Disclaimer & Attribution

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissio
ns
and
typographical errors.


The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limi
ted

to product
and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences betwee
n
differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or
oth
erwise correct or
revise this information. However, we reserve the right to revise this information and to make changes from time to time to th
e c
ontent hereof
without obligation to notify any person of such revisions or changes.


NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS
ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.


ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED.
IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER
CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED
OF THE POSSIBILITY OF SUCH DAMAGES.


AMD, AMD Radeon, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names

us
ed in
this presentation are for informational purposes only and may be trademarks of their respective owners.


OpenCL

is a trademark of Apple
Inc

used under license to the
Khronos

Group, Inc.


NVIDIA, the NVIDIA logo, and CUDA are trademarks or registered trademarks of NVIDIA Corporation.


Java , JVM, JDK and “Write
Once, Run Anywhere"

are trademark s of Oracle and/or its affiliates.


© 2011 Advanced Micro Devices, Inc. All rights reserved.