Challenges in Binary Translation for Desktop Supercomputing

sizzlepictureSoftware and s/w Development

Dec 2, 2013 (3 years and 10 months ago)

70 views

Challenges in Binary Translation for Desktop
Supercomputing

David Kaeli

Rodrigo Dominguez


Department of Electrical and Computer Engineering

Northeastern University

Boston, MA


Current trends in Many
-
core
Computing


The CPU industry has elected to jump off the
cycle
-
time scaling bandwagon


Power/thermal constraints have become a limiting
factor


We now see CPU vendors placing multiple (10’s
of) cores on a single chip


Clock speeds have not changed


The memory wall persists and multiple cores that
assume a shared
-
memory model place further
pressure on this problem


Software vendors are looking for new
parallelization technology


Multi
-
core aware operating systems


Semi
-
automatic parallelizing compilers

Current trends in Many
-
core
Computing


There has been a renewed interest in parallel
computing paradigms and languages



Existing many
-
core architectures are being
considered for general
-
purpose platforms (e.g., Cell,
GPUs
,
DSPs
)



Heterogeneous systems are becoming a common
theme



The trend will only accelerate if proper programming
frameworks are available to effectively exploit many
-
core resources

Graphics Processors


Graphics Processing Units


More than 64% of Americans played a video game in 2009


High
-
end
-

primarily used for 3
-
D rendering for videogame
graphics and movie animation


Mid/low
-
end


primarily used for computer displays


Manufacturers include NVIDIA, AMD/ATI, IBM
-
Cell


Very competitive commodities market





GPU Performance


GPUs provide a path for performance growth


Cost and power usage numbers are also impressive

Source:NVIDIA 2009

Near exponential

growth

in performance

for GPUS!!

Comparison of CPU and GPU
Hardware Architectures

CPU: Cache heavy,
focused on individual
thread performance

GPU: ALU heavy,
massively parallel,
throughput
-
oriented

CPU/GPU Relationship

CPU

(host)

GPU w/

local DRAM

(device)

A wide range of GPU apps


3D image analysis


Adaptive radiation therapy


Acoustics


Astronomy


Audio


Automobile vision


Bioinfomatics


Biological simulation


Broadcast


Cellular automata


Fluid dynamics


Computer vision


Cryptography


CT reconstruction


Data mining


Digital cinema / projections


Electromagnetic simulation


Equity training


Film


Financial


Languages


GIS


Holographics cinema


Machine learning


Mathematics research


Military


Mine planning


Molecular dynamics


MRI reconstruction


Multispectral imaging


N
-
body simulation


Network processing


Neural network


Oceanographic research


Optical inspection


Particle physics


Protein folding


Quantum chemistry


Ray tracing


Radar


Reservoir simulation


Robotic vision / AI


Robotic surgery


Satellite data
analysis


Seismic imaging


Surgery simulation


Surveillance


Ultrasound


Video conferencing


Telescope


Video


Visualization


Wireless


X
-
Ray

GPU as a General Purpose

Computing Platform


Speedups are impressive and ever increasing
!

Genetic Algorithm

2600 X

Real Time Elimination

of Undersampling Artifacts

2300 X

Lattice
-
Boltzmann Method

for Numerical Fluid Mechanics

1840 X

Source: CUDA Zone at www.nvidia.com/cuda/

Total Variation Modeling

1000 X

Fast Total Variation for

Computer Vision

1000 X

Monte Carlo Simulation

Of Photon Migration

1000 X

Stochastic Differential

Equations

675 X

K
-
Nearest Neighbor

Search

470 X

GPGPU is becoming mainstream research

Research activities are expanding significantly

Search result for keyword “GPGPU” in IEEE and ACM

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU

Texture Processor


Cluster

SM

Streaming Processor Array

Streaming Multiprocessor

Texture Unit

TPC

TPC

TPC

TPC

TPC

TPC

TPC

TPC

TPC

TPC

SM

SM

NVIDIA GT200
architecture

Grid of thread blocks

Multiple thread blocks,
many warps of threads

Individual threads



240 shader cores



1.4B transistors



Up to 2GB onboard
memory



~150GB/sec BW



1.06 SP GFLOPS



CUDA and OpenCL
support



Programmable
memory spaces



Tesla S1070
provides 4 GPUs in a
1U unit

AMD/ATI
Radeon

HD 5870



Codename “Evergreen”





1600 SIMD cores




L1/L2 memory
architecture




153GB/sec memory
bandwidth




2.72 TFLOPS SP




OpenCL and DirectX11




Hidden memory
microarchitecure




Provides for vectorized
operation

Comparison of CPU and GPU
Hardware Architectures

CPU/GPU

Single
precision
TFLOPs

Cores

GFLOPs/W
att

$/GFLOP

NVIDIA 285

1.06

240

5.8

$3.12

NVIDIA 295

1.79

480

6.2

$3.80

AMD HD 5870

2.72

1600

14.5

$0.16

AMD HD 4890

1.36

800

7.2

$0.18

Intel I
-
7 965

0.051

4

0.39

$11.02

Source: NVIDIA, AMD and Intel

AMD

NVIDIA

Hardware architecture

Vector

Scalar

Programming
language

Brook+, IL, OpenCL

CUDA, OpenCL

Programming model

SIMD vector

SIMT

Thread hierarchy

Single level

Two level

Memory exposure

Uniform space

Multiple space

Source of horsepower

Vectorization and
multiple output

Memory spaces utilization
including shared memory

Pros

Easier programming

More flexible
programming

Challenges

Harnessing the potential horsepower

AMD vs. NVIDIA

Talk Outline


Introduction on GPUs


Overview of the tool chains for both CUDA and
OpenCL


Motivation for pursuing this work


Comparing intermediate representations


Leveraging/analyzing benefits of Open64 optimization on
AMD GPUs


Comparing challenges with fundamentally different ISAs (SS
SIMT versus VLIW SIMT)


Discuss PTX and IL


Describe new common IR


Two examples of PTX
-
>IR
-
>IL binary translation


Discuss status of project and future work



GPU Programming Model


Single Instruction Multiple Threads (SIMT)



Parallelism is implicit



Programs (also called kernels or shaders) are
generally small and contain nested loops



Synchronization is handled explicitly


Toolchains


Toolchain = compiler + runtime library

NVIDIA

AMD

GPU

CUDA Runtime

C for
CUDA

OpenCL


Graphics driver

GPU

CAL Runtime

Brook+


OpenCL


Graphics driver

CUDA Compiler

cudafe

Open64

host compiler

runtime

host

gpu

ptx*

exe

binary

compile
-
time

execution
-
time

c for cuda

* ptx is included as data in the host application

driver

OpenCL (Dynamic) Compiler

OpenCL Library



LLVM

runtime

binary

execution
-
time

OpenCL

driver

compile
-
time

exe

host compiler

Objectives of our work


Compare two different IRs from similar
massively
-
threaded architectures



Influence future IR design (an active topic in
GPGPU research)



Leverage/analyze benefits of Open64
optimizations



Compare challenges with fundamentally
different
ISAs
: Superscalar/SIMT versus
VLIW/SIMT

CUDA Runtime


Device Management


cudaSetDevice, cudaGetDevice



Memory Management


Allocation: cudaMalloc, cudaFree


Transfer: cudaMemcpy, cudaMemset



Execution Control


Kernel launch: cudaLaunch


Config: cudaConfigureCall



Thread Management


cudaSynchronize


CUDA Runtime
(Vector Add example)

__global__ void vecAdd(int A[ ], int B[ ], int C[ ]) {


int i = threadIdx.x;


C[i] = A[i] + B[i];

}


int main() {


int hA[ ] = {…};


int hB[ ] = {…};



cudaMemcpy
(dA, hA, sizeof(hA), HostToDevice);


cudaMemcpy
(dB, hB, sizeof(hB), HostToDevice);



vecAdd<<<1, N>>>(dA, dB, dC);



cudaMemcpy
(dA, hA, sizeof(hA), DeviceToHost);

}

cudaConfigureCall

cudaSetupArgument

cudaLaunch

NVIDIA PTX


Low
-
level IR (close to ISA)



Pseudo
-
assembly style syntax



Load
-
Store instruction set



Strongly typed language


cvt.s32.u16 %r1, %
tid.x
;



Unlimited virtual registers



Predicate registers

AMD IL


High
-
level IR



Structured control flow (if
-
endif
, while
-
end, switch
-
end)



No predication



32
-
bit registers (4 components)
-

vectorization

Common PTX and IL instructions

mov.u16 %rh1, %ctaid.x;

mov.u16 %rh2, %ntid.x;

mul.wide.u16 %r1, %rh1, %rh2;

cvt.u32.u16 %r2, %tid.x;

add.u32 %r3, %r2, %r1;

ld.param.s32 %r4, [N];

setp.le.s32 %p1, %r4, %r3;

@%p1 bra $LabelA;

cvt.u64.s32 %rd1, %r3;

mul.lo.u64 %rd2, %rd1, 4;

ld.param.u64 %rd3, [A];

add.u64 %rd4, %rd3, %rd2;

ld.global.f32 %f1, [%rd4+0];

ld.param.u64 %rd5, [B];

add.u64 %rd6, %rd5, %rd2;

ld.global.f32 %f2, [%rd6+0];

add.f32 %f3, %f1, %f2;

ld.param.u64 %rd7, [C];

add.u64 %rd8, %rd7, %rd2;

st.global.f32 [%rd8+0], %f3;

$LabelA:

exit;



Data movement (mov)



Memory access (ld, st)



Arithmetic (mul, add)



Conversion (cvt)



Comparison and selection (setp)



Control flow (bra): uses predication
for conditional branch


vectorAdd (PTX)

Common PTX and IL instructions

mov r0, vThreadGrpId.x

mov r1, cb0[0].x

imul r2, r0, r1

mov r3, vTidInGrp.x

iadd r4, r3, r2

mov r5, cb1[3]

ige r6, r4, r5

if_logicalz r6

mov r7, r4

imul r8, r7, l0

mov r9, cb1[0]

iadd r10, r9, r8

uav_raw_load_id(0) r11, r10

mov r12, cb1[1]

iadd r13, r12, r8

uav_raw_load_id(0) r14, r13

add r15, r11, r14

mov r16, cb1[2]

iadd r17, r16, r8

uav_raw_store_id(0) mem.xyzw, r17, r15

endif

end



Data movement (mov)



Memory access (uav_raw)



Arithmetic (imul, iadd)



No conversion instructions



Comparison and Selection (ige)



Control Flow (if_logicalz):
structured statements


vectorAdd (IL)

Ocelot Framework
*


Implemented as a CUDA library


Intercepts library calls


PTX Emulation on the CPU


Parses PTX into an internal IR


Analysis: CFG, SSA, Data flow, optimizations


Our work:


IR for IL programs


PTX IR
-
> IL IR translation


AMD/CAL Backend

*Andrew Kerr, Gregory Diamos, and Sudhakar Yalamanchili. Modeling gpu
-
cpu workloads and
systems. In
GPGPU ’10: Proceedings of the 3rd Workshop on General
-
Purpose Computation on
Graphics Processing Units, pages
31

42, New York, NY, USA, 2010. ACM.

Translation Framework

exe

ptx

parser

analysis

translation to IL

CAL back
-
end

Ocelot

compile
-
time

ATI driver

IL Control Tree


Based on Structural Analysis
*


Build DFS spanning tree of the control
flow graph and traverse in
postorder


Form regions and collapse the nodes in
the CFG


Construct the Control Tree in the
process


Repeat until only 1 node is left in the
CFG

*
S. Muchnick. Advanced Compiler Design and Implementation, chapter 7.7. Morgan Kaufmann, 1997
.

IL Control Tree

Entry

WHILE

BB

IF

BB

BB

BB

cond

body

cond

true

false

abstract node
representing
regions

Example 1 (if
-
then)

mov.u16



setp.le.s32 p1, r4,
r3

@p1 bra LabelA

cvt.u64.s32



LabelA
:

exit

Entry

Block

IF

BB:

setp..

BB:

cvt…

BB:

mov..

BB:

exit

cond

true

PTX

Example 1 (if
-
then)

Entry

Block

IF

BB:

setp..

BB:

cvt…

BB:

mov..

BB:

exit

cond

true

mov



ige r6, r4, r5

if_logicalz r6

mov



endif

end

IL

Example 2 (for
-
loop)

mov.u16



setp.le.s32 p1, r5, r3

@p1 bra LabelA

cvt.u64.s32



LabelB
:



setp.lt.s32 p2, r4, r5

@p2 bra LabelB

LabelA:

exit

Entry

+

Block

IF

BB:

setp..

Block

BB:

mov..

BB:

exit

cond

true

PTX

BB:

cvt…

WHILE

setp



cond

body

Example 2 (for
-
loop)

Entry

+

Block

IF

BB:

setp..

Block

BB:

mov..

BB:

exit

cond

true

BB:

cvt…

WHILE

setp



cond

body

mov



ige r7, r4, r6

if_logicalz r7

mov



whileloop



if_logicalz r17

break

endif

endloop

endif

end

IL

Other BT Challenges


Pointer arithmetic in CUDA needs to be
emulated in CAL



Translate Application Binary Interface
(ABI), e.g. different calling conventions



Architectural
bitness
: Tesla and
Cypress are 32
-
bit architectures but
Fermi is 64
-
bits

Project Status


Main CUDA library API’s are
implemented (cudaMalloc,
cudaMemcpy, cudaLaunch, etc.)



3 CUDA applications from the SDK
running



Code quality comparable to LLVM code
generation


Next Steps


Enhance translation of the Control Tree
to support other IL constructs (e.g.,
switch
-
case)



Implement other GPGPU abstractions
(e.g., shared memory, textures, etc.)



Handle PTX predicated instructions
(since IL does not support predication
directly)

Summary and Future Work


GPUs are revolutionizing desktop
supercomputing


A number of critical applications have been
migrated successfully


CUDA and OpenCL have made these platforms
much more accessible for general purpose
computing


AMD presently has the highest DP FP performance


CUDA presently produces higher performance code
for NVIDIA


We are developing a platform that leverages the best
of both worlds