Efficiency Programming for the (Productive) Masses

gorgeousvassalSoftware and s/w Development

Nov 7, 2013 (4 years and 2 days ago)

50 views

B
ERKELEY
P
AR
L
AB

B
ERKELEY
P
AR
L
AB

Efficiency
Programming for the
(Productive) Masses

Armando Fox
, Bryan Catanzaro, Shoaib Kamil,
Yunsup Lee, Ben Carpenter, Erin Carson,
Krste Asanovic, Dave Patterson, Kurt Keutzer


UC Berkeley Parallel Computing Lab/UPCRC

B
ERKELEY
P
AR
L
AB

Make productivity programmers efficient,

and efficiency programmers productive?


Productivity level language
(PLL): Python, Ruby


high
-
level abstractions well
-
matched to application
domain => 5x faster development and 3
-
10x fewer
lines of code


>90% of programmers


Efficiency level language
(ELL): C/C++, CUDA, OpenCL


>5x longer development time


potential

10x
-
100x performance by exposing HW
model



<10% programmers, yet
their work is poorly reused


5x development time


10x
-
100x performance!

Raise level of abstraction
and

get performance?


B
ERKELEY
P
AR
L
AB

Capture
patterns
instead of
“domains”?


Efficiency programmers know how to
target
computation patterns
to hardware


stencil/SIMD codes => GPUs



sparse matrix => communication
-
avoiding algo’s on
multicore


“Big finance” Monte Carlo sim => MapReduce


Libraries? Useful, but don’t raise
abstraction level


How to make ELL work accessible to more
PLL programmers?

B
ERKELEY
P
AR
L
AB

“Stovepipes”:

Connect Pattern to Platform

OOO

GPU

SIMD

FPGA

Cloud

Runtime & OS

Common language substrate

Rendering

Probabilistic

Physics

Lin. Alg.

Virt. worlds

Data viz.

Robotics

Music

App domains

Computation
domains

Language


Thick Runtime

Hardware

Traditional
Layers

OOO

GPU

SIMD

FPGA

Cloud

Runtime & OS

Virt.
worlds

Data viz.

Robotics

Music

Applications


Motifs/Pattern
s



Thin Runtime


Hardware

“Stovepipes”

Sparse Matrix

Dense Matrix

Stencil

Humans must
produce these

B
ERKELEY
P
AR
L
AB

SEJITS: Selective, Embedded
Just
-
in
-
Time Specialization


Productivity programmers write in
general
purpose
,
modern, high level
PLL


SEJITS infrastructure
specializes

computation patterns
selectively
at runtime


Specialization uses runtime info to
generate
and
JIT
-
compile
ELL code
targeted to hardware


Embedded because PLL’s own machinery
enables (vs. extending PLL interpreter)

B
ERKELEY
P
AR
L
AB

Specifically...


When “specializable” function is called:


determine if specializer available for current platform


if no: continue executing normally in PLL


If a specializer is found, it can:


manipulate/traverse AST of the function


emit & JIT
-
compile ELL source code


dynamically link compiled code to PLL interpreter


Specializers written in PLL



Necessary features present in modern PLL’s,
but absent from older widely
-
used PLL’s

B
ERKELEY
P
AR
L
AB

.py

OS/HW

f()

@h()

Specializer

.c

PLL Interp

@g(
)

SEJITS

Productivity app

.so

cc/ld

$

SEJITS makes tuning decisions
per
-
function
(not per
-
app)

B
ERKELEY
P
AR
L
AB

.py

OS/HW

f()

@h()

Specializer

.c

PLL Interp

@g(
)

SEJITS

Productivity app

.so

cc/ld

$

SEJITS makes tuning decisions
per
-
function
(not per
-
app)

Selective

Embedded

JIT

Specialization

B
ERKELEY
P
AR
L
AB

Example: Stencil Computation
in Ruby


9

class LaplacianKernel < Kernel


def kernel(in_grid, out_grid)


in_grid.each_interior do |point|


point.neighbors(1).each do |x|


out_grid[point] += 0.2*x.val


end


end

end


VALUE kern_par(int argc, VALUE* argv, VALUE self) {

unpack_arrays into in_grid and out_grid;


#pragma omp parallel for default(shared) private (t_6,t_7,t_8)

for (t_8=1; t_8<256
-
1; t_8++) {


for (t_7=1; t_7<256
-
1; t_7++) {


for (t_6=1; t_6<256
-
1; t_6++) {


int center = INDEX(t_6,t_7,t_8);


out_grid[center] = (out_grid[center]


+(0.2*in_grid[INDEX(t_6
-
1,t_7,t_8)]));


...


out_grid[center] = (out_grid[center]


+(0.2*in_grid[INDEX(t_6,t_7,t_8+1)]));

;}}}

return Qtrue;}


Specializer emits
OpenMP


1000x
-
2000x faster
than Ruby

Use introspection to grab
parameters, inspect AST of
computation

B
ERKELEY
P
AR
L
AB

Example: Sparse Matrix
-
Vector
Multiply in Python

10

# “Gather nonzero entries,
# multiply them by vector,

# do for each column”



Specializer outputs CUDA for
nvcc
:







SEJITS leverages downstream toolchains


B. Catanzaro et al., joint work with NVIDIA Research

B
ERKELEY
P
AR
L
AB

.py

Nexus on Eucalyptus or EC2

f()

@h()

Specializer

PLL Interp

@g(
)

SEJITS

Productivity app

Spark

worker

.scala

scalac

$

Spark & Nexus


Spark
enables cloud
-

distributed, persistent,
fault
-
tolerant shared
parallel data structures


• Relies on Scala
runtime and data
-
parallel abstractions


• Relies on
Nexus
(cloud resource
management) layer

SEJITS in the Cloud

B
ERKELEY
P
AR
L
AB

Example: Logistic regression
using Spark/Scala (in progress)

M. Zaharia et al.,
Spark: Cluster Computing With Working Sets,
HotCloud’09

B. Hindman et al.,
Nexus: A Common Substrate for Cluster Computing,
HotCloud‘09

12

B
ERKELEY
P
AR
L
AB

.py

Nexus on Cloud

f()

@h()

Specializer

PLL Interp

@g(
)

SEJITS

Productivity app

Hadoop
master

.java

javac

$

SEJITS in the Cloud

B
ERKELEY
P
AR
L
AB

SEJITS for Cloud Computing

Idea: same Python app runs on desktop, on
manycore, and in cloud


Cloud/multicore synergy: specialize intra
-
node
as well as generate cloud code


Cloud: Emit JIT
-
able code for Spark (Scala),
Hadoop (Java), MPI (C), ...


Single node: Emit JIT
-
able code for OpenCL,
CUDA, OpenMP, ...


Combine
abstractions in one app


Remember...can always fall back to PLL


B
ERKELEY
P
AR
L
AB

Questions


Won’t we need lots & lots of specializers?


if ParLab “motifs” bet is correct, ~10s of specializers
will go a long way


What about libraries, frameworks, etc.?


SEJITS is complementary to frameworks


Most libraries for ELL, and ELLs lack features that
promote code reuse, don’t raise abstraction level


Why isn’t this just as hard as “magic compiler”?


Specializers written by human experts


SEJITS allows “crowdsourcing” them


Will programmers accustomed to Matlab/Fortran
learn functional style, list comprehensions, etc.?

B
ERKELEY
P
AR
L
AB

Conclusion


SEJITS enables code
-
generation strategy per
-
function, not per
-
app


Uniform approach to productive programming


same app on cloud, multicore, autotuned libraries


Combine multiple frameworks/abstractions in
same app


Research enabler


Incrementally develop specializers for different motifs
or prototype HW


Don’t need full compiler & toolchain just to get started

B
ERKELEY
P
AR
L
AB

B
ERKELEY
P
AR
L
AB

Questions

17