Accelerators_SCED11x - SC11EducationProgram

basketontarioΗλεκτρονική - Συσκευές

2 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

97 εμφανίσεις


Old CW: Transistors expensive


New CW:

Power wall


Power expensive, Transistors free

(Can put more on chip than can afford to turn on)



Old: Multiplies are slow, Memory access is fast


New:

Memory wall

Memory slow, multiplies fast

(200
-
600 clocks to DRAM memory, 4 clocks for FP multiply)



Old : Increasing Instruction Level Parallelism (ILP) via compilers, innovation
(Out
-
of
-
order, speculation, VLIW, …)


New CW:

ILP wall

diminishing returns on more ILP



New:
Power Wall
+
Memory Wall
+
ILP Wall
=
Brick Wall


Old CW:
Uniprocessor

performance 2X / 1.5 yrs


New CW:
Uniprocessor

performance only 2X / 5 yrs?

Credit: D. Patterson, UC
-
Berkeley



It turns out, sacrificing
uniprocessor

performance for power
savings can save you a lot.



Example:



Scenario One: one
-
core processor with power budget W


Increase frequency/ILP by 20%


Substantially increases power, by more than 50%


But, only increase performance by 13%



Scenario Two: Decrease frequency by 20% with a simpler core


Decreases power by 50%


Can now add another core (one more ox!)


"If one ox could not do the job, they did not try to grow a bigger ox,


but used two oxen."
-

Admiral Grace Murray Hopper.





Chickens are gaining momentum nowadays:


For certain classes of applications (not including field
plowing
...), you can run many cores at lower frequency and
come ahead (big time) at the speed game



Molecular Dynamics Codes (VMD, NAMD, etc.) reported
speedups of
25x



100x
!!

"If one ox could not do the job, they did not try to grow a bigger ox,


but used two oxen."
-

Admiral Grace Murray Hopper.


"If you were plowing a field, which would you rather use?


Two strong oxen or 1024 chickens ?"
-

Seymour Cray


Oxen are good at plowing


Chickens pick up feed



Which do I use if I want to catch mice?


I’d much rather have a couple cats



Moral: Finding the most appropriate tool for the job brings
about savings in efficiency


Addendum: That tool will only exist and be affordable if
someone can make money on it.


Cray High Density Custom
Compute System



“Same” performance on Cray’s
2
-
cabinet custom solution
compared to 200
-
cabinet x86
Off
-
the
-
Shelf system



Engineered to achieve
application performance at
< 1/100 the space, weight and
power cost
of an off
-
the shelf
system



Cray designed, developed,
integrated and deployed


System


Characteristics

Cray Custom
Solution

Off
-
the
-
Shelf
System

Cabinets

2

200

Sockets

48

37,376

Core Count

96

149,504

FPGAs

88

0

Total Power

42.7

Kw

8,780 Kw

Peak Flops

499 Gf

1.2 Pf

Total Floor Space

8.87 Sq Ft

4,752 Sq Ft

7

8

Embedded Processors

ASPs

DSPs

Dedicated

HW ASIC

Flexibility (Coverage)

Energy
Efficiency (log scale)

0.1

1

10

100

1000

Reconfigurable

Processor/Logic

GPUs were here 7
-
10 years ago

Now, they’re in this space

10




Previous GPGPU Constraint:


To get general purpose code
working, you had to use the
corner cases of the graphics API



Essentially


re
-
write entire
program as a collection of
shaders

and polygons

Input Registers

Fragment Program



Output Registers

Constants

Texture

Temp Registers

per thread

per
Shader

per Context


FB Memory

General Purpose

computing on
Graphics Processing Units

11


“Compute Unified Device Architecture”


General purpose programming model


User kicks off batches of threads on the GPU


GPU = dedicated super
-
threaded, massively data
parallel co
-
processor


Targeted software stack


Compute oriented drivers, language, and tools


Driver for loading computational programs
onto GPU


12


512 GPU cores


1.30 GHz


Single precision floating point performance:
1331 GFLOPs



(2 single precision flops per clock per core)


Double precision floating point performance:
665 GFLOPs



(1 double precision flop per clock per core)


Internal RAM: 6 GB DDR5


Internal RAM speed: 177 GB/sec (compared 30s
-
ish GB/sec for
regular RAM)


Has to be plugged into a
PCIe

slot (at most 8 GB/sec)

13


Calculation: TFLOPS vs. 150 GFLOPS


Memory Bandwidth: ~5
-
10x













Cost Benefit: GPU in every PC


massive volume


The Good:


Performance: focused silicon use


High bandwidth for streaming applications


Similar power envelope to high
-
end CPUs


High volume


affordable


The Bad:


Programming: Streaming languages (CUDA,
OpenCL
, etc.)


Requires significant application intervention / development


Sensitive to hardware knowledge


memories, banking, resource management, etc.


Not good at certain operations or applications


Integer performance, irregular data, pointer logic, low compute intensity*


Questions about reliability / error


Many have been addressed in most recent hardware models


Knights Ferry


32 Cores


Wide vector units


x86 ISA



Mostly a test platform at this
point



Knights Corner

will be first
real product
-

2012


Configurable logic blocks


Interconnection mesh



Can be incorporated into cards
or integrated inline.


The Good:


Performance: good silicon use (do only what you need)


(maximize parallel ops/cycle)


Rapid growth: Cells, Speed, I/O


Power: 1/10th CPUs


Flexible: tailor to application


The Bad:


Programming: VHDL,
Verilog
, etc.


Advances have been made here to translate high level code (C, Fortran, etc.) to HW


Compile Time: Place and Route for the FPGA layout can take
multiple hours


FPGAs are typically clocked about 1/10
th

to 1/5
th

of ASIC


Cost: They’re actually not cheap


External


entire application offloading


“Appliances”


DataPower
,
Azul


Attached


targeted offloading


PCIe

cards


CUDA/
FireStream

GPUs, FPGA cards.


Integrated


tighter connection


On
-
chip


AMD Fusion, Cell BE, Network processing chips


Incorporated


CPU instructions


Vector instructions, FMA, Crypto
-
acceleration

AMD “Fusion”

Nvidia

M2090

IBM “
CloudBurst


(
DataPower
)

Cray XK6 Integrated Hybrid Blade


External


entire application offloading


“Appliances”


DataPower
,
Azul


Attached


targeted offloading


PCIe

cards


CUDA/
FireStream

GPUs, FPGA cards.


Integrated


tighter connection


On
-
chip


AMD Fusion, Cell BE, Network processing chips


Incorporated


CPU instructions


Vector instructions, FMA, Some crypto
-
acceleration


C.
Cascaval
, et al., IBM Journal of R&D, 2010


Programming accelerators requires describing:

1.
What portions of code will be run on the accelerator (as
opposed to on the CPU)

2.
How does that code map to the architecture of the
accelerator


both compute elements and memories



The first is typically done on a function
-
by
-
function basis


i.e
. GPU kernel


The second is much more variable


Parallel directives, SIMT block description, VHDL/
Verilog



Integrating these is not very mature at this point, but coming

23

__global__
void

saxpy_cuda
(
int

n, float a, float *x, float *y)

{


int

i

= (
blockIdx.x

*
blockDim.x
) +
threadIdx.x
;


if(
i

< n)



y[
i
] = a*x[
i
] + y[
i
];

}



int

nblocks

= (n + 255) / 256;


//invoke the kernel with 256 threads per block

saxpy_cuda
<<<
nblocks
, 256>>>
(n, 2.0, x, y);


There are several efforts (mostly libraries and directive
methods) to lower the entry point for accelerator
programming


Library example: Thrust


STL
-
like interface for GPUs





Accelerator example:
OpenACC



Like
OpenMP


thrust ::
device_vector

<
int

> D (10 , 1) ;

thrust :: fill (D . begin () , D. begin () + 7 , 9) ;

thrust :: sequence (H. begin () , H. end () );



#
pragma

acc parallel [clauses]


{ structured block }

http://www.openacc
-
standard.org/

1.
Profile your code


What code is heavily used (and amenable to acceleration)

2.
Write accelerator
kernels

for heavily used code (Amdahl)


Replace CPU version with accelerator offload

3.
Play “chase the bottleneck” around the accelerator


AKA re
-
write the kernel a dozen times

4.
Profit!


Faster science/engineering/finance/whatever!

3.
???


Brandon’s stuff


Architectures are moving towards “effective use of space” (or
power).


Focusing architectures on a specific task (at the expense of
others) can make for very efficient/effective tools (for that
task)


HPC systems are beginning to integrate acceleration at
numerous levels, but “
PCIe

card GPU” is the most common


Exploiting the most popular accelerators requires intervention
by application programmers to map codes to the architecture.


Developing for accelerators can be challenging as significantly
more hardware knowledge is needed to get good performance


There are major efforts at improving this


Tomorrow


2


3 pm: CUDA Programming Part I


3:30


5 pm: CUDA Programming Part II


WSCC 2A/2B



Tomorrow at 5:30pm


BOF: Broad
-
based Efforts to Expand Parallelism
Preparedness in the Computing Workforce


WSCC 611/612 (here)



Wednesday at 10:30am


Panel/Discussion: Parallelism, the Cloud, and the Tools of
the Future for the next generation of practitioners


WSCC 2A/2B