IMplicitly PArallel Compiler Technology - Microsoft Research

footballsyrupΛογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 3 χρόνια και 10 μήνες)

57 εμφανίσεις

1

For thousand
-
core microprocessors


Wen
-
mei Hwu

with


Ryoo, Ueng, Rodrigues, Lathara, Kelm, Gelado, Stone, Yi,
Kidd, Barghsorkhi, Mahesri, Tsao, Stratton, Navarro,
Lumetta, Frank, Patel


University of Illinois, Urbana
-
Champaign

An
IM
plicitly
PA
rallel
C
ompiler
T
echnology
Based on Phoenix

2

Background


Academic compiler research infrastructure is a tough
business


IMPACT, Trimaran, and ORC for VLIW and Itanium processors


Polaris and SUIF for multiprocessors


LLVM for portability and safety



In 2001, IMPACT team moved into many
-
core
compilation with MARCO FCRC funding


A new implicitly parallel programming model that balance the
burden on programmers and the compiler in parallel
programming


Infrastructure work has slowed down ground
-
breaking work



Timely visit by the Phoenix team in January 2007


Rapid progress has since been taking place


Future IMPACT research will be built on Phoenix

3

The Next Software Challenge


Today, multi
-
core make more effective use of
area and power than large ILP CPU’s


Scaling from 4
-
core to 1000
-
core chips could
happen in the next 15 years


All semiconductor market domains
converging to concurrent system platforms


PCs, game consoles, mobile handsets, servers,
supercomputers, networking, etc.

Big picture

We need to make these systems effectively

execute valuable, demanding apps.

4

The Compiler Challenge


To meet this challenge, the compiler must


Allow simple, effective control by programmers


Discover and verify parallelism


Eliminate tedious efforts in performance tuning


Reduce testing and support cost of parallel programs

“Compilers and tools must extend the human’s ability
to manage parallelism by doing the heavy lifting.”


5


A quiet revolution and potential build
-
up


Calculation: 450 GFLOPS vs. 32 GFLOPS


Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s


Until last year, programmed through graphics API











GPU in every PC and workstation


massive volume and
potential impact

GFLOPS

G80 = GeForce 8800 GTX

G71 = GeForce 7900 GTX

G70 = GeForce 7800 GTX

NV40 = GeForce 6800 Ultra

NV35 = GeForce FX 5950 Ultra

NV30 = GeForce FX 5800

An Initial Experimental Platform

6

16 highly threaded SM’s, >128 FPU’s, 450 GFLOPS, 768
MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Load/store

Load/store

Load/store

Load/store

Load/store

GeForce 8800

7

Some Hand
-
code Results

App.

Archit. Bottleneck

Simult. T

Kernel X

App X

H.264

Registers, global memory latency

3,936

20.2

1.5

LBM

Shared memory capacity

3,200

12.5

12.3

RC5
-
72

Registers

3,072

17.1

11.0

FEM

Global memory bandwidth

4,096

11.0

10.1

RPES

Instruction issue rate

4,096

210.0

79.4

PNS

Global memory capacity

2,048

24.0

23.7

LINPACK

Global memory bandwidth, CPU
-
GPU data transfer

12,288

19.4

11.8

TRACF

Shared memory capacity

4,096

60.2

21.6

FDTD

Global memory bandwidth

1,365

10.5

1.2

MRI
-
Q

Instruction issue rate

8,192

457.0

431.0

[HKR HotChips
-
2007]

8

Computing Q: Performance

267.6
3.3
0.6
1164.1
1156.5
400.1
953.9
923.7
0
200
400
600
800
1000
1200
1400
V1 (cpu,dp)
V2 (cpu, dp,
sse2)
V3 (cpu, dp,
sse2, fm)
V4 (cpu, sp)
V5 (cpu, sp,
sse2)
V6 (cpu, sp,
sse2, fm)
V7 (gpu, sp)
V8 (gpu, sp,
fm)
Runtime (minutes)
GPU (V8): 96 GFLOPS

446x

CPU (V6): 230 MFLOPS

9

Lessons Learned


Parallelism extraction requires global understanding


Most programmers only understand parts of an application


Algorithms need to be re
-
designed


Programmers benefit from clear view of the algorithmic effect on
parallelism


Real but rare dependencies often needs to be ignored


Error checking code, etc., parallel code is often not equivalent to
sequential code


Getting more than a small speedup over sequential code is
very tricky


~20 versions typically experimented for each application to move
away from architecture bottlenecks

10

Stylized C/C++
or DSL w/
assertions

Concurrency
discovery

Visualizable
concurrent form

Human

Code
-
gen
space
exploration

Visualizable sequential
assembly code with
parallel annotations

Parallel HW
w/sequential
state gen

Deep analysis
w/ feedback
assistance

Systematic search
for best/correct
code gen

Debugger

parallel execution w/
sequential semantics

Implicitly Parallel

Programming Flow

For increased


composability

For increased


scalability

For increased


supportability

11

Key Ideas


Deep program analyses that extend programmer and DSE
knowledge for parallelism discovery


Key to reduced programmer parallelization efforts


Exclusion of infrequent but real dependences using HW
STU (Speculative Threading with Undo) support


Key to successful parallelization of many real applications


Rich program information maintained in IR for access by
tools and HW


Key to integrate multiple programming models and tools


Intuitive, visual presentation to programmers


Key to good programmer understanding of algorithm effects


Managed parallel execution arrangement search space


Key to reduced programmer performance tuning efforts

12

(
a
)
Guess vectors are obtained from the previous macroblock
.
prev
_
frame
cur
_
frame
(
b
)
Guess vectors are obtained from the corresponding
macroblock in the previous frame
.
prev
_
frame
cur
_
frame
Parallelism in Algorithms

(H.263 motion estimation example)

13

MPEG
-
4 H.263 Encoder

Parallelism Redicovery

MotionEstimation
MotionCompensation
FrameSubtraction
VopShapeMotText
BlockQuant
BlockDequant
BlockDCT
BlockIDCT
MotionEstimatePicture
FullPelMotionEstMB
MBMotionEstimation
SAD
_
Macroblock
GetMotionImages
Luminance
Comp
Chrominance
Comp
Interpolation
FindSubPel
x
5
MBBlock
Rebuild
BitstreamEncode
CodeMB
Loop
Granularity
pixel
pixel row
component
block
macroblock
Block
RebuildUV
(
h
)
Combination
#
3
:
Combination
#
2
+
field
-
sensitive pointer analysis
(
i
)
Final
:
Combination
#
3
with
value constraint and
relationship inference analyses
(
g
)
Combination
#
2
:
Combination
#
1
+
non
-
affine
expression array
disambiguation
(
f
)
Combination
#
1
:
Original
+
interprocedural array
disambiguation
+
context
-

&
heap
-
sensitive pointer analysis
X
X
X
X
X
X
X
X
X
X
X
X


(a)

(b)

(c)

(d)

(e)

14

(
b
)
Loop Fusion
+
Memory Privatization
(
a
)
Loop Partitioning
time
0
1
2
3
1
2
3
4
1
2
3
4
2
3
4
2
3
4
1
2
3
4
1
1
Motion Estimation
Motion Compensation
,
Frame Subtraction
DCT
&
Quantization
Dequantization
,
IDCT
,
Frame Addition
Operations Performed On
16
x
16
Macroblocks
Main Memory Access
Code Gen Space Exploration

15

Unification Based

Fulcra

Moving an Accurate Interprocedural
Analysis into Phoenix

16

Getting Started with Phoenix


Meetings with Phoenix team in January
2007


Determined the set of Phoenix API routines
necessary to support IMPACT analyses and
transformations


Received custom build of Phoenix that
supports full type information

17


Four step process:

1.
Convert IMPACT’s data structure to Phoenix’s
equivalents, and from C to C++/CLI.

2.
Creating the initial constraint graph using Phoenix’s
IR instead of IMPACT’s IR.

3.
Convert the solver


pointer analysis.


Consist of porting from C to C++/CLI and dealing with any
changes to Fulcra ported data structures.

4.
Annotate the points
-
to information back into
Phoenix's alias representation.

Fulcra to Phoenix


Action!

18

April 16, 2007

Phoenix Support Wish List


Access to code across file boundaries



LTCG


Access to multiple files within a pass


Full (Source code level) type information


Feed results from Fulcra back to Phoenix


Need more information on Phoenix alias
representation


In the long run, we need highly extendable IR and
API for Phoenix

19

Conclusion


Compiler research for many
-
cores will require a very
high quality infrastructure with strong engineering
support


New language extensions, new user models, new functionalities,
new analyses, new transformations


We chose Phoenix based on its robustness, features
and
engineering support



Our current industry partners are also moving into Phoenix


We also plan to share our advanced extensions to the other
academic Phoenix users