Title of Presentation - Dong Hyuk Woo

rodscarletSoftware and s/w Development

Dec 14, 2013 (4 years and 17 days ago)

71 views

Intel Labs

Aniruddha
Vaidya*

Anahita
Shayesteh

Dong Hyuk Woo

Roy Saharoy

Mani Azimi


Intel



* Now with
Nvidia

Corp.

SIMD Divergence Optimization
through Intra
-
Warp
Compaction

Public

2

Intel Labs

June 26, 2013

Public

Legal Notices


INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE,
EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS
GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR
SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR
IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR
WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT
OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT
INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.


Intel may make changes to specifications and product descriptions at any time, without notice.


All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without
notice.


Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause
the product to deviate from published specifications. Current characterized errata are available on request.


Ivy Bridge and other code names featured are used internally within Intel to identify products that are in development and
not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use
code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code
names is at the sole risk of the user


Performance tests and ratings are measured using specific computer systems and/or components and reflect the
approximate performance of Intel products as measured by those tests. Any difference in system hardware or software
design or configuration may affect actual performance.


Intel, Intel Inside, and the Intel logo are trademarks of Intel Corporation in the United States and other countries.


*Other names and brands may be claimed as the property of others.


Copyright © 2009 Intel Corporation.

3

Intel Labs

June 26, 2013

Public

“By
exploiting

the
difference between logical and physical SIMD
width of a GPU pipeline,

we address the SIMD control divergence problem

with
intra
-
warp compaction.”

4

Intel Labs

June 26, 2013

Public

“By
exploiting

the difference between logical and physical SIMD
width
of a GPU pipeline,

we address the
SIMD control divergence
problem

with
intra
-
warp compaction
.”

5

Intel Labs

June 26, 2013

Public

“By
exploiting

the difference between logical and physical SIMD
width
of a GPU pipeline,

we address the SIMD control divergence problem

with intra
-
warp compaction.”

6

Intel Labs

What is a GPU, Really?

June 26, 2013

Public

Wide SIMD



A group of SIMD lanes


16 lanes per warp in this talk


Many, many warps per core



Huge

context


Huge

register file


Also
very
energy hungry


7

Intel Labs

Understanding a GPU Register File

June 26, 2013

Public

0

1

2

3

4

5

6

7

8

9

a

b

c

d

e

f

r1

r0

0

1

2

3

4

5

6

7

8

9

a

b

c

d

e

f

r(n
-
1)

r2

r1

r0

0

1

2

3

4

5

6

7

8

9

a

b

c

d

e

f

0

1

2

3

4

5

6

7

8

9

a

b

c

d

e

f

0

1

2

3

4

5

6

7

8

9

a

b

c

d

e

f

0

1

2

3

4

5

6

7

8

9

a

b

c

d

e

f

r0

r(n
-
1)

r2

r1

0

1

2

3

4

5

6

7

8

9

a

b

c

d

e

f

0

1

2

3

4

5

6

7

8

9

a

b

c

d

e

f

0

1

2

3

4

5

6

7

8

9

a

b

c

d

e

f

0

1

2

3

4

5

6

7

8

9

a

b

c

d

e

f

Warp ID

Reg

#

warp

warp

8

Intel Labs

Single
-
Ported RF / Multi
-
Cycle Issue

fma

r3 = r0 * r1 + r2

June 26, 2013

Public

r0 / fetch @ 1

r1 / fetch @ 2

r2 / fetch @ 3

issue @ 4

issue @ 5

issue @ 6

issue @ 7

Throughput
-
balanced pipeline
:

Occupying RF stage for
4

cycles (3 inputs + 1 output)

Occupying EX stage for
4

cycles

An

array of MUX network

for operand selection / distribution

Logically SIMD16, physically SIMD4

9

Intel Labs

June 26, 2013

Public

“By
exploiting

the difference between logical and physical SIMD
width of a GPU pipeline,

we address the
SIMD control divergence
problem

with intra
-
warp compaction.”

10

Intel Labs

SIMD Control Divergence Problem

June 26, 2013

Public



If (……) {





} else {





}



0%
20%
40%
60%
80%
100%
0
1
2
3
4
5
6
if/else
-

nesting depth

Throughput degradation with nested
branches

0

1

2

3

4

5

6

7

8

9

a

b

c

d

e

f

0

1

2

3

4

5

6

7

8

9

a

b

c

d

e

f

0

1

2

3

4

5

6

7

8

9

a

b

c

d

e

f

0

1

2

3

4

5

6

7

8

9

a

b

c

d

e

f

11

Intel Labs

Previous Proposals

June 26, 2013

Public

0

1

2

3

4

5

6

7

8

9

a

b

c

d

e

f

0

1

2

3

4

5

6

7

8

9

a

b

c

d

e

f

0

1

2

3

4

5

6

7

8

9

a

b

c

d

e

f

0

1

2

3

4

5

6

7

8

9

a

b

c

d

e

f

0

1

2

3

4

5

6

7

8

9

a

b

c

d

e

f

0

1

2

3

4

5

6

7

8

9

a

b

c

d

e

f

cycle x

cycle 3n

cycle 2n

cycle 1n

cycle x

cycle 1n

Dynamically forming

a new warp

12

Intel Labs

A Problem of Previous Proposals

fma

r3 = r0 * r1 + r2

June 26, 2013

Public

r0 / fetch @ 1

r1 / fetch @ 2

r2 / fetch @ 3

13

Intel Labs

Significant Change in RF Design

June 26, 2013

Public

0

1

2

3

4

5

6

7

8

9

a

b

c

d

e

f

0

1

2

3

4

5

6

7

8

9

a

b

c

d

e

f

0

1

2

3

4

5

6

7

8

9

a

b

c

d

e

f

0

1

2

3

4

5

6

7

8

9

a

b

c

d

e

f

0

1

2

3

4

5

6

7

8

9

a

b

c

d

e

f

0

1

2

3

4

5

6

7

8

9

a

b

c

d

e

f

0

1

2

3

4

5

6

7

8

9

a

b

c

d

e

f

0

1

2

3

4

5

6

7

8

9

a

b

c

d

e

f

0

1

2

3

4

5

6

7

8

9

a

b

c

d

e

f

0

1

2

3

4

5

6

7

8

9

a

b

c

d

e

f

14

Intel Labs

Significant Change in RF Design

June 26, 2013

Public

0

2

3

5

7

8

9

b

c

d

f

0

2

3

5

7

8

9

b

c

d

f

0

2

3

5

7

8

9

b

c

d

f

0

2

3

5

7

8

9

b

c

d

f

0

2

3

5

7

8

9

b

c

d

f

0

2

3

5

7

8

9

b

c

d

f

0

2

3

5

7

8

9

b

c

d

f

0

2

3

5

7

8

9

b

c

d

f

0

2

3

5

7

8

9

b

c

d

f

0

2

3

5

7

8

9

b

c

d

f

1

1

1

1

1

1

1

1

1

1

4

4

4

4

4

4

4

4

4

4

6

6

6

6

6

6

6

6

6

6

a

a

a

a

a

a

a

a

a

a

e

e

e

e

e

e

e

e

e

e

15

Intel Labs

Summary of Dynamic Warp Formation

Dynamic
Warp
Formation

Inter
-
warp optimization

Yes

Execution cycle
compaction

No

Inter
-
lane optimization

No



Register file

Per
-
lane

bank

Per
-
lane,

warp
-
level

context info

Required

Memory divergence

Worse

June 26, 2013

Public

16

Intel Labs

June 26, 2013

Public

“By
exploiting

the difference between logical and physical SIMD
width of a GPU pipeline,

we address the SIMD control divergence problem

with
intra
-
warp compaction
.”

17

Intel Labs

Basic Cycle
Compression (BCC)

fma

r3 = r0 * r1 + r2

June 26, 2013

Public

r0 / fetch @ 1

r1 / fetch @ 2

r2 / fetch @ 3

issue @ 4

issue @ 5

issue @ 6

issue @ 7

Instead,

we want to issue a next warp at cycle 5.

18

Intel Labs

Unfruitful Cases for BCC

June 26, 2013

Public

r0 / fetch @ 1

r1 / fetch @ 2

r2 / fetch @ 3

issue @ 4

issue @ 5

issue @ 6

issue @ 7

19

Intel Labs

Swizzled

Cycle Compression (SCC)

fma

r3 = r0 * r1 + r2

June 26, 2013

Public

r0 / fetch @ 1

r1 / fetch @ 2

r2 / fetch @ 3

issue @ 4

issue @ 5

issue @ 6

issue @ 7

20

Intel Labs

Datapath

for
Swizzling

June 26, 2013

Public

21

Intel Labs

Control Algorithm for
Swizzling

June 26, 2013

Public

Goal: Determine crossbar settings and
lane
enables for
datapath

Method:

1.
Detect the optimal occupancy

2.
Balance occupancy across lanes

22

Intel Labs

Control Algorithm for
Swizzling

June 26, 2013

Public

Lane 0

Lane 1

Lane 2

Lane 3

1

1

1

1

1

1

1

1

Total 4

Total 0

Total 4

Total 0

1

1

1

1

For 1
st

EXE cycle, fill idle lanes (1, 3) from busy lanes

For 2
nd

EXE cycle, fill idle lanes (1, 3) similarly

Total 3

Total 2

Total 1

Total 2

Total 3

Total 2

Total 1

Total 2

Goal: Determine crossbar settings and
lane
enables for
datapath

Method:

1.
Detect the optimal occupancy

2.
Balance occupancy across lanes

23

Intel Labs

Hardware Overhead


BCC


Improved register file throughput


Coarse
-
grained multi
-
banked RF needed (e.g., warp
-
granularity)



SCC


BCC overhead


Swizzling

decision logic


Enough time from the decode stage to the register fetch stage


Swizzling

logic


Swizzling

logic for input already available


Updated
datapath

for output operand with additional SIMD4
crossbar

June 26, 2013

Public

24

Intel Labs

Compared to Dynamic Warp Formation

Dynamic
Warp
Formation

Our approach

Inter
-
warp optimization

Yes

No

Execution cycle
compaction

No

Yes

Inter
-
lane optimization

No

Yes



Register file

Per
-
lane

bank

4x

bandwidth

Per
-
lane,

warp
-
level

context info

Required

No

Memory divergence

Worse

No

impact

June 26, 2013

Public

25

Intel Labs

EXPERIMENTAL RESULTS

June 26, 2013

Public

26

Intel Labs

Simulation Methods


Execution
-
driven simulation


In
-
house cycle
-
level Intel GPGPU simulator


Standalone GPU simulation


A module in parallel CPU+GPU simulation


Entire GPU performance simulation with entire memory
hierarchy


50+
OpenCL

benchmark applications evaluated



Trace
-
driven simulation


GPU core performance simulation only


~600
OpenCL
, OpenGL, multimedia workload traces


June 26, 2013

Public

27

Intel Labs

Thread Dispatcher

Experimental Setup

June 26, 2013

Public

CPU

core

CPU

core

EU

EU

EU

EU

EU

EU

Data cluster

L1$

SLM

Last
-
level $

SIMD16

6 warps / EU

64B ~ 128B
per cycle

64
-
way, 4
-
banked,

128KB

64KB
scratchpad

16
-
way, 8
-
banked, 2MB

28

Intel Labs

Largely Similarity in Neighboring Lanes


ALU cycles saved (OpenGL and
OpenCL
)

June 26, 2013

Public

0%
10%
20%
30%
40%
50%
BFS
HtS
LavaMD
NW
Part
EV
RT-PR-Conf
RT-PR-AL
RT-PR-BL
RT-PR-WM
RT-AO-AL
RT-AO-BL
RT-AO-WM
LuxMark-sky
LuxMark_sala
luxmark_oclcp
bulletphysics
oclprofv1p0
rightware_mandelbulb
tree_search
LuxMark_hdr
OptSAA
sandra_ocl
ati-eigenval
ati_floydwarshall
glbench_egypt
glbench_pro
FD_IntelFinalists
FD_politicians
ALU cycles saved

BCC%

S
CC%

29

Intel Labs

Dependent on Data Cluster Bandwidth


System Performance (
OpenCL
;
RayTracing
)

June 26, 2013

Public

0%
20%
40%
60%
Speedup



bandwidth

2 $

lines / cycle

1

$

line / cycle

BCC%

S
CC%

On average

(across divergent applications),


+12% with 1$ line / cycle bandwidth

+18%

with 2$ line / cycle bandwidth

30

Intel Labs

June 26, 2013

Public

“By
exploiting

the difference between logical and physical SIMD
width
of a GPU pipeline,

we address the
SIMD control divergence
problem

with
intra
-
warp compaction
.”

31

Intel Labs

Conclusion


SIMD control divergence solutions


Exploiting the multi
-
cycle execution feature of GPUs


Intra
-
Warp Compaction


Basic cycle compression


Swizzled

cycle compression



Learnings


Control path similarity in neighboring lanes


Divergent applications are likely to require high memory
bandwidth.


June 26, 2013

Public

32

Intel Labs

June 26, 2013

Public

Thank you!

Acknowledgement


Intel
:
Murali
Sundaresan, S.
Maiyuran
, Jonathan Pearce, Ben Ashbaugh,
Kipp

Owens,
Berna Adalier, Sven
Woop
, Warren Hunt, Ingo Wald, Aaron
Kunze


ISCA:
Brucek

Khailany
,
anonymous
ISCA reviewers

Public