Analysis on a Modern GPU

parakeetincurableSoftware and s/w Development

Dec 13, 2013 (3 years and 5 months ago)

65 views

1

Shader Performance
Analysis on a Modern GPU
Architecture

Victor Moya, Carlos González,


Jordi Roca, Agustín Fernández

Department of Computer
Architecture UPC

Roger Espasa

Intel DEG

Barcelona

2

Introduction

Shaders in GPUs evolving towards general
programming


Branches, generic loads, scatter

New types of shaders: geometry in DX10

Current specialized shaders


Area hungry


Unbalancing leads to inefficiencies


This paper: unify all shaders


~8% higher performance with less area & resources

3

Outline

Attila


our GPU architecture

Attila
-
Classic: Non
-
unified shaders

Attila
-
Unified: Unified Shaders

Simulation Framework

Results

4

Outline

Attila


our GPU architecture

Attila
-
Classic: Non
-
unified shaders

Attila
-
Unified: Unified Shaders

Simulation Framework

Results

5

ATTILA

Our implementation of current GPUs


Inspired in both NVIDIA and ATI


Not exact to either pipeline

Lack of detailed micro architecture information

Educated guessing on our side

Implemented Features


2D Homogeneous Recursive Rasterization


Tiled Rasterization


Hierarchical Z


Texture compression


Anisotropic filtering


Depth compression, fast z/stencil and color clear


6

Outline

Attila


our GPU architecture

Attila
-
Classic: Non
-
unified shaders

Attila
-
Unified: Unified Shaders

Simulation Framework

Results

7

Vertex

Shader

Vertex

Shader

Vertex

Shader

Vertex

Shader

Primitive Assembly

Clipping

Triangle Setup

Rasterization

Fragment

Shader

Fragment

Shader

Fragment

Shader

Fragment

Shader

ROP

ROP

ROP

ROP

HierarchicalZ

Vertex Fetch

Memory

Controller

Memory

Controller

Memory

Controller

Memory

Controller

Attila Classic

Specialized

Shaders

8

Specialized Shader Issues

Unbalancing


In fragment shading limited scenarios (typical) up to 30% of the
processing power remains idle (for a GPU with 8 vertex and 4
fragment shaders)


In vertex shading limited scenarios up to 70% of the processing
power remains idle.

Dedicated Area


4 unused vertex shaders have the same processing power than
one 1 fragment shader


4 vertex shaders require 66% the area of a fragment shader

Different Designs


Increases the complexity of the micro architecture


Increases development and verification time

9

Outline

Attila


our GPU architecture

Attila
-
Classic: Non
-
unified shaders

Attila
-
Unified: Unified Shaders

Simulation Framework

Results

10

Memory

Controller

Memory

Controller

Memory

Controller

Memory

Controller

ROP

ROP

ROP

ROP

Shader

Shader

Shader

Shader

Vertex Fetch

Primitive Assembly

Clipping

Triangle Setup

Rasterization

HierarchicalZ

Scheduler

Distributor

Attila Unified

Unified

Shader

Pool

11

Unified Shader Architecture

Benefits


Unified programming model

DX10/SM4 and OpenGL/GLSlang are already pushing for it


The same features for all the program targets

Texturing, branching, outputs


Not just vertex and fragment programs

DX10 => geometry shader

General Purpose GPU or Stream Processor


Workload balance

Shading resources allocated as required at any point of the
rendering

12

Unified Shader Architecture

Costs


Scheduler

Select which kind of workload must be processed
next

Partly implemented with multithreading in the
fragment shader to hide texture access latency


Larger instruction memory and constant bank


Rerouting required

All the paths cross the shader pool

13

Outline

Attila


our GPU architecture

Attila
-
Classic: Non
-
unified shaders

Attila
-
Unified: Unified Shaders

Simulation Framework

Results

14

ATTILA Framework

OpenGL Interceptor tool

OpenGL library for Attila GPU

Driver for our Attila GPU

Attila GPU simulator

Signal Visualizer Tool


15

Collect

Verify

Simulate

Analyze

OpenGL Application

GLInterceptor

Vendor OpenGL Driver

Trace

ATI R520/NVidia G70

Framebuffer

Vendor OpenGL Driver

ATI R520/NVidia G70

Framebuffer

ATTILA OpenGL Driver

ATTILA Simulator

Framebuffer

GLPlayer

Signal Visualizer

Statistics

Signal Traffic

CHECK!

CHECK!

16

Collect

Verify

Simulate

Analyze

OpenGL Application

GLInterceptor

Vendor OpenGL Driver

Trace

ATI R520/NVidia G70

Framebuffer

Vendor OpenGL Driver

ATI R520/NVidia G70

Framebuffer

ATTILA OpenGL Driver

ATTILA Simulator

Framebuffer

GLPlayer

Signal Visualizer

Statistics

Signal Traffic

CHECK!

CHECK!

GLInterceptor


Capture a trace of OpenGL API alls from a real game

17

Collect

Verify

Simulate

Analyze

OpenGL Application

GLInterceptor

Vendor OpenGL Driver

Trace

ATI R520/NVidia G70

Framebuffer

Vendor OpenGL Driver

ATI R520/NVidia G70

Framebuffer

ATTILA OpenGL Driver

ATTILA Simulator

Framebuffer

GLPlayer

Signal Visualizer

Statistics

Signal Traffic

CHECK!

CHECK!

GLPlayer


Reproduce the captured trace

18

Collect

Verify

Simulate

Analyze

OpenGL Application

GLInterceptor

Vendor OpenGL Driver

Trace

ATI R520/NVidia G70

Framebuffer

Vendor OpenGL Driver

ATI R520/NVidia G70

Framebuffer

ATTILA OpenGL Driver

ATTILA Simulator

Framebuffer

GLPlayer

Signal Visualizer

Statistics

Signal Traffic

CHECK!

CHECK!

OpenGL Library

-

Transforms Fixed Function into Shader code

-

200 API Calls supported

-

ARB Vertex and Fragment extensions

-

Alpha and Fog emulated via Shader code

Driver

-

Low level access

-

Attila memory management

19

Collect

Verify

Simulate

Analyze

OpenGL Application

GLInterceptor

Vendor OpenGL Driver

Trace

ATI R520/NVidia G70

Framebuffer

Vendor OpenGL Driver

ATI R520/NVidia G70

Framebuffer

ATTILA OpenGL Driver

ATTILA Simulator

Framebuffer

GLPlayer

Signal Visualizer

Statistics

Signal Traffic

CHECK!

CHECK!

ATTILA Simulator

-

Detailed cycle
-
by
-
cycle simulation of all
pipeline stages

-

20 boxes, modeling a 100
-
deep pipeline

-

Execute@Execute: functionality
embedded at each pipeline stage

20

Find the differences


NVIDIA GeForce FX 5900XT

Attila

21

Outline

Attila


our GPU architecture

Attila
-
Classic: Non
-
unified shaders

Attila
-
Unified: Unified Shaders

Simulation Framework

Results

22

Benchmark

Unreal Tournament 2004


Fixed function OpenGL API

Vertex and fragments shaders generated by our
library


1024x768 resolution


8x Anisotropic Filtering


160 of 450 frames simulated


40 frames ~ 1 day simulation


On a Xeon P4 @ 2.0Ghz

23

Baseline Configuration

Four Vertex Shaders (only for Attila
-

Classic)

Fragment and Unified shader configuration:


32 threads

4 fragments/vertices per thread

16 128
-
bit FP registers available for temporal storage per thread


n SIMD ALUs


1 scalar ALU (optional)


1 Texture Unit per Shader Unit

16 KB texture cache

Single cycle bilinear and two cycle trilinear

AF up to 16x

Geometry and Rasterization pipelines limited to 1 vertex and 1
triangle per cycle

Two ROPs: 8 z and 8 color values written per cycle

Four 64
-
bit DDR buses: peak bandwidth 64 bytes/cycle


24

1
1,2
1,4
1,6
1,8
2
2,2
2,4
2,6
2,8
3
1-way
1-way + scalar
2-way
4-way
relative performance
2sh
4sh
6sh
8sh
“Classic” Performance

8% improvement for 2
-
way

Near linear improvement for 4 shaders

Sublinear improvement for 6 and 8 shaders


Limited by memory bandwidth and latency

8sh

6sh

4sh

2sh

~75%

~45%

~40%

7%

8%

25

Vertex shader and fragment shader workload for 4 vertex
shader units and 2 fragment shader units

Frame 330


Detailed Zoom

0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
1
101
201
301
401
501
601
701
801
901
Time (10K cycles steps)
Utilization
Vertex Shader
Fragment Shader
Vertex shading
limited

26

1
1,2
1,4
1,6
1,8
2
2,2
2,4
2,6
2,8
3
1-way
1-way + scalar
2-way
4-way
relative performance
2sh
uni2sh
4sh
uni4sh
6sh
uni6sh
8sh
uni8sh
Unified Shader Performance

Unified improvement ranges from 1% (2 shaders) to 8% (eight
1
-
way shaders)


Fragment shading limited


Vertex fetch limited


Geometry pipeline limited

8sh

6sh

4sh

2sh

27

Area Estimation

ATI R400

ATI RV400

Transistors (millions)

160

120

Vertex Shaders

6

4

Fragment Shaders

4

2


Hardware Element

Estimated Area

Millions of Transistors

Vertex Shader

2.5

Fragment Shader

15

Additional SIMD ALU

+15%

Additional scalar ALU

+5%

160


120 = 40 = 2 vertex shader * 2.5 + 2 fragments shader * 15 + 5 (other)

28

Shader Scaling vs Transistors

50
70
90
110
130
150
170
30
80
130
180
MTransistors
fps
2-way
uni 2-way
1-way
uni 1-way
linear
8sh

6sh

4sh

2sh

Linear for 4 shader units, sublinear for more than 4 shader units

Up to 30% more efficient per area for the unified architecture (two 1
-
way shaders)

29

Conclusion

Attila Unified architecture has better
performance than Attila Classic with less
hardware


Up to 8% better performance


8% to 25% less area required


10% to 30% better performance per area

Up to 8% better performance for 2
-
way shader
units

160% better performance from 2 to 8 fragment
or unified shader units


Memory bandwidth limited beyond 4 shaders

30

Questions

31

Performance of Attila Unified vs Classic Attila

1
1,01
1,02
1,03
1,04
1,05
1,06
1,07
1,08
1,09
uni2sh
uni4sh
uni6sh
uni8sh
relative performance
1-way
1-way + scalar
2-way
4-way