Graphics-processors-Norm-Rubin

birdsowlSoftware and s/w Development

Dec 2, 2013 (4 years and 29 days ago)

92 views

Graphics processors


Norm Rubin


compiler architect



normanr@ati.com


Feb 15, 2005

2

Size of market


Many millions of gpu’s shipped per month


The 3d market is entertainment (games)


Each new generation of gpu adds enough
performance to support a new version of a
game.


Each time a game is released, player have to
replace hardware to run the game.



Game industry is larger then Hollywood.


Feb 15, 2005

3

Technology view


Not enough


ok

Too good

performance / function


gpu

cpu

Proprietary


Commodity

architecture

interfaces

Mutable

Locked down

Feb 15, 2005

4

How much headroom


Pixar uses 100,000 min of compute per min of
image


Gpu’s are real time so 100,000 = 20 doubles


Most optimistic marketing version of Moore’s
law


performance doubles every 6 months



So there is 10 years to go.


Feb 15, 2005

5

Application space


Problems are embarrassingly parallel


Problems are big, screen 1000 x 1000, program
runs per pixel, including some pixels that are
behind others so 10* 1000 * 1000 calls per
frame * 20
-
60 frames per second


Run the same program over and over so


Gpus are SIMD machines

Feb 15, 2005

6

SIMD


There are many units executing in parallel


These are in lock
-
step, executing the same
instruction on different pixels/vertices at the same
time


Dynamic flow control can cause inefficiencies in
such an architecture since different pixels/vertices
can take different code paths


Dynamic branching is not always a performance win


For an
if…then…else
, need to execute both sides,
turning processors on and off.

Feb 15, 2005

7

Application space


Many values are coherent


values in neighbor
pixels are close.


Compute coherent variables at selected points
use interpolation to find the intermediate values



Today programmer specifies which variables
are coherent by splitting programs in two.

Feb 15, 2005

8

Application space


Common subproblem is texture filtering


Evaluate some array of memory around a stencil and
combine


Provide a small fixed set of stencil patterns in
hardware


You could think of this as slighty smart memory


Hardware support for 1
-
3 d arrays and several
filtering functions


Exact stencil patterns and combining operations are
proprietary

(some look better then others)


Feb 15, 2005

9

Application space


Little communication between processing
elements


Approximate spatial derivative by 2x2 difference
operator


Forces all machine designs to work on multiples
of four pixels

Feb 15, 2005

10

Application space


Throughput is important


Use threading to cover latency


The chips can support hundreds of threads, and
can switch from thread to thread every cycle


No thread switch overhead


Hardware scheduler and thread system


Compiler knows about threads and splits resources
over threads


Caches are very different


can only cover
spatial locality

Feb 15, 2005

11

Programming model



Performance is much less then users want


Min of 100,000 times less


Most developers write each program at least
four times


Xbox


Playstation


Ati top machine


Nvidia top machine


Programs are in two parts: Vertex and Pixel
shaders.

Feb 15, 2005

12

Programming model 2


Programs could be written in a high level
language (C like) HLSL/OGL2


Or in virtual assembly language (DirectX, …)


Almost one dialect per chip


While virtual languages but physical resources.


developers review virtual machine listings for
performance


developers ship virtual assembly language.

Feb 15, 2005

13

Programming model 3


At game startup


virtual assembly language is
JIT compiled to real machine language




Drastic change in resource requirements


Somewhat hard to debug


Hard to identify performance bottlenecks


Even though applications could build code on
the fly, developers pretest everything


they
want the most performance to get the best
looking image. Only approximate what they
really want.

Feb 15, 2005

14

Programmable Pipeline


Vertex Data

(Model space)

Fixed Function

Transform and

Lighting

Clipping and Viewport Mapping

Texture Stages

Fog, Alpha, Stencil Depth Testing

Geometry Stage

Rasterizer Stage

Vertex Shader

Pixel Shader

Feb 15, 2005

15

Vertex Processing Flow

Position

Normal

Texture Coordinates

Etc.

Per
-
Vertex Data

View Matrix

Projection Matrix

Skin/Bone Matrices

Light Positions

Etc.

Constants

Temporary Registers




Vertex Shader


Instructions

Triangle Mesh

Vertex
Shader
Engine

Position

“Texture” Coordinates

Color(s)

Feb 15, 2005

16

Vertex Shader


Input
:


Program specifies vertex data


Position


Normal


Vertex color


Texture coordinate(s)





Data is sent to the graphics card and processed

by the vertex shader


Output


Vertex shader computes output quantities


Position


Vertex color: diffuse and specular


Texture coordinate(s)


Sent to rasterizer via interpolators

Feb 15, 2005

17

Pixel Processing Flow

Temporary Registers

“Texture” Coordinates

Color(s)

Light Colors

Ambient Lighting Colors

Etc.

Constants




Pixel Shader


Instructions

Interpolated Values

Textures

Pixel Shader


Engine


Color

Multi
-
Render Target

Feb 15, 2005

18

Program sizes


Most programs are very small


100 virtual instructions would be a large
program


Basic data type is a four element vector of
floats


Integer data types are not yet available


Dynamic branching is new


Small amount of nesting allowed

Feb 15, 2005

19

polygons


Polygon Budget


Ruby : 75,000


Optico: 50,000


Ninja: 25,000


Environment: 100,000


Props: 50,000



Lighting Limits


3 Dynamic lights per shot (1 shadow
casting)


Lightmaps used for set



Animation Limits


35 total blend shapes


5 simultaneous blend shapes


4 weighted bones per vertex


Number of on
-
screen characters limited
to 4 at once


Feb 15, 2005

20

Shader Breakdown



Depth of Field



Hair



Skin


Feb 15, 2005

21

Depth Of Field

Feb 15, 2005

22

Depth Of Field

Feb 15, 2005

24

Shader Breakdown


Glows



Motion Blur



Reflections


Feb 15, 2005

25

Glows

Feb 15, 2005

26

Motion Blur

Feb 15, 2005

27

Reflections

Feb 15, 2005

28

Hardware view



X1900


Xbox 360




Both machines are current

Feb 15, 2005

30

X1900

Pixel
Shader
Engine
Z / Stencil Buffer Cache
Setup Engine
Vertex
Shader
Engine
Backface
Cull
Perspective Divide
Clip
Viewport
Transform
Backface
Cull
Perspective Divide
Clip
Viewport
Transform
Vertex Data
Texture Cache
Texture Units
Texture Units
Texture Units
Texture Units
Decompress
Compress
Decompress
Compress
Ultra
-
Threading
Dispatch
Processor
Ultra
-
Threading
Dispatch
Processor
Decompress
Compress
Decompress
Compress
Hierarchical
Z Test
Geometry Assembly
Rasterization
Geometry Assembly
Rasterization
Interpolators
General Purpose Register Arrays
General Purpose Register Arrays
General Purpose Register Arrays
General Purpose Register Arrays
Feb 15, 2005

31

Quad Pixel Shader Core

Vector ALU 2
Vector ALU 2
Vector ALU 1
Vector ALU 1
Scalar
Scalar
ALU
ALU
1
1
Scalar
Scalar
ALU
ALU
2
2
Branch
Branch
Execution
Execution
Unit
Unit
Vector ALU 2
Vector ALU 2
Vector ALU 1
Vector ALU 1
Scalar
Scalar
ALU
ALU
1
1
Scalar
Scalar
ALU
ALU
2
2
Branch
Branch
Execution
Execution
Unit
Unit
Pixel
Shader
Engine
Z / Stencil Buffer
Cache
Setup Engine
Vertex
Shader
Engine
Hierarchical
Z Test
Interpolators
Geometry Assembly
Rasterization
Backface
Cull
Perspective Divide
Clip
Viewport
Transform
Vertex Data
Texture
Cache
General Purpose Register Arrays
General Purpose Register Arrays
Ultra
-
Threaded
Dispatch Processor
Ultra
-
Threaded
Dispatch Processor
Texture Units
Texture Units
Decompress
Compress
Decompress
Compress
Decompress
Compress
Pixel Shader Processor

Per Clock Cycle:


1 vec3 ADD + input modifier

1 scalar ADD + input modifier

1 vec3 ADD/MUL/MADD

1 scalar ADD/MUL/MADD

1 flow control instruction

Texture Address Units


1 texture address instructions

per unit per clock cycle

Texture

Address

Unit

1

Texture

Address

Unit

2

Texture

Address

Unit

3

Texture

Address

Unit

4

Pixel Shader Processors

Feb 15, 2005

32

Vertex Engine


Upgraded to support SM3.0


Dynamic flow control


1,024 instructions (practically
unlimited with flow control)


More temporary registers


8 Vertex Shader Processors


Each can handle 2 shader
instructions per clock


10 billion instructions per
second


Backface
Cull
Perspective Divide
Clip
Viewport
Transform
Vertex Data
128
-
bit
Vector
ALU
32
-
bit
Scalar
ALU
Flow Control
To
Setup
Engine
Feb 15, 2005

33

Ring Bus Memory Controller


Supports today’s fastest
graphics memory devices


GDDR3, 48+ GB/sec


GDDR4, The future


512
-
bit Ring Bus


Simplifies layout and enables
extreme memory clock scaling


New Cache Design


Fully Associative for more
optimal performance


Improved Hyper Z


Better compression and hidden
surface removal


Programmable Arbitration Logic


Maximizes memory efficiency


Can be upgraded via software


Feb 15, 2005

34

Memory Channels
-

4x
Improvement in Random Access
over X850

32
-
bit
channel
32
-
bit
channel
32
-
bit
channel
32
-
bit
channel
32
-
bit
channel
32
-
bit
channel
32
-
bit
channel
32
-
bit
channel
Memory Controller
Memory Controller
64
-
bit
channel
64
-
bit
channel
64
-
bit
channel
64
-
bit
channel
Memory Controller
Memory Controller
256 bit interface
Memory Devices
Memory Devices
Memory Devices
Memory Devices
Radeon

X850


4x64
-
bit

channels


4 banks Per Dram

Radeon
X1900


8x32
-
bit

channels


8 Banks Per Dram


Feb 15, 2005

35

Cache Design

Graphics

Memory

Cache

Graphics

Memory

Cache

Direct

Mapped

Cache

Fully

Associative

Cache


Fully Associative Caches


Cache lines can map to any location in
external memory


Earlier designs used Direct Mapped &
N
-
Way Associative Caches


Could only access limited blocks of
external memory


Texture, Color, Z & Stencil caches are all
now fully associative


Reduces memory bandwidth
requirements


Minimizes cache contention stalls


Optimized game performance


Gains up to 25% clock for clock in
fill/bandwidth bound cases



Feb 15, 2005

36

Xbox


3.2GHz Custom IBM Central Processor



Three CPU Cores


Two Threads Per core


VMX Unit Per Core


128 VMX Registers Per Thread


1MB L2 Cache (Lockable by Graphics Processor)


500MHz Custom ATI Graphics Processor



Unified Shader Core


48 ALU’s for Vertex or Pixel Shader processing


16 Filtered & 16 Unfiltered Texture samples per clock


10MB eDRAM Framebuffer


512MB System RAM



Unified Memory Architecture (UMA)


128
-
bit interface


700MHz GDDR3 RAM


Feb 15, 2005

37

Command

Processor

Memory Hub

Vertex

Grouper

Primitive

Assembly

Shader

Interp

Shader

Interp

Sequencer

Shader

Pipe

(x16)

Vertex Cache

Texture

Pipe

Texture

Pipe

Texture

Pipe

Texture

Pipe

Shader

Pipe

(x16)

Shader

Pipe

(x16)


Pipe

Comm

256
GB/sec

Texture Cache

Scan

Converter

Z/Alpha/Stencil

Processors

Z/Alpha/Stencil

Processors

10MB

DRAM

Architecture

Feb 15, 2005

38

Adaptive Shader Array


Unified shader architecture


One processor type


Dynamic load balancing


Pixel and vertex processing where and
when they’re needed



48 shaders


120 billion operations per second





Feb 15, 2005

39



A
D
V
E
R
T
I
S
M
E
N
T
:

B
a
b
y
-
S
t
r
o
l
l
e
r
s
-
G
u
i
d
e
.
c
o
m


Feb 15, 2005

40

Some interesting problems


Coherence (branch prediction?)


What are the right instructions


Can you do non graphics applications


Programming language


Threading by compiler


Off line compile?


Feb 15, 2005

41

Implications for programming
languages


GPU


can convince people to use a new
language if you can prove it is faster, even if it
means lots of changes


Desktop CPU


have to prove it can meet some
other (non
-
performance/function) need



Top of the line price for GPU going up
-

top of
the line desktop CPU price going down, lots of
change to do cool design.



Less need to be backward compatible.



Feb 15, 2005

42

More info


http://www.ati.com/developer/index.html