GPU and PC System Architecture

gradebananaΛογισμικό & κατασκευή λογ/κού

2 Δεκ 2013 (πριν από 3 χρόνια και 8 μήνες)

92 εμφανίσεις

GPU and PC System
Architecture



UC Santa Cruz
BSoE



March 2009

John Tynefield / NVIDIA Corporation

My Goals

Survey history

and direction of GPU/PC system architecture

Demonstrate the process of system level architectural problem
solving

Motivate some of you to become architects

Disclaimers

I work for NVIDIA

Public

Info

All numbers and dates approximate

Rounding is our friend

No bus/processor is 100% efficient, etc, etc


All examples are meant to be illustrative

Not comprehensive


“ there were >40
gfx

companies in 1995”

About

Me

I love games and graphics

I love building things

Structure

Intro to PC and GPU Architecture

A Sampling of Architectures

1996
-

Voodoo Graphics / Pentium

2000
-

GeForce

256 / P3

2004
-

GeForce

6800

/ P4

2008
-

Geforce

GTX280
/ Core2

Ideas for the future of the platform

What do architects

do?

Impose structure on complex design problems

Make tradeoffs

Validate high risk design bets

Structure verification

Why this is a great time to be an Architect

Radical design mobility

I have contributed to
10

completely new processor

designs

7 of which shipped in millions of units.

Steep competition

Not for everybody

Changing the World…no…really!

Heterogeneous many core computing is here to stay and it has changed
the nature of computing

Design Tension

Fixed

Function vs. Programmable

Scalar vs. Vector

Bandwidth vs. Latency

In Order vs. Out of Order

Limited vs. Unlimited ( virtualized ) resources

Technology Trends

CPUs get faster

GPUs get faster

Interconnects get faster

Memory gets faster

Memory gets denser

Latency increases

Feature load increases

Physics intrudes more and more








All at different rates

0%
5000%
10000%
15000%
20000%
25000%
30000%
1996
2000
2004
2008
CPU Cores
CPU Interconnect BW
GPU Cores
GPU Interconnect BW
System Memory BW
GPU Memory B/W
The long time horizon

The Awesome ideas of now take 2+ years to reach market

Awesome depreciates rapidly

Predictable

Silicon

Process Roadmap

PC Arch Roadmap

3
rd

Party Component Roadmap

Your capabilities and resources

Unpredictable

Market Shifts ( commodity prices, supply shocks )

3
rd

Party Strategic Errors (
os
/platform/partner slips )

Innovative Competition ( N
-
way struggle for design initiative )

GPU
Memory

GPU

CPU

Ultra Simplified PC Anatomy

CPU

Core
Logic

GPU

GPU
Memory

System
Memory

Processor

Processor

Processor

DRAM
MGMT

DRAM
MGMT

Ultra Simplified GPU Anatomy

Host
Logic

DRAM
MGMT

Ultra Simplified GPU Anatomy (2)

Geom

Gather

Geom

Proc

Triangle

Proc

Pixel

Proc

Z / Blend

Memory

GPU

Prehistory

1960s


1970s

Single Purpose BIG IRON

E&S, GE, Lockheed, …

1980s


1990s

General Purpose BIG IRON

Custom ASICs, Workstations

SGI, Sun, Intergraph, ..

1994

Maybe we can fit this on a single
consumer add
-
in card?

Fast consumer

CPUs with floating point

Try 3D rendering in fixed point!

PCI

VGA and VESA

Id Software’s DOOM

Contract Fabrication facilities offering .6 micron

ASIC design Tools

Enabling Technologies in 1994

1996 3dfx
-

Voodoo Graphics

PIO Programming Model

Pure Pipelined Graphics

Partial Triangle Setup


FP32

Fixed Point Integer Texture

Mapping and
Gouraud

Shading

Z Buffer and Full OpenGL Blending

All at 1 PPC, all the time, with no caches


32
-
bit PCI
-

.09 GB/s

128
-
bit EDO 50
Mhz

DRAM
-

.8 GB/s

Voodoo Graphics System Architecture

Geom

Gather

Geom

Proc

Triangle

Proc

Pixel

Proc

Z / Blend

CPU

Core
Logic

FBI

FB
Memory

System
Memory

TMU

TEX
Memory

GPU

CPU

Arch

Decision


Triangle Setup

Target 3D Triangle with texture and
Gouraud

shading

3 * XYW RGBA ST = 72 bytes/triangle pre setup

32
-
bit PCI 33Mhz


90 MB/s

1.25 M triangles / second speed of light ( 1M is magic )

Observe that post setup

3 * XY WRGBAST start values + screen space derivatives + Area


76 bytes/triangle


1.18M
Tris

( still magic )

Setup can be coded on Pentium in ~100 clocks

1M triangles on P100 (
mktg

happy )

Data
-
limited setup on chip
-

>10% die cost


Typical game scenes <<1000 triangles/frame

2000
Nvidia

GeForce

256

Decoupled input queuing

Hardware Transform & Lighting

FP32 FF Transform

FP22 FF Lighting

Complex fixed function pixel shading

4 Pipelines


AGP4X


1.06 GB/s

256 Bit DDR 300
Mhz

Memory


19.2 GB/s


GeForce

256 System Architecture

Geom

Gather

Geom

Proc

Triangle

Proc

Pixel

Proc

Z / Blend

CPU

Core
Logic

GPU

GPU
Memory

System
Memory

GPU

CPU

Architecture

Detail


Combiners

Logical fixed function extension of OpenGL Machine

Surface Color = Diffuse * Texture +
Specular

Diffuse
Color

Texture

Specular

Multi Texture

If one texture is good, more are better

Diff * ( Tex1 + Tex 2 ) + Spec or Diff * Tex1 * Tex2 or …

Diffuse
Color

Texture

0.0

Texture

Specular

Diffuse
Color

Texture

Texture2

1.0

Specular

Combiners

Cascading
Mux

/ SOP

/
Mux

/

SOP

pipeline

Very, flexible, harder

to program with deeper nesting

Everything is full speed!

A MUX

B MUX

AB Partial

C MUX

D MUX

CD Partial

Inputs for Next Stage of Pipeline

Texture

Fog

Light

Programmable

Shading

But the future was obviously
Renderman
-
like
shaders



normal
surfaceN
;


color C = { 1.0, 0.5, 0.0 };


normal
lightDirection
;



Ci

= C * dot (
surfaceN
,
lightDirection

);



2004
Nvidia

GeForce

6800

Fully general

Vertex and Pixel ISA

6 Geometry Processors

16 Pixel Processors

Deep
recirculating

pipelines to hide
latency

FP32
datapath

end to end


AGP8X


2.11 GB/s

256 Bit 700
Mhz

GDDR3


44 GB/s

GeForce

6800 System Architecture

Geom

Gather

Geom

Proc

Triangle

Proc

Pixel

Proc

Z /
Blend

CPU

Core
Logic

GPU

GPU
Memory

System
Memory

GPU

CPU

Physics
and AI

Scene
Mgmt

Architecture Decision


Tex/
Shader

Structure

Problem:

Build a general programmable pipeline

Optimize for common workloads

TEX


BLEND


FOG

Common Game
Shaders

(
eg
. Doom 3 )

Plan A


Uncoupled

Elegant

Small fundamental unit

Many “passes” for common
shaders

TBF

TEXMTH

TEX

BLND

BLND

Registers

Texture

Math

Less Elegant

Larger Fundamental Unit

Single pass for common
shaders

Good scaling for longer
shaders

Big
perf

/ area win given workloads

Not forward looking

Plan B
-

Coupled

Registers

Math

Texture

Math

2008
-

GeForce

GTX280

Fully unified programmable architecture

240 instances of the same processor

IEEE FP32 and FP64


Gen2 PCIE


8GB/s

512 bit 1100
Mhz

GDDR3


144 GB/s

GeForce

GTX280

System Architecture

Geom

Gather

Geom

Proc

Triangle

Proc

Pixel

Proc

Z /
Blend

CPU

Core
Logic

GPU

GPU
Memory

System
Memory

GPU

CPU

Physics
and AI

Scene
Mgmt

Architecture Decision



Heterogeneous

Computing Support


Build a bigger Chip

Radically improve ability of GPU to share
work with the CPU

Thread

Local Memory

Grid 0

. . .

Global

Memory

. . .

Grid 1

Sequential

Grids

in Time

Block

Shared

Memory

Register File

Computing

Support

Add Efficient Thread Launching

Add General

Load / Store Instructions and
Datapath

Add Shared Memory

Add computational loads to performance design requirements

Future Graphics Directions

Higher density

Higher

refresh

Higher dynamic range

Ubiquity

Lower Power

Shaving off the last burrs

Global Illumination

Higher

quality modeling

Virtualized resources at interactive rates

Future

PC Architecture Directions

Highly Integrated


Low Cost

Require a minimum visual feature set

Web/video/
run today’s apps

And everyone else

Differentiated PCs

More bandwidth and more parallel horsepower

More mature unified programming models

C

on CUDA

DX11


OpenCL

More resource virtualization


Q & A