physicsCML - Computer Science

rucksackbulgeΤεχνίτη Νοημοσύνη και Ρομποτική

1 Δεκ 2013 (πριν από 3 χρόνια και 9 μήνες)

70 εμφανίσεις

Modified from:


A Survey of General
-
Purpose Computation on
Graphics Hardware

John Owens

University of California, Davis

David Luebke

University of Virginia

with Naga Govindaraju, Mark Harris, Jens Kr
ü
ger, Aaron Lefohn, Tim Purcell

2

Motivation: The Potential of GPGPU


In short:


The power and flexibility of GPUs makes them an
attractive platform for general
-
purpose computation


Example applications range from in
-
game physics
simulation to conventional computational science


Goal: make the inexpensive power of the GPU
available to developers as a sort of computational
coprocessor

3

Problems: Difficult To Use


GPUs designed for & driven by video games


Programming model unusual


Programming idioms tied to computer graphics


Programming environment tightly constrained


Underlying architectures are:


Inherently parallel


Rapidly evolving (even in basic feature set!)


Largely secret


Can’t simply “port” CPU code!

4

STAR Goals


Detailed & useful survey of general
-
purpose
computing on graphics hardware


Hardware and software developments behind GPGPU


Building blocks
: techniques for mapping general
-
purpose computation to the GPU


Applications
: important applications of GPGPU


A comprehensive GPGPU bibliography

5

Triangle Setup

L2 Tex

Shader Instruction Dispatch

Fragment Crossbar

Memory

Partition

Memory

Partition

Memory

Partition

Memory

Partition

Z
-
Cull

NVIDIA GeForce 6800 3D Pipeline

Courtesy Nick Triantos, NVIDIA

Vertex

Fragment

Composite

6

Programming a GPU for Graphics


Each fragment is
shaded
w/

SIMD program


Shading can use values
from texture memory


Image can be used as
texture on future
passes


Application specifies
geometry


r慳a敲楺ed

7

Programming a GPU for GP Programs


Run a SIMD
kernel

over
each fragment


“Gather” is permitted
from texture memory


Resulting buffer can be
treated as texture on
next pass


Draw a screen
-
sized
quad


獴牥慭

8

Feedback



Each algorithm step depend on
the results of previous steps



Each time step depends on the
results of the previous time step

9

CPU
-
GPU Analogies








.



.


.


Grid[i][j]= x;


.


.


.





Array Write


=

Render to Texture

CPU

GPU

10

CPU
-
GPU Analogies





CPU





GPU










Stream / Data Array = Texture



Memory Read = Texture Sample


11

Kernels

Kernel / loop body / algorithm step = Fragment Program

CPU

GPU

12

Scatter vs. Gather


Grid communication


Grid cells share information



Gather


Indirect read from memory ( x = a[i])


Naturally maps to a texture fetch


Used to access data structures and data streams



Scatter


Indirect write to memory (a[i] = x)


Difficult to emulate:


Usually done on CPU


13

Computational Resources Inventory


Programmable parallel processors


Vertex & Fragment pipelines


Rasterizer


Mostly useful for interpolating addresses (texture
coordinates) and per
-
vertex constants


Texture unit


Read
-
only memory interface


Render to texture


Write
-
only memory interface

14

Vertex Processor


Fully programmable (SIMD / MIMD)


Processes 4
-
vectors (RGBA / XYZW)


Capable of scatter but not gather


Can change the location of current vertex


Cannot read info from other vertices


Can only read a small constant memory


Latest GPUs: Vertex Texture Fetch


Random access memory for vertices



Gather (But not from the vertex stream itself)

15

Fragment Processor


Fully programmable (SIMD)


Processes 4
-
component vectors (RGBA / XYZW)


Random access memory read (textures)


Capable of gather but not scatter


RAM read (texture fetch), but no RAM write


Output address fixed to a specific pixel


Typically more useful than vertex processor


More fragment pipelines than vertex pipelines


Direct output (fragment processor is at end of pipeline)

Building Blocks & Applications

17

GPGPU Building Blocks


fundamental techniques & computational
building blocks:


Flow control (a
very

fundamental building block)


Stream operations


Data structures


Differential equations & linear algebra


Data queries

18

Flow control


Surprising number of issues on GPUs


Main themes:


Avoid branching when possible


Move branching earlier in the pipeline when possible


Largely SIMD


coherent branching most efficient


Mechanisms:


Rasterized geometry


Z
-
cull


Occlusion query

19

Domain Decomposition


Avoid branches where outcome is fixed


One region is always true, another false


Separate FPs for each region, no branches


Example:

boundaries

20

Flat 3D Textures

21

Flat 3D Textures


Advantages


One texture update per operation


Better use of GPU parallelism


Non
-
power
-
of
-
two Textures


Quick simulation preview


Disadvantage


Must compute texture offsets

22

Staggered Simulation


Non
-
interactive application:


Simulate as fast as possible


Frame rate suffers

20ms

23

Staggered Simulation


Interactive frame rate!


Simulation still proceeds pretty fast

10

20ms

24

Z
-
Cull


In early pass, modify depth buffer


Write depth=0 for pixels that should not be modified
by later passes


Write depth=1 for rest


Subsequent passes


Enable depth test (GL_LESS)


Draw full
-
screen quad at z=0.5


Only pixels with previous depth=1 will be processed


Can also use early stencil test


Note: Depth replace disables ZCull

25

Pre
-
computation


Pre
-
compute anything that will not change
every iteration!


Example: arbitrary boundaries


When user draws boundaries, compute texture
containing boundary info for cells


Reuse that texture until boundaries modified


Combine with Z
-
cull for higher performance!

26

Stream Operations


Several stream operations in GPGPU toolkit:


Map
: apply a function to every element in a stream


Reduce
: use a function to reduce a stream to a
smaller stream (often 1 element)


Scatter/gather
: indirect read and write


Filter
: select a subset of elements in a stream


Sort
: order elements in a stream


Search
: find a given element, nearest neighbors, etc

27

Simple Fire Effect

Blur and scroll upward

Trails of blur emerge from
bright source ‘embers’ at the
bottom

VD

VA

VC

VB

28

Cellular Automata


Great for generating noise and other animated patterns to use in
blending


Game of Life in a Pixel Shader


Cell ‘state’ relative to the rules is computed at each texel


Dependent texture read


State accesses ‘rules’ table, which is a texture


Highly complex rules are easy!

The Rules

For a space that is 'populated':

Each cell with one or no neighbors dies,

as if by loneliness.

Each cell with four or more neighbors dies,

as if by overpopulation.

Each cell with two or three neighbors survives.

For a space that is 'empty' or 'unpopulated'

Each cell with three neighbors becomes populated

29

Lattice Computations



How far can we take them?


Anything we can describe with discrete PDE equations!


Discrete in space and time


Also other approximations

30

Approximate Methods


Several different approximations


Cellular Automata (CA)


Coupled Map Lattice (CML)


Lattice
-
Boltzmann Methods (LBM)

31

Coupled Map Lattice


Mapping:


Continuous state


lattice nodes


Coupling:


Nodes interact with each other to produce new state
according to specified rules

32

Coupled Map Lattice


CML introduced by Kaneko (1980s)


Used CML to study spatio
-
temporal chaos


Others adapted CML to physical simulation:


Boiling [Yanagita 1992]


Convection [Yanagita 1993]


Clouds [Yanagita 1997; Miyazaki 2001]


Chemical reaction
-
diffusion [Kapral ‘93]


Saltation (sand ripples / dunes) [ Nishimori ‘93]


And more


33

CML vs. CA


CML extends cellular automata (CA)




CA

CML

SPACE

Discrete

Discrete

TIME

Discrete

Discrete

STATE

Discrete

Continuous

34

CML vs. CA


Continuous state is more useful


Discrete: physical quantities difficult


Must filter over many nodes to get “real”
values


Continuous: physical quantities easy


Real physical values at each node


Temperature, velocity, concentration, etc.

35

Rules?


CML updated via simple, local rules


Simple: same rule applied at every cell (SIMD)


Local: cells updated according to some function of
their neighbors’ state


36

Example: Buoyancy


Used in temperature
-
based boiling
simulation


At each cell:


If neighbors to left and right of cell are warmer,
raise the cell’s temperature


If neighbors are cooler, lower its temperature


37

CML Operations


Implement operations as building blocks for use in
multiple simulations


Diffusion


Buoyancy (2 types)


Latent Heat


Advection


Viscosity / Pressure


Gray
-
Scott Chemical Reaction


Boundary Conditions


User interaction (drawing)


Transfer function (color gradient)

38

Anatomy of a CML operation


Neighbor Sampling


Select and read values,
v
, of nearby cells


Computation on Neighbors


Compute
f
(
v
)
for each sample (
f
can be arbitrary
computation)


Combine new values (arithmetic)


Store new values back in lattice


39

Graphics Hardware


Why use it?


Speed: up to 25x speedup in our sims


GPU perf. grows faster than CPU perf.


Cheap: GeForce 4 Ti 4200 < $130


Load balancing in complex applications


Why not use it?


Low precision computation (not anymore!)


Difficult to program (not anymore!)

40

Hardware Implementation (GF4)

Simulating the world


Simulate a wide variety of phenomena on
GPUs


Anything we can describe with discrete PDEs or
approximations of PDEs