Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

sizzlepictureSoftware and s/w Development

Dec 2, 2013 (3 years and 11 months ago)

72 views

Sparse Matrix Solvers on the GPU:
Conjugate Gradients and Multigrid

Jeffrey Bolz, Ian Farmer, Eitan Grinspun,
Peter Schr
öder


Caltech ASCI Center

Why Use the GPU?


Semiconductor trends


cost


wires vs. compute


Stanford streaming supercomputer


Parallelism


many functional units


graphics is prime example


Harvesting this power


what application suitable?


what abstractions useful?


History


massively parallel SIMD machines


media processing

1e-4
1e-3
1e-2
1e-1
1e+0
1e+1
1e+2
1e+3
1e+4
1e+5
1e+6
1e+7
1980
1990
2000
2010
2020
Perf (ps/Inst)
Linear (ps/Inst)
Chart courtesy Bill Dally

Possible

Actual

Imagine stream processor; Bill Dally, Stanford

Connection Machine CM2; Thinking Machines

Contributions and Related Work


Contributions


numerical algorithms on GPU


unstructured grids: conjugate gradients


regular grids: multigrid


what abstractions are needed?


Numerical algorithms


Goodnight et al. 2003 (MG)


Hall et al. 2003 (cache)


Harris et al. 2002 (FD sim.)


Hillisland et al. 2003 (optimization)


Krueger & Westermann 2003 (NLA)


Strzodka (PDEs)

Streaming Model


Abstract model


Purcell, et al. 2002


data structures: streams


algorithms: kernels


Concrete

model


render a rectangle


data structures:
textures


algorithms:
fragment programs

Kernel

input

record

stream

output

record

stream

globals

Rasterizer

(set up texture

indices and all

associated data)

Fragment

program

(for all pixels

in parallel)

Texture

as read
-
only

memory

Output goes to

texture

Bind buffer

to texture

Kernel

globals

Sparse Matrices: Geometric Flow


Ubiquitous in numerical computing


discretization of PDEs: animation


finite elements, difference, volumes


optimization, editing, etc., etc.


Example here:


processing of surfaces


Canonical non
-
linear problem


mean curvature flow



implicit time discretization


solve sequence of SPD systems









)
(
4
))
cot(
)
(cot(
i
N
j
ij
i
ii
ij
ij
ij
a
A
a
t
a



)
(
)
(
)
(
)
(
t
n
t
H
t
t
x
i
i
i
i
t





Velocity opposite mean

curvature normal





i
i
x
Ax


1
Conjugate Gradients


High level code


inner loop


matrix
-
vector

multiply


sum
-
reduction


scalar
-
vector

MAD


Inner product


fragment
-
wise multiply


followed by sum
-
reduction


odd dimensions can be handled

y=Ax

Aj


off
-
diagonal

matrix elements

R


pointers

to segments

Row
-
Vector Product

X


vector elements

R


pointers to segments

A
i



diagonal matrix elements

J


pointers to x
j

A
j



off
-
diagonal matrix elements

Fragment program

Apply to All Pixels


Two extremes


one row at a time: setup overhead


all rows at once: limited by worst row


Middle ground


organize “batches” of work


How to arrange batches?


order rows by non
-
zero entries


optimal packing NP hard


We choose fixed size rectangles


fragment pipe is quantized


simple experiments reveal best size


26 x 18


91% efficient


wasted fragments on diagonal

Time

Area

(pixels)

Packing (Greedy)

9

9

8

8

8

8

8

7

7

15

13

13

12

12

11

10

9

9

7

7

7

7

7

7

7

7

6

5

5

4

15

13

13

12

12

11

10

9

9

9

9

8

8

8

8

8

7

7

7

7

7

7

7

7

7

7

6



non
-
zero entries

per row

each batch

bound to an

appropriate

fragment program

All this setup done

once only at the

beginning of time.

Depends only on

mesh connectivity

Recomputing Matrix


Matrix entries depend on surface


must “render” into matrix


two additional indirection textures




previous and next

Results (NV30@500MHz)


37k elements


matrix multiply


33 instructions, 120 per second


only 13 flops


latency limited


reduction


7 inst/frag/pass, 3400 per second


CG solve: 20 per second

Regular Grids


Poisson solver as example


multigrid approach


this time variables on “pixel grid”


e.g.: Navier
-
Stokes

b
u
u
u
u
u















2
)
(
0
t
u




p
2
after discretization:

solve Poisson eq.

at each time step

Poisson Equation


Appears all over the place


easy to discretize on regular grid


matrix multiply is

stencil application


FD Laplace stencil:


Use iterative matrix solver


just need application of stencil


easy: just like filtering


incorporate geometry (Jacobian)


variable coefficients

(i,j)

-
4

1

1

1

1

0

0

0

0

j
i
j
i
j
i
j
i
j
i
j
i
X
X
X
X
X
X
,
1
,
1
,
,
1
,
1
,
2
4










Multigrid

Relax

Relax

Relax

Relax

Relax

Projection

Projection

Interpolation

Interpolation


Fine to coarse to fine cycle


high freq. error removed quickly


lower frequency error takes longer

Relax, Project, Interpolate

Computations and Storage Layout


Lots of stencil applications


matrix multiply: 3x3 stencil


projection: 3x3 stencil


interpolation: 2x2(!)


floor op in indexing


Storage for matrices and DOFs


variables in one texture


matrices in 9(=3x3) textures


all textures packed


exploit 4 channels


domain decomp.


padded boundary

1/16

1

1

1

1

2

2

2

2

4













2
1
,
0
2
2
/
)
(
4
1
d
h
h
d
i
i
v
v
x

y

z

w

Coarser Matrices


Operator at coarser level


needed for relaxation at all levels





Triple matrix product…


work out terms and map to stencils


exploit local support of stencils


straightforward but t
-
e
-
d
-
i
-
o
-
u
-
s

A
f

A
c

S

P

=















2
2
}
1
,
0
,
1
{
,
}
1
,
0
,
1
{
,
2
2
]
2
[
]
2
[
'
]
[
'
4
/
1
]
2
[
4
/
1
]
[
g
e
g
h
g
e
g
h
d
g
e
e
d
h
e
i
A
d
g
e
S
e
S
e
i
A
S
S
i
A
Results (NV30@500MHz)


257x257 grid



matrix multiply
-

27 instructions


1370 per second


interpolation 10 inst.


projection 19 inst.


Overall performance


257x257 at 80 fps!

Conclusions


Enhancements


global registers for reductions


texture fetch with offset


rectangular texture border


scalar versus vector problems


Where are we now?


good streaming processor


twice as fast as CPU implementation


lots of room for improvement


Scientific computing compiler


better languages! Brook? C*?


manage layout in a buffer