Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work

GPU Computing for Numerical Relativity

SFB/TR7 Video Seminar

Andreas Weyhausen

B.Brugmann,J.Grigsby

Theoretisch-Physikalisches Institut Jena

SFB TR/7

15.11.2010

1/27

Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work

Contents

1

Physical Motivation

2

GPU Computing

3

Finite dierences on the GPU

4

BSSN in 3D

5

Conclusion and further work

2/27

Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work

Numerical relativity

Uses numerical methods to solve Einstein's equations.

This gives the possibility to study physical properties of

astrophysically interesting space-times.

In particular,the computation of gravitational waves.

The 3+1 split of the space-time

Foliate space-time in spatial hypersurfaces.

The slices are described by the spatial metric

ij

and the

extrinsic curvature K

ij

.

This allows us to rewrite Einstein's eld equations as a set of

evolution equations and constraints.

3/27

Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work

The ADM equations (Arnowitt,Deser and Misner (1962))

@

t

ij

=2K

ij

+D

i

j

+D

j

i

@

t

K

ij

=D

i

D

j

+(R

ij

2K

ik

K

k

j

+KK

ij

)

+

k

D

k

K

ij

+K

ik

D

j

k

+K

kj

D

i

k

H =R+K

2

K

ij

K

ij

= 0

M

i

=D

j

(K

ij

ij

K) = 0

Implemented on GPU by Burkhard Zink (2008)

Gauge choice

Lapse function and shift vector

i

represent part of the

freedom of chosing coordinates

Gauge used in this work:

1 +log slicing condition:(@

t

j

@

j

) = 2K

-driver:@

0

i

=

3

4

B

i

;@

0

B

i

= @

0

i

B

i

:

4/27

Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work

The BSSN equations (Baumgarte,Shapiro,Shibata and Nakamura)

ADM equations don't form a strongly hyperbolic evolution

system and therefore don't lead to numerically stable

simulations.

BSSN formulation:

Transform K

ij

:K

ij

= e

4

~

A

ij

+

1

3

ij

K

Transform

ij

:

ij

= e

4

ij

New variables

i

=

ij

i

ij

BSSN evolution variables:

ij

;

~

A

ij

;K;;

i

BSSN equations together with 1 +log slicing and -driver

condition lead to a strongly hyperbolic evolution system which

allows for long time stable numerical simulations.

5/27

Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work

The BSSN equations (Baumgarte,Shapiro,Shibata and Nakamura)

@

t

=

1

6

K +

i

@

i

+

1

6

@

i

i

@

t

ij

=2

~

A

ij

+

k

@

k

ij

+

ik

@

j

k

+

kj

@

i

k

2

3

ij

@

k

k

@

t

~

A

ij

= e

4

((D

i

D

j

)

TF

+R

TF

ij

) +(K

~

A

ij

2

~

A

il

~

A

l

j

)

+

k

@

k

~

A

ij

+

~

A

ik

@

j

k

+

~

A

kj

@

i

k

2

3

~

A

ij

@

k

k

@

t

K =

ij

D

j

D

i

+(

~

A

i

j

~

A

ij

+

1

3

K

2

) +

i

@

i

K

@

t

i

=2

~

A

ij

@

j

+2(

i

jk

~

A

kj

2

3

ij

@

j

K 6

~

A

ij

@

j

)

+

j

@

j

i

j

@

j

i

+

2

3

i

@

j

j

+

1

3

li

@

l

@

j

j

+

lj

@

j

@

l

i

Solving these equations in four dimensions is numerically

expensive.Per time step several thousands of Flops per data

point have to be computed.

6/27

Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work

GPUs (Graphics processing units)

Have developed from special purpose devices to powerful

general highly parallel devices.

The peak Flops/s performance outnumbers those of CPUs

by a factor of 100.

7/27

Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work

Architecture of a modern GPU

RAM

Core

Frame buﬀer

Constant memory

Scheduler

SP

SP

SP

SP

SP

SP

SP

SP

Data-parallel

Cache

Double precision

unit

Registers

SP

SP

SP

SP

SP

SP

SP

SP

Data-parallel

Cache

Double precision

unit

Registers

SP

SP

SP

SP

SP

SP

SP

SP

Data-parallel

Cache

Double precision

unit

Registers

SMP SMP SMP

A GPU consists of many streaming processors (SP) which

are grouped to streaming multiprocessors (SMP).

There is global memory available,each SMP has a local fast

shared memory.

The scheduler assigns work to the SMP.

The GPU is connected to the CPU by a data bus.

8/27

Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work

Programming a graphics card

Until ve years ago,special graphics programming frameworks

have to be used.

Today several programming frameworks allow general

programming of GPUs.

Nvidia's CUDA allows to program Nvidia GPUs.

OpenCL provides a standardized framework for several

dierent parallel devices.

Still,programming for a GPU is more low-level than

programming for a CPU.

9/27

Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work

The CUDA framework

Programming in C with some extensions.

A CUDA program consists of host code which is executed on

the CPU and device code which is executed on the GPU.

Kernel functions

global

void mykernel( args ) f...g

are called from the host to run on the GPU in a set of parallel

instances,the threads.

The threads are grouped into blocks.

The blocks are grouped to a thread grid.

10/27

Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work

CUDA's thread grid structure

Grid of blocks

(0,0)

(0,1)

(0,2)

(0,3)

(1,0)

(1,1)

(1,2)

(1,3)

(2,0)

(2,1)

(2,2)

(2,3)

Block of threads

(0,0)

(0,1)

(0,2)

(0,3)

(1,0)

(1,1)

(1,2)

(1,3)

(2,0)

(2,1)

(2,2)

(2,3)

(3,0)

(3,1)

(3,2)

(3,3)

Thread

blockIdx

(2,2)

threadIdx

(1,3)

Global

Memory

Constant

Memory

Shared Memory

Registers

Local

Memory

Each thread and each block has a unique ID which is used

for mapping data to the grid.

The thread blocks are one to three dimensional,the grid

one to two dimensional.

All threads have read/write access to the global memory.

All threads within a block have access to the shared memory.

Threads within one block can synchronize,threads of

dierent blocks have to be independent.

11/27

Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work

CUDA's execution model

Host Device

Memory

copy

Memory

copy

Kernel execution

Kernel execution

Serial code

ThreadGrid 1

ThreadGrid 2

RAM

Global

Memory

RAM

Global

Memory

Kernel calls are

asynchronous.

Only one kernel grid per

time.

Assignment of blocks to the

GPU

SMP SMP

(0,0)

(1,3)

(0,2)

(1,1)

(1,0)

(0,1)

(0,3)

(1,2)

SMP SMP SMP SMP

(1,2)

(0,0)

(0,3)

(1,3)

(0,2)

(1,0)

(0,1)

(1,1)

Scheduler Scheduler

Grid of blocks

(1,0)

(1,1)

(1,2)

(1,3)

(0,0)

(0,1)

(0,2)

(0,3)

Time

The scheduler assigns

the blocks to the SMP in

no particular order.

12/27

Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work

The scalar wave equation

Discretize scalar eld in space and time.

Evolve according to the 2D wave equation using the Leapfrog

scheme.

f

n+1

ij

= 2(1 2

2

)f

n

ij

f

n1

ij

+

2

f

n

i+1j

+f

n

i1j

+f

n

ij+1

+f

n

ij1

On CPU

i

j

Data point Ghost point

Update eld with a loop over

all grid points.

On GPU

Thread block

Thread grid

i

j

Kernel function exectued in

grid updates eld.

13/27

Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work

Use shared memory as a cache

For nite dierences,each data point is accessed several

times.

Global memory is not cached.Use shared memory as a cache.

Global memory Shared memory

Which threads loads the shared memory ghost points?

14/27

Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work

Three dierent ways of loading shared memory

Global memory Shared memory

Include ghost threads in

thread block.(Version

w2dv2)

Boundary threads load

ghost points.(Version

w2dv3)

Threads of rst and last

row load ghost points.

(Version w2dv4)

15/27

Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work

Performance results

Single precision

0

10

20

30

40

50

60

0 200 400 600 800 1000 1200 1400

w2dv1

w2dv2

w2dv3

w2dv4

GFlops/s

Points per Dimension

Double precision

0

10

20

30

0 200 400 600 800 1000 1200 1400

w2dv1

w2dv2

w2dv3

w2dv4

GFlops/s

Points per Dimension

16/27

Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work

Speed-up compared to CPU code

Single precision

0

10

20

30

40

50

60

70

80

90

100

0 200 400 600 800 1000 1200 1400

w2dv1

w2dv2

w2dv3

w2dv4

Speed-upfactor

Points per Dimension

Double precision

0

10

20

30

40

0 200 400 600 800 1000 1200 1400

w2dv1

w2dv2

w2dv3

w2dv4

Speed-upfactor

Points per Dimension

17/27

Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work

Performance results

Including memory transfer

0

2

4

6

8

10

12

14

0 200 400 600 800 1000 1200 1400

w2dv1

w2dv2

w2dv3

w2dv4

GFlops/s

Points per Dimension

single precision

double precision

OpenCL implementation

0

5

10

15

20

25

30

35

40

45

50

0 200 400 600 800 1000 1200 1400

GFlops/s

Points per Dimension

18/27

Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work

3D nite dierences

CUDA thread grid can only be one or two dimensional.

How to map the three dimensional data on a lower

dimensional thread grid?

Method 1:Map 3D grid to 1D grid (w3dv1)

19/27

Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work

Method 2:Launch grid several times (w3dv2)

z

x

shared memory

ghost zone

2d grid with

3d blocks

kernel call#1 kernel call#2 kernel call#3

Method 3:Threads iterate through data grid (w3dv3)

x

z

k+1

k

k-1

compute z=k shift and read memory compute z=k+1

Register

Shared

memory

new read

shift

20/27

Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work

Double precision results

0

10

20

30

40

0 20 40 60 80 100 120 140 160 180 200

w3dv1

w3dv2

w3dv3

GFlops/s

Points per Dimension

Speed-up to CPU version

0

10

20

30

40

0 20 40 60 80 100 120 140 160 180 200

w3dv1

w3dv2

w3dv3

Speed-up

Points per Dimension

21/27

Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work

Solving the BSSN equations on the GPU

Implementation is based on the BAM code.

Solving the BSSN equations using the method of lines:

The spatial RHS is discretized with second order nite

dierences.

A fourth order Runge-Kutta scheme is used for time

integration

Evolution on a uni grid,no adaptive mesh renement.

Sommerfeld boundary conditions are used.

22/27

Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work

Basic code structure

src/

bssnCuda/

InitialData/

Output/

AlgCon/

Boundaries/

bssnRHS/

Update/

Structure of right hand side

Evolution Equations/

Advection/

Christoﬀels/

ConformalFactor/

CovariantDerivatives/

Derivatives/

Divergence/

ExtrinsicCurvature/

InverseMetric/

RHS/

Ricci/

23/27

Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work

The GPU code uses 4 dierent kinds of kernels

Derivative kernels compute all rst and second derivatives.

Advection kernels compute advection derivatives @

i

i

X.

Algebraic kernels compute algebraic expressions.

Boundary kernels ll ghost zones according Sommerfeld

boundary condition.

Update kernels Add two data elds and multiply data eld

with scalar.

24/27

Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work

Evolution of a Schwarzschild black-hole

Evolution of the radial metric

component

1

1.1

1.2

1.3

1.4

1.5

−2.5 −1.5 −0.5 0.5 1.5 2.5

x

˜γ

rr

Evolution of the lapse function

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

−2.5 −1.5 −0.5 0.5 1.5 2.5

x

α

25/27

Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work

Performance results for double precision.

Time per time step

0

1

2

3

4

5

6

10 30 50 70 90

BAM

GPU code

Timeinms

Points per Dimension

Flops/s for GPU version

0

1

2

3

4

5

6

7

8

9

10

10 30 50 70 90

including memory copy

without memory copy

GFlops/s

Points per Dimension

26/27

Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work

Conclusion

GPUs do speed-up applications,but fast enough?

Memory needs limit grid sizes of BSSN code.

There are still many optimizations possible.

Further work

Port the code the Nvidias new fermi architecture.

Optimize kernels of BSSN GPU code.

Generalize code to run on muliple GPUs.

Implement adaptive mesh renement.

OpenCL implementation.

Use spectral methods to compute the right hand side.

27/27

## Comments 0

Log in to post a comment