GPU Computing for Numerical Relativity

skillfulwolverineSoftware and s/w Development

Dec 2, 2013 (3 years and 8 months ago)

52 views

Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
GPU Computing for Numerical Relativity
SFB/TR7 Video Seminar
Andreas Weyhausen
B.Brugmann,J.Grigsby
Theoretisch-Physikalisches Institut Jena
SFB TR/7
15.11.2010
1/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
Contents
1
Physical Motivation
2
GPU Computing
3
Finite dierences on the GPU
4
BSSN in 3D
5
Conclusion and further work
2/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
Numerical relativity

Uses numerical methods to solve Einstein's equations.
 This gives the possibility to study physical properties of
astrophysically interesting space-times.
 In particular,the computation of gravitational waves.
The 3+1 split of the space-time

Foliate space-time in spatial hypersurfaces.
 The slices are described by the spatial metric
ij
and the
extrinsic curvature K
ij
.
 This allows us to rewrite Einstein's eld equations as a set of
evolution equations and constraints.
3/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
The ADM equations (Arnowitt,Deser and Misner (1962))
@
t

ij
=2K
ij
+D
i

j
+D
j

i
@
t
K
ij
=D
i
D
j
 +(R
ij
2K
ik
K
k
j
+KK
ij
)
+
k
D
k
K
ij
+K
ik
D
j

k
+K
kj
D
i

k
H =R+K
2
K
ij
K
ij
= 0
M
i
=D
j
(K
ij

ij
K) = 0
 Implemented on GPU by Burkhard Zink (2008)
Gauge choice

Lapse function  and shift vector 
i
represent part of the
freedom of chosing coordinates
 Gauge used in this work:
 1 +log slicing condition:(@
t

j
@
j
) = 2K
 -driver:@
0

i
=
3
4
B
i
;@
0
B
i
= @
0


i
B
i
:
4/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
The BSSN equations (Baumgarte,Shapiro,Shibata and Nakamura)

ADM equations don't form a strongly hyperbolic evolution
system and therefore don't lead to numerically stable
simulations.

BSSN formulation:

Transform K
ij
:K
ij
= e
4
~
A
ij
+
1
3

ij
K

Transform
ij
:
ij
= e
4

ij

New variables


i
= 
ij


i
ij

BSSN evolution variables:
ij
;
~
A
ij
;K;;


i
 BSSN equations together with 1 +log slicing and -driver
condition lead to a strongly hyperbolic evolution system which
allows for long time stable numerical simulations.
5/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
The BSSN equations (Baumgarte,Shapiro,Shibata and Nakamura)
@
t
 =
1
6
K +
i
@
i
 +
1
6
@
i

i
@
t

ij
=2
~
A
ij
+
k
@
k

ij
+ 
ik
@
j

k
+ 
kj
@
i

k

2
3

ij
@
k

k
@
t
~
A
ij
= e
4
((D
i
D
j
)
TF
+R
TF
ij
) +(K
~
A
ij
2
~
A
il
~
A
l
j
)
+
k
@
k
~
A
ij
+
~
A
ik
@
j

k
+
~
A
kj
@
i

k

2
3
~
A
ij
@
k

k
@
t
K =
ij
D
j
D
i
+(
~
A
i
j
~
A
ij
+
1
3
K
2
) +
i
@
i
K
@
t


i
=2
~
A
ij
@
j
 +2(
i
jk
~
A
kj

2
3

ij
@
j
K 6
~
A
ij
@
j
)
+
j
@
j


i



j
@
j

i
+
2
3


i
@
j

j
+
1
3

li
@
l
@
j

j
+ 
lj
@
j
@
l

i
 Solving these equations in four dimensions is numerically
expensive.Per time step several thousands of Flops per data
point have to be computed.
6/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
GPUs (Graphics processing units)

Have developed from special purpose devices to powerful
general highly parallel devices.

The peak Flops/s performance outnumbers those of CPUs
by a factor of 100.
7/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
Architecture of a modern GPU
RAM
Core
Frame buffer
Constant memory
Scheduler
SP
SP
SP
SP
SP
SP
SP
SP
Data-parallel
Cache
Double precision
unit
Registers
SP
SP
SP
SP
SP
SP
SP
SP
Data-parallel
Cache
Double precision
unit
Registers
SP
SP
SP
SP
SP
SP
SP
SP
Data-parallel
Cache
Double precision
unit
Registers
SMP SMP SMP
 A GPU consists of many streaming processors (SP) which
are grouped to streaming multiprocessors (SMP).
 There is global memory available,each SMP has a local fast
shared memory.
 The scheduler assigns work to the SMP.

The GPU is connected to the CPU by a data bus.
8/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
Programming a graphics card
 Until ve years ago,special graphics programming frameworks
have to be used.
 Today several programming frameworks allow general
programming of GPUs.

Nvidia's CUDA allows to program Nvidia GPUs.
 OpenCL provides a standardized framework for several
dierent parallel devices.
 Still,programming for a GPU is more low-level than
programming for a CPU.
9/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
The CUDA framework

Programming in C with some extensions.
 A CUDA program consists of host code which is executed on
the CPU and device code which is executed on the GPU.

Kernel functions
global
void mykernel( args ) f...g
are called from the host to run on the GPU in a set of parallel
instances,the threads.

The threads are grouped into blocks.
 The blocks are grouped to a thread grid.
10/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
CUDA's thread grid structure
Grid of blocks
(0,0)
(0,1)
(0,2)
(0,3)
(1,0)
(1,1)
(1,2)
(1,3)
(2,0)
(2,1)
(2,2)
(2,3)
Block of threads
(0,0)
(0,1)
(0,2)
(0,3)
(1,0)
(1,1)
(1,2)
(1,3)
(2,0)
(2,1)
(2,2)
(2,3)
(3,0)
(3,1)
(3,2)
(3,3)
Thread
blockIdx
(2,2)
threadIdx
(1,3)
Global
Memory
Constant
Memory
Shared Memory
Registers
Local
Memory

Each thread and each block has a unique ID which is used
for mapping data to the grid.

The thread blocks are one to three dimensional,the grid
one to two dimensional.

All threads have read/write access to the global memory.
 All threads within a block have access to the shared memory.

Threads within one block can synchronize,threads of
dierent blocks have to be independent.
11/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
CUDA's execution model
Host Device
Memory
copy
Memory
copy
Kernel execution
Kernel execution
Serial code
ThreadGrid 1
ThreadGrid 2
RAM
Global
Memory
RAM
Global
Memory

Kernel calls are
asynchronous.

Only one kernel grid per
time.
Assignment of blocks to the
GPU
SMP SMP
(0,0)
(1,3)
(0,2)
(1,1)
(1,0)
(0,1)
(0,3)
(1,2)
SMP SMP SMP SMP
(1,2)
(0,0)
(0,3)
(1,3)
(0,2)
(1,0)
(0,1)
(1,1)
Scheduler Scheduler
Grid of blocks
(1,0)
(1,1)
(1,2)
(1,3)
(0,0)
(0,1)
(0,2)
(0,3)
Time

The scheduler assigns
the blocks to the SMP in
no particular order.
12/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
The scalar wave equation

Discretize scalar eld in space and time.
 Evolve according to the 2D wave equation using the Leapfrog
scheme.
f
n+1
ij
= 2(1 2
2
)f
n
ij
f
n1
ij
+
2

f
n
i+1j
+f
n
i1j
+f
n
ij+1
+f
n
ij1

On CPU
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ ￿
￿ ￿
￿ ￿
￿ ￿
￿ ￿
￿ ￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
i
j
￿
￿
Data point Ghost point
Update eld with a loop over
all grid points.
On GPU
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ ￿
￿ ￿
￿ ￿
￿ ￿
￿ ￿
￿ ￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
Thread block
Thread grid
i
j
Kernel function exectued in
grid updates eld.
13/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
Use shared memory as a cache

For nite dierences,each data point is accessed several
times.

Global memory is not cached.Use shared memory as a cache.
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ ￿ ￿ ￿ ￿ ￿ ￿
￿ ￿ ￿ ￿ ￿ ￿ ￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ ￿ ￿ ￿ ￿ ￿ ￿ ￿
￿ ￿ ￿ ￿ ￿ ￿ ￿ ￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
Global memory Shared memory

Which threads loads the shared memory ghost points?
14/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
Three dierent ways of loading shared memory

￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ ￿ ￿ ￿ ￿ ￿ ￿
￿ ￿ ￿ ￿ ￿ ￿ ￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ ￿ ￿ ￿ ￿ ￿ ￿ ￿
￿ ￿ ￿ ￿ ￿ ￿ ￿ ￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿ ￿￿ ￿￿ ￿￿
￿￿ ￿￿ ￿￿ ￿￿
￿￿
￿￿
￿￿
￿￿
￿￿
￿￿
￿￿
￿￿
Global memory Shared memory
Include ghost threads in
thread block.(Version
w2dv2)

￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ ￿ ￿ ￿ ￿ ￿ ￿
￿ ￿ ￿ ￿ ￿ ￿ ￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ ￿ ￿ ￿ ￿ ￿ ￿ ￿
￿ ￿ ￿ ￿ ￿ ￿ ￿ ￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿ ￿￿ ￿￿
￿￿ ￿￿ ￿￿
￿￿
￿￿
￿￿
￿￿
￿￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
Boundary threads load
ghost points.(Version
w2dv3)

￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ ￿ ￿ ￿ ￿ ￿ ￿
￿ ￿ ￿ ￿ ￿ ￿ ￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ ￿ ￿ ￿ ￿ ￿ ￿ ￿
￿ ￿ ￿ ￿ ￿ ￿ ￿ ￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿ ￿￿ ￿￿
￿￿ ￿￿ ￿￿
￿￿
￿￿
￿￿
￿￿
￿￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
Threads of rst and last
row load ghost points.
(Version w2dv4)
15/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
Performance results
Single precision
0
10
20
30
40
50
60
0 200 400 600 800 1000 1200 1400
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿￿
￿
￿
￿￿
￿￿
￿￿
￿￿￿
￿￿
￿￿￿￿
￿￿￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿
￿
￿￿
￿￿
￿￿
￿￿
￿
￿￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿
￿
￿
￿
￿￿
￿￿￿
￿￿￿￿￿
￿￿￿￿￿￿￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿
￿
￿￿
￿￿
￿￿￿
￿￿
￿￿
￿￿￿￿￿￿￿
￿￿
w2dv1
w2dv2
w2dv3
w2dv4
￿
￿
￿
￿
GFlops/s
Points per Dimension
Double precision
0
10
20
30
0 200 400 600 800 1000 1200 1400
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿￿
￿
￿
￿￿
￿
￿
￿￿
￿
￿￿￿￿
￿
￿
￿
￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿￿
￿
￿￿
￿￿￿
￿￿
￿￿￿
￿￿￿
￿￿￿￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿￿
￿
￿
￿
￿￿￿
￿￿￿
￿￿￿
￿
￿￿￿￿￿￿￿￿￿
￿
￿
￿
￿
￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿￿
￿
￿￿￿
￿
￿￿￿￿
￿￿￿
￿￿￿￿￿￿￿￿
￿
w2dv1
w2dv2
w2dv3
w2dv4
￿
￿
￿
￿
GFlops/s
Points per Dimension
16/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
Speed-up compared to CPU code
Single precision
0
10
20
30
40
50
60
70
80
90
100
0 200 400 600 800 1000 1200 1400
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
w2dv1
w2dv2
w2dv3
w2dv4
￿
￿
￿
￿
Speed-upfactor
Points per Dimension
Double precision
0
10
20
30
40
0 200 400 600 800 1000 1200 1400
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿￿
￿
￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿￿
￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿
w2dv1
w2dv2
w2dv3
w2dv4
￿
￿
￿
￿
Speed-upfactor
Points per Dimension
17/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
Performance results
Including memory transfer
0
2
4
6
8
10
12
14
0 200 400 600 800 1000 1200 1400
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿￿
￿￿
￿￿￿
￿
￿
￿
￿￿￿￿￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿
￿￿￿
￿￿
￿￿
￿￿￿￿￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿￿
￿￿
￿￿￿
￿￿￿￿￿
￿￿￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿￿
￿￿￿
￿￿￿￿￿
￿￿￿￿￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿￿￿
￿
￿￿￿￿￿￿￿
￿￿
￿￿
￿
￿￿￿￿￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿￿
￿￿
￿￿
￿￿￿￿
￿￿￿￿￿
￿￿￿￿
￿
￿￿￿￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿￿￿
￿￿￿￿
￿￿￿￿￿￿
￿￿￿￿
￿
￿￿
￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿￿￿
￿￿￿￿￿
￿￿
￿
￿
￿￿￿￿￿￿￿￿￿￿￿
w2dv1
w2dv2
w2dv3
w2dv4
￿
￿
￿
￿
GFlops/s
Points per Dimension
single precision
double precision
OpenCL implementation
0
5
10
15
20
25
30
35
40
45
50
0 200 400 600 800 1000 1200 1400
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿￿￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿￿￿
￿￿
￿
GFlops/s
Points per Dimension
18/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
3D nite dierences
 CUDA thread grid can only be one or two dimensional.

How to map the three dimensional data on a lower
dimensional thread grid?
Method 1:Map 3D grid to 1D grid (w3dv1)
19/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
Method 2:Launch grid several times (w3dv2)
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
z
x
shared memory
ghost zone
2d grid with
3d blocks
kernel call#1 kernel call#2 kernel call#3
Method 3:Threads iterate through data grid (w3dv3)
x
z
k+1
k
k-1
compute z=k shift and read memory compute z=k+1
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
Register
Shared
memory
￿ ￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
new read
shift
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ ￿
20/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
Double precision results
0
10
20
30
40
0 20 40 60 80 100 120 140 160 180 200
￿
￿
￿
￿ ￿
￿ ￿ ￿ ￿ ￿ ￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ ￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
w3dv1
w3dv2
w3dv3
￿
￿
￿
GFlops/s
Points per Dimension
Speed-up to CPU version
0
10
20
30
40
0 20 40 60 80 100 120 140 160 180 200
￿
￿
￿
￿ ￿
￿ ￿ ￿ ￿ ￿ ￿
￿
￿
￿
￿
￿
￿
￿
￿ ￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
w3dv1
w3dv2
w3dv3
￿
￿
￿
Speed-up
Points per Dimension
21/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
Solving the BSSN equations on the GPU

Implementation is based on the BAM code.
 Solving the BSSN equations using the method of lines:

The spatial RHS is discretized with second order nite
dierences.

A fourth order Runge-Kutta scheme is used for time
integration

Evolution on a uni grid,no adaptive mesh renement.
 Sommerfeld boundary conditions are used.
22/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
Basic code structure
src/
bssnCuda/
InitialData/
Output/
AlgCon/
Boundaries/
bssnRHS/
Update/
Structure of right hand side
Evolution Equations/
Advection/
Christoffels/
ConformalFactor/
CovariantDerivatives/
Derivatives/
Divergence/
ExtrinsicCurvature/
InverseMetric/
RHS/
Ricci/
23/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
The GPU code uses 4 dierent kinds of kernels

Derivative kernels compute all rst and second derivatives.
 Advection kernels compute advection derivatives @
i

i
X.

Algebraic kernels compute algebraic expressions.
 Boundary kernels ll ghost zones according Sommerfeld
boundary condition.
 Update kernels Add two data elds and multiply data eld
with scalar.
24/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
Evolution of a Schwarzschild black-hole
Evolution of the radial metric
component
1
1.1
1.2
1.3
1.4
1.5
−2.5 −1.5 −0.5 0.5 1.5 2.5
￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿ ￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿
￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿
￿￿￿￿￿￿￿￿
￿￿￿￿
￿￿￿￿￿￿
￿￿￿￿
￿￿￿￿￿￿￿￿
￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿
￿
￿￿
￿￿￿￿￿￿￿￿￿
￿￿￿￿
￿￿￿￿
￿￿￿￿￿￿
￿
￿
￿
￿
￿
￿￿￿￿
￿
￿
￿
￿
￿
￿￿￿￿￿￿
￿￿￿￿
￿￿￿￿
￿￿￿￿￿￿￿￿￿
￿￿
￿
￿￿
￿
￿
￿￿
￿￿￿￿￿￿
￿￿
￿
￿￿
￿
￿￿
￿￿￿￿￿
￿
￿
￿
￿
￿
￿
￿￿￿￿
￿
￿
￿
￿
￿
￿
￿￿￿￿￿
￿￿
￿
￿￿
￿
￿￿
￿￿￿￿￿￿
￿￿
￿
￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
x
˜γ
rr
Evolution of the lapse function
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
−2.5 −1.5 −0.5 0.5 1.5 2.5
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
x
α
25/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
Performance results for double precision.
Time per time step
0
1
2
3
4
5
6
10 30 50 70 90
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
BAM
GPU code
￿
￿
Timeinms
Points per Dimension
Flops/s for GPU version
0
1
2
3
4
5
6
7
8
9
10
10 30 50 70 90
￿
￿
￿
￿
￿
￿
￿ ￿
￿
￿
￿
￿
￿
￿
￿
￿
including memory copy
without memory copy
￿
￿
GFlops/s
Points per Dimension
26/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
Conclusion
 GPUs do speed-up applications,but fast enough?

Memory needs limit grid sizes of BSSN code.
 There are still many optimizations possible.
Further work

Port the code the Nvidias new fermi architecture.
 Optimize kernels of BSSN GPU code.

Generalize code to run on muliple GPUs.

Implement adaptive mesh renement.
 OpenCL implementation.

Use spectral methods to compute the right hand side.
27/27