Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
GPU Computing for Numerical Relativity
SFB/TR7 Video Seminar
Andreas Weyhausen
B.Brugmann,J.Grigsby
TheoretischPhysikalisches Institut Jena
SFB TR/7
15.11.2010
1/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
Contents
1
Physical Motivation
2
GPU Computing
3
Finite dierences on the GPU
4
BSSN in 3D
5
Conclusion and further work
2/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
Numerical relativity
Uses numerical methods to solve Einstein's equations.
This gives the possibility to study physical properties of
astrophysically interesting spacetimes.
In particular,the computation of gravitational waves.
The 3+1 split of the spacetime
Foliate spacetime in spatial hypersurfaces.
The slices are described by the spatial metric
ij
and the
extrinsic curvature K
ij
.
This allows us to rewrite Einstein's eld equations as a set of
evolution equations and constraints.
3/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
The ADM equations (Arnowitt,Deser and Misner (1962))
@
t
ij
=2K
ij
+D
i
j
+D
j
i
@
t
K
ij
=D
i
D
j
+(R
ij
2K
ik
K
k
j
+KK
ij
)
+
k
D
k
K
ij
+K
ik
D
j
k
+K
kj
D
i
k
H =R+K
2
K
ij
K
ij
= 0
M
i
=D
j
(K
ij
ij
K) = 0
Implemented on GPU by Burkhard Zink (2008)
Gauge choice
Lapse function and shift vector
i
represent part of the
freedom of chosing coordinates
Gauge used in this work:
1 +log slicing condition:(@
t
j
@
j
) = 2K
driver:@
0
i
=
3
4
B
i
;@
0
B
i
= @
0
i
B
i
:
4/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
The BSSN equations (Baumgarte,Shapiro,Shibata and Nakamura)
ADM equations don't form a strongly hyperbolic evolution
system and therefore don't lead to numerically stable
simulations.
BSSN formulation:
Transform K
ij
:K
ij
= e
4
~
A
ij
+
1
3
ij
K
Transform
ij
:
ij
= e
4
ij
New variables
i
=
ij
i
ij
BSSN evolution variables:
ij
;
~
A
ij
;K;;
i
BSSN equations together with 1 +log slicing and driver
condition lead to a strongly hyperbolic evolution system which
allows for long time stable numerical simulations.
5/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
The BSSN equations (Baumgarte,Shapiro,Shibata and Nakamura)
@
t
=
1
6
K +
i
@
i
+
1
6
@
i
i
@
t
ij
=2
~
A
ij
+
k
@
k
ij
+
ik
@
j
k
+
kj
@
i
k
2
3
ij
@
k
k
@
t
~
A
ij
= e
4
((D
i
D
j
)
TF
+R
TF
ij
) +(K
~
A
ij
2
~
A
il
~
A
l
j
)
+
k
@
k
~
A
ij
+
~
A
ik
@
j
k
+
~
A
kj
@
i
k
2
3
~
A
ij
@
k
k
@
t
K =
ij
D
j
D
i
+(
~
A
i
j
~
A
ij
+
1
3
K
2
) +
i
@
i
K
@
t
i
=2
~
A
ij
@
j
+2(
i
jk
~
A
kj
2
3
ij
@
j
K 6
~
A
ij
@
j
)
+
j
@
j
i
j
@
j
i
+
2
3
i
@
j
j
+
1
3
li
@
l
@
j
j
+
lj
@
j
@
l
i
Solving these equations in four dimensions is numerically
expensive.Per time step several thousands of Flops per data
point have to be computed.
6/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
GPUs (Graphics processing units)
Have developed from special purpose devices to powerful
general highly parallel devices.
The peak Flops/s performance outnumbers those of CPUs
by a factor of 100.
7/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
Architecture of a modern GPU
RAM
Core
Frame buﬀer
Constant memory
Scheduler
SP
SP
SP
SP
SP
SP
SP
SP
Dataparallel
Cache
Double precision
unit
Registers
SP
SP
SP
SP
SP
SP
SP
SP
Dataparallel
Cache
Double precision
unit
Registers
SP
SP
SP
SP
SP
SP
SP
SP
Dataparallel
Cache
Double precision
unit
Registers
SMP SMP SMP
A GPU consists of many streaming processors (SP) which
are grouped to streaming multiprocessors (SMP).
There is global memory available,each SMP has a local fast
shared memory.
The scheduler assigns work to the SMP.
The GPU is connected to the CPU by a data bus.
8/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
Programming a graphics card
Until ve years ago,special graphics programming frameworks
have to be used.
Today several programming frameworks allow general
programming of GPUs.
Nvidia's CUDA allows to program Nvidia GPUs.
OpenCL provides a standardized framework for several
dierent parallel devices.
Still,programming for a GPU is more lowlevel than
programming for a CPU.
9/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
The CUDA framework
Programming in C with some extensions.
A CUDA program consists of host code which is executed on
the CPU and device code which is executed on the GPU.
Kernel functions
global
void mykernel( args ) f...g
are called from the host to run on the GPU in a set of parallel
instances,the threads.
The threads are grouped into blocks.
The blocks are grouped to a thread grid.
10/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
CUDA's thread grid structure
Grid of blocks
(0,0)
(0,1)
(0,2)
(0,3)
(1,0)
(1,1)
(1,2)
(1,3)
(2,0)
(2,1)
(2,2)
(2,3)
Block of threads
(0,0)
(0,1)
(0,2)
(0,3)
(1,0)
(1,1)
(1,2)
(1,3)
(2,0)
(2,1)
(2,2)
(2,3)
(3,0)
(3,1)
(3,2)
(3,3)
Thread
blockIdx
(2,2)
threadIdx
(1,3)
Global
Memory
Constant
Memory
Shared Memory
Registers
Local
Memory
Each thread and each block has a unique ID which is used
for mapping data to the grid.
The thread blocks are one to three dimensional,the grid
one to two dimensional.
All threads have read/write access to the global memory.
All threads within a block have access to the shared memory.
Threads within one block can synchronize,threads of
dierent blocks have to be independent.
11/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
CUDA's execution model
Host Device
Memory
copy
Memory
copy
Kernel execution
Kernel execution
Serial code
ThreadGrid 1
ThreadGrid 2
RAM
Global
Memory
RAM
Global
Memory
Kernel calls are
asynchronous.
Only one kernel grid per
time.
Assignment of blocks to the
GPU
SMP SMP
(0,0)
(1,3)
(0,2)
(1,1)
(1,0)
(0,1)
(0,3)
(1,2)
SMP SMP SMP SMP
(1,2)
(0,0)
(0,3)
(1,3)
(0,2)
(1,0)
(0,1)
(1,1)
Scheduler Scheduler
Grid of blocks
(1,0)
(1,1)
(1,2)
(1,3)
(0,0)
(0,1)
(0,2)
(0,3)
Time
The scheduler assigns
the blocks to the SMP in
no particular order.
12/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
The scalar wave equation
Discretize scalar eld in space and time.
Evolve according to the 2D wave equation using the Leapfrog
scheme.
f
n+1
ij
= 2(1 2
2
)f
n
ij
f
n1
ij
+
2
f
n
i+1j
+f
n
i1j
+f
n
ij+1
+f
n
ij1
On CPU
i
j
Data point Ghost point
Update eld with a loop over
all grid points.
On GPU
Thread block
Thread grid
i
j
Kernel function exectued in
grid updates eld.
13/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
Use shared memory as a cache
For nite dierences,each data point is accessed several
times.
Global memory is not cached.Use shared memory as a cache.
Global memory Shared memory
Which threads loads the shared memory ghost points?
14/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
Three dierent ways of loading shared memory
Global memory Shared memory
Include ghost threads in
thread block.(Version
w2dv2)
Boundary threads load
ghost points.(Version
w2dv3)
Threads of rst and last
row load ghost points.
(Version w2dv4)
15/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
Performance results
Single precision
0
10
20
30
40
50
60
0 200 400 600 800 1000 1200 1400
w2dv1
w2dv2
w2dv3
w2dv4
GFlops/s
Points per Dimension
Double precision
0
10
20
30
0 200 400 600 800 1000 1200 1400
w2dv1
w2dv2
w2dv3
w2dv4
GFlops/s
Points per Dimension
16/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
Speedup compared to CPU code
Single precision
0
10
20
30
40
50
60
70
80
90
100
0 200 400 600 800 1000 1200 1400
w2dv1
w2dv2
w2dv3
w2dv4
Speedupfactor
Points per Dimension
Double precision
0
10
20
30
40
0 200 400 600 800 1000 1200 1400
w2dv1
w2dv2
w2dv3
w2dv4
Speedupfactor
Points per Dimension
17/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
Performance results
Including memory transfer
0
2
4
6
8
10
12
14
0 200 400 600 800 1000 1200 1400
w2dv1
w2dv2
w2dv3
w2dv4
GFlops/s
Points per Dimension
single precision
double precision
OpenCL implementation
0
5
10
15
20
25
30
35
40
45
50
0 200 400 600 800 1000 1200 1400
GFlops/s
Points per Dimension
18/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
3D nite dierences
CUDA thread grid can only be one or two dimensional.
How to map the three dimensional data on a lower
dimensional thread grid?
Method 1:Map 3D grid to 1D grid (w3dv1)
19/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
Method 2:Launch grid several times (w3dv2)
z
x
shared memory
ghost zone
2d grid with
3d blocks
kernel call#1 kernel call#2 kernel call#3
Method 3:Threads iterate through data grid (w3dv3)
x
z
k+1
k
k1
compute z=k shift and read memory compute z=k+1
Register
Shared
memory
new read
shift
20/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
Double precision results
0
10
20
30
40
0 20 40 60 80 100 120 140 160 180 200
w3dv1
w3dv2
w3dv3
GFlops/s
Points per Dimension
Speedup to CPU version
0
10
20
30
40
0 20 40 60 80 100 120 140 160 180 200
w3dv1
w3dv2
w3dv3
Speedup
Points per Dimension
21/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
Solving the BSSN equations on the GPU
Implementation is based on the BAM code.
Solving the BSSN equations using the method of lines:
The spatial RHS is discretized with second order nite
dierences.
A fourth order RungeKutta scheme is used for time
integration
Evolution on a uni grid,no adaptive mesh renement.
Sommerfeld boundary conditions are used.
22/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
Basic code structure
src/
bssnCuda/
InitialData/
Output/
AlgCon/
Boundaries/
bssnRHS/
Update/
Structure of right hand side
Evolution Equations/
Advection/
Christoﬀels/
ConformalFactor/
CovariantDerivatives/
Derivatives/
Divergence/
ExtrinsicCurvature/
InverseMetric/
RHS/
Ricci/
23/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
The GPU code uses 4 dierent kinds of kernels
Derivative kernels compute all rst and second derivatives.
Advection kernels compute advection derivatives @
i
i
X.
Algebraic kernels compute algebraic expressions.
Boundary kernels ll ghost zones according Sommerfeld
boundary condition.
Update kernels Add two data elds and multiply data eld
with scalar.
24/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
Evolution of a Schwarzschild blackhole
Evolution of the radial metric
component
1
1.1
1.2
1.3
1.4
1.5
−2.5 −1.5 −0.5 0.5 1.5 2.5
x
˜γ
rr
Evolution of the lapse function
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
−2.5 −1.5 −0.5 0.5 1.5 2.5
x
α
25/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
Performance results for double precision.
Time per time step
0
1
2
3
4
5
6
10 30 50 70 90
BAM
GPU code
Timeinms
Points per Dimension
Flops/s for GPU version
0
1
2
3
4
5
6
7
8
9
10
10 30 50 70 90
including memory copy
without memory copy
GFlops/s
Points per Dimension
26/27
Physical Motivation GPU Computing Finite dierences on the GPU BSSN in 3D Conclusion and further work
Conclusion
GPUs do speedup applications,but fast enough?
Memory needs limit grid sizes of BSSN code.
There are still many optimizations possible.
Further work
Port the code the Nvidias new fermi architecture.
Optimize kernels of BSSN GPU code.
Generalize code to run on muliple GPUs.
Implement adaptive mesh renement.
OpenCL implementation.
Use spectral methods to compute the right hand side.
27/27
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment