Mesh-free numerical methods for large-scale engineering problems. Basics of GPU-based computing.

yellvillepotatocreekΛογισμικό & κατασκευή λογ/κού

2 Δεκ 2013 (πριν από 3 χρόνια και 11 μήνες)

66 εμφανίσεις


Mesh
-
free numerical methods for large
-
scale engineering
problems.


Basics of GPU
-
based computing.



D.Stevens, M.Lees, P. Orsini

Supervisors: Prof. Henry Power, Herve Morvan

January 2008


Outline:


Meshless numerical methods using Radial Basis Functions (RBFs)


Basic RBF interpolation


Brief overview of the work done in our research group, under the direction
of Prof. Henry Power


Example results


Large scale future problems


GPU computing


Introduction to the principle of using graphics processors (GPUs), as
floating point co
-
processors


Current state of GPU hardware and software


Basic implementation strategy for numerical simulations


GABARDINE
-

EU project


GABARDINE:


Groundwater Artificial recharge Based on Alternative sources of
wateR: aDvanced INtegrated technologies and managEment



Aims to investigate the viability of artificial recharge in groundwater
aquifers, and produce decision support mechanisms


Partners include:

Portugal, Germany, Greece, Spain, Belgium,




Israel, Palestine


University of Nottingham is developing numerical methods to handle:


Phreatic aquifers (with associated moving surfaces)


Transport processes


Unsaturated zone problems (with nonlinear Governing equations)

RBF Interpolation Methods









x
P
x
x
u
m
N
j
j
j
1
1










Apply the above at each of the N test locations:



N
i
f
f
i
i
,...,
1



N

unknowns






i
j
j
i
f
x







2
2
2
)
(
)
ln(
)
(
m
m
c
r
r
r
r
r





0
0.2
0.4
0.6
0.8
1
1.2
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Kansa’s Method










N
j
j
j
x
x
u
1






on
x
g
x
u
B
in
x
f
x
u
L
)
(
)]
(
[
)
(
)]
(
[
2
2
.
j
x
L
eg




(Diffusion operator)




















i
i
j
ij
x
ij
x
f
g
L
B

]
[
]
[
Collocating
:

Hermitian Method






















N
NB
j
j
j
NB
j
j
j
x
L
x
B
x
u
1
1








on
x
g
x
u
B
in
x
f
x
u
L
)
(
)]
(
[
)
(
)]
(
[

















i
i
j
ij
x
ij
x
ij
x
ij
x
f
g
L
L
B
L
L
B
B
B









]
[
]
[
]
[
]
[


Operators are also applied to basis functions:

SYMMETRIC SYSTEM

RBF


Features and Drawbacks


Partial derivatives are obtained cheaply and accurately by
differentiating the (known) basis functions


Leads to a highly flexible formulation allowing boundary operators to
be implemented exactly and directly


Once solution weights are obtained, a continuous solution can be
reconstructed over the entire solution domain


A densely populated linear system must be solved to obtain the
solution weights


Leads to high computational cost


O(N
2
), and numerical conditioning
issues with large systems, setting a practical limit of ~1000 points

Formulation: LHI method









1
1
1













m
N
NB
j
j
j
NB
j
j
j
P
x
x
B
x
u








in
x
f
x
u
on
x
g
x
u
B
)
(
)
(
)
(
)]
(
[

Hermitian Interpolation using solution values, and boundary
operator(s) if present

Initial value problem:

Lu
t
u



LHI method formulation cont…

)
(
)
(
)
(
k
k
k
d
H



Form N systems, based on a local support:






























0
1
1
1
1
T
m
T
m
m
x
x
x
m
P
B
P
P
B
B
B
B
P
B
H














0
)
(
i
i
x
g
u
d

Hence, can reconstruct the solution in the vicinity of local
system
k

via:


.
)
(
)
(
)
(
)
(
x
R
x
u
k
k





)
(
,
,
)
(
1
x
p
x
B
x
x
R
m
i
i









where:

H ~ 10 x 10 matrix

CV
-
RBF (Modified CV Scheme)


Classic CV approach

Polynomial interpolation
to compute the flux


CV
-
RBF approach

Apply internal
operator

Apply Dirichlet
operator

Apply Boundary
operator

RBF interpolation
to compute the flux









L x f x
B x g x


 

 
 

 
3
x

on
on


Simulation Workflow

Dataset Generation

Pre
-
Processing

Simulation

Post Processing

GridGen

RBF

CVRBF

Triangle

Meshless

TecPlot

CAD

Meshless

RBF specific

Our Code

Convection
-
Diffusion: Validation

Pe = 50 - LHIM (c = 0.12)
0.8
1.0
1.2
1.4
1.6
1.8
2.0
0.75
0.80
0.85
0.90
0.95
1.00
Analytical
Calculated
Pe = 100 - LHIM (c = 0.08)
0.8
1.0
1.2
1.4
1.6
1.8
2.0
0.75
0.80
0.85
0.90
0.95
1.00
Analytical
Calculated
k = 120 - c = 0.04
0
50
100
150
200
250
300
0.0
0.2
0.4
0.6
0.8
1.0
Calculated
Analytic

Both methods have been validated against a series of 1D and 3D linear
and nonlinear advection
-
diffusion reaction problems, eg:

Pe = 200 - LHIM (c = 0.03)
0.8
1.0
1.2
1.4
1.6
1.8
2.0
0.75
0.80
0.85
0.90
0.95
1.00
Analytical
Calculated
CV
-
RBF: Infiltration well + Pumping

75,31,50
y z
x
m L m L m
L
  
Well diameter:

3
d m

1.0 0.3 0.2
x z y
K K m h K m h

   
Soil properties:

Pumping location: 25m from the infiltration well (height y=15m)

3
20
m
I P qA
h
  
Infiltration
-
Pumping rate:

CV
-
RBF: Infiltration well + Pumping



3D model: Mesh (6
0000 cells)

and BC

Boundary conditions

31
m


0:0
y
n


 

Everywhere else

CV
-
RBF: Infiltration well + Pumping

0.2
well
dh m



Piezometric head contour and streamlines at t=30h

Maximum Displacements:

0.85
pump
dh m
 
plane at z=25m

plane at y=29m

Length scale: 100m

LHI: Infiltration model
-

Setup


Infiltrate over 10m x 10m region at
ground surface


Infiltration pressure = 2m


Zero
-
flux boundary at base (‘solid wall’)

Fixed pressure distribution at sides


Initial pressure:

(8.0 )
1.0
z m Saturated zone
m Unsaturated zone


 
 
Solving Richards’ equation:

2
0
1 ( ) ( )
( )
'( )
ij
p i j
K
t S x x z
  

   
   
 
 
 
 
    
 
   
s
K
LHI: Infiltration model


Soil
properties

4
0
10


S
1
(1.0,1.0,0.2)
s
mh


K
s
p
S
S



)
(
0
0

2
3
2
2
1
)
(











g
s
h
h
K
K

2
1
2
2
1
)
(
)
(













g
r
s
r
h
h






Using Van
-
Genuchten soil representation:


Saturated conductivity:


Storativity:

Relative Hydraultic Conductivity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-1.2
-1
-0.8
-0.6
-0.4
-0.2
0
Water pressure (m)
Relative conductivity
Water Content
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
-1.2
-1
-0.8
-0.6
-0.4
-0.2
0
Water pressure (m)
Water Content
0.15,0.45,
r s
 
 
LHI: Infiltration model
-

Results


11,585 points arranged in 17 layers


‘Short’ runs: solution to 48 hours


‘Long’ run: solution to 160 hours

Richards’ equation
-

First
example


Using the steady
-
state example given in

Tracy (2006)* for the
solution of Richards’ equation, with:



e
K
K
s

)
(






e
r
s
r
)
(
)
(




On a domain:

m
z
m
y
m
x
24
.
15
0
24
.
15
0
24
.
15
0







With:
























24
.
15
sin
24
.
15
sin
)
1
(
ln
1
)
(
y
x
e
e
r
r
h
h






x
24
.
15
)
(


x

Top face

All other faces

* F.T.Tracy, Clean two
-

and three
-
dimensional analytical solutions of Richards’ equation for testing numerical solvers

0
0

p
S
Richards’ equation
-

First
example

-18
-16
-14
-12
-10
-8
-6
-4
-2
0
0
2
4
6
8
10
12
14
16
18
z (m)
Hyd. Head (m)
Analytical
Calculated
-18
-16
-14
-12
-10
-8
-6
-4
-2
0
0
2
4
6
8
10
12
14
16
18
z (m)
Hyd. Head (m)
Analytical
Calculated
-18
-16
-14
-12
-10
-8
-6
-4
-2
0
0
2
4
6
8
10
12
14
16
18
z (m)
Hyd. Head (m)
Analytical
Calculated
With:

(11 x 11 x 11) and

(21 x 21 x 21) uniformly spaced points

α = 0.164

N = 11





N = 22




α = 0.328

N =11





N = 22




-18
-16
-14
-12
-10
-8
-6
-4
-2
0
0
2
4
6
8
10
12
14
16
18
z (m)
Hyd. Head (m)
Analytical
Calculated
Richards’ equation
-

First
example (error analysis)

Alpha = 0.164

11*11*11

1.72E
-
02

3.16E
-
02

21*21*21

3.52E
-
03

7.09E
-
03

6.62e
-
2

Alpha = 0.328

11*11*11

1.61E
-
02

3.20E
-
02

21*21*21

4.24E
-
03

8.57E
-
03

1.11

Alpha = 0.492

11*11*11

1.77E
-
02

2.43E
-
02

21*21*21

5.16E
-
03

7.26E
-
03

5.13

L2 error norm

Max error

Finite Volume




max error

Improvement


factor:

706.6

129.5

9.34


Good improvement over finite volume results from Tracy paper,
particularly with rapidly decaying K and
θ

functions


Reasonable convergence rate, with increasing point density

Future work


Future work will focus on large
-
scale problems, including:


Regional scale models of real
-
world experiments in Portugal and
Greece


Country
-
scale models of aquifer pumping and seawater intrusion
across Israel and Gaza


The large problem size will require a parallel implementation for
efficient solution


hence our interest in HPC and GPUs


To our knowledge, we are the only group working on large
-
scale
hydrology problems using meshless numerical techniques


Practical implementation will require the parallelisation of large,
sparsely
-
populated, iterative matrix solvers

GPU Computing:


GPU: Graphics Processing Unit


Originally designed to accelerate floating
-
point heavy calculations in
computer games eg.


Pixel shader effects (Lighting effects, reflection/refraction, other effects)


Geometry setup (character meshes etc)


Solid
-
body physics (…not yet widely adapted)



Massively parallel architecture


currently up to 128 floating point
processing units


Recent hardware (from Nov 2006) and software (Feb 2007)
advances have allowed programmable processing units (rather
than units specialised for pixel or vertex processing)


Has led to "General Purpose GPUs"
-

‘GPGPUs’

GPU Computing:


GPUs are extremely efficient at handling add
-
multiply instructions in
small ‘packets’ (usually the main computational cost in numerical
simulations)




FP capacity outstrips CPUs, in both theoretical capacity and efficiency
(if properly implemented)

GPU Computing:


Modern GPUs effectively work like a shared
-
memory cluster:


GPUs have an extremely fast (~1000Mhz vs ~400Mhz), dedicated
onboard memory


Onboard memory sizes currently range up to 1.5Gb (in addition to
system memory)

CUDA
-

BLAS and FFT libraries


The CUDA
toolkit is a C language
development environment for CUDA
-
enabled GPUs.



Two libraries implemented on top of
CUDA:


Basic Linear Algebra System
(BLAS)


Fast Fourier Transform (FFT)



Pre
-
parallelised routines.

Available Examples/Demos:


Parallel bitonic sort


Matrix multiplication


Matrix transpose


Performance profiling using timers


Parallel prefix sum (scan) of large arrays


Image convolution


1D DWT using Haar wavelet


graphics interoperation examples


CUDA BLAS and FFT examples


CPU
-
GPU C and C++ code integration


Binomial Option Pricing


Black
-
Scholes Option Pricing


Monte
-
Carlo Option Pricing


Parallel Mersenne Twister


Parallel Histogram


Image Denoising


Sobel Edge Detection Filter


TESLA
-

GPUs for HPC


Deskside (2 GPUs 3GB) and Rackmount (4 GPUs 6GB) options


With ~500Gflops per GPU, that is ~2 Teraflops per rack


Deskside is listed at around $4200

GPU Computing


Some results


Example: Multiplication of densely populated matrices


O(N
3
) algorithm…

GPU specs:

GPU
model

Processors


GPU Clock
speed

Memory
size

Memory
bus

COST

8600GTS

32

675Mhz

512Mb

256bit

~£75

8800GT

112

600Mhz

512Mb

256bit

~£150

8800GTX

128

575Mhz

768Mb

320bit

~£250


Matrices are broken down into vector portions, and sent to the GPU
stream processors



Use CUDA


nVidia’s C
-
based API for GPUs

Various CPUs vs GPU


Note: Performance of dual
-
core and quad
-
core CPUs is
approximated from an idealised parallelisation (ie. 100% efficiency)

Improvement of 8800GTX over various CPUs
0
20
40
60
80
100
120
140
160
180
200
0
1000
2000
3000
4000
5000
6000
Matrix size ( column or row size )
Improvement
factor
over CPU
Pentium 4 3Ghz
E2160 (1.86Ghz dual)
- £55
E6850 (3.0Ghz dual) -
£190
QX9650 (3Ghz quad)
- £690
More GPU propaganda:


Dr. John Stone, Beckmann Institute of Advanced Technology, NAMD
virus simulations:


110 CPU hours on SGI Itanium supercomputer => 47minutes with a
single GPU


Represents a 240
-
fold speedup


Dr. Graham Pullan, University of Cambridge, CFD with turbine blades
(LES and RANS models)


40x absolute speedup switching from a CPU cluster to ‘a few’ GPUs


Use 10 million cells on GPU, up from 500,000 on CPU cluster


Good anecdotal evidence for improvement in real
-
world simulations is
available from those who have switched to GPU computing:

Closing words


GPU performance is advancing at a much faster rate than CPUs. This is
expected to continue for some time yet


With CUDA and BLAS, exploiting parallelism of the GPUs is in some
cases easier than traditional MPI approaches


Later this year:


2
-
3 times the performance of current hardware (over 1 TFLOP per card)


Native 64bit capability



More info:


www.nvidia.com/tesla


www.nvidia.com/cuda


www.GPGPU.org