High performance computing
for a family of smooth
trajectories using parallel
environments
Gianluca Argentini
gianluca.argentini@riellogroup.com
Bologna,
March 23

26, 2004
Advanced Computing Laboratory
The
company
:
•
products for heating and conditioning
•
development and production of residential and
industrial burners
•
presence of a Center of Excellence for the study of
combustion and flame processes
•
R&D Department, extensive CAD (Catia from IBM

Dassault Systemes) and FLUENT computations
Introduction

1
1
Introduction

2
Industrial and power
burners
have
particular requirements:
•
customized study of combustion
head
•
study of accurate geometry of
combustion chamber (shape of the
flame, flow of gas or oil and oxygen)
•
ventilation and air circulation fans
for a correct oxygen supply, right
pressurization and continous cooling
•
reduction of vibrations and noise
2
Introduction

3
Rapid prototyping for optimal shape of combustion head
and combustion chamber involves Computational Fluid
Dynamics:
•
tracing of air or gas flow particles streamlines
•
shape of the flow in a generic geometry
3
High graphic resolution requires a large amount of
particles paths:
•
strong computational memory

expensive and cpu

based effort
•
distribution of paths on a multiprocessor environment
The problem
Focus on numerical simulation of flows (in combustion
head, chamber or in fans mechanism)
The large numerical output of simulation is generated
by Navier

Stokes (use of
FLUENT
package) or
Cellular Automaton models (
MATLAB
package)
From data, we would obtain:
•
path tracking of fluid particles, useful for customized
design of combustion heads and chambers
•
smooth 3D visualization of particles trajectories,
possibly with continuous slope and curvature
(analitically: class C
2
)
4
Step 1
. The data obtained from simulation
model are treated by an algorithm for the
computation of algebric curves (cubic
polynomials) associated to particles paths:
•
block

data distribution for parallel computing
•
necessity of continuous reallocation in RAM
Step 2
. Evaluation of polynomials on a large
set of values for fine resolution:
•
very expensive CPU computation
•
sets of curves distribution on processors, no
communication
About problem treatment
Data
Algebric
curves
Massive
Computing
5
Fitting the trajectories
From
simulation
, a single particle trajectory is a set of 3D
points:
•
S
is the number of points
•
M
is the number of trajectories
Interpolation
of the points:
•
Bezier

like is not realistic in case of
twist or divergence of speeds field
•
Chebychev or Least

Squares

like are
too rigid for a customized application
•
polinomial fitting is simple but often
shows spurious effects as Runge

Gibbs
phenomenon
We think a
splines

based technique is more useful
6
The splines

based algorithm
Let
S
=
4
x
N
: path is divided into
four

points groups
For every group the points are interpolated by three
cubic polynomials imposing four analytical conditions:
•
passage at P
k
point, 1
k
3
•
passage at P
k+1
point
•
continuous slope at P
k
point
•
continuous curvature at P
k
point
For smooth rendering and for avoiding excessive
twisting of trajectories, the cubics
u
k
are added to the
Bezier curve
b
associated to the four points:
v
=
a
b
+
b
u
k
0 <
a
,
b
< 1
7
Finding the splines
We have choose
a
=
b
= 0.5
Let
b
= As
3
+ Bs
2
+ Cs + D
(0
s
1) the Bezier curve
of control points P1,…,P4; for every spline
u
k
= at
3
+ bt
2
+ ct + d
(0
t
1)
the coefficients are computed by (2
k
3, for k = 1
the formulas are slightly different but of the same
algebraic form; a, b, c, d are 3

dimensional cartesian
vector)
a = P
k+1

P
k

3B

C

6
b = B + 3 (1)
c = 2B + C + 3
d = P
k
8
A matrix for splines
The system (1) can be represented as
c = T b
(matrix

vector multiplication) where
c
= (a, b, c, d)
b
= (P
k+1
, P
k
, B, C, 1)
1

1

3

1

6
T
= 0 0 1 0 3
0 0 2 1 3
0 1 0 0 0
For every spline, only
the vector
b
is variable;
for a single trajectory, it
must be reassigned in
RAM every group of
two points,
after
the
computation of the
relative Bezier curve.
9
A global matrix for splines
If we define a
global matrix
as
=
T 0 . . . 0
0 T . . . 0
.
0 0 . . 0 T
.
with
0
as 4 x 5 zero

matrix, we
have a 4
M
x 5
M
sparse
matrix
(optimization of memory storage
in MATLAB)
and with
B
= (P
k+1
, P
k
, B
1
, C
1
, 1, . . ., P
k+1
, P
k
, B
M
, C
M
, 1)
we can compute for every two

points group the
coefficients of cubic splines for all the
M
trajectories:
C
=
B
10
Computational complexity analysis
Every four

points group, for the
M
trajectories the flops
(floating point operations) number for computing the
splines coefficients is:
•
for Bezier curves (customized Matlab script):
316
M
•
for
浡m物r

癥捴潲畬瑩v汩捡瑩潮c⡵灰敲獴業慴攩i
324
M
We have
N
groups of four

points at every trajectory: the
total flops number of the
Step 1
is about
640
MN
11
A parallel distribution for splines
With
P
, number of processes, divisor of
M
,
the method
used is the distribution of
M/P
trajectories (rows of
浡m物r⤠瑯癥特⁰牯捥獳㬠湯潭v畮u捡瑩潮c楳湶i汶l搮
The value of
M
is important for the occupation of RAM
at every computational node.
p
1
p
2
p
P
.
.
M
N
linear execution for every process
12
Bezier curves and splines computation on
•
Linux cluster IBM x330, biprocessor Pentium III 1.133
GHz, at CINECA (2003); C routines and MPI (for parallel
startup and data distribution)
•
2 nodes Windows2000 / Linux RedHat IBM x440,
biprocessor Xeon 2.4 GHz Hyper Threading, 2 GB RAM,
at Riello (2003); MATLAB rel. 6.5 scripts (startup of
simultaneous multi

engine)
Computing splines: hardware and software
13
Computing splines: performance results
X440 cluster:
Better performances of
Win2k (linear speedup)

compared with Linux

with Intel HT technology
Beowulf CINECA:
The registered speedup is
quasi

linear; for high value of
P
the amount of data distribution
(
M
variable) among processes
is more intrusive.
14
Post

processing for splines
Now we would a fast method for computing the splines
values in a set of
parameter ticks
with fine sampling.
The CFD packages have some limits in the post

processing
phase:
•
resolution based on pre

processing mesh
•
rigid (when possible) load distribution among available
processors
For good graphic visualization,
the interval between two data

points might be divided in a
suitable number of ticks:
15
Valuating the splines
Let
V
+ 1 the number of ticks for each cubic spline
valuation; then the ticks are
(0, 1/
V
, 2/
V
, . . ., (
V

1)/
V
, 1)
and the values of splines parameter in the computation
are their (0, 1, 2, 3)

th degree powers. The value of a
cubic at
t
0
can be view as a dot product:
at
0
3
+ bt
0
2
+ ct
0
+ d = (a, b, c, d)
(t
0
3
, t
0
2
, t
0
, 1)
Let
the pre

allocable constant
4x(
V
+1) matrix:
0 (1/
V
)
3
. . . . ((
V

1)/
V
)
3
1
0 (1/
V
)
2
. . . . ((
V

1)/
V
)
2
1
0 (1/
V
)
1
. . . . ((
V

1)/
V
)
1
1
1 1 . . . . 1 1
16
An eulerian view
Let
the
M
x 4
matrix (each row a
spline for each
trajectory):
a
1
b
1
c
1
d
1
a
2
b
2
c
2
d
2
a
M
b
M
c
M
d
M
. . . .
Then the
M
x (
V
+1) matrix product
=
contains in each row the values of a cubic between two
data

points, for all the
M
trajectories (
eulerian
method).
For the product, the flops are 21
M
(
V
+1), the number of
matrices
is 3
N
; the total number of flops are
63
NM
(
V
+1).
17
A lagrangian view
Let
the 3
N
x 4
matrix (each row a
spline along one single
trajectory):
a
1
b
1
c
1
d
1
a
2
b
2
c
2
d
2
a
3
N
b
3
N
c
3
N
d
3
N
. . . .
Then the 3
N
x (
V
+1) matrix product
=
contains in each row the values of a cubic between two
data

points, for a single trajectory (
lagrangian
method).
For the product, the flops are 63
N
(
V
+1), the number of
matrices
is
M
; the total number of flops are
63
NM
(
V
+1).
18
Data distribution: eulerian case
With
P
, number of processes, divisor of 3
N
(amount of
two

points groups),
the method used is the distribution of
3
N/P
matrices to every process; no communication
is involved.
The value of
N
is important for the total computation
time,
N
and
M
for the RAM allocation of each process.
3
N
M
. . . . .
CPU
RAM
19
Data distribution: lagrangian case
With
P
, number of processes, divisor of
M
(amount of
trajectories),
the method used is the distribution of
M/P
matrices to every process; no communication is
involved.
The value of
N
is important for the total computation
time,
N
and
M
for the RAM allocation of each process.
3
N
M
CPU
RAM
.
20
Hardware and software
Hardware
: 2 x { IBM x440, 2 Xeon 2.4 GHz HT, 2 GB },
at Riello
Software
: Windows2000 / Linux RH 8.1, MATLAB 6.5,
parallelism of simultaneous Matlab engines
•
for matrix multiplication, Matlab 6.5 uses internal
LAPACK Level 3 BLAS routines (good performances)
•
the
matrix is computed only one time (in case of
uniform and costant sampling interval), its values are
probably always cached during matrices multiplication
21
Performance results
Performances of a single
Matlab process for the
product
with
V
= 100; as
theory, the execution time is
linear on
M
variable.
Performances of multiprocess
products (case 3
N
= 4200
P
); for
P
8, the total computation
time depends on
NM
(
Gustafson
law
), as expected.
22
Performance results: considerations
•
Linear speedup until
P
=8 (= number
of virtual Hyper Threads processors);
for
P
8 reallocations of RAM and
caches have a negative effect
•
For large data sets, the amount of
RAM in the nodes of cluster is a
critical factor, while the CPUs
performances are good with the use
of LAPACK routines
•
First results with a technique using “global M

N” matrices,
an MPI

multithreads version of MATLAB (
Cornell Toolbox
),
and parallel matrix multiplication algorithms, show an
overhead, in case of large data, due to communications
23
Performance results: Hyper Threading
Performance of Intel
Hyper
Threading Technology
of
Xeon processors; the vertical
unit is time execution in the
case of 8 processes
(
M
=5000,
3
N
= 4200
P
); until
8, the time seems to be
quadratic on processes
number.
Similar results have been obtained
•
using Win2k or Linux
•
using
High Performance Linpack
benchmarking
24
Examples
red
= trajectory computation
with
V
= 100;
black
= least
squares method, 3
°
degree
polynomials;
gray
= data

points
from simulation
Thanks
Forced injection of air in combustion
head; the ribbons show some particles
trajectories; data

points from simulation,
paths computation with
V
=100,
M
=5000,
N
=1600,
P
=8; computation and rendering
by Matlab; total computation time 85 secs
Comments 0
Log in to post a comment