High performance computing

rucksackbulgeΤεχνίτη Νοημοσύνη και Ρομποτική

1 Δεκ 2013 (πριν από 3 χρόνια και 6 μήνες)

72 εμφανίσεις

High performance computing
for a family of smooth
trajectories using parallel
environments

Gianluca Argentini

gianluca.argentini@riellogroup.com

Bologna,
March 23
-

26, 2004

Advanced Computing Laboratory

The
company
:




products for heating and conditioning



development and production of residential and
industrial burners



presence of a Center of Excellence for the study of
combustion and flame processes



R&D Department, extensive CAD (Catia from IBM
-
Dassault Systemes) and FLUENT computations

Introduction

-

1

1

Introduction

-

2

Industrial and power
burners

have
particular requirements:



customized study of combustion
head



study of accurate geometry of
combustion chamber (shape of the
flame, flow of gas or oil and oxygen)



ventilation and air circulation fans
for a correct oxygen supply, right
pressurization and continous cooling



reduction of vibrations and noise

2

Introduction

-

3

Rapid prototyping for optimal shape of combustion head
and combustion chamber involves Computational Fluid
Dynamics:



tracing of air or gas flow particles streamlines



shape of the flow in a generic geometry

3

High graphic resolution requires a large amount of
particles paths:



strong computational memory
-
expensive and cpu
-
based effort



distribution of paths on a multiprocessor environment

The problem

Focus on numerical simulation of flows (in combustion
head, chamber or in fans mechanism)

The large numerical output of simulation is generated
by Navier
-
Stokes (use of
FLUENT

package) or
Cellular Automaton models (
MATLAB

package)

From data, we would obtain:



path tracking of fluid particles, useful for customized
design of combustion heads and chambers



smooth 3D visualization of particles trajectories,
possibly with continuous slope and curvature
(analitically: class C
2
)

4

Step 1
. The data obtained from simulation
model are treated by an algorithm for the
computation of algebric curves (cubic
polynomials) associated to particles paths:



block
-
data distribution for parallel computing



necessity of continuous reallocation in RAM

Step 2
. Evaluation of polynomials on a large
set of values for fine resolution:



very expensive CPU computation



sets of curves distribution on processors, no
communication

About problem treatment

Data

Algebric
curves

Massive
Computing

5

Fitting the trajectories

From
simulation
, a single particle trajectory is a set of 3D
points:



S

is the number of points



M

is the number of trajectories

Interpolation

of the points:



Bezier
-
like is not realistic in case of
twist or divergence of speeds field



Chebychev or Least
-
Squares
-
like are
too rigid for a customized application



polinomial fitting is simple but often
shows spurious effects as Runge
-
Gibbs
phenomenon

We think a
splines
-
based technique is more useful

6

The splines
-
based algorithm

Let
S
=

4
x
N

: path is divided into
four
-
points groups

For every group the points are interpolated by three
cubic polynomials imposing four analytical conditions:



passage at P
k

point, 1


k


3



passage at P
k+1

point



continuous slope at P
k

point



continuous curvature at P
k

point

For smooth rendering and for avoiding excessive
twisting of trajectories, the cubics
u
k

are added to the
Bezier curve
b

associated to the four points:

v

=
a
b

+
b
u
k


0 <
a
,
b

< 1

7

Finding the splines

We have choose

a

=
b

= 0.5

Let
b

= As
3

+ Bs
2

+ Cs + D

(0


s


1) the Bezier curve
of control points P1,…,P4; for every spline
u
k

= at
3

+ bt
2

+ ct + d
(0


t


1)
the coefficients are computed by (2


k


3, for k = 1
the formulas are slightly different but of the same
algebraic form; a, b, c, d are 3
-
dimensional cartesian
vector)


a = P
k+1

-

P
k

-

3B
-

C
-

6


b = B + 3 (1)


c = 2B + C + 3


d = P
k

8

A matrix for splines

The system (1) can be represented as

c = T b
(matrix
-
vector multiplication) where

c

= (a, b, c, d)

b

= (P
k+1
, P
k
, B, C, 1)


1
-
1
-
3
-
1
-
6

T

= 0 0 1 0 3


0 0 2 1 3


0 1 0 0 0



For every spline, only
the vector
b

is variable;
for a single trajectory, it
must be reassigned in
RAM every group of
two points,
after

the
computation of the
relative Bezier curve.

9

A global matrix for splines

If we define a
global matrix



as



=

T 0 . . . 0


0 T . . . 0



.

0 0 . . 0 T



.

with

0

as 4 x 5 zero
-
matrix, we
have a 4
M

x 5
M

sparse

matrix
(optimization of memory storage
in MATLAB)

and with
B

= (P
k+1
, P
k
, B
1
, C
1
, 1, . . ., P
k+1
, P
k
, B
M
, C
M
, 1)
we can compute for every two
-
points group the
coefficients of cubic splines for all the
M

trajectories:

C

=


B


10

Computational complexity analysis

Every four
-
points group, for the
M

trajectories the flops
(floating point operations) number for computing the
splines coefficients is:



for Bezier curves (customized Matlab script):

316
M



for


浡m物r
-
癥捴潲畬瑩v汩捡瑩潮c⡵灰敲⁥獴業慴攩i








324
M

We have

N
groups of four
-
points at every trajectory: the
total flops number of the
Step 1

is about

640
MN

11


A parallel distribution for splines

With
P
, number of processes, divisor of
M
,

the method
used is the distribution of
M/P

trajectories (rows of


浡m物r⤠瑯⁥癥特⁰牯捥獳㬠湯⁣潭v畮u捡瑩潮c楳⁩湶i汶l搮

The value of
M

is important for the occupation of RAM
at every computational node.

p
1

p
2

p
P

.

.

M

N

linear execution for every process

12

Bezier curves and splines computation on



Linux cluster IBM x330, biprocessor Pentium III 1.133
GHz, at CINECA (2003); C routines and MPI (for parallel
startup and data distribution)



2 nodes Windows2000 / Linux RedHat IBM x440,
biprocessor Xeon 2.4 GHz Hyper Threading, 2 GB RAM,
at Riello (2003); MATLAB rel. 6.5 scripts (startup of
simultaneous multi
-
engine)

Computing splines: hardware and software

13

Computing splines: performance results

X440 cluster:

Better performances of
Win2k (linear speedup)
-

compared with Linux
-

with Intel HT technology

Beowulf CINECA:

The registered speedup is
quasi
-
linear; for high value of
P

the amount of data distribution
(
M

variable) among processes
is more intrusive.

14

Post
-
processing for splines

Now we would a fast method for computing the splines
values in a set of
parameter ticks

with fine sampling.

The CFD packages have some limits in the post
-
processing
phase:



resolution based on pre
-
processing mesh



rigid (when possible) load distribution among available
processors

For good graphic visualization,
the interval between two data
-
points might be divided in a
suitable number of ticks:

15

Valuating the splines

Let
V
+ 1 the number of ticks for each cubic spline
valuation; then the ticks are

(0, 1/
V
, 2/
V
, . . ., (
V

-
1)/
V

, 1)

and the values of splines parameter in the computation
are their (0, 1, 2, 3)
-
th degree powers. The value of a
cubic at
t
0

can be view as a dot product:

at
0
3

+ bt
0
2

+ ct
0
+ d = (a, b, c, d)



(t
0
3
, t
0
2
, t
0
, 1)

Let



the pre
-
allocable constant
4x(
V
+1) matrix:

0 (1/
V
)
3

. . . . ((
V

-
1)/
V
)
3

1

0 (1/
V
)
2

. . . . ((
V

-
1)/
V
)
2

1

0 (1/
V
)
1

. . . . ((
V

-
1)/
V
)
1

1

1 1 . . . . 1 1

16

An eulerian view

Let



the
M

x 4
matrix (each row a
spline for each
trajectory):

a
1

b
1

c
1

d
1

a
2

b
2

c
2

d
2

a
M

b
M

c
M

d
M


. . . .

Then the
M

x (
V
+1) matrix product


=

 

contains in each row the values of a cubic between two
data
-
points, for all the
M

trajectories (
eulerian

method).
For the product, the flops are 21
M
(
V
+1), the number of
matrices


is 3
N
; the total number of flops are
63
NM
(
V
+1).

17

A lagrangian view

Let



the 3
N

x 4
matrix (each row a
spline along one single
trajectory):

a
1

b
1

c
1

d
1

a
2

b
2

c
2

d
2

a
3
N

b
3
N

c
3
N

d
3
N


. . . .

Then the 3
N

x (
V
+1) matrix product


=

 

contains in each row the values of a cubic between two
data
-
points, for a single trajectory (
lagrangian

method).
For the product, the flops are 63
N
(
V
+1), the number of
matrices


is
M
; the total number of flops are
63
NM
(
V
+1).

18

Data distribution: eulerian case

With
P
, number of processes, divisor of 3
N
(amount of
two
-
points groups),

the method used is the distribution of
3
N/P



matrices to every process; no communication
is involved.

The value of
N
is important for the total computation
time,
N

and
M

for the RAM allocation of each process.

3
N

M

. . . . .

CPU

RAM

19

Data distribution: lagrangian case

With
P
, number of processes, divisor of
M
(amount of
trajectories),

the method used is the distribution of
M/P



matrices to every process; no communication is
involved.

The value of
N
is important for the total computation
time,
N

and
M

for the RAM allocation of each process.

3
N

M

CPU

RAM

.

20

Hardware and software

Hardware
: 2 x { IBM x440, 2 Xeon 2.4 GHz HT, 2 GB },
at Riello

Software
: Windows2000 / Linux RH 8.1, MATLAB 6.5,
parallelism of simultaneous Matlab engines



for matrix multiplication, Matlab 6.5 uses internal
LAPACK Level 3 BLAS routines (good performances)



the


matrix is computed only one time (in case of
uniform and costant sampling interval), its values are
probably always cached during matrices multiplication

21

Performance results

Performances of a single
Matlab process for the



product

with
V

= 100; as
theory, the execution time is
linear on
M

variable.

Performances of multiprocess
products (case 3
N

= 4200
P
); for
P


8, the total computation
time depends on
NM
(
Gustafson
law
), as expected.

22

Performance results: considerations



Linear speedup until
P
=8 (= number
of virtual Hyper Threads processors);
for
P

8 reallocations of RAM and
caches have a negative effect



For large data sets, the amount of
RAM in the nodes of cluster is a
critical factor, while the CPUs
performances are good with the use
of LAPACK routines



First results with a technique using “global M
-
N” matrices,
an MPI
-
multithreads version of MATLAB (
Cornell Toolbox
),
and parallel matrix multiplication algorithms, show an
overhead, in case of large data, due to communications

23

Performance results: Hyper Threading

Performance of Intel
Hyper
Threading Technology

of
Xeon processors; the vertical
unit is time execution in the
case of 8 processes
(
M
=5000,

3
N

= 4200
P
); until
8, the time seems to be
quadratic on processes
number.

Similar results have been obtained



using Win2k or Linux



using
High Performance Linpack

benchmarking

24

Examples

red

= trajectory computation
with
V

= 100;
black

= least
squares method, 3
°

degree
polynomials;
gray

= data
-
points
from simulation

Thanks

Forced injection of air in combustion
head; the ribbons show some particles
trajectories; data
-
points from simulation,
paths computation with
V
=100,
M
=5000,
N
=1600,
P
=8; computation and rendering
by Matlab; total computation time 85 secs