Investigation of a new Parallel Algorithm for Turbomachinery

perchorangeΛογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 3 χρόνια και 4 μήνες)

63 εμφανίσεις















-

-

1
-

-

Special Problem Report



Investigation of a new Parallel Algorithm for Turbomachinery
Calculations and New Results for Self
-
Recirculation Control




submitted in partial fulfillment of the

requirement for the degree of

Masters of Science


In


Aerospace En
gineering


by


Joseph Gillman

gtg638s@mail.gatech.edu



under the guidance of



Dr. Lakshmi N. Sankar


School of Aerospace Engineering

Georgia Institute of Technology


Atlanta, GA


December 2004

Joseph Gillman




MS Aerospace Special Problem

Fall 2004



Georgia Institute of Technology


-

-

2
-

-

Introduction


Parallel computing has given the world of Com
putational Fluid Dynamics a new method
for decreasing time required to obtain a solution. Three
-
dimensional viscous,

turbulent

Navier
-
Stokes solvers for turbomachinery, like GT
-
TURBO
-
3D
,

are
very
time
-
intensive
. Normally it
takes a day or more to run a co
de on a single
-
processor PC for 35000 iterations.


Adding parallel directives to the code, such as Message Passing Interface (MPI) or
OpenMP is the easiest way to take advantage of parallelized computing, instead of re
-
writing the
code in High
-
Performance
Fortran. The computing facilities at Georgia Institute of Technology
include a High Performance Computing (HPC) Center where such parallel codes can be
implemented and tested by the Georgia Tech research community.


“The OpenMP Application Program Interfa
ce (API) supports multi
-
platform shared
-
memory parallel programming in C/C++ and Fortran on all architectures, including Unix
platforms and Windows NT platforms. Jointly defined by a group of major computer hardware
and software vendors, OpenMP is a porta
ble, scalable model that gives shared
-
memory parallel
programmers a simple and flexible interface for developing parallel applications for platforms
ranging from the desktop to the supercomputer.”[
1
]

In this parallelization study, OpenMP is used to paralle
lize GT
-
TURBO
-
3D. MPI is concurrently
being used by other researchers working with this code, however OpenMP is attractive because
less additional programming is required to implement it, compared to MPI. However, OpenMP
gives the programmer much less con
trol over important aspects of parallelization, most notably
the division of the computational domain among all processors.


Before the author commenced working on this project, the GT
-
TURBO
-
3D code with a
Alternating
-
Direction

Implicit scheme was already
implemented. During the course of this
research, it was desired to move away from the ADI scheme towards a Lower
-
Upper
Symmetric

Gauss
-
Seidel (LU
-
SGS) scheme because the latter has improved numerical stability
characteristics. However, the LU
-
SGS scheme
is inherently sequential (not parallelizable) so
parallelization is not straightforward as it was for the ADI scheme.
The DP
-
LUR scheme [
2
]
was chosen because of all parallelizable variants of the LU
-
SGS scheme, this required the least
increase in memory
and
CPU work [3,4
]. However, the results with DP
-
LUR were not as good
as with the original ADI scheme so either other more computationally expensive schemes can be
investigated in the future, or the ADI scheme should be used if parallelization is necessar
y.

Joseph Gillman




MS Aerospace Special Problem

Fall 2004



Georgia Institute of Technology


-

-

3
-

-


Towards the end of the research period, the author began working in analysis of a “self
-
recirculation” control concept for Rotor 67.
It is desired to increase the stable operating range of
an axial compressor at the low
-
mass
-
flow
-
rate end, but the des
ign mass flow rate of Rotor 67 is
very close to its stall/surge limit. For the full results of that research the reader is referred to [
5
],
but some results not in that work are presented in this paper.

Implicit Time Marching Schemes


The 3
-
D unsteady comp
ressible Reynolds averages Navier
-
Stokes equations are solved in
the strong conservation law form. The governing equations are solved in the control volume
form and can be written as:


















S
V
S
S
d
n
S
d
n
qdV
t
ˆ
k
ˆ
T
j
ˆ
S
i
ˆ
R
ˆ
k
ˆ
G
j
ˆ
F
i
ˆ
E





Here, S and V refer to the control surface and

control volume respectively, and
n


is the
outward normal vector to surface S.


The inviscid fluxes E, F, and G are calculated to third order spatial accuracy using Roe’s
approximate Riemann solver whereas the viscous fluxes R, S and

T are computed explicitly. The
time derivatives are computed with a two
-

or three
-

point forward difference scheme, which is
thus first or second order accurate in time.

LU
-
SGS Scheme:


In three dimensional Cartesian coordinates, the conservative differe
ntial form of the
Navier
-
Stokes equations in generalized curvilinear coordinate system can be written as:
































T
S
R
G
F
E
q
t

where the boldface type illustrates that this is a vector
equation representing the five equations of continuity, u
-
, v
-
, and
w
-
momentum, and energy
respectively, and the primes denote that the inviscid and viscous fluxes have been transformed
from Cartesian (x,y,z) coordinates to computational space (

). In the remainder of this
report, this will be understood and boldface
and primes will be omitted.


By applying a first
-
order approximation to the time derivative of the state vector and
linearizing the inviscid fluxes in time (2) can be rewritten in the linearized delta form as

Joseph Gillman




MS Aerospace Special Problem

Fall 2004



Georgia Institute of Technology


-

-

4
-

-





n
RHS
t
q
t
C
t
B
t
A
I
















ˆ
ˆ
ˆ
where
q
E
A



ˆ

,
q
F
B



ˆ
,and
q
G
C



ˆ

are 5x5 matrices describing the inviscid flux Jacobians.


n
RHS
denotes the right
-
hand side
calculated by “roeresi” and “stress” subroutines at time step “n” and is described by







n
n
n
n
n
n
n
T
S
R
t
G
F
E
t
RHS
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ






















and


terms are difference
operators which operate on

q in the computational directions (

).

The required matrix inversions are extremely computationally expensive and so Jameson and
Yoon
[12]
proposed that the inviscid flux Jacob
ians be split along the spectral radius as
2
ˆ
max,





A
A
where
max

is the normal velocity through the face (the i
-
constant face for the A
case) plus the local speed of sound.


C
B
,

are found by replacing


max,
,
ˆ
A
with


max,
,
ˆ
B

and


max,
,
ˆ
C
.


is a dissipation term that enhances stability if
1



by increasing diagonal
dominance of the system of equations, however the rate of co
nvergence is reduced. In this
implementation,


is left at 1.
[
6]

It should be noted that, with this approximation, the
eigenvalues of these split fluxes are no longer consistent with the true propagation speeds of the
flow, which ma
y introduce additional error in time accuracy and in wave propagation.

The eigenvalues of the plus
-
matrices are now greater than or equal to zero, while those of the
minus
-
matrices are less than or equal to zero. Now the system of equations to solve becom
es




n
RHS
t
q
t
C
t
B
t
A
t
C
t
B
t
A
I








































where


represents backward differences and



represents forward differences as follows:












k
j
i
k
j
i
q
q
A
q
A
,
,
1
,
,

,












k
j
i
k
j
i
q
q
A
q
A
,
,
,
,
1


and similarly for B and C in the
j
-

and k
-
directio
ns, respectively.
















































































































t
z
y
x
z
z
z
y
z
x
z
y
y
z
y
y
x
y
x
x
z
x
y
x
x
z
y
x
t
k
w
E
k
v
E
k
u
E
k
E
k
w
k
v
k
w
k
u
k
w
k
w
k
k
w
k
v
k
v
k
u
k
v
k
v
k
k
w
k
u
k
v
k
u
k
u
k
u
k
k
k
k
k
C
B
A
1
1
1
1
2
1
1
1
1
2
1
1
1
1
2
0
ˆ
,
ˆ
,
ˆ
2
2
2
2


Joseph Gillman




MS Aerospace Special Problem

Fall 2004



Georgia Institute of Technology


-

-

5
-

-

In the above formulation,


2
1
2
2
2
2
w
v
u






,
w
k
v
k
u
k
z
y
x




,




t
k
,
2





t
E
E
, and the only difference between
C
B
A
ˆ
,
ˆ
,
ˆ
is that “k
” is replaced by

or


respectively.
[7
]

The Lower
-
Upper factorization proceeds as follows:




L D U q RHS
   
, where U is an upper diagonal triangular matrix with null matrices on
the diagonal, D is a block diagonal matrix and L is the lower
diagonal matrix. For the case where
D is a non
-
singular matrix, this can be rewritten as:



RHS
q
U
D
I
L
D
D






1
1





RHS
q
U
D
D
L
D





1
, where





























C
C
B
B
A
A
t
I
L
D









































C
C
B
B
A
A
t
I
U
D












, and


























C
C
t
B
B
t
A
A
t
I
D



.

Since
2
ˆ
max,





A
A
,


2
max,






A
A
and therefore the diagonal becomes

2
2
2
max,
max,
max,









I
.

The diagonal and plus/minus matrices are arranged into a large block matrix as follows:



RHS
tJ
q
C
B
A
C
B
A
D
ijk
ijk
ijk
ijk
ijk
ijk
ijk























































And the Lower
-
Upper, Symmetric Gauss
-
Seidel Sweeps sol
ve for the change in state vector by

Joseph Gillman




MS Aerospace Special Problem

Fall 2004



Georgia Institute of Technology


-

-

6
-

-



RHS
tJ
q
D
A
B
C
D
A
B
C
D
A
B
C
D
A
B
C
D
A
B
D
A
D
ijk
ijk
ijk
ijk
ijk
ijk
ijk
ijk
ijk
ijk
ijk
ijk
ijk
ijk
ijk
ijk
ijk
ijk
ijk
ijk
ijk
ijk































































*

(lower sweep, solve for
*
q

)

*
1
q
q
D
A
D
B
A
D
C
B
A
D
C
B
A
D
C
B
A
D
C
B
A
D
D
ijk
ijk
ijk
ijk
ijk
ijk
ijk
ijk
ijk
ijk
ijk
ijk
ijk
ijk
ijk
ijk
ijk
ijk
ijk
ijk
ijk
ijk
ijk
































































(upper sweep, solve for
q

).

Since both sweeps update the state vector continuously, the algorithm is

sequential and not
immediately parallelizable.


DP
-
LUR Scheme


It is desired to implement the LU
-
SGS scheme on parallel computers in order to increase
speed of calculations and decrease turn
-
around time for a CFD solution to be found. As
described earlie
r, the data dependencies that arise from updating the solution vector (
k
j
i
q
,
,

)
during both L
-

and U
-
sweeps make cause LU
-
SGS to have many more parallelization problems
than the original ADI time
-
marching scheme which solves the system of e
quations with the
Thomas Algorithm.

The Data
-
Parallel Lower
-
Upper
-
Reduction (DP
-
LUR) Scheme attempts to use Yoon and
Jameson’s LU
-
SGS scheme as a starting point, and make some adjustments to remove the data
dependencies and make the scheme parallelizable.

The Symmetric Gauss
-
Seidel sweeps are replaced by a series of pointwise relaxation steps using
the following scheme. First, the off
-
diagonal terms are neglected and the right
-
hand side


n
RHS

is divided by the diagonal operator from the L
U
-
SGS scheme to obtain


0
U

, the initial
Joseph Gillman




MS Aerospace Special Problem

Fall 2004



Georgia Institute of Technology


-

-

7
-

-

sub
-
iteration of the state vector solution:






n
k
j
i
k
j
i
n
C
n
B
n
A
RHS
t
I
U
,
,
1
,
,
0










. Then a
series of kmax relaxation steps are made as follows:









1
2
,
,
1
,
,
1
1
2
,
,
1
,
,
1
,
,
,
,
1
,
,
2

















k
k
j
i
n
k
j
i
k
k
j
i
n
k
j
i
k
j
i
n
k
j
i
k
j
i
n
C
n
B
n
A
k
U
A
U
A
tJ
RHS
t
I
U









1
2
1
,
,
1
,
,
1
2
1
,
,
1
,
,
1
2
,
1
,
,
1
,
1
2
,
1
,
,
1
,



















k
k
j
i
n
k
j
i
k
k
j
i
n
k
j
i
k
k
j
i
n
k
j
i
k
k
j
i
n
k
j
i
U
C
U
C
U
B
U
B




. Then after the k
2max
sub
-
iterations are completed, the state vector is updated by
max
2
,
,
1
,
,
k
k
j
i
n
k
j
i
U
q




. This was presented
originally for two
-
dimensions by [
2
] and extended to three dimensions in
[3
]

Parallel Implementation using OpenMP

Reference [
8
] is a comprehensiv
e guide to using OpenMP and is an excellent reference
describing the benefits, methods and techniques for proper parallelization on shared memory
computers. The advantage over using OpenMP versus, say, explicit message
-
passing (as in
MPI) is that much le
ss programming is required to add the parallelization routines. Some
adjustments might need to be made to the code, such as changing the order of nested loops to
improve parallel performance, but in general the structure of the parallel code is identical
to the
serial version of the code.
However, there is no fine control over how little or how much data is
communicated. Also in MPI, the programmer can manually divide the computational domain
among the processors in a manner he or she decides is the most

efficient. Then, each processor
can do work serially, for example, they can use the sequential LU
-
SGS scheme, and
communicate the necessary data across the processor
-
domain borders. In this manner good
parallel speedup would be achieved even though the
main scheme is itself not parallel. Since this
is not the case for OpenMP, a new time
-
marching scheme must replace the LU
-
SGS, such as DP
-
LUR described
previously or DPLR [9
].

In the most basic form, OpenMP directives are used to parallelize for
-
do loops
. Other
portions of the program are run serially. The best parallel performance is achieved for a large
loop with many calculations because parallelization requires overhead, and the desired parallel
speedup would not be reached for a small loop. Contrar
y to programming for vector computers,
where the innermost loop is put through the vector pipe, when using OpenMP parallelization it is
the outermost loop that is parallelized. This saves some overhead which would be required for
thread creation and destr
uction if the innermost loop wa
s parallelized [10
]. The compiler decides
Joseph Gillman




MS Aerospace Special Problem

Fall 2004



Georgia Institute of Technology


-

-

8
-

-

which portion of the computational domain (in general, imax x jmax x kmax) each processor
works with.

In implementing OpenMP in GT
-
TURBO
-
3D, the parallelization of the outermost in a

set
of nested do loops is done by the command, “C$OMP PARALLEL DO”. Since the C is in the
first column, a non
-
parallel compiler will ignore the line completely. Therefore the same code
can be compiled on a PC to be executed serially. Virtually every su
broutine has a parallel set of
nested loops. Sometimes extra statements must be inserted after the “C$OMP PARALLEL DO”
line in order to maintain good parallel performance.

By default, all variables within the parallel do loop are shared between all proces
sors. Anything
on the left hand side of an equation however, must be declared as private, so that each processor
updates the variable independently in its section of the computational domain. These variables
are listed under the C$OMP PARALLEL DO line a
s


C$OMP+PRIVATE(
privatevar1,privatevar2….
)
”. These private variables have multiple
storage locations , one within the execution context of each thread, for the duration of the parallel
construct. Read or write commands to this variable are for each sep
arate, private instance on
each individual processor’s memory allocation and is inaccessible to the other threads. In
contrast, the shared variables only have one memory location, which all threads can read and
write to, which facilitates communicat
ion be
tween the OpenMP threads[8
].

Yet another clause is the reduction clause, which has storage properties resembling both
private and shared variables. This reduction clause denotes a scalar variable which is the result
of some operation (e.g. addition or mul
tiplication) on the components of a vector, such as the
resul
t of a dot
-
product [10
]. The reduction attribute is the most efficient way for the compiler to
handle these “mixed” private and

shared variables [8
]. It is implemented with slightly different
sy
ntax, “
C$OMP+REDUCTION(+:
reductionvar
)
” (note the plus sign and colon in front of the
variable name, no spaces).

The parallel codes described in this report were run on the High Performance Computing
(HPC) resources available at Georgia Tech. The specifi
c machine used is called “buteo” and its
specifications are listed on the Georgia Tech HPC website
[
11
]
as follows:

SGI® Origin
TM

3800

Machine name:
buteo


Joseph Gillman




MS Aerospace Special Problem

Fall 2004



Georgia Institute of Technology


-

-

9
-

-

Front end (
vireo
): 8
-
processor R14000
TM
-
based Origin
TM

300, 8 GB memory

32 MIPS®
TM

R14000 processo
rs in the new NUMAflex
TM

architecture, answering to the name
buteo

600 MHz Clock speed

16 GBytes 64
-
bit shared memory (2 GBytes per node board)

8 MB cache per node board

1.6 GB/sec, each direction, NUMAflex connectivity between C
-

and R
-

bricks
.

OS: IRIX
TM

6.5

The code is compiled on the vireo front
-
end machine using the SGI Fortran 77 compiler, which
is accessed by SecureCRT (a secure Telnet application supported by Georgia Tech Office of
Information Technology) at hostname vireo.gatech.edu. Once logge
d into this host, WinSCP3 is
used (with separate login) to upload the source code, grid file and other input files to the vireo
server. The compiler command used is “f77

O2

mp code.f

o code.exe”. “
-
O2” is a scalar
optimizer, and the HPC website notes
that “
-
O2

performs global optimizations, but does not
affect numerical accuracy.” A different optimizer, “
-
O3” is recommended for specifying
“automatic parallelization,” in which no OpenMP directives are added to the code, and the
compiler decides the bes
t way to parallelize. This was not used as it is inefficient with a large
CFD code such as GT
-
TURBO
-
3D.

Once compiled, an .exe executable file is created, but in order to run on the buteo
machine, it cannot be executed immediately. A batch file must be wr
itten that tells the vireo
front
-
end machine to use the buteo computer, specify how many processors are desired, etc. An
example of a batch file is shown here:

#PBS
-
N DPAxial_par8_b

#PBS
-
l ncpus=8

#PBS
-
q buteo

#PBS
-
k oe

#PBS
-
m abe

export MP_SET_NUMT
HREADS=8

cd /scratch/buteo/gtg638s/DP
-
LUR

limit stacksize unlimited

./LUaxial
-
par4test5.exe

Joseph Gillman




MS Aerospace Special Problem

Fall 2004



Georgia Institute of Technology


-

-

10
-

-

This batch file is saved in the user’s vireo.gatech.edu home directory. The first line is the
name of the batch file, in this case “DPAxial_par8_b.” The second line

specifies that 8
processors are desired for this job. The next line specifies which computer to run the program
will be used (buteo is used exclusively in this study). The fourth line “
-
k oe” specifies that both
output (o) and error (e) files will be save
d in the user’s home directory (which is where the batch
file is located). For example, in GT
-
TURBO
-
3D, when running serially on PC the screen outputs
“dq1max”, theta, and iteration number, so this “screen output” will be saved as a text file. The
line “
-
m

abe” specifies that an email notification will be sent upon code abortion (a), beginning
(b) and ending (e). The line “export MP_SET_NUMTHREADS=8” tells the IRIX operating
system that 8 CPUs will be utilized. The previous line told the program scheduler

how many
processors are used while this line is used by the operating system. It is imperative that the two
numbers match (8 in this case). Lastly, the “cd….” line describes how to get from the location of
the batch file (normally the home directory as de
scribed above) to the executable file. The
scratch directory is often used because GT
-
TURBO
-
3D outputs the grid and solution file after a
specified number of iterations, and since these files can be quite large, the scratch folder is the
best place from wh
ich to execute the program as disk space is unlimited. However, an automated
system deletes older files from time to time so every file, including grid and input files, must
have a “new” timestamp. These files can be opened in a text editor and re
-
saved wi
thout
changing the numbers, and this prevents the files from being deleted before the program can
execute.

After the code is compiled and all input files are placed with the executable in the scrath
directory, the batch job is submitted from the vireo.gate
ch.edu command prompt (using
SecureCRT) by “qsub DPAxial_par8_b” and then the batch job will execute once it gets to the
front of the queue.

Parallelization Results

As mentioned previously, it was desired to parallelize the LU
-
SGS scheme in a manner
such t
hat the speedup gain would be similar to that of the original ADI scheme. The ADI
speedup factor was measured as approximately 6 when using 8 CPUs. When OpenMP routines
were added directly to the LU
-
SGS scheme, the speedup factor was only 3 for use of 8
processors. This is because much of the computational work in GT
-
TURBO
-
3D is done in the
Joseph Gillman




MS Aerospace Special Problem

Fall 2004



Georgia Institute of Technology


-

-

11
-

-

SLPS subroutine (which calls the implicit time
-
marching scheme
-

be it ADI, LU
-
SGS, or DP
-
LUR). Therefore the data dependencies of the LU
-
SGS scheme have a large impa
ct on parallel
performance. Simply adding the OpenMP directives to a non
-
parallelizable LU
-
SGS scheme
yields an unsatisfactory result.

The references which discuss the parallelization of the LU
-
SGS scheme all use Message
Passing Interface, instead of OpenM
P. It was hoped that the algorithmic adjustments to the LU
-
SGS scheme should be the same whether using OpenMP or MPI. Some schemes mentioned in
these papers

su
ch as the “Hybrid DP
-
LUR” [2,3
] can only use message passing routines because
precise control is

required over which data is communicated between processors. In the Hybrid
DP
-
LUR for example, LU
-
SGS is performed on each individual processor, and on the data
-
domain border between two processors, DP
-
LUR is used because of its parallelizable nature.

Two

implementations of the parallelized LU
-
SGS scheme were attempted in this study. In one,
the L elimination and U elimination routines were kept intact, but there was the addition of sub
-
iterations on


q
, the updated state vector. The
state vector in the k
-
direction was based on the
previous sub
-
iteration, and therefore the data
-
dependency in the k
-
direction was removed and
should therefore be parallelizable. For this case the speed
-
up factor improved to match that of
the ADI scheme. I
n addition the results which were output by the code for mass flow rate and
stagnation pressure ratios mirrored the LU
-
SGS scheme very closely. However, the code became
unstable after only 1600 iterations, and this happened despite using a large number of
sub
-
iterations (16). Using more and more sub
-
iterations was necessary to increase the stability
(measured by the number of iterations the code was able to complete), which increased the CPU
effort required.

The second attempt more closely resembled the DP
-
LUR scheme of [
2
]. Instead of L and
U sweeps, both with sub
-
iterations, the symmetric Gauss
-
Seidel sweeps are replaced by the
point
wise

relaxation scheme described earlier. There are drawbacks from using such a scheme,
however. Much of the increased stab
ility from using LU
-
SGS is lost because the Gauss
-
Seidel
sweeps are replaced by Jacobi sweeps, which are less stable and take an increased number of
iterations to achieve convergence. When this was implemented on GT
-
TURBO
-
3D, the program
ran for 16000 ite
rations, without the need to increase the number of sub
-
iterations all the way up
to 16 as was the case earlier. However, the state vector update was so small that there was
hardly any noticeable change from one time step to the next when examining the re
sults file.

Joseph Gillman




MS Aerospace Special Problem

Fall 2004



Georgia Institute of Technology


-

-

12
-

-

For further investigation, it is recommended that a hybrid scheme be investigated with MPI
routines. Using schemes that could be implemented with OpenMP take much more CPU time
than the ADI
-
parallel version of the code do to the requirement of

sub
-
iterations. Therefore it is
recommended to continue use of the parallel ADI scheme when quick turnarounds are desired, as
the ADI scheme is sufficiently stable for the design mass flow rate condition of Rotor 67 and
Rotor 37. The self
-
recirculation
control code, described in the next section, uses the ADI scheme
and it is parallelized with the speedup shown in Figure[
1
] as the ADI curve.

Speed-up study for parallelization
0
1
2
3
4
5
6
7
8
9
10
0
2
4
6
8
10
12
14
16
18
# cpu
speed-up factor
ADI
LU-SGS
DP-LUR

Figure
1
: Speed
-
up chart for original ADI scheme, sequential LU
-
SGS scheme, and DP
-
LUR
scheme

Although the speed
-
up rate for DP
-
LUR is the best, the speedup factor calculated is
relative to performing DP
-
LUR on one processor.

This is much more computationally expensive
that the ADI scheme, as discussed earlier. Therefore even with the improved speed
-
up, the fact
t
hat the DP
-
LUR currently gives less than optimal results suggests that even if the DP
-
LUR code
is improved further so that a converged solution is achieved, it is much more efficient to use the
parallel ADI scheme for parallel computation.

Joseph Gillman




MS Aerospace Special Problem

Fall 2004



Georgia Institute of Technology


-

-

13
-

-


Self
-
Recirculat
ion Control Concept

It has been shown previously in num
erical studies by Niazi [13], and Stein [14
] that open
-

and closed
-
loop bleeding and injection of air at the case can both improve stability of a
compressor and increase its operating range at lower ma
ss flow rates. The author contributed to a
study using GT
-
TURBO
-
3D of a unique “self
-
recirculation” control concept first developed by
Hath
away [15
].

Active control approaches, while effective, have two drawbacks. Firs
tly, since the air is
being ble
d, we

have a loss of high pressure air which hinders the total pressure rise from the
compressor. And secondly there is still the need for an active controller and actuator.
[
16
]

Hathaway
’s concept

has already been applied to the Rotor 67 configuration.

The conc
ept is
shown in Figure [
2
].

Hub
Case
Injection
Bleed
Rotor
In Flow
m
1
m
2
Hub
Case
Injection
Bleed
Rotor
In Flow
Hub
Case
Injection
Bleed
Rotor
In Flow
m
1
m
2

Figure
2
: ‘Self
-
recirculation’ control treatment concept as proposed by Hathaway
[15]


Hathaway describes that a main barrier to increased performance and operating range is
the airflow at the tip. Referred to as “tip
-
blocka
ge effects,” the vortices and reversed flow
regions in the small clearance region between the spinning blade tip and the stationary case are
detrimental to overall compressor stage performance. Therefore it is desired to alleviate this
problem with the ble
ed
-
injection control concept.

Joseph Gillman




MS Aerospace Special Problem

Fall 2004



Georgia Institute of Technology


-

-

14
-

-

Airflow is bled downstream of the blade trailing edge and then re
-
injected upstream of
the leading edge near the blade tip. The concept requires both bleed and injection mass flow
rates to be the same.
The injected fluid is
directed to lie along the casing end
-
wall to energize the
local
low momentum fluid. The injection port is located just upstream of the region of low
momentum fluid
[16
]
.

The necessary adjustments to the GT
-
TURBO
-
3D code and boundary conditions are
described

fully in

[16] and [5
] and the latest results from the num
erical study are given in [5
]. In
this work, some extra descriptions of the work that went into the study which were not described
in the final paper

[
5
] are provided.




Figure
3
: Tip
blockage

vortices shown

Mid
-
pitch, 30 kg/sec, ¼ rotor rev
olution


Low relative static
pressure




Higher relative
static pressure




Shock location




Reversed flow
regions near blade
tip




Joseph Gillman




MS Aerospace Special Problem

Fall 2004



Georgia Institute of Technology


-

-

15
-

-


Figure
4
: This shows the stagnation of the bleed port is higher than that of the injection port


At the beginning of the author’s contributions to the self
-
recirculation concept study, it
was desired to show that th
e self
-
recirculation was indeed a passive concept by ensuring that the
stagnation pressure at the bleed port was higher than that at the injection port. This difference in
stagnation pressure is the only way to drive the bled airflow from the bleed port ba
ck to the
injection port.

Figure
4

shows that
indeed this is the case. Normalized stagnation pressure is
shown in the chart in the upper right
-
hand corner,
and the stagnation pressure at the bleed port is
higher than that at the injection port.


The next s
eries of figures show an uncontrolled case at low mass flow rate. There is no
bleed or injection, and even after many rotor revolutions, where in a stable case a converged
solution would result, it is clearly seen that vortices are still being shed in the
unstable case. The
pictures show velocity vectors and pressure contours as computed by Fieldview.

The velocity
vectors also show reversed flow near the blade tip, when there is a vortex. The case shown is for
a specified mass flow rate through the compres
sor of 31 kg/sec.

Joseph Gillman




MS Aerospace Special Problem

Fall 2004



Georgia Institute of Technology


-

-

16
-

-



(a) (b)


(d)

Figure
5
: Uncontrolled case, (a): theta=

1017°
, (b): theta=

10
89
°
, (c): theta=

1
162
°
,

(
d
): theta=

1
235
°
. This is at
k=21 (mid
-
pitch).











Joseph Gillman




MS Aerospace Special Problem

Fall 2004



Georgia Institute of Technology


-

-

17
-

-

For the controlled case, these unstable phenomena are stabilized by the bleeding and injection,
and the vortices and reversed flow regions disappear two revolutions before the above pictures
were taken.
The initial instabiliti
es near the blade tip are no longer present at theta of 363° as
shown in the following set of pictures.




(a) (b)




(c)

(d)

Figure
6
: Recirculation case, (a): theta=

72
°
, (b): theta=

1
45
°
, (c): theta=

218
°
, (d): theta=

290
°
.
This is at k=21 (mid
-
pitch).


Joseph Gillman




MS Aerospace Special Problem

Fall 2004



Georgia Institute of Technology


-

-

18
-

-


Figure
7
: theta= 363°
also at mid
-
pitch.
Note th
at the blockage vortex at the tip is no longer
present

and it did not
arise again

for the remainder of the computational
simulation

(up to 2
rotor
revolutions).


Conclusion
s



In this study, a parallel version of LU
-
SGS was attempted. The results are not

as good as
with the original ADI parallel scheme, and
more CPU time is required for parallelized LU
schemes compared to ADI schemes. Therefore it is recommended that current design needs for a
rapid turnaround time use the ADI version of the code.

F
urther

work
could
also investigate

DP
-
LR as an alternative

to DP
-
LUR
. However, DP
-
LR requires even more memory and processor power than the DP
-
LUR scheme. In a recent
attempt to implement the DP
-
LR scheme on a wind
-
rotor code for a grid of approximately 2
milli
on grid points,
the parallel SGI machines would not run the program due to the excessive
memory required.

Joseph Gillman




MS Aerospace Special Problem

Fall 2004



Georgia Institute of Technology


-

-

19
-

-


New results for the recirculation concept applied to Rotor 67 using GT
-
TURBO
-
3D were
also presented. The operating range of the compressor was succ
essfully extended to lower mass
flow rates with this passive control concept.

The full pape
r [5
] gives a detailed analysis of the
results and is due to be presented at the 43
rd

Annual Aerospace Sciences Meeting and Exhibit in
Reno Nevada in January 2005.

References

[1]
www.openmp.org


[2] Wright, Michael J., Candler, Graham V., Prampolini, Marco, “Data
-
Parallel Lower
-
Upper
-
Relaxation Method for the Navier
-
Stokes Equations,” AIAA Journal Vol. 34, No. 7, pp. 1371
-
1377, July 1996.

[3] Wissink, Andrew M., Lyri
ntzis, Anastasios S., and Strawn, Roger C., “Parallelization of the
Three
-
Dimensional Flow Solver for Euler Rotorcraft Aerodynamics Predictions,” AIAA Journal
Vol. 34, No. 11, Nov 1996.

[4] Wissink, Andrew M., “Efficient Parallel Implicit Methods for Rota
ry
-
Wing Aerodynamics
Calculations,” Ph.D. Dissertation, University of Minnesota, May 1997.

[5]
Iyengar, Gillman, Niazi, and Sankar, “Numerical Studies of Two Approaches for Surge
Control in Axial Compressors,” AIAA Conference Reno, NV,
2005
.


[6] Tomaro, R
obert, “An Implicit Higher
-
Order Spatially Accurate Scheme for Solving Time
Dependent Flows on Unstructured Meshes,”

Ph.D Dissertation, Georgia Institute of Technology,
School of Aerospace Engineering, December 1997.

[7] Wang, Gang, “Study of a Low
-
Disper
sion Finite Volume Scheme in Rotorcraft Noise
Prediction,” Ph.D Dissertation, Georgia Institute of Technology, School of Aerospace
Engineering, April 2002.

[8] Chandra, Rohit, et. Al., “Parallel Programming in OpenMP”, Morgan Kaufman Publishers,
2001.

[9]
Wright, Michael J., Candler, Graham V., Bose, Deepak, “Data
-
Parallel Line Relaxation
Method for the Navier
-
Stokes Equations,” AIAA Journal, Vol. 36, No. 9, pp. 1603
-
1609, Sept
1998.

[10] Shonkwiler, R., “Introduction to Parallel and Scientific Computing,”
Georgia Institute of
Technology, 2004

Joseph Gillman




MS Aerospace Special Problem

Fall 2004



Georgia Institute of Technology


-

-

20
-

-

[11]
http://www.hpc.gatech.edu/facilities/o3800.html

[12] Yoon, S., Jameson, A., “Lower
-
Upper Symmetric
-
Gauss
-
Seidel Method for the Euler and
Navier
-
Stokes Equations,” Journal of AIAA, Vol.26, No. 9.

[13] Niazi, S., “N
umerical Simulation of Rotating Stall and Surge Alleviation in Axial
Compressors,” Ph.D. Dissertation, Georgia Institute of Technology, Aerospace Engineering, July
2000.

[14] Stein, A., “Computational Analysis of Stall and Separation Control in Centrifugal

Compressors,” Ph.D. Dissertation, Georgia Institute of Technology, Aerospace Engineering,
May 2000.


[
15
] Hathaway, M.D., “Self
-
Recirculating Casing Treatment Concept for Enhanced Compressor
Performance,” T
urbo
-
Expo 2002, Amsterdam, Holland
.


[
16
]

Iyengar
, Vishwas, “Advanced Control Techniques for Modern Compressor Rotors,” MS
Special Problem Report, Georgia Institute of Technology, School of Aerospace Engineering,
July 2004.



Appendix A: DPLUR code

Appendix B: Parallelized Lower and Upper Sweeps

Appendix

C: Composition of A,B, and C matrices