Parallelized PFI for Large-Scale Nonlinear Systems in a Distributed Computing Environment

desirespraytownSoftware and s/w Development

Dec 1, 2013 (3 years and 9 months ago)

106 views


1

Parallelized PFI for Large
-
Scale Nonlinear Systems in a
Distributed Computing Environment


S.K. Dey
1

Professor of Mathematics

and

Dheeraj Bhardwaj
2


Information Technology Services


Eastern Illinois University

Charleston, IL 61920 (USA)




Abstract


Pert
urbed Functional Iterations (PFI) have been modified and rendered suitable in a
distributed computing environment. Applications of this algorithm to large
-
scale tridiagonal
nonlinear systems, which are frequently found in flow models, are discussed. Detail
ed
algorithmic studies have been presented. A domain decomposition algorithm is used for
distributing the workload and the tasks communicate via MPI (Message Passing Interface)
message passing calls. The performance analysis of the algorithm is carried out

on a cluster
of SUN SPARC workstations.


Introduction


In the last decade, parallel processing has become the way and means for doing large
-
scale
scientific computing. There is a strong consensus among the scientists, that the greatest gains
in price/per
formance can only be achieved through multiple processor parallel systems. Of
the two paradigms for parallel
-
processing viz. shared memory and distributed memory;
distributed memory has arisen as a much more cost effective computing.


PFI (Perturbed Funct
ional Iterations) is a matrix
-
free algorithm to solve nonlinear systems
of equations. It has been applied in the past to solve flow models quite accurately [1,2].
Recently, its has been parallelized for SIMD and MIMD parallel architectures [3,4,5]. It has
been found that PFI has a much better rate of convergence, which is quadratic with regard
to each processor. Hence it should be used for distributed computing environment with
parallelism for solving nonlinear systems.


The premise of distributed
-
memory p
arallel computing is straightforward: connect many
independent computers together, each with its own memory; with a high
-
speed network.
The most distributed memory computing environments use message passing parallel


1

e
-
mail: cfskd@eiu.edu

2

e
-
mail: dheerajb@eiu.edu

Applicable Mathematics: Its perspectives and Challenges (Ed. J.C. Mishra), Narosa Publishing House
(ISBN 81
-
7319
-
406
-
8), 2001, pp. 249
-
266



2

programming paradigm, i.e. the code deve
loper must explicitly move the data among
participating processors by passing messages. Message passing is a powerful and very general
method of expressing parallelism. Message passing can be used to develop extremely
efficient parallel programs, thus, it
is currently the most widely used method of programming
for many types of parallel computers [6,7].


Vectorized Splitting of a Nonlinear Tridiagonal System


Let us consider a nonlinear tridiagonal system:


J
,....,
2
,
1
j
),
U
,
U
,
U
(
G
U
1
j
j
1
j
j
j







(1)


Where
0
U

and
1
J
U

are boundary points whose values are known at each time step. It is
assumed that the nonlinear system (1) has a solution
*
j
U
. Let
J
T
J
2
1
)
U
.....
U
U
(
U



,
where
J


is a re
al J
-
dimensional real space. Also
J
T
*
J
*
2
*
1
*
)
U
.....
U
U
(
U



.


Let us consider a distributed computing environment with M number of parallel processors
M
2
1
P
,.....,
P
,
P
. This system does not have any common shared memory. They are
connected by a high
-
sp
eed network and have individual memories. They all must be
activated together and can swap information amongst them.


Let
1
P
solve
1
J
number of equations,
2
P

solve
2
J
number of

equations, and
M
P
solve
M
J
number of equations. Thus,



J
J
M
1
m
m














(2)


Because of the structure of the nonlinear system (1), there will be some overlapping of
elements at the
boundaries to be setup for each processor.


Thus the first set of equations for
1
P
is:

)
U
,
U
,
U
(
G
U
2
1
0
1
1


)
U
,
U
,
U
(
G
U
3
2
1
2
2




)
U
,
U
,
U
(
G
U
1
J
1
J
1
J
J
J
1
1
1
1





The processor
2
P
will solve the

second set given by:


)
U
,
U
,
U
(
G
U
2
J
1
J
J
1
J
1
J
1
1
1
1
1






)
U
,
U
,
U
(
G
U
3
J
2
J
1
J
2
J
2
J
1
1
1
1
1










3

)
U
,
U
,
U
(
G
U
1
J
J
J
J
1
J
J
J
J
J
J
2
1
2
1
2
1
2
1
2
1










The last processor
M
P
will solve


)
U
,
U
,
U
(
G
U
2
L
1
L
L
1
L
1
L






)
U
,
U
,
U
(
G
U
3
L
2
L
1
L
2
L
2
L









)
U
,
U
,
U
(
G
U
1
J
J
L
1
J
L
J
L
J
L
M
M
M
M









where
1
M
2
1
J
.......
J
J
L





and
1
J
U

= the edge of the outer boundary.


Thus,
1
P

needs to solve for
1
J
2
1
U
,.....,
U
,
U
. To solve for
1
J
U
,
1
P

needs informat
ion of
1
J
1
U

. This must be supplied by
2
P
. But
2
P

needs the information of
1
J
U
to solve for
1
J
1
U

. Hence
2
P

will get the value of
1
J
U
to be supplied by
1
P
. Such exchanges of
information should take place at each level of iteration for all
M
2
1
P
,.....,
P
,
P
processors.


For the mth processor
m
P
, the first element in the comput
ational field is
1
U
1
m
2
1
J
.......
J
J





. For the sake of simplicity, let it be denoted by
m
,
1
U
. Thus the jth
element
)
j
U
(
1
m
2
1
J
.......
J
J





in this field is to be denoted by
m
,
j
U
.


Hence the nonlinear tridiagonal subs
ystem to be solved by the processor
m
P

is:


m
m
,
1
j
m
,
j
m
,
1
j
m
,
j
m
,
j
J
,.....,
2
,
1
j
)
U
,
U
,
U
(
G
U








(3)


In fact the entire nonlinear system is now given by (3) for m = 1, 2,……,M and j =
1,2,…..,J
m

for each m.


If m =1 (the system to be solved by
1
P
),

1
1
,
1
j
1
,
j
1
,
1
j
1
,
j
1
,
j
J
,.....,
2
,
1
j
)
U
,
U
,
U
(
G
U




,





(4)

denotes the system to be solved by
1
P
.


Here for j=1,


0
1
,
0
U
U

a boundary element. But
1
,
1
J
1
U

is not a boundary element.

For m =2, (the system to

be solved by
2
P
), the system is:


2
2
,
1
j
2
,
j
2
,
1
j
2
,
j
2
,
j
J
,.....,
2
,
1
j
)
U
,
U
,
U
(
G
U








(5)

Here
1
J
2
,
0
U
U

and
1
J
2
,
1
1
U
U


, which are denoted by
1
,
J
1
U
and
1
,
1
J
1
U

.

If
1
J


and
2
J


are respectively the overlapping for the functionals
1
,
j
G

and
2
,
j
G
, then
the points shared by these two domains are
2
,
0
U
and
2
,
1
U

or in other words,



4







2
,
1
,
2
,
0
1
,
1
J
1
,
J
1
J
J
J
J
U
U
U
,
U
U
,
U
1
1
1
1
2
1









(6)


For m=M, the system to be solved by the processor
M
P
is:


M
M
,
1
j
M
,
j
M
,
1
j
M
,
j
M
,
j
J
,.....,
2
,
1
j
)
U
,
U
,
U
(
G
U









(7)


Here
1
M
,
J
M
,
0
1
M
U
U



,
1
M
,
1
J
M
,
1
1
M
U
U




and




1
J
M
,
1
J
U
U
M
the boundary
point.


Thus,




M
,
1
,
M
,
0
1
M
,
1
J
1
M
,
J
J
J
U
U
U
,
U
1
M
1
M
M
1
M












(8)


Definition


The boundary points to be shared by consecutive processors
1
m
P

and
m
P

are called
pseudo
-
boundary points. They are:


m
,
0
1
m
,
J
U
U
1
m











(9a)


m
,
1
1
m
,
1
J
U
U
1
m





m=2,3,….M (9b)


In the message passing para
llel programming paradigm, these are called the ghost points. At
the termination of each iteration,
1
m
,
J
1
m
U


computed by
1
m
P

will be sent to
m
P
as the
value of
m
,
0
U

to be used for the
next iteration by
m
P
, whereas,
m
,
1
U

computed by
m
P
will be sent to
1
m
P

as the value of
1
m
,
1
J
1
m
U



to be used for the next iteration by
1
m
P

.


The
Schematic of Distributed Computation


1
m
P


m
P

1
m
P


1
m
,
0
U


m
,
0
U

1
m
,
0
U


1
m
,
1
U


m
,
1
U

1
m
,
1
U











1
m
,
J
1
m
U



m
,
J
m
U

1
m
,
J
1
m
U



1
m
,
1
J
1
m
U




m
,
1
J
m
U


1
m
,
1
J
1
m
U




Figure 1: Depicts
the flow of information


5

Thus there are two sets of ghost
-
points in this environment.


(a)

Forward Ghost
-
Points

are
1
m
,
J
1
m
U


computed by
1
m
P

, communicated to
m
P
and
m
P

sets its equal to
m
,
0
U
.

(b)

Backward Ghost
-
Points
are
m
,
1
U

which are computed by
m
P
, communicated to
1
m
P

and
1
m
P


sets it equal to
1
m
,
1
J
1
m
U



.


Matrix Free PFI
-

Algorit
hm


Let us consider the mth subsystem to be processed by the processor
m
P
:


m
m
,
1
j
m
,
j
m
,
1
j
m
,
j
m
,
j
J
,.....,
2
,
1
j
),
U
,
U
,
U
(
G
U





(since there are J
m

number of
equations)

(10)


Let us denote
m
,
j
U

by
j
V
. Then (10) may be expressed as:


m
1
j
j
1
j
j
j
J
,.....,
2
,
1
j
)
V
,
V
,
V
(
G
V





(11)


we assume that
*
j
V
are the roots.


Le
t us assume that at some kth iteration, values of
j
V
’s, to be denoted by
k
j
V
, are known.
Then, let



m
k
1
j
k
j
k
1
j
j
j
J
,.....,
2
,
1
j
)
V
,
V
,
V
(
G
V
ˆ








(12)


This is an intermediary step, which is equivalent to nonlinear J
acobi iteration.



Let
j
k
j
1
k
j
V
ˆ
V














(13)


Assuming convergence at the intermediary step (12), we can write:



)
V
ˆ
,
V
ˆ
,
V
ˆ
(
G
V
ˆ
1
j
j
k
j
1
j
j
j
k
j











(14)


Let us assume that the

functionals
j
G

are at least twice continuously differentiable and the
perturbation parameters
k
j


‘s are small, such that terms of the order
2
k
j








may be
neglected.


Then,
)
V
ˆ
,
V
ˆ
,
V
ˆ
(
G
)
V
ˆ
,
V
ˆ
,
V
ˆ
(
G
V
ˆ
1
j
j
1
j
jj
k
j
1
j
j
1
j
j
j
k
j











(15)



6

where
j
j
jj
V
G
G



, giving



)
V
ˆ
,
V
ˆ
,
V
ˆ
(
G
1
V
ˆ
G
ˆ
1
j
j
1
j
jj
j
j
k
j












(16)


where,

)
V
ˆ
,
V
ˆ
,
V
ˆ
(
G
G
ˆ
1
j
j
1
j
j
j



. Thus algorithms is:


Step #1:


Compute
m
j
J
,....
2
,
1
j
,
V
ˆ

, using (1
2)

Step #2:


Compute
m
k
j
J
,....
2
,
1
j
,


, using (16)

Step #3:


Compute
m
k
j
J
,....
2
,
1
j
,
V

, using (13)


This is PFI (Perturbed Functional Iterations) which is totally matrix free. Next question is
how does it converge?


Analysis of Convergence


The
orem: 1
A necessary and sufficient condition for convergence of PFI is ,




0
lim
k
j
k








(17)

for all j, for any given norm, and for all processors.


Proof
:

Let us assume that PFI converges. Then



)
V
ˆ
V
(
)
V
V
(
j
*
j
*
j
1
k
j
k
j







(18)

Then,



*
j
j
*
j
1
k
j
k
j
V
V
ˆ
V
V








(19)


Since
*
j
j
k
1
k
j
k
V
V
ˆ
lim
V
lim








, from (19) for some
K
k

,










2
/
2
/
k
j

(20)

where


is positive and arbitrarily small.


Let
G
1
D



, where
)
G
,.....,
G
,
G
(
diag
G
m
m
J
J
22
11



and I is an arbitrary matrix
m
m
J
J

. Let
T
J
2
1
)
b
,.....,
b
,
b
(
b
m

, where
j
j
j
V
ˆ
G
ˆ
b


. Then



b
D
1
k




and
k
D
b



or
k
D
b





7


If
0
lim
k
k




and
D

is bounded, it implies



0
b
lim
k





(21)


which implies that the method must converge. Following [9], PFI has been modified and
perturbation parameters are given by



b
D
1
1
k







(22)

where,

is a scalar and



1
J
J
m
m




(23)




MPI Implementation

of PFI






The most important part of parallel programming is to map out a particular problem on a
multiprocessor environment. The problem must be broken down into a set of tasks that can
be solved concurrently. The choice of an approach to the problem d
ecomposition depends
on the computational scheme. In the present work we have used domain decomposition, as
it is well suited for PFI. Domain decomposition involves assigning subdomains of the
computational domain to different processors and solving the ea
ch subdomain concurrently.
Since the algorithm is implemented on distributed computing environment, it is designed
using message
-
passing paradigm. MPI (Message Passing Interface) has been used to
communicate between the processors [8].


(a) Domain Decompos
ition:

In the present implementation, we have divided the base domain
along grid lines in the x
-

direction as shown in figure 2.








Figure 2: Division of problem domain


into
m
2
1
J
J
J
....



subdomains on M processors.


O
ur next task is deciding how to assign processes to each of the decomposed domain. We
first create a virtual Cartesian topology of the processes using MPI topology functions. The
routine
MPI_Cart_create
creates a Cartesian decomposition of the processes. N
ext we find
the neighbors of the Cartesian mesh. The most direct way is to use the routine
MPI_Cart_get.
In the case of 1D decomposition,
MPI_Cart
-
shift

is more convenient
1
J

2
J

m
J




8

function to find the neighbors. After setting up the virtual topology of processes,
we map
the subdomains to the processes.


Load balancing plays an important role in achieving optimal speedup. We have done the
parallel implementation by assuming that the machine has equally powerful processors
connected by high performance network. Th
erefore, we have divided the problem domain
into nearly equal size, and hence to contain approximately the same number of unknowns in
order to get an appropriate load balance.


(b)
Communication:
After domain decomposition each processor computes theval
ues in the
interior of the corresponding subdomain. In order to calculate the boundary values inside
the subdomain, boundary data from neighboring processors will be required. Thus,
neighboring processors must exchange the boundary data inside the time ste
pping loop.
Figure 2, shows the exchange of data between the processors. In this figure, black strip
shows the boundary inside the interior of the subdomain and the grey strip shows the
boundary outside the subdomain. For exchange of boundary data black st
rip data of one
processor is sent to grey strip data of the other processor.








Figure 3: Exchange of boundary data between two processors



Algorithm


At the very first step,
0
j
U
, the initial guess for the value of
j
U
’s are supplied to all the
processes. Thus at the very outset


Step#1



1
P

Computes:

)
U
,
U
,
U
(
G
U
ˆ
)
U
,
U
,
U
(
G
U
ˆ
)
U
,
U
,
U
(
G
U
ˆ
0
1
,
1
J
0
1
,
J
0
1
,
1
J
1
,
J
1
,
J
0
1
,
3
0
1
,
2
0
1
,
1
1
,
2
1
,
2
0
1
,
2
0
1
,
1
0
1
,
1
1
,
1
1
1
1
1
1







0
U
= boundary point in the main computational field.


1


1



1


1
J

2
J


9


2
P

Computes:

)
U
,
U
,
U
(
G
U
ˆ
)
U
,
U
,
U
(
G
U
ˆ
)
U
,
U
,
U
(
G
U
ˆ
0
2
,
1
J
0
2
,
J
0
2
,
1
J
2
,
J
2
,
J
0
2
,
3
0
2
,
2
0
2
,
1
2
,
2
2
,
2
0
2
,
2
0
2
,
1
0
2
,
0
2
,
1
2
,
1
2
2
2
2
2







Note
0
1
,
J
0
2
,
0
1
U
U


and
0
1
,
1
J
0
2
,
1
1
U
U


.

These are the pseudo
-
boundary points.



M
P

Computes

)
U
,
U
,
U
(
G
U
ˆ
)
U
,
U
,
U
(
G
U
ˆ
)
U
,
U
,
U
(
G
U
ˆ
0
1
J
0
M
,
J
0
M
,
1
J
M
,
J
M
,
J
0
M
,
3
0
M
,
2
0
M
,
1
M
,
2
M
,
2
0
M
,
2
0
M
,
1
0
M
,
0
M
,
1
M
,
1
M
M
M
M









0
1
J
U
the other boundary
-
point of the main computationa
l field.


After this first round of computations has been completed,
1
P
supplies to
2
P
the value of
1
,
J
1
U
ˆ
and
2
P
stores it as the value of
2
,
0
U
ˆ
, and
2
P
supplies the value of
2
,
1
U
ˆ

to
1
P
and
1
P
stores it as
1
,
1
J
1
U
ˆ

. This implies that the processor
1
M
P

supplies to
M
P

the value of
1
,
J
1
M
U
ˆ

and
M
P

stores it as the value of
M
,
0
U
ˆ
and
M
P

supplies the value of
M
,
1
U
ˆ
to
1
M
P

and
1
M
P


stores it as the value of
1
M
,
1
J
1
M
U
ˆ



.


This exchange of the values of ghost points (which are also psuedo
-
boundary points)
between two successive processors have been accomplished by using
MPI_Sendrecv

function and it follows the diagram in the figure 1. Thus all processor
s
),
M
,....
2
,
1
m
(
,
P
m

have the values of
m
.
j
U
ˆ

for m =1,2,….,M and j =1,2,…,J
m

for each m
=1,2,….M.


Step#2


Now each processor
m
P
, computes



m
m
,
1
j
m
,
j
m
,
1
j
m
,
j
m
,
j
J
,.....
2
,
1
j
)
U
ˆ
,
U
ˆ
,
U
ˆ
(
G
G
ˆ






Again, as before, we exchange the g
hostpoints for
m
.
j
G
ˆ

values following the equations:


m
,
0
1
m
,
J
G
ˆ
G
ˆ
1
m












(27)



M
,.....
3
,
2
m
G
ˆ
G
ˆ
m
,
1
1
m
,
1
J
1
m











(28)


At the kth iteration each processor computes
k
m
,
j


using (22) an
d computes,


10


M
,....,
2
,
1
m
each
for
J
,.....,
2
,
1
j
U
ˆ
U
m
m
,
j
k
m
,
j
1
k
m
,
j







(29)


Step # 3


Each processor now computes


M
,....
2
,
1
m
and
J
,.....,
2
,
1
j
|,
|
max
max
m
k
m
,
j
j
m






(30)


at the end of each iteration k.


Step #4


In this step, we use
MPI_Allreduce

function to find the g
lobal maximum of
max

from all
the processors, i.e.


M
,.....,
2
,
1
m
),
max
(
max
max
m
m





(31)


If



max









(32)


Where


is pos
itive and arbitrarily small, the method converges and value (at the given time
step) are printed.


Coupled Equations


Let


J
,......,
2
,
1
j
)
V
,
V
,
V
;
U
,
U
,
U
(
G
V
)
V
,
V
,
V
;
U
,
U
,
U
(
F
U
1
j
j
1
j
1
j
j
1
j
j
j
1
j
j
1
j
1
j
j
1
j
j
j












(33)


be two nonlinear systems which are coupled.


Let each processor now

solve a coupled system. Let the processor
m
P

solve (from equation
(3)).



)
m
,
1
j
m
,
j
m
,
1
j
m
,
1
j
m
,
j
m
,
1
j
m
,
j
m
,
j
m
,
1
j
m
,
j
m
,
1
j
m
,
1
j
m
,
j
m
,
1
j
m
,
j
m
,
j
V
,
V
,
V
;
U
,
U
,
U
(
G
V
)
V
,
V
,
V
;
U
,
U
,
U
(
F
U











(34)


where m =1,2,……,M and j=1,2,….J
m

for each m.


Let



T
m
,
j
m
,
j
m
,
j
T
m
,
j
m
,
j
m
,
j
)
G
F
(
H
)
V
U
(
W



(35)


Then the coupled nonlinear systems may be represented as:



11


)
W
,
W
,
W
(
H
W
m
,
1
j
m
,
j
m
,
1
j
m
,
j
m
,
j




(36)


Thi
s is same as the equation (3). Here each processor will supply to the neighboring
following process two values at the first ghost point and receive from the preceding
neighboring processor two values at the other ghost points. Only the PFI algorithm is not

totally matrix free. Let us study this now:


Let us denote
m
,
j
W

by
j
z
. Then



T
m
,
j
m
,
j
j
)
V
U
(
z


(37)

and,




)
z
,
z
,
z
(
H
z
1
j
j
1
j
j
j




(38)


Let after some k iterations,



)
z
ˆ
,
z
ˆ
,
z
ˆ
(
H
z
ˆ
1
j
j
1
j
j
j





(39)

Let
j
k
j
1
k
j
z
ˆ
z






where
T
k
m
,
j
k
m
,
j
k
j
)
(




= the perturbation vector, and



k
m
,
j
k
m
,
j
1
k
m
,
j
k
m
,
j
k
m
,
j
1
k
m
,
j
V
V
;
U
U









(40)


Let the iterative scheme converge afte
r k iterations, then



)
z
ˆ
,
z
ˆ
,
z
ˆ
(
H
z
ˆ
1
j
j
k
j
1
j
j
j
k
j








(41)


This means



)
z
ˆ
,
V
ˆ
,
U
ˆ
,
z
ˆ
(
F
U
ˆ
1
j
m
,
j
k
j
m
,
j
k
j
1
j
m
,
j
m
,
j
k
j










(42)

and




)
z
ˆ
,
V
ˆ
,
U
ˆ
,
z
ˆ
(
G
V
ˆ
1
j
m
,
j
k
j
m
,
j
k
j
1
j
m
,
j
m
,
j
k
j










(43)


Assuming that both
m
,
j
F
and
m
,
j
G

are at least twice continuously differentiable and the
perturbation parameters are small so that
2
k
j
)
(



an
d
2
k
j
)
(


may be neglected,

)
z
ˆ
,
z
ˆ
,
z
ˆ
(
V
F
)
z
ˆ
,
z
ˆ
,
z
ˆ
(
U
F
)
z
ˆ
,
z
ˆ
,
z
ˆ
(
F
U
ˆ
1
j
j
1
j
m
,
j
m
,
j
k
j
1
j
j
1
j
m
,
j
m
,
j
k
j
1
j
j
1
j
m
,
j
m
,
j
k
j


















(44)

and similarly,


12


)
z
ˆ
,
z
ˆ
,
z
ˆ
(
V
G
)
z
ˆ
,
z
ˆ
,
z
ˆ
(
U
G
)
z
ˆ
,
z
ˆ
,
z
ˆ
(
G
V
ˆ
1
j
j
1
j
m
,
j
m
,
j
k
j
1
j
j
1
j
m
,
j
m
,
j
k
j
1
j
j
1
j
m
,
j
m
,
j
k
j

















(45)


These two equations may be represented as






























































m
,
j
m
,
j
m
,
j
m
,
j
k
j
k
j
m
,
j
m
,
j
m
,
j
m
,
j
m
,
j
m
,
j
m
,
j
m
,
j
V
ˆ
)
z
ˆ
(
G
U
ˆ
)
z
ˆ
(
F
V
)
z
ˆ
(
G
1
U
)
z
ˆ
(
G
V
)
z
ˆ
(
F
U
)
z
ˆ
(
F
1

(46)


which must be so
lved simultaneously to compute perturbation parameters.


The rest part is basically same as before.




Numerical Examples


We apply the algorithm to solve following examples:


Example1

)
y
u
x
u
(
y
u
u
x
u
u
t
u
2
2
2
2


















(47)

over a domain
1
x
0



for t>0, whose exact solution is:



)
2
/
)
y
x
sin((
)
t
*
exp(
)
t
,
y
,
x
(
u





(48)


Following are the initial and boundary

conditions:


Initial Conditions


)
2
/
)
y
x
sin((
)
0
,
y
,
x
(
u





Boundary Conditions


)
1
x
sin(
)
t
exp(
)
t
,
1
,
x
(
u
)
y
1
sin(
)
t
exp(
)
t
,
y
,
1
(
u
)
x
sin(
)
t
exp(
)
t
,
0
,
x
(
u
)
y
sin(
)
t
exp(
)
t
,
y
,
0
(
u




















13

Example2


A coupled system:


0
u
0
uu
u
x
t
x
t








(49)


over a domain
1
x
0



for t>0, whose exact solution is:


)
ut
x
(
f
)
ut
x
(
f
u
2
1





, where f
1

and f
2

are arbitrary functions.

(50)


We use following initial and boundary conditions:


Initial condition


x
x
u




at t =0.


Boundary conditions


t
1
t
)
t
,
1
(
t
1
1
)
t
,
1
(
u
0
)
t
,
0
(
0
)
t
,
0
(
u










Performance Analysis


The execution time of a parallel algorithm depends not only
on input size but also on the
architecture of the parallel computer and number of processors. Therefore, for the
performance study of any parallel algorithm should also describe the computer used for
implementation. For the present algorithm we have used a

cluster of SUN workstations.
Computing system used comprises of 8 SUN E
-
450 machines (300 MHz and four CPU
each) connected via a fast Ethernet switch. This is a part of the facility call PARAM 10000.


We benchmarked the parallel codes for the problem size

101x101 and 500 time steps. Figure
4, shows the number of processors versus speed up chart. It is clear from the speed up chart
that the algorithm has almost linear speedup upto 8 processors. Since the problem size of
the present problem of interest is sm
all, we need not to go beyond 8 processors.


14


0
1
2
3
4
5
6
7
8
Speed up
1
2
4
6
8
Number of Processors
101X101

Figure 4: Speed up chart, showing almost linear speed up upto 8 processors.


A parallel program is scalable if it is possible to maintain a given efficiency
as increasing the
problem size simultaneously increases the number of processor [7]. Since the problem size
has direct relationship with the speed up, and by keeping in mind the inherent parallelism
present in the algorithm, we expect that for a physical p
roblem that needs to solve for bigger
size will scale well for large number of processors.



Conclusion


The principal drawback of message passing is that it is very difficult to design and develop
programs using message passing. Indeed, it has been calle
d the “assembly language” of
parallel computing, because it forces the programmer to deal with so much detail. But,
message passing is the most commonly used method for programming distributed memory
systems. Since the PFI algorithm has an inherent paralle
lism built
-
in both elementwise and
vectorwise algorithms (see Appendix), its is suitable for implementation on distributed
memory parallel computing environments.


Acknowledgements


Authors wish to thank Centre for Development of Advanced Computing, Pun
e for
providing computational facility on PARAM 10000. Second author wishes to thank
Information Technology Services Department, Eastern Illinois University, Charleston (USA)
for the support to carry out this research.



References


1.

Dey, S.K., Numerical s
olution of nonlinear implicit finite difference equations of flow
problems by Perturbed Functional Iterations, Computational Fluid Dynamics (Edited by
W. Kollmann), hemisdhere, New York, (1980), 543
-
602.


2.

Dey, S.K., Numerical solution of Euler’s equations
by Perturbed Functionless, Lecture in
Applied Mathematics, Am. Math, Soc.,
22

(1985), 113
-
127.



15

3.

Dey, S.K., Large
-
scale Nonlinear Parallel Computations by perturbed functional
ietrations, Parallel Algorithms and Applications,
3

(1994), 211
-
226.


4.

Dey, S.K.,

A Massively Parallel Algorithm for Large
-
scale Nonlinear Computation with
Application to Nonlinear Parabolic PDE’s, Computational Acoustics,
1

(1993), 55
-
69.


5.

Dey, S.K., Analysis of convergence of parallelized PFI for coupled large
-
scale nonlinear
system
s, Int. J. Computer Math,
75

(200), 527
-
542.


6.

Phadke, S. and Bhardwaj, D., Parallel Implementation of Seismic Modeling Algorithms
on PARAM OpenFrame, Neural, Parallel and Scientific Computation,
6(4)

(1998), 469
-
478.

7.

Kumar, V., Grama, A., Gupta, A and Kar
ypis, G., Introduction to Parallel Computing:
Design and Analysis of Algorithms, The Benjamin/Cummings Publishing Company
Inc.,1994


8.

Gropp, W., Lusk, E. and Skjellum, A., Using MPI: Portable Parallel Programming with
Message
-
Passing Interface, The MIT Pres
s, 1999.


9.

Dey, S.K., Modified Perturbed Functional Iterations (MPFI) for nonlinear equations, to
be presented at Int. Conf. on Mathematical Modeling, University of Roorkee (India),
January 29
-
31, 2001.



16


Appendix


PFI


Algorithm for the mth System


Let

us consider the mth system:


M
M
,
1
j
M
,
j
M
,
1
j
M
,
j
M
,
j
J
,.....,
2
,
1
j
)
U
,
U
,
U
(
G
U







(1.1)


Let
j
m
,
j
V
U












(1.2)


Then (1.1) may be express as:



M
1
j
j
1
j
j
j
J
,.....,
2
,
1
j
)
V
,
V
,
V
(
G
V





(1.3)


Let
j
V
’s are known at the kth iteration. Let
k
j
V
= value of
j
V

at the kth iteration.


Let
M
k
1
j
k
j
k
1
j
j
j
J
,.....,
2
,
1
j
)
V
,
V
,
V
(
G
V
ˆ








(1.4)


Let
j
k
j
1
k
j
V
ˆ
V














(1.5)


Let us assume that t
he
j
V
ˆ
, the method converges. Then for all processors and for all
M
J
,.....,
2
,
1
j

,



)
V
,
V
,
V
(
G
)
V
ˆ
,
V
ˆ
,
V
ˆ
(
G
*
1
j
*
j
*
1
j
j
1
j
j
1
j
j






(1.6)


where
*
m
,
j
*
j
U
V


for all m,

and all j for each m. Since
*
j
j
k
j
1
k
j
V
V
ˆ
V





,



)
V
ˆ
,
V
ˆ
,
V
ˆ
(
G
V
ˆ
1
j
k
1
j
j
k
j
1
j
k
1
j
j
j
k
j

















(1.7)


Let us again focus on the mth processor. For j =1,



)
V
ˆ
,
V
ˆ
,
V
ˆ
(
G
V
ˆ
2
k
2
1
k
1
0
k
0
1
1
k
1
















(1.8)


and for
m
J
j

,



)
V
ˆ
,
V
ˆ
,
V
ˆ
(
G
V
ˆ
1
J
k
1
J
J
k
J
1
J
k
1
J
J
J
k
J
m
m
m
m
m
m
m
m
m














(1.9)



17

Let us assume that the perturbation parameters
j

’s are small and terms of the order


2
j


may be neglected. It is also assumed that each
j
G

is at least twice continuously differentiable
for
j
U
’s (j = 1,2,….,J).


Expanding the right side of (1.7) and neglecting terms of the order


2
j

, we get for the mth
processor
m
P
:


1
j
j
k
1
j
j
j
k
j
1
j
j
k
1
j
1
j
j
1
j
j
j
k
j
V
)
V
ˆ
(
G
V
)
V
ˆ
(
G
V
)
V
ˆ
(
G
)
V
ˆ
,
V
ˆ
,
V
ˆ
(
G
V
ˆ






















(1.10)


from (1.8) we get


2
2
k
2
1
1
k
1
2
1
0
1
1
k
1
V
)
V
ˆ
(
G
V
)
V
ˆ
(
G
)
V
ˆ
,
V
ˆ
,
V
ˆ
(
G
V
ˆ
















(1.11)


and from (1.9)


m
m
m
m
m
m
m
m
m
m
m
m
J
J
k
J
1
J
J
k
1
J
1
J
J
1
J
J
J
k
J
V
)
V
ˆ
(
G
V
)
V
ˆ
(
G
)
V
ˆ
,
V
ˆ
,
V
ˆ
(
G
V
ˆ

















(1.12)


where,
1
j
,
1
j
,
1
j
V
ˆ
V
ˆ
V
ˆ
at
evaluated
j
j
j
j
V
G
V
)
V
ˆ
(
G







.


From equations (1.10), (1.11) and (1.12) we get:




k
1
k
k
b
A




(1.13)


where,
T
k
J
k
2
k
1
k
)
,
,.........
(
m






(1.14)


M
1
j
j
j
j
1
j
j
k
J
,........
2
,
1
j
,
V
)
V
ˆ
(
G
,
V
)
V
ˆ
(
G
1
,
V
)
V
ˆ
(
G
diagonal
A





















, (1.15)


(
0
1
V
G



and
1
J
J
m
m
V
G




are set zeros), and



18


).
V
ˆ
,
V
ˆ
,
V
ˆ
(
G
ˆ
G
ˆ
,
V
ˆ
G
ˆ
V
ˆ
G
ˆ
V
ˆ
G
ˆ
b
1
j
j
1
j
j
j
J
J
2
2
1
1
k
m
m























(1.16)



Analysis of convergence


Theorem: 1
A necessary and

sufficient condition for convergence of PFI is ,




0
lim
k
j
k








(1.17)

for all j and for each operator.


Proof:

If
m
P

finds convergenc
e then



*
m
,
j
*
j
j
k
k
j
k
k
m
,
j
k
U
V
V
ˆ
lim
V
lim
U
lim











Thus



0
V
V
)
V
ˆ
V
(
lim
lim
*
j
*
j
j
1
k
j
k
k
j
k













Hence (1.17) is valid and it is a necessary condition.


From (1.13)



k
1
k
k
b
A






From (1.16) if
0
b
k

, he method must converge. Hence if
1
k
A


is bounded, (1.16) is a
sufficient condition for convergence.