Coupling Dynamic Load Balancing with Asynchronismin Iterative Algorithms
on the Computational Grid
∗
Jacques M.Bahi,Sylvain ContassotVivier and Rapha¨el Couturier
Laboratoire d’Informatique de FrancheComt´e (LIFC),
IUT de BelfortMontb´eliard,BP 27,90016 Belfort,France
Abstract
In a previous work,we have shown the very high power
of asynchronismfor parallel iterative algorithms in a global
context of grid computing.In this article,we study the inter
est of coupling load balancing with asynchronism in these
algorithms.We propose a noncentralized version of dy
namic load balancing which is best suited to asynchronism.
After showing,by some experiments on a given ODE prob
lem,that this technique can efﬁciently enhance the perfor
mances of our algorithms,we give some general conditions
for the use of load balancing to obtain good results with this
kind of algorithms.
Introduction
In the context of scientiﬁc computations,iterative algo
rithms are very well suited for a large class of problems
and are in many cases either preferred to direct methods or
even sometimes the single way to solve the problem.Di
rect algorithms give the exact solution of a problemwithin
a ﬁnite number of operations whereas iterative algorithms
provide an approximation of it,we say that they converge
(asymptotically) towards this solution.When dealing with
very great dimension problems,iterative algorithms are pre
ferred especially if they give a good approximationin a little
number of iterations.
These last properties have led to a good expansion of
parallel iterative algorithms.Nevertheless,most of these
parallel versions are synchronous.We have shown in [3] all
the interest of using asynchronismin such parallel iterative
algorithms especially in a global context of grid computing.
Moreover,in another work [2],we have also shown that
static load balancing can sharply improve the performances
of our algorithms.
In this article,we discuss the general interest of using dy
namic load balancing in asynchronous iterative algorithms
and we show with some experiments its major efﬁciency
∗
This research was supported by the STIC Department of the CNRS
in the global context of grid computing.Due to the nature
of these algorithms,a centralized version of load balancing
would not be well suited.Hence,the technique used in this
study works locally between neighboring processors.The
neighborhood in our case is determined by the communi
cations between processors.Two nodes are deﬁned to be
neighbors if they have to exchange data to performtheir job.
To evaluate the gain brought by this technique,some exper
iments are performed on the brusselator problem[8] which
is described by an Ordinary Differential Equation (ODE).
The following section recalls the principle of asyn
chronous iterative algorithms and replaces them in the
context of parallel iterative algorithms.Then,Section 2
presents a small discussion about the motivations of us
ing load balancing in such algorithms.A brief overview
of related works concerning noncentralized load balancing
techniques is given in Section 3.An example of applica
tion is exhibited with the Brusselator problem detailed in
Section 4.The corresponding algorithm and the insertion
of load balancing are then detailed in Section 5.Finally,
experimental results are given and interpreted in Section 6.
1 What are asynchronous iterative algo
rithms?
1.1 Iterative algorithms:backgrounds
Iterative algorithms have the structure
x
k+1
= g(x
k
),k = 0,1,...with x
0
given (1)
where each x
k
is an n  dimensional vector,and g is
some function from IR
n
into itself.If the sequence
x
k
generated by the above iteration converges to some x
∗
and
if g is continuous then we have x
∗
= g(x
∗
),we say that x
∗
is a ﬁxed point of g.
Let x
k
be partitioned into m blockcomponents
X
k
i
,i ∈ {1,...,m},and g be partitioned in a com
patible way into m blockcomponents G
i
,then equation
0769519261/03/$17.00 (C) 2003 IEEE
(1) can be written as
X
k+1
i
= G
i
X
k
1
,...,X
k
m
i = 1,...,m,with X
0
given
(2)
and the iterative algorithm can be parallelized by let
ting each of the m processors update a different block
component of x according to (2) (see [12]).At each stage,
the i
th
processor knows the value of all components of X
k
on which G
i
depends,computes the new values X
k+1
i
,and
communicates those on which other processors depend to
make their own iterations.The communications required
for the execution of iteration (2) can then be described by
means of a directed graph called the dependency graph.
Iteration (2) in which all the components of x are si
multaneously updated,is called a Jacobi  type iteration.If
the components are updated one at a time and the most re
cently computed values are used,then the iteration is called
a GaussSeidel iteration.We see that Jacobi algorithms are
suitable for parallelization and that GaussSeidel algorithms
may convergefaster than Jacobi ones but may be completely
nonparallelizable (for example if every G
i
depends on all
components X
j
).
1.2 A categorization of parallel iterative algo
rithms
Since this article deals with what we commonly call
asynchronous iterative algorithms,it appears necessary,for
clarity,to detail the class of parallel iterative algorithms.
This class can be decomposed in three main parts:
Synchronous Iterations  Synchronous Communica
tions (SISC) algorithms:all processors begin the same iter
ation at the same time since data exchanges are performed
at the end of each iteration by synchronous global com
munications.After parallelization of the problem,these
algorithms have exactly the same behavior as the sequen
tial version in terms of the iterations performed.Hence,
their convergence is directly deducible from the initial al
gorithm.Unfortunately,the synchronous communications
strongly penalize the performance of these algorithms.As
can be seen in Figure 1,there may be a lot of idle times
(white spaces) between iterations (grey blocks) depending
on the speed of communications.
Synchronous Iterations  Asynchronous Communica
tions (SIAC) algorithms:all processors also wait for the
receipts of needed data updated at the previous iteration to
begin the next one.Nevertheless,each data (or group of
data) required on another processor is sent asynchronously
as soon as it has been updated in order to overlap its com
munication by the remaining computations of the current
iteration.This scheme lies on the probability that data will
be received on the destination processor before the end of
the current iteration,and then will be directly available for
the next iteration.Hence,this partial overlapping of com
munications by computations during each iteration implies
shorter idle times and then better performances.Since each
processor begins its next iteration as soon as it has received
all its needed data updated from the previous iteration,all
the processors may not begin their iterations at the same
time.Nonetheless,in terms of iterations,the notion of syn
chronism still holds in this scheme since at any time t,it
is not possible to have two processors performing different
iterations.In fact,at each t,processors are either comput
ing the same iteration or idle (waiting for data).Hence,
as well as the SISC,this category of algorithms performs
the same iterations as the sequential version,fromthe algo
rithmic point of view,and have then the same convergence
properties.Unfortunately,this scheme does not completely
eliminate idle times between iterations,as shown in Fig
ure 2,since some communications may be longer than the
computation of the current iteration and also because the
sending of the last updated data on the latest processor can
not be overlapped by computations.
Asynchronous Iterations  Asynchronous Communi
cation (AIAC) algorithms:all processors performtheir iter
ations without taking care of the progress of the other pro
cessors.They do not wait for predetermined data to become
available fromother processors but they keep on computing,
trying to solve the given problem with whatever data hap
pen to be available at that time.Since the processors do not
wait for communications,there is no more idle times be
tween the iterations as can be seen in Figure 3.Although
widely studied theoretically,very few implementations and
experimental analysis have been carried out,especially in
the context of grid computing.In the literature,there are
two algorithmic models corresponding to these algorithms,
the Bertsekas and Tsitsiklis model [5] and the El Tarazi’s
model [11].Nevertheless,several variants can be deduced
fromthese models depending on when the communications
are performed and when the received data are incorporated
in the computations,see e.g.[4,1].Figure 3 depicts a gen
eral version of an AIAC with a data decomposition in two
halves for the asynchronous sendings.This type of algo
rithms requires a meticulous study to ensure their conver
gence because even if a sequential iterative algorithm con
verges to the right solution,its asynchronous parallel coun
terpart may not converge.It is then needed to develop new
converging algorithms and several problems appear like
choosing the good criterion for convergence detection and
the good halting procedure.There are also some implemen
tation problems due to the asynchronous communications
which imply the use of an adequate programming environ
ment.Nevertheless,despite all these obstacles,these algo
rithms are quite convenient to implement and are the most
efﬁcient especially in a global context of grid computing as
we have already shown in [3].This comes fromthe fact that
0769519261/03/$17.00 (C) 2003 IEEE
they allow communication delays to be substantial and un
predictable which is a typical situation in large networks of
heterogeneous machines.
time
Processor 2
Processor 1
Figure 1.Execution ﬂow of a SISC algorithm
with two processors.
time
Processor 2
Processor 1
Figure 2.Execution ﬂow of a SIAC algorithm
with two processors.In this example,the ﬁrst
half of datais sent as soonas updatedandthe
second half is sent at the end of the iteration.
time
Processor 2
Processor 1
Figure 3.Execution ﬂowof an AIACalgorithm
with two processors.Dashed lines represent
the communications of the ﬁrst half of data,
and solid lines are for the second half.
2 Why using load balancing in AIAC?
The scope of this paper is to study the interest of load bal
ancing in the AIACmodel.One of our goals is to showthat,
contrary to a generally accepted idea,asynchronism does
not exempt from balancing the workload.Indeed,the load
balancing can efﬁciently take into account the heterogeneity
of the machines involved in the parallel iterative computa
tion.This heterogeneity can be found at the hardware level
when using machines with different speeds but also at the
user level if the machines are used in multiusers or multi
tasks contexts.All these cases are especially encountered
when dealing with grid computing.
Moreover,even in a homogeneous context,this coupling
has the great advantage to deal with the evolution of the
computationduring the iterative process.In numerous prob
lems resolved by iterative algorithms,the progression to
wards the solution is not the same for all the components
of the systemand some of themreach the ﬁxed point faster
than others.By performing a load balancing with some cri
teria based on this progression (the residual for example),
it is then possible to enhance the repartition of the actually
evolving computations over the processors.
Hence,there are two main ideas motivating the coupling
of load balancing and AIAC algorithms:
• when the workload is well balanced on the distributed
system,asynchronism allows to efﬁciently overlap
communications by computations,especially on net
works with very ﬂuctuating latencies and/or band
widths.
• even if AIACs are potentially more efﬁcient than the
other models,they do not take into account the work
load repartition over the processors.If this is well man
aged,it can reasonably make us expect yet better per
formances.
The great advantage of AIACs in this context is that they
are far more ﬂexible than synchronous ones in the way that
it is less imperative to have at all times exactly the same
amount of work on each processors.The goal here is then to
avoid too large differences of progress between processors.
A noncentralized strategy of load balancing appears to be
best suited since it avoids global communications which
would synchronize the processors.Also,it allows an adap
tive load balancing strategy according to the local context.
3 Noncentralized load balancing models
The load balancing problem has been widely studied
fromdifferent perspectives and in different contexts.A cat
egorization of the various techniques for load balancing can
be found in [9] based on criteria like centralized/distributed,
static/dynamic,and synchronous/asynchronous.To be con
cise,we present here the fewtechniques which are the most
suited to AIAC algorithms.
In the context of parallel iterative computations,the
schedule of load balancing must be noncentralized and it
erative by nature.Local iterative load balancing algorithms
were ﬁrst proposed by Cybenko in [7].These algorithms
iteratively balance the load of a node with its neighbors
until the whole network is globally balanced.There are
mainly two iterative load balancing algorithms:diffusion
algorithms [7] and their variants,the dimension exchange
algorithms [9,7].Diffusion algorithms assume that a pro
cessor simultaneously exchanges load with its neighbors,
whereas dimension exchange algorithms assume that a pro
cessor exchanges load with only one neighbor (along each
dimension or link) at each time step.
0769519261/03/$17.00 (C) 2003 IEEE
Unfortunately,these techniques are all synchronous
which is not convenient for the AIAC class of algorithms.
Bertsekas and Tsitsiklis have proposed in [5] an asyn
chronous model for iterative noncentralized load balanc
ing.The principle is that each processor has an evaluation
of its load and those of all its neighbors.Then,at some
given times,this processor looks for its neighbors which are
less loaded than itself.Finally,it distributes a part of its load
to all these processors.Avariant evoked by the authors is to
send a part of the work only to the lightest loaded neighbor.
This last variant has been chosen for implementation in our
AIAC algorithms since it has the most suited properties:it
maintains the asynchronism in the system with only local
communications between two neighboring nodes.
In the following section,we describe a typical problem
of Ordinary Differential Equations (ODEs) which has been
chosen for our experimentations,the Brusselator problem.
4 The Brusselator problem
In this section,we present the Brusselator problemwhich
is a large stiff system of ODEs.Thus,as pointed by Bur
rage [6],the use of implicit methods is required and then,
large systems of nonlinear equations have to be solved at
each iteration.Obviously,it can be seen that parallelism is
natural for such kind of problems.
The Brusselator system models a chemical reaction
mechanism which leads to an oscillating reaction.It deals
with the conversion of two elements A and B into two oth
ers C and D by the following series of steps:
A → X
2X +Y → 3Y
B +X → Y +C
X → D
(3)
There is an autocatalysis and when the concentrations of
A and B are maintained constant,the concentrations of X
and Y oscillate with time.For any initial concentrations of
X and Y,the reaction converges towards what is called the
limit cycle of the reaction.This is the graph representing the
concentration of X against those of Y and it corresponds in
this case to a closed loop.
The desired results are the evolutions of the concentra
tions u and v of both elements Xand Y along the discretized
space in function of time.If the discretization is made with
N points,the evolution of the u
i
and v
i
for i = 1,...,N is
given by the following differential system:
u
i
= 1 +u
2
i
v
i
−4u
i
+α(N +1)
2
(u
i−1
−2u
i
+u
i+1
)
v
i
= 3u
i
−u
2
i
v
i
+α(N +1)
2
(v
i−1
−2v
i
+v
i+1
)
(4)
The boundary conditions are:
u
0
(t) = u
N+1
(t) = α(N +1)
2
v
0
(t) = v
N+1
(t) = 3
and initial conditions are:
u
i
(0) = 1 +sin(2πx
i
) with x
i
=
i
N +1
,i = 1,...,N
v
i
(0) = 3
Here,we ﬁx the time interval to [0,10] and α =
1
50
.N is a
parameter of the problem.
For further information about this problem and its for
mulation,the reader should refer to [8].
5 AIAC algorithmand load balancing
In this section,we consider the use of a network of
workstations composed of NbProcs machines (processors,
nodes...) numbered from0 to NbProcs−1.Each processor
can send and receive data fromany other one.
It must be noticed that the principle of AIAC algorithms
is generic and can be adapted to every iterative proces
sus under convergence hypotheses which are satisﬁed for
a large class of problems.In most cases,the adaptation
comes from the data dependencies,the function to approx
imate and the methods used for intermediate computations.
By this way,these algorithms can be used to solve either
linear or nonlinear systems which can be stationary or not.
In the case of the Brusselator problem,the u
i
and v
i
of
the systemare represented in a single vector as follows:
y = (u
1
,v
1
,...,u
N
,v
N
)
with u
i
= y
2i−1
and v
i
= y
2i
,i ∈ {1,...,N}.
The y
j
functions,j ∈ {1,...,2N} thereby deﬁned will
also be referred to as spatial components in the remaining
of the article.
5.1 The AIAC algorithm solving the Brusselator
problem
To solve the system (4),we use a twostage iterative al
gorithm:
• At each iteration:
– use the implicit Euler algorithm to approximate
the derivative,
– use the Newton algorithm to solve the resulting
nonlinear system.
The inner procedure will be called Solve in our algorithm.
In order to exploit the parallelism,the y
j
functions are ini
tially homogeneously distributed over the processors.Since
these functions are represented in a one dimensional space
(the state vector y),we have chosen to logically organize
our processors in a linear way and map the spatial com
ponents (y
j
functions) over them.Hence,each processor
0769519261/03/$17.00 (C) 2003 IEEE
applies the Newton method over its local components us
ing the needed data from other processors involved in its
computations.From the Brusselator problem formulation,
it arises that the processing of components y
p
to y
q
also
depends on the two spatial components before y
p
and the
two spatial components after y
q
.Hence,if we consider that
each processor owns at least two functions y
j
,the nonlocal
data needed by each processor to performits iterations come
only from the previous processor and the following one in
the logical organization.In practical cases,there will be
much more than two functions over each node.
In Algorithm1,the core of the AIAC algorithmwithout
load balancing is presented.Since the convergence detec
tion and halting procedure are not directly involved in the
modiﬁcations brought by the load balancing,only the iter
ative computations and corresponding communications are
detailed.
In this algorithm,the arrays Ynew and Yold have al
ways the following organization:the two last components
fromthe left neighbor,the local components of the node and
the two ﬁrst components of the right neighbor.This struc
ture will have to be maintained even when performing load
balancing.The StartC and EndC variables are used to
indicate the beginning and the end of the local components
actually computed by the node.Finally,the δt variable rep
resents the precision of the time discretization needed to
compute the evolution of spatial components in time.
In order to facilitate and enhance the implementation of
asynchronous communications,we have chosen to use the
PM
2
multithreaded programming environment [10].This
kind of environment allows to make the send and receive
operations in additional threads rather than in the main pro
gram.This is why the receipts of data do not directly appear
in our algorithms.In fact,they are localized in functions
called by a thread created at beginning of the programand
dealing with incoming messages.Thus,when a sending op
eration is performed over a given processor,it must be spec
iﬁed which function over the destination node will manage
the message.In the same way,the asynchronous sending
operations appearing in our algorithms actually correspond
to the creation of a communication thread calling the related
sending function.
Receive functions given in Algorithms 2 and 3 only con
sist in receiving two components from the corresponding
neighbor (left or right) and put them at the right place,be
fore or after the local components,in array Ynew.It can be
noticed that all the variables in Algorithm1 can be directly
accessed by the receive functions since they are in threads
which share the same memory space.
For each communication function (send or receive),a
mutual exclusion system is used to avoid simultaneous
threads to perform the same kind of communication with
different data which could lead to incoherent situations and
Algorithm1 Unbalanced AIAC algorithm
Initialize the communication interface
NbProcs = Number of processors
MyRank = Rank of the processor
Yold,Ynew = Arrays of local spatial components
StartC,EndC = Indices of the ﬁrst and last local spatial
components
ReT = Range of evolution time of the spatial components
StartT,EndT = First (0) and last (ReT/δt) values of time
Initialization of local data
repeat
for j=StartC to EndC do
for t=StartT to EndT do
Ynew[j,t] = Solve(Yold[j,t])
end for
if j=StartC+2 and MyRank >0 then
if there is no left communicationin progress then
Send asynchronously the two ﬁrst local com
ponents to left processor
end if
end if
end for
if MyRank <NbProcs1 then
if there is no right communication in progress then
Send asynchronously the two last local compo
nents to right processor
end if
end if
Copy Ynew in Yold
until Global convergence is achieved
Display or save local components
Halt the communication system
Algorithm2 function RecvDataFromLeft()
Receive two components fromleft node
Put these components before local components in array
Yold
Algorithm3 function RecvDataFromRight()
Receive two components fromright node
Put these components after local components in array
Yold
0769519261/03/$17.00 (C) 2003 IEEE
also to useless overloading of the network.This has also
the advantage to generate less communications.Hence,the
AIAC variant used here and detailed in Figure 4 is slightly
different fromthe general case given in Figure 3.
time
Processor 2
Processor 1
Figure 4.Execution ﬂow of our AIAC variant
with two processors.Dashed lines represent
communications which are not actually per
formed due to mutual exclusion.Solid lines
starting during iterations corresponds to left
sendings whereas those at the end of itera
tions are for right ones.
5.2 Load balanced version of the AIACalgorithm
As evoked in Section 3,Bertsekas et al.have proposed
a theoretical algorithm to perform load balancing asyn
chronously and have proved its convergence.We have used
this model to design our load balancing algorithm adapted
to parallel iterative algorithms and particularly to AIACs on
the grid.Each processor will periodically test if it has to
balance its load with one of its neighbors,the left or the
right here.If needed,it will send a given amount of data to
its lightest loaded neighbor.
In Algorithm4 is presented the load balanced version of
the AIAC algorithm given in Section 5.1.For clarity,im
plementation details which are relative to the programming
environment used are not shown.
Most of the additional parts take place at the beginning
of the main loop.At each iteration,we test if a load bal
ancing process has been performed.If it is the case,data
arrays have to be resized in order to contain just the local
components affected to the node.Hence,a second test is
performed to see if the node has received or sent data.In
the former case,the arrays have to be enlarged in order to
receive the additional data which have then to be copied in
this new array.In the latter,arrays have to be reduced and
no data copying is necessary.
If no load balancing has been performed,several things
have to be tested to perform a load balancing towards the
left or right processor.The ﬁrst one allows us to try load
balancing periodically at every k iterations.This is useful
to tune the frequency of load balancing during the iterative
process which directly depends on the problemconsidered.
In some cases,a high frequency will be efﬁcient whereas in
other cases lower frequencies will be recommended since
too much load balancing could take the most computation
time of the process according to the iterations,especially
with low bandwidth networks.
The second test detects if a communication froma previ
ous load balancing is not ﬁnished yet.In this case,the trial
is delayed to the next iteration and so on until the previous
communication is achieved.In the other case,the corre
sponding function is called.
It can be noticed that according to the current organiza
tion of these tests,the left load balancing is tested before the
right which could seem to advantage it.In fact,this is not
actually the case and this does not alter the generality of our
algorithm.This has only been done to avoid simultaneous
load balancings of a processor with its two neighbors which
would not conformto the model used.
Finally,the last point in the main algorithmconcerns the
data sendings performed at each iteration.Since the arrays
may change froman iteration to another,we have to ensure
that the received data correspond to the local data before
(/after) the current arrays and can then be safely put before
(/after) them.This is why the global position of the two ﬁrst
(/last) components are joined to the data.Moreover,in order
to decide whether or not to balance the load,the local resid
uals are used and then sent together with the components.It
may seem surprising to use the residual as a load estimator
but this choice is very well adapted to this kind of computa
tion as exposed in Section 2.At ﬁrst sight,everyone could
think that taking,for example,the time to performthe k last
iterations would give a better criterion.Nevertheless,the lo
cal residual allows us to take into account the advance of the
current computations on a given processor.So,if a proces
sor has a low residual,all its components are not evolving
so far and its computations are not so useful for the over
all progression of the algorithm.Hence,it can then receive
more components to treat in order to potentially increase its
usefulness and also allow its neighbor to progress faster.
In Algorithm 5 is detailed the function to balance the
load with the left neighbor.Obviously,this function has its
symmetrical version for the right neighbor.Its ﬁrst step is
to test if a balancing is actually needed by computing the
ratio of the residuals on the two processors and comparing
it to a given threshold.If satisﬁed,the number of data to
send is then computed and another test is done to verify
that the number of data remaining on the processor will be
large enough.This is done to avoid the famine phenomenon
on slowest processors.Finally,the computed number of
the ﬁrst (/last) data are asynchronously sent with two more
components which will represent the dependencies of the
left (/right) processor.These two additional data will con
tinue to be computed by the current processor but their val
ues will be sent to the left (/right) processor to allow it to
perform its own computations with updated values of its
data dependencies.In the same way,the two components
0769519261/03/$17.00 (C) 2003 IEEE
Algorithm4 Load balanced AIAC algorithm
Initialize the communication interface
Variables fromAlgorithm1
LBDone = boolean indicating if LB has just been per
formed
LBReceipt = boolean indicating if additional data from
LB have been received
OkToTryLB = integer allowing to periodically test for
performing LB.Initially set to 20
Initialization of local data
repeat
if LBDone=true then
if LBReceipt=true then
Resize Ynew,Yold arrays after receipt of addi
tional data
Complete new Yold array with additional data
fromtemporary array
LBReceipt=false
else
Resize Ynew,Yold arrays after sending of trans
ferred data
end if
LBDone=false
else
if OkToTryLB=0 then
if there is no left LB communication in progress
then
TryLeftLB()
else
if there is no right LB communication in
progress then
TryRightLB()
end if
end if
else
OkToTryLB=OkToTryLB1
end if
end if
for j=StartC to EndC do
.../*...indicate the same parts as in Algorithm1 */
Send asynchronously the two ﬁrst local
components and the residual of previous iteration
preceded by their global position to left processor
...
end for
...
Send asynchronously the two last local
components and the residual of current iteration
preceded by their global position to right processor
...
until Global convergence is achieved
...
before (/after) those two ones will be kept on the current
processor and become its data dependencies from the left
(/right) neighbor.
Algorithm5 function TryLeftLB()
/* symmetrical for TryRightLB() */
Ratio = Ratio of residuals between local node and its left
neighbor
NbLocal = Number of local data
NbToSend = Number of data to send to perform LB
Ratio=local residual/left residual
if Ratio>ThresholdRatio then
Compute the number of data to send NbToSend
if NbLocalNbToSend>ThresholdData then
Send asynchronously the NbToSend+2 ﬁrst data to
left processor
/* +2 added for data dependencies */
OkToTryLB=20
LBDone=true
end if
end if
Concerning the receipt functions,the ﬁrst kind,exhibited
in Algorithm6,is related to the load balancing whereas the
second type,given in Algorithm 7,deals with the classical
data exchanges induced by dependencies.The former func
tion consists in placing additional data into a temporary ar
ray until they are copied in the resized array Yold,after what
the temporary array is destroyed.Once the receipt is done,
the ﬂags indicating the completion of a load balancing com
munication and its nature are set.The latter function has the
same role as the one presented in Algorithm2.Nonetheless,
in this version,the global position of the received data must
be confronted to the expected one before stocking them in
the array.Also,the residual obtained on the source node is
an additional data to receive.
Algorithm6 function RecvDataFromLeftLB()
/* symmetrical for RecvDataFromRightLB() */
Receive the number of additional data sent
Receive these data and put themin a temporary array
LBReceipt=true
LBDone=true
Finally,we obtain a load balanced AIAC algorithm
which solves the Brusselator problem.
6 Experiments
In order to perform our experiments,we have used the
PM
2
(Parallel Multithreaded Machine) environment [10].
Its ﬁrst goal is to efﬁciently support irregular parallel ap
plications on distributed architectures.We have already
0769519261/03/$17.00 (C) 2003 IEEE
Algorithm7 function RecvDataFromLeft()
/* symmetrical for RecvDataFromRight() */
if not accessing data array then
Receive the global position and the two components
fromleft node
if global position corresponds to the two left data
needed on local node then
Put these data before local components in array Yold
else
Do not stock these data in array Yold
/* array Yold is being resized */
end if
Receive the residual obtained on the left node
end if
shown in [3] the convenienceof this kind of environment for
programming asynchronous iterative algorithms in a global
context of grid computing.
In order to evaluate the gain obtained by coupling
load balancing with asynchronism,the balanced and non
balanced versions of our AIAC algorithm are compared in
two different contexts.The former is a local homogeneous
cluster with a fast network and the latter is a collection of
heterogeneous machines scattered on distant sites.In this
last context,the machines were subject to a multiusers uti
lization directly inﬂuencing their load.Hence,our results
correspond to the average of a series of executions.
Figure 5 shows the evolution of execution times in func
tion of the number of processors on a local homogeneous
cluster.It can be seen that both versions have a very good
scalability.This is a quite important point since load bal
ancing usually introduces sensitive overheads in parallel al
gorithms leading to quite moderate scalabilities.This good
result mainly comes fromthe noncentralized nature of the
balancing used in our algorithm.Nevertheless,the most in
teresting point is the large vertical offset between the curves
which denotes a high gain in performances.In fact,the ratio
of execution times between the nonbalanced and balanced
versions varies from6.2 to 7.4 with an average of 6.8.These
results show all the efﬁciency of coupling load balancing
with AIAC algorithms on a local cluster of homogeneous
machines.
Concerning the heterogeneous cluster,ﬁfteen machines
have been used over three sites in France:Belfort,
Montb´eliard and Grenoble,between which the speed of the
network may sharply vary.The logical organization of the
systemhas been chosen irregular in order to get a grid com
puting context not favorable to load balancing.The machine
types vary froma PII 400Mhz to an Athlon 1.4Ghz.The re
sults obtained are given in Table 1.
Here also,the balancing brings an impressive enhance
ment to the performances of the initial AIAC algorithm.
10
100
1000
10000
100000
1
10
100
Time
Number of processors
Without LB
With LB
Figure 5.Execution times (in seconds) on a
homogeneous cluster
version
nonbalanced
balanced
ratio
execution time
515.3
105.5
4.88
Table 1.Execution times (in seconds) on a
heterogeneous system
The smaller ratio than in local cluster is explained by the
larger cost of communications and then of data migrations.
Although this ratio stays very satisfying,this remark would
imply a closer study concerning the tuning of the load bal
ancing frequency during the iterative process.This is not in
the scope of this article but will probably be the subject of a
future work.
Despite this,the load balancing is more interesting in this
context than in local clustering.This comes from the fact
that in the homogeneous context,as was shown in [3],the
synchronous and asynchronous iterative algorithms have al
most the same behavior and performances whereas in the
global context of grid computing,the asynchronous version
reveals all its interest by providing far better results.Hence,
we can reasonably deduce that load balancing AIAC algo
rithms in a local homogeneous context would only produce
slightly better results than their SISC counterparts whereas
in the global context,the difference between SISC and
AIAC load balanced versions will be much larger.In fact,
this last version will obtain the very best performances.
As explained in Section 2 and pointed out by these ex
periments,load balancing and asynchronism are then not
incompatible and can actually lead to very efﬁcient parallel
iterative algorithms.
The essential points to reach this efﬁciency is the way
0769519261/03/$17.00 (C) 2003 IEEE
this coupling is performed but also the context in which
it is used.The ﬁrst point has already been discussed and
it has been showed the important role played by the non
centralized nature of the balancing technique.Concern
ing the second point,there are also some conditions which
should be veriﬁed on the treated problem to ensure good
performances.
According to our experiments,it has appeared at least
four conditions required to get an efﬁcient load balancing
on asynchronous iterative algorithms.The ﬁrst one con
cerns the number of iterations which must be large enough
to make it worth to perform load balancing.In the same
way,the average time to performone iteration must be long
enough to have a reasonable ratio of computations over
communications.In the opposite case,the load balancing
will not sensibly inﬂuence the performances and will have
the drawback to overload the network.Another important
point is the frequency of load balancing operations which
must be neither too high (to avoid an overloading of the
system) nor too low (to avoid a too large imbalance in the
system).It is then important to design a good measure of
the need to load balance.Finally,the last point is the ac
curacy of the load balancing which depends on the network
load.If the network is heavily loaded (or slow) it may be
preferable to performa coarse load balancing with less data
migration.On the other hand,an accurate load balancing
will tend to speed up the global convergence.The tricky
work is then to ﬁnd the good tradeoff between these two
constraints.
7 Conclusion
The general interest of load balancing parallel iterative
algorithms has been discussed and its major efﬁciency in the
context of grid computing has been experimentally shown.
A comparison has been presented between a non
balanced and a balanced asynchronous iterative algorithm.
Experiments have been done with the Brusselator problem
using the PM
2
multithreaded environment.It has been
tested in two representative contexts.The ﬁrst one is a local
homogeneous cluster and the second one corresponds to a
global context of grid computing.
The results of these experiments clearly show that the
coupling of load balancing and asynchronismis fully justi
ﬁed since it gives far better performances than asynchro
nism alone which is itself better than synchronous algo
rithms.The efﬁciency of this coupling comes fromthe fact
that these two techniques individually optimize two differ
ent aspects of parallel iterative algorithms.Asynchronism
brings a natural and automatic overlapping of communica
tions by computations and load balancing,as named,pro
vides a good repartition of the work over the processors.
The advantage induced by the noncentralized nature of the
balancing technique has also been pointed out.Avoiding
global synchronizations leads to less overheads and then to
a better scalability.
In conclusion,balancing the load in asynchronous iter
ative algorithms can actually bring higher performances in
both local and global contexts of grid computing.
References
[1] J.M.Bahi.Asynchronous iterative algorithms for nonex
pansive linear systems.Journal of Parallel and Distributed
Computing,60(1):92–112,Jan.2000.
[2] J.M.Bahi,S.ContassotVivier,and R.Couturier.Evalu
ation of the asynchronous model in iterative algorithms for
global computing.(in submission).
[3] J.M.Bahi,S.ContassotVivier,and R.Couturier.Asyn
chronism for iterative algorithms in a global computing en
vironment.In The 16th Annual International Symposium
on High Performance Computing Systems and Applications
(HPCS’2002),pages 90–97,Moncton,Canada,June 2002.
[4] D.E.Baz,P.Spiteri,J.C.Miellou,and D.Gazen.Asyn
chronous iterative algorithms with ﬂexible communication
for nonlinear network ﬂow problems.Journal of Parallel
and Distributed Computing,38(1):1–15,10 Oct.1996.
[5] D.P.Bertsekas and J.N.Tsitsiklis.Parallel and Distributed
Computation:Numerical Methods.Prentice Hall,Engle
wood Cliffs NJ,1989.
[6] K.Burrage.Parallel and Sequential Methods for Ordinary
Differential Equations.Oxford University Press Inc.,New
York,1995.
[7] G.Cybenko.Dynamic load balancing for distributed mem
ory multiprocessors.Journal of Parallel and Distributed
Computing,7(2):279–301,Oct.1989.
[8] E.Hairer and G.Wanner.Solving ordinary differential equa
tions II:Stiff and differentialalgebraic problems,volume 14
of Springer series in computational mathematics,pages 5–8.
SpringerVerlag,Berlin,1991.
[9] S.H.Hosseini,B.Litow,M.Malkawi,J.McPherson,and
K.Vairavan.Analysis of a graph coloring based distributed
load balancing algorithm.Journal of Parallel and Dis
tributed Computing,10(2):160–166,Oct.1990.
[10] R.Namyst and J.F.M´ehaut.PM
2
:Parallel multithreaded
machine.A computing environment for distributed archi
tectures.In Parallel Computing:StateoftheArt and Per
spectives,ParCo’95,volume 11,pages 279–285.Elsevier,
NorthHolland,1996.
[11] M.E.Tarazi.Some convergence results for asynchronous
algorithms.Numer.Math.,39:325–340,1982.
[12] R.S.Varga.Matrix iterative analysis.PrenticeHall,1962.
0769519261/03/$17.00 (C) 2003 IEEE
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment