Dynamic load balancing for switch-based networks

boardpushyUrban and Civil

Dec 8, 2013 (3 years and 6 months ago)

212 views

J.Parallel Distrib.Comput.63 (2003) 286–298
Dynamic load balancing for switch-based networks
$
Wan Yeon Lee,
a,
Sung Je Hong,
b
Jong Kim,
b
and Sunggu Lee
c
a
Division of IET,Hallym University,1 Okchon-dong,Chunchon,Kangwon-do,200-702,South Korea
b
Department of CSE,Pohang University of Science and Technology (POSTECH),San-31 Hyoja Dong,Pohang 790-784,South Korea
c
Department of EE,Pohang University of Science and Technology (POSTECH),San-31 Hyoja Dong,Pohang 790-784,South Korea
Received 15 November 2000;revised 4 January 2003;accepted 7 February 2003
Abstract
Recently,switch-based networks of workstations (NOWs) have been introduced as an alternative for traditional parallel
computers.Although many dynamic load balancing algorithms have been developed for point-to-point networks (static networks),
little progress has been made on the load balancing in switch-based networks (dynamic networks).Thus,in this paper,we propose a
dynamic load balancing algorithm,called the Switch Walking Algorithm (SWA),suitable for switch-based networks.In SWA,each
processor’s load information is gathered to form global load information,which is then used for load balancing.SWA is compared
to a previous algorithm,called the Tree Walking Algorithm (TWA),which has been applied to switch-based networks.Through
analysis,we show that SWA requires less communication time for distribution of global load information and migrates fewer tasks
than TWA.Also,we show,through the implementation of a Mandelbrot set generation program,that SWA achieves about 20%
better performance than TWA on a system with 32 processing elements.
r 2003 Elsevier Science (USA).All rights reserved.
Keywords:Load balancing;Switch-based network;Tree Walking Algorithm;SP2 machine;PC cluster
1.Introduction
Load imbalance in parallel computation can lead to
processing inefficiencies.To achieve high efficiency,the
computation load must be evenly distributed among
processing elements (PEs).Load balancing aims to
distribute workload evenly among the available PEs.If
the execution time of each task is known in advance,it is
possible to distribute tasks evenly to all processors so as to
minimize the overall completion time of tasks—this is
referred to as static load balancing.However,there are
many applications where the task loads change during the
computation.In these applications,static load balancing
cannot be applied since the execution time of tasks cannot
be known in advance.Whenever the load becomes severely
unbalanced,redistribution of the load is necessary during
runtime in order to retain high efficiency—this is referred
to as dynamic load balancing.Dynamic load balancing is
carried out through task migration—the transfer of tasks
from overloaded PEs to underloaded PEs.To decide how
to perform task migration,information about the current
workload of PEs must be exchanged among PEs.
However,the communication required to share load
information and the computation required to make a
balancing decision consume the processing power of PEs,
which may result in worse overall performance.The
performance of dynamic load balancing algorithms
depends heavily on the accuracy of each balancing decision
and the amount of computation and communication
required to make balancing decisions [15].
Many algorithms [3,7,8,11,12,15–19] have been pro-
posed for load balancing in parallel computation.Most
of them have been developed for multicomputers with
point-to-point networks (static networks) such as
meshes or hypercubes,in which each PE has dedicated
connections to some number of PEs,called neighboring
PEs.However,little progress has been made on
computers with switch-based networks (dynamic net-
works),in which each PE has a dynamic connection to
any other PE by the setting switching elements.There
are many switch-based multicomputers such as the IBM
$
This work was supported by the Research Grant from Hallym
University,Korea,and the Ministry of Education of Korea through its
BK21 program toward the Electrical and Computer Engineering
Division at POSTECH.

Corresponding author.Fax:+82-33-242-2524.
E-mail addresses:wanlee@postech.ac.kr (W.Y.Lee),
sjhong@postech.ac.kr (S.J.Hong),jkim@postech.ac.kr (J.Kim),
slee@postech.ac.kr (S.Lee).
0743-7315/03/$- see front matter r2003 Elsevier Science (USA).All rights reserved.
doi:10.1016/S0743-7315(03)00032-7
SP2 [14],TMC CM-5 [10],Meiko CS-2,and Data
Diffusion Machine.Moreover,switch-based networks
of workstations (NOW),in which workstations or PCs
are connected through high performance switches,have
been used recently as parallel machines [1].This new
trend,referred to as cluster computing,is gaining
popularity as a new parallel processing paradigm,and
many switch interconnects such as Myrinet [2],Autonet,
S-connect,Tnet,and Spider have been developed to
support fast communication among PEs [4].The load
balancing algorithms developed for point-to-point net-
works cannot be applied directly to switch-based
networks.Even if those algorithms are applied to
switch-based networks with some modifications,their
performance may not be satisfactory because the
interconnection network significantly affects the perfor-
mance of the load balancing algorithm [7,19].Thus,in
this paper,we propose a dynamic load balancing
algorithm suitable for switch-based networks.
Generally,dynamic load balancing algorithms can be
classified into three types:asynchronous neighbor
diffusion,synchronous neighbor diffusion,and synchro-
nous parallel algorithms.In asynchronous neighbor
diffusion [19],PEs make a balancing decision based on
the local load information in a decentralized manner,
and manage workload migrations within their neighbor-
hoods.Successive local load balancing iterations cause
the entire systemto progress toward a globally balanced
state.In synchronous neighbor diffusion [19],all PEs
participate in performing load balancing at the same
time,and the PEs within a local area may perform load
balancing based on the local load information.In
synchronous parallel algorithms [5,12,16,17],balancing
decisions are made based on the global load informa-
tion,and all PEs cooperate to form and distribute the
global load information.The synchronous parallel
algorithms can accurately balance the load and reduce
unnecessary task migrations.Algorithms belonging to
asynchronous neighbor diffusion are the Gradient Model
[15] and Receiver (or Sender) Initiated Diffusion [15].
Algorithms belonging to synchronous neighbor diffu-
sion are the Dimension Exchange Method [15,19] and
Generalized Dimension Exchange [18].Synchronous
parallel algorithms include the Tree Walking Algorithm
[12,16],Cube Walking Algorithm [16],Mesh Walking
Algorithm [16],Modified Cube Walking Algorithm [8],
and Direct Dimension Exchange [17].Willebeek-LeMair
and Reeves [15] showed that the synchronous neighbor
diffusion algorithm is superior to the asynchronous
neighbor diffusion algorithm.Later,Wu and Shu [12,17]
showed that the synchronous parallel algorithm is
superior to the synchronous neighbor diffusion algo-
rithm.A specific synchronous parallel algorithm,called
the Tree Walking Algorithm (TWA),was applied to a
switch-based multiprocessor,the CM-5 machine [12].
Since the algorithmwas conceptually designed to support
a tree-structured static network,the algorithm was
applied after mapping the network of the CM-5 into a
tree-structured network.Although TWAperforms well in
a tree-structured network,this algorithm results in heavy
communication traffic and unnecessary task migrations
when it is applied to switch-based networks [5].
The proposed load balancing algorithm,called the
Switch Walking Algorithm (SWA),is a synchronous
approach;that is,all PEs perform load balancing
concurrently.To achieve accurate global load balancing,
each PE’s load information is gathered to form global
load information.Based on the global load information,
the excessive or lacking load amount of each PE is
calculated,and direct task migrations can occur from
PEs with excessive loads to PEs with lacking loads.In
addition,all PEs cooperate to share global load
information using a low time complexity algorithm.
Collective communication methods,in which all PEs
participate,are designed and utilized in SWA.The
communication methods to broadcast a message to all
PEs and to gather messages from all PEs are imple-
mented in software by using only unicasts since a PE is
normally allowed to exchange at most one message at a
time on switch-based systems.We analyze the proposed
algorithm and compare it with the previous algorithm,
TWA.The analytical evaluation results show that the
proposed algorithm can reduce communication time to
share global load information significantly,maximize
locality of tasks,and minimize the number of task
migrations.We also implement the proposed algorithm
for the execution of a Mandelbrot set generation
application on a 32-node IBM SP2 machine and a
cluster system consisting of 14 Pentium III PCs
connected by two Myrinet switches.The implementa-
tion results show that the proposed algorithm performs
much better than TWAas the systemsize becomes large.
The proposed algorithm has about 20% better perfor-
mance than that of TWA on a system with 32 PEs.
The remainder of this paper is organized as follows.
Section 2 formulates the dynamic load balancing
problem and describes a model of a switch-based
network.In Section 3,we investigate the previous load
balancing algorithm,TWA.Section 4 presents our
dynamic load balancing algorithm in detail.Section 5
evaluates the proposed algorithm and compares it with
TWA using analysis and the performance measured
when executing a Mandelbrot set generation program.
Finally,we provide concluding remarks in Section 6.
2.Preliminaries
2.1.The dynamic load balancing problem
In our problem model,a dynamic load balancing
algorithm attempts to minimize the overall completion
W.Y.Lee et al./J.Parallel Distrib.Comput.63 (2003) 286–298 287
time of a single application running in a multiprocessor
system.All PEs have an identical code image (single
program multiple data programming model) of the
application and the application is partitionable into
smaller grain-sized tasks.All tasks are computation-
intensive and may be executed on any PE in any
sequence.Due to the unpredictable nature of the task
requirements,each task is considered to require equal
computation time;thus,the load of a PE is measured as
the number of tasks in its queue,and tasks are evenly
distributed to PEs initially.Whenever the number of
tasks in PEs become severely unbalanced due to the
grain-size variation of tasks,a load balancing process is
triggered and tasks are redistributed so that each PE
again has approximately the same number of tasks.
The performance of the load balancing algorithm
depends heavily on the balancing accuracy and the
scheduling overhead.The balancing accuracy of a load
balancing algorithmis measured by the variation in loads
after PEs have executed the load balancing algorithm.
The scheduling overhead of a load balancing algorithm
consists of the communication overhead required to get
the load information of other PEs (or to redistribute
tasks) and the computation overhead required to perform
the balancing decision.Let PE p
i
have l
i
tasks and the
average number of tasks of N PEs be L
avg
¼
P
N1
i¼0
l
i
=N:
Then,when loads are balanced accurately,each PE will
have L
avg
tasks.To accurately determine the required
amount of task migration from overloaded PEs to
underloaded PEs,each PE must communicate its load
information with other PEs.As more load information of
PEs is communicated,tasks can be redistributed more
accurately to all PEs.However,the scheduling overhead
of the load balancing algorithm will degrade the
performance of an application.Hence,the objective of
this paper is to find a load balancing algorithm with
accurate balancing and small scheduling overhead.
2.2.System model
The interconnection network of a parallel system can
be broadly divided into two classes—-point-to-point (or
static) and switch-based (or dynamic).In a point-to-
point network,each PE has dedicated links connecting
some number of neighboring PEs.The link connection
pattern forms a specific topology such as a mesh,
hypercube,or k-ary n-cube.Neighboring PEs can send
messages directly to each other,but PEs not connected
directly to each other must send messages through other
PEs that relay messages from a source to a destination,
as shown in Fig.1(a).On the other hand,in a switch-
based network,a PE can send messages to any other PE
directly by dynamically setting the switching elements,
as shown in Fig.1(b).
When designing a network for a parallel system,
another important issue is the communication architec-
ture,for which there are two basic architectures:the k-
port ðkX2Þ and the one-port models [19].The k-port
model allows a PE to exchange messages with k PEs
simultaneously,while the one-port model restricts a PE
to exchange messages with at most one PE at a time.
Fig.2(a) shows concurrent communication on a net-
work with the one-port model,in which a PE must take
four communication steps to send a message to four
PEs.p
1
sends a message to p
0
in the first communication
step,p
1
sends it to p
2
in the second communication step,
p
1
sends it to p
4
in the third communication step,and p
1
sends it to p
3
in the fourth communication step.
Fig.2(b) shows concurrent communication on a net-
work with the 2-port model,in which a PE requires two
communication steps to send a message to four PEs.p
1
sends a message to p
0
and p
4
in the first communication
step,and p
1
sends it to p
2
and p
3
in the second
communication step.In the k-port model,each PE has a
separate I/O module connected to each link and can
send a message to k PEs simultaneously.Unfortunately,
however,the maximum concurrent communication
capability of a PE is restricted to the number of links
connected to the PE.Consequently,the network shown
in Fig.2 can be implemented to use only the 2-port
model or the one-port model since the number of links
connected to each PE is two.
The parallel system considered in this paper consists
of N homogeneous PEs connected through switches.
element
Switching
(b)
(a)
Link
PE
static network (mesh)
dynamic network (fat tree)
Fig.1.Sending a message on two different networks.
W.Y.Lee et al./J.Parallel Distrib.Comput.63 (2003) 286–298288
Since a PE has one link connection in most switch-based
networks and the k-port model is not a prevalent
implementation,we assume that the communication
architecture of the switch-based network is the one-port
model.We also assume that multicast is not supported
in hardware,communication is based on the distance-
insensitive wormhole routing [9,10] method,links are
full duplex (PEs can send and receive messages
simultaneously on a link),and switching elements
support a crossbar connection.We represent such a
system by a simple connected graph G ¼ ðV;EÞ;where
V is a set of nodes representing PEs or switching
elements and EDV V is a set of edges representing
links.V is itself divided into two sets:a set of PEs,P;
and a set of switching elements,S:In this paper,we
represent a PE pAP;a switching element sAS;and a link
eAE using a circle,a square,and an edge,respectively.
Each edge e ¼ fðp;sÞ;ðs;pÞ;ðs;sÞgAE corresponds to the
communication link between a PE and a switch,or
between a switch and a switch.
3.Tree Walking Algorithm
The Tree Walking Algorithm (TWA) [12,16] was
originally designed to support tree-structured networks.
The algorithm was applied to the TMC CM-5 machine
(which has a switch-based network) by mapping the
switch-based network into a tree-structured network,
referred as a scheduling tree.Fig.3 shows a mapping of
the CM-5 network with 16 PEs into the scheduling tree
by Shu and Wu [12].In TWA,all communications are
based on the scheduling tree.Whenever a PE becomes
idle (or its load is less than some threshold),it
broadcasts a message to request a load balancing
process.Then the root PE of the scheduling tree collects
the load information,i.e.,the number of tasks,of all
PEs and then broadcasts the average load information.
Each PE calculates the number of tasks which shall be
assigned to its subtree in order to balance the total load.
The number of tasks required is calculated by multi-
plying the number of PEs in that subtree with the
average load.If the total number of tasks in the subtree
is larger than the number of tasks calculated,the PE
sends its excessive tasks to its parent PE.Otherwise,the
PE receives its lacking tasks from its parent PE.TWA
can be described more fully as follows.
Tree Walking Algorithm ðTWAÞ
Phase 1 (Initiation of load balancing):An idle PE
sends a message indicating the start of load
balancing to its parent PE and child PEs.
When PEs receive the start message,they
retransmit it to their parent or child PEs.This
retransmission is repeated until all PEs receive
the message.
Phase 2 (Collection of load information):Each PE
receives the load information from its child
PEs.Then,the PE sums the total number of
tasks in its own subtree,W;and sends it again
to its parent PE.This cumulative summation
is repeated until the root PE of the scheduling
tree is reached.
Phase 3 (Broadcast of average load):
3.1.The root PE calculates the average
number of tasks L
avg
ð¼ I
W
N
mÞ and the
remainder L
rem
ð¼ Wmod NÞ:
3.2.The root PE sends L
avg
and L
rem
to its
child PEs.When a PE receives L
avg
and
L
rem
from its parent PE,the PE sends
them to its child PEs.This process is
repeated until PEs in the leaf nodes are
reached.
Phase 4 (Determination of task transfer):As the
balancing quota of each PE,L
avg
is assigned
to the L
rem
PEs p
0
;y;p
L
rem
1
;and L
avg
þ1 is
assigned to the N L
rem
PEs p
L
rem
;y;p
N1
:
Then,each PE calculates the balancing quota
of its own subtree,Q ¼ ðL
avg
or L
avg
þ1Þthe
number of PEs in its subtree.
P3P2P1
P4
2-port model
P0
one-port model
P0 P1 P2 P3
P4
first comm. step
second comm. step
third comm. step
fourth comm. step
(a)
(b)
Fig.2.Concurrent communications on a switch-based network.
W.Y.Lee et al./J.Parallel Distrib.Comput.63 (2003) 286–298 289
Phase 5 (Task migration):Each PE sends QW tasks
to its parent PE if Q4W;and receives W Q
tasks from its parent PE otherwise.
In TWA,Phase 1 (Initiation of load balancing),Phase 2
(Collection of load information),and Phase 3 (Broadcast
of average load) are expected to require at least D
communication steps,respectively,where Dis the depth of
the scheduling tree.Specifically,Phase 1 requires at least
2D communication steps when a leaf PE of the scheduling
tree may want to initiate a load balancing process.
The number of communication steps to perform
Phases 1,2,or 3 is larger than required in a switch-
based network.In a tree-structured network,PEs
appear to always use the dedicated links to commu-
nicate to their parent PE or child PEs.However,in a
switch-based network,PEs cannot send (or receive)
messages simultaneously from their child PEs.Link
contention that two or more PEs try to use a link
simultaneously may occur due to the limited number of
link connections of the PEs.When PEs are connected to
two links,they can communicate to at most two PEs
simultaneously.Thus,if a PE send messages to more
than two child PEs at the same time,link contention
occurs and some messages have to wait until a link
becomes available.Link contention increases the num-
ber of communication steps required.Fig.4(a) shows
the contention-free TWA communication in the sche-
duling tree.Fig.4(b) shows link contention that results
when those communication patterns are applied to the
CM-5 network.p
0
tries to send a message to p
1
;p
4
;p
8
;
and p
12
simultaneously.To send a message to p
1
and
p
4
;p
0
already uses two links and then cannot send a
message to p
8
since there is no available link.
4.The proposed algorithm
The proposed algorithm makes a balancing decision
based on the global load information in order to achieve
accurate load balancing.Since the communication time
required to gather global information is large,all PEs
cooperate to gather global load information in the
proposed algorithm.We first design efficient broad-
casting and global gathering methods on switch-based
networks.The characteristics of switch-based networks
such as the use of the one-port model and dynamic
connection establishment between nodes are considered
in the design of the collective communication methods.
In the one-port model,a PE is restricted to exchange a
message with at most one PE at a time.In addition,a PE
can communicate messages directly with any other PE
by setting switching elements dynamically in switch-
based networks.Dynamic connection establishment
allows tasks to migrate directly from an overloaded
PE to an underloaded PE.Compared with indirect task
migration by TWA,direct task migration leads to faster
load balancing and reduces the overall network traffic.
4.1.Broadcasting and global gathering
In the absence of multicast support hardware,broad-
cast must be implemented with several unicast messages,
even if it requires several communication steps.Also,
gathering messages from all PEs must be implemented
with unicast messages.When switches employ wormhole
routing,the communication time of unicast is mostly
unaffected by the locations of two PEs provided that
there is no contention [10].Then the communication
time required depends heavily on the number of
communication steps used for broadcasting or data
gathering.Thus,the number of unicast communication
steps should be minimized in order to achieve fast
communication.We use the In-Fat multicast algorithm
[4] for fast broadcasting,
1
which works as follows.
mapping into a scheduling tree
network of CM-5 machine
S10 or S11S6 or S7
S2 or S3
S0, S1,
S4 or S5
S0, S1,
S2 or S3
S4 or S5
S8 or S9
S2 or S3
S6 or S7
S6 or S7
S4 or S5S4 or S5
S6 or S7
S8 or S9
S8 or S9 S10 or S11S8 or S9 S10 or S11
S10 or S11
S0, S1,
P0 P1 P2 P3 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15
S1 S2
S5 S6 S7 S8 S10 S11S9
S3
P0
P1 P4 P8 P12
P2 P3 P5 P6 P7 P9 P10 P11 P13 P14 P15
S4 or S5
S4 or S5
S0
S4
P5P4
(a)
(b)
Fig.3.Mapping the network of CM-5 machine into a scheduling tree.
1
The In-Fat multicast algorithm was designed to work on fat-tree
structured networks.When PEs are connected irregularly to one
another,the multicast algorithm presented by Libeskind-Hadas [6]
would be used instead.
W.Y.Lee et al./J.Parallel Distrib.Comput.63 (2003) 286–298290
Initially,PEs are assigned addresses in increasing
lexicographical order from the left side to the right side.
The destination PEs and the source PE of a multicast are
sorted in increasing order of their addresses to form an
ordered list,called a multicast list F:If the address of the
source PE is in the lower half of the list,then it sends a
copy of the message to the PE having the smallest
address in the upper half of the list.If the address of the
source PE is in the upper half of the list,then it sends a
copy of the message to the PE having the largest address
in the lower half of the list.The source PE continues this
procedure until the partitioned list contains only itself.
The destination PE receiving the message is responsible
for delivering the message to the other PEs in its half
using the same procedure as the source.PEs receiving a
message forward it to a set of destinations which have
not received it until all destinations receive the message.
Fig.5 shows a working example of the In-Fat
multicast algorithm,where the source PE is p
2
and the
multicast list F is ½p
0
;p
1
;p
2
;p
3
;p
4
;p
5
;p
6
;p
7
:As a result
of the first half partitioning of the multicast list,p
2
sends
a message to p
4
in the first communication step and has
a modified multicast list F ¼ ½p
0
;p
1
;p
2
;p
3
:As a result of
the successive half partitioning,p
2
sends the message to
p
1
in the second communication step and has a modified
multicast list F ¼ ½p
2
;p
3
:Also,p
4
sends the message to
p
6
in the second communication step.Similarly,in the
third communication step,p
1
;p
2
;p
4
;and p
6
send the
message to p
0
;p
3
;p
5
;and p
7
;respectively.The In-Fat
multicast algorithm requires Jlog Nn steps to reach N
PEs,since the number of PEs receiving the message is
doubled at each step.To prevent an increase in the
communication overhead of load balancing,it is
important to avoid link contention.It is guaranteed
that the In-Fat multicast algorithm does not incur link
contention [4].Consequently,broadcasting a message to
N PEs based on the In-Fat multicast algorithm requires
exactly Jlog Nn steps.
By applying the reverse of the In-Fat broadcasting
steps,gathering of global data can also be easily
implemented.In the first step,the PEs which received
a broadcast message at the last step of the In-Fat
broadcasting send their messages along the opposite
direction of the broadcasting communication pattern.
When receiving the message for gathering,the PEs then
combine it with their own message until they collect all
messages of the PEs to which they sent a multicast
message during the broadcast communication.Then
they send the combined message to the PE which was
originally the sender of the multicast message during the
broadcast communication.This procedure is continued
until the source of the broadcast communication collects
the messages of all PEs.Fig.6(a) shows the reverse steps
of the broadcasting communication in Fig.5.The global
information gathering algorithm,called the All-Gather,
which uses the reverse steps of the broadcasting
communication,is described as follows.
Fig.6(b) shows an example of the All-Gather algori-
thm in which p
2
gathers the load information of
p
0
;p
1
;y;p
7
and creates a global information list L ¼
½3;4;0;5;3;7;4;5 :
It can also be guaranteed that the All-Gather
algorithm does not incur link contention,and thus the
(b)
(a)
Link
contention
P0
P1 P4 P8 P12
P2 P3 P5 P6 P10 P11 P13 P14 P15P9P7
S0 S1 S2 S3
S4 S5 S6 S7 S8 S9 S10 S11
P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14
P15
broadcasting in the scheduling tree
link contention in the CM-5 network
Fig.4.Link contention during broadcasting on a CM-5 network.
W.Y.Lee et al./J.Parallel Distrib.Comput.63 (2003) 286–298 291
All-Gather algorithm requires exactly Jlog Nn commu-
nication steps to gather messages fromN PEs.To gather
load information of eight PEs,three communication
steps are required.In the given broadcasting and
gathering algorithms,all PEs cooperate to reduce
communication time.If one PE tries to broadcast a
message to or collect messages fromthe other N 1 PEs
one by one,it would require at least N 1 communica-
tion steps while the given algorithms require only
Jlog Nn steps.
4.2.The Switch Walking Algorithm
In this subsection,we describe the proposed dynamic
load balancing algorithm.It uses the broadcasting and
gathering algorithms presented in Section 4.1.When a
PE becomes idle (or its load is less than some threshold),
it triggers load balancing by broadcasting a message,
which indicates the start of load balancing process.Then
all PEs send their load information to the idle PE,and
the global load information list L ¼ ½l
0
;l
1
;y;l
N1
is
created by the idle PE,where l
i
is the load of p
i
:The idle
PE broadcasts the list L to all PEs,and all PEs perform
the computation required to determine task migrations
using the global load information.Each PE calculates
the average load of N PEs L
avg
¼ I
P
N1
i¼0
l
i
N
m and the
number of remaining tasks L
rem
¼
P
N1
i¼0
l
i
mod N:As
the balancing quota q
i
for each p
i
;L
avg
þ1 is assigned
preferentially to L
rem
PEs among the overloaded
PEs whose l
i
is larger than L
avg
:L
avg
is assigned to the
rest PEs.Assigning the remaining tasks to overloaded
PEs introduces fewer task migrations than assigning
them to underloaded PEs.Then,the excessive or
lacking load amount (l
i
q
i
when l
i
4q
i
;or q
i
l
i
when
l
i
oq
i
) of all PEs can be calculated by comparing l
i
and q
i
:Next,two lists are created:overloaded
list OL;including PEs whose current load is greater
than their balancing quota,and underloaded list UL;
including PEs whose current load is smaller than their
balancing quota.The identifiers of PEs are put into one
of the two lists and the putting operation of each PE is
repeated as many as the excessive or lacking amount of
the PE.The two lists are sorted in increasing order of the
PEs’ identifiers.Finally,each overloaded PE finds an
underloaded PE to be the recipient of its task migrations
by matching its location in OL with that in UL one by
one.The overloaded PE sends tasks to a destination
PE—the number of tasks sent is determined by the
number of locations in OL that match the locations of
the destination PE in UL.The proposed algorithm,
= [P0,P1,P2,P3,P4,P5,P6,P7]
P7P3
P0
P0 P2 P6
P6
P0
P5P4
P6P0
P6
Φ

= [P4,P5]
Φ
Φ
= [P6,P7]

Φ
P1
Φ
= [P2,P3]
Φ
= [P0,P1]

Φ
= [P4,P5,P6,P7]

Φ
= [P0,P1,P2,P3]
Φ
= [P0]

= [P0]

Φ
= [P1]
Φ
= [P1]
Φ
= [P1]
Φ
= [P0]

Φ
= [P1]
Φ
P1 P2 P3 P4 P5 P7
P1 P2 P3 P4 P5 P7
P7P5P4P3P2P1
second comm. step
first comm. step
third comm. step
= [P0]
Source PE or
destination PE receiving message
Lower half
Upper half
Fig.5.An example of broadcasting using the In-Fat algorithm.
4 5
3 4 0 5
3
7
4
5 3 7 4 5
3 4
3
P0 P1 P3 P4 P5 P6 P7P2
P0 P6
P6P0
0
P6
P0
3 54 0 5 3 7 4
first comm. step
second comm. step
third comm. step
P1 P2 P3 P4 P5 P7
P1 P2 P3 P4 P5 P7
P7P5P4P3P2P1
global information list
forming a global information listthe reverse of broadcasting communication
0 5 3 7 4 5
(a)
(b)
Fig.6.An example of global gathering communication.
W.Y.Lee et al./J.Parallel Distrib.Comput.63 (2003) 286–298292
called the Switch Walking Algorithm (SWA),is de-
scribed as follows.
Switch Walking Algorithm ðSWAÞ
Assign each PE a unique identifier i according to the
lexicographic order.
Phase 1 (Initiation of load balancing):An idle PE
broadcasts a message indicating the start of
load balancing using the In-Fat algorithm
with F ¼ ½p
0
;p
1
;y;p
N1
:
Phase 2 (Collection of load information):The idle PE
creates the global load list L using the All-
Gather algorithm.
Phase 3 (Broadcast of global load information):The idle
PE broadcasts L using the In-Fat algorithm
with F ¼ ½p
0
;p
1
;y;p
N1
:
Phase 4 (Determination of task transfers):
4.1.Each PE calculates L
avg
and L
rem
:
4.2.Each PE calculates Q ¼ ½q
0
;q
1
;y;q
N1
:
4.3.Each PE creates OL and UL:
Phase 5 (Task migration):If l
i
4q
i
;each PE searches its
identifier in OL and finds its destination PEs
by matching its location in OL with the
location of UL:Each PE then sends l
i
q
i
tasks to its destination PE.
In SWA,Phases 1,2,and 3 require Jlog Nn
communication steps for each Phase.In Phase 4,all
PEs perform the computation required to determine
task migrations individually in a decentralized manner.
All PEs have the same OL and UL as the result of Phase
4.Then,in Phase 5,the overloaded PEs in OL can find
the underloaded PEs in UL by matching the two lists
and determine how many tasks to send to the under-
loaded PEs to achieve a balanced load.An overloaded
PE may send tasks to more than one underloaded PE
when the excessive amount is larger than the lacking
amount of one underloaded PE.
Fig.7 shows a working example of SWA.After
Phase 2,p
2
collects global load information L ¼
½3;4;0;5;3;7;4;5;8;6;10;4;5;9;7;6 and broadcasts it
to all PEs in Phase 3.In Phase 4.1,the average number
L
avg
and the remaining number L
rem
are calculated as 5
and 6,respectively.The balancing quota of all PEs is
created as Q ¼ ½5;5;5;5;5;6;5;5;6;6;6;5;5;6;6;5 in
Phase 4.2.In Phase 4.3,the overloaded list OL and the
underloaded list UL are created as ½5;8;8;10;10;
10;10;13;13;13;14;15 and ½0;0;1;2;2;2;2;2;4;4;6;
11 ;respectively.If l
i
oq
i
;the identifiers of PEs are
inserted in UL,repeated as many times as the difference
between the two values (the lacking amount of the PEs).
The identifiers of PEs are inserted in OL in a similar
manner when l
i
4q
i
:For example,because l
0
is two
smaller than q
0
;the identifier of p
0
is inserted in the
underloaded list two times,and the identifier of p
5
is
inserted in the overloaded list one time because l
5
is one
larger than q
5
:Finally,in Phase 5,each overloaded PE
finds a matching underloaded PE and decides how many
tasks to send to the underloaded PE by matching OL
and UL one by one.p
5
sends one task to p
0
;p
8
sends
one task to p
0
;p
8
sends one task to p
1
;p
10
sends four
tasks to p
2
;etc.
5.Performance evaluation
The performance of a dynamic balancing algorithm
depends heavily on the accuracy of each balancing
decision and the scheduling overhead of the load
balancing process.The scheduling overhead consists of
the computation overhead and the communication
overhead.We first analyze the balancing accuracy and
the scheduling overhead of SWA and evaluate them
through comparison with TWA.Next,we compare the
practical implementation of SWA and TWA on an IBM
SP2 parallel machine and a PC cluster systemconnected
by Myrinet switches.The test application used is
Mandelbrot set generation [13].
5.1.Analytical evaluation
The total load of N PEs can be written as
P
N1
i¼0
l
i
¼
L
avg
N þL
rem
:When L
rem
¼ 0;accurate balancing
should result in all PEs having L
avg
load.Otherwise,
accurate balancing should make the loads of all PEs
FOR i ¼ 0 to N 1 DO
IF l
i
4L
avg
and L
rem
40
q
i
(L
avg
þ1;
L
rem
(L
rem
1;
ELSE q
i
(L
avg
;
ENDIF
ENDFOR
j (0;k (0;
FOR i ¼ 0 to N 1 DO
diff (l
i
q
i
;
IF diff 40
WHILE diff 40
OL½j (i;j (j þ1;
diff (diff 1;
ENDWHILE
ELSE
WHILE diff o0
UL½k (i;k (k þ1;
diff (diff þ1;
ENDWHILE
ENDIF
ENDFOR
W.Y.Lee et al./J.Parallel Distrib.Comput.63 (2003) 286–298 293
differ by at most one task,where some PEs have L
avg
load and other PEs have L
avg
þ1 load.
Property 1.The balancing quality of SWA is optimal in
the sense that SWA assigns all tasks to PEs equally so
that the number of tasks in each PE differs by at most one.
Proof.In Phase 4.2 of SWA,
P
N1
i¼0
q
i
¼ ðL
avg
þ1Þ
L
rem
þL
avg
ðN L
rem
Þ ¼ L
avg
N þL
rem
¼
P
N1
i¼0
l
i
:
Since the two lists OL and UL are created according to
the difference between l
i
and q
i
in Phase 4.3 and
P
N1
i¼0
ðl
i
q
i
Þ ¼ 0;they have the same length.Conse-
quently,each PE p
i
has exactly q
i
tasks after performing
task migrations fromOL to UL in Phase 5.Then,L
avg
þ
1 tasks are assigned to L
rem
PEs among the overloaded
PEs and L
avg
tasks are assigned to the other PEs.After
executing SWA,the difference among the numbers of
tasks in all PEs is zero when L
rem
¼ 0 and one
otherwise.&
We define tasks to be local if they are never migrated
to other PEs and nonlocal if they are migrated to other
PEs.Task migrations incur communication traffic and
delay the total execution time due to an increase in
communication time.Thus,a load balancing algorithm
resulting in fewer nonlocal tasks is preferable.The
balancing accuracy of TWA is also optimal [12,16].
However,SWA results in fewer nonlocal tasks than
TWA in order to achieve accurate load balancing.In
SWA,the L
rem
tasks are preferentially assigned to
overloaded PEs so that L
rem
overloaded PEs have L
avg
þ
1 tasks.On the other hand,in TWA the remaining tasks
are assigned to not only overloaded PEs but also
underloaded PEs.In the worst case,when all remaining
tasks are assigned to underloaded PEs,TWA should
results in L
rem
more nonlocal tasks than SWA.
Property 2.The communication overhead of the SWA is
less than that of the TWA.
Proof.In SWA and TWA,the communication over-
head consists of broadcasting,gathering,and task
migration.When a PE is allowed to communicate with
at most one PE at a time,it has already been shown that
the number of communication steps of SWA to broad-
cast a message or gather messages in Phases 1–3 is the
minimum possible.Thus,the communication overhead
required to performPhases 1–3 of SWA is less than that
required to perform Phases 1–3 of TWA.Also,the
number of tasks migrated to achieve load balance in
Phase 5 of SWA is the minimum possible.Each p
i
should send or receive at least jl
i
q
i
j tasks to have q
i
tasks after load balancing,and thus at least
P
N1
i¼0
jl
i
q
i
j
2
tasks need to be migrated.Since OL and UL;created in
Phase 4.3 of SWA,have the same length (equal to
P
N1
i¼0
jl
i
q
i
j),the total number of tasks migrated from
OL to UL is
P
N1
i¼0
jl
i
q
i
j
2
:Hence,the number of tasks
migrated to achieve accurate load balancing in Phase 5
l=3 l=3l=4 l=4 l=4l=0 l=5 l=5 l=5l=7 l=7l=8 l=6 l=10 l=9 l=
6
P0 P1 P3 P4 P5 P6 P8P7 P9 P10 P11 P12P2 P14 P15P13
[4] [4]
[1]
[2][2]
[3] [3] [3] [3]
[4][4][4][4][4][4]
P11
l=4
Phase 2
[4]
[3][3]
[2] [2] [2] [2]
[1][1][1][1][1][1][1][1]
Phase 4
Phase 1 and Phase 3
P0 P1 P3
rem
L= 5
avg
L
two tasks
one task
one task
one task
four tasks
one task
one task
one task
P13 P15P14P2 P12
Phase 5
P10P9P7 P8P6P5P4P3P1P0
6l=9l=10l=6l=
[ i ] :
th communication stepi
l=3 l=3l=4 l=4 l=4l=0 l=5 l=5 l=5l=7 l=7l=8 l=6 l=10 l=9 l=6
P4
8
P5 P6 P8P7 P9 P10 P11 P12P2 P14 P15P13
=
6
l= l=
l=3 l=3l=4 l=40 5 l= 7l=5 l= 5l=7
l=
10
Underloaded List (UL)
Overloaded List (OL)
121411 11
4 42 2 2 2 2 610 1 1
0
5
3 4 0 5 3 7 4 5 8 6 10 4 5 9 7 6
P7 P8 P9P1P0 P10 P11 P12 P13 P14 P15P6P5P4P3
P2
141313 15
13
8 101010
8
Global Load Information (L)
3 4 0 5 3 7 4 5 8 6 10 4 5 9 7 6
P7 P8 P9P1P0 P10 P11 P12 P13 P14 P15P6P5P4P3P2
5 5 6 5
P7 P8 P9P1P0 P10 P11 P12 P13 P14 P15P6P5P4P3P2
5 5 5 5 6 6 6 65 5 56
Global Load Information (L)
Balancing Quota List (Q)
(a)
(b)
(c)
(d)
Fig.7.A working example of SWA.
W.Y.Lee et al./J.Parallel Distrib.Comput.63 (2003) 286–298294
of SWA is the minimum possible and less than that in
Phase 5 of TWA.Consequently,the communication
overhead of SWA is less than that of TWA.&
In TWA,task migrations from an overloaded PE to
an underloaded PE may be made indirectly by several
task migration steps through other PEs.If an over-
loaded PE is not connected directly to an underloaded
PE in the scheduling tree,the overloaded PE sends its
excessive tasks to its neighboring PE.Then the
neighboring PE retransmits those tasks to the under-
loaded PE,possibly traversing more intermediate PEs.
Indirect task migrations incur a lot of communication
traffic and long delays.In contrast,SWA forces tasks to
migrate directly from overloaded PEs to underloaded
PEs.
Property 3.The computational time complexity of SWA
is equal to that of TWA.
Proof.In SWA,computation is required to determine
the task transfers to be made in Phases 4 and 5.Each PE
executes Phases 4.1,4.2,and 4.3 in OðNÞ;OðNÞ;and
Oðs NÞ time,respectively,where s represents the
deviation of PEs’ loads from their quotas (l
i
q
i
).Also,
the matching operation in Phase 5 requires Oðs NÞ
time.Therefore,the overall computational time com-
plexity with N PEs is OðN
2
Þ since s is a constant.
Similarly,TWA requires computation to determine the
task transfers to be made in Phase 4.Each PE checks the
balancing quotas q
i
of the PEs in its own subtree to
determine the number tasks to be transferred.All PEs
undergo computation to determine task transfers at the
same time and the amount of computation required is
the same as the computation required at the root PE of
the scheduling tree.Thus,the computational time
complexity of each PE is OðNÞ and the overall
computational time complexity of TWA is OðN
2
Þ;
which is equal to that of SWA.&
The exact total computational complexity of SWA
actually seems to be somewhat larger than of TWA.
However,this difference is negligible since it requires
little extra time for each PE to execute the computation
required to make a load balancing decision.Hence,the
computation overhead of SWA is comparable to that of
TWA.From Properties 1–3,we observe that SWA can
achieve accurate load balancing with less communica-
tion overhead and similar computation overhead,when
compared with TWA.
5.2.Experimental evaluation
This subsection describes the implementation results
of a Mandelbrot set generation application executed on
an IBMSP2 and a PC cluster system.The SP2 machine
consists of 32 super scalar RISC processors connected
via IBM vulcan switches.Each processor has a peak
performance of 266 MFLOPS.The PC cluster system
consists of 14 PCs connected by two Myrinet
switches.Each PC is a Pentium III 500 MHz PC
running Red-hat Linux 5.1.The configurations of
processing elements in the SP2 and the cluster system
are shown in Fig.8(a) and (b),respectively.Mandelbrot
set generation is used to yield fractal curves,to blow
up a selected section of a picture to a full-screen display,
or to modify the EGA palette to display any desired
color.In Mandelbrot set generation,the equation z
iþ1
¼
z
n
i
þc;where both z and c are complex numbers,is
computed iteratively until a specified condition is
satisfied.In this experiment,zero is assigned to z
0
and
the vertical and horizontal coordinates over a 640 320
image are given to the real and imaginary parts of c in
the above equation,respectively.Since it is difficult to
predict the number of iterations required for the
completion of each point,the image is equally
partitioned and assigned to all PEs so that every PE
initially has an equal sized image to compute.There is
no dependency between two different points and the
load variation among points is quite large.As the
computation to solve the equation progresses after
the initial mapping stage,some PEs become idle after
completing their work while other PEs are still busy.To
balance the computational load of PEs,TWA and SWA
have been applied.In the implementation of TWA,PEs
are mapped into well-balanced scheduling trees.All of
the algorithms were written in C with MPI communica-
tion primitives.We use speedup as the performance
measure for the comparison of load balancing algo-
rithms.The speedup is the ratio of the execution time on
one processor to the execution time on multiple
processors.
First,we compare the communication time of broad-
casting and gathering when TWAor SWAis used.Table
1 shows the communication times of TWA and SWA as
the number of PEs is varied.The broadcasting time of
TWA shown in the table corresponds to the best case,
i.e.,when the root PE of the scheduling tree broadcasts a
message to all other PEs.In the table,the communica-
tion time of SWA is faster than that of TWA.The
difference becomes larger as the number of PEs
increases.The broadcasting time of SWA shows a
logarithmic increase while the broadcasting time of
TWA shows a linear increase.However,the commu-
nication time of gathering increases at a faster rate than
that of broadcasting.This is partly due to the fact that
the communication time of gathering includes the
computation time required to combine multiple mes-
sages as they are received.It is notable that the
communication time using 14 PEs on the PC cluster
system is faster than that of eight PEs on the SP2
machine.
W.Y.Lee et al./J.Parallel Distrib.Comput.63 (2003) 286–298 295
Table 2 shows the overall execution time of Mandel-
brot set generation when no load balancing algorithm is
applied (No-Bal),TWA is applied,and SWA is applied
on the given number of PEs.The execution times with
TWA and SWA are much shorter than with No-Bal.
The execution times with SWA are shorter than with
TWA.Again,it is notable that the overall execution
time on the PC cluster system is much less than that on
the SP2 machine.
In Fig.9,the speedup and the total number of task
migrations are compared when Mandelbrot set genera-
tion is implemented on the SP2 machine.Fig.9(a) shows
the speedup of No-Bal,TWA,and SWA,and Fig.9(b)
shows the number of task migrations incurred by these
load balancing algorithms during the overall computa-
tion.The speedup of TWA is much larger than that of
No-Bal and the speedup of SWA is larger than that of
TWA.As the number of PEs increases,the performance
gap between TWA and SWA becomes larger.The
performance of SWA is similar to that of TWA with
eight PEs,the performance of SWA is 10%better than
that of TWA with 16 PEs,and the performance of SWA
is 20% better than that of TWA with 32 PEs.The
number of task migrations in SWA is about half of that
in TWA over all cases.In Fig.10,the speedup and the
total number of task migrations are compared when
Mandelbrot set generation is implemented on the PC
cluster system.Fig.10(a) shows the speedup of No-Bal,
Table 1
Communication time ðmsÞ
Broadcasting Global gathering
(a) On the SP2 machine
#of PEs 8 16 32 8 16 32
TWA 341 776 1288 1420 1919 3245
SWA 316 448 525 933 1501 2889
Broadcasting Global gathering
(b) On the PC cluster system
#of PCs 7 14 7 14
TWA 105 302 140 482
SWA 98 162 127 206
P25
P24P1
SP2 machine
PC cluster system
P13P12P11P10P9P8P7P6P5P4P3P2P1P0
P31P30P29
P28P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 P16 P27P26P0 P18 P19 P20 P21 P22
P23
P17
Vulcan
switch
Frame 1 Frame 2
Myrinet
switch
Myrinet
switch
PC
Pentium III
(a)
(b)
Fig.8.Experimental systems.
W.Y.Lee et al./J.Parallel Distrib.Comput.63 (2003) 286–298296
TWA,and SWA,and Fig.10(b) shows the number of
task migrations incurred by those load balancing
algorithms during the overall computation.Since the
PC cluster system has fewer PEs than the SP2 machine,
we apply the load balancing algorithms using seven PEs
instead of eight PEs and 14 PEs instead of 16 PEs.The
result of Fig.10 is similar to that of Fig.9.The
performance of SWA is similar to that of TWA with
seven PEs and the performance of SWA is 10% better
than that of TWA with 14 PEs.The number of task
migrations in SWA is about half that of TWA in all
cases.Compared to the number of task migrations on
the SP2 machine,the number of task migrations on the
PC cluster system is relatively small.From the results of
Figs.9 and 10,we can infer that SWA shows better
performance than that of TWA because of its faster
communication and smaller number of task migrations.
Also,extrapolation of our results implies that SWA will
perform much better than TWA on larger parallel
systems.
6.Conclusions
In this paper,we propose a dynamic load balancing
algorithm suitable for switch-based networks.The
proposed dynamic load balancing algorithm,called the
Switch Walking Algorithm (SWA),was designed to
utilize the characteristics of most commercial switch-
based networks,such as the use of the one-port model,
the use of dynamic communication paths,and direct
task migration.To achieve accurate global load
balancing,global load information is used in making a
load balancing decision.Also,to reduce the commu-
nication time required for load balancing,all processing
elements cooperate to communicate global load
Table 2
Overall execution time (s)
(a) On the SP2 machine (exec.time on one proc.:63.978041)
#of PEs No-Bal TWA SWA
8 15.170944 8.501002 8.360333
16 11.883218 5.193915 4.693693
32 9.430924 3.335179 2.735639
(b) On the PC cluster system (exec.time on one proc.:46.254273)
#of PCs No-Bal TWA SWA
7 9.303447 7.603897 7.334367
14 5.972260 4.000707 3.645943
Fig.9.Comparison of no balancing,TWA,and SWA on the SP2 machine.
Fig.10.Comparison of no balancing,TWA,and SWA on the PC cluster system.
W.Y.Lee et al./J.Parallel Distrib.Comput.63 (2003) 286–298 297
information with one another.The proposed load
balancing algorithm reduces communication overhead
significantly,maximizes the locality of tasks,and
minimizes the number of task migrations.The analytical
evaluation results show that the proposed algorithm is
more suitable for switch-based networks than a pre-
viously proposed algorithm called the Tree Walking
Algorithm (TWA).In experiments with Mandelbrot set
generation on an IBMSP2 machine and a cluster system
of Pentium III PCs,it is shown that SWA performs
significantly better than TWA,with the performance
difference more pronounced as the size of the parallel
system is increased.SWA has similar performance with
that of TWA on a system with seven or eight processing
elements,has about 10%better performance than TWA
on a system with 14 or 16 processing elements,and has
about 20% better performance than TWA on a system
with 32 processing elements.
References
[1] T.Anderson,D.Culler,D.Patterson,A case for NOW
(Networks of Workstations),IEEE Micro 15 (1) (1995) 54–64.
[2] N.Boden,D.Cohen,R.Felderman,A.Kulawik,C.Seitz,J.
Seizovic,W.Su,Myrinet:a gigabit-per-second local area network,
IEEE Micro 15 (1) (1995) 29–35.
[3] W.Lee,S.Hong,J.Kim,Dynamic load distribution on meshes
with broadcasting,Internat.J.High Speed Comput.9 (4) (1997)
337–357.
[4] W.Lee,S.Hong,J.kim,On the configuration of switch-based
networks with wormhole routing,J.Interconnection Networks
1 (2) (2000) 95–114.
[5] W.Lee,S.Hong,J.Kim,S.Lee,A dynamic load balancing
algorithm on switch-based networks,ISCA International Con-
ference on Parallel and Distributed Computing Systems,2000,pp.
302–307,Las Vegas,Nevada.
[6] R.Libeskind-Hadas,D.Mazzoni,R.Rajagopalan,Optimal
contention-free unicast-based multicasting in switch-based net-
works of workstations,Internat.Parallel Process.Symp.12 (1998)
358–364.
[7] P.Loh,W.Hsu,C.Wentong,N.Sriskanthan,How network
topology affects dynamic load balancing,IEEE Parallel Distrib.
Technol.4 (3) (1996) 25–35.
[8] K.Nam,S.Lee,J.Kim,Synchronous load balancing in
hypercube multicomputers with faulty nodes,J.Parallel Distrib.
Comput.58 (1999) 26–43.
[9] L.Ni,P.Mckinley,A survey of wormhole routing techniques in
direct networks,IEEE Comput.26 (2) (1993) 62–76.
[10] R.Ponnusamy,R.Thakur,A.Choudhary,K.Velamakanni,Z.
Bozkus,G.Fox,Experimental performance evaluation of the
CM-5,J.Parallel Distrib.Comput.19 (1993) 192–202.
[11] N.Shivarati,P.Krueger,M.Singhal,Load distribution for
locally distributed systems,IEEE Comput.25 (12) (1992) 33–44.
[12] W.Shu,M.Wu,Runtime incremental parallel scheduling (RIPS)
on distributed memory computers,IEEE Trans.Parallel Distrib.
Systems 7 (6) (1996) 637–649.
[13] R.Stevens,Fractal Programming in C,M&T Publishing Inc.,
Redwood City,CA,1989.
[14] C.Stunkel,D.Shea,B.Abali,M.Atkins,C.Bender,D.Grice,P.
Hochschild,D.Joseph,B.Nathansan,The SP2 high performance
switch,IBMSystem Journal 34 (1995) 185–202.
[15] M.Willebeek-LeMair,A.Reeves,Strategies for dynamic load
balancing on highly parallel computers,IEEE Trans.Parallel
Distrib.Systems 4 (9) (1993) 979–993.
[16] M.Wu,On runtime parallel scheduling for processor load
balancing,IEEE Trans.Parallel Distrib.Systems 8 (2) (1997)
173–186.
[17] M.Wu,W.Shu,DDE:a modified dimensional exchange method
for load balancing in k-ary n-cubes,J.Parallel Distrib.Comput.
44 (1997) 88–96.
[18] C.Xu,F.Lau,Analysis of the generalized dimension exchange
method for dynamic load balancing,J.Parallel Distrib.Comput.
16 (1992) 385–393.
[19] C.Xu,B.Monien,R.Lu
¨
ling,An analytical comparison of nearest
neighbor algorithms for load balancing in parallel computers,
Internat.Parallel Process.Symp.9 (1995) 472–479.
Wan Yeon Lee received the BS,MS,and Ph.D.degrees in computer
science and engineering from Pohang University of Science and
Technology in 1994,1996,and 2000,respectively.He is currently a
research engineer in LGElectronics and joins in the standardization of
Next Generation Mobile Network of 3GPP Group.His areas of
interest include mobile network,real-time system,multimedia com-
munication,and parallel computing.
Sung Je Hong received the BS degree in electronics engineering from
Seoul National University in 1973,the MS degree in computer science
fromIowa State University in 1979,and the Ph.D.degree in computer
science fromthe University of Illinois,Urbana-Champaign in 1983.He
is currently a Professor in the Department of Computer Science and
Engineering,Pohang University of Science and Technology,Pohang,
Korea.From 1983 to 1989,he was a staff member of Corporate
Research and Development,General Electric Company,Schenectady,
NY,U.S.A.From 1975 to 1976,he was with Oriental Computer
Engineering,Korea,as a logic design engineer.From 1973 to 1975,he
served the army as a system analyst at the Army Finance Center,
Korea.His current research interests include VLSI design,CAD
algorithms,testing,and parallel processing.
Jong Kim received the BS degree in electronic engineering
from Hanyang University in 1981,the MS degree in computer science
from Pennsylvania State University in 1991.He is currently an
Associate Professor in the department of Computer Science and
Engineering,Pohang University of Science and Engineering,
Pohang,Korea.Prior to this appointment,he was a research fellow
in the Real-Time Computing Laboratory of the Department of
Electrical Engineering and Computer Science at the University of
Michigan from 1991 to 1992.From 1983 to 1986,he was
a system engineer in the Korea Securities Computer Corporation,
Seoul,Korea.His major areas of interest are fault-tolerant computing,
performance evaluation,parallel and distributed computing,and
computer security.
Sunggu Lee received the B.S.E.E.degree with highest distinction
from the University of Kansas,Lawrence,in 1985 and the M.S.E.
and Ph.D.degrees from the University of Michigan,Ann Arbor,in
1987 and 1990,respectively.He is currently an Associate Professor in
the Department of Electronic and Electrical Engineering at the
Pohang University of Science and Technology (POSTECH),
Pohang,Korea.Prior to this appointment,he was an Assistant
Professor in the Department of Electrical Engineering at the
University of Delaware in Newark,Delaware,U.S.A.From June
1997 to June 1998,he spent one year as a Visiting Scientist at the IBM
T.J.Watson Research Center.His research interests are in parallel
computing using clusters,fault-tolerant computing,and real-time
computing.
W.Y.Lee et al./J.Parallel Distrib.Comput.63 (2003) 286–298298