1
The Row/Column Pivoting Strategy on Multicomputers
M. ANGELACCIO and M. COLAJANNI
Dipartimento di Ingegneria Elettronica, Università di Roma “Tor Vergata”, Via della Ricerca Scientifica,
Roma, Italy (Tel: +39

6

72594478, Fax: +39

6

2020519, e

mail: colajanni@tovvx1.ccd.utovrm.it)
ABSTRACT.
On multicomputers the partial pivoting phase of the LU factorization has a peculiar load
unbalancing due to the presence of idle processors in most matrix decompositions. Moreover, intrinsic
synchronizati
on barriers do not allow a complete masking of this overhead by means of pipelining
techniques. We propose to reduce load unbalancing by “assigning extra work to idle processors”; this leads
to a new pivoting strategy, named
row/column pivoting
, that is ma
inly attractive to 2D decompositions.
Row/column pivoting furnishes an LU factorization algorithm that guarantees better numerical stability at
the same cost of partial pivoting in case of square decomposition. A further improvement is achieved by
adding
pipelining schemes to the naive form. In the design of the algorithms and in their evaluation we
have adopted a new environment that allows a decomposition

independent parallel programming.
Keywords.
Growth factor, Load balancing, Pipelining schemes, Pivo
ting strategies, Numerical stability.
1. INTRODUCTION
Pivoting is a widely used technique for improving the numerical stability of matrix
factorization. While pivoting is almost a trivial operation on sequential machines, it can
require a fairly complic
ated algorithm in the concurrent implementation [8]. In particular, this
is true if one considers multicomputers where the matrix decomposition has a relevant impact
on the performance of the pivot search. In almost all the decompositions partial pivoting
does
not perform well due to an inherent load unbalancing. This can arise: 1) from an unfortunate
sequence of pivot choices, and/or 2) from the inactivity of processors during the pivoting
phase. If we restrict to
row partial pivoting
, first unbalancing is
typical of row decomposition
and is mitigated by row swaps [5], whereas the latter, typical of column decomposition, is
reduced by pipelining schemes, which aim to mask the serial overhead by anticipating the
pivot computation.
Parallel implementations of
LU factorization with partial pivoting are treated in [5, 6, 9, 10,
11] for 1D matrix decompositions (i.e., row or column). The authors [8, 13] that propose 2D
decompositions (e.g., square) as a better alternative to row and column decomposition focus
2
mor
e on the LU factorization than on the pivoting phase. This paper, instead, aims to mitigate
pivoting overheads especially in such decompositions. When we adopt partial pivoting in a
2D decomposition both the kinds of load unbalancing are present, and proce
ssor idleness still
remains even if some strategy (i.e., data swap, pipelining) proposed for 1D is naively extended
to 2D decompositions. For this reason we suggest to replace partial pivoting by a
less partial
strategy, named
row/column pivoting
, that “as
signs extra work to idle processors during the
pivot search”. Since for 2D decompositions this extra computation affects only otherwise idle
processors (i.e., idle if partial pivoting were adopted), we expect parallel row/column pivoting
to achieve a speed
up higher than partial. In addition, we demonstrate that row/column
pivoting increases the numerical stability while preserving the linear complexity of the pivot
search. It should be noted that row/column pivoting, not particularly useful in serial LU
fac
torization, enjoys successful employment on multicomputers. This demonstrates that
parallelism tends not only to parallelize serial programs, but sometimes also to propose quite
original algorithms.
Before entering into details of row/column pivoting, we w
ould summarize the main features
of a new programming environment that has allowed us to implement the parallel algorithms
of this paper in a decomposition

independent way. The theoretical foundation of this
environment, named I
PLUS, can be
found in [1] whereas a complete description is reported in
[3].
I
PLUS is constituted by a set of programming primitives based on a meta

decomposition,
called
Subnet Matrix Decomposition
(SMD), from which it is possible to derive any other
de
composition. A 1D matrix decomposition can be seen as a combination of two functions:
the
matrix partition
that decomposes the matrix by row or by column, and the
node mapping
strategy
that assigns these parts to each processor in a block, scattered, etc.
way. The key idea
of SMD is to substitute the
node mapping strategy
for a
subnet
mapping strategy
that assigns
each row to a
subnet
(e.g., a
subcube
) instead of a single processor. By adopting such a
function, in practice we define a functional decompositi
on whose parameters are the subnet
dimension
r
(e.g., in a subcube, 0≤
r
≤
q
=log
p
), and the mapping attributes.
These unifying properties propagate to the set of decomposition

independent programming
primitives that constitutes I
PLUS. By adopting this environment, the algorithm design its
elf
becomes decomposition

independent, and the resulting programs unify all those based on a
particular decomposition. Throughout the paper, this property will be used to implement the
row/column pivoting meta

algorithms, to carry out a decomposition

indep
endent performance
analysis, and to obtain execution times for different matrix decompositions from a single
code.
This paper is organized as follows. Section 2 gives the motivation of this paper by discussing
load unbalancing of the pivoting phase and the
strategies to mask it. Section 3 introduces the
naive and the pipelined forms of LU factorization with row/column pivoting. Section 4
3
contains a comparative analysis of row/column
vs.
partial pivoting in terms of speedup
evaluation and numerical stability
. Section 5 shows the experimental results, whereas Section
6 gives some conclusive remarks.
4
2.
LOAD BALANCING IN THE PIVOTING PHASE
Several papers have demonstrated that 2D matrix decompositions (and in particular square)
perform better than 1D decompo
sitions as regarding the LU factorization algorithm [1, 4, 13].
This seems untrue if one focuses the analysis on the partial pivoting phase, which is divisible
into three steps; i.e.,
search for pivot
(
O
(
n
) computation and
O
(
p
) communications),
broadcast p
ivot index
(
O
(
p
) communications),
swap rows
(
O
(1) communications). For
example, if one applies the Fox algorithm [8] to dense matrices, the spanning trees (devoted
to searching the pivot by a
fan

in
call, and broadcasting the pivot index by a
fan

out
call)
,
result defined on subcubes instead of all the nodes. This implies, on one hand, the reduction
of the spanning tree depth and communication time, on the other, more idle nodes and more
serial comparisons in the local maximum search. The graphical descript
ion of the pivoting
steps highlights load unbalancing of this algorithm when it is applied to 2D decompositions
(Fig. 1).
In correspondence of the pivoting steps, each matrix decomposition causes a different load
unbalancing (Table 1). It should be noted t
hat, since the complexity of the searching step
depends on the matrix size, it usually requires most computation; therefore, load unbalancing
intrinsic in square and column decomposition results as being much more serious than the one
of row decomposition
.
Decomposition
search for pivot
send pivot index
swap rows
row
(
r
=0)
No load unbalancing
No load unbalancing
(2
q

2) idle processors
grid
(0<
r
<
q
)
(2
q

2
q

r
) idle
processors
No load unbalancing
(2
q

2
r
+1
) idle
processors
column
(
r
=
q
)
(2
q

1) idle proces
sors

No load unbalancing
Table 1.
Load unbalancing due to the row partial pivoting phase in
accordance with the matrix decomposition (2
q
=
p
).
Two strategies have been commonly adopted to mitigate such load unbalancing; i.e.,
introduction of
pipelinin
g schemes
and/or use of
pivoting strategies weaker than partial
.
Pipelining schemes increase the efficiency of the naive parallel algorithms by anticipating
some computation and/or overlapping computation with communications. An example of
pipelining is th
e
send

ahead
scheme that, applied to the naive LU factorization without
pivoting, avoids the unnecessary synchronizations at the end of each stage
k
by broadcasting
row
k
+1
as soon as updated (for more details on this scheme see [1, 11, 12]).
Even in the c
ase of the LU factorization with pivoting, it is possible to increase the efficiency
of the naive algorithms by anticipating some computation. Here, though, the problem is quite
different from the without

pivoting case because no processor can anticipate t
he updating
5
phase
k
+1 at the stage
k
without knowing pivot
k
+1
. As a consequence, the first aim of any
pipelining scheme has to regard the anticipated evaluation of the pivot. Overlapping the
search for pivot
k
+1
before the completion of the stage
k
results
particularly useful when row
partial pivoting is applied to a matrix decomposed by columns [7] (or, in a dual manner, when
column partial pivoting is used in combination with row decomposition). This scheme, here
called
(pivot)

compute

ahead
, has been also
proposed in the case of row partial pivoting and
row decomposition [5]. In other algorithms (e.g., LU factorization without pivoting) we can
also adopt pipelining schemes that combine some pre

computation with send

ahead of row
k
+1
.
This is not the case of
LU with partial pivoting because it is impossible to anticipate the
updating and sending phases before knowing the row pivot index [1].
Among the weaker pivoting strategies, an attractive alternative to partial pivoting is the
pairwise pivoting
that intro
duces parallelism into the pivoting steps. It has been demonstrated
to guarantee enough stability in many examples [14], but the upper bound on its growth factor
is the square of that of partial pivoting. Another possibility is the
local partial pivoting
,
in
which only the processor holding
a
(
k
)
;
kk
searches the pivot among its column
k
entries [6].
This algorithm allows the combination of send

ahead and compute

ahead schemes also for 2D
decompositions [1]; in fact, no communication is necessary
in the search for pivot and the
swap steps, and only a subnet spanning tree is required in the pivot index communication.
Good performances of this algorithm, though, are counterbalanced by an unknown loss of
numerical stability.
Unfortunately, the pipelin
ing schemes do not completely solve load unbalancing of pivoting
strategies; on the other hand, weaker pivoting strategies may suffer stability problems. To
reduce load unbalancing and to guarantee at least the same (in reality better) stability of partial
pivoting, we propose a stronger pivoting strategy, called
row/column pivoting
. It assigns extra
work to the row
k
nodes (otherwise idle in row partial pivoting) by exploring the column
below
a
(
k
)
;
kk
and the row next to
a
(
k
)
;
kk
; i.e
.,
A further improvement of the load balancing can be achieved by adding some pipelining
scheme to this algorithm. Two parallel forms (naive and optimized) of row/column pivoting
are described in Section 3, wherea
s its properties are demonstrated in Section 4.
3. ROW/COLUMN PIVOTING
3.1
The Naive Algorithm
6
As regarding the partial pivoting steps, LU factorization for 2D decompositions is not
preferable to that for 1D decompositions because all the nodes not in
volved in the pivot
search are idle. The number of such nodes increases with the subcube size
r
, thus loosening
the gain achieved by the spanning tree depth reduction (evidenced in Figure 1).
Following the 2D paradigm, we replace partial pivoting by
row/co
lumn pivoting
, which
combines row and column partial pivoting. For the
pivot
search
step we use two fan

in calls,
each implemented by a spanning tree with a common root node named
manager
; one tree
spans the row
k
nodes, the other the column
k
nodes. The con
current use of disjoint spanning
trees with same dimension reduces pivoting unbalancing (Fig. 2) and improves the stability
(Section 4.2) at no cost. The naive form of the parallel LU factorization with row/column
pivoting is outlined in
NaiveRCP
.
The algo
rithms are written in a pseudo

code language, whereas the actual programs are
implemented in the I
PLUS programming environment. It allows us to obtain programs based
on a particular matrix decomposition by setting at run

time the SMD paramet
ers [1]. These
pseudo

codes do not specify the decomposition

independent I
PLUS primitives, but denote all
the expressions and instructions that require them by means of “< ... >” phrases.
There are two main differences between LU factorizati
on with row pivoting and NaiveRCP;
i.e., the contemporary search for pivot both in row
k
and in column
k
, and the swap data step.
It should be observed that for 1D matrix decompositions row/column pivoting is not very
efficacious because, at each stage
k
, it
would require
n

k
additional comparisons on the row
k
node. On the other hand, this strategy becomes more efficient than partial pivoting for 2D
decompositions. Among them, the most balanced computation is achieved by square scattered
decomposition that in
volves the highest number of processors in the search for pivot.
NaiveRCP
for
k
=1
to
n

1
do
/* Row/column partial pivoting */
if
(<I own column
k
>)
then
<search for pivot
k
in mycolumn
k
>
if
(<I own row
k
>)
then
<search for pivot
k
in myrow
k
>
if
(
man
ager
=
mynode
)
then
<send pivot index>
else
receive
(pivot index)
if
(
k
≠pivot index)
then
<swap (row
k
, row
pivot
) or (column
k
, column
pivot
)>
/* LU Factorization */
if
(<I own row
k
>)
then
<send myrow
k
>
else
receive
(myrow
k
)
if
(<I own column
k
>)
then
<compute and send mymultipliers
k
>
else
receive
(mymultipliers
k
)
for all
(<
i
≥
k
+1 in myrow
i
>)
do
for all
(<
j
≥
k
+1 in mycolumn
j
>)
do
a
ij
=
a
ij

l
ik
a
kj
3.2
The Pipelined Algorithms
7
The efficiency of NaiveRCP can be increased by introducing some pipelining scheme. First of
all we propose the well known
(pivot)

compute

ahead
scheme, which aims to anticipate the
search for pivot
k
+1
before the completion of the stage
k
. This algorithm would require a
pre

updating of column
k
+1
and row
k
+1
, an immediate search for pivot
k
+1
on these matrix
entries, and a broadcast of pivot row index. We do not explicitly illustrate this algorithm that
can be easily deducted from the next PipeRCP.
A better overlapping scheme, which we propose for 2D matrix decompositions, is achieved by
antici
pating computation both of the pivot
k
+1
and of multipliers
k
+1
during the stage
k
. This
scheme, called
(pivot&multiplier)

compute

ahead
, aims to: pre

update a subset of matrix
entries (i.e., column
k
+1
and row
k
+1
), anticipate the evaluation of pivot
k
+1
and b
roadcasting of
its index, anticipate the computation of multipliers
k
+1
and their sending. This algorithm is
sketched in PipeRCP (shadow boxes evidence the differences with NaiveRCP due to
pipelining).
PipeRCP
if
(<I own column
1
>)
then
<search for pivot
1
in mycolumn
1
>
if
(<I own row
1
>)
then
<search for pivot
1
in myrow
1
>)
if
(
manager
=
mynode
)
then
<send

ahead pivot index>
else
receive
(pivot index)
if
(1≠pivot index)
then
<swap (row
1
, row
pivot
) or (column
1
, column
pivot
)>
if
(<I own
column
1
>)
then
<compute and send

ahead mymultipliers
1
>
for
k
=1
to
n

1
do
if
not(<I own column
k
>)
then
receive
(mymultipliers
k
)
if
(<I own row
k
>)
then
<send myrow
k
>
else
receive
(myrow
k
)
if
(<I own column
k
+1
>)
then
<update mycolumn
k
+1
>
<search for pivot
k
+1
in mycolumn
k
+1
>
if
(<I own row
k
+1
>)
then
<update myrow
k
+1
>
<search for pivot
k
+1
in myrow
k
+1
>
if
(
manager
=
mynode
)
then
<send

ahead pivot index>
else
receive
(pivot
index)
if
(
k
+1≠pivot index)
then
<swap (row
k
+1
, row
pivot
) or (column
k
+1
,
column
pivot
)>
if
(<I own column
k
+1
>)
then
<compute and send

ahead multipliers
k
+1
>
/*
Update submatrix
A
i
>
k
+1,
j
>
k
+1
*/
for all
(<
i
≥
k
+2 in myrow
i
>)
do
for all
(<
j
≥
k
+2 in myco
lumn
j
>)
do
a
ij
=
a
ij

l
ik
a
kj
It should be observed that both the pipelining schemes here introduced can be applied to LU
factorization with partial pivoting [1]. Analogously to this algorithm, also row/column
pivoting has a synchronization barrier
that prevents any combination of pre

computation and
send

ahead schemes on row
pivot
, because pivot index cannot be known in advance.
8
However, two properties make row/column pivoting more attractive than partial pivoting; i.e.,
speedup gain and better numer
ical stability, shown in Section 4.1 and 4.2, respectively.
4.
PERFORMANCE ANALYSIS
4.1 Speedup Analysis
Main goal of this section is to demonstrate that “extra work assigned to idle nodes” renders
row/column more efficient than partial pivoting. Let
us denote by
SMD
a generic matrix
decomposition that follows the paradigm introduced in Section 1; this function assumes
values in the interval [
row
, ...,
square
, ...,
column
].
The performance parameters regarding the partial pivoting phase are denoted by
:
PP
comm
SMD
communication time
PP
op
SMD
operation time (i.e., number of comparisons)
PP
tot
SMD
PP
comm
SMD
PP
op
SMD
global execution time.
In the above definitions we replace
PP
with
RC
to indicate performance parameters of
row/column pivoting. Moreover, let us assume
p
processor number,
propagation time of a
unitary data, and
machine time for executing a comparison between double precision
numbers; these indexes are used i
n Proposition 1 (whose proof is postponed to the Appendix).
Proposition 1.
If
<
p
then
PP
tot
col
PP
tot
row
PP
tot
square
otherwise
PP
tot
col
PP
tot
square
PP
tot
row
.
Since the hypothesis
<
p
is satisfied in almost
all machines, it seems reasonable to
restrict the comparison between partial and row/column speedup to
SMD
=
square
. In such a
case, the Proposition 1 holds true also for row/column pivoting because:
RC
op
square
PP
op
square
and
RC
comm
square
PP
comm
square
(1)
For
SMD
=
square
, in fact, the search for the maximum in row
k
and column
k
involves two sets
of nodes, which are disjoint and with equal size. In general, instead, these set of nodes have
different size, thereby
RC
op
SMD
square
PP
op
SMD
square
.
Proposition 2.
For square matrix decomposition, row/column pivoting speedup is twice as
much as partial pivoting speedup.
Proof
.
By definition, the row/column pivoting speedup is
S
RC
SMD
RC
seri al
RC
tot
SMD
.
For
the serial pivoting phase it is easy to obtain:
RC
seri al
2
PP
seri al
(2)
9
By using (1) and (2), the speedup equation for square decomposition results:
S
RC
square
RC
seri al
RC
comm
square
RC
op
square
2
PP
seri al
PP
comm
square
PP
op
square
2
S
PP
square
(3)
Q.E.D.
Propositio
n 3 provides an estimate of
S
RC
square
by using the results obtained in the Appendix.
Proposition 3.
If n becomes arbitrarily large, the speedup obtainable by row/column pivoting
for square decomposition approaches 2
p
.
Proof
. From (3) we have the row/column pivoting speedup for square decomposition:
S
RC
square
2
PP
seri al
PP
tot
square
(4)
10
By replacing denominator of (4) with the formula (A13) of the Appendix,
PP
tot
square
(
n
1
)
S
(
3
q
2
1
)
(
3
q
2
n
2
q
2
1
)
n
(
n
1
)
2
q
2
1
e
quation (4) becomes:
S
RC
square
2
n
2
2
(
n
1
)
S
(
3
q
2
1
)
(
3
q
2
n
2
q
2
1
)
n
2
q
2
1
(5)
By restricting to
O
(
n
2
) terms in (5), it holds:
S
RC
square
n
2
n
2
2
q
2
1
n
2
2
q
2
1
O
(
n
)
2
q
2
1
2
p
(6)
from which it follows that
S
RC
square
tends to 2
p
as
n
grows.
Q.E.D.
4.2
Numerical Stability
The pivot search carried out on a larger set of entries (though
O
(
n
)) suggests that row/column
pivoting yields a numerical stability of the LU factorization better than partial pivoting does.
Let
us denote by
a
max
PP
(
k
)
, and
a
max
RC
(
k
)
the value of
max
ij
a
ij
(
k
)
obtained at stage
k
of LU
factorization, with partial pivoting and row/column pivoting, respectively. Moreov
er we
define:

one

step growth factor:
g
STR
(
k
)
a
max
STR
(
k
)
a
max
STR
(
k
1
)

growth factor
:
G
STR
(
k
)
a
max
STR
(
k
)
a
max
(
1
)
where
STR
stands for
RC
or
PP
.
The upper bound of growth factor for partial pivoting is easily obtained by observ
ing that
[15]:
a
max
PP
(
k
)
2
a
max
PP
(
k
1
)
,
and
a
max
PP
(
k
)
2
(
k
1
)
a
max
(
1
)
,
i.e.,
g
PP
(
k
)
2
,
and
G
PP
(
k
)
2
(
k
1
)
.
(7)
By definition, one would expect that row/colu
mn pivoting provide a growth factor smaller
than partial pivoting.
11
Theorem 4.
The row/column pivoting growth factor is bounded by
G
RC
(
k
)
2
(
k
1
)
.
Proof
. Let us distinguish two cases:
12
[i] If the pivots obtained from row/column
and partial pivoting differ (i.e., pivot
RC
>pivot
PP
),
the row/column pivoting growth factor is strictly lower than the partial pivoting one; i.e.,
If
G
RC
(
k
)
G
PP
(
k
)
then thesis follows immediately from (7).
[ii] If the pivot obta
ined from row/column and partial pivoting coincides, the partial pivoting
growth factor does not reach its upper bound; i.e.,
G
RC
(
k
)
G
PP
(
k
)
2
(
k
1
)
.
The claim is to demonstrate that if at a certain step the one

step growth factor does not
reach 2
the growth factor cannot reach 2
(
k

1)
. To this purpose we apply the theorem shown in [2]; i.e.,
If at a certain step
i
(2≤
i
≤
k
)
a
max
RC
(
i
)
a
max
PP
(
i
)
then
g
PP
(
i
)
2
or
g
PP
(
i
1
)
2
.
By using this result and the definition of
G
PP
(
k
)
, it follows:
G
PP
(
k
)
def
g
PP
(
k
)
•
.
.
.
•
g
PP
(
2
)
2
(
k
1
)
.
Q.E.D.
The following matrix furnishes an example in which the row/column pivoting growth factor
is
lower than that of the partial pivoting (but remains higher than that of the complete pivoting):
A
1
0
0
1
4
1
1
0
1
3
1
1
1
1
2
1
1
1
1
In such a case, the growth factor resulting from LU factorization is:
G
PP
(
4
)
=19/6
3.167,
G
RC
(
4
)
=31/22
1.409, and 1573/1245 (
1.263) for complete pivoting.
A more extended analysis on the numerical stability of row/column pivoting is in [2].
5. EXPERIMENTAL RESULTS
In this section we show
the experimental results obtained on an Intel iPSC/2 hypercube of 16
nodes. The purpose is to verify that the better stability of row/column pivoting is obtained
without degrading the performance of partial pivoting.
We consider the pipelining forms of LU
factorization with row partial and with row/column
pivoting (PipeRCP). Since these codes have been implemented in I
PLUS, results for row,
square and column decomposition have been simply obtained by setting the SMD parameters.
To force a ro
w or column swap (with communication) at each step
k
we used two types of
matrices in the experiments; i.e.,
ARS
(
always row swap
) and
ACS
(
always column swap
).
The diagram in Figure 3 displays the overhead due to the pivoting phase in the LU
factorization
. The matrix size is fixed to
n
=1024, the processor number to
p
=16, and the
decomposition is variable. It should be remarked that this diagram: 1) confirms Proposition 1,
13
by indicating a minimum in the pivoting overhead for
SMD
=
square
, both for partial and
for
row/column pivoting (this also implies that iPSC/2 is a machine where first hypothesis of
Proposition 1 holds true); 2) testifies that the difference in the execution time between LU
factorization with partial and row/column pivoting is negligible in
the case of square
decomposition.
Figure 4 plots the speedup of LU factorization with partial and row/column pivoting strategies
as function of the processor number and the matrix size. The matrix chosen in all the
experiments is
ARS
whereas the matrix de
composition corresponds to
r
=
q
/2
. This figure
shows that the speedup of LU factorization with row/column pivoting is always better than
that of partial pivoting. In addition the gap between the two curves grows when
p
becomes
large. These results, though
concerning the entire LU factorization algorithm with pivoting
and not only the pivoting phase, give an experimental confirmation of the Proposition 2.
Table 2 shows in more detail the execution times for row/column pivoting as a function of
p
,
r
and
n
. T
he conclusion is that on iPSC/2 square scattered decomposition remains preferable to
row and column decompositions even with pivoting (both partial and row/column). The lower
execution time shown by ARS matrices in column decomposition is due to the absenc
e of
communications in its row swap step.
n
row
ARS
square
ARS
col
ACS
umn
ARS
64
0.683
0.592
0.662
0.541
128
3.520
3.176
3.499
3.083
p
=4
256
23.163
21.952
23.572
21.862
512
170.790
166.506
173.493
166.090
1024
1328.78
1304.17
1332.77
1303
64
0.719
0.545
0.704
0.570
128
2.778
2.177
2.725
2.26
p
=8
256
14.621
12.510
14.375
12.70
512
95.630
87.948
94.260
88.32
1024
695.144
666.884
686.717
668.9
64
0.9
0.524
0.986
0.842
128
3.006
1.595
3.006
2.511
p
=16
256
12.116
7.251
12.030
10.215
512
63.755
46.307
63.560
56.179
1024
406.095
340.9
405.341
375.204
Table 2.
Execution times
for LU factorization with
row/column pivoting
in case
of row, square, and column decomposition of
ARS
and
ACS
matrices.
6. CONCLUSION
LU factorization algo
rithms for 2D matrix decompositions (in particular square scattered) are
known to perform better than those for 1D matrix decompositions [1, 4, 13]. This is not true if
14
one restricts to the pivoting phase due to an intrinsic load unbalancing. In this paper
we have
proposed to reduce processor inactivity by means of previously unexplored techniques.
First, we have introduced a new pivoting strategy (
row/column pivoting
) that “assigns extra

work to idle processors” thus reducing load unbalancing and assuring
a better stability [2].
Combining both these properties, we obtain a parallel LU factorization algorithm with a
numerical stability better than partial pivoting without increasing the computational cost.
Second, we have improved the performance of partial
and row/column pivoting by proposing
the introduction of two pipelining schemes into these strategies.
The experimental and analytical analyses of row/column
vs.
partial pivoting have
demonstrated that the former is more efficient than the latter, and that
it is very effective for
square decomposition. The analysis has been carried out by means of the new programming
environment I
PLUS that allowed us to design decomposition

independent algorithms and to
develop a performance analysis independ
ent from a specific decomposition.
APPENDIX
In this section we estimate the influence of the domain decomposition on the pivoting phase
of the LU factorization with row partial pivoting. These results are referred in Section 4.1.
A remarkable property o
f the I
PLUS programming environment is the opportunity to analyze
the performance in a decomposition

independent way. Communication and operation time of
the previously described meta

algorithms can be evaluated by means of formulas containi
ng
the parameter
r
(i.e., subcube dimension).
To describe the communication time required to broadcast data
d
in a
q

dimensional
hypercube we adopt the following well known model:
T
broadcast
(
d
)
= (
S
+

d
)
q
(A1)
where
S
is the startup time, 
d
 is the data size, and
is the propagation time of a unitary data.
By considering the subcube size parameter
r
, we obtain (in correspondence of the I
PLUS
primitives used in the pivoting phase):
T
<
search
d
i n column
k
>
SMD
(
S

d

)
(
q
r
)
(A2)
T
<
search
d
i n row
k
>
SMD
(
S

d

)
r
(A3)
T
<
swap
(
d
1
,
d
2
)
>
SMD
(
S
max
{

d
1

,

d
2

}
)
(A4)
Equations (A2)

(A4), under the hypothesis that row swap occurs for all stages
k
(therefore
upper bound equations will be derived), allow to evaluat
e the communication time for the
pivoting steps of the LU factorization meta

algorithm with row partial pivoting:
PP
comm
SMD
T
<
search pivot in mycol
k
>
k
1
n
1
T
<
broadcast pi vot i n
d
ex
>
k
1
n
1
T
<
swap
(
my
r
ow
k
,
myrow
pi vot
)
>
k
1
n
1
15
(
n
1
)
(
S
)
(
q
r
)
(
n
1
)
(
S
)
q
(
(
n
1
)
S
n
(
n
1
)
2
r
1
)
(
n
1
)
(
S
(
2
q
r
1
)
(
2
q
r
n
2
r
1
)
)
(A5)
From (A5)
it is possible to derive the communication time for any algorithm corresponding to
a specific decomposition by suitably setting the parameter
r
; as example, for
SMD
=
row
:
PP
comm
row
T
<
search pivot in mycol
k
>
k
1
n
1
T
<
broadcast pi vot i n
d
ex
>
k
1
n
1
T
<
swap
(
r
ow
k
,
row
pi vot
)
>
k
1
n
1
(
n
1
)
(
S
)
q
(
n
1
)
(
S
)
q
(
(
n
1
)
S
n
(
n
1
)
2
)
(
n
1
)
[
S
(
2
q
1
)
(
2
q
n
2
)
]
(A6)
Equation (A6) can be obtained more directly from (A5) by setting
r
=0. In an analogous way, it
is possible to derive the communication time in the case of any grid decomposition; for
example, for
SMD
=
square
w
e set
r
=
q
/2 in (A5), thus obtaining:
PP
comm
square
(
n
1
)
[
S
(
3
q
2
1
)
(
3
q
2
n
2
q
2
1
)
]
(A7)
This is also true for the column decomposition, except for the detail that the
T
swap
in such a
case does not cause any communication. As a consequence, the communicatio
n time can be
obtained by setting
r
=
q
in (A5) and by excluding
T
swap
:
PP
comm
col
(
n
1
)
(
S
)
q
(A8)
The same can be carried out in evaluating the operation time due to the pivoting phase in a
decomposition

independent algorithm. By
setting this time equal to the number of
comparisons,
PP
op
SMD
n
k
2
q
r
k
1
n
1
n
(
n
1
)
2
q
r
1
(A9)
It would be possible to derive the formulas for row, column and square decomposition by
substituting in (A9)
r
=0,
r
=
q
, and
r
=
q
/2, respectively.
The c
ombination of (A6)

(A8) and (A9) gives the global execution time of the pivoting phase
(
denotes the machine time for executing a comparison between double precision numbers):
PP
tot
SMD
(
n
1
)
S
(
2
q
r
1
)
(
2
q
r
n
2
r
1
)
n
(
n
1
)
2
q
r
1
(A10)
16
from which,
PP
tot
row
(
n
1
)
S
(
2
q
1
)
(
2
q
n
2
)
n
(
n
1
)
2
q
1
(A11)
PP
tot
col
(
n
1
)
(
S
)
q
n
(
n
1
)
2
(A12)
PP
tot
square
(
n
1
)
S
(
3
q
2
1
)
(
3
q
2
n
2
q
2
1
)
n
(
n
1
)
2
q
2
1
(A13)
Assuming high values of
n
and a propagation time strictly lower than comparison time (
<
),
Proposition 1 shows that there exists a threshold for
suitable to determine the lowest
execution time among (A11), (A
12) and (A13).
Proposition 1.
If
<
p
then
PP
tot
col
PP
tot
row
PP
tot
square
otherwise
PP
tot
col
PP
tot
square
PP
tot
row
.
Proof
.
We prove first
PP
tot
col
{
PP
tot
row
,
PP
tot
square
}
, then the relation between
PP
tot
row
and
PP
tot
square
.
I) From (A11) and (A12) we have:
PP
tot
col
PP
tot
row
(
S
)
q
n
2
S
(
2
q
1
)
(
2
q
n
2
)
n
2
q
1
2
q
1
n
(
2
q
1
)
(
q
1
)
S
q
2
q
2
q
1
2
q
2
q
1
1
II) From (A12) and (A13) we have:
PP
tot
col
PP
tot
square
(
S
)
q
n
2
S
(
3
q
2
1
)
(
3
q
2
n
2
q
2
1
)
n
2
q
2
1
2
q
2
n
(
2
q
2
1
)
(
q
2
)
S
q
1
2
q
2
1
1
2
q
2
1
2
III
) From (A11) and (A13) we have:
PP
tot
row
PP
tot
square
S
(
2
q
1
)
(
2
q
n
2
)
n
2
q
1
S
(
3
q
2
1
)
(
3
q
2
n
2
q
2
1
)
n
2
q
2
1
2
q
q
n
(
2
q
2
1
)
(
S
)
2
q
2
2
q
2
3
In summary, by considering that
p
=2
q
, we have demonstrated:
I)
p
p
1
1
PP
tot
col
PP
tot
row
II)
p
p
1
2
PP
tot
col
PP
tot
square
17
III)
p
3
PP
tot
row
PP
tot
square
The conjunction of I), II) and III) is equivalent to:
p
p
1
1
p
3
PP
tot
col
PP
tot
row
PP
tot
square
p
3
PP
tot
col
PP
tot
square
PP
tot
row
If we let
n
become arbitrarily large, by considering that
i
(
i
=1,
2, 3) are
O
(1/
n
) terms, the
thesis holds.
Q.E.D.
REFERENCES
[1]
M. Angelaccio and M. Colajanni, “Unifying and optimizing parallel linear algebra algorithms”, to appear
in
IEEE Trans. on Parallel and Distributed Systems
.
[2]
M. Angelaccio and
M. Colajanni, “On the numerical stability of row/column pivoting”,
Res. Report
RI.92.05
, Univ. of Roma “Tor Vergata”, Italy, July 1992.
[3]
M. Angelaccio and M. Colajanni, “PLUS: A run

time library for decomposition

independent parallel
programming”,
Res.
Report RI.92.12
, Univ. of Roma “Tor Vergata”, Italy, Nov. 1992; also submitted for
publication.
[4]
R. Bisseling and J. van de Vorst, Parallel LU decomposition on a transputer network, in:
Proc. of Shell
Conference on Parallel Computing
, (Lecture Notes in
Computer Science, Springer Verlag, Berlin, 1989).
[5]
E. Chu and A. George, Gaussian elimination with partial pivoting and load balancing on a multiprocessor,
Parallel Computing
5
(1987) 65

74.
[6]
M. Cosnard, B. Tourancheau and G. Villard, Gaussian elimi
nation on message passing architecture, in:
E.N. Houstis, T.S. Papatheodorou, C.D. Polychronopoulos, eds.,
Proc. of 1st Int. Conf. on
Supercomputing
, Athens (Lecture Notes in Computer Science, n. 297, Springer Verlag, NY, 1987).
[7]
G. Davis, Column LU fac
torization with pivoting on a hypercube multiprocessor,
SIAM J. Algebraic
Discrete Methods
7
(1986).
[8]
G.C. Fox, M. Johnson, G. Lyzenga, S.W. Otto, J. Salmon and D. Walker,
Solving problems on concurrent
processors. Vol. 1.
(Prentice

Hall, Englewood Clif
fs, NJ, 1988).
[9]
G.A. Geist and M.T. Heath, Matrix factorization on a hypercube multiprocessor, in: M.T. Heath, ed.,
Hypercube Multiprocessors 1986
(Philadelphia, 1986) 161

180.
[10]
G.A. Geist and C.H. Romine, LU factorization algorithms on distributed

memory multiprocessor
architectures,
SIAM J. Sci. Statist. Comp.
9
(1988).
[11]
J.M. Ortega,
Introduction to parallel and vector solution of linear systems
(Frontiers of Computer
Science, Plenum Press, New York, 1988).
18
[12]
J.M. Ortega and C.H. Romine, The
ijk forms of factorization methods II. Parallel systems,
Parallel
Computing
7
(1988) 149

162.
[13]
Y. Saad, Gaussian elimination on hypercubes,
YALEU/DCS/RR

462
, Yale University, 1986.
[14]
D. Sorensen, Analysis of pairwise pivoting in Gaussian eliminatio
n,
IEEE Trans. on Computer
C

34
(1985) 274

278.
[15]
J. Stoer, and R. Burlish,
Introduction to Numerical Analysis
(Springer

Verlag, New York, 1980).
19
broadcast (pivot row index)
swap (myrow , myrow )
pivot
k
broadcast (myrow )
pivot
Subcube {01*}
Subcube {00*}
Subcube {10*}
Subcube {11*}
Manager
search for pivot in mycol
k
k
Manager
Figure 1.
Partial pivoting steps, at stage
k
, in a LU factorization algorith
m
on a hypercube (
q
=3,
r
=1) applied to a 2D decomposed matrix.
Active communication links are evidenced by arrows.
Row/column pivoting:
+
Row partial pivoting:
nodes that would have been idle in case of row partial
pivoting and square scattered decomposition
Manager
Subcube {01**}
Subcube {00**}
Subcube {10**}
Subcube {11**}
Figure 2.
Pivot search step in NaiveRCP assuming that (
q
=4,
r
=2),
and node Manager, at stage
k
, is 000
0.
20
row
square
column
square
0
100
200
300
400
LU Factorizati on
Pivoting Overhead
Matrix Decompositi on
parti al
Execution times (sec.)
row/column
Pivoting strategy
Figure 3.
Pivoting overhead in the parallel LU factorization
(
n
=1024,
p
=16).
1200
1000
800
600
400
200
0
0
4
8
12
16
r ow partial (p=16)
r ow/column ( p=16)
r ow partial (p=8)
r ow/column ( p=8)
Matrix size
Speedup
Figure 4.
Speedup for
row partial
and
row/column pivoting
in
case of square decomposition.
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment