The Row/Column Pivoting Strategy on Multicomputers

compliantprotectiveSoftware and s/w Development

Dec 1, 2013 (5 years and 1 month ago)

335 views

1


The Row/Column Pivoting Strategy on Multicomputers



M. ANGELACCIO and M. COLAJANNI


Dipartimento di Ingegneria Elettronica, Università di Roma “Tor Vergata”, Via della Ricerca Scientifica,

Roma, Italy (Tel: +39
-
6
-
72594478, Fax: +39
-
6
-
2020519, e
-
mail: colajanni@tovvx1.ccd.utovrm.it)



ABSTRACT.

On multicomputers the partial pivoting phase of the LU factorization has a peculiar load
unbalancing due to the presence of idle processors in most matrix decompositions. Moreover, intrinsic
synchronizati
on barriers do not allow a complete masking of this overhead by means of pipelining
techniques. We propose to reduce load unbalancing by “assigning extra work to idle processors”; this leads
to a new pivoting strategy, named
row/column pivoting
, that is ma
inly attractive to 2D decompositions.
Row/column pivoting furnishes an LU factorization algorithm that guarantees better numerical stability at
the same cost of partial pivoting in case of square decomposition. A further improvement is achieved by
adding
pipelining schemes to the naive form. In the design of the algorithms and in their evaluation we
have adopted a new environment that allows a decomposition
-
independent parallel programming.


Keywords.

Growth factor, Load balancing, Pipelining schemes, Pivo
ting strategies, Numerical stability.



1. INTRODUCTION


Pivoting is a widely used technique for improving the numerical stability of matrix
factorization. While pivoting is almost a trivial operation on sequential machines, it can
require a fairly complic
ated algorithm in the concurrent implementation [8]. In particular, this
is true if one considers multicomputers where the matrix decomposition has a relevant impact
on the performance of the pivot search. In almost all the decompositions partial pivoting
does
not perform well due to an inherent load unbalancing. This can arise: 1) from an unfortunate
sequence of pivot choices, and/or 2) from the inactivity of processors during the pivoting
phase. If we restrict to
row partial pivoting
, first unbalancing is

typical of row decomposition
and is mitigated by row swaps [5], whereas the latter, typical of column decomposition, is
reduced by pipelining schemes, which aim to mask the serial overhead by anticipating the
pivot computation.

Parallel implementations of

LU factorization with partial pivoting are treated in [5, 6, 9, 10,
11] for 1D matrix decompositions (i.e., row or column). The authors [8, 13] that propose 2D
decompositions (e.g., square) as a better alternative to row and column decomposition focus
2

mor
e on the LU factorization than on the pivoting phase. This paper, instead, aims to mitigate
pivoting overheads especially in such decompositions. When we adopt partial pivoting in a
2D decomposition both the kinds of load unbalancing are present, and proce
ssor idleness still
remains even if some strategy (i.e., data swap, pipelining) proposed for 1D is naively extended

to 2D decompositions. For this reason we suggest to replace partial pivoting by a
less partial

strategy, named
row/column pivoting
, that “as
signs extra work to idle processors during the
pivot search”. Since for 2D decompositions this extra computation affects only otherwise idle
processors (i.e., idle if partial pivoting were adopted), we expect parallel row/column pivoting
to achieve a speed
up higher than partial. In addition, we demonstrate that row/column
pivoting increases the numerical stability while preserving the linear complexity of the pivot
search. It should be noted that row/column pivoting, not particularly useful in serial LU
fac
torization, enjoys successful employment on multicomputers. This demonstrates that
parallelism tends not only to parallelize serial programs, but sometimes also to propose quite
original algorithms.

Before entering into details of row/column pivoting, we w
ould summarize the main features
of a new programming environment that has allowed us to implement the parallel algorithms
of this paper in a decomposition
-
independent way. The theoretical foundation of this
environment, named I

PLUS, can be
found in [1] whereas a complete description is reported in
[3].

I

PLUS is constituted by a set of programming primitives based on a meta
-
decomposition,
called
Subnet Matrix Decomposition

(SMD), from which it is possible to derive any other
de
composition. A 1D matrix decomposition can be seen as a combination of two functions:
the
matrix partition

that decomposes the matrix by row or by column, and the
node mapping
strategy

that assigns these parts to each processor in a block, scattered, etc.
way. The key idea
of SMD is to substitute the
node mapping strategy

for a
subnet

mapping strategy

that assigns
each row to a
subnet

(e.g., a

subcube
) instead of a single processor. By adopting such a
function, in practice we define a functional decompositi
on whose parameters are the subnet
dimension
r

(e.g., in a subcube, 0≤
r

q
=log
p
), and the mapping attributes.

These unifying properties propagate to the set of decomposition
-
independent programming
primitives that constitutes I

PLUS. By adopting this environment, the algorithm design its
elf
becomes decomposition
-
independent, and the resulting programs unify all those based on a
particular decomposition. Throughout the paper, this property will be used to implement the
row/column pivoting meta
-
algorithms, to carry out a decomposition
-
indep
endent performance
analysis, and to obtain execution times for different matrix decompositions from a single
code.

This paper is organized as follows. Section 2 gives the motivation of this paper by discussing
load unbalancing of the pivoting phase and the

strategies to mask it. Section 3 introduces the
naive and the pipelined forms of LU factorization with row/column pivoting. Section 4
3

contains a comparative analysis of row/column
vs.

partial pivoting in terms of speedup
evaluation and numerical stability
. Section 5 shows the experimental results, whereas Section
6 gives some conclusive remarks.

4

2.

LOAD BALANCING IN THE PIVOTING PHASE


Several papers have demonstrated that 2D matrix decompositions (and in particular square)
perform better than 1D decompo
sitions as regarding the LU factorization algorithm [1, 4, 13].
This seems untrue if one focuses the analysis on the partial pivoting phase, which is divisible
into three steps; i.e.,
search for pivot

(
O
(
n
) computation and
O
(
p
) communications),
broadcast p
ivot index

(
O
(
p
) communications),
swap rows

(
O
(1) communications). For
example, if one applies the Fox algorithm [8] to dense matrices, the spanning trees (devoted
to searching the pivot by a
fan
-
in

call, and broadcasting the pivot index by a
fan
-
out

call)
,
result defined on subcubes instead of all the nodes. This implies, on one hand, the reduction
of the spanning tree depth and communication time, on the other, more idle nodes and more
serial comparisons in the local maximum search. The graphical descript
ion of the pivoting
steps highlights load unbalancing of this algorithm when it is applied to 2D decompositions
(Fig. 1).

In correspondence of the pivoting steps, each matrix decomposition causes a different load
unbalancing (Table 1). It should be noted t
hat, since the complexity of the searching step
depends on the matrix size, it usually requires most computation; therefore, load unbalancing
intrinsic in square and column decomposition results as being much more serious than the one
of row decomposition
.


Decomposition

search for pivot

send pivot index

swap rows

row

(
r
=0)

No load unbalancing

No load unbalancing

(2
q
-
2) idle processors

grid

(0<
r
<
q
)

(2
q
-
2
q
-
r

) idle
processors

No load unbalancing

(2
q
-
2
r
+1
) idle
processors

column

(
r
=
q
)


(2
q
-
1) idle proces
sors

-

No load unbalancing


Table 1.

Load unbalancing due to the row partial pivoting phase in



accordance with the matrix decomposition (2
q
=
p
).


Two strategies have been commonly adopted to mitigate such load unbalancing; i.e.,
introduction of
pipelinin
g schemes

and/or use of
pivoting strategies weaker than partial
.

Pipelining schemes increase the efficiency of the naive parallel algorithms by anticipating
some computation and/or overlapping computation with communications. An example of
pipelining is th
e
send
-
ahead

scheme that, applied to the naive LU factorization without
pivoting, avoids the unnecessary synchronizations at the end of each stage
k

by broadcasting
row
k
+1

as soon as updated (for more details on this scheme see [1, 11, 12]).

Even in the c
ase of the LU factorization with pivoting, it is possible to increase the efficiency
of the naive algorithms by anticipating some computation. Here, though, the problem is quite
different from the without
-
pivoting case because no processor can anticipate t
he updating
5

phase
k
+1 at the stage
k

without knowing pivot
k
+1
. As a consequence, the first aim of any
pipelining scheme has to regard the anticipated evaluation of the pivot. Overlapping the
search for pivot
k
+1

before the completion of the stage
k

results
particularly useful when row
partial pivoting is applied to a matrix decomposed by columns [7] (or, in a dual manner, when
column partial pivoting is used in combination with row decomposition). This scheme, here
called
(pivot)
-
compute
-
ahead
, has been also

proposed in the case of row partial pivoting and
row decomposition [5]. In other algorithms (e.g., LU factorization without pivoting) we can
also adopt pipelining schemes that combine some pre
-
computation with send
-
ahead of row
k
+1
.
This is not the case of

LU with partial pivoting because it is impossible to anticipate the
updating and sending phases before knowing the row pivot index [1].

Among the weaker pivoting strategies, an attractive alternative to partial pivoting is the
pairwise pivoting

that intro
duces parallelism into the pivoting steps. It has been demonstrated
to guarantee enough stability in many examples [14], but the upper bound on its growth factor
is the square of that of partial pivoting. Another possibility is the
local partial pivoting
,
in
which only the processor holding
a
(
k
)
;
kk


searches the pivot among its column
k

entries [6].
This algorithm allows the combination of send
-
ahead and compute
-
ahead schemes also for 2D
decompositions [1]; in fact, no communication is necessary
in the search for pivot and the
swap steps, and only a subnet spanning tree is required in the pivot index communication.
Good performances of this algorithm, though, are counterbalanced by an unknown loss of
numerical stability.

Unfortunately, the pipelin
ing schemes do not completely solve load unbalancing of pivoting
strategies; on the other hand, weaker pivoting strategies may suffer stability problems. To
reduce load unbalancing and to guarantee at least the same (in reality better) stability of partial

pivoting, we propose a stronger pivoting strategy, called
row/column pivoting
. It assigns extra
work to the row
k

nodes (otherwise idle in row partial pivoting) by exploring the column
below
a
(
k
)
;
kk


and the row next to
a
(
k
)
;
kk

; i.e
.,





A further improvement of the load balancing can be achieved by adding some pipelining
scheme to this algorithm. Two parallel forms (naive and optimized) of row/column pivoting
are described in Section 3, wherea
s its properties are demonstrated in Section 4.



3. ROW/COLUMN PIVOTING



3.1

The Naive Algorithm


6

As regarding the partial pivoting steps, LU factorization for 2D decompositions is not
preferable to that for 1D decompositions because all the nodes not in
volved in the pivot
search are idle. The number of such nodes increases with the subcube size
r
, thus loosening
the gain achieved by the spanning tree depth reduction (evidenced in Figure 1).

Following the 2D paradigm, we replace partial pivoting by
row/co
lumn pivoting
, which
combines row and column partial pivoting. For the
pivot

search
step we use two fan
-
in calls,
each implemented by a spanning tree with a common root node named
manager
; one tree
spans the row
k

nodes, the other the column
k

nodes. The con
current use of disjoint spanning
trees with same dimension reduces pivoting unbalancing (Fig. 2) and improves the stability
(Section 4.2) at no cost. The naive form of the parallel LU factorization with row/column
pivoting is outlined in
NaiveRCP
.

The algo
rithms are written in a pseudo
-
code language, whereas the actual programs are
implemented in the I

PLUS programming environment. It allows us to obtain programs based
on a particular matrix decomposition by setting at run
-
time the SMD paramet
ers [1]. These
pseudo
-
codes do not specify the decomposition
-
independent I

PLUS primitives, but denote all
the expressions and instructions that require them by means of “< ... >” phrases.

There are two main differences between LU factorizati
on with row pivoting and NaiveRCP;
i.e., the contemporary search for pivot both in row
k

and in column
k
, and the swap data step.

It should be observed that for 1D matrix decompositions row/column pivoting is not very
efficacious because, at each stage
k
, it

would require
n
-
k

additional comparisons on the row
k

node. On the other hand, this strategy becomes more efficient than partial pivoting for 2D
decompositions. Among them, the most balanced computation is achieved by square scattered
decomposition that in
volves the highest number of processors in the search for pivot.


NaiveRCP



for

k
=1
to

n
-
1
do



/* Row/column partial pivoting */



if

(<I own column
k
>)
then
<search for pivot
k

in mycolumn
k
>



if

(<I own row
k
>)
then

<search for pivot
k

in myrow
k
>



if

(
man
ager
=
mynode
)
then

<send pivot index>




else
receive
(pivot index)



if

(
k
≠pivot index)
then

<swap (row
k
, row
pivot
) or (column
k
, column
pivot
)>



/* LU Factorization */



if

(<I own row
k
>)
then

<send myrow
k
>




else

receive
(myrow
k
)



if

(<I own column
k
>)
then

<compute and send mymultipliers
k
>





else

receive
(mymultipliers
k
)



for all
(<
i

k
+1 in myrow
i
>)
do




for all
(<
j

k
+1 in mycolumn
j
>)
do





a
ij

=
a
ij

-

l
ik

a
kj




3.2

The Pipelined Algorithms

7


The efficiency of NaiveRCP can be increased by introducing some pipelining scheme. First of
all we propose the well known
(pivot)
-
compute
-
ahead

scheme, which aims to anticipate the
search for pivot
k
+1

before the completion of the stage
k
. This algorithm would require a
pre
-
updating of column
k
+1

and row
k
+1
, an immediate search for pivot
k
+1

on these matrix
entries, and a broadcast of pivot row index. We do not explicitly illustrate this algorithm that
can be easily deducted from the next PipeRCP.

A better overlapping scheme, which we propose for 2D matrix decompositions, is achieved by
antici
pating computation both of the pivot
k
+1

and of multipliers
k
+1

during the stage
k
. This
scheme, called
(pivot&multiplier)
-
compute
-
ahead
, aims to: pre
-
update a subset of matrix
entries (i.e., column
k
+1

and row
k
+1
), anticipate the evaluation of pivot
k
+1

and b
roadcasting of
its index, anticipate the computation of multipliers
k
+1

and their sending. This algorithm is
sketched in PipeRCP (shadow boxes evidence the differences with NaiveRCP due to
pipelining).


PipeRCP



if

(<I own column
1
>)
then

<search for pivot
1

in mycolumn
1
>


if

(<I own row
1
>)
then

<search for pivot
1

in myrow
1
>)


if

(
manager
=
mynode
)
then

<send
-
ahead pivot index>



else
receive
(pivot index)


if

(1≠pivot index)
then

<swap (row
1
, row
pivot
) or (column
1
, column
pivot
)>


if

(<I own

column
1
>)
then

<compute and send
-
ahead mymultipliers
1
>



for

k
=1
to

n
-
1
do


if

not(<I own column
k
>)
then

receive
(mymultipliers
k
)


if

(<I own row
k
>)
then

<send myrow
k
>



else
receive
(myrow
k
)


if

(<I own column
k
+1
>)
then

<update mycolumn
k
+1
>



<search for pivot
k
+1

in mycolumn
k
+1
>


if

(<I own row
k
+1
>)
then
<update myrow
k
+1
>



<search for pivot
k
+1

in myrow
k
+1
>


if

(
manager
=
mynode
)
then

<send
-
ahead pivot index>



else
receive
(pivot
index)


if

(
k
+1≠pivot index)
then

<swap (row
k
+1
, row
pivot
) or (column
k
+1
,
column
pivot
)>


if

(<I own column
k
+1
>)
then

<compute and send
-
ahead multipliers
k
+1
>



/*
Update submatrix
A
i
>
k
+1,
j
>
k
+1

*/



for all
(<
i

k
+2 in myrow
i
>)
do




for all
(<
j

k
+2 in myco
lumn
j
>)
do





a
ij

=
a
ij

-

l
ik

a
kj


It should be observed that both the pipelining schemes here introduced can be applied to LU
factorization with partial pivoting [1]. Analogously to this algorithm, also row/column
pivoting has a synchronization barrier
that prevents any combination of pre
-
computation and
send
-
ahead schemes on row
pivot
, because pivot index cannot be known in advance.

8

However, two properties make row/column pivoting more attractive than partial pivoting; i.e.,
speedup gain and better numer
ical stability, shown in Section 4.1 and 4.2, respectively.



4.
PERFORMANCE ANALYSIS



4.1 Speedup Analysis


Main goal of this section is to demonstrate that “extra work assigned to idle nodes” renders
row/column more efficient than partial pivoting. Let
us denote by
SMD
a generic matrix
decomposition that follows the paradigm introduced in Section 1; this function assumes
values in the interval [
row
, ...,
square
, ...,
column
].

The performance parameters regarding the partial pivoting phase are denoted by
:


PP
comm
SMD



communication time


PP
op
SMD



operation time (i.e., number of comparisons)


PP
tot
SMD

PP
comm
SMD

PP
op
SMD



global execution time.

In the above definitions we replace
PP

with
RC

to indicate performance parameters of
row/column pivoting. Moreover, let us assume
p

processor number,


propagation time of a
unitary data, and


machine time for executing a comparison between double precision
numbers; these indexes are used i
n Proposition 1 (whose proof is postponed to the Appendix).


Proposition 1.

If

<

p


then

PP
tot
col

PP
tot
row

PP
tot
square
otherwise

PP
tot
col

PP
tot
square

PP
tot
row
.


Since the hypothesis

<

p



is satisfied in almost

all machines, it seems reasonable to
restrict the comparison between partial and row/column speedup to
SMD
=
square
. In such a
case, the Proposition 1 holds true also for row/column pivoting because:




RC
op
square

PP
op
square

and

RC
comm
square

PP
comm
square




(1)


For
SMD
=
square
, in fact, the search for the maximum in row
k

and column
k

involves two sets
of nodes, which are disjoint and with equal size. In general, instead, these set of nodes have
different size, thereby
RC
op
SMD

square

PP
op
SMD

square
.


Proposition 2.
For square matrix decomposition, row/column pivoting speedup is twice as
much as partial pivoting speedup.


Proof
.

By definition, the row/column pivoting speedup is
S
RC
SMD

RC
seri al
RC
tot
SMD
.

For
the serial pivoting phase it is easy to obtain:






RC
seri al

2
PP
seri al






(2)

9


By using (1) and (2), the speedup equation for square decomposition results:




S
RC
square

RC
seri al
RC
comm
square

RC
op
square

2
PP
seri al
PP
comm
square

PP
op
square

2
S
PP
square



(3)












Q.E.D.


Propositio
n 3 provides an estimate of
S
RC
square

by using the results obtained in the Appendix.


Proposition 3.

If n becomes arbitrarily large, the speedup obtainable by row/column pivoting
for square decomposition approaches 2
p

.


Proof
. From (3) we have the row/column pivoting speedup for square decomposition:




S
RC
square

2
PP
seri al
PP
tot
square








(4)

10

By replacing denominator of (4) with the formula (A13) of the Appendix,



PP
tot
square

(
n

1
)
S
(
3
q
2

1
)


(
3
q
2

n
2
q
2

1
)






n
(
n

1
)
2
q
2

1


e
quation (4) becomes:



S
RC
square

2
n
2

2


(
n

1
)
S
(
3
q
2

1
)


(
3
q
2

n
2
q
2

1
)

n
2
q
2

1









(5)


By restricting to
O
(
n
2
) terms in (5), it holds:



S
RC
square

n
2

n
2
2
q
2

1


n
2
2
q
2

1


O
(
n
)

2
q
2

1





2
p







(6)


from which it follows that
S
RC
square

tends to 2
p


as
n

grows.












Q.E.D.



4.2

Numerical Stability


The pivot search carried out on a larger set of entries (though
O
(
n
)) suggests that row/column
pivoting yields a numerical stability of the LU factorization better than partial pivoting does.

Let
us denote by
a
max
PP
(
k
)
, and
a
max
RC
(
k
)

the value of
max
ij
a
ij
(
k
)

obtained at stage
k

of LU
factorization, with partial pivoting and row/column pivoting, respectively. Moreov
er we
define:



-

one
-
step growth factor:

g
STR
(
k
)

a
max
STR
(
k
)
a
max
STR
(
k

1
)



-

growth factor
:

G
STR
(
k
)

a
max
STR
(
k
)
a
max
(
1
)



where
STR

stands for
RC

or
PP
.

The upper bound of growth factor for partial pivoting is easily obtained by observ
ing that
[15]:




a
max
PP
(
k
)

2
a
max
PP
(
k

1
)
,

and

a
max
PP
(
k
)

2
(
k

1
)
a
max
(
1
)

,





i.e.,



g
PP
(
k
)

2
,


and

G
PP
(
k
)

2
(
k

1
)
.





(7)


By definition, one would expect that row/colu
mn pivoting provide a growth factor smaller
than partial pivoting.

11


Theorem 4.

The row/column pivoting growth factor is bounded by
G
RC
(
k
)

2
(
k

1
)
.


Proof
. Let us distinguish two cases:

12

[i] If the pivots obtained from row/column

and partial pivoting differ (i.e., pivot
RC
>pivot
PP
),
the row/column pivoting growth factor is strictly lower than the partial pivoting one; i.e.,


If

G
RC
(
k
)

G
PP
(
k
)

then thesis follows immediately from (7).


[ii] If the pivot obta
ined from row/column and partial pivoting coincides, the partial pivoting
growth factor does not reach its upper bound; i.e.,
G
RC
(
k
)

G
PP
(
k
)

2
(
k

1
)
.

The claim is to demonstrate that if at a certain step the one
-
step growth factor does not
reach 2
the growth factor cannot reach 2
(
k
-
1)
. To this purpose we apply the theorem shown in [2]; i.e.,


If at a certain step
i

(2≤
i

k
)
a
max
RC
(
i
)

a
max
PP
(
i
)

then
g
PP
(
i
)

2

or
g
PP
(
i

1
)

2
.

By using this result and the definition of
G
PP
(
k
)
, it follows:
G
PP
(
k
)

def
g
PP
(
k
)


.
.
.



g
PP
(
2
)

2
(
k

1
)

.












Q.E.D.


The following matrix furnishes an example in which the row/column pivoting growth factor
is
lower than that of the partial pivoting (but remains higher than that of the complete pivoting):


A

1
0
0
1
4
1
1
0
1
3
1
1
1
1
2
1
1
1
1












In such a case, the growth factor resulting from LU factorization is:
G
PP
(
4
)
=19/6

3.167,
G
RC
(
4
)
=31/22

1.409, and 1573/1245 (

1.263) for complete pivoting.

A more extended analysis on the numerical stability of row/column pivoting is in [2].



5. EXPERIMENTAL RESULTS


In this section we show
the experimental results obtained on an Intel iPSC/2 hypercube of 16
nodes. The purpose is to verify that the better stability of row/column pivoting is obtained
without degrading the performance of partial pivoting.

We consider the pipelining forms of LU

factorization with row partial and with row/column
pivoting (PipeRCP). Since these codes have been implemented in I

PLUS, results for row,
square and column decomposition have been simply obtained by setting the SMD parameters.
To force a ro
w or column swap (with communication) at each step
k

we used two types of
matrices in the experiments; i.e.,
ARS

(
always row swap
) and
ACS

(
always column swap
).

The diagram in Figure 3 displays the overhead due to the pivoting phase in the LU
factorization
. The matrix size is fixed to
n
=1024, the processor number to
p
=16, and the
decomposition is variable. It should be remarked that this diagram: 1) confirms Proposition 1,
13

by indicating a minimum in the pivoting overhead for
SMD
=
square
, both for partial and

for
row/column pivoting (this also implies that iPSC/2 is a machine where first hypothesis of
Proposition 1 holds true); 2) testifies that the difference in the execution time between LU
factorization with partial and row/column pivoting is negligible in
the case of square
decomposition.

Figure 4 plots the speedup of LU factorization with partial and row/column pivoting strategies
as function of the processor number and the matrix size. The matrix chosen in all the
experiments is
ARS

whereas the matrix de
composition corresponds to
r
=

q
/2

. This figure
shows that the speedup of LU factorization with row/column pivoting is always better than
that of partial pivoting. In addition the gap between the two curves grows when
p

becomes
large. These results, though

concerning the entire LU factorization algorithm with pivoting
and not only the pivoting phase, give an experimental confirmation of the Proposition 2.

Table 2 shows in more detail the execution times for row/column pivoting as a function of
p
,
r
and

n
. T
he conclusion is that on iPSC/2 square scattered decomposition remains preferable to
row and column decompositions even with pivoting (both partial and row/column). The lower
execution time shown by ARS matrices in column decomposition is due to the absenc
e of
communications in its row swap step.




n

row

ARS

square

ARS

col

ACS

umn

ARS


64

0.683

0.592

0.662

0.541


128

3.520

3.176

3.499

3.083

p
=4

256

23.163

21.952

23.572

21.862


512

170.790

166.506

173.493

166.090


1024

1328.78

1304.17

1332.77

1303


64

0.719

0.545

0.704

0.570


128

2.778

2.177

2.725

2.26

p
=8

256

14.621

12.510

14.375

12.70


512

95.630

87.948

94.260

88.32


1024

695.144

666.884

686.717

668.9


64

0.9

0.524

0.986

0.842


128

3.006

1.595

3.006

2.511

p
=16

256

12.116

7.251

12.030

10.215


512

63.755

46.307

63.560

56.179


1024

406.095

340.9

405.341

375.204


Table 2.

Execution times

for LU factorization with
row/column pivoting

in case



of row, square, and column decomposition of
ARS

and
ACS

matrices.



6. CONCLUSION


LU factorization algo
rithms for 2D matrix decompositions (in particular square scattered) are
known to perform better than those for 1D matrix decompositions [1, 4, 13]. This is not true if
14

one restricts to the pivoting phase due to an intrinsic load unbalancing. In this paper

we have
proposed to reduce processor inactivity by means of previously unexplored techniques.

First, we have introduced a new pivoting strategy (
row/column pivoting
) that “assigns extra
-
work to idle processors” thus reducing load unbalancing and assuring
a better stability [2].
Combining both these properties, we obtain a parallel LU factorization algorithm with a
numerical stability better than partial pivoting without increasing the computational cost.

Second, we have improved the performance of partial
and row/column pivoting by proposing
the introduction of two pipelining schemes into these strategies.

The experimental and analytical analyses of row/column
vs.

partial pivoting have
demonstrated that the former is more efficient than the latter, and that

it is very effective for
square decomposition. The analysis has been carried out by means of the new programming
environment I

PLUS that allowed us to design decomposition
-
independent algorithms and to
develop a performance analysis independ
ent from a specific decomposition.



APPENDIX


In this section we estimate the influence of the domain decomposition on the pivoting phase
of the LU factorization with row partial pivoting. These results are referred in Section 4.1.

A remarkable property o
f the I

PLUS programming environment is the opportunity to analyze
the performance in a decomposition
-
independent way. Communication and operation time of
the previously described meta
-
algorithms can be evaluated by means of formulas containi
ng
the parameter
r
(i.e., subcube dimension).

To describe the communication time required to broadcast data
d

in a
q
-
dimensional
hypercube we adopt the following well known model:


T
broadcast
(
d
)

= (
S

+


|
d
|)
q








(A1)


where
S

is the startup time, |
d
| is the data size, and


is the propagation time of a unitary data.
By considering the subcube size parameter
r
, we obtain (in correspondence of the I

PLUS
primitives used in the pivoting phase):


T
<
search
d
i n column
k
>
SMD

(
S


|
d
|
)
(
q

r
)







(A2)


T
<
search
d
i n row
k
>
SMD

(
S


|
d
|
)
r








(A3)


T
<
swap
(
d
1
,
d
2
)
>
SMD

(
S



max
{
|
d
1
|
,
|
d
2
|
}
)






(A4)


Equations (A2)
-
(A4), under the hypothesis that row swap occurs for all stages
k

(therefore
upper bound equations will be derived), allow to evaluat
e the communication time for the
pivoting steps of the LU factorization meta
-
algorithm with row partial pivoting:



PP
comm
SMD

T
<
search pivot in mycol
k
>
k

1
n

1


T
<
broadcast pi vot i n
d
ex
>
k

1
n

1


T
<
swap
(
my
r
ow
k
,
myrow
pi vot
)
>
k

1
n

1



15




(
n

1
)
(
S


)
(
q

r
)

(
n

1
)
(
S


)
q

(
(
n

1
)
S


n
(
n

1
)
2
r

1
)





(
n

1
)
(
S
(
2
q

r

1
)


(
2
q

r

n
2
r

1
)
)





(A5)


From (A5)

it is possible to derive the communication time for any algorithm corresponding to
a specific decomposition by suitably setting the parameter
r
; as example, for
SMD
=
row
:



PP
comm
row

T
<
search pivot in mycol
k
>
k

1
n

1


T
<
broadcast pi vot i n
d
ex
>
k

1
n

1


T
<
swap
(
r
ow
k
,
row
pi vot
)
>
k

1
n

1






(
n

1
)
(
S


)
q

(
n

1
)
(
S


)
q

(
(
n

1
)
S


n
(
n

1
)
2
)





(
n

1
)
[
S
(
2
q

1
)


(
2
q

n
2
)
]






(A6)


Equation (A6) can be obtained more directly from (A5) by setting
r
=0. In an analogous way, it
is possible to derive the communication time in the case of any grid decomposition; for
example, for
SMD
=
square

w
e set
r
=
q
/2 in (A5), thus obtaining:



PP
comm
square

(
n

1
)
[
S
(
3
q
2

1
)


(
3
q
2

n
2
q
2

1
)
]





(A7)


This is also true for the column decomposition, except for the detail that the
T
swap

in such a
case does not cause any communication. As a consequence, the communicatio
n time can be
obtained by setting
r
=
q

in (A5) and by excluding
T
swap
:



PP
comm
col

(
n

1
)
(
S


)
q








(A8)


The same can be carried out in evaluating the operation time due to the pivoting phase in a
decomposition
-
independent algorithm. By
setting this time equal to the number of
comparisons,



PP
op
SMD

n

k
2
q

r
k

1
n

1


n
(
n

1
)
2
q

r

1







(A9)


It would be possible to derive the formulas for row, column and square decomposition by
substituting in (A9)
r
=0,
r
=
q
, and
r
=
q
/2, respectively.

The c
ombination of (A6)
-
(A8) and (A9) gives the global execution time of the pivoting phase
(



denotes the machine time for executing a comparison between double precision numbers):



PP
tot
SMD

(
n

1
)
S
(
2
q

r

1
)


(
2
q

r

n
2
r

1
)





n
(
n

1
)
2
q

r

1




(A10)


16

from which,


PP
tot
row

(
n

1
)
S
(
2
q

1
)


(
2
q

n
2
)





n
(
n

1
)
2
q

1





(A11)


PP
tot
col

(
n

1
)
(
S


)
q

n
(
n

1
)
2







(A12)


PP
tot
square

(
n

1
)
S
(
3
q
2

1
)


(
3
q
2

n
2
q
2

1
)





n
(
n

1
)
2
q
2

1




(A13)


Assuming high values of
n

and a propagation time strictly lower than comparison time (

<

),
Proposition 1 shows that there exists a threshold for


suitable to determine the lowest
execution time among (A11), (A
12) and (A13).


Proposition 1.

If

<

p


then

PP
tot
col

PP
tot
row

PP
tot
square
otherwise

PP
tot
col

PP
tot
square

PP
tot
row
.


Proof
.

We prove first
PP
tot
col

{
PP
tot
row
,
PP
tot
square
}
, then the relation between
PP
tot
row
and
PP
tot
square
.

I) From (A11) and (A12) we have:



PP
tot
col

PP
tot
row

(
S


)
q



n
2


S
(
2
q

1
)


(
2
q

n
2
)





n
2
q

1








2
q

1
n
(
2
q

1
)
(
q

1
)
S

q




2
q
2
q

1




2
q
2
q

1



1


II) From (A12) and (A13) we have:



PP
tot
col

PP
tot
square

(
S


)
q



n
2


S
(
3
q
2

1
)


(
3
q
2

n
2
q
2

1
)





n
2
q
2

1








2
q
2
n
(
2
q
2

1
)
(
q

2
)
S

q




1
2
q
2

1




1
2
q
2

1



2


III
) From (A11) and (A13) we have:


PP
tot
row

PP
tot
square

S
(
2
q

1
)


(
2
q

n
2
)

n
2
q

1


S
(
3
q
2

1
)


(
3
q
2

n
2
q
2

1
)

n
2
q
2

1








2
q
q
n
(
2
q
2

1
)
(
S


)

2
q
2




2
q
2



3


In summary, by considering that
p
=2
q
, we have demonstrated:



I)



p
p

1



1

PP
tot
col

PP
tot
row


II)



p
p

1



2

PP
tot
col

PP
tot
square

17


III)



p



3

PP
tot
row

PP
tot
square


The conjunction of I), II) and III) is equivalent to:




p
p

1


1




p


3






PP
tot
col

PP
tot
row

PP
tot
square






p


3



PP
tot
col

PP
tot
square

PP
tot
row


If we let
n

become arbitrarily large, by considering that

i

(
i
=1,

2, 3) are
O
(1/
n
) terms, the
thesis holds.












Q.E.D.



REFERENCES


[1]

M. Angelaccio and M. Colajanni, “Unifying and optimizing parallel linear algebra algorithms”, to appear
in
IEEE Trans. on Parallel and Distributed Systems
.

[2]

M. Angelaccio and

M. Colajanni, “On the numerical stability of row/column pivoting”,
Res. Report
RI.92.05
, Univ. of Roma “Tor Vergata”, Italy, July 1992.

[3]

M. Angelaccio and M. Colajanni, “PLUS: A run
-
time library for decomposition
-
independent parallel
programming”,
Res.

Report RI.92.12
, Univ. of Roma “Tor Vergata”, Italy, Nov. 1992; also submitted for
publication.

[4]

R. Bisseling and J. van de Vorst, Parallel LU decomposition on a transputer network, in:
Proc. of Shell
Conference on Parallel Computing
, (Lecture Notes in

Computer Science, Springer Verlag, Berlin, 1989).

[5]

E. Chu and A. George, Gaussian elimination with partial pivoting and load balancing on a multiprocessor,
Parallel Computing

5

(1987) 65
-
74.

[6]

M. Cosnard, B. Tourancheau and G. Villard, Gaussian elimi
nation on message passing architecture, in:
E.N. Houstis, T.S. Papatheodorou, C.D. Polychronopoulos, eds.,
Proc. of 1st Int. Conf. on
Supercomputing
, Athens (Lecture Notes in Computer Science, n. 297, Springer Verlag, NY, 1987).

[7]

G. Davis, Column LU fac
torization with pivoting on a hypercube multiprocessor,
SIAM J. Algebraic
Discrete Methods

7

(1986).

[8]

G.C. Fox, M. Johnson, G. Lyzenga, S.W. Otto, J. Salmon and D. Walker,
Solving problems on concurrent
processors. Vol. 1.

(Prentice
-
Hall, Englewood Clif
fs, NJ, 1988).

[9]

G.A. Geist and M.T. Heath, Matrix factorization on a hypercube multiprocessor, in: M.T. Heath, ed.,
Hypercube Multiprocessors 1986

(Philadelphia, 1986) 161
-
180.

[10]

G.A. Geist and C.H. Romine, LU factorization algorithms on distributed
-
memory multiprocessor
architectures,
SIAM J. Sci. Statist. Comp.

9

(1988).

[11]

J.M. Ortega,
Introduction to parallel and vector solution of linear systems

(Frontiers of Computer
Science, Plenum Press, New York, 1988).

18

[12]

J.M. Ortega and C.H. Romine, The

ijk forms of factorization methods II. Parallel systems,
Parallel
Computing

7

(1988) 149
-
162.

[13]

Y. Saad, Gaussian elimination on hypercubes,
YALEU/DCS/RR
-
462
, Yale University, 1986.

[14]

D. Sorensen, Analysis of pairwise pivoting in Gaussian eliminatio
n,
IEEE Trans. on Computer

C
-
34

(1985) 274
-
278.

[15]

J. Stoer, and R. Burlish,
Introduction to Numerical Analysis

(Springer
-
Verlag, New York, 1980).

19


broadcast (pivot row index)
swap (myrow , myrow )
pivot
k
broadcast (myrow )
pivot
Subcube {01*}
Subcube {00*}
Subcube {10*}
Subcube {11*}
Manager
search for pivot in mycol
k
k
Manager


Figure 1.

Partial pivoting steps, at stage
k
, in a LU factorization algorith
m
on a hypercube (
q
=3,
r
=1) applied to a 2D decomposed matrix.


Active communication links are evidenced by arrows.




Row/column pivoting:
+
Row partial pivoting:
nodes that would have been idle in case of row partial
pivoting and square scattered decomposition
Manager
Subcube {01**}
Subcube {00**}
Subcube {10**}
Subcube {11**}




Figure 2.

Pivot search step in NaiveRCP assuming that (
q
=4,
r
=2),





and node Manager, at stage
k
, is 000
0.

20



row
square

column
square
0
100
200
300
400
LU Factorizati on
Pivoting Overhead
Matrix Decompositi on
parti al

Execution times (sec.)
row/column
Pivoting strategy



Figure 3.

Pivoting overhead in the parallel LU factorization





(
n
=1024,
p
=16).




1200
1000
800
600
400
200
0
0
4
8
12
16
r ow partial (p=16)
r ow/column ( p=16)
r ow partial (p=8)
r ow/column ( p=8)
Matrix size
Speedup



Figure 4.

Speedup for
row partial

and
row/column pivoting

in





case of square decomposition.