Parallel Block LU Factorization on the Synergy ... - Temple University

unevenoliveSoftware and s/w Development

Dec 1, 2013 (3 years and 8 months ago)

96 views


1

Parallel Block LU Factorization


An Experience in Stateless Parallel Processing


Weijun He
(
weijunhe@
weijunhe@temple.edu
)


Yuan Shi,
(
shi@temple.edu
, faculty supe
rvisor)


CIS Department

Temple University

Philadelphia, PA 19122


Abstract.
This paper presents an experience in parallelizing a Block LU
factorization algorithm using a Stateless Parallel Processing

(SPP)

framework
.
Compared to message passing systems suc
h as PVM/MPI, SPP allows a
simpler dataflow computation model

that enables highly effective implicit
parallel processing. This allows a range of benefits that were not possible in
direct message
-
passing systems: programming ease, automated fault tolerance
and load balancing with
easy granularity control. T
he SPP architecture also
promises high
performance featuring automatic formation of SIMD, MIMD
and pipeline processor clusters at runtime.


T
his paper uses a LU factorization algorithm to illustrate the S
PP
programming, debugging and execution methods. We use timing models to
evaluate

and justify

parallel performance.
T
iming models are
also
used to
project scalability of the application in various

processing

environment
s

in
order to establish
relevance
. Th
e reported experiments are conducted using
Synergy V3.0 on a cluster of 4 Linux machines running RedHat9.


1

Introduction


Difficulties

in parallel programming are rooted in two areas: a) inter
-
process
communication and synchronization; and b) performance op
timization. For
multiprocessor systems without a shared memory (such as cluster of COTS processor),
the communication and synchronization programming difficulties increase
exponentially.
Identifying the

best


processing granularity

for optimized performan
ce

is another challenge. Great performance losses will result either the granularity is too
large or too small. Unfortunately, for direct message
-
passing systems, processing
granularity is inter
-
woven in the task design that cannot be easily a
d
justed.


In

addition to programming difficulties, multiprocessor systems also face a serious
availability challenge in that higher processor counts adversely impact the application
availability since any single processor crash can halt the entire parallel application
.


Stateless Parallel Processing (SPP) concerns with a processor design and parallel
programming discipline that facilitates automated fault tolerance and load balancing. In
particular, SPP eliminates the need for a single uniform memory space while allowi
ng
automatic program/data partitioning with high performance. An SPP system can

2

automatically form SIMD, MIMD and pipelined processors at runtime. It is free from
the difficult cache coherence problems. Therefore, SPP is ideally suited for using
multiple C
OTS processors.


Implementation of SPP system requires a data access model using tuple space [4]. The
tuple space data access model is in theory identical to the dataflow model. It allows
implicit parallel processing thus simplifies programming.


There ha
ve been numerous efforts exploiting the tuple space programming paradigm
and implementation techniques. In this paper, we focus on a server
-
centric
implementation (Synergy) that offers fault resilience to multiple worker faults but is
known to have larger
overhead to application inter
-
process communication when
compared to direct message
-
passing systems. A full SPP implementation will eliminate
this overhead and offer fault resilience to multiple processor and network failures.


In this paper,
we use a Bloc
k LU factorization algorithm to illustrate the programming
and performance tuning features of SPP. We use timing models to guide our program
design and to predict the achievable timing results. We use the measured timing results
to validate the models. Fin
ally we examine the scalability of this
application

under
various processing environments.


2

Stateless Parallel Processing (SPP)


For parallel programming,
SPP requires a separation between

functionality
programming and performance programming (
including f
ault tolerance). A SPP
application consists of an interconnected network of stateless parallel programs. A
stateless parallel program is any program that does not contain

or generate persistent

global state information. For example, a program uses a specif
ic IP address and port
number to communicate with another program is NOT stateless. Similarly, a program
that spawns programs on multiple processors is also NOT stateless

for it creates a
persistent parent
-
child relationship
.

Typically, it is performance
-
p
rogramming practice
that produces non
-
stateless codes.


A SPP application contains only two types of programs:
master

and
worker
. A
master

is any program that distributes working assignments and assembles the results. A
worker

program reads a working assig
nment and returns complete result. Using the
tuple space data access model, each SPP implementation represents a complete
dataflow computer.


An SPP runtime system
exploits the implied data dependencies
in the parallel programs.
It optimizes processing eff
iciency by minimizing communication overhead and
automatic formation of SIMD, MIMD and pipelined processor clusters at runtime.


T
he user can further optimize performance by tuning the processing granularity
by

3

leveraging the
stateless coding style
.


3

Syner
gy Parallel Programming and Processing System


Synergy V3.0 is a research parallel programming and processing system in use

for
graduate and undergraduate computer science education

at Temple University since
1990. It was
also successfully

used in a number

of research projects in IBM T.J. Watson
Research Center, Wharton School of University of Pennsylvania and the Center for
Advanced Computing and Communications at Temple University.


Synergy uses a

server
-
centric

tuple space mechanism [5] for inter
-
paralle
l program
communication. Different from Linda

s compiler
-
based implementation

[4]
, Synergy

s
tuple space is implemented by runtime created servers. Synergy programs can be
written in any sequential programming
language

and compiled with a link to

the

Syner
gy runtime library.


A Synergy parallel application consists of a set of SPP programs and a configuration
specification that links the programs into a logical parallel application.


The Synergy runtime system interprets the configuration specification, act
ivates the
specified components and provides functions for monitor and control.

It contains
debugging support using concurrent activation of multiple gdb debuggers. It also
contains automatic fault tolerance supporting multiple worker processor faults.


Cu
rrently, Synergy V3.0 supports RedHat Linux9 and SunOS 5.9.


4

Synergy application programming interface (SAPI).


Synergy runtime library contains the following calls:




tsd=
cnf_open(
“TupleSpaceName”
)



opens a named tuple space object with a returning
handle
.



cnf_term()


terminates
a running program.



cnf_tsput(
tsd, TupleName, Buffer
)



insert a named tuple into an opened space object.



cnf_tsget(
tsd, TupleName, &Buffer
)



fetch a matching tuple and remove it from the
given space object.



cnf_tsread(
tsd, TupleN
ame, &Buffer
)



read the contents of a matching tuple in the
given space object.


There are three environment
variable

fetching calls:




cnf_gett()



get the value defined by “threshold=” clause in application specification.



cnf_getf()



get the value defin
ed by “factor=” clause in application specification.



cnf_getP()


get the number of processors.


4


These values facilitate grain size adjustments.


Every Synergy program is separately compiled using a host language compiler and
Synergy runtime library; and m
ust use SAPI in order to achieve parallel processing
benefits.


5

Synergy Parallel Application Configuration Specification (SPACS)


The

user is required to specify a parallel application by linking multiple compiled
parallel programs. The following is the s
pecification for Block LU factorization
application (BLU.csl):


configuration: BLU;

m: master =
BLUMaster

(threshold = 1 debug = 0)


-
> f:
problem

(type = TS)


-
> m: worker =
BLUWorker

(type = slave)


-
> f:
result

(type = TS)


-
> m: master;


The hi
ghlighted words in this specification correspond to the cnf_open statements in
the parallel programs. The

=


signs in the M
-
clauses are used to equate the logical
components to the physical binary file names in the $HOME/bin directory.


This specification

is processed by the Synergy runtime system to generate dynamic
tuple space daemons and a customized application control console responsible for
monitoring and control the entire application (DAC, Distributed Application
Controller).


6

Synergy Runtime Envir
onment (SRE)


Synergy
was designed to
exploit
resource
-
sharing

as opposed to
cycle
-
stealing
.


The
idea is that all users should share the same pool of resources in any way legally
permitted. To this end, i
t does not
use SSH (Secured Shell) to activate user

programs.
Every user is required to run a login
-
based command interpret daemon (CID) to gain
CPU access to a remote machine. These daemons inherit all legal restrictions of the
login ID thus avoid any security related concerns. A port mapping daemon (PMD)

manages the dynamic ports for all users on a single host. In a NFS (Network File
System),
a single

shared

copy of binaries (and link library)
is sufficient for all cluster
users
.
On
multiple distributed clusters without a shared file system, multiple copi
es of
Synergy binary and consistent path naming are necessary.


Each user maintains a host list as the resource pool. The state of this pool defines
resource availability for the next parallel run. Activating

CID will automatically launch

5

PMD if it is not
already running. Multiple activation of PMD on the same node will not
affect the correctness of operation.


The following commands manages the host list states in SRE:




cds [
-
a]

[
-
f]


Checks daemon states. The default is checking only active hosts.

-
a” l
ists inactive hosts as well. “
-
f” saves the output to a file named .snghosts.



sds



Starts CIDs on all current active hosts.



k
ds



Kills all daemons on current active hosts.



a
ddhost



Adds a host to the host list.



delhost



Deletes a host from the host lis
t.



chosts


Chooses hosts to activate or deactivate.


Multiple
cluster
users can

share the same processor pool at the same time.


The following commands execute, debug and monitor a Synergy application:




prun [

debug] SCF_name

[
-
ft]


This activates a par
allel application named
SCF_name. “
-
debug” option prepares a virtual dataflow computer ready for
debugging using multiple gdb sessions. “
-
ft” option activates the worker
processor fault tolerance. The default is no fault tolerance to facilitate
debugging.




p
check



This command can monitor all applications started by the same login
ID. It monitors at both application level and process level and can also be used
to terminate either a process or an entire application.


7

Synergy Programming and Communication


The direct message passing systems allows the best possible communication
performance deliverable by the low
-
level networking systems [3]. However, for load
balance and fault tolerance, indirect communication is necessary since the target
recipient may not

be present or appropriate for work due to dynamic status changes.
Synergy uses indirect communication that each tuple transmission must go through
three phases: sending, matching and receiving. The inevitable overhead can be offset by
load balancing, auto
matic SIMD, MIMD and pipeline formation and fault tolerance
benefits [5].


Similar to message passing systems, Synergy programs contain initialization,
communication and termination library calls:


Tsd1=cnf_open(

problem

, 0);

// Initializing

problem


sp
ace.

Tsd2=cnf_open(

result

,0);

// Initializing

result


space

strcpy(tpname,

A0

);


6

cnf_tsput(tsd1, tpname, buffer, buf_size);

// Communication: put a tuple

A0


strcpy(tpname,

B*

);

len = cnf_tsread(tsd2, tpname, buffer2, 0);

// Communication: read a tup
le

B*




cnf_term(); // Termination


In a debugging session, a virtual dataflow computer is created by launching appropriate
daemons. This allows user to step through instruction
-
by
-
instruction using tools like
gdb.


8

LU Factorization


Dense linear system
s are commonly solved using LU factorization.
Block methods in
matrix computations are widely recognized as being able to achieve high performance
on parallel computers [1].



In solving a dense linear system
A
.
X
=
B,

we first factorize
A

into
L
.
U

where
L

i
s a lower
triangular unit matrix and
U

is an upper triangular matrix.
There are four major types of
computations. Here we omit the permutation matrix P for
simplicity.




LU factorization for diagonal sub
-
matrices



Solve lower triangular unit sub
-
linear syste
ms like LZ=subA



Solve upper triangular sub
-
linear systems like WU=subA



Matrix product for updates like subA = subA

W.Z


…………
.

.

.

.











Figure 1. Four tasks for LU Factorization



9

Parallel LU Factorization


The parallel LU Factoriz
ation application contains one
master

and one
worker
. The
LU

LZ=subA

WU


=

subA



subA=subA
-
W.Z


7

master maintains a task road map according to the sequential dependencies. The worker
performs given task (there are four types) and returns corresponding result. Worker is
activated on multiple hos
ts, including the host that runs the master.


Assuming total number of tasks is N, the parallel framework is as follows:


Master










Workers







10

Optimization for
Parallel Block LU Algorithm


We denote the sub
-
blocks as submat[
i
][j][status], here t
he status = O/U/F:


O: Original,

U: Updated,

F: Finished.


The dependency graph for the sub
-
task
s

for

a

3 x 3 block factorization is as follows:
(submat[1][i][O] = submat[1][i][U
]
, submat[i][1][O] = submat[i][1][U
]
)

Initialization: put the ready
-
to
-
compute tasks to the task poo
l

For
I

= 1 to N


Receive a completed task from a worker


Refresh the task roadmap


While a new task is ready
-
to
-
compute



Put the task in the task pool

Get any ready
-
to
-
compute task from the pool

Do the computing

Return the completed task to the Master


8



[3]
[3][O]

[3][3][U
]

[3][3][F]


[1][3][U
]

[1][3][F]


[3][1][U]
[3][1][F]

[3][2][U
]

[3][2][F
]

[1][1][U]

[1][1][F
]



[2][1][U
]


[2][1][F
]

[2][3][U]
[2][3][F]


[1][2][U
]

[1][2][F]


[2][2][O] [2][2][U
]


[2][2][F
]


Figure 2.
Dependency Graph for Block LU Factorization (3 x 3 blocks)


Each task involves either a ma
trix
-
matrix product or a upper(lower) triangle solver. To
optimize parallel execution time,
the task distribution should ensure the minimal
waiting time between tasks
.
Figure 2 shows that
this application does not render itself
well for parallel processing
.

M
any paths are sequential in essence.
However, it contains

SIMD

(multiple matrix
-
matrix products (L or U solvers) can run in parallel)
, MIMD

(matrix
-
matrix products, L and U solvers can run at the same time)

and pipelined
(the
wave
-
front dependency)
proc
essor forms
.



11

Synergy
Implementation


Roadmap Status Matrix


We use a K x K matrix to record the update history of K x K blocks.
Originally each
cell is assigned value 0. When we update the according sub
-
matrix, the value increases
1. For Block LU factori
zation for cell (i.j), when it is done, value(
i
,j) = min(
i
,j)+1.


For a 3

x

3 grid, the serialized execution path of the dependency graph is as follows

(many blocks can execute in parallel):



LU (0,0) LZ(0,1) LZ(0,2) WU(1,0) WU
(2,0)






Update (1,1) Update(1,2) Update(2,1) Update(2,2) LU(1,1)






LZ (1,2) WU(2,1) Update(2,2) LU(2,2)




0

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

1

1

0

0

0

0

0

0

0

1

1

1

0

0

0

0

0

0

1

1



0

0

0

0

0

1

1

1

1

0

0

1

0

0

1

1

1

1

0

0

1

0

0

1

1

1

1

1

0

1

0

0

1

1

1

1

1

1

1

0

0

1

1

1

1

1

1

1

1

0

1

1

1

1

1

1

1

1

1

1

1

1

1

2

1

1

1

1

1

1

1

1

2

1

1

1

1

1

1

1

1

2

2

1

1

1

1

1

1

1

2

2

1

2

1

1

1

1

1

2

2

1

2

2

1

1

1

1

2

2

1

2

3


9




Figure

5. Typical Serialization Path for a 3x3 Blocks



10














Figure 6. T
ask

and corresponding result structures


Where

x, y :

the coordinates

where

the result sub
-
matrix begins.

row,col: the size of the result sub
-
matrix.

type:

=0, Function: LU facto
rization for the diagonal sub
-
matrix

=1, Function: Solve lower triangular unit sub
-
linear systems like LZ=subA.


mat1: input L ; mat2: input subA

=2, Function: Solve upper triangular sub
-
linear systems like LZ=subA.


mat1: input U; mat2: input subA

=3,

Function: Matrix
-
matrix product


result = mat1 * mat2

=4, Function: Termination of all tasks.


12

Flow Chart
s


Putting all these together, the next flowcharts show
our goal is to minimize
synchronization. The roadmp matrix is Pstatus
.

struct msg

{


i n t t y p e;



i n t x;



i n t y;



i n t r o w;



i n t c o l;



f l o a t ma t 1 [ M] [ M];


f l o a t ma t 2 [ M] [ M];

};


struct res_msg

{


i n t t y p e;


i n t x;


i n t y;


i n t r o w;


i n t c o l;


f l o a t r e s _ ma t [ M] [ M];

};


11

Master:











Type=0 Type=1


Type=2 Type=3


(LU,
i
=j) (LZ,
i
<j) (WU,
i
>j) MatMul((i,k),(k,j))










N


Y



i
=j
i
<j

i
>j



I
==NBLOCKS
-
1


I<NBLOCKS
-
1












Figure 7. Master Flowchart
Solve LU(0,0), Update Status

put LZ(0,1:NBLOKCS
-
1), WU(1:NBLOCK
-
1,0) to Problem Space(PS)

Receive a tuple (
i
,j) from result space,

update sub
-
matrix and status

k=
i
+1:NBLOCKS
-
1

if(PStatus(
i
,k)==1)


p u t L Z (
i
,k ) t o P S

i f ( P S t a t u s ( k,
i
) = = 1 )


p u t WU ( k,
I
) t o P S

k=
i
+1:NBLOCKS
-
1

if(PStatus(
k
,i)=
=
i
+1)


put MatMul((
k
,i),(
i
,j)
to PS


k=
i
+1:NBLOCKS
-
1

if(PStatus(
j
,k)==
i
+1)


put MatMul((
i
,j),(j,k)
to PS


PStatus(
i
,j)==

min(
i
,j)?

Solve LU for the
last diagonal sub
matrix

exit

Put LU(
i
,i) to
PS

If PStatus(
i
,
i
)==
i
+1

Put LZ(
i
,j) to PS

If PStatus(
i
,
i
)==
i
+1

Put LZ(
i
,j) to PS

Put the termination
tuple to PS


12

Workers:













Type=0


Type=1 Type=2 Type=3













Figure 8. Worker Flowchart


T
he above parallel sy
stem contains two overlapping parts: Computation and
Communication. Using high performance network switches, it is possible to incur
concurrent communications.



Master (Tuple Space+Data+Roadmap), Work0





Work
1 Work3






Work2

Figure 9. Overlapping Computing and Communications



Receive a tuple from PS

Recv_length=
=4?

Put back the

termination tuple


exit

Solve
LU(
i
,i)



Solve
LZ(
i
,j)


Solve
WU(
I
,j)

Solve MatMul(

(
i
,k), (k,j))

Put the result to the Result Space(RS)




Switch


13

13

Performance

Analysis


In this section, we use timing models to evaluation the potentials of the designed
parallel application.

Timing model is performance analysis methodology that uses
holistic time complexity models and itemized instrumentations to reveal a parallel
application’s performance characteristics under different processing environments [2].


Let:

n
: size of the matrix

m
: number of blocks

k
: block size,
k
m
n



w
: CPU power (algorithmic steps per second)


: network speed (bytes per second)


: number of bytes in each word

p
: number of processors


We notice that in our program model, we actually eliminate the synchronization mostly
and also the computation and communication overlap at the most time. So th
e parallel
execution time is determined by the bigger of the two, Tcomp and Tcomm. That is,

)
,
max(
comm
comp
par
T
T
T


wp
n
T
comp
3
2
3


And



/
*
comm
comm
A
T


where

)
(
)
(
)
3
6
(
}
)
(
3
)
(
6
{
2
2
2
3
2
3
1
1
2
2
1
2
2
mn
o
mn
k
m
o
k
m
k
i
i
k
q
m
q
m
A
m
i
m
q
comm

















Therefore,

)
/
(
/
2
2




mn
o
mn
T
comm



)
,
3
2
max(
)
,
3
2
max(
)
,
max(
3
3
2
3




k
n
wp
n
mn
wp
n
T
T
T
comm
comp
par




w
n
T
seq
3
2
3



14

)
3
2
,
min(
)
2
3
,
1
max(
1




w
k
p
k
w
p
T
T
Speedup
par
seq




The above formula illustrate
s

that the speedup depends on

,
,
,
k
w
p

-

The bigger
k
p
,
and

, the better the speedup

-

The smaller
w
, th
e better the speedup

-

And,

the best block size is:



2
3
pw
k


Suppose that
4
,
5
,
4
,
500
,
10
,
10
6
7








m
p
k
w
, we
can
change the value of
each variable separately to see their influence to the speedup.


0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
network speed (u, 10^6)
Speedup

0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
100
200
300
400
500
600
700
800
900
1000
Block Size
Speedup

0.0
1.0
2.0
3.0
4.0
5.0
2
3
4
5
6
7
8
9
10
number of processors,p
Speedup

Figure 10. Block LU Factorization Scalability Analysis


0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
6
10
14
18
22
26
30
node computing power(w,10^6)
Speedup

15


14

Experimental Result
s


Figure 11 shows actual parallel running results.


matrix size

block size

Sequential

Parallel

Speedup

seq
-
time(s)

Mflops

#CPU

time(s)

Mflop
s

500

200

0.66

125.62

3

0.69

120.61

0.96

800

300

3.75

91.14

3

2.62

130.10

1.43

1200

400

15.03

76.65

3

7.63

150.94

1.97

1500

300

32.21

69.85

3

12.24

183.80

2.63

1600

400

33.02

82.71

3

15.07

181.25

2.19

2000

500

72.70

73.36

3

34.95

152.61

2.08

1000

250

6.00

111.20

4

3.24

205.96

1.85

1000

300

7.91

84.30

4

4.77

139.86

1.66

1200

300

13.39

86.03

4

5.89

195.66

2.27

1600

400

33.02

82.71

4

13.73

198.94

2.41

Figure 11. Computation Timing Results


We observ
e that for N=1600 we achieve the best speedup, resulting in 184 MFLOPS
using three processors. It also shows that according to timing models, if w improves too
much there will be no chance for speedup.


15

Conclusion


This report documents the computational e
xperiments conducted using Synergy V3.0

for a Block LU factorization algorithm. It shows that it is easier to program than direct
message passing system. It also shows that it is possible to achieve positive results
using slow networks when the processor
power is small. Timing model analysis reveals
that this positive result can go away if we use BLAS optimized library codes in the
parallel core. This is because BLAS library codes will improve w measure by 10 folds.


References

1)

Matrix Computation, 3
rd
Edit
ion,
Gene H. Golub, Charles F. Van Loan
, The
Johns Hopkins University Press

2)

Parallel Program Scalability Analysis,
Yuan Shi
, International Conference on
Parallel and Distributed Computing, October 1997, pp.451
-
456

3)

The Message Passing Interface (MPI) Standa
rd.
http://www
-
unix.mcs.anl.gov/mpi/

4)

S. Ahuja, N. Carriero and D. Gelertner,

Linda and friends
,

IEEE
Computer
,26
-
32, August 1986.

5)

Getting Started in Synergy, Yuan Shi, Online Documentation:
http:/
/spartan.cis.temple.edu/shi/synergy
.