A Tutorial Introduction to Parallel Computing

compliantprotectiveSoftware and s/w Development

Dec 1, 2013 (3 years and 8 months ago)

149 views

015 HPC


A Tutorial Introduction
to Parallel Computing

Ian C. Smith

February 2003

Document 015 Version 0.1

Computing Services Department

Contents

1.

I
NTRODUCTION

1

2.

T
HE MOTIVATION FOR P
ARALLEL COMPUTING

2

3.

A
N OUTLINE OF PARALL
EL COMPUTING HARDWAR
E

4

4.

P
ROGRAMMI
NG MODELS FOR PARALL
EL COMPUTATION

6

5.

A
N EXAMPLE PARALLEL A
LGORITHM


THE MATRIX

MATIX MULTIPLY

8

6.

I
MPLEMENTING THE MATR
IX

MATRI
X MULTIPLY IN
FORTRAN

90

10

7.

T
ESTING THE MATRIX

MATRIX

MULTIPLY

15

8.

F
URTHER
I
NFORMATION

18

9.

A
PPENDIX
A



DGEMM

FUNCTION

19



Format Conventions

In this document the following format conventions are used:

Computer output is given in a
Courier

font.

Password

Input which is to be typed by y
ou is in bold

a:
\
setup

Input which must be replaced by your details is given
in

bold

italics
.

LOGIN

user_name


Keys that you press are bold.

Enter

Menu options are given in a
Helvetica

font.

File New

Feedback

The Computing Services Department welcomes
feedback on its documentation. If you notice any
mistakes in this document, or if you wish to make any suggestions on how the document could be
improved, please contact Allyson Matthews, Information Officer. Electronic mail should be sent to the
email addr
ess
allyson@liv.ac.uk


1

1

Introduction

The Computing Services Department (CSD) within the University has recently made
available three high performance computing platforms for users needing to solve
computationally d
emanding problems. The systems comprise a 24

node Beowulf
cluster and two identical Sun Fire V880 8

processor symmetric multi

processor
machines. Since all of the machines employ parallel hardware, this document has
been written to give users who are new
to parallel computing some initial guiding
steps in the design, coding, running and testing of efficient parallel programs.
Although this document concentrates on the Sun machines, some of the general
points are equally applicable to the Beowulf cluster

To

begin the document we look at the “hows” and “why”' of parallel computing
before describing a particular class of parallel computer, the symmetric multi

processor, in detail. The Sun V880 is an example of this architecture. We then
examine the design of a

parallel algorithm for a common type of problem


dense
matrix

matrix multiplication


and its implementation in FORTRAN 90 code.
Finally, the results obtained from testing the programs are analyzed. The emphasis
here is on solving the kinds of problems t
hat occur frequently in science and
engineering. Although parallel machines are used in applications such as database
engines, web servers and mail servers etc., these are beyond the scope of this
document.

Given the limited space available, it is only po
ssible to give the briefest coverage of
each topic. It is intended that CSD will provide more detailed information in due
course. Some pointers to further information are given at the end of this document.


2

2

The Motivation for Parallel
Computing

Users who h
ave only ever written programs for PCs and machines such as those
provided by the Sun UNIX service (uxa, uxb, ...) may be wondering “what
constitutes a parallel computer” and “what advantages are there to using it”. To
answer the first question, it is
useful to examine how “conventional” computers work.
The vast majority of these, from early mainframes to modern desktop PCs, contain a
single processing unit which executes instructions stored in memory one after another
(i.e. sequentially). This sequenti
al model of computation has had an enormous
influence on the design of algorithms and programs. Almost sub

consciously,
programmers look at the solution of problems by computer as a series of steps to be
performed in sequence by the hardware.

Very often it

is the case that once a working program has been developed it is
desirable (and in some cases essential) that the time taken for it to solve a given
problem be reduced significantly. Whilst careful tuning of the code can reduce the
execution time, eventua
lly a point is reached where no further speed up is possible
due to the limitations of the underlying hardware. The actual limiting factor will
depend on the nature of the problem, but may be attributable to processor speed,
memory bandwidth or I/O (usuall
y hard disk) bandwidth.

In the first instance, the execution time cannot be reduced because the processor is
incapable of executing instructions any faster. Although there seems to be a
widespread belief that a faster processor always gives faster executi
on (a point not
lost on the advertisers of PCs), this is not always the case. For many problems, the
rate at which instructions and data can be supplied to the processor from memory
constrains the overall speed; this is the memory bandwidth limitation. In

other
problem classes (typically databases), speed up is constrained by the rate at which
data can be supplied to / from the hard disk; this is the I/O bandwidth limitation.

In contrast to a conventional computer, parallel machines contain multiple proces
sor
units (possibly with their own associated memory and disk I/O) which are capable of
executing program instructions concurrently (i.e. in parallel). If a program can be
divided up such that parts of it are executed in parallel on different processors it

should be possible to arrive at a solution faster than in the single processor,
sequential, case (assuming the same processors). We have therefore answered the
second question posed at the beginning. A parallel computer provides the hardware
resources tha
t allows (at least some) programs to be executed faster than on any given
single processor machine. This is true regardless of state of the art in processor design
and will always be true.

Solving problems on a parallel computer can therefore be neatly
summed up by the
adage that “many hands make for lighter work”. More formally we can state that:

If a problem can be divided into N equal and independent tasks, then an ideal N
-
processor parallel computer can solve the problem N times faster than an
equiva
lent uni
-
processor machine.

In practice, the words “can be divided”, “equal” and “independent” turn out to be
important caveats which significantly affect the performance of any parallel
implementation.

To begin with, it may be difficult or impossible to
divide an entire program into
parallel tasks. Generally, algorithms and programs will contain parts that are
inherently sequential. As a program is broken up into more and more tasks in search
of the N times speed up, the sequential parts come to dominate
the overall run time.

3

Eventually, a point is reached were no further performance increase is possible. This
is an example of the widely cited Amdahl's law.

Speed up may also be limited by the fact that all tasks may not be created equal. Here
the executi
on time is determined by the time taken for the longest task to complete.
Finally, there is usually some interdependence between tasks in that they need to
share data with other tasks (e.g. a global sum calculation would require data from all
processors).
On a perfect parallel computer this would not in itself pose a problem
1
.
In practice, however, it takes a finite amount of time for data from a task running on
one processor to be communicated to a task running on another processor. This
constitutes an ov
erhead which is not present in a purely sequential implementation
and can eventually limit the overall speed up.

These points may seem off putting to those new to parallel computing, however not
all is doom and gloom. Many classes of problem have been sol
ved simply and
efficiently using parallel computing. Before spending a lot of time and effort
designing and developing parallel software though it is always worth considering
how amenable a problem is to parallel solution

























1

although note that the tasks may need to synchronise before communicating data to
one another


so some may be waiting for others to “catch up”


4

3

An outline of

Parallel Computing
Hardware


The design of parallel computer hardware has historically been the subject of much
experimentation and innovation. This has led to the manufacture of a wide variety of
machines, some specifically tailored to particular types o
f problem (e.g. vector
processing on the early Cray supercomputers). Many of the designs, along with their
manufacturers have long since disappeared and

rationalisation

within the industry has
led to the current situation where there are just two main desi
gn categories (or
architectures). These are the massively
2

parallel processor (MPP) architecture and the
symmetric multi
-
processor (SMP) architectures (another is the non

uniform memory
access or NUMA architecture


essentially a special version of SMP).

I
n the MPP architecture, each parallel processor has access to its own dedicated
memory which is used for storing program instructions and data. In an MPP, if data
is required from other processors, then that data needs to be transmitted along special
comm
unication media, termed an interconnect. The definition of an MPP is
sufficiently broad that the interconnect can be anything from a dedicated high speed
link, in the case of MPP supercomputers, to a relatively ordinary Ethernet connection,
in the case of
a PC cluster. The nature of the interconnect has a large influence on the
overall performance of an MPP machine.

In the SMP architecture, all of the parallel processors have equal (i.e. symmetric)
access to a large shared memory which contains the program
instructions and data for
all processors. At the simplest level, only one processor may access the main memory
at a given time and so the memory bandwidth (i.e. the aggregate rate at which data
can be transferred to/from memory) needs to be shared amongst
the processors. In its
most basic form, contention between processors for memory bandwidth severely
limits the parallel performance of the machine so that only designs involving a few of
processors are practical. To increase the effective memory bandwidth,

use can be
made of banked memory which allows for parallel memory accesses. Machines
employing this design are commercially available containing up to 128 processors.

A popular alterative to banked memory is the use of cache. Caches are relatively
smal
l high speed memories intimately associated with each of the parallel processors.
They hold a copy of some of the program instructions and data required by the
processor so that the processor may now quickly access the locally held copies rather
than using

the slower main memory. If it can be arranged that instructions and data
needed by the processor are often available in the cache, then the number of main
memory accesses is reduced and the shared memory bandwidth becomes less of a
bottleneck. Hence the
parallel performance of the machine is improved. A good
introduction to cache memory systems is provided by Sun Microsystems [1].

The job of

organising

which instructions and data are copied from main memory to
cache is down to operating the system with va
rying degrees of help from the system
hardware. The strategies involved in populating each processor cache can be highly
complicated. As a simple example, consider the case where a datum (e.g. the value of
some variable) is required by the processor. If th
e datum is not present in the cache
then the operating system may load the datum, along with a block of data adjacent to
it, from main memory into cache. If the processor then needs to access another datum



2

The term “massive'' should not be taken too literally. Although a modern MPP
superco
mputer may have hundreds of processors, the CSD Beowulf cluster which has
“only'' 24 processing nodes (a total of 48 processors) is still termed an MPP.


5

within the block that has just been fetched it is
no longer necessary to visit the main
memory.

Although this may seem artificial, in many types of problem, data are often read from
a set of contiguous memory locations (e.g. array elements) and data that are accessed
once are again accessed repeatedly in

the near future (e.g. variables within loops).
The computer science terms for these properties are spatial locality and temporal
locality, respectively. Programs that exhibit these properties should give good
performance on cache based SMP machines. For
example, well written dense linear
algebra codes have excellent locality properties leading to efficient implementations.
Sparse linear algebra problems however can involve irregular memory access patterns
which lead to poor spatial locality. Codes for the
se types of problem are less likely to
perform well on SMP hardware.

A final issue to consider is that of cache coherency. If a task running on a processor
writes a value to cache, for example by assigning a value to a cached global variable,
then the val
ue in main memory will also need to be updated otherwise tasks running
on other processors will only ever see the old value of the variable. A problem occurs
when other processors have cached copies of the variable containing the old value. In
this case th
e caches are said to be incoherent and it is necessary for the contents to be
refreshed from main memory. Whilst the strategies used by the hardware/operating
system to ensure cache coherency are complicated, it is important to note that none of
this need
trouble the programmer. The operating system always ensures coherency
without any special steps needing to be taken in the program code. It is worth
bearing in mind though that ensuring coherency incurs an additional overhead not
present in the single pro
cessor case.

The two Sun V880 machines, recently acquired by CSD, are both based on the SMP
architecture. Each machine (or node) contains 32 GB of main memory which is
shared between eight 900 MHz UltraSPARC III processors operating in parallel. The
cache
for each processor is divided into two parts: a high speed 64 KB level 1 cache
and a slower 8 MB level 2 cache. As each processor is capable of concurrently
executing a floating point addition and floating point multiplication, the theoretical
maximum perf
ormance is 1.8 GFLOPS
3

per processor or 14.4 GFLOPS per node.








3

1 GFLOPS=
10
9

double precision floating point operations per second


6

4

Programming Models for
Parallel Computation

The main goal of parallel programming is to

utilise

all of the processors and so

minimise

the elapsed time of the program. To this end, many prog
ramming models
have been developed, reflecting the variety of underlying hardware. Here we
concentrate on just two models; namely multi

thread and distributed message
passing. It is important to note that there is no "one size fits all" approach to paralle
l
programming; each has its own strengths and

weaknesses
. Ian Foster gives a good
introduction to the design of parallel programs in his book “Designing and building
parallel programs” [2].

In the multi

thread approach, parallelism is achieved using the fo
rk

join model. A
fork is a high

level language instruction, or compiler directive, (usually implemented
as a call to a function in the thread library) which causes a separate series of
instruction streams to execute concurrently. These separate strands of
execution are
termed threads and each execute independently of one another. Threads may contain
identical instructions, giving data parallelism, or different instructions, giving
functional parallelism.

Because the threads share the same memory address s
pace it is easy to reference data
that other threads have updated. However care must be taken that data being written
by one thread is not at same time read by another thread (a so called race condition).
Special thread instructions can prevent this.

When

all of the thread instructions have been completed a join instruction causes the
newly created threads to be destroyed. This leaves only a master thread, present
before the fork, to carry on execution of the program sequentially. A join instruction
caus
es

synchronisation

between the threads since all have to have reached the join
instruction before execution can continue.

Multi

threading is inherently suited to the SMP architecture since all processors
share the same memory address space. For a parallel
implementation, separate
threads may run on separate processors (possibly with multiple threads per processor
if there is a surfeit of threads).

Thread based program can also be easily run on a uni

processor machine during the
development phase. Here the

operating system uses its time

sharing mechanism to
divide processor time between threads giving an illusion of concurrency. A major
advantage of thread

based programming is the availability of compilers which can
generate multi

thread executables automat
ically, although performance may not be as
good as for "hand coded" programs. The Sun C and FORTRAN compilers on the
SMP machines have this useful ability.

In a machine based on the MPP architecture, each parallel processor has its own
memory address space
. The lack of any shared memory between processors means
the multi
-
thread memory model cannot be employed and therefore an alternative is
needed. Once such model is distributed memory passing. Here, data that is needed by
one processor, but held in the mem
ory of another, is transferred in the form of a
message. As an example, consider the case where we wish to sum the elements of an
array that is distributed amongst
N

processors. Processors will form the partial sums
for their elements (in parallel) then i
t will be necessary to communicate the partial
sums such that a global sum can be calculated. A simple method would be to send
the
N
-
1
partial sums to one processor although it is actually to calculate the result in
just


log
2
N



steps


7

Two, very popular
, standards which implement distributed message passing are the
parallel virtual machine (PVM) and the message passing interface (MPI). We’ll
concentrate on just MPI since it provides all of the functionality of PVM as well as
many more powerful features.

MPI provides a library of functions, which can be
called from FORTRAN or C programs, that allow for parallel execution. Such is the
power of these functions, that the global summation, previously mentioned, requires
just one call to an MPI SUBROUTINE in F
ORTRAN or a call to an MPI function in
C.

The MPI standard has been widely adopted by computer vendors and is available on a
substantial range of hardware including large supercomputers (e.g. SGI T3E, NEC
Earth Simulator), workstation/PC clusters and shar
ed memory machines such as our
own Sun V880.

MPI processes executing on different processors run identical copies of the MPI
program. Whether the processors execute different instructions or identical ones is up
to the programmer. The former case gives f
unctional parallelism while the latter gives
data parallelism. This is often referred to as the "single program multiple data"
(SPMD) model of programming.

Whilst the exact nature of a message may seem rather fuzzy, it is precisely this level
of abstract
ion that allows MPI to be used on such disparate architectures. On a PC
cluster, a message will need to be sent in the form of TCP/IP packet across an
Ethernet connection, whereas on a shared memory machine all that is required is to
copy the data from on
e part of memory to another. The important point is that none of
this need concern the programmer who can concentrate on solving problems using a
high level methodology without worrying about such low level implementation
details.







8

5

An Example Parallel

Algorithm


the Matrix

Matrix Multiply



The task of calculating the product of two dense matrices is particularly amenable to
parallel solution and for that reason is used as an example here. Linear algebra tasks
such as this occur in a wide variety of s
cientific and engineering problems although it
is fair to say that many problems (e.g. finite element analysis) employ spare methods
which are more difficult to parallelize.

The elements of a matrix
C

which is the product of two matrices,
A

and
B
, can be
formed by the dot product corresponding rows of
A

and columns of
B
. If
A

is of order
m


k

and B is of order
k


n

then the elements of
C=AB

(order
m


n
) are given by:





k
r
r
j
r
b
a
c
r
i
j
i
1
,
,
,



a . b
T


a(i, :) * b(:, j)

To keep things simple, we will c
oncentrate on just square matrices of order
n


n

in
the rest of this document.

There are many ways in which this problem could be divided up to run on
p

parallel
processors. For example, matrix B could be divided up along columns into
p

blocks
(assuming

that
n

is evenly divisible by
p
). We can then form blocks of
C

using each
of these blocks of
B

in
p

parallel tasks i.e.

C(:,1+(i
-
1)*n/p:i*n/p) = A * B(:,1+(i
-
1)*n/p:i*n/p)
for

i=1,p

It is straightforward to cover the case where
p

does not divide
n

eve
nly. Let
r

be the
remainder from this division,
r=mod(n, p)
, then
r

processors will receive


n / p




columns of
B

whilst the remaining
(p
-

r)

processors receive


n / p


columns. For
large values of
n

this will give only a small difference in the work

load of different
processors. In the limit that
n=p
, this just becomes
n

matrix

vector products using
the whole of
A

and single columns of
B
.

If there are a large number of processors say,
p
2
, then we can form
p
2
blocks of order
n/p


n/p

of
C

by mult
iplying blocks of order
n/p


n

from A and blocks of order
n


n/p

from
B

from
B

in parallel, i.e.

C(1+(i
-
1)*n/p:i*n/p,1+(j
-
1)*n/p:j*n/p) =

A(1+(i
-
1)*n/p:i*n/p, :) * B(: ,1+(j
-
1)*n/p:j*n/p)


for

i=1,p; j=1,p



In the limit that
n=p
, this reduces t
o each processor calculating a single element of
C
using a dot product.

When attempting to parallelise a particular problem it is useful to

ascertain
the
maximum possible degree of concurrency that can be achieved. In the example
above, further paralleli
sm can be found in the dot product formation. Many
processors are capable of executing a multiply and add instruction concurrently so
that the dot product could be calculated in
k

time steps rather than 2
k
steps. Whilst the
not all of the parallelism may b
e

realisable
on the currently available hardware, new
hardware allowing greater degrees of parallelism may become available in the future.
If all of the inherent parallelism in a problem is identified at an early stage, it will

9

then be much easier to modif
y an existing problem to take full advantage of more
powerful computing resources.

The matrix

matrix product algorithms should give good parallel performance since
they satisfy the three conditions mentioned in section 2. Firstly, the problem can be
divi
ded into many separate tasks. Secondly, the tasks are approximately equal (at
least for
n >> p

which is usually the case). Finally the tasks can be performed
independently since no data are required from other processors in order to perform
each parallel t
ask.

The degree to which problems are can be divided into independent tasks performed
(potentially) in parallel is termed the granularity. Coarse grained problems perform a
relatively large amount of work on each of a small number of processors, whilst for

fine grained problems, the opposite is true. The optimum granularity of any particular
machine depends on the relative cost of computation and communication. Where
inter

processor communication is relatively slow, such as on a PC cluster, only
coarse grai
ned parallelism can be usefully employed however, where this
communication is much faster, then fine grained parallelism is possible.


10

6

Implementing the Matrix

Matrix
Multiply in FORTRAN 90


The FORTRAN 90 implementation will look only at the first parallel
algorithm
presented in section 5, where the second matrix is partitioned into (almost) equal
number of columns. This method is particularly suited to FORTRAN since
FORTRAN stores matrix columns

contiguously

in memory.

When weighing up the usefulness of an
y parallel implementation it is useful to test a
sequential implementation first so that the best sequential performance can be used as
a benchmark. Setting number of processors = 1 in a parallel version, as benchmark,
usually serves only as an exercise in

self

deception

A very simple sequential code can be arrived at almost trivially using FORTRAN
90's built in matrix

matrix multiply function (MATMUL):

1: PROGRAM multiply_serial

2: IMPLICIT NONE


3: DOUBLE PRECISION, DIMENSION(1000, 1000) :: A,B,C

4: INTEG
ER :: n



5: n = 1000

6: A = RESHAPE((/ ((i, i = 1, n), j = 1, n) /), (/ n, n /))

7: B = RESHAPE((/ ((j, i = 1, n), j = 1, n) /), (/ n, n /))


8: C=MATMUL(A, B)



9: END PROGRAM multiply_serial



The matrices, A and B, have here been

initialised
, in lin
es 6 and 7, to prevent the
compiler

optimising

the MATMUL to nothing if the default

initialisation

is zero.
Although this code is simple and gives reasonable results, much better performance
can be gained using a BLAS DGEMM function:


1: PROGRAM multiply_s
erial



2: USE SUNPERF


3: IMPLICIT NONE



4: DOUBLE PRECISION, DIMENSION(1000, 1000) :: A,B,C


5: INTEGER :: n



6: n = 1000


7: A = RESHAPE((/ ((i, i = 1, n), j = 1, n) /), (/ n, n /))


8: B = RESHAPE((/ ((j, i = 1, n), j = 1, n) /), (/ n, n /))



9:

CALL DGEMM('N', 'N', n, n, n, 1D0, A, n, B, n, 0D0, C, n)


10: END PROGRAM multiply_serial



The program can be compiled and linked using:

$ start_hpc

HPC 1.0 $ tmf90
-
o serial serial.f90
-
fast
-
xtarget=ultra3
\


-
xarch=v8plusb


The CSD document “Using t
he Sun Shared Memory Cluster” [3] gives more detailed
information on building and running parallel programs. This code is functionally
equivalent to that using MATMUL but should execute much faster when linked with
the Sun Performance library (see section
7). For this reason, it will be used as the

11

basis of the rest of the FORTRAN 90 implementations. The DGEMM function is
described in detail in Appendix A.

It is quite straightforward to produce a multi

thread version of this program using
widely availabl
e multi

thread library knows as open

MP. Lawrence Livermore
provide a succinct but useful introduction to this [4]. For example, the parallel matrix
multiply could be coded as follows:


1: PROGRAM multiply




2: IMPLICIT NONE



3: INTEGER :: thread_id


4: INTEGER :: i, j, n, no_of_columns, no_of_threads, col


5: INTEGER :: OMP_GET_THREAD_NUM, OMP_GET_NUM_THREADS


6: INTEGER :: remainder



7: DOUBLE PRECISION, DIMENSION(:,:), ALLOCATABLE :: A, B, C



8: n = 1000



9: ALLOCATE(A(n, n))

10: ALLOCATE(B(n,
n))

11: ALLOCATE(C(n, n))


12: A = RESHAPE((/ ((j, j = 1, n), i = 1, n) /), (/ n, n /))

13: B = RESHAPE((/ ((j, i = 1, n), j = 1, n) /), (/ n, n /))

14: C = 0D0


15: !$OMP PARALLEL PRIVATE(col,thread_id,no_of_columns,no_of_threads)&

16: !$OMP SHARED(A,

B, C, n)


17: thread_id = OMP_GET_THREAD_NUM()

18: no_of_threads = OMP_GET_NUM_THREADS()


19: no_of_columns = n / no_of_threads

20: remainder = MOD(n, no_of_threads)



21: IF(thread_id < remainder) THEN

22: no_of_columns = no_of_columns +

1

23: col = thread_id * no_of_columns + 1

24: ELSE

25: col = remainder * (no_of_columns + 1) + &

26: (thread_id
-

remainder) * no_of_columns + 1

27: END IF


29: CALL DGEMM('N', 'N', n, no_of_columns, n, 1D0, A, n, B(1,col), &






n, 0D0, C(1 , col), n)


30: !$OMP END PARALLEL


31: DEALLOCATE(A, B, C)


32: END PROGRAM multiply


The heart of the program is in lines 15 to 30. Prior to line 15, the code executes
sequentially on just one thread, know as the master thread. Line 15 i
s an open

MP
compiler directive which causes subsequent execution, up to line 30, to be performed
by a group of threads known as a team. Line 30 terminates the thread team and causes
subsequent execution to be performed sequentially again.

All open
-
MP dir
ectives in FORTRAN 90 must start with !$OMP. Those variables
declared as PRIVATE in line 15 are local to each thread and therefore can hold
different values in each thread, whereas those variables declared as SHARED are
accessed from common memory areas an
d will therefore hold the same values. The
arrays have been made allocatable so that it is easier to change their size when
benchmarking the code.


12

The hard work in the program is done by the DGEMM function in line 29. Here we
have arranged that each thread

operates on the whole of matrix A and a set number of
columns of matrix B to compute the corresponding columns of matrix C.
Unfortunately the Sun Performance Library BLAS use a FORTRAN 77 style
function interface and so cannot be called with variables tha
t use FORTRAN 90 style
array syntax. DGEMM is therefore called with the first element of each block of B
rather than an array slice (we would like to have written
B(:,col:col+number_of_columns))
.
The USE SUNPERF

statement has
deliberately been omitted so t
hat the compiler does not generate an error using this
inelegant method.

Lines 19


27 look slightly complicated but are just there to handle the case where
the number of threads does not divide the matrix order evenly and so different threads
need to oper
ate on different size blocks of matrix B. Finally lines 17 and 18 allow to
program to determine how many threads form part of the team and the identity of
each thread within the team. The number of threads can be specified at run time by
setting the OMP_NU
M_THREADS environment variable.

The program can be compiled and linked using the following sample Makefile:

thread.o: thread.f90


f95 thread.f90
-
c
-
fast
-
xtarget=ultra3
-
xarch=v8plusb
-
xopenmp


thread: thread.o


f95
-
o thread thread.o lsame.f xerbla.f
\


-
xtarget=ultra3
-
xarch=v8plusb
\


-
L/apps/globus/ATLAS/lib/SUN4_USIII
\


-
lf77blas
-
lcblas
-
latlas
-
xopenmp


This makes use of the Atlas BLAS since it was not possible to use the Sun
Performance Library BLAS with the

optimisation

options. This appears to b
e a fault
with the Sun compiler which we expect to be fixed in the next release.

An MPI implementation can make use of the same basic algorithm, however since
this uses a distributed memory model, each MPI process needs to receive a copy of
the whole of m
atrix A and a block of matrix B. This data cannot simply be accessed
from shared memory as with open
-
MP. An example MPI program is given below:


1: PROGRAM mutiply




2: IMPLICIT NONE


3: INCLUDE "mpif.h"




4: INTEGER :: myrank, proc, nprocs, ierr


5: I
NTEGER :: ncolumns, remainder


6: INTEGER :: i,j,n



7: DOUBLE PRECISION, ALLOCATABLE, DIMENSION(:,:) :: A, B, C


8: DOUBLE PRECISION, ALLOCATABLE, DIMENSION(:,:) :: buffer_B


9: DOUBLE PRECISION, ALLOCATABLE, DIMENSION(:,:) :: buffer_C

10: INTEGER, ALLO
CATABLE, DIMENSION(:) :: send_count, &

displacement, cols


11: CALL MPI_INIT(ierr)

12: CALL MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr)

13: CALL MPI_COMM_RANK(MPI_COMM_WORLD, myrank, ierr)


14: n = 1000

15: ALLOCATE(A(n, n))


16: IF(myrank == 0) THEN

17: ALLOCATE(B(n, n))

18: ALLOCATE(C(n, n))

19: A = RESHAPE((/ ((j, j = 1, n), i = 1, n) /), (/ n, n /))

20: B = RESHAPE((/ ((j, i = 1, n), j = 1, n) /), (/ n, n /))

21: C = 0.0

22: END IF


13



23: ALLOCATE(send_count(nprocs))

24: ALL
OCATE(displacement(nprocs))

25: ALLOCATE(cols(nprocs))



26: ncolumns = n / nprocs

27: remainder = MOD(n, nprocs)

28: cols(: remainder) = ncolumns + 1

29: cols(remainder + 1 :) = ncolumns

30: send_count = cols * n

31: displacement(1) = 0

32: DO proc=2
, nprocs

33: displacement(proc) = displacement(proc
-

1) + &

end_count(proc
-

1)

34: END DO


35: ALLOCATE(buffer_B(n, cols(myrank + 1)))

36: ALLOCATE(buffer_C(n, cols(myrank + 1)))


37: CALL MPI_BCAST(A , n*n, MPI_DOUBLE_PRECISION, 0, &






MPI_COMM_WORLD, ierr)


39: CALL MPI_SCATTERV(B, send_count, displacement, &

MPI_DOUBLE_PRECISION, &


buffer_B, send_count(myrank + 1), &





MPI_DOUBLE_PRECISION, 0, MPI_COMM_WORLD, ierr)


40: CALL DGEMM('N'
, 'N', n, cols(myrank + 1), n, 1D0, A, &





n, buffer_B, n, 0D0, buffer_C, n)


41: CALL MPI_GATHERV(buffer_C, send_count(myrank + 1), &

MPI_DOUBLE_PRECISION, C, send_count, displacement, &

MPI_DOUBLE_PRECISION, &

0, MPI_COMM_WORLD, ierr)


42: CALL MPI_FINALIZE(ierr)



43: END PROGRAM multiply


In the MPI program, all of the variables are local to each MPI process and any data
that needs to be shared has to be passed explicitly. All MPI programs need to include
the file “mpif.f” which co
ntains definitions of parameters such as
MPI_COMM_WORLD. Line 11 causes a set number of MPI processes to be started
usually on separate processors. The parameter MPI_COMM_WORLD is an MPI
communicator and establishes the group to which a process belongs. It

possible to
form separate sets of processes however, the union of all of the sets is covered by
MPI_COMM_WORLD.

Line 12 and 13 establish the number of processors in the MPI_COMM_WORLD
group and the identity (known as the rank) of each. In this example, a
n array used to
hold matrix A is allocated by all processes whereas matrices B and C are allocated
only by process 0. The arrays buffer_B and buffer_C are used to hold blocks of
matrices B and C respectively and are allocated by all processes.

The code i
n lines 26
-

34 deals with the case where the matrix order is not evenly
divisible by the number of processes. Two arrays are set up, send_count and
displacement, which hold the number of elements in each block of B and the position
of each block (i.e. th
e number of elements from the start of the array). Since MPI
was written originally in FORTRAN 77, it only understands one
-
dimensional arrays.
We have therefore made use of FORTRAN's column major storage scheme to handle
two
-
dimensional arrays. Matrix A i
s sent to all processes using the MPI_BCAST
subroutine in line 37. On an SMP machine such as the Sun V880, this only requires
copying a block of memory but on an MPP it would result in the data actually being
transmitted over the interconnect from process
0 to all other processes.

The MPI_SCATTERV subroutine call in line 39 is used to distribute blocks of matrix
B amongst processes. It makes use of the send_count and displacement arrays to
determine the size and location of the blocks that are to be used.
This is called a

14

scatter operation, hence the name. As with the open

MP version, the DGEMM
routine again performs the hard work of the program. In line 40, DGEMM is used to
calculate the product of matrix A and a block of B in exactly the same way as for t
he
open

MP case. The MPI_GATHERV routine called in line 41 is used to collect the
blocks of matrix C on process 0. This gather operation is essentially the opposite
operation to MPI_SCATTERV so that data is collected rather than distributed.
Finally the MP
I processes are terminated at line 42 using the MPI_FINALIZE
routine.

The program can be compiled and linked using:

$ start_hpc

HPC 1.0 $ tmf90
-
o multiply multiply.f90
-
fast
-
xtarget=ultra3
\

-
xarch=v8plusb
-
xlic_lib=sunperf
-
lmpi


An IBM Red book gives
a good overview of MPI although some of it is more geared
to their RS/6000 workstations [5].



15

7

Testing the Matrix

Matrix
Multiply

The various matrix

multiply routines were tested by running them on ulgsmp1 using
the Sun Grid Engine. Additional code was add
ed to calculate the elapsed time and
each multiply operation performed several times in order to give accurate and precise
results. The examples can be found under:

/varapps/sgeee_5.3/examples

Each directory contains a README file with further details and

it is a good idea to
copy them under your home file store in order to build and run them.

The serial code can be run using this a batch script such as this:

#!/bin/ksh

#$
-
pe mt_smp1 8

#$
-
o /home/qcl/smithic/.lfs/txt/ftn/output

#$
-
e /home/qcl/smithic/.
lfs/txt/ftn/error

/home/qcl/smithic/.lfs/txt/ftn/serial

A version of the serial code making use of the automatic threading options was
submitted using the following batch script:

#!/bin/ksh

#$
-
pe mt_smp1 8

#$
-
o /home/qcl/smithic/.lfs/txt/ftn/output

#$
-
e

/home/qcl/smithic/.lfs/txt/ftn/error

export PARALLEL=2

/home/qcl/smithic/.lfs/txt/ftn/serial

Here the PARALLEL environment variable sets the maximum number of threads to
be used. A batch file for the explicitly coded thread version uses
OMP_NUM_THREADS in
stead of PARALLEL to set the maximum number of
threads e.g.:

#!/bin/ksh

#$
-
pe mt_smp1 8

#$
-
o /home/qcl/smithic/.lfs/txt/ftn/output

#$
-
e /home/qcl/smithic/.lfs/txt/ftn/error

/export OMP_NUM_THREADS=2

/home/qcl/smithic/.lfs/txt/ftn/thread

Finally, a batch

script for the MPI implementation requires that the MPI parallel
environment be specified using the

l and

pe options e.g.:

#!/bin/ksh

#$
-
pe mpi_smp1 2

#$
-
o /home/qcl/smithic/.lfs/txt/ftn/output

#$
-
e /home/qcl/smithic/.lfs/txt/ftn/error

MPIRUN /home/q
cl/smithic/.lfs/txt/ftn/matrix

The number of MPI processes is given in the second argument to the

pe option (here
2). A fuller description of how to use the Sun Grid Engine is given in the CSD
document, “Using the Sun Shared Memory Cluster” [3].

The resu
lts of testing the serial code with automatic parallisation are shown in figure
1. The standard serial code without any parallelisation has a peak performance of
around 1.8 GFLOPS for large problems which is the peak performance of a single
processor. This

is therefore a useful benchmark for judging the performance of the
parallel versions.

With small matrices the performance is relatively poor since the cost of getting data
from main memory into cache is significant compared to the time taken for the

16

comp
utations to be performed. However for
n >
100, the results show good scalable
performance increases as the number of threads is increased.

With 7 threads, the asymptotic speed is around 11 GFLOPS which compares

favourably

with the peak achievable speed of

7 * 1.8 = 12.6 GFLOPS (an efficiency
of 87 %). Since the various operating system daemons need to run on at least one
processor there is a conflict between these and the application program in the 8 thread
case; hence the poor performance of the 8 thread

version.

The results for the open
-
MP version are shown in figure 2. The single thread version
is clearly far inferior to the serial benchmark code since the Atlas BLAS are not

optimised

for the UltraSPARC processors to the same extent as the Sun Performa
nce
Library versions. The performance does appear scalable though and reasonable speed
ups are achievable with only small problems (
n >
60 approximately). It is expected
that the open
-
MP performance will be much improved when the compiler problems
are reso
lved and the Sun Performance Library BLAS routines can be used.

Figure 3 shows the results obtained for the MPI version of the program. It is clear that
much larger problems are needed to get good performance in the MPI case. This is
because of the overhe
ad in distributing the
A

and
B

matrices and gathering the
C

matrix. Since the amount of data movement is of the order
O(n
2
)

while the amount of
computation of the order
O(n
3
)

the overhead becomes less significant for larger
systems of equations. Also since

the amount of data movement is proportional to the
number of processes, increasing the number of processes requires an increase in
problem size to give the same efficiency for small problems. In spite of this the 7
process version still gives a parallel e
fficiency of around 80 % for
n=
3000.










figure1: DGEMM on ulgsmp1 using automatic threading
0
2
4
6
8
10
12
10
100
1000
10000
matrix order
GFLOPS
std serial
1 thread
2 threads
3 threads
4 threads
5 threads
6 threads
7 threads
8 threads
std serial
1 thread
2 threads
3 threads
7 threads
8 threads
4 threads
5 threads
6 threads


17





figure2: Matrix-matrix multiply on ulgsmp1 using open-MP
0
2
4
6
8
10
12
10
100
1000
10000
matrix order
GFLOPS
std serial
1 thread
2 threads
3 threads
4 threads
5 threads
6 threads
7 threads
8 threads
std serial
1 thread
2 threads
3 threads
4 threads
5 threads
6 threads
7 threads
8 threads





figure3: matrix-matrix multiply on ulgsmp1 using MPI
0
2
4
6
8
10
12
10
100
1000
10000
matrix order
GFLOPS
std serial
1 proc
2 procs
3 procs
4 procs
6 procs
6 procs
7 procs
8 procs
serial
1 proc
2 procs
3 procs
4

procs
5 procs
6 procs
7 procs
8 procs



18

8

Further Information


[1]

Memory Hierarchy in Cache
-
Based Systems Ruud van der Pas, Sun
Microsystems




http://www.sun.com/solutions/blueprints/online.html

[2]

Designing and Building Parallel Programs, Ian Foster. Addison Wesley
-
Inc.
Also in HTML form at:


http://www

unix.mcs.anl.gov/dbpp/

[3]

Using the Sun Shared Memory Cluster, University of Liverpool Computing
Services Department document 14



http://www.liv.ac.uk/CSD/acuk_html/index
.html

[4]

open

MP, Lawrence Livermore National Laboratory,

http://www.llnl.gov/computing/tutorials/workshops/workshop/openMP/MAIN.html

[5]

RS/6000 SP: Practical
MPI Programming, IBM



http://www.redbooks.ibm.com/





19

Appendix A


DGEMM Function






SUBROUTINE DGEMM(TRANSA,TRANSB,M,N,K,ALPHA,A,LDA,B,LDB,BETA,C,LDC)

C .. Scalar Arguments ..


DOUBLE PRECISIO
N ALPHA, BETA


INTEGER K, LDA, LDB, LDC, M, N


CHARACTER TRANSA, TRANSB

C .. Array Arguments ..

DOUBLE PRECISION A, B, C

C

C Purpose

C =======

C

C DGEMM performs one of the matrix
-
matrix operations

C

C C := al
pha*op(A)*op(B) + beta*C,

C

C where op(X) is one of

C

C op(X) = X or op(X) = X',

C

C alpha and beta are scalars, and A, B and C are matrices, with

C op(A) an m by k matrix, op(B) a k by n matrix and

C an m by n matrix.

C

C

Parameters

C ==========

C

C TRANSA
-

CHARACTER*1.

C On entry, TRANSA specifies the form of op(A) to be used in

C the matrix multiplication as follows:

C

C TRANSA = 'N' or 'n', op(A) = A.

C

C TRANSA = 'T' or 't', op(A) = A'.

C

C

TRANSA = 'C' or 'c', op(A) = A'.

C

C Unchanged on exit.

C

C TRANSB
-

CHARACTER*1.

C On entry, TRANSB specifies the form of op(B) to be used in

C the matrix multiplication as follows:

C

C TRANSB = 'N' or 'n', op(B) = B.

C

C TRA
NSB = 'T' or 't', op(B) = B'.

C

C TRANSB = 'C' or 'c', op(B) = B'.

C

C Unchanged on exit.

C

C M
-

INTEGER.

C On entry, M specifies the number of rows of the matrix

C op(A) and of the matrix C. M must be at least zero
.

C Unchanged on exit.

C

C N
-

INTEGER.

C On entry, N specifies the number of columns of the matrix

C op(B) and the number of columns of the matrix C. N must be

C at least zero.

C Unchanged on exit.

C

C K
-

INTEGER.

C On entry, K specifies the number of columns of the matrix

C op(A) and the number of rows of the matrix op(B). K must

C be at least zero.

C Unchanged on exit.

C

C ALPHA
-

DOUBLE PRECISION.

C On entry, ALPHA specifies the scal
ar alpha.

C Unchanged on exit.


20

C

C A
-

DOUBLE PRECISION array of DIMENSION (LDA, ka),

C


where ka is

C k when TRANSA = 'N' or 'n', and is m otherwise.

C Before entry with TRANSA = 'N' or 'n', the leading m by k

C part of t
he array A must contain the matrix A, otherwise

C the leading k by m part of the array A must contain the

C matrix A.

C Unchanged on exit.

C

C LDA
-

INTEGER.

C On entry, LDA specifies the first dimension of A as declared

C

in the calling (sub) program. When TRANSA = 'N' or 'n' then

C LDA must be at least max(1, m), otherwise LDA must be at

C least max(1, k).

C Unchanged on exit.

C

C B
-

DOUBLE PRECISION array of DIMENSION (LDB, kb),

C


where kb i
s

C n when TRANSB = 'N' or 'n', and is k otherwise.

C Before entry with TRANSB = 'N' or 'n', the leading k by n

C part of the array B must contain the matrix B, otherwise

C the leading n by k part of the array B must contain

the

C matrix B.

C Unchanged on exit.

C

C LDB
-

INTEGER.

C On entry, LDB specifies the first dimension of B as declared

C in the calling (sub) program. When TRANSB = 'N' or 'n' then

C LDB must be at least max(1, k), otherwise

LDB must be at

C least max(1, n).

C Unchanged on exit.

C

C BETA
-

DOUBLE PRECISION.

C On entry, BETA specifies the scalar beta. When BETA is

C supplied as zero then C need not be set on input.

C Unchanged on exit.

C

C
C
-

DOUBLE PRECISION array of DIMENSION (LDC, n).

C Before entry, the leading m by n part of the array C must

C contain the matrix C, except when beta is zero, in which

C case C need not be set on entry.

C On exit, the array C

is overwritten by the m by n matrix

C (alpha*op(A)*op(B) + beta*C).

C

C LDC
-

INTEGER.

C On entry, LDC specifies the first dimension of C as declared

C in the calling (sub) program. LDC must be at least

C max(1, m).

C

Unchanged on exit.