PARALLEL PROGRAMMING COMPANION GUIDE

shapecartSoftware and s/w Development

Dec 1, 2013 (3 years and 9 months ago)

66 views

Doc. Identifier:
Parallel Programming
PARALLEL PROGRAMMING
Companion Guide

Date: 01/10/2009
COMPANION GUIDE



Subject:
R A P L A E L R P L M A R G O U N G I N M O N P M O D C I G I A E
Author(s): Adrian Jackson
Distribution: Public
1. Introduction

Parallel programming is necessary for most scientific simulations if you have high
computational requirements. Scientific simulation codes generally begin as serial
programs written to investigate a particular scientific phenomenon or theory.
Depending on the complexity of simulation code and the problem being investigated
(the input parameters) it may take a long time to run on a single processor, or
require more memory than is available in a single processor. If this is the case
some form of parallel programming is necessary.

The type of parallel programming used for scientific simulation often depends on
the specifics of the parallel computer that the programmer has access to and the
particular requirements of program to be parallelised. The aim of this document is
to provide some background about the dominate parallel programming paradigms
currently used in the scientific programming community to enable a developer to
make a more informed decision about whether parallelisation is necessary for their
program and what type of parallelism is most suit to their situation.

It is possible to exploit small amounts of parallelism using basic thread
programming, such as that provided by the Java and C programming languages.
However, it is often complicated to perform thread programming and it can only
exploit a small amount of computing resources (i.e. up to the amount of cores or
processors available in a single computer, currently this is generally a maximum of
8 processing element). Therefore, if you want to exploit large amounts of
computing resources you need to undertake some level of parallel programming.

There is a class of problems, called “embarrassingly parallel” or “parameter
investigation” problems where large amounts of computing resources (either
parallel or Grid computers) can be exploited without parallel programming, simply
by running large numbers of the same program with different input data and then
performing some processing (post-processing) work on the data produced. This
type of parallel computation does not require any change to the original program
but it is restricted in the size of problem or input data it can undertake as it can
only exploit at most the resources of a single cpu (or core) including associated
memory (i.e. you cannot use more than one core/cpu per program and it can only
access the memory associated with it). As previously mentioned this approach is
most useful for programs that can be run on a single core/processor in a reasonable
time but have a wide parameter space (i.e. set of possible inputs) to investigate.
Parallel Programming
FP7−2007−211804 1/8
Companion Guide

Doc. Identifier:
Parallel Programming
PARALLEL PROGRAMMING
Companion Guide

Date: 01/10/2009
COMPANION GUIDE



There are a range of different types of parallel programming methodologies, often
classified by “Flynn’s Taxonomy” as represented in the figure below. The dominant
programming methodology for most parallel programs currently is MIMD (multiple
instructions, multiple data) where separate programs are executed using separate
data but they co-operate in some way. There are two sub-classes of MIMD, DM
(distributed memory), and SM (shared memory) which we will discuss in more
detail. However, it is important to recognise that, in general, MIMD is actually
implemented using SPMD (single program, multiple data) methodologies (i.e. a
single program is executed by each processor/core but they can take different
paths through that program or be at different stages in that program depending on
their input data and status).


There are currently two main parallel programming approaches based on the
SPMD:
• Shared memory
• Distributed memory

Shared memory programming can only exploit processors that have access to the
same memory space. For instance, a multi-core processor can be used for shared
memory programs.

Distributed memory programming can exploit any connected processors (including
shared memory systems) providing they have the software/libraries installed on
them that support this type of programming. For performance reasons it is usually
restricted to processors connected by a high performance network.

The fundamental difference between the two different types of parallel
programming is the way that they communicate and access data in memory. For
shared memory programming each process that is running can access a shared
memory space and data is communicated between processes (the instances of the
program that are being executed) by reading and writing data in this shared
memory. Generally the processes have access to both shared memory which all
processes can access and to private memory which can only be accessed by the
owning process.

Parallel Programming
FP7−2007−211804 2/8
Companion Guide

Doc. Identifier:
Parallel Programming
PARALLEL PROGRAMMING
Companion Guide

Date: 01/10/2009
COMPANION GUIDE


In distributed memory programming, on the other hand, processes only have
access to private memory. The mechanism for transferring and communicating
data between different processes in a distributed memory program is generally via
the sending of messages. Often distributed memory programming techniques are
known as message passing programming.

The next sections describe these programming paradigms in more detail and
provide information on the standard libraries or tools that are used to implement
parallelism.

However, it must be mentioned that these programming paradigms are not the only
issue to consider when think of developing parallel programs. Which paradigm you
choose depends upon the specifics of the problem being simulated, how the
algorithms operate, and what the computational requirements are. For a developer
to consider parallelising a code there must be clear benefit, such as a reduce
runtime, access to more memory, the ability to compute larger simulations than are
possible on a single computer, etc…, as any parallelisation has an associated
overhead which affects performance.

Once it is clear that parallelisation is necessary for a given code a number of
features of the code need to be examined. The first consideration is the potential
for parallelism in the code, i.e. the parts of the code which are inherently serial
verses the parts of the code which could be done at the same time (i.e. in parallel).
Next, the data structures that are used in the current code and how that data will
be split amongst the different processes when they are running in parallel must be
examined. This is often called the data decomposition. Also, the work that is
performed in the code should also be considered and whether parts of that can be
assigned to different processes in the parallel program (work or task
decomposition). Once all these aspects have been analysed a decision can be
made as to whether parallelism is possible with the code and what technique to
use. Often the code algorithms used in a serial code have to be modified before
parallelism can be undertaken.

The document only discusses the different parallel paradigms, design decisions such
as data and work decomposition are beyond the scope of what we are covering
here. Our aim is to provide some background information on how parallelism is
generally implemented and the difference between the different techniques.

2. Shared Memory

Shared memory programming relies on each process in a running program (the
individual instances of a running parallel program are generally called processes)
having access to some shared memory space where they can read and write data.
Processes communicate with one another by writing data into the shared memory
space for other processes to read.

This means that communicating between processes is not written explicitly by the
developer of the program, it is done implicitly through the normal memory/data
operations that a program undertakes. One of the main advantages of this method
of communication is that it enables shared memory programs to be developed
easily and incrementally. It is generally very easy to take a serial program and add
parallelism to different parts of it as required without having to fundamentally
change the structure or the nature of the program. This makes shared memory
Parallel Programming
FP7−2007−211804 3/8
Companion Guide

Doc. Identifier:
Parallel Programming
PARALLEL PROGRAMMING
Companion Guide

Date: 01/10/2009
COMPANION GUIDE


programming very attractive for new parallel developers as it is easy to add and
remove the parallelism and check whether the program is still functioning correctly.

However, this simplicity of programming, which is very beneficial for initial
development, means that programmers have limited control over how the
parallelism is implemented, particularly in where the shared memory data is located
and how it is accessed. This can lead to performance issues for more complicated
or larger scale shared memory programs, although these often are addressed by
using some of the more complicated features available in the shared memory
programming languages/techniques.

The dominant shared memory programming language is OpenMP. Based on
compiler directives (additions to a program which instruct the compiler to
undertake various actions when compiling a program) which consist of sentinels of
the following form:

• Fortran:!$OMP (or C$OMP or *$OMP for F77)
• C/C++: #pragma omp

The sentinels shown above are then followed by key words which are instructions to
the compiler, for example:

!$OMP PARALLEL DO

The compiler directives are interpreted by an OpenMP aware/functional compiler
and ignored by a compiler that does not have such functionality, or where such
functionality has been turned off. Therefore, it is possible to compile the same
source code to run in both serial and parallel modes, one of the features which
make it easy to add and test parallel functionality using OpenMP.

There are two main constructs which can be used to provide parallel functionality in
OpenMP:

• Parallel Loops
• Parallel Regions

An OpenMP program starts by executing only one instance of itself in serial. The
parallelism only occurs when one of the parallel constructs of the language is
encountered. At that point more instances of the program are created to do the
parallel work, and once that parallel work has been completed these parallel
instances are no longer used meaning the program goes back to serial operation
once more.

Parallel loops involve distributing the iterations of a standard for or do loop in C or
FORTRAN to different processes, splitting up the work to be done in the loop and
thus reducing the overall runtime by using more processes to complete the work of
the loop. This type of parallelism is very simple to implement as it only requires a
directive at the start of the loop to be parallelised and OpenMP can automatically
distribute the work. However, this type of parallelism can only be used if the
iterations of the loop are independent, i.e. the calculation of one loop iteration does
not depend on the output of a previous loop iteration. For instance, the following
code does have independent loop iterations:

for(i=0; i<100; i++){
Parallel Programming
FP7−2007−211804 4/8
Companion Guide

Doc. Identifier:
Parallel Programming
PARALLEL PROGRAMMING
Companion Guide

Date: 01/10/2009
COMPANION GUIDE


a(i) = a(i) + c(i);
}

Whereas this loop does not have independent iterations, as the value of a at one
iteration depends on the value of a at the previous iteration:

do i=1,100
a(i) = a(i-1) + c(i)
end do

Parallel regions provide more generic parallel functionality within OpenMP. This
basic parallel construct specifies that section of the program contained with the
parallel region (defined with parallel and end parallel directives) should be executed
in parallel (i.e. using the number of processes specified when the program was
run). The statements inside the parallel region are executed all the parallel
processes, although it is possible to undertake conditional execution depending on
the identity of the process (when OpenMP creates the processes for the parallel
program it gives each of them a number, which can be queries using some OpenMP
functions. This number can then be used in if statements to change the behaviour
of different processes).

OpenMP also has directives that allow for reductions to be automatically performed
(i.e. performing a basic mathematical operation on a data from every process to
produce a single result), and advanced operations to allow optimisation of memory
accesses and synchronisation.

As previously discussed shared memory programming techniques are limited to
exploiting computational resources within which multiple processors or cores have
access to the same memory space. The recent trends for build parallel computers
from clusters of small shared memory machines provides the potential for both
shared memory and distributed memory programming for most parallel resources
depending on the number of processors that a user wants to exploit.

The advent of multi-core processors, which are effectively shared memory
resources, has provided the potential for small amounts of parallelism even on
users desktop machines.

The future trends are pointing to the development of many core processors in the
near future, with up to 24 core processors expected to be released towards the end
of 2010.

All these factors make shared memory an interesting and relevant parallel
programming technique which is likely to become more applicable in the near
future.

3. Distributed Memory

Distributed memory parallelism is characterised by the ability to use processors
that are separate, with no shared memory or shared resources, and only connected
by a network (of some form), a so called “shared-nothing” model. The commonest
distributed memory paradigm is message passing where the different processes
that co-operate to undertake the parallel program communicate data and
Parallel Programming
FP7−2007−211804 5/8
Companion Guide

Doc. Identifier:
Parallel Programming
PARALLEL PROGRAMMING
Companion Guide

Date: 01/10/2009
COMPANION GUIDE


synchronise control by sending and receiving message using the network that
connects them.

Message passing, from a programmer’s perspective, is undertaken by calls to a
message passing interface or library which does the work of dealing with the
physical communication network and features. The current de-facto message
passing standard is MPI (Message Passing Interface). MPI is a set of standards
defined by the MPI standards forum which define functions for sending message
between processes and associated datatypes and ancillary functions to support
these communications. There are two main standards that make up MPI, MPI-1
and MPI-2. Most basic functionality is provided in the MPI-1 standard, with more
advanced features defined in the MPI-2 standard. There are a number of open-
source or freely available MPI libraries/implementation as well as those provided by
the major parallel computer manufactures. All MPI distributions support of the MPI-
1 standard, although not all fully support the MPI-2 standard.

One of the key objectives of the MPI standard is to provide portability between
different parallel machines. Therefore, MPI defines its own datatypes which are
used for data transfers which are then mapped to specific machine datatypes by the
MPI library implementation, which should ensure that programs do not have to be
re-written to use different computing hardware.

As with shared memory programming the general approach is to write a single
program which all processes execute, although they can take different paths
through that program based on their process ID (often called a process rank).

All communications in MPI take places within a communication spaces called a
communicator. The communicator defines the group of processes within the
parallel program and provides a mechanism for messages to be sent between
them. The MPI library sets up a default communicator when MPI is initialised in a
program. This communicator contains all the processes which are running the
parallel program, and this is the context within which a process ID can be obtained.
There are standard functions provided by the MPI library to obtain a processes rank
and the total number of processes co-operating on the parallel program (the size of
the parallel program).

Any MPI program requires at least two library function/sub-routine calls, one to a
procedure to initialise the message passing functionality and another to finish it.
There are called Init and Finalize respectively (in C the functions are MPI_Init()
and MPI_Finalize(), in FORTRAN MPI_INIT() and MPI_FINALIZE()).

There are a number of different mechanisms for communicating using MPI, defined
by what processes are involved in the communication and how the communication
occurs. The two main types are:
• Point-to-point messages
• Collective message

Point-to-point communications are those which take place between two processes,
one sending and one receiving data. Each point-to-point operation must have a
sender and a receiver who call different MPI functions to carry out these tasks. The
sender specifies the location, size, and type of data to be sent along with the rank
of the process the data is being sent to. The receiver specifies where the received
data will be stored, the type and size of data it is expecting, and the rank of the
process that the data is coming from. This is all done using functions or sub-
Parallel Programming
FP7−2007−211804 6/8
Companion Guide

Doc. Identifier:
Parallel Programming
PARALLEL PROGRAMMING
Companion Guide

Date: 01/10/2009
COMPANION GUIDE


routines from the MPI library. For instance, a C program could call the following
library routine to send a message:

MPI_Send(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD);

And this code to receive a message:

MPI_Recv(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD, &stat);


A FORTRAN 90 code could call the following routine to send a message:

CALL MPI_SEND (BUFF, BUFSIZE, MPI_CHARACTER, i, TAG, &
MPI_COMM_WORLD, IERR)

And this code to receive a message:

CALL MPI_RECV (BUFF, BUFSIZE, MPI_CHARACTER, i, TAG, &
MPI_COMM_WORLD, STAT, IERR)

It is important to note that for any point-to-point message to be successfully sent
both the send and receive code need to be implemented. If a process has called a
send routine but the process it has specified as the recipient of that message has
not called a receive routine (or vice-versa) then this can lead to the calling process
waiting indefinitely until such a call is made. This is often called deadlock, a
condition that occurs when one process is waiting for another process to do
something which never happens.

MPI supports a number of different modes and types of communication.
Communication can be blocking (the routine or library function waits until the
message has started to send before completing and allowing the process to
continue computation) or non-blocking (the routine finishes its operation once the
data to be sent has been copied to the MPI library), and messages can be
synchronous (the message send completes once it has started to be received by the
receiving process) or asynchronous (the message send completes once the data
has been passed to the MPI library regardless of whether it has been sent or not).
This variety of different sending modes and routine completion features provides a
number of opportunities for optimisation of communications and therefore program
performance but also can lead to incorrect or poorly performing programs if not
fully understood.

Collective communications are those which involve all processes in the MPI program
(more accurately all the processes in a MPI communicator which for most programs
is the same as all the processes involved in the parallel execution). As they
involve all processes they are must be called by every process to work, unlike
point-to-point communications which require only two processes to call functions to
complete the communication. They generally take longer to perform than the
point-to-point communications as they involve messages to all processes and imply
a synchronisation where all processes have to reach the same point in a program
before the collective communication can complete. However, if collective
communication is required by an application using the functions provided by the
MPI library will invariably give better performance compared to writing your own
collective communications using point-to-point messages (as well as removing the
implementation overhead and the risk of developing incorrect code).

Parallel Programming
FP7−2007−211804 7/8
Companion Guide

Doc. Identifier:
Parallel Programming
PARALLEL PROGRAMMING
Companion Guide

Date: 01/10/2009
COMPANION GUIDE


The MPI standard provides collectives functions for barrier (a synchronisation
collective), broadcast (send data to all processes), gather and all-gather (collect
data from all processes), scatter (send different data to different processes),
reduction (perform global calculations using a range of mathematical operations),
and other variations of collective communications.

Because of the shared-nothing nature of distributed memory programming it tends
to be hard to develop and modify programs in an incremental manner as is possible
using a shared memory approach such as OpenMP. Given a working serial
application all the data accesses and communications have to be developed before
the program can be tested and validated making the process of developing a
distributed memory program more daunting and difficult than the equivalent shared
memory program. However, these problems are often outweighed by the fact the
distributed memory programs, such as those implemented using MPI, have access
to a much wider range of resources and machines than the equivalent shared
memory program currently.


4. Conclusions

Of the two major parallel programming paradigms discussed in the document we
can see that shared memory parallelism is generally easier to implement but is
currently limited in the size and scope of parallelisation possible. Distributed
memory parallelism provides much greater opportunities for exploiting large
amounts of computing resources but is more difficult and time consuming to
program, especially in the initial conversion of a serial program into a parallel
program. Often programmers will first use shared memory parallelism for their
codes and then when they encounter simulations that this approach does not
provide enough resources to tackle they look at using distributed memory
parallelism. Both methods have the potential to provide parallelism efficiently and
both also have the potential for poor performance and errors if not correctly
implemented.

This companion guide has attempted to provide that basic background and
information that is required for developers or scientists who have never attempted
parallel programming but are considering doing so. This aim has been to provide a
brief overview of the main areas of parallel programming that are generally
performed in the scientific simulation community and point to the main standards,
libraries, or tools that are used. This companion guide should be used in
conjunction with more in-depth training on parallel programming such as that
provided by the EUFORIA project. Please see the EUFORIA project website
(http://www.euforia-project.eu) for further details of training courses in the near
future.







Parallel Programming
FP7−2007−211804 8/8
Companion Guide