Plan P arallel programming Parallel programming Vector ... - KTH

shapecartSoftware and s/w Development

Dec 1, 2013 (4 years and 1 month ago)

75 views

2D1263:Scientic Computing Lecture 9 (1)
Plan
Parallel programming
 Parallel architectures
 Distributed memory architectures {
essentials
 Parallel programming at Nada
Distributed memory architectures (MPI)
 Managing multiple processes with MPI
 Point-to-point communication (blocking)
 Example { dierence operator in 1D
NADA
Marco Kupiainen
Michael Hanke
2D1263:ScienticComputingLecture9(2)
Parallelprogramming
Foster:\Aparallelcomputerisasetofprocessorsthatareableto
workcooperativelytosolveacomputationalproblem."
Weshallsometimesconsiderprocessesinsteadofprocessors,since
parallelcomputerscanbeemulatedonasingleprocessor.
Manyrelevantcomputationaltasksarelarge,andmayrequire
millionsorbillionsofgridpoints.
Underlyingidea:Ifmanyprocessesworksimultaneouslyona
problemweshouldbeabletosolveitfaster.
NADA
MarcoKupiainen
MichaelHanke
2D1263:Scientic Computing Lecture 9 (3)
Parallel programming
+ Parallel programming leads to increased
computational resources (CPU and RAM).
+ We can now solve larger problems or
increase the accuracy.
{ Can be dicult to fully exploit
computational resources.
{ Can lead to portability problems if one is
not careful.
{ Parallel systems are expensive.
We consider three types of parallel
architectures:
 Vector computers
 Shared memory architectures
 Distributed memory architectures
Modern computers are often a mixture of these.
NADA
Marco Kupiainen
Michael Hanke
2D1263:Scientic Computing Lecture 9 (4)
Vector computers
Some operations are naturally parallel:matrix
addition can be performed on independent
sub-blocks.
Vector computers include vector operations in
the instruction set.
Two dierent types:
 Multiple functional units { the CPU can
perform many operations simultaneously
 Pipelining { even simple operations like
addition require several assembler
statements.Perform one assembler
operation at a time,rather than one
mathematical operation.
Oldest type of parallel architecture.
Good if algorithms can be formulated in terms
of vector-vector operations.
Example:Earth Simulator (NEC,Japan),rank
3 in TOP 500.
NADA
Marco Kupiainen
Michael Hanke
2D1263:ScienticComputingLecture9(5)
Sharedmemoryarchitectures
P1P2Pn
. . .
MEMORY
NADA
MarcoKupiainen
MichaelHanke
2D1263:Scientic Computing Lecture 9 (6)
Shared memory architectures
Multiple processors with little local memory
sharing a (large) global memory.
All data is accessible from every processor.
+ Access time to all data roughly equal for
all processors.
+ From the programmers point of view
memory is accessible as on an ordinary
computer
{ Reading & writing must be synchronized to
ensure that no memory con icts occur.
Often imposes an upper bound on the
number of processors.
{ Dicult to exploit data locality in
problems.
NADA
Marco Kupiainen
Michael Hanke
2D1263:ScienticComputingLecture9(7)
Distributedmemoryarchitectures
P1P2Pn
MnM2M1
. . .
. . .
COMMUNICATION NETWORK
NADA
MarcoKupiainen
MichaelHanke
2D1263:Scientic Computing Lecture 9 (8)
Distributed memory architectures
Each processor has it's own local memory,and
does not have direct access to data on other
processes.
+ No need to synchronize memory access
+ Data locality can be eciently exploited
{ If data is needed from external processes it
must be sent over the network (\message
passing").
This course will focus on distributed memory
architectures,which seem to be the most
popular.
The current trend is to use small shared
memory computers as processors in a
distributed architecture.
Example:BlueGene L (IBM,USA),rank 1 in
TOP 500
NADA
Marco Kupiainen
Michael Hanke
2D1263:Scientic Computing Lecture 9 (9)
Parallel algorithms
In what follows we shall focus on distributed
memory architectures.Some important
concepts are:
Data distribution
Data should be distributed across the
processors such that locality can be exploited.
Communication
Communication takes time and should be
minimized.We are interested in computation.
Load balancing
Each process should have roughly the same
amount of work.
Otherwise some processes may be idle while
waiting to communicate.
NADA
Marco Kupiainen
Michael Hanke
2D1263:Scientic Computing Lecture 9 (10)
Parallel algorithms
Choosing an appropriate data distribution is
often the most crucial part of formulating an
ecient algorithm.
It is sometimes necessary to choose a dierent
numerical algorithm than one would have on a
single processor machine.
Example,y
0
= Ay for large matrices A.
 y
n+1
= y
n
+tAy
n
{ stability restrictions,
little communication
 y
n+1
= (I tA)
1
y
n
{ more stable,much
communication
For PDEs on structured grids good load
balancing can often be obtained by dividing the
grid evenly among the processes.
NADA
Marco Kupiainen
Michael Hanke
2D1263:Scientic Computing Lecture 9 (11)
Communication
Local communication:Involves a few (often
two) processes.
Global communication:Involves all processes.
Global communication is more expensive since
all processes have to wait for the slowest one.
Communication and computation should be
overlapped if possible.
Processes can perform relevant work while
waiting for messages to arrive.
There are C++ packages providing (to the
programmer) uniform access to data on
distributed architectures.These are not
recommended.
NADA
Marco Kupiainen
Michael Hanke
2D1263:Scientic Computing Lecture 9 (12)
MPI
 The Message Passing Interface (MPI) is a
de facto standard for distributed parallel
computing.
 MPI provides portable routines for
inter-process communication.
 Available for C++,C and Fortran.
 Only interfaces dened.Actual
implementation by computer vendors.
 Free implementation mpich.
 Can be slow.
 Other libraries:PVM,Vertex,p4,...
The MPI standard:
http://www-unix.mcs.anl.gov/mpi/
NADA
Marco Kupiainen
Michael Hanke
2D1263:Scientic Computing Lecture 9 (13)
MPI at Nada
 PDC operates a number of parallel
machines,including
{ the 112 processor (11 nodes) IBM SP
nighthawk (formerly strindberg)
{ the 180 processor (90 node) HP
Itanium2 lucidor
{ the 884 processor (442 node) Dell Xeon
lenngren
{ the 354 processor PC cluster SBC
 Parallel programs can be run over the
networks of Sun workstations.
See http://www.pdc.kth.se/for more
information.
For your second and third computer
laborations,accounts on lenngren (nighthawk
and lucidor) will be arranged.
NADA
Marco Kupiainen
Michael Hanke
2D1263:Scientic Computing Lecture 9 (14)
MPI on Sun stations
1.Set up paths to parallel compilers and
libraries:
prompt> module add sunonestudio/7
prompt> module add mpich/1.2.6
2.mpiCC file.cc { compile & link C++
program
3.Use mpirun to execute in parallel
mpirun -np 4 -machinefile c.txt a.out
runs the program a.out on the rst four
computers listed in the plain text le c.txt.
(See mpirun -h for more help.)
It is convenient to debug programs by running
several processes on your local machine.
The performance is poor due to slow
communication.
NADA
Marco Kupiainen
Michael Hanke