Introduction to Parallel Computing

shapecartSoftware and s/w Development

Dec 1, 2013 (3 years and 16 days ago)

55 views

Introduction to Parallel Computing
J.-F.Remacle
Universite catholique de Louvain
C++,BLAS,OpenMP,MPI,LAPACK...
1
Parallel Computing
Parallel computing is a form of computation in which many calcula-
tions are carried out simultaneously,operating on the principle that
large problems can often be divided into smaller ones,which are then
solved concurrently ("in parallel").
Parallelismhas been employed for many years,mainly in high-performance
computing,but interest in it has grown lately due to the physical con-
straints preventing frequency scaling.
The eect of processor frequency on computer speed can be seen as
Runtime =
Instructions
Program

Cycles
Instruction

Seconds
Cycles
2
Frequency Scaling
However,power consumption in a chip is given by the equation
P'C V
2
F
where P is power,V is voltage,and F is the processor frequency
(cycles per second).Increases in frequency thus increase the amount
of power used in a processor.
Increasing processor power consumption led ultimately to Intel's May
2004 cancellation of its\Tejas and Jayhawk"processors,which is gen-
erally cited as the end of frequency scaling as the dominant computer
architecture paradigm.
See the article in the NY Times:http://www.nytimes.com/2004/05/08/
business/08chip.html?ex=1399348800&en=98cc44ca97b1a562&ei=5007
3
Parallel computing
Parallel computing has become the dominant paradigm in computer
architecture,mainly in the form of multicore processors.
Parallel computers can be roughly classied according to the level at
which the hardware supports parallelism-with multi-core and multi-
processor computers having multiple processing elements within a sin-
gle machine,while clusters,MPPs,and grids use multiple computers
to work on the same task.
4
Amdahl's law
Overall speedup of a process:
1
(1 P) +
P
S
where P is the portion of the process that can be improved and S is
the s^eedup factor that can be excpected for this portion.In the case
of parallelization,then the maximum speedup that can be achieved by
using N processors is
1
(1 P) +
P
N
:
Asymptotically,we have
1
(1 P)
:
5
Dependencies
Understanding data dependencies is fundamental in implementing par-
allel algorithms.
No program can run more quickly than the longest chain of depen-
dent calculations (known as the critical path),since calculations that
depend upon prior calculations in the chain must be executed in order.
However,most algorithms do not consist of just a long chain of de-
pendent calculations;there are usually opportunities to execute inde-
pendent calculations in parallel.
6
Dependencies
Consider the following algorithm (matrix-vector product):
for (int i=0;i<n;i++){
b[i] = 0.0;
for (int j=0;j<n;j++){
b[i] += a[i][j] * x[j];
}
}
Matrix-vector operation:n independant dot products.
Imagine that N processors are in charge of this task.
Each processor could do n=N of those dot products.
This algorithm could be parallelized optimally:P =1,S =N.
7
Dependencies
Imagine e.g.N > n.
for (int j=0;j<n;j++){
b[i] += a[i][j] * x[j];
}
Each of those dot products consist in n independant operations that
can also be parallelized.
This could be considered as\small grain"parallelism.
8
Dependencies
Consider the following algorithm (compute!n):
long int x = 0;
for (int i=1;i<=n;i++){
x *= i;
}
The situation is dierent.
Yet,processor p could do a part of the job
for (int i=n*p/N;i<=n*(p+1)/N;i++){
x *= i;
}
9
Race conditions
Subtasks in a parallel program are called threads,processes or bers.
Threads will often need to update some variable that is shared between
them (x in the!n algorithm).
Let us assume that two threads T1 and T2 each want to increment
the value of a global integer by one.Ideally,the following sequence of
operations would take place:
1.Integer i = 0;(memory)
2.T1 reads the value of i from memory into register1:0
3.T1 increments the value of i in register1:
(register1 contents) + 1 = 1
4.T1 stores the value of register1 in memory:1
5.T2 reads the value of i from memory into register2:1
6.T2 increments the value of i in register2:
(register2 contents) + 1 = 2
7.T2 stores the value of register2 in memory:2
8.Integer i = 2;(memory)
10
Race conditions
Another possible sequence:
1.Integer i = 0;(memory)
2.T1 reads the value of i from memory into register1:0
3.T2 reads the value of i from memory into register2:0
4.T1 increments the value of i in register1:
(register1 contents) + 1 = 1
5.T2 increments the value of i in register2:
(register2 contents) + 1 = 1
6.T1 stores the value of register1 in memory:1
7.T2 stores the value of register2 in memory:1
8.Integer i = 1;(memory)
The programmer must use a lock to provide mutual exclusion.
Locking multiple variables using non-atomic locks introduces the pos-
sibility of program deadlock.
Multiple uses of locks will slow down signicatively the computations
an may lead to parallel slow down!
11
Race conditions
Mutual Exclusion:
1.T1 lock i
2.T1 read i
3.T1 add 1 to i
4.T1 write i
5.T1 unlock i
1.T2 lock i
2.T2 read i
3.T2 add 1 to i
4.T2 write i
5.T2 unlock i
Either thread T1 or thread T2 executes the critical section rst (thread
holds the lock).
Many parallel programs require that their subtasks act in synchrony.
This requires the use of a barrier.Barriers are typically implemented
using a software lock.
We will show examples of software locks using OpenMP and MPI.
12
Classes of parallel computers
Parallel computers can be roughly classied according to the level at
which the hardware supports parallelism.
Multi-core processor.A multicore processor is a processor that in-
cludes multiple execution units (\cores").The general trend in pro-
cessor development has been from multi-core to many-core.
Distributed Computing.A distributed computer (also known as a
distributed memory) is a distributed memory computer systemin which
the processing elements are connected by a network.
Symmetric multiprocessing.A symmetric multiprocessor (SMP) is a
computer system with multiple identical processors that share memory
and connect via a bus
Specialized parallel computers Within parallel computing,there are
specialized parallel devices that remain niche areas of interest.While
not domain-specic,they tend to be applicable to only a few classes
of parallel problems.
13
Distributed Computing
Cluster computing.A cluster is a group of loosely coupled comput-
ers that work together closely,so that in some respects they can be
regarded as a single computer.
Clusters are composed of multiple standalone machines connected by
a network.
While machines in a cluster do not have to be symmetric,load bal-
ancing is more dicult if they are not.
The most common type of cluster is the Beowulf cluster,which is
a cluster implemented on multiple identical commercial o-the-shelf
computers connected with a TCP/IP Ethernet local area network.
The vast majority of the TOP500 supercomputers are clusters.
14
Distributed Computing
Massively parallel computing.A massively parallel processor (MPP)
is a single computer with many networked processors.
MPPs have many of the same characteristics as clusters,but MPPs
have pecialized interconnect networks (whereas clusters use commod-
ity hardware for networking).MPPs also tend to be larger than clus-
ters,typically having"far more"than 100 processors.In an MPP,
"each CPU contains its own memory and copy of the operating sys-
tem and application.Each subsystem communicates with the others
via a high-speed interconnect."
A cabinet from Blue Gene/L,ranked as the fourth fastest supercom-
puter in the world according to the 11/2008 TOP500 rankings.Blue
Gene/L is a massively parallel processor.
15
General-purpose computing on graphics processing
units (GPGPU)
General-purpose computing on graphics processing units (GPGPU) is
a fairly recent trend in computer engineering research.
GPUs are co-processors that have been heavily optimized for com-
puter graphics processing.Computer graphics processing is a eld
dominated by data parallel operations-particularly linear algebra ma-
trix operations.
In the early days,GPGPU programs used the normal graphics APIs
for executing programs.However,recently several new programming
languages and platforms have been built to do general purpose com-
putation on GPUs with both Nvidia and AMD releasing programming
environments with CUDA and CTMrespectively.Other GPU program-
ming languages are BrookGPU,PeakStream,and RapidMind.Nvidia
has also released specic products for computation in their Tesla se-
ries.
16
Parallel programming languages
Concurrent programming languages,libraries,APIs,and parallel pro-
gramming models have been created for programming parallel The
general trend in processor development has been from multi-core to
many-corecomputers.
These can generally be divided into classes based on the assumptions
they make about the underlying memory architecture-shared memory,
distributed memory,or shared distributed memory.
Shared memory programming languages communicate by manipulat-
ing shared memory variables.
Distributed memory uses message passing.
POSIX Threads and OpenMP are two of most widely used shared
memory APIs.
Message Passing Interface (MPI) is the most widely used message-
passing system API.
17
OpenMP
The OpenMP (Open Multi-Processing) is an application programming
interface (API) that supports multi-platform shared memory multipro-
cessing programming in C/C++ and Fortran on many architectures,
including Unix and Microsoft Windows platforms.
It consists of a set of compiler directives,library routines,and envi-
ronment variables that in uence run-time behavior.
18
OpenMP
The parallel pragma precedes a block of code that should be executed
by all of the threads.The fork is done there.
Syntax:
#pragma omp parallel <Clause> <statement
block>
19
OpenMP
gcc 4.2 and higher support OpenMP directives
le openmp1.cc:
g++ -fopenmp -o openmp1 openmp1.cc
#ifdef _OPENMP
#include <omp.h>
#endif
#include <cstdio>
int main (int argc,char *argv[]){
int nProc = atoi(argv[1]);
omp_set_num_threads (nProc);
#pragma omp parallel
{
printf("coucou I am Thread %d\n",omp_get_thread_num());
}
return 0;
}
20
OpenMP,race condition
Compute in parallel:
N
X
i=1
i =
N (N +1)
2
:
long sum_silly (int N){
long sum = 0;
#pragma omp parallel for schedule(static,1)
for (int i = 1;i <= N;i++) {
sum += i;
}
return sum;
}
Note:The static schedule is characterized by the properties that each
thread gets approximately the same number of iterations as any other
thread,and each thread can independently determine the iterations
assigned to it.Thus no synchronization is required to distribute the
work,and,under the assumption that each iteration requires the same
amount of work,all threads should nish at about the same time.
21
OpenMP,critical section
Compute in parallel:
N
X
i=1
i =
N (N +1)
2
:
long sum_less_silly (int N){
long sum = 0;
#pragma omp parallel for schedule(static)
for (int i = 1;i <= N;i++) {
#pragma omp critical
sum += i;
}
return sum;
}
Note:Using locking for the only computation that is done in the
algorithm provides no parallel acceleration,and more surely a slow
down in the computation.
22
OpenMP,better use of critical sections
long sum_not_silly_hand_made (int N){
long sum = 0;
long local_sum = 0;
#pragma omp parallel private(local_sum)
{
int start = N* omp_get_thread_num()/omp_get_num_threads() + 1;
int end = N*(omp_get_thread_num()+1)/omp_get_num_threads() + 1;
for (int i = start;i < end;i++) {
local_sum += i;
}
#pragma omp critical
sum += local_sum;
}
return sum;
}
Note:variables declared inside the parallel section are private to the
thread while variables declared outside are shared.
23
OpenMP,reduction
long sum_not_silly (int N){
long sum = 0;
#pragma omp parallel for reduction(+:sum) schedule(static)
for (int i = 1;i <= N;i++) {
sum += i;
}
return sum;
}
Note:Available operators in the reduction process:+,,&,j,^,
&& and jj.
24
OpenMP example
Parallel Gaussian elimination with OpenMP
The principle (that you know) of the Gaussian elimination:transform
A =
2
6
6
6
4
a
11
a
12
:::a
1n
a
21
a
22
:::a
2n
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
a
n2
:::a
nn
3
7
7
7
5
into
U =
2
6
6
6
4
u
11
u
12
:::u
1n
0 u
22
:::u
2n
.
.
.
.
.
.
.
.
.
.
.
.
0 0:::u
nn
3
7
7
7
5
by replacing lines (or columns) of the matrix by linear combinations of
other lines (or columns)
25
OpenMP example
A =
2
6
6
6
4
a
11
a
12
:::a
1n
a
21
a
22
:::a
2n
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
a
n2
:::a
nn
3
7
7
7
5
=
2
6
6
6
4
l
t
1
l
t
2
.
.
.
l
t
n
3
7
7
7
5
Reduction of column 1:
l
2
:=l
2

a
21
a
11
l
1
l
3
:=l
3

a
31
a
11
l
1
:::
l
k
:=l
k

a
k1
a
11
l
1
Dene
l
k1
=
a
k1
a
11
:
26
OpenMP example
A =
2
6
6
6
6
6
6
6
6
4
a
11
a
12
:::a
1k
:::a
1n
0 a
22
:::a
2k
:::a
2n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0:::a
kk
:::a
kn
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0:::a
nk
:::a
nn
3
7
7
7
7
7
7
7
7
5
Reduction of column k:for i =k +1:::n:
l
ik
=
a
ik
a
kk
;r
i
:=r
i
l
ik
r
k
BLAS1\daxpy"computes y:=x +y
Do the same for the RHS
b
i
:=b
i
l
ik
b
k
:
27
OpenMP example
#include"myVector.h"
#include"myMatrix.h"
//...
myMatrix a(n,n);
myVector b(n);
//fill the matrix and the RHS here
//...
//Gauss Reduction
for (int k=1;k<n;k++){
for (int i=k+1;i<n;i++){
double lik = a(i,k)/a(k,k);
for (int j=k+1;i<n;i++){
a(i,j) -= lik * a(k,j);
}
b(i) -= lik b(k);
}
}
28
OpenMP example
#include <omp.h>
int iam,nt;
for (int k=1;k<n;k++){
#pragma omp parallel private (iam)
{
iam = omp_thread_num();
nt = omp_num_threads();
int nrows_t = n-k-1/nt;
int istart = iam * nrows_t;
int iend = (iam+1) * nrows_t;
for (int i=istart;i<iend;i++){
double lik = a(i,k)/a(k,k);
for (int j=k+1;i<n;i++){
a(i,j) -= lik * a(k,j);
}
b(i) -= lik b(k);
}
}//end of the scope of the omp paragma
}
29
Parallel computing
 Schematic of a generic parallel computer
 OpenMP can be used in the SMPs.
 Communications between the SMP\boxes"have to be done in a
dierent way.
30
The Message Passing Interface (MPI)
Almost everything in MPI can be summed up in the single idea of
\Message Sent | Message Received".
The rules of parallel computing
 Partition the work between processors;
 Minimize the ration Communication/Computation;
 Alternate Computations and Communications.
31
The Message Passing Interface (MPI)
The MPI is a language-independent communications protocol used to
program parallel computers.
MPI has various implementations w.r.t.dierent computer architec-
tures.
MPICH is a freely available,portable implementation of MPI,a stan-
dard for message-passing for distributed-memory applications used in
parallel computing.MPICH is Free Software and is available for most
avours of Unix (including Linux and Mac OS X) and Microsoft Win-
dows.
The MPI-2 provides additional features like dynamic process manage-
ment and parallel IO's.
The Open MPI Project is an open source MPI-2 implementation.
32
The Message Passing Interface (MPI)
There are three basic concepts of MPI:
 Communicators are groups of processes in the MPI session,each
of which have rank order,and their own virtual communication
fabric for point-to-point operations.
 Point-to-point operations:a much used example is the MPI
Send
interface,which allows one specied process to send a message to
a second specied process.
 Collective operations that involve communication between all
processes in a process group.As an example,the MPI
Allreduce
function.
33
The Message Passing Interface (MPI)
 Example 1:mpi1.cc,initialize and nalize a MPI program.
 Example 2:mpi2.cc,get processor rank and the total number of
procs.
 Example 3:mpi3.cc,use a Point-to-point Send|Recieve round of
communication.
 Example 4:mpi5.cc,Master process send a message to all.
 Example 5:mpi6.cc,use the MPI
Allreduce function.
 Example 6:mpi7.cc,use the MPI
Bcast function for doing the Par-
allel LU (live?).
34