Dirk Roose Albert-Jan Yzelman
Office: 200A 03.25
Table of content of book
Book: Parallel Scientific Computation
The first text to explain how to use BSP in parallel computing
Clear exposition of distributed-memory parallel computing with
applications to core topics of scientific computation
Each topic treated follows the complete path from theory to practice
This is the first text explaining how to use the bulk synchronous
parallel (BSP) model and the freely available BSPlib communication
library in parallel algorithm design and parallel programming. Aimed at
graduate students and researchers in mathematics, physics and
computer science, the main topics treated in the book are core topics
in the area of scientific computation and many additional topics are
treated in numerous exercises.
Parallel Scientific Computation (2)
An appendix on the message-passing interface (MPI) discusses how
to program using the MPI communication library. MPI equivalents of all
the programs are also presented.
The main topics treated in the book are core in the area of scientific
computation: solving dense linear systems by Gaussian elimination,
computing fast Fourier transforms, and solving sparse linear systems
by iterative methods. Each topic is treated in depth, starting from the
problem formulation and a sequential algorithm, through a parallel
algorithm and its analysis, to a complete parallel program written in C
and BSPlib, and experimental results obtained using this program on a
Parallel Scientific Computation (3)
Additional topics treated in the exercises include: data compression,
random number generation, cryptography, eigensystem solving, 3D
and Strassen matrix multiplication, wavelets and image compression,
fast cosine transform, decimals of pi, simulated annealing, and
The book contains five small but complete example programs written
in BSPlib which illustrate the methods taught. The appendix on MPI
discusses how to program in a structured, bulk synchronous parallel
style using the MPI communication library. It presents MPI equivalents
of all the programs in the book.
The complete programs of the book and their driver programs are
freely available online in the packages BSPedupack and MPIedupack.
Why parallel computing ?
Limits of single computer/processor
Until 2007: Growing
(clock cycle : + 40% /year ; memory access: + 10%/year)
Since 2007: multicore processors!
Parallel computing allows
to solve problems that don’t fit in the memory of a single computer
to solve problems that can’t be solved in a reasonable time on a
single core (processor)
Introduction & Motivation
adapted version of
Kathy Yelick and Jim Demmel
EECS & Math Departments
A Japanese supercomputer capable of performing more than 8
petaflop/s is the new number one system in the world, putting
Japan back in the top spot for the first time since the Earth
Simulator was dethroned in November 2004 [...]. The system,
called the K Computer, is at the RIKEN Advanced Institute for
Computational Science (AICS) in Kobe.
For the first time, all of the top 10 systems achieved petaflop/s
performance (U.S. : 5, Japan: 2, China: 2, France: 1).
Bumped to second place after capturing No. 1 on the previous
list is the Tianhe-1A supercomputer the National
Supercomputing Center in Tianjin, China, with a performance
at 2.6 petaflop/s.
Also moving down a notch was Jaguar, a Cray supercomputer
at the U.S. Department of Energy’s (DOE’s) Oak Ridge
National Laboratory, at No. 3 with 1.75 petaflop/s.
TOP500 June 2011
The New Number One
The K Computer, built by Fujitsu, currently combines 68544
SPARC64 VIIIfx CPUs, each with eight cores, for a total of
548,352 cores—almost twice as many as any other system in
the TOP500. The K Computer is also more powerful than the
next five systems on the list combined.
The K Computer’s name draws upon the Japanese word "Kei"
for 10^16 (ten quadrillions), representing the system's
performance goal of 10 petaflops. RIKEN is the Institute for
Physical and Chemical Research.
Unlike the Chinese system it displaced from the No. 1 slot and
other recent very large system, the K Computer does not use
graphics processors or other accelerators. The K Computer is
also one of the most energy-efficient systems on the list.
TOP500 June 2011 (cont.)
Scientific computing & simulation
(databases, data mining, bio-informatics, finance, ...)
(webservers, financial transactions)
(cryptography, embedded systems)
Example : weather prediction
Navier-Stokes equations (PDE) : discretized on a grid
domain : 3000 km x 3000 km x 10 km
resolution: 1 km x 1km x 0.1 km
grid of size: 3000 x 3000 x 100 = ± 10
time interval: 48 h. ; time step : 3 min.
± 1000 timesteps
cost per gid point : 1000 flop (flop = floating point operations)
x 1000 x 1000 = 10
flop = 1 Pflop
PC or workstation (200 Mflops) : 1500 hours
Cluster (e.g. VIC) (200 Gflops) : 1,5 hours
Required memory :
solution only: 10
grid points x 5 variables x 8 bytes = 4 10
= 40 Gbyte ! --> total required memory: ± 400 Gbyte
Global Climate Modeling Problem
Problem is to compute:
f (latitude, longitude, elevation, time)
temperature, pressure, humidity, wind velocity
the domain, e.g., a measurement point every 10 km
Devise an algorithm to predict weather at time t+
t given t
Predict major events,
e.g., El Nino
Use in setting air
Parallel Computing in Data Analysis
Finding information amidst large quantities of data
General themes of sifting through large, unstructured data
Has there been an outbreak of some medical condition in a
Data collected and stored at enormous speeds (Gbyte/hour)
telescope scanning the skies
microarrays generating gene expression data
Automatic” Parallelism in Modern Machines
Bit level parallelism
within floating point operations, etc.
Instruction level parallelism (ILP)
multiple instructions execute per clock cycle
Memory system parallelism
overlap of memory operations with computation
multiple jobs run in parallel on clusters & parallel systems
Limits to all of these -- for very high performance,
user needs to identify, schedule and coordinate
Processor-DRAM Gap (latency)
(grows 50% / year)
Old slide (year 2000)
Locality and Parallelism
Large memories are slow, fast memories are small
Cache : fast
main (local) memory : slower
memory of other procs : slow
Algorithm should do most work on local data
!!! Exploit & increase data locality !!!
Also useful on single processor systems
Design of parallel algorithms
Take memory hierarchy into account (data locality)
Distribute data over memories
Distribute work over processors
Introduce & analyse
communication & synchronization
Chapter 1 Introduction
This chapter is a self-contained tutorial which tells you how to get
started with parallel programming and how to design and implement
algorithms in a structured way. The chapter introduces a simple tar- get
architecture for designing parallel algorithms, the bulk synchronous
parallel computer. Using the computation of the inner product of two
vectors as an example, the chapter shows how an algorithm is
designed hand in hand with its cost analysis. The algorithm is
implemented in a short program that demonstrates the most important
primitives of BSPlib, the main communication library used in this book.
If you understand this program well, you can start writing your own
parallel programs. Another program included in this chapter is a
benchmark- ing program that allows you to measure the BSP
parameters of your parallel computer. Substituting these parameters
into a theoretical cost formula for an algorithm gives a prediction of the
actual running time of an implementation.
Parallel programming platforms
See slides Chapter 2 `Parallel programming platforms’:
files chapter2.pdf (short version) & chap2_slides.ppt (long version)
COMMENT ON SLIDE 25 of chapter2.pdf :
Communication cost in parallel systems
bandwidth = 1 Gbit/second = 125 Mbyte/second
1 word = 8 bytes
-> 15 Mword/sec. -> t
= 1/( 15. 10^6) sec. = 0.06 microsec.
latency (startup cost) t
= 25 microsec.
calculation speed = 1 Gflops (= Gflop/sec)
1 flop takes 0.001 microsec.
416 words could be sent during t
25000 flop could be executed during t
avoid to send many short messages,
reorganise if necessary the algorithm to group messages