Parallel Computing

footballsyrupSoftware and s/w Development

Dec 1, 2013 (3 years and 9 months ago)

92 views

Parallel Computing
Dirk Roose Albert-Jan Yzelman
Office: 200A 03.25
2
Table of content of book

http://ukcatalogue.oup.com/product/9780198529392.do#.UGSxJULv-2U
Book: Parallel Scientific Computation

The first text to explain how to use BSP in parallel computing

Clear exposition of distributed-memory parallel computing with
applications to core topics of scientific computation

Each topic treated follows the complete path from theory to practice

This is the first text explaining how to use the bulk synchronous
parallel (BSP) model and the freely available BSPlib communication
library in parallel algorithm design and parallel programming. Aimed at
graduate students and researchers in mathematics, physics and
computer science, the main topics treated in the book are core topics
in the area of scientific computation and many additional topics are
treated in numerous exercises.
3
Parallel Scientific Computation (2)

An appendix on the message-passing interface (MPI) discusses how
to program using the MPI communication library. MPI equivalents of all
the programs are also presented.

The main topics treated in the book are core in the area of scientific
computation: solving dense linear systems by Gaussian elimination,
computing fast Fourier transforms, and solving sparse linear systems
by iterative methods. Each topic is treated in depth, starting from the
problem formulation and a sequential algorithm, through a parallel
algorithm and its analysis, to a complete parallel program written in C
and BSPlib, and experimental results obtained using this program on a
parallel computer.
4
Parallel Scientific Computation (3)

Additional topics treated in the exercises include: data compression,
random number generation, cryptography, eigensystem solving, 3D
and Strassen matrix multiplication, wavelets and image compression,
fast cosine transform, decimals of pi, simulated annealing, and
molecular dynamics.

The book contains five small but complete example programs written
in BSPlib which illustrate the methods taught. The appendix on MPI
discusses how to program in a structured, bulk synchronous parallel
style using the MPI communication library. It presents MPI equivalents
of all the programs in the book.

The complete programs of the book and their driver programs are
freely available online in the packages BSPedupack and MPIedupack.
5
6
Why parallel computing ?

Limits of single computer/processor

available memory

performance

Until 2007: Growing
mismatch
between
clock cycle
and
memory
access time
(clock cycle : + 40% /year ; memory access: + 10%/year)

Since 2007: multicore processors!

Parallel computing allows

to solve problems that don’t fit in the memory of a single computer
(processor)

to solve problems that can’t be solved in a reasonable time on a
single core (processor)
Parallel Computing
Introduction & Motivation
adapted version of
slides from
Kathy Yelick and Jim Demmel
EECS & Math Departments
UC Berkeley
www.cs.berkeley.edu/~
demmel
/cs267_spr11
8

A Japanese supercomputer capable of performing more than 8
petaflop/s is the new number one system in the world, putting
Japan back in the top spot for the first time since the Earth
Simulator was dethroned in November 2004 [...]. The system,
called the K Computer, is at the RIKEN Advanced Institute for
Computational Science (AICS) in Kobe.

For the first time, all of the top 10 systems achieved petaflop/s
performance (U.S. : 5, Japan: 2, China: 2, France: 1).

Bumped to second place after capturing No. 1 on the previous
list is the Tianhe-1A supercomputer the National
Supercomputing Center in Tianjin, China, with a performance
at 2.6 petaflop/s.

Also moving down a notch was Jaguar, a Cray supercomputer
at the U.S. Department of Energy’s (DOE’s) Oak Ridge
National Laboratory, at No. 3 with 1.75 petaflop/s.
TOP500 June 2011
9
The New Number One
The K Computer, built by Fujitsu, currently combines 68544
SPARC64 VIIIfx CPUs, each with eight cores, for a total of
548,352 cores—almost twice as many as any other system in
the TOP500. The K Computer is also more powerful than the
next five systems on the list combined.
The K Computer’s name draws upon the Japanese word "Kei"
for 10^16 (ten quadrillions), representing the system's
performance goal of 10 petaflops. RIKEN is the Institute for
Physical and Chemical Research.
Unlike the Chinese system it displaced from the No. 1 slot and
other recent very large system, the K Computer does not use
graphics processors or other accelerators. The K Computer is
also one of the most energy-efficient systems on the list.
TOP500 June 2011 (cont.)
10
Application areas

Scientific computing & simulation
(engineering, science)

Data analysis
(databases, data mining, bio-informatics, finance, ...)

Commercial applications
(webservers, financial transactions)

Informatics
(cryptography, embedded systems)
11
Example : weather prediction

Navier-Stokes equations (PDE) : discretized on a grid

Assume
domain : 3000 km x 3000 km x 10 km

resolution: 1 km x 1km x 0.1 km
>
grid of size: 3000 x 3000 x 100 = ± 10
9
grid points
time interval: 48 h. ; time step : 3 min.
>
± 1000 timesteps
cost per gid point : 1000 flop (flop = floating point operations)
>

Total cost:
10
9
x 1000 x 1000 = 10
15
flop = 1 Pflop

PC or workstation (200 Mflops) : 1500 hours
Cluster (e.g. VIC) (200 Gflops) : 1,5 hours

Required memory :
solution only: 10
9
grid points x 5 variables x 8 bytes = 4 10
10
bytes
= 40 Gbyte ! --> total required memory: ± 400 Gbyte
12
Global Climate Modeling Problem

Problem is to compute:
f (latitude, longitude, elevation, time)



temperature, pressure, humidity, wind velocity


Approach:
-
Discretize
the domain, e.g., a measurement point every 10 km
-
Devise an algorithm to predict weather at time t+

t given t

Uses:
-
Predict major events,
e.g., El Nino
-
Use in setting air
emissions standards
Source: http://www.epm.ornl.gov/chammp/chammp.html
13
Parallel Computing in Data Analysis

Finding information amidst large quantities of data

General themes of sifting through large, unstructured data
sets:
-
Has there been an outbreak of some medical condition in a
community?
-
bio-informatics
-
...

Data collected and stored at enormous speeds (Gbyte/hour)
-
telescope scanning the skies
-
microarrays generating gene expression data
-
...
14

Automatic” Parallelism in Modern Machines

Bit level parallelism
-
within floating point operations, etc.

Instruction level parallelism (ILP)
-
multiple instructions execute per clock cycle

Memory system parallelism
-
overlap of memory operations with computation

OS parallelism
-
multiple jobs run in parallel on clusters & parallel systems
Limits to all of these -- for very high performance,
user needs to identify, schedule and coordinate
parallel tasks
15
Processor-DRAM Gap (latency)
µProc
60%/yr.
DRAM
7%/yr.
1
10
100
1000
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU
1982
Processor-Memory
Performance Gap:
(grows 50% / year)
Pe
r
formance
Time

Moore’s Law”
Old slide (year 2000)
16
Locality and Parallelism

Large memories are slow, fast memories are small
Proc
Cache
L2 Cache
L3 Cache
Memory
Conventional
Storage
Hierarchy
Proc
Cache
L2 Cache
L3 Cache
Memory
Proc
Cache
L2 Cache
L3 Cache
Memory
potential
interconnects
17
Memory hierarchy
Cache : fast
main (local) memory : slower
memory of other procs : slow
Algorithm should do most work on local data
!!! Exploit & increase data locality !!!
Also useful on single processor systems
18
Design of parallel algorithms

Take memory hierarchy into account (data locality)

Distribute data over memories

Distribute work over processors

Introduce & analyse
communication & synchronization
19
Chapter 1 Introduction

This chapter is a self-contained tutorial which tells you how to get
started with parallel programming and how to design and implement
algorithms in a structured way. The chapter introduces a simple tar- get
architecture for designing parallel algorithms, the bulk synchronous
parallel computer. Using the computation of the inner product of two
vectors as an example, the chapter shows how an algorithm is
designed hand in hand with its cost analysis. The algorithm is
implemented in a short program that demonstrates the most important
primitives of BSPlib, the main communication library used in this book.
If you understand this program well, you can start writing your own
parallel programs. Another program included in this chapter is a
benchmark- ing program that allows you to measure the BSP
parameters of your parallel computer. Substituting these parameters
into a theoretical cost formula for an algorithm gives a prediction of the
actual running time of an implementation.
20
Parallel programming platforms
See slides Chapter 2 `Parallel programming platforms’:
files chapter2.pdf (short version) & chap2_slides.ppt (long version)
COMMENT ON SLIDE 25 of chapter2.pdf :
Communication cost in parallel systems
Assume

bandwidth = 1 Gbit/second = 125 Mbyte/second
1 word = 8 bytes
-> 15 Mword/sec. -> t
w
= 1/( 15. 10^6) sec. = 0.06 microsec.

latency (startup cost) t
s
= 25 microsec.

calculation speed = 1 Gflops (= Gflop/sec)
(AMD Opteron250)
1 flop takes 0.001 microsec.
416 words could be sent during t
s
!
25000 flop could be executed during t
s
!

avoid to send many short messages,

reorganise if necessary the algorithm to group messages