Presentazione di PowerPoint - Dipartimento di Matematica

coleslawokraΛογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 3 χρόνια και 6 μήνες)

91 εμφανίσεις

1

CALCOLO SCIENTIFICO
(PARALLELO)

Prof. Luca F. Pavarino

Dipartimento di Matematica

Universita` di Milano

a.a. 20010
-
2011

luca.pavarino@unimi.it, http://www.mat.unimi.it/~pavarino

Corso di Laurea Magistrale e
Dottorati in Matematica Applicata

2

Struttura del corso


Orario

-
Lunedi` 12.30
-

14.30 Aula 4

-
Mertedi` 13.30
-

14.30 Aula 5 (compattare?)

-
Mercoledi` 14.30
-

16.30 Aula 3

-
Venerdi` 8.30
-

10.30 Aula 2


12
-

13 settimane, 9 cfu (6 lezione, 3 laboratorio)


Laboratorio in Aula 2 o LIR o LID:
esercitazioni

con


-

Nostro Cluster Linux (ulisse.mat.unimi.it), 104 processori


-

Nostro nuovo Cluster Linux Nemo


-

Cluster Linux del Cilea (avogadro.cilea.it), ~1700 processori

-
(nuovo IBM SP6 del Cineca (sp6.sp.cineca.it), ~5300 processori)


-
Uso della libreria standard per “message passing” MPI

-
Uso della libreria parallela di calcolo scientifico PETSc
dell’Argonne National Lab., basata su MPI

hardware

software

3

Materiale e Testi


Slides in inglese basate su

corsi di calcolo parallelo tenuti a



Univ. Illinois da Michael Heath, UC Berkeley da Jim Demmel,


(+ MIT da Alan Edelmann)


Possibili testi:


-

A. Grama, A. Gupta, G. Karipys, V. Kumar, Introduction to parallel
computing, 2
nd

ed., Addison Wesley, 2003


-

L. R. Scott, T. Clark, B. Bagheri, Scientific Parallel Computing,
Princeton University Press, 2005


Molto materiale on
-
line, e.g.
:

-

www
-
unix.mcs.anl.gov/dbpp/ (Ian Foster’s book)

-
www.cs.berkeley.edu/~demmel/

(Demmel’s course)

-
www
-
math.mit.edu/~edelman/

(Edelman’s course)

-
www.cse.uiuc.edu/~heath/ (Heath’s course)

-
www.cs.rit.edu/~ncs/parallel.html (Nan’s ref page)


4

Schedule of Topics

1. Introduction

2. Parallel architectures

3. Networks

4. Interprocessor communications: point
-
to
-
point, collective

5. Parallel algorithm design

6. Parallel programming, MPI: message passing interface

7. Parallel performance

8. Vector and matrix products

9. LU factorization

10. Cholesky factorization

11. PETSc parallel library

12. Iterative methods for linear systems

13. Nonlinear equations and ODEs

14. Partial Differential Equations

15. Domain Decomposition Methods

16. QR factorization

17. Eigenvalues


5

1) Introduction


What is parallel computing


Large important problems require powerful computers


Why powerful computers must be parallel processors


Why writing (fast) parallel programs is hard


Principles of parallel computing performance


6

What is parallel computing


It is an example of parallel processing:

-
division of task (process) into smaller tasks (processes)

-
assign smaller tasks to multiple processing units that work
simultaneously

-
coordinate, control and monitor the units


Many examples from nature:

-
human brain consists of ~10^11 neurons

-
complex living organisms consist of many cells (although monocellular
organism are estimated to be ½ of the earth biomass)

-
leafs of trees ...


Many examples from daily life:

-
highways tollbooths, supermarket cashiers, bank tellers, …

-
elections, races, competitions, …

-
building construction

-
written exams ...


7


Parallel computing is the use of multiple processors to
execute different parts of the same program (task)
simultaneously


Main goals of parallel computing are:

-
Increase the size of problems that can be solved

-
bigger problem would not be solvable on a serial computer in a
reasonable amount of time


decompose it into smaller problems

-
bigger problem might not fit in the memory of a serial computer


distribute it over the memory of many computer nodes

-
Reduce the “wall
-
clock” time to solve a problem



Solve (much) bigger problems (much) faster


Subgoal: save money using cheapest available
resources (clusters, beowulf, grid computing,...)

8

Not at all trivial that more processors help to achieve these
goals:



“If a man can dig a hole of 1 m
3
in 1 hour, can 60 men dig
the same hole in 1 minute (!) ? Can 3600 men do it in 1
second (!!) ?”



“I know how to make 4 horses pull a cart, but I do not
know how to make 1024 chickens do it” (
Enrico Clementi
)



“ What happens if the mean
-
time to failure for nodes on
the Tflops machine is shorter than the boot time ?
(Courtenay Vaughan)




9

Why we need
powerful computers

10

10


Simulation: The Third Pillar of Science


Traditional scientific and engineering method:

(1) Do
theory

or paper design

(2) Perform
experiments

or build system


Limitations:



Too difficult

build large wind tunnels



Too expensive

build a throw
-
away passenger jet



Too slow

wait for climate or galactic evolution



Too dangerous

weapons, drug design, climate



experimentation


Computational science and engineering paradigm:

(3) Use high performance computer systems


to
simulate and analyze

the phenomenon

-
Based on known physical laws and efficient numerical methods

-
Analyze simulation results with computational tools and
methods beyond what is used traditionally for experimental
data analysis



Simulation

Theory

Experiment

11

Some Particularly Challenging Computations


Science

-
Global climate modeling, weather forecasts

-
Astrophysical modeling

-
Biology: Genome analysis; protein folding (drug design)

-
Medicine: cardiac modeling, physiology, neurosciences


Engineering

-
Airplane design

-
Crash simulation

-
Semiconductor design

-
Earthquake and structural modeling


Business

-
Financial and economic modeling

-
Transaction processing, web services and search engines


Defense

-
Nuclear weapons (ASCI), cryptography, …

12

$5B World Market in Technical Computing

0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1998
1999
2000
2001
2002
2003
Other
Technical Management and
Support
Simulation
Scientific Research and R&D
Mechanical
Design/Engineering Analysis
Mechanical Design and
Drafting
Imaging
Geoscience and Geo-
engineering
Electrical Design/Engineering
Analysis
Economics/Financial
Digital Content Creation and
Distribution
Classified Defense
Chemical Engineering
Biosciences
Source: IDC 2004, from NRC Future of Supercomputer Report

13

Units of Measure in HPC


High Performance Computing (HPC) units are:

-
Flops: floating point operations

-
Flops/s: floating point operations per second

-
Bytes: size of data (a double precision floating point number is 8)


Typical sizes are millions, billions, trillions…

Mega

Mflop/s = 10
6

flop/sec

Mbyte = 2
20

= 1048576 ~ 10
6

bytes

Giga

Gflop/s = 10
9

flop/sec

Gbyte = 2
30

~ 10
9

bytes

Tera

Tflop/s = 10
12

flop/sec

Tbyte = 2
40

~ 10
12

bytes

Peta

Pflop/s = 10
15

flop/sec

Pbyte = 2
50

~ 10
15

bytes

Exa


Eflop/s = 10
18

flop/sec

Ebyte = 2
60

~ 10
18

bytes

Zetta

Zflop/s = 10
21

flop/sec

Zbyte = 2
70

~ 10
21

bytes

Yotta

Yflop/s = 10
24

flop/sec

Ybyte = 2
80

~ 10
24

bytes


Current fastest (public) machine ~ 1.5 Pflop/s

Up
-
to
-
date lisy at www.top500.org









14

Ex. 1: Global Climate Modeling Problem


Problem is to compute:

f(latitude, longitude, elevation, time)




temperature, pressure, humidity, wind velocity


Atmospheric model:
equation of fluid dynamics





Navier
-
Stokes system of nonlinear partial differential equations



Approach:

-
Discretize the domain, e.g., a measurement point every 1km

-
Devise an algorithm to predict weather at time t+1 given t


Uses:

-
Predict major events,
e.g., El Nino

-
Use in setting air
emissions standards


Source: http://www.epm.ornl.gov/chammp/chammp.html

15

Climate Modeling on the Earth Simulator System


Development of ES started
in 1997

in order to make a
comprehensive understanding of global environmental
changes such as global warming.


26.58Tflops

was obtained by a global atmospheric circulation
code.


35.86Tflops

(87.5% of the peak performance) is achieved in the
Linpack benchmark.


Its construction was completed at the end of February,
2002 and the practical operation started from
March 1,
2002

16

Ex. 2: Cardiac simulation


Very difficult problem spanning many disciplines:

-
Electrophysiology (spreading of electrical excitation front)

-
Structural Mechanics (large deformation of incompressible
biomaterial)

-
Fluid Dynamics (flow of blood inside the heart)


Large
-
scale simulations in computational
electrophysiology (
joint work with P. Colli
-
Franzone and S. Scacchi
)

-
Bidomain model (system of 2 reaction
-
diffusion equations) coupled
with Luo
-
Rudy 1 gating (system of 7 ODEs) in 3D

-
Q1 finite elements in space + adaptive semi
-
implicit method in time

-
Parallel solver based on PETSc library

-
Linear systems up to 36 M unknowns each time
-
step (128 procs of
Cineca SP4) solved in seconds or minutes

-
Simulation of full heartbeat (4 M unknowns in space, thousands of
time
-
steps) took more than 6 days on 25 procs of Cilea HP
Superdome, then about 50 hours on 36 procs of our cluster, now 6.5
hours using multilevel preconditioner



17

3D simulations: isochrones of acti, repo, APD

18




Hemodynamics in circulatory system (
work in Quarteroni’s
group at MOX, Polimi
)


Blood flow in the heart (
Peskin’s group, CIMS, NYU
)

-
Modeled as an elastic structure in an incompressible fluid.

-
The “immersed boundary method” due to Peskin and McQueen.

-
20 years of development in model

-
Many applications other than the heart: blood clotting, inner ear,
paper making, embryo growth, and others

-
Use a regularly spaced mesh (set of points) for evaluating the fluid

-

Uses

-
Current model can be used to design artificial heart valves

-
Can help in understand effects of disease (leaky valves)

-
Related projects look at the behavior of the heart during a heart attack

-
Ultimately: real
-
time clinical work

19

Ex. 3: latest breakthrough

20

21

22

23

24

25

26

27

28

29

30

Ex. 4: Parallel Computing in Data Analysis


Web search:

-
Functional parallelism: crawling, indexing, sorting

-
Parallelism between queries: multiple users

-
Finding information amidst junk

-
Preprocessing of the web data set to help find information


Google physical structure (2004 estimate, check
current status on e.g. wikipedia):

-
about 63.272 nodes (126,544 cpus)

-
126.544 GB RAM

-
5,062 TB hard drive space


(
This would make Google server farm one of the most powerful
supercomputer in the world
)


Google index size (June 2005 estimate):

-
about 8 billion web pages, 1 billion images


31

-

Note that the total
Surface Web

( = publically indexable, i.e.
reachable by web crawlers) has been estimated (Jan. 2005) at
over
11.5 billion

web pages.

-
Invisible (or Deep) Web

( = not indexed by search engines; it
consists of dynamic web pages, subscription sites, searchable
databases) has been estimated (2001) at over
550 billion

documents.

-
Invisible Web not to be confused with
Dark Web

consisting of
machines or network segments not connected to the Internet


Data collected and stored at enormous speeds
(Gbyte/hour)

-
remote sensor on a satellite

-
telescope scanning the skies

-
microarrays generating gene expression data

-
scientific simulations generating terabytes of data

-
NSA analysis of telecommunications


32

Why powerful
computers are
parallel

33

Tunnel Vision by Experts


“I think there is a world market for maybe five
computers.”

-
Thomas Watson, chairman of IBM, 1943.


“There is no reason for any individual to have
a computer in their home”

-
Ken Olson, president and founder of Digital Equipment
Corporation, 1977.


“640K [of memory] ought to be enough for
anybody.”

-
Bill Gates, chairman of Microsoft,1981.

Slide source: Warfield et al.

34

Technology Trends: Microprocessor Capacity

2X transistors/Chip Every 1.5
-

2 years

Called “
Moore’s Law







Moore’s Law

Microprocessors have
become smaller, denser, and
more powerful.

Gordon Moore (co
-
founder of
Intel) predicted in 1965 that the
transistor density of semiconductor
chips would double roughly every
18 months.

Slide source: Jack Dongarra

35

Impact of Device Shrinkage


What happens when the feature size shrinks by a factor
of
x

?


Clock rate goes up by
x


-
actually less than x, because of power consumption


Transistors per unit area goes up by
x
2


Die size also tends to increase

-
typically another factor of ~
x


Raw computing power of the chip goes up by ~
x
4
!

-
of which
x
3
is devoted either to
parallelism

or
locality

36

Microprocessor Transistors per Chip

i4004
i80286
i80386
i8080
i8086
R3000
R2000
R10000
Pentium
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1970
1975
1980
1985
1990
1995
2000
2005
Year
Transistors

Growth in transistors per chip

0.1
1
10
100
1000
1970
1980
1990
2000
Year
Clock Rate (MHz)

Increase in clock rate

37

37

But there are limiting forces


Moore’s 2
nd

law
(Rock’s law): costs go
up

Demo of
0.06
micron
CMOS

Source: Forbes Magazine


Yield

-
What percentage of the chips
are usable?

-
E.g., Cell processor (PS3) is
sold with 7 out of 8 “on” to
improve yield

Manufacturing costs and yield problems limit use of density

38

38

Revolution is Happening Now


Chip density is
continuing
increase ~2x
every 2 years

-
Clock speed is not

-
Number of
processor cores
may double instead


There is little or
no more hidden
parallelism (ILP)
to be found


Parallelism must
be exposed to
and managed by
software

Source: Intel, Microsoft (Sutter) and
Stanford (Olukotun, Hammond)

39

39


Parallelism in 2009
-
10?


These arguments are no longer theoretical


All major processor vendors are producing
multicore

chips

-
Every machine will soon be a parallel machine

-
To keep doubling performance, parallelism must double


Which commercial applications can use this parallelism?

-
Do they have to be rewritten from scratch?


Will all programmers have to be parallel programmers?

-
New software model needed

-
Try to hide complexity from most programmers


eventually

-
In the meantime, need to understand it


Computer industry betting on this big change, but does not
have all the answers

-
Berkeley ParLab established to work on this


40

Physical limits: how fast can a serial computer be?


Consider the 1 Tflop/s sequential machine
:

-
Data must travel some distance, r, to get from memory to CPU.

-
Go get 1 data element per cycle, this means 10
12

times per second
at the speed of light, c = 3x10
8

m/s. Thus r < c/10
12
= 0.3 mm.


Now put 1 Tbyte of storage in a 0.3 mm 0.3 mm area
:

(in fact 0.3^2 mm^2/10^12 = 9 10^(
-
2) 10^(
-
6) m^2/10^12 =


9 10^(
-
20) m^2 = (3 10^(
-
10))^2 m^2 = 3^2 A^2


-
Each byte occupies less than 3 square Angstroms, or the size of a
small atom! (1 Angstrom = 10^(
-
10) m = 0.1 nanometer)


No choice but parallelism

r = 0.3 mm

1 Tflop/s, 1 Tbyte
sequential
machine

41

41

More Exotic Solutions on the Horizon


GPUs
-

Graphics Processing Units (eg NVidia)

-
Parallel processor attached to main processor

-
Originally special purpose, getting more general


FPGAs


Field Programmable Gate Arrays

-
Inefficient use of chip area

-
More efficient than multicore now, maybe not later

-
Wire routing heuristics still troublesome


Dataflow and tiled processor architectures

-
Have considerable experience with dataflow from 1980’s

-
Are we ready to return to functional programming languages?


Cell

-
Software controlled memory uses bandwidth efficiently

-
Programming model not yet mature

42

“Automatic” Parallelism in Modern Machines


Bit level parallelism: within floating point operations, etc.


Instruction level parallelism (ILP): multiple instructions execute per
clock cycle.


Memory system parallelism: overlap of memory operations with
computation.


OS parallelism: multiple jobs run in parallel on commodity SMPs.


There are limitations to all of these:




to achieve high performance, the programmer needs to identify,
schedule and coordinate parallel tasks and data.


43

Processor
-
DRAM Gap (latency)

µProc

60%/yr.

DRAM

7%/yr.

1

10

100

1000

1980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU

1982

Processor
-
Memory

Performance Gap:

(grows 50% / year)

Performance

Time

“Moore’s Law”

44

Principles of Parallel Computing


Parallelism and Amdahl’s Law


Finding and exploiting granularity


Preserving data locality


Load balancing


Coordination and synchronization


Performance modeling


All of these issues makes parallel programming harder
than sequential programming.

45

Amdahl’s law: Finding Enough Parallelism


Suppose only part of an application seems parallel


Amdahl’s law

-
Let s be the fraction of work done sequentially, so
(1
-
s) is fraction parallelizable.

-
P = number of processors.

Speedup(P) = Time(1)/Time(P)


<= 1/(s + (1
-
s)/P)


<= 1/s

Even if the parallel part speeds up perfectly, we may be
limited by the sequential portion of code.

Ex: if only s = 1%, then speedup <= 100




not worth it using more than p = 100 processors

46

Overhead of Parallelism


Given enough parallel work, this is the most significant
barrier to getting desired speedup.


Parallelism overheads include:

-
cost of starting a thread or process

-
cost of communicating shared data

-
cost of synchronizing

-
extra (redundant) computation


Each of these can be in the range of milliseconds
(= millions of flops) on some systems


Tradeoff: Algorithm needs sufficiently large units of work
to run fast in parallel (i.e. large granularity), but not so
large that there is not enough parallel work.

47

Locality and Parallelism


Large memories are slow, fast memories are small.


Storage hierarchies are large and fast
on average.


Parallel processors, collectively, have large, fast memories
--

the slow accesses to
“remote” data we call “communication”.


Algorithm should do most work on local data.

Proc

Cache

L2 Cache

L3 Cache

Memory

Conventional

Storage

Hierarchy

Proc

Cache

L2 Cache

L3 Cache

Memory

Proc

Cache

L2 Cache

L3 Cache

Memory

potential

interconnects

48

Load Imbalance


Load imbalance is the time that some processors in the
system are idle due to

-
insufficient parallelism (during that phase).

-
unequal size tasks.


Examples of the latter

-
adapting to “interesting parts of a domain”.

-
tree
-
structured computations.

-
fundamentally unstructured problems

-
Adaptive numerical methods in PDE (adaptivity and parallelism seem
to conflict).


Algorithm needs to balance load

-
but techniques to balance load often reduce locality



49

Measuring Performance: Real Performance?

0.1

1

10

100

1,000

2000

2004

Teraflops

1996

Peak Performance grows exponentially,
a la Moore’s Law


In 1990’s, peak performance increased 100x; in
2000’s, it will increase 1000x

But efficiency (the performance relative to
the hardware peak) has declined


was 40
-
50% on the vector supercomputers of
1990s


now as little as 5
-
10% on parallel
supercomputers of today


Close the gap through ...


Mathematical methods and algorithms that
achieve high performance on a single
processor and scale to thousands of
processors


More efficient programming models and tools
for massively parallel supercomputers

Performance

Gap

Peak Performance

Real Performance

50

Performance Levels


Peak advertised performance (PAP)

-
You can’t possibly compute faster than this speed


LINPACK

-
The “hello world” program for parallel computing

-
Solve Ax=b using Gaussian Elimination, highly tuned


Gordon Bell Prize winning applications performance

-
The right application/algorithm/platform combination plus years of work


Average sustained applications performance

-
What one reasonable can expect for standard applications

When reporting performance results, these levels are
often confused, even in reviewed publications

51

51

Performance Levels (for example on NERSC
-
5)


Peak advertised performance (PAP):
100 Tflop/s


LINPACK (TPP):
84 Tflop/s


Best climate application:
14 Tflop/s

-
WRF code benchmarked in December 2007


Average sustained applications performance:
? Tflop/s

-
Probably less than 10% peak!


We will study performance

-
Hardware and software tools to measure it

-
Identifying bottlenecks

-
Practical performance tuning (Matlab demo)

52

53

54

55

56

57

58

59

Simple example 1: sum of N numbers, P procs






jk
k
j
i
i
j
a
A
1
)
1
(



N
i
i
a
A
1
Also known as reduction

(of the vector [a
1
,…,a
N
] to the scalar A)

-

Assume N is an integer multiple of P: N = kP

-

Divide the sum into P partial sums:

Then




P
j
j
A
A
1
P parallel tasks, each with

k
-
1 additions of k = N/P data

Global sum (not parallel,

communication needed)

60

Simple example 2: pi






1
0
1
0
2
|
)
(
4
)
1
/(
4
x
arctg
dx
x
,
)
1
/(
4
1
2



N
i
i
x
h
-

Use composite midpoints quadrature rule:


where h = 1/N and


-
Decompose sum into P parallel partial


sums + 1 global sum, (as before or with


stride P)

h
i
x
i
)
2
/
1
(


On processor myid = 0,…,P
-
1, (P = numprocs) compute:


sum = 0;


for I = myid + 1:numprocs:N,


x = h*(I


0.5);


sum = sum + 4/(1+x*x);


end;


mypi = h*sum;


global sum the local mypi into glob_pi (reduction)

61

Simple example 3: prime number sieve

See exercise in class

Simple example 4: Jacobi method for BVP

See exercise in class