Parallel Scientific Computing: Algorithms and Tools Lecture #1

libyantawdryAI and Robotics

Oct 23, 2013 (3 years and 11 months ago)

85 views

1

Parallel Scientific Computing:
Algorithms and Tools

Lecture #1

APMA 2821A, Spring 2008

Instructors
: George Em Karniadakis


Leopold Grinberg

2

Logistics


Contact:


Office hours: GK: M 2
-
4 pm; LG: W 2
-
4 pm


Email: {gk,lgrinb}@dam.brown.edu


Web:
www.cfm.brown.edu/people/gk/APMA2821A


Textbook:


Karniadakis & Kirby, “Parallel scientific computing in C++/MPI”


Other books:


Shonkwiler & Lefton, “Parallel and Vector Scientific Computing”


Wadleigh & Crawford, “Software Optimization for High Performance
Computing”


Foster, “Designing and Building Parallel Programs” (available online)

3

Logistics


CCV Accounts


Email:
Sharon_King@brown.edu



Prerequisite: C/Fortran programming



Grading:


5 assignments/mini
-
projects: 50%


1 Final project/presentation : 50%





4

History

5

History

6

Course Objectives



Understanding of fundamental concepts
and programming principles for
development of high performance
applications


Able to program a range of parallel
computers:
PC



clusters


supercomputers



Make efficient use of high performance
parallel computing in your own research

7

Course Objectives

8

Content Overview


Parallel computer
architecture
: 2
-
3 weeks


CPU, Memory; Shared
-
/distributed
-
memory parallel
machines; network connections;


Parallel
programming
: 5 weeks


MPI; OpenMP; UPC


Parallel numerical
algorithms
: 4 weeks


Matrix algorithms; direct/iterative solvers;
eigensolvers; Monte Carlo methods (simulated
annealing, genetic algorithms)


Grid computing: 1 week


Globus, MPICH
-
G2

9

What & Why


What is high performance computing (HPC)?


The use of the most efficient algorithms on computers capable of
the highest performance to solve the most demanding problems.


Why HPC?


Large problems


spatially/temporally


10,000 x 10,000 x 10,000 grid


10^12 grid points


4x10^12
double variables


32x10^12 bytes = 32 Tera
-
Bytes.


Usually need to simulate tens of millions of time steps.


On
-
demand/urgent computing; real
-
time computing;


Weather forecasting; protein folding; turbulence
simulations/CFD; aerospace structures; Full
-
body simulation/
Digital human …

10

HPC Examples: Blood Flow in
Human Vascular Network


Cardiovascular disease accounts for
about 50% of deaths in western world;


Formation of arterial disease strongly
correlated to blood flow patterns;

Computational challenges:
Enormous problem size

In one minute, the heart pumps the
entire blood supply of 5 quarts
through 60,000 miles of vessels, that
is a quarter of the distance between
the moon and the earth

Blood flow involves multiple scales

11

HPC Examples

Earthquake simulation


Surface velocity 75 sec after
earthquake

Flu pandemic simulation

300 million people tracked

Density of infected population,
45 days after breakout

12

HPC Example: Homogeneous Turbulence

Direct Numerical Simulation of Homogeneous Turbulence: 4096^3

Zoom
-
in

Zoom
-
in

Vorticity iso
-
surface

13

How HPC fits into Scientific Computing

Physical Processes

Mathematical Models

Numerical Solutions

Data Visualization,

Validation,

Physical insight

Air flow around

an airplane

Navier
-
stokes
equations

Algorithms, BCs,
solvers,

Application codes,
supercomputers

Viz software

HPC

14

Performance Metrics


FLOPS, or FLOP/S:
FL
oating
-
point
O
perations
P
er
S
econd


MFLOPS: MegaFLOPS, 10^6 flops


GFLOPS: GigaFLOPS, 10^9 flops, home PC


TFLOPS: TeraGLOPS, 10^12 flops, present
-
day
supercomputers (
www.top500.org
)


PFLOPS: PetaFLOPS, 10^15 flops, by 2011


EFLOPS: ExaFLOPS, 10^18 flops, by 2020


MIPS=Mega Instructions per Second = MegaHertz
(if only 1IPS)


Note: von Neumann computer
--

0.00083 MIPS

15

Performance Metrics


Theoretical peak performance R_theor:
maximum FLOPS a machine can reach in
theory.



Clock_rate*no_cpus*no_FPU/CPU



3GHz, 2 cpus, 1 FPU/CPU


R_theor=3x10^9 * 2 =
6 GFLOPS


Real performance R_real: FLOPS for specific
operations, e.g. vector multiplication


Sustained performance R_sustained:
performance on an application, e.g. CFD

R_sustained << R_real << R_theor

Not uncommon

R_sustained < 10%R_theor

16

Top 10 Supercomputers

www.top500.org


November 2007,
LINPACK

performance

R_real

R_theor

17

Number of Processors

18

Fastest
Supercomputers

At present

Projections

www.top500.org


Japanese Earth Simulator

My Laptop

IBM

BG/L

ASCI White

Pacific

EDSAC 1

UNIVAC 1

IBM 7090

CDC 6600

IBM 360/195

CDC 7600

Cray 1

Cray X
-
MP

Cray 2

TMC CM
-
2

TMC CM
-
5

Cray T3D

ASCI Red

1950

1960

1970

1980

1990

2000

2010

1 KFlop/s

1 MFlop/s

1 GFlop/s

1 TFlop/s

1 PFlop/s

Scalar

Super Scalar

Parallel

Vector

1941 1 (Fl oating Point operations / second, Flop/s)

1945 100

1949 1,000 (1 KiloFlop/s, KFlop/s)

1951 10,000

1961 100,000

1964 1,000,000 (1 MegaFlop/s, MFlop/s)

1968 10,000,000

1975 100,000,000

1987 1,000,000,000 (1 GigaFlop/s, GFlop/s)

1992 10,000,000,000

1993 100,000,000,000

1997 1,000,000,000,000 (1 TeraFlop/s, TFlop/s)

2000 10,000,000,000,000

2005

131,000,000,000,000 (131 Tflop/s)

Super Scalar/Vector/Parallel

(10
3
)

(10
6
)

(10
9
)

(10
12
)

(10
15
)

2X Transistors/Chip
Every 1.5 Years

A Growth
-
Factor of a Billion

in Performance in a Career

Japanese “Life Simulator” Effort for a
10 Pflop/s
System


From the Nikkei newspaper, May
30th morning edition.



Collaboration of industry, academia
and government is organized by
NEC, Hitachi, U of Tokyo, Kyusyu U,
and RIKEN
.


Competition component similar to
the DARPA HPCS program.


This year allocated about $4 M each
to do advanced development
towards petascale.


Total of
¥100,000 M ($909 M) will be
invested in this development.


Plan to be operational in 2011.


Japan’s Life Simulator:

Original concept design in 2005

Needs of Multi
-
scale Multi
-
physic simulation

Integration of
multiple
architecture

Tightly
-
coupled
heterogeneous computer

Needs of
multiple
computation
components

Switch

Present

Faster
interconnect

Vector
Node

Scalar
Node

MD
Node

Slower connection

Faster
interconnect

Faster
interconnect

Vector
Node

Scalar
Node

FPGA Node

MD
Node

Faster
interconnect

Proposing
architecture

Major Applications of Next Generation Supercomputer

Targeted as grand
challenges

Basic Concept for Simulations in Nano
-
Science


Basic Concept for Simulations in Life Sciences


Genes

Vascular

System

Organism

Organ

Tissue

Cell

Protein

Genome

Bio
-
MD

Tissue

Structure

Multi
-
physics

Chemical

Process

Blood

Circulation

DDS

Gene Therapy

HIFU

Micro
-
machine

Catheter

Micro

Meso

Macro

http://ridge.icu.ac.jp

http://info.med.
vale.edu/

RIKEN

RIKEN

25

Petascale Era: 2008
-


NCSA: Blue Waters 1PTF/s, 2011

26

Bell versus Moore

27

Grand Challenge Applications

28

The von Neumann Computer

Walk
-
Through:
c=a+b


1.
Get next instruction

2.
Decode: Fetch
a

3.
Fetch
a

to internal register

4.
Get next instruction

5.
Decode: fetch
b

6.
Fetch
b

to internal register

7.
Get next instruction

8.
Decode: add
a

and
b

(
c

in register)

9.
Do the addition in ALU

10.
Get next instruction

11.
Decode: store
c

in main memory

12.
Move
c

from internal register to main memory

Note: Some units are idle while others are working…waste of cycles.


Pipelining (modularization) & Cashing (advance decoding)…parallelism

29

Basic Architecture

-
CPU, pipelining

-
Memory hierarchy,
cache

30

Computer Performance


CPU operates on data. If no data, CPU has to
wait; performance degrades.



typical workstation: 3.2GHz CPU, Memory 667MHz.
Memory 5 times slower.


Moore’s law: CPU speed doubles every 18 months


Memory speed increases much much slower;


Fast CPU requires sufficiently

fast

memory.


Rule of thumb: Memory size in GB=R_theor in
GFLOPS


1CPU cycle (1 FLOPS) handles 1 byte of data


1MFLOPS needs 1MB of data/memory


1GFLOPS needs 1GB of data/memory

Many “tricks” designed for performance improvement targets the memory

31

CPU Performance


Computer time is measured in terms of CPU
cycles


Minimum time to execute 1 instruction is 1 CPU cycle


Time to execute a given program:



n_c: total number of CPU cycles

n_i: total number of instructions

CPI = n_c/n_i, average
c
ycles
p
er
i
nstruction

t_c: cycle time, 1GHz


t_c=1/(10^9Hz) = 10^(
-
9)sec = 1ns

32

To Make a Program/Computer Faster…


Reduce cycle time t_c:


Increase clock frequency; however, there is a physical limit


In 1ns, light travels 30cm


Currently ~ GHz; 3GHz cpu


light travels 10cm within 1 cpu
cycle


length/size must be < 10cm.


1 atom about 0.2 nm;


Reduce number of instructions n_i:


More efficient algorithms


Better compilers


Reduce CPI
--

The key is parallelism.


Instruction
-
level parallelism.
Pipelining

technology


Internal parallelism, multiple functional units;
superscalar

processors; multi
-
core processors


External parallelism, multiple CPUs,
parallel

machine

33

Processor Types


Vector processor;


Cray X1/T90; NEC SX#; Japan Earth Simulator; Early
Cray machines; Japan Life Simulator (hybrid)


Scalar processor


CISC:
C
omplex
I
nstruction
S
et
C
omputer


Intel 80x86 (IA32)


RISC:
R
educed
I
nstruction
S
et
C
omputer


Sun SPARC, IBM Power #, SGI MIPS


VLIW:
V
ery
L
ong
I
nstruction
W
ord; Explicitly parallel
instruction computing (EPIC); Probably dying


Intel IA64 (Itanium)

34

CISC Processor


CISC


Complex instructions; Large number of
instructions; Can complete more complicated
functions at instruction level


Instruction actually invokes
microcode
.
Microcodes are small programs in processor
memory


Slower; Many instructions access memory;
varying instruction length; allow no pipelining;

35

RISC Processor


No microcode


Simple instructions; Fewer instructions;
Fast


Only load and store instructions access
memory


Common instruction word length


Allows pipelining

Almost all present
-
day high performance computers use
RISC processors

36

Locality of References


Spatial/Temporal locality


If processor executes an instruction at time t,
it is likely to execute an adjacent/next
instruction at (t+delta_t);


If processor accesses a memory location/data
item x at time t, it is likely to access an
adjacent memory location/data item
(x+delta_x) at (t+delta_t);

Pipelining, Caching and many other techniques all
based on the locality of references

37

Pipelining


Overlapping execution of multiple instructions


1 instruction per cycle


Sub
-
divide instruction into multiple stages;
Processor handles different stages of adjacent
instructions simultaneously


Suppose 4 stages in instruction:


Instruction fetch and decode (IF)


Read data (RD)


Execute (EX)


Write
-
back results (WB)

38

Instruction Pipeline

IF

EX

RD

WB

IF

EX

RD

WB

IF

EX

RD

WB

IF

EX

RD

WB

IF

EX

RD

WB

IF

EX

RD

WB

IF

EX

RD

WB

1

2

3

4

5

6

7

8

9

10

cycle

1

2

3

4

5

6

7

instruction

Depth of pipeline: number of stages in an instruction

After the pipeline is full, 1 result per cycle! CPI = (n+depth
-
1)/n

With pipeline, 7 instructions take 10 cycles. If no pipeline, 7 instructions take 28
cycles

39

Inhibitors of Pipelining


Dependencies between instructions
interrupts pipelining, degrading
performance



Control dependence.



Data dependence.

40

Control Dependence


Branching: when an instruction occurs after an
conditional branch; so it is unknown whether that
instruction will be executed beforehand


Loop:
for(i=0;i<n;i++)…; do…enddo


Jump: goto …


Condition: if…else…

if(x>y) n=5;

Branching in programs interrupts pipeline


degrades performance

Avoid excessive branching!

41

Data Dependence


when an instruction depends on data from a
previous instruction

x = 3*j;

y = x+5.0; // depends on previous instruction

42

Vector Pipeline


Vector processors: with vector registers which
can hold a vector, e.g. of 128 elements;


Commonly encountered processors are scalar
processors, e.g. home PC


Efficient for loops involving vectors.

for (i=0;i<128;i++)


z[i] = x[i] + y[i]

Instructions:

Vector Load X(1:128)

Vector Load Y(1:128)

Vector Add Z=X+Y

Vector Store Z

43

Vector Pipeline

1

2

3



133

cycle

Load X(1:128)

Load Y(1:128)

Add Z=X+Y

Store Z

instruction

IF

X(1)

RD

X(128)



IF

Y(1)

RD

Y(128)



IF

Z(1)

AD

Z(128)



IF

Z(1)

ST

Z(128)



time

44

Vector Operations: Hockney’s Formulas

CACHE: 64 Kb

45

Exceeding Cache Size

CACHE: 32 Kb

Cache line: 64 bytes

NOTE:
Asymptotic 5Mflops: result every 15 clocks




time to reload a cache line following a miss

46

Internal Parallelism


Functional units:
components in
processor that
actually do the work


Memory operations
(MU): load, store;


Integer arithmetic (IU):
integer add, bit shift …


Floating point
arithmetic (FPU):
floating
-
point add,
multiply …

Instruction
type

Latency
(cycles)

Integer add

1

Floating
-
point
add

3

Floating
-
point
multiply

3

Floating
-
point
divide

31

Typical instruction latencies

Division is much slower than add/multiply! Minimize
or avoid divisions!

47

Internal Parallelism


Superscalar RISC processors: multiple
functional units in processor, e.g. multiple
FPUs,


Capable of executing more than one
instruction (producing more than one result)
per cycle.


Shared registers, L1 cache etc.


Need faster memory access to provide
data to multiple functional units!


Limiting factor: memory
-
processor
bandwidth

48

Internal Parallelism


Multi
-
core processors: Intel
dual
-
core, quad
-
core


Multiple execution cores
(functional units, registers, L1
cache)


Multiple cores share L2 cache,
memory


Lower energy consumption


Need
FAST

memory access
to provide data to multiple
cores


Effective memory bandwidth per
core is reduced


Limiting factor: memory
-
processor bandwidth

Functional units + L1 cache

Shared L2 cache

Between cores

CPU chip

49

Heat Flux also Increases with Speed!


50

New Processors are Too Hot!

~

~

~

51

52

Your Next PC?

53

External Parallelism


Parallel machines: Will be discussed later

54

Memory: Next Lecture


Bit: 0, 1; Byte: 8 bits


Memory size


PB


10^15 bytes; TB


10^12 bytes; GB


10^9 bytes; MB


10^6 bytes


Memory performance measures:


Access time, or response time, latency: interval between time of
issuance of memory request and time when request is satisfied.


Cycle time: minimum time between two successive memory
requests


t0

t1

t2

Memory

request

request

satisfied

Access time: t1
-
t0

Cycle time: t2
-
t0

If there is another request
at t0 < t < t2, memory is
busy and will not respond;
have to wait until t > t2

Memory busy t0 < t < t2