# Parallel Scientific Computing: Algorithms and Tools Lecture #1

AI and Robotics

Oct 23, 2013 (4 years and 8 months ago)

125 views

1

Parallel Scientific Computing:
Algorithms and Tools

Lecture #1

APMA 2821A, Spring 2008

Instructors

Leopold Grinberg

2

Logistics

Contact:

Office hours: GK: M 2
-
4 pm; LG: W 2
-
4 pm

Email: {gk,lgrinb}@dam.brown.edu

Web:
www.cfm.brown.edu/people/gk/APMA2821A

Textbook:

Karniadakis & Kirby, “Parallel scientific computing in C++/MPI”

Other books:

Shonkwiler & Lefton, “Parallel and Vector Scientific Computing”

Wadleigh & Crawford, “Software Optimization for High Performance
Computing”

Foster, “Designing and Building Parallel Programs” (available online)

3

Logistics

CCV Accounts

Email:
Sharon_King@brown.edu

Prerequisite: C/Fortran programming

5 assignments/mini
-
projects: 50%

1 Final project/presentation : 50%

4

History

5

History

6

Course Objectives

Understanding of fundamental concepts
and programming principles for
development of high performance
applications

Able to program a range of parallel
computers:
PC

clusters

supercomputers

Make efficient use of high performance
parallel computing in your own research

7

Course Objectives

8

Content Overview

Parallel computer
architecture
: 2
-
3 weeks

CPU, Memory; Shared
-
/distributed
-
memory parallel
machines; network connections;

Parallel
programming
: 5 weeks

MPI; OpenMP; UPC

Parallel numerical
algorithms
: 4 weeks

Matrix algorithms; direct/iterative solvers;
eigensolvers; Monte Carlo methods (simulated
annealing, genetic algorithms)

Grid computing: 1 week

Globus, MPICH
-
G2

9

What & Why

What is high performance computing (HPC)?

The use of the most efficient algorithms on computers capable of
the highest performance to solve the most demanding problems.

Why HPC?

Large problems

spatially/temporally

10,000 x 10,000 x 10,000 grid

10^12 grid points

4x10^12
double variables

32x10^12 bytes = 32 Tera
-
Bytes.

Usually need to simulate tens of millions of time steps.

On
-
demand/urgent computing; real
-
time computing;

Weather forecasting; protein folding; turbulence
simulations/CFD; aerospace structures; Full
-
body simulation/
Digital human …

10

HPC Examples: Blood Flow in
Human Vascular Network

Cardiovascular disease accounts for
about 50% of deaths in western world;

Formation of arterial disease strongly
correlated to blood flow patterns;

Computational challenges:
Enormous problem size

In one minute, the heart pumps the
entire blood supply of 5 quarts
through 60,000 miles of vessels, that
is a quarter of the distance between
the moon and the earth

Blood flow involves multiple scales

11

HPC Examples

Earthquake simulation

Surface velocity 75 sec after
earthquake

Flu pandemic simulation

300 million people tracked

Density of infected population,
45 days after breakout

12

HPC Example: Homogeneous Turbulence

Direct Numerical Simulation of Homogeneous Turbulence: 4096^3

Zoom
-
in

Zoom
-
in

Vorticity iso
-
surface

13

How HPC fits into Scientific Computing

Physical Processes

Mathematical Models

Numerical Solutions

Data Visualization,

Validation,

Physical insight

Air flow around

an airplane

Navier
-
stokes
equations

Algorithms, BCs,
solvers,

Application codes,
supercomputers

Viz software

HPC

14

Performance Metrics

FLOPS, or FLOP/S:
FL
oating
-
point
O
perations
P
er
S
econd

MFLOPS: MegaFLOPS, 10^6 flops

GFLOPS: GigaFLOPS, 10^9 flops, home PC

TFLOPS: TeraGLOPS, 10^12 flops, present
-
day
supercomputers (
www.top500.org
)

PFLOPS: PetaFLOPS, 10^15 flops, by 2011

EFLOPS: ExaFLOPS, 10^18 flops, by 2020

MIPS=Mega Instructions per Second = MegaHertz
(if only 1IPS)

Note: von Neumann computer
--

0.00083 MIPS

15

Performance Metrics

Theoretical peak performance R_theor:
maximum FLOPS a machine can reach in
theory.

Clock_rate*no_cpus*no_FPU/CPU

3GHz, 2 cpus, 1 FPU/CPU

R_theor=3x10^9 * 2 =
6 GFLOPS

Real performance R_real: FLOPS for specific
operations, e.g. vector multiplication

Sustained performance R_sustained:
performance on an application, e.g. CFD

R_sustained << R_real << R_theor

Not uncommon

R_sustained < 10%R_theor

16

Top 10 Supercomputers

www.top500.org

November 2007,
LINPACK

performance

R_real

R_theor

17

Number of Processors

18

Fastest
Supercomputers

At present

Projections

www.top500.org

Japanese Earth Simulator

My Laptop

IBM

BG/L

ASCI White

Pacific

EDSAC 1

UNIVAC 1

IBM 7090

CDC 6600

IBM 360/195

CDC 7600

Cray 1

Cray X
-
MP

Cray 2

TMC CM
-
2

TMC CM
-
5

Cray T3D

ASCI Red

1950

1960

1970

1980

1990

2000

2010

1 KFlop/s

1 MFlop/s

1 GFlop/s

1 TFlop/s

1 PFlop/s

Scalar

Super Scalar

Parallel

Vector

1941 1 (Fl oating Point operations / second, Flop/s)

1945 100

1949 1,000 (1 KiloFlop/s, KFlop/s)

1951 10,000

1961 100,000

1964 1,000,000 (1 MegaFlop/s, MFlop/s)

1968 10,000,000

1975 100,000,000

1987 1,000,000,000 (1 GigaFlop/s, GFlop/s)

1992 10,000,000,000

1993 100,000,000,000

1997 1,000,000,000,000 (1 TeraFlop/s, TFlop/s)

2000 10,000,000,000,000

2005

131,000,000,000,000 (131 Tflop/s)

Super Scalar/Vector/Parallel

(10
3
)

(10
6
)

(10
9
)

(10
12
)

(10
15
)

2X Transistors/Chip
Every 1.5 Years

A Growth
-
Factor of a Billion

in Performance in a Career

Japanese “Life Simulator” Effort for a
10 Pflop/s
System

From the Nikkei newspaper, May
30th morning edition.

and government is organized by
NEC, Hitachi, U of Tokyo, Kyusyu U,
and RIKEN
.

Competition component similar to
the DARPA HPCS program.

This year allocated about \$4 M each
towards petascale.

Total of
¥100,000 M (\$909 M) will be
invested in this development.

Plan to be operational in 2011.

Japan’s Life Simulator:

Needs of Multi
-
scale Multi
-
physic simulation

Integration of
multiple
architecture

Tightly
-
coupled
heterogeneous computer

Needs of
multiple
computation
components

Switch

Present

Faster
interconnect

Vector
Node

Scalar
Node

MD
Node

Slower connection

Faster
interconnect

Faster
interconnect

Vector
Node

Scalar
Node

FPGA Node

MD
Node

Faster
interconnect

Proposing
architecture

Major Applications of Next Generation Supercomputer

Targeted as grand
challenges

Basic Concept for Simulations in Nano
-
Science

Basic Concept for Simulations in Life Sciences

Genes

Vascular

System

Organism

Organ

Tissue

Cell

Protein

Genome

Bio
-
MD

Tissue

Structure

Multi
-
physics

Chemical

Process

Blood

Circulation

DDS

Gene Therapy

HIFU

Micro
-
machine

Catheter

Micro

Meso

Macro

http://ridge.icu.ac.jp

http://info.med.
vale.edu/

RIKEN

RIKEN

25

Petascale Era: 2008
-

NCSA: Blue Waters 1PTF/s, 2011

26

Bell versus Moore

27

Grand Challenge Applications

28

The von Neumann Computer

Walk
-
Through:
c=a+b

1.
Get next instruction

2.
Decode: Fetch
a

3.
Fetch
a

to internal register

4.
Get next instruction

5.
Decode: fetch
b

6.
Fetch
b

to internal register

7.
Get next instruction

8.
a

and
b

(
c

in register)

9.

10.
Get next instruction

11.
Decode: store
c

in main memory

12.
Move
c

from internal register to main memory

Note: Some units are idle while others are working…waste of cycles.

Pipelining (modularization) & Cashing (advance decoding)…parallelism

29

Basic Architecture

-
CPU, pipelining

-
Memory hierarchy,
cache

30

Computer Performance

CPU operates on data. If no data, CPU has to

typical workstation: 3.2GHz CPU, Memory 667MHz.
Memory 5 times slower.

Moore’s law: CPU speed doubles every 18 months

Memory speed increases much much slower;

Fast CPU requires sufficiently

fast

memory.

Rule of thumb: Memory size in GB=R_theor in
GFLOPS

1CPU cycle (1 FLOPS) handles 1 byte of data

1MFLOPS needs 1MB of data/memory

1GFLOPS needs 1GB of data/memory

Many “tricks” designed for performance improvement targets the memory

31

CPU Performance

Computer time is measured in terms of CPU
cycles

Minimum time to execute 1 instruction is 1 CPU cycle

Time to execute a given program:

n_c: total number of CPU cycles

n_i: total number of instructions

CPI = n_c/n_i, average
c
ycles
p
er
i
nstruction

t_c: cycle time, 1GHz

t_c=1/(10^9Hz) = 10^(
-
9)sec = 1ns

32

To Make a Program/Computer Faster…

Reduce cycle time t_c:

Increase clock frequency; however, there is a physical limit

In 1ns, light travels 30cm

Currently ~ GHz; 3GHz cpu

light travels 10cm within 1 cpu
cycle

length/size must be < 10cm.

Reduce number of instructions n_i:

More efficient algorithms

Better compilers

Reduce CPI
--

The key is parallelism.

Instruction
-
level parallelism.
Pipelining

technology

Internal parallelism, multiple functional units;
superscalar

processors; multi
-
core processors

External parallelism, multiple CPUs,
parallel

machine

33

Processor Types

Vector processor;

Cray X1/T90; NEC SX#; Japan Earth Simulator; Early
Cray machines; Japan Life Simulator (hybrid)

Scalar processor

CISC:
C
omplex
I
nstruction
S
et
C
omputer

Intel 80x86 (IA32)

RISC:
R
educed
I
nstruction
S
et
C
omputer

Sun SPARC, IBM Power #, SGI MIPS

VLIW:
V
ery
L
ong
I
nstruction
W
ord; Explicitly parallel
instruction computing (EPIC); Probably dying

Intel IA64 (Itanium)

34

CISC Processor

CISC

Complex instructions; Large number of
instructions; Can complete more complicated
functions at instruction level

Instruction actually invokes
microcode
.
Microcodes are small programs in processor
memory

Slower; Many instructions access memory;
varying instruction length; allow no pipelining;

35

RISC Processor

No microcode

Simple instructions; Fewer instructions;
Fast

Only load and store instructions access
memory

Common instruction word length

Allows pipelining

Almost all present
-
day high performance computers use
RISC processors

36

Locality of References

Spatial/Temporal locality

If processor executes an instruction at time t,
it is likely to execute an adjacent/next
instruction at (t+delta_t);

If processor accesses a memory location/data
item x at time t, it is likely to access an
(x+delta_x) at (t+delta_t);

Pipelining, Caching and many other techniques all
based on the locality of references

37

Pipelining

Overlapping execution of multiple instructions

1 instruction per cycle

Sub
-
divide instruction into multiple stages;
Processor handles different stages of adjacent
instructions simultaneously

Suppose 4 stages in instruction:

Instruction fetch and decode (IF)

Execute (EX)

Write
-
back results (WB)

38

Instruction Pipeline

IF

EX

RD

WB

IF

EX

RD

WB

IF

EX

RD

WB

IF

EX

RD

WB

IF

EX

RD

WB

IF

EX

RD

WB

IF

EX

RD

WB

1

2

3

4

5

6

7

8

9

10

cycle

1

2

3

4

5

6

7

instruction

Depth of pipeline: number of stages in an instruction

After the pipeline is full, 1 result per cycle! CPI = (n+depth
-
1)/n

With pipeline, 7 instructions take 10 cycles. If no pipeline, 7 instructions take 28
cycles

39

Inhibitors of Pipelining

Dependencies between instructions
performance

Control dependence.

Data dependence.

40

Control Dependence

Branching: when an instruction occurs after an
conditional branch; so it is unknown whether that
instruction will be executed beforehand

Loop:
for(i=0;i<n;i++)…; do…enddo

Jump: goto …

Condition: if…else…

if(x>y) n=5;

Branching in programs interrupts pipeline

Avoid excessive branching!

41

Data Dependence

when an instruction depends on data from a
previous instruction

x = 3*j;

y = x+5.0; // depends on previous instruction

42

Vector Pipeline

Vector processors: with vector registers which
can hold a vector, e.g. of 128 elements;

Commonly encountered processors are scalar
processors, e.g. home PC

Efficient for loops involving vectors.

for (i=0;i<128;i++)

z[i] = x[i] + y[i]

Instructions:

Vector Store Z

43

Vector Pipeline

1

2

3

133

cycle

Store Z

instruction

IF

X(1)

RD

X(128)

IF

Y(1)

RD

Y(128)

IF

Z(1)

Z(128)

IF

Z(1)

ST

Z(128)

time

44

Vector Operations: Hockney’s Formulas

CACHE: 64 Kb

45

Exceeding Cache Size

CACHE: 32 Kb

Cache line: 64 bytes

NOTE:
Asymptotic 5Mflops: result every 15 clocks

time to reload a cache line following a miss

46

Internal Parallelism

Functional units:
components in
processor that
actually do the work

Memory operations

Integer arithmetic (IU):

Floating point
arithmetic (FPU):
floating
-
multiply …

Instruction
type

Latency
(cycles)

1

Floating
-
point

3

Floating
-
point
multiply

3

Floating
-
point
divide

31

Typical instruction latencies

Division is much slower than add/multiply! Minimize
or avoid divisions!

47

Internal Parallelism

Superscalar RISC processors: multiple
functional units in processor, e.g. multiple
FPUs,

Capable of executing more than one
instruction (producing more than one result)
per cycle.

Shared registers, L1 cache etc.

data to multiple functional units!

Limiting factor: memory
-
processor
bandwidth

48

Internal Parallelism

Multi
-
core processors: Intel
dual
-
-
core

Multiple execution cores
(functional units, registers, L1
cache)

Multiple cores share L2 cache,
memory

Lower energy consumption

Need
FAST

memory access
to provide data to multiple
cores

Effective memory bandwidth per
core is reduced

Limiting factor: memory
-
processor bandwidth

Functional units + L1 cache

Shared L2 cache

Between cores

CPU chip

49

Heat Flux also Increases with Speed!

50

New Processors are Too Hot!

~

~

~

51

52

53

External Parallelism

Parallel machines: Will be discussed later

54

Memory: Next Lecture

Bit: 0, 1; Byte: 8 bits

Memory size

PB

10^15 bytes; TB

10^12 bytes; GB

10^9 bytes; MB

10^6 bytes

Memory performance measures:

Access time, or response time, latency: interval between time of
issuance of memory request and time when request is satisfied.

Cycle time: minimum time between two successive memory
requests

t0

t1

t2

Memory

request

request

satisfied

Access time: t1
-
t0

Cycle time: t2
-
t0

If there is another request
at t0 < t < t2, memory is
busy and will not respond;
have to wait until t > t2

Memory busy t0 < t < t2