and Parallel Programming

coleslawokraSoftware and s/w Development

Dec 1, 2013 (3 years and 11 months ago)

83 views


CS591x
-
Cluster Computing
and Parallel Programming

Parallel Computer Architecture
and Software Models

It all about performance

Greater performance is the reason for parallel
computing

Many types of scientific and engineering
programs are too large and too complex for
traditional uniprocessors

Such large problems are common is



Ocean modeling, weather modeling, astrophysics,
solid state physics, power systems….

FLOPS


a measure of
performance

FLOPS


Floating Point Operations per
Second

… a measure of how much computation
can be done in a certain amount of time


MegaFLOPS


MFLOPS
-

10
6

FLOPS


GigaFLOPS


GFLOPS


10
9

FLOPS


TeraFLOPS


TFLOPS


10
12

FLOPS


PetaFLOPS


PFLOPS


10
15

FLOPS

How fast …

Cray 1
-

~150 MFLOPS

Pentium 4


3
-
6 GFLOPS

IBM’s BlueGene
-

+70 TFLOPS

PSC’s Big Ben


10 TFLOPS

Humans
---

it depends


as calculators


0.001 MFLOPS


as information processors


10PFLOPS

FLOPS vs. MIPS

FLOPS only concerned with floating
pointing calculations

other performance issues


memory latency


cache performance


I/O capacity




See…

www.Top500.org


biannual performance reports and …


rankings of the fastest computers in the
world

Performance

Speedup(n processors) =


time(1 processor)/time(n processors)


** Culler, Singh and Gupta, Parallel Computing
Architecture, A Hardware/Software Approach

Consider…

from:
www.lib.utexas.edu/maps/indian_ocean.html


… a model of the Indian Ocean
-

73,000,000 square kilometer


One data point per 100 meters


7,300,000,000 surface points

Need to model the ocean at depth


say
every 10 meters up to 200 meters


20 depth data points

Every 10 minutes for 4 hours



24 time steps

So


73 x 10
6

(points on the surface) x 10
2

(points per sq. km) x 20 points per sq
km of depth) x 24 (time steps)



3,504,000,000,000 data points in the
model grid

Suppose 100 instruction per grid point


350,400,000,000,000 instructions in model

Then
-

Imagine that you have a computer that
can run 1 billion (10
9
)instructions per
second

3.504 x 10
14

/ 10
9

= 35040 seconds


or 9.7 hours

But


On a 10 teraflops computer



3.504 x 10
14

/ 10
13

= 35.0 seconds

Gaining performance

Pipelining


More instructions

faster


More instructions in execution at the same
time in a single processor


Not usually an attractive strategy these
days


why?

Instruction Level Parallelism
(ILP)

based on the fact that many
instructions do not depend on
instructions that are before them…

Processor has extra hardware to
execute several instructions at the same
time


…multiple adders…

Pipelining and ILP not the
solution to our problem


why?

near incremental improvements in
performance

been done already

we need orders of magnitude
improvements in performance

Gaining Performance

Vector Processors

Scientific and Engineering computations
are often vector and matrix operations



graphic transformations


i.e. shift object
x to the right

Redundant arithmetic hardware and
vector registers to operate on an entire
vector in one step (SIMD)


Gaining Performance

Vector Processors

Declining popularity for a while



Hardware expensive

Popularity returning



Applications


science, engineering,
cryptography, media/graphics


Earth Simulator

Parallel Computer Architecture

Shared Memory Architectures

Distributed Memory

Shared Memory Systems

Multiple processors connected to/share
the same pool of memory

SMP

Every processor has, potentially, access
to and control of every memory location

Shared Memory Computers

Memory

Processor

Processor

Processor

Processor

Processor

Processor

Shared Memory Computers

Memory

Memory

Memory

Processor

Processor

Processor

Shared Memory Computer

Memory

Memory

Memory

Processor

Processor

Processor

Switch

Share Memory Computers

SGI Origin2000


at
NCSA

Balder

256 250mhz R10000
processors

128 Gbyte Memory


Shared Memory Computers

Rachel at PSC

64 1.15 Ghz EV7
processors

256 Gbytes of
shared memory

Distributed Memory Systems

Multiple processors each with their own
memory

Interconnected to share/exchange data,
processing

Modern architectural approach to
supercomputers

Supercomputers and
Clusters

similar

Clusters


distributed memory

Processor

Memory

Processor

Memory

Processor

Memory

Processor

Memory

Processor

Memory

Processor

Memory

Interconnect

Cluster

Distributed Memory with SMP

Proc1

Memory

Memory

Memory

Memory

Interconnect

Proc2

Proc1

Memory

Proc2

Proc1

Memory

Proc2

Proc1

Proc2

Proc1

Proc2

Proc1

Proc2

Distributed Memory
Supercomputer

BlueGene/L

DOE/IBM

0.7 Ghz PowerPC
440

32768 Processors

70 Teraflops

Distributed Memory
Supercomputer

Thunder at LLNL

Number 5

20 Teraflops

1.4 Ghz Itanium
processors

4096 processors

Grid Computing Systems

What is a Grid


Means different things to different people

Distributed Processors


Around campus


Around the state


Around the world

Grid Computing Systems

Widely distributed

Loosely connected (i.e. Internet)

No central management

Grid Computing Systems

Connected Clusters/other dedicated scientific
computers

I2/Abilene

Grid Computer Systems

Internet

Control/Scheduler

Harvested Idle Cycles

Grid Computing Systems

Dedicated Grids


TeraGrid


Sabre


NASA Information Power Grid

Cycle Harvesting Grids


Condor


*GlobalGridForum (Parabon)


Seti@home

Let’s revisit speedup…

we can achieve speedup (theoretically)
by using more processors,…

but, of factors may limit speedup…


Interprocessor communications


Interprocess synchronization


Load balance


Amdahl’s Law

According to Amdahl’s Law…


Speedup = 1/(S + (1
-
S)/N)


where


S is the purely sequential part of the
program


N is the number of processors

Amdahl’s Law

What does it mean




Part of a program can is parallelizable


Part of the program must be sequential (S)

Amdahl’s law says



Speedup is constrained by the portion of
the program that must remain sequential
relative to the part that is parallelized.

Note: If S is very small


“embarrassingly parallel problem”

Software models for parallel
computing

Shared Memory

Distributed Memory

Data Parallel

Flynn’s Taxonomy

Single
Instruction/Single
Data
-

SISD

Multiple
Instruction/Single
Data
-

MISD

Single
Instruction/Multiple
Data
-

SIMD

Multiple
Instruction/Multiple
Data
-

MIMD

Single Program/Multiple Data
-

SPMD

Next

Cluster Computer Architecture

Linux