# PARALLEL AND DISTRIBUTED COMPUTING OVERVIEW Fall 2003

Software and s/w Development

Dec 1, 2013 (4 years and 5 months ago)

93 views

Parallel Computers

1

PARALLEL AND DISTRIBUTED COMPUTING
OVERVIEW

Fall 2003

TOPICS:

Parallel computing requires an understanding of
parallel algorithms, parallel languages, and parallel
architecture, all of which are covered in this class for the
following topics.

Fundamental concepts in parallel computation.

Synchronous Computation

SIMD, Vector, Pipeline Computing

Associative Computing

Fortran 90

ASC Language

MultiC Language

Asynchronous (MIMD or Multiprocessors) Shared
Memory Computation

OpenMP language

Distributed Memory MIMD/Multiprocessor Computation

Sometimes called Multicomputers

Programming using Message Passing

MPI language

Interconnection Networks (SIMD and MIMD)

Comparison of MIMD and SIMD Computation in Real
Time Computation Applications

Five or six homework assignments

Midterm and Final Examination

Grading: Homework 50%, Midterm 20%, Final 30%

Parallel Computers

2

Introduction to Parallel Computing

(Chapter One)

References:

[1]
-

[4] given below.

1.
Chapter 1, “Parallel Programming ” by Wilkinson, el.

2.
Chapter 1, “Parallel Computation” by Akl

3.
Chapter 1
-
2, “Parallel Computing” by Quinn, 1994

4.
Chapter 2, “Parallel Processing & Parallel Algorithms” by Roosta

Need for Parallelism

Numerical modeling and simulation of
scientific and engineering problems.

Command & Control problems like ATC.

Grand Challenge Problems

Sequential solutions may take months or
years.

Weather Prediction
-

Grand Challenge Problem

Atmosphere is divided into 3D cells.

Data such as temperature, pressure, humidity,
wind speed and direction, etc. are recorded at
regular time
-
intervals in each cell.

10
8

cells of (1 mile)
3
.

It would take a modern computer over 100
days to perform necessary calculations for 10
day forecast.

Parallel Programming
-

a viable way to increase
computational speed.

Overall problem can be split into parts, each of
which are solved by a single processor.

Parallel Computers

3

Ideally, n processors would have n times the
computational power of one processor, with
each doing 1/n
th

of the computation.

Such gains in computational power is rare, due
to reasons such as

Inability to partition the problem perfectly
into n parts of the same computational size.

Necessary data transfer between processors

Necessary synchronizing of processors

Two major styles of partitioning problems

(Job) Control parallel programming

Problem is divided into the different, non
-
identical tasks that have to be performed.

The tasks are divided among the processors
so that their work load is roughly balanced.

This is considered to be
coarse grained

parallelism.

Data parallel programming

Each processor performs the same
computation on different data sets.

Computations do not necessarily have to be
synchronous.

This is considered to be
fine grained

parallelism.

Parallel Computers

4

Shared Memory Multiprocessors (SMPs)

The processors access memory through some type
of interconnection network.

This type of memory access is called
uniform
memory access

(UMA) .

A data parallel programming language, based on a
language like FORTRAN or C/C++ may be
available.

Alternately, programming using

is
sometimes used.

More programming details will be discussed later.

Difficulty for the SMP architecture to provide fast
having hierarchical or distributed memory systems.

This type of memory access is called
nonuniform memory access

(NUMA).

Normally, fast cache is used with NUMA systems
to reduce the problem of different memory access
time for PEs.

This creates the problem of ensuring that all
copies of the same date in different memory
locations are identical.

Numerous complex algorithms have been
designed for this problem.

Parallel Computers

5

Message
-
Passing Multiprocessors
(Multicomputers)

Processors are connected by an interconnection
network (which will be discussed later in chapter).

Each processor has a local memory and can only
access its own local memory.

Data is passed between processors using messages,
as dictated by the program.

Note:

If the processors run in SIMD mode (i.e.,
synchronously), then the movement of the data
movements over the network can be synchronous:

Movement of the data can be controlled by
program steps.

Much of the message
-
routing, hot
-
avoided.

A common approach to programming
multiprocessors is to use message
-
passing library
routines in addition to conventional sequential
programs (e.g., MPI, PVM)

The problem is divided into
processes

that can be
executed concurrently on individual processors. A
processor is normally assigned multiple processes.

Multicomputers can be scaled to larger sizes much
easier than shared memory multiprocessors.

Parallel Computers

6

Multicomputers (cont.)

-
passing

Programmers must make explicit message
-
passing calls in the code

This is low
-
level programming and is error
prone.

Data is not shared but copied, which increases
the total data size.

Data Integrity: difficulty in maintaining
correctness of multiple copies of data item.

-
passing

Allows different PCs to operate on the same
data independently.

Allows PCs on a network to be easily upgraded
when faster processors become available.

Mixed “distributed shared memory” systems

Lots of current interest in a cluster of SMPs.

See Dr. David Bader’s or Dr. Joseph JaJa’s
website

Other mixed systems have been developed.

Parallel Computers

7

Flynn’s Classification Scheme

SISD
-

single instruction stream, single data stream

Primarily sequential processors

MIMD
-

multiple instruction stream, multiple data
stream.

Includes SMPs and multicomputers.

processors are asynchronous, since they can
independently execute different programs on
different data sets.

Considered by most researchers to contain the
most powerful, least restricted computers.

Have very serious message passing (or shared
memory) problems that are often ignored when

compared to SIMDs

when computing algorithmic complexity

May be programmed using a multiple
programs, multiple data (MPMD) technique.

A common way to program MIMDs is to use a
single program, multiple data (SPMD) method

Normal technique when the number of
processors are large.

Data Parallel programming style for MIMDs

SIMD: single instruction and multiple data streams.

One instruction stream is broadcast to all
processors.

Parallel Computers

8

Flynn’s Taxonomy (cont.)

SIMD (cont.)

Each processor (also called a processing
element or PE) is very simplistic and is
essentially an ALU;

PEs do not store a copy of the program nor
have a program control unit.

Individual processors can be inhibited from
participating in an instruction (based on a data
test).

All active processor executes the same
instruction synchronously, but on different data

On a memory access, all active processors must
access the
same location

in their local memory.

The data items form an array and an instruction
can act on the complete array in one cycle.

MISD
-

Multiple Instruction streams, single data
stream.

This category is not used very often.

Some include pipelined architectures in this
category.

Parallel Computers

9

Interconnection Network Overview

References:
Texts [1
-
4] discuss these network examples, but
reference 3 (Quinn) is particularly good.

Only an overview of interconnection networks is
included here. It will be covered in greater depth later.

The PEs (processing elements) are called
nodes
.

A

is the connection between two nodes.

bidirectional or use two directional links .

Either one wire to carry one bit or parallel wires (one
wire for each bit in word) can be used.

The above choices do not have a major impact on the
concepts presented in this course.

The
diameter

is the minimal number of links between
the two farthest nodes in the network.

The diameter of a network gives the maximal
distance a single message may have to travel.

Completely Connected Network

Each of n nodes has a link to every other node.

Requires n(n
-

Impractical, unless very few processors

Line/Ring Network

A
line

consists of a row of n nodes, with connection

Called a
ring

end nodes of a line.

The line/ring networks have many applications.

Diameter of a line is n
-
1 and of a ring is

n/2

.

Parallel Computers

10

The Mesh Interconnection Network

The nodes are in rows and columns in a rectangle.

The nodes are connected by links that form a 2D
mesh. (Give diagram on board.)

Each interior node in a 2D mesh is connected to its
four nearest neighbors.

A square mesh with n nodes has

n rows and

n
columns

The diameter of a

n

n mesh is 2(

n
-

1)

If the horizonal and vertical ends of a mesh to the
opposite sides, the network is called a
torus
.

Meshes have been used more on actual computers
than any other network.

A 3D mesh is a generalization of a 2D mesh and has
been used in several computers.

The fact that 2D and 3D meshes model physical
space make them useful for many scientific and
engineering problems.

Binary Tree Network

A
binary tree

network is normally assumed to be a
complete binary tree.

It has a root node, and each interior node has two
links connecting it to nodes in the level below it.

The height of the tree is

lg n

and its diameter is 2

lg n

.

Parallel Computers

11

Metrics for Evaluating Parallelism

References:

All references cover most topics in this
section and have useful information not contained
in others. Ref. [2, Akl] includes new research and is
the main reference used, although others (esp. [3,
Quinn] and [1,Wilkinson]) are also used.

Granularity
: Amount of computation done between
communication or synchronization steps and is
ranked as fine, intermediate, and coarse.

SIMDs are built for efficient communications and
handle fine
-
grained solutions well.

SMPs or message passing MIMDS handle
communications less efficiently than SIMDs but
more efficiently than clusters and can handle
intermediate
-
grained solutions well.

Cluster of workstations or distributed systems have
slower communications among PEs and are better
suited for coarse grain computations.

For asynchronous computations, increasing the
granularity

reduces expensive communications

reduces costs of process creation

but reduces the nr of concurrent processes

Parallel Computers

12

Parallel Metrics (continued)

Speedup

A measure of the increase in running time due to
parallelism.

Based on running times, S(n) = t
s
/t
p

, where

t
s
is the execution time on a single processor,
using the fastest known sequential algorithm

t
p
is the execution time using a parallel
processor.

In theoretical analysis,
S(n) = t
s
/t
p

where

t
s

is the worst case running time for of the
fastest known sequential algorithm for the
problem

t
p
is the worst case running time of the parallel
algorithm using
n

PEs.

Parallel Computers

13

Parallel Metrics (continued)

Linear Speedup is optimal for most problems

Claim:

The maximum possible speedup for parallel
computers with n PEs for ‘normal problems’ is
n.

Proof of claim

Assume a computation is partitioned perfectly
into
n

processes of equal duration.

Assume no overhead is incurred as a result of this
partitioning of the computation.

Then, under these ideal conditions, the parallel
computation will execute
n
times faster than the
sequential computation.

The parallel running time is
t
s
/n
.

Then the parallel speedup of this computation is
S(n) = t
s
/(t
s
/n) = n
.

We shall later see that this “proof” is not valid for

Unfortunately, the best speedup possible for most
applications is much less than
n
, as

Above assumptions are usually invalid.

Usually some parts of programs are sequential
and only one PE is active.

Sometimes a large number of processors are idle
for certain portions of the program.

E.g., during parts of the execution, many PEs
may be waiting to receive or to send data.

Parallel Computers

14

Parallel Metrics (cont)

Superlinear speedup

(i.e., when
S(n) > n
): Most texts
besides [2,3] argue that

Linear speedup is the maximum speedup
obtainable.

The preceding “proof” is used to argue that
superlinearity is impossible.

Occasionally speedup that appears to be superlinear
may occur, but can be explained by other reasons
such as

the extra memory in parallel system.

a sub
-
optimal sequential algorithm used.

luck, in case of algorithm that has a random
aspect in its design (e.g., random selection)

Selim Akl has shown that for some less standard
problems, superlinear algorithms can be given.

Some problems cannot be solved without use
of parallel computation.

Some problems are natural to solve using
parallelism and sequential solutions are
inefficient.

The final chapter of Akl’s textbook and several
journal papers have been written to establish
these claims are valid, but it may still be a long
time before they are fully accepted.

Superlinearity has been a hotly debated topic
for too long to be accepted quickly.

Parallel Computers

15

Amdahl’s Law

Assumes that the speedup is not superliner; i.e.,

S(n) = t
s
/ t
p

n

Assumption only valid for traditional problems.

By Figure 1.29 in [1] (or slide #40), if
f

denotes the
fraction of the computation that must be sequential,

t
p

f
t
s

+ (1
-
f) t
s
/n

Substituting above inequality into the above
equation for S(n) and simplifying (see slide #41 or
book) yields

Amdahl’s “law”:
S(n)

1/f, where f is as above.

See Slide #41 or Fig. 1.30 for related details.

Note that
S(n)

never exceed
1/f

and approaches
1/f

as
n

increases.

Example: If only 5% of the computation is serial,
the maximum speedup is 20, no matter how many
processors are used.

Observations:

Amdahl’s law limitations to
parallelism:

For a long time, Amdahl’s law was viewed as a
fatal limit to the usefulness of parallelism.

f
f
n
n
n
S
1
)
1
(
1
)
(

Parallel Computers

16

Amdahl’s law is valid and some textbooks
discuss how it can be used to increase the
efficient of many parallel algorithms.

Shows that efforts required to further reduce the
fraction of the code that is sequential may pay
off in large performance gains.

Hardware that allows even a small decrease in
the percent of things executed sequentially may
be considerably more efficient.

A key flaw in past arguments that Amdahl’s
law is a fatal limit to the future of parallelism is

Gustafon’s Law:

The proportion of the
computations that are sequential normally
decreases as the problem size increases.

Other limitations in applying Amdahl’s Law:

Its proof focuses on the steps in a particular
algorithm, and does not consider that other
algorithms with more parallelism may exist.

Amdahl’s law applies only to ‘standard’
problems were superlinearity doesn’t occurs.

For more details on superlinearity, see [2]
“Parallel Computation: Models and Methods”,
Selim Akl, pgs 14
-
20 (Speedup Folklore
Theorem) and Chapter 12.

Parallel Computers

17

More Metrics for Parallelism

Efficiency
is defined by

Efficiency give the percentage of full utilization
of parallel processors on computation,
assuming a speedup of
n

is the best possible.

Cost:

The cost of a parallel algorithm or parallel
execution is defined by

Cost = (running time)

(Nr. of PEs)

= t
p

n

Observe that

Cost
allows the quality of parallel algorithms to
be compared to that of sequential algorithms.

Compare
cost
of parallel algorithm to
running time

of sequential algorithm

The advantage that parallel algorithms have
in using multiple processors is removed by
multiplying their running time by the
number
n

of processors they are using.

If a parallel algorithm requires exactly
1/n

the running time of a sequential algorithm,
then the
parallel cost

is the same as the
sequential running time
.

n
n
S
n
t
t
E
p
s
)
(

Cost
E
t
s

Parallel Computers

18

More Metrics (cont.)

Cost
-
Optimal Parallel Algorithm:

A parallel
algorithm for a problem is said to be
cost
-
optimal

if
its cost is proportional to the running time of an
optimal sequential algorithm for the same problem.

By
proportional
, we means that

cost = t
p

n = k

t
s

where
k

is a constant. (See pg 67 of [1]).

Equivalently, a parallel algorithm is optimal if

parallel cost = O(f(t)),

where
f(t)

is the running time of an optimal
sequential algorithm.

In cases where no optimal sequential algorithm
is known, then the “fastest known” sequential

Parallel Computers

19

Sieve of Eratosthenes

(A Data
-
Parallel vs Control
-
Parallel Example)

Reference [3, Quinn, Ch. 1], pages 10
-
17

A prime number is a positive integer with exactly
two factors, itself and 1.

Sieve (siv) algorithm finds the prime numbers less
than or equal to some positive integer n

Begin with a list of natural numbers

2, 3, 4, …, n

Remove composite numbers from the list by
striking multiples of 2, 3, 5, and successive
primes

After each striking, the next unmarked natural
number is prime

Sieve terminates after multiples of largest prime
less than or equal to

have been struck
from the list.

Sequential Implementation uses 3 data structures

Boolean array index for the numbers being
sieved.

An integer holding latest prime found so far.

An loop index that is incremented as multiple
of current prime are marked as composite nrs.

n
Parallel Computers

20

A Control
-
Parallel Approach

Control parallelism

involves applying a
different

sequence of operations to different data elements

Useful for

Shared
-
memory MIMD

Distributed
-
memory MIMD

Asynchronous PRAM

Control
-
parallel sieve

Each processor works with a different prime,
and is responsible for striking multiples of that
prime and identifying a new prime number

Each processor starts marking…

Shared memory contain

boolean array containing numbers being
sieved,

integer corresponding largest prime found so
far

PE’s local memories contain local loop indexes
keeping track of multiples of its current prime
(since each is working with different prime).

Parallel Computers

21

A Control
-
Parallel Approach

(cont.)

Problems and inefficiencies

Algorithm for Shared Memory MIMD

1.
Processor accesses variable holding current
prime

2.
searches for next unmarked value

3.

Must avoid having two processors doing this at
same time

A processor could waste time sieving multiples of a
composite number

How much speedup can we get?

Suppose n = 1000

Sequential algorithm

Time to strike out multiples of prime p is

(n+1
-

p
2
)/p

Multiples of 2: ((1000+1)

4)/2=997/2=498

Multiples of 3: ((1000+1)

9)/3=992/3=330

Total Sum = 1411 (number of “steps”)

2 PEs gives speedup 1411/706=2.00

3 PEs gives speedup 1411/499=2.83

3 PEs require 499 strikeout time units, so no more
speedup is possible using additrional PEs

Multiples of 2’s dominate with 498 strikeout
steps

Parallel Computers

22

A Data
-
Parallel Approach

Data parallelism

refers to using multiple PEs to
apply the
same

sequence of operations to different
data elements.

Useful for following types of parallel operations

SIMD

PRAM

Shared
-
memory MIMD,

Distributed
-
memory MIMD.

Generally not useful for pipeline operations.

A data
-
parallel sieve algorithm:

Each processor works with a same prime, and is
responsible for striking multiples of that prime
from a segment of the array of natural numbers

Assume we have k processors, where

k << , (i.e., k is much less than ).

Each processor gets no more than

n/k

natural numbers

All primes less than , as well as first
prime greater than are in list controlled
by first processor

n
n
n
n
Parallel Computers

23

A Data
-
Parallel Approach

(cont.)

Data
-
parallel Sieve (cont.)

Distributed
-
memory MIMD Algorithm

Processor 1 finds next prime, broadcasts it
to all PEs

Each PE goes through their part of the array,
striking multiples of that prime (performing
same operation)

Continues until first processor reaches a
prime greater than sqrt(n)

How much speedup can we get?

Suppose n = 1,000,000 and we have k PEs

There are 168 primes less than 1,000, the
largest of which is 997

Maximum execution time to strike out primes

1,000,000/ k

/2

+

1,000,000/ k

/3

+

1,000,000/50

/5

+
… +

1,000,000/ k

/997

etime

The sequential execution time is the above sum
with k = 1.

The communication time = 168(k

1) ctime.

We assume that each communication takes 100
times longer than the time to execute a strikeout

Parallel Computers

24

A Data
-
Parallel Approach

(cont.)

How much speedup can we get? (cont.)

Speedup is not directly proportional to the
number of PEs

it’s highest at 11 PEs

Computation time is inversely proportional
to the number of processors used

Communication time increases linearly

After 11 processors, the increase in the
communication time is higher than the
decrease in computation time, and total
execution time increases.

Study Figures 1
-
11 and 1
-
12 on pg 15 of [3].

Practically, the primes generated must be
stored on an external device

I/O time is constant because output must be
performed sequentially

This sequential code severely limits the
parallel speedup according to Amdahl’s law

the fraction of operations that must be
performed sequentially limits the maximum
speedup possible.

Parallel I/O is an important topic in parallel
computing.

Parallel Computers

25

The work metric and work optimal concept.

The speedup and slowdown results from [2].

Data parallel vs control parallel example

Possibly a simpler example than sieve in
Quinn

Look at chapter 2 of [8, Jordan] for possible
new information.

Parallel Computers

26

n or

n