PARALLEL AND DISTRIBUTED COMPUTING OVERVIEW Fall 2003

footballsyrupSoftware and s/w Development

Dec 1, 2013 (3 years and 7 months ago)

70 views

Parallel Computers

1

PARALLEL AND DISTRIBUTED COMPUTING
OVERVIEW


Fall 2003

TOPICS:

Parallel computing requires an understanding of
parallel algorithms, parallel languages, and parallel
architecture, all of which are covered in this class for the
following topics.


Fundamental concepts in parallel computation.


Synchronous Computation


SIMD, Vector, Pipeline Computing


Associative Computing


Fortran 90


ASC Language


MultiC Language


Asynchronous (MIMD or Multiprocessors) Shared
Memory Computation


OpenMP language


Distributed Memory MIMD/Multiprocessor Computation


Sometimes called Multicomputers


Programming using Message Passing


MPI language


Interconnection Networks (SIMD and MIMD)


Comparison of MIMD and SIMD Computation in Real
Time Computation Applications


GRADING FOR PDC


Five or six homework assignments


Midterm and Final Examination


Grading: Homework 50%, Midterm 20%, Final 30%

Parallel Computers

2

Introduction to Parallel Computing

(Chapter One)


References:

[1]
-

[4] given below.

1.
Chapter 1, “Parallel Programming ” by Wilkinson, el.

2.
Chapter 1, “Parallel Computation” by Akl

3.
Chapter 1
-
2, “Parallel Computing” by Quinn, 1994

4.
Chapter 2, “Parallel Processing & Parallel Algorithms” by Roosta


Need for Parallelism


Numerical modeling and simulation of
scientific and engineering problems.


Solution for problems with deadlines


Command & Control problems like ATC.


Grand Challenge Problems


Sequential solutions may take months or
years.


Weather Prediction
-

Grand Challenge Problem


Atmosphere is divided into 3D cells.


Data such as temperature, pressure, humidity,
wind speed and direction, etc. are recorded at
regular time
-
intervals in each cell.


There are about 5

10
8

cells of (1 mile)
3
.


It would take a modern computer over 100
days to perform necessary calculations for 10
day forecast.


Parallel Programming
-

a viable way to increase
computational speed.


Overall problem can be split into parts, each of
which are solved by a single processor.

Parallel Computers

3


Ideally, n processors would have n times the
computational power of one processor, with
each doing 1/n
th

of the computation.


Such gains in computational power is rare, due
to reasons such as


Inability to partition the problem perfectly
into n parts of the same computational size.


Necessary data transfer between processors


Necessary synchronizing of processors


Two major styles of partitioning problems


(Job) Control parallel programming


Problem is divided into the different, non
-
identical tasks that have to be performed.


The tasks are divided among the processors
so that their work load is roughly balanced.


This is considered to be
coarse grained

parallelism.


Data parallel programming


Each processor performs the same
computation on different data sets.


Computations do not necessarily have to be
synchronous.


This is considered to be
fine grained

parallelism.


Parallel Computers

4

Shared Memory Multiprocessors (SMPs)


All processors have access to all memory locations .


The processors access memory through some type
of interconnection network.


This type of memory access is called
uniform
memory access

(UMA) .


A data parallel programming language, based on a
language like FORTRAN or C/C++ may be
available.


Alternately, programming using
threads

is
sometimes used.


More programming details will be discussed later.


Difficulty for the SMP architecture to provide fast
access to all memory locations result in most SMPs
having hierarchical or distributed memory systems.


This type of memory access is called
nonuniform memory access

(NUMA).


Normally, fast cache is used with NUMA systems
to reduce the problem of different memory access
time for PEs.


This creates the problem of ensuring that all
copies of the same date in different memory
locations are identical.


Numerous complex algorithms have been
designed for this problem.

Parallel Computers

5

Message
-
Passing Multiprocessors
(Multicomputers)


Processors are connected by an interconnection
network (which will be discussed later in chapter).


Each processor has a local memory and can only
access its own local memory.


Data is passed between processors using messages,
as dictated by the program.


Note:

If the processors run in SIMD mode (i.e.,
synchronously), then the movement of the data
movements over the network can be synchronous:


Movement of the data can be controlled by
program steps.


Much of the message
-
passing overhead (e.g.,
routing, hot
-
spots, headers, etc.) can be
avoided.


A common approach to programming
multiprocessors is to use message
-
passing library
routines in addition to conventional sequential
programs (e.g., MPI, PVM)


The problem is divided into
processes

that can be
executed concurrently on individual processors. A
processor is normally assigned multiple processes.


Multicomputers can be scaled to larger sizes much
easier than shared memory multiprocessors.

Parallel Computers

6

Multicomputers (cont.)


Programming disadvantages of message
-
passing


Programmers must make explicit message
-
passing calls in the code


This is low
-
level programming and is error
prone.


Data is not shared but copied, which increases
the total data size.


Data Integrity: difficulty in maintaining
correctness of multiple copies of data item.


Programming advantages of message
-
passing


No problem with simultaneous access to data.


Allows different PCs to operate on the same
data independently.


Allows PCs on a network to be easily upgraded
when faster processors become available.


Mixed “distributed shared memory” systems


Lots of current interest in a cluster of SMPs.


See Dr. David Bader’s or Dr. Joseph JaJa’s
website


Other mixed systems have been developed.

Parallel Computers

7

Flynn’s Classification Scheme


SISD
-

single instruction stream, single data stream


Primarily sequential processors


MIMD
-

multiple instruction stream, multiple data
stream.


Includes SMPs and multicomputers.


processors are asynchronous, since they can
independently execute different programs on
different data sets.


Considered by most researchers to contain the
most powerful, least restricted computers.


Have very serious message passing (or shared
memory) problems that are often ignored when


compared to SIMDs


when computing algorithmic complexity


May be programmed using a multiple
programs, multiple data (MPMD) technique.


A common way to program MIMDs is to use a
single program, multiple data (SPMD) method


Normal technique when the number of
processors are large.


Data Parallel programming style for MIMDs


SIMD: single instruction and multiple data streams.


One instruction stream is broadcast to all
processors.

Parallel Computers

8

Flynn’s Taxonomy (cont.)


SIMD (cont.)


Each processor (also called a processing
element or PE) is very simplistic and is
essentially an ALU;


PEs do not store a copy of the program nor
have a program control unit.


Individual processors can be inhibited from
participating in an instruction (based on a data
test).


All active processor executes the same
instruction synchronously, but on different data


On a memory access, all active processors must
access the
same location

in their local memory.


The data items form an array and an instruction
can act on the complete array in one cycle.



MISD
-

Multiple Instruction streams, single data
stream.


This category is not used very often.


Some include pipelined architectures in this
category.

Parallel Computers

9

Interconnection Network Overview

References:
Texts [1
-
4] discuss these network examples, but
reference 3 (Quinn) is particularly good.



Only an overview of interconnection networks is
included here. It will be covered in greater depth later.


The PEs (processing elements) are called
nodes
.


A
link

is the connection between two nodes.


bidirectional or use two directional links .


Either one wire to carry one bit or parallel wires (one
wire for each bit in word) can be used.


The above choices do not have a major impact on the
concepts presented in this course.


The
diameter

is the minimal number of links between
the two farthest nodes in the network.


The diameter of a network gives the maximal
distance a single message may have to travel.


Completely Connected Network


Each of n nodes has a link to every other node.


Requires n(n
-
1)/2 links


Impractical, unless very few processors


Line/Ring Network


A
line

consists of a row of n nodes, with connection
to adjacent nodes.


Called a
ring

when a link is added to connect the two
end nodes of a line.


The line/ring networks have many applications.


Diameter of a line is n
-
1 and of a ring is

n/2

.

Parallel Computers

10



The Mesh Interconnection Network



The nodes are in rows and columns in a rectangle.


The nodes are connected by links that form a 2D
mesh. (Give diagram on board.)


Each interior node in a 2D mesh is connected to its
four nearest neighbors.


A square mesh with n nodes has

n rows and

n
columns


The diameter of a

n

n mesh is 2(

n
-

1)


If the horizonal and vertical ends of a mesh to the
opposite sides, the network is called a
torus
.


Meshes have been used more on actual computers
than any other network.


A 3D mesh is a generalization of a 2D mesh and has
been used in several computers.


The fact that 2D and 3D meshes model physical
space make them useful for many scientific and
engineering problems.


Binary Tree Network


A
binary tree

network is normally assumed to be a
complete binary tree.


It has a root node, and each interior node has two
links connecting it to nodes in the level below it.


The height of the tree is

lg n


and its diameter is 2

lg n


.



Parallel Computers

11

Metrics for Evaluating Parallelism

References:

All references cover most topics in this
section and have useful information not contained
in others. Ref. [2, Akl] includes new research and is
the main reference used, although others (esp. [3,
Quinn] and [1,Wilkinson]) are also used.


Granularity
: Amount of computation done between
communication or synchronization steps and is
ranked as fine, intermediate, and coarse.


SIMDs are built for efficient communications and
handle fine
-
grained solutions well.


SMPs or message passing MIMDS handle
communications less efficiently than SIMDs but
more efficiently than clusters and can handle
intermediate
-
grained solutions well.


Cluster of workstations or distributed systems have
slower communications among PEs and are better
suited for coarse grain computations.


For asynchronous computations, increasing the
granularity


reduces expensive communications


reduces costs of process creation


but reduces the nr of concurrent processes

Parallel Computers

12

Parallel Metrics (continued)

Speedup


A measure of the increase in running time due to
parallelism.


Based on running times, S(n) = t
s
/t
p

, where


t
s
is the execution time on a single processor,
using the fastest known sequential algorithm


t
p
is the execution time using a parallel
processor.


In theoretical analysis,
S(n) = t
s
/t
p

where


t
s

is the worst case running time for of the
fastest known sequential algorithm for the
problem


t
p
is the worst case running time of the parallel
algorithm using
n

PEs.




Parallel Computers

13

Parallel Metrics (continued)


Linear Speedup is optimal for most problems


Claim:

The maximum possible speedup for parallel
computers with n PEs for ‘normal problems’ is
n.



Proof of claim


Assume a computation is partitioned perfectly
into
n

processes of equal duration.


Assume no overhead is incurred as a result of this
partitioning of the computation.


Then, under these ideal conditions, the parallel
computation will execute
n
times faster than the
sequential computation.


The parallel running time is
t
s
/n
.


Then the parallel speedup of this computation is
S(n) = t
s
/(t
s
/n) = n
.


We shall later see that this “proof” is not valid for
certain types of nontraditional problems.


Unfortunately, the best speedup possible for most
applications is much less than
n
, as



Above assumptions are usually invalid.


Usually some parts of programs are sequential
and only one PE is active.


Sometimes a large number of processors are idle
for certain portions of the program.


E.g., during parts of the execution, many PEs
may be waiting to receive or to send data.


Parallel Computers

14

Parallel Metrics (cont)

Superlinear speedup

(i.e., when
S(n) > n
): Most texts
besides [2,3] argue that


Linear speedup is the maximum speedup
obtainable.


The preceding “proof” is used to argue that
superlinearity is impossible.


Occasionally speedup that appears to be superlinear
may occur, but can be explained by other reasons
such as


the extra memory in parallel system.


a sub
-
optimal sequential algorithm used.


luck, in case of algorithm that has a random
aspect in its design (e.g., random selection)


Selim Akl has shown that for some less standard
problems, superlinear algorithms can be given.


Some problems cannot be solved without use
of parallel computation.


Some problems are natural to solve using
parallelism and sequential solutions are
inefficient.


The final chapter of Akl’s textbook and several
journal papers have been written to establish
these claims are valid, but it may still be a long
time before they are fully accepted.


Superlinearity has been a hotly debated topic
for too long to be accepted quickly.

Parallel Computers

15

Amdahl’s Law


Assumes that the speedup is not superliner; i.e.,

S(n) = t
s
/ t
p


n


Assumption only valid for traditional problems.


By Figure 1.29 in [1] (or slide #40), if
f

denotes the
fraction of the computation that must be sequential,

t
p


f
t
s

+ (1
-
f) t
s
/n




Substituting above inequality into the above
equation for S(n) and simplifying (see slide #41 or
book) yields




Amdahl’s “law”:
S(n)


1/f, where f is as above.


See Slide #41 or Fig. 1.30 for related details.


Note that
S(n)

never exceed
1/f

and approaches
1/f

as
n

increases.


Example: If only 5% of the computation is serial,
the maximum speedup is 20, no matter how many
processors are used.


Observations:

Amdahl’s law limitations to
parallelism:


For a long time, Amdahl’s law was viewed as a
fatal limit to the usefulness of parallelism.

f
f
n
n
n
S
1
)
1
(
1
)
(




Parallel Computers

16


Amdahl’s law is valid and some textbooks
discuss how it can be used to increase the
efficient of many parallel algorithms.


Shows that efforts required to further reduce the
fraction of the code that is sequential may pay
off in large performance gains.


Hardware that allows even a small decrease in
the percent of things executed sequentially may
be considerably more efficient.


A key flaw in past arguments that Amdahl’s
law is a fatal limit to the future of parallelism is


Gustafon’s Law:

The proportion of the
computations that are sequential normally
decreases as the problem size increases.


Other limitations in applying Amdahl’s Law:


Its proof focuses on the steps in a particular
algorithm, and does not consider that other
algorithms with more parallelism may exist.


Amdahl’s law applies only to ‘standard’
problems were superlinearity doesn’t occurs.


For more details on superlinearity, see [2]
“Parallel Computation: Models and Methods”,
Selim Akl, pgs 14
-
20 (Speedup Folklore
Theorem) and Chapter 12.

Parallel Computers

17

More Metrics for Parallelism


Efficiency
is defined by




Efficiency give the percentage of full utilization
of parallel processors on computation,
assuming a speedup of
n

is the best possible.


Cost:

The cost of a parallel algorithm or parallel
execution is defined by

Cost = (running time)


(Nr. of PEs)

= t
p



n


Observe that




Cost
allows the quality of parallel algorithms to
be compared to that of sequential algorithms.


Compare
cost
of parallel algorithm to
running time

of sequential algorithm


The advantage that parallel algorithms have
in using multiple processors is removed by
multiplying their running time by the
number
n

of processors they are using.


If a parallel algorithm requires exactly
1/n

the running time of a sequential algorithm,
then the
parallel cost

is the same as the
sequential running time
.


n
n
S
n
t
t
E
p
s
)
(



Cost
E
t
s

Parallel Computers

18

More Metrics (cont.)


Cost
-
Optimal Parallel Algorithm:

A parallel
algorithm for a problem is said to be
cost
-
optimal

if
its cost is proportional to the running time of an
optimal sequential algorithm for the same problem.


By
proportional
, we means that

cost = t
p



n = k


t
s


where
k

is a constant. (See pg 67 of [1]).


Equivalently, a parallel algorithm is optimal if

parallel cost = O(f(t)),



where
f(t)

is the running time of an optimal
sequential algorithm.


In cases where no optimal sequential algorithm
is known, then the “fastest known” sequential
algorithm is often used instead.

Parallel Computers

19

Sieve of Eratosthenes

(A Data
-
Parallel vs Control
-
Parallel Example)


Reference [3, Quinn, Ch. 1], pages 10
-
17


A prime number is a positive integer with exactly
two factors, itself and 1.


Sieve (siv) algorithm finds the prime numbers less
than or equal to some positive integer n


Begin with a list of natural numbers

2, 3, 4, …, n


Remove composite numbers from the list by
striking multiples of 2, 3, 5, and successive
primes


After each striking, the next unmarked natural
number is prime


Sieve terminates after multiples of largest prime
less than or equal to


have been struck
from the list.


Sequential Implementation uses 3 data structures


Boolean array index for the numbers being
sieved.


An integer holding latest prime found so far.



An loop index that is incremented as multiple
of current prime are marked as composite nrs.

n
Parallel Computers

20

A Control
-
Parallel Approach


Control parallelism

involves applying a
different

sequence of operations to different data elements


Useful for


Shared
-
memory MIMD


Distributed
-
memory MIMD


Asynchronous PRAM


Control
-
parallel sieve


Each processor works with a different prime,
and is responsible for striking multiples of that
prime and identifying a new prime number


Each processor starts marking…


Shared memory contain


boolean array containing numbers being
sieved,


integer corresponding largest prime found so
far


PE’s local memories contain local loop indexes
keeping track of multiples of its current prime
(since each is working with different prime).

Parallel Computers

21

A Control
-
Parallel Approach

(cont.)


Problems and inefficiencies


Algorithm for Shared Memory MIMD

1.
Processor accesses variable holding current
prime

2.
searches for next unmarked value

3.
updates variable containing current prime


Must avoid having two processors doing this at
same time


A processor could waste time sieving multiples of a
composite number


How much speedup can we get?


Suppose n = 1000


Sequential algorithm


Time to strike out multiples of prime p is


(n+1
-

p
2
)/p



Multiples of 2: ((1000+1)

4)/2=997/2=498


Multiples of 3: ((1000+1)

9)/3=992/3=330


Total Sum = 1411 (number of “steps”)


2 PEs gives speedup 1411/706=2.00


3 PEs gives speedup 1411/499=2.83


3 PEs require 499 strikeout time units, so no more
speedup is possible using additrional PEs


Multiples of 2’s dominate with 498 strikeout
steps

Parallel Computers

22

A Data
-
Parallel Approach


Data parallelism

refers to using multiple PEs to
apply the
same

sequence of operations to different
data elements.


Useful for following types of parallel operations


SIMD


PRAM


Shared
-
memory MIMD,


Distributed
-
memory MIMD.


Generally not useful for pipeline operations.


A data
-
parallel sieve algorithm:


Each processor works with a same prime, and is
responsible for striking multiples of that prime
from a segment of the array of natural numbers


Assume we have k processors, where

k << , (i.e., k is much less than ).


Each processor gets no more than

n/k


natural numbers


All primes less than , as well as first
prime greater than are in list controlled
by first processor

n
n
n
n
Parallel Computers

23

A Data
-
Parallel Approach

(cont.)


Data
-
parallel Sieve (cont.)


Distributed
-
memory MIMD Algorithm


Processor 1 finds next prime, broadcasts it
to all PEs


Each PE goes through their part of the array,
striking multiples of that prime (performing
same operation)


Continues until first processor reaches a
prime greater than sqrt(n)


How much speedup can we get?


Suppose n = 1,000,000 and we have k PEs


There are 168 primes less than 1,000, the
largest of which is 997


Maximum execution time to strike out primes




1,000,000/ k

/2


+





1,000,000/ k

/3


+





1,000,000/50

/5


+
… +




1,000,000/ k

/997


etime


The sequential execution time is the above sum
with k = 1.


The communication time = 168(k

1) ctime.


We assume that each communication takes 100
times longer than the time to execute a strikeout


Parallel Computers

24

A Data
-
Parallel Approach

(cont.)


How much speedup can we get? (cont.)


Speedup is not directly proportional to the
number of PEs


it’s highest at 11 PEs


Computation time is inversely proportional
to the number of processors used


Communication time increases linearly


After 11 processors, the increase in the
communication time is higher than the
decrease in computation time, and total
execution time increases.


Study Figures 1
-
11 and 1
-
12 on pg 15 of [3].


How about parallel I/O time?


Practically, the primes generated must be
stored on an external device


Assume access to device is sequential.


I/O time is constant because output must be
performed sequentially


This sequential code severely limits the
parallel speedup according to Amdahl’s law


the fraction of operations that must be
performed sequentially limits the maximum
speedup possible.


Parallel I/O is an important topic in parallel
computing.


Parallel Computers

25

Future Additions Needed


To be added:



The work metric and work optimal concept.


The speedup and slowdown results from [2].


Data parallel vs control parallel example


Possibly a simpler example than sieve in
Quinn


Look at chapter 2 of [8, Jordan] for possible
new information.


Parallel Computers

26



n or





n