Parallel Computers
1
PARALLEL AND DISTRIBUTED COMPUTING
OVERVIEW
Fall 2003
TOPICS:
Parallel computing requires an understanding of
parallel algorithms, parallel languages, and parallel
architecture, all of which are covered in this class for the
following topics.
•
Fundamental concepts in parallel computation.
•
Synchronous Computation
–
SIMD, Vector, Pipeline Computing
–
Associative Computing
–
Fortran 90
–
ASC Language
–
MultiC Language
•
Asynchronous (MIMD or Multiprocessors) Shared
Memory Computation
–
OpenMP language
•
Distributed Memory MIMD/Multiprocessor Computation
–
Sometimes called Multicomputers
–
Programming using Message Passing
–
MPI language
•
Interconnection Networks (SIMD and MIMD)
•
Comparison of MIMD and SIMD Computation in Real
Time Computation Applications
GRADING FOR PDC
•
Five or six homework assignments
•
Midterm and Final Examination
•
Grading: Homework 50%, Midterm 20%, Final 30%
Parallel Computers
2
Introduction to Parallel Computing
(Chapter One)
•
References:
[1]

[4] given below.
1.
Chapter 1, “Parallel Programming ” by Wilkinson, el.
2.
Chapter 1, “Parallel Computation” by Akl
3.
Chapter 1

2, “Parallel Computing” by Quinn, 1994
4.
Chapter 2, “Parallel Processing & Parallel Algorithms” by Roosta
•
Need for Parallelism
–
Numerical modeling and simulation of
scientific and engineering problems.
–
Solution for problems with deadlines
•
Command & Control problems like ATC.
–
Grand Challenge Problems
•
Sequential solutions may take months or
years.
•
Weather Prediction

Grand Challenge Problem
–
Atmosphere is divided into 3D cells.
–
Data such as temperature, pressure, humidity,
wind speed and direction, etc. are recorded at
regular time

intervals in each cell.
–
There are about 5
10
8
cells of (1 mile)
3
.
–
It would take a modern computer over 100
days to perform necessary calculations for 10
day forecast.
•
Parallel Programming

a viable way to increase
computational speed.
–
Overall problem can be split into parts, each of
which are solved by a single processor.
Parallel Computers
3
–
Ideally, n processors would have n times the
computational power of one processor, with
each doing 1/n
th
of the computation.
–
Such gains in computational power is rare, due
to reasons such as
•
Inability to partition the problem perfectly
into n parts of the same computational size.
•
Necessary data transfer between processors
•
Necessary synchronizing of processors
•
Two major styles of partitioning problems
–
(Job) Control parallel programming
•
Problem is divided into the different, non

identical tasks that have to be performed.
•
The tasks are divided among the processors
so that their work load is roughly balanced.
•
This is considered to be
coarse grained
parallelism.
–
Data parallel programming
•
Each processor performs the same
computation on different data sets.
•
Computations do not necessarily have to be
synchronous.
•
This is considered to be
fine grained
parallelism.
Parallel Computers
4
Shared Memory Multiprocessors (SMPs)
•
All processors have access to all memory locations .
•
The processors access memory through some type
of interconnection network.
•
This type of memory access is called
uniform
memory access
(UMA) .
•
A data parallel programming language, based on a
language like FORTRAN or C/C++ may be
available.
•
Alternately, programming using
threads
is
sometimes used.
•
More programming details will be discussed later.
•
Difficulty for the SMP architecture to provide fast
access to all memory locations result in most SMPs
having hierarchical or distributed memory systems.
–
This type of memory access is called
nonuniform memory access
(NUMA).
•
Normally, fast cache is used with NUMA systems
to reduce the problem of different memory access
time for PEs.
–
This creates the problem of ensuring that all
copies of the same date in different memory
locations are identical.
–
Numerous complex algorithms have been
designed for this problem.
Parallel Computers
5
Message

Passing Multiprocessors
(Multicomputers)
•
Processors are connected by an interconnection
network (which will be discussed later in chapter).
•
Each processor has a local memory and can only
access its own local memory.
•
Data is passed between processors using messages,
as dictated by the program.
•
Note:
If the processors run in SIMD mode (i.e.,
synchronously), then the movement of the data
movements over the network can be synchronous:
–
Movement of the data can be controlled by
program steps.
–
Much of the message

passing overhead (e.g.,
routing, hot

spots, headers, etc.) can be
avoided.
•
A common approach to programming
multiprocessors is to use message

passing library
routines in addition to conventional sequential
programs (e.g., MPI, PVM)
•
The problem is divided into
processes
that can be
executed concurrently on individual processors. A
processor is normally assigned multiple processes.
•
Multicomputers can be scaled to larger sizes much
easier than shared memory multiprocessors.
Parallel Computers
6
Multicomputers (cont.)
•
Programming disadvantages of message

passing
–
Programmers must make explicit message

passing calls in the code
–
This is low

level programming and is error
prone.
–
Data is not shared but copied, which increases
the total data size.
–
Data Integrity: difficulty in maintaining
correctness of multiple copies of data item.
•
Programming advantages of message

passing
–
No problem with simultaneous access to data.
–
Allows different PCs to operate on the same
data independently.
–
Allows PCs on a network to be easily upgraded
when faster processors become available.
•
Mixed “distributed shared memory” systems
–
Lots of current interest in a cluster of SMPs.
•
See Dr. David Bader’s or Dr. Joseph JaJa’s
website
–
Other mixed systems have been developed.
Parallel Computers
7
Flynn’s Classification Scheme
•
SISD

single instruction stream, single data stream
–
Primarily sequential processors
•
MIMD

multiple instruction stream, multiple data
stream.
–
Includes SMPs and multicomputers.
–
processors are asynchronous, since they can
independently execute different programs on
different data sets.
–
Considered by most researchers to contain the
most powerful, least restricted computers.
–
Have very serious message passing (or shared
memory) problems that are often ignored when
•
compared to SIMDs
•
when computing algorithmic complexity
–
May be programmed using a multiple
programs, multiple data (MPMD) technique.
–
A common way to program MIMDs is to use a
single program, multiple data (SPMD) method
•
Normal technique when the number of
processors are large.
•
Data Parallel programming style for MIMDs
•
SIMD: single instruction and multiple data streams.
–
One instruction stream is broadcast to all
processors.
Parallel Computers
8
Flynn’s Taxonomy (cont.)
•
SIMD (cont.)
–
Each processor (also called a processing
element or PE) is very simplistic and is
essentially an ALU;
•
PEs do not store a copy of the program nor
have a program control unit.
–
Individual processors can be inhibited from
participating in an instruction (based on a data
test).
–
All active processor executes the same
instruction synchronously, but on different data
–
On a memory access, all active processors must
access the
same location
in their local memory.
–
The data items form an array and an instruction
can act on the complete array in one cycle.
•
MISD

Multiple Instruction streams, single data
stream.
–
This category is not used very often.
–
Some include pipelined architectures in this
category.
Parallel Computers
9
Interconnection Network Overview
References:
Texts [1

4] discuss these network examples, but
reference 3 (Quinn) is particularly good.
•
Only an overview of interconnection networks is
included here. It will be covered in greater depth later.
•
The PEs (processing elements) are called
nodes
.
•
A
link
is the connection between two nodes.
–
bidirectional or use two directional links .
–
Either one wire to carry one bit or parallel wires (one
wire for each bit in word) can be used.
–
The above choices do not have a major impact on the
concepts presented in this course.
•
The
diameter
is the minimal number of links between
the two farthest nodes in the network.
–
The diameter of a network gives the maximal
distance a single message may have to travel.
•
Completely Connected Network
–
Each of n nodes has a link to every other node.
–
Requires n(n

1)/2 links
–
Impractical, unless very few processors
•
Line/Ring Network
–
A
line
consists of a row of n nodes, with connection
to adjacent nodes.
–
Called a
ring
when a link is added to connect the two
end nodes of a line.
–
The line/ring networks have many applications.
–
Diameter of a line is n

1 and of a ring is
n/2
.
Parallel Computers
10
•
The Mesh Interconnection Network
–
The nodes are in rows and columns in a rectangle.
–
The nodes are connected by links that form a 2D
mesh. (Give diagram on board.)
–
Each interior node in a 2D mesh is connected to its
four nearest neighbors.
–
A square mesh with n nodes has
n rows and
n
columns
•
The diameter of a
n
n mesh is 2(
n

1)
–
If the horizonal and vertical ends of a mesh to the
opposite sides, the network is called a
torus
.
–
Meshes have been used more on actual computers
than any other network.
–
A 3D mesh is a generalization of a 2D mesh and has
been used in several computers.
–
The fact that 2D and 3D meshes model physical
space make them useful for many scientific and
engineering problems.
•
Binary Tree Network
–
A
binary tree
network is normally assumed to be a
complete binary tree.
–
It has a root node, and each interior node has two
links connecting it to nodes in the level below it.
–
The height of the tree is
lg n
and its diameter is 2
lg n
.
Parallel Computers
11
Metrics for Evaluating Parallelism
References:
All references cover most topics in this
section and have useful information not contained
in others. Ref. [2, Akl] includes new research and is
the main reference used, although others (esp. [3,
Quinn] and [1,Wilkinson]) are also used.
Granularity
: Amount of computation done between
communication or synchronization steps and is
ranked as fine, intermediate, and coarse.
•
SIMDs are built for efficient communications and
handle fine

grained solutions well.
•
SMPs or message passing MIMDS handle
communications less efficiently than SIMDs but
more efficiently than clusters and can handle
intermediate

grained solutions well.
•
Cluster of workstations or distributed systems have
slower communications among PEs and are better
suited for coarse grain computations.
•
For asynchronous computations, increasing the
granularity
–
reduces expensive communications
–
reduces costs of process creation
–
but reduces the nr of concurrent processes
Parallel Computers
12
Parallel Metrics (continued)
Speedup
•
A measure of the increase in running time due to
parallelism.
•
Based on running times, S(n) = t
s
/t
p
, where
–
t
s
is the execution time on a single processor,
using the fastest known sequential algorithm
–
t
p
is the execution time using a parallel
processor.
•
In theoretical analysis,
S(n) = t
s
/t
p
where
–
t
s
is the worst case running time for of the
fastest known sequential algorithm for the
problem
–
t
p
is the worst case running time of the parallel
algorithm using
n
PEs.
Parallel Computers
13
Parallel Metrics (continued)
•
Linear Speedup is optimal for most problems
–
Claim:
The maximum possible speedup for parallel
computers with n PEs for ‘normal problems’ is
n.
–
Proof of claim
•
Assume a computation is partitioned perfectly
into
n
processes of equal duration.
•
Assume no overhead is incurred as a result of this
partitioning of the computation.
•
Then, under these ideal conditions, the parallel
computation will execute
n
times faster than the
sequential computation.
•
The parallel running time is
t
s
/n
.
•
Then the parallel speedup of this computation is
S(n) = t
s
/(t
s
/n) = n
.
–
We shall later see that this “proof” is not valid for
certain types of nontraditional problems.
–
Unfortunately, the best speedup possible for most
applications is much less than
n
, as
•
Above assumptions are usually invalid.
•
Usually some parts of programs are sequential
and only one PE is active.
•
Sometimes a large number of processors are idle
for certain portions of the program.
–
E.g., during parts of the execution, many PEs
may be waiting to receive or to send data.
Parallel Computers
14
Parallel Metrics (cont)
Superlinear speedup
(i.e., when
S(n) > n
): Most texts
besides [2,3] argue that
•
Linear speedup is the maximum speedup
obtainable.
–
The preceding “proof” is used to argue that
superlinearity is impossible.
•
Occasionally speedup that appears to be superlinear
may occur, but can be explained by other reasons
such as
–
the extra memory in parallel system.
–
a sub

optimal sequential algorithm used.
–
luck, in case of algorithm that has a random
aspect in its design (e.g., random selection)
•
Selim Akl has shown that for some less standard
problems, superlinear algorithms can be given.
–
Some problems cannot be solved without use
of parallel computation.
–
Some problems are natural to solve using
parallelism and sequential solutions are
inefficient.
–
The final chapter of Akl’s textbook and several
journal papers have been written to establish
these claims are valid, but it may still be a long
time before they are fully accepted.
–
Superlinearity has been a hotly debated topic
for too long to be accepted quickly.
Parallel Computers
15
Amdahl’s Law
•
Assumes that the speedup is not superliner; i.e.,
S(n) = t
s
/ t
p
n
–
Assumption only valid for traditional problems.
•
By Figure 1.29 in [1] (or slide #40), if
f
denotes the
fraction of the computation that must be sequential,
t
p
f
t
s
+ (1

f) t
s
/n
•
Substituting above inequality into the above
equation for S(n) and simplifying (see slide #41 or
book) yields
•
Amdahl’s “law”:
S(n)
1/f, where f is as above.
•
See Slide #41 or Fig. 1.30 for related details.
•
Note that
S(n)
never exceed
1/f
and approaches
1/f
as
n
increases.
•
Example: If only 5% of the computation is serial,
the maximum speedup is 20, no matter how many
processors are used.
•
Observations:
Amdahl’s law limitations to
parallelism:
–
For a long time, Amdahl’s law was viewed as a
fatal limit to the usefulness of parallelism.
f
f
n
n
n
S
1
)
1
(
1
)
(
Parallel Computers
16
–
Amdahl’s law is valid and some textbooks
discuss how it can be used to increase the
efficient of many parallel algorithms.
–
Shows that efforts required to further reduce the
fraction of the code that is sequential may pay
off in large performance gains.
–
Hardware that allows even a small decrease in
the percent of things executed sequentially may
be considerably more efficient.
–
A key flaw in past arguments that Amdahl’s
law is a fatal limit to the future of parallelism is
•
Gustafon’s Law:
The proportion of the
computations that are sequential normally
decreases as the problem size increases.
–
Other limitations in applying Amdahl’s Law:
•
Its proof focuses on the steps in a particular
algorithm, and does not consider that other
algorithms with more parallelism may exist.
•
Amdahl’s law applies only to ‘standard’
problems were superlinearity doesn’t occurs.
–
For more details on superlinearity, see [2]
“Parallel Computation: Models and Methods”,
Selim Akl, pgs 14

20 (Speedup Folklore
Theorem) and Chapter 12.
Parallel Computers
17
More Metrics for Parallelism
•
Efficiency
is defined by
–
Efficiency give the percentage of full utilization
of parallel processors on computation,
assuming a speedup of
n
is the best possible.
•
Cost:
The cost of a parallel algorithm or parallel
execution is defined by
Cost = (running time)
(Nr. of PEs)
= t
p
n
–
Observe that
–
Cost
allows the quality of parallel algorithms to
be compared to that of sequential algorithms.
•
Compare
cost
of parallel algorithm to
running time
of sequential algorithm
•
The advantage that parallel algorithms have
in using multiple processors is removed by
multiplying their running time by the
number
n
of processors they are using.
•
If a parallel algorithm requires exactly
1/n
the running time of a sequential algorithm,
then the
parallel cost
is the same as the
sequential running time
.
n
n
S
n
t
t
E
p
s
)
(
Cost
E
t
s
Parallel Computers
18
More Metrics (cont.)
•
Cost

Optimal Parallel Algorithm:
A parallel
algorithm for a problem is said to be
cost

optimal
if
its cost is proportional to the running time of an
optimal sequential algorithm for the same problem.
–
By
proportional
, we means that
cost = t
p
n = k
t
s
where
k
is a constant. (See pg 67 of [1]).
–
Equivalently, a parallel algorithm is optimal if
parallel cost = O(f(t)),
where
f(t)
is the running time of an optimal
sequential algorithm.
–
In cases where no optimal sequential algorithm
is known, then the “fastest known” sequential
algorithm is often used instead.
Parallel Computers
19
Sieve of Eratosthenes
(A Data

Parallel vs Control

Parallel Example)
•
Reference [3, Quinn, Ch. 1], pages 10

17
•
A prime number is a positive integer with exactly
two factors, itself and 1.
•
Sieve (siv) algorithm finds the prime numbers less
than or equal to some positive integer n
–
Begin with a list of natural numbers
2, 3, 4, …, n
–
Remove composite numbers from the list by
striking multiples of 2, 3, 5, and successive
primes
–
After each striking, the next unmarked natural
number is prime
–
Sieve terminates after multiples of largest prime
less than or equal to
have been struck
from the list.
•
Sequential Implementation uses 3 data structures
–
Boolean array index for the numbers being
sieved.
–
An integer holding latest prime found so far.
–
An loop index that is incremented as multiple
of current prime are marked as composite nrs.
n
Parallel Computers
20
A Control

Parallel Approach
•
Control parallelism
involves applying a
different
sequence of operations to different data elements
•
Useful for
–
Shared

memory MIMD
–
Distributed

memory MIMD
–
Asynchronous PRAM
•
Control

parallel sieve
–
Each processor works with a different prime,
and is responsible for striking multiples of that
prime and identifying a new prime number
–
Each processor starts marking…
–
Shared memory contain
•
boolean array containing numbers being
sieved,
•
integer corresponding largest prime found so
far
–
PE’s local memories contain local loop indexes
keeping track of multiples of its current prime
(since each is working with different prime).
Parallel Computers
21
A Control

Parallel Approach
(cont.)
•
Problems and inefficiencies
–
Algorithm for Shared Memory MIMD
1.
Processor accesses variable holding current
prime
2.
searches for next unmarked value
3.
updates variable containing current prime
–
Must avoid having two processors doing this at
same time
–
A processor could waste time sieving multiples of a
composite number
•
How much speedup can we get?
–
Suppose n = 1000
–
Sequential algorithm
•
Time to strike out multiples of prime p is
(n+1

p
2
)/p
•
Multiples of 2: ((1000+1)
–
4)/2=997/2=498
•
Multiples of 3: ((1000+1)
–
9)/3=992/3=330
•
Total Sum = 1411 (number of “steps”)
–
2 PEs gives speedup 1411/706=2.00
–
3 PEs gives speedup 1411/499=2.83
–
3 PEs require 499 strikeout time units, so no more
speedup is possible using additrional PEs
•
Multiples of 2’s dominate with 498 strikeout
steps
Parallel Computers
22
A Data

Parallel Approach
•
Data parallelism
refers to using multiple PEs to
apply the
same
sequence of operations to different
data elements.
•
Useful for following types of parallel operations
–
SIMD
–
PRAM
–
Shared

memory MIMD,
–
Distributed

memory MIMD.
•
Generally not useful for pipeline operations.
•
A data

parallel sieve algorithm:
–
Each processor works with a same prime, and is
responsible for striking multiples of that prime
from a segment of the array of natural numbers
–
Assume we have k processors, where
k << , (i.e., k is much less than ).
•
Each processor gets no more than
n/k
natural numbers
•
All primes less than , as well as first
prime greater than are in list controlled
by first processor
n
n
n
n
Parallel Computers
23
A Data

Parallel Approach
(cont.)
•
Data

parallel Sieve (cont.)
–
Distributed

memory MIMD Algorithm
•
Processor 1 finds next prime, broadcasts it
to all PEs
•
Each PE goes through their part of the array,
striking multiples of that prime (performing
same operation)
•
Continues until first processor reaches a
prime greater than sqrt(n)
•
How much speedup can we get?
–
Suppose n = 1,000,000 and we have k PEs
–
There are 168 primes less than 1,000, the
largest of which is 997
–
Maximum execution time to strike out primes
1,000,000/ k
/2
+
1,000,000/ k
/3
+
1,000,000/50
/5
+
… +
1,000,000/ k
/997
etime
–
The sequential execution time is the above sum
with k = 1.
–
The communication time = 168(k
–
1) ctime.
–
We assume that each communication takes 100
times longer than the time to execute a strikeout
Parallel Computers
24
A Data

Parallel Approach
(cont.)
•
How much speedup can we get? (cont.)
–
Speedup is not directly proportional to the
number of PEs
—
it’s highest at 11 PEs
•
Computation time is inversely proportional
to the number of processors used
•
Communication time increases linearly
•
After 11 processors, the increase in the
communication time is higher than the
decrease in computation time, and total
execution time increases.
•
Study Figures 1

11 and 1

12 on pg 15 of [3].
–
How about parallel I/O time?
•
Practically, the primes generated must be
stored on an external device
•
Assume access to device is sequential.
•
I/O time is constant because output must be
performed sequentially
•
This sequential code severely limits the
parallel speedup according to Amdahl’s law
–
the fraction of operations that must be
performed sequentially limits the maximum
speedup possible.
•
Parallel I/O is an important topic in parallel
computing.
Parallel Computers
25
Future Additions Needed
•
To be added:
–
The work metric and work optimal concept.
–
The speedup and slowdown results from [2].
–
Data parallel vs control parallel example
•
Possibly a simpler example than sieve in
Quinn
–
Look at chapter 2 of [8, Jordan] for possible
new information.
Parallel Computers
26
•
n or
n
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο