Chapter ∞: Parallel Algorithms
•
PCAM
•
SIMD & MIMD
CS 340
Page
188
•
Designing Parallel Programs
CS 340
Page
189
PCAM
When designing algorithms for systems of parallel
processors, two primary problems must be addressed:
•
Avoid processor idle time by balancing the processing
load between all of the processors.
•
Avoid time

consuming communication between
processors by allocating interdependent tasks to the
same processor.
Partition
Communicat
e
Agglomerat
e
Map
One common approach: PCAM
•
Partition
: Split the problem into a large number of small
tasks to expose opportunities for parallel execution.
•
Communicate
: Identify the data flow that is required
between separate tasks.
•
Agglomerate
: Combine tasks to minimize communication
and balance loads; evaluate data replication pros and cons.
•
Map
: Assign tasks to processors in a manner that improves
concurrency while maintaining locality.
CS 340
Page
190
PCAM Example: VLSI
Floorplan
Assume that you are designing a microchip with several
layout restrictions:
Possibilities (among many others):
•
The northwest corner of the chip must contain a 5
5
component of type A.
•
Right next to the type

A component on the north side
of the chip, a 24 square unit component of type B is
needed.
•
Adjacent to the type

B component’s east or south
side, a 40

square unit component of type C is
needed.
The idea is to minimize the overall area of the chip.
CS 340
Page
191
VLSI
Floorplan
: Algorithm Outline
The common sequential approach for this problem is a tree

based, depth

first technique called
branch

and

bound
, in
which subtrees are pruned whenever it becomes clear that
they become too costly.
Chip Size: 25
Chip Size: 64
Chip Size: 65
Chip Size: 54
Chip Size: 55
Chip Size: 85
Chip Size: 84
Chip Size: 114
Chip Size: 143
Chip Size: 174
Chip Size: 200
Chip Size: 144
Chip Size: 112
Chip Size: 130
Chip Size: 234
Chip Size: 140
Chip Size: 150
Chip Size: 102
Chip Size: 220
Component A
Placement
Component B
Placement: Six
Possibilities
Component C
Placement (B1
Version): Twelve
Possibilities
CS 340
Page
192
Parallelizing VLSI
Floorplan
: Partitioning
There is
no obvious data structure
that could be used
to perform a decomposition of this problem’s domain
into components that could be mapped to separate
processors.
A
fine

grained functional decomposition
is therefore
needed, where the exploration of each
search tree node
is
handled
by a separate task
.
This
means that new tasks will be created in a
wavefront
as
the search progresses down the search tree, which
will
be
explored in a
breadth

first
fashion
.
Notice
that only tasks on
the
wavefront
will be able to
execute concurrently
.
Chip Size: 25
Chip Size: 64
Chip Size: 65
Chip Size: 54
Chip Size: 55
Chip Size: 85
Chip Size: 84
Chip Size: 114
Chip Size: 143
Chip Size: 174
Chip Size: 200
Chip Size: 144
Chip Size: 112
Chip Size: 130
Chip Size: 234
Chip Size: 140
Chip Size: 150
Chip Size: 102
Chip Size: 220
CS 340
Page
193
Parallelizing VLSI
Floorplan
:
Communication
In a parallel implementation of simple
search,
tasks can
execute independently and need communicate only to
report solutions.
The parallel algorithm for this problem will
also need to
keep track of the bounding value (i.e., the smallest chip
area found so far),
which must be accessed by
every
task.
One possibility would be to encapsulate the bounding
value maintenance
in a single
centralized task
with
which
the other
tasks will communicate
.
Chip Size: 25
Chip Size: 64
Chip Size: 65
Chip Size: 54
Chip Size: 55
Chip Size: 85
Chip Size: 84
Chip Size: 114
Chip Size: 143
Chip Size: 174
Chip Size: 200
Chip Size: 144
Chip Size: 112
Chip Size: 130
Chip Size: 234
Chip Size: 140
Chip Size: 150
Chip Size: 102
Chip Size: 220
This approach is inherently
unscalable
, since the processor
handling the centralized task can
only service requests from the
other tasks at a particular rate,
thus bounding the number of
tasks that can execute
concurrently.
CS 340
Page
194
Parallelizing VLSI
Floorplan
:
Agglomeration
Practically speaking, parallelism will have to be
constrained by the number of processors being
employed on the program.
For this problem, an obvious agglomeration would be to
split the breadth first search into tasks at each node
until a certain level of the tree is reached, at which point
each
subtree
can be handled as a sequential depth

first
search by its assigned processor.
Of course, this step is more significant if the number of
components being placed on the chip, and the number
of options for placing them, is substantially increased.
Chip Size:
25
Chip Size:
64
Chip Size:
65
Chip Size:
54
Chip Size:
55
Chip Size:
85
Chip Size:
84
B1
Subtre
e
B2
Subtre
e
B3
Subtre
e
B4
Subtre
e
B5
Subtre
e
B6
Subtre
e
CS 340
Page
195
Parallelizing VLSI
Floorplan
: Mapping
One priority when mapping tasks to processors is
determining how to avoid lengthy idle periods for the
processors.
For this problem, one processor can be assigned a
supervisory role by handling the root node and
generating tasks associated with the subtrees.
The agglomeration step ensured that the tasks are
handled in a depth

first manner, which facilitates the
pruning of unnecessary tasks.
Chip Size:
25
Chip Size:
64
Chip Size:
65
Chip Size:
54
Chip Size:
55
Chip Size:
85
Chip Size:
84
B1
Subtre
e
B2
Subtre
e
B3
Subtre
e
B4
Subtre
e
B5
Subtre
e
B6
Subtre
e
The other nodes maintain a
queue of assigned tasks
and, whenever their queues
are depleted, they request
tasks from other nodes,
which facilitates load
balancing.
In order to keep the pruning rate as high as possible,
processors keep tasks that are far from the root (i.e.,
more likely to cause pruning soon) and hand tasks that
are near the root to idle processors requesting tasks.
CS 340
Page
196
SIMD
Parallel processing on SIMD (Single Instruction, Multiple Data)
machines is accomplished by having all processors perform
the same instruction on different data in a
vectorized
manner.
For applications that require huge blocks of data upon which
the same set of operations must be performed, both the data
retrieval and the data processing can experience significant
performance benefits.
Instruction:
Retrieve
Data Block
A[ ] =
A[0…99]
Instruction:
Retrieve Data
Block A[ ] =
A[100…199]
Instruction:
Retrieve
Data Block
A[ ] =
A[200…299]
Instruction:
Retrieve Data
Block A[ ] =
A[300…399]
Instruction:
Retrieve Data
Block B[ ] =
B[0…99]
Instruction:
Retrieve
Data Block
B[ ] =
B[100…199]
Instruction:
Retrieve
Data Block
B[ ] =
B[200…299]
Instruction:
Retrieve Data
Block B[ ] =
B[300…399]
Instruction:
A[0] =
A[0] * B[0]
Instruction:
A[0] =
A[0] * B[0]
Instruction:
A[0] =
A[0] * B[0]
Instruction:
A[0] =
A[0] * B[0]
Instruction:
A[19] =
A[3] * B[6]
Instruction:
A[19] =
A[3] * B[6]
Instruction:
A[19] =
A[3] * B[6]
Instruction:
A[19] =
A[3] * B[6]
CS 340
Page
197
SIMD Inefficiency
Parallelism in a SIMD machine does not necessarily
eliminate all inefficiency and waste.
For example, consider what
happens when multiple
processors attempt to execute
the conditional at right with their
respective data:
Compar
e u1
(10) to
v1 (20)
Compar
e u2
(16) to
v2 (12)
Compar
e u3
(14) to
v3 (21)
Compar
e u4
(15) to
v4 (13)
Multiply
x1 and
y1
Multiply
x2
and
y2
Multiply
x3 and
y3
Multiply
x4
and
y4
if (u < v)
z = x * y;
else
z = x + y;
Add x1
and y1
Add x2
and
y2
Add x3
and y3
Add x4
and
y4
Store
product
in z1
Store
sum
in z2
Store
product
in z3
Store
sum
in z4
Because the
processors handle
the instructions in
lockstep,
both
condition cases
are executed,
regardless of the
condition’s
evaluation.
CS 340
Page
198
SIMD Applications
The limitations of the SIMD approach (i.e., large data
sets to which uniform instructions are applied in lockstep
fashion) have restricted its applicability to such areas as
graphics (e.g., animation, games) and video processing
(e.g., image enhancement, video compression, format
conversion).
Note that divide

and

conquer algorithms
for sequential
machines tend to
have linear
speedups on SIMD
machines.
Mergesort, for
example, takes
O(
𝑛
𝑝
𝑙𝑜𝑔
𝑛
𝑝
) time on a
p

processor MIMD
machine, which
translates to O(
n
)
time on an
n

processor machine.
CS 340
Page
199
MIMD
Parallel processing on MIMD (Multiple Instruction,
Multiple Data) machines is accomplished by having all
processors work independently, performing different
instructions on different data.
Applications for this type of
system involve tasks that can be
easily separated from each other,
such as CAD/CAM, modeling and
simulation, and network
processing, thus minimizing the
time

consuming communication
between processors.
…
z = x * y;
alpha(z, n);
…
…
for (i = 1; i < n; i++)
{
A[i] = 0.0;
B[i] = 14.8;
}
…
…
z = x * y;
alpha(z, n);
…
…
w = u * v;
t = 2 * w + v / 10;
…
Most
current supercomputers,
networked parallel computer
clusters and
“grids”, and multi

core PCs use the MIMD
approach.
CS 340
Page
200
Designing Parallel Programs
Designing and developing parallel programs has
characteristically been a very manual
process with the
programmer
typically
responsible for both identifying and
actually implementing parallelism.
Very often, manually developing parallel codes is a
time

consuming
, complex,
error

prone,
and
iterative
process.
The first step is to determine
whether or not
a
problem is
one that can actually
be
parallelized.
Example of Parallelizable Problem:
Calculate the potential energy for each of
several thousand independent
conformations of a
molecule; when
done,
find the minimum energy conformation.
This problem is able to be solved in
parallel. Each of the molecular
conformations is independently
determinable. The calculation of the
minimum energy conformation is also a
parallelizable problem.
Example of
Non

Parallelizable
Problem:
Calculation of the Fibonacci series (1
, 1, 2,
3, 5, 8, 13, 21, ...)
by use of the
formula:
F(n
) = F(n

1) + F(n

2)
This is a non

parallelizable problem
because the calculation of the Fibonacci
sequence as shown would entail
dependent calculations rather than
independent ones. The calculation of the
F(n) value uses those of both F(n

1) and
F(n

2). These three terms cannot be
calculated independently and therefore, not
in parallel.
CS 340
Page
201
Problem Decomposition
One of the first steps in designing a parallel program is
to break the problem into discrete
“chunks”
of work that
can be distributed to multiple tasks.
Method 1: Domain Decomposition
In this type of partitioning, the
data
associated with
the
problem is decomposed. Each parallel task then works
on a portion
of
the data.
Problem Data Set
Task 1
Task 2
Task 3
Task 4
CS 340
Page
202
Method 2: Functional Decomposition
In this approach, the
partitioning is based on
the
computation
that is to be performed rather than on the data
manipulated by the computation. The problem is
decomposed according to the work that must be done.
Each task then performs a portion of the overall work.
Task 1
Task 2
Task 3
Task 4
Problem Instruction Set
CS 340
Page
203
Functional Decomposition Examples
Functional decomposition lends itself to problems that
can be split into different tasks, such as:
Ecosystem Modeling
Each program calculates the population of a given
group, where each group's growth depends on that
of its neighbors. As time progresses, each process
calculates its current state, then exchanges
information with the neighbor populations. All tasks
then progress to calculate the state at the next time
step.
Signal Processing
An audio signal data set is passed through four
distinct computational filters. Each filter is a
separate process. The first segment of data
must pass through the first filter before
progressing to the second. When it does, the
second segment of data passes through the
first filter. By the time the fourth segment of
data is in the first filter, all four tasks are busy.
Climate Modeling
Each model component can be thought of as a separate task. Arrows
represent exchanges of data between components during computation:
the atmosphere model generates wind velocity data that are used by the
ocean model, the ocean model generates sea surface temperature data
that are used by the atmosphere model, and so on.
Comments 0
Log in to post a comment