Parallel Algorithms - SIUe

stingymilitaryElectronics - Devices

Nov 27, 2013 (3 years and 7 months ago)

70 views

Chapter ∞: Parallel Algorithms


PCAM


SIMD & MIMD

CS 340

Page
188


Designing Parallel Programs

CS 340

Page
189

PCAM

When designing algorithms for systems of parallel
processors, two primary problems must be addressed:


Avoid processor idle time by balancing the processing
load between all of the processors.


Avoid time
-
consuming communication between
processors by allocating interdependent tasks to the
same processor.

Partition

Communicat
e

Agglomerat
e

Map

One common approach: PCAM


Partition
: Split the problem into a large number of small
tasks to expose opportunities for parallel execution.


Communicate
: Identify the data flow that is required
between separate tasks.


Agglomerate
: Combine tasks to minimize communication
and balance loads; evaluate data replication pros and cons.


Map
: Assign tasks to processors in a manner that improves
concurrency while maintaining locality.

CS 340

Page
190

PCAM Example: VLSI
Floorplan

Assume that you are designing a microchip with several
layout restrictions:

Possibilities (among many others):


The northwest corner of the chip must contain a 5

5
component of type A.


Right next to the type
-
A component on the north side
of the chip, a 24 square unit component of type B is
needed.


Adjacent to the type
-
B component’s east or south
side, a 40
-
square unit component of type C is
needed.

The idea is to minimize the overall area of the chip.

CS 340

Page
191

VLSI
Floorplan
: Algorithm Outline

The common sequential approach for this problem is a tree
-
based, depth
-
first technique called
branch
-
and
-
bound
, in
which subtrees are pruned whenever it becomes clear that
they become too costly.

Chip Size: 25

Chip Size: 64

Chip Size: 65

Chip Size: 54

Chip Size: 55

Chip Size: 85

Chip Size: 84

Chip Size: 114

Chip Size: 143

Chip Size: 174

Chip Size: 200

Chip Size: 144

Chip Size: 112

Chip Size: 130

Chip Size: 234

Chip Size: 140

Chip Size: 150

Chip Size: 102

Chip Size: 220

Component A
Placement

Component B
Placement: Six
Possibilities

Component C
Placement (B1
Version): Twelve
Possibilities

CS 340

Page
192

Parallelizing VLSI
Floorplan
: Partitioning

There is
no obvious data structure
that could be used
to perform a decomposition of this problem’s domain
into components that could be mapped to separate
processors.

A
fine
-
grained functional decomposition
is therefore
needed, where the exploration of each
search tree node
is
handled
by a separate task
.

This
means that new tasks will be created in a
wavefront


as
the search progresses down the search tree, which
will
be
explored in a
breadth
-
first

fashion
.

Notice
that only tasks on
the
wavefront

will be able to
execute concurrently
.

Chip Size: 25

Chip Size: 64

Chip Size: 65

Chip Size: 54

Chip Size: 55

Chip Size: 85

Chip Size: 84

Chip Size: 114

Chip Size: 143

Chip Size: 174

Chip Size: 200

Chip Size: 144

Chip Size: 112

Chip Size: 130

Chip Size: 234

Chip Size: 140

Chip Size: 150

Chip Size: 102

Chip Size: 220

CS 340

Page
193

Parallelizing VLSI
Floorplan
:
Communication

In a parallel implementation of simple
search,
tasks can
execute independently and need communicate only to
report solutions.

The parallel algorithm for this problem will
also need to
keep track of the bounding value (i.e., the smallest chip
area found so far),
which must be accessed by
every
task.

One possibility would be to encapsulate the bounding
value maintenance
in a single
centralized task
with
which
the other
tasks will communicate
.

Chip Size: 25

Chip Size: 64

Chip Size: 65

Chip Size: 54

Chip Size: 55

Chip Size: 85

Chip Size: 84

Chip Size: 114

Chip Size: 143

Chip Size: 174

Chip Size: 200

Chip Size: 144

Chip Size: 112

Chip Size: 130

Chip Size: 234

Chip Size: 140

Chip Size: 150

Chip Size: 102

Chip Size: 220

This approach is inherently
unscalable
, since the processor
handling the centralized task can
only service requests from the
other tasks at a particular rate,
thus bounding the number of
tasks that can execute
concurrently.

CS 340

Page
194

Parallelizing VLSI
Floorplan
:
Agglomeration

Practically speaking, parallelism will have to be
constrained by the number of processors being
employed on the program.

For this problem, an obvious agglomeration would be to
split the breadth first search into tasks at each node
until a certain level of the tree is reached, at which point
each
subtree

can be handled as a sequential depth
-
first
search by its assigned processor.

Of course, this step is more significant if the number of
components being placed on the chip, and the number
of options for placing them, is substantially increased.

Chip Size:
25

Chip Size:
64

Chip Size:
65

Chip Size:
54

Chip Size:
55

Chip Size:
85

Chip Size:
84

B1
Subtre
e

B2
Subtre
e

B3
Subtre
e

B4
Subtre
e

B5
Subtre
e

B6
Subtre
e

CS 340

Page
195

Parallelizing VLSI
Floorplan
: Mapping

One priority when mapping tasks to processors is
determining how to avoid lengthy idle periods for the
processors.

For this problem, one processor can be assigned a
supervisory role by handling the root node and
generating tasks associated with the subtrees.

The agglomeration step ensured that the tasks are
handled in a depth
-
first manner, which facilitates the
pruning of unnecessary tasks.

Chip Size:
25

Chip Size:
64

Chip Size:
65

Chip Size:
54

Chip Size:
55

Chip Size:
85

Chip Size:
84

B1
Subtre
e

B2
Subtre
e

B3
Subtre
e

B4
Subtre
e

B5
Subtre
e

B6
Subtre
e

The other nodes maintain a
queue of assigned tasks
and, whenever their queues
are depleted, they request
tasks from other nodes,
which facilitates load
balancing.

In order to keep the pruning rate as high as possible,
processors keep tasks that are far from the root (i.e.,
more likely to cause pruning soon) and hand tasks that
are near the root to idle processors requesting tasks.

CS 340

Page
196

SIMD

Parallel processing on SIMD (Single Instruction, Multiple Data)
machines is accomplished by having all processors perform
the same instruction on different data in a
vectorized

manner.

For applications that require huge blocks of data upon which
the same set of operations must be performed, both the data
retrieval and the data processing can experience significant
performance benefits.

Instruction:
Retrieve
Data Block
A[ ] =
A[0…99]

Instruction:
Retrieve Data
Block A[ ] =
A[100…199]

Instruction:
Retrieve
Data Block
A[ ] =
A[200…299]

Instruction:
Retrieve Data
Block A[ ] =
A[300…399]

Instruction:
Retrieve Data
Block B[ ] =
B[0…99]

Instruction:
Retrieve
Data Block
B[ ] =
B[100…199]

Instruction:
Retrieve
Data Block
B[ ] =
B[200…299]

Instruction:
Retrieve Data
Block B[ ] =
B[300…399]

Instruction:
A[0] =

A[0] * B[0]

Instruction:
A[0] =

A[0] * B[0]

Instruction:
A[0] =

A[0] * B[0]

Instruction:
A[0] =

A[0] * B[0]

Instruction:
A[19] =

A[3] * B[6]

Instruction:
A[19] =

A[3] * B[6]

Instruction:
A[19] =

A[3] * B[6]

Instruction:
A[19] =

A[3] * B[6]

CS 340

Page
197

SIMD Inefficiency

Parallelism in a SIMD machine does not necessarily
eliminate all inefficiency and waste.

For example, consider what
happens when multiple
processors attempt to execute
the conditional at right with their
respective data:

Compar
e u1
(10) to

v1 (20)

Compar
e u2
(16) to

v2 (12)

Compar
e u3
(14) to

v3 (21)

Compar
e u4
(15) to

v4 (13)

Multiply
x1 and
y1

Multiply
x2
and
y2

Multiply
x3 and
y3

Multiply
x4
and
y4

if (u < v)



z = x * y;

else



z = x + y;

Add x1
and y1

Add x2
and
y2

Add x3
and y3

Add x4
and
y4

Store
product
in z1

Store
sum

in z2

Store
product
in z3

Store
sum

in z4

Because the
processors handle
the instructions in
lockstep,
both

condition cases
are executed,
regardless of the
condition’s
evaluation.

CS 340

Page
198

SIMD Applications

The limitations of the SIMD approach (i.e., large data
sets to which uniform instructions are applied in lockstep
fashion) have restricted its applicability to such areas as
graphics (e.g., animation, games) and video processing
(e.g., image enhancement, video compression, format
conversion).

Note that divide
-
and
-
conquer algorithms
for sequential
machines tend to
have linear
speedups on SIMD
machines.

Mergesort, for
example, takes
O(
𝑛
𝑝
𝑙𝑜𝑔
𝑛
𝑝
) time on a
p
-
processor MIMD
machine, which
translates to O(
n
)
time on an
n
-
processor machine.

CS 340

Page
199

MIMD

Parallel processing on MIMD (Multiple Instruction,
Multiple Data) machines is accomplished by having all
processors work independently, performing different
instructions on different data.

Applications for this type of
system involve tasks that can be
easily separated from each other,
such as CAD/CAM, modeling and
simulation, and network
processing, thus minimizing the
time
-
consuming communication
between processors.



z = x * y;

alpha(z, n);





for (i = 1; i < n; i++)

{



A[i] = 0.0;



B[i] = 14.8;

}





z = x * y;

alpha(z, n);





w = u * v;

t = 2 * w + v / 10;



Most
current supercomputers,
networked parallel computer
clusters and
“grids”, and multi
-
core PCs use the MIMD
approach.

CS 340

Page
200

Designing Parallel Programs

Designing and developing parallel programs has
characteristically been a very manual
process with the
programmer
typically
responsible for both identifying and
actually implementing parallelism.

Very often, manually developing parallel codes is a
time
-
consuming
, complex,
error
-
prone,
and
iterative

process.

The first step is to determine
whether or not
a
problem is
one that can actually
be

parallelized.

Example of Parallelizable Problem:

Calculate the potential energy for each of
several thousand independent
conformations of a
molecule; when
done,
find the minimum energy conformation.

This problem is able to be solved in
parallel. Each of the molecular
conformations is independently
determinable. The calculation of the
minimum energy conformation is also a
parallelizable problem.

Example of
Non
-
Parallelizable
Problem:

Calculation of the Fibonacci series (1
, 1, 2,
3, 5, 8, 13, 21, ...)
by use of the
formula:

F(n
) = F(n
-
1) + F(n
-
2)

This is a non
-
parallelizable problem
because the calculation of the Fibonacci
sequence as shown would entail
dependent calculations rather than
independent ones. The calculation of the
F(n) value uses those of both F(n
-
1) and
F(n
-
2). These three terms cannot be
calculated independently and therefore, not
in parallel.

CS 340

Page
201

Problem Decomposition

One of the first steps in designing a parallel program is
to break the problem into discrete
“chunks”
of work that
can be distributed to multiple tasks.

Method 1: Domain Decomposition

In this type of partitioning, the
data

associated with
the
problem is decomposed. Each parallel task then works
on a portion
of
the data.

Problem Data Set

Task 1

Task 2

Task 3

Task 4

CS 340

Page
202

Method 2: Functional Decomposition

In this approach, the
partitioning is based on
the
computation

that is to be performed rather than on the data
manipulated by the computation. The problem is
decomposed according to the work that must be done.
Each task then performs a portion of the overall work.

Task 1

Task 2

Task 3

Task 4

Problem Instruction Set

CS 340

Page
203

Functional Decomposition Examples

Functional decomposition lends itself to problems that
can be split into different tasks, such as:

Ecosystem Modeling

Each program calculates the population of a given
group, where each group's growth depends on that
of its neighbors. As time progresses, each process
calculates its current state, then exchanges
information with the neighbor populations. All tasks
then progress to calculate the state at the next time
step.

Signal Processing

An audio signal data set is passed through four
distinct computational filters. Each filter is a
separate process. The first segment of data
must pass through the first filter before
progressing to the second. When it does, the
second segment of data passes through the
first filter. By the time the fourth segment of
data is in the first filter, all four tasks are busy.

Climate Modeling

Each model component can be thought of as a separate task. Arrows
represent exchanges of data between components during computation:
the atmosphere model generates wind velocity data that are used by the
ocean model, the ocean model generates sea surface temperature data
that are used by the atmosphere model, and so on.