Section Third : Parallel Algorithms
1. Introduction
1.1
P
arallel
A
lgorithm
1.2
Common Terms for Parallelism
1.3
Parallel Programming
1.4
Performance Issues
1.5
Processor
I
nterconnections
1.6 Parallel Computing Models: PRAM (Parallel Random

Access Machi
ne)
2. PRAM Algorthim
2.1
Parallel Reduction
2.2
The
Preﬁx

Sums Problem
2.3
The Sum of Contents of Array Problem
2.4
Parallel List ranking
2.5
Merging of two sorted Array in to single sorted array
2.6
Parallel Preoreder Travesal
3.
Matrix Multiplication
3.1 Row

Column Orient
e
d
Matrix Multiplication
Algori
thm
3.2 Block

Oriented
Matrix Multiplication
Algorithms
4. Solving Liner Systems
4.1
GAUSSIAN Elimination
4.2 The JACOBI Algorithm
5. Quicksort
5.1
Parallel
Quicksort
Algorithm
5.2 Hyper
qube Q
uicksort
Algorithm
1.
Introduction
The s
ubject of this chapter is the design and analysis of parallel algorithms. Most of today’s
algorithms are
sequential, that is, they specify a sequence of steps in which each step consists of a
single operation.
These algorithms are well suited to today’s co
mputers, which basically perform
operations in a sequential
fashion. Although the speed at which sequential computers operate has
been improving at an exponential
rate for many years, the improvement is now coming at greater
and greater cost. As a conseque
nce,
researchers have sought more cost

e
f
ective improvements
by building “parallel” computers
–
computers
that perform multiple operations in a single step.
In order to solve a problem e
f
ciently on a parallel machine,
it is usually necessary to design an
a
lgorithm that speciﬁes multiple operations on each step, i.e., a parallel
algorithm.
As an example, consider the problem of computing the sum of a sequence A of
‘
n
’
numbers.
The
standard algorithm computes the sum by making a single pass through the sequen
ce, keeping
a running
sum of the numbers seen so far. It is not di
f
cult however, to devise an algorithm for
computing the sum that
performs many operations in parallel.
The parallelism in an algorithm can yield improved performance on many di
f
erent kinds
of
computers. For example, on a parallel computer, the operations in a parallel algorithm can be performed
simultaneously by di
f
erent processors. Furthermore, even on a single

processor computer
the parallelism
in an algorithm can be exploited by using mul
tiple functional units, pipelined functional units, or pipelined
memory systems. Thus, it is important to make a distinction between
the parallelism in an algorithm and the
ability of any particular computer to perform multiple
operations in parallel. Of c
ourse, in order for a parallel
algorithm to run e
f
ciently on any type
of computer, the algorithm must contain at least as much parallelism
as the computer, for otherwise resources would be left idle. Unfortunately, the converse does not always
hold: some p
arallel
computers cannot e
f
ciently execute all algorithms, even if the algorithms contain a great
deal
of parallelism. Experience has shown that it is more di
f
cult to build a general

purpose parallel
machine
than a general

purpose sequential machine.
The r
emainder of this chapter consists of nine sections. We
begin in Section 1 with a discussion
of how to model parallel computers.
1.1
parallel algorithms
In
computer science, a
parallel algorithm
or
concurrent algorithm, as opposed to a
traditional
sequen
tial (or serial) algorithm, is an
algorithm
which can be executed a piece at a time on many
different processing devices, and then put back together again at the end to get the correct result.
Some alg
orithms are easy to divide up into pieces like this. For example, splitting up the job of
checking all of the numbers from one to a hundred thousand to see which are
primes
could be
done by
assigning a subset of the numbers to each available processor, and then putting the list of positive results
back together.
Most of the available algorithms to compute
pi
(π), on the other hand, cann
ot be easily split up into
parallel portions. They require the results from a preceding step to effectively carry on with the next step.
Such problems are called inherently serial problems. Iterative
numerical methods
, such as
Newton's
method
or the
three

body problem
, are also algorithms which a
re inherently serial. Some problems are very
difficult to parallelize, although they are recursive. One such example is the
depth

first search
of
graphs
.
Parallel algorithms are valuable because of substantial improvements in
multiprocessing
systems
and the rise of
multi

core
processors. In general, it is easier to construct a computer with a single fast
processor than one with many slow processors with the same
thro
ughput
. But processor speed is
increased primarily by shrinking the circuitry, and modern processors are pushing physical size and heat
limits. These twin barriers have flipped the equation, making multiprocessing practical even for small
systems.
The cos
t or compl
e
exity of serial algorithms is estimated in terms of the space (memory) and time
(processor cycles) that they take. Parallel algorithms need to optimize one more resource, the
communication between different processors. There are two ways paralle
l processors communicate,
shared memory or message passing.
Shared memory
processing needs additional
locking
for the data, imposes the overhead of
additional
processor and bus
cycles
, and also serializes some portion of the algorithm.
Message passing
processing uses channels and message boxes but this communication adds
transfer overhead on the bus,
additional memory need for queues and message boxes and latency in the
messages. Designs of parallel processors use special buses like crossbar so that the communication
overhead will be small but it is the parallel algorithm that decides the volume of the
traffic.
Another problem with parallel algorithms is ensuring that they are suitably
load balanced
. For
example, checking all numbers from one to a hun
dred thousand for primality is easy to split amongst
processors; however, some processors will get more work to do than the others, which will sit idle until the
loaded processors complete.
1.2
Common
Defining
Terms for Parallelism
CRCW
A shared memory mod
el that allows for concurrent reads (CR) and concurrent writes
(CW) to the memory.
CREW
This refers to a shared memory model that allows for Concurrent reads (CR) but only
exclusive writes (EW)
to the memory.
Depth
The longest chain of sequential depende
nces in a computation.
EREW
A shared memory model that allows for only exclusive reads (ER) and exclusive writes
(EW) to the
memory.
Multiprocessor Model.
A model of parallel computation based on a set of commu
nicating se
quential processors.
Pipelined
Divide

and

Conquer
A divide

and

conquer paradigm in which partial results from
recursive calls can be used before the calls
complete. The technique is often useful for reducing
the depth of an algorithm.
Pointer
Jumping
In a linked structure replacing a p
ointer with the pointer it points to. Used
for various algorithms on lists and
trees.
PRAM model
A multiprocessor model in which all processors can access a shared memory for
reading or writing with
uniform cost.
Prefix Sums
A parallel operation in which
each element in an array or linked

list receives the
sum of all the previous
elements.
Random Sampling
Using a randomly selected sample of the data to help solve a problem on
the whole data.
Recursive Doubling
The same as pointer jumping.
Scan
A parall
el operation in which each element in an array receives the sum of all the previous
elements.
Shortcutting
Same as pointer jumping.
Symmetry Breaking.
A technique to break the symmetry in a structure such as a graph which
can locally look the same to al
l the
Work
The total number of operations taken by a computation.
Work

Depth Model
A model of parallel computation in which one keeps track of the total work
and depth of a computation
without worrying about how it maps onto a machine.
Work

Efficient
A
parallel algorithm is work

efficient if asymptotically (as the problem size grows)
it requires at most a
constant factor more work than the best know sequential algorithm (or
the optimal work).
Work

Preserving
A translation of an algorithm from one model
to another is work

preserving
if the work is the same in both
models, to within a constant factor.
Concurrent Processing
A program is divided into multiple processes which are run on a single processor
.
The processes are time
sliced on the single processo
r
Distributed Processing
A program is divided into multiple processes which are run on multiple distinct machines
.
The multiple
machines are usual connected by a LAN
.
Machines used typically are workstations running multiple
programs
.
Parallel Processing
A program is divided into multiple processes which are run on multiple processors
.
The processors
normally:
are in one machine
execute one program at a time
have high speed communications between them
1.3
Parallel Programming
Issues in parallel programmin
g not found in sequential programming
Task decomposition, allocation and sequencing
Breaking down the problem into smaller tasks (processes) than can be run in parallel
,
Allocating the parallel tasks to different processors
,
Sequencing the tasks in the pro
per order
,
Efficiently use the processors
, and
Communication of interim results between processors
1.4
Performance Issues
Scalability
Using more nodes should
allow a job to run faster
allow a larger job to run in the same time
Load Balancing
All nodes shou
ld have the same amount of work
Avoid having nodes idle while others are computing
Bottlenecks
Communication bottlenecks
Nodes spend too much time passing messages
Too many messages are traveling on the same path
Serial bottlenecks
Communication
Message pa
ssing is slower than computation
Maximize computation per message
Avoid making nodes wait for messages
1.5
Processor interconnections
Parallel computers may be loosely divided into two groups:
a)
Shared Memory (or Multiprocessor)
b)
Message Passing (or Multicomp
uters)
a)
Shared Memory or Multiprocessor
Individual processors have access to a common shared
memory module(s)
.
For
Examples
:
Alliant, Cray series, some
Sun workstations
Features
Easy to build and program
Limited to a small number of processors
20

30
for Bus based multiprocessor
b)
Message Passing or
Multicomputer
Individual processors local memory.
Processors communicate via
a communication network
.
For example
Connection Machine series (CM

2, CM

5), Intel
Paragon, nCube, Transputers, Cosmic Cube
.
Features
Can scale to thousands of processors
.
1.5
Parallel
Comput
ing
Models
:
PRAM
(
Parallel Random

Access Machine
)
PRAM
stands for
Parallel Random

Access Machine
, a model assumed for most parallel algorithms and
a generalization of most sequential
machines
(which consist of a processor and attached
memory of arbitrary size containing instructions

random access memory where reads and writes
are unit time regardless
of location). This model
simply attaches multiple processors to a single
chunk of me
mory. Whereas a
single

processor
model also assumes a single thread running at a
time, the standard PRAM model involves
processors working in sync, stepping on the same instruction with
the program counter broadcast to every
processor (also known as SIMD

single instruction multiple data).
Each of the processors possess their
own set of registers that they can manipulate independently

for
example, an instruction to add to an element broadcast to processors might result in each processor
referring to diff
erent register indices to obtain
the element so directed.
Once again, recall that all instructions

including reads and writes

takes unit time. The memory is
shared
amongst all processors, making this a similar abstraction to a multi

core machine

all
processors
read from
and write to the same memory, and communicate between each other using this
.
The PRAM model changes the nature of some operations. Given n processors, one could perform
independent operations to n values in unit time, since each proces
sor can perform one operation
independently.
Computations involving aggregation (eg. summing up a set of numbers) can also be sped
up by dividing
up the values to be aggregated into parts, computing the aggregate of parts recursively in
parallel and then
c
ombining the results together.
While this results in highly parallel algorithms, the PRAM
poses certain complications for such algorithms
that have multiple processors simultaneously operating on
the same memory location .
What if two processors attempt to
read or write different data to the same
memory location at the same
time? How does a practical system deal with such contentions? To address
thus issue, several constraints
and semantic interpretations may be imposed on the PRAM model
.
Depending on how c
oncurrent access to a single memory cell (of the shared memory) is resolved, there
are various PRAM variants.
ER (Exclusive Read) or EW (Exclusive Write)
This varient of
PRAMs do not allow concurrent access of the shared memory.
C
R (Concurrent Read) o
r
C
W (Concurrent Write)
This varient of PRAMs
allow concurrent access of the shared memory.
Combining the rules for read and write access there are four PRAM variants:
1.
EREW:
access to a memory location is exclusive. No concurrent read or write oper
ations are allowed.
This is
w
eakest PRAM model
.
2.
CREW
Multiple read accesses to a memory location are allowed. Multiple write accesses to a memory location are
serialized.
3.
ERCW
Multiple write accesses to a memory location are allowed
,
but
Multiple r
ead accesses to a memory location
are serialized.
It c
an simulate an EREW PRAM
4.
CRCW
Allows multiple read and write accesses to a common memory location.
This is m
ost powerful PRAM model
It c
an simulate both EREW PRAM and CREW PRAM
Most work on parallel
algorithms has been done on the PRAM model. However real machines
usually
differ from PRAM in that not all processors have the same unit time access to a shared memory. A
parallel
machine typically involves several processors and memory units connected by
a network. Since the
network
can be of any shape, the memory access times differ between each processor

memory pair.
However, the
PRAM model a presents a simple abstract model to develop parallel algorithms that can then
be ported to a
particular network
topology.
Advantages of PRAM:
A
bstracts away from network, including dealing with varieties of protocols.
M
ost ideas translate to other models.
Problems of PRAM: Programming
U
nrealistic memory model

not always constant time for access. Practicality is
sue.
S
ynchronous; the lock

step nature removes much ﬂexibility and restricts programming practices.
F
xed number of processors
2. PRAM Algorithms
In this chapter we discuss a variety of algorithm to solve some problem on PRAM. As you will see, these
techniques
can deal with many different kinds of data structures, and often have counterparts in desing
techniques for sequential algorithms.
2.1
Parallel Reduction
Reduction:
An operation that computes a single result from a set of data.
Giv
en a set of n values a
1
, a
2
, …
, a
n
, and an associative binary operator
, reduction is the process of computing a
1
a
2
…
a
n
,.
For
examples: finding Minimum, maximu
m value, Average, sum, product, etc.
Parallel Reduction:
Do the above in parallel.
Lets take an example
of finding maximum of ‘n’ numbers
and we will see
how reduction can take place.
Let there be
n
processors and n inputs.
Model is PRAM EREW.
All the
t
ime two elements are compared on
2
/
n
processors
parallely, one element becames the partial
reselt while other element discarded by the processors. This is actually reducing the size of problem.
Finally on element that is the output will
comes on the proseccor P0.
2.2
The
Preﬁx

S
ums
P
roblem
Prefix

Sum operation takes a
n array A
of n elements as input
drawn from some domain. Assume
that a binary operation, denoted
*
, is deﬁned on the set. Assume also that
*
is associative; namely, for
any
elements a, b, c in the set, a
*
(b
*
c) = (a
*
b)
*
c.
(The operation
*
is pronounced “star” and often referred to as “sum” because addition,
relative to real
numbers, is a common example for
*
.)
The n preﬁx

sums of array A are deﬁned as
:
i
j
j
A
i
Sum
1
)
(
)
(
,
for
all
i, 1 ≤ i ≤ n, or:
Sum(1) =
A(1)
Sum(1) =
A(1)
A(2)
.
. .
Sum(i) =
A(1)
A(2)
…
A(i)
. . .
Sum(n) =
A(1)
A(2)
…
A(i)
…
A(n)
EREW PRAM
Parallel
Algo
Preﬁx_Sum
(A[n])
{
Input:
A List of Elelemnts stored in array A[0…n

1]
Output:
A[0]
A[1]
…
A[i]
Spwan(
P
0
, P
1
, … , P
n

1
)
f
or
each P
i
where 0 ≤ i ≤ n

1
for
(j= 0 to
n
2
log

1)
i
f
(
i

2
j
≥
0
)
A
[
i
] =
A
[i]
+ A
[
i
− 2
j
]
2
4
8
1
3
6
12
9
4
3
6
12
9
4
3
15
16
12
4
3
15
16
12
4
3
18
16
12
4
3
A
F
E
D
C
B
A+B
A+B+C+D+E+D+F
E+F
C+D
A+B+C+D
E+F
}
Example:
Input array is A = 3, 1, 8, 4,
2
J=0
0

2
0
=

1
No work
1

2
0
=
0
A[1] = A[1] + A[
0]
=
4
2

2
0
= 1
A[
2
] = A[
2
] + A[
1
]
=
9
3

2
0
=
2
A[
3
] = A[
3
] + A[
2
]
=
12
4

2
0
= 3
A[
4
] = A[
4
] + A[
3
]
=
6
J=1
0

2
1
=

2
No work
1

2
1
=

1
No work
2

2
1
= 0
A[
2
] = A[
2
] + A[0]
=
12
3

2
1
= 1
A[
3
] = A[
3
] + A[
1
]
=
16
4

2
1
= 2
A[
4
] = A[
4
] + A[
2
]
=
15
J=2
0

2
2
=

4
No work
1

2
2
=

3
No work
2

2
2
=

2
No work
3

2
2
=

1
No work
4

2
2
= 1
A[4] = A[4] + A[
0
]
=
18
Running time of Algorithms
If the input sequence has
n
steps, then the recursion continues to a depth of
O(log
n), which is also
the
bound on the parallel running time of this algorithm. The number of steps of the algorithm is
O(n).
Applications
Within
Counting sort
,
r
adix sort
,
List ranking
.
When
combining list ranking, prefix sums, and
Euler tours
, many important problems on
trees
may
be solved
.
S
imulate parallel algorithms
2.
3
The
S
um
of Contents of Array
P
roblem
This algorithm works on the EREW PRAM model as there are no read or write conflicts.
The
first loop to add the numbers in the segment takes each processor
O
(
n
/
p
) time, as we would like. But then
processor
0 must perform its loop to receive the subtotals from each of the
p
−
1 other processors; this loop takes
O
(
p
)
time. So the total time taken is
O
(
n
/
p
+
p
)
—
a bit more than the
O
(
n
/
p
) time we hoped for.
Input:
A List of Elelemnts stored in array A[0…n

1]
Output:
A[0]
=
A[0]
+
A[1]
+
…
+
A[i]
EREW PRAM Parallel
Algo
Sum
_o
f_Array
(A[n])
{
Spwan(P
0
, P
1
, … , P
1
2
/
n
)
f
or
each P
i
where 0 ≤ i ≤ n

1
for
(j= 0 to
n
2
log

1)
i
f
(
i
mod
2
j
= 0 AND 2i+
2
j
< n
)
A[
2
i] = A[
2
i] + A[
2i +
2
j
]
}
Example
:
Input array is A = {2, 4, 9, 7, 6, 3}
I
n the first ro
und, processor
1 sends its total to
processor
0, processor
3 to processor
2, and so
3
6
7
9
4
2
3
9
7
16
4
6
3
9
7
16
4
22
3
9
7
16
4
31
P
0
P
1
P
2
P
3
P
4
P
5
Round First
Input
Round Second
Round Third
on. The
n
/2
receiving processors all do their work simultaneously. In the next round, processor
2 sends its
total to processor
0, processor
6 to processor
4, and so on, lea
ving us with
n
/
4
sums, which again can all
be computed simultaneously. After only
n
lg
rounds, processor
0 will have the sum of all elements.
Running time of Algorithms
i
f

else
statement runs in O(1). The
for
loop
i
f

else
statemen
t runs
in
O(
lg n
)
, thus e
ach processor runs for loop
(inner) O(lg
n) time. Thus
the time complexity of algorithm is
O(
lg n
)
with n

processors
2.
4
Parallel
List ranking
The list ranking problem was posed by
Wyllie (1979), who solved it with a parallel
algorithm using
logarithmic time and O(n
log
n) total steps (with n processors). In
parallel algorithms, the
list
ranking
problem involves determining the position (or rank) of each item in a
linked list. That is, the first
item in the list should be assig
ned the number 1, the second item in the list should be assigned the number
2, etc.
List ranking can equivalently be viewed as performing a
prefix sum
operation on the given list, in which the
values to be summed are all equal to one. The list ranking prob
lem can be used to solve many problems
on
trees
via an
Euler tour
technique.
Serial algorithm
of List anking
takes
(n) time.
List ranking is defined as:
“
Given a single linked list L with n objects, compute, for each object in L, its
distance from the end
of the list.
”
Formally: suppose next is the pointer field
d[i]=
Input:
A
Linked List
Output:
Position of ith node is now stored in d
[
i
]
and each next[i] pointed to roght most node.
EREW algorithm
L
ist_
RANK(L)
{
Spwan(P
0
, P
1
, … , P
n

1
)
f
or
each P
i
where 0 ≤ i ≤ n

1
{
i
f
(
next[i]
=
NIL
)
[i]
=
0
else
d[i]
=
1
for ( j=1 to
lg n
)
{
d[i]
=
d[i]+ d[next[i]]
next[i]
=
next[next[i]]
}
}
}
Example:
Given Linked List
First Step
0
if next[i]=nil
d[next[i]]+1
if next[i]
nil
1
1
1
1
1
1
0
2
2
2
2
1
2
0
4
4
3
2
1
4
0
5
4
3
2
1
6
0
Second Step
Third Step
Running time of Algorithm
s
Inner
for
loop runs in
runs in
O(
lg n
)
,
i
f

else
statement runs in O(1), Thus body of outer for loop takes
O(
lg n
)
+ O(1) =
O(
lg n
)
.
So the time taken by each processor in seystem is
O(lg n)
Thus the time complexity of algorithm is
O(lg n)
with n

processors
.
2.
5
Merging of two sorted
Array in to single sorted array
Merging Two Sorted Lists Many PRAM algorithms achieve low time complexity by performing more operations than
an optimal RAM algorithm. The problem of merging two sorted lists is anot
her example.
One optimal RAM algorithm creates the merged list one element at a time. It requires at most n

1 comparisons to
merge two sorted lists of n/2 elements. Its time complexity is θ(n). A PRAM algorithm can perform the task in θ (log
n) time by
assigning each list element its own processor. Every processor finds the position of its own element on the
other list using binary search. Because an element's index on its own list is known, its place on the merged list can be
computed when its index on
the other list has been found and the two indices added. All n elements can be inserted
into the merged list by their processors in constant time (see Fig. 2

16). The pseudocode for the PRAM algorithm to
merge two sorted lists appears in Fig. 2

17. In this
version of the algorithm the two lists and their unions have disjoint
values. Let's examine the algorithm in detail. As usual, the first step is to spawn the maximum number of processors
needed at any point in the algorithm's execution. In this case we ne
ed 71 processors, one for each element of the two
lists to be merged. After the processors are spawned, we immediately activate them. In parallel, the processors
determine the range of indices they are going to search. The processors associated with elemen
ts in the lower half of
the array will perform binary search on the elements in the upper half of the array, and vice versa.
Every processor has a unique value of x, an element to be merged. The repeat. . .until loop implements binary search.
When a proce
ssor exits the loop, its private value of high will be set to the index of the largest element on the list that
is smaller than x.
Consider a processor Pi associated with value A[i] in the lower half of the list. The processor's final value of high
must b
e between (n/2) and n. Element A[i] is larger than i

l elements on the lower half of the list. It is also larger
than high

(n /2) elements on the upper half of the list. Therefore, we should put Ali] on the merged list after i + high

n/2

1 other el
ements, at index i + high

n/2.
Now consider a processor Pi, associated with value A[i] in the upper half of the list. The processor's final value of
high must be between 0 and n/2. Element A[i] is larger than i

(n/2 + l) other elements on the upper ha
lf of the list. It
is also larger than high elements on the lower half of the list. Therefore, we should put A[i] on the merged list after i +
high

n/2

1 other elements, at index i + high

n/2.
Since all processors use the same expression to place ele
ments in their proper places on the merged list, every
processor relocates its element using the same assignment statement at the end of the algorithm.
Input:
A
Linked List
Output:
Position of ith node is now stored in d
[
i
]
and each next[i] pointed to rog
ht most node.
CREW P
R
AM
algorithm
merge_
List
s
(
A[
1..
n]
)
{
Spwan(P
1
, P
2
, … , P
n
)
f
or
each P
i
where 1 ≤ i ≤ n
{
i
f
( i ≤ n/2
) {
low = n/2 + 1
high = n
}
else
{
low = 1
high = n/2
}
x = A[i]
do
{
index =
high)/2
(low
if(x < A[index])
high = index

1
else
low = index + 1
}while (low ≤ highy)
A[high + i
–
n/2] =
x
}
}
Running Time:
The total number of operations performed to merge the lists has increased from θ(n) in the sequenti
al algorithm to
θ
(
log n) in the parallel algorithm
with n processors
. This tactic is sensible only when the number of processors is
unbounded. When we begin to develop algorithms for real parallel computers, with processors a limited resource, we
must cons
ider the cost of the parallel algorithm.
Example
:
We have given two lists stored into an array,
{1, 5, 7
, 9, 13, 17, 19, 23
},
{
2, 4, 8, 11, 12, 21, 22, 24}
Here n = 16
Step1: here i=1 and i ≤ n/2 condition is true
Low = 8+1 =9
High = 16
x = A[1] = 1
A[
index] = [ (9 + 16)/2 ] = A[
5
.
12
]
= A[12] = 11
here x < A[index] condition is true
high =index

1 =12

1 =11
2.6
Parallel
Preorder
Traversal
Preorder tree traversal Sometimes it is appropriate to attempt to reduce complicated

loo
king problem into a simpler
one for which a fast parallel algorithm is already known. The problem of numbering the vertices of a rooted tree in
preorder (depth

first search order) is a case in point. Note that a preorder tree traversal algorithm visits no
des of the
given tree according to the principle root

left

right.
The algorithm works in the following way. In step one the algorithm constructs a singly

linked list. Each vertex of the
singly

linked list corresponds to a downward or upward edge traversal
of the tree.
In step two the algorithm assigns weights to the vertices of the newly created singly

linked list. In the preorder
traversal algorithm, a vertex is labelled as soon as it is encountered via a downward edge traversal. Every vertex in the
sing
ly

linked list gets the weight 1, meaning that the node count is incremented when this edge is traversed. List
elements corresponding to upward edges have the weight 0, because the node count does not increase when the
preorder traversal works its way back
up the tree through previously labelled nodes. In the third step we compute for
each element of the singly

linked list , the rank of that list element. In step four the processors associated with
downward edges use the ranks they have computed to assign a
preorder traversal number.
3.
Matrix Multiplication
Matrix Multiplication
:
In the matrix multiplication
we
compute
C = A.B
, where
A
,
B
, and
C
are dense matrices of
size
n
×
n
. (A
dense matrix
is a matrix in
which most of the entries are nonzero.) This matrix

matrix multiplication
involves
)
(
3
n
operations, since for each element
C
ij
of
C
, we must compute
:
H
G
F
E
D
C
B
A
H
G
F
E
D
C
B
A
(a)
(b)
A
B
1
D
B
0
E
G
1
E
H
1
EB
0
A
C
1
FC
0
B
D
1
BE
1
G
E
0
H
E
0
B
A
0
CF
1
C
A
0
(c)
A
B
7
D
B
5
E
G
4
E
H
3
EB
2
A
C
2
FC
0
B
D
6
BE
5
G
E
3
H
E
2
B
A
2
CF
1
C
A
0
(d)
A
B
C
D
E
F
G
H
1
2
7
3
4
8
5
6
(d)
1
0
.
n
k
kj
ik
ij
B
A
C
Fo
llowing is the sequential algorithm that runs in O(n
3
) time
Input
:
Two Matrices
A
n,n
, and
B
n,n
Output:
A mateix C
ij
=
A
ik
× B
kj
Sequential
Algorithm M
atrix
_M
ultiplication
(
A
n,n
, B
n,n
)
{
f
or
(
i = 0 to
n
–
1
)
for
(
j = 0 to
n
–
1
)
{
t = 0
for
(
k = 0 to
n
–
1)
t = t +
A
ik
× B
kj
C
ij
= t
}
}
Parallel
Matrix Multiplication:
O
bviously
, opeartion of
Matrix Multiplication done in parallely.
3.1 Row

Column Oriented
Matrix Multiplication
Algorithm
Matrix Multiplication
Problem:
In the
matrix multiplication
we
compute
C = A.B
, where
A
,
B
, and
C
are dense
matrices of size
n
×
n
. (A
dense matrix
is a matrix in which most of the entries are nonzero.) This matrix

matrix
multiplication involves
)
(
3
n
operations, since for each element
C
ij
of
C
, we must compute
1
0
.
n
k
kj
ik
ij
B
A
C
Parallel
Matrix Multiplication
Problem:
O
bviously
, opeartion of
Matrix Multiplication Problem done in parallely.
We have two varity of
Parallel
Matrix Multip
lication
Algorithm
:
Row

Column Orient
e
d
M
atrix
M
ultiplication
Algorithm
Block

Oriented
M
atrix
M
ultiplication
Algorithms
Sequential matrix multiplication algorithm
Input:
Two Matrices
A
n,n
, and
B
n,n
Output:
A mateix C
ij
=
A
ik
× B
kj
Sequential
Algorithm M
atrix
_M
ultiplication
(
A
n,n
, B
n,n
)
{
f
or
(
i = 0 to
n
–
1
)
for
(
j = 0 to
n
–
1
)
{
t = 0
for
(
k = 0 to m
–
1)
t = t +
A
ik
× B
kj
C
ij
= t
}
}
If the PRAM has p =
n3
processors, then matrix multiplication can be done in Θ(log
n
) time
by using one
processor to compute each product
Aik × Bkj
and then allowing groups of
n
processors to perform
n

input
summations (semigroup computation) in Θ(log m) time.
Because we are us
ually not interested in parallel
processing for matrix multiplication unless
n
is fairly large, this is not a practical solution.
PRAM algorithm using
n
3
processors
Input:
Two Matrices
A
n,n
, and
B
n,n
Output:
A mateix C
ij
=
A
ik
× B
kj
PRAM Parallel
Algo
M
a
trix
_M
multiplication
(
A
n,n
, B
n,n
)
{
f
or
(
i = 0 to
n
–
1
)
or
(
j = 0 to
n
–
1
)
{
t = 0
for
(
k = 0 to m
–
1)
t = t +
A
ik
× B
kj
C
ij
= t
}
}
Now assume that the PRAM has p =
n
² processors. In this case, matrix multiplication
can be done in
Θ(
n
)
time by using one processor to compute each element
C
ij
of the product
matrix C. The processor
responsible for computing
C
ij
reads the elements of Row i in A and
the elements of Column j in B,
multiplies their corresponding kth
elements, and adds eac
h
of the products thus obtained to a running total
t.
This amounts to parallelizing the i and j
loops in the sequential algorithm (Fig. 5.12).
For simplicity, we
label the m² processors with two indices (i, j), each ranging from 0 to n
–
1, rather than wit
h a single index
ranging from
0 to n²
–
1.
PRAM algorithm using
n
² processors
Input:
Two Matrices
A
n,n
, and
B
n,n
Output:
A mateix C
ij
=
A
ik
× B
kj
PRAM Parallel
Algo
M
atrix
_M
multiplication
(
A
n,n
, B
n,n
)
{
for each
Processor (i, j), 0
≤ i, j <
n
, do
{
t
= 0
for
(
k = 0 to
n
–
1)
t = t +
A
ik
× B
kj
C
ij
= t
}
}
Because each processor reads a different row of the matrix A, no concurrent reads from A
are ever
attempted. For matrix B, however, all
n
processors access the same element
B
kj
at
the same time. Again,
one can skew the memory accesses for B in such a way that the EREW
submodel is applicable.
3.2
Block

Oriented Matrix Multiplication Algorithms
In many practical situations, the number of processors is even less than n. So we need to
develop an
algorithm for this case as well. We can let Processor i compute a set of m/p rows in the result matrix C; say
Rows i, i+p, i+2p,
…
, i+(m/p

1)p. Again, we are parallelizing the i loop as this is preferable to parallelizing
the k loop (which has d
ata dependencies) or the j
loop.
On a lightly loaded Sequent Symmetry shared

memory multiprocessor, this last algorithm exhibits almost linear speedup, with the speed

up of about 22

point
matrices [Quin94]. This is
typical of what can be achieved on UMA multiprocessors with our simple parallel algorithm. Recall that the
UMA (uniform memory access) property implies that any memory location is accessible with the same
amount of delay. The dra
wback of the above algorithm for NUMA (non

uniform memory access) shared
memory multiprocessors is that each element of B is fetched n/p times, with only two arithmetic operations
(one multiplication and one addition) performed for each such element. Block
matrix multiplication,
discussed next, increases the computation to memory access ratio, thus improving the performance for
NUMA multiprocessors.
Let us divide the m
×
m matrices A, B, and C into p blocks of size q ×
5.13
, where We
c
m matrices by using matrix multiplication with processors, where the terms in
the algorithm statement t = t + A
ik
×
B
kj
are now q ×
the result matrix C. Thus, the algorithm
is similar to our second algorithm above, with the statement t = t +
A
ik
×
B
kj
replaced by a sequential q
q matrix multiplication algorithm.
Figuer 5.13. Partitiong the matrices for block matrix multiplication
Each multiply
–
add computation on
q
×
q
blocks needs 2
q
²= 2
m
²/
p
memory accesses to read the blocks and 2
q
³ arithmetic operations. So
q
arithmetic operations are performed for each memory access and better performance will be achieved as a result of improved lo
cality. The
assumption here is
that Processor (
i, j
) has sufficient local memory to hold Block (
i, j
) of the result matrix
C
(
q
² elements) and one block

row
of the matrix
B
; say the
q
elements in Row
kq
+
c
of Block (
k, j
) of
B.
Elements of
A
can be brought in one at a time. For example
, as element
in Row
iq
+
a
of Column
kq
+
c
in Block (
i, k
) of
A
is brought in, it is multiplied in turn by the locally stored
q
elements of
B,
and the results
added to the appropriate
q
elements of
C
(Fig. 5.14).
4.
Solving Liner Systems
In this sectio
n we discuses the parallelization of
4.1
GAUSSIAN Elimination
4.2
The JACOBI Algorithm
4.1 GAUSSIAN Elimination
Gaussian elimination is a
well known algorithm for solving the linear system
:
Ax = B
Where
A is a
is a known, square, dense and
n × n mat
rix,
while
x and B are both n dimensional vectors.
This is called the system of equations.
The general procedure for Gauss el
imination is:
Gaussian
elemenation
reduced
Ax = B
to an upper trangular system
T
x =
C and then it can
be solved through
backward substitution.
In the numeric program, vector B is stored as (n+1)th column of matrix A. Let k
control the elimination step, loo
p i control ith row accessing and loop j control jth column accessing, the
sequential Gaussian Elimination algorithm
is described as follows:
Sequential
Algo
Forward
_Elemination (A
n,n
, B
n,n
)
// Transforming to Upper Triangular
{
for (k=1 to n

1)
for (i
= k + 1 to n)
{
A
ik
=A
ik
/A
kk
for (j = k + 1 to n
+ 1
)
A
ij
=A
ij

l
ik
* A
kj
}
}
Note that since the entries in the lower trangular matrix vanish after elimination, their space is used to store
multipliers A
ik
=A
ik
/A
kk
Sequential Algo
Backward_Substitu
tion (A
n,n
, B
n,n
)
// Back

Substitution
{
for
(i = n to 1)
{
for
j
i
+1 to n do
x
i
= x
i
–
A
i
j
* x
j
x
i
= x
i
/ A
ii
}
Note that
x
i
is stores in space of A
i n+1j
G
aussian Elimination (GE) is one of the direct methods of solving linear systems (Ax =
b). In the above
algorithms, first algorithm
Forward_Elemination
converts
matrix A into the triangular matrix. Thereafter,
second algorithm
Backward_Substitution
computes the value of the vector x.
Parallel Gaussian elimination
:
Now, we describe a para
llel Gaussian elimination Algorithm.
In Farward Elemination part, the following task (denoted by
i
k
E
) can be parallelized:
i
k
E
:
A
ik
=A
ik
/A
kk
for ( j= k+1 to n+1)
A
ij
=A
ij

A
ik
* A
kj
The data access pattern of eac
h task is: read row A
k
, A
i
, then write row A
i
. thus for each step k, taske
1
k
k
E
,
2
k
k
E
, … ,
n
k
E
are independent, which leasds to the following parallel algorithm
:
Parallel forward elemination
Algori
thm
:
for (k =1 to n

1)
{
Calculate
1
k
k
E
,
2
k
k
E
, … ,
n
k
E
in parallel on p rocessors.
}
Group task into a set of clusters such each task
i
k
E
with the same row I is in the same clu
ster C
i.
Row i and
cluster C
i
will be mapped to the same processor. Cyclic mapping should be used to achieve a balanced
load amoung processors: Processor number = k

1 mod p
In Backwoard substitution part, denoted task by
j
i
S
:
j
i
S
:
for ( j= i+1 to n)
x
i
= x
i

A
ij
* x
j
x
i
=x
i
/ A
ii
The dependence is:
j
j
n
j
n
S
S
S
1
1
...
Parallel Backword substitution algorithm
:
for (i =n to 1)
{
Calculate
j
i
E
gradually on rocessor that owns j.
}
4.2 Th
e JACOBI Algorithm
In
numerical linear algebra, the
Jacobi method
is an algorithm for determining the solutions of a
system of linear
equations
with largest absolute values in each row and column dominated by the diagonal element. Each diagonal
element is
solved for, and an approximate value plugged in. The process is then iterated until it converges. This
algorithm is a stripped

down version of the
Jacobi transformation method of matrix diagonalization. The method is
named after
German
mathematician
Carl G
ustav Jakob Jacobi.
Given a square system of
n
linear equations:
Ax = b
where:
Then
A
can be decomposed into a
diagonal
component
D, and
the remainder
R:
The system of linear equations may be rewritten as:
(D + R)x = b
Thus
Dx + Rx = b
and finally:
Dx = b

Rx
Th
e Jacobi method is an
iterative technique
that solves the left hand side of this expression for
x, using previous value
for
x
on the right hand side. Analytically, this may
be written as:
x
(k+1)
= D

1
(b
–
Rx
(k)
)
The element

based formula is thus:
Note that the computation of
)
1
(
k
i
x
requires ea
ch element in
x
(k)
except itself. Unlike the
Gauss
–
Seidel method
, we
can't overwrite
)
(
k
i
x
with
)
1
(
k
i
x
, as that value will be
needed by the rest of the computation. The minimum amount of
storage is two vectors of size
n.
Parallel JACOBI Method
Parallel implementation:
Distributed rows of B and the diagonal elements of D to processors.
Perform computation based on the owner

comp
utes rule.
Perform all
–
all broadcasting after each iteration.
Note:
If the iterative matrix is very sparse, i.e. containing a lots of zeros, the code design should take advantage of
thos
e
and should not store those nonzeros. Also the code design should e
xplicitly skip those operations applied to zero
elements.
5.
Quicksort
Q
uicksort
:
The quicksort algorithm was developed in 1960 by
Tony Hoare
.
Quicksort is a
divide and conquer
algorithm. Quicksort first divides a large
list
into two smaller sub

lists: the low elements and the high elements.
Quicksort can then recursively sort the sub

lists.
The idea of sorting is:
1.
Pick an element, called a
pivot, from the list.
2.
Reorde
r the list so that all elements with values less than the pivot come before the pivot, while all elements
with values greater than the pivot come after it (equal values can go either way). After this partitioning, the
pivot is in its final position. This i
s called the
partition operation.
3.
Recursively
sort the sub

list of lesser elements and the sub

list of greater elements.
Parallel
Quicksort
:
O
bviously
, opeartion of
sorting
Problem done in parallely.
We ha
ve two varity of
Parallel Quicksort
:
Parallel Quicksort Algorithm
Hyperqube Quicksort Algorithm
5.1 Parallel Quicksort Algorithm
Quicksort is a divide

and

conquer algorithm that sorts a sequence by recursively dividing it into smaller
subsequences. Sorting the smaller arrays represents two
completely independent subproblems that can be solved in
parallel. Therefore, one way to parallelize quicksort is to execute it initially on a single processor; then when the
algorithm performs its recursive calls, assign different subproblem to different
processor. Now each of these
processors sorts its array by using sequential quicksort and assigns its different subproblem to different processor. The
algorithm terminates when the arrays cannot be further partitioned.
Initially, the input list of
elemen
tas
is partitioned serially an allotted to the number of processors
P
1
, P
2
, …
,
3
n
P
.
We
randomly choose a pivot from one of the processes and broadcast it to every processor. Then each processor Pi sorts
its unsorted list
by using sequent
ial quicksort parallely
and divides
its list
into two
sub
lists (this is called Local
Arrangement)
Li and Ri. Li has the set of elements
smaller than (or equal) the pivot
element
and
Ri has the set of
elements larger than the pivot element.
Each processor
in the upper half of the process list sends its Li to a partner
process in the lower half of the process list and receives a Ri in return (say Global Arrangement). Now, the upper

half
of list has only values greater than the pivot, and the lower

half of li
st has only values smaller than the pivot.
Thereafter, the processors divide themselves into two
teams
.
Each
team then
performs
above discussed
operation
recursively.
Initially, the input list of elements is partitioned serially an allotted to the number
of processors P
1
, P
2
, P
3
…
Step 1:
We randomly choose a pivot from one of the processes and broadcast it to every processor.
Step 2:
Processor P
i
sorts its unsorted list
by using
Sequential Quick Sort
.
and divides into two lists (say Local
Arrangement):
o
Those smaller than
(or equal) the pivot; say L
i
, and
o
Those greater than the pivot; say R
i
.
Step 3:
Each processor in the upper half of the process list sends its L
i
to a partner process in the lower half of the
process list and receives a R
i
in return (say Global Arrangement). Now,
the upper

half of list has only values
greater than the pivot, and the lower

half of list has only values smaller than the pivot.
Step 4:
Thereafter, the processors divide themselves into two groups.
Step 5:
Each group performs this operation recursively.
Example
:
Input
array is 7, 13, 18, 2, 17, 1, 14, 20, 6, 10, 15, 9, 3, 16, 19, 4, 11, 12, 5, 8
Step
First:
Step
Second
:
Step
Third
:
Step
Fourth
:
Finally
:
5.2
Hypercube Quicksort Algorithm
Hypercube network has structural characteristics that offer scope for implementing efficient divide

and

conquer
so
rting algorithms, such as quick
sort.
Initially
,
a list of n numbers placed
on
any
one node of a d

dimensional hypercube.
Now the processor that has the list of
elements sort the data by using sequential
quicksort and
divide
s whole list
into two
sublists
with respect to pivote selected by
the processor
, with one part sent to the
a
djacent node in the highest dimension.
Then the two nodes can repeat the process.
For e
xample
list
is
on
node 000
of a
3

dimensional hypercube.
Steps
Lable of
Sender Node
Lable of Reciver’s Node
Remark
1st step
000
100 (numbers greater than a pivo
t, say p1)
Invert Left Most Bit of sender’s
lable then we get reciver’s address.
=
2nd=s瑥t
000
100
010 (numbers greater than a pivot, say p2)
110 (numbers greater than a pivot, say p3)
Invert middel Bit of sender’s lable
then we get reciver’s address.
=
Prd=s瑥t
000
100
010
110
001 (numbers greater than a pivot, say p4)
101 (numbers greater than a pivot, say p5)
011 (numbers greater than a pivot, say p6)
111 (numbers greater than a pivot, say p7)
Invert Right Most Bit of sender’s
污l汥l瑨敮=w攠ge
t reciver’s address.
=
=
=
=
=
=
=
䙩c慬ayI=th攠p慲瑳=sor瑥t=using=愠s敱u敮t楡氠慬aor楴hmI=慬氠楮=p慲慬汥a
=
慳ashown=楮=fo汬lwing=figu敲
.=ff=r敱u楲敤I=sor瑥t=
p慲瑳=捡n=b攠r整ern敤=瑯=one=pro捥ssor=楮=a=s敱u敮捥=th慴a慬lows=pro捥ssor=瑯=捯n捡瑥n慴a=瑨攠sor瑥t=汩lts
=
瑯=捲敡瑥=the=
fin慬asor瑥t=汩l琮
=
Example:
input array is:
2, 4, 1, 3, 8, 7, 5, 22, 6, 9, 11, 44, 33, 99, 88
.
Processors
Steps
Remark
P
0
(000)
P
1
(001)
P
2
(010)
P
3
(011)
P
4
(100)
P
5
(101)
P
6
(110)
P
7
(111)
Input list
2, 4, 1, 3, 8, 7,
5, 22, 6, 9, 1
1,
44, 33, 99, 88
No Data
No Data
No Data
No Data
No Data
No Data
No Data
Step 1
Selection of Pivot
Pivot
= 9
Data After sorting
2, 4, 1, 3, 8, 7,
5, 6,
9
, 22, 11,
44, 33, 99, 88
List
> Pivot, send
to another node
2, 4, 1,
3, 8, 7,
5, 6, 9
No Data
No Data
No Data
22, 11, 44, 33,
99, 88
No Data
No Data
No Data
Step 2
Selection of Pivot
Pivot
= 5
Pivot
= 33
Data After sorting
2, 4, 1, 3,
5
, 7,
8, 6, 9
22, 11,
33
,
44, 99, 88
List
> Pivot, send
t
o another node
2, 4, 1, 3, 5,
No Data
7, 8, 6, 9
No Data
22, 11, 33,
No Data
44, 99,
88
No Data
Step 3
Selection of Pivot
Pivot
= 3
Pivot
= 8
Pivot
= 22
Pivot
= 44
Data After sorting
2, 1,
3
, 4
7, 6,
8
, 9
11,
22
, 33
44
, 99, 88
List
> Pivot, send
to another node
2, 1, 3
4
7, 6, 8
9
11, 22
33
44
99, 88
Step 4
Selection of Pivot
Pivot
= 2
Pivot
= 4
Pivot =
6
Pivot
=
9
Pivot
=
11
Pivot
=
33
Pivot
=
44
Pivot
=
99
Data After sorting
1,
2
, 3
4
6
, 7, 8
9
11
, 22
33
44
88,
99
Sorted data
on each node
1, 2, 3
4
6, 7, 8
9
11, 22
33
44
88, 99
All processors now sends their list to the processorP0, Now
Final sorted
list
is
on P
0
1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 22, 33, 44, 88, 99
Comments 0
Log in to post a comment