Section Third : Parallel Algorithms

unevenoliveΛογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 3 χρόνια και 10 μήνες)

99 εμφανίσεις

Section Third : Parallel Algorithms


1. Introduction

1.1
P
arallel
A
lgorithm


1.2

Common Terms for Parallelism

1.3

Parallel Programming

1.4

Performance Issues

1.5
Processor
I
nterconnections

1.6 Parallel Computing Models: PRAM (Parallel Random
-
Access Machi
ne)

2. PRAM Algorthim

2.1

Parallel Reduction

2.2

The
Prefix
-
Sums Problem

2.3

The Sum of Contents of Array Problem

2.4

Parallel List ranking

2.5

Merging of two sorted Array in to single sorted array

2.6

Parallel Preoreder Travesal

3.
Matrix Multiplication


3.1 Row
-
Column Orient
e
d
Matrix Multiplication
Algori
thm

3.2 Block
-
Oriented
Matrix Multiplication
Algorithms

4. Solving Liner Systems


4.1
GAUSSIAN Elimination


4.2 The JACOBI Algorithm

5. Quicksort


5.1
Parallel
Quicksort

Algorithm


5.2 Hyper
qube Q
uicksort

Algorithm




















1.

Introduction

The s
ubject of this chapter is the design and analysis of parallel algorithms. Most of today’s

algorithms are
sequential, that is, they specify a sequence of steps in which each step consists of a

single operation.
These algorithms are well suited to today’s co
mputers, which basically perform

operations in a sequential
fashion. Although the speed at which sequential computers operate has

been improving at an exponential
rate for many years, the improvement is now coming at greater

and greater cost. As a conseque
nce,
researchers have sought more cost
-
e
f
ective improvements

by building “parallel” computers


computers
that perform multiple operations in a single step.

In order to solve a problem e
f
ciently on a parallel machine,
it is usually necessary to design an

a
lgorithm that specifies multiple operations on each step, i.e., a parallel
algorithm.

As an example, consider the problem of computing the sum of a sequence A of

n


numbers.

The
standard algorithm computes the sum by making a single pass through the sequen
ce, keeping

a running
sum of the numbers seen so far. It is not di
f
cult however, to devise an algorithm for

computing the sum that
performs many operations in parallel.

The parallelism in an algorithm can yield improved performance on many di
f
erent kinds
of

computers. For example, on a parallel computer, the operations in a parallel algorithm can be performed
simultaneously by di
f
erent processors. Furthermore, even on a single
-
processor computer

the parallelism
in an algorithm can be exploited by using mul
tiple functional units, pipelined functional units, or pipelined
memory systems. Thus, it is important to make a distinction between

the parallelism in an algorithm and the
ability of any particular computer to perform multiple

operations in parallel. Of c
ourse, in order for a parallel
algorithm to run e
f
ciently on any type

of computer, the algorithm must contain at least as much parallelism
as the computer, for otherwise resources would be left idle. Unfortunately, the converse does not always
hold: some p
arallel

computers cannot e
f
ciently execute all algorithms, even if the algorithms contain a great
deal

of parallelism. Experience has shown that it is more di
f
cult to build a general
-
purpose parallel

machine
than a general
-
purpose sequential machine.

The r
emainder of this chapter consists of nine sections. We
begin in Section 1 with a discussion

of how to model parallel computers.

1.1

parallel algorithms


In

computer science, a

parallel algorithm

or

concurrent algorithm, as opposed to a
traditional

sequen
tial (or serial) algorithm, is an

algorithm

which can be executed a piece at a time on many
different processing devices, and then put back together again at the end to get the correct result.

Some alg
orithms are easy to divide up into pieces like this. For example, splitting up the job of
checking all of the numbers from one to a hundred thousand to see which are

primes
could be

done by
assigning a subset of the numbers to each available processor, and then putting the list of positive results
back together.

Most of the available algorithms to compute

pi

(π), on the other hand, cann
ot be easily split up into
parallel portions. They require the results from a preceding step to effectively carry on with the next step.
Such problems are called inherently serial problems. Iterative

numerical methods
, such as

Newton's
method

or the

three
-
body problem
, are also algorithms which a
re inherently serial. Some problems are very
difficult to parallelize, although they are recursive. One such example is the

depth
-
first search

of

graphs
.

Parallel algorithms are valuable because of substantial improvements in

multiprocessing

systems
and the rise of

multi
-
core

processors. In general, it is easier to construct a computer with a single fast
processor than one with many slow processors with the same

thro
ughput
. But processor speed is
increased primarily by shrinking the circuitry, and modern processors are pushing physical size and heat
limits. These twin barriers have flipped the equation, making multiprocessing practical even for small
systems.

The cos
t or compl
e
exity of serial algorithms is estimated in terms of the space (memory) and time
(processor cycles) that they take. Parallel algorithms need to optimize one more resource, the
communication between different processors. There are two ways paralle
l processors communicate,
shared memory or message passing.

Shared memory

processing needs additional

locking

for the data, imposes the overhead of
additional

processor and bus
cycles
, and also serializes some portion of the algorithm.

Message passing

processing uses channels and message boxes but this communication adds
transfer overhead on the bus,
additional memory need for queues and message boxes and latency in the
messages. Designs of parallel processors use special buses like crossbar so that the communication
overhead will be small but it is the parallel algorithm that decides the volume of the

traffic.

Another problem with parallel algorithms is ensuring that they are suitably

load balanced
. For
example, checking all numbers from one to a hun
dred thousand for primality is easy to split amongst
processors; however, some processors will get more work to do than the others, which will sit idle until the
loaded processors complete.

1.2

Common
Defining

Terms for Parallelism

CRCW

A shared memory mod
el that allows for concurrent reads (CR) and concurrent writes

(CW) to the memory.


CREW

This refers to a shared memory model that allows for Concurrent reads (CR) but only

exclusive writes (EW)
to the memory.


Depth

The longest chain of sequential depende
nces in a computation.


EREW

A shared memory model that allows for only exclusive reads (ER) and exclusive writes

(EW) to the
memory.


Multiprocessor Model.

A model of parallel computation based on a set of commu
nicating se
quential processors.


Pipelined
Divide
-
and
-
Conquer

A divide
-
and
-
conquer paradigm in which partial results from

recursive calls can be used before the calls
complete. The technique is often useful for reducing

the depth of an algorithm.


Pointer

Jumping

In a linked structure replacing a p
ointer with the pointer it points to. Used

for various algorithms on lists and
trees.


PRAM model

A multiprocessor model in which all processors can access a shared memory for

reading or writing with
uniform cost.


Prefix Sums

A parallel operation in which

each element in an array or linked
-
list receives the

sum of all the previous
elements.


Random Sampling

Using a randomly selected sample of the data to help solve a problem on

the whole data.


Recursive Doubling

The same as pointer jumping.


Scan

A parall
el operation in which each element in an array receives the sum of all the previous

elements.


Shortcutting

Same as pointer jumping.


Symmetry Breaking.

A technique to break the symmetry in a structure such as a graph which

can locally look the same to al
l the


Work

The total number of operations taken by a computation.

Work
-
Depth Model

A model of parallel computation in which one keeps track of the total work

and depth of a computation
without worrying about how it maps onto a machine.


Work
-
Efficient

A
parallel algorithm is work
-
efficient if asymptotically (as the problem size grows)

it requires at most a
constant factor more work than the best know sequential algorithm (or

the optimal work).


Work
-
Preserving

A translation of an algorithm from one model
to another is work
-
preserving

if the work is the same in both
models, to within a constant factor.


Concurrent Processing

A program is divided into multiple processes which are run on a single processor
.
The processes are time
sliced on the single processo
r


Distributed Processing

A program is divided into multiple processes which are run on multiple distinct machines
.
The multiple
machines are usual connected by a LAN
.
Machines used typically are workstations running multiple
programs
.


Parallel Processing

A program is divided into multiple processes which are run on multiple processors
.
The processors
normally:



are in one machine



execute one program at a time



have high speed communications between them

1.3

Parallel Programming

Issues in parallel programmin
g not found in sequential programming



Task decomposition, allocation and sequencing



Breaking down the problem into smaller tasks (processes) than can be run in parallel
,



Allocating the parallel tasks to different processors
,



Sequencing the tasks in the pro
per order
,



Efficiently use the processors
, and



Communication of interim results between processors

1.4

Performance Issues

Scalability



Using more nodes should



allow a job to run faster



allow a larger job to run in the same time

Load Balancing



All nodes shou
ld have the same amount of work



Avoid having nodes idle while others are computing

Bottlenecks



Communication bottlenecks



Nodes spend too much time passing messages



Too many messages are traveling on the same path



Serial bottlenecks

Communication



Message pa
ssing is slower than computation



Maximize computation per message



Avoid making nodes wait for messages

1.5

Processor interconnections

Parallel computers may be loosely divided into two groups:

a)

Shared Memory (or Multiprocessor)

b)

Message Passing (or Multicomp
uters)


a)

Shared Memory or Multiprocessor

Individual processors have access to a common shared
memory module(s)
.
For
Examples
:
Alliant, Cray series, some
Sun workstations


Features



Easy to build and program



Limited to a small number of processors



20
-

30

for Bus based multiprocessor


b)

Message Passing or
Multicomputer

Individual processors local memory.

Processors communicate via
a communication network
.

For example

Connection Machine series (CM
-
2, CM
-
5), Intel
Paragon, nCube, Transputers, Cosmic Cube
.


Features

Can scale to thousands of processors
.


1.5

Parallel
Comput
ing
Models
:
PRAM

(
Parallel Random
-
Access Machine
)


PRAM

stands for
Parallel Random
-
Access Machine
, a model assumed for most parallel algorithms and

a generalization of most sequential
machines
(which consist of a processor and attached
memory of arbitrary size containing instructions
-

random access memory where reads and writes
are unit time regardless

of location). This model
simply attaches multiple processors to a single
chunk of me
mory. Whereas a

single
-
processor
model also assumes a single thread running at a
time, the standard PRAM model involves

processors working in sync, stepping on the same instruction with
the program counter broadcast to every

processor (also known as SIMD
-

single instruction multiple data).
Each of the processors possess their

own set of registers that they can manipulate independently
-

for
example, an instruction to add to an element broadcast to processors might result in each processor
referring to diff
erent register indices to obtain

the element so directed.

Once again, recall that all instructions
-

including reads and writes
-

takes unit time. The memory is
shared

amongst all processors, making this a similar abstraction to a multi
-
core machine
-

all
processors
read from

and write to the same memory, and communicate between each other using this
.

The PRAM model changes the nature of some operations. Given n processors, one could perform
independent operations to n values in unit time, since each proces
sor can perform one operation
independently.

Computations involving aggregation (eg. summing up a set of numbers) can also be sped
up by dividing

up the values to be aggregated into parts, computing the aggregate of parts recursively in
parallel and then

c
ombining the results together.

While this results in highly parallel algorithms, the PRAM
poses certain complications for such algorithms

that have multiple processors simultaneously operating on
the same memory location .

What if two processors attempt to

read or write different data to the same
memory location at the same

time? How does a practical system deal with such contentions? To address
thus issue, several constraints

and semantic interpretations may be imposed on the PRAM model
.

Depending on how c
oncurrent access to a single memory cell (of the shared memory) is resolved, there
are various PRAM variants.


ER (Exclusive Read) or EW (Exclusive Write)

This varient of
PRAMs do not allow concurrent access of the shared memory.


C
R (Concurrent Read) o
r
C
W (Concurrent Write)

This varient of PRAMs
allow concurrent access of the shared memory.


Combining the rules for read and write access there are four PRAM variants:

1.
EREW:

access to a memory location is exclusive. No concurrent read or write oper
ations are allowed.

This is
w
eakest PRAM model
.


2.
CREW

Multiple read accesses to a memory location are allowed. Multiple write accesses to a memory location are
serialized.


3.
ERCW

Multiple write accesses to a memory location are allowed
,

but
Multiple r
ead accesses to a memory location
are serialized.

It c
an simulate an EREW PRAM


4.
CRCW

Allows multiple read and write accesses to a common memory location.

This is m
ost powerful PRAM model

It c
an simulate both EREW PRAM and CREW PRAM

Most work on parallel

algorithms has been done on the PRAM model. However real machines
usually

differ from PRAM in that not all processors have the same unit time access to a shared memory. A
parallel

machine typically involves several processors and memory units connected by

a network. Since the
network

can be of any shape, the memory access times differ between each processor
-
memory pair.
However, the

PRAM model a presents a simple abstract model to develop parallel algorithms that can then
be ported to a

particular network
topology.


Advantages of PRAM:




A
bstracts away from network, including dealing with varieties of protocols.



M
ost ideas translate to other models.


Problems of PRAM: Programming



U
nrealistic memory model
-

not always constant time for access. Practicality is
sue.



S
ynchronous; the lock
-
step nature removes much flexibility and restricts programming practices.



F
xed number of processors


2. PRAM Algorithms


In this chapter we discuss a variety of algorithm to solve some problem on PRAM. As you will see, these
techniques
can deal with many different kinds of data structures, and often have counterparts in desing
techniques for sequential algorithms.






















2.1

Parallel Reduction


Reduction:

An operation that computes a single result from a set of data.
Giv
en a set of n values a
1
, a
2
, …
, a
n
, and an associative binary operator

, reduction is the process of computing a
1



a
2





a
n
,.
For
examples: finding Minimum, maximu
m value, Average, sum, product, etc.


Parallel Reduction:

Do the above in parallel.

Lets take an example
of finding maximum of ‘n’ numbers
and we will see

how reduction can take place.

Let there be
n

processors and n inputs.

Model is PRAM EREW.

All the
t
ime two elements are compared on


2
/
n

processors

parallely, one element becames the partial
reselt while other element discarded by the processors. This is actually reducing the size of problem.
Finally on element that is the output will

comes on the proseccor P0.





2.2

The
Prefix
-
S
ums
P
roblem

Prefix
-
Sum operation takes a
n array A
of n elements as input
drawn from some domain. Assume
that a binary operation, denoted
*
, is defined on the set. Assume also that
*

is associative; namely, for

any
elements a, b, c in the set, a
*

(b
*

c) = (a
*

b)
*

c.

(The operation
*

is pronounced “star” and often referred to as “sum” because addition,

relative to real
numbers, is a common example for
*
.)

The n prefix
-
sums of array A are defined as
:




i
j
j
A
i
Sum
1
)
(
)
(
,
for
all
i, 1 ≤ i ≤ n, or:

Sum(1) =
A(1)

Sum(1) =
A(1)

A(2)

.

. .

Sum(i) =
A(1)

A(2)



A(i)

. . .

Sum(n) =
A(1)

A(2)



A(i)




A(n)


EREW PRAM
Parallel
Algo
Prefix_Sum
(A[n])

{

Input:
A List of Elelemnts stored in array A[0…n
-
1]

Output:

A[0]


A[1]



A[i]


Spwan(
P
0
, P
1
, … , P
n
-
1
)

f
or

each P
i

where 0 ≤ i ≤ n
-
1

for

(j= 0 to


n
2
log
-
1)

i
f
(

i
-
2
j


0

)

A
[
i
] =
A
[i]

+ A
[
i

− 2
j
]

2

4

8

1

3

6

12

9

4

3

6

12

9

4

3

15

16

12

4

3

15

16

12

4

3

18


16

12

4

3

A

F

E

D

C

B

A+B

A+B+C+D+E+D+F

E+F

C+D

A+B+C+D

E+F


}


Example:
Input array is A = 3, 1, 8, 4,

2


J=0

0
-
2
0

=
-
1


No work


1
-
2
0

=
0


A[1] = A[1] + A[
0]

=

4


2
-
2
0

= 1


A[
2
] = A[
2
] + A[
1
]

=

9


3
-
2
0

=
2


A[
3
] = A[
3
] + A[
2
]

=

12


4
-
2
0

= 3


A[
4
] = A[
4
] + A[
3
]

=

6


J=1

0
-
2
1

=
-
2


No work


1
-
2
1

=
-
1


No work


2
-
2
1

= 0


A[
2
] = A[
2
] + A[0]

=
12


3
-
2
1

= 1


A[
3
] = A[
3
] + A[
1
]

=

16


4
-
2
1

= 2


A[
4
] = A[
4
] + A[
2
]

=

15


J=2

0
-
2
2

=
-
4


No work


1
-
2
2

=
-
3


No work


2
-
2
2

=
-
2


No work


3
-
2
2

=
-
1


No work


4
-
2
2

= 1


A[4] = A[4] + A[
0
]

=
18


Running time of Algorithms

If the input sequence has

n

steps, then the recursion continues to a depth of

O(log

n), which is also

the
bound on the parallel running time of this algorithm. The number of steps of the algorithm is

O(n).


Applications



Within
Counting sort
,
r
adix sort
,
List ranking
.



When
combining list ranking, prefix sums, and

Euler tours
, many important problems on

trees

may
be solved
.



S
imulate parallel algorithms

2.
3

The
S
um

of Contents of Array

P
roblem

This algorithm works on the EREW PRAM model as there are no read or write conflicts.


The

first loop to add the numbers in the segment takes each processor

O
(
n

/

p
) time, as we would like. But then
processor

0 must perform its loop to receive the subtotals from each of the

p



1 other processors; this loop takes

O
(
p
)
time. So the total time taken is

O
(
n

/

p

+

p
)


a bit more than the

O
(
n

/

p
) time we hoped for.


Input:

A List of Elelemnts stored in array A[0…n
-
1]

Output:

A[0]

=

A[0]
+
A[1]
+


+
A[i]

EREW PRAM Parallel
Algo
Sum
_o
f_Array
(A[n])

{


Spwan(P
0
, P
1
, … , P


1
2
/

n
)

f
or

each P
i

where 0 ≤ i ≤ n
-
1

for

(j= 0 to


n
2
log
-
1)

i
f
(

i
mod
2
j

= 0 AND 2i+
2
j

< n
)

A[
2
i] = A[
2
i] + A[
2i +

2
j
]

}

Example
:
Input array is A = {2, 4, 9, 7, 6, 3}

I
n the first ro
und, processor

1 sends its total to
processor

0, processor

3 to processor

2, and so
3

6

7

9

4

2

3

9

7

16

4

6

3

9

7

16

4

22

3

9

7

16

4

31


P
0

P
1

P
2

P
3

P
4

P
5

Round First

Input

Round Second

Round Third

on. The

n
/2

receiving processors all do their work simultaneously. In the next round, processor

2 sends its
total to processor

0, processor

6 to processor

4, and so on, lea
ving us with

n

/

4

sums, which again can all
be computed simultaneously. After only


n
lg

rounds, processor

0 will have the sum of all elements.





Running time of Algorithms

i
f
-
else

statement runs in O(1). The
for

loop

i
f
-
else

statemen
t runs

in
O(

lg n

)
, thus e
ach processor runs for loop

(inner) O(lg
n) time. Thus
the time complexity of algorithm is
O(

lg n

)

with n
-
processors


2.
4

Parallel
List ranking

The list ranking problem was posed by

Wyllie (1979), who solved it with a parallel
algorithm using
logarithmic time and O(n

log

n) total steps (with n processors). In

parallel algorithms, the

list
ranking

problem involves determining the position (or rank) of each item in a

linked list. That is, the first
item in the list should be assig
ned the number 1, the second item in the list should be assigned the number
2, etc.

List ranking can equivalently be viewed as performing a

prefix sum

operation on the given list, in which the
values to be summed are all equal to one. The list ranking prob
lem can be used to solve many problems
on

trees

via an

Euler tour

technique.

Serial algorithm
of List anking
takes

(n) time.


List ranking is defined as:


Given a single linked list L with n objects, compute, for each object in L, its
distance from the end

of the list.



Formally: suppose next is the pointer field


d[i]=




Input:

A

Linked List

Output:

Position of ith node is now stored in d
[
i
]

and each next[i] pointed to roght most node.

EREW algorithm

L
ist_
RANK(L)

{

Spwan(P
0
, P
1
, … , P
n
-
1
)

f
or

each P
i

where 0 ≤ i ≤ n
-
1


{

i
f

(
next[i]

=

NIL

)




[i]

=
0

else

d[i]

=
1

for ( j=1 to

lg n


)
{

d[i]

=

d[i]+ d[next[i]]

next[i]
=
next[next[i]]


}


}

}


Example:


Given Linked List



First Step




0



if next[i]=nil


d[next[i]]+1


if next[i]

nil

1


1


1


1


1


1


0


2


2


2


2


1


2


0


4


4


3


2


1


4


0


5


4


3


2


1


6


0


Second Step




Third Step


Running time of Algorithm
s

Inner
for

loop runs in
runs in
O(

lg n

)
,

i
f
-
else

statement runs in O(1), Thus body of outer for loop takes
O(

lg n

)

+ O(1) =
O(

lg n

)
.

So the time taken by each processor in seystem is

O(lg n)

Thus the time complexity of algorithm is
O(lg n)

with n
-
processors
.


2.
5


Merging of two sorted
Array in to single sorted array


Merging Two Sorted Lists Many PRAM algorithms achieve low time complexity by performing more operations than
an optimal RAM algorithm. The problem of merging two sorted lists is anot
her example.


One optimal RAM algorithm creates the merged list one element at a time. It requires at most n
-

1 comparisons to
merge two sorted lists of n/2 elements. Its time complexity is θ(n). A PRAM algorithm can perform the task in θ (log
n) time by
assigning each list element its own processor. Every processor finds the position of its own element on the
other list using binary search. Because an element's index on its own list is known, its place on the merged list can be
computed when its index on
the other list has been found and the two indices added. All n elements can be inserted
into the merged list by their processors in constant time (see Fig. 2
-
16). The pseudocode for the PRAM algorithm to
merge two sorted lists appears in Fig. 2
-
17. In this

version of the algorithm the two lists and their unions have disjoint
values. Let's examine the algorithm in detail. As usual, the first step is to spawn the maximum number of processors
needed at any point in the algorithm's execution. In this case we ne
ed 71 processors, one for each element of the two
lists to be merged. After the processors are spawned, we immediately activate them. In parallel, the processors
determine the range of indices they are going to search. The processors associated with elemen
ts in the lower half of
the array will perform binary search on the elements in the upper half of the array, and vice versa.


Every processor has a unique value of x, an element to be merged. The repeat. . .until loop implements binary search.
When a proce
ssor exits the loop, its private value of high will be set to the index of the largest element on the list that
is smaller than x.


Consider a processor Pi associated with value A[i] in the lower half of the list. The processor's final value of high
must b
e between (n/2) and n. Element A[i] is larger than i
-

l elements on the lower half of the list. It is also larger
than high
-

(n /2) elements on the upper half of the list. Therefore, we should put Ali] on the merged list after i + high
-

n/2
-

1 other el
ements, at index i + high
-

n/2.


Now consider a processor Pi, associated with value A[i] in the upper half of the list. The processor's final value of
high must be between 0 and n/2. Element A[i] is larger than i
-

(n/2 + l) other elements on the upper ha
lf of the list. It
is also larger than high elements on the lower half of the list. Therefore, we should put A[i] on the merged list after i +
high
-
n/2
-

1 other elements, at index i + high
-

n/2.


Since all processors use the same expression to place ele
ments in their proper places on the merged list, every
processor relocates its element using the same assignment statement at the end of the algorithm.


Input:

A

Linked List

Output:

Position of ith node is now stored in d
[
i
]

and each next[i] pointed to rog
ht most node.

CREW P
R
AM

algorithm
merge_
List
s

(
A[
1..
n]
)

{

Spwan(P
1
, P
2
, … , P
n
)

f
or

each P
i

where 1 ≤ i ≤ n


{

i
f

( i ≤ n/2

) {





low = n/2 + 1




high = n

}

else

{





low = 1




high = n/2

}

x = A[i]

do

{


index =



high)/2
(low




if(x < A[index])




high = index
-
1



else

low = index + 1



}while (low ≤ highy)


A[high + i


n/2] =

x

}


}

Running Time:

The total number of operations performed to merge the lists has increased from θ(n) in the sequenti
al algorithm to
θ
(
log n) in the parallel algorithm

with n processors
. This tactic is sensible only when the number of processors is
unbounded. When we begin to develop algorithms for real parallel computers, with processors a limited resource, we
must cons
ider the cost of the parallel algorithm.


Example
:

We have given two lists stored into an array,
{1, 5, 7
, 9, 13, 17, 19, 23
},

{
2, 4, 8, 11, 12, 21, 22, 24}

Here n = 16

Step1: here i=1 and i ≤ n/2 condition is true

Low = 8+1 =9


High = 16


x = A[1] = 1


A[
index] = [ (9 + 16)/2 ] = A[


5
.
12
]
= A[12] = 11


here x < A[index] condition is true


high =index
-
1 =12
-
1 =11





2.6

Parallel
Preorder

Traversal


Preorder tree traversal Sometimes it is appropriate to attempt to reduce complicated
-
loo
king problem into a simpler
one for which a fast parallel algorithm is already known. The problem of numbering the vertices of a rooted tree in
preorder (depth
-
first search order) is a case in point. Note that a preorder tree traversal algorithm visits no
des of the
given tree according to the principle root
-
left
-
right.

The algorithm works in the following way. In step one the algorithm constructs a singly
-
linked list. Each vertex of the
singly
-
linked list corresponds to a downward or upward edge traversal

of the tree.

In step two the algorithm assigns weights to the vertices of the newly created singly
-
linked list. In the preorder
traversal algorithm, a vertex is labelled as soon as it is encountered via a downward edge traversal. Every vertex in the
sing
ly
-
linked list gets the weight 1, meaning that the node count is incremented when this edge is traversed. List
elements corresponding to upward edges have the weight 0, because the node count does not increase when the
preorder traversal works its way back

up the tree through previously labelled nodes. In the third step we compute for
each element of the singly
-
linked list , the rank of that list element. In step four the processors associated with
downward edges use the ranks they have computed to assign a

preorder traversal number.











































3.

Matrix Multiplication


Matrix Multiplication
:

In the matrix multiplication

we

compute

C = A.B

, where

A

,

B

, and

C

are dense matrices of
size

n

×

n

. (A

dense matrix

is a matrix in

which most of the entries are nonzero.) This matrix
-
matrix multiplication
involves

)
(
3
n


operations, since for each element

C
ij

of

C

, we must compute
:



H

G

F

E

D

C

B

A

H

G

F

E

D

C

B

A

(a)

(b)

A
B

1

D
B

0

E
G

1

E
H

1

EB

0

A
C

1

FC

0

B
D

1

BE

1

G
E

0

H
E

0

B
A

0

CF

1

C
A

0

(c)

A
B

7

D
B

5

E
G

4

E
H

3

EB

2

A
C

2

FC

0

B
D

6

BE

5

G
E

3

H
E

2

B
A

2

CF

1

C
A

0

(d)

A

B

C

D

E

F

G

H

1

2

7

3

4

8

5

6

(d)





1
0
.
n
k
kj
ik
ij
B
A
C


Fo
llowing is the sequential algorithm that runs in O(n
3
) time


Input
:

Two Matrices
A
n,n

, and

B
n,n

Output:

A mateix C
ij

=
A
ik

× B
kj

Sequential

Algorithm M
atrix
_M
ultiplication

(
A
n,n

, B
n,n
)

{

f
or
(
i = 0 to
n



1
)


for
(
j = 0 to
n



1
)

{



t = 0

for
(
k = 0 to
n



1)


t = t +
A
ik

× B
kj

C
ij

= t

}

}


Parallel
Matrix Multiplication:

O
bviously
, opeartion of

Matrix Multiplication done in parallely.


3.1 Row
-
Column Oriented
Matrix Multiplication

Algorithm


Matrix Multiplication

Problem:

In the
matrix multiplication

we

compute

C = A.B

, where

A

,

B

, and

C

are dense
matrices of size

n

×

n

. (A

dense matrix

is a matrix in which most of the entries are nonzero.) This matrix
-
matrix
multiplication involves

)
(
3
n


operations, since for each element

C
ij

of

C

, we must compute






1
0
.
n
k
kj
ik
ij
B
A
C

Parallel
Matrix Multiplication

Problem:

O
bviously
, opeartion of

Matrix Multiplication Problem done in parallely.

We have two varity of
Parallel
Matrix Multip
lication

Algorithm
:




Row
-
Column Orient
e
d
M
atrix

M
ultiplication

Algorithm



Block
-
Oriented
M
atrix

M
ultiplication

Algorithms


Sequential matrix multiplication algorithm

Input:

Two Matrices
A
n,n

, and

B
n,n

Output:

A mateix C
ij

=
A
ik

× B
kj

Sequential

Algorithm M
atrix
_M
ultiplication

(
A
n,n

, B
n,n
)

{

f
or
(
i = 0 to
n



1
)


for
(
j = 0 to
n



1
)

{



t = 0

for
(
k = 0 to m


1)


t = t +
A
ik

× B
kj

C
ij

= t

}

}


If the PRAM has p =
n3

processors, then matrix multiplication can be done in Θ(log
n
) time

by using one
processor to compute each product
Aik × Bkj

and then allowing groups of
n
processors to perform
n
-
input
summations (semigroup computation) in Θ(log m) time.

Because we are us
ually not interested in parallel
processing for matrix multiplication unless

n

is fairly large, this is not a practical solution.


PRAM algorithm using
n
3

processors

Input:

Two Matrices
A
n,n

, and

B
n,n

Output:

A mateix C
ij

=
A
ik

× B
kj

PRAM Parallel
Algo
M
a
trix
_M
multiplication

(
A
n,n

, B
n,n
)

{

f
or
(
i = 0 to
n



1
)

or
(
j = 0 to
n



1
)

{



t = 0

for
(
k = 0 to m


1)

t = t +
A
ik

× B
kj

C
ij

= t

}

}


Now assume that the PRAM has p =
n
² processors. In this case, matrix multiplication

can be done in
Θ(
n
)
time by using one processor to compute each element
C
ij

of the product

matrix C. The processor
responsible for computing
C
ij

reads the elements of Row i in A and

the elements of Column j in B,
multiplies their corresponding kth


elements, and adds eac
h

of the products thus obtained to a running total
t.
This amounts to parallelizing the i and j

loops in the sequential algorithm (Fig. 5.12).

For simplicity, we
label the m² processors with two indices (i, j), each ranging from 0 to n


1, rather than wit
h a single index
ranging from

0 to n²


1.



PRAM algorithm using
n
² processors

Input:

Two Matrices
A
n,n

, and

B
n,n

Output:

A mateix C
ij

=
A
ik

× B
kj

PRAM Parallel
Algo
M
atrix
_M
multiplication

(
A
n,n

, B
n,n
)

{

for each
Processor (i, j), 0
≤ i, j <
n
, do

{

t
= 0

for

(
k = 0 to
n



1)

t = t +
A
ik

× B
kj

C
ij

= t

}


}


Because each processor reads a different row of the matrix A, no concurrent reads from A

are ever
attempted. For matrix B, however, all
n

processors access the same element
B
kj

at

the same time. Again,
one can skew the memory accesses for B in such a way that the EREW

submodel is applicable.


3.2

Block
-
Oriented Matrix Multiplication Algorithms


In many practical situations, the number of processors is even less than n. So we need to

develop an
algorithm for this case as well. We can let Processor i compute a set of m/p rows in the result matrix C; say
Rows i, i+p, i+2p,

, i+(m/p
-
1)p. Again, we are parallelizing the i loop as this is preferable to parallelizing
the k loop (which has d
ata dependencies) or the j
loop.

On a lightly loaded Sequent Symmetry shared
-
memory multiprocessor, this last algorithm exhibits almost linear speedup, with the speed
-
up of about 22
-
point
matrices [Quin94]. This is
typical of what can be achieved on UMA multiprocessors with our simple parallel algorithm. Recall that the
UMA (uniform memory access) property implies that any memory location is accessible with the same
amount of delay. The dra
wback of the above algorithm for NUMA (non
-
uniform memory access) shared
memory multiprocessors is that each element of B is fetched n/p times, with only two arithmetic operations
(one multiplication and one addition) performed for each such element. Block

matrix multiplication,
discussed next, increases the computation to memory access ratio, thus improving the performance for
NUMA multiprocessors.


Let us divide the m

×

m matrices A, B, and C into p blocks of size q ×

5.13
, where We
c

m matrices by using matrix multiplication with processors, where the terms in
the algorithm statement t = t + A
ik

×

B
kj

are now q ×

the result matrix C. Thus, the algorithm

is similar to our second algorithm above, with the statement t = t +
A
ik

×

B
kj

replaced by a sequential q


q matrix multiplication algorithm.



Figuer 5.13. Partitiong the matrices for block matrix multiplication




Each multiply


add computation on
q

×
q
blocks needs 2
q
²= 2
m
²/
p
memory accesses to read the blocks and 2
q
³ arithmetic operations. So
q
arithmetic operations are performed for each memory access and better performance will be achieved as a result of improved lo
cality. The
assumption here is
that Processor (
i, j
) has sufficient local memory to hold Block (
i, j
) of the result matrix
C
(
q
² elements) and one block
-
row
of the matrix
B
; say the
q
elements in Row
kq
+
c
of Block (
k, j
) of
B.
Elements of
A
can be brought in one at a time. For example
, as element
in Row
iq
+
a
of Column
kq
+
c
in Block (
i, k
) of
A
is brought in, it is multiplied in turn by the locally stored
q
elements of
B,
and the results
added to the appropriate
q
elements of
C
(Fig. 5.14).



4.

Solving Liner Systems

In this sectio
n we discuses the parallelization of



4.1
GAUSSIAN Elimination

4.2
The JACOBI Algorithm



4.1 GAUSSIAN Elimination

Gaussian elimination is a

well known algorithm for solving the linear system
:

Ax = B

Where

A is a
is a known, square, dense and
n × n mat
rix,

while
x and B are both n dimensional vectors.



This is called the system of equations.

The general procedure for Gauss el
imination is:

Gaussian
elemenation
reduced

Ax = B

to an upper trangular system
T
x =
C and then it can
be solved through
backward substitution.

In the numeric program, vector B is stored as (n+1)th column of matrix A. Let k
control the elimination step, loo
p i control ith row accessing and loop j control jth column accessing, the
sequential Gaussian Elimination algorithm

is described as follows:


Sequential
Algo

Forward
_Elemination (A
n,n
, B
n,n
)


// Transforming to Upper Triangular

{

for (k=1 to n
-
1)


for (i
= k + 1 to n)

{



A
ik
=A
ik
/A
kk

for (j = k + 1 to n

+ 1
)


A
ij
=A
ij

-

l
ik

* A
kj




}


}


Note that since the entries in the lower trangular matrix vanish after elimination, their space is used to store
multipliers A
ik
=A
ik
/A
kk


Sequential Algo
Backward_Substitu
tion (A
n,n
, B
n,n
)

// Back
-
Substitution

{

for
(i = n to 1)


{


for
j



i
+1 to n do



x
i

= x
i



A
i
j

* x
j


x
i

= x
i

/ A
ii


}


Note that

x
i

is stores in space of A
i n+1j


G
aussian Elimination (GE) is one of the direct methods of solving linear systems (Ax =
b). In the above
algorithms, first algorithm

Forward_Elemination
converts

matrix A into the triangular matrix. Thereafter,
second algorithm
Backward_Substitution

computes the value of the vector x.



Parallel Gaussian elimination
:

Now, we describe a para
llel Gaussian elimination Algorithm.

In Farward Elemination part, the following task (denoted by
i
k
E
) can be parallelized:

i
k
E
:


A
ik

=A
ik

/A
kk

for ( j= k+1 to n+1)


A
ij

=A
ij

-

A
ik

* A
kj


The data access pattern of eac
h task is: read row A
k
, A
i
, then write row A
i

. thus for each step k, taske
1

k
k
E
,
2

k
k
E
, … ,
n
k
E

are independent, which leasds to the following parallel algorithm
:


Parallel forward elemination

Algori
thm
:


for (k =1 to n
-
1)

{


Calculate
1

k
k
E
,
2

k
k
E
, … ,
n
k
E

in parallel on p rocessors.

}



Group task into a set of clusters such each task
i
k
E
with the same row I is in the same clu
ster C
i.
Row i and

cluster C
i
will be mapped to the same processor. Cyclic mapping should be used to achieve a balanced
load amoung processors: Processor number = k
-
1 mod p


In Backwoard substitution part, denoted task by
j
i
S
:

j
i
S
:

for ( j= i+1 to n)



x
i

= x
i
-

A
ij

* x
j



x
i

=x
i

/ A
ii



The dependence is:

j
j
n
j
n
S
S
S
1
1
...






Parallel Backword substitution algorithm
:


for (i =n to 1)

{


Calculate
j
i
E
gradually on rocessor that owns j.

}


4.2 Th
e JACOBI Algorithm

In

numerical linear algebra, the

Jacobi method

is an algorithm for determining the solutions of a

system of linear
equations

with largest absolute values in each row and column dominated by the diagonal element. Each diagonal
element is
solved for, and an approximate value plugged in. The process is then iterated until it converges. This
algorithm is a stripped
-
down version of the

Jacobi transformation method of matrix diagonalization. The method is
named after

German

mathematician

Carl G
ustav Jakob Jacobi.


Given a square system of

n

linear equations:

Ax = b

where:



Then

A

can be decomposed into a

diagonal

component

D, and

the remainder

R:


The system of linear equations may be rewritten as:

(D + R)x = b






Thus

Dx + Rx = b






and finally:

Dx = b
-

Rx

Th
e Jacobi method is an

iterative technique

that solves the left hand side of this expression for

x, using previous value
for

x

on the right hand side. Analytically, this may

be written as:

x
(k+1)

= D
-
1
(b


Rx
(k)
)


The element
-
based formula is thus:


Note that the computation of
)
1
(

k
i
x
requires ea
ch element in

x
(k)

except itself. Unlike the

Gauss

Seidel method
, we
can't overwrite
)
(
k
i
x
with

)
1
(

k
i
x
, as that value will be

needed by the rest of the computation. The minimum amount of
storage is two vectors of size

n.


Parallel JACOBI Method

Parallel implementation:



Distributed rows of B and the diagonal elements of D to processors.



Perform computation based on the owner
-
comp
utes rule.



Perform all


all broadcasting after each iteration.

Note:

If the iterative matrix is very sparse, i.e. containing a lots of zeros, the code design should take advantage of
thos
e

and should not store those nonzeros. Also the code design should e
xplicitly skip those operations applied to zero
elements.





5.

Quicksort


Q
uicksort
:
The quicksort algorithm was developed in 1960 by

Tony Hoare
.
Quicksort is a

divide and conquer
algorithm. Quicksort first divides a large

list

into two smaller sub
-
lists: the low elements and the high elements.
Quicksort can then recursively sort the sub
-
lists.

The idea of sorting is:

1.

Pick an element, called a

pivot, from the list.

2.

Reorde
r the list so that all elements with values less than the pivot come before the pivot, while all elements
with values greater than the pivot come after it (equal values can go either way). After this partitioning, the
pivot is in its final position. This i
s called the

partition operation.

3.

Recursively

sort the sub
-
list of lesser elements and the sub
-
list of greater elements.

Parallel
Quicksort
:


O
bviously
, opeartion of

sorting
Problem done in parallely.


We ha
ve two varity of
Parallel Quicksort
:



Parallel Quicksort Algorithm



Hyperqube Quicksort Algorithm



5.1 Parallel Quicksort Algorithm

Quicksort is a divide
-
and
-
conquer algorithm that sorts a sequence by recursively dividing it into smaller
subsequences. Sorting the smaller arrays represents two

completely independent subproblems that can be solved in
parallel. Therefore, one way to parallelize quicksort is to execute it initially on a single processor; then when the
algorithm performs its recursive calls, assign different subproblem to different

processor. Now each of these
processors sorts its array by using sequential quicksort and assigns its different subproblem to different processor. The
algorithm terminates when the arrays cannot be further partitioned.


Initially, the input list of
elemen
tas

is partitioned serially an allotted to the number of processors
P
1
, P
2
, …
,
3
n
P
.
We
randomly choose a pivot from one of the processes and broadcast it to every processor. Then each processor Pi sorts
its unsorted list
by using sequent
ial quicksort parallely
and divides
its list
into two
sub
lists (this is called Local
Arrangement)

Li and Ri. Li has the set of elements
smaller than (or equal) the pivot
element
and

Ri has the set of
elements larger than the pivot element.
Each processor
in the upper half of the process list sends its Li to a partner
process in the lower half of the process list and receives a Ri in return (say Global Arrangement). Now, the upper
-
half
of list has only values greater than the pivot, and the lower
-
half of li
st has only values smaller than the pivot.

Thereafter, the processors divide themselves into two
teams
.

Each
team then

performs
above discussed
operation
recursively.


Initially, the input list of elements is partitioned serially an allotted to the number
of processors P
1
, P
2
, P
3



Step 1:

We randomly choose a pivot from one of the processes and broadcast it to every processor.

Step 2:

Processor P
i

sorts its unsorted list
by using
Sequential Quick Sort
.
and divides into two lists (say Local
Arrangement):

o

Those smaller than
(or equal) the pivot; say L
i
, and

o

Those greater than the pivot; say R
i
.

Step 3:

Each processor in the upper half of the process list sends its L
i

to a partner process in the lower half of the
process list and receives a R
i

in return (say Global Arrangement). Now,
the upper
-
half of list has only values
greater than the pivot, and the lower
-
half of list has only values smaller than the pivot.

Step 4:

Thereafter, the processors divide themselves into two groups.

Step 5:

Each group performs this operation recursively.


Example
:

Input
array is 7, 13, 18, 2, 17, 1, 14, 20, 6, 10, 15, 9, 3, 16, 19, 4, 11, 12, 5, 8

Step

First:



Step

Second
:



Step

Third
:


Step

Fourth
:


Finally
:



5.2

Hypercube Quicksort Algorithm

Hypercube network has structural characteristics that offer scope for implementing efficient divide
-
and
-
conquer
so
rting algorithms, such as quick
sort.


Initially
,
a list of n numbers placed
on
any
one node of a d
-
dimensional hypercube.
Now the processor that has the list of
elements sort the data by using sequential
quicksort and

divide
s whole list

into two
sublists

with respect to pivote selected by
the processor
, with one part sent to the
a
djacent node in the highest dimension.
Then the two nodes can repeat the process.

For e
xample

list
is

on
node 000

of a
3
-
dimensional hypercube.




Steps

Lable of
Sender Node


Lable of Reciver’s Node

Remark

1st step

000



100 (numbers greater than a pivo
t, say p1)

Invert Left Most Bit of sender’s
lable then we get reciver’s address.
=
2nd=s瑥t

000

100





010 (numbers greater than a pivot, say p2)

110 (numbers greater than a pivot, say p3)

Invert middel Bit of sender’s lable
then we get reciver’s address.
=
Prd=s瑥t

000

100

010

110









001 (numbers greater than a pivot, say p4)

101 (numbers greater than a pivot, say p5)

011 (numbers greater than a pivot, say p6)

111 (numbers greater than a pivot, say p7)

Invert Right Most Bit of sender’s
污l汥l瑨敮=w攠ge
t reciver’s address.
=
=
=
=
=
=
=
䙩c慬ayI=th攠p慲瑳=sor瑥t=using=愠s敱u敮t楡氠慬aor楴hmI=慬氠楮=p慲慬汥a
=
慳ashown=楮=fo汬lwing=figu敲
.=ff=r敱u楲敤I=sor瑥t=
p慲瑳=捡n=b攠r整ern敤=瑯=one=pro捥ssor=楮=a=s敱u敮捥=th慴a慬lows=pro捥ssor=瑯=捯n捡瑥n慴a=瑨攠sor瑥t=汩lts
=
瑯=捲敡瑥=the=
fin慬asor瑥t=汩l琮
=

Example:

input array is:
2, 4, 1, 3, 8, 7, 5, 22, 6, 9, 11, 44, 33, 99, 88
.



Processors

Steps

Remark

P
0

(000)

P
1

(001)

P
2

(010)

P
3

(011)

P
4

(100)

P
5

(101)

P
6

(110)

P
7

(111)



Input list

2, 4, 1, 3, 8, 7,
5, 22, 6, 9, 1
1,
44, 33, 99, 88

No Data

No Data

No Data

No Data

No Data

No Data

No Data

Step 1

Selection of Pivot

Pivot


= 9















Data After sorting

2, 4, 1, 3, 8, 7,
5, 6,
9
, 22, 11,
44, 33, 99, 88















List

> Pivot, send
to another node

2, 4, 1,

3, 8, 7,
5, 6, 9

No Data

No Data

No Data

22, 11, 44, 33,
99, 88

No Data

No Data

No Data

Step 2

Selection of Pivot

Pivot


= 5







Pivot


= 33







Data After sorting

2, 4, 1, 3,
5
, 7,
8, 6, 9







22, 11,
33
,
44, 99, 88







List

> Pivot, send
t
o another node

2, 4, 1, 3, 5,

No Data

7, 8, 6, 9

No Data

22, 11, 33,

No Data


44, 99,
88

No Data

Step 3

Selection of Pivot

Pivot

= 3



Pivot


= 8



Pivot

= 22



Pivot


= 44



Data After sorting

2, 1,
3
, 4



7, 6,
8
, 9



11,
22
, 33



44
, 99, 88



List

> Pivot, send
to another node

2, 1, 3

4

7, 6, 8

9

11, 22

33

44

99, 88

Step 4

Selection of Pivot

Pivot

= 2

Pivot

= 4

Pivot =

6

Pivot

=
9

Pivot

=
11

Pivot

=
33

Pivot
=
44

Pivot

=
99

Data After sorting

1,
2
, 3

4

6
, 7, 8

9

11
, 22

33

44

88,
99

Sorted data
on each node

1, 2, 3

4

6, 7, 8

9

11, 22

33

44

88, 99

All processors now sends their list to the processorP0, Now
Final sorted
list
is
on P
0




1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 22, 33, 44, 88, 99