2a.
1
Evaluating Parallel Programs
Cluster Computing, UNC

Charlotte, B. Wilkinson.
2a.
2
Sequential execution time
,
t
s
: Estimate by
counting computational steps of best sequential
algorithm.
Parallel execution time
,
t
p
: In addition to number
of computational steps,
t
comp
, need to estimate
communication overhead,
t
comm
:
t
p
=
t
comp
+
t
comm
2a.
3
Computational Time
Count number of computational steps.
When more than one process executed simultaneously,
count computational steps of
most complex process
.
Generally, function of
n
and
p
, i.e.
t
comp
=
f
(
n
,
p
)
Often break down computation time into parts. Then
t
comp
=
t
comp1
+
t
comp2
+
t
comp3
+ …
Analysis usually done assuming that all processors are
same
and operating at
same speed
.
2a.
4
Communication Time
Many factors, including
network structure
.
As a first approximation, use
t
comm
=
t
startup
+
nt
data
t
startup

startup time, essentially time to send a
message with no data. Assumed to be constant.
t
data

transmission time to send one data word,
also assumed constant, and there are
n
data
words.
2a.
5
Idealized Communication Time
Number of data items (
n
)
Star
tup time
The equation to compute the communication time ignore the fact that the
source and destination may not be directly linked in a real system so that the
message may pass through intermediate nodes.
It is also assumed that the overhead incurred by including information other
than data in the packet is constant and can be part of startup time.
2a.
6
Final communication time,
t
comm
Summation of communication times of all
sequential
messages from
one process
, i.e.
t
comm
=
t
comm1
+
t
comm2
+
t
comm3
+ …
Communication patterns of all processes assumed
same and take place together so that
only one
process
need be considered.
Both
t
startup
and
t
data
, measured in units of one
computational step, so that can add
t
comp
and
t
comm
together to obtain parallel execution time,
t
p
.
Communication Time of
Broadcast/Gather
•
If broadcast is done through single shared wire for Ethernet,
the time complexity is O(1) for single data item and O(w) if w
data items.
•
If binary tree is used as the underlying network structure and
1

to

N fan

out broadcast is used, then what about
communication cost for p final destinations (leaf nodes) using
w messages?
–
We assume the left and right child will receive the message from their
parent in a sequential way. However, at each level, different parent
nodes will send out the message at the same time.
2a.
7
1

to

N fan

out Broadcast
•
t
comm
=
2 (log p) (t
startup
+ w
t
data
)
•
It depends on number of levels and
number of nodes at each level.
•
For a binary tree and p final destinations
at the leave level.
2a.
8
2a.
9
Benchmark Factors
With
t
s
,
t
comp
, and
t
comm
, can establish speedup
factor and computation/communication ratio for
a particular algorithm/implementation:
Both functions of number of processors,
p
, and
number of data elements,
n
.
2a.
10
Factors give indication of
scalability
of parallel
solution with increasing number of processors and
problem size.
Computation/communication ratio will highlight
effect of communication
with increasing problem
size and system size.
We wish to have
dominant factor
in
computation
instead of communication, as n increases,
communication can be ignored and adding more
processors can improve the performance.
2a.
11
Example
•
Adding
n
numbers using two computers,
each adding
n
/2 numbers each.
•
Numbers initially held in one computer.
Computer 1
Computer 2
Send n/2 numbers
Send result back
Add up n/2 numbers
Add partial sums
t
comm
=
t
startup
+(
n
/2)t
data
t
comm
=
t
startup
+ t
data
t
comp
=
n
/2
t
comp
=
1
2a.
12
Overall
t
comm
= 2
t
startup
+(
n
/2 + 1)t
data
= O(
n
)
t
comp
=
n
/2 + 1
= O(
n
)
Computation/Communication ratio = O(1)
2a.
13
Another problem
Computation time complexity =
O(n
2
)
Communication time complexity = O(
n
)
Computation/Communication ratio = O(
n
)
2a.
14
Cost
Cost = (execution time ) x (number of processors)
Cost of sequential computation =
t
s
Cost of parallel computation =
t
p
x p
Cost

optimal algorithm
When parallel computation cost is proportional to
sequential computation:
Cost =
t
p
x
p
=
k
x
t
s
k
is a constant
2a.
15
Example
Suppose
t
s
= O(
n
log
n
) for the best sequential
program
where
n
= number of data item
p
= number of processors
For cost optimality if
t
p
= O(
n
log
n
) / p = O(
n
/
p
log
n
)
Not cost optimal if
t
p =
O(
n^2
/
p
)
A parallel algorithm is cost

optimal if parallel time complexity
times the number of processors equals the sequential time
complexity
.
Evaluating programs
•
Measuring the execution time
•
Time

complexity analysis gives an insight into the parallel
algorithm and is useful in comparing different algorithms. We
want to know how the algorithm actually performs in a real
system.
•
We can measure the elapsed time between two points in the
code in seconds.
–
System calls, such as clock(), time(), or gettimeofday() or MPI_Wtime()
–
Example:
L1: time(&t1);
.
.
L2: time(&t2);
elapsed_time = difftime(t2, t1);
2a.
16
Communication Time by the
Ping

Pong Method
•
Point

to

point communication time of a specific
system can be found using the ping

pong method.
•
One process p0 sends a message to another
process, say p1. Immediately upon receiving the
message, p1 sends the message back to p0. The
time is divided by two to obtain an estimate of the
time of one

way communication. For example, at p0:
time(&t1);
send(&x, p1);
recv(&x, p1);
time(&t2);
elapsed_time = 0.5* difftime(t2, t1);
2a.
17
Profilling
•
A profile of a program is a histogram or
graph showing the time spent on
different part of the program. Showing
the number of times certain source code
are executed.
•
It can help to identify certain
hot spot
places in a program visited many times
during the execution. These places
could be optimized first.
2a.
18
Program Profile Histogram
2a.
19
Statement number of region of program
Comments 0
Log in to post a comment