Evaluating Parallel Programs

coleslawokraΛογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 3 χρόνια και 6 μήνες)

66 εμφανίσεις

2a.
1

Evaluating Parallel Programs

Cluster Computing, UNC
-
Charlotte, B. Wilkinson.

2a.
2

Sequential execution time
,
t
s
: Estimate by
counting computational steps of best sequential
algorithm.


Parallel execution time
,
t
p
: In addition to number
of computational steps,
t
comp
, need to estimate
communication overhead,
t
comm
:


t
p

=
t
comp

+
t
comm

2a.
3

Computational Time

Count number of computational steps.

When more than one process executed simultaneously,
count computational steps of
most complex process
.
Generally, function of
n

and
p
, i.e.


t
comp

=
f
(
n
,
p
)


Often break down computation time into parts. Then

t
comp

=
t
comp1

+
t
comp2

+
t
comp3

+ …


Analysis usually done assuming that all processors are
same

and operating at
same speed
.

2a.
4

Communication Time

Many factors, including
network structure
.


As a first approximation, use

t
comm

=
t
startup

+
nt
data


t
startup

--

startup time, essentially time to send a
message with no data. Assumed to be constant.

t
data

--

transmission time to send one data word,
also assumed constant, and there are
n

data
words.

2a.
5

Idealized Communication Time

Number of data items (

n

)

Star

tup time

The equation to compute the communication time ignore the fact that the
source and destination may not be directly linked in a real system so that the
message may pass through intermediate nodes.


It is also assumed that the overhead incurred by including information other
than data in the packet is constant and can be part of startup time.

2a.
6

Final communication time,
t
comm

Summation of communication times of all
sequential

messages from
one process
, i.e.

t
comm

=

t
comm1

+
t
comm2

+
t
comm3

+ …


Communication patterns of all processes assumed
same and take place together so that
only one
process
need be considered.


Both
t
startup

and
t
data
, measured in units of one
computational step, so that can add
t
comp

and
t
comm

together to obtain parallel execution time,
t
p
.

Communication Time of
Broadcast/Gather


If broadcast is done through single shared wire for Ethernet,
the time complexity is O(1) for single data item and O(w) if w
data items.



If binary tree is used as the underlying network structure and
1
-
to
-
N fan
-
out broadcast is used, then what about
communication cost for p final destinations (leaf nodes) using
w messages?


We assume the left and right child will receive the message from their
parent in a sequential way. However, at each level, different parent
nodes will send out the message at the same time.

2a.
7

1
-
to
-
N fan
-
out Broadcast


t
comm

=

2 (log p) (t
startup

+ w
t
data
)



It depends on number of levels and
number of nodes at each level.


For a binary tree and p final destinations
at the leave level.


2a.
8

2a.
9

Benchmark Factors

With
t
s
,
t
comp
, and
t
comm
, can establish speedup
factor and computation/communication ratio for
a particular algorithm/implementation:






Both functions of number of processors,
p
, and
number of data elements,
n
.

2a.
10

Factors give indication of
scalability

of parallel
solution with increasing number of processors and
problem size.


Computation/communication ratio will highlight
effect of communication
with increasing problem
size and system size.


We wish to have
dominant factor
in
computation

instead of communication, as n increases,
communication can be ignored and adding more
processors can improve the performance.

2a.
11

Example


Adding
n

numbers using two computers,
each adding
n
/2 numbers each.


Numbers initially held in one computer.

Computer 1

Computer 2

Send n/2 numbers

Send result back

Add up n/2 numbers

Add partial sums

t
comm

=
t
startup

+(
n
/2)t
data

t
comm

=
t
startup

+ t
data

t
comp

=
n
/2

t
comp

=
1

2a.
12

Overall

t
comm

= 2
t
startup

+(
n
/2 + 1)t
data

= O(
n
)

t
comp

=
n
/2 + 1




= O(
n
)


Computation/Communication ratio = O(1)




2a.
13

Another problem


Computation time complexity =
O(n
2
)

Communication time complexity = O(
n
)


Computation/Communication ratio = O(
n
)




2a.
14

Cost

Cost = (execution time ) x (number of processors)


Cost of sequential computation =
t
s

Cost of parallel computation =
t
p

x p


Cost
-
optimal algorithm


When parallel computation cost is proportional to
sequential computation:


Cost =
t
p

x
p

=
k

x
t
s


k

is a constant

2a.
15

Example

Suppose
t
s

= O(
n

log
n
) for the best sequential
program

where
n

= number of data item


p

= number of processors


For cost optimality if

t
p

= O(
n

log
n
) / p = O(
n
/
p

log
n
)



Not cost optimal if

t
p =
O(
n^2
/
p

)


A parallel algorithm is cost
-
optimal if parallel time complexity
times the number of processors equals the sequential time
complexity
.




Evaluating programs


Measuring the execution time


Time
-
complexity analysis gives an insight into the parallel
algorithm and is useful in comparing different algorithms. We
want to know how the algorithm actually performs in a real
system.


We can measure the elapsed time between two points in the
code in seconds.


System calls, such as clock(), time(), or gettimeofday() or MPI_Wtime()


Example:

L1: time(&t1);

.

.

L2: time(&t2);

elapsed_time = difftime(t2, t1);


2a.
16

Communication Time by the
Ping
-
Pong Method


Point
-
to
-
point communication time of a specific
system can be found using the ping
-
pong method.


One process p0 sends a message to another
process, say p1. Immediately upon receiving the
message, p1 sends the message back to p0. The
time is divided by two to obtain an estimate of the
time of one
-
way communication. For example, at p0:

time(&t1);

send(&x, p1);

recv(&x, p1);

time(&t2);

elapsed_time = 0.5* difftime(t2, t1);


2a.
17

Profilling


A profile of a program is a histogram or
graph showing the time spent on
different part of the program. Showing
the number of times certain source code
are executed.


It can help to identify certain
hot spot
places in a program visited many times
during the execution. These places
could be optimized first.

2a.
18

Program Profile Histogram

2a.
19

Statement number of region of program