Performance Analysis of Parallel
Processing
Dr. Subra Ganesan
Department of Computer Science and Engineering
School of Engineering and Computer Science
Oakland University
CSE664
Project report
Winter 2000
4/29/2000
BY
Ahmad Milhim
Contents
1.
Introduction
2.
Definitions
2.1
Performance analysis
2.2
Performance analysis techniques
2.3
Performance analysis metrics
3.
Performance means
3.1
Arithmetic mean performance
3.2
Geometric mean performance
3.3
Harmonic mean performance
4.
Speedup performance laws
4.1
A
symptotic speedup
4.2
Harmonic speedup
4.3
Amdahl’s speedup law
5.
Workloads
5.1
Types of test workloads
5.2
HINT
6.
References
1.
Introduction
The goal of all computer system users and designers is to get the highest
performance at the lowest cost. Basic knowl
edge of performance evaluation
terminology and technology is a must for computer professionals to evaluate
their systems. Performance analysis is required at every stage of the life cycle
of a computer system, starting from the design stage to manufacturin
g, sales,
and upgrade.
Designers do performance analysis when they want to compare a number of
alternative designs and pick the best design, system administrators use
performance analysis to choose a system from a set of possible systems for a
certain appl
ication, and users need to know how well their systems are
performing and if an upgrade is needed.
It is said that performance analysis is an art, every analysis requires an
intimate knowledge of the system being evaluated. Computer applications are
so num
erous and different that it is not possible to have a standard measure of
performance analysis for all applications. There are three techniques used for
performance analysis, analytical modeling, simulation, and measurement.
Different considerations help a
nalysts choose which technique to use, the key
consideration is the stage in which the system is, if the system is a new
concept analytical modeling and simulations are the only possible techniques
to be used, at the same time these techniques can be base
on previous
measurements of other similar systems, other considerations of less
importance are the time available for analysis, tools, accuracy, and cost.
Users of computer systems seek ways to increase the productivity of their
systems in terms of both co
mputer hardware and programmers, and then
reduce the cost of computing. Computer throughputs were increased by
making the operating system handle resource sharing.
There has been always a desire to evaluate how will computer systems are
performing, and to
find ways of improving their performance.
Measurement and analysis of parallel processing is a new field, the analysis
of parallel processing helps users and researchers to answer key questions
about the configuration of parallel systems. Usually the que
stions that
analysts try to find an answer for are what is the best way to configure the
processing elements to solve a particular problem, how much overhead is
generated by parallelism, what is the effect of varying the number of
processors on the perform
ance, which is better to use shared memory or local
memory, and so on.
2.
Definitions
. In this section the most common terms and concepts
relating to the performance analysis of systems in general and parallel
processing in particular are introduced, a
nd how these terms are applied to
computer systems and parallel processing. First the definition of performance
analysis is introduced, then performance metrics and techniques are discussed
in more detail.
2.1
performance analysis.
The ideal performance
of a computer system
can be achieved if a perfect match between hardware capability and software
behavior is reached.
2.2
Performance analysis techniques
Three analysis techniques are used
for performance analysis; these are analytical modeling, simul
ation, and
measurement. To choose which one to use for the analysis of a system
depends on certain considerations. The most important consideration is the
stage in which the analysis is to be performed, measurement can’t be
performed unless the system alre
ady exists or at least a similar system does,
if the proposed design is a new idea then only analytical modeling or
simulation techniques can be used.
2.2.1
Analytical modeling.
Used if the system is in the early design stages,
and the results are needed
soon. Provides insight into the underlying system,
but it may not be precise due to simplification in modeling and mathematical
equations.
2.2.2
Simulation.
It is a useful technique for analysis. Simulation models
provide easy ways to predict the performa
nce of computer systems before
they exist and it is used to validate the results of analytical modeling and
measurement. Simulation provides snapshots of system behavior
2.2.3
Measurement.
It can be used only if the system is available for
measurement (p
ostprototype) or at least other systems similar to the system
under design exist. Its cost is high compared to the other two techniques
because instruments of hardware and software are needed to perform
measurements. It also takes a long time to monitor t
he system efficiently. The
most important considerations that are used to choose one or more of these
three techniques are illustrated in Figure 1 below.
Figure 1. Considerations in order of importance
It is said that not to trust the resul
ts of one technique until they have been
validated by the other two techniques. This is to emphasize that using one
technique may be miss leading, or at least you may not get accurate results.
2.3
Performance metrics
. Each performance study has differ
ent metrics
that can be defined and used to evaluate that specific system under study,
although these metrics are varied and different from one system to another,
common metrics exist and are used in most general computer system
evaluation. These most comm
on metrics are introduced here.
C
C
r
r
i
i
t
t
e
e
r
r
i
i
o
o
n
n
A
A
n
n
a
a
l
l
y
y
t
t
i
i
c
c
a
a
l
l
S
S
i
i
m
m
u
u
l
l
a
a
t
t
i
i
o
o
n
n
M
M
e
e
a
a
s
s
u
u
r
r
e
e
m
m
e
e
n
n
t
t
M
M
o
o
d
d
e
e
l
l
i
i
n
n
g
g
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
s
s
t
t
a
a
g
g
e
e
A
A
n
n
y
y
A
A
n
n
y
y
p
p
o
o
s
s
t
t
p
p
r
r
o
o
t
t
o
o
t
t
y
y
p
p
e
e
T
T
i
i
m
m
e
e
r
r
e
e
q
q
u
u
i
i
r
r
e
e
d
d
S
S
m
m
a
a
l
l
l
l
M
M
e
e
d
d
i
i
u
u
m
m
V
V
a
a
r
r
i
i
e
e
s
s
T
T
o
o
o
o
l
l
s
s
A
A
n
n
a
a
l
l
y
y
s
s
t
t
s
s
C
C
o
o
m
m
p
p
u
u
t
t
e
e
r
r
I
I
n
n
s
s
t
t
r
r
u
u
m
m
e
e
n
n
t
t
s
s
L
L
a
a
n
n
g
g
u
u
a
a
g
g
e
e
s
s
A
A
c
c
c
c
u
u
r
r
a
a
c
c
y
y
L
L
o
o
w
w
M
M
o
o
d
d
e
e
r
r
a
a
t
t
e
e
V
V
a
a
r
r
i
i
e
e
s
s
C
C
o
o
s
s
t
t
s
s
m
m
a
a
l
l
l
l
M
M
e
e
d
d
i
i
u
u
m
m
H
H
i
i
g
g
h
h
M
M
o
o
s
s
t
t
i
i
m
m
p
p
o
o
r
r
t
t
a
a
n
n
t
t
2.3.1
Response time
. It is the interval between a users request and the
system response. In general the response time increases as the load increases,
terms that are related to the response time are
turnaround
time and
reaction
time, turnaround time is defined as the time elapsed between submitting a job
and the completion of its output, reaction time elapsed between submitting of
a request and the beginning of its execution by the computer system. Stretch
factor is the
ratio of the response time at a particular load to that at a
minimum load.
2.3.2
Throughput
is defined as the rate (requests per unit time) at which the
request can be serviced by the system, for CPUs the rate is measured in
million instructions per secon
d (MIPS) or million floating point instructions
per second (Mflops), for networks the rate is measured in bits per second
(bps) or packets per second, and for transaction systems it is measured in
transactions per second (TPS). In general the throughput of
a the system
increases as the load initially increases, then after a certain load the
throughput stops increasing and in some cases starts decreasing.
The maximum achievable throughput under ideal workload conditions is
called the
nominal capacity
of the
system, it is the bandwidth in computer
networks and is measured in bits per second. The response time at the
nominal capacity is often too high which leads to the definition of a new
term, the
usable capacity
which is defined as the maximum achievable
thr
oughput without exceeding a pre specified response time. The
optimal
operating
point
is the point after which the response time increases rapidly as
a function of the load, at the same time the gain in throughput is small, it is
the knee capacity in
Figure
2
below.
Load
Throughput
Nominal capacity
Usable capacity
Knee capacity
Response
time
Load
Figure 2 Capacity of the system
2.3.3
System Efficiency
E(n
).
It is defined as the ratio of the maximum
achievable throughput (usable capacity ) to the maximum achievable capacity
under ideal workload conditions. It is an indication of
the actual degree of
speedup performance achieved as compared with the maximum value for
parallel processing, in other words it is the ratio of the performance of n

processor system to that of a single

processor system. The lowest efficiency
corresponds to
the case where the entire program code being executed
sequentially on a single processor. The maximum efficiency corresponds to
the case when all processors are fully utilized throughout the execution
period. The minimum efficiency is the case where the e
ntire program code
being executed sequentially on a single processor as illustrated in figure 3.
E(n)=S(n)/n
Figure 3 efficiency
0
1
2
3
4
5
6
0
1
2
3
4
5
6
N
N
u
u
m
m
b
b
e
e
r
r
o
o
f
f
p
p
r
r
o
o
c
c
e
e
s
s
s
s
o
o
r
r
s
s
E
E
f
f
f
f
i
i
c
c
i
i
e
e
n
n
c
c
y
y
2.3.4
Speedup S(n).
It is an indication of the degree of the speed gain in
pa
rallel computations. It is discussed in more detail later in this paper.
There
are three different speedup measures, asymptotic speedup, harmonic speedup,
and Amdahl's speedup law, simply the speedup is defined as the ratio of the
time taken by one proces
sor to execute a program to the time taken by n
processors to execute the same program.
S(n) = T(1)/T(n)
2.3.5
Redundancy R(n)
. The ratio of the total number of unit operations
performed by n

processor system
O(n)
to the total number of unit operations
performed by a single

processor system
O(1)
.
R(n) = O(n)/O(1)
2.3.6
Utilization U(n).
The fraction of the time the resource is busy or it is
the ratio of busy time to total time over an observation period.
Idle tim
e
is the
time during which a resource is not used. Balancing between resources
produces more efficient systems when parallel processing system is being
designed. In terms of system redundancy and its efficiency, the utilization is
expressed as system red
undancy times its efficiency. It indicates the
percentage of the resources that were kept busy during the execution of a
parallel program.
U(n) = R(n)*E(n)
2.3.7
Quality .
This metric combines the effects of speedup, utilization, and
redundancy to as
sess relative merits of parallel computation. It is directly
proportional to the speedup and efficiency and inversely related to the
redundancy.
Q(n) = S(n)*E(n)/R(n)
Since the efficiency is always a fraction and the redundancy is
greater than
one
then
the quality of the parallel system is upper bounded by the speedup.
2.3.8
Reliability.
It is measured by the probability of errors or by the mean
time between errors.
2.3.9
availability
. The availability of a resource is the fraction of the time
the res
ource is available to service users requests. Two other terms are used
in system analysis are downtime and uptime, system downtime is the time
period where the system is not available, the uptime is the time during which
the system is available.1
2.3.10
C
ost/performance ratio
. this metric is used to compare two or more
systems in terms of their cost and performance. The cost should inched the
cost of hardware, software, etc. the efficiency is measured in terms of
throughput for a pre specified response tim
e.
3.
Performance means.
To understand the concepts of performance analysis and the significance of
the output of the performance techniques, one should understand the general
concepts of mathematical performance laws and performance means. In this
secti
on performance means are discussed briefly
.
3.1
Arithmetic mean performance.
It is defined as the ratio of the sum
of all execution rates of all programs to the total number of programs used.
Symbolically, let R
j
be the execution rate of program j wher
e j =1,2,…m,
then the arithmetic mean performance
R
a
=
R
j
/ m, for all j =1 to m.
this is true only if all programs have equal weighting. If programs are
weighted, then weighted arithmetic mean is defined as:
R
a
*
=
j
R
j
/ m, for all j =1 to
m
Where f
j
is the weight of program j.
3.2
Geometric mean performance.
The geometric mean of n values is
obtained by multiplying the values together and taking the nth root of the
product. It is used only if the product of the values is a quantity of in
terest.
The geometric mean of the execution rates R
j
is:
R
g
= (
R
j
)
1/m
, j = 1,2,… m
Where m is the number of programs. Similarly this value is correct if only all
the programs have equal weights. Neither the arithmetic mean nor the
geometric mean r
epresents the real performance of the execution rates of
benchmark programs, which leads to the definition of a third kind of
performance means.
3.3
Harmonic performance means
. The harmonic mean of n values is
defined as the ratio of the number of value
s n to the sum of all the fractions
1/x
j
where x
j
is the jth value in the set of values. In the context of
performance analysis, it is defined as the average performance across a large
number of programs running in various execution modes, these modes
corr
esponds to scalar, vector, sequential, or parallel processing with different
program parts. Symbolically the harmonic performance mean is:
R
h
= m /
⠱ ⁒
j
)
Where R
j
is the execution rate of program j, m is the total number of
programs and R
h
is t
he harmonic mean if equal weighting of all programs.
The harmonic mean is the closest of the means to the real performance.
4.
Speedup performance laws.
4.1
Asymptotic speedup law
. Consider the parallel system of n processors
and , denote w
i
as the amount o
f work done with a degree of parallelism
(DOP) equals i. The execution time of w
i
on a single processor is t
i
(1),
similarly the execution time of w
i
on k processors is t
i
(k). The response time
of T(1) is the response time of a single processor system whe
n executing w
i.
And T(
) is the response time of executing the same workload w
i
if an
infinite number of processors is available. At this point the asymptotic
speedup S
is defined as the ratio of T(1) to T(
).
S
㴠=⠱⤠/⁔(
⤮
4.2
Harmonic speedup l
aw.
Suppose a workload of multiple programs is
to be executed on an n

processor system, the program (workload) may use
different number of processors at different execution times. The program is
executed in mode i if i processors are used, R
i
is the corres
ponding execution
rate which reflects the collective speed of i processors. The weighted
harmonic speedup S is defined as the ratio of the sequential execution time T
1
to the weighted arithmetic mean execution time T
*
across the n execution
modes.
S = T
1
/ T
*
4.3
Amdahl’s speedup law.
This law can be derived from the previous
laws under the assumption that the system works only in two modes, fully
sequential system with probability
or fully parallel mode with probability
1

. Amdahl’s speedup express
ion S
n
equals to the ratio of the total number
of processors to 1+(n

1)
S
n
= n / 1+(n

1)
This implies that under the above assumption, the best speedup that we can
get is upper bounded by 1/
regardless of how many processors the system
actuall
y have
because S
n
1/
as n
. In Figure 4, S
n
is plotted as a
function of n for different values of
. Note that the ideal speedup achieved
when
= 0, and the speedup drops sharply as
increases
Figure 4 plotting speedup versus number of processors for four values of
Amdahl's speedup
1
10
100
1000
10000
1
10
100
1000
10000
n: processors
speedup
㴠=
㴠〮=1
㴠〮=
㴠〮=
5.
Workloads.
The term test workload denotes any workload that is used in per
formance
study. Two type of workloads exist,
Real workload
is one that is used in
normal operations and can’t be repeated, therefore it is not suitable to be used
as a test workload. The other workload is the
Synthetic workload
, which is a
load that can b
e used repeatedly, and it models the real workload.
5.1
Types of test workloads
5.1.1
Addition instruction
, most early computers where designed around
arithmetic logic units that use the addition instruction to perform most of the
computations needed at that
time. Thus using addition instruction to measure
the performance of computers at that time was good enough. The computer
with the faster addition instruction was considered better.
5.1.2
Instruction mix.
The addition instruction was no longer sufficient
to
measure the performance of computer systems as the number of supported
instructions by increased. An instruction mix workload was then introduced.
5.1.3
Kernels
are sets of instructions that constitute a higher level function
performed by the processo
rs, kernels are used mainly to measure the
performance of processors. The kernel performance measure does not reflect
the total system performance due to the fact that most kernel workloads don’t
use I/O operations.
5.1.4
Synthetic programs.
simple progra
ms written in a high

level language,
these programs make a specified number of I/O or service calls requests.
This workload measures the CPU time and the time for the I/O requests.
Synthetic programs do not make representative memory and secondary
memory
references.
5.1.5
Application benchmarks
. Application programs that represent the real
workload. They make use of all available resources, including processors,
networks, I/O devices, and databases. Benchmarks like Siev, Whetstone,
Linpack, Dhrystone, a
nd SPEC are just some of the well known benchmarks
that have been used to evaluate the performance of computer systems. In the
next section, a new benchmark called HINT is discussed in more details.
5.2
HINT.
HINT is a new benchmark developed by John L. Gusta
fson and Quinn O.
Snell. “It is a practical approach that provides mathematically sound
comparison of computational performance even when the algorithms,
computer, and precision are changed”. It can be used to compare computing
as slow as human calculator
to computing as fast as the best super computers.
Most benchmarks are based on the idea of measuring the time various
computers’ take to complete a fixed

size task, others like database
benchmarks fix the time and vary the job size. HINT on the other hand,
is
based on a different concept. HINT stands for Hierarchical INTegration, it
does not fix time
HINT produces a speed measure called QUIPS, QUality Improvement Per
Second. It does not fix time nor problem size.
It reveals memory bandwidth,
it is
scalab
le; compares computing as slow as hand calculation to computing
as fast as the largest supercomputers, it is
portable ; it ports to every
sequential and parallel environment with very little effort , and has a low
cost, permits low cost comparison of any a
rchitecture.
A computer with twice the QUIPS rating is as twice as powerful so it must
have more
arithmetic speed
precision
storage
bandwidth
The following HINT plots use logarithmic scale for the time, the plots
illustrate the powerful use of HINT as
a performance analysis and evaluation
benchmark. It compares different kinds of computer systems.
Figure 6. Comparison of Different Precessions
Figure 7. Comparison of Diff
erent Clock Speeds
Figure 9. Comparison of Different Main Memory Sizes
Figure 10. Comparison of a Scalable Parallel Computer
Figure 11. Comparison of several Parallel Systems
Figure 12. Co
mparison of various workstations
References
1.
Jain, Raj. (1991). The Art of Computer Systems Analysis: techniques
for experimental design, measurement, simulation, and modeling. John
Wiley and sons.
2.
Hwang, Kai. (1993). Advanced Computer Architecture,
Parallelism,
Scalability, Programmability. McGrow

Hill.
3.
McKerrow, Phillip. (1988). Performance Measurement of Computer
Systems. Addison

Wesely.
4.
Gustafson, L. John and Snell O. Quinn. HINT: A new Way To
Measure Computer Performance. (
http://www.scl.ameslab.gov/HINT
)
Comments 0
Log in to post a comment