ELEG
652
Principles of Parallel Computer Architectures
Handout
#
1
Problem Set
#
1
Issued
: Wednesday, September
13
, 200
6
Due
: Wednesday,
October
4
, 200
6
Please begin your answer to every problem on a new sheet of paper. Be as concise and clear
as you c
an. Make an effort to be legible. To avoid misplacement of the various components of
your assignment~ make sure that all the sheets are
stapled
together. You may discuss problems
with your classmates, but all solutions must be written up independently.
1.

T
he LINPACK
benchmark
(
Cullers text Chapter 1 Figure 1.10 page 22)
is often used to report the
performance of various computers include most powerful parallel computers.
I n
this problem, you
wi l l
learn how to obtain the LINPACK benchmark program, and use
it to benchmark two machines.
LINPACK is a collection of subroutines that analyze and solve linear equations and linear least
squares
problems. The package solves linear systems whose matrices are general, banded, symmetric
indefinite,
symmetric positive d
efinite, tri
angular,
and tridiagonal square.
I n
addition, the package computes the
QR and singular value decompositions
(SVD)
of rectangular matrices and applies them to least squares
problems. LINPACK uses
column oriented algorithms
to increase efficiency
by preserving
locality of
reference. LINPACK was designed for supercomputers
t hat were i n
use in the 1970’s and early 1980’s.
LINPACK has been largely superseded by LAPACK which has been designed to run efficiently on
shared memory and vector supercompute
rs.
Visit
http://www.netlib.org/linpack/index.html
for a detailed description of LINPACK and browse the
repository at the beginning of this page for many subroutines.
Please create a subdirectory, s
ay called LINPACK, and store the downloaded LINPACK
benc
h
mark code (as explained below).
Download the benchmark from
http://www.capsl.udel.edu/courses/eleg652/2006/homework/
clinpack.c
Print out a code listing and
read it, as a part of your homework.
Compile the program using
single
precision (
the command line is
cc

O3

fallow

single

precision

DROLL

DSP clinpack.c

o clinpack
). Run the
code on the EECIS main teaching machi
ne
mlb.acad.ece.udel.edu. (You should all have accounts,
otherwise go to
http://www.eecis.udel.edu
/
.
What is the LINPACK performance of your run? Report it carefully, e.g. you should state clearly
the following i
nformation:
The experimental test bed:
the machine net address, the OS version, compiler version,
the
machine configuration, processor type, model, speed, cache size, memory capacity, etc.
Check
the man pages for the UNIX command to get the CPU time.
The i
nput:
size of the problem which should be 100, 256 and 512.
You should at least one
per data size (three in total)
T
he performance parameters:
explain clearly what are being reported(“Fundamentals of
Matrix Computations by D.S. Watkins Chapter, and “Numeri
cal Linear Algebra for
High
Performance Computers” by Jack Dongarra et al., Chapter 4 will be of use here.) Also,
check
http://www.netlib.org/performance/
and compare how it matches to your performance.
Comment your results and explain your observations
2.

Please perform the same
steps
as in Problem 1 for your personal PC. If you do not have one, try to use
one
from our labs or your friends.
There are special versions of the benchmarks available for PC
s. Look
for linpackpc version
i n
the w
eb page (
http://www.netlib.org/benchmark/linpack

pc.c
).
I f
you do not
have access to a PC with a C compiler, use any other machine
(workstations, servers, e
tc)
you have
access to.
Points and Hints:

Microsoft Visual Studio Users
: Go to Project
Settings. In the popup window, choose the C/C++ tab and add to
the Project Options box
–
DSP
–
DUNROLL.

For all
: You need to change the code a bit

Check the co
de for a variable n. This will tell you how big of an array is going to be used. Change that variable
to 100, 256 and 512. Please be advised, they are other arrays depending on that value (arrays of 200 and 201) so
change accordingly
3.

In this problem,
you
wi l l
learn to apply Amdahl
’
s law. Suppose that you are given a large scientific
program
P
and 8% of this code is not parallelizable.
(a) What is Amdahl
’
s Law?
(b) What is the maximum speedup you can achieve by parallelization?
(c)
I f
you wanted to ach
ieve a speedup of 15 (i.e. to make your program run 15 times fast) what
percentage of the code should be parallelized?
(d) Assume someone has looked at
P
and your computation, and becomes quite upset with what you
said about the speedup. Now if the person
is given a chance to use a massively parallel machine
(with the option to upgrade to many more processors than the speedup
li mi t
you gave for
P
), what
(hopefully
positive)
advice should you give in this case) Please j ustify.
4.

The Top 500 list categori
zes the fastest scientific machines in the world according to their
performance on the LINPACK benchmark. Visit their Website at
http://www.top500.org
/
and look at the top 100 performants (there are many repeats of
a particular vendor product,
since individual supercomputer sites rather than a product line are counted). Obtain the
particular type of supercomputers that are in the list. Obtained such data until 1993 and make
a graph with the changes in computer types
across the years. You can obtain this information
on the website. Please use the website facilities instead of trying to do it yourself since that
would take too long.
One important application of high performance computing systems is solving large engin
eering problems
efficiently. Some physical/engineering problems can be formulated as mathematical problems involving large
number of algebraic equations, a million equations with million unknowns is not a rarity. The core computation
in solving such system
s of equations is matrix operations, either Matrix

Matrix Multiply or Matrix Vector
Multiply depending on the problem and solution methods under study. Therefore, a challenge to computer
architectures is how to handle these two matrix operations efficientl
y.
In this problem, you will study the implementation of the matrix vector multipl
y (or MVM for short).
There are lots of challenging issues associated with the implementation of MVM. First of all, matrices in
many engineering problems are sparse matrices
–
where the number of non

zeroes entries in a matrix is
much less than the total size of the matrix. Such sparse matrices pose an interesting problem in terms of
efficient storage and computation. The use of different compressed formats for such matrices
have been
proposed and use in practice.
There are several ways of compressing sparse matrices to reduce the storage space. However, identifying
the non

zero elements with the right indices is not a trivial task. The following
questions deals with three
ty
pes
of the sparse matrix storage
.
Compressed Row Storage (CRS)
is
a
method that deals
with the rows of a sparse matrix. It consists of 3
vectors. The first vector is the “el
ement value vector”: a vector which contains all the nonzero elements of
the array
(collected by rows). The second vector
has
the column
indices of the non

zero values. This
vector is called “column index vector”. The third vector
consists of the
row pointers to each of the
elements in the first vector. That is, elements of this “row p
ointer vector” consists of the element number
(for example. x
th
element start from row k) with which a new row starts.
With the above format in mind try to compress the following sparse matrix and represent it in a CRS
format:
An example of CRS is presented below
This is the result if the initial index is zero. If the initial index is one then every entry in vector 2 and 3 will
0
1
0
0
0
3
0
0
0
0
4
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
7
0
6
0
8
0
0
0
2
0
0
0
0
0
3
0
0
0
0
0
0
0
0
0
0
1
0
0
0
5
0
2
0
0
0
0
1
0
A =
0
0
0
0
6
0
0
1
0
0
0
0
0
0
0
0
8
0
12
0
0
5
0
0
0
0
0
10
0
0
0
0
5
0
2
0
B =
6
1
8
1
2
5
10
5
2
4
1
4
0
3
3
2
4
0
1
2
3
5
6
9
The l ast el ement i s def i ned as number
of number non

zero el ement s pl us
one
Vect or 1:
Non Zero
el ement s
Vect or 2:
Col umn
i ndi ces
Vect or 3:
Row
poi nt ers
be plus one, except for the last entry of
Vector
3
,
which
will be 9 in both cases. B
oth results (initial index 0
or
1) are acc
eptable, but please be consisten
t
.
5..

In this problem, you are going to experiment with an implementation of Matrix

Vector Multiply (MVM)
using different storage format
s. Matrix

Vector Multiply has the following pseudo

code:
w
here C and B are vectors of size N and A is a sparse matrix of size NxN
1
.
In this specific code, the order
is Sparse Matrix times Vector.
Create a C code that:
a.
Creat
e a random dense vector of N size (
B
).
b.
Reads a sparse matrix of size NxN (A) and
transform it to
CRS format
c.
Calculate the MVM (A*
B
) of the vector and the sparse matrix in CRS format.
Make sure that your
code h
as some type of timing function
(i.e. getrusage
, clock, gettimeofday, _ftime, et
c) and time
the MVM operation using
the CRS matrix.
d.
Repeat the operation with a
Compress Column Storage
2
format matrix and compare performance
numbers
. Report your finding
s, as which
method has better
performance, and run
it
for at least
five matrix/vector sizes
. You can use the same methods that were used in Homework 1 to
calculate performance.
Please report your machine configuration as
you did
in Homework 1.
1.
Have a look at the following storage representation, known as
Jagged Diagonal Storage (JDS).
i.
Explain how it is derived (Hint: Think in the sequence of E.S.S. which stands for Extract, Shift and
Sort)
ii.
Why is this format claimed to be useful for the implementation of iterative methods on pa
rallel and
1
If you want to get more information about the MVM, please take a look at the Fortran Implementation of MVM at
http://www.netlib.org/slatec/lin/dsmv.f
2
Compress Column Storage is the same as CRS, but it uses columns pointers instead of row pointers; the non

zero elements
are gathered
by columns and not rows; and Vector 2 contains the row indices instead of the column indices
1
0
0
2
1
0
2
0
0
0
3
0
3
0
0
0
0
0
4
0
0
1
0
0
5
C =
1
3
1
2
4
2
3
5
1
1
2
2
4
4
3
5
1
6
9
10
C
j C
Of f C
1
3
5
2
4
Per m
1
5
f or I: 1 t o N
f or J: 1 t o N
C[ I] += A[ I] [ J ] * B[ J];
vector processors?
iii.
Can you imagine a situation in which this format will perform worse?
iv.
Modify your program to create (read) matrices in a JDS and repeat your experiments as in (2). Report
your results for this storage method as you did for CRS
and CCS.
6.

You have learned from class that the performance of a vector architecture can often be
described
by the
timing of a single arithmetic operation of a vector of length n. This will fit closely to the following generic
formula for the time of t
he operation, t, as a function of the vector length n:
t = r

1
∞
(n + n
1/2
)
The two parameters r
∞
and n
1/2
describe the performance of an idealized computer under the
architecture model/technology and give a first order description of a real computer. The
y are
defined as:
o
T
he maximum asymptotic performance r
∞

the maximum rate of computation in floating
point operation performed per second (in MFLOPS). For the generic computer, this
occurs asymptotically for vectors of infinite length.
o
T
he half

performanc
e length n
1/2
–
the vector length required to achieve half the
maximum possible performance.
The benchmark used to measure (r
∞
, n
1/2
) is shown below:
Please replace the call Timing_function with a C code timing function (clock, getrusage, etc)
.
Please identify and explain your choices.
Assume vector machine X is measured (by the benchmark above) such that its r
∞
= 300
MFLOPS and n
1/2
= 10 and vector machine Y is measured (by the benchmark above) such
that its r
∞
= 300 MFLOPS and n
1
/2
= 100. Wh
at these numbers mean? How can you judge
the performance difference between X and Y? Explain.
Describe how would you use this benchmark (or its variation) to derive (r
∞
, n
1/2
)
Rewrite the code to C, port them onto a Sun Workstation and run it. Tell us whic
h
machine you used. Choose a sensible NMAX (Suggestion: NMAX = 256).
Report your results and make plots of:
o
Runtime V.S. Vector Length
o
Performance (MFLOPS) V.S. Vector Length
o
What are the values of the two parameters you obtained?
T1 = call Timing_function;
T2 = call Timing_function;
T0 = T2

T1;
FOR N = 1, NMAX
{
T1 = call Timing_function;
FOR I = 1, N
{
A[I] = B[I] * C[I];
}
T2 = call Timing_function;
T = T2
–
T1
–
T0;
}
State on which machine m
odel and type you performed your experiments.
Problem
7.
Match each of the following computer system
s
:
KSR

1, Dash, CM

5, Monsoon, Tera MTA, IBM SP

2, Beowulf,
SUN UltraSparc

450
with one of the best descriptions listed below. The mapping is a one

to

one
correspondence in this case.
1.
A massively parallel system built with multiple

context processor and
a 3

D torus architecture.
2.
Linux

based PCs with fast Ethernet
.
3.
A ring

connected multiprocessor using cache

o
nly memory architecture.
4.
An experimental mul
tiprocessor built with a
dataflow architecture.
5.
A research scalable multiprocessor built with distributed shared memory coherent caches
.
6.
An MIMD distributed

memory computer built with a large mult
istage switching network.
7.
A small scale shared memory mult
iprocessor with uniform addre
ss space.
8.
A cache

coherent non

uniform memory access multiprocessor built with Fat Hypercube network.
You are encouraged to search conference proceedings, related journals, and the Internet extensively for answers.
For each c
omputer system, write one paragraph about the system characteristics such as the maximum node
numbers.
Hint: You need to learn how to quickly find references from Proceedings of International Symposium of
Computer Architecture (ISCA), IEEE Trans. of Comp
uter, IEEE Micro, the Journal of Supercomputing, IEEE
Parallel & Distributed Processing.
Problem
8
Please try to fit the machine: “Cray Red Storm” into one of the classifications above. If none of them fits, please
provide a short architecture descripti
on.
Comments 0
Log in to post a comment