# DOC - capsl

AI and Robotics

Oct 16, 2013 (4 years and 5 months ago)

87 views

ELEG
652

Principles of Parallel Computer Architectures

Handout

#

1

Problem Set
#
1

Issued
: Wednesday, September

13
, 200
6

Due
: Wednesday,
October

4
, 200
6

Please begin your answer to every problem on a new sheet of paper. Be as concise and clear
as you c
an. Make an effort to be legible. To avoid misplacement of the various components of
your assignment~ make sure that all the sheets are
stapled

together. You may discuss problems
with your classmates, but all solutions must be written up independently.

1.
-

T
he LINPACK

benchmark

(
Cullers text Chapter 1 Figure 1.10 page 22)

is often used to report the
performance of various computers include most powerful parallel computers.
I n

this problem, you
wi l l

learn how to obtain the LINPACK benchmark program, and use
it to benchmark two machines.

LINPACK is a collection of subroutines that analyze and solve linear equations and linear least
squares
problems. The package solves linear systems whose matrices are general, banded, symmetric

indefinite,
symmetric positive d
efinite, tri
angular,

and tridiagonal square.
I n

QR and singular value decompositions

(SVD)

of rectangular matrices and applies them to least squares
problems. LINPACK uses
column oriented algorithms

to increase efficiency

by preserving
locality of
reference. LINPACK was designed for supercomputers

t hat were i n

use in the 1970’s and early 1980’s.

LINPACK has been largely superseded by LAPACK which has been designed to run efficiently on
shared memory and vector supercompute
rs.

Visit
http://www.netlib.org/linpack/index.html

for a detailed description of LINPACK and browse the

benc
h
mark code (as explained below).

http://www.capsl.udel.edu/courses/eleg652/2006/homework/
clinpack.c

Print out a code listing and

Compile the program using
single

precision (
the command line is
cc
-
O3
-
fallow
-
single
-
precision
-
DROLL
-
DSP clinpack.c
-
o clinpack
). Run the
code on the EECIS main teaching machi
ne
mlb.acad.ece.udel.edu. (You should all have accounts,

otherwise go to
http://www.eecis.udel.edu
/
.

What is the LINPACK performance of your run? Report it carefully, e.g. you should state clearly
the following i
nformation:

The experimental test bed:

the machine net address, the OS version, compiler version,
the
machine configuration, processor type, model, speed, cache size, memory capacity, etc.
Check
the man pages for the UNIX command to get the CPU time.

The i
nput:

size of the problem which should be 100, 256 and 512.

You should at least one
per data size (three in total)

T
he performance parameters:

explain clearly what are being reported(“Fundamentals of
Matrix Computations by D.S. Watkins Chapter, and “Numeri
cal Linear Algebra for
High
Performance Computers” by Jack Dongarra et al., Chapter 4 will be of use here.) Also,

check
http://www.netlib.org/performance/

and compare how it matches to your performance.

2.
-

steps
as in Problem 1 for your personal PC. If you do not have one, try to use
one
from our labs or your friends.
There are special versions of the benchmarks available for PC
s. Look
for linpackpc version
i n

the w
eb page (
http://www.netlib.org/benchmark/linpack
-
pc.c
).
I f

you do not
have access to a PC with a C compiler, use any other machine

(workstations, servers, e
tc)

you have

Points and Hints:

-

Microsoft Visual Studio Users
: Go to Project

Settings. In the popup window, choose the C/C++ tab and add to
the Project Options box

DSP

DUNROLL.

-

For all
: You need to change the code a bit

-

Check the co
de for a variable n. This will tell you how big of an array is going to be used. Change that variable
to 100, 256 and 512. Please be advised, they are other arrays depending on that value (arrays of 200 and 201) so
change accordingly

3.
-

In this problem,

you
wi l l

learn to apply Amdahl

s law. Suppose that you are given a large scientific
program
P

and 8% of this code is not parallelizable.

(a) What is Amdahl

s Law?

(b) What is the maximum speedup you can achieve by parallelization?

(c)
I f

you wanted to ach
ieve a speedup of 15 (i.e. to make your program run 15 times fast) what
percentage of the code should be parallelized?

(d) Assume someone has looked at
P

and your computation, and becomes quite upset with what you
said about the speedup. Now if the person
is given a chance to use a massively parallel machine
(with the option to upgrade to many more processors than the speedup
li mi t

you gave for
P
), what
(hopefully
positive)

4.
-

The Top 500 list categori
zes the fastest scientific machines in the world according to their
performance on the LINPACK benchmark. Visit their Website at
http://www.top500.org
/
and look at the top 100 performants (there are many repeats of
a particular vendor product,
since individual supercomputer sites rather than a product line are counted). Obtain the
particular type of supercomputers that are in the list. Obtained such data until 1993 and make
a graph with the changes in computer types
across the years. You can obtain this information
on the website. Please use the website facilities instead of trying to do it yourself since that
would take too long.

One important application of high performance computing systems is solving large engin
eering problems
efficiently. Some physical/engineering problems can be formulated as mathematical problems involving large
number of algebraic equations, a million equations with million unknowns is not a rarity. The core computation
in solving such system
s of equations is matrix operations, either Matrix
-
Matrix Multiply or Matrix Vector
Multiply depending on the problem and solution methods under study. Therefore, a challenge to computer
architectures is how to handle these two matrix operations efficientl
y.

In this problem, you will study the implementation of the matrix vector multipl
y (or MVM for short).
There are lots of challenging issues associated with the implementation of MVM. First of all, matrices in
many engineering problems are sparse matrices

where the number of non
-
zeroes entries in a matrix is
much less than the total size of the matrix. Such sparse matrices pose an interesting problem in terms of
efficient storage and computation. The use of different compressed formats for such matrices
have been
proposed and use in practice.

There are several ways of compressing sparse matrices to reduce the storage space. However, identifying
the non
-
zero elements with the right indices is not a trivial task. The following

questions deals with three
ty
pes

of the sparse matrix storage
.

Compressed Row Storage (CRS)

is

a

method that deals

with the rows of a sparse matrix. It consists of 3
vectors. The first vector is the “el
ement value vector”: a vector which contains all the nonzero elements of
the array

(collected by rows). The second vector
has

the column
indices of the non
-
zero values. This
vector is called “column index vector”. The third vector
consists of the

row pointers to each of the
elements in the first vector. That is, elements of this “row p
ointer vector” consists of the element number
(for example. x
th

element start from row k) with which a new row starts.

With the above format in mind try to compress the following sparse matrix and represent it in a CRS
format:

An example of CRS is presented below

This is the result if the initial index is zero. If the initial index is one then every entry in vector 2 and 3 will
0

1

0

0

0

3

0

0

0

0

4

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

7

0

6

0

8

0

0

0

2

0

0

0

0

0

3

0

0

0

0

0

0

0

0

0

0

1

0

0

0

5

0

2

0

0

0

0

1

0

A =

0

0

0

0

6

0

0

1

0

0

0

0

0

0

0

0

8

0

12

0

0

5

0

0

0

0

0

10

0

0

0

0

5

0

2

0

B =

6

1

8

1
2

5

10

5

2

4

1

4

0

3

3

2

4

0

1

2

3

5

6

9

The l ast el ement i s def i ned as number
of number non
-
zero el ement s pl us
one

Vect or 1:

Non Zero

el ement s

Vect or 2:

Col umn

i ndi ces

Vect or 3:

Row

poi nt ers

be plus one, except for the last entry of

Vector
3
,

which

will be 9 in both cases. B
oth results (initial index 0
or

1) are acc
t
.

5..
-

In this problem, you are going to experiment with an implementation of Matrix
-
Vector Multiply (MVM)
using different storage format
s. Matrix
-
Vector Multiply has the following pseudo
-
code:

w
here C and B are vectors of size N and A is a sparse matrix of size NxN
1
.

In this specific code, the order
is Sparse Matrix times Vector.

Create a C code that:

a.

Creat
e a random dense vector of N size (
B
).

b.

Reads a sparse matrix of size NxN (A) and
transform it to

CRS format

c.

Calculate the MVM (A*
B
) of the vector and the sparse matrix in CRS format.
Make sure that your
code h
as some type of timing function

(i.e. getrusage
, clock, gettimeofday, _ftime, et
c) and time
the MVM operation using

the CRS matrix.

d.

Repeat the operation with a
Compress Column Storage
2

format matrix and compare performance
numbers
s, as which

method has better

performance, and run
it
for at least
five matrix/vector sizes
. You can use the same methods that were used in Homework 1 to
calculate performance.

you did

in Homework 1.

1.

Have a look at the following storage representation, known as
Jagged Diagonal Storage (JDS).

i.

Explain how it is derived (Hint: Think in the sequence of E.S.S. which stands for Extract, Shift and
Sort)

ii.

Why is this format claimed to be useful for the implementation of iterative methods on pa
rallel and

1

http://www.netlib.org/slatec/lin/dsmv.f

2

Compress Column Storage is the same as CRS, but it uses columns pointers instead of row pointers; the non
-
zero elements
are gathered

by columns and not rows; and Vector 2 contains the row indices instead of the column indices

1

0

0

2

1

0

2

0

0

0

3

0

3

0

0

0

0

0

4

0

0

1

0

0

5

C =

1

3

1

2

4

2

3

5

1

1

2

2

4

4

3

5

1

6

9

10

C

j C

Of f C

1

3

5

2

4

Per m

1

5

f or I: 1 t o N

f or J: 1 t o N

C[ I] += A[ I] [ J ] * B[ J];

vector processors?

iii.

Can you imagine a situation in which this format will perform worse?

iv.

Modify your program to create (read) matrices in a JDS and repeat your experiments as in (2). Report
your results for this storage method as you did for CRS
and CCS.

6.
-

You have learned from class that the performance of a vector architecture can often be
described

by the
timing of a single arithmetic operation of a vector of length n. This will fit closely to the following generic
formula for the time of t
he operation, t, as a function of the vector length n:

t = r
-
1

(n + n
1/2
)

The two parameters r

and n
1/2

describe the performance of an idealized computer under the
architecture model/technology and give a first order description of a real computer. The
y are
defined as:

o

T
he maximum asymptotic performance r

-

the maximum rate of computation in floating
point operation performed per second (in MFLOPS). For the generic computer, this
occurs asymptotically for vectors of infinite length.

o

T
he half
-
performanc
e length n
1/2

the vector length required to achieve half the
maximum possible performance.

The benchmark used to measure (r

, n
1/2
) is shown below:

Please replace the call Timing_function with a C code timing function (clock, getrusage, etc)
.

Assume vector machine X is measured (by the benchmark above) such that its r

= 300
MFLOPS and n
1/2

= 10 and vector machine Y is measured (by the benchmark above) such
that its r

= 300 MFLOPS and n
1
/2

= 100. Wh
at these numbers mean? How can you judge
the performance difference between X and Y? Explain.

Describe how would you use this benchmark (or its variation) to derive (r

, n
1/2
)

Rewrite the code to C, port them onto a Sun Workstation and run it. Tell us whic
h
machine you used. Choose a sensible NMAX (Suggestion: NMAX = 256).

Report your results and make plots of:

o

Runtime V.S. Vector Length

o

Performance (MFLOPS) V.S. Vector Length

o

What are the values of the two parameters you obtained?

T1 = call Timing_function;

T2 = call Timing_function;

T0 = T2
-
T1;

FOR N = 1, NMAX

{

T1 = call Timing_function;

FOR I = 1, N

{

A[I] = B[I] * C[I];

}

T2 = call Timing_function;

T = T2

T1

T0;

}

State on which machine m
odel and type you performed your experiments.

Problem
7.

Match each of the following computer system
s
:

KSR
-
1, Dash, CM
-
5, Monsoon, Tera MTA, IBM SP
-
2, Beowulf,
SUN UltraSparc
-
450

with one of the best descriptions listed below. The mapping is a one
-
to
-
one

correspondence in this case.

1.

A massively parallel system built with multiple
-
context processor and
a 3
-
D torus architecture.

2.

Linux
-
based PCs with fast Ethernet
.

3.

A ring
-
connected multiprocessor using cache
-
o
nly memory architecture.

4.

An experimental mul
tiprocessor built with a

dataflow architecture.

5.

A research scalable multiprocessor built with distributed shared memory coherent caches
.

6.

An MIMD distributed
-
memory computer built with a large mult
istage switching network.

7.

A small scale shared memory mult
ss space.

8.

A cache
-
coherent non
-
uniform memory access multiprocessor built with Fat Hypercube network.

You are encouraged to search conference proceedings, related journals, and the Internet extensively for answers.
For each c
omputer system, write one paragraph about the system characteristics such as the maximum node
numbers.

Hint: You need to learn how to quickly find references from Proceedings of International Symposium of
Computer Architecture (ISCA), IEEE Trans. of Comp
uter, IEEE Micro, the Journal of Supercomputing, IEEE
Parallel & Distributed Processing.

Problem
8

Please try to fit the machine: “Cray Red Storm” into one of the classifications above. If none of them fits, please
provide a short architecture descripti
on.