# Computational Methods in Applied Sciences I

Software and s/w Development

Dec 1, 2013 (4 years and 7 months ago)

87 views

Computational Methods in Applied Sciences I

University of Wyoming MA 531
0

Spring, 2013

Professor Craig C. Douglas

http://www.mgnet.org/~douglas/Classes/na
-
sc/notes/2013sw.pdf

2

Course Description
:
First semester of a three
-
semester computational methods
series. Review of basics (round off errors and matrix algebra review), finite
differences and Taylor expansions, solution of linear systems of equations
(Gaussian elimination var
iations (tridiagonal, general, and sparse matrices),
iterative methods (relaxation and conjugate gradient methods), and
overdetermined systems (least squares)), nonlinear equations (root finding of
functions), interpolation and approximation (polynomial, L
agrange, Hermite,
piecewise polynomial, Chebyshev, tensor product methods, and least squares
rules), and one other topic (simple optimization methods, Monte
-
Carlo, etc.).

(3

hours)

Prerequisites
: Math 3310 and COSC 1010. Identical to COSC 5310,

CHE 5140,
ME 5140, and CE 5140
.

Suggestion
: Get a Calculus for a single variable textbook and reread it.

Textbook
:

George Em Karniadakis and Robert M. Kirby II, Parallel Scientific
Computing in C++ and MPI: A Seamless Approach to Parallel Algorithms and
Their Implementation, Cambridge University Press, 2003
.

3

Preface: O
utline

of Course

Errors

In pure mathematics, a+b is well defined and exact. In computing, a and b might
not even be representable in Floating Point numbers (e.g., 3 is not representable
in IEEE floating point and is only approximately 3), which is a
finite

subset of
the Reals. In

addition, a+b is subject to roundoff errors, a concept unknown in
the Reals. We will study computational and numerical errors in this unit.

See Chapter 2.

C++ and parallel communications

If you do not know simple C++, then you will learn enough to get
by in this
class. While you will be using MPI (message passing interface), you will be
taught how to use another set of routines that will hide MPI from you. The
advantages of hiding MPI will be given during the lectures. MPI has an

4

enormous number of rout
ines and great functionality. In fact, its vastness is also
a disadvantage for newcomers to parallel computing on cluster machines.

See Appendix A and a web link.

Solution of linear systems of equations Ax=b

We will first review matrix and vector
algebra. Then we will study a variety of
direct methods (ones with a predictable number of steps) based on Gaussian
elimination. Then we will study a collection of iterative methods (ones with a
possibly unpredictable number of steps) based on splitting me
thods. Then we
will study Krylov space methods, which are a hybrid of the direct and iterative
paradigms. Finally we will study methods for sparse matrices (ones that are
almost all zero).

See Chapters 2, 7, and 9.

5

Solution of nonlinear equations

We wi
ll develop methods for finding specific values of one variable functions
using root finding and fixed point methods.

See Chapters 3 and 4.

Interpolation and approximation

Given {f(x
0
), f(x
1
), …, f(x
N+1
)}, what is f(x), x
0

x

x
N+1
and x
i
<x
i+1
?

See
Chapter 3.

Numerical integration and differentiation

Suppose I give you an impossible to integrate (formally) function f(x) and a
domain of integration. How do you approximate the integral? Numerical
integration, using quadrature rules, turns out to be r
elatively simple.

6

Alternately, given a reasonable function g(x), how do I takes its derivative using
just a computer? This turns out to be relatively difficult in comparison to
integration. Surprisingly, from a numerical viewpoint, it is the exact opposit
e of
what freshman calculus students determine in terms of hardness.

Finally, we will look at simple finite difference methods for approximating
ordinary differential equations.

See Chapters 4, 5, and 6.

Specialized topic(s)

If there is sufficient time

at the end of the course, one or more other topics will
be covered, possibly by the graduate students in the class.

7

1
. Errors

1.

Initial errors

a.

Inaccurate representation of constants (

, e, etc.)

b.

Inaccurate measurement of data

c.

Overly simplistic model

2.

Tru
ncation

a.

From approximate mathematical techniques, e.g.,

e
x

= 1 + x + x
2
/2 + … + x
n
/n! + …

e

= 1 +

+ … +

k
/k! + E

3.

Rounding

a.

From finite number of digits stored in some base

b.

Chopping and symmetric rounding

Error types 1
-
2 are
problem

dependent whereas error type 3 is
machine

dependent.

8

Floating Point Arithmetic

We can represent a real number
x

by

1 2
(0 ... ),
c
m
x a a a b
      

where 0

a
i

b, and
m
,
b
, and

c

M are machine dependent with common bases
b

of 2, 10, and 16.

IEEE 755
(circa 1985) floating point standard (all of ~6 pages):

Feature

Single precision

Double precision

Bits total

32

64

Sign bits

1

1

Mantissa bits

23

52

Exponent bits

8

11

Exponent Range

[
-
44.85,38.53]

[
-
323.3,308.3]

Decimal digits

7

16

9

Conversion between bases is simple for integers, but is really tricky for real
numbers. For example, given
r

base 10, its equivalent in base 16 is
10 16
( ) ( )
r r

is derived by computing

0
16
0

+

1
16
1

+

2
16
2

+ … +

1
16
-
1

+

2
16
-
2

+ …

Integers

are relatively easy to convert. Real numbers are quite tricky, however.

Consider
r
1

= 1/10:

16
r

1

= 1.6 =

1

+

2
/16 +

3
/16
2

+ …

16
r

2

= 9.6 =

2

+

3
/16 +

4
/16
2

+ …

Hence,
10 16
(.1) (.199999)

a number with
m

digits in one base may not have
terminal representation in another base.
It is not just irrationals that are a
problem

(e.g., consider
10 2
(3.0) (3.0)

).

10

Consider
r

= .115 if b = 10 and
m

= 2, then

r

= .11

chopping

r

= .12

symmetric rounding (
r+.5

b
c
-
m
-
1

and then chop)

Most computers chop instead of round off. IEEE compliant CPUs can do both
and there may be a system call to switch, which is usually not user accessible.

Note
: When the rounding changes, almost all nontrivial codes break.

War
ning
: On all common computers, none of the standard arithmetic operators
are associative. When dealing with multiple chained operations, none you would
expect are commutative, either, thanks to round off properties. (What a deal!)

Let’s take a look, one o
perator at a time.

11

Let
( )
e x x x
 

in the arithmetic operations that follow in the remainder of this
section.

:

( ( )) ( ( )) ( ) ( ( ) ( ))
x y x e x y e y x y e x e y
        
.

____
x y
x y
 

is fun to construct an example.

x y

can overflow (rounds off to

or underflow (rounds off to
zero) even though the number in infinite precision is neither. Overflow is a
major error, but underflow usually is not a big deal.

Warning
: The people who defined IEEE ari
thmetic assumed that 0 is a signed
number, thus violating a basic mathematical definition of the number system.
Hence, on IEEE compliant CPUs, there is both +0 and
-
0 (but
no

signless 0),
which are different numbers in floating point. This seriously disrup
ts

12

comparisons with 0. The programming fix is to compare
abs
(
expression
) with 0,
which is computationally ridiculous and inefficient.

Decimal shifting can lead to errors.

Example: Consider
b

= 10 and
m

= 4. Then given
4
1
0.5055 10
x
 

and
0
2 11
...0.4000 10
x x
  

we have

4 4
1 2 1
0.50554 10 0.5055 10
x x x
     
.

Even worse,
1 2 3 11 1
(...( ) )...) )
x x x x x
    
,

but

4
11 10 9 1
(...( ) )...) ) 0.5059 10
x x x x
     
.

Rule of thumb
: Sort the numbers by positive, negative, and zero values based on
their absolute values. Add them up in
ascending order inside each category.
Then combine the numbers.

13

Subtraction
:

( ( )) ( ( )) ( ) ( ( ) ( ))
x y x e x y e y x y e x e y
        
.

If x and y are close there is a loss of significant digits.

Multiplication
:

( ) ( )
x y x y xe y ye x
    
.

Note that the
e
(
x
)
e
(
y
) term is not
present above. Why?

14

Division
:

2
( ) ( ) ( )
1
1 ( )/
x e x e x xe y
x x
y
y y y
e y y
y
  
  
  
  
  

   

where we used
2 3
1
1...
1
r r r
r
    

Note that when
y

is
sufficiently close to 0
,

it is

utterly and completely disastrous
in terms of

rounding error.

15

2
. An introduction to C++ and parallel computing basics

See the C++ Primer at

http://www.mgnet.org/~douglas/Classes/na
-
sc/notes/C++Primer.pdf
.

A parallel computing communications
interface

(parlib) is available from

http://www.mgnet.org/~douglas/Classes/hpc
-
xtc/notes/parlib.tgz

or

http://www.mgnet.org/~douglas/Classes/hpc
-
xtc/notes/parlib.zip

with documentation
available
from

http://www.mgnet.org/~douglas/Classes/hpc
-
xtc/notes/parlib.pdf
.

16

Assume there are
p

processors numbered from 0 to
p
-
1

and labeled
P
i
. The
communication between the processors uses

one or more high speed and
bandwidth switches.

In the old days, various topologies were used, none of which scaled to more than
a modest number of processors. The Internet model saved parallel computing.

Today parallel computers come in several flavors
(hybrids, too):

Small shared memory (SMPs)

Small clusters of PCs

Blade servers (in one or more racks)

Forests of racks

GRID or Cloud computing

Google operates the world’s largest Cloud/GRID system. The Top 500 list
provides an ongoing list of the fastest
computers willing to be measured. It is not
a comprehensive list and Google, Yahoo!, and many governments and
companies do not participate.

17

Data needs to be distributed sensibly among the
p

processors. Where the data
needs to be can change, depending on t
he operation, and communication is
usual.

Algorithms that essentially never need to communicate are known as
embarrassingly parallel
. These algorithms scale wonderfully and are frequently
used as examples of how well so and so’s parallel system scales. Mo
st
applications are not in this category, unfortunately.

To do parallel programming, you need only a few functions to get by:

Initialize the environment and find out processor numbers
i
.

Finalize or end parallel processing on one or all processors.

Send d
ata to one, a set, or all processors.

Receive data from one, a set, or all processors.

Cooperative operations on all processors (e.g., sum of a distributed vector).

18

Everything else is a bonus. Almost all of MPI is designed for compiler writers
and
operating systems developers. Only a small subset is expected to be used by
regular people.

19

3.
Solution of Linear Systems of Equations

3a. Matrix Algebra Review

Let
ij
R r
 
 
 
 

be
m

n

and
ij
S s
 
 
 
 

be
n

p
.

Then
ij
T RS t
 
 
 
 
 

is
m

p

with
1
n
ij
k
ik kj
t r s

.

SR

exists if and only if
m
=
p

and
SR

RS

normally.

=
ij ij ij
Q q R S r s
 
 
 
 
   

exists if and only if
R

and
S

have the same dimensions.

Transpose: for
ij
R r
 
 
 
 

,
T
ji
R r
 
 
 
 

.

20

Inner product: for
x,y n
-
vectors, (
x
,
y
) =
x
T
y

and (
A
x
,
y
) = (
Ax
)
T
y
.

Matrix
-
Matrix Multiplication

(an aside)

for i = 1,M do

for j = 1,M do

for k = 1,M do

A(i,j) = B(i,k)
*
C(k,j)

or the blocked form

for i = 1,M,
step by s, do

for j = 1,M, step by s, do

for k = 1,M step by s do

for l = i, i + s

1 do

for m = j, j + s

1 do

for n = k, k+s
-
1 do

A(l,m) = B(l,n)
*
C(n,m)

21

If you pick the block size right, the blocked algorithm runs 2X+ faster than
the
standard algorithm.

Why does the blocked form work so much better? If you pick
s

correctly, the
blocks fit in cache and only have to be moved into cache once with double
usage. Arithmetic is no longer the limiting factor in run times for numerical
alg
orithms. Memory cache misses control the run times and are notoriously hard
to model or repeat in successive runs.

An even better way of multiplying matrices is a Strassen style algorithm (the
Winograd variant is the fastest in practical usage). A good im
plementation is the
GEMMW code (see
http://www.mgnet.org/~douglas/ccd
-
free
-
software.html
).

22

Continuing basic definitions

If
( )
i
x x

is an
n
-
vector (i.e., a
n

1
matrix), then

1
2
( )
n
diag x
x
x
x
 
 
 
 
 
 
 
 
 

.

Let
e
i

be a
n
-
vector with all zeroes except the
i
th

component, which is 1. Then

I

= [
e
1
,
e
2
, …,
e
n

]

is the
n

n

identity matrix. Further, if
A

is
n

n
, then
IA=AI=A
.

23

The
n

n

matrix
A

is said to be
nonsingular

if

!
x

such that
Ax=b
,

b
.

Tests for nonsingularity:

Let 0
n

be the zero vector of length
n
.
A

is nonsingular if and only if 0
n

is the
only solution of
Ax=
0
n
.

A

is nonsingular if and only if det(
A
)

0.

Lemma
:

!
A
-
1

such that
A
-
1
A=AA
-
1
=I

if and only if
A

is nonsingular.

Proof
: Suppose

C

such that
C

A
-
1
, but
CA=AC=I
. Then

C=IC=(A
-
1
A)C=A
-
1
(AC)=A
-
1
I=A
-
1
.

24

Diagonal matrices:
0
0
D
a
b
c
d
 
 
 
 
 
 
 
 
 

.

Triangular matrices: upper
U
x x x x
x x x
x x
x
 
 
 
 
 
 
 
 

, strictly upper
0
0
0
0
U
x x x
x x
x
 
 
 
 
 
 
 
 
 

.

lower
L
x
x x
x x x
x x x x
 
 
 
 
 
 
 
 

, strictly lower
0
0
0
0
L
x
x x
x x x
 
 
 
 
 
 
 
 
 

.

25

3b. Gaussian elimination

Solve
Ux=
b
,
U

upper triangular, real, and nonsingular:

1 1 1,
1,1
1
and ( )
n
n n
n n n n
nn
n n
b
x x b a x
a a
  
 
  

If we define
0
n
ij j
i n
a x

, then the formal algorithm is

1
1
( ) ( )
n
i ii i ij j
j i
x a b a x

 
 

,
i
=
n
,
n
-
1
, …, 1.

Solve
Lx=b
,
L

lower triangular, real, and nonsingular similarly.

Operation count
: O(
n
2
) multiplies

26

Tridiagonal Systems

Only three diagonals nonzero
around main diagonal:

11 1 12 2 1
21 1 22 2 23 3 2
,1 1
nn n n
n n n
a x a x b
a x a x a x b
a x a x b
 
 
  
 

Eliminate
x
i

from (
i
+1)
-
st equations sequentially to get

1 1 2 1
2 2 3 2
n n
x p x q
x p x q
x q
 
 

where

27

12
1
11
a
p
a


1
1
11
b
q
a


,1
,1 1
i i
i
ii
i i i
a
p
a p a

 


,1 1
,1 1
i
i i i
i
ii
i i i
b a q
q
a p a
 
 



Operation count
: 5
n
-
4 multiplies

Parallel tridiagonal solvers

Parallel tridiagonal solvers come in several flavors, all of which are extremely
complicated. In the past I have confused entire classes at this point with one
such definition. I refer
interested readers to the textbook and its associated
software.

Parallel or simple cyclic reduction are my favorite algorithms to parallelize or
vectorize tridiagonal solvers.

28

General Matrix A
(
nonsingular
)
, solve Ax = f by Gaussian elimination

Produce

A
(k)
,
f
(k)
,
k=1,…n
, where
A
(1)
=
A

and
f
(1)
=f
and for
k=2, 3, …, n
,

( 1)
( )
( 1)
,1
( 1) ( 1)
1,
( 1)
1,1
1
0, 1
,
k
ij
k
ij
k
i k
k k
ij
k j
k
k k
a
i k
a i k j k
i k j k
a
a a
a

 

 
 
   
 

( 1)
( ) ( 1)
,1
( 1) ( 1)
1
( 1)
1,1
1
k
i
k k
i
i k
k k
i
k
k
k k
f i k
f a
f f i k
a

 

 
 

 

29

The 2

2 block form of
A
(k)

is

( ) ( )
( )
( )
0
k k
k
k
U A
A
A
 
 
 
 
 

.

Theorem 3.1
: Let
A

be such that Gaussian elimination yields nonzero diagonal
elements
( )
k
kk
a
,
k=1, 2, …, n
. Then
A

is nonsingular and

(
1
)

(1) (2) ( )
11 22
det
n
nn
A a a a

.

Also,
( )
n
A U

is upper triangular and
A

has the factorization

(
2
)

LU A

,

where
( )
ik
L m

is lower triangular with elements

30

( )
( )
0 for
1
ik
k
ik
k
kk
i k
m i k
a
i k
a

 

The vector

(
3
)

( )
1
n
g f L f

 
.

Proof
: Note that once
(
2
)

is proven,
det( ) det( )det( ) det( )
A L U U
 
, so
(
1
)

follows.

Now we

prove
(
2
)
. Set
( )
ij
LU c

. Then (since
L

and
U

are triangular and
A
(k)

is
satisfied for
k=n
)

min(,)
( ) ( )
1 1
n i j
n k
ij
k k
ik kj ik kj
c m a m a
 
 
 
.

31

From the definitions of
( )
and
k
ij
ik
a m

we get

( 1) ( 1) ( )
,1 1.
for 2,
k k k
ij ij
i k k j
m a a a k i k j
 
 
    

and recall that
(1)
ij ij
a a

. Thus, if
i

j
, then

1 1
( ) ( ) ( 1) ( )
1 1
i i
k i k i
k
ij ij ij ij ij ij
k k
ik kj
c m a a a a a a
 
 
 
 
 

 
     
 
.

When
i>j
,
( 1)
0
j
ij
a

 
(
2
)
.

Finally, we prove
(
3
)
. Let
h Lg

. So,

( )
1 1
i i
k
i
k k
ik k ik k
h m g m f
 
 
 
.

From the definitions of
( ) (1)
, , and
k
i i i
ik
f m f f

,

32

( ) ( ) ( 1)
k k k
i i i i
ik k
m f f f h f

   
.

L

nonsingular completes the proof of
(
3
)
.
QED

Examples:

(1) (2) (3)
4 6 1 4 6 1 4 6 1 1 0 0
8 10 3, 0 2 1, 0 2 1 and 2 1 0
12 48 2 0 66 15 0 0 38 3 33 1
A A A A U L
       
       
       
       
       
       
       
       
  

and

(1) (2)
4 6 1 4 6 1 1 0 0
8 10 3, 0 2 1 and 2 1 0
8 12 2 0 0 0 3 0 1
A A A U L
     
     
     
     
     
     
     
     
   
.

33

The correct way to solve
Ax=f

is to compute
L

and
U

first, then solve

,
.
Ly f
Ux y

Generalized Gaussian elimination

1.

Order of elimination arbitrary.

2.

Set
(1) (1)
, and
A A f f
 
.

3.

Select an arbitrary
1 1
(1)
,
0
i j
a

as the first

pivot element. We can eliminate
1
j
x

from all but the
i
1
-
st

equation. The multipliers are
1
1 1
(1) (1)
,
,,
/
i j
k j k j
m a a

.

4.

The reduced system is now
(2) (2)
A x f

.

5.

Select another pivot
2 2
(2)
,
0
i j
a

and repeat the elimination.

6.

If
(2)
0, ,
rs
a r s
 
, then the remaining equations are degenerate and we halt.

34

Theorem 3.2
: Let
A

have rank
r
. Then we can find a sequence of distinct row
and column indices (
i
1
,
j
1
), (
i
2
,
j
2
), …, (
i
r
,
j
r
) such that corresponding pivot
elements in
A
(1)
,
A
(2)
, …,
A
(r)

are nonzero and
( )
1 2
0 if ,, ,
r
r
ij
a i i i i
 
. Define
permutation matrices (whose columns are unit vectors)

1 2
( ) ( )
( )
( )
,,,,,
n
r
i i
i
i
P e e e e
 
 
 
 

and
1 2
( ) ( )
( )
( )
,,,,,
n
r
j j
j
j
Q e e e e
 
 
 
 

,

where {
i
k
} and {
j
k
} are per
mutations of {1,2,…,n}. Then

By=g

(where
, , and
T T T
B P AQ y Q x g P f
  
) is equivalent to
Ax=f

and can be
reduced to triangular form by Gaussian elimination with the natural ordering.

Proof
: Generalized Gaussian elimination alters
(1)
A A

by forming linear
combinations of the rows. Thus, whenever no nonzero pivot can be found, the
remaining rows were linearly dependent on the preceding rows. Permutations
P

35

and
Q

rearrange equations and unknowns such that
,
, 1,2,,
v v
vv
i j
b a v n
 
. By
the
first half of the theorem, the reduced
B
(r)

is triangular since all rows
r+1
, …,
n

vanish.
QED

Operation Counts

To compute
( )
k
ij
a
: (
n
-
k+1
)
2

+ (
n
-
k+1
) (do quotients only once)

To compute
( )
k
i
f
: (
n
-
k+1
)

Recall that
1
( 1)
2
n
k
n n
k

and
2
1
( 1)(2 1)
6
n
k
n n n
k

 

. Hence, there are
2
( 1)
3
n n

multiplies to triangularize
A

and
( 1)
2
n n

multiplies to modify
f
.

Using the
Ly=f

and
Ux=y

approach, computing
x
i

requires (
n
-
i
) multiplies
plus 1 divide. Hence, only
( 1)
2
n n

multiplies are required to solve the
triangular systems.

36

Lemma
:
3
2
3 3
n n
mn
 

operations are required to solve
m

systems
( ) ( )
Ax j f j

,
j=1, …, m

by

Gaussian elimination.

Note
: To compute
A
-
1

requires
n
3

operations. In general,
n
2

operations are
required to compute
A
-
1
f
(
j
)
. Thus, to solve
m

systems requires
mn
2

operations.
Hence,
n
3
+mn
2

operations are necessary to solve
m

systems.

Thus, it is
always

more efficient to use Gaussian elimination instead of
computing the inverse!

We can always compute
A
-
1

by solving
Ax
i
=e
i
,
i=1,2,…,n

and then the
x
i
’s are
the columns of
A
-
1
.

Theorem 3.3
: If
A

is nonsingular,

P

such that
PA=LU

is possible and
P

is only
a permutation of the rows. In fact,
P

may be found such that
kk ik
l l

for
i>k
,
k=1,2,…,n
-
1
.

37

Theorem 3.4
: Suppose
A

is symmetric. If
A=LU

is possible, then the choice of
l
kk
=u
kk

l
ik
=u
ki
. Hence,
U=L
T
.

Variants of Gaussian elimination

LDU

factorization
:
L

and
U

are strictly lower and upper triangular and
D

is
diagonal.

Cholesky
:
A=A
T
, so factor
A=LL
T
.

Fun example
:
0 1
1 0
A
 
 
 
 
 

is symmetric, but cannot be factored into LU form.

Definition
:
A

is
positive definite

if
0, 0
T T
x Ax x x
  
.

Theorem 3.5

(Cholesky Method): Let
A

be symmetric, positive definite. Then
A

can be factored in the form
A=LL
T
.

38

Operation counts
:

To find
L

and
g=L
-
1
f

is
3
2
operations + 's
6 6
n n
n n
 
.

To find
U

is
2
2
n n

operations.

Total is
3
2
3
operations + 's
6 2 3
n n
n n
 

operations.

39

Parallel LU Decomposition

There are 6 convenient ways of writing the factorization step of the
n

n

A

in LU
decomposition (see textbook). The two most common ways are as follows:

kij
loop:

A

by row (daxpy)

kji

loop:

A

by column (daxpy)

for

k
= 1
, n −
1

for

k
= 1
, n −
1

for

i
=

k
+ 1
, n

for

p
=

k
+ 1
, n

l
ik

=

a
ik

/a
kk

l
pk

=

a
pk

/a
kk

for

j
=

k
+ 1
, n

endfor

a
ij

=

a
ij

− l
ik
a
kj

for

j
=

k
+ 1
, n

endfor

for

i
=

k
+ 1
, n

endfor

a
ij

=

a
ij

− l
ik
a
kj

endfor

endfor

endfor

endfor

40

Definition
: A
daxpy

is a double precision vector update of the form

x x y

 
, where
,
n
x y

and

.

saxpy
’s are single precision vector updates defined similarly.

Four styles of axpy’s (real and complex, single and double precision) are
included in the BLAS (basic linear algebra subroutines) tha
t are the basis for
most high performance computing linear algebra and partial differential
equation libraries.

41

It is frequently convenient to store
A

by rows in the computer.

Suppose there are
n

processors
P
i
, with one row of
A

stored on each
P
i
. Using
t
he
kji

access method, the factorization algorithm is

for
i = 1, n
-
1

Send
a
ii

to processors
P
k
,
k=i+1,…, n

In parallel on each processor
P
k
,
k=i+1,…, n
, do the daxpy update to row

k

endfor

Note that in step
i
, after
P
i

sends
a
ii

to other processors that the first
i

processors
are idle for the rest of the calculation. This is highly inefficient if this is the only
thing the parallel computer is doing.

A column oriented version is very similar.

42

We can overlap communication with c
omputing to hide some of the expenses of
communication. This still does not address the processor dropout issue. We can
do a lot better yet.

Improvements to aid parallel efficiency:

1.

Store multiple rows (columns) on a processor. This assumes that there are

p

processors and that
p n
. While helpful to have mod(
n
,
p
)=0, it is
unnecessary (it just complicates the implementation slightly).

2.

Store multiple blocks of rows (columns) on a processor.

3.

Store either 1 or 2 using a cyclic scheme (e.g
., store rows 1 and 3 on
P
1

and
rows 2 and 4 on
P
2

when
p=2

and
n=4
).

Improvement 3, while extremely nasty to program (and already has been as part
of Scalapack so you do not have to reinvent the wheel if you choose not to)
leads to the best use of all of

the processors. No processor drops out. Figuring
out how to get the right part of
A

to the right processors is lots of fun, too, but is
also provided in the BLACS, which are required by Scalapack.

43

Now that we know how to factor
A = LU

in parallel, we nee
d to know how to do
back substitution in parallel. This is a classic divide and conquer algorithm
leading to an operation count that cannot be realized on a known computer
(why?).

We can write the lower triangular matrix
L

in block form as

1
2 3
0
L
L
L L
 
 
 
 
 
 

,

where
L
1

and
L
2

are also lower triangular. If
L

is of order
2
k
, some
k>0
, then no
special cases arise in continuing to factor the
L
i
’s. In fact, we can prove that

1
3 1 3
1
1
1 1 1
2
0
L
L
L L L L
 
 
 
 
 
 

  

,

which is also known as a Schur complement.
Recursion solves the problem.

44

Norms

Definition
: A
vector norm

:
n
 

satisfies for any
n
i
x x
 
 
 
 

and any
n
y

,

1.

0,
n
x x
  

and
0
x

if and only if

1 2
0
n
x x x
   
.

2.

, ,
n
x x x
  
     

3.

, ,
n
x y x y x y
   

In particular,

1
1
n
i
i
x x

.

1/
1
, 1
p
p
n
i
i
p
x x p
 
 
 
 
 

 

.

1 2
max,,,
n
x x x x
 
 
 
 
 

.

Example:
1 2
4,2,5, 6 5, 5, 4
x x x x
 
 
 
 

      
.

45

Definition
: A
matrix norm

:
n n

 

satisfies for any
n n
ij
A a
 
 
 
 

 

and any
n n
B

,

1.

0 and 0 if and only if ,, 0.
ij
A A i j a
   

2.

A A
 
 

3.

A B A B
  

4.

AB A B
 

In particular,

1
1
1
max
n
ij
i
j n
A a

 

, which is the maximum absolute column sum.

1
1
max
n
ij
j
i n
A a

 

, which is the maximum absolute row sum.

1/2
2
1 1
n n
ij
i j
E
A a
 
 
 
 
 

 
, which is the Euclidean matrix norm.

1
max
u
A Au

46

Examples:

1.

1 2
1 2 3
9 1 2, 11, 12, 11
1 2 4
E
A A A A
 
 
 
 
 
 
 

    
  

2.

Let
n n
n
I

. Then
1 2
1
n n
I I
 
, but
n
E
I n

.

Condition number of a matrix

Definition
: cond(
A
)=
1
A A

.

Facts

(
compatible norms
):
1 2
1 1 2
, ,
E
Ax A x Ax A x Ax A x

 
  
.

47

Theorem 3.6
: Suppose we have an approximate solution of
Ax=b

by some
x
,
where
0 and A
n n
b

 

is nonsingular. Then for any compatible matrix and
vector norms,
1
, where cond( )
Ax b Ax b
x x
A
x
b b
  

 

  
.

Proof
: (rhs)
1
x x A r

 
, wh
ere
r Ax b
 

is the residual. Thus,

1 1
x x A r A Ax b
 
     

Since
Ax=b
,

1
and /
A b b A b x

  
.

Thus,

1
//
x x x A Ax b A b

    
.

(lhs) Note that since
0
A

,

Ax b r Ax Ax A x x
      

or
/
x x Ax b b
  
.

48

Further,

1
1
1 1 1
or
x A b x A b x A b
 
 
 
 

  
     
.

Combining the two inequalities gives us the lhs.
QED

Theorem 3.7
: Suppose
x

and

x

satisfy
and ( )( )
Ax f A A x x f f
  
    
,
where

x

and

x

are perturbations. Let
A

be nonsingular and

A

be so small that
1
1
A A

. Then

for
cond( )
A

we have

1/
f
x A
x
A A A
f

 


 
 
 
 
 
 
 
 

.

Note
: Theorem 3.7 implies that when
x

is small, small relative changes in
f

and
A

cause small changes in
x
.

49

Iterative Improvement

1.

Solve
Ax=f

to an approximation
x

(all single precision).

2.

Calculate
r Ax f
 

using double the precision of the data.

3.

Solve
Ae=r

to an approximation
e

(single precision).

4.

Set
'
x x e
 

(single precision
x
) and repeat steps 2
-
4 with
'
x x

.

Normally the solution method is a variant of Gaussian elimination.

Note that
( )
r Ax f A x x Ae
    
. Since we cannot solve
Ax=f

exactly, we
probably
cannot solve
Ae=r

exactly, either.

Fact
: If 1
st

'
x

has
q

digits correct. Then the 2
nd

'
x

will have
2q

digits correct
(assuming that
2q

is less than the number of digits representable on your
computer) and the n
th

'
x

will have
nq

digits correct (under a similar assumption
as before).

Parallelization is straightforward
: Use a parallel Gaussian elimination code and
parallelize the residual calculation based on where the data resides.

50

3c. Iterative Methods

3c (i) Splitting or Relaxation Methods

Let
A=S

T
, where
S

is nonsingular. Then

Ax b Sx Tx b
   
. Then the
iterative procedure

is defined by

0
1
,1
k k
x given
Sx Tx b k

  

To be useful requires that

1.

1
k
x

is easy to compute.

2.

k
x x

in a reasonable amount of time.

51

Example: Let
A=D
-
L
-
U
, where
D

is diagonal and
L

and
U

are strictly lower and
upper triangular
, respectively. Then

a.

S=D

and
T=L+U
: both are easy to compute, but many iterations are
required in practice.

b.

S=A

and
T=0
:
S

is hard to compute, but only 1 iteration is required.

Let
k k
e x x
 
. Then

1
0
1
or ( )
k
k k k
Se Te e S T e

 
,

which
proves the following:

Theorem 3.8
: The iterative procedure converges or diverges at the rate of
1
S T

.

52

Named relaxation (or splitting) methods
:

1.

,
S D T L U
  

(
Jacobi
): requires 2 vectors for
x
k

and
x
k+1
, which is
somewha
t unnatural, but parallelizes trivially and scales well.

2.

,
S D L T U
  

(
Gauss
-
Seidel

or Gau

-
Seidel in German): requires only 1
vector for
x
k
. The method was unknown to Gauss, but known to Seidel.

3.

1
1
,
S D L T D U

   
:

a.

(1,2)

(
Successive Over Relaxation
, or
SOR
)

b.

(0,1)

(
Successive Under Relaxation
, or
SUR
)

c.

1

is just Gauss
-
Seidel

Example:
2 1
1 2
A
 
 
 
 
 

,
2 0
0 2
J
S
 
 
 
 
 

,
0 1
1 0
J
T
 
 
 
 
 

, and
1
1
2
J J
S T

, whereas

2 0
1 2
GS
S
 
 
 
 
 

,
0 1
0 0
GS
T
 
 
 
 
 

, and
1
1
4
GS GS
S T

,

which implies that 1 Gauss
-
Seidel iteration equals 2 Jacobi iterations.

53

Special Matrix

Example

Let

fd
A

2 1
1 2 1
1 2 1
1 2 1
1 2 1
1 2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 

be tridiagonal.

For this matrix, let
1
J J
S T

and
1
,,
SOR SOR
S T
 

. The optimal

is such that

54

2
2 2
1
 
 
 
 
  
, which is part of Young’s thesis (1950), but correctly proven
by Varga later. We can show that
2 2
2 1 1
  
 
 
 
 

  

makes

as small as
possible.

Aside
: If

=1
, then
2
 

or
 

. Hence, Gauss
-
Seidel is tw
ice as fast as
Jacobi (in either convergence or divergence).

If
1
, let
1
n n
fd
A h
n

 

.

Facts:

cos( )
h
 

Jacobi

2 2
cos ( )
h
 

Gauss
-
Seidel

1 sin( )
1 sin( )
h
h

SOR
-
optimal

䕸慭灬攺p
n=21

and
h=1/22
. Then
2
0.99, 0.98, 0.75
  
   

30 Jacobis
equals 1 SOR with the optimal

! Take
n=1000

and
h=1/1001
. Then ~1275
Jacobis equals 1 SOR with the optimal

!!

55

There are many other splitting methods, including Alternating Direction Implicit
and a cottage industry of splitting methods developed
in the U.S.S.R. (1960’s). There are some interesting parallelization methods
-
like methods
have similar convergence properties of ADI.

Par
allelization of the Iterative Procedure

For Jacobi, parallelization is utterly trivial:

1.

Split up the unknowns onto processors.

2.

Each processor updates all of its unknowns.

3.

Each processor sends its unknowns to processors that need the updated
information.

4.

C
ontinue iterating until done.

56

Common fallacies:

When an element of the solution vector
x
k

has a small enough element
-
wise residual, stop updating the element. This leads to utterly wrong
solutions since the residuals are affected by updates of neighbors

after
the element stops being updated.

Keep computing and use the last known update from neighboring
processors. This leads to chattering and no element
-
wise convergence.

Asynchronous algorithms exist, but eliminate the chattering through
extra calculatio
ns.

Parallel Gauss
-
Seidel and SOR

are much, much harder. In fact, by and large,
they do not exist. Googling efforts leads to an interesting set of papers that
approximately parallelize Gauss
-
Seidel for a set of matrices with a very well
known structures only. Even then, the algorithms are

extremely complex.

57

Parallel Block
-
Jacobi

is commonly used instead as an approximation. The
matrix
A

is divided up into a number of blocks. Each block is assigned to a
processor. Inside of each block, Jacobi is performed some number of iterations.
Data
is exchanged between processors and the iteration continues.

See the book (
absolutely shameless plug
),

C. C. Douglas, G. Haase, and U. Langer,
A Tutorial on Elliptic PDE
Solvers and Their Parallelization
,

for how to do
parallelization of iterative methods for matrices that commonly
occur when solving partial differential equations (
what else would you ever want
to solve anyway???
).

58

3c

(ii) Krylov Space Methods

Let
A

be symmetric, positive definite
, i.e.,

A=A
T

and
2
(,), where 0
Ax x x
 
 
.

The

iteration method for the solution of
Ax+b=0

is defined
as follows with
r=r(x)=Ax+b
:

x
0

arbitrary

(approximate solution)

r
0
=Ax
0
+b

(approximate residual)

w
0
=r
0

(search direction)

59

For
0,1,
k

1
,
k k k k
x x w

 

(,)
(,)
k k
k
k k
r w
w Aw



1
k k k k
r r Aw

 

1 1
,
k k k k
w r w

 
 

1
(,)
(,)
k k
k
k k
r Aw
w Aw



Lemma CG1
: If
1
( ( )) ( ( ),( )) (,( ))
2
Q x t x t Ax t b x t
 

and
( )
k k
x t x tw
 
, then
k

is
chosen to minimize
( ( ))
Q x t

as a function of
t
.

Proof
:
Expand
x
(
t
)

and use inner product linearity:

Q(x(t))

=

1
(,) (,)
2
k k k k k k
x tw Ax tAw b x tw
   

=

2
1
(,) 2 (,) (,) (,) (,)
2
k k k k k k k k
x Ax t x Aw t w Aw b x t b w
 
 
 
 
 
   

60

Q(x(t))
d
dt

=

(,) (,) (,)
k k k k k
x Aw t w Aw b w
 

Q(x( ))
k
d
dt

=

(,) (,) (,)
k k k k k k
x Aw w Aw b w

 

=

(,) (,) (,)
k k k k k
x Aw r w b w
 

=

(,)
k k k
Ax b r w
 

=

0

since

k k
Ax r

=

1 1 1 1 1 1
( ) ( )
k k k k k k
A x w r Aw
 
     
  

=

1 1
k k
Ax r
 

=

=

0 0
Ax r

=

b

61

Note that
2
2
Q(x( ))=(,A )>0 if >0
k k k k
d
w w w
dt

.

Lemma CG2
: The parameter
k

is chosen so that
1
(,) 0
k k
w Aw

.

Lemma CG3
: For
0
q k
 
,

1.

1
(,) 0
q
k
r w

2.

1
(,) 0
q
k
r r

3.

1
(,) 0
q
k
w Aw

Lemma CG4
:
1 1
1
(,)
(,)
k k
k
k k
r r
r Ar

 

.

Lemma CG5
:
1 1
(,)
(,)
k k
k
k k
r r
r r

 

62

Theorem 3.9
:

(CG): Let
N N
A

be symmetric, positive definite. The
n

the CG
iteration converges to the exact solution of
Ax+b=0

in not more than
N

iterations.

Preconditioning

We seek a matrix
M

(or a set of matrices) to use in solving
1 1
0
M Ax M b
 
 

such that

1
( ) ( )
M A A
 

M

is easy to use when solving
My
=
z
.

M

and
A

have similar properties (e.g., symmetry and
positive
definiteness)

Reducing the condition number reduces the number of iterations necessary to
nce factor.

63

Thereom 3.10
: In finite arithmetic, the preconditioned conjugate gradient
method converges at the rate based on the largest and smallest eigenvalues of
1
M A

,

1
2
1
2
2
1
2
0
2
( ) 1
2 ( )
( ) 1
k
k
x x
M A
M A
M A
x x

 
 
 
 
 
 

, where
1
max
2
min
( )
M A

.

Proof
: See Golub and Van Loan or many other numerical linear algebra books.

What are some common preconditioners?

Identity!!!

Main diagonal (the easiest to implement in parallel and very hard to beat)

Jacobi

Gauss
-
Seidel

Tchebyshev

64

Incomplete LU,
known as ILU (or modified ILU)

Most of these do not work straight out of the box since symmetry may be
required. How do we symmetrize Jacobi or a SOR
-
like iteration?

Do two iterations: once in the order specified and once in the opposite
order. So, if th
e order is
natural
, i.e.,
1

N
, then the opposite is
N

1
.

There are a few papers that show how to do two way iterations for less than
the cost of two matrix
-
vector multiplies (which is the effective cost of the
solves).

x
0

arbitrary

(approximate solution)

r
0
=Ax
0
+b

(approximate residual)

0 0
Mr r

0 0
w r

(search direction)

65

followed by f
or
0,1,
k

until
0 0
1 1
(,) (,)
k k
r r r r

 

and
0 0
1 1
(,) (,)
k k
r r r r

 

for
a given

:

1
,
k k k k
x x w

 

(,)
(,)
k k
k
k k
r r
w Aw



1
k k k k
r r Aw

 

1 1
k k
Mr r
 

1 1
,
k k k k
w r w

 
 

1
(,)
(,)
k k
k
k k
r r
r r



66

3d. Sparse Matrix Methods

We want to solve
Ax=b
, where
A

is large, sparse, and
N

N
. By sparse,
A

is
nearly all zeroes. Consider the tridiagonal matrix,
[ 1,2,1]
A
  
. If
N=10,000
,
then
A

is sparse, but if
N=4

it is not sparse. Typical sparse matrices are not just
banded or diagonal matrices. The nonzero pattern may appear to be random at
first glance.

There are a small number of common storage schemes so that (almost) no zeroes
are stored for
A
, ideally stor
ing only
NZ(A)

= number of nonzeroes in
A
:

Diagonal (or band)

Profile

Row or column (and several variants)

Any of the above for blocks

The schemes all work in parallel, too, for the local parts of
A
. Sparse matrices
arise in a very large percentage of
problems on large parallel computers.

67

Compressed row storage scheme

(Yale Sparse Matrix Package format)

3 vectors:
IA
,
JA
, and
AM
.

Length

Description

N+1

IA(j)

= index in
AM

of 1
st

nonzero in row
j

NZ(A)

JA(j)

= column of
j
th

element in
AM

NZ(A)

AM(j)

=
a
ik
, for some row
i

and
k=JA(j)

Row
j

is stored in
( ( ):( 1) 1)
AM IA j IA j
 
. The order in the row may be arbitrary
or ordered such that
( ) ( 1)
JA j JA j
 

within a row. Sometimes the diagonal entry
for a row comes first, then the rest of the row is ordered.

The
compressed column storage scheme

is defined similarly.

68

Modified compressed row storage scheme

(new Yale Sparse Matrix Package
format)

2 vectors:
I
JA
,
AM
, each of length
NZ(A)+
N+
1
.

Assume
A = D + L + U
, where
D

is diagonal and
L

and
U

are strictly lower
and upper triangular, respectively. Let
(row of L+U)
i
NZ i

.

Then

(1) 2
IJA N
 

1
( ) ( 1), 2,3,,1
i
IJAi IJAi i N

    

( )
IJA j

column index of
j
th

element in
AM

( ), 1
ii
AM i a i N
  

( 1)
AM N

is arbitrary

( ), ( ) ( 1) and ( )
ik
AM j a IJAi j IJAi k IJA j
    

The
modified compressed column storage scheme

is defined similarly.

69

Very modified compressed column
storage scheme

(Bank
-
Smith format)

Assumes that
A

is either symmetric or nearly symmetric.

Assume
A = D + L + U
, where
D

is diagonal and
L

and
U

are strictly lower
and upper triangular, respectively. Let
(column of )
i
NZ i U

that will be
stored. Let

1
N
i
i
 

.

2 vectors:
IJA
,
AM

with both
a
ij

and
a
ji

stored if either is nonzero.

(1) 2
IJA N
 

1
( ) ( 1), 2,3,,1
i
IJAi IJAi i N

    

( )
IJA j

row index of
j
th

element in
AM

( ), 1
ii
AM i a i N
  

( 1)
AM N

is arbitrary

( ), ( ) ( 1) and ( )
ki
AM j a IJAi j IJAi k IJA j
    

If
T
A A

, then
( )
ik
AM j a

 

70

AM

contains first
D
, an arbitrary element,
U
T
, and then possibly
L
.

Example:

11 13
22
24
31 33
42 44
0 0
0
0
0 0
0 0
a a
a
a
A
a a
a a
 
 
 
 
 
 
 
 
 
 
 
 

Then

D

and column “pointers”

U
T

Optional
L

index

1

2

3

4

5

6

7

8

9

IJA

6

6

6

7

8

1

2

AM

a
11

a
22

a
33

a
44

a
13

a
24

a
31

a
42

71

Compute Ax or A
T
x

Procedure MULT(
N
,
A
,
x
,
y

)

do
i = 1:N

y(i)=A(i)x(i)

enddo

Lshift=0 if L is not stored and
IJA(N+1)
-
IJA(1)

otherwise

Ushift=0 if
y=Ax

or L is not stored and
IJA(N+1)
-
IJA(1)

if
y=A
T
x

do
i = 1:N

do
k = 1:IJA(i):IJA(i+1)
-
1

j=IJA(k)

y(i)+=A(k+Lshift)x(j) // So
-
so caching properties

y(j)+=A(k+Ushift)x(i) // Cache misses galore

enddo

enddo

end MULT

72

In the double loop, the first y update has so
-
so cache properties, but the second
update is really problematic. It is almost guaranteed to cause at least one cache
miss every time through the loop. Storing small blocks of size
p

q

1

1) is

Note that when solving
Ax=b

by iterative methods like Gauss
-
Seidel or SOR,
D
,
L
, and
U

is required. These algorithms can be
implemented fairly easily on a single processor core.

Sparse Gaussian elimination

We wa
nt to factor
A=LDU
. Without loss of generality, we assume that
A

is
already reordered so that this is accomplished
without

pivoting. The solution is
computed using forward, diagonal, and backward substitutions:

Lw b
Dz w
Ux z

73

There are 3 phase
s:

1.

symbolic factorization

(determine the nonzero structure of
U

and possibly
L
,

2.

numeric factorization

(compute
LDU
), and

3.

forward/backward substitution

(compute
x
).

Let
(,)
G V E

denote the ordered, undirected graph corresponding to the matrix
A
.
1
N
i
i
V v
 
 
 

is the virtex set,
| 0
ij ji ij ji
E e e a a
 
 
 
 
 
   

is the edge set, and
( ) |
i
G
ik
 
 
 
 
.

Gaussian elimination corresponds to a sequence of elimination graphs
G
i
,
0

i<N
. Let
G
0
=G
. Define
G
i

from
G
i
-
1
,
i>0
, by removing
v
i

from
G
i
-
1

and all of
its incident edges from
G
i
-
1
, and adding new edges as required to pairwise
connect all vertices in
1
( )
i
i
G

.

74

Let
F

denote the set of edges added during the elimination process. Let
',
G V E F
 
 
 
 
. Gaussian elimination applied to
G’

produces no new fillin edges.
Symbolic factorization computes
E F

. Def
ine

( )
m i

=

'
min | ( ), 1
i
G
k i k adj v i N
 
 
 
   

=

1
min | ( )
i
i
G
 
 
 
 
 

 

Theorem 3.11
:
,
ij
e E F i j
  
, if and only if

1.

ij
e E

, or

2.

sequence
1 2
,,,
p
k k k
 
 
 

such that

a.

k
1
=l
1
,
k
p
=j
,
j
l
e E

,

b.

i=k
q
, some
2

q

p

1
, and

c.

1
( )
q
q
k m k

,
2

q

p
.

75

Computing the fillin

The cost in time will be
( )