Design Patterns in Parallel Programming

compliantprotectiveΛογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 3 χρόνια και 11 μήνες)

88 εμφανίσεις













Design Patterns in Parallel Programming

The Current State of Frameworks

and Parallel Compilers






CSG280


Parallel Programming

Northeastern University

Fall 2005


Michael Everett


Abstract



Parallel Patterns
looks

for
reoccurring

patterns
in code that can be abstracted into
a more structured framework. Often these frameworks achieve three goals for the
average programmer, more understandable code, better designed code
created
in less
time, and less errors. A handful of patterns
allow

a pr
ogrammer to think more abstractly
and worry less about the details of the implementation. While these ideas can get
complicated

as in
a

Model View Controller, they can often be quite simple, replacing
goto Label with while or for loops.
It is this search f
or such patterns in parallel code that
this paper will explore.


The field of Parallel Computing today is becoming more important than ever.
Desktop PC’s with multiple processors
are

no longer just an odd rarity, it is common
place. The implications of th
is
have yet to really be felt or understood. While this will
only increase the years of research into solving large, computational
ly

intensive problems
in short amount
s

of time, it has also resulted in a new field emerging. Today there are
millions of se
quential libraries and applications written. As we attempt to exploit this new
processor power in dual and quad core chips, the possibility of having to rewrite legacy
code into suitable parallel form is simply unacceptable. On top of this, writing efficie
nt
parallel code is difficult. There are a number of reasons for this, lack of programming
experience, increased complexity in having to deal with communication, shared data,
synchronization,
dirty caches,
limited languages
, etc
.

To make matters even wor
se
debugging
a parallel program is
a
non
-
trivial.
When you compare these difficulties a
programmer must face compared to writing a sequential program in a high level language
such as Java

its easy to tell that the field has a long way to
still
go.


Two are
as of research that have attempted to address some of these issues are
finding patterns or frameworks to generalize parallel ideas and optimizing compilers or
automated parall
el
ization. They both address the same issue,
making

parallel cod
ing

easier by
ex
amining

reoccurring

patterns, but they approach the issue from very different
angles.



Design Patterns


Overview



The field of research in Design Patterns in Parallel Computing has been around
for quite some time. The basic idea of design patterns, sk
eletons, or frameworks as they
have been referred to over the years, is to find
repetitive

patterns typically found in
parallel programs and abstract them. Just as sequential programs slowly introduced the
ideas of macros, methods, data types, abstract me
thods/
data types
/classes, inheritance,
etc., the field of design patterns hopes to do
the same for parallel computing.
The ultimate
goal is a more efficient, easier to use language that allows a less trained user to produce
faster code
in less time
.


In
many ways this is a
much more complicated task compared to sequential
patterns.
The pattern designer must worry about communication, shared data, network
and processor configuration, and synchronization just to name a subset of the problems.
On top of th
is, they must also discover useful parallel algorithms and data structures that
end users can use. It

i
s the hope that if this can successfully be done, the end user can
worry about writing efficient code, fine tuning, and debugging and forget about the
f
ramework this is
occurring

in.





History



Design patterns have been around in some form or another since in the mid 80’s.
They originated as what is now commonly
referred

to as
Pattern
-
based Skeletons or
Systems
.

This took a general look at abstracting
patterns by providing the basic structure
to
specific
parallel expressions
.
Often times a pattern skeleton coupled the data structure,
the network type, and the communication protocols very closely.
The problem that
resulted from this tight coupling
was
t
hey were too specific.
If a pattern did no
t fit your
needs then you basically had to throw the whole pattern away and start from scratch.
This of course defeats the purpose of us
ing a pattern in the first place
.

Even if the data
type and network structur
e was correct, i.e. master
-
slave, but the communication protocol
could

n
o
t handle your needs, the entire pattern was no longer usable even though two
thirds of it still met the programmer's needs.


The algorithmic skeletons took a more abstract
approach
.
They began by looking
at the
reoccurring

structures that parallel programmers were continually rewriting. A few
examples
, divide

and conquer, clus
t
er, task queue or Farm, pipeline, just to name a few.
The idea was to have the programmer select from one of

these design types when first
starting a project and the skeleton would generate the general framework for you.


This approach was more successful than the previous pattern
-
based. It did begin
to capture some of the
reoccurring

problems facing parallel p
rogrammers. Unlike
sequential programs, the parallel programmer typically first looks at how will my data be
stored, what work am I doing on this data, and how will I divide the work among the
processors. The algorithmic model allows the programmer to co
mplete the first two steps
simply by choosing the appropriate skeleton and thus having forced the programmer to
think about certain design constraints from the start, made the remaining choices easier to
implement.

Once

the programmer chooses a specific s
keleton, an

automatic code
generator can begin building the routine work, hence saving some time for the
programmer who can concentrate on implementing the application code instead of the
framework code.


This worked great, in theory.
Algorithmic skeleton
s still left programmers with
much to be desired. Just of few of the problems
still
facing the programmer,
synchronization, data structures and
communication still had to be handled in their
application code.
Communication
usually meant MPI or OpenMP

whi
ch can be quite
tricky at times
. Shared data and synchronization still had to be coded by the application
writer. While most Algorithmic Skeletons had libraries containing the frameworks, these
were often still limited and rather trivial for real world a
pplications. While the library let
the user add new algorithms to the framework if they were

n
o
t there already, if this
occurred often it again defeated the purpose of using a pattern in the first place. Perhaps
the biggest problem with the libraries is
modifying how they worked meant rewriting the
entire algorithm. Often these auto
-
generated stubs would be written in some underlying
byte
-
code. While this may have been easier for the framework builders it meant that
modifying how any part of an algorithm

worked meant once again
rewriting the entire
library
.

The general
consensus

was why bother dealing with these problems given the fact
that the user must write the communication, synchronization, and data structures
anyways.


In the mid
-
90’s research into
the field of parallel patterns began to fade. Perhaps
as a result of a new ‘hot’ research topic or perhaps the lack of
cross
-
over into the real
world market slowed funding. By 1997/1998
most of

the
substantial

and
consistent

research coming in this field

was from the University of Alberta and the University of
Waterloo, both located in Canada.


Current Research



The University of Alberta and the University of Waterloo began looking at an
even more abstract idea than just Algorithmic Skeletons. There are

essentially t
hree

problems with these, one, they suffered from a single hierarchy, two they do

n
o
t employ
enough high
-
level abstraction and three,
they lack modifiability
.

Just as Aspect Oriented Programming has recently emerged to highlight some
problem
s with languages such as Java, a similar and in some cases more obvious case can
be observed about Algorithmic Skeletons. Algorithmic skeletons do a nice job of
examining how to generate a framework based on what data you have and what you want
to do to t
hat data, but it completely ignores topics such as Communication
, Operating

System, and network configuration. All of these are just as important to Parallel
Computing as what type of algorithm you want to use.


Despite this, i
t does

n
o
t really come as a

surprise that the Parallel Community has
not flocked to high level languages the way sequential programmers have. Parallel
programming is about efficiency and speed. Assembly code is faster then C which is
faster than Java. While this is all true, [4] m
akes a very good argument that while
program speed is important, the number of times it is needed to be run and the cost of
developing the program in the first place needs to be considered.

If a program can be
written in a high level framework and language

in
significantly
less time, it might offset
the slightly slower execution time.


Perhaps the biggest problem with most frameworks or skeletons is they are

n
o
t
robust enough. They may give you
pre
-
generated

patterns but lack the ability to easily
extend o
r modify it. This is a direct result of a few problems, one, often the
pre
-
generated

code is in an unreadable format, two, the programmer does

n
o
t have access to
the
pre
-
generated

code, and or three, the programmer cannot mix or combine several
patterns t
ogether to solve a large complex problem

without rewriting a library
.


From around 1997 to roughly 2003, these universities focused heavily on
implementing a Parallel Framework in a high level language to satisfy what they felt
were the biggest
deficienci
es

in parallel languages today. What they ended up developing
became known as Correct Object
-
Oriented Pattern
-
based Parallel Programming System
(CO
2
P
3
S). CO
2
P
3
S is written in Java using multi
-
threaded technology and RMI for all
its

communication.
Origina
lly

designed for shared
-
memory processors it was eventually
extended to work und
er distributed networks as well.



CO
2
P
3
S takes full advantage of object
-
oriented tools such as
inheritance

and
abstract data types. The first step of starting a parallel proje
ct in CO
2
P
3
S is to choose a
pattern. There are numerous patterns to choose from and each pattern has a family of
sub
-
patterns to get a better fit for the type of
data

being used. This pattern will not only
choose the type of network structure,
i.e.

master
-
slave, but will also choose the data
structure for storing the data, ie. an Nx
N

matrix. New patterns can be quickly and easily
added by extending an abstract class Pattern.


Once the template has been chosen, class names, data structure names, variable
na
mes are asked to be filled in. Finally the dependence levels of the data structure

are

asked to be described. This is to give the data structure some idea of what data is
dependent upon each other.


At this point auto
-
generated code is created in what is d
escribed as an
Intermediate Layer.
Here
the user does not have the ability to modify the generated code.
They are given a series of hook methods that they must code. This is similar to the idea of
the hooks a user of Top
-
C must fill in. An
interesting

poin
t here is the hooks allow the
user to simply write a sequential program.
The auto
-
generated code has handled the data
structure, network configuration, synchronization, and communication protocols for the
user.


Finally the user has access to all the code
.
At this point the user may fine tune the
code, make large modifications if the pattern set was too simple to describe the problem
adequately, or simply run the program.


One very interesting thing to notice is CO
2
P
3
S forces the user to write correct
pa
rallel code.
If the code does

n
o
t run correctly after the last step, the user automatically
knows it is a problem with the source code in one of the hook methods. The use of using
interfaces, abstract classes and inheritance to design a loosely coupled com
munication
protocol, network configuration and data structure

is a very unique way of approaching
parallel computing. By abstracting these three parts of parallel computing, it allows the
user to focus on each particular task outside of the context of a sp
ecific implementation.
This also encourages testing of different network configurations, different data structures
to see which performs the best by making only minor modifications to the code. Since
communication protocols inherit from the same interface,

changing

between MPI or
OpenMP or perhaps RMI is a matter of initializing a different class.


Future


Since the joint venture between
University

of Alberta and University of Waterloo
,

very little work has come out in either university on the topic. In th
e summer of 2005 a
paper came out of the University of Alberta that received much attention and gives a
glimpse at where the future research of Parallel Patterns is going. While Java does offer
many
great
characteristics that any parallel programmer would

love, it has its downsides.
Even with the use of JIT compilers
there is still some question as to w
h
ether Java can run
as fast as good C/C++ code. While RMI is easy, it too has a large overhead which can be
disastress in a
parallel

program. This has led

to looking into porting the Java
implemented Framework to an MPI wrapper

using C
.

This will not be easy. C is not Java and some things which are enforced
in Java
can

not

be enforced in C
and some things are
simply
not possible, such as polymorphism.
One

cannot write simple hooks through interface design as one could in Java, instead
macros must be used. Despite the difficulties, preliminary tests show on average a 2
times speed

up over the Java
implementation
.

One of the stated goals for the project is
the hopes that they will begin to see real
world use of design patterns for the first time. Since MPI and OpenMP are seemingly
becoming the industry standard for communication in parallel processing, a framework
built on
MPI
will likely draw
more attentio
n than their Java implementation from
the
parallel

industry
.




Conclusion



Despite twenty years of research on the topic, parallel skeletons have yet to reach
outside the research community. There are likely a number of reasons for this. The
greatest r
eason being cost and speed. While parallel computers are becoming common
place,
their
use in real world applications is still relatively limited. The average software,
business or engineering company is not employing such computers and if they are, in a
very limited manner. For most real world applications parallel processing is still an
oddity
. As a result of this, the cost to purchase and run applications on these machines is
still relatively high compared to normal sequential programs.
This has resu
lted in
purchasers of such products to be willing to spend extra money to have a more efficient
application. If they

a
re going to pay top dollar for the hardware, the software better run
as
fast

as it can.


Th
is

results in a double blow for parallel skele
tons. While Java is
convenient

and
produces
understandable
, easily modifiable code, convincing someone who just
purchased a two million dollar parallel computer that you

wi
ll be running an application
in Java will be a hard sell. The cost of the hardware

still
out ways

taking slight efficiency
hits in software. Even in sequential processors, it has

n
o
t been until very rece
ntly that
you

ha
ve seen Java applications emerge as contending applications in the real world
market.


The second hit skeletons take

as a result of this is there is less urgency to produce
an application in less time. Having to produce an application in a matter of weeks or a
few
months

is less urgent when such large sums of money are involved. This is precisely
one of the motivations

of skeletons, to produce better code faster. With
relatively
few
applications being written

(compared to sequential code)
, the number of new
programmers

that might more readily adapt these new methodologies are few and far
between. As a result of all thi
s, skeletons have been limited to the research community.


There are still other reasons of course.
Lots of times they simply do not

work
well. Often they are built so sloppily that the need to modify minor details results in the
whole framework failing.

Only recently have researchers appeared to have captured this
robustness in their framework
, t
his being the University of Alberta
,

but the Java
implementation seems to have not drawn much attention for reasons stated above. The
fact that they are now por
ting it down to MPI is only further proof that the real world
application writers have no intention of moving away from low level code writing
anytime soon.


This leads us to ask if skeletons will ever become common place. The answer is
that in some point

in time, they will have to. Just as we have seen a more and more
structured progression of programming languages, the need to easily write, maintain, test
and debug parallel code will need to be encompassed in some sort of structure or
hierarchy. Once a

suitable structure is found and parallel computing becomes common
place enough to teach it as a requirement at most universities, a skeleton will emerge into
the work force. When the average software company is writing parallel code some
higher level lan
guage will be needed and skeletons seem to encompass the basic ideas
that will be needed in such a language.
The important thing to note is the idea’s skeletons
encompass is what will be needed, whether it
takes

the form that CO
2
P
3
S takes or perhaps
some h
igh
-
bred Top
-
C format cannot be predicted until the tool emerges and becomes
common place.




Compiler Opti
mi
zation/ Automated Parallelization



Overview


Automated Parallel Compiler design also began in late 80’s.
As the field of
parallel computing bega
n to emerge, the desire to tap into old legacy code, especially old
FORTRAN

modules was highly desirable. Since most parallel computing programs
involved solving large, complex,
mathematical

problems
,

reusing old
FORTRAN

code
was highly desirable.
It was
quickly decided that rewriting all the code to parallel format
would be too costly.
From this
problem

arose the field of automated parallelism. The idea
of automated parallelism is to take a sequential program, run it through a compiler that
returns the sa
me program with code that is
parallelized

to run over N computers.
A
similar and perhaps less
ambitious

field is that of optimization of parallel compilers.
These take parallel code and return modifications to make the program more efficient.


Both of thes
e employ very similar techniques and, in fact, if you can do the first
part than it

i
s rather trivial to do the second part.
While this is a very different approach
than parallel skeletons, they both share the same
concepts
. Both of them require a strong
u
nderstanding of what patterns appear in parallel cod
e.
Parallel Skeletons abstract the
patterns so the user can select a pattern and fill in the sequential code, Parallel Compilers
hides the patterns from the user and allows the user to write code seq
uenti
ally as they
always have.
When the user puts the code through the compiler the compiler says I
recognize

this pattern, let me rearrange it like this so it can be run over N
processors
instead.


In both cases the final outcome is the same. The compiler/ske
leton creator has
hidden the complexity of generating efficient parallel code from the user.


Techniques for Finding Parallelizable loops



Automated parallel compilers seem like a great idea, the problem arises with
actually trying to make them work. I
t is non
-
trivial

to generate parallel code from
sequential code.


Let’s look at a simple matrix array multiplication



a) for j


1 to m do



for i


1 to n do




a[i]


a[i] * b[i, j]



endfor



endfor


A first look at trying to
parallelize

this mi
ght be to see that the inner for loop can be split
between processors while the outer for loop can not. This would suggest trying to invert
these two loops





b) for i


1 to n by strip do



for j


1 to m do




a[i]


a[i] * b[i, j]



endfor



end
for


Intuitively

this makes sense. Now each processor can work independently on one column
of the matrix and one index on the array. The algorithm works nicely but the hardware
does

n
o
t agree. When this is divided among processors, each processor will re
ad in a
chunk of data larger than is needed for efficiency issues on its part. This means that
while each processor is only writing to its own index in the a[], each processor still has
multiple indices of a[] in its cache that other processors are writin
g to.

Ex: Processor 0 reads and writes only to a[0] but loads a[0]
-
> a[50] for efficient cache
use. Since other processors
write to the other

49 indices, Processor 0 still has the value in
its cache and when others write to say, a[10], it must still be
marked dirty and updated on
Processor 0

even though it never uses that value
.
The result of this, each time a write is
performed

all the other processors that loaded that index of array into its cache must
invalidate it each time that one processor write
s

it. We thought we had written a nice
granular
parallel

loop, instead, we are updating the cache lines on multiple processors
each for each write operation!


Let’s take a look at what a
n automated

compiler has created.



c) for k

1 to n by strip do



for

j


1 to m do




for i


k to min( k + strip


1, n) do





a[i]


a[i] * b[i,j]




endfor



endfor



endfor


Here we have divided up the loops into i/
(
num of processors
)

strips. As a result of this
new ordering, the inner most loop now has the greate
st amount of information that can be
reused stored in cache. This results in fewer cache updates and invalidations. The result
is a faster parallel algorithm than a hand written one.


The algorithm we just described is extremely simple and yet the
intuiti
ve

approach is not the best approach. What happens when we get three or four nested loops?
From benchmark tests, the
automated
compiler works better and better as more loops
occur

and the inverse is true for the human version
.


Compiler Optimizers
and
Au
tomaters

are not the perfect solution however. Yes
they can work quite well on perfectly nested loops or trivial parallel loops. Trivial
parallelism is
easily

described by an example:



for i


1 to n do



a[i]


a[i] + i



b[i]


b[i] + i


endfor


One c
an easily see that both a and b are independent arrays and can both be operated on
by two processors at the same time.


Compiler
s

fall apart on even trivial problems such as the introduction of a
conditional statement or a method call with some unknown be
havior, or unknown array
dimensions. This is a huge problem since these are some of the most common features of
programs.


There are numerous techniques that have been developed to help determine if a
program can be parall
el
ized or ways to modify it to he
lp the parall
el
ization along.

They
can be categorized by:
Induction tests,
privatization
, reduction, d
ata
-
dependence testing,
and loop coalescing techniques
. We will no
w

explore each group and look at some of the
major accomplishments.


1)

Parallel Enabli
ng Transformations



The idea here is to rearrange data such that it no longer becomes dependent upon
each other. A simple example int x = y = 2; could be rewritten int x = 2; int y = 2;


This technique is used as a first pass quite often to help other
computations.



2
) Induction Variables



Induction variables are variables declared outside of a loop but incremented with
in the loop each iteration.



J = 0


DO I = 1,N



J = J + 2



U(J) = …



This causes two problems, 1) J is read and written each iter
ation making it cross
-
iteration dependent. 2) a subscript expression involving induction variables cannot be
directly analyzed by most dependency tests because the variable is no
t

a loop
index
.


The solution is to rewrite this in terms of a loop
index
, by

either directly
modifying it off of the current loop
index

or adding an additional loop to do this for us.



DO I = 1,N



U(I*2)


A common technique used to implement this is described in greater detail in the
additional paper and implementation.


3) Pri
vatization



-

identifying and renaming private variables and arrays inside loops



DO I = 1,N



T = A(I) + 1



B(I) = T * 2



C(I) = T + 1


T is a local variable to the loop. Since it is rewritten each iteration it is cross
-
iteration
dependent. To fix t
his we must declare it private to each processor and store a different
copy of T in a different memory location for all processors.

This will almost always take
place after induction has occurred. Induction will often remove cross
-
iteration
behavior

from a

variable inside a loop but it will still need to be declared private so the loop can be
strip
-
mined.



4) Reduction

This is usually a multi
-
step process of recognizing and removing variables that
two or more processors must modify concurrently. The major
drawback here, which is
why compilers focus so much of this task, is it produces
substantially

more code than a
sequential version of the code.


Ex.

Do I = 1,N



Sum = Sum + A(I)


It

i
s quite obvious that this loop is

parall
el
izable in the current format.

Just because it is
parallelizable does

no
t mean we should necessarily run it in parallel in
its

given format.

If it were strip
-
mined, each processor would be writing to Sum
, a shared variable,

and
it
would result in lots of synchronization and overhead
that would actually make this loop
run slower than its sequential version.


The solution to this problem is to strip
-
mine the loop but include a private version
of Sum. Once all processors are finished
,

the solutions will be summed up on a single
computer.


Ex.

Do I = 1, Num
-
Processors




S = 0


Do I = 1,N by stip
-
mining



S(my
-
proc
-
number) = S(my
-
proc
-
number) + A(I)


Do I = 1, Num
-
Processors



Sum = Sum + S(I)


It

i
s quite obvious that reduction quickly increases code size. The two lines of sequential
code

has turned into 6 lines and that

i
s without including the communication overhead.
Reduction has, however,
parallelized

an
unparallelizable

sequential piece of code.






5
)

Data
-
dependence

Testing



Data
-
dependence testing is one of the most important tec
hniques in automated
parall
el
ization. The goal is to look at loops and determine if references inside a loop are
dependent upon each other. This can be descri
bed by looking at a simple loop



-

analyze every pair of references to same array within a loop



Do I = 1,N

R:


A(2*I) = …

S:


…. = A(2*I)

T:


….

= A(2*I+1)




The formal name for this checking is Cross
-
Iteration Dependencies. We

wi
ll
begin by looking at the simplest tests first.


a) Equality Test



checks for data dependence by comparing i
dentical linear
expressions. Two instance
s

of R are independent

since they

a
re assignments
. R&S are
also independent. R&T is unknown.



b) GCD Test



determines if two linear equations involving a subscript has an integer
solution. Consider 2i = 2i’ +
1. Since there are no integer solutions for i and i’, they
never access the same element. GCD(2,2) ! | 1

(Note: although they are not cross
-
iteration dependent, they will ac
c
ess the same variable
and
because

of

cache rewrites which will degrade the compu
tation)


c) Range Test



determines if
dependence

occurs in a set of nested loops by looking at
the upper and lower bound of two separate iterations.
The idea here is to not
determine

if
two different references are independent, but determine if a nested
loop is indepen
den
t of
its outer loop. If this is true we can strip
-
mine the entire loop.


Ex:

Do I = 1,N



Do J = 1,N




A(I*M + J) = 0



Our goal here is to determine if A is ever accessed by two different iterations. If it

i
s not we can strip
-
mine the
outer
loop

and process this loop over various computers at
the same time.



d
) Banerjee(
-
Wolfe) Test



determines if over the course of a loop cycle, if two
references will access the same parts of data.



Do I = 1,100

R:


A(I) = ….

S:


….. = A(I + 20
0)



By knowing the loop count, we can see that R accesses indices 1

100 while S
accesses indices 201

300. From this test we can determine that R&S are indepen
den
t of
each other.



Do I = 1,100

R:


A(I) = ….

S:


…. = A(I+5)



In this example, the Ba
nerjee test
is
ambiguous

as to whether R&S are
independent. R = 1

100, S = 6

105. The solution to this
is
try
ing to

rewrit
e

these
indices in terms of a single variable. If we take the value R, we will then rewrite all the
other references in terms of R s
o S = R + 6


105. We can now examine our two
equations [R] and [R+6:105]. If we then examine these indices from a R is performed
prior to S and S is performed prior to R
, we see

that they both have dependencies on each
other in both directions. From thi
s test we can conclude then that the two statements are
dependent.



6)

Loop coalescing

T
echniques



One of the complications of parallel computing is dealing with more than
just
a
single processor and a single set of RAM. Dealing with communication,
syn
chronization, and shared data is just a few of the problems. This often results in less
than
optimized

usage

for each processor. A lot of the ideas here are implemented in
automated
parallelizers

but are also found in compiler optimizers.


a) Loop
Fusio
n



this is a technique employed in optimizers more than
parallelizers
.
The idea is there are two loops with only a handful of iterations and operations
performed. As a result, strip
-
mining this information would result in communication to
pass out the i
nformation and then recollect it
may
take longer than the time to process it
on a single computer.



Do I = 1,N by strip
-
mining



A(I) = B(I)


Do I = 1,N by strip
-
mining



C(I) = A(I) + D(I)


Can be rewritten in certain cases more effectively:



Do I = 1,N



A(I) = B(I)



C(I) = A(I) + D(I)


b) Loop Coalescing




This is a technique for removing nested loops. Typically a pair of
nested loops will have a part that can be parall
el
ized and a part that cannot be
parall
el
ized. By removing
the
inner loop into a
single outer loop that can be stripped
mined, it generates larger granularity and thus allows more processors to work on the
data.


Ex.

Do I = 1,N by strip
-
mining



Do J = 1,M




A(I,J) = B(I,J)



Do IJ = 1, N*M by strip
-
mining



I = 1 + (IJ
-
1) Div M



J
= 1 + (IJ


1) Mod N



A(I,J) = B(I,J)


With slightly more work, the loop has been modified to be parall
elizable over a factor of
M rather than a factor of N. This may be important depending on the amount of
computation and the amount of processors.


c) Lo
op Interchange



Often times as we saw in the example above, in a pair of nested
loops, only one of the two loops can
successfully

be strip
-
mined. While Loop
-
Coalescing is one technique for achieving this it has its drawback. It introduces extra
overhead
by having to recalculate the indices each iteration, it is harder to understand by
the programmer, unless you are achieving speed up by allowing more processors to work
on the data, there is really no reason to use Loop
-
Coalescing.


Loop Interchange is a

simpler approach that simply swaps the loops if a
parall
el
izable loop is found inside the non
-
parall
el
ized loop.


Ex.

Do I = 1,N



Do J = 1,M

by strip
-
mining




A( I, J ) = A( I
-
1, J )



Do I = 1,M by strip
-
mining



Do J = 1,N




A( I, J ) = A( I
-
1, J )



By doing this, granularity has been increased, a greater portion is parall
el
ized, and the
number of processors to
successfully

work on this can be better controlled. One problem
with this method that was described at the beginning of the section and will

be looked
again next, is this often can result in poorer use of the cache and result in fewer hits.



d) Loop Blocking



The example given at the beginning of the section comparing the
hand
parallelized

loop with

automated parallelism is an example of loo
p blocking.
In
essence, it

i
s a combination of strip
-
mining and loop interchange. The idea is to create
large granular pieces that can be handled independently by different processors while at
the same time keeping as much information in cache as possible
. This is one of the more
complicated parallel techniques to do by hand yet it results in some of the most beneficial
speed
-
ups by reducing cache misses and communication between processes.





Compiler Design



Both optimizing compilers and
parallelizing

compilers work almost exactly the
same. There are two main differences, the input and the amount of transformations.
Obviously
an

optimizing compiler will be given parall
el
ized code while a parall
el
izing
compiler will be given sequential code. The main

difference is an optimizing compiler
has less work to do. Most of
its

work will be looking to create privatization and reduce
dependence. A parall
el
izer has to
go through much more work over more iterations to
modify,
privatize
, and continue to reduce o
ver and over again.


One of the best known parall
el
izers was Polaris which was created to parall
el
ize
Fortran 7
7

code. It was never modified to support
FORTRAN

90 and has since fallen
into obscurity. This was the driving force that created the field of
parallel compilers.
Today, the goal remains the same but the focus has shifted to C, C++, and Java code.
Today two leading projects have built framework compilers to allow
researchers

easy
access to modify and test compiler parall
el
izers.



Cetus:
http://paramount.www.ecn.purdue.edu/ParaMount/Cetus/index.html



Mercurium:
http://www.cepba.upc.edu/mercurium/


Cetus supports C and C++ while Mercurium supports C and
FORTRAN

and is designed
to w
ork with OpenMP.


Polaris worked in this manner but most parallel compilers work in a similar
fashion:


1) scan and parse the code

2) substitute induction variables

3) reduce recognition

4) privatization

5) data dependence tests

6) loop transformations

7)
output pass


Throughout these passes, information will be gathered and stored in
Intermediate
Representations.
The Intermediate Representation takes the form of a parse tree. This tree
can be traversed and modified in such a manner as to ensure a valid sou
rce code to be
produced.





Future



Both compiler optimization and automated
parallelization

continue to be heavily
researched topics. When research began on the field, the desire was to reuse Fortran code
since most parallel projects were using scienti
fic data and the lack of data definitions and
pointers made Fortran easier to
parallelize

than a language such as C. In the last six or
seven years there has been a dramatic shift in moving away from Fortran and towards
more common languages such as C, C+
+ and Java. This is no doubt in part to the more
commonality of parallel chips.


Research into the field will likely continue at the pace it is going. Optimizing
compilers are already reaching the point where they can offer speed
-
ups to hand
parallelized

code and at worse, do no worse than the hand
parallelized

code. This alone
should keep research projects going.


There is a great likely hood that we will never see a fully automated
parallelizer
.
There are many in this research field that believe it is

impossible
. As more techniques
are discovered
,

however, they can be directly applied to compilers and in a broader sense,
give the parallel community yet a better understanding of what is required from a parallel
language.

Summary



At first the two f
ields of Parallel Research, that on Parallel Skeletons and that on
Parallel Compilers, seems to fall into two very different fields. In fact, they both look to
achieve the same goals using the same information in two very different ways. Parallel
Skeleto
ns look to find the
reoccurring

patterns and abstract it into a high level hierarchy
that is easy for the programmer to use and modify. In doing so it hides the complexity
issues that can seemingly overwhelm a parallel programmer, design patters,
communica
tion, operating systems, etc..


Parallel Compilers look

at the same
reoccurring

patterns and again tries to hide it
from the end user. Instead of abstract classes and methods, it hides it inside a compiler
that can recognize these patterns from sequential

or parallel code.


At the moment both fields have yet to find the silver bullet of parallel
programming. It seems highly unlikely that fully
-
automated sequential to parallel code
compilers will ever exist. High level code is simply too complex for the c
ompilers to ever
be able to figure everything out at compile time.
Parallel optimizers on the other hand
will most likely continue to improve and become more and more common place. All
code needs to be compiled and if a compiler can also make it generate

faster code than
there is no reason no
t

to pursue such research. Parallel Skeletons will likely continue to
exist in some format. Just as we have seen languages such as Java help to deal with
complex programming issues, it makes no sense why a similar st
ructure w
ill no
t become
common place to deal with even greater complexity issues in parallel computing. As
communication protocols, network and processor configurations begin to lead towards an
industry standard, a better and better skeleton will begin to
emerge.


Perhaps the most important idea to take away from these research fields is what
they
have
discovered. While they serve little or no purpose in solving real world parallel
problems today, they have both led to the discovery of many
reoccurring

p
atterns and
constraints that good parallel code must abide by. From the field of Skeletons we have
seen the
reoccurring

algorithms that are used and

ways to hide communication. From the
compilers we have seen what data structures work in parallel computing

and which
make
it more difficult
. More importantly,
constraints

on loops that must exist in

order for it to
be
parallelized

and ways to help balance loads.
From these themes it could be imagined
that after time a parallel language could be constructed fr
om all the patterns that have
been gathered

from these fields of research
.

Works Cited


Patterns


1)

D. Goswami, A. Singh, and B. R. Preiss. Using Object
-
Oriented Techniques for

Realizing Parallel Architectural Skeletons.
Proc. ISCOPE '99
, San
Francisco,

CA, December 1999.


2)

S. Siu, M De Simone, D. Goswami, A. Singh. Design Patterns for Parallel

Programming. University of Waterloo, Ontario, Canada.


3)

D. Goswami, A. Singh and B. R. Preiss. Architectural Skeletons: The Re
-
Usable

Building
Blocks for Parallel Applications.
Proc. 1999 International Conference

on Parallel and Distributed Processing Techniques and Applications
, volume

3,

pages
1
250
-
1256, Las Vegas, NV, June 1999
.


4)

K. Tan, D. Szafron, J. Schaeffer, et.. Using Generative Des
ign Patterns to

Generate Parallel Code for a Distributed Memory Environment.
Proceedings of

ACM

SIGPLAN Symposium on Principles and P
ractice of Parallel Programming

(PPoPP 2003)
, San Diego, CA, June 2003, pages 203
-
215.


5)

S. MacDonald, D. Szafron and

J. Schaeffer.
"Object
-
Oriented Pattern
-
Based

Parallel Programming with Automatically Generated Frameworks", in

Proceedings of the 5th USENIX Conference on Object
-
Oriented Tools and

Systems (COOTS'99)
, San Diego, CA, May 1999, pages 29
-
43.



6)

P. Meh
ta, J. N. Amaral, and D. Szafron. Is MPI Suitable for a Generative Design
-

Pattern System
?
Workshop on Patterns in High Performance Computing,

Champaign
-
Urbana, Illinois, May, 2005


7)

D.K.G. Campbell. Towards the Classification of Algorithmic Skelet
ons.
Technical Report YCS 276, Department of Computer Science, University of
York, 1996.


Compilers


8
) D. Milicev and Z. Jovanovic.
A Formal Model of Software Pipelining Loops


with Conditions
.

Proceedings of the 11th International Symposium on


Parallel Processing,

554


558,

1997.


9
)

D. Szafron and J. Schaeffer. An Experiment to Measure the Usability of Parallel

Programming Systems.
Concurrency Practice and Experience
, Vol. 8, No. 2,

March 1996, pp. 147
-
166.


10
)

G. S. Johnson and S. Seth
umadhavan.
Compiler Directed Parallelization of Loops

in Scale for Shared
-
Memory Multiprocessors
.
Proceedings of the Third

International Conference on Computational Science
-

ICCS 2003


1
0
)

Y. Yu and E. D’Hollander. Partitioning Loops with Variable Depe
ndence

Distances.
Proceedings of the 2000 International Conference on Parallel

Processing (ICPP'00),
21
-
24 August 2000, Toronto, Canada
, pages 209
-
218.


1
2
)

W. Blume, R. Doallo, R. Eigenmann, and et.. Advanced Program Restructuring

for High
-
Performance
Computers with Polaris. University of Illinois at Urbana
-

Champaign, Urbana, IL, 1996.


1
3
)

K. S. McKinley. A Compiler Optimization Algorithm for Shared
-
Memory

Multiprocessors. IEEE
Transactions on Parallel and Distributed Systems
, Aug

1998 (Vol. 9, No
. 8),


pp. 769
-
787.

Appendix A

-

Terms



Algorithmic Skeleton


a skeleton that implements various well known and commonly

occurring

algorithm patterns found in parallel computing. Often they will
be written in a library in such a way as to allow the end

user to use the
algorithm without having to the write the
implementation

code, hence
saving the programmer time.


Cross
-
iteration dependence


this occurs when a loop accesses the same memory location

in at least two iterations and changes the value of t
hat memory location in
at least one of those iterations. This is a condition that must always not
occur in order for
parallelization

of a loop.


Granularity


when looking at a sequential program or loop, granularity refers to the

number of independent
components. The more granularity a program or
loop contains, the easier it is to
parallelize

and greater speed up will be
achieved.


Loop
-
Blocking


see Strip
-
mining


Parallel Pattern


any structure that helps to clarify
reoccurring

code typically found
in

parallel programming. A simple example found in sequential
programming is replacing the goto and label by a for or while loop. The
code was always there, the pattern makes it clearer and less error prone.


Perfectly nested loops


a set of nested loo
ps which does not contain a function call with

unknown side effects. This is an important concept since unknown side
effects can often make it impossible to use parallel compiler optimization.


Skeleton


a general term for a data abstraction sandbox tha
t hides complexity details

from the user.


Strip
-
mining


a similar idea to producer
-
consumer, where a 0


N loop execution is

broken down into s smaller pieces, after completing all s pieces, the next s
are taken from till N has been reached. This will

typically result in more
overhead since synchronization and more communication must take place.
It is an easy way to prevent loop over run which occurs when numerous
machines are calculating values on a loop and some of them do work
beyond the exit condi
tion.