A Brief Survey of Parallelization and/or Code Generation Software Tools

compliantprotectiveSoftware and s/w Development

Dec 1, 2013 (3 years and 16 days ago)

57 views


1

A Brief Survey of
Parallelization


and/or
C
ode
G
eneration
S
oftware
T
ools


Florina Monica Ciorba

~
cflorina@cslab.ece.ntua.gr

~


Introduction


This document describes 19 parallelization and/or code generati
on
software
tools, in
chronologic order, starting with 1990. Not all publicly available tools are comprised in this
report, but only the most cited in the bibliography.
Software tools providing automated
functionalities for parallelization and/or parallel
code generation can make the parallel
programming task much easier.
Parallelization is the process of analyzing sequential
programs for parallelism and restructuring them to run efficiently on multiprocessor
systems. The goal is to minimize the overall com
putation time by distributing the
computational workload among the available processors

(process called scheduling)
.
The
target architectures can be shared memory, distributed memory, or distributed shared
memory systems.
Par
allelization can be automatic,
manual or semi
-
automatic. Scheduling
methodologies can be static (performed at compile
-
time) or dynamic (performed at
runtime). Furthermore, dynamic techniques can be distributed (the scheduling task and/or
the scheduling information are distributed among
the processors and their memories) or
centralized (store global information at a centralized location and use this information to
make more comprehensive scheduling decisions using the computing and storage
resources of one or more dedicated processors).



1.
Hypertool

[1], 1990


is a tool developed for message passing systems.
It takes a user
partitioned C program as input, automatically allocates these partitions to PEs (processing
elements


processors), and inserts proper synchronization primitives whe
re needed. The
program development methodology goes as follows: In the beginning, a designer develops a
proper algorithm, performs partitioning, and writes a program as a set of procedures. The
program looks like a sequential program and it can run on a se
quential machine for the
purpose of debugging. This program is automatically converted into the parallel program for
a message
-
passing target machine by parallel code synthesis and optimization. Hypertool
then generates performance estimates, including exe
cution time, communication time,
suspension time for each PE, and network delay for each communication channel. The
explanation facility available displays data dependencies between PEs, as well as parallelism
and load distribution. Of the designer is not
satisfied with the result, he/she will attempt to
redefine partitioning strategy and the size of partitions using the information provided by the
performance estimator and the explanation facility.

The
algorithms
employed for the static
scheduling of proce
sses are
based
on the critical path method
: Modified Critical Path (MCP)
and Mobility Directed Scheduling (MD).


2.
Parafrase
-
2

[2]

[3]
, 1990


is a project aimed at developing a source to source multilingual
restructuring compiler. It provides a reliable,

portable and efficient research tool for
experimentation with program transformations and other compiler techniques for parallel

2

supercomputers. It
takes as input a sequential code (in different source languages, such as C
or
FORTRAN
) and
by the use of a
preprocessor it transforms each input language

it into

an
intermediate representation.

S
ymbolic dependence analysis and interprocedural analysis is
performed to detect the parallelism, followed by an auto
-
scheduling (at coarse
-
grain level)
technique that r
esults in a self
-
driven program (parallel C or parallel
FORTRAN
) where the
code for the scheduling of tasks is generated by the compiler.
Although the partitioning
phase is carried out at compile time, task scheduling is considered to be a dynamic process.

Likewise, a postprocessor (code generator) is used for each language desired as output.
It
also provides a GUI for displaying compiler information to the user.

Parafrase
-
2 is a useful
tool for developing new techniques in parallelizing compiler design.


3.
O
REGAMI

[
4
], 1991


is a project involving the design, implementation, and testing of
algorithms for mapping parallel computations to message
-
passing parallel architectures. It
addresses the mapping problem by exploiting regularity and by allowing the u
ser to guide
and evaluate mapping decisions made by OREGAMI’s efficient combinatorial mapping
algorithms. Its approach to mapping is based on a new graph theoretic model of parallel
computation called the Temporal Communication Graph.
OREGAMI is a set of t
ools that
includes
:
a
LaRCS compiler

(a graph description language that compiles textual user task
descriptions into specialized Temporal Communication Graphs
-

TCG),
a
MAPPER

tool
(a
library of mapping algorithms which utilize information provided by LaRC
S to perform
contraction, embedding and routing; it is also a tool for mapping tasks on a variety of target
architectures)
,

and
METRICS

tool
(an interactive graphics tool for display and analysis of
mappings and their performance).
It is designed for use a
s a front
-
end mapping tool in
conjunction with parallel programming languages that support explicit message passing, such
as OCCAM, C*, Dino, Par, C and
FORTRAN

with communication extensions.
Only
symbolic mapping directives are generated and no target cod
e.


4.
PYRROS

[
5
]

[6]
, 1992


is a parallel programming tool for automatic scheduling static
task graphs and generating the appropriate target code for message passing MIMD
architectures. It
takes as input a task graph and its associated sequential C code,

and outputs
a static schedule a
nd a parallel C code for message
-
passing

architecture
s (such as nCUBE
-
2
and INTEL
-
2)
.
It c
onsists of a
task graph language

(with an interface to C) that allows users to
define partitioned programs and data,
a scheduling syst
em

which uses only the DSC
(Dominant
Sequence)
algorithm for clustering the task graph, performing load balanced mapping and
computation/communication ordering,
an X
-
Windows based graphic displayer

for displaying
Gantt charts for task graphs and scheduling

results, and a
code generator

that automatically
inserts synchronization primitives and performs parallel code optimization for the target
architecture.


5.
Parallax

[7]
, 1993
-

incorporates seven classical scheduling heuristics
, providing an
effective w
ay for developers to find out how the schedulers affect program performance on
various parallel architectures
.
Of the seven heuristics, two simple ones consider only task
execution time, two consider both task execution and message
-
passing delay times, two

use
task duplication to reduce communication delay, and one consider
s

communication delays,
task execution time, and target machine characteristics such as interconnection network
topology and overhead due to message passing and process creation. Develope
rs must
provide the input as a task g
raph

and estimate

the task
s’

execution time
s (or obtain them by
running the program). They also must express the target machine as an i
nterconnection

3

topology graph

(with annotations for processing and message speeds)
.
It generates Gantt
chart schedules, speedup graphs, processor and communication use/efficiency charts, but no
executable code. Additionally, an animated display of the simulated running program is
provided to help users evaluate the differences
among

the s
cheduling heuristics used.


6.
PARSA

[
8
], 199
4



(PARallel program Scheduling and Assessment) is a
software tool
developed for automatic
partitioning
and
scheduling
of
parallel
programs

on multiprocessor
systems
.
The PARSA prototype is

composed of

4 tools
:

an
A
pplication
S
pecification
T
ool

(converts a sequential SISAL program into a DAG, represented in textual form by the IF1
acyclic graphical language
; it also attaches execution delays to each application task and
intertask link, using delay information fr
om the Architecture Specification Tool
),
an
A
rchitecture
S
pecification
T
ool

(used for target system specification in graphical form, and
generates execution delay for each task),
a
P
artitioning and
S
cheduling
T
ool

(
used to partition
fine
-
grain program inst
ructions into coarse grain tasks and schedule/map those tasks on the
target architecture, in order to achieve a minimal execution time with as small of a
communication delay overhead as possible; it employs

the HNF, LC and LCTD
partitioning
and scheduling
techniques
), and
a
P
erformance
A
ssessment
T
ool

(displays the expected runtime
behavior of the scheduled program

at
compile
-
time
).
These tools are integrated into an
interactive environment that is portable to any UNIX workstation that supports Motif and X
Windows.
PARSA does no generate any target code.


7.
DF

[
9
] [
10
],
1994



(Distributed Filaments) is
a software kernel, using multithreaded
scheduling for overlapping communication and computation.
It uses the
Filaments package
:

a
library of C code that run
s on several different shared
-

and distributed
-
memory machines

(meaning that the same user code, C plus Filaments calls, runs unchanged on either type of
machine)
. It has been designed to support parallel scientific applications, and is intended to
be a ta
rget for a compiler.
It is a fine
-
grain parallelizing system. As input, it takes a sequential
C code and the resulting application code is written (not automatically) in C plus Filaments
calls for distributed shared memory systems. It uses
UDP communicatio
n and
a
n efficient

reliab
ility protocol to create Packet (a low overhead reliable
datagram comm
unication
package
, with two types of messages:
request

and
reply
)
.


8.
CompSys
HPF

&

FM

[
1
1
],
1994


(High Performance
FORTRAN

and
FORTRAN

M)
is
an integrated ta
sk/data parallel programming
system that

allows the use of FM to
coordinate concurrent HPF computations.
HPF and FM procedures are clearly distinguished
and are compiled with the HPF or FM compiler, respectively.
Simple interface routines a
r
e
used to coord
inate the two programming models.
FM processes are used to encapsulate
data
-
parallel HPF computations, and virtual computers are used to control the allocation of
computational resources to HPF computations. Is divided into three parts:
T
ranslator

(transla
tes HPF to F77 plus calls to communication libraries: Nexus or Express),
C
ommunication
L
ibrary

(currently used is Express) and
I
ntrinsics

(set of routines acting on
vector data elements). The compiler partitions the data and computation on the processors
(
using
decomposition
,
align

and
distribute

directives given in the
HPF

program
),
detects and generates communication
using pattern matching to detect commonly occurring
communication patterns and generating the equivalent communication calls to
the
communic
ation libraries; finally
the code generator produces

loosely synchronous SPMD

4

code, structured as alternating phases of local computation and global communications.

Therefore, the processes do not need to synchronize during local computations.


9.
Paradigm

[1
2
] [1
3
],
1995



(Parallelizing Compiler for Distributed
-
Memory, General
-
Purpose Multicomputers) is
a flexible compiler framework, that accepts a sequential
F77/HPF program and produces an explicit message
-
passing v
ersion (i.e.,
an F77 program
with calls

to the selected message
-
passing library and the runtime system
)
. Has the following
components:
P
rogram
A
nalysis

(done with Parafrase
-
2
; it parses the sequential program into
an intermediate representation and analyzes the code to generate flow, dependence

and call
graphs
),
A
utomatic
D
ata
P
artitioning

(for regular computations:
the compiler automatically
determines the best
static distribution of program data
, using the
owner
-
computes

rule
; for
irregular computations: compile
-
time analysis and flexible, ir
regular runtime support),
T
ask
-
G
raph
S
ynthesis

(to exploit both data and functional parallelism),
M
ultithreading

(running
multiple threads on each processor to overlap computation and communication) and
G
eneric
L
ibrary

I
nterface

(for each supported library
, abstract functions are mapped during compile
time to corresponding library
-
specific code generators).



10.
ProcSimity

[1
4
],
1995


is
a software simulation tool that
provides a convenient
environment for performance analysis of processor allocation and

scheduling algorithms. It
supports both stochastic independent user job streams as well as communication patterns
from actual parallel applications
, including several of the NAS parallel benchmarks
.
Designed to provide a convenient environment for perform
ance analysis of processor
allocation and scheduling algorithms.
It t
akes as input independent user job streams and
outputs simulation trace files.
It h
as two components: a
M
ulticomputer
S
imulator

(
supports
experimentation with selected allocation and sche
duling algorithms on architectures with a
range of network topologies and for several current routing and flow control mechanisms; it
is
made up of a job scheduler and a processor allocator; it employs five different scheduling
and twelve allocation strate
gies) and a
Performance
Analysis
/Visualization

T
ool

(
allows the user
to view a dynamic animation of the selected algorithms as well as a variety of system and job
level performance metrics; it is
comprised of several control windows helping to observe
anim
ation: the topology, SimInfo, textual dynamic event display, system metrics, message
blocking time and system averages windows).


11.
RIPS

[1
5
],
1996


(Runtime Incremental Parallel Scheduling) is
an incremental parallel
scheduling

strategy where the syste
m scheduling activity alternates with the underlying
computation work. Tasks are incrementally generated and scheduled in parallel. RIPS

schedules incoming tasks
incrementally

(using an
eager

and a
lazy

policy), followed by
parallel

execution of the schedu
ling algorithm (using the
T
ree
W
alking
A
lgorithm

for the tree topology
,
-

in which the communication network is tree
-
structured and each node of the tree
represents a processor,

and its modified version).

The RIPS system paradigm assumes
SPMD programming m
odel, and starts with a system phase (which schedules initial tasks),
followed by a user computation phase (to execute the scheduled tasks and possibly generate
new tasks). Then in a second system phase, the old tasks that have not been executed will be
sc
heduled together with the newly generated tasks. This process repeats iteratively until the
entire computation is completed.

RIPS combines the advantages of static and dynamic
scheduling, adapting to dynamic problems and producing high
-
quality scheduling.
It balances
the load very well and effectively reduces the processor idle time.



5

12.
MARS

[1
6
],
1996



is
an auto
-
parallelizing compiler

that uses loop generation techniques
to enhance program performance. It is
targeted at single address space architectu
res
, such as
KSR
-
1, Meiko CS
-
2 and SGI Power Challenge
.
It is built
on top of Sage++ and
employs the
PIP (Parameter Integer Program
ming) loop generation technique
. It has three phases:
pre
-
partitioning

(parallel detection, significant regions),
partitionin
g

(global alignment, global data
partitioning) and
post
-
partitioning

(local realignment, work scheduling,
synchronization

placement, mapping, hierarchical locality optimization, loop optimization). Takes as input a
sequential F77 code
(that is read and sto
red as an annotated syntax tree in Sage++, then
translated into an augmented affine form)
and after the three phases
(pre
-
partitioning,
partitioning and post
-
partitioning)
that operate
on (transform and update)
the linear
algebraic representation of the pr
ogram, it generates machine specific parallel F
ORTRAN

code.


13.
PROMIS

[1
7
],
1997


is
a parallelizer/code generator
prototype that

uses HTG
(hierarchical task graphs) as a unified program
(at multilevel)
representation to allow loop
level and instruction

level parallelization techniques.
PROMIS is a multi
-
source, multi
-
target
parallelizing compiler in which the front
-
end and the back
-
end are integrated via a single
unified representation, thus providing several important opportunities for high
-
level/low
-
l
evel interactions and trade
-
offs that are either very difficult or impossible to do effectively in
conventional compilers. Its i
nput is a sequential C/
FORTRAN

code and by use of
Parafrase
-
2 front
-
end (for
coarse grain level:
parsing, symbolic and dependenc
e analysis, and
source
-
level parallelization) it produces C code with parallel directives and Semantics
Retention
A
ssertions, for use by EVE Mutation Scheduling back
-
end compiler, which
performs
t
he parallelization process at instruction
-
level
based on sou
rce
-
level info,
and
that
finally
generates parallel VLIW machine code.

The prototype compiler generates code for a
simulated shared
-
memory multiprocessor wherein each processor is a pipelined VLIW.


14.
CASCH

[1
8
]
,
1997
-

provides an integrated programmin
g environment that includes
parallelization, partitioning, scheduling, mapping, communication, synchronization, code
generation and performance evaluation
, for performing automatic parallelization and
scheduling of applications without relying on simulatio
ns
.
It includes an extensive library of
state
-
of
-
the
-
art scheduling algorithms, organized into different categories that are suitable for
different architectural environments.
Using CASCH, the user first writes a
sequential program
,
which generates a
task
graph
(or macro
-
dataflow graph) with the help of a
lexical analyzer and
parser
. The
task graph generator

inserts the weights on the tasks and edges with the help of an
estimator, which
compute
s them

using a database that contains the timing of various
comp
utation, communication, and I/O operations for different machines
. Next step
is
partitioning of the problem into tasks and
scheduling

(using algorithms from the library)
the
task graph generated based on this partitioning. Possible partitioning schemes inc
lude: block,
cyclic, block
-
cyclic pattern, and for irregular problems partitioning into many tasks for load
balancing.

The scheduling and mapping algorithms are used to schedule the task graph
generated from the user program.

CASCH

has a GUI for analyzing
various scheduling and
mapping algorithms. The best schedule generated by an algorithm can be used by the code
generator to generate a parallel program for a particular machine, process that can be
repeated for a different machine.

At this point, CASCH
map
s

(allocates)
each task to a
processor

of the target machine
, and the
communication inserter

automatically inserts
communication primitives (
send

or
reply
), to reduce the program’s burden and eliminate
insertion errors.
After the communication primitives a
re properly inserted, the parallel code

6

is generated by including appropriate library procedures from a standard package (such as
the NX of the Intel Paragon). In order to reduce the number of message transfers and the
time to initiate messages, CASCH pack
s and sends several messages

together. The user can
compile the resulted parallel program to generate native parallel machine code on the target
machine (for example, the IBM PS2) for testing. Because CASCH automatically inserts some
statements for collect
ing runtime statistics, the user can use CASCH to parse these statistics
to generate runtime profiles to visualize the behaviors on different parts of the program. By
repeating the whole design process again, the user can improve his/her program.


15.
Hyp
ertool/2

[1
9
], 2000


is
a runtime scheduling system. A CDAG (Compact D
irected
A
cyclic
G
raph
) is generated at compile
-
time, which is expanded incrementally into a DAG at
runtime, then scheduled and executed on a parallel machine. The incremental
execution
model

starts with a
system phase

(
to generate and schedule only a part of the DAG
)
,
and is
then
followed by a
user computation phase

(
to execute the scheduled tasks
)
. The processing
elements (PEs) execute until most tasks have been completed and transfer t
o the next
system phase to generate and schedule the next part of the DAG. The PEs that execute a
parallel scheduling algorithm are called the
physical PEs
(
PPEs
) in order to distinguish them
from the
target PEs
(
TPEs
) to which the DAG is to be scheduled.
The parallel scheduler uses
the HPMCP
(Horizontal Parallel MCP)
algorithm

is as follows
:
(1) T
he graph is partitioned

along the time domain (horizontally)
,

and each partition is assigned to a PPE. (2) Each
PPE

applies the MCP algorithm to its partition
to
produce
a sub
-
schedule, ignoring the edges
between a node and its remote parent nodes. The first node in the list is scheduled to the
TPE that allows the earliest start time. This node is deleted from the list and the scheduling
step is repeated until the
list is empty. (3) Each pair of adjacent sub
-
schedules is
concatenated.
Using Hypertool/2, DAGs can be scheduled in parallel and incrementally
executed at runtime.


16.
SADS

& DBSADS
[
20
],
2000


SADS
(Self
-
Adjusting Dynamic Scheduling)
is a
class of
cent
ralized

dynamic scheduling

algorithms, that simultaneously address load balancing and
memory locality while explicitly accounting for the time spent on scheduling. The SADS
algorithms overlap the scheduling process with the execution of tasks to mask the o
verhead
associated with the scheduling process. They also reduce the bottleneck effect of centralized
scheduling by proactively assigning tasks to working processors while those processors are
performing their previously assigned tasks.
SADS, similarly to
the branch
-
and
-
bound
algorithm, searches through a space of all possible partial and complete schedules. However
SADS is, significantly faster than the original branch
-
and
-
bound algorithm mainly due to its
ability to divide the solution space into smaller
clusters, each of which is relevant to only the
current scheduling phase.
SADS’s unified cost model accounts for memory access and task
processing costs. DBSADS (Depth
-
Bound SADS) is a heuristic variation of SADS that
combin
es

an online version of
the
dept
h
-
first search with the original branch
-
and
-
bound
SADS algorithm

to
give priority to examining nodes that reside at a higher depth in the tree

(i.e., closer to the leaves). In effect, DBSADS gives higher priority to examining partial
schedules in which a l
arger number of tasks have been assigned. This bias in the order of
examining the nodes aims at reducing the branching ratio of the basic SADS algorithm to
limit examination of partial schedules with fewer tasks which naturally incur lower costs.
DBSADS ca
n be regarded as a gradient
-
descent algorithm that uses SADS’s unified cost
model to schedule, during each iteration, a task at a deeper level of the search space

7

compared to the previous iteration.

Both categories of algorithms
tak
e

as input a task tree
a
nd output
the
scheduled tasks.



17.
ParAgent

[
2
1
],
2002


(Parallelization Agent)
a domain specific, interactive automatic
parallelization tool, used to diagnose, analyze and parallelize serial programs, using the
domain decomposition method. It is applic
able to time marching explicit FDM (finite
difference method) programs
, mostly found in Numerical Weather Prediction (NWP)
models
. The input is a F77 serial program and the output is a parallel F77 program with
primitives for message passing, code that is
portable to distributed memory platforms. The
target architecture is cluster of PCs.

It has an interactive GUI to help the user. On launching
the application, it shows the
call order tree
, the block
-
level
abstract syntax tree

(AST) and it
provides a
variab
le tracking

facility to categorize and analyze a multitude of variables. After
parallelization, the GUI can be used to see the synchronization/exchange points and the
communication patterns using stencil
-
exchange diagrams and provides a block
-
level view to

show how the parallel program relates to the serial program.

Using ParAgent, the
parallelization of an FDM program involves three major steps:
diagnostic

(the serial program
is interactively analyzed, and the user has to fix specific problems using the au
xiliary
information provided by ParAgent),
communication analysis and optimization
(data flow analysis is
performed to identify the synchronization/exchange points, the variable to be
communicated and their communication patterns) and
automatic code genera
tion

(involves
changes in array declaration to incorporate decomposition, global
-
to
-
local index
transformation and insertion of communication primitives).

ParAgent can perform either 1
-
D (along the J direction) or 2
-
D (along the I and J dimensions) paralle
lization.


18.
PETSc

[2
2
], 2003



(Portable Extensible Toolkit for Scientific computation) is a
powerful set of tools for the numerical solution of
partial differential equations (
PDEs
)

and
related problems on high
-
performance computers. Consists of a vari
ety of libraries (similar
to classes in C++), each of which manipulates a particular family of objects (for instance,
vectors) and the operations one would like to perform on the objects. Some of the PETSc
modules deal with: index sets (including permutati
ons, for indexing into vectors,
renumbering, etc), vectors, matrices (generally sparse), distributed arrays (useful for
parallelizing regular grid
-
based problems), Krylov subspace methods, preconditioners
(including multigrid and sparse direct solvers), no
nlinear solvers, and timesteppers for
solving time
-
dependent (nonlinear) PDEs.

Each module consists of an abstract interface
(simply a set of calling sequences) and one or more implementations using particular data
structures.


19. SUIF

[
24
], 2003


(Stanf
ord University Intermediate Format) The SUIF system is a
compiler infrastructure designed to support collaborative research and development of
compilation techniques, based upon a program representation, also called SUIF. The
emphasis is to maximize code r
euse by providing useful abstractions and frameworks for
developing new compiler passes and by providing an environment that allows compiler
passes to inter
-
operate.
The SUIF system has a simple and modular architecture, that
comprises a small
kernel

which

implements a set of basic functions found to be useful across
all compilation passes, a number of
modules

loaded dynamically under user control, and a
driver

that controls the system operation.
The
kernel

consists of two layers: the IO kernel
(implements
a persistent object system that is independent of the applications of writing
compilers) and the SUIF kernel (defines and implements the Suif compiler environment



8

SuifEnv
, that is all the user needs to know when writing a SUIF program).The system
comes w
ith a number of basic
modules
, as well as some tools to help users construct their
own modules.

Modules can be of two kinds: a set of nodes in the intermediate
representation, or a program analysis pass. To create a compiler or a standalone pass, the
user
needs to supply a “main” program (called a
driver
) that creates the SuifEnv, imports the
relevant modules, loads a SUIF program and applies a series of transformations on the
program and eventually writes out the information.


20. Cronus/1

[25]

[26]
, 2003


is a semi
-
automatic parallelization and code generation tool

for distributed message passing architectures
.

It employs a dynamic scheduling methodology
called SDS (Successive Dynamic Scheduling) that schedules the tasks at runtime. Cronus/1
has three sta
ges:
User Input

(the user inputs a
serial program
),
Compilation Time
, and
Run
Time
. In the Compile Time stage, the first
phase
is called
Parallelism Detection
, where the tool
finds if nested loops exist
. If no
nested loop

can be found in the sequential pro
gram, the tool

stops.
In

the second phase (
Parameters Extraction
), the
sequential
program is
parsed
and the
following
the
essential

parameters
for scheduling and parallel code generation
are extracted.

At this point, the available number of processors

is r
equired as input.
In the final stage,
Run
Time
,
the appropriate parallel code is generated
(with

the help of

an automatic code
generator

written in Perl)

for the given number of processors. Th
e

resulted
parallel code is
written in C and contains run time r
outines for
SDS
and

MPI primitives for data
communication
(whenever
communication is necessary); it is eligible for compilation and

execution on the multicomputer at hand.

The tool is available by request from the authors.

For more information, please refe
r to the following URL:
http://www.cslab.ece.ntua.gr/~cflorina/research
, where the reader can also find a “Brief
User’s Guide” for Cronus/1.






Conclusion


This document briefly describes t
he most relevant parallelizing and/or code generation
tools, which have been published since 1990.
The reader interested in a particular tool is
referred to the corresponding referenced paper(s)

or URL
(
s
)
.




9

References:


[1]

Min
-
You Wu
,
Daniel Gajski,

H
ypertool: A Programming Aid for Message
-
Passing Systems


IEEE Transactions on Parallel and Distributed Systems
,
vol. 1, no. 3,
pp. 330
-
343, 1990

[2]

C.D. Polychronopoulos, M.B. Girkar, M.R. Haghighat, C.L. Lee, B.P. Leung, D.A.
Schouten,
“The Structure o
f Parafrase
-
2: An Advanced Parallelizing Compiler
for C and
FORTRAN


Languages and Compilers for Parallel Computing (LCPC)
, MIT
Press, 1990

[3]

Parafrase
-
2 Homepage
, URL:
http://www.csrd.uiuc
.edu/parafrase2/index.html
,

1995

[
4
]

M. LoVirginia, Rajopadhye Sanjay, Gupta Samik, David Keldsen, Moataz A.
Mohamed, Bill Nitzberg, Jan Arne Tell, Xiaoxiong Zhong,
“OREGAMI: Tools for
mapping parallel computations to parallel architectures”

Internatio
nal Journal of
Parallel Programming
, vol. 20, no. 3, pp. 237
-
270, June, 1991

[
5
]

T. Yang, A. Gerasoulis,
“PYRROS: Static Task Scheduling and Code Generation
for Message Passing Multiprocessors”

Proceedings of the 1992 ACM International
Conference on Supe
rcomputing
, Washington DC, 1992

[6]

Pyrros, URL:
http://www.cs.rutgers.edu/pub/gerasoulis/pyrros/
, 1993

[
7
]

T.G. Lewis, H. El
-
Rewini,
“Parallax: A tool for parallel program scheduling”

IE
EE Parallel and Distributed Technology: Systems and Applications
, vol. 1, no. 2, pp. 62
-
72,
1993

[
8
]

Behrooz Shirazi
,
Krishna Kavi
,
A. R. Hurson
,
Prasenjit Biswas
,
“PARSA: A Parallel
Program Scheduling and Assessment Environment”

Proceedings of the 1993
International Conference on Parallel Processing
, vol. II


卯晴wa牥Ⱐ灰⸠ff
-

-

-
㜲ⰠCoC
m牥rsⰠI畧畳琬‱㤹t

[
9
]

s楮i敮琠 t⸠ c牥敨r 䑡癩搠 䬮h i潷敮瑨e氬l 䝲敧潲礠 o⸠ A湤牥nsⰠ
“Distributed
Filaments: Efficient Fine
-
Grain Parallelism on a Cluster of Workstatio
ns”

Symposium on Operating Systems Design and Implementation (OSDI)
, pp. 201
-
213, 1994

[
10
]

The Filaments Research Group,
“The FILAMENTS Project”
Ⱐ 啒i㨠
桴h瀺⼯pww⹣s⹡物穯ra⹥d甯灥潰u支晩
污浥湴⽦楬i桴浬
Ⱐㄹ㤴

x
N
N
z

fa渠c潳瑥爬t j楮i 塵u B桡癥渠Ava污湩n A汯欠C桯畤桡特Ⱐ
“A Compilation System
that Integrates High Performance
FORTRAN

and
FORTRAN

M”

Proceedings of
the 1994 Scalable High Performance Computing Conference
, IEEE Computer Science
Press,
July, 1994

[1
2
]

P. Banerjee, J.A. Chandy, M. Gupta, E.W. Hodges IV, J.G. Holm, A. Lain, D.J.
Palermo, S. Ramaswamy, E. Su,

The Paradigm Compiler for Distributed
-
Memory Multicomputers


The Computer Journal
, vol. 28, no. 10, pp. 34
-
47, October,
199
5

[1
3
]

The PARADIGM Compiler,
“PARADIGM: A Parallelizing Compiler for
Distributed Memory Message
-
Passing Multicomputers”
, URL:
http://www.ece.northwestern.edu/cpdc/Paradigm/Paradi
gm.html
,
or

http://www.crhc.uiuc.edu/Paradigm/
, 1995

[1
4
]

K. Windisch, J. Miller, V. Lo,
“ProcSimity: an experimental tool for processor
allocation and scheduling in highly parallel systems”
,
Proceeding
s of the Fifth
Symposium on the Frontiers of Massively Parallel Computation
, February, 1995

[1
5
]

W. Shu, M.
-
Y. Wu,
“Runtime Incremental Parallel Scheduling ({RIPS}) on
Distributed Memory Computers”
,
IEEE Transactions on Parallel and

10

Distributed Systems
, v
ol. 7, no. 6, pp. 637
-
649, 1996

[1
6
]

Z. S. Chamski,

M. F. P.

O'Boyle,
“Practical Loop Generation”
,
Proceedings of the 29th
Hawaii International Conference on System Sciences (HICSS'96)
,

vol.

1: Software Technology
and Architecture,

January,

1996

[1
7
]

C.
J.
Brownhill,
A.
Nicolau,
S.
Novack,
C.D.

Polychronopoulos,

“The PROMIS
Compiler Prototype”
,

IEEE Conference on Parallel Architectures and Compilation
Techniques (PACT '97)

,

pp.
116
-
125,

1
997

[1
8
]

I.
Ahmad,
Y.
-
K.
Kwok,
M.
-
Y.
Wu,
W.

Shu,
“Automatic Paralle
lization and
Scheduling of Programs on Multiprocessors using CASCH”
,

Proceedings of the
International Conference of Parallel Processing (ICPP)
,

pp.
288
-
291,

CRC Press
, August, 1997

[1
9
]

M.
-
Y.
Wu,
W.

Shu,
Y.
Chen,
“Runtime Parallel Incremental Scheduling o
f
DAGs”
,

Proceedings of the 2000 International Conference on Parallel Processing ICPP
-
2000
,

Toronto, Canada,

pp.
541
-
548,

IEEE Press,

2000

[
20
]

B.
Hamidzadeh,
L.Y.
Kit,
D.J.
Lilja,

“Dynamic Task Scheduling Using Online
Optimization”
,

IEEE Transactions on
Parallel and Distributed Systems
,

vol
.
11,

n
o.
11,

p
p.
1151
-
1163,

November, 2000

[
2
1
]

S.C.
Kothari,
J.
Cho,
Y.
Deng,

“Software Tools and Parallel Computing for
Numerical Weather Prediction Models”
,

Proceedings of the International Parallel and
Distributed

Processing Symposium: IPDPS 2002 Workshops
,

IEEE Press,

April
, 2002

[2
2
]

S.
Balay,
K.
Buschelman,
V.
Eijkhout,
W.

Gropp,
D.

Kaushik,
M.

Knepley,
L.
Curfman McInnes,
B.
Smith,
H.

Zhang,
“The PETSc User's Manual”
,

URL:
http://www
-
unix.mcs.anl.gov/petsc/petsc
-
2/snapshots/petsc
-
current/docs/manual.pdf
,

2003

[24]

The SUIF Group, “The SUIF Compiler System”, URL:
http://su
if.stanford.edu/
,
1999
-
current

[25]

Florina M. Ciorba, Theodore Andronikos, Dimitris Kamenopoulos, Panagiotis
Theodoropoulos and George Papakonstantinou,

Simple code generation for
special UDLs

,
In Proceedings of the 1st Balkan Conference in Informatic
s (BCI'03)
, pp. 466
-
475, Thessaloniki, Greece, November 21
-
23, 2003

[26]

Computing Systems Laboratory,
“Description of Cronus”
, URL:
http://www.cslab.ece.ntua.gr/~cflorina/research
, 2004