M o d e l s f o r P a r a l l e l C o m p u t i n g

shapecartSoftware and s/w Development

Dec 1, 2013 (3 years and 11 months ago)

85 views

In con trast to forkjoin st yle execution the programmer is resp onsible for mapping the
parallel tasks in the program to the p a v ailable pro cessors A ccordingly  the programmer
has the resp onsibilit y for load balancing while it is pro vided automatically b y the dynamic
sc heduling in the forkjoin st yle While this is more cum b ersome at program design time
and limits
exibilit y  it leads to reduced runtime o v erhead as dynamic sc heduling is no
longer needed SPMD is used eg in the MPI message passing standard and in some parallel
programming languages suc h as UPC
Nested parallelism can b e ac hiev ed with SPMD st yle as w ell namely if a group of p
pro cessors is sub divided in to s subgroups of p
i
pro cessors eac h where
P
i
p
i
￿ p  Eac h
subgroup tak es care of one subtask in parallel After all subgroups are nished with their
subtask they are discarded and the paren t group resumes execution As group splitting can
b e nested the group hierarc h y forms a tree at an y time during program execution with the
leaf groups b eing the curren tly activ e ones
This sc hema is analogous to exploiting nested parallelism in forkjoin st yle if one regards
the original group G of p pro cessors as one p threaded pro cess whic h ma y spa wn s new
p
i
threaded pro cesses G
i
 ￿ ￿ i ￿ s  suc h that the total n um b er of activ e threads is not
increased The paren t pro cess w aits un til all c hild pro cesses subgroups ha v e terminated
and reclaims their threads
  P arallel Random A ccess Mac hine
The Par al lel R andom A c c ess Machine PRAM mo del w as prop osed b y F ortune and W yllie
 as a simple extension of the Random A ccess Mac hine RAM mo del used in the design
and analysis of sequen tial algorithms The PRAM assumes a set of pro cessors connected
to a shared memory  There is a global clo c k that feeds b oth pro cessors and memory  and
execution of an y instruction including memory accesses tak es exactly one unit of time
indep enden t of the executing pro cessor and the p ossibly accessed memory address Also
there is no limit on the n um b er of pro cessors accessing shared memory sim ultaneously 
The memory mo del of the PRAM is strict consistency  the strongest consistency mo del
kno wn  whic h sa ys that a write in clo c k cycle t b ecomes globally visible to all pro cessors
in the b eginning of clo c k cycle t ￿ ￿  not earlier and not later
The PRAM mo del also determines the eect of m ultiple pro cessors writing or reading the
same memory lo cation in the same clo c k cycle An EREW PRAM allo ws a memory lo cation
to b e exclusiv ely read or written b y at most one pro cessor at a time the CREW PRAM
allo ws concurren t reading but exclusiv e writing and CR CW ev en allo ws sim ultaneous write
accesses b y sev eral pro cessors to the same memory lo cation in the same clo c k cycle The
CR CW mo del sp eci es in sev eral submo dels ho w suc h m ultiple accesses are to b e resolv ed
eg b y requiring that the same v alue b e written b y all  Common CR CW PRAM  b y the
priorit y of the writing pro cessors  Priority CR CW PRAM  or b y com bining all writ
ten v alues according to some global reduction or pre x computation  Combining CR CW
PRAM  A somewhat restricted form is the CR O W PRAM where eac h memory cell ma y
only b e written b y one pro cessor at all whic h is called the cells owner  In practice man y
CREW algorithms really are CR O W
Practical Relev ance The PRAM mo del is unique in that it supp orts deterministic par
allel computation and it can b e regarded as one of the most programmerfriendly mo dels
a v ailable Numerous algorithms ha v e b een dev elop ed for the PRAM mo del see eg the b o ok
b y JaJa  Also it can b e used as a rst mo del for teac hing parallel algorithms  as it
allo ws studen ts to fo cus on pure parallelism only  rather than ha ving to w orry ab out data
lo calit y and comm unication eciency already from the b eginning
The PRAM mo del esp ecially its cost mo del for shared memory access ha v e ho w ev er b een
strongly criticized for b eing unrealistic In the shado w of this criticism sev eral arc hitectural
approac hes demonstrated that a costeectiv e realization of PRAMs is nev ertheless p ossible
using hardw are tec hniques suc h as m ultithreading and smart com bining net w orks suc h as
the NYU Ultracomputer   SBPRAM b y W olfgang P auls group in Saarbrc k en    
XMT b y Vishkin  and ECLIPSE b y F orsell  A fair comparison of suc h approac hes
with curren t clusters and cac hebased m ultipro cessors should tak e in to consideration that
the latter are go o d for sp ecialpurp ose regular problems with high data lo calit y while they
p erform p o orly on irregular computations In con trast the PRAM is a generalpurp ose
mo del that is completely insensitiv e to data lo calit y 
P artly as a reaction to the criticism ab out practicalit y  v arian ts of the PRAM mo del ha v e
b een prop osed within the parallel algorithms theory comm unit y  Suc h mo dels relax one or
sev eral of the PRAMs prop erties These include async hronous PRAM v arian ts   the
hierarc hical PRAM HPRAM  the blo c k PRAM  the queuing PRAM QPRAM
and the distributed PRAM DRAM to name only a few Ev en the BSP mo del whic h w e
will discuss in Section  can b e regarded a relaxed PRAM and actually w as in tro duced to
bridge the gap b et w een idealistic mo dels and actual parallel mac hines On the other hand
also an ev en more p o w erful extension of the PRAM w as prop osed in the literature the BSR
Broadcast with selectiv e reduction 
Implemen tations Sev eral PRAM programming languages ha v e b een prop osed in the lit
erature suc h as F ork    F or most of them there is b ey ond a compiler a PRAM
sim ulator a v ailable sometimes additional to ols suc h as trace le visualizers or libraries for
cen tral PRAM data structures and algorithms   Also metho ds for translating PRAM al
gorithms automatically for other mo dels suc h as BSP or message passing ha v e b een prop osed
in the literature
  Unrestricted Message P assing
A distribute d memory machine  sometimes called messagep assing multic omputer  consists
of a n um b er of RAMs that run async hronously and comm unicate via messages sen t o v er a
comm unication net w ork Normally it is assumed that the net w ork p erforms message rout
ing so that a pro cessor can send a message to an y other pro cessor without consideration of
the particular net w ork structure In the simplest form a message is assumed to b e sen t b y
one pro cessor executing an explicit send command and receiv ed b y another pro cessor with
an explicit receiv e command  p ointtop oint c ommunic ation  Send and receiv e commands
can b e either blo c king ie the pro cessors get sync hronized or nonblo c king ie the sending
pro cessor puts the message in a buer and pro ceeds with its program while the message
passing subsystem forw ards the message to the receiving pro cessor and buers it there un til
the receiving pro cessor executes the receiv e command There are also more complex forms
of comm unication that in v olv e a group of pro cessors socalled c ol le ctive c ommunic ation
op er ations suc h as broadcast m ulticast or reduction op erations Also oneside d c ommuni
c ations allo w a pro cessor to p erform a comm unication op eration send or receiv e without
the pro cessor addressed b eing activ ely in v olv ed
The cost mo del of a messagepassing m ulticomputer consists of t w o parts The op erations
p erformed lo cally are treated as in a RAM P oin ttop oin t nonblo c king comm unications are
mo delled b y the L o gP mo del  named after its four parameters The latency L sp eci es
the time that a message of one w ord needs to b e transmitted from sender to receiv er The
o v erhead o sp eci es the time that the sending pro cessor is o ccupied in executing the send
command The gap g giv es the time that m ust pass b et w een t w o successiv e send op erations of
a pro cessor and th us mo dels the pro cessors bandwidth to the comm unication net w ork The
pro cessor coun t P giv es the n um b er of pro cessors in the mac hine Note that b y distinguishing
b et w een L and o it is p ossible to mo del programs where comm unication and computation
o v erlap The LogP mo del has b een extended to the LogGP mo del  b y in tro ducing another
parameter G that mo dels the bandwidth for longer messages
Practical Relev ance Message passing mo dels suc h as CSP comm unicating sequen tial
pro cesses ha v e b een used in the theory of concurren t and distributed systems for man y
y ears As a mo del for parallel computing it b ecame p opular with the arriv al of the rst
distributed memory parallel computers in the late s With the de nition of v endor
indep enden t messagepassing libraries message passing b ecame the dominan t programming
st yle on large parallel computers
Message passing is the least common denominator of p ortable parallel programming
Message passing programs can quite easily b e executed ev en on shared memory computers
while the opp osite is m uc h harder to p erform ecien tly As a lo wlev el programming
mo del unrestricted message passing giv es the programmer the largest degree of con trol
o v er the mac hine resources including sc heduling buering of message data o v erlapping
comm unication with computation etc This comes at the cost of co de b eing more error
prone and harder to understand and main tain Nev ertheless message passing is as of to da y
the most successful parallel programming mo del in practice As a consequence n umerous
to ols eg for p erformance analysis and visualization ha v e b een dev elop ed for this mo del
Implemen tations Early v endorsp eci c libraries w ere replaced in the early s b y
p ortable message passing libraries suc h as PVM and MPI MPI w as later extended in the
MPI  standard   b y onesided comm unication and forkjoin st yle MPI in terfaces
ha v e b een de ned for F ortran C and C T o da y  there exist sev eral widely used imple
men tations of MPI including op ensource implemen tations suc h as Op enMPI
A certain degree of structuring in MPI programs is pro vided b y the hierarc hical group
concept for nested parallelism and the comm unicator concept that allo ws to create separate
comm unication con texts for parallel soft w are comp onen ts A library for managing nested
parallel m ultipro cessor tasks on top of this functionalit y has b een pro vided b y Raub er and
Rnger  F urthermore MPI libraries for sp eci c distributed data structures suc h as
v ectors and matrices ha v e b een prop osed in the literature
  Bulk Sync hronous P arallelism
The bulksynchr onous p ar al lel BSP mo del prop osed b y V alian t in   and sligh tly
mo di ed b y McColl  enforces a structuring of message passing computations as a dy
namic sequence of barrierseparated sup ersteps  where eac h sup erstep consists of a com
putation phase op erating on lo cal v ariables only  follo w ed b y a global in terpro cessor com
m unication phase The cost mo del in v olv es only three parameters n um b er of pro cessors p 
p oin ttop oin t net w ork bandwidth g  and message latency resp barrier o v erhead L  suc h
that the w orstcase asymptotic cost for a single sup erstep can b e estimated as w ￿ hg ￿ L
if the maxim um lo cal w ork w p er pro cessor and the maxim um comm unication v olume h p er
pro cessor are kno wn The cost for a program is then simply determined b y summing up the
costs of all executed sup ersteps
An extension of the BSP mo del for nested parallelism b y nestable sup ersteps w as pro
p osed b y Skillicorn 
A v arian t of BSP is the CGM coarse grained m ulticomputer mo del prop osed b y Dehne
 whic h has the same programming mo del as BSP but an extended cost mo del that also
accoun ts for aggregated messages in coarsegrained parallel en vironmen ts
Systolic c omputing is a pip eliningbased parallel computing mo del in v olving a syn
c hronously op erating pro cessor arra y a socalled systolic pr o c essor arr ay  where pro cessors
ha v e lo cal memory and are connected b y a xed c hannelbased in terconnection net w ork
Systolic algorithms ha v e b een dev elop ed mainly in the s mostly for sp eci c net w ork
top ologies suc h as meshes trees p yramids or h yp ercub es see eg Kung and Leiserson 
The main motiv ation of systolic computation is that the mo v emen t of data b et w een pro
cessors is t ypically on a nearestneigh b or basis socalled shift comm unications whic h has
shorter latency and higher bandwidth than arbitrary p oin ttop oin t comm unication on some
distributed memory arc hitectures and that no buering of in termediate results is necessary
as all pro cessing elemen ts op erate sync hronously 
A c el lular automaton CA consists of a collection of nite state automata stepping
sync hronously  eac h automaton ha ving as input the curren t state of one or sev eral other
automata This neigh b our relationship is often re
ected b y ph ysical pro ximit y  suc h as ar
ranging the CA as a mesh The CA mo del is a mo del for massiv ely parallel computations
with structured comm unication Also systolic computations can b e view ed as a kind of
CA computation The restriction of nearestneigh b or access is relaxed in the glob al c el lular
automaton GCA  where the state of eac h automaton includes an index of the automa
ton to read from in this step Th us the neigh b or relationship can b e time dep enden t and
data dep enden t The GCA is closely related to the CR O W PRAM and it can b e ecien tly
implemen ted in recon gurable hardw are ie eld programmable gate arra ys FPGA
In a very lar ge instruction wor d VLIW pro cessor an instruction w ord con tains sev eral
assem bler instructions Th us there is the p ossibilit y that the compiler can exploit instruction
lev el parallelism ILP b etter than a sup erscalar pro cessor b y ha ving kno wledge of the
program to b e compiled Also the hardw are to con trol execution in a VLIW pro cessor is
simpler than in a sup erscalar pro cessor th us p ossibly leading to a more ecien t execution
The concept of explicitly p ar al lel instruction c omputing EPIC com bines VLIW with other
arc hitectural concepts suc h as predication to a v oid conditional branc hes kno wn from SIMD
computing
In stream pro cessing a con tin uous stream of data is pro cessed eac h elemen t undergoing
the same op erations In order to sa v e memory bandwidth sev eral op erations are in terlea v ed
in a pip elined fashion As suc h stream pro cessing inherits concepts from v ector and systolic
computing
Practical Relev ance V ector computing w as the paradigm used b y the early v ector sup er
computers in the s and s and is still an essen tial part of mo dern highp erformance
computer arc hitectures It is a sp ecial case of the SIMD c omputing paradigm whic h in v olv es
SIMD instructions as basic op erations in addition to scalar computation Most mo dern high
end pro cessors ha v e v ector units extending their instruction set b y SIMDv ector op erations
Ev en man y other pro cessors no w ada ys oer SIMD instructions that can apply the same op
eration to   or  sub w ords sim ultaneously if the sub w ordsized elemen ts of eac h op erand
and result are stored con tiguously in the same w ordsized register Systolic computing has
b een p opular in the s in the form of m ultiunit copro cessors for highp erformance com
putations CA and GCA ha v e found their nic he for implemen tations in hardw are Also with
the relation to CR O W PRAMs the GCA could b ecome a bridging mo del b et w een highlev el
parallel mo dels and ecien t con gurable hardw are implemen tation VLIW pro cessors b e
came p opular in the s w ere pushed aside b y the sup erscalar pro cessors during the
s but ha v e seen a rebirth with In tels Itanium pro cessor VLIW is to da y also a p opu
lar concept in highp erformance pro cessors for the digital signal pro cessing DSP domain
Stream pro cessing has curren t p opularit y b ecause of its suitabilit y for realtime pro cessing
of digital media
Implemen tations APL  is an early SIMD programming language Other SIMD lan
guages include V ectorC  and C
￿
 F ortran supp orts v ector computing and ev en a
simple form of data parallelism With the HPF  extensions it b ecame a full
edged data
parallel language Other data parallel languages include ZPL  NESL  Dataparallel
C  and Mo dula 
CA and GCA applications are mostly programmed in hardw are description languages
Besides prop osals for o wn languages the mapping of existing languages lik e APL on to those
automata  has b een discussed As the GCA is an activ e new area of researc h there are
no complete programming systems y et
Clustered VLIW DSP pro cessors suc h as the TI Cxx family allo w parallel execution
of instructions y et apply additional restrictions on the op erands whic h is a c hallenge for
optimizing compilers 
Early programming supp ort for stream pro cessing w as a v ailable in the Bro ok language
 and the Sh library Univ W aterlo o Based on the latter RapidMind pro vides a com
mercial dev elopmen t kit with a stream extension for C
  T ask P arallel Mo dels and T ask Graphs
Man y applications can b e considered as a set of tasks  eac h task solving part of the problem at
hand T asks ma y comm unicate with eac h other during their existence or ma y only accept
inputs as a prerequisite to their start and send results to other tasks only when they
terminate T asks ma y spa wn other tasks in a forkjoin st yle and this ma y b e done ev en
in a dynamic and data dep enden t manner Suc h collections of tasks ma y b e represen ted
b y a task gr aph  where no des represen t tasks and arcs represen t comm unication ie data
dep endencies The sche duling of a task graph in v olv es ordering of tasks and mapping of
tasks to pro cessing units Goals of the sc heduling can b e minimization of the applications
run time or maximization of the applications eciency  ie of the mac hine resources T ask
graphs can o ccur at sev eral lev els of gran ularit y 
While a sup erscalar pro cessor m ust detect data and con trol
o w dep endencies from a
linear stream of instructions dataow c omputing pro vides the execution mac hine with the
applications data
o w graph where no des represen t single instructions or basic instruction
blo c ks so that the underlying mac hine can sc hedule and dispatc h instructions as so on as all
necessary op erands are a v ailable th us enabling b etter exploitation of parallelism Data
o w
computing is also used in har dwar esoftwar e c odesign  where computationally in tensiv e
parts of an application are mapp ed on to recon gurable custom hardw are while other parts
are executed in soft w are on a standard micropro cessor The mapping is done suc h that the
w orkloads of b oth parts are ab out equal and that the complexit y of the comm unication
o v erhead b et w een the t w o parts is minimized
T ask graphs also o ccur in grid c omputing  where eac h no de ma y already represen t
an executable with a run time of hours or da ys The execution units are geographically dis
tributed collections of computers In order to run a grid application on the grid resources
the task graph is sc heduled on to the execution units This ma y o ccur prior to or during ex
ecution  static vs dynamic sc heduling Because of the wide distances b et w een no des with
corresp onding restricted comm unication bandwidths sc heduling t ypically in v olv es cluster
ing ie mapping tasks to no des suc h that the comm unication bandwidth b et w een no des
is minimized As a grid no de more and more often is a parallel mac hine itself also tasks
can carry parallelism socalled mal le able tasks Sc heduling a malleable task graph in v olv es
the additional dicult y of determining the amoun t of execution units allo cated to parallel
tasks
Practical Relev ance While data
o w computing in itself has not b ecome a mainstream in
programming it has seriously in
uenced parallel computing and its tec hniques ha v e found
their w a y in to man y pro ducts Hardw are soft w are codesign has gained some in terest b y
the in tegration of recon gurable hardw are with micropro cessors on single c hips Grid com
puting has gained considerable attraction in the last y ears mainly driv en b y the enormous
computing p o w er needed to solv e grand c hallenge problems in natural and life sciences
Implemen tations A prominen t example for parallel data
o w computation w as the MIT
Alewife mac hine with the ID functional programming language 
Hardw aresoft w are co design is frequen tly applied in digital signal pro cessing where there
exist a n um b er of m ultipro cessor systemsonc hip DSPMPSoC see eg  The Mitrion
C programming language from Mitrionics serv es to program the SGI RASC appliance that
con tains FPGAs and is mostly used in the eld of bio informatics
There are sev eral grid middlew ares most prominen tly Globus  and Unicore  for
whic h a m ultitude of sc hedulers exists
 General P arallel Programming Metho dologies
In this section w e brie
y review the features adv an tages and dra wbac ks of sev eral widely
used approac hes to the design of parallel soft w are
Man y of these actually start from an existing sequen tial program for the same problem
whic h is more restricted but of v ery high signi cance for soft w are industry that has to p ort
a host of legacy co de to parallel platforms in these da ys while other approac hes encourage
a radically dieren t parallel program design from scratc h
  F oster
s PCAM Metho d
F oster  suggests that the design of a parallel program should start from an existing p os
sibly sequen tial algorithmic solution to a computational problem b y partitioning it in to
man y small tasks and iden tifying dep endences b et w een these that ma y result in comm uni
cation and sync hronization for whic h suitable structures should b e selected These rst t w o
design phases partitioning and comm unication are for a mo del that puts no restriction on
the n um b er of pro cessors The task gran ularit y should b e as ne as p ossible in order to not
constrain the later design phases arti cially  The result is a more or less scalable parallel
algorithm in an abstract programming mo del that is largely indep enden t from a particular
parallel target computer Next the tasks are agglomerated to macrotasks pro cesses to
reduce in ternal comm unication and sync hronization relations within a macrotask to lo cal
memory accesses Finally  the macrotasks are sc heduled to ph ysical pro cessors to balance
load and further reduce comm unication These latter t w o steps agglomeration and mapping
are more mac hine dep enden t b ecause they use information ab out the n um b er of pro cessors
a v ailable the net w ork top ology  the cost of comm unication etc to optimize p erformance
  Incremen tal P arallelization
F or man y scien ti c programs almost all of their execution time is sp en t in a fairly small part
of the co de Directiv ebased parallel programming languages suc h as HPF and Op enMP 
whic h are designed as a seman tically consisten t extension to a sequen tial base language
suc h as F ortran and C allo w to start from sequen tial source co de that can b e parallelized
incremen tally  Usually  the most computationally in tensiv e inner lo ops are iden ti ed eg
b y pro ling and parallelized rst b y inserting some directiv es eg for lo op parallelization
If p erformance is not y et sucien t more directiv es need to b e inserted and ev en rewriting
of some of the original co de ma y b e necessary to ac hiev e reasonable p erformance
  Automatic P arallelization
Automatic parallelization of sequen tial legacy co de is of high imp ortance to industry but
notoriously dicult It o ccurs in t w o forms static p ar al lelization b y a smart compiler and
runtime p ar al lelization with supp ort b y the languages runtime system or the hardw are
Static parallelization of sequen tial co de is in principle an undecidable problem b ecause
the dynamic b eha vior of the program and thereb y the exact data dep endences b et w een
statemen t executions is usually not kno wn at compile time but solutions for restricted
programs and sp eci c domains exist P arallelizing compilers usually fo cus on lo op paral
lelization b ecause lo ops accoun t for most of the computation time in programs and their
dynamic con trol structure is relativ ely easy to analyze Metho ds for data dep endence testing
of sequen tial lo ops and lo op nests often require index expressions that are linear in the lo op
v ariables In cases of doubt the compiler will conserv ativ ely assume dep endence ie non
parallelizabilit y  of all lo op iterations Domainsp eci c metho ds ma y lo ok for sp ecial co de
patterns that represen t t ypical computation k ernels in a certain domain suc h as reductions
pre x computations dataparallel op erations etc and replace these b y equiv alen t parallel
co de see eg  These metho ds are in general less limited b y the giv en con trol structure
of the sequen tial source program than lo op parallelization but still rely on careful program
analysis
Runtime parallelization tec hniques defer the analysis of data dep endences and deter
mining the parallel sc hedule of computation to runtime when more information ab out
the program eg v alues of inputdep enden t v ariables is kno wn The necessary computa
tions prepared b y the compiler will cause runtime o v erhead whic h is often prohibitiv ely
high if it cannot b e amortized o v er sev eral iterations of an outer lo op where the dep endence
structure do es not c hange b et w een its iterations Metho ds for runtime parallelization of
irregular lo ops include doacross parallelization the insp ectorexecutor metho d for shared
and for distributed memory systems the priv atizing DO ALL test   and the LRPD test
 The latter is a soft w are implemen tation of sp eculativ e lo op parallelization
In sp e culative p ar al lelization of lo ops  sev eral iterations are started in parallel and the
memory accesses are monitored suc h that p oten tial missp eculation can b e disco v ered If
the sp eculation w as wrong the iterations that w ere erroneously executed in parallel m ust
b e rolled bac k and reexecuted sequen tially  Otherwise their results will b e committed to
memory 
Threadlev el parallel arc hitectures can implemen t general threadlev el sp eculation eg
as an extension to the cac he coherence proto col P oten tially parallel tasks can b e iden ti ed
b y the compiler or onthe
y during program execution Promising candidates for sp ecu
lation on data or con trol indep endence are the parallel execution of lo op iterations or the
parallel execution of a function call together with its con tin uation co de follo wing the call
Sim ulations ha v e sho wn that threadlev el sp eculation w orks b est for a small n um b er of
pro cessors  
  Sk eleton Based and Library Based P arallel Programming
Structur e d p ar al lel pr o gr amming  also kno wn as skeleton pr o gr amming   restricts the
man y w a ys of expressing parallelism to comp ositions of only a few prede ned patterns so
called sk eletons Skeletons   are generic p ortable and reusable basic program build
ing blo c ks for whic h parallel implemen tations ma y b e a v ailable They are t ypically deriv ed
from higherorder functions as kno wn from functional programming languages A sk eleton
based parallel programming system lik e PL   SCL   eSk el   MuesLi 
or QUAFF  usually pro vides a relativ ely small xed set of sk eletons Eac h sk eleton
represen ts a unique w a y of exploiting parallelism in a sp eci cally organized t yp e of compu
tation suc h as data parallelism task farming parallel divideandconquer or pip elining By
comp osing these the programmer can build a structured highlev el sp eci cation of parallel
programs The system can exploit this kno wledge ab out the structure of the parallel com
putation for automatic program transformation  resource sc heduling  and mapping
P erformance prediction is also enhanced b y comp osing the kno wn p erformance prediction
functions of the sk eletons accordingly  The appropriate set of sk eletons their degree of com
p osabilit y  generalit y  and arc hitecture indep endence and the b est w a ys to supp ort them in
programming languages ha v e b een in tensiv ely researc hed in the s and are still issues of
curren t researc h
Comp osition of sk eletons ma y b e either nonhierarc hical b y sequencing using temp orary
v ariables to transp ort in termediate results or hierarc hical b y conceptually nesting sk eleton
functions that is b y building a new hierarc hically comp osed function b y virtually insert
ing the co de of one sk eleton as a parameter in to that of another one This enables the elegan t
comp ositional sp eci cation of m ultiple lev els of parallelism In a declarativ e programming
en vironmen t suc h as in functional languages or separate sk eleton languages hierarc hical
comp osition giv es the co de generator more freedom of c hoice for automatic transformations
and for ecien t resource utilization suc h as the decision of ho w man y parallel pro cessors
to sp end at whic h lev el of the comp ositional hierarc h y  Ideally  the cost estimations of the
comp osed function could b e comp osed corresp ondingly from the cost estimation functions
of the basic sk eletons While nonnestable sk eletons can b e implemen ted b y generic library
routines nestable sk eletons require in principle a static prepro cessing that unfolds the
sk eleton hierarc h y  eg b y using C templates or C prepro cessor macros
The exploitation of neste d p ar al lelism sp eci ed b y suc h a hierarc hical comp osition is
quite straigh tforw ard if a forkjoin mec hanism for recursiv e spa wning of parallel activities
is applicable In that case eac h thread executing the outer sk eleton spa wns a set of new
threads that execute also the inner sk eleton in parallel This ma y result in v ery negrained
parallel execution and shifts the burden of load balancing and sc heduling to the runtime
system whic h ma y incur tremendous space and time o v erhead In a SPMD en vironmen t
lik e MPI UPC or F ork nested parallelism can b e exploited b y suitable group splitting
P arallel programming with sk eletons ma y b e seen in con trast to parallel programming us
ing parallel library routines Domainsp eci c parallel subroutine libraries eg for n umerical
computations on large v ectors and matrices are a v ailable for almost an y parallel computer
platform Both for sk eletons and for library routines reuse is an imp ortan t purp ose Nev
ertheless the usage of library routines is more restrictiv e b ecause they exploit parallelism
only at the b ottom lev el of the programs hierarc hical structure that is they are not com
p ositional and their computational structure is not transparen t for the programmer
 Conclusion
A t the end of this review of parallel programming mo dels w e ma y observ e some curren t
trends and sp eculate a bit ab out the future of parallel programming mo dels
As far as w e can foresee to da y  the future of computing is parallel computing dictated
b y ph ysical and tec hnical necessit y  P arallel computer arc hitectures will b e more and more
h ybrid com bining hardw are m ultithreading man y cores SIMD units accelerators and on
c hip comm unication systems whic h requires the programmer and the compiler to solicit
parallelism orc hestrate computations and manage data lo calit y at sev eral lev els in order
to ac hiev e reasonable p erformance A p erhaps extreme example for this is the Cell BE
pro cessor
Because of their relativ e simplicit y  purely sequen tial languages will remain for certain
applications that are not p erformance critical suc h as w ord pro cessors Some will ha v e
standardized parallel extensions and b e sligh tly revised to pro vide a w ellde ned memory
mo del for use in parallel systems or disapp ear in fa v or of new true parallel languages As
parallel programming lea v es the HPC mark et nic he and go es mainstream simplicit y will
b e piv otal esp ecially for no vice programmers W e foresee a m ultila y er mo del with a simple
deterministic highlev el mo del that fo cuses on parallelism and p ortabilit y  while it includes
transparen t access to an underlying lo w erlev el mo del la y er with more p erformance tuning
p ossibilities for exp erts New soft w are engineering tec hniques suc h as asp ectorien ted and
viewbased programming and mo deldriv en dev elopmen t ma y help in managing complexit y 
Giv en that programmers w ere mostly trained in a sequen tial w a y of algorithmic thinking
for the last  y ears migration paths from sequen tial programming to parallel program
ming need to b e op ened T o prepare coming generations of studen ts b etter undergraduate
teac hing should encourage a massiv ely parallel access to computing eg b y taking up par
allel time w ork and cost metrics in the design and analysis of algorithms early in the
curriculum 
F rom an industry p ersp ectiv e to ols that allo w to more or less automatically p ort
sequen tial legacy soft w are are of v ery high signi cance Deterministic and timepredictable
parallel mo dels are useful eg in the realtime domain Compilers and to ols tec hnology m ust
k eep pace with the in tro duction of new parallel language features Ev en the most adv anced
parallel programming language is do omed to failure if its compilers are premature at its
mark et in tro duction and pro duce p o or co de as w e could observ e in the s for HPF in
the highp erformance computing domain  where HPC programmers instead switc hed to
the lo w erlev el MPI as their main programming mo del
A c kno wledgmen ts Christoph Kessler ac kno wledges funding b y Ceniit   at Link pings
univ ersitet b y V etensk apsr det VR b y SSF RISE b y Vinno v a SafeMo dSim and b y the
CUGS graduate sc ho ol
References
 F erri Ab olhassan Reinhard Drefenstedt Jrg Keller W olfgang J P aul and Dieter Sc heerer On the ph ysical
design of PRAMs Computer J  

  Decem b er 
 AliReza A dlT abatabai Christos K ozyrakis and Bratin Saha Unlo c king concurrency m ulticore programming
with transactional memory  A CM Queue  Dec Jan 
 
 Sarita V A dv e and K ourosh Gharac horlo o Shared Memory Consistency Mo dels a Tutorial IEEE Comput 
  
 
 Anan t Agarw al Ricardo Bianc hini Da vid Chaik en Kirk L Johnson Da vid Kranz John Kubiato wicz Beng
Hong Lim Da vid Mac k enzie and Donald Y eung The MIT Alewife mac hine Arc hitecture and p erformance In
Pr o c nd Int Symp Computer Ar chite ctur e  pages   
 A Aggarw al AK Chandra and M Snir Comm unication complexit y of PRAMs The or etic al Computer Scienc e 

  
 Alb ert Alexandro v Mihai F Ionescu Klaus E Sc hauser and Chris Sc heiman LogGP Incorp orating long
messages in to the LogP mo del for parallel computation Journal of Par al lel and Distribute d Computing   


 


 Eric Allen Da vid Chase Jo e Hallett Victor Luc hangco JanWillem Maessen Suky oung Ryu Guy L
Steele Jr and Sam T obinHo c hstadt The fortress language sp ecication v ersion  ￿  Marc h 

h ttp researc hsuncompro jectsplrgPublicationsfortress b etap df
 Bruno Bacci Marco Danelutto Salv atore Orlando Susanna P elagatti and Marco V annesc hi PL A structured
high lev el programming language and its structured supp ort Concurr ency  Pr act Exp 
   
 Bruno Bacci Marco Danelutto and Susanna P elagatti Resource Optimisation via Structured P arallel Program
ming In 
 pages  April 
 Henri E Bal Jennifer G Steiner and Andrew S T anen baum Programming Languages for Distributed Com
puting Systems A CM Computing Surveys      Septem b er 
 R Bisseling Par al lel Scientic Computation  A Structur e d Appr o ach using BSP and MPI  Oxford Univ ersit y
Press 
 Guy Blello c h Programming Parallel Algorithms Comm A CM    
 Marc h 
 Rob ert D Blumofe Christopher F Jo erg Bradley C Kuszmaul Charles E Leiserson Keith H Randall and
Y uli Zhou Cilk an ecien t m ultithreaded runtime system In Pr o c th A CM SIGPLAN Symp Principles
and Pr actic e of Par al lel Pr o gr amming  pages 
  
 HansJ Bo ehm Threads cannot b e implemen ted as a library  In Pr o c A CM SIGPLAN Conf Pr o gr amming
Language Design and Implementation  pages    
 Olaf Bonorden Ben Juurlink Ingo v on Otte and Ingo Rieping The P aderb orn Univ ersit y BSP PUB Library 
Par al lel Computing   

 
 Ian Buc k Tim F oley  Daniel Horn Jerem y Sugerman Ka yv on F atahalian Mik e Houston and P at Hanrahan
Bro ok for GPUs stream computing on graphics hardw are In SIGGRAPH  A CM SIGGRAPH  Pap ers 
pages



 New Y ork NY USA  A CM Press

 William W Carlson Jesse M Drap er Da vid E Culler Kath y Y elic k Eugene Bro oks and Karen W arren
In tro duction to UPC and language sp ecication T ec hnical Rep ort CCSTR
 second prin ting ID A Cen ter
for Computing Sciences Ma y 
 Brian D Carlstrom Austen McDonald Hassan Cha JaeW o ong Ch ung Chi Cao Minh Christoforos E
K ozyrakis and Kunle Oluk otun The atomos transactional programming language In Pr o c Conf Pr o g L ang
Design and Impl
PLDI  pages  A CM June 
 Nic holas Carriero and Da vid Gelern ter Linda in con text Commun A CM       
 Bradford L Cham b erlain Da vid Callahan and Hans P  Zima P arallel programmabilit y and the c hap el language
submitted 

 Murra y Cole Bringing sk eletons out of the closet A pragmatic manifesto for sk eletal parallel programming
Par al lel Computing     
 Murra y I Cole A lgorithmic Skeletons Structur e d Management of Par al lel Computation  Pitman and MIT
Press 
 Ric hard Cole and Ofer Za jicek The APRAM Incorp orating Async hron y in to the PRAM mo del In Pr o c st
A nnual A CM Symp Par al lel A lgorithms and A r chite ctur es  pages 
 
 Da vid E Culler Ric hard M Karp Da vid A P atterson Abhijit Saha y  Klaus E Sc hauser Eunice San tos
Ramesh Subramonian and Thorsten v on Eic k en LogP T o w ards a realistic mo del of parallel computation In
Principles Pr actic e of Par al lel Pr o gr amming  pages  
 J Darlington A J Field P  G Harrison P  H B Kelly  D W N Sharp and Q W u P arallel Programming
Using Sk eleton Functions In Pr o c Conf Par al lel A r chite ctur es and L anguages Eur op e  pages   Springer
LNCS  
 J Darlington Y Guo H W T o and J Y ang P arallel sk eletons for structured comp osition In Pr o c th A CM
SIGPLAN Symp Principles and Pr actic e of Par al lel Pr o gr amming  A CM Press July   SIGPLAN Notic es
 pp  

 K M Dec k er and R M Rehmann editors Pr o gr amming Envir onments for Massively Par al lel Distribute d
Systems  Birkhuser Basel Switzerland  Pro c IFIP W G  W orking Conf at Mon te Verita Ascona
Switzerland April 
 F Dehne A F abri and A RauChaplin Scalable parallel geometric algorithms for coarse grained m ulticom
puters In Pr o c A CM Symp on Comput Ge ometry  pages  
 
 Beniamino di Martino and Christoph W Keler T w o program comprehension to ols for automatic parallelization
IEEE Concurr    Spring 
 Dietmar W Erwin and Da vid F Snelling Unicore A grid computing en vironmen t In Pr o c th Intl Confer enc e
on Par al lel Pr o c essing
Eur o Par  pages   London UK   SpringerV erlag
  Vija j Sarasw at et al Rep ort on the exp erimen tal language X  draft v   White pap er
h ttp wwwibmresearc hcom commresearc hpro jectsnsfpagesx indexh tml F ebruary 
  Jo el F alcou and Jo celyn Serot F ormal seman tics applied to the implemen tation of a sk eletonbased parallel
programming library  In Pr o c ParCo   IOS press 
 Martti F orsell A scalable highp erformance computing solution for net w orks on c hips IEEE Micr o  pages
  Septem b er  
 S F ortune and J W yllie P arallelism in random access mac hines In Pr o c th A nnual A CM Symp The ory of
Computing  pages   

  Ian F oster Designing and Building Par al lel Pr o gr ams  A ddison W esley   
 Ian F oster Globus to olkit v ersion  Soft w are for serviceorien ted systems In Pr o c IFIP Intl Conf Network
and Par al lel Computing  LNCS 

 pages  Springer 

 Ian F oster Carl Kesselman and Stev en T uec k e The anatom y of the grid Enabling scalable virtual organizations
Intl J Sup er c omputer Applic ations      
 Phillip B Gibb ons A More Practical PRAM Mo del In Pr o c st A nnual A CM Symp Par al lel A lgorithms and
A r chite ctur es  pages   
 W K Giloi P arallel Programming Mo dels and Their In terdep endence with Parallel Arc hitectures In Pr o c st
Int Conf Massively Par al lel Pr o gr amming Mo dels  IEEE Computer So ciet y Press 
 Sergei Gorlatc h and Susanna P elagatti A transformational framew ork for sk eletal programs Ov erview and
case study  In Jose Rohlim et al editor IPPSSPDP W orkshops Pr o c e e dings IEEE Int Par al lel Pr o c essing
Symp and Symp Par al lel and Distribute d Pr o c essing  pages  
 Springer LNCS  
  Allan Gottlieb An o v erview of the NYU ultracomputer pro ject In JJ Dongarra editor Exp erimental Par al lel
Computing A r chite ctur es  pages   Elsevier Science Publishers 

  Susanne E Ham brusc h Mo dels for Parallel Computation In Pr o c Int Conf Par al lel Pr o c essing W orkshop
on Chal lenges for Par al lel Pr o c essing  
 Philip J Hatc her and Mic hael J Quinn Data Par al lel Pr o gr amming on MIMD Computers  MIT Press  
 Maurice Herlih y and J Eliot B Moss T ransactional memory Arc hitectural supp ort for lo c kfree data structures
In Pr o c Int Symp Computer Ar chite ctur e  
  Heyw o o d and Leop old Dynamic randomized sim ulation of hierarc hical PRAMs on meshes In AIZU A izu
International Symp osium on Par al lel A lgorithmsA r chite ctur e Synthesis  IEEE Computer So ciet y Press  
 Jonathan M D Hill Bill McColl Dan C Stefanescu Mark W Goudreau Kevin Lang Satish B Rao T orsten
Suel Thanasis T san tilas and Rob Bisseling BSPlib the BSP Programming Library  Par al lel Computing 
  
 

 Rolf Homann KlausP eter Vlkmann Stefan W aldsc hmidt and W olfgang Heenes GCA Global Cellular
Automata A Flexible P arallel Mo del In PaCT  Pr o c e e dings of the th International Confer enc e on Par al lel
Computing T e chnolo gies  pages 
 London UK   SpringerV erlag
 Kenneth E Iv erson A Pr o gr amming Language  Wiley  New Y ork  
 Joseph JJ A n Intr o duction to Par al lel Algorithms  A ddisonW esley   
 Johannes Jendrsczok Rolf Homann P atric k Ediger and Jrg Keller Implemen ting APLlik e data parallel
functions on a GCA mac hine In Pr o c  st W orkshop Par al lel A lgorithms and Computing Systems
P ARS 


 Jrg Keller Christoph Kessler and Jesp er T r Pr actic al PRAM Pr o gr amming  Wiley  New Y ork  
 Ken Kennedy  Charles K o elb el and Hans Zima The rise and fall of High P erformance F ortran an historical
ob ject lesson In Pr o c Int Symp osium on the History of Pr o gr amming L anguages
HOPL III  June 

 Christoph Kessler Managing distributed shared arra ys in a bulksync hronous parallel en vironmen t Concurr ency
 Pr act Exp     
 Christoph Kessler T eac hing parallel programming early  In Pr o c W orkshop on Developing Computer Scienc e
Educ ation  How Can It Be Done Linkpings universitet Swe den  Marc h 
 Christoph Kessler and Andrzej Bednarski Optimal in tegrated co de generation for VLIW arc hitectures Con
curr ency and Computation Pr actic e and Exp erienc e      
 Christoph W Keler NestStep Nested Parallelism and Virtual Shared Memory for the BSP mo del The J of
Sup er c omputing 
   

 Christoph W Kessler A practical access to the theory of parallel algorithms In Pr o c A CM SIGCSE
Symp osium on Computer Scienc e Educ ation  Marc h 
 Christoph W Keler and Helm ut Seidl The F ork Parallel Programming Language Design Implemen tation
Application Int J Par al lel Pr o gr amming   
 F ebruary 

 Herb ert Kuc hen A sk eleton library  In Pr o c Eur o Par  pages      
 H T Kung and C E Leiserson Algorithms for VLSI pro cessor arra ys In C Mead and L Con w a y  editors
Intr o duction to VLSI systems  pages
 A ddisonW esley  
  Christian Lengauer A p ersonal historical p ersp ectiv e of parallel programming for high p erformance In Gn ter
Hommel editor Communic ation Base d Systems
CBS   pages  Klu w er 
  Claudia Leop old Par al lel and Distribute d Computing A survey of mo dels p ar adigms and appr o aches  Wiley 
New Y ork 
 KC Li and H Sc h w etman V ector C A Vector Pro cessing Language J Par al lel and Distrib Comput 
   
 L F a v a Lindon and S G Akl An optimal implemen tation of broadcasting with selectiv e reduction IEEE
T r ans Par al lel Distrib Syst     
  B M Maggs L R Matheson and R E T arjan Mo dels of Parallel Computation a Surv ey and Syn thesis In
Pr o c th A nnual Hawaii Int Conf System Scienc es  v olume  pages 
 Jan uary  
 W F McColl General Purp ose Parallel Computing In A M Gibb ons and P  Spirakis editors L e ctur es on
Par al lel Computation Pr o c  ALCOM Spring Scho ol on Par al lel Computation  pages 
  Cam bridge
Univ ersit y Press 

 W olfgang J P aul P eter Bac h Mic hael Bosc h Jrg Fisc her Cdric Lic h tenau and Jo c hen Rhrig Real PRAM
programming In Pr o c Int Eur o Par Conf  August  
 Susanna P elagatti Structur e d Development of Par al lel Pr o gr ams  T a ylorF rancis 
 Mic hael Philippsen and W alter F Tic h y  Mo dula  and its Compilation In Pr o c st Int Conf of the Austrian
Center for Par al lel Computation  pages   Springer LNCS    

 Thomas Raub er and Gudula Rnger Tlib a library to supp ort programming with hierarc hical m ultipro cessor
tasks J Par al lel and Distrib Comput   
 Marc h  

 La wrence Rauc h w erger and Da vid P adua The Priv atizing DO ALL Test A RunTime Tec hnique for DO ALL
Lo op Iden tication and Arra y Priv atization In Pr o c th A CM Int Conf Sup er c omputing  pages   A CM
Press July 

 La wrence Rauc h w erger and Da vid P adua The LRPD Test Sp eculativ e RunTime Parallelization of Lo ops with
Priv atization and Reduction Parallelization In Pr o c A CM SIGPLAN Conf Pr o gr amming Language Design
and Implementation  pages    A CM Press June  

 J Rose and G Steele C an Extended C Language for Data Parallel Programming T ec hnical Rep ort PL
 
Thinking Mac hines Inc Cam bridge MA 


 D B Skillicorn Mo dels for Practical Parallel Computation Int J Par al lel Pr o gr amming       

 D B Skillicorn miniBSP a BSP Language and T ransformation System T ec hnical rep ort
Dept of Computing and Information Sciences Queens s Univ ersit y  Kingston Canada Oct 
h ttp wwwqucisqueensucahomeskillminips

 Da vid B Skillicorn and Domenico T alia editors Pr o gr amming L anguages for Par al lel Pr o c essing  IEEE Com
puter So ciet y Press  


 Da vid B Skillicorn and Domenico T alia Mo dels and Languages for Parallel Computation A CM Computing
Surveys  June 

 La wrence Sn yder The design and dev elopmen t of ZPL In Pr o c A CM SIGPLAN Thir d symp osium on history
of pr o gr amming languages
HOPL III  A CM Press June 


 H!k an Sundell and Philippas T sigas NOBLE A nonblo c king in terpro cess comm unication library  T ec hnical
Rep ort    Dept of Computer Science Chalmers Univ ersit y of T ec hnology and Gteb org Univ ersit y  SE
  Gteb org Sw eden  
 Leslie G V alian t A Bridging Mo del for Parallel Computation Comm A CM     August 
  F redrik W arg T e chniques to r e duc e thr e ad level sp e culation overhe ad  PhD thesis Dept of Computer Science
and Engineering Chalmers univ ersit y of tec hnology  Gothen burg Sw eden 
  Xingzhi W en and Uzi Vishkin Pramonc hip rst commitmen t to silicon In SP AA  Pr o c e e dings of the
ninete enth annual A CM symp osium on Par al lel algorithms and ar chite ctur es  pages    New Y ork NY
USA 
 A CM
 W a yne W olf Guest editor s in tro duction The em b edded systems landscap e Computer