Design Issues for a High-Performance Distributed Shared Memory ...

giantsneckspiffyElectronics - Devices

Oct 13, 2013 (4 years and 26 days ago)

102 views

Cluster Computing
Design Issues for a High erformance Distributed Shared

Memory on Symmetrical Multipro cessor Clusters
Sumit Ro y and Vipin Chaudhary
Par al lel and Distribute d Computing L ab or atory Dep artment of Ele ctric al and Computer Engine ering Wayne State
University Detr oit Michigan
Eail ro y c haudeceng a ynedu
Clusters of Symmetrical Multipro cessors MPs ha v e recen tly b ecome the norm for high erformance eco
nomical computing solutions Multiple no des in a cluster can b e used for parallel programming using a message
passing library An alternate approac h is to use a soft w are Distributed Shared Memory SM to pro vide a
view of shared memory to the application programmer This pap er describ es Strings a high p erformance
distributed shared memory system designed for suc h SMP clusters The distinguishing feature of this system is
the use of a fully m ultihreaded run time system using k ernel lev el threads Strings allo ws m ultiple application
threads to b e run on eac h no de in a cluster Since most mo dern UNIX systems can m ultiplex these threads on
k ernel lev el ligh t w eigh t pro cesses applications written using Strings can exploit m ultiple pro cessors on a SMP
mac hine This pap er describ es some of the arc hitectural details of the system and illustrates the p erformance
impro v emen ts with b enc hmark programs from the SPLASH suite some computational k ernels as w ell as a
full dged application
It is found that using m ultiple pro cesses on SMP no des pro vides go o d sp eedups only for a few of the
programs Multiple application threads can impro v e the p erformance in some cases but other programs sho w
a slo wdo wn If k ernel threads are used additionally the o v erall p erformance impro v es signian tly in all
programs tested Other design decisions also ha v e a b eneial impact though to a lesser degree
Keyw ords Distributed Shared Memory Symmetrical Multipro cessors Multithreading P erformance Ev aluation
In tro duction to solv e suc h problems Instead man y v endors
of traditional w orkstations ha v e adopted a de
Though curren t micropro cessors are getting
sign strategy wherein m ultiple statefhert mi
faster at a v ery rapid rate there are still some cropro cessors are used to build high p erformance
v ery large and complex problems that can only b e
sharedemory parallel w orkstations These sym
solv ed b y using m ultiple co op erating pro cessors metrical m ultipro cessors MPs are then con
These problems include the soalled Gr and Chal
nected through high sp eed net w orks or switc hes
lenge Pr oblems suc h as F uel com bustion Ocean
to form a scalable computing cluster Examples of
mo deling Image understanding and Rational drug
this imp ortan t class of mac hines include the SGI
design There has recen tly b een a decline in the
P o w er Challenge Arra y the IBM SP with m ulti
n um b er of sp ecialized parallel mac hines b eing built
ple PO WER based no des the Con v ex Exemplar

the DEC Adv an tageCluster the SUN HPC
This researc h w as supp orted in part b y NSF gran ts MIP
EIA US Arm y Con tract D AEA cluster with the SUN Cluster Channel as w ell as
D and F ord Motor Compan y gran ts R and
the Cra yGI Origin series
A preliminary v ersion of this pap er app eared in the
Using m ultiple no des on suc h SMP clusters re
Pro ceedings of the High P erformance Distributed Comput
quires the programmer to either write explicit mes
ing Conference S R oy V Chaudhary Design Issues for a Higherformanc e DSM on SMP Clusters
sage passing programs using libraries lik e MPI or cen tly to impro v e DSM p erformance is the use of
PVM or to rewrite the co de using a new language m ultiple threads of con trol in the system Multi
with parallel constructs eg HPF and F ortran threaded DSMs ha v e b een describ ed as third gen
Message passing programs are cum b ersome to eration systems Published erts ha v e b een
write while parallel languages usually only w ork restricted to nonreemptiv e userev el thread im
w ell with co de that has regular data access pat plemen tations Ho w ev er user lev el threads
terns In b oth cases the programmer has to b e cannot b e sc heduled across m ultiple pro cessors on
in timately familiar with the application program an SMP Since SMP clusters are increasingly b e
as w ell as the target arc hitecture to get the b est coming the norm for High P erformance Computing
p ossible p erformance The shared memory mo del sites w e consider this to b e an imp ortan t prob
on the other hand is easier to program since the lem to b e solv ed This pap er describ es Strings
programmer do es not ha v e to w orry ab out the a m ultihreaded DSM designed for SMP clusters
data la y out and do es not ha v e to explicitly send The distinguishing feature of Strings is that it is
data from one pro cess to another Hence an al built using POSIX threads whic h can b e m ulti
ternate approac h to using these computing clus plexed on k ernel ligh t eigh t pro cesses The k er
ters is to pro vide a view of logically shared mem nel can sc hedule these ligh t w eigh t pro cesses across
ory o v er ph ysically distributed memory kno wn as m ultiple pro cessors for b etter p erformance Strings
a Distributed Shared Memory SM or Shared is designed to exploit data parallelism b y allo wing
Virtual Memory VM Recen t researc h pro jects m ultiple application threads to share the same ad
with DSMs ha v e sho wn go o d p erformance for ex dress space on a no de Additionally the proto col
ample IVY Mirage Munin T readMarks handler is m ultithreaded and is able to use task
Quarks CVM and Strings This parallelism at the run time lev el The o v erhead of
mo del has also b een sho wn to giv e go o d results for in terrupt driv en net w ork I is a v oided b y using
programs that ha v e irregular data access patterns a dedicated comm unication thread W e sho w the
whic h cannot b e analyzed at compile time or impact of eac h of these design c hoices using some
indirect data accesses that are dep enden t on the example programs as w ell as some b enc hmark pro
input dataet grams from the SPLASH suite
DSMs share data at the relativ ely large gran u The follo wing section describ es some details of
larit y of a virtual memory page and can sur from the soft w are system The ev aluation platform
a phenomenon kno wn as alse sharing wherein and programs for the p erformance analysis are
t w o pro cesses sim ultaneously attempt to write to describ ed in section Exp erimen tal results are
diren t data items that reside on the same page sho wn and analyzed in section Section sug
If only a single writer is p ermitted the page ma y gests some direction for future w ork and concludes
ping ong b et w een the no des One solution to this the pap er
problem is to old on to a freshly arriv ed page for
some time b efore releasing it to another requester
System details
Relaxed memory consistency mo dels that al
lo w m ultiple concurren t writers ha v e also b een pro
The Strings distributed shared memory w as de
p osed to alleviate this symptom These
riv ed from the publicly a v ailable system Quarks
systems ensure that all no des see the same data at
It shares the use of the Release Consistency
w ell deed p oin ts in the program usually when mo del with that system as w ell as the concept of
sync hronization o ccurs Extra ert is required to
a dsm server thread W e brie describ e the im
ensure program correctness in this case plemen tation details and highligh t the dirence
One tec hnique that has b een in v estigated re
b et w een the t w o systemsS R oy V Chaudhary Design Issues for a Higherformanc e DSM on SMP Clusters
Exe cution mo del an y thread issues a blo c king system call the k er
nel considers the pro cess as a whole and th us all
The Strings system consists of a library that
the asso ciated threads to b e blo c k ed Also on a
is link ed with a shared memory parallel program
m ultipro cessor system all user lev el threads can
The program uses calls to the distributed shared
only run on one pro cessor at a time User lev el
memory allo cator to create globally shared mem
threads do allo w the programmer to exercise v ery
ory regions A t ypical program go es through the
e con trol on their sc heduling within the pro cess
initialization sho wn in Figure
In con trast k ernel lev el threads can b e sc heduled
The master pro cess starts up and forks c hild pro
b y the op erating system across m ultiple pro ces
cesses on remote mac hines using rsh Eac h pro
sors Most mo dern UNIX implemen tations pro
cess creates a dsm server thread and a communication
vide a ligh t eigh t pro cess in terface on whic h these
thread The fork ed pro cesses then register their
threads are then m ultiplexed The thread pac k
listening p orts with the master The master pro
age used in Strings is the standard P osix c
cess then en ters the application program prop er
thread library Multihreading has b een suggested
and creates shared memory regions Application
for impro ving the p erformance of scien ti co de
threads are then created b y sending requests to the
b y o v erlapping comm unications with computations
remote dsm servers Shared region iden tirs and
Previous w ork on m ultihreaded message
global sync hronization primitiv es are sen t as part
passing systems has p oin ted out that k ernelev el
of the thread create call The virtual memory sub
implemen tations sho w b etter results than user lev el
system is used to enforce coheren t access to the
threads for a message size greater than K b ytes
globally shared regions
Since the page size is usually K or K b ytes
The original Quarks system used user lev el
it suggests that k ernel threads should b e useful for
Cthr e ads and allo w ed only a single application
DSM systems
thread Strings allo ws m ultiple application threads
to b e created on a single no de This increases the
Shar e d memory implementation
concurrency lev el on eac h no de in a SMP cluster
Shared memory in the system is implemen ted b y
Kernel level thr e ads
using the UNIX mmap call to map a e to the
Threads are ligh t eigh t pro cesses that ha v e b ottom of the stac k segmen t Quarks used anon y
minimal execution con text and share the global mous mappings of memory pages to addresses de
address space of a program The time to switc h termined b y the system but this w orks only with
from one thread to another is v ery small when statically link ed binaries With dynamically link ed
compared to the con text switc hing time required programs it w as found that due to the presence of
for fulledged pro cesses Moreo v er the implicit shared libraries mmap w ould map the same page
shared memory leads to a v ery simple program to diren t addresses in diren t pro cesses While
ming mo del Thread implemen tations are distin an address translation table can b e used to access
guished as b eing userev el usually implemen ted opaquely shared data it is not p ossible to pass
as a library or as b eing k ernel lev el in terms of p oin ters to shared memory this w a y since they
ligh t eigh t pro cesses Kernel lev el threads are w ould p oten tially address diren t regions in dif
a little more exp ensiv e to create since the k er feren t pro cesses An alternate approac h w ould b e
nel is in v olv ed in managing them Ho w ev er user to preallo cate a v ery large n um b er of pages as done
lev el threads sur from some imp ortan t limita b y CVM and T readMarks but this asso ciates the
tions Since they are implemen ted as a user lev el same large o v erhead with ev ery program regard
library they cannot b e sc heduled b y the k ernel If less of its actual requiremen ts S R oy V Chaudhary Design Issues for a Higherformanc e DSM on SMP Clusters
Master
rsh rsh create create create
page
thread thread thread
fault
Child
Child
Init. Comm. DSM Appl.
Figure Initialization Phase of a Strings Program
Allo wing m ultiple application threads on a no de Pol le d network I
leads to a p eculiar problem with the DSM imple
Early generation DSM systems used in terrupt
men tation Once a page has b een fetc hed from a re
driv en I to obtain pages lo c ks etc from re
mote no de its con ten ts m ust b e written to the cor
mote no des This can cause considerable disrup
resp onding memory region so the protection has
tion at the remote end and previous researc h tried
to b e c hanged to writable A t this time no other
to o v ercome this b y aggregating messages reducing
thread should b e able to access this page User
comm unication b y com bining sync hronization with
lev el threads can b e sc heduled to allo w atomic up
data and other suc h tec hniques Strings uses a
dates to the region Ho w ev er susp ending all k er
dedicated communication thread whic h monitors
nel lev el threads can p oten tially lead to a deadlo c k
the net w ork p ort th us eliminating the o v erhead
and w ould also reduce concurrency Figure illus
of an in terrupt call Incoming message queues are
trates the approac h used in Strings Ev ery page
main tained for eac h activ e thread at a no de and
is mapp ed to t w o diren t addresses It is then
message arriv al is announced using condition v ari
p ossible to write to the hado w address without
ables This prev en ts w asting CPU cycles with busy
c hanging the protection of the primary memory re
w aits A reliable messaging system is implemen ted
gion
on top of UDP
The mprotect call is used to con trol access to
the shared memory region When a thread faults
while accessing a page a page handler is in v ok ed to
Concurr ent server
fetc h the con ten ts from the o wning no de Strings
curren tly supp orts sequen tial consistency using an
The original Quarks dsm server thread w as an
in v alidate proto col as w ell as release consistency
iterativ e serv er that handled one incoming request
using an up date proto col When a thread
at a time It w as found that under certain condi
tries to write to a page a t win cop y of the page
tions lo c k requests could giv e rise to a deadlo c k b e
is created A t releases time ie when a lo c k is
t w een t w o comm unicating pro cesses Strings solv es
unlo c k ed or a barrier is en tered the dirence
this b y creating separate threads to handle eac h in
or di b et w een the curren t con ten ts of the page
coming request for pages lo c k acquires and barrier
and its t win is sen t to other threads that share the
arriv als Relativ ely e grain lo c king of in ternal
page The release consistency mo del implemen ted
data structures is used to main tain a high lev el of
in Quarks has b een impro v ed b y aggregating m ul
concurrency while guaran teeing correctness when
tiple di to decrease the n um b er of messages sen t
handling m ultiple concurren t requestsS R oy V Chaudhary Design Issues for a Higherformanc e DSM on SMP Clusters



Stack



Shadow Write Page
RW RW RW
RW RW RW
NO
NO NO
Global
NO NO RO

Get Page Enable R/W

Shared Lib



Heap





Data






Text

Access Error Update Page Handler Returns
Figure Thread safe memory up date for Strings
Synchr onization primitives as full dged co de Additionally w e sho w results
for matrix m ultiplication a program from the ld
Quarks pro vides barriers and lo c ks as shared
of medical computing as w ell as a k ernel for solv
memory primitiv es Strings also implemen ts condi
ing partial diren tial equations b y the successiv e
tion v ariables for g based sync hronization Bar
o v erelaxation tec hnique and the classical tra v el
riers are managed b y the master pro cess Bar
ing salesman problem
rier arriv als are st collected lo cally and are then
sen t to the barrier manager Dirt y pages are also
SPLASH pr o gr ams
purged at this time as p er Release Consistency
seman tics The data access patterns of the programs in the
Lo c k o wnership is migratory with distributed SPLASH suite ha v e b een c haracterized in ear
queues F or m ultiple application threads only one lier researc h FFT p erforms a transform of
lo c k request is sen t to the curren t o wner the sub n complex data p oin ts and requires three allo
sequen t ones are queued lo cally as are incoming all in terpro cessor comm unication phases for a ma
requests Requests on the same no de prempt trix transp ose The data access is regular LU
request from remote no de While this do es not and LU p erform factorization of a dense matrix
guaran tee fairness or progress this optimization
The nonon tiguous v ersion has a single pro ducer
w orks v ery w ell for data parallel programs A and m ultiple consumers It surs from consid
similar optimization w as emplo y ed in CVM
erable fragmen tation and false sharing The con
Release Consistency op erations are deferred if the tiguous v ersion uses an arra y of blo c ks to impro v e
lo c k transfer is within the lo cal no de
spatial lo calit y RADIX p erforms an in teger radix
sort and surs from a highegree of false sharing
at page gran ularit y during a p erm utation phase
P erformance analysis
RA YTRA CE renders a complex scene using an op
W e ev aluated the p erformance of Strings us timized ra y tracing metho d It uses a shared task
ing programs from the SPLASH b enc hmark suite queue to allo cate jobs to diren t threads Since
These programs ha v e b een written for ev alu the o v erhead of this approac h is v ery high in a DSM
ating the p erformance of shared addresspace m ul system the co de w as mo did to main tain a lo
tipro cessors and include application k ernels as w ell cal as w ell as global queue p er thread T asks w ere S R oy V Chaudhary Design Issues for a Higherformanc e DSM on SMP Clusters
initially drained from the lo cal queue and then b e set to a m ultiple of the page size of the mac hine
from the shared queue V OLREND renders three Since eac h application thread computes a con tigu
dimensional v olume data It has a m ultiple pro ous blo c k of v alues this eliminates the problem of
ducers with m ultiple consumers data sharing pat false sharing
tern with b oth fragmen tation and false sharing
Suc c essive Over R elaxation
W A TERp ev aluates the forces and p oten tials o c
curring o v er time in a system of w ater molecules
The successiv e o v er relaxation program OR
A D grid of cells is used so that a pro cessor
uses a redlac k algorithm and w as adapted from
that o wns a cell only needs to lo ok at neigh b oring
the CVM sample co de In ev ery iteration eac h
cells to d in teracting molecules Comm unica
p oin t in a grid is set to the a v erage of its four
tion arises out of the mo v emen t of molecules from
neigh b ors Most of the tra arises out of near
one cell to another at ev ery timetep W A TER
est neigh b orho o d comm unication at the b orders of
n solv es the same problem as W A TERp though
a rectangular grid
with a less eien t algorithm that uses a simpler
datatructure
T r aveling Salesman Pr oblem
Image deblurring
The T ra v eling Salesman Problem SP w as
also adapted from the CVM sample co de The pro
The application tested is a parallel algorithm for
gram solv es the classic tra v eling salesman problem
deblurring of images obtained from Magnetic Res
using a branc hnd ound algorithm
onance Imaging Images generated b y MRI ma y
sur a loss of clarit y due to inhomogeneities in
Evaluation Envir onment
the magnetic ld One of the tec hniques for re
mo ving this blurring artifact is the demo dulation Our exp erimen ts w ere carried out so as to sho w
of the data for eac h pixel of the image using the
ho w v arious c hanges in the system impact p erfor
v alue of the magnetic ld near that p oin t in space mance The runs w ere carried out on a cluster of
This metho d consists of acquiring a lo cal ld map four SUN UltraEn terprise Serv ers connected using
ding the b est to a linear map and using it to a Mbs F oreRunnerLE A TM switc h The
deblur the image distortions due to lo cal frequency st mac hine is a pro cessor UltraEn terprise
v ariations This is a v ery computation in tensiv e with Gb yte memory The master pro cess w as
op eration and has previously b een parallelized us alw a ys run on this mac hine The three other ma
ing a message passing approac h The shared c hines are pro cessor UltraEn terprise s with
memory implemen tation uses a w orkile mo del Gb yte memory eac h All the mac hines use
where eac h thread deblurs the input image around
MHz UltraSparcI I pro cessors with Mb yte
a particular frequency p oin ts and then up dates the cac he
relev an t p ortions to the al image Since these
The program parameters and the memory re
p ortions can o v erlap eac h thread do es the up date quiremen ts for the sequen tial v ersion are sho wn in
under the protection of a global lo c k T able It can b e seen in eac h case that the mem
ory requiremen ts do not exceed the capacit y of an y
Matrix multiplic ation
one no de
The matrix m ultiplication program A TMUL T
R untime V ersions
computes the pro duct of t w o dense square matri
ces The resultan t matrix is partitioned using a The Strings run time w as mo did to demon
blo c kise distribution The size of the blo c ks can strate the incremen tal ect of diren t design deS R oy V Chaudhary Design Issues for a Higherformanc e DSM on SMP Clusters
Program P arameters Size
FFT p oin ts Mb yte
LU blo c k size Mb yte
LU blo c k size Mb yte
RADIX in tegers Mb yte
RA YTRA CE balls Mb yte
V OLREND head views Mb yte
W A TER molecules steps Mb yte
W A TERp molecules steps Mb yte
MA TMUL T doubles blo c ks Mb yte
MRI PHANTOM image frequency p oin ts Mb yte
SOR doubles iterations Mb yte
TSP b Mb yte
T able
Program P arameters and Memory Requiremen ts
cisions The follo wing runs w ere carried out P Pol le d I A communication thread w aits on
message arriv als and notis the other threads
S Single Applic ation Thr e ad sixteen pro cesses
on no de The o v erhead of generating a signal
four p er mac hine with a single application
and switc hing to the user lev el signal handler
thread p er pro cess User lev el threads are used
are th us a v oided
throughout The dsm server thread handles
B Summing Barrier Instead of eac h application
one request at a time and the net w ork I is
thread sending an arriv al message to the barrier
in terrupt driv en This appro ximates t ypical ex
manager the arriv als are st collected lo cally
isting DSMs that do not supp ort m ultiple ap
and then a single message is sen t
plication threads lik e T readMarks whic h has
b een studied on A TM net w ork ed DECstation Release Consistency mo del w as used in eac h
s case
M Multiple Applic ation Thr e ads four pro cesses
one p er mac hine with four application threads
Exp erimen tal Results and Analysis
p er pro cess This case is similar to other DSMs
that allo w m ultiple application threads but are
The o v erall sp eedup results are sho wn for eac h
restricted to using user lev el threads eg CVM
v ersion on the run time in Figure The time mea
results w ere presen ted on a cluster of SMP DEC
sured in this case excludes the initialization times
Alpha mac hines This w as appro ximated b y
of the application programs
setting the thread sc heduler to only allo w pro
T o b etter analyze the p erformance b eha vior due
cess lev el con ten tion for the threads These w ere
the v arious design c hoices the o v erall execution
then constrained to run on a single pro cessor p er
times are split up in to comp onen ts in Figure
no de
The results are normalized to the total execution
K Kernel Thr e ads This v ersion allo ws the use of
time of the sequen tial case The data is separated
k ernel lev el threads that can b e sc heduled across
in to the time for
m ultiple pro cessors on an SMP no de
P age F ault the total time sp en t in the pageault
server thread
C Concurr ent Server The dsm
handler
no w creates explicit handler threads so that m ul
Lo c k the time sp en t to acquire a lo c k
tiple requests can b e handled in parallel
Barrier W ait the time sp en t w aiting on the bar S R oy V Chaudhary Design Issues for a Higherformanc e DSM on SMP Clusters
Single Application Thread
14
Multiple Application Threads
Kernel Threads
12
Concurrent Server
Polled I/O
10
Summing Barrier
8
6
4
2
0
FFT LU−c LU−n RADIX RAYTRACE VOLREND WATER−n2 WATER−sp MATMULT MRI SOR TSP
Figure Sp eedp Results for Benc hmark programs
rier after completing Release Consistency re pro v emen t for MRI is due to the use of a w ork
lated proto col actions pile mo del Essen tially the st thread do es all
Compute this includes the compute time as the w ork and incurs all the faults This can b e
w ell as some miscellaneous comp onen ts lik e the v erid b y lo oking at the high compute time for
startup time this program F rom Figure the time p er page
fault do es not decrease as signian tly as the total
All times are w all clo c k times and th us include
n um b er of faults This is a result of using user lev el
time sp en t in the op erating system
threads whic h are sequen tialized on a single pro
cessor When a user lev el thread has a page fault
Single Applic ation Thr e ad
the thread library will sc hedule another runnable
thread Once the page arriv es the faulted thread
If m ultiple pro cesses are used on a no de only
programs with v ery little false sharing and com will b e allo w ed to run only after the curren t thread
is blo c k ed in a system call or its timeslice expires
m unication are seen to pro vide sp eedups These
include V OLREND MA TMUL T and TSP On the This can also b e seen in the increased compute
time in programs lik e LU LU RA YTRA CE
other hand man y programs ha v e a signian t slo w
do wn including FFT RADIX RA YTRA CE and MA TMUL T and TSP
W A TERp F rom Figure it can b e seen that a
signian t part of the execution time is sp en t on
Kernel Thr e ads
page faults and sync hronization Since eac h pro
cess on an SMP no de executes in its o wn address
When the application threads are executed on
space the pages ha v e to b e faulted in separately
top of k ernel threads the op erating system can
This leads to an increase in net w ork con ten tion as
sc hedule them across m ultiple pro cessors in an
w ell as disruption at the serv er side
SMP This is clearly seen in the reduction in o v er
all execution time Figure sho ws that the time
Multiple Applic ation Thr e ads
sp en t on eac h barrier decreases substan tially when
The use of m ultiple application threads signi using k ernel threads since the application threads
can tly acts the n um b er of page faults incurred are no longer serialized on a single pro cessor
p er thread as seen in Figure Since the threads This ect is particularly visible in programs with
on a single SMP no de share the same memory high computational comp onen ts lik e LU LU
space a page that is required b y m ultiple threads RA YTRA CE W A TER MA TMUL T MRI and
has to b e faulted in only once The v ery high im TSP
Speed UpS R oy V Chaudhary Design Issues for a Higherformanc e DSM on SMP Clusters
4.5
Compute
4.0 Barrier Wait
Lock
3.5
Page Fault
3.0
2.5
2.0
1.5
1.0
0.5
0.0
S M K C P B S M K C P B S M K C P B S M K C P B S M K C P B S M K C P B S M K C P B S M K C P B S M K C P B S M K C P B S M K C P B S M K C P B
FFT LU−c LU−n RADIX RAYTRACE VOLREND WATER−n2 WATER−sp MATMULT MRI SOR TSP
Figure Execution time breako wn for Benc hmark programs
Single Application Thread
1400
Multiple Application Threads
Kernel Threads
1200
Concurrent Server
Polled I/O
1000
Summing Barrier
800
600
400
200
0
FFT LU−c LU−n RADIX RAYTRACE VOLREND WATER−n2 WATER−sp MATMULT MRI SOR TSP
Figure Av erage n um b er of P age F aults
40
Single Application Thread
Multiple Application Threads
35
Kernel Threads
Concurrent Server
30
Polled I/O
Summing Barrier
25
20
15
10
5
0
FFT LU−c LU−n RADIX RAYTRACE VOLREND WATER−n2 WATER−sp MATMULT MRI SOR TSP
Figure Av erage time tak en for a P age F ault
2510
Number of Faults
Time (ms) Execution Time Normalized to Seq. S R oy V Chaudhary Design Issues for a Higherformanc e DSM on SMP Clusters
4.0
Single Application Thread
Multiple Application Threads
3.5
Kernel Threads
Concurrent Server
3.0
Polled I/O
Summing Barrier
2.5
2.0
1.5
1.0
0.5
NO BARRIER
0.0
FFT LU−c LU−n RADIX RAYTRACE VOLREND WATER−n2 WATER−sp MATMULT MRI SOR TSP
Figure Av erage time sp en t p er Barrier Call
300
Single Application Thread
Multiple Application Threads
Kernel Threads
250
Concurrent Server
Polled I/O
200
Summing Barrier
150
100
50
0
FFT LU−c LU−n RADIX RAYTRACE VOLREND WATER−n2 WATER−sp MATMULT MRI SOR TSP
Figure Av erage time required to acquire a Lo c k
Concurr ent Server FFT LU and RADIX the signal driv en I v er
sion p erforms as w ell if not b etter than p olled I
Programs lik e W A TER and W A TERp ha v e
The p olled I pro vides visible b enes when the
a high o v erhead due to lo c king of shared regions
n um b er of messages is small and the pac k et size is
This is ectiv ely reduced b y using the concurren t
mo derate as in RA YTRA CE and V OLREND
serv er whic h allo ws requests for diren t lo c ks to
b e serv ed in parallel The decrease is close to Summing Barrier
in case of W A TERp as seen from Figure Ho w
F rom Figure collecting all the barrier arriv als
ev er in ma y of the programs tested the o v erhead
lo cally reduces the time p er barrier b y a factor of
due to lo c k base sync hronization is v ery lo w
in most cases This pro vides an addi
tional source of p erformance impro v emen t in the
Pol le d I
programs
The b eha vior of signal driv en I compared to
p olled I can b e explained b y referring to T able
Related W ork
The o v erhead of signal generation b ecomes appar
en t as so on as the message size drops b elo w k When compared to message passing programs
b ytes F or larger a v erage pac k et sizes as seen in additional sources of o v erhead for traditional soft
5.1
7.5
Time (ms)
Time (s)
7737.9
1037.6
620.6
527.2
551.3
1076.7
856.0S R oy V Chaudhary Design Issues for a Higherformanc e DSM on SMP Clusters
Program Num b er Av erage Latency s
of Messages Size yte P olled In terrupt
FFT
LU
LU
RADIX
RA YTRA CE
V OLREND
W A TERSQUARED
W A TERP A TIAL
MA TMUL T
MRI
SOR
TSP
T able
Comm unication c haracteristics er no de
w are DSM systems ha v e b een iden tid to include ters The consistency mo del used is a mo di
separation of data and sync hronization o v erhead d v ersion of in v alidate base lazy release consis
in detecting memory faults and absence of aggre tency They do not use a threaded system since
gation Researc hers ha v e attempted to use they claim that it leads to more page in v alidations
compiler assisted analysis of the program to re in some irregular applications Strings uses an up
duce these o v erheads Prefetc hing of pages has date based proto col and it is not clear whether the
b een suggested b y a n um b er of groups for impro v same results can b e applied Cashmere exploits
ing the p erformance of T readMarks b y sa ving the features found in the DEC MemoryChannel net
o v erhead of a memory fault This tec hnique w ork in terface to implemen t a DSM on a cluster of
sacries the transparency of a page orien ted DSM Alpha SMPs Our system is more general and
but can b e incorp orated in parallelizing compil pro vides go o d p erformance ev en with commo dit y
ers In Strings a faulting thread do es not blo c k net w orks W e ha v e observ ed similar sp eedp re
the execution of other application threads on the sults with a h ub based F astEthernet net w ork
same pro cess hence the b ene of prefetc hing is Brazos is another DSM system designed to run
not exp ected to b e v ery large Async hronous data on m ultipro cessor cluster but only under Windo ws
fetc hing w as also iden tid to b e a source of p er NT The Strings run time on the other hand is v ery
formance impro v emen t In our system the p ortable and has curren tly b een tested on Solaris
dedicated dsm server and comm unication thread Lin ux x and AIX
together hide the consistency related actions from
the application threads
Conclusions
SoftFLASH is a similar pro ject but uses
k ernel mo diations to implemen t a SVM on an
Though the p erformance of eac h implemen ta
SMP cluster In con trast our implemen tation is
tion can b e seen to dep end on the data sharing
completely in user space and th us more p ortable
and comm unication pattern of the application pro
Some other researc h has studied the ect of clus
gram some general trends can b e observ ed It is
tering in SMPs using sim ulations W e ha v e
found that using m ultiple pro cesses on SMP no des
sho wn results from runs on an actual net w ork of
pro vides go o d sp eedups only in programs that ha v e
SMPs HLR CMP is another DSM for SMP clus
v ery little data sharing and comm unication In S R oy V Chaudhary Design Issues for a Higherformanc e DSM on SMP Clusters
all other cases the n um b er of page faults is v ery B D Fleisc h and G P op ek irage A Coheren t
Distributed Shared Memory Design in Pr o c e e dings of
high and causes excess comm unication Multiple
the A CM Symp osium on Op er ating System Principles
application threads can impro v e the p erformance
ew Y ork pp A CM
in some cases b y reducing the n um b er of page
J Bennett J Carter and W Zw aenep o el unin
faults This is v ery ectiv e when there is a large
Distributed Shared Memory Based on T yp ep eci
degree of sharing across the threads in a no de
Memory Coherence in Pr o c e e dings of the A CM Sym
p osium on the Principles and Pr actic e of Par al lel Pr o
Ho w ev er the use of user lev el threads causes an
gr amming ew Y ork pp A CM A CM
increase in computation time and resp onse time
Press
since all the threads comp ete for CPU time on a
C Amza A L Co x S Dw ark adas P Keleher H Lu
single pro cessor If k ernel threads are used addi
R Ra jamon y W Y u and W Zw aenep o el read
tionally the o v erall p erformance impro v es signi
Marks Shared Memory Computing on Net w orks of
can tly in all the programs tested Using a dedi W orkstations IEEE Computer pp F ebruary

cated communication thread to p oll for incoming
D Khandek ar Quarks Portable Distribute d Shar e d
messages is a preferred alternativ e to signal driv en
Memory on Unix Computer Systems Lab oratory Uni
I The concurren t dsm server approac h reduces
v ersit y of Utah b eta ed
the latencies for pageaults b y allo wing m ultiple
P Keleher CVM The Coher ent Virtual Machine Uni
requests to b e handled concurren tly Finally us
v ersit y of Maryland CVM V ersion ed July
S Ro y and V Chaudhary Strings A High
ing a hierarc hical summing barrier impro v es the
P erformance Distributed Shared Memory for Symmet
barrier w ait times in most of the programs
rical Multipro cessor Clusters in Pr o c e e dings of the
Ov erall using k ernel threads is v ery promis
Seventh IEEE International Symp osium on High Per
ing esp ecially for regular programs with little false
formanc e Distribute d Computing hicago IL pp
sharing Additional w ork needs to b e done to iden
July
tify the sources of o v erhead in the barrier imple H Lu A L Co x S Dw ark adas R Ra jamon y and
W Zw aenep o el ompiler and Soft w are Distributed
men tation since this dominates the execution time
Shared Memory Supp ort for Irregular Application in
in the cases where the o v erall results are not that
Pr o c e e dings of the A CM Symp osium on the Principles
go o d Our curren t w ork is to impro v e the p erfor
and Pr actic e of Par al lel Pr o gr amming
mance of the release consistency proto col
B N Bershad and M J Zek ausk as idw a y Shared
Memory P arallel Programming with En try Consistency
for Distributed Memory Multipro cessors T ec h Rep
Ac kno wledgmen ts
CMUS Carnegie Mellon Univ ersit y Pitts
burgh P A Septem b er
W e w ould lik e to thank John Carter and Dilip
J B Carter esign of the Munin Distributed Shared
Khandek ar for putting the Quarks source co de in
Memory System Journal of Par al lel and Distribute d
the public domain This allo w ed us to concen trate
Computing
our erts on dev eloping a multithr e ade d DSM
Y Zhou L Ifto de J P Singh K Li B R T o onen
W e thank the anon ymous review ers whose helpful I Sc hoinas M D Hill and D A W o o d elaxed Con
sistency and Coherence Gran ularit y in DSM Systems
commen ts shap ed an earlier v ersion of this pap er
A P erformance Ev aluation in Pr o c e e dings of the A CM
W e also thank P admanabhan Menon for p orting
Symp osium on the Principles and Pr actic e of Par al lel
the MRI co de to Strings
Pr o gr amming as V egas pp June
E Sp eigh t and J K Bennett razos A Third
Generation DSM System in Pr o c e e dings of the First
References
USENIX Windows NT Workshop August
K Li and P Hudak emory Coherence in Shared
K Thitik amol and P Keleher ultihreading and
Virtual Memory Systems A CM T r ansactions on
Remote Latency in Soft w are DSMs in Pr o c e e dings of
Computer Systems v ol pp No v em b er
the th International Confer enc e on Distribute d Com
S R oy V Chaudhary Design Issues for a Higherformanc e DSM on SMP Clusters
puting Systems Symp osium on High Performanc e Computer A r chite c
S C W o o M Ohara E T orri J P Singh and tur e
A Gupta he SPLASH Programs Characteriza R Stets S Dw ark adas N Harda v ellas G Hun t
tion and Metho dological Considerations in Pr o c e e d L Kon tothanassis S P arthasarath y and M Scott
ings of the International Symp osium on Computer A r ASHMEREL Soft w are Coheren t Shared Memory
chite ctur e pp June on a Clustered Remote rite Net w ork in Pr o c e e dings
E W F elten and D McNamee mpro ving the P er of the A CM Symp osium on Op er ating System Princi
formance of Message assing Applications b y Multi ples ain t Manlo F rance Octob er
threading in Pr o c e e dings of the Sc alable High Perfor S Ro y and V Chaudhary v aluation of Cluster In ter
manc e Computing Confer enc e pp April connects for a Distributed Shared Memory in Pr o c e e d
SY P ark J Lee and S Hariri Multithreaded ings of the IEEE Intl Performanc e Computing
Message assing System for High P erformance Dis and Communic ations Confer enc e pp F ebruary
tributed Computing Applications in Pr o c e e dings of
the IEEE th International Confer enc e on Distribute d
Systems
R Mirc handaney S Hiranandani and A Sethi m
pro ving the P erformance of DSM Systems via Compiler
In v olv emen t in Pr o c e e dings of Sup er c omputing

P Keleher and CW Tseng nhancing Soft w are
DSM for Compiler arallelized Applications in Pr o
c e e dings of International Par al lel Pr o c essing Symp o
sium August
D Jiang H Shan and J P Singh pplication
Restructuring and P erformance P ortabilit y on Shared
Virtual Memory and Hardw areoheren t Multipro ces
sors in Pr o c e e dings of the A CM Symp osium on the
Principles and Pr actic e of Par al lel Pr o gr amming as
V egas pp A CM
P Menon V Chaudhary and J G Pip e arallel
Algorithms for deblurring MR images in Pr o c e e dings
of ISCA th International Confer enc e on Computers
and Their Applic ations Marc h
A L Co x S Dw ark adas H Lu and W Zw aenep o el
v aluating the P erformance of Soft w are Distributed
Shared Memory as a T arget for P arallelizing Compil
ers in Pr o c e e dings of International Par al lel Pr o c essing
Symp osium April
S Dw ark adas A L Co x and W Zw aenep o el
n In tegrated Compileimeunime Soft w are
Distributed Shared Memory System in ASPLOS
VII Pr o c e e dings v ol am bridge Massac h usetts
pp A CM Octob er
A Erlic hson N Nuc k olls G Chesson and J Hennessy
oftFLASH Analyzing the P erformance of Clustered
Distributed Virtual Shared Memory in ASPLOS
VII Pr o c e e dings v ol am bridge Massac h usetts
pp A CM Octob er
R Saman ta A Bilas L Ifto de and J P Singh
omeased Shared Virtual Memory Across SMP
No des in Pr o c e e dings of the F ourth International