S C A L A B L E M E S S A G E S T A B I L I T Y D E T E C T I O N

rockyboygangNetworking and Communications

Oct 24, 2013 (3 years and 9 months ago)

220 views

SCALABLE MESSA GE ST ABILITY DETECTION
PR OTOCOLS
A Dissertation
Presen ted to the F acult y of Graduate Sc
of Cornell Univ ersit y
P artial F ullmen t of the Requiremen ts for the Degree of
Do ctor of Philosoph y
b y
Katherine Hua Guo
Ma y
in
ol ho thec
Katherine Hua Guo
ALL R IGHTS RESER VEDSCALABLE MESSA GE ST ABILITY DETECTION PR OTOCOLS
Katherine Hua uo G P h
Cornell Univ ersit y
In group comm unication in o rder to deliv er m ulticast messages reliably in a roup g
it common p ractice for ac e h mem ber to main tain copies of messages it sends
and receiv es in a bur for p oten lo cal retransmission The storage of these
messages is costly and burs ma y gro w ut o of b ound A form of garbage collection
is needed to address this issue Garbage collection o ccurs a pro essc learns
that a message its bur has b een receiv ed b y ev ery pro cess in group The
message is declared stable is released from the bur An imp rtan o t part of
garbage collection is message stabilit y detection
This dissertation presen ts the result of an in v estigation in to message stabilit y
detection proto cols A n ber of message y detection proto ols c used in
p pular o r eliable m ulticast proto cols are studied with a o f cus on their p erformance
in large scale settings This dissertation prop oses anew gossipt yle proto col with
impro v scalabilit y tolerance dissertation also sho ws that b y
adding a hierarc hical structure to the set of basic proto cols their p erformance can
be signian impro v ed when the n ber of participan ts is large
um tly
This fault and ed
stabilit um
and
the in
once
tial
all isBiographical etc h
Katherine Hua Guo w as b rno in n i Beijing P eople Republic of China She
en tered the Univ ersit y of Science and T ec hnology of China as a biology ma jor
Shortly afterw ards she transferred to the Univ ersit y of T exas at Austin in Austin
T exas After three and half y ears of college education she earned a B in
puter science and B n i mathematics with sp ecial honor and highest honor Then
she mo v ed to Ithaca New Y ork for her graduate study at ornell C Univ ersit y where
she arned e an M in computer science in Her Ph follo w ed in No w
she joins Bell Lab oratories in Holmdel New Jersey
iii
com
SkT o hose t on whose shoulders I stand to m y family
iv
andAc kno wledgemen ts
First I w t to thank the Chair m y committee Ken Birman for guidance
and supp ort during m y en tire graduate study His strategic insigh ts his unique
p ersp ectiv e distributed systems and on computer science researc h b een
in v aluable in m y training pro cess
I am v ery luc to v e the opp ortunit y w ork with Robb ert v an Renesse
Step b y step he has sho wn me ho w to disco v er problems and olv s e problems whic h
result in v aluable computer system researc h W erner V ogels started w orking with
me when I lost m y irections d I am grateful f or his u n w a v ering encouragemen tand
sense of h umor I also w t to thank S Kesha v for man y insigh tful discussions
during this w ork
Iw ould lik e t o thank m y committee mem b er tev S eV a v asis for arefully c reading
this dissertation and giving me v aluable commen ts My thanks go to Nic k
T refethen for his encouragemen t the opp o rtunit y to explore n umerical
analysis
I wrote three pap ers with Lu Ro drigues at Univ ersit y of Lisb oa I am grateful
to him for sharing his insigh ts kno wledge exp erience and time with me and
for b eing true friend I am grateful to Alexey V a ysburd Olga V eksler and Y uri
Bo yk o v for benig w onderful oemates and for man y discussions that cleared up
v
for and
also
an
to ha ky
has on
his of anm y though ts
Man y thanks go the Horus researc hers w ere a ys willing to help
me I also wish to express m y gratitude Andrew F ga v e me help and
supp ort in to o man y w a to en umerate
Lo oking through m y w er I m express m y er deep gratitude to
Da vid Kincaid Univ y of T at Austin whose advice and guidance
ha v e help ed me c ho osing a career in computer science researc h and oing g through
diult times life
Finally I w ould lik eto ankth m y aren p ts and m y b rother for their constan tlo v e
and supp ort
vi
in
exas ersit the at
ev ust ev ho past
ys
who eng to
alw who all toT able of ten ts
Biographical Sk etc h iii
Dedication
Ac kno wledgemen ts v
List of T ables x
List of Figures
In tro duction
Large Scale Multicast
Tw o Categories Reliable Multicast
Separate Issues in Reliable Multicast
Related Studies
Dissertation Outline
Bac kground
Categories of Reliable Multicast Proto cols
Sendernitiated proto cols
Receiv ernitiated proto cols
Com bination of sendernitiated and receiv ernitiated p roto
cols
Hierarc proto cols
Bur Managemen t
Garbage ollection c
Other Applications
Summary
F ailure Detection
F ailure Detection Algorithms
Basic algorithm
Gossipt yle algorithm
vii




hical




of

xv
iv
ConIn tegration of Stabilit y Detection F ailure Detection
Summary
Stabilit y Detection
Assumptions
The Basic Proto ols c
Co ordP
F ullDist
T rain
Gossip
Analysis Gossip proto col
The Structured Proto cols
S Co ordP
S T rain
S Gossip
Comparison of the Sev en Proto cols
Summary
Sim ulation of Stabilit y Detection Proto cols
The Underlying Net w ork
Complexit y Metrics
The Gossip Proto col
Sim ulation with aexd group size
Adaptiv e metho d ding the indow w o f optimal step i n terv als
Sim ulation with v arying group
Summary
Comparison of V arious Proto cols in Dense Groups with Senders
T otal n um ber of messages on all hops in system
Av erage and maxim um queue sizes o v er all no i n the
system
Time eround PR
Comparison of V arious Proto cols in Sparse Groups with Senders
T otal n um ber of messages on all hops in system
Av erage and maxim um queue sizes o v er all no i n the
system
Time eround PR
Comparison of V arious Proto cols in Dense Groups with One Sender
T otal n um ber of messages on all hops in system
Av erage and maxim um queue sizes o v er all no i n the
system
Time eround PR
Comparison of V arious Proto cols in Sparse Groups with Sender
Summary
viii

One
des the
the

des the
the

des the
the

sizes



the



of



and Stabilit y T riggering Mec hanism
Summary
Discussion and Conclusions
A Sim ulation Results Gossip in Dense Groups
B Sim ulation Results Gossip in Sparse
C Sim ulation Results Dense Groups with One Sender
D Sim ulation Results Sparse Groups One Sender
Bibliograph y
ix
with for
for
Groups for
for
List of T ables
Complexit y of proto cols xact form ula
Complexit y of proto cols
The n earptimal step in terv al n seconds f or Gossip for diren t
group sizes and d iren t n um b ers of senders
T otal n um ber of messages on all v arious proto cols
x
for hops


List of Figures
An example run of the generic senderased proto col when essage m
m arriv at ev ery receiv er successfully
An example run of the generic senderased proto col when essage m
m not reac h receiv er r in the original m ulticast

example run of the st v ariation of the generic receiv erased
proto col when message m do es not reac h r eceiv er r in the original

m ulticast
example run the second v ariation of the generic receiv
based proto col when message m do es not reac han y r eceiv er in the

original m ulticast
An example run of t he third v ariation o f the generic receiv erased
proto col when message m do es not reac h r eceiv er r in the original

m ulticast
Gossipt yle failure detection proto col
example the gossipt failure detection proto ol c
Steps of proto col Co ordP
Steps of proto col F ullDist
Steps of proto col T rain
An example run of the g ossipt yle stabilit y d etection proto ol c Gos

P art I of the stabilit y detection proto col Gossip n v ersion
stabilit y arra y Sis piggybac k ed on all gossip messages
P art I I of the stabilit y etection d proto col Gossip n this v ersion
stabilit y arra y Sis piggybac k ed on all gossip messages
ber of microteps needed to hiev e diren t probabilit y of
incomplete stabilit y detections P stands for P ure
incompete
ber of steps needed to ac hiev e diren t probabilit y of incom
plete stabilit y detections P stands for P the ure
incompete
Cost of qualit y terms of n um ber of microteps
Cost of qualit y terms of n um ber of steps
Steps of proto col S Co ordP
Steps of proto col S T rain
xi
in
in
in
Num
the in
ac Num
the

the
this
sip




yle of run An
The

er of An
An
es do

es
Structure of the S Gossip proto col
T ypical net w ork top ologies A start top ology A c hain
top o logy and A b oundedegree where in terior no all
v e degree
Sim ulation I o message loss with a ensed group of size and
subset size
Sim ulation I o message loss with a ense d group of size art
I
Sim ulation I o message loss with a ense d group of size art
I I
Sim ulation I I queue size with a d ense group of size art
I
Sim ulation I I queue size with a d ense group of size art
I I
for sim ulations I and I I sparse groups of size
P art I of the adaptiv e algorithm ding a nearinim TPR
f x s the a v erage measured v for a in terv v alue x
P art I I of the adaptiv e a lgorithm ding the optimal step in al
windo wgiv en a nearinim um f x steh a v erage m easured
v alue oras f tep in terv al v x
Nearinim um TPR and the corresp o nding n b er of steps needed
in a round for sparse groups in Sim ulation I I using the global gossip
heme with senders A data poni t with subset size x step
y seconds is e as x y
Nearinim um TPR and the corresp o nding n b er of steps needed
in a round for sparse groups in Sim ulation I I using the global gossip
heme with one sender A p oin t subset size x step
y seconds is e as x y
T otal n um ber of messages on all hops for basic proto cols
dense groups with senders
T otal n um ber of messages on all hops for the t hree structured pro
to cols dense groups with senders
T otal n um ber of messages on all hops for basic and their cor
resp onding structured proto cols groups with senders
art I
T otal n um ber of messages on all hops for basic and their cor
resp onding structured proto cols groups with senders
art II
erage and maxim um queue sizes o v er all the basic
and structured dense groups with senders
xii
in cols proto
the for des no Av

dense in
the
dense in
the
in
in
four the
led lab al terv in
and with data sc
um
led lab al terv in
and sc
um
alue TPR
TPR
terv
al step alue TPR
um
with TPR





ha
des tree
Time eround PR for the four basic proto cols in dense g roups
senders
Time eround for Gossip and F ullDist emplo the
scattering mec hanism in dense groups senders
Time eround PR for the three structured proto cols in dense
groups with senders
Time eround TPR for the basic and their orresp c onding struc
tured proto cols in dense groups with senders art I
Time eround TPR for the basic and their orresp c onding struc
tured proto cols in dense groups with senders art I I
T otal n um ber of messages on all hops basic proto cols in
sparse groups with senders
T otal n um ber of messages on all hops for the t hree structured pro
to cols sparse groups with senders
T otal n um ber of messages on all hops for basic and their cor
resp onding structured proto cols in sparse groups senders
art I
T otal n um ber of messages on all hops for basic and their cor
resp onding structured proto cols in sparse groups senders
art II
erage queue size o v er all for the basic proto cols
sparse groups with senders
erage queue o v er the for the structured proto cols
sparse groups with senders
Maxim um s o v er all t he no des for the four basic p roto cols
sparse groups with senders
Maxim um queue size o v er all the no des for the structured proto cols
sparse groups with senders
Time eround TPR for t he four basic proto cols in sparse groups
senders
Time eround PR for the structured proto cols in sparse groups
senders
Time eround TPR for the basic and their orresp c onding struc
tured proto cols in sparse groups senders art I
Time eround TPR for the basic and their orresp c onding struc
tured proto cols in sparse groups senders art I I
T otal n um b er of messages on all hops for Gossip and S Gossip with
diren t n ber of senders dense groups
erage queue size o v er all the no des with diren t n um b ers of
senders for the basic and structured proto cols in dense groups
Maxim um queue size o v all the with diren t n um bers of
senders for the basic and structured proto cols in dense groups
xiii
des no er
Av
in um
with
with
with
with
in
in
ize queue
in
des no all size Av
in
four des no the Av

with
the
with
the
in
the for
with
ying PR
with Time eround PR diren t n um bers of for the
four basic proto cols dense groups
Time eround PR diren t n um bers of for the
three structured proto cols in dense groups
Optimal n um b er f o essages m i n the output bur when the stabilit y
detection proto col should be triggered
Optimal n um b er f o essages m i n the output bur when the stabilit y
detection proto col should be triggered
A Sim ulation I o message loss with a dense group of n
art I
A Sim ulation I o message loss with a dense group of n
art II
A Sim ulation II ueue lost messages per step with
adesen group of size n part I
A Sim ulation II ueue lost messages per step with
adesen group of size n part I I
A Sim ulation I I I lost messages p er s tep with a dense group of size
n art I
A Sim ulation I I I lost messages p er s tep with a dense group of size
n art II
B Sim ulation I o message loss sparse groups size n
art I
B Sim ulation I o message loss sparse groups size n
art I I
B Sim ulation II ueue lost messages per step with
sparse groups of size n part I
B Sim ulation II ueue lost messages per step with
sparse groups of size n part II
B Sim ulation III messages per step with sparse groups of
size n art I
B Sim ulation III messages per step with sparse groups of
size n art I I
C T otal n um ber of messages on all hops for basic proto cols
dense groups with sender
C T otal n um ber of messages on all hops for the t hree structured pro
to cols dense groups with one ender s
C T otal n um ber of messages on all hops for basic and their cor
resp onding structured proto cols in dense groups with one sender
art I
xiv
the
in
one in
four the
lost
lost

and size
and size
of with
of with

and size
and size

size
size




senders with
in
senders withC T otal n um ber of messages on all hops for basic and their cor
resp onding structured proto cols in dense groups with one sender
art II
C erage and maxim um queue sizes o v er all the basic
and structured dense groups with sender
C Time eround for the basic structured proto cols in
dense groups with one sender
C Time eround TPR for the basic and their orresp c onding struc
tured proto cols in dense groups with one sender art I
C Time eround TPR for the basic and their orresp c onding struc
tured proto cols in dense groups with one sender art II
D T otal n um ber of messages on all hops for basic proto cols
sparse groups with one sender
D T otal n um ber of messages on all hops for the t hree structured pro
to cols sparse groups with sender
D T otal n um ber of messages on all hops for basic and their cor
resp onding structured proto cols in sparse groups with one sender
art I
D T otal n um ber of messages on all hops for basic and their cor
resp onding structured proto cols in sparse groups with one sender
art II
D Maxim um s o v er all t he no des for the four basic p roto cols
sparse groups with one sender
D Maxim um queue size o v er all the no des for the structured proto cols
sparse groups with one sender
D erage queue size o v er all for the basic proto cols
sparse groups with one sender
D erage queue o v er the for the structured proto cols
sparse groups with one sender
D Time eround TPR for t he four basic proto cols in sparse groups
one sender
D Time eround PR for the structured proto cols in sparse groups
one sender
D Time eround TPR for the basic and their orresp c onding struc
tured proto cols in sparse groups sender art I
D Time eround TPR for the basic and their orresp c onding struc
tured proto cols in sparse groups sender art I I
xv
one with
one with
with
with
in
des no all size Av
in
four des no the Av
in
in
ize queue

the
the
one in
in
four the

and PR
one in cols proto
the for des no Av

theChapter
In tro duction
Multicast i s an eien tcomm unication paradigm for disseminating data in a group
with a sender and a set of receiv ers T ypically a m ulticast group is tid b y
a single group address The seman tics of reliable m unication are
normally deed h that all mem bers of the group eed n to receiv e acop y f o the
m ulticast message Informally this means that all pro c esses deliv er the
same set of messages and set include all messages m ulticast b y correct
pro esses c and no spurious messages ul
The gro wth of the In ternet h as triggered the widespread u se of realime m
cast including applications that supp ort v oice or example v at M NeV ot c h
and video v re whic h do not require m ulticast There are also
cations that do require reliable m ulticast suc h s a s hared hite w oards for xample e

wb JL
Man y other m ulticast applications also require reliable deliv ery data to all
the receiv ers F example m ulticast used in Distributed In teractiv e
Sim ulations IS for dynamic terrain up dates SC It for

dissemina used is
is reliable or
of

appli reliable
ulti
this that
correct
suc
comm ulticast
iden
tion of c k quotes to a large n ber of clien ts distribution of soft w are
pro ucts d to groups of customers It is also used b yw eb serv ers to send up dates of
w eb pages to their pro xies
The increasing p opularit y of endond m ulticast applications supp orting
ther videoonferencing Supp orted Collab orated W ork SCW or the
reliable data dissemination o v er the In ternet is making the pro vision of reliable and
unreliable m ulticast services an in tegral of arc hitecture
Large Scale Multicast
In the future global information exc hange will b e essen tial part of
ev eryda y life Driv en b y the a v ailabilit y of sp net w orks and p o w
cessors more and m ore applications ill w equire r reliable data transfer within large
groups whose mem b ers ma y be spread o v er the w orld Therefore
ing comm unication systems m ust w with resp ect to b oth n um ber of group
mem b ers and geographical expansion
Mean while widespread a v ailabilit y of IP m ulticast C and MBone
um ha v e dramatically increased the geographic exten t the size of
m unication groups
T o supp ort reliable m ulticast comm unication suc h a scenario eien t and
scalable m ulticast con trol mec hanisms ha v e b ecome and more essen tial
more
in
com and
the
ell scale
forthcom all
pro erful eed high
an come near
its part
Computer
ei
for and um sto
Tw o Categories of Reliable Multicast
Generally reliable m ulticast tec hniques fall in t w o categories sendernitiated and
receiv ernitiated b oth of whic h emplo y a sequen tial n um b ringe of data messages
at sender
The endernitiated s approac h i s ased b on the se u of p ositiv eac kno wledgmen ts
CKs It places the resp onsibilit y on the whic h ainm tains state informa
tion of all the receiv ers that m ulticasting to The receiv ers ac wledge
the receipt of data messages b y sending unicast A CKs to the sender The sender
k eeps trac k from it has receiv ed A CKs for h m ulticast A
timer is asso ciated with eac h message at the sender whenev er timer
pires b e fore A from receiv come in sender re ulticasts the
message
In trast receiv ernitiated approac h not use A at all
ev receiv ers detect a missing message b y observing gaps i n t he sequence n um ber
stream of data messages they send negativ eac wledgmen ts AKs whic h serv e
as repair requests In general none of the mem b ers k the state information re
garding the set of receiv ers If the receiv ers repair requests to sender
then the sender is resp onsible for data retransmission if the receiv ers m re
pair requests in group then an y ber has the requested message ma y
conduct retransmission The retransmission m ulticast to en group
tire the is the
who mem the
ulticast
the send
eep
kno
er
When CKs es do the con
the ers the all CKs
ex the and
message eac whom of
kno data is it
sender
the
Separate Issues Reliable Multicast
Muc h w ork has b een done on the p e rformance issues of p oin to oin t and poni t
to ulti oin t transp ort proto ols c The w con trol transmission error reco v
ery and bur space managemen t issues seem to teract a rather
w a y M
W e follo w the discipline st prop osed b y Clark et al in LZ whereb y
an y transp ort proto col op erates eien tly decouples w and error
con trol Mixing the t w o i n a single mec hanism can mak eo wcon trol vulnerable to
transmission errors and Under discipline the proto ol c uses a windo w
for error con trol only In practice some proto cols follo w this discipline some
do not
Proto cols lik e TCP os u se the same windo w b oth for o w con trol and for
error con trol A TCP transmitter stops at the windo w b oundary and w aits a
new ac wledgmen t b efore it can tin ue w ait sync hronization serv es
as TCP w con trol mec hanism and often cause p erformance degradation
On the other hand p roto cols lik e N ETBL T CLZ Blast File T ransfer P roto
col he a nd StarBurst Multicast T ransfer Proto ol c ta separate error
con trol from their r ateased o w con trol mec hanism NETBL Tmnai tains a large
windo w at the sender whereas Blast StarBurst k eep en tire
e to b e transferred in the error con trol windo w
Out three imp ortan t issues in the design of m ulticast proto cols
w con trol error con trol and bur managemen t the error con trol windo w
this dissertation v estigates the issue bur managemen t is ho w to free
messages from the error con trol windo w
that of in
for
reliable the of
the MFTP and FTP
File
can
for This con kno
for
and
this ys dela
trol con that
complicated in in
in
A message can b e released from the ur b at an y mem b er a sender r o a receiv er
only it receiv ed b y ev ery mem ber in the group at h p oin t the message is
called stable Therefore managemen t is tially the same as detecting
message stabilit y releasing stable messages from the burs
Related Studies
There ha v e b een comparativ e analysis of endernitiated s and receiv ernitiated re
liable m ulticast proto cols b y Pingali et al in TK and b y Levine in ev
These studies conduct throughput analysis f o iren d t proto cols based on pro cess
ing requiremen boht the sending and receiving hosts rather than the comm u
nication bandwidth requiremen ts
This dissertation studies only one asp ect of reliable m ulticast proto cols the
bur managemen t mec hanisms The f o cus is on the scalabilit y f o these proto cols
W e examine v arious bur managemen t mec hanisms for reliable m ulticast pro
to cols in a large scale en vironmen t where messages ma y b e ed and c
ma y crash W e tak e b oth pro cessing and unication bandwidth requiremen ts
in to accoun t and study dela y p erformance using ev en t sim ulation
Dissertation Outline
The results presen ted dissertation are based on sim ulations apply to
generic proto cols rather than to sp eci implemen tations W e b e they
pro v aluable insigh t for the design of next generation scalable reliable m
cast proto cols Chapter starts with a study of rror e con trol mec hanism in reliable
ulti vide
that eliev
They this in
comm
esses pro dropp
at ts
and
essen bur
whic is if
m ulticast proto cols then in tro duces the to stabilit y detection reliable
m ulticast and other imp ortan t p roto cols In order to detect message stabilit yt is
necessary to v e group b ership information Chapter presen ts failure
detection proto col the tegration of failure detection stabilit y detection
The detailed comparison of stabilit y detection proto cols is presen ted in Chapter
follo w ed b y the ulation results in Chapter Chapter n v estigates the mec h
anism to in v ok e detection of stabilit y Chapter concludes the dissertation The
app e ndices at the end vide some additional sim ulation results
pro
sim
and in and
the mem ha

in do needChapter
Bac kground
Categories of Reliable Multicast Proto cols
With resp ect to mec error correction m ulticast proto cols can
be broadly separated in to t w o categories sendernitiated receiv ernitiated
A sendernitiated proto ol c deed as whic h the sender gets p sitivo e
ac kno wledgmen ts CKs all receiv ers p erio dically release messages
from its bur accordingly A receiv ernitiated proto col is deed a s ne o in whic h
the receiv ers send negativ e ac kno wledgmen ts AKs when detect message
losses and they nev er send an y A CKs to the sender
Sendernitiated proto cols
Sendernitiated reliable m ulticast proto cols are based on of A CKs The
resp onsibilit yof rop viding message reliabilit y s i placed on the sender whic h
tains state information regarding all receiv ers to whic h itism ulticasting data The
sender k eeps list of all receiv ers and for eac h pac k et receiv ers

from the the the
main
use the
they
and the from
in one is
and
reliable for hanisms
whic h it has receiv ed A CKs
Reliabilit y a nd the main tenance of this state information are nsured e b yha ving
the receiv ers return A CKs for messages correctly receiv ed and b y using at
the sender for the purp ose of detecting message losses
A eneric proto col migh t op erate follo ws
Whenev er the sender m ulticasts a message it starts a timer asso ciated with
this message
P erio dically a receiv er sends k a unicast A CK to the sender iden tifying
the messages it has correctly receiv ed so far The A CK migh t i ndicate either
a p s eci message or a w w of messages
receipt of an A CK the sender up dates its A CK lists asso ciated with
these messages indicated in the A CK
Whenev er the timer expires b e A CKs from all the receiv ers arriv e at the
sender sender ulticasts the message
Whenev er the A CK asso ciated with a message tains the receiv ers
the group sender can release this message from its bur cancel
corresp onding timer
Some example runs of the sendernitiated proto col are giv en in Figures
and In st case illustrated in Figure once the sender s m ulticasts
the message m to receiv ers r and r starts a timer for m After receiving the

unicast A CK from b oth r and r he sender cancels the timer

In the second case as in Figure message m is m ulticast in group m
arriv r successfully it is lost on its w a y to r When the timer for m

but at es
the

it
as the
the
and the in
all con list
re the
fore
on Up
indo
bac
as
timers
sr1 r2
m
Start timer_m
m
ACK_m
ACK_m
Cancel timer_m
time
Figure An example run of the generic s enderased proto col w hen message m
arriv at ev ery receiv er successfully
expires the sender s receiv ed an A from r therefore re ulticasts

m
The endernitiated s approac h i s sed u n i early reliable m ulticast proto cols suc h

as Xpress T ransp ort Proto col TP D WTP
The main limitation of the sendernitiated proto col is that the sender n eeds to
kno w the set of r eceiv ers and eeds n to pro cess A CKs f rom all the receiv When
the group size is large the sender can b e o v erwhelmed b y he t large amoun t of state

XTP actually uses a com bination of b oth sendernitiated and receiv ernitiated approac hes
ers
the
it CK not has
essr1 r2
m
Start timer_m
ACK_m
Timer_m expires
m
m
time
Figure An example run of the generic s enderased proto col w hen message m
do es not reac h receiv er r in the original m ulticast

information m ust main tain and exp erience an A CK implosion problem
The A CK implosion problem is as follo ws As n um b er of receiv ers b ecomes
v large the sender is o v erwhelmed A CK messages from all the ers
n w ork b ecomes congested from the arriv al of a arge l n um ber of messages in
a short p erio d o f time a nd the sender ro p cess exp riences e signian to v erhead due
to the pro cessing of the large n um ber of A CK messages
A implosion has the follo impact
First the tremendous n um ber of A CK messages results in pro essing c o v
er
wing CK
et The
receiv with ery
the
it
head at the sender results dela ys data comm unication
Second a large n um ber of A messages can cause an excess of b oth
bur space and bandwidth triggering additional message losses
One approac h to address the limitations of sendernitiated proto cols is to use
NAKs instead of A CKs or f error detection and g et rid of state information regarding
all the receiv ers W e call this a receiv ernitiated hanism
Receiv ernitiated proto cols
Receiv ernitiated proto cols shift the burden of pro viding transfer
to the receiv conduct error con trol based on negativ e ac kno wledgmen ts
AKs
In this approac h the receiv ers are resp onsible for error detection and error
reco v ery They do not to return status rep orts ac kno wledgmen ts to the
sender After a receiv er detects lost messages b y observing gaps in message
quence n um besr it informs other b ers NAKs that solicit retransmissions
In order to guard against either loss of NAK or the subsequen t message
retransmission the receiv starts a timer eac h message If the timer
expires b efore message is receiv ed receiv er sends another
sage and restarts the timer The timer canceled up on successful receipt of the
message
T o loss detection for the last message in a burst the m ust
m ulticast the sequence n ber of that last message p erio dically
A eneric proto col migh t op erate follo ws
as
um
sender the aid
is
mes NAK out the the
missing for er
the the
via mem
se
or need
and ers
data reliable
mec
use CK
in in and
Whenev er a receiv er detects a ost l message it sends a NAK and then starts
a imer t asso ciated with message
Whenev er the timer expires b efore the requested message arriv es a receiv er
sends a NAK again and retarts the timer
There are t w o approac hes to end s retransmission requests AKs one is to
unicast to the sender the to m ulticast to en tire group
bined with the w a y retransmission sen t out w e v e the follo wing three
v ariations
In the st v ariation up on receipt of a unicast NAK the
m ulticasts the requested message
In the second v ariation receipt of a m ulticast NAK the sender
re ulticasts the requested message Whenev er a receiv er receiv es a
NAK requesting same message has requested in o wn NAK it
resets the timer asso ciated with the missing message mec
suppresses redundan t NAKs b y king o the timers receiv ers
The third v ariation the same the second except an y mem ber
with a cop y f o t he requested message can re ulticast in the roup g up on
receipt of a NAK
Examples of he t three v ariations of he t receiv ernitiated proto col are presen
in Figures o Figure illustrates the rst v ariation of the proto col The
sender s m ulticasts messages m m m to the t w o receiv Out these

three m ulticast messages m and m e at boht receiv ers successfully m

arriv
of ers and

ted
one as is
at bac
hanism This
its it the
on up
re sender
ha is
Com the is other
this
sr1 r2
m1
m2 m1
m3
m3
r2: Start timer for
NAK_m2
NAK_m2
m2
m2
r2: Cancel timer for
NAK_m2
time
Figure An example run of the st v ariation of the generic receiv
proto col when message m do es h receiv er r the original m ulticast

arriv at r but r As so on as r detects the loss of m from the gap in the

sequence n b er space it sends a unicast AK N message t o the sender s requesting
the etransmission r of m a nd starts a timer for the NAK message of m Once the

sender receiv es the NAK from r t i re ulticasts m to b oth receiv ers r ignores

the message since it is a duplicate After r receiv es m successfully it cancels the

timer for the NAK
Figure presen ts a sample run of the second v ariation Message m fails to

arriv e at either r r er r the loss st then it m ulticasts a

detects Receiv or
um
not es
in reac not
erased
sr1 r2
m1
m2 m1
m3
m3
NAK_m2 NAK_m2
r2: Start timer for
r1: Start timer
NAK_m2
for NAK_m2
m2
m2
r1: Cancel timer
for NAK_m2
r2: Cancel timer for
NAK_m2
time
Figure An example run of the second v ariation of generic receiv
proto col when message m do es h an y receiv er in the original m ulticast

NAK message for m and starts a t for the NAK After r receiv es NAK

it starts a t imer for the NAK for m as if it has ust j sen t o ut the AK N itself Once

the s ender s receiv the N AK from r t i re ulticasts m in After m

arriv r and r b oth receiv ers ancel c their corresp onding timers for the NAK

In Figure the third v is used Message m es at r not r

Once r detects the loss it m ulticasts a AK N message requesting m It also starts

a timer for this NAK As so on as r receiv es this NAK request since it h as already

receiv ed m it re ulticasts m the group Once m arriv at r timer

the es in
but arriv ariation
at es
group the es
the imer
reac not
erased the
sr1 r2
m1
m2 m1
m3
m3
NAK_m2
r2: Start timer for
NAK_m2
m2 m2
NAK_m2
s: Ignores
NAK_m2
r2: Cancel timer for
NAK_m2
time
Figure example of the third v ariation of generic receiv
proto col when message m do es h receiv er r the original m ulticast

for the NAK is canceled The sender s receiv the NAK m after receiv es

m it ignores the NAK accordingly since i t k no ws some other mem b r e has already

re ulticast m

The st v ariation is used b y Birman the ISIS reliable m proto col
ir in a lo cal area net w AN en vironmen t
The second v ariation is b y Ramakrishnan and in egativ e
Ac wledgmen ts with P erio dic P olling APP proto ol c J for a LAN
cobson has used similar ideas to implemen t a reliable m ulticast proto c ol suitable
Ja kno
the Jain osed prop
ork
ulticast in
it for es
in reac not
erased the run An
for the wide area net w AN ac
The third v ariation in the Scalable Reliable Multicast RM proto col

designed b y Flo yd JL where an y mem ber whic h the
message ma y conduct the retransmission This tec hnique ectiv reduces the
burden on the sender and shortens the retransmission dela y
Because sender do es not ha v e mem b ership information the
ceiv ers do not send feedbac k to the sender up on successful receipt o f m essages the
sender has n o mec hanism to ascertain when it can safely release messages from its
bur F urthermore this approac h is suitable to pro vide a fulleliable
m unication service b ecause there is no mec hanism for an y mem b er to detect when
a r eceiv er fails v es the group
The bet ne this sc heme since the state information minim um it
scales w ell
Clearly in the receiv ernitiated approac h order to handle p ossible retrans
mission requests if the sender is resp onsible f or retransmission hen t ll a the m
cast messages m ust be k ept the sender bur whereas if the mem b ers are
resp onsible for retransmission then all the messages m ust be main tained in the
burs of all the mem bers
A pure receiv ernitiated approac h lac ks a b ur managemen tsc heme to detect
when a message is receiv ed b y all the mem b ers and therefore can b e safely released
from eac h mem b er bur T ocom bat the limitation of the pure receiv ernitiated
approac h in practice y proto cols use of b oth A CKs and NAKs
the bine com man

all in
ulti
in
is that is of
lea or
com not
re and group the
ely
requested has al et
used is
ork
Com bination of sendernitiated receiv ernitiated
proto cols
In the h ybrid approac h that com bines sendernitiated receiv ernitiated
mec hanisms regardless of whether the sender or receiv ers are in c harge of
tecting message losses conducting retransmission that is despite of the use
of A CKs or NAKs for retransmission requests the sender c harge of releasing
a m essage from bur and receiv ers burs after ev ery receiv er pois
tiv kno wledged receipt of the message A are emplo y ed b y the sender to
ascertain that safe release messages from memory
The bination of A CKs and NAKs ha v e been used extensiv ely reliable
m ulticast proto cols F or examplehe Xpress T ransp ort Proto col TP D W
XTP the egativ e Ac kno wledgmen ts with P erio dic P ollingNAPP
col J and the StarBurst Multicast File T ransfer Proto col tarBurst MFTP
ta group together large partitions of data messages that are p erio dically
A CKed while lost messages within the partition are NAKed
Hierarc hical proto cols
A ierarc h hical s tructure can reduce the A CK implosion problem n i sendernitiated
proto cols The Reliable Multicast T ransp ort Proto col RMTP d ev elop ed at
Labs b y Sabnani et a l LPB emplo ys a h h yb y ividing d the receiv in to
some n um ber of subsets and using Designated Receiv Rs one per subset
to collect A CKs from other mem b ers Th us the sender only k eeps
the list DRs and h DR k eeps mem b ership information subset This
hierarc h y educes r b oth the moun a t of state information required at the sender and
its of eac of
subset that in
ers
ers ierarc
Bell
proto
for com
to is it
CKs ac ely
has from its
in is
and
de the
and the
and
the n ber of A CKs collected b y the sender
The receiv ernitiated proto col can bet en from a hierarc h y F or
ple the Logased Receiv ereliable Multicast BRM dev elop ed b y Holbro ok et
al SC uses a hierarc h y of log serv ers to store m messages
nitely Receiv ers request retransmissions of lost messages from a log serv er The
log serv ers signian tly reduce sender burden of handling retransmissions
The Lo cal Groupased Multicast Proto col GMP b y ofa
Hofb is based on the concept of Lo cal Groups where lost messages are st
reco v ered Lo cal Groups using NAKs A message is requested from the
sender only if not a single mem b er of the Lo cal Group holds a cop y of the missing
message
Bur Managemen t
T ransp o rt lev el burs v e a certain bound in terms size F or example in
Unix the default bur size for eac h connection is K b ytes at the sender
and at the receiv er The purp ose of transp ort lev el burs is to store messages for
reasonable amoun t of time in case retransmission is needed After these burs
b ecome full n ew messages are ither e dropp ed or forced to replace existing messages
in the bur dep ending the bur managemen t sc heme
In the follo wing simple calculation w e assume sender is resp onsible for
retransmission and it stores all the m ulticast messages it has sen t ut o in its bur
Assume the s ender m ulticasts edize messages t a the rate r messages p er second
and message size is k b ytes Also assume bur size at the sender is B
B
b ytes Then t seconds after the sender starts m ulticasting the bur will
r k
the the
the
on
TCP
of ha
inside
Hofmann
the
inde ulticast all
exam also
um
be full After this poin t messages ha v e to replace messages in bur
If the time it tak es to detect message loss and send bac k NAK messages is greater
than t then original message is not a v ailable in transp ort lev el bur for
retransmission an y more
If application la y er can store all relev an t then lost messages that are
not in the transp ort lev el burs can be reconstructed b y application and
retransmitted otherwise they are lost can be retransmitted
SRM uses the st approac h emplo a concept called Application Lev el
F raming LF prop osed b y Clark and T ennenhouse in T the design
principle ALF the application breaks the data in to suitable aggregates c alled Ap
plication Data Units or ADUs ADUs will tak e the place of the pac k as the
unit of manipulation When misrdered or incomplete units o c the
plication rather than the ransp t o rt proto col pro vides the data for retransmision b y
reconstructing the data
Garbage collection

In scalable reliable m ulticast proto cols JL ofa ofbLPB i t is most
eien t to use the lo c al r ep sc heme that is for eac h roupg mem b er to retransmit
messages resp onse to requests b y other mem b ers ha v e detected message
losses F or applications that require messages to be deliv to all correct
pro esses c in group it necessary to bur the receiv ed messages at
ev ery mem ber handle the case of sender crash net w ork partition
On the other hand t he storage f o these messages is costly and the bur space
at h mem ber is limited prev en the proto cols scaling to a large group
from ting eac
and to
all also is the
ered all
that in
air

ap cur data
et
In
ying
not and
the
data the
the the
the old new
size A form garb age c ol le ction is needed to address this issue
In order for mem besr the group late to catc h up with the rest of
the group a small n um ber of mem b ers are designated as Lateoin Handlers
JHs LJHs k eep all messages they sen t and receiv ed in their burs
sion of ho w l ong he t LJHs should k eep m ulticast messages is made b y applications
instead of the garbage collection mec hanism
Whenev er a d ata message h as b en e receiv ed b y all the mem b e rs only the LJHs
should store the message Other b ers should discard it since of the
mem b ers in the c urren tm ulticast group needs retransmission A message is called
stable if it is receiv ed b y a m b ers of the group T o o d this garbage collection
amec hanism is needed to detect whic h messages are stable Also a failure etection d
mec hanism needed rep ort the curren t b e rship otherwise a failed
mem ber could prev en t garbage collection altogether
Under the ideal situation where sender do es not crash and there no
w ork partition sendernitiated proto cols already a h v e builtn ur b managemen t
and garbage collection hanisms After the sender gets A the
ceiv ers can detect whic h messages are stable can be released eac h
mem b er bur Receiv ernitiated proto cols do ha v e stabilit y detection
mec hanism since none mem b ers ha v e group mem b ership information
If w e w an t to guaran tee reliable message deliv in face of sender crash and
net w ork partition then w e a mec hanism to stabilit y detection and bur
managemen t for b o th sendernitiated and receiv ernitiated proto c ols
The problem of bur managemen t s i largely i gnored in existing calable s reliable
m ulticast proto cols in the literature Based on concept of Application Lev el
the
do need
ery
the of
this not
from and it
re all from CKs mec
net is the
mem group to is
em the ll
none mem
deci The
the
join that
of
F raming LF T issue ho w long a message should be bured is
handled b y the application la y er F or example in SRM all messages b e long to
the curren t white oard session are stored in the bur of eac h group mem ber
Dep nding e on w long a session is the n um ber of messages a session could be
un b ounded T o our kno wledge this is st comprehensiv e study of transp ort
lev el bur managemen t reliable m ulticast proto cols
This dissertation ev aluates the scalabilit y and p erformance of existing stabilit y
detection proto cols further prop oses proto cols
Other Applications
The message stabilit y etection d mec hanism can also supp ort atomic essage m or
ing whic h means that a message is not eliv d to an y group mem ber un til all the
mem b ers ha v e receiv it F or example the researc h rep orted w as triggered
b y a problem a Swiss faced using the Isis group comm unication
system ir In their s etp they had t w o serv er mac hines nd a ab out a h
PC w orkstations rganized o in a group The serv ers m ulticast up dates to replicated
data main tained at eac h of the w orkstations The up dates to be deliv
atomically in spite er failures Therefore w orkstations had to bur
the data un til it w as kno wn that the data w as deliv ev erywhere rate
of the up dates w sometimes so high that Isis stabilit y detection proto col w as
not able to k eep and burs grew to o argel The ect w as m h
b y fact that m ultiple groups w ere used Correct ordering bet w een groups
required that switc hing from sending in one group to another w as done only after
the messages sen t and deliv ered the st group had b ecome stable
in
the the
exacerbated uc up
as
The ered
the serv of
ered had
undred
when bank that
here ed
ered
der
new and
for
the
in ho

of the
These mac hines w ere inside a single branc h a nd on a single lo cal area net w ork
One can easily en vision m ultiple branc hes being link ed together with man y h
dreds if not thousands mac hines Our terest is ding a scalable stabilit y
detection proto col that s future requiremen ts
The concept of message stabilit y as h b een used in more traditional areas suc has
distributed database managemen t nd a parallel computing In distributed database
systems partial failures of transactions c an lead to inconsisten t results Therefore
termination a transaction up dates distributed has to be co ordinated
among its participan ts In the atomic commit roto p cols HG a p ro cess can not
commit a transaction un eryb o dy else has agreed to commit This is similar to
message stabilit y detection proto ols c i n whic h all pro cesses m ust d eliv er a message
if an y do es so
In parallel computing barrier hronization KB requires that all
cesses execute barrier construct b fore e y pro cess can pro ceed past it to the
next statemen t Ev ery pro cess has to kno w if other pro esses c v e hed the
barrier b e fore it can again This is also an agreemen t problem similar to
message stabilit y atomic c ommit
Summary
This c hapter pro vides the bac kground to the study of message stabilit y
tion proto cols to be presen ted in subsequen t c hapters Tw o categories of reliable
m ulticast proto cols sendernitiated and the receiv ernitiated proto cols are
describ d e along with the pros and cons f o eac h category Tw o c ommon approac hes
to impro v e he t scalabilit y of reliable m ulticast proto cols are studied is to
com one
the
detec
and
ceed pro
reac ha all
an the
pro sync
ev til
data that of
in in of
un
bine the t w o categories of proto cols the other is to emplo y a hierarc hical structure
The q uestion of eien t bur managemen t and the concept of garbage collection
are raised in con text of reliable m ulticast
This bac kground mak es it p ossible to no w study the failure etection d algorithms
hapter and l o o k n i detail at he t diren tsc hemes to conduct message stabilit y
detection hapters to
the
Chapter
F ailure Detection
T o ectiv e garbage collection m ust detect a message stable and
obtain a consisten t view of the curren t group mem b rship e Therefore the stabilit y
detection proto col and failure detection proto col are t w oin tegral parts of garbage
collection This c hapter describ es the failure detection algorithms
F ailure Detection Algorithms
T o conduct m ulticast in a distributed en vironmen t one m ust face the problem of
dynamic group mem b ership c hanges Initially there are n mem bers in group
n b ered through n As time passes b y new mem b ers migh t join the group
existing b ers migh t crash or migh t lea v e the v olun tarily The failure
detection problem to detect whic h mem b ers the are still op erational
and therefore constitute curren t mem b ership

the
group of is
group mem
um
the
is when one do Basic algorithm
T raditionally failure detection algorithms are based on timeut mec hanisms In
general a failure is detected hen w he t lac k of resp onse from a remote mem ber opr
cess es comm unication proto cols unable to mak e p rogress
tion that ev ery mem ber pro cess is constan sending messages if a mem ber
has not b een heard from after a certain time it is assumed to ha v e crashed or left
the group The failure detection algorithms normally reside in transp ort la y er
that implemen ts in terro cess comm unication
T o ensure timely detection of failures when d ata tra is lo w or u nidirectional
some systems require eac h mem ber to m ulticast mliv e session messages in
the group p erio dically
In general failure detection a lgorithms c an b e divided in t w o c ategories under
the same timeut principle according to V ogels in og
The rst s c heme uses a heartb eat mec hanism where eac h pro cess sends out
amliv e session messages in the group of ro p cesses using m ultiple p oin to oin t
messages or a single IP ulticast message Eac h ro p cess records the reception times
of messages if an ber of consecutiv e heartb eats a certain mem ber are
missing a suspicion is raised for this mem ber The lengths o f i n eat
and timeut p erio ds are conurable b y the application The application can also
piggybac k data essages m on the heartb eat messages
The second sc heme uses a p lling o d where the failure detector sends
request messages to pro cesses in the group and collects ac kno ts
them If kno wledgmen ts are receiv ed after a n ber of retries the failure
detector raises a suspicion P oll periods timeuts and retransmission limits are

um ac no
from wledgmen
metho
gaps tereartb
from um and
to
the
out tly
assump the Under mak

also conurable b y the application
When scaled up to more than sev eral dozens of mem b ers man y ailure f detectors
are e ither u nreasonably slo w e to o man y f alse detections as rep orted b yv an
Renesse e t al in RMH The reason is that in either the heartb eat metho d or
the p olling metho d the large n b er o f Imliv e messages and p olling requests
add unnecessary loads to the system
Gossipt algorithm
W e prop ose a new gossipt algorithm to failure detection The is
based on the gossiping tec hnique pioneered in the Clearinghouse pro ject in the

GH D O Ov a decade b efore that er Shostak S
describ e a gossip proto col using ladies and telephones b efore the widespread use
of computers and net w orks
The goal of gossip proto cols is to distribute information in the group The
mec hanism is for eac h mem ber to forw ard information to randomly c
mem b ers p rio e dically The randomly c hosen mem b rs e during eac h gossip c onstitute
a gossip subset And the gossip period called step interval
The gossipt yle failure detection algorithm w orks as follo ws Ev mem ber
main tains an n lemen t iv e arra y L h is led with s initially This
proto col is divided in to equally imed t steps During ev ery ossip g step eac h mem ber
i incremen ts all the other elemen ts in its Liv e arra y L b y hekwil eeping L i
then it gossips L to a random subset Up on receiving a ata d message from mem ber

j a mem ber sets L j Up on receiving another iv e arra y L a mem ber
replaces its o wn iv e y L with the elemen tise minim its L and
old of um arra

whic
ery
is
hosen new
and Bak er
col proto do yle
yle
um
mak or

L Small v alues in liv e arra y indicate that corresp onding mem b ers are
activ e and large v alues signify that the corresp onding mem bers ha v e not b een
heard from recen tly The pseudoo de is presen ted Figure
Eac hmem ber i k eeps a liv e rraa y L
i
Initially L ery mem ber i
i
P erio dically em ber i do es the follo wing
L j L j for all j i
i i
sends out a gossip message con taining L L
i
Ev ery mem b e r reacts to receiv ed messages as follo ws
Up on receiving a ata d message from j mem ber i do es the follo wing
L j
i
Up on receiving L mem ber i do es the follo wing
L Arra yMin L
i i
Figure Gossipt failure detection proto ol c
Figure presen ts a sample execution of the ossipt g yle ailure f detection pro
to col In this example there are four b ers in the group A B C and D
W e examine the b eha vior of the proto col at mem ber A Initially the iv e y
at A After one tep s in terv al A incremen ts ev ery elemen t of the iv e
arra y b y except itself and iv e arra y b comes e After this
incremen t A gossips its t iv e arra y to B When next in terv al
passes the iv e arra ybdan omesec A gossips it to C When A receiv es a
data message from C i t s t tfor C to b e The Liv e y b ecomes
at this poin t After some passes iv e arra y is A t
this momen t A receiv es D iv e arra y and c alculates the lemen e tise
the time
arra elemen he ets

step the curren
the for
is
arra
mem
yle The




ev at
in
the the
BC D
A
A B C D
[ 0 0 0 0 ]
[ 0 1 1 1 ]
[ 0 1 1 1 ]
[ 0 2 2 2 ]
[ 0 2 2 2 ]
data
[ 0 2 0 2 ]
[ 2 1 2 0 ]
[ 0 4 3 6 ]
[ 0 1 2 0 ]
time
Figure An example of the gossipt yle failure detection proto col
minim um A rr ayMin whic h new iv e y
at A
It tak es time for the iv e arra y from eac h mem ber to propagate throughout
the en tire group Therefore w e set a threshold v K the maxim um iv e
fail
arra yv alue Once K is reac hed the corresp onding mem b er is c onsidered failed
fail
The v alue K dep e nds o n he t gossiping rate he length of step in al and the
f ail
subset size during eac h gossip step K selected so that probabilit y that
fail
an bod y y mak es an erroneous failure detection some small threshold
P
mistak e
than less is
the is
terv
alue
arra the is
run
In RMH v an Renesse al conduct a statistical analysis a b etter
v ersion of the proto col here w during eac h gossip tep s only o ne mem b er s i gossiping
This mem ber is c hosen at random and c ho one mem ber to gossip to at
random Under this condition the n b er f o steps n eeded i ncreases logarithmically
with the n ber of mme besr n The iv e y s ize increases with the group size
n therefore in order to k the bandwidth requiremen t per ber constan t
the gossip step is set to be prop ortional to n With this requiremen t K gro ws
f ail
in the order of n log n
Assuming eac h elemen t o f t e y o ccupies b yte the size of a gossip
message b ecomes n b ytes where n is the group size A hierarc hical structure
can be emplo y ed in failure detection to reduce gossip message sizes
and impro v e scalabilit y the stabilit y detection proto col is the fo cus of this
dissertation the failure detection proto col is not iscussed d further
In tegration of Stabilit y Detection and F ail
ure Detection
When there re a mem b ership c hanges the failure etection d proto ol c assists the sta
bilit y etection d proto col a y mem b er s i detected this information is propa
gated throughout the group As presen ted in Section eac h mem ber tains
an n it hom eeardrom bitmap arra y W for recording from whic h
b ers it has receiv ed information needed for y detection Mem bers alw a ys
c hec k the hom eeardrom arra y W against curren t group mem ber
ship b efore deciding if message y is hed th en an indeite
ting prev us reac stabilit
the
stabilit
mem
main
fault If
Since
col proto the
arra iv he
mem eep
arra um
um
other oses
for et
w ait for fault y mem b ers sequence n um ber arra ys
Recall from Section a small set of mem b ers designated as Lateoin
Handlers JHs and the LJHs will vide new bers data necessary
to catc h up with existing mem b ers a new receiv er joins the group after
receiving necessary information from the LJHs it also joins the stabilit y detection
proto col hom eeardrom bitmap arra y W adds one more bit at the
end represen ting the new receiv er An y ber whic h hears indirectly or directly
from this new receiv t c hange in W from gossip essages m and dds a one
bit to its o W
The v o ation c of the stabilit y detection proto col dep ends on patterns of
sage sending and mem beipshr c hanging Since there is a limit on bur space at
m ulticast group mem b ers a round of y proto ol c should start whenev er
the burs reac h some threshold An analytical mo for determination of this
threshold is discussed in Chapter
Summary
The stabilit y detection proto col and failure detection proto col are t w o in tegral
parts of garbage collection This c hapter starts with a ey of failure detection
proto cols follo w ed b y prop osal of a new gossipt proto col to detect
ures Finally hisc hapter describ es w failure etection d proto cols assist stabilit y
detection proto cols there b ership c hanges bac kground
in the next few c hapters w e can start lo at mec hanisms to conduct stabilit y
detection whic h are the fo cus of this dissertation
main
oking
this With mem are when
ho
fail yle the
surv
del
stabilit the
mes in
wn
he notices er
mem
The
When
with mem pro
are
Chapter
Stabilit y Detection
The roto p col hat t collects message stabilit y and t i to ev
ery group ber called a message stability dete ction pr oto c ol Suc h
are implemen ted as an in tegral part of reliable m ulticast proto cols in man y
tributed systems DKM Bir Car Cri CM Ha y KTHB vRBM
W e st study three represen tativ e p roto cols named Co ordP F ullDist and T rain
then prop ose anwe proto col called Gossip
Assumptions
As men tioned in Chapter this dissertation studies the message stabilit y etection d
and failure detection framew ork in tended for reliable m ulticast in a large scale
en vironmen t where messages ma y b e ed and pro cesses ma y crash
In a dynamic distributed en vironmen t certain unication problems ma y
mimic pro c ess failures F or example pro cesses p q are b oth functional
but the comm unication link bet w een exp eriences transien t failures p rhapse

them
and when
comm
dropp
dis
cols proto is mem
nformation his distributespro ess c p will onsider c that pro cess q has failed while p ro cess q b e liev es the o pp osite
is true This situation is called a network p artition p artitioning failur e In order
for the system to mak e progress one of these ev en ts m b ecome oial
If a partitioning failure o ccurs a system it imp ossible to guaran tee that
m ultiple comp onen ts can deliv er the same set of messages T o o
wide guaran tees a proto col m ust w ait for comm unication to be reestablished in
at least one side of partition In primary partition approac h
in Isis system ir one of these comp onen ts is designated as
the primary c omp onent The primary comp onen t is p ermitted to mak e progress
and other comp onen ts are forced to sh ut wn Pro cesses within nonrimary
comp onen ts reconnect to the rimary p comp onen t w hen comm unication is restored
An alternativ e onrimary n artition p approac h s i u sed in the T ransis DKM

DMS and T otem AMSA MMSA s ystems in whic han y c omp onen t that
can reac hin ternal agreemen t o n i b ership is p ermitted to con tin ue op eration
Ho w ev er only a single comp onen t of the system is the primary one Applications
migh t con tin ue to be a v in nonrimary comp onen ts When the partition
failure ends nonrimary comp onen ts merge their states k to primary
one
If the system follo ws the primary partition approac h then stabilit y detection
proto cols only execute in the primary comp onen t when n et w ork partition happ ens
If the system allo ws nonrimary comp o nen t to execute the stabilit y detection
proto cols are executed these comp onen ts also
When the system is free of partition failures whenev er a is
stable b y t he stabilit y detection proto cols i t can b e released r garbage collected
determined message
in
the in bac
ailable
mem ts
do
only and one the
pioneered the the
system strong btain
is in
ust
or
from ev mem b er When net w ork partition o curs c messages can not
be released ev en they are detected to be stable within the net w ork comp onen t
b ecause mem b ers in other comp onen t ma y need the retransmission during the
merge pro cess
The merge of net w ork partitions handled b y users of the stabilit y detection
proto cols Therefore w e only describ e the proto cols under situation where no
net w ork partition can happ en
W e base our analysis on the follo wing common assumptions ab out reliable m
ticast proto cols Notice w e do not assume FIF O ordering
Am ulticast group of size n consists of a set of pro esses c from to n
h mem b er f o the group can b e a sender m ulticasting data messages to the
tire group Without loss of generalit y e assume m m n p ro cesses are
senders and they are n um bered through m
h mem ber is a a receiv er This means the sender is also a receiv er
t messages sends out
sender assigns h message a sequence n ber that is unique for
particular sender Therefore h data message has a unique name
this name consists of globally unique sender a lo cally unique
sequence n ber
h sequence n um ber o ccupies b ytes he sequence n um ber space is

bet w een nd whic h is large enough for applications
Am ulticast message is alw a ys sen tto the en tire group and therefore a sender
also receiv es am ulticast message from itself
most
Eac
um
and name the
eac the
um data eac The
it he for
ys alw Eac
en
Eac
named
ul
the
is
if
bur ery
Multicast in the stabilit y detection hanisms can use the underlying
m ulticast proto cols ev en though some stabilit y detection proto cols ha v e
their o wn builtn reliabilit y mec hanism
F or easy comparison proto cols under consideration will be organized in to
r ounds A round b egins the proto col is initiated externally and eac h round
has a deite termination poni t The reason is that one could imagine a stabilit y
proto col that runs autonomously async hronously and tin uously
F or a stabilit y d etection proto col to conduct useful w ork the follo wing condition
m ust be satisd a message is stable at t when a proto col round b egins
then the stabilit y etection d p roto col m ust certify its stabilit yb y the time the round
ishes It is b cause e during the execution of the proto col the n um ber of stable
messages can only increase decrease this constrain t w e rule out
an y proto col that do es not mak e progress
Notice there is a time dirence b et w een the momen t a message b e comes stable
and t he momen tit is dete cte d to b e stable If a roto p col is triggered when a message
b ecomes stable at t and it is detected to b e stable at t then the time irence d

t t indicates the p erformance of stabilit y detection proto col The smaller

the time dirence the b etter the col p erforms
Without considering the underlying w ork top ology w e can t w o metrics
to c haracterize the p erformance of stabilit y detection proto cols
W e numb er of steps p er r ound as a complexit y metric Since not
ev ery step tak es the same amoun t of time n um ber of steps alone do es not
giv e accurate information ab o ut time complexit y W e use it as a c on v enien t aid in
description of the p roto cols W e can also dra w conclusions on proto col p erformance
the
time use
use net
proto
the
With not but
time If
con
when
all
able
reli mec
based on the n um b er of steps eeded n and the estimated time needed for eac h step
T o measure the distribution of pro cessing load among mem b w e use
numb er of messages pr o c esse d p er r ound whic h is sum of the n ber of
sages sen t and receiv b y h mem ber during eac h proto col round
In all the proto c ols w e study eac h mem ber tains an m lemen t sequence
n ber arra y R where its j h elemen t R j is the maxim um sequence n um ber of
all messages from sender j that ha v e ed mem ber Eac h mem ber also
main tains an n lemen t e y L recting curren t group mem b e s
describ d e n i Section and n a n it eeardrom itmap b arra y W
for recording from whic h mem b ers has receiv ed sequence n um ber arra ys
All the message stabilit y detection proto cols w same pro edure c in a
round of execution
When the sequence n um b er y R from eac hmem b er i s collected someho w
a tabilit s y arra y S created where S js the minim of j h elemen t
of eac h mem b er sequence n um ber arra y
After t he stabilit y arra y S is built it is distributed in the roup g using v arious
hanisms
After receiving S ac h mem b er an c release data messages from sender j with
sequence n besr less than S j
um

mec
the um is
arra
the follo
it
hom
rship arra iv
this at arriv
um
main
eac ed
mes um the
ers group
The Basic c
Co ordP
Co ordP a tralized proto col run b y one of the mem b ers group
ignated as the c o or dinator There are m senders the group h mem ber i
therefore main tains an m lemen t sequence n um ber arra y R where its j h
i
men t R j is the maxim um sequence n ber the sense that all messages with
i
lo w er sequence n um b ers from sender j v e arriv ed at ber i Three t yp es of

messages are used in this proto col S T T message an empt y body nd
the A CK and O are of size m b ytes There three steps in a
round of execution wn in Figure
Step The co ordinator m ulticasts aST T message in the group
Step After receiving the ST T message eac h mem ber i its
quence n um b er arra y R to the co ordinator as an A CK message b y p oin to
i
poni t links
Step After collecting sequence n um ber arra ys from all mem b ers
the c o ordinator calculates the stabilit y arra y S A rr ayMin R where
i n i
S j R j R j R j S is then m ulticast in the group as an
n
O m essage Based on the receiv ed S n y mem b r e in the group can lab el
a message receiv from ber i as stable if the message a sequence
n um ber less than or equal to S i

In implemen tation of proto ols c one ld in message eader h n ormally designates the message
t e
yp

has mem ed
INF
min
the
se sends AR
AR
sho as
are messages INF
has AR the
mem ha
in um
ele
eac in
des the of cen is
ols Proto
BC D
A
START
START
Step 1
START
ACK
ACK
ACK
Step 2
INFO
INFO
Step 3
INFO
time
Figure Steps of proto col Co ordP
In this proto col there are m ulticasts b y the co ordinator and n p oin to
poin t messages from nono ordinators The co ordinator sends ST AR T and
O m ulticast it receiv es AR T INF O n A CKs Therefore
the total n um ber messages pro cessed b y co ordinator n whic h
messages are sen t and n are receiv ed The m ulticast ted as a message
sen t and receiv ed b y co ordinator A nono ordinator sends p oin to oin t
A message receiv es AR T O messages total n um ber of
messages pro cessed b y a nono ordinator s i of whic h is sen t a nd are receiv ed
Lost A CKs from nono ordinators will prev en t the co ordinator from getting all
The INF and ST CK
the
coun is
of is the of
and ST INF
the

information in the group ed b y making nono ordinators send
unicast A CK messages to the co r ep
In an extension to the T andem global up date proto col ar and the Amo eba
total ordering proto col THB a articular p v ersion of Co ordP is y ed as
their stabilit y detection algorithm
The original T andem global up date proto col w orks as follo ws One mem ber is
designated as the sequencer group a sender s sends a global up date
message u to the group t i rst sends u to the sequencer then the sender sends u to
other roup g mem b ers one b y A t astl it sends u to the sequencer again Up on
receipt u eac h mem ber a p ositiv e ac kno wledgmen t k to the sender
Message losses are detected b y timeuts message retransmissions
The T andem lobal g up ate d p roto col allo ws at most one p u date to b e m ulticast
at a time in the group Therefore the p erformance is v ery bad when the group size
is large dBM Cristian et al prop ose an extension to the global up date
proto col called the P ositiv eAc kno wledgmen tor P A proto col that allo ws concurren t
m ulticasts
In the P A proto col a sender s starts m ulticast of an up u b y sending
a message u l to sequencer where l is the lo cal sequence n um ber for u If
the revious p message receiv ed s b y he t sequencer had lo cal sequence n um ber
l the sequencer assigns a global sequence n um ber k to this up date and sends
a message u k to ev ery group mem b er Otherwise the sequencer stores u l in

a lo cal bur til the previous message u arriv es from s after whic h it

assigns u and u global orders consisten t w ith their origination order at s and then

m ulticasts u and u the group
in
un
from
the
date the
In
in result and
bac sends of
one
When the of
emplo
ely etitiv ordinator
solv is problem This
Up on receipt of u k eac h mem ber sends a p ositiv e ac t for k to
the sequencer Message losses are detected b y timeuts result in message
retransmissions Up dates deliv ered at mem b ers in order imp osed b y the
global sequence n um b ers attac b y the sequencer Concurrency is allo w ed
m ulticast senders w m ultiple m ulticast requests at a single sender
Message stabilit y detection w as follo ws Group mem b ers send to the
quencer the global sequence n um ber l of l up they ha v e deliv ered The
sequencer records them a sequence n um ber arra y one en try per mem ber
and m ulticasts the minim um sequence n ber l stored in arra y us an
min
up date u with global sequence n um ber k is stable if k l
min
The Amo ba e m ulticast proto col is also sequencerased T o initiate the m
cast of up date u a sender s ends u to the sequencer The sequencer assigns a global
sequence n um ber k to u and sends messages u k to all group mem b ers
After ha ving receiv ed a message u k a ber a negativ e kno
edgmen t to the sequencer only if message receiv con tains a sequence
n b er g reater than k The sequencer retransmits messages only up on receiving
suc h negativ e ac wledgmen ts Concurrency is w ed among diren t senders
But eac h sender handles only m at a time The message stabilit y is
established in the same manner as P A proto col
In b oth systems there is a sequencer r a co ordinator in our terminology
whic h assigns sequence n um ber to eac h message Other b ers
p erio dically send to the sequencer a b ld last consecutiv e sequence
n b er After receiving these n um b ers the sequencer calculates their minim
and announces sequence n ber of the stable message in the group
last um the
um um
the yte
mem data global the
the for
ulticast one
allo kno
um
es it next the
wl ac sends mem
ulti
Th the um
with in
date ast the
se orks
as ell as
among hed
the are
and
wledgmen kno
F ullDist
F ullDist is a fully d istributed proto col in the sense that e v ery mem b r e p erio dically
m ulticasts its information ab out message stabilit yinthe en tire group In F ullDist
eac h mem ber k eeps a stabilit y m atrix E of size n m b ecause there are m senders
in the group Matrix elemen t E i j stores the sequence n um b r e f o the last message
that is sen t b y sender j and has bene receiv ed b y mem ber i The i h ro w of the
matrix mem ber i stores the last sequence n um bers messages ha v e b een
receiv ed b ymem ber i from all he t senders of the group m of the j h
column represen ts the last sequence n ber whose corresp onding message is sen t
b y j h sender and has b een receiv ed b y ev ery mem ber Messages sen t
sender j with this sequence n um ber or w er are stable This proto col only uses
one t yp e of m yte INF O messages P erio dically ca hmem ber m its ro w
of its stabilit y matrix E in the group F or the purp ose of measuring w long it
tak es for eac h ber to detect message stabilit y in Chapter w e in tro uce d t w o
steps in one round o f execution F ullDist as illustrated Figure
Step The st ber m ulticasts st ro w of its matrix E via an
O m essage No signiance is attac to the c hoice of the st mem ber
Step After receiving the O message the i h mem ber m the
i h ro w of its stabilit y matrix E via an INF O message Ev ery mem ber
replaces the i h w of matrix the receiv ed ro w information
ber i
As ev ery ber main its o wn stabilit y matrix and determines whether
or not an y data messages in the system are stable F ullDist is decen tralized with
tains mem
mem
from with its ro
ulticasts INF
hed INF
the mem
in of
mem
ho
ulticasts
lo
from the
um
um inim The
that of at
BC D
A
INFO
INFO
Step 1
INFO
INFO
INFO
Step 2
INFO
time
Figure Steps of proto col F ullDist
out an y co ordinator This proto col includes n m ulticast INF O messages Ev ery
mem ber sends INF O m ulticast receiv es n INF O messages Hence the total
n ber of messages pro cessed b y eac h ber is n
As Co ordP this proto col is executed p erio ically d and as a result an
INF O message will comp ensate for the one in previous
F ullDist is used the stabilit y detection proto col in Horus

sem ble a y and
Horus is a group comm unication system h ors great xibilit y the
prop erties pro vided b y proto cols It supp orts dynamic group mem b e rship
mes
in whic
JL SRM
En RBM as
step the lost
with
mem um
and
sage ordering sync hronization and failure handling In the Horus arc hitecture
proto cols re a constructed dynamically b y stac king microroto cols whic h supp ort
a common in terface Eac h microroto col ors a small tegral set of comm u
nication prop erties and Horus implemen ts them diren t y One suc h
microroto col is the message stabilit y etection d proto col a n
eration system Ensem ble Ha y i mplemen t a set of stabilit y roto p cols in separate
la y ers so users can pic k the appropriate for their application F ullDist is one
of the stabilit y proto cols ored b y Horus Ensem ble
In SRM eac h mem ber p erio d ically m ulticasts session messages to rep ort the
sequence n um b er state for activ e s enders This is essen tially the F ullDist proto col
The a v erage bandwidth consumed b y session messages is limited to a small raction f
or example of the aggregate data bandwidth SRM mem bers dynamically
adjust the generation rate of session messages in prop ortion to the m ulticast group
size
T rain
T is a decen tralized linear proto col sense that a ed size rain is
passed around group mem b ers to spread the message tabilit s y information In the
T proto col ac e h mem ber i k eeps a sequence n um b e r arra y R m elemen ts
i
where m is the n um ber of senders There a cyclic order among group b ers
n A train with a ed size of m b ytes passes through all the em m b ers
in this cyclic order Mem ber starts the proto col b y putting its sequence n um ber
arra y on the train When the rain arriv es at an y other b er this mem ber
gets the y from the rain calculates the minim um of arra yand its o wn
this arra
mem

mem is
with rain
the in rain
and
one
gen ext its nd Horus
ers la as
in
stabilit y arra y using A rr ayMin then the result k rain After
one irculation c he t st mem ber tsge bac k n i the rain the minim um of stabilit y
arra ys all mem bers since the rain visited ev ery mem ber once The
rain con taining this arra y passes around the group in circle
again In this second step ev ber tak es minim um arra y and marks
stable messages accordingly
The t w o t of messages b y the T rain proto col are the m yte A CK
and INF O messages The ber starts the proto col b y assigning M R

then sending a p oin to oin tA CK message con taining M to the second mem ber

Up on receiving this A CK he t second mem b er calculates M A rr ayMin M

and sends M to the third mem ber an A CK In general after receiving an

A message con taining M mem ber i calculates M A ayMin M
i i i i
and sends M on an A CK message to mem ber i After an A CK message
i
from mem ber n arriv es at the st mem b er that is after the A
culating the group once the st mem ber starts passing M in the circle
n
in an INF O message A t this poin t the stabilit y y S constructed since
S M A rr ayMin R that is S con tains the sequence n bers for the
n i n i
last stable messages sen t from eac h mem b er The T rain proto ol c wn in
Figure
In this proto col A CK message needs to circulate the group and so
do es the INF O message Hence n steps are required h mem ber sends out
and also receiv es A CK O message The total n um ber of messages
pro essed c b y eac h mem ber is toou fwhci h are sen t and are receiv ed
Once this proto c ol starts h mem ber rep eatedly sends same message to
the eac

INF and
Eac
once the
sho is

um
is arra
same
cir ishes CK

rr CK
via


mem st
used es yp
this mem ery
the then um minim
has the of
the in bac puts
BC D
A
ACK
ACK
Step 1 to n
ACK
ACK
INFO
INFO
INFO Step n+1 to 2n
INFO
time
Figure Steps of proto col T
the next mem ber in line This mec hanism ectiv ely com message losses

In the T rain ri and T otem singleing MMSA proto cols he t T pro
to col used to or message stabilit y detection
In the T rain proto col there is a cyclic order mem b e rs A train
con taining a sequence of messages circulates from one mem ber to another in this
order order to m ulticast a message a sender w aits the train to arriv e
When the train arriv es the sender st deliv ers all messages carried b y train
and then app ends all messages that w an ts to m ulticast at the end of the
train The s ender remo v es its o wn m ulticast messages hen w it ees s the train again
it new
the
for In
group among
is
rain
bats
rain
Group mem b ers other than the sender ers all messages on t the
train passes b y
If there are no m ulticasts in progress the y train remains idle at some
designated group mem b er the tr ainmaster suc h a case a m ust
request the train in order to m ulticast a message A lost is detected b y the
trainmaster based on a timeut mec hanism The trainmaster is also resp onsible
for regenerating the train
A message is detected be stable the train completes one more round
after deliv ering the message to all group mem b ers The stabilit y etection d p roto col
used here is a sp ecial v ersion of generic T rain proto col When train
circulates in the st round starting at the sender it ers the message to all
mem b ers After that the sender remo v the message from the train Therefore
when the train circulates the second round it implicitly tells all b ers this
message stable since the message on the an y more

The T otem singleing proto col SMAMMS MSA pro vides
liable totally ordered deliv of messages a logical tok enassing ring The
tok en circulates around the ring as poni to oin t messages Only the mem ber
holding the tok can m ulticast a message Unlik e T proto col the
k en do es not con tain the m ulticast message Instead of follo wing the logical ring
messages are m ulticast in group b y the sender
In the singleing p roto col a sequence n b er eld in the tok en called al q
pro vides g lobal message sequence n um b ers for all m ulticast messages and th us a o t
tal order on messages When a sender m ulticasts a ew n message it incremen ts the
glob al se q ld of the tok en and giv es the message t hat sequence n um ber Messages

se glob um
the
to rain the en
using ery
re
train not is is
mem
es
deliv
the the
when to
train
sender in and
empt
when rain the deliv
are deliv ered with increasing global sequence n b Mem b ers recognize issing m
messages b y detecting gaps in sequence n b ers and request retransmissions b y
inserting the sequence n b ers of the messages to the retransmission
request list of the tok en
Again a sp ecial v ersion of the T proto col is used here to detect message
stabilit y Eac h mem ber stores in q its maxim um sequence n um ber suc h that all
messages with lo w er sequence n um b ers v e ed successfully the tok en
passes around the ring it records q of mem bers it has
so far in its alleceiv edpto ld or aru After a full tok en rotation the aru
ld records a sequence n um ber so that all mem bers on the v e ed all
messages with lo w er sequence n um b ers Eac h mem ber can hen t reclaim the b ur
space used b y messages sequence n um b ers up to b ecause they will nev er
need to b e retransmitted
Gossip
The three basic stabilit y detection proto cols w e v e discussed so ha v e their
limitations on scalabilit y the group size is large an implosion problem ill w
o c at the co ordinator in Co ordP at ery mem ber in F ullDist b ecause
the n um b r e of messages they need to pro c ess increases linearly with the g roup size
F or T rain t he message train n eeds o t tra v erse the en tire group b fore e the stabilit y
information is collected as a result for a round of proto col execution to
ish increases linearly with the group s ize
T oa v oid the implosion problem in Co ordP F ullDist a nd the inear l ra t v
sal of all the group mem bers T rain w e prop ose anew proto col called Gossip
in
er and
time the
ev and cur
When
far ha
aru with
receiv ha ring
visited the se um minim the
As arriv ha
se
rain
in missing um
um
ers um
The proto col is divided in to equally timed steps During eac h s ev
b er constructs a gossip subset consisting of p m em b ers with ranks randomly
c hosen from to n In the ev ery mem ber sends its sequence n um ber
ra y R to its gossip subset After receiving a gossip message a mem ber computes
the Minoar arra y M whic h s i he t elemen tise inim m um of sequence n um ber
arra ys itself and of b ers that it heard from It computes
the Whom eeardrom arra y as the tise maxim um of hom e
heardrom arra itself and of other mem b ers that it heard from In the
subsequen t steps ev ery mem b er ossips g its inoar arra y M and its hom
I eeardrom arra y W to a diren t random subset Instead of sending their
information to one c o ordinator eac h mem ber gossip messages disseminate
their information in the step b y step After certain n um ber of steps one
mem ber receiv es information ab curren t bers the inoar
arra y M at this mem ber b e comes the stabilit y arra y S This is detected when
the hom eeardrom bitmap arra y W con tains s for all curren t group
mem b ers
A t this p oin t the mem ber that detects message y disseminating
S in the group b y putting it on future gossip messages Up on receiving S a
mem ber discards stable messages accordingly T o sa v e on bandwidth requiremen t
of future gossip messages nd a to disseminate S faster one could m ulticast S in the
en tire group Instead of i mplemen ting reliable m ulticast again an xisting e reliable
m ulticast proto col can be used Ho w ev er this metho d has a wbac k Some
reliable m ulticast proto cols do guaran tee a m ulticast message to be receiv
b y mem b ers in the group If these proto cols are used to distribute S there is
all
ed not
dra
the
starts stabilit
and mem all out
group
to uses
has of ys
elemen
also has mem other of
ar step st
distinct
mem ery tep
BC D
A
Step 1
D
ABC ABCD ABCD
Step 2
ABCD ABCD
ABCD ABCD
time
Figure An example run of the gossipt yle tabilit s y d c ol Gossip
no guaran tee that S will arriv e at ev ery b er A h ybrid sc heme this
problem S is m ulticast to the en tire group and p rioe dically S is piggybac k ed on
future gossip messages to reac h mem b h are left out in t he original m ulticast
of S
An example is giv en in Figure to illustrate the proto col The group size is
and the subset size for gossip round of the proto col ished after
steps A t the end st step mem b ers B and C construct the stabilit y
arra y whereas A and D v e partial information During the second step