Data-Centric Reconfiguration with Network-Attached Disks

enginestagNetworking and Communications

Oct 26, 2013 (3 years and 8 months ago)

63 views

Data
-
Centric Reconfiguration
with Network
-
Attached Disks

Alex
Shraer

(
Technion
)

Joint work with:


J.P. Martin, D. Malkhi, M. K. Aguilera (MSR)

I.
Keidar

(
Technion
)

Preview

2


The setting:
data
-
centric

replicated storage


Simple network
-
attached storage
-
nodes



Our contributions:


1.
First distributed reconfigurable R/W storage



2.
Asynch
. VS. consensus
-
based reconfiguration






Allows to add/remove

storage
-
nodes dynamically

Enterprise Storage Systems


Highly reliable customized hardware



Controllers, I/O ports may become a bottleneck



Expensive



Usually not extensible


Different solutions for different scale


Example(HP): High end
-

XP (
1152

disks), Mid range


EVA (
324
disks)




3

Alternative


Distributed Storage



Made up of many storage nodes


Unreliable, cheap hardware


Failures are the norm, not an exception



Challenges:


Achieving
reliability

and
consistency


Supporting
reconfigurations



4

Distributed Storage Architecture



Unpredictable network delays (asynchrony)

Cloud
Storage

LAN/
WAN

read

write

5

Storage Clients

Dynamic,

Fault
-
prone

Fault
-
prone

Storage Nodes

A Case for
Data
-
Centric
Replication


Client
-
side code runs replication logic


Communicates with multiple storage nodes


Simple storage nodes (servers)


Can be network
-
attached disks



Not necessarily PCs with disks



Do not run application
-
specific code



Less fault
-
prone components



Simply respond to client requests



High throughput



Do not communicate with each other


If storage
-
nodes communicate,
their failure is
likely to
be correlated!


Oblivious to where other replicas of each object are stored


Scalable, same storage node can be used for many replication sets



not
-
so
-
thin

client

thin

storage

node

Real

Systems

Are

Dynamic

7

The challenge: maintain
consistency

,
reliability
,

availability


LAN/
WAN

reconfig

{

A,

B}

A

B

C

D

E

reconfig

{

C, +F,…,

+I}

F

G

I

H

Pitfall of Naïve Reconfiguration

8

A

B

C

D

{A, B, C, D}

{A, B, C, D}

{A, B, C, D}

{A, B, C, D}

{A, B, C, D}

{A, B, C, D}

{A, B, C, D, E}

{A, B, C}

{A, B, C, D, E}

{A, B, C, D, E}

{A, B, C}

{A, B, C}

E

reconfig

{+E
}

reconfig

{
-
D
}

{A, B, C, D, E}

Returns

“Italy”!

Pitfall of Naïve Reconfiguration

9

A

B

C

D

{A, B, C, D, E}

{A, B, C}

{A, B, C, D, E}

{A, B, C, D, E}

{A, B, C}

{A, B, C}

E

write x “Spain”

read x

{A, B, C, D, E}

X = “Italy”,
1

X = “Italy”,
1

X = “Spain”,
2

X = “Spain”,
2

X = “Spain”,
2

X = “Italy”,
1

X = “Italy”,
1

X = “Italy”,
1

Split Brain!

Reconfiguration Option
1
: Centralized


Can be automatic


E.g.,
Ursa

Minor
[
Abd
-
El
-
Malek

et al., FAST
05
]


Downtime


Most solutions stop R/W while reconfiguring


Single point of failure


What if manager crashes while changing the system?


10


Tomorrow Technion servers will be down

for maintenance from
5
:
30
am to
6
:
45
am







Virtually Yours,







Moshe Barak

Reconfiguration Option
2
:

Distributed
Agreement


Servers
agree
on next configuration


Previous solutions
not data
-
centric







No downtime


In theory, might never terminate
[FLP
85
]



In practice, we have
partial synchrony
so it
usually works

11

Reconfiguration Option
3
:
DynaStore

[Aguilera,
Keidar
,
Malkhi
, S., PODC
09
]

12


Distributed & completely asynchronous



No downtime



Always terminates



Not data
-
centric







In this work:
Dyna
Disk


dynamic data
-
centric R/W storage

13

1.
First distributed data
-
centric solution


No downtime


2.
Tunable reconfiguration method


Modular design, coordination is separate from data


Allows easily setting/comparing the coordination method


Consensus
-
based VS. asynchronous reconfiguration


3.
Many shared objects


Running a protocol instance per object too costly


Transferring all state at once might be infeasible


Our solution: incremental state transfer


4.
Built with an external (weak) location service


We formally state the requirements from such a service





Location Service




Used in practice, ignored in theory



We formalize the weak external service as an
oracle
:












Not enough to solve reconfiguration


14


oracle.query
( )

returns some “legal” configuration



If reconfigurations stop and
oracle. query
()
invoked infinitely many times, it eventually returns
last system configuration

The Coordination Module in
DynaDisk


Storage devices in a configuration
conf

= {+A, +B, +C}

z

x

y

next
config
:



z

x

y

next
config
:



z

x

y

next
config
:



A

B

C

Distributed R/W objects

Updated similarly to ABD

Distributed
“weak snapshot”
object

API:

update(set of changes)→
OK

scan()

set

of updates

15

Coordination with Consensus

z

x

y

next
config
:



z

x

y

next
config
:



z

x

y

next
config
:



A

B

C

reconfig
({

C
})

reconfig
({+D
})

Consensus

+D


C

+D

+D

+D

+D

+D

+D

update
:










scan: read & write
-
back next
config

from majority


every scan returns +D or


16

Weak Snapshot


Weaker than consensus


No

need to agree on the next configuration, as long as


each process has a set of possible next configurations,


and all such sets intersect


Intersection allows to converge and again use a single
config




Non
-
empty intersection
property of weak snapshot:


Every two non
-
empty sets returned by
scan( )
intersect


Example: Client
1
’s scan Client
2
’s scan





{+D}



{+D}




{

C} {+D,

C}




{+D} {

C}



Consensus

17

Coordination without consensus

z

x

y

next
config
:

z

y

next
config
:

z

y

next
config
:

A

B

C

reconfig
({

C
})

reconfig
({+D
})

update
:











scan:


read & write
-
back proposals from majority (twice)


CAS({

C},

,
0
)





+D















CAS({

C},

,
1
)

+D


C

WRITE ({

C},
0
)

OK

OK

2

2

2

1

1

1

0

0

0


C

Tracking Evolving
Config’s


With consensus:
agree

on next configuration



Without consensus


usually a chain, sometimes a DAG:

19

A, B, C

A,B,C,D

+D



C

A,B

A, B, D

A, B, C

+D

+D



C



C

A,B,C,D

A, B, D

Inconsistent

updates found

and merged


weak snapshot

scan() returns {+D,
-
C}

scan() returns {+D}

All non
-
empty scans intersect

Consensus
-
based VS.
Asynch
. Coordination





Two implementations of weak snapshots


Asynchronous


Partially synchronous (consensus
-
based)


Active Disk
Paxos

[
Chockler
,
Malkhi
,
2005
]


Exponential
backoff

for leader
-
election



Unlike asynchronous coordination,



consensus
-
based might not terminate
[FLP
85
]



Storage overhead


Asynchronous: vector of updates


vector size ≤ min(#
reconfigs
, #members in
config
)


Consensus
-
based:
4
integers and the chosen update


Per storage device and configuration




20

Strong progress guarantees are not for free

Consensus
-
based

Asynchronous (no consensus)

0
100
200
300
400
500
600
0
1
2
5
ms.

Number of simultaneous
reconfig

operations

Average write latency

0
100
200
300
400
500
600
700
800
900
1
2
5
ms.

Average
reconfig

latency

Number of simultaneous
reconfig

operations


Significant
negative effect
on R/W latency

Slightly
better,much

more
predictable
reconfig

latency
when many
reconfig

execute
simultaneously

The same when no
reconfigurations

21

Future & Ongoing Work



Combine
asynch
. and partially
-
synch. coordination



Consider other weak snapshot implementations


E.g., using randomized consensus



Use weak snapshots to reconfigure other services


Not just for R/W



22

Summary


DynaDisk



dynamic data
-
centric R/W storage


First decentralized solution


No downtime


Supports many objects, provides incremental
reconfig


Uses one coordination object per
config
. (not per object)


Tunable reconfiguration method


We implemented asynchronous and consensus
-
based


Many other implementations of weak
-
snapshots possible



Asynchronous coordination in practice:


Works in more circumstances → more robust


But, at a cost


significantly affects ongoing R/W ops



23