PPT Slides

burnwholeInternet and Web Development

Feb 5, 2013 (4 years and 6 months ago)

205 views

REDDnet
: Enabling Data Intensive
Science in the Wide Area

Alan
Tackett


Mat Binkley, Bobby
Brown,
Chris
Brumgard
, Santiago
de
ledesma
,
and Matt Heller

Vanderbilt
University

Overview


What is
REDDnet
? and Why is it needed?


Logistical Networking


Distributed Hash Tables


L
-
Store


Putting it all together


This is a work in progress!

Most of the metadata operations are in the prototype stage.

REDDnet
: Distributed Storage Project

http://www.reddnet.org


NSF
funded
project


Currently have 500TB


Currently 13 sites


Multiple disciplines


Sat
imagery (
AmericaView
)


HEP


Terascale

Supernova
Initative


Bioinformatics


Vanderbilt
TV News
Archive


Large Synoptic Survey
Telescope


CMS
-
HI


Goal: Providing
working

storage for
distributed

scientific collaborations.

Metadata

Raw storage

“Data Explosion”


Focus on increasing bandwidth and
raw storage


Assume metadata growth is minimal

Metadata

Raw storage

“Data
Explosion”of

both
data

and
metadata


Focus on increasing bandwidth and
raw storage


Assume metadata growth is minimal


Works great for large files



For collections of small files the
metadata becomes the bottleneck


Need ability to scale metadata



ACCRE examples


Proteomics: 89,000,000 files totaling
300G


Genetics: 12,000,000 files totaling
50G in a single directory

Design Requirements


Availability


Should survive a partial network outage


No downtimes, hot upgradable


Data and Metadata Integrity


End
-
to
-
end
guarantees
a must!


Performance


Metadata(transactions/s)


Data(MB/s)


Security


Fault
Tolerance

-

Should survive
multiple

complete device failures


Metadata


Data

Very different
design
requirements

Logistical Networking

Logistical Networking (LN) provides
a “bits are bits” Infrastructure


Standardize on what we have
an adequate common model for


Storage/buffer management


Coarse
-
grained data transfer


Leave everything else to higher
layers


End
-
to
-
end services: checksums,
encryption, error encoding, etc.


Enable autonomy in wide area
service creation


Security, resource allocation,
QoS

guarantees


Gain the benefits of
interoperability!

One structure serves all




IBP Internet Backplane Protocol


Middleware for managing and using
remote storage


Allows advanced space and
TIME

reservation


Supports multiple connections/depot


User configurable block size


Designed to support large scale,
distributed systems


Provides global “
malloc
()
” and “
free()



End
-
to
-
end guarantees


Capabilities

»
Each allocation has separate
Read/Write/Manage keys


LoCI

Tools*

Logistical Computing and Internetworking Lab

*http
://loci.cs.utk.edu

IBP is the “waist of
the hourglass” for
storage

What makes LN different?

Based on a highly generic abstract block for storage(IBP)

What makes
LN
different?

Based on a highly generic abstract block for storage(IBP)

Storage
Depot

IBP Server or Depot

RID: 1005

Size: 2TB

Type: Disk

RID: CMS
-
1004

Size: 120GB

Type: SSD

RID: 1006

Size: 1TB

Type: Disk

OS and optionally RID DBs


Runs the
ibp_server

process


Resource


Unique ID


Separate data and metadata
partitions (or directories)


Optionally can import metadata
to SSD


Typically JBOD disk
configuration


Heterogeneous disk sizes and
types


Don’t have to use dedicated
disks




IBP functionality


Allocate


Reserve space for a limited time


Can create allocation
Aliases

to control access


Enable block level disk checksums to detect silent read errors


Manage allocations


INC/DEC allocation reference count (when 0 allocation is removed)


Extend duration


Change size


Query duration, size, and reference counts


Read/Write


Can use either append mode or use random access


Optionally use network checksums for end
-
to
-
end guarantees


Depot
-
Depot Transfers


Data can be either
pushed

or
pulled

to allow firewall traversal


Supports both append and random offsets


Supports network checksums for transfers


Pheobus

support


I2 overlay routing to improve performance


Depot Status


Number of resources and their resource ID’s


Amount of free space


Software version


What is a “capability”?


Controls access to an IBP allocation


3 separate caps/allocation


Read


Write


Manage
-

delete or modify an allocation


Alias or Proxy allocations supported


End user never has to see true capabilities


Can be revoked at any time


Alias Allocations


A temporary set of capabilities providing limited
access to an existing allocation


Byte
-
range or full access is allowed


Multiple alias allocations can exist for the same
primary allocation


Can be revoked at any time

Split/Merge Allocations


Split


Split an existing allocations space into two unique allocations


Data is not shared. The initial primary allocation is truncated.


Merge


Merge two allocations together with only the “primary” allocation’s
capabilities surviving


Data is not merged. The primaries size is increased and the
secondary allocation is removed.


Designed to prevent Denial of “Space” attacks


Allows implementation of user and group quotas


Split and Merge are atomic operations


Prevents another party from grabbing the space during the
operation.


Depot
-
Depot Transfers

Copy data from A to B

Firewall

Firewall

Private IP

Private IP

Public IP

Depot A

Depot B

Depot
-
Depot Transfers

Copy data from A to B

Firewall

Firewall

Private IP

Private IP

Public IP

Depot A

Depot B

Direct transfer gets
blocked at the
firewall!

Depot
-
Depot Transfers

Copy data from A to B

Firewall

Firewall

Private IP

Private IP

Public IP

Depot A

A issues a
PUSH

B issues a
PULL

Depot B

exNode

XML file containing metadata

A

B

C

0

300

200

100

IBP Depots

Network


Analogous to a disk I
-
node and contains


Allocations


How to assemble file


Fault tolerance encoding scheme


Encryption keys

Normal file

Replicated at
different sites

Replicated and
striped

eXnode3


Exnode



Collection of layouts


Different layouts can be used for versioning, replication, optimized
access (row vs. column), etc.


Layout


Simple mappings for assembling the file


Layout offset, segment offset, length, segment


Segment


Collection of blocks with a predefined structure.


Type: Linear, striped, shifted stripe, RAID5, Generalized Reed
-
Solomon


Block


Simplest example is a full allocation


Can also be an allocation fragment

Segment types

1

1

1

1

2

3

4

2

2

2

3

3

3

4

4

4

1

1

1

1

1

1

1

1

2

2

2

2

3

3

3

3

4

4

4

4

Linear

Single logical
device

Blocks can
have unequal
size

Access is linear

Striped

Multiple devices

Shifted Stripe

Multiple devices using shifted
stride between rows

RAID5 and Reed
-
Solomon segments
are variations of the Shifted Stripe

Optimize data layout for WAN performance

1

header

2

3

4

5

1

header

2

3

4

5

a

b

c

d

e

a

b

c

d

e

Skip to a specific video frame


Have to read the header and each frame
header


WAN latency and disk seeks kill
performance

Frame header and data

Optimized layout


Separate header and data into
separate segments


Layout contains mappings to present
the traditional file view


All header data is contiguous
minimizing disk seeks.


Easy to add additional frames

Header and Frame
information

Frame data

Latency Impact on Performance

Client

Server

Latency

Command

Synchronous

Client

Server

Latency

Asynchronous

T
ime

Asynchronous Internal
architecture

Global Host Queue

In
-
flight Queue 1

Recv

Queue

Send Queue

Operation

Operation

completed

If socket


error retry


Each host has a dedicated queue


Multiple

In
-
flight queues for each host
(depends on load)


Send/
Recv

Queues are independent threads


Socket library must be thread safe!


Each operation has a weight


An In
-
flight queue has a max weight


If socket errors occur # of In
-
flight queues
are adjusted for stability

Asynchronous client interface


Similar to
MPI_WaitAll
() or
MPI_WaitAny
() Framework


All commands return handles


Execution can start immediately or be delayed


Failed ops can be tracked


ibp_capset_t

*
create_allocs
(
int

nallocs
,
int

asize
,
ibp_depot_t

*depot)

{


int

i
, err;


ibp_attributes_t

attr
;


oplist_t

*
oplist
;


ibp_op_t

*op;



//**Create caps list which is returned **


ibp_capset_t

*caps = (
ibp_capset_t

*)
malloc
(
sizeof
(
ibp_capset_t
)*
nallocs
);



//** Specify the allocations attributes **


set_ibp_attributes
(&
attr
, time(NULL) + 900, IBP_HARD, IBP_BYTEARRAY);


oplist

=
new_ibp_oplist
(NULL);
//**Create a new list of ops


oplist_start_execution
(
oplist
);
//** Go on and start executing tasks. This could be done any time



//*** Main loop for creating the allocation ops ***


for (
i
=0;
i
<
nallocs
;
i
++) {


op =
new_ibp_alloc_op
(&(caps[
i
]),
asize
, depot, &
attr
,
ibp_timeout
, NULL); //**This is the actual
alloc

op


add_ibp_oplist
(
oplist
, op);

//** Now add it to the list and start execution


}




err =
ibp_waitall
(
oplist
);
//** Now wait for them all to complete


if (err != IBP_OK) {


printf
(“
create_allocs
: At least 1 error
occured
! *
ibp_errno
=%d *
nfailed
=%d
\
n”, err,
ibp_oplist_nfailed
(
iolist
)
);


}


free_oplist
(
oplist
);
//** Free all the ops and
oplist

info return(caps);

}

Distributed Hash Tables

Distributed Hash Tables


Provides a mechanism to distribute a single hash table
across multiple nodes


Used by many Peer
-
to
-
peer systems


Properties


Decentralized


No centralized control


Scalable


Most can support 100s to 1000s of nodes


Lots of implementations


Chord, Pastry, Tapestry,
Kademlia
, Apache Cassandra, etc.


Most differ by
lookup()

implementation


Key space is partitioned across all nodes


Each node keeps a local routing table, typically routing is
O(log n).


Chord Ring*

Distributed Hash Table


Distributed Hash Table


Maps a Key (K##)
-

hash(name)


Nodes (N##) are distributed
around the ring and are
responsible for the keys
“behind” them.


O(lg N) lookup


N1

N8

N14

N21

N32

N38

N42

N56

K10

K24

K30

K54

Valid key range for


N32 is 22
-
32

*F
.
Dabek
, et al., "Building Peer
-
to
-
Peer Systems With Chord, a Distributed Lookup Service." In
Proceedings of the 8th Workshop on Hot Topics in Operating Systems (
HotOS
-
VIII), May, 2001.

Locating keys

Finger Table

N8+1

N14

N8+2

N14

N8+4

N14

N8+8

N21

N8+16

N32

N8+32

N42

N14

N1

N8

N21

N32

N38

N42

N48

N51

N56

+1

+2

+32

+4

+8

+16

N1

N8

N14

N21

N32

N38

N42

N48

N51

N56

K10

K24

K30

K54

lookup(54)

Apache Cassandra

http://cassandra.apache.org/


Open source, actively developed, used by major companies


Created at and still used by
Facebook
, now an Apache project and also used by
Digg
,
Twitter,
Reddit
,
Rackspace
,
Cloudkick
, Cisco


Decentralized


Every node is identical, no single points of failure


Scalable


Designed to scale across many machines to meet throughput or size needs (100+
node production systems)


Elastic Infrastructure


Nodes can be added and removed without downtime


Fault Tolerance through replication


Flexible replication policy, support for location aware policies (rack and data center)


Uses Apache Thrift for communication between clients and server nodes


Thift

bindings are available for C++, Java, Python, PHP, Ruby,
Erlang
, Perl, Haskell,
C#, Cocoa, Smalltalk, and
Ocaml
). Language specific client libraries also available.


Apache Cassandra


http://cassandra.apache.org/


High write speed


Use of RAM for write caching and sorting and append only disk writes (sequential
writes are fast, seeking is slow, especially for mechanical hard drives)


Durable


Use of a commit log insures that data will not be lost if a node goes down (similar to
filesystem

journaling).


Elastic Data Organization


Schema
-
less, namespaces, tables, and columns can be added and removed on the
fly


Operations support various consistency requirements


Reads/Writes of some data may favor high availability and performance and accept
the risk of stale data.


Reads/Writes of other data may favor stronger consistency guarantees and sacrifice
performance and availability when nodes fail


Does not support SQL and the wide range of query types of RDBMS


Limited atomicity and support for transactions (no commit, rollback, etc)

L
-
Store


Putting it all together

L
-
Store

http://www.lstore.org


Generic object storage framework using tree based structure


Data and metadata scale independently


Large objects, typically file data, is stored in IBP depots


Metadata is stored in Apache Cassandra


Fault
-
tolerant in both data and metadata


No downtime for adding or removing storage or metadata nodes.


Heterogeneous disk sizes and type supported


Lifecycle management


can migrate data off old servers


Functionality can be easily extended via policies.


The L
-
Store core is just a routing framework. All commands, both system and user
provided, are added in the same way.

L
-
Store Policy


A Policy is used to extend L
-
Store functionality
and is comprised of 3 components


command,
stored procedures, and services


Each can be broken into an ID, arguments, and
execution handler


Arguments are stored in
google

protocol buffer
messages


The ID is used for routing

Logistical Network Message
Protocol


Provides a common communication framework


Message Format

[4 byte command] [4 byte message size] [message] [data]


Command


Similar to an IPv4 octet (10.0.1.5)


Message size


32
-
bit unsigned big
endian

value


Message


Google Protocol Buffer message


Designed to handle command options/variables


Entire message is always unpacked using internal buffer


Data


Optional field for large data blocks


Google Protocol Buffers

http://code.google.com/apis/protocolbuffers/


Language
-
neutral mechanism for
serializing data structures


Binary compressed format


Special “compiler” to generate code
for packing/unpacking messages


Messages support most primitive
data types, arrays, optional fields,
and message nesting


package
activityLogger
;


message
alog_file

{


required string name = 1;


required uint32
startTime

= 2;


required uint32
stopTime

= 3;


required uint32
depotTime

= 4;


required sint64
messageSize

= 5;

}


L
-
Store Policies


Commands


Processes heavyweight operations, including multi
-
object locking coordination via
locking service


Stored Procedures


Lightweight atomic operations on a single object.


Simple example is updating multiple key/value pairs on an object.


More sophisticated examples would test/operate on multiple key/value pairs and
only perform the update on success


Services


Triggered via metadata changes made through stored procedures


Services register with a messaging queue or creates new queues listening for
events.


Can trigger other L
-
Store commands


Can’t access metadata directly

L
-
Store Architecture

Client


Client contacts L
-
Servers to
issue commands


Data transfers are done
directly via the depots with
the L
-
Server providing the
capabilities


Can also create locks if
needed

Arrow direction signifies which component
initiates communication

L
-
Store Architecture

L
-
Server


Essentially a routing layer
for matching incoming
commands to execution
services


All persistent state
information is stored in the
metadata layer


Clients can contact multiple
L
-
Servers to increase
performance


Provides all Auth/
AuthZ

services


Controls access to metadata


Arrow direction signifies which component
initiates communication

L
-
Store Architecture

StorCore
*


From
Nevoa

Networks


Disk Resource Manager


Allows grouping of
resources into Logical
Volumes


A resource can be a
member of multiple
Dynamical Volumes


Tracks depot reliability and
space


Arrow direction signifies which component
initiates communication

*http://www.nevoanetworks.com/

L
-
Store Architecture

Metadata


L
-
Store DB


Tree
-
based structure to
organize objects


Arbitrary number of
key/value pairs per object


All MD operations are
atomic and only operate on
a single object


Multiple key/value pairs can
be updated simultaneously


Can be combined with testing
(stored procedure)

Arrow direction signifies which component
initiates communication

L
-
Store Architecture

Depot


Used to store large objects.
Traditionally file contents.


IBP storage


Fault tolerance is done at the L
-
Store level.


Clients directly access the depots


Depot
-
to
-
depot transfers are
supported.


Multiple depot implementations


ACCRE depot* is actively being
developed

Arrow direction signifies which component
initiates communication

*http://www.lstore.org/pwiki/pmwiki.php?n=Docs.IBPServer

L
-
Store Architecture

Messaging Service


Event notification system


Uses the Advanced Messaging
Queuing Protocol


L
-
Store provides a lightweight
layer to abstract all queue
interaction

Arrow direction signifies which component
initiates communication

L
-
Store Architecture

Services


Mainly handles events triggered
by metadata changes


L
-
Server commands and other
services can also trigger events.


Typically services are designed to
handle operations that are not
time sensitive.

Arrow direction signifies which component
initiates communication

L
-
Store Architecture

Lock Manager


Variation of the Cassandra
metadata nodes but focusing on
providing a simple locking
mechanism.


Each lock is of limited duration,
just a few seconds, and must be
continually renewed.


Guarantees that a dead client
doesn’t lock up the entire system


To prevent deadlock a client must
present all objects to be locked in
a single call.

Arrow direction signifies which component
initiates communication

Current functionality


Basic functionality


Ls, cp,
mkdir
,
rmdir
, etc


Augment


Create an additional copy using a different
LUN


Trim


Remove a LUN copy



Add/Remove arbitrary attributes


RAID5 and Linear stripe supported


FUSE mount for Linux and OSX


Used for CMS
-
HI


Basic messaging and services


L
-
Sync, similar to
DropBox


Tape archive or HSM integration



L
-
Store
Performance circa 2004

3 GB/s

30 Mins


Multiple simultaneous writes to 24 depots.


Each depot is a 3 TB disk server in a 1U case.


30 clients on separate systems uploading files.


Rate has scaled linearly as depots added
.


Latest generation depot is 72TB with 18Gb/s sustained transfer rates

Summary


REDDnet

provides short term working storage


Scalable in both metadata (op/s) and storage
(MB/s)


Block level abstraction using IBP


Apache Cassandra for metadata scalability


Flexible framework supporting the addition of user
operations