Replica Location Service

musicincurableData Management

Jan 31, 2013 (4 years and 6 months ago)

178 views

1

Data Management Services in the
Globus Toolkit® 3.0 Release

2

Globus Toolkit


The Globus Project is developing the fundamental
technologies needed to build computational Grids.





The Globus Project provides software tools that make it
easier to build computational Grids and Grid
-
based
applications.




The Toolkit is used by many organizations to build
computational Grids that can support their applications.

3

Globus toolkit components


The composition of the
Globus Toolkit can be
pictured as the following
three pillars.





Security is the foundation
common to all three
pillars.


4

Reliable File Transfer Service

5

Overview


The Reliable Transfer Service (RFT) is an
OGSA based
service

that provides interfaces for:



Controlling and monitoring 3rd party file transfers using GridFTP
servers.


The client controlling the transfer is
hosted inside of a grid service

so it can be managed using the soft state model and queried using
the ServiceData interfaces available to all grid services.



It is essentially a reliable and recoverable version of the
GT2 globus
-
url
-
copy tool and more.

6

Prerequisites and Dependencies



The Prerequisites to RFT are:


GridFTP Server with a Host Certificate


PostgreSQL



PostgreSQL is used to
store the state

of the transfer to
allow for restart

after
failures
.


The interface to PostgreSQL is JDBC so any DBMS that supports
JDBC can be used.


Note:GT3 used PostgreSQL version 7.3.2 for testing and the instructions
provided to set up the database are good for the same.

7

Prerequisites and Dependencies


GridFTP perfoms the actual file transfer.


GridFTP server can only be run on
Unix

or
Linux.





There are 2 ways to get GridFTP:


Packaged with the core GT3 Final installation


As part of the Globus Toolkit 2.4 distribution

8

Prerequisites and Dependencies

1.
PostgreSQL Setup

2.
Configure and Run a GridFTP Server

3.
RFT Grid Service Setup

4.
Build the GAR from Source Distribution

www
-
unix.globus.org/toolkit/reliable_transfer.html

9

Service Data Elements for RFT



Version :
version of RFT.



FileTransferProgress:
SDE that denotes the
percentage

of file that is transferred



FileTransferRestartMarker:
SDE for the
last restart marker

for a
particular
transfer



FileTransferJobStatusElement:
SDE for status of a
particular transfer



FileTransferStatusElement:
SDE that denotes the status of
all the transfers

in the
request



GridFTPRestartMarkerElement:
SDE of
Restart marker of the transfer



GridFTPPerfMarkerElement:

SDE of
Performance Marker of the transfer


10

The Replica Location Service

11

The replica Location Service


The
replica location service (RLS)

maintains and
provides
access

to
mapping information

from
logical names

for data
items to
target names
.




The distributed RLS is intended to replace the
centralized

Globus replica catalog

available in earlier releases of
GT2.x.



The distributed RLS provides higher performance,
reliability and scalability.

12

Replica Location service


Replication of data items can
reduce access

latency
,
improve data locality
, and
increase
robustness
,
scalability and performance

for
distributed applications.



An RLS typically does not operate in isolation, but
functions as one component of a data grid
architecture.


13

Replica Location Service


Consistent local state

maintained in Local Replica Catalogs
(LRCs).




Local catalogs maintain mappings between arbitrary logical file
names (
LFN
s) and the physical file names (
PFN
s) associated with
those
LFN
s on its storage system(s).





Collective state

with
relaxed consistency

maintained in
Replica Location Indices (RLIs).


Each RLI contains a set of mappings from
LFN
s to
LRC
s.


A
variety of index structures can be defined with different
performance characteristics, simply by varying the number of RLIs
and amount of redundancy and partitioning among the RLIs.




14

Replica Location Service


Soft state maintenance of RLI state.



LRCs
send information about their state

to RLIs using
soft state

protocols
.


State information in RLIs times out and must be periodically
refreshed by soft state updates.



Compression of state updates.


Optional
compression uses Bloom Filters

to summarize the content of a LRC
before sending a soft state update to a RLI Node.





Membership and partitioning information maintenance.


The current
RLS

implementation
maintains

static information
about the
LRCs

and
RLIs

participating in the distributed system.



As new implementations of the RLS are developed, they will use
OGSA
mechanisms for registration

of services and for
service lifetime management
.


15

Relationship to Earlier Globus Replica
Management Software


The RLS is intended to replace replica management tools available in GT2.X,
including:



the Replica Catalog API


the Replica Management API.




The RLS differs from these earlier components in several important ways.



As a distributed system, the RLS is designed to provide
reliability

by avoiding
single points

of
failure

,
load balancing
,
performance

and
scalability
.



The RLS implementation is based on
open source relational database

technology.



The RLS
separates

replication information

from other types of
metadata
.



The RLS does not include information about logical collections, but
assumes

such
information is stored in a separate metadata service.

16

The GridFTP Protocol and
Software

17

What is GridFTP ?


GridFTP is a high
-
performance, secure, reliable
data transfer protocol optimized for high
-
bandwidth wide
-
area networks.



The GridFTP protocol is based on FTP, the
highly
-
popular Internet file transfer protocol.


18

Protocol Features


GSI security on control and data channels


Multiple data channels for parallel transfers


Partial file transfers


Third
-
party (direct server
-
to
-
server) transfers


Authenticated data channels


Reusable data channels


Command pipelining


19

Protocol Features


Grid Security Infrastructure (GSI) and Kerberos support:


Robust and flexible authentication, integrity, and confidentiality features are
critical when transferring or accessing files.



Third
-
party control of data transfer:


In order to manage large data sets for large distributed communities, it is necessary
to provide third
-
party control of transfers between storage servers.



Parallel data transfer:


On wide
-
area links, using
multiple TCP streams

can improve aggregate bandwidth
over using a single TCP stream.



Striped data transfer:



Partitioning
data across multiple servers

can further improve aggregate bandwidth.
GridFTP supports striped data transfers through extensions defined in the Grid
Forum draft.


20

Protocol Features


Partial file transfer:


GridFTP introduces new FTP commands to support transfers of regions of a file.



Support for reliable data transfer:


Reliable transfer is important for many applications that manage data. Fault
recovery methods for handling transient network failures, server outages, etc., are
needed



Manual control of TCP buffer size
:


This is a critical parameter for achieving maximum bandwidth with TCP/IP. The
protocol also has support for automatic buffer size tuning



Integrated Instrumentation
:


The protocol calls for restart and performance markers to be sent back. It is not
specified how often, and this is something we intend to address shortly.


21

What Does “GridFTP” Mean?



GridFTP Protocol
:


This refers to the wire protocol used and is defined by a draft technical
specification submitted to the Global Grid Forum.



The Globus Toolkit V2.0
GridFTP Server

(GT2GridFTP):


This system is the widely used
open source wuftpd FTP server

code base
extended to support the GridFTP protocol extensions.


GT2GridFTP is distributed with the Globus Toolkit.



The GridFTP family of tools: the term “GridFTP” is used to refer to
the entire family of GridFTP tools distributed with the Globus Toolkit:
The GridFTP server, client tools, client library, control library, etc.

22

Implementation


The Globus implementation of the GridFTP protocol takes the form of two
APIs and corresponding libraries:



globus_ftp_control



globus_ftp_client.



Besides supporting the protocol features described above, The APIs also
include interfaces for adding software "plug
-
ins".



In addition to Globus software libraries, we have also implemented


an API/library (globus_gass_copy)



a command
-
line tool (globus
-
url
-
copy) that integrates GridFTP, HTTP, and local
file I/O to enable secure transfers using any combination of these protocols.



Globus
has adapted

a popular
FTP server package

(Washington University's
wu
-
ftpd) to support a majority of the
GridFTP protocol features

(GSI security,
parallel transfer, third
-
party transfer, partial file transfer).

23

Availability of the GridFTP


Our data grid software is currently available to the
public as components of the Globus Toolkit 2.0
release.



Prior to GT2.X release, the software was tested
and evaluated for more than a year by several
external project teams who are using our
technologies to build data grids for their own use.

24

GASS:
Global Access to Secondary
Storage

25

Requirement for Grid I/O service


Uniform data access


Diverse data source


Dynamic resource set


Support for streaming I/O


Little or no program modification


Support for programmer
-
direct performance
optimization


Joseph Bester et al. “GASS: A Data Movement and Access Service for Wide Area Computing
Systems”

26

GASS Architecture


Common Grid File Access Patterns


Default Data Movement Strategies


Specialized Data Movement Strategies


GASS Operation


Integration with the Globus Toolkit


Joseph Bester et al. “GASS: A Data Movement and Access Service for Wide Area Computing
Systems”

27

Common Grid File Access
Patterns


Read
-
only access


Write
-
shared access


Append
-
only access


Unrestricted read/write

Joseph Bester et al. “GASS: A Data Movement and Access Service for Wide Area Computing
Systems”

28

READ

READ

WRITE

WRITE

APPEND

APPEND

READ

READ

READ

WRITE

WRITE

WRITE

Joseph Bester et al. “GASS: A Data Movement and Access Service for Wide Area Computing
Systems”

Read
-
only access to



constant data,



read entire file

Write access to



Entire file,



Multiple writers:
last writer wins

Append
-
only access,



multiple writers,

Concurrent write and read access,

Concurrent write access to the same file

Read
-
only access to part of the file

29

Default Data Movement
Strategies


GASS addresses bandwidth management
issues by providing a
file

cache
: a “local”
secondary storage




By
default
, data is moved into and out of this
cache when files are opened and closed


Joseph Bester et al. “GASS: A Data Movement and Access Service for Wide Area Computing
Systems”

30

Cache

Cache

Processes

Joseph Bester et al. “GASS: A Data Movement and Access Service for Wide Area Computing
Systems”

GASS
-
server

http
-
server

ftp
-
server

HPSS
-
server

31

GASS Operation


Grid applications access remote files using GASS by
opening and closing the files with specialized open
and close calls


globus_gass_open()


globus_gass_fopen()


globus_gass_close()



globus_gass_fclose()


Note: the GASS open and close calls act like their
standard Unix I/O counterparts, except that a URL
rather than a le name is used to specify the location
of the le data.

Joseph Bester et al. “GASS: A Data Movement and Access Service for Wide Area Computing
Systems”

32

Integration With Globus Toolkit


The availability of GASS services has made it
straightforward to extend the GRAM API:




Allow both executables and standard input,
output, and error streams to be named by URLs



GASS mechanisms are used to fetch



URL
-
named executable into the cache.



standard input, and to redirect standard output and error.

Joseph Bester et al. “GASS: A Data Movement and Access Service for Wide Area Computing
Systems”