Secure, Reliable, and Efficient Data Replica Management in Grid Networks

smileybloatΔίκτυα και Επικοινωνίες

20 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

74 εμφανίσεις

Secure, Reliable, and Efficient Data Replica
Management in Grid Networks


Kelly Clynes and Caitlin Minteer

Radford University

Abstract



This paper will

describe two services

that

are

fundamental to

any Data Grid:

high
-
speed
transport and replica manageme
nt.

The

replica

management service integrates a replica catalog
with GridFTP transfers to provide for the

creation,
registration, location, and management of
dataset replicas.

Replica Location Service as well as
a brief
overview of the Globus Toolkit will
be
briefly cover
ed as far as

the justificatio
n for the Data Management topic and the
four main
components of the Globus Toolkit

will also be discussed.



1
Introduction


The term
grid computing

refers to the emerging computational and networking
i
nfrastructure that is designed to provide pervasive, uniform and reliable access to data,
computational, and human resources distributed over wide area environments.

Grid computing
has developed

i
n order to apply

the resources of many computers in a networ
k to a single
scientific or
technological

problem

at the same time.

Data
-
intensive, high
-
performance computing
applications
in the grid
require the efficient management and transfer of information in a wide
-
area, distributed computing environment using fil
e sizes as large as terabytes or petabytes.
Massive data sets must be shared by a large community of hundreds or thousands of
researchers who are distributed around the world. These researchers require efficient transfer of
large data sets to perform analy
ses at their local sites or at other remote resources. In many
cases, the researchers create local copies or replicas of the experimental data sets to overcome
wide
-
area data transfer latencies.

Grid computing uses software to divide and
distribute

pieces
of a program to as many as
several thousand computers
.
The Globus Toolkit is a technology for the "Grid,"
which
is an open

2

source toolkit for building computi
ng grids developed

by the Globus Alliance. It
lets

people share
computing power, databases, and ot
her tools securely online across corporate, institutional, and
geographic boundaries without sacrificing local

independence
. The toolkit
not only
includes
software serv
ices but also

libraries for re
source monitoring, discovery
,

security
,

and file
managemen
t.
It is packaged as a set of components that can be used either independently or
together to develop applications
.

The Globus Toolkit has grown through an open
-
source strategy
that
encourages
a
broader, more rapid adoption and leads to greater technical i
nnovation, as the
open
-
source community provides continual enhancements to the product.

In addition to being a central part of science and engineering projects that total nearly a
half
-
billion dollars internationally, the Globus Toolkit is a substrate on w
hich leading IT companies
are building signi
ficant commercial Grid products
. Every organization has
different
modes of
operation. C
ollaboration between m
ultiple organizations

is

affected

by incompatibility of resources
such as data archives, computers, and

networks. The Globus Toolkit was
created to

remove
obstacles that prevent seamless collaboration. Its core services, interfaces and protocols allow
users to access remote resources as if they were located within their own machine room while
at
the same ti
me

preserving local control over who can use resources and when.

This paper discusses two fundamental data management components which include
Reliable File Transfer and Replica Location Service. It also provides a brief overview on Data
Management, GridFT
P, and Replica Management.

2
Secure, Reliable, Transfer


Data
-
intensive applications such as scientific applications require
two fundamental data

management components, upon which higher
-
level components can be built:



Reliable File Transfer

(RFT)
is a web
service

that provides “job scheduler”
-
like
functionality for data movement
.

Protocol

is used in

wide

area environments. Ideally, this
protocol would be universally adopted to

provide access to the widest variety of available
storage systems.



3



Replica Loca
tion Service (RLS)

includ
es
services for registering and locating all
physical locations for files and

collections.

Higher
-
level services that can be built upon
these fundamental components include

reliable creation of a copy of a large data
collection at
a new location; selection of the

best replica for a data transfer operation
based on performance estimates provided by

external information services; and
automatic creation of new replicas in response to

application demands.


3 The Globus Architecture and
Data Management

There are four main components of Globus

Toolkit, which include The

Grid Security
Infrastructure (GSI)
,
The Globus Resource Management,
The Information Management
Architecture, and the Data Management Architecture.



Grid Security Infrastr
ucture (GSI)
provides authentication and authorization services
using public key certificates as well as

Kerberos authentication.

The Globus Resource Management architecture provides a

language for specifying
application requirements and mechanisms for im
mediate and

advance reservations of one or
more computational components. This architecture also

provides several interfaces for submitting
jobs to remote machines.


The Globus

Information Management architecture provides a distributed scheme for
publishi
ng and

retrieving information about resources in the wide area environment. A distributed

collection of information servers is accessed by higher
-
level services that perform

resource
discovery, configuration and scheduling.


The last major component of Gl
obus

is the Data Management architecture.

T
he Globus
Data

Management architecture, or
Data Grid
, provides two fundamental components: a

universal
data transfer protocol for grid computing environments called
GridFTP

and a

Replica
Management

infrastructure
for managing multiple copies of shared data sets.


4

4 GridFTP: A Secure, Efficient Data Transport Mechanism

The

FTP is a widely implemented and well
-
understood IETF standard protocol. The FTP
protocol was extended because it was observed that FTP is the pro
tocol most commonly used for
data transfer on the Internet and the most likely candidate for meeting the Grid’s needs. As a
result, there is a large base of code and expertise from which to build.

T
he FTP protocol provides
a well
-
defined architecture for p
rotocol extensions and supports dynamic discovery of the
extensions supported by a par
ticular implementation. N
umerous groups have added extensions
through the IETF, and some of these extensions will be particularl
y useful in the Grid. I
n addition
to clien
t/server transfers, the FTP protocol also supports transfers directly between two servers,
mediated by a third party client (i.e. “third party transfer”).

T
here is a

universal grid data

transfer and access protocol called
GridFTP

that provides
secure, effi
cient data

movement in Grid environments. This protocol, which extends the standard
FTP

protocol, provides a superset of the features offered by the various Grid storage systems

c
urrently in use. U
sing GridFTP as a common data access protocol would

be mutu
ally
advantageous to grid storage providers and users. Storage providers would

gain a broader user
base, because their data would be available to any client, while

storage users would gain access
to a broader range of storage systems and data.

The followin
g diagrams are an illustration of GridFTP. The Control Channel (CC) is the
p
ath between client and server
which is
used to exchange all information needed to establish
data channels
. The
Data Channel (DC)

is the
network pathway
that the files flow over. Th
e
Control Channel Interpreter (CCI)

is the s
erver side implementation of the control channel
functionality
.
Data Protocol Interpreter (DPI)

h
andles the actual transferring of files

and the
Client

side implementation of the control channel functionality
.


5

Figure 1:
Simple Two Party Transfers

CCI
DPI
Client
DPI
Data
Channel
Control
Channel

CCI
DPI
CCI
DPI
Data
Channel
Control
Channel
Control
Channel
Client

Figure 2:
Simple Third Party Transfer



6


DPI
DPI
DPI
DPI
DPI
DPI
DPI
CCI
CCI
Data
Channel
Data
Channel
Data
Channel
Client
Control
Channel
Control
Channel

Figure 3:
Striping


5 Replica Management

In this section
,

the Globus Replica Management architecture, which is

responsible for
managing complete and partial copies o
f data sets
, will be discussed
. Replica management

is an
important issue for a number of scientific applications.
As an
example, consider a

data set that
contains petabytes of experimental results for a particle physics application.

While the complete

7

data

set may exist in one or possibly several physical locations, it is

likely that many universities,
research laboratories or individual researchers will have

insufficient storage to hold a complete
copy. Instead, they will store copies of the most

relevant
portions of the data set on local storage
for faster access.

Services provided by a replica management system include:



C
reating new copies of a complete or partial data set



R
egistering these new copies in a
Replica Catalog



A
llowing users and applications t
o query the catalog to find all existing copies

of a
particular file or collection of files



S
electing the ``best'' replica for access based on storage and network



P
erformance predictions provided by a Grid information service

The Globus replica management
architecture is a layered architecture. At the lowest

level is a
Replica Catalog

that allows users to register files as logical collections and

provides mappings
between logical names for files and collections and the storage system

locations of one or mor
e
replicas of these objects.
A

Replica

Catalog API

in C

has also been implemented

as well as a
command
-
line tool; these functions and commands perform

low
-
level manipulation operations for
the replica catalog, including creating, deleting and

modifying cat
alog entries.

The basic replica
management services that
are

provide
d

can be used by higher
-
level tools

to
select among
replicas

based on network or storage system performance or

automatically to create new replicas

at desirable locations.
S
ome of

the

high
er
-
level services
will be implemented
in the next
generation of
the

replica management

infrastructure.




6 Summary


This paper has given a brief overview of the Globus Toolkit, and has briefly covered the
justification for the Data Management topic. The
RLS and RFT topics were explained, as well as
the four main components of the Globus Toolkit
, and replica management.



8













References

[1] W. Hoschek, J. Jaen
-
Martinez, A. Samar, H. Stockinger, K. Stockinger, “Data

Management in
an International Gr
id Project”, 2000 International Workshop on

Grid

Computing (GRID 2000),
Bangalore, India, December 2000.


[2] K. Holtman, “Object Level Replication for Physics”, Proceedings of 4th Annual

Globus Retreat, Pittsburgh, July 2000.


[3
] Globus Data Management

Available at:
www.globus.org/research/data
-
management.html


[4
]
B.

Tierney,
J.

Lee,
B.
Crowley,
M.
Holding,
J.
Hylton,
F.
Drake, "A Network
-

Aware Distributed
Storage Cache for Data Intens
ive Environments", Proceedings of

IEEE High Pe
rformance
Distributed Computing
conference(HPDC
-
8), August 1999.


[5]
J
.
Bresnahan

Intro to GridFTP

(October 2006)
-
, Argonne

National Laboratory.