Secure, Reliable, and Efficient Data Replica
Management in Grid Networks
Kelly Clynes and Caitlin Minteer
Radford University
Abstract
This paper will
describe two services
that
are
fundamental to
any Data Grid:
high
-
speed
transport and replica manageme
nt.
The
replica
management service integrates a replica catalog
with GridFTP transfers to provide for the
creation,
registration, location, and management of
dataset replicas.
Replica Location Service as well as
a brief
overview of the Globus Toolkit will
be
briefly cover
ed as far as
the justificatio
n for the Data Management topic and the
four main
components of the Globus Toolkit
will also be discussed.
1
Introduction
The term
grid computing
refers to the emerging computational and networking
i
nfrastructure that is designed to provide pervasive, uniform and reliable access to data,
computational, and human resources distributed over wide area environments.
Grid computing
has developed
i
n order to apply
the resources of many computers in a networ
k to a single
scientific or
technological
problem
at the same time.
Data
-
intensive, high
-
performance computing
applications
in the grid
require the efficient management and transfer of information in a wide
-
area, distributed computing environment using fil
e sizes as large as terabytes or petabytes.
Massive data sets must be shared by a large community of hundreds or thousands of
researchers who are distributed around the world. These researchers require efficient transfer of
large data sets to perform analy
ses at their local sites or at other remote resources. In many
cases, the researchers create local copies or replicas of the experimental data sets to overcome
wide
-
area data transfer latencies.
Grid computing uses software to divide and
distribute
pieces
of a program to as many as
several thousand computers
.
The Globus Toolkit is a technology for the "Grid,"
which
is an open
2
source toolkit for building computi
ng grids developed
by the Globus Alliance. It
lets
people share
computing power, databases, and ot
her tools securely online across corporate, institutional, and
geographic boundaries without sacrificing local
independence
. The toolkit
not only
includes
software serv
ices but also
libraries for re
source monitoring, discovery
,
security
,
and file
managemen
t.
It is packaged as a set of components that can be used either independently or
together to develop applications
.
The Globus Toolkit has grown through an open
-
source strategy
that
encourages
a
broader, more rapid adoption and leads to greater technical i
nnovation, as the
open
-
source community provides continual enhancements to the product.
In addition to being a central part of science and engineering projects that total nearly a
half
-
billion dollars internationally, the Globus Toolkit is a substrate on w
hich leading IT companies
are building signi
ficant commercial Grid products
. Every organization has
different
modes of
operation. C
ollaboration between m
ultiple organizations
is
affected
by incompatibility of resources
such as data archives, computers, and
networks. The Globus Toolkit was
created to
remove
obstacles that prevent seamless collaboration. Its core services, interfaces and protocols allow
users to access remote resources as if they were located within their own machine room while
at
the same ti
me
preserving local control over who can use resources and when.
This paper discusses two fundamental data management components which include
Reliable File Transfer and Replica Location Service. It also provides a brief overview on Data
Management, GridFT
P, and Replica Management.
2
Secure, Reliable, Transfer
Data
-
intensive applications such as scientific applications require
two fundamental data
management components, upon which higher
-
level components can be built:
Reliable File Transfer
(RFT)
is a web
service
that provides “job scheduler”
-
like
functionality for data movement
.
Protocol
is used in
wide
area environments. Ideally, this
protocol would be universally adopted to
provide access to the widest variety of available
storage systems.
3
Replica Loca
tion Service (RLS)
includ
es
services for registering and locating all
physical locations for files and
collections.
Higher
-
level services that can be built upon
these fundamental components include
reliable creation of a copy of a large data
collection at
a new location; selection of the
best replica for a data transfer operation
based on performance estimates provided by
external information services; and
automatic creation of new replicas in response to
application demands.
3 The Globus Architecture and
Data Management
There are four main components of Globus
Toolkit, which include The
Grid Security
Infrastructure (GSI)
,
The Globus Resource Management,
The Information Management
Architecture, and the Data Management Architecture.
Grid Security Infrastr
ucture (GSI)
provides authentication and authorization services
using public key certificates as well as
Kerberos authentication.
The Globus Resource Management architecture provides a
language for specifying
application requirements and mechanisms for im
mediate and
advance reservations of one or
more computational components. This architecture also
provides several interfaces for submitting
jobs to remote machines.
The Globus
Information Management architecture provides a distributed scheme for
publishi
ng and
retrieving information about resources in the wide area environment. A distributed
collection of information servers is accessed by higher
-
level services that perform
resource
discovery, configuration and scheduling.
The last major component of Gl
obus
is the Data Management architecture.
T
he Globus
Data
Management architecture, or
Data Grid
, provides two fundamental components: a
universal
data transfer protocol for grid computing environments called
GridFTP
and a
Replica
Management
infrastructure
for managing multiple copies of shared data sets.
4
4 GridFTP: A Secure, Efficient Data Transport Mechanism
The
FTP is a widely implemented and well
-
understood IETF standard protocol. The FTP
protocol was extended because it was observed that FTP is the pro
tocol most commonly used for
data transfer on the Internet and the most likely candidate for meeting the Grid’s needs. As a
result, there is a large base of code and expertise from which to build.
T
he FTP protocol provides
a well
-
defined architecture for p
rotocol extensions and supports dynamic discovery of the
extensions supported by a par
ticular implementation. N
umerous groups have added extensions
through the IETF, and some of these extensions will be particularl
y useful in the Grid. I
n addition
to clien
t/server transfers, the FTP protocol also supports transfers directly between two servers,
mediated by a third party client (i.e. “third party transfer”).
T
here is a
universal grid data
transfer and access protocol called
GridFTP
that provides
secure, effi
cient data
movement in Grid environments. This protocol, which extends the standard
FTP
protocol, provides a superset of the features offered by the various Grid storage systems
c
urrently in use. U
sing GridFTP as a common data access protocol would
be mutu
ally
advantageous to grid storage providers and users. Storage providers would
gain a broader user
base, because their data would be available to any client, while
storage users would gain access
to a broader range of storage systems and data.
The followin
g diagrams are an illustration of GridFTP. The Control Channel (CC) is the
p
ath between client and server
which is
used to exchange all information needed to establish
data channels
. The
Data Channel (DC)
is the
network pathway
that the files flow over. Th
e
Control Channel Interpreter (CCI)
is the s
erver side implementation of the control channel
functionality
.
Data Protocol Interpreter (DPI)
h
andles the actual transferring of files
and the
Client
side implementation of the control channel functionality
.
5
Figure 1:
Simple Two Party Transfers
CCI
DPI
Client
DPI
Data
Channel
Control
Channel
CCI
DPI
CCI
DPI
Data
Channel
Control
Channel
Control
Channel
Client
Figure 2:
Simple Third Party Transfer
6
DPI
DPI
DPI
DPI
DPI
DPI
DPI
CCI
CCI
Data
Channel
Data
Channel
Data
Channel
Client
Control
Channel
Control
Channel
Figure 3:
Striping
5 Replica Management
In this section
,
the Globus Replica Management architecture, which is
responsible for
managing complete and partial copies o
f data sets
, will be discussed
. Replica management
is an
important issue for a number of scientific applications.
As an
example, consider a
data set that
contains petabytes of experimental results for a particle physics application.
While the complete
7
data
set may exist in one or possibly several physical locations, it is
likely that many universities,
research laboratories or individual researchers will have
insufficient storage to hold a complete
copy. Instead, they will store copies of the most
relevant
portions of the data set on local storage
for faster access.
Services provided by a replica management system include:
C
reating new copies of a complete or partial data set
R
egistering these new copies in a
Replica Catalog
A
llowing users and applications t
o query the catalog to find all existing copies
of a
particular file or collection of files
S
electing the ``best'' replica for access based on storage and network
P
erformance predictions provided by a Grid information service
The Globus replica management
architecture is a layered architecture. At the lowest
level is a
Replica Catalog
that allows users to register files as logical collections and
provides mappings
between logical names for files and collections and the storage system
locations of one or mor
e
replicas of these objects.
A
Replica
Catalog API
in C
has also been implemented
as well as a
command
-
line tool; these functions and commands perform
low
-
level manipulation operations for
the replica catalog, including creating, deleting and
modifying cat
alog entries.
The basic replica
management services that
are
provide
d
can be used by higher
-
level tools
to
select among
replicas
based on network or storage system performance or
automatically to create new replicas
at desirable locations.
S
ome of
the
high
er
-
level services
will be implemented
in the next
generation of
the
replica management
infrastructure.
6 Summary
This paper has given a brief overview of the Globus Toolkit, and has briefly covered the
justification for the Data Management topic. The
RLS and RFT topics were explained, as well as
the four main components of the Globus Toolkit
, and replica management.
8
References
[1] W. Hoschek, J. Jaen
-
Martinez, A. Samar, H. Stockinger, K. Stockinger, “Data
Management in
an International Gr
id Project”, 2000 International Workshop on
Grid
Computing (GRID 2000),
Bangalore, India, December 2000.
[2] K. Holtman, “Object Level Replication for Physics”, Proceedings of 4th Annual
Globus Retreat, Pittsburgh, July 2000.
[3
] Globus Data Management
Available at:
www.globus.org/research/data
-
management.html
[4
]
B.
Tierney,
J.
Lee,
B.
Crowley,
M.
Holding,
J.
Hylton,
F.
Drake, "A Network
-
Aware Distributed
Storage Cache for Data Intens
ive Environments", Proceedings of
IEEE High Pe
rformance
Distributed Computing
conference(HPDC
-
8), August 1999.
[5]
J
.
Bresnahan
Intro to GridFTP
(October 2006)
-
, Argonne
National Laboratory.
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Comments 0
Log in to post a comment