SDSC Technical Report 2003-06 Preservation of Data

economickiteInternet and Web Development

Oct 21, 2013 (3 years and 7 months ago)

79 views




1




SDSC Technical Report 2003
-
06


Preservation of Data

Reagan W. Moore

San Diego Supercomputer Center

moore@sdsc.edu

Abstract:

Multiple communities are developing technologies for the management and preservation
of distributed data. Each community is tacklin
g a different set of goals, ranging from
Semantic Web focus on development of semantic
-
based access, data grid focus on
management of distributed data, digital library focus on publication, and persistent
archive focus on management of technology evolution
. Not surprisingly, common
elements are used across all of these environments. The common elements provide ways
to automate data manipulation, data transformation, and data discovery. The common
elements make it possible to implement preservation enviro
nments for the long
-
term
management of massive data collections.


Keywords: Persistent archive, data grid, technology evolution, encoding format, digital
ontology, consistency constraint

1. Introduction


Data preservation appears to pose an insurmountabl
e challenge, namely the long
-
term
access and display of digital entities while the underlying hardware and software systems
evolve. The technology we will use in the future to display a digital entity will be
different than the technology that was used t
o create the digital entity. Either the
hardware platforms will be different, or the display software will be different, or even the
encoding format of the digital entity will be different. Within the preservation
community, there is no consensus on an a
pproach to preservation that will be able to
handle the massive amounts of data that are being created. Fortunately, the technologies
that are being developed within the data grid, digital library, and semantic web
communities do provide capabilities that

can be used to implement preservation
environments. Each data management community has unique capabilities that can be
applied to the long
-
term preservation of data. Preservation environments can and are
being built now that illustrate the feasibility
of long
-
term access and display of digital
entities.


Preservation environments inherently rely upon the ability to automate all aspects of data
discovery, access, management, and manipulation. Automation requires the ability to
differentiate between the
bits that comprise a digital entity, the encoding format that
defines the structures present within the digital entity, the data dictionary that defines the
semantic labels applied to the structures, the operations that are used to manipulate the
structure
s, and the software and hardware infrastructure that are used to store and display



2




the digital entity. The characterization of the digital entity is done using some variant of
a digital ontology that is applied to create an “archival form”. The managemen
t of the
hardware and software infrastructure is done using some variant of a data grid that
provides “infrastructure independent” data management. Basically, a characterization is
created for the digital entity that is based upon public, non
-
proprietary
standards, and an
interoperability abstraction is created for interactions with hardware and software
platforms that can be used across all types of storage repositories.


A digital ontology is a characterization of the internal structures and semantic lab
els that
define what operations can be performed on a digital entity [1]. A simple example is to
consider a scientific data set. The bits can be organized into 8
-
bit bytes, the bytes can be
organized into 8
-
byte words, and the words can be organized into

arrays, all of which
represent definable structures. Semantic labels can be applied to the arrays to identify the
physical variables that are being stored. The arrays can be mapped onto a coordinate
system for display and manipulation. The digital onto
logy specifies the named structures
within the digital entity and the relationships between the structures. The relationships
might be temporal in nature (successive arrays denote the same physical variable
recorded at different times), spatial in nature
(successive arrays denote different physical
variables that are represented in the same coordinate system), semantic in nature (logical
mappings between the physical variable names), and functional in nature (the physical
variables may obey properties such

as mass, energy, and momentum conservation).


The digital ontology defines the semantic terms and the relationships between the terms
that comprise the context needed to correctly interpret the digital entity.


A similar characterization can be done for d
ocuments created by office publication tools.
In this case, the bits are organized into 8
-
bit bytes, and the bytes are organized into
variable length words separated by white spaces. Words are organized into sentences
separated by punctuation marks, and
sentences are organized into paragraphs. Semantic
labels are applied through headings. Specific types of operations are implied by
annotations such as time
-
track changes. The digital ontology provides a way to
characterize the information content (seman
tic labels) and knowledge content (structural
and functional relationships) present within the digital entity. A digital entity combined
with the digital ontology constitutes the archival form on which long
-
term preservation
can be based.


Data grids prov
ide the abstraction mechanisms needed to deal with heterogeneous
hardware and software systems [2]. A typical problem is migrating data from a legacy
archival storage system to a new, more cost effective storage solution. In the migration
process, the ph
ysical file name may change, the access protocol may change, and the site
where the file is stored may change. For a viable preservation system, all of these
changes should be transparent. From the perspective of the preservation environment, the
naming

convention for digital entities, and the access and manipulation mechanisms
should be invariant over time. Data grids provide a logical name space that is location
independent, a storage repository abstraction (standard set of manipulation operations)
fo
r interacting with storage systems, and an access abstraction (standard set of access



3




operations) for retrieving data. The logical name space is implemented as a collection,
with attributes used to specify the mapping from the logical name to the physical

file
name. The standard manipulation operations are typically implemented as Unix file
system operations for open, close, read, and write of files. The standard access
operations are a combination of metadata and data manipulation operations. Together

these abstractions provide the ability to interface between arbitrary storage systems and a
wide variety of application programming interfaces.


An example of a logical name space is shown in Figure 1. This is a snapshot of the
mySRB web interface to a

collection managed by the Storage Resource Broker data grid
[2]. The particular collection is a set of photographs taken of the Yosemite Valley around
1880. The logical name is used to organize the photographs and to support
administrative and descripti
ve metadata about each photograph.




Figure 1. Logical name space used to register a photograph collection


Each photograph is registered into the collection using a name (Data Name column) that
will be held invariant as the photograph is moved between
storage locations. The first
three records (I0024586A.jpg) show one photograph that has been replicated onto three
storage locations listed in the Resource column (test
-
unix, Unix file system; ora
-
sdsc,
binary large object in an Oracle database; and hpss
-
sdsc, file in an archive). All replicas
are accessible through the same access interface. Administrative metadata is listed that
describes the storage location, when the photograph was stored, the size of the
photograph, and the image type. Migration co
rresponds to making a replica on a new



4




storage system, and deleting the old copy.



The data grid provides the interoperability mechanisms needed to manage technology
evolution. Data grids were originally designed to manage access to data distributed
acro
ss multiple sites. For a preservation environment, the migration from old technology
to new technology is a similar task. At the point in time when the migration occurs, both
the old and new technology is present. The management of files distributed acr
oss
storage systems is equivalent to the migration of files across storage systems over time.
Data grids inherently provide the abstraction mechanisms needed for the management of
evolving technology.


II.
Preservation Context


A digital entity without a
defining context is useless. The ability to interpret the digital
entity, verify its authenticity, and even discover its location depends upon the
management of associated metadata. Traditionally, the context is managed as attributes
stored in a database
, and organized as a collection. A preservation environment migrates
the collection context forward into the future. In essence, a preservation environment
provides a way to characterize and support a collection while the underlying hardware
and software

systems evolve.


The preservation community has defined standard preservation terms as part of an Open
Archival Information System (OAIS) [3]. A Submission Information Package (SIP) is
used to transmit data to the persistent archive, an Archival Informat
ion Package (AIP) is
used to store the archival form of the data, and a Distribution Information Package (DIP)
is used to send archived data to a user. The processing steps that are required to convert
from the SIP to the AIP, and then to the DIP constitut
e the archival processes associated
with preservation. In the paper (hardcopy) world the archival processes correspond to:




Appraisal, the decision for which digital entities to keep



Accession, the process used to accept the SIPs and verify their transmit
tal



Description, the process used to define the context associated with each digital
entity



Arrangement, the process used to define how the digital entities will be organized



Preservation, the process used to create the archival form



Access, the process us
ed to discover and retrieve a digital entity


Each of these processes can be implemented for digital entities [4]. The challenge is
whether each of these processes can be automated. The data grid, digital library, and
persistent archive communities are a
ggressively developing data and information
management technology that can be used to build prototype persistent archives. Each
community is focusing on an aspect of the problem, or equivalently a subset of the
archival processes required for preservation
. In particular, multiple communities are
investigating the abstraction mechanisms and standards needed to manage technology



5




evolution, from characterizations of digital entity structure and semantics, to
characterizations of standard operations on storage

repositories and standard access
mechanisms.


III. Data Grid Community


The Global Grid Forum has promoted the formation of research groups for the explicit
purpose of assessing best community practices and summarizing experimental results
related to the
management of distributed resources. The Global Grid Forum meets three
times a year to promote interactions between grid researchers and implementers. The last
meeting was held in Seattle, on June 25 to June 27, 2003.


Five groups from the Global Grid
Forum are of particular interest for the creation of
preservation environments [5]:




Persistent Archive Research Group



Data Transport Research Group



Data Access and Integration Services Working Group



Data Format Description Language Research Group



Grid Pro
tocol Architecture Working Group


The Persistent Archive Research Group has specified the basic components of data grid
technology that can be used to implement a persistent archive [4]. The group has written
a specification of the capabilities that are n
eeded to automate archival processes, and
demonstrated how these capabilities can be implemented using technology provided in
existing data grids. In one sense, this specification is a very aggressive assessment of the
systems needed to implement a persis
tent archive prototype. On the other hand, the
specification is based upon data grids that are being used in production, and thus
represents real software systems that can be used today.


The principal concepts described in the persistent archive specific
ation are a set of
abstractions for dealing with specific components of data management systems:




Use of a logical name space to create global, persistent identifiers.



Use of a storage repository abstraction to manage migration of digital entities onto
new

storage systems as they become available



Use of an information repository abstraction to manage the migration of
provenance, descriptive, and authenticity metadata to new databases as they
become available



Use of an access abstraction to simplify the deve
lopment of new access
mechanisms over time.


The Grid Community is investigating the standardization of the storage repository
abstraction in the Data Transport Research Group. This is effectively the set of



6




operations that should be supported by any stor
age repository. An initial assessment
showed that only four operations are currently supported by all data grids, namely open,
close, read, and write. Of the grids examined, there were a total of 46 different
operations that had been developed for intera
cting with storage repositories. Over the
next year, the Global Grid Forum will propose a minimal set of these operations that are
needed to support representative applications. The emergence of object storage systems
will make this task even more impor
tant. Object storage systems support the
manipulation of data as files, rather than as sectors / blocks on disk. The operations that
can be performed on the files include format validation (archival accession process),
metadata extraction (archival descr
iption process), and format transformation (archival
preservation process).


A similar effort is being pursued in the Data Access and Integration Working Group, this
time from the perspective of defining the set of operations that should be developed for
i
nteracting with database technology. The operations are being defined through the Open
Grid Services Architecture (OGSA), and an implementation is being done in the United
Kingdom data grid. An extension to this effort will characterize the operations re
quired
to support manipulation of collections within databases as well as manipulation of
records in collections. The expectation is that the Global Grid Forum will provide a set
of standard services for interacting with databases.


The development of sta
ndard operations for manipulating data in a storage repository or
metadata in a database only make sense if the operations can be mapped onto the
structures present within digital entities. Towards this end, the Data Format Description
Language Research G
roup (DFDL) is investigating digital ontologies. A digital entity
can be characterized by a digital ontology, which is written using a standard relationship
management syntax such as the Resource Description Format (RDF). Over time, the
digital ontology
is migrated onto new relationship encoding standards, ensuring the
ability to continue to represent the structures present within the digital entity. In this
view, the digital entity is preserved as the original set of bits, and the digital ontology is
mi
grated to new encoding formats.


The DFDL is approaching the problem of digital ontologies through the creation of the
following characterizations of digital entities:



Language for specifying the structures within a digital entity



Language for mapping sema
ntic labels onto the structures



Language for mapping allowed operations onto the structures


A second mapping is needed from the relationships present within the digital entity, to
the standard operations that are supported by display and analysis applicat
ions. The
second mapping can be characterized as a display ontology. The set of operations that is
desired can be defined in terms of present display applications. This is an abstraction of
the concept of emulation. Instead of migrating forward in time

the software that was
used to create a digital entity, the set of operations that support the manipulation of the
digital entity can be defined and the characterization migrated forward in time. Thus



7




abstractions

can be created for both digital entity st
ructures, and the operations that are
performed on digital entities. Both abstractions can be updated to conform to future
technology, while preserving the bits of the original digital entity. The Data Format
Description Language research group is plann
ing to investigate the ability to characterize
digital entities, and make a simple version for scientific data. The characterization of
operations performed upon digital entities is a more aggressive research question that is
being pursued in the knowledg
e management community.


Finally, the Grid Protocol Architecture working group is now characterizing the
consistency constraints that are needed to assemble a working grid from differentiated
services and distributed state information registries. Each gri
d service uses a logical
name space to support global identifiers. A grid service creates distributed state
information to describe the results of operations at remote sites. The distributed state
information is mapped onto the logical name space and man
aged in a registry. In order to
ensure that the distributed state information remains consistent with the digital entities
that are registered into the logical name space, update constraints are imposed on the grid
service. A typical constraint is that t
he grid service is not marked as complete until the
associated distributed state information has been put into the appropriate registry.


Corresponding consistency constraints are needed when implementing archival
processes. Preservation metadata is gener
ated by each archival process (whether
arrangement information, descriptive metadata, administrative metadata, or authenticity
metadata) and is mapped onto a logical name space that provides global persistent
identifiers for the archival objects. Consiste
ncy constraints ensure that when the archival
process completes, the associated metadata has been correctly updated. Hence the
technologies being specified in the grid forum are directly applicable to the development
of a persistent archive prototype.

IV.

Digital Library Community



The ability to support discovery and access of material organized as a collection is being
actively developed by the digital library community. Three areas are directly applicable,
the development of metadata standards for com
pound objects [6], the development of
metadata interchange standards for retrieving information [7], and the development of
web crawling technology for the preservation of intellectual capital published on the web
[8].


The Metadata Encoding and Transmissi
on Standard [6] for compound documents defines
a level of organization needed between the digital ontology that describes a digital entity,
and the file that is manipulated by grid technology. Each compound document has a
structure that can be defined, or
ganized, and maintained independently of the separate
components. Thus the METS standard can be viewed as a variant of a digital ontology
that operates at the object level. The METS schema provides support for descriptive
metadata, administrative metadat
a, structural metadata, and behavioral metadata.





8




The Open Archives Initiative metadata harvesting standard is being used to support
publication of metadata from independent collections into a central repository. OAI
provides a mechanism for the retriev
al of attributes for manipulation by programs. A set
of complementary services for database access is being defined in the DAIS Grid
Working Group. Again the integration of digital library technology and grid technology
is being sought to create a standa
rd that can be used within persistent archives.


Web crawling technology is essential for retrieving digital entities from the web. An
example application is the preservation of Presidential Web sites, but the technology can
also be used to preserve the
web sites of any institution. A typical crawl of four major
US governmental agencies results in the retrieval of 17 million digital entities, which
must then be arranged, described, and preserved. An example of an archival process
(arrangement) is the ch
aracterization of the logical structure that defines the links
between the multiple digital entities within a web site.

V. Persistent Archive Community


Multiple communities are starting to create persistent archives of their material for long
-
term preserv
ation. Two examples are the NSF National Science Digital Library (NSDL)
and the California Digital Library (CDL). The NSDL [9] is using a combination of the
OAI harvesting protocol, the web
-
crawling technology, and the Storage Resource Broker
data grid [
2], to implement a persistent archive of material published on the web. The
challenge is that the lifetime of material published on the web is about six months. For
an educator to be able to use material for several years, an archived version is needed t
hat
can be maintained independently of the web site. In effect, the NSDL persistent archive
is a series of time snapshots of the material registered into the NSDL repository.


The CDL project is very similar, with the source of the material being academic

and
federal web sites. Again the intent is to support the arrangement, description, and
preservation of material harvested from the web. The processes that accomplish these
tasks are the archival processes. The technologies that are being employed are
the
Storage Resource Broker data grid and the Stanford WebBase web crawler. These
projects illustrate that academic groups are going ahead with the implementation of
archives, and in effect are developing their own standards for the long
-
term preservation

of material.


A prototype persistent archive has been implemented at the National Archives and
Records Administration [10]. The system is built on the Storage Resource Broker data
grid technology to manage heterogeneous databases and storage repositorie
s, and to
support the automation of archival processes. By registering digital entities into a
collection, it becomes possible to manage authenticity attributes (audit trails, digital
signatures), administrative attributes (location, aggregation in contai
ners), and
provenance attributes (creator, submission process).





9




VI. Future Research Activities


The challenges that drive the development of preservation environments can be
characterized as the management of ontologies. An active area of research is t
he
development of constraint
-
based collection management systems to support the
federation of name spaces across multiple collections through the application of
dynamically defined relationships. This requires the integration of ontology management
with i
nformation management. The relationships between the collections for access and
update control can be organized as an ontology.


Digital ontologies are also being applied as a combined migration/emulation approach to
preservation. The tools that interpr
et digital entities for presentation and manipulation
can be based upon the same constraint
-
based collection management software used to
integrate collections. A digital ontology represents the structural, semantic, spatial, and
temporal relationships inh
erent within a digital entity. A presentation tool applies the
relationships in the order defined by the digital ontology to interpret the digital entity.
Over time, the digital ontology is migrated to new relationship encoding formats. The
presentation

tool is characterized by the set of operations that will be performed on the
relationships organized within the digital ontology. The set of operations are then
mapped onto future applications (emulated). It is also possible to extend the set of
operati
ons to include new capabilities that are present in future applications.


Data grids implement the equivalent of a digital ontology for collections. The SRB
implements a logical name space that is used to define global, persistent identifiers that
are lo
cation independent. Archival services create archival state information that is
mapped onto the logical name space. The consistency of the archival state information is
managed by imposing constraints on the mapping. Each of the archival processes creat
es
a different set of information that is managed for each collection. In effect, archival
processes map provenance, administrative, descriptive, and authenticity attributes onto
the logical name space.


From this perspective, there is a unifying approa
ch based upon the management and
organization of relationships. When we demonstrate the automation of archival
processes, we are demonstrating the ability to manage the mapping of the archival
process metadata onto the logical name space. When we develop

constraint
-
based
collection management systems, we demonstrate that the mappings can be managed
consistently, through ontologies that organize the relationships that are needed to impose
consistent update of the information. The ability to manage the rel
ationships makes it
possible to reapply archival processes, guaranteeing that not only the result of the
archival processes can be preserved, but also a description of the application of the
archival processes can be preserved. Finally, the application of

ontologies to digital
entities to organize the relationships needed to interpret their structure and meaning
makes it possible to guarantee the ability to manipulate digital entities in the future. A
common vision has emerged for the preservation of all
components of a persistent
archive. A persistent archive can be implemented using generic software systems that



10




manage collections of structures within a digital entity, collections of digital entities, sets
of operations imposed by an archival process,
and the archival processes themselves.


VII. Acknowledgements


The ideas presented here were developed by members of the Data and Knowledge
Systems group at the San Diego Supercomputer Center; Michael Wan, Arcot Rajasekar,
Bertram Ludasecher, Amarnath Gupt
a, Richard Marciano, Ilya Zaslavsky, and Chaitan
Baru. This research was supported by NSF NPACI ACI
-
9619020 (NARA supplement),
NSF NSDL/UCAR Subaward S02
-
36645, NSF I2T EIA9983510, DOE SciDAC/SDM
DE
-
FC02
-
01ER25486, and NIH BIRN
-
CC3~P41~RR08605
-
08S1. The
views and
conclusions contained in this document are those of the authors and should not be
interpreted as representing the official policies, either expressed or implied, of the
National Science Foundation, the National Archives and Records Administration
, or the
U.S. government.

VIII. References


1.

R. Moore, “The San Diego Project: Persistent Objects”, Proceedings of the
Workshop on XML as a Preservation Language, Urbino, Italy, October 2002.

2.

A. Rajasekar, M. Wan, R. Moore, “mySRB and SRB, Components of a
Data Grid”,
11
th

High Performance Distributed Computing conference, Edinburgh, Scotland, July
2002.

3.

OAIS
-

Reference Model for an Open Archival Information System (OAIS).
submitted as ISO draft, http://www.ccsds.org/documents/pdf/CCSDS
-
650.0
-
R
-
1.pdf,
1999.

4.

R. Moore, A. Merzky, “Persistent Archive Basic Components”, Persistent Archive
Research Group, Global Grid Forum; July 27, 2002

5.

Global Grid Forum URL
-

http://www.gridforum.org/

6.

Metadata Encoding and Transmission Standard


METS, URL
-

http://www.loc.gov/standards/mets/

7.

Open Archives Initiative


OAI, URL
-

http://www.openarchives.org/

8.

WebBase, parallel web crawling technology, URL
-

http://www
-
diglib.stanford.edu/~testbed/doc2/WebBase/

9.

National Science Digital Library


NSDL, URL
-

http://nsdl.org/render.userLayoutRootNode.uP

10.

Nat
ional Archives and Records Administration supplement to the NSF National
Partnership for Advanced Computational Infrastructure, URL
-

http://www.sdsc.edu/NARA/