Archival Asset Package Design Concept for an OAIS System

dimerusticΔίκτυα και Επικοινωνίες

23 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

74 εμφανίσεις

1


Archival Asset Package Design Concept
for an

OAIS System


Quyen
L.
Nguyen

National Archives and Records
Administration

8601 Adelphi Rd

College Park, MD 20740, USA

quyen.nguyen
@nara.gov







Dyung Le

National Archives and Records
Administration

8601 Adelphi Rd

College Park, MD 20740, USA

dyung.le
@nara.gov



ABSTRACT

The core functionalities of an Open Archive Information
System (OAIS) are to store, preserve
,

and provide access to
the digital records for
a long time
.

Packaging digital
records for preservation is the basic building
function
block

of

an OAIS
, and is
crucial
in
exchanging archival records
between preservation
systems. On the other hand, t
he
scope
of the challenges that a large
OAIS has to satisfy is a
quantifiable consequence of
two

factors. First is system
scalability due to the exponential growth of digital records
,

as usage of information and web technology
are
increasingly

adopted.

Second is the variety of record types,
both in terms of format and content category.

Within this
context, we propose a design concept
of
Archival Asset
Package (AAP), which serves as a generic object model for
transferring records

between components
of an OAIS
,

and
also between
peer system
s
.
The key characteristics of AAP
are extensibility and uniformity. Furthermore, we will show
how the AAP
design

is

supported by a scalable
and
evolvab
le Content Server Architecture
.

Keywords

OAIS; electronic records; archival asset package
; AIP.


1.

INTRODUCTION

Any digital archiving

system has the core functionalities of
storing, preserving, and providing access to di
gital records
for the long term.
The first core element of such a
system is
an
object model t
hat

fulfill
s

these requirements.

Moreover,
this object model
is also crucial for exchanging archival
records between preservation systems and even within
subsystems of an OAIS, which comprises Ingest, Data
Management, Archival Storage, Preser
vation Planning, and
Access. Figure 1 depicts the typical subsystems of an OAIS
in the reference model [1]. On the other hand, the scope of
the challenges that an OAIS has to satisfy is a quantifiable
consequence of two factors. First is system scalability

due
to the exponential growth of digital records
,

as usage of
information and web technology
are increasingly

adopted.
Second is the variety of record types, both in terms of
format and content category.
The OAIS
Information

Model
[1] has defined three ki
nds of Information Packages (IP):
the Submiss
ion Information Package (SIP) used to ingest
records into the system, the Archival Information Package
(AIP) for storing self
-
describing packages in long
-
term
storage, and the Dissemination Information Package (
DIP)
that is to be sent out to the Access client.
In order to fulfill
its core functionalities, a large digital archiving system has
to cope with three main
data
challenges:



Data Scalability

d
ue to the exponential growth of
digital records created by softw
are applications in

enterprises and organizations. T
hese applications are
becoming more and more complex, as they follow the
trend from monolithic to distributed architecture,
involving network messaging, web services, mashups,
etc.



Data Heterogeneity
. The
re are two aspects of
heterogeneity. One is the variety of format types
proprietary to particular software applications.
The
other

aspect concerns the type of content such
as
memos, research papers, documentary films, etc.



Data Evolvability. Software appli
cations
that produce
data also evolve.
Consequently,

the formats of the data
they generate are also changing. A document in
Microsoft Word 2003 has a different format than one
written in Microsoft Word 2007. What is interesting in
this example is the
softw
are update cycle
span
ning

of
only four years as compared to the long term
preservation that a digital archiving system has to
support.

Thus,

any

design of any element

for the digital archiving
system
, especially the object model for packaging digital
records,

should
meet

the
system
qualities
of
data
scalability, heterogeneity, and evolvability.

The
contribution of this paper is the design concept of Archival
Asset Package (AAP), which offers a flexible, unified
,

and
standardized data model for transmit
ting digital records
within a digital archiving system. Moreover, we believe
that the model can serve as a standardized interface to
external systems for the ingest and access processes.


2





Figure
1
.
OAIS Reference Model
.


The paper is organized as follows. Section 2 states

the
inherent and unique challenges
that
motivate
the design of
the AAP concept.
Before describing and discussing AAP in
Section
5
, we
present the Content Server Architecture
(CSA) in Section 3, and
introduce
the
concept of
Asset
Catalog Entry

(ACE),

which
embodies

the metadata o
f a
digital record

in Section 4. Section 6 discusses the work
performed by other researchers and organizations
that is
related to this paper. Finally, t
he

summary and
conclusion

are in Section
7
.

2.

DESIGN
CHALLENGE
S

AND
GOALS

The
main challenge of a large and long term OAIS is

to
design a system that can evolve and adapt to future data,
business rules, and technologi
es. In some sense, the system
architect has to lay down structures and patterns when
a lot
of
specifications are not defined clearly and concretely

yet
.

The challenges that an OAIS has to cope can be grouped
into four categories
:

Evolving
archival
concept
s
.
The notion of
digital
record

has been defined by InterPARES as


Document(s)
produced or received by a person or organisation in the
course of business, and retained by that person or
organisation. A record may incorporate one or several
documents (e.g. w
hen one document has attachments), and
may be on any medium in any format.
” [
9
]
[10]
. However,
although it has been agreed that
record

is a building block
for digital archiving,
the notion of
a
record

is still debatable
in the archival community.

According to the definition
above, a record can consist of a single file as in the case of
a simple Microsoft Word document without embedded
objects. A more complex record may comprise multiple
files and folders. One well
-
known example is a web page
with
an HTML file, and a folder containing image files
referenced in the HTML file.

On the other spectrum is a
personnel record stored in a database. In the latter case, a
to
-
be
-
archived file which can be exported out of a database
will contain multiple records
. Emails containerized in a
large file by an email server also in this category.
Another
critical concept is the authenticity of a digital record. There
has been no definitive set of characteristics attributed to an
authentic digital record that a system
developer can encode
in a rule engine for the digital preservation module.

Unknown

data

specifications
.

At design and
implementation time, we don’t know the specifications of
the data that will be ingested and preserved in the system.
First, t
he media for
transferring data range from tapes,
DVD, SAN storage, or electronic FTP, and HTTP upload.

But, what is more important is the data format of the digital
records, which will greatly impact the ingest process. For
instance, some extraction and transformation
may be
desired as in the case of individual emails containerized in
a .
pst

file. File identification is also a challenge due to the
lack of a comprehensive tool and algorithm that can
identify all the formats and their derivatives that currently
exist, let

alone future formats.

Undetermined

process
.

Both business and system
-
related
processes are still evolving. The system has to deal with
business rules which are new or changed due to current
federal regulations. These rules will impact the ingest
process and
the level of digital preservation
.
Various

possible
preservation strategies

ranging from emulation,
migration, persistent format transformation (PFT), and
Universal Virtual Computer (UVC)
were discussed in [11]
.
Migration and PFT are similar in that they make use of
tools to transform digital reco
rds from one format to
another.
Emulation and UVC would require an elaborate
extraction of technical attributes and specifications from
the digital records.

But, no preservation strategy is
predominant and declared to be complete for all cases.

Given this
uncertainty, the system should have the
capability to accommodate any preservation strategy or
combination of strategies.

Evolving technology
.

Computing technologies

and
paradigms

supporting an OAIS

such as
preservation
software tools, application servers,

search engines,
and
storage

are changing at a rapid pace.

For instance, instead
of purchasing and maintaining storage servers in house,
the
system
may archive digital records in
a storage cloud

after
performing all the necessary ingest and preservation
pr
ocessing. Thus, the system should be able to evolve in
order to take advantage of new technologies
.

Advances in
discovery and access software products can be leveraged to
provide enhanced search capability.
While search is
currently focused on textual documents, or metadata of
multimedia objects,
the capability
to search a video or still
images based on a multimedia clip or figure pattern

is
coming
.
With the growing popularity of Web 2.0, features
such as tag
ging, discussion forum, and interest groups

of a
set of digital records are desired for the access component
of OAIS
.

Within this context, t
he following

considerations are the
driving factors
of the AAP design:

a)

Commonality


A
Common object model establis
hed
for data exchange between the Content Server and
other subsystems. The motivation is to encapsulate
the
3


object and
its
metadata together, and leverage
an
object
-
oriented paradigm for reusability.



Evolvability


The standard metadata structure of the
AA
P allows for evolutionary changes to any of the
subsystems or services autonomously of other
subsystems or services.



Extensibility
-

New record types, data types, and
services and full subsystems can be added to the
system without perturbing unaffected su
bsystems.

Since we are dealing with an archive system, which
presumably must not alter the original records, updates will
occur in the metadata to record
additional properties and
lifecycle events such as virus scanning, preservation
format
creation
,
and
storage
technology migration.

Therefore, the
design should accommodate efficient updates to the
metadata, while providing a simple interface for
placing
and retrieving an AAP

in an archive
.

Before presenting the
AAP design, we will describe the supporting
architecture,
and metadata framework.

3.

CONTENT SERVER ARCHITECTURE

The Content Server Architecture

(CSA)

has been designed
to manage digital records and their associated metadata

in
an efficient and flexible way
.
CSA has two major
architectural components:
Global
Federator

and Con
tent
Server
.

T
here will be one
Global
Federator, but one or
many

Content Servers.
Note that the singularity of the
Global Federator has only a logical meaning with regard to
the relationship with the Content Server instances. In
rea
lity, the Global Federator can be implemented by a
cluster of nodes to avoid single of point of failure.

3.1.

Federator

The
Global Federator

functions as a router among multiple
Content Server

for
inserting into or retrieving
data
from
Content Server
s

by exposing
a common and unified
interfac
e

to external clients. On the ingest side, the
Global
Federator

forwards insert requests to the appropriate
Content Server

based on some policy dictated by business
rules or system engineering criteria. The
Global F
ederator

has a
similar logic upon a retrieval request
, which will be
directed to the
correct

Content Server
.
For the
Global
Federator

to perform its functions,
all
Content Server

instances must
provide services via a uniform interface
.
The primary object c
rossing the
Global Federator

interface
and the
Content Server

boundary is the
AAP
.
The UML
diagram in Figure 2 illustrates an example of one
Global
Federator

and two Content Servers. The flow shows that
the “Insert AAP” request is redirected to either Con
tent
Server 1 (CS1) or Content Server 2 (CS2) depending on the
result of the resolution. Such resolution can be based on a
policy preloaded into the
Global Federator
.



Figure
2
.
Global Federator:
Processing an Insert Request.


By exposing a simple and common programming interface
for applications and services to
input and
access digital
records, the
Global Federator

satisfies the evolvability
characteristic
. Indeed,
new Content Servers can be added

to
accommodate the volume growth of digital records.
Moreover, a new
Content Server

may be needed due to the
appearance of a new type of digital records requiring
special way of handling and processing.

Another use case
of a new
Content Server

is the need

to segregate a body of
records from the rest to satisfy some business rule and
regulation.
These addition scenarios and also removal ones
will be transparent to the clients of the
Global Federator
.

A key component in the
Global Federator

is the policy th
at
determines how the digital records are partitioned among
Content Servers. The policy

may be set using criteria
such
as semantic record groups, business domain, or level of
service. Content server partitioning may also be influenced
by the levels of serv
ice
performed
on the data the server
contain
s
. An example of differentiated levels of service is
to provide full content search for a body of high
-
value
records stored in fast disk
-
access, versus coarse
-
grained
search of group
-
level description for a low
-
demand set of
records in storage tape
s.
Policy configuration within the
architecture

is controlled by the Federator, and can be
influenced
by the following factors:

a)

Business
-
driven

rule. For example, digital records
containing privacy data have to segregated from others
and put under tight security access controls.

b)

Data
-
driven

rule
. Due to the characteristics of some
digital records, the system needs specialized Content
Servers. For

instance, we can have one
Content Server

consisting of
geos
patial application record data which
present special

search

capability
, preservation
challenges, as well as unique requirements for
visualization


all necessitating a suit of specific
services un
like those related to other k
inds of
electronic records. Another instance of
Content Server

is dedicated to providing services performed on
4


Computer Aided Design (CAD) and Computer Aided
Manufac
turing (CAM) application data.

c)

System
-
driven rule
.

Over time
, there
may
exist a
COTS which can provide a turnkey solution to
implement a Content Server. In this case, it can be
introduced into CSA and co
-
exist with other
Content
Server

instances, provided that its interface with the
Global Federator

is maintained.



Table 1 below lists the steps performed by the Global
Federator to route an AAP to the correct Content Server
CS. Step 2 is a simple table lookup to find the algorithm
that
is based on the rules presented above to
determine the
CS to route to.



Table
1
. AAP Routing Scheme for Ingest.

Route(AAP)

1.

Get AAP.metadata.

2.

a
lgorithm ← Lookup policy.

3.

CS ←
a
lgorithm.execute(
AAP.metadata).

4.

Route AAP to CS

by invoking
the PUT operation.



As an illustration, the
Table 2’s
algor
ithm
called
D
TBased
Route

essentially partitions the digital records
based on the
data types

specified in the metadata.
Depending on the data type DT

found in Step 1

which
consists of extracting the DT value from the metadata
encoded in XML
,
DTBasedRoute
can
then get the content
server instance CS by calling the function getCS().

T
he
function getCS() in turn
can be implemented based on the
table mapping the different data types to the content server
instances
. Table 3 gives some an example of this mapping
with

three content servers.


Table
2
. DataTypeRoute

Algorithm.

DTBasedRoute
(
metadata
)

1.

DT

← Parse metadata using XPath.

2.

CS

← getCS
(LoS).

3.

Return

CS.



Table
3
. Content Server and Data Type Mapping
Example.

Data Type

Content Server

CAD

CS1

Geospatial

CS2

Other

CS3


Note that a policy can also be a combination of th
ose
different kinds of

rules,
thus forming a multi
-
dimensional
matrix of p
olicy attributes to be considered simultaneously.
However,
the ultimate internal implementation of the
Content Servers

is no more complex than before, and th
e
external interfaces of the Content Servers remain constant.

3.2.

Content Server

A

Content Server is a logical architectural construct that
abstracts its internal

implementation from the system or
systems external to it. The encapsulation of the internals
behind a uniform and constant interface allows a great
degree of variability of the internal implementation both
initially and as required for the system evoluti
on. A
Content Server consists of services to store digital assets,
provide search capabilities and access to the assets, and
allow the building and deployment of
enhanced archival
applications. As a point of reference to the

Reference
Model for an

OAIS
, a
Content Server encompasses
part of
the Data Management and the Archival Storage OAIS
components.

Significantly, Content Server is a self
-
sufficient entity that
manages a body of assets, and provides to applications and
external services a unified and sta
ndard interface to operate
on the assets. The key aspect of its self
-
sufficiency is that a
Content Server manages both the assets and their metadata
internally. An archival repository system may contain
multiple Content Servers as necessary.

In an imple
mentation a Content Server can be realized as a
custom application deployed on an Application Server to
provide the above services, or it can be a turnkey
Commercial Off The Shelf (
COTS
)

solution, assuming the
proper encapsulation of the COTS solution behi
nd the
standard Content Server APIs.

A
t the minimum, a

Content Server

contain
s

a Digital
Object Manager and a Metadata Manager. The latter
service is responsible for managing the metadata associated
to the digital records.
The Digit
al Object Manager has th
e
task to perform

the Insert, Retrieve, and Delete operations
on digital objects under the management of the
Content
Server
. If there are different storage technologies within a
Content Server
, then the Digital Object Manager,
functioning as a Local Federa
tor, must h
ave a policy for
routing the AAP
(s) to the correct physical storage
implementation.

Figure 3 depicts the processing of an Insert request inside a
Content Server
. Upon receiving a request containing an
AAP, the Content Server invokes simultaneously the
Metadata Manager to insert the metadata and the Digital
Object Manager to store the digital object of the record.
The digital object is then passed to the Storage M
anager
instance depending on the result of
an internal
resolution.
Implementation

wise
, a Storage Manager can be realized by
a COTS product to manage the physical archive storage.
Note that the architecture allows the use of multiple
products at the same t
ime if needed.

As an example, we have shown two Storage Manager
instances, namely Storage Manager 1 (SM1) and Storage
Manager 2 (SM2) in Figure 3. The result of the resolution
internal to the Content Server guides which Storage
Manager instance will get t
he forwarded digital objects to
5


be archived. This underlines the benefit that new Storage
Manager implementation can be plugged in order to replace
the old ones or work in tandem with existing Storage
Manager instances in the interfacing with the Digital O
bject
Manager.



Figure
3
. Local Federator: Processing an Insert Request.


4.

ASSET CATALOG ENTRY

In order to understand more clearly the underlying rational
behind the AAP design, it is helpful to examine
the
concepts of the Intellectual Entity and the Asset Catalog
Entry (ACE), which contains ERA record metadata.

At the high level, an ACE has the following characteristics:

a)

The content is
compliant

with the OAIS information
model
,

containing

information related to
the
unique
identifier, provenance, context, fixity, and description.
The metadata design also takes into account the flow
of digital records through an OAIS system, from ingest
to storage, preservation, access and dissemination, i.e
.
,

elements are identified to be collected, extracted, and
updated throughout the lifecycle processing.

b)

The schema is
based on the PREMIS

[8]
conceptual
data model

for digital preservation
,

which includes the
main elements of Intellectual Entities

(IEs)
,
Objects,
Rights, Events, and Agents.

c)

The structure facilitates
easy incorporation of external
schemas

in the archival domain
,

such as NISO MIX
and the US National Archives’ LCDRG (Lifecycle
Data Requirements Guide) [6].

d)

Aggregation of metadata through proc
essing phases.
The processing of an archived digital object can be
performed by different archivist groups using varied
archival processing systems and technologies.
Associated metadata will be collected at each stage,
and finally accumulated in the ACE.

T
he electronic records
can be

stored and managed using the
concept
of
IE

as
defined
in
PREMIS data model as “a set of
content that is considered a single intellectual unit for the
purposes of

management and description.” [
8
]

At a logical
level, an IE

consis
ts

most commonly
of
metadata

associated to a digital record
.
However, a
n IE may also
consist of a set of
metadata of
intellec
tually related files.

It
can also be
just
the
metadata without any directly
associated asset object files, as for Record Series.
In the
event of a ‘metadata only’ IE, it relates to
other IEs
within
the system
.

The data elements of
an
ACE are grouped according to the
entities specified by PREMIS. An ACE has one
Intellectual
Entity
, and one or more
Representations
. While the IE has
elements common to all representations such as Title, Asset
Type, Events, and Record Group, the technical metadata
specific
to each representation
can be found in
that
Representation. A Representation may contain one or more
objects or f
iles.

Figure
4

depicts the structure of an ACE with a simple
example of a digital record consisting of a document. The
first Representation is the original format of the document
in Microsoft Word, while
the second
Representation
contains metadata of the
transformed document in PDF. In
a digital preservation system, a new Representation is
produced as the result of a transformation when the
migration strategy is utilized.
Note that both representations
share a set of common metadata at the IE level.
Additi
onally, each representation has its own specific set of
metadata, which depend on the format types of the physical
files. Although for the sake of simplicity the diagram in
Figure
4

shows that ea
ch representation has one file. But,
in
reality, a representa
tion may contain one or more files
depending on
the type of assets as in the case of a web
page.

In addition
, there are
two

metadata components tha
t we
need to emphasize: Integrity Seal and Relationship.

The
Integrity Seal

is the digital hash of the digita
l object to
ensure its authentication. This seal will be verified every
time a digital record is accessed. Each newly migrated
representation or object gets a recomputed integrity seal.

Elements under the
Relationship

group allow the system to
express
relationships between the ACEs. The default
taxonomy is the Archival Hierarchy, which comprises of
Record Groups, Collections, S
eries, File Units, and Items
[7
].

But, other taxonomies can be built by expanding these
Relationship

element set. A virtual coll
ection can be
formed by linking records belonging originally to various
series and sharing a common topic of interest to the
collection.

6




Figure
4
. ACE
Structure
.


5.

ARCHIVAL ASSET PACKAGE

5.1

Concept

An Archival Asset Package is an object comprising
metadata and optional data files.
At the

implementation

level
, these optional data files can be just pointers or URLs
to the data files

or a set of structured folders containing
those files
.

The design of t
he AAP is geared to create a
transport structure and mechanisms that would most
efficiently standardize and facilitate the movement of the
electronic assets


asset object files and the metadata
associated with these asset object files between the
function
al subsystems of ERA: Ingest, Content Servers
within Data Management, Preservation, and Access.

AAP Types
. The concept of AAP
offers

an abstraction of
the data object to be passed through the interfaces exposed
by a Content Server. Practically, there are

two categories of
AAP in the system:

a)

Category 1. An AAP can contain just metadata with no
reference pointer to any physical data file. Examples of
AAP in this category are:



AAP containing

the ACE of a series of
records or collections of records.



AAP
containing the ACE of a folder.

b)

Category 2. An AAP in this category will contain
metadata and reference pointer
(s)

to one or more
physical data files. The metadata of this kind of AAP
will comprise the Common Metadata at the IE level
and the metadata speci
fic to the representation as
manifested by these physical data files. An AAP
always contains the metadata structure in compliance
with the ACE redesign. However, not all metadata
elements are completed in the AAP that is presented to
the Content Server at
the time of ingest.

A simple example in this category is an AAP containing
metadata and reference to an asset data file embodied by a
single Microsoft Word document. Using PREMIS
terminology, this Microsoft Word document is a
representation of that data fi
le.

In a more complex example, the representation of an asset
can be a combination of N physical files. In this case, an
AAP contains metadata and references to those N physical
files.

These two categories of AAP differ

only

in the existence
and nature of

the reference pointers to other assets. While
exposing a unified interface to subsystems and applications
via the AAP concept model, the services operating on the
AAP objects can take advantage of the polymorphism
feature of the Object
-
Oriented paradigm i
n such a way that
the part of the processing specific to each category depends
only on
the reference pointers to other assets.

Note that
the metadata components of an ACE may
relate

to other metadata components. Record Series
-
level
metadata is one such ex
ample. It refers to the metadata
associated with distinct IEs


physical representations of
distinct IEs, in a particular series. Each IE, of course, may
consist of a number of asset object files. Therefore, an
ACE at the record representation level ref
ers to the record
file
-
level ACEs for that record.

AAPs carry a set of data, comprised of a mandatory
standard metadata structure and an optional pointer to a
physical asset file associated with the metadata, across all
subsystems in ERA. The content
of an AAP is always
correlated to a specific ACE. Thus, an AAP of Category 1
is associated with an ACE that
is referenced by
other
ACEs, rather than physical asset object files and consists
only of the mandatory

metadata structure
.

An example of
such AAP
is one that contains the description of a series or
collection of digital records.



Figure
5
. Common AAP Structure.


The more common type of
AAP
is
associated with an ACE
that refers to asset object
file
s

contains pointer
s

to that asset
object file in addition to the metadata, as shown in Figure
5
. When an AAP is constructed at Ingest time, the pointer
s

contain sufficient information for the Content Server to find
the object file(s) in
the temporary w
orking storage
.

For
category 1 AAP, there will be no link to any digital object.

7


5.2

Persistent URI

Individual metadata entries have a universal persistent
identifier in correspondence with the associated physical
asset, if any. Any number of schemes can be us
ed for the
assignment of persistent identifiers. However, all of them
divide into two broad categories or some hybrid of the two:
synthetic and semantic.

A synthetic identifier scheme entails generation of a unique
hashed or random number identifier, w
hile a semantic
identifier scheme is usually a multipart assignment schema,
where the component parts offer some degree of semantic
meaning that is of value to the system owners or users. The
choice of the scheme is matter of business preference. The
sys
tem presupposes an assignment mechanism that ensures
uniqueness across the entire system without the need of a
centralized assigner service. Any central identifier
assignment service could easily and quite quickly become a
system bottleneck.
Therefore, ou
r design of a persistent
URI requires that the identifier service be distributed.
Furthermore, while not opting for a fully semantic scheme,
we do need inject some meaning into the identifier scheme
for ease of processing and resolution. More specifically,

the
identifier will have multiple segments
that
can be expanded
if needed; each segment will be inspected by a type of
component in our layered architecture. This design strategy
is not new but reminiscent of the OSI network layer.
One
important feature i
s that each level of an ACE, namely
Intellectual Entity, Representation, and Object
,

is

identified
by a unique identifier.
So, the object level is one segment in
the identifier scheme

and

its different values

as encoded in
the ACE structure

are listed in Table
4

below.


Table
4
.
Values for Object Level in ACE

Value

Level in ACE

1

Intellectual Entity Level

2

Representation Level

3

Object File Level

As such, access can be perfor
med at each of the three
levels, when the Object Level segment in the persistent
URI is processed in some kind of URI Resolution Service.

A stronger benefit of such a persistent URI scheme down to
each level in the ACE is to
facilitate
Semantic Web
applic
ation
s

in the future. Indeed, w
ith the persistent unique
identifiers, it would be possible to construct different
ontologies as overlays on top of the set of ACEs.

Each level
within an ACE can participate in a relationship since it is
addressable by a URI.

5.3

Tree View

The Tree View of the ACE structure may help to appreciate
the AAP concept. In
Figure 6
, we see three levels of an
ACE, and it is possible that we make these three levels
accessible via the AAP.

a)

An AAP can be constructed with all the metadata
con
tained in an ACE, including IE
-
level,
representation
-
level and file
-
level metadata, and all
associated object files. In this case, the package will
contain
all the files of all representations
. This
construction allows access at the topmost level of the
AC
E, which may be useful in the case of exporting a
record and its derived transformations.

b)

Access can also be provided at the Representation
level. In this case, the AAP will contain the IE
-
level
metadata, the metadata related to a specific
representation,
and all the files under this
representation with their technical metadata. This
access is the most commonly used.



Figure
6
. Different Access Levels.


Access can be performed at the object file level as

well. In
this case, the AAP will be constructed from the IE
-
level
metadata, the specific representation
-
level metadata, and
the object file level metadata of the particular object file.

As discussed above, AAP serves to standardize data
interfaces of the
ERA subsystems and consists of a
standard metadata structure and an optional pointer to an
asset object file. While the metadata structure remains
constant at all times, the number of the actual metadata
elements present in any one AAP varies, depending o
n the
specific operation that AAP supports in the Create, Read,
Update, Delete
, and Search

(CRUD
S
) suite of operations in
the Content Server. Content Server is a natural reference
point in the discussion of data transfers within
CSA
,
because ultimately
any operation that involves the transfer
of object files or metadata between subsystems, involves a
CRUD
S

operation in the Content Server.

PUT Operation
. Electronic records arrive at the ERA
system boundary as packaged shipments, each comprised
of multiple

files


asset object files. Once unpacked and
initially processed in the Ingest subsystem, the asset object
files are tagged with unique
persistent
identifiers. At that
point a set of metadata, as available, is also associated with
8


each asset object fil
e via the
URI
. An AAP carries each
asset
-
object specific metadata set and a pointer to the
associated object file from Ingest subsystem to the Content
Server. The metadata is used in the content server for the
Create ACE operation. The pointer to the ass
et object file
is used to
transfer

the asset file from Ingest

area

to the
Content Server for permanent storage.

It should be noted, that at this point the set of the metadata
elements in the AAP is perhaps the smallest that would be
associated with an asse
t object file, since the asset has
undergone only the initial processing. More metadata is
accreted during the record lifecycle beyond this point.

In particular, an AAP carrying metadata only into the
Content Server across the Content Server subsystem
bou
ndary is used to crea
te ACE at the IE level
.

GET Operation
. This operation results in AAP instances
crossing the Content Server boundary. Again, these AAPs
can be of Category 1 (metadata only) or Category 2
(metadata and a pointer to an associated asset ob
ject file).
At the implementation level, the asset object file itself can
be retrieved from its content server by the requesting client
using an efficient retrieval protocol as provided by the
underlying physical storage.

POST Operation
. A POST AAP operat
ion is performed in
several cases. One scenario of a POST AAP is the result of
a preservation transformation, where an asset encapsulated
in an AAP is retrieved and transformed, and the
transformed version/representation is inserted into its
Content Serve
r. Note that during this insertion the metadata
of the new AAP is richer than the metadata of the AAP
associated with this asset when it was first ingested.
Moreover, the result of this insertion is the creation of a
new additional representation under the

existing ACE, as
well as an update to the ACE, such as the addition of a
pointer to the new representation.

Another scenario is an update to an existing ACE, or more
precisely a metadata post operation, performed when
additional metadata is added to an ex
isting ACE.
E
vents
requiring an ACE update would include creation of
additional metadata elements, such as a record description
for an existing ACE entry, or pointer updates. The pointer
updates can be performed:



Within an ACE to associate an IE to a new
representation, or



To establish an association between a parent ACE
(e.g., series) and child ACEs (e.g., assets within a
series).

If the AAP in a POST operation contains references to new
objects files as a result of an event such as preservation

transfor
mations or redactions, then the POST operation will
also result in the creation of new asset object files and
associations between ACEs or within an ACE.

This POST operation is similar to the PUT operation with
the difference that, in the latter operation
a new ACE at the
IE level was created, while it is only updated in the former.

DELETE Operation
. While a Delete operation is
implemented in the Content Server and both ACE entries
and asset files can be deleted, no AAP is constructed with a
Delete, since j
ust a list of identifiers is sufficient to indicate
assets or metadata for removal from the Content Server.

QUERY Operation
. This operation will be used by the
Search application within the Access subsystem. It allows
the discovery of assets via a metadata

search or content
search, depending on the level of service provided by the
Content Server. The result of the Query operation is a list
of search results with links to the AAP(s) that satisfy the
query criteria.

5.4

Relation with AIP

It is interesting to note

that the AAP
is actually

a logical
Archival Information Package (AIP) defined in the OAIS
reference information model. Thus, the AAP concept can
be said to generalize the AIP model in order to:



A
dapt to the ACE structure in the system
.



Provide a generic i
nterface for both physical and
virtual assets. The latter can be series, collections,
record groups
,

and folder entities that don’t have
references directly pointing to physical assets.
These virtual assets do have relationships to the
ACE instances groupe
d under them.

Moreover, AAP also encompasses AIP, AIU
,

and AIC
models, where AIU and AIC refers to the Archival
Information Unit and Archival Information Collection
respectively. While AIU offers the granular access down to
the atomic object level, AIC all
ows the aggregation, in
either physical or virtual sense,
of
multiple AIP(s). Our tree
view, endowed with persistent URIs, as discussed above
,

clearly provides the system a way to access and operate at
all three levels as specified in the OAIS reference mo
del.
At run time, what is guaranteed is the standard metadata
structure in an AAP. The number of elements within the
structure differs for different AAP
s
. Indeed, for the first
time that an asset is ingested into a Content Server (ACE
creation), the metad
ata structure of its AAP may not
contain all the elements that would eventually be contained
in the later date ACE of that asset, with its rich IE
-
level
metadata, although it should contain the necessary elements
used for the creation and identification of

the ACE. But, in
subsequent operations on the same asset, the AAP can be
constructed with richer metadata set.

6.

RELATED WORK

The Public Record Office Victoria (PROV) in Australia has
designed VEO (Victorian Electronic Records Strategy)
Encapsulated Object for packaging digital record and its
associated metadata [2].
In the archive, the
entire

VEO
object is encoded in XML, includ
ing any binary content
,

9


which is converted into Base64. The
big
advantage of this
approach is that it provides a self
-
describing package
totally independent from any archive system. Moreover,
since XML is ASCII by nature, a VEO can be preserved
easily for
the long term. However,
converting all content
into XML may create some inefficiency issue
s
, if not
become

infeasible, especially when the digital object to be
archived is very large
,

such as a high resolution space
image or a database file.

Although the d
igital content is in
XML, we still have the issue of interpreting the Base64
section, either directly or after the re
-
conversion to binary
format at access time.

The concept of an archive object named Trusted Digital
Object or TDO was promoted in [4]. Her
e, the author
emphasized on the authenticity of the preserved object. A
TDO can be claimed to be authentic if it contains the
safeguards, some of which
are
included in the object itself
in the form of a cryptographic seal, and others provided by
a trusted
repository.
Note that our proposed AAP also
includes an integrity seal similar to the one used in a TDO.

Another common feature between TDO and AAP is the
identifier.
A TDO is identified by a Digital Resource
Identifier (DRI)
, similar to the notion of pers
istent URI,

which allows
for
references from external entities.

In [3], the authors proposed a framework for packaging
complex digital objects based on t
he standard MPEG
-
21
Digital Item Declaration (MPEG
-
21 DID)
.

They argued
that the advantage of such an a
pproach lies on the fact that
the standard offers a data model, a normative
representation, and an XML schema. The elements of the
data model include Container, Item, Component
,

and
associated Descriptor and identifier. In some sense, the
hierarchical view

of AAP
,

which is based on
the
PREMIS
structure
,

is very similar to the data model in MPEG
-
21
DID.
However, we think that the AAP concept has the
benefits of
taking advantage of
the connotation and focus
on

digital preservation

by relying on PREMIS
, while at the
same time being
able to handle

the problem of AIP
packaging.

7.

CONCLUSION

In this paper, we have described the design concept of an
AAP as the core element in a digital archiving system. The
design takes into consideration both the functional
aspect as
well as the non
-
functional requirements exhibited in a large
system for digital archiving and preservation.
On one hand,
we have shown that the concept is flexible and extensible to
adapt to a wide range of what defines a digital record, from
an
atomic file, a set of files, or a semantic record in a
database file.
On the other hand, the implementation of
AAP is supported by a scalable architecture and service
s

that
provides a set of operations with uniform and
standardized interfaces.


8.

DISCLAIMER

The opinion
in

this paper belongs solely to the authors, and
does not reflect any position of the National Archives and
Records Administration.

9.

REFERENCES

[1]

The Consultative Committee for Space Data Systems.
Reference Model for an Open Archival Information S
ystem
(OAIS)
, 2002. Available:
http://public.ccsds.org/publications/archive/650x0b1.pdf

.
[Feb. 16,
2009]
.

[2]

Andrew Waugh. The Design of the VERS Encapsulated
Object Experience with an Archival Information Package.
International Journal on Digital Libraries

(2006) 6(2): 184
-
191.

[3]

Jeroen Bekaert, Emiel De Kooning, Rik Van de Walle.
Packaging Models for the Storage
and Distribution of
Complex Digital
Objects

in Archival Information Systems: a
Review of MPEG
-
21 DID Principles.
Multimedia Systems

10: 286
-
301 (2005).

[4]

Henry M. Gladney. Trustworthy 100
-
year Digital Objects:
Evidence After Every Witness Is Dead.
ACM Transa
ctions
on Information Systems (TOIS)
,
Volume 23,

Issue 3
(July
2005)
, pp.
299


324.

[5]

Robert Kahn and Robert Wilensky. A Framework f
or
Distributed Digital Objects.
International Journal on Digital
Libraries (2006) 6(2): 115

123
.

[6]

Lifecycle

Data Requirements Guide. Available:
http://www.archives.gov/research/arc/lifecycle
-
data
-
requirements.pdf
.

[7]

Archival Descriptions from the Archival
Research

Catalog
(ARC).

Available:
http://www.archives.gov/datasets/arc
-
archival
-
descriptions
-
xml.pdf
.

[8]

PREMIS. Preservation Metadata. Available:
http://www.loc.gov/standards/premis/
.

[9]

InterPARES 2
.
http://www.interpares.org/ip2/ip2_terminology_db.cfm
.

[10]

Sherry L. Xie.
Foundation for Developing Digital
Preservation Policy: The InterPARES Policy Framework.
http://www.armaedfoundation.org/pdfs/iPRES2007_Beijing_
Paper
_SherryXie_Canada_20071011_.pdf

[11]

Kenneth Thibodeau.
Overview of Technological Approaches
to Digital Preservation and Challenges in Coming Years
.

Conference Proceedings of the State of Digital Preservation:
An International Perspective. Washington, D.C.
April

24
-
25,
2002. Council

on Library and Information Resources, Pub
107.
Available:
http://www.clir.org/PUBS/reports/pub107/thibodeau.html
.