Digital Object Architecture:

martencrushInternet και Εφαρμογές Web

8 Δεκ 2013 (πριν από 3 χρόνια και 4 μήνες)

52 εμφανίσεις


Digital Object Architecture:

Building Information Management
Infrastructure for Networks



20 September 2010




Larry Lannom

Corporation for National Research Initiatives

http://www.cnri.reston.va.us/

http://www.handle.net/

Corporation for National Research Initiatives

Three Initial Networks


About 30


35 years ago, DARPA funded the creation
of three seminal packet networks


ARPANET, Packet
Radio, Packet Satellite


The Internet came about from a desire to
link the
three of them


Ethernet occurred in parallel, led by Xerox Parc
researchers, and other network types followed


The resulting architecture was independent of the
number and type of networks or who ran them.


Corporation for National Research Initiatives


The Internet would be a global information system.


An open
-
architecture

would be used to combine
different networks based on open and well
-
known
interfaces, protocols & objects
.


A new communications
-
oriented host protocol
(
TCP/IP
) would be created to replace the original
ARPANET host protocol (NCP).


The concept of global addressing and
IP addresses

would be introduced to identify individual machines
anywhere on the global Internet.


Key Decisions

Corporation for National Research Initiatives

Comments on the Key Decisions


The architecture is robust in the presence of many
different network types and many outages.


Gateways provided IP routing and Network
"Impedance Matching".


TCP accommodated end
-
end protocol:


different packet sizes, duplicates, error
detection, losses due to tunnels, mountains,
jamming, etc.


Separate network administrations were permitted,
which allowed the Net to grow.


DNS not technically critical, but helped users.

Corporation for National Research Initiatives

Understanding the Big Picture


Many things were done well from the outset; with
20/20 hindsight, some could have been done better.


The context was critical:


Mostly mainframes, few time
-
sharing systems


No PCs, workstations, LANs


One dominant carrier in the US


Government facility initially


What is important at the time may be only apparent
with hindsight; but also what seems important at
the time may not turn out to be so important later
on.


Corporation for National Research Initiatives


What is so hard about it?


Making it scalable over platforms, size and time


Achieving Critical Mass


Getting Buy in:


Pleasing many essential participants


Displacing prior capabilities


Structuring matters to deal with concerns about empire
building


It’s a lot easier to create brand new capabilities than
to affect existing means of operation.



Infrastructure Development

Corporation for National Research Initiatives

Infrastructure Creation is a Subtractive Process


Infrastructure reduces a common, shared capability
to its basic and essential attributes.


These attributes are not always recognized or
understood up front.


Upon further scrutiny, capabilities are usually deleted
from a well
-
conceived architecture over time.


Consensus develops when no more can be removed
without disabling the infrastructure.


Corporation for National Research Initiatives

What is the
Informatio
n

Management
Problem?


Managing information in the Net over very long
periods of time


e.g., centuries or more.


Dealing with very large amounts of information in
the Net over time.


When information, its location(s) and even the
underlying systems may change dramatically over
time.


Respecting and protecting rights, interests and value.


Corporation for National Research Initiatives


Allows for arbitrary types of information systems.


Allows for dynamic formatting and data typing.


Can accommodate interoperability between multiple
different information systems.


Allows metadata schema to be identified and typed.

A Meta
-
level Architecture

Corporation for National Research Initiatives


To reformulate the Internet architecture around the notion
of uniquely identifiable data structures.


Enabling existing and new types of information to be reliably
managed and accessed in the Internet environment over
long periods of time.


Providing mechanisms to stimulate innovation, the creation
of dynamic new forms of expression, and to manifest older
forms.


While supporting intellectual property protection, fine
-
grained access control, and enable well
-
formed business
practices to emerge.


Digital Object Architecture: Motivation

Corporation for National Research Initiatives

Digital
Object Architecture Technical Components


Digital Objects (
DOs
)


Structured data, independent of the platform on which it was created


Consisting of "elements" of the form <
type,value
>


One of which is its unique, persistent identifier


Resolution of Unique Identifiers


Maps an identifier into "state information" about the DO


Handle System is a general purpose resolution sy
stem


Repositories from which
DOs

may be accessed


And into which they may be deposited


Metadata Registries


Repositories that contain general information about
DOs


Support multiple metadata schemes


Can map queries into unique DO specifications (via handles)


Corporation for National Research Initiatives


Defined data structure, machine independent.


Consisting of a set of elements:


Each of the form <type,value>


One of which is the unique identifier


Identifiers are known as "Handles":


Format is "prefix/suffix"


Prefix is unique to a naming authority


Suffix can be any string of bits assigned by that authority


Data structure can be parsed; types can be resolved within the
architecture.


Associated properties record, and transaction record, contain
metadata and usage information.


What is a Digital Object?

Corporation for National Research Initiatives


Create a cohesive interoperable collection of
repository
-
based systems.


Initially, perhaps, around a core set of projects,
content, applications and/or organizations


Demonstrate interoperability between different
repository collections.


Develop procedures to insure continued accessibility
to key archival information.


Interoperability & Federated

Repositories

Corporation for National Research Initiatives

Repository Notion

Any Hardware & Software

Configuration

Logical External Interface

DOP

Digital

Object
Protocol

Corporation for National Research Initiatives

Repository


Each
Digital Object
has
its own
unique &
persistent ID.


Content Providers
assign
IDs.


Could
be
upwards of
trillions of
DOs per Repository.

Objects may be

Replicated in

Multiple Repositories

Repositories & Digital Objects

Corporation for National Research Initiatives


Distributed identifier service on the Internet


First general purpose resolution system


Can be used to locate repositories that contain digital objects
given their handles


and more!


Other indirect references



Public Keys, Authentication information for DOs


Accommodates interoperability between many different
information systems

The Handle System

Corporation for National Research Initiatives


The basic Architecture of the Handle System is flat,
scaleable, and extensible.


Logically central, but physically decentralized.


Supports Local Handle Services, if desired.


Handle resolutions return entire "handle records" or
portions thereof.


Handle Records are also:


digital objects


signed by the servers


doubly certificated by the system.


Attributes of the Handle System

Corporation for National Research Initiatives

Resolution Mechanism

Multiple Sites

Multiple Servers


Handle System

<www.handle.net>

Handle



System is
non
-
nodal



Scaleable & Distributed



Supports global (and local) resolution



Has backup
for reliability, mirroring for efficiency

Handle Record

Corporation for National Research Initiatives


Managing Digital Objects for long
-
term access is a key
challenge.


Initial technology components are available; industry is
expected to generate more over time.


Third
-
party value
-
added providers in the private sector
will ultimately shape the long
-
term evolution.


Interoperability and reliable information access is a
critical objective.


A diversity of applications (with user
-
friendly interfaces)
need to be developed & deployed.

Conclusions

Corporation for National Research Initiatives

Phone Guy Perspective


Purpose of Digital Object


Today's architectures

and paradigms, including leading edge
technology, operate on the circuit switched telephone
equivalent of data storage.


A "dumb" system for payload data storage ("the circuits").


A separate system for management, control, and metadata
("the signaling network").


As a consequence, these systems are limited in robustness,
security, interoperability, extensibility, cost effectiveness,
vendor independence, and functionality.

Create the foundation for data storage and retrieval, equivalent to
what packet data did for communication.

Urs

Muller, Net
-
Scale

Today's Paradigms

Data management

Data

Data

Data

Data

Data

Data


Access control


Key management


Provenance infrastructure


Version control


Metadata

Data storage

User

Request

Data

Examples
:


Documentum (EMC)



SharePoint, MOSS
2007
(Microsoft)



FileNet (IBM)



10
g, Stellent (Oracle)



LiveLink (OpenText)



Alfresco (open source)


Authentication

Urs

Muller, Net
-
Scale

What Happens When Data Is Moved

Data management

Data

Data

Data

Data

Data

Data

Data storage

Data


Loss of access control


Loss of key management


Loss of provenance infrastructure


Loss of version control


Loss of metadata

Urs

Muller, Net
-
Scale


Use of separate and different systems for storage of the (payload) data and the
data management.


Creates a centralized system.


Poor interoperability.


Heavy vendor and product dependence.


The data management system is a fragile huge single point of failure which
requires heavy protection to make a solution usable.


This is similar to the signaling network and out of band data in a circuit
switched traditional telephone network.


Poorly suited to reach these key requirements for the DoD:


High degree of global data distribution and replication (a super robust
network, data is available where needed).


Vendor independence.


Interoperability among vendors and multiple technology generations (like
the Internet).


Access control "travels" with the data and does not need to be replicated
each time the data is copied onto a different system (e.g., a laptop).

Limitations of Today's Paradigms

Urs

Muller, Net
-
Scale

Digital Object Architecture

Data


Access control


Key management


Provenance infrastructure


Version control


Metadata

Data

Data

Data

Digital Object Repository

Data

Urs

Muller, Net
-
Scale

A Digital Object Is Moved

Data management remains intact:


Access control


Key management


Provenance infrastructure


Version control


Metadata

Digital Object Repository

Data

Data

Data

Data

Data

Data

Urs

Muller, Net
-
Scale

A Solid Foundation

The Digital Object Architecture provides a solid foundation for the creation of:


A highly distributed, robust, and scalable data storage and retrieval
infrastructure.


Digital Objects are self
-
contained and don't depend on a separate centralized
data management subsystem. This dramatically improves scalability.


A highly secure data storage and retrieval infrastructure.


By eliminating a centralized security paradigm which is a single point of failure
and greatly vulnerable to attacks.


Security is distributed. A successful attack reveals very little reward (each
digital object has to be attacked separately).


A highly "future proof", extensible, interoperable, and vendor independent
data storage and retrieval infrastructure.


By greatly reducing the complexity for exchanging data without breaking
access control, provenance, version control, etc.

The Digital Object Architecture provides a far superior foundation for realizing
these essential properties compared to today's paradigms.


Urs

Muller, Net
-
Scale

Comparison to Data Communication

Circuit Switched (old phone)

(~ traditional architectures)




Data has no "intelligence" and is managed by
a large central system (signaling network).

Packet Based (Internet)

(~ Digital Object Architecture)




Data management information is embedded
with the data itself (packet header).


The packet itself knows what it is,
where it is coming from and where it is
going to.


The network can be simpler, far more
flexible and robust.


Today, few people dispute that packet routing is superior to circuit switching for data
communication.


A few decades ago the differences were not so clear. After all, data can easily be
exchanged over a circuit
-
switched network.


Compared with today's paradigms, the Digital Object Architecture will lead to far more
flexibility, diversity, technology independence, and overall usage for data storage and
retrieval.

Urs

Muller, Net
-
Scale

Example From the Real World


Circuit switched past: When a 5ESS switch was down, all calls
to the affected area were out, leaving a whole region without
communication.


Current Internet: On December 19, 2008 three undersea cables
were cut between the Middle East and Europe. Data traffic was
severely impacted but communication remained intact.

We expect the Digital Object Architecture to create a paradigm
shift for data storage and retrieval similar to the impact the
Internet had on data communication.

Urs

Muller, Net
-
Scale

Corporation for National Research Initiatives

Digital Object Architecture

Where Are We?


Handle System


Up and running since the early 90s


Core

architecture stable from the late 90s


www.handle.net


Digital Object Repository


In daily use in multiple projects


Available

open
-
source since the start
o
f

2010


www.dorepository.org


Introductory article in Jan/Feb D
-
Lib Magazine


Digital

Object Registry


In daily use in multiple projects


Available open
-
source since May, 2010


www.doregistry.org





Information Management on Networks

Resolution

Client

Resource Discovery

Search Engines, Metadata Databases, Catalogues, Guides, etc.

<?xml version="1.0"?>

<description>

…….


</description>

<?xml version="1.0"?>

<description>

…….


</description>

<?xml version="1.0"?>

<description>

…….


</description
>

<?xml version="1.0"?>

<note>



<to>John</to>



<from>Jane</from>



<heading>Reminder



<body>Don't forget me!

</note>

<?xml version="
1.0
"?>

<note>



<to>John</to>



<from>Jane</from>



<heading>Reminder



<body>Don't forget me!

</note>

Repositories

/
Collections

Identifier Resolution System

Corporation for National Research Initiatives

Information Management on Networks

Administrative

Client

<?xml version="1.0"?>

<note>



<to>John</to>



<from>Jane</from>



<heading>Reminder



<body>Don't forget me!

</note>

<?xml version="1.0"?>

<note>



<to>John</to>



<from>Jane</from>



<heading>Reminder



<body>Don't forget me!

</note>

Repositories

/
Collections

Resource Discovery

Search Engines, Metadata Databases, Catalogues, Guides, etc.

<?xml version="1.0"?>

<description>

…….


</description>

<?xml version="1.0"?>

<description>

…….


</description>

<?xml version="1.0"?>

<description>

…….


</description
>

Identifier Resolution System

Corporation for National Research Initiatives

Information Management on Networks

Administrative

Client

<?xml version="
1.0
"?>

<note>



<to>John</to>



<from>Jane</from>



<heading>Reminder



<body>Don't forget me!

</note>

<?xml version="1.0"?>

<note>



<to>John</to>



<from>Jane</from>



<heading>Reminder



<body>Don't forget me!

</note>

Repositories

/
Collections

Resource Discovery

Search Engines, Metadata Databases, Catalogues, Guides, etc.

<?xml version="1.0"?>

<description>

…….


</description>

<?xml version="1.0"?>

<description>

…….


</description>

<?xml version="
1.0
"?>

<description>

…….


</description
>

Identifier Resolution System

Corporation for National Research Initiatives

Federation


Federation in information systems makes
sense when


a set of
varying features

exists across the
federates, which is the reason for multiplicity


Includes organizational boundaries, locations,
content types, etc.


a set of
common features

exists across federates,
which is
usuallly

the reason to perform
federation


Shared topics, common audience, etc.


Corporation for National Research Initiatives

Challenges
-

Conceptual


Identifying the type of aggregation:


Aggregate
objects ahead of time,
before
query?


Merge
search responses from federates by issuing a distributed
query?


Or, anything in between?


Identifying the level of semantic interoperability


Enforce
complete semantic interoperability across all the data
stored in the federates?


Use
only the least common denominator (from a data semantics
point
of view) among the federates?


Federate topology


Are all federates directly connected to each other? (fully
-
connected
mode)


Is each federate connected to only its neighbor? (peer
-
peer mode)


These criteria can be visualized as a
Federation Spectrum


Complete Semantic Interoperability

No Semantic Interoperability

(Ad Hoc Mix)

No Semantic Interoperability

(Ad Hoc Mix)

No Semantic Interoperability

(Ad Hoc Mix)

No Semantic Interoperability

(Ad Hoc Mix)

Complete Semantic Interoperability

Complete Semantic Interoperability

Complete Semantic Interoperability

Level of data Interoperability

Level of data Interoperability

Level of data Interoperability

Federation Spectrum

Corporation for National Research Initiatives

Challenges
-

Technical


Depending on the criteria chosen for federation,
various
technical
requirements arise. These

may
include:


Designing a storage model to aggregate objects into a
common store that identifies the relationship between
multiple metadata instances describing a single object


Designing cross
-
walking algorithms to translate and
map heterogeneous data into a common model


Designing a query model to gather and rank search
results from multiple federates


Ensuring scalability, reliability, and security without
compromising performance

Corporation for National Research Initiatives

Existing technologies


Digital Object Registry (basis for ADL
-
R)


Provides a
data model

to encapsulate related
metadata instances together


Enables aggregation of objects from
fully
-
connected

mode to
peer
-
peer

mode


Uses the Handle System to uniquely identify
objects and metadata instances across all
federates