Overview of Stanford's Digital Library Research & Development ...

knifedamagingInternet and Web Development

Feb 2, 2013 (5 years and 5 months ago)


tanford Digital Library Research and Development Activities



Overview of Stanford’s Digital Library Research & Development
Efforts for the CLIR
Stanford Meeting, December 2009

Stanford University Libraries and Academic Information Resources (SULAIR) is actively
developing a robust and interoperable ecosystem
of digital library systems and modules
that, taken together, will provide a full set of content, tools, services and supporting
infrastructures for Stanford’s scholarly community and SULAIR’s information

Most broadly, Stanford’s digital li
brary development breaks into three spheres of


of digital resources, supported by a Digital Object Registry (DOR)
and a set of tailored applications and workflow services for digital content and
pipeline management


of digital assets in the Stanford Digital Repository (SDR)


to digital resources (including discovery, delivery & dissemination),
through a combination of a next generation discovery environment and a series
of rich web applications providing contex
tually appropriate delivery of digital

Taken altogether, these three spheres of development address the full lifecycle of
information management, and the loose coupling of component parts within and across
these spheres allow modules to develop, ev
olve, swap and recombine to handle new
content and service needs as they arise without requiring full
scale replacement of
underlying infrastructure or applications.

In addition to these broad spheres of digital library development, each described more
ully below, SULAIR has active development underway in a number of specific programs.
These projects are enumerated below.

Management: Digital Object Registry (DOR)

A digital object registry (DOR) is the heart of SULAIR’s digital asset management sphere.

Based on Fedora, it serves as a
, not repository, for all of Stanford’s digital
library content. It serves to register, track and relate versions and instances of digital
content across the digital library and their full lifecycle, from deposit or

capture through
processing, preservation and indexing for discovery or transformation for delivery. DOR
also tracks all component parts of digital objects

both content and metadata

serves to orchestrate Stanford’s suite of digital library web services

into pipeline
workflows (e.g.,
fetch metadata
transform object
create derivative
send to SDR
, etc.)
Finally, DOR serves as the management UI and reporting engine for Stanford’s digital
activities. DOR and its suite of supporting web services a
re under active development
and enhancement; it serves as both the engine and intelligence of SULAIR’s Google
Books processing workflow, its end
user repository services, and soon, SULAIR’s internal
digitization pipelines.

SULAIR Digital Library

Research and Development

tanford Digital Library Research and Development Activities



Management: Hydra

Hydra is

a rapid application development framework for building digital library and
institutional repository applications atop Fedora. Hydra is a joint development project
among Stanford, University of Virginia, University of Hull and DuraSpace. It is based on
ora, ActiveFedora, Ruby on Rails, and makes extensive use of Blacklight as for
searching, browsing and viewing objects. The spirit of Hydra is a common body
(framework), many heads (tailored user interfaces). The Hydra/Fedora hybrid will
produce IR applica
tions which support the following common functions:


Edit & Annotate

Set Permissions / Access Levels

Manage collections




View Object

These different functions, when recombined, will form the basis of many different
“Hydra heads
”, each of which provide some digital library and/or institutional repository
services. Using a common platform ensures rapid deployment of new services and a
high degree of reuse of technical components across projects. Stanford plans to
leverage Hydra to

meet the following programmatic needs over the next several

Electronic Theses and Dissertations (ETD)

Accessioning born digital electronic library materials (EEMs)

Digitization workflow and management tool

Deposit & dissemination of open acces
s articles

specific data curation environments

General purpose research and data archiving

tanford Digital Library Research and Development Activities



Stanford’s Hydra
based ETD solution went into production in November of 2009; a
digital archiving solution and a set of accessioning tools for born digital
library materials
are slated for release in Winter of 2010. The Hydra collaboration is also poised to grow
to encompass another half dozen to dozen institutions with similar needs and shared
technical environments in the next year.

Preservation: The Sta
nford Digital Repository (SDR)

The Stanford Digital Repository is a mature, first
generation preservation repository
based on the OAIS reference model. Currently containing more than 80 TB of unique
data (including text, images, manuscripts, audio files an
d geospatial data), SDR is now
being redesigned to become a “version 2.0” preservation system. The primary objectives
of this redesign are to apply the lessons learned and best practices uncovered in three
years of operational experience; to increase the f
lexibility of the system and its ease of
operation; to increase its capacity for ingesting different collections and content types;
to scale up its storage layer; and to bring its functions in line with our evolving
understanding of the preservation servic
es most critical to support.

The second
generation SDR is currently in development. The main points of focus are on
adopting Fedora as a metadata management system, and a RESTful set of supporting
web services for preservation and management workflows; s
coping SDR’s preservation
functions to a core set of bit preservation and administrative services; integrating with
other services beyond the core to provide richer access and higher
level preservation
services; introducing a new storage abstraction & mana
gement layer.

Access: Next Generation Discovery Environment

In 2008, SULAIR launched a sustained initiative to provide its patrons with a next
generation discovery environment. Based on a keen understanding in the shortfalls of its
current OPAC and the
numerous silos of content and application functionality across all
its resources, the explicit objectives of this effort were:

To make it easier for patrons to discover and use the wealth of information
resources held at SULAIR by providing a comprehensi
ve place to search and



create a unified discovery interface to all SULAIR resources, including the OPAC,
full text resources, finding aids, images, databases, licensed journals and web


create a feature
rich, highly inter
active tool for searchers, with support for key
functions: search, browse, visualize, refine, expand, extract, one click access, etc.


create a highly usable interface that achieves specific discovery tasks, using user
centered design, user research and
testing, and use metrics to guide ongoing


create an extensible platform that be enriched over time with both more
resources and more functionality, such as personalization, alerts, and integration
with other tools

The centerpiece of SULAIR
’s enhanced discovery environment is SearchWorks
), a next
generation OPAC with an intuitive
interface, faceted browsing, and a host of helpful features. SearchWorks is based
on the
open source application, Blacklight, with an underlying solr index. Originally created at

tanford Digital Library Research and Development Activities



the University of Virginia, Stanford (and many other institutions) have since adopted
Blacklight as the platform for their discovery environments.


and solr have proven highly scalable, admirably easy to customize, and
flexible enough to work for everything from bibliographic metadata (ranging from MARC
to MODS to Dublin Core) to a front
end for digital repositories with full text and complex
. More information on Blacklight can be found at

The next stage of development for this effort will be the deliberate expansion of the
Blacklight community to include more contrib
utors; increasing the functionality of the
core Blacklight code base (integrating JPEG2000 image streaming, e.g.); expanding
Stanford’s scope of covered resources (e.g., adding in Special Collections finding aids
and image bases) to SearchWorks; as well as

tailoring domain
specific views and
applications (e.g., for Music, for digital manuscripts, etc.)

Access: Content Delivery via the “Digital Stacks”

We use the term “digital stacks” as a collective term for the suite of user
applications and modul
es that deliver digital objects directly to end users. This includes
turning applications, image viewing and manipulation services, audio and video
streaming environments. It also includes more specialized environments for map
presentation, and downlo
ad of various types of data resources.

Rather than developing a single application to provide all these functions in one
environment, we are seeking modular components of vendor

and community
software that can be combined into a heterogeneous env
ironment. This is an area where
we are actively seeking collaboration.

Digital Manuscripts

Stanford, along with Corpus Christi College, Cambridge and Cambridge University
Library, recently completed a five year project to digitize and make online the co
of the Matthew Parker Library, a collection of 559 Anglo Saxon manuscripts in
Cabmbridge. Based on the experience and expertise gained during the project, and
several explorations into interoperability of the Parker site
) with other digital manuscript services, Stanford is
actively exploring extensions to its digital manuscript holdings and offerings. Principal
among these are interoperability of digital manuscript contents from
repositories, interoperable metadata to feed aggregated or federated discovery systems,
and interoperable services to allow for the seamless flow of scholarly research activities
across sites and web environments, regardless of the built
in capaci
ties of a given
repository. Stanford is actively seeking collaborators with content, technical and/or
scholarly resources in this space.

tanford Digital Library Research and Development Activities



Full Text Processing and Delivery Environment

A core capability of Stanford’s digital library must be the ability

to handle and integrate
full text objects across their full lifecycle. This includes support for full text regardless
of source (local

or Google
scanned, purchased and licensed content, institutional
repository content, etc.)

process full text locally,

applying tailored algorithms against them to populate
discovery systems (full text indexing, vector analysis, entity extraction,
clustering & classification, structural markup, etc.)

deliver curated collections through a rich, robust page
turning enviro

selectively load texts into processing environments for text mining and scholarly

preserve all locally loaded text in SDR to enable reuse and long
term access

Specifically with regard to text mining, Stanford has identified and begun the
ecification of an infrastructure and set of services to provide large
scale analysis of
texts using research algorithms for natural language processing, machine translation,
machine learning, and computational linguistics, as well as literary, linguistic a
historical analysis.

tanford Digital Library Research and Development Activities



book and E
journal Record & Content Loading

Given the significant investment in licensed and purchased e
books and e
journals, fully
integrating these resources into our digital library is a high priority. Efforts on this fron
include loading bibliographic metadata for e
book and other e
resources into our OPAC
and next generation discovery environment. We are also pursuing locally loading these
into our preservation system, as a guarantee of long
term access, and finally have

the groundwork to index these and include them in both our full
text discovery and text
mining environments. The structure and quality of these records and content are highly
variable across providers, making this a challenge on all fronts. Normaliza
tion of this
data and metadata into well defined structures with quality metadata, whether through
technical processing or the application of standardization or market forces, would be a
welcome development.

Data Curation: Geospatial, Social Science and

Marine Science

Stanford University has widespread (and in many cases, highly distributed) data
production and use needs, and the Libraries’ data holdings and curation services are
beginning to blossom as part of an overall University
wide trend. SULAIR ha
s active
programs in three distinct domains underway. These are geospatial data (building on
the Libraries’ National Geospatial Digital Archive (NGDA) project, funded by Library of
Congress; social science data, for the deposit, management and access to bo
community data sets of broad interest to the social science community as well as
primary and/or partially processed data sets produced by local research groups; and
marine science data, documenting the current and historical environmental conditions
the Northern California coast. In each of these cases, Stanford’s librarians are working
with faculty and research groups on systematic ways to tap the Libraries digital library
infrastructure to provide domain
specific solutions to the particular needs of


Electronic Theses and Dissertations (ETD)

Stanford launched the first phase of its ETD solution in Fall of 2009. The highlights of
the overall solution are 1) a Fedora repository for the deposit, management and delivery
of the electroni
c theses and all associated files (both supplemental data and related
permissions letter); 2) integration with the University’s PeopleSoft Student
Administration system to manage the student record and degree conferral process; 3)
integration with the Stan
ford Digital Repository for longterm preservation of the ETDs;
4) integration with the Libraries’ integrated library system (ILS), for managing the
cataloging and traditional library workflows associated with theses and dissertations
(updated to account fo
r ETDs); and 5) delivery of all ETDs to Google Scholar for free,
worldwide indexing and delivery. In addition to the Fedora repository, the solution uses
the Hydra application framework as a deposit and management front
end, and in the
next phase of develo
pment, Blacklight for its search, browse and viewing functionality.

End Management of Everyday Electronic Materials

An important and growing component of Stanford’s collections are relatively simple,
digital materials published via the

Web. Generally PDFs, office documents or simple
HTML files, this class of object has been termed “EEMS”, or everyday electronic
materials. With funding from the Mellon Foundation, Stanford is building the tools and
infrastructure necessary to enable its s
electors, acquisitions and cataloging staff to
accession these materials, preserve them over the longterm, and provide access to them

tanford Digital Library Research and Development Activities



via our digital stacks. As with ETD’s, the solution will comprise integration of DOR,
Hydra, our ILS and SDR for the end
end management of this class of resources.

SALT & AIMS: Archival Solutions for Born Digital Materials

In anticipation of the ever
growing need for solutions to archive born digital special
collections, Stanford has two active programs for building its

capacity in this space. The
first, SALT, is a project that aims to both reduce the cost of traditional archival
processing and increase the effectiveness of archival access through the application of
assisted tools (entity extraction, linked
data for authority files and
visualization tools). The second, AIMS, is a collaborative project with the Univeristy of
Virginia, Yale and University of Hull. Funded by the Mellon Foundation, it will process,
preserve and present digital components of thirt
een different special collections, and
help establish best practices for archiving digital materials as well as the tools to assist
with their appraisal, processing and presentation. Both projects will also deal with the
most effective way to present hybri
d physical and digital collections.

Digital Forensics

Stanford began building its capacity to recover files from handheld and legacy media
(hard drives, floppy disks, tape, optical media, etc.) at the beginning of 2009. The
primary objective of this ef
fort is to be able to recover and preserve for future access the
contents of thousands of digital information carriers in Stanford’s university archives
and special collections. This requires a special set of methods and tools to be able to
retrieve such f
iles in a forensically sound manner, without corrupting or updating the
files in the process. It also raises new questions about the process and ethics of
performing such recoveries. Stanford is beginning to actively work with others in this
space, includi
ng the British Library, Emory University and the Maryland Institute for
Technology in the Humanities (MITH) on creating the capacity and process for handling
these materials.

Based Web Publishing

Stanford Libraries web presence comprises tens of
thousands of (largely static) web
pages. The Libraries have adopted Drupal as a web publishing and content management
platform, and have been steadily migrating content to the new platform over the past
year. This migration will continue for the foreseeabl
e future; keeping current with
Drupal’s rapidly advancing technology base, while adopting the functions and templates
that best fit the Libraries’ needs for web publishing and social networking is an area of
constant programmatic interest.

Mobile User I

Stanford has recently begun investigating the systematic development of mobile
user interfaces to its primary applications and content bases. This will be an area of
increased interest and activity in the coming quarters.

Technology C
onsolidation and Interoperation Framework

Consolidating the technologies supporting its digital library is an issue of paramount
importance for Stanford. This consolidation not only provides medium

and long
sustainability for services and infrastruct
ure, but also gives short term boosts in time to
release new functionality due to the high degree of reuse among projects.

tanford Digital Library Research and Development Activities



The main pillars of Stanford’s digital library environment are Fedora, for metadata
management; solr as a performant indexed data s
tore; Ruby on Rails for rapid
development of rich user applications, and atomic web services presented in RESTful
form for back end processing.