An Introduction to the Merritt
Curation Repository
University of California Curation Center Team
California Digital Library
June 9, 2011
UC3 Summer Webinar Series
First, a word about the webinar series…
•
A forum for timely topics of interest to the UC
community
–
Highlighting projects, services, and developments in the
areas of digital preservation, web archiving, and data
curation
–
Intended to raise awareness of issues, and provide
information on useful resources and services available to
the UC community
–
2nd and 4th Thursday of the month, and as scheduled,
featuring UC3 staff and UC librarians, content managers,
and technologists
Teleconference
+1 (866) 740
-
1260,
access code
9879016#
Webconference
http://bit.ly/jdjMAP
First, a word about the webinar series…
•
Some logistics…
–
Participant phones will be muted during the formal
presentation, but we will be monitoring the online chat
–
Slides, Q & A, and web and voice recordings will be posted
after each presentation
–
Schedule available at
http://www.cdlib.org/uc3/uc3webinars.html
–
Please suggest additional topics!
uc3@ucop.edu
–
Take the short survey
http://www.surveymonkey.com/s/XSGWP8R
Now on with the show…
•
Today’s topic is an introduction to the Merritt
curation repository
–
Who is it for?
–
What can it do?
–
Why use it?
–
What does it cost?
–
Next steps?
–
Q & A
What keeps you up at night?
Are there standards or
best practices I should
be aware of?
How much will
it cost?
How can I transfer my
content to an
appropriate curation
environment
How do I know my
content is safe?
What’s the best
strategy to ensure
permanent availability?
Do I need to create
new derivatives just for
preservation purposes?
How can I get a
persistent reference
to my content?
What if my content
needs to evolve over
time?
Can I control
who can see my
content?
I have a good discovery
platform; how can I add
preservation services
?
“There’s an app for that”
Are there standards or
best practices I should
be aware of?
How much will
it cost?
How can I transfer my
content to an
appropriate curation
environment
How do I know my
content is safe?
What’s the best
strategy to ensure
permanent availability?
Do I need to create
new derivatives just for
preservation purposes?
How can I get a
persistent reference
to my content?
What if my content
needs to evolve over
time?
Can I control
who can see my
content?
I have a good discovery
platform; how can I add
preservation services
?
Automatic replication and
high
-
availability redundancy
Periodic fixity audit
Simple submission UI/API
METS “feeder” duplicates
existing DPR workflow
Model free
No packaging, format, or
metadata requirements
Strongly versioned
Integration with
EZID and DataCite
Curator
-
defined
access control rules
Modular micro
-
services “toolkit”
UC3 consultation
Storage at
$1.04/GB/year
Merritt repository
•
Merritt is available for use by all members of the UC
community
–
Libraries/archives/museums
–
ORU/MRUs
–
Faculty/staff
•
Centrally hosted by UC3/CDL on behalf of the UC
community
–
Economies of scale
–
Shared experience and
expertise
Mediated through campus libraries
Modes of use: dark archive
•
Pro
-
active preservation, but
no
expectation of direct
end user access
–
Legacy DPR content contributed by campus libraries
–
Cultural heritage texts, master images, sound, moving
image, data sets
–
All DPR content will be automatically migrated to Merritt
Modes of use: bright archive
•
Provide preservation
and
end user access
–
NIH Healthy Pathways project on bio
-
demographics
•
Multi
-
institutional: UC Davis, University of Colorado, University of
Virginia,
Syddansk
University (Denmark)
•
Need to restrict access to project partners initially, with eventual
public access
Modes of use: bright archive
•
Content discovery: search
Modes of use: bright archive
•
Content discovery: search
Modes of use: bright archive
•
Content discovery: browse
Modes of use: bright archive
•
Content discovery: browse
Modes of use: preservation “back end”
•
Preservation only; content discovery/delivery
provided by well
-
known external systems
–
Using direct hooks into Merritt to retrieve content
–
eScholarship
Open access publishing
–
Open Context
Archaeological data publishing
–
Investigating integration with
Islandora
/
Drupal
and Alfresco
Modes of use: distributed data grids
•
DataONE
“
Enable new science and knowledge creation
through universal access to data about life on earth and the
environment that sustains it
”
More information
•
Online help
http://merritt.cdlib.org/help
•
FAQ
http://merritt.cdlib.org/docs/merritt_handout.pdf
•
User’s guide
http://merritt.cdlib.org/docs/merritt_user_guide.pdf
•
UC3 contact
http://www.cdlib.org/uc3/contact.html
uc3@ucop.edu
Merritt cost model
•
UC3 provides technical infrastructure, data center
hosting, staff, monitoring, maintenance,
enhancements, help, outreach, consultation, etc.
•
Contributors are charged only for storage used, at
the UC3 recovery rate of $1.04/GB/year
•
Developing an “endowment” model: Pay once,
preserve forever
•
Will soon extend model for non
-
UC contributors
How does this compare?
•
Cost of a physical book in RLF
†
$ 4.62/year
•
Cost of a digital book in HathiTrust
‡
$ 0.15/year
•
Cost of a digital book in Merritt
$ 0.06/year
† Gary Lawrence (2007) Internal analysis, CDL; ‡ Paul Courant and Matthew Nielsen (2010),
On the cost of keeping a book
, HathiTrust.
Average collection sizes and costs
Collection
Objects
Size
Annual cost
CA DOE
reports
8,000
12.0 GB
$ 12.48
Cal Cultures
420
65.6 GB
$ 68.22
eScholarship
46,425
118.6 GB
$ 123.34
A “cost calculator” spreadsheet is available at
http://www.cdlib.org/uc3/docs/Merritt
-
cost
-
calculator
-
v3.xlsx
Average ETD size and cost
Campus
ETD
titles
Size
Annual cost
Berkeley
797
12.4 GB
$ 12.88
Davis
837
13.0 GB
$ 13.52
Irvine
390
6.1
GB
$ 6.30
Los Angeles
720
11.2 GB
$ 11.63
Riverside
192
2.9 GB
$ 3.10
San Diego
558
8.7
GB
$ 9.02
San Francisco *
560
8.7 GB
$ 9.05
Santa Barbara
325
5.0 GB
$ 5.25
Santa Cruz
155
2.4 GB
$ 2.50
Based on 2009 holdings in
ProQuest
*
UCSF based on total ETD holdings in Merritt
Average research data size and cost
•
Almost 50% of all research data is less than 1 GB
Source:
Science
331:6018 (February 11, 2011): 692
-
693 <DOI: 10.1126/science.331.6018.692>
Size
Percentage
Annual cost
< 1 GB
48.3 %
< $ 1.04
1
–
100 GB
32.0 %
$ 1.04
–
104.00
100 GB
–
1 TB
12.1 %
$ 104.00
–
1,040.00
> 1 TB
7.6 %
> $ 1,040.00
Next steps
•
UC3 is working with campus partners to determine
ongoing development and collection priorities
Annotation
Notification
Transformation
Characterization
Fixity / Linked data
Replication
IdM
/
Authn
/
Authz
Ingest, Access
Inventory, Queuing
Storage and Identity
Technology watch
Metadata standards
Policy and business model
Data management guidelines
Object and collection modeling
New content
acquisition
Next steps
In production
•
Model
-
free objects
•
Submission via UI and API
•
Persistent identifiers
•
Format identification
•
Version provenance
•
Automated replication
•
Automated fixity audit
•
Role
-
based access control
•
Collections
•
Semantic index and search
•
Object/version/file download
In progress
•
Simplified update
•
Enhanced characterization
(
JHOVE2
)
•
Faceted search and browse
(
XTF
)
•
CMS/DAMS
-
like function
(
Islandora
)
In planning
•
Simplified batch
•
UCTrust
integration
•
Linked data
•
Transformation
•
Notification
•
Annotation
•
Support for NGTS/DLSTF
recommendations
We welcome your feedback on needs and priorities
!
http://www.cdlib.org/uc3/contact.html
uc3@ucop.edu
Simplified update
•
Variant form of object update requiring the
submission of only the changed components
•
Client
-
side tools to simplify the creation of batch
manifests
#%checkm_0.7
#%profile | http://uc3.cdlib.org/registry/ingest/mani
#%prefix |
mrt
: | http://merritt.cdlib.org/terms#
#%prefix |
nfo
: | http://www.semanticdesktop.org/onto
#%fields |
nfo:fileUrl
|
nfo:hashAlgorithm
|
nfo:hash
http://merritt.cdlib.org/samples/goldenDragon.jpg | m
http://merritt.cdlib.org/samples/tumbleBug.jpg | md5
http://merritt.cdlib.org/samples/generalDrapery.jpg |
http://merritt.cdlib.org/samples/generalDrapery.jpg |
#%
eof
Enhanced characterization
•
JHOVE2 next
-
generation framework for format
-
aware characterization
http://jhove2.org/
–
Automated extraction and inference of extensive technical
metadata significant for preservation analysis and planning
"Module": {
"scope": "
ICCModule
“,
"Header": {
"scope": "
ICCHeader
“,
"
ProfileSize
": {
"unit": "byte“,
"value": 60960
}
,"
ProfileVersionNumber
": "4.2.0.0“
,"
ProfileDeviceClass_raw
": "
spac
“
,"
ProfileDeviceClass_descriptive
":
"
ColorSpace
Conversion profile“
,"
ColourSpace_raw
": "RGB “
,"
ColourSpace_descriptive
": "
rgbData
“
,"
ProfileConnectionSpace_raw
": "Lab “
,"
ProfileConnectionSpace_descriptive
": "
labData
“
Enhanced discovery via XTF
•
eXtensible
Text Framework
http://xtf.cdlib.org/
–
CDL developed/supported open source discovery platform
–
Robust, scalable faceted search and browse
CMS/DAMS
-
like function
•
Many campuses are looking for CMS/DAMS solutions
•
Investigating integration with
Islandora
to provide a
Drupal
CMS/DAMS
front
-
end to Merritt
http://islandora.ca/
http://drupal.org/
Questions?
Upcoming webinars
Date/time
Topic
Wednesday, June 15
12:30 pm
Data Sharing by Scientists: Practices and Perceptions
Carol
Tenopir
, Univ. Tennessee
Mike Frame, USGS
Thursday, June 30
2:00 pm
The Data Management Planning Tool (DMP Tool)
Trisha
Cruse, UC3
Thursday, July 14
2:00
pm
Data as Publication
John Kunze, UC3
Catherine Mitchell, CDL Publishing Program
Thursday,
July 28
2:00 pm
Merritt:
Depositing Content and Providing Access
Thursday, August 11
2:00 pm
DCXL
(Data Curation Excel)
http://www.cdlib.org/uc3/uc3webinars.html
Please take the webinar survey
http://www.surveymonkey.com/s/XSGWP8R
For more information
UC Curation Center
http://www.cdlib.org/uc3
http://www.cdlib.org/uc3/contact.html
uc3@ucop.edu
Stephen Abrams
Margaret Low
Lisa Colvin
David Loy
Patricia Cruse
Mark Reyes
Scott Fisher
Tracy Seneca
Erik Hetzner
Joan Starr
Greg Janée
Marisa Strong
John Kunze
Perry Willett
UC3 webinar
series
http://www.cdlib.org/uc3/uc3webinars.html
Merritt repository
http://merritt.cdlib.org/
http://merritt.cdlib.org/help
http://merritt.cdlib.org/docs/merritt_handout.pdf
http://merritt.cdlib.org/docs/merritt_user_guide.pdf
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Comments 0
Log in to post a comment