An Introduction to the Merritt

architectgroundhogInternet and Web Development

Dec 4, 2013 (3 years and 13 days ago)

63 views

An Introduction to the Merritt
Curation Repository

University of California Curation Center Team

California Digital Library

June 9, 2011

UC3 Summer Webinar Series


First, a word about the webinar series…


A forum for timely topics of interest to the UC
community


Highlighting projects, services, and developments in the
areas of digital preservation, web archiving, and data
curation


Intended to raise awareness of issues, and provide
information on useful resources and services available to
the UC community


2nd and 4th Thursday of the month, and as scheduled,
featuring UC3 staff and UC librarians, content managers,
and technologists


Teleconference

+1 (866) 740
-
1260,
access code
9879016#

Webconference

http://bit.ly/jdjMAP


First, a word about the webinar series…


Some logistics…


Participant phones will be muted during the formal
presentation, but we will be monitoring the online chat


Slides, Q & A, and web and voice recordings will be posted
after each presentation


Schedule available at
http://www.cdlib.org/uc3/uc3webinars.html



Please suggest additional topics!

uc3@ucop.edu


Take the short survey

http://www.surveymonkey.com/s/XSGWP8R



Now on with the show…


Today’s topic is an introduction to the Merritt
curation repository


Who is it for?


What can it do?


Why use it?


What does it cost?


Next steps?


Q & A


What keeps you up at night?

Are there standards or
best practices I should
be aware of?

How much will
it cost?

How can I transfer my
content to an
appropriate curation
environment

How do I know my
content is safe?

What’s the best
strategy to ensure
permanent availability?

Do I need to create
new derivatives just for
preservation purposes?

How can I get a
persistent reference
to my content?

What if my content
needs to evolve over
time?

Can I control
who can see my
content?


I have a good discovery
platform; how can I add
preservation services
?

“There’s an app for that”

Are there standards or
best practices I should
be aware of?

How much will
it cost?

How can I transfer my
content to an
appropriate curation
environment

How do I know my
content is safe?

What’s the best
strategy to ensure
permanent availability?

Do I need to create
new derivatives just for
preservation purposes?

How can I get a
persistent reference
to my content?

What if my content
needs to evolve over
time?

Can I control
who can see my
content?


I have a good discovery
platform; how can I add
preservation services
?

Automatic replication and
high
-
availability redundancy

Periodic fixity audit

Simple submission UI/API

METS “feeder” duplicates
existing DPR workflow

Model free

No packaging, format, or
metadata requirements

Strongly versioned

Integration with
EZID and DataCite

Curator
-
defined
access control rules

Modular micro
-
services “toolkit”

UC3 consultation

Storage at
$1.04/GB/year

Merritt repository


Merritt is available for use by all members of the UC
community


Libraries/archives/museums


ORU/MRUs


Faculty/staff


Centrally hosted by UC3/CDL on behalf of the UC
community


Economies of scale


Shared experience and

expertise

Mediated through campus libraries

Modes of use: dark archive


Pro
-
active preservation, but
no

expectation of direct
end user access


Legacy DPR content contributed by campus libraries


Cultural heritage texts, master images, sound, moving
image, data sets






All DPR content will be automatically migrated to Merritt



Modes of use: bright archive


Provide preservation
and

end user access


NIH Healthy Pathways project on bio
-
demographics


Multi
-
institutional: UC Davis, University of Colorado, University of
Virginia,
Syddansk

University (Denmark)


Need to restrict access to project partners initially, with eventual
public access

Modes of use: bright archive


Content discovery: search

Modes of use: bright archive


Content discovery: search

Modes of use: bright archive


Content discovery: browse

Modes of use: bright archive


Content discovery: browse

Modes of use: preservation “back end”


Preservation only; content discovery/delivery
provided by well
-
known external systems


Using direct hooks into Merritt to retrieve content



eScholarship

Open access publishing



Open Context

Archaeological data publishing



Investigating integration with
Islandora
/
Drupal

and Alfresco

Modes of use: distributed data grids


DataONE

Enable new science and knowledge creation
through universal access to data about life on earth and the
environment that sustains it


More information


Online help

http://merritt.cdlib.org/help


FAQ

http://merritt.cdlib.org/docs/merritt_handout.pdf


User’s guide

http://merritt.cdlib.org/docs/merritt_user_guide.pdf



UC3 contact

http://www.cdlib.org/uc3/contact.html

uc3@ucop.edu



Merritt cost model


UC3 provides technical infrastructure, data center
hosting, staff, monitoring, maintenance,
enhancements, help, outreach, consultation, etc.


Contributors are charged only for storage used, at
the UC3 recovery rate of $1.04/GB/year





Developing an “endowment” model: Pay once,
preserve forever


Will soon extend model for non
-
UC contributors

How does this compare?


Cost of a physical book in RLF


$ 4.62/year


Cost of a digital book in HathiTrust


$ 0.15/year


Cost of a digital book in Merritt

$ 0.06/year

† Gary Lawrence (2007) Internal analysis, CDL; ‡ Paul Courant and Matthew Nielsen (2010),
On the cost of keeping a book
, HathiTrust.

Average collection sizes and costs

Collection

Objects

Size

Annual cost

CA DOE

reports


8,000


12.0 GB

$ 12.48

Cal Cultures


420


65.6 GB

$ 68.22

eScholarship

46,425

118.6 GB

$ 123.34

A “cost calculator” spreadsheet is available at

http://www.cdlib.org/uc3/docs/Merritt
-
cost
-
calculator
-
v3.xlsx

Average ETD size and cost

Campus

ETD

titles

Size

Annual cost

Berkeley

797

12.4 GB

$ 12.88

Davis

837

13.0 GB

$ 13.52

Irvine

390


6.1

GB

$ 6.30

Los Angeles

720

11.2 GB

$ 11.63

Riverside

192


2.9 GB

$ 3.10

San Diego

558


8.7

GB

$ 9.02

San Francisco *

560


8.7 GB

$ 9.05

Santa Barbara

325


5.0 GB

$ 5.25

Santa Cruz

155


2.4 GB

$ 2.50

Based on 2009 holdings in
ProQuest


*
UCSF based on total ETD holdings in Merritt

Average research data size and cost


Almost 50% of all research data is less than 1 GB

Source:
Science

331:6018 (February 11, 2011): 692
-
693 <DOI: 10.1126/science.331.6018.692>

Size

Percentage

Annual cost

< 1 GB

48.3 %

< $ 1.04


1


100 GB

32.0 %


$ 1.04



104.00


100 GB


1 TB

12.1 %


$ 104.00


1,040.00

> 1 TB


7.6 %

> $ 1,040.00

Next steps


UC3 is working with campus partners to determine
ongoing development and collection priorities



Annotation

Notification

Transformation

Characterization

Fixity / Linked data

Replication

IdM
/
Authn
/
Authz

Ingest, Access
Inventory, Queuing

Storage and Identity

Technology watch

Metadata standards

Policy and business model

Data management guidelines

Object and collection modeling

New content

acquisition

Next steps

In production


Model
-
free objects


Submission via UI and API


Persistent identifiers


Format identification


Version provenance


Automated replication


Automated fixity audit


Role
-
based access control


Collections


Semantic index and search


Object/version/file download


In progress



Simplified update



Enhanced characterization
(
JHOVE2
)






Faceted search and browse
(
XTF
)


CMS/DAMS
-
like function
(
Islandora
)

In planning



Simplified batch







UCTrust

integration



Linked data




Transformation


Notification


Annotation


Support for NGTS/DLSTF
recommendations

We welcome your feedback on needs and priorities
!

http://www.cdlib.org/uc3/contact.html

uc3@ucop.edu

Simplified update


Variant form of object update requiring the
submission of only the changed components


Client
-
side tools to simplify the creation of batch
manifests

#%checkm_0.7

#%profile | http://uc3.cdlib.org/registry/ingest/mani

#%prefix |
mrt
: | http://merritt.cdlib.org/terms#

#%prefix |
nfo
: | http://www.semanticdesktop.org/onto

#%fields |
nfo:fileUrl

|
nfo:hashAlgorithm

|
nfo:hash


http://merritt.cdlib.org/samples/goldenDragon.jpg | m

http://merritt.cdlib.org/samples/tumbleBug.jpg | md5

http://merritt.cdlib.org/samples/generalDrapery.jpg |

http://merritt.cdlib.org/samples/generalDrapery.jpg |


#%
eof




Enhanced characterization


JHOVE2 next
-
generation framework for format
-
aware characterization
http://jhove2.org/



Automated extraction and inference of extensive technical
metadata significant for preservation analysis and planning


"Module": {


"scope": "
ICCModule
“,


"Header": {


"scope": "
ICCHeader
“,


"
ProfileSize
": {


"unit": "byte“,


"value": 60960


}


,"
ProfileVersionNumber
": "4.2.0.0“


,"
ProfileDeviceClass_raw
": "
spac



,"
ProfileDeviceClass_descriptive
":


"
ColorSpace

Conversion profile“


,"
ColourSpace_raw
": "RGB “


,"
ColourSpace_descriptive
": "
rgbData



,"
ProfileConnectionSpace_raw
": "Lab “


,"
ProfileConnectionSpace_descriptive
": "
labData


Enhanced discovery via XTF


eXtensible

Text Framework
http://xtf.cdlib.org/



CDL developed/supported open source discovery platform


Robust, scalable faceted search and browse

CMS/DAMS
-
like function


Many campuses are looking for CMS/DAMS solutions


Investigating integration with
Islandora

to provide a
Drupal

CMS/DAMS
front
-
end to Merritt

http://islandora.ca/

http://drupal.org/

Questions?

Upcoming webinars

Date/time

Topic

Wednesday, June 15

12:30 pm

Data Sharing by Scientists: Practices and Perceptions

Carol

Tenopir
, Univ. Tennessee

Mike Frame, USGS

Thursday, June 30

2:00 pm

The Data Management Planning Tool (DMP Tool)

Trisha

Cruse, UC3

Thursday, July 14

2:00

pm

Data as Publication

John Kunze, UC3

Catherine Mitchell, CDL Publishing Program

Thursday,

July 28

2:00 pm

Merritt:

Depositing Content and Providing Access

Thursday, August 11

2:00 pm

DCXL

(Data Curation Excel)

http://www.cdlib.org/uc3/uc3webinars.html

Please take the webinar survey
http://www.surveymonkey.com/s/XSGWP8R

For more information

UC Curation Center

http://www.cdlib.org/uc3

http://www.cdlib.org/uc3/contact.html

uc3@ucop.edu

Stephen Abrams

Margaret Low

Lisa Colvin

David Loy

Patricia Cruse

Mark Reyes

Scott Fisher

Tracy Seneca

Erik Hetzner

Joan Starr

Greg Janée

Marisa Strong

John Kunze

Perry Willett

UC3 webinar

series

http://www.cdlib.org/uc3/uc3webinars.html


Merritt repository

http://merritt.cdlib.org/

http://merritt.cdlib.org/help

http://merritt.cdlib.org/docs/merritt_handout.pdf

http://merritt.cdlib.org/docs/merritt_user_guide.pdf