2
Dr Birgit
Plietzsch
Arts Computing Advisor
bp10@st
-
andrews.ac.uk
Swithun
Crowe
Developer
for Arts and
Humanities Computing projects
cs2@st
-
andrews.ac.uk
&
IT Services, University of St Andrews
3
1.
Introduction to the University of St Andrews Digital
Archiving Project (DAP)
2.
The DAP Open Archival Information System
3.
Developing the OAIS Ingest function in Alfresco
4
Digital Preservation
is …
•
the
active management
of digital information over time to ensure its
accessibility
•
long
-
term, error
-
free storage of digital information
, with means for
retrieval and interpretation, for the entire time span the information is
required for.
•
Long
-
term
is defined as "long enough to be concerned with the
impacts of
changing technologies
, including support for new media and data formats, or
with a
changing user community
. Long Term may extend indefinitely”.
•
Retrieval
means
obtaining needed digital files
from the long
-
term, error
-
free
digital storage,
without possibility of corrupting the continued error
-
free storage
of the digital files.
•
Interpretation
means that the retrieved digital files, files that, for example, are
of texts, charts, images or sounds, are
decoded and transformed into usable
representations
. This is often interpreted as "rendering", i.e. making it available
for a human to access. However, in many cases it will mean able to be
processed by computational means.
(Source: Wikipedia)
5
•
Legal requirements (e.g. Freedom of Information Act)
•
Protection of institutional intellectual property
•
Funding body requirements
•
until 2008 Arts and Humanities Data Service for Arts and Humanities
(national depository for arts and humanities research data)
•
no such body exists now for the Arts and Humanities
•
other subjects national support is patchy
•
Moral obligations
•
protection of cultural and corporate memory
6
www.rps.ac.uk
•
proceedings of the
Scottish Parliament
from the first
surviving act of
1235 to the union of
1707
•
10 years of research
•
no print publication
•
c16.5m words
•
issues:
•
inconsistent editorial
practices
•
obsolescence of
software originally
used
•
long
-
term
sustainability of
research data
7
•
Pilot project
•
Scope:
•
data contained in electronic resources produced within the Faculty
of Arts, University of St Andrews
•
Aims:
•
ensure long
-
term sustainability of RPS data
•
investigate the requirements of digital archiving and obtain
experience
•
meet funding body requirement
•
flexible implementation (to allow for additional future uses)
8
Concepts and Properties of Archives and Hosting in the
Strategy and their Relationships ©Charles
Beagrie
Ltd
2009.
CreativeCommons
Attribution
-
Share Alike3.0 Key:
solid
colour
represents core properties and fading
colour
represents weaker properties of archives and hosting
services.
Concepts and Properties of Archives and Hosting in the Strategy and their Relationships
© Charles
Beagrie
Ltd 2009. CreativeCommons Attribution
-
Share Alike3.0
9
1.
Introduction to the University of St Andrews Digital
Archiving Project (DAP)
2.
The DAP Open Archival Information System
3.
Developing the OAIS Ingest function in Alfresco
10
•
An
Open Archival Information System
(or
OAIS
) is an
archive
, consisting of an organization of people and
systems, that has accepted the responsibility to preserve
information
and make it available for a
Designated
Community
.
•
reference model:
ISO 14721:2003
11
Seven functions
•
Ingest
•
Archival
Storage
•
Data
Management
•
Administration
•
Preservation
Planning
•
Access
•
Management
SIP
Submission Information Package
AIP
Archival Information Package
DIP
Dissemination Information Package
12
Implementation
•
Content
Information:
•
XML
•
TIFF
•
DOC
•
Etc
•
Preservation
Description
Information:
•
PREMIS
•
Descriptive
Information:
•
MODS
•
Packaging
Information:
•
METS
13
•
What needs to be preserved?
•
data
•
layout
•
functionality
•
user experience
•
What are the significant properties?
•
generic low
-
level properties (e.g. basic data unit, byte
-
level encoding, data type, and logical
schema)
•
data type specific properties (example: text)
•
underlying abstract forms (font, spacing, layout)
•
sub
-
properties (e.g. font type, style, family, size,
colour
)
•
How do we preserve?
•
bit stream preservation
•
emulation
•
migration
•
Adopted approach:
•
data is preserved
•
combination of bit stream preservation and file format migration upon ingest
14
•
description needs of different types of material
•
electronic resources
•
digital images
•
video
•
research papers
•
University records
•
etc.
•
introduce flexibility
•
future wider uses of the archive
15
•
expressed in MODS
•
3 layers
•
use for pilot
•
more models can be
developed
Project
Research
data
Documen
-
tation
Code
Resource type
Digital object
Resource Discovery
Metadata
16
Monolithic approach
•
Repository framework:
Fedora Commons
•
issues with suitable front end
for Ingest, Access,
Preservation Planning, or
Administration functions
•
highly
customisable
•
Metadata
•
MODS
•
METS
•
PREMIS
•
DSpace
•
issues with Archival Storage
and Data Management
functions
•
EPrints
•
issues with Administration
and Access functions
•
RODA
•
technical issues
No support for
Preservation Planning
Breakdown into OAIS
requirements
Access
•
Plato
•
Testbed
17
Software used
•
Alfresco
•
www.alfresco.com
•
Fedora
Commons
•
fedora
-
commons.org
•
Planets Suite
•
www.openplanets
foundation.org
Archival storage
&
Data Management
Management
•
Share
•
Explorer
•
Records
Management
Ingest
Preservation
Planning
Administration
18
19
•
Version control of
AIPs
•
Alfresco / Fedora interaction?
•
Access front end
•
Fedora Commons front ends do not normally support OAIS
functions
•
Can extra properties be added to folders and files in
Records Management site?
We welcome ideas that might help us resolve the above
three issues.
20
1.
Introduction to the University of St Andrews Digital
Archiving Project (DAP)
2.
The DAP Open Archival Information System
3.
Developing the OAIS Ingest function in Alfresco
21
•
FITS and PREMIS
•
Technical metadata
•
RPS and MODS
•
Resource discovery metadata
•
Antivirus scanning
•
METS
•
Wrapping files and metadata
Introduction
22
•
FITS (File Information Tool Set)
•
http://code.google.com/p/fits/
•
Consolidates file format metadata from 3
rd
party tools
•
Jhove
, DROID, NLNZ ME,
Exiftool
and others
•
Output as XML
•
PREMIS (
PREservation
Metadata: Implementation
Strategies)
•
http://www.loc.gov/standards/premis/
•
Data dictionary of semantic units, maps to XML
•
Transform FITS XML to PREMIS using XSLT
Introduction
23
•
Text property defined in custom aspect
for storing FITS
XML in node metadata
•
Create temporary file containing content of node
•
Run FITS on temporary file
•
Put output into custom property
•
Later on, transform this to PREMIS XML
•
Can be run as space rule
•
Compile to AMP using Alfresco SDK
The action
24
<!DOCTYPE beans PUBLIC '
-
//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring
-
beans.dtd'>
<beans>
<bean id="fits
-
action
-
messages" class="org.alfresco.i18n.ResourceBundleBootstrapComponent">
<property name="
resourceBundles
">
<list><value>
alfresco.module.FitsAction.fits
-
action
-
messages</value></list>
</property>
</bean>
<bean id="fits
-
model
-
bootstrap" parent="
dictionaryModelBootstrap
" depends
-
on="
dictionaryBootstrap
">
<property name="models">
<list><value>alfresco/module/
FitsAction
/context/fitsModel.xml</value></list>
</property>
</bean>
<bean id="fits
-
action“ class="
uk.ac.st_andrews.repo.action.executer.FitsActionExecuter
“ parent="action
-
executer">
<property name="
serviceRegistry
"><ref bean="
ServiceRegistry
"/></property>
</bean>
</beans>
fits
-
action
-
context.xml
25
package
uk.ac.st_andrews.repo.action.executer
;
public class
FitsActionExecuter
extends
ActionExecuterAbstractBase
{
public void
setServiceRegistry
(
ServiceRegistry
serviceRegistry
);
protected void
addParameterDefinitions
(List<
ParameterDefinition
>
paramList
);
protected void
executeImpl
(Action
action
,
NodeRef
actionedUponNodeRef
);
}
FitsActionExecuter
26
63
// make sure node exists
64 if (!
nodeService.exists
(
actionedUponNodeRef
))
65 {
66 throw new Exception(
"no node"
);
67 }
68
69
// make sure that node has fits aspect
70
QName
fitsAspect
=
QName.createQName
(
fitsURI
,
"
fitsAspect
"
);
71 if (!
nodeService.hasAspect
(
actionedUponNodeRef
,
fitsAspect
))
72 {
73
this.nodeService.addAspect
(
actionedUponNodeRef
,
fitsAspect
, null);
74 }
75
76
// create new FITS instance
77 Fits
fits
= new Fits();
78
Fits.allowRounding
= true;
79
FitsOutput
result = null;
FitsActionExecuter.executeImpl
(fragment)
27
81
// put input into temp file
82
ContentReader
reader =
83
contentService.getReader
(
actionedUponNodeRef
,
ContentModel.PROP_CONTENT
);
84 String
fileName
=
85 (String)
nodeService.getProperty
(
actionedUponNodeRef
,
ContentModel.PROP_NAME
);
86 File
inputFile
=
87
TempFileProvider.createTempFile
(
"
FitsActionExecuter
_"
,
"."
+
fileName
);
88
reader.getContent
(
inputFile
);
89
90
// transform into technical metadata
91 result =
fits.examine
(
inputFile
);
92 Document doc =
result.getFitsXml
();
93
94
// put result of transformation into output
95
XMLOutputter
serializer
= new
XMLOutputter
(
Format.getPrettyFormat
());
96 String output =
serializer.outputString
(doc);
97
98
// get property to write to
99
QName
fitsProp
=
QName.createQName
(
fitsURI
,
"
fitsOutput
"
);
100
nodeService.setProperty
(
actionedUponNodeRef
,
fitsProp
, output);
FitsActionExecuter.executeImpl
(fragment cont.)
28
<identification status="CONFLICT">
<identity format="
Microsoft Word
"
mimetype
="
application/
msword
">
<tool
toolname
="
Exiftool
"
toolversion
="8.25" />
<tool
toolname
="file utility"
toolversion
="5.04" />
<tool
toolname
="NLNZ Metadata Extractor"
toolversion
="3.4GA" />
<tool
toolname
="
ffident
"
toolversion
="0.2" />
</identity>
<identity format="
OLE2 Compound Document Format
"
mimetype
="
application/octet
-
stream
">
<tool
toolname
="Droid"
toolversion
="3.0" />
<
externalIdentifier
toolname
="Droid"
toolversion
="3.0"
type="
puid
">
fmt
/111</
externalIdentifier
>
</identity>
</identification>
Fragment of FITS XML showing conflicting file formats
29
<
premis:format
>
<
premis:formatDesignation
>
<
premis:formatName
>
Microsoft Word
</
premis:formatName
>
</
premis:formatDesignation
>
</
premis:format
>
<
premis:format
>
<
premis:formatDesignation
>
<
premis:formatName
>
OLE2 Compound Document Format
</
premis:formatName
>
</
premis:formatDesignation
>
<
premis:formatRegistry
>
<
premis:formatRegistryName
>Droid (3.0)</
premis:formatRegistryName
>
<
premis:formatRegistryKey
>
fmt
/111</
premis:formatRegistryKey
>
<
premis:formatRegistryRole
>
puid
</
premis:formatRegistryRole
>
</
premis:formatRegistry
>
</
premis:format
>
Corresponding fragment of PREMIS XML
30
•
Records of the Parliaments of Scotland marked up in
thousands of XML documents
•
http://www.rps.ac.uk
•
Using Text Encoding Initiative (TEI)
•
http://www.tei
-
c.org/index.xml
•
TEI headers contain resource discovery metadata
•
Extract metadata from documents and populate custom
metadata fields
•
Can be run as space rule
•
Compile as AMP using Alfresco SDK
Introduction
31
<TEI.2 id="
_william_and_mary_t1689_3_6_d6_trans
" n="
william_and_mary_trans
">
<
teiHeader
>
<
fileDesc
>
<
titleStmt
>
<title>
A committee appointed for
controverted
elections
</title>
</
titleStmt
>
<
editionStmt
>
<edition n="session">
william_and_mary_t1689_3_1_d2_trans
</edition>
</
editionStmt
>
<
publicationStmt
>
<date>
16890314
</date>
</
publicationStmt
>
</
fileDesc
>
</
teiHeader
>
<text>...</text>
</TEI.2>
TEI example
Unique ID for document
Document belongs to
translated version of
records from reign of
William and Mary
Main heading in document
Pointer to session that
document belongs to
Date of document,
in YYYYMMDD
format
32
package
uk.ac.st_andrews.repo.content.metadata
;
public class
RPSMetadataExtracter
extends
org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter
{
public
RPSMetadataExtracter
();
protected Map<String,
Serializable
>
extractRaw
(
ContentReader
reader);
}
RPSMetadataExtracter
33
63
// set up parser
64
SAXParser
sp =
spf.newSAXParser
();
65
InputStream
cis
=
reader.getContentInputStream
();
66
InputSource
is = new
InputSource
(
cis
);
67
RPSSaxParser
teip
= new
RPSSaxParser
();
68
69
// do parsing
70
teip.setProperties
(map);
71
sp.parse
(is,
teip
);
72 map =
teip.getProperties
();
73
74
// loop over properties found
75 Set s =
map.entrySet
();
76
Iterator
it =
s.iterator
();
77 while (
it.hasNext
())
78 {
79
Map.Entry
m = (
Map.Entry
)
it.next
();
80
putRawValue
((String)
m.getKey
(), (String)
m.getValue
(),
rawProperties
);
81 }
RPSMetadataExtracter.extractRaw
34
package
uk.ac.st_andrews.repo.content.metadata
;
public class
RPSSaxParser
extends
org.xml.sax.helpers.DefaultHandler
{
public void
setProperties
(Map<String,
Serializable
> prop);
public Map<String,
Serializable
>
getProperties
();
public void
startElement
(String
uri
, String
localName
, String
qName
, Attributes
attributes
);
public void
endElement
(String
uri
, String
localName
, String
qName
);
public void characters(char[]
ch
,
int
start,
int
length);
private void
handleID
(String id);
private void
handleDate
(String d);
}
RPSSaxParser
35
// property names
21 private static final String KEY_ID =
"
rpsID
"
;
22 private static final String KEY_REIGN =
"
rpsReign
"
;
23 private static final String KEY_VERSION =
"
rpsVersion
"
;
24 private static final String KEY_HEADING =
"
rpsHeading
"
;
25 private static final String KEY_SESSION =
"
rpsSession
"
;
26 private static final String KEY_DATE =
"
rpsDate
"
;
27 private static final String KEY_TITLE =
"
cmTitle
"
;
// some properties get set in
RPSSaxParser.characters
185 if (true ==
inTitle
)
186 {
187
rawProperties.put
(KEY_TITLE, new String(
ch
, start, length));
188 }
189 else if (true ==
inSession
)
190 {
191
rawProperties.put
(KEY_SESSION, new String(
ch
, start, length));
192 }
RPSSaxParser
36
# Namespaces
namespace.prefix.rps
=http://www.rps.ac.uk/ns/1.0
namespace.prefix.cm=http://www.alfresco.org/model/content/1.0
# Mapping of property names to Qualified names used in model
rpsID
=
rps:id
rpsReign
=
rps:reign
rpsSession
=
rps:session
rpsDate
=
rps:date
rpsVersion
=
rps:version
rpsHeading
=
rps:heading
cmTitle
=
cm:title
RPSMetadataExtracter.properties
37
<aspect name="
rps:metadata
">
<title>RPS Metadata</title>
<properties>
<property name="
rps:id
"><type>d:text</type></property>
<property name="
rps:reign
"><type>d:text</type></property>
<property name="
rps:session
"><type>d:text</type></property>
<property name="
rps:date
"><type>d:text</type></property>
<property name="
rps:heading
"><type>d:text</type></property>
<property name="
rps:version
"><type>d:text</type></property>
</properties>
</aspect>
rpsModel.xml (fragment showing aspect)
38
# I18N strings
rpsID
=RPS ID
rpsReign
=RPS Reign
rpsSession
=RPS Session
rpsDate
=RPS Date
rpsVersion
=RPS Version
rpsHeading
=RPS Heading
webclient.properties
39
•
Metadata Object Description Schema
•
http://www.loc.gov/standards/mods/
•
MODS is a resource discovery metadata standard
•
Working on defining MODS data models
•
For Project, Resource Type and Digital Object levels
•
Will move RPS metadata into MODS fields
Using MODS
40
•
Creates an action for scanning files for viruses
•
Uses
ClamAV
•
http://www.clamav.net/lang/en/
•
Can be configured for other tools
•
Emails creator of file if virus found
•
Deletes file from repository if virus found
•
Can be run as space rule
•
Compile as AMP using Alfresco SDK
Introduction
41
antivirus
-
action.xml (fragment)
<bean id="antivirus
-
action" class="
uk.ac.st_andrews.repo.action.executer.AntivirusActionExecuter
"
parent="action
-
executer">
<!
–
services needed by bean
--
>
<property name="
contentService
“><ref bean="
contentService
" /></property>
<property name="
nodeService
"><ref bean="
nodeService
" /></property>
<property name="
templateService
"><ref bean="
templateService
" /></property>
<property name="
actionService
"><ref bean="
actionService
" /></property>
<property name="
personService
"><ref bean="
personService
" /></property>
<!
–
person that email will come from, defined in alfresco
-
golbal.properties
--
>
<property name="
fromEmail
">
<value>${
antivirus.mailer
}</value>
</property>
<!
–
path to
Freemarker
template, defined in alfresco
-
golbal.properties
--
>
<property name="
emailTemplate
">
<value>${
antivirus.template
}</value>
</property>
42
antivirus
-
action.xml (fragment, cont.)
<property name="command">
<bean class="
org.alfresco.util.exec.RuntimeExec
">
<property name="
commandMap
">
<map>
<!
–
command to run, ${antivirus.exe} set in alfresco
-
golbal.properties
, ${source} in Java class
--
>
<entry key=".*" value="${antivirus.exe} ${source}"/>
</map>
</property>
<property name="
errorCodes
">
<value>1</value>
<!
–
exit code 1 indicates that virus was found
--
>
</property>
</bean>
</property>
</bean>
43
AntivirusActionExecuter
package
uk.ac.st_andrews.repo.action.executer
;
public class
AntivirusActionExecuter
extends
ActionExecuterAbstractBase
{
public void
setContentService
(
ContentService
contentService
);
public void
setNodeService
(
NodeService
nodeService
);
public void
setTemplateService
(
TemplateService
templateService
);
public void
setActionService
(
ActionService
actionService
);
public void
setPersonService
(
PersonService
personService
);
public void
setFromEmail
(String
fromEmail
);
public void
setCommand
(
RuntimeExec
command);
public void
setEmailTemplate
(String
emailTemplate
);
public void init();
protected void
addParameterDefinitions
(List<
ParameterDefinition
>
paramList
);
protected void
executeImpl
(final Action
ruleAction
, final
NodeRef
actionedUponNodeRef
);
}
44
AntivirusActionExecuter
.
executeImpl
(fragment)
135
// put content into temp file
136
ContentReader
reader =
137
contentService.getReader
(
actionedUponNodeRef
,
ContentModel.PROP_CONTENT
);
138 String
fileName
=
139 (String)
nodeService.getProperty
(
actionedUponNodeRef
,
ContentModel.PROP_NAME
);
140 File
sourceFile
=
141
TempFileProvider.createTempFile
(
"
anti_virus_check
_"
,
"_"
+
fileName
);
142
reader.getContent
(
sourceFile
);
143
144
// set source property for command
145 Map<String, String> properties = new
HashMap
<String, String>(1);
146
properties.put
(VAR_SOURCE,
sourceFile.getAbsolutePath
());
147
148
// execute the transformation command
149
ExecutionResult
result = null;
150 try
151 {
152 result =
command.execute
(properties);
153 }
154 catch (
Throwable
e)
155 {
156 throw new
AlfrescoRuntimeException
("
Antivirus check error:
\
n"
+ command, e);
157 }
45
AntivirusActionExecuter
.
executeImpl
(fragment, cont.)
165
// try to get document creator's details
166 String
creatorName
= (String)
nodeService.getProperty
(
actionedUponNodeRef
,
167
ContentModel.PROP_CREATOR
);
168 if (null ==
creatorName
|| 0 ==
creatorName.length
())
169 {
170 throw new Exception(
"couldn't get creator's name"
);
171 }
172
173
NodeRef
creator =
personService.getPerson
(
creatorName
);
174 if (null == creator)
175 {
176 throw new Exception(
"couldn't get creator"
);
177 }
178
179 String
creatorEmail
= (String)
nodeService.getProperty
(creator,
180
ContentModel.PROP_EMAIL
);
181 if (null ==
creatorEmail
|| 0 ==
creatorEmail.length
())
182 {
183 throw new Exception(
"couldn't get creator's email address"
);
184 }
46
AntivirusActionExecuter
.
executeImpl
(fragment, cont.)
186
// put together message
187 Map<String, Object> model = new
HashMap
<String, Object>(8, 1.0f);
188
model.put
(
"filename"
,
fileName
);
189
model.put
(
"message"
, result);
190
191 String
emailMsg
=
templateService.processTemplate
(
"
freemarker
"
,
emailTemplate
, model);
192
193
// send email message
194 Action
emailAction
=
actionService.createAction
(
"mail"
);
195
emailAction.setParameterValue
(
MailActionExecuter.PARAM_TO
,
creatorEmail
);
196
emailAction.setParameterValue
(
MailActionExecuter.PARAM_FROM
,
fromEmail
);
197
emailAction.setParameterValue
(
MailActionExecuter.PARAM_SUBJECT
,
198
"Virus found in "
+
fileName
);
199
emailAction.setParameterValue
(
MailActionExecuter.PARAM_TEXT
,
emailMsg
);
200
emailAction.setExecuteAsynchronously
(true);
201
actionService.executeAction
(
emailAction
, null);
202
203
// delete node
204
nodeService.addAspect
(
actionedUponNodeRef
,
ContentModel.ASPECT_TEMPORARY
, null);
205
nodeService.deleteNode
(
actionedUponNodeRef
);
47
•
Metadata and Encoding Transmission Standard (METS)
•
http://www.loc.gov/standards/mets/
•
METS is a wrapper for other metadata documents
•
Plan to generate METS documents containing/referencing:
•
Ingested files
•
Renderings of these files (thumbnails, reference copies, archival
formatted versions etc.)
•
Resource discovery metadata
•
Technical metadata
•
Fedora Commons can ingest METS documents as SIPs
•
http://fedora
-
commons.org/
Introduction
48
•
FITS in Alfresco
•
http://forge.alfresco.com/projects/fitsinalfresco/
•
RPS Metadata
Extracter
•
http://forge.alfresco.com/projects/rpsmetadata/
•
Antivrus
•
http://forge.alfresco.com/projects/antivirus/
•
http://www.st
-
andrews.ac.uk/itsupport/academic/arts
Project source code available on Alfresco Forge
University
of St Andrews Digital Archiving Project
49
Dr Birgit
Plietzsch
Arts Computing Advisor
bp10@st
-
andrews.ac.uk
Swithun
Crowe
Developer
for Arts and
Humanities Computing projects
cs2@st
-
andrews.ac.uk
&
IT Services, University of St Andrews
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο