Dr Birgit Plietzsch

batterycopperInternet και Εφαρμογές Web

12 Νοε 2013 (πριν από 3 χρόνια και 7 μήνες)

111 εμφανίσεις

2

Dr Birgit
Plietzsch

Arts Computing Advisor


bp10@st
-
andrews.ac.uk

Swithun

Crowe

Developer
for Arts and

Humanities Computing projects

cs2@st
-
andrews.ac.uk

&

IT Services, University of St Andrews

3



1.
Introduction to the University of St Andrews Digital
Archiving Project (DAP)


2.
The DAP Open Archival Information System


3.
Developing the OAIS Ingest function in Alfresco


4

Digital Preservation
is …


the
active management
of digital information over time to ensure its
accessibility


long
-
term, error
-
free storage of digital information
, with means for
retrieval and interpretation, for the entire time span the information is
required for.


Long
-
term
is defined as "long enough to be concerned with the
impacts of
changing technologies
, including support for new media and data formats, or
with a
changing user community
. Long Term may extend indefinitely”.


Retrieval
means
obtaining needed digital files
from the long
-
term, error
-
free
digital storage,
without possibility of corrupting the continued error
-
free storage
of the digital files.


Interpretation
means that the retrieved digital files, files that, for example, are
of texts, charts, images or sounds, are
decoded and transformed into usable
representations
. This is often interpreted as "rendering", i.e. making it available
for a human to access. However, in many cases it will mean able to be
processed by computational means.


(Source: Wikipedia)

5


Legal requirements (e.g. Freedom of Information Act)



Protection of institutional intellectual property



Funding body requirements


until 2008 Arts and Humanities Data Service for Arts and Humanities
(national depository for arts and humanities research data)


no such body exists now for the Arts and Humanities


other subjects national support is patchy



Moral obligations


protection of cultural and corporate memory



6

www.rps.ac.uk


proceedings of the
Scottish Parliament
from the first
surviving act of
1235 to the union of
1707


10 years of research


no print publication


c16.5m words


issues:


inconsistent editorial
practices


obsolescence of
software originally
used


long
-
term
sustainability of
research data




7


Pilot project



Scope:


data contained in electronic resources produced within the Faculty
of Arts, University of St Andrews



Aims:


ensure long
-
term sustainability of RPS data


investigate the requirements of digital archiving and obtain
experience


meet funding body requirement


flexible implementation (to allow for additional future uses)


8

Concepts and Properties of Archives and Hosting in the
Strategy and their Relationships ©Charles
Beagrie

Ltd
2009.
CreativeCommons

Attribution
-
Share Alike3.0 Key:
solid
colour

represents core properties and fading
colour

represents weaker properties of archives and hosting
services.








Concepts and Properties of Archives and Hosting in the Strategy and their Relationships


© Charles
Beagrie

Ltd 2009. CreativeCommons Attribution
-
Share Alike3.0

9



1.
Introduction to the University of St Andrews Digital
Archiving Project (DAP)


2.
The DAP Open Archival Information System


3.
Developing the OAIS Ingest function in Alfresco


10



An
Open Archival Information System

(or
OAIS
) is an
archive
, consisting of an organization of people and
systems, that has accepted the responsibility to preserve
information

and make it available for a
Designated
Community
.



reference model:
ISO 14721:2003



11

Seven functions


Ingest


Archival
Storage


Data
Management


Administration


Preservation
Planning


Access


Management


SIP


Submission Information Package

AIP


Archival Information Package

DIP


Dissemination Information Package

12

Implementation


Content
Information:


XML


TIFF


DOC


Etc


Preservation
Description
Information:


PREMIS


Descriptive
Information:


MODS


Packaging
Information:


METS


13


What needs to be preserved?


data


layout


functionality


user experience



What are the significant properties?


generic low
-
level properties (e.g. basic data unit, byte
-
level encoding, data type, and logical
schema)


data type specific properties (example: text)


underlying abstract forms (font, spacing, layout)


sub
-
properties (e.g. font type, style, family, size,
colour
)



How do we preserve?


bit stream preservation


emulation


migration



Adopted approach:


data is preserved


combination of bit stream preservation and file format migration upon ingest


14


description needs of different types of material


electronic resources


digital images


video


research papers


University records


etc.



introduce flexibility


future wider uses of the archive



15



expressed in MODS




3 layers




use for pilot




more models can be


developed

Project

Research
data

Documen
-

tation

Code

Resource type

Digital object

Resource Discovery
Metadata

16

Monolithic approach



Repository framework:


Fedora Commons


issues with suitable front end
for Ingest, Access,
Preservation Planning, or
Administration functions


highly
customisable


Metadata



MODS


METS


PREMIS



DSpace


issues with Archival Storage
and Data Management
functions


EPrints


issues with Administration
and Access functions


RODA


technical issues


No support for

Preservation Planning

Breakdown into OAIS
requirements


Access







Plato


Testbed


17

Software used


Alfresco


www.alfresco.com



Fedora
Commons


fedora
-
commons.org



Planets Suite


www.openplanets

foundation.org


Archival storage

&

Data Management


Management






Share


Explorer


Records
Management

Ingest

Preservation

Planning


Administration

18

19


Version control of
AIPs



Alfresco / Fedora interaction?


Access front end


Fedora Commons front ends do not normally support OAIS
functions


Can extra properties be added to folders and files in
Records Management site?


We welcome ideas that might help us resolve the above
three issues.





20



1.
Introduction to the University of St Andrews Digital
Archiving Project (DAP)


2.
The DAP Open Archival Information System


3.
Developing the OAIS Ingest function in Alfresco


21


FITS and PREMIS


Technical metadata


RPS and MODS


Resource discovery metadata


Antivirus scanning


METS


Wrapping files and metadata

Introduction

22


FITS (File Information Tool Set)


http://code.google.com/p/fits/


Consolidates file format metadata from 3
rd

party tools


Jhove
, DROID, NLNZ ME,
Exiftool

and others


Output as XML


PREMIS (
PREservation

Metadata: Implementation
Strategies)


http://www.loc.gov/standards/premis/


Data dictionary of semantic units, maps to XML


Transform FITS XML to PREMIS using XSLT

Introduction

23


Text property defined in custom aspect

for storing FITS
XML in node metadata


Create temporary file containing content of node


Run FITS on temporary file


Put output into custom property


Later on, transform this to PREMIS XML


Can be run as space rule


Compile to AMP using Alfresco SDK

The action

24

<!DOCTYPE beans PUBLIC '
-
//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring
-
beans.dtd'>

<beans>



<bean id="fits
-
action
-
messages" class="org.alfresco.i18n.ResourceBundleBootstrapComponent">


<property name="
resourceBundles
">


<list><value>
alfresco.module.FitsAction.fits
-
action
-
messages</value></list>


</property>


</bean>



<bean id="fits
-
model
-
bootstrap" parent="
dictionaryModelBootstrap
" depends
-
on="
dictionaryBootstrap
">


<property name="models">


<list><value>alfresco/module/
FitsAction
/context/fitsModel.xml</value></list>


</property>


</bean>



<bean id="fits
-
action“ class="
uk.ac.st_andrews.repo.action.executer.FitsActionExecuter
“ parent="action
-
executer">


<property name="
serviceRegistry
"><ref bean="
ServiceRegistry
"/></property>


</bean>


</beans>

fits
-
action
-
context.xml

25

package
uk.ac.st_andrews.repo.action.executer
;


public class
FitsActionExecuter

extends
ActionExecuterAbstractBase

{


public void
setServiceRegistry
(
ServiceRegistry

serviceRegistry
);


protected void
addParameterDefinitions
(List<
ParameterDefinition
>
paramList
);


protected void
executeImpl
(Action
action
,
NodeRef

actionedUponNodeRef
);

}

FitsActionExecuter

26


63
// make sure node exists


64 if (!
nodeService.exists
(
actionedUponNodeRef
))


65 {


66 throw new Exception(
"no node"
);


67 }


68


69
// make sure that node has fits aspect


70
QName

fitsAspect

=
QName.createQName
(
fitsURI
,
"
fitsAspect
"
);


71 if (!
nodeService.hasAspect
(
actionedUponNodeRef
,
fitsAspect
))


72 {


73
this.nodeService.addAspect
(
actionedUponNodeRef
,
fitsAspect
, null);


74 }


75


76
// create new FITS instance


77 Fits
fits

= new Fits();


78
Fits.allowRounding

= true;


79
FitsOutput

result = null;

FitsActionExecuter.executeImpl

(fragment)

27


81
// put input into temp file


82
ContentReader

reader =


83
contentService.getReader
(
actionedUponNodeRef
,
ContentModel.PROP_CONTENT
);


84 String
fileName

=


85 (String)
nodeService.getProperty
(
actionedUponNodeRef
,
ContentModel.PROP_NAME
);


86 File
inputFile

=


87
TempFileProvider.createTempFile
(
"
FitsActionExecuter
_"
,
"."

+
fileName
);


88
reader.getContent
(
inputFile
);


89


90
// transform into technical metadata


91 result =
fits.examine
(
inputFile
);


92 Document doc =
result.getFitsXml
();


93


94
// put result of transformation into output


95
XMLOutputter

serializer

= new
XMLOutputter
(
Format.getPrettyFormat
());


96 String output =
serializer.outputString
(doc);


97


98
// get property to write to


99
QName

fitsProp

=
QName.createQName
(
fitsURI
,
"
fitsOutput
"
);

100
nodeService.setProperty
(
actionedUponNodeRef
,
fitsProp
, output);


FitsActionExecuter.executeImpl

(fragment cont.)

28


<identification status="CONFLICT">


<identity format="
Microsoft Word
"
mimetype
="
application/
msword
">


<tool
toolname
="
Exiftool
"
toolversion
="8.25" />


<tool
toolname
="file utility"
toolversion
="5.04" />


<tool
toolname
="NLNZ Metadata Extractor"
toolversion
="3.4GA" />


<tool
toolname
="
ffident
"
toolversion
="0.2" />


</identity>


<identity format="
OLE2 Compound Document Format
"
mimetype
="
application/octet
-
stream
">


<tool
toolname
="Droid"
toolversion
="3.0" />


<
externalIdentifier

toolname
="Droid"
toolversion
="3.0"
type="
puid
">
fmt
/111</
externalIdentifier
>


</identity>


</identification>

Fragment of FITS XML showing conflicting file formats

29

<
premis:format
>


<
premis:formatDesignation
>


<
premis:formatName
>
Microsoft Word
</
premis:formatName
>


</
premis:formatDesignation
>

</
premis:format
>

<
premis:format
>


<
premis:formatDesignation
>


<
premis:formatName
>
OLE2 Compound Document Format
</
premis:formatName
>


</
premis:formatDesignation
>


<
premis:formatRegistry
>


<
premis:formatRegistryName
>Droid (3.0)</
premis:formatRegistryName
>


<
premis:formatRegistryKey
>
fmt
/111</
premis:formatRegistryKey
>


<
premis:formatRegistryRole
>
puid
</
premis:formatRegistryRole
>


</
premis:formatRegistry
>

</
premis:format
>

Corresponding fragment of PREMIS XML

30


Records of the Parliaments of Scotland marked up in
thousands of XML documents


http://www.rps.ac.uk


Using Text Encoding Initiative (TEI)


http://www.tei
-
c.org/index.xml


TEI headers contain resource discovery metadata


Extract metadata from documents and populate custom
metadata fields


Can be run as space rule


Compile as AMP using Alfresco SDK

Introduction

31

<TEI.2 id="
_william_and_mary_t1689_3_6_d6_trans
" n="
william_and_mary_trans
">


<
teiHeader
>


<
fileDesc
>


<
titleStmt
>


<title>
A committee appointed for
controverted

elections
</title>


</
titleStmt
>


<
editionStmt
>


<edition n="session">
william_and_mary_t1689_3_1_d2_trans
</edition>


</
editionStmt
>


<
publicationStmt
>


<date>
16890314
</date>


</
publicationStmt
>


</
fileDesc
>


</
teiHeader
>


<text>...</text>

</TEI.2>

TEI example

Unique ID for document

Document belongs to
translated version of
records from reign of
William and Mary

Main heading in document

Pointer to session that
document belongs to

Date of document,
in YYYYMMDD
format

32

package
uk.ac.st_andrews.repo.content.metadata
;


public class
RPSMetadataExtracter

extends
org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter

{


public
RPSMetadataExtracter
();


protected Map<String,
Serializable
>
extractRaw
(
ContentReader

reader);

}

RPSMetadataExtracter

33

63
// set up parser

64
SAXParser

sp =
spf.newSAXParser
();

65
InputStream

cis

=
reader.getContentInputStream
();

66
InputSource

is = new
InputSource
(
cis
);

67
RPSSaxParser

teip

= new
RPSSaxParser
();

68

69

// do parsing

70
teip.setProperties
(map);

71
sp.parse
(is,
teip
);

72 map =
teip.getProperties
();

73

74
// loop over properties found

75 Set s =
map.entrySet
();

76
Iterator

it =
s.iterator
();

77 while (
it.hasNext
())

78 {

79
Map.Entry

m = (
Map.Entry
)
it.next
();

80
putRawValue
((String)
m.getKey
(), (String)
m.getValue
(),
rawProperties
);

81 }

RPSMetadataExtracter.extractRaw

34

package
uk.ac.st_andrews.repo.content.metadata
;


public class
RPSSaxParser

extends
org.xml.sax.helpers.DefaultHandler

{


public void
setProperties
(Map<String,
Serializable
> prop);


public Map<String,
Serializable
>
getProperties
();


public void
startElement
(String
uri
, String
localName
, String
qName
, Attributes
attributes
);


public void
endElement
(String
uri
, String
localName
, String
qName
);


public void characters(char[]
ch
,
int

start,
int

length);


private void
handleID
(String id);


private void
handleDate
(String d);

}

RPSSaxParser

35


// property names


21 private static final String KEY_ID =
"
rpsID
"
;


22 private static final String KEY_REIGN =
"
rpsReign
"
;


23 private static final String KEY_VERSION =
"
rpsVersion
"
;


24 private static final String KEY_HEADING =
"
rpsHeading
"
;


25 private static final String KEY_SESSION =
"
rpsSession
"
;


26 private static final String KEY_DATE =
"
rpsDate
"
;


27 private static final String KEY_TITLE =
"
cmTitle
"
;


// some properties get set in
RPSSaxParser.characters

185 if (true ==
inTitle
)

186 {

187
rawProperties.put
(KEY_TITLE, new String(
ch
, start, length));

188 }

189 else if (true ==
inSession
)

190 {

191
rawProperties.put
(KEY_SESSION, new String(
ch
, start, length));

192 }

RPSSaxParser

36

# Namespaces

namespace.prefix.rps
=http://www.rps.ac.uk/ns/1.0

namespace.prefix.cm=http://www.alfresco.org/model/content/1.0

# Mapping of property names to Qualified names used in model

rpsID
=
rps:id

rpsReign
=
rps:reign

rpsSession
=
rps:session

rpsDate
=
rps:date

rpsVersion
=
rps:version

rpsHeading
=
rps:heading

cmTitle
=
cm:title

RPSMetadataExtracter.properties

37


<aspect name="
rps:metadata
">


<title>RPS Metadata</title>


<properties>


<property name="
rps:id
"><type>d:text</type></property>


<property name="
rps:reign
"><type>d:text</type></property>


<property name="
rps:session
"><type>d:text</type></property>


<property name="
rps:date
"><type>d:text</type></property>


<property name="
rps:heading
"><type>d:text</type></property>


<property name="
rps:version
"><type>d:text</type></property>


</properties>


</aspect>

rpsModel.xml (fragment showing aspect)

38

# I18N strings

rpsID
=RPS ID

rpsReign
=RPS Reign

rpsSession
=RPS Session

rpsDate
=RPS Date

rpsVersion
=RPS Version

rpsHeading
=RPS Heading

webclient.properties

39


Metadata Object Description Schema


http://www.loc.gov/standards/mods/


MODS is a resource discovery metadata standard


Working on defining MODS data models


For Project, Resource Type and Digital Object levels


Will move RPS metadata into MODS fields


Using MODS

40


Creates an action for scanning files for viruses


Uses
ClamAV


http://www.clamav.net/lang/en/


Can be configured for other tools


Emails creator of file if virus found


Deletes file from repository if virus found


Can be run as space rule


Compile as AMP using Alfresco SDK

Introduction

41

antivirus
-
action.xml (fragment)


<bean id="antivirus
-
action" class="
uk.ac.st_andrews.repo.action.executer.AntivirusActionExecuter
"
parent="action
-
executer">



<!


services needed by bean
--
>


<property name="
contentService
“><ref bean="
contentService
" /></property>


<property name="
nodeService
"><ref bean="
nodeService
" /></property>


<property name="
templateService
"><ref bean="
templateService
" /></property>


<property name="
actionService
"><ref bean="
actionService
" /></property>


<property name="
personService
"><ref bean="
personService
" /></property>


<!


person that email will come from, defined in alfresco
-
golbal.properties

--
>


<property name="
fromEmail
">


<value>${
antivirus.mailer
}</value>


</property>


<!


path to
Freemarker

template, defined in alfresco
-
golbal.properties

--
>


<property name="
emailTemplate
">


<value>${
antivirus.template
}</value>


</property>

42

antivirus
-
action.xml (fragment, cont.)

<property name="command">


<bean class="
org.alfresco.util.exec.RuntimeExec
">


<property name="
commandMap
">


<map>


<!


command to run, ${antivirus.exe} set in alfresco
-
golbal.properties
, ${source} in Java class
--
>


<entry key=".*" value="${antivirus.exe} ${source}"/>


</map>


</property>


<property name="
errorCodes
">


<value>1</value>
<!


exit code 1 indicates that virus was found
--
>


</property>


</bean>


</property>

</bean>

43

AntivirusActionExecuter

package
uk.ac.st_andrews.repo.action.executer
;


public class
AntivirusActionExecuter

extends
ActionExecuterAbstractBase

{


public void
setContentService
(
ContentService

contentService
);


public void
setNodeService
(
NodeService

nodeService
);


public void
setTemplateService
(
TemplateService

templateService
);


public void
setActionService
(
ActionService

actionService
);


public void
setPersonService
(
PersonService

personService
);


public void
setFromEmail
(String
fromEmail
);


public void
setCommand
(
RuntimeExec

command);


public void
setEmailTemplate
(String
emailTemplate
);


public void init();


protected void
addParameterDefinitions
(List<
ParameterDefinition
>
paramList
);


protected void
executeImpl
(final Action
ruleAction
, final
NodeRef

actionedUponNodeRef
);

}

44

AntivirusActionExecuter
.
executeImpl

(fragment)

135
// put content into temp file

136
ContentReader

reader =

137
contentService.getReader
(
actionedUponNodeRef
,
ContentModel.PROP_CONTENT
);

138 String
fileName

=

139 (String)
nodeService.getProperty
(
actionedUponNodeRef
,
ContentModel.PROP_NAME
);

140 File
sourceFile

=

141
TempFileProvider.createTempFile
(
"
anti_virus_check
_"
,
"_"

+
fileName
);

142
reader.getContent
(
sourceFile
);

143

144
// set source property for command

145 Map<String, String> properties = new
HashMap
<String, String>(1);

146
properties.put
(VAR_SOURCE,
sourceFile.getAbsolutePath
());

147

148
// execute the transformation command

149
ExecutionResult

result = null;

150 try

151 {

152 result =
command.execute
(properties);

153 }

154 catch (
Throwable

e)

155 {

156 throw new
AlfrescoRuntimeException
("
Antivirus check error:
\
n"

+ command, e);

157 }

45

AntivirusActionExecuter
.
executeImpl

(fragment, cont.)

165
// try to get document creator's details

166 String
creatorName

= (String)
nodeService.getProperty
(
actionedUponNodeRef
,

167
ContentModel.PROP_CREATOR
);

168 if (null ==
creatorName

|| 0 ==
creatorName.length
())

169 {

170 throw new Exception(
"couldn't get creator's name"
);

171 }

172

173
NodeRef

creator =
personService.getPerson
(
creatorName
);

174 if (null == creator)

175 {

176 throw new Exception(
"couldn't get creator"
);

177 }

178

179 String
creatorEmail

= (String)
nodeService.getProperty
(creator,

180
ContentModel.PROP_EMAIL
);

181 if (null ==
creatorEmail

|| 0 ==
creatorEmail.length
())

182 {

183 throw new Exception(
"couldn't get creator's email address"
);

184 }

46

AntivirusActionExecuter
.
executeImpl

(fragment, cont.)

186
// put together message

187 Map<String, Object> model = new
HashMap
<String, Object>(8, 1.0f);

188
model.put
(
"filename"
,
fileName
);

189
model.put
(
"message"
, result);

190

191 String
emailMsg

=
templateService.processTemplate
(
"
freemarker
"
,
emailTemplate
, model);

192

193
// send email message

194 Action
emailAction

=
actionService.createAction
(
"mail"
);

195
emailAction.setParameterValue
(
MailActionExecuter.PARAM_TO
,
creatorEmail
);

196
emailAction.setParameterValue
(
MailActionExecuter.PARAM_FROM
,
fromEmail
);

197
emailAction.setParameterValue
(
MailActionExecuter.PARAM_SUBJECT
,

198
"Virus found in "

+
fileName
);

199
emailAction.setParameterValue
(
MailActionExecuter.PARAM_TEXT
,
emailMsg
);

200
emailAction.setExecuteAsynchronously
(true);

201
actionService.executeAction
(
emailAction
, null);

202

203
// delete node

204
nodeService.addAspect
(
actionedUponNodeRef
,
ContentModel.ASPECT_TEMPORARY
, null);

205
nodeService.deleteNode
(
actionedUponNodeRef
);

47


Metadata and Encoding Transmission Standard (METS)


http://www.loc.gov/standards/mets/


METS is a wrapper for other metadata documents


Plan to generate METS documents containing/referencing:


Ingested files


Renderings of these files (thumbnails, reference copies, archival
formatted versions etc.)


Resource discovery metadata


Technical metadata


Fedora Commons can ingest METS documents as SIPs


http://fedora
-
commons.org/

Introduction

48


FITS in Alfresco


http://forge.alfresco.com/projects/fitsinalfresco/


RPS Metadata
Extracter


http://forge.alfresco.com/projects/rpsmetadata/


Antivrus


http://forge.alfresco.com/projects/antivirus/





http://www.st
-
andrews.ac.uk/itsupport/academic/arts



Project source code available on Alfresco Forge

University

of St Andrews Digital Archiving Project

49

Dr Birgit
Plietzsch

Arts Computing Advisor


bp10@st
-
andrews.ac.uk

Swithun

Crowe

Developer
for Arts and

Humanities Computing projects

cs2@st
-
andrews.ac.uk

&

IT Services, University of St Andrews