The OAI Protocol for Metadata Harvesting - IVOA

clappingknaveSoftware and s/w Development

Dec 14, 2013 (3 years and 7 months ago)

114 views

a centre of expertise in digital information management

www.ukoln.ac.uk

The OAI Protocol for
Metadata Harvesting

Andy Powell

a.powell@ukoln.ac.uk

UKOLN, University of Bath

IVOA Registry Meeting, London

March 2003






2

Contents


a brief history of OAI


10 technical things you should know
about the OAI
-
PMH






3

OAI roots


the roots of OAI lie in the development
of eprint archives…


arXiv, CogPrints, NACA (NASA), RePEc, NDLTD,
NCSTRL


each offered Web interface for deposit
of articles and for end
-
user searches


difficult for end
-
users to work across
archives without having to learn multiple
different interfaces


recognised need for single search
interface to all archives


Universal Pre
-
print Service (UPS)






4

Searching vs. harvesting


two possible approaches to building a
single search interface to multiple eprint
archives…


cross
-
searching multiple archives based on protocol
like Z39.50


harvesting metadata into one or more ‘central’
services


bulk move data to the user
-
interface


US digital library experience in this area
indicated that cross
-
searching not
preferred approach


distributed searching of N nodes viable, but only for
small values of N






5

Searching vs. harvesting

search service

search service

…or…






6

Harvesting requirements


in order that harvesting approach can work
there need to be agreements about…


transport protocols


HTTP vs. FTP vs. …


metadata formats


DC vs. MARC vs. …


quality assurance


mandatory elements,
mechanisms for naming of people, subjects,
etc., handling duplicated records, best
-
practice


intellectual property and usage rights


who
can do what with the records


work in this area resulted in the “Santa Fe
Convention”






7

Development of OAI
-
PMH


2 year metamorphosis thru various names


Santa Fe Convention, OAI
-
PMH versions 1.0, 1.1…


OAI Protocol for Metadata Harvesting 2.0


development steered by international
technical committee


inter
-
version stability helped developer
confidence


move from focus on eprints to more
generic protocol


move from OAI
-
specific metadata schema to mandatory
support for DC






8

Bluffer’s guide to OAI

1.
OAI
-
PMH is a low
-
cost mechanism for
harvesting metadata records


from ‘data providers’ to ‘service providers’

2.
allows ‘service provider’ to say ‘give me
some or all of your metadata records’


where ‘some’ is based on date
-
stamps, sets,
metadata formats

3.
not limited to repositories of eprints


images, museum artefacts, learning objects, …

4.
based on HTTP and XML


simple, Web
-
friendly, autonomous


fast, flexible deployment

http://www.openarchives.org/






9

Bluffer’s guide to OAI

5.
OAI
-
PMH is
not

a search protocol


but use can underpin search
-
based services
based on Z39.50 or SRW or SOAP or…

6.
OAI
-
PMH carries only metadata


content (e.g. full
-
text or image) made available
separately


typically at URL in metadata

7.
mandates simple DC as record format


but extensible to any XML format


IMS,
ONIX, MARC, METS, etc.

8.
extensible framework for metadata about


repository, resources, ‘items’, sets


can include rights metadata






10

Bluffer’s guide to OAI

9.
metadata and ‘content’ often made freely
available


but not a requirement


OAI
-
PMH can be used between closed
groups


or, can make metadata available but restrict
access to content in some way

10.
underlying HTTP protocol provides


access control


e.g. HTTP BASIC


compression mechanisms (for improving
performance of harvesters)


could, in theory, also provide encryption if
required







11

Resources, items and records

resource

all available metadata

about
David

item

Dublin Core

metadata


MARC

metadata


SPECTRUM

metadata


records

item =

identifier






12

Protocol requests


six different request types


Identify


ListMetadataFormats


ListSets


ListIdentifiers


ListRecords


GetRecord


harvester need not use all types


repository must implement all types


required and optional arguments


on request types






13

Record structure


metadata about a resource in a
particular XML format


header (mandatory)


identifier (1)


datestamp (1)


setSpec elements (*)


status attribute for deleted item (?)


metadata (mandatory)


XML encoded metadata within root tag
which provides namespace and schema


repositories must support Dublin Core


about (optional)


rights statements


provenance statements






14

Dublin Core


OAI
-
PMH mandates use of simple DC
as lowest common denominator


agreed XML schema


‘oai_dc’


simple DC


15 metadata properties


all DC properties optional and repeatable

Title

Contributor

Source

Creator

Date

Language

Subject

Type

Relation

Description

Format

Coverage

Publisher

Identifier

Rights

http://dublincore.org/






15

OAI demonstration


repository explorer demo






16

OAI and Google

Web

site(s)

multimedia

database(s)

DP9 gateway

OAI gateway

makes harvested

metadata

available to

Google…

eprint

archive(s)






17

Implementing OAI


OAI protocol is relatively simple


implementation and deployment tends
to be very fast


lots of available toolkits


Java, Perl, PHP, etc.


complete tools also available


e.g. tools that sit in front of

existing databases


see ‘tools’ area on the

OAI Web site…






18

Creative Commons


CC is “
devoted to expanding the range
of creative work available for others to
build upon and share



provides ‘standard’ licences for content


attribution


noncommercial


no derivative works


share alike


mechanisms for indicating licence on
Web pages


need similar mechanism in OAI


http://www.creativecommons.org/






19

Questions…

a centre of expertise in digital information management

www.ukoln.ac.uk