Markup and Metadata

tansygoobertownInternet and Web Development

Dec 8, 2013 (3 years and 10 months ago)

79 views

Markup and Metadata
Session 5
LIS 60639 Implementation of Digital Libraries
Dr. Yin Zhang
2
1. Guidance for Building Good
Digital Collections: Metadata
NISO (2007). A Framework of Guidance for
Building Good Digital Collections: Metadata
related sections.
Why do we need metadata for DLs?

One of the most challenging aspects of the digital environment is the
identification of resources available on the Web.

The existence of searchable descriptive metadata increases the
likelihood that digital content will be discovered and used.

The description of individual objects and sets of objects helps to
locate an object and collocate similar/related objects.

Examples of metadata-based access tools:

library catalogs,

archival finding aids,

museum inventory control systems, and

search utilities such as Google.
What is metadata and who should be
responsible for creating it?

Metadata is structured information associated with an object for purposes of
discovery, description, use, management, and preservation.

Metadata creation is an incremental process that should be a shared
responsibility among various parts of an institution. Different types of
metadata can be added by different people at various stages of an
information object’s life cycle.

At the creation stage, metadata about an object’s authors, contributors, source,
and intended audience could be provided by the original authors. Creators of
digital objects should be encouraged to embed as much metadata as possible
within the object before it is shared or distributed.

At the organization stage, metadata about subjects, publishing history, and
access rights could be recorded by catalogers or indexers.

At the access and usage stage, evaluative information such as reviews and
annotations could be added by the user.
Types of metadata

Three basic types of metadata:

Descriptive metadata
helps users find and obtain objects,
distinguish one object or group of objects from one another, and
discover the subject or contents.

Administrative metadata
helps collection managers keep track of
objects for such purposes as file management, rights
management, and preservation.

Structural metadata
documents relationships within and among
objects and enables users to navigate complex objects, such as
the pages and chapters of a book.
Metadata standards

Various
metadata standards
have been
developed for describing different types of
objects and for different purposes.

A typology of data standards created by Anne
Gilliland helps us understand how different
metadata standards are related and work
together.

Various
metadata standards
have
been developed for describing
different types of objects and for
different purposes.

A
metadata schema
provides a
labeling, tagging or coding system
used for recording cataloging
information or
structuring

descriptive records. A metadata
schema establishes and defines
data elements and the rules
governing the use of data
elements to describe a resource. -
http://gondolin.rutgers.edu/MIC/text/how/catalog_glossary.htm
See Table 4 for a list of DL metadata schemas

This typology of data standards
helps us understand how different
metadata standards are related
and work together.

Depending upon the nature of DL
collections, a single metadata
scheme may not suffice for all
needs. A combination of metadata
schemes may be the best solution.
Typology of Data Standard created by Anne Gilliland
Metadata we should know from 60002

Metadata schemas:

MARC

Dublin core

Data value standards:

LCSH

Data content standards:

AACR2

Data format/technical interexchange standards

MARC21
Choosing metadata standards

The decisions about which metadata standard(s) to
adopt and what levels of description to apply must be
made within the context of

the organization's purpose for creating the collection,

the available human and technical resources,

the users and intended usage, and

approaches adopted within the particular field of inquiry or
knowledge domain.
Metadata Principles
1.
Good metadata conforms to community standards in a way that is
appropriate to the materials in the collection, users of the collection, and
current and potential future uses of the collection.
2.
Good metadata supports interoperability.
3.
Good metadata uses authority control and content standards to describe
objects and collocate related objects.
4.
Good metadata includes a clear statement of the conditions and terms of
use for the digital object.
5.
Good metadata supports the long-term curation and preservation of
objects in collections.
6.
Good metadata records are objects themselves and therefore should
have the qualities of good objects, including authority, authenticity,
archivability, persistence, and unique identification.
11
Discussion and Reflection

Issues raised in this reading

How such issues are addressed in your
DL case
12
2. Metadata Formats and Sharing
Reese
&
Banerjee
(2008)

Ch. 5 Metadata formats

Ch. 6 Sharing data: Metadata harvesting and
distribution

XML and metadata

Given the large quantity of metadata, workflows, and knowledge in
the development of bibliographic data such as MARC, an important
question is how these legacy records and systems will be moved
toward a more XML-centric metadata schema.

In the mid-1990s, the Library of Congress made an important first
step by offering an XML version of MARC in MARC21XML.
MARC21XML
was developed to be MARC, but also in XML. It
represented a lossless XML format for MARC data, with many of the
benefits of XML.

MARCXML/MARCXML21 homepage:
http://www.loc.gov/standards/marcxml/

MARC21XML
Metadata Object Description Schema
(
MODS
) metadata format

MODS homepage: http://www.loc.gov/standards/mods/

The Library of Congress recognized the need for a metadata schema that would be compatible
with the legacy MARC data while providing a new way of representing and grouping bibliographic
data.

These efforts led to the development of the Metadata Object Description Schema (
MODS
)
metadata format. MODS represents

the next natural step in the evolution of MARC into XML.

a much simpler alternative that retains its compatibility with MARC.

Developed as a subset of the current MARC21 specification, MODS was created as a richer
alternative to other metadata schemas like Dublin Core.

Differences between MARC2IXML and MODS

MARC2IXML faithfully transferred MARC structures into XML; the structure of MODS allowed for metadata
elements to be regrouped and reorganized within a metadata record.

MODS uses textual field labels rather than numeric fields in MARC21XML. This change allows MODS
records to be more readable than traditional MARC or MARC2lXML records and promotes a design that
allows for element descriptions that can be reused throughout the metadata schema.

MODS applications

A number of digital library efforts are looking at formalizing MODS support to either replace or
augment Dublin Core only metadata systems. Likewise, groups like the Digital Library Federation
have started recommending that organizations and software designers provide MODS-based OAI
harvesting capability to allow for a higher level of metadata granularity.

E-print systems like DSpace have looked at ways of utilizing MODS either as an internal storage
format or as a supported OAI protocol

Digital repositories like Fedora currently utilize a MODS-like metadata schema as the internal
storage schema.

Interest has grown in using MODS for ILS development due in large part to the open-source
development work being done by the Evergreen open-source ILS built around MODS.

MODS was developed as a subset of a number of larger ongoing metadata initiatives at the
Library of Congress:

It was developed, in part, as an extension format to the Metadata Encoding and Transmission Standard
(METS) to provide a MARC-like bibliographic metadata component for METS-generated records.

The MODS schema was tapped as one of the registered metadata formats for
SRU/SRW
(Search and
Retrieve via URL/ Search and Retrieve via Web), the next-generation communication format designed as a
replacement for Z39.50.

While MODS was created to work as a stand-alone metadata format that could be used for original record
creation, translating MARC data into XML, or facilitating the harvesting of library materials, it was also
created as part of a larger ongoing strategy at the Library of Congress to create a set of more diverse,
lightweight XML formats that could be used with the library community's current legacy data.
METS
(Metadata Encoding And
Transmission Standard)

METS page: http://www.loc.gov/mets

METS is
not
a metadata format utilized for bibliographic description of
objects.

METS acts as a container object for the many pieces of metadata needed to
describe a single digital object:


the individual who submitted the digital object may only be responsible for adding
information to the
bibliographic metadata
,

the digital repository itself is generating metadata related to the
structural
information
of the digital object, that is, assembling information about the files
that make up the entire digital object (metadata, attached items, etc.).

METS provides a method for binding these objects together so that they can be
transferred to other systems or utilized within the local digital repository system
as part of a larger application profile.

A very basic METS document utilized at Oregon State University for archiving
structural information about digitized text in DSpace
Free metadata

The library community has historically promoted the idea that
metadata should be freely accessible between systems.

The Z39.50 protocol represented one such manifestation of this
belief.

Over the last decade, many information providers have "freed" their
metadata, embracing the mashup concept either as a business
model or as a user service.

Large information providers like search engines (Google, Yahoo!,
MSN) or social networking services like Flickr and del.icio.us have
moved to offer API (application programming interface) to provide
easier remote-system access and encourage use of their services
from remote users.
Sharing metadata

Nearly all digital repository platforms provide some
method of sharing metadata within the larger user
community.

Tools, methods, and protocols to crosswalk harvested
content to different metadata schemas:

XSLT (eXtensible Stylesheet Transformation)
is commonly used
in XML metadata crosswalking. XSLT is a W3C technology
designed to work with XSL (eXtensible Stylesheet Language), a
style-sheet language for XML. XSLT offers a simple method for
transforming an XML document to other formats.

OAI-PMH (Open Archives Imitative Protocol For Metadata
Harvesting)
makes metadata available for harvest.
Challenges for metadata crosswalking

Metadata consistency
The crosswalking process must assume that metadata in one format has been consistently
created in order to develop rules/algorithms about how that information should be
represented in other metadata formats.

Schema granularity
Very rarely does crosswalking occur between two metadata schemas that share the same
level of granularity, for example:
Dublin Core: Creator

MARC21: 100, 110, 111, 700, 710, 711, 720

The “spare parts”
Because metadata crosswalking is rarely a lossless process, one often has to decide what
information is "lost" during the crosswalking process. “Spare parts” - the unmappable data
that cannot be carried through the crosswalk

Dealing with localisms
Localisms-data:
the metadata to enable data to sort or display a specific way within a local
system.
23
Discussion and Reflection

Issues raised in this reading

How such issues are addressed in your
DL case
24
3. Markup and Metadata
Witten & Bainbridge (2003)
. Ch. 5 Markup and
metadata: Elements of organization
Markup and metadata

If documents are the digital library's basic building blocks,
markup
and
metadata
are its basic elements of organization.

Markup
is used to specify the structure of individual documents and control how
they look when presented to the user.

Metadata
is used to expedite access to relevant parts of the collection through
searching and browsing.

Markup controls two complementary aspects of an electronic document:
structure
and
appearance
:

Structural markup
makes certain aspects of the-document structure explicit:
typically section divisions, headings, subsection structure enumerated and
bulleted lists…. These structural items can be considered metadata for the
document.

Appearance
is controlled by
presentation
or
formatting
markup which dictates
how the document appears typographically: page size, page headers and
footers, fonts, line spacing, how section headers look, where figures appear, and
so on.

Structure and appearance are related by the design of the document, that is, a
catalog-often called a
style sheet -
of how

each structrual item should be
presented.
Markup languages and styles

HTML – HyperText Markup Language

XML - Extensible Markup Language

Presenting marked-up documents

Cascading style sheets: CSS for HTML

Extensible stylesheet language: XSL for XML
Extracting metadala

Automatic extraction of information from text
- text mining,
is a hot
research topic.

Plain text documents are designed for people. Readers extract
information by
understanding
their content. Fully automatic
comprehension of arbitrary documents using computers and
programs remains a challenge.

Structured markup languages such as XML help make key aspects
of documents accessible to computers and people alike.

Fortunately, it is often unnecessary to understand a document in
order to extract useful metadata from it.
28
Discussion and Reflection

Summary
:

Issues raised in this reading

How such issues are addressed in your DL case
29
4. Trends in Metadata Practices
Carole L. Palmer, Oksana L. Zavalina, & Megan
Mustafoff (2007).
Trends in metadata practices: A
longitudinal study of collection federation.

In
Proceedings of the 7th ACM/IEEE Joint Conference on
Digital Libraries
(pp. 386-395.). New York: ACM.
Background

With the increasing focus on interoperability for distributed digital
content, resource developers need to take into consideration how
they will contribute to large federated collections, potentially at the
national and international level.

At the same time, their primary objectives are usually to meet the
needs of their own institutions and user communities.

This tension between local practices and needs and the more global
potential of digital collections has been an object of study for the
IMLS Digital Collections and Content (IMLS DCC) project.
Aim and approach

Our practical aim has been to provide integrated access to over 160 IMLS-
funded digital collections through
a centralized collection registry and
metadata repository
(hereafter referred to as the “IMLS DCC”) based on the
Open Archives Initiative Metadata Harvesting Protocol (OAI-PMH).

During the course of development, the research team has investigated how
collections and items can best be represented to meet the needs of local
resource developers and aggregators of distributed content, as well as the
diverse user communities they may serve.

Research methods:
Surveys, interviews, and case studies. Additional data were collected to
investigate item and collection description and subject access issues
through content analysis, focus groups, and usability studies.
Locally Developed Schemes

Whether used as single or with multiple schemes, 29% of projects
applied locally developed schemes in 2003 (n=94) and 2006 (n=59).

Projects chose to apply a local scheme for a number of reasons:

customization was needed to capture information unique to the materials,
information already recorded in a database or some other local information
source was to be imported, or existing standards did not allow projects to adhere
to their goals.

All 100% (n=17) of the projects using locally developed schemes, indicated
access as the primary purpose of their project in their grant proposals, while only
56.9% of other projects (n=51) listed access as a primary purpose of their grant.
Schemes for New Content and Mapping

Schemes for New Content

There are some significant differences with respect to scheme use
between projects that have added new types of data and those that
have not.

Use of Dublin Core was more frequent than MARC for projects adding
new content.

Mapping

63.4% (n=56) of projects have mapped their metadata.

Dublin Core was mapped to most often with 63% (n=35) of projects
mapping to Dublin Core.

26% (n=35) of projects have mapped to MARC


Other Standards” and MODS were the next highest at 14% and 12%
(n=35), respectively.

Overall, 41% (n=35) of projects that have done mapping have mapped
to multiple schemes.
Decision factors on choosing metadata

Choice of metadata scheme(s) was influenced by the
following factors:

the overall degree to which a standard had been adopted by peer
institutions was an important consideration

the compatibility with local systems

Content management system software also influenced text encoding
decisions

Knowledge and skill
Many library-based digital collection developers chose MARC because it allowed for
more granularity in description than Dublin Core while also being the easiest to
implement since their staff were already proficient using MARC
Problems

The three most commonly reported problems with description were:

consistent application of the chosen metadata scheme within a project,

identification and application of controlled vocabularies, and

integration of sets of data, schemes, and vocabularies either within an
institution or among collaborators.

In addition, there were clear tensions between local practices and
what was perceived as the best for interoperability.

One project that began with Dublin Core decided against using it
part way into the grant, favoring MARC and TEI for representing the
texts in their collection. Later they ended up mapping their metadata
back to Dublin Core for OAI interoperability.
Problems and solutions

Some of the unique content in digital collections in the IMLS DCC cannot be
adequately described by existing metadata schemes

resource
developers often look for examples of how similar content has been
described by other projects

Local, home-grown metadata schemes often are developed when no
suitable standard can be identified

Folksonomies and social tagging
collected from the end-user community were named as one of the term
sources for such home-grown schemes

Smaller digital collections that do not have resources to develop local
schemes sometimes end up compromising the richness of description to
implement Dublin Core.

The unstable standards environment has made it difficult to advance without
shifts, reconsiderations, and adaptations in original metadata plans to
support interoperability and shareability of metadata.
40
Discussion and Reflection

Issues raised in this reading

How such issues are addressed in your DL case