Web services, standardized common vocabularies and ... - CEH Wiki

draughtplumpInternet and Web Development

Oct 22, 2013 (4 years and 8 months ago)


NERC DataGrid

Of Metadata,Vocabularies,
Governance and the Road to

Roy Lowry

British Oceanographic Data Centre

NERC Data Management Workshop, 2007

NERC DataGrid

Presentation Outline

Metadata Evolution

Controlled Vocabularies

Mapping and Vocabulary Entities

Ontology Technology

NERC DataGrid

Metadata Evolution

In the beginning there was plaintext

In the 1980s the enlightened few saw the value and
potential of metadata

Most data suppliers saw metadata as an unnecessary
waste of their valuable time

The enlightened tried to change this by making
metadata creation as easy as possible

Plaintext is much easier to create than structured
metadata so metadata formats based on plaintext
fields such as EDMED were promoted and populated

NERC DataGrid

Metadata Evolution

Problem is that whilst humans have the intelligence to
read and understand plaintext it is of very limited use to
a computer

Consider some example EDMED plaintext parameter

A wide variety of chemical and biological parameters

CTD data

Amplitude de l'echo retrodiffuse

Cu, Zn, Fe, Pb, Cd, Cr, Ni in biota


Consequently, the chances of these data sets being
discovered by conventional parameter searching is
virtually zero

NERC DataGrid

Metadata Evolution

Plaintext kills interoperability stone dead

In 2005 I was asked to build Antarctic Portal DIFs
from Dutch EDMED entries

DIF is a structured format representing (before the
latest changes) each parameter thus:


<Category>EARTH SCIENCE</Category>





How does one derive this automatically from
the plaintext ‘CTD data’, ‘T/S’, ‘temp+salin’,
’temperature and salinity’ and all the other
variants in EDMED?

Needless to say, it never happened…..

NERC DataGrid


The key to interoperability is to
replace plaintext descriptions by
keywords from controlled

Controlled vocabularies

Ensure consistent spellings

Ensure entities are described using
the same words in the same order

NERC DataGrid

Controlled vocabularies have been around for
decades in the oceanographic domain


Content governance was total anarchy

Decisions made by individuals

even students

Terms were set up and used with inadequate thought
about their meaning and formal definitions were
conspicuous by their absence

Technical governance wasn’t much better

No formal maintenance or versioning

Vocabularies delivered on an ad
hoc basis as CSV
files on FTP servers or web sites

Data models differed from one vocabulary to the next


NERC DataGrid


All this has changed

Content governance

Infrastructure has been established

Project teams who care about vocabulary
content (e.g. NERC DataGrid and

CF Standard Names Committee and e
mail list (climate science domain)

SeaDataNet and MarineXML Vocabulary
Content Governance Group (SeaVoX) e
mail list (marine domain)

NERC DataGrid


All this has changed

Technical governance

NERC DataGrid Vocabulary Server has been developed

Vocabulary data model:

Entries comprising a key, a term, an abbreviated
term and a definition

Aggregated into lists and super
corresponding to a real
world sub
classes and

Vocabularies managed in an Oracle system with
automated versioning and audit trail maintenance

Vocabularies served through a Web Service API

Clients using this interface are available





Server in operational use for NERC DataGrid and EU
SeaDataNet distributed data system projects

NERC DataGrid

Mapping and Vocabulary Entities

Distributed data systems need to:

Link into established legacy systems

Provide the basis for semantic
interoperability with other metadata and
data systems

Maps need to be built between legacy,
internal and external vocabularies if this
is to be achieved

Mapping is made difficult because
legacy vocabularies have poorly defined

But this is only part of the problem….

NERC DataGrid

Mapping and Vocabulary Entities

A ‘Statement of the Obvious’

Each vocabulary entry describes an instance of a ‘thing’ or
entity in the real world

These ‘things’ should share a common set of attributes
making them members of a consistent level in a class

In other words, its should be possible to apply a common
formal definition to all the real
world ‘things’ described by
the entries in a list

This formal definition should be compatible with the
definition of the metadata field populated by the entries
from that list

Unfortunately, the obvious wasn’t obvious when some of
the vocabularies commonly used in oceanographic data
management were being developed

NERC DataGrid

Mapping and Vocabulary Entities

Inconsistent entity definitions

Cruise Summary Report (ROSCOP) ‘parameters’

Atmospheric chemistry: a domain

Phosphate: a chemical species

Grab: a sampling device

Bottom photography: a shipboard activity

Sea level: a phenomenon

Bathythermograph: a sensor class

GCMD Instrument Keywords

AA Spectrophotometry: a sample analysis

Neuston Net: a sampling device

SEAWIFS: a platform

Altimeter: a sensor class

Makes mapping much more difficult or even impossible

NERC DataGrid

Mapping and Vocabulary Entities

Data model ‘repairs’

one relationships turned into one
many by
creating vocabulary entries that are themselves lists

Examples from a ‘data type’ vocabulary

Moored profiling CTD + acoustic current meter

Turbulence/dissipation data

Multiple data types


Multiple data types


Makes mapping much more difficult or even

NERC DataGrid

Mapping and Vocabulary Entities


A shoe
horn is an implement for forcing a
large foot into a small shoe

In metadata terms, it means using a
structure or record to describe something
for which it wasn’t originally designed

This is generally accomplished through the
‘imaginative’ usage of vocabularies, which
inevitably corrupts the entity definition

Makes mapping much more difficult or even

NERC DataGrid

Mapping and Vocabulary Entities

These problems are surmountable: it just takes
a lot of effort, particularly when issues are
distributed throughout legacy data

The result of vocabulary development and
mapping work is a managed framework

A collection lists covering the fields of our metadata

A population of well understood and defined list

Mappings between the list entries

NERC DataGrid


Ontologies are technology delivering
knowledge management for software

Consist of classes (lists), instances (list
entries) and relationships (mappings)

Ontology technology is relevant to

The implementation of interoperability
through development of semantic cross

Underpinning semantic data discovery (look
for pigments, find chlorophyll)

NERC DataGrid


XML technologies

Resource Description Framework (RDF)

Fundamentally comprises subject and object URIs
linked by a relationship (RDF triples)

Very powerful tool for managing mappings

Web Ontology Language (OWL)

based schema describing classes, instances
and relationships

W3C recommended ontology format

Simple Knowledge Organisation System (SKOS)

Another RDF
based schema that does what it says on
the tin

Very good set of concept mapping relationships

NERC DataGrid




Open source RDF triple store with inference engine,
query language (SPARQL) and much more


Open source ontology management system.
Cumbersome, but free.

TopBraid Composer

Commercial ontology management system including
an integrated inference engine and subVersion client
in an Eclipse environment. Slick but expensive.

NERC DataGrid


Work in NERC DataGrid and SeaDataNet
currently well underway building a parameter
ontology based on terms from BODC, CF and
NASA Global Change Master Directory

Work has started in SeaVox on a ‘devices’
ontology to categorise data production tools
used in the oceanographic domain

specific thesaurus implemented in
NERC DataGrid Vocabulary Server to support
semantic discovery

NERC DataGrid

Some Conclusions

Plaintext is a destroyer of interoperability

Multiple controlled vocabularies covering a
common domain topic whilst undesirable is
repairable by mapping/ontology building

Controlled vocabularies need rigorous
management for operational mapping to be

Mapping issues are not confined to vocabulary
entries but also crop up in vocabulary topics
making mapping between entries much more

NERC DataGrid

That’s All Folks

Thank you for your attention

Any questions?