Federal Data Architecture Subcommittee Linked Open Government Vocabularies

splashburgerInternet and Web Development

Oct 22, 2013 (3 years and 11 months ago)

116 views

Federal
Data Architecture Subcommittee

Linked Open Government Vocabularies

Purpose


This paper discusses
an

approach
to

open government vocabularies as part of the
World Wide
Web
.

It is intended for technology leaders

and policymakers in government who are concerned
with vocabularies
, open government
or data

interoperability
.

What
Is The Open Government Vocabulary
Working G
roup
?

Governments produce and consume vast quantities of

structured

information using schema and
vocabularies that are particular to their systems and communities
,

yet overlapping in th
eir
concerns and coverage. The
mission of the
Open
Government
Vocabulary (O
G
V)

Work
ing

G
roup

(WG)

of the
F
ederal
D
ata
A
rchitecture
S
ubcommittee

(DAS
) is

to
develop and
facilitate

execution of

an approach that
enable
s

better understanding, sharing
and
use

of
government data

through well
-
defined vocabularies
.

The
OGV

WG

will
achieve

its mission

by

providing
guidance and
adopting or adapting open
standards
supp
orting the

accessibility, discoverability and federation of vocabularies and schema
.

Data.gov will be used as a resource and proving ground for federated open vocabularies.

More
explicitly
, the OGV intends to:



Provide a c
atalog

of vocabularies created by or recognized by the U.S.
government



Recommend enabling standards, best practices
, processes

and technologies



Make r
ecommendations on how to express and harmonize vocabularies



Provide a f
orum for horizontal federation
of vocabul
aries
across domains



Recommend a vocabulary for
describing

vocabularies

and how vocabulary
concepts are related



Provide information on how vocabularies can provide value to government and
address initiatives such as the open government initiative

and the 2
5 point
implementation plan to reform federal information technology management



Describe how vocabularies facilitate shard data, interoperability
, collaboration

and shared services.

What isn
’t the Open Government Vocabulary Working Group
?

As important as
it is to understand what OGV

is
, what we a
re not doing is equally
impo
rtant
.



We are not setting policy or creating standards.



We are not doing a universal vocabulary or taking on the semantic integration of
domain vocabularies.





We are not suggesting any particular vocabulary as the “center”


in our view this is
where some prior efforts have taken the wrong road.


The alternative view is that there
are multiple vocabularies that can play the role of reference vocabularies as appr
opriate
for the context in question.




We
are not intending to provide
singular or exclusive

data warehouse for vocabularies

that attempts to hold every term and concept.

Background

The President’s open governme
nt initiative
directs
agencies

to be more ope
n, participatory and
collaborative. A major enabler
of

this
initiative is

to make

government data open and
accessible to the public unless access must be restricted for privacy or security reasons.

Many government programs
routinely
define stan
dard schema and vocabularies
. These
valuable
efforts have the community and authority to define the terms
, tags

and concepts
of their data, thus addressing data problems within their scope. However
,

on a global
“internet scale” it is not practical to def
ine a single vocabulary and schema that all must
use
.

An alternative
approach is required
to ensure a common understanding of data from
disparate
and
autonomous sources

such as those developed by consensus standards
groups and communities of interest
.

The alternative approach the OGV WG suggests is to
federat
e the data which means each data
set
can be associated with other data

without any
changes to data properties
.



What are vocabularies and Schema?

The focus of the OGV WG

is on provisioning
vocabularies produced by the government on
the semantic web


the terms, phrases and descriptions that describe what concepts, and
therefore data, mean. Such vocabularies are independent of a particular application or
structure, so are less specific than
a schema, such as the schema for a Structured Query
Language (SQL
) Database

Management System (DBMS) or Extensible Markup
Language (XML) message. The intent of the OGV
WG is

to enable application specific
schema
to reference

open government vocabularies a
s a basis for federating the
underlying data. Existing schema can also be mined for and related to their vocabularies,
vocabularies that may prove useful in other domains and other schema. Vocabularies
also typically provide capabilities for synonyms, an
tonyms, categorization and other
relationships between terms and concepts that assist in their definition and understanding.

What Are Open Government Vocabularies?



Vocabularies
in general
are sets of terms used
by social group
s

to

represent

concepts
.



Ope
n vocabularies are those defined by a community
or authority
but made
available to a larger community, frequently open to all.



Government vocabularies

are

defined by or recognized by a government
authority.

O
pen government vocabularies are sets of terms
representing
concepts

defined or
recognized by a government authority and made available to a wider community.

The
focus of the OGV WG is on
vocabularies that are fully open and public
. However, t
he
OGV WG
a
pproach
may also be applied to more restricted vocabularies

of
private

communities

and organizations
.

While OGV may develop or recognize a few cross
-
cutting vocabularies it is expected that government
departments

and agencies will be the
authorities on sp
ecific vocabularies and recommend their adoption using the
processes
and technologies recommended by DAS.


Examples of applying OGV

to Open Government

Data.gov

Data.gov is the on
-
line
portal

of government datasets available to the public. To date,
over 27
0,000 government datasets have been published and a large portion of these
datasets come from the geospatial community. Increase in the accessibility of government
data

sets is
augmented by

use of semantic web standards of the World Wide Web
Consortium (W3
C)
.

The challenge is that many of these datasets use different schema
and vocabularies in their definitions. To find the data, understand it, analyze it, repurpose
it and to make connections between different datasets currently requires a substantial
manu
al effort. Lack of schema and vocabulary

federation are key factors hindering
information sharing
in a networked world
.

National Information Exchange Model (NIEM)

NIEM, the National Information Exchange Model[
http://www.niem.gov
]
, is a partnership
of the U.S. Department of Justice, the U.S. Department of Homeland Security, and the
U.S. Department of Health and Human Services. It is designed to develop, disseminate
and support enterprise
-
wide information exchange standards and proc
esses that can
enable jurisdictions to effectively share critical information in emergency situations, as
well as support the day
-
to
-
day operations of agencies throughout the nation.

The success of NIEM could be greatly enhanced by being able to embrace a
nd integrate
NIEM and exchange models based on other standards.

One ingredient to doing this would
be open and federatable vocabularies as envisioned by OGV. In addition, the technology
independence

of OGV would complement the XML focus of NIEM e
xchange p
ackets.

DoD Universal Core (UCore)

Universal Core (UCore) is a Federal information sharing initiative that supports the
National Information Sharing Strategy and all associated Departmental / Agency
strategies. UCore enables information sharing by defining

an implementable specification
(XML Schema) containing agreed upon representations for the most commonly shared
and universally understood concepts of who, what, when, and where.

Requirements

The requirements for
open and federated
vocabular
ies

and schema

are three
-
fold:



A
ccessibility
1


vocabularies and schema must be accessible and visible so that
they can be referenced, used and reused. For this purpose
D
ata.gov pr
ovides us
with a starting point;

vocabularies and schema are just another kind of data


data
which can and will be published on the web. Converting
existing vocabularies
and schema to web
data
standards makes vocabularies and schema directly
accessible as standard web artifacts, just l
ike a web page.

But, unlike a web page
semantic web data can be understood and processed by software, not only
humans.

To support accessibility the
OGV

WG

must find and make available

existing
government
and government endorsed
vocabularies

in a standard

web
data
form
.



D
iscoverability



making something
available does

not necessarily

make it easy
to fin
d

or trust
. The “metadata” about the
government
vocabularies and schema
provides information about who developed it, what its purpose is and other facts
th
at help us find
government
vocabularies and understand their usefulness to us.
Data.gov already provides some metadata,
this
metadata can be extended for
government
vocabularies and schema

in general
.



F
ederation


Much
overlap and redundancy
exists
between

vocabularies and
schema. Instead of trying to develop a “universal
vocabulary

, the

OGV WG
suggestion is
to connect

and leverage

the
available
vocabularies
.

The semantic
web and other existing standards provide mechanisms to federate these schema
and vocabularies so that we can understand other
people’s

data from our own
perspective. By linking the terms and concepts of these vocabularies we provide
the found
ation for federating the data defined in th
ose

schema and vocabularies

and, ultimately, the organizations, services and processes that consume and
produce the data
.


Guiding
Principles

and Approach

The following are the principles guiding our approach:



Don’t re
-
invent
-

reuse and federate what exists.

Adapt existing standards
that are
fit for
purpose.

o

Reuse existing standards (be they Voluntary Consensus Standards
Organizations (VCSO), defacto, or .gov oriented) where they exist as they
are wherever/whenever possible.

o

Engage the .gov/.org/.com and VCSO
Communities of Internst and
Communities of Practic
e (
CoI/CoP
) for design,
implementation input and
usage requirements leveraging a government collaborative environment.

o

Develop
an
approach for government stakeholders to publish (internal) or
recognize (external) existing vocabularies

o

Select or define voca
bularie
s for representing vocabularies based on
standards.




1

Accessibility in this context is available as web information and should not be confused with 508
compliance.



Don’t be technology specific


c
reate abstract syntaxes for reused and generated
metadata/vocabularies using
Unified Modeling Language (
UML
)

and implement
using tools that facilitate syntax transfo
rmations.

o

All vocabularies should be available in machine processable concrete
syntaxes

o

Define and publish open source mappings/transformations to concrete
syntax as required
.

o

Define
Resource Description Framework

Schema

Schema
(
RDF
S
)

as a
concrete sy
ntax available for all open government vocabularies



Federate vocabularies to expose common and related concepts in support of
linking, mapping and analyzing the underlying data that may use non
-
federated
vocabularies

o

Develop a standards based approach to
link vocabularies

o

Provide linkages for generic / cross cutting vocabularies

o

Provide information and capabilities to help domain specific communities
link their vocabularies



Support and leverage Data.gov

o

Review existing standards for Data.gov purpose
fitness (focusing on cross
-
domain concerns).


Start with cross
-
domains in Data.gov that are
machine processable

o

Reuse/create cross
-
cutting domain and vocabularies and data set schema
on Data.gov for use by multiple
efforts (
publishing, licensing,
organiza
tions, catalogs, provenance, statistics, linking, etc.)

o

Publish vocabularies, vocabulary links and vocabulary mappings on
Data.gov

in RDF



Publish
resulting design approach best practices

for government

Leveraging the Semantic Web Standards and Technologie
s

In our effort to leverage the semantic web,
we need to distinguish two very different
semantic
technologies

available:


Web

Ontology Language (
OWL
)

and
Resource
Description Framework (
RD
F
)
.

OWL is

a foundation for formal ontologies and RDF
,

or
Linked Op
en Data
,


is

a basis for machine readable web data (think of the web as a kind
of
Database Management System (
DBMS)
)
.

RDF (which is now about 10 years old) is a very simple structure for representing
information on the web.

The intent of this technology

is to augment human readable web
pages with machine readable data resources.

For example, there is “DBPEDIA”, a
semantic web based representation of the information in Wikipedia.

The semantic web

and RDF are

the foundation for “linked open data”.


We wi
ll not go into a technical
description

since

there are many
resources that provide additional information
(look for

RDF triples

).

But
,

it is important to recognize the core concepts
:
EVERYTHING has a
web URI and you can “derefere
nce” those web URIs to get
data

about ANYTHING
published by ANYONE you trust
.


It is not very h
ard to publish existing data (from

almost any form
, such as SQL or XML
) as linked open data in RDF.



RDF data is described by
RDF vocabularies
,
which are also RDF data
.


These are called
RDF Schema, or RDFS.
You can have any numb
er of vocabularies in RDF.


T
o make
that concrete for us, every term and concept
in an open government vocabulary
would
have a web
Uniform Resource Identifier (
URI
)

that anyone can access from any
where
(assuming rights are granted).


RDF is just a set of links between web resources.

URIs
are required over Uniformed Resource Locators (URL) because URIs provide
an
identifier for

the resource while URLs provide location.

OWL is a “logical” language b
ased on description logics and first order logic.

Initially it
was only defined as a
n

RDF vocabulary but OWL
-
2 has
evolved

and now has an RDF
vocabulary as well as other syntaxes.

You use OWL when you
want to define deeper
semantics as an “ontology”.


N
ote there are other RDF vocabularies that are standard, such as SKOS, which comes out
of the library community.

OWL
is
one

vocabulary we could use
in its entire
ty
,
in
part or
not at all.
.


I
f we use RDF
S

we
will
need
some
vocabulary

as a foundation

and this group
will recommend a set of such vocabularies
.


The challenge is in the fact
that there are
already
many
to choose from.

OGV WG
Recommendation:

U
se RDF
, RDFS

and linked open data as
the

foundation
for represe
nting vocabularies as web data.
RDF Linked Open Data

it is getting traction
and
will

work well for
our purpose
.

We are not convinced that OWL (or description
logics in general)
is

the right foundation for us.

Jim Hendler likes to say “a little
semantics goes a long way”, and OWL may be

more than a little.



So,
this

mean
s

that
the OGV WG will

pick the vocabulary/vocabularies we want to use to
represent vocabularies


but we will be able

to represent these as RDF linked open data
.
Our initial direction is to leverage the ISO standard vocabularies.
We may borrow one
or
a few terms from OWL, but will
not commit to a full ontological representation.

The
nice thing is


if someone wanted t
o add full OWL (or anything else) to a vocabulary
they could do so without disturbing anything else.

W
e

suggest adopt
ing

RDF
S

Linked Open Data as our core representation of
vocabularies
but

that
OWL

not be fully embraced

at this time
.


T
hese vocabularies

can be published on
Data.gov
,

but with the understanding that Datagov
can be
one of many
portals of access
.
Open government vocabularies can also be published in other
portals

on

the web of data
all
ow
ing

increased discovery
and use
of
vocabularies of interest.


The combination of
D
ata.gov and an RDF
S

representation will provide a set of vocabularies that can be
linked and cross referenced globally without a single point of control.


This lightweight

usage of

Linked Open Data

does not have to be hard

or expensive.


Giving Vocabularies, Concepts and Terms a web ID

The foundational concept that binds the web together is the URI


everything on the web has a
URI, these are usually the things that start
with

HTTP://

, etc. The URI is the identifier for
some web resource. Most people are
familiar

with the use of a URI for a web page, but
anything can have a URI.
My
cat

can have a URI. What that does, giving my cat a URI
,

is allow someone on the web to

say something about it. How do they know it is my cat
?

Because

it has the same URI
!

U
RIs give us a universal identifier

scheme. That doesn’t
mean that you can only have one URI


many resources on the web
with many URIs
could describe m
y cat


so
something ca
n have any number of these URIs, but the URI
only refers to one thing.

The same can be true of a vocabulary


an

entire vocabulary can
have a

URI
, This

allows
us to talk about the vocabulary, who created and when, who supports it, etc.
T
his is

the
metadata about the vocabulary represented by that URI. So a vocabulary could have a
URI like:
http://modeldriven.org/vocabularies/aboutcats
. What is interesting is that once
this vocabula
ry has such a URI anyone can reference it


the
y

can look it up, they could
even “say something” about it, like “what a silly vocabulary
http://modeldriven.org/vocabularies/aboutcats

is

.

We can

then go a bit deeper,

each term in that vocabulary c
an

have a URI as well, it
could be something like
http://modeldriven.org/vocabularies/aboutcats/Cat
. Now, anyone
can also look up the te
rm “Cat” in this vocabulary


it is that vocabulary publishers
“opinion” about the term “Cat”. It may or may not be the same as your
s



but the fact
that you can point at that particular term unambiguously is very powerful.

But how do you know how this vo
cabulary relates to anything else? One of the “facts”
y
ou can put in a vocabulary is that a concept means the same thins as something
else;

this
can be

done with
the linking term “sameAs”. T
here may be a statement in my
vocabulary that says that
http://dbpedia.org/page/Cat

is the
sameAs

http://modeldriven.org/vocabularies/aboutcats.rdf
. This means that both resources
describe the same thing. W
hat is on
http://dbpedia.org/page/Cat
? This is
data

pulled
from

Wikipedia on cats
. Dbpedia
is frequently used as a
vocabulary

of reference, but any
vocabulary can play that role
. So if
five

vocabulary terms all say they are the same as
http://dbpedia.org/page/Cat
, we know they are all talking about the same thing
, a pet

(and
not the Broadway play, “Cats”).

In summary,
a
ll vocabularies, terms and co
ncepts should have a URI.
Each

URI

should
point to both data and a web page that shows the data about that vocabulary, term or
concept. Doing this one simple thing makes our vocabularies universally accessible
on
the semantic
web
.

Permanence of the URI

I
magine that 20,000 other things all pointed to
http://dbpedia.org/page/Cat

and then it just
disappeared. That would be bad! Vocabularies that are intended as points of reference
need
permanence



they need the
stewardship to make sure that the URIs retain the same
meaning and that the supporting technology stays around for a very long time. A multi
-
faceted organization like a government needs to be able to “mint” these URIs in a way
that they can be supported f
or a long time

and

are permanent. It doesn’t matter who ends
up “managing” them or on what machine they reside


what is important is that the URI
live on.

The UK Government has done some great work in this area that we can build on, it can
be accessed he
re:

www.cabinetoffice.gov.
uk
/media/301253/puiblic_
sector
_
uri
.pdf

OGV WG
Recommendation
: Use the UK
document as a basis for a US

policy on
minting URIs. All open government vocabularies should use permanent URIs.


Conclusion
:

The issues, inefficiencies and costs associated with poor data interoperability are
well understood
,

not to mention th
e landscape

of stovepiped solutions that hinder discovery
of
and access
to

government data
.
U
tilizing semantic technologies

and standards

to
publish and federate
government
vocabularies
not only supports the President’s Open and
Transparent Government Directive, it
facilitates

data interoperability throughout
government.


The approach suggested by the OGV WG will make
data more open,
accessible and usable
by

government and citizens

alike
, facilitating a more open and
engaged government
.