CONNECTING CLOSED WORLD RESEARCH INFORMATION SYSTEMS THROUGH THE LINKED OPEN DATA WEB

splashburgerInternet and Web Development

Oct 22, 2013 (4 years and 2 months ago)

118 views

International
Journal of Software Engineering and Knowledge Engineering




World Scientific Publishing Company

1

CONNECTING CLOSED WO
RLD RESEARCH INFORMA
TION SYSTEMS
THROUGH THE LINKED O
PEN DATA WEB

BRIGITTE JÖRG
*

University Department, University Name, Address,

City, State ZIP/Zone, Country


fi牳tautho牟id@domain_name


http://<webadd牥獳r

IVÁN RUIZ
-
RUBE

Department o
f

Computer Languages and Systems,
University of Cádiz

C/ Chile 1, 11002
,

Cádiz, Spain


ivan.ruiz@uca.es

MIGUEL
-
ÁNGEL

SICILIA

Department of Computer Science, University of Alcalá

Ctr
a.
Barcelona km. 33.6 28871, Alcalá de Henares, Madrid, Spain

msicilia
@
uah.es

Received (Day Month Year)

Revised (Day Month Year)

Accepted (Day Month Year)

Research information systems (RIS) play a critical role in the sharing of scientific information and

provide researchers, professionals and decision makers with the required data for their activities.
Existing standards for RIS have provided data models conveying the main entities that are required
for information exchange between systems. However, the d
istributed nature of research information
across different
systems calls

for a mechanism for linking local entities in the closed world of
concrete RIS with other entities that are exposed through the open world of linked data. This paper
introduces the mo
tivation for such a mechanism and demonstrates how an existing standard can be
extended for that purpose. Further, the paper describes the main principles and techniques for
exposing RIS data as linked open data.

Keywords
:
(Current) Research Information Sy
stem
s; Linked Open Data; Ontologies
.

1.
Introduction

needs adaptio
n!

Research activities are funded through public money and consequently of interest to
multiple stakeholders and is treated as an asset itself. Research is becoming increasingly

*

Typeset names in 8 pt Times Roman, uppercase. Use the footnote to indicate the present or permanent address
of the author.



ptate co浰letely wi
thout abbreviation猬sthe affiliation and 浡iling addre獳I including country. 呹peset in 8 pt
Ti浥猠ftalic.



呹pe獥t author e
-
浡ml addre獳sin 獩ngle line.

2

B. Jörg, et al.


competitive and

as a consequence, research related information is more and more of
interest to multiple stakeholders. Where research output recording in the format of
publications or patents has a clear and results are becoming more research output is an
underlying requi
rement in the innovation process and besides others, a key indicator for
measuring the transfer of knowledge. Nowadays, most institutions somehow related to
scientific research have support systems for research management. These research
information system
s (RIS) play a critical role in the sharing of scientific information and
in providing researchers, professionals and decision makers with the required data for
their activities.


Existing standards for RIS have provided data models conveying the main ent
ities that
are required for information exchange between systems. The CERIF (Common Europe
an
Research Information Format)
data model
[1]
is the RIS standard recommended by the
European Union to its Member States. Thanks to CERIF model, it is possible to d
evelop
comprehensive research information systems (RIS) and at the same time, to allow further
interoperability between different RIS.


EuroCRIS, the CERIF

support organization, developed a few years ago, a well
-
defined
XML format for exchanging data between research information systems. To this end, it
proposed a well
-
defined structure of XML files to represent the information stored in
CERIF databases. In
spite of the use of CERIF XML as a standard format for data
exchange, the access interfaces to the RIS systems’ APIs are proprietary and therefore
dependent on specific providers. Also, it was not possible to link entities from different
RIS systems or wit
h external elements to CERIF model, such as external terminologies,
bibliographic databases, and others.


The distributed nature of research information across different systems calls for a
mechanism for linking local entities in the closed world of concre
te RIS with other
entities that are exposed through the open world of linked data. Linked Open Data
(LOD) is the initiative to bring these "close data islands" to a global, interconnected data
space, publishing structured data over the Web, so that it can

be interlinked and become
more useful

[2]
.



This paper demonstrates how an existing standard, like CERIF, can be extended for
exposing and integrating research information, using LOD technologies. Our proposal
provides a set of additional benefits to dif
ferent stakeholders involved in the research
context. Further, the paper describes the main principles and techniques for exposing RIS
data as linked open data, and finally several issues emerged during the design of our
proposal will be discussed.

2.
Sharing

research information: stakeholders and use cases

TO WRITE: atira, avedas, what is there, corresponding to Rec


Connecti
ng closed world research information

systems through the linked open data web

3


Stakeholders: uses and benefits for several stakeholders by using research systems
enriched with CERIF Linked Data: Higher Education Institution
s (HEI) or R&D
institutions, Funding bodies (FB), Research Authorities (RA), Researchers, Research
information Enterprises (RIE), General public, Enterprises…


Three major patterns, which the new entity should cover (two are relatively
straightforward but
CRIS
-
CRIS is not): (1) CRIS connecting to another CRIS (2) CRIS
connecting to a local System (not a CRIS) e.g. scholarly publication repository or finance
system (3) CRIS connecting to a KOS / Authority (for purposes such as validation,
explanation or res
olution)

3.
Adapting CERIF to linking RIS through the Web

TO WRITE:
Alcala Results + Follow
-
up Discussions

To be explained: Comments about closed and open world implications in integrity of
syntax and semantics. Open World
-

so things are incomplete and we ha
ve to deal with it.
Include necessary information to ensure integrity across the boundary 'gap' from the
closed world (integrity) to the open world (incompleteness and uncertainty)

Reuse topics of section 5.5 “Distributed datasets” of the LD
-
CERIF recommen
dations
and meeting notes.

4.
Exposing CERIF as linked open data: vocabularies and recipes

The proper linked data exposure of research information requires developing an ontology
based on CERIF logical model and a value vocabulary intended to provide addition
al
semantics to the model. Also, we have developed a set of recipes for publishing CERIF
-
LD that then we will present them.

4.1.
CERIF Ontology and semantic vocabulary

In order to formally publish metadata about the research context is necessary to define an
on
tology that collects the CERIF model elements. From the CERIF model
,

we have
designed (using the tool Neologism
§
) an ontology in RDFS (available in the EuroCRIS
Web Site
**
).


The ontology collect the base, result, 2nd and link entities, as well as other re
quired
elements to support multiple language features and semantics. An overview of this
ontology is presented in the Figure 1 and an excerpt of its definition in RDFS is listed in
Annex A.


In general, CERIF entities of the logical model were translated
into RDF classes and their
attributes into classes’ attributes of our ontology. We used logical names instead of

§

http://neologism.deri.ie/


**

http://eurocris.org/cerif

4

B. Jörg, et al.


physical names, because there is no longer limitation imposed by any database. Also, all
properties and classes of our ontology were self
-
descr
ibed using the standards
annotations
rdfs:label

and
rdfs:comment
.



Fig.

1.


General overview of the CERIF Ontology
.


During the ontology design it was essential to not “reinvent the wheel", trying to reuse
existing

elements from well
-
known and globally accepted ontologies. CERIF
-
LD datasets
will publish RDF data according to our ontology CERIF and also using terms from other
auxiliary ontologies such as FOAF, Dublin Core and Bibliographic Ontology.


In addition to t
he above, we have designed a value vocabulary to gather all relevant types
and roles in a research context between the involved entities from the CERIF Ontology.
This vocabulary includes several RDF classes to describe classification schemes for the
CERIF
entities, e.g.: kinds of organisation units, as well as, a number of RDF properties
to semantically enrich the relationships between CERIF entities, e.g.: roles of a person in
a research project. The CERIF Semantic Vocabulary is also available in the EuroC
RIS
Web Site
††
I und敲 愠d楦f敲en琠tam敳e慣攮

4.2.
Recipes for CERIF
-
LD exposure

All RDF data exposure from research systems must meet the basic principles of Linked
Data
[2]
paradigm, namely: use URIs as names for things; use HTTP URIs so that people
can look up

those names; when someone looks up a URI, provide useful information,
using the standards (RDF*, SPARQL); and finally, include links to other URIs.


††

http://eurocris.org/semcerif


Connecti
ng closed world research information

systems through the linked open data web

5



CERIF model entities must be exposed as RDF resources, which they will be typed and
enriched with metadat
a according to the CERIF ontology and other external vocabularies.
Below, several recipes to expose CERIF
-
LD from RIS are presented.

4.2.1.
Designing CERIF
URIs

A good URIs design is essential to enable interoperability and understanding of linked
data research
resources, allowing us to discover similarities between different resources
published on several CERIF
-
LD datasets. Thus, the proposed basic scheme for CERIF
URIs design (in BNF syntax) is the following:


<URICerifEntity>

::=
<URIBa
seEntity>|<
URIResultEntity>
|
<URI2ndLevelEntity>|
<URILinkEntity>

<URIBaseEntity>

::= NAMESPACE “/” ENTITY_NAME “/” DESCRIPTOR

<URIResultEntity>

::= NAMESPACE “/” ENTITY_NAME “/” DESCRIPTOR

<URI2ndLevelEntity>

::=
<URIUncoupled2ndLevelEntity>
|<URICoupled2ndLevelEntity>

<URIUncoupled2ndLevelEntity>

::= NAMESPACE “/” ENTITY_NAME “/” DESCRIPTOR

<URICoupled2ndLevelEntity>

::=
(<URIBaseEntity>|<URIResultEntity>|<
URIUncoupled2ndLevelEntity) “/”
ENTITY_NAME “/” UUID

<URILinkEntit
y>

::= <URIRelationshipEntity>
|<URIClassificati
onEntity>

<URIClassificationEntity>

::=
(<URIBaseEntity>|
<URIResultEntity>
|<URI2ndLevelEntity>) “/” CLASS_SCHEME
“/” UUID

<URIRelationshipEntity>

::= NAMESPACE “/” ENTITY_NAME “/” UUID


NAMESPACE

is the base URI for all LD resources to publish by the rese
arch information
system.
ENTITY_NAME

will correspond with a given entity name in CERIF model and
DESCRIPTOR

will be a unique, descriptive identifier for the selected entity, e.g:
attributes acronym, name or title. This identifier is meaningful within the d
omain of the
data set, creating human
-
readable and memorable URIs.
UUID

refers to a universally
unique identifier for the selected entity. Finally,
CLASS_SCHEME

represents a suitable
classification scheme (subtype of
semcerif:classificationScheme
).

4.2.2.
Exposing common metadata

There are a number of attributes shared by all CERIF entities. The internal identifiers
(e.g.: a primary key in a table) of the entities managed by the research systems will be
exposed using
cerif:internalIdentifier
, a literal prop
erty derived of the identifier property
in the Dublin Core vocabulary. Also, the
cfURI

attribute used to point to related websites,
will be exposed through the
foaf:homepage

property.


6

B. Jörg, et al.


Besides, CERIF model supports multiple languages (with different transl
ation types) for
names, titles, descriptions, keywords and abstracts of the base, result and 2nd level
entities. These language
-
dependent attributes will be exposed according to the
T
able

1.



4.2.3.
Exposing CERIF real word entities

In the CERIF model we can distinguish between base, result and 2nd
-
level entities. The
CERIF base entities are Person,
OrganisationUnit and Project, which they are the main
actors involved in the scientific context. With regards to the result entities, CERIF
comprises: Publication, Patent and Product (any result not classified as publication
neither patent). Finally, the 2
nd level entities allow for the representation of the research
context and can be considered as universally shared (e.g.: countries), coupled (e.g.: postal
addresses) and uncoupled entities (e.g.: events).


For the above types of entities, we should gener
ate new RDF resources holding a set of
specific axioms, in addition to the common metadata. These are described in CERIF
-
LD
specification document
[?]
. The universally shared entities not need to be exposed in
Linked Data from the CERIF datasets. For the s
ake of example, Table 2 represents the

RDF metadata to generate for
CERIF Project

entity.


Table

1.


Mapping
CERIF/RDF for
common metadata.

RDF Property

C
ERIF Attribute

rdfs:label, foaf:name

*.cfName

rdfs:label, dc:title
a
, dcterms:alternative
b

*.cfTitle

dc:description

*.cfDescr

dc:subject

*.cfKeyw

dcterms:abstract

*.cfAbstract

a

The dc:title property is used in the case of a original title
.

b
The
dc:alternative property will be used in the case of a translated title by a human or
a machine
.



Connecti
ng closed world research information

systems through the linked open data web

7



Let us assume that our research organization has its corporate RIS system available at
http://example.org
, which it publishes linked data based on CERIF model. Here, we
present a
n

excerpt of RDF description for a given research project:


@prefix cerif: <
http://eurocris.org/cerif#> .

@prefix foaf: <http://xmlns.com/foaf/0.1/> .

@prefix dc: <http://purl.org/dc/elements/1.1/> .

@prefix dc
terms:
<
http://purl.org/dc/terms
/> .


<http://example.org/resource/projects/VOA3R>


a cerif:Project ;



foaf:homepage <http://voa3r.eu/> ;


dc:subject "aquaculture" , "ontology" , "agriculture" ;


cerif:internalIdentifier "ff8080812ddb916a012ddb9170b60001" ;


cerif:acronym "VOA3R" ;


cerif:startDate "2010
-
06
-
01"^^xsd:date ;


cerif:endDate

"2013
-
05
-
31"^^xsd:date ;


dc:title"Repositorio de Agricultura y Acuicultura de acceso abierto
virtual"@es
-
es , "Virtual Open Access Agriculture & Aquaculture
Repository"@en
-
uk ;


dcterms:abstract

"The general objective of the VOA3R project is to
improve the spread of European agriculture and aquaculture research
results by using an innovative approach to sharing open acc
ess research
products. "@en
-
uk .

4.2.4.
Exposing CERIF link entities

Link entities ar
e a mechanism designed in CERIF to maintain relationships between
entities and classify entities. In the first case, for each link entity, we should generate a
new RDF statement, relating two entities through a given predicate from an external LD
vocabular
y, such as the already described, CERIF Semantic Vocabulary. We can see it in
Table

2
.


Mapping CERIF/RDF for Project entity
.

RDF Class / Property

CERIF Entity / Attribute

cerif:Project

cfProj

cerif:internalIdentifier

cfProj
.cfProjId

foaf:homepage

cfProj
.
cfURI

cerif:startDate

cfProj
.cfStartDate

cerif:endDate

cfProj.cfEndDate

cerif:acronym

cfProj.cfAcro

dc:title

cfProjTitle.cfTitle

dc:subject

cfProjKeyw.cfKeyw

dcterms:abstract

cfProjAbstr.cfAbstr


8

B. Jörg, et al.


the following RDF example (without namespace prefixes), which it describes that one
person is coordinator of a research project:


@prefix
sem
cerif: <http://eurocris.org/
sem
ceri
f#> .


<http://example.org/resource/projects/VOA3R>


a cerif:Project.



<http://example.org/resource/persons/Miguel
-
Angel_Sicilia>


a cerif:Person;


semcerif:coordinator

<
http://example.org/resource/projects/VOA3R
> .



With the aim of discover the

suitable predicate, it is necessary to connect the
classification schemes managed by the RIS with vocabularies that are publicly accessible
on the Web.


Furthermore, the CERIF semantic features allows to define time intervals of validity of
the relationsh
ips between entities (link entity attributes
cfStartDate
,
cfEndDate
), as well
as to assign fractional values (
cfFraction
) that express a proportion or probability of a
fact. There are two approaches when it comes to exposing these additional data:




One can

add annotations to facts as properties of a reified basic statement
-

this
approach is described e.g. in
[3]
.



One can generate a new RDF node for each fact and linking it with the involved
entities. This approach is described as the Qualified Relation pa
ttern in
[4]
. The
additional information then goes into properties of the newly generated node.


The former approach avoids the need for having a node in the middle betwen the two
entities of the fact. However, this comes at the cost of added complexity in

reaching the
additional temporal information. Since the temporal information on facts is crucial to
CERIF, we chose the latter approach. Building upon the previous example, let us add
temporal and fraction information (using the RDF resource in the middle
):


<http://example.org/resource/person
s
/Miguel
-
Angel_Sicilia>


cerif:linksToProject <http://example.org/resource/proj_pers/123XYZ>
.


<http://example.org/resource/project
s
/VOA3R>


cerif:isLinkedByPerson
<
http://example.org/resource/proj_pers/123XYZ
>.


<http://example.org/resource/proj_pers/123XYZ>


a cerif:Relationship;


rdfs:label "Association between VOA3R and Miguel
-
Angel Sicilia" ;


Connecti
ng closed world research information

systems through the linked open data web

9



cerif:role <http://eurocris.org/semcerif#coordinator> ;


cerif:startDate "1901
-
01
-
01 00:00:00.0" ;


cer
if:endDate "2099
-
12
-
31 23:59:59.0" ;


cerif:fraction "0.75" .


Aside from to (binary) link entities that express facts, CERIF supports (unary)
classifications of its information entities. Also these classifications may have termporal
and fraction informati
on associated with them. The mechanism is similar and we can see
it in the following RDF example which describes that a certain research project is
classified by a given term (in this case, c_550: Aquaculture) from the Agrovoc
[5]
vocabulary:


<http://exam
ple.org/resource/projects/VOA3R>


a cerif:Project;


semcerif:agrovoc
<
http://aims.fao.org/aos/agrovoc/c_550
>

.


It is also possible to include time intervals of validity and fractional values to the
classification.
In this case, the resulting RDF
would be:


<http://example.org/resource/project
s
/VOA3R>


a cerif:Project;


cerif:isClassifiedByAgrovoc









<
http://example.org/resource/project/VOA3R/agrovoc/789ABC
> .


<http://example.org/resource/project
s
/VOA3R/agrovoc/789ABC>


a
cerif:Classification;


rdfs:label "Classification for VOA3R" ;


cerif:classification
<
http://aims.fao.org/aos/agrovoc/c_550
>
;


cerif:startDate "1901
-
01
-
01 00:00:00.0" ;


cerif:endDate "2099
-
12
-
31 23:59:59.0" ;


cerif:fraction "1" .


4.2.5.
Publishing
additional
metadata

Before the public exposure of research information in LD, we must add additional
metadata. It is highly important to describe the license or waiver under which the data are
made available, including any restrictions that apply to its usage. Also,
in order to enable
applications to be sure about the origin of data, data source should publish provenance
metadata together with the primary data. With this aim, we will enrich the descriptions
of all CERIF RDF resources using some Dublin Core predicates
, such as
dc:creator
,
dc:publisher
,
dc:date

and
dc:rights
.


10

B. Jörg, et al.


Finally, it is important to indicate the dataset that publishes the RDF resources in order to
allow crawlers to discover data on the Web. We encourage the use of VOID vocabulary
for this purpose.


# Provenance metadata for all resources published by the CERIF dataset


dc:creator <http://example.org> ;


dc:publisher <http://example.org> ;


dc:date "2011
-
01
-
01"^^xsd:date;


dc:rights <http://creativecommons.org/licenses/by
-
nc/3.0> ;



void:inDataset <http://example.org/dataset> ;



# Dataset Metadata

http://example.org/dataset


a void:Dataset ;


rdfs:label "RIS Dataset of Example Org. based on CERIF
-
LD";


foaf:homepage <http://example.org/dataset.html> ;


fo
af:isPrimaryTopicOf <http://example.org/dataset.rdf> ;


void:sparqlEndpoint <http://example.org/sparql>;


void:vocabulary <http://xmlns.com/foaf/0.1/>;


void:vocabulary <http://purl.org/dc/elements/1.1/>;


void:vocabulary <
http://purl.org/ontology/bibo/>;


void:vocabulary <http://eurocris.org/cerif>;


void:vocabulary <http://eurocris.org/semcerif>;


void:exampleResource



<
http://example.org/resource/
organisationUnits
/UAH
>.

5.
Case Study

Brigitte proposes to move
this section to the front!

In VOA3R: To be explained: ¿what is VOA3R? What parts of the CERIF model are
managed by the VOA3R platform? Need of data normalization before the LD exposure.
Architecture: CMS Portal + Bibliographic Repository (OWL), D2R Server
for the CERIF
database and D2R Server for the ad
-
hoc database of publications

6.
Related work

TO DO


7.
Conclusion

To be explained: Benefits of our approach, Conclusions, Issues to tackle (version control,
suitable evolution of CERIF compon
ents (mde approach?).N
ext steps. Etc.



Connecti
ng closed world research information

systems through the linked open data web

11


Appendix A.


Excerpt of the CERIF Ontology in RDF/Turtle

To include a meaningful excerpt of http://spi
-
fm.uca.es/neologism/cerif
.

Acknowledgments

This section should come before the References. Funding information may also be
included here.

References

[1]

B.

Jörg, K.G.

Jeffery, G.

van Grootel,

A.

Asserson,

J.

Dvo
rak and

H.

Rasmussen,

CERIF
2008


1.2 Full Data Model (FDM): Introduction and Specification. euroCRIS, 2010.

[2]

C. Bizer, T. Heath and

T. Berners
-
Lee, Linked data


the story so far’,
International Jo
urnal
on Semantic Web and Information Systems

5(3)
(2009) 1

22.

[3]

J.

Hoffart,

F. M.

Suchanek, K.

Berberich

and

G.

Weikum,

YAGO2: A Spatially and
Temporally Enhanced
Knowledge Base from Wikipedia,
Technical report

MPI
-
I
-
2010
-
5
-
007

(Ed.:

Saarbrücken: Max P
lanck Institute for Informatics, 2010),
Available at
http://domino.mpi
-
inf.mpg.de/internet/reports.nsf/c125634c000710d0c12560400034f45a/97dff14cb0fd1562c125
77d9002c0d46/$
FILE/MPI
-
I
-
2010
-
5
-
007.pdf

[4]

L.

Dodds and

I.

Davis,

Linked Data Patterns
-

A Pattern Cata
logue for Modelling, Publishing,
and Consuming Linked Data

(2010). Available at http://patterns.dataincubator.org/book/

[5]

T. Baker and

J. Keizer, Linked Data for Fighting Global Hunger: Experiences in setting
standards for Agricultural Information Managemen
t. Linking Enterprise Data

(Springer,
2010)
,
pp. 177

201.