Semantic Data - K-Drive

grassquantityΤεχνίτη Νοημοσύνη και Ρομποτική

15 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

98 εμφανίσεις

WP1-D1.1
Semantic Data
Project full title:Knowledge Driven Data Exploitation
Project acronym:K-Drive
Grant agreement no.:286348
Project instrument:EU FP7/Maria-Curie IAPP/PEOPLE WP 2011
Document type:D (deliverable)
Nature of document:S (Specication)
Dissemination level:PU (public)
Document number:ISOCO,UNIABDN/WP1-D1.1/D/PU/b3
Responsible editors:Panos Alexopoulos and Andrew Walker
Reviewers:Yuan Ren,Yuting Zhao and Je Z.Pan
Contributing participants:ISOCO,UNIABDN
Contributing workpackages:WP1
Contractual date of deliverable:30 September 2012
Actual submission date:30 September 2012
Abstract
The goal of this deliverable is to describe a set of semantic data sources,both in the form of
Linked Data and other,that are to be used within the K-Drive project in order to support the
research and development activities of the subsequent work packages and evaluate the quality
of their outcomes.
Keyword List
Use Case Scenario,Semantic Datasets
Project funded by the European Commission within the 7th Framework Programme/Maria Curie Industry-
Academia Partnerships and Pathways schema/PEOPLE Work Programme 2011.
c K-Drive 2012.
ii
Semantic Data
Panos Alexopoulos
1
,Jose-Manuel Gomez-Perez
1
and Andrew Walker
2
1
iSOCO,Spain
Email:fpalexopoulos,jmgomezg@isoco.com
2
Department of Computing Science,Aberdeen University,UK
Email:andrew.walker.05@aberdeen.ac.uk
30 September 2012
Abstract
The goal of this deliverable is to describe a set of semantic data sources,both in the form of
Linked Data and other,that are to be used within the K-Drive project in order to support the
research and development activities of the subsequent work packages and evaluate the quality
of their outcomes.
Keyword List
Use Case Scenario,Semantic Datasets
iv
Contents
1 Introduction 1
2 Use Case Scenario 1
2.1 Generic Description..................................1
2.2 Specic Implementation Scenarios..........................7
3 Semantic Datasets 8
3.1 Core Datasets......................................10
3.1.1 Enterprise Information Datasets.......................10
3.1.2 Tenders Electronic Daily Dataset.......................10
3.1.3 BUSCAMEDIA Multimedia Dataset.....................11
3.2 Horizontal Multi-Domain Datasets..........................12
3.2.1 DBPedia....................................12
3.2.2 YAGO2.....................................13
3.3 Specialized Datasets..................................15
3.3.1 Geonames....................................15
3.3.2 LinkedGeoData.................................15
3.3.3 Research Datasets:ACM,IEEE,DBLP,Citeseer and BibBase......16
3.3.4 Open Corporates................................17
3.4 The K-Drive Data Model...............................17
4 Conclusion 18
v
vi
1 Introduction
The goal of this deliverable is to describe a set of semantic data sources,both in the form of
Linked Data and other,that are to be used within the K-Drive project in order to support the
research and development activities of the subsequent work packages and evaluate the quality
of their outcomes.
Technically speaking,Linked Data refers to data published on the Web in such a way that
it is machine-readable,its meaning is explicitly dened,it is linked to other external data sets,
and can in turn be linked to from external data sets [Bizer et al.,2009a].The emergence in
the last years of initiatives like the Linked Open Data (LOD) [Bizer et al.,2009a] has led to
a signicant increase of the amount of such data on the Web,including rich semantic sources
(e.g.FreeBase
1
or DBpedia
2
[Bizer et al.,2009b]),public releases of governmental or other data
(e.g.the data.gov
3
initiative or the New York Times news vocabulary
4
) as well as semantically
annotated Web content using RDFa
5
or Microformats
6
.
In the context of K-Drive,we are interested in datasets that may be used within a compre-
hensive use case scenario involving the provision of advanced methods,techniques and tools for
their reuse and access in various knowledge-intensive tasks.The two basic research areas that
this scenario will cover are semantic question answering and semantic data summarization.The
rst area studies the problem of transforming natural language questions to semantic queries
and their execution against one or more datasets in order to derive answers [Lopez et al.,2007].
The second one aims at facilitating better understanding of a given dataset by providing a
landscape view of it in an intuitive and non-technical way [Li and Motta,2010].
Given that,the structure of the rest of this document is as follows.In the next section we
describe in detail the K-Drive use case scenario,focusing on its objectives and the particular
tasks it involves.Subsequently,in section 3 we describe a number of semantic datasets that
we intend to use within this scenario.For each dataset we provide an overview of its key
characteristics (domains,size,accessibility etc.) and the rationale behind its selection.
2 Use Case Scenario
2.1 Generic Description
The typical use case scenario we consider for the K-Drive system assumes some organization,
either public or private,that wants to reuse public semantic datasets from the Open Web in
order to perform two tasks:
1.To enrich with them its own data so as to make the latter more usable and increase its
usability and value.
2.To make the enriched data available to end-users,in an intuitive way,for some particular
purpose (e.g.decision support).
1
http://www.freebase.com
2
http://dbpedia.org/About
3
http://data.gov.uk/
4
http://data.nytimes.com
5
http://www.w3.org/TR/xhtml-rdfa-primer/
6
http://microformats.org/
1
The rst task,namely data enrichment,refers to the process of adding structured informa-
tion to a set of data in order to:
 Expand the data's coverage,both vertically (i.e.more instances) and horizontally (more
concepts and relations).For example,a sport news organization might want to enrich
its knowledge base of athletes and teams with additional information about tournaments,
competitions and records.
 Make the data's semantics clearer.For example,by replacing people's names with URIs
in order to make them unambiguous.
 Put the data into context.For example,by using a geographic ontology in order to relate
events to their locations and thus add a geographical dimension to them.
Such tasks are typically performed within an organization by knowledge engineers and the
common problem associated to them is the so called knowledge acquisition bottleneck,
namely the high amount of time and eort required to acquire and maintain the needed knowl-
edge [Szumlanski and Gomez,2010] [Reichartz et al.,2010] for the enrichment task.Some of
the most evident aspects of this problem,as documented in the literature [Hoppenbrouwers and Lucas,2009]
[Wagner,2007],are the following:
 Knowledge is hard to make explicit,and even harder to formalise.
 The domain experts required for this job are usually not available for lengthy involvement
in knowledge acquisition activities,nor do they possess the required skills.
 The expert knowledge engineers/modellers that could be hired to do the modelling job
are few and expensive.
 Existing knowledge acquisition methodologies only provide limited help for non-expert
modellers in the actual execution of their modelling tasks.
 In many cases domain specic knowledge modelling requires intensive negotiation and
validation of models by heterogeneous teams of stakeholders.
 The increasing availability of automated knowledge extraction methods fromdata has not
managed to eliminate or at least signicantly reduce the problem as i) the application of
these methods still require expertise and ii) their eectiveness is yet high enough,thus
creating the need for their outputs to be substantially validated and tweaked by humans.
 As the amount of acquired knowledge increases,the maintenance of the knowledge base
is becoming more dicult,especially in domains whose content changes at a fast pace.
Given the above,in K-Drive we consider the reuse of existing public semantic data as
a promising way to (partially) alleviate the knowledge acquisition problem.One reason for
that is that the volume and diversity of public semantic datasets are increasing at high rates
[Bizer et al.,2009a] (see gures 1,2,3 and 4),resulting into a large amount of both generic and
domain-specic knowledge that is available to use for various application scenarios.Another
advantage of the reuse approach is that the maintenance and evolution of these datasets is the
responsibility of their publishers,thus reducing the required eorts and costs for this task in
the organization's side.Of course,a consequence of this is that the organization loses control
2
Figure 1:Linked Data Cloud in 2007
over the quality and trustworthy of the data,meaning that there is an inherent trade-o that
needs to be evaluated by the organization.
As an example of this consider the sports domain and a news organization that wants to
create and maintain a knowledge base about the Spanish football league (teams,rosters,results
etc).The pace at which this knowledge changes is quite fast (e.g.team rosters change at
least every year,sometimes even more frequently),meaning that the organization needs to
have a dedicated team that constantly monitors these changes and updates the knowledge base
accordingly.Nevertheless,much of this information is already available in public semantic data
(e.g.DBPedia) and,more importantly,it is (almost) always up to date.This means that it
would be better for the organization to reuse this public data instead of creating it from scratch
and having to maintain it.
In fact,previous experience of this deliverable's authors in commercial projects,has shown
that reusing public semantic data can save a substantial amount of time and money for the client
organization.Indicatively,we mention two such projects,one involving a Greek publishing
organization,specializing in history related books,and another involving a Greek business
directory (Yellow Pages).In both projects the goal was to enrich the existing data of these
organizations with additional relevant semantic information so as to provide better access to it
(e.g.better search of the history books through their annotation with the events and persons
they cover,or better navigation to the directory of professional doctors by means of a disease
taxonomy).Especially for the yellow pages organizations,this task was enormous as its initial
data covered more than 10 quite generic domains (e.g.health,education,entertainment).Yet,
by reusing existing semantic data for these domains (primarily from Freebase but also from
DBPedia) the enrichment process was performed in much less time.
3
Figure 2:Linked Data Cloud in 2008
Nevertheless,an important problem that inhibits the wider reuse of such public semantic
data is the diculty for knowledge engineers to decide whether a given dataset is actually
suitable for their needs.This is because semantic datasets typically cover diverse domains,do
not follow a unied way of organizing the knowledge and dier in a number of features including
size,coverage,granularity and descriptiveness.This makes the task of assessing whether a
dataset satises particular requirements (e.g.covering adequately a particular domain) and/or
comparing dierent datasets in order to select which one is more suitable for a given purpose
quite dicult.
For instance,in the example mentioned above about data related to the Spanish football
league,one may nd such data in more than one sources,including DBPedia and Freebase.To
evaluate these sources,the knowledge engineer needs to examine and assess a variety of factors
including:
 The domain's coverage,namely the degree to which the containing data cover the Spanish
football league (e.g.one of the sources might not contain adequate data for a given year).
 The dataset's consistency,namely the absence of contradictions in the data (e.g.there
might be statements suggesting that a player is currently playing for two clubs).
 The dataset's contemporariness,namely whether the dataset's contents are updated fre-
quently enough so as to always re ect the the current reality of the domain (e.g.it might
be that some team rosters are not the latest ones).
For that reason,an important dimension of the K-Drive use case scenario is the provision of
the ability to the system's users to derive semantic data summaries,namely useful descrip-
4
Figure 3:Linked Data Cloud in 2009
tions,measures and indicators that provide a landscape yet informative view on a dataset that
enables the assessment of the latter's potential value.This task of semantic data summariza-
tion is rather overlooked in the research community and has only been addressed by few works
[d'Aquin and Motta,2011] [Peroni et al.,2008] [Presutti et al.,2011] each of which generates
dataset summaries according to dierent data features and by applying dierent criteria.
Yet,the problem with these approaches is that they treat the summarization task in an
application and user independent way by producing\objective"summaries whose usefulness
is limited to a generic very high level overview of the data.By contrast,in our scenario we
are mostly interested in facilitating the generation of requirements-oriented and task-specic
summaries that may be signicantly more helpful to the knowledge engineers in their task to
locate semantic data to reuse and exploit.
The second important dimension of our scenario is the facilitation of access to the enriched
data by end users.Depending on the type of the organization,these users can be internal
(e.g.business executives,consultants,salesmen etc.) who need this data in order to perform
important job tasks,or external (e.g.clients) who again may access the data for purposes like
product search or information nding.In both cases,the users'concern is the eectiveness
of this access,both in terms of accuracy (i.e.nding the right information) and easiness (i.e.
accessing the information in an intuitive way).
Typically,access to semantic data is facilitated in two ways;by navigation and by querying.
Navigation involves using appropriate structures,like hierarchies,facets,tag clouds etc.,in
order to enable the user to quickly locate the information he/she looks for without the need to
search.The same structures may also be used for browsing when the user wants just to have
5
Figure 4:Linked Data Cloud in 2011 (source:Richard Cyganiak and Anja Jentzsch.http://lod-
cloud.net/)
an overview of the data.Particularly for Linked Data one may nd browsers like Disco
7
or
Tabulator [Berners-Lee et al.,2008].
The second way for a user to access a semantic dataset is by querying,i.e.by expressing
his/her information need in some query language that is compatible to the dataset's underly-
ing representation language (usually SPARQL
8
for RDF data) and have a system execute this
query against the dataset.This way,however,requires from the user to be technically savvy
and be aware of the structure and the particular characteristics of the data.For that,a great
amount of research has focused in the last years on the task of semantic question answering
(QA).The latter refers to the transformation of user queries expressed in natural language
into formal queries that can be executed against the data [Lopez et al.,2011].A variety of sys-
tems that try to achieve that have already been proposed in several works [Cimiano et al.,2007]
[Ferrandez et al.,2009] [Unger and Cimiano,2011] [Lopez et al.,2007] [Damljanovic et al.,2012].
The common characteristic of the above navigational and querying approaches is that they
focus on providing generic,task-independent access to data.In contrast,in K-Drive's use case
scenario the focus is on optimizing this access based on the particular characteristics and goals
of the application scenario at hand.The reason is that these characteristics may signicantly
in uence the eectiveness and quality of the access.
More specically,in many scenarios users access data with some goal in mind (e.g.to
educate themselves on an issue,to verify a fact,to take a decision etc.) and perhaps with some
special focus (e.g.a lm critic might be asking questions about lms and genres or a marketing
7
http://www4.wiwiss.fuberlin.de/bizer/ng4j/disco
8
http://www.w3.org/TR/sparql11-query/
6
manager about volume of sales).In navigational access this goal and focus should determine
what data and in what way should be presented to the user.In access by querying,in turn,these
two factors should determine which of the available data are more eective in answering the
user's questions than others (e.g.greater subject coverage,availability of particular relations
or concepts etc.).This is important as executing queries over all available data can have a
negative eect on both the eciency (larger knowledge base) and eectiveness (greater degree
of ambiguity,misleading information etc.) of the process.
Therefore,the second task,after the data enrichment,of the K-Drive use case scenario is
the provision of user-driven and goal-oriented access to the enriched data by in two ways:
 By means of a user-driven data summarization process where the user is able to
dene and execute custom summarization tasks in order to generate useful to him/her
data summaries.
 By means of a scenario-based query answering process where the application sce-
nario characteristics (domain,goals,focus) are exploited in order to better understand
and thus answer the potential users'queries.
Based on the above,the typical execution ow of the K-Drive use case scenario consists of
the following steps:
1.Given one or more (typically small) initial datasets to be enriched,the knowledge engineer
uses,through an appropriate user interface,the data summarization capabilities of the
overall system to generate targeted summaries for selected available datasets.
2.Based on the summaries,the engineer decides which datasets (or parts of them) are useful
for his/her purposes and uses them to perform data enrichment.
3.The resulting data is made available to the end users of the system who use the latter's
data summarization and question answering capabilities for particular tasks and purposes.
A more detailed analysis of the functional and technological requirements for the implemen-
tation of the above scenario is contained in deliverable D1.2.
2.2 Specic Implementation Scenarios
For validation and evaluation purposes,the above generic use case scenario of data enrichment
and access will be exemplied by considering two specic application scenarios,one related to
knowledge management and decision support and another to intelligent content search.
In particular,the rst scenario will involve the enrichment of an enterprise ontology describ-
ing organizational related knowledge such as employees,projects,areas of business and research,
technologies etc.The candidate data to be used in this enrichment will typically involve other
companies,business and research areas and geographical information.The target users in this
scenario are the knowledge engineers who will need to do the enrichment and,perhaps,dene
dierent data modules per group of end users (e.g.human resource department) as well as
organization's employees and executives who will need the enriched information for particular
tasks.
An example of such a task is the evaluation of a tender call,namely an open request made
by some organization for a written oer concerning the procurement of goods or services at
7
Figure 5:Tender Call Evaluation Process
a specied cost or rate.The evaluation of such a call by a company refers to the process of
deciding whether it should devote resources for preparing a competitive tender in order to be
awarded the bid (see gure 5).This process is practically a decision making problem for which
the decision makers need to consider a variety of information and data regarding,among others,
the company's expertise,experience,resources etc.Other similar tasks include project stang,
recruiting,competition assessment,strategic planning etc.
The second scenario will involve the enrichment of a dataset describing multimedia items
(e.g.videos) related to sports events such as football matches,F1 races etc.The need for
enrichment in this scenario will arise from the fact that most of this dataset's information is
inadequately semantically grounded (e.g.players are described by their names instead of URIs)
and that useful additional information is missing (e.g.previous winners of a given competition).
The users here will be again knowledge engineers but also end users who will want to gain
insightful access to these multimedia items,either by means of summaries (e.g.best scorers in
a given period) or by direct queries (e.g."I want all videos involving player X").
3 Semantic Datasets
In this section we describe the datasets that are to be used in the use case scenarios described
above.For the purpose of this deliverable,as well as throughout the K-Drive project,we con-
sider a semantic dataset to be\a set of RDF triples that are published,maintained or aggregated
by a single provider"[Alexander et al.,2009].This denition is important as it distinguishes
the notion of\dataset"from that of an"RDF Graph"which can consist of any arbitrary set
of triples [Beckett,2004].In fact,a dataset has typically the following characteristics:
 It covers one or more certain topics (e.g.geography,biology etc.).
 It originates from a certain source (e.g.Wikipedia) or process.
 It is accessible on the Web,for example through resolvable HTTP URIs,through a
SPARQL endpoint or through some dedicated API.
8
 It contains a sucient number of RDF triples.
Perhaps the most known and important public semantic dataset,in terms of size and topic
coverage,is currently DBPedia.Nevertheless there are also many other datasets
9
that provide
from horizontal and rather shallow coverage of multiple topics (e.g.Freebase) to deep coverage
of very specialized domains (e.g.GeoSpecies
10
).A comprehensive list of public semantic data
can be found at http://thedatahub.org/group/lodcloud.
Information about the characteristics and properties of a dataset is normally expected to
be available by the dataset publisher so as to enable its discovery and selection.In fact,data
consumers,either humans or software agents,are mostly interested in the following types of
information regarding a dataset:
 Its content,that is,what the dataset mainly is about.
 Its interlinking to other datasets,that is,to which other datasets and how the dataset is
interlinked.
 The vocabularies and ontologies used in the dataset.
 The way it can be programmatically accessed,as an RDF dump,through SPARQL end-
point or any other protocol.
Such information can be expressed by means of the the voiDvocabulary [Alexander et al.,2009],
which is an ontology that allows to formally describe linked RDF datasets.Nevertheless,at the
moment not all public datasets provide such descriptions.Furthermore,the granularity and
richness of the provided metadata may dier signicantly among dierent datasets.
In this deliverable,we describe a number of datasets that are to be used within the K-Drive
project.The selection of these datasets has been driven by the need to have a balanced set of
them that gathers the following characteristics:
 Relevance to the case study:The datasets need to be directly or indirectly relevant to
the domains and application scenarios described in the use case scenario described above.
 Heterogeneity and multi-domain coverage:Some datasets need to be heterogeneous
so as to have the need to apply summarization techniques on them in order to understand
them better and determine the parts of them to reuse for a given task.
 Usage by the research community:Some datasets need to have already been used
for evaluation purposes in relevant to K-Drive projects and publications so as to be able
to compare our methods and techniques against the relevant state of the art.
More specically,the datasets for K-Drive project fall into three categories.The rst cat-
egory includes relatively small core datasets that are directly relevant to the use case scenario
and which are going to be the starting point for the question answering and summarization
functionalities to be developed for the end-users.These datasets are not necessarily public
nor part of the Linked Data cloud but re ect typical data that are used within organizations
and require enrichment in order to become more usable.The second category includes large
heterogeneous datasets that cover multiple topics in a horizontal and rather shallow way.These
9
http://linkeddata.org/data-sets
10
http://lod.geospecies.org/
9
datasets will be primarily used for enriching and complementing the core datasets with generic
or specialized knowledge,as well as for developing and testing methods related to data sum-
marization and selection.Finally,the third category includes vertical datasets which will be
used for enriching the core datasets with specialized knowledge in particular domains.In the
following sections we describe datasets from each of the above categories.
3.1 Core Datasets
3.1.1 Enterprise Information Datasets
These datasets consist of three ontologies describing company-related information about three
companies:iSOCO,Telefonica and Repsol.The ontology about iSOCO is rather small includ-
ing information about employees,projects,areas of business and research and technologies.The
Telefonica and REPSOL ontologies are richer describing business processes,products,technolo-
gies,strategic goals etc.All ontologies are suitable for knowledge management and business
intelligence scenarios as well as for enrichment with public semantic data (e.g.with data about
other companies,business and research areas,geographical information).Table 1 summarizes
the basic features of these datasets.
Table 1:Enterprise Datasets Overview
Dataset
Telefonica
Repsol
iSOCO
No of Classes
139
356
6
No of Properties
27
79
20
No of Instances
1163
1692
185
Main Classes
Company,Technol-
ogy,Client,Strate-
gic Area
Company,Client,
Installation,Per-
sonnel
Oce,Technology,
Project,Person
As already mentioned above,these three datasets will be ones to be enriched in the scenario
related to knowledge management and decision support.
3.1.2 Tenders Electronic Daily Dataset
The Tenders Electronic Daily (TED) dataset is the online version of the\Supplement to the
Ocial Journal of the European Union",dedicated to European public procurement.It con-
tains information about procurements mainly in the European area and is updated in every 1
to 3 days.A comprehensive search interface is also provided for user to search by facets such as
country,CPV (Common Procurement Vocabulary) code,deadline,or by writing a CCL (Com-
mon Command Language) query.Although the TED dataset does not have a RDF version
yet,its online data
11
can be easily extracted and converted into structural data.Actually,for
each document published in TED,a structural data table is provided as well.Therefore a TED
ontology can be constructed and updated accordingly.Table 2 summarizes the basic features
of this dataset.Note that the instances included in this table are only the static instances,such
as business types and countries.Dynamic instances such as the procurement documents,their
authority organisations,etc.are not counted.
11
http://ted.europa.eu/TED/browse/browseByBO.do
10
Table 2:TED Dataset Overview
Dataset
TED
No of Classes
30
No of Properties
22
No of Static In-
stances
11554
Main Classes
Country,Contract,Business Type,Main Activity
As already mentioned above,these this dataset will be enriched in the scenario related to
knowledge management and decision support.Its enrichment can be performed by linking to
both the enterprise information datasets introduced above,and the specialized datasets such
as Geonames,as we will introduced later in this deliverable.
3.1.3 BUSCAMEDIA Multimedia Dataset
This dataset includes semantic descriptions of approximately 2500 video fragments from sports
events (football and basketball matches as well as formula 1 races),represented according to
the M3 ontology network
12
.The latter represents knowledge related to multimedia information
of any type of resource (text,video,audio,audiovisual or image) in several domains and taking
into account a multilingual context.In particular,it covers the following three perspectives:
 The multimedia perspective (gure 6) that models media information at dierent lay-
ers:low-level information (e.g.,MPEG-7 descriptors),multimedia structure information
(e.g.,objects),and information about the contents of the multimedia resource (related to
the dierent domains).
 The multidomain perspective (gure 7)that provides general denition across dierent
domains (e.g.,events,agents,actions) and information for dierent particular domains.
The current version covers the sport domain and a set of its subdomains (Football,Bas-
ketball,and Formula One).
 The multilingual perspective that represents the linguistic information necessary to
support any of the co-ocial Spanish languages (Spanish,Catalan,Galician,and Basque)
and the English one.
Table 3:Buscamedia Datasets Overview
Dataset
Videos
Football
Formula One
No of Classes
15
59
18
No of Properties
27
152
34
No of Instances
2500
134
36
Main Classes
Video,VideoFrag-
ment,VideoFormat
FootballPlayer,
FootballTeam,Sta-
dium,MatchEvent
Circuit,
RaceDriver,Grand-
Prix
12
http://buscamedia.isoco.net/m3repository/
11
Figure 6:M3 Multimedia Ontology
As already mentioned above,the Buscamedia dataset will be the one to be enriched in
the scenario involving advanced access to sport videos.Table 3 summarizes the dataset's key
characteristics.
3.2 Horizontal Multi-Domain Datasets
3.2.1 DBPedia
DBpedia [Bizer et al.,2009b] is the result of an ongoing process of extracting structured infor-
mation from Wikipedia,representing it in RDF format,and making it available on the Web.
The data,which are under the terms of the Creative Commons Attribution - ShareAlike 3.0
License and the GNU Free Documentation License,are structured according to the DBpedia
Ontology
13
,namely a shallow,cross-domain ontology,which has been manually created based
on the most commonly used infoboxes within Wikipedia.
The DBPedia ontology currently denes about 350 classes which are organized as a sub-
sumption hierarchy and are described by 1,775 distinct properties.Furthermore,it contains
information about 2.3 million instances and features 6,200,000 external links into other RDF
datasets,making it perhaps the most interlinked dataset in the Linked Data cloud.
DBPedia datasets are accessible through two SPARQL endpoints (http://dbpedia.org/
sparql and http://live.dbpedia.org/sparql) as well in the form of N-Triples and N-Quads
datasets.The latter include a provenance URI to each statement which denotes the origin of the
13
http://dbpedia.org/Ontology
12
Figure 7:M3 Multidomain Ontology
extracted triple in Wikipedia.This URI is composed of the URI of the article from Wikipedia
where the statement has been extracted and a number of parameters denoting the exact source
line.Table 4 summarizes the key characteristics of DBPedia.
DBPedia is particularly relevant to the K-Drive use case scenario as it is a large horizontal
and heterogeneous dataset which:
 Contains information related to the domains of the use case scenario (e.g.football related
entities)
 Covers multiple domains,meaning that in order to be evaluated for reuse it will denitely
require the application of summarization techniques.
 It has been used by the research community for evaluation purposes in tasks like question
answering [Lopez et al.,2007],data summarization [Presutti et al.,2011][d'Aquin and Motta,2011]
and semantic annotation [Mendes et al.,2011] (which is an important task in data enrich-
ment)
3.2.2 YAGO2
YAGO2[Hoart et al.,2011a] is a knowledge base that is automatically built from Wikipedia,
GeoNames,and WordNet and focuses on temporal and spatial knowledge.Temporal informa-
tion is available for four major entity types,namely people (through the relations wasBornOn-
Date and diedOnDate),groups (through the relations wasCreatedOnDate and wasDestroye-
13
Table 4:DBPedia Overview
URL
http://dbpedia.org
Namespace
http://dbpedia.org/resource
SPARQL endpoint
http://live.dbpedia.org/sparql
Availability
Public
Datasets interlinked with
About 30 including Freebase,OpenCyc,Geon-
ames and YAGO2
No of classes
350
No of relations
1775
No of entities
2,350,000
No of facts
447,470,256
Main entity types
People (764,000),Organizations (192,000),Places
(687,414),Species (202,000)
dOnDate),artifacts (through the relations wasCreatedOnDate and wasDestroyedOnDate) and
events (through the relations startedOnDate and endedOnDate).
On the other hand,all entities possessing a permanent spatial extent on Earth e.g.coun-
tries,cities,or mountains) are grouped together under the class yagoGeoEntity.The position
of a such an entity is then described by geographical coordinates,expressed through the yago-
GeoCoordinates datatype,and linked to each entity through the hasGeoCoordinates relation.
Table 5 summarizes the basic characteristics of the YAGO dataset.
Similarly with DBPedia,the YAGO dataset is particularly relevant to the K-Drive use case
scenario as it is also a large horizontal and heterogeneous dataset that covers multiple domains
and had been used by the research community for evaluation purposes in tasks like question
answering [Adolphs et al.,2011] and semantic annotation [Hoart et al.,2011b] (which is an
important task in data enrichment).Furthermore,since much of its contents are common with
DBPedia,YAGO will be useful for testing and evaluating the task of dataset selection.
Table 5:YAGO Dataset Overview
URL
www.mpi-inf.mpg.de/yago-naga/yago2/
Namespace
http://yago-knowledge.org/resource
SPARQL endpoint
http://lod.openlinksw.com/sparql
Availability
Public
Datasets interlinked with
DBPedia,Geonames
No of classes
365,372
No of relations
104
No of entities
2,648,387
No of facts
124,333,521
Main entity types
People (872,155),Groups (316,699),Artifacts
(212,003),Events (187,392),Locations (687,414)
14
3.3 Specialized Datasets
3.3.1 Geonames
The GeoNames geographical database
14
integrates geographical data such as names of places,
elevation,population and others from various sources.It covers all countries and contains over
10 million geographical names with over 8 million unique features.All features are categorized
into one out of nine feature classes and further subcategorized into one out of 645 feature
codes,according to the GeoNames ontology (available in OWL at http://www.geonames.org/
ontology/ontology_v3.01.rdf).The data is accessible,under a creative commons attribution
license,through a number of webservices and a daily database export.An overviewof GeoNames
characteristics is shown in table 6.
Table 6:GeoNames Dataset Overview
URL
http://www.geonames.org/
Namespace
http://www.geonames.org/ontology#
Availability
Public under creative commons attribution license
Web Service for search
http://www.geonames.org/export/
geonames-search.html
No of entities
About 10 million
Main entity types
Countries,Cities,Towns.Villages,Mountains,
Rivers,Postal Codes,Buildings,Airports
Geonames is considered in the K-Drive use case scenario because its comprehensive geo-
graphical information can be useful in virtually any domain whose content has a geographi-
cal dimension.Furthermore it has been extensively used in research works relevant to geo-
graphical information processing and retrieval [Volz et al.,2007] [Andogah and Koster,2008]
[Andogah and Nerbonne,2012].
3.3.2 LinkedGeoData
LinkedGeoData [Stadler et al.,2011] [Auer et al.,2009] (table 7)uses the spatial information
collected by the OpenStreetMap project
15
and makes it available as an RDF knowledge base
according to the Linked Data principles.It consists of more than 1 billion nodes and 100
million ways and the resulting RDF data consists of approximately 20 billion triples.The
data is available according to the Linked Data principles and interlinked with DBpedia and
GeoNames.
Data access is facilitated through a rest API as well as through two SPARQL endpoints,a
static one (http://linkedgeodata.org/sparql) that contains data extracted from an Open-
StreetMap le of a certain date and a live one (http://live.linkedgeodata.org/sparql)
that contains a synchronized version.
As with GeoNames,the LinkedGeoData dataset will provide to the K-Drive use case scenario
specialized geographical information to be reused for data enrichment.It is a dataset that has
been considered in several works [Van Aart et al.,2010] [Parundekar et al.,2011] [Fouad et al.,2010]
it will be used for comparison purposes with GeoNames for the task of dataset evaluation and
selection.
14
http://www.geonames.org/
15
http://www.openstreetmap.org/
15
Table 7:LinkedGeoData Dataset Overview
URL
http://www.linkedgeodata.org/
Datasets interlinked with
DBPedia,Geonames
Availability
Public
SPARQL endpoints
http://linkedgeodata.org/sparql,http:
//live.linkedgeodata.org/sparql
Web Service for access
http://linkedgeodata.org/OnlineAccess/
RestApi
Main entity types
Countries,Cities,Streets,Administrative regions,
Buildings
3.3.3 Research Datasets:ACM,IEEE,DBLP,Citeseer and BibBase
These 5 datasets contain information about research publications,mainly in the area of com-
puter science and informatics,and are all based on the AKT Reference Ontology
16
.In par-
ticular,the ACM dataset represents publications of the Association for Computing Machin-
ery (ACM),along with details of their authors.Currently,the dataset is about 1.1 Giga-
bytes big and contains about 1,2 million triples.It is accessible from a SPARQL endpoint at
http://acm.rkbexplorer.com/sparql/.Similarly,the IEEE dataset represents publications
of the Institute of Electrical and Electronic Engineers,it contains about 120,000 triples and it's
available through an endpoint at http://ieee.rkbexplorer.com/sparql/
On the other hand,the DBLP dataset is more generic and provides bibliographic information
on major computer science journals and conference proceedings,containing more than 800.000
articles and 400.000 authors.It is derived froma D2RServer publishing the DBLP Bibliography
Database as linked data.The complete RDF view on the database consists of approximately 15
million RDF triples.It is accessible from a SPARQL endpoint at http://dblp.rkbexplorer.
com/sparql.
Citeseer
17
is also a scientic literature digital library that focuses primarily on the liter-
ature in computer and information science.The corresponding dataset is about 1Gb with
about 7.9 million triples and it is accessible from a SPARQL endpoint at http://citeseer.
rkbexplorer.com/sparql.Finally,BibBase is a collection of publications in the formof bibtex-
les
18
,a le format which is used to describe and process lists of references,mostly in conjunc-
tion with LaTeX documents.Its SPARQL endpoint is available at http://data.bibbase.org:
2020/sparql.
All the above datasets (summarized in table 8) are expected to be used in the K-Drive use
case scenarios and in particular in the one involving the enrichment and use of the enterprise
information datasets for tasks like tender call evaluation,recruiting,project stang etc.That is
because all three companies these datasets are about have intensive R&D activities that require
specialized knowledge like the one contained in the 5 above research datasets.
16
http://projects.kmi.open.ac.uk/akt/ref-onto/
17
http://csxstatic.ist.psu.edu/about
18
http://www.bibtex.org/
16
Table 8:Research Datasets Overview
Dataset
URL
SPARQL Endpoint
No of Entities
ACM
http://acm.
rkbexplorer.com
http://acm.
rkbexplorer.com/
sparql/
1.2 million
IEEE
http://ieee.
rkbexplorer.com
http://ieee.
rkbexplorer.com/
sparql/
120,000
DBLP
http://dblp.
rkbexplorer.com
http://dblp.
rkbexplorer.com/
sparql
36.5 million
Citeseer
citeseer.
rkbexplorer.com
http://citeseer.
rkbexplorer.com/
sparql
7.9 million
BibBase
http://data.bibbase.
org
http://data.bibbase.
org:2020/sparql
200,000
3.3.4 Open Corporates
OpenCorporates
19
(table 9) provides information about corporate entities as open data under
the share-alike attribution Open Database Licence
20
.Currently,about 43,5 million companies
are covered,each having been assigned a unique URL.
Table 9:OpenCorporates Dataset Overview
URL
http://opencorporates.com/
Namespace
http://opencorporates.com/companies
Datasets interlinked with
DBPedia
Availability
Public
Access API
http://api.opencorporates.com/
No of companies
43,5 million
No of triples
750,000,000
As with the research datasets,the OpenCorporates dataset will be utilized in the enterprise
datasets scenario by providing useful knowledge about other companies.
3.4 The K-Drive Data Model
In Figure-8 we describe the K-Drive data model in a high level.It shows three categories of
datasets and their relations.
The top category in Figure-8 is called core dataset,which contains 3 small core datasets,
i.e.,Enterprise Information Datasets,TED,and BUSCAMEDIA Multimedia Dataset.These
datasets are specially relevant to the use case scenario and which are going to be enriched,
19
http://opencorporates.com/
20
http://opendatacommons.org/licenses/odbl/1.0/
17
Figure 8:The K-Drive Data Model
and used as starting point for the question answering and summarization functionalities to be
developed for the end-users.
The second category is called Horizontal Multi-Domain Datasets,which includes large
heterogeneous datasets,i.e.,DBPedia and YAGO2,covering multiple topics in a horizontal and
rather shallow way.These datasets will be primarily used for enriching and complementing
the core datasets with generic or specialized knowledge,as well as for developing and testing
methods related to data summarization and selection.
The third category is called Specialized Datasets,which includes Geonames,LinkedGeo-
Data,Research Datasets,and Open Corporates.These are vertical datasets which will be used
for enriching the core datasets with specialized knowledge in particular domains.
4 Conclusion
In this deliverable we described a set of semantic data sources,that are intended to be used
within the K-Drive project in order to support the research and development activities of the
18
subsequent work packages and evaluate the quality of their outcomes.The selection of these
datasets was performed in accordance to a use case scenario involving the provision of advanced
methods,techniques and tools for their reuse and access in various knowledge-intensive tasks.
Acknowledgement
This research has been funded by the European Commission within the 7th Framework Pro-
gramme/Maria Curie Industry-Academia Partnerships and Pathways schema/PEOPLE Work
Programme 2011 project K-Drive number 286348 (cf.http://www.kdrive-project.eu).We
appreciate the supporting from Mr.Yuan Ren (UNIABDN) and Dr.Yuting Zhao (UNIABDN)
in further enriching and polishing this deliverable.
References
[Adolphs et al.,2011] Adolphs,P.,Theobald,M.,Schafer,U.,Uszkoreit,H.,and Weikum,G.
(2011).Yago-qa:Answering questions by structured knowledge queries.In Proceedings of
the 2011 IEEE Fifth International Conference on Semantic Computing,ICSC'11,pages
158{161,Washington,DC,USA.IEEE Computer Society.
[Alexander et al.,2009] Alexander,K.,Cyganiak,R.,Hausenblas,M.,and Zhao,J.(2009).
Describing Linked Datasets - On the Design and Usage of voiD,the'Vocabulary of Interlinked
Datasets'.In WWW2009 Workshop:Linked Data on the Web (LDOW2009),Madrid,Spain.
[Andogah and Nerbonne,2012] Andogah,G.,B.G.and Nerbonne,J.(2012).Every document
has a geographical scope.Data and Knowledge Engineering.
[Andogah and Koster,2008] Andogah,G.,B.G.N.J.and Koster,E.(2008).Methodologies
and resources for processing spatial language.
[Auer et al.,2009] Auer,S.,Lehmann,J.,and Hellmann,S.(2009).LinkedGeoData - adding a
spatial dimension to the web of data.In Proc.of 8th International Semantic Web Conference
(ISWC).
[Beckett,2004] Beckett,D.(2004).RDF/XML Syntax Specication (Revised).W3c recom-
mendation,W3C.http://www.w3.org/TR/2004/REC-rdf-syntax-grammar-20040210/.
[Berners-Lee et al.,2008] Berners-Lee,T.,Hollenbach,J.,Lu,K.,Presbrey,J.,and Schraefel,
M.(2008).Tabulator Redux:Browsing and Writing Linked Data.
[Bizer et al.,2009a] Bizer,C.,Heath,T.,and Berners-Lee,T.(2009a).Linked data - the story
so far.Int.J.Semantic Web Inf.Syst.,5(3):1{22.
[Bizer et al.,2009b] Bizer,C.,Lehmann,J.,Kobilarov,G.,Auer,S.,Becker,C.,Cyganiak,
R.,and Hellmann,S.(2009b).Dbpedia - a crystallization point for the web of data.Web
Semant.,7(3):154{165.
[Cimiano et al.,2007] Cimiano,P.,Haase,P.,and Heizmann,J.(2007).Porting natural lan-
guage interfaces between domains:an experimental user study with the orakel system.In
Proceedings of the 12th international conference on Intelligent user interfaces,IUI'07,pages
180{189,New York,NY,USA.ACM.
19
[Damljanovic et al.,2012] Damljanovic,D.,Agatonovic,M.,and Cunningham,H.(2012).
Freya:an interactive way of querying linked data using natural language.In Proceedings
of the 8th international conference on The Semantic Web,ESWC'11,pages 125{138,Berlin,
Heidelberg.Springer-Verlag.
[d'Aquin and Motta,2011] d'Aquin,M.and Motta,E.(2011).Extracting relevant questions
to an rdf dataset using formal concept analysis.In Musen,M.A.and Corcho,s.,editors,
K-CAP,pages 121{128.ACM.
[Ferrandez et al.,2009] Ferrandez,.,Izquierdo,R.,Ferrandez,S.,and Vicedo,J.L.(2009).
Addressing ontology-based question answering with collections of user queries.Inf.Process.
Manage.,45(2):175{188.
[Fouad et al.,2010] Fouad,R.A.,Badr,N.,Talha,H.,and Hashem,M.(2010).On Location-
Centric Semantic Information Retrieval in Ubiquitous Computing Environments.Interna-
tional Journal of Electrical & Computer Sciences IJECS-IJENS,10(06).
[Hoart et al.,2011a] Hoart,J.,Suchanek,F.M.,Berberich,K.,Lewis-Kelham,E.,de Melo,
G.,and Weikum,G.(2011a).Yago2:exploring and querying world knowledge in time,space,
context,and many languages.In Proceedings of the 20th international conference companion
on World wide web,WWW'11,pages 229{232,New York,NY,USA.ACM.
[Hoart et al.,2011b] Hoart,J.,Yosef,M.A.,Bordino,I.,Furstenau,H.,Pinkal,M.,Spaniol,
M.,Thater,S.,and Weikum,G.(2011b).Robust disambiguation of named entities in text.
In Proceedings of EMNLP 2011:Conference on Empirical Methods in Natural Language
Processing,Edinburgh,Scotland,UK,July 27-31,pages 782{792.
[Hoppenbrouwers and Lucas,2009] Hoppenbrouwers,S.and Lucas,P.(2009).Attacking the
knowledge acquisition bottleneck through games-for-modelling.In Proceedings of AISB 2009
workshop (AI and Games).
[Li and Motta,2010] Li,N.and Motta,E.(2010).Evaluations of user-driven ontology summa-
rization.In Proceedings of the 17th international conference on Knowledge engineering and
management by the masses,EKAW'10,pages 544{553,Berlin,Heidelberg.Springer-Verlag.
[Lopez et al.,2007] Lopez,V.,Uren,V.,Motta,E.,and Pasin,M.(2007).Aqualog:An
ontology-driven question answering system for organizational semantic intranets.Web Se-
mant.,5(2):72{105.
[Lopez et al.,2011] Lopez,V.,Uren,V.S.,Sabou,M.,and Motta,E.(2011).Is question
answering t for the semantic web?:A survey.Semantic Web,2(2):125{155.
[Mendes et al.,2011] Mendes,P.N.,Jakob,M.,Garca-Silva,A.,and Bizer,C.(2011).Dbpedia
spotlight:shedding light on the web of documents.In Proceedings of the 7th International
Conference on Semantic Systems,I-Semantics'11,pages 1{8,New York,NY,USA.ACM.
[Parundekar et al.,2011] Parundekar,R.,Ambite,J.L.,and Knoblock,C.A.(2011).Aligning
unions of concepts in ontologies of geospatial linked data.In Proceedings of the Terra Cognita
2011 Workshop in Conjunction with the 10th International Semantic Web Conference.
20
[Peroni et al.,2008] Peroni,S.,Motta,E.,and d'Aquin,M.(2008).Identifying key concepts
in an ontology through the integration of cognitive principles with statistical and topological
measures.In Third Asian Semantic Web Conference,Bangkok,Thailand.
[Presutti et al.,2011] Presutti,V.,Aroyo,L.,Adamou,A.,Schopman,B.,Gangemi,A.,and
Schreiber,G.(2011).Extracting core knowledge from linked data.In Proceedings of the
Second Workshop on Consuming Linked Data,COLD2011,Workshop in conjunction with
the 10th International Semantic Web Conference 2011 (ISWC 2011).CEUR-WS.
[Reichartz et al.,2010] Reichartz,F.,Korte,H.,and Paass,G.(2010).Semantic relation ex-
traction with kernels over typed dependency trees.In KDD'10:Proceedings of the 16th ACM
SIGKDD international conference on Knowledge discovery and data mining,pages 773{782,
New York,NY,USA.ACM.
[Stadler et al.,2011] Stadler,C.,Lehmann,J.,Honer,K.,and Auer,S.(2011).Linkedgeodata:
A core for a web of spatial open data.Semantic Web Journal.
[Szumlanski and Gomez,2010] Szumlanski,S.and Gomez,F.(2010).Automatically acquiring
a semantic network of related concepts.In Proceedings of the 19th ACM international con-
ference on Information and knowledge management,CIKM'10,pages 19{28,New York,NY,
USA.ACM.
[Unger and Cimiano,2011] Unger,C.and Cimiano,P.(2011).Pythia:compositional meaning
construction for ontology-based question answering on the semantic web.In Proceedings of
the 16th international conference on Natural language processing and information systems,
NLDB'11,pages 153{160,Berlin,Heidelberg.Springer-Verlag.
[Van Aart et al.,2010] Van Aart,C.,Wielinga,B.,and Van Hage,W.R.(2010).Mobile cul-
tural heritage guide:location-aware semantic search.In Proceedings of the 17th international
conference on Knowledge engineering and management by the masses,EKAW'10,pages 257{
271,Berlin,Heidelberg.Springer-Verlag.
[Volz et al.,2007] Volz,R.,Kleb,J.,and Mueller,W.(2007).Towards ontology based disam-
biguation of geographical identiers.In WWW2007,Ban,Canada.
[Wagner,2007] Wagner,C.(2007).Breaking the knowledge acquisition bottleneck through
conversational knowledge management.Innovative Technologies for Information Resources
Management,page 200.
21