The Multilingual Semantic Web

blaredsnottyΤεχνίτη Νοημοσύνη και Ρομποτική

15 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

274 εμφανίσεις

Report from Dagstuhl Seminar 12362
The Multilingual Semantic Web
Edited by
Paul Buitelaar
1
,Key-Sun Choi
2
,Philipp Cimiano
3
,and
Eduard H.Hovy
4
1 National University of Ireland – Galway,IE,paul.buitelaar@deri.org
2 KAIST – Daejeon,KR,kschoi@kaist.ac.kr
3 Universität Bielefeld,DE,cimiano@cit-ec.uni-bielefeld.de
4 University of Southern California – Marina del Rey,US
Abstract
This document constitutes a brief report from the Dagstuhl Seminar on the"Multilingual Se-
mantic Web"which took place at Schloss Dagstuhl between September 3rd and 7th,2012
1
.The
document states the motivation for the workshop as well as the main thematic focus.It describes
the organization and structure of the seminar and briefly reports on the main topics of discussion
and the main outcomes of the workshop.
Seminar 02.–09.September,2012 – www.dagstuhl.de/12362
1998 ACM Subject Classification H.1.2 User/Machine Systems:Human Factors,H.2.3 Lan-
guages,H.5.2 User Interfaces:Standardization
Keywords and phrases Semantic Web,Multilinguality,Natural Language Processing
Digital Object Identifier 10.4230/DagRep.2.9.15
1 Executive Summary
Paul Buitelaar
Key-Sun Choi
Philipp Cimiano
Eduard H.Hovy
License
Creative Commons BY-NC-ND 3.0 Unported license
© Paul Buitelaar,Key-Sun Choi,Philipp Cimiano,and Eduard H.Hovy
The amount of Internet users speaking native languages other than English has seen a
substantial growth in recent years.Statistics from 2010 in fact show that the number of
non-English Internet users is almost three times the number of English-speaking users (1430
million vs.536 million users).As a consequence,the Web is turning more and more into a
truly multilingual platform in which speakers and organizations from different languages and
cultural backgrounds collaborate,consuming and producing information at a scale without
precedent.Originally conceived by Tim Berners-Lee et al.[5] as “an extension of the current
web in which information is given well-defined meaning,better enabling computers and
people to work in cooperation”,the Semantic Web has seen an impressive growth in recent
years in terms of the amount of data published on the Web using the RDF and OWL data
models.The kind of data published nowadays on the Semantic Web or Linked Open Data
(LOD) cloud is mainly of a factual nature and thus represents a basic body of knowledge
1
Please also visit the website http://www.dagstuhl.de/12362
Except where otherwise noted,content of this report is licensed
under a Creative Commons BY-NC-ND 3.0 Unported license
The Multilingual Semantic Web,Dagstuhl Reports,Vol.2,Issue 9,pp.15–94
Editors:Paul Buitelaar,Key-Sun Choi,Philipp Cimiano,and Eduard H.Hovy
Dagstuhl Reports
Schloss Dagstuhl – Leibniz-Zentrum für Informatik,Dagstuhl Publishing,Germany
16 12362 – The Multilingual Semantic Web
that is accessible to mankind as a basis for informed decision-making.The creation of a level
playing field in which citizens from all countries have access to the same information and have
comparable opportunities to contribute to that information is a crucial goal to achieve.Such
a level playing field will also reduce information hegemonies and biases,increasing diversity
of opinion.However,the semantic vocabularies used to publish factual data in the Semantic
Web are mainly English,which creates a strong bias towards the English language and
culture.As in the traditional Web,language represents an important barrier for information
access as it is not straightforward to access information produced in a foreign language.A
big challenge for the Semantic Web therefore is to develop architectures,frameworks and
systems that can help in overcoming language and national barriers,facilitating the access to
information originally produced for a different culture and language.An additional problem
is that most of the information on the Web stems from a small set of countries where majority
languages are spoken.This leads to a situation in which the public discourse is mainly
driven and shaped by contributions from those countries where these majority languages
are spoken.The Semantic Web vision bears an excellent potential to create a level playing
field for users with different cultural backgrounds,native languages and originating from
different geo-political environments.The reason is that the information available on the
Semantic Web is expressed in a language-independent fashion and thus bears the potential to
be accessible to speakers of different languages if the right mediation mechanisms are in place.
However,so far the relation between multilingualism and the Semantic Web has not received
enough attention in the research community.Exploring and advancing the state-of-the-art in
information access to the Semantic Web across languages is the goal of the seminar proposed
here.A Semantic Web in which information can be accessed across language and national
barriers has important social,political and economic implications:
it would enable access to data in other languages and thus provide support for direct
comparisons (e.g.of public spending),thus creating an environment where citizens
feel well-informed and contributing to increasing their trust and participation in demo-
cratic processes as well as strengthening democracy and trust in government and public
administration
it would facilitate the synchronization and comparison of information and views expressed
in different languages,thus contributing to opinion forming processes free of any biases
or mainstream effects
it would foster higher information transparency;the exchange of many data items is
limited due to national boundaries and national idiosyncrasies,as it is e.g.the case with
financial data,the exchange of which is limited due to the availability of very different
accounting procedures and reporting standards.Creating an ecosystem in which financial
information can be integrated across countries can contribute to a higher transparency of
financial information,global cash flow and investments.
Vision,Goals and Topic:The vision underlying the proposed workshop is the creation of
a Semantic Web in which all languages have the same status,every user can perform searches
in their own language,and information can be contrasted,compared and integrated across
languages.As a main topic for the seminar,we intend to discuss in how far the Semantic
Web can be extended —- from an infrastructural and conceptual point of view –– in order to
support access across languages.This will lead us to the discussion of two main questions:
Ontological vocabularies that are available and used in the Semantic web cover a broad
number of domains and topics to varying degrees of detail and granularity.For one
Paul Buitelaar,Key-Sun Choi,Philipp Cimiano,and Eduard H.Hovy 17
thing we will discuss in how far these vocabularies can indeed be seen as an interlingua
(language-independent) representation.This includes the question how,building on such
an interlingual representation,the Semantic Web can indeed support access to semantic
data across languages.This discussion will extend to the question which approaches are
suitable to translate the user’s information needs,expressed in natural language,into
such a language-independent representation.
For another thing,we will discuss how the multilingual Semantic Web can be constructed
by publication and linking of available multilingual lexical resources following the Linked
Data paradigms.In this context,we will also discuss how natural language processing
tools can benefit from such a linked ecosystem of lexico-semantic background knowledge.
Other topics that we anticipated would be discussed at the seminar include the following:
models for the integration of linguistic information with ontologies,i.e.,models for
multilingualism in knowledge representation,in particular OWL and RDF(S)
collaborative design of ontologies across languages and cultures
multilingual ontology alignment
multilingual and cross-lingual aspects of semantic search and querying of knowledge
repositories
cross-lingual question answering over Linked Data
architectures and infrastructure for a truly Multilingual Semantic Web
localization of ontologies to multiple languages
automatic integration and adaptation of (multilingual) lexicons with ontologies
multi- and cross-lingual ontology-based information extraction and ontology population
multilingualismand linked data (generation,querying,browsing,visualization and present-
ation)
multilingual aspects of ontology verbalization
ontology learning across languages
NLP methods to construct the multilingual Semantic Web
Organization & Structure
The Dagstuhl seminar on the Multilingual Semantic Web took place at Schloss Dagstuhl from
the 3rd to the 7th of September 2012.The organizers were Paul Buitelaar (National University
of Ireland,Galway),Key-Sun Choi (KAIST),Philipp Cimiano (Bielefeld University) and
Eduard Hovy (CMU).
The organizers asked participants to submit an abstract and to prepare a short present-
ation of about 10 minutes for the seminar.The schedule of the seminar proposed by the
organizers was as depicted in the figure below:
12362
18 12362 – The Multilingual Semantic Web
The first day started with an introduction by the organizers,giving an overview of the
main topics and goals of the seminar.Some guiding questions for the seminar as proposed
by the organizers were the following:
Can we exploit the LOD for NLP?
Can we allow for multilingual access to the knowledge in the LOD?
Can we regard the LOD as an interlingua?
Can we apply Linked Data principles to the modelling of linguistic/lexical resources?
How can we facilitate the localization of (semantic) web sites to multiple languages?
As technical and research challenges for the field in the next years,the organizers
highlighted the following:
Aggregating and summarizing content across languages
Repurposing and verbalizing content in multiple languages
Linking of information across languages
Detection of inconsistent views across languages
Translation of “objects” that have a context and are produced within some workflow
Large-scale and robust text analysis in multiple languages
Personalized and contextualized Interpretation of NL [38]
Cross-lingual/cultural reconciliation of conceptualizations
Every day,between 10:30 and 12:00,a panel took place in which attendees of the seminar
had 10 minutes to present their view on the main challenges in the field,answering to the
following questions in particular:
1.What are in your view the most important challenges/barriers/problems and pressing
needs with respect to the multilingual access to the Semantic Web?
Paul Buitelaar,Key-Sun Choi,Philipp Cimiano,and Eduard H.Hovy 19
2.Why does the problem matter in practice?Which industry sectors or domains are
concerned with the problem?
3.Which figures are suited to quantify the magnitude or severity of the problem?
4.Why do current solutions fail short?
5.What insights do we need in order to reach a principled solution?What could a principled
solution look like?
6.How can standardization (e.g.by the W3C) contribute?
After each panel the organizers attempted to group participants into teams around a
certain topic.The groups worked together on the topic in the afternoons between 13:30 and
15:30.They were supposed to wrap-up their discussion and come up with a summary of
their discussion until 17:00.These summaries were then presented in a plenary session to all
the participants from Tuesday to Friday between 9:00 and 10:30.
Every day between 17:00 and 18:00 (just before dinner),we had an invited talk or
special activity.On the first day,Kimmo Rossi from the European Commission shared his
perspective on the challenges in our field.On the second day,there was a non-academic slot:
First Jeroen van Grondelle showcased an industrial application of semantic,multilingual
technologies;next,Christian Lieske and Felix Sasaki discussed perception and reality of the
multilingual Semantic Web.On the third day we had a small walk to Noswendel (see Figure
1),and on the fourth day we organized a demo session,giving participants the opportunities
to give a hands-on look at their tools.
12362
20 12362 – The Multilingual Semantic Web
Figure 1 Walking to Noswendel.
Paul Buitelaar,Key-Sun Choi,Philipp Cimiano,and Eduard H.Hovy 21
2 Table of Contents
Executive Summary
Paul Buitelaar,Key-Sun Choi,Philipp Cimiano,and Eduard H.Hovy.......15
Overview of Talks
Some reflections on the IT challenges for a Multilingual Semantic web
Guadalupe Aguado de Cea and Elena Montiel Ponsoda................24
Accessibility to a Pervasive Web for the challenged people
Dimitra Anastasiou....................................25
Multilingual Computation with Resource and Process Reuse
Pushpak Bhattacharyya.................................26
Multilingual Semantic Web and the challenges of Open Language Data
Nicoletta Calzolari....................................28
Multilingual Web Sites
Manuel Tomas Carrasco Benitez............................30
The Multilingual Semantic Web and the intersection of NLP and Semantic Web
Christian Chiarcos....................................32
The importance of Semantic User Profiles and Multilingual Linked Data
Ernesto William De Luca................................34
Shared Identifiers and Links for the Linguistic Linked Data Cloud
Gerard de Melo......................................37
Abstract
Thierry Declerck.....................................38
Supporting Collaboration on the Multilingual Semantic Web
Bo Fu...........................................39
Cross-lingual ontology matching as a challenge for the Multilingual Semantic
Webmasters
Jorge Gracia.......................................40
Abstract
Iryna Gurevych......................................41
Collaborative Community Processes in the MSWarea
Sebastian Hellmann...................................43
Overcoming Linguistic Barriers to the Multilingual Semantic Web
Graeme Hirst.......................................44
Interoperability in the MSW
Chu-Ren Huang......................................45
Rendering lexical and other resources as Linked Data
Nancy Ide.........................................47
Leveraging MSWresearch for practical applications:what can we do?
Antoine Isaac.......................................49
Practical challenges for the multilingual Semantic Web
Christian Lieske.....................................51
12362
22 12362 – The Multilingual Semantic Web
Localization in the SW:the status quo
John McCrae.......................................52
Localization and interlinking of Semantic Web resources
Elena Montiel Ponsoda and Guadalupe Aguado de Cea................53
Multilingual Word Sense Disambiguation and the Semantic Web
Roberto Navigli......................................54
Abstract
Sergei Nirenburg.....................................56
Under-resourced languages and the MSW
Laurette Pretorius....................................57
How to make new domains,new languages,and new information accessible in the
Multilingual Semantic Web?
Aarne Ranta.......................................58
Abstract
Kimmo Rossi.......................................60
Sustainable,organisational Support for bridging Industry and the Multilingual
Semantic Web
Felix Sasaki........................................61
Abstract
Gabriele Sauberer.....................................63
The Translingual Web – A Challenge for Language and Knowledge Technologies
Hans Uszkoreit......................................65
Problems and Challenges Related to the Multilingual Access of Information in the
Context of the (Semantic) Web
Josef van Genabith....................................67
Towards Conceptually Scoped LT
Jeroen van Grondelle...................................69
What is the current state of the Multilingual Web of Data?
Asunción Gómez Pérez & Daniel Vila Suero......................70
Exploiting Parallel Corpora for the Semantic Web
Martin Volk........................................72
Working Groups
Bootstrapping a Translingual Web (Day 1)......................74
Content Representation and Implementation (Day 1).................75
Interaction between NLP & SWCommunities (Day 1)................76
Language Resources for the Semantic Web and vice versa (Day 2).........76
How to Improve Linked Data by the Addition of LRs (Day 2)............77
Parallel Corpora (Day 2)................................77
Under-resourced Languages and Diversity (Day 3)..................78
The Telic Semantic Web,Cross-cultural Synchronization and Interoperability (Day 3) 78
Paul Buitelaar,Key-Sun Choi,Philipp Cimiano,and Eduard H.Hovy 23
Collaboration,Cross-domain Adaptation of Terminologies and Ontologies (Day 4) 79
Multilingual Web Sites (Day 4).............................79
Use Cases for High-Quality Machine Translation (Day 4)..............80
Scalability to Languages,User Generated Content,Tasks (Day 4)..........80
Talk by Kimmo Rossi..................................81
Non-academic Session..................................82
Demonstration Session..................................83
Final Session........................................83
MLWstandards......................................84
Lexicalization of Linked Data..............................86
Scalability to Languages,Domains and Tasks.....................86
Use Cases for Precision-oriented HLT/MT......................87
Under-resourced Languages...............................87
Conclusion.........................................88
Participants.........................................94
12362
24 12362 – The Multilingual Semantic Web
3 Overview of Talks
3.1 Some reflections on the IT challenges for a Multilingual Semantic
web
Guadalupe Aguado de Cea and Elena Montiel Ponsoda (Universidad Politécnica de Madrid)
License
Creative Commons BY-NC-ND 3.0 Unported license
© Guadalupe Aguado de Cea and Elena Montiel Ponsoda
1.Most important challenges/barriers/problems and pressing needs with re-
spect to the multilingual access to the Semantic Web (SW):
Many attempts have been made to provide multilinguality to the Semantic Web,by means
of annotation properties in Natural Language (NL),such as RDFs or SKOS labels,and
other lexicon-ontology models,such as lemon,but there are still many issues to be solved
if we want to have a truly accessible Multilingual Semantic Web (MSW).Reusability of
monolingual resources (ontologies,lexicons,etc.),accessibility of multilingual resources
hindered by many formats,reliability of ontological sources,disambiguation problems and
multilingual presentation to the end user of all this information in NL can be mentioned
as some of the most relevant problems.Unless this NL presentation is achieved,MSW
will be restricted to the limits of IT experts,but even so,with great dissatisfaction and
disenchantment.
2.Why does the problem matter in practice?Which industry sectors or do-
mains are concerned with the problem?
Considering Linked Data as a step forward from the original Semantic Web,providing
the possibility of accessing all the information gathered in all the ontological resources
should become one significant objective,if we want every user to “perform searches in
their own language”,as mentioned in the motivation of Dagstuhl Seminar.Globalization
of work has opened the scope of possible domains and sectors interested in Linked data
and a true MSW.From governmental,political,administrative and economic issues to
medicine,chemistry,pharmaceutical,car makers and other industries alike,all would hop
on the bandwagon of MSWif it provides them the suitable information needed for their
businesses.As long as we cannot retrieve the answer to a question in NL,even if we have
the possible information in DBpedia and other ontological and knowledge resources,it
will be difficult to beat Google,and extract the most of LD and the SW,no matter how
many “semantic” resources we have.
3.Which figures are suited to quantify the magnitude or severity of the prob-
lem?
It is difficult for us to quantify the problem in figures,but it is clear that we can miss the
boat if this issue remains unsolved.In the last few years the mobile industry has made
advances at a greater speed,maybe because there were more chances to make money.
4.Why do current solutions fail short?
At the moment,we have complex models to be implemented by SW illiterate,many
technological issues unsolved,and a lack of agreement with respect to the ontological-
lexical linguistic knowledge to be provided to end-users when using the SWto improve
their resources.
5.What insights do we need in order to reach a principled solution?What
could a principled solution look like?
Focusing on certain aspects that can be agreed upon by many key sectors (researchers,
developers,industry,end-users),some relevant problems could be approached aiming
Paul Buitelaar,Key-Sun Choi,Philipp Cimiano,and Eduard H.Hovy 25
at delimiting the wishes,needs and resources available.A principled solution should be
based on simplicity,usefulness,wide coverage,and reusability.
6.How can standardization (e.g.by the W3C) contribute?
It can contribute because participation is open to many sectors involved.If all sectors
cooperate,dissemination and promotion can be achieved more easily.Getting other
standardization committees involved (ISO TC 37) can also widen the scope and can
contribute to dissemination too.But it is important to get industry professionals involved
to make them aware of the possibilities they have to make the most of their products.
3.2 Accessibility to a Pervasive Web for the challenged people
Dimitra Anastasiou (University of Bremen)
License
Creative Commons BY-NC-ND 3.0 Unported license
© Dimitra Anastasiou
1.What are in your view the most important challenges/barriers/problems
and pressing needs with respect to the multilingual access to the Semantic
Web?
One need is to make people believe about its importance.Although some projects and
workshops (including the Dagstuhl Workshop) bring this topic forward,there is still
need for more interesting projects and initiatives in the community.As Semantic Web
technologies are used by many domains,and multilingualism is also an aspect taken into
account by many stakeholders,many people regard the Multilingual Semantic Web (MSW)
as a vague concept,so some clear description,specifications or even a standard would
make the MSWmore prominent.At the moment I am more interested in the accessibility
to the Web and the MSW by the seniors and people with disabilities.Moreover,and
in relation to the Web for challenged people,I am interested in the pervasive Web in
Ambient Assisted Living (AAL),which goes beyond the Web present on a PC monitor,
and is present in the invisible technology in smart homes.
2.Why does the problem matter in practice?Which industry sectors or do-
mains are concerned with the problem?
The aging phenomenon is reality today,as according to the World Population Aging
report,the world average of the 65+ age group was 7.6% in 2010 and will be 8.2% in 2015.
The European Commission suggests demographic and epidemiological research on aging
and disability,predicting the size of the future aging population,and acquiring information
as inputs to planning.Industries (and some academic groups) are mostly concerned with
AAL,but the community researching on the Web technology used particularly there is
very small.Moreover,multilingualism plays a secondary role,though it is so important,
as seniors today are often not foreign language speakers and have to communicate
with technology (Web or not).Whereas health informatics,HCI,HRI,sensoring and
recognition play an important role,the Semantic Web and multilingual support are not
taken into serious consideration.
3.Which figures are suited to quantify the magnitude or severity of the prob-
lem?
The Working Draft of “Web Accessibility for Older Users:A Literature Review”
2
gives
2
Web Accessibility for Older Users:A Literature Review:http://www.w3.org/TR/wai-age-literature/
12362
26 12362 – The Multilingual Semantic Web
very interesting insights about Web design and development and its aspects affecting the
elderly.
4.Why do current solutions fail short?
Because the limitations of those challenged people can vary significantly,it cannot be
really categorized in specific groups,so high customization of software and high learning
effort is needed,which results in information overload.The technology is too expensive
and not affordable yet.Moreover,it is also very complex,so easier-to-use and user-friendly
methods should be developed.
5.What insights do we need in order to reach a principled solution?What
could a principled solution look like?
More initiatives including common projects,community groups workshops in the fields of
AAL,multimodality,Semantic Web,language technology.A principled solution should
look like elderly persons being able to speak in their mother tongue to turn on and off
their coffee machine,switch on and off lights.When they speak in their mother tongue,
they do not feel digitally intimidated,but are more natural,trustful,and user-friendly.
Ontologies could help dialogue systems triggering predictable actions in AAL smart
homes,i.e.turning off the oven when not used or reminding a person to make a phone
call.
6.How can standardization (e.g.by the W3C) contribute?
Cooperation with the W3C Web Accessibility Initiative
3
would be very useful.It has
released Web Content Accessibility Guidelines
4
,User Agent Accessibility Guidelines,and
Authoring Tool Accessibility Guidelines.
3.3 Multilingual Computation with Resource and Process Reuse
Pushpak Bhattacharyya (Indian Institute of Technology Bombay)
License
Creative Commons BY-NC-ND 3.0 Unported license
© Pushpak Bhattacharyya
1.Introduction
Mutilingual computation is the order of the day and is needed critically for the realization
of the Semantic web dream.Now,it stands to reason,that work done for a language
should come to help for computation in another language.For example,if through the
investment of resources we have been able to detect named entities in one language,we
should be able to detect them in another language too,through much smaller levels of
investment like transliteration.The idea of projection from one language to another is a
powerful and potent one and merits deep investigation.In the seminar I would like to
expound on the projection for multilingual NLP.
2.Challenges/barriers/problems and pressing needs with respect to the mul-
tilingual access to the Semantic Web
Resource constraint is the most important challenge facing multilingual access to the
Semantic web.Over the years through conscious decisions,English has built foundational
tools and resources for language processing.Examples of these are Penn Treebank
5
,
3
W3C Web Accessibility Initiative (WAI):http://www.w3.org/WAI/
4
Web Content Accessibility Guidelines (WCAG) Overview:http://www.w3.org/WAI/intro/wcag.php
5
http://www.cis.upenn.edu/~treebank/
Paul Buitelaar,Key-Sun Choi,Philipp Cimiano,and Eduard H.Hovy 27
Propbank,Rule based and Statistical Parsers
6
,Wordnet
7
,Corpora of various kinds of
annotation and so on and so forth.No language comes anywhere close to English in
terms of lexical resources and tools.
3.Why does the problem matter in practice?
It is impossible to do NLP without adequate lexical resources and foundational tools.
For example,nobody thinks of building a parser today for a language,without first
creating Treebank for the language – constituency or dependency – and then training a
probabilistic parser on the Treebank.However,creating treebanks requires years of effort.
Everything in language technology sector needs lexical resources.Information Extraction,
Machine Translation and Cross Lingual Search are some of the examples.E-Governance –
a domain dealing with the automatization of administrative processes of the government
in a large,multilingual country like India – is a large consumer of language technology.
4.Which figures are suited to quantify the magnitude or severity of the prob-
lem?
Lexical resources are typically quantified by the amount of annotated data and found-
ational tools by their precision and recall figures.For example,the famous SemCor
8
corpus for sense annotated data has about 100,000 Wordnet id marked words.On the
tools side,CLAWS POS tagger for English has over 97% accuracy.
5.Why do current solutions fail short?
It takes years to build high quality lexical resources.Both linguistic expertise and
computational dexterity are called for.It is not easy to find people with both linguistic
and computational acumen.Large monetary investment to is called for.
6.Principled Solution
Projection is the way to go.Reuse of resources and processes is a must.Over the years
in our work on word sense disambiguation involving Indian languages,we have studied
how sense distributions can be projected from one language to another for effective WSD
[47,48,50,49].The idea of projection has been applied in POS tagging (best paper
award ACL 2011
9
).We have also used it to learn named entities in one language from
the NE tagged corpora of another language.
7.How can standardization (e.g.by the W3C) contribute?
For projection to work at all,resources and tools need to be standardized for input-output,
storage,API and so on.For example,WordNet building activity across the world follows
the standard set by the Princeton WordNet.
6
http://nlp.stanford.edu/software/lex-parser.shtml
7
http://wordnet.princeton.edu
8
http://www.gabormelli.com/RKB/SemCor_Corpus
9
Dipanian Das and Slav Petrov,Unsupervised Part-of-Speech Tagging with Bilingual Graph-based
Projections (ACL11);Singapore,August,2009
12362
28 12362 – The Multilingual Semantic Web
3.4 Multilingual Semantic Web and the challenges of Open Language
Data
Nicoletta Calzolari (Istituto Linguistica Computazionale,Pisa)
License
Creative Commons BY-NC-ND 3.0 Unported license
© Nicoletta Calzolari
Language Technology (LT) is a data-intensive field and major breakthroughs have stemmed
from a better use of more and more Language Resources (LRs).LRs and Open/Shared
Language Data is therefore a great topic!New approaches are needed,both for Data and
Meta-Data (LRs and Meta- LRs).My topics are linked to the layer of LRs and language
services that serve LT,and especially open information on LRs and on research results.How
can Linked Data contribute?
1.The Language Resource dimensions
In the FLaReNet
10
Final Blueprint,the actions recommended for a strategy for the
future of the LR field are organised around nine dimensions:a) Infrastructure,b) Docu-
mentation,c) Development,d) Interoperability,e) Coverage,Quality and Adequacy,f)
Availability,Sharing and Distribution,g) Sustainability,h) Recognition,i) International
Cooperation.Taken together,as a coherent system,these directions contribute to a
sustainable LR ecosystem.Multilingual Semantic Web has strong relations with many of
these dimensions,esp.a),b),d),f),g).
2.Language Resources and the Collaborative framework
The traditional LR production process is too costly.A new paradigm is pushing towards
open,distributed language infrastructures based on sharing LRs,services and tools.
It is urgent to create a framework enabling effective cooperation of many groups on
common tasks,adopting the paradigm of accumulation of knowledge so successful in more
mature disciplines,such as biology,astronomy,physics.This requires the design of a new
generation of multilingual LRs,based on open content interoperability standards [12].
Multilingual Semantic Web may help in determining the shape of the LRs of the future,
consistent with the vision of an open distributed space of sharable knowledge available on
the web for processing (see [11]).It may be crucial to the success of such an infrastructure,
critically based on interoperability,aimed at improving sharing of LRs and accessibility to
multilingual content.This will serve better the needs of language applications,enabling
building on each other achievements,integrating results,and having them accessible
to various systems,thus coping with the need of more and more ‘knowledge intensive’
large-size LRs for effective multilingual content processing.This is the only way to make
a great leap forward.
3.Open Documentation on LRs
Accurate and reliable documentation of LRs is an undisputable need:documentation is the
gateway to discovery of LRs,a necessary step towards promoting the data economy.LRs
that are not documented virtually do not exist:initiatives able to collect and harmonise
metadata about resources represent a valuable opportunity for the NLP community.
LRE Map:The LRE Map is a collaborative bottom-up means of collecting metadata
on LRs from authors.It is an instrument for enhancing availability of information about
10
http://www.flarenet.eu
Paul Buitelaar,Key-Sun Choi,Philipp Cimiano,and Eduard H.Hovy 29
LRs,either new or already existing ones,and a way to show the current LR landscape and
its trends.As a measuring tool for monitoring various dimensions of LRs across places
and times,it helps highlighting evolutionary trends in LR use and related development by
cataloguing not only LRs in a narrow sense (i.e.language data),but also tools,standards,
and annotation guidelines.The Map contributes to the promotion of a movement towards
an accurate and massive documentation of LRs.
4.Open Language Resource Repositories
The rationale behind the need of Open LR Repositories is that accumulation of massive
amounts of (high-quality) multi-dimensional data about many languages is the key to
foster advancement in our knowledge about language and its mechanisms.We must be
coherent and take concrete actions leading to the coordinated gathering – in a shared
effort – of as many (processed/annotated) language data as we are able to produce.This
initiative compares to the astronomers/astrophysics’ accumulation of huge amounts of
observation data for a better understanding of the universe.
Language Library:The Language Library is an experiment – started around paral-
lel/comparable texts processed by authors at LREC 2012 – of a facility for gathering and
making available the linguistic knowledge the field is able to produce,putting in place new
ways of collaboration within the community.It is collaboratively built by the community
providing/enriching LRs by annotating/processing language data and freely using them.
The multi-layer and multi-language annotation on the same parallel/comparable texts
should foster comparability and equality among languages.The Language Library is
conceived as a theory-neutral space,which allows for several annotation philosophies
to coexist,but we must exploit the sharing trend for initiating a movement towards
creating synergies and harmonisation among annotation efforts that are now dispersed.
In a mature stage the Library could focus on enhancing interoperability,encouraging the
use of common standards and schemes of annotation.Interoperability should not be seen
as a superimposition of standards but rather as the promotion of a series of best practices
that might help other contributors to better access and easily reuse the annotation layers
provided.The Language Library could be seen as the beginning of a big Genome project
for languages,where the community collectively deposits/creates increasingly rich and
multi-layered LRs,enabling a deeper understanding of the complex relations between
different annotation layers/language phenomena.
5.Open Repositories of Research Results
Disclosing data/tools related to published papers is another “simpler” addition to the
Language Library,contributing to the promotion of open repositories of LR research
results.Moreover LRs must be not only searchable/shareable,but also “citable” (linked
to issue h) Recognition).
6.Open Language Data (OpenLanD)
Open Language Data – the set of 2.to 5.above – aims at offering the community a
series of facilities for easy and broad access to information about LRs in an authoritative
and trustable way.By investing in data reusability,OpenLanD can store the information
as a collection of coherent datasets compliant to the Linked Data philosophy.The idea
is that by linking these data among themselves and by projecting them onto the wider
background of Linked Data,new and undiscovered relations can emerge.OpenLanD must
be endowed with functionalities for data analytics and smart visualisation.OpenLanD
12362
30 12362 – The Multilingual Semantic Web
differs from existing catalogues for the breadth and reliability of information due to a
community-based approach.The information made available covers usages,applications
of LRs,their availability,as well as related papers,individuals,organisations involved in
creation or use,standards and best practices followed or implemented.OpenLanD avoids
the problem of rapid obsolescence of other catalogues by adopting a bottom-up approach
to meta-data population.
3.5 Multilingual Web Sites
Manuel Tomas Carrasco Benitez (European Commission)
License
Creative Commons BY-NC-ND 3.0 Unported license
© Manuel Tomas Carrasco Benitez
1.Abstract
Multilingual Web Sites (MWS) refer to web sites that contain multilingual parallel
texts;i.e.,texts that are translations of each other.For example,most of the European
Institutions sites are MWS,such as Europa
11
.The main point of views are:
Users should expect the same multilingual behaviour when using different browsers
and/or visiting different web sites.
Webmasters should be capable of creating quickly high quality,low cost MWS.
This is a position paper for the Dagstuhl Seminar on the Multilingual Semantic Web.
Personal notes on this event can be found on the web
12
.
2.Relevance
MWS are of great practical relevance as there are very important portals with many hits;
also they are very complex and costly to create and maintain:Europa is in 23 languages
and contains over 8 million pages.Current multilingual web sites are applications
incompatible with each other,so facilitating and enjoying this common experience entails
standardisation.There is a Multilingual Web Sites Community Group at the W3C
13
.
3.Point of Views
a.User From a users point of view,the most common usage is monolingual,though a
site might be multilingual;i.e.,users are usually be interested in just one language of
the several available at the server.The language selection is just a barrier to get the
appropriate linguistic version.One has also to consider that some users might be really
interested in several linguistic versions.It is vital to agree on common behaviours for
users:browser-side (language button) and server-side (language page).
b.Webmaster Webmaster refers to all the aspect of the construction of MWS:author,
translator,etc.The objective is the creation of high quality low cost MWS.Many
existing applications have some multilingual facilities and (stating the obvious) one
should harvest the best techniques around.Servers should expect the same application
programming interface (API).The first API could be just a multilingual data structure.
11
Europa;http://europa.eu
12
http://dragoman.org/dagstuhl
13
Multilingual Web Sites Community Group;http://www.w3.org/community/mws
Paul Buitelaar,Key-Sun Choi,Philipp Cimiano,and Eduard H.Hovy 31
The absence of this data structure means that each application has to craft this facility;
having the same data structure means that servers (or other programs) would know
how to process this data structure directly.It is a case of production of multilingual
parallel texts:the cycle Authorship,Translation and Publication chain (ATP-chain)
14
.
4.Wider context
Language vs.non-language aspects:differentiate between aspects that are language
and non-language specific.For example,the API between CMS and web server is
non-language specific and it should be addressed in a different forum.
Language as a dimension:as in TCN
15
,one should consider language a dimension
and extend the con- cept to other areas such as linked data.Consider also feature
negotiations as in TCN.
Linguistic versions:the speed (available now or later) and translation technique
(human or machine translation) should be considered in the same model.
Unification:multilingual web is an exercise in unifying different traditions looking
at the same object from different angles and different requirements.For example,
the requirements for processing a few web pages are quite different from processing a
multilingual corpus of several terabytes of data.
5.Multidiscipline map
Web technology proper
Content management systems (CMS),related to authoring and publishing
Multilingual web site (MWS)
Linked data,a form of multilingual corpora and translation memories
Previous versions in time,a form of archiving
16
Traditional natural language processing (NLP)
Multilingual corpora,a form of linked data
17
∗ Machine translation,for end users and prepossessing translators
Source documents and tabular transformations,the same data in different presenta-
tions
Translation
Computer-aided translation (CAT)
∗ Preprocessing,from corpora,translation memories or machine translation
Computer-aided authoring,as a help to have better source text for translation
Localisation
Translation memories (TM)
18
,related to corpora and linked data
Industrial production of multilingual parallel publications
Integration of the Authorship,Translation and Publishing chain (ATP-chain)
Generation of multilingual publications
Official Journal of the European Union
19
6.Disclaimer
This document represents only the views of the author and it does not necessarily represent
the opinion of the European Commission.
14
Open architecture for multilingual parallel texts;http://arxiv.org/pdf/0808.3889
15
Transparent Content Negotiation in HTTP;http://tools.ietf.org/rfc/rfc2295.txt
16
Memento – Adding Time to the Web;http://mementoweb.org
17
Multilingual Dataset Format;http://dragoman.org/muset
18
TMX 1.4b Specification;http://www.gala-global.org/oscarStandards/tmx/tmx14b.html
19
Official Journal of the European Union;http://publications.europa.eu/official/index_en.htm
12362
32 12362 – The Multilingual Semantic Web
3.6 The Multilingual Semantic Web and the intersection of NLP and
Semantic Web
Christian Chiarcos (Information Sciences Institute,University of Southern California)
License
Creative Commons BY-NC-ND 3.0 Unported license
© Christian Chiarcos
The premise of the Dagstuhl seminar is the question which problems we need to overcome in
order to enhance multilingual access to the Semantic Web,and how these are to be addressed.
Ultimately,the Semantic Web in its present stage suffers froma predominance of resources
originating in the Western hemisphere,with English as their primary language.Eventually,
this could be overcome by providing translated and localized versions of resources in other
languages,and thereby creating a critical mass of foreign language resources that is sufficient
to convince potential non-English speaking users to (a) employ these resources,and (b) to
develop their own extensions or novel resources that are linked to these.On a large scale,this
can be done automatically only,comparable to,say,the conversion of the English Wikipedia
into Thai
20
.Unlike the translation of plain text,however,this translation requires awareness
to the conceptual structure of a resource,and is thus not directly comparable to text-oriented
Machine Translation.A related problem is that the post-editing of translation results in a
massive crowdsourcing approach (as conducted for the Thai Wikipedia) may be problematic,
because most laymen will not have the required level of technical understanding.
Therefore,the task of resource translation (and localization) of Semantic Web resources
requires a higher level of automated processing than comparable amounts of plain text.This
is an active research topic,but pursued by a relatively small community.One possible issue
here is that the NLP and Semantic Web communities are relatively isolated from each other
21
,
so that synergies between themare limited.A consequence is that many potentially interested
NLP people are relatively unaware of developments in the Semantic Web community,and,
moreover,that they do not consider Semantic Web formalisms to be relevant to their research.
This is not only a problem for the progress of the Multilingual Semantic Web,but also for
other potential fields of overlap.In the appendix I sketch two of them.
In my view,both the NLP community and the Semantic Web community could be-
nefit from small- to mid-scale events co-located with conferences of the other community
(or joint seminars,as this workshop),and that this may help to identify fields of mutual
interest,including,among other topics,the translation of Semantic Web resources.In at
least two other fields,such convergence processes may already be underway,as sketched below.
Questionnaire
1.Challenges/problems and needs with respect to the multilingual access to
the Semantic Web
For languages that are under-represented in the Semantic Web,the initial bias to create
resources in their own language and in accordance with their own culture is substantially
higher than for English,where synergy effects with existing resources can be exploited in
the development of novel resources.To provide these languages with a basic repository of
20
http://www.asiaonline.net/portal.aspx#ThaiLaunch
21
For example,the LREC (http://www.lrec-conf.org) lists 11 publications for the topic „Semantic Web“
for 2012,11 for 2010,16 for 2008.Similarly,the Google counts for ACL (http://aclweb.org/anthology)
contributions containing the word „ontology“ are consistently low:2008:5,2009:8,2010:15,2011:3,
2012:7.Both conferences have between 500 and 1000 participants,so,in terms of paper-participant
ratio,this line of research is underrepresented.
Paul Buitelaar,Key-Sun Choi,Philipp Cimiano,and Eduard H.Hovy 33
SWresources,massive automated translation is required.This task is,however,closer to
the traditional realm of NLP than to that of the SW.The SW-subcommunity working
towards this direction is thus relatively small,and may benefit from closer ties to the NLP
community.(Which may be of mutual interest to both sides,also beyond the problem of
Semantic Web multilingualism,see appendix.)
2.Why does the problem matter in practice?Which industry sectors or do-
mains are concerned with the problem?
The situation is comparable to the development of NLP tools for less-resourced languages.
Without a basic set of language- and culture-specific resources (say,a WordNet and a
DBpedia/Wikipedia with sufficient coverage),there will be little interest to develop and
to invest in Semantic Web applications.A plain translation is an important first step,but
for semantic resources,there may be important culture-specific differences that need to
be taken into consideration.These efforts can be crowd-sourced to a certain extent,but
only if a certain level of knowledge is already available in order to convince contributors
that this is an effort that pays off.
3.Which figures are suited to quantify the magnitude or severity of the prob-
lem?
As for the primary problem to attract potentially interested NLP people,this can be
illustrated by the small number of Semantic Web contributions to NLP conferences (and
vice versa),see footnote 21.
4.Why do current solutions fail short?
The NLP community and the SW community are relatively isolated from each other,
and often not aware of developments in the other community.For example,a recent
discussion on an NLP mailing list showed that occasionally RDF (as an abstract data
model) is confused with RDF/XML (as one RDF linearization) and rejected because of
the verbosity of this linearization,even though other,more compact and more readable
linearizations exist.
5.What insights do we need in order to reach a principled solution?What
could a principled solution look like?
Co-located and/or interdisciplinary events.(Simply continue and extend the series
of Multilingual Semantic Web and OntoLex workshops.) Interdisciplinary community
groups.
6.How can standardization (e.g.by the W3C) contribute?
Standardization is actually a key issue here.The NLP community developed its own
standards within the ISO,and succeeded in integrating different groups from NLP/com-
putational linguistics/computational lexicography.Semantic Web standards,however,
are standardized by the W3C.Even though,say,GrAF and RDF (see appendix) are
conceptually very close,the potential synergies have been realized only recently.If these
standardization initiatives could be brought in closer contact with each other,natural
convergence effects are to be expected.
Appendix
Possible Future Convergences between Semantic Web and NLP
From the perspective of Natural Language Processing and Computational Linguistics,one
of the developments I would expect for the next 5-10 years is the accelerating convergence
of both disciplines,at least in certain aspects.On the one hand,this includes adopting
Linked Data as a representation formalism for linguistic resources in;on the other hand,this
includes the improved integration of NLP tools and pipelines in Semantic Web applications.
Both developments can be expected to continue for the next decade.
12362
34 12362 – The Multilingual Semantic Web
The Prospective Role of Linked Data in Linguistics and NLP
In the last 20 years,Natural Language Processing (NLP) has seen a remarkable maturation,
evident,for example,fromthe shift of focus of shared tasks fromelementary linguistic analyses
over semantic analyses to higher levels of linguistic description
22
.To a large extent,this
development was driven by the increased adaption of statistical approaches during the 1990s.
One necessary precondition for this development was the availability of large-scale corpora,
annotated for the phenomena under discussion,and for the development of NLP tools for
higher levels of description (say,semantics or anaphoric annotation),the number and diversity
of annotations available (and necessary) increased continually.
During the same period,corpus linguistics has developed into a major line of research in
linguistics,partially supported by the so-called “pragmatic shift” in theoretical linguistics,
when scholars have recognized the relevance of contextual factors.The study of these context
factors favored the application of corpora in linguistics at a broader scale,which can now be
considered to be an established research paradigm in linguistics.
Taken together,both communities created increasingly diverse and increasingly large
amounts of data whose processing and integration,however,posed an interoperability chal-
lenge.In a response to this,the NLP community developed generic formalisms to represent
linguistic annotations,lexicons and terminology,namely in the context of the ISO TC37.As
far as corpora are concerned,a standard,GrAF [44],has been published this year.So far,
GrAF is poorly supported with infrastructure and maintained by a relatively small com-
munity.However,its future application can take benefit of developments in the Linked Data
community,where RDF provides a data model that is similar in philosophy and genericity,
but that comes with a rich technological ecosystem,including data base implementations
and query languages – which are currently not available for GrAF.Representing corpora
in RDF,e.g.,using an RDF representation of GrAF yields a number of additional benefits,
including the uncomplicated integration of corpus data with other RDF resources,includ-
ing lexical-semantic resources (e.g.,WordNet) and terminology resources (e.g.,GOLD).A
comparable level of integration of NLP resources within a uniform formalism has not been
achieved before,and to an increasing extent,this potential is recognized by researchers in
NLP and linguistics,as manifested,for example,in the recent development of a Linguistic
Linked Open Data cloud
23
.
3.7 The importance of Semantic User Profiles and Multilingual Linked
Data
Ernesto William De Luca (University of Applied Sciences Potsdam)
License
Creative Commons BY-NC-ND 3.0 Unported license
© Ernesto William De Luca
1.Introduction
Today,people start to use more and more different web applications.They manage their
bookmarks in social bookmarking systems,communicate with friends on Facebook
24
22
CoNLL Shared Tasks:1999-2003 flat annotations (NP bracketing,chunking,clause identification,named
entity recognition),2004-2009:dependency parsing and semantic role labelling,2010-2012:pragmatics
and discourse (hedge detection,coreference).
23
http://linguistics.okfn.org/llod
24
http://www.facebook.com/
Paul Buitelaar,Key-Sun Choi,Philipp Cimiano,and Eduard H.Hovy 35
and use services like Twitter
25
to express personal opinions and interests.Thereby,they
generate and distribute personal and social information like interests,preferences and goals
[68].This distributed and heterogeneous corpus of user information,stored in the user
model (UM) of each application,is a valuable source of knowledge for adaptive systems like
information filtering services.These systems can utilize such knowledge for personalizing
search results,recommend products or adapting the user interface to user preferences.
Adaptive systems are highly needed,because the amount of information available on the
Web is increasing constantly,requiring more and more effort to be adequately managed
by the users.Therefore,these systems need more and more information about users
interests,preferences,needs and goals and as precise as possible.However,this personal
and social information stored in the distributed UMs usually exists in different languages
(language heterogeneity) due to the fact that we communicate with friends all over the
world.
Therefore,we believe that the integration of multilingual resources into a user model
aggregation process to enable the aggregation of information in different languages leads
to better user models and thus to better adaptive systems.
a.The Use of Multilingual Linked Data
Because the Web is evolving from a global information space of linked documents to
one where both documents and data are linked,we agree that a set of best practices
for publishing and connecting structured data on the Web known as Linked Data.The
Linked Open Data (LOD) project [6] is bootstrapping the Web of Data by converting
into RDF and publishing existing available ”open datasets”.In addition,LOD datasets
often contain natural language texts,which are important to link and explore data
not only in a broad LOD cloud vision,but also in localized applications within large
organizations that make use of linked data [3,66].
The combination of natural language processing and semantic web techniques has
become important,in order to exploit lexical resources directly represented as linked
data.One of the major examples is the WordNet RDF dataset [73],which provides
concepts (called synsets),each representing the sense of a set of synonymous words [32].
It has a low level of concept linking,because synsets are linked mostly by means of
taxonomic relations,while LOD data are mostly linked by means of domain relations,
such as parts of things,ways of participating in events or socially interacting,topics of
documents,temporal and spatial references,etc.[66].
An example of interlinking lexical resources like EuroWordNet[77] or FrameNet
26
[2]
to the LOD Cloud is given in [19,33].Both create a LOD dataset that provides new
possibilities to the lexical grounding of semantic knowledge,and boosts the “lexical
linked data” section of LOD,by linking e.g.EuroWordNet and FrameNet to other LOD
datasets such as WordNet RDF[73].This kind of resources open up new possibilities
to overcome the problem of language heterogeneity in different user models and thus
allows a better user model aggregation [20].
2.Requirements for a User-Oriented Multilingual Semantic Web
Based on the idea presented above,some requirements have to be fulfilled:
25
http://twitter.com/
26
http://framenet.icsi.berkeley.edu/
12362
36 12362 – The Multilingual Semantic Web
Requirement 1:Ontology-based profile aggregation.We need an approach to aggregate
information that is both application independent and application overarching.This
requires a solution that allows to semantically define relations and coherences between
different attributes of different UMs.The linked attributes must be easily accessible
by applications such as recommender and information retrieval systems.In addition,
similarity must be expressed in these defined relations.
Requirement 2:Integrating semantic knowledge.A solution to handle the multilingual
information for enriching user profiles is needed.Hence,methods that incorporate
information fromsemantic data sources such as EuroWordNet and that aggregate complete
profile information have to be developed.
a.Multilingual Ontology-based Aggregation
For the aggregation of user models,the information in the different user models has to
be linked to the multilingual information (as Multilingual Linked Data) as we want to
leverage this information and use it for a more precise and qualitatively better user
modeling.These resources can be treated as a huge semantic profile that can be used
to aggregate user models based on multilingual information.
Figure 1 describes the general idea.The goal is to create one big semantic user profile,
containing all information from the three profiles of the user information,were the
data is connected.The first step is to add the multilingual information to the data
contained in the different user models.This gives us a first model were the same data
is linked together through the multilingual information.
b.Integrating Semantic Knowledge
The second step is to add links between data that is not linked through the multilingual
information.The target is to have a semantic user model were data is not only connected
on a language level,but also on a more semantic similarity level.The aggregation of
information into semantic user models can be performed similarly to the approach
described in [4],by using components that mediate between the different models and
using recommendation frameworks that support semantic link prediction like [69].The
combined user model should be stored in an commonly accepted ontology,like [37],to
be able to share the information with different applications.
With such a semantic user model,overcoming language barriers,adaptive systems
have more information about the user and can use this data to adapt better to the
user preferences.
3.Conclusions
Analyzing the problems described above,we believe that more context information about
users is needed,enabling a context sensitive weighting of the information used for the
profile enrichment.The increasing popularity of Social Semantic Web approaches and
standards like FOAF
27
can be one important step in this direction.On the other
hand,multilingual semantic datasets itself (as for example multilingual linked data)
have to be enriched with more meta-information about the data.General quality and
significance information,like prominence nodes and weighted relations,can improve
semantic algorithms to better compute the importance of paths between nodes.Enriching
the quality of user profiles and the multilingual semantic representation of data can
27
http://www.foaf-project.org/
Paul Buitelaar,Key-Sun Choi,Philipp Cimiano,and Eduard H.Hovy 37
Aggregated Profile
User Models
￿￿￿￿￿￿￿
￿￿￿￿￿￿￿
￿￿￿￿￿￿￿
RDF/OWL EuroWordNet
has
Attrib
ute
Attrib
ute
Attrib
ute
Attrib
ute
Attrib
ute
Attrib
ute
Attrib
ute
Attrib
ute
Attrib
ute
Attrib
ute
Attrib
ute
move
drive
ride
gaan
berijd
en
rijden
Interlingual
Index
guidare
caval
care
andare
condu
icir
mover
cabal
gar
Figure 2 Integrating semantic knowledge about multilingual dependencies with the information
stored in the user models.
be helpful,because both sides cover different needs required for an enhancement and
consolidation of a multilingual semantic web.
3.8 Shared Identifiers and Links for the Linguistic Linked Data Cloud
Gerard de Melo (ICSI Berkeley)
License
Creative Commons BY-NC-ND 3.0 Unported license
© Gerard de Melo
The Web of Data opens up new opportunities in many areas of science and technology,
including linguistics and library science.Common data formats and protocols have made it
easier than ever to work with information from different sources simultaneously.The true
potential of Linked Data,however,can only be appreciated when shared identifiers and
extensive cross-linkage engender extensive interconnectedness across different data sets.
Examples of shared identifiers include those based on WordNet and Wikipedia.The
UWN/MENTA multilingual knowledge base,for instance,integrates millions of words and
names from different languages into WordNet and also uses Wikipedia-based identifiers [?].
This means that one buys into ecosystems already carrying a range of valuable pre-existing
assets.WordNet,for instance,already comes with sense-annotated corpora and mappings to
other resources.Wikipedia-based identifiers are also used by DBpedia [1],YAGO [41],and
numerous other Linked Data providers.
Lexvo.org’s linguistic identifiers are another example.Consider the example of a book
written in a little-known under-resourced language.If its bibliographic entry relies on
identifiers from Lexvo.org,one can easily look up where that language is spoken and what
other libraries carry significant numbers of other books in the same language.Additionally,
Lexvo.org also serves as an example of cross-linkage between resources.The service provides
a language hierarchy [21] that connects identifiers based on the ISO 639 language standards
to relevant entries in DBpedia,WordNet,and several other data sets.
The recent LINDAalgorithm[7] shows how such links between identifiers can be discovered
automatically in a scalable way.The algorithm was designed for the Hadoop distributed
computing platform,which means that even very large crawls of the Web of Linked Data
with billions of triples can be supplied as input.A new data set can therefore automatically
be linked to many other data sets.
In conclusion,there are both incentives and tools for us to connect the data sets we build
12362
38 12362 – The Multilingual Semantic Web
and use.As a community,we should seek to identify and support identifier schemes that
can serve as de facto standards.Publishers of linguistic data are strongly encouraged to link
their resources to other existing data sets,e.g.in the rapidly growing cloud of Linguistic
Linked Data.These efforts will lead to a much more useful Web of Linked Data.
3.9 Abstract
Thierry Declerck (DFKI Saarbrücken)
License
Creative Commons BY-NC-ND 3.0 Unported license
© Thierry Declerck
1.What are in your view the most important challenges/barriers/problems
and pressing needs with respect to the multilingual access to the Semantic
Web?
There is a (correct) statement that most knowledge is conveyed by Human Language,
and therefore a criticism that you find in the Semantic web (I consider here mainly
the LOD/LD instantiation of the Semantic web) only structured abstract knowledge
representation.As a response to this criticism,our work can stress that language
processing has to structure language data too,and that one of our task would be to
represent structured language data in the same way as the knowledge objects,and to
interlink those in a more efficient way as this has been done in the past,like for example
in the simple/parole or the generative lexicon,linking thus language data"in use"with
knowledge data"in use".
2.Why does the problem matter in practice?Which industry sectors or do-
mains are concerned with the problem?
The possible approach sketched under point 1) would be deployed in a multilingual fash-
ion.If multilingual data is successfully attached to knowledge objects,then multilingual
and cross-lingual retrieval of knowledge is getting feasible.Not on the base of machine
translation (only),but rather on the base of multilingual equivalents found linked to
knowledge objects.At the end not only knowledge of the world can be retrieved,but also
knowledge of the words (or language) associated with the knowledge of the world.The
knowledge of the language would be partial (no full grammar is to be expected),but it
can serve in many applications.
3.Which figures are suited to quantify the magnitude or severity of the prob-
lem?
I can not answer concretely this question.I also do not know if there is a real"problem".
We could go on the way we are doing by now (searching Google or the like,using domain
specific repositories,using Question/Answering systems for accessing knowledge in text,
etc),but I expect a gain of efficiency in many natural language based application,dealing
with the treatment of knowledge:semantic annotation,semantic disambiguation,inform-
ation extraction,summarization,all in multi- and cross-lingual contexts.Terminology
should also benefit from this approach (linking multilingual linguistic linked data with
linked data),in offering a better harmonization of the domain specific terms used in
various languages,while referring to established terms used in the LD/LOD.
4.Why do current solutions fail short?
Well:all the natural language expressions available in knowledge objects are not (yet)
available in a structured form,reflecting the knowledge of language.So that the linking
Paul Buitelaar,Key-Sun Choi,Philipp Cimiano,and Eduard H.Hovy 39
of conceptual knowledge and language is done on a non-equilibrated manner:structured
data on the one side and analysed strings on the other one.
5.What insights do we need in order to reach a principled solution?What
could a principled solution look like?
See the comment under point 1).
6.How can standardization (e.g.by the W3C) contribute?
Giving an consensual view on representation of the various types of knowledge and ways
to integrate those,by merging (OWL?) or by mapping/linking (SKOS,lemon-LMF).
My possible contribution to the Workshop:Describing the potential LabelNet that can
be resulting on the generalisation of linking structured language knowledge with domain
knowledge.Generalizing the use of certain words/expressions (phrases,clauses,etc) so
that labels (or linguistically described terms) can be re-used in different knowledge contexts.
There is also a specific domain I am working on,besides finance (XBRL;MFO) and radiology
(RADLEX):Digital Humanities,more specifically two classification systems for tales and
related genres.I am using there the Thompsom Motif Index and the Aarne Thompson Uther
Type index of tales and transformed those in explicit taxonomies.We are also currently
working on representing the labels of such taxonomies in LMF/lemon.I could present actual
work in any of these 3 domains,if wished.
3.10 Supporting Collaboration on the Multilingual Semantic Web
Bo Fu (University of Victoria,British Columbia,Canada)
License
Creative Commons BY-NC-ND 3.0 Unported license
© Bo Fu
In relation to realising cross-lingual data access on the multilingual semantic web,particularly
through the use of mappings,a lack of collaboration support in the current research field
appears to be an important problem that is yet to be addressed.
One of the best examples of collaboration on the web during the past decade is Wikipedia,
which has successfully demonstrated the value and importance of collaboratively building
domain knowledge in a wide range of subject matters.Similarly,on the semantic web,know-
ledge bases (i.e.ontologies and other formal specifications) regardless of their representations
or syntaxes,are the wisdom of communities and likely to involve the effort of individuals and
groups from many different backgrounds.Given these characteristics,it is thus important to
provide the necessary support for collaborations that are taking place during various stages
on the semantic web.
In recent years,we have seen ontology editors integrating collaboration features.For
instance,WebProtégé [76] is designed to support the collaborative ontology editing process
by providing an online environment for users to edit,discuss and annotate ontologies.This
trend in providing collaboration support is not yet evident in other semantic web research
fields.For example,research in ontology mapping generation and evaluation has focused
on developing and improving algorithms to date,where little attention has been placed on
supporting collaborative creation and evaluation of mappings.
Proceeding forward,one of the challenges for the multilingual semantic web is to design
and develop collaboration features for tools and services,in order for them to
12362
40 12362 – The Multilingual Semantic Web
support social interactions around the data
28
,so that a group of collaborators working on
the same dataset can provide commentary and discuss relevant implications on common
ground;
engage a wider audience and provide support for users to share and publish their findings,
so that information is appropriately distributed for group decision making;
support long-term use by people with distinct backgrounds and different goals,so that
personal preferences can be fully elaborated;and
enhance decision making by providing collaborative support from the beginning of the
design process,so that collaborative features are included in the design process of tools
and services to prevent these features being developed as an afterthought.
3.11 Cross-lingual ontology matching as a challenge for the
Multilingual Semantic Webmasters
Jorge Gracia (Universidad Politécnica de Madrid)
License
Creative Commons BY-NC-ND 3.0 Unported license
© Jorge Gracia
Recently,the Semantic Web has experienced significant advancements in standards and
techniques,as well as in the amount of semantic information available online.Nevertheless,
mechanisms are still needed to automatically reconcile information when it is expressed in
different natural languages on the Web of Data,in order to improve the access to semantic
information across language barriers.In this context several challenges arise [34],such as:(i)
ontology translation/localization,(ii) cross-lingual ontology mappings,(iii) representation of
multilingual lexical information,and (iv) cross-lingual access and querying of linked data.In
the following we will focus on the second challenge,which is the necessity of establishing,
representing and storing cross-lingual links among semantic information on the Web.In fact,
in a “truly” multilingual Semantic Web,semantic data with lexical representations in one
natural language would be mapped to equivalent or related information in other languages,
thus making navigation across multilingual information possible for software agents.
Dimensions of the problem
The issue of cross-lingual ontology matching can be explored across several dimensions
1.Cross-lingual mappings can be established at different knowledge representation levels,
each of them requiring their own mapping discovery/representation methods and tech-
niques:i.conceptual level (links are established between ontology entities at the schema
level),ii.instance level (links are established between data underlying ontologies),and iii.
linguistic level (links are established between lexical representations of ontology concepts
or instances).
2.Cross-lingual mappings can be discovered runtime/offline.Owing to the growing size and
dynamic nature of the Web,it is unrealistic to conceive a Semantic Web in which all
possible cross-lingual mappings are established beforehand.Thus,scalable techniques to
dynamically discover cross-lingual mappings on demand of semantic applications have to
28
In this context,data can be any variable related to applications on the semantic web.For example,it
can be the results from ontology localisation,ontology mapping or the evaluations of such results.
Paul Buitelaar,Key-Sun Choi,Philipp Cimiano,and Eduard H.Hovy 41
be investigated.Nevertheless,one can imagine some application scenarios (in restricted
domains for a restricted number of languages) in which computation and storage of
mappings for later reuse is a viable option.In that case,suitable ways of storing and
representing cross-lingual mappings become crucial.Also mappings computed runtime
could be stored and made available online,thus configuring a sort of pool of cross-lingual
mappings that grows with time.Such online mappings should follow the linked data
principles to favour their later access and reuse by other applications.
3.Cross-lingual links can be discovered either by projecting the lexical content of the mapped
ontologies into a common language (either one of the languages of the aligned ontologies or
a pivot language) e.g.,using machine translation,or by comparing the different languages
directly by means of cross-lingual semantic measures (e.g.,cross-lingual explicit semantic
analysis [74]).Both avenues have to be explored,compared,and possibly combined.
What is needed?
In summary,research has to be done in different aspects:
Cross-lingual ontology matching.Current ontology matching techniques could be extended
with multilingual capabilities,and novel techniques should be investigated as well.
Multilingual semantic measures.Such novel cross-lingual ontology matching techniques
above mentioned have to be grounded on measures capable of evaluating similarity or
relatedness between (ontology) entities documented in different natural languages.
Scalability of matching techniques.Although the scalability requirement is not inherent to
the multilingual dimension in ontology matching,multilingualismexacerbates the problem
due to the introduction of a higher heterogeneity degree and the possible explosion of
compared language pairs.
Cross-lingual mapping representation.Do current techniques for representing lexical
content and ontology alignments suffice to cover multilingualism?Novel ontology lexica
representation models [55] have to be explored for this task.
3.12 Abstract
Iryna Gurevych (Technical University Darmstadt)
License
Creative Commons BY-NC-ND 3.0 Unported license
© Iryna Gurevych
We first outline a set of research directions for the multilingual content processing on the
web,such as aggregating the knowledge in multiple documents,assessing the quality of
information,engineering complex multilingual Web-based systems,and scalability of machine
learning based approaches to new tasks and domains.Then,we present some research
initiatives at UKP Lab with immediate relevance to the research directions listed above.
Research directions
The volume of text-based data,especially user-generated content in many languages,on
the Web has been continuously growing.Typically,there are multiple documents of various
origins describing individual facets of the same event.This entails redundancy,resulting
in the need to aggregate the knowledge distributed across multiple documents.It involves
the tasks such as removing redundancy,information extraction,information fusion and text
12362
42 12362 – The Multilingual Semantic Web
summarization.Thereby,the intention of the user and the current interaction context play
an important role.
Another fundamental issue in the Web is assessing the quality of information.The vast
portion of the content is user-generated and is thus not subject to editorial control.Therefore,
judging its quality and credibility is an essential task.In this area,text classification methods
have been applied and combined with social media analysis.Since the information on the
Web might quickly become outdated,advanced inference techniques should be put to use in
order to detect outdated content and controversial statements found in the documents.
Due to advances in ubiquitous computing and the penetration of small computer devices
in everyday life,the integration of multiple knowledge processing techniques operating across
different modalities and different languages on huge amounts of data has become an important
issue.This is an issue with challenges to be addressed in software engineering.It requires
standardization of the interface specifications regarding individual components,ensuring
the scalability of approaches to large volumes of data,large user populations and real-time
processing,and solutions regarding the technical integration of multiple components into
complex systems.
Current multilingual language processing systems extensively utilize machine learning.
However,the training data is lacking in many tasks and domains.To alleviate this problem,
the use of semi- supervised and unsupervised techniques is an important research direction.
For the supervised settings,utilizing crowdsourcing and human computation such as Amazon
Mechanical Turk,Games with a Purpose,or Wiki-based platforms for knowledge acquisition
is a current research direction [36].Research is needed to find ways of efficiently acquiring
the needed high-quality training data under the time and budget constraints depending on
the properties of the task.
Research Initiatives at UKP Lab
The above research directions have been addressed in several projects by UKP Lab at the
Technical University Darmstadt,described below.
Sense-linked lexical-semantic resources.We present a freely available standardized
large-scale lexical-semantic resource for multiple languages called UBY
29
[26,35].UBY
currently combines collaboratively constructed and expert-constructed resources for English
and German.It is modeled according to the ISO standard Lexical Markup Framework
(LMF).UBY contains standardized versions of WordNet,GermaNet,FrameNet,VerbNet,
Wikipedia,Wiktionary and OmegaWiki.A subset of the resources in UBY is linked at
the word sense level,yielding so-called mono- and cross-lingual sense alignments between
resources [25,58,65].The UBY database can be accessed by means of a Java-based API
available at Google Code
30
and used for knowledge-rich language processing,such as word
sense disambiguation.
Multilingual processing based on the Unstructured Information Management
Architecture (UIMA).We put a strong focus on component-based language processing
(NLP) systems.The resulting body of software is called the Darmstadt Knowledge Processing
Software Repository (DKPro) [24].Parts of DKPro have already been released to the public
as open source products,e.g.:
29
http://www.ukp.tu-darmstadt.de/uby
30
http://code.google.com/p/uby
Paul Buitelaar,Key-Sun Choi,Philipp Cimiano,and Eduard H.Hovy 43
DKPro Core
31
is an integration framework for basic linguistic preprocessing.It wraps a
number of NLP tools and makes them usable via a common API based on the Apache
UIMA framework.Fromthe user perspective,the aimof DKPro Core is to provide a set of
components that work off-the-shelf,but it also provides parameter setting options for the
wrapped tools.The roadmap for DKPro Core includes:packing models for the different
tools (parser,tagger,etc.) so they can be logically addressed by name and version and
downloaded automatically,cover more tagsets and languages,logically address corpora
and resources by name and version and download them automatically,provide transparent
access to the Hadoop HDFS so that experiments can be deployed on a Hadoop Cluster.
DKPro Lab
32
is a framework to model parameter-sweeping experiments as well as
experiments that require complex setups which cannot be modeled as a single UIMA
pipeline.The framework is lightweight,provides support for declaratively setting up
experiments,and integrates seamlessly with Java-based development environments.To
reduce the computational effort of running an experiment with many different parameter
settings,the framework uses dataflow dependency information to maintain and reuse
intermediate results.DKPro Lab structures the experimental setup with three main goals:
facilitating reproducible experiments,structuring experiments for better understandability,
structuring experiments into a workflow that can potentially be mapped to a cluster
environment.In particular,the latter is currently in the focus of our attention.
The DKPro software collection has been employed in many NLP projects.It yielded
excellent performance in a series of recent language processing shared tasks and evaluations,
such as:
Wikipedia Quality Flaw Prediction Task in the PAN Lab at CLEF 2012.[30]
Semantic Textual Similarity Task for SemEval-2012,held at *SEM (the First Joint
Conference on Lexical and Computational Semantics).[8]
Cross-lingual Link Discovery Task (CrossLink) at the 9th NTCIR Workshop (NTCIR-9),
Japan.[51]
3.13 Collaborative Community Processes in the MSW area
Sebastian Hellmann (Leipzig University)
License
Creative Commons BY-NC-ND 3.0 Unported license
© Sebastian Hellmann
This presentation introduces three major data pools that have recently been made freely
available as Linked Data by a collaborative community process:(1) the DBpedia Interna-
tionalization committee is concerned with the extraction of RDF from the language-specific
Wikipedia editions;(2) the creation of a configurable extractor based on DBpedia which
is able to extract information from all languages of Wiktionary with manageable effort;
(3) the Working Group for Open Lingustic Data,an Open Knowledge Foundation group
with the goal of converting Open Linguistics data sets to RDF and interlinking them.The
presentation highlights and stresses the role of Open Licenses and RDF for the sustenance of
such pools.It also provides a short update on the recent progress of NIF (Natural Language
31
http://code.google.com/p/dkpro-core-asl/
32
http://code.google.com/p/dkpro-lab/
12362
44 12362 – The Multilingual Semantic Web
Processing Interchange Format) by the LOD2-EU project.NIF 2.0 will have many new
features,including interoperability with the above-mentioned data pools as well as major
RDF vocabularies such as OLiA,Lemon,and NERD.Furthermore,NIF can be used as an
exchange language for Web annotation tools such as AnnotateIt as it uses robust Linked
Data aware identifiers for Website annotation.
3.14 Overcoming Linguistic Barriers to the Multilingual Semantic Web
Graeme Hirst (University of Toronto)
License
Creative Commons BY-NC-ND 3.0 Unported license
© Graeme Hirst
Sometime between the publication of the original Semantic Web paper by Berners-Lee,
Hendler,and Lassila [5] and Berners-Lee’s “Linked Data” talk at TED
33
,the vision of the
Semantic Web contracted considerably.Originally,the vision was about “information”;
now it is only about data.The difference is fundamental.Data has an inherent semantic
structure and an a priori interpretation.Other kinds of information need not.In particular,
information in linguistic form gains an interpretation only in context,and only for a specific
reader or community of readers.
I do not mean to criticize the idea of restricting our Semantic Web efforts to data pro
tem.It is still an extremely challenging problem,and the results will still be of enormous
utility.At the same time,however,we need to keep sight of the broader goal,and we need to
make sure that our efforts to solve the smaller problem are not just climbing trees to reach
the moon.
In the original vision,“information is given well-defined meaning” (p.37),implying that
it didn’t have “well-defined meaning” already.Of course,the phrase “well-defined meaning”
lacks well-defined meaning,but Berners-Lee et al.are not saying that information on the
non-Semantic Web is meaningless;rather what they want is precision and lack of ambiguity in
the Semantic layer.In the case of linguistic information,this implies semantic interpretation
into a symbolic knowledge representation language of the kind they talk about,which is a
goal that exercised,and ultimately defeated,research in artificial intelligence and natural
language understanding from the 1970s through to the mid-1990s.
One of the barriers that this earlier work ran into was the fact that traditional symbolic
knowledge representations – the kind that we still see for the Semantic Web – proved to be
poor representations for linguistic meaning,and hierarchical ontologies proved to be poor
representations for the lexicon [39].Near-synonyms,for example,form clusters of related and
overlapping meanings that do not admit a hierarchical differentiation.And quite apart from
lexical issues,any system for representing linguistic information must have the expressive
power of natural language;we have nothing anywhere close to this as yet.
All these problems are compounded when we add multilinguality as an element.For
example,different languages will often present a different and mutually incompatible set of
word senses,as each language lexicalizes somewhat different categorizations or perspectives
of the world,and each language has lexical gaps relative to other languages and to the
categories of a complete ontology.It is rare even for words that are regarded as translation
33
The Next Web,TEDConference,http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html
Paul Buitelaar,Key-Sun Choi,Philipp Cimiano,and Eduard H.Hovy 45
equivalents to be completely identical in sense;more usually,they are merely cross-lingual
near-synonyms [39].
And then we have the problem of querying linguistic information on the Semantic Web,
again in a natural language.Much of the potential value of querying the Semantic Web is
that the system may act on behalf of the user,finding relevance in,or connections between,
texts that goes beyond anything the original authors of those texts intended.That is,it
could take a reader-based view of meaning,“What does this text mean to me?” [38].The
present construal of the Semantic Web,however,is limited to a writer-based view of meaning.
That is,semantic mark-up is assumed to occur at page-creation time,either automatically or
semi-automatically with the assistance of the author [5];a page has a single,fixed,semantic
representation that (presumably) reflects its author’s personal and linguistic worldview and
which therefore does not necessarily connect well with queries to which the text is potentially
relevant.
But that’s not to say that author-based mark-up isn’t valuable,as many kinds of natural
language queries take the form of intelligence gathering,“What are they trying to tell me?”
[38].Rather,we need to understand its limitations,just as we understand that the query
“Did other people like this movie?” is an imperfect proxy for our real question,“Will I like
this movie?”.
This gives us a starting point for thinking about next steps for a monolingual or mul-
tilingual Semantic Web that includes linguistic information.We must accept that it will
be limited,at least pro tem,to a static,writer-based view of meaning.Also,any semantic
representation of text will be only partial,and will be concentrated on facets of the text for
which a representation can be constructed that meets Berners-Lee et al.’s criterion of relative
precision and lack of ambiguity,and for which some relatively language-independent ontolo-
gical grounding has been defined.Hence,the representation of a text may be incomplete,
patchy,and heterogeneous,with different levels of analysis in different places [40].
We need to recognize that computational linguistics and natural language processing
have been enormously successful since giving up the goal of high-quality knowledge-based
semantic interpretation 20 years ago.Imperfect methods based on statistics and machine
learning frequently have great utility.Thus there needs to be space in the multilingual
Semantic Web for these kinds of methods and the textual representations that they imply
– for example,some kind of standardized lexical or ontolexical vector representation.We
should expect to see symbolic representations of textual data increasingly pushed to one
side as cross-lingual methods are developed in distributional semantics [29] and semantic
relatedness.These representations don’t meet the “well-defined meaning” criterion of being
overtly precise and unambiguous,and yet they are the representations most likely to be at
the centre of the future multilingual Semantic Web.