From raw data to knowledge representation

impulseverseAI and Robotics

Oct 24, 2013 (4 years and 14 days ago)

91 views


From raw data to knowledge representation

-

methodologies for user
-
interactive acquisition and processing of

multilingual terminology


Barbara Dragsted and Benjamin Kjeldsen

Copenhagen Business School, Department of English

(bd.eng@cbs.dk, bek.eng@cbs.dk)


ABSTRACT

The Copenhagen Business School, Faculty of Modern Languages, recently
participated in SENSUS, an EU Language Engineering project the purpose of which
was to provide new technology aiming at analysing multilingual data and breaking
down language ba
rriers between law enforcement agencies in Europe. The NLP
technologies developed and enhanced by the project rely heavily on machine
translation and domain
-
specific terminology. The CBS project contribution focused
on interpreting, communicating and opera
tionalising the needs of human translators,
MT software developers and the intelligence community, all of whom formed part of
the project. This task was solved by triangulating the positions of the three groups
around a common interface for collecting, str
ucturing and validating multilingual
police and intelligence
-
related terminology. The paper describes a partly new
approach to leveraging the potential of machine translation and intelligence
processing by involving human translators in the development pr
ocess, and discusses
various aspects and practical experience related to the interaction with police
translators. Particular emphasis is placed on the challenge of coping with the very
different approaches, needs and expectations on the part of the differe
nt user groups



1) Introduction to the SENSUS project

SENSUS is a Language Engineering project operating under the European
Commission Fifth Framework Programme. The project ran from 1998 to 2000 with
the overall aim of breaking down language barriers by

facilitating efficient and fast
communication between law enforcement agencies in Europe. SENSUS endeavoured
to combine existing language technology products with new research and
development. Focus was on providing linguistic resources and technology for

two
central areas: multilingual communication and multilingual document analysis.


With reference to
multilingual communication
, SENSUS supports the workflow of
international communication by providing translation (i.e. traditional translation as
well as

machine translation), content analysis and visualisation tools which give a
faster and better understanding of what a given text is about. Different levels of
sophistication for the translation tools were explored and developed by the project,
from simple

term replacement to full machine translation. Human translators can
access a terminology and knowledge database designed by CBS as part of the project,
and automated translation tools and other natural language processing (NLP)
applications can draw on th
e content of this database. The multilingual
communication component could be incorporated in an independent operational
communication network allowing police officers to exchange information across
borders, but the final SENSUS system does not include suc
h a network, since this was
not the goal of the project.


From raw data to knowledge representation

Paper presented at the EST Conference in Copenhagen, 31 August 2001

Barbara Dragsted and Benjamin Kjeldsen

As to
multilingual analysis
, the focus of SENSUS was to support the combat of
crime by providing intelligence and police analysts with intelligent, multilingual data
analysis, data mining, visualis
ation and reporting tools in support of, and in
accordance with, their specific intelligence analysis requirements. Unstructured and
structured data elements are analysed by the project end application and made
available to the analyst by various tools.


I
n specific terms, the task of the project was to compose a system that would enable
police and intelligence services to quickly examine documents obtained from various
sources with a view to determining the content and relevance of the document in
question
. Technically, this requires a powerful scanning device and multilingual OCR
tools (in the case of paper documents) as well as storage and retrieval capacity. Once
a document is in the system, it is indexed for rapid retrieval and possible cross
-
referencin
g to other documents containing similar types of events, names of persons,
places, dates, cars etc.


The goal was to make it possible for, say, a German border control officer to check
whether a specific license plate is mentioned in a Portuguese police r
eport, and if so,
what the police report is about. Through an intuitive graphical interface, the border
officer can also check whether a bunch of documents seized from a suspect vehicle
contain any information useful to an ongoing investigation, such as n
ames of persons,
events etc. It is possible to check the document for specific terms (including slang
expressions) that are indicative of criminal action or otherwise relevant to police
work.


The SENSUS project involves
end users
from police, security and

intelligence
services of more than 10 European countries. The SENSUS user group, which was co
-
ordinated by Europol, was not only the intended users of the end product, but also
participated actively in the project already in the development stage by provi
ding
terminology for the project.


The
NLP software developers
in the project are leading companies in the areas of
linguistics, retrieval, data mining, visualisation and software integration and workflow
applications.



2)

CBS contribution to the SENSUS
project

Both the user group and the technology group in the project were assisted by
academic research institutions, of which the Copenhagen Business School (CBS) was
one.


Drawing on our experience and knowledge in the domains of linguistics,
communicati
on, terminology and translation and on experience from previous police
communication projects, the Business School formed the linguistic hub connecting
the real
-
life requirements of the user group and the technical work of the developers
of NLP technology

for the project.


More specifically, CBS were responsible for collecting and processing terminology
for the project’s different users and uses. The participating police forces were asked to
submit relevant wordlists and texts that they might have in elec
tronic format or on
From raw data to knowledge representation

Paper presented at the EST Conference in Copenhagen, 31 August 2001

Barbara Dragsted and Benjamin Kjeldsen

paper. Our ‘working basis’ for the terminology collection process was thus a large
collection of wordlists and texts in a wide variety of electronic formats as well as hard
copies of domain
-
specific dictionaries and full texts. The mate
rial comprised very
different information items structured in a number of different ways, and the language
coverage varied greatly. Therefore, comprehensive re
-
structuring of the input was
required to make it conform with the agreed project formats. Once t
he terminology
was compiled into a common format, actual terminology work could begin.


The multi
-
faceted applicability of the system required linguistic resources of varying
degrees of sophistication. A police translator needs detailed encyclopaedic
monol
ingual and bilingual dictionaries for correct translation of specific terms in all
situations. The machine translation system has little room for synonyms and near
-
synonyms in the target language and hence requires clearly distinguished terminology
with on
e
-
to
-
one translation equivalence. The CBS SENSUS Team were responsible
for catering for this variation in the users’ needs by identifying their requirements and
setting up a method for channelling and sorting linguistic resources in the project.


3) Diffe
rent groups with different interests

It was clear from the outset that three different terminology user groups could be
identified: NLP software developers, police officers, and police translators. It was also
clear that these different groups had differen
t kinds of needs that had to be met.


The
NLP software developers
were interested in feeding as much terminology as
possible into the machine translation component of the SENSUS system. Above all,
the focus of the NLP software developers was therefore on q
uantity more than on
quality. This group emphasised a need for vast amounts of terminological data in as
many languages as possible, with a specified amount of linguistic information,
including semantic categories (corresponding to those of already existin
g machine
translation systems), grammatical information such as part of speech, gender,
countability etc., but little or no information beyond the level of the lexeme in terms
of definitions, regional variance marking, terminologist and source identificati
on etc.
In the course of the project, the degree of linguistic information required was
somewhat reduced.


The
police translators
took more or less the opposite approach with a focus on
quality above quantity. In order to provide this user group with an e
fficient tool, the
terminological data had to undergo comprehensive processing before they were of any
use: Again, language coverage was an important factor, but just as important was the
amount of information included in each terminological record. Prefer
ably, each record
in the terminological database would include: subject area of the concept, any
s
ynonyms and abbreviations, linguistic information (such as part of speech, gender,
number) and, very importantly,
definitions

of the concept in all languages
, related
concepts of the terms and other conceptual relations to other entries in the database,
information on the origin and usage of the concept, and an example of context in
which the term appeared.


Another important piece of information to be includ
ed concerned the degree of
equivalence with concepts in other languages, especially since a good deal of the
terminological data concerned legal language with its inherent cultural, historical, and
regional distinctiveness which often results in twisted co
nceptual scopes between
From raw data to knowledge representation

Paper presented at the EST Conference in Copenhagen, 31 August 2001

Barbara Dragsted and Benjamin Kjeldsen

languages. In contrast,
MT systems often have little room for such finely grained
cultural variation or other conceptual distinctions. It was recognised, though, that
qualitative terminology work can also generally be put to use in
NLP applications
from machine translation to content analysis and retrieval mechanisms.


Finally, the
police officers
primarily wanted a tool that worked, i.e. a tool which
would provide them with quick solutions to any language and data analysis problem
they might have. The police officers would benefit from both the machine translation
and data analysis component of the system, on which the technology providers were
focusing, and from the terminology and knowledge database emphasised by the police
transl
ators. However, unlike the translators, who have a natural understanding of the
importance of comprehensive terminological information, the police officers were not
used to doing terminological work and were therefore not immediately inclined to set
aside
the large amount of time required for thorough terminological processing. On
the other hand, the knowledge possessed by this group was indispensable in the
process of gathering and validating factual information in the database.


Hence, it was necessary t
o direct attention towards an educational effort for this group
to shed light on the practice and applicability of terminology work. As primary end
users of the project, the police expert group and their needs also reflected the ultimate
goal of the projec
t, but it could be difficult for the individual member of the group to
realise the potential of the final system and the complex processes involved in its
development. One challenge was thus to visualise the link between terminology work
and the efficiency

of the final application for this group.


4) Challenges

Overall, the biggest concern in the terminology part of the SENSUS project was to
unite these sometimes highly conflicting needs and motivation of the three different
user groups. More specifically,
we identified the following challenges:


First of all, there was a conflict between the interests of NLP software developers and
police translators. The NLP software developers were, as already mentioned, mostly
interested in large quantities of data with
out elaborate information on individual
terms, but with an approximated one
-
to
-
one equivalence, whereas the translators
wanted a database with comprehensive terminological records including both
linguistic and factual knowledge, and translational equivalen
ce clarification. One of
our challenges was therefore to consider the interests of both these groups at the same
time. In short, we may call this challenge the
quality


quantity conflict
.


Furthermore, there was a conflict between the needs of the police

translators on the
one hand and the motivation, and to some extent ability, of police officers to perform
terminological work on the other. Another challenge was therefore to introduce police
officers to the nature and benefits of terminological work, and

then to teach them how
to do it. This challenge could be termed the
user involvement challenge
.


5) Approach

What did we do in response to these challenges? First of all, it was of course of great
importance that all three user groups were involved in the

process, so we arranged
individual meetings with the different users to discover exactly what their needs and
expectations were. The meetings allowed end users with little or no experience with
From raw data to knowledge representation

Paper presented at the EST Conference in Copenhagen, 31 August 2001

Barbara Dragsted and Benjamin Kjeldsen

terminology and computational aids in translation to obtain a

realistic background on
which to base expectations to the final system, i.e. both the declared and the latent
needs and expectations of the user groups were explored. The NLP software
developers were also given the opportunity to voice their expectations
and
requirements. Thus, through dialogue, all users became familiar with the project
potential, and mutual expectations were exchanged, which created a motivational and
inspirational basis for further co
-
operation in the project.


Challenge 1

(quality <=>
quantity conflict)

In order to consider the conflicting interests of both the NLP software developers and
the police translators, it was decided to divide the terminology collection process into
two parts: a quantity part which was to satisfy the NLP softw
are developers’
requirement for vast amounts of terminological data, and a quality part which would
meet the needs of the police translators for an elaborate terminological database and
allow systematic maintenance and expansion of the database.


The
quant
ity part
implied that all the raw
word lists

we had received from the
different police officers and police translators were converted into a common format
and supplied only with the type of information required by the NLP software
developers. Furthermore,
CBS extracted terms from
texts

received from users and,
again, added the information required by NLP software developers. All the
terminology processed in the quantity part also became an independent part of the
terminology database developed for manual co
nsultation, even though it contained
only minimal information. Moreover, work was instantiated to select a number of key
terms for further processing in the quality part of the project.


The
quality part
was highly dependent on user interaction in order to

gather elaborate
specialised information within the SENSUS domain. Our first initiative was therefore
to involve the police translators and police officers in the terminology collection
process by setting up a procedure for interactive terminology exchang
e aiming at
validating and expanding terminological entries. We termed our terminology
exchange methodology
‘the electronic basket procedure’
, because the terminology
would travel between CBS and the individual users in small batches or
‘baskets’

in
electr
onic form. Technically, the format used for the baskets was Microsoft Excel. We
chose Excel because we wanted a tool which everybody could be expected to have,
and possibly know a little bit about. So for instance, we would send an Excel basket to
the Fren
ch police containing around 20 terms for monolingual expansion. The French
police were thus tasked with adding definitions and other information to the terms.
Upon completion of their task, they would return the basket to us, and we would then
check the in
put and send it on to the German police, for instance, who would perform
the same task and so on until sufficient information had been added in all SENSUS
languages. After this process of filling linguistic and factual gaps in different
languages, the next

step in the basket procedure was the validation process where the
baskets were sent out once again for final multilingual validation by experts. After
validation, the terms were then ready to enter the final SENSUS terminology
database. Apart from forming

part of the database, the terminology collected through
the basket procedure could also be used in the MT and data analysis component.



The terminology collected through the basket procedure was very elaborate, and in
this way satisfied the police trans
lators’ need for quality terminology. However, the
From raw data to knowledge representation

Paper presented at the EST Conference in Copenhagen, 31 August 2001

Barbara Dragsted and Benjamin Kjeldsen

process was also extremely time consuming, and we soon found a need to narrow
down the terminology exchange procedure to cover 200 highly relevant terms which
were to undergo extensive terminological proce
ssing. While this number may seem
relatively small, it provides a suitable testbed for the new methodology introduced,
and the number of high
-
quality terms actually exceeded this initial goal by far. The
terms were to be identified by the users themselves,

and at this point a new and very
important aspect was added, namely the formulation of a
knowledge model
, cf. figure
1, the purpose of which was to 1) secure more systematic collection of data, 2)
discover any holes in the data collection, 3) structure th
e terminological data in a
logical way, and 4) parallelise multilingual and multicultural data.


Figure 1: SENSUS Knowledge Model


As appears from figure 1, the SENSUS Knowledge Model is structured around an
event, e.g. a criminal act such as
burglary
. T
he term
burglary
would thus be assigned
the knowledge model code
event
. The boxes surrounding the event relate to how and
when an event took place (the left side of the model), and on how it is investigated or
handled by the authorities (the right side of
the model). So for instance the term
burglar
would fit in the box
perpetrator
, a
crowbar

could be an
object

involved in a
criminal act. The term
crime scene investigation

would be put in the box
Procedures

under police activities, and
Police report

would f
it in the
Documents

box. The boxes
surrounding an event can then be further elaborated by identifying subtypes of for
instance
Documents
related to police investigation.


The overall characteristics and ideology behind the event
-
based knowledge model can
b
e applied to other domains and applications as well. The resulting structure of
terminology and knowledge may help a large corporation organise and share its
knowledge in one or more languages, which would secure consistency in use of
language and reduce t
he margin for error in communication, both internally in the
corporation and externally. For “Event”, substitute “Product”, “Project” or “Process”,
for instance, when talking about a corporation. The other labels in the model must be
substituted correspond
ingly to match the processes of the corporation in question.
Once a universal model for that particular corporation is built, new projects, products
From raw data to knowledge representation

Paper presented at the EST Conference in Copenhagen, 31 August 2001

Barbara Dragsted and Benjamin Kjeldsen

or processes may be categorised using the same custom shell. In a non
-
commercial or
governmental organisati
on, the base point may be a “Process”, such as passing a law
in parliament, publishing a document, implementing a computer system etc. The
resulting organisation of sub
-
processes and relevant information pertaining to the
product, process or event will fin
d a multitude of possible uses, ranging from pure
terminology or knowledge structuring over document routing applications or
knowledge sharing to NLP applications such as machine translation, search engine
facilities or other types of artificial intelligen
ce.


The logical structure of the SENSUS Knowledge Model helps bridge the gap between
the requirements and capabilities of the three user groups. The model caters for the
semantic category modelling required by NLP developers, the terminological
informa
tion desired by police translators, and the comprehensibility barrier seen by
the police officers actually contributing their knowledge to the system. The intuitive
composition of the knowledge model provided police officers and translators alike
with a to
ol for input and organisation of knowledge. In many ways, the model is a
pre
-
fabricated conceptual system ready to structure information relevant to the user
groups as well as to the linguistic software system developers. By pre
-
fabricating a
conceptual sy
stem for any domain, you run the risk of simplifying conceptual
nuances, as the labels of individual terminological categories will be broadly defined.
On the other hand, you have a universal, accessible tool for organising terminology
and the conceptual k
nowledge underlying the terms. As demonstrated in the
following, this feature proved to be the decisive factor in securing user
-
interaction and
creating a flexible, reliable, and efficient term bank upon which all users and
developers in the project could
draw.



Challenge 2

(user involvement challenge)

In response to the challenge of involving police officers in terminological work, we
arranged two terminology workshops. The aim of these workshops was to explain the
principles and application of terminolog
y to police officers, so that they would
become more motivated and better armed to do terminological work. Moreover, the
workshops provided both police officers and police translators with hands
-
on
experience so that any methodological and technical proble
ms in relation to the
processing of electronic baskets could be resolved as they arose. One of the technical
issues discussed at the workshops was the medium of the electronic baskets.
Microsoft Excel was supposed to be a known or recognisable application
universally
among European police users, but was in fact not supported at very many user sites.
Realising that the “familiarity” gain of a known application was lost anyway, CBS
decided to introduce users to a fully fledged terminology application, viz. Tr
ados
MultiTerm. This allowed users flexible input options, ranging from plain text to direct
typing into MultiTerm. At the same time, users were able to track progress in the
database and start consulting it right away.

Apart from the two terminology work
shops involving both police officers and
translators, we also held three seminars with individual users, both police translators
and police officers, here at CBS. The purpose of the seminars was mainly to arrive at
the most logical and workable structure o
f the
knowledge model

which would form
the basis of the identification of the first 200 key terms, and subsequently to actually
identify the 200 terms. As discussed, the knowledge model not only served the
purpose of systematising term collection and makin
g the tasks easier for those new to
From raw data to knowledge representation

Paper presented at the EST Conference in Copenhagen, 31 August 2001

Barbara Dragsted and Benjamin Kjeldsen

terminology work, but also helped route terminological data to the various NLP
applications.


The seminars with individual users were very effective and resulted in the formulation
of the knowledge model presented earlie
r, as well as 200 central terms in German and
English. The two terminology workshops were given quite a positive reception by
police officers as well as police translators, and were very important steps towards the
aim of collecting and validating the 200
key terms. Project software developers also
supported the proliferation of the knowledge model as a valuable tool for structuring
linguistic input for the various artificial intelligence applications in the project.


6) Output and conclusion

The CBS contri
bution to the SENSUS project was to collect, unify and structure
terminological data from different users for different purposes by means of user
interaction. The output of the CBS contribution was terminological data which are
used partly for automatic da
ta analysis and machine translation in the final SENSUS
system (the quantity part) and partly as an independent look
-
up tool for translators and
police officers (the quality part). Technically, both types of terminological data are
combined in one joint Tr
ados MultiTerm database. The database contains approx.
13,000 terminological records with a total of almost 50,000 terms with varying
degrees of conceptual and linguistic information and validation covering all from 1 to
9 languages.


Out of these 13,000
articles, some 200 are central concepts identified on the basis of
our
SENSUS knowledge model
tailored for police terminology. These 200 central
concepts serve as the basis for expanding qualitative terminology work in European
police co
-
operation projects
. The language coverage of the concrete output of the
terminology part of the SENSUS project is, so far, implemented in only four
languages: English, German, French and Spanish.


Apart from this concrete output in the form of a terminological database, the

terminology project has contributed methodologies and experience which can be
drawn on in future similar projects:


Firstly, a methodology for exchange of terminology across countries and institutions
has been developed.


Secondly, a knowledge model has

been drawn up, which is tailored for police
terminology and usable for identification and structuring of police terminology in the
future. As shown earlier, the knowledge model is an implementation of a process
-
based or event
-
based knowledge modelling str
ategy developed by CBS, which can be
tailored to other processes, projects or products.


Thirdly, experience has been gained concerning the difficulties of joining and
motivating the diverse groups of police officers, human translators, and NLP
developers.

The gap between human translators and machine translation developers is
particularly interesting, as it is increasingly becoming a challenge in the daily work of
both parties. The project succeeded in creating mutual understanding through
constructive dia
logue and educational efforts. We resolved the challenge of
motivating all three identified groups to make a joint contribution from which all
From raw data to knowledge representation

Paper presented at the EST Conference in Copenhagen, 31 August 2001

Barbara Dragsted and Benjamin Kjeldsen

benefited. By demonstrating, through dialogue and in workshops, that it was possible
to create a sum greater than

its parts and by providing a shared medium for
interaction, it became possible to fully leverage the potential of the groups and,
consequently, that of the project in general.



REFERENCES:


CBS/SENSUS Project

http://www.cbs.dk/departments/vista/sensus/


EU
-

Project SENSUS

http://www.sensus
-
int.de/