Can indexing be automated? - the example of the Deutsche Nationalbibliothek

beadkennelAI and Robotics

Oct 15, 2013 (3 years and 9 months ago)

71 views

1


Ulrike Junger

Deutsche Nationalbibliothek

Frankfurt/Main


Leipzig



Can indexing be automated?

-

t
he example of
the Deutsche
Nationalbibliothek



Abstract: The German subject headings authority file (Schlagwortnormdatei
/SWD) provides a
broad controlled vocabulary for indexing documents of all subjects.
Traditionally used for
intellectual subject cataloguing primarily of books the Deutsche Nationalbibliothek (DNB, German
National Library) has been working on developping a
nd implementing pro
c
edures for automated
assignment
of subject heading
s for online publications. This project, its results and problems are
sketched in the paper.




I.
Introducing the Schl
a
gwortnormdatei, the German subject headings authority file


The Deutsche Nationalbiblio
t
hek (German National Library, in short DNB)
is celebrating its 100
th

anniversary this year
.
Working with controlled subject headings has
had
a
almost as long a
tradition in the DNB
, also
in the times of the card catalogue
.

When

computers were introduced into library work and c
a
taloguing was henceforth done in
databases the concept of authority files emerged
. I
n German
y

an authority file for subject headings
was created starting in the mid 1980s. It is called Schlagwortnormdatei,

in short SWD (this
abbreviation will be used throughout this paper). In an English translation it simply means
“Authority file for subject headings”.
The first part of this paper
wants

t
o

giv
e

a
n idea about its
character and organiz
ation
.


The SWD contain
s records for various
groups of headings
: topical terms, geographic and
ethnographic names, corporate bodies, work titles. Heading
s

for persons are part of another
autority file, the P
e
rsone
n
nam
e
ndatei (PND,
File for
N
ames of Persons
1
).

Currently there are
around 610.000
records in the SWD.
Over 170.000 are topical terms

covering all sciences and
subjects
. The
larger share

of the headings are individual names, e.g. for companies, geographical
entities etc. This predominance of i
n
dividual
names is a result of the
set of rules underlying the
creation of new subject headings
and their use. These
Regeln für die Schlagwortkatalogisierung

(RSWK, translated simply meaning “Rules for subject indexing”)

have stipulat
ions

that the contents
of a publ
ication should be represented by the most narrow terms.


Although
DNB hosts all authority files and
assumes a major editorial responsibility
f
or them
,

they
are in fact the result of a longstanding cooperation between DNB and partner libraries and libra
r
y
networks in Germany, Austria and Switzerland
. A
ll
partners
contribut
e

new headings and fulfill
editorial tasks
.


The SWD has a thesaurus
-
like structure.
Besides the preferred term (the actual heading) an
authority record contains
synonyms
, superordin
a
te

and related terms, and


dependent on the
type of heading


codes for languages, countries etc.
A simple homegrown classification is applied
to most of the headinsg except geographic names.
About 40.000 terms have been enhanced with
DDC notations.
Notes a
nd information about the sources for a term complete the record.




1

The PND recor
d
s are one of the data
sets
constituing the VIAF
/Virtual International Authority File,
http://viaf.org/
.

2






Fig
. 1: Example of a SWD record


New headings are created if needed for indexing a publication. Thus the continuous growth and the
contents of the SWD is to a great degree depend
e
nt on the degree in which libraries are doing
subject cataloguing. DNB and also other libraries have been reducing
this effort over a number of
years now due to lack of capacity. An effect is that the maintenance of the topical terms
in the
natural science
s is insufficient because DNB stopped indexing d
octoral theses
about 5 years ago, a
major source of new terms.


How can
SWD records be obtained? The DNB provides all of its authority files in both the German
exchange format MAB as well as in MARC 21.
In
2010 a linked data service was established, the
data can be obtained free of charge
.
2


2012 is not only the year of the 100
th

anniversary of DNB but also the year that the SWD will
vanish
,

and its contents be integrated into one large authority file, the G
emeinsame Normdatei
(GND, Consolidated Authority File). Besides the SWD also the authority files for persons, musical
works and corporate bodies (used in
descriptive

cataloguing)
are incorporated.
The format is close
to MARC 21, record creation and use wil
l be following the stipulations in RDA.



II.
The use of SWD subject headings for automated indexing


As the national library for Germany DNB has
the right to legal deposit. In the year 2006 a revised
law on the DNB has become effective. Along with a chan
ge of the name (f
ormerly Die Deutsche
Bibliothek
, now Deutsche Nationalbibliothek, G
e
rman National Library) a major change in the
mandate
of DNB came into effect: it now also has to
collect, catalogue, index and archive

works in
immaterial form, i.e. online publications.


The DNB had been collecting online publications for a number of years, predominantly doctoral
theses, on the average 20
.
000 works annually. But it soon became clear that DNB had to
step up
its efforts t
o

increase the number of online publications collected.

The development and
implementation of new methods and channels for submission of online publications and c
ampaigns
to
adress publishers lead to a very significant increase in the collection of online
publications. In
2011 DNB collected
more than 187.000 online documents.
3

It can be expec
ted that this number
will grow at

a fast pace in the coming years.




2

See
http://www.dnb.de/EN/Service/DigitaleDienste/LinkedData/linkeddata_node.html

3

In comparison to this figure: DNB collected over 385.000 documents in physical form in 2011.

3



It also soon became clear that
DNB does not have the capacity to catalogue and index all this
mater
ial in the traditional way, i.e. intellectually/manually. Like many libraries DNB also has to do
more (data and services) with less (staff).
Still DNB has the claim
and the need
to provide
catalogue data for every document in its collection; this encompass
es subject data. In 2009 it was
therefore decided
to stop manually cataloguing online publications with the beginning of
the year
2010 and
to start a large project to develop methods for automated
processing

of monographic
online publications
.
4

A major goa
l of the project was

the automated assignment of SWD subject
headings to online documents.
This project, called
Petrus
5
, was conducted from 2009 to 2011, with
follow
-
up projects in 2012
-
2013
.


One could argue that in the era of full text indexing the
assignment of controlled terms such as
SWD subject headings is obsolete
.
But we are convinced that the use of controlled vocabulary
serves an important purpose even in the Google age. A shared system of terms allows precise and
comprehensive searches on al
l the collections a library holds and beyond.
Indexing with terms
taken from a controlled vocabulary has the advantage a single term is embedded in a semantic net
or context, which can
additionally
be used for retrieval.
Also, we wanted to
have a continuit
y
in
subject cataloguing

and our data
.


How did DNB approach this enterprise? O
ne of the conditions of the project was that
a system or
software available on the market should be used and a homegrown software should be avoided.
T
he first
two

years of Petru
s therefore were dedicated to a market scan and thorough tests of
several

systems. These were chosen following a
public bid invitation.


In the end DNB decided to acquire
and license
the
Averbis Extraction Platform
, a system
develop
p
ed by the Averbis

company locat
e
d in

Freiburg, G
e
rmany. This company
i
s a spin
-
off of
the
University Hospitals in Freiburg and had so far been speciali
z
ing in the automated indexing of
medical publications.


The process t
he Averbis Extraction Platform
performs on document
s basically is this: First it
exec
utes

a textual analysis of the online publications

based on

various linguistic methods in order
to extract terms
carrying
content
out of textual parts and titles of online publications. It then ranks
the extracted terms ac
cording to their meaning and importance. The extracted terms then are
matched to the controlled vocabulary of the SWD.


There are two main components in the system:



the Averbis Concept Mapper, this is a c
onfigurable

annotating tool based on a dictionary.
It
combines methods of machine learning with morphological and syntactical analysis.

The
dictionary is flexible and allows the integration of synonyms and various attributes for
terms, e.g. classificatory information.



the Dictionary Configurator: this is a

user interface to create and modify user
-
specific
concepts regarding the dictionary
.


Various
sections of the SWD/PND
have been
integrated

into the dictionary
, as
t
he
basis for the
assignment of subject headings. Currently these are: 170.000 topical terms
, 153.000 records for
geographic and ethnographic names

and
311.000
records for persons.


As mentioned, t
he so established dict
i
onary can be configured with the
A
verbis Dictona
r
y
Configurator. It allows e.g. the reali
z
ation of tai
l
ored indexing concepts fo
r certain groups of
objects/publications. Since the SWD is continually growing
and enhanced
the dictionary
has to

be



4

Serials and other types of online documents were excluded because of specific difficulties associated with
cataloguing and indexing these materials.

5

Petrus is an acronym for „Prozessunterstützende Software für die digitale Deutsche Nationalbibliothek“,
i.e.
Software supporting processes for the digital G
e
rman National L
ibrary.

4


updated

regularly
.
It is a requirement for the Averbis sy
s
tem to retain once config
u
rated concepts
for the dictionary.


After the acquisition of the Averbis system
a number of tests were conducted
to
explore
which
configurations
would bring the best results. Also, in the first tests only topical terms were
integrated into the dictionary, geographic and persons’ names follow
ed later.


Since the major part of publications collected by DNB of course is in German it was decided to first
concentrate on monographic publications in this language.


A difficulty regarding the tests was the question how to do the evaluation and measur
e
the
quality
of the automatically generated subject headings. The discussions resulted in the decision to
intellectually control the results
, using a sample of titles

with automatically assigned subject
headings
. This was done by the subject specialists i
n the department of subject cataloguing, i.e.
the same persons who usually index and classify books and other publications.
In full awareness of
the fact that intellectual indexing holds a subjective component and inter
-
indexer accordance
on
the average li
es around 50%
, we decided to
consider their
judg
ment
as the “gold standard”.


An evaluation database was created which contained the author, the title, a link to the full text of
the document and the list of automatically assigned subject headings.

Each he
ading
had

to be judged on 4
-
point
-
scale: very useful, useful, less useful, harmful.
6


The raters should also note missing aspects/headings.


This implied that the raters
had “to forget” partially the rules they are supposed to follow when
indexing intellectually. An example is that there are a number of stipulations in the
Regeln für den
Schlagwortkatalog/RSWK

on how to arrange several subject headings to form a
proper
seq
uence.
























6

This is meant with respect to retrieval.

Fig. 2: User interface oft he evaluation database

5


The ratings were then statistically evaluated:
generalized
precision and
generalized
recall as
measures for usefulness

and completeness of the assigned headings were calculated
.


As a showcase the results of a test conducted in October 2011 are described
:

Objects tested were German electronic full text documents, predominantly doctoral theses. All
documents belonged to on
e of 12 subject areas, e.g. medicine
7
. Ten subject heading
s

were
assigned to each document.
The configuration of the Averbis system used in this test contained a
multi
-
level procedure for disambiguation of terms, the use of ignore
-
/exceptions

lists, the us
e of
classificatory imformation attached to the headings and a higher weight on the title

terms
compared to terms taken from the text corpus
.

A sample of 30 documents
precessed
was taken and evaluated by the librarians.


The results are shown in the follow
ing figures:












7

All publications collected by DNB are sorted into one of about 100 so
-
called subject groups

based on the DDC
.
Since the distribution of

online publications is very disparate across subjects, it
proved necessary

to concentrate
on subject groups with a sufficient number of full text documents.

6


Although the values for recall all lie between 0.5 and 0.9, the precision is deficient, and the
overall
r
esults
achieved for automated indexing
are
considered
not satisfactory enough
.

T
here still are too
many incorrect and too few
correct/useful headings assigned
.

This means that the productive stage
is not yet reached and it is still too early to establish a routine workflow.


A major reason for that is that ambiguous terms still are not discriminated well enough.
Also, t
he
SWD as
a universal vocabulary contains many general terms (e.g. Methode/method)
, frequently
occuring in documents, but with no specific meaning unless in combination with other headings
. A
solution
to both problems
could be the analysis of co
-
occurrences and the
use of topical filters.


Another general problem is the discrimination between names of persons, geographic names and
topical terms. Here DNB thinks about using methods of
Named Entity Recognition
.


An

additional
factor in order to reach better results cou
ld be taking into account a document’s
formal structure. A doctoral thesis and a novel
do
not only
have
a
different
language, but

a typical
arrangement of the text.


Another topic
in need of
further work on are confidence values.
The subject headings the A
verbis
software assig
n
s have a confidence value
, i.e. a value between 0 and 1 marking the “sureness” the
system has that the term is valid and correct in regard to the document. It is intended to manage
via the confidence value which and how many headings
are assigned to a document.
Up to now the
inform
a
tive value of the confidence value is not sufficient.

For this reason a fixed number of
headings is
produced

in the tests, probably one of the reasons why
so far besides some correct
headings many incorrect
headings are put out
.

T
he improvement of the methods to calculate the
ranking and the confidence value
therefore ha
s a high priority.


Early in 2012
DNB has decided to continue the
project

for two more years. It has become very
clear that it is not easy to develop and implement a process using a universal controlled vocabulary
for automated indexing of a universal collection.

But we are convinced it can be done with
reasonable results as lo
ng as the claim is not that automated indexing produces the same results
as rule
-
guided intellectual indexing. The benchmark should be usefulness for retrieval.





Author information:

Ulrike Junger

Head,Department of Subject Cataloguing

Deutsche Nationalb
iblothek

Adickesallee 1

60322 Frankfurt/Main, Germany

Phone: +40
-
69
-
1525 1500

Mail:
u.junger@dnb.de


Biography: Psychologist and theologian, academic librarian since 1994, various positions as subject
librarian and authority file editor, head of the German Union Catalogiue of Serials, since 2009 head
of the Department of Subject Cataloguing at the Deutsch
e Nationalbibliothek