Data mining MARC to find FRBR

Data mining MARC to find FRBR?

Finnish Norwegian project

ELAG, Rome, 17.4.2002

Eeva Murtomaa, Helsinki University Library

One year ago, in the beginnig of March 2001, there were only questions to be
answered. These questions related to the implementation
s of the IFLA (International
Federation of Library Associations and Institutions) FRBR
model (Functional
Requirements for Bibliographic Records) in bibliographic systems.

Knut Hegna from the University of Oslo Library and Eeva Murtomaa from the
Helsinki U
niversity Library started a project for getting some answers.

Our first question was: Could we find the FRBR entities: work, expression, and
manifestation from the MARC
records included in the Finnish and Norwegian
national bibliographies, and BIBSYS (a
library system serving most university and
college libraries in Norway). But this was not enough. We were also curious to see
what kind of problems we are facing when looking at the MARC
records or at the
search results and displays in the light of FRBR, o
r how to design better hit lists
based on the FRBR

After one year we had two main answers to our first question: yes and no. Yes,
because the FRBR model is to some extent present in the MARC record, and it is
partly to be found by a computer
program as well. Usually bibliographic records are
created on the manifestation level. This means, that we can identify and separate
elements describing works and manifestations from the bibliographic records. Even
data describing the expression is to be
found, depending of the level and quality of
description. In addition we realized, that the relationships between the description and
the main and added entries as well as subject descriptors would help in the
identifying process.

Example 1.





008 941208s1946 fi j f 008941208s1946 se j f c

015 $a f521254

015 $a f521255

041 $a swe

041 $a swe

080 $a 839.79

080 $a 839.79

1001 $a Jansson $h Tove 1001 $a Jansson $h Tove

2452 $a Kometjakten 2452 $a Kometjakten

$d Tove Jansson $d Tove Jansson

260 $a Helsingfors

260 $a Norrköping

$b Söderström
$c 1946

$b Sörlin
$c 1946

300 $a [2], 179 s. $c 8:o 300 $a [2], 179 s. $c 8:o

Example 2:

008 8700909s1967 gb j 008 940214s1991 us j f c

015 f688998

021 $a 0
2 $c nid.


$a eng

$c swe 0411

$a eng

$c swe


$a Jansson $h Tove


$a Jansson $h Tove


$a Kometjakten





$a Comet in Moominland

$a Comet in Moominland

$d Tove

$d written and ill. by Tove Jansson Jansson $e translated by Elizabeth

$e translated by Elizabeth Portch Portch

250 $a [New ed.]


$a Harmondsworth 260 $a [New York, Ny]
$b Farrar

$b Penguin books $c 1967 Straus, and Giroux $c 1991

$a 157 s
. $b kuv. 300
$a 192 s
. $b kuv. $c 19 cm

490 $a A Puffin boo
k 490 $a A Sunburst book

555 $a 5. Impr. London &

New York: Benn & Walk, 1970

1$a Portch $h Elizabeth

$a Portch $h Elizabeth

Example 3:



008 941208s194
6 fi j f 008941208s1946 se j f c

015 $a f521254

015 $a f521255

041 $a swe 041
$a fin

$c swe

080 $a 839.79
3(024.7) 080 $a 839.79

1001 $a Jan
sson $h Tove 1001 $a Jansson $h Tove

2452 $a Kometjakten

2452 $a
Takaisin Muumilaaksoon

$d Tove Jansson $d Tove Jansson

260 $a Helsingfors

260 $a Porvoov$a Hki $a Juva

$b Söderström $c 1946
$b WSOY $c 1988

300 $a [2], 179 s. $c 8:o 300
$a 250 s
. $b kuv. $c 30 cm

500 $a
Alkuteokset: Kometjakten


Trollkarlens hatt …

Sisältö: Muumipeikko ja

pyrstötähti ; ...

1 $a Järvinen $h Liisa

Example 4 (

Identifying the


*008941107s1994 it j c

$a 88
X $c nid.

*0411 $a ita $c swe

*080 $a 839.79

*1001 $a Jansson $h Tove $c 1914

*241 $a Resa med lätt bagage

2452 $a Viaggio con bagaglio leggero

$d Tove Jansson $e introduzione di
Carmen Giorgetti Cima $e [traduzione dallo svedese
di Carmen Giorgetti

*260 $a Milano
$b Iperborea $c 1994

*300 $a 187, [1] s
. $c 20 cm

*490 $a Iperborea $v 44 $y 0044

*70021$a Giorgetti Cima $h Carmen

Example 5

008880325 no

*02000 $a

$b h. $c
Nkr 60.00

*04110 $a

*08200 $a

*10010 $a
Ibsen, Henrik $d 1828

*24510 $a

(1879) $c
Henrik Ibsen


Odd Tangerud ; lingve kontrolita de

Esperantista Verkista Asocio (EVA)

*26000 $a
Hokksund $b Eldonejo Odd Tangerud
$c 1987

*26900 $a
[Drammen] : Tangen

*30000 $a
[1], 57 s. $c 24 cm

*50000 $a Originaltittel:
Et dukkehjem

København : Gyldendal, 1879

*99100 $a
Tangerud, Odd

Why the answer was also No ? We realized, that the cata
loguing rules are designed
for the card catalogues and printed bibliographies, not for the FRBR model based

displays. The central information is often recorded in a way more suitable for the
human mind and eye, than for a computer.

However, this was n
ot the end of the questions. We had to know, what kind of
problems we meet when looking at the MARC records or at the structure of the hit
lists, and displays in the light of FRBR. What should the hit lists look like?


Our goal was to colloca
te similarities, and to analyse differences. Therefore we
created strings for identifying works, espressions and manifestations by mapping the
FRBR attributes associated with the entities work, expression, and manifestation to
the elements included in the

national MARC fields and subfields. Only attributes and
relationships of high or moderate value to identify the entities were taken into
consideration. The idea was to bring together identical strings and to separate
different strings.

Table 1:

le 2:

Our examples consisted mainly of single works or collections of works for which a
single person is responsible.
From the tables above you can see, that for identifying
the work, we looked at the original titles or uniform
titles. The title of the work and
the relation to the person(s) responsible were used as attributes identifying the works.

With these attributes we could collocate the identical works existing in several
records and differentiate them from other works. Whe
n this was done, we used other
data to differentate and collocate different expressions.

At the expression level the language of expression supplemented with the entity
responsible for the expression (usually the translator) were selected. For identifyi
the manifestation following elements from the MARC records were used:
publisher, date and extent of carrier.

Some results of our examples

Results of using the work reduction procedure on records from the
Norwegian (n)
and the Finnish (f) national bibliographies.

From the table we see, that the "Number of records" means the existence of the
author as main or added entry. In the "Number of work lines" we see the number of
work identifiers from these recor
ds extracted by the programme. In the last column
we see the "Number of unique woks" of the author. Looking at the variation of
numbers, we have to realize, that there are some problems between the reality and the
numbers given on the table.

What kin
d of problems ?

Of course the results of the study depend on the quality of cataloguing. There are
inconsistencies in the logic of cataloguing caused by historical or individual
differencies. In addition there are other reasons for "lying statistics"lik


records with lack some information or the information is wrong


records without original or uniform titles, or unidentifiable titles


misprints and spelling differencies in the original titles


inaccurate cataloguing


inconsistent registration o
f collections (several works in the same manifestation)


relationships, which are usually expressed in natural language (usually as notes)


lack of qualifying information (roles) in added entries (700 $e)

User intefaces

One of our questions concerne
d the meaning of the FRBR structure to the hit lists and
displays. We thought, that the hist list should be in line with the search and search
results. In addition we supposed that the search results for the works of a single
person should be arranged in

alphabetical or chronological order, or perhaps
according to the function of the person related to the work, expression or

For designing the hit lists and displays we looked at the attributes/elements of the
entities that are important

in the selection process in the FRBR. It seems, that on the
work level the title and relation to the creator are most important. On the expression
level the language and the relation to the person responsible for the expression should
be taken into consi
deration. Finally, important elements on the manifestation level
are the statements of responsibility, edition, publisher, and date .

Card catalogues

In the card catalogue the hit li
st is presented by overlapping cards. The headings are
composed of work title and person responsible for the work. The expressions are
represented by the language of the expression and the title of manifestation. The
original title is given first, and afte
r that other titles in alphabetical order. At the end
of the title of expression the number of manifestations of this expression is given.

The manifestations are sorted according to the publishing year.

Tree structure

Here the wor
k titles are sorted alphabetically according to the original title. The
number in the end of the titles indicates the amount of expressions belonging to this
work. The work nodes are expandable with the expressions as leaves. The
expressions are sorted alp
habetically according to the languages of the original titles.


First of all, I would like to stress the importance of good cataloguing. If bibliographic
records are created logically, it is possible to manipulate the data by computer.

s why we have to put a question to ourselves: how, why, and to whom we are

Our investigation showed, that the meaning of the authority data and of the language
codes should be stressed in cataloguing. We notized, that our analysis had been
easier, if original titles were recorded in a more consistent way e.g. in separate,
repeatable fields.

With help of authority files we can give our customers the possibility to navigate in
the bibliographic universe. Besides authority files of names (of p
ersons and corporate
bodies and series) and subject descriptors there is a need for work authoritities for
collocating the same work under one heading.

With help of language codes we can identify the manifestation as translation.
Language codes are imp
ortant attributes to identify different expressions of a single
work as well.

During the project , the role of the functions became more and more important.
Functions or roles of the person or corporate body is usually indicated in the
description only.
In the environment structured according to the FRBR
model the
function statement in the main or added entry field would be very helpful. The search
systems and the design of hit lists could make good use of the function statements. In
addition our users c
ould benefit from the function statement in their bibliographical
navigation. That's why we suggested, that the functions should not be optional.

Suggestions for continuing work

After finishing the project, many aspects are still open for further inv

One important topic concern relationships, which provide additional information for
the user in making connections between the entity found and other entities that are
related to that entity. Relationships give new aspects for display designers
. They also
offer new ways to create navigation possibilities for the user.

We have to find out what kind of relationships we can find from the descriptions.
Unfortunately relations are mainly indicated in the description, and textual
information is v
ery much language dependent. In the near future this problem may be
partly solved by diferent kinds of identification numbers, which link different entities
with each other. In addition the role of coded information, subject headings,
classification and au
thority files are worth of examination.