Mixed Content and Mixed Metadata Information Discovery in a Messy World

toadspottedincurableInternet και Εφαρμογές Web

4 Δεκ 2013 (πριν από 3 χρόνια και 8 μήνες)

83 εμφανίσεις


1

Draft of November 15, 2003


Mixed Content and Mixed Metadata

Information Discovery in a Messy World


Caroline R. Arms

Library of Congress

caar@loc.gov


William Y. Arms

Cornell University

wya@cs.cornell.edu



Overview


As digital libraries grow in scale, he
terogeneity becomes a fact of life. Content comes in
a bewildering variety of formats; it is organized and managed in innumerable different
ways. Metadata comes in a similarly broad variety of formats; its quality and
completeness vary extensively. This
diversity has considerable impact on the
effectiveness of searching, browsing, and all aspects of information discovery.


When faced with heterogeneity, the conventional approach is to attempt to eliminate it,
either by enforcing standardization or by con
verting disparate forms to a common lingua
franca. Both alternatives are important, but both have limitations as the scale and
diversity of collections increase. Digital libraries must find other ways to assemble
materials from diverse collections, with
differing metadata or with none, and provide
coherent information discovery services to users.


To understand how this may be possible, it is important to distinguish the overall process
of information discovery from the specific act of searching a catalo
g or index. The
process by which an intelligent person discovers information includes a wide range of
exploration of which searching and browsing, in the narrow sense of scanning lists, are
only parts. If information discovery is considered synonymous wi
th searching, the
problem of heterogeneity is probably insuperable. By considering the full process of how
users discover information, many of the difficulties can be overcome.


Mixed content, mixed metadata


This paper looks at the issues of heterogene
ity in two parts. This first part explores why
digital libraries have to wrestle with mixed content, why mixed metadata is inevitable,
and the problems that this provides for information retrieval. The second part describes
some techniques that are provi
ng successful for information discovery in this messy new
world.



2

Searching: the legacy of history


We begin by considering the legacy of history. Historically, developments in information
discovery have emphasized the specific task of searching catalogs
and indexes.
Librarianship has concentrated on descriptive metadata; computer science on algorithms
for information retrieval; abstracting and indexing services have drawn from both fields.


Many of the metadata systems that we use today were originally

developed when the
underlying resources were in physical form. In this context, descriptive metadata has two
functions: for searching where the content itself cannot be indexed, and to describe the
resources well enough that the user has a good understan
ding of the material before going
to the labor of obtaining an actual copy.


The emphasis on metadata for searching made excellent sense in the days of printed
materials accessed via manual indexes or early online systems, but does not necessarily
apply
to the digital libraries of today. In 1841, when Sir Anthony Panizzi formulated his
Ninety
-
One Rules for cataloguing printed books, the book stacks of the British Museum
Library were closed to readers. If a reader has to wait hours for a book to be retri
eved
from the stacks, it is vital to have an accurate description so as to be confident of
requesting the correct item. One hundred and twenty years later, when the first
computer
-
based abstracting and indexing services were developed for scientific and
p
rofessional information, information resources were still physical items. Initially,
computer costs were so high that users could rarely afford more than a single search. In
the very early services, the index records were held on magnetic tape, user quer
ies were
batched together, and a single batch was processed once a day. Because of cost, these
services were aimed primarily at researchers, with an emphasis on comprehensive
searching, that is, the objective was high recall. The typical user was a medic
al
researcher or lawyer who would pay good money to be sure of finding everything
relevant to a topic. The aim was to achieve this high recall through a single, carefully
formulated search.


Although there are wide difference in the details, the approac
hes that were developed for
library catalogs and early information services have much in common: careful rules for
human cataloging and indexing, heavily structured metadata, controlled vocabularies, and
subject access via subject headings or classificatio
n schemes. These factors led to
powerful but complex systems. They were designed with the underlying assumption that
the users would be trained in their use or supported by professional librarians.


As recently as the early 1990s, most services had the

following characteristics:


(a)

Resources were separated into categories of related materials, such as the
monographs in research libraries, papers in medical journals, social science data
sets, or newspaper articles. Each category was organized, indexe
d and searched
separately.



3

(b)

Catalogs and indexes were built on tightly controlled metadata standards, such as
MARC, MeSH headings, etc. With human
-
generated metadata, success depends on
the consistency with which the metadata is generated and the sk
ill of the user in
formulating queries that match the terminology and the structure provided,


hence
the emphasis on metadata quality and formal training for both cataloguers and
users.


(c)

Search engines used Boolean operators and fielding searching.

These techniques
are effective in exploiting the precise vocabulary and structure that are inherent in a
rich metadata record.


(d)

Query languages and search interfaces assumed a trained user. The combination of
complex metadata standards with search en
gines that offer fielded retrieval and
Boolean operators is powerful, but not intuitive. To achieve consistently good
results requires an understanding of both the search syntax and the underlying
metadata


its elements and the rules that govern the valu
es those elements can take.


(e)

Most resources were physical items.



Today, each of these characteristics has changed dramatically.


The demand for mixed content


As digital libraries have become larger, they have begun to amalgamate materials from
cate
gories that were previously managed separately and users have grown to expect one
-
stop access to information. In the past, libraries separated materials into categories,
typically by mode of expression, sometimes by physical form. Thus the Library of
Cong
ress has separate divisions, such as Prints and Photographs, Manuscripts,
Geography and Maps, etc. Each division manages a relatively homogeneous collection.
Separate practices have been developed to catalog or index each category of information,
using d
ifferent tools and processes, each tailored to the nature of the materials and, in
some cases, an assumed group of users. So long as the distinction between the categories
is clear, the division poses no problem. For instance, no experienced user is conf
used
that the National Library of Medicine provides a catalog of MARC records for its book
collection and the Medline index to journal articles.


Specialist users tolerated having to search different sources for different categories of
content, particula
rly when the content itself was held in different locations. However, to
students, the general public, and scholars in areas not aligned with the category
boundaries, the divisions were often frustrating and confusing. Once the content could be
made avai
lable online, the appeal of being able to search across categories was obvious.
Some digital libraries have been established explicitly to amalgamate materials from
sources and categories that were previously managed separately. For example, the NSF's
Na
tional Science Digital Library (NSDL) is collecting materials of value to scientific
education, irrespective of their format or provenance [NSDL]. To give an appreciation of
the variety, four NSDL collections are based at Cornell. They comprise: (a) Atla
s, which

4

has data sets of volcanoes, earthquakes, and other information that is used for the study of
the continents, (b) the Rouleux collection of digitized versions of mechanical models
from the nineteenth century, (c) the Laboratory of Ornithology, with

collections of sound
recording, images and videos of birds and many other animals, and (d) mathematical
theorems and proofs. These are just a few of the hundreds of highly diverse collections
that the NSDL already includes.


Even with more conventional library materials, similar diversity arises when the materials
are brought together in a single digital library. American Memory at the Library of
Congress includes millions of digital items of many different types: phot
ographs, posters,
published books, personal papers of presidents, maps, sound recordings, motion pictures,
and much more [American Memory]. Materials have been digitized from the collections
of the Library of Congress and, through the LC/Ameritech competi
tion and other
collaborations, from those of around thirty other libraries, museums, and historical
societies. The library traditionally uses different methods to organize, index and present
the various categories of material. Books have individual catal
og records; collections of
manuscripts are described by finding aids; geographical coverage is a fundamental field
in describing maps; the richness of cataloging applied to photographs varies widely
depending on the significance of the items and economic f
easibility.


Every user of a digital library such as the NSDL or American Memory is trying to find
things. Users want to explore the digital collections as a whole, without needing to learn
different techniques for different categories of material. Yet t
he conventional methods of
searching and browsing are poorly adapted for mixed content. Organizing lists of books
by title or publication date makes sense. For maps, the geographical coverage is more
important than title. For sheet music, it may be impo
rtant to treat the first line or the
chorus as if it were a title. Large numbers of records with the same subject terms for
similar photographs can overwhelm the user who may not find the other significant items
related to the topic of interest.


Mixed co
ntent means mixed metadata


Given that mixed content is inevitable and that information discovery systems must reach
across many formats and genres, a natural impulse is to seek for a unifying cataloguing
and indexing standard. The dream would be a single
, all
-
embracing metadata standard
that suits every category of material and is adopted by every collection. However, this is
an illusion. Mixed metadata appears to be as inevitable as mixed content. Moreover, it is
inevitable that there is no useful met
adata for many items.


There are good reasons why different metadata formats are used for different categories
of resource. The descriptive details that matter for various categories of library materials
are different. Consider books, journal articles, p
rints and photographs, manuscripts,
music, maps, recorded sound, and so on. All libraries face the challenge that maps are
different from photographs, and that sound recordings differ from journal articles. All
libraries are faced with decisions of granul
arity, whether to index every article in a serial,
or every photograph in a box. A set of photographs of a single subject taken on the same

5

occasion may be impossible to distinguish usefully through metadata that will aid
discovery. The user is best serve
d by a group of thumbnails. Purely digital forms of
expression, such as software, datasets, simulations, and web sites, call for different
practices. Web sites often present the additional challenge of how to describe a resource
that is continually chang
ing.


When assembling a digital library from many sources, it becomes clear that metadata
always reflects an intended context or use. The description of the 47,000 pieces of sheet
music registered for copyright between 1870 and 1885 is brief, with an em
phasis on the
music genre and instrumentation; the audience in mind is musicians. In contrast, the
3,042 pieces of sheet music in an American Memory collection of
Historic American
Sheet Music

drawn from the Rare Book, Manuscript, and Special Collections
Library at
Duke University were selected to present a significant perspective on American history
and culture and the cataloging includes detailed description of the illustrated covers and
the advertising. In the NSDL, many of the best
-
managed collections

were not intended
for educational use; a taxonomy of animal behavior designed for researchers is of no
value to school children.


Reconciling the variety of formats and genres would be a forbidding task if it were purely
a matter of schemas and guidelin
es, but there is a deeper and more fundamental source of
mixed metadata: the social context within which information is created and used. Recent
history is littered with metadata projects that were technically excellent but failed to
achieve widespread ad
option for social and cultural reasons.


One social factor is economic. Well
-
funded fields, such as medical research, have
resources to abstract and index individual items


journal articles


and to maintain tools
such as controlled vocabularies and subj
ect headings, but the rich disciplines are the
exceptions. Even major research libraries cannot attempt to catalog every item.
Typically, they catalog monographs individually, but archival collections are usually
described by finding aids. Photographs p
rovide a good example. The Prints &
Photographs Division of the Library of Congress uses several approaches to limit the cost
of describing of pictorial resources. It uses catalog records for groups of pictures and
creates very brief records for items in

large collections. Other institutions take advantage
of the archival finding aid structure to keep item
-
level description of individual
photographs to a minimum. There are many reasons why the variety of descriptive
practice is reasonable, but the funda
mental reason is cost.


A second social factor is history. Large catalogs and indexes represent an enormous
investment. The investment is much more than the costs of creating metadata. It
includes the accumulated expertise of users, including reference
librarians, and the
development of computer systems. For instance, abstracting and indexing
services such
as Medline, Inspec or Chemical Abstracts were developed independently for the separate
disciplines with little cross
-
fertilization. Unsurprisingly,
each of these services has
developed its own schemes for indexing and abstracting the journal articles that they
cover. Whether this is intrinsic to the subject matter or an accident of history is hard to

6

judge. It is clear, however, that any attempt to
introduce a single unifying scheme would
create a most unpopular upheaval.


Metadata consistency


The last decade or so provides many independent examples where benefits from metadata
consistency have been recognized and steps taken to harmonize usage in limited areas.
Many
steps taken for particular reasons have contributed to a general trend towards
metadata consistency


but only to the extent that economic value accrues to creators,
distributors, custodians, or users of the content described. Value may come from reduced
costs, the ability to reach new markets, or better services (as evidenced by usage or
willingness to pay, depending on context).


Developments in the traditional library community


During the 1990s, two developments in relation to the MARC standard enhance
d
consistency.
Format integration

brought the variants of USMARC used for monographs,
serials, music, visual materials, etc. into a single bibliographic format. This process
followed a decision made in 1988 and was complete by 1995. "Format integration
was
envisioned as a means to simplify documentation, reduce redundancy across formats,
enable catalogers to work more easily with multiple formats, and improve machine
validation of MARC records."
[http://wings.buffalo.edu/publications/mcjrnl/v3n2/glennan.
html]. In the late 1990s, the
Library of Congress, National Library of Congress, and the British Library agreed to
pursue
MARC harmonization

in order to reduce the costs of cataloging, by making a
larger pool of catalog records available for copy
-
catalogi
ng. One outcome was the
MARC21 format, which superseded USMARC and CAN/MARC. The motivation for
these efforts was not explicitly to benefit users, but users have certainly benefited because
systems are simpler to build when metadata elements are used cons
istently.


Several of the recent modifications to the MARC standard have been made with the aim
of compatibility with other metadata schemas or interoperability efforts. Changes were
made to support mappings between MARC and the Federal Geographic Data
Co
mmittee's Content Standards for Geospatial Metadata, and mappings between MARC
and the Dublin Core Metadata Element Set. Recently, changes were approved to support
citations to journal articles in convenient machine
-
parsable form to support the OpenURL
st
andard and to allow such detail to be preserved in conversions to MARC from schemas
used in citation databases designed for journal articles.


More recently, the Network Development and MARC Standards Office at the Library of
Congress has developed a new
metadata schema in response to demand from the library
community for a schema that is XML
-
based and compatible with MARC, but simpler.
MODS (Metadata Object Description Schema) includes a subset of MARC elements and
inherits MARC semantics for those elemen
ts. Particularly important aspects inherited
from MARC from the point of view of American Memory include the ability to express
the role of a creator (photographer, illustrator, etc.) and specify placenames in explicitly

7

tagged hierarchical form (e.g. <c
ountry><state><city>). A valuable simplification is in
the treatment of types and genres for resources; a short list of high
-
level types allowed in
one element, and all other genre terms and material designators in another. Some
constraints of the MARC s
yntax have been eliminated, including those that had proved
most problematic for American Memory. MODS provides for much more explicit
categorization of dates and they can all be expressed in machine
-
readable encodings
(
iso8601

and
w3cdtf
). Being in XML,

any MODS element can be tagged with the
language of the element contents and the full UNICODE character set is allowed. This
permits the assembly of bilingual or multilingual records. MODS also permits the
association of coordinates with a placename in
the same <subject> element. Over the
last few years, the descriptive needs for American Memory content for which MARC was
not used, have been harmonized. Based on this harmonization effort, a migration of all
non
-
MARC descriptive records to MODS is expe
cted over time. The treatment of types
of resources in MODS is of particular appeal; the one consistent desire expressed by
American Memory users is better capabilities for filtering by resource type, both in
specifying queries and in organizing result lis
ts.


Federated searching vs. union catalog


Federated searching is a form of distributed searching in which one (client) system sends
a query to several servers, each of which carries out a search on the indexes that apply to
the data in its own collection
s and returns the results to the client, which combines them
and merges them from presentation to the user. This is sometimes called metasaeach or
broadcast search, because the query is broadcast to many servers. The library
community has been making ef
fective use of federated search using the protocol for
search and retrieval developed in the early 1980s and first standardized as ANSI/NISO
Z39.50 in 1988. The current version is also standardized as ISO 23950. Most use has
been among library catalog s
ystems and abstracting and indexing services.


Z39.50 is flexible and inclusive, as most international standards have to be to gain
approval. In practice, its effectiveness is limited by incompatibility in either the metadata
or the index configurations i
n the remote systems. In recent years, standard profiles for
index configurations have been developed by the library community in the hope of
persuading vendors to build more compatible systems. These profiles represent a
compromise between what users (o
r libraries on their behalf) might hope for and what
vendors believe can be built at a reasonable cost and delivering acceptable performance.
The Bath Profile was originally developed in the UK and now maintained by the National
Library of Canada. [http:/
/www.nlc
-
bnc.ca/bath/tp
-
bath2
-
e.htm] A comparable U.S.
National Standard is under development under the auspices of NISO. Perhaps the key
observation to make about these profiles is the focus on a few fields (Author, Subject,
Title, Standard Number, Date

of Publication). The highest level in the Bath profile for
bibliographic search and retrieval also includes type of resource, and language. A
keyword search on any field is expected to cover everything else.


In the last few years, systems have emerged t
hat attempt to incorporate a wider variety of
resources into a federated search for library patrons. Often called
portal applications
,

8

these usually attempt to use Z39.50 when the resources being searched support it, and
other protocols or ad hoc methods
as needed. These applications have to contend with
many challenges in addition to inconsistent metadata. The portal system has to wait for
responses from several servers and can only receive the first batch of results from each
server before presenting r
esults to a user. Busy users are frustrated by having to wait.
Experienced users are frustrated by the inability to express complex queries. Another
challenge is de
-
duplication when several sources return a record for the same item (say an
article from a

journal indexed by several services). Inconsistency in metadata makes it
hard to cluster records that might be for the same work.


American Memory and NSDL are both examples of a different approach to coherent
access to distributed content. They follow
the pattern of union catalogs in gathering
metadata records from many sources and applying a common indexing strategy. The
Open Archives Initiative Protocol for Metadata Harvesting was developed to support this
approach to building services that provide a
ccess to distributed resources. Some of these
services may be deliberately broad in scope while others cater to a specific audience and
focus on a narrowly defined body of content. Both types of service will be limited by
inconsistencies in metadata. Wh
at is interesting about this approach to building a service
is the different sort of feedback that it provides to the supplier of the metadata. The items
in American Memory primarily fall into a generic category known as
cultural heritage

and a primary ob
jective behind its development was to make the unique content owned
by the Library of Congress available to the public. Hence, there is an interest in
providing metadata that is compatible with the services that have compatible missions.
What becomes clea
r in looking at the UIUC Digital Gateway to Cultural Heritage
Materials [http://nergal.grainger.uiuc.edu/search/], OAIster
[http://oaister.umdl.umich.edu/], or RLG Cultural materials [http://cmi.rlg.org/], that
filtering by type of work is important to the

users of those services too. As descriptive
records are re
-
used in different contexts, the areas where consistency is valuable enough
to be worth adjusting practice or developing better mappings may be more obvious.


Cross domain metadata: the Dublin Cor
e


If MARC harmonization represents the efforts that have been made to reconcile closely
related cataloging rules, Dublin Core represents the attempt to build a lingua franca, or at
least a pidgin language, that can be used across domains. To comprehend ho
w rapidly our
understanding is changing, it is instructive to go back to early days of the web. As
recently as 1995, it was recognized that the methods of full text indexing that were used
by the early web search engines, such as Lycos, would not scale up
. The contemporary
wisdom was that "... indexes are most useful in small collections within a given domain.
As the scope of their coverage expands, indexes succumb to problems of large retrieval
sets and problems of cross
-
disciplinary semantic drift. Ric
her records, created by content
experts, are necessary to improve search and retrieval." [Weibel 1995] With the benefit
of hindsight, we can now recognize that the opposite has happened. The web search
engines have developed new techniques and have adap
ted to huge scale while cross
-
disciplinary metadata schemes have not.



9

For the first phase of the NSDL development, collections were encouraged to provide
Dublin Core metadata for each item and to make these records available for harvesting
via the Open
Archives Initiative Protocol for Metadata Harvesting (OAI
-
PMH). While
this strategy enabled the first phase of the library to be implemented rapidly, it showed up
some fundamental weaknesses of the Dublin Core approach. This early experience is
described

in [Arms 2003].
Metadata quality is highly variable. Although each
component (Dublin Core and OAI
-
PMH) is intended to be simple, expertise is needed to
understand the specifications and to implement them consistently, which places a burden
on small, lig
htly staffed collections. Conversely, when collections have invested the
effort to create fuller metadata (e.g., to one of the standards that are designed for learning
objects), valuable information is lost when it is mapped into the NSDL variant of Dubli
n
Core. The granularities and the types of the objects characterized by metadata vary
greatly. For example, the results of a search may include records that correspond to
individual items (e.g., web pages), to collections of items, or to entities (such a
s
numerical datasets) that can be used only in the presence of suitable tools. The overall
result has been disappointing information discovery.


The reason is simple. Search engines work by matching a query against the information
in the records being
searched. There is not much information in a Dublin Core record.


Information discovery in a messy world


In the first part of this paper we explored why mixed content and mixed metadata are
inevitable in large
-
scale digital libraries. The second part
described some work based on
metadata consistency. However, as discussed earlier, with large
-
scale digital libraries,
attempts to build uniform or consistent metadata for mixed content are limited in what
they can hope to achieve. This section of the pap
er describes methods that are proving
successful in the messy world of large
-
scale digital libraries..


Fortunately, the power of modern computing, which makes large
-
scale digital libraries
possible, also provides new capabilities for information discovery

in situations where
there is mixed content and mixed metadata or none. Over the past decade, as computers
and networks have become more powerful, it has become both feasible and desirable to
seek for information across collections that have widely differ
ent characteristics or have
been indexed using different metadata schemes. Two themes run through the methods
that have been developed: brute force computation and making explicit use of the
expertise of users.


(a)

Brute force computation can be used to
analyze the content of resources for features
that provide useful clues, e.g., to index every word from textual materials, and to
extract links and citations. Computing power can combine information from
various sources (e.g., a dictionary or thesaurus),
to compute complex algorithms, to
process huge amounts of data, or to provide flexible user interface services, e.g.,
visualizations.



10

(b)

Powerful user interfaces and networks bring human expertise into the information
discovery loop. They enable users t
o apply their knowledge and understanding
through a continual process of exploring results, revising search strategies, and
experimenting with varieties of queries and search strategies.


Advances in information retrieval


Search systems use computer algor
ithms to relate a user's description of an information
need (known as a query) to descriptions of resources, which are held in computer
indexes. Examples include library catalogs, abstracting and indexing services, web
search engines, computer help system
s, and so on. Historically, when collections were of
physical materials, such as books or journal articles, the information in the indexes was
created by experts following cataloguing or indexing rules.


When materials are in digital formats, as an alte
rnative to creating metadata manually it is
possible to extract information from the content by computer programs. Beginning in the
1960s, automated full
-
text indexing was developed as an information retrieval approach
that uses no metadata [Salton 1983].

With full text indexing, the actual words used by
the author are taken as the descriptors of the content. The philosophy is to treat every
word in a document as a potential descriptor, and to measure the similarity between the
terms in each document and

the terms in the query. Full
-
text search engines provide
ranked lists of how similar the terms in the documents are to those in the query.


As early as 1967, Cleverdon recognized that, in some circumstances, automated indexes
could be at least as effecti
ve as those generated by skilled human indexers [Cleverdon
1967]. This counter
-
intuitive result is possible because an automated index containing
every word in a textual document has more information that a catalog or index record
created by hand. It may

lack the quality control and structure of fields that are found in a
catalog record, but statistically the much greater volume of information provided by the
words that the author chose to describe a topic may be more useful than a shorter
surrogate recor
d.


By the early 1990s there were two well established methods for indexing and searching
textual materials: fielded searching of metadata records and full
-
text indexing. Both built
on the implicit expectation that information resources were divided int
o relatively
homogeneous categories of material; search systems were tuned separately for each
category. Until the development of the web, almost all information retrieval experiments
studied homogeneous collections. For example, the classical Cranfield
experiments
studied papers in aeronautics [Cleverdon 1967]. More recently, when the TREC
conferences in the 1990s carried out systematic studies of the performance of search
engines [Vorhees 1999], the test corpora came from homogeneous sources, such as
the
Associated Press newswire, thus encouraging the development of algorithms that can be
tuned to perform well on relatively homogeneous collections of documents.


Web search services combine a web crawler with a full
-
text indexing system. The first
ser
vices were developed shortly after the release of the Mosaic browser in 1993. For

11

example, the first version of Lycos used the Pursuit search engine developed by Mauldin
at Carnegie Mellon [Mauldin 1997]. This was a conventional full
-
text system, using te
rm
ranking methods, which had done will in the TREC evaluations. The web search services
were very popular from the start, but there were serious doubts about how they would
manage as the scale of the web increased. There were two repeated complaints: si
mple
searches resulted in thousands of hits, and many of the highly ranked hits were junk.
Many developments have enabled the search services to improve their results, even as the
web has grown spectacularly. Four, in particular, have general relevance b
eyond the
specific but very important application of searching the web:


(a)

Better understanding of why users seek for information.


(b)

Relationships and context information.


(c)

Multi
-
modal information discovery.


(d)

User interfaces for exploring info
rmation.


The next four sections discuss each of these developments.


Understanding why users seek for information


The conventional measures of effectiveness, such as precision and recall, are based on the
concept of
relevance.

This is a binary measure.

A document is either relevant or not, and
all relevant documents are considered equally important. With such criteria, the goal of a
search system is to find all documents relevant to a query.


In their original Google paper, Brin and Page introduced a

new criterion by which to
evaluate the effectiveness of searching [Brin 1998]. They recognized that, with mixed
content, some documents are likely to be much more useful than others. In a typical web
searches the underlying term vector model finds every

document that matches the terms
in the query, often hundreds of thousands. However, the user looks at only the most
highly ranked batches of hits, rarely more than a hundred in total. Google's focus is on
those first batches of hits. The traditional ob
jective of high recall, i.e., finding all relevant
documents, is not a goal. Indeed, the crawling strategy does not attempt to index all
potentially useful documents.


With homogeneous content it is reasonable to assume that all documents are equally
im
portant. Therefore they are ranked by how similar to the query. With mixed content,
many pages are relevant, but not all of them are useful to the user. Brin and Page have
the example of a web page that contains three words, "Bill Clinton sucks." Using

term
matching to rank the hits, this page is undoubtedly an excellent match to the query "Bill
Clinton". However, it is unlikely to be much use. Therefore, Google estimates the
importance of each page, using criteria that are independently from any cons
ideration of
how well the page matches a query. The order in which pages are returned to the user is
a combination of these two rankings: similarity to the query and importance.


12


Relationship and context


Information resources always exist in a context.

Often the context is expressed as
relationships among resources. A monograph is one of a series; an article cites other
articles; customers who buy a certain book often buy related ones; reviews and
annotations describe how people other than the authors

view resources.


Science Citation Index was one of the pioneer in using relationships for information
discovery [Garfield 1979]. The original purpose was to answer a simple question.
Which journal papers cite a given paper? For this purpose, Science
Citation Index built a
large database of citations extracted from the leading science journals. Once this
database had been created it was used for many other purposes, such as identifying the
most heavily cited papers and journals, trends in research and

so on.


On the web, hyperlinks provide relationships between pages that are analogous to
citations between papers. Google's well
-
known PageRank algorithm estimates that
importance of a web page by the number of other web pages that link to it, weighted

by
the importance of the linking pages and the number of links from each page. The Teoma
search engine uses hyperlinks in a different way. After carrying out a text search, it
analyses a set of several thousand of the highest ranking results and identif
ies authorities
within that set


pages that many other pages link to.


These methods are examples of brute force computing. Calculating the PageRanks for
billions of web pages or the authorities for a results set while the user waits requires
iterative c
omputations on a matrix with a row and a column for each page. Admittedly
the matrices are very sparse, but still these require significant computation.


Google image search is an intriguing example of a system that relies entirely on context
to search fo
r images on the web. The content of images cannot be indexed reliably and
the only metadata is the name of the file, but most images have considerable context.
This context includes text in anchors that refer to the image, captions, terms in nearby
parag
raphs, etc. By indexing the terms in this contextual information, Google image
search is often able to find useful images.


Citations and hyperlinks are examples of contextual information that is embedded within
documents. Reviews and annotations are exa
mples of external information.
Amazon.com has been a leader in encouraging the general public to provide such
information. The value of information contributed by outsiders depends on the reputation
of the contributor. When a journal article is indexed
in Medline, the value of the
metadata is reinforced by the reputation of the National Library of Medicine for quality
control.


Multimodal information discovery



13

With mixed content and mixed metadata, the amount of information about the various
resources

varies greatly. In addition, there are many useful features that can be extracted
from some documents but not all. For example, a <title> field in a web page provides
very useful information about the content of the page, but not all pages have <title>
fields. Citations and hyperlinks are other features that are valuable when present but not
all documents have them. These features can be considered clues about those resources
that have them. Multimodal information discovery methods combine information

about
various features of the collections, using the information that is available about each item.
This information may be extracted from the content, e.g., terms in textual documents,
may be in the form of metadata, or may be contextual. Whatever is a
vailable is used.


The term "multimodal information discovery" was coined by Carnegie Mellon's
Informedia project. Informedia has a homogeneous collection


segments of video from
television news programs. However, because it uses purely automated ind
exing and
retrieval of text extracted from the content, the metadata that it has about the segments
varies greatly. The search and retrieval process combines clues derived from each of
these sources. In describing the multimodal approach, the head of the
project Howard
Wactlar wrote, "The fundamental premise of the research was that the integration of these
technolgies, all of which are imperfect and incomplete, would overcome the limitations
of each, and improve the overall performance in the information
retrieval task" [Wactlar
2000].


Web search services also use a multimodal approach to ranking. While the technical
details of each service are trade secret, the underlying approach is known to combine
conventional concepts from full text indexing with
ranking schemes that draw on
information beyond the terms in a document, such as PageRanks. They use every clue
that they can find that might help in retrieving and ranking web pages. Such clues
include anchor text, terms in titles, words that are emphas
ized or in larger font, and the
proximity of terms to each other.


Multimodal information discovery with mixed content and metadata can be contrasted
with methods used for homogeneous content where great efforts are made to have the
same information about
all resources; searching and browsing tools assume that all
documents are described in the same way and in the same detail.


The main difficulty with multimodal methods of information discovery is that there is no
general theory to build on. The most succ
essful systems have been developed for a
specific category of material and users.


User interfaces for exploring results


Good support for exploring the results of a search can compensate for many weaknesses
in the search service, including indifferent o
r missing metadata. It is not coincidence that
Informedia, where the quality of metadata is inevitably poor, has been one of the key
research projects in the development of user interfaces for browsing visual materials.



14

Perhaps the most profound chang
e in information discovery that has happened in the past
decade is that the full content of many resources is now online. The time to retrieve a
resource has gone from minutes, hours or even days, to a few seconds. When full
collections are online, brows
ing and searching are interwoven. The value of an
information discovery service is the aggregate of the quality of the searching and the
support that it provides for exploration after the search. As a direct consequence, users
are very satisfied with inf
ormation retrieval systems that are deficient by traditional
criteria.


The web search engines provide a supreme example. They are inefficient by all the
traditional measures, but they provide quick and direct access to information sources that
the user

can then explore independently. A commonly observed pattern is for a user to
type a few words into a web search service, glance through the list of hits, examine a few,
try a different combination of search terms, and examine a new set of hits. This rap
id
interplay between the user's expertise and the computing tools is totally outside the
formal analysis of single searches that is still the basis of most information retrieval
research. One of the reasons for the success of Google is that the search res
ults are
intimately linked to the browsing process. The concept of ranking pages by importance
often results in home pages of web sites being ranked highly. The most highly ranked
page from a search with the single word "Cornell" is the Cornell Universit
y homepage.
From that page a user can explore the entire set of Cornell web sites. Google has been
used as a method to begin browsing, rather than solely as a search engine.


The user interface to RLG's Cultural Heritage collections provides a differen
t example
[RLG]. It consists of a simple search system, supported by elegant tools for exploring the
results found by a search. The presumption is that a user will use these tools to explore
the collections. The search system acts as a filter that reduc
es the number of records that
the user is offered to explore. In the past, a similar system would almost certainly have
provided at least two search interfaces. One would offer advanced features that a skilled
user could use to specify a very precise sea
rch that would result in a small set of highly
relevant hits. Instead RLG has a simple search interface and an intuitive interface for
exploring the results. Neither requires a skilled user. This is another instance of an
information discovery service t
hat depends upon powerful computing to organize and
display results quickly.


Yet another area where Google has advanced the state
-
of
-
the
-
art in information discovery
lies in the short records that are returned for each hit. These are sometimes called
"sn
ippets". Each is a short extract from the web page, which summarizes it so that the
user can decide whether to view it. Most services, generate the snippets when the pages
are indexed. For a given page, the user always receive the same snippet, whatever

the
query. Google generates snippets dynamically, to include the words on the page that
were matched against the query. Usually, these dynamic snippets are much more helpful
in guiding the user's exploration.


Case study: the NSDL



15

The NSDL provides an

excellent example of many of the approaches in discussed above.
The library explicitly set out to support a spectrum of interoperability, recognizing that
the content and the metadata about that content would vary enormously [Arms 2002].


The centerpie
ce of the architecture is an NSDL repository, which is intended to hold
everything that is known about every item of interest. The long
-
term aim is to combine
the information in this repository with information that can be extracted from content and
conte
xtual information. As with all real systems, as of 2003, the implementation
provides only a small subset of the capabilities that are envisioned eventually. In the
initial phases, the repository holds only metadata in a limited range of formats;
consider
able emphasis has been placed on Dublin Core records, both item
-
level and
collection
-
level. The first search service, combines fielded searching of these records
with a full
-
text index of those textual documents that are openly accessible for indexing.


A
s discussed above, the limited information in a Dublin Core does not lend itself to
powerful information discovery services. Initially, the addition of full
-
text indexing of
some documents is a mixed blessing because the ranking methods have not yet been
tuned to accommodate the variety. A second problem lies with the snippets, which have
been derived from the Dublin Core records; they are confusing when a hit is returned
based on its content, not its metadata.


Three improvements are planned for the near

term: expansion of the range of metadata
formats that are accepted, improved ranking, and dynamic generation of snippets. In the
medium
-
term, the major development is the addition of contextual information,
particularly annotations and relationships. Fo
r an educational digital library,
recommendations based on practical experience in using the resources are extremely
valuable. The expectation is that users will contribute an every
-
increasing proportion of
the information about resources in the NSDL repo
sitory.


Finally, the NSDL has a broad program of interface developments. Various experiments
are under way to enhance the exploration of the collections; visualization tools are
particularly promising. The target audiences are so broad that several port
als into the
same digital library are planned. Moreover, we hope and expect that users will discover
NSDL resources in many ways, not only by using the tools that the NSDL provides. One
natural development is to expose the contents of the repository to w
eb crawlers, so that
users of web search engines will be led to NSDL resources. Another is a browsing tool
that enables users to see whether resources found in other ways are in the NSDL. A
simple application of this tool would begin with a search in Goo
gle. When a page of
results is returned, the user clicks the tool and each URL on the page that references a
resources in the NSDL has a logo appended to it. Clicking on the logo, takes the user to
the corresponding record in the repository.


Implicatio
ns for the future


In summary, as digital libraries grow larger, information discovery increasingly has the
following characteristics:


16


(a)

Mixed content is the norm. Resources with highly diverse content, of many formats
and genres, are grouped together.

Mixed metadata is inevitable. Many different
metadata standards coexist. Many resources have no item
-
level metadata, some
have limited metadata and some have excellent. Overall, the proportion of
resources with high
-
quality, human
-
generated metadata i
s steadily declining.


(b)

Most searching and browsing is done by the end
-
users themselves. Information
discovery services can no longer assume that users are trained in the nuances of
cataloguing standards and complex search syntaxes, nor that they are a
ssisted by
reference librarians.


(c)

Information discovery tasks have changed, with high recall rarely the dominant
requirement. The criteria by which information discovery is judged have to
recognize the full process of searching and browsing with the h
uman in the loop.



(d)

Multimodal methods of searching and browsing have to become the norm. More
information about a resources


whether metadata or the full content


provides
more opportunities for information discovery but only if services are des
igned to
use everything that is available.


Perhaps the most important conclusion is that successful information discovery services
depend on the inter
-
relationship between three areas: the underlying information (content
and metadata), computing tools suc
h as search engines, and the human
-
computer
interfaces that are provided. This book is about the first of these three, the relationship
between content and metadata, but none of them can be studied in isolation.


Acknowledgements


The ideas in this paper
synthesize ideas that we have gained in working on American
Memory at the Library of Congress and the NSDL at Cornell and with many other
colleagues.
This work was supported in part by the National Science Fo
undation, under
NSF grant 0127308.


References


[American Memory]
http://memory.loc.gov/


[Arms 2003]
William Y. Arms, Naomi Dushay, Dave Fulker, and Carl Lagoze, A Case
Study in Metadata Harvesting: the NSDL.
Library HiTech

vol. 21, no. 2, 2003.


[Arms 2
002]
William Y. Arms, et al., (2002), A Spectrum of Interoperability: The Site
for Science Prototype for the NSDL.
D
-
Lib Magazine
, vol. 8, no. 1, January 2002.
http://www.dlib.org/dlib/january02/arms/01arms.html



17

[Brin 1998]

Sergey Brin and Lawrence Page
,
The Anatomy of a Large
-
Scale
Hypertextual Web Search Engine
. Seventh International World Wide Web Conference.
Brisbane, Australia, 1998.
http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm


[Cleverdon 1967] Cyril William Cleverdon, The Cranfield

tests on index language
devices,
ASLIB Proceedings
, vol. 19, no. 6, pp 173
-
194, June 1967.


[Garfield 1979] Eugene Garfield.
Citation Indexing: Its Theory and Application in
Science, Technology, and Humanities
. Wiley, New York, 1979.


[Mauldin 1997] Mic
hael L.

Mauldin, Lycos: Design Choices in an Internet Search
Service.
IEEE Expert
, vol. 12, no. 1, pp 8
-
11, 1997.



[NSDL] http://www.nsdl.org/


[RLG] http://www.rlg.org/culturalres/



[Salton 1983] Gerald Salton and Michael J. McGill.
Introduction to m
odern information
retrieval
. McGraw
-
Hill, 1983.


[Vorhees 1999] E. Voorhees, D. Harman,
Overview of the Eighth Text REtrieval
Conference (TREC
-
8)
. 1999. http://trec.nist.gov/pubs/trec8/papers/overview_8.ps.


[Wactlar 2000] Howard Wactlar, Informedia
-

Se
arch and Summarization in the Video
Medium.
Proceedings of Imagina 2000 Conference
, Monaco, January 31 to February 2,
2000. http://www.informedia.cs.cmu.edu/documents/imagina2000.pdf


[Weibel 1995] Stuart Weibel,
Metadata
: the foundations of resou
rce description.
D
-
Lib
Magazine
, vol. 1, no. 1, July 1995. http://www.dlib.org/dlib/July95/07contents.html