EVALUATING NATURAL LANGUAGE PROCESSING TECHNIQUES IN INFORMATION RETRIEVAL: A TREC PERSPECTIVE

scarfpocketΤεχνίτη Νοημοσύνη και Ρομποτική

24 Οκτ 2013 (πριν από 3 χρόνια και 11 μήνες)

145 εμφανίσεις

1
EVALUATING NATURAL LANGUAGE
PROCESSING TECHNIQUES IN
INFORMATION RETRIEVAL: A TREC
PERSPECTIVE
Tomek Strzalkowski
1
Fang Lin
1
Jose Perez-Carballo
2
and Jin Wang
1
1
GE Corporate Research & Development
2
School of Communication, Information and Library Studies, Rutgers University
ABSTRACT
Natural language processing techniques may hold a tremendous potential for overcoming the
inadequacies of purely quantitative methods of text information retrieval, but the empirical evi-
dence to support such predictions has thus far been inadequate, and appropriate scale evaluations
have been slow to emerge. In this chapter, we report on the progress of the Natural Language
Information Retrieval project, a joint effort of several sites led by GE Research, and its evaluation
in a series of Text Retrieval Conferences (TREC) conducted since 1992 under the auspices of the
National Institute of Standards and Technology (NIST) and the Defense Advanced Research
Projects Agency (DARPA).
1 INTRODUCTION AND MOTIVATION
We report on the current status of the ongoing project to explore the uses of natural language pro-
cessing in full-text document retrieval. Since its inception in 1991 at New York University, the
main thrust of this project has been to demonstrate that robust if relatively shallow NLP can help
to derive better representation of text documents for indexing and search purposes than any sim-
ple word and string-based methods commonly used in statistical full-text retrieval. This was
based on the premise that linguistic processing can uncover certain critical semantic aspects of
document content, something that a simple word counting canÕt do, thus leading to more accurate
representation. The projectÕs progress has been rigorously evaluated in a series of Þve Text
Retrieval Conferences (TRECÕs) organized by the U.S. Government under the guidance of NIST
and DARPA. In 1995, the project center moved to GE Research Labs, and its scope widened sub-
stantially to include several parallel efforts at GE, Rutgers, Lockheed Martin Corporation, and
New York University. At TREC we demonstrated that NLP can be done efÞciently on a very large
scale, and that it can have a signiÞcant impact on IR. At the same time, it became clear that
exploiting the full potential of linguistic processing is harder than originally anticipated. Our ini-
2
tial effort was directed at using NLP techniques to extract meaningful indexing terms and to assist
a statistical information retrieval system in making a proper use of them. To that end we had to re-
think how NLP was to be done when instead of a couple of megabytes of CACM abstracts
1
we
faced hundreds and thousands of megabytes of newspaper stories, patent disclosure statements,
and government documents. By TREC-3 (1994), we were able to parse and otherwise process
gigabytes of free text, and indeed show the gains in retrieval effectiveness against our own non-
NLP baseline. However, while these gains, both in recall and precision have not been negligible
(we recorded 10-25% increases), no breakthrough has occurred either. In other words, while we
could demonstrate the performance gains, particularly in precision, the simple word-based statis-
tical retrieval has never been far behind. Moreover, the results remained inconclusive as the base-
line which we used as an initial benchmark turned out to be signiÞcantly lower than that of the
leading statistical IR systems. The main achievement of these early TREC experiments was there-
fore their very scale, which showed that NLP, in however tentative form, was now available for
serious IR research. By TREC-4 (1995), we were Þnally ready to turn our attention once again to
the performance issues.
Not surprisingly, we have soon discovered ( re-discovered?) that the amount of improvement in
recall and precision which we could attribute to NLP, appeared to be related to the quality of the
initial search request, which in turn seemed unmistakably related to its length (cf. Table 1). Long
and descriptive queries responded well to NLP, while terse one-sentence search directives showed
hardly any improvement. This was not particularly surprising or even new, considering that the
shorter queries tended to contain highly discriminating words in them, and that was just enough to
achieve the optimal performance. This still left out all these descriptive, imprecise queries to deal
with, but that is more of an exception than a norm in IR. On the other hand, comparing various
evaluation categories at TREC, it was also quite clear that the longer queries just did better than
the short ones, no matter what their level of processing. Furthermore, while the short queries
needed no better indexing than with simple words, their performance remained inadequate, and
one deÞnitely could use better queries. Therefore, we started looking into ways to build full-bod-
ied search queries, either automatically or interactively, out of usersÕ initial search statements.
This is a challenging undertaking, one that promises to move NLP-IR relationship into a more
advanced next level. The fact that NLP-based indexing is evidently able to further accelerate the
gains resulting from using better queries is also worth exploring.
1.CACM-3204 is a collection of technical abstracts from Communications of the ACM, about 2 MBytes, including
50 or so queries. It was used as one of the standard test collections prior to TREC.
TABLE 1. Performance gains attributed to NLP indexing vs. query length
RUNS
TREC-3 runs: query ~ 70 terms TREC-4 runs: query ~ 10 terms
Base Best NL Index Base Best NL Index
AdHoc
% change
0. 2271 0.2735
+20.0
0.2082 0.2272
+9.1
Routing
% change
0.2578 0.3244
+25.8
0.2715 0.2913
+7.3
3
TREC-5 (1996), therefore, marks a shift in our approach away from text representation issues and
towards query development problems. While our TREC-5 system still performs extensive text
processing in order to extract phrasal and other indexing terms, our main focus moved on to query
construction using words, sentences, and entire passages to expand initial search speciÞcations in
an attempt to cover their various angles, aspects and contexts. Based on the observations that NLP
is more effective with highly descriptive queries, we designed an expansion method in which
entire passages from related, though not necessarily relevant documents were quite liberally
imported into the user queries. This method appeared to have produced a dramatic improvement
in the performance of two different statistical search engines that we tested (CornellÕs SMART
and NISTÕs Prise) boosting the average precision by at least 40%. Similar improvements were
reported for the University of MassachusettsÕ INQUERY system, when run on our expanded que-
ries (Callan, 1996).
The other notable new feature of our TREC-5 system is the stream architecture. It is a system of
parallel indexes built for a given collection, with each index reßecting a different text representa-
tion strategy. These indexes are called streams because they represents different streams of data
derived from the underlying text archive. A retrieval process searches all or some of the streams,
and the Þnal ranking is obtained by merging individual stream search results. This allows for an
effective combination of alternative document representation and retrieval strategies, in particular
various NLP and non-NLP methods. The resulting meta-search system can be optimized by max-
imizing the contribution of each stream. It is also a convenient vehicle for an objective evaluation
of streams against one another.
2 NLP-BASED INDEXING IN INFORMATION RETRIEVAL
In information retrieval (IR), a typical task is to fetch relevant documents from a large archive in
response to a userÕs query, and rank these documents according to relevance. This has been usu-
ally accomplished using statistical methods (often coupled with manual encoding) that (a) select
terms (words, phrases, and other units) from documents that are deemed to best represent their
content, and (b) create an inverted index Þle (or Þles) that provide an easy access to documents
containing these terms. A subsequent search process will attempt to match preprocessed user que-
ries against term-based representations of documents in each case determining a degree of rele-
vance between the two which depends upon the number and types of matching terms. Although
many sophisticated search and matching methods are available, the crucial problem remains to be
that of an adequate representation of content for both the documents and the queries.
In term-based representation, a document (as well as a query) is transformed into a collection of
weighted terms (or surrogates representing combinations of terms), derived directly from the doc-
ument text or indirectly through thesauri or domain maps. The representation is anchored on these
terms, and thus their careful selection is critical. Since each unique term can be thought to add a
new dimensionality to the representation, it is equally critical to weigh them properly against one
another so that the document is placed at the correct position in the N-dimensional term space.
2
2.In a vector-space model term weights are represented as coordinate values; in a probabilistic model estimates of
prior probabilities are used.
4
Our goal is to have the documents on the same topic placed close together, while those on differ-
ent topics placed sufÞciently apart. The above should hold for any topics, a daunting task indeed,
which is additionally complicated by the fact that we often do not know how to compute terms
weights. The statistical weighting formulas, based on terms distribution within the database, such
as tf*idf, are far from optimal, and the assumptions of term independence which are routinely
made are false in most cases. This situation is even worse when single-word terms are intermixed
with phrasal terms and the term independence becomes harder to justify.
The simplest word-based representations of content, while relatively better understood, are usu-
ally inadequate since single words are rarely speciÞc enough for accurate discrimination, and their
grouping is often accidental. A better method, it may seem, is to identify groups of words that cre-
ate meaningful phrases, especially if these phrases denote important concepts in the database
domain. For example, Òjoint ventureÓ is an important term in the Wall Street Journal (WSJ hence-
forth) database, while neither ÒjointÓ nor ÒventureÓ are important by themselves. In the retrieval
experiments with the TREC database, we noticed that both ÒjointÓ and ÒventureÓ were dropped
from the list of terms by the system because their inverted document frequency ( idf) weights were
too low. In large databases, it is quite common to eliminate very high frequency terms to conserve
space, because of their minimal discriminating value. On the other hand, a phrase, even one made
up of high frequency words may make a good discriminator, therefore, the use of phrasal terms
becomes not merely desirable, but in fact necessary. This observation has been made by a grow-
ing number of IR researchers and practitioners, for example, many systems participating in TREC
now use one or another form of phrase extraction.
There are a number of ways to obtain ÒphrasesÓ from text. These include generating simple collo-
cations, statistically validated N-grams, part-of-speech tagged sequences, syntactic structures, and
even semantic concepts. Some of these techniques are aimed primarily at identifying multi-word
terms that have come to function like ordinary words, for example Òwhite collarÓ or Òelectric carÓ,
and capturing other co-occurrence idiosyncrasies associated with certain types of texts. This sim-
ple approach has proven quite effective for some systems, for example the Cornell group reported
(Buckley, 1995) that adding simple collocations to the list of available terms can increase retrieval
precision by as much as 10%.
Other more advanced techniques of phrase extraction, including extended N-grams and syntactic
parsing, attempt to uncover ÒconceptsÓ, which would capture underlying semantic uniformity
across various surface forms of expression. Syntactic phrases, for example, appear reasonable
indicators of content, arguably better than proximity-based phrases, since they can adequately
deal with word order changes and other structural variations (e.g., Òcollege juniorÓ vs. Òjunior in
collegeÓ vs. Òjunior collegeÓ). A subsequent regularization process, where alternative structures
are reduced to a Ònormal formÓ, helps to achieve the desired uniformity, for example, Òcol-
lege+juniorÓ will represent a college for juniors, while Òjunior+collegeÓ will represent a junior in
a college. A more radical normalization would have also Òverb objectÓ, Ònoun rel-clauseÓ, etc.
converted into collections of such ordered pairs. This head+modiÞer normalization has been used
in our system, and is further described in this chapter. In order to obtain the head+modiÞer pairs
of respectable quality, we used a full-scale robust syntactic parsing. We need to note here that the
use of even the fastest syntactic analysis tools is severely pushing the limits of practicality of an
information retrieval system because of the increased demand for computing power and storage.
5
At the same time, the gain in recall and precision which could be attributed to the higher quality
phrases has not materialize in any consistent fashion.
3
One possible explanation is that syntactic analysis is just not going far enough. Or perhaps more
appropriately, that the semantic uniformity predictions made on the basis of syntactic structures
(as is in the case of head+modiÞer pairs) are less reliable than we have hoped for. Of course, the
relatively low quality of parsing may be a major problem, although there is little evidence to sup-
port that. In other words, if we are shooting for some Òsemantic concept-based representationÓ,
are there other ways to get there?
This state of affairs has prompted us take a closer look at the phrase selection and representation
process. In TREC-3 we showed that an improved weighting scheme for compound terms, includ-
ing phrases and proper names, leads to an overall gain in retrieval accuracy. The fundamental
problem, however, remained to be the systemÕs inability to recognize, in the documents searched,
the presence or absence of the concepts or topics that the query was asking for. The main reason
for this was, we noted, the limited amount of information that the queries could convey on various
aspects of topics they represented. Therefore, we started experimenting with manual and auto-
matic query building techniques. The purpose of this exercise was to devise a method for full-text
query expansion that would allow for creating fuller (self-contained?) search queries such that:
(1) the performance of any system using these queries would be signiÞcantly better than when the
system is run using the original queries, and (2) the method could be eventually automated or
semi-automated so as to be useful to a non-expert user. Our preliminary results from TREC-5
evaluations show that this approach is indeed very effective.
3 NLP in INFORMATION RETRIEVAL: A PERSPECTIVE
Natural language processing has always seemed to offer the key to building an ultimate informa-
tion retrieval system. Somehow we feel that the ``bag-of-wordsÕÕ representations, prevalent
among todayÕs information retrieval systems, can hardly do justice to the complexities of free,
unprocessed text with which we have to deal. Some of the favorite examples include Venetian
blind vs. blind Venetian, Poland is attacked by Germany vs. Germany attacks Poland, or car wash
vs. automobile detailing. Natural language processing could provide solutions to at least some of
these problems through lexical and syntactic analysis of text (e.g.,Venetian is used as either an
adjective or a noun), through the assignment of logical structures to sentences (e.g.,Germany is a
logical subject of attack), or through an advanced semantic analysis that may involve domain
knowledge. Other important applications include discourse-level processing, resolution of ana-
phoric references, proper name identiÞcation, and more.
Unfortunately, a direct application of NLP techniques to information retrieval has met some rather
severe obstacles, chief among which was a paralyzing lack of robustness and efÞciency. Worse
yet, the difÞculties did not end with the linguistic processing itself but extended to the representa-
tion it produced: it wasnÕt at all clear how the complex structures could be effectively compared
to determine relevance. A better approach, it seemed, was to use NLP to assist an IR system,
3.For a broader comparison and analysis of NL-based indexing the reader is referred to Chapter 1 of this volume.
6
whether boolean, statistical, or probabilistic, in automatically selecting important terms, words
and phrases, which could then be used in representing documents for search purposes. This
approach provided extra maneuverability for softening any inadequacies of the NLP software
without incapacitating the entire system. EfÞciency problems still prevented direct on-line pro-
cessing of any large amount of text, but NLP could be gradually built into an off-line database
indexing.
There has been a considerable amount of interest in using NLP in information retrieval research,
with speciÞc implementations varying from the word-level morphological analysis to syntactic
parsing to conceptual-level semantic analysis. Some interesting insights have been made; how-
ever, demonstrating the superiority of these techniques over simple statistical processing has
proved harder than expected. One of the more comprehensive evaluations of NLP use in informa-
tion retrieval was done by Fagan (1987) who used syntactic and ÔstatisticalÕ phrases to index doc-
uments in Þve different test collections. He showed that phrases could be considerably more
effective than single words in representing semantic content of documents, but that phrases
derived by statistical means were as just effective as those derived through parsing. Still the over-
all improvements in retrieval effectiveness were relatively small (2-23%) and inconsistent, which
may be attributed to the small scale of these experiments.
Further experiments with statistical phrases (Lewis & Croft, 1990) generally did not conÞrm
these numbers, showing much lesser improvements (5-8%). Other notable attempts at using com-
pound phrasal terms in indexing or retrieval include (Dillon and Gray, 1983), (Sparck Jones and
Tait, 1984), (Smeaton & van Rijsbergen, 1988), (Metzler et al., 1989), (Ruge et al., 1991), and
(Evans et al., 1993). Not all of these approaches were properly evaluated, and for those that were
evaluated the results were generally discouraging. More advanced techniques aimed at overcom-
ing the limitations of shallow linguistic processing that included semantic and discourse-level
processing proved unacceptably expensive and difÞcult to evaluate on established benchmarks
despite some spectacular results obtained in laboratory tests (e.g., Mauldin, 1991). It is interesting
to see that these past attempts fell into one of the following two categories: either it was an
advanced NLP system applied to a small-scale task (e.g., Metzler et al.Õs COP, or MauldinÕs FER-
RET), or a relatively shallow NLP applied to a larger task (e.g., Evans et al.Õs CLARIT). There
was essentially no middle ground to speak of with a possible exception of DR-LINK (Liddy &
Myaeng, 1993), an advanced conceptual IR system, which nonetheless failed to scale-up its more
ambitious features to TREC-level evaluations.
A common theme among the majority of NLP applications to information retrieval is to use lin-
guistically motivated objects (stems, phrases, proper names, Þxed terminology, lexical correla-
tions, etc.) as derived from documents and queries, and create a Ôvalue-addedÕ representation,
usually in the form of an inverted index. The bulk of this representation may still rest upon statis-
tically weighted single-word terms, while additional terms (e.g., phrases) are included on the
assumption that they can only make the representation richer, and in effect, improve the effective-
ness of subsequent search. For example, if the search Þnds the phrase Venetian blind in a docu-
ment, we should have more conÞdence as to the relevance of this document than when Venetian
and blind are found separately. This, however, does not seem to happen in any consistent manner.
The problem is that the phrases are not all alike, and their abilities to reßect the content of the text
vary greatly with the type of the phrase and its position within the text. Statistical weighting
schemes developed for single-word terms, such as tf.idf, do not seem to extend on compound
7
terms. Moreover, compound terms derived through statistical means (e.g., co-occurrence, mutual
information) tend to behave differently from those derived with a grammar-based parser. A
retrieval model, which includes a weighting scheme for terms, is a crucial part of any information
retrieval system, and a wrong retrieval model can defeat even the most accomplished representa-
tion. Nonetheless, most work on NLP in IR concentrated on representation or compound term
matching strategies, with relatively little consideration given to term weighting and to scoring of
retrieved documents. Some commonly used strategies, where a phrase weight was a function of
weights of its components, did not produce uniform results, (Fagan, 1987; Lewis and Croft,
1990). In fact, the lack of an established retrieval model which can handle linguistically-moti-
vated compound terms may be one of the more serious obstacles in evaluating the impact and fea-
sibility of natural language processing in information retrieval.
In recent years, we noted a renewed interest in using NLP techniques in information retrieval,
sparked in part by the sudden prominence, as well as the perceived limitations, of existing IR
technology in rapidly emerging commercial applications, including on the Internet. This has also
been reßected in what is being done at TREC: using phrasal terms and proper name annotations
became a norm among TREC participants, and a special interest track on NLP took off for the Þrst
time in TREC-5.
In the remainder of this paper we discuss particulars of our present system and some of the obser-
vations made while processing TREC data. The above comments will provide the background for
situating our present effort and the state-of-the-art with respect to where we should be in the
future.
4 STREAM-BASED INFORMATION RETRIEVAL MODEL
The stream model was conceived to facilitate a thorough evaluation and optimization of various
text content representation methods, including simple quantitative techniques as well as those
requiring complex linguistic processing. Our system encompasses a number statistical and natural
language processing techniques that capture different aspects of document content, and combin-
ing these into a coherent whole was in itself a major challenge. Therefore, we designed a distrib-
uted representation model in which alternative methods of document indexing (which we call
ÒstreamsÓ) are strung together to perform in parallel. Streams are built using a mixture of different
indexing approaches, term extracting and weighting strategies, even different search engines.
The following term extraction methods correspond to some of the streams used in our system:
1.Elimination of stopwords: Original text words minus certain no-content and low-content
stopwords are used to index documents. Included in the stopwords category are closed-class
words such as determiners, prepositions, pronouns, etc., as well as certain very frequent words.
2.Morphological stemming: Words are normalized across morphological variants (e.g., Òprolif-
erationÓ, ÒproliferateÓ, ÒproliferatingÓ) using a lexicon-based stemmer. This is done by chop-
ping off a sufÞx (-ing, -s, -ment) or by mapping onto root form in a lexicon (e.g.,proliferation
to proliferate).
8
3.Phrase extraction: Various shallow text processing techniques, such as part-of-speech tag-
ging, phrase boundary detection, and word co-occurrence metrics are used to identify rela-
tively stable groups of words, e.g.,joint venture.
4.Phrase normalization: ÒHead+ModiÞerÓ pairs are identiÞed in order to normalize across syn-
tactic variants such as weapon proliferation, proliferation of weapons, proliferate weapons,
etc., and reduce to a common ÒconceptÓ, e.g.,weapon+proliferate.
5.Proper name extraction: Proper names are identiÞed for indexing, including people names
and titles, location names, organization names, etc.
The Þnal results are produced by merging ranked lists of documents obtained from searching all
streams with appropriately preprocessed queries, i.e., phrases for phrase stream, names for names
stream, etc. The merging process weights contributions from each stream using a combination
that was found the most effective in training runs. This allows for an easy combination of alterna-
tive retrieval and routing methods, creating a meta-search strategy which maximizes the contribu-
tion of each stream. The stream model is illustrated in Figure 1. Both CornellÕs SMART version
11 (Buckley, Salton, 19xx), and NISTÕs Prise version 2 (Harman & Candella, 1991) were used as
base search engines.
FIGURE 1. Stream organization concept.
text
data
stems
phrases
names
H+M pairs
index-1
index-2
index-3
index-4
search queries
merge
match-1
match-2
match-3
match-4
base
9
Among the advantages of the stream architecture we may include the following:
¥ stream organization makes it easier to compare the contributions of different indexing features
or representations. For example, it is easier to design experiments which allow us to decide if a
certain representation adds information which is not contributed by other streams.
¥ it provides a convenient testbed to experiment with algorithms designed to merge the results
obtained using different IR engines and/or techniques.
¥ it becomes easier to Þne-tune the system in order to obtain optimum performance
¥ it allows us to use any combination of IR engines without having to adapt them in any way.
The notion of combining evidence from multiple sources is not new in information retrieval. Sev-
eral researchers have noticed in the past that different systems may have similar performance but
retrieve different documents, thus suggesting that they may complement one another. It has been
reported that the use of different sources of evidence increases the performance of a hybrid sys-
tem (see for example, Callan et al., 1995; Fox et al.,1993; Saracevic & Kantor, 1988). Nonethe-
less, the stream model used in our system is unique in that it explicitly addresses the issue of
document representation as well as provides means for subsequent optimization.
5 ADVANCED LINGUISTIC STREAMS
5.1 Head+ModiÞer Pairs Stream
Our linguistically most advanced stream is the head+modiÞer pairs stream. In this stream, docu-
ments are reduced to collections of word pairs derived via syntactic analysis of text followed by a
normalization process intended to capture semantic uniformity across a variety of surface forms,
e.g., Òinformation retrievalÓ, Òretrieval of informationÓ, Òretrieve more informationÓ, Òinforma-
tion that is retrievedÓ, etc. are all reduced to Òretrieve+informationÓ pair, where ÒretrieveÓ is a
head or operator, and ÒinformationÓ is a modiÞer or argument. It has to be noted that while the
head-modiÞer relation may suggest semantic dependence, what we obtain here is strictly syntac-
tic, even though the semantic relation is what we are really after. This means in particular that the
inferences of the kind where a head+modiÞer is taken as a specialized instance of head,are
inherently risky, because the head is not necessarily a semantic head, and the modiÞer is not nec-
essarily a semantic modiÞer, and in fact the opposite may the case. In the experiments that we
describe here, we have generally refrained from semantic interpretation of head-modiÞer relation-
ship, treating it primarily as an ordered relation between otherwise equal elements. Nonetheless,
even this simpliÞed relationship has already allowed us to cut through a variety of surface forms,
and achieve what we thought was a non-trivial level of normalization. The apparent lack of suc-
cess of linguistically-motivated indexing in information retrieval may suggest that we havenÕt still
gone far enough.
In this section we describe in detail how the head+modiÞer indexing has been generated in our
system. Since several other researchers have used similar concept in the past (Fagan, Dil-
lon&Grey, Salton, Evans et al.), it will be necessary to relate the foregoing to these other
approaches when comparing experimental results and conclusions following from them.
10
In our system, the head+modiÞer pairs stream is derived through a sequence of processing steps
that include:
¥ Part-of-speech tagging
¥ Lexicon-based word normalization (extended ÒstemmingÓ)
¥ Syntactic analysis with TTP parser
¥ Extraction of head+modiÞer pairs
¥ Corpus-based disambiguation of long noun phrases
5.1.1 Part-of-speech tagging
Part of speech tagging allows for resolution of lexical ambiguities in a running text, assuming a
known general type of text (e.g., newspaper, technical documentation, medical diagnosis, etc.)
and a context in which a word is used. This in turn leads to a more accurate lexical normalization
or stemming. It also is a basis for a phrase boundary detection.
A part-of-speech tagger assigns a part of speech label to each word in a text depending on the
labels assigned to the preceding words. Often, more than one part-of-speech tag is assigned to a
single word, presumably reßecting some kind of ambiguity in the input. In the best-tag-only
option, which we used in our experiments, only at the top-ranked tag for each word is output.
While alternative (but also less likely) readings of ambiguous sentences may be lost this way, we
also lose ambiguity which can translate into the gain in speed and robustness of any subsequent
process, including a parser.
We used a version of BrillÕs rule based tagger (Brill, 1992) trained on Wall Street Journal texts to
preprocess linguistic streams used by SMART. We also used BBNÕs stochastic POST tagger as
part of our NYU-based Prise system. Both systems are based on Penn Treebank Tagset developed
at University of Pennsylvania, and have compatible levels of performance.
5.1.2 Lexicon-based word normalization
Word stemming has been an effective way of improving document recall since it reduces words to
their common morphological root, thus allowing more successful matches. On the other hand,
stemming tends to decrease retrieval precision, if care is not taken to prevent situations where oth-
erwise unrelated words are reduced to the same stem. In our system we replaced a traditional mor-
phological stemmer with a conservative dictionary-assisted sufÞx trimmer.
4
The sufÞx trimmer performs essentially two tasks:
1.it reduces inßected word forms to their root forms as speciÞed in the dictionary, and
2.it converts nominalized verb forms (e.g., ÒimplementationÓ, ÒstorageÓ) to the root forms of
corresponding verbs (i.e., ÒimplementÓ, ÒstoreÓ).
4.Dealing with preÞxes is a more complicated matter, since they may have quite strong effect upon the meaning of
the resulting term, e.g., un- usually introduces explicit negation
11
This is accomplished by removing a standard sufÞx, e.g., Òstor+ageÓ, replacing it with a standard
root ending (Ò+eÓ), and checking the newly created word against the dictionary, i.e., we check
whether the new root (ÒstoreÓ) is indeed a legal word. Below is a small example of text before and
after stemming.
original:
While serving in South Vietnam, a number of U.S. Soldiers were reported
as having been exposed to the defoliant Agent Orange. The issue is veter-
ans entitlement, or the awarding of monetary compensation and/or medical
assistance for physical damages caused by Agent Orange.
stemmed:
serve south vietnam number u.s. soldier expose defoliant agent orange
veteran entitle award monetary compensate medical assist physical dam-
age agent orange
Please note that full proper names, such as South Vietnam and Agent Orange are identiÞed sepa-
rately through the name extraction process described below. Note also that various ÒstopwordsÕÕ
(e.g., prepositions, conjunctions, articles, etc.) are removed from text.
5.1.3 Syntactic analysis with TTP
Parsing reveals Þner syntactic relationships between words and phrases in a sentence, relation-
ships that are hard to determine accurately without a comprehensive grammar. Some of these rela-
tionships do convey semantic dependencies, e.g., in Poland is attacked by Germany the
subject+verb and verb+object relationships uniquely capture the semantic relationship of who
attacked whom. The surface word-order alone cannot be relied on to determine which relationship
holds. From the onset, we assumed that capturing semantic dependencies may be critical for accu-
rate text indexing. One way to approach this is to exploit the syntactic structures produced by a
fairly comprehensive parser.
TTP (Tagged Text Parser) is based on the Linguistic String Grammar developed by Sager (1981).
The parser currently encompasses some 400 grammar productions, but it is by no means com-
plete. The parserÕs output is a regularized parse tree representation of each sentence, that is, a rep-
resentation that reßects the sentenceÕs logical predicate-argument structure. For example, logical
subject and logical object are identiÞed in both passive and active sentences, and noun phrases are
organized around their head elements. The parser is equipped with a powerful skip-and-Þt recov-
ery mechanism that allows it to operate effectively in the face of ill-formed input or under a
severe time pressure.
5
TTP has been shown to produce parse structures which are no worse than
those generated by full-scale linguistic parsers when compared to hand-coded Treebank parse
trees (Strzalkowski & Scheyen, 1996).
5.When parsing the TREC-3 collection of more than 500 million words, we found that the parserÕs speed averaged
between 0.17 and 0.26 seconds per sentence, or up to 80 words per second, on a SunÕs SparcStation10.
12
TTP is a full grammar parser, and initially, it attempts to generate a complete analysis for each
sentence. However, unlike an ordinary parser, it has a built-in timer which regulates the amount of
time allowed for parsing any one sentence. If a parse is not returned before the allotted time
elapses, the parser enters the skip-and-Þt mode in which it will try to ÒÞtÓ the parse. While in the
skip-and-Þt mode, the parser will attempt to forcibly reduce incomplete constituents, possibly
skipping portions of input in order to restart processing at a next unattempted constituent. In other
words, the parser will favor reduction to backtracking while in the skip-and-Þt mode. The result
of this strategy is an approximate parse, partially Þtted using top-down predictions. The fragments
skipped in the Þrst pass are not thrown out, instead they are analyzed by a simple phrasal parser
that looks for noun phrases and relative clauses and then attaches the recovered material to the
main parse structure. Full details of TTP parser have been described in the TREC-1 report (Strza-
lkowski, 1993a), as well as in other works (Strzalkowski, 1992; Strzalkowski & Scheyen, 1996).
As may be expected, the skip-and-Þt strategy will only be effective if the input skipping can be
performed with a degree of determinism. This means that most of the lexical level ambiguity must
be removed from the input text, prior to parsing. This is another reason for using part-of-speech
tagging: in order to streamline the processing, we perform morphological normalization of words
on the tagged text, but before parsing. This is possible because the part-of-speech tags retain the
information about each word's original form. Thus the sentence The Soviets have been notiÞed is
transformed into the/dt soviet/nps have/vbp be/vbn notify/vbn before parsing commences.
The tags are read as follows:dt is determiner,nps is proper name,vbp is tensed plural verb, and
vbn is past participle.
5.1.4 Extracting head+modiÞer pairs
Syntactic phrases extracted from TTP parse trees are head+modiÞer pairs. The head in such a pair
is a central element of a phrase (main verb, main noun, etc.), while the modiÞer is one of the
adjunct arguments of the head. It should be noted that the parserÕs output is a predicate-argument
structure centered around main elements of various phrases. The following types of pairs are con-
sidered: (1) a head noun and its left adjective or noun adjunct, (2) a head noun and the head of its
right adjunct, (3) the main verb of a clause and the head of its object phrase, and (4) the head of
the subject phrase and the main verb. These types of pairs account for most of the syntactic vari-
ants for relating two words (or simple phrases) into pairs carrying compatible semantic content.
This also gives the pair-based representation sufÞcient ßexibility to effectively capture content
elements even in complex expressions. There are of course exceptions. For example, the three-
word phrase Òformer Soviet presidentÓ would be broken into two pairs Òformer presidentÓ and
ÒSoviet presidentÓ, both of which denote things that are potentially quite different from what the
original phrase refers to, and this fact may have potentially negative effect on retrieval precision.
This is one place where a longer phrase appears more appropriate. Below is a small sample of
head+modiÞer pairs extracted (proper names are not included):
original text:
While serving in South Vietnam, a number of U.S. Soldiers were reported
as having been exposed to the defoliant Agent Orange. The issue is veter-
ans entitlement, or the awarding of monetary compensation and/or medical
assistance for physical damages caused by Agent Orange.
13
head+modiÞer pairs:
damage+physical, cause+damage, award+assist, award+compensate,
compensate+monetary, assist+medical, entitle+veteran
5.1.5 Corpus-based disambiguation of long noun phrases
The notorious structural ambiguity of nominal compounds remains a serious difÞculty in obtain-
ing quality head-modiÞer pairs. What it means is that word order information cannot be reliably
used to determine relationships between words in complex phrases, which is required to decom-
pose longer phrases into meaningful head+modiÞer pairs. In order to cope with ambiguity, the
pair extractor looks at the distribution statistics of the compound terms to decide whether the
association between any two words (nouns and adjectives) in a noun phrase is both syntactically
valid and semantically signiÞcant. For example, we may accept language+natural and process-
ing+language from Ònatural language processingÓ as correct, however, case+trading would make
a mediocre term when extracted from Òinsider trading caseÓ. On the other hand, it is important to
extract trading+insider to be able to match documents containing phrases Òinsider trading sanc-
tions actÓ or Òinsider trading activityÓ. Phrasal terms are extracted in two phases. In the Þrst
phase, only unambiguous head-modiÞer pairs are generated, while all structurally ambiguous
noun phrases are passed to the second phase Òas isÓ. In the second phase, the distributional statis-
tics gathered in the Þrst phase are used to predict the strength of alternative modiÞer-modiÞed
links within ambiguous phrases. For example, we may have multiple unambiguous occurrences of
Òinsider tradingÓ, while very few of Òtrading caseÓ. At the same time, there are numerous phrases
such as Òinsider trading caseÓ, Òinsider trading legislationÓ, etc., where the pair Òinsider tradingÓ
remains stable while the other elements get changed, and signiÞcantly fewer cases where, say,
Òtrading caseÓ is constant and the other words change.
The phrase decomposition procedure is performed after the Þrst phrase extraction pass in which
all unambiguous pairs (noun+noun and noun+adjective) and all ambiguous noun phrases are
extracted. Any nominal string consisting of three or more words of which at least two are nouns is
deemed structurally ambiguous. In the TREC corpus, about 80% of all ambiguous nominals were
of length 3 (usually 2 nouns and an adjective), 19% were of length 4, and only 1% were of length
5 or more. The phrase decomposition algorithm has been described in detail in (Strzalkowski,
1995). The algorithm was shown to provide about 70% recall and 90% precision in extracting
correct head+modiÞer pairs from 3 or more word noun groups in TREC collection texts. In terms
of the total number of pairs extracted unambiguously from the parsed text, the disambiguation
step recovers an additional 10% to 15% of pairs, all of which would otherwise be either discarded
or misrepresented.
5.2 Simple Noun Phrase Stream
In contrast to the elaborate process of generating the head+modiÞer pairs, unnormalized noun
groups are collected from part-of-speech tagged text using a few regular expression patterns. No
attempt is made to disambiguate, normalize, or get at the internal structure of these phrases, other
than the stemming which has been applied to text prior to the phrase extraction step. The follow-
ing phrase patterns have been used, with phrase length arbitrarily limited to the maximum 7
words:
14
1.a sequence of modiÞers (adjectives, participles, etc.) followed by at least one noun, such as:
Òcryonic suspensionÓ, Òair trafÞc control systemÓ;
2.proper noun sequences modifying a noun, such as: Òu.s. citizenÓ, Òchina tradeÓ;
3.proper noun sequences (possibly containing Ô&Õ): Òwarren commissionÓ, Ònational air trafÞc
controllerÓ.
The motivation for having a phrase stream is similar to that for head+modiÞer pairs since both
streams attempt to capture signiÞcant multi-word indexing terms. The main difference is the lack
of normalization, which makes the comparison between these two streams particularly interest-
ing.
5.3 Name Stream
Proper names, of people, places, events, organizations, etc., are often critical in deciding rele-
vance of a document. Since names are traditionally capitalized in English text, spotting them is
relatively easy, most of the time. Many names are composed of more than a single word, in which
case all words that make up the name are capitalized, except for prepositions and such, e.g., The
United States of America. It is important that all names recognized in text, including those made
up of multiple words, e.g., South Africa or Social Security, are represented as tokens, and not bro-
ken into single words, e.g., South and Africa, which may turn out to be different names altogether
by themselves. On the other hand, we need to make sure that variants of the same name are indeed
recognized as such, e.g., U.S. President Bill Clinton and President Clinton, with a degree of conÞ-
dence. One simple method, which we use in our system, is to represent a compound name dually,
as a compound token and as a set of single-word terms. This way, if a corresponding full name
variant cannot be found in a document, its component words matches can still add to the docu-
ment score. A more accurate, but arguably more expensive method would be to use a substring
comparison procedure to recognize variants before matching.
In our system names are identiÞed by the parser, and then represented as strings, e.g.,
south+africa. The name recognition procedure is extremely simple, in fact little more than the
scanning of successive words labeled as proper names by the tagger (ÒnpÓ and ÒnpsÓ tags). Sin-
gle-word names are processed just like ordinary words, except for the stemming which is not
applied to them. We also made no effort to assign names to categories, e.g., people, companies,
places, etc., a classiÞcation which is useful for certain types of queries (e.g., ÒTo be relevant a
document must identify a speciÞc generic drug companyÓ). In the TREC-5 database, compound
names make up about 8% of all terms generated. A small sample of compound names extracted is
listed below:
right+wing+christian+fundamentalism, u.s+constitution, gun+control+legislation,
national+railroad+transportation+corporation, u.s+government, united+states,
exxon+valdez, plo+leader+arafat, suzuki+samurai+soft_top+4wd
15
5.4 Other Streams
5.4.1 Stems Stream
The stems stream is the simplest, yet the most effective of all streams, a backbone of the multi-
stream model. It consists of stemmed single-word tokens (plus hyphenated phrases) taken directly
from the document text (exclusive of stopwords). The stems stream provides the most compre-
hensive, though not very accurate, image of the text it represents, and therefore it is able to out-
perform other streams that we used thus far. We believe however, that this representation model
has reached its limits, and that further improvement can only be achieved in combination with
other text representation methods. This appears consistent with the results reported at TREC.
In one variation of this stream, we used occurrences of hyphenated phrases in text as a guide for
extracting other multi-word terms for indexing. Many semi-Þxed expressions in English are occa-
sionally hyphenated, which may indicate that their non-hyphenated occurrences should also be
treated as unit terms:alien-smuggling, quick-freeze, roto-router, heart-surgery, cigarette-smoking,
lung-cancer, per-capita, etc. No performance gain was recorded for the modiÞed stream, but we
believe that context-of-use information may need to be exploited to accurately identify domain-
speciÞc terms.
In another variation, we tried to identify unambiguous single-sense words and give them premium
weights as reliable discriminators. Many words, when considered out of context, display more
than one sense in which they can be used. When such words are used in text they may assume any
of their possible senses thus leading to undesired matches. This has been a problem for word-
based IR systems, and have spurred attempts at sense disambiguation in text indexing (Krovetz,
199?). Another way to address this problem is to focus on words that do not have multiple-sense
ambiguities, and treat these as special, because they seem to be more reliable as content indica-
tors. This modiÞcation has produced a slightly stronger stream.
5.4.2 Unstemmed-Word Stream
In some experiments, notably in routing, we used also a plain text stream. This stream was
obtained by indexing the text of the documents Òas isÓ without stemming or any other processing
and running the unprocessed text of the queries against that index. The purpose of having this
stream was to see if and when the lexical form of words can help to increase precision, while pos-
sibly sacriÞcing recall in some types of queries. In routing, where queries are extensively tuned
through training, having multiple word forms allows, in theory at least, Þner-grained adjustments.
5.4.3 Fragments Stream
For the routing experiments we also used a stream of fragments. This was the result of spliting the
documents of the stems stream into fragments of constant length (1024 characters) and indexing
each fragment as if it were a separate document. The queries used with this stream were the same
as with the stems stream. Unlike in the regular stream, where the entire documents were retrieved,
here each document fragment was scored and ranked independently. The rank of a document was
determined by the highest-scoring fragment contained by this document. This stream was moti-
16
vated by the large body of work into passage-level retrieval (Callan, Kwok), and its primary pur-
pose was to provide a benchmark for the locality stream described next.
5.4.4 Locality Stream
One limitation of the fragments-based representation is that they are rigid, both in their size and
locations. Moving-window versions (e.g., Callan, 1995) are more ßexible, but also tend to be
quite expensive to run. EfÞciency considerations has led us to investigate an alternative approach
in which the maximum number of terms on which a query is permitted to match a document is
limited to N highest weighted terms, where N can be the same for all queries of may vary from
one query to another. Note that this is not the same as simply taking the N top terms from each
query. Rather, for each document for which there are M matching terms with the query, only
min(M,N) of them, namely those which have highest weights, will be considered when comput-
ing the document score, moreover, only the global importance weights for terms are considered
(such as idf). Locality retrieval has been consistently useful in our TREC experiments, adding on
the average 10% to overall precision. In the most recent, still preliminary experiments, it appeared
to have outperformed the fragments stream
5.4.5 Foreign Country Tags Stream
For queries involving references to foreign countries, either direct, or indirect, e.g., good of for-
eign manufacture, we added special tokens for each reference of the concept ÔforeignÕ. The identi-
Þcation is done simply by looking up certain key words and phrases (e.g. foreign, other countries,
international,etc.). Using a list of foreign countries and major cities acquired from the Internet,
we tagged the documents in the collection with the same special ÒforeignÓ token whenever a for-
eign country or city was mentioned.
Only 10 queries (out of 45) were affected. Comparing with the base run, 9 out of these 10 show
improvement, and only one shows a modest 5% performance loss. On the average, the precision
gain is 27% for those queries. The result may suggest that type-tagging of terms in general, e.g.,
people, organizations, locations, etc. may lead to signiÞcant improvement in retrieval accuracy, a
subject that has been the focus of much debate by the Tipster community. The challenge is to
identify a sufÞcient number of basic categories so that they can be found in a number of different
queries, and such that an efÞcient object identiÞer can be implemented for them.
TABLE 2. How different streams perform relative to one another (11-pt avg. Prec)
STREAM short queries long queries
Stems 0.1682 0.2626
Phrases 0.1233 0.2365
H+M Pairs 0.0755 0.2040
Names 0.0844 0.0608
17
For streams using SMART indexing, we selected optimal term weighting schemes from among a
dozen or so variants implemented with version 11 of the system. These schemes vary in the way
they calculate and normalize basic term weights. The selection of one scheme over another can
have a dramatic effect on systemÕs performance. For details the reader is referred to Buckley
(1991).
6 STREAM MERGING & WEIGHTING
The results obtained from different streams are list of documents ranked in order of relevance: the
higher the rank of a retrieved document, the more relevant it is presumed to be. In order to obtain
the Þnal retrieval result, ranking lists obtained from each stream have to be combined together by
a process known as merging or fusion. The Þnal ranking is derived by calculating the combined
relevance scores for all retrieved documents. The following are the primary factors affecting this
process:
1.document relevancy scores from each stream
2.retrieval precision distribution estimates within ranks from various streams, e.g., projected pre-
cision between ranks 10 and 20, etc.;
3.the overall effectiveness of each stream (e.g. measured as average precision on training data)
4.the number of streams that retrieve a particular document, and
5.the ranks of this document within each stream.
Generally, a stronger (i.e., better performing) stream will more effect on shaping the Þnal ranking.
A document which is retrieved at a high rank from such a stream is more likely to end up ranked
high in the Þnal result. In addition, the performance of each stream within a speciÞc range of
ranks is taken into account. For example, if phrases stream tends to pack relevant documents
between the top 10th and 20th retrieved documents (but not so much into 1-10) we would give
premium weights to the documents found in this region of phrase-based ranking, etc. Table 4
TABLE 3. Term weighting across streams using SMART
STREAM Weighting Scheme
Stems lnc.ntn
Phrases ltn.ntn
H+M Pairs ltn.nsn
Names ltn.ntn
18
gives some additional data on the effectiveness of stream merging. Further details are available in
a TREC conference article.
Note that long text queries beneÞt more from linguistic processing.
6.1 Inter-stream merging using score calculation
There are many ways in which the information obtained from different streams can be merged in
order to obtain the Þnal results. In our experiments with PRISE we tried several methods and
chose the ones that seemed to be the best for our ofÞcial results. The experiments that helped us
chose the best merging technique were performed using a 500-MBytes dry-run collection that was
created using past TRECs data for which we already had relevance information available. Some
of the methods we tried include:
¥ Basic linear combinations: multiply the score obtained by each stream by a constant stream
coefÞcient A(i), then add the score of all streams together, i.e.
¥ Sorted groups: the top group of documents are those retrieved by N streams, the second group
are those retrieved in N-1 streams, etc. At the bottom of this ranking would be the documents
retrieved in only one stream. The Þnal score is nstreams(d) plus an internal rank within the
group.
¥ A combination of the above two: multiply each stream by a different constant and add all
streams together. Then for each document multiply the score by a number which is a function
of the number of streams in which that document was retrieved. The Þnal score is therefore cal-
culated as:
where A(i) is a coefÞcient for stream i;score(i)(d) is the normalized score of the document against
the query within the stream i;and nstreams(d) is the number of streams in which d has been
retrieved (no lower than a certain rank, e.g., 1000).
The last method achieved the highest performance in experiments with PRISE system with preci-
sion increases of up to 40% when compared to the performance of the basic stems stream.
TABLE 4. Precision improvements over stems-only retrieval
Streams merged
short queries
%change
long queries
%change
All streams +5.4 +20.94
Stems+Phrases+Pairs +6.6 +22.85
Stems+Phrases +7.0 +24.94
Stems+Pairs +2.2 +15.27
Stems+Names +0.6 +2.59
A i( ) score i( ) d(
)
×
i

finalscore d( ) A i( ) score i( ) d( )× 0.9 nstreams d( )+( )×
i

=
19
6.2 Inter-stream merging using precision distribution estimates
In another approach to merging, we used the following two principal sources of information about
each stream to weigh their relative contributions to the Þnal ranking:
¥ an actual ranking obtained from a training run (training data, old queries);
¥ an estimated retrieval precision at certain ranges of ranks.
Precision estimates are used to order results obtained from the streams, and this ordering may
vary at different rank ranges. Table 5 shows precision estimates for selected streams at certain
rank ranges as obtained from a training collection derived from TREC-4 data.
This method of stream merging was used primarily to merge results obtained from SMART-
indexed streams. The Þnal score of a document ( d) is calculated using the following formula:
where N is the number of streams;A(i) is the stream coefÞcient;score(i)(d) is the normalized
score of the document against the query within the stream i;prec(ranks(i)) is the precision esti-
mate from the precision distribution table for stream i; and rank(i,d) is the rank of document d in
stream i.
6.3 Stream coefÞcients
For merging purposes, streams are assigned numerical coefÞcients, referred to as A(i) above, that
have two roles:
1.Control the relative contribution of a document score assigned to it within a stream when cal-
culating the Þnal score for this document. This applies primarily to streams producing normal-
ized document scores, such as SMART.
2.Change stream-to-stream document score relationships for un-normalized ranking system, e.g.,
PRISE.
TABLE 5. Precision distribution estimates for selected streams
RANKS STEMS PHRASES PAIRS NAMES
1-5 0.49 0.45 0.33 0.23
6-10 0.42 0.38 0.27 0.18
11-20 0.37 0.32 0.23 0.13
21-30 0.33 0.28 0.21 0.10
31-50 0.27 0.25 0.17 0.08
51-100 0.19 0.17 0.12 0.06
101-200 0.12 0.11 0.08 0.04
A i[ ] score i( ) d( ) prec ranks i( ) rank i d,( ) ranks i( )∈〈 | 〉( )××{ }
i 1…N=

20
An example of a coefÞcient structure is shown below. They are obtained empirically to maximize
the performance of any speciÞc combination of streams. Table 6 summarizes stream coefÞcient
structures used in TREC-5 experiments. Typically, a new combination was created for a given
collection, a retrieval mode (ad-hoc vs. routing) and the search engines used.
7 Query Expansion Experiments
7.1 Why Query Expansion?
In the opening section of this chapter we argued that the quality of the initial search directive, or
userÕs information need statement is the ultimate factor in the performance of an information
retrieval system. This means that the query must provide a sufÞciently accurate description of
what constitutes the relevant information, as well as how to distinguish this from related but not
relevant information. We also pointed out that todayÕs NLP techniques are not advanced enough
to deal effectively with semantics and meaning, and instead they rely on syntactic and other sur-
face forms to derive representations of content.
The purpose of query expansion is therefore to make the user query resemble more closely the
documents it is expected to retrieve. This includes both content, as well as some other aspects
such as composition, style, language type, etc. If the query is indeed made to resemble a ÒtypicalÓ
relevant document, then suddenly everything about this query becomes a valid search criterion:
words, collocations, phrases, various relationships, etc. Unfortunately, an average search query
does not look anything like this, most of the time. It is more likely to be a statement specifying the
semantic criteria of relevance. This means that except for the semantic or conceptual resemblance
(which we cannot model very well as yet) much of the appearance of the query (which we can
model reasonably well) may be, and often is, quite misleading for search purposes. Where can we
get the right queries?
In todayÕs information retrieval, query expansion usually pertains content and typically is limited
to adding, deleting or re-weighting of terms. For example, content terms from documents judged
relevant are added to the query while weights of all terms are adjusted in order to reßect the rele-
vance information. Thus, terms occurring predominantly in relevant documents will have their
weights increased, while those occurring mostly in non-relevant documents will have their
weights decreased. This process can be performed automatically using a relevance feedback
method, e.g., RoccioÕs (1971), with the relevance information either supplied manually by the
TABLE 6. Stream merging coefÞcient structures used in TREC-5 (0 not used)
STREAMS
RUNS stems phrases pairs names words locality frags
ad-hoc SMART 4 3 3 1 0 0 0
ad-hoc PRISE 1 0 3 1 1 4 1
routing SMART 4 3 3 1 0 0 0
routing PRISE 1 0 3 1 1 4 1
21
user (Harman, 1988), or otherwise guessed, e.g. by assuming top 10 documents relevant, etc.
(Buckley, et al., 1995). A serious problem with this content-term expansion is its limited ability to
capture and represent many important aspects of what makes some documents relevant to the
query, including particular term co-occurrence patterns, and other hard-to-measure text features,
such as discourse structure or stylistics. Additionally, relevance-feedback expansion depends on
the inherently partial relevance information, which is normally unavailable, or unreliable. Other
types of query expansions, including general purpose thesauri or lexical databases (e.g., Wordnet)
have been found generally unsuccessful in information retrieval (cf. Voorhees & Hou, 1993;
Voorhees, 1994)
An alternative to term-only expansion is a full-text expansion which we tried for the Þrst time in
TREC-5. In our approach, queries are expanded by pasting in entire sentences, paragraphs, and
other sequences directly from any text document. To make this process efÞcient, we Þrst perform
a search with the original, un-expanded queries (short queries), and then use top N (10, 20)
returned documents for query expansion. These documents are not judged for relevancy, nor
assumed relevant; instead, they are scanned for passages that contain concepts referred to in the
query. Expansion material can be found in both relevant and non-relevant documents, beneÞtting
the Þnal query all the same. In fact, the presence of such text in otherwise non-relevant documents
underscores the inherent limitations of distribution-based term reweighting used in relevance
feedback. Subject to some further ÒÞtness criteriaÓ, these expansion passages are then imported
verbatim into the query. The resulting expanded queries undergo the usual text processing steps,
before the search is run again.
Full-text expansion can be accomplished manually, as we did initially to test feasibility of this
approach, or automatically, as we tried in later with promising results. We Þrst describe the man-
ual process focussing on guidelines set forth in such a way as to minimize and streamline human
effort, and lay the ground for eventual automation. We then describe our Þrst attempt at automated
expansion, and discuss the results from both.
The initial evaluations indicate that queries expanded manually following the prescribed guide-
lines are improving the systemÕs performance (precision and recall) by as much as 40%. This
appear to be true not only for our own system, but also for other systems: we asked other groups
participating in TREC-5 to run search using our expanded queries, and they reported nearly iden-
tical improvements. At this time, automatic text expansion produces less effective queries than
manual expansion, primarily due to a relatively unsophisticated mechanism used to identify and
match concepts in the queries.
7.2 Guidelines for manual query expansion
We have adopted the following guidelines for query expansion. They were constructed to observe
realistic limits of the manual process, and to prepare ground for eventual automation.
1.NLIR retrieval is run using the 50 original ÒshortÓ queries.
2.Top 10 documentss retrieved by each query are retained for expansion. We obtain 50 expansion
sub-collections, one per query.
22
3.Each query is manually expanded using phrases, sentences, and entire passages found in any of
the documents from this queryÕs expansion subcollection. Text can both added and deleted, but
care is taken to assure that the Þnal query has the same format as the original, and that all
expressions added are well-formed English strings, though not necessarily well-formed sen-
tences. A limit of 30 minutes per query in a single block of time has been observed.
4.Expanded queries are sent through all text processing steps necessary to run the queries against
multiple stream indexes.
5.Rankings from all streams are merged into the Þnal result.
There are two central decision making points that affect the outcome of the query expansion pro-
cess following the above guidelines. The Þrst point is how to locate text passages that are worth
looking at -- it is impractical, if not downright impossible to read all 10 documents, some quite
long, in under 30 minutes. The second point is to actually decide whether to include a given pas-
sage, or a portion thereof, in the query. To facilitate passage spotting, we used simple word search,
using key concepts from the query to scan down document text. Each time a match was found, the
text around (usually the paragraph containing it) was read, and if found ÒÞtÓ, imported into the
query. We experimented also with various ÒpruningÓ criteria: passages could be either imported
verbatim into the query, or they could be ÒprunedÓ of ÒobviouslyÓ undesirable noise terms. In
evaluating the expansion effects on query-by-query basis we have later found that the most liberal
expansion mode with no pruning was in fact the most effective. This would suggest that relatively
self-contained text passages, such as paragraphs, provide a balanced representation of content,
that cannot be easily approximated by selecting only some words.
7.3 Automatic Query Expansion
Queries obtained through the full-text manual expansion proved to be overwhelmingly better than
the original search queries, providing as much as 40% precision gain. These results were sufÞ-
ciently encouraging to motivate us to investigate ways of performing such expansions automati-
cally.
One way to approximate the manual text selection process, we reasoned, was to focus on those
text passages that refer to some key concepts identiÞed in the query, for example, Òalien smug-
glingÓ for query 252 below.
The key concepts (for now limited to simple noun groups) were identiÞed by either their pivotal
location within the query (in the Title Þeld), or by their repeated occurrences within the query
Description and Narrative Þelds. As in the manual process, we run a ÒshortÓ query retrieval, this
time retaining 100 top documents retrieved by each query. An automated process then scans these
100 documents for all paragraphs which contain occurrences, including some variants, of any of
the key concepts identiÞed in the original query. The paragraphs are subsequently pasted verbatim
into the query. The original portion of the query may be saved in a special Þeld to allow differen-
tial weighting. Finally, the expanded queries were run to produce the Þnal result.
The above, clearly simplistic technique has produced some interesting results. Out of the Þfty
queries we tested, 34 has undergone the expansion. Among these 34 queries, we noted precision
gains in 13, precision loss in 18 queries, with 3 more basically unchanged. However, for these
23
queries where the improvement did occur it was very substantial indeed: the average gain was
754% in 11-pt precision, while the average loss (for the queries that lost precision) was only
140%. Overall, we still can see a 7% improvement on all 50 queries (vs. 40%+ when manual
expansion is used).
Our experiments show that selecting the right paragraphs from documents to expand the queries
can dramatically improve the performance of a text retrieval system. This process can be auto-
mated, however, the challenge is to devise more precise automatic means of Òparagraph pickingÓ.
8 SUMMARY OF RESULTS
8.1 Adhoc runs
Ad-hoc experiments were conducted in several subcategories, including automatic, manual, and
using different sizes of databases and different types of queries. An automatic run means that
there was no human intervention in the process at any time. A manual run means that some
human processing was done to the queries, and possibly multiple test runs were made to improve
the queries. A short query is derived using only one section of a TREC-5 topic, namely the
DESCRIPTION Þeld. A long query is derived from any or all Þelds in the topic. An example
TREC-5 query is show below; note that the Description Þeld is what one may reasonably expect
to be an initial search query, while Narrative provides some further explanation of what relevant
material may look like. The Topic Þeld provides a single concept of interest to the searcher; it was
not permitted in the short queries.
<top>
<num> Number: 252
<title> Topic: Combating Alien Smuggling
<desc> Description:
What steps are being taken by governmental or even private entities world-wide to stop
the smuggling of aliens.
<narr> Narrative:
To be relevant, a document must describe an effort being made (other than routine border
patrols) in any country of the world to prevent the illegal penetration of aliens across bor-
ders.
</top>
Table 7 summarizes selected runs performed with our NLIR system on TREC-5 database using
queries 251 through 300. The SMART baseline is produced by CornellÕs team using version 11 or
the system, and including proximity bi-gram phrases indexing. Table 8 gives the performance of
CornellÕs (now Sabir Inc.) SMART system version 12, using advanced Lnu.ltu term weighting
scheme, proximity bi-gram phrases, and query expansion through automatic relevance feedback
(rel.fbk), on the same database and with the same queries. Sabir used our long queries to obtain
long query run. Note the consistently large improvements in retrieval precision attributed to the
expanded queries.
24
TABLE 7.Precision improvement in NLIR system
TABLE 8.Results for CornellÕs SMART v. 12 (no NLP indexing)
8.2 Routing Runs
Routing is a process in which a stream of previously unseen documents are Þltered and distributed
among a number of standing proÞles, also known as routing queries. In routing, documents can be
assigned to multiple proÞles. In categorization, a type of routing, a single best matching proÞle is
selected for each document. Routing is harder to evaluate in a standardized setup than the retroac-
tive retrieval because of its dynamic nature, therefore a simulated routing mode has been used in
TREC. A simulated routing mode (TREC-style) means that all routing documents are available at
once, but the routing queries (i.e., terms and their weights) are derived with respect to a different
training database, speciÞcally TREC collections from previous evaluations. This way, no statisti-
cal or other collection-speciÞc information about the routing documents is used in building the
proÞles, and the participating systems are forced to make assumptions about the routing docu-
ments just like they would in real routing. However, no real routing occurs, and the prepared rout-
ing queries are run against the routing database much the same way they would be in an ad-hoc
retrieval. Documents retrieved by each routing query, ranked in order of relevance, become the
content of its routing bin.
PREC.
short queries
SMART base
full queries
SMART base
full queries
NL index
long queries
auto + NL
long queries
SMART base
long queries
man. + NL
11pt. avg
%change
0.1235 0.1771
+43.0
0.2083
+69.0
0.2220
+80.0
0.2664
+116.0
0.3176
+157.0
@5 docs
%change
0.1422 0.2222
+56.0
0.2667
+88.0
0.2578
+81.0
0.3778
+166.0
0.3867
+172.0
@100 doc
%change
0.0482 0.0653
+35.0
0.0713
+48.0
0.0709
+47.0
0.0920
+91.0
0.0998
+107.0
Recall
%change
0.54 0.64
+19.0
0.65
+20.0
0.64
+19.0
0.75
+39.0
0.77
+43.0
PREC.short queries full queries
full queries
rel. feedback
long queries
manual
11pt.avg
%change
0.1499 0.2142
+43.0
0.2416
+62.0
0.2983
+99.0
@5 docs
%change
0.2178 0.2889
+33.0
0.2756
+27.0
0.3600
+65.0
@100 docs
%change
0.0578 0.0709
+23.0
0.0771
+33.0
0.0904
+56.0
Recall
@change
0.58 0.64
+10.0
0.70
+21.0
0.73
+26.0
25
In Smart routing, automatic relevance feedback was performed to build routing queries using the
training data available from previous TRECs. The routing queries, split into streams, were then
run against stream-indexed routing collection. The weighting scheme was selected in such a way
that no collection-speciÞc information about the current routing data has been used. Instead, col-
lection-wide statistics, such as idf weights, were those derived from the training data. The routing
was carried out in the following four steps:
1.A subset of the previous TREC collections was chosen as the training set, and four index
streams were built. Queries were also processed and run against the indexes. For each query,
1000 documents are retrieved. The weighting schemes used were: lnc.ltc for stems, ltc.ntc for
phrases, ltc.ntc for head+modiÞer pairs, and ltc.ntc for names.
2.The Þnal query vector was then updated through an automatic feedback step using the known
relevance judgements. Up to 350 terms occurring in the most relevant documents were added
to each query. Two alternative expanded vectors were generated for each query using different
sets of Roccio parameters.
3.For each query, the best performing expansion was retained. These were submitted to NIST as
ofÞcial routing queries.
4.The Þnal queries were run against the four-stream routing test collection and retrieved results
were merged.
9 CONCLUSIONS
We presented in some detail our natural language information retrieval system consisting of an
advanced NLP module and a `pure' statistical core engine. While many problems remain to be
resolved, including the question of adequacy of term-based representation of document content,
we attempted to demonstrate that the architecture described here is nonetheless viable. In particu-
lar, we demonstrated that natural language processing can now be done on a fairly large scale and
that its speed and robustness has improved to the point where it can be applied to real IR prob-
lems.
The main observation to make is that natural language processing is not as effective as we once
hoped in to obtain better indexing and better term representations of queries. Using linguistic
terms, such as phrases, head-modiÞer pairs, names, or even simple concepts does help to improve
retrieval precision, but the gains remained quite modest. On the other hand, full text query expan-
sion works remarkably well. Our main effort in the immediate future will be to explore ways to
achieve at least partial automation of this process. An initial experiment in this direction has been
performed as part of NLP Track (genlp3 run), and the results are encouraging.
TABLE 9. Average Precision on 45 routing queries
STREAMS 11pt. Precision R-Precision
SMART streams only 0.2755 0.3145
PRISE streams only 0.2099 0.2473
COMBINED streams 0.3023 0.3359
26
ACKNOWLEDGEMENTS.We would like to thank Donna Harman of NIST for making her
PRISE system available to us since the beginning of TREC. Will Rogers provided valuable assis-
tance in installing updated versions of PRISE at NYU and Rutgers. We would also like to thank
Ralph Weischedel for providing and assisting in the use of the BBNÕs part of speech tagger at
NYU. This paper is based upon work supported in part by the Advanced Research Projects
Agency under Tipster Phase-2 Contract 94-FI57900-000, and the National Science Foundation
under Grant IRI-93-02615.
REFERENCES
Chris Buckley, Amit Singhal, Mandar Mitra, Gerard Salton. 1995. ``New Retrieval Approches
Using SMART: TREC 4ÕÕ. In Proceedings of TREC4.
Guthrie, Louise and Leistensnider, James, `A Simple Probabilistic Approach to ClassiÞcation and
RoutingÕ, Proceedings of the TIPSTER Text Program Phase II Workshop, Sponsored by Defense
Advanced Research Projects Agency, May 6-8, 1996.
J. Ross Quinlan. 1993. C4.5: Programs for Machine Learning. San Mateo: Morgan Kaufmann.
Strzalkowski, Tomek and Jose Perez-Carballo. 1994. ÒRecent Developments in Natural Language
Text Retrieval.Ó Proceedings of the Second Text REtrieval Conference (TREC-2), NIST Special
Publication 500-215, pp. 123-136.
Strzalkowski, Tomek, Jose Perez-Carballo and Mihnea Marinescu. 1995. ÒNatural Language
Information Retirieval: TREC-3 Report.Ó Proceedings of the Third Text REtrieval Conference
(TREC-3), NIST Special Publication 500-225, pp. 39-53.
Strzalkowski, Tomek, Jose Perez-Carballo and Mihnea Marinescu. 1996. ÒNatural Language
Information Retirieval: TREC-4 Report.Ó Proceedings of the Third Text REtrieval Conference
(TREC-4), NIST Special Publication 500-2xx.
Strzalkowski, Tomek. 1995. ÒNatural Language Information RetrievalÓ Information Processing
and Management, Vol. 31, No. 3, pp. 397-417. Pergamon/Elsevier.
Strzalkowski, Tomek, and Peter Scheyen. 1993. ÒAn Evaluation of TTP Parser: a preliminary
report.Ó In H. Bunt, M. Tomita (eds), ÒRecent Advances in Parsing Technology.Ó Kluwer Aca-
demic Publishers, pp. 201-220.