Concepts across the Interspace: Information Infrastructure for Community Knowledge

pogonotomygobbleΤεχνίτη Νοημοσύνη και Ρομποτική

15 Νοε 2013 (πριν από 3 χρόνια και 6 μήνες)

65 εμφανίσεις




Concepts across the Interspace:

Information Infrastructure for Community Knowledge



Bruce R. Schatz

CANIS (Community Architectures for Network Information Systems) Laboratory

Graduate School of Library and Information Science

University of Illinois
at Urbana
-
Champaign


schatz@uiuc.edu
, www.canis.uiuc.edue


Abstract

A global information infrastructure for knowledge manipulation must support effective
analysis to correlate related objects. The Interspace is the com
ing global network, where
knowledge manipulation is supported by concept navigation across community spaces. We have
produced a working Interspace Prototype, an analysis environment supporting semantic indexing
on community repositories. Scalable technolo
gies have been implemented for concept extraction
and concept spaces, which use semantic indexing to facilitate concept navigation. These
technologies have been tested on discipline
-
scale, real
-
world document collections. The
technologies use statistical c
lustering, on contextual frequency of document phrases within a
collection. Computer trends show that semantic indexing technologies will be practical for
everyday use on community knowledge in the foreseeable future. Thus, concept navigation
across com
munity repositories will become a routine operation.


Keywords
: Interspace, semantic indexing, scalable semantics


concept spaces, concept navigation, concept switching






The most popular service in the Net has always been community sharin
g


at whatever level
of functionality the current technology supports. Technologies such as electronic mail, bulletin
boards, moderated newsgroups, bibliographic databases, preprint services, and Web sites are
increasing steps along this path. The closer

the technology for sharing results in documents gets
to the technology of composing the documents themselves, the more heavily used the community
sharing mechanisms will be.

The waves of the Net illustrate the increasing levels of functionality. As shown

in Figure 1,
each wave builds on the previous, then establishes a new higher level of standard infrastructure.
In the upswing of a wave, the fundamental research is being done for a new level of
functionality. Prototype research systems begin in the tro
ugh and evolve into mass commercial
systems in the peak. The functionality of the current wave is polished commercially in the
downswing period for mass propagation before the start of the next wave.


2



Figure 1. Waves of the Net.



To users living in t
he Net when a new wave is cresting, the environment feels completely
different. This has already occurred during the transition from packets, which are raw bits and
files, to objects, which contain display and interaction software for groups of packets.
Electronic
mail in the ARPAnet was a transformational experience for users in the First Wave, as has been
document browsing in the Internet for users in the Second Wave.

The transition about to occur will involve concepts, which contain indexing and mean
ing for
groups of objects. Concepts are useful for analysis of the content rather than search of the form.
Concept navigation in the Interspace will be the transformational experience for users in the
Third Wave of the Net.

The First Wave was the level o
f access, of transmission of data. It began with the coming of
the ARPAnet and evolved through large
-
scale distributed file systems, roughly 10 years on the
upswing and 10 on the down. The focus was on packets of bits, transparently transferring them
fro
m one machine to another.

The Second Wave is the level of organization, of retrieval of information. It began with
distributed multimedia network information systems, such as the Telesophy system [1].
Telesophy was featured in my invited talk at the 20th

Anniversary Symposium for the ARPAnet
in 1989, as the example of future technology for worldwide information spaces. That same year
saw the initiation of the World
-
Wide Web project at CERN, which when coupled with the NCSA
Mosaic interface, became the t
echnology that brought worldwide information spaces into
everyday reality. The information wave took about 10 years to peak; we are just finishing the 5
years of the consolidation phase. The focus is now on documents of objects, pushing towards
Internet
-
wide operating systems.


3

The Third Wave will be the level of analysis, of correlation of knowledge. It will focus on
paths of searches. It will move past search of individual repositories, to analysis of information
across sources and subjects. The sta
ndard protocols for this information infrastructure will
support collections residing directly on users’ machines. The beginnings of these peer
-
peer
protocols are already apparent in the popularity of music swapping services, such as Napster. To
support
analysis, they must evolve to provide semantic indexing as standard infrastructure.

The technology for the Third Wave currently exists in large
-
scale prototypes in research
laboratories. This wave, the
Interspace
, will have distributed services to transfe
r concepts across
domains, just as the ARPANET had distributed services to transfer files across machines and the
Internet has distributed services to transfer objects across repositories. The Interspace provides
protocols to interconnect logical spaces,
just as the Internet provides protocols to interconnect
physical machines.

Telesophy became the internal inspiration for Mosaic, through my service as scientific advisor
for information systems at NCSA. This paper describes the Interspace Prototype, which

was at
roughly the same state in 1999 as was the Telesophy Protoype in 1989. There is a fully
-
fledged
research system running in the laboratory, which has semantically indexed large
-
scale real
-
world
collections and supports concept navigation across mul
tiple sources. The technology is ready for
widespread deployment that will catalyze the worldwide Interspace for concept navigation, much
as Mosaic catalyzed the worldwide Internet for document browsing.

Towards The Interspace

In 1989, the Telesophy Protot
ype had made it clear that universal interlinked objects were
technically feasible. That is, it was possible to create a worldwide information space of objects,
which were interlinked and could be transparently navigated. Five years later, in 1994, NCSA

Mosaic made it clear that this paradigm could be implemented efficiently enough so that it would
become the mass standard for information infrastructure in the Net.

From trends in network infrastructure, it was clear that communities of interest would qui
ckly
form, with personal web sites dominating archival web sites. The same phenomenon had
happened earlier with electronic bulletin boards versus electronic file archives. From trends in
information retrieval, it was clear that soon the resulting volum
e of documents would cause web
search to break down. The same phenomenon had happened when bibliographic databases
exceeded a certain size relative to their conceptual density (about a million items for a scientific
discipline).

A decade ago, these trend
s indicated that the Net would need to evolve beyond the Web, into
an infrastructure that directly supported community repositories with semantic indexing. When
there were billions of documents on
-
line, navigating across fixed links would no longer suffic
e
for effective navigation. There would be so many relevant documents for any particular situation
that fine
-
grained links to related documents would need to be created dynamically during user
sessions. This would require automatically identifying docum
ents containing related concepts.
Thus, the basic network infrastructure would need to support universal interlinked concepts (in
the Interspace), just as the current network infrastructure supports universal interlinked objects
(in the Internet).

The Net

of the 21
st

Century will radically transform the interaction with knowledge. Online
information has always been dominated by data centers with large collections indexed by trained
professionals. The rise of the Web has rapidly developed the technologie
s for collections of
independent communities. In the future, online information will be dominated by small

4

collections maintained and indexed by the community members themselves. The great mass of
objects will be stored in these community repositories.

B
uilding the Interspace requires generating semantic indexes for community repositories with
interactive support adequate for amateur classifiers, then correlating these indexes across
multiple sources with interactive support adequate for amateur navigator
s. Since there will be
so many sources indexed by non
-
professional indexers, the infrastructure itself must provide
substantial support for semantic indexing. Since the sources in the Net will be dominated by
small community repositories, the typical i
nteraction will be navigating through many sources


retrieving and correlating objects relevant to the particular session. The infrastructure itself
must accordingly provide substantial support for information analysis.

The Interspace will be the firs
t generation of the Net to support analysis. For the first time,
the standard infrastructure of the Net will support direct interaction with abstraction. The
Internet supports search of objects, e.g. matching phrases within documents. The Interspace, in

contrast, supports correlation of concepts, e.g. comparing related phrases in one repository to
related phrases in another repository. There will be a quantum jump in functionality across the
waves, from syntactic search to semantic correlation. Users w
ill navigate within spaces of
concepts, to identify relevant phrases before they navigate within networks of objects, as at
present.

The information infrastructure must explicitly support correlation across communities, by
concept switching from specialty
to specialty. Concept switching involves navigating from the
repository of one community into the repository of another, by traversing bridges across related
concepts. This will enable a distributed group of persons to form a specialized community living

on the Net, yet be able to communicate effectively with related groups, via translation of
concepts across specialties. The infrastructure would relate terminology from community to
community, enabling navigation at the level of concepts.

The Interspace
Prototype

A decade
-
long program of research, 1990
-
2000, has produced a working Interspace Prototype.
This is an analysis environment with new protocols for information infrastructure, supporting
semantic indexing on community collections.

The concept o
f the Interspace grew out of my experience with the Telesophy Prototype and
was explicitly mentioned in the conclusions of my 1990 Ph.D. dissertation, evaluating the wide
-
area network performance of information spaces [2]. The Worm Community System, 1990
-
1994, was a complete implementation of an analysis environment in molecular biology, with
custom technology pre
-
Web [3]. The algorithms for semantic indexing were developed, 1994
-
1998, as part of the Illinois Digital Library project [4].

Finally, the

flagship contract in the DARPA Information Management program, 1997
-
2000,
was specifically for the Interspace Prototype [5]. The two
-
fold goal was to develop a complete
analysis environment, then test it on large real collections to demonstrate that the

concept
technology was indeed generic, independent of subject domain. These goals were successfully
achieved, as described in subsequent sections.

The Interspace Prototype is composed of a suite of indexing services, which supports
semantic indexing fo
r community collections, and an analysis environment, which utilizes these
indexes to navigate within and across collections at abstract levels. Our suite of components
reproduces automatically, for any collection, equivalents to standard physical library

indexes.


5

Some of these indexes represent abstract spaces for concepts and categories above concrete
collections of units and objects. A “concept space” records the co
-
occurrence between units
within objects, such as words within documents or textures w
ithin images. Much like a subject
thesaurus, it is useful for suggesting other words while searching (if your specified word doesn’t
retrieve desired documents, try another word which appears together with it in another context).
A “category map” records

the co
-
occurrence between objects within concepts, such as two
documents with significant overlap of concept words. Much like a subject classification, it is
useful for identifying clusters of similar objects for browsing (to locate which sub
-
collection
should be searched for desired items).

The information infrastructure for uncovering emergent patterns from federated repositories
relies on “scalable semantics”. This technology can index arbitrary collections in a semantic
fashion. Scalable semantics at
tempts to be the golden mean for information retrieval
--

semantics
pulls towards deep parsing for small collections, while scalable pulls towards shallow parsing for
large collections. The parsing extracts generic units from the objects, while the index
ing
statistically correlates these uniformly across sources. For text documents, the generic units are
noun phrases, while the statistical indexes record the co
-
occurrence frequency, of how often each
phrase occurs with each other phrase within a documen
t within the collection.

We believe that concepts are the generic level for semantic protocols for the state of
technology in the foreseeable future. Concepts provide some semantics with automatic
indexing, and noun phrase extraction appears computatio
nally feasible for large collections of
diverse materials. Concept spaces actually provide semi
-
automatic categorization. They are
useful for interactive retrieval, with suggestion by machine but selection by human. All our
experience indicates that aug
mentation not automation of human performance is what is
technologically feasible for semantic interoperability in digital libraries.

Implementing the Prototype

The Interspace Prototype demonstrates that it is technologically feasible to support concept
na
vigation utilizing scalable semantics. The services in the Interspace Prototype generate the
semantic indexes for all collections within the spaces. A common set of concepts is extracted
across all services and these concepts are used for all indexes, s
uch as concept spaces and
category maps. The user sees an integrated analysis environment in which the individual indexes
on the individual collections can be transparently navigated. For example, one can easily move
from a category map to a concept space
, to concepts, to documents, then back up again to higher
levels of abstraction from the concepts mentioned in the document. See Sidebar 1 for examples.

The Interspace Prototype comprises an analysis environment across multiple indexes of
multiple source
s. Each source is separately processed, but all are available within the
environment. When a source is processed, the concepts are uniformly extracted from each
object. With documents, every noun phrase is parsed out and normalized, with its position
b
eing recorded. These noun phrases are then used for a series of indexes at different levels of
abstraction. Having a common set of phrases implies that a single phrase can be referenced
uniformly within multiple indexes. As the phrases represent concep
ts, this enables a concept to
be transparently navigated across indexes and across sources.

Our current concept extractor was developed using standard components for noun phrase
extraction over general text documents. We experimented with several research

and commercial
systems, before developing an effective parser from public domain source code. The noun
phrase extractor is based upon the Brill tagger [6] and the noun phrase identification rules of

6

NPtool [7]. The parser itself has three major parts: t
okenization, part
-
of
-
speech tagging, and
noun
-
phrase identification [8].

This software was chosen since it is generic
--

the trained lexicon was derived from several
different sources including the Wall Street Journal and Brown corpora, hence the lexicon h
as a
fairly general coverage of the English language. It can be applied across subject domains
without further domain customization while maintaining a comparable parsing quality.
According to our studies, the noun phraser enhanced with the UMLS
-
lexicon
performed slightly
better than the generic version on a collection of 630K MEDLINE abstracts but the difference is
not statistically significant. A similar parser also works well on certain classes of grayscale
images, specifically aerial photographs, us
ing texture density as the extracted units.

The generic nature and ease of customization enable the parser to fulfill the range of noun
phrase parsing. We did a careful evaluation for biomedical literature of the parser with these
general rules. This res
earch experiment parsed all the biomedical literature, 45M (million)
unique noun phrases from 10M MEDLINE abstracts, and its description won Best Paper at the
1999 annual meeting of the American Medical Informatics Association [9].

We have used concept spa
ce algorithms in numerous experiments to generate and integrate
multiple semantic indexes. The space consists of the interrelationships between the concepts in
the collection. Interactive navigation of the concept space is useful for locating related term
s
relevant to a particular search strategy. To create a concept space, first find the context of terms
within documents using a noun phrase parser as above, then compute term (noun phrase)
relationships using co
-
occurrence analysis.

The co
-
occurrence anal
ysis computes the contextual relationships between the concepts (noun
phrases) within the collections. The documents in the collections are processed one by one, with
two concepts related whenever they occur together within the same document. Multiple
-
wo
rd
terms are assigned heavier weights than single
-
word terms because multiple
-
word terms usually
convey more precise semantic meaning than single
-
word terms. The relationships between noun
phrases reflect the strengths of their context associations within

a collection. Co
-
occurring
concepts are ranked in decreasing order of similarity.

Simulating the Interspace

We have performed several hero experiments, using high
-
end supercomputers as time
machines to simulate the world of the future, ten years hence.

These experiments took a large
existing collection and partitioned it into many community repositories, to simulate typical future
situations. Semantic indexing was then performed on each community repository, to investigate
strategies for concept navig
ation within and across repositories.


On our NSF/DARPA/NASA Digital Library Initiative project in 1996, we demonstrated the
feasibility of this approach to generating a large
-
scale testbed for semantic federation of
community repositories [4]. The biblio
graphic database COMPENDEX was used to supply
broad coverage across all of engineering, with 800K abstracts chosen from 600 categories from
the hierarchical subject classification, while INSPEC was used to supply deep coverage for our
core domains of physi
cs, electrical engineering, and computer science, with 400K abstracts
chosen from 300 categories.

This generated 4M bibliographic abstracts across 1000 repositories (each abstract was
classified by indexer subject assignment into roughly 3 categories so t
here was overlap across
repositories). The NCSA 32
-
node HP/Convex Exemplar was used for 10 days of CPU time to

7

compute concept spaces for each community repository. The final production run of the spaces,
after all testing and debugging, took about 2 da
ys of supercomputer time.

Our project in the DARPA Information Management program enabled us to carry out
discipline
-
scale experiments in semantic indexes for community repositories. As one example, in
1998, we generated semantic indexes for all of MEDLINE
.

MEDSPACE [10] was an Interspace composed of concept spaces across all of MEDLINE.
The backfiles comprise 10M abstracts, in a database both broad and deep. Using the MeSH
subject classification, we partitioned the collection into approximately 10K com
munity
repositories and computed concept spaces for each. The multiple classification for each abstract
caused an expansion factor of about four in raw abstracts to repository abstracts. The complete
MEDSPACE involved 400M phrase occurrences within 40M
abstracts.

The Medicine computation was an order of magnitude bigger than the Engineering
computation (40M versus 4M abstracts). The computing time required was about the same scale


10 days for test debugging and 2 days for final production. This was
possible because the 2
-
year period had made the high
-
end NCSA supercomputer an order of magnitude better. The
128
-
node, 64GB SGI /Cray Origin 2000 had 4 times more processors for this highly parallel
computation and the faster processors with bigger memor
ies combined with optimized parallel
algorithms to further improve the performance.

This computation demonstrates the feasibility of generating semantic indexes for entire
disciplines, in a form deployable within a large
-
scale testbed. Some of the semanti
c indexes
were used for experiments on concept switching. See examples in Sidebar 1. The experimental
users included physicians for MEDLINE indexes and engineers for INSPEC indexes.

Concept Switching

Correlating across communities is where the power of g
lobal analysis lies. Mapping the
concepts of their community into the concepts of related communities would enable users to
locate relevant items from related research. The difficulty is how to locate related research,
within the twin explosions of comm
unity terminologies and distributed repositories. Each
community needs to identify the concepts within their repository and index these concepts in a
generic fashion that can be compared to those in other repositories from other communities.

The principal

function of the Interspace is thus concept switching. The concept switches of
the Interspace serve much the same function as the packet switches of the Internet


they
effectively map concepts in one repository of one community to concepts in another rep
ository
of another community, much as switching gateways in the Internet reliably transmit packets from
one machine in one location to another machine in another location.

The technologies for concept switching are still immature. A specialized form cal
led
vocabulary switching has existed since the 1970s [11]. This form is largely manual


the
concepts are taken from a human
-
generated subject thesaurus and the mapping across thesauri is
performed by human subject experts. The Unified Medical Language S
ystem (UMLS)
developed at the National Library of Medicine contains a modern example of this manual
switching across subject thesauri, by relating biomedical vocabulary from multiple thesauri with
the Metathesaurus [12].

Vocabulary switching is expensive

to maintain, since it requires human tracking of the
concepts in the thesauri by experts knowledgeable about both sides of the vocabulary map.
Scalable semantics could potentially support full concept switching by parsing all concepts and

8

computing all
relationships. The promise of automatic methods is concept mapping at a viable
cost for community
-
scale collections.

Future concept switching will rely on cluster
-
to
-
cluster mapping, rather than term
-
to
-
term.
Then each concept will have an equivalence cl
ass of related concepts generated for it in the
particular situation, and the equivalence class from one space will be mapped into the most
relevant classes in other spaces. The simple example in Sidebar 1 uses the related terms in the
concept space as th
e equivalence class for mapping a particular term. Full cluster
-
to
-
cluster
mapping will likely use neural net technology, such as spreading activation on self
-
organizing
maps of related terms in related documents.

Concept switching supports a new and radi
cal paradigm for information retrieval. Users
rarely issue searches. Instead, they navigate from concept to concept, transparently within and
across repositories, examining relevant objects and the contained concepts. If they can
recognize relevant co
ncepts when viewed during an interactive session, they need not know
specialized terminology beforehand.

Scalable Semantics

The success of the Interspace revolves around the technologies of Scalable Semantics, which
attempt to be the golden mean for inform
ation retrieval, in
-
between scalable (broad since works
in any subject domain) and semantics (deep since captures the underlying meaning). Traditional
technologies have been either broad but shallow (e.g. full
-
text search) or deep but narrow (e.g.
expert

systems). The new wave of semantic indexing relies on statistical frequencies of the
context of units within objects, such as words within documents, and is thus fully automatic.

The technology curves for computer power indicate that semantic indexing o
f any scale
collection will shortly become routine. This observation is largely independent of which style of
indexing is considered. Within 5 years, discipline
-
scale collections will be processed within
hours on desktop computers. Within 10 years, the
largest collections will be processed in real
-
time on desktop computers and within minutes on palmtops. In the near future, users will
routinely perform semantic indexing on their personal collections using their personal computers.

This availability of c
omputing power will push feasible levels of semantic indexing to deeper
levels of abstraction. The technologies for handling concepts and categories seem already well
understood. Concepts rely on phrases within documents as the units within objects. C
ategories
rely on documents within concepts, clustering the concepts for the next more abstract level.

Even higher potential levels would involve perspectives and situations. Perspectives rely on
concepts within categories, while Situations rely on cat
egories within collections. These move
towards Path matching, where the patterns of all the user’s searches are placed within the
contexts of all their available knowledge.

These more abstract semantic levels push towards closer matching of the meaning
s in the
user’s minds to the meanings in the world’s objects. Each higher level groups larger units
together across multiple relationships. The algorithmic implications are that many more objects
must be correlated to generate the indexing, requiring in
creasing levels of computing power.

The early years of the new millennium will see the infrastructure of the Net evolve from the
Internet to the Interspace. Each specialized community will maintain their own knowledge
collections and semantically index t
hese collections on their own machines. Pattern discovery
across community sources will become routine, with concept navigation across the Interspace.
Problem Solving in the Net will be an everyday experience in the 21
st

century.




9

Acknowledgements

Than
ks go to the members of the Interspace team, who have prototyped the Third Wave of the Net. The
DARPA Information Management program provided financial support through contract N66001
-
97
-
C
-
8535, entitled
“The Interspace Prototype: An Analysis Environment

for Semantic Interoperability”, with program manager Ron
Larsen. DARPA supported the Interspace Prototype project from 1997 to 2000, with Principal Investigator Bruce
Schatz, and co
-
Principal Investigators Charles Herring (at the University of Illinois a
t Urbana
-
Champaign) and
Hsinchun Chen (at the University of Arizona at Tucson). The systems research was performed at Illinois in the
CANIS Laboratory (Community Architectures for Network Information Systems) under Schatz. The technical
leads were Herri
ng, Bill Pottenger, and Kevin Powell (who served as technical architect after co
-
writing the original
architecture document with Schatz). The primary programmers were Conrad Chang, Les Tyrrell, Yiming Chung,
Dan Pape, Qin He, and Nuala Bennett. The algo
rithms research was performed in the Artificial Intelligence
Laboratory at Arizona under Chen. The technical leads were Dorbin Ng, Dmitri Roussinov, Marshall Ramsey, and
Kris Tolle. At Illinois, Duncan Lawrie and Bob McGrath evaluated computer power for
semantic indexing.


References

1.

B. Schatz, “Telesophy: A System for Manipulating the Knowledge of a Community”,
Proc. IEEE
Globecom '87
, Tokyo, Nov. 1987, pp. 1181
-
1186.

2.

B. Schatz,
Interactive Retrieval in Information Spaces Distributed across a Wide
-
Area
Network
, Ph.D.
Dissertation, Technical Report 90
-
35,
Department of Computer Science, University of Arizona,

Tucson,
Dec. 1990, 95 pp.

3.

B. Schatz, “Building an Electronic Community System”
, J. Management Information Systems
, Vol. 8,
Winter 1991
-
92, pp. 87
-
10
7. reprinted in R. Baecker (ed),

Readings in Groupware and Computer
Supported Cooperative Work
,
Morgan Kaufmann, 1993, pp. 550
-
560 in Chapter 9.

4.

B. Schatz, et. al., “Federated Search of Scientific Literature”,
Computer
,

Vol. 32, Feb. 1999, pp. 51
-
59.

5.

B. S
chatz, “
High
-
Performance Distributed Digital Libraries: Building the Interspace on the Grid
”,
Proc. 7th
IEEE Int’l Symp. High
-
Performance Distributed Computing
, Ch
icago, Jul. 1998, pp. 224
-
234.

6.

E. Brill, “Transformation
-
Based Error
-
Driven Learning and Natural Language Processing”,
Computational
Linguistics
, Vol. 21, 1995, pp. 543
-
565.

7.

A. Voutilainen, “NPtool: A Detector of English Noun Phrases”,
Proc. Workshop on V
ery Large Corpora
,
Columbus, OH, June 22, 1993.

8.

K. Tolle and H. Chen, “Comparing Noun Phrasing Techniques for Use with Medical Digital Library
Tools”,
J. Amer. Soc. Information Science
, Vol. 51, Mar. 2000, pp. 380
-
393.

9.

N. Bennett, et. al., “Extracting Nou
n Phrases for all of MEDLINE”,
1999 Annual Meeting American
Medical Informatics Assoc
., Nov. 1999, pp 681
-
688.

10.

Y. Chung, et. al., “Semantic Indexing for a Complete Subject Discipline”,
4th Int’l ACM Conf. Digital
Libraries
, Berkeley CA, Aug. 1999, pp. 39
-
4
8.

11.

R. Niehoff, “Development of an Integrated Energy Vocabulary and the Possibilities for On
-
line Subject
Switching”,
J. Amer. Soc. Information Science
, vol. 27, Jan
-
Feb 1976, pp. 3
-
17.

12.

D. Lindberg, et. al., “The Unified Medical Language System”,
Methods I
nformation Medicine
, Vol. 32,
1999, pp. 281
-
291.





Figure 1. Waves of the Net.


Figure 2. Concept Navigation in the Interspace Prototype.


Figure 3. Concept Switching in the Interspace Prototype.


10

Sidebar 1: User Sessions performing Concept Navigatio
n


The Interspace Prototype enables navigation across different levels of spaces for documents,
for concepts, for categories. The spaces can be navigated from concept to concept without the
need for searching. The production interface is invocable from
within a web browser and can be
found at
www.canis.uiuc.edu

under Interspace under Demonstrations. The interface is
implemented in Smalltalk, but the user interaction takes place via an emulator called
ClassicBlen
d, which dynamically transforms Smalltalk graphics into Java graphics.


Figure 2 is a composite screendump of an illustrative session with the Interspace Remote
Access (IRA) interface for the Interspace Prototype. This illustrates concept navigation wit
hin a
community repository from MEDLINE. The user is a clinic physician who wants to find a drug
for arthritis that reduces the pain (analgesic) but does not cause stomach (gastrointestinal)
bleeding. In upper left window, the community collections (sub
ject domains) with semantic
indexes are described.

In lower left, the user selects the domain for “Rheumatoid Arthritis”, then searches for all
concepts (noun phrases) mentioning the word “bleeding”. They then navigate in concept space
to find a related

term that might be relevant for their current need. From “gastrointestinal
bleeding”, the related concepts include those that are general (“drug”) or artifacts (“ameliorating
effect”). But the related concepts also include those of appropriate specifi
city for locating
relevant items, such as names (“trang l”) and detailed concepts (“simple analgesic”).

Following further in the concept space from “simple analgesic” yields “maintenance therapy”,
where the first document is displayed in the lower right.

This document discusses a new drug
“proglumetacin”, that when used for treatment produces a patient whose “haematology and
blood chemistry were not adversely affected”. Thus this drug does not cause bleeding. This
document, however, would have been di
fficult to retrieve by a standard text search on
MEDLINE, due to the difficulty of guessing beforehand the terminology actually used. The
upper right lists another document on this drug, which was located by navigating from the
concepts (noun phrases) in

the current document via the selected concept.


Figure 3 gives an example of concept switching in the Interspace Prototype, where the
relationships within the concept spaces are used to guide the navigation across community
repositories for MEDLINE. Th
e subject domains “Colorectal Neoplasms, Hereditary
Nonpolyposis” and “Genes, Regulator” were chosen and their concept spaces were displayed in
the middle and the right windows respectively. “Hereditary cancer” was entered as a search term
in the first co
ncept space and all concepts that are lexical permutations are returned. Indented
levels in the display indicate the hierarchy of the co
-
occurrence list. Navigating in the concept
space moves from “hereditary nonpolyposis colorectal cancer” to the relate
d “mismatch repair
genes”.

The user then tries to search for this desired term in another domain repository, “Genes,
Regulator”. A straight text search at top right returns no hits. So Concept Switching is invoked
to switch concepts from one domain to an
other across their respective concept spaces. The
concept switch takes the term “mismatch repair genes” and all related terms from its indented co
-
occurrence list in the source concept space for “Colorectal Neoplasms” and intersects this set
into the targ
et concept space for “Genes, Regulator”.


11

After syntactic transformations, the concept switch produces the list (in the right
-
most
window panel) of concepts computed to be semantically equivalent to “mismatch repair genes”
within “Genes, Regulator”. Switch
ing occurs by bridging across community repositories on the
term “Polymerase Chain Reaction”, an experimental method common to both subject domains.
Navigating the concept space down to the object (document) level locates the article displayed at
the bott
om. This article discusses a leukaemia inhibitory factor that is related to colon cancer.
Note that this article was located without doing a search, by concept switching across
repositories starting with the broad term “hereditary cancer” and using commo
n terms as bridges.





<EDITORIAL NOTE. In this sidebar, phrases within quotes, such as “Gene, Regulator” are
meant to be literals (strings actually entered by the user within the session). In previous articles
in COMPUTER, such literals were placed i
nto a distinct fixed
-
width fonts, rather than being
quoted in print.>


12



Figure 2. Concept Navigation in the Interspace Prototype.


13



Figure 3.

Concept Switching in the Interspace Prototype.


14

Sidebar 2: Technology Trends underlying Concept Navigation

Information Infrastructure evolves, as better technology becomes available to support basic
needs. For technology to be mature enough to be incorporated into standard infrastructure, it
must be sufficiently generic. That is, the technology must be robu
st and readily adaptable to
many different applications and purposes.

For Information Infrastructure to support Concept Navigation in a fundamental way, a number
of new technologies must be incorporated into the standard support. The body of this article
discusses the Interspace Prototype, an early system possible since these technologies are
currently mature enough for a complete research system. The Interspace itself will become
widely spread when these underlying technologies further mature into commer
cial components.
This sidebar tries to make explicit the major technologies that the Interspace Prototype (and the
Interspace eventually) critically but implicitly relies on.

The rise of four technologies is critical, in particular: document protocols f
or information
retrieval, extraction parsers for noun phrases, statistical indexers for context computations,
communications protocols for peer
-
to
-
peer retrieval. Together, these generic technologies
support semantic indexing of community repositories.

A document can be stored in a standard representation. Concepts can be extracted from a
document with some level of semantics. These concepts can be utilized to transform a document
collection into a searchable repository, by indexing the documents wit
h some level of semantics.
Finally, the resultant indexing can be utilized to semantically federate the knowledge of a
community, by concept navigation across distributed repositories that comprise relevant sources.



The Rise of the World
-
Wide Web

has ma
de it possible to store documents in a standard
representation. Prior to the worldwide adoption of a single format to represent documents,
collections were limited to those that could be administered by a single central organization.
Prime examples were

Dialog, for bibliographic databases consisting of journal abstracts, and
Lexis/Nexis, for full
-
text databases consisting of magazine articles.

The widespread adoption of WWW Protocols enabled global information retrieval, which in
turn increased the vol
ume to the point that semantic indexing has become necessary to enable
effective retrieval. In particular, the current situation was caused by the universal distribution of
servers that store documents in HTML and retrieve documents using HTTP. Many mor
e
organizations could now maintain their own collections, since the information retrieval
technology was now standard enough to enable information providers to directly store their own
collections, rather than transferring them to a central repository arch
ive.

Standard protocols implied that a single program could retrieve documents from multiple
sources. Thus the WWW protocols enabled the implementation of Web browsers. In particular,
Mosaic proved to be the right combination of streamlined standard
s and flexible interfaces to
attract millions of users to information retrieval for the first time [1].

As the number of documents increased, identifying the initial document to hypertext browse
from became a major problem. Then, web searchers began to

dominate web browsers as the
primary interface to the global information space. These searches across so many documents
with such variance showed the weakness of syntactic search, such as the word matching used
within Dialog, and increased the demand fo
r semantic indexing embedded within the
infrastructure [2].


15

The Web at present is fundamentally a client
-
server model, with few large servers and many
small clients. The clients are typically user workstations, which prepare queries to be processed
at ar
chival servers. The infrastructure has made the transition from files to documents. The
primary functionality has made the transition from access, where a browser is used for directly
fetching, to organization, where a searcher is used for initially sele
cting relevant documents.

As the number of servers increases and the size of collections decreases, the infrastructure will
evolve into a peer
-
peer model, where user machines exchange data directly. In this model, each
machine is both a client and a se
rver at different times. This model is already popular, with
services for music swapping such as Napster estimated to use 20% of present traffic in the Net.
However, the functionality is still access to files, rather than organization of documents. Th
is
functionality will change as the technology for semantic indexing becomes mature.

Document standards eliminate the need for format converters for each collection. Extracting
words becomes universally possible with a syntactic parser. But, extracting
concepts requires a
semantic parser, which extracts the appropriate units from documents of any subject domain.
Many years of research into information retrieval have shown that the most discriminating units
for retrieval in text documents are multi
-
word
noun phrases. Thus, the best concepts in
document collections are noun phrases.


The Rise of Generic Parsing
has made it possible to automatically extract concepts from
arbitrary documents. The key to context
-
based semantic indexing is identifying

the “right size”
unit to extract from the objects in the collections. These units represent the “concepts” in the
collection. The document collection is then processed statistically to compute the co
-
occurrence
frequency of the units within each documen
t.

Over the years, the feasible technology for concept extraction has become increasingly more
precise. Initially, there were heuristic rules that used stop words and verb phrases to
approximate noun phrase extraction. Then, there were simple noun phra
se grammars for
particular subject domains. Finally, the statistical parsing technology became good enough, so
that extraction was computable without explicit grammars. These statistical parsers can extract
noun phrases quite accurately for general te
xts, after being trained on sample collections [3].

This technology trend approximates meaning by statistical versions of context. This trend in
information retrieval has been a global trend in recent years for pattern recognition in many
areas. Compute
rs have now become powerful enough that rules can be practically replaced by
statistics in many cases. Global statistics on local context has replaced deterministic parsing.

For example, in computational linguistics, the best noun phrase extractors no lon
ger have an
underlying definite grammar, but instead rely on neural nets trained on typical cases. The initial
phases of the DARPA TIPSTER program, a $100M effort to extract facts from newspaper
articles for intelligence purposes, were based upon grammars
, but the final phases were based
upon statistical parsers. Once the neural nets are trained on a range of collections, they can
parse arbitrary texts with high accuracy. It is even possible to determine the type of the noun
phrases, such as person or p
lace, with high precision [4].

Once the units, such as noun phrases, are extracted, they can be used to approximate meaning.
This is done by computing the frequency with which the units occur within each document
across the collection. In the same sen
se that the noun phrases represent concepts, the contextual
frequencies represent meanings.

These frequencies for each phrase form a space for the collection, where each concept is
related to each other concept by co
-
occurrence. The concept space is
used to generate related

16

concepts for a given concept, which can be used to retrieve documents containing the related
concepts. The space consists of the interrelationships between the concepts in the collection.

Concept navigation is enabled by a con
cept space computed from a document collection. The
technology operates generically, independent of subject domain. The goal is enable users to
navigate spaces of concepts, instead of documents of words. Interactive navigation of the
concept space is use
ful for locating related terms relevant to a particular search strategy.


The Rise of Statistical Indexing

has made it possible to compute relationships between
concepts within a collection. Algorithms for computing statistical co
-
occurrence have been
stu
died within information retrieval since the 1960s [5]. But it is only in the last few years that
the statistics involved for effective retrieval have been computationally feasible for real
collections. These concept space computations combine artificial
intelligence for the concept
extraction, via noun phrase parsing, with information retrieval for the concept relationship, via
statistical co
-
occurrence.

The technology curves of computer power are making statistical indexing feasible. The
coming period i
s the decade that scalable semantics will become a practical reality. For the 40
-
year period from the dawn of modern information retrieval in 1960 to the present worldwide
Internet search of 2000, statistical indexing has been an academic curiosity. Te
chniques such as
co
-
occurrence frequency were well
-
known, but confined to collections of only a few hundred
documents. The practical information retrieval on large
-
scale real
-
world collections of millions
of documents relied instead on exact match of text

phrases, such as embodied in full
-
text search.

The speed of machines is changing all this rapidly. The next 10 years, 2000
-
2010, will see
the fall of indexing barriers for all real
-
world collections [6]. For many years, the largest
computer could not s
emantically index the smallest collection. After the coming decade, even
the smallest computer will be able to semantically index the largest collection.

The body of this article describes the hero experiment in the late 1990s, of semantically
indexin
g the largest scientific discipline on the largest public supercomputer. Experiments of
this scale will be routinely carried out by ordinary people on their watches (palmtop computers)
less than 10 years later, in the late 2000s.

The TREC (Text REtri
eval Conference) competition [7] is organized by the National Institute
for Standards and Technology (NIST). It grew out of the DARPA TIPSTER evaluation
program, starting in 1992, and is now a public indexing competition entered annually by
international
teams. Each team generates semantic indexes for gigabyte document collections
using their statistical software.

Currently, semantic indexing can be computed by the appropriate community machine, but in
batch mode. For example, a concept space for 1K do
cuments is appropriate for a laboratory of
10 people and takes an hour to compute on a small laboratory server. Similarly, a community
space of 10K documents for 100 people takes 3 hours on a large departmental server. Each
community repository can be pr
ocessed on the appropriate
-
scale server for that community. As
the speed of machines increases, the time of indexing will decrease from batch to interactive, and
semantic indexing will become feasible on dynamically specified collections.

When the technol
ogy for semantic indexing becomes routinely available, it will be possible to
incorporate this indexing directly into the infrastructure. At present, the Web protocols make it
easy to develop a collection to access as a set of documents. Typically, the

collection is
available for fetching but not for searching, except by being incorporated into web portals, which

17

gather documents via crawlers for central indexing. Software is not commonly available for
groups to maintain and index their own collection

for web
-
wide search.


The Rise of Peer
-
Peer Protocols

is making it possible to support distributed repositories for
small communities. This trend is following the same pattern in the 2000s as email in the
ARPAnet did in the 1960s, where person
-
person co
mmunications became the dominant service
in infrastructure designed for station
-
station computations. Today, there are many personal web
sites, even though traffic is dominated by central archives, such as home shopping and scientific
databases, which dri
ve the market.

There are already significant beginnings of peer
-
peer where simpler protocols enable users to
directly share their datasets. These are driven by the desires of specialized communities to
directly share with each other, without the interven
tion of central authorities. The most famous
example is Napster for music sharing, where files on a personal machine in a specified format
can be made accessible to other peer machines, via a local program that supports the sharing
protocol. The Napste
r service has now become so popular, that the technology is breaking down
due to lack of searching capability that can filter out copyrighted songs.

There are many examples in more scientific situations of successful peer
-
peer protocols.
Typically, these
programs implement a simple service on an individual user’s machine, which
performs some small computation on small data that can be combined across many machines
into a large computation on large data [8].

For example, the SETI software is running on a
million machines across the world, each
computing the results of a radio telescope survey from a different sky region. Computed results
are sent to a central repository for a database seeking intelligent life across the entire universe.
Similar net
-
wid
e distributed computation, with volunteer downloads of software onto personal
machines, has computed large primes and broken encryption schemes. For
-
profit corporations
have used peer
-
to
-
peer computing for public
-
service medical computations [9].

Generali
zed software to handle documents or databases currently exists at a primitive level for
peer
-
peer protocols. A canned program can be run, which processes local data in a simple way.
Functionality is still at the level of files rather than documents.
The infrastructure supports
access rather than organization.

Internet infrastructure, such as the Open Directory project [10], enables distributed subject
curators to index web sites within assigned categories, with the entries being entire collections.

In contrast, Interspace infrastructure, such as automatic subject assignment [11], will enable
distributed community curators to index the documents themselves within the collections.

Increasing scale of community databases will force evolution of peer
-
pe
er protocols.
Semantic indexing will mature and become infrastructure at whatever level technology will
support generically. Community repositories will be automatically indexed, then aggregated to
provide global indexes. Concept navigation will become

a standard function of global
infrastructure in 2010, much as document browsing has become in 2000. Then the Internet will
have evolved into the Interspace.



References

1.

B. Schatz and J. Hardin, “NCSA Mosaic and the World
-
Wide Web: Global Hypermedia Prot
ocols for the
Internet”,
Science
, Vol. 265, 12 Aug. 1994, pp. 895
-
901.

2.

T. Berners
-
Lee, et. al. , “The Semantic Web”,
Scientific American
, Vol. 284, May 2001, pp. 35
-
43.


18

3.

T. Strzalkowski, “Natural Language Information Retrieval”,
Information Processing & Man
agement
, Vol.
31, 1996, pp. 397
-
417.

4.

D. Bikel, et. al., “NYMBLE: A High
-
Performance Learning Name Finder”,
Proc. 5
th

Conf. Applied
Natural Language Processing
, Mar. 1998, pp. 194
-
201.

5.

P. Kantor, “Information Retrieval Techniques”,
Annual Review Informatio
n Science & Technology
, Vol.
29, 1994, pp. 53
-
90.

6.

B. Schatz, “
Information Retrieval in Digital Libraries: Bringing Search to the Net”,

Science
, Vol. 275, 17
Jan. 1997, pp. 327
-
334.

7.

D. Harma
n (ed),
Text Retrieval Conferences (TREC), National Institute Standards & Technology (NIST),
http://trec.nist.gov


8.

B. Hayes, “Collective Wisdom”,
American Scientist
, Vol. 86, Mar
-
Apr 1998, pp. 118
-
122.

9.

Intel Philanth
ropic Peer
-
to
-
Peer Program
,
www.intel.com/cure


10.

Open Directory Project
,
www.dmoz.org


11.

Y. Chung, et. al., “Automatic Subject Indexing Using an Associative Neural Network”,
Proc.

3
rd

Int’l ACM
Conference Digital Libraries
, Jun. 1998, Pittsburgh, pp. 59
-
68.