Avoiding Information Overload: Knowledge Management on the Internet

magazinebindΔιαχείριση

6 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

63 εμφανίσεις



2002 JISC


Page
i

Author: Dr Adam Bostock of Acro Logic

June 2002


TSW 02
-
02


Avoiding Information Overload:

Knowledge Management on the Internet

Contents

1

Executive Summary

................................
................................
................................
........................

1

2

The Technologies

................................
................................
................................
............................

1

3

Technology Watch Issue

................................
................................
................................
................

2

4

Technical Overview

................................
................................
................................
........................

2

4.1

Search Engines

................................
................................
................................
.........................

2

4.2

Web Browsers: Searching and Saving Individual Pages

................................
..........................

4

4.3

Knowledge

Management Systems and Agents

................................
................................
.........

4

4.4

Knowledge Representation

................................
................................
................................
.......

7

4.5

XML

................................
................................
................................
................................
.........

8

5

Developments

................................
................................
................................
................................
..

9

5.1

XML Search Engines

................................
................................
................................
................

9

5.2

Web Browsers
................................
................................
................................
...........................

9

5.3

Knowledge Management Systems and Agents

................................
................................
.......

10

5.4

Metadata

................................
................................
................................
................................
.

10

5.5

XML Extensions and Applications

................................
................................
.........................

10

5.6

Knowledge
Application Tools

................................
................................
................................

11

6

Assessment
................................
................................
................................
................................
.....

11

7

References

................................
................................
................................
................................
.....

14

7.1

Agents

................................
................................
................................
................................
.....

14

7.2

Applications and Projects (Miscellaneous)

................................
................................
.............

14

7.3

Knowledge Management Technologies and Products

................................
............................

14

7.4

Metadata

................................
................................
................................
................................
.

15

7.5

Metadata (Dublin Core)

................................
................................
................................
..........

15

7.6

Metadata (Miscellaneous)

................................
................................
................................
.......

16

7.7

Metadata (RDF
-

Resource Description Framework)

................................
.............................

16

7.8

Ontology

................................
................................
................................
................................
.

16

7.9

Search Engines and Directories

................................
................................
..............................

17

7.10

Semantic Web

................................
................................
................................
.........................

17

7.11

Web Serv
ices

................................
................................
................................
..........................

17

7.12

XML

................................
................................
................................
................................
.......

18

8

Glossary

................................
................................
................................
................................
.........

19

9

Appendix A
-

Example of XML

................................
................................
................................
...

20

10

Appendix B
-

Developing Your Web Site to Generate XML

................................
....................

21

10.1

Converting an Existing Database Driven System

................................
................................
...

21

10.2

Starting from Scratch

................................
................................
................................
..............

21

10.3

Build Your Own XML Portal

................................
................................
................................
.

21

10.4

Benefits

................................
................................
................................
................................
...

21

11

Appendix C
-

Example Use of Namespaces

................................
................................
................

21

12

Appendix D
-

Information Presentation Issues

................................
................................
..........

22

13

Appendix E
-

Metadata and Information Extraction

................................
................................

22

14

Appe
ndix F
-

Internet Directories

................................
................................
...............................

23

15

Appendix G
-

Search Engines

................................
................................
................................
......

23




2002 JISC


Page
1

1

Executive Summary

Keywords:
search, knowledge management, XML, metadata, RDF, ontology, agent

It is estimated that there are ove
r two billion Web pages, and thousands of newsgroups and forums, on
the Internet
-

covering virtually every topic imaginable. However, many users find that searching the
Internet can be a time consuming and tedious process. Even experienced searchers som
etimes run into
difficulties. To fully benefit from the potential opportunities of the Internet, both Web site developers
and users need to be aware of the tools and techniques for managing and retrieving online knowledge.

This has driven the development
of improved search and information
retrieval

systems. However, we
now need sophisticated information
extraction

(and/or summary) capabilities to present the user only
with the information they need, rather than a large set of relevant documents to read.

S
earch service providers, Web portals, and amalgamations of community Web sites could all help their
users to benefit today, just by adopting the current generation of
knowledge management systems
,
particularly those with effective information
extraction

ca
pabilities.

Metadata

has a very useful role to play, but it has limitations with regard to information extraction.

One of the key opportunities of the XML initiative is to allow structure and (indirectly) "meaning" to
be embedded into the
content

of the re
source itself. XML provides the much needed data structure for
computer
-
to
-
computer interaction. The availability of good user
-
friendly, and "intelligent", tools will
be critical in persuading the wider community to adopt XML as an alternative to HTML.

I
t is probably reasonable to state that the current generation of knowledge management systems is an
interim measure, to be superseded by AI systems in the long
-
term. Such systems will probably be able
to process natural language and XML encoded content.

T
he success of Internet based knowledge management, and the Semantic Web, will require the
development and integration of various data standards, ontology definitions, and knowledge
management and agent technologies. It will take a concerted and significan
t effort to get there. The
likely longer
-
term benefits are much more effective Internet searches and smart information extraction
services, which present the user with concise relevant extracts.

In the meantime, perhaps we should also think about how auth
ors represent knowledge and present
information, and how users apply knowledge, in a more structured and meaningful way.

This report includes a glossary, reference section and appendices.

A concise, interactive, XML version of this report, plus extra featu
res, can be found here:


Acro
-
Report

(www.acrologic.co.uk/cgi
-
bin/src.pl?r=ikm
-
16.446)

2

The Technologies

In the broadest sense
knowledge management

encompasses all aspects to do with cr
eating,
maintaining, organising, classifying, representing, storing, querying, retrieving, analysing and
presenting knowledge. It encompasses people, procedures, processes, policies and technologies. An
organisation's level of success in knowledge manage
ment is limited by the weakest link in that chain.
In
this report
we primarily focus on the technology. This report provides an overview of the issues,
and the tools and techniques available for managing and finding knowledge on the Internet. The aim
be
ing to avoid adding to the stress and frustration felt by Internet users, and enable them to quickly
acquire knowledge from the Internet. The resulting benefits are applicable to all types of Internet user.
This report was written for a diverse audience a
nd so jargon, purist definitions and detailed technical
aspects have been kept to a minimum (though the references contain more detail).

A popular method for finding Internet resources is through
search engines and directories
.

Directories

allow a user to manually browse a hierarchy of categories to find appropriate Web sites.
Search
engines

take a user's query and automatically search their database to return matching results.



2002 JISC


Page
2

Knowledge management (KM) s
ystems

and
agents

are two distinct topics but increasingly within this
context they can be observed working together. These systems have more functionality than search
engines, and some systems can match concepts derived from
unstructured data.

For computer systems to communicate directly with each other, standard data formats are required.
XML

allows standard languages and data formats to be developed. Coupled with an appropriate
ontology

systems can perform useful functions on the data they exchange with each other. KM
systems can be enhanced to utilise XML to provide improved capabilities.

3

Technology Watch Issue

The key issues now for knowledge management and searching

on the Internet are:

1.

The volume of information on the Internet is such that it is not feasible to
manually

search for
and retrieve
all

relevant sources of quality knowledge on a given topic

2.

Users would probably not have enough time to read all the releva
nt documents on the Internet

3.

This [2] requires information
extraction
, to present only the
relevant parts of a document
.

4.

Knowledge management systems and search agents are needed to assist with the above

5.

HTML Web pages contain unstructured data, with no co
mputer understandable "meaning"

6.

XML and ontology provides an opportunity to address the above issue

7.

A critical mass for each community needs to co
-
operate to develop acceptable and workable
standards for each community. Activities are needed not just on m
etadata but also on
representing document content and the corresponding knowledge much more effectively.

4

Technical Overview

4.1

Search Engines

It is assumed readers are familiar with directories and search engines (if not, see Appendices F and G).
Useful thou
gh they are, current search engines have limitations. The classification and searching
algorithms are automated and lack intelligence. In particular, search engines have been vulnerable to
abuse from "
spammers
" claiming to offer popular products or servi
ces to attract visitors to their site on
false pretences. Unfortunately, because of this, the algorithms and rules of search engines have adapted
quickly to try and guarantee relevant (non
-
spam) results for searchers. This means that
Search Engine
Optimi
sation

(SEO) practices have changed quickly, and what used to be good or acceptable practice
can, in some cases, now be detrimental to your Web site's rating on a search engine. [
References
]

This issue has also resulted in the
role of
metadata being down graded

in Web pages. Metadata may
be used to describe a Web page, e.g. description and keywords. A few years ago metadata, particularly
keywords, were used to determine how well a page matched a query. However, the spammers ab
used
this and so search algorithms were modified to play down the role of metadata.

4.1.1

The Invisible Web

Even a determined search engine that attempts to index every Web page will not actually be able to
find, or extract, all of the content potentially availa
ble to Web users. The reasons for this include:



Web pages that have no links to them, and are not explicitly submitted to search engines



Web pages that reside in password protected areas of a Web site



Content stored in an unsupported format (e.g. images,
animations)



Dynamic Web pages generated from a database in response to a user's actions or queries.



2002 JISC


Page
3

The latter scenario is the one to which people often refer to when they talk about the "invisible Web".
Examples of dynamically generated pages are the res
ults returned by a search engine, whereby the
content is tailored to the specific query of a user. There are many sites that use this technique, and
consequently a significant volume of knowledge may reside in the invisible Web.

There are relatively simpl
e ways of making some of this hidden knowledge available to search engines:

1) A representative set of
static

pages can be extracted from the database and linked to in the normal
way. This approach may be used to demonstrate a sample of what is available.

2) One or more Web pages containing index listings of the data can be derived from a selected set of
database records. The index listing could contain links to dynamically generated pages. For those
search engines that index dynamic pages they then have

a link to follow, which reveals some of the
underlying data in the database. For complex systems this approach may not be feasible.

3) Generate static Web indexes that contain metadata on a selected set of records. Caution: remember
the spammers and how

they abuse metadata? Well, because this technique can be used to deceive
search engines you may be automatically penalised (see:
Cloaking
).

4.1.2

On
-
site Searching

Some Web sites are very large, or contain hidden data, and so they provide an on
-
site search fac
ility
either via third party facilities, public search engines (with the search restricted to their site), or their
own search system. (Many have an irritatingly small field in which to type your query.)

However, you may want to bear in mind that search s
ystems (and their users) are not perfect. If a user
searches for something provided on your site but they mistype it or use a different set of words to
describe the object (e.g. "green ball" instead of "green sphere") the search will probably fail. The u
ser
may then assume that your site does not have any information on the object and leave the site. Some
commercial Web site owners have reported sales going up when they have
removed

their on
-
site search
feature! This may be because the failed search sce
nario has been removed, plus the user has to browse
the site to find what they want. Whilst browsing they may find what they want, something similar, or
something unexpected yet desirable.

4.1.3

Meta Search Engines

Until relatively recently no search engine cam
e close to indexing all the pages on the Web, and each
search engine had a different set of indexed pages (with some degree of overlap of course). Therefore,
in order to conduct a comprehensive search of the Web a user would need to perform similar search
es
on a number of major search engines. Clearly this process can be cumbersome. This is where meta
search engines come in. They take one query from the user and automatically pass that on to several
search engines. The meta search engine then presents
the combined results to the user.

4.1.4

Integrated Search and Directory Features

Users may be confused as to whether a particular site is a search engine or a directory because it may
appear to offer both features. A directory will probably provide the option t
o browse by category and
the option to search. The search feature looks for a match in the list of descriptions for each Web site
in the directory. Conversely, a search engine may also offer a directory hierarchy for the user to
browse through. The situ
ation gets even more blurred today because some directories now extend their
search features by calling upon the capabilities of a third party's search engine. For example, a search
on the directory Yahoo! may return results generated by the Google search

engine.

4.1.5

Other Types of Internet Resource

It is worth remembering that the Internet contains more than (HTML) Web pages. The Google search
engine has been particularly proactive in recognising this and provides search results that refer to
Newsgroups, Ado
be Acrobat (PDF) files, and Office documents (Word and PowerPoint).



2002 JISC


Page
4

4.1.6

Other Methods of Finding Information

Other methods exist for finding information: Ask someone (in the real world, or in one of the many
online forums and newsgroups); Web logs (
blogs)

are
Web sites where the author maintains a log of
things happening which correspond to their own interest or expertise; and technologies described later.

4.2

Web Browsers: Searching and Saving Individual Pages

4.2.1

Searching within a Page

Once a Web browser has loaded
and displayed a page, the user may want to conduct a search of the
text on that page, particularly if the page is very long. Most Web browsers provide some kind of
search or find capability. However, a well designed and structured page should reduce the
need for
such actions, e.g. start with a contents list.
Information presentation
and Web design are key aspects of
effective information retrieval, but are beyond the scope of this report.

4.2.2

Saving Pages

A Web browser will usually allow you to save a Web pa
ge to your computer. This saves one file for
the HTML code, and additional files for the images. However, it would be useful to record where the
Web page originates from, e.g. inserting a line that displays the source URI at the top of the saved
page. A
lternatively, you may want to use a database to record metadata on the page you have saved,
including the saved file path and name, and the corresponding URI of the source. Most materials are
protected by copyright, so your metadata database could record
a link to the original page.

4.2.3

Bookmarks

Another method for "saving" a Web page is to bookmark it, or add it to your favourites list. This
simply records the URL of the Internet resource, rather than copying the content itself. Most browsers
will support t
his function. For those that have relatively fast Internet connections this may be a better
option than save, as it avoids issues of copyright and using a saved local copy which may become
outdated. However, the potential downside is that many Web pages
eventually move to a new URL
address, or get deleted. (It would be nice if more Web site owners provided a link to the new address
of the resource or some type of help, rather than just reporting Error 404
-

page not found.)

4.3

Knowledge Management Systems a
nd Agents

4.3.1

Introduction to Knowledge Management Systems

The technology underlying
knowledge management

(KM) systems should ideally provide support for
all aspects of knowledge management (see earlier Technology chapter). KM systems
can automatically
search for, retrieve, organise and classify information. Some have the ability to:



Extract relevant content from a document or page



Summarise a document



Automatically classify, cluster and match documents by concept.

The system administra
tor typically has control over which resources the system has access to (Intranet,
Internet, documents, databases, etc.). Given the autonomous nature of its knowledge acquisition, its
demand on busy resources (e.g. bandwidth) may become excessive. Howeve
r, the administrator should
have the option to control its level of activity and/or schedule activities for a more suitable time of day.

The system can
retrieve

documents relevant to a user's explicit
search request
, or those
matching
concepts

in a selecte
d document, or those
matching a user's interests
. However, given lots of potential
matches from the billions of pages on the Web we need even more help. This is where
information
extraction

can help by presenting users with only the relevant extracts fro
m documents. Extraction
technology removes information from a variety of document formats. It may aggregate content into a
single location, and translate text in documents into a structured format such as XML and/or database


2002 JISC


Page
5

records.
Document summarisatio
n
also represents a potentially useful feature, depending on its
accuracy.

There are several
knowledge management products

available. These products may offer one or more
of the following features:



Document management

-

tracking me
tadata on documents such as topic category,
location, author, date created and modified, version number and history, and features to
index, classify and represent the content of a document



Work flow

-

integrated tracking of business processes and the flow
of information within
an organisation



Intranet and Internet
knowledge management (Web pages, documents, email,
newsgroups, forums, etc.)



Communications tracking

of internal and external communications through various
forms of electronic media, possibly eve
n including voice.



Database integration

-

existing database resources can be integrated, and of course
database technology may provide the infrastructure of the knowledge management system



Group
-
ware and Collaboration tools

to facilitate and track internal

team based
activities, project work and knowledge sharing (also useful for virtual organisations)



Project management



Management information systems



Knowledge creation and representation tools

(e.g. structured documents / databases,
mind
-
maps, diagrams, fo
rmal schemes and process descriptions)

There is a degree of overlap between all of the above features and, of course, integration is highly
desirable within this context. Some products may only provide one of the above features. Some IT
manufacturers pro
vide modules for each of the above which can be integrated into their existing
product lines.

However,
care should be taken when selected a suitable knowledge management product
for your
organisation. "Knowledge management" has become one of the current

fashionable trends and many
manufacturers have jumped on the bandwagon. Consequently, there is a lot of hype and waffle around
(you won't have any problem finding this on the Internet). Given that knowledge management is such a
broad subject it is relat
ively easy for any IT provider to make claims that they provide a knowledge
management product or service. Whilst they probably do, it is worth asking yourself does this product
address
all

of my information and communication needs, is it sufficiently fle
xible, and is it capable of
supporting new and emerging standards (e.g. XML)?

Further questions to ask are does it support concept matching (as opposed to just key word searches),
document summarisation, and information extraction (as opposed to just infor
mation retrieval)? Is
your KM system going to present you with concise relevant facts (extracts), or a long list of documents
that you have to read? Weigh up what burden a new KM system will place upon the people within
your organisation, and what benefi
ts they will receive from it. Will they want to use it, or be forced
into it and end up cutting corners (which negate the effectiveness of the system)?

The manufacturers that are dedicated to the explicit development of knowledge management systems
may of
fer a core product that is designed to specifically provide most, or all, of the above features,
and/or provide an infrastructure on which traditional IT manufacturers can build. Such an
infrastructure may be able to handle all of an organisations unstruc
tured and structured information and
communications, in a variety of media formats.

For example, Autonomy, an established UK knowledge management company, uses mathematical
algorithms to perform its
concept

matching of documents. But the really interestin
g aspect is that it can
work with any kind of data source, e.g. unstructured text in English or another language, or even
speech.



2002 JISC


Page
6

Sticking with the above example, it identifies the patterns that naturally occur in text, based on the
usage and frequency of
words or terms that correspond to specific ideas or concepts. Based on the
occurrences of one pattern over another in a piece of unstructured information, it enables computers to
understand that there is X percent of probability that the document in quest
ion is about a specific
subject. This effectively extracts a document's digital essence, encoding the
signature

of the key
concepts, which enables operations to be performed on that text, automatically.

The technology is based on Bayesian Inference and C
laude Shannon's principles of information theory.
Part of this is based on the work of Thomas Bayes, an 18th century English cleric, whose work
included calculating the probabilistic relationship between multiple variables and determining the
extent to whi
ch one variable impacts on another. A typical problem is to judge how relevant a
document is to a given query or agent profile. Extensions of the theory go further than relevance
information for a given query against a text. Adaptive probabilistic concep
t modelling analyses
correlation between features found in documents relevant to an agent profile, finding new concepts and
documents. Concepts important to sets of documents can be determined, allowing new documents to
be accurately classified.

Claude Sh
annon discovered that
information

could be treated as a quantifiable value in
communications, and represented mathematically using his information theory. Natural languages
contain a high degree of redundancy. A conversation in a noisy room can be unders
tood even when
some of the words cannot be heard; the essence of a news article can be obtained by skimming over the
text. Information theory provides a framework for extracting the concepts from the redundancy.
Autonomy's approach to concept modelling r
elies on Shannon's theory that the less frequently a unit of
communication occurs, the more information it conveys. Therefore, ideas that are more rare within the
context of a communication tend to be more indicative of its meaning. It is this theory whi
ch enables
its software to determine the most important (or informative) concepts within a document.

In terms of other approaches,
artificial intelligence

(AI) offers methods for managing knowledge. For
example, neural networks can be trained to perform w
ell in a variety of pattern matching tasks, where
the data may be any multimedia format or real
-
world scenario. The subject of AI is vast and beyond
the scope of this report. It is the author's belief that AI probably offers the long
-
term solution to
kno
wledge management and knowledge application.

4.3.2

Introduction to Agents

Agents

typically have the following attributes:



A task or objective can be delegated to them



They are autonomous, and



They can make decisions.

An agent may al
so be capable of collaboration and learning. An agent may collaborate with servers
and/or other agents. Such interactions form part of peer
-
to
-
peer networks (P2P) and "grid computing".
Mobile agents are capable of roaming wide area networks or the Interne
t, interacting with foreign
hosts, gathering information and reporting back, having performed the duties set by its user.

Such agents may be a software package on your local computer (or another device), and/or a facility
that you access through a Web serv
er. Their underlying technology can be almost any IT technology
from standard programming languages, to AI languages, neural networks, and even biological
evolution theory, in which the "fittest" agents "reproduce" and only the fittest survive.

As with KM

systems, the purists have various definitions that try to define the terminology more
precisely. In particular, intelligent agents stir up many debates because of the difficulties associated
with agreeing on what exactly intelligence means.

However, to
day, agents do not have a level of intelligence that allows them to process natural language,
compare that against their own knowledge and understanding, and make decisions and take actions in
the same way that people do. Therefore, to allow the current g
eneration of agents to perform useful
and "meaningful" tasks they have to operate in a highly structured and formalised environment, which


2002 JISC


Page
7

presents little room for ambiguity. This means agents generally work best with well
-
defined data
fields, values and
structures, and a formalised set of rules.

There are various roles for agents, which currently include, but are not necessarily limited to:



E
-
commerce
-

finding requested products or services, filtering those down by given
criteria such as price, quality,
delivery data, and specification details, negotiating the best
price, and (optionally) actually making the decision to buy



Manage emails, calendars and set up meetings



Workflow management



Network management, Air
-
traffic control, Business process re
-
enginee
ring



Data mining



Education



Personal digital assistants (PDAs)

These are at varying levels of development and capability, although some have been in use for years
within specific domains.

Another example often quoted is the scenario where you ask your agent

to arrange you a holiday. In
this scenario it requires the agent to use its knowledge of your preferences, look up information in
various online sources (e.g. holiday destinations, hotels, car hire, flights), make recommendations, and
book it for you.

So
me of the above decision making roles require us to consider what level of trust and responsibility
we wish to assign to our agents. We can probably expect to hear more about this in the future, as they
slowly start to take a bigger part in our lives.

Inf
ormation agents have come about because of the sheer demand for tools to help us manage the
explosive growth of information we are experiencing. Information agents perform the role of
managing, manipulating or collating information from many distributed s
ources. Within the context of
KM, the roles for agents include searching, resource discovery, information retrieval and extraction. It
is anticipated that agents may eventually remove the need for directories and search engines.
(Although, behind the sc
enes, a modified search engine may serve the agents.)

However, we tend to express knowledge in our natural language, whereas agents require a much more
structured and formally defined environment. Ontology, described in the following section, may
eventual
ly bridge this significant gap between the two environments of human and machine. This
represents one of the greatest challenges for agents. Currently, with the exception of the concept
matching mentioned earlier, agents still have a long way to go, and
their success will be dependent on
the success of the initiatives mentioned in the following chapter.

4.4

Knowledge Representation

How much do we really know about knowledge? What's the meaning of meaning? What does the
word "mean" mean? When you start to g
et serious about KM systems and intelligent agents it's time to
start answering basic, yet fundamental, questions about the aspects we often take for granted. Some
people approach the technological development of knowledge systems by emulating human cogni
tive
processes or other organisational and evolutionary processes found in nature. Some take the less
glamorous route and build on what they are most familiar with, step by step.

4.4.1

Terminology

We've already discussed directories and their hierarchical cla
ssification systems. The terminology used
for classification structures is
taxonomy
. In order to represent knowledge and imply meaning a
standard set of terms, vocabularies, rules, and data structures need to be agreed upon. This is referred
to as
ontology
. The next generation of the Web has been called the
Semantic Web

-

a Web of


2002 JISC


Page
8

resources that offer the possibility of communicating meaning to
computer applications
. A great deal
of the develo
pment activity for this is based around metadata, RDF, ontology and XML.

4.4.2

Metadata

Metadata

is data about data, or data that describes a resource, or describes processes and resources
associated with that resource. For example
, library cataloguing and classification systems provide
metadata on books, e.g.: topic, title, author, publisher, date, and location. Sticking with this example it
is easy to see that the one of the key purposes of metadata is to aid the search for a res
ource (also called
resource discovery).
Effective metadata systems offer the potential for far more effective searches on
the Web
. In some communities there is currently a resurgence of interest in metadata, and
RDF
. For
those t
hat are familiar with meta
-
tags used in HTML Web pages, you should be able to relate to
another metadata standard, the
Dublin Core
. This defines a basic set of metadata fields.

Metadata has the potential to improve the capabilities

of search and information
retrieval

systems.
However, metadata is probably less effective at information
extraction
; many of these limitations are
related to human aspects (see Appendix E
-

Metadata and Information Extraction).

4.4.3

Resource Description Frame
work (RDF)

RDF is intended to describe Web resources, but it is not necessarily restricted to this role. It represents
a basic framework, which can be extended to include domain specific terminology. It provides a basis
for representing metadata and onto
logy. RDF can be represented diagrammatically, but it is also
expressed using XML schemas, which support extensibility through
namespaces
. It is a generic
building block, and there are lots of applications based on
RDF
.

4.5

XML

4.5.1

The p
urpose of XML

We have all probably heard of HTML (Hyper
-
Text Mark
-
up Language) for designing Web pages.
HTML has been very successful at presenting information to humans, but it has very little to offer in
terms of intelligent computer to computer communi
cations because most of the language focuses on
defining how information should
look

rather than what the data
means
. That's where XML comes in.

Unlike HTML,
XML

does not focus on defining how information should be presented visua
lly; instead
it defines data types and structures. Used with a corresponding ontology we can effectively assign
terms, vocabulary and possibly "meaning". This allows computer programs to interact with each other,
without human intervention, and perform u
seful functions on the data.

4.5.2

What is XML?

The
eXtensible Mark
-
up Language (XML)

could be called a meta
-
language in that it is used to
define other mark
-
up languages. For example, in the mathematics community they have created
MathML
, and in chemistry
CML
.

(There is even an XML representation of HTML called
XHTML
.)

4.5.3

What does XML look like?

XML uses tags, as does HTML. XML allows you to create a new language with its own tags (or
elements), attributes, and structure. (See Appendix A, and the XML version
of this document.)

4.5.4

XML Search Tools

XML can be used to provide much more powerful search capabilities than current search engines,
which work with HTML data. It becomes possible for search tools and agents to know which data field
types should be searched

in order to match your query and return relevant results.

For example, an XML based search engine would theoretically be able to search for a
person

called
"Joe Bloggs" and perform the search correctly by, say, only searching in the <name>, <first> and


2002 JISC


Page
9

<l
ast> fields of an Internet resource. Optionally, it could also check for the existence of other field
definitions that also represent a person's name, and search those as well. Such an engine would not
return a resource that contained a company name call
ed "Joe Bloggs Internet Company". Of course, if
the searcher wanted to they could include such matches too.

4.5.5

Document Type Definition (DTD)

To create a new XML language a plain text file is created which contains the
DTD
. This defines the
tags (or element
s) and attributes of the language. Defining a DTD is not for the faint hearted or those
lacking significant IT literacy skills. However, this may not be a relevant barrier to the success of
XML. Given that a DTD is probably going to define a community
-
w
ide standard language, it will
require the co
-
operation of a critical mass of key players and standards bodies. So although, in theory,
anyone who is technically minded can create a DTD in isolation, in practice it may not be a wise thing
to do. In the n
ext chapter we look at a replacement for DTDs: schemas and namespaces.

5

Developments

In this chapter we investigate the developments in technology and standards, but you may also like to
view the practical suggestions of Appendix B
-

Developing Your Web Sit
e to Generate XML.

5.1

XML Search Engines

Once a critical mass of Web sites begin to offer their content in a choice of traditional HTML format or
as XML structured data we can expect to see a significant development in the search capabilities of
mainstream se
arch engines, as illustrated in the previous chapter. However, there are some issues

that
influence the success of generic XML based search engines. XML allows anyone to create his or her
very own new language. Here's where a potential danger lies for t
he Web.

The development of standard languages, data fields, vocabularies, structures and "meanings" are vital.
That means a critical mass co
-
operating and agreeing on what the standard is. So far XML derived
languages have been developed for use within s
pecific domains (e.g. MathML, CML, and molecular
biology). But developing a useful generic language may require much more thought and effort.
However, we do not have to develop a complete generic language. As the name implies XML is
extensible and this
means that a generic core language (and ontology) could be developed, and domain
specific extensions added as appropriate. For a generic XML search engine to quickly become
successful it would help if the wider Internet community adopted such a core appro
ach. (For example,
having agreed definitions for things like a person's name, a company name, postal address, email
address, etc.) If there is no agreement then how would a generic search engine know that in
Language_1
<nameFirst>Joe</nameFirst>

means th
e same as
<fn>Joe</fn>

in Language_345?

It is possible to cater for a wide range of independently developed languages, but this would require a
mapping methodology for each language. The development of the mapping process would almost
certainly require ma
nual input for each supported language because it requires an intelligent
understanding of the meanings in each language
-

the challenge for ontology. Provided the number of
independent languages is relatively small this is feasible. However, it probably

makes more sense to
agree on a common core in the first place. Fortunately, there are contenders for this role (
RDF
,
Dublin
Core
,
the Text Encoding Initiative and others
).

Som
e search engines already use XML behind the scenes, but we are now starting to see XML based
services being opened up to the Internet community (see the section on SOAP).

5.2

Web Browsers

In the futuristic world of XML, knowledge management systems and intelli
gent agents, perhaps Web
browsers, as we know them, will become redundant. Your own personal

intelligent agent

could be the
only (technology) interface you require. It could handle all of your knowledge management activities
a
nd present the information in a format most suited to you, taking into account your physical access
device, your choice of international language and any special needs. This is not far fetched. XML
provides an excellent building block for knowledge syste
ms, and XML provides international language


2002 JISC


Page
10

support.
XSLT

provides the ability to retrieve an XML knowledge resource, extract relevant data sets
and present those in whatever visual (or otherwise) format the user requires.

In the
interim, Web browsers are still needed to cope with the transition from HTML to XML. At this
half way house developers can build systems that store data internally as XML and exchange this
directly with other XML systems, but when a traditional Web browse
r makes a request XSLT can be
used to transform the XML data into an HTML Web page. On some Web sites this already happens.

Some browsers (e.g. Internet Explorer 5+) can already display XML in its raw state, and no doubt we
will see a growing selection of

plug
-
in modules, scripts and applets designed to handle XML data.

5.3

Knowledge Management Systems and Agents

Earlier we described how knowledge management systems could perform functions on unstructured
data (e.g. textual documentation). Some of these syste
ms have been enhanced to work with XML
structures, as input, storage and/or export formats. Adding structure and ontology to a KM system
provides an opportunity for enhanced capabilities, and benefits to end users. It is still early days with
ontology de
velopments and applications, and so we have not yet seen the full benefits of this.

As with KM systems, artificial intelligence and intelligent agents represent a huge field of activity and
specialisation which is far beyond the scope of this report. As t
he most promising developments in
these areas become available it should be possible to eventually plug them into an XML framework.
XML is effectively a low level data layer, to which higher intelligent application layers can be added.

5.4

Metadata

Metadata h
as been in use for years and it continues to play a significant role. Currently, various
interest groups are using RDF to build a range of applications [
references
].

5.5

XML Extensions and Applications

5.5.1

Schema and Namespaces

So you hav
e been slogging away, climbing the learning curve for DTD syntax and some bright spark
tells you that we do not do that anymore! Well DTDs still exist and work, but there is a new alternative
-

the
schema
. A schema provides mo
re functionality than a DTD, and seems to be the norm for
implementing RDF and domain specific ontologies. Schemas support the definition of data fields
(elements or tags), attributes, structure, data values and ranges, and extension through the use of
na
mespaces. If you are familiar with object oriented programming and classes then getting use to the
concepts of a schema will probably be easier, though not necessarily easy. DTD syntax is different to
XML, whereas a schema is written in XML. Each schema

is referenced via a
namespace

-

a name that
references a specific URI, which holds the schema definition. In this way you can extend the range of
definitions available by including references to more than one schema. (See Appendix C.)

5.5.2

XLink, XPointer, X
Path, XForm and XQuery

For the sake of completeness, you may be interested to know that there are several
XML

extensions,
both existing and under development. These include enhanced capabilities for dealing with links, and
referen
cing sections within a document (XLink, XPointer and XPath). XForm is the XML equivalent
of HTML forms, and I bet you can guess what XQuery deals with?

5.5.3

XSL and XSLT

Those who are familiar with HTML may know that cascading style sheets (CSS) can be used to

control
how a document will look. CSS can be used on XML documents, but XML has its own version of this,
eXtensible Stylesheet Language (
XSL
). This allows control over the formatting
and

it can be used to
transform

the data. Th
is
XSLT

capability can extract subsets of data from an XML resource, rearrange
the order of the data, or even transform it into another language (e.g. HTML or XHTML).



2002 JISC


Page
11

5.5.4

WSDL and SOAP

We saw earlier that XML allows computer systems to interact directly with e
ach other, and this is
where
WSDL

(Web Services Description Language) and
SOAP

(Simple Object Access Protocol) fit in.
Using these, an XML application, such as an agent, can find specific online services and

exchange data
as XML objects. Google has released an
API based on SOAP
, which allows any Web developer to
produce applications, or agents, which tap into Google's services (e.g. search and spell checking).

5.5.5

DAML+OIL

The
DARPA Agent Mark
-
up Language

/ Ontology Inference Layer

provides a framework in which
agents can perform useful functions on the data of a resource. This is the layer that helps to assign
meaning to XML data, and it represents a key chal
lenge in building the Semantic Web. [
References
]

5.6

Knowledge Application Tools

Will the above standards and technologies help to manage knowledge better?
-

Yes, probably. Will
they solve all the issues associated with knowledge

management?
-

Probably not. Until we have
sophisticated
intelligent

agents there is another important aspect to knowledge management
-

humans
.
Key issues: the way
we

publish knowledge and present information (see Appendix D); the volume of
unstructured
information available and the time it takes
us

to read it all; the learning time required
translating it into
personal

knowledge, understanding and skill.

Duplication of published knowledge is also one of the biggest issues facing knowledge management.
Fo
r example, much has been written on XML, duplicating similar knowledge over and over again. It
would be great if readers only had to encounter new and unique extensions to their knowledge.
Currently a reader may have to read all chapters in every article
, on the off chance that it may contain a
new gem of knowledge. The concept of an extensible knowledge model (and application tool) aims to
represent knowledge in such a way that new knowledge can be integrated into the model by any author,
without duplic
ating content
. This requires authors to conform to the structure of a knowledge model.

Research and learning consume significant amounts of time. The author is conducting primary
research into knowledge models and knowledge application tools, with the ai
m of reducing the research
and learning effort. Knowledge models can be built to represent a system or process, and
knowledge
application tools

used to guide users through that process
-

without having to learn all the aspects of
that process. This work
builds on concepts such as wizards, and expert systems.

6

Assessment

Search service providers, Web portals, and amalgamations of community Web sites could all help their
users to benefit today, just by adopting the current generation of
knowledge management
systems
.
Search and information retrieval is only one part of Internet knowledge management. Because there is
so much relevant information on the Internet, systems that perform effective information
extraction

may offer better and longer
-
term benefits th
an information retrieval alone.

Metadata

has a useful role to play, especially for some purposes, but it cannot do everything.
Metadata offers the potential to improve the functionality of Internet search tools, and resource
discovery services, over the s
hort
-

to medium
-
term. It also has a key long
-
term role in describing
processes associated with a resource, and describing the content of resources that cannot easily express
themselves (e.g. an image). (However, some knowledge management systems do have
the capability to
classify and match concepts in various multimedia formats.) Metadata standards development,
agreement and implementation require significant effort and cost. Adding metadata is extra work; and
many end users lack the time and skills. W
here there has been an incentive to add metadata on the
Web some have abused the system, and consequently made metadata ineffective. There are groups
looking at this, and trust systems, but it is not yet clear if such technologies can outwit the spammer.

Although
XML

is about defining data types and structures, one of the key opportunities of the XML
initiative is to (indirectly) allow "meaning" to be embedded into the
content

of the resource itself. In
other words, in a totally XML content enabled world
there would be less need for metadata.
Fortunately, some of the metadata work offers a useful basis, or core, for XML
-
enabled content. For


2002 JISC


Page
12

example,
RDF

provides a framework upon which
ontologies

have been built, which may be used either
for metadata purpo
ses or developed for use as content models.

XML

is making its mark, and this could be the "XML decade". XML provides the much needed data
structure for computer
-
to
-
computer interaction. However, the associated learning curve(s), particularly
for schemas,

can be steep. The availability of good user
-
friendly development tools will be critical in
persuading the wider community to adopt XML as an alternative to HTML.

Even with user
-
friendly XML tools though there is still a fundamental issue for the semantic

Web.
Many humans are not good at using lots of small data bricks (XML tags) to build robust, scalable and
complex knowledge bases, which are populated with content. It presents an author with a potentially
complex and tedious task, and under such circum
stances some people often tend to cut corners
-

which
can bring the whole tower of data bricks tumbling down. Remember the old saying: "Garbage in,
garbage out". To address this issue we probably
need XML tools that "intelligently" assist the user
.

There

is a blurred boundary between knowledge management systems and
artificial intelligence
. It is
probably reasonable to state that the current generation of knowledge management systems is an
interim measure, to eventually be superseded by AI systems. Such

systems will probably be able to
process natural language and XML encoded content.

There is one thing clear for the future of Internet based knowledge management, and the Semantic
Web. Its success will require the development and integration of various d
ata standards, ontology
definitions, and knowledge management and agent technologies. It will take a concerted and
significant effort to get there, but the benefits will be well worth it.

We should also think about how authors represent knowledge and
pres
ent information
, and how users
apply knowledge. We have the text encoding initiative, what about a knowledge encoding initiative?
Be open minded, and always have the end goal in mind: the reason we want knowledge is to…?

There are already pockets of XML
and knowledge management activity, but on the Internet as a whole
it is still early days for both, with little overall up take. Still a lot can (and probably will) happen on the
Internet in the next five years
-

just look at the growth of the Web over the

last 5 years!

In the future, the Web may include the following features.

















2002 JISC


Page
13

In the above diagram, blue represents the data formats used to store content, yellow represents
metadata, and applications are in green. The layering attempts to repr
esent where the greatest focus
and development activity
may

be
-

though this should not be taken too literally. The benefits of the
above scenario would be to increase the effectiveness of Internet searches. Search engines, knowledge
management systems,
and agents would be able to match queries against relevant data fields and so
return much more relevant results.

The probability of this happening is very much dependent on what happens in the marketplace. For
example, how will key IT manufacturers implem
ent and adhere to the standards, how will they
collaborate, and how will they market their products and services?

The response by end users, particularly authors and online publishers is very critical to the success or
failure of the above scenario. First
ly there has to be significant agreement by a critical mass in each
community, and the Internet community as a whole, on key XML languages and ontologies. This also
requires co
-
operation, awareness raising and education to a wider community than those cur
rently
involved in R&D in this area. It is also critically important that XML development, and authoring tools
especially, are both user friendly and intelligently helpful. With regard to that latter point tools need to
simplify, and if possible, automat
e the tedious drudgery of having to fill in lots of fields that represent a
particular XML document type. If this does not happen then in many cases end users may be
overwhelmed by the apparent complexity and tediousness of some XML document structures, t
heir
syntax and meanings in the associated vocabulary. The tools need to be as easy to use and understand,
as are today's user
-
friendly Web page design tools
-

irrespective of the underlying complexity in the
schemas and ontologies. All this is possible,

but not easy. Finally the benefits of adopting these new
technologies have to be clear and significant, particularly where an apparent burden is being placed on
the user. The issues with metadata have been discussed, and these need to be addressed if th
e above
scenario is to become a reality.















The above scenario represents a much more exciting place than the 2007 scenario, because this
provides an environment in which information extraction and agents come to the fore. Users would
benefit
from the presentation of concise relevant facts or extracts, and the burden of wading through
pages and pages of "relevant" documents would be reduced. It also represents a changing emphasis
away from writing text based reports, to knowledge encoding and
thinking more about why we want
knowledge and how we apply it. This could see an emphasis on agents and tools that run services and
deliver solutions, rather than so called knowledge management tools that just deliver information. Will
it happen? Well,
on the Internet a lot can happen in ten years
-

so who knows?



2002 JISC


Page
14

7

References

A concise, interactive, XML version of this report can be found here:
Acro
-
Report

(www.acrologic.co.uk/cgi
-
bin
/src.pl?r=ikm
-
16.446).

7.1

Agents

Introduction to Agents

(
http://www.labs.bt.com/projec
ts/agents/publish/papers/review1.htm

)

What is an Agent?

(
http://www.labs.bt.c
om/projects/agents/publish/papers/review2.htm
)

An Overview of Different Agent Types

(
http://www.labs.bt.com/projects/agents/publish/papers/review3.htm
)

Some General Issues and the Future of Agents

(
http://www.labs.bt.com/projects/agents/publish/papers/review4.htm
)

The Agent Cities
initiative

(
http://www.agentcities.org/
),


for instance
San Francisco

(
http://sf.us.agentcities.net/
)

7.2

Applications and Projects (Miscellaneous)

Bu
siness Pros flock to Weblogs

(
http://www.msnbc.com/news/737986.asp?cp1=1
)

Capturing the State of Distributed System
s with XML

(
http://www.cs.caltech.edu/~adam/papers/xml/xml
-
for
-
archiving.html
)

Chemical Markup Language

(
http://www.venus.co.uk/omf/cml/
)

Collaborative Management Environment (CME)
-

use of XML as a cost
-
efficient alternative for
acquiring, storing, querying, and publishing heterogeneous and distributed enterprise information:
CME

-

system designed to automate and enhance the management of research proposal information
from multiple independent research organisations

(
http://www.epm.ornl.gov/ctrc/cme/overview4.html
)

Evolution of the .NET technologies

(
http://builder.com.com/article.jhtml;jsessionid=Y0GDJU5V3AYO4QD23UZCFFA?id=u0032002032
9gcn01.htm&vf=crg&rcode=u001
)

General SGML/XML Applications

(
http://xml.coverpages.org/gen
-
apps.html
)

Projects at Columbia

(
http://www.cs.columbia.edu/nlp/projects.html
),

including
PERSIVAL: Personalised Search and Summarisation Over Multimedia Information

(
http://www.cs.columbia.edu/diglib/PERSIVAL/
)

Report cautioning federal agencies to go slowly in adopting XML

(
http://zdnet.com.com/2100
-
1105
-
877856.html
)

Text Encoding Initiative

(
http://www.tei
-
c.org/
)

7.3

Knowledge Management Technologies and Products

Are computer programs smart enough to compete with human intelligence?

(
http://www.thestandard.com/article/0,1902,16805,00.html
)



2002 JISC


Page
15

Artificial intelligence tools offer new ways to explore Web content

(
http://www.intelligentkm.com/feature/01
0507/feat1.shtml
)

Autonomy

and
XML support

(
http://www.autonomy.com/autonomy_v3/Content/Technology/Technology_Benefits/XML
)

[Example of multimedia support:]
Intelligent Data Operating Layer

(
http://www.autonomy.com/Content/IDOL/
)

[Issues with:]
Manual Tagging

(
http://www.autonomy.com/Content/Technology/Other_Approaches/Manual_Tagging
)

[Limitations of:]
Keyword Searching or Boolean Query

(
http://www.autonomy.com/Content/Technology/Other_Approaches/Keyword_Searching
)

Adding Intelligence to XML

(
http://www.autonomy.com/autonomy_v3/Content/Technology/Technology_Benefits/XML
)

Collaborative Filtering or Social Agents

(
http://www.autonomy.com/Content/Technology/Other_Approaches/
Collaborative_Filtering
)

Brint Knowledge Management Portal

(
http://www.brint.com/km/
)

Information Today

(
http://www.infotoday.com/
)

InForum '99

-

Improving the Visibility of R&D Information

(
http://www.osti.gov/inforum99/
)

Intelligent Knowledge Management
-

Features

(
http://www.intelligentkm.com/feature/
)

Introduction to Knowledge Ma
nagement

(
http://www.kmresource.com/exp_intro.htm
)

Knowledge Management Internet Resources

(
htt
p://www.icasit.org/km/links.htm
)

Open Directory: Reference: Knowledge Management

(
http://dmoz.org/Reference/Knowledge_Manageme
nt/
)

7.4

Metadata

Glossary

(
http://www.getty.edu/resea
rch/institute/standards/intrometadata/4_glossary/index.html
)

Potential issues and limitations:

Putting the torch to seven straw men of the meta
-
utopia

(
http://www.well.com/~doctorow/metacrap.htm
)

7.5

Metadata (Dublin Core)

The Dublin Core Metadata Initiative


(http://purl.org/dc/) (PURL) or
http://dublincore.org

The n
ames of the Dublin Core Metadata Initiative namespaces:

http://purl.org/dc/elements/1.1/

http://purl.org/dc/terms/

http://pu
rl.org/dc/dcmitype/


The EOR Toolkit

(
http://eor.dublincore.org/
)



2002 JISC


Page
16

7.6

Metadata (Miscellaneous)

RSS

-

Rich Site Summary

(
http://my.netscape.com/publish/
)

The Warwick Framework

(
http://www.dli
b.org/dlib/july96/lagoze/07lagoze.html
)

7.7

Metadata (RDF
-

Resource Description Framework)

An introduction to RDF: Exploring the standard for Web
-
based metadata

(
http://www
-
106.ibm.com/developerworks/library/w
-
rdf/
)

Resource Description Framework (RDF)

(
http://www.w3.org/RDF/
)

RDF Primer

(
http://www.w3.org/TR/rdf
-
primer/
)

RDF Model and Syntax Specification

(
http://www.w3.org/TR/REC
-
rdf
-
syntax
)

RDF/XML Syntax Specification

(
http://www.w3.org/TR/rdf
-
syntax
-
grammar
)

RDF Schema Specification

(
http://www.w3.org/TR/rdf
-
schema
)

Open Directory: RDF: Applications

(
http://dmoz.org/Reference/Libraries/Library_and_Information_Science/Technical_Services/Catalogui
ng/Metadata/Resource_Description_Framework_
-
_RDF/Applications/
)

RDF Parser

(
http://www.swi.psy.uva.nl/projects/SWI
-
Prolog/packages/sgml/online.html
)

Subject Listing for RDF Applications

(
http://www.ilrt.bristol.ac.uk/discovery/rdf
-
dev/roads/subject
-
listing/rdfapps.html
)

Survey of RDF data on the Web

(
http://www.i
-
u.de/schools/eberhart/rdf/rdf
-
survey.htm
)

Thinking XML: Basic XML and RDF techniques for knowledge management

(
http://www
-
106.ibm.com/developerworks/library/x
-
think4/
)

7.8

Ontology

Requirements for a Web Ontology Language

-

includes a definition of ontology

(
http://www.w3.org/TR/webont
-
req/
)

OntoW
eb Deliverables

(
http://www.ontoweb.org/deliverable.htm
)

SENSUS with 70,000 nodes! Browse it with
Ontosaurus

(
http://mozart.isi.edu:8003/sensus/sensus_frame.html
)

About DAML

-

DARPA Agent Mark
-
up Language

(
http://www.daml.
org/about.html
)

Welcome to OIL

(
http://www.ontoknowledge.org/oil/
)

OIL Semantics

(
http://www.ontoknowledge.org/oil/downl/semantics.pdf
)



2002 JISC


Page
17

An Introduction to OilEd

(
http://www.cs.man.ac.uk/~robertsa/daml
-
oil
-
workshop/oiled
-
tutorial/index.html
)

OilEd Publications

(
http://oiled.man.ac.uk/publications.shtml
)

OIL Technical Report

(
http://www.ont
oknowledge.org/oil/TR/oil.long.html
)

OIL White Paper

(
http://www.ontoknowledge.org/oil/downl/oil
-
whitepape
r.pdf
)

OilEd
-

OIL Editor

(
http://oiled.man.ac.uk/
)

DAML application:

TAMBIS

for molecular biology

(
http://www.cs.man.ac.uk/~horrocks/Ontologies/tambis.daml
)

7.9

Search Engines and Directories

Beaucoup

[List of Search Engines]

(
http://www.beaucoup.com/
)

Open Directory

(
http://dmoz.org/
)

Search Engine Watch

(
http://www.searchenginewatch.com/
)

7.10

Semantic Web

Semantic Web Road map

(
http://www.w3.org/DesignIssues/Semantic.
html
)

The Languages of the Semantic Web

(
http://www.newarchitectmag.com/documents/s=2453/n
ew1020218556549/
)

The Semantic Web

(abstract)

(
http://www.sciam.com/2001/0501issue/0501berners
-
lee.html
)

Web Architecture from 50,000 feet

(
http://www.w3.org/DesignIssues/Architecture.html
)

7.11

Web Services

REST could burst SOAP's bubble

(
http://searchwebservices.techtarget.com/qna/0,289202,sid26_gci823
342,00.html
)

SOAP

(
http://www.w3.org/TR/SOAP/
)

SOAP Standards (ownership)

(
https://answers.google.com/answers/main?cmd=threadview&id=2020
)

Develop Your Own Applications Using Google

(
http://www.google.com/apis/
)

and
Google's Gaffe

(
http://www.xml.com/pub/a/2002/04/24/google.html
)

WSDL

(
http://www.w3.org/TR/wsdl
)



2002 JISC


Page
18

7.12

XML

XML

(
http://www.w3.org/XML/
)

An Introduction to the Extensible Markup Language (XML)

(
http://www.personal.u
-
net.com/~sgml/xmlintro.htm
)

T
he XML Cover Pages

(
http://www.oasis
-
open.org/cover/sgml
-
xml.html
)

The Evolution of Web Documents: The Ascent of X
ML

(
http://www.cs.caltech.edu/~adam/papers/xml/ascent
-
of
-
xml.html
)

The Extensible Markup Language (XML)

(
http://www.sgml.u
-
net.com/xml.htm
)

X Marks the Spot

(
http://ww
w.cs.caltech.edu/~adam/papers/xml/x
-
marks
-
the
-
spot.html
)

XML: The Universal Publishing Format

(
http://
www.oasis
-
open.org/cover/bosakParis9805/sld00000.htm
)

DTD

Building an XML application, step 1: Writing a DTD

(
http://www
-
106.ibm.com/developerworks/library/buildappl/writedtd.html
)

Declaring Elements and Attributes in an XML DTD

(
http://www.rpbourret.com/xml/xmldtd.htm
)

Schema

W3C XML Schema Needs You


(
http://www.xml.com/pub/a/20
02/03/27/deviant
-
schemas.html
)

XML Schema Formal Description

(
http://www.w3.org/TR/xmlschema
-
formal/
)

XML
-
Data

(
http://www.w3.org/TR/1998/NOTE
-
XML
-
data/
)

XSL
-

eXtensible Stylesheet Language

A Proposal for XSL

(
http://www.w3.org/TR/NOTE
-
XSL.html
)

W3C Recommendation

(
http://www.w3.org/TR/xsl/
)

Xpath
-

XML Path Langauge

(
http://www.w3.org/TR/WD
-
xptr
)

Xpointer
-

XML Pointer Language

(
http://www.w3.org/TR/xptr
)

Xli
nk
-

XML Linking Language

(
http://www.w3.org/TR/xlink/
)

Searches and XML
-
Query

XML and Search

(
http://www.searchtools.com/info/xml.html
)



2002 JISC


Page
19

XML Query Working Group

(
http://www.w3.org/XML/Query
)

How to Apply XML:

Basic XML and RDF techniques for knowledge management

(
http://www
-
106.ibm.com/developerworks/library/x
-
think4/
)

Building an XML application

(
http://www
-
106.ibm.com/developerworks/library/buildappl/wri
tedtd.html
)

Multidimensional files

(
http://www.webreview.com/1997/05_16/webauthors/05_16_97_4.
shtml
)

Peer
-
to
-
peer communications using XML

(
http://www
-
106.ibm.com/developerworks/xm
l/library/x
-
peer.html?loc=x
)

8

Glossary

API

Application Program Interface
-

an interface between an application program and an
IT resource or service, such as an operating system, database or another application.

ASP

Application Service Provider
-

a Web bas
ed provider of IT resources, services and
applications (e.g. database, knowledge management, agent, and office applications).

Blogs

Web logs
-

like an online diary of its author, noting activities, events and resources
related to their area of interest.

Cl
oaking

A technique used to potentially mislead a search engine (spider) by presenting it with
one set of information and presenting Internet users with another. Some search
engines may frown upon this practice as it is a technique sometimes used spammers.

DAML

DARPA Agent Mark
-
up Language. Also referred to as DAML+OIL, which includes
work on the Ontology Inference Layer.

DTD

Document Type Definition, for the definition of XML based languages. An newer
alternative is the schema.

Namespace

A URI which poin
ts to a schema. An XML document can point to more than one
namespace and hence extend its vocabulary of terms and definitions.

Ontology

A common set of terms (and their relationships) used to describe and represent
knowledge within a domain (e.g. educatio
n, chemistry, and cars). Ontologies can be
used by automated tools to power advanced services such as more accurate Web
search, intelligent software agents and knowledge management. Ontology has also
been referred to as a 'concept thesaurus'. (See refere
nces for more information.)

PERL

A programming language, which is often used in Unix environments and CGI
applications.

PURL

Permanent URL. A URL that will always refer to a given resource even if the hosting
location (URL) of that resource changes. A pe
rmanent URL is used to redirect the
user to the referenced resource. For the PURL to remain valid it must be updated its
administrator if the resource moves, and the PURL service provider must remain in
existence. For example, see:
http://purl.org

RDF

Resource Description Framework. A metadata model intended for representing Web
resources. It is typically expressed using XML and the RDF Schema.

Schema

An alternative to the DTD, allowing the definition of an XML language th
rough the
creation of elements, attributes, values, data structures and classes.



2002 JISC


Page
20

Semantic Web

A Web which has meaning, or more particularly "meaning" to
computer applications

such as intelligent agents, search engines, Web services, knowledge management
sy
stems, e
-
business and e
-
commerce systems.

SEO

Search Engine Optimisation
-

the practice of designing Web pages with the aim of
achieving high rankings on search engines. This may involve paying attention to
metadata such as keywords, titles, links, and pa
ge content. However, thanks to
spammers, metadata now carries little importance in some search engine algorithms.

SOAP

Simple Object Access Protocol. A wrapper or envelope (using XML) for exchanging
data between applications.

Spammers

People who try to p
romote their services through mass email shots of unsolicited
"junk mail", or other under
-
hand practices such as cloaking.

Taxonomy

A hierarchical structure of classifications (or categories), e.g. the Yahoo! directory.

URI

A Uniform Resource Identifier (U
RI) supports the Uniform Resource Locator (URL),
and the Uniform Resource Name (URN). Here is a technical
explanation of URI
.

URL

Uniform Resource Locator (URL), e.g.
http://www.acrologic.co.uk
. It consists of a
protocol part (http, ftp, etc.), a domain name, and optionally path, filename, and
optional location within the file / page / resource.

URN

Uniform Resource Name (URN).

This does not specify a protocol, just a name that
references a resource.

Web Services

Services or resources made available through the Web to computer applications. See
also some of the associated protocols for this: WSDL and SOAP.

WSDL

Web Services De
scription Language.

XML

eXtensible Mark
-
up Language capable of representing new languages, and
facilitating extensions to those languages.

XPath

A method for accessing nodes (and the data) within XML documents.

XPointer

(and XLink) link and referencing lan
guage. See also XPath.

XSL

eXtensible Style
-
sheet Language has two capabilities: determining how information
is presented; and the ability to transform data (XSLT).

XQuery

XML Query Language.

9

Appendix A
-

Example of XML

An extract of an hypothetical langu
age for student data may be used like so:

<student>


<name><first>
Joe
</first><last>
Bloggs
</last></name>


<address>


<street>
123 The Roads
</street>


<city>
The City
</city>


<postcode>
CC 1XZ
</postcode>


</address>


<course><topic>
Physics
</topic><
code>
P202
</code></course>

</student>



2002 JISC


Page
21

10

Appendix B
-

Developing Your Web Site to Generate XML

10.1

Converting an Existing Database Driven System

Many medium and large
-
scale organisations could probably provide an XML option on their Web site.
If dynamic Web pages

are already generated from a database then two possible options are:

1.

Install software that generates XML between the database server and the HTTP (Web) server; or

2.

Replace the traditional database server with a specific XML object server, and import the da
ta.

You also need to select, or develop, your chosen XML language and ontology. In the case of option 1
you will need to map database fields to corresponding XML elements.

10.2

Starting from Scratch

For medium and large
-
scale organisations that do not already
use databases and dynamic Web pages,
you've got a lot of catching up to do! Although on the plus side, at least you are starting with a blank
piece of paper and do not have to worry about legacy systems. Don't worry if you feel you don't have
the necessa
ry in
-
house expertise
-

you could contract the development work out, and/or make use of an
Application Service Provider

(ASP) who provides the XML expertise and the hosting of the data.
For organisations with a permanent Internet connection, an ASP based
solution could allow you to
maintain your data through something as simple to use as good old Web forms. Everyone knows how
to use those
-

don't they? As with any significant IT project make sure you are confident about the
capabilities of your contracto
r, and make sure your organisation is able to deliver on its part of the
project, including the subsequent operational and maintenance aspects.

10.3

Build Your Own XML Portal

If your organisation's role is to lead, participate in setting standards and dissemina
te large volumes of
information electronically to your community then perhaps you should consider developing an XML
portal. The portal could provide a standard Web form to access an underlying XML search tool. Of
course, the form would provide more sophi
sticated options than that of a conventional search engine.
The portal could also support an XML channel for direct interfacing to agents and knowledge
management systems.

10.4

Benefits

In the short
-
term the benefits of an XML Web site may be limited, unless o
ther members of your
community have already defined an XML language and are actively using it. (However, users of an
XML portal would benefit immediately from day one.) In the medium
-
term and beyond, benefits
should increase as public search engines and
Internet users adopt XML technologies.

11

Appendix C
-

Example Use of Namespaces

For example (hypothetical extract):

<
educ
:student>


<
people
:name>
Joe Bloggs
</
people
:name>



<
places
:address>




<
places
:street>
123 The Roads
</
places
:street>




<
places
:city>
The C
ity
</
places
:city>




<
places
:postcode>
CC 1XZ
</
places
:postcode>



</
places
:address>


<
educ
:course>



<
educ
:topic>
Physics
</
educ
:topic>



<
educ
:code>
P202
</
educ
:code>


</
educ
:course>

</
educ
:student>



2002 JISC


Page
22

The namespace prefixes have been highlighted in red here to a
id readability. Each schema (educ,
people, and places) could be developed by independent interest groups. (On the subject of readability,
yes XML can look terribly verbose, and many actual implementations are far more cryptic looking than
the above examp
le. However, do not let that unduly worry you as XML development tools can help to
make this simpler for the developer, as HTML editors do for traditional Web pages.)

12

Appendix D
-

Information Presentation Issues

More could be done to educate authors in th
e way that they structure and present information. For
example, masses of unstructured text without headings, without a summary and without a contents list
forces readers to (skim) read an entire document in order to extract relevant gems of knowledge;
wh
ereas highly structured content allows the reader to quickly identify relevant sections.

The document type used to present information can cause difficulties on the Web too. Think about the
potential issues and limitations encountered when trying to quick
ly extract information from the
following: Acrobat (PDF) document, Word document, PowerPoint or any slide presentation, or any
other document format.

Information overload is not solely due to issues with Internet searching. There are also significant
issu
es regarding the
presentation of information
, particularly on a Web site. There are simple solutions
to many of these issues, based on best practice and using the right tool (or document type) for the job.
However, this is outside the scope of this repor
t.

13

Appendix E
-

Metadata and Information Extraction

Metadata has the potential to improve the capabilities of search and information
retrieval

systems.
However, metadata is probably less effective at information
extraction
; many of these limitations are
r
elated to human aspects.

Metadata represents a concise
summary

of a resource
-

it is not the resource itself. Metadata does not
comprehensively describe the content of a resource. In the library example, metadata may help identify
the most relevant shelf

in the library but you will still have to read the books to find exactly what you
are looking for. (Now imagine a Web library with over two billion URLs.)

Further aspects to consider:



Ask ten people to summarise a resource and you will probably get ten
different summaries.



Many resources do not neatly fall into just one category of a rigid classification system.



Many users cannot be bothered to use metadata (e.g. Word document summary).



Maintaining effective metadata descriptions is a costly process, and

requires trained staff.



Metadata in HTML Web pages has been abused by spammers for years.



Metadata cannot match the fine granularity of the information within the content of a resource.

It is not the intention of the author to rubbish metadata, only to po
int out potential issues and
limitations. As a short to medium
-
term measure metadata has the opportunity to greatly enhance the
capabilities of search tools on the Internet. Even in the long
-
term, metadata
does

have a key role to
play in some areas, e.g.
: describing
processes

associated with a resource; and describing resources that
are unable to adequately express their own content (e.g. multimedia resources and abstract concepts).

Note: Although traditional metadata systems may not
comprehensively

repre
sent the content of a
resource, it does not mean that future metadata systems will be unable to do so. For example, in the
same way that we have document summarisation systems and compression technologies, we may one
day have sophisticated knowledge repre
sentation systems. Indeed, to an extent, some of today's
knowledge management systems already classify the concepts embedded in the content of a document.



2002 JISC


Page
23

14

Appendix F
-

Internet Directories

Internet Directories contain information that is classified into c
ategories or topics. These categories are
arranged in some type of hierarchy, for example: Europe > UK > Cars > Sports. (See:
Taxonomy
.)
Typically, each category is maintained manually by one or more editors. Their role being to perform
classification
checks and edit resource descriptions. There are many Internet directories [
References
].
Many people still use directories for finding information and they provide a great starting point for
browsing through categories, partic
ular for novices in a given topic.

However, directories have limitations. Typically, a Web site will be squeezed into one category. That
classification may not be exact, or a site may host content spanning several categories. The description
given for a

Web site is often very concise, covering just one or two lines. (Few can satisfactorily
describe what
they

do in fewer than two lines.) Another issue is the maintenance of a comprehensive
set of up to date listings, given that there are billions of Web
pages with a significant rate of growth in
new pages and changes to the content of existing pages.

15

Appendix G
-

Search Engines

Typically, search engines collect information from the Web
automatically

and index the textual content
in a database. A user ent
ers a query as a string of words and/or phrases. The search engine will then
use its own search algorithm to find the best matches in its database. [Search engine
References
].

A search engine has a number of advantages over a
directory:



The automated process can index hundreds of millions to billions of Web pages



The search resolution or granularity is much finer, down to individual words on a page



The entire content of a Web site, or page, can be searched rather than a brief d
escription.