The Semantic Web How NLP can resolve The Chicken & Egg Issue

economickiteInternet and Web Development

Oct 21, 2013 (3 years and 5 months ago)

67 views

Alessandro Marcias



NLP Coursework



2523889


2008
-
2009



1







The Semantic Web

How NLP can resolve

The Chicken & Egg Issue


By Alessandro Marcias

Student No. 2523889

2008
-
2009

BSc BIT


Natural Language Processing

Coursework


Tutor


Dave Inman



















Alessandro Marcias



NLP Coursework



2523889


2008
-
2009



2


Abstract


This research will explore the meanin
g, main component and vision behind the Semantic Web.

The report will analyze the structure underpinning the Semantic Web and we will look to the
objects and technologies that make it possible. We will look at what has been done, what
challenges we are fac
ing to achieve the full development of such a “dream”, and what tools and
techniques can help to facilitate and accelerate its fulfilment.

We can only express our intents properly with our natural language, keyword search engine
cannot catch exactly what w
e really want them to fetch.

NLP techniques are thought to be a field of AI that can help the development of the Semantic
Web. We will also look at some applications that implement NLP techniques and try to solve the
problem facing the fulfilment of the Se
mantic Web’s vision.




































Alessandro Marcias



NLP Coursework



2523889


2008
-
2009



3

Content Page:

Abstract













2


Discussion Notes











4



Question for Discussion










4


Introduction













5


The Semantic Web: an Overview









6

(Definition and general views of the Semantic Web)

The Vision











6


The Technology










8


Problems













10

(
Why is the Semantic Web so difficult to implement)


The Chinese Room Argument










11


-
Searle
-
The Chinese Room Experiment







11



NLP













12

-
Natural Language Processing


-
The Chatbot Approach








13



-
The Child Approach









13



-
The Adult Approach









14


The Chicken &

Egg











15



In Action













15

(Overview of some applications that tackle some of the Semantic Web issues)


Powerset












15

(A natural language research company)


Juice













23

(An intelligent discovery engine)


Calais













27


(Connect everything)


Furthermore

(Links to more interesting website)







29

C
onclusions












29


References












30

References Recommendations









31

Alessandro Marcias



NLP Coursework



2523889


2008
-
2009



4









Discussion Notes


In 2001 Tim Berners
-
Lee, James Hendler and Ora Lassila
published an article about the Semantic
Web in
Scientific American

[4].


In this article were described how two people, brother and sister named Pete and Lucy, were able
to let their semantic web agents deal with the bookings of their mother physical thera
py session,
how Pete is not happy with the first solution and how Pete’s web agent finally find a solution
suitable for Pete, but in doing so it reschedules some less important appointments.


This visionary scenario for many thinkers and internet speculato
r is not that far, even though in
order to be feasible lot of work has still to be done.

The first vision of the Semantic Web came precisely from Tim Berners
-
Lee, but after him many
have embraced the idea to use technological innovations not only to contro
l information but to
convert it into intelligence. [2]


By using the word intelligence I mean that the data present within the web could be analysed and
processed by the machine itself in a way that it will look to our eyes as if they were actually
thinkin
g is stead of us and, helping us in finding intelligent solutions to our problems.


Question for Discussion


But how is this possible?

What is the Semantic Web?

What are the basic bricks that make Semantic Web possible and alive?

Can computer understand

us?

How can NLP help the development of the Semantic Web?

What is already out there to help us contribute to this fascinating and exciting perspectives?







Alessandro Marcias



NLP Coursework



2523889


2008
-
2009



5

Introduction


The Semantic web is seen as an evolution of the World Wide Web where the informa
tion held in it
can be used both from humans and machines and can be transformed into intelligence.

To do so not only is necessary a communication to a human/machine and machine/machine level
but it is also needed a system of semantic knowledge on the ba
ckground that will be the Semantic
Web itself.

“The World Wide Web Consortium (W3C) creates Web standards and its mission is to
lead
the Web to its full potential
, which it does by developing technologies (specifications,
guidelines, software, and tools
) that will create a forum for information, commerce,
inspiration, independent thought, and collective understanding. “
[21]

One of its seven goals and operating principles is the Semantic Web. On the semantic Web people
will be able to “communicate” with c
omputer, to express them self in a way that a machine would
be able to compute, elaborate and exchange information and knowledge, saving us from the
tedious and boring jobs, such as fetching a receipt, check time travel, medical information, fixing
an app
ointment et cetera.


These operations called Web Services will be carried out by Web Agents.

What are Web Agents and Web Services?


A
Web Service

is defined by the
W3C

as:

"
A software system designed

to support
interoperable

machine
-
to
-
machine

interaction over
a
network
".

[21]


In the vision of the Semantic Web these services will be carried out by the Web Agents.


Web agents are very complex software systems operating on the web.

They shall be able to f
etch, compute, elaborate and exchange data both with humans and
machines.

In other words they are pieces of software working on behalf of people.

They will have access to all the information on the web and they will be able, thanks to the
Semantic Web, to
carry out intelligent tasks.


Unfortunately the Semantic Web is far from being fully operative.

To achieve its goal we need tools and technologies that are not yet optimal.

Too many different companies are offering services that claim to be the final solut
ion to the
Semantic Web problems.


In order to understand better what this services are and what they do I will now give a brief
introduction on the vision and technologies behind the concept of the Semantic Web.







Alessandro Marcias



NLP Coursework



2523889


2008
-
2009



6


The Semantic Web:
An Overview


The V
ision

The
Semantic Web

is thought to be an evolution of the World Wide Web.

The Web when developed was thought to be used not only for human to human communications
but machines also were meant to be part of this “information space” ,as defined by Tim
Berners
-
Lee [2];but the information on the web is designed to be meaningful for humans rather than
machines. In other words Internet was not designed to teach the machines what the information
they held really means.


The long
-
term aim of this vision is to

imbuing the Web itself with

meaning.

That is, providing
meaningful ways to describe the resources available on the Web and, perhaps more importantly,
why there are links connecting them together [11].


A Web of resources with a semantic behind it, sort of

a global database of knowledge inter
correlate, in which every item or entity has its own description, specification and annotations.

Computers will be able to understand a particular question asked by the user and find related
documents linked together b
y a semantic net and not by simply keyword comparison or ranking.


The aim of this vision consist in create a metadata
-
rich Web of resources
that can describe
themselves not only by how they should be

displayed (HTML) or syntactically (XML), but
also by th
e meaning of the

metadata.”

[22]






What is metadata?


“Metadata is data associated with objects which relieves their potential users of having full
advance knowledge of their existence or characteristics.”

[8]


Metadata can be thought as data about dat
a.

During the annotation process we create metadata on a particular document. On the context of

the Semantic Web, annotation is
a set of instantiations attached to an HTML document. [1]


In other words we add extra comments to our documents or web pages a
nd this will enable the
computer or other people to better understand the nature of the resource and to link them
together.


Annotations will be sort of summaries describing the content of a resource, being that a
document, an image, a web page and/or basi
cally anything that can be stored online.

Annotations will also be the ground where intelligent systems will carry out their reasoning and
inducing processes in order to answer a query, e.g. fetch a particular document, or simply solve
our particular needs

and problems. Annotations can also take the form of human readable
describing data, helping us understand better the online content, e.g. tags on a website.




Alessandro Marcias



NLP Coursework



2523889


2008
-
2009



7

Adding annotations to web resources is the essential step to produce semantic and knowledge

Wit
hin the web.
Semantic annotations are to tag ontology class instance data and map it into

Ontology classes. [17]



Figure 1 represent five web resources: three WebPages, a document and a library.


Each resource has some arrow linked to the symbol

which

I used to describe annotations.

Thanks to the annotations it is possible to understand to which sub
-
class the object belongs.

A sub
-
class are instances of more general classes,which are described in ontologies.

Therefore it can be possible to understand t
he nature of a resource, assign it a class and link it
together to other objects related to it ,as described in the ontology.




Fig.1, Ontology, Class and Annotations


The main aim of the semantic web is in fact to empower people. To give them a better t
ool to
connect and sift knowledge within the web.


"The central principle behind the success of the giants born in the Web 1.0 era who have
survived to lead the Web 2.0 era appears to be this, that they have embraced the power of
the web to
harness collect
ive intelligence
".

[15]


There is far too much information in the web, and we have neither the capabilities nor the time to
go through all of it in an effective way. With the semantic web and its technologies working
properly we will be able to find what w
e need effectively and efficiently.

In 1998 the basis for the technologies and architectures necessary to the fulfilment of this vision
were outlined. [3]



Alessandro Marcias



NLP Coursework



2523889


2008
-
2009



8


In the next section we will look at the technologies that support and will make the Semantic Web
p
ossible.


The Technology


Figure 2 gives a conceptualized vision of the various component of the Semantic Web.

[11]




Fig2. The Semantic Web Structure



I will give now a brief description of the conceptual layers that constitutes the Semantic Web in

a
bottom
-
top approach. [11]


Unicode and URIs

URIs (Uniform Resource Identifiers) are unique identifiers for resources of any type, being for
instance a document, people, a page on the web, et cetera. The Unicode is the standard for
computer character rep
resentation. Together they are the foundations of the Semantic Web’s
Architecture.




XML
(eXtensible Mark
-
up Language)

The XML with its related standards (e.g. XML schemas) is
a language that lets one write structured
Web and

” it is particularly suita
ble for sending documents across the Web. XML is a
mark
-
up
languages

just like HTML and allow one to write some content and provide information
about what role that content plays. Like HTML, XML is based on tags.”

[1]

The creation of XML is considered
by many as what has made or will make the Semantic Web
possible.

Alessandro Marcias



NLP Coursework



2523889


2008
-
2009



9

“XML is a structured set of rules for how one might define any kind of data to be shared on
the web. It is called ‘extensible’ because it can be modified to suit any purpose......

... The pri
mary goal of XML is to describe information on the web.”
[6]


Thanks to DTD (Document Type Definition) we can define the tags within our XML documents.


Resource Description Framework

The Resource Description Framework or RDF is a framework that enables

the interchange and
description of metadata. RDF is a data modelling language
. [22]

It allows modelling information through a variety of syntax formats and enables us to classify data
on the web.

RDF is graph
-
based, but usually serialised as XML. Essentia
lly, it consists of triples:
subject, predicate, object. [7]


It also represents the relationships between entity and resources via graphs models.



RDF

is a basic data model, like the entity
-
relationship model, for writing simple
statements about Web obj
ects (resources) “.

[1]


RDF Schema

The RDF schema is a knowledge representation language and it is used to describe ontologies.

It uses a class
-
subclasses fashion and gives a reasoning framework for inferring types of resources
[11]

RDF Schema defines the

vocabulary
used in RDF data models. In RDFS we can define the
vocabulary, specify which properties apply to which kinds of objects and what values they can
take, and describe the relationships between objects. [1]



Ontology or RDF Vocabularies

Philosophi
cally speaking ontology is “the study of the nature of being”. It is about grouping entities
and structure them in a hierarchy, subdividing them by their similarities and differences.

They are also referred to as metadata vocabularies and they are systems
that provide extra
constraints on things such as entities types and their attributes.


“An Ontology is a shared conceptualization of the world. Ontologies consist of definitional
aspects such as high
-
level schemas and assertional aspects such as entities,
attributes,
interrelationships between entities, domain vocabulary and factual knowledge

all
connected in a semantic manner. Ontologies provide a common understanding of a
particular domain. They allow the domain to be communicated between people,
organiza
tions, and application systems. Ontologies provide the specific tools to organize
and provide a useful description of heterogeneous content.”

[5]


Logic and Proof

At this layer is where the AI reasoning kicks in. The inference process takes place at this s
tage.

Is in fact in this system that the web agents take their decisions and make their deductions on the
particular problems being asked by the users.


The logical foundations of the Semantic Web allow us to construct proofs that can be used
to improve tr
ansparency, understanding, and trust.” [12]


Trust

Alessandro Marcias



NLP Coursework



2523889


2008
-
2009



10

In order to gain the user’s trust on the truthfulness of the data, resources and conclusion
elaborated from the web agents we need to address issues such a validation of evidence and facts
used in the logi
c process of finding the solutions to the tasks.

Therefore this layer is extremely dependant on the accuracy of the metadata.

Explanations facilities can help users to gain confidence in the system. I discuss explanation
facilities, especially relating to
recommender systems, in my final year project.


Further discussions and recommendations for interested, curious or just masochist readers can be
found at the following links (Recommendations References page 30 of this paper):



Best Practice Recipes for Pub
lishing RDF Vocabularies [D]



Extensible Mark
-
up Language (XML) 1.0 (Fifth Edition) [E]



OWL Web Ontology Language Guide [Q]



OWL Web Ontology Language Overview [O]



OWL Web Ontology Language Reference [I]



OWL Web Ontology Language Semantics and Abstract Synta
x [P]



OWL Web Ontology Language Test Cases [G]



RDF Primer [N]



RDF Semantics [K]



RDF Test Cases [J]



RDF Vocabulary Description Language 1.0: RDF Schema [F]



RDF/XML Syntax Specification (Revised) [C]



RDFa Primer [A]



RDFa in XHTML [B]



Resource Description Fra
mework (RDF) [M]



Web Ontology Language (OWL) [L]



XML Schema Datatypes in RDF and OWL [H]


Problems


This was the basic structure of the semantic web and its vision.

Of course we could go in much deeper details but the aim of this paper for now is to give a

generic
view of it and to better focus on the problems of its making and some solutions that have been
applied.




If everything is already been outlined and so well structured and thought why the development of
the Semantic Web is taking so long?

Why am
I still writing this paper instead of talking to my laptop?

Why do I spend hours searching for some relevant document, and why after a keyword search on
any search engine at page four there are documents that have nothing to do with my intended
search?


It

seems that for the Semantic Web to work, with that I mean to actually be fully part of our lives,
we require computers to think?

Alessandro Marcias



NLP Coursework



2523889


2008
-
2009



11

Or maybe to just process data in a more seemingly intelligent way?


Since its first days the study of AI was intended to be t
he way to create, one day, a machine with a
mind. Humans would be able to assemble together unanimated parts and give them life, reason,
thinking, in other words they could be God for one day.

Ambition is one of the strongest stimulating feeling that human
s have and I am not going to talk
about what is right to think or what is not but I will just say that sometimes ambition can be
productive while sometimes it may just be counter
-
productive, but without ambition the human
kind would never got anywhere far
from a tree and a bunch of bananas.



The history of AI studies is plenty of successes and failures.

Most of the time failures can be due to the scope of the particular project, often too big.

Because of that, AI researcher tend to carry out a bottom
-
up ap
proach to solve problems

Which means subdivide the main problem in smaller chunk and, once every little problem has
been solved the main goal will be achieved.

One of these sub
-
problems is to get the machines to understand our language (Natural Language
Pr
ocessing studies or NLP studies).

Language is in fact is the first barrier between human beings and computers.


Can computer understand language?

Can computer fool us to understand language?



The Chinese room argument

There are two main schools of though
t in the AI world; one is called the strong AI and the other
weak AI.

[19]

Thinkers who reside in the strong AI side believe that computers one day will acquire the
capability to think just like us, they will be able to carry out any task that we associat
e to our
human brain in other words computer one day will be able to do things such as dream, imagine
and create on their own just like we do.

All these attributes and processes are in fact usually refer to as defining characteristics of the
human brain.

T
he branch of research or school of thought called the weak AI on the other hand believes that
computers will never be able to gain our brain peculiars attributes.

What they believe is that computers will be only able to compute a finite amount of tasks,
fo
llowing some given rules or logic that will never comprise creativity or creation.

On other words they can only mimic the process of human thinking and they argue that computers
are not conscious of what they are doing.

Therefore the study of AI can be onl
y thought as a tool to better understand how the human brain
actually works. AI models in fact can be used to mimic the mind and better understand what goes
on inside our brain.



SEARLE


the Chinese room experiment

John Searle is one of the promoters of
the weak AI and in 1980 he coined the term “strong AI” and
devised a thought experiment against the beliefs of this thinking. [19]

The experiment is called the Chinese Room experiment.

Alessandro Marcias



NLP Coursework



2523889


2008
-
2009



12

Searle asks the reader to imagine a room with two holes, one marked wit
h “INPUT” and the other
one marked with “OUTPUT”.

Inside the room there is a person that can only speak English and there are a number of books
with Chinese language rules.

Outside the room there is a Chinese person that put a paper with a question written

in Chinese
trough the “INPUT” hole.

The English person inside the room reads the first symbol on the “Chinese query”, get the book
with that symbol on the cover and then check which book to take if the first symbol is followed by
the second symbol and so
on. Reputedly in this way, thanks to the books and their links to them
self, he manages to answer perfectly the question that has been asked. He then put the answer
through the “OUTPUT” hole.


The Chinese person outside the room will believe that inside th
e room there is someone who does
speak and understand perfectly Chinese, but obviously, as we know, that’s not the case.

In conclusion for Searle a symbol
-
processing machine, despite the fact that can make the user
believe that is actually understanding in

deep the meaning of our queries, it will never have a
complete understanding of what has been asked. For Searle they will be just manipulator of
symbols and they will never have conscious mental states of what they are saying.


The issue for many is not t
o get machines to actually have cognition of what we are saying but to
just understand us, or in other words to give us a pertinent answer when we ask them something.

If computer would be able to better understand our language there will be a number advant
ages.


Let’s think for example how easier, in terms of scalability and usability would be to communicate
with a database if we could query the DB in spoken English instead of using some specific query
language. In this scenario anyone who could write a que
ry in the English language, if the database
was written in English, s/he could get information out of it without studying any particular
programming language.


Let’s think, for another example, of a medical expert system. If the system could actually proce
ss
natural language there won’t be the need the add new rules every time something new has been

discover. We will just give it journals and reports to process, and it will get the new information,
store them, and use them for future cases.


The potential
benefits from successful natural language processing are amazing, but seems that
nothing has properly been achieved. (See the
In Action

chapter of this paper)

Natural Language Processing is difficult, it has been studied for many years and not many solutio
ns
have proved to be fully successful.


Why?


Natural Language Processing


Natural Language Processing or NLP is a subfield of studies of Artificial Intelligence.

Its main concerns and efforts are towards the “translation” of computer readable data and
inf
ormation in a more human friendly format.

Alessandro Marcias



NLP Coursework



2523889


2008
-
2009



13

A state of the art NLP system will translate normal human spoken sentences in a format
understandable by machine therefore its goal is to allow users to “talk” to computers like if they
were talking to a human and

vice versa.


Of course this process arises many issues and problems.

A computer just doesn’t have the necessary amount of experience or sometimes common
-
sense
that allows us human
-
beings to understand sentences and the meaning behind them.


Examples of su
ch problems can be disarming. Instead of listing all the different types of problems,
which can be found on this Wikipedia web page [13] page, I will outline the three main approaches
to solve them and I will then focus on how one of these approaches can
help the Semantic Web.


NLP 3
-
ways Solution


There are many ways in which scientists and researchers have tried to overcome the NLP
problems, and in the following sections I will outline the three that seems more promising to me:


The Chat
-
Bot Approach


A
chatterbot (chatbot) is a computer program designed to mimic a natural conversation with
humans.

It consists of an input textbox where the user will input some answer and then an output region
where the software will display the answer.

Chat box don’t know

anything about the structure of the language.

What they do is a simple find and match. They are actually very good at that but the problem is
that often they can make stupid mistakes.

This approach can be compared to the situation where two people are dia
loguing in English and
one of them doesn’t speak English fluently. So he may fool the English native for a couple of break
-
ice questions, but then inevitably if he tries to answer or better to guess every question he will
give a silly answer.

Chatbots work

on keywords and pattern
-
matching and they haven’t give encouraging results so far
to the mission of NLP.


The Child Approach


A different approach is to allow computers to learn a language like a child will do.

Some religion and philosophies believe that

we already know everything about everything and we
just need to dig it out from our brain. To unveil the knowledge that resides inside us.


Others think of our brain as a
tabula rasa
[21]; a blank slate where day by day we will write down
our experience,
knowledge and notions.

Thinking in this way a child, when is born, doesn’t know anything about the meaning or the
structure of the words and therefore of the language; but s/he manage to learn it.

Thanks to examples and the context the child lives in, s/he

will be able to pick up a language and
to express him or herself with it.


Computers can be compared to a child. In fact they know nothing unless we put data in them.

Alessandro Marcias



NLP Coursework



2523889


2008
-
2009



14

We can give them to process and to store lots of example and they have the ability to pr
ocess it
and compare it very fast.

The example
-
based approach turned out to be a good approach but not the more satisfying.


The Adult approach

The third and personally the more promising approach is the so called adult approach.

The way it tries to proce
ss a language is just like an adult person would proceed in learning a new
language.

Syntax + Vocabulary, Semantic, Pragmatics.


Syntax


When we study the rules and principles for construct a sentence in any spoken language we are
studying the language’s s
yntax or grammar.

Every language has a structure. For example most languages in an affirmative positive sentence
will use the following structure: subject first, followed by a verb and then one or many objects.

Different ways have been used in NLP studies
to represent syntaxes; from parse trees to lists and
techniques such as re
-
write rules and transition network have been used to represent the possible
structure allowed in a grammar. Bottom
-
up and top
-
down approach, or algorithms such as depth
first and br
eadth first are example of algorithms used to find out the structure of grammars.


The study of syntax and the codification of its rules in a machine understandable way are useless if
we don’t “teach” to a computer how to understand the semantic behind the

words.



Semantic


If I say “ I’d could eat an horse!!!!“ for any listener ,a part for 2
-
to
-
5 years old babies, it will be
obvious that I am just saying that I am hungry in a “colourful“ way.

Unfortunately computers lack of the most obvious notions and

common sense that we acquired in
our lives.

Semantic is that branch of studies that tackle the meaning behind the words and sentences while
pragmatics is more concerned with the figurative, allusive way we can express ourselves in so
many different ways,
even if we are still saying the same thing.




He seems to me that to overcome the NLP’s problems and difficulties is needed an hybrid
approach ,where first we teach to the machine the syntax, then we give it a huge amount of
data/examples and then it co
uld elaborate them in a pattern/example
-
based approach.

Of course it’s easy to think that if for instance we could feed the machine with every possible
combination of sentences it will be able to understand everything we say; but that is, as a matter
of fa
ct, impossible. First because it will take an infinite amount of time, and secondly because
languages evolve every day, with new words, composed words, terms and situations just as the
human history is a continuous, yet cyclic, flow. [9, 23]




Alessandro Marcias



NLP Coursework



2523889


2008
-
2009



15

The Chicken

& Egg: Problems! ….Solutions?!?!


What has to be done to get the semantic web going?

The process of the development of the Semantic Web is cost and time consuming and that’s why
often people prefer an egg (a quick solution that meets some requirements b
ut is available now)
to a chicken ( a solution that meet all the requirements but he will take years to fully develop and
implement).

The chicken and egg problem also means that vendors and developers are waiting to put real
effort and money until the mark
et is created, but the market won’t create itself if there are not
applications.

Basically those applications require data that is not out in the web, and this data to be created
needs those applications and so on.

That is known as a chicken and egg proble
m: who came first? the egg or the chicken?


Despite these problems recently there has been an acceleration in the creation of such
technologies.


In a very simple way, what the IT world, researchers of any type and/or anyone interested in the
development o
f these promising issues are trying to do, is to give every entity or item that can be
stored or linked to in the web, a meaning and/or a way to be referred to in a meaningful way.

(See Fig.1) To do so people have to start writing data/documents in a RDF f
ormat.

Many are claiming to have built applications that use NLP to better search the web or to have
created applications able to automatically create annotations and metadata.

In the next section we will look at some of such applications.







In Action


In order to actually get the Semantic Web to work properly data has to be published to the web in
RDF format.

A way of doing so can be “screen scraping”. In the screen scraping process a computer basically
extract data from the output displayed by another

program. [18]

The purpose of this process is to format the data into a more manageable way.


Another way is to get the data directly from a user’s inputs or forms.


Many companies are claiming to have created applications that can help the Semantic Web’s

goals.

During the research for this coursework I came across a number of this “statement” and I will now
show some of my results.


Powerset a Natural Language Search Company
[15]

The creators of Powerset: a Natural
Language search engine states that their tool will enable us to

carry out more meaningful researches within Wikipedia documents. This is clearly a limited data
-
set of what is on the web.

Alessandro Marcias



NLP Coursework



2523889


2008
-
2009



16



Fig.3 screenshot of the Powerset Homepage
.


Powerset Search Engin
e is meant to allow us a keyword
-
based search or a phrase
-
based search.

In a video lecture Barney Pell, founder and CEO of Powerset (
Barney Pell
, Powerset, Inc.) states
the Powerset’s techniques and pri
nciples which are summarized in the following bullets point:







Goal : enable people to interact with information and services as naturally and
effectively as possible



Combine deep NL and scalable search technology



How ?: natural language search

o

Interpre
t the web

o

Index

o

Interpret the query

o

Search… Match

o

The system creates and uses Semantic Web information in multiple ways “


He states that the main innovation that Powerset is bringing to the Web Searching and the
Semantic Web world is to understand how t
he document’s intent is encoded.





o

Goal: Matching query intents with document intents

o

Changes to document model drive largest innovations:

o

Proximity: shift from “doc as bag
-
of
-
keywords” to “doc as vector
-
of
-
keywords”

o

Anchor Text: Adding off
-
page text t
o doc “


How is it able to understand and better answer a query?






Parses each sentence on the page



Extracts entities & semantic relationships



Identifies and expands to similar entities, relationships & abstractions



Indexes multiple facts for each senten
ce “




Alessandro Marcias



NLP Coursework



2523889


2008
-
2009



17

Let’s see it in action.

If a type in the query “What is the capital of France?” I get the following results:



Fig. 4


If we look at the highlighted words in the results we can noticed that Powerset took the initial
phrase “What is the capit
al of France?” it understood that was a question, in fact the words are
now in a different way: an affirmative way.


While Powerset obviously analyze the query in a semantic way Google will give us just a series of
documents where are present the words con
tained in our query.


But, as we can see, it also gives us results such as the capital punishment in France, which is clearly
unrelated to my first query.



If a do the same search on Google the output will be still acceptable:



Fig.5 Google “What is t
he capital of France?”




A definition that states that Paris is
the capital of France.

Alessandro Marcias



NLP Coursework



2523889


2008
-
2009



18



I will now try an example to discredit Google, because after all I am trying to show is if those new
technologies that claim to use and empower the Semantic Web are really helpful and have
something new to offer.

If a type “rare

wildlife of the Amazon” Powerset gives me a list of papers, works and TV series that
treats the argument of rare wildlife of the Amazon”.


Fig. 6





















While Google straight from the first page gives me a link to an Amazon.com item, it w
ill take
Powerset 41 links to stick one of those in the list.


Alessandro Marcias



NLP Coursework



2523889


2008
-
2009



19





Fig .7













Fig. 8




Page 5 link 41 : Powerset’s first
“wrong” translation of the term
Amazon. Figure 7



1
st

page, 4
th

link Google is already
showing some Amazon.com links.

Figure 8

Alessandro Marcias



NLP Coursework



2523889


2008
-
2009



20




A final example will be given by the typing of the following query:


“Why is natural language processing difficult?”






Fig. 9


It is obvi
ous from this results that Google doesn’t have a clue of what we are talking about; in fact

it just gives us a series of link with the NLP words in it; but it also asks us if we meant a different
concept:
“why is NLP
difficulty
?”


By selecting the suggest
ed query, Google still doesn’t give us more than just a series of documents
with the words Natural Language Processing and difficulty in it.









Alessandro Marcias



NLP Coursework



2523889


2008
-
2009



21











On the other hand using Powerset we get a little bit closer to our goal (which is find documents
that explain us why NLP is so difficult).





Fig.10 Powerset results on the query “why is natural language processing difficult?”



With Powerset the results are a bit more accurate than the Google’s one but still there is a lot of
unrelated links to our

query.

In both searches from the two search engines most results relate to pages where the words NLP +
the word difficult appear.

Alessandro Marcias



NLP Coursework



2523889


2008
-
2009



22

It is clear from the above examples how more accurate the understanding of a query is in a search
engine that implements NLP
techniques compare to a traditional keyword indexing and ranking
technique; but it is also obvious how ineffective and inefficient those applications still are.


But something more accurate is already out there....








Another interesting Search Engine
Application which I came across during my researches is Juice.


Juice
[10]


Juice is an Add
-
on for Firefox. An add
-
on is an application that enhances Firefox adding to it
features.

Juice is meant to be an intelligent discovery engine.


What it does, thanks

to a NLP system and a dictionary management system, is “help the semantic
web by keyword connecting keywords with the most relevant, rich content from third
-
party web
services”. [10]


How juice works?

As we can see from figure 11 Juice reside on the righ
t hand side of the screen but can easily
removed by clicking on a button on the toolbar.



Alessandro Marcias



NLP Coursework



2523889


2008
-
2009



23


Fig. 11 Screenshot of how the Juice Add
-
on looks.






One of the main characteristics of Juice is the ability to highlight a text and drag & drop it on the
Juice

panel. This action will activate a search and automatically generate a list of links.


Juice allows us also to search within its panel for topics while keeping the page from with started
from on the main page.




Let’s try with the following: Why is Natur
al Language Processing difficult?




Fig. 12




As the screenshot shows, Juice gives much more accurate results than our previous experiment
with both Powerset and Google.

In fact it actually finds a document with the title:”The Difficulty of Natural Lan
guage” and another
one stating:” Natural Language Processing: Difficulties”.


In summary Juice have found more pertinent documents on a query where our previous tested
search engine failed miserably.

Alessandro Marcias



NLP Coursework



2523889


2008
-
2009



24

In addition to the better “understanding “of the query,
Juice gives us the chance to navigate
through its finding without lose the original webpage from which we started.


Fig. 13



Juice allows us also to search for images, news, videos and blogs.






Fig. 14



The results from the Juice search on
why Natural Language Processing
is difficult

Alessandro Marcias



NLP Coursework



2523889


2008
-
2009



25


Another interesting features of
Juice is that when we have a video in the screen a small button ,
“ drag me “will appear next to the video giving us the possibility to paste it into the Juice search
panel and watch it in the panel leaving the main page free and “usable”.



Fig
. 15




Juice will also store video or images for future viewing.

In summary seems that Juice is much more “clever “than our previous analyzed search engine;
plus it offers a series of entertainment features.

I have to admit that many of the papers I found

during this research, I found them thanks to Juice.

Plus the ability to navigate without overwriting the initial page is really useful; a user will never be
carry away too far in surfing new pages and s/he can always come back to the initial point.

I beli
eve that is a quite an intelligent way to search the web.




Until now I have shown applications that help us find information or resources on the World Wide
Web thanks to the Semantic Web.








Alessandro Marcias



NLP Coursework



2523889


2008
-
2009



26


The following section analyzes a way of help the Semantic
Web by adding metadata to document.


Calais or Open
-
Calais

[13]


The following sentences are cited from the Calais website:


“Calais is a rapidly growing toolkit of capabilities that allow you to readily incorporate
state
-
of
-
the
-
art semantic functionality

within your blog, content management system,
website or application. “


“Calais is a web service that uses natural language processing (NLP) technology to
semantically tag text that is input to the service. The tags are delivered to the user who can
then
incorporate them into other applications
-

for search, news aggregation, blogs,
catalogs et cetera.”

[14]



Calais in other words enables users to simply create tags by coping and paste a document into it
and it will output tags making the documents linkab
le.


Figure 16 shows a screenshot of the input form where we can paste our document in Calais.



Fig. 16



As input for testing purposes I used the definition I gave earlier on this paper about XML, RDF and
RDF schema.


Calais will use different colours f
or different types of tags, such as green for cities, red for
companies and so on. We can see which colour has been used for which type of tag in the left
panel of the screen.




Alessandro Marcias



NLP Coursework



2523889


2008
-
2009



27


The output will be what is shown in figure 17:




Fig. 17


Calais has pr
oduced many tags for the given document and if I go over a tag with the mouse a
window will appear giving out some information and description.

The interesting thing is that if the user presses the button “show RDF” the system will show the
document in RDF

format as we can see in figure 18.



Fig. 18

Once again the results are not optimal. The tags that Calais will annotate don’t cover all the
important words there are held in the text, but it is still impressive what they have managed to do
and I believe

there is the need to see those results in an optimistic way.

If you think about it, the majority of the people that uses the internet don’t know anything about
tags, RDF and XML and so on .Therefore the real need of sophisticated tools that do the
annotat
ion

process
for the users, or in other words that take care of the dirty jobs without the
Alessandro Marcias



NLP Coursework



2523889


2008
-
2009



28

unaware user having a clue of what is going on
in
the background. But those users are the one
s

that will be surprised the most when they see “magic” connections comin
g out apparently from
nowhere
.

Our community, by that I mean us human beings, cannot cope with everything we have at our
hands or virtual hands. So for those who hate computers because are changing our world in to a
more cold and electro
-
silicon
-
plastic sc
enario I will suggest that they just use them and abuse
them and think that they actually are making our lives easier .After all we created them
,

so let’s
make them do something useful and especially something we prefer not to do and concentrate in
the mor
e creative aspects of our lives. We can create,

they cannot and they never will.


Furthermore

Unfortunately and inevitably when discussing topics so actual it is impossible to cover everything;
and even during the writing of this paper new technologies and

new tools just kept popping out.
To follow is a list of some other very interesting applications and research groups
that due to time
constrains I wasn’t

able to discuss but are of great interest:





Meteweb:

ht
tp://www.metaweb.com/

"an open, shared database of the world's
knowledge";

Twine
:
http://www.twine.com/

A smarter way to track interest ;

Garlik
:
http://www.garlik.com/

a UK com
pany creating tools to control personal
information on the Web;

Joost
:
http://www.joost.com/

online TV provider;

Talis
:
http://www.talis.com/

a vendor of software that makes data
"available to share,
remix and reuse"

Mondeca
:
http://www.mondeca.com/

a European enterprise information integration
company;

Ontoprise
:
http://www.ontoprise.de/

a German ve
ndor of ontology
-
related tools; “

(Descriptions have been taken from the website’s homepages)


Conclusion

I believe those type of tools are really helpful and they are already making a big difference in the
way we use the internet and its resources.

Tags a
nd what they l
ink

together are changing the way we surf, research and discover the web and
its resources.

Further discussion on how tags are changing our web browsing experience and the evolution of
visualization of information are discussed in more detail
s on my Information Retrieval coursework.

It is clear though, that such tools are not yet optimal and fully efficient but we have seen an
enormous effort and development in the last decade.

I strongly believe that what is driving the world today, a part th
e obvious making
-
money
processes, is a community
-
driven fashion of applications, where people are able to share their
resources being that for knowledge
-
research purposes or simply like
-
dislike items.

People have to feel connected and able to communicate w
ith everything and anything in the
world.

The Semantic Web is an amazing and exciting prospective that is trying to make this possible and I
do believe that even if what was described in the article by T.B
. L
ee will never be possible, at least
in my life t
ime unfortunately, it will still make our lives a lot more interesting.

We are already able to discover a lot of new and interesting things thanks to those sorts of
applications and I am confident there will be definitely more amazing things still to come.

Alessandro Marcias



NLP Coursework



2523889


2008
-
2009



29

Reference


[1]


G.Antoniou, F.van Harmelen (2008) .A Semantic Web Primer, The MIT Press Cambridge,
Massachusetts, London, England

[2]

T. Berners
-
Lee (1989).
Information Management: A Proposal
. CERN. Available at:
http://www.w3.org/History/1989/proposal.h
tml

[3]


T. Berners
-
Lee (1998).
Semantic Web Road Map
. W3C. Available at:
http://www.w3.org/DesignIssues/Semantic.html

[4]

T. Berners
-
Lee,J. Hendler, and O. Lassila (2001).
The Semantic Web
.
Scientific American
.
Available at:
http://www.sciam.com/print_v
ersion.cfm?articleID=00048144
-
10D2
-
1C70
-
84A9809EC588EF21

[5]

J.Cardoso (2006); Semantic Web Services : theory ,tools and applications ,
Information
Science Reference.

[6]

K.Darlington (2005); Effective Website Development;Tools and Techniques

[7]


J.Davie
s,R.Studer,P,Warren(2006);

Semantic Web Technologies ;Trends and Research in
Ontology
-
based Systems, Wiley.

[8]

L. Dempsey, R. Heery, (1997).Metadata [DESIRE] Specification for resource description
methods Part 1: A review of metadata: a survey of current

resource description formats.
available at
http://www.ukoln.ac.uk/metadata/desire/overview/

[9]

J.Gleick(1988) .Chaos
-
making a new Science

[10]


Juice. available at http://www.juiceapp.com

[1
1]


B. Matthews (2005) .Semantic Web Technologies,

available at
http://www.jisc.ac.uk/uploaded_documents/jisctsw_05_02bpdf.pdf

[12]

D. McGuiness, M. Dean (2005); Substance of the

Semantic Web ; SWANS


Available at
http://www.daml.org/meetings/2005/04/pi/Substance.pdf

[13]

Natural Language Processing, Wikipedia. available at
http://en.wikipedia.org/wiki/Natural_language_processing#Tasks_and_limitations

[14]

Opencalais. Available at
http://www.opencalais.com/

[15]

T.O’reilly

(2005). What is web 2.0 .available at
http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what
-
is
-
web
-
20.html?page=2?

[16]


Powerset available at http://www.powerset.com/

[17]

L.Reeve,H.Han,C.Chen,

(2006).Visualizing the Semantic Web ;XML
-
Based I
nternet and
Information .Visualization

[18]


Screen scraping, Wikipedia. Available at
http://en.wikipedia.org/wiki/Screen_scraping

[19]


J.Searle
(1980),
Minds, Brains and Programs
,
Behavioral and Brain Sciences

.

[20]


R.Studer, S.Grimm, A. Abecker (2007).
Semantic Web S
ervices Concepts, Technologies, and
Applications


[21]


Tabula rasa ,Aristotele (unknown)

[22]


W3C: World
-
Wide Web Consortium homepage: http://www.w3.org/

[23]

G.P.Williams .(1997) Chaos Theory Tamed .





Alessandro Marcias



NLP Coursework



2523889


2008
-
2009



30

Recommendations Reference


A.

B. Adida, M. Birbeck,

RDFa Primer, Bridging the Human and Data Webs, W3C Working
Group Note 14 October 2008 .

Available at:

http://www.w3.org/TR/xhtml
-
rdfa
-
primer/

[last
accessed 01/12/08]

B.

B. Adida, M. Birbeck, S. McCarr
on, S. Pemberton,I (2008) ; RDFa in XHTML: Syntax and
Processing ,A collection of attributes and processing rules for extending XHTML to support
RDF ;W3C Recommendation .

Available at:

http://www.w3.org/TR/r
dfa
-
syntax/

[last
accessed 01/12/08]

C.

Dave Beckett (2004).
RDF/XML Syntax Specification (Revised).
W3C Recommendation
.

Available at:
http://www.w3.org/TR/rdf
-
syntax
-
grammar/
[last accessed 01/12/08]

D.

D. Berrueta, J. Phipps,
Best Practice Recipes for Publi
shing RDF Vocabularies(2008), W3C
Working Group Note .
Available at:
http://www.w3.org/TR/swbp
-
vocab
-
pub/

[last accessed
01/12/08]

E.

T. Bray, J. Paoli, C. M. Sperberg
-
McQueen, E. Maler, F. Yergeau (2008);
E
xtensible Mark
-
up
Language (XML) 1.0 (Fifth Edition),
W3C Recommendation .

Available at:

http://www.w3.org/TR/2008/REC
-
xml
-
20081126/

[last accessed 01/12/08]

F.

D. Brickley and R.V. Guha (2004).
RDF
Vocabulary Description Language 1.0: RDF Schema
.
W3C Recommendation . Available at:
http://www.w3.org/TR/rdf
-
schema/
[last accessed
01/12/08]

G.

J. Carroll, J. De Roo (2004).
OWL Web Ontology Language Test Cases
. W3C
Recommendation . Available at:
http:
//www.w3.org/TR/owl
-
test/
[last accessed
01/12/08]

H.

J. J. Carroll, J. Z. Pan ; XML Schema Datatypes in RDF and OWL(2006), W3C Working Group
Note ;
Available at:
http://www.w3.org/TR/swbp
-
xsch
-
datatype
s/

[last accessed 01/12/08]

I.

M. Dean, G. Schreiber, F. van Harmelen, J. Hendler, I. Horrocks, D. L. McGuinness, P. F.
Patel
-
Schneider, L. A. Stein (2004).
OWL Web Ontology Language Reference
. W3C
Recommendation . Available at:
http://www.w3.org/TR/owl
-
re
f/
[last accessed 01/12/08]

J.

J. Grant, D. Beckett (2004).
RDF Test Cases
. W3C Recommendation. Available at:
http://www.w3.org/TR/rdf
-
testcases/
[last accessed 01/12/08]

K.

P. Hayes (2004).
RDF Semantics
. W3C Recommendation . Available at:
http://www.w3.org/
TR/rdf
-
mt/
[last accessed 01/12/08]

L.

J. Heflin (2004).
Web Ontology Language (OWL) use cases and requirements
. W3C
Recommendation. Available at:
http://www.w3.org/TR/webont
-
req/
[last accessed
01/12/08]

M.

G. Klyne, J. Carroll (2004).
Resource Description Fr
amework (RDF): Concepts and Abstract
Syntax
. W3C Recommendation. Available at:
http://www.w3.org/TR/rdf
-
concepts/
[last
accessed 01/12/08]

N.

F. Manola, E. Miller (2004).
RDF Primer
. W3C Recommendation. Available at:
http://www.w3.org/TR/rdf
-
primer/
[last acc
essed 01/12/08]

Alessandro Marcias



NLP Coursework



2523889


2008
-
2009



31

O.

D. L. McGuinness, F. van Harmelen (2004).
OWL Web Ontology Language Overview
. W3C
Recommendation. Available at:
http://www.w3.org/TR/owl
-
features/
[last accessed
01/12/08]


P.

P. F. Patel
-
Schneider, P. Hayes, I. Horrocks (2004).
OWL Web Ontolo
gy Language Semantics
and Abstract Syntax.
W3C Recommendation. Available at:
http://www.w3.org/TR/owl
-
absyn/
[last accessed 01/12/08]


Q.

M. K. Smith, D. McGuinness, R. Volz, C. Welty (2004).
OWL Web Ontology Language Guide
.
W3C Recommendation. Available at:
http://www.w3.org/TR/owl
-
guide/
[last accessed
01/12/08]