HY-566 Semantic Web

elbowsspurgalledInternet και Εφαρμογές Web

21 Οκτ 2013 (πριν από 3 χρόνια και 7 μήνες)

98 εμφανίσεις

HY
-
566 Semantic Web
















Ontology Learning



















Μπαλάφα Κασσιανή

Πλασταρά Κατερίνα



Table of contents


1. Introduction

2. Data sources for ontology learning

3. Ontology Learning Process

4. Architecture

5. Methods for learning ontolo
gies

6. Ontology learning tools

7. Uses/applications of ontology learning

8. Conclusion

9. References





























1. Introduction



1.1 Ontologies


Ontologies serve as a means for establishing a conceptually concise basis for
communicating
knowledge for many purposes. In recent years, we have seen a
surge of interest that deals with the discovery and automatic creation of complex,
multirelational knowledge structures.

Unlike knowledge bases ontologies have “all in one”:




formal or machine re
adable representation




full and explicitly described vocabulary




full model of some domain




consensus knowledge: common understanding of a domain




easy to share and reuse



1.2 Ontology learning


General


The main task of ontology learning is to automatica
lly learn complicated

domain ontologies; this task is usually solved by human only. It explores
techniques for applying knowledge discovery techniques to different data sources
(html, documents, dictionaries, free text, legacy ontologies etc.) in order to

support the task of engineering and maintaining ontologies. In other words is the
Machine learning of ontologies.



Technical description


The manual building of ontologies is a tedious task, which can easily result in a
knowledge acquisition bottleneck.
In addition, human expert modeling by hand is
biased, error prone and expensive. Fully automatic machine knowledge
acquisition remains in the distant future. Most systems are semi
-
automatic and
require human (expert) intervention and balanced cooperative m
odeling for
constructing ontologies.


Semantic Integration


The conceptual structures that define an underlying ontology provide the key to
machine
-
processable data on the Semantic Web.
Ontologies

serve as metadata

schemas, providing a controlled vocabula
ry of concepts, each with explicitly
defined and machine
-
processable semantics. Hence, the Semantic Web’s
success and proliferation depends on quickly and cheaply constructing

domain
-
specific ontologies. Although ontology
-
engineering tools have matured

ove
r the last decade, manual ontology acquisition remains a tedious,
cumbersome task that can easily result in a knowledge acquisition bottleneck.
Intelligent support tools for an ontology engineer take on a different meaning than

the integration architecture
s for more conventional knowledge acquisition.


In the figures below we can see how ontology learning is concerned in semantic
integration





Semantic Information Integration





















Ontology Alignment and Transfo
rmations























?????? NO RELATION BETWEEN THESE FIGURES!!!






Ontology Engineering







2. Data sources for ontology learning


2.1 Natural languages


Natural language texts exhibit morphological, syntactic,
semantic, pragmatic and

conceptual constraints that interact in order to convey a particular meaning to the

reader. Thus, the text transports information to the reader and the reader

embeds this information into his background knowledge. Through the

und
erstanding of the text, data is associated with conceptual structures and new

conceptual structures are learned from the interacting constraints given through

language. Tools that learn ontologies from natural language exploit the

interacting constraint
s on the various language levels (from morphology to

pragmatics and background knowledge) in order to discover new concepts and

stipulate relationships between concepts.




2.1.2 Example




An example of extracting semantic information of natural text
in the form of
ontology is a methodology developed in Leipzig University of Germany. This
approach is focused on the application of statistical analysis of large corpora to
the problem of extracting semantic relations from unstructured text. It is a viable

method for generating input for the construction of ontologies, as ontologies use
well
-
defined semantic relations as building blocks. The method’s purpose is to
create classes of terms (
collocation
sets) and how to
postprocess
these

statistically generate
d collocation sets in order to extract
named relations
. In
addition, for different types of relations like
cohyponyms
or
instance
-
of
-
relations,
different extraction methods as well as additional sources of information can be
applied to the basic collocatio
n sets in order to verify the existence of a specific
type of semantic relation for a given set of terms.


The first step of this approach is to collect large amounts of unstructured text,
which will be processed in the following steps. The next step is
to create the
collocation sets, i.e. the classes of similar terms. The occurrence of two or more
words within a well defined unit of information (sentence, document) is called a
collocation. For the selection of meaningful and significant collocations, an
adequate collocation measure is defined based on probabilistic similarity metrics.
For calculating the collocation measure for any reasonable pairs of terms the
joint occurrences of each pair is counted. This problem is complex both in time
and storage. Ne
vertheless, the collocation measure is calculated for any pair with
total frequency of at least 3 for each component. This approach is based on
extensible ternary search trees, where a count can be associated to a pair of
word numbers. The memory overhead
from the original implementation could be
reduced by allocating the space for chunks of 100,000 nodes at once. Even
when using this technique on a large memory computer more than one run
through the corpus may be necessary, taking care that every pair is o
nly counted
once. The resulting word pairs above a threshold of significance are put into a
database where they can be accessed and grouped in many different ways.


Further on, except for the textual output of collocation sets, visualizing them as
graphs

is an additional type of representation. The procedure followed is:



A word is chosen and its collocates are arranged in the plane so that
collocations between collocates are taken into account. This results in graphs
that show homogeneity where words a
re interconnected and they show
separation where collocates have little in common. Polysemy is made visible
(see figure below). Line thickness represents the significance of the
collocation. All words in the graph are linked to the central word; the rest o
f
the picture is automatically computed, but represents semantic
connectedness as well.

The relations between the words are just presented, but not yet named. The
figure shows the collocation graph for
space
. Three different meaning contexts
can be recogni
zed in the graph:

o

real estate,

o

computer hardware, and

o

astronautics.

The connection between
address
and
memory
results from the fact that address
is another polysemous concept.




Collocation graph for
space



The final step is to identify the relations
between terms or collocation sets. The
collocation sets are searched and some semantic relations appear more often
than others. The following basic types of relations can be identified:

o

Cohyponymy

o

top
-
level syntactic relations, which translate to semantic
‘actor
-
verb’ and
often used properties of a noun

o

instance
-
of

o

special relations given by multiwords (
A
prep/det/conj
B
), and

o

unstructured set of words describing some subject area.

These types of relations may be classified according to the properties symm
etry,
anti
-
symmetry, and transitivity. Additional relations between collocation sets can
be identified with the user’s contribution, such as:

o

Pattern
-
based extraction (user defined) e.g. (
profession) ? (last name)
implies that ? Is in fact a
first name.

o

Co
mpound nouns. Semantic relation between the parts of a compound
word can be found in most cases.

Term properties
are be

derived with similar ways.


A combination of the results of each of the steps described above forms the
ontology of terms included in
the original text. The output of this approach may
be used for the automatic generation of semantic relations between terms in
order to fill and expand ontology hierarchies.




2.2 Ontology Learning from Semi
-
structured Data.


With the success of new stand
ards for document publishing on the web there will
be a proliferation of semi
-
structured data and formal descriptions of semi
-
structured data freely and widely available. HTML data, XML data, XML
Document Type Definitions (DTDs), XML
-
Schemata , and their l
ikes add
--

more
or less expressive
--

semantic information to documents. A number of
approaches understand ontologies as a common generalizing level that may
communicate between the various data types and data descriptions. Ontologies
play a major role fo
r allowing semantic access to these vast resources of semi
-
structured data. Though only few approaches do yet exist we belief that learning
of ontologies from these data and data descriptions may considerably leverage
the application of ontologies and, thu
s, facilitate the access to these data.



2.2.1 Example



An example of learning ontologies from both unstructured text and semi
-
structured text is the DODDLE system. This approach, which was implemented
in Shizuoka University of Japan, describes how to

construct domain ontologies
with taxonomic and non
-
taxonomic conceptual relationships exploiting a machine
readable dictionary and domain
-
specific texts. The taxonomic relationships come
from WordNet (an online lexical database for the English language) i
n interaction
with a domain expert, using the following two strategies: match result analysis
and trimmed result analysis. The non
-
taxonomic relationships come from domain
specific texts with the analysis of lexical co
-
occurrence statistics.


The DODDLE

(Domain Ontology Rapid Development Environment) system
consists of two components: the taxonomic relationship acquisition module using
WordNet and non
-
taxonomic relationship learning module using domain
-
specific
texts. An overview of the system and its co
mponents is depicted in figure 1.





Figure1: DODDLE overview





Taxonomic relationship acquisition module:

The taxonomic relationship acquisition module does spell match between
the input domain terms and WordNet. The spell match links these terms to
Wor
dNet. Thus the initial model from the spell match results is a
hierarchically structured set of all the nodes on the path from these terms
to the root of WordNet. However the initial model has unnecessary internal
terms (nodes) not to contribute to keeping

topological relationships among
matched nodes, such as parent
-
child relationship and sibling relationship.
So the unnecessary internal nodes can be trimmed from the initial model
into a trimmed model, as shown in Figure 2 process.




Figure 2: Trimming
process


In order to refine the trimmed model, two strategies are applied in
interaction with a user: match result analysis and trimmed result analysis.

o

Match result analysis:

Looking at the trimmed model, it turns out that it is divided into a
PAB (a PAth

including only Best spell
-
matched nodes) and a STM
(a Sub
-
Tree that includes best spell
-
matched nodes and other
nodes and so should be Moved) based on the distribution of best
-
matched nodes. On one hand, a PAB is a path that includes only
best
-
matched nod
es that have sense for a given domain specificity.
Because all nodes have already been adjusted to the domain in
PABs, PABs can stay there in the trimmed model. On the other
hand, a STM is such a sub
-
tree that an internal node is a root and
the subordinate
s are only best
-
matched nodes. Because internal
nodes have not been confirmed to have sense for a given domain,
an STM can be moved in the trimmed model. Thus DODDLE
identifies PABs and STMs in the trimmed model automatically and
then supports a user in co
nstructing a conceptual hierarchy by
moving STMs. Figure 3 illustrates the above
-
mentioned match
result analysis.









Figure 3: Match Result Analysis


o

Trimmed result analysis:

In order to refine the trimmed model, DODDLE uses trim result
analysis as w
ell as match result analysis. Taking some sibling
nodes with the same parent node, there may be many differences
about the number of trimmed nodes between them and the parent
node. When such a big difference comes up on a sub
-
tree in the
trimmed model, it
may be better to change the structure of the sub
-
tree. The system asks the user if the sub
-
tree should be
reconstructed or not. Figure 4 illustrates the abovementioned
trimmed result analysis.




Figure 4: Trimmed Result Analysis



Finally DODDLE II compl
etes taxonomic relationships of the input domain
terms with hand
-
made additional modification from the user.





Non
-
taxonomic relationship learning module

Non
-
taxonomic Relationship Learning almost comes from WordSpace,
which derives lexical co
-
occurrence i
nformation from a large text corpus
and is a multi
-
dimension vector space (a set of vectors). The inner product
between two word vectors works as the measure of their semantic
relatedness. When two words’ inner product is beyond some upper bound,
they are
candidates to have some non
-
taxonomic relationship between
them. WordSpace is constructed as shown in Figure 5.




Figure 5: Construction Flow of WordSpace


The main steps of the WordSpace construction process are: extraction of
high
-
frequency 4
-
grams, co
nstruction of collocation matrix, construction of
context vectors, construction of word vectors and construction of vector
representations of all concepts.



After these two main and parallel modules are concluded, all the resulting
concepts are compared

for similarity. The user defines a certain threshold for this
similarity and a concept pair with the similarity beyond it is extracted as
a similar
concept pair
. A set of the similar concept pairs becomes a concept specification
template. Both kinds of co
ncept pairs, those whose meaning is similar (with
taxonomic relation) and those who have something relevant with each other (with
non
-
taxonomic relation), are extracted as concept pairs with context similarity in
a mass. However, by using taxonomic informa
tion from TRA module with co
-
occurrence information, DODDLE distinguishes the concept pairs which are
hierarchically closer to each other than the other pairs as TAXONOMY. A user
constructs a domain ontology by considering the relation with each concept pa
ir
in the concept specification templates and by deleting an unnecessary concept
pair. Figure 6 shows the ontology editor (left window) and the concept graph
editor (right window).






Figure 6: The ontology editor



In order to evaluate how DODDLE is
doing in practical fields, case studies have
been done in a particular law called Contracts for the International Sale of Goods
(CISG). Although this case study was small scale the results were encouraging.




2.3 Ontology Learning from Structured Data


On
tologies have been firmly established as a means for mediating between

different databases. Nevertheless, the manual creation of a mediating ontology is

again a tedious, often extremely difficult, task that may be facilitated through

learning methods. T
he negotiation of a common ontology from a set of data and

the evolution of ontologies through the observation of data is a hot topic these

days. The same applies to the learning of ontologies from metadata, such as

database schemata, in order to derive

a common high
-
level abstraction of

underlying data descriptions
-

an important precondition for data warehousing or

intelligent information agents.






3. Ontology Learning Process


A general framework of the ontology learning process is shown in the
figure
below.



The ontology learning process


The basic steps in the engineering cycle are:

o

Merging existing structures or defining mapping rules between these
structures allows
importing
and
reusing
existing ontologies.
(
For instance,
Cyc’s ontological

structures have been used to construct a domain
-
specific ontology

o

Ontology
extraction
models major parts of the target ontology, with
learning support fed from Web documents.

o

The target ontology’s rough outline, which results from import, reuse, and
extra
ction, is
pruned
to better fit the ontology to its primary purpose.

o

Ontology
refinement
profits from the pruned ontology but completes the
ontology at a fine granularity (in contrast to extraction).

o

The target application serves as a measure for validating

the resulting
ontology.

Finally, the ontology engineer can begin this cycle again

for example, to include
new domains in the constructed ontology or to maintain and update its scope.



3.1 Ontology learning process example




A variation of the ontol
ogy learning process described in the previous session
was implemented in a user
-
centered system for ontology construction, called
Adaptiva, implemented in the University of Sheffield (UK). In this approach, the
user selects a corpus of texts and sketches
a preliminary ontology (or selects an
existing one) for a domain with a preliminary vocabulary associated to the
elements in the ontology (lexicalisations). Examples of sentences involving such
lexicalisation (e.g. ISA relation) in the corpus are automatic
ally retrieved by the
system. Retrieved examples are then validated by the user and used by an
adaptive Information Extraction system to generate patterns that discover other
lexicalisations of the same objects in the ontology, possibly identifying new
con
cepts or relations. New instances are added to the existing ontology or used
to tune it. This process is repeated until a satisfactory ontology is obtained.


Each of the above mentioned stages consists of three steps: bootstrapping,
pattern learning and

user validation, and cleanup.

o

Bootstrapping
.

The bootstrapping process involves the user specifying a
corpus of texts, and a seed ontology. The draft ontology must be
associated with a small thesaurus of words, i.e. the user must indicate at
least one ter
m that lexicalises each concept in the hierarchy.

o

Pattern Learning & User Validation
.
Words in the thesaurus are used by
the system to retrieve a first set of examples of the lexicalisation of the
relations among concepts in the corpus. These are then pres
ented to the
user for validation. The learner then uses the positive examples to induce
generic patterns able to discriminate between them and the negative
ones. Pattern are generalised in order to find new (positive) examples of
the same relation in the c
orpus. These are presented to the user for
validation, and user feedback is used to refine the patterns or to derive
additional ones. The process terminates when the user feels that the
system has learned to spot the target relations correctly. The final p
atterns
are then applied on the whole corpus and the ontology is presented to the
user for cleanup.

o

Cleanup
.
This step helps the user make the ontology developed by the
system coherent. First, users can visualize the results and edit the
ontologies directl
y. They may want to collapse nodes, establish that two
nodes are not separate concepts but synonyms, split nodes or move the
hierarchical positioning of nodes with respect to each other. Also, the user
may wish to 1) add further relations to a specific nod
e; 2) ask the learner
to find all relations between two given nodes; 3) refine/label relations
discovered in the between given nodes. Corrections are returned back to
the IE system for retraining.

This methodology focuses the expensive user activity on ske
tching the initial
ontology, validating textual examples and the final ontology, while the system
performs the tedious activity of searching a large corpus for knowledge
discovery. Moreover, the output of the process is not only an ontology, but also a
sys
tem trained to rebuild and eventually retune the ontology, as the learner
adapts by means of the user feedback. This simplifies ontology maintenance, a
major problem in ontology
-
based methodologies.



4. Architecture


The general architecture of the ontol
ogy learning process is shown in the
following figure.




Ontology learning architecture for the Semantic Web


The ontology engineer only interacts via the graphical interfaces, which
comprise two of the four components: the Ontology Engineering Workbench

and the Management Component. Resource Processing and the Algorithm
Library are the architecture’s remaining components. These components are
described below.




Ontology Engineering Workbench

This component is sophisticated means for manual modeling and re
fining
of the final ontology. The ontology engineer can browse the resulting
ontology from the ontology learning process and decide to follow, delete or
modify the proposals as the task requires.




Management component graphical user interface

The ontology
engineer uses the management component to select input
data

that is, relevant resources such as HTML and XML documents,
DTDs, databases, or existing ontologies that the discovery process can
further exploit. Then, using the management component, the engine
er
chooses from a set of resource
-
processing methods available in the
resource
-
processing component and from a set of algorithms available in
the algorithm library. The management component also supports the
engineer in discovering task
-
relevant legacy dat
a

for example, an
ontology
-
based crawler gathers HTML documents that are relevant to a
given core ontology.




Resource processing

Depending on the available input data, the engineer can choose various
strategies for resource processing:

o

Index and reduce HT
ML documents to free text.

o

Transform semistructured documents, such as dictionaries, into a
predefined relational structure.

o

Handle semistructured and structured schema data (such as DTDs,
structured database schemata, and existing ontologies) by following

different strategies for import, as described later in this article.

o

Process free natural text.

After first preprocessing data according to one of these or similar
strategies, the resource
-
processing module transforms the data into an
algorithm
-
specific
relational representation.




Algorithm library

An ontology can be described by a number of sets of concepts, relations,
lexical entries, and links between these entities. An existing ontology
definition can be acquired using various algorithms that work on
this
definition and the preprocessed input data. Although specific algorithms
can vary greatly from one type of input to the next, a considerable overlap
exists for underlying learning approaches such as association rules,
formal concept analysis, or clust
ering. Hence, algorithms can be reused
from the library for acquiring different parts of the ontology definition.




5. Methods for learning ontologies



Some methodologies used in the ontology learning process are described in
the following sections.



5.1 Association Rules



A basic method that is used in many ontology learning systems is the use of
association rules for ontology extraction. Association
-
rule
-
learning algorithms are
used for prototypical applications of data mining and for finding asso
ciations that
occur between items in order to construct ontologies (
extraction stage
). ‘Classes’
are expressed by the expert as a free text conclusion to a rule. Relations
between these ‘classes’ may be discovered from existing knowledge bases and a
model
of the classes is constructed (ontology) based on user
-
selected patterns in
the class relations. This approach is useful for solving classification problems by
creating classification taxonomies (ontologies) from rules.


A classification knowledge based
system using this method with experimental
results based on medical data was implemented in the University of New South
Wales, in Australia. In this approach, Ripple Down Rules (RDR) were used to
describe classes and their attributes. The form of RD Rules
is shown in the
following figure, which represents some rules for the class
Satisfactory
lipid profile previous raised LDL noted
. In the first rule there is a
condition Max(LDL) > 3.4 and in the second rule there is a condition Max(LDL) is
HIGH), where HIG
H is a range between 2 real number.




An example of a class which is a disjunction of two rules


The conclusions of the rules form the classes of the classification ontology. The
expert using this methodology is allowed to specify the correct conclusion
and
identify the attributes and values that justify this conclusion in case the system
makes an error.


The method applied in this approach includes three basic steps:

o

The first step is to discover class relation between rules. In this stage,
three basic

relations are taken into account:

1.

Subsumption/intersection: a class A subsumes/intersects with a class
B if class A always occurs when class B occurs, but not the other way
around.

2.

Mutual exclusivity: two classes are mutual exclusive if they never occur
together.

3.

Similarity: two classes are similar if they have similar conditions in the
rules they come from.

Based on these relations the first classes of rule conclusions are formed.

o

The second step is to specify some compound relations which appear
interes
ting using the three basic relations. This step is performed in
interaction with the expert.

o

The final step is to extract instances of these compound relations or
patterns and assemble them into a class model (ontology).


The key idea in this technique i
s that it seems reasonable to use heuristic
quantitative measures to group classes and class relations. This then enables
possible ontologies to be explored on a reasonable scale.



5.2 Clustering




Learning semantic classes


In the context of learning se
mantic classes, learning from syntactic contexts
exploits syntactic relations among words to derive semantic relations, following
Harris’ hypothesis. According to this hypothesis, the study of syntactic
regularities within a specialized corpus permits to i
dentify syntactic schemata
made out of combinations of word classes reflecting specific domain knowledge.
The fact of using specialized corpora eases the learning task, given that we have
to deal with a limited vocabulary with reduced polysemy, and limited

syntactic
variability.

In syntactic approaches, learning results can be of different types, depending on
the method employed. They can be distances that reflect the degree of similarity
among terms, distance
-
based term classes elaborated with the help of
nearest
-
neighbor methods degrees of membership in term classes, class hierarchies
formed by conceptual clustering or predicative schemata that use concepts to

constraint selection. The notion of distance is fundamental in all cases, as it
allows calculatin
g the degree of proximity between two objects

terms in this
case

as a function of the degree of similarity between the syntactic contexts in
which they appear. Classes built by aggregation of near terms can afterwards be
used for different applications, su
ch as syntactic disambiguation or document

retrieval. Distances are however calculated using the same similarity notion in all
cases, and our model relies on these studies regardless of the application task.


Conceptual clustering


Ontologies are organized

as multiple hierarchies that form an acyclic graph
where nodes are term categories described by intention, and links represent
inclusion. Learning through hierarchical classification of a set of objects can be
performed in two main ways: top
-
down, by incr
emental specialization of classes,
and bottom
-
up, by incremental generalization. The bottom
-
up approach due to its
smaller algorithmic complexity and its understandability to the user in view of an
interactive validation task is better.


The Mo’K workbench


A workbench that supports the development of conceptual clustering methods for
the (semi
-
) automatic construction of ontologies of a conceptual hierarchy type
from parsed corpora is the Mo’K workbench. The learning model proposed in that
takes parsed cor
pora as input. No additional (terminological or semantic)
knowledge is used for labeling the input, guiding learning or validating the
learning results. Preliminary experiments showed that the quality of learning
decreases with the generality of the corpus
. This makes somewhat unrealistic the
use of general ontologies for guiding such learning as they seem too incomplete
and polysemic to allow for efficient learning in specific domains.




5.3 Ontology Learning with Information Extraction Rules


The Figure

below illustrates the overall idea of building ontologies with learned
information extraction rules. We start with:

1. An initial, hand
-
crafted seed ontology of reasonable quality which contains
already the relevant types of relationships between ontology

concepts in the
given domain.

2. An initial set of documents which exemplarily represent (informally)
substantial parts of the knowledge represented formally in the seed ontology.









To take the pairs of (ontological sta
tement, one or more textual
representations) as positive examples for the way how specific ontological
statements can be reflected in texts. There are two possibilities to extract
such examples:



Based on the seed ontology, the system looks up the signature

of a
certain relation searches all occurrences of instances of the concept
classes Disease and Cure, respectively, within a certain maximum
distance, and regards these co
-
occurrences as positive examples for
relationship R. This approach presupposes that
the seed documents have
some “definitional” character, like domain specific lexica or textbooks.




The user goes through the seed documents with a marker and manually
highlights all interesting passages as instances of some relationship. This
approach

i
s mo
re work
-
intensive, but promises faster

learning and more
precise results. We employed

this approach already successfully in an
industrial

information extraction project




Employ a pattern learning algorithm to automatically construct information


ext
raction rules which abstract from the specific examples, thus creating


general statements which text patterns are an evidence for a certain


ontological relationship. In order to learn such information extraction rules,


we need some prer
equisites:

(a)

A sufficiently detailed representation of documents (in particular,


including word positions, which is not usual in conventional, vector
-



based learning algorithms, WordNet
-
synsets, and part
-
of
-
speech



tagging).


(b) A sufficiently powerful representation formalism for extraction patterns.


(c) A learning algorithm which has direct access to background knowledge


sources, like the already available seed onto
logy containing


statements about known concept instance, or like the WordNet


database of lexical knowledge linking words to their synonyms sets,


giving access to suband superclasses of synonym sets, et
c
.



Apply these learned information extraction rules to other, new text
documents to discover new or not yet formalized instances of relationship
R in the given application domain.



Compared to other ontology learning approaches this technique is not restr
icted
to learning taxonomy relationships, but arbitrary relationships in an application
domain.

A project that uses this technique is the FRODO ("A Framework for Distributed
Organizational Memories") project which is about methods and tools for building
an
d maintaining distributed Organizational Memories in a real
-
world enterprise
environment. It is funded by the German National Ministry for Research and
Education has started with five scientific researchers in January 2000.



6. Ontology learning tools


6.
1

TEXT
-
TO
-
ONTO


It develops a semi
-
automatic ontology learning from text. It tries to overcome the
knowledge acquisition bottleneck
.
It is based on a general architecture for
discovering conceptual structures and engineering ontologies from text.
















Architecture


The process of semi
-
automatic ontology learning from text is embedded in an
architecture that comprises several core features described as a kind of pipeline.
The main components of the architecture are the:





Text & Processing Manage
ment Component


The ontology engineer uses that component to select domain texts exploited in
the further discovery process. The engineer can choose among a set of text (pre
-
) processing methods available on the
Text Processing Server

and among a set
of al
gorithms available at the
Learning & Discovering component
. The former
module returns text that is annotated by XML and XML
-
tagged is fed to the
Learning & Discovering component.





Text Processing Server


It contains a shallow text processor based on the c
ore system SMES
(Saarbr¨ucken Message Extraction System). SMES is a system that
performs syntactic analysis on natural language documents. It organized in
modules, such as tokenizer, morphological and lexical processing and chunk
parsing that use lexical
resources to produce mixed syntactic/semantic
information. The results of text processing are stored in annotations using
XML
-
tagged text.




Lexical DB & Domain Lexicon


SMES accesses a lexical database with more than 120.000 stem entries and
more than 12
.000 subcategorization frames that are used for lexical analysis and
chunk parsing. The domain
-
specific part of the lexicon associates word stems
with concepts available in the concept taxonomy and links syntactic information
with semantic knowledge that m
ay be further refined in the ontology.




Learning & Discovering comp
o
nent


Uses various discovering methods on the annotated texts e.g. term extraction

methods for concept acquisition.




Ontology Engineering Enviroment
-
ONTOEDIT


It supports the ontology eng
ineer in semi
-
automatically adding newly discovered
conceptual structures to the ontology. Internally stores modeled ontologies using
an XML serialization.















6.2
ASIUM




ASIUM overview



Asium is an acronym for “Acquisition of Semantic knowle
dge Using Machine
learning method". The main aim of Asium is to help the expert in the acquisition
of semantic knowledge from texts and to generalize the knowledge of the corpus.
It also provides the expert with a user interface which includes tools and
fu
nctionality for exploring the texts and then learning knowledge which is not in
the texts.


During the learning step, Asium helps the expert to acquire semantic
knowledge from the texts, like
subcategorization frames

and an
ontology.

The
ontology

represe
nts an acyclic graph of the concepts of the studied domain. The
subcategorization frames

represent the use of the verbs in these texts. For
example, starting from cooking recipe texts, Asium should learn an ontology with
concepts of "Recipients", "Vegetabl
es" and "Meat". It can also learn, in parallel,
the subcategorization frame of the verb "to cook" which can be:



to cook
:



Object
: Vegetable or Meat



in
: Recipients




Methodology



The overall methodology that is implemented by ASIUM is depicted in the
foll
owing figure.The input for Asium are syntactically parsed texts from a specific
domain. It then extracts these triplets: verb, preposition/function (if there is no
preposition), lemmatized head noun of the complement. Next, using factorization,
Asium will
group together all the head nouns occurring with the same couple
verb, preposition/function. These lists of nouns are called basic clusters. They
are linked with the couples verb, preposition/function they are coming from.
Asium then computes the similarit
y among all the basic clusters together. The
nearest ones will be aggregated and this aggregation is suggested to the expert
for creating a new concept. The expert defines a minimum threshold for
gathering classes into concepts. Only the distance computati
on is not enough to
learn concepts of one domain. The help of the expert is necessary because any
learned concepts can contain noise (mistakes in the parsing for example), some
sub
-
concepts are not identified or over
-
generalization occurs due to
aggregatio
ns. Similarity computation is computed between all basic clusters to
each other and next the expert validates the list of classes learned by Asium.
After this, Asium will have learned the first level of the ontology. Similarity is
computed again but among
all the clusters, both the old and the new ones in
order to learn the next level of the ontology.

The advantages of this method are twofold:



First, the similarity measure identifies all concepts of the domain and the
expert can validate or split them. Nex
t the learning process is, for one part,
based on these new concepts and suggests more relevant and more
general concepts.



Second, the similarity measure will offer the expert aggregations between
already validated concepts and new basic clusters in order
to get more
knowledge from the corpus.

The cooperative process runs until there are no more possible aggregations.
The output of the process are the subcategorization frames and the ontology
schema.



The ASIUM methodology




SYLEX


The preprocessing of

the free text is performed by Sylex. Sylex, the
syntactic parser of the French society Ingénia, is used in order to parse
source texts in French or English. This parser is a tool
-
box of about 700
functions which have to be used in order to produce some re
sults.

In ASIUM, the attachments between head nouns of complements and verbs
and the bounds are retrieved from the full syntactic parsing performed by
Sylex.


The file format that Asium uses to understand the parsing is the following:

----

(Sentence of
the original text)

Verbe:

(the verb)

kind of complement (Sujet(Subject), COD(Object), COI(Indirect
object), CC(position, manière, but, provenance, direction) (adjunct of
position, manner, aim, provenance, direction):

(head noun of the complement)

Bornes_mo
t:

(bounds of the noun in the sentence)

Bornes_GN:

(bound of the noun phrase in the sentence)

Prep:
(optional)

(the preposition)



The resulting parsed text is then provided to ASIUM for further elaboration.




The user interface



The user interface of ASI
UM allows the user to manipulate and view the
ontology in every stage of the learning process. The following figures show some
of the basic windows of the interface.


















This window allows the expert to validate the concepts learned by Asium.



















This window displays the list of all the examples covered for the learned
concept.














This window displays the ontology like it actually is in memory: i.e.
learned concepts and concepts to be proposed for this level. Each blue
circ
le represents a class. It can be labeled or not.


































This window allows the expert to split a class into several sub
-
concepts.
The left list represents the list of nouns the expert wants to split into sub
-
concepts. The right lis
t contains all the nouns for one sub
-
concept.









Uses of ASIUM


The kind of semantic knowledge that is prodused by ASIUM can be very useful in
a lot of applications. Some of them are mention bellow:



Information Retrieval:

Verb subcategorization frames
can be used in order to tag texts. The

major part of the nouns occurring in the texts will then be tagged by their

concept. The search of the right text will be based on a query using

domain concepts instead of words. For example, if the user is intereste
d in

movie stars, he would not search for the noun "star" but for the concept

"Movie_stars" which is really distinct from "Space_Stars".



Information Extraction:

Such subcategorization frames together with an ontology allow the expert
to write "semantic"
extractions rules.



Text indexing:

After the learning of the ontology for one domain, the texts should be
enriched by the concepts. The ontology can then be use for indexing the
texts.



Texts Filtering:

As with information extraction, filtering should use ru
les based on
concepts and on the verbs used in the texts. The filtering quality should be
improved by this semantic knowledge.



Abstracts of texts:

The use of subcategorization frames and ontology concepts will allow the
texts to be tagged and then it will
certainly be a precious help for
extracting abstracts from texts.



Automatic translation:

Creation both in the language of the ontologies and the subcategorization
frames and next the use of a method in order to match the concepts of the
verbs frames in bot
h languages should improve translators.



Syntactic parsing improvement::

Subcategorization frames and concepts of a domain should improve a
syntactic parser by letting it choose the right verb attachment regarding
the ontology and then by letting it avoids
a lot of ambiguities.













7. Uses/applications of ontology learning




The ontology learning process and methods described in the previous section
can be used and applied in many domains concerning knowledge and
information extraction. Some uses
and applications are described in this section.




7.1 Knowledge sharing in multi agent systems



Discovering related concepts in a multi
-
agent system among agents with
diverse ontologies is difficult using existing knowledge representation languages
and

approaches. In this section an approach for identifying candidate relations
between expressive, diverse ontologies using concept cluster integration is
described. In order to facilitate knowledge sharing between a group of interacting
information agents (
i.e. a multi
-
agent system), a common ontology should be
shared. However, agents do not always commit
a priori
to a common, pre
-
defined global ontology. This research investigates approaches for agents with
diverse ontologies to share knowledge by automated

learning methods and
agent communication strategies. The goal is that agents who do not know the
relationships of their concepts to each other need to be able to teach each other
these relationships. If the agents are able to discover these concept relati
ons,
this will aid them as a group in sharing knowledge even though they have diverse
ontologies. Information agents acting on behalf of a diverse group of users need
a way of discovering relationships between the individualized ontologies of users.
These
agents can use these discovered relationships to help their users find
information related to their topic, or concept, of interest.


In this approach, semantic concepts are represented in each agent as
concept
vectors

of terms. Supervised inductive learn
ing is used by agents to learn their
individual ontologies. The output of this ontology learning is semantic concept
descriptions (SCD) in the form of interpretation rules. This concept representation
and learning is shown in the following figure.




Supe
rvised inductive learning produces ontology rules


The process of knowledge sharing between two agents, the Q (querying) and the
R (responding) agent, begins when the Q agent sends a concept based query.
The R agent interpreters this query and if related c
oncepts are found a response
is sent to the Q agent. After that, the Q agent takes the following steps to perform
the concept cluster integration:

1.

From the R agent response, determine the names of the concepts to
cluster.

2.

Create a new compound concept usin
g the above names.

3.

Create a new ontology category by combining instances associated with
the compound concept.

4.

Re
-
learn the ontology rules.

5.

Re
-
interpret the concept based query using the new ontology rules
including the new concept cluster description rule
s.

6.

If the concept is verified, store the new concept relation rule.

In this way, an agent learns from the knowledge provided by another agent.

This methodology was implemented in the DOGGIE (Distributed Ontology
Gathering Group Integration Environment) sys
tem, which was developed in the
University of Iowa.



7.2 Ontology based Interest Matching



Designing a general algorithm for interest matching is a major challenge in
building online community and agent
-
based communication networks. This
section prese
nts an information theoretic concept
-
matching approach to measure
degrees of similarity among users. A distance metric is used as a measure of
similarity on users represented by concept hierarchy. Preliminary sensitivity
analysis shows that this distance m
etric has more interesting properties and is
more noise tolerant than keyword
-
overlap approaches. With the emergence of
online communities on the Internet, software
-
mediated social interactions are
becoming an important field of research. Within an online
community, history of a
user’s online behavior can be analyzed and matched against other users to
provide collaborative sanctioning and recommendation services to tailor and
enhance the online experience. In this approach the process of finding similar
use
rs based on data from logged behavior is called
interest matching
.


Ontologies may take many forms. In the described method, an ontology is
expressed in a tree
-
hierarchy of concepts. In general, tree
-
representations of
ontologies are usually polytrees. H
owever, for the purpose of simplicity, here the
tree representation is assumed to be singly connected and that that all child
nodes of a node are mutually exclusive. Concepts in the hierarchy represent the
subject areas that the user is interested in. To f
acilitate ontology exchange
between agents, an ontology can be encoded in the DARPA Agent Markup
Language (DAML). The figure below illustrates a visualization of this sample
ontology.




An example of an ontology used


The root of the tree represents the
interests of the user. Subsequent sub
-
trees
represent classifications of interests of the user. Each parent node is related to a
set of children nodes. A directed edge from the parent node to a child node
represents a (possibly exclusive) sub
-
concept. For
example, in the figure,
Seafood
and
Poultry
are both subcategories of the more general concept of
Food
. However, in general, every user is to adopt the standard ontology, there
must be a way to
personalize the ontology
to describe each user. For each user,

each node has a weight attribute to represent the importance of the concept. In
this ontology, given the context of
Food
, the user tends to be more interested in
Seafood
rather than
Poultry
. The weights in the ontology are determined by
observing the beha
vior of the user. History of the user’s online readings and
explicit relevance feedback are excellent sources for determining the values of
the weights.


In this approach, a standard ontology is used to categorize the interests of
users. Using the standa
rd ontology, the websites the user visits can be classified
and entered into the standard ontology to personalize it. A form of weight for
each category can then be derived: if a user frequents websites in that category
or an
instance
of that
class
, it can

be viewed that the user will also be interested
in other instances of the class. With the weights, the distance metric can be used
to perform comparisons between interests of different users and finally
categorize them. The effectiveness of the ontology m
atching algorithm is to be
determined by deploying it in various instances of on
-
line communities.










7.3 Ontology learning for Web Directory Classification




Ontologies and ontology learning can also be used to create information
extraction tools

for collecting general information from the free text of web pages
and classifying them in categories. The goal is to collect indicator terms from the
web pages that may assist the classification process. These terms can be
derived from directory headings

of a web page as well as its content. The
indicator terms along with a collection of interpretation rules can result in a
hierarchy (ontology) of web pages. In this way, the Information Extraction and
Ontology Learning process can be applied to large web
directories both for
information storage and knowledge mining.



7.4 E
-
mail classification



KMi Planet


“KMi Planet” is a web
-
based news server for communication of stories between
members in Knowledge Media Institute. Its main goals are to classify an
in
coming story, obtain the relevant objects within the story and deduce the
relationships between them and to populate the ontology with minimal help from
the user.


Integrate a template
-
driven information extraction engine with an ontology engine
to supply
the necessary semantic content.
Two primary components are the story
library and the ontology library. The Story library contains the text of the stories
that have been provided to Planet by the journalists. In the case of KMi Planet it
contains stories wh
ich are relevant to our institute. The Ontology Library contains
several existing ontologies, in particular the KMi ontology. PlanetOnto
augmented the basic publish/find scenario supported by KMi planet, and
supports the following activities:


1
. Story sub
mission
. A journalist submits a story to KMi planet using e
-
mail text.
Then the story is formatted and stored.

2.
Story reading
. A Planet reader browses through the latest stories using a
standard Web browser,

3.
Story annotation
. Either a journalist or a
knowledge engineer manually
annotates the story using Knote (the Planet knowledge editor),

4.
Provision of customized alerts
. An agent called Newsboy builds user
profiles from patterns of access to PlanetOnto and then uses these profiles to
alert readers a
bout relevant new stories.

5.
Ontology editing
. A tool called WebOnto providesWeb
-
based visualisation,
browsing and editing support for the ontology. The “Operational Conceptual
Modelling Language," OCML is a language designed for knowledge modeling.
WebOn
to uses OCML and allows the creation of classes and instances in the
ontology, along with easier development and maintenance of the knowledge
models. In that point ontology learning is concerned.

6.
Story soliciting
. An agent called Newshound, periodically

solicits stories from
the journalists.

7.
Story retrieval and query answering
. The Lois interface supports integrated
access to the story archive


Two other tools have been integrated in the architecture:



MyPlanet: Is an extension to Newsboy and helps sto
ry readers to read only


the stories that are of interest instead of reading all stories in the archive.


It uses a manually predefined set of cue
-
phrases for each of “research areas

defined in the ontology. For example, for genetic algorithm
s one cue
-
phrase is
“evolutionary algorithms". Consider the example of someone interested in
research area Genetic Algorithms. A search engine will return all the stories that
talk about that research area. In contrast, my
-
Planet (by using the ontological
relations) will also find all Projects that have research area Genetic Algorithms
and then search for stories that talk about these projects, thus returning them to
the reader even if the story text itself does not contain the phrase “genetic
algorithms".




an IE tool : Is a tool which extracts information from e
-
mail text and it


connects with WebOnto to prove theorems using the KMi
-
planet ontology.



8. Conclusion




O
ntology learning could add significant leverage to the Semantic Web because
it pro
pels the construction of domain ontologies, which the Semantic Web needs
to succeed
. We have presented a collection of approaches and methodologies
for ontology learning that crosses the boundaries of single disciplines, touching
on a number of challenges.

All these methods are still experimental and awaiting
further improvement progress and analysis. So far, the results are rather
discouraging compared to the final goal that has to be achieved, fully automated,
intelligent and knowledge learning systems. T
he good news is, however, that
perfect or optimal support for cooperative ontology modeling is not yet needed.
Cheap methods in an integrated environment can tremendously help the
ontology engineer. While a number of problems remain within individual
disci
plines, additional challenges arise that specifically pertain to applying
ontology learning to the Semantic Web
. With the use of XML
-
based namespace
mechanisms, the notion of an ontology with well
-
defined boundaries

for
example, only definitions that are i
n one file

will disappear
. Rather, the
Semantic Web might yield a primitive structure regarding ontology boundaries
because ontologies refer to and import each other. However, what the semantics
of these structures will look like is not yet known.
In light

of these facts, the
importance of methods such as ontology pruning and crawling ??? will drastically
increase and further approaches are yet to come
.



9. References


[1] M.Sintek, M. Junker, Ludger van Est, A. Abecker,
Using Information
Extraction Rules
for Extending Domain Ontologies,
German Research Center for
Artificial Intelligence (DFKI)

[2] M.Vargas
-
Vera, J.Domingue, Y.Kalfoglou, E.Motta, S.Buckingham Shum,
Template
-
Driven Information Extraction for Populating Ontologies
, Knowledge
Media Institute (
UK)

[3] G.Bisson, C.Nedellec,
Designing clustering methods for ontology building
,
University of Paris

[4] A.Maedche, S.Staab,
The TEXT
-
TO
-
ONTO Ontology Learning Environment
,
University of Karlsruhe

[5] A.Maedche, S.Staab,
Ontology Learning for the Semantic

Web
, University of
Karlsruhe

[6] H.Suryanto,P.Compton,

Learning classification taxonomies from a
classification knowledge based system
, University of New South Wales
(Australia)

[7] Proceedings of the First Workshop on Ontology Learning OL'2000

Berlin, Ge
rmany, August 25, 2000

[8] Proceedings of the Second Workshop on Ontology Learning OL'2001

Seattle, USA, August 4, 2001

[9] ASIUM web page:
http://www.lri.fr/~faure/Demonstrat
ion.UK/Presentation_Demo.html

[10] T. Yamaguchi,
Acquiring Conceptual Relationships from domain specific
texts,
Shizuoka University, Japan

[11] G. Heyer, M. Lauter,
Learning Relations using Collocations
, Leipzig
University, Germany

[12] C. Brewster, F. Ci
ravegna, Y. Wilks,
User
-
centered ontology learning for
knowledge management
, University of Sheffield, UK

[13] A. Williams, C. Tsatsoulis,
An instance based approach for identifying
candidate ontology relations within a multi agent system
, University of Iow
a

[14] W. Koh, L. Mui,
An information theoretic approach to ontology based interest
matching
, MIT

[15] M. Kavalec, V. Svatek,
Information extraction and ontology learning guided
byv web directory
, University of Prague

[16] C. Brewster, F. Ciravegna, Y. Wil
ks,
Knowledge acquisition for knowledge
management
, University of Sheffield, UK