Semantic Annotation of Web Documents and Ontology evolution with the MOMIS System

farmpaintlickInternet και Εφαρμογές Web

21 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

76 εμφανίσεις

1

Semantic Annotation of Web Documents and Ontology
evolution with the MOMIS System

D. Beneventano and S. Bergamaschi
1

and F. Guerra

Dipartimento di Ingegneria dell’Informazione

Università di Modena e Reggio Emilia

Via Vignolese 905


Modena

{beneventano.do
menico, bergamaschi.sonia,
guerra.francesco}@unimo.it


Introduction

Nowadays the Web is a huge collection of information and its expansion rate is
very high. The users need new ways to exploit all this available information and
possibilities. The problem i
s that Web information is meaningless for a computer and
so it is very hard to find out what we are looking for. In this context arises the need of
a
new vision of the Web
; its inventor Tim Berners
-
Lee has called it
The Semantic
Web
2
. Berners
-
Lee who has
also built the infrastructure of the current web, has
imagined a Web where resources could be annotated with machine
-
processable
metadata providing them with background knowledge and meaning. Moreover this
annotation could be exploited by users’ agents, th
at can carry out sophisticated tasks
and searches roaming from page to page.

This new scenario creates many expectations among the users and information
providers but new issues and new problems have to be solved before achieving good
results. One of the
main issues in this context is “
dynamics
". Web environment is very
changeable, it is continuously updated, modified and the users need to rely on the data
they retrieve from the net. Another fundamental component is
ontology
; this “
explicit
specification o
f a conceptualization
" [3] might allow information providers to
annotate their documents. The annotation phase is a crucial step in order to create a
semantically rich environment exploitable and intelligible by users' agents. Many
studies are trying to fi
nd
languages

and
standards

that can help domain experts in the
delicate task of expressing their knowledge in a formal way. But as in the case of
data, ontologies evolve, and therefore we have to face again the problem of managing
the dynamics with respect

to ontologies.

Another important aspect is
sources integration
. Today the ability to integrate,
merge, manipulate heterogeneous sources is fundamental; companies merge their
departments, doctors gather their data, vendors want to create common market pla
ces,
etc. Many research effort in the sources integration has been dealt into the traditional
distributed databases and multi
-
database systems data management approach. The



1

IEIIT
-
CNR viale Risorgimento 2, 40136 Bologna, Italy

2

http://www.w3.org/2001/sw/

2

common assumption in this area is to have a global database schema, usually obtain
ed
by skilled databases designers [16]. Many systems have been developed as
heterogeneous distributed databases or often called
multi
-
database system,
e.g
.

Multibase[17], Momis, Garlic[18], TSIMMIS [19], and Information Manifold [20]
.
In
most of these syst
ems, a user issue queries to a global schema and the system (called a
madiator in [21]) maps the queries to sub
-
queries on the underlying data sources.
Each data source has a wrapper able to map sub
-
queries into its native query
language. A database design
er is responsible for creating the global schema and the
mappings with the data sources and for maintaining the schema and mappings w.r.t.
evolution (i.e. data sources entering and leaving the system).

It is clear by now that the semantic web vision impl
ies research activity on many
topics, such as: information retrieval and integration, semantic enrichment and
dynamics management.

The main purpose of this work is to face the problem of an ontology evolution in
the case of the insertion of new sources. T
his problem in the context of the MOMIS
system, which is the Intelligent Information Integration system developed at the
University of Modena and Reggio Emilia is solved w.r.t. sources which are
HTML/XML documents.


MOMIS (Mediator envirOnment for Multiple

Information Sources)[1,2,15] has
been conceived as a group of tools to provide an integrated virtual view of
heterogeneous structured and semi
-
structured information sources. The result of the
integration process is a Global Virtual View (in short GVV) wh
ich is a set of (global)
classes that represent the information contained in the sources being used. GVV thus
conceptualizes the domain of the sources referred to. The GVV may be thought of as
a domain ontology [4] for the integrated sources. Therefore, t
he GVV generated by
applying our integration methodology to a set of web documents (HTML/XML) may
be thought of as a reference ontology for the involved sources. While the approach to
the Semantic Web is based on the “a priori” existence of an ontology or

a list of
different versions of an ontology, our approach “builds” a domain ontology as the
synthesis of the integration process, which is the result of an incremental process that
becomes enriched every time a new source is added.

The integration approa
ch and the ontology have some interesting features:



The ontology covers the whole domain of the aforementioned sources



The semi
-
automatic methodology for building the ontology guarantees
that every interesting concept is represented in GVV



It is possible
to use the same methodology to support the evolution of an
ontology.


In this paper, we focus on the application of our methodology onto a particular
kind of sources, i.e. web documents and show how the result of the integration
process can be exploited to

create a conceptualization of the domain. The obtained
conceptualization is a domain ontology composed of the following elements:



local ontologies of the sources
: formal explicit descriptions with a common
language, ODLI
3
, of concepts (classes), propertie
s of each concept (attributes),
and restrictions on instances of classes (integrity constraints)

3



annotations of the local sources
: each term is annotated with its meaning. w.r.t.
a lexical ontology (at the moment we use Wordnet [5])



Common Thesaurus
: conta
ins the relationships among the terms coming from
different local ontologies expressible in ODLI
3



Global Virtual View (GVV)
: it consists of the GVV itself expressed as set of
global classes and the mappings between the GVV and the local ontologies. Each
Gl
obal Class represents a concept of the domain and each global attribute of a
global class a property of the concept



annotations of the GVV
: the GVV terms (classes and attributes) are ontologies
with their meaning w.r.t. a lexical ontology (at the moment we

use Wordnet).


The main contribution of this work w.r.t. previous papers about the MOMIS system
are the following:



extensions of the kind of sources to be integrated

, i.e. HTML documents



annotation of the GVV.

Annotating a GVV means to provide Global Cla
sses
with a name (useful to recognize what sources describe) and with meaning(s)
(useful to manage concepts without ambiguity). By starting from annotations of
the local sources and mappings between the GVV and the integrated local
ontologies, we propose a

semi
-
automatic methodology to generate the
annotations



evolution of an ontology.
Exploiting GVV annotation and the MOMIS
integration methodology we propose a method for supporting evolution caused
by the insertion of a new source.


The outline of our pape
r is the following: Section 1 describes the MOMIS
approach for creating a domain ontology from scratch and shows the result of the
integration process (GVV). Section 2 describes the semiautomatic annotation process
of the GVV. Section 3 presents the method
ology to support the GVV evolution in the
case of the addition of a new source to on ontology. Finally, section 4 concludes the
paper.


1.

The MOMIS system

In this section, we describe the information integration process to construct the
GVV of a web pages’ s
et, that is, the construction of the
GVV
, and the
mapping

description

between the GVV itself and the integrated sources.

4

2.1 Common Thesaurus Generation

To develop intelligent techniques for semantic integration, intra and inter
-
schema
knowledge between in
formation sources in the considered domain has to be identified
and properly represented. For this purpose, MOMIS constructs a Common Thesaurus
describing intra and inter
-
schema knowledge in the form of intensional relationships
(SYN, BT, NT, and RT) and e
xtensional relationships (SYN
EXT
, BT
EXT
, and NT
EXT

between classes names).

The Common Thesaurus is constructed through an incremental process in which
relationships are added in the following order:


1.

schema
-
derived relationships:
intensional and extensio
nal relationships holding
at intra
-
schema level extracted by analyzing each schema separately;


2.

lexicon
-
derived relationships
.

These relationships are derived from the semantics

relations
between meanings
coming from WordNet, according to the following
map
ping:

Synonymy
:

corresponds to a

SYN

relationship

Hypernymy
: corresponds to a

BT

relationship

Hyponymy
:

corresponds to a

NT

relationship

Holonomy
:

corresponds to a

RT

relationship

Meronymy
:

corresponds to a

RT

relationship

Correlation
:

corr
esponds to a

RT

relationship


3.

designer
-
supplied relationships
.
New relationships can be supplied directly by
the designer, to capture specific domain knowledge. This is a crucial operation,
because the new relationships are forced to belong to the Common

Thesaurus.
This means that, if a nonsense or wrong relationship is inserted, the subsequent
integration process can produce a wrong global schema;


4.

inferred relationships.

Description Logics techniques of ODB
-
Tools [12] are
exploited to infer new relati
onships.


2.2 WordNet

The WordNet database contains 146,350 lemma organized in 111,223 synonym
sets. WordNet’s starting point for lexical semantics comes from a conventional
association between the forms of the words

that is, the way in which words are
pro
nounced or written

and the concept or meaning they express. These associations
give rise to several properties, including synonymy, polysemy, and so forth.
The
correspondence between the words form and their meaning is synthesized in the so
-
called Lexical
Matrix M, in which the words meaning are reported in rows (hence
each row represents a synset) and columns represent the words form (form/base
lemma):



5


F
1

F
2

F
3

.
..

Fn

M
1

E
1,1

E
1,2




M
2


E
2,2




M
3



E
3,3



.
..




.
..


M
m





E
m,n

Table
1
: Wordnet word form and meanings

Each element in the matrix implies that the form in that particular column can be
used in an appropriate context to express the meaning in that particular row. Thus,
entry E1,1 implies that word form F1 can b
e used to express word meaning M1. If
there are at least two entries in the same column then the corresponding word form is
polysemous (i.e. it can be used to represent more than one meaning, exactly two in
this case); if there are at least two entries in
the same row then two word forms are
synonyms relative to a context.

An entry Ei,j of the matrix is denoted by Fi #j; in this way a meaning can be
identified also with F#counter, furthermore each meaning can have more than one of
these identifiers.

2.3
ODL
I
3

language

For a semantically rich representation of source schemas and object patterns,
Momis uses an object
-
oriented language called ODL
I
3
[2]. ODL
I
3

is an extension of
the ODL language
3

and can be used to describe heterogeneous schemas of structured

and semistructured data sources. ODL
I
3

extends ODL with intentional and extensional
relationships expressing intra
-

and inter
-
schema knowledge for the source schemas. In
particular ODL
I
3

extends ODL with the following relationships:



SYN

(synonym of) is a
relationship defined between two terms
t
i

and
t
j

(where
t
i


t
j
) that are synonyms in every involved source.



BT

(broader terms) is a relationship defined between two terms
t
i

and
t
j
,
where
t
i

has a broader, more general meaning than
t
j
.
BT

relationships are not
symmetric. The opposite of
BT

is
NT

(narrower terms)
.



RT

(related terms) is a relationship defined between two terms
t
i

and
t
j

that
are generally used together in the same context in the considered sources.

Lexical relationships extracted with WordNet are translated into
ODL
I
3



3

http://
www.odmg.org

6

relationships following the ru
les cited before.


2.4 Extracting data structure for sources.

The ontology development process starts with the acquisition of information
sources descriptions, aiming at constructing a representation of the information
sources by means of a common data mod
el (ODMI3) and a common data language
(ODLI3)[6].

We encapsulate each source with a wrapper that logically converts the underlying
data structure into the ODLI3 information model. Therefore, the wrapper architecture
and interfaces are crucial, because wr
appers are the focal point for managing the
diversity of data sources. A wrapper from XML/DTDs to ODLI3 has been already
developed within the MOMIS system, but in the context of web documents HTML
documents are the majority. Thus we need a further step of
extraction.

2.4.1. Wrapping Web content

Information is available on the Web mainly in HTML format that is human
-
readable but cannot easily be automatically accessed and manipulated. XML separates
data structure from layout and provides a much more suitab
le data representation [7,
8].

To obtain the translation of information from HTML into XML format, we have
tested and review many research and commercial tools, such as Lixto [9],
RoadRunner [10], Andes [11], and we select Lixto as the most suitable for ou
r
approach.

By providing a fully visual and interactive user interface, Lixto assists the user to
create wrapper programs in a semi
-
automatic way. Once the wrapper is built, it can be
applied automatically to continually extract relevant information from
a permanently
changing web page.

When an HTML web source has to added in the integration process, we create a
wrapper program with Lixto to translate the information into XML format. Later we
use the MOMIS XML wrapper to represent the XML data in ODLI3 lan
guage.

2.5 Annotations of a local source

In this step, meanings of terms are established respect to a
lexicon ontology
.
The
integration designer has to manually choose the correct WordNet meaning for each
schema element. This is a two steps process:


1.

Word

form choice
. In this step, the WordNet morphologic processor aids
the designer by suggesting a word form corresponding to the given term.
More precisely, the morphologic processor stems (i.e. converts to a
common root form) the term and checks if it exist
s as word form.

7

2.

Meaning choice
. For each element, the designer has the correct senses.
Notice that the user can only choose a sense among the existing ones in
WordNet that is he is not allowed to extend it with his new meanings.


This step assigns to each

local class
LC

a name,
LCN
, identifying unambiguously the
local class, and a set (possibly empty) of meanings
LC
m
:


LC = <LC
n
, {LC
m1
, … , LC
mk
}>,
k


0

2.6 GVV generation

The MOMIS methodology to construct the GVV allows semi
-
automatic
identification of
ODLI3 classes that describe the same or semantically related
concepts in different sources and give a measure of the level of matching of their
structures. This activity is performed by the ARTEMIS tool. ARTEMIS has been
conceived for a semi
-
automatic inte
gration of heterogeneous structured databases [13]
and in the context of MOMIS, the ARTEMIS affinity framework has been extended
and applied to the analysis of ODLI3

schema descriptions [14].


The generation of
Global Classes

out of selected clusters is a
synthesis activity
performed interactively with the designer: to each cluster, a Global Class definition is
given. Let be Cli a selected cluster in the affinity tree provided by ARTEMIS and let
GCi

the global
ODL
I
3
class defined for Cli. The GVV generation

consists of two
phases. First, we associate a set of global attributes with GCi, corresponding to the
union of attributes of the classes belonging to Cli. Then, the attributes set is restricted
on the basis of the following rules:



only one term is select
ed as the name of a global attribute in GCi, for
attributes that have a SYN relationship;



a term which is a BT for all selected is assigned to the corresponding
global attribute in GCi for attributes that have a BT/NT relationship.

At this point, redundan
cies are eliminated in a semi
-
automatic way, taking into
account the relationships stored in the Common Thesaurus. For each GC
i
, a persistent
Mapping Table

storing all the intensional mappings is generated; it is a table whose
columns represent the set of
local classes which belong to the cluster and whose rows
represent the global attributes. An element MT[ga][L] represents the set of attributes
of the local class L which are mapped into the global attribute ga: the value of the ga
attribute is a function
of the values assumed by the set of attributes MT[ga][L]. Some
simple and frequent cases of such function are the following:

-

identity
: the ga value is equal to the la value; we denote this case as

MT[ga][L] = la

-

concatenation
: the ga value is obtained as
a concatenation of the values assumed
by a set of locat attributes la
i

of the local class L; we denote this case as
MT[ga][L]= la
i

… and la
n

-

constant
: ga assumes

into the local class L a constant value set by the designer;
we denote this case by
MT[ga][L]=

const

8

-

undefined
: ga is a set undefined into the local class L; we denote this case as
MT[ga][L]= null
.

2.7 Running example

We suppose to have to create an ontology of two different web sites related
to the University domain. By means of a wrapper, we tran
slate the web pages into
XML documents [9,10,11] where the DTD represents the schema of the information
held in the page, and the XML file the data. By applying this process to the University
web pages, we obtain the following DTDs:


University Site (UNI)


<!ELEMENT UNI(People*)>

<!ELEMENT People(Research_Staff*,
School_Member*)>

...

<!ELEMENT Research_Staff(name, e
-
mail, Section*, Article*)>

<!ELEMENT Section(name, year.
period)>

<!ELEMENT Article(title, year,
journal, conference)>

<!ELEMENT School_Mem
ber(name, e
-
mail)>

<!ELEMENT name (#pcdata)> ...


Computer Science Site (CS)


<!ELEMENT CS(Person*)>

...

<!ELEMENT Person(Professor*,
Student*)>

<!ELEMENT Professor(first_name,
last_name, e
-
mail, Publication*)>

<!ELEMENT Student(name, e
-
mail)>

<!ELEMENT

Course(denomination,
Professor)>

<!ELEMENT Publication(title, year,
journal, editor)>

<!ELEMENT School_Member(name, e
-
mail)>

<!ELEMENT name (#pcdata)>...


Table
2
: The University (UNI) and C
omputer Science (CS) sites DTD.

These

DTDs are translated into
ODL
I
3

descriptions:


University Site (UNI)



Interface
Research_Staff

(Source Un_site.dtd)

{ attribute string name;


attribute string email;


attribute section section;


attribute article Article;

}

Interface
Section

(Source Un
_site.dtd)

{ attribute string name;


attribute string year;

Computer Science Site (CS)



Interface
Professor

(Source Sc_site.dtd)

{ attribute string first_name;


attribute string last_name;


attribute string email;


attri
bute publication Pubblication;

}

Interface
Publication

(Source Sc_site.dtd)

{ attribute string title;


attribute string year;

9


attribute string period;

}




attribute string journal;

}




Table
3
: The University (UNI) and Computer Science (CS) sites in ODLI3.

The integration process generates the Common Thesaurus by exploiting the
annotation of the sources’ descriptions w.r.t. the WordNet database. For example, the
following terms of the schemata are annotated with the corresponding synset in
WordNet:


CS.Cou
rse


course#1

=
education imparted in a series of lessons or class meetings

UNI.Person


person#1

=
a human being

UNI.Professor


professor#1

=
someone who is a member of the faculty at a
college or university

UNI.Student


student#1

=
a learner who is en
rolled in an educational institution


The Common Thesaurus consists of relationships between terms. For example: the
following relationships are automatically obtained by MOMIS after the annotation
and proposed at the integration designer.


CS.Professor N
T CS.CS_Person

UNI.School_Member NT CS.CS_Person

UNI.Research_Staff SYN CS.Professor

UNI.Research_Staff NT CS.CS_Person

CS.Student NT CS.CS_Person

UNI.Article NT CS.Publication


If the designer accepts the above relationships, the integration process gives

rise to
three global classes:


Global1:

(UNI.Section, CS.Course)

Global2:

(UNI.Article, CS.Publication)

Global3:

(UNI.Research_Staff, UNI.School_Member, CS.Person, CS.Professor, CS.Student)


For each global class a Mapping Table is generated. For example
the Mapping Table for
Global2 is:



UNI.Article

CS.Publication

Title

Title

Title

Year

Year

Year

Journal

Journal

Journal

Conference

Conference

null

Editor

null

Editor

10

Table
4
: Mapping Table of the global class Global2 (Publicat
ion).

A more detailed example of the process can be found in [2,6].

3.

Global Virtual View Annotation

We propose a semi
-
automatic methodology to perform.

3.1 Global Class Annotation

Annotating a global class GC means to provide it with a representative name
GCN

(unique in the GVV) and a set of meanings
GCM
i
:


GC = < GCN, {GCM
1
, … , GCM
p
}>, p


0


In order to semi
-
automatically associate an annotation to each global class, we
consider the set of all its “broadest” local classes, w.r.t. the relationships inclu
ded in
the Common Thesaurus, denoted by CG
B
:


CG
B
= { LC


GC | not


LC'


GC , LC
n

NT LC'
n

or LC'
n

BT LC
n

}



For example:

For GC1

=
{CS.Course, UNI.Section}

we have GC1
B
=
{CS.Course, UNI.Section}

For GC2

=
{
CS.Publication, UNI.Article

}

we have GC2
B

=
{

CS.Publication

}

For GC3

=
{CS.CS_Person, CS.Professor, UNI.School_Member,
UNI.Research_Staff, CS.Student}

we have GC3
B
=
{CS.CS_Person, UNI.School_Member,
UNI.Research_Staff}


On the basis of CG
B
, the designer will annotate the global class GC as f
ollows:



name choice
: the integration designer is responsible for the choice of the GC
name: the system only suggests a list of possible names. The designer may select
a name within the proposed list or select another name not inside the list. In
particula
r, concerning the
name

and according to the role of the global class name
(to allow the designer to identify the Global Class and its contents), we consider
the name as a label. Therefore, a name might not be a word form of WordNet
(e.g. university member)
.



meaning choice
: the union of the meanings of the local class names in CGB are
proposed to the designer as meanings of the Global Class. The designer can
change this set, by removing some meanings or by adding other ones.




11

With respect own example, the
proposed annotations are the following:


CG

Names

Meanings

GC1

course OR section

course#1

GC2

pubblication

pubblication#1

GC3

person, researcher, student

person#1,
student#1,
professor#1

Table
5
: University GVV annotation.

3. 2
Global Attributes Annotation


Annotating a global attribute GA of a global class GC means to provide it with a
representative name
GAN

(unique in the GC) and a set of meanings
GAM
i
:


GA = < GAN, {GAM
1
, … , GAM
p
}>, p


0


In order to propose to the designe
r a name and a set of meanings for the global
attributes we extend the previously used approach to consider all the mapping
relations among global and local attributes.

Given a global attribute GA of the global class GC, we consider the set LGA of local
at
tributes, which are mapped into GA:


LGA= { LA |


LC


GC , LA


LC and MT[GA][ LA] is not null}


And then we consider the set of all its “broadest” local attributes, denoted by LGA
B
:

LGA
B
= { LA


LGA| not


LA'


LGA , LA
n

NT LA'
n

or LA'
n

BT LA
n

}


O
n the basis of LGA
B
, the designer will annotate the global attribute as described for
the Global Classes. Moreover, according to the mapping functions, we may develop
some specific policies to ignore some meanings or to prefer other ones.

4. Adding a new
source

Supporting the evolution of an ontology is a challenging issue to be faced by the
MOMIS system. In particular, it means to provide the system with the capability to
correctly manage:



Insertion of new sources in the system



Update of existing sources
structures



Deletion of previously integrated sources

If new sources are added, or if some changes occur in the sources, the GVV has to
change. The integration process is expensive both for the designer and for the system.
12

For these reasons, we propose a me
thodology to integrate new sources exploiting the
previous integration work, i.e., without restarting the integration process from scratch.

In GVV building approach
all the sources to be integrated concur with the same
weight to the process. Therefore, if

we consider an already built GVV and we have to
insert a new sources which refers to the same ontology, we can assume that this
source brings less semantics than the GVV previously built. For this reason, we
devise an integration process of a new source
that starts from the obtained GVV and
tries to integrate a new source in the GVV.

In the following, we show how the evolution of a GVV caused by the insertion of a
new source can be simplified by having available the lexicon
-
based knowledge we
introduced i
n the GVV annotation.

4.1 Integration of new sources

The insertion of a new source is managed as an integration process between two
schemata: the GVV and the new source schema; in other words, the global classes of
GVV are considered as local classes and
are integrated with the local classes of the
new source.

We show the approach analyzing all the integration phases of the GVV with the new
source. We introduce the following notation:


gcNew
the global class of the new integrated schema has a name
(
gcNewN
ame
)
and a set of global attribute

gcNewAtt
i
,


gcOld
the global class of the old integrated schema has a name

(
gcOldName
)
and a set of global attributes

gcOldAtt
j
,


lcNew
the local class of the new source has a name
(
lcNewName
)
and a
set of local attribute
s
lcOldAtt
k
.


According to the MOMIS integration methodology, we have to create a Common
Thesaurus of the involved sources. In this case, the Common Thesaurus will contain
schema
-
derived relationships extracted from the analysis of the new source and intra
-
schema lexicon
-
derived relationships obtained by the annotation of the new source.
Further, the GVV old global classes have to be semantically enriched according to the
annotation methodology shown in section 3. The interesting point is that the
annotatio
n of GVV old allows the discard of inter
-
schema lexical relationships which
enrich the Common Thesaurus.

The next step is the cluster generation and the Global Classes and mapping tables
generation. This phase has to provide mapping rules among Global Clas
ses and new
or old local classes. In order to achieve this result, we substitute old Global Classes
with the respective Local Classes preserving all the previous mappings. In this way,
new Global Classes that represent old Local Classes and new Local Class

are built.
Thus we have:


13

gcNew = {gcOld
1
,…gcOld
p
,lcNew
1
,…,lcNew
n
}


the result of the rewriting step is:


gcNew = {lcOld
11
,…, lcOld
1z
,…, lcOld
p1
, …,
lcOld
pn,
lcNew
1
,…,lcNew
n
}



W.r.t the Global Class generation, we can easily observe that, with the same
cl
ustering parameters, an old cluster {
lc
1
,…,lc
i
,… lc
n
} changes only if the
integration process places one or more new local classes

(
lcNew
i
) into the old
cluster. Therefore, we individuate the following possible cases:


a) A new global class

gcNew
is compos
ed of only one old global class

(
gcOld)

and one or more new local classes

(
lcNew
i
):


gcNew = {gcOld,lcNew
1
,…lcNew
i
,… lcNew
n
}


The new global class

(
gcNew
)

may have new global attributes generated from the
semantic contribution of new local classes. New map
ping rules define the connection
between a global attribute and its corresponding local attribute(s). In this case, global
attributes belonging to the

gcOld

(
gcOldAtt
i
)

can map both local classes of the
previous GVV and new local classes (see the columns a
ssociated to

lcNew
t
,
for
example).

New global attributes can only map new local Classes (null mappings in
the following table).



lcOld
1


lcOld
k

lcNew
1

lcNew
t

lcNew
n

gcOldAtt
1


the same mappings as in
gcOld




new mappings


gcOldAtt
m

gcNewAtt
1


nu
ll mappings


gcNewAtt
p


Table
6
: New Mapping Table example
.



b) A global class of the new integrated schema is composed of only new local classes:



gcNew = {lcNew
1
,…lcNew
i
,…lcNew
n
}


This situation describes the case in whi
ch the schema is extended without interfering
with the previous classes.

The new global class (
gcNew
) has a name (
gcNewName
) and a set of new global
attributes (
gcNewAtt
i
), where the new global attributes map only new local classes.

14


c) A global class of

the new integrated schema is composed of more then one global
class of the GVV and at least one local class of the new source we are integrating.


gcNew = {gcOld
1
,…gcOld
p
,lcNew
1
,…lcNew
i
,…,lcNew
n
}


In this case the previous GVV is modified; side effects ca
n influence the applications
based on the previous schema.

The new global class (
gcNew
) has a name

(
gcNewName
) and a set of new global
attributes (
gcNewAtt
i
).

For a detailed example of integration method see [23].

5. Concluding remarks

In this paper, we p
resented a methodology for supporting the semi
-
automatic
building and the evolution of a domain ontology obtained by integrating web
documents with the MOMIS system. Similar approaches have to face two different
issues: the system overload to maintain the
built ontology corresponding to the
involved sources, and the insertion of a new source that may modify the existing
ontology, with a side effect to each application based on the ontology.

We tried to solve both the issue. The most relevant advantages of
the methodology
we proposed can be summarized as follows:



the process is less expensive than starting from scratch the integrates process;



the integration process is done starting from semantically annotated results of pre
-
existing integration processes:
the new GVV exploits previous annotation and will
be similar to the previous one.

Possible limitations are:



mistakes of the previous integration process will propagate to the new GVV;



the new GVV is based on the previous one, and then it could not represen
t
perfectly all the sources.

The methodology will be adapted within the SEWASIE
4
. SEWASIE (SEmantic
Webs AgentS in Integrated Economies) (EU IST
-
2001
-
34825) is a research project
funded by EU on action line Semantic Web (may 2002/ April 2005) whose goal is

to
design and implement an advanced search engine enabling intelligent access to
heterogeneous data sources on the web via semantic enrichment to provide the basis
of structured secure web
-
based communication


Acknowledgements

This work is supported in pa
rt by the 5
th

Framework IST program of the European
Community through project SEWASIE within the Semantic Web Action Line. The



4

http://www.sewasie.org

15

SEWASIE consortium comprises in addition to the authors’ organization (Sonia
Bergamaschi is the coordinator of the project), the
Universities of Aachen RWTH (M.
Jarke), Roma La Sapienza (M. Lnezerini, T. Catarci), Bolzano (E. Franconi), as well
as IBM Italia, Thinking Networks AG and Can as user organizations.

References

[1] Beneventano D., Bergamaschi S., Castano S., Corni A., Guid
etti R., Malvezzi
G., Melchiori M., Vincini M., Information Integration: the MOMIS Project
Demonstration, Proceed
ings of 26th Int.Conf.on Very Large Data Bases
(VLDB2000), 2000, Cairo, Egypt.

[2] Bergamaschi S., Castano S., Beneventano D., Vincini M., Sem
antic Integration
of Heterogeneous Information Sources, Special Issue on Intelligent Information
Integration, Data & Knowledge Engineering, Vol. 36, Num. 1, 215
-
249, Elsevier
Science B.V. 2001.

[
3] T. R. Gruber. A translation approach to portable ontologie
s. Knowledge Acquisition,
5(2):199
-
220, 1993

[4]

Guarino, N. Formal Ontology in Information Systems In N. Guarino (ed.) Formal
Ontology in Information Systems. Proceedings of FOIS'98, Trento, Italy, 6
-
8 June 1998. IOS
Press, Amsterdam: 3
-
15

[5]Miller, A.G.

A lexical database for English. Communications of the ACM,
38(11):39:41,1995

[6] Benetti I., Beneventano D., Bergamaschi S., Guerra F., Vincini M. An
Information Integration Framework for E
-
Commerce, IEEE Intelligent Systems,
(Jan/Feb) 2002

[7] A.Y. Levy

and D.S. Weld. “Intelligent internet systems”. Artificial Intelligence,
118(1
-
2), 2000.

[8] [1] S. Abiteboul, P. Buneman, and D. Suciu. “Data on the Web
-

From
Relations to Semistructured Data and XML”. Morgan Kaufmann, 2000.

[9] Robert Baumgartner, Sergi
o Flesca, Georg Gottlob: Visual Web Information
Extraction with Lixto. VLDB 2001: 119
-
128

[10] Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo: RoadRunner:
automatic data extraction from data
-
intensive web sites. SIGMOD Conference 2002.

[11] Jussi My
llymaki: Effective Web data extraction with standard XML
technologies. WWW 2001: 689
-
696

[12] Beneventano D., Bergamaschi S., Sartori C., Vincini M., ODB
-
Tools: A
Description Logics Based Tool For Schema Validation and Semantic Query
Optimization in Object

Oriented Databases, Proc. of Int.
Conf. on Data Engineering
(ICDE
-
97), Birmingam UK 1997.

[13] Silvana Castano, Valeria De Antonellis, Sabrina De Capitani di Vimercati:
Global Viewing of Heterogeneous Data Sources.
TKDE 13(2): 277
-
297 (2001)

[14] S. Berga
maschi, S. Castano and M. Vincini, Semantic Integration of
Semistructured and Structured Data Sources, SIGMOD Record Special Issue on
Semantic Interoperability in Global Information 28(1) (1999) 54

59.

16

[15] S. Bergamaschi, S. Castano, M. Vincini "Semantic
Integration of
Semistructured and Structured Data Sources", SIGMOD Record Special Issue on
Semantic Interoperability in Global Information, Vol. 28, No. 1, March 1999

[16] W. Litwin, L. Mark, N Roussopoulos. Interoperability of multiple autonomous
database
s. ACM Computing Surveys, 22(3):267
-
293, 1990

[17] J.M. Smith, P.A. Bernstein, U. Dayal, N. GoodMan, T. Landers, K.W.T. Lin,
and E. Wong. Multibase


integrating heterogeneous distributed database systems. In
Proceedings of 1981 National Computes Conferenc
e, pages 487
-
499. AFIPS Press
1981

[18] M.J.Carey, L.M. Haas, P.M. Schwarz, M. Arya, W.F. Cody, R. Fagin, M.
Flickner, A.W. Luniewski, W. Niblack, D. Petkovic, J. Thomas, J.H. Williams and
E.L. Wimmers, "Towards Multimedia Information System: The Garlic Ap
proach",
IBM Almaden Research Center, San Jose, CA 95120.

[19]

S. Chawathe, H. Garcia
-
Molina, J. Hammer, K. Ireland, Y. Papakonstantinou,
J. Ullman and J. Widom: The Tsimmis Project
-

Integration of Heterogeneous
Information Sources, Proceedings 10th Anniv
ersary Meeting of the Information
Processing Society of Japan, Tokyo, Japan, 1994.

[20] A.Y. Levy, A. Rajaraman and J.J. Ordille: Querying Heterogeneous
Information Sources using Source Descriptions, Proceedings 22nd VLDB
Conference, Bombay, India, 1996.

[
21] G.
Wiederhold,
Mediators in the architecture of future information systems.
IEEE Computer 25, 3 (March
), 38

49, 1992

[22] A. Fergnani Ontology dynamics for Semantic Web: the MOMIS approach.
degree thesis available at http:www.dbgroup.unimo.it, 2003