ONTOLOGY EXTRACTION FOR THE SEMENTIC WEB: A NEW FRAMEWORK

schoolmistInternet and Web Development

Oct 22, 2013 (3 years and 7 months ago)

62 views


Proceedings
of the 11th Annual Conference

of Asia Pacific Decision S
ciences Institute

Hong Kong, June 14
-
18, 2006
, pp.
585
-
588
.



O
NTOLOGY EXTRACTION FOR THE SEMENTIC WEB: A NEW FRAMEWORK


Feng Li

School of Business Administration
,

South

China
University
of
Technology
,
Guangzhou
,
P.R. China


EMAIL:
fenglee@scut.edu.cn


Ying Wei

Department of Systems Engineering & Engineering Management,

The Chinese University of Hong Kong, Shatin, New territories, Hong Kong

EMAIL
:
ywei@se.cuhk.edu.hk




ABSTRACT


To wrap the traditional web with semantic presentation

is a challenging task.
This paper proposes a new framework for
ontology extraction by introducing a “self
-
learning”
cycle
,

even with

the lack of

pre
-
defined knowledge.
The ontology
is constructed along f
ive

steps
: knowledge representing the
website
, recalling
existing
ontologies
from an ontology base
to recognize concepts and relations,
extracting concept
by
cluster
ing algori
thm for unrecognized concepts and relations
,

amending proposed

ontology
, and
re
storing

the confirmed ontology
for future reuse.

Further, this
framework

can
integrated other intelligent technologies. At the end of this paper, w
e develop a prototype system
e
xtracting the
ontology of

the website of
Department of Automation and Computer
-
Aided Engineering, the Chinese University of
Hong Kong.


Key
words
:
Ontology Extraction, Semantic Web




INTRODUCTION


Semantic Web (SW), proposed by Berners
-
Lee as next generat
ion Web, provides great benefit
s

in Web Services,
Internet Commerce, and other promising application areas [1]. However, SW is still in its primary stage and has lots of
unsolved problems.
To transfer d
omain
-
specific ontolog
y to
structure
d

data for machine

understanding

is one of them,
by which we call ontology extraction
.


A naïve approach for ontology extraction is by manual. Man
ual construction is time
-
consuming

and error
-
prone, and
poses problems into future ontology maintenance.

Most of the
current
re
search focuses on

exploit
ing

various methods
to generate ontology automatically or semi
-
automatically
.

PROMPT (formerly SMART) provides a semi
-
automatic
approach to merge or align ontologies [2]. Similar
work
include
s

OBSERVER, ONIONS, OntoMorph, OntoMap,
and
GLUE.
However, a
dditional knowledge, e.g. instances, sharable instances, or linguistic ontologies

is required t
o merge
or align ontologies
.

Another stream
focus
es

on extracting ontologies from well
-
structured sources. For example,
ERONTO is a tool that

builds ontologies from extended E/R diagrams [
3
].
The other
stream

extract
s

ontologies from
semi
-

or un
-
structured sources with certain auxiliary knowledge.
Typical work can be referred to
OntoMiner
, which

extracts ontologies from overlapping domain
-
speci
fic
w
ebsites,
and the websites have been
confirmed before the
extraction
[
4
]. Othe
rs
recognize linguistic ontological information from plain text sources using knowledge
-
poor
algorithms. Typical algorithm is Latent Seman
tic Index
, a vector space approach t
o catch term
-
term statistical [
5
].


T
o creat
e

ontology from diversity sources automatically or semi
-
automatically
, m
ost of
the research
need
s

various
auxili
ary sources. Thi
s paper proposes a new framework by introducing a “self
-
learning” cycle, even with t
he lack of
pre
-
defined knowledge. The ontology is constructed along
five

steps: knowledge re
presenting the
w
ebsite, recalling
existing ontologies from an ontology base to recognize concepts and relations,

extracting concept by clustering
algorithm for unre
cognized concepts and relations,
amending proposed ontology
, and restoring
the confirmed ontology
for future reuse
.

Furthermore, this process is implemented semi
-
automatically and can be integrated into with other
open intelligent technologies.


The rest
of this paper is organized as

follows.
We describe the process of ontology extracting from websites

i
n section 2
.
In section3 we develop a prototype system
with the example

of
Department of Automation and Computer
-
Aided
Engineering, the Chinese University
of Hong Kong
. And finally we conclude the paper in section 4.

FENG LI AND YING WEI


586



O
NTOLOGY

E
XTRACTION FROM

W
EBSITE


Before formaliz
ing

the ontology extracting process, we specify the notations and
assumptions
as follows.


Assumption 1

Each
website

is assumed
as

an ontolog
y instance.
This assumption is similar to the

specification in [
4
]: a
website

is said to be ‘taxonomy
-
directed’ if it contains at least one taxonomy for organizing its key concepts.


Assumption 2

Each Web page is assumed

as

a concept instance. If no furthe
r ontology extraction within Web page,
every Web page belonging to a
website

is a concept instance inside the ontology instance represented by the
website
.


Assumption 3

Each hyperlink, along with contiguous hypertext, represents a relation instance.


Assu
mption 4

Web pages are assumed to be placed together if they are instances of a same concept. In other word,
they always are sibling nodes in a graph which represent the organization of the
website
.


Assumption 5

Instances of a same concept are assumed to
be similar, while instances of different concepts are
dissimilar because of dissimilarity of separate concepts they belonging to.


Given the notations and assumption, o
ntology
extraction
from
website

includ
es
extracti
ng the
website organization
s
into ontol
ogy instances
,
web pages into concept instances, and hyperlinks
in
to relation instances.


The Framework

By its nature
, extracting ontology from
w
ebsite is a process of acquiring semantic knowledge from
w
eb documents.
C
ommon problems
in

knowledge management
, such as storing, adapting, standardizing knowledge

are also considered
in
ontology extraction
, namely

five iterative phases: (1) represent organization of
w
ebsite; (2) recall previous ontologies
to recognize concepts and relations; (3) extract concepts a
nd relations; (4) amend generated ontology; and (5) maintain
revised ontology.



Figure 1.
The f
ramework of o
ntology
e
xtraction


As illustrated in
Figure 1
, the process is

centralized by an ontology
-
base, a repository for ontol
ogies and their patterns.
When a URL of the target
w
ebsite is located,
w
eb pages are fetched
to represent
the organization of the
w
ebsite.
Existing o
ntologies and their patterns
are recalled from
the ontology
-
base to recognize
the
concepts and relations fr
om
fetched
w
eb pages.
If there are no matched ontologies, we call the
concepts and relations
unrecognized, and
use
heuristic algorithms
to
group “similar”
w
eb pages
and hyperlinks
into clusters
. Each cluster is an instance of new
concept or relation.
The c
oncepts or relations
are

then
refined

or
revised

by human experts
,
or other

intelligent
techniques. Finally, the accepted ontology is restored into
ontology
-
base
for future reuse.

T
he ontology
-
base system
grows
increasingly within its
lifecycle.



As indic
ated with dashed lines in Figure 1, common knowledge also play an important role in each phase. The common
knowledge here means general knowledge, which is loosed
-
coupled, or domain
-
independent knowledge, opposed to
specific and formal knowledge represente
d by the ontology.


Website Representation

ONTOLOGY EXTRACTION FO
R THE SEMENTIC WEB: A NEW FRAMEWORK

587

S
uccess
ful
ontology extraction relies on

how to
represent

the organization of the target
w
ebsite.

In order to represent
organization of the
w
ebsite, several steps are needed: fetch
w
eb pages and hyperlinks, label
w
eb pages and hyperlinks,
prune useless pages and hyperlinks, and finally construct the organization of the target.


The process of fetching
w
ebsite involves: (1) fetch Web pages inside the target
w
ebsite from remote server; (2)
transform fetched pages int
o well
-
formed
w
eb documents; (3) parse and filter some trivial HTML tags such as
“CENTER”, “HR”; (4) store fetched
website

into local repository for later use.

After all
w
eb pages are fetched, each
pages need to be labeled to identify main contents inside
pages.
However, labeling is not easy due to the loose grammar
regulation of the html files. For example
, tag “TITLE” of HTML
is designed to
provide

information about an entire
page,
but
our experiment result
s show
that 30% pages have the identical text
emb
edded
in tag “TITLE” as the home
page.

Therefore

we consider both
information from incoming hyperlinks

and text embedded in
“TITLE” tag to identify
pages.


Once the useless pages and hyperlinks are pruned, we use
breadth
-
first algorithm

to construct the or
ganization of the
website. We
traverse downloaded
w
eb pa
ges assuming

that
w
ebsite developers organize
w
eb pages with hierarchy

architecture
. We start
from

the URL of

website
home page and establish the organization as
follows
: home page of the
w
ebsite is t
he root node
;

the
linked pages by outgoing hyperlinks of the home page
are regarded as the
first
-
generation

---

child node
s;

and the linked pages by outgoing hyperlinks of the child nodes are
the second
-
generation
---

grandchild
nodes
;
and so on. In this w
ay, the organization of the
website

is represented as hierarchical semantic structure.



Ontology Recall

T
o recognize concepts and ontologies from historical patterns
is easier
than to extract concepts and ontologies on the
fly. The ontology recall process

recalls ontologies and their patterns which are stored in the ontology
-
base. It
consists of
the
following three tasks: (1) search the ontology
-
base for concepts and their related patterns that match label and
content of pages; (2) use recognized concepts
and ontology
-
base to explore those unidentified pages; (3) replace
recognized pages from the organization of the
w
ebsite with their concepts.


Intuitively, each concept of ontology has some properties that can be used to identify its instances. Ideally the
se
properties are specific enough to distinguish instances belonging to other concepts. However, these properties are not
practically apparent to recognize the concept. And a concept may be able to be represented by many related patterns.
Th
erefore

both on
tologies and their related patterns are stored in the ontology
-
base.

We use
some common regular
expressions to represent patterns
, f
or example,
the

concept “time” is

represented by
“[0
-
2]?[0
-
9]:[0
-
5]?[0
-
9][pPaA]?[.][mM][.]?”.


Ontology Extraction

At

the

early stage of system running, few concepts and patterns are available

in the ontology
-
base
. To recognize Web
pages, it is not enough to use technologies of pattern matching. According to assumption 5, algorithms for clustering
concept instances by their s
imilarity are needed. Obviously, it is easier to extract concept from concept instances if
we
can successfully
identify

and group concept instances.


S
imilarity of concept instances is
calculated by
two
approaches
: syntactical similarity and semantic
al

sim
ilarity. The
syntactical similarity evaluation, also called knowledge
-
poor approach, has its advantage in that it is simple and easy to
implement in practice. For example, in vector space model, similarity is calculated by comparing frequencies of terms
oc
curred in two pages. On the other hand, semantical similarity evaluation, also called knowledge
-
rich approach, needs
more domain
-
specific knowledge to explain why two pages are similar and how similar they are. This approach has its
advantages in precise a
nd recall if similar features are used to measure retrieval performance just as informatio
n
retrieval technologies do
.


W
e calculate and group pages based on their locations in the organization and similarities represented by vector space
model.

A
mbiguous
patterns are
then
extracted from these pages, which are assumed to be instances of the same concept.
T
hese patterns are utilized for f
urther matching and grouping. Finally

more clear
-
cut patterns are generated from
clustered concept instances to represent
their concepts
, and re
stored into the ontology
-
base after
possible
re
finement
.
For those pages which have no similar pages, concepts are extracted only from their labels
.


The generated pattern is very specific at the beginning to

be

recognize
d.
However, a
s

more
patterns are generated
and
stored in the ontology base, the coverage wil
l
be increasing
.


Ontology Amendment

After ontology is
extracted and generated
, it
may need further

refinement and revision
for final

output
. This phase is
called ontology amen
dment where general knowledge

and domain knowledge are necessary
.
The ontology amendment

FENG LI AND YING WEI


588

consists of three tasks: content amendment, structure amendment, and pattern amendment. The first one re
-
defines the
definition of concepts and relations while the sec
ond one reconstructs hierarchy structure of the ontology. The pattern
amendment is optional since it refines the patterns extracted in the stage of ontology extraction. The amended ontology
should satisfy some accepted criteria for sharing, reusing, and di
sseminating ontology purpose. The ontology revision
may be repeated several times to assure its quality. Moreover, the amendment is complicated since it requires higher
intelligence,
such as
interacting with domain experts for guidance
or artificial intell
igent

techniques.
W
e provide a
friendly graphic
UI
interface for ontology revision manually
.


Ontology Maintenance

Ontology maintenance restores the
extracted ontology to the ontology
-
base. In general, it is a process of selecting parts
of the ontology a
nd the form of ontology to store. If the ontology describes a new concept, the entirely new ontology
needs to be stored into the ontology
-
base as a new record. If the ontology is related to an existed ontology, the existing
one should be revised. Finally,
if the ontology has already existed in the ontology
-
base, it should be discarded.


Moreover, ontology maintenance performs another important function. It standardizes ontology with ontology
representation languages. This standardization supports ontology s
haring and reusing. Applications should allow other
ontology resources to be imported, integrate them into the ontology
-
base, and publish local ontology. However,
the
lack
of well
-
accepted ontology representation language hampers broader distribution of th
e ontology.
Our
system support
s

existing or available standards, such as RDF and DAML+OIL.



IMPLEMENTATION


We implement a

prototyped system

using
Java of J2SE Development Kit 5.0, with Xerces2 Java Parser 2.5.0 plug
-
in
for forming and parsing well
-
formed

Web documents. MySQL database server 4.
1

is
used for
constructing ontology
database with l
ocal repository
.
We tested
the prototype system

with
an example of
Department of Automation and
Computer
-
Aided Engineering
, the Chinese University of Hong Kong

(
http://www.acae.cuhk.edu.hk/en/
)
.

T
he
stored

ontology of “Department”
is
shown in Figure
2
.



Figure 2.
Example:
Ontology extracted from ACAE

department


C
ONCLUSION


In this paper w
e
propose

a new framework for
semi
-
automatic

ontology extraction
. Being a self
-
learning process, this
model can increasingly enhance its extraction ability with few or lack initial pre
-
defined knowledge.
O
ur
preliminary
experiment result

is

acceptable and wi
ll be increasingly
convincing
with the growing ontology
-
base.
In addition, this
new model
can
also
be integrated into with other open intelligent technologies, which could be a future concern of our
research.


REFERENCES


[1]
Berners
-
Lee
, T.
, Handler
, J.
,
&

Lassila,

O.

“The Semantic Web”, Scientific American, 2001, 284
(
5
):

34
-
43.

[2] Noy,

N.,

&

Musen,

M.

“PROMPT: Algorithm and Tool for Automated Ontology Merging and Alignment”, In
Proceedings of the 17th National Conference on Artificial Intelligence and
12th Conference on Innovative
Applications of Artificial Intilligence, 2000, Austin, USA.

[
3
] Upadhyaya,
S.R., &

Kumar,

P.S.

“ENONTO: A Tool for Exacting Ontologies from Extended E/R Diagrams”, In
Proceedings of the 2005 ACM Symposium on Applied Computing,

2005, Santa Fe, USA.

[4]
Davulcu,

H.,

Vadrevu,

S., &
Nagarjan,
S.
“OntoMiner: Bootstrapping Ontologies from Overlapping Domain
Specific Web Sites”, In Proceedings of the 13th International World Wide Web Conference, 2004, New York, USA.

[5
] Maddi,
G.R.
,
V
elvadapu,
C.S.,

Srivastava,
S.
, &

Lamadrid,
J.G.
“Ontology Extraction from Text Documents by
Singular Value Decomposition”, In Proceedings of ADMI 2001, 2001, Hampton, USA.