A Formal Ontology Discovery from Web Documents

grotesqueoperationInternet και Εφαρμογές Web

21 Οκτ 2013 (πριν από 4 χρόνια και 17 μέρες)

89 εμφανίσεις

A Formal Ontology Discovery from Web Documents
Norihiro Ogata
Faculty of Language and Culture,Osaka University
ogata@lang.osaka-u.ac.jp
keywords:Web,formal ontology,XML,logic programming,natural language
processing
1 Introduction
The huge amount of documents distributing over WWWcan be regarded as eas-
ily accessible resources of domain-specific knowledge.However users can be also
annoyed with the quantitative enormousness,qualitative irregularity,and unfa-
miliarity of contents of the documents arising from easy accessiblity to specific
domains and unstructuredness of WWW.One of the possible solutions to this
problemis to specify and annotate the text structures and the semantic contents,
i.e.,(formal) ontology (in the sense in Artifical Intelligence or knowledge engi-
neering [5],i.e.,formal specifications of conceptualizations) of each document by
giving top-down definitions of their specifications using HTML,XML,and RDF
(see [9]) such as OntoBroker[2],SHOE (Simple HTML Ontology Extensions[7]),
XOL (An XML-Based Ontology Exchange Language [6]),OIL(The Ontology
Inference Layer[3]),and so on.However,their drawbacks are the point that a
total design,definitions,and consent formation of each domain,which can be
frequently created and revised,take high costs,and their non-flexiblity such that
each user’s preference cannot be reflected into the definitions.One of the other
possible solutions is to extract and discover the ontological information of each
of the WWW documents by giving only loose and general forms of ontology
and text structures.This approach can not only handle the problem,but also
makes it possible that an automatic acquisition of domain-specific knowledge.
This paper will propose a formal basis and logic-,Web-technology- and Natu-
ral Language Processing-based system architecture of ontology discovery from
WWWdocuments as an efficient utilization of Web resources in these senses.
2 A Formal Ontology and FOML (Formal Ontology
Markup Language)
A formal ontology O is a triple hO;C
O
;`
O
i,where O is a classification,C
O
a
concept lattice generated by O,and`
O
a set of Horn-constraints supported by
O.A classification O is a triple hTok(O);Typ(O);²
O
i,where Tok(O) is a set of
tokens,Typ(O) a set of types,and ²
O
:Tok(O) £Typ(O)!f1;0g.A concept
lattice [4] C
O
generated by O is a tuple hC;·
C
i,where C is a set of concepts
generated by O,·
C
an order between concepts,i.e.,c
1
·
C
c
2
iff tok
c
1
µ tok
c
2
and typ
c
2
µ typ
c
1
.A concept c 2 C
O
µ pow(Tok(O)) £ pow(Typ(O)) over
O is a pair (c
tok
;c
typ
) such that tok(c
typ
) = c
tok
and typ(c
tok
) = c
typ
,where
tok(®) = fx 2 Tok(O)jx²
O
®g and typ(x) = f® 2 Typ(O)jx²
O
®g.A Horn-
constraint h 2 vdash
O
supported by classification O is a pair h¡;'i of types and
a type such that for all ® 2 ¡ such that a²
A
®,implies a²
A
'.
The concept lattice generated by a classification classifies the information
on the hierarchical dependency of the concepts in the classification,whereas
the local logic generated by the classification classifies the information on the
specification and inference of the concepts.
To connect formal ontologies defined in the previous section and Web docu-
ments,this section introduces an XML-based markup language FOML (Formal
Ontology Markup Language) which is sound to the definitions of formal ontolo-
gies in the previous section,as in the DTD of Appendix A.FOML is much
simpler than the other markup languages such as SHOE,XOL,and OIL,and
only reflects the information of formal ontology,i.e.,triple of classification,con-
straints,and concept lattice.FOML can be considered as an XML intervening
between an XML for natural language processing and other ontological XMLs.
3 Discovery of Classifications from XML(SNLP)
Suppose that we have a HTML document with some XML tags of shallownatural
language processings such as part-of-speech tagging,partial parsing,and noun
phrase-chunking,and call this an XML(SNLP) document.The XML(SNLP)
document is supposed to have tags such as <np> (noun phrase),<det>,<cn>
(common noun),<prep> (preposition),<pn> (proper name),<pp> (past partici-
ple),<apposition/>,<comma>,<paren> (parentheses),and so on.We can find
ontological information from information in the XML(SNLP) document.For ex-
ample,given a fragment of a biochemical document “fluoropyrimidines such as
5-fluorouracil and 5-fluorodeoxyuridine,” this document can be processed as the
following XML(SNLP) document:
<np><np nform="fluoropyrimidine">fluoropyrimidines</np>
<prep>such as</prep> <np>5-fluorouracil</np>
<conj>and</conj> <np>5-fluorodeoxyuridine</np></np>
and furthermore processed as the following FOML+XML(SNLP) document:
<np><ontology>
<np nform="fluoropyrimidine">
<type>fluoropyrimidine</type>
fluoropyrimidines</np> <prep>such as</prep>
<np><token ofType="fluoropyrimidine">5-fluorouracil</token></np>
<conj>and</conj>
<np><token ofType="fluoropyrimidine">5-fluorodeoxyuridine</token></np>
</ontology></np>
and as a result,we can find the following partial formal ontology:
<ontology>
<type>fluoropyrimidine</type>
<token ofType="fluoropyrimidine">5-fluorouracil</token>
<token ofType="fluoropyrimidine">5-fluorodeoxyuridine</token>
</ontology>
Such rule-based discovery of ontological information can be formalized as DOM
tree manipulations as follows:
4 Deductive Discovery of Constraints
An XML(SNLP) document with FOML tags except <constraint> can be mapped
into (horn) clauses of logic programming,as follows:
F(<ontology><type>X</type>;Y+</ontology>)=("X:type.",F(Y)+)
F(<token typeof="Y">X</token>)="X:Y."
where X:Y means ‘X is of type Y’.
A discovery method of constraints from these clauses can be presented as the
following logic program in a basically similar way with inductive logic program-
ming [8],where X-->Y means a constraint.
body_make:- setof(X,type(X),C),powerset(C,B),asserta(body(B)),fail.
body_make:- true.
Types --> Type:- body(Types),type(Type),
\+ member(Type,Types),
\+ (member(X,Types),Token:X,\+ Token:Type).
5 Deductive Discovery of Concept Lattices
Concept lattices can be also discovered by logic programming,as follows:
(i) the generation of concepts:
make_concept:- boy(Types),elements(Types,E),
intersection_all(E,Tokens),
asserta(concept0(Tokens,Types)),fail.
make_concept:-true.
concept(Tokens,Types):- concept0(Tokens,Types),
\+ (concept0(Tokens,Types1),\+ Types1 == Types,
sublist(Types,Types1)).
elements0(T,Elements):- setof(X,(X:T),Elements).
elements(Types,Elements):- body(Types),
setof(X,(T^(member(T,Types),elements0(T,X))),Elements).
(ii) inheritance from subtypes to their supertypes:
X:T:- subtype(T1,T),X:T1.
(iii) the nonmonotonic subtyping:
subtype(X,Y):- must_subtype(X,Y),\+must_subtype(Y,X).
subtype(X,Y):- evidentially_subtype(X,Y),\+must_subtype(Y,X),
\+evidentially_subtype(Y,X).
subtype(X,Y):- may_subtype(X,Y),\+must_subtype(Y,X),
\+evidentially_subtype(Y,X).
evidentially_subtype(X,Y):- X:type,Y:type,most(X,Y),
\+most(Y,X),\+must_subtype(Y,X).
most(X,Y):- elements(X,E1),elements(Y,E2),
|intersection(E1,E2)| > (0.8 * |E1|).
(iv) the generation of order:
subconcept(c(Tok1,Typ1),c(Tok2,Typ2)):-
concept(Tok1,Typ1),concept(Tok2,Typ2),
sublist(Tok1,Tok2),sublist(Typ2,Typ1).
where literal may
subtype(Y,X) is translated from FOML tag
<subtype type="Y">X</subtype>,affix must
expresses a stable proposition
and is used as background knowledge and user’s prereference,evidential
is
used as an expected proposition fromthe data,may
expresses a non-stable propo-
sition and is used as an expected proposition discovered from XML documents
by the rules in Appendix A and the mapping in the previous section.The non-
monotonicity of (iii) is implemented by the negation-as-failure.
In constructing ontological structures,usually top-level ontologies cannot be
extracted from domain-specific texts,and need extra-information such as back-
ground knowledge.We must combine top-level ontologies which cannot be writ-
ten in domain-specific texts,in order to construct well-formed ontological struc-
tures.must
formulas are used to specify such stable information.
6 Conclusion
We have seen a theory of formal ontologies based on classifications,their local
logics and their conceptual lattices,its XML implementation FOML,and its
discovery method implemented in logic programming which has an aspect of a
nonmonotonic reasoning and an aspect of inductive leaning.This system can be
refined as a practical knowledge acquisition and information retrieval system on
WWWusing the Java language.In fact,we have implemented Concept Lattice
Viewer and Constraint Viewer by Java and FOML serves as their shared data
format,which can constracted from Web documents.
References
1.Jon Barwise and Jerry Seligman.Information Flow:The Logic of Distributed
Systems.Cambridge University Press,Cambridge,1997.
2.Stefan Decker,Michael Erdmann,Dieter Fensel,and Rudi Studer.Ontobroker:On-
tology based access to distributed and semi-structured information.In R.Meers-
man et al.,editors,Semantic Issues in Multimedia Systems.Proceedings of DS-8,
pages 351–369.Kluwer Academic Publisher,Boston,1999.
3.D.Fensel,I.Horrocks,F.Van Harmelen,S.Decker,M.Erdmann,and M.Klein.
OIL in a nutshell.In R.Dieng et al.,editors,Knowledge Acquisition,Modeling,
and Management,Proceedings of the European Knowledge Acquisition Conference
(EKAW-2000),pages 75–102.Springer-Verlag,Berlin,2000.
4.Bernhard Ganter and Rudolf Wille.Formal Concept Analysis.Springer,Berlin,
1999.
5.TomR.Gruber.Atranslation approach to portable ontology specifications.Knowl-
edge Acquisition,5(2):199–220,1993.
6.Peter D.Karp,Vinay K.Chaudri,and Jerome Thomere.XOL:An xml-based
ontology exchange language.http://ecocyc.panbio.com/xol/xol.html,1999.
7.Sean Luke and Jeff Heflin.SHOE 1.01:Proposed specification.
http://www.cs.umd.edu/projects/plus/SHOE/spec.html,2000.
8.Shan-Hwei Nienhuys-Cheng and Ronald de Wolf.Foundations of Inductive Logic
Programming.Springer-Verlag,Berlin,1997.
9.F.van Harmelen and D.Fensel.Practical knowledge representation for the web.
Appendix A:The DTD of FOML
<!DOCTYPE foml [
<!ELEMENT ontology (domain*,type+,token*,subtype*,concept*,constraint*) >
<!ELEMENT domain (#PCDATA) > <!ELEMENT type (#PCDATA) >
<!ELEMENT token (#PCDATA) > <!ELEMENT subtype (#PCDATA) >
<!ELEMENT constraint (body,type) > <!ELEMENT body (type*) >
<!ELEMENT concept (token*,type*) >
<!ATTLIST ontology attr (partial|total)"partial">
<!ATTLIST domain attr (simple|complex|undef)"simple">
<!ATTLIST type alias (#PCDATA)#IMPLIED
morph (suffix|infix|affix|non|undef)"non">
<!ATTLIST subtype attr (direct|nondirect|undef)"undef"
ofType (#PCDATA)#REQUIRED>
<!ATTLIST token attr (direct|nondirect|undef)"undef"
ofType (#PCDATA)#REQUIRED>
]>
Appendix B:Trasnformation fromXML(SNLP) to FOML+XML(SNLP)
INPUT
OUTPUT
Typing Context Rules
<np><det>;<cn>;<pp>X;<pn>;
<np>
<ontology>
<det>;<cn>
<type>
;<pp>X;<pn>
<token>
;
where X2 ftermed,called,designated,known asg
<np><cn>;<apposition/>;<pn>;
<np>
<ontology>
<cn>
<type>
;<apposition/>;<pn>
<token>
;
<np><det>;<cn>;<comma>
<np>
<ontology>
<det>;<cn>
<type>
;<comma>;<p>X;
<p>X;<pn>;
<pn>
<token>
;
X2 fto be known as,to be calledg
<np><pn>;<comma>;<det>;<cn>
<np>
<ontology>
<pn>
<token>
;<comma>;<det>;<cn>
<type>
;
<np><pn>;<hyphen>;<det>;<cn>
<np>
<ontology>
<pn>
<token>
;<hyphen>;<det>;<cn>
<type>
;
<np><pn>;<paren><det>;<cn>
<np>
<ontology>
<pn>
<token>
;<paren><det>;<cn>
<type>
;
<np><pn>+;<conj>;
<np>
<ontology>
<pn>
<token>
+;<conj>;
<adj>other;<cn>
<adj>;<cn>
<type>
;
<np><det>;<cn>;<p>X;
<np>
<ontology>
<det>;<cn>
<type>
;<p>X;
(<pn>;<comma>)+;
(<pn>
<token>
;<comma>)+;<conj>;<pn>
<token>
;
<conj>;<pn>;
where X2 flike,such as,except,other thang
Subtyping Context Rules
<np><det>;<cn>;<pp>X;<cn>;
<np>
<ontology>
<det>;<cn>
<type>
;<pp>X;<cn>
<subtype>
;
where X2 ftermed,called,designated,known asg
<np><det>;<cn>;
<np>
<ontology>
<det>;<cn>
<type>
;
<comma>;<p>X;<np>;
<comma>;<p>X;<np>
<subtype>
;
X2 fto be known as,to be calledg
<np><np>+;<conj>;
<np>
<ontology>
<np>
<subtype>
+;<conj>;
<adj>other;<cn>;
<adj>;<cn>
<type>
;
<np><det>;<cn>;<p>X;
<np>
<ontology>
<det>;<cn>
<type>
;<p>X;
(<np>;<comma>)+;
(<np>
<subtype>
;<comma>)+;<conj>;<np>
<subtype>
;
<conj>;<np>;
where X2 flike,such as,except,other thang
where gray boxes means added DOMnodes,and <X><Y> DOMnode <X> subor-
dinates DOM node <Y>,<X>;<Y> DOM node <X> and <Y> are subordinated to
the same DOM node and <X> precedes <Y>,and X+ an n(> 0)-times occurrence
of X.