Creating Ontologies from Web documents

grotesqueoperationInternet and Web Development

Oct 21, 2013 (3 years and 9 months ago)

92 views





Creating Ontologies from Web documents

David SÁNCHEZ , Antonio MORENO
Department of Computer Science and Mathematics
Universitat Rovira i Virgili (URV)
Avda. Països Catalans, 26. 43007 Tarragona.
{dsanchez, amoreno}@etse.urv.es

Abstract. In this paper we present a methodology to build automatically an
ontology, extracting information from the World Wide Web from an initial keyword.
This ontology represents a taxonomy of classes and gives to the user a general view
of the kind of concepts and the most significant sites that he can find on the Web for
the specified keyword's domain. The system uses intensively a publicly available
search engine, extracts concepts (based on its relation to the initial one and statistical
data about appearance) and represents the result in a standard way.

Keywords. Ontology building, information extraction, World Wide Web, OWL.
Introduction
In the last years, the growth of the Information Society has been very significant, providing
a way for fast data access and information exchange all around the world. However, the
classical human readable data resources (like electronic books or web sites) present serious
problems for achieving machine interoperability. This is why a structured way for
representing information is required and ontologies [6] (machine-processable
representations that contain the semantic information of a domain), can be very useful.
They allow transferring and processing information effectively in a distributed
environment. Moreover, many authors [3, 4, 12, and 13] are using the ontology's semantic
data to improve the search for information itself on unstructured documents (that represent
almost 100 % of the available resources). Therefore, the building of an ontology that
represents a specified domain is a critical process that has to be made carefully. However,
manual ontology building is a difficult task that requires an extended knowledge of the
domain (an expert) and, in most cases, the result could be incomplete or inaccurate.
In order to ease the ontology construction process, automatic methodologies can be
used [8]. The idea is to extract structured knowledge, like concepts and semantic relations,
from unstructured resources that cover the main domain's topics. Concretely, the solution
that we propose in this paper is to use the available information on the World Wide Web to
create the ontology. This method has the advantage that the ontology is built automatically
and fully represents the actual state of the art of a domain (based on the web pages that
cover a specific topic).
So, in this paper, we present a methodology to extract information from the Web to
build an ontology for a given domain. Moreover, during the building process, the most
representative web sites for each ontology concept will be retrieved. A prototype has been
implemented to test this method.
The rest of the paper is organised as follows: section 1 describes the methodology
developed to build the ontology and the standard language used to represent it. Section 2
contains information about the prototype implemented and some tests. Section 3 contains
some conclusions and proposes some future lines of work.


1. Ontology building methodology
In this section, the methodology used to discover and select representative concepts and
websites for a domain and construct the final ontology is described.
The algorithm is based on analysing a large number of web sites in order to find
important concepts for a domain by studying the initial keyword's neighbourhood (we
assume that words that are near to the specified keyword are closely related). The candidate
concepts are processed in order to select the most adequate ones by performing a statistical
analysis. The selected classes are finally incorporated to the ontology. For each one, the
main websites from where it was extracted are stored, and the process is repeated
recursively in order to find new terms and build a hierarchy of concepts.
The resulting taxonomy of terms can be the base for finding more complex
ontological relations between concepts [8], or it can be used to guide a search for
information or a classification process from a document corpus [3, 4, 12, and 13].

Search
Constraints
Web
documents
Name
Attributes
Name
Attributes
Name
Attributes
}
}
Candidate
Concepts
Selection
Constraints
Class
URL's
}
"keyword"
}
OWL
Ontology
"class+keyword"
Class
URL's
Class
URL's
...
...
...
...
"class+keyword"
"class+keyword"
Class selection
Parsing
Inputparameters

Figure 1. Ontology building algorithm
1.1 Building algorithm
More in detail, the algorithm's sequence shown on figure 1, has the following phases:

• It starts with a keyword that has to be representative enough for a specific domain
(e.g. biosensor) and a set of parameters that constrain the search and the concept
selection (described below).
• Then, it uses a publicly available search engine (Google) in order to obtain the most
representative web sites that contain that keyword. The search constraints specified
are the following:
− Maximum number of pages returned by the search engine: this parameter
constrains the size of the search. For a general keyword with a large amount of
results (10000 or above), analysing between 5% and 10% of them (beginning
from the most popular ones) produces quite representative results.


− Filter of similar sites: for a general keyword (e.g. car), the enabling of this filter
hides the web sites that belong to the same web domain, obtaining a set of
results that represent a wider spectrum. For a concrete word (e.g. biosensor)
with a smaller amount of results, the disabling of this filter will return the whole
set of pages (even sub pages of a domain), allowing wider searches.
• For each web site returned, an exhaustive analysis is performed in order to obtain
useful information from each one. Concretely:
− Different types of no-HTML document formats are processed (
pdf
,
ps
,
doc
,
ppt
, etc.) by obtaining the HTML version from Google's cache.
− For each "Not found" or "Unable to show" page, the parser tries to obtain the
web site's data from Google's cache.
− Redirections are followed until finding the final site.
− Frame-based sites are also considered, obtaining the complete set of texts by
analysing each web subframe.
• The parser returns the useful text from each site (rejecting tags and visual
information), and tries to find the initial keyword (e.g. biosensor). For each
matching, it analyses the immediate anterior word (e.g. optical biosensor). If it
fulfills a set of prerequisites, it is selected as candidate concept. Concretely, the
parser verifies the following:
− Words must have a minimum size and must be represented with a standard
ASCII character set (not Japanese, for example).
− They must be "relevant words". Prepositions, determinants, and very common
words ("stop words") are rejected.
− Each word is analysed from its morphological root (e.g. fluorescence and
fluorescent are considered as the same word and their attribute values -
described below - are merged: for example, the number of appearances of both
words is added). A stemming algorithm for the English language is used to
reject plurals, verbal forms, etc.
• For each candidate concept selected (some examples are contained in table 1), a
statistical analysis is performed in order to select the most representative ones.
Concretely, we consider the following attributes:
− Total number of appearances (on all the analysed web sites): this represents a
measure of the concept's relevance for the domain and allows to eliminate very
specific ones (e.g. company names like Questlink) or no directly related ones
(e.g. novel).
− Number of different web sites that contain the concept at least one time: this
gives a measure of the word's generality for the domain (e.g. amperometric is
quite common, but Dupont isn't).
− Estimated number of results returned by the search engine setting the selected
concept alone: this indicates the global generality of the word and allows
avoiding widely-used ones (e.g. advanced).
− Estimated number of results returned by the search engine joining the selected
concept with the initial keyword: this represents a measure of association
between those two terms (e.g. "optic biosensor" gives many results but "techs
biosensor" doesn't).
− Ratio between the two last measures. This indicates the intensity of the relation
between the concept and the keyword (e.g. "amperometric biosensor" is much
more relevant than "government biosensor").


• Only concepts (a little percentage of the candidate list) whose attributes fit with a
set of specified constraints (a range of values for each parameter) are selected
(marked in bold in table 1). For each one, a new keyword is constructed joining the
new concept with the initial one (e.g. "optic biosensor"), and the algorithm is
executed again from the beginning. This process is repeated recursively until a
selected depth level is achieved or no more results are found (e.g. reusable quartz
fiber optic biosensor has not got any subclass). Each new execution has its own
search and selection parameter values because the searched keyword is more
restrictive (constraints have to be relaxed in order to obtain a significant number of
final results).
• The obtained result is a hierarchy that is stored as an ontology. Each class name is
represented as its morphological root (e.g. optical = OPTIC). However, if a word
has different derivative forms, all of them are evaluated independently (e.g. optic,
optical). Moreover, each class stores the concept's attributes described previously
and the set of URLs from where it was selected. The sites associated with these
URLs are the most representative ones for each concept on the ontology (from
where the concepts have been selected).
• Finally, an ontology refinement process is performed in order to obtain a more
compact taxonomy and avoid redundancy. In this process, classes and subclasses
that have the same set of associated URLs will be merged because we consider that
they are closely related: in the search process, the two concepts have always
appeared together. For example, the hierarchy “optic -> fiber -> quartz -> reusable
-> rapid” will result in “optic -> fiber -> rapid_reusable_quartz” because the last
3 subclasses have the same web sets. Moreover, the list of URLs will be processed
in order to avoid redundancies between the classes’ sets (e.g. if a web address is
stored in a subclass, it will be deleted from the superclass set).

Table 1. Candidate concepts for the biosensor ontology. Words in bold represent all the selected
classes (merged ones -with the same root- in italic). The other ones are a reduced list of some of
the rejected concepts (attributes that don't fulfil the selection constraints are represented in italic).
Concept Morphological root #Appear. #Different
pages
# Search
Results
# Joined
Results
Result
Ratio
fluorescence
fluorescent
optic
optical
amperometric
based
electrochemical
glucose
resonance
salomon
spr
government
advanced
novel
techs
ambri
dupont
questlink
FLUORESC
FLUORESC
OPTIC
OPTIC
AMPEROMETR
BASE
ELECTROCHEM
GLUCOS
RESON
SALOMON
SPR
GOVERN
ADVANC
NOVEL
TECH
AMBRI
DUPONT
QUESTLINK
1
36
8
16
14
14
11
21
6
48
6
6
16
4
9
2
6
3
1
19
6
13
12
9
9
18
6
35
5
5
12
4
3
2
1
2
410000
616000
945000
4270000
9840
8790000
126000
811000
803000
597000
350000
5560000
10050000
5150000
301000
2400
951000
44600
56
130
450
1100
505
1560
447
737
481
326
306
40
425
383
0
170
17
0
1.3E-4
2.1E-4
4.6E-4
2.5E-4
0.051
1.7E-4
0.003
9.1E-4
5.9E-4
5.4E-4
8.7-4
7.1E-6
4.2E-5
7.4E-5
0.0
0.07
1.7E-5
0.0



1.2 Ontology representation
The final ontology is stored in a standard representation language: OWL [16]. The Web
Ontology Language is a semantic markup language for publishing and sharing ontologies
on the World Wide Web. It is developed as a vocabulary extension of RDF [17] (Resource
Description Framework) and is derived from the DAML+OIL [14] Web Ontology
Language. It is designed for use by applications that need to process the content of
information and facilitates greater machine interpretability by providing additional
vocabulary along with a formal semantics. All these features allow to find easily relations
between the found classes and subclasses (equivalences, intersections, unions, etc.). Thus,
the final hierarchy of terms is presented to the user in a refined way. Moreover, OWL is
supported by many ontology visualizers and editors, like Protégé 2.0, allowing the user to
explore, understand, analyse or even modify the resulting ontology easily.
In order to evaluate the correctness of the results, a set of formal tests could be
performed (Protégé provides tests for finding loops, inconsistencies or redundancies).
However, the evaluation from a semantic point of view can only be made by comparing the
results with other existing semantic studies or through an analysis performed by an expert.
Once the ontology is created, it is easy to obtain the most representative web sites
for each category (concept), because their URLs are stored on each subclass frame.
Moreover, they cover the full spectrum (or at least the most important) of web resources
available on the Web at this time. The updating of this list could be made easily by
performing simple (and fast) individual searches for each subclass keyword.
2. The prototype
In order to test the performance of the ontology construction methodology, we have built a
prototype with the algorithm described previously. The program has been fully
implemented in Java because there is a large amount of libraries that ease the retrieval and
parsing of web pages and the construction of ontologies. Concretely, the tools used are the
following:
• Stemmers 1.0: it provides a stemming algorithm to find the morphological root of a
word for the English language.
• Html Parser 1.4: this is a powerful HTML parser that allows extracting and
processing text from a web site.
• Google Web APIs (beta): this is the library that the Google search engine provides
to programmers allowing making queries and retrieving search results.
• OWL API 1.2: it is one of the first libraries providing functions for constructing and
managing ontologies in OWL.

Moreover, we have used Protégé as an ontology visualization and edition tool with
the ezOWL plug-in that creates a visual representation of an OWL ontology.
2.1 Execution example
As an example, we have used the word "biosensor" as the initial keyword for the domain
that represents the different types of this device. In order to constraint the search according
to this initial word we have defined the following parameters:
• Candidate concepts must have a minimum length of 2 characters.
• The maximum number of web sites per search has been set to 1000 (a good value
because "biosensor" has over 20000 different results on Google's API).


• The maximum depth level has not been constrained: the system searches subclasses
until it finds no more.
• On the first level, the filter for similar pages has been enabled to obtain all different
web sites. For deeper levels, it has been disabled in order to obtain more results
(even if they belong to the same web domain).
• The minimum number of total hits has been set to 5 (on at least 2 different websites)
for the first level. For deeper ones, it has been decreased until 1 occurrence for
levels deeper than 3.
• The maximum number of results returned by Google for each new concept has been
set to 10.000.000 (to avoid very general words) and the minimum number of results
joining the initial keyword with the new one has been set to 10 (to avoid very
concrete words). The minimum ratio between these two numbers has been set to
0.0001 (to select only closely related words).

With these parameters, the search has been performed, obtaining the ontology
shown on figure 3L (visualized on Protégé 2.0). An example of the candidate concepts for
the first level of the search and their attribute values (used for the class selection) is shown
on table 1. The resulting taxonomy is formally correct (it has passed all the ontology tests
provided by Protégé) and quite accurate according to a biosensor classification that can be
found at [18]. Concretely, all basic classes specified on the document
(amperometric/potentiometric, enzyme, optical and chemical) have been found.

Optic Biosensor
C
AWDFiber Optic Biosensor
C
C
http://mywebpages.comcast.net/tfs-jdownward/Web_Pages/TFS_HH01_Fluorometer.html
http://www.roanoke.edu/Chemistry/JSteehler/Web/fiber.htm
http://www.fbodaily.com/cbd/archive/1997/01(January)/15-Jan-1997/Aawd002.htm
C
Fiber Optic Biosensor
C
http://www.isb.vt.edu/brarg/brasym94/rogers.htm
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=146999&rendertype=abstract
http://www.photonics.com/todaysheadlines/XQ/ASP/url.lookup/id.493/QX/today.htm
http://www.bdt.fat.org.br/binas/Library/book/rogers.html
http://www.age.psu.edu/FAC/IRUDAYARAJ/biosensors/Soojin3-FiberOpt.htm
http://www.age.psu.edu/FAC/IRUDAYARAJ/biosensors/Soojin3-FiberOpt_files/slide0001.htm
http://flux.aps.org/meetings/YR97/BAPSSES97/abs/S1900013.html
http://www.baeg.engr.uark.edu/FACULTY/yanbinli/projects/project6.html
http://www.verbund-sensorik.de/projects/Projekt13028_19_eng.pdf
C
Model HHXX Series Fiber Optic Biosensor
C
http://ic.net/~tfs/Web_Pages/TFS_HH01_Fluorometer.html
OPTIC BIOSENSOR
C
C
Rapid Reusable Quartz Fiber Optic Biosensor
C
http://www.nal.usda.gov/ttic/tektran/data/000006/79/0000067907.html
C
Evanescent Wave Fiber Optic Biosensor
C
http://ieeexplore.ieee.org/xpl/abs_free.jsp?arNumber=294007
C
Chemiluminescence Fiber Optic Biosensor
C
http://ift.confex.com/ift/2001/techprogram/paper_7902.htm
C
Real-time Fiber Optic Biosensor
C
http://www.imd3.org/tang.doc
Optical Biosensor
C
Affinity-based Optical Biosensor
C
C
http://www.cfdrc.com/applications/biotechnology/biosensor.html
http://www.cfdrc.com/applications/biotechnology/microspheres.html
http://www-users.med.cornell.edu/~jawagne/surf.plasmon.res.biosensor.html
http://www.nanoptics.com/biosensor.htm
http://www1.elsevier.com/vj/microtas/46/show/indexes/all_externals.htt?KEY=Optical+biosensor
http://www1.elsevier.com/vj/microtas/46/show/indexes/all_externals.htt?KEY=Optical+biosensor+microarray
http://www.ch.ic.ac.uk/leatherbarrow/PDF/Edwards%20etal%20(1997)%20J%20Mol%20Recog%2010,%20128.pdf
http://www.ibmh.msk.su/gpbm2002/ppt/archakov/
http://www.ee.umd.edu/LaserLab/research.html
http://www.biochem.utah.edu/files/Long_Chapter_Affinity_Chr.pdf
C
Coupling Optical Biosensor
C
http://www.iscpubs.com/articles/abl/b0011fit.pdf
C
Fiber Optical Biosensor
C
http://www.isb.vt.edu/brarg/brasym94/rogers.htm
http://ieeexplore.ieee.org/xpl/abs_free.jsp?arNumber=294007
http://www.bdt.fat.org.br/binas/Library/book/rogers.html
C
Generic Optical Biosensor
C
http://www2.elen.utah.edu/~blair/R/research.html
C
Integrated Optical Biosensor
C
http://cism.jpl.nasa.gov/events/workshop/abstracts/Swanson.pdf
http://www.optics.arizona.edu/library/PublicationDetail.asp?PubID=10285
C
Mobile Multi-channel Optical Biosensor
C
http://www.ee.washington.edu/research/spr/projects.htm
C
Multi-analyte Optical Biosensor
C
http://instruct1.cit.cornell.edu/Courses/nsfreu/baeumner.htm
C
Plastic Colorimetric Resonant Optical Biosensor
C
http://www.pcm411.com/sensors/abstracts/abs043.pdf

Figure 3. Left: Biosensor ontology; Right: URL hierarchy for OPTIC Biosensor subclass.



As mentioned previously, OWL allows us to find interclass relations automatically,
like intersections, inclusions or equalities. For example, in this case, we have found that the
"amperometric biosensor" class includes the "glucose amperometric biosensor" subclass
and the "glucose biosensor" class includes the "amperometric glucose biosensor" subclass.
As "glucose amperometric biosensor" and "amperometric glucose biosensor" are
equivalent (this fact is shown by Protégé automatically by marking the classes with a
different colour), their subclass hierarchy is merged (obtaining a new taxonomy).
Moreover, for each obtained class, a list of URLs has been stored, allowing the user
to access the most representative web sites for each concept. An example of the retrieved
URLs for the "OPTIC biosensor" subclass is shown on figure 3R. As you can see, different
no-HTML file types (e.g. PDFs) have been retrieved.
3. Conclusion and future work
Some authors have been working on ontology learning from different kinds of structured
information sources (like data bases, knowledge bases or dictionaries [7]). However, taking
into consideration the amount of resources available easily on the Internet, we believe that
ontology creation from unstructured documents like webs is an important line of research.
In this sense, many authors [2, 5, 8, and 11] are putting their effort on processing
natural language texts. In most cases, an ontology of basic relations is used like a semantic
repository (WordNet [1]) from which one can extract word's meanings and senses and
perform linguistic analysis. Moreover, in some of them, the ontology learning is made over
an existing representative ontology for the explored domain. In most cases, a relevant
corpus of documents carefully selected is used as a starting point.
On the contrary, the proposed methodology does not start from any kind of
predefined knowledge of the domain, and it only uses publicly available web search
engines. Performing a statistical analysis, new knowledge is discovered and processed
recursively building a hierarchy of representative classes. The obtained taxonomy really
represents the state of the art on the WWW for a given concept and the hierarchical
structured list of the most representative web sites for each class is a great help for finding
and accessing the desired web resources.
As future lines of research some topics can be proposed:
• To ease the definition of the search and selection parameters, a pre-analysis can be
performed from the initial keyword in order to estimate the most adequate values
based on the domain. For example, the number of total results for this concept can
tell us a measure of its generality (setting more restrictive or relaxed constraints).
• In this first prototype, we have used the keyword's previous word mainly to create
the ontology. However, the posterior ones can tell us the domain where the main
concept is applied, allowing more general searches and wider ontologies. For
example, for the biosensor example, some of the words returned by a "posterior
word" analysis are: design, group, research, technology, system, application, etc.
• Several executions from the same initial keyword in different times can give us
different taxonomies. A study about the changes can tell us how a domain evolves.
• For each class, an extended analysis of the relevant web sites could be performed to
find possible attributes and values that describe important characteristics (e.g. the
name of a company), or closely related words (like a topic signature [9]).
• More complex relations between classes could be extracted from an exhaustive
analysis of the coincidence between the obtained URLs, the possible attributes (and
their values), or the multiple subclass dependence (on different depth levels).


Acknowledgements
We would like to thank David Isern and Jaime Bocio, members of the hTechSight project
[4], for their help. This work has been supported by the "Departament d'Universitats,
Recerca i Societat de la Informació" of Catalonia.
References
[1] WordNet: a lexical database for English Language. Web page:
http://www.cogsci.princeton.edu/wn.
[2] O. Ansa, E. Hovy, E. Aguirre, D. Martínez, Enriching very large ontologies using the
WWW, In proceedings of the Workshop on Ontology Construction of the European
Conference of AI (ECAI-00), 2000.
[3] H. Alani, S. Kim, D. Millard, M. Eal, W. Hall, H. Lewis, and N. Shadbolt, Automatic
Ontology-Based Knowledge Extraction from Web Documents, 14-21, IEEE
Intelligent Systems, IEEE Computer Society, 2003.
[4] A. Aldea, R. Bañares-Alcántara, J. Bocio, J. Gramajo, D. Isern, J. Jiménez, A.
Kokossis, A. Moreno, and D. Riaño, An ontology-based knowledge management
platform, in Workshop on Information Integration on the Web (IIWEB’03) at
IJCAI’03, 177-182, 2003
[5] E. Alfonseca and S. Manandhar, An unsupervised method for general named entity
recognition and automated concept discovery, in Proceedings of the 1
st
International
Conference on General WordNet, 2002.
[6] D. Fensel, Ontologies: A Silver Bullet for Knowledge Management and Electronic
Commerce, Volume 2, Springer Verlag, 2001.
[7] D. Manzano-Macho, A. Gómez-Pérez, A survey of ontology learning methods and
techniques, OntoWeb: Ontology-based Information Exchange Management and
Electronic Commerce, 2000.
[8] A. Maedche, R. Volz, J.U. Kietz, A Method for Semi-Automatic Ontology Acquisition
from a Corporate Intranet, EKAW’00 Workshop on Ontologies and Texts, 2000.
[9] C.Y. Lin, and E.H. Hovy, The Automated Acquisition of Topic Signatures for Text
Summarization, Proceedings of the COLING Conference, 2000.
[10] A Maedche, Ontology Learning for the Semantic web, volume 665, Kluwer
Academic Publishers, 2001.
[11] P. Velardi, R. Navigly, Ontology Learning and Its Application to Automated
Terminology Translation, 22-31, IEEE Intelligent Systems, 2003.
[12] A. Sheth, Ontology-driven information search, integration and analysis, Net Object
Days and MATES, 2003.
[13] L. Magnin, H. Snoussi, J. Nie, Toward an Ontology–based Web Extraction, The
Fifteenth Canadian Conference on Artificial Intelligence, 2002.
[14] DAML+OIL. W3C. Web page: http://www.w3c.org/TR/daml+oil-reference.
[15] Extensible Mark-up Language (XML). W3C. Web page:
http://www.w3c.org/TR/owl-feagures/.
[16] OWL. Web Ontology Language. W3C. Web: http://www.w3c.org/TR/owl-features/.
[17] Resource Description Framework (RDF). W3C. Web page: http://www.w3c.org/RDF
[18] M. Woods, Biosensors, volume 2 of The World and I, 176,
http://worldandI.com/public/1987/frebruary/ns2.cfm, 1987.