Learning Ontology from Web Documents to Support Web Queries

cobblerbeggarAI and Robotics

Oct 15, 2013 (3 years and 11 months ago)

100 views


Learning Ontology from Web Documents to Support Web Queries

-
29
-

Learning Ontology from Web Documents to Support Web Queries


Fu
-
ren Lin
a
*
, Ju
-
fen Hsueh
b


a
Institute of Technology Management, National Tsing Hua University, Hsinchu Taiwan 300

b
Department of Information Management
,
National Sun Yat
-
sen University
,
Kaohsiu
ng Taiwan 804



Abstract

This paper proposes an ontology learning mechanism to build domain ontology from Web
documents to facilitate query expansion. Three
query expansion approaches
with different
term
-
selection

strategies
from

ontology are used for eva
luating the learned ontology.
Web documents were
selected from
G
oogle

s consulting category to train and test ontology for

data mining


domain. T
he
performance of query expansion, in terms of recall and precision, were significantly enhanced by the
learn
ed ontology.


1.

Introduction


The World Wild Web (WWW) has become the largest text resource in the world. Modern
W
eb
search engines have remarkably improved search efficiency through keyword search and ranking
mechanisms
(Yuwono and Lee, 1996).

However, t
h
ese

mechanism
s

still cannot

guarantee that the

searched
Web sites are relevant to user
s


queries
. The situation is getting worse when users are not
familiar with the target domain and have difficulty to describe their needs
(Davenport and Prusak,
1998)
.

Information overflow and ambiguous keyword search make knowledge acquirement on the WWW
more difficult. Besides building a complicated search engine,
one simple way to improve the query

performance is

to reform query string
s

known as query expansion
. Que
ry expansion, or query
modification, is brought from the information retrieval domain. Keywords can be expanded based on
related knowledge domains. The explicit knowledge structure,
i.e.
, ontology, benefits users in reducing
ambiguity and misunderstandin
g in retrieving information from a document set. Query expansion can
select keywords from ontology, and expand
query terms
by relations.

In order to efficiently build a useful and specific ontology, a full or semi
-
automatic generating
mechanism is neces
sary. Even though general ontology is still built by an ontologist, domain specific
ontology can be extracted from a document collection. A concept has different meanings in different
domains. The expansion of possible keywords for information query may

not only depend on co
-
occurrence or term frequency, but also
on
domain boundaries. Domain ontology also provides an
alternative way to understand knowledge structure.
Domain ontologies

become the foundation of
retrieving information on the
Web
.

This stu
dy propose
s

an ontology learning method to generate domain ontology from on
-
line Web
documents.
It

extract
s

important and relevant concepts from HTML documents, and identif
ies

the
concept
relation

by analyzing HTML structure. This study
also

design
s

diff
erent expansion approaches
based on the constructed ontology. Evaluating the precision and recall
from expanded queries
shows

the correctness of learned ontology and the effectiveness of the ontology
-
based approach.

This paper is organized as follows. S
ection 2 reviews related concepts and techniques with
ontology learning and query expansion. The ontology
learning process dedicated for learning from



*

Corresponding author. Tel.:+886 3 5742216; fax: +886 3 5745310; E
-
mail address: frlin@mx.nthu.edu.tw

IICM
第九卷第二期

民國九十五年六月

-
30
-

HTML documents is introduced in Section 3. Section 4 depicts the knowledge
-
based query expansion
which
utilizes
learned

ontology. Section 5 describes the procedure and results of experiments which
evaluate the application of learned ontology to performing query expansion. Section 6 concludes this
study and specifies research limitations.


2.

Literature
r
evi
ew

2.1 Query expansion

In an information retrieval system, query is conducted by a query model.
T
here are
Boolean

model, vector space model, extended
Boolean

model, fuzzy model, probabilistic model and natural
language model (
Korfhage
,
1997)
. The most us
ed models are Boolean and vector space models.

Query
expansion does not only alter the original query term
s
, but also extend the possible terms to increase the
probability of retrieving relevant documents.

Ideally, query expansion could achieve by human
experts
or using a thesaurus that adds synonyms, broader terms, and other appropriate words. Unfortunately,
it
is un
afford
able for

human experts to
manually expand query terms
. A good query expansion
mechanism should
increase

the recall
rate
and keep a f
ixed precision as much as possible.

Query expansion approach can be classified into user
-
assisted and automatic
(
Bodner

and Song
,
1996)
. User
-
assisted approaches mainly rely on the user’s knowledge about the problem domain and
the ability to select effe
ctive terms. The automatic approach often uses external knowledge resources.
Knowledge
-
based query expansion is suitable for retrieving documents in a specific domain. It helps
users identify important keywords and expand query terms by relations. Conc
ept map and semantic
relation have been used for Web page searching
(Carnot

et al.
,

2001)
. Ontology
-
based information
selection has also been applied to database query
(
Khan

2000)
.


2.2 Ontology
engineering

Ontology originally used in philosophy to desc
ribe the universe. For kno
wledge representation
purpose,
ontology is a formal, explicit specification of a shared conceptualization

(Gruber
,

1993)
.

A
typical ontology
is

a taxonomy
to

defi
ne

classes (concepts) and relations, and a set of inference rules
for
reasoning functions
(
Berners
-
Lee,
et al.
,

2001)
. Sometimes
it

also include
s

axioms and instances
(Gruber
,

1993)
. Ontology is a representation vocabulary, often specialized to some domain or subject
matter
(
Chandrasekaran
, et al., 1999)
. Ontology can

be classified into subcategories by different
dimensions, like formality, purpose, dependence and the reuse of knowledge
(Ding
,

2001
;
Mizoguchi
and Ikeda
,

1997)
.

Ontology engineering is a research domain focusing on building ontology. Ontological
engin
eering encompasses a set of activities conducted during conceptualization, design, implementation
and deployment of ontology
(
Devedzic
, 2002)
. However, the ontology construction process consumes
more time than expect
ed
. People may have different concepts

on the same thing even they have the
same background and experience. It is tough to build ontology across domains. In a
changing

environment, manual approach will meet the problems of ontology maintenance and evolvement.

Though ontology
-
engineering tool
s have matured over the last decade, the manual building of
ontolog
y

remains a tedious

and
cumbersome task
,

which can easily result in a knowledge acquisition
bottleneck
(Fensel
,

2001
). Many applications require that the engineering of ontolog
y

be complet
ed
quickly and easily. A method
that

has proven to be extremely beneficial for the knowledge acquisition
task is the integration of knowledge acquisition with machine learning techniques. This process also
called ontology learning

(
Maedche

and

Staab

2001
)
.

The architecture of

general

ontology learning mechanism
s

consists of the following components
(
Kietz,
et al.
, 200
)
.


Learning Ontology from Web Documents to Support Web Queries

-
31
-

(1) Text processing component processes raw text to identify keywords from this domain. It will first
analyze lexical relation, find the

name entities (
e.g.
, name, location and organization), and then
retrieve domain
-
specific information by internal or external lexical databases.

(2) Learning and discovering components to extract the relationships among relevant terms from
the
text proces
sing component
. Th
e
se relations
link

concept terms
to
construct concept trees which
represent domain
ontology.
A
ny domain
-
unspecific concepts

on the tree
should
are
pruned.

(3)
Encoding component to
provide
an
interface or rule
s

to trans
form

l
earn
ed conc
ept trees to

a
formal
language.

In the ontology learning process, two methods are usually used to discover important concepts and
relations: symbolic and statistics
-
based approaches

(
Maedche
,

et al.
,

2002)
. The trade
-
off between
these two methods is that
statistics
-
based approaches allow for better scaling, but symbolic approaches
may

eventually turn up being more precise.

The symbolic approach is the idea of using lexico
-
syntactic
patterns in the form of regular expressions for extracting semantic relati
ons, such as
s
emantic
p
atterns
(
Velardi,

et al.
,
2001
). Semantic patterns approaches are heuristic methods
to
scan the text for instance
of distinguished lexico
-
syntactic patterns that indicate a relation of interest
, and t
hen define a regular
expression
as a semantic structure to

capture re
-
occurring expressions.

T
his approach
does

not discover
new
relation
s from sentences, but find concepts that match the given patterns.

S
emantic patterns could
get a better accuracy,
but d
efining patterns manually is a

time
-
consuming task.

F
eature selection and associative rule analysis are usually adopted

for statistics
-
based approaches

(Agirre
,

et al.
,

2000)
. Term frequency and entropy are two measures
usually used
to select important
features to represent documents
(
Yang

and

Pedersen
, 1997)
. Association rule analysis is a traditional
data mining method, which analyzes the association of two items in data set
(
Srikant

and
Agrawal
,
1995)
. Association rule analysis often uses support and confidence rates to select ass
ociation rules.
High support and confidence represent a higher degree of relevance between terms.


A

term with high frequency is often taken as bad indexing term.
In addition,

a rare term is often
taken as unimportant term.
However, a

knowing problem o
f statistic approach is that the feature in a
document is not necessary the important concept in a domain.

Domain relevant measures the
possibility of one single term that appears in different domain (
Velardi,

et al.
,
2001
)
.

Automatic
ontology constructi
on faces several basic problems. One is the
determination of
ontology
hierarchy

on
the degree of
abstract

and
coverage
. Most ontology learning research ad
o
pts ontology backbone to
reduce the problem (
Ding

and

Foo
,
2002
).
The other problem is the discove
ry of undefined relations.
P
attern matching approach can only
identify
defined
relation
s, but

statistic
-
based approaches
may
discover

undefined

relations
, which may incur
semantic conflicts and analogous problems
.


2.3

Text/ Web mining

Web mining is an in
tegrated technology in which several research fields are involved, such as
data mining, computational linguistics, and statistics. However, Web mining has two unique features
compared with data mining. One is that the source of Web mining is Web document
s. The other is
that the pattern identified can be the content of the documents or the structure of the Web
(
Jichang
, et
al., 1997)
. Many researches adopt thesaurus as lexical databases to retrieve terms. Besides lexical
analysis, text mining often use
the same technology developed for information retrieval to extract
important terms from document
s
, such as term weighting or feature selection in text mining
(
Yang

and

Pedersen
, 1997)
.


Structural information extraction is different from the classical one

since it usually utilizes the
meta
-
information (
e.g.,

HTML tags, simple syntactic, or delimiters) available inside the semi
-
structured
data. HTML documents use tags to define
the
layout

of Web pages which

represent an author’s writing
IICM
第九卷第二期

民國九十五年六月

-
32
-

structure. The con
cepts contained in one paragraph will have closer relation than the concepts
separated in different segments. By using Web layout, a Web page source text can be divided into
sentences or words,
and then
related facts

can be grouped together

to separate fr
om unrelated ones
(
Soderland
,

19
97
)
. Terms can be re
-
weighted by tags, such as document title, author name, subtitle,
etc.
, which may

enhance

the precision of information extracting from HTML documents.

In order to process HTML documents more efficiently,

an HTML document is often transformed
to
a

tree structure
(
DiPasquo
, 1998)
. Some important properties of HTML structure can be used to
calculate the distance between two nodes: relative depth (
i.e.
, the number of nodes in the path from the
root node), ab
solute depth (
i.e
., the number of nodes from the root node to the current node), the
number of levels (
i.e
., the average absolute depth from the current node to the current node), and the
number of leaves
(
Carchiolo
,
et al.
,

2000)
.


3.
Ontology learning pr
ocess


This study proposes an o
ntology learning
process
that integrates a sequence of functions to
generate ontology

from HTML documents
. It is also possible to improve the ultimate results by
replacing the compatible methods at any steps. In building do
main ontology automatically, the process
starts from data collection to ontology ge
neration as shown in Figure 1.



Figure
1
.

The ontology learning process


In the first step, a software agent collects HTML documents from the Internet. Some constraints
s
uch as the limitations of traversal depth and file size are used to
obtain manageable

Web pages. After
collecting enough Web data, a converter transforms ordinal HTML format to document object model
(DOM) format in
reducing

noise
s

and preparing for the la
ter processes. The DOM format
ted

data
is

analyzed by a lexical parser to tag word classes and phrases.

In the next step, important concepts
as terms
will be extracted from Web pages. Naturally,
important concepts appear in domain related documents. Pars
ing a large amount of domain documents
is to ensure that
the system receives

most of concepts in that domain. In the third step, an HTML
structure
-
based method will extract the possible relations and calculate the distance between concept

term
s. By calcu
lating the term distance in a document, the corresponding concept distance can be
estimated. The output ontology is stored in an ontology repository for further application.


3.1
Data collection and preprocessing

Building domain ontology requires a set of

domain documents as the training data set. This study
chose the target domain in data mining because of the familiarity with this domain and the availability
of data. Theoretically, this ontology learning process can be applied to any domains.

Web docu
ments

Learning Ontology from Web Documents to Support Web Queries

-
33
-

as the data source
are

the most common and richest
text
format on the Internet. In addition, a markup
document in HTML possesses machine
-
readable structure
annotated
by tags. A tag defines how a
document presents, and indicates the importance of a
text block within the document.
A t
itle is
often
more important than
a
head,
and a
head is more important than
a
paragraph. A HTML document can be
transformed into the tree
-
like structure. Although XML may be a better markup language than HTML,
HTML sti
ll gets the largest percentage of document format on the Internet.

The data collection process is semi
-
automatic. First, every Web page or site is reviewed by human.
Human experts have to decide whether this page illustrates the important
target
concep
t
s
. After they
validate each selected Web page, the relevant ones will be extracted from the Internet. An open source
Java package WebSPHINX crawls Web pages by following hyperlinks

(
Miller
, 2002)
. WebSPHINX
can filter Web documents by MIME type and lim
it the crawling depth. Most Web crawlers limit their
depth to 6 or 7, and the default depth of WebSPHINX is 5. This default value is adopted in this study
for efficiency reason. Only pure HTML format documents are kept at this stage. The final training

data
are collected from 37 different Web sites, and consist of 341 Web pages in total.

Training pages are stored as the DOM structure. DOM
,

defined by W3C, is a

platform
-

and
language
-
neutral interface that allows programs and scripts to dynamically acc
ess and update the
content, structure and style of documents

(
Wood
, et al.,

1998)
. DOM organizes a document as a tree,
and every node is connected de
pending on its HTML structure.

A node contains tag, text and hierarchy
information. Unimportant informat
ion within an HTML document is dropped while transforming
HTML documents into the DOM structure. Unimportant information includes empty nodes and nodes
for the presentation purpose, such as <font> and <sub>. Structure nodes such as <ul> and <tr> are
kept
. A weight is assigned to a node that contains text to indicate the importance of this node. The
weight assignment is listed in Table 1. Structure node is given N/A mark because of carrying no text.
In addition, these weights represent relative weights
, and different Web design styles often affect the
weight assignment.

Table 1. HTML tags and their assigned weights

HTML tag

Weight

<title>

2

<body>

1

<h>

2

<center>

2

<p>

1

<dl>

N/A

<dt>

2

<dd>

1

<ul>

N/A

<ol>

N/A

<li>

2

<th>

2

<tr>

N/A

<td>

1


IICM
第九卷第二期

民國九十五年六月

-
34
-

We ignore tag attributes in this study. In some markup languages like XML, a tag attribute is
meaningful. However, this study does not consider tag attributes since these attributes are often used
for the presentation purpose in a HTML document. Se
veral open source packages can perform DOM
parsing. JTify, a well
-
known DOM parser developed since 1994, is a Java port of HTML Tidy
(
Raggett
, 2003)
.

JTify takes charge of the transform from HTML to DOM in this study.

Figure
3

shows an
example

of DOM tr
ansformation from a HTML Web page shown in Figure 2
.



Figure 2.
An o
riginal

HTML document



Figure
3.

The Transform
ed

HTML document to the DOM structure

The last thing for the pre
-
processing is to parse sentences and words in each node by lexical
analys
is. There are several lexical analysis tools developed by natural language process research. The
most common and popular tools are POS (part
-
of
-
speech) tag and chunk parsers (called shallow parser
or partial parser).

Words of a language can be grouped
into classes, which show similar syntactic behavior. Noun,
verb and adjective are three important classes. These word classes are commonly called parts of speech
(POS)
(Manning and Schutze
,

1999)
. It analyzes large corpus to build probabilistic model
s

o
r learning
algorithm
s
, and then tag the word classes in sentences. Words can be organized into phrases.

Probabilistic context free grammars (PCFG) are the most common model

to represent
the regularities
and constraints of word order and phrase structure
.

In order to train this kind of grammar model, a

Learning Ontology from Web Documents to Support Web Queries

-
35
-

large example phrases are particularly useful. The most widely used one is Penn Treebank
(
Marcus
, et
al., 1999)
. With this grammar model at hand, a sentence can be decomposed to several phrases with
phras
e type.
Figure 4 demonstrates a portion of Penn Treebank
tree

parsed from
the
example of Figure
3.
After parsing, there are four noun phrases (NP) appeared in this sentence
,

Data mining techniques

,

the result

,

a long process


and

research and produ
ct development

, which
may be important
concepts in this sentence.
Each
sentence in the DOM node will be parsed by a PCFG parser to extract
its
syntactical

information. T
he

parsing

precision rate is about 75% (
Klein and Manning
,

2002).


(S (NP (NNP Data)

(NN mining) (NNS techniques))

(VP (VBP are)

(NP (NP (DT the) (NN result))

(PP (IN of)

(NP (NP (DT a) (JJ long) (NN process))

(PP (IN of)

(NP (NN research) (CC and) (NN product) (NN development)))))))

(. .))

Figure 4
,

A Penn Treebank tree


3.2
Extra
cting concepts from HTML documents

Terms parsed from DOM structure annotated by Penn Treebank tree
co
nstitute the ontology
vocabularies as
potential concept candidates.
This study selects concept terms
only
from
noun phrase
s

(NP)
. A noun phrase is compos
ed of noun (singular or plural), adjective and determiner. The
determiner part of a noun phrase will be further removed. Words like “the”, ”is” and “he” will be
removed from concept candidate terms. In general, the number of removed noise is larger than

that of
true concept terms.

In the concept extracting process, we take the whole document collection as a single document
that describes the target domain. If
a

term appears frequently in the collection, it is reasonable to
assure
that th
e

term at least
means certain concept in this domain. Moreover, the importance should increase
with the frequency. In this case, some feature selection methods are only suitable for a set of
documents.
tf

method handle
s

the term frequency in a single document. For the

practical usage, we use
tf

method to extract important terms. If a term is important, it not only depends on frequency, but also
on document structure.
Since

each

node
of DOM tree
has tag information, we
can
rewrite the
probabilistic term weighting form
ula as

and
,

where
tag
(
t
) is
the sum of tag weight for term

t

in a document.

This new weight formula indicates that if a term appears in a
node with tag weight 2, we
treat that
it appears twice. When
a

term w
eight
w

is greater than a threshold,
the

term
is viewed
as an important
concept in this document collection.

This study

set
s

the
threshold


to
0.51, and extract
s

196 terms
from total 21467 terms.

Among 196 terms, there are 168 single terms and 28 noun p
hrases.



3.3
Finding relations between concepts

An HTML document is a semi
-
structure document.
This structure can inspire several interesting
heuristics in analyzing documents. External and internal Web structures are all worthy to explore. On
one han
d,
i
nternal Web structure will be considered in terms of depth, indicating how the document
context is organized and placed. A human writer tends to describe related context sequentially, or from
an abstract idea to a specific one. On the other hand, dep
endent contexts will be actually
closer than
independent context
, which implies that dependent concepts may have similar behaviors.

IICM
第九卷第二期

民國九十五年六月

-
36
-

In an HTML document, any two sentences or w
ords have two kinds of distance
:
hierarchical
and
physical
distances. A hierar
chical distance is the absolute distance that a node goes through the tree to
another node. A hierarchical distance is defined as
hd
ij

to represent the distance from node
i

to
j
. A
physical distance is the real distance between two contexts in a document
, which can be defined as the
character length from one node to the other one.

Take Figure
5

as an example, the hierarchical distances between node 1 to 9 and node 1 to 13 are
the same, which is 5. However, by the context sequence, the distance between no
de 1 and 13 should be
longer than
that between node
s

1 and
9 because there is more information between them. There are 417
characters between node
s

1 and 9, and there are 489 characters between node
s

1 and 13. Thus, the
physical distance between node
s

1
and 9 is 417, and that between node
s

1 and 13 is 489. If two
concepts appear in the same node, their physical distance is 0. To avoid the effect of original document
length on counting distance, the character length is divided by the document length.




Figure
5.

HTML tree structure


This study
define
s

charlength
(
i,j
)
as a function to calculate the character length between nodes
i

and
j.

The distance between nodes
i

and
j

is defined as

,where
charlength
(
i,j
) denotes the character
length between nodes
i

and
j

(including nodes
i

and
j
), and
CHARLENGTH

denotes the number of total characters in a document.

A term may appear more than
on
ce

in a document

or
in more than one document. The distance of two concepts in a document will be
t
he average of node distance containing the concepts; that is,

,

where
k

denotes
the
k
th

document,
n

denotes the number of node containing concept
c
i

in the document, and
m

denotes
the number of nodes containing concept
c
j

in the docu
ment.

The average distance of two concepts in a
document collection is defined as

,

where
K

denotes the number of documents
containing the concept
c
i

and

c
j
.

There are total 14060 concept pairs found from
the

training
HTML page
se
t. Because some
concepts are connected by occasional co
-
occurrence, in order to reduce this situation,
this study a
dopt
s

support and confidence used in association rule analysis. We will prefer higher support and confidence

Learning Ontology from Web Documents to Support Web Queries

-
37
-

values when we need relations
that are more precise. In some ontology
-
based applications,
low

support
and confidence
values tend
to get broader results.


4. K
nowledge
-
based query expansion

The characteristics of external knowledge source may improve the effectiveness of query
expansi
on. A well
-
defined ontology contains concepts, attributes, relations and rules. Attributes and
relations are features to describe concepts and the connection among them. Rules are used to determine
to which situation the relations can be applied. Since

ontology contains knowledge about the domain
documents,
the study

call
s

the query expansion model with ontology the knowledge
-
based query
expansion.


4.1 Query t
erm selection


In adopting ontology to enable knowledge
-
based query expansion, the first
task

is to select terms
from the ontology.
T
his study proposes three different term selection approaches.

Approach 1
. Close frequency term (CFT)

This approach takes a user’s initial term as an entry, and expands it by a relevant term which has
the closest weig
ht to this initial term. We define
w
(
u,v
) as the frequency difference between two terms,
u

and
v
. By representing ontology as a graph,

G
= (
V, E
), where a vertex in
V
denotes a term, and an
edge denotes the frequency difference between two vertices. A m
inimum spanning tree (MST)
algorithm can be used to calculate the total minimum weight.

The number of terms constrains the searching span. Figure
6

illustrates the close frequency term
approach. In this example,
T

is the initial query term and its weight

representing term frequency is
0.55. The number beside the line denotes the value of
w
(
u, v
). The first three expanding terms will be
those terms with the minimum frequency difference,
e.g.
, C2, C7 and C6. If the initial term does not
connect with any
terms, this search will not return any expanded terms.



Figure 6. The illustration of the CFT approach


Approach 2
. Shortest distance term (SDT)

The shortest distance approach simply uses the original MST algorithm to search expanded terms.
Figure
7

ill
ustrates the SDT approach with an example. In this example, T is the initial query term.
The distance between two irrelevant nodes is marked as
. The first three expanded terms are the
IICM
第九卷第二期

民國九十五年六月

-
38
-

closest terms C1, C3 and C4. If term
T

does
not connect with any terms, this search will not return any
expanded term
s
.

Figure 7. The illustration of the SDT approach


Approach 3
. Maximum mediator term (MMT)

This approach is based on a common phenomenon in automatic query expansion methods, which
o
ften ad
o
pt co
-
occurrence analysis to identify the associative terms as expanding candidates. However,
in most documents, synonyms rarely appear together. A writer usually uses consistent terms in writing
a serial of documents. When one term has almost t
he same mediator as another

has
, t
hey may be
exchangeable.

To find paths of a fixed length
k

in a graph, we can use an adjacency matrix. To represent a graph
using a matrix, rows and columns are labeled by the vertices in the same order. If there is an e
dge
existing between two vertices, the value is set to 1; otherwise it is 0. After obtaining the
square of this
matrix, the value in
i
th

row and
j
th

column indicates the possible pathway number between vertices

i

and
j
. Finally,
a binary coefficient will

decide if this path really exists.

Figure
8

shows the maximum mediator term approach. From this example, the initial term T has
three different paths to reach term C8, and the mediator size is 3 (see the dotted line in the left side).
The second ma
ximum paths are from term
T

to term C4, the mediator size is 2 (see other kind of dotted
line in the right side). By setting the minimum threshold value of mediator size to 3, we identify the
expanded term C8. If there is no connected term around term
T
,

or all mediator sizes are less than 3,
this search will return no expanded terms.


Figure 8. The illustration of the MMT approach


Learning Ontology from Web Documents to Support Web Queries

-
39
-



4.2 Query
m
odification

The goal of query expansion is to keep query precision in a fixed value and try to increase the
rec
all

rate
; that is, to relax the query results. In this way, the original query syntax will be reserved.
Every single query term will be expanded by

the proposed

three term
-
selection approaches. In addition,
the set of expanded terms will be grouped by B
oolean operator
OR
.
Figure 9 displays an example of
query modification.

Besides query terms are enlarged, the original query structure still exists. In a term
-
weight
information retrieval system, users can specify their own term weight. Usually, expande
d terms will
get a lower weight than a given keyword. It
e
mphasize
s

the importance of original keyword and avoid
s

expanding too
many

terms to reduce the query effectiveness. In this research, the effect of weighting
will not be taken into concern, so tha
t the weight of an original term is simply set as 2 and that of an
expanded term as 1.


Original query: “
term1 term2
” AND (
term3

OR
term4
)

Expanding “
term1 term2
”: (
term12
-
1

OR
term12
-
2
)

Expanding

term3
: (
term3
-
1
)

Expanding
term4
: (
term4
-
1

OR

term4
-
2

OR

te
rm4
-
3
)

Modified query: (“
term1 term2
” OR

term12
-
1

OR
term12
-
2
) AND ((
term3

OR
term3
-
1
) OR (
term4

OR

term4
-
1
OR
term4
-
2

OR
term4
-
3
))

Figure
9
. Query modification
s

in Boolean query model


5.
The evaluation of query expansion application
s with learned ontolog
y

This section evaluates the effects of a learned ontology on the effectiveness of three query
expansion approaches. It is hard to measure the accuracy of a learned ontology. However, a useful
ontology should facilitate the knowledge
-
based query expansio
n to alloc
ate relevant terms addition

to

query terms. Based on three term
-
selection approaches proposed in Section 4,
this study

conduct
s

several information retrieval experiments

to evaluate whether
the ontology
-
based expansion approaches
obtain better r
esults than the original query approach.
T
he influence of the number of relations on the
query effectiveness

is also examined
.

Among those methods of evaluating the accuracy of learned ontology, most commonly used
methods include human expert judgment, “g
olden standard”, and application
(
Maedche
,
et al.,

2002)
.
This first method, h
uman expert judgment
,

is the most straightforward approach. Human can well
explain the meanings of abstract concepts, but faces some limitations in evaluating ontology. For
ex
ample, people may have different scales in explaining abstract concepts. The second method is to
compare constructed ontology with so
-
called “golden standard” ontology.

E
xisting ontology is built by
domain

experts as the standard. However, this kind of

ontology is rarely available. The
third

evaluation method is through ontology
-
based applications. Measuring the effectiveness of the
application can indirectly evaluate the ontology. This research adopts the indirect application
evaluation approach.


5.1 The collection of testing data

A straightforward way to examine the effectiveness of query expansion is to construct an
information retrieval system, and then measure the expanding query results with non
-
expanding ones.
We adopt the IR software Lucen
e (Jakarta Lucene, http://jakarta.apache.org/lucene/) as the evaluation
system. It is an open source, high
-
performance and full
-
featured text search engine written entirely in
IICM
第九卷第二期

民國九十五年六月

-
40
-

Java. Lucene can perform document indexing and ranking. The query result will

be ranked by hits.
Hit means the relevance degree of document and query. Retrieval documents will be ordered by rank.
Lucene uses
tf*idf

method to rank the results.

In order to collect a large amount of Web pages from the Internet, we use the Web direc
tory as the
starting

point. Many Web sites provide Web directories to facilitate Web browsing. Few Web
directories are built manually, but categorized by machine learning or statistic algorithms. Google
provides a Web directory that is classified by
Pag
eRank

algorithm
(Page, et al., 1998)
. We use a
broader domain “business consulting” to collect testing data. From different kinds of consulting Web
sites, we collect 15 categories and 11527 pages in total. The testing data set is summarized in Table
2
.

Web pages belonging to “Data_Mining” category
is

taken as target documents, and there are 1359
target Web pages in total.

Table
2.

The summary of testing Web pages

Subcategories of business consulting
domain

Number of
pages

Travel_Industry

864

Housing_C
onsultants

127

Materials_Handling

159

Biotechnology_and_Pharmaceuticals

160

Data_Mining

1359

Data_Transfer

102

Chemical

970

Education

1350

Import_and_Export


948

Retail_Trade

2258

Statistical_Consulting

1204

Sales

437

Government

1102

Waste_Mana
gement

247

Transportation_and_Logistics

240

Total

11527

5.2
The experimental design

After an experimental information retrieval system and the testing data are available, we begin to
conduct experiments. The experimentation procedure is shown in Figure

10
. The initial query string is
modified to a new query string by different expanding settings, and the resulting query terms are used to
retrieve documents from the IR system. Retrieved documents will be indicated as relevant or
irrelevant. Finally, t
he effectiveness of the query results in terms of recall and precision is evaluated.


Learning Ontology from Web Documents to Support Web Queries

-
41
-


Figure 10. The experimentation procedure

The initial query string is decided before starting the

query expanding experiments.
A query string
definitely affects the retr
ieval results. How to group those query terms also affects the final output. In
order to distinguish the expansion effects on the query effectiveness, we use a single term as the initial
string. There are 15 initial terms (Table
3
), which are manually p
icked up from 196 ontology concepts.
These terms are taken as relevant keywords to search “data mining” Web pages. Each term is a single
query string and treated independently. Therefore, these initial query terms serve as the basis for
evaluating expan
sion results.

Table
3
.
The list of i
nitial query terms

1.

Classification

2.

Clustering

3.

data analysis

4.

data mining process

5.

Data mining software

6.

data mining techniques

7.

data mining tool

8.

data mining

9.

data preparation

10.

data warehouse

11.

data warehousing

12.

Decision tre
e

13.

knowledge discovery

14.

machine learning

15.

neural network




5.3

Query modification mechanisms

F
our expansion settings
are used in experiments
to expand initial query terms
. They are
(1) no
expansion (NOEXPANDING),

(2) searching close frequency term (CFT),

(3
) searching shortest
distance term (SDT), and

(4) searching maximum mediator term (MMT).

In
N
OEXPANDING

setting
,
the initial query term is used to retrieve documents. In the other three settings, three
term
-
selection

approaches are used to search relevan
t terms bounded by the maximum terms
,

respectively. If no terms
are found to expand the query, the initial term is used. In experiment
s
, the n
umber of terms is limited
to 4 including initial term.

Moreover, in order to understand the effect of the number
of relations on influencing query
effectiveness, query expansion mechanisms
also combine
with different support rates.
A small
number
of relations
will be generated by
a low support rate. Therefore, a query expansion approach may
IICM
第九卷第二期

民國九十五年六月

-
42
-

include different terms
when ontology presents more or less relations. This research
did

not concern
the effects of confidence on the number of relations

as

we fix
ed

the confidence

level

at 0.05.

After query expansion, the new set of query terms will be grouped by the modified r
ules we have
mentioned in Section 4. We just combine these terms by
OR

operation and set the weight of the initial
terms as 2 and the expanded terms as 1. Then this new query string will be submitted to the IR system
to retrieve relevant documents.


5.4
Experimental results

and evaluation

Each initial term is expanded with four query expansion me
chanisms. In addition, ontolog
ies

built
by seven different support values in the last three settings are used for query expansion. IR system uses
every new quer
y strings to retrieve documents.

Recall and precision are known as the effectiveness
measurement of the information retrieval system. Recall is defined as the proportion of relevant
documents that are retrieved, and precision is defined as the proportion

of retrieved documents that are
relevant. In other words, recall measures the breadth of the search, while precision measures the
effectiveness of the search. High recall and high precision
are

the ideal situation. However, these two
qualities often ap
pear to be negatively correlated. Precision
is calculated by

, r
ecall
by
, and a
verage precision
by
,
where
REL
denotes a set of relevant
documents,
RET

denotes a set of retrieved documen
ts,
r
denotes the recall level,

denotes the
number of queries, and

denotes the precision at recall level
r

for the
i
th

query.

In order to
evaluate the effectiveness of query expansion, the average precision
is calculated
from 15 query strings
using three expansion mechanisms and seven support rates

(0.05, 0.04, 0.03, 0.02, 0.01, 0.007, and
0.005)
.

The recall and precision results indicate that the ontology
-
based approach increases the query
effectiveness tha
n non
-
expansion query method.
The results show that three term
-
selection methods
outperform no
-
expansion case.
Figure
11

demonstrates
precision
-
recall curves of these four query
methods under 7 different support rates
. They generally gain a higher preci
sion
at

recall level
s

0.1
to
0.3.

The most obvious improvement appears as the support rate is set at 0.02.

With higher support rates, term expansion is limited by fewer available relations. Under this
situation, three
term selection

approaches only incre
ase recall in low precision. Even in the same recall
level, they do not improve precision much. Along with
decreasing

support rate
s
, these approaches
expand query terms by richer relations and obtain more effective query terms to improve the precision.

In order to evaluate whether these expansion approaches
obtain

different results, we use paired
t
-
test
(Hull, 1993)

to evaluate the average precision from all pairs of methods.

When the support value
s

are

0.04, 0.03, 0.02 and 0.01, these
three
query expa
nsion mechanisms make significant difference from
NONEXPANDING (
p
-
values are less than 0.05).

CFT outperforms consistently in various support rates, and results in a better effectiveness than
the
other t
wo query

expansion settings. CFT just expands terms
with close frequency term to the initial
term. When the initial term is a high frequency term, CFT tends to expand more high frequency terms.
In contrary, it expands more low frequency terms. In addition, CFT not only searches directly related
terms, bu
t also indirect terms by the minimum spanning tree algorithm.
CFT

solves the problem that
some related terms rarely appear together, and they cannot be expanded by co
-
occurrence analysis.
SDT and MMT
share this
feature, too.



Learning Ontology from Web Documents to Support Web Queries

-
43
-



Figure 11(a) Support
= 0.05


Figure 11(b) Support = 0.04


Figure 11(c) Support = 0.03


Figure 11(d) Support = 0.02


Figure 11(e) Support = 0.01


Figure 11(f) Support = 0.007


Figure 11(g) Support = 0.005


Figure 11. P
recision
-
recall curve
s

with
in seven support value
s

IICM
第九卷第二期

民國九十五年六月

-
44
-

The difference between CFT and SDT may come from the concept distance in SDT. SDT only
expands term based on concept distance. It is possible to expand a term, which just occasionally
appears together with initial term. In other words, we may improve
the effectiveness of SDT by
increasing the selection precision of relations. The precision increases as the support rate decreases.
From the results of evaluating precision and recall, the value of support rates seems to affect the query
effectiveness.

To go a step further, we use Friedman test to examine multiple results with one query expansion
method. Friedman test
(Hull, 1993)
can test differences of the average score of more than two
methods.
The Friedman test results are 0.007, 0.000, and 0.000

for CFT, SDT, and MMT
, respectively
,
which indicates that
all expanding settings are affected by different support rates

with
. Three
term selection

methods perform the best in support rate 0.03. Then, the query effectiveness keeps

the
same or becomes worse when the support rate increases.

In a higher support rate, there are no
t

enough relations for expanding and selecting effective terms.
In a lower support rate, too many irrelevant relations will be accepted. It is possible to

expand more
effective terms, but also possible to reduce the expansion precision. To pick up a proper relation
number may maximize the query effectiveness. In this experiment, the positive correlation between the
increase of relations and precision appe
aring in the support rate decreasing from
0.04

to
0.0
1
,

where the
number of relations increases from 689 to 7113, which is 5% to 50% of total relations.


6.
Conclusions and research limitation
s

This research propose
s

an ontology learning process and appli
e
s

it to three query expansion
mechanisms.
E
xperiment
s are

indirectly
used to
evaluate the accuracy of learned ontology to extract
concepts and determine
term

weights in a target domain by considering both concept similarity and
document structure. Based

on these extracted concept

terms
, potential relations are also confirmed by
concept distance. Th
e learned
ontology helps expand meaningful query terms
to
obtain improved query
results.

The common feature of the proposed query expansion methods is that th
ey all search direct and
indirect relations in ontology. In contrast to the co
-
occurrence analysis, indirect relation connects two
terms that may rarely appear together. Co
-
occurrence terms often appear in the same document, so that
other relevant docume
nts contain none of these terms, and it is hard to extract from a whole document
collection. When other approaches fail to extract more relevant documents,
the proposed
searching
approach
es

can

retrieve
documents by relations
built from
less jointly appea
ring terms.
The results also
show
that the effect of the number of relations cau
s
ed by different support rates will identify different
relation number and result in various
degrees of
precision.

Machine learning and data mining methods often face several
problems, which may directly affect
the final outputs. In this research,
the quality of ontology learning is mainly affected by two

issue
s:
data quality

and
lack of initial ontology backbone
.

W
eb documents used for training and testing are
acquired

from
the Internet
, and
often contain

noise
s

which may not be
easy to
be
detect
ed
.

Since it
is a
time consuming task

to clean documents
,
the system may not acquire
sufficien
t
training data

as well as
t
esting data
.

Query accuracy is based on Web site category,
which is constructed mainly by automatic
methods, and inevitably, contains miss
-
categorized documents.

The second problem is lack of ontology backbone. Most ontology learning researches expand
ontology according to a given ontology, called backbone ontolo
gy. Backbone ontology
tackles
several
basic difficulty, ontology hierarchy and conflict handling. However, ontology backbone is seldom
available. This research just show
s

the

possibility to make ontology

grow
from scratch
.


Learning Ontology from Web Documents to Support Web Queries

-
45
-

Because of these research cons
traints, Web structure mining technique
s were

not fully adopted in
this research. They may solve some known problems such as removing data noise and clarifying the
extracted relations. The presentation form and tag attributes in HTML documents indicate e
mphasized
terms and possible types. In addition, a link analysis technique will extract more structure information
from HTML documents.

In the future, some methods may be designed to improve the ontology learning process. One
direction is the special tra
ining data format. Q&A document may be a good source for extracting
explicit relations since question often contains the necessary attributes and the answer carries the
possible instances. Text understanding is another direction that focuses on extractin
g relation and
important concepts by linguistics and content semantics. Ontology engineering now begins to consider
the maintenan
ce and evolvement of ontology.


References

Agirre, E. and Ausa, O. and Havy, E.
,

Martinez, D.
,

2000. Enriching very large onto
logies using the
WWW.
In:
ECA12000

workshop on Ontology Learning,

August 2000
, Berlin,
http:1/012000.aifb.uni
-

karlsruhe.de/
.

Berners
-
Lee, T., Hendler, J.
,

& Lassila, O.
,

2001.
The semantic web. Scientific American,
284(5), 34
-
43
.

Bodner, R.
,

Song, F.
,
19
96. Knowledge
-
based approaches to query expansion in information retrieval.
In
:

McCalla, G. (Ed.), Advances in Artificial Intelligence
.
Springer
, New York, pp.

146
-
158
.

Carchiolo, V.
,
Longheu, A.
,

Malgeri, M.
,
2000. Extracting Logical Schema from the Web.
I
n:
Proceedings of the
Workshop on Text and Web Mining
.

Melbourne, Australia, pp.
64
-
71
.

Carnot, M.J.
, Dunn, B.,

Canas, A.J.,

2001. Concept
m
ap
-
b
ased versus Web
p
age
-
b
ased
i
nterfaces in

s
earch and
b
rowsing
, In:

Proceedings of International Conference on Tech
nology and Education,
Tallahassee

FL
, http://www.icte.org
.

Chandrasekaran,

B.
,

Josephson, R.
,
Benjamins, R.
,
1999. What Are
o
ntologies, and
w
hy
d
o
w
e
n
eed
t
hem? IEEE Intelligent Systems
.

14(1)
,
20
-
26
.

Davenport, T.H.
,

Prusak, L.
,
1998. Working Knowledge: H
ow Orga
nizations Manage What They
Know
.

Harvard Business School

Press
.

Devedzic, V.
,
2002. Understanding
o
ntological
e
ngineering. Communications of the ACM,
45(4),
136
-
144.


Ding, Y.
,
2001. Ontology:
t
he enabler for the
s
emantic Web. Journal of Information

Science, 27(6)
,
377
-
384.

Ding
,

Y
.
,

Foo
, S.
,
2002. Ontology
r
esearch and
d
evelopment:
p
art 1
,

a

r
eview of
o
ntology
g
eneration.
Journal of Information Science 28(2)
, 123
-
136
.

DiPasquo, D.
,

1998. Using HTML Formatting to Aid in Natural Language Processing on

the World
Wide Web. Senior Honors Thesis, School of Computer

Science, CMU
.

Fensel, D.
,
2001. Ontologies:
S
ilver Bullet for Knowledge Management and Electronic Commerce.
Springer
-
Verlag,
Berlin.

Gruber, T.R.
,

1993. A
t
ranslation
a
pproach to
p
ortable
o
ntolo
gy
s
pecifications. Knowledge Acquisition
,
5
(2),

199
-
220
.

Hull, D.
,

1993. Using
s
tatistical
t
esting in the
e
valuation of
r
etrieval
e
xperiments.
In:
Proceedings of the
16th annual international ACM SIGIR
C
onference on Research and development in information
retrieval
.
Pittsburgh, Pennsylvania,
pp.
329


338
.

IICM
第九卷第二期

民國九十五年六月

-
46
-

Jichang, W.
,

Huan, H.
,

Gangshan, W.
,

Fugan, Z.
,

1997.
Web mining:
k
nowledge
d
iscovery on the
W
eb.
In
:

Proceedings of the
N
inth International Conference on Tools

with Artificial Intelligence.

http://ieeexp
lore.ieee.org
.

Khan, L.
,

2000. Ontology
-
based Information Selection. Ph.D. Dissertation, Department of Computer
Science, University of Southern California
.

Kietz, J
-
U.
,

and

Maedche, A.
,

Volz, R.
,
2000.
A method for semi
-
automatic ontology acquisition from
a corporate intranet. In
:

Proceedings of the EKAW'00 Workshop on Ontologies and Text, Juan
-
Les
-
Pins, France
.

Klein
,

D
.
,

Manning
,
C
.
D.
,
2002. Fast Exact Inference with a Factored Model for Natural Language
Parsing.
In:
Becker
,
S.,
Thrun
,
S.,
Obermayer
, K.

(Eds.)
,

Advances in Neural Information Processing
Systems 15

(
Neural

Information Processing Systems, NIPS 2002,
Vancouver, British Columbia,
Canada
)
, pp.3
-
10
.


Korfhage, R.
,
1997. Information Storage and Retrieval
.

John Wil
ey,
New York
.

Maedche, A.
,

Pekar, V.
,

Staab, S.
,

200
2
.
Ontology
l
earning
,

p
art
o
ne
-

o
n
d
iscovering
t
axonomic
r
elations from the Web. In: Zhong
, N.,

et al. (
E
ds
.
) Web Intelligence. Springer.


Maedche, A.
,

Staab, S.
,
2001. Ontology learning for the semantic

web. IEEE Intelligent Systems, 16(2),
72
-
79
.

Manning, C.
,

Schutze, H.
,

1999. Foundations of Statistical Natural Langu
age Processing. MIT Press
.

Marcus, M. et al.
,
1999. The Penn Treebank Project. http://www.cis.upenn.edu/~treebank/home.html

Miller, R.
,
20
02. WebSPHINX: A Personal, Customizable Web Crawler.
http://www
-
2.cs.cmu.edu/~rcm/websphinx/

Mizoguchi, R.
,

Ikeda, M.
,
1997. Towards
o
ntology
e
ngineering,
In:
Proc
eedings
of
Pacific
-
Asian
Conference on Expert Systems (
PACES'97
)
, pp.259
-
266
.

Page, L.
,

Brin,

S.
,

Motwani, R.
,

Winograd, T.
,

1998.
The PageRank
c
itation
r
anking:
b
ringing
o
rder to
the Web.
http://google.stanford.edu/~backrub/pageranksub.ps

Raggett, D.
,
2003. Clean up your Web page
s with HTML TIDY.
http://www.w3.org/People/Raggett/tidy/

Soderland, S.
,

1997. Learning to Extract Text
-
based Information from the World Wide

Web.
In:
Proceedings of the Third International Conference on Knowledge Discovery and Data

Mining,
Newport Beach, C
alifornia, pp. 251
-
254.

Srikant, R.
,

Agrawal, R.
,
1995. Mining
g
eneralized
a
ssociation
r
ules.
In:
Proceedings of 21th
International Conference on Very Large Data Bases, September 11
-
15, 1995, Zurich, Switzerland.
Morgan Kaufmann,

pp.
407
-
419

Velardi, P.
,

Mi
ssikoff, M.
,

Fabriani, P.
,

2001.
Using Text Processing Techniques to Automatically
enrich a Domain Ontology
.

In:
Proceedings of
2nd International Conference on Formal Ontology in
Information Systems

(
FOIS 2001
)
, Ogunquit, Maine, October 17
-
19
, pp.

270
-
284
.

Wood, L. et al.
,
1998
. Document Object Model (DOM) Level 1 Specification.

http://www.w3.org/TR/1998/REC
-
DOM
-
Level
-
1
-
19981001/

Yang, Y.
,

Pedersen J.P.
,
1997. A
c
omparative
s
tudy on
f
eature
s
election in
t
ext
c
ategorization.
In:
Fisher, D.H. Jr. (Ed.),
Proce
edings of the Fourteenth International Conference on Machine Learning
(ICML'97)
, Nashville Tennessee, July 8
-
12
.

Yuwono, B.
,
Lee, D. L.
,
1996
. Search and ranking algorithms for locating resources on the
W
orld
W
ide
W
eb. In
:

Su
,
S. (Ed.)
,
Proceedings of the
Twelfth International Conference on Data
Engineering

(ICDE)
, New Orleans, LA,
IEEE Computer Society
,
p
p.

164
-

171
.