Building Ontology Web Retrieval System Using Data Mining

manyfarmswalkingInternet και Εφαρμογές Web

21 Οκτ 2013 (πριν από 3 χρόνια και 5 μήνες)

68 εμφανίσεις



1


Building Ontology Web Retrieval System
Using Data Mining


D
D
r
r
.
.


H
H
a
a
d
d
e
e
e
e
l
l


S
S
h
h
o
o
w
w
k
k
e
e
t
t


A
A
l
l
-
-
O
O
b
b
a
a
i
i
d
d
y
y


C
C
o
o
m
m
p
p
u
u
t
t
e
e
r
r


E
E
n
n
g
g
i
i
n
n
e
e
e
e
r
r
i
i
n
n
g
g


D
D
e
e
p
p
a
a
r
r
t
t
m
m
e
e
n
n
t
t


C
C
o
o
l
l
l
l
e
e
g
g
e
e


o
o
f
f


E
E
n
n
g
g
i
i
n
n
e
e
e
e
r
r
i
i
n
n
g
g




A
A
h
h
l
l
i
i
a
a


U
U
n
n
i
i
v
v
e
e
r
r
s
s
i
i
t
t
y
y,
,


B
B
a
a
h
h
r
r
a
a
i
i
n
n





Abstract:
Much of the information in science, engineering and business has been recorded in the
form of text.
Traditionally, this information would appear in journals or company reports, but
increasingly it can be found online in the World-Wide Web. Tools to support information
access and discovery on the Internet are proliferating at an astonishing rate. Some of this
development reflects real progress but there are also many exaggerated claims. The focus of
this presentation will be to review the important technologies for text-based information
access on the Web and to describe the progress that is being made by researchers in these
areas.
The Semantic Web3 provides a common framework that allows data to be shared
and reused across application, enterprise, and community boundaries. It envisions the
globally interconnected network of machine- processable information, made possible by
means of the sharing of semantic data models or ontologies. Locating suitable existing
ontology to capture the user-required information from the Internet is a big challenge in the
current research of the Semantic Web.
This research presents the ontology ability of search engine to match retrieved
information content to a user's profile and present search capabilities for a Web application
.It is possible to measure the performance of a search by understand user interest and
information relevant.
This research using data mining techniques to building ontology retrieves system to
improve the quality of information access.
This system using Association Rule to cluster the knowledge base and indexing
them into web data base based on semantic way.
The Onto -mining search engine in this research allows the user to perform
keyword searches on certain types of “ontology” files, and to visually inspect the files to
check their relevance. Onto Search system is based on Java, ASP.

Keywords:
Data mining, World Wide Web, Web Information Retrieval System , Ontology server.



2
1. Introduction
Ten years ago, the primary technologies being used to construct large
information systems were database systems, information retrieval systems, and
information filtering systems. Database systems were used to handle large volumes
of structured data and to provide guarantees of reliability and consistency despite
systems failures and high volumes of update transactions. Information retrieval
systems were used to search large databases of text, such as scientific abstracts,
legal materials, or newspaper stories. Information filtering or “clipping” services
provided periodic updates in the form of text stories, mostly in the business
domain, based on user profiles. In the relatively short period since, there have been
many developments that have affected how information technology is talked about
and used.
The most important of these have been the growth of the Internet and the
availability of cheap hardware. The technologies for the large information systems
discussed today include the Internet (and intranets and extranets), Web search,
portals, agents, collaborative filtering, XML and metadata, and data mining using
Association Rule to build large item-set as a keywords for web pages.

2. Search Engines
One of the major tools for information access is the search engine. Most
search engines use information retrieval techniques to rank Web pages in presumed
order of relevance based on a simple query. Compared to the bibliographic
information retrieval systems of the 70’s and 80’s, the new search engines must
deal with information that is much more heterogeneous, “messy”, more varied in
quality, and vastly more distributed or “linked”. Similarly, most Web search
engines use a centralized architecture where “Web crawlers” gather Web pages and
a single, very large index is created. An approach like this has inherent scalability
problems.


3
There has been a growing awareness that effective information retrieval is a
hard problem. Indeed, in a recent Turing Award lecture, it was identified as a
software “grand challenge”. To address this challenge, researchers in information
retrieval and related areas of computer science are proposing new retrieval models
and techniques to support distributed architectures, summarization, question
answering, cross-lingual retrieval, better interfaces, and multimodal search [AL-
Obaidy07a].
3. Information Filtering
Information filtering has been around for some time in the form of “current
awareness” systems. A number of Web tools provide this functionality (often under
the “agent” label). Most of the applications of this technology are in the business
and news domains. Many of these systems use simple Boolean matching
techniques, although there has been much research and a number of new companies
applying machine learning techniques to this problem. Effective filtering is,
however, as difficult as effective search, and the problems involved with
proactively sending too much data that is not relevant to the users have resulted in
varying levels of acceptance, techniques being developed to improve search,
however, will also result in more effective filtering so we can expect to see more
applications involving this technique in the future. Collaborative filtering is a
complementary technique based on matching user preferences that has become
popular in E-commerce applications [Bara99].

4. Ontology Web Site
A "conceptualization" is an abstract model of a phenomenon, created by
identification of the relevant concepts of the phenomenon. The concepts, the
relations between them and the constraints on their use are explicitly defined.
"Formal" means that Ontology is machine-readable and excludes the use of natural
languages. For example, in medical domains, the concepts are diseases and


4
symptoms, the relations between them a recausal and a constraint is that a disease
cannot cause itself. That an Ontology is a "shared conceptualization" states that
Ontologies aim to represent consensual knowledge intended for the use of a group[
Lee01]. Ideally the Ontology captures knowledge independently of its use and in a
way that can be shared universally, but practically different tasks and uses call for
different representations of the knowledge in an Ontology. Ontology is sometimes
confused with taxonomy, which is a classification of the data in a domain. The
difference between them is in two important contexts:
1. An Ontology has a richer internal structure as it includes relations and constraints
between the concepts.
2. An Ontology claims to represent a certain consensus about the knowledge in the
domain. This consensus is among the intended users of the knowledge [ Palm01].

4.1 Semantic Web:

The Semantic Web provides a common framework that allows data to be shared
and reused across application, enterprise, and community boundaries. It envisions a
globally interconnected network of machine-process able information, made
possible by the sharing of semantic data models, which is also known as
ontologies. The Semantic Web is a collaborative effort led by the World Wide Web
consortium8 with participation from a large number of researchers and industrial
partners. It is based on the Resource Description Framework (RDF), which
integrates a variety of applications using XML for syntax and URIs for naming.
There are many people working in this area to improve, extend and standardize the
Semantic Web. Many documents and tools have already been developed.
However, Semantic Web technologies are still in the infancy and there are many
challenges in this area. One of the most important issues is to locate suitable
existing ontologies to capture the user-required information from the Semantic
Web [AL-Obaidy05, Lee01].


5
5. Text Data Mining
A considerable amount of research is being carried out under the heading of
text data mining. This includes a variety of techniques such as information
extraction, clustering, and discovery of associations or “rules”. All of these
techniques combine statistical methods with some level of linguistic analysis. In
contrast to data mining using relational Database systems, where a number of
commercial packages are available, text data mining is still an open research issue.
Evaluation of research in this area is also difficult, since many of the results are
presented with examples instead of statistical data.
Information extraction techniques are designed to extract “facts” from text. In
many cases, this means very simple facts such as names of companies, people, and
monetary amounts, but in general this technique can be used to extract more
complex information, such as filling a database according to a template or schema.
Extraction is a key component of text data mining since it provides the objects for
the statistical analysis. Much of the research in this area has been done with
newspaper text, but results with scientific text are beginning to be reported. There
has also been recent work focusing on information extraction based on the structure
of Web pages [Mitch97].
Clustering is used to group related information. This technique has been well-
studied in information retrieval but has recently been the subject of a number of
new papers. Information extraction and clustering can be used with other
techniques to discover interesting associations in text databases. The applications
of this type of discovery have been mostly based on business information, but it
may also be useful in scientific and engineering contexts [Godbout99].






6
5.1 Association Rule:-

Association rules are one of the promising aspects of data mining as knowledge
discovery tool, and have been widely explored to data. They allow capturing all
possible rules that explain the presence of some attributes according to the
presence of other attributes. An association rule, X => Y , is a statement of the
form "for a specified fraction of transaction , a particular value of an attributes
set X determines the value of attributes set Y as another particular value under
a certain confidence". Thus association rule aim at discovering the patterns of
co occurrence of attributes in a database [Han01,Fay01].
5.1.1 Formal Definitions
:-
The formal definition of association rule is the following : Let Γ={i
1
, i
2
,
…., i
m
}be a set of literal, called items. Let D be a set of transactions, where
each transaction T is a set of items such that T⊆Γ. Associated with each
transaction is a unique identifier, called its TID.
Definition (1):
An item X is a set of items in I , An item set X is called k-
item set if it contains k items from I.
Definition (2):
A transaction T satisfies an item set X if X⊆T. The support of
an item set X in D, supports
D
(X), and is the number of
transactions in D that satisfies X.
Definition(3):
An item X is called Large item set if support of X in D
exceeds a minimum support threshold explicitly declared by
the user and a small item set otherwise.
Definition (4):
An association rule is an implication of the form X⇒Y,
where X⊂Γ, Y⊂Γ, and X intersection Y =Φ. The support and
confidence of an association rule (X ⇒Y) are calculated by the
following two equations:


7
The rule X⇒Y holds in the transaction set D with confidence c
Where c = support
D
(X U Y)/ support
D
(X) .The rule X⇒Y has support s in
D if the function s of the transactions in D contain X U Y.

If its support and confidence are equal to or greater than the user specified
values. The goal of association rules is to find the relationship between any
combinations of items.

Example (1):
Consider the example transaction database ETDB in table(3) . there
are six transaction in the database with Transaction IDentifiers, TIDs ,1,2,3,4,5,
and 6 . the set of item sets I={A,B,C,D,E,F}, each item is an abbreviation of book
title in bookshop sales as shown table (1) . There are totally (2
6
-1)=63 nonempty
item set (each non-empty subset of I is an item set ).{A} is a 1- item set and
{A,B} is a 2- items set and so on[Zaki04].
Support (A) =3 since three transactions include A in it. Let us assume that the
minimum support (minsup) is two (approximately taken as 33%). Then
{A,B,C,D,E,AB AC AE BC D,BD, BE, CT ,CE, DE, ABC, ABE, ACE, BCD,
BCE ,BDE,CDE, ABCE, BCDE } are the set of large item set ;since their support
is grater than or equal to 2.(33% x 6) ,and the remaining ones are small item sets ,
there are two item sets , ABCE and BCDE called maximal item sets ; all other large
item set are subset s of one of them . Table (2,3) depicts large item set with their
support. Let's assume that the, minimum confidence (minconf) is set to 100% , then
A=> B as an association rule with respect to the specified minsup and minconf ( its
support is 3) , and its confidence is :
sactionmberofTranTheTotalNu
ainsXandYonhatConatfTransactiTheNumbero
Support =


8

%100100*3/3100*
ETDB(A)support
ETDB(AD)support
==


On other hand, the rule B=>A is not valid association rule since its confidence is
50%. The table (4) depicts the association rules that be mined from database ETDB
according to 100% confidence and 33% minsup value[Fay01].

Item
Book Title
A From Here to Eternity
B Love at the Time of Cholera
C Gone with the wind
D The Moon and the Fences
E The tree and assassination of Marzooq
F The monster

Table (1) the items abbreviations of database ETDB

Transaction TID
Item-(Books)
1 B, C, E
2 B,C, D, E
3 A, B, C, D, E
4 B, C, D
5 A, B, F
6 A, B, C, E

Table (2) A Transaction Data Base



9


Table (3) Large item set with minsup=33%=2

Rules(1)
Rules(2)
Rules(3)
A⇒B(3/3) AC⇒B(2/2) AC⇒BE(2/2)
C⇒B(5/5) AE⇒B(2/2) AE⇒BC(2/2)
D⇒B(3/3) AC⇒E(2/2) DE⇒BC(2/2)
E⇒B(3/3) AE⇒C(2/2) ABC⇒B(2/2)
D⇒C(4/4) DE⇒B(2/2) ABE⇒C(2/2)
E⇒C(4/4) DE⇒C(2/2) ACE⇒B(2/2)
ABE⇒C(2/2) ACE⇒B(2/2) ABC⇒E(2/2)

Table (4) Associations rules with minconf=100%






Support
Item set
No.
6=100% B 1
5=83% C BC 2
4=67% E BE CE BCE 3
3=50% A D AB BD CD BCD 6
2=33% AC AE DE ABC ABE ACE BDE CDE ABCE BCDE 10


10

Rule Generation Rule

1 for all large k-items l
k
,K >= 2, in L do
2 begin
3 H
i
={ consequents of rules from l
k
with one item
4 in the consequent }
5 ap-gerules(l
k,
H
m
)
6 end
7 ap-gerules(l
k ,
H
m
)
8 if k> m+1 then
9 begin
10 H
m+1
=appriori-gen(H
m
)
11 for all h
m+1
to H
m+1
do
12 begin
13 conf= support
D
(L
k
)/ support
D
(L
k
, h
m+1
)
14 if conf>= minconf then
15 add(L
k
- h
m+1
)=> h
m+1
to the rule set
16 else
17 delete h
m+1
from H
m+1

18 end
19 ap-gerules(l
k,
H
m+1
)
20 end







11
The candidate generation algorithm
apriori_gen(L
k-1
)

1 ck=
φ
=
㈠景爠慬氠楴敭⁳整⁘=

L
k-1
and Y


欭‱
摯㌠
㍩映報㵙3
Λ
†薅
Λ
X
k-2
= Y
k-2

Λ

欭1=
㰽<
欭1
†⁴桥渠扥杩渠
C=報堲薅⁘
欭1



䅤搠䌠瑯⁃欠
䕮搠
䑥汥瑥⁣慮摩摡瑥⁩瑥洠獥瑳⁩渠䍫⁷桯獥⁡湹⁳畢獥琠楳潴⁩渠D
=1=
=
周攠慰物潲椠慬杯物瑨T=
=
偲潣敤畲攠䅰物楰潲礨⤠
ㄠ䰱‽筬慲来ㄭ⁩瑥洠獥琠素
㈠䬽㈠
㌠坨楬攠3
=1
㴰⁤漠
㐠䉥杩渠
㔠5

㵡灲楯物ⴠ来渨=
=1

㘠䙯爠慬氠瑲慮獣瑩潮獴⁩渠䐠摯6
㜠䉥杩渠
㠠8
t
㵳畢獥琨=
=1

㤠䙯爠慬氠捡湤楤慴攠9

C
t
do
10 c.count =c.count+1
11 end
12 L
k
={ c

†=
t
⁼⁣⹣潵湴‾㵭楮獵災=
ㄳ⁋㵫⬱1
ㄴ⁅湤1
=


12

Figure (1) the general architecture of onto sever

6. The Proposal Design of Onto-Mining Search Engine:
The proposal work was design the adaptive Web Information Retrieval System
(Ontology Adaptive Web Information Retrieval System). The Websites have been
traditionally designed to fit the needs of a generic user but an adaptive Web
Information Retrieval System using Association rule to cluster the user keywords
and also this algorithm used for classify the Knowledge base content .
To build Ontology search Engine needed to design adaptive personalization for
user interests of search engine.

User
Keywords
Concepts
Onto Server
Taxonomy /Thesaurus based
on Association rule
Relevant Ontology
Ontology Library
Onto Final Ranking
System


13
The proposal system needs the conceptual model to develop the conceptual
WIR system based ontology of concept hierarchy for user interests, the general
architecture was shown in the Figure (1) and the WIR model is shown in
Figure (2).



















Figure (2) the general steps of WIR system



Keywords
Select pages
User
Chosen Pages

Web Pages
(RDF)+Meta data

Final
Onto-
Mining
Library
Onto-Mining Server
Taxonomy viewer
by using Association rule

5: Viewing
8: Save
ontology
7:Select page based on
structure
2:search into DB
about Relevant
Page
6:Return result
as Taxonomy
view
1: Input
keywords(Data)
3: Return result to
use
r
4: Select
Pages


14
To build this system, the system gathers Information for each user and
analyzes it, after this point when the user visits the web site the system
analyzes his/her activity by storing tracks (Cookies) and produces the user
profile. In this level the system need the contextual information for each user
this means the proposed needs the three processing [AL-Obaidy07b]:
1. Generation the history of each user to identify the most relevant
WebPages for user based on his past histories.
2. Classification of the concepts as the tree of concept (interest) and user
activity
3. Classification of user profile.
The general algorithm of the design WIR system is explained as follows:
Ontology WIR Algorithm
Begin
Step 1: Call Web information Retrieval Algorithm
Step 2: For each user created a personal tree featuring his/her own concepts and
hierarchy;
Step 3: Simple documents are gathered for each concept in the hierarchy;
Step 4: Using Association rule algorithm to classify, that are comparing the user’s
simple documents to user interest
Step 5: The system can present the user with Websites organized by using large
item set as a user’s concepts.
Step 6: End.








15
Web Information Retrieval (WIR) Algorithm [AL-Obaidy07b]
Begin
Step 1: Generate user profile where each visitor has:
• Individual file for him/her and give ID for him.
• specify the visitor interest and also the proposed system give for
each interest ID this means for each subject has :

1. ID for user documents
2. ID for subject it self
Step 2: Build a Database for each category which mean for each category have:-
• ID for each user has interest(category) in this interest
• ID for each interest (category).
This is produce a document vector (that has personal information and user profile)
Step 3: Generate a super document which means any subject has a many visitor and
summation the profiles and put the id for it as a profile ID.
Step 4: Classification process by using Association rule is performed by that
compare the results and generate for each subject the top and leafs using
Concept hierarchy algorithm.
Step 5: The next time when visitor enter the web site proposed system build:
Conceptual Profile (CP) for him based on classification user
activity and each documents that it existed in Data Base.
Step 6: The Onto Server takes the CP and user request and make ranking the results
for an original rank and then re-rank by combination between
original rank and similarity between keywords and CP.
Step 7: Rate the Web Pages and decided which the pages might be of system
recommended based user interest
End.



16
In this algorithm the proposed system first generates identify ID for each
user that visits web site and stores his activities in this site , second the
proposed system builds Database for concept that user used this database has :
ID for each subject and ID for all documents that belong for this concept as
thesaurus for Web site. After this point the proposed system makes
classification the user concept using algorithm concept hierarchy. When user
needs anything from web site the Onto Server takes the request and rates the
web pages that related to him interest and also web pages that are recommend
from Onto Server using the Response Equation:
....eq.(1)Score.....Request *α)-(1Score onto*α ResponseServer +=

Where α has a value between 0 and 1. When α has a value of 0, conceptual
rank is not given any weight, and it is equivalent to pure request based ranking.
If α has a value of 1, request based ranking is ignored and pure conceptual rank
is considered. Both the conceptual and request based rankings can be blended
by varying the values of α.
The proposal system was created using PHP language with Mysql Relational
Database and make shortcuts table that when the user search for specific word and
the word is not existed in any of the sites stored in the sites table it will go to the
shortcuts table that may contain subjects that is related to that word entered by the
user and bring the contents of this related subjects that my be interesting for the use
as the same as the same subject.
The Onto-WIR system needs analysis before the development of search
engine showed that it would be of most use on a personal or departmental level,
answering the needs of special interest groups or communities.
The prototype for development had very specific characteristics and was
much more manageable than if a structure of specialized subject gateways had been
used, such as those used by general gateways like Yahoo
! or dmoz
. The first


17
objective of the design was to have the simplest possible search system, which
would reduce the problem of over-complicated interfaces for novice users, and
would reply faster with information in greater detail, and be more agile. The initial
architecture for the application planned for this project is shown in the figure
below:






















Figure (3) the Database Architecture

DB MYSQL
Server DB(PHP)
Dates
Tips
Users
DK
Keywords
Pma_relation
CLIENTE
GUI Navigator web
HTML
HTTP


18
The resulting relational scheme consists of six tables: data, types, users, data-
keyword relations (dk), keywords and pma_relation (Figure 3 above). This last
relation proves essential when working with this version of MySQL in order to
enable the database management system to interpret the relations with many-to-
many cardinality. The users table remains apart from the scheme and not related to
the other tables, as its only task is to centre on administering the various types of
users who access the application.

















Figure (4 ) Relations Between tables In Database


19

Figure (5) WIR Interface




20



















Figure (6) the onto search result










21

7. Conclusions

The Onto -mining search engine in this research allows the user to perform
keyword searches on certain types of “ontology” files, and to visually inspect the
files to check their relevance. This research was design the Web Information
Retrieval (WIR) System by building the web search engine based on the user
interests this lead to more efficient for web site design and for web site
personalization to custom user needs.
The aim of this paper is to develop a search engine based on ontology
matching within the Semantic Web. It uses the data in Semantic Web form such as
DAML or RDF. When the user input a query, the program accepts the query and
transfers it to a machine learning agent. Then the agent measures the similarity
between different ontologies, and feedback the matched item to the user the Web is
a huge, relatively unstructured and sometimes unreliable source of information.
The development of XML and ontology standards for metadata will promote
sharing and introduce a limited amount of structure to the Web, but they are not the
whole solution to the information problem.












22
References:


[AL-Obaidy05]
Al-Obiady, Hadeel," Dynamic Analysis of Structuring Websites through
Websites " engineering", Proceeding 3rd Conference on Documentation &
Electronic Archiving, “Knowledge investment & Management for Decision
Support”, Dubai, September17-19, 2005.
[AL-Obaidy 07a]
" Building Ontology Search Engine Based on Concept hierarchy of User
Interests , " The 2007 International Conference on Semantic Web Services
(SWW'07: June 25-28, 2007), Las Vegas, USA, 2007.
[AL-Obaidy07b]
Design personalized search engine based on Context Awareness”, The SLA –
AGC 13TH Annual Conference Information and Knowledge Management in
the Arabian Gulf , Manama, Kingdom of Bahrain ,3rd – 5th April 2007 .
[Bara99]
De Bra P.,"Design Issues in Adaptive Web-Site Development", Proceedings of
The 2nd Workshop on Adaptive Systems and User Modeling on the
WWW, 1999.
[Fay01]
Fayyad & al , Advances in Knowledge Discovery and Data Mining, AAI/MIT
Press, 1996
[Godbout99]
Godbout, Alain J. (January 1999). Filtering Knowledge: Changing Information
into Knowledge Assets. Journal of Systemic Knowledge Management,
Accessed 01/02/2006 http://www.iste.org/L&L/26/8/summaries.html#Editorial
summaries.html#Editorial
[Han01]
Han & Kamber , Data Mining: concepts and techniques, , chez Morgan
Kaufmann Pubs., 2001
[Lee01]
T. Berners-Lee, J. Hendler, and O. Lassila. Semantic web. Scientific American,
284(5):35–43, 2001.
[Mitch97]
Mitchell, Machine Learning, McGraw-Hill, 1997
[Palm01]
Sean B. Palmer, The Semantic Web: An Introduction, 2001-09.
http://infomesh.net/2001/swintro/#ontInference

[Zaki04]
M.J. Zaki. Mining Non-Redundant Association Rules. Data Mining and
Knowledge Discovery, (9):223-248, 2004