Towards to Knowledge Integration by Ontology-based Web Mining: An Analytical Study

wafflebazaarInternet και Εφαρμογές Web

21 Οκτ 2013 (πριν από 3 χρόνια και 11 μήνες)

101 εμφανίσεις

3891


3891




Towards to Knowledge Integration
by Ontology-based Web Mining:
An Analytical Study
Navigator001@gmail.com
1


1/1


1/1


1/1


1/1


1/1


1/1


2


1
1
1
1
1
1

3

The Data Mining

1
1
1
1
1
1
1
1
1
1
1
1

3891


3891


4

The Web Mining

1
1
1
1
1
1
1
1
1
1
1
1

5


1
1
1
1
Syntax
1
1
1
1
1
1
1
1

6


7


8


1

1
1
(The
world wide web
Architecture
Database
Management Systems
Online Public Access
Catalogue
Communication Forms
MaRC

3891


3891


1222
7
122
1212
7771
1
Duplication
Information Extraction
Web Mining
Ontology
Web Mining
-

Web content mining

-

Web structure
mining

-

Web usage mining

Ontology
Ontology-based Web mining


1
Kunder, M. d. (n.d.). WorldWideWebSize.com | The size of the World Wide Web (The Internet).
RetrievedSeptember 21, 2011, from
http://www.worldwidewebsize.com
.

3891


3891


1
2
1


1


1
3
1


1


1


1


1
4
1
5
-


-


-

1
6
1


1


3891


3891



2

2
1
Knowledge
Sunasee and Sewery
-


-


-


-


-


-

information
integration


2

Merriam-Webster (2001). Merriam-Webster, n.d. Web. 28 Sept. 2012.
http://www.merriam-webster.com

3

Sunasee and Sewery (2002). Introduction to Knowledge Modeling Available at
www.makhfi.com/KCM_intro.htm

4

Ibid.
5

Murray, K. S. (1996) KI: A tool for Knowledge Integration. Proceedings of the Thirteenth National Conference
on Artificial Intelligence.
6

Ibid.
3899


3899


Data Models
2
2
Data
Model
Method or Structure
-


-


The World Wide Web
1191
Tim Berners Lee
URL
TCP/IP
HTTP
HTML
o
Web directories



7

Linn, M. C. (2006) The Knowledge Integration Perspective on Learning and Instruction. R. Sawyer (Ed.). In
The Cambridge Handbook of the Learning Sciences. Cambridge, MA. Cambridge University Press.
3898


3898


o
Web search engines

o
Meta search engines
o
Web portals
o
Invisible web catalogue

o
Web public access library catalogue
Data Structure
-

31
structure query language

-

91
HTML

Hypertext Markup Language
o

o



8

1221
11
3881


3881


o

o
html

2
3

2
3
1
Content Spamdexing

Silverstein
91

search engine spam


9
Lim, E.P., Sun, A.
)
2005) : Web Mining - The Ontology Approach. In: Proceedings of The International
Advanced Digital Library Conference (IADLC 2005), Nagoya, Japan (August 2005) available at:iadlc.nul.nagoya-
http://iadlc.nul.nagoya-u.ac.jp/archives/IADLCpresen/Lim.pdf
date: 7/3/2012.
10
C. Silverstein, M. R. Henzinger, J. Marais, and M. Moricz.(1999)."Analysis of a very large AltaVista query log."
ACM SIGIR Forum, 33:P6-12.
11
Z. Gyongyi and H. Garcia-Molina.(2005) Web spam taxonomy.In First International Workshop on Adversarial Information
Retrieval on the Web (AIRWeb). Retrieved July 21, 2011, from
http://www.airweb.cse.lehigh.edu/2005/gyongyi.pdf

3883


3883


o
information needs

o

o
Bandwidth

o

o
2
3
2
The Invisible Web
Web Mining
-

Surface web

-

Invisible web
12
12
2
-

122
122

-

72
1



12
A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly.(2006). Detecting spam web pages through content
analysis. In Proceedings of the World Wide Web conference, pages 83-92.avaliable at
http://www.research.microsoft.com/apps/pubs/default.aspx?id=65140

13
Sherman, C., & Price, G. (2001).The Invisible Web: uncovering information sources search engines
can't see. Medford, N.J.: CyberAge Books.
14
MICHAEL K. BERGMAN
7
The Deep Web: Surfacing Hidden Value.avaliable:
http://brightplanet.com/images/uploads/12550176481-deepwebwhitepaper.pdf

15

Ibid.
3882


3882


-

112

-


-

11

2
3
3
1


12
1229
1
Google, Yahoo!, MSN, AOL, Ask Jeeves
122
1221
1212
1,1
12,222
1



16
MICHAEL K. BERGMAN
7
The Deep Web: Surfacing Hidden Value.avaliable:
http://brightplanet.com/images/uploads/12550176481-deepwebwhitepaper.pdf

17
Jansen, B. J., & Spink, A. (2003). An analysis of web information seeking and use: Documents retrieved
versus documents viewed. In Proceedings of the 4th International Conference on Internet Computing, pp.
65-69. Las Vegas, Nevada. 23-26 June.
18

Internet Statstics (2012). Available at :
www.internetworldstats.com

3881


3881


2


3


19
1
3

Data Mining
3
1
Web Mining
Data Mining
Knowledge Discovery from Database
Web Mining
Data Mining
Data Mining
Knowledge Discovery in Databases"
Data Mining
1111
International Joint Conferences
on Artificial Intelligence
U. Fayyad, G. P.-Shapiro, and P. Smyth
KDD
Data Mining


19
Ibid
20
Jansen, B. J., & Spink, A. (2005). How are we searching the World Wide Web? A comparison of nine
large search engine transaction logs. Information Processing and Management, 42(1), 248-263.

21

U. Fayyad, G. P.-Shapiro, and P. Smyth. From data mining to knowledge discovery in
databases. AI Magazine, Vol. 17 No. 3, pp. 37-54, Fall 1996.
3881


3881


KDD
Data Mining
3
2
Jiawei Han, Micheline Kamber, Jian Pei
1
Data cleaning
.
1
Data Integration

1
Data Selection
.
1
Data Transformation
/

1
Data mining
.
1
Pattern Evaluation

7
Knowledge Representation
Visualization
.


22

David Hand, Heikki Mannila, and Padhraic Smith, Principles of Data Mining, MIT Press,
Cambridge, MA, 2001
23

Han, J., & Kamber, M. (2001). Data mining: concepts and techniques. San Francisco: Morgan
Kaufmann Publishers.
3881


3881


1
3
3
Jiawei Han, Micheline
Kamber, Jian Pei
1

Data Repository
Data Base
World Wide Web
Data
Warehouse

1


1

Knowledge Base

1

Data mining engine



24

Ibid.
3881


3881


1


1


1
3
4

Fayyad, Usama; Piatetsky-Shapiro, Gregory; Smyth, Padhraic
1

Anomaly detection

1

Association rule learning
Dependency modeling


25

Ibid.
26

Ibid.
3881


3881



1

Clustring

1

Classification

1

Regression

1

Summarization

3
5
Fayyad, Usama; Piatetsky-Shapiro
1
1-

Model representation
Patterns

1-

Model-evaluation criteria
Parameters

1-

Search Method
3
6
1221
Institute of Electrical and
Electronics EngineersIEEE


27

Fayyad, Usama; Piatetsky-Shapiro, Gregory; Smyth, Padhraic (1996). "From Data Mining to
Knowledge Discovery in Databases". Retrieved 17 December 2008. Available at:
http://www.kdnuggets.com/gpspubs/aimag-kdd-overview-1996-Fayyad.pdf

3889


3889


International Conference on Data Mining (ICDM)
1
1
1
Vector space model

Gerard Salton
1171
SMART (System for the Mechanical Analysis and
Retrieval of Text
Bag of words

d
j
= (w
1,j
,w
2,j
,...,w
t,j
)
q = (w
1,q
,w
2,q
,...,w
t,q
)
w
1

-

D
T
tf -idf
o
TF
t
d


-

tfi, j

-

ni,j
ti
dj

-

k nk,j


122
1
4/100)
0.04
3888


3888


o
T

-

idf
i

-

log

-

| D |

-

t
i

1000000
1222
log(1000000/1000)
3

X


0.04X3
0.12
2711
1
2

3
6
2
C4.5 and beyond

Cases
Ross Quinlan
2111


2111


Sample
Vector
3
6
3
The k-means algorithm

K
(x
1
, x
2
, …, x
n
)
d-dimensional real vector
K
3
6
4
Naive Bayes

Bayes'
theorem

4

Web Mining

4
1

Tim Berners lee
2113


2113



Ricardo Baeza-Yates
4
2
1
1
1
1222
7
122
1212
7771
1
1
1
1
1
Ambiguity of information


1
1
1
Target

Target



28
Alesso, H. P., & Smith, C. F. (20 - del, and Turing.
Hoboken, N.J.: Wiley-Interscience.(p.67).
29
Yates, R., & Neto, B. (2011). Modern information retrieval: the concepts and technology behind search
(Second ed.). New York: Addison Wesley.(p.11).
30

Berners-Lee, T. (n.d.). The Semantic Web: Scientific American. Science News, Articles and
Information | Scientific American. Retrieved August 2, 2011, from
http://www.scientificamerican.com/article.cfm?id=the-semantic-web

31
Kunder, M. d. (n.d.). WorldWideWebSize.com | The size of the World Wide Web (The Internet).
RetrievedSeptember 21, 2011, from
http://www.worldwidewebsize.com
.

2112


2112




1
1
1


1
1
1

Lexicon Handling

1
1
1


1
1
7

1
1
9
Application killer


32
Sanjib uma (ma c 2 9 “ O ARDS SEMAN IC E ASED SEARCH EN INES” Na al
C f c “Adva c C mpu N w & I f ma c l y (NCACNI -09) March 24-
25, available at :
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5974163
at date: 7/2/2012.
2111


2111


1
1
1


1
1
12
Data Dynamic

-

11
311
21

-

11
.com
.gov
.edu

-


-

1

1
1
11

271
122
5


33
Terrence A. Brooks. Web search: how theWeb has changed information retrieval. Information
Research, 8(3):(paper no. 154), April 2003.
34

Andrew Hammond (2004). Arabic search engine may boost content available at .
http://www.abc.net.au.
35
http://www.translate-to-success.com/online-language-web-site-content.html
2111


2111


English

68.40%

Japan
ese

5.90%

German

5.80%

Chinese

3.90%

French

3.00%

Spanish

2.40%

Russian

1.90%

Italian

1.60%

Portuguese

1.40%

Korean

1.30%

Arabic

1
.60%

1
Web Data Mining
4
3
Web Mining
Jaideep Srivastava
Web
Log
4
4
Web Mining
Web Graph
Web Mining
Data Mining


36
Top Ten Internet Languages - World Internet Statistics. (n.d.). Internet World Stats - Usage and
Population Statistics. Retrieved July 20, 2011, from http://www.internetworldstats.com/stats7.htm
37

J S va ava P D a a d V Kuma “ M Acc mpl m a d Fu u D c ”
Proc. US Nat’l Science Foundation Workshop on Next-Generation Data Mining (N DM Na ’l Science
Foundation, 2002.
2111


2111


Kosala and Blockeel
1
4
4
1
Web structure mining

Hyperlinks
Data
Mining
4
4
2
Web content mining


4
4
3
Web usage mining

Web Logs
4
5
1
1
1
Web Crawlers

Spiders
Robots
URLs


38

K ala R a d l c l H (2 “ M R a c A Su v y ” SI KDD Expl a 2(1
June 2000. Available at
http://www.umiacs.umd.edu/~joseph/classes/enee752/Fall09/survey-2000.pdf

2111


2111



Christopher Olston
Marc Najork
-


-


-


-

1
-


-


-


-


-

1
1
1
Tokenization and Analysis
Web information
extraction
html , pdf
models
Bag of Words


39
Christopher Olston and Marc Najork
7
Web Crawling
7
Foundations and Trends in Information
Retrieval.Vol. 4, No. 3 (2010).
40
Pant, G., Srinivasan, P., & Menczer, F. (n.d.). Crawling the Web. University of Iowa. Retrieved July 21,
2011, from
http://dollar.biz.uiowa.edu/~pant/Papers/crawling.pdf

41
Ibid.
42

Liu, B. (2007). Web data mining exploring hyperlinks, contents, and usage data. Berlin: Springer.
2111


2111


keywords
Text Normalization Operations
1
1
1
Web Data Model
Models
12
12
1

Taxonomy
2

Thesaurus

Hierarchical
Equivalence
Association


43
Baeza-Yates, R. and Castillo, C.
)
2005
(
“ S a c ” P c d f d p
Graphs (WAW), Vol. 3243 of Lec-ture Notes in Computer Science, pp. 156-167, Rome, Italy, Spring,..
44
Lim, E. H., Liu, J. N., & Lee, R. S. (2011). Knowledge seeker ontology modelling for information
search and management : a compendium. New York: Springer.
45
Ibid.
46
Ibid.
2119


2119


3

Latent Semantic Index
Singular
Value Decomposition(SVD)

Values
4

Topic Maps

ISO
information model
structured
1
1
Topic Space
Resource Space
connectionsassociation
occurrence connection
47
11


47
Ibid.
2118


2118


5

Semantic Network
NPL
LexiGuide
48
49
6


Web Ontology Language
Property
Classes
4
6
B.Liu
Web communities
1

hyperlink-induced
topic search
HITS
1117
Jon Kleinberg
-

Hubs Nodes



48

http://www.lexiquest.com/

49
Ibid.
50

B. Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Data-Centric
Systems and Applications, DOI 10.1007/978-3-642-19460-3_7, © Springer-Verlag Berlin Heidelberg
2011
51

Ibid.
2131


2131


-

Authorities Nodes
11
i
authority
11
2133


2133


p
n
p
i
p
2

PageRank
Sergey Brin
Lawrence Page
1117
web Graph
Brin
Page
Google
-


-

0
1
1
NODES
A,B,C and D
1
2711
A


52
Levene, M. (2010). An introduction to search engines and web navigation (2nd ed.). Hoboken: Wiley.
2132


2132


31
B,C,D
A
A
B
2,11
B
A
0.25/2=0.125
A



A
0.458
5

Ontology-based Web mining
Ontology
2131


2131


5
1
The Ontology
Ontology
semantic compatibility
syntax compatibility
ontology
Rudolf Göckel
Jacob Lorhard
1111
Knowledge Representation
Gruber
1111
Harrods

5
2
Syntax



53
Smith, B. and Welty, C. (2001) Ontology-towards a new synthesis. Proceedings of the International
Conference on Formal Ontology in Information Systems (FOIS2001). ACM Press,.
54
Gruber, T. R.,(199
1
)
Toward Principles for the Design of Ontologies Used for Knowledge Sharing
.
International Journal Human-Computer Studies, 43(5-6):907-928,.
55
Prytherch, R. J. (2005). Harrod's librarians' glossary and reference book: a directory of over 10,200
terms, organizations, projects and acronyms in the areas of information management, library science,
publishing and archive management (10th ed.). Aldershot, Hants, England: Ashgate.
2131


2131


-

Entity
Individuals
Individuals

-

Ideas
Classes
Classes
individuals
object
subclasses

-

Properties
Attribute
Classes
Individuals

-

Relationship
5
3

1

OIL ontology inference language

ontoknowledge
1

DAML : darpa agent markup language
Defense Advanced Research Projects Agency
122
Tim Lee
2131


2131


1

OWL ontology web language

recomandation
w3c
1221
1-


1-


1-


5
6
OWL
1
1
1
OWL Lite

owl
11
owl
56
1
1
1
OWL DL
OWL
description logics


56
Geroimenko, V. (2004).Dictionary of XML technologies and the semantic Web .London: Springer.
2131


2131


1
1
1
OWL FULL
5
7
Ee-Peng Lim and Aixin Sun
1
-

Web page classification

-

Web clustering
Grouping

-

Web extraction
HTML elements

Web Mining
5
7
1
Ontology-based Web
clustering



57
McGuinness, D. L., & Harmelen, F. v.W3C.
OWL Web Ontology Language Overview.avaliable at:
www.w3.org/TR/owl-features/
date:8/3/2012.
58

Ee-P m a d A x Su (2 5 “ M - O l y App ac “ ava la l a
http://reference.kfupm.edu.sa/content/w/e/web_mining_____the_ontology_approach_61587.pdf

2131


2131


Web Clustering
A
A
Ontology-based Web clustering
HTML elements
Ontology-based Web site structure mining
1
7
1
Ontology-based
Web classification

1
7
1
ontology-
based Web extraction

5
8


59

Ibid.
2139


2139


1


1

Improved search to Web data

1

Better browsing capabilities

1

Personalization of Web data access

6

1


a


b


c

Tokenization


60

Ibid.
2138


2138



d

Spamdexing
information needs

e

1

1

Web Mining

1


1


1


1


7


7

17

Resource Description Framwork
Web Ontology Web
The Semantic Web

17

Database Management System

17

Semantic Search Engines

2121


2121


17


17


17


8


1

1221
11
2- A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly.(2006). Detecting spam
web pages through content analysis. In Proceedings of the World Wide Web
conference, pages 83-92.avaliable at
http://www.research.microsoft.com/apps/pubs/default.aspx?id=65140

3- Alesso, H. P., & Smith, C. F. (2006). Thinking on the Web: Berners-
del, and Turing. Hoboken, N.J.: Wiley-Interscience.(p.67).
4- Andrew Hammond (2004). Arabic search engine may boost content available at
.
http://www.abc.net.au
.
5- Berners-Lee, T. (n.d.). The Semantic Web: Scientific American. Science News,
Articles and Information | Scientific American. Retrieved August 2, 2011, from
http://www.scientificamerican.com/article.cfm?id=the-semantic-web

6- C. Silverstein, M. R. Henzinger, J. Marais, and M. Moricz.(1999)."Analysis of a
very large AltaVista query log." ACM SIGIR Forum, 33:P6-12.
7- Christopher Olston and Marc Najork
Web Crawling
Foundations and Trends in
Information Retrieval.Vol. 4, No. 3 (2010).
8- David Hand, Heikki Mannila, and Padhraic Smith, Principles of Data Mining,
MIT Press, Cambridge, MA, 2001
9- Ee-P m a d A x Su (2 5 “ M - O l y App ac “
available at
http://reference.kfupm.edu.sa/content/w/e/web_mining_____the_ontology_
approach_61587.pdf

10- Geroimenko, V. (2004).Dictionary of XML technologies and the semantic Web
.London: Springer.
11- Gruber, T. R.,(1992) Toward Principles for the Design of Ontologies Used for
Knowledge Sharing. International Journal Human-Computer Studies, 43(5-
6):907-928.
12- Han, J., & Kamber, M. (2001). Data mining: concepts and techniques. San
Francisco: Morgan Kaufmann Publishers.
13- http://www.translate-to-success.com/online-language-web-site-content.html

14- Internet Statstics (2012). Available at :
www.internetworldstats.com

15- J S va ava P D a a d V Kuma “ M Acc mpl m a d
Fu u D c ” P c US Na ’l Sc c F u da p N x -
a Da a M (N DM Na ’l Sc c F u da 2 2
16- Jansen, B. J., & Spink, A. (2003). An analysis of web information seeking and
use: Documents retrieved versus documents viewed. In Proceedings of the 4th
International Conference on Internet Computing, pp. 65-69. Las Vegas,
Nevada. 23-26 June.
17- Jansen, B. J., & Spink, A. (2005). How are we searching the World Wide Web?
A comparison of nine large search engine transaction logs. Information
Processing and Management, 42(1), 248-263.
18- K ala R a d l c l H (2 “ M R a c A Su v y ”
SIGKDD Explorations, 2(1), June 2000. Available at
http://www.umiacs.umd.edu/~joseph/classes/enee752/Fall09/survey-2000.pdf

2123


2123


19- Kunder, M. d. (2012). WorldWideWebSize.com | The size of the World Wide
Web (The Internet). Retrieved October 2, 2012, from
http://www.worldwidewebsize.com

20- Levene, M. (2010). An introduction to search engines and web navigation (2nd
ed.). Hoboken: Wiley.
21- Lim, E. H., Liu, J. N., & Lee, R. S. (2011). Knowledge seeker ontology
modelling for information search and management : a compendium. New
York: Springer.
22- Lim, E.P., Sun, A.
2005) : Web Mining - The Ontology Approach. In:
Proceedings of The International Advanced Digital Library Conference
(IADLC 2005), Nagoya, Japan (August 2005) available at
http://iadlc.nul.nagoyau.ac.jp/archives/IADLCpresen/Lim.pdf
date: 10/3/2012.
23- Linn, M. C. (2006) The Knowledge Integration Perspective on Learning and
Instruction. R. Sawyer (Ed.). In The Cambridge Handbook of the Learning
Sciences. Cambridge, MA. Cambridge University Press.
24- Liu, B. (2007). Web data mining exploring hyperlinks, contents, and usage data.
Berlin: Springer.
25- McGuinness, D. L., & Harmelen, F. v.W3C.OWL Web Ontology Language
Overview.avaliable at:
www.w3.org/TR/owl-features/date:8/3/2012
.
26- Merriam-Webster (2001). Merriam-Webster, n.d. Web. 28 Sept. 2012.
http://www.merriam-webster.com
.
27- MICHAEL K. BERGMAN
(2004).The Deep Web: Surfacing Hidden Value.
avaliable:
http://brightplanet.com/images/uploads/12550176481-
deepwebwhitepaper.pdf

28- Murray, K. S. (1996) KI: A tool for Knowledge Integration. Proceedings of the
Thirteenth National Conference on Artificial Intelligence.
29- Pant, G., Srinivasan, P., & Menczer, F. (n.d.). Crawling the Web. University of
Iowa. Retrieved October 2, 2012, from
http://dollar.biz.uiowa.edu/~pant/Papers/crawling.pdf

30- Prytherch, R. J. (2005). Harrod's librarians' glossary and reference book: a
directory of over 10,200 terms, organizations, projects and acronyms in the
areas of information management, library science, publishing and archive
management (10th ed.). Aldershot, Hants, England: Ashgate.
31- Sa j uma (ma c 2 9 “TOWARDS SEMANTIC WEB BASED
SEARCH EN INES” Na al C f c “Adva c C mpu
Networks & Information Technology (NCACNIT-09) March 24-25, available
at :
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5974163
at date:
10/2/2012.
32- Sherman, C., & Price, G. (2001).The Invisible Web: uncovering information
sources search engines can't see. Medford, N.J.: CyberAge Books.
33- Smith, B. and Welty, C. (2001) Ontology-towards a new synthesis. Proceedings
of the International Conference on Formal Ontology in Information Systems
(FOIS2001). ACM Press.
34- Sunasee and Sewery (2002). Introduction to Knowledge Modeling Available at
www.makhfi.com/KCM_intro.htm

35- Terrence A. Brooks. Web search: how the Web has changed information
retrieval. Information Research, 8(3):(paper no. 154), April 2003.
36- Top Ten Internet Languages - World Internet Statistics. (n.d.). Internet World
Stats - Usage and Population Statistics. Retrieved October 2, 2012, from
http://www.internetworldstats.com/stats7.htm

37- U. Fayyad, G. P.-Shapiro, and P. Smyth. From data mining to knowledge
discovery in databases. AI Magazine, Vol. 17 No. 3, pp. 37-54, Fall 1996.
38- Yates, R., & Neto, B. (2011). Modern information retrieval: the concepts and
technology behind search (Second ed.). New York: Addison Wesley.(p.11).
39- Z. Gyongyi and H. Garcia-Molina.(2005) Web spam taxonomy.In First
International Workshop on Adversarial Information Retrieval on the Web
(AIRWeb). Retrieved October 2, 2012, from
http://www.airweb.cse.lehigh.edu/2005/gyongyi.pdf