Web Information Retrieval and Extraction

libyantawdryΤεχνίτη Νοημοσύνη και Ρομποτική

23 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

81 εμφανίσεις


Web Information
Retrieval and Extraction

Chia
-
Hui Chang, Associate Professor

National Central University, Taiwan

chia@csie.ncu.edu.tw


Sep. 21, 2004

2

Course Content


Web Information Retrieval


Browsing via categories


Searching via search engines


Query answering


Web Information Integration


Web page collection


Data extraction from semi
-
structured Web pages


Data integration

Sep. 21, 2004

3

Web Categories


Yahoo
http://www.yahoo.com


Fourteen categories and ninety subcategories


Categorization by humans


Technology


Document classification


Pros and Cons


Overview of the content in the database


Browsing without specific targets

Sep. 21, 2004

4

Search Engines


Google
http://www.google.com


Search by keyword matching


Business model


Technology


Web Crawling


Indexing for fast search


Ranking for good results


Pros and Cons


Search engines locate the documents not the answers

Sep. 21, 2004

5

Question Answering


Askjeeves
http://www.ask.com


Input a question or keywords


Relevance feedback from users to clarify the
targets


ExtAns (Molla et al., 2003)


Technology


Text information extraction


Natural Language Processing

Sep. 21, 2004

6

Web Page Collection


Metacrawler
http://www.metacrawler.com/


Google


Yahoo


Ask Jeeves About


LookSmart


Overture


FindWhat


Ebay
http://www.ebay.com/


Information asymmetry between buyers and
sellers


Technology


Program generators


WNDL, W4F, XWrap, Robomaker


Sep. 21, 2004

7

Data Extraction from Semi
-
structured Documents


Example


Technology


Information Extraction Systems


WIEN, Softmealy, Stalker, IEPAD, DeLA, OLERA,
Roadrunner, EXALG, XWrap, W4F, etc.


Data Annotation


Wrapper induction is an excellent exercise of
machine learning technologies



Sep. 21, 2004

8

Data Integration


Technology


Template based interface design


Microsoft


Visual Programming tools


Sep. 21, 2004

9

Available Techniques


Artificial Intelligence


Search and Logic programming


Machine Learning


Supervised learning (classification)


Unsupervised learning (clustering)


Database and Warehousing


OLAP and Iceberg queries


Data Mining


Pattern mining from
large

data sets


Other Disciplines


Statistics, neural network, genetic algorithms, etc.


Sep. 21, 2004

10

Classical Tasks


Classification


Artificial Intelligence, Machine Learning


Clustering


Pattern recognition, neural network


Pattern Mining


Association rules, sequential patterns, episodes
mining, periodic patterns, frequent continuities,
etc.

Sep. 21, 2004

11

Classification Methods


Supervised Learning (Concept Learning)


General
-
to
-
specific ording


Decision tree learning


Bayesian learning


Instance
-
based learning


Sequential covering algorithms


Artificial neural networks


Genetic algorithms


Reference: Mitchell, 1997


Sep. 21, 2004

12

Clustering Algorithms


Unsupervised learning (comparative analysis)


Partition Methods


Hierarchical Methods


Model
-
based Clustering Methods


Density
-
based Methods


Grid
-
based Methods


Reference: Han and Kamber (Chapter 8)

Sep. 21, 2004

13

Pattern Mining


Various kinds of patterns


Association Rules


Closed itemsets, maximal itemsets, non
-
redundant
rules, etc.


Sequential patterns


Episodes mining


Periodic patterns


Frequent continuities



Sep. 21, 2004

14

Applications


Relational Data


E.g.
Northern Group Retail
(Business Intelligence)


Banking, Insurance, Health, others


Web Information Retrieval and Extraction


Bioinformatics


Multimedia Mining


Spatial Data Mining


Time
-
series Data Mining

Sep. 21, 2004

15

Techniques from

Information Retrieval (IR)


Text Operations


Lexical analysis of the text


Elimination of stop words


Index term selection


Indexing and Searching


Inverted files


Suffix trees and suffix arrays


Signature files


Ranking Models


Query Operations


Relevance feedback


Query expansion


Sep. 21, 2004

16

Course Schedule


Techniques from Information Retrieval


Text Operations


Indexing and Searching


Ranking Models


Query Operations


Text Information Extraction for Query answering


AutoSlog, SRV, Rapier, etc.


Data extraction from semi
-
structured Web pages


WIEN, Softmealy, Stalker, IEPAD, DeLA, Roadrunner,
EXALG, OLERA, etc.


Web page collection


XWrap, W4F, Robomaker, etc.


Sep. 21, 2004

17

Grading


Two projects (by groups): 50%


Chosen from the topics covered in the course


Presentation and reports


Paper reading (by yourself): 20%


Presentation


Information Integration Projects: 30%


Chosen freely


Presentation and reports


Sep. 21, 2004

18

References


Baeza
-
Yates, R. and Ribeiro
-
Neto, B. 1999. Modern
Information Retrieval, Addison Wesley


Han, J. and Kamber, M. 2001. Data
Mining:


Concepts and Techniques, Morgan
Kaufmann Publishers


Mitchell, T. M. 1997. Machine Learning, McGRAW
-
HILL.


Molla, D., Schwitter, R., Rinaldi, F., Dowdall, J. and
Hess, M. 2003. ExtrAns: Extracting Answers from
Technical Texts. IEEE Intelligent Systems,
July/August 2003, 12
-
17.