Genres for Web Page

journeycartAI and Robotics

Oct 15, 2013 (4 years and 23 days ago)

92 views

Genres for Web Page
Classification

Vedrana Vidulin, Mitja Luštrek, Matjaž Gams

Department of Intelligent Systems

Jožef Stefan Institute

vedrana.vidulin@ijs.si


Why genres?


Genre could be described as a style of a web page.

Training a Genre Classifier

1.
Corpus of web pages annotated with
genres

2.
Set of features

3.
Machine learning algorithm

20
-
Genre Corpus


1539 web pages divided into 20 genres


Multilabeled corpus

The Set of Features

1.
URL features


https, www, document type, URL words…

2.
HTML features


number of hyperlinks to the same domain / total
number of hyperlinks, number of tags / total number
of tags for e.g. tag group “Interaction”…

3.
Text features


average number of characters per word, function
words, punctuation symbols, content words, number
of declarative sentences / total number of
sentences…

Machine Learning


Multilabeled classification


20 binary machine learning tasks


Machine learning algorithms:


Decision Tree


Bagging

Results

Decision
Tree

Bagging

Diff.

Paired T
-
test

Accuracy

93

95

2

7/13/0

Precision

53

70

17

5/15/0

Recall

43

40

-
3

0/19/1

F
-
Measure

46

48

2

1/18/1