ApMl (All Purpose Machine Learning) Toolkit

journeycartΤεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 3 χρόνια και 11 μήνες)

71 εμφανίσεις

ApMl (All Purpose Machine
Learning) Toolkit

David W. Miller and Helen Howell


Semantic Web Final Project

Spring 2002


Department of Computer Science

University of Georgia



www.cs.uga.edu/~miller/SemWeb


www.cs.uga.edu/~helen/SemWeb/SemWeb.html






2

What Has Been Done


Extensive Research into the
effectiveness of machine learning
algorithms has been performed


Train System on expert created taxonomy
with expert specified documents

3

What We Did


Train system on a domain specific
taxonomy


Eg. CNN’s Sports Pages


Test system’s ability to correctly classify
documents from a second, yet similar
taxonomy


Eg. Yahoo! Sports Pages

4

Automatic Text Classification
via Statistical Methods

Text Categorization is the problem of assigning
predefined

categories to free text documents.

Statistical Learning Methods used in ApMl


Bayes Method


Rocchio Method (most popular)


K
-
Nearest Neighbor Classification


Probabilistic Indexing

5

A Probabilistic Generative Model


Define a probabilistic generative model for
documents with classes.


Bayes:




Reinforcement


Learning:

a Survey

This paper surveys

the field of rein
-

forcement learning

from a computer

science perspective.

35 a

1 block

12 computer

4 field

1 leg

7 machine

44 of

3 paper

2 perspective

1 rate

5 reinforcement

9 science

2 survey

56 the

11 this

1 underrated

… …

“Bag
-
of
-
words”

Automatic Text Classification through Machine Learning, McCallum, et. al.

6

Bayes Method

Pick the most probable class, given the evidence:

-

a class (like “
Planning
”)

-

a document (like “
language intelligence proof
...
”)

Bayes Rule:

Probability Category c
j
should be assigned to
document d

Automatic Text Classification through Machine Learning, McCallum, et. al.

7

Bayes Rule

-

Probability that document
d
belongs to category
c
j

-

Probability that a randomly picked document has the same attributes

-

Probability that a randomly picked document belongs to this category

-

Probability that category
c
contains document
d

8

Bayes Method


Generates conditional probabilities of
particular words occurring in a document
given it belongs to a particular category.


Larger vocabulary generate better
probabilities


Each category is given a threshold
p
for
which it judges the worthiness of a document
to fall in that classification.


Documents may fall into one, more than one,
or not even one category.

9

Rocchio Method


Each document is
D
is represented as a
vector within a given vector space
V
:


Documents with similar content have similar
vectors


Each dimension of the vector space represents a
word selected via a feature selection process

10

Rocchio Method


Values of
d
(i)
for a document
d
are
calculated as a combination of the
statistics
TF(w,d) and DF(w)


TF(w,d)
(Term Frequency) is the
number of times word
w
occurs in a
document
d.


DF(w)

(Document Frequency) is the
number of documents in which the word
w
occurs at least once.

11

Rocchio Method


The
inverse document frequency
is
calculated as




Value of
d
(i)

of feature
w
i

for a document
d
is
calculated as the product



d
(i)

is called the weight of the word
w
i

in the
document
d.

12

Rocchio Method


Based on word weight heuristics, the
word w
i

is an important indexing term
for a document
d
if it occurs frequently
in that document


However, words that occurs frequently
in many document spanning many
categories are rated less importantly

13

K
-
Nearest Neighbor


Features


All instances correspond to points in an n
-
dimensional Euclidean space


Classification is delayed till a new instance
arrives


Classification done by comparing feature
vectors of the different points


Target function may be discrete or real
-
valued

K
-
Nearest Neighbor Learning, Dipanjan Chakraborty

14

1
-
Nearest Neighbor

K
-
Nearest Neighbor Learning, Dipanjan Chakraborty

15

K
-
Nearest Neighbor


An arbitrary instance is represented by

(a
1
(x), a
2
(x), a
3
(x),.., a
n
(x))


a
i
(x) denotes features


Euclidean distance between two instances


d(x
i
, x
j
)=sqrt (sum for r=1 to n (a
r
(x
i
)
-

a
r
(x
j
))
2
)


Find the k
-
nearest neighbors whose distance
from your test cases falls within a threshold
p.


If
x
of those k
-
nearest neighbors are in
category
c
i
, then assign the test case to
c
i
,
else it is unmatched.

K
-
Nearest Neighbor Learning, Dipanjan Chakraborty

16

Probabilistic Indexing


Goal is to estimate P(C|s
i
, d
m
)


Probability that assignment of term s
i

to the
document d
m

is correct


Once terms have been identified, assign
Form Of Occurrence (FOC)


Certainty that term is correctly indentified


Significance of Term

17

Probabilistic Indexing Cont.


If term
t

appears in document
d

and a
term descriptor from
t

to
s

exists,
s

an
indexing term, then generate a
descriptor indictor


Set of generated term descriptors can
be evaluated and a probability
calculated that document
d
lies in class
c

18

ApMl Toolkit


Built on top of and extends existing
toolkits


rainbow (CMU)


Machine Learning


wget (GNU)


Web Crawler


4 Machine Learning Algorithms and 2
Classification Committees


Web Crawler and Document Retrieval


Automated Testing

19

Machine Learning Components


4 Machine Learning Algorithms
(rainbow)


Naïve Bayes, Rocchio, KNN, Probabilistic
Indexing


2 Classification Committees (ApMl)


Weight Assigned For Overall Accuracy


Weights Assigned For Accuracy within
each Class of Taxonomy

20

21

22

Document Retrieval


Web Crawler and Document Retrieval


Specify Starting URL


Specify Recursion Depth


Allow Multiple Domain Spanning


Specify Excluded Domains


Store all retrieved pages into a single
directory (ApMl)

23

24

Automated Testing


Choose Algorithms to Test


Choose Test Directory


Specify Number of Tests


All results are placed into persistent
window for evaluation

25

26

Effectiveness: Contingency
Table

Machine Learning for Text Classification, David D. Lewis, AT&T Labs

27


precision =
a/(a+b)


Documents classified correctly vs. All classified as a
particular category


recall =
a/(a+c)


Documents classified correctly vs. All that should have been
classified in a category


accuracy = (
a+d)/(a+b+c+d)


All documents classified as positive or negative in a category
correctly vs All classified


Effectiveness Measures

Machine Learning for Text Classification, David D. Lewis, AT&T Labs

28

Test Plan


Choose two areas and selected
subcategories


Sports


Football


Tennis


Golf


NBA


Health


Children


Men


Women


29

Test Plan Continued


Sport Web Sites


www.sportsillustrated.com


sports.yahoo.com


www.usatoday.com/sports/sfront.htm


Health Web Sites


www.patient.co.uk


www.cdc.gov/health


www.bbc.co.uk/health

30

Test Plan Continued


Train the system on pages from one
taxonomy from one domain and test on
another taxonomy for the same area


Determine contingency tables for each
category


Compute effectiveness using precision,
recall, and accuracy

31

Sports Test Results

ApMl Test Results

32

Health Test Results

ApMl Test Results

33

Comparison of Precision

ApMl Test Results

34

Comparison of Recall

ApMl Test Results

35

Comparison of Sports
Additional Levels

ApMl Test Results

36

Comparison of Health
Additional Levels

ApMl Tests Results

37

Comparison of Accuracy

ApMl Test Results

38

Trends of Results


K Nearest Neighbor effectiveness was
significantly lower than other algorithms


continuously categorize the same


The class of Health was much more
difficult for the algorithms to correctly
categorize


children’s health a non
-
gender class


No improvement in our results with
additional training

39

Conclusions


Results of automatic text categorization
are subjective


Trends can occur because of various
factors


Heterogeneous taxonomies can be
used for automatic classification with
acceptable efficiencies


More research needed

40

Resources

1.
Dipanjan Chakraborty. “K
-
Nearest Neighbor Learning.” A
PowerPoint Presentation.

2.
Norbert Fuhr and Ulrich Pfeifer. “Combining Model
-
Oriented
and Description
-
Oriented Approached for Probabilistic
Indexing.”
Proceedings of the Fourteenth Annual
International ACM SIGIR Conference on Research and
Development in Information Retrieval
, pages 46
-
56. ACM,
New York. 1991.

3.
Thorsten Joachims. “A Probabilistic Analysis of the Rocchio
Algorithm with TFIDF for Text Categorization.” Technical
Report, CMU, March 1996.

4.
Fabrizio Sebastiani. “Machine Learning in Automated Text
Categorization.”
ACM Computing Surveys
, 34(1):1
-
47, 2002.

5.
Amit Sheth, et. al. “Semantic Web Content Management for
Enterprises and the Web.”
In submission to IEEE Internet
Computing.