Text Tools: A Supervised Machine Learning Package in an R Wrapper

munchsistersΤεχνίτη Νοημοσύνη και Ρομποτική

17 Οκτ 2013 (πριν από 3 χρόνια και 11 μήνες)

92 εμφανίσεις

Tutorial Version 5/
20
/2010

Text
Tools
: A Supervised Machine
Learning Package

in
an
R

Wrapper

TextTools

relies on
pre
-
labeled
document
examples to train

algorithms

to classify other documents. This
‘supervised learning’ approach is widely used within computer science.
Our e
xample
will be a sample of
Spanish
bill
titles.
Humans have assigned one of 19 primary topic codes to each
bill title
.


The TextTools package its
elf was created by Paul Wolfgang

based on earlier work by Dustin Hillard and
Stephen Purpura. It includes four algorithms, SVM, MaxEnt, LingPipe and Naïve Bayes, a
nd

customizable
pre
-
processing components

that are described in the Text Tools documentation
:




http://www.cis.temple.edu/~wolfgang/Text%20Classification%20Tools.pdf

Th
e

R
wrapper
that we will be using here
(developed by Loren Collingwood) first
prepares and sends
jobs to TextTools
f
or processing.
Then we import the results into
R
to take

adva
ntage of R
’s

statistical
offerings
more generally .











Results


Steps in the Process.



Create a train
ing
database > Train

the algorithm

> Test

the algorithm

against labeled
documents

>
Refine through
Active Learn
ing

> Classify

unlabeled
d
ocuments


Import/
Create a train
ing database
.
A
set of pre
-
labeled documents

is required. It must be
large enough
so that part of it
can be
used
to train the system,
while another
part
is used to
evaluate
system
R Text Tools
Wrapper


Written By Loren
Collingwood

TextTools

Written by Paul
Wolfgang

OBDC
Registered
Access
database

performance
.
There is no simple answer for how many labeled training cases are required

(other than
more are better
!
)
, but a rough starting point

might be 100 examples
per category
.


Train

and Classify
.
Training the algorithms involve several steps.

Th
e
general
goal is to distinguish the
relevant features of documents

that leads humans to assign them to different classes.

A central
c
hallenge is
that
documents can contain
irrelevant similarities (or differences)

that
lead to

classification
errors
.
Pre
-
processing

helps to
mitigate

such
errors by

word


stemming



(e.g. eliminat
ing

suffixes

such
as ‘ing’
) and

by
removing
common

‘stop
words


(
e.g. a,
the, or
).
The next step
is to train the
algorithm(s). Then the trained algorithms are used to classify new documents
.

Test

the algorithm against labeled documents
.

It is important to know how well the system is working.

This is typically done by predict
ing the labels
(classifying)
of a
nother
set of pre
-
labeled documents.
Then
the machine results can be compared to the pre
-
existing labels. Text
Tools includes options to assess the

overall

performance of an algorithm,

precision, recall and bias (via a conf
usion matrix), and
algorithm

agreement.
Typically, information acquired at this stage is
used to improve performance

through active
learning.

Active Learn
.

Active learnin
g generally refers to

us
ing

information obtained during testing to improve
the training process.
Which adjustments can improve performance most efficiently?
For example,
we
could increase the size of the training set by
randomly classify
ing

additional documents
.
But if
the errors
ar
e not randomly distributed, it might be more
efficient
to
discover an
d make adjustments in those
areas where the system
is performing especially poorly
. The
confusion matrix
,
which
compares actual
and predicted labels,
is well suited for this purpose.

Cla
ssify

unlabeled
d
ocuments
.

The active learning process continues until the researcher
is satisfied
with the system’s performance
.
I
f

overall
performance is
unacceptable, an alternative approach is to
separate the
cases that do meet
a
performance threshold.

M
ultiple algorithms
can sometimes be used
as a
measure of confidence in a label.
If experiments indicate high reliability when multiple algorithms
predict the same label for a case, then

cases
where the
ensemble

agree
s
can be set aside,
substantially
reducing
the cases that require human attention.

Performance Assessment
.
How do we know how well the algorithms are working?

The
common
baseline is
to ask
whether the machine labels as reliably as
the typical human,
b
y comparing
human
-
human

agreement
with

machine
-
human

agreement
.
However, both humans and machines should do
better than what we would expect by chance.

A variety of statistics
are available that
compare observed agreement with expected agreement
. We will
use
Krippendorf
‘s Alpha

in this examp
le
.

The confusion matrix provides

additional

information about
agreement with respect to
particular labels, a
s well as information about the distribution of errors
(bias)
.



We might also want to

run experiments using different
training set
s, and by altern
ating the portion of
the test data that is used to test
, to test the sensitivity of our results.
For some projects,
it
may
also
be
possible to
assess reliability
using other
indicators

(f
or example, if a
nother

project
has
classifie
s

bill
topics based on media coverage

rather than titles).

Setting

up the R package

This is an R package
for Windows

1.

Download

and install
R

on your desktop
. For information about how to do this:

http://faculty.washington.edu/jwilker/559/PAP/install_R.pdf

2.

Download th
e Rtexttools
zip file

to your desktop
, or wherever your browser downloads to
. The
latest version can be found here
:

http://staff.washington.edu/lorenc2/software/index.html

3.

Open R. You may need to install some additional packages. If you already have these installed in R,
then forgo the following two lines of code:

i
nstall.packages('t
au
'
)

install.packages('R
ODBC
'
)

If for some reason these packages don’t load, go to help/CRAN home page,
and
search for the
package zip file
. D
ownload it and install it in the same manner as
4. below. (Do not extract the
zip file before trying to install it in R
.)


Now load them
:

library(tau)

library(ROD
B
C)

4.

At the top menu,
go to
Pa
ckages > Install packages from local zip files >Then navigate to wh
ere you

downloaded the zip file
.

The package
will
install

in a subdirectory of R on your computer
.
(
Y
ou

may have

to close

and reopen

R
)

.


This package may
also
take longer than normal packages to install because
it is large
.

However,
it shouldn’t take longer than about 5
-
7 minutes.

When it’s finished, i
n the R command line, t
ype:

library(
Rtexttools
)

#
this
should
load

the package

If you get an error message about package not found, you will need to repeat step 3
above to install that missing package. Type
library(Rtexttools)
again. Repeat if
necessary!


T
o access the documentation,
t
ype:

?
Rtexttools

If this does n
ot work, go to help/html help and search for Rtexttools. The documentation
will open in a separate browser


Take a quick look through the documentation

and not
e

the ‘index’ at the bottom
.

Import

a
T
rain
-
T
est Access database

and ODBC register

We are going to use
an
example database

that is
already
part of the
R
package

(Spanish.acc
d
b)
.

This is a
small
ish

file of Spanish legislation that has been manually coded for topic.

If you
wanted to import
your own

(Access)

database,
it might be easiest
to put

it in the
same
working
directory
.


It must be an Access database

which is
ODBC
registered on

the
computer

you are
currently
using
.

1.

Search ‘ODBC’ on your computer.
When the
program opens
,
click the ‘A
dd’ option;
select

Microsoft Access Driver (.mdb
, accdb) option;

click
F
inish;
Find and
s
elect
the

‘Spanish.accdb’
file
; Click OK;
Name the
Data Source =
spanish

(
could be anything

but this is what it is called in
the example code below
)
;
click
OK
.

If you are using Window 7 and
the
Access Driver

is not

listed
, please go here

for additional
information
:


http://Faculty.washington.edu/jwilker/559/ODBCWindows7.pdf

Now, so that you will appreciate what
we
are about to do….


2.


Open the

Spanish.accb


database
. It wi
ll be in your R directory under

library/
Rtexttools/
TextTools_V0.11
/Spanish.
accb

.


There are 2 tables

in Spanish.accb
. We
will
use the labeled
titles
in the


first


table
to train

the
algorithms
, and the

titles in the ‘
second

to
test

them
.
Note that t
he
both t
able
s

include
titles

and
human assigned labels (“Code”)

but that only the second table has columns to accept the algorithm’s
predictions (new_code_svm, etc).


3.

C
lear the c
ell
s

to the right of the titles in the ’second’ table,
b
ut
re
tain the variable names
.
Save
and close
.

Tra
in and Test

using labeled documents

To run an example of any
of the Rtexttools
function
s found in the documentation (e.g. texttrain;
textclassify)
,
simply
type:

example(functionname).
However, we will do this
as

if we were
working on our own project.
First, we
need to
make sure that we are in the directory

where the Access
database is located
:

getwd()
#shows the current working directory



setwd(system.file("TextTools_V0.11", package="
R
texttools"))


Next, we specify the data for training and the algori
thm to be trained
.
We are only going to
do

this for
one (
SVM
)

here
. The
documentation

has example

codes

for the other algorithms.


Copy and paste this to the R command line and run it.

tsvm <
-

texttrain(method="svm", datasource="
spanish
",
table_name="spa
nish_bills_first", id_column="obs",

text_column="title", code_column="Code", model="svm_spanish_first",
remove_stopwords="spanish", do_stemming="Spanish")


Then, we specify the data to be classified and by what algorithm
:

csvm <
-

textclass(method="SVM",
datasource="
spanish
",
table_name="spanish_bills_second", id_column="obs",

text_column="title", code_column="Code",
output_code_col="new_code_svm", model="svm_spanish_first")


Then we
send

the job

to TextTools
:

classrun(tsvm, csvm,ttpath=system.file
("TextTools_V0.11",
package="
Rtexttools
"))


The classrun command may take a while

to run
.
When the job is finished,
return
to Spanish.accb
, open
it,

and
look at

the

second


table.
There should
now
be
values in the column

to
the right of the

t
itles

(new_c
ode_svm)
?


You do not need to clear the
new_code_svm
cells to run TextTools again. The values will
be automatically replaced.

You do have to CLOSE the
Spanish.accb database

What have we done?

If you look at the
R
code above, a lot of it should
be f
amiliar
. The


method


is SVM. The

datasource


is
the OBDC
name
you gave for
Spanish.accb

(
spanish
)
.
The

texttrain


table


is the first table

in
s
panish.accb

(
the textclassify table is
second

table

in the database
). The

text


i
s

located
in the ‘title’
column of

th
e respective tables
,
while the labels
(codes)
are found in

the
‘Code’

column.


Because this is a Spanish language example, the
code also
includes commends to
remove
(
Spanish
)

stopwords and stem
s
.
If the text were English, these could be omitted.
Finally
, the

text
classify

code
specifies that the

SVM

predictions
go into
the
column ‘new_code_svm’.


Performance Assessment

Now we are going to
grab information from Spanish.accb and create a
n R

data

frame

to explore our
results
.

M
ake sure that you are in the
working directory

where the database is located
:


getwd()


list.files()

#this will show you all the files in the directory

If not,


setwd(system.file("TextTools_V0.11", package="
Rtexttools
"))



Now, grab the
Access
data, and
create
the new
data frame

(i
n th
is

example it is ‘spanish’)



spanish <
-

datagrab("spanish.accdb","spanish_bills_second")


To check that the dataframe has been brought into R, type:




head(spanish)


#shows first 5 rows and all columns of the
new
dataframe.


Great
!


Now we can
take advantage of R’s
extensive capabilities
.



Confusion Matrix


The confusion matrix offers diagnostic information that may lead to more efficient strategies for
improvement.
To generate a confusion matrix
,

we

must
access the
data frame

(if we
are not
al
ready there
)
:

data(spanish)


Alg_reliable: This option produces a confusion matrix that includes the
Krippendorf’s Alpha

measure of
reliability and
Percent Correctly Predicted
:

data(spanish)

nmax <
-

alg_reliable(spanish,c("Code", "new_code_svm
"),
method="ordinal", pct.correct=TRUE)

nmax


The output
should look
something like this

(
albeit
with different labels)
:


health

E
nergy

environment

health

14

1

1

energy

0

28

2

environment

13

2

6


The rows are the

human assigned
labels and the column
s

are the predicted

labels
. The diagonals are the
cases of
agreement

while t
he
off diagonal
s

are the prediction errors
.
The rows tell us

how many ‘true’
cases the system correctly recalled
.
For example in the second row,
we learn that the system correctly
l
abeled
28
of
the
30
human labeled
energy
cases
. The columns tell us

how many cases predicted to be
about energy are about energy (‘
precision’
). In this contrived example,

28 of the 31 cases labeled as
energy were
actually
about energy.

O
verall reliabilit
y
is

72
% (48/
67
). However
, the value of the confusion matrix is that it also indicates that

the system is doing
better
for some topics than for others. Environment has low recall (6/21), and better
precision (6/9). Health, on the other hand, has higher r
ecall

but
lower precision.

Knowing th
is can
lead to more efficient approaches to improving overall performance. For example, if
health and environment
bills share common
-

but irrelevant
-

words,
then
we

could
expand

our

list of
stopwords

to include them. Another
possibility is that some of the labels assigned by humans

are
incorrect. Fixing those human errors

might
also improve system performance.

Crosstabulate

We might also be curious about how the cases are distributed across the
topics. The following creates
a
n object (
table
)

that includes the proportion of cases by topic.

install.packages(
'
g
models
'
)
#
or use GUI packages option to install

library(gmodels)

CrossTable(
spanish
$new_code_svm)

To create a list of the proportions, fi
rst create an R object
:

svm1 <
-

CrossTable(
spanish
$new_code_svm)

Then create the list
:

svm2 <
-

t(svm1$prop.row)

svm2

A
lgorithm
Ensemble

Agreement

Confusion tables are helpful during the training process, but they are of little value once we predict to
unlabeled cases (our ultimate goal).
At this stage we can
simply
assume performance

equivalent to
what
was learned
through earlier experiments using l
abeled data
.
We can

also

do spot checking using
random
samples
of the unlabeled data.



A third option

is ensemble agreement.
A
lgorithm

agreement typically
equates

with

higher precision.
T
his
information can be

used

to
distinguish the cases that are most likely to be incorrectly labeled.


We first create a
data frame

using that includes the classification predictions
for
the
4 algorithms
. (Here
we are using spanish as our example.)


data(spanish)


spanish_agree <
-

t
ext
classify(spanish,c("new_code_naive","new_code_maxent",
"new_code_ling", "new_code_svm"))


To see the results, t
ype:


spanish_agree

#
or fix(Spanish_agree)


To
summarize, type
:


table(spanish_agree$scores)



0 1 2 3 4


20 178 128 400 228


If
we know (from earlier experiments) that precision is very high when
there is majority agreement (e.g.

3

or more

algorithms agree
)
, then we have managed to automatically

and reliably

classify 2/3rds of our
cases

to a high level of reliability
.


FindCluste
rs

This diagnostic tool
can be used to isolate
cases
with similar text but incons
istent

labels

(codes)
.

For
example, it can be used to check that human coders have labeled similar cases consistently before
proceeding to the next step of training the algor
ithms. Or it could also be used to discover similar cases
that algorithm is labeling inconsistently.

The default approach
generate
s

a report that includes
only th
ose

clusters with inconsistent labels.
The
code_column specifies the labels that
interest you
. Here we are using the human labeled examples. The
cluster_column doesn’t appear to
affect t
he output
. We are using the table that includes both human
and machine labels so w
e specif
y

it to be
the
SVM
machine
predictions
(new_code_svm)
. In the output
below,
inconsistent human labels (Code) are the focus, but
revers
ing

the specification
would report only
those clusters where the machine labels (new_code_svm) were inconsistent.

As mentioned, FindClusters only reports the clusters with in
consistent labels.
If for some reason you
wanted to examine
every c
luster, then you just need to specify a
code_column
variable that is
unique



such as the id

column “obs
.

This would
then
report all clusters
(
because all
of the

codes


would be
unique an
d therefore
inconsistent
)
.

findclusters(datasource="
spanish
",table_name="spanish_bills_second",
id_column="obs", text_column="title", code_column="Code",
cluster_column="new_code_svm", output_file="clustertest.doc",
findclusters = "FindClusters", parameter
s = "Xmx1G", run = TRUE
,
ttpath
= "C:
\
\
Program
Files
\
\
R
\
\
R2.11.0
\
\
library
\
\
Rtexttools
\
\
TextTools_V0.11")

#the ttpath appears
to be unnecessary if you are
already
in the working
directory that contains TextTools
_V0.11

If this runs successfully, the results are reported in a clustertest.doc file located in the current working
directory.

An example is below.

Bill

Title

Code

Cluster

1489

Proposición de Ley reguladora del contrato de aparcamiento de vehículos.

10

14

1285

Proposición de Ley
reguladora del

sobre regulación de un

contrato de
aparcamiento de vehículos.

14

14



364

Pensiones mínimas y su aplicación al régimen especial de trabajadores autónomos.

15

15

316

Pensiones mínimas y su aplicación al régimen

especial de trabajadores autónomos.

15

15

239

Pensiones mínimas y su aplicación al régimen especial de trabajadores autónomos.

5

15


Exporting

Results from R


If you would like to export
an object (such as
s
panish_agree
)

as a .csv file
:


write.csv(
spanish_agree, file = "spanish.csv")


If you are trying to locate it, f
ind out what working directory you are in:

g
etwd()

Or
put it
in a different directory
?

For example….


setwd("c:/Users/John/Desktop")



--------------------------------------------------------------------------------------------------------------------

#
R
CODE FOR
EXAMPLE

COMPILED


i
nstall.packages('t
au
'
)

install.packages('R
ODBC
'
)

library(tau)

library(ROD
B
C)

library(
Rtexttools
)


?
Rtextto
ols

getwd()
#shows the current working directory


setwd(system.file("TextTools_V0.11", package="
Rtexttools
"))



#RUNNING TEXTTOOLS

#TRAIN SVM


tsvm <
-

texttrain(method="svm", datasource="
spanish
",
table_name="spanish_bills_first", id_column="obs",

text_column="title", code_column="Code", model="svm_spanish_first",
remove_stopwords="spanish", do_stemming="Spanish")


#CLASSIFY USING SVM


csvm <
-

textclass(method="SVM", datasource="
spanish
",
table_name="spanish_bills_second", id_column="obs",

text_co
lumn="title", code_column="Code",
output_code_col="new_code_svm", model="svm_spanish_first")


#SEND THE JOB TO TEXTTOOLS


classrun(tsvm, csvm,ttpath=system.file("TextTools_V0.11",
package="
Rtexttools
"))



#BRING THE RESULTS INTO A R DATA FRAME


getwd()


list.files()

#this will show you all the files in the directory


setwd(system.file("TextTools_V0.11", package="
Rtexttools
"))



spanish <
-

datagrab("spanish.accdb","spanish_bills_second")



head(spanish) #shows first 5 rows and all columns of the dataframe.



#CREATING A CONFUSION MATRIX

conf_svm <
-

confusion(spanish, c("Code", "new_code_svm"))


conf_svm


install.packages('concord')

library(concord)

data(spanish)

#KRIPPENDORF’S ALPHA

install.packages('concord')

library(concord)

data(spanish)

code <
-

spanish$Code

svm <
-

spanish$new_code_svm

code_kripp <
-

rbind(spanish$Code, spanish$new_code_svm,
spanish$new_code_ling, spanish$new_code_naive, spanish$new_code_maxent)

#For interrater reliability among ALL classifiers :

kripp.alpha(code_kripp)


#Compare
one classifier at a time to the ‘true’ results (here, maxent versus the true code)

code_alg_kripp <
-

rbind(spanish$Code, spanish$new_code_maxent)

kripp.alpha(code_alg_kripp)


#Produce a confusion matrix that includes the Krippendorf’s Alpha and Percent Co
rrectly Predicted:

data(spanish)

nmax <
-

alg_reliable(spanish,c("Code", "new_code_svm"),
method="ordinal", pct.correct=TRUE)

nmax



#CALCULATING AGREEMENT AMONG CLASSIFIERS

spanish_agree <
-

textclassify(spanish,c("new_code_naive","new_code_maxent",
"
new_code_ling", "new_code_svm"))


spanish_agree #or fix(apanish_agree)


table(spanish_agree$scores)



FindClusters

findclusters(datasource="
spanish
",table_name="spanish_bills_second",
id_column="obs", text_column="title", code_column="Code",
cluster_column="new_code_svm", output_file="clustertest.doc",
findclusters = "FindClusters", parameters = "Xmx1G", run = TRUE
,
ttpath
= "C:
\
\
Program
Files
\
\
R
\
\
R2.11.0
\
\
library
\
\
Rtexttools
\
\
TextTools_V0.11")

#the ttpath appears to be unnecessary if you are
currently in the working
directory that contains TextTools


Text
Tools
: A Supervised Machine Learning Package in an R Wrapper

Homework: To confirm that this is all making sense, try the following:

1.

Re
-
run the abo
ve specifying

Maxent rather than SVM as you
r classifier

If that worked, let’s take it to a new level!

2.

Find another labeled dataset to test. Remember that
the data
need to be
put into an
Access

database that you ODBC register

(you’ll want train and test tables per above)
.


To make things
easier, put it in the same directory as the current Spanish.accdb file and use the same column
labels for your predictions.

3.

Bring your results into R and examine them. How did the algorithms perform compared to the
Spanish database case?

L
ooking for labeled data?
www.policyagendas.org

or
www.congressionalbills.org