mallet-eval

wonderfuldistinctΤεχνίτη Νοημοσύνη και Ρομποτική

16 Οκτ 2013 (πριν από 3 χρόνια και 7 μήνες)

80 εμφανίσεις

MALLET

MA
chine

L
earning for
L
anguag
E

T
oolkit

Outline


About MALLET



Representing
Data



Command Line Processing



Simple Evaluation



Conclusion


Outline


About MALLET



Representing
Data



Command Line Processing



Simple Evaluation



Conclusion


About MALLET


"MALLET
: A Machine Learning for Language Toolkit
.“


written by Andrew McCallum


http://mallet.cs.umass.edu
. 2002
.


Implemented in Java, currently version 2.0.6



Motivation:


T
ext classification and information extraction


Commercial machine
learning


Analysis and indexing of
academic publications

About MALLET


Main idea


Text focus: data is discrete rather
than continuous
, even when
values could
be continuous



How to


Command
line scripts:


bin/mallet
[command]
--
[option] [value] …


Text
User Interface
(“
tui
”) classes



Direct
Java
API


http
://mallet.cs.umass.edu/api

Outline


About MALLET



Representing
Data



Command Line Processing



Simple Evaluation



Conclusion


Representations


Transform text documents to
vectors
x
1 ,
x
2



Elements of
vector are
called
feature values


Example:
“Feature at
row 345
is
number
of
times “dog
” appears
in
document”


Retain meaning of vector
indices


Documents to Vectors

Documents to Vectors

Documents to Vectors

Documents to Vectors

Documents to Vectors

Instances

Instances

Instances

Outline


About MALLET



Representing
Data



Command Line Processing



Developing with MALLET



Conclusion


Command Line


Importing Data



Classification



Sequence Tagging



Topic Modeling

Importing Data


One Instance per file


files in the folder:

sample
-
data/web/en

or
sample
-
data/web/de


command line:

bin/mallet
import
-
dir

--
input sample
-
data/web/*
--
output
web.mallet



One file, one instance per line


file format:

[
URL] [language] [text of the page
...]


command line:

bin/mallet
import
-
file
--
input /data/web/data.txt
--
output
web.mallet


Classification


Training a classifier


bin/mallet
train
-
classifier
--
input training.mallet
--
output
-
classifier
my.classifier



Choosing an algorithm


MaxEnt
,
NaiveBayes
, C45,
DecisionTree

and many others.


bin/mallet
train
-
classifier
--
input
training.mallet

--
output
-
classifier
my.classifier

--
trainer
MaxEnt



Evaluation


Random
split the data into 90% training instances, which will be used to train the
classifier, and 10% testing instances.



bin/mallet
train
-
classifier
--
input labeled.mallet
--
training
-
portion
0.9

Sequence Tagging


S
equence algorithms


hidden
Markov models (
HMMs)


linear
chain conditional random fields (CRFs
).



SimpleTagger


a command line interface to the MALLET Conditional Random
Field (CRF) class

SimpleTagger


Input file: [
feature1
feature2 ...
featuren

label]


Bill
CAPITALIZED
noun

slept non
-
noun

here
LOWERCASE STOPWORD
non
-
noun



Train a CRF


An input file “sample”


A
trained CRF in the file "
nouncrf
"


java
-
cp

“~/mallet/class:~/mallet/lib/mallet
-
deps.jar
"
cc.mallet.fst.SimpleTagger

--
train true
--
model
-
file
nouncrf

sample


SimpleTagger


A file “
stest
” needed to be labeled

CAPITAL
Al

slept

here



Label the input

java
-
cp

“~/mallet/class:~/mallet/lib/mallet
-
deps.jar
"
cc.mallet.fst.SimpleTagger

--
model
-
file
nouncrf

stest



Output

Number of predicates: 5

noun
CAPITAL Al

non
-
noun
slept

non
-
noun
here


Topic
Modeling


Building Topic Models


bin/mallet
train
-
topics
--
input
topic
-
input.mallet

--
num
-
topics 100
-
-
output
-
state
topic
-
state.gz


--
input [FILE]



--
num
-
topics [NUMBER]

The number of topics to use. The best number depends on
what you are looking for in the model.


--
num
-
iterations [NUMBER]

The number of sampling iterations should be a trade off
between the time taken to complete sampling and the quality of the topic model
.


--
output
-
state [FILENAME]

This
option outputs a compressed text file containing the
words in the corpus with their topic assignments.



Demo

Outline


About MALLET



Representing
Data



Command Line Processing



Simple Evaluation



Conclusion


Methodology


Focus on sequence tagging module in MALLET


CRF
-
based implementation


Some scripts written for importing data and evaluating results



Small corpora collected from web


Divided into two parts, 80% for training, 20% for test



Evaluate both POS Tagging and Named Entity Recognition


The performance of training


Accuracy (POS Tagging) and Precision, Recall and FB1 (NER)



All scripts, corpora and results can be found here


http://mallet
-
eval.googlecode.com

A
S
urvey of Named Entity Corpora


Well known named
entity corpora


Language
-
Independent Named Entity Recognition at CoNLL
-
2003


A manual annotation
of a subset of RCV1 (Reuters Corpus Volume 1
)


free and public, but need RCV1
raw texts
as the input


Message
Understanding

Conference (MUC)
6

/ 7


not
for free


Affective
Computational
Entities (ACE) Training Corpus


not for
free


O
ther special purpose corpora


Enron Email
Dataset


email
messages in this corpus are tagged with person names, dates and times.


A variety of biomedical
corpora


some
corpora in this collection are tagged with entities in the biomedical domain,
such as gene
name

Small Corpora


Two small corpora collected from web


Penn Treebank Sample


English POS tagging corpora
, ~5% fragment of Penn Treebank, (C)
LDC 1995.


raw, tagged, parsed and combined data from Wall
Street Journal


148120 tokens,
36 Standard
treebank

POS tagger


http://web.mit.edu/course/6/6.863/OldFiles/share/data/corpora/treebank/



HIT CIR LTP Corpora Sample


Chinese NER corpora integrated


10% of the whole corpora (
open to
public)


23751 tokens, 7 kinds of
n
amed entities


http://ir.hit.edu.cn/demo/ltp/
Sharing_Plan.htm



Environment


Hardware


CPU: Q8300 Quad Core 2.50 GHz


Memory: 3GB



Software


Fedora 13 x86_64


Java 1.6.0_18


MALLET 2.0.6

Data Format and Labels


Data Format


Each token one row, each feature one column

Bill noun

slept non
-
noun

Here non
-
noun



Labels


Standard
treebank

POS Tagger


CC

Coordinating
conjunction |
CD

Cardinal
number |
DT

Determiner |
EX

Existential
there |
FW

Foreign
word |
IN

Preposition or subordinating
conjunction |
JJ

Adjective |
JJR

Adjective,
comparative |
JJS

Adjective,
superlative |
LS

List item
marker |
MD

Modal |
NN

Noun,
singular or
mass |
NNS

Noun,
plural



(
36 taggers in all
)


HIT Named Entity


O
不是

NE | S
-

单独构成

NE | B
-

一个

NE
的开


| I
-

一个

NE
的中间

| E
-

一个

NE
的结



Nm
数词

| Ni
机构名

| Ns



|
Nh

人名

|
Nt

时间

| Nr
日期

|
Nz

专有名词


Example:
美国

B
-
Ni
洛杉


I
-
Ni
警察局

E
-
Ni

pos

chunking

ner

Training

Instance #

3982

8936

1286

Tokens #

95767

211727

20913

Time

308m 23s

190m 50s

17m 13s

Test

Tokens #

46452

47377

2829

Accuracy

85.67%

93.97%

98.55%

Precision

-

90.54%

86.89%

Recall

-

89.89%

86.89%

FB1

-

90.21

86.89

Time

15.80s

4.43s

0.8s

Evaluation

Stages

Tasks

DEMO

Q&A