Overview of Machine Learning for NLP Tasks: part I

journeycartΤεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 4 χρόνια και 24 μέρες)

79 εμφανίσεις

1

Overview of Machine Learning

for NLP Tasks: part I

(based partly on slides by

Kevin Small and Scott Yih)

Page
2

Goals of Introduction


Frame specific
natural language processing

(NLP) tasks as
machine learning problems


Provide an overview of a general machine learning system
architecture


Introduce a common terminology


Identify typical needs of ML system


Describe some specific aspects of our tool suite in regards to
the general architecture


Build some intuition for using the tools


Focus here is on
Supervised

learning

Page
3

Overview

1.
Some Sample NLP Problems

2.
Solving Problems with Supervised Learning

3.
Framing NLP Problems as Supervised Learning Tasks

4.
Preprocessing: cleaning up and enriching text

5.
Machine Learning System Architecture

6.
Feature Extraction using FEX


Page
4

Context Sensitive Spelling
[2]


A word level tagging task:




I would like a peace of cake for desert.



I would like a

piece

of cake for

dessert
.




In principal, we can use the solution to the



duel problem
.




In

principle
,

we can use the solution to the




dual

problem.

Page
5

Part of Speech (POS) Tagging


Another word
-
level task:





Allen Iverson is an inconsistent player. While he can

shoot very well, some nights he will

score only a few

points.





(NNP Allen) (NNP Iverson) (VBZ is) (DT an)

(JJ inconsistent) (NN player) (. .) (IN While) (PRP he)


(MD can)

(VB shoot) (RB very)

(RB well)

(, ,) (DT some)

(NNS nights) (PRP he)

(MD will) (VB score)

(RB only)


(DT a) (JJ few)

(NNS points)

(. .)



Page
6

Phrase Tagging


Named Entity Recognition


a phrase
-
level task:



After receiving his M.B.A. from Harvard Business School,
Richard F. America accepted a faculty position at the
McDonough School of Business (Georgetown University) in
Washington.



After receiving his
[MISC M.B.A.]

from
[ORG Harvard
Business School]
,
[PER Richard F. America]

accepted a
faculty position at the
[ORG McDonough School of Business]

(
[ORG Georgetown University]
) in
[LOC Washington]
.





Page
7

Some Other Tasks


Text Categorization


Word Sense Disambiguation


Shallow Parsing


Semantic Role Labeling


Preposition Identification


Question Classification


Spam Filtering

:

:

Page
8

Supervised Learning/SNoW

Page
9

Learning Mapping Functions


Binary Classification




Multi
-
class
Classification




Ranking




Regression




{Feature, Instance, Input} Space



space used to describe each instance; often



Output Space


space of possible output
labels; very dependent on problem


Hypothesis Space


space of functions
that can be selected by the machine
learning algorithm; algorithm dependent
(obviously)

Page
10

Multi
-
class Classification
[3,4]

One Versus All (OvA)

Constraint Classification

Page
11

Online Learning
[5]


SNoW algorithms include Winnow,
Perceptron


Learning algorithms are mistake driven


Search for linear discriminant along
function gradient (unconstrained
optimization)


Provides best hypothesis using data
presented up to to the present example


Learning rate determines convergence


Too small and it will take forever


Too large and it will not converge

Page
12

Framing NLP Problems as

Supervised Learning Tasks

Page
13

Defining Learning Problems
[6]


ML algorithms are mathematical formalisms and problems
must be modeled accordingly


Feature Space


space used to describe each instance;



often
R
d
, {0,1}
d
,
N
d


Output Space


space of possible output labels, e.g.


Set of Part
-
of
-
Speech tags


Correctly spelled word (possibly from confusion set)


Hypothesis Space


space of functions that can be selected by
the machine learning algorithm, e.g.


Boolean functions (e.g. decision trees)


Linear separators in
R
d


Page
14

Context Sensitive Spelling



Did anybody (else) want

too

sleep for

to
more




hours this morning?



Output Space


Could use the entire vocabulary;
Y
={a,aback,...,zucchini}


Could also use a
confusion set
;
Y=
{to, too, two}


Model as (single label) multi
-
class classification



Hypothesis space is provided by SNoW


Need to define the feature space

Page
15

What are ‘feature’, ‘feature type’, anyway?


A feature type is any characteristic (relation) you can define over the
input representation.



Example: feature TYPE = word bigrams


Sentence:



The man in the moon eats green cheese.


Features:



[The_man], [man_in], [in_the], [the_moon]….



In Natural Language Text, sparseness

is often a problem



How many times are we likely to see “the_moon”?



How often will it provide useful information?



How can we avoid this problem?


Page
16

Preprocessing: cleaning up and enriching text


Assuming we start with plain text:




The quick brown fox jumped over the lazy dog. It landed on


Mr. Tibbles, the slow blue cat.



Problems:



Often, want to work at the level of sentences, words



Where are sentence boundaries


‘Mr.’ vs. ‘Cat.’?



Where are word boundaries
--

‘dog.’ Vs. ‘dog’?



Enriching the text: e.g. POS
-
tagging:




(DT The) (JJ quick) (NN brown) (NN fox) (VBD jumped)



(IN over) (DT the) (JJ lazy) (NN dog) (. .)



Page
17

Download Some Tools


http::/l2r.cs.uiuc.edu/~cogcomp/



Software::tools, Software::packages


Sentence segmenter


Word segmenter


POS
-
tagger


FEX


NB: RIGHT
-
CLICK on “download” link



select “save link as...”

Page
18

Preprocessing scripts


http://l2r.cs.uiuc.edu/~cogcomp/


sentence
-
boundary.pl



./sentence
-
splitter.pl

d HONORIFICS

i nyttext.txt



-
o nytsentence.txt



word
-
splitter.pl



./word
-
splitter.pl nytsentence.txt > nytword.txt



Invoking the tagger:



./tagger

i nytword.txt

o nytpos.txt



Check output

Page
19

Problems running .pl scripts?


Check the first line:


#!/usr/bin/perl



Find perl library on own machine


E.g. might need...


#!/local/bin/perl



Check file permissions...

> ls

l sentence
-
boundary.pl

> chmod 744 sentence
-
boundary.pl

Page
20

Minor Problems with install


Possible (system
-
dependent) compilation errors:


doesn’t recognize ‘optarg’


POS
-
tagger: change Makefile in subdirectory snow/ where indicated


sentence
-
boundary.pl: try ‘perl sentence
-
boundary.pl’



Link error (POS tagger): linker can’t find

lxnet



remove ‘
-
lxnet’ entry from Makefile



generally, check README, makefile for hints

Page
21

The System View

Page
22

A Machine Learning System

Preprocessing

Feature

Extraction

Machine

Learner

Classifier(s)


Inference

Raw

Text

Formatted

Text

Testing

Examples

Function

Parameters

Labels

Feature

Vectors

Training

Examples

Labels

Page
23

Preprocessing Text


Sentence splitting, Word Splitting, etc.


Put data in a form usable for feature
extraction

They recently
recovered a small
piece of a live
Elvis concert
recording.

He was singing
gospel songs,
including “Peace
in the Valley.”

0 0 0 They

0 0 1 recently

0 0 2 recovered

0 0 3 a

0 0 4 small

piece 0 5 piece

0 0 6 of

:

0 1 6 including

0 1 7 QUOTE

peace 1 8 Peace

0 1 9 in

0 1 10 the

0 1 11 Valley

0 1 12 .

0 1 13 QUOTE

Page
24

A Machine Learning System

Preprocessing

Feature

Extraction

Raw

Text

Formatted

Text

Feature

Vectors

Page
25

Feature Extraction with FEX

Page
26

Feature Extraction with FEX


FEX (Feature Extraction tool) generates abstract representations

of text input



Has a number of specialized modes suited to different types of problem



Can generate very expressive features



Works best when text enriched with other knowledge sources


i.e., need to
preprocess text




S = I would like a
piece

of cake too!



FEX converts input text into a list of active features…





1: 1003, 1005, 1101, 1330…


Where each
numerical feature

corresponds to a specific
textual

feature
:





1:


label[piece]



1003:


word[like] BEFORE word[a]

Page
27

Feature Extraction


Converts formatted text
into
feature vectors


Lexicon file contains
feature descriptions

0 0 0 They

0 0 1 recently

0 0 2 recovered

0 0 3 a

0 0 4 small

piece 0 5 piece

0 0 6 of

:

0 1 6 including

0 1 7 QUOTE

peace 1 8 Peace

0 1 9 in

0 1 10 the

0 1 11 Valley

0 1 12 .

0 1 13 QUOTE

0, 1001, 1013, 1134, 1175, 1206

1, 1021, 1055, 1085, 1182, 1252

Lexicon

File

Page
28

Role of FEX



Why won't you
accept

the facts?



No one saw her
except

the postman.


1, 1001, 1003, 1004, 1006:


2, 1002, 1003, 1005, 1006:

Feature Extraction

FEX


lab[accept], w[you], w[the], w[you*], w[*the]


lab[except], w[her], w[the], w[her*], w[*the]

Page
29

Four Important Files

FEX

Script

Corpus

Example

Lexicon

A new
representation of the
raw text data

1.
Control FEX’s behavior

2.
Define the “types” of features

Feature vectors for
SNoW

Mapping of feature
and feature id

Page
30

Corpus


General Linear Format


The corpus file contains the preprocessed input with
a single
sentence

per line.


When generating examples, Fex never crosses line
boundaries.


The input can be any combination of:


1
st

form: words separated by white spaces


2
nd

form: tag/word pairs in parentheses


There is a more complicated 3
rd

form, but deprecated in
view of alternative, more general format (later)

Page
31

Corpus


Context Sensitive Spelling



Why won't you
accept

the facts?

(
WRB

Why) (
VBD

wo) (
NN

n't) (
PRP

you)

(
VBP

accept) (
DT

the) (
NNS

facts) (
.

?)





No one saw her
except

the postman.

(
DT

No) (
CD

one) (
VBD

saw) (
PRP

her)

(
IN

except) (
DT

the) (
NN

postman) (
.

.)


Page
32

Script


Means of Feature Engineering


Fex does not decide or find
good

features.


Instead, Fex provides
you

an easy method to define the
feature types and extracts the corresponding features from
data.


Feature Engineering is in fact very important in practical
learning tasks.

Page
33

Script


Description of Feature Types


What can be good features?


Let’s try some combinations of words and tags.


Feature types in mind


Words
around

the target word (
accept
,
except
)


POS tags
around

the target


Conjunctions of words and POS tags?


Bigrams or trigrams?


Include relative locations?

Page
34

Graphical Representation

0

1

2

3

4

5

6

7

WRB

Why

VBD

won

NN

't

PRP

you

VBP

accept

DT

the

NNS

facts

.

?

Target

-
2

-
1

1

2

0

-
3

-
4

3

Window [
-
2,2]

Page
35

Script


Syntax


Syntax:


targ [inc] [loc]: RGF [[left
-
offset, right
-
offset]]




targ



target index


If targ is

-
1
¶«


target file entries are used to identify the targets


If no target file is specified, then EVERY word is treated as a
target


inc



use the actual target instead of the

generic place
-
holder (
µ
*

)


loc



include the location of feature relative to the target


RGF



define
³
types
´

of features like words, tags, conjunctions,
bigrams, trigrams,

, etc


left
-
offset
and
right
-
offset:

specify the window range

Page
36

Basic RGF’s


Sensors (1/2)

Type

Mnemonic

Interpretation

Example

Word

w

the word (spelling)

w[you]

Tag

t

part
-
of
-
speech tag

t[NNP]

Vowel

v

active if the word starts
with a vowel

v[eager]

Length

len

length of the word

len[5]


Sensor is the fundamental method of defining “feature types.”


It is applied on the element, and generates active features.

Page
37

Basic RGF’s


Sensors (2/2)

Type

Mnemonic

Interpretation

Example

City List

isCity

active is the phrase is the
name of a city

isCity[Chicago]

Verb Class

vCls

return Levin’s verb class

vCls[51.2]

More sensors can be found by looking at FEX source (Sensors.h)

lab: a special RGF that generates labels


lab(w), lab(t), …


Sensors are also an elegant way to incorporate our
background knowledge.

Page
38

Complex RGF’s


Existential Usage


len(x=3), v(X)


Conjunction and Disjunction


w&t; w|t


Collocation and Sparse Collocation


coloc(w,w); coloc(w,t,w); coloc(w|t,w|t)


scoloc(t,t); scoloc(t,w,t); scoloc(w|t,w|t)

Page
39

(Sparse) Collocation

0

1

2

3

4

5

6

7

WRB

Why

VBD

won

NN

't

PRP

you

VBP

accept

DT

the

NNS

facts

.

?

Target

-
2

-
1

1

2

0

-
3

-
4

3

-
1 inc: coloc(w,t)[
-
2,2]

w[‘t]
-
t[PRP], w[you]
-
t[VBP]

w[accept]
-
t[DT], w[the]
-
t[NNS]


-
1 inc: scoloc(w,t)[
-
2,2]

w[‘t]
-
t[PRP], w[‘t]
-
t[VBP], w[‘t]
-
t[DT], w[‘t]
-
t[NNS],

w[you]
-
t[VBP], w[you]
-
t[DT], w[you]
-
t[NNS],
w[accept]
-
t[DT], w[accept]
-
t[NNS],

w[the]
-
t[NNS]


Page
40

Examples


2 Scripts


Download examples from tutorial page:


‘context sensitive spelling materials’ link



accept
-
except
-
simple.scr

-
1: lab(w)

-
1: w[
-
1,1]



accept
-
except.scr

-
1: lab(w)

-
1: w|t [
-
2,2]

-
1 loc: coloc(w|t,w|t) [
-
3,
-
3]

Page
41

Lexicon & Example (1/3)


Corpus:

… (NNS prices) (CC or) (VB accept) (JJR slimmer) (NNS profits) …


Script:
ae
-
simple.scr

-
1 lab(w);
-
1: w[
-
1,1]


Lexicon:

1 label[w[except]]

2 label[w[accept]]

1001 w[or]

1002 w[slimmer]


Example:

2, 1001, 1002;

Generated by
lab(w)

Generated by
w[
-
1,1]

Feature indices of
lab

start from 1.

Feature indices of
regular features
start from 1001.

Page
42

Lexicon & Example (2/3)


Target file:
fex
-
t ae.targ …

accept

except



Lexicon file


If the file does not exist, fex will create it.


If the file already exists, fex will first read it, and then
append the new entries to this file.


This is important because we don’t want two different
feature indices representing the same feature.

We treat
only
these two words as
targets
.

Page
43

Lexicon & Example (3/3)


Example file


If the file does not exist, fex will create it.


If the file already exists, fex will append new examples to
it.


Only active features and their corresponding lexicon
items are generated.


If the read
-
only lexicon option is set, only those features
from the lexicon that are present (
active
) in the current
instance are listed.

Page
44

Now practice


change script, run FEX, look at
the resulting lexicon/examples


> ./fex

t ae.targ ae
-
simple.scr ae
-
simple.lex short
-
ae.pos


short
-
ae.ex

Page
45

Citations

1)
F. Sebastiani. Machine Learning in Automated Text Categorization.
ACM Computing Surveys,

34(1):1
-
47, 2002.

2)
A. R. Golding and D. Roth. A Winnow
-
Based Approach to Spelling
Correction.
Machine Learning
, 34:107
-
130, 1999.

3)
E. Allewin, R. Schapire, and Y. Singer. Reducing Multiclass to Binary: A
Unifying Approach for Margin Classifiers.
Journal of Machine
Learning Research,
1:113
-
141, 2000.

4)
S. Har
-
Peled, D. Roth, and D. Zimak. Constraint Classification: A New
Approach to Multiclass Classification.
In Proc. 13
th

Annual Intl. Conf.
of Algorithmic Learning Theory,
pp. 365
-
379, 2002.

5)
A. Blum. On
-
Line Algorithms in Machine Learning. 1996.



Page
46

Citations

6)
T. Mitchell.
Machine Learning,

McGraw Hill, 1997.

7)
A. Blum. Learning Boolean Functions in an Infinite Attribute Space.
Machine Learning,
9(4):373
-
386, 1992.

8)
J. Kivinen and M. Warmuth. The Perceptron Algorithm vs. Winnow:
Linear vs. Logarithmic Mistake Bounds when few Input Variables are
Relevant. UCSC
-
CRL
-
95
-
44, 1995.

9)
T. Dietterich. Approximate Statistical Tests for Comparing Supervised
Classfication Learning Algorithms.
Neural Computation
, 10(7):1895
-
1923, 1998