Machine learning for Natural Language Processing

huntcopywriterΤεχνίτη Νοημοσύνη και Ρομποτική

24 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

68 εμφανίσεις

1
Machine learning for
Natural Language Processing
Maria Simi, 2008/2009
Tasks and solutions

Given

A set of NL classification tasks

A set of classification models

Experiment by applying one or more of the
methods to one of the tasks

Design the experiment

Evaluate the results

Document the process
We concentrate on classification

Classification is the task of
choosing the correct class
label
for a given input

Each input is considered in isolation from all other inputs

The set of labels is defined in advance

Specializations

multi-class classification

sequence classification

Classification is a supervised learning task (
vs
clustering
)
Supervised classifier
2
Building a classifier

Decide what features of the input are relevant
(
feature set
)

Decide how to encode those features

Build a
feature extractor

Build a classifier

Train on training corpus

Evaluate on test corpus

Perform error analysis
Training, test, development corpus
Evaluation

Contingency table or
confusion matrices

TP
: correctly classified

TN
: correctly non classified

FP
: incorrectly classified

FN
: incorrectly non classified

Accuracy
= (
TP+TN
)
/
(
TP+FP+TN+FN
)

Precision
:
TP /
(
TP + FP
)

Recall
:
TP
/(
TP
+
FN
)

F-Measure
: (2*Precision*Recall)/(Precision+Recall)
TN
FN
Not
assigned
FP
TP
Assigned
Not in C
In C
C
Macro-averaging and micro-averaging

If we have several

categories
C
1
,


C
n
there are two
ways to proceed to obtain a global evaluation

1.
Macro-averaging
1.
Evaluate classification
w.r.t
each
C
i
(and

C
i
)
2.
Average the evaluation measures over categories
2.
Micro-averaging
1.
Build a single contingency table for all the data by summing the
scores for each category

Macro-averaging give equal weight to each category,
while micro-averaging gives equal weight to each
object
3
Confusion matrices

Table where each cell [i,j] indicates how often class j was predicted
when the correct label was i. Diagonal entries indicate classes that
were correctly predicted; off-diagonal entries indicate errors.
Cross-validation

Multiple evaluations on different test sets

We subdivide the original corpus into N subsets
called
folds

For each of these folds, we train a model using
all of the data except the data in that fold, and
then test the model on the fold

Combine the scores
Classification problems in NLP

Preprocessing tasks

Sentence splitting

Grammatical tasks

PoS
tagging

Parsing

Semantic tasks

Word sense disambiguation

Super sense tagging

Named Entity Recognition

Document classification

deciding whether an email is spam or not.

sorting by topic

sentiment classification
(positive, negative, unbiased)
A selection of tasks

Let us concentrate on


POS tagging

Word Sense Disambiguation

Document classification

Assuming the task is clear


Which data are available for training?

Which features are important? How do I represent these
features?

Which model should I choose?

How do I evaluate the results?

Establish baseline

Evaluation metrics
4
PoS
tagging

Resource for EVALITA POS tagging
http:
//medialab
.
di
.
unipi
.
it/evalita/

train:
repubblica
.
tanl
(~ 112,500 tokens)

t
est:
wiki
.
tanl
(~ 5000 tokens)

annotated with TANL morphed-tags

Data format
è



VAip3s
stata

VApsfs
accolta

Vpsfs
in
E
genere

Sms
con

E
disinteresse
Sms
.
FS
A

E
ben

B
pensarci
Vfc
,

FF
l'

RDns
intervista
Sfs
dell'
EAns
on.

SA
Formica

SP

PoS
: features

Feature size:

Unigrams (single tokens), bigrams,


Context | window size: within -2, +2

Word features

Word form (W), PoS of previous words

Word position

Ordered: (-2,-1, 0, +1, +2); left/right;

Word shape:

If capitalized, if first word in sentence


Feature representation (an example)

A vector of strings coding a triple such as (
position
,
feature
,
value
)
(

-1%W%il

,

-1%P%RDms

,

-2%W%come

,

)
TANL
tagset

http:
//medialab
.
di
.
unipi
.
it/wiki/Tanl_POS_Tagset

14 Coarse grain tags

~100 Fine grain tags (with some morphology)
Word Sense Disambiguation

We concentrate on a Supervised approach to WSD

Knowledge-based disambiguation

Use of external resources such as
WordNet
: first sense, glosses,
similarity measures etc. etc.

Unsupervised approach (discrimination) with clustering methods

All Words Word WSD

Resolve all the
open-class

words
in a text
He
put
his
suit
over the
back
of the
chair

Target WSD (as in
Semeval
1, 2

)

Disambiguate a set of target words
5
WSD:
SemCore

SemCore
corpus (subset of Brown corpus)

186 texts; 192,639 words; annotation of all
open words
(nouns,
verbs, adjectives, and adverbs)

with POS, lemma, and
WordNet
1.6 senses
<
wf cmd=done
pos=NN
lemma=committee
wnsn=1
lexsn=1
:14:00::
>Committee<
/wf
>
<
wf cmd=done
pos=NN
lemma=approval
wnsn=1
lexsn=1
:04:02::
>approval<
/wf
>
<
wf cmd=ignore
pos=IN>of<
/wf
>
<
wf cmd=done rdf=person
pos=NNP
lemma=person
wnsn=1 lexsn=1
:03:00::

pn=person
>Gov.
_Price_Daniel
<
/wf
>
<
wf cmd=ignore
pos=POS>
's
<
/wf
>
<
punc
>``<
/punc
>
<
wf cmd=done
pos=JJ
lemma=abandoned
wnsn=1
lexsn=5
:00:00:uninhabited:00
>abandoned<
/wf
>
<
wf cmd=done
pos=NN lemma=property
wnsn=2
lexsn=1
:21:00::
>property<
/wf
>
WSD:
Senseval

http://www.senseval.org/

Contains training data for 35 ambiguous words
in context (sentences) annotated with their
sense.

The test are sentences including the words, to
be annotated.
WSD: features

Feature size:

Unigrams (single tokens),
bigrams
,


Context | window size: within -5, +5

Word features

Word form (W), lemma (L), fine grain
PoS
(P) or coarse grain
PoS
tag (CP)

Word position

Unordered (bag-of-words), Ordered (-5

,-1, -2, +1, +2,

+5); left/right;

Which words?

Only content words (less effective); all words (more effective)

Word shape:

If capitalized, if first word in sentence, other forms of word shape


s="
Merrill

Lynch& Co.
"
;

sh
(s) =

Xx



Xx



&Xx..

WordNet
first-sense

for unknown words (knowledge based heuristic)
WSD: the problem of sparse data

Data are typically very sparse (unbalanced):
some senses are never present or have

low
frequency counts

Strategies:

smoothing techniques such as
m
-estimate

when not able to disambiguate, select the most
frequent sense from the training data
6
WSD: evaluation

Evaluation metrics

Precision=#correct / #classified

Recall=#correct / #total

Test on 100 words. The system disambiguates 75
words:
among them 50 are correct.

Precision=
50/75;
Recall=
50/100.

If all occurrences are tagged

Precision=Recall
WSD: links

SemCor
:
http://www.
cs
.
unt
.
edu/~rada/downloads
.html

SENSEVAL
:
http://www.
senseval
.org/

WordNet
:
http:
//wordnet
.
princeton
.
edu/
Document classification: a classical benchmark

Reuters 21578 collection of news articles

21578 articles sorted in 135 categories by TOPIC

Not all documents are categorized

Each document may have multiple categories

Categories with no examples

Used subsets:

R10 (the 10 most populated categories)

R90 (categories with at least 1 positive example)

http://www.
daviddlewis
.com/resources/testcollections/reuters215
78/

Too complex?
Document classification: polarity or news

Polarity dataset v2: a Movie Review collection
sorted according to polarity (Positive, Negative).

1000 positive reviews, 1000 negative reviews

http://www.
cs
.
cornell
.edu/people/pabo/movie-review-
data/

20-newsgroup: news classified in 20 categories
according to topic

20 categories, ~1000 documents for category

http://people.csail.mit.edu/jrennie/20Newsgroups/
7
Common models for classification in NLP

Decision trees classifiers

Naïve
Bayes
classifiers

Linear classifiers

Maximum Entropy classifiers

HMM classifiers

SVM classifiers

Nearest neighbors classifiers
Decision tree (DT) classifiers

To classify names in M and F
Naïve
Bayes

Instance
x
described as conjunction of attributes
a
1
a
2


a
n
.

The target function takes values in
C

c
MAP

=
argmax
P
(
a
1
a
2


a
n
|

c
j
)
P
(
c
j
)

c
j


C
c
NB
=
argmax
P
(
c
j
)
Π

P
(
a
i
|

c
j
)
by the simplifying

c
j


C

i

independence assumption
Probabilities estimates:
P
(
c
j
)

= |
c
j
|/N
P
(
a
i
|

c
j
)
= |
a
i
,

c
j
|/|
c
j
|
Find by LS a line to separate
Figure 2.1
(© HTF 2001)
:
A classification example in two dimensions
.
The classes are coded as a binary variable
GREEN
= 0
,
RED
= 1
and then fit by linear regression.
The line is the decision boundary defined by
x
T
w


= 0
.
5
.
The red shaded region denotes that part of
input space

classified as
RED
, while the
green region is classified as

GREEN
.
The decision boundary is
{
x

|
x
T
w

= 0
.
5

}

is linear (and
seems to make many errors on the training data).
Is it true?
!
"
#


otherwise
if
h
T
______
0
5
.
0
_
1
)
(
w
x
x
8
1-nn
Figure 2.3
(
© HTF 2001)
The same classification example
in two

dimensions as in

Figure
2.1.
GREEN
= 0
,
RED
= 1

Very flexible !

No misclassifications in TR data:
0 training error: what for test
data ?

Decision boundaries is
not
linear: it is quite
irregular

May be unnecessary noisy (e.g.
for scenario 1)
The project matrix
DT
NB
LS
K-NN
PoS
tagging
WSD
Document
classification
x
Naïve
Bayes
classifiers

In Naïve
Bayes
classifiers, every feature contributes in
determining which class label should be assigned to a
given input value.

To choose a label for an input value, the naïve
Bayes
classifier begins by calculating the prior probability of
each class, which is determined by checking frequency
of each label in the training set.

The contribution from each feature is then combined with
this prior probability, to arrive at a likelihood estimate for
each label.

The label whose likelihood estimate is the highest is then
assigned to the input value.
Classifying news
9
Classification process
NB and text classification: formalization

X
instance space consisting of text documents;
Docs
training set of
classified text documents.

C
={
c
1

c
n
};
Vocabulary
:
all the distinct words in

X

Bag of words representation for documents:
{
a
1
a
2


a
k
}
with

a
j
∈
Vocabulary
with words taken from a
Vocabulary
of word occurrences extracted
from the corpus.
Note
: we assume position does not count

Target function:
f
:
X


C

Classification of new document:
c
NP

=
argmax
P
(
c
j
)
P
(
a
1
|

c
j
)


P
(
a
i
|

c
j
)

c
j

C
NB and text classification: example

C
={
like, dislike
}

X
consists of 1000 documents: 700
like
and 300
dislike

Classify:
x
= {
learning, approach, trouble, problem,

}

v
NP
=argmax
P
(
v
j
)
P
(
a
1
|

v
j
)


P
(
a
i
|

v
j
)

v
j


{
like, dislike
}

Compute
:

P
(
'like'
)
P
('
learning'

|

'
like'
)


P
('
problem'

| '
like'
)


P
(
'dislike'
)
P
('
learning'

|

'dislike'
)


P
('
problem'

|
'dislike'
)


Return the class with maximum value

The training is used to estimate
probabilites
. How?
NB and text classification: estimation of
P's

P
(
c
j
) = |
Docs
i
| / |
Docs
|
Docs
i
documents classified as
i
P
(
'like'
) = 700/1000 = 0.7

P
(
'dislike'
) = 300/1000 = 0.3
1.
Estimation of
P
(
a
i
|
'like'
)
with maximum likelihood
n
i
/
n
n
i

number of times
a
i

occurs in
Docs
like
n

number of word positions in
Docs
like
2.
Smoothing
: estimation of
P
(
a
i
|
'like'
)
with

m-estimate
m-estimate
= (
n
i
+
mp
)/
(
n
+
m
)
A
ssuming
m =
|
Vocabulary
| since
a
i
can be any word in
Vocabulary
p=
1/
|
Vocabulary
|

assuming uniform distribution for
m
unseen ex.
m-estimate
= (
n
i
+
1
)/
(
n
+
|
Vocabulary
|)
10
Naïve
Bayes
learner: algorithm
Learn_Naïve_Bayes_text
(
Docs, C
)
1.
Vocabulary



set of distinct words in

Docs
2.
Compute
P
(
w
k
)
and

P
(
w
k
|
c
j

):
For

j

in

C
do
:
-
Docs
j


subset of

Docs
classified as

j
-
P
(
c
j
)


|
Docs
i
| / |
Docs
|
-
Text
j


a single document obtained concatenating all

Docs
j
- n


total number of distinct word positions in

Text
j
-
For each word

w
k

in

Vocabulary
:

n
k



occurrences of

w
k
in

Text
j

P
(
w
k
|
c
j

)


(
n
k

+ 1)
/(n
+ |
Vocabulary
|)
Naïve
Bayes
classifier: algorithm
Classify_Naïve_Bayes_text
(
Doc
)
Positions



word positions in

Doc
belonging to
Vocabulary

Return
c
NP

where
:
c
NP

=
argmax
P
(
c
j
)
Π

P
(
a
i
|

c
j
)

c
j


C

i



Positions