Global Inference in Learning for Natural Language Processing

blabbingunequaledΤεχνίτη Νοημοσύνη και Ρομποτική

24 Οκτ 2013 (πριν από 3 χρόνια και 8 μήνες)

83 εμφανίσεις

Global Inference in Learning for

Natural Language Processing

Vasin Punyakanok

Department of Computer Science

University of Illinois at Urbana
-
Champaign


Joint work with Dan Roth, Wen
-
tau Yih, and Dav Zimak

Page
2

Story Comprehension

1.
Who is Christopher Robin?

2.
What did Mr. Robin do when Chris was three years old?

3.
When was Winnie the Pooh written?

4.
Why did Chris write two books of his own?

(ENGLAND, June, 1989)
-

Christopher Robin is alive and well. He lives in
England. He is the same person that you read about in the book, Winnie the
Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When
Chris was three years old, his father wrote a poem about him. The poem was
printed in a magazine for others to read. Mr. Robin then wrote a book. He
made up a fairy tale land where Chris lived. His friends were animals. There
was a bear called Winnie the Pooh. There was also an owl and a young pig,
called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin
made them come to life with his words. The places in the story were all near
Cotchfield Farm. Winnie the Pooh was written in 1925.

Children still love to
read about Christopher Robin and his animal friends. Most people don't know
he is a real person who is grown now. He has written two books of his own.
They tell what it is like to be famous.

Page
3

Stand Alone Ambiguity Resolution


Context Sensitive Spelling Correction

IIllinois’
bored

of education board



Word Sense Disambiguation

...Nissan Car and truck
plant

is …

…divide life into
plant

and animal kingdom



Part of Speech Tagging

(This DT) (can N) (will MD) (rust V) DT,MD,V,N



Coreference Resolution

The dog bit the kid.
He

was taken to a hospital.

The dog bit the kid.
He

was taken to a veterinarian.

Page
4

Textual Entailment

Eyeing the huge market potential, currently led by Google, Yahoo took
over search company Overture Services Inc. last year.


Yahoo acquired Overture.


Question Answering


Who acquired Overture?

Page
5

Inference and Learning


Global decisions in which several local decisions play a role but
there are mutual dependencies on their outcome.



Learned classifiers for different sub
-
problems



Incorporate classifiers’ information, along with constraints, in making
coherent decisions


decisions that respect the local classifiers as
well as domain & context specific constraints.



Global inference for the best assignment to all variables of interest.

Page
6

Text Chunking

VP

ADVP

VP

NP

ADJP

The

guy

standing

there

is

so

tall

x =

y =

Page
7

Full Parsing

VP

ADVP

NP

ADJP

VP

NP

S

The

guy

standing

there

is

so

tall

x =

y =

Page
8

Outline


Semantic Role Labeling Problem


Global Inference with Integer Linear Programming



Some Issues with Learning and Inference


Global vs Local Training


Utility of Constraints in the Inference



Conclusion

Page
9

Semantic Role Labeling

I
left

my pearls to my daughter in my will .


[
I
]
A0

left

[
my pearls
]
A1

[
to my daughter
]
A2


[
in my will
]
AM
-
LOC

.



A0

Leaver


A1

Things left


A2

Benefactor


AM
-
LOC

Location

Page
10

Semantic Role Labeling


PropBank
[Palmer et. al. 05]

provides a large human
-
annotated
corpus of semantic verb
-
argument relations.


It adds a layer of generic semantic labels to Penn Tree Bank II.


(Almost) all the labels are on the constituents of the parse trees.



Core arguments: A0
-
A5 and AA


different semantics for each verb


specified in the PropBank Frame files



13 types of adjuncts labeled as AM
-
arg



where
arg

specifies the adjunct type

Page
11

Semantic Role Labeling

Page
12

The Approach



Pruning


Use heuristics to reduce the number of
candidates (modified from
[Xue&Palmer’04]
)


Argument Identification


Use a binary classifier to identify
arguments


Argument Classification


Use a multiclass classifier to classify
arguments



Inference


Infer the final output satisfying linguistic
and structure constraints

I left my nice pearls to her

I left my nice pearls to her

I left my nice pearls to her

I left my nice pearls to her

I left my nice pearls to her

Page
13

Learning


Both argument identifier and argument classifier are trained phrase
-
based classifiers.



Features (some examples)


voice, phrase type, head word, path, chunk, chunk pattern, etc.
[some make use of a full syntactic parse]



Learning Algorithm


SNoW


Sparse network of linear functions


weights learned by regularized Winnow multiplicative update
rule with averaged weight vectors


Probability conversion is done via softmax


p
i

= exp{act
i
}/

j

exp{act
j
}

Page
14

Inference


The output of the argument classifier often violates some constraints,
especially when the sentence is long.



Finding the
best legitimate output

is formalized as an optimization
problem and solved via Integer Linear Programming.



Input:


The probability estimation (by the argument classifier)


Structural and linguistic constraints



Allows incorporating expressive (non
-
sequential) constraints on the
variables (the arguments types).

Page
15

Integer Linear Programming Inference


For each argument
a
i

and type
t
(including
null
)


Set up a Boolean variable:
a
i,
t

indicating if
a
i

is classified as
t


Goal is to maximize




i

score(
a
i

=
t

)
a
i,
t


Subject to the (linear) constraints


Any Boolean constraints can be encoded this way



If score(
a
i

=
t

) = P(
a
i

=
t

), the objective is find the assignment that
maximizes the expected number of arguments that are
correct

and
satisfies the constraints

Page
16

Linear Constraints


No overlapping or embedding arguments



a
i
,
a
j

overlap or embed:
a
i
,
null

+

a
j
,
null



1


Page
17

Constraints


Constraints


No overlapping or embedding arguments


No duplicate argument classes for A0
-
A5


Exactly one V argument per predicate


If there is a C
-
V, there must be V
-
A1
-
C
-
V pattern


If there is an R
-
arg, there must be arg somewhere


If there is a C
-
arg, there must be arg somewhere before


Each predicate can take only core arguments that appear in its
frame file.


More specifically, we check for only the minimum and
maximum ids

Page
18

SRL Results (CoNLL
-
2005)


Training: section 02
-
21


Development: section 24


Test WSJ: section 23


Test Brown: from Brown corpus (very small)

Precision

Recall

F1

Development

Collins

73.89

70.11

71.95

Charniak

75.40

74.13

74.76

Test WSJ

Collins

77.09

72.00

74.46

Charniak

78.10

76.15

77.11

Test Brown

Collins

68.03

63.34

65.60

Charniak

67.15

63.57

65.31

Page
19

Inference with Multiple SRL systems


Goal is to maximize




i

score(
a
i

=
t

)
a
i,
t


Subject to the (linear) constraints


Any Boolean constraints can be encoded this way



score(
a
i

=
t

) =

k

f
k
(
a
i

=
t

)


If system
k

has no opinion on
a
i
, use a prior instead

Page
20

Results with Multiple Systems (CoNLL
-
2005)


Page
21

Outline


Semantic Role Labeling Problem


Global Inference with Integer Linear Programming



Some Issues with Learning and Inference


Global vs Local Training


Utility of Constraints in the Inference



Conclusion

Page
22

Training w/o Constraints

Testing: Inference with Constraints

IBT: Inference
-
based Training

Learning and Inference

x
1

x
6

x
2

x
5

x
4

x
3

x
7

y
1

y
2

y
5

y
4

y
3

f
1
(
x
)

f
2
(
x
)

f
3
(
x
)

f
4
(
x
)

f
5
(
x
)

X

Y

Learning the
components together!

Which one is better?
When and Why?

Page
23

Comparisons of Learning Approaches

Coupling (IBT)


Optimize the true
global

objective function (This should be better in
the limit)


Decoupling (L+I)


More efficient


Reusability of classifiers


Modularity in training


No
global

examples required

Page
24

Claims


When the local classification problems are “easy”, L+I outperforms
IBT.



Only when the local problems become difficult to solve in isolation,
IBT outperforms L+I, but needs a large enough number of training
examples.



Will show experimental results and theoretical intuition to support
our claims.

Page
25

-
1

1

1

1

1

Y’

Local Predictions


Perceptron
-
based Global Learning

x
1

x
6

x
2

x
5

x
4

x
3

x
7

f
1
(
x
)

f
2
(
x
)

f
3
(
x
)

f
4
(
x
)

f
5
(
x
)

X

Y

-
1

1

1

-
1

-
1

Y

True Global Labeling

-
1

1

1

1

-
1

Y’

Apply Constraints:


Page
26

Simulation


There are 5 local binary linear classifiers


Global classifier is also linear


h
(
x
) = argmax
y
2
C
(
Y
)


i
f
i
(
x
,
y
i
)


Constraints are randomly generated


The hypothesis is linearly separable at the global level given that the
constraints are known


The separability level at the local level is varied

Page
27


opt
=0.2


opt
=0.1


opt
=0

Bound Prediction


Local





opt

+
(

(
d

log
m

+ log 1/


) /
m

)
1/2


Global



≤ 0 +
(

(
cd

log
m

+
c
2
d

+ log 1/


) /
m

)
1/2

Bounds

Simulated Data

L+I vs. IBT: the more identifiable
individual problems are the better
overall performance is with L+I

Page
28

Relative Merits: SRL

Difficulty of the learning problem

(# features)

L+I is better.

When the problem
is artificially made
harder, the tradeoff
is clearer.

easy

hard

Page
29

Summary


When the local classification problems are “easy”, L+I outperforms
IBT.



Only when the local problems become difficult to solve in isolation,
IBT outperforms L+I, but needs a large enough number of training
examples.



Why does inference help at all?


Page
30

About Constraints


We always assume that global coherency is good


Constraints does help in real world applications



Performance is usually measured at the local prediction



Depending on the performance metric, constraints can hurt

Page
31

Results: Contribution of Expressive
Constraints
[Roth & Yih 05]


Basic: Learning with statistical constraints only;


Additional constraints added at evaluation time (efficiency)

73.91

73.78

73.71

73.64

69.74

69.14

71.94

71.72

71.71

71.78

67.10

66.46

+ disallow

+ verb pos

+ argument

+ cand

+ no dup

basic
(Viterbi)

F
1

+0.13

+0.22

+0.07

+0.01

+0.07

-
0.07

+3.90

+4.68

+0.60

+0.64

diff

diff

CRF
-
D

CRF
-
ML

Page
32

Assumptions


y

=
h

y
1
, …,
y
l

i


Non
-
interactive classifiers:
f
i
(
x
,
y
i
)

Each classifier does not use as inputs the outputs of other classifiers


Inference is linear summation

h
un
(
x
) = argmax
y
2
Y


i
f
i
(
x
,
y
i
)

h
con
(
x
) = argmax
y
2
C
(
Y
)


i
f
i
(
x
,
y
i
)


C
(
Y
) always contains correct outputs


No assumption on the structure of constraints

Page
33

Performance Metrics


Zero
-
one loss


Mistakes are calculated in terms of
global

mistakes


y

is wrong if any of
y
i

is wrong



Hamming loss


Mistakes are calculated in terms of
local

mistakes

Page
34

Zero
-
One Loss


Constraints cannot hurt


Constraints never fix correct
global

outputs



This is not true for Hamming Loss

Page
35

4
-
bit binary outputs

Boolean Cube

0000

1000

0100

0010

0001

1100

1010

1001

0110

0101

0011

1110

1101

1011

0111

1111

Hamming Loss

Score

1 mistake

2 mistakes

3 mistakes

4 mistakes

0 mistake

Page
36

Hamming Loss

0000

1000

0100

0010

0001

1100

1010

1001

0110

0101

0011

1110

1101

1011

0111

1111

Hamming Loss

Score

h
un

00
11

Page
37

Best Classifiers

0000

1000

0100

0010

0001

1100

1010

1001

0110

0101

00
11

1110

1101

1011

0111

1111

Hamming Loss

Score


1


3


2


1
+

2


4

Page
38

When Constraints Cannot Hurt


i

: distance between the correct label and the 2
nd

best label


i

: distance between the predicted label and the correct label


F
correct

= {
i
|
f
i

is correct}

F
wrong

= {
i
|
f
i

is wrong}


Constraints cannot hurt if


8

i
2

F
correct

:


i

>

i
2

F
wrong


i

Page
39

An Empirical Investigation


SRL System


CoNLL
-
2005 WSJ test set

Without Constraints

With Constraints

Local Accuracy

82.38%

84.08%

Page
40

An Empirical Investigation

Number

Percentages

Total Predictions

19822

100.00

Incorrect Predictions

3492

17.62

Correct Predictions

16330

82.38

Violate the condition

1833

9.25

Page
41

Good Classifiers

0000

1000

0100

0010

0001

1100

1010

1001

0110

0101

00
11

1110

1101

1011

0111

1111

Hamming Loss

Score

Page
42

Bad Classifiers

0000

1000

0100

0010

0001

1100

1010

1001

0110

0101

00
11

1110

1101

1011

0111

1111

Hamming Loss

Score

Page
43

Average Distance vs Gain in Hamming Loss

Good


High Loss
!

Low Score

(Low Gain)

Page
44

Good Classifiers

0000

1000

0100

0010

0001

1100

1010

1001

0110

0101

00
11

1110

1101

1011

0111

1111

Hamming Loss

Score

Page
45

Bad Classifiers

0000

1000

0100

0010

0001

1100

1010

1001

0110

0101

00
11

1110

1101

1011

0111

1111

Hamming Loss

Score

Page
46

Average Gain in Hamming Loss vs Distance

Good


High Score
!

Low Loss



(High Gain)

Page
47

Utility of Constraints


Constraints improve the performance because the classifiers are
good


Good Classifiers:


When the classifier is correct, it allows large margin between the
correct label and the 2
nd

best label


When the classifier is wrong, the correct label is not far away from
the predicted one

Page
48

Conclusions


Show how global inference can be used


Semantic Role Labeling


Tradeoffs between Coupling vs. Decoupling learning and inference


Investigation of utility of constraints



The analyses are very preliminary


Average
-
case analysis for the tradeoffs between Coupling vs.
Decoupling learning and inference


Better understanding for using constraints


More interactive classifiers


Different performance metrics, e.g. F1


Relation with margin