Global Inference in Learning for
Natural Language Processing
Vasin Punyakanok
Department of Computer Science
University of Illinois at Urbana

Champaign
Joint work with Dan Roth, Wen

tau Yih, and Dav Zimak
Page
2
Story Comprehension
1.
Who is Christopher Robin?
2.
What did Mr. Robin do when Chris was three years old?
3.
When was Winnie the Pooh written?
4.
Why did Chris write two books of his own?
(ENGLAND, June, 1989)

Christopher Robin is alive and well. He lives in
England. He is the same person that you read about in the book, Winnie the
Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When
Chris was three years old, his father wrote a poem about him. The poem was
printed in a magazine for others to read. Mr. Robin then wrote a book. He
made up a fairy tale land where Chris lived. His friends were animals. There
was a bear called Winnie the Pooh. There was also an owl and a young pig,
called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin
made them come to life with his words. The places in the story were all near
Cotchfield Farm. Winnie the Pooh was written in 1925.
Children still love to
read about Christopher Robin and his animal friends. Most people don't know
he is a real person who is grown now. He has written two books of his own.
They tell what it is like to be famous.
Page
3
Stand Alone Ambiguity Resolution
Context Sensitive Spelling Correction
IIllinois’
bored
of education board
Word Sense Disambiguation
...Nissan Car and truck
plant
is …
…divide life into
plant
and animal kingdom
Part of Speech Tagging
(This DT) (can N) (will MD) (rust V) DT,MD,V,N
Coreference Resolution
The dog bit the kid.
He
was taken to a hospital.
The dog bit the kid.
He
was taken to a veterinarian.
Page
4
Textual Entailment
Eyeing the huge market potential, currently led by Google, Yahoo took
over search company Overture Services Inc. last year.
Yahoo acquired Overture.
Question Answering
Who acquired Overture?
Page
5
Inference and Learning
Global decisions in which several local decisions play a role but
there are mutual dependencies on their outcome.
Learned classifiers for different sub

problems
Incorporate classifiers’ information, along with constraints, in making
coherent decisions
–
decisions that respect the local classifiers as
well as domain & context specific constraints.
Global inference for the best assignment to all variables of interest.
Page
6
Text Chunking
VP
ADVP
VP
NP
ADJP
The
guy
standing
there
is
so
tall
x =
y =
Page
7
Full Parsing
VP
ADVP
NP
ADJP
VP
NP
S
The
guy
standing
there
is
so
tall
x =
y =
Page
8
Outline
Semantic Role Labeling Problem
Global Inference with Integer Linear Programming
Some Issues with Learning and Inference
Global vs Local Training
Utility of Constraints in the Inference
Conclusion
Page
9
Semantic Role Labeling
I
left
my pearls to my daughter in my will .
[
I
]
A0
left
[
my pearls
]
A1
[
to my daughter
]
A2
[
in my will
]
AM

LOC
.
A0
Leaver
A1
Things left
A2
Benefactor
AM

LOC
Location
Page
10
Semantic Role Labeling
PropBank
[Palmer et. al. 05]
provides a large human

annotated
corpus of semantic verb

argument relations.
It adds a layer of generic semantic labels to Penn Tree Bank II.
(Almost) all the labels are on the constituents of the parse trees.
Core arguments: A0

A5 and AA
different semantics for each verb
specified in the PropBank Frame files
13 types of adjuncts labeled as AM

arg
where
arg
specifies the adjunct type
Page
11
Semantic Role Labeling
Page
12
The Approach
Pruning
Use heuristics to reduce the number of
candidates (modified from
[Xue&Palmer’04]
)
Argument Identification
Use a binary classifier to identify
arguments
Argument Classification
Use a multiclass classifier to classify
arguments
Inference
Infer the final output satisfying linguistic
and structure constraints
I left my nice pearls to her
I left my nice pearls to her
I left my nice pearls to her
I left my nice pearls to her
I left my nice pearls to her
Page
13
Learning
Both argument identifier and argument classifier are trained phrase

based classifiers.
Features (some examples)
voice, phrase type, head word, path, chunk, chunk pattern, etc.
[some make use of a full syntactic parse]
Learning Algorithm
–
SNoW
Sparse network of linear functions
weights learned by regularized Winnow multiplicative update
rule with averaged weight vectors
Probability conversion is done via softmax
p
i
= exp{act
i
}/
j
exp{act
j
}
Page
14
Inference
The output of the argument classifier often violates some constraints,
especially when the sentence is long.
Finding the
best legitimate output
is formalized as an optimization
problem and solved via Integer Linear Programming.
Input:
The probability estimation (by the argument classifier)
Structural and linguistic constraints
Allows incorporating expressive (non

sequential) constraints on the
variables (the arguments types).
Page
15
Integer Linear Programming Inference
For each argument
a
i
and type
t
(including
null
)
Set up a Boolean variable:
a
i,
t
indicating if
a
i
is classified as
t
Goal is to maximize
i
score(
a
i
=
t
)
a
i,
t
Subject to the (linear) constraints
Any Boolean constraints can be encoded this way
If score(
a
i
=
t
) = P(
a
i
=
t
), the objective is find the assignment that
maximizes the expected number of arguments that are
correct
and
satisfies the constraints
Page
16
Linear Constraints
No overlapping or embedding arguments
a
i
,
a
j
overlap or embed:
a
i
,
null
+
a
j
,
null
1
Page
17
Constraints
Constraints
No overlapping or embedding arguments
No duplicate argument classes for A0

A5
Exactly one V argument per predicate
If there is a C

V, there must be V

A1

C

V pattern
If there is an R

arg, there must be arg somewhere
If there is a C

arg, there must be arg somewhere before
Each predicate can take only core arguments that appear in its
frame file.
More specifically, we check for only the minimum and
maximum ids
Page
18
SRL Results (CoNLL

2005)
Training: section 02

21
Development: section 24
Test WSJ: section 23
Test Brown: from Brown corpus (very small)
Precision
Recall
F1
Development
Collins
73.89
70.11
71.95
Charniak
75.40
74.13
74.76
Test WSJ
Collins
77.09
72.00
74.46
Charniak
78.10
76.15
77.11
Test Brown
Collins
68.03
63.34
65.60
Charniak
67.15
63.57
65.31
Page
19
Inference with Multiple SRL systems
Goal is to maximize
i
score(
a
i
=
t
)
a
i,
t
Subject to the (linear) constraints
Any Boolean constraints can be encoded this way
score(
a
i
=
t
) =
k
f
k
(
a
i
=
t
)
If system
k
has no opinion on
a
i
, use a prior instead
Page
20
Results with Multiple Systems (CoNLL

2005)
Page
21
Outline
Semantic Role Labeling Problem
Global Inference with Integer Linear Programming
Some Issues with Learning and Inference
Global vs Local Training
Utility of Constraints in the Inference
Conclusion
Page
22
Training w/o Constraints
Testing: Inference with Constraints
IBT: Inference

based Training
Learning and Inference
x
1
x
6
x
2
x
5
x
4
x
3
x
7
y
1
y
2
y
5
y
4
y
3
f
1
(
x
)
f
2
(
x
)
f
3
(
x
)
f
4
(
x
)
f
5
(
x
)
X
Y
Learning the
components together!
Which one is better?
When and Why?
Page
23
Comparisons of Learning Approaches
Coupling (IBT)
Optimize the true
global
objective function (This should be better in
the limit)
Decoupling (L+I)
More efficient
Reusability of classifiers
Modularity in training
No
global
examples required
Page
24
Claims
When the local classification problems are “easy”, L+I outperforms
IBT.
Only when the local problems become difficult to solve in isolation,
IBT outperforms L+I, but needs a large enough number of training
examples.
Will show experimental results and theoretical intuition to support
our claims.
Page
25

1
1
1
1
1
Y’
Local Predictions
Perceptron

based Global Learning
x
1
x
6
x
2
x
5
x
4
x
3
x
7
f
1
(
x
)
f
2
(
x
)
f
3
(
x
)
f
4
(
x
)
f
5
(
x
)
X
Y

1
1
1

1

1
Y
True Global Labeling

1
1
1
1

1
Y’
Apply Constraints:
Page
26
Simulation
There are 5 local binary linear classifiers
Global classifier is also linear
h
(
x
) = argmax
y
2
C
(
Y
)
i
f
i
(
x
,
y
i
)
Constraints are randomly generated
The hypothesis is linearly separable at the global level given that the
constraints are known
The separability level at the local level is varied
Page
27
opt
=0.2
opt
=0.1
opt
=0
Bound Prediction
Local
≤
opt
+
(
(
d
log
m
+ log 1/
) /
m
)
1/2
Global
≤ 0 +
(
(
cd
log
m
+
c
2
d
+ log 1/
) /
m
)
1/2
Bounds
Simulated Data
L+I vs. IBT: the more identifiable
individual problems are the better
overall performance is with L+I
Page
28
Relative Merits: SRL
Difficulty of the learning problem
(# features)
L+I is better.
When the problem
is artificially made
harder, the tradeoff
is clearer.
easy
hard
Page
29
Summary
When the local classification problems are “easy”, L+I outperforms
IBT.
Only when the local problems become difficult to solve in isolation,
IBT outperforms L+I, but needs a large enough number of training
examples.
Why does inference help at all?
Page
30
About Constraints
We always assume that global coherency is good
Constraints does help in real world applications
Performance is usually measured at the local prediction
Depending on the performance metric, constraints can hurt
Page
31
Results: Contribution of Expressive
Constraints
[Roth & Yih 05]
Basic: Learning with statistical constraints only;
Additional constraints added at evaluation time (efficiency)
73.91
73.78
73.71
73.64
69.74
69.14
71.94
71.72
71.71
71.78
67.10
66.46
+ disallow
+ verb pos
+ argument
+ cand
+ no dup
basic
(Viterbi)
F
1
+0.13
+0.22
+0.07
+0.01
+0.07

0.07
+3.90
+4.68
+0.60
+0.64
diff
diff
CRF

D
CRF

ML
Page
32
Assumptions
y
=
h
y
1
, …,
y
l
i
Non

interactive classifiers:
f
i
(
x
,
y
i
)
Each classifier does not use as inputs the outputs of other classifiers
Inference is linear summation
h
un
(
x
) = argmax
y
2
Y
i
f
i
(
x
,
y
i
)
h
con
(
x
) = argmax
y
2
C
(
Y
)
i
f
i
(
x
,
y
i
)
C
(
Y
) always contains correct outputs
No assumption on the structure of constraints
Page
33
Performance Metrics
Zero

one loss
Mistakes are calculated in terms of
global
mistakes
y
is wrong if any of
y
i
is wrong
Hamming loss
Mistakes are calculated in terms of
local
mistakes
Page
34
Zero

One Loss
Constraints cannot hurt
Constraints never fix correct
global
outputs
This is not true for Hamming Loss
Page
35
4

bit binary outputs
Boolean Cube
0000
1000
0100
0010
0001
1100
1010
1001
0110
0101
0011
1110
1101
1011
0111
1111
Hamming Loss
Score
1 mistake
2 mistakes
3 mistakes
4 mistakes
0 mistake
Page
36
Hamming Loss
0000
1000
0100
0010
0001
1100
1010
1001
0110
0101
0011
1110
1101
1011
0111
1111
Hamming Loss
Score
h
un
00
11
Page
37
Best Classifiers
0000
1000
0100
0010
0001
1100
1010
1001
0110
0101
00
11
1110
1101
1011
0111
1111
Hamming Loss
Score
1
3
2
1
+
2
4
Page
38
When Constraints Cannot Hurt
i
: distance between the correct label and the 2
nd
best label
i
: distance between the predicted label and the correct label
F
correct
= {
i

f
i
is correct}
F
wrong
= {
i

f
i
is wrong}
Constraints cannot hurt if
8
i
2
F
correct
:
i
>
i
2
F
wrong
i
Page
39
An Empirical Investigation
SRL System
CoNLL

2005 WSJ test set
Without Constraints
With Constraints
Local Accuracy
82.38%
84.08%
Page
40
An Empirical Investigation
Number
Percentages
Total Predictions
19822
100.00
Incorrect Predictions
3492
17.62
Correct Predictions
16330
82.38
Violate the condition
1833
9.25
Page
41
Good Classifiers
0000
1000
0100
0010
0001
1100
1010
1001
0110
0101
00
11
1110
1101
1011
0111
1111
Hamming Loss
Score
Page
42
Bad Classifiers
0000
1000
0100
0010
0001
1100
1010
1001
0110
0101
00
11
1110
1101
1011
0111
1111
Hamming Loss
Score
Page
43
Average Distance vs Gain in Hamming Loss
Good
High Loss
!
Low Score
(Low Gain)
Page
44
Good Classifiers
0000
1000
0100
0010
0001
1100
1010
1001
0110
0101
00
11
1110
1101
1011
0111
1111
Hamming Loss
Score
Page
45
Bad Classifiers
0000
1000
0100
0010
0001
1100
1010
1001
0110
0101
00
11
1110
1101
1011
0111
1111
Hamming Loss
Score
Page
46
Average Gain in Hamming Loss vs Distance
Good
High Score
!
Low Loss
(High Gain)
Page
47
Utility of Constraints
Constraints improve the performance because the classifiers are
good
Good Classifiers:
When the classifier is correct, it allows large margin between the
correct label and the 2
nd
best label
When the classifier is wrong, the correct label is not far away from
the predicted one
Page
48
Conclusions
Show how global inference can be used
Semantic Role Labeling
Tradeoffs between Coupling vs. Decoupling learning and inference
Investigation of utility of constraints
The analyses are very preliminary
Average

case analysis for the tradeoffs between Coupling vs.
Decoupling learning and inference
Better understanding for using constraints
More interactive classifiers
Different performance metrics, e.g. F1
Relation with margin
Comments 0
Log in to post a comment