Natural Language Processing
COMPSCI 423/723
Rohit Kate
Conditional Random Fields
(CRFs) for Sequence Labeling
Some of the slides have been adapted from Raymond
Mooney’s NLP course at UT Austin.
Graphical Models
•
If no assumption of independence is made, then an
exponential number of parameters must be estimated
–
No realistic amount of training data is sufficient to estimate so
many parameters
•
If a blanket assumption of conditional independence is
made, efficient training and inference is possible, but
such a strong assumption is rarely warranted
•
Graphical models
use directed or undirected graphs
over a set of random variables to explicitly specify
variable dependencies and allow for less restrictive
independence assumptions while limiting the number of
parameters that must be estimated
–
Bayesian Networks
: Directed acyclic graphs that indicate
causal structure
–
Markov Networks
: Undirected graphs that capture general
dependencies
Bayesian Networks
•
Directed Acyclic Graph (DAG)
–
Nodes are random variables
–
Edges indicate causal influences
Burglary
Earthquake
Alarm
JohnCalls
MaryCalls
Conditional Probability Tables
•
Each node has a
conditional probability table
(
CPT
)
that gives the probability of each of its values given
every possible combination of values for its parents
(conditioning case).
–
Roots (sources) of the DAG that have no parents are given prior
probabilities.
Burglary
Earthquake
Alarm
JohnCalls
MaryCalls
P(B)
.001
P(E)
.002
B
E
P(AB,E)
T
T
.95
T
F
.94
F
T
.29
F
F
.001
A
P(MA)
T
.70
F
.01
A
P(JA)
T
.90
F
.05
Joint Distributions for Bayes
Nets
•
A Bayesian Network implicitly defines
(
factors
) a joint distribution
))
(
Parents

(
)
,...
,
(
1
2
1
i
n
i
i
n
X
x
P
x
x
x
P
•
Example
)
(
E
B
A
M
J
P
)
(
)
(
)

(
)

(
)

(
E
P
B
P
E
B
A
P
A
M
P
A
J
P
00062
.
0
998
.
0
999
.
0
001
.
0
7
.
0
9
.
0
Naïve Bayes as a Bayes Net
•
Naïve Bayes is a simple Bayes Net
Y
X
1
X
2
…
X
n
•
Priors P(
Y
) and conditionals P(
X
i

Y
) for
Naïve Bayes provide CPTs for the network
HMMs as Bayesian Network
•
The directed probabilistic graphical
model for the random variables w
1
to w
n
and t
1
to t
n
with the independence
assumptions:
t
1
t
2
t
3
t
n
w
1
w
2
w
3
w
n
…
…
P(t
1
)
P(t
2
t
1
)
P(t
3
t
2
)
P(t
n
t
n

1
)
P(w
1
t
1
)
P(w
2
t
2
)
P(w
3
t
3
)
P(w
n
t
n
)
Drawbacks of HMMs
•
HMMs are generative models and are
not
directly designed to maximize the
performance of sequence labeling. They
model the joint distribution P(
O
,
Q
) and thus
only indirectly model P(QO) which is what is
needed for the sequence labeling task (O:
observation sequence, Q: label sequence)
•
Can’t use arbitrary features related to the
words (e.g. capitalization, prefixes etc. that
can help POS tagging) unless these are
explicitly modeled as part of observations
9
Undirected Graphical Model
•
Also called Markov Network, Random Field
•
Undirected graph over a set of random variables,
where an edge represents a dependency
•
The
Markov blanket
of a node,
X
, in a Markov Net
is the set of its neighbors in the graph (nodes that
have an edge connecting to
X
)
•
Every node in a Markov Net is conditionally
independent of every other node given its Markov
blanket
10
Sample Markov Network
Burglary
Earthquake
Alarm
JohnCalls
MaryCalls
Distribution for a Markov
Network
•
The distribution of a Markov net is most compactly described in
terms of a set of
potential functions
(a.k.a.
factors
,
compatibility
functions
),
φ
k
, for each clique,
k
, in the graph.
•
For each joint assignment of values to the variables in clique
k,
φ
k
assigns a non

negative real value that represents the compatibility
of these values.
•
The joint distribution of a Markov network is then defined by:
)
(
1
)
,...
,
(
}
{
2
1
k
k
k
n
x
Z
x
x
x
P
Where
x
{k}
represents the joint assignment of the variables in
clique
k
, and
Z
is a normalizing constant that makes a joint
distribution that sums to 1.
x
k
k
k
x
Z
)
(
}
{
Sample Markov Network
Burglary
Earthquake
Alarm
JohnCalls
MaryCalls
B
A
1
T
T
100
T
F
1
F
T
1
F
F
200
E
A
2
T
T
50
T
F
10
F
T
1
F
F
200
M
A
4
T
T
50
T
F
1
F
T
10
F
F
200
J
A
3
T
T
75
T
F
10
F
T
1
F
F
200
(...)
/
50
*
75
*
1
*
1
)
(
E
B
A
M
J
P
Discriminative Markov Network
or Conditional Random Field
•
Directly models P(YX)
•
The potential functions could be based on arbitrary
features of X and Y and they are expressed as
exponentials
P
(
y
1
,
y
2
,
...
y
m

x
1
,
x
2
,
...
x
n
)
1
Z
(
X
)
k
(
y
{
k
}
,
x
{
k
}
k
)
Z
(
X
)
k
(
y
{
k
}
,
x
{
k
}
)
k
Y
Random Field
(Undirected Graphical Model)
Conditional Random Field (CRF)
)
(
1
)
,...
,
(
}
{
2
1
k
k
k
n
v
Z
v
v
v
P
v
k
k
k
v
Z
)
(
}
{
)
,
(
)
(
1
)
,...
,

,...
,
(
}
{
}
{
2
1
2
1
k
k
k
k
n
m
x
y
X
Z
x
x
x
y
y
y
P
Z
(
X
)
k
(
y
{
k
}
,
x
{
k
}
)
k
Y
Two types of variables x & y,
there is no factor with only x variables
v
2
…
v
1
v
n
v
3
v
10
v
4
Y
2
X
1,
X
2,…,
X
n
Y
1
Y
n
…
Y
3
Linear

Chain Conditional Random Field (CRF)
Y
2
X
1,
X
2,…,
X
n
Y
1
Y
n
…
)
,
,
(
)
(
1
)
,...
,

,...
,
(
}
{
1
2
1
2
1
k
k
i
i
k
n
m
x
y
y
X
Z
x
x
x
y
y
y
P
Y
k
k
i
i
k
x
y
y
X
Z
)
,
,
(
)
(
}
{
1
Ys are connected in a linear chain
Logistic Regression as a
Simplest CRF
•
Logistic regression is a simple CRF with
only one output variable
Y
X
1
X
2
…
X
n
•
Models the conditional distribution, P(
Y

X
)
and not the full joint P(
X
,
Y
)
Simplification Assumption for
MaxEnt
•
The probability P(YX1..Xn) can be
factored
as:
18
P
(
c

X
)
exp(
W
ci
f
i
(
c
,
x
)
i
0
N
)
exp(
W
c
'
i
f
i
(
c
'
,
x
)
i
0
N
)
c
'
Classes
Generative vs. Discriminative
Sequence Labeling Models
•
HMMs are generative models and are
not
directly designed to maximize the performance
of sequence labeling. They model the joint
distribution P(
O
,
Q
)
•
HMMs are trained to have an accurate
probabilistic model of the underlying language,
and not all aspects of this model benefit the
sequence labeling task
•
Conditional Random Fields
(CRFs)
are
specifically designed and trained to maximize
performance of sequence labeling. They model
the
conditional distribution
P(
Q

O
)
Classification
Y
X
1
X
2
…
X
n
Y
X
1
X
2
…
X
n
Naïve
Bayes
Logistic
Regression
Conditional
Generative
Discriminative
Sequence Labeling
Y
2
X
1
X
2
…
X
T
HMM
Linear

chain CRF
Conditional
Generative
Discriminative
Y
1
Y
T
..
Y
2
X
1
X
2
…
X
T
Y
1
Y
T
..
Simple Linear Chain CRF
Features
•
Modeling the conditional distribution is similar to
that used in multinomial logistic regression.
•
Create feature functions
f
k
(
Y
t
,
Y
t−1
,
X
t
)
–
Feature for each state transition pair
i
,
j
•
f
i,j
(
Y
t
,
Y
t−1
,
X
t
) = 1 if
Y
t
=
i
and
Y
t−1
=
j
and 0 otherwise
–
Feature for each state observation pair
i
,
o
•
f
i,o
(
Y
t
,
Y
t−1
,
X
t
) = 1 if
Y
t
=
i
and
X
t
=
o
and 0 otherwise
•
Note
: number of features grows quadratically in
the number of states (i.e. tags).
22
Conditional Distribution for
Linear Chain CRF
•
Using these feature functions for a
simple linear chain CRF, we can define:
23
P
(
Y

X
)
1
Z
(
X
)
exp(
m
f
m
(
Y
t
,
Y
t
1
,
X
t
))
m
1
M
t
1
T
Z
(
X
)
exp(
m
f
m
(
Y
t
,
Y
t
1
,
X
t
))
m
1
M
t
1
T
Y
Adding Token Features to a
CRF
•
Can add token features
X
i
,
j
24
…
X
1,1
X
1,m
…
X
2,1
X
2,m
…
X
T,1
X
T,m
…
…
•
Can add additional feature functions for
each token feature to model conditional
distribution.
Y
1
Y
2
Y
T
Features in POS Tagging
•
For POS Tagging, use lexicographic
features of tokens.
–
Capitalized?
–
Start with numeral?
–
Ends in given suffix (e.g. “s”, “ed”, “ly”)?
25
Enhanced Linear Chain CRF
(standard approach)
•
Can also condition transition on the
current token features.
26
…
X
1,1
X
2,1
…
…
…
…
X
T,1
X
1,m
X
2,m
X
T,m
•
Add feature functions:
•
f
i,j,k
(
Y
t
,
Y
t−1
,
X
) 1 if
Y
t
=
i
and
Y
t−1
=
j
and
X
t −1,k
= 1
and 0 otherwise
Y
2
Y
T
Y
1
Supervised Learning
(Parameter Estimation)
•
As in logistic regression, use L

BFGS
optimization procedure, to set
λ
weights
to maximize CLL of the supervised
training data
27
Sequence Tagging
(Inference)
•
Variant of dynamic programming (Viterbi)
algorithm can be used to efficiently,
O(
TN
2
), determine the globally most
probable label sequence for a given
token sequence using a given log

linear
model of the conditional probability P(
Y

X
)
28
Skip

Chain CRFs
•
Can model some long

distance dependencies
(i.e. the same word appearing in different parts
of the text) by including long

distance edges in
the Markov model.
29
Y
2
X
1
X
2
…
X
3
Y
1
Y
3
Michael Dell said
Dell bought
Y
100
X
100
Y
101
X
101
•
Additional links make exact inference
intractable, so must resort to approximate
inference to try to find the most probable
labeling.
30
CRF Results
•
Experimental results verify that they have
superior accuracy on various sequence
labeling tasks
–
Part of Speech tagging
–
Noun phrase chunking
–
Named entity recognition
–
Semantic role labeling
•
However, CRFs are much slower to train and
do not scale as well to large amounts of
training data
–
Training for POS on full Penn Treebank (~1M
words) currently takes “over a week.”
•
Skip

chain CRFs improve results on IE
CRF Summary
•
CRFs are a discriminative approach to sequence labeling
whereas HMMs are generative
•
Discriminative methods are usually more accurate since
they are trained for a specific performance task
•
CRFs also easily allow adding additional token features
without making additional independence assumptions
•
Training time is increased since a complex optimization
procedure is needed to fit supervised training data
•
CRFs are a state

of

the

art method for sequence labeling
31
Phrase Structure
•
Most languages have a
word order
•
Words are organized into
phrases
,
group of words that act as a single unit
or a
constituent
–
[The dog] [chased] [the cat].
–
[The fat dog] [chased] [the thin cat].
–
[The fat dog with red collar] [chased] [the
thin old cat].
–
[The fat dog with red collar named Tom]
[suddenly chased] [the thin old white cat].
Phrases
•
Noun phrase:
A syntactic unit of a sentence
which acts like a noun and in which a noun is
usually embedded called its
head
–
An optional determiner followed by zero or more
adjectives, a noun head and zero or more
prepositional phrases
•
Prepositional phrase:
Headed by a
preposition and express spatial, temporal or
other attributes
•
Verb phrase:
Part of the sentence that
depend on the verb. Headed by the verb.
•
Adjective phrase:
Acts like an adjective.
Phrase Chunking
•
Find all non

recursive noun phrases
(NPs) and verb phrases (VPs) in a
sentence.
–
[NP
I
] [VP
ate
] [NP
the spaghetti
] [PP
with] [NP
meatballs
].
–
[NP
He
] [VP
reckons
] [NP
the current
account deficit
] [VP
will narrow
] [PP
to
]
[NP
only # 1.8 billion
] [PP
in
] [NP
September
]
•
Some applications need all the noun
phrases in a sentence
Phrase Chunking as
Sequence Labeling
•
Tag individual words with one of 3 tags
–
B (Begin) word starts new target phrase
–
I (Inside) word is part of target phrase but
not the first word
–
O (Other) word is not part of target phrase
•
Sample for NP chunking
–
He
reckons
the
current account deficit
will
narrow to
only
# 1.8 billion
in
September
.
35
Begin
Inside
Other
Evaluating Chunking
•
Per token accuracy does not evaluate finding
correct full chunks. Instead use:
36
found
chunks
of
number
Total
found
chunks
correct
of
Number
Precision
chunks
actual
of
number
Total
found
chunks
correct
of
Number
Recall
•
Take harmonic mean to produce a single
evaluation metric called F measure.
R
P
PR
R
P
F
2
2
/
)
1
1
(
1
1
Current Chunking Results
•
Best system for NP chunking: F
1
=96%
•
Typical results for finding range of
chunk types (CoNLL 2000 shared task:
NP, VP, PP, ADV, SBAR, ADJP) is
F
1
=92−94%
37
Comments 0
Log in to post a comment