Natural Language Processing

lettuceescargatoireAI and Robotics

Nov 7, 2013 (3 years and 11 months ago)

74 views

Natural Language Processing

COMPSCI 423/723

Rohit Kate

Conditional Random Fields
(CRFs) for Sequence Labeling

Some of the slides have been adapted from Raymond
Mooney’s NLP course at UT Austin.

Graphical Models


If no assumption of independence is made, then an
exponential number of parameters must be estimated


No realistic amount of training data is sufficient to estimate so
many parameters


If a blanket assumption of conditional independence is
made, efficient training and inference is possible, but
such a strong assumption is rarely warranted


Graphical models

use directed or undirected graphs
over a set of random variables to explicitly specify
variable dependencies and allow for less restrictive
independence assumptions while limiting the number of
parameters that must be estimated


Bayesian Networks
: Directed acyclic graphs that indicate
causal structure


Markov Networks
: Undirected graphs that capture general
dependencies

Bayesian Networks


Directed Acyclic Graph (DAG)


Nodes are random variables


Edges indicate causal influences

Burglary

Earthquake

Alarm

JohnCalls

MaryCalls

Conditional Probability Tables


Each node has a
conditional probability table

(
CPT
)
that gives the probability of each of its values given
every possible combination of values for its parents
(conditioning case).


Roots (sources) of the DAG that have no parents are given prior
probabilities.

Burglary

Earthquake

Alarm

JohnCalls

MaryCalls

P(B)

.001

P(E)

.002

B

E

P(A|B,E)

T

T

.95

T

F

.94

F

T

.29

F

F

.001

A

P(M|A)

T

.70

F

.01

A

P(J|A)

T

.90

F

.05

Joint Distributions for Bayes
Nets


A Bayesian Network implicitly defines
(
factors
) a joint distribution

))
(
Parents
|
(
)
,...
,
(
1
2
1
i
n
i
i
n
X
x
P
x
x
x
P




Example

)
(
E
B
A
M
J
P






)
(
)
(
)
|
(
)
|
(
)
|
(
E
P
B
P
E
B
A
P
A
M
P
A
J
P






00062
.
0
998
.
0
999
.
0
001
.
0
7
.
0
9
.
0






Naïve Bayes as a Bayes Net


Naïve Bayes is a simple Bayes Net

Y

X
1

X
2



X
n


Priors P(
Y
) and conditionals P(
X
i
|
Y
) for
Naïve Bayes provide CPTs for the network

HMMs as Bayesian Network


The directed probabilistic graphical
model for the random variables w
1

to w
n
and t
1

to t
n
with the independence
assumptions:


t
1

t
2

t
3

t
n

w
1

w
2

w
3

w
n





P(t
1
)

P(t
2
|t
1
)

P(t
3
|t
2
)

P(t
n
|t
n
-
1
)

P(w
1
|t
1
)

P(w
2
|t
2
)

P(w
3
|t
3
)

P(w
n
|t
n
)

Drawbacks of HMMs


HMMs are generative models and are
not

directly designed to maximize the
performance of sequence labeling. They
model the joint distribution P(
O
,
Q
) and thus
only indirectly model P(Q|O) which is what is
needed for the sequence labeling task (O:
observation sequence, Q: label sequence)


Can’t use arbitrary features related to the
words (e.g. capitalization, prefixes etc. that
can help POS tagging) unless these are
explicitly modeled as part of observations




9

Undirected Graphical Model


Also called Markov Network, Random Field


Undirected graph over a set of random variables,
where an edge represents a dependency


The
Markov blanket

of a node,
X
, in a Markov Net
is the set of its neighbors in the graph (nodes that
have an edge connecting to
X
)


Every node in a Markov Net is conditionally
independent of every other node given its Markov
blanket


10

Sample Markov Network

Burglary

Earthquake

Alarm

JohnCalls

MaryCalls

Distribution for a Markov
Network


The distribution of a Markov net is most compactly described in
terms of a set of
potential functions
(a.k.a.
factors
,
compatibility
functions
),

φ
k
, for each clique,
k
, in the graph.


For each joint assignment of values to the variables in clique
k,
φ
k
assigns a non
-
negative real value that represents the compatibility
of these values.


The joint distribution of a Markov network is then defined by:

)
(
1
)
,...
,
(
}
{
2
1


k
k
k
n
x
Z
x
x
x
P


Where
x
{k}
represents the joint assignment of the variables in
clique
k
, and
Z

is a normalizing constant that makes a joint
distribution that sums to 1.




x
k
k
k
x
Z
)
(
}
{

Sample Markov Network

Burglary

Earthquake

Alarm

JohnCalls

MaryCalls

B

A


1

T

T

100

T

F

1

F

T

1

F

F

200

E

A


2

T

T

50

T

F

10

F

T

1

F

F

200

M

A


4

T

T

50

T

F

1

F

T

10

F

F

200

J

A


3

T

T

75

T

F

10

F

T

1

F

F

200









(...)
/
50
*
75
*
1
*
1
)
(
E
B
A
M
J
P
Discriminative Markov Network
or Conditional Random Field


Directly models P(Y|X)








The potential functions could be based on arbitrary
features of X and Y and they are expressed as
exponentials


P
(
y
1
,
y
2
,
...
y
m
|
x
1
,
x
2
,
...
x
n
)

1
Z
(
X
)

k
(
y
{
k
}
,
x
{
k
}
k

)

Z
(
X
)


k
(
y
{
k
}
,
x
{
k
}
)
k

Y

Random Field

(Undirected Graphical Model)

Conditional Random Field (CRF)

)
(
1
)
,...
,
(
}
{
2
1


k
k
k
n
v
Z
v
v
v
P




v
k
k
k
v
Z
)
(
}
{

)
,
(
)
(
1
)
,...
,
|
,...
,
(
}
{
}
{
2
1
2
1


k
k
k
k
n
m
x
y
X
Z
x
x
x
y
y
y
P


Z
(
X
)


k
(
y
{
k
}
,
x
{
k
}
)
k

Y

Two types of variables x & y,

there is no factor with only x variables

v
2



v
1

v
n

v
3

v
10

v
4

Y
2

X
1,

X
2,…,

X
n

Y
1

Y
n



Y
3

Linear
-
Chain Conditional Random Field (CRF)

Y
2

X
1,

X
2,…,

X
n

Y
1

Y
n



)
,
,
(
)
(
1
)
,...
,
|
,...
,
(
}
{
1
2
1
2
1



k
k
i
i
k
n
m
x
y
y
X
Z
x
x
x
y
y
y
P





Y
k
k
i
i
k
x
y
y
X
Z
)
,
,
(
)
(
}
{
1

Ys are connected in a linear chain

Logistic Regression as a
Simplest CRF


Logistic regression is a simple CRF with
only one output variable


Y

X
1

X
2



X
n


Models the conditional distribution, P(
Y

|
X
)
and not the full joint P(
X
,
Y
)

Simplification Assumption for
MaxEnt



The probability P(Y|X1..Xn) can be
factored
as:








18


P
(
c
|
X
)

exp(
W
ci
f
i
(
c
,
x
)
i

0
N

)
exp(
W
c
'
i
f
i
(
c
'
,
x
)
i

0
N

)
c
'

Classes

Generative vs. Discriminative

Sequence Labeling Models


HMMs are generative models and are
not

directly designed to maximize the performance
of sequence labeling. They model the joint
distribution P(
O
,
Q
)


HMMs are trained to have an accurate
probabilistic model of the underlying language,
and not all aspects of this model benefit the
sequence labeling task


Conditional Random Fields

(CRFs)

are
specifically designed and trained to maximize
performance of sequence labeling. They model
the
conditional distribution
P(
Q

|
O
)

Classification

Y

X
1

X
2



X
n

Y

X
1

X
2



X
n

Naïve

Bayes

Logistic

Regression

Conditional


Generative




Discriminative

Sequence Labeling

Y
2

X
1

X
2



X
T

HMM

Linear
-
chain CRF

Conditional


Generative




Discriminative

Y
1

Y
T

..

Y
2

X
1

X
2



X
T

Y
1

Y
T

..

Simple Linear Chain CRF
Features


Modeling the conditional distribution is similar to
that used in multinomial logistic regression.


Create feature functions
f
k
(
Y
t
,

Y
t−1
,

X
t
)


Feature for each state transition pair
i
,
j


f
i,j
(
Y
t
,

Y
t−1
,

X
t
) = 1 if
Y
t

=
i

and

Y
t−1

=
j
and 0 otherwise


Feature for each state observation pair
i
,
o


f
i,o
(
Y
t
,

Y
t−1
,

X
t
) = 1 if
Y
t

=
i

and

X
t

=
o

and 0 otherwise


Note
: number of features grows quadratically in
the number of states (i.e. tags).


22

Conditional Distribution for

Linear Chain CRF


Using these feature functions for a
simple linear chain CRF, we can define:

23


P
(
Y
|
X
)

1
Z
(
X
)
exp(

m
f
m
(
Y
t
,
Y
t

1
,
X
t
))
m

1
M

t

1
T


Z
(
X
)

exp(

m
f
m
(
Y
t
,
Y
t

1
,
X
t
))
m

1
M

t

1
T

Y

Adding Token Features to a
CRF


Can add token features
X
i
,
j

24



X
1,1

X
1,m



X
2,1

X
2,m



X
T,1

X
T,m






Can add additional feature functions for
each token feature to model conditional
distribution.

Y
1

Y
2

Y
T

Features in POS Tagging


For POS Tagging, use lexicographic
features of tokens.


Capitalized?


Start with numeral?


Ends in given suffix (e.g. “s”, “ed”, “ly”)?


25

Enhanced Linear Chain CRF

(standard approach)


Can also condition transition on the
current token features.

26



X
1,1

X
2,1









X
T,1

X
1,m

X
2,m

X
T,m


Add feature functions:


f
i,j,k
(
Y
t
,

Y
t−1
,

X
) 1 if
Y
t

=
i

and

Y
t−1

=
j
and
X
t −1,k

= 1

and 0 otherwise


Y
2

Y
T

Y
1

Supervised Learning

(Parameter Estimation)


As in logistic regression, use L
-
BFGS
optimization procedure, to set
λ

weights
to maximize CLL of the supervised
training data

27

Sequence Tagging

(Inference)


Variant of dynamic programming (Viterbi)
algorithm can be used to efficiently,
O(
TN
2
), determine the globally most
probable label sequence for a given
token sequence using a given log
-
linear
model of the conditional probability P(
Y

|
X
)

28

Skip
-
Chain CRFs


Can model some long
-
distance dependencies
(i.e. the same word appearing in different parts
of the text) by including long
-
distance edges in
the Markov model.

29

Y
2

X
1

X
2



X
3

Y
1

Y
3

Michael Dell said








Dell bought



Y
100

X
100

Y
101

X
101


Additional links make exact inference
intractable, so must resort to approximate
inference to try to find the most probable
labeling.

30

CRF Results


Experimental results verify that they have
superior accuracy on various sequence
labeling tasks


Part of Speech tagging


Noun phrase chunking


Named entity recognition


Semantic role labeling


However, CRFs are much slower to train and
do not scale as well to large amounts of
training data


Training for POS on full Penn Treebank (~1M
words) currently takes “over a week.”


Skip
-
chain CRFs improve results on IE



CRF Summary


CRFs are a discriminative approach to sequence labeling
whereas HMMs are generative


Discriminative methods are usually more accurate since
they are trained for a specific performance task


CRFs also easily allow adding additional token features
without making additional independence assumptions


Training time is increased since a complex optimization
procedure is needed to fit supervised training data


CRFs are a state
-
of
-
the
-
art method for sequence labeling

31

Phrase Structure


Most languages have a
word order


Words are organized into
phrases
,
group of words that act as a single unit
or a
constituent


[The dog] [chased] [the cat].


[The fat dog] [chased] [the thin cat].


[The fat dog with red collar] [chased] [the
thin old cat].


[The fat dog with red collar named Tom]
[suddenly chased] [the thin old white cat].

Phrases


Noun phrase:

A syntactic unit of a sentence
which acts like a noun and in which a noun is
usually embedded called its
head



An optional determiner followed by zero or more
adjectives, a noun head and zero or more
prepositional phrases


Prepositional phrase:

Headed by a
preposition and express spatial, temporal or
other attributes


Verb phrase:

Part of the sentence that
depend on the verb. Headed by the verb.


Adjective phrase:

Acts like an adjective.


Phrase Chunking


Find all non
-
recursive noun phrases
(NPs) and verb phrases (VPs) in a
sentence.


[NP
I
] [VP
ate
] [NP
the spaghetti
] [PP
with] [NP
meatballs
].


[NP

He
] [VP

reckons

] [NP

the current
account deficit
] [VP

will narrow

] [PP

to
]
[NP

only # 1.8 billion
] [PP

in
] [NP

September
]


Some applications need all the noun
phrases in a sentence


Phrase Chunking as
Sequence Labeling


Tag individual words with one of 3 tags


B (Begin) word starts new target phrase


I (Inside) word is part of target phrase but
not the first word


O (Other) word is not part of target phrase


Sample for NP chunking


He

reckons
the

current account deficit
will
narrow to
only

# 1.8 billion
in
September
.



35

Begin

Inside

Other

Evaluating Chunking


Per token accuracy does not evaluate finding
correct full chunks. Instead use:

36

found

chunks

of
number

Total
found

chunks
correct

of
Number

Precision

chunks

actual

of
number

Total
found

chunks
correct

of
Number

Recall


Take harmonic mean to produce a single
evaluation metric called F measure.

R
P
PR
R
P
F




2
2
/
)
1
1
(
1
1
Current Chunking Results


Best system for NP chunking: F
1
=96%


Typical results for finding range of
chunk types (CoNLL 2000 shared task:
NP, VP, PP, ADV, SBAR, ADJP) is
F
1
=92−94%

37