# Natural Language Processing

AI and Robotics

Nov 7, 2013 (4 years and 6 months ago)

80 views

Natural Language Processing

COMPSCI 423/723

Rohit Kate

Conditional Random Fields
(CRFs) for Sequence Labeling

Some of the slides have been adapted from Raymond
Mooney’s NLP course at UT Austin.

Graphical Models

If no assumption of independence is made, then an
exponential number of parameters must be estimated

No realistic amount of training data is sufficient to estimate so
many parameters

If a blanket assumption of conditional independence is
made, efficient training and inference is possible, but
such a strong assumption is rarely warranted

Graphical models

use directed or undirected graphs
over a set of random variables to explicitly specify
variable dependencies and allow for less restrictive
independence assumptions while limiting the number of
parameters that must be estimated

Bayesian Networks
: Directed acyclic graphs that indicate
causal structure

Markov Networks
: Undirected graphs that capture general
dependencies

Bayesian Networks

Directed Acyclic Graph (DAG)

Nodes are random variables

Edges indicate causal influences

Burglary

Earthquake

Alarm

JohnCalls

MaryCalls

Conditional Probability Tables

Each node has a
conditional probability table

(
CPT
)
that gives the probability of each of its values given
every possible combination of values for its parents
(conditioning case).

Roots (sources) of the DAG that have no parents are given prior
probabilities.

Burglary

Earthquake

Alarm

JohnCalls

MaryCalls

P(B)

.001

P(E)

.002

B

E

P(A|B,E)

T

T

.95

T

F

.94

F

T

.29

F

F

.001

A

P(M|A)

T

.70

F

.01

A

P(J|A)

T

.90

F

.05

Joint Distributions for Bayes
Nets

A Bayesian Network implicitly defines
(
factors
) a joint distribution

))
(
Parents
|
(
)
,...
,
(
1
2
1
i
n
i
i
n
X
x
P
x
x
x
P

Example

)
(
E
B
A
M
J
P

)
(
)
(
)
|
(
)
|
(
)
|
(
E
P
B
P
E
B
A
P
A
M
P
A
J
P

00062
.
0
998
.
0
999
.
0
001
.
0
7
.
0
9
.
0

Naïve Bayes as a Bayes Net

Naïve Bayes is a simple Bayes Net

Y

X
1

X
2

X
n

Priors P(
Y
) and conditionals P(
X
i
|
Y
) for
Naïve Bayes provide CPTs for the network

HMMs as Bayesian Network

The directed probabilistic graphical
model for the random variables w
1

to w
n
and t
1

to t
n
with the independence
assumptions:

t
1

t
2

t
3

t
n

w
1

w
2

w
3

w
n

P(t
1
)

P(t
2
|t
1
)

P(t
3
|t
2
)

P(t
n
|t
n
-
1
)

P(w
1
|t
1
)

P(w
2
|t
2
)

P(w
3
|t
3
)

P(w
n
|t
n
)

Drawbacks of HMMs

HMMs are generative models and are
not

directly designed to maximize the
performance of sequence labeling. They
model the joint distribution P(
O
,
Q
) and thus
only indirectly model P(Q|O) which is what is
needed for the sequence labeling task (O:
observation sequence, Q: label sequence)

Can’t use arbitrary features related to the
words (e.g. capitalization, prefixes etc. that
can help POS tagging) unless these are
explicitly modeled as part of observations

9

Undirected Graphical Model

Also called Markov Network, Random Field

Undirected graph over a set of random variables,
where an edge represents a dependency

The
Markov blanket

of a node,
X
, in a Markov Net
is the set of its neighbors in the graph (nodes that
have an edge connecting to
X
)

Every node in a Markov Net is conditionally
independent of every other node given its Markov
blanket

10

Sample Markov Network

Burglary

Earthquake

Alarm

JohnCalls

MaryCalls

Distribution for a Markov
Network

The distribution of a Markov net is most compactly described in
terms of a set of
potential functions
(a.k.a.
factors
,
compatibility
functions
),

φ
k
, for each clique,
k
, in the graph.

For each joint assignment of values to the variables in clique
k,
φ
k
assigns a non
-
negative real value that represents the compatibility
of these values.

The joint distribution of a Markov network is then defined by:

)
(
1
)
,...
,
(
}
{
2
1

k
k
k
n
x
Z
x
x
x
P

Where
x
{k}
represents the joint assignment of the variables in
clique
k
, and
Z

is a normalizing constant that makes a joint
distribution that sums to 1.

x
k
k
k
x
Z
)
(
}
{

Sample Markov Network

Burglary

Earthquake

Alarm

JohnCalls

MaryCalls

B

A

1

T

T

100

T

F

1

F

T

1

F

F

200

E

A

2

T

T

50

T

F

10

F

T

1

F

F

200

M

A

4

T

T

50

T

F

1

F

T

10

F

F

200

J

A

3

T

T

75

T

F

10

F

T

1

F

F

200

(...)
/
50
*
75
*
1
*
1
)
(
E
B
A
M
J
P
Discriminative Markov Network
or Conditional Random Field

Directly models P(Y|X)

The potential functions could be based on arbitrary
features of X and Y and they are expressed as
exponentials


P
(
y
1
,
y
2
,
...
y
m
|
x
1
,
x
2
,
...
x
n
)

1
Z
(
X
)

k
(
y
{
k
}
,
x
{
k
}
k

)

Z
(
X
)

k
(
y
{
k
}
,
x
{
k
}
)
k

Y

Random Field

(Undirected Graphical Model)

Conditional Random Field (CRF)

)
(
1
)
,...
,
(
}
{
2
1

k
k
k
n
v
Z
v
v
v
P

v
k
k
k
v
Z
)
(
}
{

)
,
(
)
(
1
)
,...
,
|
,...
,
(
}
{
}
{
2
1
2
1

k
k
k
k
n
m
x
y
X
Z
x
x
x
y
y
y
P


Z
(
X
)

k
(
y
{
k
}
,
x
{
k
}
)
k

Y

Two types of variables x & y,

there is no factor with only x variables

v
2

v
1

v
n

v
3

v
10

v
4

Y
2

X
1,

X
2,…,

X
n

Y
1

Y
n

Y
3

Linear
-
Chain Conditional Random Field (CRF)

Y
2

X
1,

X
2,…,

X
n

Y
1

Y
n

)
,
,
(
)
(
1
)
,...
,
|
,...
,
(
}
{
1
2
1
2
1

k
k
i
i
k
n
m
x
y
y
X
Z
x
x
x
y
y
y
P

Y
k
k
i
i
k
x
y
y
X
Z
)
,
,
(
)
(
}
{
1

Ys are connected in a linear chain

Logistic Regression as a
Simplest CRF

Logistic regression is a simple CRF with
only one output variable

Y

X
1

X
2

X
n

Models the conditional distribution, P(
Y

|
X
)
and not the full joint P(
X
,
Y
)

Simplification Assumption for
MaxEnt

The probability P(Y|X1..Xn) can be
factored
as:

18


P
(
c
|
X
)

exp(
W
ci
f
i
(
c
,
x
)
i

0
N

)
exp(
W
c
'
i
f
i
(
c
'
,
x
)
i

0
N

)
c
'

Classes

Generative vs. Discriminative

Sequence Labeling Models

HMMs are generative models and are
not

directly designed to maximize the performance
of sequence labeling. They model the joint
distribution P(
O
,
Q
)

HMMs are trained to have an accurate
probabilistic model of the underlying language,
and not all aspects of this model benefit the

Conditional Random Fields

(CRFs)

are
specifically designed and trained to maximize
performance of sequence labeling. They model
the
conditional distribution
P(
Q

|
O
)

Classification

Y

X
1

X
2

X
n

Y

X
1

X
2

X
n

Naïve

Bayes

Logistic

Regression

Conditional

Generative

Discriminative

Sequence Labeling

Y
2

X
1

X
2

X
T

HMM

Linear
-
chain CRF

Conditional

Generative

Discriminative

Y
1

Y
T

..

Y
2

X
1

X
2

X
T

Y
1

Y
T

..

Simple Linear Chain CRF
Features

Modeling the conditional distribution is similar to
that used in multinomial logistic regression.

Create feature functions
f
k
(
Y
t
,

Y
t−1
,

X
t
)

Feature for each state transition pair
i
,
j

f
i,j
(
Y
t
,

Y
t−1
,

X
t
) = 1 if
Y
t

=
i

and

Y
t−1

=
j
and 0 otherwise

Feature for each state observation pair
i
,
o

f
i,o
(
Y
t
,

Y
t−1
,

X
t
) = 1 if
Y
t

=
i

and

X
t

=
o

and 0 otherwise

Note
: number of features grows quadratically in
the number of states (i.e. tags).

22

Conditional Distribution for

Linear Chain CRF

Using these feature functions for a
simple linear chain CRF, we can define:

23


P
(
Y
|
X
)

1
Z
(
X
)
exp(

m
f
m
(
Y
t
,
Y
t

1
,
X
t
))
m

1
M

t

1
T


Z
(
X
)

exp(

m
f
m
(
Y
t
,
Y
t

1
,
X
t
))
m

1
M

t

1
T

Y

CRF

X
i
,
j

24

X
1,1

X
1,m

X
2,1

X
2,m

X
T,1

X
T,m

each token feature to model conditional
distribution.

Y
1

Y
2

Y
T

Features in POS Tagging

For POS Tagging, use lexicographic
features of tokens.

Capitalized?

Ends in given suffix (e.g. “s”, “ed”, “ly”)?

25

Enhanced Linear Chain CRF

(standard approach)

Can also condition transition on the
current token features.

26

X
1,1

X
2,1

X
T,1

X
1,m

X
2,m

X
T,m

f
i,j,k
(
Y
t
,

Y
t−1
,

X
) 1 if
Y
t

=
i

and

Y
t−1

=
j
and
X
t −1,k

= 1

and 0 otherwise

Y
2

Y
T

Y
1

Supervised Learning

(Parameter Estimation)

As in logistic regression, use L
-
BFGS
optimization procedure, to set
λ

weights
to maximize CLL of the supervised
training data

27

Sequence Tagging

(Inference)

Variant of dynamic programming (Viterbi)
algorithm can be used to efficiently,
O(
TN
2
), determine the globally most
probable label sequence for a given
token sequence using a given log
-
linear
model of the conditional probability P(
Y

|
X
)

28

Skip
-
Chain CRFs

Can model some long
-
distance dependencies
(i.e. the same word appearing in different parts
of the text) by including long
-
distance edges in
the Markov model.

29

Y
2

X
1

X
2

X
3

Y
1

Y
3

Michael Dell said

Dell bought

Y
100

X
100

Y
101

X
101

intractable, so must resort to approximate
inference to try to find the most probable
labeling.

30

CRF Results

Experimental results verify that they have
superior accuracy on various sequence

Part of Speech tagging

Noun phrase chunking

Named entity recognition

Semantic role labeling

However, CRFs are much slower to train and
do not scale as well to large amounts of
training data

Training for POS on full Penn Treebank (~1M
words) currently takes “over a week.”

Skip
-
chain CRFs improve results on IE

CRF Summary

CRFs are a discriminative approach to sequence labeling
whereas HMMs are generative

Discriminative methods are usually more accurate since
they are trained for a specific performance task

Training time is increased since a complex optimization
procedure is needed to fit supervised training data

CRFs are a state
-
of
-
the
-
art method for sequence labeling

31

Phrase Structure

Most languages have a
word order

Words are organized into
phrases
,
group of words that act as a single unit
or a
constituent

[The dog] [chased] [the cat].

[The fat dog] [chased] [the thin cat].

[The fat dog with red collar] [chased] [the
thin old cat].

[The fat dog with red collar named Tom]
[suddenly chased] [the thin old white cat].

Phrases

Noun phrase:

A syntactic unit of a sentence
which acts like a noun and in which a noun is
usually embedded called its

An optional determiner followed by zero or more
prepositional phrases

Prepositional phrase:

preposition and express spatial, temporal or
other attributes

Verb phrase:

Part of the sentence that
depend on the verb. Headed by the verb.

Phrase Chunking

Find all non
-
recursive noun phrases
(NPs) and verb phrases (VPs) in a
sentence.

[NP
I
] [VP
ate
] [NP
the spaghetti
] [PP
with] [NP
meatballs
].

[NP

He
] [VP

reckons

] [NP

the current
account deficit
] [VP

will narrow

] [PP

to
]
[NP

only # 1.8 billion
] [PP

in
] [NP

September
]

Some applications need all the noun
phrases in a sentence

Phrase Chunking as
Sequence Labeling

Tag individual words with one of 3 tags

B (Begin) word starts new target phrase

I (Inside) word is part of target phrase but
not the first word

O (Other) word is not part of target phrase

Sample for NP chunking

He

reckons
the

current account deficit
will
narrow to
only

# 1.8 billion
in
September
.

35

Begin

Inside

Other

Evaluating Chunking

Per token accuracy does not evaluate finding

36

found

chunks

of
number

Total
found

chunks
correct

of
Number

Precision

chunks

actual

of
number

Total
found

chunks
correct

of
Number

Recall

Take harmonic mean to produce a single
evaluation metric called F measure.

R
P
PR
R
P
F

2
2
/
)
1
1
(
1
1
Current Chunking Results

Best system for NP chunking: F
1
=96%

Typical results for finding range of
chunk types (CoNLL 2000 shared task: