# CS 479, section 1: Natural Language Processing

Τεχνίτη Νοημοσύνη και Ρομποτική

24 Οκτ 2013 (πριν από 4 χρόνια και 6 μήνες)

81 εμφανίσεις

CS 479, section 1:

Natural Language Processing

Lectures #11:
Language Model
Smoothing, Interpolation

Thanks to Dan Klein of UC Berkeley for many of the materials used in this lecture.

-
Share Alike 3.0
Unported

.

Announcements

M&S 6.3
-
end (of
ch
. 6)

Due: Monday

Project #1, Part 1

Build an interpolated language model

ASAP: Work through the Tutorial
programming partner

Early: Wednesday

Due: next Friday

Recap: Language Models

What is the purpose of a language model?

What do you think are the main challenges in
building n
-
gram language models?

Objectives

Get comfortable with the process of factoring
and smoothing a joint model of a familiar
object: text!

Motivate smoothing of language models

Dig into Linear Interpolation as a method for
smoothing

Feel confident about how to use these
techniques in Project #1, Part 1.

Discuss how to train interpolation weights

Problem

Cause:
Sparsity

New words appear all the time:

Synaptitute

132,701.03

fuzzificational

New bigrams: even more often

Trigrams or larger

still worse!

What was the point of
Zipf’s

law for us?

What will we do about it?

Solution: Smoothing

We often want to make predictions from sparse statistics:

Smoothing flattens distributions so they generalize better

Very important all over NLP, but easy to do badly!

P(w | denied the)

3 allegations

2 reports

1 claims

1 request

7 total

allegations

reports

claims

attack

request

man

outcome

allegations

attack

man

outcome

allegations

reports

claims

request

P(w | denied the)

2.5 allegations

1.5 reports

0.5 claims

0.5 request

2 other

7 total

Smoothing

Two approaches we will explore:

Interpolation
: combine multiple estimates to give
probability mass to unseen events

Think:
Two heads are better than one!

Project 1.1

Today’s lecture

Discounting
: explicitly reserve mass for unseen events

Think:
Robin Hood

rob from the rich to feed the poor!

Project 1.2

Next time

Can be used in combination!

Back
-
off

we won’t spend time on this

Interpolation

Idea: two heads are better than one

i.e., combine (less sparse) lower
-
order model
with higher
-
order model to get a more robust
higher
-
order model

𝑃

𝑖

𝑖

1
=

(
𝑃
1

𝑖

𝑖

1
,
𝑃

𝑖
,
1
𝑉
)

Convex, Linear Interpolation

Convex: interpolation constants sum to 1.

General linear interpolation:

One interpolation coefficient per history and
predicted word

Linear Interpolation

The other extreme: a single global mixing weight

generally not ideal but it works:

Middle ground: different weights for classes of
histories defined at other granularities:

Bucket histories (and their weights) by count
k
:

for each bucket
k
, have a weight

(
k
)

Bucket histories by average count (better):

for a range
of buckets bucket
k

bucket
k+m

, have a weight

Example: Linear Interpolation

history (h)

w

P3(
w|h
)

P2(
w|h
)

P1(
w|h
)

interpolated

fall into

the

0.30

0.5

0.030

a

0.10

0.2

0.010

two

0.00

0.0

0.001

<other>

<OOV>

<UNK>

0.6

0.3

0.959

Question: using the following weights

λ
3,”fall into”
= 0.1

λ
2, ”fall into”
= 0.5

λ
1, ”fall into”

= 0.4

How do you compute the combined,
interpolated probabilities
?

Learning the Weights

How?

Tuning on Held
-
Out Data

Important tool for getting models to generalize:

Training Data

Held
-
Out

Data

Test

Data

Wisdom

“A cardinal sin in Statistical NLP is to test on your
training data.”

Manning & Schuetze, p. 206

Corollary: “You should always eyeball the training
data

you want to use your human pattern
-
finding
abilities to get hints on how to proceed. You
shouldn’t eyeball the test data

that’s cheating …”

M&S, p. 207

Training Data

Held
-
Out

Data

Test

Data

Likelihood of the Data

We want the joint probability of some data set

M,
trained from training set

Take the log()

why?

Distribute through

Compare models using the Log Likelihood function

Maximizing the Likelihood

Situation: we have a small number of parameters

1

k
that
control the degree of smoothing

Goal: set them to
maximize

the (log
-
)likelihood of held
-
out
data:

Method: use any optimization technique

line search

easy, OK

EM (to be discussed later in this course)

Tuning on Held
-
Out Data

Important tool for getting models to generalize:

Training Data

Held
-
Out

Data

Test

Data

LL

What’s Next

Upcoming lectures:

Discounting strategies

Reserving mass for Unknown Word (UNK) and
Unseen n
-
grams

i.e., “Open Vocabulary”

Speech Recognition