# A fast and simple neural probabilistic language model for natural language processing

AI and Robotics

Oct 24, 2013 (4 years and 6 months ago)

98 views

A fast and simple neural probabilistic
language model for natural language
processing

Presenter: Yifei Guo

Supervisor: David Barber

Statistical language model

Goal:
model the joint distribution of words in a
sentence

predict the
next word
Wn

from preceding n
-
1
words called
context

Example: n
-
gram algorithm

Pro: Simplicity

conditional probability tables for
P(
Wn|context
) estimated by smoothing word n
-
tuple
counts

Con : curse of high dimensions

e.g. 10 words in a
sentence with vocabulary size of 10,000 , will leads to
10,000 to the power of 10 free parameters

Neural probabilistic language model

Neural architecture

Example

Generalize from

The cat is walking in the kitchen

to

The dog was running in the room

a
nd likewise to

The cat was walking in the kitchen

The cat was running in the kitchen

The cat is running in the kitchen

The cat is running in the room

The dog was walking in the room

The dog was walking in the kitchen

The dog is running in the room

The dog was running in the kitchen

……………

Uses distributed representations of words to address the
curse of high dimensions

NPLM

Quantifies the compatibility between the next
word and its context estimated by score
function

Words are mapped into real
-
valued feature vector
learnt from data

Distribution over next word defined by the
score function

Maximum Likelihood learning

Ml training of NLM is tractable but expensive

-
likelihood takes time
linear in the vocabulary size

-------------
?????
--------------

Importance sampling approximation (
Bengio

2003)

Sample words from a proposal distribution and

Stability issues: need either many samples or an

A fast and simple solution: noise
-
contrastive estimation

IDEA: Fit a density model by discriminating
samples from data distribution and samples
from a known noise distribution

Fit the model to data: maximize the expected
log
-
posterior of the data noise labeled D

The strength of NCE

Allow working with
unnormalized

distribution

The gradient of object are more easier than
the importance sampling gradient, since that
the weights are always between 0 and 1.

k tends to the infinity

Application: MSR sentence completion
challenge

Task: given a sentence with a missing word,
find the most appropriate word among five
candidates choices

Training dataset: five novels of Sherlock
Holmes

Thank you