A fast and simple neural probabilistic language model for natural language processing

addictedswimmingAI and Robotics

Oct 24, 2013 (4 years and 19 days ago)

64 views

A fast and simple neural probabilistic
language model for natural language
processing


Presenter: Yifei Guo

Supervisor: David Barber

Statistical language model


Goal:
model the joint distribution of words in a
sentence


Task:
predict the
next word
Wn

from preceding n
-
1
words called
context


Example: n
-
gram algorithm


Pro: Simplicity


conditional probability tables for
P(
Wn|context
) estimated by smoothing word n
-
tuple
counts


Con : curse of high dimensions


e.g. 10 words in a
sentence with vocabulary size of 10,000 , will leads to
10,000 to the power of 10 free parameters




Neural probabilistic language model

Neural architecture

Example

Generalize from


The cat is walking in the kitchen

to


The dog was running in the room


a
nd likewise to


The cat was walking in the kitchen


The cat was running in the kitchen


The cat is running in the kitchen


The cat is running in the room


The dog was walking in the room


The dog was walking in the kitchen


The dog is running in the room


The dog was running in the kitchen

……………


Uses distributed representations of words to address the
curse of high dimensions

NPLM


Quantifies the compatibility between the next
word and its context estimated by score
function


Words are mapped into real
-
valued feature vector
learnt from data


Distribution over next word defined by the
score function



Maximum Likelihood learning


Ml training of NLM is tractable but expensive


Computing the gradient of log
-
likelihood takes time
linear in the vocabulary size


-------------
?????
--------------


Importance sampling approximation (
Bengio

2003)


Sample words from a proposal distribution and
reweight the gradient


Stability issues: need either many samples or an
adaptive proposal distributions



A fast and simple solution: noise
-
contrastive estimation


IDEA: Fit a density model by discriminating
samples from data distribution and samples
from a known noise distribution





Fit the model to data: maximize the expected
log
-
posterior of the data noise labeled D

The strength of NCE


Allow working with
unnormalized

distribution


The gradient of object are more easier than
the importance sampling gradient, since that
the weights are always between 0 and 1.


NCE gradient converges to ML gradient when
k tends to the infinity

Application: MSR sentence completion
challenge


Task: given a sentence with a missing word,
find the most appropriate word among five
candidates choices


Training dataset: five novels of Sherlock
Holmes

Thank you