Machine Learning in Natural Language Processing

kettledoctorΤεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 3 χρόνια και 11 μήνες)

127 εμφανίσεις

Machine Learning in

Natural Language Processing

Noriko Tomuro

November 16, 2006


What is NLP?


Natural Language Processing (NLP) is a
field in Computer Science devoted to
creating computers that use natural
language as input and/or output.

NLP is AI
-
complete


“The most difficult problems in AI
manifest themselves in human
language phenomena.”


Use of language is the touchstone
of intelligent behavior.

NLP Tasks


NLP can be stand
-
alone applications or
components embedded in other
systems.


NLP essentially is to analyze
(natural
language)

sentences, linguistically. It
involves
(minimally)
:

1.
Determining the
part
-
of
-
speech

category of
words (e.g. noun, verb).

2.
Identifying the
syntactic structure

of a
sentence.

3.
Deriving the
meaning

of the sentence.



NLP Applications


Speech Recognition


Sentence Analysis


Parsing for syntactic analysis


Word sense disambiguation for semantic
analysis


Dialogue Systems, e.g.


Tutoring systems


Question
-
Answering Systems


Machine Translation


Automatic Text Summarization

(e.g.
Newsblaster

system at U of Columbia)


and much more

Part
-
Of
-
Speech (POS)


Identifying POS is as easy as one would
think


because of
ambiguity
.



He

authored

the
book
.”


“he”


DET


“author”


N or V


“the”


DET


“book”


N or V


Without knowing the structure (or
sometimes the meaning) of the sentence,
ambiguous categories cannot be correctly
disambiguated.

Stemming is NOT NLP


Note that
stemming

(e.g. Porter
stemmer) is a technique in Information
Retrieval; it’s NOT NLP.


Stemming only removes the common
morphological and inflectional endings
from words.


“police”, “policy”, “policies” => “polic”


“went” => “went”


Whereas proper NLP analysis yields:


“police” => “police”, “policies” => “policy”


“went” => “go”


Syntactic Structure


Grammar rules
(often in CFG)

specify
the grammatically correct ways to
form phrases (and a sentence).

Grammar

R0:

R1:

R2:

R3:

R4:

R5:

R6:

R7:

cake"

"





N

the"

"



Det

ate"

"





V

John"

"





NP



V



VG

NP

VG





VP

N



Det





NP

VP



NP





S

















S

NP

VP

V

NP

“John”

“ate”

“the”

Det

N

“cake”

Parsing Algorithms


Computational complexity
(with no
optimization)

is exponential.


Various parsing algorithms:


Top
-
down Parsing
--

(top
-
down) derivation


Bottom
-
up Parsing


Chart Parsing


Earley’s Algorithm


most efficient,
O(n
3
)


Left
-
corner Parsing


optimization of Earley’s


and lots more…


Demo using my CF parser

Semantics


Derive the meaning of a sentence.


Often applied on the result of syntactic
analysis.



John

ate

the cake
.”



NP V NP



((action INGEST)

; syntactic verb


(actor


JOHN
-
01)

; syntactic subj


(object FOOD))

; syntactic obj


To do semantic analysis, we need a
(semantic) dictionary (e.g. WordNet,
http://www.cogsci.princeton.edu/~wn/
).


NLP is Hard…


Understanding natural languages is hard
… because of inherent
ambiguity


Engineering NLP systems is also hard …
because of


Huge amount of
resources

needed
(e.g. grammar, dictionary, documents to
extract statistics from)


Computational
complexity

(intractable) of analyzing a sentence

Ambiguity


At last, a computer that
understands you like your mother.”


Three possible (syntactic)
interpretations:

1.
It understands you as well as your
mother understands you.

2.
It understands that you like your
mother.

3.
It understands you as well as it
understands your mother.

Empirical Approaches to NLP


Formal analysis of natural languages
is too hard and computationally
heavy.


Abandon complete/correct solutions,
and shoot for
“approximate”
solutions

by using cheaper/faster
techniques.


Collect data from documents
(corpus),
rather than building linguistic
resources manually.

Part
-
Of
-
Speech Tagging


Typically, from a corpus where words are
already tagged with POS categories, collect
statistics of n
-
word sequences (
n
-
gram
)

=>
probabilities of POS sequences
.


For a given untagged/test sentence, assign
the
most probable

POS tag for each word
starting from the beginning, based on the
POS tags of the preceding words.


Popular techniques:


Hidden Markov Model (HMM)



with parameter
estimation using Viterbi, EM algorithms


Transformation Rules

(e.g. Brill tagger)


Various ML classification algorithms

Probabilistic Parsing


For ambiguous sentences, we’d like to
know which parse tree is more likely than
others.


So we must assign probability to each
parse tree … but how?


A probability of a parse tree
t
is




where
r

is a rule used in
t
.


and
p(r)

is obtained from a (annotated) corpus.


Again, parameter estimation by using
various ML techniques.


Partial (Chunk) Parsing


Abandon full parsing; instead aim to
obtain just phrases (chunks).


Build a flat structure
(instead of a hierarchical tree
by full parsing)
.





Each chunk is identified by a regular
expression


a
finite
-
state automaton.

=> Polynomial time complexity


Then a chunker is a cascade of finite
-
state automata.

Semantic Analysis


Natural language sentences are
ambiguous, largely because a word often
has several meanings/senses.


e.g. “bass” (n) has two senses:

1.
a type of fish

2.
tones of low frequency



Which sense is used?


“The
bass

part of the song is very moving.”


“I went fishing for some sea
bass
.”


It’s easy for humans, but not for
computers.



Word Sense Disambiguation


A task of assigning the proper sense to
words in a sentence
(thus a classification task)
.


Using a training corpus annotated with
proper senses, obtain statistics of
n
-
word
sequences

(their word senses)
.


Apply classification algorithms, such as:


Decision Tree


Support Vector Machine


Conditional Random Fields (CRF)


Difficulty with
data sparseness
.
Techniques:


Smoothing


Backoff models

Other NLP’ish ML Tasks


Text Categorization


Classify a document into one of the document
categories
(See textbook on Naïve Bayes).


Multi
-
lingual Retrieval


Enter a query in one language, and retrieve
relevant documents of ANY language.


Machine Translation

part is done by ML.


Automatic construction of domain
ontology


Build a conceptual hierarchy of a specific
domain


and many, many more!



Analyzing Web Documents


Recently there have been many NLP
applications which analyze
(not just retrieve)

web documents


Blogs



for semantic analysis, sentiment
(polarity/opinion) identification


Email Spam Filtering


but most often
systems utilize simple word probability


A general approach “Web as a corpus”


web as the vast collection of documents