Natural Language Processing Part 1

mumpsimuspreviousAI and Robotics

Oct 25, 2013 (3 years and 5 months ago)

54 views

Natural Language Processing

Part 1



G
eneral
V
iew





Phonetics:
What sounds are used in human speech?

P
honology

:
How do languages use and combine sounds?

M
orphology
:
How do languages form words?

S
yntax
:
How do languages form sentences?

S
emantics
:
How do

languages convey meaning in sentences?

pragmatics
:
How do people use language to communicate?


Research Problems



How can we construct
the Lexicon



What is the best Lexical Analyzer



What is the suitable Tag set



How can we construct the CFG rules



What is th
e best Syntax Analyzer





Natural Language Processing Applications






Lexicon

To construct the Lexicon we have three method



Automatic
: by using a set of rules to tag the lexemes

in the corpora,
the main disadvantage
s are:

not accurate, the corpora mus
t be
diacritization and

the corpora may not cover all the words. See
"
Constructing

an Automatic Lexicon for Arabic Language"




Manual
:
by inserting all the Arabic words which approximately
1000
0 roots

and the unique words (surface forms) is estimated to
be

6000000 words
.
The disadvantage is difficult to insert it and not
flexible for the new words




Simi
-
Manual
: inserting the roots and the anomalies, and then using
the rules to derive the rest of the words






















For

English language


Nominalization
:

V +
-
ation:
computerization

V+
-
er:
killer

Adj +
-
ness:
fuzziness

Negation
:

un
-
:
undo, unseen, ...

mis
-
:
mistake,...

Adjectivization
:

V+
-
able:
doable

N +
-
al:
national




We call the smallest (meaningful or grammatical) parts of

words
mor
phemes
.




I
rregular word forms:

-

Plural nouns
add
s
to singular noun:
book
-
book
s
,

but:
box
-
box
es
, fly
-
fl
ies
, child
-
child
ren

-

Past tense verbs
add
ed
to infinitive:
walk
-
walk
ed
,

but:
like
-
like
d
, leap
-
leap
t


Finite state automata for morphology



Derivat
ional Morphology



Lexeme

=

Root + Pattern



Word

= Lexeme + Features



Part
-
of
-
speech


Traditional
: Noun, Verb, Particle


Computational
: N, PN, V, Adj, Adv, P, Pron, Num, Conj, Det,

Aux,
Pun, IJ, and others



Noun
-
specific

• Number: singular, dual, plur
al, collective

• Gender: masculine, feminine, Neutral

• Definiteness: definite, indefinite

• Case: nominative, accusative, genitive

• Possessive clitic



Verb
-
specific

• Aspect: perfective, imperfective, imperative

• Voice: active, passive

• Tense: past, p
resent, future

• Mood: indicative, subjunctive, jussive

• Subject (Person, Number, Gender)

• Object clitic



Others

• Single
-
letter conjunctions

• Single
-
letter prepositions





Representation Units

















Part
-
of
-
Speech Tagging
(
lexeme

analyzer
)


The process of assigning a part
-
of
-
speech to each word in a sentence








Choosing a tagset

• Need to choose a standard set of tags to do POS tagging



One tag for each part of speech

• Could pick very coarse tagset



N, V, Adj, Adv, Prep.

• More co
mmonly used set is finer
-
grained



E.g., the
UPenn TreeBank II
tagset has 36 word tags

• Even more finely
-
grained tagsets exist































Comparison of different tag sets: Verb, preposition, punctuation and
symbol tags. An entry of `not
' means an item was ignored in tagging, or
was not separated off as a separate token.





Why is POS tagging hard?


Ambiguity




Plants
/N need light and water.”



“Each one
plant
/V one.”




Flies

like a flower”


Flies
: noun or verb?


like
: preposition,

adverb, conjunction, noun, or verb?


flower
: noun or verb?


interest


Ellen has a strong
interest

in computational linguistics.


Ellen pays a large amount of
interest

on her credit card.


نيب



/bayyana/ Verb
he declared/demonstrated



/bayyanna/ Verb
they [feminine] declared/demonstrated



/bayyin/ Adj
clear/evident/explicit



/bayna/ Prep
between/among



/biyin/ Proper Noun
in Yen



/biyn/ Proper Noun
Ben


مهف
:


noun, or verb
?



POS Taggi
ng Approaches



Rule
-
Based
: Human crafted rules based on lexical and other
linguistic knowledge.

large collection (> 1000) of constraints on
what sequences of tags are allowable



Learning
-
Based
: Trained on human annotated corpora like the
Penn Treebank.



Stati
stical models
: Hidden Markov Model (HMM),
Maximum Entropy Markov Model (MEMM), Conditional
Random Field (CRF)



Rule learning
: Transformation Based Learning (TBL)



Generally, learning
-
based approaches have been found to be more
effective overall, taking into

account the total amount of human
expertise and effort involved.





Sequence Labeling as Classification

(
Rule
-
Based
)



Classify each token independently but use as input features,
information about the surrounding tokens (sliding window).




Markov Model
/ Markov Chain

(
Learning
-
Based
)



A finite state machine with probabilistic state transitions.



Makes Markov assumption that next state only depends on the
current state and independent of previous history.





Hidden Markov Model




Three Useful HMM Tasks



Observation Likelihood: To classify and order sequences.



Most likely state sequence (Decoding): To tag each token in a
sequence with a label.



Maximum likelihood training (
Learning
): To train models to fit
empirical training data.


Most Likely Sequence



Of t
wo or more possible sequences, which one was most likely
generated by a given model?



Used to score alternative word sequence interpretations in speech
recognition.





Learning Problem
: Given some training

observation sequences
O={O
1
,
O
2
,

…,O
m
}
, and a HMM structure,

determine HMM parameters
Q ={A,
B,

π)
that best fit training data, that is

maximizes
Pr(O|Q )
. Unfortunately,

there is no feasible direct (optimal)

solution,
Baum
-
Welch

algorithm is a

good approximation solution for this

problem.





Other HMM Models