Natural Language Processing
Part 1
G
eneral
V
iew
Phonetics:
What sounds are used in human speech?
P
honology
:
How do languages use and combine sounds?
M
orphology
:
How do languages form words?
S
yntax
:
How do languages form sentences?
S
emantics
:
How do
languages convey meaning in sentences?
pragmatics
:
How do people use language to communicate?
Research Problems
How can we construct
the Lexicon
What is the best Lexical Analyzer
What is the suitable Tag set
How can we construct the CFG rules
What is th
e best Syntax Analyzer
Natural Language Processing Applications
Lexicon
To construct the Lexicon we have three method
Automatic
: by using a set of rules to tag the lexemes
in the corpora,
the main disadvantage
s are:
not accurate, the corpora mus
t be
diacritization and
the corpora may not cover all the words. See
"
Constructing
an Automatic Lexicon for Arabic Language"
Manual
:
by inserting all the Arabic words which approximately
1000
0 roots
and the unique words (surface forms) is estimated to
be
6000000 words
.
The disadvantage is difficult to insert it and not
flexible for the new words
Simi
-
Manual
: inserting the roots and the anomalies, and then using
the rules to derive the rest of the words
For
English language
Nominalization
:
V +
-
ation:
computerization
V+
-
er:
killer
Adj +
-
ness:
fuzziness
Negation
:
un
-
:
undo, unseen, ...
mis
-
:
mistake,...
Adjectivization
:
V+
-
able:
doable
N +
-
al:
national
We call the smallest (meaningful or grammatical) parts of
words
mor
phemes
.
I
rregular word forms:
-
Plural nouns
add
s
to singular noun:
book
-
book
s
,
but:
box
-
box
es
, fly
-
fl
ies
, child
-
child
ren
-
Past tense verbs
add
ed
to infinitive:
walk
-
walk
ed
,
but:
like
-
like
d
, leap
-
leap
t
Finite state automata for morphology
Derivat
ional Morphology
–
Lexeme
=
Root + Pattern
–
Word
= Lexeme + Features
–
Part
-
of
-
speech
•
Traditional
: Noun, Verb, Particle
•
Computational
: N, PN, V, Adj, Adv, P, Pron, Num, Conj, Det,
Aux,
Pun, IJ, and others
–
Noun
-
specific
• Number: singular, dual, plur
al, collective
• Gender: masculine, feminine, Neutral
• Definiteness: definite, indefinite
• Case: nominative, accusative, genitive
• Possessive clitic
–
Verb
-
specific
• Aspect: perfective, imperfective, imperative
• Voice: active, passive
• Tense: past, p
resent, future
• Mood: indicative, subjunctive, jussive
• Subject (Person, Number, Gender)
• Object clitic
–
Others
• Single
-
letter conjunctions
• Single
-
letter prepositions
Representation Units
Part
-
of
-
Speech Tagging
(
lexeme
analyzer
)
The process of assigning a part
-
of
-
speech to each word in a sentence
Choosing a tagset
• Need to choose a standard set of tags to do POS tagging
–
One tag for each part of speech
• Could pick very coarse tagset
–
N, V, Adj, Adv, Prep.
• More co
mmonly used set is finer
-
grained
–
E.g., the
UPenn TreeBank II
tagset has 36 word tags
• Even more finely
-
grained tagsets exist
Comparison of different tag sets: Verb, preposition, punctuation and
symbol tags. An entry of `not
' means an item was ignored in tagging, or
was not separated off as a separate token.
Why is POS tagging hard?
Ambiguity
–
“
Plants
/N need light and water.”
–
“Each one
plant
/V one.”
–
“
Flies
like a flower”
•
Flies
: noun or verb?
•
like
: preposition,
adverb, conjunction, noun, or verb?
•
flower
: noun or verb?
•
interest
Ellen has a strong
interest
in computational linguistics.
Ellen pays a large amount of
interest
on her credit card.
نيب
–
/bayyana/ Verb
he declared/demonstrated
–
/bayyanna/ Verb
they [feminine] declared/demonstrated
–
/bayyin/ Adj
clear/evident/explicit
–
/bayna/ Prep
between/among
–
/biyin/ Proper Noun
in Yen
–
/biyn/ Proper Noun
Ben
مهف
:
noun, or verb
?
POS Taggi
ng Approaches
Rule
-
Based
: Human crafted rules based on lexical and other
linguistic knowledge.
large collection (> 1000) of constraints on
what sequences of tags are allowable
Learning
-
Based
: Trained on human annotated corpora like the
Penn Treebank.
–
Stati
stical models
: Hidden Markov Model (HMM),
Maximum Entropy Markov Model (MEMM), Conditional
Random Field (CRF)
–
Rule learning
: Transformation Based Learning (TBL)
Generally, learning
-
based approaches have been found to be more
effective overall, taking into
account the total amount of human
expertise and effort involved.
Sequence Labeling as Classification
(
Rule
-
Based
)
•
Classify each token independently but use as input features,
information about the surrounding tokens (sliding window).
Markov Model
/ Markov Chain
(
Learning
-
Based
)
•
A finite state machine with probabilistic state transitions.
•
Makes Markov assumption that next state only depends on the
current state and independent of previous history.
Hidden Markov Model
Three Useful HMM Tasks
•
Observation Likelihood: To classify and order sequences.
•
Most likely state sequence (Decoding): To tag each token in a
sequence with a label.
•
Maximum likelihood training (
Learning
): To train models to fit
empirical training data.
Most Likely Sequence
•
Of t
wo or more possible sequences, which one was most likely
generated by a given model?
•
Used to score alternative word sequence interpretations in speech
recognition.
Learning Problem
: Given some training
observation sequences
O={O
1
,
O
2
,
…,O
m
}
, and a HMM structure,
determine HMM parameters
Q ={A,
B,
π)
that best fit training data, that is
maximizes
Pr(O|Q )
. Unfortunately,
there is no feasible direct (optimal)
solution,
Baum
-
Welch
algorithm is a
good approximation solution for this
problem.
Other HMM Models
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο