CS 533: Natural Language Processing

huntcopywriterAI and Robotics

Oct 24, 2013 (3 years and 7 months ago)

71 views




CS 533: Natural Language Processing
Lecture 01
1

CS 533:
Natural Language
Processing
Lecture 02
REs/FSAs and (English) Morphology

CS 533: Natural Language Processing – Professor McCarty
2
Formal Language Theory
(Review)




CS 533: Natural Language Processing – Professor McCarty
3
Regular Grammars
The Chomsky Hierarchy

CS 533: Natural Language Processing – Professor McCarty
4
Regular Grammars
The Chomsky Hierarchy




CS 533: Natural Language Processing – Professor McCarty
5
Regular Expressions
Formally:
So for REs we just need: concatenation,
disjunction and “Kleene star” operator.

CS 533: Natural Language Processing – Professor McCarty
6
Regular Expressions
But regular languages are closed under:
So it is convenient to add more operators.




CS 533: Natural Language Processing – Professor McCarty
7
Regular Expressions
Common syntax (Unix grep, Perl, Python, ...):

CS 533: Natural Language Processing – Professor McCarty
8
Regular Expressions
Common syntax (Unix grep, Perl, Python, ...):




CS 533: Natural Language Processing – Professor McCarty
9
Regular Expressions
Common syntax (Unix grep, Perl, Python, ...):
Precedence:

CS 533: Natural Language Processing – Professor McCarty
10
Regular Expressions
Task: Write a regular expression to recognize the
English article
the
.
/the/
Problem: This pattern will miss a capitalized
The
.
/[tT]he/
Problem: This pattern will incorrectly match
the

embedded in other words (e.g.,
other
,
theology
).
/\b[tT]he\b/
Note:
\b
matches word boundaries (as defined in C).




CS 533: Natural Language Processing – Professor McCarty
11
Regular Expressions
Problem: We might want to match
the
when it
appears adjacent to underscores or numbers
(e.g.,
the_
,
the25
).
/[^a-zA-Z][tT]he[^a-zA-Z]/
Problem: This pattern will not match
the
when it
begins a line or ends a line.
/(^|[^a-zA-Z])[tT]he([^a-zA-Z]|$)/

CS 533: Natural Language Processing – Professor McCarty
12
Errors
The process we just went through was based
on
fixing two kinds of errors:

Matching strings that we should not have
matched (
there
,
then
,
othe
r
):

False positives (Type I)

Not matching strings that we should have
matched (
The
):

False negatives (Type II)




CS 533: Natural Language Processing – Professor McCarty
13
Errors
This will become a familiar story, for many
tasks. Reducing the error rate for a task
often involves two
antagonistic
efforts:

Increasing accuracy, or precision
(minimizing
false positives).

Increasing coverage, or recall
(minimizing
false negatives).

CS 533: Natural Language Processing – Professor McCarty
14
Finite State Automata

Sheep Talk:

baa!, baaa!, baaaa!, baaaaa!, ...

Regular Expression:

/baa+!/
or
/baaa*!/

Finite State Automaton:




CS 533: Natural Language Processing – Professor McCarty
15
Finite State Automata
Formally:

CS 533: Natural Language Processing – Professor McCarty
16
Finite State Automata
Transition Table:




CS 533: Natural Language Processing – Professor McCarty
17
Finite State Automata
Adding a
fail state
or
sink
:

CS 533: Natural Language Processing – Professor McCarty
18
Finite State Automata
This is a
deterministic
FSA (DFSA):




CS 533: Natural Language Processing – Professor McCarty
19
Finite State Automata
These are nondeterministic FSAs (NFSAs):

CS 533: Natural Language Processing – Professor McCarty
20
Finite State Automata
Transition Table:




CS 533: Natural Language Processing – Professor McCarty
21
Finite State Automata
There are two ways to recognize strings with
a nondeterministic FSA:

Convert the NFSA into a DFSA and use the
algorithm for
D-RECOGNIZE
.

Theorem: For any NFSA there exists a DFSA
that recognizes exactly the same language.

Work directly with the NFSA and manage the
state-space search explicitly:
ND-RECOGNIZE.

CS 533: Natural Language Processing – Professor McCarty
22
Finite State Automata




CS 533: Natural Language Processing – Professor McCarty
23
Finite State Automata

CS 533: Natural Language Processing – Professor McCarty
24
English Morphology
Morphology is the study of how words are
built up from smaller meaningful units
called morphemes.
We can usefully divide morphemes into two
classes:

Stems
: The core meaning-bearing units.

Affixes
: Bits and pieces that adhere to stems
to change their meanings and grammatical
functions. These could be
prefixes
,
suffixes
,
infixes
(Tagalog) or
circumfixes
(German).




CS 533: Natural Language Processing – Professor McCarty
25
English Morphology
We can further divide morphology up into
two broad types:

Inflectional Morphology.

Derivational Morphology.

plus a more specialized operation:

Cliticization.
Morphological rules usually depend on the
word class
of the stem: Noun, Verb,
Adjective, ...

CS 533: Natural Language Processing – Professor McCarty
26
Inflectional Morphology
The term “inflectional morphology” refers to
the combination of stems and affixes where
the resulting word:

has the same word class as the original, and

serves a grammatical/semantic purpose that is

different from the original,

but is nevertheless transparently related to
the original.




CS 533: Natural Language Processing – Professor McCarty
27
Nouns in English
Nouns are simple. They have markers for:

Plural

Possessive

llama / llama's

children / children's

llamas / llamas'

Euripides / Euripides'

CS 533: Natural Language Processing – Professor McCarty
28
Verbs in English
Verbs are only slightly more complex:




CS 533: Natural Language Processing – Professor McCarty
29
Derivational Morphology
The term “derivational morphology” refers to
the combination of stems and affixes where
the resulting word:

has a different word class from the original, and

is quasi-systematically related to the original,
but

the meaning of the resulting word is difficult to
predict exactly.

CS 533: Natural Language Processing – Professor McCarty
30
Derivational Examples

Verbs to Nouns

Adjectives to Nouns




CS 533: Natural Language Processing – Professor McCarty
31
Derivational Examples

Nouns to Adjectives

Verbs to Adjectives

CS 533: Natural Language Processing – Professor McCarty
32
Example:
compute

Many paths are possible…

Start with
compute

computer -> computerize -> computerization

computer -> computerize -> computerizable

But not all paths/operations are equally good
(allowable?)

Start with
clue

clue -> *clueize

clue -> *clueable




CS 533: Natural Language Processing – Professor McCarty
33
Cliticization
This term refers to the combination of a word
stem with a
clitic
, which is

syntactically like a word, but

phonologically like an affix.

CS 533: Natural Language Processing – Professor McCarty
34
Goal: Morphological Parsing
We will do this in stages ...




CS 533: Natural Language Processing – Professor McCarty
35
Morpholgy and FSAs
Let's use the machinery provided by FSAs to
capture the preceding facts about
morphology, that is:

Accept strings that are in the language.

Reject strings that are not in the language.

And do so in a way that does not require us,
in effect, to list all the words in the language
explicitly.

CS 533: Natural Language Processing – Professor McCarty
36
Simple Rules for Nouns




CS 533: Natural Language Processing – Professor McCarty
37
Now Plug in the Words

CS 533: Natural Language Processing – Professor McCarty
38
Simple Rules for Verbs




CS 533: Natural Language Processing – Professor McCarty
39
Derivational Rules

CS 533: Natural Language Processing – Professor McCarty
40
Parsers and Generators

We can now run strings through these machines
to
recognize
words in the language.

But recognition is usually not what we want:

Often, if we find some string in the language, we
would like to assign a structure to it (
parsing
).

Or we might have some structure and would like to
produce a surface form for it (
production/generation
).

Example

From
“cats”
to
“cat +N +PL”
(
parsing
).

From
“cat +N +PL”
to
“cats”
(
production/generation
).




CS 533: Natural Language Processing – Professor McCarty
41
Finite State Transducers
The simple story:

Add another tape

Add extra symbols to the transitions

CS 533: Natural Language Processing – Professor McCarty
42
Finite State Transducers




CS 533: Natural Language Processing – Professor McCarty
43
Finite State Transducers
Simple FST for Nouns:

CS 533: Natural Language Processing – Professor McCarty
44
Now Plug in the Words




CS 533: Natural Language Processing – Professor McCarty
45
Finite State Transducers
Note: This FST does not recognize a surface
string, but rather an intermediate form:
How would we recognize
“foxes”
instead of
“foxs”
?

CS 533: Natural Language Processing – Professor McCarty
46
Multi-Tape Machines
Solution: Add another tape.
And build a transducer between the surface
level and the intermediate level.




CS 533: Natural Language Processing – Professor McCarty
47
Orthographic Rules
Here are some rules for English spelling:
We can implement each of these rules in a
separate finite state transducer.

CS 533: Natural Language Processing – Professor McCarty
48
Orthographic Rules
Here is an FST for “E insertion”:




CS 533: Natural Language Processing – Professor McCarty
49
Orthographic Rules
Here is a transition table for “E insertion”:

CS 533: Natural Language Processing – Professor McCarty
50
Cascaded FSTs




CS 533: Natural Language Processing – Professor McCarty
51
Recognizing “foxes”

CS 533: Natural Language Processing – Professor McCarty
52
Parsers and Generators
These cascaded FSTs can be used to map a
string

from the surface level to the lexical level
(parsing), or

from the lexical level to the surface level
(generation).
But parsing can be ambiguous ...




CS 533: Natural Language Processing – Professor McCarty
53
Ambiguity
Ambiguous inflectional morphology:


foxes”

“fox +N +PL”


foxes”

“fox +V +3Sg”

That trickster foxes me every time”
Ambiguous derivational morphology:


unionizable” “union +ize +able”


unionizable” “un +ion +ize +able”
These ambiguities can only be resolved by
the syntactic or semantic context.

CS 533: Natural Language Processing – Professor McCarty
54
Ambiguity
There are also local ambiguities that arise
during morphological parsing.
Example:


assess” “assess +V”
But our FST might attempt to parse this as:


asses” “ass^s#” “ass +N +PL”
until it sees the final
“s”
.
Thus parsing with FSTs is
nondeterministic
.




CS 533: Natural Language Processing – Professor McCarty
55
Efficient Compilation

Intersection

Composition