I want to have some cash

scarfpocketAI and Robotics

Oct 24, 2013 (3 years and 5 months ago)

78 views

Computational Linguistics

Yoad Winter




*

General overview



*

Examples:



Transducers; Stanford Parser;



Google Translate; Word
-
Sense

Disambiguation



*

Finite State Automata and Formal

Grammars

Linguistics
-

from Theory to Technology

Computational Linguistics

Theoretical Linguistics

Natural Language Processing

Language Technology

INFO

TLW

Industrie


Goals of CL:


* Foundations for Linguistics in Computer

Science (e.g. Formal Language theory)


* Computable linguistic theories (HPSG, LFG,



Categorial Grammar)


* Implementation of demos for linguistic



theories


* (Mathematical Linguistics)


Computational Linguistics


Goals of NLP


practical applications of CL:


* Speech recognition/synthesis


* Machine translation


* Summarization


* Question answering


* Text categorization


* Grammar checking


Statistical NLP:


* Unsupervised


* Supervised (corpus
-
based)



Natural Language Processing


Goals of LT:


* Useful linguistic resources (lexicons, grammar



rules, semantics webs)


* Implementation of most useful tools involving

language processing



(Google translation, Word spell checker,

MS Speech Recognizer etc.)

Language Technology

Computational Linguistics

Theoretical Linguistics

Natural Language Processing

Language Technology


Input:



Output:




J&M (2009)

Words in text

Part of speech (Noun/Verb);






Morphological Information


Speech Sound

Text

Wave





text



Speech Sound Wave


Sentence in text

Phrases in Sentences (noun





phrase, verb phrase)


Sentence/text

Action/Reasoning


Sentence/text

Translation


Language Processing
-

Tasks

I: Words

II: Speech

III: Syntax

IV: Semantics


& Pragmatics

V: Applications

Start with null information state I=0

Repeat while there is language to read:



-

Read a language token T



-

Recognize T: extract information I(T)



-

Update information state I using I(T)



-

Do some action using I(T)

Processing
-

General Idea

I want to cash a check:

-
Start from state 0

-
Read “I want to”, move to state 1, and
output nothing

-
Read “cash”, move to state 3, and output V

-
Read “a check”, move to state 4, and
output nothing


Example 1


Finite State Transducers

I want to have some |
ε

I want to |
ε

cash |
V

cash |
N

a check |
ε

0

2

1

3

4

I want to have some cash:

-
Start from state 0

-
Read “I want to have some”, move to state
2, and output nothing

-
Read “cash”, move to state 4, and output N


Example 1


Finite State Transducers

I want to have some |
ε

I want to |
ε

cash |
V

cash |
N

a check |
ε

0

2

1

3

4

Your query

I want to cash a check

Tagging

I/PRP want/VBP to/TO cash/VB a/DT check/NN

Parse


Example 2


Stanford Parser

http://nlp.stanford.edu/software/lex
-
parser.shtml
link

Your query

I want to have some cash

Tagging

I/PRP want/VBP to/TO have/VB some/DT cash/NN

Parse


Example 2


Stanford Parser

http://nlp.stanford.edu/software/lex
-
parser.shtml
link

Example 3


Google Translate

Example 3


Google Translate

Summary

We have seen ways to process:


-

words, word
-
by
-
word:




transducers


-

sentences, with a tree structure:




Stanford Parser

A word like CASH must be
disambiguated

for Noun or Verb, in order to have a
correct translation.

Other kinds of disambiguation?

Example 4
-

Word
-
Sense Disambiguation

the light blue car:


1.

de lichtblauwe auto


2.

de lichte blauwe auto

John likes the light blue car but not the deep blue
car

John was able to lift the light blue car but not the
heavy blue car

Google Translate: lichtblauwe auto in both cases

Word
-
Sense disambiguation: finding the right
sense of the word



Basic Model 1: Finite State Automata (FSA)

q0
-

start state

q4
-

accepting state

arrows


transitions, also defined by a
transition
table

FSA
-

formally

Tracing the execution of an FSA

“baaa!” is
accepted

because when taking the
input symbols one by one, we reached the
accepting state q4.

FSA’s as Grammars

An FSA possibly describes an
infinite set

of
strings over a
finite

input alphabet
Σ
.

We thus say that an FSA describes a
grammar

over
Σ
, which
derives

a
formal language

over
Σ
.

More officially:


Σ



a finite set


Σ
*


all the strings over
Σ

(infinite)


L(FSA) = the language of the FSA



is the set of strings S in
Σ
* that are derived

by the FSA.


Any set described by an FSA is called
regular
.


Non
-
regular languages and complexity

L = { ab, aabb, aaabbb, aaaabbbb, … }

can be shown to be
non
-
regular.


No FSA can derive this language L!


But there are grammars that can also generate
non
-
regular languages!


Are natural languages regular or non
-
regular?

How hard it is for a computer to recognize regular
and non
-
regular languages?

Are there different classes of formal languages in
terms of their complexity?



Another way to define regular langauges


regular expressions

A regular expression is a compact way for
describing a regular language.

Example:


baa(a*)!

descibes the same language as the FSA we saw.

We say that this regular expression
matches

any string in this
language, and does not match other strings.

Regular expressions
-

formally

Σ
-

a
finite

alphabet

1
-

Any string in
Σ

is a regular expression that matches
itself



“a” matches “a”; “b” matches “b”; etc.

2
-

If A and B are regular expressions then AB is a regular
expression that matches any
concatenation

of a string
that A matches
with

a string that B matches.



“ab” matches “ab”

3
-

If A and B are regular expressions then A|B is a regular
expression that matches any string that A
or

B match.



“a|b” matches both “a” and “b”

4
-

If A is regular expression then A* matches any string
that has zero or more As.



“a*” matches the empty string, “a”, “aa”, “aaa” etc.



Examples

Convention: we give
precedence

to *.


AB* = A(B*)

Convenience: we let ε match the empty string.


a|b
*

matches {ε, "a", "b", "bb", "
bbb
", ...}


(
a|b
)*

matches the set of all strings with no symbols other than "a"
and "b", including the empty string: {ε, "a", "b", "
aa
", "
ab
", "
ba
",
"bb", "
aaa
", ...}


ab
*(
c|ε
)

denotes the set of strings starting with "a", then zero or more
"
b"s

and finally optionally a "c": {"a", "ac", "
ab
", "
abc
", "
abb
",
"
abbc
", ...}


At home

Read 3.4
-
3.6 on Transducers as preparation
for Eva’s class.