CMSC 723 / LING 645: Intro to

whooploafΛογισμικό & κατασκευή λογ/κού

13 Δεκ 2013 (πριν από 3 χρόνια και 10 μήνες)

94 εμφανίσεις

CMSC 723 / LING 645: Intro to
Computational Linguistics

September 8, 2004: Monz

Regular Expressions and

Finite State Automata (J&M 2)


Prof. Bonnie J. Dorr

Dr. Christof Monz

TA: Adam Lee

Regular Expressions

and

Finite State Automata


REs: Language for specifying text strings


Search for document containing a string


Searching for “woodchuck”


Finite
-
state automata (FSA)

(singular: automaton)


How much wood would a
woodchuck


chuck if a
woodchuck

would chuck wood?


Searching for “woodchucks”


with an
optional
final “s”


Regular Expressions


Basic regular expression patterns


Perl
-
based syntax (slightly different from
other notations for regular expressions)


Disjunctions
/[wW]oodchuck/

Regular Expressions


Ranges
[A
-
Z]



Negations
[^Ss]

Regular Expressions


Optional characters
?

,
*

and
+


?

(0 or 1)


/colou
?
r/


捯汯c

or

colour


*

(0 or more)


/oo
*
h!/


潨℠
or

Ooh!
or

Ooooh!

*
+

Stephen Cole Kleene



+

(1 or more)


/o
+
h!/


潨℠
or

Ooh!
or

Ooooh!




Wild cards
.

-

/beg
.
n/


扥杩渠
or

began
or

begun


Regular Expressions


Anchors
^

and
$


/
^
[A
-
Z]/



R
amallah,
P
alestine”


/
^
[^A
-
Z]/



¿
verdad?” “
r
eally?”


/
\
.
$
/


“It is over
.



/.
$
/


?


Boundaries
\
b

and
\
B


/
\
b
on
\
b
/



on
my way” “M
on
day”


/
\
B
on
\
b
/


“automat
on



Disjunction
|


/yours|mine/


“it is either
yours

or
mine


Disjunction, Grouping,
Precedence


Column 1 Column 2 Column 3 …

How do we express this?


/
Column
[0
-
9]+

*
/


/
(
Column
[0
-
9]+

+
)*
/


Precedence


Parenthesis
()


Counters
* + ? {}


Sequences and anchors
the ^my end$


Disjunction
|


REs are greedy!


Perl Commands



While (
$line=<STDIN>
){


if (
$line =~ /the/
){



print “MATCH: $line”;


}

}


Writing correct expressions

Exercise: Write a Perl regular expression to match
the English article “the”:


/the/

/
[tT]
he/

/
\
b
[tT]
he
\
b
/

/
[^a
-
zA
-
Z]
[tT]
he
[^a
-
zA
-
Z]
/

/
(
^
|
[^a
-
zA
-
Z]
)
[tT]
he
[^a
-
zA
-
Z]
/

A more complex example

Exercise: Write a regular expression that will
match “any PC with more than 500MHz and
32 Gb of disk space for less than $1000”:

/
$
[0
-
9]+/

/
$
[0
-
9]+
\
.[0
-
9][0
-
9]
/

/
\
b
$
[0
-
9]+
(
\
.[0
-
9][0
-
9]
)?
\
b
/

/
\
b
$
[0
-
9][0
-
9]?[0
-
9]?
(
\
.[0
-
9][0
-
9]
)?
\
b
/

/
\
b
[0
-
9]+

*
([MG]Hz|[Mm]egahertz|


[Gg]igahertz)
\
b
/

/
\
b
[0
-
9]+
*
(Mb|[Mm]egabytes
?
)
\
b
/

/
\
b
[0
-
9](
\
.[0
-
9]+)

*
(Gb|[Gg]igabytes
?
)
\
b
/

Advanced operators

should be _

Substitutions and Memory


Substitutions

s/colour/color/

s/colour/color/g

s/([Cc]olour)/$1olor/


/the (.*)er they were, the $1er they will be/


/the (.*)er they (.*), the $1er they $2/

Substitute as many
times as possible!

Case insensitive
matching

s/colour/color/i


Memory
(
$1
,

$2
, etc. refer back to matches)

Eliza
[Weizenbaum, 1966]

User: Men are all alike

ELIZA: IN WHAT WAY

User: They’re always bugging us about something or other

ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE?

User: Well, my boyfriend made me come here

ELIZA: YOUR BOYFRIEND MADE YOU COME HERE

User: He says I’m depressed much of the time

ELIZA: I AM SORRY TO HEAR THAT YOU ARE DEPRESSED

Eliza
-
style regular expressions

s/
.*

YOU ARE (depressed|sad)
.*
/I AM SORRY TO HEAR
YOU ARE
\
1/

s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU
ARE
\
1/

s/.* all .*/IN WHAT WAY/

s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/

Step 1: replace first person with second person references

s/
\
b
I(’m| am)
\
b

/YOU ARE/g

s/
\
b
my
\
b

/YOUR/g

S/
\
b
mine
\
b

/YOURS/g


Step 2: use additional regular expressions to generate replies

Step 3: use scores to rank possible transformations

Finite
-
state Automata


Finite
-
state automata (FSA)


Regular languages


Regular expressions

Finite
-
state Automata
(Machines)

/^baa+!$/

q
0

q
1

q
2

q
3

q
4

b

a

a

!

a

state

transition

final

state


baa!


baaa!


baaaa!


baaaaa!


...

Input Tape

a

b

a

!

b

q
0

0

1

2

3

4

b

a

a

!

a

REJECT

Input Tape

b

a

a

a

q
0

q
1

q
2

q
3

q
3

q
4

!

0

1

2

3

4

b

a

a

!

a

ACCEPT

Finite
-
state Automata


Q: a finite set of N states


Q = {q
0
, q
1
, q
2
, q
3
, q
4
}



: a finite input alphabet of symbols




= {a, b, !}


q
0
: the start state


F: the set of final states


F = {q
4
}



(q,i): transition function


Given state q and input symbol i, return new state q'



(q
3
,!)


q
4

State
-
transition Tables

Input

State

b

a

!

0

1

Ø

Ø

1

Ø

2

Ø

2

Ø

3

Ø

3

Ø

3

4

4:

Ø

Ø

Ø

D
-
RECOGNIZE

function

D
-
RECOGNIZE (
tape
,
machine
)
returns

accept or reject


index



Beginning of tape


current
-
state



Initial state of machine


loop


if

End of input has been reached
then


if

current
-
state is an accept state
then


return

accept


else


return

reject


elsif

transition
-
table [current
-
state, tape[index]]

is empty
then


return

reject


else


current
-
state


transition
-
table [current
-
state, tape[index]]


index


index

+ 1

end

Adding a failing state

q
0

q
1

q
2

q
3

q
4

b

a

a

!

a

q
F

a

!

b

!

b

!

b

b

a

!

Adding an “all else” arc

q
0

q
1

q
2

q
3

q
4

b

a

a

!

a

q
F

=

=

=

=

Languages and Automata


Can use FSA as a generator as well as a
recognizer


Formal language L: defined by machine M
that both generates and recognizes all and
only the strings of that language.


L(M) = {baa!, baaa!, baaaa!, …}


Regular languages vs. non
-
regular languages

Languages and Automata


Deterministic vs. Non
-
deterministic FSAs


Epsilon (

) transitions

Using NFSAs to accept strings


Backup
: add markers at choice points, then
possibly revisit unexplored arcs at marked
choice point.


Look
-
ahead
: look ahead in input


Parallelism
: look at alternatives in parallel

Using NFSAs

Input

State

b

a

!



0

1

Ø

Ø

Ø

1

Ø

2

Ø

Ø

2

Ø

2,3

Ø

Ø

3

Ø

Ø

4

Ø

4:

Ø

Ø

Ø

Ø

Readings for next time


J&M Chapter 3