LING/C SC/PSYC 438/538

greenbeansneedlesSoftware and s/w Development

Dec 13, 2013 (3 years and 6 months ago)

72 views

LING/C SC/PSYC 438/538

Computational Linguistics

Sandiway Fong

Lecture 2: 8/23

Administrivia


Anyone try installing Perl yet?


Active State Perl


http://www.activestate.com


install the free version


(not the trial versions)

Background Reading


Textbook


Chapter 2: Regular Expressions and Automata


Section 2.1: Regular Expressions


Perl Regular Expressions (RE)


perlrequick

-

Perl regular expressions quick start


http://perldoc.perl.org/perlrequick.html


perlretut

-

Perl regular expressions tutorial


http://perldoc.perl.org/perlretut.html


Regular Expressions


regular expressions are used in string pattern
-
matching


important tool in automated searching


formally equivalent to finite
-
state automata (FSA) and regular grammars



popular implementations


Unix
grep


command line program


returns lines matching a regular expression


standard part of all Unix
-
based systems


including
MacOS X

(command
-
line interface in Terminal)


many shareware/freeware implementations available for Windows XP


just Google and see...


grep

functionality is built into many programming languages


e.g. Perl


wildcard search in Microsoft Word


limited version of regular expressions (not full power)


with differences in notation

Regular Expressions


Historical note


grep
:


name comes from Unix

ed

command



g/re/p



search
g
lobally for lines
matching the
r
egular
e
xpression, and
p
rint them



[Source:
http://en.wikipedia.org/wiki/Grep]


ed


is an obscure and difficult
-
to
-
use text edit program on Unix
systems


doesn’t need a screen display


would work on an ancient
teletype

wikipedia

Regular Expressions


Formally


a regular expression
(regexp) is formed from:


an
alphabet


(= set of characters)


operators


a
regexp

is shorthand for
a
set of strings


(possibly infinite set)


(strings are of finite
length)


Formally


a set of strings is
called a
language


a language that can
be defined by a
regular expression is
called a
regular
language


(not all languages are
regular)

Regular Expressions


alphabet


e.g. {a,b,c,...,z}


set of lower case English
letters

Note:
case is important


operators


a


single symbol a


a
n

exactly
n




occurrences of




a, n a positive




integer



a
n

a
3


aaa


a
*


zero or more




occurrences of a


a
+

one or more




occurrences of a


concatenation


two regexps may be
concatenated, the resulting
string is also a regexp


e.g. abc


disjunction


infix operator: |


(vertical bar)


e.g. a|b



parentheses may be used
for disambiguation


e.g. gupp(y|ies)

Regular Expressions


Technically, a
+

is not
necessary


aa* = a
+





a

concatenated with
a
*
(zero or more
occurrences of
a
)”

=


“one or more
occurrences of
a





Disjunction


[
set of characters
]


set of characters enclosed
in square brackets means
match one of the characters


e.g. [aeiou]


matches any of the vowels
a
,
e
,

i, o or
u

but not
d


dash (
-
) shorthand for a
range


e.g. [a
-
e]


matches
a
,
b
,
c
,
d

or
e

Regular Expressions


Range defined over a
computer character set


typically ASCII


originally a 7 bit
character set


2^7 = 128 (0
-
127)
different characters


ASCII =
American
Standard Code for
Information
Interchange


e.g. [A
-
z]




[0
-
9A
-
Za
-
z]


http://www.idevelopment.info/data/Programming/a
scii_table/PROGRAMMING_ascii_table.shtml

Regular Expressions:
Microsoft Word


terminology:


wildcard search


Regular Expressions:
Microsoft Word

Note:

zero or more times is missing in Microsoft Word

Regular Expressions


Perl uses the same notation as
grep



(textbook also uses
grep

notation)


More shorthand


question mark

(?) means the
previous regexp is optional


e.g. colou?r


matches
color

or
colour


metacharacters

or operators
like ? have a function


to match a question mark,
escape

it using a backslash (
\
)


e.g. why
\
?


? in Microsoft Word means
match any character


More shorthand


period

(.) stands for any
character


(except newline)


e.g. e.t


matches
eat

as well as
eet


caret sign

(^) as the first
character of a range of
characters


[^set of characters]


means don’t match any of the
characters mentioned (after
the caret)


e.g. [^aeiou]


any character except for one
of the vowels listed

Regular Expressions


Text files in Unix
consists of sequences
of lines separated by a
newline character


(LF = line feed)


Typically, text files are
read a line at a time by
programs


Matching in Perl and
grep is line
-
oriented

(can be changed in Perl)


Differences in platforms for
line breaking:


Unix: LF


Windows (DOS): CR LF


MacOS (X): CR


affects binary file transfers

Regular Expressions


Line
-
oriented metacharacters
:


caret (^) at the beginning of a
regexp string matches the
“beginning of a line”


e.g. ^The


matches lines beginning with


the sequence
The


Note: the caret is very


overloaded...


[^ab]


a^b


dollar sign ($) at the end
of a regexp string
matches the “end of the
line”


e.g. end
\
.$


matches lines ending in
the sequence
end.


e.g. ^$


matches blank lines only


e.g. ^ $


matches lines contains
exactly one space

Regular Expressions


Word
-
oriented metacharacters
:


a
word

is any sequence of digits [0
-
9],
underscores (_) and letters [a
-
zA
-
Z]


(historical reasons for this)


\
b matches a
word boundary
, e.g. a
space or beginning or end of a line or a
non
-
word character


e.g. the


matches
the
,
they, breathe

and
other


but
\
bthe will only match
the

and
they


the
\
b will match
the

and
breathe


\
bthe
\
b will only match
the


(
\
< and
\
> can also be used to match the
beginning and end of a word)


e.g.
\
b99


matches 99 but not 299


also matches $99

Regular Expressions


Note:


definition of
word

does not
include characters with accent
marks



extended ASCII character set


8 bit


characters 128
-
255


does not include non
-
Roman
characters



towards multilingual computing


Unicode


one massive table

Regular Expressions


Range abbreviations:


\
d (digit) = [0
-
9]


\
s (whitespace character) =
space (SP), tab (HT), carriage
return (CR), newline (LF) or
form feed (FF)


\
w (word character)


= [0
-
9a
-
zA
-
Z_]


uppercase versions denote
negation


e.g.
\
W means a non
-
word
character



\
D means a non
-
digit


Repetition abbreviations:


a?
a optional


a*
zero or more a’s


a+
one or more a’s


a{n,m}
between n and m a’s


a{n,}
at least n a’s


a{n}
exactly n a’s


e.g.
\
d{7,}


matches numbers with at
least 7 digits


e.g.
\
d{3}
-
\
d{4}


matches 7 digit telephone
numbers with a separating
dash

Perl


Run from the command line in Windows


Start > Run...


cmd

(brings up command line interpreter)


Running a Perl program:


perl
-
help

(gives options)



perl filename.pl



(runs Perl command file
filename.pl
)



perl filename.pl inputfile.txt



(runs Perl command file
filename.pl
,
inputfile.txt

is supplied to
filename.pl
)



e.g.
filename.pl

reads and processes
input file
inputfile.txt

Perl


Example Perl program
(
match.pl
) to read in a
text file and print lines
matching a regexp
enclosed by /...
/



Example input file
(
text.txt
)



Command

perl match.pl text.txt

open (F,$ARGV[0]) or die
"$ARGV[0] not found!
\
n";


while (<F>) {


print $_ if (/The/);

}

This is a test.

The cat sat on the mat.

These shoes are made for walking.

Otherwise, I t
??
hought it was
cold.

45

Perl


Program:

open (F,$ARGV[0]) or die "$ARGV[0]
not found!
\
n";


while (<F>) {


print $_ if (/The/);

}


while (<F>)

first evaluates
<F>


<F>

reads in a line from the file
referenced by

F
and places the
line in the program variable
$_


then it executes the program code
between the curly braces


then it goes back and reads
another line


it does this repeatedly while
<F>

produces a valid line


if we reach
the end of the file, the while loop
stops


print $_ if (/.../);

is
conditional code that means print
the contents of variable
$_

if the
regexp between the
/.../

can be
found in
$_



Program explained:



open the file referenced in
$AGRV[0]

for
input



$AGRV[0]

is the first command line
argument following the program name



F
is the file descriptor associated with the
opened file



if there is a problem opening the file, e.g.
file doesn’t exist, program execution dies
and prints the value of the string enclosed in
double quotes
"$ARGV[0] not found!
\
n"