Perl regular expressions

greenbeansneedlesSoftware and s/w Development

Dec 13, 2013 (3 years and 7 months ago)

74 views

Perl regular expressions


This Powerpoint file can be found at:


http://www.ku.edu/pri/ksdata/sashttp/kcasug2004
-
10

Kansas City Area SAS User Group (KCASUG)

October 5, 2004

Larry Hoyle

Policy Research Institute, The University of Kansas

Regular expressions


A regular expression is a pattern to be
matched against some text (a string)




originally from neurophysiology


Then in QED and grep



see:


http://msdn.microsoft.com/library/default.asp?
url=/library/en
-
us/dnaspp/html/regexnet.asp

Perl regular expressions


P
ractical
E
xtraction and
R
eport
L
anguage
implements a version of regular expressions
that is something of a standard


see:
http://www.perldoc.com/perl5.6.1/pod/perlre.html

SAS Documentation

Short syntax description

Some simple examples

/Baa/


matches the string "Baa"

/Baa
\
d
/


matches "Baa"
followed by






any numeric digit


Using Perl Regular Expressions in SAS
9.1 and above

data cc;



input c $;


prxNum
=
prxParse
(
'
/Baa
\
d/
');


start=prxMatch(
prxNum,c
);


if start then put c= 'is a match';


else put c= 'does not match';

datalines;

Baa

Baa2

baa3

aaaaBaa3

;

run;

proc sql; select * from cc


where prxmatch('
/Baa
\
d/
',c)
;

Documentation for PRX Functions and Call Routines in SAS HELP

CALL
PRXCHANGE

Performs a pattern
-
matching replacement

CALL
PRXDEBUG

Enables Perl regular expressions in a DATA step to send debug output to the SAS log

CALL PRXFREE

Frees unneeded memory that was allocated for a Perl regular expression

CALL PRXNEXT

Returns the position and length of a substring that matches a pattern and iterates over
multiple matches within one string

CALL PRXPOSN

Returns the start position and length for a capture buffer

CALL
PRXSUBSTR

Returns the position and length of a substring that matches a pattern

PRXCHANGE
Function

Performs a pattern
-
matching replacement

PRXMATCH
Function

Searches for a pattern match and returns the position at which the pattern is found

PRXPAREN
Function

Returns the last bracket match for which there is a match in a pattern

PRXPARSE
Function

Compiles a Perl regular expression (PRX) that can be used for pattern matching of a
character value

PRXPOSN
Function

Returns the value for a capture buffer

single character "wildcards"

.


matches any character

\
d


matches a numeric character

\
D

matches a non
-
numeric

\
w

matches a "word character"



(letter, digit, or underscore)

\
W

matches a non
-
word character

\
s


matches white space (spaces or tabs)

\
S

matches non
-
white space

Try a different pattern for expr

data

myturn;


retain expr '
/Whatever/
'; /* put your own expression here */


retain prxNum;


length c $
80
;


input c $80.;


if _n_=
1

then do;


prxNum=prxParse(expr);


if prxNum=
0

then put 'bad expression' expr= ;


end;


start=prxMatch(prxNum,c);


put start= c= ;

datalines;

Whatever floats your boat

Now is the time

for

all
-
good

men 2

come to the

aid of their country.

the quick brown fox jumped over the lazy dog

The quick red fox jumped over the 3 lazy dogs

You could replace this with whatever text you wanted.

;

run
;

find all the numbers

find the first space on each line

find any non word characters

sample expressions

find all the numbers



/
\
d/

find the first space on each line


/
\
s/

find any non word characters


/
\
W/

Anchors

^


beginning of the string

$


end of the string

Character Classes

[acB]


matches "a", "c" or "B"

[D
-
G]


matches "D", "E", "F", or "G"

[^aeiouyAEIOUY]

matches any non vowel

Search for words

data

mywords;

/* words starting with a
-
d */


retain

expr '/^[a
-
dA
-
D]/';


retain prxNum;


length word $
50
;


input word $50.;


if _n_=
1

then do;


prxNum=prxParse(expr);


if prxNum=
0

then put 'bad expression' expr= ;


end;


start=prxMatch(prxNum,word);


put start= c= ;


if start>
0
;

datalines;

a

boo

cwm

Dublin

oocyte

pneumonoultramicroscopicsilicovolcanoconiosis

qat

Washington

;

run
;

find all the proper names

find words with a "q" not followed by a "u"

How about?

find all the proper names

find words with a "q" not followed by a "u"

How about?

find all the proper names /[A
-
Z]/






find words with a "q" not followed by a "u"

How about?

find all the proper names /[A
-
Z]/





find words with a "q" not followed by a "u" /q[^u]/

Multipliers

{
n
}

previous expression
n

times e.g. {3}

{
n
,}

previous expression n or more times

{
n,m
}

previous expression from n to m times

{0,
m
}

previous expression m or fewer times



*


previous expression 0 or more times
{0,}

+


previous expression 1 or more times
{1,}

?


previous expression 0 or 1 times
{0,1}


from the word list

find words without vowels


from the word list

find words without vowels


/^[^aeiouyAEIOUY]+$/


"write only"?

document your expressions

find words without vowels


/^[^aeiouyAEIOUY]+$/


/*

^ beginning of string

[^aeiouyAEIOUY]+ one or more non
-
vowels

$ end of string

*/


Hangman Example


Suppose we want to code the sequence
of guesses in the game of hangman by
the use of inferred strategies


e.g. did the person guess the most
frequently used letters first?


did the person guess vowels first?

Coding the strategies

data

HangmanGuesses;

%let ns=4;


drop i prxNum1
-
prxnum&ns;


array expr{&ns} $
80

ex1
-
ex&ns(


'/^[aeiou]{3}/'


'/^[etaoin]{6}/'


'/^qwerty/'


'/^[zqxjkv]{6}/'


);


array used{&ns}used1
-
used&ns;


label used1= '3 vowels first'


used2= 'letter frequency'


used3= 'qwerty'


used4= 'unusuals'


;


array prx{&ns}prxNum1
-
prxnum&ns;


retain used1
-
used&ns; /* strategy
name */


retain ex1
-
ex&ns; /* strategy name */


retain prxNum1
-
prxnum&ns; /*prx
number */


length guess $
13
;


input guess $13. success;


guess=lowcase(guess);



if _n_=
1

then do i=
1

to &ns;


prx{i}=prxParse(expr{i});


if prx{i}=
0

then put "expression &ns is bad"
expr{i}= ;


end;




do i=
1

to &ns;


used{i}=prxMatch(prx{i},guess);


end;

datalines;

eaotwhnrbg 1

etaoinshrdlcu 0

etaoinshrdluc 0

qwertyuiopasd 0

vkjxqznmasdfg 0

asdfghjklzxcv 0

argbe 1

efghijklmnopq 0

abcdefghijklm 0

;

We get dummy variables

Looking at expression 2

Memory within match

(
pattern
)

treat the pattern as a unit and remember
the part of the string matched

\
n
inside the match recall substring n





example /
(
\
d)
{3}X
\
1/


matches 123X123



not 123X456

Memory outside match

(
pattern
)

treat the pattern as a unit and remember
the part of the string matched

$
n
outside the match recall substring n





example s/
(
\
w)
+,
(
\
w)
+/ $2 $1/


substitutes Doe,John



with John Doe

Call log example

datalines;

I called Fred at 9:17 am at 785
-
555
-
1234

10:12 Called George
-

(913)
-
555
-
3213

816
-
555
-
9876 was Irving the time was 1:22 pm

751 555 1212 8384 3:33 Bob

;

Get the time


retain expTime '/
\
d{1,2}:
\
d{2}
\
s?(pm|am)?
/';


/*


\
d{1,2}:
one or two digits followed by a colon


\
d{2}
\
s?

two digits and optional space


(pm|am)?
optional am or pm


*/

Get the phone number

define 3 capture buffers


retain expPhone '/
\
(?([2
-
9]
\
d
\
d)
\
)?[
-
](
\
d
\
d
\
d)[
-
](
\
d{4})
/';


/*


\
(? optional left paren


([2
-
9]
\
d
\
d) 3 digit area code (buffer 1)


\
)? optional right paren


[
-
] space or hyphen


(
\
d
\
d
\
d) 3 digit exchange (buffer 2)


[
-
] space or hyphen


(
\
d{4}) 4 digit exchange (buffer 3)


*/

Use the expressions


retain prxTime prxPhone;




if _n_=
1

then do;


prxTime=prxParse(expTime);


if prxTime=
0

then put 'bad expression'


expTime= ;



prxPhone=prxParse(expPhone);


if prxPhone=
0

then put 'bad expression'


expPhone= ;


end;




sequence=_n_;


call
prxsubstr
(prxTime, note,


position, length);


time=
substr
(note,position,length);



call prxsubstr(prxPhone, note,


position, length);


phone=substr(note,position,length);


CALL PRXPOSN (prxPhone,
1
,


position, length);


ac=substr(note,position,length);





CALL
PRXPOSN

(prxPhone,
2
,


position, length);


exchange=
substr
(note,


position,length);


CALL PRXPOSN (prxPhone,
3
,


position, length);


last4=substr(note,


position,length);




local=
exchange||'
-
'||last4
;


Result

The time and phone number have been extracted.

The phone number is standardized.

Substitution expressions

s/
match

expression
/
replacement
/


s/
cat
/
hat
/
changes
cat

to
hat

s/
([a
-
zA
-
Z
\
-
]+),([a
-
zA
-
Z
\
-
]+)
/
$2 $1
/


changes
Doe
-
Roe,John

to
John Doe
-
Roe



Call PRXCHANGE

(Data Step only)

CALL PRXCHANGE (regular
-
expression
-
id,


times,


old
-
string


<, new
-
string


<, result
-
length


<, truncation
-
value


<, number
-
of
-
changes>>>>);


PRXCHANGE

(Data Step, SQL, where clauses)

PRXCHANGE(perl
-
regular
-
expression |


regular
-
expression
-
id,


times,


source)

data

cc;


length c $
60

changedString $
60
;


input c $60.;


prxNum=
prxParse
(
's/([a
-
zA
-
Z
\
-
]+),[ ]*([a
-
zA
-
Z
\
-
]+)/$2 $1/
');


CALL
prxChange

(prxNum,


1
,


c,


changedString,


newLength,


wasTruncated,


numberChanges);


datalines;

Doe
-
Roe,John

BlackSheep, BaaBaa

Prince

;

PRXCHANGE example

s/

([a
-
zA
-
Z
\
-
]+)
first word

,

comma

[ ]*
zero or more blanks

([a
-
zA
-
Z
\
-
]+)
second word

/$2 $1/
switch words

PRXCHANGE example results