Perl 6 Update - PGE and Pugs

greenbeansneedlesSoftware and s/w Development

Dec 13, 2013 (3 years and 7 months ago)

102 views

Perl 6 Update
-

PGE and Pugs

Dr. Patrick R. Michaud

April 26, 2005

Rules and Grammars

Perl 6 completely redesigns the regular
expression syntax

Regular expressions are now "rules"

Rules can call/embed other rules

Groups of rules can be combined into
Grammars

Current events in Perl 6

Parrot 1.2 released

The Perl Foundation receives $25,000 for
completion of Parrot milestones

New Parrot pumpking
-

Chip Salzenburg

New version of Parrot Grammar Engine (PGE
/ Perl 6 rules) to be released this week

Pugs
-

Autrijus Tang


Perl 6 test suite

Pugs

Perl 6 compiler written in Haskell

Started by Autrijus Tang

Compiles directly to Haskell or to Parrot AST

Being used to develop Perl 6 tests and
experiment with Perl 6 design

Available at http://pugscode.org

Discussion on perl6
-
compiler@perl.org
mailing list

Perl 6 rules / Parrot Grammar Engine

The heart of the Perl 6 compiler is the
Perl/Parrot Grammar Engine (PGE)

Implements the Perl 6 rules syntax, compiles
to Parrot code

Perl 6 rules compiler currently written in C

Bootstrap to Perl 6

Steps to Perl 6 compiler

Finish PGE bootstrap in C


Parse p6 "rule" statements and grammars

Use p6 rules to define the Perl 6 grammar

P6 grammar can be used to generate Parrot
abstract syntax trees from Perl 6 programs

Compile, (optimize), execute the abstract
syntax tree to get working Perl 6 program

Use Perl 6 to rewrite the grammar engine in
Perl 6 (faster)

Current state of PGE

Handles concatenation, alternation,
quantifiers, captures*, subpatterns, subrules

Capture semantics redefined in Dec 2004, still
not final

To be added next


Character classes (note: Unicode)


Patterns containing scalars, arrays, hashes


P6 rule syntax

Changes from perl 5


No more trailing /e, /x, /s options


[...] denotes non
-
capturing groups


^ and $ are beginning/end of string


^^ and $$ are beginning/end of line


. matches any character, including newline


\
n and
\
N match newline/non
-
newline


# marks a comment (to end of line)


Quantifiers are *, +, ?, and **{m..n}

Character classes

[aeiou] changed to <[aeiou]>

[^0
-
9] now <
-
[0..9]>

Properties defined as


<alpha>


<digit>


<alnum>

Combine classes using +/
-

syntax:


<+<alpha>
-
[aeiou]>

Subrules

Patterns are now called "rules"

Analogous to subroutines and closures

Like {...}, /.../ compiles into a "rule"
subroutine

P6 rule statement allows named rules:



rule ident / [<alpha>|_]
\
w* /;

Named rules can be easily used in other
rules:



m / <ident>
\
:= (.*) /;



rule expr / <term> [ <[+
-
]> <term> ]* /;


Interpolation

Variables no longer interpolate directly, thus

/ $var /


matches the contents of $var literally, even if
it contains rule metacharacters.
(No
\
Q and
\
E)


To treat $var as a rule, use

/ <$var> /

Interpolated arrays match as an alternation:

/ @cmds /


/ [ @cmds[0] | @cmds[1] | @cmds[2] | ... ] /

Interpolation, cont'd

Hashes match the keys of the hash, and the
value of the hash is either


Executed if it is a closure


Treated as a subrule if it's a string or rule object


Succeeds if value is 1


Fails for any other value

Useful for parsed languages


rule expr / <term> [ %infixop <expr> ]? /


< metasyntax >

The < ... > introduce various forms of
metasyntax

A leading alphabetic character indicates a
subrule or grammatical assertion

<alpha>

<expr>

<before
pattern
>

<after
pattern
>

A leading ! negates the match

<!before
pattern
>

< metasyntax >

Leading ' matches a literal string

<'match this exactly (whitespace matters)'>

Leading " matches an interpolated string

<"match $THIS exactly (whitespace matters)">

Leading '+' or '
-
' are character classes

/<
-
[a..z]> <
-
<alpha>>/



< metacharacters >

Leading '(' indicates code assertion

/(
\
d**{1..3}) <( $1 < 256 )>/




# (fail if $1 is not less than 256)

A $, @, or % indicates a variable subrule,
where each value (or key) is a subrule to be
matched

<$myrule>

<@cmds>

<%commands>





A cool and somewhat scary example


%cmd{'^
\
d+'} = { say "You entered a number" };

%cmd{'^hello'} = { say "world" };

%cmd{'^print
\
s (.*)'} = { say $1; };

%cmd{'^exit'} = { exit() };


while =$*IN {


/<%cmd>/ || say "Unrecognized command";

}



Backtracking control

Single colons skip previous atom



m/
\
( <expr> [ , <expr> ]* :
\
) /



(if we don't find closing paren, no point in trying to match
fewer <expr>s)

Two colons break an alternation:



m:w/ [ if :: <expr> <block>




| for :: <list> <block>




| loop :: <loop_controls>? <block>




]


(once we've found "if", "for", or "loop", no point in trying the
other branches of the alternation)




Backtracking control

Three colons (:::) fail the current rule

The <commit> assertion fails the entire
match (including any rules that called the
current rule)

The <cut> assertion matches successfully,
removes the matched portion of the string up
to the <cut>, and if backtracked over fails
the match entirely


Useful for throwing away successfully processed
input when matching from an input stream


Like, say, when writing a compiler :
-
)

Backslash

\
L,
\
U,
\
Q,
\
E,
\
A,
\
z gone from rules

\
n and
\
N match newline/not newline

\
s matches any Unicode space

backreferences are gone, use $1, $2, $3
(non
-
interpolated)

Perl 6 allows defining custom backslash
sequences for use in rules


Closures

Anything in curlies is executed as a Perl 6
closure

/ (
\
w+) { say "Got $1"; } /


Capture semantics

Captures are different in Perl 6

The result of a match is a "match object"

If a match succeeds, the match object has:



Boolean value true


Numeric value 1 (except for global matches)


String value the matched substring


Array component is matched subpatterns


Hash component is matched subrules

Subpattern captures

Part of a rule in parenthesis is a subpattern

Each subpattern produces its own match
object




/Scooby (dooby) (doo)!/





$1 $2

Quantified subpatterns produce arrays of
match objects:




/Scooby (
\
w+
\
s+)* (doo)!/





$1 $2



$1 is a (possibly empty) array of matches







Non
-
capturing groups

Brackets do not capture, thus they don't
result in a match object




/Scooby [ (
\
w+
\
s+)* (doo) ]!/





$1 $2

Quantified brackets replace nested
subpatterns with the last component
matched:




/Scooby [ (
\
w+
\
s+)* (doo) ]+ !/





$1 $2



Nested capturing subpatterns

Each capturing subpattern introduces a new
lexical scope, with nested captures inside the
new match object:





/Scooby ( (
\
w+
\
s+)* (doo) ) !/


$1[0] $1[1]


<
--------

$1
---------
>




Alternations

Alternations introduce a new lexical scope,
thus subpatterns restart counting at zero for
each alternative branch (unlike p5):


$1 $2


m/ Scooby (dooby)* (doo)!


| Yabba (dabba)* (doo) /



$1 $2



This avoids lots of empty subpatterns when
an alternation doesn't match.

Subrules

Subrules capture into a hash keyed by the
name of the subrule:




rule ident / [<alpha>|_]
\
w* /;



rule num /
\
d+ /;



m/ <ident>
\
:= <num> /;



places match objects into $<ident> and
$<num>

Quantified subrules

Like subpatterns, quantified subrules produce
arrays of matches


m:w / dir <file>* /


produces matches in $<file>[0], $<file>[1],
etc.


Nested parens in a subrule capture to the
subrule's match object



Named captures

Portions of a match can be captured directly
into a match object without a subrule:


m:w/ $<name> :=
\
w+ , <$val> :=
\
d+ /



captures the first sequence of alphanumerics
into $<name>, and digits following the
comma into $<val>.


Grammars

Rules can be packaged together into separate
name spaces to form Grammars



grammar Perl6 {



rule ident { ... };


rule term { ... };


rule expr { ... };


}


:parsetree

The :parsetree flag to a rule causes the
grammar engine to keep all information about
a match.

Thus, one can do something like

$parse = ($source ~~ Perl6::program);


to get the entire parsetree for a program
(including comments)

Questions?