2 References > B. Ford Packrat Parsing: Simple, Powerful, Lazy, Linear Time ICFP (2002) > Parsing Expression Grammars: A Recognition Based Syntactic Foundation POPL (2004) > K. Mizushima, A. Maeda & Y. Yamaguchi Packrat Parsers Can Handle Practical Grammars in Mostly Constant Space PASTE (2010) > R. Redziejowski Applying classical concepts to Parsing Expression Grammar Fundamentica Informaticae 93(13) (2009) > BITES instead of FIRST for Parsing Expression Grammar Fundamentica Informaticae 109(3) (2011)
3 Parsing Expression Grammars
4 Parsing Expression Grammars > A Parsing Expression Grammar (PEG) is a formalization of recursive descent parsing with backtracking > Similar in concept to a ContextFree Grammar (CFG) > Expressed as a set of matching rules N < e where each nonterminal N defines a function over its input which either matches, possibly consuming input, or fails, consuming no input
5 PEGs and CFGs > PEGs can express all deterministic LR(k) languages > PEGs can also express some noncontextfree languages (e.g. a n b n c n ) > It is conjectured that contextfree languages exist which cannot be parsed by PEGs A linear time algorithm exists for parsing PEGs, while the best known CFG algorithms are cubic
6 Nonterminals and Literals > The expression N matches only if the nonterminal N matches, consuming whatever input the function representing N does > The expression xyz matches and consumes the string xyz > The expression matches and consumes any character
7 Ordered Choice > An expression may be an ordered choice between subexpressions E < s ( / t)* > Once a subexpression matches, the PEG never backtracks to try another > Unlike the unordered choice of contextfree grammars, a b / a and a / a b are different expressions The a b in the second version will never match
8 Repetition and Option > Repetition a* or a+ and option a? generally work like you expect > This is a possessive match rather than a greedy match, though a*a will never match, as a* consumes all the a s, and the parser will not backtrack to try and match the final a
9 Lookahead > The lookahead operators &a and!a provide much of the power of PEGs > &a matches if a does, but does not consume any input !a is similar, but matches if a fails > Needed for the grammar for a n b n c n S < &(A! b ) a* B! A < a A b / B < b B c /
10 Combined Lexing and Parsing S < _ Expr*! Expr < OPEN Expr+ CLOSE / Id / Num Id < [AZaz_]+ _ Num < [09]+ _ OPEN < ( _ CLOSE < ) < ( / \t / \n )*
11 Combined Lexing and Parsing > Can give typically nonrecursive lexical tokens recursive structure > Nested comments: Comment < (* (Comment /! *) )* *) > Expressions in string literals: Str < [ ] ( ${ Expr } /![ ] )* [ ]
12 Unambiguous Parses > There is only one parse tree for any string in a language represented by a PEG > No dangling else problem: E < if E then E else E / if E then E /
13 Quirks > A PEG parser may match on any prefix of the input This can be solved by ending with a! rule > PEG parsers do not natively support left recursion, whether direct or indirect Not a practical problem, as this represents repetition, which the * operator handles
14 Formalisms
15 Desugaring > a+ = a a* > a? = a / ε > &a =!(!a) > ab z = a b z (recursively applied) You do need an ε literal for the empty matcher > [az] = a / / z > = [<entire alphabet>]
16 Parsing Expression Grammar > A parsing expression grammar G is a 4tuple (N, Σ, R, e S ) N is the set of nonterminals Σ is the set of characters G is defined over R is a map from each A N to some parsing expression e e S is the expression R(S) corresponding to the start nonterminal S
17 Parsing Expressions > ε, the empty string > a, a terminal, a Σ > A, a nonterminal, A N > e 1 e 2, a sequence > e 1 /e 2, an ordered choice > e, zeroormore repetitions >! e, a notpredicate
18 Matching > Define e s, s Σ to be a function returning either s Σ, a suffix of s containing the unconsumed input from a match, or, a failure > e matches on s if e s Σ > e fails on s if e s = > The language L(G) of a grammar G = (N, Σ, R, e S ) is e S s Σ s > Note that matches match any prefix
19 Formal Definitions > ε s s > a s a = first(s) rest(s) > A s R A s > e 1 e 2 s e 1 s e 2 e 1 s > e 1 /e 2 s e 1 s e 1 s e 2 s otherwise > e s ee /ε s (fixed point) >! e s e s = s
20 Packrat Parsing
21 Motivation > A recursive descent parser for PEGs is simple to implement, but has O 2 n worst case runtime > By comparison, all LR(k) languages can be parsed in time linear in the size of the input string.
22 Packrat Parsing > To obtain a linear time bound on parsing a string represented by a PEG, desugar repetition expressions into rightrecursive nonterminals and memoize the functions representing the nonterminals > There are a constant number N of nonterminal functions, each of which calls other nonterminal functions with either its input string or some suffix thereof
23 Packrat Parsing > A terminal character can be parsed in time proportional to its length (which is 1) > A fixed length sequence or alternation of nonrepetitive expressions can be parsed in constant time > If each expression can be parsed in constant time once its subexpressions are parsed, runtime is bounded by the number of possible subexpression parses, O n
24 Repetition & Left Recursion Elimination > e* can be rewritten as a new nonterminal E < e E / > Direct leftrecursion of the form A < A a / b can be rewritten as A < b a*, which we ve just shown how to convert to rightrecursive form > Indirect leftrecursion can be handled similarly, using techniques found in any compilers text
25 Improvements & Future Directions
26 Packrat Space Usage > Packrat parsing takes O n space to store the memoization table, while more traditional LR parsing methods only take space proportional to the recursion depth of the grammar, which is much smaller for many practical grammars and strings. > If there are no alternate options at any point in the parse, the parser can throw away all earlier entries in the memoization table
27 Cut Operators > Mizushima et al. propose a cut operator ^ which indicates that no later alternation will match this allows backtracking options to be eliminated more aggressively. > E.g. once the + is matched here, the second alternative will never match, so it can be cut E < L + ^ E / L L < [09]+
28 Cut Autoinsertion > Determining if the languages of two parsing expressions (e.g. e and g in e f / g) are disjoint is undecidable > Therefore we can t statically insert all the possible cuts into a PEG > We can compute a conservative approximation of disjointness, though, and insert cuts in those positions.
29 FIRST for PEGs > Redziejowski defines FIRST e as the set of terminals, one of which must match at the current position for e to succeed. > Mizushima et al. point out that e / f can be rewritten as!(first(f)) ^ e / f if neither e nor f are nullable and no element of FIRST e is a prefix of an element of FIRST f or viceversa
30 Limitations of FIRSTbased Automatic Cut Insertion > Only one terminal of lookahead, analogous to LL(1) parsing > Cannot automatically insert cut after : in the following rule: A < [az]+ : B / [az]+ ; > Redziejowski proposes BITES, a more powerful approximation that produces regular expressions of terminals rather than sets
Parsing Expression Grammar and Packrat Parsing
Parsing Expression Grammar and Packrat Parsing (Survey) IPLAS Seminar Oct 27, 2009 Kazuhiro Inaba This Talk is Based on These Resources The Packrat Parsing and PEG Page (by Bryan Ford) http://pdos.csail.mit.edu/~baford/packrat/
