Parsing Expression Grammar and Packrat Parsing

Similar documents
Parsing Expression Grammars and Packrat Parsing. Aaron Moss

10. PEGs, Packrats and Parser Combinators!

10. PEGs, Packrats and Parser Combinators

Better Extensibility through Modular Syntax. Robert Grimm New York University

COP 3402 Systems Software Syntax Analysis (Parser)

CMSC 330: Organization of Programming Languages

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

CMSC 330: Organization of Programming Languages. Context Free Grammars

CMSC 330: Organization of Programming Languages. Context Free Grammars

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

CMSC 330: Organization of Programming Languages

CMSC 330: Organization of Programming Languages. Context Free Grammars

Architecture of Compilers, Interpreters. CMSC 330: Organization of Programming Languages. Front End Scanner and Parser. Implementing the Front End

Context Free Grammars. CS154 Chris Pollett Mar 1, 2006.

Grammars and Parsing. Paul Klint. Grammars and Parsing

CMSC 330: Organization of Programming Languages

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Spring UW CSE P 501 Spring 2018 C-1

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Winter /15/ Hal Perkins & UW CSE C-1

Parsing II Top-down parsing. Comp 412

Cut points in PEG. Extended Abstract. Roman R. Redziejowski.

Defining Program Syntax. Chapter Two Modern Programming Languages, 2nd ed. 1

Context-Free Languages & Grammars (CFLs & CFGs) Reading: Chapter 5

EDAN65: Compilers, Lecture 06 A LR parsing. Görel Hedin Revised:

xtc Towards an extensbile Compiler Step 1: The Parser Robert Grimm New York University

CSE 130 Programming Language Principles & Paradigms Lecture # 5. Chapter 4 Lexical and Syntax Analysis

CMPS Programming Languages. Dr. Chengwei Lei CEECS California State University, Bakersfield

Applying Classical Concepts to Parsing Expression Grammar

Introduction to Parsing. Lecture 5

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

Comp 411 Principles of Programming Languages Lecture 3 Parsing. Corky Cartwright January 11, 2019

LL(k) Parsing. Predictive Parsers. LL(k) Parser Structure. Sample Parse Table. LL(1) Parsing Algorithm. Push RHS in Reverse Order 10/17/2012

Context-Free Grammars

Compiler Construction

The Turing Machine. Unsolvable Problems. Undecidability. The Church-Turing Thesis (1936) Decision Problem. Decision Problems

Lecture 4: Syntax Specification

CMSC 330 Practice Problem 4 Solutions

In One Slide. Outline. LR Parsing. Table Construction

Languages and Compilers

xtc Robert Grimm Making C Safely Extensible New York University

Compilers and computer architecture From strings to ASTs (2): context free grammars

JavaCC Parser. The Compilation Task. Automated? JavaCC Parser

Context Free Languages

LR Parsing LALR Parser Generators

Converting regexes to Parsing Expression Grammars

CS 321 Programming Languages and Compilers. VI. Parsing

Theoretical Part. Chapter one:- - What are the Phases of compiler? Answer:

Downloaded from Page 1. LR Parsing

Defining syntax using CFGs

Dr. D.M. Akbar Hussain

3. Syntax Analysis. Andrea Polini. Formal Languages and Compilers Master in Computer Science University of Camerino

Robert Grimm New York University

Parsing. source code. while (k<=n) {sum = sum+k; k=k+1;}

Wednesday, September 9, 15. Parsers

Parsers. What is a parser. Languages. Agenda. Terminology. Languages. A parser has two jobs:

This book is licensed under a Creative Commons Attribution 3.0 License

Context-Free Grammars

CMPT 755 Compilers. Anoop Sarkar.

Introduction to Parsing. Lecture 5

Homework & Announcements

Parsing III. CS434 Lecture 8 Spring 2005 Department of Computer Science University of Alabama Joel Jones

LPEG: a new approach to pattern matching. Roberto Ierusalimschy

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

Principles of Programming Languages COMP251: Syntax and Grammars

LR Parsing, Part 2. Constructing Parse Tables. An NFA Recognizing Viable Prefixes. Computing the Closure. GOTO Function and DFA States

HKN CS 374 Midterm 1 Review. Tim Klem Noah Mathes Mahir Morshed

Parsers. Xiaokang Qiu Purdue University. August 31, 2018 ECE 468

LING/C SC/PSYC 438/538. Lecture 20 Sandiway Fong

CSE450 Translation of Programming Languages. Lecture 4: Syntax Analysis

CSE 311 Lecture 21: Context-Free Grammars. Emina Torlak and Kevin Zatloukal

CMSC 330, Fall 2009, Practice Problem 3 Solutions

4. Lexical and Syntax Analysis

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Chapter 3. Describing Syntax and Semantics

Context-Free Grammar (CFG)

Ambiguous Grammars and Compactification

Section A. A grammar that produces more than one parse tree for some sentences is said to be ambiguous.

COL728 Minor1 Exam Compiler Design Sem II, Answer all 5 questions Max. Marks: 20

Building Compilers with Phoenix

Derivations of a CFG. MACM 300 Formal Languages and Automata. Context-free Grammars. Derivations and parse trees

EDAN65: Compilers, Lecture 04 Grammar transformations: Eliminating ambiguities, adapting to LL parsing. Görel Hedin Revised:

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Chapter 3: Lexing and Parsing

Top-Down Parsing and Intro to Bottom-Up Parsing. Lecture 7

CPS 506 Comparative Programming Languages. Syntax Specification

Introduction to Lexical Analysis

A language is a subset of the set of all strings over some alphabet. string: a sequence of symbols alphabet: a set of symbols

Parsing Expression Grammars for Structured Data

Midterm 2 Solutions Many acceptable answers; one was the following: (defparameter g1

LR Parsing. Table Construction

Chapter 3. Describing Syntax and Semantics ISBN

LECTURE 7. Lex and Intro to Parsing

Chapter 3. Describing Syntax and Semantics ISBN

Defining Languages GMU

Properties of Regular Expressions and Finite Automata

Defining syntax using CFGs

CSCI312 Principles of Programming Languages!

A Simple Syntax-Directed Translator

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Transcription:

Parsing Expression Grammar and Packrat Parsing (Survey) IPLAS Seminar Oct 27, 2009 Kazuhiro Inaba

This Talk is Based on These Resources The Packrat Parsing and PEG Page (by Bryan Ford) http://pdos.csail.mit.edu/~baford/packrat/ (was active till early 2008) A. Birman & J. D. Ullman, Parsing Algorithms with Backtrack, Information and Control (23), 1973 B. Ford, Packrat Parsing: Simple, Powerful, Lazy, Linear Time, ICFP 2002 B. Ford, Parsing Expression Grammars: A Recognition-Based Syntactic Foundation, POPL 2004

Outline What is PEG? Introduce the core idea of Parsing Expression Grammars Packrat Parsing Parsing Algorithm for the core PEG Packrat Parsing Can Support More Syntactic predicates Full PEG This is what is called PEG in the literature. Theoretical Properties of PEG PEG in Practice

What is PEG? Yet Another Grammar Formalism Intended for describing grammars of programming languages (not for NL, nor for program analysis) As simple as Context-Free Grammars Linear-time parsable Can express: All deterministic CFLs (LR(k) languages) Some non-cfls

What is PEG? Comparison to CFG (Predicate-Free) Parsing Expression Grammar A B C Concatenation Context-Free Grammar A B C Concatenation A B / C Prioritized Choice A B C Unordered Choice When both B and C matches, prefer B When both B and C matches, either will do

Example (Predicate-Free) Parsing Expression Grammar S A a b c A a A / a S fails on aaabc. S Context-Free Grammar S A a b c A a A a S recognizes aaabc S A Oops! A abc a A a A a A a a

Another Example (Predicate-Free) Parsing Expression Grammar S E ; / while ( E ) S / if ( E ) S else S / if ( E ) S / Context-Free Grammar S E ; while ( E ) S if ( E ) S else S if ( E ) S if(x>0) if(x<9) y=1; else y=3; unambiguous if(x>0) if(x<9) y=1; else y=3; ambiguous

Formal Definition Predicate-Free PEG G is <N, Σ, S, R> N : Finite Set of Nonterminal Symbols Σ : Finite Set of Terminal Symbols S N : Start Symbol R N rhs : Rules, where rhs ::= ε A ( N) a ( Σ) rhs / rhs rhs rhs Note: A rhs stands for R(A)=rhs Note: Left-recursion is not allowed

Semantics [[ e ]] :: String Maybe String where String=Σ* [[ c ]] = λs case s of (for c Σ) c : t Just t _ Nothing [[ e1 e2 ]] = λs case [[ e1 ]] s of Just t [[ e2 ]] t Nothing Nothing [[ e1 / e2 ]] = λs case Just t Just t Nothing [[ e2 ]] s [[ e1 ]] s of [[ ε ]] = λs Just s [[ A ]] = [[ R(A) ]] (recall: R(A) is the unique rhs of A)

Example (Complete Consumption) S a S b / c [[S]] acb = Just [[asb]] acb = Just [[a]] acb [[S]] cb [[asb]] cb [[a]] cb [[c]] cb = Just cb = Just b = Nothing = Nothing = Just b [[b]] b = Just

Example (Failure, Partial Consumption) S a S b / c [[S]] b [[asb]] b [[a]] b [[c]] b = Nothing = Nothing = Nothing = Nothing [[S]] cb = Just b [[asb]] cb = Nothing [[a]] cb = Nothing [[c]] cb = Just b

Example (Prioritized Choice) S A a A a A / a [[ S ]] aa = Nothing Because [[ A ]] aa = Just, not Just a [[ A ]] aa = Just [[ a ]] aa = Just a [[ A ]] a = Just

Recognition-Based In generative grammars such as CFG, each nonterminal defines a language (set of strings) that it generates. In recognition-based grammars, each norterminal defines a parser (function from string to something) that it recognizes.

Outline What is PEG? Introduce the core idea of Parsing Expression Grammars Packrat Parsing Parsing Algorithm for the core PEG Packrat Parsing Can Support More Syntactic predicates Full PEG This is what is called PEG in the literature. Theoretical Properties of PEG PEG in Practice

[Semantics] Parsing Algorithm for PEG Theorem: Predicate-Free PEG can be parsed in linear time wrt the length of the input string. Proof By Memoization ( All arguments and outputs of [[e]] :: String -> Maybe String are the suffixes of the input string )

[Semantics] Parsing Algorithm for PEG How to Memoize? Tabular Parsing [Birman&Ullman73] Prepare a table of size G input, and fill it from right to left. Packrat Parsing [Ford02] Use lazy evaluation.

[Semantics] Parsing PEG (1: Vanilla Semantics) S as / a doparse = parses parsea s = case s of 'a':t -> Just t _ -> Nothing :: String -> Maybe String parses s = alt1 `mplus` alt2 where alt1 = case parsea s of Just t -> case parses t of Just u -> Just u Nothing -> Nothing Nothing-> Nothing alt2 = parsea s

[Semantics] Parsing PEG (2: Valued) S as / a doparse = parses :: String -> Maybe (Int, String) parsea s = case s of 'a':t -> Just (1, t) _ -> Nothing parses s = alt1 `mplus` alt2 where alt1 = case parsea s of Just (n,t)-> case parses t of Just (m,u)-> Just (n+m,u) Nothing -> Nothing Nothing -> Nothing alt2 = parsea s

[Semantics] Parsing PEG (3: Packrat Parsing) S as / a type Result = Maybe (Int, Deriv) data Deriv = D Result Result doparse :: String -> Deriv doparse s = d where d = D results resulta results = parses d resulta = case s of a :t -> Just (1,next) _ -> Nothing next = doparse (tail s)

[Semantics] Parsing PEG (3: Packrat Parsing, cnt d) S as / a type Result = Maybe (Int, Deriv) data Deriv = D Result Result parses :: Deriv -> Result parses (D rs0 ra0) = alt1 `mplus` alt2 where alt1 = case ra0 of Just (n, D rs1 ra1) -> case rs1 of Just (m, d) -> Just (n+m, d) Nothing -> Nothing Nothing -> Nothing alt2 = ra0 alt1 = case parsea s of Just (n,t)-> case parses t of Just (m,u)-> Just (n+m,u) Nothing -> Nothing Nothing -> Nothing alt2 = parsea s

[Semantics] Packrat Parsing Can Do More Without sacrificing linear parsing-time, more operators can be added. Especially, syntactic predicates : [[&e]] = λs case [[e]] s of Just _ Nothing Just s Nothing [[!e]] = λs case [[e]] s of Just _ Nothing Nothing Just s

Formal Definition of PEG PEG G is <N, Σ, S, R N rhs> where rhs ::= ε A ( N) a ( Σ) rhs / rhs rhs rhs &rhs!rhs rhs? (eqv. to X where X rhs/ε) rhs* (eqv. to X where X rhs X/ε) rhs+ (eqv. to X where X rhs X/rhs)

Example: A Non Context-Free Language {a n b n c n n>0} is recognized by S &X a* Y!a!b!c X axb / ab Y byc / bc

Example: C-Style Comment C-Style Comment Comment /* ((! */) Any)* */ (for readability, meta-symbols are colored) Though this is a regular language, it cannot be written this easy in conventional regex.

Outline What is PEG? Introduce the core idea of Parsing Expression Grammars Packrat Parsing Parsing Algorithm for the core PEG Packrat Parsing Can Support More Syntactic predicates Full PEG This is what is called PEG in the literature. Theoretical Properties of PEG PEG in Practice

Theoretical Properties of PEG Two Topics Properties of Languages Defined by PEG Relationship between PEG and predicatefree PEG

Language Defined by PEG For a parsing expression e [Ford04] F(e) = {w Σ* [[e]]w Nothing } [BU73] B(e) = {w Σ* [[e]]w = Just } [Redziejowski08] R. R. Redziejowski, Some Aspects of Parsing Expression Grammar, Fundamenta Informaticae(85), 2008 Investigation on concatenation [[e1 e2]] of two PEGs S(e) = {w Σ* u. [[e]]wu = Just u } L(e) = {w Σ* u. [[e]]wu = Just u }

Properties of F(e) = {w Σ* [[e]]w Nothing} F(e) is context-sensitive Contains all deterministic CFL Trivially Closed under Boolean Operations F(e1) F(e2) = F( (&e1)e2 ) F(e1) F(e2) = F( e1 / e2 ) ~F(e) = F(!e ) Undecidable Problems F(e) = Φ? is undecidable Proof is similar to that of intersection emptiness of context-free languages F(e) = Σ*? is undecidable F(e1)=F(e2)? is undecidable

Properties of B(e) = {w Σ* [[e]]w = Just } B(e) is context-sensitive Contains all deterministic CFL For predicate-free e1, e2 B(e1) B(e2) = B(e3) for some predicate-free e3 For predicate-free & well-formed e1,e2 where well-formed means that [[e]] s is either Just or Nothing B(e1) B(e2) = B(e3) for some pf&wf e3 ~B(e1) = B(e3) for some predicate-free e3 Emptiness, Universality, and Equivalence is undecidable

Properties of B(e) = {w Σ* [[e]]w = Just } Forms AFDL, i.e., markedunion(l 1, L 2 ) = al 1 bl 2 markedrep(l1) = (al 1 )* marked inverse GSM (inverse image of a string transducer with explicit endmarker) [Chandler69] AFDL is closed under many other operations, such as left-/right- quotients, intersection with regular sets, W. J. Chandler, Abstract Famlies of Deterministic Languages, STOC 1969

Predicate Elimination Theorem: G=<N,Σ,S,R> be a PEG such that F(S) does not contain ε. Then there is an equivalent predicate-free PEG. Proof (Key Ideas): [[ &e ]] = [[!!e ]] [[!e C ]] = [[ (e Z / ε) C ]] for ε-free C where Z = (σ 1 / /σ n )Z / ε, {σ 1,,σ n }=Σ

Predicate Elimination Theorem: PEG is strictly more powerful than predicate-free PEG Proof: We can show, for predicate-free e, w.( [[e]] = Just [[e]] w = Just w ) by induction on w and on the length of derivation Thus we have F(S) F(S)=Σ* but this is not the case for general PEG (e.g., S!a)

Outline What is PEG? Introduce the core idea of Parsing Expression Grammars Packrat Parsing Parsing Algorithm for the core PEG Packrat Parsing Can Support More Syntactic predicates Full PEG This is what is called PEG in the literature. Theoretical Properties of PEG PEG in Practice

PEG in Practice Two Topics When is PEG useful? Implementations

When is PEG useful? When you want to unify lexer and parser For packrat parsers, it is easy. For LL(1) or LALR(1) parsers, it is not. list<list<string>> Error in C++98, because >> is RSHIFT, not two closing angle brackets Ok in Java5 and C++1x, but with strange grammar (* nested (* comment *) *) s = embedded code #{1+2+3} in string

Implementations

Performance (Rats!) R. Grimm, Better Extensibility through Modular Syntax, PLDI 2006 Parser Generator for PEG, used, e.g., for Fortress Experiments on Java1.4 grammar, with sources of size 0.7 ~ 70KB

PEG in Fortress Compiler Syntactic Predicates are widely used (though I m not sure whether it is essential, due to my lack of knowledge on Fortress ) /* The operator " ->" should not be in the left-hand sides of map expressions and map/array comprehensions. */ String mapstoop =!(" ->" w Expr (w mapsto / wr bar / w closecurly / w comma)) " ->" ; /* The operator "<-" should not be in the left-hand sides of generator clause lists. */ String leftarrowop =!("<-" w Expr (w leftarrow / w comma)) "<-";

Optimizations in Rats!

Summary Parsing Expression Grammar (PEG) has prioritized choice e1/e2, rather than unordered choice e1 e2. has syntactic predicates &e and!e, which can be eliminated if we assume ε-freeness. might be useful for unified lexer-parser. can be parsed in O(n) time, by memoizing.