Cut points in PEG. Extended Abstract. Roman R. Redziejowski.

Similar documents
Parsing Expression Grammars and Packrat Parsing. Aaron Moss

Applying Classical Concepts to Parsing Expression Grammar

Left Recursion in Parsing Expression Grammars

Parsing Expression Grammars for Structured Data

Converting regexes to Parsing Expression Grammars

Parsing Expression Grammar as a Primitive Recursive-Descent Parser with Backtracking

Parsing Expression Grammar and Packrat Parsing

Performance Improvement of Stack Based Recursive-Descent Parser for Parsing Expression Grammar

A Survey of Packrat Parser

Better Extensibility through Modular Syntax. Robert Grimm New York University

nfn2dlp: A Normal Form Nested Programs Compiler

sbp: A Scannerless Boolean Parser

COP 3402 Systems Software Top Down Parsing (Recursive Descent)

arxiv: v1 [cs.fl] 17 Oct 2012

Intersection of sets *

Syntax Analysis. The Big Picture. The Big Picture. COMP 524: Programming Languages Srinivas Krishnan January 25, 2011

Extracting the Range of cps from Affine Typing

Homework & Announcements

This book is licensed under a Creative Commons Attribution 3.0 License

arxiv: v3 [cs.fl] 13 Feb 2014

Object-oriented Compiler Construction

A language is a subset of the set of all strings over some alphabet. string: a sequence of symbols alphabet: a set of symbols

Describing Syntax and Semantics

Languages and Compilers

UNIT I Programming Language Syntax and semantics. Kainjan Sanghavi

Scannerless Boolean Parsing

Syntax Analysis. COMP 524: Programming Language Concepts Björn B. Brandenburg. The University of North Carolina at Chapel Hill

LPEG: a new approach to pattern matching. Roberto Ierusalimschy

Control Flow Analysis with SAT Solvers

Constrained Types and their Expressiveness

CSE 12 Abstract Syntax Trees

Building Compilers with Phoenix

Syntax/semantics. Program <> program execution Compiler/interpreter Syntax Grammars Syntax diagrams Automata/State Machines Scanning/Parsing

Lecture 10 Parsing 10.1

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part 1

Specifying Syntax COMP360

CMSC 330: Organization of Programming Languages

CMPSCI 250: Introduction to Computation. Lecture #7: Quantifiers and Languages 6 February 2012

Lexical and Syntax Analysis. Top-Down Parsing

Implementation of Axiomatic Language

1.1 - Introduction to Sets

Lecture 4: Syntax Specification

flex is not a bad tool to use for doing modest text transformations and for programs that collect statistics on input.

Parsing Combinators: Introduction & Tutorial

Propositional Logic. Andreas Klappenecker

CSE 311 Lecture 21: Context-Free Grammars. Emina Torlak and Kevin Zatloukal

Consider a description of arithmetic. It includes two equations that define the structural types of digit and operator:

LESSON 1: INTRODUCTION TO COUNTING

Introduction to Lexing and Parsing

1. true / false By a compiler we mean a program that translates to code that will run natively on some machine.

MOUSE: FROM PARSING EXPRESSIONS TO A PRACTICAL PARSER. Version Roman R. Redziejowski

Theoretical Part. Chapter one:- - What are the Phases of compiler? Answer:

CS323 Lecture - Specifying Syntax and Semantics Last revised 1/16/09

Programming Languages Third Edition

COLLEGE OF ENGINEERING, NASHIK. LANGUAGE TRANSLATOR

CMSC 330: Organization of Programming Languages. Context Free Grammars

UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFR08008 INFORMATICS 2A: PROCESSING FORMAL AND NATURAL LANGUAGES

Intro to semantics; Small-step semantics Lecture 1 Tuesday, January 29, 2013

CPS 506 Comparative Programming Languages. Syntax Specification

Lexical and Syntax Analysis

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

Part 5 Program Analysis Principles and Techniques

Chapter 3. Describing Syntax and Semantics

Syntax. In Text: Chapter 3

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

1. [5 points each] True or False. If the question is currently open, write O or Open.

@ Massachusetts Institute of Technology All rights reserved.

EDAN65: Compilers, Lecture 04 Grammar transformations: Eliminating ambiguities, adapting to LL parsing. Görel Hedin Revised:

UNIVERSITY OF CALIFORNIA Department of Electrical Engineering and Computer Sciences Computer Science Division. P. N. Hilfinger.

Context-Free Grammar (CFG)

EDA180: Compiler Construc6on Context- free grammars. Görel Hedin Revised:

CSE 3302 Programming Languages Lecture 2: Syntax

Introduction to Scheme

Functional Parsing A Multi-Lingual Killer- Application

Formalizing OO Frameworks and Framework Instantiation

The Lexical Structure of Verdi TR Mark Saaltink. Release date: July 1994

Recursive Definitions

Static checking of domain constraints in applications interacting with relational database by means of dependently-typed lambda calculus

Packrat Parsing with Elastic Sliding Window

UNIT -2 LEXICAL ANALYSIS

Syntax. A. Bellaachia Page: 1

Relational Model, Relational Algebra, and SQL

Programming Language Syntax and Analysis

Recursive definition of sets and structural induction

Concurrent Programming Constructs and First-Class Logic Engines

The Substitution Model

Denotational Semantics of a Simple Imperative Language Using Intensional Logic

a name refers to an object side effect of assigning composite objects

SOFTWARE ENGINEERING DESIGN I

CMPSCI 250: Introduction to Computation. Lecture #1: Things, Sets and Strings David Mix Barrington 22 January 2014

Parsing. CS321 Programming Languages Ozyegin University

Lecture 5. Data Flow Analysis

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

University of Nevada, Las Vegas Computer Science 456/656 Fall 2016

DEMO A Language for Practice Implementation Comp 506, Spring 2018

Formal Semantics. Prof. Clarkson Fall Today s music: Down to Earth by Peter Gabriel from the WALL-E soundtrack

CSCC24 Functional Programming Scheme Part 2

Designing Views to Answer Queries under Set, Bag,and BagSet Semantics

LL(k) Parsing. Predictive Parsers. LL(k) Parser Structure. Sample Parse Table. LL(1) Parsing Algorithm. Push RHS in Reverse Order 10/17/2012

2.2 Syntax Definition

Transcription:

Cut points in PEG Extended Abstract Roman R. Redziejowski roman.redz@swipnet.se 1 Introduction This is a short note inspired by some ideas and results from [3, 6 8]. It is about Parsing Expression Grammars (PEGs) introduced by Ford in [1]. PEG specifies a language by defining for it a parser with limited backtracking. All of the quoted papers contain a detailed introduction to PEG, so it is not repeated here. Backtracking means that if e 1 in a choice expression e 1 /e 2 fails, the parser returns to the position it had before the attempt at e 1, and tries e 2 on the same input. Limited backtracking means that once e 1 succeeded, e 2 will not be tried on the same input upon a subsequent failure. In some cases, a failure of e 1 may mean that e 2 is also bound to fail, so there is no need to try it; one can terminate e 1 /e 2 straight away and return failure. As an example, consider this grammar: S = E $ E = T +E / T T = a / b Suppose that at some point during the parse of S, expression E is applied to input w. It starts by calling T. Clearly, if T fails, w does not start with a or b, so the second alternative, being the same T, must also fail. Suppose now that T + succeeds, after which E fails. One can easily see that the only thing that can follow E in the parse of S is $. Thus, applying the second alternative to w will result in a successful parse only if w = a$ or w = b$. But T + succeeding on w means that w is none of these. So, trying the second alternative will not result in a successful parse. We have identified two points in the expression for E, indicated below by and, such that you do not need to backtrack if you fail before or after. E = T + E / T We shall refer to them as cut points. Thanks to the backtracking being limited, one can use the so-called packrat technology to run the PEG parser in a linear time. The technology consists in saving all results to be reused in the case of backtracking. It means buying speed at the cost of large memory consumption. Mizushima et al. [6] noted that after passing a cut point, one can discard the saved results that would be needed by e 2. In this way, one can greatly reduce the memory requirement.

In a recent paper [3], Maidl et al. discuss the way in which a PEG parser may provide meaningful information on why it failed to parse a given input. The traditional way is to report the failure that occurred farthest down in the input. But, some failures may be quite innocent, such as a failure of e 1 in e 1 /e 2 that is followed by a success of e 2. In section 5 of [3], the authors present an extension to PEG to help reporting real failures. Its use suggests that a real failure is one occurring in the no return zone, such as before and after in the example above. 2 Finding cut points As follows from the above, i is interesting to find cut points in a given grammar. To see how this can be done, we consider a minimal grammar without iteration and predicates as used in [7, 8]. It has starting symbol S and end-of-input marker $. As in [7,8], we denote by Tail(e) the set of all terminated strings that can follow an application of e in a parse starting with S. For convenience, we consider choice expressions of the form e 0 e 1 /e 2 that can be easily desugared to the primitive form used in [7, 8]. The input alphabet is denoted by Σ. The grammar may be alternatively interpreted as a grammar in Backus-Naur Form (BNF), with "/" denoting the unordered choice. We denote by L(e) the language defined by e when interpreted as BNF. The following has been shown in [5] (the proof is found also in [4, 7, 8]): Proposition 1. If expression e succeeds on input w, it consumes a string belonging to L(e), meaning that w L(e)Σ. To say something about w in case of failing e requires that each sequence expression e 1 e 2 satisfies this condition: x,y xy L(e 1 e 2 )Σ x L(e 1 ) y L(e 2 )Σ. (1) Using the formal method from [4, 5, 7, 8] one can verify the following: Proposition 2. In a grammar satisfying (1), if expression e fails on input w then w / L(e)Σ. Using Proposition 1 one can verify: Proposition 3. A sufficient condition for after e 0 in A = e 0 e 1 /e 2 is: Using Propositions 1 and 2 one can verify: L(e 0 )Σ L(e 2 ) Tail(A) =. (2) Proposition 4. A sufficient condition for after e 0 in e 0 e 1 /e 2 in a grammar satisfying (1) is: L(e 2 ) Tail(A) L(e 0 )Σ. (3)

3 Using first expressions As the inclusion and emptiness of intersection of context-free languages are in general undecidable, there is no mechanical way to to represent a given expression as e 0 e 1 satisfying (2) or (3). Mizushima et al. [6] uses the set of first terminals as e 0. This works if the grammar is LL(1). We extend this to a wider class of grammars by using first expressions. A first expression of e is any expression first such that L(first) Σ + and L(e) L(first)Σ. Given A = e 1 /e 2, suppose there exist first expressions first 1, first 2 such that: L(e 1 ) L(first 1 )Σ, (4) L(e 2 ) Tail(A) L(first 2 )Σ, (5) L(first 1 )Σ L(first 2 )Σ =. (6) One can easily see that L(first 1 )Σ L(e 2 ) Tail(A) =. If e 1 = first 1 e 1 for some e 1, we have, according to (2), a cut point in e 1 after first 1. For E in our example, (4 6) are satisfied by first 1 = T + and first 2 = T $, identifying the cut point after T +. Suppose that instead of (6), first 1 and first 2 satisfy L(first 2 ) L(first 1 )Σ. (7) We have then L(e 2 ) Tail(A) L(first 1 )Σ. If e 1 = first 1 e 1 for some e 1, we have, according to (3), a cut point in e 1 after first 1. For E in our example, the conditions are satisfied by first 1 = first 2 = T, identifying the cut point after T. A special case of first 1 and first 2 are the sets of first terminals used in [6]. In that special case, (4 6) are conditions for the grammar being LL(1). Allowing first 1 and first 2 to be arbitrary expressions extends the results from [6] to grammars that have been in [7,8] referred to as LL(kP). These are the grammars where a top-down parser can choose its way by examining the input within the reach of k parsing procedures. (Note that our example in the Introduction is LL(2P).) 4 External cut point It is often difficult to represent e 1 as first 1 e 1. This is solved in [6] by replacing A = e 1 /e 2 with: A = (! first 2 ) e 1 / e 2. (8) Obviously e 2 must fail once! first 2 succeeded, so we have after (! first 2 ). To verify that this also works for first expressions, we have to introduce the not-predicate "!" into our grammar. It can be done in a restricted way, by defining! e 0 e 1 /e 2 as a new expression and formally specifying its semantics.

Assuming that the grammar satisfies (1) and first 1, first 2 satisfy (4 6), one can use Propositions 1 and 2 to verify that: (! first 2 ) e 1 / e 2 is equivalent to e 1 / e 2 in the sense that both either consume the same text, or both fail, on the same input. If! first 2 succeeds then e 2 fails. 5 Labeled failures In the standard version of PEG, a failing expression returns just an indication that it failed. In the modification suggested in Section 5 of [3], failing expression returns a label which may conveniently be a complete error message. One distinguished such label is just fail. A failing terminal returns fail by default. Other labels are created by the new expression l which forces an immediate failure with label l. The meaning of choice e 1 /e 2 is redefined so that if e 1 fails with label fail, e 2 is tried in the normal way and the expression terminates with the result of e 2. If e 1 fails with label other than fail, the whole expression fails immediately with that label without trying e 2. Of course, this can only be safe before and after. The technique can be used in our example like this: E = (T / t ) + (E / e ) / T where t may be the message Term expected and e the message Expression expected. The result is the message Term expected for a failure before and Expression expected for a failure after. 6 Problems The sets of first terminals of e can be mechanically computed by restricting to terminals the set First (e) where First is the relation describing which expressions appear as first in the definition of a given expression. Condition (6) boils down to checking that two sets of letters are disjoint. This means that the cut points (8) can always be automatically inserted. But it works only for LL(1) grammars, while the backtracking of PEG is often used just to avoid the LL(1) restriction. It is suggested in [7,8] how to find first 1, first 2 satisfying (4,5) among the subsets of First (e 1 ) and First (e 2 ). But, checking (6) for these subsets cannot, in general, be done in a mechanical way. It appears that finding cut points in non-ll(1) grammars must to a large extent be done manually. The same applies to insertion of labeled failures, even if one solves the problem of a mechanical generation of meaningful message texts. An anonymous reviewer pointed out that manually inserted cut points and labeled failures make the grammar completely unreadable. One has to find a way of conveying the information in another way. This can, for example, be done in semantic procedures, which in some parser generators (such as the author s Mouse ) are separated from the grammar.

A recent note [2] to the PEG discussion forum pointed out an important fact: cut points are local to a specific expression. Their use for discarding saved results and for generating diagnostics must be considered in the context in which the expression is invoked. Suppose expression e invoked e. Passing a cut point in e does not mean that e passed its cut point; e may still need its saved results. Similarly, a serious termination of e may turn out to be an innocent one in e. This must be taken into account when designing any scheme using cut points. References 1. Ford, B.: Parsing expression grammars: A recognition-based syntactic foundation. In: Jones, N.D., Leroy, X. (eds.) Proceedings of the 31st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2004. pp. 111 122. ACM, Venice, Italy (14 16 January 2004) 2. Hobbelt, G.: Breaking the Mizushima(2010) cut operator, PEG Archives, July 2014, https://lists.csail.mit.edu/pipermail/peg/2014-july/000629.html 3. Maidl, A.M., Medeiros, S., Mascarenhas, F., Ierusalimschy, R.: Error reporting in Parsing Expression Grammars. Tech. rep., PUC-Rio, UFRJ Rio de Janeiro, UFRN Natal, Brazil (2014), http://arxiv.org/pdf/1405.6646v1.pdf 4. Mascarenhas, F., Medeiros, S., Ierusalimschy, R.: On the relation between contextfree grammars and Parsing Expression Grammars. Tech. rep., UFRJ Rio de Janeiro, UFS Aracaju, PUC-Rio, Brazil (2013), http://arxiv.org/pdf/1304.3177v1 5. Medeiros, S.: Correspondência entre PEGs e Classes de Gramáticas Livres de Contexto. Ph.D. thesis, Pontifícia Universidade Católica do Rio de Janeiro (Aug 2010) 6. Mizushima, K., Maeda, A., Yamaguchi, Y.: Packrat parsers can handle practical grammars in mostly constant space. In: Lerner, S., Rountev, A. (eds.) Proceedings of the 9th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering, PASTE 10, Toronto, Ontario, Canada, June 5-6, 2010. pp. 29 36. ACM (2010) 7. Redziejowski, R.R.: From EBNF to PEG. Fundamenta Informaticae 128(1-2), 177 191 (2013) 8. Redziejowski, R.R.: More about converting BNF to PEG. Fundamenta Informaticae (2014), to appear, http://www.romanredz.se/papers/fi2014.pdf