Better Extensibility through Modular Syntax. Robert Grimm New York University

Similar documents
Better Extensibility through Modular Syntax

xtc Towards an extensbile Compiler Step 1: The Parser Robert Grimm New York University

Robert Grimm New York University

xtc Robert Grimm Making C Safely Extensible New York University

EDAN65: Compilers, Lecture 06 A LR parsing. Görel Hedin Revised:

Parsing Expression Grammars and Packrat Parsing. Aaron Moss

Parsing Expression Grammar and Packrat Parsing

Parser Tools: lex and yacc-style Parsing

Building Compilers with Phoenix

EDAN65: Compilers, Lecture 04 Grammar transformations: Eliminating ambiguities, adapting to LL parsing. Görel Hedin Revised:

Semantic Analysis. Lecture 9. February 7, 2018

Anatomy of a Compiler. Overview of Semantic Analysis. The Compiler So Far. Why a Separate Semantic Analysis?

CS606- compiler instruction Solved MCQS From Midterm Papers

Section A. A grammar that produces more than one parse tree for some sentences is said to be ambiguous.

Compiler Theory. (Semantic Analysis and Run-Time Environments)

COP 3402 Systems Software Syntax Analysis (Parser)

Grammars and Parsing. Paul Klint. Grammars and Parsing

Parsing Algorithms. Parsing: continued. Top Down Parsing. Predictive Parser. David Notkin Autumn 2008

Parser Tools: lex and yacc-style Parsing

JavaCC Parser. The Compilation Task. Automated? JavaCC Parser

Comp 411 Principles of Programming Languages Lecture 3 Parsing. Corky Cartwright January 11, 2019

G53CMP: Lecture 4. Syntactic Analysis: Parser Generators. Henrik Nilsson. University of Nottingham, UK. G53CMP: Lecture 4 p.1/32

Course Overview. Introduction (Chapter 1) Compiler Frontend: Today. Compiler Backend:

Syntax Analysis. Chapter 4

Scannerless Boolean Parsing

sbp: A Scannerless Boolean Parser

Part III : Parsing. From Regular to Context-Free Grammars. Deriving a Parser from a Context-Free Grammar. Scanners and Parsers.

Introduction to Parsing. Lecture 5

Compiler Construction Using

Parsing II Top-down parsing. Comp 412

Syntax Analysis. The Big Picture. The Big Picture. COMP 524: Programming Languages Srinivas Krishnan January 25, 2011

10. PEGs, Packrats and Parser Combinators!

1. The output of lexical analyser is a) A set of RE b) Syntax Tree c) Set of Tokens d) String Character

Syntax. In Text: Chapter 3

CJT^jL rafting Cm ompiler

CSE 12 Abstract Syntax Trees

Semantic Analysis. Outline. The role of semantic analysis in a compiler. Scope. Types. Where we are. The Compiler so far

Torben./Egidius Mogensen. Introduction. to Compiler Design. ^ Springer

Building Compilers with Phoenix

Lecture 14: Parser Conflicts, Using Ambiguity, Error Recovery. Last modified: Mon Feb 23 10:05: CS164: Lecture #14 1

A programming language requires two major definitions A simple one pass compiler

Context-Free Grammars

afewadminnotes CSC324 Formal Language Theory Dealing with Ambiguity: Precedence Example Office Hours: (in BA 4237) Monday 3 4pm Wednesdays 1 2pm

Compiler principles, PS1

SYED AMMAL ENGINEERING COLLEGE (An ISO 9001:2008 Certified Institution) Dr. E.M. Abdullah Campus, Ramanathapuram

Compiler Design Overview. Compiler Design 1

COMPILER CONSTRUCTION LAB 2 THE SYMBOL TABLE. Tutorial 2 LABS. PHASES OF A COMPILER Source Program. Lab 2 Symbol table

CSE 130 Programming Language Principles & Paradigms Lecture # 5. Chapter 4 Lexical and Syntax Analysis

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

The role of semantic analysis in a compiler

Syntax Analysis. COMP 524: Programming Language Concepts Björn B. Brandenburg. The University of North Carolina at Chapel Hill

Chapter 4. Lexical and Syntax Analysis

Syntax Analysis Part I

Chapter 3: Describing Syntax and Semantics. Introduction Formal methods of describing syntax (BNF)

About the Authors... iii Introduction... xvii. Chapter 1: System Software... 1

Object-oriented Compiler Construction

Katahdin. Mutating a Programming Language s Syntax and Semantics at Runtime. Chris Seaton. The University of Bristol

Earley Parsing in Action

Syntax/semantics. Program <> program execution Compiler/interpreter Syntax Grammars Syntax diagrams Automata/State Machines Scanning/Parsing

Parsing. source code. while (k<=n) {sum = sum+k; k=k+1;}

1. true / false By a compiler we mean a program that translates to code that will run natively on some machine.

EDA180: Compiler Construc6on. Top- down parsing. Görel Hedin Revised: a

Describing Syntax and Semantics

Lecture 8: Deterministic Bottom-Up Parsing

In Our Last Exciting Episode

Semantic Analysis. Outline. The role of semantic analysis in a compiler. Scope. Types. Where we are. The Compiler Front-End

Syntax Analysis, VII One more LR(1) example, plus some more stuff. Comp 412 COMP 412 FALL Chapter 3 in EaC2e. target code.

Introduction to Parsing. Lecture 5

Syntax-Directed Translation. Lecture 14

Properties of Regular Expressions and Finite Automata

Optimizing Finite Automata

CMSC 330: Organization of Programming Languages. Context Free Grammars

Parser Design. Neil Mitchell. June 25, 2004

List of Figures. About the Authors. Acknowledgments

Lecture 7: Deterministic Bottom-Up Parsing

ECE251 Midterm practice questions, Fall 2010

CS 314 Principles of Programming Languages

Bottom-Up Parsing. Lecture 11-12

UNIVERSITY OF CALIFORNIA Department of Electrical Engineering and Computer Sciences Computer Science Division. P. N. Hilfinger

4. Lexical and Syntax Analysis

COMP 181. Prelude. Prelude. Summary of parsing. A Hierarchy of Grammar Classes. More power? Syntax-directed translation. Analysis

Programming Languages Third Edition. Chapter 7 Basic Semantics

5. Semantic Analysis!

10. PEGs, Packrats and Parser Combinators

4. Lexical and Syntax Analysis

UNIVERSITY OF CALIFORNIA

Review of CFGs and Parsing II Bottom-up Parsers. Lecture 5. Review slides 1

11. a b c d e. 12. a b c d e. 13. a b c d e. 14. a b c d e. 15. a b c d e

CSE 3302 Programming Languages Lecture 2: Syntax

CS164: Midterm I. Fall 2003

First Midterm Exam CS164, Fall 2007 Oct 2, 2007

Derivations vs Parses. Example. Parse Tree. Ambiguity. Different Parse Trees. Context Free Grammars 9/18/2012

PL Revision overview

Introduction to Parsing

Alternatives for semantic processing

The Structure of a Syntax-Directed Compiler

Compilers and computer architecture From strings to ASTs (2): context free grammars

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Chapter 3. Syntax - the form or structure of the expressions, statements, and program units

Transcription:

Better Extensibility through Modular Syntax Robert Grimm New York University

Syntax Matters More complex syntactic specifications Extensions to existing programming languages Transactions, event-based code, network protocols, information flow, device drivers, Several dialects of same programming language Think K&R, ISO, and GCC for C Distinct languages with considerable syntactic overlap Think C, C++, Objective-C, Java, C# More programming language tools Compilers, interpreters, syntax-highlighting editors, API documentation generators, source measurement tools

Syntax Matters More complex syntactic specifications Extensions to existing programming languages Transactions, event-based code, network protocols, information flow, device drivers, Several dialects of same programming language Think K&R, ISO, and GCC for C Distinct languages with considerable syntactic overlap Think C, C++, Objective-C, Java, C# More programming language tools Compilers, interpreters, syntax-highlighting editors, API documentation generators, source measurement tools Need easily extensible syntax and parsers!

Three Desirable Properties Suitable formalism Closed under composition to enable modularity Unambiguous as computer formats have one meaning Scannerless to also provide extensibility at lexical level Expressive module system Units of productions to provide encapsulation Modifications of units to capture extensions and subsets Flexible composition of units to maximize reuse Well-defined escape hatch No formalism can capture all languages

CFGs Fall Short LR, LL parsing LALR used by Yacc; LL used by ANTLR, JavaCC But not closed under composition GLR, Earley, CYK parsing GLR used by Bison, Elkhound, SDF2 Closed under composition But not unambiguous Building all possible trees is inefficient Heuristically selecting one tree may result in wrong one Requiring explicit disambiguation adds complexity

PEGs Are More Suitable Parsing expression grammars (PEGs) Basic theory introduced by Birman [PhD '70] Fully developed by Ford [ICFP '02, POPL '04] Closed under composition, intersection, complement Ordered choices to avoid ambiguities Syntactic predicates to increase expressiveness [Parr '94] Match but do not consume input Scannerless to avoid separate lexer Implemented by recursive descent parsers Memoize all results to ensure linear time performance

PEGs Are More Suitable Parsing expression grammars (PEGs) Basic theory introduced by Birman [PhD '70] Fully developed by Ford [ICFP '02, POPL '04] Closed under composition, intersection, complement Ordered choices to avoid ambiguities Syntactic predicates to increase expressiveness [Parr '94] Match but do not consume input Scannerless to avoid separate lexer Implemented by functional recursive descent "packrat parsers" Memoize all results to ensure linear time performance

My Work Rats!, a packrat parser generator for Java Makes PEGs practical for imperative languages Concise syntax, aggressive optimizations Provides expressive module system Supports global state through lightweight transactions Module Module Module Resolve Grammar Create code Parser Module Module Module Deduce values & optimize

Talk Outline Introduction Module system Parser implementation and optimizations Experimental evaluation Conclusions

Productions Basic format Attribute* Type Nonterminal = Expression ; Operators EBNF-like notation Literals; sequences; greedy choices ('/'), repetitions, options Syntactic predicates Followed-by ('&'), not-followed-by ('!') Support for semantic values Actions ("{ }"), bindings ("id:e"), predicates ("&{ }") Not necessary when returning null, strings, generic tree nodes, when passing value through

Modules Provide encapsulation Group related productions Track dependencies module xtc.util.symbol; import xtc.util.spacing; String Symbol = SymbolCharacters Spacing ; transient String SymbolCharacters = <GreaterGreaterEqual> ">>=" / <LessLessEqual> "<<=" / <GreaterGreater> ">>" / <LessLess> "<<" /* and so on... */ ;

Visibility Control Productions declare visibility through attribute "public" = top-level production, visible to outside "protected" = inter-module production "private" = intra-module, helper production For ambiguous nonterminals Give precedence to productions defined in same module If all productions in other modules, require qualified name E.g., "xtc.util.spacing.spacing" instead of "Spacing"

Visibility Control Productions declare visibility through attribute "public" = top-level production, visible to outside "protected" = inter-module production, default "private" = intra-module, helper production For ambiguous nonterminals Give precedence to productions defined in same module If all productions in other modules, require qualified name E.g., "xtc.util.spacing.spacing" instead of "Spacing"

(Back to) Modules Provide encapsulation Group related productions Specify dependencies module xtc.util.symbol; import xtc.util.spacing; String Symbol = SymbolCharacters Spacing ; transient String SymbolCharacters = <GreaterGreaterEqual> ">>=" / <LessLessEqual> "<<=" / <GreaterGreater> ">>" / <LessLess> "<<" /* and so on... */ ;

Module Parameters Facilitate flexible composition Represent module names, are replaced on instantiation Delay provision of actual names until parser generation time module xtc.util.symbol(spacing); import Spacing; String Symbol = SymbolCharacters Spacing ; transient String SymbolCharacters = <GreaterGreaterEqual> ">>=" / <LessLessEqual> "<<=" / <GreaterGreater> ">>" / <LessLess> "<<" /* and so on... */ ;

Module Resolution Breadth-first search across dependency declarations Includes explicit instantiations of parameterized modules instantiate xtc.lang.cconstant(xtc.lang.cspacing); instantiate xtc.lang.cspacing(xtc.lang.cstate, xtc.lang.cconstant); Supports circular dependencies C BFS CConstant CSpacing BFS Best practice: Parameterize all modules, instantiate at top

Module Modifications Concisely express how modules differ from another Can add, override, or remove individual alternatives Can also override entire productions, incl. attributes Result in new modules, combining deltas and bases module xtc.lang.javasymbol(symbol); modify Symbol; String SymbolCharacters += <TripleGreaterEqual> ">>>=" / <GreaterGreaterEqual>... ; String SymbolCharacters += <TripleGreater> ">>>" / <GreaterGreater>... ;

Putting It All Together Three modules xtc.util.symbol Symbols common to C and Java xtc.lang.javasymbol Symbols unique to Java xtc.lang.csymbol Symbols unique to C Considerable flexibility JavaSymbol modifies Symbol All of Java's symbols CSymbol modifies Symbol All of C's symbols JavaSymbol modifies CSymbol modifies Symbol All of Java's and C's symbols

Parser Implementation Method for each production Character array for input Column array for memo table, with field per production Null implies not tried before " R a t s! " ; Characters 0 1 2 3 4 5 6 7 Columns Parser Column SemanticValue represents successful parse ExpressionStatement Value StringLiteral Value <actual value, index of next column, possible parse error> ParseError represents failed parse <error message, index of error> Result provides common interface to values and errors

Avoid Empty Fields [Ford '02] Insight: Most productions not tried for input position Field remains null Break column object into chunks " R a t s! " ; Characters 0 1 2 3 4 5 6 7 Columns Parser Column Allocate only when needed Chunk Chunk I.e., when one of chunk's productions is tried ExpressionStatement Value StringLiteral Value

Avoid Memoization Insight: Most productions tried 0-1 for input position Token-level: Most helper productions, spacing Hierarchical syntax: Look at tokens before references If different, production can only be tried at most once Give grammar writers control over memoization "transient" attribute disables memoization Doubly effective: Eliminates rows & columns from memo table Facilitates further optimizations Preserve repetitions in transient productions as iterations Turn direct left-recursions into equivalent right-iterations

Improve Terminal Recognition Insight: Many alternatives in token-level productions start with different characters Inline sole nonterminals (if productions are transient) Combine common prefixes Use switch statements for disjoint alternatives Also: Avoid dynamic instantiation of matched text Use string if text can be statically determined Use null if text is never used (i.e., bound)

Suppress Unnecessary Results Insight: Many productions pass the value through Example: 17 levels of expressions for C or Java, all of which must be invoked to parse a literal, identifier, Only create new SemanticValue if contents differ Otherwise, reuse passed-through value Insight: Most alternatives fail on first expression Example: Statement production for C or Java Only create new ParseError if subsequent expressions or entire production fail Meanwhile, use generic error object

Experimental Evaluation Syntactic specification Are grammars concise? Are grammars extensible? Parser performance What is optimizations' impact? Are parsers fast enough?

Experimental Evaluation Syntactic specification Are grammars concise? Yes, see paper! Are grammars extensible? Parser performance What is optimizations' impact? Are parsers fast enough?

Experimental Setup Five contestants, using Java 1.4 grammars Formalism Language Version Rats! PEG Java 1.8.0 SDF2 GLR C 2.3.3 Elkhound LALR/GLR C++ 2005.08.22b ANTLR LL Java 2.7.5 JavaCC LL Java 4.0 41 source files: 1-135 KB, varying programming styles 1 judge: Apple imac from fall '02 Reporting least-squares-fit

Grammars Are Easily Extensible C4 (CrossCutting C Compiler) Aspect-enhanced C to simplify Linux kernel extensions 17 hours (4 learning, 13 writing & debugging) 4 modules with 150 LoC + 1 Java class with 130 LoC Jeannie Combination of Java and C to simplify JNI programming Just write: jobject result = `obj.method( ); 45 hours (25 learning, 20 writing & debugging) 4 modules with 230 LoC

Optimizations Are Effective 1,000 320 Throughput Heap Utilization (X:1) 750 500 250 Heap Utilization 240 160 80 Throughput (KB/s) 0 None Chunks Grammar Terminals Cost Transient Nontransient Nontransient Repeated Left Optional Choices1 Choices2 Errors Errors Select Values Matches Prefixes GNodes 0

Optimizations Are Effective 1,000 320 Throughput Heap Utilization (X:1) 750 500 250 2.44 3.64 4.9 3.5 Heap Utilization 240 160 80 Throughput (KB/s) 0 None Chunks Grammar Terminals Cost Transient Nontransient Nontransient Repeated Left Optional Choices1 Choices2 Errors Errors Select Values Matches Prefixes GNodes 0

Parsers Perform Well 1,200 Throughput (KB/s) 900 600 300 0 Recognizer AST-Building Parser Rats! SDF2 Elkhound ANTLR JavaCC

Parsers Perform Well 1,200 Throughput (KB/s) 900 600 300 2.7 1.9 0 Recognizer AST-Building Parser Rats! SDF2 Elkhound ANTLR JavaCC

Conclusions Made PEGs practical for imperative languages Concise syntax, global state, aggressive optimizations Built an expressive module system Modules to encapsulate related productions Modifications to concisely specify extensions and subsets Parameters to enable flexible composition and reuse To good overall effect Others can realize real-world extensions in little time, code Parsers perform reasonably well

http://cs.nyu.edu/rgrimm/xtc/ Thank you: Martin Bravenboer, Marc Fiuczynski, Bryan Ford, Ben Goldberg, Laune Harris, Martin Hirzel, Trevor Jim, Scott McPeak, Marco Yuen