Validating LR(1) parsers

Similar documents
Validating LR(1) Parsers

Formal verification of program obfuscations

How much is a mechanized proof worth, certification-wise?

A Formally-Verified C static analyzer

Formal verification of a static analyzer based on abstract interpretation

Formal verification of an optimizing compiler

EDAN65: Compilers, Lecture 06 A LR parsing. Görel Hedin Revised:

Conflicts in LR Parsing and More LR Parsing Types

Lecture 8: Deterministic Bottom-Up Parsing

Lecture 7: Deterministic Bottom-Up Parsing

Bottom-Up Parsing. Lecture 11-12

CMSC 330: Organization of Programming Languages. Context Free Grammars

Outline. Lecture 17: Putting it all together. Example (input program) How to make the computer understand? Example (Output assembly code) Fall 2002

LR Parsing LALR Parser Generators

Compiler Design 1. Bottom-UP Parsing. Goutam Biswas. Lect 6

In One Slide. Outline. LR Parsing. Table Construction

Wednesday, September 9, 15. Parsers

Parsers. What is a parser. Languages. Agenda. Terminology. Languages. A parser has two jobs:

EDAN65: Compilers, Lecture 04 Grammar transformations: Eliminating ambiguities, adapting to LL parsing. Görel Hedin Revised:

CST-402(T): Language Processors

Chapter 3: Lexing and Parsing

Towards efficient, typed LR parsers

Lecture 14: Parser Conflicts, Using Ambiguity, Error Recovery. Last modified: Mon Feb 23 10:05: CS164: Lecture #14 1

Compilers. Bottom-up Parsing. (original slides by Sam

A programming language requires two major definitions A simple one pass compiler

Topic 5: Syntax Analysis III

Compiler Design Overview. Compiler Design 1

Parser. Larissa von Witte. 11. Januar Institut für Softwaretechnik und Programmiersprachen. L. v. Witte 11. Januar /23

Comp 411 Principles of Programming Languages Lecture 3 Parsing. Corky Cartwright January 11, 2019

Bottom-Up Parsing. Lecture 11-12

Working of the Compilers

CS453 : JavaCUP and error recovery. CS453 Shift-reduce Parsing 1

Reachability and error diagnosis in LR(1) parsers

Compiler Construction

Verasco: a Formally Verified C Static Analyzer

Parsers. Xiaokang Qiu Purdue University. August 31, 2018 ECE 468

LR Parsing LALR Parser Generators

Compiler Construction

COP 3402 Systems Software Syntax Analysis (Parser)

A concrete memory model for CompCert

Compiler Construction

Compilers and computer architecture From strings to ASTs (2): context free grammars

LR Parsing. Table Construction

Lexical and Syntax Analysis. Bottom-Up Parsing

Simple LR (SLR) LR(0) Drawbacks LR(1) SLR Parse. LR(1) Start State and Reduce. LR(1) Items 10/3/2012

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

CSCE 314 Programming Languages

Syntax-Directed Translation. Lecture 14

LL(k) Parsing. Predictive Parsers. LL(k) Parser Structure. Sample Parse Table. LL(1) Parsing Algorithm. Push RHS in Reverse Order 10/17/2012

CS 406/534 Compiler Construction Putting It All Together

Parsing Expression Grammars and Packrat Parsing. Aaron Moss

SYED AMMAL ENGINEERING COLLEGE (An ISO 9001:2008 Certified Institution) Dr. E.M. Abdullah Campus, Ramanathapuram

Section A. A grammar that produces more than one parse tree for some sentences is said to be ambiguous.

Faculty of Electrical Engineering, Mathematics, and Computer Science Delft University of Technology

COMPILER (CSE 4120) (Lecture 6: Parsing 4 Bottom-up Parsing )

CS5363 Final Review. cs5363 1

Torben./Egidius Mogensen. Introduction. to Compiler Design. ^ Springer

More Bottom-Up Parsing

Syntax Analysis. Martin Sulzmann. Martin Sulzmann Syntax Analysis 1 / 38

RYERSON POLYTECHNIC UNIVERSITY DEPARTMENT OF MATH, PHYSICS, AND COMPUTER SCIENCE CPS 710 FINAL EXAM FALL 96 INSTRUCTIONS

Appendix Set Notation and Concepts

Syntax. 2.1 Terminology

Using an LALR(1) Parser Generator

G53CMP: Lecture 4. Syntactic Analysis: Parser Generators. Henrik Nilsson. University of Nottingham, UK. G53CMP: Lecture 4 p.1/32

CSCI312 Principles of Programming Languages!

Part III : Parsing. From Regular to Context-Free Grammars. Deriving a Parser from a Context-Free Grammar. Scanners and Parsers.

LR Parsing. Leftmost and Rightmost Derivations. Compiler Design CSE 504. Derivations for id + id: T id = id+id. 1 Shift-Reduce Parsing.

Certified compilers. Do you trust your compiler? Testing is immune to this problem, since it is applied to target code

CS 2210 Sample Midterm. 1. Determine if each of the following claims is true (T) or false (F).

Parsing Algorithms. Parsing: continued. Top Down Parsing. Predictive Parser. David Notkin Autumn 2008

Introduction to Parsing Ambiguity and Syntax Errors

Introduction to Lexing and Parsing

Abstract Syntax Trees & Top-Down Parsing

Introduction to Parsing. Lecture 8

CMSC 330: Organization of Programming Languages

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Winter /15/ Hal Perkins & UW CSE C-1

Bottom-Up Parsing II (Different types of Shift-Reduce Conflicts) Lecture 10. Prof. Aiken (Modified by Professor Vijay Ganesh.

Introduction to Parsing Ambiguity and Syntax Errors

Monday, September 13, Parsers

Syntax Analysis. Amitabha Sanyal. ( as) Department of Computer Science and Engineering, Indian Institute of Technology, Bombay

Building Compilers with Phoenix

JavaCC Parser. The Compilation Task. Automated? JavaCC Parser

Syntax Analysis. COMP 524: Programming Language Concepts Björn B. Brandenburg. The University of North Carolina at Chapel Hill

S Y N T A X A N A L Y S I S LR

Abstract Syntax Trees & Top-Down Parsing

Abstract Syntax Trees & Top-Down Parsing

Context-free grammars

CSE 130 Programming Language Principles & Paradigms Lecture # 5. Chapter 4 Lexical and Syntax Analysis

CSCE 314 Programming Languages. Type System

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Spring UW CSE P 501 Spring 2018 C-1

Parsing. Roadmap. > Context-free grammars > Derivations and precedence > Top-down parsing > Left-recursion > Look-ahead > Table-driven parsing

Outline. Limitations of regular languages. Introduction to Parsing. Parser overview. Context-free grammars (CFG s)

CSE P 501 Exam 11/17/05 Sample Solution

Reusable Tools for Formal Modeling of Machine Code

3. Syntax Analysis. Andrea Polini. Formal Languages and Compilers Master in Computer Science University of Camerino

CS606- compiler instruction Solved MCQS From Midterm Papers

Compiler Construction: Parsing

Syntax and Parsing COMS W4115. Prof. Stephen A. Edwards Fall 2003 Columbia University Department of Computer Science

Consider a description of arithmetic. It includes two equations that define the structural types of digit and operator:

Transcription:

Validating LR(1) parsers Jacques-Henri Jourdan François Pottier Xavier Leroy INRIA Paris-Rocquencourt, projet Gallium Journées LTP/MTVV, 2011-10-28

Parsing: recap text or token stream abstract syntax tree 1 + 2 3 + 1 2 3

Parsing: problem solved? After 50 years of computer science: Foundations: Context-Free Grammars, Backus-Naur Form, LL(k), LR(k), Generalized LR, Parsing Expression Grammars,... Libraries: parsing combinators, Packrat,... Parser generators: Yacc, Bison, ANTLR, Menhir, Elkhound,...

The correctness issue How can we make sure that a parser (generated or hand-written) is correct? Application areas where it matters: Formally-verified compilers, code generators, static analyzers. Security-sensitive applications: SQL queries, handling of semi-structured documents (PDF, HTML, XML,... ).

CompCert: the formally verified part side-effects out type elimination CompCert C Clight C#minor of expressions loop simplifications Optimizations: constant prop., CSE, tail calls, (LCM), (Software pipelining) RTL CFG construction expr. decomp. register allocation (IRC) CminorSel stack allocation of & variables instruction Cminor selection (Instruction scheduling) linearization spilling, reloading LTL LTLin Linear of the CFG calling conventions layout of stack frames Asm asm code generation Mach

CompCert: the whole compiler C source lexing, parsing, construction of an AST type-checking, de-sugaring AST C Type reconstruction Graph coloring Code linearization heuristics Verified compiler Executable assembling linking Assembly printing of asm syntax AST Asm Part of the TCB Not part of the TCB Not proved (hand-written in Caml) Proved in Coq (extracted to Caml)

Correct with respect to what? Specification of a parser: a context-free grammar with semantic actions. Terminal symbols a Nonterminal symbols A Symbols X ::= a A Start symbol S Productions A X 1... X n {f } f : T (X 1 ) T (X n ) T (A) is a semantic action T (X ) : Type is the type of semantic values for symbol X. (lovely dependent types!)

Semantics of grammars X w/v (symbol X derives word w producing semantic value v) a (a, v)/v A X 1... X n {f } is a production X i w i /v i for i = 1,..., n A w 1... w n /f (v 1,..., v n )

Correctness of a parser A parser = a function token stream Reject Accept(semantic value, token stream) Soundness: if Parser(W ) = Accept(v, W ), there exists a word w such that W = w.w and S w/v. Non-ambiguity: if Parser(W ) = Accept(v, W ) and and S w/v, then W = w.w and v = v. Completeness: if S w/v then Parser(w.W ) = Accept(v, W ). (Note: completeness + determinism non-ambiguity.)

Verifying a parser, approach 1: a posteriori validation at every parse token stream untrusted parser verified validator parse tree Error OK(semantic value) Validator: trivially checks the parse tree & computes semantic value. Soundness: guaranteed. Nonambiguity: no guarantee. Completeness: no guarantee. : proved correct in Coq : not verified, untrusted

Verifying a parser, approach 2: deductive verification of the parser itself Apply program proof to the parser itself, showing soundness and completeness. Drawbacks: Long and tedious proof, especially if parser is generated as an automaton. Proof to be re-done every time the grammar changes.

Verifying a parser, approach 3: deductive verification of a parser generator (A. Barthwal and M. Norrish, Verified Executable Parsing, ESOP 2009) token stream grammar SLR(1) parser generator LR(1) automaton Pushdown interpreter Reject Accept(v) Barthwal & Norrish proved (in HOL) soundness and completeness for every parser successfully generated by their generator. Limitation: their generator only accepts SLR(1) grammars; the ISO C99 grammar is not SLR(1).

Our approach: verified validation of a parser generator Given a grammar G and an LR(1) automaton A, check that A is sound and complete w.r.t. G. Token stream Grammar Instrumented parser generator OK / error LR(1) automaton Grammar Certificate Validator Pushdown interpreter Semantic value Parser generation time / Compile-compile time Parse time The validator supports all flavors of LR(1) parsing: canonical LR(1), SLR(1), LALR(1), Pager s method,...

Refresher: LR automata A stack machine with 4 kinds of actions: accept, reject, shift (push the next token), and reduce (by a production) + goto another state.

Interpreting LR(1) automata in Coq Module Parser(G: Grammar) (A: Automaton). Inductive parse_result := Accept (v: G.semantic_type G.start_symbol) (rem: Stream token) Reject Internal_Error Timeout. Definition parse (input: Stream token) (fuel: nat) : parse_result :=... Note fuel parameter to guarantee termination (we can have infinite sequences of reduce actions). Note Internal_Error result caused by e.g. popping from an empty stack.

Soundness Theorem (Soundness) If parse W N = Accept v W, there exists a word w such that W = w.w and S w/v. Note that this theorem holds unconditionally for all automata: the parse function performs some dynamic checks and fails with Internal_Error in all cases where soundness would be compromised. Easy Coq proof (200 lines) using an invariant relating the current stack of the automaton with the word read so far.

Safety Theorem (Safety) If safety validator G A = true, then parse W N Internal error for every input stream W and fuel N. safety_validator (200 Coq lines) decides a number of properties (next slide) with the help of annotations produced by the parser generator. Proof of the theorem: 500 Coq lines.

The safety validator 1 For every transition, labeled X, of a state σ to a new state σ, pastsymbols(σ ) is a suffix of pastsymbols(σ)incoming(σ), paststates(σ ) is a suffix of paststates(σ){σ}. 2 For every state σ that has an action of the form reduce A α {f }, α is a suffix of pastsymbols(σ)incoming(σ), If paststates(σ){σ} is Σ n... Σ 0 and if the length of α is k, then for every state σ Σ k, the goto table is defined at (σ, A). (If k is greater than n, take Σ k to be the set of all states.) 3 For every state σ that has an accept action, σ init, incoming(σ) = S, paststates(σ) = {init}.

Weak completeness Theorem (Weak completeness) If completeness validator G A = true and if S w/v, then for every remaining input W and fuel N, parse (w.w ) N {Accept(v, W ), Timeout, Internal error} The proof amounts to showing that the automaton performs a depth-first traversal of the parse tree S w/v. completeness_validator (next slide): 200 Coq lines. Proof: 700 Coq lines.

The completeness validator 1 For every state σ, the set items(σ) is closed, that is, the following implication holds: A α 1 A α 2 [a] items(σ) A α {f } is a production a first(α 2 a) A α [a ] items(σ) 2 For every state σ, if A α [a] items(σ), where A S, then the action table maps (σ, a) to reduce A α {f }. 3 For every state σ, if A α 1 aα 2 [a ] items(σ), then the action table maps (σ, a) to shift σ, for some state σ such that: A α 1 a α 2 [a ] items(σ )

The completeness validator 1 For every state σ, if A α 1 A α 2 [a ] items(σ), then the goto table either is undefined at (σ, A ) or maps (σ, A ) to some state σ such that: A α 1 A α 2 [a ] items(σ ) 2 For every terminal symbol a, we have S S [a] items(init). 3 For every state σ, if S S [a] items(σ), then σ has a default accept action. 4 first and nullable are fixed points of the standard defining equations.

Towards full correctness Conjecture (Completeness) If completeness validator G A = true and S w/v, then there exists a fuel N 0 such that for all N N 0, parse (w.w ) N {Accept(v, W ), Internal Error}. Seems provable but requires reasoning on the size of the derivation S w/v (structural induction is not enough).

Towards termination Completeness shows termination for valid inputs, but what about invalid inputs? (We have examples of non-termination for automata that pass the safety and completeness validators.) Conjecture (Termination) Assuming some to-be-determined validation conditions hold, for all input streams W there exists a fuel N 0 such that for all N N 0, parse W N Timeout. A proof sketch in Aho and Ullman, but only for canonical LR(1) automata (which have a peculiar early failure property).

Experimental validation: ISO C 1999 Starting point: grammar from Appendix A of ISO C 99 standard. Removed old-style function declarations (unsupported by CompCert). Fixed / worked around several ambiguities (next slides). Grammar with 87 terminals, 72 nonterminals, 263 productions. Modified the Menhir parser generator to produce Coq output + certificates (500 lines of Caml). Pager s LR(1) automaton with 505 states. Plus 4.2 Mbytes of certificates (mostly, item sets).

Experimental validation: ISO C 1999 Running the validators on Menhir s Coq output: Executed within Coq (Eval vm_compute). Reading and type-checking Menhir s output: 32 s. Safety validator: 4 s. Completeness validator: 15 s. Replacing CompCert s unproved parser with our new parser: Parsing: 5 times slower. Total compilation time: +20%

Ambiguities in the C grammar 1- Dangling else if (cond1) if (cond2) x = 1; else x = 2; A classic problem: which if matches the else? ISO standard says the second if, but not reflected in grammar. A simple solution: rewrite the grammar to have two statement nonterminals, one for statements that can be followed by else, the other for statements that cannot.

Ambiguities in the C grammar 2- Type names a * b; In the scope of a typedef... a; declaration, this means Otherwise, this means Declare a variable b of type pointer to a. Compute a times b and throw result away. Must have two different terminals for type names and variable names. The lexer must classify identifiers into type names or variable names taking typedef declarations and block scopes into account.

Classic solution: the lexer hack. Ambiguities in the C grammar 2- Type names Semantic actions of the parser update a symbol table. The lexer consults this table to classify identifiers. Our approach: a pre-parser. The pre-parser keeps track of typedefs that are in scope, and adjusts the stream of tokens accordingly. In our implementation, the pre-parser is a full, non-verified C parser using the lexer hack. Lexer... ident... ident... Pre-parser... typename... varname... Parser

Ambiguities in the C grammar 3- Binding occurrences typedef double a; a a; First a is a type name, second a is a variable name in binding position, subsequent a s are variable names. Here, no other possible interpretation; but...

Ambiguities in the C grammar 3- Binding occurrences typedef double a; int f(int (a)); Could mean either: (in civilized Coq syntax) 1 f : forall (a: int), int 2 f : (a -> int) -> int Original ISO C99 standard leaves this ambiguity open. Technical Corrigendum 2 says interpretation #2 is correct. Again, we rely on the pre-parser for correct classification.

Conclusions Once more, the verified validator approach is a win: Reduced proof effort (2 500 lines versus Barthwal and Norrish s 20 000). Reusable with all known LR(1) constructions (from canonical to Pager s). Can also reuse existing, mature parser generator (e.g. Menhir and its excellent diagnostics).

Possible improvements Prove strong completeness, perhaps termination. Prove that the parser does not read more tokens than necessary. (Important for interactive applications, e.g. toplevel loops.) Speed up the pushdown interpreter by removing dynamic checks. ( much more dependent types?) Take precedences and associativity declarations into account.

Perspectives for CompCert A similar validation approach should work for the lexer as well. (Perhaps using Brozowski s derivatives.) Simplify the pre-parser by restricting typedef to global scope. (Very few C codes use local typedef.) The elaboration passes (between the parser and the input to the first Coq-proved pass) need work.