An ANTLR Grammar for Esterel

Similar documents
Abstract Syntax Trees

Implementing Actions

Syntax and Parsing COMS W4115. Prof. Stephen A. Edwards Fall 2003 Columbia University Department of Computer Science

Grammars and Parsing, second week

Compilers. Compiler Construction Tutorial The Front-end

Syntax and Parsing COMS W4115. Prof. Stephen A. Edwards Fall 2004 Columbia University Department of Computer Science

ASTs, Objective CAML, and Ocamlyacc

CS 4240: Compilers and Interpreters Project Phase 1: Scanner and Parser Due Date: October 4 th 2015 (11:59 pm) (via T-square)

Project 1: Scheme Pretty-Printer

Exercise ANTLRv4. Patryk Kiepas. March 25, 2017

Defining Program Syntax. Chapter Two Modern Programming Languages, 2nd ed. 1

CD Assignment I. 1. Explain the various phases of the compiler with a simple example.

Lecture 05 I/O statements Printf, Scanf Simple statements, Compound statements

JME Language Reference Manual

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

CS664 Compiler Theory and Design LIU 1 of 16 ANTLR. Christopher League* 17 February Figure 1: ANTLR plugin installer

Introduction to Lexical Analysis

Compiler Construction D7011E

Typescript on LLVM Language Reference Manual

MP 3 A Lexer for MiniJava

Syntactic Analysis. The Big Picture Again. Grammar. ICS312 Machine-Level and Systems Programming

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

CSCI312 Principles of Programming Languages!

ML 4 A Lexer for OCaml s Type System

A Simple Syntax-Directed Translator

Perdix Language Reference Manual

DINO. Language Reference Manual. Author: Manu Jain

Lexical Analysis. Introduction

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Principles of Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

Project 2 Interpreter for Snail. 2 The Snail Programming Language

IPCoreL. Phillip Duane Douglas, Jr. 11/3/2010

CSE302: Compiler Design

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part2 3.3 Parse Trees and Abstract Syntax Trees

Syntax Analysis MIF08. Laure Gonnord

Peace cannot be kept by force; it can only be achieved by understanding. Albert Einstein

MATVEC: MATRIX-VECTOR COMPUTATION FINAL REPORT. John C. Murphy jcm2105 Programming Languages and Translators Professor Stephen Edwards

CSC 467 Lecture 3: Regular Expressions

Optimizing Finite Automata

MP 3 A Lexer for MiniJava

ASML Language Reference Manual

Lexical Analysis. Lecture 3. January 10, 2018

MATVEC: MATRIX-VECTOR COMPUTATION LANGUAGE REFERENCE MANUAL. John C. Murphy jcm2105 Programming Languages and Translators Professor Stephen Edwards

Lexing, Parsing. Laure Gonnord sept Master 1, ENS de Lyon

Ordering Within Expressions. Control Flow. Side-effects. Side-effects. Order of Evaluation. Misbehaving Floating-Point Numbers.

Control Flow COMS W4115. Prof. Stephen A. Edwards Fall 2006 Columbia University Department of Computer Science

DEMO A Language for Practice Implementation Comp 506, Spring 2018

Part 5 Program Analysis Principles and Techniques

MOHAWK Language Reference Manual. COMS W4115 Fall 2005

Compiler Construction LECTURE # 3

COP4020 Programming Assignment 2 - Fall 2016

Some Basic Definitions. Some Basic Definitions. Some Basic Definitions. Language Processing Systems. Syntax Analysis (Parsing) Prof.

Comp 411 Principles of Programming Languages Lecture 3 Parsing. Corky Cartwright January 11, 2019

PLT 4115 LRM: JaTesté

Building lexical and syntactic analyzers. Chapter 3. Syntactic sugar causes cancer of the semicolon. A. Perlis. Chomsky Hierarchy

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

RYERSON POLYTECHNIC UNIVERSITY DEPARTMENT OF MATH, PHYSICS, AND COMPUTER SCIENCE CPS 710 FINAL EXAM FALL 96 INSTRUCTIONS

Contents. Jairo Pava COMS W4115 June 28, 2013 LEARN: Language Reference Manual

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Parser Tools: lex and yacc-style Parsing

int foo() { x += 5; return x; } int a = foo() + x + foo(); Side-effects GCC sets a=25. int x = 0; int foo() { x += 5; return x; }

Syntax Intro and Overview. Syntax

SFU CMPT 379 Compilers Spring 2018 Milestone 1. Milestone due Friday, January 26, by 11:59 pm.

1 Lexical Considerations

EDAN65: Compilers, Lecture 06 A LR parsing. Görel Hedin Revised:

Related Course Objec6ves

Monday, August 26, 13. Scanners

Wednesday, September 3, 14. Scanners

COMP 181 Compilers. Administrative. Last time. Prelude. Compilation strategy. Translation strategy. Lecture 2 Overview

A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, though scanner is also a term for the first stage of a lexer.

Funk Programming Language Reference Manual

Parsing. Roadmap. > Context-free grammars > Derivations and precedence > Top-down parsing > Left-recursion > Look-ahead > Table-driven parsing

GBIL: Generic Binary Instrumentation Language. Language Reference Manual. By: Andrew Calvano. COMS W4115 Fall 2015 CVN

Programming Assignment II

CS /534 Compiler Construction University of Massachusetts Lowell. NOTHING: A Language for Practice Implementation

CSE 401 Midterm Exam Sample Solution 2/11/15

COMPILER CONSTRUCTION LAB 2 THE SYMBOL TABLE. Tutorial 2 LABS. PHASES OF A COMPILER Source Program. Lab 2 Symbol table

Building a Parser Part III

Java Bytecode (binary file)

Mirage. Language Reference Manual. Image drawn using Mirage 1.1. Columbia University COMS W4115 Programming Languages and Translators Fall 2006

Programming Languages & Translators. XML Document Manipulation Language (XDML) Final Report

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

LL(k) Compiler Construction. Choice points in EBNF grammar. Left recursive grammar

A programming language requires two major definitions A simple one pass compiler

Language Reference Manual

CPSC 434 Lecture 3, Page 1

B The SLLGEN Parsing System

LL(k) Compiler Construction. Top-down Parsing. LL(1) parsing engine. LL engine ID, $ S 0 E 1 T 2 3

Programming Languages & Translators. XML Document Manipulation Language (XDML) Language Reference Manual

Compiling Regular Expressions COMP360

Introduction to Lexical Analysis

Introduction to Compiler Design

Topic 3: Syntax Analysis I

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part 1

Building Compilers with Phoenix

Transcription:

An ANTLR Grammar for Esterel COMS W4115 Prof. Stephen A. Edwards Fall 2005 Columbia University Department of Computer Science

ANTLR Esterel.g class EsterelParser extends Parser; file : expr EOF!; class EsterelLexer extends Lexer; ID : LETTER (LETTER DIGIT)* ; EsterelParser.java public class EsterelParser extends antlr.llkparser implements EsterelParserTokenTypes {} EsterelLexer.java public class EsterelLexer extends antlr.charscanner implements EsterelParserTokenTypes, TokenStream {}

ANTLR Lexer Specifications Look like class MyLexer extends Lexer; options { option = value } Token1 : char char ; Token2 : char char ; Token3 : char ( char )? ; Tries to match all non-protected tokens at once.

ANTLR Parser Specifications Look like class MyParser extends Parser; options { option = value } rule1 : Token1 Token2 Token3 rule2 ; rule2 : (Token1 Token2)* ; rule3 : rule1 ; Looks at the next k tokens when deciding which option to consider next.

An ANTLR grammar for Esterel Esterel: Language out of France. Programs look like module ABRO: input A, B, R; output O; loop [ await A await B ]; emit O each R end module

The Esterel LRM Lexical aspects are classical: Identifiers are sequences of letters, digits, and the underline character, starting with a letter. Integers are as in any language, e.g., 123, and floating-point numerical constants are as in C++ and Java; the values 12.3,.123E2, and 1.23E1 are constants of type double, while 12.3f,.123E2f, and 1.23E1f are constants of type float. Strings are written between double quotes, e.g., "a string", with doubled double quotes as in "a "" double quote".

The Esterel LRM Keywords are reserved and cannot be used as identifiers. Many constructs are bracketed, like present... end present. For such constructs, repeating the initial keyword is optional; one can also write present... end. Simple comments start with % and end at end-of-line. Multiple-line comments start with %{ and end with }%.

A Lexer for Esterel Operators from the langauge reference manual:. # + - / * < >, = ; : := ( ) [ ]??? <= >= <> => Main observation: none longer than two characters. Need k = 2 to disambiguate, e.g.,? and??. class EsterelLexer extends Lexer; options { k = 2; }

A Lexer for Esterel Next, I wrote a rule for each punctuation character: PERIOD :. ; POUND : # ; PLUS : + ; DASH : - ; SLASH : / ; STAR : * ; PARALLEL : " " ;

A Lexer for Esterel Identifiers are standard: ID : ( a.. z A.. Z ) ( a.. z A.. Z _ 0.. 9 )* ;

A Lexer for Esterel String constants must be contained on a single line and may contain double quotes, e.g., "This is a constant with ""double quotes""" ANTLR makes this easy: annotating characters with! discards them from the token text: StringConstant : "! ( ( " \n ) ( "! " ) )* "! ;

A Lexer for Esterel I got in trouble with the operator, which inverts a character class. Invert with respect to what? Needed to change options: options { k = 2; charvocabulary = \3.. \377 ; exportvocab = Esterel; }

A Lexer for Esterel Another problem: ANTLR scanners check each recognized token s text against keywords by default. A string such as "abort" would scan as a keyword! options { k = 2; charvocabulary = \3.. \377 ; exportvocab = Esterel; testliterals = false; } ID options { testliterals = true; } : ( a.. z A.. Z ) /*... */ ;

Numbers Defined From the LRM: Integers are as in any language, e.g., 123, and floating-point numerical constants are as in C++ and Java; the values 12.3,.123E2, and 1.23E1 are constants of type double, while 12.3f,.123E2f, and 1.23E1f are constants of type float.

Numbers With k = 2, for each rule ANTLR generates a set of characters that can appear first and a set that can appear second. But it doesn t consider the possible combinations. I split numbers into Number and FractionalNumber to avoid this problem: If the two rules were combined, the lookahead set for Number would include a period (e.g., from.1 ) followed by end-of-token e.g., from 1 by itself). Example numbers:.1$.2 1$ First Second. EOT 1. 2 1

Number Rules Number : ( 0.. 9 )+ (. ( 0.. 9 )* (Exponent)? ( ( f F ) { $settype(floatconst); } /* empty */ { $settype(doubleconst); ) /* empty */ { $settype(integer); } ) ;

Number Rules Continued FractionalNumber :. ( 0.. 9 )+ (Exponent)? ( ( f F ) { $settype(floatconst); } /* empty */ { $settype(doubleconst); ) ; protected Exponent : ( e E ) ( + - )? ( 0.. 9 )+ ;

Comments From the LRM: Simple comments start with % and end at end-of-line. Multiple-line comments start with %{ and end with }%.

Comments Comment : % ( ( { ) => { ( // Prevent.* from eating the whole file options {greedy=false;}: ( ( \r \n ) => \r \n { newline(); } \r { newline(); } \n { newline(); } ( \n \r ) ) )* "}%" (( \n ))* \n { newline(); } ) { $settype(token.skip); } ;

A Parser for Esterel Esterel s syntax started out using ; as a separator and later allowed it to be a terminator. The language reference manual doesn t agree with what the compiler accepts.

Grammar from the LRM NonParallel: AtomicStatement Sequence Sequence: SequenceWithoutTerminator ; opt SequenceWithoutTerminator: AtomicStatement ; AtomicStatement SequenceWithoutTerminator ; AtomicStatement AtomicStatement: nothing pause...

Grammar from the LRM But in fact, the compiler accepts module TestSemicolon1: nothing; end module module TestSemicolon2: nothing; nothing; end module module TestSemicolon3: nothing; nothing end module Rule seems to be one or more statements separated by semicolons except for the last, which is optional.

Grammar for Statement Sequences Obvious solution: sequence : atomicstatement (SEMICOLON atomicstatement)* (SEMICOLON)? ; warning: nondeterminism upon k==1:semicolon between alt 1 and exit branch of block Which option do you take when there s a semicolon?

Nondeterminism sequence : atomicstatement (SEMICOLON atomicstatement)* (SEMICOLON)? ; Is equivalent to sequence : atomicstatement seq1 seq2 ; seq1 : SEMICOLON atomicstatement seq1 /* nothing */ ; seq2 : SEMICOLON /* nothing */ ;

Nondeterminism sequence : atomicstatement seq1 seq2 ; seq1 : SEMICOLON atomicstatement seq1 /* nothing */ ; seq2 : SEMICOLON /* nothing */ ; How does it choose an alternative in seq1? First choice: next token is a semicolon. Second choice: next token is one that may follow seq1. But this may also be a semicolon!

Nondeterminsm Solution: tell ANTLR to be greedy and prefer the iteration solution. sequence : atomicstatement ( options { greedy=true; } : SEMICOLON! atomicstatement )* (SEMICOLON!)? ;

Nondeterminism Delays can be A X A immediate A or [A and B]. delay : expr bsigexpr bsigexpr "immediate" bsigexpr ; bsigexpr : ID "[" signalexpression "]" ; expr : ID /*... */ ; Which choice when next token is an ID?

Nondeterminism delay : expr bsigexpr bsigexpr "immediate" bsigexpr ; What do we really want here? If the delay is of the form expr bsigexpr, parse it that way. Otherwise try the others.

Nondeterminism delay : ( (expr bsigexpr) => delaypair bsigexpr "immediate" bsigexpr ) ; delaypair : expr bsigexpr ; The => operator means try to parse this first. If it works, choose this alternative.

Greedy Rules The author of ANTLR writes I have yet to see a case when building a parser grammar where I did not want a subrule to match as much input as possible. However, it is particularly useful in scanners: COMMENT : "/*" (.)* "*/" ; This doesn t work like you d expect...

Turning Off Greedy Rules The right way is to disable greedy: COMMENT : "/*" (options {greedy=false;} :.)* "*/" ; This only works if you have two characters of lookahead: class L extends Lexer; options { k=2; } CMT : "/*" (options {greedy=false;} :.)* "*/" ;

The Dangling Else Problem class MyGram extends Parser; stmt : "if" expr "then" stmt ("else" stmt)? ; Gives ANTLR Parser Generator Version 2.7.1 gram.g:3: warning: nondeterminism upon gram.g:3: k==1:"else" gram.g:3: between alts 1 and 2 of block

Generated Code stmt : "if" expr "then" stmt ("else" stmt)? ; match(literal_if); expr(); match(literal_then); stmt(); if ((LA(1)==LITERAL_else)) { match(literal_else); /* Close binding else */ stmt(); } else if ((LA(1)==LITERAL_else)) { /* go on: else can follow a stmt */ } else { throw new SyntaxError(LT(1)); }

Removing the Warning class MyGram extends Parser; stmt : "if" expr "then" stmt (options {greedy=true;} :"else" stmt)? ;

A Simpler Language class MyGram extends Parser; stmt : "if" expr "then" stmt ("else" stmt)? "fi" ; match(literal_if); expr(); match(literal_then); stmt(); switch (LA(1)) { case LITERAL_else: match(literal_else); stmt(); break; case LITERAL_fi: break; default: throw new SyntaxError(LT(1)); } match(literal_fi);