Chapter 2 - Programming Language Syntax. September 20, 2017

Similar documents
COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Programming Languages (CS 550) Lecture 4 Summary Scanner and Parser Generators. Jeremy R. Johnson

Lecturer: William W.Y. Hsu. Programming Languages

Lexical and Syntax Analysis (2)

Revisit the example. Transformed DFA 10/1/16 A B C D E. Start

CS 315 Programming Languages Syntax. Parser. (Alternatively hand-built) (Alternatively hand-built)

Part 5 Program Analysis Principles and Techniques

Chapter 2 :: Programming Language Syntax

COP 3402 Systems Software Syntax Analysis (Parser)

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Syntax Analysis. COMP 524: Programming Language Concepts Björn B. Brandenburg. The University of North Carolina at Chapel Hill

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

Syntax Analysis. The Big Picture. The Big Picture. COMP 524: Programming Languages Srinivas Krishnan January 25, 2011

Syntax and Parsing COMS W4115. Prof. Stephen A. Edwards Fall 2003 Columbia University Department of Computer Science

Parsing. source code. while (k<=n) {sum = sum+k; k=k+1;}

CSE 3302 Programming Languages Lecture 2: Syntax

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Programming Language Syntax and Analysis

Chapter 3. Describing Syntax and Semantics ISBN

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Compiler Construction

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part 1

SYED AMMAL ENGINEERING COLLEGE (An ISO 9001:2008 Certified Institution) Dr. E.M. Abdullah Campus, Ramanathapuram

Compiler Construction

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

CST-402(T): Language Processors

Languages and Compilers

Chapter 4. Lexical and Syntax Analysis

2068 (I) Attempt all questions.

Theoretical Part. Chapter one:- - What are the Phases of compiler? Answer:

CPS 506 Comparative Programming Languages. Syntax Specification

COLLEGE OF ENGINEERING, NASHIK. LANGUAGE TRANSLATOR

Formal Languages and Compilers Lecture VI: Lexical Analysis

CMSC 330: Organization of Programming Languages

CSE 401 Midterm Exam 11/5/10

Lecture 4: Syntax Specification

CS415 Compilers. Syntax Analysis. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

CS 314 Principles of Programming Languages. Lecture 3

Syntax and Parsing COMS W4115. Prof. Stephen A. Edwards Fall 2004 Columbia University Department of Computer Science

Introduction to Lexical Analysis

CS 403: Scanning and Parsing

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Monday, August 26, 13. Scanners

Lexical Scanning COMP360

Lexical and Syntax Analysis

Wednesday, September 3, 14. Scanners

flex is not a bad tool to use for doing modest text transformations and for programs that collect statistics on input.

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Winter /15/ Hal Perkins & UW CSE C-1

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

EDAN65: Compilers, Lecture 06 A LR parsing. Görel Hedin Revised:

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part2 3.3 Parse Trees and Abstract Syntax Trees

Lexical Analysis. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Habanero Extreme Scale Software Research Project

CIT Lecture 5 Context-Free Grammars and Parsing 4/2/2003 1

A simple syntax-directed

CS 164 Handout 11. Midterm Examination. There are seven questions on the exam, each worth between 10 and 20 points.

ECS 120 Lesson 7 Regular Expressions, Pt. 1

Chapter 3: Lexing and Parsing

CS415 Compilers. Lexical Analysis

Figure 2.1: Role of Lexical Analyzer

Principles of Compiler Design Presented by, R.Venkadeshan,M.Tech-IT, Lecturer /CSE Dept, Chettinad College of Engineering &Technology

CS5363 Final Review. cs5363 1

Last time. What are compilers? Phases of a compiler. Scanner. Parser. Semantic Routines. Optimizer. Code Generation. Sunday, August 29, 2010

List of Figures. About the Authors. Acknowledgments

ECE251 Midterm practice questions, Fall 2010

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

CSE 401 Midterm Exam Sample Solution 2/11/15

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

Syntax Analysis/Parsing. Context-free grammars (CFG s) Context-free grammars vs. Regular Expressions. BNF description of PL/0 syntax

Introduction to Parsing

Theory and Compiling COMP360

CS 4240: Compilers and Interpreters Project Phase 1: Scanner and Parser Due Date: October 4 th 2015 (11:59 pm) (via T-square)

Chapter 3 Lexical Analysis

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

CSE 130 Programming Language Principles & Paradigms Lecture # 5. Chapter 4 Lexical and Syntax Analysis

Lexical Analysis. Introduction

3. Context-free grammars & parsing

CSE450 Translation of Programming Languages. Lecture 4: Syntax Analysis

Context-free grammars (CFG s)

ΕΠΛ323 - Θεωρία και Πρακτική Μεταγλωττιστών

Working of the Compilers

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Week 2: Syntax Specification, Grammars

CSE 401/M501 Compilers

Formal Languages. Formal Languages

CSE 582 Autumn 2002 Exam 11/26/02

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

CSCI 1260: Compilers and Program Analysis Steven Reiss Fall Lecture 4: Syntax Analysis I

Wednesday, September 9, 15. Parsers

Parsers. What is a parser. Languages. Agenda. Terminology. Languages. A parser has two jobs:

COP 3402 Systems Software Top Down Parsing (Recursive Descent)

Compiler course. Chapter 3 Lexical Analysis

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Transcription:

Chapter 2 - Programming Language Syntax September 20, 2017

Specifying Syntax: Regular expressions and context-free grammars Regular expressions are formed by the use of three mechanisms Concatenation Alternation Repetition (aka Kleene closure)

Specifying Syntax: Regular expressions and context-free grammars A context-free grammar adds one more mechanism: recursion

Tokens and regular expressions Tokens are the basic lexical unit for a programming language In C for instance, while and return are tokens, as are identifier names, such as printf, as well as operators and parentheses (N.B. - parentheses are bizarrely called punctuators in the C family of languages, which is neologistic at best since previously punctuator merely referred to a person doing punctuation.)

Tokens and regular expressions Tokens are specified by regular expressions

Tokens and regular expressions Consider the regular set for numeric expressions given in the text on page 44: number integer real integer digit digit real integer exponent decimal ( exponent ɛ ) decimal digit (. digit digit. ) digit exponent ( e E ) ( + ɛ ) integer digit 1 2 3 4 5 6 7 8 9 0

Tokens and regular expressions Notice that there is no recursion in the previous set of rules. Were there even an indirect recursion, this would not describe regular expression but instead a context-free grammar.

Tokens and regular expressions You can consider the previous set of rules as a generator for the initial token number ; starting with the first rule, you can expand each. For instance, number integer digit digit 1 digit 1 3 digit 1 3

Character sets and formatting issues Some languages ignore case (most Basics, for instance); many impose rules about the case of keywords / reserved words / built-ins; others use case to imply semantic content (Prolog and Go, for instance) Some languages support more than just ASCII

Character sets and formatting issues Most languages are free format, with whitespace only providing separation, not semantics However, line breaks are given some significance in some languages, such as Python and Haskell

Character sets and formatting issues There are even modern languages like Python and Haskell that do care about indentation to one or degree or another. Indeed, there is something of a move to using schemes such as anything starting in column 1 is a comment (e.g., see Bird Style Haskell).

Character sets and formatting issues In contrast, there were languages that had quite rigid rules about line formatting (older style Fortran and RPG come to mind here.)

Other applications of regular expressions Of course, regular expressions are widely used in scripting languages (Perl s regular expressions in particular have had wide acceptance in other languages, and are generally available in the Linux environment via libpcre) What Perl calls a regular expression is actually more than just a regular expression, and indeed for some time the Perl developers had planned to rename them to just rules in Perl 6 but backed off of that decision

Other applications of regular expressions One of the most popular extensions to regular expression syntax is the ability to, while matching, specify portions of the match. "a4b6" =~ /a([0-9])b([0-9])/

Other applications of regular expressions For instance, the previous Perl regular expression, when applied against the string a4b6 would set the variable $1 to 4 and the variable $2 to 6 : $ perl "a4b6" =~ /a([0-9])b([0-9])/ ; print $1. "\n"; print $2. "\n"; 4 6

Context-free grammars in Backus-Naur Form (BNF) While regular expressions are ideal for identifying individual tokens, adding recursion to the mix means that we can now recognize context-free grammars. Context-free grammars have the advantage of allowing us to specify more flexible structure than mere regular expressions. Each rule is called a production. The symbols that appear on the left-hand side of a production rule are called non-terminals.

Context-free grammars in Backus-Naur Form (BNF) Since we will use Lemon for generating parsers, it is a particularly good idea to note the discussion on page 47 about the equivalent meaning of the single production op + /

Context-free grammars in Backus-Naur Form (BNF) and the set op + op op op /

Derivations and parse trees Using the grammar expr id number expr op expr ( expr ) op + / We can derive expr expr op expr expr op id expr + id expr op expr + id expr op id + id expr id + id id id + id

Derivations and parse trees Or the can be written in a summary fashion as expr id id + id

Derivations and parse trees.class width=60 height=60px} {#id

Derivations and parse trees This grammar is ambiguous, but on page 50, there is an equivalent grammar given that is not: expr term expr addop term term factor term multop factor factor id number factor ( expr ) addop + multop /

Derivations and parse trees What s the difference in the two previous grammars? Why is one ambiguous and the other not?

Scanning Usually, scanning is the fast, easy part of syntax analysis; it is usually linear or very close to it; importantly, it removes the detritus of comments and preserves semantically significant strings such as identifiers. (Note the weasel word usually!) Examples of scanner generators are lex, flex, re2c, and Ragel.

Scanning However, since scanning is generally quite simple, it is often done ad hoc with custom code (for instance, see figure 2.5 in the text, or here.)

Scanning and automata The most general type of automaton is a non-deterministic finite automaton (NFA, sometimes NDFA) based on the 5-tuple (S, Σ, Move(), S0, Final) where S is the set of all states in the NFA, Sigma is the set of input symbols, Move is the transition() function that maps a state/symbol pair to sets of states, S0 is the initial state, and Final is the set of accepting/final states. (See section 3.6 of the Dragon Book).

Scanning and automata An NFA is entirely general since it allows both: From a given state, an NFA allows multiple exits by a single input From a given state, an NFA allows an epsilon (null) transition to another state

Scanning and automata A deterministic finite automaton is a type of NFA where From a given state, a DFA allows only one exit by a single input From a given state, a DFA does not allow an epsilon (null) transition to another state Both an NFA and a DFA are equivalent in expressive power, and can always be transformed from one to the other.

Generating a finite automaton Scanner generators allow us to automatically build a finite automaton. This is done by writing a specification for, say, flex or re2c. Then, typically, the scanner generator will first create an NFA from our specification, and then rewrite that to a DFA. Then it s likely that the generator will then attempt to reduce the number of states in the DFA.

Parsing A parser is a language recognizer. We generally break parsers up into two CFG families, LL and LR. Parsing has generally the arena for context-free grammars, but other techniques have recently gained quite a bit of attention. Packrat parsing and PEGs have recently become an area of interest since the methodology seems to offer some benefits over CFGs. Surprisingly, the 4th edition does not cover any ground on this current topic.

Parsing LL parsers are top-down parsers, and usually written by hand (good old recursive-descent parsing!) LR parsers are bottom-up parsers, and are usually created with a parser generator. (Also see SLR parsing, as referenced in your book on page 70.)

Parsing LL stands for Left-to-right, Left-most derivation LR stands for Left-to-right, Right-most derivation

Parsing Consider idlist id idlisttail idlisttail, id idlisttail idlisttail ; This grammar can parsed either way, either top-down or bottom-up

Parsing A, B, C; with previous grammar, top-down step 1 height=60px} {#id.class width=60

Parsing A, B, C; with previous grammar, top-down step 2 {#id

Parsing A, B, C; with previous grammar, top-down step 3

Parsing A, B, C; with previous grammar, top-down step 4

Bottom-up parsing We reverse the search; we again start at the left, but we are now trying to match from the right The first match is when id list tail matches ;

Bottom-up parsing The next rule that matches is then:

Bottom-up parsing The next rule that matches is then:

Bottom-up parsing The next rule that matches is then:

A better bottom-up grammar for lists As illustrated in figure 2.14 in the text, the following grammar saves a lot of initial shifting for LR parsing, though at the cost of not being usable via recursive descent: idlist idlistprefix ; idlistprefix idlistprefix, id idlistprefix id

Step 1: Parsing A, B, C bottom-up

Step 2: Parsing A, B, C bottom-up

Step 3: Parsing A, B, C bottom-up

Step 4: Parsing A, B, C bottom-up

Recursive descent for the win As your book mentions on page 72, GCC is now using a hand-written recursive descent parser rather than bison. This is also true for clang.

Figure 2.15: Example LL grammar for simple calculator language program stmtlist EOD stmtlist stmt stmtlist ɛ stmt id := expr read id write expr expr term termtail termtail addop term termtail ɛ

Figure 2.15 continued: Example LL grammar for simple calculator language term factor factortail factortail multop factor factortail ɛ factor ( expr ) id number addop + multop /

Example 2.24 Sum and average program read A read B sum := A + B write sum write sum / 2

Figure 2.17 Recursive Descent Parse of example 2.24

Figure 2.17 Lemon Parse of example 2.24

Recursive Descent Parser for example 2.24 in C parser.c

Lemon Parser for example 2.24 grammar.y

Recursive Descent (minimal) For each non-terminal, we write one function of the same name Each production rule is transformed: each non-terminal is turned into a function call of that non-terminal s function, and each terminal is made into a call for a match If a production has alternatives, these alternatives must be distinguished by using the predict set.

Recursive Descent (fully general) While figure 2.21 shows that for small examples like the grammar for example 2.24, it is easy to see the first and follow sets, it quickly grows more challenging as grammars grow bigger. Your text on pp.79-82 gives definitions and algorithms for computing predict, first, and follow sets.

Disambiguation rules As your text says, LL cannot parse everything. The famous Pascal if-then[-else] ambiguity problem is mentioned: stmt if condition thenclause elseclause otherstmt thenclause then stmt elseclause then stmt ɛ

Disambiguation rules What s the solution? The same sort of idea as PEGs and Lemon use: give precedence in order of appearance (i.e., first applicable wins.)

Bottom-up parsing Shift and reduce are the fundamental actions in bottom-up parsing Page 87:... a top-down parser s stack contains a list of what the expects to see in the future; a bottom-up parser s stack contains a record of what the parser has already seen in the past.

Bottom-up parsing Stack contents Remaining input -------------------------------- --------------- (nil) A, B, C; id(a), B, C; id(a), B, C; id(a), id(b), C; id(a), id(b), C; id(a), id(b), id(c) ; id(a), id(b), id(c); id(a), id(b), id(c) id_list_tail id(a), id(b) id_list_tail id(a), id_list_tail id_list

The last four lines, the reduction ones, correspond to the derivation idlist id idlisttail id, id idlisttail id, id, id idlisttail id, id, id;