Introduction to Syntax Analysis Recursive-Descent Parsing

Similar documents
Writing a Lexer. CS F331 Programming Languages CSCE A331 Programming Language Concepts Lecture Slides Monday, February 6, Glenn G.

Syntax-Directed Translation. Lecture 14

Writing an Interpreter Thoughts on Assignment 6

A Simple Syntax-Directed Translator

Comp 411 Principles of Programming Languages Lecture 3 Parsing. Corky Cartwright January 11, 2019

CS 230 Programming Languages

Thoughts on Assignment 4 Haskell: Flow of Control

EDAN65: Compilers, Lecture 04 Grammar transformations: Eliminating ambiguities, adapting to LL parsing. Görel Hedin Revised:

Peace cannot be kept by force; it can only be achieved by understanding. Albert Einstein

10/5/17. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntax Analysis

Scheme: Data. CS F331 Programming Languages CSCE A331 Programming Language Concepts Lecture Slides Monday, April 3, Glenn G.

LECTURE 7. Lex and Intro to Parsing

CSE 130 Programming Language Principles & Paradigms Lecture # 5. Chapter 4 Lexical and Syntax Analysis

JavaCC Parser. The Compilation Task. Automated? JavaCC Parser

CSE 401 Midterm Exam Sample Solution 2/11/15

8 Parsing. Parsing. Top Down Parsing Methods. Parsing complexity. Top down vs. bottom up parsing. Top down vs. bottom up parsing

Haskell: Lists. CS F331 Programming Languages CSCE A331 Programming Language Concepts Lecture Slides Friday, February 24, Glenn G.

Context-Free Grammar. Concepts Introduced in Chapter 2. Parse Trees. Example Grammar and Derivation

10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

A simple syntax-directed

Lecture 8: Deterministic Bottom-Up Parsing

Time : 1 Hour Max Marks : 30

EDAN65: Compilers, Lecture 06 A LR parsing. Görel Hedin Revised:

Parsing Algorithms. Parsing: continued. Top Down Parsing. Predictive Parser. David Notkin Autumn 2008

Review of CFGs and Parsing II Bottom-up Parsers. Lecture 5. Review slides 1

Introduction to Parsing. Lecture 8

CSE P 501 Compilers. LR Parsing Hal Perkins Spring UW CSE P 501 Spring 2018 D-1

CS 2210 Sample Midterm. 1. Determine if each of the following claims is true (T) or false (F).

COP 3402 Systems Software Syntax Analysis (Parser)

Recursive Descent Parsers

MIT Top-Down Parsing. Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology

Semantic Analysis. Lecture 9. February 7, 2018

Outline. Limitations of regular languages. Introduction to Parsing. Parser overview. Context-free grammars (CFG s)

Parsing. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Syntax Analysis Part I

Syntax Analysis. COMP 524: Programming Language Concepts Björn B. Brandenburg. The University of North Carolina at Chapel Hill

Building Compilers with Phoenix

Scheme: Expressions & Procedures

Lecture 10 Parsing 10.1

ECE251 Midterm practice questions, Fall 2010

Lecture 7: Deterministic Bottom-Up Parsing

Scheme: Strings Scheme: I/O

Syntax Analysis. Chapter 4

Parsing III. CS434 Lecture 8 Spring 2005 Department of Computer Science University of Alabama Joel Jones

CSE 3302 Programming Languages Lecture 2: Syntax

CIT Lecture 5 Context-Free Grammars and Parsing 4/2/2003 1

Chapter 4. Lexical and Syntax Analysis

Lexical and Syntax Analysis

CS415 Compilers. Syntax Analysis. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Abstract Syntax Trees & Top-Down Parsing

2.2 Syntax Definition

COMPILER CONSTRUCTION LAB 2 THE SYMBOL TABLE. Tutorial 2 LABS. PHASES OF A COMPILER Source Program. Lab 2 Symbol table

Semantics of programming languages

Lexical and Syntax Analysis

The Parsing Problem (cont d) Recursive-Descent Parsing. Recursive-Descent Parsing (cont d) ICOM 4036 Programming Languages. The Complexity of Parsing

Introduction to Parsing

1. The output of lexical analyser is a) A set of RE b) Syntax Tree c) Set of Tokens d) String Character

Syntax Analysis. The Big Picture. The Big Picture. COMP 524: Programming Languages Srinivas Krishnan January 25, 2011

CS164: Midterm I. Fall 2003

Semantics of programming languages

Topic 3: Syntax Analysis I

Downloaded from Page 1. LR Parsing

Derivations vs Parses. Example. Parse Tree. Ambiguity. Different Parse Trees. Context Free Grammars 9/18/2012

Chapter 4. Lexical and Syntax Analysis. Topics. Compilation. Language Implementation. Issues in Lexical and Syntax Analysis.

CS 406/534 Compiler Construction Parsing Part I

Compiler Design Concepts. Syntax Analysis

Part 5 Program Analysis Principles and Techniques

CS2112 Fall Assignment 4 Parsing and Fault Injection. Due: March 18, 2014 Overview draft due: March 14, 2014

Section A. A grammar that produces more than one parse tree for some sentences is said to be ambiguous.

Decaf Language Reference

Chapter 3. Describing Syntax and Semantics ISBN

CMSC 330: Organization of Programming Languages

Some Thoughts on Grad School. Undergraduate Compilers Review

UNIVERSITY OF CALIFORNIA

Context-free grammars

Programming Language Syntax and Analysis

LECTURE 3. Compiler Phases

4. Lexical and Syntax Analysis

CSE302: Compiler Design

Lexical and Syntax Analysis. Top-Down Parsing

Revisit the example. Transformed DFA 10/1/16 A B C D E. Start

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Spring UW CSE P 501 Spring 2018 C-1

CS 4120 Introduction to Compilers

A programming language requires two major definitions A simple one pass compiler

Outline. Limitations of regular languages Parser overview Context-free grammars (CFG s) Derivations Syntax-Directed Translation

Parsing II Top-down parsing. Comp 412

Compiler Construction: Parsing

4. Lexical and Syntax Analysis

Abstract Syntax Trees & Top-Down Parsing

Abstract Syntax Trees & Top-Down Parsing

Homework & Announcements

ICOM 4036 Spring 2004

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

CSE 401 Compilers. LR Parsing Hal Perkins Autumn /10/ Hal Perkins & UW CSE D-1

CS1622. Today. A Recursive Descent Parser. Preliminaries. Lecture 9 Parsing (4)

SYED AMMAL ENGINEERING COLLEGE (An ISO 9001:2008 Certified Institution) Dr. E.M. Abdullah Campus, Ramanathapuram

Lecture Notes on Bottom-Up LR Parsing

Parsing III. (Top-down parsing: recursive descent & LL(1) )

Outline. Top Down Parsing. SLL(1) Parsing. Where We Are 1/24/2013

LR Parsing - The Items

Transcription:

Introduction to Syntax Analysis Recursive-Descent Parsing CS F331 Programming Languages CSCE A331 Programming Language Concepts Lecture Slides Friday, February 10, 2017 Glenn G. Chappell Department of Computer Science University of Alaska Fairbanks ggchappell@alaska.edu 2017 Glenn G. Chappell

Review Overview of Lexing & Parsing Parsing Character Stream cout << ff(12.6); Lexer Lexeme Stream cout << ff(12.6); id op id lit op Parser punct op AST or Error expr binop: << Two phases: Lexical analysis (lexing) Syntax analysis (parsing) The output of a parser is often an abstract syntax tree (AST). Specifications of these can vary. expr id: cout id: ff expr funccall expr numlit: 12.6 10 Feb 2017 CS F331 / CSCE A331 Spring 2017 2

Review Introduction to Lexical Analysis There are essentially three ways to write a lexer. Automatically generated, based on regular grammars or regular expressions for each lexeme category. Hand-coded state machine using a table. Entirely hand-coded state machine. We wrote a lexer using the last method. 10 Feb 2017 CS F331 / CSCE A331 Spring 2017 3

Review Writing a Lexer Introduction, Design Decisions Our lexer is implemented as Lua module lexer. See lexer.lua. Function lexer.lex provides an iterator. A for-in loop gives each lexeme as a pair: text (string) and category (number). The latter is an index for lexer.catnames. Code to use our lexer: lexer = require "lexer" for lexstr, cat in lexer.lex(program) do catstr = lexer.catnames(cat) io.write(string.format("%-10s %s\n", lexstr, catstr)) end 10 Feb 2017 CS F331 / CSCE A331 Spring 2017 4

Review Writing a Lexer Coding a State Machine Internally, our lexer runs as a state machine. A state machine has a current state, stored in variable state. It proceeds in a series of steps. At each step, it looks at the current character in the input and the current state. It then decides what state to go to next. As we write a state machine, an important question is when do we add a new state? A good guiding principle: Two situations can be handled by the same state if they would react identically to all future input. 10 Feb 2017 CS F331 / CSCE A331 Spring 2017 5

Review Writing a Lexer Notes [1/2] With our lexeme specification, it is tricky to handle +. and -.. For example, +.3 is a single lexeme (NumericLiteral), while +.x is three lexemes: (Operator, Operator, Identifier). We handle this using lookahead: peeking ahead in the input to get the information needed to make a decision. This is a common technique at all phases of parsing. 10 Feb 2017 CS F331 / CSCE A331 Spring 2017 6

Review Writing a Lexer Notes [2/2] Our lexeme specification results in some curious behavior. Input: k 4 Input: k 4 Output: Output: k Identifier k Identifier - Operator -4 NumericLiteral 4 NumericLiteral One possible way of altering this (not done in lexer.lua): change the longest-lexeme rule. Allow the caller to set a flag during lexing, indicating that the next lexeme, if it begins with + or, should always be interpreted as an Operator. An important point: sometimes the parser may need to guide the lexer. This can affect the design of a lexer. 10 Feb 2017 CS F331 / CSCE A331 Spring 2017 7

Introduction to Syntax Analysis Basics [1/2] We have covered lexical analysis. Now we look at the second stage of parsing, syntax analysis. In this stage: We determine whether the input is syntactically correct. If it is correct, then we usually output some representation of its structure: often an abstract syntax tree (AST). If it is not correct, we usually output information about the problem. As we have noted, syntax analysis is also called parsing; this is the narrow sense of the term. 10 Feb 2017 CS F331 / CSCE A331 Spring 2017 8

Introduction to Syntax Analysis Basics [2/2] Syntax analysis is usually based on a context-free grammar (CFG). Recall the notion of a derivation: begin with the start symbol, apply productions one by one, ending with a string of terminals. CFG (start symbol: item) item ( item ) item thing thing ID thing % Here, ID is a lexeme category. The actual string might be something like ((abc_39)). Derivation item (item) ((item)) ((thing)) ((ID)) A general method for doing parsing is a parsing algorithm. A typical parsing algorithm is usable with any number of CFGs. When we implement a parsing algorithm to handle a particular CFG, we have a parser: a code module that does parsing. 10 Feb 2017 CS F331 / CSCE A331 Spring 2017 9

Introduction to Syntax Analysis Categories of Parsers [1/3] Parsing algorithms can vary a great deal, but they come in two basic kinds: top-down and bottom-up. Every parser goes through the steps to find a derivation based on the CFG being used. (It will generally not output this derivation; it probably will not even store it anywhere, but it will go though the steps.) A top-down parsing algorithm goes through the derivation from top to bottom, beginning with the start symbol, expanding nonterminals as it goes, and ending with the string to be derived (the program?). A bottom-up parsing algorithm goes through the derivation from bottom to top, beginning with the string to be derived (the program?), collapsing substrings to nonterminals as it goes, and ending with the start symbol. 10 Feb 2017 CS F331 / CSCE A331 Spring 2017 10

Introduction to Syntax Analysis Categories of Parsers [2/3] Top-down parsers usually expand the leftmost nonterminal first. Thus, they usually produce leftmost derivations. Some top-down parsers are in a category known as LL parsers, the name coming from the fact that they handle their input in a strictly Left-to-right order, and they go through the steps to generate a Leftmost derivation. If a top-down parsing algorithm is not an LL parser, this is typically because it does multiple-lexeme lookahead, and so does not handle its input in a strictly left-to-right order. Top-down parsing code is often hand-coded although top-down parser generators do exist. We will look closely at a top-down parsing algorithm called Recursive Descent. This algorithm can be written to use multiple-lexeme lookahead, but if it does not, then it is in the LL category. An upcoming assignment will involve writing a Recursive-Descent parser. 10 Feb 2017 CS F331 / CSCE A331 Spring 2017 11

Introduction to Syntax Analysis Categories of Parsers [3/3] Bottom-up parsers usually collapse the leftmost nonterminal first. But thinking of the derivation from top to bottom, this would mean that the leftmost nonterminal is expanded last; the rightmost nonterminal is expanded first, resulting in a rightmost derivation. Thus, some bottom-up parsing algorithms are in a category known as LR parsers, the name coming from the fact that they handle their input in a strictly Left-to-right order, and they go through the steps to generate a Rightmost derivation. If a bottom-up parsing algorithm is not an LR parser, this is, again, typically because it does multiple-lexeme lookahead. Bottom-up parsing code is almost always automatically generated. We will look closely at a bottom-up parsing algorithm called Shift- Reduce. However, you will not be required to implement a Shift-Reduce parser. 10 Feb 2017 CS F331 / CSCE A331 Spring 2017 12

Introduction to Syntax Analysis Categories of Grammars [1/2] As a rule, a fast parsing algorithm is not capable of handling all CFGs. For each category of parsing algorithms (LL, LR, etc.), there is an associated category of grammars that such algorithms can handle. The grammars that LL parsers can handle are called LL grammars. Similarly, LR parsers can handle LR grammars. Curiously, every LL grammar is an LR grammar, while there are LR grammars that are not LL grammars. So LR parsing algorithms are more general. Note, however, that when we write a compiler for a programming language, we only need one grammar, and if it is an LL grammar, then an LL parser works fine. 10 Feb 2017 CS F331 / CSCE A331 Spring 2017 13

Introduction to Syntax Analysis Categories of Grammars [2/2] The following diagram shows the relationship between various categories of grammars. All Grammars CFGs LR Grammars LL Grammars 10 Feb 2017 CS F331 / CSCE A331 Spring 2017 14

Recursive-Descent Parsing Introduction Now we look at a parsing algorithm called Recursive Descent. A top-down parsing algorithm. In the LL category as long as multiple-lexeme lookahead is not used. Almost always hand-coded. Has been known for decades. Still in common use. Algorithms you may be familiar with (Binary Search, Merge Sort, etc.) can be written once, and, if the implementation is suitably generic, never written again. But Recursive Descent is not like this. When we write a Recursive-Descent parser, we choose what functions to write based on our grammar. Thus a Recursive-Descent parser is tailored for a specific grammar; a different grammar requires writing a new parser. 10 Feb 2017 CS F331 / CSCE A331 Spring 2017 15

Recursive-Descent Parsing How It Works A Recursive-Descent parser has one parsing function for each nonterminal in the grammar. Each parsing function is responsible for parsing all strings that the corresponding nonterminal can be expanded into. So the function corresponding to the start symbol is the one we call to parse the entire input (program?). The code for a parsing function is essentially a translation into code of the right-hand side of the production for the nonterminal. Other nonterminals in the right-hand side become function calls and so the parsing functions are mutually recursive. A terminal in the right-hand side becomes a check that the input string contains the proper lexeme. 10 Feb 2017 CS F331 / CSCE A331 Spring 2017 16

Recursive-Descent Parsing Example #1: Simple [1/5] Let s write a Recursive-Descent parser based on the following CFG. Grammar 1 item ( item ) item thing thing ID thing % The start symbol is item. ID represents the Identifier category from our lexer. Our parser will be written in Lua. It will take input from our lexer (module lexer). 10 Feb 2017 CS F331 / CSCE A331 Spring 2017 17

Recursive-Descent Parsing Example #1: Simple [2/5] Grammar 1 item ( item ) item thing thing ID thing % Recall that a Recursive-Descent parser has one parsing function for each nonterminal. So it is appropriate to begin by combining productions with a common left-hand side. Grammar 1a item ( item ) thing thing ID % 10 Feb 2017 CS F331 / CSCE A331 Spring 2017 18

Recursive-Descent Parsing Example #1: Simple [3/5] Grammar 1a item ( item ) thing thing ID % Next we turn each production into code for a function. It is convenient to name each parsing function after the nonterminal it handles. So the parsing function for item will be parse_item. And the parsing function for thing will be parse_thing. A parser typically outputs an AST or an error message. For now, our parser will simply return true or false, depending on whether the input is syntactically correct. (We will eventually generate an AST.) 10 Feb 2017 CS F331 / CSCE A331 Spring 2017 19

Recursive-Descent Parsing Example #1: Simple [4/5] I have written a framework for a Recursive-Descent parser that uses our lexer. This is a Lua module that exports a function parse. See rdparser1.lua. In a parsing function (e.g., parse_item or parse_thing), the current lexeme & category are in variables lexstr & lexcat, respectively. When the function starts, a lexeme is already in these variables. To move to the next lexeme, call advance. There are two convenience functions: matchcat & matchstring. Pass the former a lexeme category (e.g., lexer.id). If the current lexeme is in this category, then it calls advance and returns true; otherwise, it returns false. Function matchstring is similar, except that it takes a string to match (e.g., "%"). I have also written a simple program that uses the parser. See userdparser1.lua. 10 Feb 2017 CS F331 / CSCE A331 Spring 2017 20

Recursive-Descent Parsing Example #1: Simple [5/5] Grammar 1a item ( item ) thing thing ID % TO DO Write a Recursive-Descent parser based on Grammar 1a. Done. See rdparser1.cpp. 10 Feb 2017 CS F331 / CSCE A331 Spring 2017 21

Recursive-Descent Parsing Handling Incorrect Input [1/4] What output does our parser give for each of the following inputs? xyz 123 % ((abc_39)) (((((%))))) (a,b,c) (((x)) ((x))) a,b,c Are these outputs what we want them to be? For all but the last two, the output is what we expect. But the parser says the last two are correct. Obviously, they are not. I claim, however, that this is not a bug. Read on 10 Feb 2017 CS F331 / CSCE A331 Spring 2017 22

Recursive-Descent Parsing Handling Incorrect Input [2/4] Our parser says the following are both syntactically correct: ((x))) a,b,c Why? Remember: Recursive Descent. Function parse_item is called to parse the entire input. It is also called, recursively, to parse an item between parentheses. When the latter invocation of the function sees ) following a correct parse, it must simply return, assuming that the ) is handled by its caller. So parse_item sees the first string above as a syntactically correct string ((x)) followed by extra characters: ). The second string is similar: correct a followed by extra,b,c. 10 Feb 2017 CS F331 / CSCE A331 Spring 2017 23

Recursive-Descent Parsing Handling Incorrect Input [3/4] Our parsing functions are acting correctly. But the parser is not giving us the information we need. What can we do about this? One common solution is to add another lexeme category: end of input. Standard notation for this: $. Then add a new start symbol, and augment the grammar with one more production, of the form newstartsymbol oldstartsymbol $. The following would be our new grammar, with start symbol all. Grammar 1b all item $ item ( item ) thing thing ID % This is only an example. I will not be using this idea in our parser. 10 Feb 2017 CS F331 / CSCE A331 Spring 2017 24

Recursive-Descent Parsing Handling Incorrect Input [4/4] Another solution is to add an extra check at the end of parsing, to see whether all lexemes have been read. If we do this, then the grammar is unchanged, and the parsing functions are the same. A correct parse of the entire input then requires two conditions: The parsing function for the start symbol indicates a correct parse. All lexemes have been read. The above solution works better with the interactive environment that we will eventually write. So I will be using it in all of our Recursive-Descent parsers. TO DO Modify rdparser1.lua so that it implements the above idea. Done. See rdparser1.cpp. 10 Feb 2017 CS F331 / CSCE A331 Spring 2017 25

Recursive-Descent Parsing TO BE CONTINUED Recursive-Descent Parsing will be continued next time. 10 Feb 2017 CS F331 / CSCE A331 Spring 2017 26