The procedure attempts to "match" the right hand side of some production for a nonterminal.

Similar documents
8 Parsing. Parsing. Top Down Parsing Methods. Parsing complexity. Top down vs. bottom up parsing. Top down vs. bottom up parsing

CA Compiler Construction

Syntactic Analysis. Top-Down Parsing

Top down vs. bottom up parsing

LL(k) Parsing. Predictive Parsers. LL(k) Parser Structure. Sample Parse Table. LL(1) Parsing Algorithm. Push RHS in Reverse Order 10/17/2012

Compilers. Predictive Parsing. Alex Aiken

Table-Driven Parsing

CSCI312 Principles of Programming Languages

CS1622. Today. A Recursive Descent Parser. Preliminaries. Lecture 9 Parsing (4)

Syntax Analysis. Martin Sulzmann. Martin Sulzmann Syntax Analysis 1 / 38

Syntax Analysis, III Comp 412

Building a Parser III. CS164 3:30-5:00 TT 10 Evans. Prof. Bodik CS 164 Lecture 6 1

Syntax Analysis, III Comp 412

Note that for recursive descent to work, if A ::= B1 B2 is a grammar rule we need First k (B1) disjoint from First k (B2).

Top-Down Parsing and Intro to Bottom-Up Parsing. Lecture 7

Parsing III. (Top-down parsing: recursive descent & LL(1) )

Chapter 3. Parsing #1

CSE P 501 Compilers. LR Parsing Hal Perkins Spring UW CSE P 501 Spring 2018 D-1

LL(1) Grammars. Example. Recursive Descent Parsers. S A a {b,d,a} A B D {b, d, a} B b { b } B λ {d, a} D d { d } D λ { a }

Top-Down Parsing and Intro to Bottom-Up Parsing. Lecture 7

Abstract Syntax Trees & Top-Down Parsing

Parsing. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

CSE431 Translation of Computer Languages

Abstract Syntax Trees & Top-Down Parsing

Abstract Syntax Trees & Top-Down Parsing

Parsing III. CS434 Lecture 8 Spring 2005 Department of Computer Science University of Alabama Joel Jones

Question Points Score

Parsing. Roadmap. > Context-free grammars > Derivations and precedence > Top-down parsing > Left-recursion > Look-ahead > Table-driven parsing

Outline. Top Down Parsing. SLL(1) Parsing. Where We Are 1/24/2013

CSE 401 Compilers. LR Parsing Hal Perkins Autumn /10/ Hal Perkins & UW CSE D-1

Parsing. Handle, viable prefix, items, closures, goto s LR(k): SLR(1), LR(1), LALR(1)

Parsing #1. Leonidas Fegaras. CSE 5317/4305 L3: Parsing #1 1

Lexical and Syntax Analysis (2)

Example CFG. Lectures 16 & 17 Bottom-Up Parsing. LL(1) Predictor Table Review. Stacks in LR Parsing 1. Sʹ " S. 2. S " AyB. 3. A " ab. 4.

Table-Driven Top-Down Parsers

CS 2210 Sample Midterm. 1. Determine if each of the following claims is true (T) or false (F).

4 (c) parsing. Parsing. Top down vs. bo5om up parsing

Section A. A grammar that produces more than one parse tree for some sentences is said to be ambiguous.

Chapter 4: LR Parsing

Let us construct the LR(1) items for the grammar given below to construct the LALR parsing table.

Building A Recursive Descent Parser. Example: CSX-Lite. match terminals, and calling parsing procedures to match nonterminals.

Administrativia. WA1 due on Thu PA2 in a week. Building a Parser III. Slides on the web site. CS164 3:30-5:00 TT 10 Evans.

LR Parsing E T + E T 1 T

MIT Parse Table Construction. Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology

CS502: Compilers & Programming Systems

Types of parsing. CMSC 430 Lecture 4, Page 1

UNIT III & IV. Bottom up parsing

Parsing II Top-down parsing. Comp 412

1 Introduction. 2 Recursive descent parsing. Predicative parsing. Computer Language Implementation Lecture Note 3 February 4, 2004

CSX-lite Example. LL(1) Parse Tables. LL(1) Parser Driver. Example of LL(1) Parsing. An LL(1) parse table, T, is a twodimensional

Lexical and Syntax Analysis. Top-Down Parsing

Context-Free Grammar. Concepts Introduced in Chapter 2. Parse Trees. Example Grammar and Derivation

Lecture Bottom-Up Parsing

Wednesday, September 9, 15. Parsers

Parsers. What is a parser. Languages. Agenda. Terminology. Languages. A parser has two jobs:

UNIT-III BOTTOM-UP PARSING

3. Parsing. Oscar Nierstrasz

PESIT Bangalore South Campus Hosur road, 1km before Electronic City, Bengaluru -100 Department of Computer Science and Engineering

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Syntax-Directed Translation. Lecture 14

Introduction to Parsing. Comp 412

Parsing Techniques. CS152. Chris Pollett. Sep. 24, 2008.

Monday, September 13, Parsers

Compilers: CS31003 Computer Sc & Engg: IIT Kharagpur 1. Top-Down Parsing. Lect 5. Goutam Biswas

Compilerconstructie. najaar Rudy van Vliet kamer 140 Snellius, tel rvvliet(at)liacs(dot)nl. college 3, vrijdag 22 september 2017

Parsers. Xiaokang Qiu Purdue University. August 31, 2018 ECE 468

How do LL(1) Parsers Build Syntax Trees?

Bottom-up parsing. Bottom-Up Parsing. Recall. Goal: For a grammar G, withstartsymbols, any string α such that S α is called a sentential form

Chapter 4. Lexical and Syntax Analysis

Syntax Analysis. COMP 524: Programming Language Concepts Björn B. Brandenburg. The University of North Carolina at Chapel Hill

Compiler Design 1. Top-Down Parsing. Goutam Biswas. Lect 5

CS 230 Programming Languages

Syntax Analysis. The Big Picture. The Big Picture. COMP 524: Programming Languages Srinivas Krishnan January 25, 2011

Parsing Part II (Top-down parsing, left-recursion removal)

BSCS Fall Mid Term Examination December 2012

Compilers. Yannis Smaragdakis, U. Athens (original slides by Sam

It parses an input string of tokens by tracing out the steps in a leftmost derivation.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Lexical and Syntax Analysis

COMP3131/9102: Programming Languages and Compilers

CS 4120 Introduction to Compilers

Recursive Descent Parsers

Chapter 4. Lexical and Syntax Analysis. Topics. Compilation. Language Implementation. Issues in Lexical and Syntax Analysis.

CSE 130 Programming Language Principles & Paradigms Lecture # 5. Chapter 4 Lexical and Syntax Analysis

A simple syntax-directed

Ambiguity, Precedence, Associativity & Top-Down Parsing. Lecture 9-10

Formal Languages and Compilers Lecture VII Part 3: Syntactic A

CSC 4181 Compiler Construction. Parsing. Outline. Introduction

Alternatives for semantic processing

LL(1) predictive parsing

Wednesday, August 31, Parsers

Derivations vs Parses. Example. Parse Tree. Ambiguity. Different Parse Trees. Context Free Grammars 9/18/2012

Table-driven using an explicit stack (no recursion!). Stack can be viewed as containing both terminals and non-terminals.

CS 314 Principles of Programming Languages

shift-reduce parsing

Revisit the example. Transformed DFA 10/1/16 A B C D E. Start


Context-free grammars

EDAN65: Compilers, Lecture 06 A LR parsing. Görel Hedin Revised:

Ambiguity. Grammar E E + E E * E ( E ) int. The string int * int + int has two parse trees. * int

Transcription:

Parsing A parser is an algorithm that determines whether a given input string is in a language and, as a side-effect, usually produces a parse tree for the input. There is a procedure for generating a parser from a given context-free grammar. Recursive-Descent Parsing Recursive-descent parsing is one of the simplest parsing techniques that is used in practice. Recursive-descent parsers are also called top-down parsers, since they construct the parse tree top down (rather than bottom up). The basic idea of recursive-descent parsing is to associate each non-terminal with a procedure. The goal of each such procedure is to read a sequence of input characters that can be generated by the corresponding non-terminal, and return a pointer to the root of the parse tree for the non-terminal. The structure of the procedure is dictated by the productions for the corresponding non-terminal. The procedure attempts to "match" the right hand side of some production for a nonterminal. To match a terminal symbol, the procedure compares the terminal symbol to the input; if they agree, then the procedure is successful, and it consumes the terminal symbol in the input (that is, moves the input cursor over one symbol). To match a non-terminal symbol, the procedure simply calls the corresponding procedure for that non-terminal symbol (which may be a recursive call, hence the name of the technique). Recursive-Descent Parser for Expressions Consider the following grammar for expressions (we'll look at the reasons for the peculiar structure of this grammar later): 1. <E> --> <T> <E*> 2. <E*> --> + <T> <E*> - <T> <E*> epsilon 3. <T> --> <F> <T*> 4. <T*> --> * <F> <T*> / <F> <T*> epsilon 5. <F> --> ( <E> ) number We create procedures for each of the non-terminals. According to production 1, the procedure to match expressions (<E>) must match a term (by calling the procedure for <T>), and then more expressions (by calling the procedure <E*>).

procedure E; T; Estar; Some procedures, such as <E*>, must examine the input to determine which production to choose. procedure Estar; if NextInputChar = "+" or "-" then read(nextinputchar); T; Estar; We will append a special marker symbol (ENDM) to the input string; this marker symbol notifies the parser that the entire input has been seen. We should also modify the procedure for the start symbol, E, to recognize the end marker after seeing an expression. Top-Down Parser for Expressions procedure E; T; Estar; if NextInputChar = ENDM then /* done */ else print("syntax error") procedure Estar; if NextInputChar = "+" or "-" then read(nextinputchar); T; Estar; procedure T; F; Tstar; procedure Tstar; if NextInputChar = "*" or "/" then read(nextinputchar); F; Tstar; procedure F; if NextInputChar = "(" then read(nextinputchar); E; if NextInputChar = ")" then read(nextinputchar) else print("syntax error"); else if NextInputChar = number then read(nextinputchar) else print("syntax error"); Tracing the Parser

As an example, consider the following input: 1 + (2 * 3) / 4. We just call the procedure corresponding to the start symbol. NextInputChar = "1" Call E Call T Call F NextInputChar = "+" /* Match 1 with F */ Call Tstar /* Match epsilon */ Call Estar NextInputChar = "(" /* Match + */ Call T Call F /* Match (, looking for E ) */ NextInputChar = "2" Call E Call T Call F /* Match 2 with F */ NextInputChar = "*" Call Tstar /* Match * */ NextInputChar = "3" Call F /* Match 3 with F */ NextInputChar = ")" Call Tstar /* Match epsilon */ Call Estar /* Match epsilon */ NextInputChar = "/" /* Match ")" */ Call Tstar NextInputChar = "4" /* Match "/" */ Call F /* Match 4 with F */ NextInputChar = ENDM Call Tstar /* Match epsilon */ Call Tstar /* Match epsilon */ Call Estar /* Match epsilon */ /* Match ENDM */ Observations about Recursive-Descent Parser In procedure Estar and Tstar, we match one of the productions with an arithmetic operator if we see such an operator in the input; otherwise we simply return. A procedure that returns without matching any symbols is, in effect, choosing the epsilon production. In our expression parser, we only choose the epsilon production if the NextInputChar doesn't match the first terminal on the right hand side of the production.

We never attempt to read beyond the end marker (ENDM), which is matched only at the end of an expression. In all other circumstances, the presence of the end marker signals a syntax error. As written, our recursive-descent parser only determines whether or not the input string is in the language of the grammar; it does not give the structure of the string according to the grammar. We could easily build a parse tree incrementally during parsing. Lookahead in Recursive-Descent Parsing In order to implement a recursive-descent parser for a grammar, for each nonterminal in the grammar, it must be possible to determine which production to apply for that nonterminal by looking only at the current input symbol. (We want to avoid having the compiler or other text processing program scan ahead in the input to determine what action to take next.) The lookahead symbol is simply the next terminal that we will try to match in the input. We use a single lookahead symbol to decide what production to match. Consider a production: A --> X1...Xm. We need to know the set of possible lookahead symbols that indicate this production is to be chosen. This set is clearly those terminal symbols that can be produced by the symbols X1...Xm (which may be either terminals or non-terminals). Since a lookahead is only a single terminal symbol, we want the first (i.e., leftmost) symbol that could be produced by X1...Xm. We donote the set of symbols that could be produced first by X1...Xm as First(X1...Xm). First Sets To distinguish two productions with the same non-terminal on the left hand side, we examine the First sets for their corresponding right hand sides. Given the production A -- > X1...Xm we must determine First(X1...Xm). We first consider the leftmost symbol, X1. If this is a terminal symbol, then First(X1...Xm) = X1. If X1 is a non-terminal, then we compute the First sets for each right hand side corresponding to X1. In our expression grammar above:

First(<E>) = First(<T> <E*>) First(<T> <E*>) = First(<T>) First(<T>) = First(<F> <T*>) First(<F> <T*>) = First(<F>) = {(,number} If X1 can generate epsilon, then X1 can (in effect) be erased, and First(X1...Xm) depends on X2. If X2 is a terminal, it is included in First(X1...Xm). If X2 is a non-terminal, we compute the First sets for each of its corresponding right hand sides. Similarly, if both X1 and X2 can produce epsilon, we consider X3, then X4, etc. Follow Sets Suppose we are attempting to compute the lookahead symbols that suggest the production A --> X1...Xm. What if each of the Xi can produce epsilon? If the entire right hand side of a production can produce epsilon, then the lookahead for A is determined by those terminal symbols that can follow A in a parse. We denote the set of terminal symbols that can follow a non-terminal A in a parse as Follow(A). We inspect the grammar for all occurences of the non-terminal A. In each production, A is either: followed by a terminal symbol x, so x is in Follow(A). followed by a non-terminal symbol B, so Follow(A) includes First(B). at the end of a production for some non-terminal S (as in S -> Y1...YmA), in which case Follow(A) includes Follow(S). First and Follow Sets for Expression Grammar Computing the First and Follow sets for our expression grammar (as augmented with a new start symbol that includes the ENDM in the production): 1. <S> --> <E> ENDM 2. <E> --> <T> <E*> 3. <E*> --> + <T> <E*> - <T> <E*> epsilon 4. <T> --> <F> <T*> 5. <T*> --> * <F> <T*> / <F> <T*> epsilon 6. <F> --> ( <E> ) number

First(<E>) = First(<T> <E*>) = First(<T>) First(<E*>) = {+} U {-} U Follow(<E*>) Follow(<E*>) = Follow(<E>) = {),ENDM} First(<E*>) = {+,-,),ENDM} First(<T>) = First(<F> <T*>) = First(<F>) First(<T*>) = {*} U {/} U Follow(<T*>) Follow(<T*>) = Follow(<T>) = First(<E*>) First(<T*>) = {*,/,+,-,),ENDM} First(<F>) {(,number} LL(1) Grammars for Recursive-Descent Parsing The set of lookahead symbols that will cause the selection (ie., prediction) of the production A --> X1...Xm is Predict(A --> X1...Xm) = First(X1...Xm) U If X1...Xm --> epsilon then Follow(A) else null That is, any symbol that can be the first symbol produced by the right hand side of a production will predict that production. Further, if the entire right hand side can produce epsilon, then symbols that can immediately follow the left hand side of a production will also predict that production. If, for two productions 1. A --> X1...Xm 2. A --> Y1...Yn we have some symbol s for which 1. s is in Predict(A --> X1...Xm) 2. s is in Predict(A --> Y1...Yn) then we cannot in general know which production to select by looking at a single input symbol. Recursive-descent parsing can only parse those CFG's that have disjoint predict sets for productions that share a common left hand side. CFG's that obey this restriction are called LL(1).

From experience we know that it is usually possible to create an LL(1) CFG for a programming language. However, not all CFG's are LL(1) and a CFG that is not LL(1) may be parsable using some other (usually more complex) parsing technique. Creating LL(1) Grammars Recursive-descent parsing can only parse grammars that have disjoint predict sets for productions that share a common left hand side. Two common properties of grammars that violate this condition are: Left recursion: any grammar containing productions with left recursion, that is, productions of the form A --> A X1...Xm, cannot be LL(1). The problem is that any symbol that predicts this production the first time will, of necessity, continue to predict this production forever (and never be matched). Common prefix: any grammar containing two productions for the same nonterminal that share a common prefix on the right hand side cannot be LL(1). The problem is that any symbol that predicts the first production must also predict the second; since the predict sets for the two productions are not disjoint, the grammar is not LL(1). Creating an LL(1) Grammar Consider the following grammar for expressions: 1. <E> --> <E> + <T> 2. <E> --> <E> - <T> 3. <E> --> <T> 4. <T> --> <T> * <F> 5. <T> --> <T> / <F> 6. <T> --> <F> 7. <F> --> ( <E> ) 8. <F> --> number This grammar has left recursion, and therefore cannot be LL(1). We can replace the use of left recursion with right recursion as follows: 1. <E> --> <T> + <E> 2. <E> --> <T> - <E> 3. <E> --> <T> 4. <T> --> <F> * <T> 5. <T> --> <F> / <T>

6. <T> --> <F> 7. <F> --> ( <E> ) 8. <F> --> number The resulting grammar is still not LL(1); productions 1-3 share a common prefix, as do productions 4-6. We can eliminate the common prefix by defering the decision as to which production to pick until after seeing the common prefix. This technique is called factoring the common prefix. 1. <E> --> <T> <E*> 2. <E*> --> + <T> <E*> - <T> <E*> epsilon 3. <T> --> <F> <T*> 4. <T*> --> * <F> <T*> / <F> <T*> epsilon 5. <F> --> ( <E> ) number Table-Driven Parsing In recursive-descent parsing, the decision as to which production to choose for a particular non-terminal is hard-coded into the procedure for the non-terminal. The procedure uses the Predict sets (computed from the First and Follow sets) for the grammar to decide which production to choose based on the lookahead symbol. The problem with recursive-descent parsing is that it is inflexible; changes in the grammar can cause significant (and in some cases non-obvious) changes to the parser. Since recursive-descent parsing uses an implicit stack of procedure calls, it is possible to replace the parsing procedures and implicit stack with an explicit stack and a single parsing procedure that manipulates the stack. In this scheme, we encode the actions the parsing procedure should take in a table. This table can be generated automatically (with the grammar as input), which is why this approach adapts more easily to changes in the grammar. A Table-Driven Parser The parse table encodes the choice of production as a function of the current non-terminal of interest and the lookahead symbol. T: Non-terminals x Terminals -> Productions U {Error}

The entry T[A,x] gives the production number to choose when A is the non-terminal of interest and x is the current input symbol. The table is a mapping from non-terminals x terminals to productions. T[A,x] == A -> X1..Xm if x in Predict(A->X1..Xm) otherwise T[A,x] == Error The driver procedure is very simple. It stacks symbols that are to be matched or expanded. Terminal symbols on the stack must match an input symbol; non-terminal symbols are expanded via the Predict function (which is encoded in the parse table). Parse Table for Expressions Here is an LL(1) expression grammar, augmented to include the end marker: 1. <S> --> <E> ENDM 2. <E> --> <T> <E*> 3. <E*> --> + <T> <E*> 4. <E*> --> - <T> <E*> 5. <E*> --> epsilon 6. <T> --> <F> <T*> 7. <T*> --> * <F> <T*> 8. <T*> --> / <F> <T*> 9. <T*> --> epsilon 10. <F> --> ( <E> ) 11. <F> --> number The table for this expression grammar is (where a blank entry corresponds to an error): ( ) + - * / Number ENDM S 1 1 E 2 2 E* 5 3 4 5 T 6 6 T* 9 9 9 7 8 9 F 10 11 This table is constructed from the Predict sets described earlier.

Driver Procedure Under table-driven parsing, there is a single procedure that "interprets" the parse table. This "driver" procedure takes the following form: procedure Parser; /* Push the start symbol S onto the stack */ Push(S,stack) /* Initialize lookahead symbol */ scanner(nextinputsymbol) while not Empty(stack) do top = Top(stack) if top is a nonterminal then action = ParseTable[top,NextInputSymbol] if action > 0 then /* Pop top symbol * Pop(stack) /* Push RHS of production */ for each symbol on RHS #action do Push(symbol) else print("syntax error") else if NextInputSymbol == top then /* Match terminal symbol in input */ Pop(stack) /* Get next terminal symbol in input */ scanner(nextinputsymbol) else print("syntax error") Example Parse Let's trace the parse for the input 1 + (2 * 3) / 4 ENDM: Stack Contents Current input Action 1: S 1 + (2 * 3) / 4 ENDM 1 2: E ENDM 1 + (2 * 3) / 4 ENDM 2 3: T E* ENDM 1 + (2 * 3) / 4 ENDM 6 4: F T* E* ENDM 1 + (2 * 3) / 4 ENDM 11 5: N T* E* ENDM 1 + (2 * 3) / 4 ENDM Pop 6: T* E* ENDM + (2 * 3) / 4 ENDM 9 7: E* ENDM + (2 * 3) / 4 ENDM 3 8: + T E* ENDM + (2 * 3) / 4 ENDM Pop 9: T E* ENDM (2 * 3) / 4 ENDM 6 10: F T* E* ENDM (2 * 3) / 4 ENDM 10 11: ( E ) T* E* ENDM (2 * 3) / 4 ENDM Pop 12: E ) T* E* ENDM 2 * 3) / 4 ENDM 2 13: T E* ) T* E* ENDM 2 * 3) / 4 ENDM 6 14: F T* E* ) T* E* ENDM 2 * 3) / 4 ENDM 11 15: N T* E* ) T* E* ENDM 2 * 3) / 4 ENDM Pop 16: T* E* ) T* E* ENDM * 3) / 4 ENDM 7

17: * F T* E* ) T* E* ENDM * 3) / 4 ENDM Pop 18: F T* E* ) T* E* ENDM 3) / 4 ENDM 11 19: N T* E* ) T* E* ENDM 3) / 4 ENDM Pop 20: T* E* ) T* E* ENDM ) / 4 ENDM 9 21: E* ) T* E* ENDM ) / 4 ENDM 5 22: ) T* E* ENDM ) / 4 ENDM Pop 23: T* E* ENDM / 4 ENDM 8 24: / F T* E* ENDM / 4 ENDM Pop 25: F T* E* ENDM 4 ENDM 11 26: N T* E* ENDM 4 END Pop 27: T* E* ENDM ENDM 9 28: E* ENDM ENDM 5 29: ENDM ENDM Pop 30: Done!