Parsing III. CS434 Lecture 8 Spring 2005 Department of Computer Science University of Alabama Joel Jones

Parsing III (Top-down parsing: recursive descent & LL(1) ) (Bottom-up parsing) CS434 Lecture 8 Spring 2005 Department of Computer Science University of Alabama Joel Jones Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit permission to make copies of these materials for their personal use.

Left Factoring (Generality) Question Answer Example By eliminating left recursion and left factoring, can we transform an arbitrary CFG to a form where it meets the LL(1) condition? (and can be parsed predictively with a single token lookahead?) Given a CFG that doesn t meet the LL(1) condition, it is undecidable whether or not an equivalent LL(1) grammar exists. {a n 0 b n n 1} {a n 1 b 2n n 1} has no LL(1) grammar

Language that Cannot Be LL(1) Example {a n 0 b n n 1} {a n 1 b 2n n 1} has no LL(1) grammar G aab abbb A aab 0 Problem: need an unbounded number of a characters before you can determine whether you are in the A group or the B group. B abbb 1

Recursive Descent (Summary) 1. Build FIRST (and FOLLOW) sets 2. Massage grammar to have LL(1) condition a. Remove left recursion b. Left factor it 3. Define a procedure for each non-terminal a. Implement a case for each right-hand side b. Call procedures as needed for non-terminals 4. Add extra code, as needed a. Perform context-sensitive checking b. Build an IR to record the code Can we automate this process?

FIRST and FOLLOW Sets FIRST(α) For some α T NT, define FIRST(α) as the set of tokens that appear as the first symbol in some string that derives from α That is, x FIRST(α) iff α * x γ, for some γ FOLLOW(α) For some α NT, define FOLLOW(α) as the set of symbols that can occur immediately after α in a valid sentence. FOLLOW(S) = {EOF}, where S is the start symbol To build FIRST sets, we need FOLLOW sets

Building Top-down Parsers Given an LL(1) grammar, and its FIRST & FOLLOW sets Emit a routine for each non-terminal Nest of if-then-else statements to check alternate rhs s Each returns true on success and throws an error on false Simple, working (, perhaps ugly,) code This automatically constructs a recursive-descent parser Improving matters Nest of if-then-else statements may be slow Good case statement implementation would be better I don t know of a system that does this What about a table to encode the options? Interpret the table with a skeleton, as we did in scanning

Building Top-down Parsers Strategy Encode knowledge in a table Use a standard skeleton parser to interpret the table Example The non-terminal Factor has three expansions ( Expr ) or Identifier or Number Table might look like: Terminal Symbols + - * / Id. Num. EOF Non-terminal Symbols Factor 10 11 Error on `+ Reduce by rule 10 on `Id

Building Top Down Parsers Building the complete table Need a row for every NT & a column for every T Need a table-driven interpreter for the table

LL(1) Skeleton Parser token next_token() push EOF onto Stack push the start symbol, S, onto Stack TOS top of Stack loop forever if TOS = EOF and token = EOF then break & report success exit on success else if TOS is a terminal then if TOS matches token then pop Stack // recognized TOS token next_token() else report error looking for TOS else // TOS is a non-terminal if TABLE[TOS,token] is A B 1 B 2 B k then pop Stack // get rid of A push B k, B k-1,, B 1 // in that order else report error expanding TOS TOS top of Stack

Building Top Down Parsers Building the complete table Need a row for every NT & a column for every T Need an algorithm to build the table Filling in TABLE[X,y], X NT, y T 1. entry is the rule X β, if y FIRST(β ) 2. entry is the rule X ε if y FOLLOW(X ) and X ε G 3. entry is error if neither 1 nor 2 define it If any entry is defined multiple times, G is not LL(1) This is the LL(1) table construction algorithm

Recursive Descent in Object-Oriented Languages Shortcomings of Recursive Descent Too procedural No convenient way to build parse tree Solution Associate a class with each non-terminal symbol Allocated object contains pointer to the parse tree Class NonTerminal { public: NonTerminal(Scanner & scnr) { s = &scnr; tree = NULL; } virtual ~NonTerminal() { } virtual bool ispresent() = 0; TreeNode * absyntree() { return tree; } protected: } Scanner * s; TreeNode * tree;

Non-terminal Classes Class Expr : public NonTerminal { public: Expr(Scanner & scnr) : NonTerminal(scnr) { } virtual bool ispresent(); } Class EPrime : public NonTerminal { public: EPrime(Scanner & scnr, TreeNode * p) : NonTerminal(scnr) { exprsofar = p; } virtual bool ispresent(); protected: TreeNode * exprsofar; } // definitions for Term and TPrime Class Factor : public NonTerminal { public: Factor(Scanner & scnr) : NonTerminal(scnr) { }; virtual bool ispresent(); }

Implementation of ispresent bool Expr::isPresent() { Term * operand1 = new Term(*s); if (!operand1->ispresent()) return FALSE; Eprime * operand2 = new EPrime(*s, NULL); if (!operand2->ispresent()) // do nothing; } return TRUE;

Implementation of ispresent bool EPrime::isPresent() { token_type op = s->nexttoken(); if (op == PLUS op == MINUS) { s->advance(); Term * operand2 = new Term(*s); if (!operand2->ispresent()) throw SyntaxError(*s); Eprime * operand3 = new EPrime(*s, NULL); if (operand3->ispresent()); //do nothing } return TRUE; } else return FALSE;

Abstract Syntax Tree Construction bool Expr::isPresent() { // with semantic processing Term * operand1 = new Term(*s); if (!operand1->ispresent()) return FALSE; tree = operand1->absyntree(); EPrime * operand2 = new EPrime(*s, tree); if (operand2->ispresent()) tree = operand2->abssyntree(); } // here tree is either the tree for the Term // or the tree for Term followed by EPrime return TRUE;

Abstract Syntax Tree Construction bool EPrime::isPresent() { // with semantic processing token_type op = s->nexttoken(); if (op == PLUS op == MINUS) { s->advance(); Term * operand2 = new Term(*s); if (!operand2->ispresent()) throw SyntaxError(*s); TreeNode * t2 = operand2->abssyntree(); tree = new TreeNode(op, exprsofar, t2); } Eprime * operand3 = new Eprime(*s, tree); if (operand3->ispresent()) tree = operand3->abssyntree(); return TRUE; } else return FALSE;

Factor bool Factor::isPresent() { // with semantic processing token_type op = s->nexttoken(); } if (op == IDENTIFIER op == NUMBER) { tree = new TreeNode(op, s->tokenvalue()); s->advance(); return TRUE; } else if (op == LPAREN) { s->advance(); Expr * operand = new Expr(*s); if (!operand->ispresent()) throw SyntaxError(*s); if (s->nexttoken()!= RPAREN) throw SyntaxError(*s); s->advance(); tree = operand->abssyntree(); return TRUE; } else return FALSE;

Parsing Techniques Top-down parsers (LL(1), recursive descent) Start at the root of the parse tree and grow toward leaves Pick a production & try to match the input Bad pick may need to backtrack Some grammars are backtrack-free (predictive parsing) Bottom-up parsers (LR(1), operator precedence) Start at the leaves and grow toward root As input is consumed, encode possibilities in an internal state Start in a state valid for legal first tokens Bottom-up parsers handle a large class of grammars

Bottom-up Parsing (definitions) The point of parsing is to construct a derivation A derivation consists of a series of rewrite steps S γ 0 γ 1 γ 2 γ n 1 γ n sentence Each γ i is a sentential form If γ contains only terminal symbols, γ is a sentence in L(G) If γ contains 1 non-terminals, γ is a sentential form To get γ i from γ i 1, expand some NT A γ i 1 by using A β Replace the occurrence of A γ i 1 with β to get γ i In a leftmost derivation, it would be the first NT A γ i 1 A left-sentential form occurs in a leftmost derivation A right-sentential form occurs in a rightmost derivation

Bottom-up Parsing A bottom-up parser builds a derivation by working from the input sentence back toward the start symbol S S γ 0 γ 1 γ 2 γ n 1 γ n sentence bottom-up To reduce γ i to γ i 1 match some rhs β against γ i then replace β with its corresponding lhs, A. (assuming the production A β) In terms of the parse tree, this is working from leaves to root Nodes with no parent in a partial tree form its upper fringe Since each replacement of β with A shrinks the upper fringe, we call it a reduction. The parse tree need not be built, it can be simulated parse tree nodes = words + reductions

Finding Reductions Consider the simple grammar And the input string abbcde The trick is scanning the input and finding the next reduction The mechanism for doing this must be efficient

Finding Reductions (Handles) The parser must find a substring β of the tree s frontier that matches some production A β that occurs as one step in the rightmost derivation ( β A is in RRD) Informally, we call this substring β a handle Formally, A handle of a right-sentential form γ is a pair <A β,k> where A β P and k is the position in γ of β s rightmost symbol. If <A β,k> is a handle, then replacing β at k with A produces the right sentential form from which γ is derived in the rightmost derivation. Because γ is a right-sentential form, the substring to the right of a handle contains only terminal symbols the parser doesn t need to scan past the handle (very far)

Finding Reductions (Handles) Critical Insight (Theorem?) If G is unambiguous, then every right-sentential form has a unique handle. If we can find those handles, we can build a derivation! Sketch of Proof: 1 G is unambiguous rightmost derivation is unique 2 a unique production A β applied to derive γ i from γ i 1 3 a unique position k at which A β is applied 4 a unique handle <A β,k> This all follows from the definitions

Example (a very busy slide) The expression grammar Handles for rightmost derivation of x 2 * y This is the inverse of Figure 3.9 in EaC

Handle-pruning, Bottom-up Parsers The process of discovering a handle & reducing it to the appropriate left-hand side is called handle pruning Handle pruning forms the basis for a bottom-up parsing method To construct a rightmost derivation S γ 0 γ 1 γ 2 γ n 1 γ n w Apply the following simple algorithm for i n to 1 by 1 This takes 2n steps Find the handle <A i β i, k i > in γ i Replace β i with A i to generate γ i 1

Handle-pruning, Bottom-up Parsers One implementation technique is the shift-reduce parser push INVALID token next_token( ) repeat until (top of stack = Goal and token = EOF) if the top of the stack is a handle A β then // reduce β to A pop β symbols off the stack push A onto the stack else if (token EOF) then // shift push token token next_token( ) else // need to shift, but out of input report an error How do errors show up? failure to find a handle hitting EOF & needing to shift (final else clause) Either generates an error Figure 3.7 in EAC

Back to x - 2 * y Stack Input Handle Action $ id num * id none shift $ id num * id 1. Shift until the top of the stack is the right end of a handle 2. Find the left end of the handle & reduce

Back to x - 2 * y Stack Input Handle Action $ id num * id none shift $ id num * id 9,1 red. 9 $ Factor num * id 7,1 red. 7 $ Term num * id 4,1 red. 4 $ Expr num * id 1. Shift until the top of the stack is the right end of a handle 2. Find the left end of the handle & reduce

Back to x - 2 * y Stack Input Handle Action $ id num * id none shift $ id num * id 9,1 red. 9 $ Factor num * id 7,1 red. 7 $ Term num * id 4,1 red. 4 $ Expr num * id none shift $ Expr num * id none shift $ Expr num * id 1. Shift until the top of the stack is the right end of a handle 2. Find the left end of the handle & reduce

Back to x - 2 * y Stack Input Handle Action $ id num * id none shift $ id num * id 9,1 red. 9 $ Factor num * id 7,1 red. 7 $ Term num * id 4,1 red. 4 $ Expr num * id none shift $ Expr num * id none shift $ Expr num * id 8,3 red. 8 $ Expr Factor * id 7,3 red. 7 $ Expr Term * id 1. Shift until the top of the stack is the right end of a handle 2. Find the left end of the handle & reduce

Back to x - 2 * y Stack Input Handle Action $ id num * id none shift $ id num * id 9,1 red. 9 $ Factor num * id 7,1 red. 7 $ Term num * id 4,1 red. 4 $ Expr num * id none shift $ Expr num * id none shift $ Expr num * id 8,3 red. 8 $ Expr Factor * id 7,3 red. 7 $ Expr Term * id none shift $ Expr Term * id none shift $ Expr Term * id 1. Shift until the top of the stack is the right end of a handle 2. Find the left end of the handle & reduce

Back to x 2 * y Stack Input Handle Action $ id num * id none shift $ id num * id 9,1 red. 9 $ Factor num * id 7,1 red. 7 $ Term num * id 4,1 red. 4 $ Expr num * id none shift $ Expr num * id none shift $ Expr num * id 8,3 red. 8 $ Expr Factor * id 7,3 red. 7 $ Expr Term * id none shift $ Expr Term * id none shift $ Expr Term * id 9,5 red. 9 $ Expr Term * Factor 5,5 red. 5 $ Expr Term 3,3 red. 3 $ Expr 1,1 red. 1 $ Goal none accept 5 shifts + 9 reduces + 1 accept 1. Shift until the top of the stack is the right end of a handle 2. Find the left end of the handle & reduce

Example Goal Expr Expr Term Term Term * Fact. Fact. Fact. <id,y> <id,x> <num,2>

Shift-reduce Parsing Shift reduce parsers are easily built and easily understood A shift-reduce parser has just four actions Shift next word is shifted onto the stack Reduce right end of handle is at top of stack Locate left end of handle within the stack Pop handle off stack & push appropriate lhs Accept stop parsing & report success Error call an error reporting/recovery routine Accept & Error are simple Shift is just a push and a call to the scanner Reduce takes rhs pops & 1 push Handle finding is key handle is on stack finite set of handles use a DFA! If handle-finding requires state, put it in the stack 2x work

An Important Lesson about Handles To be a handle, a substring of a sentential form γ must have two properties: It must match the right hand side β of some rule A β There must be some rightmost derivation from the goal symbol that produces the sentential form γ with A β as the last production applied Simply looking for right hand sides that match strings is not good enough Critical Question: How can we know when we have found a handle without generating lots of different derivations? Answer: we use look ahead in the grammar along with tables produced as the result of analyzing the grammar. LR(1) parsers build a DFA that runs over the stack & finds them

An Important Lesson about Handles To be a handle, a substring of a sentential form γ must have two properties: It must match the right hand side β of some rule A β There must be some rightmost derivation from the goal symbol that produces the sentential form γ with A β as the last production applied We have seen that simply looking for right hand sides that match strings is not good enough Critical Question: How can we know when we have found a handle without generating lots of different derivations? Answer: we use look ahead in the grammar along with tables produced as the result of analyzing the grammar. o There are a number of different ways to do this. o We will look at two: operator precedence and LR parsing