Prelude COMP Lecture Topdown Parsing September, 00 What is the Tufts mascot? Jumbo the elephant Why? P. T. Barnum was an original trustee of Tufts : donated $0,000 for a natural museum on campus Barnum Museum, later Barnum Hall Jumbo : famous circus elephant : Jumbo died, was stuffed, donated to Tufts 9: Fire destroyed Barnum Hall, Jumbo Tufts University Computer Science Last time Finished scanning Produces a stream of tokens Removes things we don t care about, like white space and comments Contetfree grammars Formal description of language synta Deriving strings using CFG Depicting derivation as a parse tree Grammar issues Often: more than one way to derive a string Why is this a problem? Parsing: is string a member of L(G)? We want more than a yes or no answer Key: Represent the derivation as a parse tree We want the structure of the parse tree to capture the meaning of the sentence Tufts University Computer Science Tufts University Computer Science Grammar issues Parse tree: * y Often: more than one way to derive a string Why is this a problem? Parsing: is string a member of L(G)? We want more than a yes or no answer op Key: number Represent the derivation as a parse tree We want the structure of the parse op tree to capture + the meaning of the sentence * / Rule Rightmost derivation Sentential form op op <id,y> * <id,y> op * <id,y> op <num,> * <id,y> <num,> * <id,y> <id,> <num,> * <id,y> Parse tree op op * y Tufts University Computer Science Tufts University Computer Science
Abstract synta tree Left vs right derivations Parse tree contains etra junk Eliminate inediate nodes Move operators up to parent nodes Result: abstract synta tree op op * y * y Two derivations of * y Rule Sentential form op <id, > op <id,> <id,> op <id,> <num,> op <id,> <num,> * <id,> <num,> * <id,y> Rule Sentential form op op <id,y> * <id,y> op * <id,y> op <num,> * <id,y> <num,> * <id,y> <id,> <num,> * <id,y> Leftmost derivation Rightmost derivation Tufts University Computer Science Tufts University Computer Science Derivations One captures meaning, the other doesn t With precedence Last time: ways to force the right tree shape Add productions to represent precedence * y Leftmost derivation * y Rightmost derivation op number op + * / + * or / or or or number Tufts University Computer Science 9 Tufts University Computer Science 0 With precedence Parsing op op * y * * What is parsing? Discovering the derivation of a string If one eists Harder than generating strings Not surprisingly Two major approaches Topdown parsing Bottomup parsing y Don t work on all contetfree grammars Properties of grammar deine parseability Our goal: make parsing efficient We may be able to transform a grammar Tufts University Computer Science Tufts University Computer Science
Two approaches Topdown parsers LL(), recursive descent Start at the root of the parse tree and grow toward leaves Pick a production & try to match the input Bad pick may need to backtrack Bottomup parsers LR(), operator precedence Start at the leaves and grow toward root As input is consumed, encode possible parse trees in an internal state (similar to our NFA DFA conversion) Bottomup parsers handle a large class of grammars Grammars and parsers LL() parsers Lefttoright input Leftmost derivation symbol of lookahead LR() parsers Lefttoright input Rightmost derivation symbol of lookahead Also: LL(k), LR(k), SLR, LALR, Grammars that this can handle are called LL() grammars Grammars that this can handle are called LR() grammars Tufts University Computer Science Tufts University Computer Science Topdown parsing Start with the root of the parse tree Root of the tree: node labeled with the start symbol Algorithm: Repeat until the fringe of the parse tree matches input string At a node A, select a production for A Add a child node for each symbol on rhs If a inal symbol is added that doesn t match, backtrack Find the net node to be epanded (a noninal) Done when: Leaves of parse tree match input string (success) All productions ehausted in backtracking (failure) Tufts University Computer Science Eample Epression grammar + * or / or or or number Input string * y (with precedence) Tufts University Computer Science Eample Current position in the input stream Backtracking Rule Sentential form Input string * y + * y + * y or + * y <id> + * y <id,> + * y + Rule Sentential form Input string * y + * y + * y or + * y <id> + * y? <id,> + * y Undo all these productions Problem: Can t match net inal We guessed wrong at step Rollback productions Choose a different production for Continue Tufts University Computer Science Tufts University Computer Science
Retrying Successful parse Rule Sentential form Input string * y * y * y or * y <id> * y <id,> * y <id,> or * y <id,> <num> * y Problem: More input to read Another cause of backtracking Rule Sentential form Input string * y * y * y or * y <id> * y <id,> * y <id,> * * y <id,> * * y <id,> <num> * * y <id,> <num,> * * y <id,> <num,> * <id> * y All inals match we re done * y Tufts University Computer Science 9 Tufts University Computer Science 0 Other possible parses Rule Sentential form Input string * y + * y + + * y + + + * y + + + + * y Problem: ination Wrong choice leads to infinite epansion (More importantly: without consuming any input!) May not be as obvious as this Our grammar is left recursive Tufts University Computer Science Left recursion Formally, A grammar is left recursive if a noninal A such that A * A α (for some set of symbols α) What does * mean? A B B A y Bad news: Topdown parsers cannot handle left recursion Good news: We can systematically eliminate left recursion Tufts University Computer Science Notation Eliminating left recursion Noninals Capital letter: A, B, C Terminals Lowercase, underline:, y, z Some mi of inals and noninals Greek letters: α, β, γ Eample: A B + A B α α = + Consider this grammar: Rewrite as foo foo α β foo β bar bar α bar New noninal Language is β followed by zero or more α This production gives you one β These two productions give you zero or more α Tufts University Computer Science Tufts University Computer Science
Back to essions Eliminating left recursion Two cases of left recursion: + Transform as follows: + * or / or or or * or / or Resulting grammar All right recursive Retain original language and associativity Not as intuitive to read Topdown parser Will always inate May still backtrack There s a lovely algorithm to do this automatically, which we will skip 9 0 + or * or / or or number Tufts University Computer Science Tufts University Computer Science Topdown parsers Problem: Leftrecursion Solution: Technique to remove it What about backtracking? Current algorithm is brute force Problem: how to choose the right production? Idea: use the net input token (duh) How? Look at our rightrecursive grammar Tufts University Computer Science Rightrecursive grammar 9 0 + or * or / or or number Two productions with no choice at all All other productions are uniquely identified by a inal symbol at the start of RHS We can choose the right production by looking at the net input symbol This is called lookahead BUT, this can be tricky Tufts University Computer Science Lookahead Goal: avoid backtracking Look at future input symbols Use etra contet to make right choice How much lookahead is needed? In general, an arbitrary amount is needed for the full class of contetfree grammars Use fancydancy algorithm CYK algorithm, O(n ) Fortunately, Many CFGs can be parsed with limited lookahead Covers most programming languages not C++ or Perl Topdown parsing Goal: Given productions A α β, the parser should be able to choose between α and β Trying to match A How can the net input token help us decide? Solution: FIRST sets (almost a solution) Informally: FIRST(α) is the set of tokens that could appear as the first symbol in a string derived from α Def: in FIRST(α) iff α * γ Tufts University Computer Science 9 Tufts University Computer Science 0
Topdown parsing Building FIRST sets We ll look at this algorithm later The LL() property Given A α and A β, we would like: FIRST(α) FIRST(β) = Parser can make right choice by looking at one lookahead token..almost.. Topdown parsing What about ε productions? Complicates the definition of LL() Consider A α and A β and α may be empty In this case there is no symbol to identify α Eample: What is FIRST()? = { ε } A B y C What lookahead symbol tells us we are matching production? Tufts University Computer Science Tufts University Computer Science Topdown parsing If A was empty What will the net symbol be? Must be one of the symbols that immediately follow an A Solution Build a FOLLOW set for each production with ε Etra condition for LL: FIRST(β) must be disjoint from FIRST(α) and FOLLOW(Α) FOLLOW sets Eample: FIRST() = { } FIRST() = { y } FIRST() = { ε } A B y C E A z What can follow A? Look at the contet of all uses of A FOLLOW(A) = { z } Now we can uniquely identify each production: If we are trying to match an A and the net token is z, then we matched production Tufts University Computer Science Tufts University Computer Science More on FIRST and FOLLOW Notice: FIRST and FOLLOW may be sets FIRST may contain ε in addition to other symbols Eample: FIRST() = {, y, ε } FOLLOW(A) = { z, w } Question: When would we care about FOLLOW(A)? Answer: if FIRST(C) contains ε A B C B y E A z F A w Tufts University Computer Science LL() property Including ε productions FOLLOW(A) = the set of inal symbols that can immediately follow A Def: FIRST+(A α) as FIRST(α) U FOLLOW(A), if ε FIRST(α) FIRST(α), otherwise Def: a grammar is LL() iff A α and A β and FIRST+(A α) FIRST+(A β) = Tufts University Computer Science
LL() property Question Can there be two rules A αand A βin a LL() grammar such that ε FIRST(α) and ε FIRST(β)? Answer Yes, as long as they have different FOLLOW sets Parsing LL() grammar Given an LL() grammar Code: simple, fast routine to recognize each production Given A β β β, with FIRST + (β i ) FIRST + (β j ) = /* find rule for A */ if (current token FIRST+(β )) select A β else if (current token FIRST+(β )) select A β else if (current token FIRST+(β )) for all i!= j select A β else report an error and return false Tufts University Computer Science Tufts University Computer Science Predictive parsing Recursive descent Predictive parsing The parser can predict the correct epansion Using lookahead and FIRST and FOLLOW sets Two kinds of predictive parsers Recursive descent Often handwritten Tabledriven Generate tables from First and Follow sets 9 0 goal + or * or / or or number ( ) This produces a parser with si mutually recursive routines: Goal Epr Epr Term Term Factor Each recognizes one NT or T The descent refers to the direction in which the parse tree is built. Tufts University Computer Science 9 Tufts University Computer Science 0 Eample code Goal symbol: Eample code Match main() /* Match goal > */ tok = nettoken(); if (() && tok == EOF) then proceed to net step; else return false; Toplevel ession () /* Match > */ if (() && ()); else return false; () /* Match > + */ /* Match > */ if (tok == + or tok == ) tok = nettoken(); if (()) then return (); else return false; /* Match > empty */ Check FIRST and FOLLOW sets to distinguish Tufts University Computer Science Tufts University Computer Science
Eample code or() /* Match or > ( ) */ if (tok == ( ) tok = nettoken(); if (() && tok == ) ) else synta error: epecting ) return false /* Match or > num */ if (tok is a num) return true /* Match or > id */ if (tok is an id) Topdown parsing So far: Gives us a yes or no answer We want to build the parse tree How? Add actions to matching routines Create a node for each production How do we assemble the tree? Tufts University Computer Science Tufts University Computer Science Building a parse tree Notice: Recursive calls match the shape of the tree Idea: use a stack Each routine: main or Pops off the children it needs Creates its own node Pushes that node back on the stack Building a parse tree With stack operations () /* Match > */ if (() && ()) _node = pop(); _node = pop(); _node = new Node(_node, _node) push(_node); else return false; Tufts University Computer Science Tufts University Computer Science Net time Finish topdown parsing Tabledriven parsers Building FIRST and FOLLOW sets Start bottomup parsing Tufts University Computer Science