COMP 181. Prelude. Next step. Parsing. Study of parsing. Specifying syntax with a grammar

COMP Lecture Parsing September, 00 Prelude What is the common name of the fruit Synsepalum dulcificum? Miracle fruit from West Africa What is special about miracle fruit? Contains a protein called miraculin In an acidic environment, binds to sweet receptors in taste buds Sour becomes sweet Lemons and limes, Tabasco sauce, stinky cheese, dark beer Flavor tripping lasts about an hour Tufts University Computer Science Next step Parsing text chars Lexical analyzer tokens Parser IR text chars Lexical analyzer tokens Parser IR Errors Errors Parsing: Organize tokens into sentences Do tokens conform to language syntax? Good news: token types are just numbers Bad news: language syntax is fundamentally more complex than lexical specification Good news: we can still do it in linear time in most cases Parser Reads tokens from the scanner Checks organization of tokens against a grammar Constructs a derivation Derivation drives construction of IR Tufts University Computer Science Tufts University Computer Science Study of parsing Discovering the derivation of a sentence Diagramming a sentence in grade school Formalization: Mathematical model of syntax a grammar G Algorithm for testing membership in L(G) Roadmap: Contextfree grammars Topdown parsers Ad hoc, often handcoded, recursive decent parsers Bottomup parsers Automatically generated LR parsers Tufts University Computer Science Specifying syntax with a grammar Can we use regular essions? For the most part, no Limitations of regular essions Need something more powerful Still want formal specification (for automation) Contextfree grammar Set of rules for generating sentences Expressed in BackusNaur Form (BNF) Tufts University Computer Science

Contextfree grammar Example: Formally: contextfree grammar is G = (s, N, T, P) T : set of inals (provided by scanner) N : set of noninals (represent structure) sheepnoise sheepnoise baa baa Alternative (shorthand) s N : start or goal symbol P : N (N T)* : set of production rules produces or generates Language L(G) Language L(G) L(G) is all sentences generated from start symbol Generating sentences Use productions as rewrite rules Start with goal (or start) symbol a noninal Choose a noninal and expand it to the righthand side of one of its productions Only inal symbols left sentence in L(G) Inediate results known as sentential forms Tufts University Computer Science Tufts University Computer Science Expressions Language of essions Numbers and identifiers Allow different binary operators Arbitrary nesting of essions op number op + * / Tufts University Computer Science Language of essions What s in this language? op number op + * / We can build the string x * y This string is in the language op <id,x> op <id,x> <id,x> op <id,x> <num,> op <id,x> <num,> * <id,x> <num,> * <id,y> Tufts University Computer Science 0 Derivations Using grammars A sequence of rewrites is called a derivation Discovering a derivation for a string is parsing Different derivations are possible At each step we can choose any noninal Rightmost derivation: always choose right NT Leftmost derivation: always choose left NT (Other random derivations not of interest) Left vs right derivations Two derivations of x * y op <id, x> op <id,x> <id,x> op <id,x> <num,> op <id,x> <num,> * <id,x> <num,> * <id,y> op op <id,y> * <id,y> op * <id,y> op <num,> * <id,y> <num,> * <id,y> <id,x> <num,> * <id,y> Leftmost derivation Rightmost derivation Tufts University Computer Science Tufts University Computer Science

Derivations and parse trees Two different derivations Both are correct Do we care which one we use? Represent derivation as a parse tree Leaves are inal symbols Inner nodes are noninals To depict production α β γ δ show nodes β,γ,δ as children of α Tree is used to build internal representation Example (I) Rightmost derivation op op <id,y> * <id,y> op * <id,y> op <num,> * <id,y> <num,> * <id,y> <id,x> <num,> * <id,y> Concrete syntax tree Shows all details of syntactic structure What s the problem with this tree? Parse tree op op * y x Tufts University Computer Science Tufts University Computer Science Abstract syntax tree Parse tree contains extra junk Eliminate inediate nodes Move operators up to parent nodes Result: abstract syntax tree op op * y x Tufts University Computer Science x Problem: Evaluates as (x ) * y * y Example (II) Leftmost derivation op <id, x> op <id,x> <id,x> op <id,x> <num,> op <id,x> <num,> * <id,x> <num,> * <id,y> op x Parse tree Solution: evaluates as x ( * y) op * y Tufts University Computer Science Derivations Derivations and semantics x * y Leftmost derivation * y x Rightmost derivation Problem: Two different valid derivations One captures meaning we want (What specifically are we trying to capture here?) Key idea: shape of tree implies its meaning Can we ess precedence in grammar? Notice: operations deeper in tree evaluated first Solution: add an inediate production New production isolates different levels of precedence Force higher precedence deeper in the grammar Tufts University Computer Science Tufts University Computer Science

Adding precedence Two levels: Level : lower precedence higher in the tree Level : higher precedence deeper in the tree Observations: Larger: requires more rewriting to reach inals But, produces same parse tree under both left and right derivations Tufts University Computer Science + * or / or or or number Expression example Rightmost derivation * or * <id,y> or * <id,y> <num,> * <id,y> <num,> * <id,y> or <num,> * <id,y> <id,x> <num,> * <id,y> Parse tree Now right derivation yields x ( * y) Tufts University Computer Science 0 x op op * y With precedence In class questions What if I want (x)*y? op * Some common patterns op * y * x y x Tufts University Computer Science Tufts University Computer Science Another issue Original ession grammar: op number op + * / Our favorite string: x * y Tufts University Computer Science Another issue op op op <id, x> op op <id,x> op <id,x> <num,> op <id,x> <num,> * <id,x> <num,> * <id,y> Multiple leftmost derivations Such a grammar is called ambiguous Is this a problem? Very hard to automate parsing Tufts University Computer Science 0 op <id, x> op <id,x> <id,x> op <id,x> <num,> op <id,x> <num,> * <id,x> <num,> * <id,y>

Ambiguous grammars A grammar is ambiguous iff: There are multiple leftmost or multiple rightmost derivations for a single sentential form Note: leftmost and rightmost derivations may differ, even in an unambiguous grammar Intuitively: We can choose different noninals to expand But each noninal should lead to a unique set of inal symbols What s a classic example? Ifthenelse ambiguity Ifthenelse Grammar: Problem: nested ifthenelse statements stmt if then stmt if then stmt else stmt other statements Each one may or may not have else How to match each else with if Tufts University Computer Science Tufts University Computer Science Ifthenelse ambiguity with two derivations: if then if then stmt else stmt if prod. prod. then else if stmt prod. then prod. stmt if then if then stmt else stmt Removing ambiguity Restrict the grammar Choose a rule: else matches innermost if Codify with new productions stmt if then stmt if then withelse else stmt other statements withelse if then withelse else withelse other statements Intuition: when we have an else, all preceding nested conditions must have an else Tufts University Computer Science Tufts University Computer Science Ambiguity Ambiguity can take different forms Grammatical ambiguity (ifthenelse problem) Contextual ambiguity In C: x * y; could follow typedef int x; In Fortran: x = f(y); f could be function or array Cannot be solved directly in grammar Issues of type (later in course) Deeper question: How much can the parser do? Parsing What is parsing? Discovering the derivation of a string If one exists Harder than generating strings Not surprisingly Two major approaches Topdown parsing Bottomup parsing Don t work on all contextfree grammars Properties of grammar deine parseability Our goal: make parsing efficient We may be able to transform a grammar Tufts University Computer Science Tufts University Computer Science

Two approaches Topdown parsers LL(), recursive descent Start at the root of the parse tree and grow toward leaves Pick a production and try to match the input What happens if the parser chooses the wrong one? Bottomup parsers LR(), operator precedence Start at the leaves and grow toward root Issue: might have multiple possible ways to do this Key idea: encode possible parse trees in an internal state (similar to our NFA DFA conversion) Bottomup parsers handle a large class of grammars Grammars and parsers LL() parsers Lefttoright input Leftmost derivation symbol of lookahead LR() parsers Lefttoright input Rightmost derivation symbol of lookahead Also: LL(k), LR(k), SLR, LALR, Grammars that they can handle are called LL() grammars Grammars that they can handle are called LR() grammars Tufts University Computer Science Tufts University Computer Science Topdown parsing Start with the root of the parse tree Root of the tree: node labeled with the start symbol Algorithm: Repeat until the fringe of the parse tree matches input string At a node A, select one of A s productions Add a child node for each symbol on rhs Find the next node to be expanded (a noninal) Done when: Leaves of parse tree match input string (success) Example Expression grammar Input string + * or / or or or number x * y (with precedence) Tufts University Computer Science Tufts University Computer Science 0 Example Current position in the input stream Backtracking Input string x * y + x * y + x * y or + x * y <id> + x * y <id,x> + x * y + Input string x * y + x * y + x * y or + x * y <id> + x * y? <id,x> + x * y Undo all these productions Problem: Can t match next inal We guessed wrong at step What should we do now? x If we can t match next inal: Rollback productions Choose a different production for Continue Tufts University Computer Science Tufts University Computer Science

Retrying Successful parse Input string x * y x * y x * y or x * y <id> x * y <id,x> x * y <id,x> or x * y <id,x> <num> x * y Problem: More input to read Another cause of backtracking x Input string x * y x * y x * y or x * y <id> x * y <id,x> x * y <id,x> * x * y <id,x> * x * y <id,x> <num> * x * y <id,x> <num,> * x * y <id,x> <num,> * <id> x * y x * y Tufts University Computer Science Tufts University Computer Science Other possible parses Problem: ination Input string x * y x * y x * y x * y x * y Wrong choice leads to infinite expansion (More importantly: without consuming any input!) May not be as obvious as this Our grammar is left recursive Left recursion Formally, A grammar is left recursive if a noninal A such that A * A α (for some set of symbols α) What does * mean? A B x B A y Bad news: Topdown parsers cannot handle left recursion Good news: We can systematically eliminate left recursion Tufts University Computer Science Tufts University Computer Science Notation Eliminating left recursion Noninals Capital letter: A, B, C Terminals Lowercase, underline: x, y, z Fix this grammar: foo foo α β Language is β followed by zero or more α Some mix of inals and noninals Greek letters: α, β, γ Example: A B + x A B α α = + x Rewrite as foo β bar bar α bar New noninal This production gives you one β These two productions give you zero or more α Tufts University Computer Science Tufts University Computer Science

Back to essions Eliminating left recursion Two cases of left recursion: + How do we fix these? + * or / or or or * or / or Resulting grammar All right recursive Retain original language and associativity Not as intuitive to read Topdown parser Will always inate May still backtrack There s a lovely algorithm to do this automatically, which we will skip 0 + or * or / or or number Tufts University Computer Science Tufts University Computer Science 0 Topdown parsers Rightrecursive grammar Problem: Leftrecursion Solution: Technique to remove it What about backtracking? Current algorithm is brute force Problem: how to choose the right production? Idea: use the next input token (duh) How? Look at our rightrecursive grammar 0 + or * or / or or number Two productions with no choice at all All other productions are uniquely identified by a inal symbol at the start of RHS We can choose the right production by looking at the next input symbol This is called lookahead BUT, this can be tricky Tufts University Computer Science Tufts University Computer Science Lookahead Goal: avoid backtracking Look at future input symbols Use extra context to make right choice How much lookahead is needed? In general, an arbitrary amount is needed for the full class of contextfree grammars Use fancydancy algorithm CYK algorithm, O(n ) Fortunately, Many CFGs can be parsed with limited lookahead Covers most programming languages not C++ or Perl Topdown parsing Goal: Given productions A α β, the parser should be able to choose between α and β Trying to match A How can the next input token help us decide? Solution: FIRST sets (almost a solution) Informally: FIRST(α) is the set of tokens that could appear as the first symbol in a string derived from α Def: x in FIRST(α) iff α * x γ Tufts University Computer Science Tufts University Computer Science

Topdown parsing Building FIRST sets We ll look at this algorithm later The LL() property Given A α and A β, we would like: FIRST(A α) FIRST(A β) = Parser can make right choice by with one lookahead token..almost.. What are we not handling? Topdown parsing What about ε productions? Complicates the definition of LL() Consider A α and A β and α may be empty In this case there is no symbol to identify α Example: What is FIRST(#)? = { ε } What would tells us we are matching production? S A z A x B y C Tufts University Computer Science Tufts University Computer Science Topdown parsing If A was empty S A z A x B y C What will the next symbol be? Must be one of the symbols that immediately follows an A Solution Build a FOLLOW set for each symbol that could produce ε Extra condition for LL: FIRST(A β) must be disjoint from FIRST(A α) and FOLLOW(Α) FOLLOW sets Example: FIRST(#) = { x } FIRST(#) = { y } FIRST(#) = { ε } S A z A x B y C What can follow A? Look at the context of all uses of A FOLLOW(A) = { z } Now we can uniquely identify each production: If we are trying to match an A and the next token is z, then we matched production Tufts University Computer Science Tufts University Computer Science More on FIRST and FOLLOW Notice: FIRST and FOLLOW are sets FIRST may contain ε in addition to other symbols Question: What is FIRST(#)? = FIRST(B) = { x, y, ε }? and FIRST(C) Question: When would we care about FOLLOW(A)? Answer: if FIRST(C) contains ε Tufts University Computer Science S A z A B C D B x y C... LL() property Key idea: Build parse tree topdown Use lookahead token to pick next production Each production must be uniquely identified by the inal symbols that may appear at the start of strings derived from it. Def: FIRST+(A α) as FIRST(α) U FOLLOW(A), if ε FIRST(α) FIRST(α), otherwise Def: a grammar is LL() iff A α and A β and FIRST+(A α) FIRST+(A β) = Tufts University Computer Science 0

Parsing LL() grammar Given an LL() grammar Code: simple, fast routine to recognize each production Given A β β β, with FIRST + (β i ) FIRST + (β j ) = for all i!= j /* find rule for A */ if (current token FIRST+(β )) select A β else if (current token FIRST+(β )) select A β else if (current token FIRST+(β )) select A β else report an error and return false Topdown parsing Build parse tree top down A β γ δ G A α B ζ B ε C? D t t t t t t t t t token stream Is CD? Consider all possible strings derivable from CD What is the set of tokens that can appear at start? G A α B ζ A β γ δ B C D F t FIRST(C D) t FIRST(F) t FOLLOW(B) disjoint? Tufts University Computer Science Tufts University Computer Science FIRST and FOLLOW sets FIRST(α) The righthand side of a production For some α (T NT)*, define FIRST(α) as the set of tokens that appear as the first symbol in some string that derives from α That is, x FIRST(α) iff α * x γ, for some γ Note: may include ε FOLLOW(A) For some A NT, define FOLLOW(A) as the set of symbols that can occur immediately after A in a valid sentence. FOLLOW(G) = {EOF}, where G is the start symbol Computing FIRST sets Idea: Use FIRST sets of the right side of production Cases: A B B B FIRST(A B) = FIRST(B ) What does FIRST(B ) mean? Union of FIRST(B γ) for all γ What if ε in FIRST(B )? FIRST(A B) = FIRST(B ) What if ε in FIRST(B i ) for all i? FIRST(A B) = {ε} Why =? repeat as needed leave {ε} for later Tufts University Computer Science Tufts University Computer Science Algorithm Algorithm For one production: p = A β if (β is a inal t) FIRST(p) = {t} else if (β == ε) FIRST(p) = {ε} else Given β = B B B B k i = 0 do { i = i + ; FIRST(p) += FIRST(B i ) {ε} } while ( ε in FIRST(B i ) && i < k) Why do we need to remove ε from FIRST(B i )? if (ε in FIRST(B i ) && i == k) FIRST(p) += {ε} Tufts University Computer Science For one production: Given A B B B B B Compute FIRST(A B) using FIRST(B) How do we get FIRST(B)? What kind of algorithm does this suggest? Recursive? Like a depthfirst search of the productions Problem: What about recursion in the grammar? A x B y and B z A w Tufts University Computer Science 0

Algorithm Algorithm Solution Start with FIRST(B) empty Compute FIRST(A) using empty FIRST(B) Now go back and compute FIRST(B) What if it s no longer empty? Then we recompute FIRST(A) What if new FIRST(A) is different from old FIRST(A)? Then we recompute FIRST(B) again When do we stop? When no more changes occur we reach convergence FIRST(A) and FIRST(B) both satisfy equations This is another fixpoint algorithm Using fixpoints: forall p FIRST(p) = {} while (FIRST sets are changing) pick a random p compute FIRST(p) Can we be smarter? Yes, visit in special order Reverse postorder depth first search Visit all children (all righthand sides) before visiting the lefthand side, whenever possible Tufts University Computer Science Tufts University Computer Science Example 0 goal + or * or / or or number FIRST() = { + } FIRST() = { } FIRST() = { ε } FIRST() = { * } FIRST() = { / } FIRST() = { ε } FIRST() =? FIRST() = FIRST() = FIRST() = FIRST(0) FIRST() = { number, identifier } Computing FOLLOW sets Idea: Push FOLLOW sets down, use FIRST where needed Cases: A B B B B B k What is FOLLOW(B )? FOLLOW(B ) = FIRST(B ) In general: FOLLOW(B i ) = FIRST(B i+ ) What about FOLLOW(B k )? FOLLOW(B k ) = FOLLOW(A) What if ε FIRST(B k )? FOLLOW(B k ) = FOLLOW(A) extends to k, etc. Tufts University Computer Science Tufts University Computer Science 0 Example Example 0 goal + or * or / or or number FOLLOW(goal) = { EOF } FOLLOW() = FOLLOW(goal) = { EOF } FOLLOW() = FOLLOW() = { EOF } FOLLOW() =? FOLLOW() += FIRST() += { +,, ε } += { +,, FOLLOW()} += { +,, EOF } 0 goal + or * or / or or number FOLLOW() += FOLLOW() FOLLOW(or) =? FOLLOW(or) += FIRST() += { *, /, ε } += { *, /, FOLLOW()} += { *, /, +,, EOF } Tufts University Computer Science Tufts University Computer Science

Computing FOLLOW Sets LL() property FOLLOW(G) {EOF } for each A NT, FOLLOW(A) Ø while (FOLLOW sets are still changing) for each p P, of the form A β β β k FOLLOW(β k ) FOLLOW(β k ) FOLLOW(A) TRAILER FOLLOW(A) for i k down to if ε FIRST(β i ) then FOLLOW(β i ) FOLLOW(β i ) { FIRST(β i ) { ε } } TRAILER else FOLLOW(β i ) FOLLOW(β i ) FIRST(β i ) TRAILER Ø Def: a grammar is LL() iff A α and A β and FIRST+(A α) FIRST+(A β) = Problem What if my grammar is not LL()? May be able to fix it, with transformations Example: A α β α β α β A α Z Z β β β Tufts University Computer Science Tufts University Computer Science Left oring Expression example Graphically A α β α β α β A αβ αβ αβ After left oring: or identifier [ ] ( ) First+() = {identifier} First+() = {identifier} First+() = {identifier} A α Z Z β β β A αz β β β or identifier post post [ ] ( ) In this form, it has LL() property First+() = {identifier} First+() = { [ } First+() = { ( } First+() =? = Follow(post) = {operators} Tufts University Computer Science Tufts University Computer Science Left oring Left oring Graphically or identifier identifier [ ] Question Using left oring and left recursion elimination, can we turn an arbitrary CFG to a form where it meets the LL() condition? Answer No basis for choice identifier ( ) Given a CFG that does not meet LL() condition, it is undecidable whether or not an LL() grammar exists ε Example or identifier [ ] {a n 0 b n n } {a n b n n } has no LL() grammar Next word deines choice ( ) aaa0bbb aaabbbbbb Tufts University Computer Science Tufts University Computer Science

Limits of LL() No LL() grammar for this language: {a n 0 b n n } {a n b n n } has no LL() grammar Predictive parsing Predictive parsing The parser can predict the correct expansion Using lookahead and FIRST and FOLLOW sets G a A b a B bb A a A b 0 B a B bb Problem: need an unbounded number of a characters before you can deine whether you are in the A group or the B group Two kinds of predictive parsers Recursive descent Often handwritten Tabledriven Generate tables from First and Follow sets Tufts University Computer Science Tufts University Computer Science 0 Recursive descent Example code 0 goal + or * or / or or number ( ) This produces a parser with six mutually recursive routines: Goal Expr Expr Term Term Factor Each recognizes one NT or T The descent refers to the direction in which the parse tree is built. Goal symbol: main() /* Match goal > */ tok = nexttoken(); if (() && tok == EOF) then proceed to next step; else return false; Toplevel ession () /* Match > */ if (() && ()); return true; else return false; Tufts University Computer Science Tufts University Computer Science Example code Match () /* Match > + */ /* Match > */ if (tok == + or tok == ) tok = nexttoken(); if (()) then if (()) return true; else return false; /* Match > empty */ return true; Check FIRST and FOLLOW sets to distinguish Example code or() /* Match or > ( ) */ if (tok == ( ) tok = nexttoken(); if (() && tok == ) ) return true; else syntax error: expecting ) return false /* Match or > num */ if (tok is a num) return true /* Match or > id */ if (tok is an id) return true; Tufts University Computer Science Tufts University Computer Science

Topdown parsing So far: Gives us a yes or no answer Is that all we want? We want to build the parse tree How? Add actions to matching routines Create a node for each production How do we assemble the tree? Building a parse tree Notice: Recursive calls match the shape of the tree main or Idea: use a stack Each routine: Pops off the children it needs Creates its own node Pushes that node back on the stack Tufts University Computer Science Tufts University Computer Science Building a parse tree Generating a topdown parser With stack operations () /* Match > */ if (() && ()) _node = pop(); _node = pop(); _node = new Node(_node, _node) push(_node); return true; else return false; 0 goal + or * or / or or number Two pieces: Select the right RHS Satisfy each part First piece: FIRST+() for each rule Mapping: NT Σ rule# Look familiar? Tufts University Computer Science Tufts University Computer Science Generating a topdown parser Tabledriven approach 0 goal + or * or / or or number Second piece Keep track of progress Like a depthfirst search Use a stack Idea: Push Goal on stack Pop stack: Match inal symbol, or Apply NT mapping, push RHS on stack Encode mapping in a table Row for each noninal Column for each inal symbol Table[NT, symbol] = rule# if symbol FIRST+(NT > rhs(#)) +, *, / id, num error error error or error or error error (do nothing) Tufts University Computer Science Tufts University Computer Science 0

Code push the start symbol, G, onto Stack top top of Stack loop forever if top = EOF and token = EOF then break & report success if top is a inal then if top matches token then pop Stack // recognized top token next_token() else // top is a noninal if TABLE[top,token] is A B B B k then pop Stack // get rid of A push Bk, Bk,, B // in that order top top of Stack Missing else s for error conditions Next time Bottomup parsers Why? More powerful But, more complex algorithm (cannot write by hand) Widely used yacc, bison, JavaCUP Tufts University Computer Science Tufts University Computer Science Left oring Algorithm: A NT, find the longest prefix α that occurs in two or more righthand sides of A if α ε then replace all of the A productions, A αβ αβ αβ n γ, with A α Z γ Z β β β n where Z is a new element of NT Repeat until no common prefixes remain Tufts University Computer Science