Syntax Analysis Top Down Parsing CMPSC 470 Lecture 05 Topics: Overview Recursive-descent parser First and Follow A. Overview Top-down parsing constructs parse tree for input string from root and creating node of parse tree in preorder (depth-first, Left-Visit-Right). Topdown parsing can be viewed as finding a left most derivation for an input. Example) Consider the following grammar: EE TTEE TT FFTT FF iiii EE +TTEE εε TT FFTT εε Top-down parser creates parse tree using the following steps repeatedly. Input: 1. Determine the production to be applied for nonterminal, say AA 2. Once AA-production is selected, match the terminal symbol in the production body with the input string, and advance (move to next) token in input string Recursive predictive parser Error recovery
Top-down parser includes recursive-descent parser and recursive-predictive parser that uses LL(1) grammar B. Recursive-Descent Parser In recursive descent parser, each nonterminal become a procedure (or function). This requires backtracking. Following example shows how the parser can be implemented, and how backtracking is handled. Parser Function: Example) Consider the following grammar: SS ccaadd AA aaaabb aa Its corresponding recursive-descent parser can be: S() 1. x input pointer location 2. // production SS ccaadd 3. match cc with input symbol (current token) and advance input pointer (move to next token) 4. call A() 5. match dd, and advance 6. if all lines 2-4 succeed, return success 7. // no more production rule 8. return fail A() 1. x input point location 2. // production AA aaaabb 3. match aa, and advance 4. call A() 5. match bb, and advance 6. if all lines 3-5 succeed, return success 7. // production AA aa 8. Reset input point location to x 9. Match aa, and advance 10. if line 9 succeeds, return success 11. // no more production rule 12. return fail
Parsing: Parsing starts by calling the procedure for starting symbol: S(). It requires backtracking. Parsing steps with a given input ww = "cccccc" 1. Call S()
Example2) How to implement the following production? EE +TTEE εε E () 1. x input pointer location 2. // production EE +TTEE 3. match +, and advance 4. call T() 5. call E () 6. if all lines 3-5 succeed, return success 7. // production EE εε 8. Note: C. First and Follow Recursive-descent parser requires backtracking, which is time consuming. This can be improved by using recursive-predictive parser. First() and Follow() are functions used in construction top-down (recursive-predictive) and bottom-up parsers, which do not require backtracking. In the top-down parsing, First and Follow help to choose production.
Definition: First(αα) First(αα) is the set of terminals that begin strings derived from αα. Example) Given the grammar AA aaaa bbaa aa bb, the language is LL(AA) = { } and First(αα) = { }, since For grammar AA aa, First(AA) = AA aa εε First(AA) = AA BBaa εε BB bb First(AA) = AA AAaa bb εε First(AA) = AA BBBBaa εε BB CCbb εε CC cc εε First(AA) = AA aa εε BB bb εε CC cc εε First(AAAAAA) = Determine FFFFFFFFFF(XX) 1. if XX is a terminal, FFFFFFFFFF(xx) = XX 2. if XX YY 1 YY 2 YY kk, determine FFFFFFFFFF(XX) as follows: 1. add all FFFFFFFFFF(YY 1 ) into FFFFFFFFFF(XX). 2. If εε FFFFFFFFFF(YY 1 ), 3. If εε FFFFFFFFFF(YY 1 ) and εε FFFFFFFFFF(YY 1 ), n. If εε FFFFFFFFFF(YY 1 ),, εε FFFFFFFFFF(YY kk ), 3. if XX εε is a production,
Concept) How to use First? Consider the following grammar GG: AA BB CC BB bb cc CC dd ee In GG, FFFFFFFFFF(BB) = bb, cc and FFFFFFFFFF(BB) = dd, ee are disjoint set. When parsing with nonterminal AA, if next input symbol is bb or cc, then AA BB production will be selected by parser. If next input symbol is dd or ee, then AA CC production will be selected by parser. Definition: Following(αα) FFFFFFFFFFFF(AA), for nonterminal AA, is the set of terminals aa that can appear immediately to the right of AA in some sentential form. FFFFFFFFFFFF(AA) is the set of terminals aa such that there exists derivations of SS ααααaaββ, for some αα and ββ. If AA can be the right most symbol in sentential form (SS AA) then $ FFFFFFFFFFFF(AA), where $ is a special endmarker symbol. SS AAbb AA aa εε FFFFFFFFFFFF(AA) = SS bbbb AA aa εε FFFFFFFFFFFF(AA) = SS aabbbbdd BB bb εε CC cc εε FFFFFFFFFFFF(BB) = FFFFFFFFFFFF(CC) =
SS aabbbbee BB bb εε CC cc εε FFFFFFFFFF(CC) = FFFFFFFFFFFF(BB) = SS aabbbbff BB bb cc εε CC dd ee εε FFFFFFFFFF(CC) = FFFFFFFFFFFF(BB) = SS aabbbbbb BB bb εε CC cc εε DD dd εε FFFFFFFFFF(CC) = FFFFFFFFFF(DD) = FFFFFFFFFFFF(BB) = Determine FFFFFFFFFFFF(AA) 1. Place $ in FFFFFFFFFFFF(SS). 2. If there is a production AA αααααα, then 3. If there is a production AA αααα, or AA αααααα and εε FFFFFFFFFF(ββ), then Note:
D. Recursive Predictive Parser a) Overview Consider the following grammar ssssssss iiii ( eeeeeeee ) ssssssss eeeeeeee ssssssss (αα) wwwwwwwwww ( eeeeeeee ) ssssssss (ββ) { ssssssss_llllllll } (γγ) Given next input symbol lah (lookahead token), a production can be predicted and selected using the following rules: 1. If lah is iiii FFFFFFFFFF(αα), then choose ssssssss αα 2. If lah is wwwwwwwwww FFFFFFFFFF(ββ), then choose ssssssss ββ 3. If lah is { FFFFFFFFFF(ββ), then choose ssssssss γγ The prediction rules can be written as parsing table MM AA, aa : Nonterminals Input symbol (lookahead) iiii wwwwwwwwww { ssssssss ssssssss iiii ( eeeeeeee ) ssssssss eeeeeeee ssssssss ssssssss wwwwwwwwww ( eeeeeeee ) ssssssss ssssssss { ssssssss_llllllll } During recursive-descent parsing, if current nonterminal is ssssssss and input symbol lah is iiii, wwwwwwwwww, or {, then its right production can be selected from the above prediction table MM, which need no backtracking. b) LL(1) Grammar LL(1) grammar can construct predictive parsers (recursive-descent parsers that need no backtracking). LL(1) stands for:
A Grammar GG is LL(1) a. If GG is non-left recursive and unambiguous, or b. Hold the following conditions: If AA αα ββ are two distinct production of GG. b1. αα and ββ do not derive string beginning with the same terminal aa. b2. At most, one of αα and ββ can derive empty string. b3. If ββ εε, then αα do not derive any string beginning with a terminal in FFFFFFFFFFFF(AA). Likewise, if αα εε, then ββ do not derive any string beginning with a terminal in FFFFFFFFFFFF(AA).
c) Construct predictive parse table Idea) Given productions AA αα ββ. 1. If the next input symbol lah (lookahead token) is in FFFFFFFFFF(AA), then choose AA αα 2. If αα = εε or αα εε, and lah FFFFFFFFFFFF(AA) or lah = $ FFFFFFFFFFFF(AA), then choose again AA αα Construction algorithm: INPUT: Given grammar G Example) EE TTEE EE +TTEE εε TT FFTT TT FFTT εε FF (EE) iiii OUTPUT: Parsing table MM METHOD: For each production AA αα, do the following 1. Determine FFFFFFFFFF and FFFFFFFFFFFF 2. For each terminal aa FFFFFFFFFF(AA), add AA αα to MM AA, aa
3. If εε FFFFFFFFFF(αα), then for each terminal bb FFFFFFFFFFFF(AA), add AA αα to MM AA, bb. If εε FFFFFFFFFF(αα) and $ FFFFFFFFFFFF(AA), add AA αα to MM[AA, $] as well. 4. If, after performing above, there is no production at all in MM AA, aa, then set MM AA, aa to error.(which we normally represent by an empty in the table) Final parsing table MM is: Nonterminals EE Input symbol (lookahead) iiii + ( ) $ EE TT TT FF This table MM means that:
Note: For every LL(1) Grammar, each parse table entry is uniquely identified. If a grammar is left-recursive or ambiguous, then at least one entry of the parse table MM will have 2 productions. Some Languages cannot have LL(1) grammar, even though left-recursion elimination and left-factoring are applied. Examples include dangling else problem. Dangling-else problem: Following is an abstract form of dangling else problem, that is applied left-recursion elimination and left-factoring: SS ii EE tt SS SS aa SS ee SS εε EE bb whose parse table is: Nonterminals Input symbol (lookahead) aa bb ee ii tt $ SS SS EE
d) Recursive Predictive Parser Given the following predictive parse table Input symbol (lookahead) iiii + ( ) $ EE EE TTEE EE TTEE EE EE +TTEE EE εε EE εε TT FF FFTT TT FFTT TT TT εε TT FFTT TT εε TT εε FF FF iiii FF (EE) Nonterminals its parser can be built easily as follows: void E() { if (lah == id ) { T(); E (); } else if(lah == ( ) { T(); E (); } else report( syntax error ); } void E () { if (lah == + ) { match( + ); T(); E (); } else if(lah == ) ) { } // do nothing else if(lah == $ ) { } // do nothing else report( syntax error ); } void T() { if (lah == id ) { F(); T (); } else if(lah == ( ) { F(); T (); } else report( syntax error ); } void T () { if (lah == + ) { } // do nothing else if(lah == * ) { match( * ); F(); T (); } else if(lah == ) ) { } // do nothing else if(lah == $ ) { } // do nothing else report( syntax error ); } void F() { if (lah == id ) { match(id); } else if(lah == ( ) { match( ( ); E(); match( ) ); } else report( syntax error ); }
Let input be iiii + iiii iiii. When calling E(), it works as follows:
e) Non-recursive Predictive Parser Non-recursive predictive parser can be built by maintaining a stack explicitly, rather than implicitly via recursive call.... a + b * c $ X Y Z $ Predictive Parsing Program Given input ww, initially the parser is in a configuration, where input buffer has ww$ and stack has the start symbol SS of grammar GG above $. The following program produce a predictive parse for the input ww, using the predictive parsing table MM. 1. aa the first symbol of ww 2. XX the opt of stack symbol 3. while ( XX $ ) { // stack is not empty 4. if ( XX = aa ) { 5. pop the stack 6. aa the next symbol of ww 7. } 8. else if ( XX is a terminal ) error() 9. else if ( MM[XX, aa] is an error entry ) error() 10. else if ( MM[XX, aa] = XX YY 1 YY 2 YY kk ) { 11. output the production XX YY 1 YY 2 YY kk 12. pop the stack 13. push YY kk, YY kk 1,, YY 1 onto the stack, with YY 1 on top 14. } 15. XX the top stack symbol 16. }
Consider following parse table, and input iiii + iiii iiii. Nonterminals Input symbol (lookahead) iiii + ( ) $ EE EE TTEE EE TTEE EE EE +TTEE EE εε EE εε TT FF FFTT TT FFTT TT TT εε TT FFTT TT εε TT εε FF FF iiii FF (EE) Note: EE lm TT EE lm Change of configuration during parser generates output: Matches Stack Input Action
E. Error Recovery If a compiler had to process only one correct program, its design and implementation will be simplified greatly. However, it is expected that a compiler locates and track down errors. a) Types of programming error Lexical error: misspelling of identifiers, keywords, operators, etc. Syntactic error: misplaced semicolons, extra braces, case statement without switch, etc. Semantic error: type mismatches between operators and operands, like return int value in void function. Logical error: anything from incorrect reasoning on the part of the programmer. b) Simplest (Errors Recovery Mode) When the first error is discovered, c) Panic Mode Recovery When an error is discovered, This recovery strategy can be implemented by adding the synchronized token into parse table. 1. Add sync token into parse table Nonterminals Input symbol (lookahead) iiii + ( ) $ EE EE TTEE EE TTEE EE EE +TTEE EE εε EE εε TT FF FFTT TT FFTT TT TT εε TT FFTT TT εε TT εε FF FF iiii FF (EE)
2. During parsing, If MM AA, aa is blank, skip aa. If MM AA, aa is sync, pop nonterminal AA from stack. If token mismatch (AA aa), pop token AA from stack. Example) Input is ) iiii + iiii iiii, Matches Stack Input Action
d) Phrase-level Recovery On discovering an error, parser may perform local correction on remaining input, such that replacing some prefix of input in order to continue parsing. This can be done by filling a blank entity of parse table with the function pointer for error routine that adds, removes, or replaces input symbol (tokens), or pop stacks, and then issues error messages.