Parsing. For a given CFG G, parsing a string w is to check if w L(G) and, if it is, to find a sequence of production rules which derive w.

Parsing For a given CFG G, parsing a string w is to check if w L(G) and, if it is, to find a sequence of production rules which derive w. Since, for a given language L, there are many grammars which generates the same language L, parsing must be done based on grammar G, not on its language L(G). Consider the following two context-free grammars G 1 and G 2 which generate the same language {a i b i i > 0 }. G 1 : S asb ab G 2 : S aa A Sb b Clearly, the following PDA recognizes this language. However, this PDA does not provide any information for identifying the grammar or a way for generating a given string with one of the grammars. ( a, a/aa ) (b, a/ε ) (a, Z 0 /az 0 ) (b, a/ ε ) (ε, Z 0 /Z 0 ) start 142

Two Derivation Rules Recall that if a context-free grammar G is unambiguous, for each string x in L(G), there is unique parse tree that yields x. So parse tree could be a good output form for parsing. However, it is not practical to output parse trees in two dimensional form. How about representing them in one-dimensional form, i.e., a sequence of productions rules applied? There is a problem with this approach; in general there can be more than one sequence of productions rules that generate the same string x. This is true even when the grammar is unambiguous. Recall that a string x is in L(G), if x can be derived by applying a sequence of production rules (one rule at a time) in G. Suppose that ababd is derived in the middle of such sequence. The final result is irrelevant to which nonterminal symbol is chosen in ababd to derive next sentential form (i.e., string of terminals and nonterminals). We should choose one from such multiple sequences of production rules. The sequence should be uniquely identifiable and effective to work with. There are two ways for the derivation defined as follows that can be uniquely identifiable. Leftmost (rightmost) derivation: A string is derived by iteratively applying a production rule with the leftmost (rightmost) nonterminal symbol of the current sentential form. 143

For the following grammar G, the leftmost and rightmost derivations are as shown below. G: S ABC A aa B a C cc c Leftmost derivation: S ABC aabc aaac aaacc aaacc Rightmost derivation: S ABC ABcC ABcc Aacc aaacc S A B C aa a c C c Notice that the sequence of productions applied with the leftmost derivation rule corresponds to the top-down left-to-right traversal of the parse tree, and the reverse sequence applied with the rightmost derivation rule corresponds to bottom-up left-to-right traversal of the parse tree. 144

The Basic Strategy for LL(k) Parsing Now we investigate how we can use a DPDA for parsing. Consider the following CFG G which generates language {a 10 x x = b i or x = c i, i > 0 }. (For convenience, when we refer a rule of G, we shall use the rule number shown above each of the rules.) (1) (2) (3) (4) (5) (6) (7) G: S AB AC A aaaaaaaaaa B bb b C cc c We want to design a DPDA which, given a string x {a, b} * on the input tape, outputs a sequence of production rules that generates x, if x L(G). We assume that the machine has an output port as shown in the figure below, and the grammar is stored in the memory as a lookup table. Let s first try a simple greedy strategy of generating a string in the stack that matches string x appearing in the input tape. Since any string in L(G) should be generated with the start symbol S, the machine initially pushes S in the stack entering in a working state q 1, and examine the input to choose a proper production rule for S. Recall that the conventional PDA sees the stack top, which is S, and decides whether it will read the next input symbol or not. a a a a a a a a a a b b b output port q 1 G SZ 0? 145

Without reading the input, the machine has no information available for choosing rule (1) or (2) for S. So we let the machine read the input. Suppose that the symbol read is a as shown in the figure below. This information does not help, because both rules (1) and (2) generates the same leading a s (actually 10 a s). The b s located after a s in the input string indicate that the first production rule to apply to generate the input string is rule (1), which is S aaaaaaaaaab. Using the conventional DPDA, it is impossible to correctly choose this production rule. a a a a a a a a a a b b b q 1 G SZ 0? To overcome this problem we equip the DPDA with the capability of looking ahead the input string by some constant k cells. For the current grammar the lookahead length k should be at least 11, because the first symbol b appears 11 cells away from the current input position. (Notice that the count includes current cell under the read head.) This symbol b is the nearest information in the input string that helps for choosing the correct production rule for S, rule (1) for the example. 146

Now, by looking 11 symbols ahead the machine knows that the input string should be derived by applying production rule (1) first, if it is a string generated by grammar G. So the machine replaces S in the stack top with the right side string of rule (1) and output rule number (1) as shown in the figure below. (Notice that looking ahead does not involve any move of the read head.) Whenever a terminal symbol appears at the stack top, the machine reads the input symbol, compares with the stack top and pops it if they match. Otherwise, the input is rejected. a a a a a a a a a a b b b α (1) q 1 G A B Z 0 q β G Machine Configuration: (q, α, β) (a) (b) For convenience, let (q, α, β) denote a configuration of the machine with current state (including G) q, the input portion α to be read, and current stack content β. From now on we shall use this triple for the machine configuration instead of a diagram. 147

The initial configuration (q 0, aaaaaaaaaabbb, Z 0 ) is routinely changed to ready configuration, aaaaaaaaaabbb, SZ 0 ). Based on the information looked ahead 11 positions, this configuration has been changed by applying rule (1) as shown below. Then seeing A at the stack top, the machine replaces A with the right side of rule (3). For this operation the machine does not need to look ahead because there is only one production rule for A. Now, the first 10 a s of the input can be successfully matched one by one with the 10 a s appearing at the stack top as follows. (The number above the arrows refer the production rule applied.) (1) (3) (q 0, aaaaaaaaaabbb, Z 0 ), aaaaaaaaaabbb, SZ 0 ), aaaaaaaaaabbb, ABZ 0 ), aaaaaaaaaabbb, aaaaaaaaaabz 0 )...., abbb, abz 0 ), bbb, BZ 0 )? Now symbol B appears at the stack top. Which of the production rules B bb b should have been applied to generate next input symbol b? Since there are more than one b, the next input b must be generated by rule B bb. To see if there are more than one b, the machine needs to look ahead 2 cells. Thus, the machine applies rule B bb whenever it sees two b s ahead, and applies rule B b when it sees one b. This way the machine successfully parse the the remaining input as the following slide shows. The last configuration (q 0, ε, Z 0 ), with empty stack and null input to parse, implies that the parsing has successfully completed. Its output is the sequence of production rules applied when a nonterminal symbol appears at the stack top. 148

(4) (4), bbb, BZ 0, bbb, bbz 0, bb, BZ 0, bb, bbz 0 ) (5), b, BZ 0, b, bz 0, ε, Z 0 ) The sequence of productions applied by this machine is shown below, which follows exactly the order of leftmost derivation. (1) (3) (4) (4) (5) S AB aaaaaaaaaab aaaaaaaaaabb aaaaaaaaaabbb aaaaaaaaaabbb We can easily see that the machine, given a string x in the input tape, can correctly generate the sequence of production rules in the order applied for the leftmost derivation for x if and only if x is in L(G). This machine parses the input string reading left-to-right looking ahead at most 11 cells and generates the sequence of productions rules applied according to leftmost derivation. We call this machine LL(11) parser. Conventionally LL(k) parser is represented with a table that shows, depending on the nonterminal symbol appearing at the stack top and look-ahead contents, which production rule should be applied. Reading the input symbols to match stack top terminal symbols and popping operations are usually omitted for convenience. 149

The parse table for the above example is shown below, where blank entries are for the cases not defined (i.e., the input is rejected), and x in the look-ahead contents is a don t care (wild card) symbol. Stack top S A B C 11 look-ahead aaaaaaaaaab aaaaaaaaaac bbxxxxxxxxx b ccxxxxxxxxx c ε (no look-ahead) AB AC aaaaaaaaaa bb b cc c 150

Example 1. Construct an LL(k) parser for the following CFG with minimum k. (1) (2) S asb aabbb This grammar generates the language {a i aabbbb i i 0 }. Consider string aaaabbbbb and its left most derivation; (1) (1) (1) (2) S asb aasbb aaasbbb aaaaabbbbbb Notice that aabbb at the center of this string is generated by rule S aabbb. If we let our parser look ahead 3 cells, it can select correct production rule that generates the next input symbol as follows. If it sees aaa, then the first a in this look-ahead contents must have been generated by rule S asb. If it is aab, then this string aab, together with the succeeding two b s, if any, must have been generated by production rule S aabbb. Based on this observation our LL(3) parser parses string aaaaabbbbbb as follows. First the parser gets ready by pushing S into the stack. (q 0, aaaaabbbbbb, Z 0, aaaaabbbbbb, SZ 0 )? 151

Our parser, looking aaa ahead, applies rule (1) S asb, and seeding a at the stack top, pop it reading a from the input tape. Thus, the configuration changes as follows. (1), aaaaabbbbbb, SZ 0, aaaaabbbbbb, asbz 0, aaaabbbbbbb, SbZ 0 )? Again, looking aaa ahead, the parser applies rule (1) S asb two more times as follows. (1), aaaabbbbbbb, SbZ 0, aaaabbbbbb, asbbz 0 ) (1), aaabbbbbbb, SbZ 0, aaabbbbbb, asbbz 0 ), aabbbbbb, SbbZ 0 )? Now, our parser looks aab ahead, applies rule (2) S aabbb and then matches remaining input symbols with the ones appearing on the stack top as follows., aabbbbbb, SbbZ 0, aabbbbbb, aabbbbbz 0 ), ε, Z 0 ) 152

The sequence of productions applied (1) (1) (1) (2) is exactly the one applied for the leftmost derivation deriving aaaaabbbbbb. Actually the parser derived the string in the stack according to the leftmost derivation rule. Clearly, this parser operates according to the following parsing table. Stack top S 3 look-ahead aaa aab asb aabbb 153

Example 2. Construct an LL(K) parser with minimum k for the following grammar. (1) (2) (3) (4) S aba ε A Saa b We will build an LL(2) parser by examining how it can parse string ababaaaa by deriving it in the stack according to the following leftmost derivation. (1) (3) (1) (3) (2) S aba absaa ababaaa ababsaaaa ababaaaa Following the routine initialization operation we have S at the stack top as follows. (q 0, ababaaaa, Z 0, ababaaaa, SZ 0 )? By looking ahead two cells on the tape, the parser sees ab. Definitely the next rule applied in the leftmost derivation is rule (1), which is the only rule producing ab to the left. So the parser applies rule (1) and the configuration changes as follows; 154

(1), ababaaaa, SZ 0, ababaaaa, abaz 0 ).., abaaaa, AZ 0 )? For convenience the grammar is repeated here. (1) (2) (3) (4) S aba ε A Saa b If rule (4) were applied for A, the next terminal symbol appearing in the next cell would be b, not a. The next input symbol must be generated by S. So, rule (3) must have been applied next to generate the input string. Thus the configuration is changed as follows. (3), abaaaa, AZ 0, abaaaa, SaaZ 0 )? Now, the parser looks ahead two cells and sees ab. Rule (1) must have been applied to derive the input string. If rule (2) were applied, the two look ahead contents would be either ε (for the case of null input string) or aa (generated by rule (3)). Thus the parser applies rule (1) for S on the stack top, and changes its configuration as follows. 155

(1), abaaaa, SaaZ 0, abaaaa, abaaaz 0 ).., aaaa, AaaZ 0 )? (1) (2) (3) (4) S aba ε A Saa b Again, since the next input symbol is a, next rule applied cannot be rule (4). It must be rule (3). Thus the configuration changes as follows. (3), aaaa, AaaZ 0, aaaa, SaaaaZ 0 )? Now, the parser looks ahead aa which cannot be generated by either rule (1) or (2). It must be generated by rule (3) previously. Thus the parser applies rule (2) as follows and then matches remaining input with the string in the stack. (2), aaaa, SaaaaZ 0, aaaa, aaaaz 0 )...., ε, Z 0 ) The sequence of rules applied by the parser on the stack top is exactly same as the sequence applied for the leftmost derivation deriving ababaaaa. 156

Clearly, this parser can parse the language with the following parsing table. Stack top S aba ε A 2 look-ahead ab aa bx BB Saa Saa b B: blank X: don t care ε 157

Example 3. The grammar below is not an LL(k) grammar for any fixed integer k. S A B A aa 0 B ab 1 Notice that the language of this grammar is { a n x n 0, x {0, 1} }. The strings in this language can have arbitrary large number of a s followed by either 0 or 1, depending on whether it is generated by rule S A or S B, respectively. With finite look ahead range k it is impossible to look ahead the crucial indicator (0 or 1) that is needed to decide which production rule is applied to generated the input string. For the given grammar, there is not LL(k) parser for any finite k. a a a a.............. a a a a a a a a 0 q 1 G SZ 0? It is easy to see that for the following grammar, which generates the same language, we can construct an LL(1) parser. S as D D 0 1 158

Formal Definition of LL(k) Grammars Notation: Let (k) ω denote the prefix of length k of string ω. If ω < k, (k) ω = ω. For example, (2) ababaa = ab and (3) ab = ab. Definition (LL(k) grammar). Let G = (V T, V N, P, S) be a CFG. Grammar G is an LL(k) grammar for some fixed integer k, if it has the following property: For any two leftmost derivations S * ωaα ωβα * ωy and S * ωaα ωγα * ωx, where α, β,γ (V T V N )* and ω, x, y V T *, if (k) x = (k) y, then it satisfies β = γ. If a CFG G has this property, then, for every x L(G), we can decide the sequence of leftmost derivations which generates x by scanning x left to right, looking ahead at most k symbols. (If you are interested in the proof of this claim, see The Theory of Computation by D. Wood, or a book for compiler construction.) 159