LL Parsing, LR Parsing, Complexity, and Automata

LL Parsing, LR Parsing, Complexity, and Automata R. Gregory Taylor Department of Mathematics and Computer Science Manhattan College Riverdale, New York 10471-4098 USA <gtaylor@manhattan.edu> Abstract It is well known that pushdown-stack automata find application within the syntactic analysis phase of compilation. Nonetheless, in most compiler design textbooks the relation between popular parsing algorithms and the theory of deterministic pushdown-stack automata remains implicit. We show that it is not difficult to implement these algorithms as deterministic automata. These implementations in turn yield instructive time/space analyses of the implemented algorithms. 1. LL(1) Parsing and Deterministic Pushdown- Stack Automata Table 1 1.1 LL(1) Expression Grammar Example 1 It will be helpful for our discussion to focus on a particular example, and so let us consider the expression grammar G whose productions are as follows. expr " term expr_aux expr_aux " add_op term expr_aux term " fac term_aux term_aux " mult_op fac term_aux fac " primary - primary primary " ( expr ) tok_id add_op " + mult_op " * / Note that G is a left-factored grammar. It is obvious that G has no direct left recursion. It is also quite easy to see that G has no indirect left recursion either. Table 1 presents part of an LL(1) parse table that is constructed, in accordance with a well-known technique, on the basis of grammar G (see [1], [2]). 1.2. Implementation of an LL(1) Parser as a Pushdown-Stack Automaton It can be shown that an LL(1) parser is, at bottom, a deterministic pushdown-stack automaton that accepts by empty stack. Of course, a stack component does play a role in the table-driven LL(1) parsers that one standardly considers. However, it is presumably not obvious how we may view a parser that uses a stack and a table as a pushdown automaton, which, of course, appears to involve no table. As the reader will have guessed, this table component is really just the tabular representation of the state diagram of a (deterministic) pushdown automaton. To see this, consider the automaton M of Figure 1 in conjunction with the following remarks. We shall use Table 1 as the basis upon which to construct M s state diagram as depicted (partially) in Figure 1. The result will be that M accepts (or recognizes) the class of expressions generated by G. We assume that we previously tokenized the input stream. The input alphabet of M then comprises the terminal symbols (token classes) of G together with end-of-input symbol #. The stack alphabet of M contains all terminals and Vol 34, No. 4, 2002 December 71 SIGCSE Bulletin

nonterminals of G plus stack-initialization symbol #. As usual, we assume the latter symbol to be the only symbol on M s stack at the inception of execution. An arc labeled a, b;c, d, e, say, is interpreted as follows: if symbol a is the current input symbol and symbol b is currently on top of the stack, then pop the stack and push symbols c, d, and e in that order. If symbol a is e, then the input stream is being ignored in effect. If, following the semicolon, there is an occurrence of e, this is to say that no symbol is pushed onto the stack. The states of M are two initialization states q 0 and q 1 and, in addition, eight states corresponding to each symbol within its input alphabet including #. It will be convenient to designate the latter states as q tok_id, q +, q -, q *, q /, q (, q ), and q #. (State designation q + abbreviates q tok_plus or the like.) The start state of M is q 0, and M has no accepting states. Each of the transitions from state q 1 corresponds to M s looking ahead one token in the input stream and then incorporating the encountered lookahead symbol into its state. For example, if the lookahead symbol at state q 1 is *, then M is seen to enter state q * and then to manipulate its stack without reading further input see the two selfloops at state q *. (For the sake of simplicity in presenting the diagram, we are using symbol a as a wildcard ranging over all members of M s stack alphabet.) Processing at state q * continues until symbol * itself appears atop the stack, at which point it is popped and M re-enters state q 1. For simplicity again, we show only two self-loops for each of the states q tok_id, q +, and so forth. We may easily infer the omitted self-loops from (the completion of) Table 1. It is rather easy to see that M is deterministic. This is primarily because the expression grammar on which it is based is left-factored. Tracing the computation of M for input string num1 * num2 + num3 as well as (num1 +, say, reveals that the former is accepted, whereas the latter is not. In fact, M s computation will be essentially that of an LL(1) parser as recorded in Tables 2 and 3 (both partial). The student should then have no trouble believing that the table-driven LL(1) parser is, at root, the implementation of a pushdown automaton that accepts by empty stack. The state diagram of Figure 1 may be used to clarify the role of the (single) lookahead in LL(1) parsing. Namely, at central state q 1, machine M must decide which peripheral state to enter and, by implication, how to expand the leftmost parse tree node labeled by a nonterminal. It renders this decision based solely upon the lookahead: if the lookahead is (, then M enters state q ( and so forth. Once at peripheral state q (, say, M proceeds to expand the tree downward until terminal ( appears atop the stack, at which point the stack is popped and M returns to state q 1. It is at this point, and not before, that input token ( is said to have been consumed. In other words, lookahead ( is used as the basis for tree expansion before it is consumed. Incidentally, the topology of the state diagram of Figure 1 a single central state sur- Figure 1. Arc label a serves as a wildcard representing an arbitrary stack alphabet symbol. SIGCSE Bulletin 72 Vol 34, No. 4, 2002 December

rounded by a ring of peripheral states with multiple selfloops is characteristic of parsing automata that use a single lookahead. (How would we configure parsers using two lookaheads? Three lookaheads?) Table 2 2. LR Parsing and Deterministic Pushdown-Stack Automata In fact, we shall look only at so-called SLR(1) parsing, a particularly simple form of LR(1) parsing. 2.1. Example 2 Our discussion of SLR parsing will focus on the expression grammar G appearing below. Table 3 (1.1) (1.2) expr " term expr add_op term (2.1) (2.2) term " fac term mult_op fac (3.1) (3.4) fac " ( expr ) tok_int_lit - ( expr ) - tok_int_lit (4.1) (4.2) add_op " + - (5.1) (5.2) mult_op " * / Note that G is left-recursive and not left-factored. 1.3 The Complexity of LL(1) Parsing Finally, reflection upon Figure 1 enables us to justify a certain claim regarding the efficiency of LL(1) parsing. Note that if token * is the current input symbol, then M reads that symbol, leaves state q 1, and enters state q *. Once in state q *, M will traverse e-self-loops until symbol * appears atop its stack. It is easy to see that this will require, worst case, two steps. One more step will then bring M back to state q 1. The case of input token / is perfectly analogous. So are all the other possibilities: each peripheral state q has but finitely many e-self-loops. Since we eliminated left recursion from the grammar based on which M was constructed, none of these e-self-loops will ever be traversed twice during any one stint at q. Moreover, the number of these self-loops obviously depends in no way upon n, where we take n to be the number of tokens in the tokenized input stream. (What it does depend on is the grammar G of Example 1.) Finally, since the self-loops at q are all e-moves, they never advance input. It should now be apparent that, for each of n input tokens, automaton M enters some peripheral state and then computes there for O(1) steps, worst case, assuming the obvious notion of computation step. Moreover, each step increases the height of M s stack by O(1). Apparently, we have proved the following proposition. Theorem 1 LL(1) parsing requires O(n) time and O(n) space, where n is the length of the token stream. 2.2. Implementation of an SLR Parser as a Pushdown- Stack Automaton Once again we endeavor to show that a certain type of parser this time an SLR parser for the expression grammar G of Example 2 is, in its essentials, the implementation of a deterministic pushdown automaton M that accepts by empty stack. A part of the transition diagram of M appears in Figure 2. Again, it is presumably not obvious that Tables 4 and 5 represent this machine. Consequently, some explanation will be required in order to make this plausible. This explanation will take the form of a step-by-step description of the construction of M based upon G or, rather, based upon the action and goto tables that are based, in turn, upon G. (1) M will have ten states total. There will be one for each of the seven terminal symbols of G and one more for end-of-input symbol #. In addition, there will be two initialization states q 0 and q 1. Again, it will be convenient to designate peripheral states as q tok_int_lit, q +, q -,q *, q /, q (, q ), and q #. M s start state will be q 0. (2) M s stack alphabet will include all terminals and nonterminals of G plus stack-initialization symbol #. In addition, the stack alphabet will contain numerals representing each of the 22 rows in the complete action table (Table 4). We shall think of the numeral 46, say, as a single stack alphabet symbol. (3) The stack of M will contain symbols of G terminals and nonterminals alternating with numerals designating states of a certain finite-state automaton (not shown). This is potentially confusing, we admit. Our talk of states here has nothing whatever to do with the states of the pushdown automaton M now under construction of which there are only ten. (4) A single arc from state q 0 to state q 1 pushes numeral 0 onto M s stack, which was assumed to already contain stack-initialization symbol #. (5) For each terminal a of G and every state numeral n, we Vol 34, No. 4, 2002 December 73 SIGCSE Bulletin

add an arc labeled a,n;n from start state q 0 to state q a. Similarly, for end-of-input symbol #, we add an arc labeled #,n ;n from start state q 0 to state q #. (See Figure 2, where we indicated these arcs schematically.) (6) Corresponding to each shift action within the action table, there will be an e-move leading from a peripheral state back to state q 1. For instance, corresponding to the shift S5 in the upper left-hand corner of Table 4 there will be an arc labeled e,0;0 tok_int_lit 5 from state q tok_int_lit back to state q 1. There are 31 shift actions in the action table, but we have included only a very few of the corresponding arcs in the diagram of Figure 2 just the ones that we shall need to cite later. (7) Corresponding to each reduce action in Table 4, there will be an e-self-loop on one of the peripheral states in Figure 2. For example, corresponding to the reduce action R3.2 in the sixth row and second column (not shown), there will be a self-loop labeled e,tok_int_lit 5;fac at state q +. (The reader will need to check production (3.2) of G in order to make sense of this.) Again, we have presented only a very few of these arcs in Figure 2. The entire goto table of Table 5 will be represented by e- self-loops on each of the peripheral states including state Table 4. Action Table (partial) Table 5. Goto Table (partial) q #. Thus, corresponding to the three entries in the first row of that table, there will be arcs labeled e,0 expr;0 expr 1 and e,0 term;0 term 2 and e,0 fac;0 fac 3 on state q + and on every other peripheral state. (8) The single Accept action of the action table will be reflected in a self-loop labeled e,# 0 expr 1;e at state q # (see Figure 2). Figure 2. SIGCSE Bulletin 74 Vol 34, No. 4, 2002 December

2.3. The Complexity of SLR Parsing As in the case of LL parsing, reflection upon deterministic pushdown automata such as machine M of Figure 2 will enable us to give worst-case time and space analyses for SLR parsing. (1) First, note that, after entering central state q 1, machine M makes a single-step transition from q 1 to some peripheral state q a for each token a within its input stream. That transition does not itself alter M s stack. However, M will eventually make a transition from q a back to q 1, simultaneously pushing a together with some state numeral onto its stack. (2) Since such periphery-to-center transitions are the only instructions that strictly increase the height of the stack, one can readily see that the height of the stack at any point during M s computation is O(n), where n is the length of the token stream. Thus M computes in O(n) space. (3) Further, while at peripheral state q a, machine M executes a number of e-moves. These e-moves occur in pairs: an e-move implementing a reduce action, followed by an e-move implementing a goto action. Again, the number of such e-moves executed, during any one stint at q a, is O(1) it depends upon G and not upon n. (4) Putting (1) through (3) together, we have established the following proposition. Theorem 2 SLR(1) parsing executes in time and space that are linear in the length of the token stream. The same is true of LR(1) parsing. Of course, Theorem 2 does not take into account the cost of computations required to design the SLR parser itself. That is as it should be, since the cost of parser construction is a one-time cost that we should not charge to the parsing process itself. As for the generalization to LR(1) parsing, we remind the reader that the driver routines of SLR and LR parsing do not differ. Rather, the difference between LR and SLR parsing is a matter of the size of the respective parse tables. In the present context, this means that the general structure of the transition diagrams of implementing pushdown automata will be the same. What will change is the number of e- moves at peripheral states as well as the number of center-toperiphery and periphery-to-center transitions. Consequently, the foregoing analyses for SLR parsing are applicable to LR parsing as well. 3. Summary We have that both LL and LR parsing can be implemented by deterministic pushdown stack automata. Moreover, both algorithms can be carried out in linear time. A more careful statement of the situation is the following. Given any LL(1) respectively LR(1) grammar G for context-free language L, a deterministic pushdown stack automaton MG can be constructed such that MG parses an arbitrary string of length n, over the set of terminal symbols of G, in O(n) steps. This having been said, there do exist context-free languages generated by no LR(1) grammar, and the class of LL(1) grammars is still more restrictive. Of course, these grammar classes have not been defined above (see [3]). Suffice it to say here that they are precisely the grammars for which our deterministic automaton construction techniques work. Acknowledgements The author wishes to thank Jane Stanton for editorial assistance. He also acknowledges a debt to the late Matthew Smosna, in whose superb course at New York University he acquired a first exposure to compiler design theory. References [1] Fischer, Charles N. and LeBlanc, Richard., Jr. Crafting a Compiler. Benjamin/Cummings, Menlo Park, California, 1988. [2] Parsons, Thomas W. Introduction to Compiler Construction, Computer Science Press, New York, 1992. [3] Sorenson, Paul G. and Tremblay, Jean Paul, The Theory and Practice of Compiler Writing, McGraw Hill, New York, 1985. FASE Forum for Advanced Software Engineering Education Online Newsletter for educating and training software engineers <http://www.cs.ttu.edu/fase/> Vol 34, No. 4, 2002 December 75 SIGCSE Bulletin