Parsing A parser is an algorithm that determines whether a given input string is in a language and, as a side-effect, usually produces a parse tree for the input. There is a procedure for generating a parser from a given context-free grammar. Recursive-Descent Parsing Recursive-descent parsing is one of the simplest parsing techniques that is used in practice. Recursive-descent parsers are also called top-down parsers, since they construct the parse tree top down (rather than bottom up). The basic idea of recursive-descent parsing is to associate each non-terminal with a procedure. The goal of each such procedure is to read a sequence of input characters that can be generated by the corresponding non-terminal, and return a pointer to the root of the parse tree for the non-terminal. The structure of the procedure is dictated by the productions for the corresponding non-terminal. The procedure attempts to "match" the right hand side of some production for a nonterminal. To match a terminal symbol, the procedure compares the terminal symbol to the input; if they agree, then the procedure is successful, and it consumes the terminal symbol in the input (that is, moves the input cursor over one symbol). To match a non-terminal symbol, the procedure simply calls the corresponding procedure for that non-terminal symbol (which may be a recursive call, hence the name of the technique). Recursive-Descent Parser for Expressions Consider the following grammar for expressions (we'll look at the reasons for the peculiar structure of this grammar later): 1. <E> --> <T> <E*> 2. <E*> --> + <T> <E*> - <T> <E*> epsilon 3. <T> --> <F> <T*> 4. <T*> --> * <F> <T*> / <F> <T*> epsilon 5. <F> --> ( <E> ) number We create procedures for each of the non-terminals. According to production 1, the procedure to match expressions (<E>) must match a term (by calling the procedure for <T>), and then more expressions (by calling the procedure <E*>).
procedure E; T; Estar; Some procedures, such as <E*>, must examine the input to determine which production to choose. procedure Estar; if NextInputChar = "+" or "-" then read(nextinputchar); T; Estar; We will append a special marker symbol (ENDM) to the input string; this marker symbol notifies the parser that the entire input has been seen. We should also modify the procedure for the start symbol, E, to recognize the end marker after seeing an expression. Top-Down Parser for Expressions procedure E; T; Estar; if NextInputChar = ENDM then /* done */ else print("syntax error") procedure Estar; if NextInputChar = "+" or "-" then read(nextinputchar); T; Estar; procedure T; F; Tstar; procedure Tstar; if NextInputChar = "*" or "/" then read(nextinputchar); F; Tstar; procedure F; if NextInputChar = "(" then read(nextinputchar); E; if NextInputChar = ")" then read(nextinputchar) else print("syntax error"); else if NextInputChar = number then read(nextinputchar) else print("syntax error"); Tracing the Parser
As an example, consider the following input: 1 + (2 * 3) / 4. We just call the procedure corresponding to the start symbol. NextInputChar = "1" Call E Call T Call F NextInputChar = "+" /* Match 1 with F */ Call Tstar /* Match epsilon */ Call Estar NextInputChar = "(" /* Match + */ Call T Call F /* Match (, looking for E ) */ NextInputChar = "2" Call E Call T Call F /* Match 2 with F */ NextInputChar = "*" Call Tstar /* Match * */ NextInputChar = "3" Call F /* Match 3 with F */ NextInputChar = ")" Call Tstar /* Match epsilon */ Call Estar /* Match epsilon */ NextInputChar = "/" /* Match ")" */ Call Tstar NextInputChar = "4" /* Match "/" */ Call F /* Match 4 with F */ NextInputChar = ENDM Call Tstar /* Match epsilon */ Call Tstar /* Match epsilon */ Call Estar /* Match epsilon */ /* Match ENDM */ Observations about Recursive-Descent Parser In procedure Estar and Tstar, we match one of the productions with an arithmetic operator if we see such an operator in the input; otherwise we simply return. A procedure that returns without matching any symbols is, in effect, choosing the epsilon production. In our expression parser, we only choose the epsilon production if the NextInputChar doesn't match the first terminal on the right hand side of the production.
We never attempt to read beyond the end marker (ENDM), which is matched only at the end of an expression. In all other circumstances, the presence of the end marker signals a syntax error. As written, our recursive-descent parser only determines whether or not the input string is in the language of the grammar; it does not give the structure of the string according to the grammar. We could easily build a parse tree incrementally during parsing. Lookahead in Recursive-Descent Parsing In order to implement a recursive-descent parser for a grammar, for each nonterminal in the grammar, it must be possible to determine which production to apply for that nonterminal by looking only at the current input symbol. (We want to avoid having the compiler or other text processing program scan ahead in the input to determine what action to take next.) The lookahead symbol is simply the next terminal that we will try to match in the input. We use a single lookahead symbol to decide what production to match. Consider a production: A --> X1...Xm. We need to know the set of possible lookahead symbols that indicate this production is to be chosen. This set is clearly those terminal symbols that can be produced by the symbols X1...Xm (which may be either terminals or non-terminals). Since a lookahead is only a single terminal symbol, we want the first (i.e., leftmost) symbol that could be produced by X1...Xm. We donote the set of symbols that could be produced first by X1...Xm as First(X1...Xm). First Sets To distinguish two productions with the same non-terminal on the left hand side, we examine the First sets for their corresponding right hand sides. Given the production A -- > X1...Xm we must determine First(X1...Xm). We first consider the leftmost symbol, X1. If this is a terminal symbol, then First(X1...Xm) = X1. If X1 is a non-terminal, then we compute the First sets for each right hand side corresponding to X1. In our expression grammar above:
First(<E>) = First(<T> <E*>) First(<T> <E*>) = First(<T>) First(<T>) = First(<F> <T*>) First(<F> <T*>) = First(<F>) = {(,number} If X1 can generate epsilon, then X1 can (in effect) be erased, and First(X1...Xm) depends on X2. If X2 is a terminal, it is included in First(X1...Xm). If X2 is a non-terminal, we compute the First sets for each of its corresponding right hand sides. Similarly, if both X1 and X2 can produce epsilon, we consider X3, then X4, etc. Follow Sets Suppose we are attempting to compute the lookahead symbols that suggest the production A --> X1...Xm. What if each of the Xi can produce epsilon? If the entire right hand side of a production can produce epsilon, then the lookahead for A is determined by those terminal symbols that can follow A in a parse. We denote the set of terminal symbols that can follow a non-terminal A in a parse as Follow(A). We inspect the grammar for all occurences of the non-terminal A. In each production, A is either: followed by a terminal symbol x, so x is in Follow(A). followed by a non-terminal symbol B, so Follow(A) includes First(B). at the end of a production for some non-terminal S (as in S -> Y1...YmA), in which case Follow(A) includes Follow(S). First and Follow Sets for Expression Grammar Computing the First and Follow sets for our expression grammar (as augmented with a new start symbol that includes the ENDM in the production): 1. <S> --> <E> ENDM 2. <E> --> <T> <E*> 3. <E*> --> + <T> <E*> - <T> <E*> epsilon 4. <T> --> <F> <T*> 5. <T*> --> * <F> <T*> / <F> <T*> epsilon 6. <F> --> ( <E> ) number
First(<E>) = First(<T> <E*>) = First(<T>) First(<E*>) = {+} U {-} U Follow(<E*>) Follow(<E*>) = Follow(<E>) = {),ENDM} First(<E*>) = {+,-,),ENDM} First(<T>) = First(<F> <T*>) = First(<F>) First(<T*>) = {*} U {/} U Follow(<T*>) Follow(<T*>) = Follow(<T>) = First(<E*>) First(<T*>) = {*,/,+,-,),ENDM} First(<F>) {(,number} LL(1) Grammars for Recursive-Descent Parsing The set of lookahead symbols that will cause the selection (ie., prediction) of the production A --> X1...Xm is Predict(A --> X1...Xm) = First(X1...Xm) U If X1...Xm --> epsilon then Follow(A) else null That is, any symbol that can be the first symbol produced by the right hand side of a production will predict that production. Further, if the entire right hand side can produce epsilon, then symbols that can immediately follow the left hand side of a production will also predict that production. If, for two productions 1. A --> X1...Xm 2. A --> Y1...Yn we have some symbol s for which 1. s is in Predict(A --> X1...Xm) 2. s is in Predict(A --> Y1...Yn) then we cannot in general know which production to select by looking at a single input symbol. Recursive-descent parsing can only parse those CFG's that have disjoint predict sets for productions that share a common left hand side. CFG's that obey this restriction are called LL(1).
From experience we know that it is usually possible to create an LL(1) CFG for a programming language. However, not all CFG's are LL(1) and a CFG that is not LL(1) may be parsable using some other (usually more complex) parsing technique. Creating LL(1) Grammars Recursive-descent parsing can only parse grammars that have disjoint predict sets for productions that share a common left hand side. Two common properties of grammars that violate this condition are: Left recursion: any grammar containing productions with left recursion, that is, productions of the form A --> A X1...Xm, cannot be LL(1). The problem is that any symbol that predicts this production the first time will, of necessity, continue to predict this production forever (and never be matched). Common prefix: any grammar containing two productions for the same nonterminal that share a common prefix on the right hand side cannot be LL(1). The problem is that any symbol that predicts the first production must also predict the second; since the predict sets for the two productions are not disjoint, the grammar is not LL(1). Creating an LL(1) Grammar Consider the following grammar for expressions: 1. <E> --> <E> + <T> 2. <E> --> <E> - <T> 3. <E> --> <T> 4. <T> --> <T> * <F> 5. <T> --> <T> / <F> 6. <T> --> <F> 7. <F> --> ( <E> ) 8. <F> --> number This grammar has left recursion, and therefore cannot be LL(1). We can replace the use of left recursion with right recursion as follows: 1. <E> --> <T> + <E> 2. <E> --> <T> - <E> 3. <E> --> <T> 4. <T> --> <F> * <T> 5. <T> --> <F> / <T>
6. <T> --> <F> 7. <F> --> ( <E> ) 8. <F> --> number The resulting grammar is still not LL(1); productions 1-3 share a common prefix, as do productions 4-6. We can eliminate the common prefix by defering the decision as to which production to pick until after seeing the common prefix. This technique is called factoring the common prefix. 1. <E> --> <T> <E*> 2. <E*> --> + <T> <E*> - <T> <E*> epsilon 3. <T> --> <F> <T*> 4. <T*> --> * <F> <T*> / <F> <T*> epsilon 5. <F> --> ( <E> ) number Table-Driven Parsing In recursive-descent parsing, the decision as to which production to choose for a particular non-terminal is hard-coded into the procedure for the non-terminal. The procedure uses the Predict sets (computed from the First and Follow sets) for the grammar to decide which production to choose based on the lookahead symbol. The problem with recursive-descent parsing is that it is inflexible; changes in the grammar can cause significant (and in some cases non-obvious) changes to the parser. Since recursive-descent parsing uses an implicit stack of procedure calls, it is possible to replace the parsing procedures and implicit stack with an explicit stack and a single parsing procedure that manipulates the stack. In this scheme, we encode the actions the parsing procedure should take in a table. This table can be generated automatically (with the grammar as input), which is why this approach adapts more easily to changes in the grammar. A Table-Driven Parser The parse table encodes the choice of production as a function of the current non-terminal of interest and the lookahead symbol. T: Non-terminals x Terminals -> Productions U {Error}
The entry T[A,x] gives the production number to choose when A is the non-terminal of interest and x is the current input symbol. The table is a mapping from non-terminals x terminals to productions. T[A,x] == A -> X1..Xm if x in Predict(A->X1..Xm) otherwise T[A,x] == Error The driver procedure is very simple. It stacks symbols that are to be matched or expanded. Terminal symbols on the stack must match an input symbol; non-terminal symbols are expanded via the Predict function (which is encoded in the parse table). Parse Table for Expressions Here is an LL(1) expression grammar, augmented to include the end marker: 1. <S> --> <E> ENDM 2. <E> --> <T> <E*> 3. <E*> --> + <T> <E*> 4. <E*> --> - <T> <E*> 5. <E*> --> epsilon 6. <T> --> <F> <T*> 7. <T*> --> * <F> <T*> 8. <T*> --> / <F> <T*> 9. <T*> --> epsilon 10. <F> --> ( <E> ) 11. <F> --> number The table for this expression grammar is (where a blank entry corresponds to an error): ( ) + - * / Number ENDM S 1 1 E 2 2 E* 5 3 4 5 T 6 6 T* 9 9 9 7 8 9 F 10 11 This table is constructed from the Predict sets described earlier.
Driver Procedure Under table-driven parsing, there is a single procedure that "interprets" the parse table. This "driver" procedure takes the following form: procedure Parser; /* Push the start symbol S onto the stack */ Push(S,stack) /* Initialize lookahead symbol */ scanner(nextinputsymbol) while not Empty(stack) do top = Top(stack) if top is a nonterminal then action = ParseTable[top,NextInputSymbol] if action > 0 then /* Pop top symbol * Pop(stack) /* Push RHS of production */ for each symbol on RHS #action do Push(symbol) else print("syntax error") else if NextInputSymbol == top then /* Match terminal symbol in input */ Pop(stack) /* Get next terminal symbol in input */ scanner(nextinputsymbol) else print("syntax error") Example Parse Let's trace the parse for the input 1 + (2 * 3) / 4 ENDM: Stack Contents Current input Action 1: S 1 + (2 * 3) / 4 ENDM 1 2: E ENDM 1 + (2 * 3) / 4 ENDM 2 3: T E* ENDM 1 + (2 * 3) / 4 ENDM 6 4: F T* E* ENDM 1 + (2 * 3) / 4 ENDM 11 5: N T* E* ENDM 1 + (2 * 3) / 4 ENDM Pop 6: T* E* ENDM + (2 * 3) / 4 ENDM 9 7: E* ENDM + (2 * 3) / 4 ENDM 3 8: + T E* ENDM + (2 * 3) / 4 ENDM Pop 9: T E* ENDM (2 * 3) / 4 ENDM 6 10: F T* E* ENDM (2 * 3) / 4 ENDM 10 11: ( E ) T* E* ENDM (2 * 3) / 4 ENDM Pop 12: E ) T* E* ENDM 2 * 3) / 4 ENDM 2 13: T E* ) T* E* ENDM 2 * 3) / 4 ENDM 6 14: F T* E* ) T* E* ENDM 2 * 3) / 4 ENDM 11 15: N T* E* ) T* E* ENDM 2 * 3) / 4 ENDM Pop 16: T* E* ) T* E* ENDM * 3) / 4 ENDM 7
17: * F T* E* ) T* E* ENDM * 3) / 4 ENDM Pop 18: F T* E* ) T* E* ENDM 3) / 4 ENDM 11 19: N T* E* ) T* E* ENDM 3) / 4 ENDM Pop 20: T* E* ) T* E* ENDM ) / 4 ENDM 9 21: E* ) T* E* ENDM ) / 4 ENDM 5 22: ) T* E* ENDM ) / 4 ENDM Pop 23: T* E* ENDM / 4 ENDM 8 24: / F T* E* ENDM / 4 ENDM Pop 25: F T* E* ENDM 4 ENDM 11 26: N T* E* ENDM 4 END Pop 27: T* E* ENDM ENDM 9 28: E* ENDM ENDM 5 29: ENDM ENDM Pop 30: Done!