CS 842 Ben Cassell University of Waterloo
Recursive Descent Re-Cap Top-down parser. Works down parse tree using the formal grammar. Built from mutually recursive procedures. Typically these procedures represent the production rules of the grammar. Lend themselves very well to representation in both functional languages and simple, nonfunctional equivalents.
Sample Recursive Descent Parser Consider the language defined as follows: A a + B B A B a Generates words such as: a + a a + a + a a + a + a + a Based on samples available at http://www.cs.engr.uky.edu/~lewis/essays/compilers/rec-des.html
parsea([parse, input]) if (isfirst(input, a ) and isfirst(rest(input), + )) then return parseb([parse, rest(rest(input))]) else return [false, ] parseb([parse, input]) if (isfirst(input, ( )) then let [parse2, input2] = parsea([parse, rest(input)]) if (parse2 and isfirst(input2, ) )) then return [parse, rest(input2)] else return [false, ] else if (isfirst(input, a )) then return [parse, rest(input)] else return [false, ]
Types of Recursive Descent Parsers At a high level there are two primary types of recursive descent parsers: Predictive parsers. Backtracking parsers.
Predictive Parsing Parse the class of LL(k) grammars. Use the k token lookahead to determine which production rule to choose. Cannot handle some ambiguous grammars (depending on lookahead) nor left recursion Which is ok because LL(k) does not include any of these grammars! Should be noted that you don t necessarily create an LL(k) grammar by removing left recursion. Predictive parsing on LL(k) grammars runs in linear time.
Breaking Predictive Parsing Left recursion: A Aa A ϵ Ambiguous grammar: A A + A A a
Backtracking Parsers Although k-lookahead parsers are very fast, they are also limited. Backtracking parsers attempts production rules in turn, rewinding given an error and trying other alternatives. Backtracking parsers can handle a much larger variety of languages, but might not terminate on non-ll(k) languages. Even when they do terminate, backtracking parsers can require exponential time to run if implemented naively.
A Potentially Slow Example Consider the ambiguous grammar: A aaa A ϵ This will return an exponential number of valid parses for any series of a s!
Memoization Consider the following implementation of the Fibonacci numbers in Haskell: fib Int Integer fib 0 = 0 fib 1 = 1 fib n = fib n 2 + fib n 1 As calls to this function get larger, they will become significantly slower.
Now, consider this implementation instead: fib Int Integer fib = map fibrec 0..!! where fibrec 0 = 0 fibrec 1 = 1 fibrec n = fib n 2 + fibrec n 1 Similar techniques can be manually applied to backtracking parsers to significantly improve performance. Why run the same production rule multiple times if you already know what it will produce? Based on samples available at http://www.haskell.org/haskellwiki/memoization
Extending Memoization Wouldn t it be great if the technique for memoization could be extended out into a general form that could simply be pluggedin to our other functions? Maybe we could even use similar techniques to solve some of the problems with recursive descent parsers if applied correctly!
Papers Techniques for Automatic Memoization with Applications to Context-Free Parsing Peter Norvig (UC Berkley), 1991 Memoization in Top-Down Parsing Mark Johnson (Brown), 1995
Techniques for Automatic Memoization Claims that an algorithm similar to Earley s algorithm can be generated with a backtracking parser using memoization. Earley s algorithm: Top-down dynamic programming algorithm. Set of states to examine. Starts with only the top rule, and as input is processed, new rules are added to the set of states by prediction, scanning and completion. Has O n 3 time complexity where n is the length of the string, and O n 2 time complexity when the grammar is unambiguous.
Memoizing Functions in General Consider the following code from the paper: (defun memo (fn) (let ((table (make-hash-table))) # (lambda (x) (multiple-value-bind (val found) (gethash x table) (if found val (setf (gethash x table) (funcall fn x)))))))
Problems with the Implementation The function fn being memoized is required to both take and return one value. This is probably too restrictive to be useful. Also, more importantly, what if fn makes any recursive calls? Recursive calls will go to the original version of fn, and not the memoized version. This mostly defeats the point of using the current version, especially for functional parsing.
One Possible Solution Globally rebind what fn points to: (defun memoize (fn-name) (setf (symbol-function fn-name) (memo (symbol-function fn-name)))) This is highly useful, but still has a limitation: Memoized functions can only have one argument. It should accept arbitrary arguments, and we should be able to index on arbitrary combinations of them.
Updated Version (defun memo (fn &key (key # first) (test # eql) name) (let ((table (make-hash-table :test test))) (setf (get name memo) table) # (lambda (&rest args) let ((k funcall key args))) (multiple-value-bind (val found) (gethash k table) (if found val (setf (gethash k table) (apply fn args))))))))
Updated Version (Continued) (defun memoize (fn-name &key (key # first) (test # eql)) (setf (symbol-function fn-name) (memo (symbol-function fn-name) :name fn-name :key key :test test)))
Notes About the New Version The hash table is stored on the property list for the function name, meaning it can be inspected, cleared or otherwise modified. Useful when the working set changes. The default key function is first. This is fine for a single argument. In lisp, identity can be used to hash on all the arguments. The test by default uses eql. Can be changed to equal or other tests as desired. equal, for instance, requires more computational overhead but will prevent duplicated lists.
Using Memoize to Parse: A Simple Top- Down Parser (defun parse (tokens start-symbol) (if (eq (first tokens) start-symbol) (list (make-parse :tree (first tokens) :rem (rest tokens))) (mapcan # (lambda (rule) (extend-parse (lhs rule) nil tokens (rhs rule))) (rules-for start-symbol) (defun extend-parse (lhs rhs rem needed) (if (null needed) (list (make-parse :tree (cons lhs rhs) :rem rem)) (mapcan # (lambda (p) (extend-parse lhs (append rhs (list (parse-tree p))) (parse-rem p) (rest needed))) (parse rem (first needed)))))
Adding Memoization (memoize rules-for) (memoize parse :test # equal :key # identity) (defun parser (tokens start-symbol) (clear-memoize parse) (mapcar # parse-tree (remove-if-not # null (parse tokens start-symbol) :key # parse-rem))) Parse returned all valid parses of all prefixes of the input. Parser looks for completeness.
Limitations The algorithm is equivalent to Earley s (not proved), but with O n 4 complexity. The asymptotic complexity is worse because of the use of equal over remaining tokens. O n 3 is achieved by adding a different type of hash table (compromising between eql and equal), and a memoize that allows a user-specified hash getter and putter functions. Explicitly cannot handle left-recursion (the authors mention this directly). Hash tables that are not very carefully implemented can result in very poor performance.
Silver Linings The parser is exceedingly simple (15 lines of code or less). The technique applies beyond parsing. The authors show automatic memoization applied across various languages: Scheme. Pascal.
Memoization in Top-Down Parsing Goal: Discover why left-recursion fails for memoized parsers and present a memoization technique that can handle it. Takeaway: Instead of returning a set of positions as a single value for the right string positions of a category, return them incrementally.
Symbols Used Throughout Uses symbols similar to The Functional Treatment of Parsing (Leermakers, 1993): S: Sentence (S NP VP) N: Noun ( student, professor ) V: Verb ( likes, knows ) Det: Determinant ( every, no ) PN: Person Name ( Kim, Sandy ) NP: Noun phrase, (NP PN Det N) VP: Verb phrase, (VP V NP V S) This rule in Figure 1 of the paper has a typo. (seq (V S)) should read (seq V S) to properly represent VP V S.
Formalizing Grammars Johnson creates a recursive descent parser quite similar to Norvig s, and defines higher order functions to simplify the process. reduce: Recursively applies a function across a list of arguments. For example, (reduce f x (1 2 3)) would reduce to (f (f (f x 1) 2) 3). union: Construct a unique list from two lists. terminal: Map a substring to a terminal if it matches, otherwise empty. seq: Recognize a concatenation of substrings recognized by two functions.
Formalizing Grammars (Continued) alt: Recognize the union of substrings recognized by two functions. epsilon: Recognizes the empty string. opt: Recognizes optional elements. k*: Recognize the Kleene star of an element. recognize: Return true if the string passed in can be parsed from the start symbol.
Language Problem Consider the following examples: (define S (seq NP VP)) (define VP (alt (seq V NP) (seq V S))) In Scheme, the way these are written incurs a mutual recursion issue (the binding will fail). The fix for this is fairly straight-forward: (define-syntax vacuous (syntax-rules () ((vacuous fn) (lambda args (apply fn args))))) (define S (vacuous (seq NP VP)))
More Involved Problems The provided rules act as a top-down parser. Results are returned as a list of suffixes. We get our usual left-recursive problems. There is a non-trivial amount of re-computation that occurs by default. Memoization can prevent the re-computation. The presented memo function is a Scheme version of the Norvig technique. Now we can write: (define S (memo (vacuous (seq NP VP)))) As in the Norvig paper however, this still doesn t allow the parsing of left-recursive grammars.
What s the Fundamental Problem? To memoize a result, it first needs to be fully computed, which is impossible. Instead, memoize calls as they are made and lazily evaluate the results as needed. To do this, Johnson uses a technique called Continuation-Passing Style with memoization. Provide a function to return a result to. This effectively reverses the direction of computation from bottom-up to top-down.
Sample CPS Function Traditional definition: (define (square x) (* x x)) CPS definition: (define (square cont x) (cont (* x x))) (square display 10)
How do the Parsing Functions Change? Rules (A) are now represented by functions g A c l that reduce in a way such that (c r) only reduces if A can derive the string from positions l to r. In other words, (c r) is evaluated zero or more times with r bound to right string positions. This implies that instead of returning a set of string positions, we re simply calling the continuation for each result position instead. The terminal will could now be written as: (define (future-aux continuation pos) (if (and (pair? pos) (eq? (car pos) will)) (continuation (cdr pos))) The rule VP V NP V S could be written: (define (VP continuation pos) (begin (V (lambda (pos1) (NP continuation pos1)) pos) (V (lambda (pos1) (S continuation pos1)) pos)))
Simplifying and Recognizing In the previous example, the lambda expression tells the function V to pass a parsed V s right string position into NP and S. Johnson redefines alt, seq and terminal to simplify this process. The recognize function now simply passes a continuation that marks whether or not a word is successfully parsed. The example uses a set! but this is not necessary. Problem: This still fails to terminate, even in the memoized version!
The Secret Sauce: Updating Memo Memo table entries can t record reduced results. In CPS the results are what are passed forwards to the continuation. New memo function stores a set of argument values that maps to a list of caller continuations and a list of result values. Result values are propagated and updated as new values are returned. Values that are not subsumed by previous ones are added to the list of associated entries. When a function is called it looks up whether or not values already exist to pass to the continuation based on the provided arguments.
Left Recursion is Fixed! Unmemoized functions are never called more than once on the same arguments. They can fall back on the lazily-evaluated continuation list stored in the hash table. Even a left-recursive grammar can look up the continuation and result list and pass computation forwards as the CPS-style calls enumerate through the parse of the input string. Progress is always made because the left-recursive call does not need to drill down again to produce a concrete result, proving that the lazy shall inherit the Earth!
The End