Parsing Combinators: Introduction & Tutorial

Parsing Combinators: Introduction & Tutorial Mayer Goldberg October 21, 2017 Contents 1 Synopsis 1 2 Backus-Naur Form (BNF) 2 3 Parsing Combinators 3 4 Simple constructors 4 5 The parser stack 6 6 Recursive parsers 7 7 Packing 9 8 Builtin Parsers & Parser Constructors 11 9 Using the parsers 14 10 The power of parsing combinators 15 1 Synopsis Parsing combinators are a technique for compositionally embedding topdown, recursive descent parsers into programming languages that support higher-order abstraction. Put simply, this means that if you formulate your grammar in a specific way, that we shall discuss later, you can encode it directly, in an almost one-to-one way, into any functional and object-oriented programming languages. This means that you can encode and implement 1

your parsers quickly, incrementally, and directly into your programming language, without having to learn some special language for describing the grammar, without having to translate the grammar into source code in some programming language. This technique will let you implement sophisticated parsers quickly and correctly, with minimal pain, far quicker than by using other methods. The drawback: The technique is an embedding of a grammar, rather than a translation of it, so no optimizations are performed on the grammar. This means that if your parsers are terribly inefficient, you will have to identify the causes for this and change your grammar accordingly. This tutorial describes the theory and use of parsing combinators. To use parsing combinators, you do not really need to understand how they are implemented, though this will help you and is not very difficult. Try to read through the full text, including the examples. If you find an error in the text, or have suggestions, please write me an email. 2 Backus-Naur Form (BNF) Our journey begins with the Backus-Naur form, named after John Backus of The IBM Corporation, and Peter Naur of the University of København, in København, Denmark. BNF is a notation used to specify Context-Free Grammars, and was used since the late 1950 s. BNF is a language that describes non-terminals using both terminals and non-terminals and the constructors catenation and disjunction. BNF has been extended with syntactic sugar that includes the Kleene-star (denoting the concatenation of zero or more expressions), the Kleene-plus (denoting the concatenation of one or more expressions), the question mark (denoting either 0 or 1 occurrence of an expression), and parenthesis for grouping sub-expressions. All these extensions can be translated into straight BNF if we add additional non-terminals. Years ago, it was quite common for books on specific languages to include a grammar for the syntax of the language, either in BNF or in some extended version of BNF. Today this is rarely seen. However, you should already be somewhat familiar with the notation of BNF. Here is an example of the definition of integers with no initial zeros: <digit-1-9> ::= 1 2 3 4 5 6 7 8 9 <digit-0-9> ::= 0 <digit-1-9> <natural-number> ::= <digit-1-9><digit-0-9>* 0 <integer> ::= ( - + )? <natural-number> The? means at most once, so we may replace the rule 2

<integer> ::= ( - + )? <natural-number> with <integer> ::= - <natural-number> + <natural-number> <natural-number> 3 Parsing Combinators The basic ideas behind parsing combinators are: Terminals are encoded as parsers that only recognize their respective terminals. Non-terminals are encoded as parsers. The operators of catenation and disjunction can be encoded as higherorder abstractions (either higher-order functions or as instances of a class of parsers), i.e., In the functional world, catenation and disjunction are higherorder procedures that take parsers for grammars, and returns a parser for the catenation and disjunction of these grammars. In the object-oriented world, catenation and disjunction are static methods, factory methods or factory classes that take parsers for grammars and construct new parsers for the catenation or disjunction of these grammars. Recursive non-terminals become either recursive functions, or recursive methods. So if you look at the abstract-syntax tree for a grammar encoded in BNF (i.e., the AST of the BNF for that grammar), each node in that tree maps to a function or method call in the definition of a parser for that grammar that has been constructed using parsing combinators. But this is far from all there is to say about parsing combinators: Because parsers are just functions or objects, and because the parsers that are constructed using parsing combinators are constructed on-the-fly, at runtime, it is simple to use either functional or object-oriented abstraction to create additional parsing combinators, i.e., procedures that take parsers and construct new parsers. In this way, a single derived parsing combinator can be used to describe many rules in BNF. This is where parsing combinators are actually better at expressing grammar than BNF: It s like BNF with abstraction. We shall have more to say about this later on. 3

4 Simple constructors You are given the file pc.scm, which is the implementation for the parsing combinators package. To begin using it, you must load the file into your Scheme session. You can either load it from the prompt, or, if you re using it within a larger project (e.g., writing a compiler), then you place the call to the load procedure at the top of the file: > (load "pc.scm") > If the file is in the current directory in which the Scheme system is running, then this should be enough. Otherwise, you may need to know the path to the pc.scm file. Please make sure you use relative paths when specifying the file pc.scm. Otherwise, your code shall not be portable across Linux/Windows. The file loads without any visible output. The most elementary parsing combinators are given by const, caten, disj, for creating terminals, catenations, and disjunctions: > (const (lambda (ch) (and (char<=? #\a ch) (char<=? ch #\z)))) #<procedure at pc.scm:483> The constant parser takes a predicate as an argument. This procedure takes a character (or token). It then returns a parser (which too is a procedure) that matches the predicate. How might we test such a parser? We define <alphabetic> to be such a parser. Notice that we are following a BNF-like notational convention, whereby non-terminals are enclosed in angle-brackets. The parsing-combinator package comes with a builtin procedure test-string for testing parsers. This is not how you deploy a parser, and you should use this procedure only for testing. That said, you should build and test your parsers incrementally, rather than attempt to construct from scratch the entire grammar for a large language: > (define <alphabetic> (const (lambda (ch) (and (char<=? #\a ch) (char<=? ch #\z))))) > (test-string <alphabetic> "") (failed with report:) > (test-string <alphabetic> "a") ((match #\a) (remaining "")) > (test-string <alphabetic> "abc") ((match #\a) (remaining "bc")) 4

As you can see, <test-string> takes two arguments: A parser, and an input string. It then attempts to parse the head of the string using the parser. It either fails, returning a report, or it succeeds, returning an expression, and the remaining characters. Notice that when recognizing an alphabetic character, we only recognize one such character. To recognize more, we would need to pass a different parser: > (test-string (caten <alphabetic> <alphabetic>) "abc") ((match (#\a #\b)) (remaining "c")) > (test-string (caten <alphabetic> <alphabetic>) "ab") ((match (#\a #\b)) (remaining "")) > (test-string (caten <alphabetic> <alphabetic>) "a") (failed with report:) We have no introduced the catenation combinator, which takes any number of parsers for some grammars, and returns a parser for the catenation of these grammars. Notice that the parser that recognizes two alphabetic characters cannot match a string that contains only one such characters, so the parser fails. The parsing-combinator package contains extensive tools for reporting errors, but we shall not be covering them just yet. This means that for the time being, when we fail to match the head of the input string, we shall fail with an empty report. Consider the disjunction combinator: > (define <alphabetic> (const (lambda (ch) (and (char<=? #\a ch) (char<=? ch #\z))))) > (define <digit> (const (lambda (ch) (and (char<=? #\0 ch) (char<=? ch #\9))))) > (test-string (disj <alphabetic> <digit>) "a") ((match #\a) (remaining "")) > (test-string (disj <alphabetic> <digit>) "3") ((match #\3) (remaining "")) > (test-string (disj <alphabetic> <digit>) "*") (failed with report:) The disjunction of either an alphabetic char or a digit char can recognize both digits and alphabetic characters, but not punctuation. Hence we fail on *. Writing parsers in this way can be very tedious. For example, to recognize the input HELLO would require a catenation of 5 different parsers! Luckily, the parsing-combinator package contains some more advanced combinators to help meet such common parsing needs: 5

The procedure range takes two characters and returns a parser that recognizes characters in the given range, i.e., between the two characters. The procedure range-ci behaves like range, only in a case-insensitive manner, namely, it doesn t distinguish between uppercase and lowercase characters. The procedure word takes a string and returns a parser that matches that string. The procedure word-ci behaves like range, only in a case-insensitive manner. > (test-string (range #\a #\z) "a") ((match #\a) (remaining "")) > (test-string (range #\a #\z) "*") (failed with report:) > (test-string (range #\a #\z) "A") (failed with report:) > (test-string (range-ci #\a #\z) "A") ((match #\A) (remaining "")) > (test-string (range-ci #\a #\z) "c") ((match #\c) (remaining "")) > (test-string (word "HELLO") "hello") (failed with report:) > (test-string (word "HELLO") "HELL") (failed with report:) > (test-string (word "HELLO") "HELLO-WORLD!") ((match (#\H #\E #\L #\L #\O)) (remaining "-WORLD!")) > (test-string (word-ci "HELLO") "hello-world!") ((match (#\h #\e #\l #\l #\o)) (remaining "-world!")) 5 The parser stack Writing complex parsers requires composing many smaller ones. This can be difficult and nest deeply. To simplify this task, we use postfix notation to describe the construction of complex parsers using a parser stack: The procedure new starts a new stack. Commands for the parser stack are preceded with an asterisk character (*). The sequence of commands ends with the command done: If by the time the done command is executed, the parser 6

stack contains one parser, then this parser is returned; Otherwise an error message is generated. Here s a simple example: Rather than write a complex parser such as: (define <base-10-integer> (disj (caten (range #\1 #\9) (star (range #\0 #\9))) (not-followed-by (char #\0) (range #\0 #\9)))) We can write it as a flat structure, using the parser stack, as follows: (define <base-10-integer> (new (*parser (range #\1 #\9)) (*parser (range #\0 #\9)) *star (*caten 2) (*parser (char #\0)) (*parser (range #\0 #\9)) *not-followed-by (*disj 2) done)) The procedure star takes a parser and returns a parser that recognizes the Kleene star (i.e., zero or more occurrences) of any expression recognized by the original parser. The procedures *caten and *disj are the parser-stack equivalents of caten and disj. The argument they take is number of elements off of the stack to catenate or disjunct. The procedure not-followed-by takes two parsers p1 and p2, and returns a parser that recognizes all expressions that are recognized by p1 provided that they are not also recognized by p2. The parser-stack equivalents of this is *not-followed-by. 6 Recursive parsers Regardless of whether you define your parsers by composing parsing combinators directly, or by using a parser-stack to compose them, you define parsers through application. This means that the general form is always just a bunch of applications of various functions f1, f2, and so on, to various parsers <p1>, <p2>, <p3>, etc. For example, something like this: 7

(define <parser> (f1 <p1> (f2 (f3 <p2> <p3>) (f3 <p4> (f4 <p5> <p6>)) (f5 <p7>)))) If you consider how Scheme handles application, namely, how applicative order of evaluation works, you shall realize that we are going to have a real problem defining recursive parsers. Suppose we wanted to define: (define <parser> (f1 <parser>)) This wouldn t work, since we need to have <parser> in order to define <parser>. This problem is quite the same as that of defining a recursive function. When we define recursive functions, all we need is the address of the function and not its value. Consider the ubiquitous example of a recursive function, the factorial function: (define fact (lambda (n) (if (zero? n) 1 (* n (fact (- n 1)))))) The occurrence of the variable name fact within the body of the procedure requires only the address of fact, rather than its value. The value will only be needed later, when we apply fact. But the purpose of compiling fact the address is sufficient. Getting back to our parsing combinators, we need a mechanism that will let us define recursive parsers, and we do this by wrapping the name of the parser within a thunk, that is, a procedure of zero arguments: (lambda () <parser>). Of course, the interface to this parser is now completely different from what our parsing combinators expect, so we need a way to bridge this delayed value with the standard parsing combinator interface. This bridge is called delayed. 8

Suppose we wish to parse the grammar: S ::= a b S. The grammar is recursive, and we shall need to wrap the recursive production in a thunk and delay it: (define <S> (disj (char #\a) (caten (char #\b) (delayed (lambda () <S>))))) 7 Packing So far, parsing involves finding splitting the input stream of characters into the characters that are recognized by our grammar, and the remaining characters. This is all that parsing theory is really interested in. However for any practical use of parsing theory, we shall want to be able to do something with the matched characters: Generally, construct something, and in the context of programming language tools, create an abstract syntax tree. The parsing combinators package comes with procedures for performing post-processing on the matched input. These procedures are pack and pack-with, and their parsing stack equivalents *pack and *pack-with. The procedure pack takes a parser and a unary callback function, and returns a parser that recognizes the exact same grammar as the original parser, the only difference being that the callback function is applied for post-processing. To see how to use pack, let us return to our original parser for natural numbers: (define <base-10-integer> (disj (caten (range #\1 #\9) (star (range #\0 #\9))) (not-followed-by (char #\0) (range #\0 #\9)))) Testing this parser, we notice that the matching characters are returned in two lists: A list of the first character, and a list of the remaining characters: > (test-string <base-10-integer> "12345") ((match (#\1 (#\2 #\3 #\4 #\5))) (remaining "")) These lists are generated by the parser: 9

(caten (range #\1 #\9) (star (range #\0 #\9))) Can we combine these two lists? To do so, we can replace this parser with: (pack (caten (range #\1 #\9) (star (range #\0 #\9))) (lambda (first+rest) (cons (car first+rest) (cadr first+rest)))) Starting with the original parser, we pass it as an argument to pack. The callback function is a unary function that takes the single parameter first+rest, which stands for a list of two things: The first character and the list of all the remaining characters. We simply apply cons to the elements of this list: > (test-string <base-10-integer> "12345") ((match (#\1 #\2 #\3 #\4 #\5)) (remaining "")) So we now generate a single list. We are not yet quite satisfied. We still want to convert these characters into a number. To do this, we simply beef-up the post-processing: (pack (caten (range #\1 #\9) (star (range #\0 #\9))) (lambda (first+rest) (string->number (list->string (cons (car first+rest) (cadr first+rest)))))) The resulting code behaves as follows: > (test-string <base-10-integer> "12345moshe") ((match 12345) (remaining "moshe")) 10

The procedure pack-with is similar to pack, but is intended to give names to the elements of the list. For example, when, as in this case, the list is created using catenation, we may want to name each element in the list. We can do this with pack (and possibly use let inside the callback function), but it s simpler to use pack-with: (pack-with (caten (range #\1 #\9) (star (range #\0 #\9))) (lambda (first rest) (string->number (list->string (cons first rest))))) The code behaves identically. We can also use *pack and *pack-with when writing parsers using the parser-stack: (define <base-10-integer> (new (*parser (range #\1 #\9)) (*parser (range #\0 #\9)) *star (*caten 2) (*pack-with (lambda (first rest) (string->number (list->string (cons first rest))))) (*parser (char #\0)) (*parser (range #\0 #\9)) *not-followed-by (*disj 2) done)) 8 Builtin Parsers & Parser Constructors 8.1 Builtin parsers <any-char>: Matches any character. <any>: A synonym for <any-char>. 11

<end-of-input>: Matches the end of the input stream. <epsilon>: Matches the empty input. This is the unit for catenation. <fail>: Matches nothing. This is the unit for disjunction. 8.2 Parser constructors ^<separated-exprs>: Takes a parser for <expr> for expressions, and a parser <sep> for a separator, and returns a parser for a sequence of one or more expressions separated by the given separator. caten: Takes any number of parsers, and returns their catenation. char-ci: Takes a character, and returns a parser that matches that character in a case-insensitive manner. char: Takes a character, and returns a parser that matches that character. const: Takes a predicate, and returns a parser that matches anything that satisfies the given predicate. delayed: Provides an interface to a thunk-wrapped parser. Used for embedding recursive production rules. diff: Takes two parsers <p1>, <p2>, and returns a parser that matches anything matched by <p1> provided <p2> does not match the head of the same input characters. disj: Takes any number of parsers, and returns their disjunction. fence followed-by maybe not-followed-by one-of-ci one-of otherwise 12

pack-with pack plus: Takes a parser . (plus ) returns a parser that for any string str recognized by , recognizes the catenation of one or more copies of str. range-ci range star: Takes a parser . (star ) returns a parser that for any string str recognized by , recognizes the catenation of zero or more copies of str. times: Takes a parser and an natural number n. (times n) returns a parser that for each string str recognized by , recognizes the catenation of n copies of str. word-ci word-suffixes-ci word-suffixes word 8.3 Parser-Stack procedures *caten *diff *disj *dup *fence *followed-by *guard *maybe *not-followed-by 13

*otherwise *pack-with *pack *parser *plus *star *swap *times 9 Using the parsers The procedure test-string is intended to help you test and develop your parsers interactively and incrementally. In fact, they are not how you should deploy your parsers. The purpose of this section is to show you how you can invoke your parsers. The parsers you write using the parsing-combinator package are procedures of 3 arguments: A list of input characters A success continuation A failure continuation If the parser succeeds in matching the head of the input characters to some grammatical form, the success continuation is called. Otherwise, the failure continuation is called. The succeess continuation is called with two arguments: The object matched and the remaining characters of the input stream. Initially, the object matched is the list of characters matched by the parser, however the use of post-processing callback functions (e.g., through the use of pack) can result in other objects being returned. The list of remaining characters are precisely those characters left after the returned object has been "read" from the input stream. The fail continuation is invoked with a single argument, an error report. This is going to be a list of strings. This tutorial does not [yet?] deal with error reporting, so we shall not explore this venue at present. 14

In summary, if <parser> is some parser, you call it in tail position as follows: (<parser> s ; some list of characters (lambda (e remaining-chars)... ) (lambda (errors)... )) Look at the source code for the test-string procedure in the pc.scm file, and see how it works. 10 The power of parsing combinators Parsing combinators are way to embed a grammar into a programming language, in a way that is both compositional and direct. Unlike parser generators, parsing combinators perform absolutely no processing on the grammar. The following shall have to be done manually, before the parser can be implemented using parsing combinators: Removal of left-recursive productions Any optimizations to the grammar That said, parsing combinators offer unique advantages for encoding complex parsers quickly and precisely: The parsing combinator package contains powerful constructors that allow to define grammars that are not context-free (!) with ease. The use of abstraction allows us to define meta-production-rules each of which replaces many production rules, resulting in shorter, simpler, and more consistent implementations. Parsing combinators encourage an interactive, incremental, bottom-up development of parsers, that is very conducive to large and complex parsers. 15