Context Free Grammars and Recursive Descent Parsing Tim Dawborn January, 2018
cfg Parsing Recursive Descent Parsing Calculator 2 Outline 1 Context-Free Grammars (cfg) 2 Parsing 3 Recursive Descent Parsing 4 Calculator
cfg Parsing Recursive Descent Parsing Calculator 3 Regular Expressions Regular expressions are a useful tool for pattern matching as you ll no doubt recall. To warm up, write a regular expression to match strings containing any number of a s, followed by two b s, followed by one or more c s
cfg Parsing Recursive Descent Parsing Calculator 3 Regular Expressions Regular expressions are a useful tool for pattern matching as you ll no doubt recall. To warm up, write a regular expression to match strings containing any number of a s, followed by two b s, followed by one or more c s /a*bbc+/
cfg Parsing Recursive Descent Parsing Calculator 3 Regular Expressions Regular expressions are a useful tool for pattern matching as you ll no doubt recall. To warm up, write a regular expression to match strings containing any number of a s, followed by two b s, followed by one or more c s /a*bbc+/ Now write a regular expression to match any number of a s followed by the same number of b s
cfg Parsing Recursive Descent Parsing Calculator 3 Regular Expressions Regular expressions are a useful tool for pattern matching as you ll no doubt recall. To warm up, write a regular expression to match strings containing any number of a s, followed by two b s, followed by one or more c s /a*bbc+/ Now write a regular expression to match any number of a s followed by the same number of b s Cannot be done!
cfg Parsing Recursive Descent Parsing Calculator 4 Regular Languages There are limitations on what types of languages regular expressions are able to match Regular expressions only able to match regular languages Context Free Grammars (cfgs) have more expressive power than regular expressions recursively enumerable context-sensitive context-free regular
cfg Parsing Recursive Descent Parsing Calculator 5 Grammar Definitions An alternative way to express the constraints on the valid strings of a language is to use a grammar A grammar consists of four things: Terminals (T ) Non-terminals (V ) Production rules (R : V (T V ) ) A start symbol (one of the non-terminals) (S V )
cfg Parsing Recursive Descent Parsing Calculator 6 Grammars An example grammar for micro-english : 1 <sentence> ::= <subject> <verb> <object> "." 2 <subject> ::= "I" "a" <noun> "the" <noun> 3 <object> ::= "me" "a" <noun> "the" <noun> 4 <noun> ::= "cat" "mat" "rat" 5 <verb> ::= "like" "see" "is"
cfg Parsing Recursive Descent Parsing Calculator 6 Grammars An example grammar for micro-english : 1 <sentence> ::= <subject> <verb> <object> "." 2 <subject> ::= "I" "a" <noun> "the" <noun> 3 <object> ::= "me" "a" <noun> "the" <noun> 4 <noun> ::= "cat" "mat" "rat" 5 <verb> ::= "like" "see" "is" Terminals are represented as "string literals" Non-terminals are delimited with <angle-brackets> <start-symbol> is where we begin, conventionally at the top left The ::= is the can be re-written as symbol Alternative rewrite rules can be separated by vertical bars or put on separate lines
cfg Parsing Recursive Descent Parsing Calculator 6 Grammars An example grammar for micro-english : 1 <sentence> ::= <subject> <verb> <object> "." 2 <subject> ::= "I" "a" <noun> "the" <noun> 3 <object> ::= "me" "a" <noun> "the" <noun> 4 <noun> ::= "cat" "mat" "rat" 5 <verb> ::= "like" "see" "is" Terminals T = {., I, a, the, me, cat, mat, rat, like, see, is} Non-terminals, built V = {<sentence>, <subject>, <object>, <noun>} Production rules R = the above five rules Start symbol S = <sentence>
cfg Parsing Recursive Descent Parsing Calculator 7 Grammars 1 <sentence> ::= <subject> <verb> <object> "." 2 <subject> ::= "I" "a" <noun> "the" <noun> 3 <object> ::= "me" "a" <noun> "the" <noun> 4 <noun> ::= "cat" "mat" "rat" 5 <verb> ::= "like" "see" "is" The grammar rules define a set of rewrites <sentence> <subject> <verb> <object>.
cfg Parsing Recursive Descent Parsing Calculator 7 Grammars 1 <sentence> ::= <subject> <verb> <object> "." 2 <subject> ::= "I" "a" <noun> "the" <noun> 3 <object> ::= "me" "a" <noun> "the" <noun> 4 <noun> ::= "cat" "mat" "rat" 5 <verb> ::= "like" "see" "is" The grammar rules define a set of rewrites <sentence> <subject> <verb> <object>. I <verb> <object>.
cfg Parsing Recursive Descent Parsing Calculator 7 Grammars 1 <sentence> ::= <subject> <verb> <object> "." 2 <subject> ::= "I" "a" <noun> "the" <noun> 3 <object> ::= "me" "a" <noun> "the" <noun> 4 <noun> ::= "cat" "mat" "rat" 5 <verb> ::= "like" "see" "is" The grammar rules define a set of rewrites <sentence> <subject> <verb> <object>. I <verb> <object>. I see <object>.
cfg Parsing Recursive Descent Parsing Calculator 7 Grammars 1 <sentence> ::= <subject> <verb> <object> "." 2 <subject> ::= "I" "a" <noun> "the" <noun> 3 <object> ::= "me" "a" <noun> "the" <noun> 4 <noun> ::= "cat" "mat" "rat" 5 <verb> ::= "like" "see" "is" The grammar rules define a set of rewrites <sentence> <subject> <verb> <object>. I <verb> <object>. I see <object>. I see the <noun>.
cfg Parsing Recursive Descent Parsing Calculator 7 Grammars 1 <sentence> ::= <subject> <verb> <object> "." 2 <subject> ::= "I" "a" <noun> "the" <noun> 3 <object> ::= "me" "a" <noun> "the" <noun> 4 <noun> ::= "cat" "mat" "rat" 5 <verb> ::= "like" "see" "is" The grammar rules define a set of rewrites <sentence> <subject> <verb> <object>. I <verb> <object>. I see <object>. I see the <noun>. I see the cat.
cfg Parsing Recursive Descent Parsing Calculator 8 Grammars 1 <sentence> ::= <subject> <verb> <object> "." 2 <subject> ::= "I" "a" <noun> "the" <noun> 3 <object> ::= "me" "a" <noun> "the" <noun> 4 <noun> ::= "cat" "mat" "rat" 5 <verb> ::= "like" "see" "is" Other sentences in micro-english: I like the cat. I see a rat. The cat like the mat. unfortunately not good English The mat is the rat. syntactically valid string
cfg Parsing Recursive Descent Parsing Calculator 9 Grammars 1 <sentence> ::= <subject> <verb> <object> "." 2 <subject> ::= "I" "a" <noun> "the" <noun> 3 <object> ::= "me" "a" <noun> "the" <noun> 4 <noun> ::= "cat" "mat" "rat" 5 <verb> ::= "like" "see" "is" How many strings are in the language of micro-english?
cfg Parsing Recursive Descent Parsing Calculator 9 Grammars 1 <sentence> ::= <subject> <verb> <object> "." 2 <subject> ::= "I" "a" <noun> "the" <noun> 3 <object> ::= "me" "a" <noun> "the" <noun> 4 <noun> ::= "cat" "mat" "rat" 5 <verb> ::= "like" "see" "is" How many strings are in the language of micro-english? <verb> = 3 <noun> = 3 <object> = 1 + 3 + 3 = 7 <subject> = 1 + 3 + 3 = 7 <sentence> = 7 3 7 = 147
cfg Parsing Recursive Descent Parsing Calculator 10 Your turn Write a grammar rule for a 0x prefixed hexadecimal number.
cfg Parsing Recursive Descent Parsing Calculator 10 Your turn Write a grammar rule for a 0x prefixed hexadecimal number. 1 <hex> ::= "0" "x" ( "0"... "9" "A"... "F" )+ Note the parentheses (, ) for grouping, and the regex-like + for one or more of. (Admit it, you know you want to write it as a regular expression: 0x[0-9A-F]+)
cfg Parsing Recursive Descent Parsing Calculator 10 Your turn Write a grammar rule for a 0x prefixed hexadecimal number. 1 <hex> ::= "0" "x" ( "0"... "9" "A"... "F" )+ Note the parentheses (, ) for grouping, and the regex-like + for one or more of. (Admit it, you know you want to write it as a regular expression: 0x[0-9A-F]+) Next, write a grammar rule for an integer (positive or negative).
cfg Parsing Recursive Descent Parsing Calculator 10 Your turn Write a grammar rule for a 0x prefixed hexadecimal number. 1 <hex> ::= "0" "x" ( "0"... "9" "A"... "F" )+ Note the parentheses (, ) for grouping, and the regex-like + for one or more of. (Admit it, you know you want to write it as a regular expression: 0x[0-9A-F]+) Next, write a grammar rule for an integer (positive or negative). 1 <integer> ::= "-"? ( "0" "1"... "9" )+
cfg Parsing Recursive Descent Parsing Calculator 11 Your turn Write a grammar to accept expressions of integers, which could contain zero or more additions (e.g. 23, 1+5+23, -23+23). You will need two grammar rules.
cfg Parsing Recursive Descent Parsing Calculator 11 Your turn Write a grammar to accept expressions of integers, which could contain zero or more additions (e.g. 23, 1+5+23, -23+23). You will need two grammar rules. 1 <expr> ::= <integer> ( "+" <expr> )? 2 <integer> ::= "-"? ( "0" "1"... "9" )+ Note how the first rule is recursive: that means you can create any sequence of <integer> + <integer> + <integer>...
cfg Parsing Recursive Descent Parsing Calculator 11 Your turn Write a grammar to accept expressions of integers, which could contain zero or more additions (e.g. 23, 1+5+23, -23+23). You will need two grammar rules. 1 <expr> ::= <integer> ( "+" <expr> )? 2 <integer> ::= "-"? ( "0" "1"... "9" )+ Note how the first rule is recursive: that means you can create any sequence of <integer> + <integer> + <integer>... Write a grammar to accept any number of a s followed by the same number of b s. Again you ll need two rules.
cfg Parsing Recursive Descent Parsing Calculator 11 Your turn Write a grammar to accept expressions of integers, which could contain zero or more additions (e.g. 23, 1+5+23, -23+23). You will need two grammar rules. 1 <expr> ::= <integer> ( "+" <expr> )? 2 <integer> ::= "-"? ( "0" "1"... "9" )+ Note how the first rule is recursive: that means you can create any sequence of <integer> + <integer> + <integer>... Write a grammar to accept any number of a s followed by the same number of b s. Again you ll need two rules. 1 <string> ::= "a" <string> "b" 2 <string> ::= ε
cfg Parsing Recursive Descent Parsing Calculator 12 Regular Expression Languauge The language of regular expressions is governed by a grammar You know that /a*b bbc/ is valid and /a(*/ is invalid Imagine our own regular expression language only has support for OR, bracketing, and the Kleene star. How might we write this grammar?
cfg Parsing Recursive Descent Parsing Calculator 12 Regular Expression Languauge The language of regular expressions is governed by a grammar You know that /a*b bbc/ is valid and /a(*/ is invalid Imagine our own regular expression language only has support for OR, bracketing, and the Kleene star. How might we write this grammar? Not your turn! My turn!
cfg Parsing Recursive Descent Parsing Calculator 12 Regular Expression Languauge The language of regular expressions is governed by a grammar You know that /a*b bbc/ is valid and /a(*/ is invalid Imagine our own regular expression language only has support for OR, bracketing, and the Kleene star. How might we write this grammar? Not your turn! My turn! 1 <re> ::= <simple-re> ( " " <re> )? 2 <simple-re> ::= <basic-re>+ 3 <basic-re> ::= <elem-re> "*"? 4 <elem-re> ::= "(" <re> ")" 5 <elem-re> ::= "\" ( "*" "(" ")" " " "\" ) 6 <elem-re> ::= ( "*" "(" ")" " " "\" )
cfg Parsing Recursive Descent Parsing Calculator 13 Parsing We saw before how you can use the grammar rules to generate strings of the language of the grammar <sentence> <subject> <verb> <object>.
cfg Parsing Recursive Descent Parsing Calculator 13 Parsing We saw before how you can use the grammar rules to generate strings of the language of the grammar <sentence> <subject> <verb> <object>. I <verb> <object>.
cfg Parsing Recursive Descent Parsing Calculator 13 Parsing We saw before how you can use the grammar rules to generate strings of the language of the grammar <sentence> <subject> <verb> <object>. I <verb> <object>. I see <object>.
cfg Parsing Recursive Descent Parsing Calculator 13 Parsing We saw before how you can use the grammar rules to generate strings of the language of the grammar <sentence> <subject> <verb> <object>. I <verb> <object>. I see <object>. I see the <noun>.
cfg Parsing Recursive Descent Parsing Calculator 13 Parsing We saw before how you can use the grammar rules to generate strings of the language of the grammar <sentence> <subject> <verb> <object>. I <verb> <object>. I see <object>. I see the <noun>. I see the cat.
cfg Parsing Recursive Descent Parsing Calculator 14 Parsing Parsing is the opposite process: Given a string, 1 Check that the string is in the language of the grammar 2 Construct the derivation tree (parse tree) for the string <sentence> <verb> <object> <subject> <noun> I see the cat.
cfg Parsing Recursive Descent Parsing Calculator 15 Parsing in our re implementation We need to be able to convert input strings of regular expressions into nfas i.e. we need to parse the language of regular expressions /a(bc d*)*e/
cfg Parsing Recursive Descent Parsing Calculator 16 /a(bc d*)*e/ re simple basic-re basic-re basic-re elem-re elem-re * elem-re a ( re ) e re re simple-re simple-re basic-re basic-re basic-re elem-re elem-re elem-re * b c d
cfg Parsing Recursive Descent Parsing Calculator 17 Parsing a calculator language Here is a grammar for a basic calculator language 1 <e1> ::= <e2> ( "+" <e1> )? 2 <e2> ::= <e3> ( "*" <e2> )? 3 <e3> ::= "-"? ( "0" "1"... "9" )+ 4 <e3> ::= "(" <e1> ")" Come up with a string in the language of this grammar Draw the parse tree for this string: 4 * (3 + 2)
cfg Parsing Recursive Descent Parsing Calculator 18 Parse tree for 4*(3+2) <e1> <e2> <e3> "4" "*" <e2> <e3> "(" <e1> ")" <e2> <e3> "3" "+" <e1> <e2> <e3> "2"
cfg Parsing Recursive Descent Parsing Calculator 19 Evaluating 4*(3+2) <e1> <e2> <e3> "4" "*" <e2> <e3> "(" <e1> ")" <e2> <e3> "3" "+" <e1> <e2> <e3> "2"
cfg Parsing Recursive Descent Parsing Calculator 19 Evaluating 4*(3+2) <e1> <e2> <e3> "4" "*" <e2> <e3> "(" <e1> ")" <e2> "+" <e1> <e1>: 3+2=5 <e3> <e2> "3" <e3> "2"
cfg Parsing Recursive Descent Parsing Calculator 19 Evaluating 4*(3+2) <e1> <e2> <e2>: 4*5=20 <e3> "*" <e2> "4" <e3> "(" <e1> ")" <e2> "+" <e1> <e1>: 3+2=5 <e3> <e2> "3" <e3> "2"
cfg Parsing Recursive Descent Parsing Calculator 20 Parsing in our re implementation We need to be able to convert input strings of regular expressions into nfas i.e. we need to parse the language of regular expressions /a(bc d*)*e/
cfg Parsing Recursive Descent Parsing Calculator 21 Introduction There is a standard parsing technique for context free grammars using recursion: Recursive Descent Parsing The idea is to come up with one function/method for each non-terminal in the grammar Here we will learn how to construct a recursive descent parser for any cfg
cfg Parsing Recursive Descent Parsing Calculator 22 Balanced Language Grammar We want to write a parser which accepts strings of the grammar: 1 <s> ::= "a" <s> "b" 2 <s> ::= "e" What we re aiming for: 1 >>> Parser('aeb').parse() 2 True 3 >>> Parser('aaebb').parse() 4 True 5 >>> Parser('aaebbb').parse() 6 False Note: this is called an undecorated parser because it doesn t do any more than check syntactic correctness of the input.
cfg Parsing Recursive Descent Parsing Calculator 23 Step 1: Basic parsing framework 1 class Parser: 2 def init (self, tokens): 3 self._tokens = tokens 4 self._length = len(tokens) 5 self._upto = 0 6 7 def end(self): 8 return self._upto == self._length 9 10 def peek(self): 11 return None if self.end() else self._tokens[self._upto] 12 13 def next(self): 14 if not self.end(): 15 self._upto += 1
cfg Parsing Recursive Descent Parsing Calculator 24 Step 2: Helper method for each non-terminal 17 def _parse_s(self): 18 if self.peek() == 'a': # <s> ::= "a" <s> "b" 19 self.next() # move to the next input token 20 ret = self._parse_s() # recursively parse the <s> 21 if not ret: 22 return False 23 if self.peek()!= 'b': # assert the next input token 24 return False 25 self.next() # move to the next input token 26 elif self.peek() == 'e': # <s> ::= "e" 27 self.next() 28 else: # unknown input! 29 return False 30 return True
cfg Parsing Recursive Descent Parsing Calculator 25 Step 3: Wrap the start-symbol parsing helper method 32 def parse(self): 33 return self._parse_s() and self.end()
cfg Parsing Recursive Descent Parsing Calculator 26 Step 4: Test it 1 >>> Parser('ab').parse() 2 False 3 >>> Parser('cd').parse() 4 False 5 >>> Parser('e').parse() 6 True 7 >>> Parser('aeb').parse() 8 True 9 >>> Parser('aaebb').parse() 10 True 11 >>> Parser('aaebbb').parse() 12 False 13 >>> Parser('aaccaaebbddbb').parse() 14 False
cfg Parsing Recursive Descent Parsing Calculator 27 Calculator Let s build a calculator evaluator! (This is decorated.) What we re aiming for: 1 >>> Parser(['3','+','2']).parse().eval() 2 5 3 >>> Parser(['3','*','2']).parse().eval() 4 6 5 >>> Parser(['3','*','2','+','4']).parse().eval() 6 10 7 >>> Parser(['3','+','2','*','4']).parse().eval() 8 11 9 >>> Parser(['(','5','*','2',')','*','3']).parse().eval() 10 30 11 >>> Parser(['5','*','2','*','4']).parse().eval() 12 40
cfg Parsing Recursive Descent Parsing Calculator 28 Calculator Two steps: Parse the input into something which can be evaluated Evaluate the returned object Build an object tree while parsing! Here s the grammar we will use in this example: 1 <e1> ::= <e2> ( "+" <e1> )? 2 <e2> ::= <e3> ( "*" <e2> )? 3 <e3> ::= "-"? ( "0" "1"... "9" )+ 4 <e3> ::= "(" <e1> ")"
cfg Parsing Recursive Descent Parsing Calculator 28 Calculator Two steps: Parse the input into something which can be evaluated Evaluate the returned object Build an object tree while parsing! Here s the grammar we will use in this example: 1 <e1> ::= <e2> ( "+" <e1> )? 2 <e2> ::= <e3> ( "*" <e2> )? 3 <e3> ::= "-"? ( "0" "1"... "9" )+ 4 <e3> ::= "(" <e1> ")" 1,2: "+" will be evaluated after "*"
cfg Parsing Recursive Descent Parsing Calculator 28 Calculator Two steps: Parse the input into something which can be evaluated Evaluate the returned object Build an object tree while parsing! Here s the grammar we will use in this example: 1 <e1> ::= <e2> ( "+" <e1> )? 2 <e2> ::= <e3> ( "*" <e2> )? 3 <e3> ::= "-"? ( "0" "1"... "9" )+ 4 <e3> ::= "(" <e1> ")" 3,4: <e3> is either an integer or a compound expression
cfg Parsing Recursive Descent Parsing Calculator 29 Expression Tree: Abstract Base Class Composite design expression tree We need an abstract base class 54 class Node: 55 def init (self, left, right): 56 self.left = left 57 self.right = right 58 59 def eval(self): 60 raise NotImplementedError()
cfg Parsing Recursive Descent Parsing Calculator 30 Expression Tree: Concrete Subclasses We need concrete subclasses for each type of node 62 class AddNode(Node): 63 def eval(self): 64 return self.left.eval() + self.right.eval() 65 66 class MultNode(Node): 67 def eval(self): 68 return self.left.eval() * self.right.eval() 69 70 class LiteralNode(Node): 71 def init (self, value): 72 super(). init (None, None) 73 self.value = value 74 75 def eval(self): 76 return self.value
cfg Parsing Recursive Descent Parsing Calculator 31 Step 1: Basic parsing framework 3 class Parser: 4 RE_NUMBER = re.compile(r'-?[0-9]+') 5 6 def init (self, tokens): 7 self._tokens = tokens 8 self._length = len(tokens) 9 self._upto = 0 10 11 def end(self): 12 return self._upto == self._length 13 14 def peek(self): 15 return None if self.end() else self._tokens[self._upto] 16 17 def next(self): 18 if not self.end(): 19 self._upto += 1
cfg Parsing Recursive Descent Parsing Calculator 32 Step 2: Helper method for each non-terminal (<e3>) <e3> ::= "-"? ( "0" "1"... "9" )+ <e3> ::= "(" <e1> ")" 37 def _parse_e3(self): 38 node = None 39 if self.peek() == '(': 40 self.next() 41 node = self._parse_e1() 42 if self.peek()!= ')': 43 raise Exception('Closing parenthesis not found!') 44 self.next() 45 elif Parser.RE_NUMBER.match(self.peek()): 46 node = LiteralNode(int(self.peek())) 47 self.next() 48 return node
cfg Parsing Recursive Descent Parsing Calculator 33 Step 2: Helper method for each non-terminal (<e2>) <e2> ::= <e3> ( "*" <e2> )? 29 def _parse_e2(self): 30 node = self._parse_e3() 31 if self.peek() == '*': 32 self.next() 33 node2 = self._parse_e2() 34 node = MultNode(node, node2) 35 return node
cfg Parsing Recursive Descent Parsing Calculator 34 Step 2: Helper method for each non-terminal (<e1>) <e1> ::= <e2> ( "+" <e1> )? 21 def _parse_e1(self): 22 node = self._parse_e2() 23 if self.peek() == '+': 24 self.next() 25 node2 = self._parse_e1() 26 node = AddNode(node, node2) 27 return node
cfg Parsing Recursive Descent Parsing Calculator 35 Step 3: Wrap the start-symbol parsing helper method 50 def parse(self): 51 node = self._parse_e1() 52 if not self.end(): 53 raise Exception('Extra content found at end of input!') 54 return node
cfg Parsing Recursive Descent Parsing Calculator 36 Step 4: Test it 1 >>> Parser(['3','+','2']).parse().eval() 2 5 3 >>> Parser(['3','*','2']).parse().eval() 4 6 5 >>> Parser(['3','*','2','+','4']).parse().eval() 6 10 7 >>> Parser(['3','+','2','*','4']).parse().eval() 8 11 9 >>> Parser(['(','5','*','2',')','*','3']).parse().eval() 10 30 11 >>> Parser(['5','*','2','*','4']).parse().eval() 12 40 13 >>> Parser(['5','+','2','+','4']).parse().eval() 14 11