1 Introduction to Parsing Lecture 5 1
2 Outline Regular languages revisited Parser overview Contextfree grammars (CFG s) Derivations Ambiguity 2
3 Languages and Automata Formal languages are very important in CS specially in programming languages Regular languages The weakest formal languages widely used Many applications (as we ve seen) We will also study contextfree languages, tree languages 3
4 Beyond Regular Languages Difficulty with regular languages is that many languages are not regular Some are very important They can t be expressed using Rs and FAs x. Strings of balanced parentheses are not regular: Note this is fairly representative of lots of programming constructs {() i i i 0} Note: given as set not R 4
5 Beyond Regular Languages x. Nested arithmetic expressions ((1+2) * 3) x. Nested if then else statements if if if fi fi then fi then then if here acts like ( in previous example Note that even if language doesn t have the fi like Cool, it is usually implied 5
6 An xample To Help Understand the Limitations Consider the following DFA 1 0 x: What does it recognize? 0 Note: doesn t have any way of knowing length of input string 6
7 Beyond Regular Languages In general: Nesting constructs cannot be handled by regular expressions Raises the questions: What can be expressed? Why are Rs insufficient for recognizing arbitrary nesting constructs? 7
8 What Can Regular Languages xpress? Languages requiring counting modulo a fixed integer.g., parity Intuition: A finite automaton that runs long enough must repeat states Finite automaton can t remember # of times it has visited a particular state 8
9 The Functionality of the Parser Input: sequence of tokens from lexer Output: parse tree of the program (But some parsers never produce a parse tree...) 9
10 xample Cool if x = y then 1 else 2 fi Parser input (from lexical analyzer) IF ID = ID THN INT LS INT FI Parser output IFTHNLS = INT INT ID ID 10
11 xample Note: nesting structure has been made explicit by tree Also the three components of the if then else Predicate Then branch lse branch IFTHNLS = INT INT ID ID 11
12 Comparison with Lexical Analysis Phase Input Output Lexer Parser String of characters String of tokens String of tokens Parse tree 12
13 Couple of things As mentioned, sometimes parse tree is only implicit More on this later Many compilers do build full parse tree, many do not There are compilers that combine lexer and parser phases into one phase verything done by the parser Parsing technology powerful enough to express lexical analysis in addition to parsing But most compilers use two phases, because Rs are such a good match for lexical analysis 13
14 The Role of the Parser Not all strings of tokens are programs parser must distinguish between valid and invalid strings of tokens And give error messages for the invalid ones We need A language for describing valid strings of tokens An algorithm for distinguishing valid from invalid strings of tokens 14
15 ContextFree Grammars Programming language constructs have recursive structure An XPR in Cool can be if XPR then XPR else XPR fi while XPR loop XPR pool Note: Recursively composed of other expressions Contextfree grammars are a natural notation for this recursive structure 15
16 What is a ContextFree Grammar (CFG)? A CFG consists of A set of terminals T A set of nonterminals N A start symbol S (a nonterminal) A set of productions X Y 1 Y 2!Y n where X N and Y T N { ε} i 16
17 Notational Conventions In these lecture notes Nonterminals are written uppercase Terminals are written lowercase The start symbol is the lefthand side of the first production This is standard for CFGs 17
18 xamples of CFGs S ( S ) S ε 18
19 xamples of CFGs S ( S ) S ε What are the parts of the grammar: N =? T =? Start =? 19
20 xamples of CFGs S ( S ) S ε What are the parts of the grammar: N = { S } T = { (, ) } Start = S (the only nonterminal) 20
21 xamples of CFGs S ( S ) S ε What are the parts of the grammar: N = { S } T = { (, ) } Start = S (the only nonterminal) Productions? 21
22 xamples of CFGs A fragment of Cool: XPR if XPR then XPR else XPR fi while XPR loop XPR pool id 22
23 xamples of CFGs (cont.) Simple arithmetic expressions: + ( ) id 23
24 The Language of a CFG Read productions as rules: X Y 1!Y n Means X can be replaced by Y 1!Y n That is, in general, the right hand side can replace the left hand side. 24
25 Key Idea 1. Begin with a string consisting of the start symbol S 2. Replace any nonterminal X in the string by a the righthand side of some production X Y 1!Y n 3. Repeat (2) until there are no nonterminals in the string So note, the string is changing over time 25
26 The Language of a CFG (Cont.) More formally, write X 1! X i! X n X 1! X i 1 Y 1!Y m X i+1! X n if there is a production X i Y 1!Y m and say that the left hand side derives the right, or can derive the right hand side, etc. This is one step of a contextfree derivation. 26
27 The Language of a CFG (Cont.) Write if X 1! X n Y 1!Y m in 0 or more steps X 1! X n!! Y 1!Y m We say the left hand side rewrites in zero or more steps to the right hand side 27
28 So, in general When we write X 0 X n it is shorthand for saying that there is some sequence of individual productions (rules) that get us from X 0 to X n in zero or more steps 28
29 The Language of a CFG Let G be a contextfree grammar with start symbol S. Then the language, L(G), of G is: # & $ a 1 a n S a 1 a n and every a i is a terminal' % ( 29
30 Terminals Terminals are socalled because there are no rules for replacing them Once generated, terminals are permanent feature of the string Terminals ought to be tokens of the language 30
31 Recall earlier xample L(G) is the language of CFG G Strings of balanced parentheses {() i i i 0} Two grammars: S S ( S) ε OR S ( S) ε 31
32 Cool xample A fragment of COOL: XPR if XPR then XPR else XPR fi while XPR loop XPR pool id Recall: Nonterminals are written uppercase Terminals are written lowercase Also, could have written as three productions 32
33 Cool xample (Cont.) Some elements of the language (why?) id if id then id else id fi while id loop id pool if while id loop id pool then id else id if if id then id else id fi then id else id fi 33
34 Arithmetic xample Simple arithmetic expressions: + () id Some elements of the language: id id + id (id) id id (id) id id (id) 34
35 Notes The idea of a CFG is a big step. But: Membership in a language is yes or no ; also need parse tree of the input Must handle errors gracefully Need an implementation of CFG s (e.g., bison) 35
36 More Notes Form of the grammar is important Many grammars generate the same language Tools are sensitive to the grammar Note: Tools for regular languages (e.g., flex) are sensitive to the form of the regular expression, but this is rarely a problem in practice 36
37 Derivations and Parse Trees A derivation is a sequence of productions S!!! A derivation can be drawn as a tree Start symbol is the tree s root For a production add children to node X X Y 1!Y n X Y 1!Y n Y 1 Y n 37
38 Derivation xample Grammar + () id String id id + id We wish to parse the string 38
39 Derivation xample (Cont.) + + id id id + id + id + id + * id id id parse tree (of the input string) 39
40 Derivation in Detail (1) 40
41 Derivation in Detail (2)
42 Derivation in Detail (3) * 42
43 Derivation in Detail (4) * id + id 43
44 Derivation in Detail (5) id + * id id + id id 44
45 Derivation in Detail (6) id + * id id id id + id + id id id 45
46 Some Interesting Things About Parse Trees A parse tree has Terminals at the leaves Nonterminals at the interior nodes An inorder traversal of the leaves is the original input Let s go back and take a look The parse tree shows the association of operations, the input string does not Note * binds more tightly than + because * is a subtree of the parse tree 46
47 An Interesting Question How did I know to pick this particular parse tree for the derivation? It turns out that there is more than one 47
48 Leftmost and Rightmost Derivations The example we did is a leftmost derivation At each step, replace the leftmost nonterminal There is an equivalent notion of a rightmost derivation + + id + id id + id id + id 48
49 Leftmost and Rightmost Derivations The example we did is a leftmost derivation At each step, replace the leftmost nonterminal There is an equivalent notion of a rightmost derivation + +id + id id + id id id + id 49
50 Rightmost Derivation in Detail (1) 50
51 Rightmost Derivation in Detail (2)
52 Rightmost Derivation in Detail (3) id id 52
53 Rightmost Derivation in Detail (4) + + +id * id + id 53
54 Rightmost Derivation in Detail (5) + + +id + id * id id + id id 54
55 Rightmost Derivation in Detail (6) + +id + + id * id id + id id id + id id id 55
56 Derivations and Parse Trees Note that rightmost and leftmost derivations have the same parse tree In this case And this is not an accident The difference is the order in which branches are added Finally, there could be other parse trees that arise from neither leftmost or rightmost derivation But we are most interested in leftmost and rightmost 56
57 Summary of Derivations We are not just interested in whether s is in L(G) We need a parse tree for s A derivation defines a parse tree But one parse tree may have many derivations Leftmost and rightmost derivations are important in parser implementation 57
58 Ambiguity Grammar + () id String id id + id 58
59 Ambiguity (Cont.) This string has two parse trees + * * id id + id id id id 59
60 Ambiguity (Cont.) A grammar is ambiguous if it has more than one parse tree for some string quivalently, there is more than one rightmost or leftmost derivation for some string Ambiguity is BAD Leaves meaning of some programs illdefined 60
61 Dealing with Ambiguity There are several ways to handle ambiguity Most direct method is to rewrite grammar unambiguously + ' ' id ʹ id () ʹ () ' nforces precedence of * over + 61
62 Ambiguity in Arithmetic xpressions Recall the grammar + * ( ) int The string int * int + int has two parse trees: + * * int int + int int int int 62
63 Ambiguity: The Dangling lse Consider the grammar if then if then else OTHR This grammar is also ambiguous 63
64 The Dangling lse: xample The expression if 1 then if 2 then 3 else 4 has two parse trees if if 1 if 4 1 if Typically we want the second form 64
65 The Dangling lse: A Fix else matches the closest unmatched then We can describe this in the grammar MIF /* all then are matched */ UIF /* some then is unmatched */ MIF if then MIF else MIF OTHR UIF if then if then MIF else UIF Describes the same set of strings 65
66 The Dangling lse: xample Revisited The expression if 1 then if 2 then 3 else 4 if if 1 if 1 if A valid parse tree (for a UIF) Not valid because the then expression is not a MIF 66
67 Ambiguity No general techniques for handling ambiguity Impossible to convert automatically an ambiguous grammar to an unambiguous one Used with care, ambiguity can simplify the grammar Sometimes allows more natural definitions We need disambiguation mechanisms 67
68 Precedence and Associativity Declarations Instead of rewriting the grammar Use the more natural (ambiguous) grammar Along with disambiguating declarations Most tools allow precedence and associativity declarations to disambiguate grammars xamples 68
69 Associativity Declarations Consider the grammar + int Ambiguous: two parse trees of int + int + int int int + int int int int Left associativity declaration: %left + 69
70 Precedence Declarations Consider the grammar + * int And the string int + int * int * + + int int * int int int Precedence declarations: %left + %left * int 70
