EDA180: Compiler Construc6on Context- free grammars. Görel Hedin Revised:

EDA180: Compiler Construc6on Context- free grammars Görel Hedin Revised: 2013-01- 28

Compiler phases and program representa6ons source code Lexical analysis (scanning) Intermediate code genera6on tokens intermediate code Syntac6c analysis (parsing) Op6miza6on AST APributed AST intermediate code Seman6c analysis Analysis Machine code genera6on Synthesis machine code 2

A closer look at the parser text Scanner tokens Defined by: regular expressions Parser Pure parsing concrete parse tree (implicit) This lecture context- free grammar AST building AST abstract grammar The concrete parse tree is never constructed. The parser builds the AST directly using seman<c ac<ons. 3

Regular ressions vs Context- Free Grammars Example RE: ID = [a-z][a-z0-9]* Example CFG: Stmt -> WhileStmt AssignStmt WhileStmt -> WHILE LPAR RPAR Stmt An RE can have itera<on A CFG can also have recursion (it is possible to derive a symbol, e.g., Stmt, from itself) 4

Elements of a Context- Free Grammar Example CFG: Stmt -> WhileStmt AssignStmt WhileStmt -> WHILE LPAR RPAR Stmt Nonterminal symbols Terminal symbols (tokens) Produc<on rules: N -> s 1 s 2 s n where s k is a symbol (terminal or nonterminal) Start symbol (one of the nonterminals, usually the first one) 5

Exercise Construct a grammar covering the following program: Example program: while (k <= n) {sum = sum + k; k = k+1;} CFG: Stmt -> WhileStmt AssignStmt CompoundStmt WhileStmt -> "while" "(" ")" Stmt AssignStmt -> ID "=" CompoundStmt -> -> LessEq -> Add -> Usually, simple tokens are wripen directly as text strings 6

Solu6on Construct a grammar covering the following program: Example program: while (k <= n) {sum = sum + k; k = k+1;} CFG: Stmt -> WhileStmt AssignStmt CompoundStmt WhileStmt -> "while" "(" ")" Stmt AssignStmt -> ID "=" CompoundStmt -> "{" (Stmt ";")* "}" -> LessEq Add ID LessEq -> "<=" Add -> "+" 7

Real world example: The Java Language Specifica6on Compila6onUnit: PackageDeclara6on opt ImportDeclara6ons opt TypeDeclara6ons opt ImportDeclara6ons: ImportDeclara6on ImportDeclara6ons ImportDeclara6on TypeDeclara6ons: TypeDeclara6on TypeDeclara6ons TypeDeclara6on PackageDeclara6on: Annota6ons opt package PackageName ; See http://docs.oracle.com/javase/specs/jls Take a look at Chapter 2 about the Java grammar defini6ons, and look at some parts of the specifica6on. 8

Parsing Use the grammar to derive a tree for a program: Example program: sum = sum + k Stmt Start symbol sum = sum + k 9

Parse tree Use the grammar to derive a parse tree for a program: Example program: sum = sum + k Stmt AssignStmt Start symbol Nonterminals are inner nodes Add sum = sum + k Terminals are leafs 10

Corresponding abstract syntax tree (will be discussed in Lecture 5) Example program: sum = sum + k AssignStmt Add Id Id Id sum = sum + k 11

Forms of CFGs: EBNF: Stmt -> AssignStmt CompoundStmt AssignStmt -> ID "=" CompoundStmt -> "{" (Stmt ";")* "}" -> Add ID Add -> "+" Canonical form: Stmt -> ID "=" Stmt -> "{" Stmts "}" Stmts -> ε Stmts -> Stmt ";" Stmts -> "+" -> ID Extended Backus- Naur Form: Compact, easy to read and write BNF has alterna6ves EBNF also has repe66on, op6onals, parentheses Common nota6on for prac6cal use EBNF used in JavaCC BNF oben used in LR- parser generators Canonical form: Core formalism for CFGs Replaces repe66on with recursion Replaces alterna6ves, op6onals, parentheses with several produc6on rules Useful for proving proper6es and explaining algorithms 12

Formal defini6on of CFGs (canonical form) A context-free grammar G = (N, T, P, S), where N the set of nonterminal symbols T the set of terminal symbols P the set of production rules, each with the form X -> Y 1 Y 2 Y n where X N, and Y k N T S the start symbol (one of the nonterminals). I.e., S N So, the left-hand side X of a rule is a nonterminal. And the right-hand side Y 1 Y 2 Y n is a sequence of nonterminals and terminals. If the rhs for a production is empty, i.e., n = 0, we write X -> ε 13

A grammar G defines a language L(G) A context-free grammar G = (N, T, P, S), where N the set of nonterminal symbols T the set of terminal symbols P the set of production rules, each with the form X -> Y 1 Y 2 Y n where X N, and Y k N T S the start symbol (one of the nonterminals). I.e., S N G defines a language L(G) over the alphabet T T* is the set of all possible sequences of T symbols. L(G) is the subset of T* that can be derived from the start symbol S, by following the production rules P. 14

Exercise G = (N, T, P, S) L(G) = { P = { Stmt -> ID "=", Stmt -> "{" Stmts "}", Stmts -> ε, Stmts -> Stmt ";" Stmts, -> "+", -> ID } N = { } T = { } S = } 15

Solu6on G = (N, T, P, S) P = { Stmt -> ID "=", Stmt -> "{" Stmts "}", Stmts -> ε, Stmts -> Stmt ";" Stmts, -> "+", -> ID } L(G) = { ID "=" ID, ID "=" ID "+" ID, ID "=" ID "+" ID "+" ID, "{" "}", "{" ID "=" ID ";" "}", "{" ID "=" ID ";" "{" "}" ";" "}", N = {Stmt,, Stmts} T = {ID, "=", "{", "}", ";", "+"} S = Stmt } The sequences in L(G) are usually called sentences or strings 16

Deriva6on step If we have a sequence of terminals and nonterminals, e.g., X a Y Y b we can replace one of the nonterminals, applying a production rule. This is called a derivation step. (Swedish: Härledningssteg) Suppose there is a production Y -> X a and we apply it for the first Y in the sequence. We write the derivation step as follows: X a Y Y b => X a X a Y b 17

Deriva6on A derivation, is simply a sequence of derivation steps, e.g.: γ 1 => γ 2 => => γ n (n 0) where each γ i is a sequence of terminals and nonterminals If there is a derivation from γ 1 to γ n, we can write this as γ 1 =>* γ n So this means it is possible to get from the sequence γ 1 to the sequence γ n by following the production rules. 18

Recall that: Defini6on of the language L(G) G = (N, T, P, S) T* is the set of all possible sequences of T symbols. L(G) is the subset of T* that can be derived from the start symbol S, by following the production rules P. Using the concept of derivations, we can formally define L(G) as follows: L(G) = { w T* S =>* w } 19

Exercise: Prove that a sentence belongs to a language Prove that INT + INT * INT belongs to the language of the following grammar: p 1 :# -> "+" p 2 :# -> "*" p 3 :# -> INT Proof (by showing all the derivation steps from the start symbol ): => 20

Solu6on: Prove that a sentence belongs to a language Prove that INT + INT * INT belongs to the language of the following grammar: p 1 :# -> "+" p 2 :# -> "*" p 3 :# -> INT Proof (by showing all the derivation steps from the start symbol ): => "+" #(p 1 ) => INT "+" #(p 3 ) => INT "+" "*" #(p 2 ) => INT "+" INT "*" #(p 3 ) => INT "+" INT "*" INT #(p 3 ) 21

Lebmost and rightmost deriva6ons In a leftmost derivation, the leftmost nonterminal is replaced in each derivation step, e.g.,: => "+" => INT "+" => INT "+" "*" => INT "+" INT "*" => INT "+" INT "*" INT In a rightmost derivation, the rightmost nonterminal is replaced in each derivation step, e.g.,: => "+" => "+" "*" => "+" "*" INT => "+" INT "*" INT => INT "+" INT "*" INT LL parsing algoritms use leftmost derivation. LR parsing algorithms use rightmost derivation. Will be discussed in later lectures. 22

A deriva6on corresponds to building a parse tree Grammar: -> "+" -> "*" -> INT Exercise: build the parse tree (also called derivation tree). Example derivation: => "+" => INT "+" => INT "+" "*" => INT "+" INT "*" => INT "+" INT "*" INT 23

A deriva6on corresponds to building a parse tree Grammar: -> "+" -> "*" -> INT Parse tree: Example derivation: "+" => "+" => INT "+" => INT "+" "*" => INT "+" INT "*" => INT "+" INT "*" INT INT INT "*" INT 24

Exercise: Can we do another deriva6on of the same sentence, that gives a different parse tree? Grammar: -> "+" -> "*" -> INT Parse tree: Another derivation: => 25

Solu6on: Can we do another deriva6on of the same sentence, that gives a different parse tree? Grammar: -> "+" -> "*" -> INT Parse tree: Another derivation: "*" => "*" => "+" "*" => INT "+" "*" => INT "+" INT "*" => INT "+" INT "*" INT INT "+" INT INT Which parse tree would we prefer? 26

Ambiguous context- free grammars A CFG is ambiguous if a sentence in the language can be derived by two (or more) different parse trees. A CFG is unambiguous if each sentence in the language can be derived by only one syntax tree. (Swedish: tvetydig, otvetydig) 27

How can we know if a CFG is ambiguous? If we find an example of an ambiguity, we know the grammar is ambiguous. There are algorithms for deciding if a CFG belongs to certain subsets of CFGs, e.g. LL, LR, etc. (See later lectures.) These grammars are unambiguous. But in the general case, the problem is undecidable: it is not possible to construct a general algorithm that decides ambiguity for any CFG. 28

Strategies for elimina6ng ambigui6es We should try to eliminate ambiguities, without changing the language. First, decide which parse tree is the desired one. Create an equivalent unambiguous grammar from which only the desired parse trees can be derived. Or, use additional priority and associativity rules to instruct the parser to derive the "right" parse tree. (Works for some ambiguities.) 29

Equivalent grammars Two grammars, G 1 and G 2, are equivalent if they generate the same language. I.e., each sentence in one of the grammars can be derived also in the other grammar: L(G 1 ) = L(G 2 ) 30

Example ambiguity: Priority (also called precedence) -> "+" -> "*" -> INT Two parse trees for INT "+" INT "*" INT "+" "*" INT "*" "+" INT INT INT INT INT prio("*") > prio("+") (according to tradition) prio("+") > prio("*") (would be unexpected and confusing) 31

Example ambiguity: Associa<vity -> "+" -> "-" -> "**" -> INT For operators with the same priority, how do several in a sequence associate? "+" "**" "-" INT INT "**" INT INT INT INT Left-associative (usual for most operators) Right-associative (usual for the power operator) 32

Example ambiguity: Non- associa<vity -> "<" -> INT For some operators, it does not make sense to have several in a sequence at all. They are non-associative. "<" "<" "<" INT INT "<" INT INT INT INT We would like to forbid both trees. I.e., rule out the sentence from the langauge. 33

Disambigua6ng expression grammars How can we change the grammar so that only the desired trees can be derived? Idea: Restrict certain subtrees by introducing new nonterminals. Priority: Introduce a new nonterminal for each priority level: Term, Factor, Primary,... Left associativity: Restrict the right operand so it only can contain expressions of higher priority Right associativity:... 34

Exercise Ambiguous grammar: Equivalent unambiguous grammar: r -> r "+" r r -> r "*" r r -> INT r -> "(" r ")" 35

Solu6on Ambiguous grammar: r -> r "+" r r -> r "*" r r -> INT r -> "(" r ")" Equivalent unambiguous grammar: r -> r "+" Term r -> Term Term -> Term "*" Factor Term -> Factor Factor -> INT Factor -> "(" r ")" Here, we introduce a new nonterminal, Term, that is more restricted than r. That is, from Term, we can not derive any new addi6ons. We use Term as the right operand in the addi6on produc6on, to make sure no new addi6ons will appear to the right. This give leb- associa6vity. We use Term, and the even more restricted nonterminal Factor, in the mul6plica6on produc6on, to make sure no addi6ons can appear there, without using parentheses. This gives mul6plica6on higher priority than addi6on. 36

Wri6ng the grammar in different nota6ons Equivalent BNF (Backus-Naur Form): Canonical form: r -> r "+" Term r -> Term Term -> Term "*" Factor Term -> Factor Factor -> INT Factor -> "(" r ")" Use alterna<ves instead of several produc6ons per nonterminal. Equivalent EBNF (Extended BNF): Use repe<<on instead of recursion, where possible. 37

Wri6ng the grammar in different nota6ons Equivalent BNF (Backus-Naur Form): Canonical form: r -> r "+" Term r -> Term Term -> Term "*" Factor Term -> Factor Factor -> INT Factor -> "(" r ")" r -> r "+" Term Term Term -> Term "*" Factor Factor Factor -> INT "(" r ")" Use alterna<ves instead of several produc6ons per nonterminal. Equivalent EBNF (Extended BNF): r -> Term ("+" Term)* Term -> Factor ("*" Factor)* Factor -> INT "(" r ")" Use repe<<on instead of recursion, where possible. 38

EBNF Transla6ng EBNF to Canonical form Equivalent canonical form Top level repetition X -> γ 1 γ 2 * γ 3 Top level alternative X -> γ 1 γ 2 Top level parentheses X -> γ 1 (...) γ 2 Where γ k is a sequence of terminals and nonterminals 39

EBNF Transla6ng EBNF to Canonical form Equivalent canonical form Top level repetition X -> γ 1 γ 2 * γ 3 X -> γ 1 N γ 3 N -> γ 2 N N -> ε Top level alternative X -> γ 1 γ 2 X -> γ 1 X -> γ 2 Top level parentheses X -> γ 1 (...) γ 2 X -> γ 1 N γ 2 N ->... 40

Exercise: Translate from EBNF to Canonical form EBNF: Equivalent Canonical Form r -> Term ("+" Term)* 41

Solu6on: Translate from EBNF to Canonical form EBNF: r -> Term ("+" Term)* Equivalent Canonical Form r -> Term N N -> "+" Term N N -> ε 42

Can we show that these are equivalent? Equivalent Canonical Form trivial r -> Term N N -> "+" Term N N -> ε EBNF: r -> Term ("+" Term)* Alternative Equivalent Canonical Form non- trivial r -> r "+" Term r -> Term 43

Example proof 1. We start with this: r -> Term ("+" Term)* 2. We can move the repetition: r -> (Term "+")* Term 5. Replace N Term by r in second production: r -> N Term N -> r "+" N -> ε 3. Introduce a nonterminal N: r -> N Term N -> (Term "+")* 6. Eliminate N: r -> r "+" Term r -> Term 4. Eliminate the repetition: r -> N Term N -> N Term "+" N -> ε 44

Equivalence of grammars Given two context-free grammars, G1 and G2. Are they equivalent? I.e., is L(G1) = L(G2)? Undecidable problem: a general algorithm cannot be constructed. We need to rely on our ingenuity to find out. (In the general case.) 45

Regular ressions vs Context- Free Grammars RE CFG Alphabet characters terminal symbols (tokens) Language strings (char sequences) sentences (token sequences) Used for tokens parse trees Power iteration recursion Recognizer DFA NFA with stack 46

The Chomsky grammar hierarchy Grammar Rule patterns Type regular X -> ay or X -> a or X -> ε 3 context free X -> γ 2 context sensitive α X β -> α γ β 1 arbitrary γ -> δ 0 Type(3) Type (2) Type(1) Type(0) Regular grammars have the same power as regular expressions (tail recursion = itera6on). Type 2 and 3 are of prac6cal use in compiler construc6on. Type 0 and 1 are only of theore6cal interest. 47

Summary ques6ons Define a small example language with a CFG What is a nonterminal symbol? A terminal symbol? A produc6on? A start symbol? A parse tree? What is a leb- hand side of a produc6on? A right- hand side? Given a grammar G, what is meant by the language L(G)? What is a deriva6on step? A deriva6on? A lebmost deriva6on? How does a deriva6on correspond to a parse tree? When is a grammar ambiguous? Unambiguous? How can ambigui6es for expressions be removed? When are two grammars equivalent? When should we use canonical form, and when EBNF? Translate an EBNF grammar to canonical form. lain why context- free grammars are more powerful than regular expressions. What is the Chomsky hierarchy? 48

Readings F3: Context free grammars, deriva6ons, ambiguity, EBNF Appel, chapter 3-3.1. Appel, relevant exercises: 3.1-3.3 Try solve the problems in Seminar 2. F4: Predic6ve parsing. Recursive descent. LL grammars and parsing. Leb recursion and factoriza6on. Appel, chapter 3.2 49