EDA180: Compiler Construc6on Context- free grammars. Görel Hedin Revised:

Similar documents
EDA180: Compiler Construc6on. Top- down parsing. Görel Hedin Revised: a

Parsing. source code. while (k<=n) {sum = sum+k; k=k+1;}

EDAN65: Compilers, Lecture 04 Grammar transformations: Eliminating ambiguities, adapting to LL parsing. Görel Hedin Revised:

EDA180: Compiler Construc6on. More Top- Down Parsing Abstract Syntax Trees Görel Hedin Revised:

Syntax. In Text: Chapter 3

Defining syntax using CFGs

Defining syntax using CFGs

CS 314 Principles of Programming Languages

EDAN65: Compilers, Lecture 06 A LR parsing. Görel Hedin Revised:

Describing Syntax and Semantics

LL parsing Nullable, FIRST, and FOLLOW

CMPS Programming Languages. Dr. Chengwei Lei CEECS California State University, Bakersfield

Lexical and Syntax Analysis. Top-Down Parsing

Parsing. Roadmap. > Context-free grammars > Derivations and precedence > Top-down parsing > Left-recursion > Look-ahead > Table-driven parsing

Lexical and Syntax Analysis

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

Lecture 4: Syntax Specification

Introduction to Lexing and Parsing

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

Related Course Objec6ves

Programming Language Specification and Translation. ICOM 4036 Fall Lecture 3

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

ICOM 4036 Spring 2004

Architecture of Compilers, Interpreters. CMSC 330: Organization of Programming Languages. Front End Scanner and Parser. Implementing the Front End

CMSC 330: Organization of Programming Languages. Context Free Grammars

Dr. D.M. Akbar Hussain

Chapter 4. Syntax - the form or structure of the expressions, statements, and program units

3. Parsing. Oscar Nierstrasz

Syntax Analysis Part I

CMSC 330: Organization of Programming Languages

A programming language requires two major definitions A simple one pass compiler

CS415 Compilers. Syntax Analysis. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

CPS 506 Comparative Programming Languages. Syntax Specification

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

CSE 3302 Programming Languages Lecture 2: Syntax

CMSC 330: Organization of Programming Languages

Languages and Compilers

Optimizing Finite Automata

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

CS 406/534 Compiler Construction Parsing Part I

Parsing II Top-down parsing. Comp 412

COP 3402 Systems Software Syntax Analysis (Parser)

CMSC 330: Organization of Programming Languages

Principles of Programming Languages COMP251: Syntax and Grammars

Chapter 3. Describing Syntax and Semantics

CMSC 330: Organization of Programming Languages. Context Free Grammars

Syntax/semantics. Program <> program execution Compiler/interpreter Syntax Grammars Syntax diagrams Automata/State Machines Scanning/Parsing

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

Building Compilers with Phoenix

Derivations vs Parses. Example. Parse Tree. Ambiguity. Different Parse Trees. Context Free Grammars 9/18/2012

Chapter 3. Describing Syntax and Semantics ISBN

Introduction to Syntax Analysis. The Second Phase of Front-End

Bottom-Up Parsing. Lecture 11-12

Bottom-Up Parsing. Lecture 11-12

CMPT 755 Compilers. Anoop Sarkar.

CSE450 Translation of Programming Languages. Lecture 4: Syntax Analysis

Grammars and Parsing. Paul Klint. Grammars and Parsing

Programming Language Syntax and Analysis

Chapter 3. Describing Syntax and Semantics ISBN

Syntax Analysis/Parsing. Context-free grammars (CFG s) Context-free grammars vs. Regular Expressions. BNF description of PL/0 syntax

CSCI312 Principles of Programming Languages!

CS 2210 Sample Midterm. 1. Determine if each of the following claims is true (T) or false (F).

CS 315 Programming Languages Syntax. Parser. (Alternatively hand-built) (Alternatively hand-built)

Introduction to Syntax Analysis

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Winter UW CSE P 501 Winter 2016 C-1

Compila(on (Semester A, 2013/14)

Syntax Analysis Check syntax and construct abstract syntax tree

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

3. Context-free grammars & parsing

Wednesday, September 9, 15. Parsers

Parsers. What is a parser. Languages. Agenda. Terminology. Languages. A parser has two jobs:

List of Figures. About the Authors. Acknowledgments

CS5363 Final Review. cs5363 1

Compilers. Yannis Smaragdakis, U. Athens (original slides by Sam

Chapter 3. Describing Syntax and Semantics

4 (c) parsing. Parsing. Top down vs. bo5om up parsing

Compiler phases. Non-tokens

Principles of Programming Languages

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

CMSC 330: Organization of Programming Languages. Context Free Grammars

Part III : Parsing. From Regular to Context-Free Grammars. Deriving a Parser from a Context-Free Grammar. Scanners and Parsers.

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part 1

Chapter 3. Syntax - the form or structure of the expressions, statements, and program units

Outline CS412/413. Administrivia. Review. Grammars. Left vs. Right Recursion. More tips forll(1) grammars Bottom-up parsing LR(0) parser construction

Homework & Announcements

Introduction to Parsing

ECE251 Midterm practice questions, Fall 2010

Parsing Part II. (Ambiguity, Top-down parsing, Left-recursion Removal)

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

Theoretical Part. Chapter one:- - What are the Phases of compiler? Answer:

Part 5 Program Analysis Principles and Techniques

Compilers Course Lecture 4: Context Free Grammars

EECS 6083 Intro to Parsing Context Free Grammars

A simple syntax-directed

CSCE 314 Programming Languages

4. Lexical and Syntax Analysis

CSCI312 Principles of Programming Languages!

Context-free grammars (CFG s)


Transcription:

EDA180: Compiler Construc6on Context- free grammars Görel Hedin Revised: 2013-01- 28

Compiler phases and program representa6ons source code Lexical analysis (scanning) Intermediate code genera6on tokens intermediate code Syntac6c analysis (parsing) Op6miza6on AST APributed AST intermediate code Seman6c analysis Analysis Machine code genera6on Synthesis machine code 2

A closer look at the parser text Scanner tokens Defined by: regular expressions Parser Pure parsing concrete parse tree (implicit) This lecture context- free grammar AST building AST abstract grammar The concrete parse tree is never constructed. The parser builds the AST directly using seman<c ac<ons. 3

Regular ressions vs Context- Free Grammars Example RE: ID = [a-z][a-z0-9]* Example CFG: Stmt -> WhileStmt AssignStmt WhileStmt -> WHILE LPAR RPAR Stmt An RE can have itera<on A CFG can also have recursion (it is possible to derive a symbol, e.g., Stmt, from itself) 4

Elements of a Context- Free Grammar Example CFG: Stmt -> WhileStmt AssignStmt WhileStmt -> WHILE LPAR RPAR Stmt Nonterminal symbols Terminal symbols (tokens) Produc<on rules: N -> s 1 s 2 s n where s k is a symbol (terminal or nonterminal) Start symbol (one of the nonterminals, usually the first one) 5

Exercise Construct a grammar covering the following program: Example program: while (k <= n) {sum = sum + k; k = k+1;} CFG: Stmt -> WhileStmt AssignStmt CompoundStmt WhileStmt -> "while" "(" ")" Stmt AssignStmt -> ID "=" CompoundStmt -> -> LessEq -> Add -> Usually, simple tokens are wripen directly as text strings 6

Solu6on Construct a grammar covering the following program: Example program: while (k <= n) {sum = sum + k; k = k+1;} CFG: Stmt -> WhileStmt AssignStmt CompoundStmt WhileStmt -> "while" "(" ")" Stmt AssignStmt -> ID "=" CompoundStmt -> "{" (Stmt ";")* "}" -> LessEq Add ID LessEq -> "<=" Add -> "+" 7

Real world example: The Java Language Specifica6on Compila6onUnit: PackageDeclara6on opt ImportDeclara6ons opt TypeDeclara6ons opt ImportDeclara6ons: ImportDeclara6on ImportDeclara6ons ImportDeclara6on TypeDeclara6ons: TypeDeclara6on TypeDeclara6ons TypeDeclara6on PackageDeclara6on: Annota6ons opt package PackageName ; See http://docs.oracle.com/javase/specs/jls Take a look at Chapter 2 about the Java grammar defini6ons, and look at some parts of the specifica6on. 8

Parsing Use the grammar to derive a tree for a program: Example program: sum = sum + k Stmt Start symbol sum = sum + k 9

Parse tree Use the grammar to derive a parse tree for a program: Example program: sum = sum + k Stmt AssignStmt Start symbol Nonterminals are inner nodes Add sum = sum + k Terminals are leafs 10

Corresponding abstract syntax tree (will be discussed in Lecture 5) Example program: sum = sum + k AssignStmt Add Id Id Id sum = sum + k 11

Forms of CFGs: EBNF: Stmt -> AssignStmt CompoundStmt AssignStmt -> ID "=" CompoundStmt -> "{" (Stmt ";")* "}" -> Add ID Add -> "+" Canonical form: Stmt -> ID "=" Stmt -> "{" Stmts "}" Stmts -> ε Stmts -> Stmt ";" Stmts -> "+" -> ID Extended Backus- Naur Form: Compact, easy to read and write BNF has alterna6ves EBNF also has repe66on, op6onals, parentheses Common nota6on for prac6cal use EBNF used in JavaCC BNF oben used in LR- parser generators Canonical form: Core formalism for CFGs Replaces repe66on with recursion Replaces alterna6ves, op6onals, parentheses with several produc6on rules Useful for proving proper6es and explaining algorithms 12

Formal defini6on of CFGs (canonical form) A context-free grammar G = (N, T, P, S), where N the set of nonterminal symbols T the set of terminal symbols P the set of production rules, each with the form X -> Y 1 Y 2 Y n where X N, and Y k N T S the start symbol (one of the nonterminals). I.e., S N So, the left-hand side X of a rule is a nonterminal. And the right-hand side Y 1 Y 2 Y n is a sequence of nonterminals and terminals. If the rhs for a production is empty, i.e., n = 0, we write X -> ε 13

A grammar G defines a language L(G) A context-free grammar G = (N, T, P, S), where N the set of nonterminal symbols T the set of terminal symbols P the set of production rules, each with the form X -> Y 1 Y 2 Y n where X N, and Y k N T S the start symbol (one of the nonterminals). I.e., S N G defines a language L(G) over the alphabet T T* is the set of all possible sequences of T symbols. L(G) is the subset of T* that can be derived from the start symbol S, by following the production rules P. 14

Exercise G = (N, T, P, S) L(G) = { P = { Stmt -> ID "=", Stmt -> "{" Stmts "}", Stmts -> ε, Stmts -> Stmt ";" Stmts, -> "+", -> ID } N = { } T = { } S = } 15

Solu6on G = (N, T, P, S) P = { Stmt -> ID "=", Stmt -> "{" Stmts "}", Stmts -> ε, Stmts -> Stmt ";" Stmts, -> "+", -> ID } L(G) = { ID "=" ID, ID "=" ID "+" ID, ID "=" ID "+" ID "+" ID, "{" "}", "{" ID "=" ID ";" "}", "{" ID "=" ID ";" "{" "}" ";" "}", N = {Stmt,, Stmts} T = {ID, "=", "{", "}", ";", "+"} S = Stmt } The sequences in L(G) are usually called sentences or strings 16

Deriva6on step If we have a sequence of terminals and nonterminals, e.g., X a Y Y b we can replace one of the nonterminals, applying a production rule. This is called a derivation step. (Swedish: Härledningssteg) Suppose there is a production Y -> X a and we apply it for the first Y in the sequence. We write the derivation step as follows: X a Y Y b => X a X a Y b 17

Deriva6on A derivation, is simply a sequence of derivation steps, e.g.: γ 1 => γ 2 => => γ n (n 0) where each γ i is a sequence of terminals and nonterminals If there is a derivation from γ 1 to γ n, we can write this as γ 1 =>* γ n So this means it is possible to get from the sequence γ 1 to the sequence γ n by following the production rules. 18

Recall that: Defini6on of the language L(G) G = (N, T, P, S) T* is the set of all possible sequences of T symbols. L(G) is the subset of T* that can be derived from the start symbol S, by following the production rules P. Using the concept of derivations, we can formally define L(G) as follows: L(G) = { w T* S =>* w } 19

Exercise: Prove that a sentence belongs to a language Prove that INT + INT * INT belongs to the language of the following grammar: p 1 :# -> "+" p 2 :# -> "*" p 3 :# -> INT Proof (by showing all the derivation steps from the start symbol ): => 20

Solu6on: Prove that a sentence belongs to a language Prove that INT + INT * INT belongs to the language of the following grammar: p 1 :# -> "+" p 2 :# -> "*" p 3 :# -> INT Proof (by showing all the derivation steps from the start symbol ): => "+" #(p 1 ) => INT "+" #(p 3 ) => INT "+" "*" #(p 2 ) => INT "+" INT "*" #(p 3 ) => INT "+" INT "*" INT #(p 3 ) 21

Lebmost and rightmost deriva6ons In a leftmost derivation, the leftmost nonterminal is replaced in each derivation step, e.g.,: => "+" => INT "+" => INT "+" "*" => INT "+" INT "*" => INT "+" INT "*" INT In a rightmost derivation, the rightmost nonterminal is replaced in each derivation step, e.g.,: => "+" => "+" "*" => "+" "*" INT => "+" INT "*" INT => INT "+" INT "*" INT LL parsing algoritms use leftmost derivation. LR parsing algorithms use rightmost derivation. Will be discussed in later lectures. 22

A deriva6on corresponds to building a parse tree Grammar: -> "+" -> "*" -> INT Exercise: build the parse tree (also called derivation tree). Example derivation: => "+" => INT "+" => INT "+" "*" => INT "+" INT "*" => INT "+" INT "*" INT 23

A deriva6on corresponds to building a parse tree Grammar: -> "+" -> "*" -> INT Parse tree: Example derivation: "+" => "+" => INT "+" => INT "+" "*" => INT "+" INT "*" => INT "+" INT "*" INT INT INT "*" INT 24

Exercise: Can we do another deriva6on of the same sentence, that gives a different parse tree? Grammar: -> "+" -> "*" -> INT Parse tree: Another derivation: => 25

Solu6on: Can we do another deriva6on of the same sentence, that gives a different parse tree? Grammar: -> "+" -> "*" -> INT Parse tree: Another derivation: "*" => "*" => "+" "*" => INT "+" "*" => INT "+" INT "*" => INT "+" INT "*" INT INT "+" INT INT Which parse tree would we prefer? 26

Ambiguous context- free grammars A CFG is ambiguous if a sentence in the language can be derived by two (or more) different parse trees. A CFG is unambiguous if each sentence in the language can be derived by only one syntax tree. (Swedish: tvetydig, otvetydig) 27

How can we know if a CFG is ambiguous? If we find an example of an ambiguity, we know the grammar is ambiguous. There are algorithms for deciding if a CFG belongs to certain subsets of CFGs, e.g. LL, LR, etc. (See later lectures.) These grammars are unambiguous. But in the general case, the problem is undecidable: it is not possible to construct a general algorithm that decides ambiguity for any CFG. 28

Strategies for elimina6ng ambigui6es We should try to eliminate ambiguities, without changing the language. First, decide which parse tree is the desired one. Create an equivalent unambiguous grammar from which only the desired parse trees can be derived. Or, use additional priority and associativity rules to instruct the parser to derive the "right" parse tree. (Works for some ambiguities.) 29

Equivalent grammars Two grammars, G 1 and G 2, are equivalent if they generate the same language. I.e., each sentence in one of the grammars can be derived also in the other grammar: L(G 1 ) = L(G 2 ) 30

Example ambiguity: Priority (also called precedence) -> "+" -> "*" -> INT Two parse trees for INT "+" INT "*" INT "+" "*" INT "*" "+" INT INT INT INT INT prio("*") > prio("+") (according to tradition) prio("+") > prio("*") (would be unexpected and confusing) 31

Example ambiguity: Associa<vity -> "+" -> "-" -> "**" -> INT For operators with the same priority, how do several in a sequence associate? "+" "**" "-" INT INT "**" INT INT INT INT Left-associative (usual for most operators) Right-associative (usual for the power operator) 32

Example ambiguity: Non- associa<vity -> "<" -> INT For some operators, it does not make sense to have several in a sequence at all. They are non-associative. "<" "<" "<" INT INT "<" INT INT INT INT We would like to forbid both trees. I.e., rule out the sentence from the langauge. 33

Disambigua6ng expression grammars How can we change the grammar so that only the desired trees can be derived? Idea: Restrict certain subtrees by introducing new nonterminals. Priority: Introduce a new nonterminal for each priority level: Term, Factor, Primary,... Left associativity: Restrict the right operand so it only can contain expressions of higher priority Right associativity:... 34

Exercise Ambiguous grammar: Equivalent unambiguous grammar: r -> r "+" r r -> r "*" r r -> INT r -> "(" r ")" 35

Solu6on Ambiguous grammar: r -> r "+" r r -> r "*" r r -> INT r -> "(" r ")" Equivalent unambiguous grammar: r -> r "+" Term r -> Term Term -> Term "*" Factor Term -> Factor Factor -> INT Factor -> "(" r ")" Here, we introduce a new nonterminal, Term, that is more restricted than r. That is, from Term, we can not derive any new addi6ons. We use Term as the right operand in the addi6on produc6on, to make sure no new addi6ons will appear to the right. This give leb- associa6vity. We use Term, and the even more restricted nonterminal Factor, in the mul6plica6on produc6on, to make sure no addi6ons can appear there, without using parentheses. This gives mul6plica6on higher priority than addi6on. 36

Wri6ng the grammar in different nota6ons Equivalent BNF (Backus-Naur Form): Canonical form: r -> r "+" Term r -> Term Term -> Term "*" Factor Term -> Factor Factor -> INT Factor -> "(" r ")" Use alterna<ves instead of several produc6ons per nonterminal. Equivalent EBNF (Extended BNF): Use repe<<on instead of recursion, where possible. 37

Wri6ng the grammar in different nota6ons Equivalent BNF (Backus-Naur Form): Canonical form: r -> r "+" Term r -> Term Term -> Term "*" Factor Term -> Factor Factor -> INT Factor -> "(" r ")" r -> r "+" Term Term Term -> Term "*" Factor Factor Factor -> INT "(" r ")" Use alterna<ves instead of several produc6ons per nonterminal. Equivalent EBNF (Extended BNF): r -> Term ("+" Term)* Term -> Factor ("*" Factor)* Factor -> INT "(" r ")" Use repe<<on instead of recursion, where possible. 38

EBNF Transla6ng EBNF to Canonical form Equivalent canonical form Top level repetition X -> γ 1 γ 2 * γ 3 Top level alternative X -> γ 1 γ 2 Top level parentheses X -> γ 1 (...) γ 2 Where γ k is a sequence of terminals and nonterminals 39

EBNF Transla6ng EBNF to Canonical form Equivalent canonical form Top level repetition X -> γ 1 γ 2 * γ 3 X -> γ 1 N γ 3 N -> γ 2 N N -> ε Top level alternative X -> γ 1 γ 2 X -> γ 1 X -> γ 2 Top level parentheses X -> γ 1 (...) γ 2 X -> γ 1 N γ 2 N ->... 40

Exercise: Translate from EBNF to Canonical form EBNF: Equivalent Canonical Form r -> Term ("+" Term)* 41

Solu6on: Translate from EBNF to Canonical form EBNF: r -> Term ("+" Term)* Equivalent Canonical Form r -> Term N N -> "+" Term N N -> ε 42

Can we show that these are equivalent? Equivalent Canonical Form trivial r -> Term N N -> "+" Term N N -> ε EBNF: r -> Term ("+" Term)* Alternative Equivalent Canonical Form non- trivial r -> r "+" Term r -> Term 43

Example proof 1. We start with this: r -> Term ("+" Term)* 2. We can move the repetition: r -> (Term "+")* Term 5. Replace N Term by r in second production: r -> N Term N -> r "+" N -> ε 3. Introduce a nonterminal N: r -> N Term N -> (Term "+")* 6. Eliminate N: r -> r "+" Term r -> Term 4. Eliminate the repetition: r -> N Term N -> N Term "+" N -> ε 44

Equivalence of grammars Given two context-free grammars, G1 and G2. Are they equivalent? I.e., is L(G1) = L(G2)? Undecidable problem: a general algorithm cannot be constructed. We need to rely on our ingenuity to find out. (In the general case.) 45

Regular ressions vs Context- Free Grammars RE CFG Alphabet characters terminal symbols (tokens) Language strings (char sequences) sentences (token sequences) Used for tokens parse trees Power iteration recursion Recognizer DFA NFA with stack 46

The Chomsky grammar hierarchy Grammar Rule patterns Type regular X -> ay or X -> a or X -> ε 3 context free X -> γ 2 context sensitive α X β -> α γ β 1 arbitrary γ -> δ 0 Type(3) Type (2) Type(1) Type(0) Regular grammars have the same power as regular expressions (tail recursion = itera6on). Type 2 and 3 are of prac6cal use in compiler construc6on. Type 0 and 1 are only of theore6cal interest. 47

Summary ques6ons Define a small example language with a CFG What is a nonterminal symbol? A terminal symbol? A produc6on? A start symbol? A parse tree? What is a leb- hand side of a produc6on? A right- hand side? Given a grammar G, what is meant by the language L(G)? What is a deriva6on step? A deriva6on? A lebmost deriva6on? How does a deriva6on correspond to a parse tree? When is a grammar ambiguous? Unambiguous? How can ambigui6es for expressions be removed? When are two grammars equivalent? When should we use canonical form, and when EBNF? Translate an EBNF grammar to canonical form. lain why context- free grammars are more powerful than regular expressions. What is the Chomsky hierarchy? 48

Readings F3: Context free grammars, deriva6ons, ambiguity, EBNF Appel, chapter 3-3.1. Appel, relevant exercises: 3.1-3.3 Try solve the problems in Seminar 2. F4: Predic6ve parsing. Recursive descent. LL grammars and parsing. Leb recursion and factoriza6on. Appel, chapter 3.2 49