Parsing. source code. while (k<=n) {sum = sum+k; k=k+1;}

Similar documents
EDA180: Compiler Construc6on Context- free grammars. Görel Hedin Revised:

Syntax. In Text: Chapter 3

CMPS Programming Languages. Dr. Chengwei Lei CEECS California State University, Bakersfield

CS 314 Principles of Programming Languages

Syntax Analysis. Prof. James L. Frankel Harvard University. Version of 6:43 PM 6-Feb-2018 Copyright 2018, 2015 James L. Frankel. All rights reserved.

COP 3402 Systems Software Syntax Analysis (Parser)

EDAN65: Compilers, Lecture 04 Grammar transformations: Eliminating ambiguities, adapting to LL parsing. Görel Hedin Revised:

Compiler phases. Non-tokens

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Optimizing Finite Automata

CS415 Compilers. Syntax Analysis. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Theoretical Part. Chapter one:- - What are the Phases of compiler? Answer:

Syntax Analysis Check syntax and construct abstract syntax tree

Languages and Compilers

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

Lecture 4: Syntax Specification

Parsing. Roadmap. > Context-free grammars > Derivations and precedence > Top-down parsing > Left-recursion > Look-ahead > Table-driven parsing

Syntax Analysis Part I

Introduction to Lexing and Parsing

Formal Languages. Formal Languages

EDAN65: Compilers, Lecture 06 A LR parsing. Görel Hedin Revised:

Defining syntax using CFGs

Principles of Programming Languages COMP251: Syntax and Grammars

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

EECS 6083 Intro to Parsing Context Free Grammars

Context-Free Languages and Parse Trees

EDA180: Compiler Construc6on. Top- down parsing. Görel Hedin Revised: a

3. Syntax Analysis. Andrea Polini. Formal Languages and Compilers Master in Computer Science University of Camerino

Introduction to Parsing

MA513: Formal Languages and Automata Theory Topic: Context-free Grammars (CFG) Lecture Number 18 Date: September 12, 2011

CSE 3302 Programming Languages Lecture 2: Syntax

Programming Language Syntax and Analysis

Describing Syntax and Semantics

3. Parsing. Oscar Nierstrasz

Part 5 Program Analysis Principles and Techniques

Dr. D.M. Akbar Hussain

Chapter 3. Describing Syntax and Semantics ISBN

Building Compilers with Phoenix

LL(1) predictive parsing

Chapter 4. Syntax - the form or structure of the expressions, statements, and program units

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

Properties of Regular Expressions and Finite Automata

Syntax Analysis. Martin Sulzmann. Martin Sulzmann Syntax Analysis 1 / 38

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

LL(k) Compiler Construction. Choice points in EBNF grammar. Left recursive grammar

Parsing II Top-down parsing. Comp 412

3. Context-free grammars & parsing

Chapter 3. Describing Syntax and Semantics

CSCE 314 Programming Languages

ICOM 4036 Spring 2004

Programming Language Specification and Translation. ICOM 4036 Fall Lecture 3

CMSC 330: Organization of Programming Languages. Context Free Grammars

CPS 506 Comparative Programming Languages. Syntax Specification

Writing a Simple DSL Compiler with Delphi. Primož Gabrijelčič / primoz.gabrijelcic.org

Introduction to Parsing. Lecture 5

A programming language requires two major definitions A simple one pass compiler

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part 1

Chapter 3. Describing Syntax and Semantics ISBN

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part2 3.3 Parse Trees and Abstract Syntax Trees

Section A. A grammar that produces more than one parse tree for some sentences is said to be ambiguous.

LL(k) Compiler Construction. Top-down Parsing. LL(1) parsing engine. LL engine ID, $ S 0 E 1 T 2 3

Formal Languages and Compilers Lecture V: Parse Trees and Ambiguous Gr

CS 406/534 Compiler Construction Parsing Part I

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology


LL parsing Nullable, FIRST, and FOLLOW

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

Introduction to Syntax Analysis

Chapter 2 :: Programming Language Syntax

Grammars and Parsing. Paul Klint. Grammars and Parsing

Introduction to Syntax Analysis. The Second Phase of Front-End

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

CSCI312 Principles of Programming Languages!

Compilers Course Lecture 4: Context Free Grammars

Syntax. A. Bellaachia Page: 1

CS 315 Programming Languages Syntax. Parser. (Alternatively hand-built) (Alternatively hand-built)

CMSC 330: Organization of Programming Languages

Homework & Announcements

CMSC 330: Organization of Programming Languages. Context Free Grammars

Habanero Extreme Scale Software Research Project

Introduction to Parsing. Lecture 5

4. Lexical and Syntax Analysis

CMSC 330: Organization of Programming Languages. Context Free Grammars

CMPT 755 Compilers. Anoop Sarkar.

CSE 130 Programming Language Principles & Paradigms Lecture # 5. Chapter 4 Lexical and Syntax Analysis

List of Figures. About the Authors. Acknowledgments

Architecture of Compilers, Interpreters. CMSC 330: Organization of Programming Languages. Front End Scanner and Parser. Implementing the Front End

Parsing. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Derivations of a CFG. MACM 300 Formal Languages and Automata. Context-free Grammars. Derivations and parse trees

Chapter 3. Describing Syntax and Semantics

CMSC 330: Organization of Programming Languages

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

CS5363 Final Review. cs5363 1

Chapter 3. Syntax - the form or structure of the expressions, statements, and program units

4. Lexical and Syntax Analysis

CMSC 330: Organization of Programming Languages

Syntax. 2.1 Terminology

Chapter 4. Lexical and Syntax Analysis

Transcription:

Compiler Construction Grammars Parsing source code scanner tokens regular expressions lexical analysis Lennart Andersson parser context free grammar Revision 2012 01 23 2012 parse tree AST builder (implicit) AST (explicit) syntactic analysis Compiler Construction 2012 F03-1 Regular expressions vs Grammars Compiler Construction 2012 F03-2 Rules of the grammar G Stmt Regular expressions are not powerful enough to describe the syntax of programming languages. while (k<=n) {sum = sum+k; k=k+1;} A simple generalization can be made: Use names for regular expressions and allow theses names to appear in expressions (recursively). expr INT expr + expr ( expr ) statement whilestmt assignment compoundstmt whilestmt while... assignment compoundstmt expr lessequal add Compiler Construction 2012 F03-3 Compiler Construction 2012 F03-4

Grammar A context-free grammar has four components, G = (N, T, P, S) N finite set of nonterminal symbols T finite set of terminal symbols (tokens) P finite set of productions (rules) S the start symbol (one of the nonterminals in N). N and T are disjoint sets. Each production has a left part that is a nonterminal symbol and a right part that is a regular expression with terminal and nonterminal symbols. There are three operations: alternative ( ), concatenation, and iteration ( ) in order of increasing precedence. Parentheses are used for grouping. Derivation Derivation of while (k<=n) {sum = sum+k; k=k+1;} statement whilestmt while ( expr ) statement while ( lessequal ) statement while ( expr <= expr ) statement while ( ID <= expr ) statement while ( ID <= ID ) statement while ( ID <= ID ) compoundstmt... All the strings in the derivation are called sentential forms. The final string is called a sentence. Compiler Construction 2012 F03-5 Parse tree Compiler Construction 2012 F03-7 Derivation of strings statement A string belongs to a language L(G) if it can be derived from the start symbol. Derivation (härledning): Start with the start symbol Replace a nonterminal X using the right-hand side of a production for X: if there are alternatives ( ), choose one if there are iterations ( ), repeat any number of times Continue replacing nonterminals in this way until there are only terminal symbols left The resulting string belongs to the language while ( ID <= ID ) { ID = ID + ID ; ID = ID + INT ; } Compiler Construction 2012 F03-8 Compiler Construction 2012 F03-10

Derivations The language L(G) A derivation has the form γ 0 γ 1 γ n, n 0 where each γ i is a string of terminal and non-terminal symbols obtained from the previous one by the use of one production. When there is such a derivation we write The language L(G), where G = (N, T, P, S), is the set of all strings that can be derived from S using the rules in P. G generates L(G). L(G) = {w T S w} T denotes the set of all strings with symbols from T. G is finite but L(G) is usually an infinite set. γ 0 γ n Compiler Construction 2012 F03-11 Leftmost derivation (vänsterhärledning) Compiler Construction 2012 F03-12 Another leftmost derivation Grammar G expr expr expr + expr expr expr * expr expr INT The leftmost nonterminal is replaced in each derivation step. Derive INT + INT * INT. expr expr + expr INT + expr INT + expr * expr INT + INT * expr INT + INT * INT Grammar G expr Derive INT + INT * INT. expr expr * expr expr + expr * expr INT + expr * expr INT + INT * expr INT + INT * INT expr expr + expr expr expr * expr expr INT Parse tree Compiler Construction 2012 F03-13 Compiler Construction 2012 F03-14

Ambiguous grammar Equivalent grammars A grammar is ambiguous if there is a string having more than one parse tree. A grammar is unambiguous if there is no string having more than one parse tree. Two grammars, G 1 and G 2, are equivalent if they generate the same language, i.e. each sentence that can be derived using one of the grammars, can also be derived using the other grammar. L(G 1 ) = L(G 2 ) If a grammar is ambiguous we should try to find a grammar generating the same language that is unambiguous. In the current case we would prefer a parse tree that respects operator precedences. Compiler Construction 2012 F03-15 An equivalent grammar expr expr + term expr term term term * factor term factor factor INT Derive INT + INT * INT. expr expr + term term + term factor + term INT + term INT + term * factor INT + factor * factor INT + INT * factor INT + INT * INT Compiler Construction 2012 F03-16 Other equivalent grammars expr expr + term term term term * factor factor factor INT expr term ( + term)* term factor ( * factor)* factor INT Compiler Construction 2012 F03-17 Compiler Construction 2012 F03-18

Different notations for CFGs Canonical form (kanonisk form) The simplest notation: no alternatives nor iterations are permitted. Useful when proving formal properties of grammars and implementing parser generators. BNF Backus-Naur form Includes a shorthand for writing several productions for the same nonterminal in the same rule EBNF Extended Backus-Naur Form Additional shorthand notations are introduced: regular expressions can be written in the right-hand side of a rule, e.g., (...)* gives very compact grammars standard notation for describing the syntax for programming languages used in JavaCC Compiler Construction 2012 F03-19 Transformation to canonical form alternatives Transformation to canonical form iteration A grammar rule containing iteration on the top level has the form X γ 1 γ 2 * γ 3 where γ i is a regular expression on the alphabet N T. It can be transformed into X γ 1 N γ 3 N γ 2 N N ɛ where N is a new nonterminal symbol. Compiler Construction 2012 F03-20 Example A grammar rule containing an alternative on the top level has the form can be transformed into X γ 1 γ 2 expr term (( + - ) term)* can be transformed into X γ 1 X γ 2 Compiler Construction 2012 F03-21 Compiler Construction 2012 F03-22

EBNF Example, dangling else statement if expr then statement [ else statement] In EBNF (JavaCC) other operators are permitted γ + is equivalent to γ γ γ? and [γ] are equivalent to γ ɛ This production introduces ambiguity. Consider if e1 then if e2 then s1 else s2 if e1 then {if e2 then s1} else s2 if e1 then {if e2 then s1 else s2} Compiler Construction 2012 F03-24 Dangling else remedies Compiler Construction 2012 F03-25 Rightmost derivation Grammar G expr statement if expr then statement [ else statement] fi statement matched unmatched matched if expr then matched else statement matched otherstmts unmatched if expr then statement unmatched if expr then matched else unmatched expr expr + expr expr expr * expr expr INT The rightmost nonterminal is replaced in each derivation step. Derive INT + INT * INT. expr expr + expr expr + expr * expr expr + expr * INT expr + INT * INT INT + INT * INT Compiler Construction 2012 F03-26 Compiler Construction 2012 F03-27

LL and LR parsers Is L(G Stmt ) too large? Parser algorithms will be the topic of the next lecture. an LL parser constructs a leftmost derivation an LR parser constructs a rightmost derivation statement whilestmt assignment compoundstmt whilestmt while ( expr ) statement assignment ID = expr ; compoundstmt { statement* } expr lessequal add ID INT lessequal expr <= expr add expr + expr G Stmt allows statements that would not be legal in Java. while (a+b) {...} x = x <= y; How can this be handled? Compiler Construction 2012 F03-28 What is meant by context-free? Compiler Construction 2012 F03-29 Comparing CFGs to REs A canonical grammar is context-free if every production has the form A γ, where A is a nonterminal and γ is a string of terminals and nonterminals. A production on the form αaβ αγβ, where α and β are strings of terminals and nonterminals, is context-sensitive. A may be replaced by γ in the context α β. CFG Alphabet terminal symbols (tokens) characters Language sentences (sequences of tokens) strings (sequences of characters) Used for describing syntax for programming tokens languages Expressive power recursion iteration Recognizer nondeterministic deterministic finite pushdown automaton (NFA with stack) automata (DFA) RE Compiler Construction 2012 F03-30 Compiler Construction 2012 F03-31

The Chomsky hierarchy Grammar Rule patterns Type regular X ay or X ɛ 3 context free X γ 2 context sensitive α X β α γ β 1 arbitrary γ δ 0 Regular grammars have the same describing power as regular expressions. Type 2 and 3 are of practical use in compiler construction. The others are only of theoretical interest. Compiler Construction 2012 F03-32 Unsolvable problems Given two context-free grammars, G 1 and G 2. Is L(()G 1 ) = L(()G 2 )? Given a context-free grammar, G. Is G ambiguous? Parts of a typical CFG for Java CompilationUnit [PackageDeclaration](ImportDeclaration) (TypeDeclaration) PackageDeclaration package Name ; ImportDeclaration import Name ["." * ] ; TypeDeclaration ClassDeclaration InterfaceDeclaration ; ClassDeclaration ("abstract" final "public") CleanClassDeclaration CleanClassDeclaration class ID["extends" Name] ["implements" NameList]ClassBody ClassBody { (ClassBodyDeclaration) }... Name ID("." ID) NameList Name("," Name)... Compiler Construction 2012 F03-33 Summary questions Around 100 nonterminals and 200 rules Define a small example language with a context-free grammar. What is a nonterminal symbol? A terminal symbol? A production? A start symbol? Given a grammar G, what is meant by the language L(G)? What is a derivation? A leftmost derivation? When are two grammars equivalent? What is the difference between a context-free grammar and a context-sensitive grammar? When should we use canonical form, and when EBNF? Translate an EBNF grammar to canonical form. Explain why context-free grammars are more powerful than regular expressions. Explain why tokens are usually defined by regular expressions rather than context-free grammars. Compiler Construction 2012 F03-34 Compiler Construction 2012 F03-35

Some terms abbr. English Svenska AST abstract syntax tree abstrakt syntaxträd terminal symbol terminalsymbol nonterminal symbol icketerminalsymbol CFG context-free grammar kontextfri grammatik context-sensitive grammar kontextberoende grammatik derivation härledning canonical form kanonisk form BNF Backus-Naur form EBNF Extended BNF utvidgad BNF RE regular expression reguljärt uttryck ɛ the empty string tomma strängen DFA deterministic finite automaton deterministisk ändlig automat Readings F3: Context free grammars, derivations, ambiguity, EBNF Appel, chapter 3-3.1. F4: Predictive parsing. Recursive descent. LL grammars and parsing Left recursion and factorization. Appel, chapter 3.2 Seminar 2. Grammars. Programming assignment 1. Build a scanner. Instructions is on the web. Compiler Construction 2012 F03-36 Compiler Construction 2012 F03-37