Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

Similar documents
Part 5 Program Analysis Principles and Techniques

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Theoretical Part. Chapter one:- - What are the Phases of compiler? Answer:

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

COP 3402 Systems Software Syntax Analysis (Parser)

CMSC 330: Organization of Programming Languages. Context Free Grammars

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

CMSC 330: Organization of Programming Languages

Architecture of Compilers, Interpreters. CMSC 330: Organization of Programming Languages. Front End Scanner and Parser. Implementing the Front End

Syntax. In Text: Chapter 3

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

CMSC 330: Organization of Programming Languages

CSE 3302 Programming Languages Lecture 2: Syntax

Parsing. source code. while (k<=n) {sum = sum+k; k=k+1;}

CMSC 330: Organization of Programming Languages. Context Free Grammars

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part 1

Syntax. A. Bellaachia Page: 1

CMSC 330: Organization of Programming Languages

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Dr. D.M. Akbar Hussain

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Languages and Compilers

CMSC 330: Organization of Programming Languages. Context Free Grammars

CPS 506 Comparative Programming Languages. Syntax Specification

CSCI312 Principles of Programming Languages!

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Winter /15/ Hal Perkins & UW CSE C-1

CS 315 Programming Languages Syntax. Parser. (Alternatively hand-built) (Alternatively hand-built)

Principles of Programming Languages COMP251: Syntax and Grammars

Syntax Analysis Check syntax and construct abstract syntax tree

Principles of Programming Languages COMP251: Syntax and Grammars

CMPS Programming Languages. Dr. Chengwei Lei CEECS California State University, Bakersfield

CSCE 314 Programming Languages

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

CSE302: Compiler Design

Lexical Analysis. Chapter 2

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

Theory and Compiling COMP360

Lexical Analysis. Introduction

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

A Simple Syntax-Directed Translator

3. Context-free grammars & parsing

CSE302: Compiler Design

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Spring UW CSE P 501 Spring 2018 C-1

ECS 120 Lesson 7 Regular Expressions, Pt. 1

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

CS415 Compilers. Syntax Analysis. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Lexical Scanning COMP360

2. Lexical Analysis! Prof. O. Nierstrasz!

Describing Syntax and Semantics

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Introduction to Lexical Analysis

Lecture 4: Syntax Specification

Compiler Construction

Lexical Analysis. Lecture 2-4

Chapter 2 - Programming Language Syntax. September 20, 2017

CSE 311 Lecture 21: Context-Free Grammars. Emina Torlak and Kevin Zatloukal

CSCI312 Principles of Programming Languages!

Habanero Extreme Scale Software Research Project

Chapter 3. Describing Syntax and Semantics ISBN

Lexical Analysis. Implementation: Finite Automata

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Defining syntax using CFGs

Specifying Syntax. An English Grammar. Components of a Grammar. Language Specification. Types of Grammars. 1. Terminal symbols or terminals, Σ

CSE 401/M501 Compilers

Formal Languages. Formal Languages

UNIT -2 LEXICAL ANALYSIS

A programming language requires two major definitions A simple one pass compiler

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part2 3.3 Parse Trees and Abstract Syntax Trees

Implementation of Lexical Analysis

CS 314 Principles of Programming Languages. Lecture 3

Introduction to Parsing

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

Introduction to Lexical Analysis

Week 2: Syntax Specification, Grammars

Front End: Lexical Analysis. The Structure of a Compiler

EECS 6083 Intro to Parsing Context Free Grammars

EDA180: Compiler Construc6on Context- free grammars. Görel Hedin Revised:

Syntax Analysis. COMP 524: Programming Language Concepts Björn B. Brandenburg. The University of North Carolina at Chapel Hill

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

Defining Program Syntax. Chapter Two Modern Programming Languages, 2nd ed. 1

Lexical Analysis. Lecture 3-4

Syntax Analysis Part I

Programming Languages (CS 550) Lecture 4 Summary Scanner and Parser Generators. Jeremy R. Johnson

Introduction to Parsing Ambiguity and Syntax Errors

CS308 Compiler Principles Lexical Analyzer Li Jiang

CSE 401 Midterm Exam Sample Solution 2/11/15

2.2 Syntax Definition

EDAN65: Compilers, Lecture 04 Grammar transformations: Eliminating ambiguities, adapting to LL parsing. Görel Hedin Revised:

Implementation of Lexical Analysis

Building Compilers with Phoenix

Formal Languages and Compilers Lecture VI: Lexical Analysis

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Introduction to Parsing Ambiguity and Syntax Errors

Transcription:

Formal Languages and Grammars Chapter 2: Sections 2.1 and 2.2

Formal Languages Basis for the design and implementation of programming languages Alphabet: finite set Σ of symbols String: finite sequence of symbols Empty string ε: sequence of length zero Σ* - set of all strings over Σ (incl. ε) Σ + - set of all non-empty strings over Σ Language: set of strings L Σ* E.g., for Java, Σ is Unicode, a string is a program, and L is defined by a grammar in the language spec 2

G = (N, T, S, P) Formal Grammars Finite set of non-terminal symbols N Finite set of terminal symbols T Starting non-terminal symbol S N Finite set of productions P Describes a language L T* Production: x y x is a non-empty sequence of terminals and nonterminals; y is a seq. of terminals and non-terminals Applying a production: uxv uyw 3

Example: Non-negative Integers N = { I, D } T = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 } S = I P = { I D, I DI, D 0, D 1,, D 9 } 4

I D DI More Common Notation - two production alternatives D 0 1 9 - ten production alternatives Terminals: 0 9 Starting non-terminal: I Shown first in the list of productions Examples of production applications: I DI D6I D6D DI DDI D6D 36D DDI D6I 36D 361 5

String derivation Languages and Grammars w 1 w 2 w n ; denoted w 1 * w n + If n>1, non-empty derivation sequence: w 1 w n Language generated by a grammar + L(G) = { w T* S w } Fundamental theoretical characterization: Chomsky hierarchy (Noam Chomsky, MIT) Regular languages Context-free languages Context-sensitive languages Unrestricted languages Regular languages in PL: for lexical analysis Context-free languages in PL: for syntax analysis 6

Regular Languages (1/5) Operations on languages Union: L M = all strings in L or in M Concatenation: LM = all ab where a in L and b in M L 0 = { ε } and L i = L i-1 L Closure: L * = L 0 L 1 L 2 Positive closure: L + = L 1 L 2 Regular expressions: notation to express languages constructed with the help of such operations Example: (0 1 2 3 4 5 6 7 8 9) + 7

Regular Languages (2/5) Given some alphabet, a regular expression is The empty string ε Any symbol from the alphabet If r and s are regular expressions, so are r s, rs, r *, r +, r?, and (r) * / + /? have higher precedence than concatenation, which has higher precedence than All are left-associative 8

Regular Languages (3/5) Each regular expression r defines a language L(r) L(ε) = { ε } L(a) = { a } for alphabet symbol a L(r s) = L(r) L(s) L(rs) = L(r)L(s) L(r * ) = (L(r)) * L(r + ) = (L(r)) + L(r?) = { ε } L(r) L((r)) = L(r) Example: what is the language defined by 0(x X)(0 1 9 a b f A B F) + 9

Regular grammars Regular Languages (4/5) All productions are A wb and A w A and B are non-terminals; w is a sequence of terminals This is a right-regular grammar Or all productions are A Bw and A w Left-regular grammar Example: L = { a n b n > 0 } is a regular language S Ab and A a Aa I D DI and D 0 1 9 : is this a regular grammar? 10

Regular Languages (5/5) Equivalent formalisms for regular languages Regular grammars Regular expressions Nondeterministic finite automata (NFA) Deterministic finite automata (DFA) Additional details: Sections 2.2 and 2.4 What does this have to do with PLs? Foundation for lexical analysis done by a scanner You will have to implement a scanner for your interpreter project; Section 2.2 provides useful guidelines 11

Regular Languages in Compilers & Interpreters stream of characters w,h,i,l,e,(,a,1,5,>,b,b,),d,o, Scanner (uses a regular grammar to perform lexical analysis) stream of tokens keyword[while], leftparen, id[a15], op[>], id[bb], rightparen, keyword[do], Parser (uses a context-free grammar to perform syntax analysis) parse tree each token is a leaf in the parse tree more compiler/interpreter components 12

Uses of Regular Languages Lexical analysis in compilers E.g., an identifier token is a string from the regular language letter (letter digit)* Each token is a terminal symbol for the context-free grammar of the parser Pattern matching stdlinux> grep a\+b foo.txt Find every line from foo.txt that contains a string from the language L = { a n b n > 0 } i.e., the language for reg. expr. a + b 13

Context-Free Languages They subsume regular languages Every regular language is a c.f. language L = { a n b n n > 0 } is c.f. but not regular Generated by a context-free grammar Each production: A w A is a non-terminal, w is a sequences of terminals and non-terminals BNF (Backus-Naur Form): traditional alternative notation for context-free grammars John Backus and Peter Naur, for Algol-58 and Algol-60 Backus was also one of the creators of Fortran Both are recipients of the ACM Turing Award 14

Example: Non-negative Integers I D DI and D 0 1 9 BNF <integer> ::= <digit> <digit><integer> <digit> ::= 0 1 9 What if we wanted to disallow zeroes at the beginning? e.g. 509 is OK, but 059 is not Possible motivation: in C, leading 0 means an octal constant Propose a context-free grammar that achieves this Is this grammar regular? If not, can you change it to make it regular? 15

Derivation Tree for a String Also called parse tree or concrete syntax tree Leaf nodes: terminals Inner nodes: non-terminals Root: starting non-terminal of the grammar Describes a particular way to derive a string based on a context-free grammar Leaf nodes from left to right are the string To get this string: depth-first traversal of the tree, always visiting the leftmost unexplored branch 16

Example of a Derivation Tree <expr> ::= <term> <expr> + <term> <term> ::= id (<expr>) <expr> <expr> + <term> Parse tree for (x+y)+z <term> z ( <expr> ) <expr> + <term> <term> y x 17

Equivalent Derivation Sequences The set of string derivations that are represented by the same parse tree One derivation: <expr> <expr> + <term> <expr> + z <term> + z (<expr>) + z (<expr> + <term>) + z (<expr> + y) + z (<term> + y) + z (x + y) + z Another derivation: <expr> <expr> + <term> <term> + <term> (<expr>) + <term> (<expr> + <term>) + <term> (<term> + <term>) + <term> (x + <term>) + <term> (x + y) + <term> (x + y) + z Many more 18

Ambiguous Grammars For some string, there are several different parse trees An ambiguous grammar gives more freedom to the compiler writer e.g. for code optimizations, to choose the shape of the parse tree that leads to better performance For real-world programming languages, we typically have non-ambiguous grammars We need a deterministic specification of the parser To remove the ambiguity: add non-terminals 19

Elimination of Ambiguity (1/2) <expr> ::= <expr> + <expr> <expr> * <expr> id ( <expr> ) Two possible parse trees for a + b * c Conceptually equivalent to (a + b) * c and a + (b * c) Operator precedence: when several operators are without parentheses, what is an operand of what? Is a+b an operand of *, or is b*c an operand of +? Operator associativity: for several operators with the same precedence, left-to-right or right-to-left? Is a b c equivalent to (a b) c or a (b c)? 20

Elimination of Ambiguity (2/2) In most languages, * has higher precedence than +, and both are left-associative Problem: change <expr> ::= <expr> + <expr> <expr> * <expr> id ( <expr> ) Eliminate the ambiguity Get the correct precedence and associativity Solution: add new non-terminals <expr> ::= <expr> + <term> <term> <term> ::= <term> * <factor> <factor> <factor> ::= id ( <expr> ) 21

The dangling-else Problem Ambiguity for else <stmt> ::= if <expr> then <stmt> if <expr> then <stmt> else <stmt> if a then if b then c:=1 else c:=2 Two possible parse trees Traditional solution: match the else with the last unmatched then 22

Non-Ambiguous Grammar <stmt> ::= <matched> <unmatched> <matched> ::= <non-if-stmt> if <expr> then <matched> else <matched> <unmatched> ::= if <expr> then <stmt> if <expr> then <matched> else <unmatched> 23

Extended BNF (EBNF) [ ] optional element if <expr> then <stmt> [ else <stmt> ] { } repetition (0 or more times) <IdList> ::= <id> {, <id> } Sometimes shown as { }* { } + repetition (1 or more times) <block> ::= begin <stmt> { <stmt> } end <block> ::= begin { <stmt> } + end Does not change the expressive power of the notation (we can always rewrite in plain BNF) 24

Core: A Toy Imperative Language (1/2) <prog> ::= program <decl-seq> begin <stmt-seq> end <decl-seq> ::= <decl> <decl><decl-seq> <stmt-seq> ::= <stmt> <stmt><stmt-seq> <decl> ::= int <id-list> ; <id-list> ::= id id, <id-list> <stmt> ::= <assign> <if> <loop> <in> <out> <assign> ::= id := <expr> ; <in> ::= input <id-list> ; <out> ::= output <id-list> ; <if> ::= if <cond> then <stmt-seq> endif ; if <cond> then <stmt-seq> else <stmt-seq> endif ; 25

Core: A Toy Imperative Language (2/2) <loop> ::= while <cond> begin <stmt-seq> endwhile ; <cond> ::= <cmpr>! <cond> ( <cond> AND <cond> ) ( <cond> OR <cond> ) <cmpr> ::= [ <expr> <cmpr-op> <expr> ] <cmpr-op> ::= < =!= > >= <= <expr> ::= <term> <term> + <expr> <term> <expr> <term> ::= <factor> <factor> * <term> <factor> ::= const id <factor> ( <expr> ) 26

Parser vs. Scanner id and const are terminal symbols for the grammar of the language tokens that are provided from the scanner to the parser But they are non-terminals in the regular grammar for the lexical analysis The terminals here are ASCII characters <id> ::= <letter> <id><letter> <id><digit> <letter> ::= A B Z a b z <const> ::= <digit> <const><digit> <digit> ::= 0 1 9 Note: as written, this grammar is not regular, but can be easily changed to an equivalent regular grammar 27

Notes for the Core Interpreter Project Consider 9 5 + 4 What is the parse tree? What is the problem? For ease of implementation, we will not fix this But if we wanted to fix it, how can we? Manually writing a scanner for this language Ad hoc approach (next slide) Systematic approach: write regular expressions for all tokens, convert to an NFA, convert that to a DFA, minimize it, write code that mimics the transitions of the DFA (Section 2.2) There exist various tools to do this automatically, but you should not use them for the project (will use in CSE 5343) 28

Outline of a Scanner for Core (1/2) The parser asks the scanner for the next token Skip white spaces and read next character x If x is ;, ( ) [ ] = + * return the corresponding token If x is :, read the next character y If y is not =, report error, else return the token for := If x is!, peek at the next character y If y is not =, return the token for! If y is =, read it and return the token for!= Peeking can be done easily in C, C++, and Java file I/O 29

Outline of a Scanner for Core (2/2) If x is <, peek at the next character y If y is not =, return the token for < If y is =, read it and return the token for <= Similarly when x is > If x is a letter, keep reading characters; stop before the first not-letter-or-digit character If the string is a keyword, return the keyword token Else return token id with the string name attached If x is a digit, keep reading characters; stop before the first not-digit character Return token const with the integer value attached 30