COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Similar documents
COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

Programming Languages (CS 550) Lecture 4 Summary Scanner and Parser Generators. Jeremy R. Johnson

Chapter 2 - Programming Language Syntax. September 20, 2017

Building Compilers with Phoenix

CSE 3302 Programming Languages Lecture 2: Syntax

Syntax Analysis. COMP 524: Programming Language Concepts Björn B. Brandenburg. The University of North Carolina at Chapel Hill

3. Context-free grammars & parsing

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

CS 403: Scanning and Parsing

Sometimes an ambiguous grammar can be rewritten to eliminate the ambiguity.

CPS 506 Comparative Programming Languages. Syntax Specification

CS 315 Programming Languages Syntax. Parser. (Alternatively hand-built) (Alternatively hand-built)

COP 3402 Systems Software Syntax Analysis (Parser)

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

CS 314 Principles of Programming Languages

COP 3402 Systems Software Top Down Parsing (Recursive Descent)

Syntax Analysis. The Big Picture. The Big Picture. COMP 524: Programming Languages Srinivas Krishnan January 25, 2011

Parsing. source code. while (k<=n) {sum = sum+k; k=k+1;}

Lecturer: William W.Y. Hsu. Programming Languages

CIT Lecture 5 Context-Free Grammars and Parsing 4/2/2003 1

CS415 Compilers. Syntax Analysis. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Chapter 3. Describing Syntax and Semantics ISBN

A programming language requires two major definitions A simple one pass compiler

CMSC 330: Organization of Programming Languages

Syntax Analysis. Prof. James L. Frankel Harvard University. Version of 6:43 PM 6-Feb-2018 Copyright 2018, 2015 James L. Frankel. All rights reserved.

Lecture 4: Syntax Specification

Compilers. Yannis Smaragdakis, U. Athens (original slides by Sam

Parsing. Roadmap. > Context-free grammars > Derivations and precedence > Top-down parsing > Left-recursion > Look-ahead > Table-driven parsing

Syntax Analysis Check syntax and construct abstract syntax tree

Syntax Analysis Part I

Dr. D.M. Akbar Hussain

Syntax. In Text: Chapter 3

Part 5 Program Analysis Principles and Techniques

Theoretical Part. Chapter one:- - What are the Phases of compiler? Answer:

Lexical and Syntax Analysis (2)

CSE302: Compiler Design

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

Architecture of Compilers, Interpreters. CMSC 330: Organization of Programming Languages. Front End Scanner and Parser. Implementing the Front End

Chapter 4. Lexical and Syntax Analysis

CMSC 330: Organization of Programming Languages

CSE 130 Programming Language Principles & Paradigms Lecture # 5. Chapter 4 Lexical and Syntax Analysis

2. λ is a regular expression and denotes the set {λ} 4. If r and s are regular expressions denoting the languages R and S, respectively

CMSC 330: Organization of Programming Languages. Context Free Grammars

EECS 6083 Intro to Parsing Context Free Grammars

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part2 3.3 Parse Trees and Abstract Syntax Trees

Lexical Analysis. Introduction

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

Languages and Compilers

EDAN65: Compilers, Lecture 04 Grammar transformations: Eliminating ambiguities, adapting to LL parsing. Görel Hedin Revised:

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

3. Parsing. Oscar Nierstrasz

CMSC 330: Organization of Programming Languages. Context Free Grammars

CMSC 330: Organization of Programming Languages

Principles of Programming Languages COMP251: Syntax and Grammars

Concepts Introduced in Chapter 4

LECTURE 7. Lex and Intro to Parsing

Syntax. A. Bellaachia Page: 1

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

Defining syntax using CFGs

Introduction to Parsing

Lexical Analysis (ASU Ch 3, Fig 3.1)

Habanero Extreme Scale Software Research Project

Chapter 2 :: Programming Language Syntax

Revisit the example. Transformed DFA 10/1/16 A B C D E. Start


Lexical and Syntax Analysis

CMSC 330: Organization of Programming Languages. Context Free Grammars

Lexical and Syntax Analysis. Top-Down Parsing

CMPS Programming Languages. Dr. Chengwei Lei CEECS California State University, Bakersfield

A Simple Syntax-Directed Translator

4. Lexical and Syntax Analysis

The Parsing Problem (cont d) Recursive-Descent Parsing. Recursive-Descent Parsing (cont d) ICOM 4036 Programming Languages. The Complexity of Parsing

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

PART 3 - SYNTAX ANALYSIS. F. Wotawa TU Graz) Compiler Construction Summer term / 309

4. Lexical and Syntax Analysis

Compiler Construction 2016/2017 Syntax Analysis

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Syntax Analysis. Amitabha Sanyal. ( as) Department of Computer Science and Engineering, Indian Institute of Technology, Bombay

Parsing Techniques. CS152. Chris Pollett. Sep. 24, 2008.

ICOM 4036 Spring 2004

Programming Language Specification and Translation. ICOM 4036 Fall Lecture 3

Syntax and Parsing COMS W4115. Prof. Stephen A. Edwards Fall 2003 Columbia University Department of Computer Science

Section A. A grammar that produces more than one parse tree for some sentences is said to be ambiguous.

Compiler Construction

3. Syntax Analysis. Andrea Polini. Formal Languages and Compilers Master in Computer Science University of Camerino

Specifying Syntax. An English Grammar. Components of a Grammar. Language Specification. Types of Grammars. 1. Terminal symbols or terminals, Σ

Introduction to Syntax Analysis. The Second Phase of Front-End

Homework. Lecture 7: Parsers & Lambda Calculus. Rewrite Grammar. Problems

Wednesday, September 9, 15. Parsers

Parsers. What is a parser. Languages. Agenda. Terminology. Languages. A parser has two jobs:

Principles of Programming Languages COMP251: Syntax and Grammars

Chapter 3: Describing Syntax and Semantics. Introduction Formal methods of describing syntax (BNF)

CSCI312 Principles of Programming Languages!

10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part 1

Parsing Expression Grammars and Packrat Parsing. Aaron Moss

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Lexical and Syntax Analysis

CS 406/534 Compiler Construction Parsing Part I

Transcription:

COP4020 Programming Languages Syntax Prof. Robert van Engelen

Overview Tokens and regular expressions Syntax and context-free grammars Grammar derivations More about parse trees Top-down and bottom-up parsing Recursive descent parsing COP4020 Spring 2011 2

Tokens Tokens are the basic building blocks of a programming language Keywords, identifiers, literal values, operators, punctuation We saw that the first compiler phase (scanning) splits up a character stream into tokens Tokens have a special role with respect to: Free-format languages: source program is a sequence of tokens and horizontal/vertical position of a token on a page is unimportant (e.g. Pascal) Fixed-format languages: indentation and/or position of a token on a page is significant (early Basic, Fortran, Haskell) Case-sensitive languages: upper- and lowercase are distinct (C, C++, Java) Case-insensitive languages: upper- and lowercase are identical (Ada, Fortran, Pascal) COP4020 Spring 2011 3

Defining Token Patterns with Regular Expressions The makeup of a token is described by a regular expression (RE) A regular expression r is one of A character (an element of the alphabet Σ), e.g. Σ = {a, b, c} a Empty, denoted by ε Concatenation: a sequence of regular expressions r 1 r 2 r 3 r n Alternation: regular expressions separated by a bar r 1 r 2 Repetition: a regular expression followed by a star (Kleene star) r* COP4020 Spring 2011 4

Example Regular Definitions for Tokens digit 0 1 2 3 4 5 6 7 8 9 unsigned_integer digit digit* signed_integer (+ - ε) unsigned_integer relop < <= <> > >= = letter a b z A B Z id letter (letter digit)* Cannot use recursive definitions! digits digit digits digit COP4020 Spring 2011 5

Finite State Machines = Regular Expression Recognizers relop < <= <> > >= = id letter ( letter digit ) * start 0 < 1 = 2 return(relop, LE) > 3 return(relop, NE) other 4 * return(relop, LT) = 5 return(relop, EQ) > 6 = 7 return(relop, GE) other 8 * return(relop, GT) letter or digit start 9 letter other 10 11 * return(gettoken(), install_id()) COP4020 Spring 2011 6

Non-Deterministic Finite State Automata An NFA is a 5-tuple (S, Σ, δ, s 0, F) where S is a finite set of states Σ is a finite set of symbols, the alphabet δ is a mapping from S Σ to a set of states s 0 S is the start state F S is the set of accepting (or final) states start 0 a b a b b 1 2 3 S = {0,1,2,3} Σ = {a,b} s 0 = 0 F = {3} COP4020 Spring 2011 7

From a Regular Expression to an NFA ε start i ε f a start i a f r 1 r 2 r 1 r 2 start i ε ε N(r 1 ) N(r 2 ) start i N(r 1 ) N(r 2 ) f r* start i ε N(r) ε COP4020 Spring 2011 8 ε ε ε ε f f

Grammars Context-free grammar has four components: T is a finite set of tokens (terminal symbols) N is a finite set of nonterminals P is a finite set of productions of the form α β where α (N T)* N (N T)* and β (N T)* S N is a designated start symbol COP4020 Spring 2011 9

Context Free Grammars: BNF Notation Regular expressions cannot describe nested constructs, but context-free grammars can Backus-Naur Form (BNF) notation for productions: <nonterminal> ::= sequence of (non)terminals where Each terminal in the grammar is also a token A <nonterminal> defines a syntactic category The symbol denotes alternative forms in a production The special symbol ε denotes empty COP4020 Spring 2011 10

Example <Program> ::= program <id> ( <id> <More_ids> ) ; <Block>. <Block> ::= <Variables> begin <Stmt> <More_Stmts> end <More_ids> ::=, <id> <More_ids> ε <Variables> ::= var <id> <More_ids> : <Type> ; <More_Variables> ε <More_Variables> ::= <id> <More_ids> : <Type> ; <More_Variables> ε <Stmt> ::= <id> := <Exp> if <Exp> then <Stmt> else <Stmt> while <Exp> do <Stmt> begin <Stmt> <More_Stmts> end <More_Stmts> ::= ; <Stmt> <More_Stmts> ε <Exp> ::= <num> <id> <Exp> + <Exp> <Exp> - <Exp> COP4020 Spring 2011 11

Extended BNF Extended BNF simplifies grammar definitions Extended BNF adds Optional constructs with [ and ] Repetitions with [ ]* Some EBNF definitions also add [ ] + for non-zero repetitions Any extended BNF grammar can be rewritten into BNF COP4020 Spring 2011 12

Example <Program> ::= program <id> ( <id> [, <id> ]* ) ; <Block>. <Block> ::= [ <Variables> ] begin <Stmt> [ ; <Stmt> ]* end <Variables> ::= var [ <id> [, <id> ]* : <Type> ; ] + <Stmt> ::= <id> := <Exp> if <Exp> then <Stmt> else <Stmt> while <Exp> do <Stmt> begin <Stmt> [ ; <Stmt> ]* end <Exp> ::= <num> <id> <Exp> + <Exp> <Exp> - <Exp> COP4020 Spring 2011 13

Derivations From a grammar we can derive strings (= sequences of tokens) The opposite process of parsing Starting with the grammar s designated start symbol, in each derivation step a nonterminal is replaced by a righthand side of a production for that nonterminal COP4020 Spring 2011 14

Example Derivation Start symbol Sentential forms <expression> ::= identifier unsigned_integer - <expression> ( <expression> ) <expression> <operator> <expression> <operator> ::= + - * / <expression> <expression> <operator> <expression> <expression> <operator> identifier <expression> + identifier <expression> <operator> <expression> + identifier <expression> <operator> identifier + identifier <expression> * identifier + identifier identifier * identifier + identifier Replacement of nonterminal with one of its productions The final string is the yield COP4020 Spring 2011 15

Rightmost versus Leftmost Derivations When the nonterminal on the far right (left) in a sentential form is replaced in each derivation step the derivation is called right-most (left-most) <expression> <expression> <operator> <expression> <expression> <operator> identifier Replace in rightmost derivation Replace in rightmost derivation Replace in leftmost derivation <expression> <expression> <operator> <expression> identifier <operator> <expression> Replace in leftmost derivation COP4020 Spring 2011 16

A Language Generated by a Grammar A context-free grammar is a generator of a context-free language The language defined by a grammar G is the set of all strings w that can be derived from the start symbol S L(G) = { w S * w } <S> ::= a ( <S> ) L(G) = { set of all strings a (a) ((a)) (((a))) } <S> ::= <B> <C> <B> ::= <C> + <C> <C> ::= 0 1 L(G) = { 0+0, 0+1, 1+0, 1+1, 0, 1 } COP4020 Spring 2011 17

Parse Trees A parse tree depicts the end result of a derivation The internal nodes are the nonterminals The children of a node are the symbols (terminals and nonterminals) on a right-hand side of a production The leaves are the terminals <expression> <expression> <operator> <expression> <expression> <operator> <expression> identifier * identifier + identifier COP4020 Spring 2011 18

Parse Trees <expression> <expression> <operator> <expression> <expression> <operator> identifier <expression> + identifier <expression> <operator> <expression> + identifier <expression> <operator> identifier + identifier <expression> * identifier + identifier identifier * identifier + identifier <expression> <expression> <operator> <expression> <expression> <operator> <expression> identifier * identifier + identifier COP4020 Spring 2011 19

Ambiguity There is another parse tree for the same grammar and input: the grammar is ambiguous This parse tree is not desired, since it appears that + has precedence over * <expression> <expression> <operator> <expression> <expression> <operator> <expression> identifier * identifier + identifier COP4020 Spring 2011 20

Ambiguous Grammars Ambiguous grammar: more than one distinct derivation of a string results in different parse trees A programming language construct should have only one parse tree to avoid misinterpretation by a compiler For expression grammars, associativity and precedence of operators is used to disambiguate <expression> ::= <term> <expression> <add_op> <term> <term> ::= <factor> <term> <mult_op> <factor> <factor> ::= identifier unsigned_integer - <factor> ( <expression> ) <add_op> ::= + - <mult_op> ::= * / COP4020 Spring 2011 21

Ambiguous if-then-else: the Dangling Else A classical example of an ambiguous grammar are the grammar productions for if-then-else: <stmt> ::= if <expr> then <stmt> if <expr> then <stmt> else <stmt> It is possible to hack this into unambiguous productions for the same syntax, but the fact that it is not easy indicates a problem in the programming language design Ada uses different syntax to avoid ambiguity: <stmt> ::= if <expr> then <stmt> end if if <expr> then <stmt> else <stmt> end if COP4020 Spring 2011 22

Linear-Time Top-Down and Bottom-Up Parsing A parser is a recognizer for a context-free language A string (token sequence) is accepted by the parser and a parse tree can be constructed if the string is in the language For any arbitrary context-free grammar parsing can take as much as O(n 3 ) time, where n is the size of the input There are large classes of grammars for which we can construct parsers that take O(n) time: Top-down LL parsers for LL grammars (LL = Left-to-right scanning of input, Left-most derivation) Bottom-up LR parsers for LR grammars (LR = Left-to-right scanning of input, Right-most derivation) COP4020 Spring 2011 23

Top-Down Parsers and LL Grammars Top-down parser is a parser for LL class of grammars Also called predictive parser LL class is a strict subset of the larger LR class of grammars LL grammars cannot contain left-recursive productions (but LR can), for example: <X> ::= <X> <Y> and <X> ::= <Y> <Z> <Y> ::= <X> LL(k) where k is lookahead depth, if k=1 cannot handle alternatives in productions with common prefixes <X> ::= a b a c A top-down parser constructs a parse tree from the root down Not too difficult to implement a predictive parser for an unambiguous LL(1) grammar in BNF by hand using recursive descent COP4020 Spring 2011 24

Top-Down Parser in Action <id_list> ::= id <id_list_tail> <id_list_tail>::=, id <id_list_tail> ; A, B, C; A, B, C; A, B, C; A, B, C; COP4020 Spring 2011 25

Top-Down Predictive Parsing Top-down parsing is called predictive parsing because parser predicts what it is going to see: 1. As root, the start symbol of the grammar <id_list> is predicted 2. After reading A the parser predicts that <id_list_tail> must follow 3. After reading, and B the parser predicts that <id_list_tail> must follow 4. After reading, and C the parser predicts that <id_list_tail> must follow 5. After reading ; the parser stops COP4020 Spring 2011 26

An Ambiguous Non-LL Grammar for Language E Consider a language E of simple expressions composed of +, -, *, /, (), id, and num <expr> ::= <expr> + <expr> <expr> - <expr> <expr> * <expr> <expr> / <expr> ( <expr> ) <id> <num> Need operator precedence rules COP4020 Spring 2011 27

An Unambiguous Non-LL Grammar for Language E <expr> ::= <expr> + <term> <expr> - <term> <term> <term> ::= <term> * <factor> <term> / <factor> <factor> <factor> ::= ( <expr> ) <id> <num> COP4020 Spring 2011 28

An Unambiguous LL(1) Grammar for Language E <expr> ::= <term> <term_tail> <term> ::= <factor> <factor_tail> <term_tail> ::= <add_op> <term> <term_tail> ε <factor> ::= ( <expr> ) <id> <num> <factor_tail> ::= <mult_op> <factor> <factor_tail> ε <add_op> ::= + - <mult_op> ::= * / COP4020 Spring 2011 29

Constructing Recursive Descent Parsers for LL(1) Each nonterminal has a function that implements the production(s) for that nonterminal The function parses only the part of the input described by the nonterminal <expr> ::= <term> <term_tail> procedure expr() term(); term_tail(); When more than one alternative production exists for a nonterminal, the lookahead token should help to decide which production to apply <term_tail> ::= <add_op> <term> <term_tail> procedure term_tail() ε case (input_token()) of '+' or '-': add_op(); term(); term_tail(); otherwise: /* no op = ε */ COP4020 Spring 2011 30

Some Rules to Construct a Recursive Descent Parser For every nonterminal with more than one production, find all the tokens that each of the right-hand sides can start with (called the FIRST set): <X> ::= a starts with a b a <Z> starts with b <Y> starts with c or d <Z> f starts with e or f <Y> ::= c d <Z> ::= e ε Empty productions are coded as skip operations (nops) If a nonterminal does not have an empty production, the function should generate an error if no token matches COP4020 Spring 2011 31

Example for E procedure expr() term(); term_tail(); procedure term_tail() case (input_token()) of '+' or '-': add_op(); term(); term_tail(); otherwise: /* no op = ε */ procedure term() factor(); factor_tail(); procedure factor_tail() case (input_token()) of '*' or '/': mult_op(); factor(); factor_tail(); otherwise: /* no op = ε */ procedure factor() case (input_token()) of '(': match('('); expr(); match(')'); of identifier: match(identifier); of number: match(number); otherwise: error; procedure add_op() case (input_token()) of '+': match('+'); of '-': match('-'); otherwise: error; procedure mult_op() case (input_token()) of '*': match('*'); of '/': match('/'); otherwise: error; COP4020 Spring 2011 32

Recursive Descent Parser s Call Graph = Parse Tree The dynamic call graph of a recursive descent parser corresponds exactly to the parse tree Call graph of input string 1+2*3 COP4020 Spring 2011 33

Example <type> ::= <simple> ^ id array [ <simple> ] of <type> <simple> ::= integer char num dotdot num COP4020 Spring 2011 34

Example (cont d) <type> ::= <simple> ^ id array [ <simple> ] of <type> <simple> ::= integer char num dotdot num The FIRST sets {integer, char, num} {^} {array} {integer} {char} {num} COP4020 Spring 2011 35

Example (cont d) procedure match(t : token) if input_token() = t then nexttoken(); else error; procedure type() case (input_token()) of integer or char or num : simple(); of ^ : match( ^ ); match(id); of array : match( array ); match( [ ); simple(); match( ] ); match( of ); type(); otherwise: error; procedure simple() case (input_token()) of integer : match( integer ); of char : match( char ); of num : match( num ); match( dotdot ); match( num ); otherwise: error; COP4020 Spring 2011 36

Step 1 Check lookahead and call match type() match( array ) Input: array [ num dotdot num ] of integer lookahead COP4020 Spring 2011 37

Step 2 type() match( array ) match( [ ) Input: array [ num dotdot num ] of integer lookahead COP4020 Spring 2011 38

Step 3 type() match( array ) match( [ ) simple() match( num ) Input: array [ num dotdot num ] of integer lookahead COP4020 Spring 2011 39

Step 4 type() match( array ) match( [ ) simple() match( num ) match( dotdot ) Input: array [ num dotdot num ] of integer lookahead COP4020 Spring 2011 40

Step 5 type() match( array ) match( [ ) simple() match( num ) match( dotdot ) match( num ) Input: array [ num dotdot num ] of integer lookahead COP4020 Spring 2011 41

Step 6 type() match( array ) match( [ ) simple() match( ] ) match( num ) match( dotdot ) match( num ) Input: array [ num dotdot num ] of integer lookahead COP4020 Spring 2011 42

Step 7 type() match( array ) match( [ ) simple() match( ] ) match( of ) match( num ) match( dotdot ) match( num ) Input: array [ num dotdot num ] of integer lookahead COP4020 Spring 2011 43

Step 8 type() match( array ) match( [ ) simple() match( ] ) match( of ) type() match( num ) match( dotdot ) match( num ) simple() Input: match( integer ) array [ num dotdot num ] of integer lookahead COP4020 Spring 2011 44

Bottom-Up LR Parsing Bottom-up parser is a parser for LR class of grammars Difficult to implement by hand Tools (e.g. Yacc/Bison) exist that generate bottom-up parsers for LALR grammars automatically LR parsing is based on shifting tokens on a stack until the parser recognizes a right-hand side of a production which it then reduces to a left-hand side (nonterminal) to form a partial parse tree COP4020 Spring 2011 45

Bottom-Up Parser in Action <id_list> ::= id <id_list_tail> <id_list_tail>::=, id <id_list_tail> ; input A, B, C; A A, B, C; A, A, B, C; A,B stack A, B, C; A,B, A, B, C; A,B,C A, B, C; A,B,C; parse tree A, B, C; A,B,C Cont d COP4020 Spring 2011 46

A, B, C; A,B,C A, B, C; A,B A, B, C; A A, B, C; COP4020 Spring 2011 47