Context-free grammars (CFG s)

Similar documents
Syntax Analysis/Parsing. Context-free grammars (CFG s) Context-free grammars vs. Regular Expressions. BNF description of PL/0 syntax

Syntactic Analysis. Syntactic analysis, or parsing, is the second phase of compilation: The token file is converted to an abstract syntax tree.

Compiler Passes. Syntactic Analysis. Context-free Grammars. Syntactic Analysis / Parsing. EBNF Syntax of initial MiniJava.

CSE 401: Introduction to Compiler Construction. Course Outline. Grading. Project

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Spring UW CSE P 501 Spring 2018 C-1

Parsing. Roadmap. > Context-free grammars > Derivations and precedence > Top-down parsing > Left-recursion > Look-ahead > Table-driven parsing

Wednesday, September 9, 15. Parsers

Parsers. What is a parser. Languages. Agenda. Terminology. Languages. A parser has two jobs:

Lexical and Syntax Analysis. Top-Down Parsing

Parsing Algorithms. Parsing: continued. Top Down Parsing. Predictive Parser. David Notkin Autumn 2008

Syntax. In Text: Chapter 3

CSE 3302 Programming Languages Lecture 2: Syntax

Derivations vs Parses. Example. Parse Tree. Ambiguity. Different Parse Trees. Context Free Grammars 9/18/2012

EDAN65: Compilers, Lecture 06 A LR parsing. Görel Hedin Revised:

8 Parsing. Parsing. Top Down Parsing Methods. Parsing complexity. Top down vs. bottom up parsing. Top down vs. bottom up parsing

CMPS Programming Languages. Dr. Chengwei Lei CEECS California State University, Bakersfield

Principles of Programming Languages COMP251: Syntax and Grammars

COP 3402 Systems Software Syntax Analysis (Parser)

3. Parsing. Oscar Nierstrasz

Parsers. Xiaokang Qiu Purdue University. August 31, 2018 ECE 468

LL(k) Compiler Construction. Top-down Parsing. LL(1) parsing engine. LL engine ID, $ S 0 E 1 T 2 3

Project 2 Interpreter for Snail. 2 The Snail Programming Language

Lecture 14: Parser Conflicts, Using Ambiguity, Error Recovery. Last modified: Mon Feb 23 10:05: CS164: Lecture #14 1

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Lexical and Syntax Analysis

EDA180: Compiler Construc6on. Top- down parsing. Görel Hedin Revised: a

MiniJava Parser and AST. Winter /27/ Hal Perkins & UW CSE E1-1

Programming Language Syntax and Analysis

Monday, September 13, Parsers

Context-Free Grammars

CPS 506 Comparative Programming Languages. Syntax Specification

Chapter 3. Parsing #1

Chapter 3. Describing Syntax and Semantics ISBN

CS 315 Programming Languages Syntax. Parser. (Alternatively hand-built) (Alternatively hand-built)

Defining syntax using CFGs

COP 3402 Systems Software Top Down Parsing (Recursive Descent)

Parsing II Top-down parsing. Comp 412

Top down vs. bottom up parsing

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Winter UW CSE P 501 Winter 2016 C-1

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Winter /15/ Hal Perkins & UW CSE C-1

CMSC 330: Organization of Programming Languages

Section A. A grammar that produces more than one parse tree for some sentences is said to be ambiguous.

ΕΠΛ323 - Θεωρία και Πρακτική Μεταγλωττιστών

A programming language requires two major definitions A simple one pass compiler

LL(k) Parsing. Predictive Parsers. LL(k) Parser Structure. Sample Parse Table. LL(1) Parsing Algorithm. Push RHS in Reverse Order 10/17/2012

LL(k) Compiler Construction. Choice points in EBNF grammar. Left recursive grammar

CMPT 755 Compilers. Anoop Sarkar.

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

Parsing III. CS434 Lecture 8 Spring 2005 Department of Computer Science University of Alabama Joel Jones

EDAN65: Compilers, Lecture 04 Grammar transformations: Eliminating ambiguities, adapting to LL parsing. Görel Hedin Revised:

Using an LALR(1) Parser Generator

Parsing III. (Top-down parsing: recursive descent & LL(1) )

CSCI 1260: Compilers and Program Analysis Steven Reiss Fall Lecture 4: Syntax Analysis I

CSCE 314 Programming Languages

Syntax/semantics. Program <> program execution Compiler/interpreter Syntax Grammars Syntax diagrams Automata/State Machines Scanning/Parsing

Syntactic Analysis. Top-Down Parsing

Context-Free Grammar. Concepts Introduced in Chapter 2. Parse Trees. Example Grammar and Derivation

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Wednesday, August 31, Parsers


Part III : Parsing. From Regular to Context-Free Grammars. Deriving a Parser from a Context-Free Grammar. Scanners and Parsers.

Syntax. A. Bellaachia Page: 1

Context-free grammars

Syntax-Directed Translation. Lecture 14

Review of CFGs and Parsing II Bottom-up Parsers. Lecture 5. Review slides 1

Chapter 3. Describing Syntax and Semantics

CMSC 330: Organization of Programming Languages

CMSC 330: Organization of Programming Languages. Context Free Grammars

Building Compilers with Phoenix

CS415 Compilers. Syntax Analysis. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Describing Syntax and Semantics

Compiler Construction: Parsing

CS1622. Today. A Recursive Descent Parser. Preliminaries. Lecture 9 Parsing (4)

EECS 6083 Intro to Parsing Context Free Grammars

CS 406/534 Compiler Construction Parsing Part I

Bottom-Up Parsing. Lecture 11-12

Part 3. Syntax analysis. Syntax analysis 96

Syntax Errors; Static Semantics

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

CS 2210 Sample Midterm. 1. Determine if each of the following claims is true (T) or false (F).

Syntax Analysis Part I

CIT Lecture 5 Context-Free Grammars and Parsing 4/2/2003 1

CMSC 330: Organization of Programming Languages. Context Free Grammars

CMSC 330: Organization of Programming Languages

Compilers. Yannis Smaragdakis, U. Athens (original slides by Sam

Parsing #1. Leonidas Fegaras. CSE 5317/4305 L3: Parsing #1 1

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Syntax Analysis. COMP 524: Programming Language Concepts Björn B. Brandenburg. The University of North Carolina at Chapel Hill

Programming Language Definition. Regular Expressions

Parsing. source code. while (k<=n) {sum = sum+k; k=k+1;}

CMSC 330: Organization of Programming Languages. Context Free Grammars

Compilers. Predictive Parsing. Alex Aiken

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

Dr. D.M. Akbar Hussain

Part 5 Program Analysis Principles and Techniques

Programming Language Specification and Translation. ICOM 4036 Fall Lecture 3

Chapter 4. Syntax - the form or structure of the expressions, statements, and program units

Specifying Syntax. An English Grammar. Components of a Grammar. Language Specification. Types of Grammars. 1. Terminal symbols or terminals, Σ

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

A Simple Syntax-Directed Translator

Transcription:

Syntax Analysis/Parsing Purpose: determine if tokens have the right form for the language (right syntactic structure) stream of tokens abstract syntax tree (AST) AST: captures hierarchical structure of input program a primary representation of program Susan Eggers 1 CSE 401 Context-free grammars (CFG s) Syntax specified using CFG s capture important structural characteristics Notation for CFG s: Backus Normal (Naur) Form (BNF) set of terminal symbols (tokens from lexical analysis) set of nonterminals (sequences of terminals &/or nonterminals): impose the hierarchical structure set of productions combine terminals & nonterminals nonterminal ::= nonterminals &/or terminals start symbol: nonterminal that denotes the language CFG: set of productions that define a language Susan Eggers 2 CSE 401

BNF description of PL/0 syntax Program ::= module Id ; Block Id. Block ::= DeclList begin StmtList end DeclList ::= { Decl ; } Decl ::= ConstDecl ProcDecl VarDecl ConstDecl ::= const ConstDeclItem {, ConstDeclItem } ConstDeclItem ::=Id : Type = ConstExpr ConstExpr ::= Id Integer VarDecl ::= var VarDeclItem {, VarDeclItem } VarDeclItem ::= Id : Type ProcDecl ::= procedure Id ( [ FormalDecl {, FormalDecl } ] ) ; Block Id FormalDecl ::= Id : Type Type ::= int StmtList ::= { Stmt ; } Stmt ::= CallStmt AssignStmt OutStmt IfStmt WhileStmt CallStmt ::= Id ( [ Exprs ] ) AssignStmt ::= LValue := Expr LValue ::= Id OutStmt ::= output := Expr IfStmt ::= if Test then StmtList end WhileStmt ::= while Test do StmtList end Test ::= odd Sum Sum Relop Sum Relop ::= <= <> < >= > = Exprs ::= Expr {, Expr } Expr ::= Sum Sum ::= Term { (+ -) Term } Term ::= Factor { (* /) Factor } Factor ::= - Factor LValue Integer input ( Expr ) Susan Eggers 3 CSE 401 Context-free grammars vs. Regular Expressions CFG can check everything a RE can but: not need CFG power for lexical analysis REs are a more concise notation for tokens lexical analyzers constructed automatically are more efficient more modular front end RE s not powerful enough for parsing nested constructs recursion Susan Eggers 4 CSE 401

Derivations & Parse Trees Derivation: define the language specified by the grammar sequence of expansion steps, beginning with start symbol, leading to a string of terminals production seen as rewriting rule: nonterminal replaced by the rhs Parsing: inverse of derivation given target string of terminals (tokens), want to recover nonterminals representing structure Can represent derivation as a: parse tree (concrete syntax tree) graphical representation for a derivation keeps the grammar symbols don t record the expansion order abstract syntax tree (AST) simpler representation precedence implied by the hierarchy Susan Eggers 5 CSE 401 Example grammar E ::= E Op E - E ( E ) id Op ::= + - * / (a + b * -c) * d E -> E op E -> E op id -> E * id -> (E) * id -> (E op E) * id -> (E op -E) * id -> (E op -id) * id -> (E * -id) * id -> (E op E * -id) * id -> (E op id * -id) * id -> (E + id * -id) * id -> (id + id * -id) * id Susan Eggers 6 CSE 401

AST s Abstract syntax trees represent only important aspects of concrete syntax trees no need for sign posts like ( ), ;, do, end rest of compiler only cares about abstract structure can regenerate concrete syntax tree from AST when needed Susan Eggers 7 CSE 401 AST representations in project Susan Eggers 8 CSE 401

AST extensions in project Expressions: true and false constants array index expression function call expression and, or operators tests are expressions constant expressions Statements: for statement return stmt if stmt with else array assignment stmt (similar to array index expression) Declarations: procedures with result type var parameters (passed by reference) Types: boolean type array type Susan Eggers 9 CSE 401 Parsing algorithms Given grammar, want to see if an input program can be generated by it check legality produce AST representing structure be efficient Kinds of parsing algorithms: top-down LL(k) grammar bottom-up LR(k) grammar Left to right scan on input Leftmost/Rightmost derivation can see k tokens at once Susan Eggers 10 CSE 401

Top-down parsing Build parse tree for input program from the top (start symbol) down to leaves (terminals) find leftmost derivation for an input string (replace the leftmost nonterminal at each step) create parse tree nodes in preorder Basic issue: when replacing a nonterminal with some rhs, how to pick which rhs? E.g. Stmt ::= Call Assign If While Call ::= Id Assign ::= Id := Expr If ::= if Test then Stmts end if Test then Stmts else Stmts end While ::= while Test do Stmts end Solution: look at input tokens to help decide Susan Eggers 11 CSE 401 Predictive parsing Predictive parser: top-down parser that can select correct rhs looking at at most k input tokens (the look-ahead) Efficient: no backtracking needed linear time to parse Implementation of predictive parsers: table-driven parser like table-driven FSA plus stack to hold productions, recursively recursive descent parser each nonterminal parsed by a procedure call other procedures to parse sub-nonterminals, recursively Susan Eggers 12 CSE 401

LL(k) grammars Can construct predictive parser automatically/easily if grammar is LL(k) Left-to-right scan of input, Leftmost derivation k tokens of look-ahead needed (k 1) to make parsing decisions Some restrictions: no ambiguity >1 parse tree (derivation) for a sentence in the language no common prefixes of length k S ::= if Test then Ss end if Test then Ss else Ss end... no left recursion E ::= E Op E... a few others Restrictions guarantee that, given k input tokens, can always select correct rhs to expand nonterminal Susan Eggers 13 CSE 401 Ambiguity Some grammars are ambiguous: multiple parse trees with same final string produces more than one leftmost or rightmost derivation Structure of parse tree captures much of meaning of program; ambiguity multiple possible meanings for same program Solutions: 1) add meta-rules 2) change the grammar 3) change the language Susan Eggers 14 CSE 401

Famous ambiguities: dangling else Stmt ::=... if Expr then Stmt if Expr then Stmt else Stmt if e 1 then if e 2 then s 1 else s 2 Susan Eggers 15 CSE 401 Resolving the ambiguity Option 1: add a meta-rule e.g. else associates with closest previous then + works + keeps original grammar intact - ad hoc and informal Susan Eggers 16 CSE 401

Resolving the ambiguity Option 2: rewrite the grammar to resolve ambiguity explicitly Stmt ::= MatchedStmt UnmatchedStmt MatchedStmt ::=... if Expr then MatchedStmt else MatchedStmt UnmatchedStmt ::= if Expr then Stmt if Expr then MatchedStmt else UnmatchedStmt + formal, no additional rules beyond syntax - sometimes obscures original grammar Susan Eggers 17 CSE 401 Resolving the ambiguity Option 3: redesign the language to remove the ambiguity Stmt ::=... if Expr then Stmt end if Expr then Stmt else Stmt end extra end required for every if + formal, clear, elegant - changing the language Susan Eggers 18 CSE 401

Another famous ambiguity: expressions E ::= E Op E - E ( E ) id Op ::= + - * / a + b * c Susan Eggers 19 CSE 401 Resolving the ambiguity Option 1: add some meta-rules, e.g. precedence and associativity rules Example: E ::= E Op E - E ( E ) id Op ::= + - * / = < and or operator precedence associativity prefix - highest right *, / left +, - left =, < none and left or lowest left Susan Eggers 20 CSE 401

Resolving the ambiguity Option 2: modify the grammar to explicitly resolve the ambiguity Strategy: create a nonterminal for each precedence level expr is lowest precedence nonterminal each nonterminal can be rewritten with a higher precedence operator highest precedence operator includes terminals at each precedence level, use: left recursion for left-associative operators right recursion for right-associative operators no recursion for non-associative operators Susan Eggers 21 CSE 401 Example Expr ::= Expr0 Expr0 ::= Expr0 or Expr1 Expr1 Expr1 ::= Expr1 and Expr2 Expr2 Expr2 ::= Expr3 (= <) Expr3 Expr3 Expr3 ::= Expr3 (+ -) Expr4 Expr4 Expr4 ::= Expr4 (* /) Expr5 Expr5 Expr5 ::= -Expr5 Expr6 Expr6 ::= id int... (Expr0) Susan Eggers 22 CSE 401

Eliminating common prefixes Can left factor common prefixes to eliminate them create new nonterminal for common prefix and/or different suffixes Before: If ::= if Test then Stmts end if Test then Stmts else Stmts end After: If ::= if Test then Stmts IfCont IfCont ::= end else Stmts end Grammar a bit uglier Easy to do by hand in recursive-descent parser Susan Eggers 23 CSE 401 Eliminating left recursion Can rewrite grammar to eliminate left recursion Before: E ::= E + T T T ::= T * F F F ::= id... After: E ::= T ECont ECont ::= + T ECont ε T ::= F TCont TCont ::= * F TCont ε F ::= id... right-recursive productions Susan Eggers 24 CSE 401

Transition Diagrams Railroad diagrams another more graphical notation for CFGs diagram per nonterminal look like FSAs, where arcs can be labelled with nonterminals as well as terminals if terminal: follow the arc parser gets a new token & compares it with the terminal on the arc if nonterminal: go to new diagram parser calls the procedure for the nonterminal (recursive descent parser) Susan Eggers 25 CSE 401 Table-driven predictive parser Can automatically convert grammar into parsing table PREDICT(nonterminal, input-sym) production selects the right production to take given a nonterminal to expand and the next token of the input Example: stmt ::= if expr then stmt else stmt while expr do stmt begin stmts end stmts ::= stmt ; stmts ε expr ::= id Parsing table if then else while do begin end id ; stmt 1 2 3 stmts 1 1 1 2 expr 1 Susan Eggers 26 CSE 401

Table-driven predictive parser Stack implementation depends on top of stack & current input token initial stack configuration: Start $ top of stack = input token pop terminal, advance to next token top of stack = nonterminal pop nonterminal pick production from parsing table push production top of stack = input token = $ halt Errors: input token not match terminal on top of stack table entry empty Susan Eggers 27 CSE 401 Constructing the parse table Compute FIRST for each rhs FIRST(RHS) = {all tokens that appear first in a derivation of RHS} rhs begins with a terminal: FIRST(RHS) = that terminal rhs begins with a nonterminal: FIRST(RHS) = FIRST(all the nonterminal s productions) rhs begins with ε: FIRST(RHS) = FOLLOW(RHS s LHS) (we re done expanding this rhs) Compute FOLLOW for each nonterminal FOLLOW(X) = {all tokens that can appear after a derivation of X} after the nonterminal is a terminal: FOLLOW(X) = that terminal after the nonterminal is a nonterminal: FOLLOW(X) = FIRST(nonterminal) after the nonterminal is ε: FOLLOW(X) = FOLLOW(LHS) (we re done expanding this rhs) Susan Eggers 28 CSE 401

Constructing PREDICT table FIRST FOLLOW S ::= if E then S else S while E do S begin Ss end Ss ::= S ; Ss ε E ::= id if then else while do begin end id ; stmt 1 2 3 stmts 1 1 1 2 expr 1 Susan Eggers 29 CSE 401 Constructing PREDICT table Start ::= S $ S ::= id If If ::= if E then S IfCont IfCont ::= else S ε Start ::= S $ S ::= id ::= If If ::= if E then S IfCont IfCont ::= else S ::= ε FIRST FOLLOW id if then else $ Start S If IfCont Susan Eggers 30 CSE 401

Another example S ::= E $ E ::= T E E ::= (+ -) T E ε T ::= F T T ::= (* /) F T ε F ::= - F id ( E ) FIRST (RHS) FOLLOW (X) S ::= E $ E ::= T E E ::= (+ -) T E ε T ::= F T T ::= (* /) F T ε F ::= - F id ( E ) Susan Eggers 31 CSE 401 PREDICT and LL(1) If PREDICT table has at most one entry in each cell, then grammar is LL(1) always exactly one right choice fast to parse and easy to implement LL(1) each column labelled by 1 token Can have multiple entries in each cell common prefixes left recursion ambiguity Susan Eggers 32 CSE 401

Recursive descent parsers Write subroutine for each non-terminal each subroutine first selects correct r.h.s. by peeking at input tokens then consume r.h.s. if terminal symbol, verify that it s next & then advance if nonterminal, call corresponding subroutine construct & return AST representing r.h.s. PL/0 parser is recursive descent PL/0 scanner routines: Token* Get(); Token* Peek(); Token* Read(SYMBOL expected_kind); bool CondRead(SYMBOL expected_kind); Susan Eggers 33 CSE 401 Example ParseExpr => ParseSum => ParseTerm => ParseFactor Sum ::= Term { (+ -) Term } Term ::= Factor { (* /) Factor } Expr* Parser::ParseSum() { Expr* expr = ParseTerm(); for (;;) { Token* t = scanner->peek(); if (t->kind() == PLUS t->kind() == MINUS) {scanner->get();...} else { break; } } } Expr* Parser::ParseTerm() { Expr* expr = ParseFactor(); for (;;) { Token* t = scanner->peek(); if (t->kind() == MUL t->kind() == DIVIDE) {scanner->get();...} else { break; } } } Susan Eggers 34 CSE 401

Example If ::= if Expr then Stmt [else Stmt] end ; IfStmt* Parser::ParseIfStmt() { scanner->read(if); } Expr* test = ParseExpr(); scanner->read(then); Stmt* then_stmt = ParseStmt(); Stmt* else_stmt; if (scanner->condread(else)) { else_stmt = ParseStmt(); } else { else_stmt = NULL; } scanner->read(semicolon); return new IfStmt(test, then_stmt, else_stmt); Susan Eggers 35 CSE 401 Example Stmt ::= IdStmt ; IdStmt ::= CallStmt AssignStmt; CallStmt ::= IDENT ( Expr ) AssignStmt ::= IDENT := Expr Stmt* Parser::ParseIdentStmt() { Token* t = scanner->read(ident); }... if (scanner->condread(lparen)) { // call stmt: parse argument list... args = ParseExprs(); scanner->read(rparen);... } else { // assign stmt: parse the rest... scanner->read(gets); Expr* expr = ParseExpr();... } Susan Eggers 36 CSE 401

Yacc yacc: yet another compiler-compiler Input: grammar possibly augmented with action code Output: C functions to parse grammar and perform actions LALR parser generator practical bottom-up parser more powerful than LL(1) used for parser generators yacc++, bison, byacc are modern updates of yacc Susan Eggers 37 CSE 401 Yacc input grammar Example declaration: %{ #include <stdio.h> %} %token INTEGER Example grammar productions: %% assignstmt: IDENT GETS expr ; ifstmt: IF test THEN stmts END IF test THEN stmts ELSE stmts END ; expr: term expr '+' term expr '-' term ; factor: '-' factor IDENT INTEGER INPUT '(' expr ')' ; %% Susan Eggers 38 CSE 401

Yacc with semantic actions Example grammar productions: assignstmt: IDENT GETS expr { $$ = new AssignStmt($1, $3); } ; ifstmt: IF test THEN stmtlist END { $$ = new IfStmt($2, $4); } IF test THEN stmts ELSE stmts END { $$ = new IfElseStmt($2, $4, $6); } ; expr: term { $$ = $1; } expr '+' term { $$ = new BinOp(PLUS, $1, $3); } expr '-' term { $$ = new BinOp(MINUS, $1, $3); } ; factor: '-' factor { $$ = new UnOp(MINUS, $2); } IDENT { $$ = new VarRef($1); } INTEGER { $$ = new IntLiteral($1); } INPUT { $$ = new InputExpr; } '(' expr ')' { $$ = $2; } ; Susan Eggers 39 CSE 401 Error handling How to handle syntax error: error recovery Option 1: quit compilation PL/0 + easy - inconvenient for programmer Option 2: do more before quit + try to catch as many errors as possible on one compile - avoid streams of spurious errors Option 3: error correction + fix syntax errors as part of compilation - hard!! Susan Eggers 40 CSE 401

Panic mode error recovery (option 2) When find a syntax error, skip tokens until reach a sign post sign posts in PL/0: end, ;, ), then, do,... once a sign post is found, will have gotten back on track + simple - not catch errors in program text that is skipped In top-down parser, maintain set of sign post tokens as recursive descent proceeds sign post selected from terminals later in production as parsing proceeds, set of sign posts will change, depending on the parsing context Susan Eggers 41 CSE 401