COP4020 Programming Languages. Syntax Prof. Robert van Engelen

COP4020 Programming Languages Syntax Prof. Robert van Engelen

Overview n Tokens and regular expressions n Syntax and context-free grammars n Grammar derivations n More about parse trees n Top-down and bottom-up parsing n Recursive descent parsing COP4020 Fall 2016 2

Tokens n n n Tokens are the basic building blocks of a programming language Keywords, identifiers, literal values, operators, punctuation We saw that the first compiler phase (scanning) splits up a character stream into tokens Tokens have a special role with respect to: Free-format languages: source program is a sequence of tokens and horizontal/vertical position of a token on a page is unimportant (e.g. Pascal) Fixed-format languages: indentation and/or position of a token on a page is significant (early Basic, Fortran, Haskell) Case-sensitive languages: upper- and lowercase are distinct (C, C++, Java) Case-insensitive languages: upper- and lowercase are identical (Ada, Fortran, Pascal) COP4020 Fall 2016 3

Defining Token Patterns with Regular Expressions n The makeup of a token is described by a regular expression (RE) n A regular expression r is one of A character (an element of the alphabet Σ), e.g. Σ = {a, b, c} a Empty, denoted by ε Concatenation: a sequence of regular expressions r 1 r 2 r 3 r n Alternation: regular expressions separated by a bar r 1 r 2 Repetition: a regular expression followed by a star (Kleene star) r* COP4020 Fall 2016 4

Example Regular Definitions for Tokens n digit 0 1 2 3 4 5 6 7 8 9 n unsigned_integer digit digit* n signed_integer (+ - ε) unsigned_integer n relop < <= <> > >= = n letter a b z A B Z n id letter (letter digit)* n Cannot use recursive definitions! digits digit digits digit COP4020 Fall 2016 5

Finite State Machines = Regular Expression Recognizers relop < <= <> > >= = id letter ( letter digit ) * start 0 < 1 = 2 return(relop, LE) > 3 return(relop, NE) other 4 * return(relop, LT) = 5 return(relop, EQ) > 6 = 7 return(relop, GE) other 8 * return(relop, GT) letter or digit start 9 letter other 10 11 * return(gettoken(), install_id()) COP4020 Fall 2016 6

Non-Deterministic Finite State Automata n An NFA is a 5-tuple (S, Σ, δ, s 0, F) where S is a finite set of states Σ is a finite set of symbols, the alphabet δ is a mapping from S Σ to a set of states s 0 S is the start state F S is the set of accepting (or final) states start 0 a b a b b 1 2 3 S = {0,1,2,3} Σ = {a,b} s 0 = 0 F = {3} COP4020 Fall 2016 7

From a Regular Expression to an NFA ε start i ε f a start i a f r 1 r 2 r 1 r 2 start i ε ε N(r 1 ) N(r 2 ) start i N(r 1 ) N(r 2 ) f r* start i ε N(r) ε COP4020 Fall 2016 8 ε ε ε ε f f

Grammars n Context-free grammar has four components: T is a finite set of tokens (terminal symbols) N is a finite set of nonterminals P is a finite set of productions of the form α β where α (N T)* N (N T)* and β (N T)* S N is a designated start symbol COP4020 Fall 2016 9

Context Free Grammars: BNF Notation n Regular expressions cannot describe nested constructs, but context-free grammars can n Backus-Naur Form (BNF) notation for productions: <nonterminal> ::= sequence of (non)terminals where Each terminal in the grammar is also a token A <nonterminal> defines a syntactic category The symbol denotes alternative forms in a production The special symbol ε denotes empty COP4020 Fall 2016 10

Example <Program> ::= program <id> ( <id> <More_ids> ) ; <Block>. <Block> ::= <Variables> begin <Stmt> <More_Stmts> end <More_ids> ::=, <id> <More_ids> ε <Variables> ::= var <id> <More_ids> : <Type> ; <More_Variables> ε <More_Variables> ::= <id> <More_ids> : <Type> ; <More_Variables> ε <Stmt> ::= <id> := <Exp> if <Exp> then <Stmt> else <Stmt> while <Exp> do <Stmt> begin <Stmt> <More_Stmts> end <More_Stmts> ::= ; <Stmt> <More_Stmts> ε <Exp> ::= <num> <id> <Exp> + <Exp> <Exp> - <Exp> COP4020 Fall 2016 11

Extended BNF n Extended BNF simplifies grammar definitions n Extended BNF adds Optional constructs with [ and ] Repetitions with [ ]* Some EBNF definitions also add [ ] + for non-zero repetitions n Any extended BNF grammar can be rewritten into BNF COP4020 Fall 2016 12

Example <Program> ::= program <id> ( <id> [, <id> ]* ) ; <Block>. <Block> ::= [ <Variables> ] begin <Stmt> [ ; <Stmt> ]* end <Variables> ::= var [ <id> [, <id> ]* : <Type> ; ] + <Stmt> ::= <id> := <Exp> if <Exp> then <Stmt> else <Stmt> while <Exp> do <Stmt> begin <Stmt> [ ; <Stmt> ]* end <Exp> ::= <num> <id> <Exp> + <Exp> <Exp> - <Exp> COP4020 Fall 2016 13

Derivations n From a grammar we can derive strings (= sequences of tokens) The opposite process of parsing n Starting with the grammar s designated start symbol, in each derivation step a nonterminal is replaced by a righthand side of a production for that nonterminal COP4020 Fall 2016 14

Example Derivation Start symbol Sentential forms <expression> ::= identifier unsigned_integer - <expression> ( <expression> ) <expression> <operator> <expression> <operator> ::= + - * / <expression> <expression> <operator> <expression> <expression> <operator> identifier <expression> + identifier <expression> <operator> <expression> + identifier <expression> <operator> identifier + identifier <expression> * identifier + identifier identifier * identifier + identifier Replacement of nonterminal with one of its productions The final string is the yield COP4020 Fall 2016 15

n Rightmost versus Leftmost Derivations When the nonterminal on the far right (left) in a sentential form is replaced in each derivation step the derivation is called right-most (left-most) <expression> <expression> <operator> <expression> <expression> <operator> identifier Replace in rightmost derivation Replace in rightmost derivation Replace in leftmost derivation <expression> <expression> <operator> <expression> identifier <operator> <expression> Replace in leftmost derivation COP4020 Fall 2016 16

A Language Generated by a Grammar n n A context-free grammar is a generator of a context-free language The language defined by a grammar G is the set of all strings w that can be derived from the start symbol S L(G) = { w S * w } <S> ::= a ( <S> ) L(G) = { set of all strings a (a) ((a)) (((a))) } <S> ::= <C> ::= <C> + <C> <C> ::= 0 1 L(G) = { 0+0, 0+1, 1+0, 1+1, 0, 1 } COP4020 Fall 2016 17

Parse Trees n A parse tree depicts the end result of a derivation The internal nodes are the nonterminals The children of a node are the symbols (terminals and nonterminals) on a right-hand side of a production The leaves are the terminals <expression> <expression> <operator> <expression> <expression> <operator> <expression> identifier * identifier + identifier COP4020 Fall 2016 18

Parse Trees <expression> <expression> <operator> <expression> <expression> <operator> identifier <expression> + identifier <expression> <operator> <expression> + identifier <expression> <operator> identifier + identifier <expression> * identifier + identifier identifier * identifier + identifier <expression> <expression> <operator> <expression> <expression> <operator> <expression> identifier * identifier + identifier COP4020 Fall 2016 19

Ambiguity n There is another parse tree for the same grammar and input: the grammar is ambiguous n This parse tree is not desired, since it appears that + has precedence over * <expression> <expression> <operator> <expression> <expression> <operator> <expression> identifier * identifier + identifier COP4020 Fall 2016 20

Ambiguous Grammars n Ambiguous grammar: more than one distinct derivation of a string results in different parse trees n A programming language construct should have only one parse tree to avoid misinterpretation by a compiler n For expression grammars, associativity and precedence of operators is used to disambiguate <expression> ::= <term> <expression> <add_op> <term> <term> ::= <factor> <term> <mult_op> <factor> <factor> ::= identifier unsigned_integer - <factor> ( <expression> ) <add_op> ::= + - <mult_op> ::= * / COP4020 Fall 2016 21

Ambiguous if-then-else: the Dangling Else n A classical example of an ambiguous grammar are the grammar productions for if-then-else: <stmt> ::= if <expr> then <stmt> if <expr> then <stmt> else <stmt> n It is possible to hack this into unambiguous productions for the same syntax, but the fact that it is not easy indicates a problem in the programming language design n Ada uses different syntax to avoid ambiguity: <stmt> ::= if <expr> then <stmt> end if if <expr> then <stmt> else <stmt> end if COP4020 Fall 2016 22

Linear-Time Top-Down and Bottom-Up Parsing n A parser is a recognizer for a context-free language n A string (token sequence) is accepted by the parser and a parse tree can be constructed if the string is in the language n For any arbitrary context-free grammar parsing can take as much as O(n 3 ) time, where n is the size of the input n There are large classes of grammars for which we can construct parsers that take O(n) time: Top-down LL parsers for LL grammars (LL = Left-to-right scanning of input, Left-most derivation) Bottom-up LR parsers for LR grammars (LR = Left-to-right scanning of input, Right-most derivation) COP4020 Fall 2016 23

Top-Down Parsers and LL Grammars n n n Top-down parser is a parser for LL class of grammars Also called predictive parser LL class is a strict subset of the larger LR class of grammars LL grammars cannot contain left-recursive productions (but LR can), for example: <X> ::= <X> <Y> and <X> ::= <Y> <Z> <Y> ::= <X> LL(k) where k is lookahead depth, if k=1 cannot handle alternatives in productions with common prefixes <X> ::= a b a c A top-down parser constructs a parse tree from the root down Not too difficult to implement a predictive parser for an unambiguous LL(1) grammar in BNF by hand using recursive descent COP4020 Fall 2016 24

Top-Down Parser in Action <id_list> ::= id <id_list_tail> <id_list_tail>::=, id <id_list_tail> ; A, B, C; A, B, C; A, B, C; A, B, C; COP4020 Fall 2016 25

Top-Down Predictive Parsing n Top-down parsing is called predictive parsing because parser predicts what it is going to see: 1. As root, the start symbol of the grammar <id_list> is predicted 2. After reading A the parser predicts that <id_list_tail> must follow 3. After reading, and B the parser predicts that <id_list_tail> must follow 4. After reading, and C the parser predicts that <id_list_tail> must follow 5. After reading ; the parser stops COP4020 Fall 2016 26

An Ambiguous Non-LL Grammar for Language E n Consider a language E of simple expressions composed of +, -, *, /, (), id, and num <expr> ::= <expr> + <expr> <expr> - <expr> <expr> * <expr> <expr> / <expr> ( <expr> ) <id> <num> n Need operator precedence rules COP4020 Fall 2016 27

An Unambiguous Non-LL Grammar for Language E <expr> ::= <expr> + <term> <expr> - <term> <term> <term> ::= <term> * <factor> <term> / <factor> <factor> <factor> ::= ( <expr> ) <id> <num> COP4020 Fall 2016 28

An Unambiguous LL(1) Grammar for Language E <expr> ::= <term> <term_tail> <term> ::= <factor> <factor_tail> <term_tail> ::= <add_op> <term> <term_tail> ε <factor> ::= ( <expr> ) <id> <num> <factor_tail> ::= <mult_op> <factor> <factor_tail> ε <add_op> ::= + - <mult_op> ::= * / COP4020 Fall 2016 29

Constructing Recursive Descent Parsers for LL(1) n n Each nonterminal has a function that implements the production(s) for that nonterminal The function parses only the part of the input described by the nonterminal <expr> ::= <term> <term_tail> procedure expr() term(); term_tail(); n When more than one alternative production exists for a nonterminal, the lookahead token should help to decide which production to apply <term_tail> ::= <add_op> <term> <term_tail> procedure term_tail() ε case (input_token()) of '+' or '-': add_op(); term(); term_tail(); otherwise: /* no op = ε */ COP4020 Fall 2016 30

Some Rules to Construct a Recursive Descent Parser n For every nonterminal with more than one production, find all the tokens that each of the right-hand sides can start with (called the FIRST set): <X> ::= a starts with a b a <Z> starts with b <Y> starts with c or d <Z> f starts with e or f <Y> ::= c d <Z> ::= e ε n Empty productions are coded as skip operations (nops) n If a nonterminal does not have an empty production, the function should generate an error if no token matches COP4020 Fall 2016 31

Example for E procedure expr() term(); term_tail(); procedure term_tail() case (input_token()) of '+' or '-': add_op(); term(); term_tail(); otherwise: /* no op = ε */ procedure term() factor(); factor_tail(); procedure factor_tail() case (input_token()) of '*' or '/': mult_op(); factor(); factor_tail(); otherwise: /* no op = ε */ procedure factor() case (input_token()) of '(': match('('); expr(); match(')'); of identifier: match(identifier); of number: match(number); otherwise: error; procedure add_op() case (input_token()) of '+': match('+'); of '-': match('-'); otherwise: error; procedure mult_op() case (input_token()) of '*': match('*'); of '/': match('/'); otherwise: error; COP4020 Fall 2016 32

Recursive Descent Parser s Call Graph = Parse Tree n The dynamic call graph of a recursive descent parser corresponds exactly to the parse tree n Call graph of input string 1+2*3 COP4020 Fall 2016 33

Example <type> ::= <simple> ^ id array [ <simple> ] of <type> <simple> ::= integer char num dotdot num COP4020 Fall 2016 34

Example (cont d) <type> ::= <simple> ^ id array [ <simple> ] of <type> <simple> ::= integer char num dotdot num The FIRST sets {integer, char, num} {^} {array} {integer} {char} {num} COP4020 Fall 2016 35

Example (cont d) procedure match(t : token) if input_token() = t then nexttoken(); else error; procedure type() case (input_token()) of integer or char or num : simple(); of ^ : match( ^ ); match(id); of array : match( array ); match( [ ); simple(); match( ] ); match( of ); type(); otherwise: error; procedure simple() case (input_token()) of integer : match( integer ); of char : match( char ); of num : match( num ); match( dotdot ); match( num ); otherwise: error; COP4020 Fall 2016 36

Step 1 Check lookahead and call match type() match( array ) Input: array [ num dotdot num ] of integer lookahead COP4020 Fall 2016 37

Step 2 type() match( array ) match( [ ) Input: array [ num dotdot num ] of integer lookahead COP4020 Fall 2016 38

Step 3 type() match( array ) match( [ ) simple() match( num ) Input: array [ num dotdot num ] of integer lookahead COP4020 Fall 2016 39

Step 4 type() match( array ) match( [ ) simple() match( num ) match( dotdot ) Input: array [ num dotdot num ] of integer lookahead COP4020 Fall 2016 40

Step 5 type() match( array ) match( [ ) simple() match( num ) match( dotdot ) match( num ) Input: array [ num dotdot num ] of integer lookahead COP4020 Fall 2016 41

Step 6 type() match( array ) match( [ ) simple() match( ] ) match( num ) match( dotdot ) match( num ) Input: array [ num dotdot num ] of integer lookahead COP4020 Fall 2016 42

Step 7 type() match( array ) match( [ ) simple() match( ] ) match( of ) match( num ) match( dotdot ) match( num ) Input: array [ num dotdot num ] of integer lookahead COP4020 Fall 2016 43

Step 8 type() match( array ) match( [ ) simple() match( ] ) match( of ) type() match( num ) match( dotdot ) match( num ) simple() Input: match( integer ) array [ num dotdot num ] of integer lookahead COP4020 Fall 2016 44

Bottom-Up LR Parsing n Bottom-up parser is a parser for LR class of grammars n Also called shift-reduce parsing: shift tokens on a stack until the parser recognizes a right-hand side of a production which it then reduces to a left-hand side (nonterminal) to form a partial parse tree n Difficult to implement by hand n Tools (e.g. Yacc/Bison) exist that generate bottom-up parsers for LALR grammars automatically COP4020 Fall 2016 45

Shift-Reduce Parsing Grammar: <S> a <A> e <A> <A> b c b d These match production right-hand side Reducing a sentence: a b b c d e a <A> b c d e a <A> d e a <A> e <S> Shift-reduce corresponds to a rightmost derivation: <S> rm a <A> e rm a <A> d e rm a <A> b c d e rm a b b c d e <S> <A> <A> <A> <A> <A> <A> <A> a b b c d e 46 a b b c d e a b b c d e a b b c d e

Model of an LR Parser input a 1 a 2 a i a n $ stack s m X m s m-1 LR Parsing Program (driver) output 47 X m-1 s 0 action shift reduce accept error goto DFA Constructed with LR(1) or LALR(1)

Bottom-Up Parser in Action <id_list> ::= id <id_list_tail> <id_list_tail>::=, id <id_list_tail> ; Shift Shift Shift Shift Shift Shift input A, B, C; A A, B, C; A, A, B, C; A,B stack A, B, C; A,B, A, B, C; A,B,C A, B, C; A,B,C; Stack contains shifted input (DFA states not shown) parse tree Reduce A, B, C; A,B,C Cont d Matches production right-hand side COP4020 Fall 2016 48

Reduce Reduce A, B, C; A,B,C A, B, C; A,B These match production right-hand sides Reduce A, B, C; A A, B, C; Accept COP4020 Fall 2016 49

LR(1) Table s means shift and goto state # r means reduce by production # acc means accept and stop state 0 1 2 char on input id * = $ s5 s4 acc s6 r6 goto state S L R 1 2 3 Grammar: 1. S S 2. S L = R 3. S R 4. L * R 5. L id 6. R L 3 4 5 6 7 8 9 s5 s4 s12 s11 r5 r4 r6 r3 r5 r4 r6 r2 8 7 10 9 10 r6 11 s12 s11 10 13 12 r5 50 13 r4

I 0 : [S S, $] goto(i 0,S)=I 1 [S L=R, $] goto(i 0,L)=I 2 [S R, $] goto(i 0,R)=I 3 [L *R, =/$] goto(i 0,*)=I 4 [L id, =/$] goto(i 0,id)=I 5 [R L, $] goto(i 0,L)=I 2 I 6 : I 7 : [S L= R, $] goto(i 6,R)=I 9 [R L, $] goto(i 6,L)=I 10 [L *R, $] goto(i 6,*)=I 11 [L id, $] goto(i 6,id)=I 12 [L *R, =/$] I 1 : [S S, $] I 8 : [R L, =/$] I 2 : I 3 : I 4 : I 5 : 51 [S L =R, $] goto(i 0,=)=I 6 [R L, $] [S R, $] [L * R, =/$] goto(i 4,R)=I 7 [R L, =/$] goto(i 4,L)=I 8 [L *R, =/$] goto(i 4,*)=I 4 [L id, =/$] goto(i 4,id)=I 5 [L id, =/$] I 9 : I 10 : I 11 : I 12 : I 13 : [S L=R, $] [R L, $] [L * R, $] goto(i 11,R)=I 13 [R L, $] goto(i 11,L)=I 10 [L *R, $] goto(i 11,*)=I 11 [L id, $] goto(i 11,id)=I 12 [L id, $] [L *R, $] The set of LR(1) items constructed for the grammar