CS2210: Compiler Construction Syntax Analysis Syntax Analysis

Comparison with Lexical Analysis The second phase of compilation Phase Input Output Lexer string of characters string of tokens Parser string of tokens Parse tree/ast

What Parse Tree? CS2210: Compiler Construction A parse tree represents the program structure of the input Programming language constructs usually have recursive structures If-stmt if (EXPR) then Stmt else Stmt fi Stmt If-stmt While-stmt...

A Parse Tree Example Code to be compiled:... if x==y then... else... fi Lexer:...... Parser: Input: sequence of tokens... IF ID==ID THEN... ELSE... FI Desired output: IF-STMT == STMT STMT ID ID

How to Specify a Syntax How can we specify a syntax with nested structures? Is it possible to use RE/FA? RE(Regular Expression) FA(Finite Automata)

How to Specify a Syntax How can we specify a syntax with nested structures? Is it possible to use RE/FA? RE(Regular Expression) FA(Finite Automata) RE/FA is not powerful enough Example: matching parenthesis: # of ( equals # of ) (x+y)*z ((x+y)+y)*z... (...(((x+y)+y)+y)...) ((x+y)+y)+y)*z

RE/FA is Not Powerful Enough Pumping Lemma for Regular Languages In FA, all sufficiently long words may be pumped that is, have a substring repeated an arbitrary number of times If an FA is to accept a string longer than the number of states, there must be a set of states that are repeated (q s... q t in above picture) Let y be the substring that is repeated. Then all strings xy i z (i >= 0) are also accepted and is part of language

RE/FA is Not Powerful Enough Matching parentheses is not a Regular Language Suppose it is RL and there is an FA for this language Now construct a string with matching ()s long enough such that there are more (s than the number of states Then, there would be a pumping y filled entirely with ( Then, any string xy i z will also be accepted by FA but not be matching Contradiction QED Similarly, any language with nested structures is not an RL

What Language is C Syntax? An example of a Context Free Language (CFL) A broader category of languages that includes languages with nested structures Before discussing CFL, we need to learn a more general way of specifying languages than RE, called Grammars Can specify both RL and CFL and more...

Grammar General Language Specification Recall language definition Language set of strings over alphabet Alphabet: finite set of symbols Sentences: strings in the language A grammar is a general way to specify a language E.g. The English grammar specifies the English language

An Example CS2210: Compiler Construction Language L = { any string with 00 at the end } 1 A 0 B 0 C 0 Grammar G = { T, N, s, δ } where T = { 0, 1 }, N = { A, B }, s = A, and grammar rule set δ = { A 0A 1A 0B, B 0 } Derivation: from grammar to language A 0A 00B 000 A 1A 10B 100 A 0A 00A 000B 0000 A 0A 01A...

Grammar CS2210: Compiler Construction A grammar consists of 4 components (T, N, s, δ) T set of terminal symbols Essentially tokens leaves in the parse tree N set of non-terminal symbols Internal nodes in the parse tree One or more symbols grouped into higher construct example: declaration, statement, loop,... s the non-terminal start symbol where derivation starts δ a set of production rules LHS RHS : left-hand-side produces right-hand-side

Production Rule and Derivation Production Rule: LHS RHS Meaning: LHS can be replaced with RHS A derivation a series of applications of production rules Derivation: β α Meaning: String α is derived from β β α 1 step β α 0 or more steps β = α 0 or more steps example: A 0A 00B 000 A = 000 A = + 000

Noam Chomsky Grammars Classification of languages based on form of grammar rules Four(4) types of grammars: Type 0 recursive grammar Type 1 context sensitive grammar Type 2 context free grammar Type 3 regular grammar

Type 0: Unrestricted Grammar Type 0 grammar unrestricted or recursively enumerable Form of rules α β where α (N T ) +, β (N T ) Implied restrictions: LHS: no ε allowed Example: ab acd ; LHS is shorter than RHS aab ab ; LHS is longer than RHS A ε ; ε-productions are allowed Computational complexity: unbounded Derivation strings may contract and expand repeatedly (Since LHS may be longer or shorter than RHS) Unbounded number of productions before target string L(Unrestricted) L(Turing Machine)

Type 1: Context Sensitive Grammar Type 1 grammar context sensitive grammar Form of rules αaβ αγβ where A N, α, β (N T ), γ (N T ) + Replace A by γ only if found in the context of α and β Implied restrictions: LHS: shorter or equal to RHS RHS: no ε allowed Example: aab acb ; Replace A with C when in between a and B A C ; Replace A with C regardless of context Computational complexity: likely NP-Complete Derivation strings may only expand Bounded number of derivations before target string L(CSG) L(Linear Bounded Turing Machine)

Type 2: Context Free Grammar Type 2 grammar context free grammar Form of rules A γ where A N, γ (N T ) + Replace A by γ (no context can be specified) Implied restrictions: LHS: a single non-terminal RHS: no ε allowed Sometimes relaxed to simplify grammar but rules can always be rewritten to exclude ε-productions Example: A abc ; Replace A with abc regardless of context Computational complexity: Polynomial O(n 2.3728639 ), but most real world CFGs are O(n) L(CFG) L(Pushdown Automaton)

Type 3: Regular Grammar Type 3 grammar regular grammar Form of rules A α, or A αb where A, B N, α T In terms of FA: Move from state A to state B on input α Implied restrictions: LHS: a single non-terminal RHS: a terminal or a terminal followed by a non-terminal Example: A 1A 0 Computational complexity: Linear O(n) Derivation string length increases by 1 at each step L(RG) L(Finite Automaton) L(Regular Expression)

Differentiating Type 2 (RG) and 3 (CFG) Language L1 = { [ i ] j i, j >= 1} Regular grammar S [ S [ T T ] T ] Language L2 = { [ i ] i i >= 1} Context free grammar S [ S ] [ ]

Differentiating Type 1 (CSG) and 2 (CFG) Type 2 grammar (context free) S D U D int x; int y; U x=1; y=1; Type 1 grammar (context sensitive) S D U D int x; int y; int x; U int x; x=1; int y; U int y; y=1;

Differentiating Type 1 (CSG) and 2 (CFG) Language from type 2 (CFG): S DU int x; U int x; x=1; S DU int x; U int x; y=1; S DU int y; U int y; x=1; S DU int y; U int y; y=1; Language from type 1 (CSG): S DU int x; U int x; x=1; S DU int y; U int y; y=1; If PLs are context-sensitive, why use CFGs for parsing? CSG parsers are provably inefficient (PSPACE-Complete) Most PL constructs are context-free: If-stmt, declarations The remaining context-sensitive constructs can be analyzed at the semantic analysis stage (e.g. def-before-use, matching formal/actual parameters)

Language Classification Regular Grammar CFG CSG Recursive Grammar Type 0 Type 1 Type 2 Type 3

Language Classification Regular Grammar CFG CSG Recursive Grammar Type 0 Type 1 Type 2 Type 3 However, L y L x where L x :[ i ] k RG, L y :[ i ] i CFG Is it a problem?

Context Free Grammars

Grammar and Grammar is used to derive string or construct parser A derivation is a sequence of applications of rules Starting from the start symbol S......... (sentence) Leftmost and Rightmost derivations At each derivation step, leftmost derivation always replaces the leftmost non-terminal symbol Rightmost derivation always replaces the rightmost one

Examples CS2210: Compiler Construction E E * E E + E ( E ) id leftmost derivation E E + E E * E + E id * E + E id * id + E... id * id + id * id rightmost derivation E E + E E + E * E E + E * id E + id * id... id * id + id * id

Parse Trees CS2210: Compiler Construction Parse tree structure Describes program structure (defined by the rules applied) Internal nodes are non-terminals, leaves are terminals Structure depends on what production rules were applied Grammars should specify precisely what rules to apply to avoid ambiguity in program structure Structure does not depend on order of application Same tree for previous rightmost/leftmost derivations Order of rule application is implementation dependent

Different Parse Trees Consider the string id * id + id * id can result in 3 different trees (and more) for this grammar E E + E E E * E E E * E E * E E * E id E + E id E * E id id id id id E * E E + E id id id id id

Ambiguity CS2210: Compiler Construction A grammar G is ambiguous if there exist a string str L(G) such that more than one parse tree derives str In practice, we prefer unambiguous grammars Fortunately, ambiguity is (often) the property of a grammar and not the language It is (often) possible to rewrite grammar to remove ambiguity Programming languages are designed not to be ambiguous it s the job of compiler writers to express those rules accurately in the grammar

How to Remove Ambiguity Step I: Specify precedence Build precedence into grammar by recursion on a different non-terminal for each precedence level. Insight: Lower precedence tend to be higher in tree (close to root) Higher precedence tend to be lower in tree (far from root) Place lower precedence non-terminals higher up in the tree E E + E E - E E * E E / E E E ( E ) id Rewrite it to: E E + T E - T T ; Lowest precedence: +, - T T * F T / F F ; Middle precedence: *, / F B F B ; Highest precedence: B id ( E ) ; Even higher precedence: ()

How to Remove Ambiguity Step II: Specify associativity Allow recursion only on either left or right non-terminal Left associative recursion on left non-terminal Right associative recursion on right non-terminal In the previous example, E E + E... ; allows both left/right associativity Rewrite it to: E E + T... F B F... ; + is left associative ; exponent is right associative

Properties of Context Free Grammars Decidable: computable using a Turing Machine In other words, guaranteed an answer in finite time E.g. if a string is in a context free language: decidable In fact we can run a CFL parser in polynomial time If a CFG is ambiguous: Undecidable In general, impossible to tell if a CFG is ambiguous Can only check whether it is ambiguous for a given string In practice, parser generators like Yacc check for a more restricted grammar (e.g. LALR(1)) instead LALR(1) is a subset of unambiguous grammars Whether grammar is LALR(1) is easily decidable If two CFGs generate same language: Undecidable Impossible to tell if language changed by tweaking grammar Parsers are regression tested against a test set frequently

From Grammar to Parser What exactly is parsing, or syntax analysis? To process an input string for a given grammar, and compose the derivation if the string is in the language Two subtasks to determine if string is in the language or not to construct a parse tree (a representation of the derivation)

Types of Parsers CS2210: Compiler Construction Universal parser Can parse any CFG e.g. Early s algorithm Powerful but can be very inefficient (O(N 3 ) where N is length of string) Top-down parser Starts from root and expands into leaves Tries to expand start symbol to input string Finds leftmost derivation Only works for a certain class of CFGs called LL But provably works in O(n) time Parser code structure closely mimics grammar Amenable to implementation by hand Automated tools exist to convert to code (e.g. ANTLR)

Types of Parsers (cont.) Bottom-up parser Starts at leaves and builds up to root Tries to reduce the input string to the start symbol Finds reverse order of the rightmost derivation Works for a wider class of CFGs called LR Also provably works in O(n) time Parser code structure nothing like grammar Very difficult to implement by hand Automated tools exist to convert to code (e.g. Yacc, Bison)

What Output do We Want? The output of parsing is parse tree, or abstract syntax tree An abstract syntax tree is abbreviated representation of a parse tree drops some details without compromising meaning some terminal symbols that no longer contribute to semantics are dropped (e.g. parentheses) internal nodes may contain terminal symbols

An Example CS2210: Compiler Construction Consider the grammar E int ( E ) E + E and an input 5 + ( 2 + 3 ) After lexical analysis, we have a sequence of tokens INT 5 + ( INT 2 + INT 3 )

Parse Tree of the Input E E + E INT 5 ( E ) E + E INT 2 INT A parse tree 3 Traces the process of derivation Captures all the rules that were applied but contains redundant information parentheses single-successor nodes

Abstract Syntax Tree An Abstract Syntax Tree (AST) for the input PLUS PLUS 5 2 3 AST captures all that is semantically relevant AST but is more compact in representation AST abstracts from parse tree (a.k.a. concrete syntax tree) ASTs are used in most compilers rather than parse trees

How are ASTs constructed in a compiler? Through implementation of semantic actions Semantic actions: actions triggered on a rule match Already used in lexical analysis to return token tuples For syntactic analysis, involves computing semantic value of whole construct from values of its parts To construct AST, attach an attribute for each symbol X Attributes contain the semantic value for each symbol X.ast the constructed AST for symbol X Then attach semantic actions to production rules, e.g. X Y 1 Y 2...Y n { actions } actions define X.ast using Y i.ast (1 i n)

Example CS2210: Compiler Construction For the previous example, we have E int { E.ast = mkleaf(int.lval) } E1 + E2 { E.ast = mkplus(e1.ast, E2.ast) } (E1) { E.ast = E1.ast } Two functions that need to be defined ptr1=mkleaf(n) create a leaf node and assign value n ptr2=mkplus(t1, t2) create a tree node and assign value PLUS, then attach two child trees t1 and t2 ptr1 ptr2 n PLUS t1 t2

AST Construction Steps For input INT 5 + ( INT 2 + INT 3 ) Construction order given is for a bottom-up LR(1) parser (Order can change depending on parser implementation)

AST Construction Steps For input INT 5 + ( INT 2 + INT 3 ) Construction order given is for a bottom-up LR(1) parser (Order can change depending on parser implementation) E1.ast=mkleaf(5) 5

AST Construction Steps For input INT 5 + ( INT 2 + INT 3 ) Construction order given is for a bottom-up LR(1) parser (Order can change depending on parser implementation) E1.ast=mkleaf(5) E2.ast=mkleaf(2) 5 2

AST Construction Steps For input INT 5 + ( INT 2 + INT 3 ) Construction order given is for a bottom-up LR(1) parser (Order can change depending on parser implementation) E4.ast=mkplus(E2.ast, E3.ast) PLUS E1.ast=mkleaf(5) E2.ast=mkleaf(2) E3.ast=mkleaf(3) 5 2 3

AST Construction Steps For input INT 5 + ( INT 2 + INT 3 ) Construction order given is for a bottom-up LR(1) parser (Order can change depending on parser implementation) E5.ast=mkplus(E1.ast, E4.ast) PLUS E4.ast=mkplus(E2.ast, E3.ast) PLUS E1.ast=mkleaf(5) 5 2 3

Summary CS2210: Compiler Construction Compilers specify program structure using CFG Most programming languages are not context free Context sensitive analysis can easily separate out to semantic analysis phase A parser uses CFG to... answer if an input str L(G)... and build a parse tree... or build an AST instead... and pass it to the rest of compiler

Parsing

Parsing CS2210: Compiler Construction We will study two approaches Top-down Easier to understand and implement manually Bottom-up More powerful, but hard to implement manually

Example CS2210: Compiler Construction Consider a CFG grammar G S A B A a C B b D D d C c Actually, this language has only one sentence, i.e. L(G) = { acbd } Leftmost Derivation: S AB (1) acb (2) acb (3) acbd (4) acbd (5) S Rightmost Derivation: S AB (5) AbD (4) Abd (3) acbd (2) acbd (1)

Top Down Parsers CS2210: Compiler Construction Recursive descent parser Implemented using recursive calls to functions that implement the expansion of each non-terminal Goes through all possible expansions by trial-and-error until match with input; backtracks when mismatch detected Simple to implement, but may take exponential time Predictive parser Recursive descent parser with prediction (no backtracking) Predict next rule by looking ahead k number of symbols Restrictions on the grammar to avoid backtracking Only works for a class of grammars called LL(k) Nonrecursive predictive parser Predictive parser with no recursive calls Table driven suitable for automated parser generators

Recursive Descent Example E T + E T T int * T int ( E ) input string: start symbol: int * int E initial parse tree is E

Recursive Descent Example E T + E T T int * T int ( E ) input string: start symbol: int * int E initial parse tree is E Assume: when there are alternative rules, try right rule first

Parsing Sequence (using Backtracking) E

Parsing Sequence (using Backtracking) E T pick right most rule E T

Parsing Sequence (using Backtracking) E T ( E ) pick right most rule E T pick right most rule T (E)

Parsing Sequence (using Backtracking) E T ( E ) pick right most rule E T pick right most rule T (E) ( does not match int

Parsing Sequence (using Backtracking) E T ( E ) pick right most rule E T pick right most rule T (E) ( does not match int failure, backtrack one level

Parsing Sequence (using Backtracking) E T ( E ) int pick right most rule E T pick right most rule T (E) ( does not match int failure, backtrack one level pick next T int int matches input int

Parsing Sequence (using Backtracking) E T ( E ) int pick right most rule E T pick right most rule T (E) ( does not match int failure, backtrack one level pick next T int int matches input int however, more tokens remain failure, backtrack one level

Parsing Sequence (using Backtracking) E T ( E ) int int * T pick right most rule E T pick right most rule T (E) ( does not match int failure, backtrack one level pick next T int int matches input int however, more tokens remain failure, backtrack one level pick next T int * T int * matches input int *

Parsing Sequence (using Backtracking) E T ( E ) int int * T int * ( E ) pick right most rule E T pick right most rule T (E) ( does not match int failure, backtrack one level pick next T int int matches input int however, more tokens remain failure, backtrack one level pick next T int * T int * matches input int * pick right most rule T (E)

Parsing Sequence (using Backtracking) E T ( E ) int int * T int * ( E ) int * int pick right most rule E T pick right most rule T (E) ( does not match int failure, backtrack one level pick next T int int matches input int however, more tokens remain failure, backtrack one level pick next T int * T int * matches input int * pick right most rule T (E) ( does not match input int failure, backtrack one level pick next T int

Recursive Descent Parsing uses Backtracking Approach: for a non-terminal in the derivation, productions are tried in some order until A production is found that generates a portion of the input, or No production is found that generates a portion of the input, in which case backtrack to previous non-terminal Terminals of the derivation are compared against input Match advance input, continue parsing Mismatch backtrack, or fail Parsing fails if no production for the start symbol generates the entire input

Implementation CS2210: Compiler Construction Create a procedure for each non-terminal 1. For RHS of each production rule, a. For a terminal, match with input symbol and consume b. For a non-terminal, call procedure for that non-terminal c. If match succeeds for entire RHS, return success d. If match fails, regurgitate input and try next production rule 2. If match succeeds for any rule, apply that rule to LHS

Sample Code CS2210: Compiler Construction Sample implementation of parser for previous grammar: E T + E T T int * T int ( E ) int match(int token) { return getnext() == token; } int expr() { R1: if (!term() ) goto R2; if (!match(addnum) ) goto R2; if (!expr() ) goto R2; return 1; R2: /* backtrack R1 */ if (!term() ) goto Fail: return 1; Fail: /* backtrack R2 */ return 0; } int term() { R1: if (!match(intnum) ) goto R2; if (!match(starnum) ) goto R2; if (!term() ) goto R2; return 1; R2: /* backtrack R1 */ if (!match(intnum) ) goto R3: return 1; R3: /* backtrack R2 */ if (!match(leftparennum) ) goto Fail; if (!expr() ) goto Fail; if (!match(rightparennum) ) goto Fail; return 1; Fail: /* backtrack R3 */ return 0; }

Left Recursion Problem Recursive descent doesn t work with left recursion Right recursion is okay Why is left recursion a problem? For left recursive grammar A A b c We may repeatedly choose to apply A b A A b A b b... Sentence can grow indefinitely w/o consuming input How do you know when to stop recursion and choose c?

Remove Left Recursion In general, we can eliminate all immediate left recursion A A x y change to A y A A x A ε Not all left recursion is immediate may be hidden in multiple production rules A BC D B AE F... see Section 4.3 for elimination of general left recursion... (not required for this course)

Summary of Recursive Descent Recursive descent is a simple and general parsing strategy Left-recursion must be eliminated first Can be eliminated automatically as in previous slide L(Recursive_Descent) L(CFG) CFL However it is not popular because of backtracking Backtracking requires re-parsing the same string Which is inefficient (can take exponential time) Also undoing semantic actions may be difficult E.g. removing already added nodes in parse tree Techniques used in practice do no backtracking... at the cost of restricting the class of grammar

Predictive Parsers CS2210: Compiler Construction A top-down parser with no backtracking: choose the appropriate production given next input terminal If first terminal of every alternative production is unique A a B D b B B B c b c e D d parsing input abced requires no backtracking Rewriting grammars for predictive parsing Left factor to enable prediction A αβ αγ can be changed to A α A A β γ Eliminate left recursion, just like for recursive descent

LL(k) Parsers CS2210: Compiler Construction LL(k) Parser A predictive parser that uses k lookahead tokens L left to right scan L leftmost derivation k k symbols of lookahead LL(k) Grammar A grammar that can parsed using a LL(k) parser LL(k) Language A language that can be expressed as a LL(k) grammar LL(k) languages are a restricted subset of CFLs But many languages are LL(k).. in fact many are LL(1)! Can be implemented in a recursive or nonrecursive fashion

Nonrecursive Predictive Parser input tokens $ read head syntax stack TOP parser driver output $ parse table M[N,T] Syntax stack holds unmatched portion of derivation string Parser driver next action based on (current token, stack top) Parse table M[A,b] an entry containing rule A... or error Table-driven: amenable to automatic code generation (just like lexers)

A Sample LL(1) Parse Table int * + ( ) $ E E TX E TX X X +E X ε X ε T T int Y T (E) Y Y *T Y ε Y ε Y ε Implementation with 2D parse table First column lists all non-terminals First row lists all possible terminals and $ A table entry contains one production One action for each (non-terminal, input) combination It predicts the correct action based on one lookahead No backtracking required

Algorithm for Parsing X symbol at the top of the syntax stack a current input symbol Parsing based on (X,a) If X==a==$, then parser halts with success If X==a!=$, then pop X from stack and advance input head If X!=a, then Case (a): if X T, then parser halts with failed, input rejected Case (b): if X N, M[X,a] = X RHS pop X and push RHS to stack in reverse order

then CS2210: Compiler Construction Push RHS in Reverse Order X symbol at the top of the syntax stack a current input symbol if M[X,a] = X B c D X... $ B c D... $

Applying LL(1) Parsing to a Grammar Given our old grammar E T + E T T int * T int ( E ) No left recursion But require left factoring After rewriting grammar, we have E T X X + E ε T int Y ( E ) Y * T ε

Using the Parse Table To recognize int * int input int * int $ parser driver E $ Stack Table int * + ( ) $ E E TX E TX X X +E X ε X ε T T int Y T (E) Y Y *T Y ε Y ε Y ε

Using the Parse Table To recognize int * int input int * int $ parser driver T X $ Stack Table int * + ( ) $ E E TX E TX X X +E X ε X ε T T int Y T (E) Y Y *T Y ε Y ε Y ε

Using the Parse Table To recognize int * int input int * int $ parser driver int Y X $ Stack Table int * + ( ) $ E E TX E TX X X +E X ε X ε T T int Y T (E) Y Y *T Y ε Y ε Y ε

Using the Parse Table To recognize int * int input int * int $ parser driver Y X $ Stack Table int * + ( ) $ E E TX E TX X X +E X ε X ε T T int Y T (E) Y Y *T Y ε Y ε Y ε

Using the Parse Table To recognize int * int input int * int $ parser driver * T X $ Stack Table int * + ( ) $ E E TX E TX X X +E X ε X ε T T int Y T (E) Y Y *T Y ε Y ε Y ε

Using the Parse Table To recognize int * int input int * int $ parser driver T X $ Stack Table int * + ( ) $ E E TX E TX X X +E X ε X ε T T int Y T (E) Y Y *T Y ε Y ε Y ε

Using the Parse Table To recognize int * int input int * int $ parser driver int Y X $ Stack Table int * + ( ) $ E E TX E TX X X +E X ε X ε T T int Y T (E) Y Y *T Y ε Y ε Y ε

Using the Parse Table To recognize int * int input int * int $ parser driver Y X $ Stack Table int * + ( ) $ E E TX E TX X X +E X ε X ε T T int Y T (E) Y Y *T Y ε Y ε Y ε

Using the Parse Table To recognize int * int input int * int $ parser driver X $ Stack Table int * + ( ) $ E E TX E TX X X +E X ε X ε T T int Y T (E) Y Y *T Y ε Y ε Y ε

Using the Parse Table To recognize int * int input int * int $ parser driver $ Stack Table int * + ( ) $ E E TX E TX X X +E X ε X ε T T int Y T (E) Y Y *T Y ε Y ε Y ε

Using the Parse Table To recognize int * int input int * int $ parser driver Accept!!! $ Stack Table int * + ( ) $ E E TX E TX X X +E X ε X ε T T int Y T (E) Y Y *T Y ε Y ε Y ε

Recognition Sequence It is possible to write in a action list Stack Input Action E $ int * int $ E TX T X $ int * int $ T int Y int Y X $ int * int $ match Y X $ * int $ Y * T * T X $ * int $ match T X $ int $ T int Y int Y X $ int $ match Y X $ $ Y ε X $ $ X ε $ $ halt and accept

Constructing the Parse Table Need to know 2 sets First(A): set of terminals that can appear first in A Follow(A): set of terminals that can appear following A

Intuitive Meaning of First and Follow S α A a β c γ c First(A) a Follow(A) Why is the Follow Set important?

First(α) CS2210: Compiler Construction First(α) = set of terminals that start string of terminals derived from α. Apply followsing rules until no terminal or ε can be added 1). If t T, then First(t)={t}. For example First(+)={+}. 2). If X N and X ε exists, then add ε to First(X). For example, First(Y) = {*, ε}. 3). If X N and X Y 1 Y 2 Y 3...Y m, then Add (First(Y 1 ) - ε) to First(X). If First(Y 1 ),..., First(Y k 1 ) all contain ε, then add ( P 1 i k First(Y i) ε) to First(X). If First(Y 1 ),..., First(Y m) all contain ε, then add ε to First(X).

Follow(α) CS2210: Compiler Construction Follow(α) = {t S αtβ} Intuition: if X A B, then First(B) Follow(A) little trickier because B may be ε i.e. B * ε Apply followsing rules until no terminal or ε can be added 1). $ Follow(S), where S is the start symbol. e.g. Follow(E) = {$... }. 2). Look at the occurrence of a non-terminal on the right hand side of a production which is followed by something If A αbβ, then First(β)-{ε} Follow(B) 3). Look at N on the RHS that is not followed by anything, if (A αb) or (A αbβ and ε First(β)), then Follow(A) Follow(B)

Informal Interpretation of First and Follow Sets First set of X (focus on LHS) If X T, then terminal symbol itself If X Y Z, then add First(Y) If X ε, then add ε Follow set of X (focus on RHS) If X is start symbol, add $ If... X Y, then add First(Y) If Y X, then add Follow(Y)

For the example CS2210: Compiler Construction E T X X + E ε T int Y ( E ) Y * T ε First set (focus on LHS) E T X X + E X ε T int Y T ( E ) Y * T Y ε Follow set (focus on RHS) $ T ( E ) X + E E T X Y * T T int Y

Example CS2210: Compiler Construction Symbol ( ) + int Y X T E E T X X + E ε T int Y ( E ) Y * T ε First ( ) + int *, ε +, ε (, int (, int Symbol E X T Y Follow $,) $,) $,),+ $,),+

Construction of LL(1) Parse Table To construct the parse table, we check each A α For each terminal a First(α), add A α to M[A,a]. If ε First(α), then for each terminal b Follow(A), add A α to M[A,b].

Example CS2210: Compiler Construction E T X X + E ε T int Y ( E ) Y * T ε Symbol ( ) + int Y X T E First ( ) + int *, ε +, ε (, int (, int Symbol E X T Y Follow $,) $,) $,),+ $,),+ Table int * + ( ) $ E E TX E TX X X +E X ε X ε T T int Y T (E) Y Y *T Y ε Y ε Y ε

Determine if Grammar G is LL(1) Observation If a grammar is LL(1), then each of its LL(1) table entry contains at most one rule. Otherwise, it is not LL(1). Two methods to determine if a grammar is LL(1) or not (1) Construct LL(1) table, and check if there is a multi-rule entry or (2) Checking each rule as if the table is getting constructed. G is LL(1) iff for a rule A α β First(α) First(β) = φ at most one of α and β can derive ε If β derives ε, then First(α) Follow(A) = φ

Non-LL(1) Grammars Suppose a grammar is not LL(1). What then? (1) The language may still be LL(1). Try to massage grammar to LL(1) grammar: Apply left-factoring Apply left-recursion removal Try to remove ambiguity in grammar: Encode precendence into rules Encode associativity into rules (2) If (1) fails, language may not be LL(1) Impossible to resolve conflict at the grammar level Programmer chooses which rule to use for conflicting entry (if choosing that rule is always semantically correct) Otherwise, use a more powerful parser (e.g. LL(k), LR(1))

Non-LL(1) Grammar Example Some grammars are not LL(1) even after left-factoring and left-recursion removal S if C then S if C then S else S a (other statements) C b change to S if C then S X a X else S ε C b problem sentence: if b then if b then a else a else First(X) First(X)-ε Follow(S) X else... ε else Follow(X) Such grammars are potentially ambiguous

Non-LL(1) Even After Removing Ambiguity For the if-then-else example, how would you rewrite it? S if C then S if C then S2 else S S2 if C then S2 else S2 a C b Now grammar is unambiguous but it is not LL(k) for any k Must lookahead until matching else to choose rule for S That lookahead may be an arbirary number of tokens with arbitrary nesting of if-then-else statements Some languages are not LL(1) or even LL(k) No amount of lookahead can tell parser whether this statement is an if-then or if-then-else Fortunately in this case, can resolve conflicts in table Always choose X else S over X ε on else token Semantically correct, since else should match closest if

LL(1) Time and Space Complexity Linear time and space relative to length of input

LL(1) Time and Space Complexity Linear time and space relative to length of input Time each input symbol is consumed within a constant number of steps. Why?

LL(1) Time and Space Complexity Linear time and space relative to length of input Time each input symbol is consumed within a constant number of steps. Why? If symbol at top of stack is a terminal: Matched immediately in one step If symbol at top of stack is a non-terminal: Matched in at most N steps, where N = number of rules Since no left-recursion, cannot apply same rule twice without consuming input Space smaller than input (after removing X ε). Why?

Nonrecursive Parser Summary First and Follow sets are used to construct predictive parsing tables Intuitively, First and Follow sets guide the choice of rules For non-terminal A and lookahead t, use the production rule A α where t First(α) OR For non-terminal A and lookahead t, use the production rule A α where ε First(α) and t Follow(A) There can only be ONE such rule or grammar is not LL(1)

Questions CS2210: Compiler Construction What is LL(0)? Why LL(2)... LL(k) are not widely used?