Syntax Analysis. Martin Sulzmann. Martin Sulzmann Syntax Analysis 1 / 38

Syntax Analysis Martin Sulzmann Martin Sulzmann Syntax Analysis 1 / 38

Syntax Analysis Objective Recognize individual tokens as sentences of a language (beyond regular languages). Example 1 (OK) Program text x := x + 2 yields token stream (ident x ) assign (ident x ) plus (const 2) Example 2 (Not OK) Program text x x := + 2 yields token stream (ident x ) (ident x ) assign plus (const 2) We will discuss later if the OK example shall be semantically valid. Martin Sulzmann Syntax Analysis 2 / 38

Approach: Parsing with CFG Context-Free Grammar (CFG) Language described in terms of a context-free grammar: Command ident assign Expression... Expression const ident Expression plus Expression ident, const, plus being tokens where attributes are omitted. Note: CFG uses uppercase for non-terminal and lowercase for terminal symbols. Our parsing tool OCamlyacc just uses it the other way around! Parsing yields Abstract Syntax Tree (AST) Check if stream of tokens is valid and produce AST. Martin Sulzmann Syntax Analysis 3 / 38

Parsing with CFG Example Token Stream (ident x ) assign (ident x ) plus (const 2) AST assign ident plus x ident plus x 2 Martin Sulzmann Syntax Analysis 4 / 38

Types of Parsing Top-Down Recursive descent From root to leaves ANTRL style Bottom-Up From leaves to root yacc style Others Parser combinators Derivative-based parser... Martin Sulzmann Syntax Analysis 5 / 38

Error Handling Types of Errors lexical misspelled keyword whle cond... syntactic unbalanced parentheses x:= x+1) semantic incompatible types x:= x + True logical infinite loop while True... Provide useful feedback to user. Mostly ignored here. Martin Sulzmann Syntax Analysis 6 / 38

Context Free Grammar (CFG) Definition A CFG G is a 4-tuple (V t,v n,s,p) where 1 V t is the set of terminal symbols (commonly the set of tokens). 2 V n is the set of non-terminal symbols. 3 S is a distinguished non-terminal symbol call the start symbol. 4 P is a finite set of productions (aka rewrite rules) of the form: non-terminal (non-terminal terminal). Note: All sets are finite. Left-hand side (lhs) of production consists of a single non-terminal symbol. Right-hand side (rhs) consists of arbitrary combinations of terminal and non-terminal symbol described by a regular expression. Martin Sulzmann Syntax Analysis 7 / 38

CFG Example and Notation Example E E +T T T T F F F (E) ident Notation Vocabulary V = V t V n a,b,c,... V t A,B,C,... V n α,β,γ,... V u,v,w,... V t Martin Sulzmann Syntax Analysis 8 / 38

CFG as a Rewrite System Derivations by Rewriting If A γ P then αaβ αγβ is a one-step derivation using A γ. We assume that denotes a derivation of 0 steps. In a leftmost derivation (denoted L ) we replace the leftmost non-terminal in each step. In a rightmost derivation (denoted R ) we replace the rightmost non-terminal in each step. If S β then β is a sentential form of G. If S w and w V t then w is a sentence of G. L(G) = {w S w,w V t } denotes the language described by G. Martin Sulzmann Syntax Analysis 9 / 38

Example Derivation CFG E E +T T T T F F F (E) ident Derivation E ident +(ident) E E +T T +T F +T ident +T ident +F ident +(E) ident +(T) ident +(F) ident +(ident) Martin Sulzmann Syntax Analysis 10 / 38

Example Derivation (2) CFG E E +E E E (E) 0... 9 Derivation E 3+7 2 E E +E 3+E 3+E E 3+7 E 3+7 2 Things to think about Multiple derivations for the same input word? How to represent derivations? Martin Sulzmann Syntax Analysis 11 / 38

Parse Trees Purpose Compact representation of a derivation E w. Most often in terms of a tree-like structure (but other formats like bit-encodings are also possible). Different from AST! Parse tree represents concrete input word. AST is an abstract representation of input. Martin Sulzmann Syntax Analysis 12 / 38

Parse Tree Example CFG + Derivation E E +E E E (E) 0... 9 E E +E 3+E 3+E E 3+7 E 3+7 2 Parse Tree E 8 8888888 E + E 3 E E 7 2 Martin Sulzmann Syntax Analysis 13 / 38

AST versus Parse Tree Parse tree E ( E ) E + E 1 2 AST A + 2 2222222 ident ident Parse tree: Maintain all source (concrete) syntax info. AST: Abstract representation. Martin Sulzmann Syntax Analysis 14 / 38

Ambiguity Definition Multiple parse trees for the same grammar and input word. Example E E +E E E (E) 0... 9 E E E E +E E 3+E E 3+7 E 3+7 2 Another parse tree: E E E E + E 2 3 7 Martin Sulzmann Syntax Analysis 15 / 38

Ambiguity Example Ambiguous Grammar E E +E E E (E) ident Equivalent Non-Ambiguous Grammar E E +T T E E +E T T F F E E E F (E) ident E (E) ident Fact Some context-free languages are inherently ambiguous. Martin Sulzmann Syntax Analysis 16 / 38

Ambiguity Example (2) Exercise Consider Stmt if Expr then Stmt if Expr then Stmt else Stmt other Ambiguous? Does an equivalent unambiguous grammar exist? Martin Sulzmann Syntax Analysis 17 / 38

Parsing Approaches: Top-Down versus Bottom-Up Example E E +E E E (E) 0... 9 Top-Down ( ANTLR ) Build left-most derivation (strictly reduce left-most non-terminal symbols). E E +E 3+E 3+E E 3+7 E 3+7 2 Guess which rule to apply from left to right. Bottom-Down ( yacc ) Build right-most derivation. E E +E E +E E E +E 2 E +7 2 3+7 2 Seek for matches of right-hand side, replace by left-hand side. Martin Sulzmann Syntax Analysis 18 / 38

Top-Down Parsing Example E E +T T T T F F F (E) ident Approach Build left-most derivation starting from start symbol, use target string to choose productions. Challenge: Left Recursion E E +T E +T +T... Several possible right-hand sides. Some lead to dead ends. Backtracking? Too inefficient. Martin Sulzmann Syntax Analysis 19 / 38

Removal of Left Recursion Example Consider E E +T T T T F F F (E) ident Remove left recursion, e.g. E TE E +TE ǫ Martin Sulzmann Syntax Analysis 20 / 38

Removal Left Recursion (2) Direct Left Recursion Removal Left recursion can be eliminated as follows: Grammar rules A Aα 1... Aα n β 1... β m are replaced by the equivalent rules A β 1 A... β m A A α 1 A... α n A ǫ Indirect Left Recursion Removal Consider A BC B Aβ Left recursion involves several rules. Can be eliminated (reduced to direct case) via unfolding. Martin Sulzmann Syntax Analysis 21 / 38

Left Factoring Issue Remove common prefix of right hand sides. Left Factoring Left factoring of A αβ 1... αβ n γ 1... γ m (where α ǫ is the longest common prefix) yields Example A αa γ 1... γ m A β 1... β n S iets ietses a E b transformed to S ietss a S ǫ es E b Martin Sulzmann Syntax Analysis 22 / 38

Top-Down Parsing: Short Summary Predictive Top-Down Parsing Eliminate left recursion. Apply left factoring. Deterministic. No backtracking. Lookahead? How far to lookahead in the input string to (deterministically) apply a rule? Observe right-hand sides! Martin Sulzmann Syntax Analysis 23 / 38

First (matching symbols) Intuition Which terminals can start a string matching the non-terminal? Definition First(α) = {a V t {ǫ} α aβ} Inductive Definition {x} if α = xα, x V t First(α) = First(x) if α = xα, ǫ First(x), x V n First(x) First(α ) if α = xα, ǫ First(x), x V n Sufficient to consider First(X) where X is a non-terminal. Martin Sulzmann Syntax Analysis 24 / 38

First Algorithm Algorithm Build First(X) as follows: 1 If X is a terminal then First(X) = {X}. 2 If X ǫ then add ǫ to First(X). 3 If X Y 1...Y k : 1 Put First(Y 1 ) {ǫ} in First(X). 2 i.1 < i <= k, if ǫ First(Y 1 )... First(Y i 1 ) (i.e. Y 1...Y i 1 ǫ) then put First(Y i ) {ǫ} in First(X). 3 If ǫ First(Y 1 )... First(Y k ) then add ǫ to First(X). Repeat until no more additions can be made. Martin Sulzmann Syntax Analysis 25 / 38

First Example Example E TE E +TE E ǫ T FT T FT T ǫ F ident F (E) First(E) = {ident,(} First(E ) = {+,ǫ} First(T) = {ident,(} First(T ) = {,ǫ} First(F) = {ident,(} Martin Sulzmann Syntax Analysis 26 / 38

Follow Issue Which terminals can start a string matching after the nonterminal? Consider ǫ right hand sides. Definition Follow(A) = {w First(β) S waβ} Algorithm Build Follow(A) as follows: 1 Add $ to Follow(S) where S is the start symbol. 2 If A αbβ: 1 Put First(β) {ǫ} in Follow(B) 2 If β = ǫ (i.e. A αb) or ǫ First(β) (i.e. β ǫ) then put Follow(A) in Follow(B). Repeat until no more additions can be made ($ = EOF token). Martin Sulzmann Syntax Analysis 27 / 38

First and Follow Example Example S BAa A ǫ BcA B ǫ b First Follow S a,b,c $ A ǫ,b,c a B ǫ,b a,b,c Martin Sulzmann Syntax Analysis 28 / 38

LL(1) Grammar definition A grammar G is LL(1) iff for each set of productions A α 1... α n : 1 First(α 1 ),...,First(α n ) are all pairwise disjoint. 2 If α i ǫ then First(α j ) Follow(A) = forall 1 <= j <= n,i j. If G is ǫ-free (there are no ǫ productions), first condition is sufficient. Facts 1 No left-recursive grammar is LL(1). 2 No ambiguous grammar is LL(1). 3 Some languages have no LL(1) grammar. Martin Sulzmann Syntax Analysis 29 / 38

LL(1) Grammar (cont d) Example S as a is not LL(1) because First(aS) = First(a) = {a}. However, the following equivalent grammar is LL(1). S as S as ǫ Lemma A grammar is LL(1) iff for each two productions A α and A β we have that Look(A α) Look(A β) = where Look(A α) = { First(α)\{ǫ} if ǫ Follow(α) (First(α)\{ǫ}) Follow(A) otherwise Martin Sulzmann Syntax Analysis 30 / 38

Recursive Descent Parsing Example E TE E +TE E ǫ T FT T FT T ǫ F ident F (E) First(E) = {ident,(} First(E ) = {+,ǫ} First(T) = {ident,(} First(T ) = {,ǫ} First(F) = {ident,(} Recursive Descent Parser Grammar is LL(1). Introduce a function for each non-terminal symbol. Martin Sulzmann Syntax Analysis 31 / 38

Recursive Descent Parser Parser Ignore the parse tree, so strictly speaking we only build a matcher. e() { t() ; e (); if token!= EOF then HALT; } e () { if token == PLUS then get_next_token(); t(); e (); } t() { f(); t (); } t () { if token == TIMES then get_next_token(); f(); t (); } f () { if token == IDENT then get_next_token(); else if token == OPENPAR then get_next_token(); e(); if token==closepar then get_next_token(); else HALT; } Martin Sulzmann Syntax Analysis 32 / 38

Table Driven Parsing Fact A program making use of recursive function can be simulated by a non-recursive program (by making use of a stack). Table Driven Approach Input Stack elements are elements in our voc Input consists of terminal symbols Table encodes LL(1) actions Stack LL(1) parser Parsing table M Martin Sulzmann Syntax Analysis 33 / 38

Parse Table Construction Algorithm Input: Grammar G Output: Parsing table M Method: 1 For each production A α: 1 For each a First(α) add A α to M[A,a]. 2 If ǫ First(α): 1 For each b Follow(A) add A α to M[A,b]. 2 If $ Follow(A), add A α to M[A,$]. 2 Set each undefined entry of M to error (or leave empty). If there are multiple entries then grammar is not LL(1). Martin Sulzmann Syntax Analysis 34 / 38

Table Driven Parsing Algorithm Push $; Push start symbol while Top of stack is not $ { X:= top of stack; a:= input symbol if X is a terminal then if X == a then pop X; advance input else error else if M[X,a] == X->Y1... Yk then pop X; push Yk,..., Y1 else error } Martin Sulzmann Syntax Analysis 35 / 38

Example First and Follow Sets E TE First(E) = {ident,(} Follow(E) = {$,)} E +TE First(E ) = {+,ǫ} Follow(E ) = {$,)} E ǫ T FT First(T) = {ident,(} Follow(T) = {$,+,)} T FT First(T ) = {,ǫ} Follow(T ) = {$,+,)} T ǫ F ident First(F) = {ident,(} Follow(F) = {$,+,,)} F (E) Martin Sulzmann Syntax Analysis 36 / 38

Parse Table + Execution ident + ( ) $ E E TE E TE E E +TE E ǫ E ǫ T T FT T FT T T ǫ T FT T ǫ T ǫ F F ident F (E) Stack Input Rule Stack Input Rule $E id + (id) E TE $E T )E id) E TE $E T id + (id) T FT $E T )E T id) T FT $E T F id + (id) F id $E T )E T F id) F id $E T id id + (id) $E T )E T id id) $E T +(id) T ǫ $E T )E T ) T ǫ $E +(id) E +TE $E T )E ) E ǫ $E T+ +(id) $E T ) ) $E T (id) T FT $E T $ T ǫ $E T F (id) F (E) $E $ E ǫ $E T )E( (id) $ $ Martin Sulzmann Syntax Analysis 37 / 38

Top-Down Parsing Summary Predictive parsing (removal left recursion, left factoring). LL(1) grammars. Recursive descent parsing. Table driven parsing. How to construct parse tree? (later) Error recovery important (but no time!), please consult text book. ANTRL state-of-the art LL-style parser for Java. Martin Sulzmann Syntax Analysis 38 / 38