Syntax Analysis Part I Chapter 4: Context-Free Grammars Slides adapted from : Robert van Engelen, Florida State University
Position of a Parser in the Compiler Model Source Program Lexical Analyzer Token, tokenval Get next token Parser and rest of front-end Intermediate representation Lexical error Syntax error Semantic error Symbol Table
The Parser A parser implements a context-free grammar Check syntax (= string recognizer) Report syntax errors accurately Invoke semantic actions For static semantics checking, e.g. type checking of expressions, functions, etc. For syntax-directed translation of the source code to an intermediate representation
Syntax-Directed Translation One of the major roles of the parser is to produce an intermediate representation (IR) of the source program using syntax-directed translation methods Possible IR output: Abstract syntax trees (ASTs) Three-address code (3AC) Register transfer list notation (RTN)
Error Handling A good compiler should assist in identifying and locating errors Lexical errors: important, compiler can easily recover and continue Syntax errors: most important for compiler, can almost always recover Static semantic errors: important, can sometimes recover Dynamic semantic errors: hard or impossible to detect at compile time, runtime checks are required Logical errors: hard or impossible to detect
Viable-Prefix Property Prefix The viable-prefix property of LL/LR parsers allows early detection of syntax errors Goal: detection of an error as soon as possible without further consuming unnecessary input How: detect an error as soon as the prefix of the input does not match a prefix of any string in the language for (;) Error is detected here Prefix Error is detected here DO 10 I = 1;0
Error Recovery Strategies Panic mode Discard input until a token in a set of designated synchronizing tokens is found Phrase-level recovery Perform local correction on the input to repair the error Error productions Augment grammar with productions for erroneous constructs Global correction Choose a minimal sequence of changes to obtain a global least-cost correction
Context-Free Grammar: How It Works Write a grammar representing the structure of a thesis A thesis consists of a thesis title followed by one or more chapters A chapter consists of a chapter title followed by one or more sections A section consists of a section title followed by one or more line of text
How It Works T t-title Cs Cs C C Cs C c-title Ss Ss S S Ss S s-title Ls Ls line line Ls
T How It Works t-title c-title C Cs Ss Cs C S c-title Ss s-title Ls S line s-title Ls line
Context-Free Grammar A context-free grammar (CFG) is a 4-tuple G = (N, T, P, S) where T is a finite set of tokens (terminal symbols) N is a finite set of nonterminals P is a finite set of productions of the form A α where A N and α (N T)* S N is a designated start symbol
Example G = ({E, T, F}, {+, -, *, /, (, ), id}, P, E) Productions in P : E E + T E - T T T T * F T / F F F ( E ) id
Notational Conventions Terminals a, b, c, T specific terminals: 0, 1, id, + Nonterminals A, B, C, N specific nonterminals: expr, term, stmt
Notational Conventions Grammar symbols X, Y, Z (N T) Strings of terminals u, v, w, x, y, z T* Strings of grammar symbols α, β, γ (N T)*
Derivations Given a CFG we can determine the set of all strings (sequences of tokens) generated by the grammar using derivation We begin with the start symbol In each step, we replace one nonterminal in the current sentential form with one of the righthand sides of a production for that nonterminal
Derivations Mathematically, the one-step derivation is a binary relation defined by α A β α γ β where A γ is a production in the grammar
Derivations In addition, we define is leftmost lm if α does not contain a nonterminal is rightmost rm if β does not contain a nonterminal Transitive closure * (zero or more steps) Positive closure + (one or more steps) The language generated by G is defined by L(G) = {w T* S + w}
Example Grammar G = ({E}, {+, *, (, ), -, id}, P, E) with productions P = E E + E E E * E E ( E ) E - E E id Example derivations: E - E - id E rm E + E rm E + id rm id + id E * E E * id + id E + id * id + id
Exercise Which of the strings are in the language of the given CFG? abcba acca aba abcbcba S axa X ε by Y ε cxc
Parse Trees The root of the tree is labeled by the start symbol Each leaf of the tree is labeled by a terminal (=token) or ε Each interior node is labeled by a nonterminal If A X 1 X 2 X n is a production, then node A has immediate children X 1, X 2,, X n where X i is a (non)terminal or ε
Example E - E - (E) - (E + E) - (id + E) - (id + id) E - E ( E ) E id + E id
Ambiguity An ambiguous grammar produces more than one leftmost derivation (or more than one parse tree) for the same sentence Consider the string id + id * id and the productions E E + E, E E * E, E id E E + E id + E id + E * E id + id * E id + id * id E E * E E + E * E id + E * E id + id * E id + id * id
Ambiguity Different parse trees for the same sentence correspond to different interpretations, in this case for the precedence of the arithmetic operators E E E + E E * E id E * E E + E id id id id id
Exercise Which of the following CFGs are ambiguous? S SS a b E E + E id S Sa Sb ε E E E + E E - E id ( E )
Chomsky Hierarchy: Language Classification A grammar G is said to be Regular if it is right linear where each production is of the form A w B or A w or left linear where each production is of the form A B w or A w Context free if each production is of the form A α where A N and α (N T)* Context sensitive if each production is of the form α A β α γ β where A N, α,γ,β (N T)*, γ > 0 Unrestricted
Chomsky Hierarchy L(regular) L(context free) L(context sensitive) L(unrestricted) Where L(T) = { L(G) G is of type T } That is: the set of all languages generated by grammars G of type T Examples: Every finite language is regular! (construct a FSA for strings in L(G)) L 1 = { a n b n n 1 } is context free L 2 = { a n b n c n n 1 } is context sensitive
Parsing Parsing is the process of Determining if a string of tokens can be generated by a grammar Producing the relevant parse tree forest Top-down parsing constructs a parse tree from root to leaves Bottom-up parsing constructs a parse tree from leaves to root
Parsing Universal parsing algorithms work for any CFG Recursive descent uses backtracking and takes exponential time Tabular methods take O(n 3 ) time to parse a string of n tokens Cocke-Younger-Kasami Earley
Parsing CFGs for programming languages are restricted (unambiguous, etc.) and can be parsed in linear time Two main family of algorithms LL parsing uses top-down strategy LR parsing uses bottom-up strategy
Push-Down Automata A push-down automaton (PDA) implements a context-free grammar Reads the input left to right from a buffer Uses an auxiliary storage called stack, allowing push and pop operations
Push-Down Automata A configuration of the PDA completely describes the state of the computation, and is a pair (σ, β) where σ is the stack β is the buffer A transition, or action, simulates a move from one configuration to the next A computation of the PDA is a sequence of configurations obtained by applying actions
Top Down PDA 1. Predict: for each production A X 1 X 2 X n, if A is at the top of the stack, replace it with X n X n-1 X 2 X 1 2. Match: if terminal symbol a is the first symbol of the buffer and the top-most symbol of the stack, remove both symbols 3. The initial configuration is (S, a 1 a 2 a n ) 4. The final configuration is (ε, ε)
Example Grammar: 1. E E + T 2. E T 3. T T * F 4. T F 5. F ( E ) 6. F id Stack E T + E T + T T + F * T T + F * F T + F * id T + F * T + F T + id T + T F id ε Buffer id*id+id id*id+id id*id+id id*id+id id*id+id id*id+id *id+id id+id id+id +id id id id ε Action predict 1 predict 2 predict 3 predict 4 predict 6 match match predict 6 match match predict 4 predict 6 match accept
Bottom Up PDA 1. Reduce: for each production A X 1 X 2 X n, if X 1 X 2 X n is at the top of the stack, replace it with A 2. Shift: remove first terminal symbol a from the buffer and push it into the stack 3. The initial configuration is (ε, a 1 a 2 a n ) 4. The final configurationis (S, ε)
Example Grammar: 1. E E + T 2. E T 3. T T * F 4. T F 5. F ( E ) 6. F id Stack ε id F T T * T * id T * F T E E + E + id E + F E + T E Buffer id*id+id *id+id *id+id *id+id id+id +id +id +id +id id ε ε ε ε Action shift reduce 6 reduce 4 shift shift reduce 6 reduce 3 reduce 2 shift shift reduce 6 reduce 4 reduce 1 accept
LL and LR Parsing The top down and bottom up PDAs are nondeterministic: several actions might be possible at a given configuration Most of LL and LR parsing can be understood as based on the previous PDAs, with an additional oracle that provides the correct action