Syntax Analysis Part I - PDF Free Download

Syntax Analysis Part I Chapter 4: Context-Free Grammars Slides adapted from : Robert van Engelen, Florida State University

Position of a Parser in the Compiler Model Source Program Lexical Analyzer Token, tokenval Get next token Parser and rest of front-end Intermediate representation Lexical error Syntax error Semantic error Symbol Table

The Parser A parser implements a context-free grammar Check syntax (= string recognizer) Report syntax errors accurately Invoke semantic actions For static semantics checking, e.g. type checking of expressions, functions, etc. For syntax-directed translation of the source code to an intermediate representation

Syntax-Directed Translation One of the major roles of the parser is to produce an intermediate representation (IR) of the source program using syntax-directed translation methods Possible IR output: Abstract syntax trees (ASTs) Three-address code (3AC) Register transfer list notation (RTN)

Error Handling A good compiler should assist in identifying and locating errors Lexical errors: important, compiler can easily recover and continue Syntax errors: most important for compiler, can almost always recover Static semantic errors: important, can sometimes recover Dynamic semantic errors: hard or impossible to detect at compile time, runtime checks are required Logical errors: hard or impossible to detect

Viable-Prefix Property Prefix The viable-prefix property of LL/LR parsers allows early detection of syntax errors Goal: detection of an error as soon as possible without further consuming unnecessary input How: detect an error as soon as the prefix of the input does not match a prefix of any string in the language for (;) Error is detected here Prefix Error is detected here DO 10 I = 1;0

Error Recovery Strategies Panic mode Discard input until a token in a set of designated synchronizing tokens is found Phrase-level recovery Perform local correction on the input to repair the error Error productions Augment grammar with productions for erroneous constructs Global correction Choose a minimal sequence of changes to obtain a global least-cost correction

Context-Free Grammar: How It Works Write a grammar representing the structure of a thesis A thesis consists of a thesis title followed by one or more chapters A chapter consists of a chapter title followed by one or more sections A section consists of a section title followed by one or more line of text

How It Works T t-title Cs Cs C C Cs C c-title Ss Ss S S Ss S s-title Ls Ls line line Ls

T How It Works t-title c-title C Cs Ss Cs C S c-title Ss s-title Ls S line s-title Ls line

Context-Free Grammar A context-free grammar (CFG) is a 4-tuple G = (N, T, P, S) where T is a finite set of tokens (terminal symbols) N is a finite set of nonterminals P is a finite set of productions of the form A α where A N and α (N T)* S N is a designated start symbol

Example G = ({E, T, F}, {+, -, *, /, (, ), id}, P, E) Productions in P : E E + T E - T T T T * F T / F F F ( E ) id

Notational Conventions Terminals a, b, c, T specific terminals: 0, 1, id, + Nonterminals A, B, C, N specific nonterminals: expr, term, stmt

Notational Conventions Grammar symbols X, Y, Z (N T) Strings of terminals u, v, w, x, y, z T* Strings of grammar symbols α, β, γ (N T)*

Derivations Given a CFG we can determine the set of all strings (sequences of tokens) generated by the grammar using derivation We begin with the start symbol In each step, we replace one nonterminal in the current sentential form with one of the righthand sides of a production for that nonterminal

Derivations Mathematically, the one-step derivation is a binary relation defined by α A β α γ β where A γ is a production in the grammar

Derivations In addition, we define is leftmost lm if α does not contain a nonterminal is rightmost rm if β does not contain a nonterminal Transitive closure * (zero or more steps) Positive closure + (one or more steps) The language generated by G is defined by L(G) = {w T* S + w}

Example Grammar G = ({E}, {+, *, (, ), -, id}, P, E) with productions P = E E + E E E * E E ( E ) E - E E id Example derivations: E - E - id E rm E + E rm E + id rm id + id E * E E * id + id E + id * id + id

Exercise Which of the strings are in the language of the given CFG? abcba acca aba abcbcba S axa X ε by Y ε cxc

Parse Trees The root of the tree is labeled by the start symbol Each leaf of the tree is labeled by a terminal (=token) or ε Each interior node is labeled by a nonterminal If A X 1 X 2 X n is a production, then node A has immediate children X 1, X 2,, X n where X i is a (non)terminal or ε

Example E - E - (E) - (E + E) - (id + E) - (id + id) E - E ( E ) E id + E id

Ambiguity An ambiguous grammar produces more than one leftmost derivation (or more than one parse tree) for the same sentence Consider the string id + id * id and the productions E E + E, E E * E, E id E E + E id + E id + E * E id + id * E id + id * id E E * E E + E * E id + E * E id + id * E id + id * id

Ambiguity Different parse trees for the same sentence correspond to different interpretations, in this case for the precedence of the arithmetic operators E E E + E E * E id E * E E + E id id id id id

Exercise Which of the following CFGs are ambiguous? S SS a b E E + E id S Sa Sb ε E E E + E E - E id ( E )

Chomsky Hierarchy: Language Classification A grammar G is said to be Regular if it is right linear where each production is of the form A w B or A w or left linear where each production is of the form A B w or A w Context free if each production is of the form A α where A N and α (N T)* Context sensitive if each production is of the form α A β α γ β where A N, α,γ,β (N T)*, γ > 0 Unrestricted

Chomsky Hierarchy L(regular) L(context free) L(context sensitive) L(unrestricted) Where L(T) = { L(G) G is of type T } That is: the set of all languages generated by grammars G of type T Examples: Every finite language is regular! (construct a FSA for strings in L(G)) L 1 = { a n b n n 1 } is context free L 2 = { a n b n c n n 1 } is context sensitive

Parsing Parsing is the process of Determining if a string of tokens can be generated by a grammar Producing the relevant parse tree forest Top-down parsing constructs a parse tree from root to leaves Bottom-up parsing constructs a parse tree from leaves to root

Parsing Universal parsing algorithms work for any CFG Recursive descent uses backtracking and takes exponential time Tabular methods take O(n 3 ) time to parse a string of n tokens Cocke-Younger-Kasami Earley

Parsing CFGs for programming languages are restricted (unambiguous, etc.) and can be parsed in linear time Two main family of algorithms LL parsing uses top-down strategy LR parsing uses bottom-up strategy

Push-Down Automata A push-down automaton (PDA) implements a context-free grammar Reads the input left to right from a buffer Uses an auxiliary storage called stack, allowing push and pop operations

Push-Down Automata A configuration of the PDA completely describes the state of the computation, and is a pair (σ, β) where σ is the stack β is the buffer A transition, or action, simulates a move from one configuration to the next A computation of the PDA is a sequence of configurations obtained by applying actions

Top Down PDA 1. Predict: for each production A X 1 X 2 X n, if A is at the top of the stack, replace it with X n X n-1 X 2 X 1 2. Match: if terminal symbol a is the first symbol of the buffer and the top-most symbol of the stack, remove both symbols 3. The initial configuration is (S, a 1 a 2 a n ) 4. The final configuration is (ε, ε)

Example Grammar: 1. E E + T 2. E T 3. T T * F 4. T F 5. F ( E ) 6. F id Stack E T + E T + T T + F * T T + F * F T + F * id T + F * T + F T + id T + T F id ε Buffer id*id+id id*id+id id*id+id id*id+id id*id+id id*id+id *id+id id+id id+id +id id id id ε Action predict 1 predict 2 predict 3 predict 4 predict 6 match match predict 6 match match predict 4 predict 6 match accept

Bottom Up PDA 1. Reduce: for each production A X 1 X 2 X n, if X 1 X 2 X n is at the top of the stack, replace it with A 2. Shift: remove first terminal symbol a from the buffer and push it into the stack 3. The initial configuration is (ε, a 1 a 2 a n ) 4. The final configurationis (S, ε)

Example Grammar: 1. E E + T 2. E T 3. T T * F 4. T F 5. F ( E ) 6. F id Stack ε id F T T * T * id T * F T E E + E + id E + F E + T E Buffer id*id+id *id+id *id+id *id+id id+id +id +id +id +id id ε ε ε ε Action shift reduce 6 reduce 4 shift shift reduce 6 reduce 3 reduce 2 shift shift reduce 6 reduce 4 reduce 1 accept

LL and LR Parsing The top down and bottom up PDAs are nondeterministic: several actions might be possible at a given configuration Most of LL and LR parsing can be understood as based on the previous PDAs, with an additional oracle that provides the correct action