Compiler Design Concepts. Syntax Analysis

Similar documents
Syntax Analysis Check syntax and construct abstract syntax tree

Formal Languages and Compilers Lecture V: Parse Trees and Ambiguous Gr

CS415 Compilers. Syntax Analysis. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Compilers Course Lecture 4: Context Free Grammars

Parsing Part II. (Ambiguity, Top-down parsing, Left-recursion Removal)

Syntax Analysis. Prof. James L. Frankel Harvard University. Version of 6:43 PM 6-Feb-2018 Copyright 2018, 2015 James L. Frankel. All rights reserved.

Introduction to Parsing

Chapter 3: Describing Syntax and Semantics. Introduction Formal methods of describing syntax (BNF)

Part 5 Program Analysis Principles and Techniques

Parsing. Roadmap. > Context-free grammars > Derivations and precedence > Top-down parsing > Left-recursion > Look-ahead > Table-driven parsing

CS 314 Principles of Programming Languages

EECS 6083 Intro to Parsing Context Free Grammars

Outline. Parser overview Context-free grammars (CFG s) Derivations Syntax-Directed Translation

Introduction to Parsing. Lecture 5

Sometimes an ambiguous grammar can be rewritten to eliminate the ambiguity.

Context-Free Grammars

A Simple Syntax-Directed Translator

Principles of Programming Languages COMP251: Syntax and Grammars

Introduction to Parsing. Lecture 5

CS 406/534 Compiler Construction Parsing Part I

CMPS Programming Languages. Dr. Chengwei Lei CEECS California State University, Bakersfield

3. Parsing. Oscar Nierstrasz

Optimizing Finite Automata

A simple syntax-directed

Lexical and Syntax Analysis. Top-Down Parsing

Introduction to Parsing. Lecture 5. Professor Alex Aiken Lecture #5 (Modified by Professor Vijay Ganesh)

Parsing: Derivations, Ambiguity, Precedence, Associativity. Lecture 8. Professor Alex Aiken Lecture #5 (Modified by Professor Vijay Ganesh)

Outline. Regular languages revisited. Introduction to Parsing. Parser overview. Context-free grammars (CFG s) Lecture 5. Derivations.

3. Context-free grammars & parsing

Syntax Analysis Part I

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

Chapter 4. Lexical and Syntax Analysis. Topics. Compilation. Language Implementation. Issues in Lexical and Syntax Analysis.

Parsing II Top-down parsing. Comp 412

Fall Compiler Principles Context-free Grammars Refresher. Roman Manevich Ben-Gurion University of the Negev

Introduction to Syntax Analysis

Chapter 4. Syntax - the form or structure of the expressions, statements, and program units

Introduction to Parsing Ambiguity and Syntax Errors

( ) i 0. Outline. Regular languages revisited. Introduction to Parsing. Parser overview. Context-free grammars (CFG s) Lecture 5.

CPS 506 Comparative Programming Languages. Syntax Specification

Chapter 3. Describing Syntax and Semantics ISBN

Grammars and ambiguity. CS164 3:30-5:00 TT 10 Evans. Prof. Bodik CS 164 Lecture 8 1

Context-Free Languages & Grammars (CFLs & CFGs) Reading: Chapter 5

Introduction to Syntax Analysis. The Second Phase of Front-End

Introduction to Parsing Ambiguity and Syntax Errors

CSE 3302 Programming Languages Lecture 2: Syntax

Chapter 4. Lexical and Syntax Analysis

E E+E E E (E) id. id + id E E+E. id E + E id id + E id id + id. Overview. derivations and parse trees. Grammars and ambiguity. ambiguous grammars

CS 315 Programming Languages Syntax. Parser. (Alternatively hand-built) (Alternatively hand-built)

Habanero Extreme Scale Software Research Project

Dr. D.M. Akbar Hussain

Topic 3: Syntax Analysis I

4. Lexical and Syntax Analysis

Part 3. Syntax analysis. Syntax analysis 96

CS2210: Compiler Construction Syntax Analysis Syntax Analysis

Syntax. In Text: Chapter 3

Compilers and computer architecture From strings to ASTs (2): context free grammars

Parsing III. CS434 Lecture 8 Spring 2005 Department of Computer Science University of Alabama Joel Jones

4. Lexical and Syntax Analysis

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

Parsing. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

Properties of Regular Expressions and Finite Automata

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

CMSC 330: Organization of Programming Languages. Context-Free Grammars Ambiguity

Section A. A grammar that produces more than one parse tree for some sentences is said to be ambiguous.

Chapter 2 Syntax Analysis

Syntactic Analysis. The Big Picture Again. Grammar. ICS312 Machine-Level and Systems Programming

CSE 130 Programming Language Principles & Paradigms Lecture # 5. Chapter 4 Lexical and Syntax Analysis

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

COP 3402 Systems Software Syntax Analysis (Parser)

Lexical and Syntax Analysis

A programming language requires two major definitions A simple one pass compiler

Defining syntax using CFGs

Syntax Analysis. Amitabha Sanyal. ( as) Department of Computer Science and Engineering, Indian Institute of Technology, Bombay

Chapter 2 Syntax Analysis

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part2 3.3 Parse Trees and Abstract Syntax Trees

Parsing. source code. while (k<=n) {sum = sum+k; k=k+1;}

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part 1

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

Context-free grammars

Introduction to Parsing. Lecture 8

Ambiguity, Precedence, Associativity & Top-Down Parsing. Lecture 9-10

Compilerconstructie. najaar Rudy van Vliet kamer 140 Snellius, tel rvvliet(at)liacs(dot)nl. college 3, vrijdag 22 september 2017

CMSC 330: Organization of Programming Languages. Context Free Grammars

Architecture of Compilers, Interpreters. CMSC 330: Organization of Programming Languages. Front End Scanner and Parser. Implementing the Front End

2.2 Syntax Definition

Defining Program Syntax. Chapter Two Modern Programming Languages, 2nd ed. 1

Context-Free Grammars

CMSC 330: Organization of Programming Languages

Compiler Construction: Parsing

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Winter /15/ Hal Perkins & UW CSE C-1

Principles of Programming Languages COMP251: Syntax and Grammars

Syntax Analysis. Martin Sulzmann. Martin Sulzmann Syntax Analysis 1 / 38

Compiler Construction 2016/2017 Syntax Analysis

Lecture 10 Parsing 10.1

CMSC 330: Organization of Programming Languages

CMSC 330: Organization of Programming Languages

Outline. Limitations of regular languages. Introduction to Parsing. Parser overview. Context-free grammars (CFG s)

CMSC 330: Organization of Programming Languages. Context Free Grammars

Transcription:

Compiler Design Concepts Syntax Analysis

Introduction First task is to break up the text into meaningful words called tokens. newval=oldval+12 id = id + num Token Stream Lexical Analysis Source Code (High Level) identifiers The order of the tokens is not important at this stage. Example: 12 + old val = newval Will also be accepted. Lexical Analyzer s purpose is simply to extract the token. Symbol Table Token Lexeme Id Newval Id oldval Num 12 There should not be any combination which can not pass as token. e.g., 12oldval

Syntax After verifying that there is no lexical error, it is time to check for the order of the tokens id = id + num Syntax Analysis Token Stream Syntax Analysis phase should be able to say if Id = id + num is a valid arrangement or not. Observe that the actual lexemes are not used here. Syntax Analysis phase is not interested to know if it is Oldval = newval + 12 or newval = oldval + 12 Only the structure is important Just like Lexical Analysis was not interested in the order of token

Syntax But the compiler process should not forget the lexemes. They will be used later. id = id + num Syntax Analysis Token Stream Tokens will carry the pointer to the symbol table entry with them. Symbol Table Token Lexeme Id oldval Id newval Num 12

Syntax Okay, now, how to check if the syntax is correct or not Syntax Analysis id = id + num Token Stream Rules S id = id + num It means if there is a combination Id = id + num, it can be called a statement, which may be symbolized as S. That is, in this case, if id = id + num is a valid combination or not. There must be some ruled defined. Which will specify which combinations are valid. This rule is specified by the means of formats called productions S id = id + num Now, it has to be seen whether S fits into the total scheme

Syntax Most constructions from programming languages are easily expressed by Context Free Grammars (CFG) According to CFG, a software program can be seen as made of syntactic categories, by arranging them in a proper order. This is like natural languages where we have parts of speech. These are Expressions, Statements, Declaration, etc. Each syntactic category is made up of valid arrangement of tokens. A syntactic category can be made of other syntactic categories and finally, tokens. Syntactic categories are designated as Non Terminals. Recall that a non terminal can be derived into any combination of terminals and non terminals, but eventually, it should be all tokens.

Syntax The entire source program listing can be considered as a syntactic category, i.e., non terminal, say P A statement (whatever type it may be) can also be considered as another syntactic category, i.e., non terminal, say S So, as a rule, we can write P S; S; Now, S, i.e., a statement can have various expansions. For example, an assignment statement can look like S id := id + id * number ;

Syntax Let s take another string myval = newval* 10 It will be converted to token stream id = id * num Syntax Analysis id = id * num Token Stream Rules S id = id + num S id = id * num If there is another production S id = id * num Then the above combination will be considered valid.

Syntax Let s take another string myval = newval* 10 It will be converted to token stream id = id * num Syntax Analysis id = id * num Token Stream Rules S id = id + num S id = id * num If there is another production S id = id * num Then the above combination will be considered valid.

Syntax id = id + num ; id = id * num ; Syntax Analysis Token Stream Lexical Analysis Source Code (High Level) newval=oldval+12; myval=newval*10; S id = id + num S id = id * num So, the stream will be converted to S;S; We can also check later if S;S; is valid or not. It will be valid, if there is a production P S;S; But combinations like S+S or S*S will not be valid Symbol Table Token Id Id Num 12 Id Num 10 Lexeme Newval oldval Myval

Syntax So, any combination of tokens that can be reduced, meaning, that exists on the right hand side of a production is valid. But there are infinite combinations that are valid, e.g., Id = id id Id = id * id Id = id + id id Id = id + id num Id = id * id id... It is impossible to have all. We have to have a limited set of rules using which we can generate all combinations. Just like English grammar. Finite number of words but infinite combinations, that is infinite number of sentences

Syntax This is the house that Jack built This is the malt that laid in the house that Jack built This is the rat that ate the malt that laid in the house that Jack built This is the cat that killed the rat that ate the malt that laid in the house that Jack built This is the dog that chased the cat that killed the rat that ate the malt that laid in the house that Jack built

There are limited types of tokens but the combination is infinite Take for example arithmetic expressions Syntax E E + E E E E E E * E E E /E E id E num Using the above productions, we can validate any arithmetic expression containing variable, number, add, sub, mult & div This is context free grammar E is a non terminal. It has to stay on LHS of at least one production. It can also stay on the RHS of some productions. Id, num, +, - *, /, = These are terminals which are tokens. They stay only on RHS of productions

Syntax E E + E E + id id + id E E + E E + E * E E + E * id E + id * id id + id * id E E + E E + E - E E + E - id E + id - id id + id id E E + E E + E - E E + E num E + id num id + id num E E * E E * E - E E * E - id E * id - id id * id id E E * E E * E - E E * E E / E id * E E / E id * id E / E id * id id / E id * id id / id (the non terminal being derived in each step has been highlighted) One has to choose the appropriate production.

Syntax Recursive usage of productions on terminals and non terminals result in valid statements. Defining a grammar: A Context Free Grammar consists of 1. A set of terminals (T) 2. A set of non terminals (V) 3. A set of productions (P) 4. A start symbol which is a non terminal (S) Start symbol is a non terminal from which the chain of derivations will start. There can be only one. In the example, E is the start symbol. A production is of the form V w Where w is a string of terminals and non terminals.

Syntax A derivation happens when a terminal is replaced by a string of terminals and non terminals as defined in some production. E E + E E + E - E E + E num E + id num id + id num The combination of terminals and non terminals at each stage of derivation is called a Sentential Form. Let s get little cryptic: N: Non terminal α, β, γ : strings of terminals and non terminals If there exists a production N γ Then in a sentential form, N can be replaced by γ So, αnβ can be rewritten as αγβ

Derivation Definition: Given a context-free grammar G with start symbol S, terminal symbols T and productions P, the language L(G) that G generates is defined to be the set of strings of terminal symbols that can be obtained by derivation from S using the productions P, i.e., the set As an example, look at the grammar T R T atc R ε R RbR This grammar generates the string aabbbcc by the derivation shown. We have, for clarity, in each sequence of symbols underlined the non terminal that is rewritten in the following step.

Derivation Production applied Derivation Step 1. T atc 2. T atc 3. T R 4. R RbR 5. R ε 6. R RbR 7. R RbR 8. R ε 9. R ε 10. R ε Rightmost Leftmost Derivation of the string aabbbcc using the given grammar In this derivation, we have applied derivation steps sometimes to the leftmost non terminal, sometimes to the rightmost and sometimes to a non terminal that was neither.

Derivation- Parsing The Syntax Analysis phase checks the structure of the source code statements. This is. called Parsing There are two common methods: 1. Trying to generate the statement from the start symbol and applying production rules. This is called top down parsing. We have generated the sting aabbbcc from the start symbol T T aabbbcc 2. Taking the string and applying productions in reverse to arrive at the start symbol. This is called bottom up parsing aabbbcc T

Derivation However, since derivation steps are local, the order does not matter. So, we might as well decide to always rewrite the leftmost non terminal. Production applied 1. T atc 2. T atc 3. T R 4. R RbR 5. R RbR 6. R ε 7. R RbR 8. R ε 9. R ε 10. R ε Derivation Step A derivation that always rewrites the leftmost non terminal is called a leftmost derivation. Similarly, a derivation that always rewrites the rightmost non terminal is called a rightmost derivation..

Derivation - Trees Drawing the tree from production rules We can draw a derivation as a tree: Root of the tree = Start symbol For a derivation, the string on the RHS of the chosen production are added as children below the non terminal When applying T atc T a, T and c will be drawn as children below T Read the leaves from left to right a T c The leaves of the tree are terminals which, when read from left to right, form the derived string. ε is ignored..

Derivation - Trees Order of derivation does not matter: only choice of rule First b from left Third b from left Second b from left Syntax tree for the string aabbbcc irrespective of order of derivation

Ambiguity But, we may have alternate tree for the same string Choice of production matters Different rule has been applied When a grammar permits several different syntax trees for some strings we call the grammar ambiguous.

Ambiguity Ambiguity is not a problem for validating syntax. Both parse trees show that aabbbcc is a valid string. But the problem is elsewhere. When we evaluate the string: Let s take the example of an Expression E > E + E E E * E E num E E + E E + E * E num + num * num 2 + 3 * 4 E E * E E + E * E num + num * num 2 + 3 * 4

Ambiguity E E + E E + E * E num + num * num E 2 + 3 * 4 E + E Evaluation: 3 * 4 = 12; 2 + 12 = 14 2 E * E 3 4 Sub trees are evaluated first E E * E E + E * E num + num * num 2 + 3 * 4 Evaluation: 2 + 3 = 5; 5 * 4 = 20 E E E * + E E 4 NOTE: THE SUBTREES ARE EVALUATED FIRST 2 3

Ambiguity Resolution Parser can not be built for ambiguous grammar Parser must make a tree while processing the token string. So, ambiguity must be resolved 1) Use disambiguating / precedence rule while parsing 2) Rewrite the grammar to make it unambiguous (with language unchanged) (i) Associativity a b c will be processed as (a - b) c left associative a ** b ** c will be processed as a ** ( b**c) right associative a > b > c will be invalid non associative Note: Each of + and * can be both right associative and left associative, but for convenience, they are made left associative. (parser has to follow any one rule) (i) Precedence a+ b * c will be treated as a + (b * c)

Ambiguity Detection Ambiguity exists in the grammar is there exists a string which can result in two distinct parse trees. - Very hard, almost impossible to find in certain cases In many cases, it is not difficult by looking at the grammar N NαN Note : Parsers can be built only from unambiguous grammars Most of the ambiguity occurs in expression grammar E E op E E num (num is a numeric literal)

Rewriting ambiguous grammar Expression Grammar Rewrite as follows: (a) For left associative operators (e.g., a-b-c) Introduce new non terminal E E op E E E E num Isolate the rightmost non terminal first, push it to a sub tree Derivation example: E E-E (E-E )-E (num-num)-num There is an implicit parenthesis

Rewriting ambiguous grammar (b) For right associative operators (e.g., a**b**c) Introduce new non terminal E E op E E E E num Derivation example: E E ^ E num ^ E num ^ (E ^ E) num ^ (num ^ E) num ^ (num ^ E ) num ^ (num ^ num) There is an implicit parenthesis

Rewriting ambiguous grammar (c) For non associative operators (e.g., a**b**c) e.g., a<b E E op E E E E num e.g., a<b<c is not allowed

Rewriting ambiguous grammar So far, we have handled only the cases where an operator interacts with itself This is easily extendible where the cases where several operators with the same precedence and associativity interact E E + E E E E E E E num + and - are both left associative hence left recursive grammar is required.

Rewriting ambiguous grammar But if we mix left recursive with right recursive, it will be ambiguous again E E + E E E ^ E E E E num As an example, we can not represent 2 + 3 ^ 4 using this grammar.

Rewriting ambiguous grammar But if we mix left recursive with right recursive, it will be ambiguous again E E + E E E ^ E E E E num As an example, we can not represent 2 + 3 ^ 4 using this grammar.

Rewriting ambiguous grammar Mixing operators with different precedents but equal associativity We must know the precedence of operators First, the higher precedence operator needs to be worked out Use different non terminals for different precedence levels E E + E2 E E E2 E E2 E2 E2*E3 E2 E2/E3 E2 E3 E3 num

Example: Other sources of ambiguity if P then if Q then S1 else S2 Ambiguity is, which if the else is connected to? It might mean if P then ( if Q then S1 else S2 ) Or if P then (if Q then S1) else S2 Note: else clause is optional. Otherwise it would ve been unambiguous

Other sources of ambiguity Let s see The grammar is stmt <id> :=<exp> stmt <stmt>.<stmt> stmt if <exp> then <stmt> else <stmt> stmt if <exp> then <stmt> According to this grammar, the single else can equally match with either if

Other sources of ambiguity Two parse trees, indicating ambiguous grammar

Other sources of ambiguity Usual convention: else matches with the closest if. We will enforce this rule by rewriting the grammar We introduce two new non terminals stmt <matched> stmt <unmatched> matched if <exp> then <matched> else <matched> matched <id> :=<exp> unmatched if <exp> then <matched> > else <unmatched> unmatched if <exp> then <matched>