Syntactic Analysis. The Big Picture Again. Grammar. ICS312 Machine-Level and Systems Programming

Similar documents
CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

Compiler Design Concepts. Syntax Analysis

CSE 3302 Programming Languages Lecture 2: Syntax

Structure of a compiler. More detailed overview of compiler front end. Today we ll take a quick look at typical parts of a compiler.

Part 5 Program Analysis Principles and Techniques

Compiling Techniques

Introduction to Parsing. Lecture 8

CSE450 Translation of Programming Languages. Lecture 4: Syntax Analysis

Chapter 3: CONTEXT-FREE GRAMMARS AND PARSING Part 1

Introduction to Lexical Analysis

CS 536 Midterm Exam Spring 2013

CPS 506 Comparative Programming Languages. Syntax Specification

Introduction to Parsing Ambiguity and Syntax Errors

Introduction to Parsing Ambiguity and Syntax Errors

CS415 Compilers. Syntax Analysis. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

A simple syntax-directed

Outline. Limitations of regular languages. Introduction to Parsing. Parser overview. Context-free grammars (CFG s)

1. Consider the following program in a PCAT-like language.

Compiler Construction D7011E

Principles of Programming Languages COMP251: Syntax and Grammars

LECTURE 3. Compiler Phases

Introduction to Parsing

EECS 6083 Intro to Parsing Context Free Grammars

COMP 181 Compilers. Administrative. Last time. Prelude. Compilation strategy. Translation strategy. Lecture 2 Overview

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Compilers and computer architecture From strings to ASTs (2): context free grammars

Lexical Analysis. Introduction

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

ΕΠΛ323 - Θεωρία και Πρακτική Μεταγλωττιστών

The Structure of a Syntax-Directed Compiler

Principles of Programming Languages COMP251: Syntax and Grammars

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

3. Context-free grammars & parsing

Defining Program Syntax. Chapter Two Modern Programming Languages, 2nd ed. 1

COP 3402 Systems Software Top Down Parsing (Recursive Descent)

A Pascal program. Input from the file is read to a buffer program buffer. program xyz(input, output) --- begin A := B + C * 2 end.

10/5/17. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntax Analysis

1. The output of lexical analyser is a) A set of RE b) Syntax Tree c) Set of Tokens d) String Character

10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

Formal Languages and Compilers Lecture V: Parse Trees and Ambiguous Gr

Defining syntax using CFGs

Syntax Intro and Overview. Syntax

Course Overview. Introduction (Chapter 1) Compiler Frontend: Today. Compiler Backend:

Introduction to Parsing. Lecture 5

A programming language requires two major definitions A simple one pass compiler

A Simple Syntax-Directed Translator

Syntax Analysis. Chapter 4

Parsing. Roadmap. > Context-free grammars > Derivations and precedence > Top-down parsing > Left-recursion > Look-ahead > Table-driven parsing

2.2 Syntax Definition

Syntax/semantics. Program <> program execution Compiler/interpreter Syntax Grammars Syntax diagrams Automata/State Machines Scanning/Parsing

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

CSC 467 Lecture 3: Regular Expressions

CSE 401 Midterm Exam 11/5/10

Stating the obvious, people and computers do not speak the same language.

CSE 401 Midterm Exam Sample Solution 11/4/11

Compilers and Interpreters

CSCE 314 Programming Languages

ML 4 A Lexer for OCaml s Type System

If you are going to form a group for A2, please do it before tomorrow (Friday) noon GRAMMARS & PARSING. Lecture 8 CS2110 Spring 2014

Project 1: Scheme Pretty-Printer

Dr. D.M. Akbar Hussain

Today. Assignments. Lecture Notes CPSC 326 (Spring 2019) Quiz 2. Lexer design. Syntax Analysis: Context-Free Grammars. HW2 (out, due Tues)

Theoretical Part. Chapter one:- - What are the Phases of compiler? Answer:

COMPILER DESIGN LECTURE NOTES

Introduction to Lexical Analysis

CS 4240: Compilers and Interpreters Project Phase 1: Scanner and Parser Due Date: October 4 th 2015 (11:59 pm) (via T-square)

Chapter 4. Lexical and Syntax Analysis. Topics. Compilation. Language Implementation. Issues in Lexical and Syntax Analysis.

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

A language is a subset of the set of all strings over some alphabet. string: a sequence of symbols alphabet: a set of symbols

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Spring UW CSE P 501 Spring 2018 C-1

COP 3402 Systems Software Syntax Analysis (Parser)

ΕΠΛ323 - Θεωρία και Πρακτική Μεταγλωττιστών

Lexical and Syntax Analysis

Chapter 10 Language Translation

Context-Free Grammars

Introduction to Syntax Analysis. Compiler Design Syntax Analysis s.l. dr. ing. Ciprian-Bogdan Chirila

CS664 Compiler Theory and Design LIU 1 of 16 ANTLR. Christopher League* 17 February Figure 1: ANTLR plugin installer

Compilers. Lecture 2 Overview. (original slides by Sam

Introduction to Parsing. Lecture 5

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

Lexical Analysis. Lecture 3. January 10, 2018

Chapter 3. Describing Syntax and Semantics ISBN

Syntax. 2.1 Terminology

Week 2: Syntax Specification, Grammars

ADTS, GRAMMARS, PARSING, TREE TRAVERSALS

Outline. Parser overview Context-free grammars (CFG s) Derivations Syntax-Directed Translation

Introduction to Parsing. Lecture 5. Professor Alex Aiken Lecture #5 (Modified by Professor Vijay Ganesh)

Related Course Objec6ves

Outline. Regular languages revisited. Introduction to Parsing. Parser overview. Context-free grammars (CFG s) Lecture 5. Derivations.

Topic 3: Syntax Analysis I

Parsing II Top-down parsing. Comp 412

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

CS 4201 Compilers 2014/2015 Handout: Lab 1

CS 314 Principles of Programming Languages

Chapter 3: Describing Syntax and Semantics. Introduction Formal methods of describing syntax (BNF)

CSE302: Compiler Design

CS 415 Midterm Exam Spring 2002

Project 2 Interpreter for Snail. 2 The Snail Programming Language

CS415 Compilers. Lexical Analysis

Transcription:

The Big Picture Again Syntactic Analysis source code Scanner Parser Opt1 Opt2... Optn Instruction Selection Register Allocation Instruction Scheduling machine code ICS312 Machine-Level and Systems Programming COMPILER Henri Casanova (henric@hawaii.edu) Syntactic Analysis Lexical Analysis was about ensuring that we extract a set of valid words (i.e., tokens/lexemes) from the source code But nothing says that the words make a coherent sentence (i.e., program) Example: if while i == == == 12 + endif abcd Lexer will produce a stream of tokens: <TOKEN_IF> <TOKEN_WHILE> <TOKEN_NAME, i > <TOKEN_EQUAL> <TOKEN_EQUAL> <TOKEN_EQUAL> <TOKEN_INTEGER, 12 > <TOKEN_PLUS, + > <TOKEN_ENDIF> <TOKEN_NAME, abcd > This program is lexically correct, but syntactically incorrect Grammar Question: How do we determine that a sentence is syntactically correct? Answer: We check against a grammar! A grammar consists of rules that determine which sentences are correct Example in English: A sentence must have a verb Example in C: A { must have a matching }

Grammar Regular expressions are one way we have seen for specifying a set of rules Unfortunately they are not powerful enough for describing the syntax of programming languages Example: If we have 10 { then me must have 10 } We can t implement this with regular expressions because they do not have memory! no way of counting and remembering counts Therefore we need a more powerful tool This tool is called Context-Free Grammars And some additional mechanisms Context-Free Grammars A context-free grammar (CFG) consists of a set of production rules Each rule describes how a non-terminal symbol can be replaced or expanded by a string that consists of non-terminal symbols and terminal symbols inal symbols are really tokens Rules are written with syntax like regular expressions Rules can then be applied recursively Eventually one reaches a string of only terminal symbols, or so one hopes This string is syntactically correct according to the grammatical rules! Let s see a simple example CFG Example Set of non-terminals: A, B, C (uppercase initial) Start non-terminal: S (uppercase initial) Set of terminal symbols: a, b, c, d Set of production rules: S A BC A Aa a B bbcb b C dccd c We can now start producing syntactically valid strings by doing derivations Example derivations: S BC bbcbc bbcbc bbdccdbc bbdccdbc bbdccdbc S A Aa Aaa Aaaa aaaa A Grammar for essions Op Identifier Identifier Letter Letter Identifier Letter a-z Op + - / Digit Digit Digit 0 1 2 3 4 5 6 7 8 9 Op Op Digit Op 3 Op 34 Op 34 34 Identifier 34 Letter Identifier 34 a Identifier 34 a Letter 34 ax

What is Parsing? What we just saw is the process of, starting with the start symbol and, through a sequence of rule derivations, obtain a string of terminal symbols We could generate all correct programs (infinite set though) Parsing: the other way around Give a string of non-terminals, the process of discovering a sequence of rule derivations that produce this particular string When we say we can t parse a string, we mean that we can t find any legal way in which the string can be obtained from the start symbol through derivations What we want to build is a parser: a program that takes in a string of tokens (terminal symbols) and discovers a derivation sequence, thus validating that the input is a syntactically correct program Derivations as Trees A convenient and natural way to represent a sequence of derivations is a syntactic tree or parse tree Example: Op Op Digit Op 3 Op 34 Op 34 34 Identifier 34 Letter Identifier 34 a Identifier 34 a Letter 34 ax Digit 3 Digit 4 Op Letter a Identifier Identifier Letter x Derivations as Trees In the parser, derivations are implemented as trees Often, we draw trees without the full derivations Example: Op 34 Identifier ax Ambiguity We call a grammar ambiguous if a string of terminal symbols can be reached by two different derivation sequences In other terms, a string can have more than one parse tree It turns out that our expression grammar is ambiguous! Let s show that string 35+8 has two parse trees

Ambiguity Problems with Ambiguity Op 3 5 Op + left parse-tree 8 Op 3 Op 5 + 8 right parse-tree Problem: syntax impacts meaning For our example string, we d like to see the left tree because we most likely want to have a higher precedence than + We don t like ambiguity because it makes the parsers difficult to design because we don t know which parse tree will be discovered when there are multiple possibilities So we often want to disambiguate grammars It turns out that it is possible to modify grammars to make them nonambiguous by adding non-terminals by adding/rewriting production rules In the case of our expression grammar, we can rewrite the grammar to remove ambiguity and to ensure that parse trees match our notion of operator precedence We get two benefits for the price of one Would work for many operators and many precedence relations Non-Ambiguous Grammar Non-Ambiguous Grammar + - / + - / Identifier - + Identifier - + Example: 45+3-89 Example: 45+3-89 3 9 3 9 5 8 5 8 4 4

Another Example Grammar ForStatement for ( StmtCommaList CommaList StmtCommaList ) { StmtSemicList } StmtCommaList ε Stmt Stmt, StmtCommaList CommaList ε, CommaList StmtSemicList... Stmt... ε Stmt Stmt StmtSemicList Full Language Grammar Sketch Program VarDeclList FuncDeclList VarDeclList ε VarDecl VarDecl VarDeclList VarDecl Type IdentCommaList IdentCommaList Ident Ident, IdentCommaList Type int char float FuncDeclList ε FuncDecl FuncDecl FuncDeclList FuncDecl Type Ident ( ArgList ) { VarDeclList StmtList } StmtList ε Stmt Stmt StmtList Stmt Ident = ForStatement...... Ident... Using notations (not + here) Program VarDeclList FuncDeclList VarDeclList VarDecl VarDecl Type IdentCommaList IdentCommaList Ident (, Ident) Type int char float FuncDeclList FuncDecl FuncDecl Type Ident ( ArgList ) { VarDeclList StmtList } StmtList Stmt Stmt Ident = ForStatement...... Ident... Real-world CFGs Some sample grammars found on the Web LISP: 7 rules PROLOG: 19 rules Java: 30 rules C: 60 rules Ada: 280 rules LISP is particularly easy because No operators, just function calls Therefore no precedence, associativity LISP is thus very easy to parse In the Java specification the description of operator precedence and associativity takes 25 pages

So What Now? Our Language We want to write a compiler for a given language Lexing We come up with a definition of the tokens embodied in regular expressions We build a lexer using a tool In the previous set of lecture notes, we have used ANTLR to do this Parsing We come up with a definition of the syntax embodied in a context-free grammar We build a parser using a tool Let s use ANTLR again for a simple language! We have all the tokens we ve already defined in our lexer: IF, ENDIF PRINT, INT, PLUS, LPAREN, RPAREN EQUAL, NOTEQUAL, ASSIGN, SEMICOLON INTEGER, NAME We want a very limited language with integer variable declarations assignments addition (only 2 operands) if (not else, only test for equality) semicolon-terminated statements white-spaces, tabs, carriage returns don t matter Let s look at an example program to get a sense of it Example Program int a int b a = 3 b = a + 1 if (b == 4) a = 2 endif if (a == 3) a = a + 1 b = b + 6 endif print a print b Let s write/run the grammar Root non-terminal: program Let us now write the grammar in class together using ANTLR syntax... Using our simple Lexer as a starting point A (hopefully similar) grammar is posted on the course Website

Code Generation Now we have a parser that will reject syntactically incorrect code, and generate a parse tree for correct code The next step toward building a compiler is to generate code One easy but limited option is to use syntax-directed translation Attach actions to the rules of the grammar Use attributes to non-terminals and terminals in the grammar There is quite a bit of theory here, but instead we ll just do it by example using the ANTLR syntax First let s just review a few basic elements of this syntax ANTLR Syntax-directed translation Each time a grammar symbol is evaluated you can insert Java code to be executed! Example: program : {System.out.println( Declarations! )} declaration {System.out.println( Statement! )} statements {System.out.println( Done! )} ANTLR Syntax-directed translation Each (lexer) token has an attribute called text that contains its lexeme Example: declaration : INT NAME SEMICOLON {System.out.println( Declared +$NAME.text)} ANTLR Syntax-directed translation You can give your own names to symbols in case you have multiple occurrences Example: something : {int a,b} a=name EQUAL b=name SEMICOLON {System.out.println($a.text + - + $b.text)}

ANTLR Syntax-directed translation You can create attributes for non-terminal grammar symbols and use them Example: something : ident SEMICOLON {System.out.println( stuff +$ident.whatever)} ident returns [String whatever]: NAME {$whatever = "somestring"+$name.text} ANTLR Syntax-directed translation And with all this we can now implement our compiler Our goal: have ANTLR produce x86 assembly code that we can run! Let s do it in class right now A (hopefully) similar version is posted on the course Web site There will be mistakes, questions, hiccups, and confusion But the goal is that we can all learn from this? Off we go... Conclusion There is a LOT of depth to the topic of Compilers We ve only scratched the surface here There are well-known books on compilers