Left to right design 1

Similar documents
COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

10/5/17. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntax Analysis

10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

Examples of attributes: values of evaluated subtrees, type information, source file coordinates,

Part III : Parsing. From Regular to Context-Free Grammars. Deriving a Parser from a Context-Free Grammar. Scanners and Parsers.

A simple syntax-directed

Parsing. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Lexical and Syntax Analysis

CSE 401 Midterm Exam 11/5/10

Context-Free Grammar. Concepts Introduced in Chapter 2. Parse Trees. Example Grammar and Derivation

COP 3402 Systems Software Syntax Analysis (Parser)

Parsing and Pattern Recognition

Last time. What are compilers? Phases of a compiler. Scanner. Parser. Semantic Routines. Optimizer. Code Generation. Sunday, August 29, 2010

Chapter 3. Parsing #1

CSC 467 Lecture 3: Regular Expressions

Chapter 3. Describing Syntax and Semantics ISBN

flex is not a bad tool to use for doing modest text transformations and for programs that collect statistics on input.

COP4020 Programming Assignment 2 - Fall 2016

A lexical analyzer generator for Standard ML. Version 1.6.0, October 1994

A Simple Syntax-Directed Translator

Parsing III. CS434 Lecture 8 Spring 2005 Department of Computer Science University of Alabama Joel Jones

CS 403: Scanning and Parsing

Structure of a compiler. More detailed overview of compiler front end. Today we ll take a quick look at typical parts of a compiler.

A programming language requires two major definitions A simple one pass compiler

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

CSE 3302 Programming Languages Lecture 2: Syntax

A parser is some system capable of constructing the derivation of any sentence in some language L(G) based on a grammar G.

Programming Languages (CS 550) Lecture 4 Summary Scanner and Parser Generators. Jeremy R. Johnson

COP4020 Programming Assignment 2 Spring 2011

LECTURE 7. Lex and Intro to Parsing

CMSC 330: Organization of Programming Languages

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

Part 5 Program Analysis Principles and Techniques

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Introduction to Syntax Directed Translation and Top-Down Parsers

RYERSON POLYTECHNIC UNIVERSITY DEPARTMENT OF MATH, PHYSICS, AND COMPUTER SCIENCE CPS 710 FINAL EXAM FALL 96 INSTRUCTIONS

Lexical Analysis. Introduction

Project 1: Scheme Pretty-Printer

CPS 506 Comparative Programming Languages. Syntax Specification

Iteration. Side effects

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

COMPILER CONSTRUCTION LAB 2 THE SYMBOL TABLE. Tutorial 2 LABS. PHASES OF A COMPILER Source Program. Lab 2 Symbol table

CD Assignment I. 1. Explain the various phases of the compiler with a simple example.

Programming Language Syntax and Analysis

Lecture 12: Parser-Generating Tools

Syntax-Directed Translation

Compiler Construction D7011E

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

Software II: Principles of Programming Languages

Outline. Top Down Parsing. SLL(1) Parsing. Where We Are 1/24/2013

CSCI312 Principles of Programming Languages

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Topic 3: Syntax Analysis I

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

CSE 401 Midterm Exam Sample Solution 2/11/15

Languages and Compilers

Jim Lambers ENERGY 211 / CME 211 Autumn Quarter Programming Project 4

Chapter 4. Lexical and Syntax Analysis

Lexical and Syntax Analysis

CSE 130 Programming Language Principles & Paradigms Lecture # 5. Chapter 4 Lexical and Syntax Analysis

CSCI 1260: Compilers and Program Analysis Steven Reiss Fall Lecture 4: Syntax Analysis I

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

Syntactic Analysis. The Big Picture Again. Grammar. ICS312 Machine-Level and Systems Programming

Compiler construction in4303 lecture 3

Programming Language Specification and Translation. ICOM 4036 Fall Lecture 3

B The SLLGEN Parsing System

Syntax Analysis, III Comp 412

Syntax-Directed Translation. Lecture 14

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

4. Semantic Processing and Attributed Grammars

Syntax. In Text: Chapter 3

Optimizing Finite Automata

Error Recovery. Computer Science 320 Prof. David Walker - 1 -

Parser Tools: lex and yacc-style Parsing

4. Lexical and Syntax Analysis

Program Assignment 2 Due date: 10/20 12:30pm

Lex & Yacc. By H. Altay Güvenir. A compiler or an interpreter performs its task in 3 stages:

LL(k) Compiler Construction. Top-down Parsing. LL(1) parsing engine. LL engine ID, $ S 0 E 1 T 2 3

Building a Parser III. CS164 3:30-5:00 TT 10 Evans. Prof. Bodik CS 164 Lecture 6 1

Fall, 2015 Prof. Jungkeun Park

Lex & Yacc. by H. Altay Güvenir. A compiler or an interpreter performs its task in 3 stages:

COLLEGE OF ENGINEERING, NASHIK. LANGUAGE TRANSLATOR

Syntax Analysis, III Comp 412

ICOM 4036 Spring 2004

4. Lexical and Syntax Analysis

Parsing. Zhenjiang Hu. May 31, June 7, June 14, All Right Reserved. National Institute of Informatics

An Introduction to LEX and YACC. SYSC Programming Languages

Principles of Programming Languages COMP251: Syntax and Grammars

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Parser Tools: lex and yacc-style Parsing

Grammars and Parsing, second week

Defining Program Syntax. Chapter Two Modern Programming Languages, 2nd ed. 1

Building lexical and syntactic analyzers. Chapter 3. Syntactic sugar causes cancer of the semicolon. A. Perlis. Chomsky Hierarchy

CS415 Compilers. Lexical Analysis

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

CS 315 Programming Languages Syntax. Parser. (Alternatively hand-built) (Alternatively hand-built)

Grammars & Parsing. Lecture 12 CS 2112 Fall 2018

2.2 Syntax Definition

Transcription:

Left to right design 1

Left to right design The left to right design method suggests that the structure of the program should closely follow the structure of the input. The method is effective when the structure of the input dominates the problem. Many problems in practice have complex input structure. Even if it doesn t dominate the whole problem, the sub problem of handling the input can be solved using left-to-right design. Any program that reads input is a language recognizer or parser. 2

The problem Write a program to act as a simple calculator. Users type in arithmetic expressions, one per line; the program should print the value of each expression. Expressions may involve the four main arithmetic functions and parentheses. For simplicity, assume all numbers are integers. Unix comes with two calculator programs, one (dc) for postfix expressions and the other (bc) for infix expressions. Bc is built on top of dc. 3

Input Expression - Examples 3 + 5 5* 3 7/2 2 + 5* 2 3 5 * (4* 2 2/3) - (2+ 3)* (4+ 5) (2+ 4)/(3* 5) 4+ 5 4

Structure Description Formal notation: x,y x then y x* zero or more repetitions of x x+ one or more repetitions of x x y either x or y [x] either x or nothing x: y x is defined as y 5

Data Description (grammer) file: line*,end line: newline (expr,newline) expr: term,((add sub),term)* term: factor,((mul div),factor)* factor: number (lparen,expr,rparen) number: digit+ add: + sub: '- mul: '* div: '/ lparen: '( rparen: ') newline: '\n end: EOF 6

White Space This data description does not say where white space may appear in the input, because that would make the description unnecessarily complicated. Most programs accept white space in some places and not in others. In standard terminology, a token is a unit of input such that any spaces between tokens are not significant and any spaces within a token are significant (if allowed) 7

Tokens We must decide what the tokens of our grammar are Tokens are the smallest elements of the grammar. They must be defined: without reference to other tokens without recursion independent of preceding tokens The tokens in this program are: number, add, sub, mul, div, lparen, rparen, newline, end. 8

Two stages Traditionally, the task of recognizing the structure of the input has been done in two stages. The option of using only one stage is discussed in a later section. The first stage, lexical analysis, scanning, or tokenizing,groups characters into tokens while ignoring white space and comments (both may appear anywhere and neither is significant). 9

Two stages The second stage, syntactic analysis or parsing, groups tokens into higher-level entities such a expression. In technical language, tokens are called terminal symbols, while the entities recognized by parsers are called nonterminal symbols 10

Tokenizer operations The tokenizer, lexical analyser, or scanner is a function. Each time it is called, it should read the next token, and return an indication of which kind of token it is (number or add or sub etc); and, if there is more than one token of that kind, an indication of the one that was seen. For example, all plus signs are alike, but when the calculator reads in a number, it must know which number it is. 11

Indicators The standard way to indicate the kind of a token is via an enumerated type: typedef enum { ADD, SUB, MUL, DIV, LPAREN, RPAREN, NL, END, NUMBER } TokenKind; 12

Indicators In this case only one kind of token, NUMBER, needs an indicator that says which token of that kind was seen, so the value of the token can be put into an integer (we are not concerned with real numbers in this exercise). 13

Using unions In general, more than one kind of token may have an associated value, and these values may be of different types. For example, some tokenizers must be able to recognize both integers and identifiers. The solution is to use a union: typedef union { int number; char *ident; }TokenValue; Every value of type TokenValue will have enough storage to hold either an int or a char*, but not both. 14

Token representation Conceptually, a token is a kind/value pair, and should be represented as a structure with two fields. typedef struct { TokenKind kind; TokenValue value; } Token; 15

Token representation Token token; token.kind = = NUMBER = > value is in token.value.number token.kind = = IDENT = > value is in token.value.ident token.kind is something else = > token has no associated value However, for simplicity people often use two separate variables for kind and value. 16

Tokenizer structure Tokenizer functions start with code that gets rid of nonsignificant white space and comments, if they are allowed. c = getc(stdin); while (c!= EOF && c!= \n && isspace(c)) c = getc(stdin); The first character left in the input is then often sufficient to find out what kind of token is next.(if it isn't, we must use techniques usually used for parsing.) 17

Consider all the rules for tokens: number: digit+ add: + sub: '- mul: '* div: '/ lparen: '( rparen: ') newline: '\n end: EOF Each token begins with different characters, so we can switch on thefirst non-space character to decide the token kind. 18

Identifiers Many grammars have some kind of identifier token. For our calculator, we might want to allow identifiers for variable names. Identifiers usually have a structure like: ident:letter,(letter digit)* precisely to distinguished them from numbers by the first character 19

The rest of the token Once the tokenizer has found out what kind of token is next, it must read in the rest of the token. The structure of the code that does this should follow the structure of data description of the rest of the token. ident: letter,(letter digit)* /* c is known to be a letter * / buf[i+ + ] = c; c = getc(stdin); while (isalpha(c) isdigit(c)) { buf[i+ + ] = c; c = getc(stdin); } 20

Tokenizer TokenKind do_get_token(int * token_value){ int c, val; c = getc(stdin); while (c!= EOF && c!= \n && isspace(c)) c = getc(stdin); switch (c) { case '+ ': return ADD; case '-': return SUB; case '* ': return MUL; case '/': return DIV; case '(': return LPAREN; case ')': return RPAREN; case '\n': return NL; case EOF: return END; (continued) 21

Tokenizer (2) } case 0 : case 1 : case 2 : case 3 : case 4 : case 5 : case 6 : case 7 : case 8 : case 9 : val = c - 0 ; c = getc(stdin); while (c!= EOF && isdigit(c)) { val = val * 10 + c - 0 ; c = getc(stdin); } ungetc(c, stdin); * token_value = val; return NUMBER; default: /* handle the error * / } 22

Pushback do_get_token must remove exactly one token from the input, together with its preceding white space. We cannot find out whether a digit is the last character in a number or not until we have read the next character. This character may be e.g. +, which represents a token, so we must make sure that the next invocation of do_get_token processes it. Our code does this by calling ungetc, which arranges for the next call to getc on the same file to read the character pushed back by ungetc. 23

Recursive Descent Parsing 24

Recursive Descent Parsing The parser has a function for each nonterminal in the grammar. The structure of this function is derived from the nonterminal s definition in the grammar. The translation scheme is: grammar rule Æ function nonterminal Æ function call terminal Æ check token and consume sequence (,) Æ sequence of statements repetitions (* and + ) Æ while or do statement based on next token kind alternative ( and []) Æ if or switch statement based on next token kind 25

Data Description (grammer) file: line*,end line: newline (expr,newline) expr: term,((add sub),term)* term: factor,((mul div),factor)* factor: number (lparen,expr,rparen) number: digit+ add: + sub: '- mul: '* div: '/ lparen: '( rparen: ') newline: '\n end: EOF 26

Fixed one-token lookahead This scheme maintains this invariant: when the function of a nonterminal is called, the global variables hold information about the first token that may be part of that nonterminal; and when the function of a nonterminal returns, the global variables hold information about the first token beyond that nonterminal. As soon as a token is recognized, it should be consumed by a call to get_token, which sets the global variables according to the next token. Lookahead is an alternative to pushback. 27

The top level function To implement the lookahead, we must begin our program by looking ahead. Next we handle the top level nonterminal in our grammar: file. We handle a nonterminal with a function call. int main(void) { /* recognizes file: line*,end */ get_token(); get_file(); return 0; } void get_token(void) { next_token_kind = do_get_token( &next_token_value); } 28

file: line*,end file We translate the file grammar rule to a get_file() function whose function is the translation of the RHS of the rule. We translate a nonterminal, such as line, to a call to the function for that nonterminal, such as get_line(). We translate a * repetition into a while loop whose condition tests that the next token could be the first token of what is repeated, in this case NUMBER or LPAREN. 29

file file: line*,end line: newline (expr,newline) expr: term,((add sub),term)* term: factor, ((mul div),factor)* factor: number (lparen,expr,rparen) 30

file void get_file(void) { /* recognizes file: line*, end */ while (next_token_kind == NUMBER next_token_kind == LPAREN) get_line(); if (next_token_kind!= END)... handle the error... /* no need to get token after END */ } 31

Error Conditions We must consider what happens to invalid input. With this definition, if a line begins with, say, PLUS, we get an error message and get_file() returns. It would usually be better to ignore the erroneous line and keep processing. void get_file(void) { /* recognizes file: line*,end */ while (next_token_kind!= END) if (next_token_kind == NUMBER else next_token_kind=lparen) get_line();... print error message and skip line... /* no need to get token after END */ } 32

line line: newline (expr,newline) The alternative construct translates to an if or switch on the next token kind. A terminal is handled by checking it and getting the next token. 33

line cont. void get_line(void) { if (next_token_kind == NL) get_token(); else { get_expr(); if (next_token_kind == NL) get_token(); else... error... } } 34

Consuming tokens Code like if (next_token_kind == something get_token(); else handle a syntax error is common enough that it s often worth writing a function or macro to handle it. void consume(tokenkind tok) { if (next_token_kind == tok) get_token(); else... handle syntax error... } 35

Consuming tokens (2) Using this function simplifies the get_line() function and makes its similarity to the grammar rule more apparent: line: newline (expr,newline) void get_line(void) { if (next_token_kind == NL) get_token(); else { get_expr(); consume(nl); } } 36

Recognize an expression expr: term, ((add sub),term)* void get_expr(void) { get_term(); while (next_token_kind == ADD next_token_kind == SUB) { get_token(); /* ADD or SUB */ get_term(); } } Code for get_term() is very similar 37

Recognizing a factor factor: number (lparen,expr,rparen) void get_factor(void) { switch (next_token_kind) { case NUMBER: get_token(); break; case LPAREN: get_token(); get_expr(); consume(rparen); break; default:... error... } } 38

Actions This code does nothing but check the syntax of the input stream. But it is easy to extend it to perform whatever actions are required, for example: The action can compute the value of the expression. The action can create a tree structure to represent the expression. The action can generate code to evaluate the expression. 39

Action (2) We extend get_expr() to return the value of the expression int get_expr(void) { int val = get_term(); while (next_token_kind == ADD next_token_kind ==SUB) { TokenKind op = next_token_kind; get_token(); if (operation == ADD) val += get_term(); else val -= get_term(); } return val; } 40

Grammar Manipulation Suppose we had defined expr this way: expr: number (lparen,expr,rparen) (expr,add,expr) (expr,sub,expr) (expr,mul,expr) (expr,div,expr) This description is correct, but we cannot decide which alternative to apply just by looking at the first token of an expression. Therefore we cannot derive a working parser from it using the techniques of recursive descent parsers; we must transform the grammar first. 41

Left Factoring Left factoring uses the rule from that a, (b c) (a,b) (a,c) to pull out a common initial part of several alternatives, so it is not repeated. This gives us: expr: number (lparen,expr,rparen) (expr, ( (add,expr) (sub,expr) (mul,expr) (div,expr))) 42

Left Factoring We write this more manageably as: expr: number (lparen,expr,rparen) (expr,rest) rest: (add,expr) (sub,expr) (mul,expr) (div,expr) 43

Left recursion expr: number (lparen,expr,rparen) (expr,rest) We cannot derive a working parser from this data structure description either. The problem is that one of the alternatives for expr starts with expr. If we wrote get_expr() following this grammar, when the token was other than NUMBER or LPAREN, we would immediately call get_expr(). Since we would not have consumed any tokens, the current token would still not be NUMBER or LPAREN, so we would again immediately call get_expr(). And so on 44

Left recursion elimination Consider what our grammar rule will recognize: NUMBER or LPAREN expr RPAREN or NUMBER rest or LPAREN expr RPAREN rest or NUMBER rest rest or LPAREN expr RPAREN rest rest or We see a pattern here: it begins with either NUMBER or LPAREN expr RPAREN, and follows with any number or repetitions of rest. So we can rewrite our rule as: expr: factor, rest* factor: number (lparen,expr,rparen) 45

Left recursion elimination (2) The general rule is to invent a new nonterminal for the non-left-recursive alternatives: factor: number (lparen,expr,rparen) Then define another new nonterminal as all of the left recursive alternatives, with the left recursive nonterminal removed. In this case it s just rest. 46

Left recursion elimination (3) Finally, replace the left recursive rule with one that starts with the new non-left-recursive nonterminal factor and ends with 0 or more repetitions of the other new nonterminal (just rest in this case). This gives us: expr: factor, rest* factor: number (lparen,expr,rparen) 47

Precedence This data description divides up the input 2 + 3 * 4 as 2, followed by + 3, followed by * 4. That is, (2+ 3)* 4. This would be OK if + and * had the same precedence, but they don t. We want the parser to treat 3 * 4 as a unit. In general, we want any sequence of factors with multiplicative operators between them to be treated as a unit. We call these units terms. 48

Fixing precedence We must separate the multiplicative from the addative operators: term: factor,restterm* restterm: (mul,term) (div,term) expr: term,restexpr* restexpr: (add,expr) (sub,expr) After substituting the definitions of restterm and restexpr for their uses and some factoring: term: expr: factor,((mul div),term)* term,((add sub),expr)* 49

Associativity When matching the input DJDLQVW expr: term,((add sub),expr)* we don t want 1 + 2 to be considered an expr, because that would lead to evaluating 10 - (1 + 2), when what we want is (10-1) + 2. We can fix this by changing the grammar to: term: expr: factor,((mul div),factor)* term,((add sub),term)* 50

Compiler technology Scanning and parsing are the best understood aspects of compiler technology. They have a large body of theory, much of it developed in the sixties and seventies. Many tools exist for the automatic creation of tokenizers and parsers. Two of the best known are the scanner generator lex and the parser generator yacc, which are standard on Unix systems. The theories of scanning and parsing are covered in some detail in 433-255, and may be explored further in 433-361. These units should also introduce tools such as lex and yacc. 51

Parsing without tokenizing A separate tokenizer is helpful if parts of the input are to be ignored (e.g. white space, comments) and if the code to check for and parse these parts would have to repeated at several points in the program. If all of the input is significant, or if there are only a few places in the grammar where the parts to be ignored occur, we need not have a tokenizer; the parser should view each character as a token. file: line* line: name,colon,pw,colon,number,colon,users,nl users: [user,(comma,user)*] 52