CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Similar documents
CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

CSE 401/M501 Compilers

CSE 401 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter UW CSE 401 Winter 2015 B-1

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Lexical Analysis. Chapter 2

Introduction to Lexical Analysis

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Winter /15/ Hal Perkins & UW CSE C-1

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Part 5 Program Analysis Principles and Techniques

CSc 453 Lexical Analysis (Scanning)

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

Lexical Analysis. Lecture 2-4

Lexical Analysis. Lecture 3-4

Implementation of Lexical Analysis

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

CSE 3302 Programming Languages Lecture 2: Syntax

Implementation of Lexical Analysis

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Spring UW CSE P 501 Spring 2018 C-1

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Figure 2.1: Role of Lexical Analyzer

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

CSCI312 Principles of Programming Languages!

Lexical Analysis. Lecture 3. January 10, 2018

Lexical Analysis. Introduction

Compiler course. Chapter 3 Lexical Analysis

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

Chapter 3 Lexical Analysis

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

Compiler phases. Non-tokens

Formal Languages and Compilers Lecture VI: Lexical Analysis

CSE 401 Midterm Exam Sample Solution 2/11/15

Compiler Construction

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

Implementation of Lexical Analysis. Lecture 4

CS 314 Principles of Programming Languages. Lecture 3

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Programming Languages (CS 550) Lecture 4 Summary Scanner and Parser Generators. Jeremy R. Johnson

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

Week 2: Syntax Specification, Grammars

CSE302: Compiler Design

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

CSEP 501 Compilers. Overview and Administrivia Hal Perkins Winter /8/ Hal Perkins & UW CSE A-1

Implementation of Lexical Analysis

Chapter 2 - Programming Language Syntax. September 20, 2017

Compiler Design. 2. Regular Expressions & Finite State Automata (FSA) Kanat Bolazar January 21, 2010

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

Languages and Compilers

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

Lexical Analyzer Scanner

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

CMSC 330: Organization of Programming Languages

Implementation of Lexical Analysis

Compiler Construction

Lexical Analysis. Implementation: Finite Automata

CMSC 330: Organization of Programming Languages. Architecture of Compilers, Interpreters

CS 314 Principles of Programming Languages

Lexical Analyzer Scanner

Compiler Construction D7011E

Introduction to Lexical Analysis

UNIT -2 LEXICAL ANALYSIS

Lexical Analysis - An Introduction. Lecture 4 Spring 2005 Department of Computer Science University of Alabama Joel Jones

Lexical Scanning COMP360

Introduction to Lexing and Parsing

Formal Languages. Formal Languages

Lecture 4: Syntax Specification

CS415 Compilers. Lexical Analysis

Chapter 2 :: Programming Language Syntax

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

CMSC 330: Organization of Programming Languages. Context Free Grammars

CMSC 350: COMPILER DESIGN

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Wednesday, September 3, 14. Scanners

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

CS 315 Programming Languages Syntax. Parser. (Alternatively hand-built) (Alternatively hand-built)

CMSC 330: Organization of Programming Languages

CSc 453 Compilers and Systems Software

COP 3402 Systems Software Syntax Analysis (Parser)

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Theory and Compiling COMP360

Structure of Programming Languages Lecture 3

Building Compilers with Phoenix

CSE 582 Autumn 2002 Exam Sample Solution

Monday, August 26, 13. Scanners

COLLEGE OF ENGINEERING, NASHIK. LANGUAGE TRANSLATOR

Defining Program Syntax. Chapter Two Modern Programming Languages, 2nd ed. 1

Parsing and Pattern Recognition

CPS 506 Comparative Programming Languages. Syntax Specification

Zhizheng Zhang. Southeast University

CS308 Compiler Principles Lexical Analyzer Li Jiang

CSE450. Translation of Programming Languages. Lecture 20: Automata and Regular Expressions

Transcription:

CSE 413 Programming Languages & Implementation Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions 1

Agenda Overview of language recognizers Basic concepts of formal grammars Scanner Theory Regular expressions Finite automata (to recognize regular expressions) Scanner Implementation 2

And the point is How do we execute this? int npos = 0; int k = 0; while (k < length) { if (a[k] > 0) { npos++; } } Or, more concretely, how do we program a computer to understand and carry out a computation described in a programming language? 3

Compilers vs. Interpreters (recall) Interpreter A program that reads a source program and executes that program Compiler A program that translates a program from one language (the source) to another (the target) For both of these we need to represent the program in some suitable data structure (usually a tree) With MUPL we started with the tree and didn t worry about where it came from 4

Interpreter Interpreter Execution engine Program execution interleaved with analysis running = true; while (running) { analyze next statement; execute that statement; } May involve repeated analysis of some statements (loops, functions) MUPL was a special case of this a function to evaluate expressions under a given environment 5

Compiler Read and analyze entire program Translate to semantically equivalent program in another language Presumably easier to execute or more efficient Usually improve the program in some fashion Offline process Tradeoff: compile time overhead (preprocessing step) vs execution performance 6

Hybrid approaches Well-known example: Java Compile Java source to byte codes Java Virtual Machine language (.class files) Execution Interpret byte codes directly (interpreter in JVM), or Compile some or all byte codes to native code Just-In-Time compiler (JIT) detect hot spots & compile on the fly to native code when method is called 7

Compiler/Interpreter Structure First approximation Front end: analysis Read source program and understand its structure and meaning Back end: synthesis Execute or generate equivalent target program Source Front End Back End Target 8

Common Issues Compilers and interpreters both must read the input a stream of characters and understand it: analysis w h i l e ( k < l e n g t h ) { <nl> <tab> i f ( a [ k ] > 0 ) <nl> <tab> <tab>{ n P o s + + ; } <nl> <tab> } 9

Programming Language Specs Since the 1960s, the syntax of every significant programming language has been specified by a formal grammar First done in 1959 with BNF (Backus-Naur Form or Backus-Normal Form) used to specify the syntax of ALGOL 60 Adapted from the linguistics community (Chomsky) 10

Grammar for a Tiny Language program ::= statement program statement statement ::= assignstmt ifstmt assignstmt ::= id = expr ; ifstmt ::= if ( expr ) statement expr ::= id int expr + expr id ::= a b c i j k n x y z int ::= 0 1 2 3 4 5 6 7 8 9 11

Context-Free Grammars Formally, a grammar G is a tuple <N,Σ,P,S> where N a finite set of non-terminal symbols Σ a finite set of terminal symbols P a finite set of productions A subset of N (N È Σ )* ( can think of these as rules from N (N È Σ )* ) S the start symbol, a distinguished element of N If not otherwise specified, this is usually assumed to be the non-terminal on the left of the first production 12

Productions The rules of a grammar are called productions Rules contain Nonterminal symbols: grammar variables (program, statement, id, etc.) Terminal symbols: concrete syntax that appears in programs (a, b, c, 0, 1, if, (, {, ), }, ) Meaning of production nonterminal ::= <sequence of terminals and nonterminals> In a derivation, any instance of nonterminal can be replaced by the sequence of terminals and nonterminals on the right of the production Often, there are two or more productions for a single nonterminal can use any at different points in a derivation 13

Alternative Notations There are several common notations for productions; all mean the same thing ifstmt ::= if ( expr ) stmt ifstmt if ( expr ) stmt <ifstmt> ::= if ( <expr> ) <stmt> 14

Example Derivation program ::= statement program statement statement ::= assignstmt ifstmt assignstmt ::= id = expr ; ifstmt ::= if ( expr ) statement expr ::= id int expr + expr id ::= a b c i j k n x y z int ::= 0 1 2 3 4 5 6 7 8 9 a = 1 ; if ( a + 1 ) b = 2 ; 15

Parsing Parsing: reconstruct the derivation (syntactic structure) of a program In principle, a single recognizer could work directly from the concrete, character-by-character grammar In practice this is never done 16

Parsing & Scanning In real compilers the recognizer is split into two phases Scanner: translate input characters to tokens Also, report lexical errors like illegal characters and illegal symbols Parser: read token stream and reconstruct the derivation Typically a procedural interface parser asks the scanner for new tokens when needed source Scanner tokens Parser 17

Scanner Example Input text // this statement does very little if (x >= y) y = 42; Token Stream IF LPAREN ID(x) GEQ ID(y) RPAREN ID(y) BECOMES INT(42) SCOLON Tokens are atomic items, not character strings Comments and whitespace are not tokens in most programming languages But sometimes whitespace does matter Examples: Python indentation, Ruby newlines 18

Parser Example Token Stream Input Abstract Syntax Tree IF LPAREN ID(x) ifstmt GEQ ID(y) RPAREN ID(y) BECOMES >= assign INT(42) SCOLON ID(x) ID(y) ID(y) INT(42) 19

Why Separate the Scanner and Parser? Simplicity & Separation of Concerns Scanner hides details from parser (comments, whitespace, etc.) Parser is easier to build; has simpler input stream (tokens) Efficiency Scanner can use simpler, faster design (But still often consumes a surprising amount of the compiler s total execution time if you re not careful) 20

Tokens Idea: we want a distinct token kind (lexical class) to represent each distinct terminal symbol in the programming language Examine the grammar to find these Some tokens may have attributes. Examples: All integer constants are a single kind of token, but the actual value (17, 42, ) will be an attribute Identifier tokens carry the actual identifier string as an attribute of the token identifier 21

Typical Programming Language Tokens Operators & Punctuation + - * / ( ) { } [ ] ; : :: < <= == =!=! Each of these is a distinct lexical class Keywords if while for goto return switch void Each of these is also a distinct lexical class (not a string) Identifiers A single ID lexical class, but parameterized by actual id Integer constants A single INT lexical class, but parameterized by int value Other constants (doubles, strings, ), etc. 22

Principle of Longest Match In most languages, the scanner should pick the longest possible string to make up the next token if there is a choice Example return iffy!= dowhile; should be recognized as 5 tokens RETURN ID(iffy) NEQ ID(dowhile) SCOLON not more (i.e., not parts of words or identifiers, not! and = as separate tokens) 23

Formal Languages & Automata Theory (in one slide) Alphabet: a finite set of symbols String: a finite, possibly empty sequence of symbols from an alphabet Language: a set, often infinite, of strings Finite specifications of (possibly infinite) languages Automaton a recognizer; a machine that accepts all strings in a language (and rejects all other strings) Grammar a generator; a system for producing all strings in the language (and no other strings) A particular language may be specified by many different grammars and automata A grammar or automaton specifies only one language 24

Regular Expressions and FAs The lexical grammar (structure) of most programming languages can be specified with regular expressions Not always, e.g., FORTRAN and some others, but can usually cheat in the unusual corner cases Tokens can be recognized by a deterministic finite automaton (DFA) Can be either table-driven or built by hand based on lexical grammar Facts (er, theorems): any language that can be generated by a regexp can be recognized by the corresponding DFA; for every DFA, there is a set of regular expressions that generate the language it recognizes 25

Regular Expressions Defined over some alphabet Σ For programming languages, commonly ASCII or Unicode If re is a regular expression, L(re) is the language (set of strings) generated by re 26

Fundamental REs re L(re ) Notes a { a } Singleton set, for each a in Σ ε { ε } Empty string Æ { } Empty language 27

Operations on REs re L(re ) Notes rs L(r)L(s) Concatenation È r s L(r) L(s) Combination (union) r* L(r)* 0 or more occurrences (Kleene closure) Precedence: * (highest), concatenation, (lowest) Parentheses can be used to group REs as needed 28

Abbreviations The basic operations generate all possible regular expressions, but there are common abbreviations used for convenience. Typical examples: Abbr. Meaning Notes r+ (rr*) 1 or more occurrences r? (r ε) 0 or 1 occurrence [a-z] (a b z) 1 character in given range [abxyz] (a b x y z) 1 of the given characters 29

Examples re Meaning + single + character! single! character = single = character!= 2 character sequence <= 2 character sequence hogwash 7 character sequence 30

More Examples re Meaning [abc]+ [abc]* [0-9]+ [1-9][0-9]* [a-za-z][a-za-z0-9_]* 31

Abbreviations Many systems allow abbreviations to make writing and reading definitions easier name ::= re Restriction: abbreviations may not be circular (recursive) either directly or indirectly (otherwise it would no longer be a regular expression would be a context-free grammar) 32

Example Possible syntax for numeric constants digit ::= [0-9] digits ::= digit+ number ::= digits (. digits )? ( [ee] (+ -)? digits )? 33

Recognizing REs Finite automata can be used to recognize strings generated by regular expressions Can build by hand or automatically Not totally straightforward, but can be done systematically Tools like Lex, Flex, and JLex do this automatically from a set of REs read as input Even if you don t use a FA explicitly, it is a good way to think about the recognition problem 34

Finite State Automaton (FSA) A finite set of states One marked as initial state One or more marked as final states States sometimes labeled or numbered A set of transitions from state to state Each labeled with symbol from Σ, or ε Operate by reading input symbols (usually characters) Transition can be taken if labeled with current symbol ε-transition can be taken at any time Accept when final state reached & no more input Difference in a scanner: start scan in initial state at previous point in input. When a final state is reached, recognize the token corresponding to that final state Reject if no transition possible, or no more input and not in final state (DFA) 35

Example: FSA for cat c a t 36

DFA vs NFA Deterministic Finite Automata (DFA) No choice of which transition to take under any condition Non-deterministic Finite Automata (NFA) Choice of transition in at least one case Accept - if some way to reach final state on given input Reject - if no possible way to final state 37

FAs in Scanners Want DFA for speed (no backtracking) Conversion from regular expressions to NFA is easy There is a well-defined procedure for converting a NFA to an equivalent DFA (subset construction) See any formal language or compiler textbook for details (RE to NFA to DFA to minimized DFA) 38

Example: DFA for hand-written scanner Idea: show a hand-written DFA for some typical programming language constructs Then use the DFA to construct a hand-written scanner Setting: Scanner is called whenever the parser needs a new token Scanner remembers current position in input file Starting there, use a DFA to recognize the longest possible input sequence that makes up a token, update the current position, and return that token 39

Scanner DFA Example (1) 0 whitespace or comments end of input ( ) ; 1 2 3 4 Accept EOF Accept LPAREN Accept RPAREN Accept SCOLON 40

Scanner DFA Example (2)! 5 = 6 Accept NEQ other 7 Accept NOT < 8 = 9 Accept LEQ other 10 Accept LESS 41

Scanner DFA Example (3) [0-9] 11 [0-9] other 12 Accept INT 42

Scanner DFA Example (4) [a-za-z] 13 [a-za-z0-9_] other 14 Accept ID or keyword Strategies for handling identifiers vs keywords Hand-written scanner: look up identifier-like things in table of keywords to classify (good application of perfect hashing) Machine-generated scanner: generate DFA with appropriate transitions to recognize keywords Lots o states, but efficient (no extra lookup step) 43

Implementing a Scanner by Hand: Token Representation A token is a simple, tagged structure. Something like: public class Token { public int kind; // token s lexical class public int intval; // integer value if class = INT public String id; // actual identifier if class = ID // lexical classes (should really be an enum type) public static final int EOF = 0; // end of file token public static final int ID = 1; // identifier, not keyword public static final int INT = 2; // integer public static final int LPAREN = 4; public static final int SCOLN = 5; public static final int WHILE = 6; // etc. etc. etc. // but use enums if you ve got em 44

Simple Scanner Example // global state and methods static char nextch; // next unprocessed input character // advance to next input char void getch() { } // skip whitespace and comments void skipwhitespace() { } 45

Scanner gettoken() pseudocode // return next input token public Token gettoken() { Token result; skipwhitespace(); if (no more input) { result = new Token(Token.EOF); return result; } switch(nextch) { case '(': result = new Token(Token.LPAREN); getch(); return result; case )': result = new Token(Token.RPAREN); getch(); return result; case ;': result = new Token(Token.SCOLON); getch(); return result; // etc. 46

gettoken() (2) case '!': //! or!= getch(); if (nextch == '=') { result = new Token(Token.NEQ); getch(); return result; } else { result = new Token(Token.NOT); return result; } case '<': // < or <= getch(); if (nextch == '=') { result = new Token(Token.LEQ); getch(); return result; } else { result = new Token(Token.LESS); return result; } // etc. 47

gettoken() (3) case '0': case '1': case '2': case '3': case '4': case '5': case '6': case '7': case '8': case '9': // integer constant String num = nextch; getch(); while (nextch is a digit) { } num = num + nextch; getch(); result = new Token(Token.INT, Integer(num).intValue()); return result; 48

gettoken (4) case 'a': case 'z': case 'A': case 'Z': // id or keyword string s = nextch; getch(); while (nextch is a letter, digit, or underscore) { s = s + nextch; getch(); } if (s is a keyword) { result = new Token(keywordTable.getKind(s)); } else { result = new Token(Token.ID, s); } return result; 49

Alternatives Use a tool to build the scanner from the (regexp) grammar Often can be more efficient than hand-coded! Build an ad-hoc scanner using regular expression package in implementation language Ruby, Perl, Java, many others Suggest you use this for our project (good excuse to learn the Ruby regexp package) 50