CSE 401 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter UW CSE 401 Winter 2015 B-1

Similar documents
Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

CSE 401/M501 Compilers

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Lexical Analysis. Lecture 2-4

CSc 453 Lexical Analysis (Scanning)

Lexical Analysis. Chapter 2

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

Lexical Analysis. Lecture 3-4

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Winter /15/ Hal Perkins & UW CSE C-1

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

Implementation of Lexical Analysis

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Part 5 Program Analysis Principles and Techniques

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Spring UW CSE P 501 Spring 2018 C-1

Introduction to Lexical Analysis

Implementation of Lexical Analysis

Figure 2.1: Role of Lexical Analyzer

CSCI312 Principles of Programming Languages!

CSE 3302 Programming Languages Lecture 2: Syntax

CS 314 Principles of Programming Languages. Lecture 3

CSE Lecture 4: Scanning and parsing 28 Jan Nate Nystrom University of Texas at Arlington

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

CPS 506 Comparative Programming Languages. Syntax Specification

Homework & Announcements

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

Formal Languages and Compilers Lecture VI: Lexical Analysis

Lexical Analyzer Scanner

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Winter UW CSE P 501 Winter 2016 C-1

Lexical Analyzer Scanner

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

Languages and Compilers

Compiler course. Chapter 3 Lexical Analysis

Lexical Analysis. Introduction

CSCE 314 Programming Languages

CS415 Compilers. Lexical Analysis

Chapter 3 Lexical Analysis

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

CSE 401 Midterm Exam 11/5/10

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

CSE 401 Midterm Exam Sample Solution 2/11/15

CMSC 350: COMPILER DESIGN

Programming Languages (CS 550) Lecture 4 Summary Scanner and Parser Generators. Jeremy R. Johnson

CSE302: Compiler Design

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

CSE 582 Autumn 2002 Exam Sample Solution

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

CS 314 Principles of Programming Languages

Compiler phases. Non-tokens

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Lexical and Syntax Analysis

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

CSE 582 Autumn 2002 Exam 11/26/02

Compiler Construction D7011E

CSE P 501 Compilers. LR Parsing Hal Perkins Spring UW CSE P 501 Spring 2018 D-1

Chapter 2 - Programming Language Syntax. September 20, 2017

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Parsing. source code. while (k<=n) {sum = sum+k; k=k+1;}

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

Lexical Analysis. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

CSc 453 Compilers and Systems Software

Chapter 2 :: Programming Language Syntax

COP 3402 Systems Software Syntax Analysis (Parser)

Lecture 4: Syntax Specification

Implementation of Lexical Analysis

Lexical Analysis - An Introduction. Lecture 4 Spring 2005 Department of Computer Science University of Alabama Joel Jones

CS 4240: Compilers and Interpreters Project Phase 1: Scanner and Parser Due Date: October 4 th 2015 (11:59 pm) (via T-square)

Grammars and Parsing. Paul Klint. Grammars and Parsing

CSC 467 Lecture 3: Regular Expressions

Week 2: Syntax Specification, Grammars

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

Implementation of Lexical Analysis

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

CS 403: Scanning and Parsing

Semantic Analysis. Lecture 9. February 7, 2018

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Theory and Compiling COMP360

Building Compilers with Phoenix

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

Lexical Analysis. Implementation: Finite Automata

Lexical Analysis 1 / 52

Parsing and Pattern Recognition

Chapter 3: Lexing and Parsing

10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

Building lexical and syntactic analyzers. Chapter 3. Syntactic sugar causes cancer of the semicolon. A. Perlis. Chomsky Hierarchy

Where We Are. CMSC 330: Organization of Programming Languages. This Lecture. Programming Languages. Motivation for Grammars

COMPILER CONSTRUCTION LAB 2 THE SYMBOL TABLE. Tutorial 2 LABS. PHASES OF A COMPILER Source Program. Lab 2 Symbol table

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

Lexical and Syntax Analysis

Defining syntax using CFGs

Compiler Design. 2. Regular Expressions & Finite State Automata (FSA) Kanat Bolazar January 21, 2010

CS308 Compiler Principles Lexical Analyzer Li Jiang

Transcription:

CSE 401 Compilers Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter 2015 UW CSE 401 Winter 2015 B-1

Administrivia Read: textbook ch. 1 and sec. 2.1-2.4 First homework: out by end of the week, due next Thursday. WriOen problems on this weeks material. If you haven t already, please Fill in the office hour doodle on the website Post a followup on the discussion board Pick a project partner We ll post a link to a catalyst form for ONE of you to send in partner info UW CSE 401 Winter 2015 B-2

Agenda Quick review of basic concepts of formal grammars Regular expressions Lexical specificaxon of programming languages Using finite automata to recognize regular expressions Scanners and Tokens General ideas in lecture, then examples, details, and compiler applicaxons in secxons UW CSE 401 Winter 2015 B-3

Programming Language Specs Since the 1960s, the syntax of every significant programming language has been specified by a formal grammar First done in 1959 with BNF (Backus- Naur Form), used to specify ALGOL 60 syntax Borrowed from the linguisxcs community (Chomsky) UW CSE 401 Winter 2015 B-4

Formal Languages & Automata Theory (a review in one slide) Alphabet: a finite set of symbols and characters String: a finite, possibly empty sequence of symbols from an alphabet Language: a set of strings (possibly empty or infinite) Finite specificaxons of (possibly infinite) languages Automaton a recognizer; a machine that accepts all strings in a language (and rejects all other strings) Grammar a generator; a system for producing all strings in the language (and no other strings) A parxcular language may be specified by many different grammars and automata A grammar or automaton specifies only one language UW CSE 401 Winter 2015 B-5

Language (Chomsky) hierarchy: quick reminder Regular (Type- 3) languages are specified by regular expressions/ grammars and finite automata (FSAs) Specs and implementaxon of scanners Context- free (Type- 2) languages are specified by context- free grammars and pushdown automata (PDAs) Specs and implementaxon of parsers Context- sensixve (Type- 1) languages aren t too important Recursively- enumerable (Type- 0) languages are specified by general grammars and Turing machines Turing CSL CFL Regular UW CSE 401 Winter 2015 6

Example: Grammar for a Tiny Language program ::= statement program statement statement ::= assignstmt ifstmt assignstmt ::= id = expr ; ifstmt ::= if ( expr ) statement expr ::= id int expr + expr id ::= a b c i j k n x y z int ::= 0 1 2 3 4 5 6 7 8 9 UW CSE 401 Winter 2015 B-7

Exercise: Derive a simple program program ::= statement program statement statement ::= assignstmt ifstmt assignstmt ::= id = expr ; ifstmt ::= if ( expr ) statement expr ::= id int expr + expr id ::= a b c i j k n x y z int ::= 0 1 2 3 4 5 6 7 8 9 a = 1 ; if ( a + 1 ) b = 2 ; UW CSE 401 Winter 2015 B-8

ProducXons The rules of a grammar are called producxons Rules contain Nonterminal symbols: grammar variables (program, statement, id, etc.) Terminal symbols: concrete syntax that appears in programs (a, b, c, 0, 1, if, =, (, ), Meaning of nonterminal ::= <sequence of terminals and nonterminals> In a derivaxon, an instance of nonterminal can be replaced by the sequence of terminals and nonterminals on the right of the producxon Open there are several producxons for a nonterminal can choose any in different parts of derivaxon UW CSE 401 Winter 2015 B-9

AlternaXve NotaXons There are several syntax notaxons for producxons in common use; all mean the same thing ifstmt ::= if ( expr ) statement ifstmt if ( expr ) statement <ifstmt> ::= if ( <expr> ) <statement> UW CSE 401 Winter 2015 B-10

Parsing Parsing: reconstruct the derivaxon (syntacxc structure) of a program In principle, a single recognizer could work directly from a concrete, character- by- character grammar In pracxce this is never done UW CSE 401 Winter 2015 B-11

Parsing & Scanning In real compilers the recognizer is split into two phases Scanner: translate input characters to tokens Also, report lexical errors like illegal characters and illegal symbols Parser: read token stream and reconstruct the derivaxon source Scanner tokens Parser UW CSE 401 Winter 2015 B-12

Why Separate the Scanner and Parser? Simplicity & SeparaXon of Concerns Scanner hides details from parser (comments, whitespace, input files, etc.) Parser is easier to build; has simpler input stream (tokens) / narrow interface Efficiency Scanner recognizes regular expressions proper subset of context free grammars (But sxll open consumes a surprising amount of the compiler s total execuxon Xme) UW CSE 401 Winter 2015 B-13

But Not always possible to separate cleanly Example: C/C++/Java type vs iden2fier Parser would like to know which names are types and which are idenxfiers, but Scanner doesn t know how things are declared So we hack around it somehow Either use simpler grammar and disambiguate later, or communicate between scanner & parser Engineering issue: try to keep interfaces as simple & clean as possible UW CSE 401 Winter 2015 B-14

Typical Tokens in Programming Languages Operators & PunctuaXon + - * / ( ) { } [ ] ; : :: < <= == =!=! Each of these is a disxnct lexical class Keywords if while for goto return switch void Each of these is also a disxnct lexical class (not a string) IdenXfiers A single ID lexical class, but parameterized by actual id Integer constants A single INT lexical class, but parameterized by int value Other constants, etc. UW CSE 401 Winter 2015 B-15

Principle of Longest Match In most languages, the scanner should pick the longest possible string to make up the next token if there is a choice Example return maybe!= iffy; should be recognized as 5 tokens RETURN ID(maybe) NEQ ID(iffy) SCOLON i.e.,!= is one token, not two; iffy is an ID, not IF followed by ID(fy) UW CSE 401 Winter 2015 B-16

Lexical ComplicaXons Most modern languages are free- form Layout doesn t maoer Whitespace separates tokens AlternaXves Fortran line oriented Haskell, Python indentaxon and layout can imply grouping And other confusions In C++ or Java, is >> a ship operator or the end of two nested templates or generic classes? UW CSE 401 Winter 2015 B-17

Regular Expressions and FAs The lexical grammar (structure) of most programming languages can be specified with regular expressions (SomeXmes a liole cheaxng is needed) Tokens can be recognized by a determinisxc finite automaton Can be either table- driven or built by hand based on lexical grammar UW CSE 401 Winter 2015 B-18

Regular Expressions Defined over some alphabet Σ For programming languages, alphabet is usually ASCII or Unicode If re is a regular expression, L(re ) is the language (set of strings) generated by re UW CSE 401 Winter 2015 B-19

Fundamental REs re L(re ) Notes a { a } Singleton set, for each a in { } Empty string { } Empty language UW CSE 401 Winter 2015 B-20

OperaXons on REs re L(re ) Notes rs L(r)L(s) Concatenation r s L(r) L(s) Combination (union) r* L(r)* 0 or more occurrences (Kleene closure) Precedence: * (highest), concatenaxon, (lowest) Parentheses can be used to group REs as needed UW CSE 401 Winter 2015 B-21

Examples re Meaning + single + character! single! character = single = character!= 2 character sequence "!=" xyzzy 5 character sequence xyzzy (1 0)* 0 or more binary digits (1 0)(1 0)* 1 or more binary digits 0 1(0 1)* sequence of binary digits with no leading 0 s, except for 0 itself UW CSE 401 Winter 2015 B-22

AbbreviaXons The basic operaxons generate all possible regular expressions, but there are common abbreviaxons used for convenience. Some examples: Abbr. Meaning Notes r+ (rr*) 1 or more occurrences r? (r ) 0 or 1 occurrence [a-z] (a b z) 1 character in given range [abxyz] (a b x y z) 1 of the given characters UW CSE 401 Winter 2015 B-23

More Examples re Meaning [abc]+ [abc]* [0-9]+ [1-9][0-9]* [a-za-z][a-za-z0-9_]* UW CSE 401 Winter 2015 B-24

AbbreviaXons Many systems allow abbreviaxons to make wrixng and reading definixons or specificaxons easier name ::= re RestricXon: abbreviaxons may not be circular (recursive) either directly or indirectly (else would be non- regular) UW CSE 401 Winter 2015 B-25

Example Possible syntax for numeric constants digit ::= [0-9] digits ::= digit+ number ::= digits (. digits )? ( [ee] (+ - )? digits )? How would you describe this set in English? What are some examples of legal constants (strings) generated by number? UW CSE 401 Winter 2015 B-26

Recognizing REs Finite automata can be used to recognize strings generated by regular expressions Can build by hand or automaxcally Reasonably straigh}orward, and can be done systemaxcally Tools like Lex, Flex, JFlex et seq do this automaxcally, given a set of REs UW CSE 401 Winter 2015 B-27

Finite State Automaton A finite set of states One marked as inixal state One or more marked as final states States somexmes labeled or numbered A set of transixons from state to state Each labeled with symbol from Σ, or ε Operate by reading input symbols (usually characters) TransiXon can be taken if labeled with current symbol ε- transixon can be taken at any Xme Accept when final state reached & no more input Slightly different in a scanner where the FSA is a subrouxne that accepts the longest input string matching a token regular expression, starxng at the current locaxon in the input Reject if no transixon possible, or no more input and not in final state (DFA) UW CSE 401 Winter 2015 B-28

Example: FSA for cat c a t UW CSE 401 Winter 2015 B-29

DFA vs NFA DeterminisXc Finite Automata (DFA) No choice of which transixon to take under any condixon No ε transixons (arcs) Non- determinisxc Finite Automata (NFA) Choice of transixon in at least one case Accept if some way to reach a final state on given input Reject if no possible way to final state i.e., may need to guess right path or backtrack UW CSE 401 Winter 2015 B-30

FAs in Scanners Want DFA for speed (no backtracking) But conversion from regular expressions to NFA is easy Fortunately, there is a well- defined procedure for converxng a NFA to an equivalent DFA (subset construcxon) UW CSE 401 Winter 2015 B-31

From RE to NFA: base cases a UW CSE 401 Winter 2015 B-32

r s r s UW CSE 401 Winter 2015 B-33

r s r s UW CSE 401 Winter 2015 B-34

r * r UW CSE 401 Winter 2015 B-35

Exercise Draw the NFA for: b(at ag) bug UW CSE 401 Winter 2015 B-36

From NFA to DFA Subset construcxon Construct a DFA from the NFA, where each DFA state represents a set of NFA states Key idea State of the DFA aper reading some input is the set of all NFA states that could have reached aper reading the same input Algorithm: example of a fixed- point computaxon If NFA has n states, DFA has at most 2 n states => DFA is finite, can construct in finite # steps ResulXng DFA may have more states than needed See books for construcxon and minimizaxon details UW CSE 401 Winter 2015 B-37

Exercise Build DFA for b(at ag) bug, given the NFA UW CSE 401 Winter 2015 B-38

To Tokens A scanner is a DFA that finds the next token each Xme it is called Every final state of a DFA emits (returns) a token Tokens are the internal compiler names for the lexemes == becomes EQUAL ( becomes LPAREN while becomes WHILE xyzzy becomes ID(xyzzy) You choose the names Also, there may be addixonal data \r\n might count lines; all tokens might include line # UW CSE 401 Winter 2015 B-39

DFA => Code OpXon 1: Implement by hand using procedures one procedure for each token each procedure reads one character choices implemented using if and switch statements Pros straigh}orward to write fast Cons a fair amount of tedious work may have subtle differences from the language specificaxon UW CSE 401 Winter 2015 B-40

DFA => Code [conxnued] OpXon 1a: Like opxon 1, but structured as a single procedure with mulxple return points choices implemented using if and switch statements Pros also straigh}orward to write faster Cons a fair amount of tedious work may have subtle differences from the language specificaxon UW CSE 401 Winter 2015 B-41

DFA => code [conxnued] OpXon 2: use tool to generate table driven scanner Rows: states of DFA Columns: input characters Entries: acxon Go to next state Accept token, go to start state Error Pros Convenient Exactly matches specificaxon, if tool generated Cons Magic UW CSE 401 Winter 2015 B-42

DFA => code [conxnued] OpXon 2a: use tool to generate scanner TransiXons embedded in the code Choices use condixonal statements, loops Pros Convenient Exactly matches specificaxon, if tool generated Cons Magic Lots of code big but potenxally quite fast Would never write something like this by hand, but can generate it easily enough UW CSE 401 Winter 2015 B-43

Example: DFA for hand- wrioen scanner Idea: show a hand- wrioen DFA for some typical programming language constructs Then use to construct hand- wrioen scanner Se ng: Scanner is called whenever the parser needs a new token Scanner stores current posixon in input From there, use a DFA to recognize the longest possible input sequence that makes up a token and return that token; save updated posixon for next Xme Disclaimer: Example for illustraxon only you ll use tools for the course project UW CSE 401 Winter 2015 B-44

Scanner DFA Example (1) 0 whitespace or comments end of input 1 Accept EOF ( 2 Accept LPAREN ) 3 Accept RPAREN ; 4 Accept SCOLON UW CSE 401 Winter 2015 B-45

Scanner DFA Example (2)! 5 = 6 Accept NEQ [other ] 7 Accept NOT < 8 = 9 Accept LEQ [other ] 10 Accept LESS UW CSE 401 Winter 2015 B-46

Scanner DFA Example (3) [0-9] 11 [0-9] [other ] 12 Accept INT UW CSE 401 Winter 2015 B-47

Scanner DFA Example (4) [a-za-z] 13 [a-za-z0-9_] [other ] 14 Accept ID or keyword Strategies for handling idenxfiers vs keywords Hand- wrioen scanner: look up idenxfier- like things in table of keywords to classify (good applicaxon of perfect hashing) Machine- generated scanner: generate DFA will appropriate transixons to recognize keywords Lots o states, but efficient (no extra lookup step) UW CSE 401 Winter 2015 B-48

ImplemenXng a Scanner by Hand Token RepresentaXon A token is a simple, tagged structure public class Token { public int kind; // token s lexical class public int intval; // integer value if class = INT public String id; // actual idenxfier if class = ID // lexical classes public staxc final int EOF = 0; // end of file token public staxc final int ID = 1; // idenxfier, not keyword public staxc final int INT = 2; // integer public staxc final int LPAREN = 4; public staxc final int SCOLN = 5; public staxc final int WHILE = 6; // etc. etc. etc. UW CSE 401 Winter 2015 B-49

Simple Scanner Example // global state and methods staxc char nextch; // next unprocessed input character // advance to next input char void getch() { } // skip whitespace and comments void skipwhitespace() { } UW CSE 401 Winter 2015 B-50

Scanner gettoken() method // return next input token public Token gettoken() { Token result; skipwhitespace(); if (no more input) { result = new Token(Token.EOF); return result; } switch(nextch) { case '(': result = new Token(Token.LPAREN); getch(); return result; case )': result = new Token(Token.RPAREN); getch(); return result; case ;': result = new Token(Token.SCOLON); getch(); return result; // etc. UW CSE 401 Winter 2015 B-51

gettoken() (2) case '!': //! or!= getch(); if (nextch == '=') { result = new Token(Token.NEQ); getch(); return result; } else { result = new Token(Token.NOT); return result; } case '<': // < or <= getch(); if (nextch == '=') { result = new Token(Token.LEQ); getch(); return result; } else { result = new Token(Token.LESS); return result; } // etc. UW CSE 401 Winter 2015 B-52

gettoken() (3) case '0': case '1': case '2': case '3': case '4': case '5': case '6': case '7': case '8': case '9': // integer constant String num = nextch; getch(); while (nextch is a digit) { num = num + nextch; getch(); } result = new Token(Token.INT, Integer(num).intValue()); return result; UW CSE 401 Winter 2015 B-53

gettoken() (4) case 'a': case 'z': case 'A': case 'Z': // id or keyword string s = nextch; getch(); while (nextch is a leoer, digit, or underscore) { s = s + nextch; getch(); } if (s is a keyword) { result = new Token(keywordTable.getKind(s)); } else { result = new Token(Token.ID, s); } return result; UW CSE 401 Winter 2015 B-54

MiniJava Scanner GeneraXon We ll use the jflex tool to automaxcally create a scanner from a specificaxon file, We ll use the CUP tool to automaxcally create a parser from a specificaxon file, Token class is shared by jflex and CUP. Lexical classes are listed in CUP s input file and it generates the token class definixon. Details in next week s secxon UW CSE 401 Winter 2015 B-55

Coming AOracXons First homework: paper exercises on regular expressions, automata, etc. Then: first part of the compiler assignment the scanner Next topic: parsing Will do LR parsing first we need this for the project, then LL (recursive- descent) parsing, which you should also know. UW CSE 401 Winter 2015 B-56