EDAN65: Compilers, Lecture 02 Regular expressions and scanning. Görel Hedin Revised:

Similar documents
Compiler phases. Non-tokens

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Lexical Analysis. Chapter 2

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

Writing a Lexical Analyzer in Haskell (part II)

Compiler course. Chapter 3 Lexical Analysis

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Chapter 3 Lexical Analysis

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Lexical Analysis. Introduction

CMSC 350: COMPILER DESIGN

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

Lexical Analysis 1 / 52

Lexical Analysis. Lecture 2-4

Syntax and Type Analysis

2. Syntax and Type Analysis

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 3

Lexical Analysis. Lecture 3-4

CSCI312 Principles of Programming Languages!

CSc 453 Lexical Analysis (Scanning)

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

Introduction to Lexical Analysis

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

EDAN65: Compilers, Lecture 06 A LR parsing. Görel Hedin Revised:

Alternation. Kleene Closure. Definition of Regular Expressions

Figure 2.1: Role of Lexical Analyzer

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 2

EDA180: Compiler Construc6on Context- free grammars. Görel Hedin Revised:

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

Implementation of Lexical Analysis

The Structure of a Syntax-Directed Compiler

Outline CS4120/4121. Compilation in a Nutshell 1. Administration. Introduction to Compilers Andrew Myers. HW1 out later today due next Monday.

CSC 467 Lecture 3: Regular Expressions

UNIT -2 LEXICAL ANALYSIS

Lexical Analysis. Lecture 3. January 10, 2018

Implementation of Lexical Analysis

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

CSE Lecture 4: Scanning and parsing 28 Jan Nate Nystrom University of Texas at Arlington

2010: Compilers REVIEW: REGULAR EXPRESSIONS HOW TO USE REGULAR EXPRESSIONS

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

Grammars and Parsing. Paul Klint. Grammars and Parsing

CSE 3302 Programming Languages Lecture 2: Syntax

Reading Assignment. Scanner. Read Chapter 3 of Crafting a Compiler.

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Monday, August 26, 13. Scanners

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

Compiler Construction

Lexical Analysis. Textbook:Modern Compiler Design Chapter 2.1.

Wednesday, September 3, 14. Scanners

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

CS 403: Scanning and Parsing

CSE 401/M501 Compilers

The Structure of a Syntax-Directed Compiler

Parsing and Pattern Recognition

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

Lecture 9 CIS 341: COMPILERS

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

Compilation 2014 Warm-up project

Compiler Construction D7011E

Lexical Analysis (ASU Ch 3, Fig 3.1)

Compiler Construction

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

CS 314 Principles of Programming Languages

Week 2: Syntax Specification, Grammars

EDAN65: Compilers, Lecture 04 Grammar transformations: Eliminating ambiguities, adapting to LL parsing. Görel Hedin Revised:

Lexical Analysis. Finite Automata

LL parsing Nullable, FIRST, and FOLLOW

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

Lexical Analysis - 2

Compiler Construction

Programming Languages (CS 550) Lecture 4 Summary Scanner and Parser Generators. Jeremy R. Johnson

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

Introduction to Lexical Analysis

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

Structure of Programming Languages Lecture 3

CPS 506 Comparative Programming Languages. Syntax Specification

Part 5 Program Analysis Principles and Techniques

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

Lexical analysis. Syntactical analysis. Semantical analysis. Intermediate code generation. Optimization. Code generation. Target specific optimization

Zhizheng Zhang. Southeast University

Lexical Analysis - An Introduction. Lecture 4 Spring 2005 Department of Computer Science University of Alabama Joel Jones

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

Lexical Analysis. Finite Automata

Compiler Construction

An introduction to Flex

10/5/17. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntax Analysis

Finite Automata and Scanners

Formal Languages and Compilers Lecture VI: Lexical Analysis

10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

Lexical and Syntax Analysis

Transcription:

EDAN65: Compilers, Lecture 02 Regular expressions and scanning Görel Hedin Revised: 2014-09- 01

Course overview Regular expressions Context- free grammar ARribute grammar Lexical analyzer (scanner) SyntacIc analyzer (parser) SemanIc analyzer This lecture source code (text) tokens AST (Abstract syntax tree) ARributed AST runtime system activation records objects stack garbage collection heap Intermediate code generator Interpreter OpImizer intermediate code Virtual machine code and data intermediate code Target code generator target code machine 2

Analyzing program text AssignStmt Exp parse tree Add Exp Exp tokens program text ID EQ ID PLUS ID sum = sum + k This lecture 3

Recall: GeneraIng the compiler: Regular expressions Context- free grammar ARribute grammar We will use a scanner generator called JFlex Scanner generator Parser generator ARribute evaluator generator Lexical analyzer (scanner) SyntacIc analyzer (parser) SemanIc analyzer text tokens tree 4

Some typical tokens Token Example lexemes Regular expression Reserved words (keywords) IF THEN FOR if then for "if" "then" "for" IdenIfiers ID B alpha k10 [A- Za- z][a- Za- z0-9]* Literals INT FLOAT CHAR STRING 123 0 99 2014 3.1416 0.2 'A' 'c' "Hello" "" "j" [0-9]+ [0-9]+ "." [0-9]+ \' [^\'] \' \" [^\"]* \" Operators PLUS INCR NE + ++!= "+" "++" "!=" Separators SEMI COMMA LPAREN ;, ( ";" "," "(" JFlex syntax 5

Some typical non- tokens Token WHITESPACE ENDOFLINECOMMENT Example lexemes blank tab newline // comment Regular expression (jflex) " " "\t" "\n" "//" [^\n\r]* (\n \r \r\n)? JFlex syntax Non- tokens are also recognized by the scanner, just like tokens. But they are not sent on to the parser. 6

Formal languages An alphabet, Σ, is a set of symbols (nonempty and finite). A string is a sequence of symbols (each string is finite) A formal language, L, is a set of strings (can be infinite). We would like to have rules or algorithms for defining a language deciding if a certain string over the alphabet belongs to the language or not. 7

Example: Languages over binary numbers Suppose we have the alphabet Σ = {0, 1} Example languages: The set of all possible combinaions of zeros and ones: L 0 = {0, 1, 00, 01, 10, 11, 000,...} All binary numbers without unnecessary leading zeros: L 1 = {0, 1, 10, 11, 100, 101, 110, 111, 1000,...} All binary numbers with two digits: L 2 = {00, 01, 10, 11}... 8

Example: Languages over UNICODE Here, the alphabet Σ is the set of UNICODE characters Example languages: All possible Java keywords: { class, import, public,...} All possible lexemes corresponding to Java tokens. All possible lexemes corresponding to Java whitespace. All binary numbers... 9

Example: Languages over Java tokens Here, the alphabet Σ is the set of Java tokens Example languages: All syntacically correct Java programs All that are syntacically incorrect All that are compile- Ime correct All that terminate... (This language cannot be computed: TerminaIon is undecidable: it is not possible to construct an algorithm that decides if a string is a terminaing program or not.) 10

Defining languages using rules Increasingly powerful: Regular expressions (for tokens) Context- free grammars (for syntax trees) ARribute grammars (for arbitrary decidable languages) 11

Regular expressions (core notaion) RE read is called a a symbol M N M or N alternaive M N M followed by N concatenaion the empty string epsilon M* zero or more M repeiion (Kleene star) (M) where a is a symbol in the alphabet (e.g., {0,1} or UNICODE) and M and N are regular expressions Each regular expression defines a language over the alphabet (a set of strings that belong to the langauge). PrioriIes: M N P* means M (N (P*)) 12

Example a b c* means {a, b, bc, bcc, bccc,...} 13

Regular expressions (extended notaion) Core RE read is called a a symbol M N M or N alternaive M N M followed by N concatenaion the empty string epsilon M* zero or more M repeiion (Kleene star) (M) Extended RE read means M+ at least one... M M* M? op8onal... M [aou] [a- za- Z] [^0-9] (Appel notaion: ~[0-9]) one of... (a character class) a o u a b... z A B... Z not... "a+b" the string... a \+ b one character, but not anyone of those listed 14

Exercise Write a regular expression that defines the language of all decimal numbers, like 3.14 0.75 4711 0... But not numbers lacking an integer part. And not numbers with a decimal point but lacking a fracional part. So not numbers like 17..236. Leading and trailing zeros are allowed. So the following are ok: 007 008.00 0.0 1.700 a) Use the extended notaion. b) Then translate the expression to the core notaion c) Then write an expression that disallows leading zeros 15

SoluIon a) [0-9]+ ("."[0-9]+)? b) (0... 9)(0... 9)* ( ("."((0... 9)(0... 9)*))) c) (0 [1-9] [0-9]*) ("."[0-9]+)? 16

Escaped characters Use backslash to escape metacharacters and non- prining control characters. Metacharacters \+ \* \( \) \ \\... Non- prinang control characters \n newline \r return \t tab \f formfeed... 17

Some typical tokens Kind Name Example lexemes Reserved words (keywords) IF THEN FOR if then for IdenIfiers ID B alpha k10 Literals INT 123 0 99 Operators Separators FLOAT 3.1416 0.2 CHAR STRING PLUS INCR NE SEMI COMMA LPAREN 'A' 'c' "Hello" "" "j" + ++!= ;, ( Regular expression "if" "then" "for" [A- Za- z]([a- Za- z0-9])* [0-9]+ [0-9]+ "." [0-9]+ \' [^\'] \' \" [^\"]* \" "+" "++" "!=" ";" "," "(" 18

Some typical non- tokens Token WHITESPACE ENDOFLINECOMMENT Example lexemes blank tab newline // comment Regular expression (jflex) " " "\t" "\n" "//" [^\n]* (\n \r \r\n)? Non- tokens are also recognized by the scanner, just like tokens. But they are not sent on to the parser. 19

JFlex: A scanner generator GeneraIng a scanner for a language lang Program.lang characters lang.jflex JFlex specificaion with regular exprs jflex.jar Scanner generator LangScanner.java tokens LangParser.java 20

A JFlex specificaion package lang; import lang.token;... // the generated scanner will belong to the package lang // Our own class for tokens // ignore whitespace " " \t \n \r \f { /* ignore */ } // tokens "if" { return new Token("IF"); } "=" { return new Token("ASSIGN"); } "<" { return new Token("LT"); } "<=" { return new Token("LE"); } [a- za- Z]+ { return new Token("ID", yytext()); }... Rules and lexical acaons Each rule has the form: regular- expression { lexical ac8on } The lexical acion consists of arbitrary Java code. It is run when a token is matched. The method yytext() returns the lexeme (the token value). 21

AmbiguiIes? package lang; import lang.token;... // the generated scanner will belong to the package lang // Class for tokens // ignore whitespace " " \t \n \r \f { /* ignore */ } // tokens "if" { return new Token("IF"); } "=" { return new Token("ASSIGN"); } "<" { return new Token("LT"); } "<=" { return new Token("LE"); } [a- za- Z]+ { return new Token("ID", yytext()); }... Are the token definiaons ambiguous? Which rules match "<="? Which rules match "if"? Which rules match "ifff"? Which rules match "xyz"? 22

Extra rules for resolving ambiguiies Longest match If one rule can be used to match a token, but there is another rule that will match a longer token, the larer rule will be chosen. This way, the scanner will match the longest token possible. Rule priority If two rules can be used to match the same sequence of characters, the first one takes priority. 23

ImplementaIon of scanners ObservaIon: Regular expressions are equivalent to finite automata (finite- state machines). (They can recognize the same class of formal languages: the regular languages.) Overall approach: Translate each token regular expression to a finite automaton. Label the final state with the token. Merge all the automata. The resuling automaton will in general be nondeterminis8c Translate the nondeterminisic automaton to a determinis8c automaton. Implement the determinisic automaton, either using switch statements or a table. A scanner generator automates this process. 24

Construct an automaton for each token "if" i f IF 0-9 state [0-9]+ 0-9 INT a transiion " " \n \t " " WHITESPACE \n WHITESPACE \t WHITESPACE start state a- za- Z [a- za- Z]+ a- za- Z ID final state 25

Merge the start states of the automata i 0-9 " "\n\t 0-9 f INT IF WHITESPACE a- za- Z a- za- Z ID Is the new automaton determinisac?

DeterminisIc finite automata In a determinisic finite automaton each transiion is uniquely determined by the input. 1 a 2 a 3 NondeterminisIc, since if we read a when in state 1, we don't know if we should go to state 2 or 3. 1 ε 2 NondeterminisIc, since when we are in state 1, we don't know if we should stay there, or go to state 2 without reading any input. (Epsilon denotes the empty string.) 1 a 2 b 3 DeterminisIc, since from state 1, the next input determines if we go to state 2 or 3. 27

DFA versus NFA DeterminisAc Finite Automaton (DFA) A finite automaton is determinisic if all outgoing edges from any given state have disjoint character sets there are no epsilon edges Can be implemented efficiently Non- determinisac Finite Automaton (NFA) An NFA may have two outgoing edges with overlapping character sets epsilon edges Every DFA is also an NFA. Every NFA can be translated to an equivalent DFA. 28

TranslaIng an NFA to a DFA Simulate the NFA keep track of a set of current NFA- states follow ε edges to extend the current set (take the closure) Construct the corresponding DFA Each such set of NFA states corresponds to one DFA state If any of the NFA states is final, the DFA state is also final, and is marked with the corresponding token. If there is more than one token to choose from, select the token that is defined first (rule priority). (Minimize the DFA for efficiency) 29

Example NFA DFA i f 2 3 IF i ID 2,4 f 3,4 IF 1 a- z a- z 1 a- eg- z a- z a- z 4 ID a- hj- z 4 ID 30

Error handling i ID 2 f 3 IF a- eg- z 1 a- z a- z a- hj- z 4 ID ^a- z ^a- z 0 ^a- z ^a- z ERROR Add a "dead state" (state 0), corresponding to erroneous input. Add transiions to the "dead state" for all erroneous input. Generate an "ERROR token" when the dead state is reached. 31

ImplementaIon alternaives for DFAs Table- driven Represent the automaton by a table AddiIonal table to keep track of final states and token kinds A global variable keeps track of the current state Switch statements Each state is implemented as a switch statement Each case implements a state transiion as a jump (to another switch statement) The current state is represented by the program counter. 32

Table- driven implementaion ID i 2 f 3 IF a- eg- z PLUS + 5 1 a- hj- z a- z 4 a- z ID... +... a... e f g... h i j... z... final kind 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 true ERROR 1 0 5 0 4 4 4 4 4 4 4 2 4 4 4 0 false 2 0 0 0 4 4 4 3 4 4 4 4 4 4 4 0 true ID 3 0 0 0 4 4 4 4 4 4 4 4 4 4 4 0 true IF 4 0 0 0 4 4 4 4 4 4 4 4 4 4 4 0 true ID 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 true PLUS 33

Scanner implementaion, design Token int kind() String value() File call Scanner call Parser char nextchar() Token nexttoken() 34

Scanner implementaion, sketch Idea: Scan the next token by staring in the start state scan characters unil we reach a final state return a new token Token nexttoken() { state = 1; // start state while (! isfinal[state]) { ch = file.readchar(); state = edges[state, ch]; } return new Token(kind[state]); } Needs to be extended with handling of: longest match end of file non tokens (like whitespace) token values (like the idenifier name) 35

Extend to longest match, design Token int kind() String value() File PushbackFile Scanner Parser char readchar() char readchar() void pushback(string) Token nexttoken() Idea: When a token is matched, don't stop scanning. When the error state is reached, return the last token matched. Push read characters that are unused back into the file, so they can be scanned again. Use a PushbackFile to accomplish this. 36

Extend to handle longest match, sketch When a token is matched (a final state reached), don t stop scanning. Keep track of the currently scanned string, str. Keep track of the latest matched token (lastfinalstate, lasttokenvalue). ConInue scanning unil we reach the error state. Restore the input stream using PushBackFile. Return the latest matched token. (or return the ERROR token if there was no latest matched token) Token nexttoken() { state = 1; str = ""; lastfinalstate = 0; lasttokenvalue = ""; while (state!= 0) { ch = pushbackfile.readchar(); } str = str + ch; state = edges[state, ch]; if (isfinal[state]) { lastfinalstate = state; lasttokenvalue = str; } // In Java, StringBuilder would be more efficient } pushbackfile.pushback(str.substring(lasttokenvalue.length)); return new Token(kind[lastFinalState], lasttokenvalue); 37

Handling End- of- file (EOF) and non- tokens EOF construct an explicit EOF token when the EOF character is read Non- tokens (Whitespace & Comments) view as tokens of a special kind scan them as normal tokens, but don t create token objects for them loop in next() unil a real token has been found Errors construct an explicit ERROR token to be returned when no valid token can be found. 38

Specifying EOF and ERROR in JFlex package lang; import lang.token;... // the generated scanner will belong to the package lang // Class for tokens // ignore whitespace " " \t \n \r \f { /* ignore */ } // tokens "if" { return new Token("IF"); } "=" { return new Token("ASSIGN"); } "<" { return new Token("LT"); } "<=" { return new Token("LE"); } [a- za- Z]+ { return new Token("ID", yytext()); }... <<EOF>> { return new Token("EOF"); } [^] { return new Token("ERROR"); } <<EOF>> is a special regular expression in JFlex, matching end of file. [^] means any character. Due to rule priority, this will match any character not matched by previous rules. 39

Example scanner generators tool author generates lex Schmidt, Lesk. 1975 C- code flex ("fast lex") Paxon. 1987 C- code jlex Java code jflex Java code... 40

Lexical states Some tokens are difficult or impossible to define with regular expressions. Lexical states (sets of token rules) give the possibility to switch token sets (DFAs) during scanning. LEXSTATE1 T1 T2 T3 T4... LEXSTATE2 T5 T6 T7... Useful for muli- line comments, HTML, scanning muli- language documents, etc. Supported by many scanner generators (including JFlex) 41

Example: muli- line comments Would like to scan the complete comment as one token: /* int m() { return 15 / 3 * 4 * 2; } */ Can be solved easily with lexical states: Default token set ID "if" "then" "/*"... "*/" [^] Token set used inside comment WriIng an ordinary regular expression for this is difficult: "/*"((\*+[^/*]) ([^*]))*\**"*/" However, some scanner generators, like JFlex, has the special operator upto (~) that can be used instead: "/*" ~"*/" { /* Comment */ } 42

Course overview Regular expressions Context- free grammar ARribute grammar Lab 1 Lab 1 Lexical analyzer (scanner) SyntacIc analyzer (parser) SemanIc analyzer source code (text) This lecture tokens Next lecture AST (Abstract syntax tree) ARributed AST runtime system activation records objects stack garbage collection heap Intermediate code generator Interpreter OpImizer intermediate code Virtual machine code and data intermediate code Target code generator target code machine 43

Summary quesions What is a formal language? What is a regular expression? What is meant by an ambiguous lexical definiion? Give some typical examples of ambiguiies and how they may be resolved. What is a lexical acion? Give an example of how to construct an NFA for a given lexical definiion Give an example of how to construct a DFA for a given NFA What is the difference between a DFA and and NFA? Give an example of how to implement a DFA in Java. How is rule priority handled in the implementaion? Longest match? EOF? Whitespace? Errors? What are lexical states? When are they useful? See the course web for exercises and study recommendaions. EDAN65, 2014, Lecture 01 44