Implementation of Lexical Analysis

Similar documents
Implementation of Lexical Analysis

Implementation of Lexical Analysis

Implementation of Lexical Analysis

Implementation of Lexical Analysis

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

Introduction to Lexical Analysis

Implementation of Lexical Analysis

Lexical Analysis. Chapter 2

Lexical Analysis. Lecture 2-4

Lexical Analysis. Implementation: Finite Automata

Lexical Analysis. Lecture 3-4

Implementation of Lexical Analysis. Lecture 4

Administrativia. Extra credit for bugs in project assignments. Building a Scanner. CS164, Fall Recall: The Structure of a Compiler

CSE450. Translation of Programming Languages. Lecture 20: Automata and Regular Expressions

Kinder, Gentler Nation

Compiler course. Chapter 3 Lexical Analysis

Lexical Analysis. Finite Automata. (Part 2 of 2)

Chapter 3 Lexical Analysis

Lexical Analysis. Finite Automata

Cunning Plan. Informal Sketch of Lexical Analysis. Issues in Lexical Analysis. Specifying Lexers

Lexical Analysis. Lecture 3. January 10, 2018

CSE450. Translation of Programming Languages. Automata, Simple Language Design Principles

Lexical Analysis. Finite Automata

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

Formal Languages and Compilers Lecture VI: Lexical Analysis

Introduction to Lexical Analysis

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

CSc 453 Lexical Analysis (Scanning)

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

ECS 120 Lesson 7 Regular Expressions, Pt. 1

Compiler Construction D7011E

Structure of Programming Languages Lecture 3

UNIT -2 LEXICAL ANALYSIS

Regular Languages. MACM 300 Formal Languages and Automata. Formal Languages: Recap. Regular Languages

Compiler phases. Non-tokens

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Lexical Analysis. Introduction

Writing a Lexical Analyzer in Haskell (part II)

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

CS308 Compiler Principles Lexical Analyzer Li Jiang

Lex Spec Example. Int installid() {/* code to put id lexeme into string table*/}

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

Compiler Construction LECTURE # 3

Compiler Construction

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

2010: Compilers REVIEW: REGULAR EXPRESSIONS HOW TO USE REGULAR EXPRESSIONS

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

CS 432 Fall Mike Lam, Professor. Finite Automata Conversions and Lexing

The Language for Specifying Lexical Analyzer

CSE 130 Programming Language Principles & Paradigms Lecture # 5. Chapter 4 Lexical and Syntax Analysis

Lexical Analyzer Scanner

Lexical Analyzer Scanner

Figure 2.1: Role of Lexical Analyzer

CSC 467 Lecture 3: Regular Expressions

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

Monday, August 26, 13. Scanners

Wednesday, September 3, 14. Scanners

Finite automata. III. Finite automata: language recognizers. Nondeterministic Finite Automata. Nondeterministic Finite Automata with λ-moves

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

CMSC 350: COMPILER DESIGN

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

Compiler Construction

Front End: Lexical Analysis. The Structure of a Compiler

Lexical and Syntax Analysis

Dr. D.M. Akbar Hussain

CS415 Compilers. Lexical Analysis

10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

Programming Languages (CS 550) Lecture 4 Summary Scanner and Parser Generators. Jeremy R. Johnson

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

Formal Languages and Compilers Lecture IV: Regular Languages and Finite. Finite Automata

CS 314 Principles of Programming Languages. Lecture 3

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

CSCI312 Principles of Programming Languages!

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

Lecture 9 CIS 341: COMPILERS

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

CSE 401/M501 Compilers

3.15: Applications of Finite Automata and Regular Expressions. Representing Character Sets and Files

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

CSE 3302 Programming Languages Lecture 2: Syntax

Automating Construction of Lexers

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

CSE302: Compiler Design

1. Lexical Analysis Phase

Introduction to Parsing Ambiguity and Syntax Errors

1. (10 points) Draw the state diagram of the DFA that recognizes the language over Σ = {0, 1}

Lexical Analysis 1 / 52

Tasks of the Tokenizer

Building Compilers with Phoenix

Zhizheng Zhang. Southeast University

Introduction to Parsing Ambiguity and Syntax Errors

Transcription:

Outline Implementation of Lexical nalysis Specifying lexical structure using regular expressions Finite automata Deterministic Finite utomata (DFs) Non-deterministic Finite utomata (NFs) Implementation of regular expressions RegExp NF DF Tables 2 Notation For convenience, we will use a variation (we will allow user-defined abbreviations) in regular expression notation Union: + B B Option: +? Range: a + b + + z [a-z] Excluded range: complement of [a-z] [^a-z] Regular Expressions in Lexical Specification Last lecture: a specification for the predicate s L(R) But a yes/no answer is not enough! Instead: partition the input into tokens We will adapt regular expressions to this goal 3 4

Regular Expressions Lexical Specifications. Select a set of tokens Integer, Keyword, Identifier, LeftPar,... Regular Expressions Lexical Specifications 3. Construct R, a regular expression matching all lexemes for all tokens 2. Write a regular expression (pattern) for the lexemes of each token Integer = digit + Keyword = if + else + Identifier = letter (letter + digit)* LeftPar = ( R = Keyword + Identifier + Integer + = R + R 2 + R 3 + Facts: If s L(R) then s is a lexeme Furthermore s L(R i ) for some i This i determines the token that is reported 5 6 Regular Expressions Lexical Specifications 4. Let input be x x n (x... x n are characters) For i n check x x i L(R)? 5. It must be that x x i L(R j ) for some j (if there is a choice, pick a smallest such j) 6. Remove x x i from input and go to previous step How to Handle Spaces and Comments?. We could create a token Whitespace Whitespace = ( + \n + \t ) + We could also add comments in there n input " \t\n 555 " is transformed into Whitespace Integer Whitespace 2. Lexical analyzer skips spaces (preferred) Modify step 5 from before as follows: It must be that x k... x i L(R j ) for some j such that x... x k- L(Whitespace) Parser is not bothered with spaces 7 8

mbiguities () There are ambiguities in the algorithm How much input is used? What if x x i L(R) and also x x K L(R) Rule: Pick the longest possible substring The maximal munch mbiguities (2) Which token is used? What if x x i L(R j ) and also x x i L(R k ) Rule: use rule listed first (j if j < k) Example: R = Keyword and R 2 = Identifier if matches both Treats if as a keyword not an identifier 9 Error Handling What if No rule matches a prefix of input? Problem: Can t just get stuck Solution: Write a rule matching all bad strings Put it last Lexical analysis tools allow the writing of: R = R +... + R n + Error Token Error matches if nothing else matches Summary Regular expressions provide a concise notation for string patterns Use in lexical analysis requires small extensions To resolve ambiguities To handle errors Good algorithms known (next) Require only single pass over the input Few operations per character (table lookup) 2

Regular Languages & Finite utomata Basic formal language theory result: Regular expressions and finite automata both define the class of regular languages. Thus, we are going to use: Regular expressions for specification Finite automata for implementation (automatic generation of lexical analyzers) Finite utomata finite automaton is a recognizer for the strings of a regular language finite automaton consists of finite input alphabet Σ set of states S start state n set of accepting states F S set of transitions state input state 3 4 Finite utomata Transition s a s 2 Is read In state s on input a go to state s 2 Finite utomata State Graphs state The start state If end of input (or no transition possible) If in accepting state accept Otherwise reject n accepting state transition a 5 6

Simple Example finite automaton that accepts only nother Simple Example finite automaton accepting any number of s followed by a single lphabet: {,} 7 8 nd nother Example lphabet {,} What language does this recognize? nd nother Example lphabet still {, } The operation of the automaton is not completely defined by the input On input the automaton could be in either state 9 2

Epsilon Moves nother kind of transition: -moves Machine can move from state to state B without reading input B Deterministic and Non-Deterministic utomata Deterministic Finite utomata (DF) One transition per input per state No -moves Non-deterministic Finite utomata (NF) Can have multiple transitions for one input in a given state Can have -moves Finite automata have finite memory Enough to only encode the current state 2 22 Execution of Finite utomata DF can take only one path through the state graph Completely determined by input NFs can choose Whether to make -moves Which of multiple transitions for a single input to take cceptance of NFs n NF can get into multiple states Input: Rule: NF accepts an input if it can get in a final state 23 24

NF vs. DF () NFs and DFs recognize the same set of languages (regular languages) NF vs. DF (2) For a given language the NF can be simpler than the DF DFs are easier to implement There are no choices to consider NF DF DF can be exponentially larger than NF (contrary to what is shown in the above example) 25 26 Regular Expressions to Finite utomata High-level sketch NF Regular Expressions to NF () For each kind of reg. expr, define an NF Notation: NF for regular expression M M Regular expressions Lexical Specification DF Table-driven Implementation of DF i.e. our automata have one start and one accepting state For For input a a 27 28

Regular Expressions to NF (2) For B Regular Expressions to NF (3) For * B For + B B 29 3 Example of Regular Expression NF conversion NF to DF. The Trick Consider the regular expression Simulate the NF (+)* Each state of DF The NF is = a non-empty subset of states of the NF Start state = the set of NF states reachable through -moves from NF start state B C D E F G H I J dd a transition S a S to DF iff S is the set of NF states reachable from any state in S after seeing the input a considering -moves as well 3 32

NF to DF. Remark NF to DF Example n NF may be in many states at any time How many different states? If there are N states, the NF must be in some subset of those N states How many subsets are there? 2 N - = finitely many B BCDHI C D E F FGBCDHI EJGBCDHI G H I J 33 34 Implementation Table Implementation of a DF DF can be implemented by a 2D table T One dimension is states Other dimension is input symbols For every transition S i a S k define T[i,a] = k S T U DF execution If in state S i and input a, read T[i,a] = k and skip to state S k Very efficient S T U T T U U T U 35 36

Implementation (Cont.) NF DF conversion is at the heart of tools such as lex, ML-Lex or flex But, DFs can be huge In practice, lex/ml-lex/flex-like tools trade off speed for space in the choice of NF and DF representations Theory vs. Practice Two differences: DFs recognize lexemes. lexer must return a type of acceptance (token type) rather than simply an accept/reject indication. DFs consume the complete string and accept or reject it. lexer must find the end of the lexeme in the input stream and then find the next one, etc. 37 38