Implementation of Lexical Analysis

Similar documents
Implementation of Lexical Analysis

Implementation of Lexical Analysis

Implementation of Lexical Analysis

Introduction to Lexical Analysis

Lexical Analysis. Chapter 2

Lexical Analysis. Lecture 2-4

Implementation of Lexical Analysis

Implementation of Lexical Analysis

Lexical Analysis. Implementation: Finite Automata

Lexical Analysis. Lecture 3-4

Implementation of Lexical Analysis. Lecture 4

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

CSE450. Translation of Programming Languages. Lecture 20: Automata and Regular Expressions

CSE450. Translation of Programming Languages. Automata, Simple Language Design Principles

Kinder, Gentler Nation

Lexical Analysis. Finite Automata. (Part 2 of 2)

Compiler course. Chapter 3 Lexical Analysis

Chapter 3 Lexical Analysis

Lexical Analysis. Finite Automata

Cunning Plan. Informal Sketch of Lexical Analysis. Issues in Lexical Analysis. Specifying Lexers

Lexical Analysis. Finite Automata

Lexical Analysis. Lecture 3. January 10, 2018

Formal Languages and Compilers Lecture VI: Lexical Analysis

Administrativia. Extra credit for bugs in project assignments. Building a Scanner. CS164, Fall Recall: The Structure of a Compiler

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Introduction to Lexical Analysis

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

UNIT -2 LEXICAL ANALYSIS

CSc 453 Lexical Analysis (Scanning)

Regular Languages. MACM 300 Formal Languages and Automata. Formal Languages: Recap. Regular Languages

Compiler phases. Non-tokens

Lex Spec Example. Int installid() {/* code to put id lexeme into string table*/}

2010: Compilers REVIEW: REGULAR EXPRESSIONS HOW TO USE REGULAR EXPRESSIONS

CS 432 Fall Mike Lam, Professor. Finite Automata Conversions and Lexing

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

Writing a Lexical Analyzer in Haskell (part II)

ECS 120 Lesson 7 Regular Expressions, Pt. 1

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Structure of Programming Languages Lecture 3

Compiler Construction

CS308 Compiler Principles Lexical Analyzer Li Jiang

Lexical Analysis. Introduction

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Front End: Lexical Analysis. The Structure of a Compiler

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Lexical Analyzer Scanner

Lexical Analyzer Scanner

Formal Languages and Compilers Lecture IV: Regular Languages and Finite. Finite Automata

Compiler Construction D7011E

Figure 2.1: Role of Lexical Analyzer

CS415 Compilers. Lexical Analysis

Dr. D.M. Akbar Hussain

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Monday, August 26, 13. Scanners

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

Wednesday, September 3, 14. Scanners

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

Compiler Construction

Automating Construction of Lexers

Compiler Construction LECTURE # 3

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

CMSC 350: COMPILER DESIGN

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

CS 314 Principles of Programming Languages. Lecture 3

Last lecture CMSC330. This lecture. Finite Automata: States. Finite Automata. Implementing Regular Expressions. Languages. Regular expressions

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

1. (10 points) Draw the state diagram of the DFA that recognizes the language over Σ = {0, 1}

CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2]

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

CSE 401/M501 Compilers

Zhizheng Zhang. Southeast University

The Language for Specifying Lexical Analyzer

CSE 130 Programming Language Principles & Paradigms Lecture # 5. Chapter 4 Lexical and Syntax Analysis

6 NFA and Regular Expressions

Lexical Analysis 1 / 52

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

Lecture 9 CIS 341: COMPILERS

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

CSE302: Compiler Design

Lexical Analysis. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

3.15: Applications of Finite Automata and Regular Expressions. Representing Character Sets and Files

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

Programming Languages (CS 550) Lecture 4 Summary Scanner and Parser Generators. Jeremy R. Johnson

G52LAC Languages and Computation Lecture 6

CSC 467 Lecture 3: Regular Expressions

2. Lexical Analysis! Prof. O. Nierstrasz!

2. Syntax and Type Analysis

Transcription:

Implementation of Lexical Analysis

Outline Specifying lexical structure using regular expressions Finite automata Deterministic Finite Automata (DFAs) Non-deterministic Finite Automata (NFAs) Implementation of regular expressions RegExp NFA DFA Tables Compiler Design (20) 2

Notation For convenience, we use a variation (allow userdefined abbreviations) in regular expression notation Union: A + B A B Option: A + A? Range: a + b + + z [a-z] Excluded range: complement of [a-z] [^a-z] Compiler Design (20) 3

Regular Expressions in Lexical Specification Last lecture: a specification for the predicate s L(R) But a yes/no answer is not enough! Instead: partition the input into tokens We will adapt regular expressions to this goal Compiler Design (20) 4

Regular Expressions Lexical Spec. (). Select a set of tokens Integer, Keyword, Identifier, OpenPar,... 2. Write a regular expression (pattern) for the lexemes of each token Integer = digit + Keyword = if + else + Identifier = letter (letter + digit)* OpenPar = ( Compiler Design (20) 5

Regular Expressions Lexical Spec. (2) 3. Construct R, matching all lexemes for all tokens R = Keyword + Identifier + Integer + = R + R 2 + R 3 + Facts: If s L(R) then s is a lexeme Furthermore s L(R i ) for some i This i determines the token that is reported Compiler Design (20) 6

Regular Expressions Lexical Spec. (3) 4. Let input be x x n (x... x n are characters) For i n check x x i L(R)? 5. It must be that x x i L(R j ) for some j (if there is a choice, pick a smallest such j) 6. Remove x x i from input and go to previous step Compiler Design (20) 7

How to Handle Spaces and Comments?. We could create a token Whitespace Whitespace = ( + \n + \t ) + We could also add comments in there An input \t\n 5555 is transformed into Whitespace Integer Whitespace 2. Lexer skips spaces (preferred) Modify step 5 from before as follows: It must be that x k... x i L(R j ) for some j such that x... x k- L(Whitespace) Parser is not bothered with spaces Compiler Design (20) 8

Ambiguities () There are ambiguities in the algorithm How much input is used? What if x x i L(R) and also x x K L(R) Rule: Pick the longest possible substring The maximal munch Compiler Design (20) 9

Ambiguities (2) Which token is used? What if x x i L(R j ) and also x x i L(R k ) Rule: use rule listed first (j if j < k) Example: R = Keyword and R 2 = Identifier if matches both Treats if as a keyword not an identifier Compiler Design (20) 0

Error Handling What if No rule matches a prefix of input? Problem: Can t just get stuck Solution: Write a rule matching all bad strings Put it last Lexer tools allow the writing of: R = R +... + R n + Error Token Error matches if nothing else matches Compiler Design (20)

Summary Regular expressions provide a concise notation for string patterns Use in lexical analysis requires small extensions To resolve ambiguities To handle errors Good algorithms known (next) Require only single pass over the input Few operations per character (table lookup) Compiler Design (20) 2

Regular Languages & Finite Automata Basic formal language theory result: Regular expressions and finite automata both define the class of regular languages. Thus, we are going to use: Regular expressions for specification Finite automata for implementation (automatic generation of lexical analyzers) Compiler Design (20) 3

Finite Automata A finite automaton is a recognizer for the strings of a regular language A finite automaton consists of A finite input alphabet Σ A set of states S A start state n A set of accepting states F S A set of transitions state input state Compiler Design (20) 4

Finite Automata Transition Is read s a s 2 In state s on input a go to state s 2 If end of input (or no transition possible) If in accepting state accept Otherwise reject Compiler Design (20) 5

Finite Automata State Graphs A state The start state An accepting state A transition a Compiler Design (20) 6

A Simple Example A finite automaton that accepts only Compiler Design (20) 7

Another Simple Example A finite automaton accepting any number of s followed by a single 0 Alphabet: {0,} 0 Compiler Design (20) 8

And Another Example Alphabet {0,} What language does this recognize? 0 0 0 Compiler Design (20) 9

And Another Example Alphabet still { 0, } The operation of the automaton is not completely defined by the input On input the automaton could be in either state Compiler Design (20) 20

Epsilon Moves Another kind of transition: -moves A B Machine can move from state A to state B without reading input Compiler Design (20) 2

Deterministic and Non-Deterministic Automata Deterministic Finite Automata (DFA) One transition per input per state No -moves Non-deterministic Finite Automata (NFA) Can have multiple transitions for one input in a given state Can have -moves Finite automata have finite memory Enough to only encode the current state Compiler Design (20) 22

Execution of Finite Automata A DFA can take only one path through the state graph Completely determined by input NFAs can choose Whether to make -moves Which of multiple transitions for a single input to take Compiler Design (20) 23

Acceptance of NFAs An NFA can get into multiple states 0 Input: 0 0 Rule: NFA accepts an input if it can get in a final state Compiler Design (20) 24

NFA vs. DFA () NFAs and DFAs recognize the same set of languages (regular languages) DFAs are easier to implement There are no choices to consider Compiler Design (20) 25

NFA vs. DFA (2) For a given language the NFA can be simpler than the DFA NFA 0 0 DFA 0 0 0 0 DFA can be exponentially larger than NFA Compiler Design (20) 26

Regular Expressions to Finite Automata High-level sketch NFA Regular expressions DFA Lexical Specification Table-driven Implementation of DFA Compiler Design (20) 27

Regular Expressions to NFA () For each kind of reg. expr, define an NFA Notation: NFA for regular expression M M For For input a a Compiler Design (20) 28

Regular Expressions to NFA (2) For AB A B For A + B B A Compiler Design (20) 29

Regular Expressions to NFA (3) For A* A Compiler Design (20) 30

Example of Regular Expression NFA conversion Consider the regular expression (+0)* The NFA is A B C D 0 E F G H I J Compiler Design (20) 3

NFA to DFA. The Trick Simulate the NFA Each state of DFA = a non-empty subset of states of the NFA Start state = the set of NFA states reachable through -moves from NFA start state Add a transition S a S to DFA iff S is the set of NFA states reachable from any state in S after seeing the input a considering -moves as well Compiler Design (20) 32

NFA to DFA. Remark An NFA may be in many states at any time How many different states? If there are N states, the NFA must be in some subset of those N states How many subsets are there? 2 N - = finitely many Compiler Design (20) 33

NFA to DFA Example A B ABCDHI 0 C D 0 E F FGABCDHI 0 EJGABCDHI G H I J 0 Compiler Design (20) 34

Implementation A DFA can be implemented by a 2D table T One dimension is states Other dimension is input symbols For every transition S i a S k define T[i,a] = k DFA execution If in state S i and input a, read T[i,a] = k and skip to state S k Very efficient Compiler Design (20) 35

Table Implementation of a DFA S 0 T 0 U 0 S T U 0 T T T U U U Compiler Design (20) 36

Implementation (Cont.) NFA DFA conversion is at the heart of tools such as lex, ML-Lex or flex But, DFAs can be huge In practice, lex/ml-lex/flex-like tools trade off speed for space in the choice of NFA and DFA representations Compiler Design (20) 37

Theory vs. Practice Two differences: DFAs recognize lexemes. A lexer must return a type of acceptance (token type) rather than simply an accept/reject indication. DFAs consume the complete string and accept or reject it. A lexer must find the end of the lexeme in the input stream and then find the next one, etc. Compiler Design (20) 38