Finite Automata and Scanners

Similar documents
Alternation. Kleene Closure. Definition of Regular Expressions

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 3

More Examples. Lex/Flex/JLex

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 2

Lexical Error Recovery

Figure 2.1: Role of Lexical Analyzer

Lexical Error Recovery

Crafting a Compiler with C (V) Scanner generator

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

Monday, August 26, 13. Scanners

Wednesday, September 3, 14. Scanners

Compiler phases. Non-tokens

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 5

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Spring 2015

CSE450. Translation of Programming Languages. Lecture 20: Automata and Regular Expressions

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Introduction to Lexical Analysis

Writing a Lexical Analyzer in Haskell (part II)

Lex Spec Example. Int installid() {/* code to put id lexeme into string table*/}

CSc 453 Lexical Analysis (Scanning)

Implementation of Lexical Analysis

Finite automata. We have looked at using Lex to build a scanner on the basis of regular expressions.

Compiler course. Chapter 3 Lexical Analysis

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Implementation of Lexical Analysis

The Language for Specifying Lexical Analyzer

Chapter 3 Lexical Analysis

Dr. D.M. Akbar Hussain

Lexical Analysis. Textbook:Modern Compiler Design Chapter 2.1.

Implementation of Lexical Analysis

CSE302: Compiler Design

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

Zhizheng Zhang. Southeast University

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Assignment 1 (Lexical Analyzer)

Assignment 1 (Lexical Analyzer)

Structure of Programming Languages Lecture 3

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Lexical Analysis - 2

Parsing and Pattern Recognition

Front End: Lexical Analysis. The Structure of a Compiler

Implementation of Lexical Analysis. Lecture 4

Lexical Analyzer Scanner

Lexical Analysis. Chapter 2

CS 310: State Transition Diagrams

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Automated Tools. The Compilation Task. Automated? Automated? Easier ways to create parsers. The final stages of compilation are language dependant

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2]

CS 541 Spring Programming Assignment 2 CSX Scanner

G52LAC Languages and Computation Lecture 6

Lexical Analysis. Lecture 2-4

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

CS415 Compilers. Lexical Analysis

Lexical Analyzer Scanner

CSE450. Translation of Programming Languages. Automata, Simple Language Design Principles

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Lecture 2 Finite Automata

EDAN65: Compilers, Lecture 02 Regular expressions and scanning. Görel Hedin Revised:

Lexical Analysis. Textbook:Modern Compiler Design Chapter 2.1

CMSC 350: COMPILER DESIGN

DFA: Automata where the next state is uniquely given by the current state and the current input character.

Lexical Analysis. Implementation: Finite Automata

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

UNIT -2 LEXICAL ANALYSIS

Compiler Construction

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

CSCI312 Principles of Programming Languages!

An introduction to Flex

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

LECTURE 6 Scanning Part 2

CPSC 434 Lecture 3, Page 1

ECS 120 Lesson 7 Regular Expressions, Pt. 1

Lexical Analysis. Finite Automata. (Part 2 of 2)

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

Lexical Analysis. Lecture 3-4

Lexical Analysis. Introduction

Converting a DFA to a Regular Expression JP

Compiler Construction

Lexical and Syntax Analysis

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

COLLEGE OF ENGINEERING, NASHIK. LANGUAGE TRANSLATOR

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

Scanning. COMP 520: Compiler Design (4 credits) Professor Laurie Hendren.

CS 314 Principles of Programming Languages

Formal Languages and Compilers Lecture VI: Lexical Analysis

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Compiling Regular Expressions COMP360

2. Lexical Analysis! Prof. O. Nierstrasz!

J. Xue. Tutorials. Tutorials to start in week 3 (i.e., next week) Tutorial questions are already available on-line

Lexical Analysis 1 / 52

Implementation of Lexical Analysis

Lexical Analysis. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

2. Syntax and Type Analysis

CD Assignment I. 1. Explain the various phases of the compiler with a simple example.

CS 314 Principles of Programming Languages. Lecture 3

Formal Definition of Computation. Formal Definition of Computation p.1/28

Transcription:

Finite Automata and Scanners A finite automaton (FA) can be used to recognize the tokens specified by a regular expression. FAs are simple, idealized computers that recognize strings belonging to regular sets. An FA consists of: A finite set of states A set of transitions (or moves) from one state to another, labeled with characters in Σ A special state called the start state A subset of the states called the accepting, or final, states 87

These four components of a finite automaton are often represented graphically: is a state is a transition is the start state is an accepting state Finite automata (the plural of automaton is automata) are represented graphically using transition diagrams. We start at the start state. If the next input character matches the label on a transition 88

from the current state, we go to the state it points to. If no move is possible, we stop. If we finish in an accepting state, the sequence of characters read forms a valid token; otherwise, we have not seen a valid token. In this diagram, the valid tokens are the strings described by the regular expression (a b (c) + ) +. a a b c c 89

Deterministic Finite Automata As an abbreviation, a transition may be labeled with more than one character (for example, Not(c)). The transition may be taken if the current input character matches any of the characters labeling the transition. If an FA always has a unique transition (for a given state and character), the FA is deterministic (that is, a deterministic FA, or DFA). Deterministic finite automata are easy to program and often drive a scanner. If there are transitions to more than one state for some character, then the FA is nondeterministic (that is, an NFA). 90

A DFA is conveniently represented in a computer by a transition table. A transition table, T, is a two dimensional array indexed by a DFA state and a vocabulary symbol. Table entries are either a DFA state or an error flag (often represented as a blank table entry). If we are in state s, and read character c, then T[s,c] will be the next state we visit, or T[s,c] will contain an error marker indicating that c cannot extend the current token. For example, the regular expression // Not(Eol) * Eol which defines a Java or C++ singleline comment, might be translated into 91

/ / Eol 1 2 3 4 eof Not(Eol) The corresponding transition table is: State Character / Eol a b 1 2 2 3 3 3 4 3 3 3 4 A complete transition table contains one column for each character. To save space, table compression may be used. Only non-error entries are explicitly represented in the table, using hashing, indirection or linked structures. 92

All regular expressions can be translated into DFAs that accept (as valid tokens) the strings defined by the regular expressions. This translation can be done manually by a programmer or automatically using a scanner generator. A DFA can be coded in: Table-driven form Explicit control form In the table-driven form, the transition table that defines a DFA s actions is explicitly represented in a run-time table that is interpreted by a driver program. In the direct control form, the transition table that defines a DFA s actions appears implicitly as the control logic of the program. 93

For example, suppose CurrentChar is the current input character. End of file is represented by a special character value, eof. Using the DFA for the Java comments shown earlier, a table-driven scanner is: State = StartState while (true){ if (CurrentChar == eof) break NextState = T[State][CurrentChar] if(nextstate == error) break State = NextState read(currentchar) } if (State in AcceptingStates) // Process valid token else // Signal a lexical error 94

This form of scanner is produced by a scanner generator; it is definitionindependent. The scanner is a driver that can scan any token if T contains the appropriate transition table. Here is an explicit-control scanner for the same comment definition: if (CurrentChar == '/'){ read(currentchar) if (CurrentChar == '/') repeat read(currentchar) until (CurrentChar in {eol, eof}) else //Signal lexical error else // Signal lexical error if (CurrentChar == eol) // Process valid token else //Signal lexical error 95

The token being scanned is hardwired into the logic of the code. The scanner is usually easy to read and often is more efficient, but is specific to a single token definition. 96

More Examples A FORTRAN-like real literal (which requires digits on either or both sides of a decimal point, or just a string of digits) can be defined as RealLit = (D + (λ. )) (D *. D + ) This corresponds to the DFA. D D D D. 97

An identifier consisting of letters, digits, and underscores, which begins with a letter and allows no adjacent or trailing underscores, may be defined as ID = L (L D) * ( _ (L D) + ) * This definition includes identifiers like sum or unit_cost, but excludes _one and two_ and grand total. The DFA is: L D L _ L D 98

Lex/Flex/JLex Lex is a well-known Unix scanner generator. It builds a scanner, in C, from a set of regular expressions that define the tokens to be scanned. Flex is a newer and faster version of Lex. Jlex is a Java version of Lex. It generates a scanner coded in Java, though its regular expression definitions are very close to those used by Lex and Flex. Lex, Flex and JLex are largely nonprocedural. You don t need to tell the tools how to scan. All you need to tell it what you want scanned (by giving it definitions of valid tokens). 99

This approach greatly simplifies building a scanner, since most of the details of scanning (I/O, buffering, character matching, etc.) are automatically handled. 100

JLex JLex is coded in Java. To use it, you enter java JLex.Main f.jlex Your CLASSPATH should be set to search the directories where JLex s classes are stored. (The CLASSPATH we gave you includes JLex s classes). After JLex runs (assuming there are no errors in your token specifications), the Java source file f.jlex.java is created. (f stands for any file name you choose. Thus csx.jlex might hold token definitions for CSX, and csx.jlex.java would hold the generated scanner). 101

You compile f.jlex.java just like any Java program, using your favorite Java compiler. After compilation, the class file Yylex.class is created. It contains the methods: Token yylex() which is the actual scanner. The constructor for Yylex takes the file you want scanned, so new Yylex(System.in) will build a scanner that reads from System.in. Token is the token class you want returned by the scanner; you can tell JLex what class you want returned. String yytext() returns the character text matched by the last call to yylex. 102

A simple example of using JLex is in ~cs536-1/pubic/jlex Just enter make test 103