A Scanner should create a token stream from the source code. Here are 4 ways to do this:

Similar documents
Formal Languages and Compilers Lecture VI: Lexical Analysis

Writing a Lexical Analyzer in Haskell (part II)

Lexical Analysis. Chapter 2

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Implementation of Lexical Analysis

Implementation of Lexical Analysis

Compiler Construction LECTURE # 3

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

CSc 453 Lexical Analysis (Scanning)

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2]

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

WARNING for Autumn 2004:

Lexical Analysis. Introduction

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

Lexical Analysis - An Introduction. Lecture 4 Spring 2005 Department of Computer Science University of Alabama Joel Jones

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Lexical Analysis. Lecture 3-4

Last lecture CMSC330. This lecture. Finite Automata: States. Finite Automata. Implementing Regular Expressions. Languages. Regular expressions

Compiler Construction D7011E

Lexical Analysis. Lecture 2-4

Figure 2.1: Role of Lexical Analyzer

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Introduction to Lexical Analysis

CS415 Compilers. Lexical Analysis

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Lexical Analysis 1 / 52

Assignment 1 (Lexical Analyzer)

Monday, August 26, 13. Scanners

The Language for Specifying Lexical Analyzer

Lecture 2 Finite Automata

Wednesday, September 3, 14. Scanners

Dr. D.M. Akbar Hussain

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

Automating Construction of Lexers

Outline CS4120/4121. Compilation in a Nutshell 1. Administration. Introduction to Compilers Andrew Myers. HW1 out later today due next Monday.

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

Compilers: table driven scanning Spring 2008

Lexical and Syntax Analysis

Implementation of Lexical Analysis

Lexical Analysis. Lecture 3. January 10, 2018

Problem: Read in characters and group them into tokens (words). Produce a program listing. Do it efficiently.

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

Compiler phases. Non-tokens

2010: Compilers REVIEW: REGULAR EXPRESSIONS HOW TO USE REGULAR EXPRESSIONS

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

Finite automata. We have looked at using Lex to build a scanner on the basis of regular expressions.

CD Assignment I. 1. Explain the various phases of the compiler with a simple example.

CS 314 Principles of Programming Languages

Chapter 3 Lexical Analysis

Assignment 1 (Lexical Analyzer)

Chapter 2 :: Programming Language Syntax

LECTURE 6 Scanning Part 2

Compiler course. Chapter 3 Lexical Analysis

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

Formal Definition of Computation. Formal Definition of Computation p.1/28

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Static Vulnerability Analysis

CSE 582 Autumn 2002 Exam 11/26/02

CSE450. Translation of Programming Languages. Lecture 20: Automata and Regular Expressions

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

CSE 401 Midterm Exam Sample Solution 2/11/15

Implementation of Lexical Analysis

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 5

Lexical Analysis. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Part 5 Program Analysis Principles and Techniques

CS 314 Principles of Programming Languages. Lecture 3

CPSC 434 Lecture 3, Page 1

Lexical Analysis. Finite Automata. (Part 2 of 2)

Lexical Error Recovery

Project 1: Scheme Pretty-Printer

Tasks of the Tokenizer

UNIT II LEXICAL ANALYSIS

Part II : Lexical Analysis

Compiler construction 2002 week 2

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

Non-deterministic Finite Automata (NFA)

1. (10 points) Draw the state diagram of the DFA that recognizes the language over Σ = {0, 1}

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

CSE 401/M501 Compilers

CSE 3302 Programming Languages Lecture 2: Syntax

Structure of Programming Languages Lecture 3

Lexical Analyzer Scanner

CHAPTER 5 VARIABLES AND OTHER BASIC ELEMENTS IN JAVA PROGRAMS

Lexical Analyzer Scanner

Lex Spec Example. Int installid() {/* code to put id lexeme into string table*/}

Finite Automata and Scanners

CSCI-GA Compiler Construction Lecture 4: Lexical Analysis I. Hubertus Franke

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

Recognition of Tokens

Midterm I (Solutions) CS164, Spring 2002

CSE 130 Programming Language Principles & Paradigms Lecture # 5. Chapter 4 Lexical and Syntax Analysis

Alternation. Kleene Closure. Definition of Regular Expressions

Transcription:

Writing A Scanner

A Scanner should create a token stream from the source code. Here are 4 ways to do this: A. Hack something together till it works. (Blech!) B. Use the language specs to produce a Deterministic Finite Automaton (DFA) that identifies tokens, then write a program that models this DFA. C. Write table-driven Scannner (more on this later). D. Use LEX (a standard Unix program that generates a Scanner. It is not hard to guess which approach we'll take. We first need to do a little background work.

A regular expression is an expression made from the following rules. We start with an alphabet S. a) If letter a is in S, then a is a regular expression. b) If s and t are regular expressions, so are st, and s t. c) If s is a regular expression, so are (s), s+ and s*. Regular expressions are patterns that describe strings: a) Any letter of the alphabet describes itself. b) st describes the strings that can be made from something described by s, followed by (concatenated with) something described by t. s t describes anything that is described by either s or t. c) (s) describes the same strings as s. s+ describes all the strings that can be made by concatenating together one or more strings described by s. s* is the same, only concatenating 0 or more strings described by s.

For example, let dig represent the decimal digits and pos the nonzero digits: pos ::= 1 2... 9 dig ::= 0 pos Then we could represent the positive integers as PosInt ::= pos dig* We could represent the nonnegative integers as NonNeg ::= 0 pos dig* We could represent all integers with Int ::= 0 pos dig* - pos dig*

If we let Letter be the alphabet of letters a-z, then letter* bob letter* is the set of strings that contain "bob" as a substring.

Here is a regular expression describing the strings of 0's and 1's that contain an even number of 1's: 0*(10*10*)*0*

In language theory a language is any set of strings. We say a language is regular if it is described by some regular expression. Here is why we care: in almost all programming languages, the set of tokens forms a regular language. Sadly, programming languages almost never specify the regular expressions that describe their tokens.

Fortunately, there is another way to describe regular languages, and it leads directly to programs. A deterministic finite automaton or DFA consists of a set of states and transitions between those states. Some people call this a "finite state" automaton. Rather than give a formal definition, here is a picture of one:

The states are represented by circles, the transitions by labeled edges. One state is labeled "Start". One or more states are represented by double circles; these are the "terminal" states.

We say this automaton "accepts" or "rejects" a string according to the following rules. We begin in the Start state at the start of the string. We move through the transitions of the automaton using the letters of the string to determine which transition to take. If we get to a state where there is no transition out corresponding to the next letter, we reject the string. At the end of the string, we look to see what state we are in. If it is a terminal (double circle) state we accept the string; it it is not a terminal state we reject the string.

For example, consider our automaton with the string 10010. We begin in the Start state. With the first 1 we go to state B. On the next two 0's we stay in state B. On the 1 we go to state A; on the final 0 we stay in state A. We are at the end of the string in a terminal state, so this automaton accepts string 10010.

It is easy to turn such an automaton into code. Here is a Java function that takes a string and returns true or false according to whether this automaton accepts or rejects the string. For simplicity I have numbered the states: Start is state 0, A is state 1, and B is state 2:

public static boolean Accept(String s){ int state = 0; for (int i =0; i < s.length(); i++ ) { char ch = s.charat(i); if (state == 0) { if (ch == '0') state = 1; else state = 2; } else if (state == 1) { if (ch == '1') state = 2; } else if (state == 2) { if (ch == '1') state = 1; } } if (state == 2) return false; else return true; }

In the Theory course (CSCI 383) we show that a language is regular if and only if it is the language accepted by some DFA. The tokens of most programming languages form regular languages. It is usually an easy matter to draw out a DFA that accepts these tokens and then convert that DFA to code. In principle, that is your Scanner.

For example, suppose our language has 4 kinds of tokens: Identifiers, non-negative Number, + and *. Here are some definitions: alpha ::= a b... z digit ::= 0 1 2... 9 Id ::= alpha (alpha digit) Int ::= digit + (let's not worry about leading 0's) Plus ::= + Times ::= *

Here is the DFA for this:

It is easy enough to turn this DFA into code. There is an issue here that is inherent to thinking of scanners as DFA's. We scan through the entire program; no one is giving us individual strings and asking if these strings are tokens. One of the tasks of our scanner is to find the boundaries where one token ends and another one starts.

Think of the expression 23*4. If we are scanning through the start of this we see the digit 2, and know we are looking at a number. We see the digit 3 and we are still looking at a number. We see the *character and we know we are done with the number and have found a token, representing 23. However, we have already read the start of the next token! We will generally consider a token to be the longest string that will result in a valid token. So "23skidoo" results in two tokens: the Num "23" and the Id "skidoo".

We still need to deal with the fact that we don't recognize that most tokens are finished until we have read the first character of the next token. The solution to this problem is to keep a lookahead character -- the next character that has not yet been used.

This can be handled in various ways. In C, where there is an ungetc( ) function that sticks the most recently read character back on the input buffer, it is easy to use the input buffer for the lookahead; when you realize you have read past the end of the current token just unget the extra character.

Java, and many other languages, don't have an unget operation. I find that the easiest way to handle this issue in such languages is to read the source code one line at a time and keep an index to the current location on this line. When you use the current character from the line, increment this index. TO "unget" a character just don't increment the index. The end of the line terminates any token. When you are at the end of the line and need the next character, read the next line from the file and set the character index to 0.

We will structure our Scanner to have a getnexttoken() method. This method reads through the source file until it has found the next token, which it assigns to a Token variable. The next two slides have such a function for the Token language consisting of Nums, Ids, and + (with room for 4 more lines we could add a case for * as well).

public void getnexttoken() throws ScannerException { int i = 0; // i points at the next character to handle int j=0; while (i < currentline.length() &IsWhitespace(currentLine.charAt(i))) i += 1; if (i == currentline.length()) { if (input.hasnextline()) { currentline = input.nextline(); linenumber += 1; getnexttoken(); } else { nexttoken = new Token(Token.T_EOF, "", linenumber); currentline = ""; } }

else { // i < currentline.length() char ch = currentline.charat(i); if (isdigit(ch)) { j = i+1; while (j < currentline.length() && isdigit(currentline.charat(j)) ) j += 1; String tokenstring = currentline.substring(i, j); nexttoken = new NumberToken(Token.T_NUM, tokenstring, linenumber); currentline = currentline.substring(j); } else if (isletter(ch)) { j = i+1; while (j < currentline.length() && isalphanum(currentline.charat(j))) j += 1; String tokenstring = currentline.substring(i, j); nexttoken = new Token(Token.T_ID, tokenstring, linenumber); currentline = currentline.substring(j); } else if (ch == '+') { nexttoken = new Token( Token.T_PLUS, "+". linenumber); currentline = currentline.substring(i+1); } }

Most languages, including BPL, have some tokens that are keywords, such as "if", "while", "void", etc. You could make a complicated DFA that recognizes these, but an easier way to handle them is to make them part of the code that finds Identifiers. Before you decide that a string represents an Id token, run it through a filter that searches for each for each of the keywords. If you find that your token string is "if", for example, you want to make it a IF-token rather than and IDtoken.