Problem: Read in characters and group them into tokens (words). Produce a program listing. Do it efficiently.

Similar documents
Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Theoretical Part. Chapter one:- - What are the Phases of compiler? Answer:

Lexical and Syntax Analysis

Formal Languages and Compilers Lecture VI: Lexical Analysis

Compiler phases. Non-tokens

Lexical Analysis. Introduction

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Chapter 3: Lexical Analysis

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Compiler Construction D7011E

Decision, Computation and Language

Part 5 Program Analysis Principles and Techniques

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Monday, August 26, 13. Scanners

Lexical Analysis. Lecture 3. January 10, 2018

CPSC 434 Lecture 3, Page 1

Wednesday, September 3, 14. Scanners

2010: Compilers REVIEW: REGULAR EXPRESSIONS HOW TO USE REGULAR EXPRESSIONS

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Lexical Analysis 1 / 52

Zhizheng Zhang. Southeast University

10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

Lexical Analysis. Chapter 2

1. Lexical Analysis Phase

Building lexical and syntactic analyzers. Chapter 3. Syntactic sugar causes cancer of the semicolon. A. Perlis. Chomsky Hierarchy

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

Chapter 3 Lexical Analysis

Alternation. Kleene Closure. Definition of Regular Expressions

Compiling Regular Expressions COMP360

Theory and Compiling COMP360

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Compiler course. Chapter 3 Lexical Analysis

Languages and Compilers

A Scanner should create a token stream from the source code. Here are 4 ways to do this:

Implementation of Lexical Analysis

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

CS 314 Principles of Programming Languages

ECS 120 Lesson 7 Regular Expressions, Pt. 1

LECTURE 6 Scanning Part 2

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

Chapter 2 - Programming Language Syntax. September 20, 2017

CD Assignment I. 1. Explain the various phases of the compiler with a simple example.

Figure 2.1: Role of Lexical Analyzer

CSCI312 Principles of Programming Languages!

10/5/17. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntax Analysis

A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, though scanner is also a term for the first stage of a lexer.

Assignment 1 (Lexical Analyzer)

Regular Expressions. Regular Expressions. Regular Languages. Specifying Languages. Regular Expressions. Kleene Star Operation

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

Lexical Analysis. Lecture 2-4

Introduction to Lexical Analysis

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

Implementation of Lexical Analysis

CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2]

Chapter 3. Describing Syntax and Semantics ISBN

COMPILER DESIGN LECTURE NOTES

There are two ways to use the python interpreter: interactive mode and script mode. (a) open a terminal shell (terminal emulator in Applications Menu)

Chapter 2 :: Programming Language Syntax

CSE 401/M501 Compilers

CS415 Compilers. Lexical Analysis

UNIT -2 LEXICAL ANALYSIS

CS 314 Principles of Programming Languages. Lecture 3

Lexical and Syntax Analysis

Lexical Scanning COMP360

There are two ways to use the python interpreter: interactive mode and script mode. (a) open a terminal shell (terminal emulator in Applications Menu)

Compilers CS S-01 Compiler Basics & Lexical Analysis

Parsing and Pattern Recognition

CSc 453 Lexical Analysis (Scanning)

Lexical analysis. Syntactical analysis. Semantical analysis. Intermediate code generation. Optimization. Code generation. Target specific optimization

CSc 453 Compilers and Systems Software

A simple syntax-directed

Crafting a Compiler with C (V) Scanner generator

G Compiler Construction Lecture 4: Lexical Analysis. Mohamed Zahran (aka Z)

CS164: Midterm I. Fall 2003

CSE 3302 Programming Languages Lecture 2: Syntax

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

The Language for Specifying Lexical Analyzer

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

Examples of attributes: values of evaluated subtrees, type information, source file coordinates,

Writing a Lexical Analyzer in Haskell (part II)

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 3

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Lexical Analysis. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Crafting a Compiler with C (II) Compiler V. S. Interpreter

CS308 Compiler Principles Lexical Analyzer Li Jiang

Lexical Analysis - An Introduction. Lecture 4 Spring 2005 Department of Computer Science University of Alabama Joel Jones

Assignment 1 (Lexical Analyzer)

CS S-01 Compiler Basics & Lexical Analysis 1

Compilers. Prerequisites

Principles of Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

9/5/17. The Design and Implementation of Programming Languages. Compilation. Interpretation. Compilation vs. Interpretation. Hybrid Implementation

Lexical Analyzer Scanner

Transcription:

3) Lexical Analysis Input: Program to be compiled. Output: Stream of (token, value) pairs. Problem: Read in characters and group them into tokens (words). Produce a program listing. Do it efficiently. Analysis of compiler performance shows that most of the execution time is spent in the lexical analysis phase. 37 9/1/01

Rationale: 1) Modular design, allowing the compiler to be partitioned into pieces that can be developed independently. 2) It is more efficient for the parser to deal with words, not characters. Incorrect words are never seen by the parser. 3) Isolates character set dependencies ASCII vs. EBCDIC 4) Isolates representation of symbols <> instead of.ne. or!= {... } instead of begin... end 38 9/1/01

What is a token? A token is a placeholder for a logical entity in a programming language. Some tokens include: keywords, constants, operators, punctuation, and identifiers. Tokens are not: white space and comments. 39 9/1/01

Example of tokenizing if( price + gst - rebate <= 10.00 ) gift = false; Token Token # Value Comment if 10 keyword ( 20 left parenthesis price 50 price identifier + 1 + add operator gst 50 gst identifier - 1 - add operator rebate 50 rebate identifier <= 2 <= relational operator 10.00 51 10.00 float constant ) 21 right parenthesis gift 50 gift identifier = 3 assign operator false 50 false identifier ; 4 separator 40 9/1/01

Simple tokenizer There is an obvious way of recognizing tokens. Consider recognizing the tokens end, else and identifiers: c = getchar(); if( c == 'e' ) { c = getchar(); if( c == 'n' ) { c = getchar(); if( c == 'd' ) { next = getchar(); if(!isletter( next ) &&!isdigit( next ) ) return( KEYWORD_END ); else { /* Read to end of identifier. */ return( IDENTIFIER ); } } else { /* Read to end of identifier. */ return( IDENTIFIER ); } } else if( c == 'l' ) { /* Look for else keyword or identifier */ } else { /* Look for other keywords or identifiers */ } } 41 9/1/01

This form of coding is easy to do, but is tedious! We can make it more modular and easier to construct, at the cost of some efficiency: token GetToken() { SkipWhiteSpace(); c = getchar(); if( isletter( c ) ) return( ScanForIdentifier() ); if( isdigit( c ) ) return( ScanForConstant() ); switch( c ) { case '(': return( LEFT_PAREN ); case ')': return( RIGHT_PAREN ); case '+': return( ScanForAddorIncrement() ); case '-': return( ScanForSuborDecrement() ); case '=': return( ScanForEqualsorAssign() ); case '/': return( ScanForCommentorDivide() );... default: return( ERROR ); } } 42 9/1/01

We would like to automate this process, having a tool to build a fast, compact lexical analyzer for us automatically. It turns out that most tokens can be easily defined by a regular grammar: the user defines tokens in a form equivalent to regular grammars, and the system converts the grammar into code. Variety of tools to do this, all similar in their approach to automating lexical analysis. 43 9/1/01

Regular expressions Regular grammars can be expressed in several other forms. One popular form is regular expressions. Three operations: concatenation a b = a b alternation a or b = a b repetition a a... a = a* a ( a... a ) = a + Note that regular expressions are equivalent to regular grammars. All regular expressions can be expressed as a regular grammar; all regular grammars can be converted to an equivalent regular expression. It is easy to show the equivalence... 44 9/1/01

Examples: Let s use the following 2 macros to simplify our solutions: LETTER = ( a b... z A B... Z ) DIGIT = ( 0 1... 9 ) 1) An identifier must begin with a letter and can be followed by an arbitrary number of letters and digits. Regular grammar: ID : LETTER ID_REST ID_REST : LETTER ID_REST DIGIT ID_REST <> Regular expression: ID : LETTER ( LETTER DIGIT ) * Syntax diagram: LETTER ID LETTER DIGIT 45 9/1/01

2) A floating point number is one or more digits, followed by a decimal place, followed by one or more digits. Regular grammar: FLOAT : DIGIT FLOAT. DIGITS DIGITS : DIGIT DIGITS DIGIT Not correct, since it allows there to be no digits to the left of the decimal place. Regular expression: FLOAT : ( DIGIT + ) '.' ( DIGIT + ) Syntax diagram: FLOAT DIGIT DIGIT 46 9/1/01

These rules allow infinite identifiers and infinitely precise numbers. In the real world, there has to be restrictions: Identifiers: Some programming languages impose a limit on the length of an identifier. Fortran, for example, only considers the first 6 characters in an identifier s name. C used to recognize only the first 8 characters (caseless) for external names. The advantage? Simplicity of saving names in the symbol table. The disadvantage? Only to the user. Numbers: Machines have finite precision. Therefore, a limit must be placed on the number of digits. Some compilers generate error messages if you use a number that is too large/small/precise. Others do not flag an error and give you a questionable alternative. The lexical rules must be supplemented by additional language/hardware constraints. 47 9/1/01

UNIX and regular expressions Regular expressions are an integral part of the UNIX tool set: editors (ed, ex, vi ) sed awk grep / fgrep / egrep specifying file names to shells (sh, csh, tcsh) For example: egrep "(John Jonathan).*Schaeffer" *.c where means alternation, () is used for grouping,. matches any character and * causes the previous character to match an arbitrary number of times. 48 9/1/01

Exercise 1 A real number consists of 2 parts: 1) The integer part, consisting of one or more digits. A number may not begin with a zero, unless the integer is just zero. 2) The decimal part, consisting of a decimal point followed by one or more digits. Construct a regular expression for real numbers. Solution: 49 9/1/01

Finite automata Yet another form that is equivalent to a regular grammar is a finite automata. Draw a diagram where terminal symbols are transitions and non-terminals are nodes. For example: S : a S b S a A A : a C C : a C b C b B B : b D b D : a D b D a b a a a a a b b S A C B D b b b F b a, b Here we have added an F (final) state to acknowledge when we have reached a point where we know that the input is legal. 50 9/1/01

This diagram is equivalent to the regular expression: ( a b ) * a a ( a b ) * b b ( a b ) * any string containing two a s followed by two b s. To determine if the input is accepted, move from state to state, guided by the input characters. However, this is a non-deterministic finite automata: in state S, on an a do you stay in state S or go to state A? In a deterministic finite automata, each state has only one transition for each input character. It turns out that... regular grammar, regular expression, non-deterministic finite automata (NDFA), and deterministic finite-automata (DFA) are all equivalent. 51 9/1/01

Converting an NDFA to a DFA State a b S S, A S A C error B error D, F C C B, C D D, F D, F In state S, on input a, do you go to state S or A? Don t make up your mind just yet; postpone the decision by going to a new state SA: State S on input a, go to state SA State S on input b, go to state S In this new SA state, on input a, where do you go? If in state S, would go to S or A. In state A on an a, would go to state C. Create a new state SAC which reflects all 3 possibilities: State SA on input a, go to state SAC State SA on input b, go to state S 52 9/1/01

State a b S S, A S A C error B error D, F C C B, C D D, F D, F State a b S SA S SA SAC S SAC SAC SBC SBC SAC SBCDF= F b a S a a b b SA SAC SBC F b a 53 9/1/01

Once we have a DFA, the code is easy: S: c = getchar(); if( c == 'a' ) go to SA; if( c == 'b' ) go to S; error(); SA: SAC: c = getchar(); if( c == 'a' ) go to SAC; if( c == 'b' ) go to S; error(); c = getchar(); if( c == 'a' ) go to SAC; if( c == 'b' ) go to SBC; error(); 54 9/1/01

Or we could build a table-driven lexical analyzer: token LexicalDriver( LexTable ) { state = laststate; for( ; ; ) { c = NextChar(); state = LexTable[ state, c ]; if( state!= error && state!= finalstate ) { AddToToken( c ); AdvanceInput(); } else break; } if( state!= finalstate ) return( ERROR ); else return( Token[ finalstate] ); } 55 9/1/01

Lexical analyzer generators How does a lexical analyzer generator work? Get input from the user who defines the tokens in a form that is equivalent to regular grammars (usually regular expressions or syntax diagrams). Turn the input into a non-deterministic finite automata. Convert a non-deterministic finite automata to a deterministic finite automata. Generate code to recognize the deterministic finite automata. 56 9/1/01

Exercise 2 Given the grammar: S : a S a A b S b B A : b B a C B : a C b C : b S b B Draw the non-deterministic finite automata represented by this grammar. 57 9/1/01

Construct the deterministic finite automata: State a b Draw a diagram of the deterministic finite automata: 58 9/1/01

Output listing and lexical errors A compiler must produce a listing of the program being compiled, augmented with informative error messages that are inserted near the locations of the errors. Usual technique for producing a listing is to have the lexical analyzer print the text as it is tokenizing. A complication is that errors should not be printed as they occur, since they would appear in the middle of lines. Instead, the errors should be queued and only output once a new-line is reached. Once a lexical error occurs, the lexical analyzer must recover from it and continue to tokenize the input. There are two simple approaches to lexical error recovery: 1) Ignore all characters read as part of the erroneous token and start a new token. 2) Delete the first character read of the erroneous token and start re-reading the input after the deleted character. This has the extra complication that input has to be read and re-read. One error that requires special handling is a runaway string. Be careful not to propagate error messages! 59 9/1/01

Lex - a lexical analyzer Source code: Look at the file ex1.l Lexing: lex ex1.l Lex output: Look at the file lex.yy.c Does any of it make sense? Compilation: make Execution: ex1 Modify the code: Can you modify the rules so that constants cannot have a leading 0? What about arbitrarily long identifier names? Comments? Floating point numbers? 60 9/1/01