Lexical Analysis, V Implemen'ng Scanners. Comp 412 COMP 412 FALL Sec0on 2.5 in EaC2e. target code. source code Front End OpMmizer Back End

Similar documents
Lexical Analysis. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Lexical Analysis - An Introduction. Lecture 4 Spring 2005 Department of Computer Science University of Alabama Joel Jones

Parsing II Top-down parsing. Comp 412

The ILOC Virtual Machine (Lab 1 Background Material) Comp 412

Register Allocation. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Lexical Analysis. Introduction

Computing Inside The Parser Syntax-Directed Translation. Comp 412 COMP 412 FALL Chapter 4 in EaC2e. source code. IR IR target.

Parsing II Top-down parsing. Comp 412

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

CS415 Compilers. Lexical Analysis

Intermediate Representations

Syntax Analysis, VII One more LR(1) example, plus some more stuff. Comp 412 COMP 412 FALL Chapter 3 in EaC2e. target code.

Syntax Analysis, VI Examples from LR Parsing. Comp 412

Parsing III. CS434 Lecture 8 Spring 2005 Department of Computer Science University of Alabama Joel Jones

Syntax Analysis, V Bottom-up Parsing & The Magic of Handles Comp 412

Introduction to Parsing. Comp 412

CSc 453 Lexical Analysis (Scanning)

Compiler Construction D7011E

Computing Inside The Parser Syntax-Directed Translation. Comp 412

Local Optimization: Value Numbering The Desert Island Optimization. Comp 412 COMP 412 FALL Chapter 8 in EaC2e. target code

Intermediate Representations

Introduction to Lexical Analysis

Code Shape Comp 412 COMP 412 FALL Chapters 4, 5, 6 & 7 in EaC2e. source code. IR IR target. code. Front End Optimizer Back End

Building a Parser Part III

The Software Stack: From Assembly Language to Machine Code

Parsing. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Lexical Analysis. Lecture 3. January 10, 2018

CPSC 434 Lecture 3, Page 1

CS 406/534 Compiler Construction Putting It All Together

Lexical Analysis. Chapter 2

Cunning Plan. Informal Sketch of Lexical Analysis. Issues in Lexical Analysis. Specifying Lexers

Implementation of Lexical Analysis

Combining Optimizations: Sparse Conditional Constant Propagation

Monday, August 26, 13. Scanners

Implementing Control Flow Constructs Comp 412

Wednesday, September 3, 14. Scanners

Implementation of Lexical Analysis. Lecture 4

Formal Languages and Compilers Lecture VI: Lexical Analysis

Lexical Analysis. Lecture 2-4

Syntax Analysis, III Comp 412

Intermediate Representations Part II

Implementation of Lexical Analysis

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

Computing Inside The Parser Syntax-Directed Translation, II. Comp 412 COMP 412 FALL Chapter 4 in EaC2e. source code. IR IR target.

Lexical Analysis. Finite Automata

Implementation of Lexical Analysis

Syntax Analysis, III Comp 412

Compiler Construction

Computing Inside The Parser Syntax-Directed Translation, II. Comp 412

Lexical Analysis. Finite Automata

Lexical Analysis. Lecture 3-4

Compiler Construction

CS 314 Principles of Programming Languages

Compiler phases. Non-tokens

CSCI312 Principles of Programming Languages

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Lexical Analysis. Implementation: Finite Automata

CS 314 Principles of Programming Languages. Lecture 3

COMP 181 Compilers. Administrative. Last time. Prelude. Compilation strategy. Translation strategy. Lecture 2 Overview

Front End. Hwansoo Han

Figure 2.1: Role of Lexical Analyzer

Optimizer. Defining and preserving the meaning of the program

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

The Processor Memory Hierarchy

CS415 Compilers. Procedure Abstractions. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Runtime Support for Algol-Like Languages Comp 412

Local Register Allocation (critical content for Lab 2) Comp 412

Lexical Analysis. Implementing Scanners & LEX: A Lexical Analyzer Tool

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

Introduction to Lexical Analysis

Parsing and Pattern Recognition

Just-In-Time Compilers & Runtime Optimizers

Procedure and Function Calls, Part II. Comp 412 COMP 412 FALL Chapter 6 in EaC2e. target code. source code Front End Optimizer Back End

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit

Languages and Compilers

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

CS415 Compilers. Syntax Analysis. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

2010: Compilers REVIEW: REGULAR EXPRESSIONS HOW TO USE REGULAR EXPRESSIONS

Writing a Lexical Analyzer in Haskell (part II)

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

The Structure of a Syntax-Directed Compiler

CS 406/534 Compiler Construction Parsing Part I

CS 4240: Compilers and Interpreters Project Phase 1: Scanner and Parser Due Date: October 4 th 2015 (11:59 pm) (via T-square)

The View from 35,000 Feet

Compilers. Lecture 2 Overview. (original slides by Sam

Compiler course. Chapter 3 Lexical Analysis

Lecture 3: CUDA Programming & Compiler Front End

CS415 Compilers. Intermediate Represeation & Code Generation

Formats of Translated Programs

COMPILER DESIGN. For COMPUTER SCIENCE

Instruction Selection: Preliminaries. Comp 412

EDAN65: Compilers, Lecture 02 Regular expressions and scanning. Görel Hedin Revised:

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Handling Assignment Comp 412

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

Chapter 3 Lexical Analysis

Transcription:

COMP 412 FALL 2017 Lexical Analysis, V Implemen'ng Scanners Comp 412 source code IR Front End OpMmizer Back End IR target code Copyright 2017, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit permission to make copies of these materials for their personal use. Faculty from other educamonal insmtumons may use these materials for nonprofit educamonal purposes, provided this copyright nomce is preserved. Sec0on 2.5 in EaC2e

Table-Driven Scanners A common strategy is to simulate DFA execu0on Table + Skeleton Scanner So far, we have used a simplified skeleton state s 0 ; while (state exit) do char NextChar( ) // read next character state δ(state,char); // take the transi'on In pracmce, the skeleton is more complex Character classificamon for table compression Forming a string from the characters Recording the lexeme Recognizing sub-expressions PracMce is to combine all the REs into one DFA Must recognize individual words without hiyng EOF s 0 r 0 9 0 9 s f COMP 412, Fall 2017 1

Table-Driven Scanners Character Classifica0on Group together characters by their acmons in the DFA Combine idenmcal columns in the transimon table, δ Indexing δ by a character s class shrinks the table state s 0 ; while (state exit) do char NextChar( ) // read next character cat CharCat(char) // classify character state δ(state,cat) // take the transi'on Idea works well in ASCII (or EBCDIC) compact, byte-oriented character sets limited range of values Not clear how it extends to larger character sets (unicode) Obvious algorithm is O( Σ 2 S ). COMP 412, Fall 2017 Can you do beber? 2

Table-Driven Scanners Forming a String with the Lexeme Scanner produces syntacmc category Most applicamons want the lexeme (word), too (part of speech) state s 0 lexeme empty string while (state exit) do char NextChar( ) // read next character lexeme lexeme char // concatenate onto lexeme cat CharCat(char) // classify character state δ(state,cat) // take the transi'on Choose syntac'c category based on the final state (see Lecture 4, Slide 16) This problem is trivial Save the characters COMP 412, Fall 2017 3

Table-Driven Scanners Choosing a Category from an Ambiguous RE We want one DFA, so we combine all the REs into one Some strings may fit RE for more than 1 syntacmc category Keywords versus general idenmfiers [a-z]([a-z] [0-9])* vs for while Would like to encode all of these dismncmons into the RE & recognize them in a single DFA Scanner must choose a category for ambiguous final states Classic answer: specify priority by order of REs (return 1 st one) [a-z] is a lex notamon for the lebers a through z in COMP 412, Fall 2017 ASCII collamng sequence order. 4

Keywords Folded into the RE Example The regular expression for while ([a-z] ([a-z] [0-9])* ) leads to a DFA such as Need transimons from for and while into s 10 o r s 2 s 3 s 4 [a-z] [0-9] f [a-e] [g-u] [x-z] s 0 s 10 [a-z] [0-9] [a-z] [0-9] w s 5 h i l e s 6 s 7 s 8 s 9 s 4 for s 9 while COMP s 10 412, general Fall idenmfier 2017 5

Table-Driven Scanners Choosing a Category from an Ambiguous RE We want one DFA, so we combine all the REs into one Some strings may fit RE for more than 1 syntacmc category Keywords versus general idenmfiers Would like to recognize them all with one DFA Scanner must choose a category for ambiguous final states Classic answer: specify priority by order of REs (return 1 st one) Alternate Implementa0on Strategy (Quite popular) Build hash table of all idenmfiers & fold keywords into RE for iden'fier Preload keywords into hash table & set their categories Separate keyword table Makes sense if can make mabers worse Scanner will enter all iden'fiers in the table Scanner is hand coded Otherwise, let the DFA handle them (O(1) cost per character) COMP 412, Fall 2017 6

Table-Driven Scanners Scanning a Stream of Words Real scanners do not look for 1 word per input stream Want scanner to find all the words in the input stream, in order Want scanner to return one word at a Mme SyntacMc SoluMon: can insist on delimiters Blank, tab, punctuamon, Do you want to force blanks everywhere? in expressions? ImplementaMon solumon Run DFA to an error or EOF, back up to accepmng state Need the scanner to return token, not boolean Token is < lexeme, Part of Speech > pair Use a map from DFA s final state to Part of Speech (PoS) for ends in a final state that maps to for while ford ends in a final state that maps to iden'fier COMP 412, Fall 2017 7

EaC1e: This scanner differs significantly from the discussion in the 1 st Edi'on of EaC. Table-Driven Scanners Handling a Stream of Words // recognize words state s 0 lexeme empty string clear stack push (bad) while (state s e ) do char NextChar( ) lexeme lexeme + char if state S A then clear stack push (state) cat CharCat(char) state δ(state,cat) end; S A is the set of accepmng (e.g., final) states // clean up final state while (state S A and state bad) do state pop() truncate lexeme roll back the input one character end; // report the results if (state S A ) then return <PoS(state), lexeme> else return invalid PoS: state part of speech Need a clever buffering scheme, such as double buffering to support roll back COMP 412, Fall 2017 8

Avoiding Excess Rollback Not too preby Some REs can produce quadra0c rollback Consider ab (ab)* c and its DFA Input ababababc s 0, s 1, s 2, s 3, s 4, s 3, s 4, s 3, s 4, s 5 Input abababab s 0, s 1, s 2, s3, s 4, s 3, s 4, s 3, s 4, rollback 6 characters s 0, s 1, s 2, s 3, s 4, s 3, s 4, rollback 4 characters s 0, s 1, s 2, s 3, s 4, rollback 2 characters s 0, s 1, s 2 DFA for ab (ab)* c a s 3 s 4 b c a s 0 a s 1 s 2 s 5 b c c This behavior is preventable Have the scanner remember paths that fail on parmcular inputs Simple modificamon creates the maximal munch scanner Note that Exercise 2.16 on page 82 of EaC2e is worded incorrectly. You can do beber than the scheme shown in Figure 2.15, but cannot avoid, in the worst case, space propormonal to the input string. COMP (AlternaMve: 412, Fall fixed 2017 upper bound on token length) 9

Maximal Munch Scanner // recognize words state s 0 lexeme empty string clear stack push (bad,bad) while (state s e ) do char NextChar( ) InputPos InputPos + 1 lexeme lexeme + char if Failed[state,InputPos] then break; if state S A then clear stack push (state,inputpos) cat CharCat(char) state δ(state,cat) end // clean up final state while (state S A and state bad) do Failed[state,InputPos) true state,inputpos pop() truncate lexeme roll back the input one character end // report the results if (state S A ) then return <PoS(state), lexeme> else return invalid Ini'alizeScanner() InputPos 0 for each state s in the DFA do for i 0 to input do Failed[s,i] false end; end; COMP Thomas 412, Reps, Fall `Maximal 2017 munch tokenizamon in linear Mme, ACM TOPLAS, 20(2), March 1998, pp 259-273. 10

Maximal Munch Scanner Uses a bit array Failed to track dead-end paths IniMalize both InputPos & Failed in Ini'alizeScanner() Failed requires space input stream Can reduce the space requirement with clever implementamon Avoids quadramc rollback Produces an efficient scanner Can your favorite language cause quadramc rollback? If so, the solu'on is inexpensive If not, you might encounter the problem in other applica'ons of these technologies COMP 412, Fall 2017 11

Table-Driven Versus Direct-Coded Scanners index index Table-driven scanners make heavy use of indexing Read the next character Classify it Find the next state Branch back to the top state s 0 ; while (state exit) do char NextChar( ) cat CharCat(char ) state δ(state,cat); Alterna0ve strategy: direct coding Encode state in the program counter Each state is a separate piece of code Do transimon tests locally and directly branch Generate ugly, spaghey-like code More efficient than table driven strategy Fewer memory operamons, might have more branches Code locality as opposed to random access in δ COMP 412, Fall 2017 12

Table-Driven Versus Direct-Coded Scanners Overhead of Table Lookup Each lookup in CharCat or δ involves an address calculamon and a memory operamon CharCat(char) becomes @CharCat 0 + char x w w is sizeof(el t of CharCat) δ(state,cat) becomes @δ 0 + (state x cols + cat) x w cols is # of columns in δ w is sizeof(el t of δ) The references to CharCat and δ expand into mulmple ops Fair amount of overhead work per character Avoid the table lookups and the scanner will run faster We will see these expansions in Ch. 7. Reference to an array or vector is almost always more COMP 412, Fall 2017 expensive than to a scalar variable. 13

Building Faster Scanners from the DFA A direct-coded recognizer for r Digit Digit start: accept s e lexeme count 0 goto s 0 s 0 : char NextChar lexeme lexeme + char count++ if (char = r ) then goto s 1 else goto s out s 1 : char NextChar lexeme lexeme + char count++ if ( 0 char 9 ) then goto s 2 else goto s out s 2 : char NextChar lexeme lexeme + char count 0 accept s 2 if ( 0 char 9 ) then goto s 2 else goto s out s out : if (accept s e ) then { for i 1 to count { RollBack() return <PoS,lexeme> } } else return <invalid,invalid> Fewer (complex) memory operamons No character classifier COMP 412, Fall 2017 Use mulmple strategies for test & branch 14

Direct coding the maximal munch scanner is easy, too. Building Faster Scanners from the DFA A direct-coded recognizer for r Digit Digit start: accept s e lexeme count 0 goto s 0 s 0 : char NextChar lexeme lexeme + char count++ if (char = r ) then goto s 1 else goto s out s 1 : char NextChar lexeme lexeme + char count++ if ( 0 char 9 ) then goto s 2 else goto s out s 2 : char NextChar lexeme lexeme + char count 1 accept s 2 if ( 0 char 9 ) then goto s 2 else goto s out s out : if (accept s e ) then begin for i 1 to count RollBack() report success end else report failure If end of state test is complex (e.g., many cases), scanner generator should consider other schemes Table lookup (with classificamon?) Binary search COMP 412, Fall 2017 15

What About Hand-Coded Scanners? Many modern compilers use hand-coded scanners StarMng from a DFA simplifies design & understanding There are things you can do that don t fit well into tool-based scanner CompuMng the value of an integer In LEX or FLEX, many folks use sscanf() & touch chars many Mmes Can use old assembly trick and compute value as it appears value = value * 10 + digit 0 ; Combine similar states (serial or parallel) Scanners are fun to write Compact, comprehensible, easy to debug, Don t get too cute (e.g., perfect hashing for keywords) COMP 412, Fall 2017 16

Building Scanners The point All this technology lets us automate scanner construcmon Implementer writes down the regular expressions Scanner generator builds NFA, DFA, minimal DFA, and then writes out the (table-driven or direct-coded) code This reliably produces fast, robust scanners For most modern language features, this works and works well You should think twice before introducing a feature that defeats a DFAbased scanner The ones we ve seen (e.g., insignificant blanks, non-reserved keywords) have not proven parmcularly useful or long lasmng Of course, not everything fits into a regular language which leads to parsing COMP 412, Fall 2017 17