CSc 453 Lexical Analysis (Scanning)

Similar documents
Figure 2.1: Role of Lexical Analyzer

Writing a Lexical Analyzer in Haskell (part II)

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Formal Languages and Compilers Lecture VI: Lexical Analysis

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Lexical Analysis. Introduction

CS 314 Principles of Programming Languages

UNIT II LEXICAL ANALYSIS

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

Compiler course. Chapter 3 Lexical Analysis

Implementation of Lexical Analysis

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

Zhizheng Zhang. Southeast University

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

Chapter 3 Lexical Analysis

Implementation of Lexical Analysis

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Lexical Analysis. Chapter 2

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

Lexical Analysis - An Introduction. Lecture 4 Spring 2005 Department of Computer Science University of Alabama Joel Jones

Lexical Analyzer Scanner

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Lexical Analyzer Scanner

CS308 Compiler Principles Lexical Analyzer Li Jiang

Lexical Analysis. Lecture 2-4

Introduction to Lexical Analysis

CSC 467 Lecture 3: Regular Expressions

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

CSE302: Compiler Design

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

2010: Compilers REVIEW: REGULAR EXPRESSIONS HOW TO USE REGULAR EXPRESSIONS

Lexical Analysis. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

[Lexical Analysis] Bikash Balami

CS 314 Principles of Programming Languages. Lecture 3

Implementation of Lexical Analysis

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Lexical Analysis. Lecture 3-4

Compiler Construction

UNIT -2 LEXICAL ANALYSIS

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Implementation of Lexical Analysis

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Lexical Analysis. Implementation: Finite Automata

Lexical Analysis 1 / 52

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Lexical Analysis (ASU Ch 3, Fig 3.1)

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

Monday, August 26, 13. Scanners

Wednesday, September 3, 14. Scanners

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

Compiler phases. Non-tokens

CSE 401/M501 Compilers

CD Assignment I. 1. Explain the various phases of the compiler with a simple example.

CPSC 434 Lecture 3, Page 1

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Part 5 Program Analysis Principles and Techniques

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

CS 403: Scanning and Parsing

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

UNIT I- LEXICAL ANALYSIS. 1.Interpreter: It is one of the translators that translate high level language to low level language.

Lexical Analysis. Lecture 3. January 10, 2018

1. Lexical Analysis Phase

Compiler Construction

CSE Lecture 4: Scanning and parsing 28 Jan Nate Nystrom University of Texas at Arlington

The Language for Specifying Lexical Analyzer

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

LANGUAGE TRANSLATORS

Lexical and Syntax Analysis

10/5/17. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntax Analysis

Alternation. Kleene Closure. Definition of Regular Expressions

Lexical Analysis. Sukree Sinthupinyo July Chulalongkorn University

Chapter 2 :: Programming Language Syntax

Dr. D.M. Akbar Hussain

Front End: Lexical Analysis. The Structure of a Compiler

Dixita Kagathara Page 1

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

G Compiler Construction Lecture 4: Lexical Analysis. Mohamed Zahran (aka Z)

CS415 Compilers. Lexical Analysis

Lexical Analysis/Scanning

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

2. Lexical Analysis! Prof. O. Nierstrasz!

Finite automata. We have looked at using Lex to build a scanner on the basis of regular expressions.

Compilation 2012 Context-Free Languages Parsers and Scanners. Jan Midtgaard Michael I. Schwartzbach Aarhus University

Lexical analysis. Syntactical analysis. Semantical analysis. Intermediate code generation. Optimization. Code generation. Target specific optimization

CSCI-GA Compiler Construction Lecture 4: Lexical Analysis I. Hubertus Franke

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

Implementation of Lexical Analysis. Lecture 4

Lexical Analysis. Finite Automata

Lexical Analysis. Finite Automata

Compiler Design. Lexical Analysis

Buffering Techniques: Buffer Pairs and Sentinels

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

ECS 120 Lesson 7 Regular Expressions, Pt. 1

Transcription:

CSc 453 Lexical Analysis (Scanning) Saumya Debray The University of Arizona Tucson Overview source program lexical analyzer (scanner) tokens syntax analyzer (parser) symbol table manager Main task: to read input characters and group them into tokens. Secondary tasks: Skip comments and whitespace; Correlate error messages with source program (e.g., line number of error). CSc 453: Lexical Analysis 2 1

Overview (cont d) CSc 453: Lexical Analysis 3 Implementing Lexical Analyzers Different approaches: Using a scanner generator, e.g., lex or flex. This automatically generates a lexical analyzer from a high-level description of the tokens. (easiest to implement; least efficient) Programming it in a language such as C, using the I/O facilities of the language. (intermediate in ease, efficiency) Writing it in assembly language and explicitly managing the input. (hardest to implement, but most efficient) CSc 453: Lexical Analysis 4 2

Lexical Analysis: Terminology token: a name for a set of input strings with related structure. Example: identifier, integer constant pattern: a rule describing the set of strings associated with a token. Example: a letter followed by zero or more letters, digits, or underscores. lexeme: the actual input string that matches a pattern. Example: count CSc 453: Lexical Analysis 5 Examples Input: count = 123 Tokens: identifier : Rule: letter followed by Lexeme: count assg_op : Rule: = Lexeme: = integer_const : Rule: digit followed by Lexeme: 123 CSc 453: Lexical Analysis 6 3

Attributes for Tokens If more than one lexeme can match the pattern for a token, the scanner must indicate the actual lexeme that matched. This information is given using an attribute associated with the token. Example: The program statement count = 123 yields the following token-attribute pairs: identifier, pointer to the string count assg_op, integer_const, the integer value 123 CSc 453: Lexical Analysis 7 Specifying Tokens: regular expressions Terminology: alphabet : a finite set of symbols string : a finite sequence of alphabet symbols language : a (finite or infinite) set of strings. Regular Operations on languages: Union: R S = { x x R or x S Concatenation: RS = { xy x R and y S Kleene closure: R* = R concatenated with itself 0 or more times = {ε R RR RRR = strings obtained by concatenating a finite number of strings from the set R. CSc 453: Lexical Analysis 8 4

Regular Expressions A pattern notation for describing certain kinds of sets over strings: Given an alphabet Σ: ε is a regular exp. (denotes the language {ε) for each a Σ, a is a regular exp. (denotes the language {a) if r and s are regular exps. denoting L(r) and L(s) respectively, then so are: (r) (s) ( denotes the language L(r) L(s) ) (r)(s) ( denotes the language L(r)L(s) ) (r)* ( denotes the language L(r)* ) CSc 453: Lexical Analysis 9 Common Extensions to r.e. Notation One or more repetitions of r : r+ A range of characters : [a-za-z], [0-9] An optional expression: r? Any single character:. Giving names to regular expressions, e.g.: letter = [a-za-z_] digit = 0 1 2 3 4 5 6 7 8 9 ident = letter ( letter digit )* Integer_const = digit+ CSc 453: Lexical Analysis 10 5

Recognizing Tokens: Finite Automata A finite automaton is a 5-tuple (Q, Σ, T, q0, F), where: Σ is a finite alphabet; Q is a finite set of states; T: Q Σ Q is the transition function; q0 Q is the initial state; and F Q is a set of final states. CSc 453: Lexical Analysis 11 Finite Automata: An Example A (deterministic) finite automaton (DFA) to match C- style comments: CSc 453: Lexical Analysis 12 6

Formalizing Automata Behavior To formalize automata behavior, we extend the transition function to deal with strings: Ŧ : Q Σ* Q Ŧ(q, ε) = q Ŧ(q, aw) = Ŧ(r, w) where r = T(q, a) The language accepted by an automaton M is L(M) = { w Ŧ(q0, w) F. A language L is regular if it is accepted by some finite automaton. CSc 453: Lexical Analysis 13 Finite Automata and Lexical Analysis The tokens of a language are specified using regular expressions. A scanner is a big DFA, essentially the aggregate of the automata for the individual tokens. Issues: What does the scanner automaton look like? How much should we match? (When do we stop?) What do we do when a match is found? Buffer management (for efficiency reasons). CSc 453: Lexical Analysis 14 7

Structure of a Scanner Automaton CSc 453: Lexical Analysis 15 How much should we match? In general, find the longest match possible. E.g., on input 123.45, match this as num_const(123.45) rather than num_const(123),., num_const(45). CSc 453: Lexical Analysis 16 8

Input Buffering Scanner performance is crucial: This is the only part of the compiler that examines the entire input program one character at a time. Disk input can be slow. The scanner accounts for ~25-30% of total compile time. We need lookahead to determine when a match has been found. Scanners use double-buffering to minimize the overheads associated with this. CSc 453: Lexical Analysis 17 Buffer Pairs Use two N-byte buffers (N = size of a disk block; typically, N = 1024 or 4096). Read N bytes into one half of the buffer each time. If input has less than N bytes, put a special EOF marker in the buffer. When one buffer has been processed, read N bytes into the other buffer ( circular buffers ). CSc 453: Lexical Analysis 18 9

Buffer pairs (cont d) Code: if (fwd at end of first half) reload second half; set fwd to point to beginning of second half; else if (fwd at end of second half) reload first half; set fwd to point to beginning of first half; else fwd++; it takes two tests for each advance of the fwd pointer. CSc 453: Lexical Analysis 19 Buffer pairs: Sentinels Objective: Optimize the common case by reducing the number of tests to one per advance of fwd. Idea: Extend each buffer half to hold a sentinel at the end. This is a special character that cannot occur in a program (e.g., EOF). It signals the need for some special action (fill other buffer-half, or terminate processing). CSc 453: Lexical Analysis 20 10

Buffer pairs with sentinels (cont d) Code: fwd++; if ( *fwd == EOF ) { /* special processing needed */ if (fwd at end of first half)... else if (fwd at end of second half)... else /* end of input */ terminate processing. common case now needs just a single test per character. CSc 453: Lexical Analysis 21 Handling Reserved Words 1. Hard-wire them directly into the scanner automaton: harder to modify; increases the size and complexity of the automaton; performance benefits unclear (fewer tests, but cache effects due to larger code size). 2. Fold them into identifier case, then look up a keyword table: simpler, smaller code; table lookup cost can be mitigated using perfect hashing. CSc 453: Lexical Analysis 22 11

Implementing Finite Automata 1 Encoded as program code: each state corresponds to a (labeled code fragment) state transitions represented as control transfers. E.g.: while ( TRUE ) { state_k: ch = NextChar(); /* buffer mgt happens here */ switch (ch) { case : goto...; /* state transition */ state_m: /* final state */ copy lexeme to where parser can get at it; return token_type; CSc 453: Lexical Analysis 23 Direct-Coded Automaton: Example int scanner() { char ch; while (TRUE) { ch = NextChar( ); state_1: switch (ch) { /* initial state */ case a : goto state_2; case b : goto state_3; default : Error(); state_2: state_3: switch (ch) { case a : goto state_2; default : return SUCCESS; /* while */ CSc 453: Lexical Analysis 24 12

Implementing Finite Automata 2 Table-driven automata (e.g., lex, flex): Use a table to encode transitions: next_state = T(curr_state, next_char); Use one bit in state no. to indicate whether it s a final (or error) state. If so, consult a separate table for what action to take. T next input character Current state CSc 453: Lexical Analysis 25 Table-Driven Automaton: Example #define isfinal(s) ((s) < 0) int scanner() { char ch; int currstate = 1; while (TRUE) { ch = NextChar( ); if (ch == EOF) return 0; /* fail */ currstate = T [currstate, ch]; if (IsFinal(currState)) { return 1; /* success */ /* while */ state T 1 2 3 input a b 2 3 2 3 2-1 CSc 453: Lexical Analysis 26 13

What do we do on finding a match? A match is found when: The current automaton state is a final state; and No transition is enabled on the next input character. Actions on finding a match: if appropriate, copy lexeme (or other token attribute) to where the parser can access it; save any necessary scanner state so that scanning can subsequently resume at the right place; return a value indicating the token found. CSc 453: Lexical Analysis 27 14