Lexical Analyzer Scanner

Similar documents
Lexical Analyzer Scanner

Front End: Lexical Analysis. The Structure of a Compiler

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Chapter 3 Lexical Analysis

Compiler course. Chapter 3 Lexical Analysis

Formal Languages and Compilers Lecture VI: Lexical Analysis

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Figure 2.1: Role of Lexical Analyzer

Finite automata. We have looked at using Lex to build a scanner on the basis of regular expressions.

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

CSE302: Compiler Design

Introduction to Lexical Analysis

Lexical Analysis (ASU Ch 3, Fig 3.1)

CSc 453 Lexical Analysis (Scanning)

Zhizheng Zhang. Southeast University

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

Implementation of Lexical Analysis

CS308 Compiler Principles Lexical Analyzer Li Jiang

Syntax-Directed Translation

Implementation of Lexical Analysis

Part 5 Program Analysis Principles and Techniques

CS 314 Principles of Programming Languages. Lecture 3

[Lexical Analysis] Bikash Balami

Lexical Analysis. Implementation: Finite Automata

Lexical Analysis. Chapter 2

UNIT -2 LEXICAL ANALYSIS

UNIT II LEXICAL ANALYSIS

2. Lexical Analysis! Prof. O. Nierstrasz!

Formal Languages and Compilers Lecture IV: Regular Languages and Finite. Finite Automata

Monday, August 26, 13. Scanners

Lexical Analysis. Prof. James L. Frankel Harvard University

Wednesday, September 3, 14. Scanners

Implementation of Lexical Analysis

Crafting a Compiler with C (V) Scanner generator

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Syntax and Parsing COMS W4115. Prof. Stephen A. Edwards Fall 2004 Columbia University Department of Computer Science

Lexical Analysis. Sukree Sinthupinyo July Chulalongkorn University

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

Lexical Analysis 1 / 52

Introduction to Compiler Construction

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Context-free grammars

Lexical Analysis. Lecture 3-4

Week 2: Syntax Specification, Grammars

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Compiler Construction

Lexical Analysis. Lecture 2-4

Compiler Construction

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Lexical Analysis. Introduction

Lexical analysis. Syntactical analysis. Semantical analysis. Intermediate code generation. Optimization. Code generation. Target specific optimization

Lex Spec Example. Int installid() {/* code to put id lexeme into string table*/}

Compiler Construction

2010: Compilers REVIEW: REGULAR EXPRESSIONS HOW TO USE REGULAR EXPRESSIONS

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2]

Compiler Construction: Parsing

Alternation. Kleene Closure. Definition of Regular Expressions

Lexical Analysis - 2

Introduction to Compiler Construction

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

CSC 467 Lecture 3: Regular Expressions

Implementation of Lexical Analysis. Lecture 4

CS4850 SummerII Lex Primer. Usage Paradigm of Lex. Lex is a tool for creating lexical analyzers. Lexical analyzers tokenize input streams.

CS415 Compilers. Lexical Analysis

Automata Theory TEST 1 Answers Max points: 156 Grade basis: 150 Median grade: 81%

Chapter 3: Lexing and Parsing

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

LECTURE 6 Scanning Part 2

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

CS 314 Principles of Programming Languages

Compiler Construction

1.3 Functions and Equivalence Relations 1.4 Languages

Last lecture CMSC330. This lecture. Finite Automata: States. Finite Automata. Implementing Regular Expressions. Languages. Regular expressions

Parsing and Pattern Recognition

DVA337 HT17 - LECTURE 4. Languages and regular expressions

CS 403: Scanning and Parsing

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Scanning. COMP 520: Compiler Design (4 credits) Alexander Krolik MWF 13:30-14:30, MD 279

Introduction to Compiler Construction

Introduction to Lexical Analysis

An introduction to Flex

CPSC 434 Lecture 3, Page 1

CSE450. Translation of Programming Languages. Lecture 20: Automata and Regular Expressions

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

ECS 120 Lesson 7 Regular Expressions, Pt. 1

Implementation of Lexical Analysis

Languages and Compilers

Type 3 languages. Regular grammars Finite automata. Regular expressions. Deterministic Nondeterministic. a, a, ε, E 1.E 2, E 1 E 2, E 1*, (E 1 )

CSE 401/M501 Compilers

Lexical Analysis. Implementing Scanners & LEX: A Lexical Analyzer Tool

Edited by Himanshu Mittal. Lexical Analysis Phase

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 3

Transcription:

Lexical Analyzer Scanner ASU Textbook Chapter 3.1, 3.3, 3.4, 3.6, 3.7, 3.5 Tsan-sheng Hsu tshsu@iis.sinica.edu.tw http://www.iis.sinica.edu.tw/~tshsu 1

Main tasks Read the input characters and produce as output a sequence of tokens to be used by the parser for syntax analysis. tokens: terminal symbols in grammar. Lexeme : a sequence of characters matched by a given pattern associated with a token. Lexeme pi = 3.1416 ; Example: token ID ASSIGN FLOAT-LIT SEMI-COL patterns: identifier (variable names) starts with a letter or, and follows by letters, digits or ; floating point number starts with a string of digits, follows by a dot, and terminates with another string of digits; Compiler notes #2, Tsan-sheng Hsu, IIS 2

Strings Definitions and operations. alphabet : a finite set of symbols (characters); string : a finite sequence of symbols from the alphabet; S : length of a string S; empty string: ɛ; xy: concatenation of strings x and y ɛx xɛ x; exponention: s 0 ɛ; s i s i 1 s, i > 0. Compiler notes #2, Tsan-sheng Hsu, IIS 3

Parts of a string Parts of a string: example string necessary prefix: deleting zero or more tailing characters; suffix: deleting zero or more leading characters; eg: nece eg: ssary substring: deleting prefix and suffix; eg: ssa subsequence: deleting zero or more not necessarily contiguous symbols; eg: ncsay Proper prefix, suffix, substring or subsequence: one that cannot equal to the original string; Compiler notes #2, Tsan-sheng Hsu, IIS 4

Language Language : any set of strings over an alphabet. Operations on languages: union: L M = {s s L or s M}; concatenation: LM = {st s L and t M}; L 0 = {ɛ}; Kleene closure : L = i=0 Li ; Positive closure : L + = i=1 Li ; L = L + {ɛ}. Compiler notes #2, Tsan-sheng Hsu, IIS 5

Regular expressions A regular expression r denotes a language L(r), also called a regular set. Operations on regular expressions: regular expression language empty set {} ɛ the set containing the empty string {ɛ} a {a} where a is a legal symbol r s L(r) L(s) union rs L(r)L(s) concatenation r L(r) Kleene closure a b {a, b} (a b)(a b) {aa, ab, ba, bb} Example: a {ɛ, a, aa, aaa,...} a a b {a, b, ab, aab,...} C identifier (A Z a z) ((A Z a z ) (0 1 9) ) Compiler notes #2, Tsan-sheng Hsu, IIS 6

Regular definitions For simplicity, give names to regular expressions. format: name regular expression. example 1: digit 0 1 2 9. example 2: letter a b c z A B Z. r r + ɛ r + rr Notational standards: r? r ɛ [abc] a b c [a z] a b c z Example: C variable name: [A Za z ][A Za z0 9 ] Compiler notes #2, Tsan-sheng Hsu, IIS 7

Balanced or nested construct Example: if then else Recognized by Non-regular sets context free grammar. Matching strings: {wcw}, where w is a string of a s and b s and c is a legal symbol. Cannot be recognized even using context free grammars. Remark: anything that needs to memorize non-constant amount of information happened in the past cannot be recognized by regular expressions. Compiler notes #2, Tsan-sheng Hsu, IIS 8

Finite state automata (FA) FA is a mechanism used to recognize tokens specified by a regular expression. Definition: A finite set of states, i.e., vertices. A set of transitions, labeled by characters, i.e., labeled directed edges. A starting state, i.e., a vertex with an incoming edge marked with start. A set of final (accepting) states, i.e., vertices of concentric circles. Example: transition graph for the regular expression (abc + ) + a start a b c 0 1 2 3 c Compiler notes #2, Tsan-sheng Hsu, IIS 9

Transition graph and table for FA Transition graph: a start a b c 0 1 2 3 c Transition table: a b c 0 1 1 2 2 3 3 1 3 Rows are input symbols. Columns are current states. Entries are resulting states. Along with the table, a starting state and a set of accepting states are also given. This is also called a GOTO table. Compiler notes #2, Tsan-sheng Hsu, IIS 10

Types of FA s Deterministic FA (DFA): has a unique next state for a transition and does not contain ɛ-transitions, that is, a transition takes ɛ as the input symbol. Nondeterministic FA (NFA): either could have more than one next state for a transition; or contains ɛ-transitions. Example: aa bb. start 0 1 3 a b 2 4 b a Compiler notes #2, Tsan-sheng Hsu, IIS 11

How to execute a DFA Algorithm: s starting state; while there are inputs and s is a legal state do s Table[s, input] end while if s accepting states then ACCEPT else REJECT Example: input abccabc. The accepting path: 0 a 1 b 2 c 3 c 3 a a 1 b 2 c 3 start a b c 0 1 2 3 c Compiler notes #2, Tsan-sheng Hsu, IIS 12

How to execute an NFA (informally) An NFA accepts an input string x if and only if there is some path in the transition graph initiating from the starting state to some accepting state such that the edge labels along the path spell out x. Could have more than one path. (Note DFA has at most one.) Example: regular expression: (a b) abb; input aabb a start a b b 0 1 2 3 b a b 0 {0,1} {0} 1 {2} 2 {3} 0 a 0 0 a 0 a 1 a 0 b 2 b 0 b 3 Accept! b 0 Reject! Compiler notes #2, Tsan-sheng Hsu, IIS 13

From regular expressions to NFA s Structural decomposition: atomic items:, ɛ and a legal symbol. r s starting state for r start NFA for r r* starting state for r start NFA for r NFA for s accepting states for r starting state for s starting state for r starting state for s start NFA for r NFA for s rs convert all accepting states in r into non accepting states and add transitions Compiler notes #2, Tsan-sheng Hsu, IIS 14

Example: (a b) abb start ο 1 2 a 3 b 4 5 a b b 6 7 8 9 10 11 12 This construction produces only ɛ-transitions, and never produces multiple transitions for an input symbol. It is possible to remove all ɛ-transitions from an NFA and replace them with multiple transitions for an input symbol, and vice versa. Compiler notes #2, Tsan-sheng Hsu, IIS 15

Construction theorems Theorem #1: Any regular expression can be expressed by an NFA. Any NFA can be converted into a DFA. That is, any regular expression can be expressed by a DFA. How to convert an NFA to a DFA: Find out what is the set of possible states that can be reached from an NFA state using ɛ-transitions. Find out what is the set of possible states that can be reached from an NFA state on an input symbol. Theorem #2: Every DFA can be expressed as a regular expression. Every regular expression can be expressed as a DFA. DFA and regular expressions have the same expressive power. How about the power of DFA and NFA? Compiler notes #2, Tsan-sheng Hsu, IIS 16

Converting an NFA to a DFA Definitions: let T be a set of states and a be an input symbol. ɛ-closure(t ): the set of NFA states reachable from some state s T using ɛ-transitions. move(t, a): the set of NFA states to which there is a transition on the input symbol a from state s T. Both can be computed using standard graph algorithms. ɛ-closure(move(t, a)): the set of states reachable from a state in T for the input a. Example: NFA for (a b) abb start ο 1 2 a 3 b 4 5 a b b 6 7 8 9 10 11 12 ɛ-closure({0}) = {0, 1, 2, 4, 6, 7}, that is, the set of all possible starting states move({2, 7}, a) = {3, 8} Compiler notes #2, Tsan-sheng Hsu, IIS 17

Subset construction algorithm In the converted DFA, each state represents a subset of NFA states. T a ɛ-closure(move(t, a)) Subset construction algorithm : initially, we have an unmarked state labeled with ɛ-closure({s 0 }), where s 0 is the starting state. while there is an unmarked state with the label T do mark the state with the label T for each input symbol a do U ɛ-closure(move(t, a)) if U is a subset of states that is never seen before then add an unmarked state with the label U end for end while New accepting states: those contain an original accepting state. Compiler notes #2, Tsan-sheng Hsu, IIS 18

Example start ο 1 2 a 3 b 4 5 a b b 6 7 8 9 10 11 12 First step: ɛ-closure({0}) = {0,1,2,4,6,7} move({0, 1, 2, 4, 6, 7}, a) = {3,8} ɛ-closure({3,8}) = {0,1,2,3,4,6,7,8,9} move({0, 1, 2, 4, 6, 7}, b) = {5} start 0,1,2,4,6,7 a b 0,1,2,3,4, 6,7,8,9 0,1,2,4,5,6,7 ɛ-closure({5}) = {0,1,2,4,5,6,7} Compiler notes #2, Tsan-sheng Hsu, IIS 19

Example cont. start ο 1 2 a 3 b 4 5 a b b 6 7 8 9 10 11 12 states: A = {0, 1, 2, 4, 6, 7} B = {0, 1, 2, 3, 4, 6, 7, 8, 9} C = {0, 1, 2, 4, 5, 6, 7, 10, 11} D = {0, 1, 2, 4, 5, 6, 7} E = {0, 1, 2, 4, 5,, 6, 7, 12} transition table: a b A B D B B C C B E D B D E B D start a B b a C A a a a b b D b E b Compiler notes #2, Tsan-sheng Hsu, IIS 20

Algorithm for executing an NFA Algorithm: s 0 is the starting state, F is the set of accepting states. S ɛ-closure({s 0 }) while next input a is not EOF do S ɛ-closure(move(s, a)) end while if S F then ACCEPT else REJECT Execution time is O(r 2 s), where r is the number of NFA states, and s is the length of the input. Need O(r 2 ) time in running ɛ-closure(t ) assuming using an adjacency matrix representation and a linear-time hashing routine to remove duplicated states. Space complexity is O(r 2 c) using a standard adjacency matrix representation for graphs, where c is the cardinality of the alphabets. May have slightly better algorithms. Compiler notes #2, Tsan-sheng Hsu, IIS 21

Trade-off in executing NFA s Can also convert an NFA to a DFA and then execute the equivalent DFA. Running time: linear in the input size. Space requirement: linear in the size of the DFA. Catch: May get O(2 r c) DFA states by converting an r-state NFA. The converting algorithm may also takes O(2 r ) time. space time Time-space tradeoff: NFA O(r 2 c) O(r 2 s) DFA O(2 r c) O(s) If memory is cheap or programs will be used many times, then use the DFA approach; otherwise, use the NFA approach. Compiler notes #2, Tsan-sheng Hsu, IIS 22

LEX An UNIX utility. An easy way to use regular expressions to do lexical analysis. Convert your LEX program into an equivalent C program. Depending on implementation, may use NFA or DFA algorithms. file.l lex file.l lex.yy.c lex.yy.c cc -ll lex.yy.c a.out May produce.o file if there is no main(). input a.out output sequence of tokens May have slightly different implementations and libraries. Compiler notes #2, Tsan-sheng Hsu, IIS 23

LEX formats Source format: Declarations - a set of regular definitions, i.e., names and their regular expressions. %% Translation rules actions to be taken when patterns are encountered. %% Auxiliary procedures Global variables: yyleng: length of current string yytext: current string yylex(): the scanner routine... Compiler notes #2, Tsan-sheng Hsu, IIS 24

LEX formats cont. Declarations: C language code between %{ and %}. variables; manifest constants, i.e., identifiers declared to represent constants. Regular expressions. Translation rules: P 1 {action 1 } if regular expression P 1 is encountered, then action 1 is performed. LEX internals: regular expressions NFA if needed DFA Compiler notes #2, Tsan-sheng Hsu, IIS 25

test.l Declarations %{ /* some initial C programs */ #define BEGINSYM 1 #define INTEGER 2 #define IDNAME 3 #define REAL 4 #define STRING 5 #define SEMICOLONSYM 6 #define ASSIGNSYM 7 %} Digit [0-9] Letter [a-za-z] IntLit {Digit}+ Id {Letter}({Letter} {Digit} _)* Compiler notes #2, Tsan-sheng Hsu, IIS 26

test.l Rules %% [ \t\n] {/* skip white spaces */} [Bb][Ee][Gg][Ii][Nn] {return(beginsym);} {IntLit} {return(integer);} {Id} { printf("var has %d characters, ",yyleng); return(idname); } ({IntLit}[.]{IntLit})([Ee][+-]?{IntLit})? {return(real);} \"[^\"\n]*\" {stripquotes(); return(string);} ";" {return(semicolonsym);} ":=" {return(assignsym);}. {printf("error --- %s\n",yytext);} Compiler notes #2, Tsan-sheng Hsu, IIS 27

test.l Procedures %% /* some final C programs */ stripquotes() { /* handling string within a quoted string */ int frompos, topos=0, numquotes = 2; for(frompos=1; frompos<yyleng; frompos++){ yytext[topos++] = yytext[frompos]; } yyleng -= numquotes; yytext[yyleng] = \0 ; } void main(){ int i; i = yylex(); while(i>0 && i < 8){ printf("<%s> is %d\n",yytext,i); i = yylex(); } } Compiler notes #2, Tsan-sheng Hsu, IIS 28

austin% lex test.l austin% cc lex.yy.c -ll austin% cat data Begin 123.3 321.4E21 x := 365; "this is a string" austin% a.out < data <Begin> is 1 <123.3> is 4 <321.4E21> is 4 var has 1 characters, <x> is 3 <:=> is 7 <365> is 2 <;> is 6 <this is a string> is 5 %austin Sample run Compiler notes #2, Tsan-sheng Hsu, IIS 29

More LEX formats P 1 { action 1 Special format requirement: } LEX special characters (operators): Note: { and } must indent. \ [ ] ^ -?. * + ( ) $ { } % < > When there is any ambiguity in matching, prefer longest possible match; earlier expression if all matches are of equal length. Compiler notes #2, Tsan-sheng Hsu, IIS 30

LEX internals LEX code: regular expression #1 {action #1} regular expression #2 {action #2} regular expression #1 action #1 start regular expression #2 action #2.. Compiler notes #2, Tsan-sheng Hsu, IIS 31

LEX internals cont. How to find a longest possible match if there are many legal matches? If an accepting state is encountered, do not immediately accept. Push this accepting state and the current input position into a stack and keep on going until no more matches is possible. Pop from the stack and execute the actions for the popped accepting state. Resume the scanning from the popped current input position. How to find the earliest match if all matches are of equal length? Assign numbers 1, 2, to the accepting states using the order they appear (from top to bottom) in the expressions. If you are in multiple accepting states, execute the action associated with the least indexed accepting state. Compiler notes #2, Tsan-sheng Hsu, IIS 32

Practical considerations key word v.s. reserved word key word: def: word has a well-defined meaning in a certain context. example: FORTRAN, PL/1,... if if then else = then ; id id id Makes compiler to work harder! reserved word: def: regardless of context, word cannot be used for other purposes. example: COBOL, ALGOL, PASCAL, C, ADA,... task of compiler is simpler reserved words cannot be used as identifiers listing of reserved words is tedious for the scanner, also makes scanner large solution: treat them as identifiers, and use a table to check whether it is a reserved word. Compiler notes #2, Tsan-sheng Hsu, IIS 33

Practical considerations cont. Multi-character lookahead: how many more characters ahead do you have to look in order to decide which pattern to match? FORTRAN: lookahead until difference is seen without counting blanks. DO 10 I = 1, 15 a loop statement. DO 10 I = 1.15 an assignment statement for the variable DO10I. PASCAL: lookahead 2 characters with 2 or more blanks treating as one blank. 10..100: needs to look 2 characters ahead to decide this is not part of a real number. LEX lookahead operator / : r 1 /r 2 : match r 1 only if it is followed by r 2 ; note that r 2 is not part of the match. This operator can be used to cope with multi-character lookahead. How is this implemented in LEX? Compiler notes #2, Tsan-sheng Hsu, IIS 34