Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

Similar documents
Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Chapter 3 Lexical Analysis

Compiler course. Chapter 3 Lexical Analysis

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

2. λ is a regular expression and denotes the set {λ} 4. If r and s are regular expressions denoting the languages R and S, respectively

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Lexical analysis. Syntactical analysis. Semantical analysis. Intermediate code generation. Optimization. Code generation. Target specific optimization

Lexical Analysis (ASU Ch 3, Fig 3.1)

Lexical Analysis. Chapter 2

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

Formal Languages and Compilers Lecture VI: Lexical Analysis

CSc 453 Lexical Analysis (Scanning)

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

CSE302: Compiler Design

Lexical Analysis. Lecture 2-4

UNIT -2 LEXICAL ANALYSIS

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Implementation of Lexical Analysis

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

Introduction to Lexical Analysis

Zhizheng Zhang. Southeast University

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

Implementation of Lexical Analysis

Lexical Analysis. Introduction

Figure 2.1: Role of Lexical Analyzer

Formal Languages and Grammars. Chapter 2: Sections 2.1 and 2.2

CS308 Compiler Principles Lexical Analyzer Li Jiang

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

Lexical Analysis. Lecture 3-4

Compiler phases. Non-tokens

CSE 401/M501 Compilers

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

CS 403: Scanning and Parsing

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

Using JFlex. "Linux Gazette...making Linux just a little more fun!" by Christopher Lopes, student at Eastern Washington University April 26, 1999

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

Outline CS4120/4121. Compilation in a Nutshell 1. Administration. Introduction to Compilers Andrew Myers. HW1 out later today due next Monday.

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Lexical Analysis. Lecture 3. January 10, 2018

Assignment 1 (Lexical Analyzer)

I/O and Parsing Tutorial

Compiler Construction

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Lexical Analysis - 2

Front End: Lexical Analysis. The Structure of a Compiler

Dr. D.M. Akbar Hussain

CD Assignment I. 1. Explain the various phases of the compiler with a simple example.

Assignment 1 (Lexical Analyzer)

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

UNIT I- LEXICAL ANALYSIS. 1.Interpreter: It is one of the translators that translate high level language to low level language.

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

Chapter 3: Lexical Analysis

CSCI312 Principles of Programming Languages!

Lexical Analysis. Implementation: Finite Automata

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Compiler Construction

Compiler Construction

Monday, August 26, 13. Scanners

CSE302: Compiler Design

Lexical Analysis. Sukree Sinthupinyo July Chulalongkorn University

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

Lexical Analysis. Finite Automata

Wednesday, September 3, 14. Scanners

Lexical Analysis 1 / 52

Recognition of Tokens

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

Structure of Programming Languages Lecture 3

Compiler Construction

Compiler Construction D7011E

Introduction to Compiler Design

CSC 467 Lecture 3: Regular Expressions

Part 5 Program Analysis Principles and Techniques

Building lexical and syntactic analyzers. Chapter 3. Syntactic sugar causes cancer of the semicolon. A. Perlis. Chomsky Hierarchy

10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

Lexical Analysis. Textbook:Modern Compiler Design Chapter 2.1

EDAN65: Compilers, Lecture 02 Regular expressions and scanning. Görel Hedin Revised:

Compiler Construction LECTURE # 3

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

CS415 Compilers. Lexical Analysis

ML 4 A Lexer for OCaml s Type System

2010: Compilers REVIEW: REGULAR EXPRESSIONS HOW TO USE REGULAR EXPRESSIONS

1. Lexical Analysis Phase

10/5/17. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntax Analysis

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

Lexical Analyzer Scanner

Projects for Compilers

Lexical and Syntax Analysis

Principles of Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

Writing a Lexical Analyzer in Haskell (part II)

Programming Languages (CS 550) Lecture 4 Summary Scanner and Parser Generators. Jeremy R. Johnson

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 3

UNIT II LEXICAL ANALYSIS

Week 2: Syntax Specification, Grammars

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

Transcription:

Lexical Analysis Chapter 1, Section 1.2.1 Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

Inside the Compiler: Front End Lexical analyzer (aka scanner) Converts ASCII or Unicode to a stream of tokens Provides input to the syntax analyzer (aka parser), which creates a parse tree from the token stream Usually the parser calls the scanner: getnexttoken() Other scanner functionality Removes comments: e.g. /* */ and // Removes whitespaces: e.g., space, newline, tab May add identifiers to the symbol table May maintain information about source positions (e.g., file name, line number, character number) to allow more meaningful error messages 2

Basic Definitions Token: token name and optional attribute value E.g. token name if, no attribute: the if keyword E.g. token name relop (relational operator), attribute in { LT, LE, EQ, NE, GT, GE }: represents <, <=, =, <>, >, >= The token name is an abstract symbol that is a terminal symbol for the grammar in the parser Each token has a pattern: e.g. id has the pattern letter followed by zero or more letters or digits Lexeme: a sequence of input characters (ASCII or Unicode) that is an instance of the token: e.g. the characters getprice for token id 3

4 Typical Categories of Tokens One token per reserved keyword; no attribute One token per operator or per operator group One token id for all identifiers; the attribute is a pointer to an entry in the symbol table Names of variables, functions, user-defined types, Symbol table has lexeme & source position where seen One token for each type of constant; attribute is the actual constant value E.g. (int_const,5) or (string_const,"alice") One token per punctuation symbol; no attribute E.g. left_parenthesis, comma, semicolon

Specifying Patterns for Lexemes Standard terminology Alphabet: a finite set of symbols (e.g., ASCII or Unicode) String over an alphabet: a finite sequence of symbols Language: countable set of strings over an alphabet Operations on languages Union: L M = all strings in L or in M Concatenation: LM = all ab where a in L and b in M L 0 = { ε } and L i = L i-1 L Closure: L * = L 0 L 1 L 2 ; positive closure: w/o L 0 Regular expressions: notation to express languages constructed with the help of such operations 5

6 Regular Expressions (1/2) Given some alphabet, a regular expression is The empty string ε ; any symbol from the alphabet r s, rs, r *, and (r) where r and s are regular expressions * has higher precedence than concatenation, which has higher precedence than ; all are left-associative Each regular expression r defines a language L(r) L(ε) = { ε } ; L(a) = { a } for alphabet symbol a ; L(r s) = L(r) L(s) ; L(rs) = L(r)L(s) ; L(r * ) = (L(r)) * ; L((r)) = L(r) Extended notation r + = rr* ; + has same precedence & associativity as * r? = r ε ;? has same precedence & associativity as * / + [a-za-z_] = a b z A B Z _

Regular Expressions (2/2) For convenience, give names to expressions Example: identifiers in C letter_ A B Z a b z _ digit 0 1 9 id letter_ ( letter_ digit ) * Example: unsigned numbers (integers or floating point) optional fraction and exponent digit [0-9] digits digit + number digits (. digits )? ( E [+-]? digits )? 7

Recognition of Tokens Finite automata (FA) Automata that recognize strings defined by a regular expression States, input symbols, transitions, start state, set of final states Transitions between states occur on specific input symbols Deterministic automata: only 1 transition per state on a specific input; no transitions on the empty string 8 RE for integers: [ 0-9 ] + A Start state 0 1 9 B Final state 0 1 9

9 Languages and Automata Language recognized by an automaton: the set of strings it accepts Starting in the start state; transitions upon to the input symbols in the input string; finish in a final state Each pattern for a token gets its own FA < start 1 > = 2 3 = return(relop,eq) > other other = 4 5 6 7 8 return(relop,le) return(relop,ne) backtrack; return(relop,lt) return(relop,ge) backtrack; return(relop,gt)

Keywords, Identifiers, Numbers FA for the keyword then t start 1 h not letter or 2 e 3 n 4 digit or _ 5 backtrack FA for id Letter or _, followed by letter, digit, or _ Ends with something other than letter, digit, _ Backtrack FA for simple integers One or more digits; end with non-digit; backtrack FA for floating point numbers, with exponents, etc. see Fig 3.16 10

11 Implementing a Scanner The naïve approach: Define an order in which to try each FA (e.g., first try the FA for each individual keyword, then the FA for id, then for numbers, etc.) For each FA: keep a state variable; big switch stmt state = 0; while (true) { switch (state) { } case 0: case 1: c = nextchar(); if (c == < ) state = 1; else if (c == = ) state = 2; else if (c == > ) state = 3; else fail(); break; state = ; break;

The Easy Way Do the code generation automatically, using a generator of lexical analyzers High-level description of regular expressions and corresponding actions Automatic generation of transition tables, etc. Sophisticated lexical analysis techniques better that what you can hope to achieve manually C world: lex and flex ; Java world: JLex and JFlex Can be used to generate Standalone scanners Scanners integrated with automatically-generated parsers (using yacc, bison, CUP, etc.) 12

13 Simple JFlex Example [Assignment: get it from the course web page under Resources and run it today!] Standalone text substitution scanner Reads a name after the keyword name Substitutes all occurrences of "hello" with "hello <name>! %% %public %class Subst %standalone %unicode %{ String name; Everything above %% is copied in the resulting Java class (e.g., Java import, package, comments) The generated Java class should be public The generated Java class will be called Subst.java Create a main method; no parser; unmatched text printed Capable of handling Unicode input text (not only ASCII) Code copied verbatim into the generated Java class %} %% Start rules and actions Returns the lexeme as String "name " [a-za-z]+ { name = yytext().substring(5); } Reg expr [Hh] "ello" { System.out.print(yytext()+" "+name+"!"); }

Rules (Regular Expressions) and Actions The scanner picks a regular expressions that matches the input and runs the action If several regular expressions match, the one with the longest lexeme is chosen E.g., if one rule matches the keyword break and another rule matches the id breaking, the id wins If there are several longest matches, the one appearing earlier in the specification is chosen The action typically will create a new token for the matched lexeme 14

Regular Expressions in JFlex Character Except meta characters ( ) { } [ ] < > \. * +? ^ $ /. " ~! Escape sequence \n \r \t \f \b \x3f (hex ASCII) \u2ba7 (hex Unicode) Warning: \n is not end of line, but ASCII LF (line feed); use \r \n \r\n to match end of line Character classes [a0-3\n] is {a,0,1,2,3,\n}; [^a0-3\n] is any char not in set Predefined classes: e.g. [:letter:], [:digit:],. (all except \n) " " matches the exact text in double quotes All meta characters but \ and " lose their special meaning inside a string 15

Regular Expressions in JFlex { MacroName } A macro can be defined earlier, in the second part of the specification: e.g., LineTerminator = \r \n \r\n In the third part, it can be used with {LineTerminator} Operations on regular expressions a b, ab, a*, a+, a?,!a, ~a, a{n}, a{n,m}, (a), ^a, a$, a/, End of file: <<EOF>> Assignment: http://jflex.de/manual.html Read Lexical Specifications, subsection Lexical rules Read A Simple Example: How to work with JFlex 16

Lexical States Definition and use In part 2: %state STRING In part 3: <STRING> expr { action } or <STRING> { expr1 { action1 } expr2 { action2 } } Lexical states can be used to refine a specification E.g. if the scanner is in STRING, only expressions that are preceded by <STRING> can be matched A regular expression can depend on more than one state; matched if the scanner is in any of these states YYINITIAL is predefined and is the starting state If an expression has no state, it is matched in all states See example in online documentation 17

18 Interoperability with CUP (1/2) CUP is a parser generator; grammar given in x.cup Terminal symbols in grammar are encoded in a CUP-generated sym.java public class sym { public static final int MINUS = 4; public static final int NUMBER = 9; } The CUP-generated parser (in parser.java) calls a method next_token on the scanner and expects to get back an object of java_cup.runtime.symbol A Symbol contains a token type (from sym.java) and optionally an Object with an attribute value, plus source location (start & end position)

Interoperability with CUP (2/2) Inside the lexical specification import java_cup.runtime.symbol; Add %cup in part 2 Return instances of Symbol "-" { return new Symbol(sym.MINUS); } {IntConst} { return new Symbol(sym.NUMBER, new Integer(Integer.parseInt(yytext())) Workflow Run JFlex to get Lexer.java Run CUP to get sym.java and parser.java Main.java: new parser(new Lexer(new FileReader( ))); Compile everything (e.g. javac Main.java) Assignment: read & run calc example on the web page, under Resources 19

Project 1 simplec on web page: a tiny scanner and parser for a subset of C (the parser is fake no real parsing) Project: extend the functionality to handle All TODO comments in the specification file Any keywords, operators, etc. needed to handle two small C programs (already preprocessed with cpp) Do not change MyLexer.java (a driver program) or MySymbol.java (helper class, extension of Symbol) The output from MyLexer will be used for grading Assignment: start working on it today! 20

21 Constructing JFlex-like tools Well-known and investigated algorithms for Generating non-deterministic finite automata (NFA) from regular expressions (Sect. 3.7.4) Running a NFA on a given string (Sect. 3.7.2) Generating deterministic finite automata (DFA) from NFA (Sect. 3.7.1) Generating DFA from regular expressions (Sect. 3.9.5) Optimizing DFA to reduce number of states (Sect. 3.9.6) We will not cover these algorithms in this class Building an actual tool Compile the spec (e.g., z.flex) to transition tables for a single NFA (new start node, ε-transitions to all NFA) Run the NFA (or an equivalent DFA) on the input