Compiler phases. Non-tokens

Similar documents
EDAN65: Compilers, Lecture 02 Regular expressions and scanning. Görel Hedin Revised:

Lexical Analysis. Chapter 2

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Parsing. source code. while (k<=n) {sum = sum+k; k=k+1;}

Chapter 3 Lexical Analysis

Lexical Analysis. Lecture 2-4

Lexical Analysis. Lecture 3-4

Implementation of Lexical Analysis

Implementation of Lexical Analysis

Compiler course. Chapter 3 Lexical Analysis

Implementation of Lexical Analysis

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

CSCI312 Principles of Programming Languages!

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Lexical Analysis. Implementation: Finite Automata

Introduction to Lexical Analysis

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Lexical Analysis. Lecture 3. January 10, 2018

ECS 120 Lesson 7 Regular Expressions, Pt. 1

2. Syntax and Type Analysis

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Lexical Analysis. Introduction

Syntax and Type Analysis

Zhizheng Zhang. Southeast University

Compiler Construction

Lexical Analysis 1 / 52

UNIT -2 LEXICAL ANALYSIS

Alternation. Kleene Closure. Definition of Regular Expressions

Figure 2.1: Role of Lexical Analyzer

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Monday, August 26, 13. Scanners

Wednesday, September 3, 14. Scanners

EDAN65: Compilers, Lecture 06 A LR parsing. Görel Hedin Revised:

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

Finite Automata and Scanners

CSc 453 Lexical Analysis (Scanning)

Compiler Construction

Writing a Lexical Analyzer in Haskell (part II)

Lexical Analysis. Finite Automata. (Part 2 of 2)

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Implementation of Lexical Analysis

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Lexical Analysis - 2

2010: Compilers REVIEW: REGULAR EXPRESSIONS HOW TO USE REGULAR EXPRESSIONS

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

Formal Languages and Compilers Lecture VI: Lexical Analysis

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

Implementation of Lexical Analysis. Lecture 4

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 3

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan

Implementation of Lexical Analysis

Implementation of Lexical Analysis

Lexical Analysis. Finite Automata

Compilers CS S-01 Compiler Basics & Lexical Analysis

Dr. D.M. Akbar Hussain

CSE 401/M501 Compilers

COLLEGE OF ENGINEERING, NASHIK. LANGUAGE TRANSLATOR

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

CS164: Midterm I. Fall 2003

CSE 401 Midterm Exam Sample Solution 2/11/15

Cunning Plan. Informal Sketch of Lexical Analysis. Issues in Lexical Analysis. Specifying Lexers

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 2

Lexical Error Recovery

Lexical Analysis (ASU Ch 3, Fig 3.1)

An introduction to Flex

Compilers CS S-01 Compiler Basics & Lexical Analysis

A Simple Syntax-Directed Translator

Programming Languages (CS 550) Lecture 4 Summary Scanner and Parser Generators. Jeremy R. Johnson

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

CS5363 Final Review. cs5363 1

Finite automata. We have looked at using Lex to build a scanner on the basis of regular expressions.

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

CSE450. Translation of Programming Languages. Lecture 20: Automata and Regular Expressions

CS 301. Lecture 05 Applications of Regular Languages. Stephen Checkoway. January 31, 2018

2068 (I) Attempt all questions.

Theoretical Part. Chapter one:- - What are the Phases of compiler? Answer:

CSE 3302 Programming Languages Lecture 2: Syntax

Lexical Analysis. Finite Automata

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

CS S-01 Compiler Basics & Lexical Analysis 1

Lexical Error Recovery

Compilation 2012 Context-Free Languages Parsers and Scanners. Jan Midtgaard Michael I. Schwartzbach Aarhus University

CMSC 350: COMPILER DESIGN

Kinder, Gentler Nation

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

EDA180: Compiler Construc6on Context- free grammars. Görel Hedin Revised:

CS 406/534 Compiler Construction Putting It All Together

CS415 Compilers. Lexical Analysis

Torben./Egidius Mogensen. Introduction. to Compiler Design. ^ Springer

Transcription:

Compiler phases Compiler Construction Scanning Lexical Analysis source code scanner tokens regular expressions lexical analysis Lennart Andersson parser context free grammar Revision 2011 01 21 parse tree (implicit) 2011 AST builder AST (explicit) syntactic analysis Compiler Construction 2011 F02-1 Typical tokens Compiler Construction 2011 F02-2 Non-tokens Reserved words (keywords) Identifiers lexemes if then else begin i alpha k10 whitespace lexemes tab newline formfeed Literals 123 3.1416 A text comments /* comment */ //eol comment Operators + ++!= Separators ;, ( Compiler Construction 2011 F02-3 Compiler Construction 2011 F02-4

Language concepts Regular expression usage An alphabet is a nonempty finite set of symbols. A string is a finite sequence of symbols. A language is a set of strings. grep, egrep, fgrep: unix find utilities emacs: re-search-forward, replace-regexp sed: stream editor with re capabilities java.util.regex pattern matching classes L 1 L 2 contains all strings s 1 s 2 where s 1 L 1 and s 2 L 2 Compiler Construction 2011 F02-5 Utility scanners Compiler Construction 2011 F02-6 Regular expressions, canonical L[[M]] is the language described by the regular expression M. java.io.streamtokenizer java.util.scanner java.util.regex expression M language L[[M]] { a { "a" α β L[[α]] L[[β]] α β L[[α]]L[[β]] α { "" L[[α]] L[[α α]] L[[α α α]]... (α) L[[α]] α and β denote regular expressions. We will use ɛ as an abbreviation of. L[[ɛ]] = {. Compiler Construction 2011 F02-7 Compiler Construction 2011 F02-8

Operator precedence (binding power) operator precedence highest lowest expression interpretation language a b a (b ) a b c (a b) c a b c (a (b )) c and are associative. a (b c) is equivalent to (a b) c; they denote the same language. Activity Write a regular expression describing the language of all binary 1. strings that ends with 11. 2. strings that does not end with 11. 3. with an even number of 1 s Write a regular expression describing the language of all arithmetic expressions with binary numbers and the operators + and *, but without parentheses. Compiler Construction 2011 F02-9 Extended regular expressions Compiler Construction 2011 F02-10 JavaCC example... Appel JavaCC Canonical RE ɛ MN M N M N M + M+ M M M? M? M [a za Z] [ "a" - "z", "A" - "Z" ] a b etc. [ "0" "9" ] (too long to fit) "axb" a x b \n \n /* Skip whitespace */ SKIP : { " " "\t" "\n" "\r" /* Reserved words */ TOKEN [IGNORE_CASE]: { < IF: "if"> < THEN: "then"> /* Identifiers */ TOKEN: { < ID: (["A"-"Z", "a"-"z"])+ > Compiler Construction 2011 F02-11 Compiler Construction 2011 F02-12

... JavaCC example Ambiguous lexical definitions /* Literals */ TOKEN: { < INT: (["0"-"9"])+ > /* Operators and separators */ TOKEN: { < ASSIGN: "=" > < MULT: "*" > < LT: "<" > < LE: "<=" >... The scanned text can match tokens in more than one way Which tokens match "<="? Which tokens match "if"? Compiler Construction 2011 F02-13 Extra rules for resolving the ambiguities Compiler Construction 2011 F02-14 Implementation of scanners Finite automata Longest match If one rule can be used to match a token, but there is another rule that will match a longer token, the latter rule will be chosen. This way, the scanner will match the longest token possible. Rule priority If two rules can be used to match the same sequence of characters, the first one takes priority. 1 i f 2 3 0 9 0 9 1 2 IF a state transition start state INT final state Compiler Construction 2011 F02-15 Compiler Construction 2011 F02-16

More automata Automaton for the whole language WHITESPACE ID Combine the automata for individual tokens merge the start states re-enumerate the states so they get unique numbers mark each final state with the token that is matched Compiler Construction 2011 F02-17 Example 1 Compiler Construction 2011 F02-18 Example 2 Combine IF and INT Combine IF and ID Compiler Construction 2011 F02-19 Compiler Construction 2011 F02-20

Translating a RE to an NFA a MN M N M ɛ DFA vs. NFA Deterministic Finite Automaton (DFA) A finite automaton is deterministic if all outgoing edges from any given node have disjoint character sets there are no ɛ transitions (the empty string) Can be implemented efficiently Non-deterministic Finite Automaton (NFA) An NFA may have two outgoing edges with overlapping character sets ɛ-edges DFA NFA Compiler Construction 2011 F02-21 Compiler Construction 2011 F02-22 Each NFA can be translated to a DFA Combine IF and ID NFA DFA Each state in the DFA will correspond to a set of states in the NFA There is a path in the NFA from state s to state t consuming the symbol a and any number ɛ transitions there will be an a transition from any state containing s to any state containing t in the DFA. The start state of the DFA will contain the start state of the NFA and all states in the NFA reachable from the start state via ɛ transitions (ɛ closure). Every state in the DFA containing a final state in the NFA will be a final state labeled with the same token. 1 i f 2 3 IF a z0 9 a z 4 ID Compiler Construction 2011 F02-23 Compiler Construction 2011 F02-24

Construction of a Scanner Implementation alternatives for DFAs Define all tokens as regular expressions Construct a finite automaton for each token Combine all these automata to a new automaton (an NFA in general) Translate to a corresponding DFA Minimize the DFA Implement the DFA Table-driven Represent the automaton by a table Additional table to keep track of final states and token kinds A global variable keeps track of the current state Switch statements Each state is implemented as a switch statement Each case implements a state transition as a jump (to another switch statement). The current state is represented by the program counter Compiler Construction 2011 F02-25 Table-driven implementation Compiler Construction 2011 F02-26 Error handling DFA for IF and ID (F2-23) 1 2,4 3,4 4.. a-e f g-h i j-z.. final kind true ERROR Add transitions to the dead state (state 0) for all erroneous input Generate an Error token when the dead state is reached Compiler Construction 2011 F02-27 Compiler Construction 2011 F02-29

Design Scanner (sketch) Token int kind String string class Scanner { PushbackReader reader; boolean isfinal [ ]; int edges [ ] [ ]; int kind [ ]; Token next() { Parser Scanner Token next() Reader int get() Compiler Construction 2011 F02-30 Handling End-Of-File (EOF) and non-tokens Compiler Construction 2011 F02-31 Handling longest match EOF construct an explicit EOF token when the EOF character is read Non-tokens (Whitespace & Comments) view as tokens of a special kind scan them as normal tokens, but don t create token objects for them loop in next() until a real token has been found Errors construct an explicit ERROR token to be returned when no valid token can be found. When a token is matched (a final state reached), don t stop scanning. Keep track of the latest matched token in a variable. Continue scanning until we reach the dead state Restore the input stream. Use PushbackReader to be able to do this. Return the latest matched token (or return the ERROR token if there was no latest matched token) Compiler Construction 2011 F02-33 Compiler Construction 2011 F02-34

Scanner with longest match JavaCC as a scanner generator class Scanner {... Token next() { toy.jj sum.toy javacc ToyTokenManager.java ToyConstants.java...java tokens Compiler Construction 2011 F02-35 Other scanner generators Compiler Construction 2011 F02-37 Additional functionality tool author implementation generates lex Schmidt, Lesk, 1975 table-driven C code flex Paxon. 1987 switch statements C code jlex table-driven Java code jflex switch statements Java code Lexical actions for adapting the generated scanner e.g., create special token objects, count tokens, keep track of position in input text (line and column numbers), etc. Lexical states ( start states ) gives the possibility to switch automata during scanning Compiler Construction 2011 F02-38 Compiler Construction 2011 F02-39

Summary questions What is meant by an ambiguous lexical definition? Give some typical examples of this and how they may be resolved. Construct an NFA for a given lexical definition Construct a DFA for a given NFA What is the difference between a DFA and an NFA? Implement a DFA in Java. How is rule priority handled in the implementation? How is longest match handled? EOF? Whitespace? Readings F2: Scanning. Regular expressions, deterministic and nondeterministic finite automata. Scanning with JavaCC. Appel, chapter 2-2.4 Hedin, Quick guide to writing scanning definitions in JavaCC, www.cs.lth.se/eda180/2007/tools/scanningguide. html F3: Parsing. Context-free grammars. Predictive parsing, Parsing with JavaCC. Appel, chapter 3-3.2 JavaCC documentation, javacc.dev.java.net/doc/docindex.html S1: Seminar 1. Lexical Analysis. Study F1, F2, and associated readings. L1: Programming assignment 1. Scanning Study the lab material Compiler Construction 2011 F02-40 Compiler Construction 2011 F02-41