CS415 Compilers. Lexical Analysis

Similar documents
Lexical Analysis. Introduction

Lexical Analysis - An Introduction. Lecture 4 Spring 2005 Department of Computer Science University of Alabama Joel Jones

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

Lexical Analysis. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Front End. Hwansoo Han

CS415 Compilers. Syntax Analysis. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Lecture 3: CUDA Programming & Compiler Front End

CS 406/534 Compiler Construction Putting It All Together

Parsing. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Structure of Programming Languages Lecture 3

Parsing II Top-down parsing. Comp 412

CS 314 Principles of Programming Languages. Lecture 3

CS 314 Principles of Programming Languages

Parsing III. CS434 Lecture 8 Spring 2005 Department of Computer Science University of Alabama Joel Jones

CS415 Compilers Overview of the Course. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Introduction to Compiler

Part 5 Program Analysis Principles and Techniques

Implementation of Lexical Analysis

Syntax Analysis, VII One more LR(1) example, plus some more stuff. Comp 412 COMP 412 FALL Chapter 3 in EaC2e. target code.

CS 406/534 Compiler Construction Parsing Part I

Formal Languages and Compilers Lecture VI: Lexical Analysis

CS308 Compiler Principles Lexical Analyzer Li Jiang

EECS 6083 Intro to Parsing Context Free Grammars

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

Lexical Analysis. Lecture 2-4

Implementation of Lexical Analysis

Figure 2.1: Role of Lexical Analyzer

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Compiler Optimisation

Introduction to Parsing

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Lexical Analysis. Lecture 3-4

CS415 Compilers. Instruction Scheduling and Lexical Analysis

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Compiler course. Chapter 3 Lexical Analysis

Finite automata. We have looked at using Lex to build a scanner on the basis of regular expressions.

G52LAC Languages and Computation Lecture 6

Syntax Analysis, III Comp 412

Introduction to Lexical Analysis

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

Compiler Construction

Lexical Analysis. Chapter 2

Formal Languages and Compilers Lecture IV: Regular Languages and Finite. Finite Automata

Chapter 3 Lexical Analysis

UNIT -2 LEXICAL ANALYSIS

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

CSE302: Compiler Design

Front End: Lexical Analysis. The Structure of a Compiler

2. Lexical Analysis! Prof. O. Nierstrasz!

Dr. D.M. Akbar Hussain

A simple syntax-directed

Lecture 4: Syntax Specification

Introduction to Parsing. Comp 412

CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2]

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Syntax Analysis, V Bottom-up Parsing & The Magic of Handles Comp 412

COLLEGE OF ENGINEERING, NASHIK. LANGUAGE TRANSLATOR

COP 3402 Systems Software Syntax Analysis (Parser)

CS 315 Programming Languages Syntax. Parser. (Alternatively hand-built) (Alternatively hand-built)

Week 2: Syntax Specification, Grammars

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Parsing and Pattern Recognition

Compiler Construction

Syntax Analysis, III Comp 412

Writing a Lexical Analyzer in Haskell (part II)

Syntax Analysis, VI Examples from LR Parsing. Comp 412

Lexical Analysis. Lecture 3. January 10, 2018

Zhizheng Zhang. Southeast University

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Lexical Analyzer Scanner

Compilers. Yannis Smaragdakis, U. Athens (original slides by Sam

CSE450. Translation of Programming Languages. Lecture 20: Automata and Regular Expressions

Theoretical Part. Chapter one:- - What are the Phases of compiler? Answer:

Lexical Analyzer Scanner

Languages and Compilers

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

Goals for course What is a compiler and briefly how does it work? Review everything you should know from 330 and before

Two hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE. Date: Friday 20th May 2016 Time: 14:00-16:00

Lexical Analysis 1 / 52

Instruction Selection, II Tree-pattern matching

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

CSE302: Compiler Design

Implementation of Lexical Analysis. Lecture 4

Implementation of Lexical Analysis

Structure of a compiler. More detailed overview of compiler front end. Today we ll take a quick look at typical parts of a compiler.

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

CSc 453 Lexical Analysis (Scanning)

Non-deterministic Finite Automata (NFA)

CSE450 Translation of Programming Languages. Lecture 4: Syntax Analysis

MidTerm Papers Solved MCQS with Reference (1 to 22 lectures)

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

Chapter 3: Describing Syntax and Semantics. Introduction Formal methods of describing syntax (BNF)

Programming Languages and Compilers (CS 421)

CSCI312 Principles of Programming Languages!

Transcription:

CS415 Compilers Lexical Analysis These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University Lecture 7 1

Announcements First project and second homework have been posted. a) loadi 1024 => r0 first instruction in all test cases b) r0 is not an allocatable register (not part of k ) c) Latency of outputai is 3 Homeworks: No late submissions; however, you may ask for an extension in class Paper on register allocation on web site (8 th International Conference on Compiler Construction, CC 99, March 1999) cs415, spring 18 Lecture 7 2

Register allocation project ILOC simulator + ILOC files (1) Command line option for sim -i NUM0 NUM1... NUMn Treats NUM1 through NUMn as integer data items and initializes data memory by writing them into consecutive words of memory, starting at the address NUM0. (2) Inserted loadi 1024 => r0 to testblock and reportblock files; all spill code has to use loadai and storeai instructions; you should use two feasible registers for topdown, and no feasible registers for bottom-up cs415, spring 18 Lecture 7 3

Next topic Lexical Analysis Read EaC: Chapters 2.1 2.5; cs415, spring 18 Lecture 7 4

The Front End EaC Chapter 2 Source code Front End IR Back End Machine code Errors The purpose of the front end is to deal with the input language Perform a membership test: code Î source language? Is the program well-formed (semantically)? Build an IR version of the code for the rest of the compiler The front end is not monolithic cs415, spring 18 Lecture 7 5

The Front End Source code Scanner tokens Parser IR Scanner Maps stream of characters into words Basic unit of syntax x = x + y ; becomes <id,x> <eq,=> <id,x> <pl,+> <id,y> <sc,; > Errors Speed is an issue in scanning Þ use a specialized recognizer Characters that form a word are its lexeme Its part of speech (or syntactic category) is called its token type Scanner discards white space & (often) comments cs415, spring 18 Lecture 7 6

The Front End Source code Scanner tokens Parser IR Parser Checks stream of classified words (parts of speech) for grammatical correctness Determines if code is syntactically well-formed Guides checking at deeper levels than syntax Builds an IR representation of the code We ll get to parsing in the next lectures Errors cs415, spring 18 Lecture 7 7

The Big Picture Language syntax is specified with parts of speech, not words Syntax checking matches parts of speech against a grammar Here is an example context free grammar (CFG) G: 1. goal expr 2. expr expr op term 3. term 4. term number 5. id 6. op + 7. S = goal T = { number, id, +, - } N = { goal, expr, term, op } P = { 1, 2, 3, 4, 5, 6, 7} G in BNF form G = (S, T, N, P) cs415, spring 18 Lecture 7 8

The Big Picture Why study lexical analysis? We want to avoid writing scanners by hand source code Scanner parts of speech & words (tokens) Goals: specifications Scanner Generator Specifications written as regular expressions tables or code Represent words as indices into a global table To simplify specification & implementation of scanners To understand the underlying techniques and technologies cs415, spring 18 Lecture 7 9

Regular Expressions Lexical patterns form a regular language *** any finite language is regular *** Regular expressions (REs) describe regular languages Ever type rm *.o a.out? Regular Expression (over alphabet S) e is a RE denoting the set {e} If a is in S, then a is a RE denoting {a} If x and y are REs denoting L(x) and L(y) then x y is an RE denoting L(x) È L(y) xy is an RE denoting L(x)L(y) x * is an RE denoting L(x)* (x) is an RE denoting L(x) Precedence is closure, then concatenation, then alternation cs415, spring 18 Lecture 8 10

Set Operations These definitions should be well known cs415, spring 18 Lecture 7 11

Examples of Regular Expressions Identifiers: Letter (a b c z A B C Z) Digit (0 1 2 9) Identifier Letter ( Letter Digit ) * Numbers: Integer (+ - e) (0 (1 2 3 9)(Digit * ) ) Decimal Integer. Digit * Real ( Integer Decimal ) E (+ - e) Digit * Complex ( Real, Real ) Numbers can get much more complicated! cs415, spring 18 Lecture 7 12

Regular Expressions (the point) Regular expressions can be used to specify the words to be translated to parts of speech by a lexical analyzer Using results from automata theory and theory of algorithms, we can automatically build recognizers from regular expressions Þ We study REs and associated theory to automate scanner construction! cs415, spring 18 Lecture 7 13

Example Consider the problem of recognizing ILOC register names Register r (0 1 2 9) (0 1 2 9) * Allows registers of arbitrary number Requires at least one digit RE corresponds to a recognizer (or DFA) r (0 1 2 9) (0 1 2 9) S 0 S 1 S 2 accepting state Recognizer for Register Transitions on other inputs go to an error state, cs415, spring 18 Lecture 7 14

Example (continued) DFA operation Start in state S 0 & take transitions on each input character DFA accepts a word x iff x leaves it in a final state (S 2 ) r (0 1 2 9) (0 1 2 9) S 0 S 1 S 2 accepting state Recognizer for Register So, r17 takes it through s 0, s 1, s 2 and accepts r takes it through s 0, s 1 and fails a takes it straight to error state (not shown here) cs415, spring 18 Lecture 7 15

Example (continued) To be useful, recognizer must turn into code Char next character State s 0 d r 0,1,2,3,4, 5,6,7,8,9 All others while (Char ¹ EOF) State d(state,char) Char next character s 0 s 1 s 1 s 2 if (State is a final state ) then report success else report failure s 2 s 2 Skeleton recognizer Table encoding RE cs415, spring 18 Lecture 7 16

Example (continued) To be useful, recognizer must turn into code Char next character State s 0 while (Char ¹ EOF) State d(state,char) perform specified action Char next character if (State is a final state ) then report success else report failure Skeleton recognizer d s 0 s 1 s 2 r s 1 start error error error 0,1,2,3,4, 5,6,7,8,9 error s 2 add s 2 add error Table encoding RE All others error error error error cs415, spring 18 Lecture 7 17

What if we need a tighter specification? r Digit Digit * allows arbitrary numbers Accepts r00000 Accepts r99999 What if we want to limit it to r0 through r31? Write a tighter regular expression Register r ( (0 1 2) (Digit e) (4 5 6 7 8 9) (3 30 31) ) Register r0 r1 r2 r31 r00 r01 r02 r09 Produces a more complex DFA Has more states Same cost per transition Same basic implementation cs415, spring 18 Lecture 7 18

Tighter register specification (continued) The DFA for Register r ( (0 1 2) (Digit e) (4 5 6 7 8 9) (3 30 31) ) (0 1 2 9) 0,1,2 S 2 S 3 r 3 0,1 S 0 S 1 S 5 S 6 4,5,6,7,8,9 S 4 Accepts a more constrained set of registers Same set of actions, more states cs415, spring 18 Lecture 7 19

Tighter register specification (continued) d r 0,1 2 3 4-9 All others s 0 s 1 s 1 s 2 s 2 s 5 s 4 s 2 s 3 s 3 s 3 s 3 s 3 s 4 s 5 s 6 Runs in the same skeleton recognizer s 6 Table encoding RE for the tighter register specification cs415, spring 18 Lecture 7 20

Constructing a Scanner - Quick Review source code Scanner parts of speech & words tables or code specifications Scanner Generator The scanner is the first stage in the front end Specifications can be expressed using regular expressions Build tables and code from a DFA cs415, spring 18 Lecture 7 21

Goal We will show how to construct a finite state automaton to recognize any RE Overview: Direct construction of a nondeterministic finite automaton (NFA) to recognize a given RE Require-transitions to combine regular subexpressions Construct a deterministic finite automaton (DFA) to simulate the NFA Use a set-of-states construction Minimize the number of states Hopcroft state minimization algorithm Generate the scanner code Additional specifications needed for details cs415, spring 18 Lecture 7 22

More Regular Expressions All strings of 1s and 0nding in a 1 ( 0 1 ) * 1 All strings over lowercase letters where the vowels (a,e,i,o, & u) occur exactly once, in ascending order Cons (b c d f g h j k l m n p q r s t v w x y z) Cons * a Cons * e Cons * i Cons * o Cons * u Cons * All strings of 1s and 0s that do not contain three 0s in a row: cs415, spring 18 Lecture 7 23

More Regular Expressions All strings of 1s and 0nding in a 1 ( 0 1 ) * 1 All strings over lowercase letters where the vowels (a,e,i,o, & u) occur exactly once, in ascending order Cons (b c d f g h j k l m n p q r s t v w x y z) Cons * a Cons * e Cons * i Cons * o Cons * u Cons * All strings of 1s and 0s that do not contain three 0s in a row: ( 1 * ( e 01 001 ) 1 * ) * ( e 0 00 ) cs415, spring 18 Lecture 7 24

Non-deterministic Finite Automata Each RE corresponds to a deterministic finite automaton (DFA) May be hard to directly construct the right DFA What about an RE such as ( a b ) * abb? a b e a b b S 0 S 1 S 2 S 3 S 4 This is a little different S 0 has a transition on e S 1 has two transitions on a This is a non-deterministic finite automaton (NFA) cs415, spring 18 Lecture 7 25

Non-deterministic Finite Automata An NFA accepts a string x iff $ a path though the transition graph from s 0 to a final state such that the edge labels spell x Transitions on e consume no input To run the NFA, start in s 0 and guess the right transition at each step Always guess correctly If some sequence of correct guesses accepts x then accept Why study NFAs? They are the key to automating the RE DFA construction We can paste together NFAs with e-transitions e NFA NFA becomes an NFA cs415, spring 18 Lecture 7 26

Relationship between NFAs and DFAs DFA is a special case of an NFA DFA has no e transitions DFA s transition function is single-valued Same rules will work DFA can be simulated with an NFA Obviously NFA can be simulated with a DFA Simulate sets of possible states Possible exponential blowup in the state space Still, one state per character in the input stream (less obvious) cs415, spring 18 Lecture 7 27

Automating Scanner Construction To convert a specification into code: 1 Write down the RE for the input language 2 Build a big NFA 3 Build the DFA that simulates the NFA 4 Systematically shrink the DFA 5 Turn it into code Scanner generators Lex and Flex work along these lines Algorithms are well-known and well-understood Key issue is interface to parser (define all parts of speech) You could build one in a weekend! cs415, spring 18 Lecture 7 28

Automating Scanner Construction RE NFA (Thompson s construction) Build an NFA for each term Combine them with e-moves NFA DFA (subset construction) Build the simulation DFA Minimal DFA Hopcroft s algorithm The Cycle of Constructions RE NFA DFA minimal DFA RE (Not part of the scanner construction) All pairs, all paths problem Take the union of all paths from s 0 to an accepting state DFA cs415, spring 18 Lecture 7 29

RE NFA using Thompson s Construction Key idea NFA pattern for each symbol and each operator Each NFA has a single start and accept state Join them with e moves in precedence order a S 0 S 1 a S 0 S 1 e b S 3 S 4 NFA for a NFA for ab S 0 e a S 1 S 2 e S 5 e S 0 S 1 e a e S 3 S 4 e b S 3 S 4 e e NFA for a * NFA for a b Ken Thompson, CACM, 1968 cs415, spring 18 Lecture 7 30

Homework and Next class More Lexical Analysis cs415, spring 18 Lecture 7 31