The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

Similar documents
Lexical Analysis. Introduction

Lexical Analysis - An Introduction. Lecture 4 Spring 2005 Department of Computer Science University of Alabama Joel Jones

CS415 Compilers. Lexical Analysis

Lexical Analysis. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Part 5 Program Analysis Principles and Techniques

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

CSc 453 Lexical Analysis (Scanning)

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Lecture 3: CUDA Programming & Compiler Front End

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

2. Lexical Analysis! Prof. O. Nierstrasz!

Lexical Analysis. Lecture 3. January 10, 2018

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

Compiler Construction

Compiler Construction

1. Lexical Analysis Phase

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Front End. Hwansoo Han

Lexical Analysis. Chapter 2

Lecture 4: Syntax Specification

Introduction to Lexical Analysis

Lexical Analysis. Finite Automata

Figure 2.1: Role of Lexical Analyzer

CS415 Compilers. Syntax Analysis. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

CS 314 Principles of Programming Languages. Lecture 3

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Lexical Analysis. Lecture 3-4

Theoretical Part. Chapter one:- - What are the Phases of compiler? Answer:

Lexical Analysis. Lecture 2-4

CS 314 Principles of Programming Languages

Introduction to Lexical Analysis

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Non-deterministic Finite Automata (NFA)

Structure of Programming Languages Lecture 3

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

Implementation of Lexical Analysis

Lexical Analysis. Finite Automata

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Dr. D.M. Akbar Hussain

Lexical Analysis 1 / 52

UNIT II LEXICAL ANALYSIS

CPSC 434 Lecture 3, Page 1

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2]

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

CSE 3302 Programming Languages Lecture 2: Syntax

10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

Implementation of Lexical Analysis

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

G Compiler Construction Lecture 4: Lexical Analysis. Mohamed Zahran (aka Z)

Lexical and Syntax Analysis

ECS 120 Lesson 7 Regular Expressions, Pt. 1

Week 2: Syntax Specification, Grammars

CSE Lecture 4: Scanning and parsing 28 Jan Nate Nystrom University of Texas at Arlington

CSE 105 THEORY OF COMPUTATION

Formal Languages and Compilers Lecture VI: Lexical Analysis

CSC 467 Lecture 3: Regular Expressions

CSCI-GA Compiler Construction Lecture 4: Lexical Analysis I. Hubertus Franke

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

Lexical Analysis (ASU Ch 3, Fig 3.1)

10/5/17. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntax Analysis

Compiler Construction LECTURE # 3

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

Compiler phases. Non-tokens

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

A simple syntax-directed

Syntax and Type Analysis

CPS 506 Comparative Programming Languages. Syntax Specification

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

Dixita Kagathara Page 1

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

Structure of a compiler. More detailed overview of compiler front end. Today we ll take a quick look at typical parts of a compiler.

2. Syntax and Type Analysis

We use L i to stand for LL L (i times). It is logical to define L 0 to be { }. The union of languages L and M is given by

Cunning Plan. Informal Sketch of Lexical Analysis. Issues in Lexical Analysis. Specifying Lexers

UNIT -2 LEXICAL ANALYSIS

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

Lexical Analysis - 1. A. Overview A.a) Role of Lexical Analyzer

Lexical Analysis. Implementing Scanners & LEX: A Lexical Analyzer Tool

Chapter Seven: Regular Expressions

COLLEGE OF ENGINEERING, NASHIK. LANGUAGE TRANSLATOR

CSE302: Compiler Design

Compiler course. Chapter 3 Lexical Analysis

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

Zhizheng Zhang. Southeast University

Lexical Analyzer Scanner

Parsing and Pattern Recognition

Academic Formalities. CS Language Translators. Compilers A Sangam. What, When and Why of Compilers

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

Lexical Analyzer Scanner

8 ε. Figure 1: An NFA-ǫ

Programming Languages and Compilers (CS 421)

Transcription:

The Front End Source code Front End IR Back End Machine code Errors The purpose of the front end is to deal with the input language Perform a membership test: code source language? Is the program well-formed (syntactically)? Build an IR version of the code for the rest of the compiler from Cooper & Torczon 1

The Front End Source code Scanner tokens Parser IR Scanner Errors Maps stream of characters into words > Basic unit of syntax > x = x + y ; becomes <id,x> <assignop,=> <id,x> <arithop,+> <id,y> ; Characters that form a word are its lexeme Its part of speech (or syntactic category) is called its token Scanner discards white space & (often) comments Speed is an issue in scanning use a specialized recognizer from Cooper & Torczon 2

The Front End Source code Scanner tokens Parser IR Parser Errors Checks stream of classified words (parts of speech) for grammatical correctness Determines if code is syntactically well-formed Guides checking at deeper levels than syntax Builds an IR representation of the code We ll come back to parsing in a couple of lectures from Cooper & Torczon 3

The Big Picture In natural languages, word part of speech is idiosyncratic > Based on connotation & context > Typically done with a table lookup In formal languages, word part of speech is syntactic > Based on denotation > Makes this a matter of syntax, or micro-syntax > We can recognize this micro-syntax efficiently > Reserved keywords are critical (no context!) Fast recognizers can map words into their parts of speech Study formalisms to automate construction of recognizers from Cooper & Torczon 4

The Big Picture Why study lexical analysis? We want to avoid writing scanners by hand source code specifications Scanner Scanner Generator parts of speech tables or code Goals: > To simplify specification & implementation of scanners > To understand the underlying techniques and technologies from Cooper & Torczon 5

Specifying Lexical Patterns (micro-syntax) A scanner recognizes the language s parts of speech Some parts are easy White space > WhiteSpace blank tab WhiteSpace blank WhiteSpace tab Keywords and operators > Specified as literal patterns: if, then, else, while, =, +, Comments > Opening and (perhaps) closing delimiters > /* followed by */ in C > // in C++ > % in LaTeX from Cooper & Torczon 6

Specifying Lexical Patterns (micro-syntax) A scanner recognizes the language s parts of speech Some parts are more complex Identifiers > Alphabetic followed by alphanumerics + _, &, $, > May have limited length Numbers > Integers: 0 or a digit from 1-9 followed by digits from 0-9 > Decimals: integer. digits from 0-9, or. digits from 0-9 > Reals: (integer or decimal) E (+ or -) digits from 0-9 > Complex: ( real, real ) We need a notation for specifying these patterns We would like the notation to lead to an implementation from Cooper & Torczon 7

Regular Expressions Patterns form a regular language *** any finite language is regular *** Ever type rm *.o a.out? Regular expressions (REs) describe regular languages Regular Expression (over alphabet Σ) ε is a RE denoting the set {ε} If a is in Σ, then a is a RE denoting {a} If x and y are REs denoting L(x) and L(y) then > x is a RE denoting L(x) > x y is a RE denoting L(x) L(y) > xy is a RE denoting L(x)L(y) > x * is a RE denoting L(x)* Precedence is closure, then concatenation, then alternation from Cooper & Torczon 8

Set Operations (refresher) Operation Definition Union of L and M Written L M L M = {s s L or s M } Concatenation of L and M Written LM Kleene closure of L Written L * LM = {st s L and t M } L * = 0 i L i Positive Closure of L Written L + L * = 1 i L i You need to know these definitions from Cooper & Torczon 9

Examples of Regular Expressions Identifiers: Letter (a b c z A B C Z) Digit (0 1 2 9) Identifier Letter ( Letter Digit ) * Numbers: Integer (+ - ε) (0 (1 2 3 9)(Digit * ) ) Decimal Integer. Digit * Real ( Integer Decimal ) E (+ - ε) Digit * Complex ( Real, Real ) Numbers can get much more complicated! from Cooper & Torczon 10

Regular Expressions (the point) To make scanning tractable, programming languages differentiate between parts of speech by controlling their spelling (as opposed to dictionary lookup) Difference between Identifier and Keyword is entirely lexical > While is a Keyword > Whilst is an Identifier The lexical patterns used in programming languages are regular Using results from automata theory, we can automatically build recognizers from regular expressions We study REs to automate scanner construction! from Cooper & Torczon 11

Example Consider the problem of recognizing register names Register r (0 1 2 9) (0 1 2 9) * Allows registers of arbitrary number Requires at least one digit RE corresponds to a recognizer (or DFA) r (0 1 2 9) (0 1 2 9) S 0 S 1 S 2 accepting state Recognizer for Register With implicit transitions on other inputs to an error state, s e from Cooper & Torczon 12

Example (continued) DFA operation Start in state S 0 & take transitions on each input character DFA accepts a word x iff x leaves it in a final state (S 2 ) (0 1 2 9) r (0 1 2 9) S 0 S 1 S 2 accepting state Recognizer for Register So, r17 takes it through s 0, s 1, s 2 and accepts r takes it through s 0, s 1 and fails a takes it straight to s e from Cooper & Torczon 13

Example (continued) char next character; state s 0 ; call action(state,char); while (char eof) state δ(state,char); call action(state,char); char next character; if Τ(state) = final then report acceptance; else report failure; action(state,char) switch(τ(state) ) case start: word char; break; case normal: word word + char; break; case final: word char; break; case error: report error; break; end; δ r Τ S 0 S 1 S 2 0,1, 2,3, 4,5, 6, 7,8, 9 a c t ion s ta r t no rm a l fina l e rr o r ot h e r S 0 S 1 S 1 S 2 S 2 S 2 The recognizer translates directly into code To change DFAs, just change the tables from Cooper & Torczon 14

What if we need a tighter specification? r Digit Digit * allows arbitrary numbers Accepts r00000 Accepts r99999 What if we want to limit it to r0 through r31? Write a tighter regular expression > Register r ( (0 1 2) (Digit ε) (4 5 6 7 8 9) (3 30 31) > Register r0 r1 r2 r31 r00 r01 r02 r09 Produces a more complex DFA Has more states Same cost per transition Same basic implementation from Cooper & Torczon 15

Tighter register specification (continued) The DFA for Register r ( (0 1 2) (Digit ε) (4 5 6 7 8 9) (3 30 31) (0 1 2 9) S 2 S 3 0,1,2 r 3 0,1 S 0 S 1 S 5 S 6 4,5,6,7,8,9 S 4 Accepts a more constrained set of registers Same set of actions, more states from Cooper & Torczon 16

Tighter register specification (continued) To implement the recognizer Use the same code skeleton Use transition and action tables for the new RE δ r 0,1 2 3 4,5,6 7,8,9 o t h e r S 0 S 1 S 1 S 2 S 2 S 5 S 4 S 2 S 3 S 3 S 3 S 3 S 3 S 4 S 5 S 6 S 6 Τ S 0 S 1 S 2, 3, 4, 5, 6 ac t io n s t a r t n or m al fina l e r r o r Bigger tables, more space, same asymptotic costs Better (micro-)syntax checking at the same cost from Cooper & Torczon 17