CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

Similar documents
Outline CS4120/4121. Compilation in a Nutshell 1. Administration. Introduction to Compilers Andrew Myers. HW1 out later today due next Monday.

Compiler Construction LECTURE # 3

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

2010: Compilers REVIEW: REGULAR EXPRESSIONS HOW TO USE REGULAR EXPRESSIONS

Lexical Analysis. Chapter 2

Lexical Analysis. Lecture 3-4

Introduction to Lexical Analysis

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

CSC 467 Lecture 3: Regular Expressions

Lexical Analysis. Lecture 2-4

Formal Languages and Compilers Lecture VI: Lexical Analysis

CSE 3302 Programming Languages Lecture 2: Syntax

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

DVA337 HT17 - LECTURE 4. Languages and regular expressions

Introduction to Lexical Analysis

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Zhizheng Zhang. Southeast University

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Lexical Analysis. Lecture 3. January 10, 2018

Lexical Analysis. Sukree Sinthupinyo July Chulalongkorn University

CMSC 350: COMPILER DESIGN

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Implementation of Lexical Analysis

CS 314 Principles of Programming Languages. Lecture 3

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

CSc 453 Lexical Analysis (Scanning)

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Structure of Programming Languages Lecture 3

Lexical Analysis (ASU Ch 3, Fig 3.1)

ECS 120 Lesson 7 Regular Expressions, Pt. 1

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

Implementation of Lexical Analysis

Part 5 Program Analysis Principles and Techniques

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

UNIT II LEXICAL ANALYSIS

CS 314 Principles of Programming Languages

Compiler phases. Non-tokens

Lexical Analysis 1 / 52

Lexical Analysis. Finite Automata

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

Lexical Analysis. Introduction

Dr. D.M. Akbar Hussain

CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2]

CS308 Compiler Principles Lexical Analyzer Li Jiang

Compilers CS S-01 Compiler Basics & Lexical Analysis

Chapter Seven: Regular Expressions

Group A Assignment 3(2)

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

CPSC 434 Lecture 3, Page 1

Lexical Analyzer Scanner

CSE302: Compiler Design

Lexical Analyzer Scanner

Lexical analysis. Syntactical analysis. Semantical analysis. Intermediate code generation. Optimization. Code generation. Target specific optimization

Lexical Analysis - 1. A. Overview A.a) Role of Lexical Analyzer

Compilers CS S-01 Compiler Basics & Lexical Analysis

Lexical Analysis. Finite Automata

Introduction to Compiler Design

Compiler Construction D7011E

CS 4120 and 5120 are really the same course. CS 4121 (5121) is required! Outline CS 4120 / 4121 CS 5120/ = 5 & 0 = 1. Course Information

Last lecture CMSC330. This lecture. Finite Automata: States. Finite Automata. Implementing Regular Expressions. Languages. Regular expressions

CSCI312 Principles of Programming Languages!

Introduction to Lexing and Parsing

Context-Free Grammar. Concepts Introduced in Chapter 2. Parse Trees. Example Grammar and Derivation

Formal Languages and Compilers Lecture IV: Regular Languages and Finite. Finite Automata

CSE Lecture 4: Scanning and parsing 28 Jan Nate Nystrom University of Texas at Arlington

CSCE 314 Programming Languages

Lecture 9 CIS 341: COMPILERS

Regular Expressions. Regular Expressions. Regular Languages. Specifying Languages. Regular Expressions. Kleene Star Operation

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

Compiler Construction

Compiler Construction

Compiler I: Syntax Analysis

Front End: Lexical Analysis. The Structure of a Compiler

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

Administrativia. Extra credit for bugs in project assignments. Building a Scanner. CS164, Fall Recall: The Structure of a Compiler

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

CS S-01 Compiler Basics & Lexical Analysis 1

Question Bank. 10CS63:Compiler Design

Lexical Analysis/Scanning

Scanning. COMP 520: Compiler Design (4 credits) Alexander Krolik MWF 13:30-14:30, MD 279

CMPSCI 250: Introduction to Computation. Lecture #28: Regular Expressions and Languages David Mix Barrington 2 April 2014

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

Implementation of Lexical Analysis

Compiler course. Chapter 3 Lexical Analysis

Week 2: Syntax Specification, Grammars

2. Lexical Analysis! Prof. O. Nierstrasz!

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

2. λ is a regular expression and denotes the set {λ} 4. If r and s are regular expressions denoting the languages R and S, respectively

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

The Language for Specifying Lexical Analyzer

CS606- compiler instruction Solved MCQS From Midterm Papers

Compiler Construction

Transcription:

CS412/413 Introduction to Compilers Tim Teitelbaum Lecture 2: Lexical Analysis 23 Jan 08

Outline Review compiler structure What is lexical analysis? Writing a lexer Specifying tokens: regular expressions 2

Simplified Compiler Structure Source code if (b == 0) a = b; Understand source code Optimize Intermediate code Intermediate code Assembly code cmp $0, ecx cmovz edx, ecx Generate assembly code 3

Simplified Front End Structure Source code (character stream) if (b == 0) a = b; Lexical Analysis Syntax Analysis Semantic Analysis Errors (incorrect program) Correct program (AST representation) 4

More Precise Front End Structure Source code (character stream) if (b == 0) a = b; Lexical Analysis Syntax Analysis Semantic Analysis Errors (incorrect program) Correct program (AST representation) Intermediate code Intermediate Code Generation 5

How It Works Source code (character stream) Token stream Abstract syntax tree (AST) Decorated AST if (b == 0) a = b; if ( b == 0 ) a = b ; == b 0 boolean == if int b int 0 if a = b int = int a lvalue int b Lexical Analysis Syntax Analysis (Parsing) Semantic Analysis 6

How It Works, cont. Decorated AST Intermediate code Intermediate code Assembly code boolean == int b int 0 t = (b ==0) jump t, L a = b label L t = (b ==0) jump t, L a = 0 label L if int = int a lvalue cmp $0,ecx cmovz $0,[ebp+8] int b Intermediate Code Generation Optimizations Machine Optimizations and Code Generation 7

First Step: Lexical Analysis Source code (character stream) Token stream if (b == 0) a = b; if ( b == 0 ) a = b ; Lexical Analysis Syntax Analysis Semantic Analysis 8

Tokens Identifiers: x y11 elsen _i00 Keywords: if else while break Constants: Integer: 2 1000-500 5L 0x777 Floating-point: 2.0 0.00020.02 1. 1e5 0.e-10 String: x He said, \ Are you?\ \n Character: c \000 Symbols: + * { } ++ < << [ ] >= Whitespace (typically recognized and discarded): Comment: /** don t change this **/ Space: <space> Format characters: <newline> <return> 9

Ad-hoc Lexer Hand-write code to generate tokens How to read identifier tokens? Token readidentifier( ) { String id = ; while (true) { char c = input.read(); if (!identifierchar(c)) return new Token(ID, id, linenumber); id = id + String(c); } } Problems How to start? What to do with following character? How to avoid quadratic complexity of repeated concatenation? How to recognize keywords? 10

Look-ahead Character Scan text one character at a time Use look-ahead character (next) to determine what kind of token to read and when the current token ends char next; while (identifierchar(next)) { id = id + String(next); next = input.read (); } e l s e n next (lookahead) 11

Ad-hoc Lexer: Top-level Loop class Lexer { InputStream s; char next; Lexer(InputStream _s) { s = _s; next = s.read(); } Token nexttoken( ) { if (identifierfirstchar(next)) return readidentifier(); if (numericfirstchar(next)) return readnumber(); if (next == \ ) return readstringconst(); } } 12

Problems Might not know what kind of token we are going to read from seeing first character if token begins with i is it an identifier? if token begins with 2 is it an integer constant? interleaved tokenizer code hard to write correctly, harder to maintain in general, unbounded lookahead may be needed 13

Issues How to describe tokens unambiguously 2.e0 20.e-01 2.0000 x \\ \ \ How to break up text into tokens if (x == 0) a = x<<1; if (x == 0) a = x<1; How to tokenize efficiently tokens may have similar prefixes want to look at each character ~1 time 14

Principled Approach Need a principled approach lexer generator that generates efficient tokenizer automatically (e.g., lex, flex, JLex) a.k.a. scanner generator Approach Describe programming language s tokens with a set of regular expressions Generate scanning automaton from that set of regular expressions 15

Language Theory Review Let Σ be a finite set Σ called an alphabet a Σcalled a symbol Σ* is the set of all finite strings consisting of symbols from Σ A subset L Σ* is called a language If L 1 and L 2 are languages, then L 1 L 2 is the concatenation of L 1 and L 2, i.e., the set of all pair-wise concatenations of strings from L 1 and L 2, respectively 16

Language Theory Review, ctd. Let L Σ* be a language Then L 0 = {} L n+1 = L L n for all n 0 Examples if L = {a, b} then L 1 = L = {a, b} L 2 = {aa, ab, ba, bb} L 3 = {aaa, aab, aba, aba, baa, bab, bba, bbb} 17

Syntax of Regular Expressions Set of regular expressions (RE) over alphabet Σ is defined inductively by Let a Σand R, S RE. Then: a RE ε RE RE R S RE RS RE R* RE In concrete syntactic form, precedence rules, parentheses, and abbreviations 18

Semantics of Regular Expressions Regular expression T RE denotes the language L(R) Σ* given according to the inductive structure of T: L(a) ={a} the string a L(ε) = { } L( ) = {} L(R S) = L(R) L(S) the empty string the empty set alternation L(RS) = L(R) L(S) concatenation L(R*) = { } L(R) L(R 2 ) L(R 3 ) L(R 4 ) Kleene closure 19

Simple Examples L(R) = the language defined by R L( abc ) = { abc } L( hello goodbye ) = {hello, goodbye} L( 1(0 1)* ) = all non-zero binary numerals beginning with 1 20

Convienent RE Shorthand R + one or more strings from L(R): R(R*) R? optional R: (R ε) [abce] one of the listed characters: (a b c e) [a-z] one character from this range: (a b c d e y z) [^ab] anything but one of the listed chars [^a-z] one character not from this range abc the string abc \( the character (... id=r named non-recursive regular expressions 21

More Examples Regular Expression R Strings in L(R) digit = [0-9] 0 1 2 3 posint = digit+ 8 412 int = -? posint -42 1024 real = int ((. posint)?) -1.56 12 1.0 = (- ε)([0-9]+)((. [0-9]+) ε) [a-za-z_][a-za-z0-9_]* C identifiers else the keyword else 22

How To Break Up Text elsen = 0; 1 else n = 0 2 elsen = 0 REs alone not enough: need rule(s) for disambiguation Most languages: longest matching token wins Ties in length resolved by prioritizing tokens Lexer definition = RE s + priorities + longestmatching-token rule + token representation 23

Historical Anomalies PL/I Keywords not reserved IF IF THEN THEN ELSE ELSE; FORTRAN Whitespace stripped out prior to scanning DO 123 I = 1 DO 123 I = 1, 2 By and large, modern language design intentionally makes scanning easier 24

Summary Lexical analyzer converts a text stream to tokens Ad-hoc lexers hard to get right, maintain For most languages, legal tokens are conveniently and precisely defined using regular expressions Lexer generators generate lexer automaton automatically from token RE s, prioritization Next lecture: how lexer generators work 25

Reading IC Language spec JLEX manual CVS manual Links on course web home page Groups If you haven t got a full group lined up, hang around and talk to prospective group members today 26