Outline CS4120/4121. Compilation in a Nutshell 1. Administration. Introduction to Compilers Andrew Myers. HW1 out later today due next Monday.

Similar documents
CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

Compiler Construction LECTURE # 3

CMSC 350: COMPILER DESIGN

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Lexical Analysis. Lecture 3-4

Lexical Analysis. Chapter 2

Lexical Analysis. Lecture 2-4

Lecture 9 CIS 341: COMPILERS

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Formal Languages and Compilers Lecture VI: Lexical Analysis

Lexical Analysis. Lecture 3. January 10, 2018

Structure of Programming Languages Lecture 3

2010: Compilers REVIEW: REGULAR EXPRESSIONS HOW TO USE REGULAR EXPRESSIONS

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

CSC 467 Lecture 3: Regular Expressions

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

CS 4120 and 5120 are really the same course. CS 4121 (5121) is required! Outline CS 4120 / 4121 CS 5120/ = 5 & 0 = 1. Course Information

Introduction to Lexical Analysis

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

Chapter 3 Lexical Analysis

Lexical Analysis. Finite Automata

Implementation of Lexical Analysis

Compiler course. Chapter 3 Lexical Analysis

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

UNIT II LEXICAL ANALYSIS

Introduction to Lexical Analysis

Lexical Analysis. Finite Automata

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

Compilers Crash Course

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

Implementation of Lexical Analysis

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Compiler Construction D7011E

ML 4 A Lexer for OCaml s Type System

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

Compiler phases. Non-tokens

CSE 3302 Programming Languages Lecture 2: Syntax

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

ECS 120 Lesson 7 Regular Expressions, Pt. 1

Lexical Analyzer Scanner

David Griol Barres Computer Science Department Carlos III University of Madrid Leganés (Spain)

Computer Science Department Carlos III University of Madrid Leganés (Spain) David Griol Barres

CSc 453 Lexical Analysis (Scanning)

Cunning Plan. Informal Sketch of Lexical Analysis. Issues in Lexical Analysis. Specifying Lexers

COMPILER DESIGN UNIT I LEXICAL ANALYSIS. Translator: It is a program that translates one language to another Language.

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

Lexical Analysis 1 / 52

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

Lexical Analyzer Scanner

Syntax and Type Analysis

Introduction to Lexing and Parsing

Chapter 4. Lexical analysis. Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

CS 314 Principles of Programming Languages. Lecture 3

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

LECTURE 6 Scanning Part 2

COP 3402 Systems Software Syntax Analysis (Parser)

2. Syntax and Type Analysis

Chapter 2 :: Programming Language Syntax

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

SYED AMMAL ENGINEERING COLLEGE (An ISO 9001:2008 Certified Institution) Dr. E.M. Abdullah Campus, Ramanathapuram

CSCI312 Principles of Programming Languages!

Compiler Construction

EDAN65: Compilers, Lecture 02 Regular expressions and scanning. Görel Hedin Revised:

Formal Languages and Compilers Lecture IV: Regular Languages and Finite. Finite Automata

JFlex Regular Expressions

Chapter 2 - Programming Language Syntax. September 20, 2017

CSE 401 Midterm Exam Sample Solution 11/4/11

Compilers CS S-01 Compiler Basics & Lexical Analysis

CSE 401 Midterm Exam Sample Solution 2/11/15

Concepts. Lexical scanning Regular expressions DFAs and FSAs Lex. Lexical analysis in perspective

Lexical Analysis. COMP 524, Spring 2014 Bryan Ward

CSE P 501 Exam 11/17/05 Sample Solution

Lexical Analysis. Implementation: Finite Automata

CS 403 Compiler Construction Lecture 3 Lexical Analysis [Based on Chapter 1, 2, 3 of Aho2]

CS202 Compiler Construction. Christian Skalka. Course prerequisites. Solid programming skills a must.

CSc 453 Compilers and Systems Software

Zhizheng Zhang. Southeast University

Compilers CS S-01 Compiler Basics & Lexical Analysis

PRINCIPLES OF COMPILER DESIGN UNIT II LEXICAL ANALYSIS 2.1 Lexical Analysis - The Role of the Lexical Analyzer

Scanning. COMP 520: Compiler Design (4 credits) Professor Laurie Hendren.

Lecture Outline. COMP-421 Compiler Design. What is Lex? Lex Specification. ! Lexical Analyzer Lex. ! Lex Examples. Presented by Dr Ioanna Dionysiou

Implementation of Lexical Analysis

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

A lexical analyzer generator for Standard ML. Version 1.6.0, October 1994

Question Bank. 10CS63:Compiler Design

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

CSE 401 Midterm Exam 11/5/10

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

CSE Lecture 4: Scanning and parsing 28 Jan Nate Nystrom University of Texas at Arlington

DVA337 HT17 - LECTURE 4. Languages and regular expressions

CSE302: Compiler Design

Week 2: Syntax Specification, Grammars

POLITECNICO DI TORINO. Formal Languages and Compilers. Laboratory N 1. Laboratory N 1. Languages?

A simple syntax-directed

Defining Program Syntax. Chapter Two Modern Programming Languages, 2nd ed. 1

Transcription:

CS4120/4121 Introduction to Compilers Andrew Myers Lecture 2: Lexical Analysis 31 August 2009 Outline Administration Compilation in a nutshell (or two) What is lexical analysis? Writing a lexer Specifying tokens: regular expressions Writing a lexer generator Converting regular expressions to Nondeterministic finite automata (NFAs) NFA to DFA transformation 2 Administration Compilation in a Nutshell 1 HW1 out later today due next Monday. Source code (character stream) Token stream Abstract syntax tree (AST) if (b == 0) a = b; if ( b == 0 ) a = b ; == b 0 if boolean == Decorated AST int b int 0 if a = b int = ; int a int b lvalue ; Lexical analysis Parsing Semantic Analysis 3 4

Compilation in a Nutshell 2 First step: lexical analysis boolean == if int b int 0 int = int a int b lvalue ; Intermediate Code Generation Source code (character stream) if (b == 0) a = hi ; Lexical analysis SEQ(CJUMP(TEMP(b) == 0, L1, L2), LABEL(L1), TEMP(a) = TEMP(b) LABEL(L2)) Code Generation Token stream if ( b == 0 ) a = hi ; Parsing cmp r b, 0 jnz L2 L1: mov r a, r b L2: Register allocation, optimization cmp ecx, 0 cmovz [ebp+8],ecx Semantic Analysis 5 6 Tokens Ad-hoc lexer Identifiers: x y11 elsex _i00 Keywords: if else while break Integers: 2 1000-500 5L Floating point: 2.0 0.00020.02 1. 1e5 0.e-10 Symbols: + * { ++ < << [ ] >= Strings: x He said, \ Are you?\ Comments: /** don t change this **/ 7 Hand-write code to generate tokens How to read identifier tokens? Token readidentifier( ) { String id = ; while (true) { char c = input.read(); if (!identifierchar(c)) return new Token(ID, id, linenumber); id = id + String(c); 8

Look-ahead character Scan text one character at a time Use look-ahead character (next) to determine what kind of token to read and when the current token ends char next; while (identifierchar(next)) { id = id + String(next); next = input.read (); e l s e x next 9 Ad-hoc lexer: top-level loop class Lexer { Preloading next. InputStream s; char next; Lexer(InputStream s_) { s = s_; next = s.read(); Token nexttoken( ) { if (identifierchar(next)) Alternative: define input return readidentifier(); streams that support if (numericchar(next)) lookahead automatically return readnumber(); if (next == \ ) return readstringconst(); 10 Problems Issues Don t know what kind of token we are going to read from seeing first character if token begins with i is it an identifier? if token begins with 2 is it an integer constant? interleaved tokenizer code is hard to write correctly, harder to maintain A more principled approach: lexer generator that generates efficient tokenizer automatically (e.g., lex, JLex) from a lexical specification. 11 How to describe tokens unambiguously 2.e0 20.e-01 2.0000 x \\ \ \ How to break text up into tokens Text if (x == 0) a = x<<1; if (x == 0) a = x<1; How to tokenize efficiently tokens may have similar prefixes want to look at each character O(1) times 12

How to Describe Tokens Programming language tokens can be described using regular expressions Regular expression R describes a set of strings L(R): L(R) is the language defined by R L(abc) = { abc L(hello goodbye) = {hello, goodbye L([1-9][0-9]*) = all positive integer constants L(X(Y Z)) = L(XY XZ) = L(XY)! L(XZ) Idea: define each kind of token using RE 13 a Regular expression notation an ordinary character stands for itself! the empty string R S RS any string from either L(R) or L(S): L(R S)=L(R)! L(S) string from L(R) followed by one from L(S): L(RS) = {rs r"l(r) ^ s"l(s) R* zero or more strings from L(R), concatenated! R RR RRR RRRR ( Kleene star ) 14 Regular Expression a ab a b (ab)* Examples Strings in L(R) a ab a b ab abab (a ) b ab b (=a?b) 15 R + Convenient RE Shorthand one or more strings from L(R): = R(R*) R? an optional R: = (R!) [abce] one of the listed characters: (a b c e) [a-z] one char from the range: (a b c d e...) [^ab] anything but one of the listed chars [^a-z] one character not from the range R{n n repetitions of R (RRRR ) \x0a ASCII 10 (newline) \n also newline 16

More Examples (JFlex) Regular Expression Strings in L(R) digit = [0-9] posint = {digit+ 0 1 2 3 8 412 int = -? {posint -42 1024 real = {int (. posint)? -1.56 12 1.0 = (-!)(0 9)(0... 9)*(! (. (0... 9)(0... 9)*)) [a-za-z_][a-za-z0-9_]* C identifiers Lexer generators support abbreviations cannot be recursive. Forbidden: foo = a{foo! Zero-width assertions Not strictly regular expressions Not supported by all lexer generators. ^R matches R if preceded by newline R$ matches R if followed by newline \b match a word boundary (Perl) \A match beginning of input (Perl) R1/R2 matches R1 if followed by something matching R2 (lex) 17 18 How to break up text Lexer Generator Spec elsex = 0; 1 2 REs alone not enough: need rule for choosing Most languages: longest matching token wins even if a shorter token is only way to parse tokens. Exception: early FORTRAN (totally whitespaceinsensitive) Ties in length resolved by prioritizing tokens RE s + priorities + longest-matching token rule = lexer definition else x = elsex = 0 0 19 Input to lexer generator: list of regular expressions in priority order associated action for each RE (generates appropriate kind of token, other bookkeeping) Output: program that reads an input stream and breaks it up into tokens according to the REs. (Or reports lexical error -- Unexpected character ) 20

%line %column %% Example: JFlex digits = 0 [1-9][0-9]* letter = [A-Za-z] identifier = {letter({letter [0-9_])* whitespace = [\ \t\n\r]+ %% {whitespace {/* discard */ {digits { return new IntegerConstant(Integer.parseInt(yytext()); if { return new IfToken(); while { return new WhileToken(); {identifier { return new IdentifierToken(yytext()); 21 Lexer states Most lexer generators allow conditioning on lexer state. Helps with long tokens (strings, comments): /* { yybegin(comment); <COMMENT> { */ { yybegin(yyinitial);. \n { /* ignore */ 22 Summary Lexical analyzer converts a text stream to tokens Ad-hoc lexers hard to get right, maintain For most languages, legal tokens conveniently, precisely defined using regular expressions Lexer generators generate lexer code automatically from token RE s, precedence Next lecture: how lexer generators work Groups If you don t have a full group lined up, hang around and talk to prospective group members Send mail to cs4120-l if you still cannot make a full group (can also post to cornell.class.cs4120) 23 24