Compiler Construction D7011E

Similar documents
Introduction to Lexical Analysis

Lexical Analysis. Chapter 2

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

Implementation of Lexical Analysis

Lexical Analysis. Lecture 3. January 10, 2018

Lexical Analysis. Lecture 2-4

Week 2: Syntax Specification, Grammars

Implementation of Lexical Analysis

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

Lexical Analysis. Lecture 3-4

Implementation of Lexical Analysis

Introduction to Lexical Analysis

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

CSCI312 Principles of Programming Languages!

CD Assignment I. 1. Explain the various phases of the compiler with a simple example.

Cunning Plan. Informal Sketch of Lexical Analysis. Issues in Lexical Analysis. Specifying Lexers

Grammars and Parsing, second week

Monday, August 26, 13. Scanners

Wednesday, September 3, 14. Scanners

CSE 413 Programming Languages & Implementation. Hal Perkins Autumn 2012 Grammars, Scanners & Regular Expressions

CMSC 350: COMPILER DESIGN

CSEP 501 Compilers. Languages, Automata, Regular Expressions & Scanners Hal Perkins Winter /8/ Hal Perkins & UW CSE B-1

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

A Pascal program. Input from the file is read to a buffer program buffer. program xyz(input, output) --- begin A := B + C * 2 end.

CSE 130 Programming Language Principles & Paradigms Lecture # 5. Chapter 4 Lexical and Syntax Analysis

Lexical Analysis. Finite Automata

Lexical Analysis. Finite Automata

CSC 467 Lecture 3: Regular Expressions

10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

Formal Languages and Compilers Lecture VI: Lexical Analysis

CSc 453 Compilers and Systems Software

CSE 413 Programming Languages & Implementation. Hal Perkins Winter 2019 Grammars, Scanners & Regular Expressions

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Lexical and Syntax Analysis

10/5/17. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntax Analysis

Programming Assignment I Due Thursday, October 7, 2010 at 11:59pm

Syntactic Analysis. The Big Picture Again. Grammar. ICS312 Machine-Level and Systems Programming

Administrativia. Extra credit for bugs in project assignments. Building a Scanner. CS164, Fall Recall: The Structure of a Compiler

Chapter 4. Lexical and Syntax Analysis

Implementation of Lexical Analysis. Lecture 4

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

Programming Assignment II

Chapter 3 Lexical Analysis

Compiler phases. Non-tokens

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Syntax. Syntax. We will study three levels of syntax Lexical Defines the rules for tokens: literals, identifiers, etc.

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

Lisp: Lab Information. Donald F. Ross

Compiler course. Chapter 3 Lexical Analysis

Chapter 3: Lexical Analysis

Figure 2.1: Role of Lexical Analyzer

Outline CS4120/4121. Compilation in a Nutshell 1. Administration. Introduction to Compilers Andrew Myers. HW1 out later today due next Monday.

Structure of a compiler. More detailed overview of compiler front end. Today we ll take a quick look at typical parts of a compiler.

Programming Languages 2nd edition Tucker and Noonan"

CSc 453 Lexical Analysis (Scanning)

Comp 411 Principles of Programming Languages Lecture 3 Parsing. Corky Cartwright January 11, 2019

CS664 Compiler Theory and Design LIU 1 of 16 ANTLR. Christopher League* 17 February Figure 1: ANTLR plugin installer

Implementation of Lexical Analysis

1 Lexical Considerations

Programming Project 1: Lexical Analyzer (Scanner)

Languages and Compilers

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

4. Lexical and Syntax Analysis

Compilers CS S-01 Compiler Basics & Lexical Analysis

Principles of Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

LECTURE 6 Scanning Part 2

Compilers CS S-01 Compiler Basics & Lexical Analysis

4. Lexical and Syntax Analysis

Defining Program Syntax. Chapter Two Modern Programming Languages, 2nd ed. 1

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

Lexical and Syntactic Analysis

CS S-01 Compiler Basics & Lexical Analysis 1

MP 3 A Lexer for MiniJava

Lecture 9 CIS 341: COMPILERS

CST-402(T): Language Processors

Programming Assignment I Due Thursday, October 9, 2008 at 11:59pm

Chapter 3. Describing Syntax and Semantics ISBN

CSE 3302 Programming Languages Lecture 2: Syntax

Lexical Analysis. Introduction

Syntax Errors; Static Semantics

1. Lexical Analysis Phase

Building Compilers with Phoenix

Programming Languages and Compilers (CS 421)

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

Announcements! P1 part 1 due next Tuesday P1 part 2 due next Friday

n (0 1)*1 n a*b(a*) n ((01) (10))* n You tell me n Regular expressions (equivalently, regular 10/20/ /20/16 4

CS164: Midterm I. Fall 2003

Chapter 10 Language Translation

Software II: Principles of Programming Languages

Chapter 3: Lexing and Parsing

MIT Top-Down Parsing. Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology

Semantics of programming languages

Building lexical and syntactic analyzers. Chapter 3. Syntactic sugar causes cancer of the semicolon. A. Perlis. Chomsky Hierarchy

Syntax. A. Bellaachia Page: 1

Compilers and Interpreters

Transcription:

Compiler Construction D7011E Lecture 2: Lexical analysis Viktor Leijon Slides largely by Johan Nordlander with material generously provided by Mark P. Jones. 1

Basics of Lexical Analysis: 2

Some definitions: Language = Syntax Semantics 3

Some definitions: Language = Syntax = Concrete Abstract Semantics 3

Some definitions: Language = Syntax = Semantics = Concrete Abstract Static Dynamic 3

Some definitions: Language = Syntax = Semantics = Concrete Abstract Static Dynamic Concrete Syntax: the representation of a program text in its source form as a sequence of bits/bytes/characters/lines. 3

Some definitions: Language = Syntax = Semantics = Concrete Abstract Static Dynamic Abstract Syntax: the representation of program structure, independent of written form. 4

Syntax Analysis: Language = Syntax = Semantics = Concrete Abstract Static Dynamic This is one of the areas where theoretical computer science has had major impact on the practice of software development. 5

Syntax Analysis: i f x > 0 t h e n b r e a k ; e l s e y = z + 2 ; 6

Syntax Analysis: i f x > 0 t h e n b r e a k ; e l s e y = z + 2 ; 6

Syntax Analysis: i f x > 0 t h e n b r e a k ; e l s e y = z + 2 ; if 6

Syntax Analysis: i f x > 0 t h e n b r e a k ; e l s e y = z + 2 ; if > break = x 0 y + z 2 6

Syntax Analysis: i f x > 0 t h e n b r e a k ; e l s e y = z + 2 ; if x > 0 then break ; else y = z + 2 ; if > break = x 0 y + z 2 7

Separate Lexical Analysis? It isn t necessary to separate lexical analysis from parsing. But it does have several potential benefits: n n n n Simpler design (separation of concerns); Potentially more efficient (parsing often uses more expensive techniques than lexical analysis); Isolates machine/character set dependencies; Good tool support. Modern language specifications often separate lexical and grammatical syntax. 8

Examples Plain English: M y d o g i s l a z y Mathematics: 2 m y + s i n ( α i ) - 1 Morse code:...... _...... _.. 9

Examples Plain English: M y d o g i s l a z y Mathematics: 2 m y + s i n ( α i ) - 1 Morse code:...... _...... _.. 9

Examples Plain English: M y d o g i s l a z y Mathematics: 2 m y + s i n ( α i ) - 1 Morse code:...... _...... _.. 9

Examples Plain English: M y d o g i s l a z y Mathematics: 2 m y + s i n ( α i ) - 1 Morse code:...... _...... _.. 9

Examples Plain English: M y d o g i s l a z y Mathematics: 2 m y + s i n ( α i ) - 1 Morse code:...... _...... _.. 9

Lexical Analysis: Lexer: carries out lexical analysis. Character stream Token stream Goal: to recognize and identify the sequence of tokens represented by the characters in a program text. The definition of tokens, i.e., lexical structure, is an important part of a language specification. 10

Basic Terminology: Lexeme: A particular sequence of characters that might appear together in an input stream as the representation for a single entity. Some lexemes in Java. 0.0 3.14 1e-97d true false false hello world! if then ; { } ; a \n class String main + && 0xC0B0L 2000 temp Token: A lexeme treated as a data atom 11

Basic Terminology: Token type: A name for a group of lexemes. (A description of what each lexeme represents.) In Java: 0.0 3.14 1e-97d are double literals 0xC0B0L 2000 are integer literals true false are boolean literals false hello world! are string literals ; a \n are character literals if then static are keywords ; { } are separators + && are operators String main temp are identifiers Tokens/lexemes are normally chosen so that each lexeme has just one token type. 12

Basic Terminology: Pattern: A description of the way that lexemes are written. In Java, an identifier is an unlimited-length sequence of Java letters and Java digits, the first of which must be a Java letter. An identifier cannot have the same spelling as a keyword We ll see how to make this kind of thing more precise soon 13

Token Attributes: Many token types have associated attributes: In many cases, the lexeme itself might be used as an attribute; Literals and constants: the corresponding value might be treated as an attribute; For error reporting, we might include positional information: n n n Name of source file; Line/column number; Etc. 14

Other Input Elements: Other elements that may appear in the input stream include: n n n Whitespace: the space, tab, newline character, etc., which typically have no significance in the language (other than as token separators); Illegal characters, which should not appear in any input; Comments, in various flavors. These are filtered out during lexical analysis, and not passed as tokens to the parser. 15

Common Comment Types: Nesting brackets: n (* Pascal, Modula-2, ML *) n { Pascal } n {- Haskell -} Non-nesting brackets: n /* Prolog, Java, C++, C */ Single Line: n n n n n n n // C++, Java -- occam, Haskell ; Scheme, Lisp % Prolog # csh, bash, sh, make C Fortran REM Basic Can you think of another? 16

Representing Token Streams: We could construct a new data object for each lexeme that we find, and build an array that contains all of the tokens for a given source in order. Different types of token object are needed for different types of token when the number and type of attributes vary. In practice, many compilers do not build token objects, and instead expect tokens to be read in pieces and on demand. 17

A simple lexer (Java-like...): // Read the next token. int nexttoken(); // Returns the token code for the current lexeme. int gettoken(); // Returns the text (if any) for the current lexeme. String getlexeme(); // Return the position of the current lexeme. Position getpos(); 18

A simple lexer (Java-like...): // Read the next token. int nexttoken(); Advance to the next lexeme and return a code for the token type. // Returns the token code for the current lexeme. int gettoken(); // Returns the text (if any) for the current lexeme. String getlexeme(); // Return the position of the current lexeme. Position getpos(); 18

Recognizing Identifiers: Suppose that c contains the character at the front of the input stream: if (isidentifierstart(c)) { do { c = readnextinputchar(); } while (c!=eof && isidentifierpart((char)c)); } return IDENT; A symbolic constant, defined elsewhere, to represent identifier tokens. 19

Buffering: Input characters must be stored in a buffer Because we often need to store the characters that constitute a lexeme until we find the end: i f x > 0 t h e n b r e Shading indicates input that has been read start end 20

Buffering: Input characters must be stored in a buffer Because we might not know that we ve reached the end of a token until we ve read the following character: i f c o u n t > 0 t h e n 21

Buffering: Input characters must be stored in a buffer Because we might need to look ahead to see what tokens are coming: Fortran 77: d o 1 0 c = 1, 1 0 do 10 c = 1,10 do 10 c = 1, 10 do 10 c = 1.10 do10c = 1.10 22

Buffering: Input characters must be stored in a buffer Because we might need to look ahead to see what tokens are coming: Fortran 77: d o 1 0 c = 1, 1 0 do 10 c = 1,10 do 10 c = 1, 10 do 10 c = 1.10 do10c = 1.10 Now generally considered an example of really bad design! 22

Impact on Language Design: In some languages, only the first 32 characters of an identifier are significant; string literals cannot have more than 256 characters; etc Puts an upper bound on the size of a buffer. In most languages, only one or two characters of lookahead are required for lexical analysis. In most languages, whitespace cannot appear in the middle of a token. 23

Buffering in our Java-like lexer: BufferedReader src; int row = 0; int col = 0; String line; int c; void nextline() { line = src.readline(); col = (-1); row++; nextchar(); } int nextchar() { if (line == null) { c = EOF; } else if (++col >= line.length()) { c = EOL; } else { c = line.charat(col); } return c; } 24

Basic nexttoken() function: int nexttoken() { for (;;) { lexemetext = null; switch (c) { case EOF: token=endinput;return token; case EOL: nextline(); break; case ' ' : nextchar(); break; case '\t' : nextchar(); break; case ';' : nextchar(); token=semi; return token; } } } 25

How to identify Identifiers: The current input line also serves as a buffer for the lexeme text: c = line.charat(col); if (isidentifierstart(c)) { int start = col; do { c = line.charat(++col); } while (col<line.length && isidentifierpart((char)c)); lexemetext = line.substring(start, col); return IDENT; } If there are lexemes that span multiple lines, then additional buffering must be provided. 26

Towards more structured lexing 27

Recognizing Identifiers: In Java: an identifier is an unlimited-length sequence of Java letters and Java digits, the first of which must be a Java letter. An identifier cannot have the same spelling as a keyword How can we make this pattern more precise? 28

Recognizing Identifiers: Compare with the concrete implementation: c = line.charat(col); if (isidentifierstart(c)) { int start = col; do { c = line.charat(++col); } while (col<line.length && isidentifierpart((char)c)); lexemetext = line.substring(start, col); return IDENT; } How can we avoid unnecessary details? 29

Recognizing Identifiers: Idea: describe the implementation using a state machine notation: letter letter digit If the token for every pattern were described as a state machine, implementation would be much easier. But there are some problems lurking 30

Maximal Munch: A widely used convention: n The rule of the longest lexeme: if there are two different lexemes at the beginning of the input stream, always choose the longest alternative. For example: Another classic: f o r ( i = 0 f o r w a r d x + + + + + y x + + + + + y 31

Maximal Munch: A widely used convention: n The rule of the longest lexeme: if there are two different lexemes at the beginning of the input stream, always choose the longest alternative. For example: Another classic: f o r ( i = 0 f o r w a r d x + + + + + y x + + + + + y 31

Maximal Munch: A widely used convention: n The rule of the longest lexeme: if there are two different lexemes at the beginning of the input stream, always choose the longest alternative. For example: Another classic: f o r ( i = 0 f o r w a r d x + + + + + y x + + + + + y 31

Implementation: Remember the last state and input position where a valid lexeme had been detected. Continue reading ahead to look for a longer lexeme with the same prefix. If you find a longer one, update the state and position, and continue. If you don t find a longer one, go back to the last valid lexeme. (N.B. Buffers play an important role here). 32

Non-determinism: The lexeme if looks a lot like an identifier: i f IF? letter letter IDENT digit What do we do if the first char is i? What if the first character is i, we follow the top branch and the next character is b? 33

Solution 1: Backtracking When faced with multiple alternatives: n n Explore each alternative in turn. Pick the alternative that leads to the longest lexeme. Again, buffers play a key role. Complex to program and potentially expensive to execute. 34

Solution 2: Postprocessing To begin with, just treat and recognize if as a normal identifier. But before we return identifiers as tokens, check them against a built-in table of keywords, and return a different type of token as necessary. Simple, efficient (so long as we can look up entries in the keyword table without too much difficulty) but not always applicable. 35

Solution 3: Delay the Decision! Find an equivalent, deterministic machine: i any digit or letter except f f any letter or digit IF any letter except i letter IDENT digit Recognizes the same set of lexemes, without ambiguity. Hard to get the right machine by hand 36

Another Example: How do we recognize a comment? It begins with /* It ends with */ Any characters can appear in between (well, almost any characters) 37

Naïve Description: A simple first attempt to recognize comments, might lead to the following state machine: any / * * / But this machine can fail if we follow the second * branch too soon. /* y = x * z */ 38

Naïve Description: A simple first attempt to recognize comments, might lead to the following state machine: any / * * / But this machine can fail if we follow the second * branch too soon. /* y = x * z */ 38

A More Careful Approach: The previous state machine was non-deterministic: there is a choice of distinct successor states for the character *, and we can t tell which branch to take without looking ahead. An equivalent, deterministic machine is as follows: any but * / * * / * any but * or / 39

Code to Read a Comment: if (c=='/ ) { } nextchar(); if (c=='*') { } nextchar(); for (;;) { } if (c=='*') { } // Skip bracketed comment do { nextchar(); } while (c=='*'); if (c=='/') { } nextchar(); return; if (c==eof) { Unterminated comment } if (c==eol) nextline(); else nextchar();

Code to Read a Comment: if (c=='/ ) { } nextchar(); if (c=='*') { } nextchar(); for (;;) { } if (c=='*') { } // Skip bracketed comment do { nextchar(); } while (c=='*'); if (c=='/') { } nextchar(); return; if (c==eof) { Unterminated comment } if (c==eol) nextline(); else nextchar(); Further complications: error handling

Handwritten Lexical Analyzers: Doesn t require sophisticated programming. Often requires care to avoid non-determinism, or potentially expensive backtracking. Can be fine tuned for performance and for the language concerned. But it might also be something we would want to automate 41

Can a Machine do Better? It can be hard to write a (correct) lexer by hand But that s not surprising: finite state machines are low level an assembly language of lexical analysis Can we build a lexical analyzer generator that will take care of all the dirty details, and let the humans work at a higher level? If so, what would its input look like? 42

The lex Family: lex is a tool for generating C programs to implement lexical analyzers. It can also be used as a quick way of generating simple text processing utilities. lex dates from the mid-seventies and has spawned a family of clones: flex, ML lex, JLex, JFlex, ANTLR, etc Lex is based on ideas from the theory of formal languages and automata 43

Reading instructions n Chapter 3 is on lexical analysis. You should have read up to 3.4 before this lecture. n Next lecture covers the rest of chapter 3. If you want to move ahead you can start reading chapter 4 on parsing. (4.1-4.4 is a good start.) Please get started. Deadline for project: n Phase 0: Sept. 13 n Phase 1: Sept. 24 (Lexing!) 44

Thank you. 45

An APL Example ž ¹ž PXOWLSOLFDWLRQWDEOH 46