Writing a Lexer. CS F331 Programming Languages CSCE A331 Programming Language Concepts Lecture Slides Monday, February 6, Glenn G.

Similar documents
Introduction to Syntax Analysis Recursive-Descent Parsing

Scheme: Data. CS F331 Programming Languages CSCE A331 Programming Language Concepts Lecture Slides Monday, April 3, Glenn G.

Haskell: Lists. CS F331 Programming Languages CSCE A331 Programming Language Concepts Lecture Slides Friday, February 24, Glenn G.

Writing an Interpreter Thoughts on Assignment 6

Thoughts on Assignment 4 Haskell: Flow of Control

Scheme: Expressions & Procedures

A Simple Syntax-Directed Translator

B The SLLGEN Parsing System

IPCoreL. Phillip Duane Douglas, Jr. 11/3/2010

Java Bytecode (binary file)

1 Lexical Considerations

YOLOP Language Reference Manual

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

Syntax Errors; Static Semantics

Time : 1 Hour Max Marks : 30

Decaf Language Reference

CSE450. Translation of Programming Languages. Lecture 11: Semantic Analysis: Types & Type Checking

Structure of Programming Languages Lecture 3

MP 3 A Lexer for MiniJava

Compiler Construction D7011E

Scheme: Strings Scheme: I/O

The PCAT Programming Language Reference Manual

CS1622. Semantic Analysis. The Compiler So Far. Lecture 15 Semantic Analysis. How to build symbol tables How to use them to find

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

CSE 452: Programming Languages. Outline of Today s Lecture. Expressions. Expressions and Control Flow

2.2 Syntax Definition

PL Categories: Functional PLs Introduction to Haskell Haskell: Functions

Programming for Engineers Introduction to C

Fundamentals of Programming Session 4

CS664 Compiler Theory and Design LIU 1 of 16 ANTLR. Christopher League* 17 February Figure 1: ANTLR plugin installer

Crayon (.cry) Language Reference Manual. Naman Agrawal (na2603) Vaidehi Dalmia (vd2302) Ganesh Ravichandran (gr2483) David Smart (ds3361)

CS 6353 Compiler Construction Project Assignments

The SPL Programming Language Reference Manual

CGS 3066: Spring 2015 JavaScript Reference

COMPILER DESIGN LECTURE NOTES

C++ Programming: From Problem Analysis to Program Design, Third Edition

Typescript on LLVM Language Reference Manual

Software Engineering Concepts: Invariants Silently Written & Called Functions Simple Class Example

Compiler Design. Computer Science & Information Technology (CS) Rank under AIR 100

Full file at

Concepts Introduced in Chapter 3. Lexical Analysis. Lexical Analysis Terms. Attributes for Tokens

Outline. 1 Scanning Tokens. 2 Regular Expresssions. 3 Finite State Automata

UNIVERSITY OF COLUMBIA ADL++ Architecture Description Language. Alan Khara 5/14/2014

MP 3 A Lexer for MiniJava

SFU CMPT 379 Compilers Spring 2018 Milestone 1. Milestone due Friday, January 26, by 11:59 pm.

Lexical Analysis 1 / 52

DEMO A Language for Practice Implementation Comp 506, Spring 2018

MATVEC: MATRIX-VECTOR COMPUTATION LANGUAGE REFERENCE MANUAL. John C. Murphy jcm2105 Programming Languages and Translators Professor Stephen Edwards

Contents. Jairo Pava COMS W4115 June 28, 2013 LEARN: Language Reference Manual

LECTURE 3. Compiler Phases

Data Structures and Algorithms

In Java, data type boolean is used to represent Boolean data. Each boolean constant or variable can contain one of two values: true or false.

CSE302: Compiler Design

CSC 467 Lecture 3: Regular Expressions

1. Lexical Analysis Phase

UNIVERSITY OF CALIFORNIA Department of Electrical Engineering and Computer Sciences Computer Science Division. P. N. Hilfinger

Syntax Intro and Overview. Syntax

CS 403: Scanning and Parsing

Today. Assignments. Lecture Notes CPSC 326 (Spring 2019) Quiz 2. Lexer design. Syntax Analysis: Context-Free Grammars. HW2 (out, due Tues)

Semantics of programming languages

Lexical Considerations

STATS 507 Data Analysis in Python. Lecture 2: Functions, Conditionals, Recursion and Iteration

UNIT -2 LEXICAL ANALYSIS

Programming Languages Third Edition. Chapter 9 Control I Expressions and Statements

Decaf Language Reference Manual

Defining Program Syntax. Chapter Two Modern Programming Languages, 2nd ed. 1

THE COMPILATION PROCESS EXAMPLE OF TOKENS AND ATTRIBUTES

Angela Z: A Language that facilitate the Matrix wise operations Language Reference Manual

Programming Assignment I Due Thursday, October 7, 2010 at 11:59pm

The Decaf Language. 1 Lexical considerations

Programming Assignment I Due Thursday, October 9, 2008 at 11:59pm

Programming Assignment II

CPSC 411, Fall 2010 Midterm Examination

7. Introduction to Denotational Semantics. Oscar Nierstrasz

A Short Summary of Javali

c) Comments do not cause any machine language object code to be generated. d) Lengthy comments can cause poor execution-time performance.

Language Reference Manual

Sardar Vallabhbhai Patel Institute of Technology (SVIT), Vasad M.C.A. Department COSMOS LECTURE SERIES ( ) (ODD) Code Optimization

The Structure of a Syntax-Directed Compiler

Lecture 05 I/O statements Printf, Scanf Simple statements, Compound statements

Formal Languages and Compilers Lecture VI: Lexical Analysis

Static Semantics. Lecture 15. (Notes by P. N. Hilfinger and R. Bodik) 2/29/08 Prof. Hilfinger, CS164 Lecture 15 1

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

Lexical analysis. Syntactical analysis. Semantical analysis. Intermediate code generation. Optimization. Code generation. Target specific optimization

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

ARG! Language Reference Manual

ASTs, Objective CAML, and Ocamlyacc

Fundamental of Programming (C)

Introduction to Lexical Analysis

Objectives. Chapter 2: Basic Elements of C++ Introduction. Objectives (cont d.) A C++ Program (cont d.) A C++ Program

Chapter 2: Basic Elements of C++

5/3/2006. Today! HelloWorld in BlueJ. HelloWorld in BlueJ, Cont. HelloWorld in BlueJ, Cont. HelloWorld in BlueJ, Cont. HelloWorld in BlueJ, Cont.

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

Chapter 2: Basic Elements of C++ Objectives. Objectives (cont d.) A C++ Program. Introduction

Principle of Complier Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

CPS122 Lecture: From Python to Java

Introduction to C Programming

Semantics of programming languages

Structure of a compiler. More detailed overview of compiler front end. Today we ll take a quick look at typical parts of a compiler.

Lexical Considerations

Transcription:

Writing a Lexer CS F331 Programming Languages CSCE A331 Programming Language Concepts Lecture Slides Monday, February 6, 2017 Glenn G. Chappell Department of Computer Science University of Alaska Fairbanks ggchappell@alaska.edu 2017 Glenn G. Chappell

Lua: Programming II Closures A closure is a function that carries with it (some portion of) the environment in which it was defined. In Lua, when we return a function from a function, we get a closure. A closure can form a simpler alternative to a traditional OO construction (class, objects), particularly when a class exists primarily to support a single member function. See prog2.lua. Closures are found in a number of PLs. Since the 2011 standard, C++ has had closures, in the form of lambda functions. See closure.cpp. 6 Feb 2017 CS F331 / CSCE A331 Spring 2017 2

Lua: Programming III Coroutines [1/4] A coroutine is a function that can give up control yield) at any point, and then later be resumed. Yielding typically involves sending a value back to the caller. Coroutines are available in Lua through the standard module coroutine. To write a coroutine, write an ordinary Lua function. Each time there is a value to send back to the caller or when control should be given back to the caller call coroutine.yield, passing the yielded value, if any. When the coroutine has finished, return as usual. 6 Feb 2017 CS F331 / CSCE A331 Spring 2017 3

Lua: Programming III Coroutines [2/4] To use a coroutine, first get the coroutine object, with coroutine.create. cor = coroutine.create(cfunc) Call coroutine.resume with the coroutine object to execute the coroutine. The first time resume is called, any additional parameters will be passed to the coroutine function. coroutine.resume returns an error code (true: okay, false: error) and the yielded value, if any. If the value of coroutine.status is not "dead", then the coroutine was successful, and any yielded value may be used. Then call coroutine.resume again to resume the coroutine and request another value, handling it as above. See prog3.lua. 6 Feb 2017 CS F331 / CSCE A331 Spring 2017 4

Lua: Programming III Coroutines [3/4] Four questions from last time, about Lua coroutines. (1) coroutine.yield sends data out of a coroutine. Can we similarly send data into a coroutine? Yes! Any additional arguments passed to the 2nd and later calls to coroutine.resume (other than the coroutine object) become the return value(s) of coroutine.yield, inside the coroutine. (2) Can a coroutine be resumed, when its status is "dead"? No. To run the coroutine again, we must get a new coroutine object from coroutine.create. 6 Feb 2017 CS F331 / CSCE A331 Spring 2017 5

Lua: Programming III Coroutines [4/4] (3) Can a coroutine return a value? Yes! Any arguments to return become the values yielded when coroutine.status returns "dead". (4) Must we go through the rigamarole of checking both the error flag (ok) and coroutine.status? If there is an error (ok is false) then coroutine.status will return "dead". So only coroutine.status needs to be checked to determine whether to exit a loop going through the yielded values. We may wish to check ok after leaving the loop. 6 Feb 2017 CS F331 / CSCE A331 Spring 2017 6

Lua: Programming III Custom Iterators [1/2] A Lua for-in loop uses an iterator. Here is the form of a custom iterator, using a simplified model. function XYZ( ) function iter(dummy1, dummy2) if then return nil -- Iterator exhausted end return??? -- Next value (or values) end return iter, nil, nil end 6 Feb 2017 CS F331 / CSCE A331 Spring 2017 7

Lua: Programming III Custom Iterators [2/2] Here is an actual custom iterator, based on the simplified model. -- count: iterator. Given a, b. Counts from a up to b. function count(a, b) function iter(dummy1, dummy2) See prog3.lua. if a > b then return nil Code that uses this iterator: end local old_a = a for i in count(2, 6) do a = a+1 io.write(i.." ") return old_a end end io.write("\n") return iter, nil, nil end The above prints: 2 3 4 5 6 6 Feb 2017 CS F331 / CSCE A331 Spring 2017 8

Overview of Lexing & Parsing Parsing Character Stream cout << ff(12.6); Lexer Lexeme Stream cout << ff(12.6); id op id lit op Parser punct op AST or Error expr binop: << Two phases: Lexical analysis (lexing) Syntax analysis (parsing) The output of a parser is often an abstract syntax tree (AST). Specifications of these can vary. expr id: cout id: ff expr funccall expr numlit: 12.6 6 Feb 2017 CS F331 / CSCE A331 Spring 2017 9

Introduction to Lexical Analysis Lexeme Categories [1/6] A lexer reads a character stream and outputs a lexeme stream. Each lexeme is generally placed into a category. We look at some common categories. Character Stream cout << ff(12.6); Lexer Lexeme Stream cout << ff(12.6); id op id lit op punct op 6 Feb 2017 CS F331 / CSCE A331 Spring 2017 10

Introduction to Lexical Analysis Lexeme Categories [2/6] An identifier is a name a program gives to some entity: variable, function, type, namespace, etc. In the C++ code below, the identifiers are circled. class MyClass { public: void myfunc(type1 & aa, int bb) const { for (int ii = -37; ii <= bb; ++ii) { aa.foo(); } } }; 6 Feb 2017 CS F331 / CSCE A331 Spring 2017 11

Introduction to Lexical Analysis Lexeme Categories [3/6] A keyword is an identifier-like lexeme that has special meaning within a programming language. In the C++ code below, the keywords are circled. class MyClass { public: void myfunc(type1 & aa, int bb) const { for (int ii = -37; ii <= bb; ++ii) { aa.foo(); } } }; 6 Feb 2017 CS F331 / CSCE A331 Spring 2017 12

Introduction to Lexical Analysis Lexeme Categories [4/6] An operator is a word that gives an alternate method for making what is essentially a function call. The arguments of an operator are called operands. Operators are often but not always placed between their operands; such an operator is an infix operator. The arity of an operator is the number of operands it has. A unary operator has one operand. A binary operator has two operands. In the following code, the operators are += and *, which are binary operators, and -, which is a unary operator. aaa += b * -c; 6 Feb 2017 CS F331 / CSCE A331 Spring 2017 13

Introduction to Lexical Analysis Lexeme Categories [5/6] A literal is a bare value. Here are some C++ literals. -100.8f Literal "Hello there!" 'x' false Kind of Literal Numeric literal Character array literal Character literal Boolean literal { 1, 2, 3 } Initializer list literal Since C++11 6 Feb 2017 CS F331 / CSCE A331 Spring 2017 14

Introduction to Lexical Analysis Lexeme Categories [6/6] Punctuation is the category for the extra lexemes in a program that do not fit into any of the previously mentioned categories. Punctuation in C++ includes braces ({ }), semicolons (;), an ampersand (&) indicating a reference, and a colon (:) after public or private. Lexeme categories mentioned: Identifier Keyword Operator Literal Punctuation 6 Feb 2017 CS F331 / CSCE A331 Spring 2017 15

Introduction to Lexical Analysis Reserved Words [1/3] A reserved word is a word that fits the general specification of an identifier, but is not allowed as a legal identifier in a program. Note that, while this is an important concept, reserved word is not a lexeme category. In many PLs, the keywords and the reserved words are the same. However, it is not hard to envision a variant of (say) C in which the compiler could distinguish how a word is used, based only on its position in the code. Then there could be keywords that are not reserved words. Something like the following might be legal. for (for for = 10; for; --for) ; 6 Feb 2017 CS F331 / CSCE A331 Spring 2017 16

Introduction to Lexical Analysis Reserved Words [2/3] Since the 2011 Standard, C++ has had two keywords that are not reserved words: override and final. These have special meaning when placed after the parentheses in a memberfunction declaration, but otherwise are simply identifiers. So the following is legal C++. class Derived : public Base { virtual void override() override; // Derived member function named "override" // Overrides Base member function "override" 6 Feb 2017 CS F331 / CSCE A331 Spring 2017 17

Introduction to Lexical Analysis Reserved Words [3/3] The programming language Fortran traditionally has no reserved words. The following is, famously, legal code in at least some versions of Fortran. IF IF THEN THEN ELSE ELSE On the other hand, there can be reserved words that are not keywords. The Java standard specifies that goto is a reserved word. However, it is not a keyword. Thus, this word cannot legally be included in a Java program at all. Note: We will use the above definitions consistently in this class. Be aware, however, that the term keyword is sometimes used to mean what we mean by reserved word. 6 Feb 2017 CS F331 / CSCE A331 Spring 2017 18

Introduction to Lexical Analysis Lexer Operation There are essentially three ways to write a lexer. Automatically generated, based on regular grammars or regular expressions for each lexeme category. Hand-coded state machine using a table. Entirely hand-coded state machine. The first method might involve a software package like lex, which generates C code for a lexer, given input that consists mostly of regular expressions. We will write a lexer using the last method. When we are done, it should not be difficult to see how we might have used a table instead. A lexer outputs a series of lexemes. There is generally no need to store these lexemes in a data structure. Rather, the lexer can provide get-next-lexeme functionality, which the parser can then use. 6 Feb 2017 CS F331 / CSCE A331 Spring 2017 19

Writing a Lexer Introduction We wish to write a lexer in Lua for a hypothetical PL. Lexemes are described on the In-Class Lexeme Specification. Identifiers are essentially as in C/C++. No lexeme includes any whitespace. All whitespace will usually be treated the same. Lexemes may be arbritrarily long. There might not be any delimiter between lexemes. Comments are like multiline C/C++ comments. The keywords and reserved words are the same. Note that there are PLs in which some of the above do not apply. In Python, Haskell, JavaScript, and Go, a newline can sometimes serve as something like an end-of-statement marker. In Forth, consecutive lexemes are always separated by whitespace. Lua uses different syntax for comments. 6 Feb 2017 CS F331 / CSCE A331 Spring 2017 20

Writing a Lexer Design Decisions [1/2] We will write a Lua modula lexer, with the following interface. The module has a function lexer.lex, which is given a string ( program ) and returns an iterator that goes through lexemes. The iterator returns pairs: string and number. The string is the lexeme itself. The number represents the lexeme category. The category is an index for table lexer.catnames, whose values will be human-readable string forms of the category names. So the following prints text & category of all lexemes in program. lexer = require "lexer" for lexstr, cat in lexer.lex(program) do catstr = lexer.catnames(cat) io.write(string.format("%-10s %s\n", lexstr, catstr)) end 6 Feb 2017 CS F331 / CSCE A331 Spring 2017 21

Writing a Lexer Design Decisions [2/2] The function returned by lexer.lex will be a closure. Whatever data are necessary for lexing (the kind of thing we might store as data members in a C++/Java object) will be stored in this closure. Lua has no character type. We represent a character as a string of length one. I have written some character-testing functions (isletter, isdigit, iswhitespace). Each of these takes a string parameter. Each returns false if this parameter does not have length exactly one. 6 Feb 2017 CS F331 / CSCE A331 Spring 2017 22

Writing a Lexer Coding a State Machine I Core Functionality Internally, our lexer will run as a state machine. A state machine has a current state, stored in variable state. It proceeds in a series of steps. At each step, it looks at the current character in the input and the current state. It then decides what state to go to next. Variables & Utility Functions The input stream is in the given string: prog. The index of the next character to read is stored in variable pos, which starts at 1, since this is Lua. The state starts at START. The lexeme we are building is in string lexstr, which starts as an empty string (""). To add the current character to the lexeme, call add1(). To skip the character without adding it, call drop1(). When a complete lexeme has been found, set state to DONE, and set category appropriately (lexer.id, lexer.key, etc.). 6 Feb 2017 CS F331 / CSCE A331 Spring 2017 23

Writing a Lexer Coding a State Machine I Issues [1/3] We need to be careful about invariants: statements that are always true at a particular point in a program. What should we expect to be true about variables (pos, in particular) when our iterator function is called? Whatever we decide, we need to ensure that it is true when this function returns. 6 Feb 2017 CS F331 / CSCE A331 Spring 2017 24

Writing a Lexer Coding a State Machine I Issues [2/3] We need to be clear about what happens when we read past the end of the input. We use the string function sub to get single characters out of the input string. This function returns the empty string when it is asked to read past the end. And an empty string will always result in false when passed to a character-testing function, or when equality-compared with any single character. So anything we attempt to check about a past-the-end character will be false. 6 Feb 2017 CS F331 / CSCE A331 Spring 2017 25

Writing a Lexer Coding a State Machine I Issues [3/3] I follow the convention that each state is named after a short string that will put me in that state. If we have read a, then we are in state LETTER, since we have read a single letter. As we write a state machine, an important question is when do we add a new state? A good guiding principle: Two situations can be handled by the same state if they would react identically to all future input. Continuing from above, we have read a and are in state LETTER. Suppose the next character is 3. Are we still in state LETTER? Applying the above principle: yes. Because whatever follows a3, we handle it the same as we would if it followed a. For example, a3_xq6 is an identifier; and so is a_xq6. 6 Feb 2017 CS F331 / CSCE A331 Spring 2017 26

Writing a Lexer Coding a State Machine I CODE A skeleton for a lexer is in lexer.lua. This needs to be turned into a complete lexer for the lexemes specified in the In-Class Lexeme Description. We expanded lexer.lua to handle some of the lexeme categories. The lexer is still unfinished. It should be completed next time. See lexer.lua. I have also posted a simple main program for the lexer. See uselexer.lua. 6 Feb 2017 CS F331 / CSCE A331 Spring 2017 27

Writing a Lexer TO BE CONTINUED Writing a Lexer will be continued next time. 6 Feb 2017 CS F331 / CSCE A331 Spring 2017 28