MP 3 A Lexer for MiniJava

Similar documents
MP 3 A Lexer for MiniJava

ML 4 A Lexer for OCaml s Type System

CS 4240: Compilers and Interpreters Project Phase 1: Scanner and Parser Due Date: October 4 th 2015 (11:59 pm) (via T-square)

PLT 4115 LRM: JaTesté

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

CMSC 350: COMPILER DESIGN

CS 11 Ocaml track: lecture 6

Programming Languages and Compilers (CS 421)

n (0 1)*1 n a*b(a*) n ((01) (10))* n You tell me n Regular expressions (equivalently, regular 10/20/ /20/16 4

CS 426 Fall Machine Problem 1. Machine Problem 1. CS 426 Compiler Construction Fall Semester 2017

Typescript on LLVM Language Reference Manual

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Lecture 9 CIS 341: COMPILERS

10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

The SPL Programming Language Reference Manual

JME Language Reference Manual

TML Language Reference Manual

The PCAT Programming Language Reference Manual

Project 1: Scheme Pretty-Printer

Structure of Programming Languages Lecture 3

GAWK Language Reference Manual

A lexical analyzer generator for Standard ML. Version 1.6.0, October 1994

Programming Assignment I Due Thursday, October 9, 2008 at 11:59pm

Programming for Engineers Introduction to C

Parsing. CS321 Programming Languages Ozyegin University

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Programming Assignment I Due Thursday, October 7, 2010 at 11:59pm

Programming Languages and Compilers (CS 421)

Programming Languages & Compilers. Programming Languages and Compilers (CS 421) I. Major Phases of a Compiler. Programming Languages & Compilers

10/5/17. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntax Analysis

Language Basics. /* The NUMBER GAME - User tries to guess a number between 1 and 10 */ /* Generate a random number between 1 and 10 */

ASTs, Objective CAML, and Ocamlyacc

Simple Lexical Analyzer

Programming Project 1: Lexical Analyzer (Scanner)

Principles of Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

Compiler Construction D7011E

CSC 467 Lecture 3: Regular Expressions

Programming Languages & Compilers. Programming Languages and Compilers (CS 421) Programming Languages & Compilers. Major Phases of a Compiler

Spoke. Language Reference Manual* CS4118 PROGRAMMING LANGUAGES AND TRANSLATORS. William Yang Wang, Chia-che Tsai, Zhou Yu, Xin Chen 2010/11/03

Monday, August 26, 13. Scanners

Programming Assignment II

Assignment 1 (Lexical Analyzer)

Wednesday, September 3, 14. Scanners

B E C Y. Reference Manual

Lexical Analysis. Lecture 3. January 10, 2018

1 Lexical Considerations

TaML. Language Reference Manual. Adam Dossa (aid2112) Qiuzi Shangguan (qs2130) Maria Taku (mat2185) Le Chang (lc2879) Columbia University

Lecture Outline. COMP-421 Compiler Design. What is Lex? Lex Specification. ! Lexical Analyzer Lex. ! Lex Examples. Presented by Dr Ioanna Dionysiou

The MaSH Programming Language At the Statements Level

Java Bytecode (binary file)

A clarification on terminology: Recognizer: accepts or rejects strings in a language. Parser: recognizes and generates parse trees (imminent topic)

Syntax Errors; Static Semantics

VLC : Language Reference Manual

Language Reference Manual simplicity

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

Mobile Computing Professor Pushpendra Singh Indraprastha Institute of Information Technology Delhi Java Basics Lecture 02

TED Language Reference Manual

Lecture 05 I/O statements Printf, Scanf Simple statements, Compound statements

Decaf Language Reference

CSE 401 Midterm Exam Sample Solution 2/11/15

CS 541 Spring Programming Assignment 2 CSX Scanner

JAVASCRIPT AND JQUERY: AN INTRODUCTION (WEB PROGRAMMING, X452.1)

Lexical Considerations

Lab 2 Tutorial (An Informative Guide)

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

UNIVERSITY OF CALIFORNIA Department of Electrical Engineering and Computer Sciences Computer Science Division. P. N. Hilfinger

SFU CMPT 379 Compilers Spring 2018 Milestone 1. Milestone due Friday, January 26, by 11:59 pm.

JFlex Regular Expressions

Lexical and Syntax Analysis

Final Report on Cards the Language. By Jeffrey C Wong jcw2157

Sprite an animation manipulation language Language Reference Manual

A Pascal program. Input from the file is read to a buffer program buffer. program xyz(input, output) --- begin A := B + C * 2 end.

COMS W4115 Programming Languages & Translators FALL2008. Reference Manual. VideO Processing Language (VOPL) Oct.19 th, 2008

Overview. - General Data Types - Categories of Words. - Define Before Use. - The Three S s. - End of Statement - My First Program

CS 6353 Compiler Construction Project Assignments

CS4120/4121/5120/5121 Spring 2016 Xi Language Specification Cornell University Version of May 11, 2016

CS664 Compiler Theory and Design LIU 1 of 16 ANTLR. Christopher League* 17 February Figure 1: ANTLR plugin installer

Assignment 1 (Lexical Analyzer)

Full file at

ECE251 Midterm practice questions, Fall 2010

CS1622. Semantic Analysis. The Compiler So Far. Lecture 15 Semantic Analysis. How to build symbol tables How to use them to find

Parsing and Pattern Recognition

Lexical Considerations

Language Reference Manual

CSC 306 LEXICAL ANALYSIS PROJECT SPRING 2017

BASIC COMPUTATION. public static void main(string [] args) Fundamentals of Computer Science I

printf( Please enter another number: ); scanf( %d, &num2);

UNIVERSITY OF COLUMBIA ADL++ Architecture Description Language. Alan Khara 5/14/2014

CSE 401/M501 18au Midterm Exam 11/2/18. Name ID #

CS143 Handout 05 Summer 2011 June 22, 2011 Programming Project 1: Lexical Analysis

Full file at C How to Program, 6/e Multiple Choice Test Bank

Tail Calls. CMSC 330: Organization of Programming Languages. Tail Recursion. Tail Recursion (cont d) Names and Binding. Tail Recursion (cont d)

CMPT 125: Lecture 3 Data and Expressions

RTL Reference 1. JVM. 2. Lexical Conventions

CSE 401 Midterm Exam Sample Solution 11/4/11

Interpreter. Scanner. Parser. Tree Walker. read. request token. send token. send AST I/O. Console

Compilation 2014 Warm-up project

Figure 2.1: Role of Lexical Analyzer

C: How to Program. Week /Mar/05

age = 23 age = age + 1 data types Integers Floating-point numbers Strings Booleans loosely typed age = In my 20s

Transcription:

MP 3 A Lexer for MiniJava CS 421 Spring 2012 Revision 1.0 Assigned Wednesday, February 1, 2012 Due Tuesday, February 7, at 09:30 Extension 48 hours (penalty 20% of total points possible) Total points 43 1 Change Log 1.1 Clarify the definition of Problem 3. 1.0 Initial Release. 2 Overview After completing this MP, you should understand how to implement a practical lexer using a lexer generator such as Lex. Hopefully you should also gain a sense of appreciation for the availability of lexer generators, instead of having to code a lexer completely from scratch. The language we are making a parser for is called MiniJava, which is basically a subset of Java. 3 Collaboration Collaboration is allowed on this assignment. 4 Overview of Lexical Analysis (Lexing) Recall from lecture that the process of transforming program code (i.e, as ASCII or Unicode text) into an abstract syntax tree (AST) has two parts. First, the lexical analyzer (lexer) scans the text of the program and converts the text into a sequence of tokens, usually as values of a user-defined datatype. These tokens are then fed into the parser, which builds the AST. Note that it is not the job of the lexer to check for correct syntax - this is done by the parser. In fact, our lexer will accept (and correctly tokenize) strings such as if if == if if else which are not valid programs. 5 Lexer Generators The tokens of a programming language are specified using regular expressions, and thus the lexing process involves a great deal of regular-expression matching. It would be tedious to take the specification for the tokens of our language, convert the regular expressions to a DFA, and then implement the DFA in code to actually scan the text. Instead, most languages come with tools that automate much of the process of implementing a lexer in those languages. To implement a lexer with these tools, you simply need to define the lexing behavior in the tool s specification language. The tool will then compile your specification into source code for an actual lexer that you can use. In this MP, we will use a tool called ocamllex to build our lexer. 1

5.1 ocamllex specification ocamllex is documented here: http://caml.inria.fr/pub/docs/manual-ocaml/manual026.html What follows below is only the short version. If it doesn t make sense, or you need more details, consult the link above. You will need to become especially familiar with ocamllex s regular expression syntax. ocamllex s lexer specification is slightly reminiscent of an OCaml match statement: rule myrule = parse regex1 { action1 } regex2 { action2 } where rule and parse are required words, but myrule can be any OCaml identifier (as long as it starts with a lower-case letter). By the way, even though ocamllex uses the word parse, it is a misnomer; ocamllex generates lexers, not parsers. When this specification is compiled, it creates a function called myrule that does the lexing. Whenever myrule finds something that matches regex1, it consumes that part of the input and returns the result of evaluating the expression action1. In our code, the lexing function should return the token it finds. The token type is described in the problem description below, and will also be included in file mp3common.mli. The syntax and meaning of regular expressions are given in the ocamllex manual, and are described in Lecture 6. An action is any expression of type token, giving the token to be returned when the corresponding regular expression is matched. For example, if the action is { LSHIFT }, the lexer will return the LSHIFT token. Within an action, you may refer to the variable lexbuf, which represents the part of the input remaining after the token is removed. Thus, if you simply want to ignore the token as you might want to do, for example, with comments the action would be myrule lexbuf meaning, scan the remaining input and return whatever token it returns. To bind part of an expression to a variable which you can use in the action, use the as keyword. For example: + (_* as s) + { s } recognizes strings surrounded by +, and returns the characters inside. Here is a quick example: rule tokenize = parse [ \t \n ] { tokenize lexbuf } "<<" { LSHIFT } ([ x y z ]+ as s) { IDENTIFIER s } _ { failwith ("Illegal character") } The function tokenize matches the following: (1) a single whitespace character; it ignores it and returns the token obtained from the remainder of the input (which may also begin with whitespace that will be ignored similarly); or (2) a left-shift operator; it returns the LSHIFT token; or (3) a sequence of one or more of the characters x, y, and z; it returns the IDENTIFIER token constructed from the matched sequence. In all other cases, it throws an exception. You can also define multiple lexing functions see the online documentation for more details (they are referred to as entrypoints ). Then from the action of one rule, you can call a different lexing function. Think of the lexer on the whole as being a big state machine, where you can change lexing behaviors based on the state you are in (and transition to a different state after seeing a certain token). This is convenient for defining different behavior when lexing inside comments, strings, etc. The functions you define in this way can also have additional arguments, as in : 2

rule aux n = parse regex1 { aux (n+1) lexbuf } Technically, the lexing function tokenize created by the example above has type lexbuf -> token. How do you turn the input string into a lexbuf? There is a function for that called Lexing.from_string: string -> lexbuf. But you don t have to worry about that, because we ll provide the following functions (in mp3lex-skeleton.mll): (* lextest: string -> token *) let lextest s = tokenize (Lexing.from_string s) (* get_all_tokens s: string -> token list *) let get_all_tokens s = let buf = Lexing.from_string (sˆ"\n") in let rec aux () = match tokenize buf with EOF -> [] t -> t:: aux () in aux () lextest s returns the first token from s, while get_all_tokens s returns the list of all tokens in s. (The definition of get_all_tokens or, more specifically, of aux must be confusing, because we just keep calling aux () repeatedly and it is not clear why we are not in a loop. The call tokenize buf not only finds the next token in buf, it also has a side effect on buf, removing the matched characters. Thus, in effect, each time this call is made, buf gets shorter, until it only contains the EOF character.) 6 Provided Code mp3common.mli is a special file containing the definition of tokens, posted separately on the MP3 page. Please examine but do not modify it. mp3common.cmi is a binary version of the same file. It must be in the folder where you execute your code. mp3lex-skeleton.mll is the skeleton for the lexer specification (please rename it to mp3lex.mll). token is the name of the lexing rule that is already partially defined. There are also a few user-defined functions in the header and footer section that you may find useful. This is the file you will modify and hand in. 7 Problems 1. (5 pts) Define all the keywords and operator symbols of our MiniJava language. Each of these tokens is represented by a constructor in our datatype (defined in mp3common.mli). 3

Token Constructor boolean BOOLEAN break BREAK case CASE class CLASS continue CONTINUE else ELSE extends EXTENDS float FLOAT default DEFAULT int INT new NEW if IF public PUBLIC switch SWITCH return RETURN static STATIC while WHILE this THIS null NULL LITERAL ( LPAREN ) RPAREN { LBRACE } RBRACE [ LBRACK ] RBRACK ; SEMICOLON, COMMA. DOT = EQ < LT! NOT : COLON && ANDAND OROR + PLUS - MINUS * MULT / DIV & AND OR Each token should have its own rule in the lexer specification. Be sure that, for instance, && is lexed as the ANDAND token and not two AND tokens (remember that the regular expression rules are tried by the longest match rule first, and then by the input from top to bottom). There are token categories for integers, booleans, strings, and floats, specifically: INTEGER_LITERAL of int, BOOLEAN_LITERAL of boolean, STRING_LITERAL of string, and FLOAT_LITERAL of float. The following problems will handle these tokens. There is no token for comments because comments don t produce a token but you will need to recognize and discard them (as we did with whitespace in the example above). 4

2. (6 pts) Recognize integer constants. An integer is a string of 1 or more decimal digits. You may use int of string: string -> int to convert strings to integers, and the constructor INTEGER_LITERAL to convert integers to tokens. (Note that negative integer constants are not individual tokens, but instead consist of two tokens, MINUS and integer literal.) 3. (8 pts) Implement floats. A numeric sub-string (with at least one digit) represents a float (as opposed to an integer) if either or both of these conditions hold: It contains exactly one decimal point (and at least one digit to the left or to the right of that decimal point) The character e, optionally followed by a + or -, then followed by an integer can be appended. There is a token constructer FLOAT_LITERAL that takes a float as an argument. (In real Java, the tokens we are describing are double literals, but in our version of Java we only have the float type, so we are omitting the f that is used in Java to make a float literal.) You may use float of string: string -> float to convert strings to floats. 4. (3 pts) Implement booleans. The patterns to be matched are false and true. The relevant constructor is BOOLEAN_LITERAL. 5. (6 pts) Recognize identifiers. An identifier is a letter followed by any number of alphanumeric characters (letters and decimal digits). Use the IDENTIFIER constructor, which takes a string argument (the name of the identifier). 6. (8 pts) Implement strings. A string begins with a double quote ("), followed by a sequence of characters, followed by a closing double quote ("). Double quotes, line breaks and backslashes may not occur within string literals. You may find it most convenient to use ocamlllex s [ˆ ] characters except those specified in the. syntax to create a character set which contains all Hint: You probably want to use StringCharacter, which already does much of the work for you. # get_all_tokens "\"some string\"";; ] - : Mp3common.token list = [Mp3common.STRING_LITERAL "some string"] We don t want the quotation marks which surround the string in the input as part of the string. One way you can ignore these by binding to a name only the part of the regular expression (with the as keyword see the ocamllex documentation). Or maybe you can find another way to do it. In any case, don t include the delimiting quotation marks in the token that is returned. Note: Assume that eof is not reached before the closing double quote. 7. (3 pts) Implement line comments. Line comments in MiniJava are made up of two slashes, // and include all characters up to the next newline. 5

8. (4 pts) Implement traditional C-style comments. C-style comments begin with /* and end with */, and contain any characters, including newlines. Unlike OCaml s comments, C-tyle comments cannot be nested. You must handle C-style comments by creating an additional lexing function after the original lexing function, tokenize. Hint: This is the only problem for which a separate lexing function is required. You may want to refer to lecture notes for an example. You should raise an exception if eof is reached before the end of a comment; e.g., failwith "unterminated comment". That s it! See, it wasn t that bad, was it? 8 Compiling, Testing & Handing In To compile your lexer specification (renamed to mp3lex.mll) to OCaml code, use the command: make Now you can run./grader. But, you also now have a file called student.ml (note that this file ends in.ml, not.mll). Then you can run tests on your lexer in OCaml using the tokenize function that is already included in mp3lex.mll. To see all the tokens producible from a string, use get_all_tokens. # #use "student.ml";; # get_all_tokens "some string to test";; - : Picomlparse.token list = [ some list of tokens ] To test the solution, use solution.cmo instead: # #load "solution.cmo";; # open Solution;; # get_all_tokens "some string to test";; - : Picomlparse.token list = [ some list of tokens ] 6