CS 541 Spring Programming Assignment 2 CSX Scanner

Similar documents
Character Classes. Our specification of the reserved word if, as shown earlier, is incomplete. We don t (yet) handle upper or mixedcase.

Potential Problems in Using JLex. JLex Examples. The following differences from standard Lex notation appear in JLex:

An example of CSX token specifications. This example is in ~cs536-1/public/proj2/startup

The Java program that uses the scanner is: import java.io.*;

Overlapping Definitions

The JLex specification file is:

CS 2210 Programming Project (Part I)

Programming Assignment I Due Thursday, October 9, 2008 at 11:59pm

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

More Examples. Lex/Flex/JLex

The character class [\])] can match a single ] or ). (The ] is escaped so that it isn t misinterpreted as the end of character class.

Programming Assignment I Due Thursday, October 7, 2010 at 11:59pm

Programming in C++ 4. The lexical basis of C++

Programming Assignment II

CS 536 Introduction to Programming Languages and Compilers Charles N. Fischer Lecture 3

Features of C. Portable Procedural / Modular Structured Language Statically typed Middle level language

Regular Expressions in JLex. Character Classes. Here are some examples: "+" {return new Token(sym.Plus);}

Programming Project 1: Lexical Analyzer (Scanner)

C++ Basics. Lecture 2 COP 3014 Spring January 8, 2018

1 Lexical Considerations

Lecture 2 Tao Wang 1

CSCI 2010 Principles of Computer Science. Data and Expressions 08/09/2013 CSCI

CSCI312 Principles of Programming Languages!

Alternation. Kleene Closure. Definition of Regular Expressions

CS 426 Fall Machine Problem 1. Machine Problem 1. CS 426 Compiler Construction Fall Semester 2017

12/22/11. Java How to Program, 9/e. Help you get started with Eclipse and NetBeans integrated development environments.

MP 3 A Lexer for MiniJava

CS 6353 Compiler Construction Project Assignments

Language Reference Manual

CS 6353 Compiler Construction Project Assignments

DINO. Language Reference Manual. Author: Manu Jain

CS143 Handout 05 Summer 2011 June 22, 2011 Programming Project 1: Lexical Analysis

CSc 10200! Introduction to Computing. Lecture 2-3 Edgardo Molina Fall 2013 City College of New York

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

IT 374 C# and Applications/ IT695 C# Data Structures

ML 4 A Lexer for OCaml s Type System

BASIC ELEMENTS OF A COMPUTER PROGRAM

Lexical Considerations

CSc Introduction to Computing

Lexical Considerations

Simple Lexical Analyzer

Decaf Language Reference Manual

Data and Expressions. Outline. Data and Expressions 12/18/2010. Let's explore some other fundamental programming concepts. Chapter 2 focuses on:

Contents. Jairo Pava COMS W4115 June 28, 2013 LEARN: Language Reference Manual

CS 374 Fall 2014 Homework 2 Due Tuesday, September 16, 2014 at noon

Assoc. Prof. Dr. Marenglen Biba. (C) 2010 Pearson Education, Inc. All rights reserved.

MP 3 A Lexer for MiniJava

Standard 11. Lesson 9. Introduction to C++( Up to Operators) 2. List any two benefits of learning C++?(Any two points)

Basics of Java Programming

! A literal represents a constant value used in a. ! Numbers: 0, 34, , -1.8e12, etc. ! Characters: 'A', 'z', '!', '5', etc.

Lecture 05 I/O statements Printf, Scanf Simple statements, Compound statements

Typescript on LLVM Language Reference Manual

CS/IT 114 Introduction to Java, Part 1 FALL 2016 CLASS 3: SEP. 13TH INSTRUCTOR: JIAYIN WANG

1.1 Introduction to C Language. Department of CSE

LESSON 4. The DATA TYPE char

Getting started with C++ (Part 2)

2.1. Chapter 2: Parts of a C++ Program. Parts of a C++ Program. Introduction to C++ Parts of a C++ Program

Exercise: Using Numbers

C How to Program, 6/e by Pearson Education, Inc. All Rights Reserved.

Finite Automata and Scanners

CS 541 Fall Programming Assignment 3 CSX_go Parser

LESSON 1. A C program is constructed as a sequence of characters. Among the characters that can be used in a program are:

CSC 467 Lecture 3: Regular Expressions

The Warhol Language Reference Manual

Compiler Construction D7011E

Handout 7, Lex (5/30/2001)

Full file at

CS321 Languages and Compiler Design I. Winter 2012 Lecture 4

B.V. Patel Institute of BMC & IT, UTU 2014

Chapter 2: Data and Expressions

SFU CMPT 379 Compilers Spring 2018 Milestone 1. Milestone due Friday, January 26, by 11:59 pm.

Module 8 - Lexical Analyzer Generator. 8.1 Need for a Tool. 8.2 Lexical Analyzer Generator Tool

Building lexical and syntactic analyzers. Chapter 3. Syntactic sugar causes cancer of the semicolon. A. Perlis. Chomsky Hierarchy

Chapter 2, Part I Introduction to C Programming

Chapter 2: Introduction to C++

Project 2: Scheme Lexer and Parser

GraphQuil Language Reference Manual COMS W4115

JFlex Regular Expressions

Chapter 2: Special Characters. Parts of a C++ Program. Introduction to C++ Displays output on the computer screen

VARIABLES AND CONSTANTS

Programming Languages 2nd edition Tucker and Noonan"

Chapter 2: Data and Expressions

These are reserved words of the C language. For example int, float, if, else, for, while etc.

Java Notes. 10th ICSE. Saravanan Ganesh

DEPARTMENT OF MATHS, MJ COLLEGE

Programming for Engineers Introduction to C

The SPL Programming Language Reference Manual

ITC213: STRUCTURED PROGRAMMING. Bhaskar Shrestha National College of Computer Studies Tribhuvan University

Full file at C How to Program, 6/e Multiple Choice Test Bank

ECE 122 Engineering Problem Solving with Java

JME Language Reference Manual

IPCoreL. Phillip Duane Douglas, Jr. 11/3/2010

QUIZ: What value is stored in a after this

JFlex. Lecture 16 Section 3.5, JFlex Manual. Robb T. Koether. Hampden-Sydney College. Mon, Feb 23, 2015

Programming Languages & Translators. XML Document Manipulation Language (XDML) Language Reference Manual

c) Comments do not cause any machine language object code to be generated. d) Lengthy comments can cause poor execution-time performance.

6.096 Introduction to C++ January (IAP) 2009

CSCI 1061U Programming Workshop 2. C++ Basics

LECTURE 02 INTRODUCTION TO C++

Transcription:

CS 541 Spring 2017 Programming Assignment 2 CSX Scanner Your next project step is to write a scanner module for the programming language CSX (Computer Science experimental). Use the JFlex scanner-generation tool (based on Lex). Future assignments will involve a CSX parser, type checker and code generator. The CSX Scanner Generate the CSX scanner, a member of class Yylex, using JFlex. Your main task is to create the file csx.jflex, the input to JFlex. The jflex file specifies the regular expression patterns for all the CSX tokens, as well as any special processing required by tokens. When a valid CSX token is matched by member function yylex(), it returns an object that is an instance of class java_cup.runtime.symbol (the class our parser expects to receive from the scanner). Symbol contains an integer field sym that identifies the token class just matched. Possible values of sym are identified in the class sym 1. Symbol also contains a field value,which contains token information beyond the token s identity. For CSX, the value field references an instance of class CSXToken (or a subclass of CSXToken). CSXToken contains the line number and column number at which each token was found. This information is necessary to frame high-quality error messages. The line number on which a token appears is stored in linenum. The column number at which a token begins is stored in colnum. The column number counts tabs as one character, even though they expand into several blanks when viewed. You must also store auxiliary information for identifiers, integer literals, character literals and string literals. For identifiers, class CSXIdentifierToken, a subclass of CSXToken, contains the identifier s name in field identifiertext. For integer 1 Java class names normally are capitalized. However, certain classes created by the tool Java CUP ignore this convention.

literals, class CSXIntLitToken, a subclass of CSXToken, contains the literal s numeric value in field intvalue. For character literals, class CSXCharLitToken, a subclass of CSXToken, contains the literal s character value in field charvalue. For string literals, class CSXStringLitToken, a subclass of CSXToken, contains a field stringtext, the full text of the string (with enclosing double quotes and internal escape sequences included as they appeared in the original string text that was scanned). CSX Tokens The CSX languages uses the following classes of tokens: The reserved words of the CSX language: bool break char class const continue else false if int read return true void while display The break and continue reserved words are optional; compilers that include them receive extra credit. Identifiers. An identifier is a sequence of letters and digits starting with a letter, excluding reserved words. Id = (A B Z a b z) (A B Z a b z 0 1 9) * Reserved Integer Literals. An integer literal is a sequence of digits, optionally preceded by a ~. A ~ denotes a negative value. IntegerLit = (~ ) (0 1 9) + String Literals. A string literal is any sequence of printable ASCII characters, delimited by double quotes. A double quote within the text of a string must be escaped (as \, to avoid being misinterpreted as the end of the string). Tabs and newlines within a string must be escaped (\n is a newline and \t is a tab). Backslashes within a string must also be escaped (as \\). No other escaped characters are allowed. Strings may not cross line boundaries. StringLit = " ( Not(" \ UnprintableChar) \" \n \t \\ ) * " Character Literals. A character literal is any printable ASCII character, enclosed within single quotes. A single quote within a character literal must be escaped (as \', to avoid being misinterpreted as the end of the literal). A tab or newline must be escaped ('\n' is a newline and '\t' is a tab). A backslash must also be escaped (as '\\'). No other escaped characters are allowed.

CharLit = ' ( Not(' \ UnprintableChar) \' \n \t \\ ) ' Other Tokens. These are miscellaneous one- or two-character symbols representing operators and delimiters. ( ) [ ] = ; + * / ==!= && < > <= >=,! { } : End-of-File (EOF) Token. The EOF token is automatically returned by yylex() when it reaches the end of file while scanning the first character of a token. Comments and white space, as defined below, are not tokens because they are not returned by the scanner. Nevertheless, they must be matched (and skipped) when they are encountered. A Single Line Comment. As in C++ and Java, this style of comment begins with a pair of slashes and ends at the end of the current line. Its body can include any character other than an end-of-line. LineComment = // Not(Eol) * Eol A Multi-Line Comment. This comment begins with the pair ## and ends with the pair ##. Its body can include any character sequence other than two consecutive # s. BlockComment = ## ( (# ) Not(#) ) * ## White Space. White space separates tokens; otherwise it is ignored. WhiteSpace = ( Blank Tab Eol) + Any character that cannot be scanned as part of a valid token, comment or white space is invalid and should generate an error message. Considerations/Requirements Because reserved words look like identifiers, you must be careful not to miss-scan them as identifiers. You should include distinct token definitions for each reserved word before your definition of identifiers. Upper- and lower-case letters are equivalent in reserved words and in identifiers. When you print a reserved word or an identifier, you may either print its original case or a conversion to standard case (such as lower case). Print character and string literals as they are input, that is, with the escaped characters shown as \n, \\, or whatever, and with the surrounding quotes. However, you should also store the effective values of character and string literals, in which

escaped characters are replaced by their meaning, and surround quotes are removed. You should not assume any limit on the length of identifiers. You should not assume any limit on the length of input lines that are scanned. You may use Java API classes to convert strings representing integer literals to their corresponding integer values. Be careful though; in Java a minus sign, and not ~ represents a negative value. You must detect and report overflow in a systemindependent fashion, perhaps using the constants MIN_VALUE and MAX_VALUE in class Integer. Do not halt on overflow; print an error message and return MAX_VALUE or MIN_VALUE as the value of the literal. An online reference manual for JFlex may be found in the Useful Programming Tools section of the class homepage. Although JFlex s regular expression syntax is designed to be very similar to that of Lex, it is not identical. Read the JFlex manual carefully. A blank should not be used within a character class (i.e., [ and ]). You may use \040 (which is the character code for a blank). A doublequote is meaningful within a character class (i.e., [ and ]). As was the case in project 1, javac requires an environment variable CLASSPATH to define the directories to be searched to find.class files stored in libraries. JFlex and Java Cup (in the next assignment) use CLASSPATH to tell Java where to find the classes that they use. Once again, the Makefile we supply places all.class files in subdirectory classes. Skeleton files and a makefile are in the directory ~raphael/ courses/cs541/public/proj2/startup.; they are also available through the class homepage. Do not assume that the Java is up to coding standard; use a style checker to improve the code. What to hand in Submit your project electronically by mailing it to raphael@cs.uky.edu. Please run make clean first to remove all class files. Hand in: (1) the csx.jflex file you create, (2) any other classes you create, (3) the test data you use to test your scanner, (4) the outputs produced using your test data, (4) a README file, (5) a Makefile, and (6) all source files necessary to build an executable version of your program (.java files and a csx.jflex file). Name the class that contains your main()routine P2.java. Your scanner test program should act like the test program illustrated below, reading a stream of characters from the command line file and printing out the tokens matched to the standard output, one per line in the following format: line:column token For identifiers, include the text of the identifier; for integer literals, include the literal s

numeric value; for character and string literals, include the literal s full text (with enclosing quotes and escape sequences). Use the following format line:column Token (value) For example, if the contents of the file is: class T { // hello, this is // a test const cnst "hello\n" ^ ~10; You should produce: 1:1 class 1:7 Identifier (T) 1:9 { 4:1 const 5:4 Identifier (cnst) 6:1 String literal ("hello\n") 7:1 Invalid token (^) 8:1 Integer literal ( 10) 8:4 ; Your program should try to follow this format to ease grading. A significant fraction of your grade will be based on the quality of your test data. Please exercise your program in every possible way. Your program should print appropriate error messages if it scans an invalid token. You may handle strings that attempt to cross a line boundary either by refusing to accept the initial double-quote, which will lead to a cascade of error messages, or by explicitly treating such attempts by returning an error token that contains all the input up to the line boundary.