CS143 Handout 05 Summer 2011 June 22, 2011 Programming Project 1: Lexical Analysis

Similar documents
Decaf PP2: Syntax Analysis

CS143 Handout 13 Summer 2011 July 1 st, 2011 Programming Project 2: Syntax Analysis

CS143 Handout 25 Summer 2012 August 6 th, 2011 Programming Project 4: TAC Generation

CS143 Handout 04 Summer 2011 June 22, 2011 flex In A Nutshell

CS 426 Fall Machine Problem 1. Machine Problem 1. CS 426 Compiler Construction Fall Semester 2017

PP3-4: Semantic Analysis

Programming Assignment I Due Thursday, October 7, 2010 at 11:59pm

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

PP6: Register Allocation - Optimization

Programming Assignment I Due Thursday, October 9, 2008 at 11:59pm

Programming Project 1: Lexical Analyzer (Scanner)

Programming Assignment II

CSC 467 Lecture 3: Regular Expressions

Handout 7, Lex (5/30/2001)

Project #1: Tracing, System Calls, and Processes

Lecture Outline. COMP-421 Compiler Design. What is Lex? Lex Specification. ! Lexical Analyzer Lex. ! Lex Examples. Presented by Dr Ioanna Dionysiou

Full file at

Assignment 1: Lexer and tokenout utility

Jim Lambers ENERGY 211 / CME 211 Autumn Quarter Programming Project 4

COMPILER CONSTRUCTION LAB 2 THE SYMBOL TABLE. Tutorial 2 LABS. PHASES OF A COMPILER Source Program. Lab 2 Symbol table

CS 541 Spring Programming Assignment 2 CSX Scanner

Languages and Compilers

Programming Project 4: COOL Code Generation

CS143 Handout 03 Summer 2012 June 27, 2012 Decaf Specification

Jim Lambers ENERGY 211 / CME 211 Autumn Quarter Programming Project 2

CS52 - Assignment 8. Due Friday 4/15 at 5:00pm.

Chapter 2 Basic Elements of C++

CS 6353 Compiler Construction Project Assignments

Module 8 - Lexical Analyzer Generator. 8.1 Need for a Tool. 8.2 Lexical Analyzer Generator Tool

CMSC445 Compiler design Blaheta. Project 2: Lexer. Due: 15 February 2012

CS/IT 114 Introduction to Java, Part 1 FALL 2016 CLASS 3: SEP. 13TH INSTRUCTOR: JIAYIN WANG

1 Lexical Considerations

Project 1: Scheme Pretty-Printer

Programming Assignment III

Project Data: Manipulating Bits

Programming Project II

COP Programming Assignment #7

EECS483 D1: Project 1 Overview

Creating a Shell or Command Interperter Program CSCI411 Lab

COP4020 Programming Assignment 1 CALC Interpreter/Translator Due March 4, 2015

CS1622. Semantic Analysis. The Compiler So Far. Lecture 15 Semantic Analysis. How to build symbol tables How to use them to find

CS 6353 Compiler Construction Project Assignments

Programming Tips for CS758/858

CS /534 Compiler Construction University of Massachusetts Lowell

1 The Var Shell (vsh)

Programming Assignment IV Due Thursday, November 18th, 2010 at 11:59 PM

CS 4240: Compilers and Interpreters Project Phase 1: Scanner and Parser Due Date: October 4 th 2015 (11:59 pm) (via T-square)

CSSE2002/7023 The University of Queensland

Fundamental of Programming (C)

The print queue was too long. The print queue is always too long shortly before assignments are due. Print your documentation

2.8. Decision Making: Equality and Relational Operators

SFU CMPT 379 Compilers Spring 2018 Milestone 1. Milestone due Friday, January 26, by 11:59 pm.

Using Lex or Flex. Prof. James L. Frankel Harvard University

CS 103 Lab The Files are *In* the Computer

MP 3 A Lexer for MiniJava

MP 3 A Lexer for MiniJava

Structure of a compiler. More detailed overview of compiler front end. Today we ll take a quick look at typical parts of a compiler.

Review of Fundamentals. Todd Kelley CST8207 Todd Kelley 1

Lexical Considerations

CS 2604 Minor Project 1 Summer 2000

Project 5 - The Meta-Circular Evaluator

static CS106L Spring 2009 Handout #21 May 12, 2009 Introduction

The Decaf language 1

CS 2210 Programming Project (Part I)

Due: 9 February 2017 at 1159pm (2359, Pacific Standard Time)

EXPERIMENT NO : M/C Lenovo Think center M700 Ci3,6100,6th Gen. H81, 4GB RAM,500GB HDD

CS 356, Fall 2018 Data Lab (Part 1): Manipulating Bits Due: Wednesday, Sep. 5, 11:59PM

CS131 Compilers: Programming Assignment 2 Due Tuesday, April 4, 2017 at 11:59pm

STUDENT OUTLINE. Lesson 8: Structured Programming, Control Structures, if-else Statements, Pseudocode

CS143 Handout 12 Summer 2011 July 1 st, 2011 Introduction to bison

Lab 03 - x86-64: atoi

COMPILER CONSTRUCTION Seminar 02 TDDB44

DEMO A Language for Practice Implementation Comp 506, Spring 2018

CS106X Handout 03 Autumn 2012 September 24 th, 2012 Getting Started

Programming Assignment Multi-Threading and Debugging 2

Lexical Considerations

CS 1044 Program 6 Summer I dimension ??????

COMP 412, Fall 2018 Lab 1: A Front End for ILOC

EXPERIMENT NO : M/C Lenovo Think center M700 Ci3,6100,6th Gen. H81, 4GB RAM,500GB HDD

CS 322 Operating Systems Programming Assignment 2 Using malloc and free Due: February 15, 11:30 PM

CS 2604 Minor Project 1 DRAFT Fall 2000

IT 374 C# and Applications/ IT695 C# Data Structures

The Decaf Language. 1 Lexical considerations

TDDD55- Compilers and Interpreters Lesson 2

cmps104a 2002q4 Assignment 2 Lexical Analyzer page 1

CS 103 The Social Network

The role of semantic analysis in a compiler

CS 211 Programming Practicum Fall 2018

Chapter 2, Part I Introduction to C Programming

CS143 Handout 20 Summer 2012 July 18 th, 2012 Practice CS143 Midterm Exam. (signed)

Project 5 - The Meta-Circular Evaluator

Lecture 03 Bits, Bytes and Data Types

A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, though scanner is also a term for the first stage of a lexer.

C Fundamentals & Formatted Input/Output. adopted from KNK C Programming : A Modern Approach

Lecture Programming in C++ PART 1. By Assistant Professor Dr. Ali Kattan

for (i=1; i<=100000; i++) { x = sqrt (y); // square root function cout << x+i << endl; }

Skill 1: Multiplying Polynomials

Language Reference Manual simplicity

Graduate-Credit Programming Project

Pace University. Fundamental Concepts of CS121 1

Transcription:

CS143 Handout 05 Summer 2011 June 22, 2011 Programming Project 1: Lexical Analysis Handout written by Julie Zelenski with edits by Keith Schwarz. The Goal In the first programming project, you will get your compiler off to a great start by implementing the lexical analyzer. For the first task of the front end, you will use flex to create a scanner for the Decaf programming language. Your scanner will transform the source file from a stream of bits and bytes into a series of meaningful tokens containing information that will be used by the later stages of the compiler. This is a fairly straightforward assignment and most students don t find it too timeconsuming. However, we are giving you more than a week to work on it. Don t let that lull you into procrastinating. Most of the work comes in figuring out how flex works. If you ve never really played with flex, then start sooner than later so you don t find yourself using a late day on the easiest assignment of the quarter. Once you get up to speed, things should go relatively smoothly. Due July 1 th at 11:59 p.m. Decaf Lexical Structure The handout on the Decaf specification (Handout 02) contains a full description of the lexical structure of Decaf under the section Lexical Considerations. Your job in this assignment will be to write a scanner that implements this lexical specification. When writing your scanner, you will need to handle conversions from integer and realvalued literals into integer and real valued numeric data. That is, if you encounter the sequence of characters 3.1415E+3, while scanning the input you should convert this into a double with this value. Similarly, when seeing 137 you should convert it to an int. To save you some time and energy, you may want to use the strtol or strtod functions to do these conversions. You can learn more about strtol and strtod by typing man strtol or man strtod. Alternatively, if you're familiar with the C++ stringstream type, feel free to use that as well. You may assume that any numeric literal, whether integer or real, can be converted to the appropriate type without errors, so don't worry about literals that overflow or underflow.

Decaf adopts the two types of comments available in C++. A single line comment starts with // and extends to the end of the line. Multi line comments start with /* and end with the first subsequent */. Any symbol is allowed in a comment except the sequence */ which ends the current comment. Multi line comments do not nest. Your scanner should consume any comments from the input stream and ignore them. If a file ends with an unterminated comment, the scanner should report an error. Recall that the following operators and punctuation characters are used by Decaf: + - * / % < <= > >= = ==!= &&! ;,. [ ] ( ) { } [] Note that [, ], and [] are three different tokens and that for the [] operator, as well as the other two character operators, there must not be any space in between the two characters. Using flex Inspect the CS143 web site. All the way to the right you ll find links to various flex resources. We may not cover all the details of the tools in lecture, so you'll forage in the online docs to learn what you need to know. A quick way to get help is using GNU info. (Try "info flex") flex is not exactly user friendly. If you put a space or newline in the wrong place, it will often print "syntax error" with no line number or hint of what the true problem is. It may take some delving into the manual, a little experimentation, and some patience to learn its shortcomings. Here are a few suggestions to help keep you sane: Be careful about spaces within patterns (it s easy to accidentally allow a space to be interpreted as part of the pattern or signal the end of pattern prematurely if you aren t careful). Never put newlines between a pattern and an action. When in doubt, parenthesize within the pattern to ensure you are getting the precedence you intend. Enclose each action in curly braces (although not required for a single line action, you're better safe than sorry). Use the definitions section to define pattern substitutions (names like Digit, Exponent, etc.). It makes for much more readable rules that are easier to modify, extend, and debug. You must put curly braces around the definition name when you are using it in another definition or a pattern; without them it will only match the literal name.

Starter files The starting files can be found in /usr/class/cs143/assignments/pp1. The starting pp1 project contains the following files (the boldface entries are the ones you will need to modify): Makefile main.cc scanner.h scanner.l errors.h/.cc utility.h/.cc samples/ builds scanner main for scanner type definitions and prototype declarations for scanner starting scanner skeleton error messages you are to use interface/implementation of various utility functions directory of test input files Copy the entire directory to your home directory. Your first order of business is to read through all the files to learn the lay of the land as well as to absorb the helpful hints contained in the files. You should not modify scanner.h, errors.h/.cc or main.cc since our grading scripts depend on your output matching our defined constants and behavior. You may (but don't need to) modify utility.h/.cc. You will definitely need to modify scanner.l. You may use our sample Makefile as a start to build the project. The Makefile has a target to build the dcc executable. dcc must read input from stdin; therefore, you can use standard UNIX file redirection to read from a file. For example, to invoke your compiler (well, your scanner) on a particular input file, you would use: %./dcc < samples/t1.decaf Take care that comment characters inside string literals don't get misidentified as comments. If a file ends with an unclosed multi line comment, report an error via a call to one of the methods in the ReportError class. In order to match our output exactly (which is important for our testing code), please use the standard error messages provided in errors.h. Scanner implementation The scanner.l lex input file in the starter project is where you ll do your work. The yylval global variable is used to record the value for each lexeme scanned and the yylloc global records the lexeme position (line number and column). The action for

each pattern will update the global variables and return the appropriate token code. Your goal is to modify scanner.l to skip over white space; recognize all keywords and return the correct token from scanner.h; recognize punctuation and single char operators and return the ASCII value as the token; recognize two character operators and return the correct token; recognize int, double, bool, and string constants, return the correct token and set appropriate field of yylval; recognize identifiers, return the correct token and set appropriate fields of yylval; record the line number and first and last column in yylloc for all tokens; and report lexical errors for improper strings, lengthy identifiers, and invalid characters We recommend adding token types one at a time to scanner.l, testing after each addition. Be careful with characters that have special meaning to flex such as * and (see docs for how/when to suppress special ness). The patterns for integers, doubles, and strings will require careful testing to make sure all cases are covered (see the man pages for strtol/strtod for details on converting strings to numbers). For this assignment, you may assume that all integer constants can be represented by a 32 bit integer. Similarly, you can assume that it is safe to use strtod to convert double constants. Recording the position of each lexeme requires you to track the current line and column numbers (you will need global variables) and update them as the scanner reads the file, mostly likely incrementing the line count on each newline and the column on each token. A tab character accounts for 8 columns. There is code in the starter file that installs a function to be automatically included with each action (that s much nicer that repeating the call everywhere!), and we strongly encourage you to use it for this purpose. String constants do not allow C style escape sequences. For instance, "\" is a perfectly valid string in Decaf, even though it would be an open string constant in C or C++. Lastly, you need to be sure that your scanner reports the various lexical errors. The action for an error case should call our ReportError class with one of the standard error messages provided in errors.h. For each character that cannot be matched to any token pattern, report it and continue parsing with the next character. If a string erroneously contains a newline, report an error and continue at the beginning of the

next line. If an identifier is longer than the Decaf maximum (31 characters), report the error, truncate the identifier to the first 31 characters (discarding the rest), and continue. Testing your work In the starting project, there is a samples directory containing various input files and matching.out files which represent the expected output. You should diff your output against ours as a first step in testing. Now examine the test files and think about what cases aren t covered. Construct some of your own input files to test your scanner even more. What lexemes look like numbers but aren t? What sequences might confuse your processing of comments? This is precisely the sort of thought process a compiler writer must go through. Any sort of input is fair game and you ll want to be sure yours can handle anything that comes its way, correctly tokenizing it if possible or reporting some reasonable error if not. Note that lexical analysis is responsible only for correctly breaking up the input stream and categorizing each token by type. The scanner will accept syntactically bogus sequences such as: int if array + 4.5 [ bool} In your next two assignments, you'll check input programs for these sorts of errors. General requirements Your executable should be named dcc and should require no command line arguments. It should process input from stdin, write tokens to stdout, and report errors to stderr. Compilers are notorious for leaking memory. Given that they run once and exit, this is not considered much of a problem. We will not examine your programs for leaks and not expect them to free dynamically allocated memory. When debugging, you may want to run your program in valgrind to confirm that your program does not have any memory corruption errors. If valgrind reports any memory leaks, that's fine, but if you're accidentally reading a garbage pointer valgrind will be invaluable in helping pin down your error.

Grading This project is worth 10% of your overall course grade. Most of the points will be allocated for correctness with some consideration for design and readability. We will run your program through the given test files from the samples directory, as well as other tests of our own, using diff -w to compare your output to that of our solution. Deliverables You are to electronically submit your entire project using and electronic submission process that I ll outline shortly. Be sure to include your README file, which is your chance to explain your design decisions and why you believe your program to be correct and robust, as well as describe what to expect from your submission and its error handling. Good Luck! 6