cmps104a 2002q4 Assignment 2 Lexical Analyzer page 1

Similar documents
COMPILER CONSTRUCTION LAB 2 THE SYMBOL TABLE. Tutorial 2 LABS. PHASES OF A COMPILER Source Program. Lab 2 Symbol table

TDDD55 - Compilers and Interpreters Lesson 3

COMPILERS AND INTERPRETERS Lesson 4 TDDD16

COMPILER CONSTRUCTION Seminar 02 TDDB44

An Introduction to LEX and YACC. SYSC Programming Languages

TDDD55- Compilers and Interpreters Lesson 3

CS143 Handout 05 Summer 2011 June 22, 2011 Programming Project 1: Lexical Analysis

CS 426 Fall Machine Problem 1. Machine Problem 1. CS 426 Compiler Construction Fall Semester 2017

CSCI-243 Exam 2 Review February 22, 2015 Presented by the RIT Computer Science Community

Programming Project II

Chapter 3 -- Scanner (Lexical Analyzer)

CSC 467 Lecture 3: Regular Expressions

Languages and Compilers

Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

CMSC445 Compiler design Blaheta. Project 2: Lexer. Due: 15 February 2012

flex is not a bad tool to use for doing modest text transformations and for programs that collect statistics on input.

LECTURE 11. Semantic Analysis and Yacc

cmps104a 2002q4 Assignment 3 LALR(1) Parser page 1

A Bison Manual. You build a text file of the production (format in the next section); traditionally this file ends in.y, although bison doesn t care.

9/5/17. The Design and Implementation of Programming Languages. Compilation. Interpretation. Compilation vs. Interpretation. Hybrid Implementation

Using Lex or Flex. Prof. James L. Frankel Harvard University

Rule 1-3: Use white space to break a function into paragraphs. Rule 1-5: Avoid very long statements. Use multiple shorter statements instead.

Syntax Analysis Part IV

The structure of a compiler

Programming Assignment I Due Thursday, October 9, 2008 at 11:59pm

Parsing and Pattern Recognition

Intermediate Programming, Spring 2017*

Yacc: A Syntactic Analysers Generator

Motivation was to facilitate development of systems software, especially OS development.

Big Picture: Compilation Process. CSCI: 4500/6500 Programming Languages. Big Picture: Compilation Process. Big Picture: Compilation Process.

Lexical and Syntax Analysis

Lexical Analysis. Implementing Scanners & LEX: A Lexical Analyzer Tool

Lex & Yacc. by H. Altay Güvenir. A compiler or an interpreter performs its task in 3 stages:

Lex & Yacc. By H. Altay Güvenir. A compiler or an interpreter performs its task in 3 stages:

Handout 7, Lex (5/30/2001)

Automatic Scanning and Parsing using LEX and YACC

Yacc. Generator of LALR(1) parsers. YACC = Yet Another Compiler Compiler symptom of two facts: Compiler. Compiler. Parser

Lexical and Parser Tools

Lab 2. Lexing and Parsing with Flex and Bison - 2 labs

Motivation was to facilitate development of systems software, especially OS development.

Marcello Bersani Ed. 22, via Golgi 42, 3 piano 3769

Compiler Design 1. Yacc/Bison. Goutam Biswas. Lect 8

CSCI Compiler Design

Programming Assignment I Due Thursday, October 7, 2010 at 11:59pm

I. OVERVIEW 1 II. INTRODUCTION 3 III. OPERATING PROCEDURE 5 IV. PCLEX 10 V. PCYACC 21. Table of Contents

Big Picture: Compilation Process. CSCI: 4500/6500 Programming Languages. Big Picture: Compilation Process. Big Picture: Compilation Process

Compil M1 : Front-End

Have examined process Creating program Have developed program Written in C Source code

File I/O in Flex Scanners

Compiling Regular Expressions COMP360

Semantic actions for declarations and expressions

CD Assignment I. 1. Explain the various phases of the compiler with a simple example.

PRACTICAL CLASS: Flex & Bison

Programming Assignment II

Appendix. Grammar. A.1 Introduction. A.2 Keywords. There is no worse danger for a teacher than to teach words instead of things.

Control flow and string example. C and C++ Functions. Function type-system nasties. 2. Functions Preprocessor. Alastair R. Beresford.

CS113: Lecture 7. Topics: The C Preprocessor. I/O, Streams, Files

Decaf PP2: Syntax Analysis

Compiler construction in4020 lecture 5

Semantic actions for declarations and expressions

CSCI 171 Chapter Outlines

Semantic actions for declarations and expressions. Monday, September 28, 15

CS 4240: Compilers and Interpreters Project Phase 1: Scanner and Parser Due Date: October 4 th 2015 (11:59 pm) (via T-square)

Chapter 11 Introduction to Programming in C

CSCI-243 Exam 1 Review February 22, 2015 Presented by the RIT Computer Science Community

The Structure of a Syntax-Directed Compiler

Compiler Construction

>B<82. 2Soft ware. C Language manual. Copyright COSMIC Software 1999, 2001 All rights reserved.

Compiler Construction

A Fast Review of C Essentials Part I

Using an LALR(1) Parser Generator

Module 8 - Lexical Analyzer Generator. 8.1 Need for a Tool. 8.2 Lexical Analyzer Generator Tool

Cooking flex with Perl

C Review. MaxMSP Developers Workshop Summer 2009 CNMAT

Compiler Lab. Introduction to tools Lex and Yacc

Short Notes of CS201

Lecture Outline. COMP-421 Compiler Design. What is Lex? Lex Specification. ! Lexical Analyzer Lex. ! Lex Examples. Presented by Dr Ioanna Dionysiou

Figure 2.1: Role of Lexical Analyzer

CS143 Handout 04 Summer 2011 June 22, 2011 flex In A Nutshell

Compilers Project 3: Semantic Analyzer

Compiler construction 2002 week 5

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

CS201 - Introduction to Programming Glossary By

EXPERIMENT NO : M/C Lenovo Think center M700 Ci3,6100,6th Gen. H81, 4GB RAM,500GB HDD

LEX/Flex Scanner Generator

DDMD AND AUTOMATED CONVERSION FROM C++ TO D

Chapter 11 Introduction to Programming in C

Syntax-Directed Translation

COMPILER DESIGN. For COMPUTER SCIENCE

G52CPP C++ Programming Lecture 6. Dr Jason Atkin

Programming in C++ 4. The lexical basis of C++

Crafting a Compiler with C (V) Scanner generator

Project 1: Scheme Pretty-Printer

UNIVERSITY OF CALIFORNIA Department of Electrical Engineering and Computer Sciences Computer Science Division. P. N. Hilfinger

Introduction to Programming in C Department of Computer Science and Engineering. Lecture No. #54. Organizing Code in multiple files

Compiler, Assembler, and Linker

Building a Parser Part III

Lecture 03 Bits, Bytes and Data Types

DOID: A Lexical Analyzer for Understanding Mid-Level Compilation Processes

A simple syntax-directed

Transcription:

cmps104a 2002q4 Assignment 2 Lexical Analyzer page 1 $Id: asg2-scanner.mm,v 327.1 2002-10-03 13:20:15-07 - - $ 1. The Scanner (Lexical Analyzer) Write a main program, string table manager, and lexical analyzer for the language c0 that you will be compiling this quarter. The usage and options were described in the first assignment. The main program will scan the input with the following code, which will be removed inassignment 3 and replaced by a call to the parser : int token_code; yyin = fopen( /*... argv[?] or something...*/ ); for(;;){ token_code = yylex(); if( token_code == YYEOF ) break; fprintf( stderr, "yylex() returned %d (yytext=%s).\n", token_code, yytext ); }; Flex reads characters from the FILE* yyin, which must point at a valid file structure before calling yylex(). Whatever you called this file in the first assignment, change it to yyin. Note that yylex() returns YYEOF (which is 0) when it hits end of file. The scanner should dump its tokens itself from a semantic action when the -t flag is set. Warning : This is where the course project really starts. The string table assignment was really just a Data Structures assignment, which you should have found rather easy. This assignment, together with the parse of the next assignment, is the «real stuff».afailing grade in the scanner or parser assignment will result in failing the course. The scanner specification should be placed a file with a.l suffix, such as scanner.l. At the beginning of this file, ensure that at least the following #includes are present in the C declarations : %{ #include "yyexternals.h" #include "tokenast.h" %} 2. Options You must implement all of the options from the previous assignment, and all options for any assignment must carry forward to future assignments. In this case, the t option will cause the tokens to be dumped into program.tok and the L option will cause the flex-generated scanner to produce its debug output by setting yy_flex_debug to 1. See assignment 1 for information pointing at dbx. 3. Global interface You will need a set of global declarations for communication among the various different modules. Try not to make too much of a hash of things and do not use globals when it is possible to avoid them. The file yyexternals.h should contain : int yylex( void ); int yyparse( void ); extern FILE *yyin; extern char *yytext; #define YYEOF 0 4. The Token AST ADT You must also implement a Token Abstruct Syntax Tree. For the current assignment, you don t need any tree implementation code, as each token is a stand-alone unit. For the parser assignment, you must add tree management code to your ADT. The file tokenast.h should contain :

cmps104a 2002q4 Assignment 2 Lexical Analyzer page 2 #define YYSTYPE TokenAST_ref typedef struct TokenAST *TokenAST_ref; #include "parser.h" The ordering of things above is important. YYSTYPE is a macro definition which defines the type of the objects on the parser s semantic stack. This is used by parser.h, and must be defined before parser.h is included. Hence, to include it from inside of tokenast.h ensures that things will always be defined in the correct order. A sample parser.h is to be found in the dummy-parser subdirectory. With every token recognized there should be a semantic action which creates a new struct TokenAST with malloc() and initializes the various fields as appropriate. The external declaration yylval will automatically be generated from the scanner and will be of type TokenAST_ref, soanappropriate statement to create a token node is : yylval = malloc( sizeof (struct TokenAST) ); Then fill the various fields. Note that you don t need to bother free() ing the nodes in this assignment. That, of course, leads to storage leak, but in the next project, instead of abandoning the nodes, you will link them into a parse tree. In your implementation file, you will declare the various fields : int token_code; is a copy ofthe token code to be returned by yylex(). Itwill be useful later when walking the parse tree. It also means that every lexical semantic action that returns a token may terminate with the statement (actually, you will need to write some access functions to have the equivalent effect) : return yylval->token_code; int serial_nr; is a token serial number consisting of line_nr * 1000 + offset where offset is either the character number of the token within the current line or a unique integer within the current line. This will be used for two purposes : generating semantic error messages so that they can properly reference input line numbers ; and choosing unique label numbers in the generated intermediate code. StringNode_ref lex_info; is a pointer to a string node created from the lexical information found by yylex(). Strictly speaking, this is unnecessary for tokens without necessary semantic information, but it is easier to include it in every token. When lexical information needs to be associated with a token, it can be done as follows, after the malloc() of anew token. Note : yytext is declared by the scanner to point at the text of the last-recognized token. yylval->lex_info = intern_stringtable( stringtable, yytext ); In the next assignment, struct TokenASTs will be the nodes in the abstract syntax tree, and hence a facility to enter them into an n-way tree will be needed. Note that the parser s semantic stack needs to have a uniform type, and so it should be made into a stack of TokenAST_refs. 5. Tokens in the c0 language The language c0 has the following tokens in it : special symbols : =+-*/%&==!=>>=<<=;,()[]{} reserved words : int char void return if else while tokens with lexical information : identifiers and literals (character, integer, and string), all with C syntax. You do not need to interpret the semantics of literal tokens, just write a pattern to recognize them. Comments in c0 are just like incand are skipped over and never returned back to the parser. They are not tokens. Comments also begin with the hash (#) character and continue up to but not including the newline character. Thus, C #include s are treated as comments as well. This is a hack so that gcc can compile c0 programs with the inclusion of appropriate header files. According to the flex manual, here is a scanner which discards C comments and white space while maintaining the current input line counter :

cmps104a 2002q4 Assignment 2 Lexical Analyzer page 3 %x comment %% "/*" { BEGIN( comment ); } <comment>[ˆ*\n]* { } <comment>"*"+[ˆ*/\n]* { } <comment>\n { line_count++; } <comment>"*"+"/" { BEGIN( INITIAL ); } \n { line_count++; } [\t ]+ {} 6. Dumping to the.tok file The function make_token() should dump each node to the debug file as the node is created. Each token dumped to program.tok should have the format : 16.003 264 TOK_KW_RETURN (return) 16.010 61 = (=) 20.008 258 TOK_IDENT (hello) 20.010 271 TOK_LIT_INT (1234) 25.002 123 { ({) 26.008 272 TOK_LIT_STRING ("beep\007") The first column contains (double) serial_nr / 1000.0 in %8.3f format, followed by the integer token_code followed by the symbolic name of the token code. Lastly, ifthere is any lexical information associated with the token it is printed between parentheses exactly as stored in the string table, except that any character that is not isgraph() is printed as three octal digits following a backslash and the backslash is printed as two backslashes. The following function, if it appears in the third part of the parser source, can be used to translate an integer symbol number into a symbolic name for a grammar symbol : const char *token_code_name( int token_code ) /* input: numeric token code (symbol) *result: symbolic (char*) name of input token_code */ { return yytname[ YYTRANSLATE( token_code ) ]; } Do not worry about the contents of the c0_lib.h file until the symbol table assignment. Specifically, the sample test data shows these symbols generated into the string table. They will not be there until such time as you have the symbol table assignment done. The sample output is thus a little advanced for the current assignment. You should still link in the dummy parser in this project in order to make some undefined external references disappear at link time. Doing this is also necessary in order to make the function token_code_name() be available to the scanner. This function must be defined in the parser file since it uses the macro YYTRANSLATE, which is defined therein. The command bison -dtv -o parser.c parser.y can be used to generate the output C parser. 7. Debugging generated C programs Why amigetting the following error message? /cats/gnu/sparclib/bison/bison.simple:270: parse error before ) This is a recurring problem caused by the stupid way that the C compiler works (or doesn t). It first runs a preprocessor over the program and then compiles the output thereof. In order to be «helpful», bison and flex put in #line directives to point errors at the original source, but with matchfix operator errors, this can lead to confusing error

cmps104a 2002q4 Assignment 2 Lexical Analyzer page 4 messages. bison.simple is the prototype parser into which your actions are merged. It means that the error occurred somewhere in front of where it is reported, but that could be anywhere and the printed line numbers are not necessarily of any use at all. First, edit the parser.c file and delete all #line directives. Recompile, and see if the error message refers to a more meaningful location. The problem is in code you wrote in your.y file and which was propagated to the.y file. Second, if that doesn t work, use the command : gcc -E y_parser.c > plain.c This will preprocess the program so that you can see exactly what is being compiled when you gcc plain.c. Ifthat doesn t work, apply the binary search technique to the program. Comment out all your semantic actions : {/*... */} or /*{... }*/. and #ifdef out your section 3 code : #ifdef COMMENT_OUT your section 3 code #endif do the same in your section 1 %{... %} declarations. If you recompile, the error (hopefully) will be gone, because the offending code will be gone. Then put the code back in a little at a time until the error comes back. Especially : check for mismatched matchfix operators like {}[]()/**/. Of course, if you are running using the options -ansi -Wall -pedantic when trying to compile the generated K&R code, you ll get a ton of warnings. So compile the generated code without those options and only use the «friendly» options when compiling code you wrote yourself. 8. Avoid keywords in the lexical grammar The following is a very poor way of recognizing reserved words : "if" { return make_token( KW_IF ); } "while" { return make_token( KW_WHILE ); }...etc... {IDENT} { return make_token( IDENTIFIER ); } Amuch better way to do it is as follows : {IDENT} { return make_ident_token( IDENTIFIER ); } where the function make_ident_token() first searches for yytext() in a reserved word table and then returns either the code for IDENTIFIER or one of the keyword codes, as appropriate. Searching a keyword table can be done with the C library function bsearch(). Alinear search is NOT acceptable, NOR is a sequence if if-else statements. Alternatively, instead of a reserved word table, you could statically initialize an array of String_nodes and then inserte them into the string table by a function similar to the intern function, but which does not allocate any new storage. That way, looking up a string in the string table will automatically distinguish between an identifier and a reserved word. Of course, it would require an extra bit in the string table. As an experiment, let s take all of the C++ keywords and drop them into a scanner and see what is produced : If there are no keywords in the lexical grammar, flex produces the following : 221/2000 NFA states 57/1000 DFA states (266 words) 509 state/nextstate pairs created 101/408 unique/duplicate transitions 57/1000 base-def entries created 655/2000 (peak 0) nxt-chk entries created static const struct yy_trans_info yy_transition[683] =

cmps104a 2002q4 Assignment 2 Lexical Analyzer page 5 If all of the C++ keywords are put in the lexical grammar, flex produces the following : 632/2000 NFA states 275/1000 DFA states (1536 words) 6681 state/nextstate pairs created 773/5908 unique/duplicate transitions 275/1000 base-def entries created 12271/14000 (peak 0) nxt-chk entries created static const struct yy_trans_info yy_transition[12323] = The statistics come from the output of running flex and the line of C declaration is from the generated scanner. As you can see, the numbers for the second scanner are MUCH larger : 2732 bytes for the scanner without keywords and 49292 for the scanner with keywords. It does not take 46560 bytes of memory to store a keyword table. And these numbers are just for the array containing the FSM integer codes. 9. Flex options : -pp -8bdsv -CeF Agood set of options to use with flex is : -pp -8bdsv -CeF. -pp generate a performance report for both major and minor performance losses. -8 generate an 8-bit clean scanner. -b generate backup information. -d compile the scanner in debug mode. -s suppress the default rule to find holes in the rule set. -v generate summary stats. -Ce construct equivalence classes to reduce the scanner size. -CF generate an alternate fast scanner. Youcan use whichever options work for you. 10. The Error reporting module Youmust have an error handling module which will accept error messages in various different formats. One of them must be called yyerror() with a specific format. Error messages should be printed in a format similar to that printed by gcc, namely with the filename, line number, and specific message. For the scanner and the parser, yyerror() will be used, and the current line number maintained by your scanner code can be printed. For other phases, the line number from the token node can be printed. One thing you will need when you link in the dummy parser is a function : void yyerror( const char *message ){ put_error( yylineno, message ); } It should in turn call your own error message function. You should have an error message function which prints to stderr the name of the file in error (i.e., the file whose name you got from argv[], the line number in that file most closely associated with the error, and the text of the error message. It should also maintain an error count so that main() knows whether to return a zero or non-zero return code. 11. Gcc options Both flex and bison produce old-style K&R code which, when compiled with the -ansi option generates many warnings. Suppress this option when you compile the generated code, but only for that code. Also, never put more than the absolute minimum amount of C code in either the.l or the.y file. Use function calls and includes and put the code elsewhere whenever possible. This will tend to reduce the number of times the compiler fails to warn you about non-ansi things in your code. In addition, flex and bison do not understand C code. They simply take whatever you have in the semantic actions between squiggle brackets and in sections one and three and copy them to the output file. Errors in the C code will not show upduring the flex or bison phase, but only when you get to compile the generated code.