cmps104a 2002q4 Assignment 2 Lexical Analyzer page 1

cmps104a 2002q4 Assignment 2 Lexical Analyzer page 1 $Id: asg2-scanner.mm,v 327.1 2002-10-03 13:20:15-07 - - $ 1. The Scanner (Lexical Analyzer) Write a main program, string table manager, and lexical analyzer for the language c0 that you will be compiling this quarter. The usage and options were described in the first assignment. The main program will scan the input with the following code, which will be removed inassignment 3 and replaced by a call to the parser : int token_code; yyin = fopen( /*... argv[?] or something...*/ ); for(;;){ token_code = yylex(); if( token_code == YYEOF ) break; fprintf( stderr, "yylex() returned %d (yytext=%s).\n", token_code, yytext ); }; Flex reads characters from the FILE* yyin, which must point at a valid file structure before calling yylex(). Whatever you called this file in the first assignment, change it to yyin. Note that yylex() returns YYEOF (which is 0) when it hits end of file. The scanner should dump its tokens itself from a semantic action when the -t flag is set. Warning : This is where the course project really starts. The string table assignment was really just a Data Structures assignment, which you should have found rather easy. This assignment, together with the parse of the next assignment, is the «real stuff».afailing grade in the scanner or parser assignment will result in failing the course. The scanner specification should be placed a file with a.l suffix, such as scanner.l. At the beginning of this file, ensure that at least the following #includes are present in the C declarations : %{ #include "yyexternals.h" #include "tokenast.h" %} 2. Options You must implement all of the options from the previous assignment, and all options for any assignment must carry forward to future assignments. In this case, the t option will cause the tokens to be dumped into program.tok and the L option will cause the flex-generated scanner to produce its debug output by setting yy_flex_debug to 1. See assignment 1 for information pointing at dbx. 3. Global interface You will need a set of global declarations for communication among the various different modules. Try not to make too much of a hash of things and do not use globals when it is possible to avoid them. The file yyexternals.h should contain : int yylex( void ); int yyparse( void ); extern FILE *yyin; extern char *yytext; #define YYEOF 0 4. The Token AST ADT You must also implement a Token Abstruct Syntax Tree. For the current assignment, you don t need any tree implementation code, as each token is a stand-alone unit. For the parser assignment, you must add tree management code to your ADT. The file tokenast.h should contain :

cmps104a 2002q4 Assignment 2 Lexical Analyzer page 2 #define YYSTYPE TokenAST_ref typedef struct TokenAST *TokenAST_ref; #include "parser.h" The ordering of things above is important. YYSTYPE is a macro definition which defines the type of the objects on the parser s semantic stack. This is used by parser.h, and must be defined before parser.h is included. Hence, to include it from inside of tokenast.h ensures that things will always be defined in the correct order. A sample parser.h is to be found in the dummy-parser subdirectory. With every token recognized there should be a semantic action which creates a new struct TokenAST with malloc() and initializes the various fields as appropriate. The external declaration yylval will automatically be generated from the scanner and will be of type TokenAST_ref, soanappropriate statement to create a token node is : yylval = malloc( sizeof (struct TokenAST) ); Then fill the various fields. Note that you don t need to bother free() ing the nodes in this assignment. That, of course, leads to storage leak, but in the next project, instead of abandoning the nodes, you will link them into a parse tree. In your implementation file, you will declare the various fields : int token_code; is a copy ofthe token code to be returned by yylex(). Itwill be useful later when walking the parse tree. It also means that every lexical semantic action that returns a token may terminate with the statement (actually, you will need to write some access functions to have the equivalent effect) : return yylval->token_code; int serial_nr; is a token serial number consisting of line_nr * 1000 + offset where offset is either the character number of the token within the current line or a unique integer within the current line. This will be used for two purposes : generating semantic error messages so that they can properly reference input line numbers ; and choosing unique label numbers in the generated intermediate code. StringNode_ref lex_info; is a pointer to a string node created from the lexical information found by yylex(). Strictly speaking, this is unnecessary for tokens without necessary semantic information, but it is easier to include it in every token. When lexical information needs to be associated with a token, it can be done as follows, after the malloc() of anew token. Note : yytext is declared by the scanner to point at the text of the last-recognized token. yylval->lex_info = intern_stringtable( stringtable, yytext ); In the next assignment, struct TokenASTs will be the nodes in the abstract syntax tree, and hence a facility to enter them into an n-way tree will be needed. Note that the parser s semantic stack needs to have a uniform type, and so it should be made into a stack of TokenAST_refs. 5. Tokens in the c0 language The language c0 has the following tokens in it : special symbols : =+-*/%&==!=>>=<<=;,()[]{} reserved words : int char void return if else while tokens with lexical information : identifiers and literals (character, integer, and string), all with C syntax. You do not need to interpret the semantics of literal tokens, just write a pattern to recognize them. Comments in c0 are just like incand are skipped over and never returned back to the parser. They are not tokens. Comments also begin with the hash (#) character and continue up to but not including the newline character. Thus, C #include s are treated as comments as well. This is a hack so that gcc can compile c0 programs with the inclusion of appropriate header files. According to the flex manual, here is a scanner which discards C comments and white space while maintaining the current input line counter :

cmps104a 2002q4 Assignment 2 Lexical Analyzer page 3 %x comment %% "/*" { BEGIN( comment ); } <comment>[ˆ*\n]* { } <comment>"*"+[ˆ*/\n]* { } <comment>\n { line_count++; } <comment>"*"+"/" { BEGIN( INITIAL ); } \n { line_count++; } [\t ]+ {} 6. Dumping to the.tok file The function make_token() should dump each node to the debug file as the node is created. Each token dumped to program.tok should have the format : 16.003 264 TOK_KW_RETURN (return) 16.010 61 = (=) 20.008 258 TOK_IDENT (hello) 20.010 271 TOK_LIT_INT (1234) 25.002 123 { ({) 26.008 272 TOK_LIT_STRING ("beep\007") The first column contains (double) serial_nr / 1000.0 in %8.3f format, followed by the integer token_code followed by the symbolic name of the token code. Lastly, ifthere is any lexical information associated with the token it is printed between parentheses exactly as stored in the string table, except that any character that is not isgraph() is printed as three octal digits following a backslash and the backslash is printed as two backslashes. The following function, if it appears in the third part of the parser source, can be used to translate an integer symbol number into a symbolic name for a grammar symbol : const char *token_code_name( int token_code ) /* input: numeric token code (symbol) *result: symbolic (char*) name of input token_code */ { return yytname[ YYTRANSLATE( token_code ) ]; } Do not worry about the contents of the c0_lib.h file until the symbol table assignment. Specifically, the sample test data shows these symbols generated into the string table. They will not be there until such time as you have the symbol table assignment done. The sample output is thus a little advanced for the current assignment. You should still link in the dummy parser in this project in order to make some undefined external references disappear at link time. Doing this is also necessary in order to make the function token_code_name() be available to the scanner. This function must be defined in the parser file since it uses the macro YYTRANSLATE, which is defined therein. The command bison -dtv -o parser.c parser.y can be used to generate the output C parser. 7. Debugging generated C programs Why amigetting the following error message? /cats/gnu/sparclib/bison/bison.simple:270: parse error before ) This is a recurring problem caused by the stupid way that the C compiler works (or doesn t). It first runs a preprocessor over the program and then compiles the output thereof. In order to be «helpful», bison and flex put in #line directives to point errors at the original source, but with matchfix operator errors, this can lead to confusing error

cmps104a 2002q4 Assignment 2 Lexical Analyzer page 4 messages. bison.simple is the prototype parser into which your actions are merged. It means that the error occurred somewhere in front of where it is reported, but that could be anywhere and the printed line numbers are not necessarily of any use at all. First, edit the parser.c file and delete all #line directives. Recompile, and see if the error message refers to a more meaningful location. The problem is in code you wrote in your.y file and which was propagated to the.y file. Second, if that doesn t work, use the command : gcc -E y_parser.c > plain.c This will preprocess the program so that you can see exactly what is being compiled when you gcc plain.c. Ifthat doesn t work, apply the binary search technique to the program. Comment out all your semantic actions : {/*... */} or /*{... }*/. and #ifdef out your section 3 code : #ifdef COMMENT_OUT your section 3 code #endif do the same in your section 1 %{... %} declarations. If you recompile, the error (hopefully) will be gone, because the offending code will be gone. Then put the code back in a little at a time until the error comes back. Especially : check for mismatched matchfix operators like {}[]()/**/. Of course, if you are running using the options -ansi -Wall -pedantic when trying to compile the generated K&R code, you ll get a ton of warnings. So compile the generated code without those options and only use the «friendly» options when compiling code you wrote yourself. 8. Avoid keywords in the lexical grammar The following is a very poor way of recognizing reserved words : "if" { return make_token( KW_IF ); } "while" { return make_token( KW_WHILE ); }...etc... {IDENT} { return make_token( IDENTIFIER ); } Amuch better way to do it is as follows : {IDENT} { return make_ident_token( IDENTIFIER ); } where the function make_ident_token() first searches for yytext() in a reserved word table and then returns either the code for IDENTIFIER or one of the keyword codes, as appropriate. Searching a keyword table can be done with the C library function bsearch(). Alinear search is NOT acceptable, NOR is a sequence if if-else statements. Alternatively, instead of a reserved word table, you could statically initialize an array of String_nodes and then inserte them into the string table by a function similar to the intern function, but which does not allocate any new storage. That way, looking up a string in the string table will automatically distinguish between an identifier and a reserved word. Of course, it would require an extra bit in the string table. As an experiment, let s take all of the C++ keywords and drop them into a scanner and see what is produced : If there are no keywords in the lexical grammar, flex produces the following : 221/2000 NFA states 57/1000 DFA states (266 words) 509 state/nextstate pairs created 101/408 unique/duplicate transitions 57/1000 base-def entries created 655/2000 (peak 0) nxt-chk entries created static const struct yy_trans_info yy_transition[683] =

cmps104a 2002q4 Assignment 2 Lexical Analyzer page 5 If all of the C++ keywords are put in the lexical grammar, flex produces the following : 632/2000 NFA states 275/1000 DFA states (1536 words) 6681 state/nextstate pairs created 773/5908 unique/duplicate transitions 275/1000 base-def entries created 12271/14000 (peak 0) nxt-chk entries created static const struct yy_trans_info yy_transition[12323] = The statistics come from the output of running flex and the line of C declaration is from the generated scanner. As you can see, the numbers for the second scanner are MUCH larger : 2732 bytes for the scanner without keywords and 49292 for the scanner with keywords. It does not take 46560 bytes of memory to store a keyword table. And these numbers are just for the array containing the FSM integer codes. 9. Flex options : -pp -8bdsv -CeF Agood set of options to use with flex is : -pp -8bdsv -CeF. -pp generate a performance report for both major and minor performance losses. -8 generate an 8-bit clean scanner. -b generate backup information. -d compile the scanner in debug mode. -s suppress the default rule to find holes in the rule set. -v generate summary stats. -Ce construct equivalence classes to reduce the scanner size. -CF generate an alternate fast scanner. Youcan use whichever options work for you. 10. The Error reporting module Youmust have an error handling module which will accept error messages in various different formats. One of them must be called yyerror() with a specific format. Error messages should be printed in a format similar to that printed by gcc, namely with the filename, line number, and specific message. For the scanner and the parser, yyerror() will be used, and the current line number maintained by your scanner code can be printed. For other phases, the line number from the token node can be printed. One thing you will need when you link in the dummy parser is a function : void yyerror( const char *message ){ put_error( yylineno, message ); } It should in turn call your own error message function. You should have an error message function which prints to stderr the name of the file in error (i.e., the file whose name you got from argv[], the line number in that file most closely associated with the error, and the text of the error message. It should also maintain an error count so that main() knows whether to return a zero or non-zero return code. 11. Gcc options Both flex and bison produce old-style K&R code which, when compiled with the -ansi option generates many warnings. Suppress this option when you compile the generated code, but only for that code. Also, never put more than the absolute minimum amount of C code in either the.l or the.y file. Use function calls and includes and put the code elsewhere whenever possible. This will tend to reduce the number of times the compiler fails to warn you about non-ansi things in your code. In addition, flex and bison do not understand C code. They simply take whatever you have in the semantic actions between squiggle brackets and in sections one and three and copy them to the output file. Errors in the C code will not show upduring the flex or bison phase, but only when you get to compile the generated code.