cmps104a 2002q4 Assignment 3 LALR(1) Parser page 1 $Id: asg3-parser.mm,v 327.1 2002-10-07 13:59:46-07 - - $ 1. Summary Write a main program, string table manager, lexical analyzer, and parser for the language c0 that you will be compiling this quarter. The usage and options were described in the first assignment. Include options to generate the.str file, and the.tok file, as before. In addition, dump the abstract syntax tree into the.ast file. The -a option will be used to request that the.ast file be generated. And remember that the -v option causes all lower case options, including -a, to be turned on. Use bison to generate the parser. The main program calls yyparse() with no arguments. The result is 0 for a successful parse and 1 for a failed parse. However, it does not register your own errors. Include your own error handler which counts errors. Return zero back to Unix on success and non-zero if any error messages were generated. IMPORTANT : You must implement all of the options from previous assignments in this assignment. And options from this assignment must be implemented in later assignments. For debugging, you must implement both L and Y which turn on, respectively, the scanner s debugger and the parser s debugger. 2. The Metagrammar When reading the grammar of c0, it is important to distinguish between the grammar and the metagrammar. the metagrammar is the grammar that describes the grammar. You must also use your knowledge of C to fill in what is only implied. The metalanguage redundantly uses fonts and typography to represent concepts for the benefit of those reading this document via simple ASCII text. It looks prettier in the Postscript version. The notation used is : x... [x] [x]... x y while symbol Three dots means that the preceding symbol occurs one or more times. Square brackets indicate the the symbol(s) occurs zero times or once. Square brackets and three dots mean that the symbol(s) occur zero or more times. Astick indicates alternation between its left and right operands. Symbols representing themselves and written in Courier-Bold. Nonterminal symbols in the grammar are written in lower-case Roman. IDENT Token classes with lexical information are written in upper-case ITALIC. 3. The Grammar Following is the context-free syntax of c0. You will need to translate it into an LALR(1) grammar acceptable to bison. You may, of course, take advantage of bison s capacity to handle ambiguous grammars. The dangling else problem in the grammar below istoberesolved in the usual way. Operator precedence is given inaseparate table. program [[decl ] ; function ]... decl declobj type function fntype fnobj type declobj * declobj object char int void fntype params fnblock type fnobj * fnobj IDENT params ( [decl [, decl ]... ] ) fnblock { [decl ; stmt ]... } stmt { [stmt ]... } while ( expr ) stmt if ( expr ) stmt [ else stmt ] return [expr ] ; [expr ] ;
cmps104a 2002q4 Assignment 3 LALR(1) Parser page 2 expr ( expr ) expr BINOP expr UNOP expr IDENT ( [expr [, expr ]... ] ) object CHAR_LIT INT_LIT STRING_LIT object IDENT [ [ [ expr ] ] ] Note that you will not be able to feed to grammar above to bison, because it will not be able to handle BINOP and UNOP as you might expect. You will need to explicitly enumerate all possible rules with operators in them. However, using bison s operator precedence declarations, the number of necessary rules will be reduced. Following is a table of operator precedence and associativity. Itisthe same as that of C, except that C has more operators. Operators Arity Associativity Precedence if else ternary right lowest = binary right. <<=>>= binary left. ==!= binary left. +- binary left. */% binary left. +-&* unary right. () [] variable left highest 4. Semantic Information Void declarations are syntactically valid but semantically invalid, and will be caught later during symbol table insertion. That is, the declaration, «void foo;» should be accepted by the parser, as it is too much trouble to suppress it. Your symbol table handler will figure out that this is wrong. In general, you should be fairly forgiving in the parser and accept things which are not strictly valid and the later put on a semantic check to reject the error. For example, the grammar above gives the idea that an expression is optional on a return statement. That is not true. It is either required or prohibited, dependong on whether the function is void or non-void. However, you can not determine that until you have a symbol table. Declarations of objects above imply an indefinitely large number of pointer indirections. Example : «int ****x;» is syntactically valid and the parser will accept this. However, your symbol table manager might generate a semantic error rejecting this. Attempting to make the parser reject this is actually more work that accepting it. 5. Required output to the.ast file If the -a option is set then a file with a.ast suffix will be created with a symbolic representation of the Abstract Syntax Tree after the parse is complete. (Note : this is not the parse tree.) This file, like the string table file, can be opened, dumped to, and closed in the same function. This function will call a recursive tree-walker function that will dump to the file using a prefix depth-first walk. For example, if part of the input file is int *mul_int( char num[], int int_num ) { num[ 0 ] = num[ 0 ] + int_num; } then part of the output file might be :
cmps104a 2002q4 Assignment 3 LALR(1) Parser page 3 0.000 3e088 {} 14.003 3ea70 int 14.007 3eae0 * 14.008 3eb50 mul_int 14.015 3ebc0 () 14.017 3ec30 char 14.023 3eca0 [] 14.022 3ed10 num 14.027 3ed80 int 14.031 3edf0 int_num 15.003 3ee60 {} 16.015 3eed0 = 16.009 3ef40 [] 16.006 3efb0 num 16.011 3f020 0 16.025 3f090 + 16.019 3f100 [] 16.017 3f170 num 16.021 3f1e0 0 16.027 3f250 int_num The first column is the serial number divided by 1000, the second column is the node s address in hexadecimal (%p) format, and the final column is the lexical information from the token node, indented by three times the depth of the node from the root of the tree. Note that if you examine the.ast files in the samples directory, you will see that they contain more information that you can generate at the present time. The numbers in parentheses following the token are references to the declaration of variables and can not be generated without the symbol table. The information after that comes from the code generator 6. Pictures of AST parts Following are pictures of the abstract syntax trees to be constructed from each reduction. The mathematical structure of the trees are shown, not internal links, so which exact set of pointers you use will depend on your tree implementation. Nodes may have zero, one, two, three, or more than three child nodes. Pic has been used to draw pictures of each of them, but you can only see those pictures in the Postscript version. Pic more or less has a hissyfit when asked to generate ASCII, but makes a valiant, though inadequate, effort. In each of the picture labels, which is all you ll see in the ASCII version, the first word on the line is the root of the tree, and the rest are its child nodes. 1 6.1 Binary operator The subscript operator, «[]»behaves asabinary operator when it has an expression in the subscript position. BINOP [] expr expr IDENT expr 6.2 Unary operator The unary operators are : «+», «-», «&», and «*». The subscript operator, «[]»behaves asaunary operator when it has no expression in the subscript position. 1. After all, this isn t rec.arts.ascii or alt.ascii-art.
cmps104a 2002q4 Assignment 3 LALR(1) Parser page 4 UNOP [] expr IDENT 6.3 Function call The function call operator «()»has any number of arguments, but at least one. Its first argument is to the left of the parentheses, and its others, if any, are between the parentheses and separated by commas. Throw away the commas and the right parenthesis. Use the left parenthesis as the operator. () IDENT expr expr expr 6.4 Object declaration Each type mark serves as the root of a declaration subtree. The leaf of such a tree is the identifier. Any pointer indications are at intermediate levels of the tree. See the final diagram of a function for an example. 6.5 Block of declarations and statements. A block of declarations and statements begins and ends with braces and contains sequences of declarations and statements in between. When building a statement, discard the semi-colon. When building sequences thereof, discard the trailing «}» and use the leading «{}»asthe operator which is the parent of the declarations and statements. Then chain them together in the expected way : {} decl decl stmt stmt 6.6 Control structures The control structures «while»and «if»use the keyword as the root of a tree whose child nodes are arranged in the same way as for a binary operator. The expression at the start of the control structures is the first operand and the statement is the second operand. Note : one and only one statement is allowed, but this may be a block statement with a «{» atthe root. In the case of an «if»-«else», the the «if»operator is a ternary operator, with the operand of «else»asthe third operand. The node «else»itself is discarded. An «if»without an «else»looks just like a«while».
cmps104a 2002q4 Assignment 3 LALR(1) Parser page 5 while if expr stmt expr stmt stmt 6.7 Return statement A«return»statement is either a unary or a nilary operator. return return expr 6.8 Tree from afunction declaration A function declaration has the same tree as the corresponding variable declaration, except that the parameter list and statement block are hung off the type mark. For example : int *foo( int bar, int *baz ) { *baz = bar; return baz; } Would produce the following tree. Note that this tree is somewhat different from similar trees in the sample test data. When there is an inconsistency, follow the specifications here, and not the sample test data.
cmps104a 2002q4 Assignment 3 LALR(1) Parser page 6 int * () {} foo int int = return bar * * bar baz baz baz 7. Beginning the Grammar The first rules should be: start : program { $$ = ast_semantics( $1 ); } ; program : program decl { $$ = adopt( $1, $2, NULL ); } program func { $$ = adopt( $1, $2, NULL ); } {$$ = make_root_token( { ); } ; That is, the parser will be called to parse a complete program. After this is done, the root node of the parse tree will be passed to the semantic routines. These semantic routines will walk the tree and dump it symbolically into the.ast file. Before doing that, it will walk the tree, annotating it with symbol table attributes (but not until project 4), and then walk it again generating intermediate code (in project 5). Note that you will have a problem linking top-level constructs together, so you will need to create a root node under which to hang everything else. make_root_token() will just call make_token() from the previous project as a specialcase kluge 2. But before doing so, it must fiddle with yytext. This means that the final ast will be a list of decls and funcs linked by the follow links. 2. See /afs/cats.ucsc.edu/courses/cmps104a-wm/jargon/, the Jargon file, version 4.2.3. See «kluge».