Programming Project II CS 322 Compiler Construction Winter Quarter 2006 Due: Saturday, January 28, at 11:59pm START EARLY! Description In this phase, you will produce a parser for our version of Pascal. Your parser will parse the token stream (output of the lexical analyzer) and build an abstract syntax tree. In particular, for this assignment you will 1. slightly modify your lexical analyzer and incorporate it with a BISON syntax analyzer. 2. modify the supplied grammar to recognize multidimensional arrays. 3. modify the supplied grammar so that it handles or resolves conflicts properly. 4. build and print out the syntax tree (in mixed infix and prefix notation). 5. perform some basic error recovery Preparation START EARLY! Read this handout before you start writing code. Read the manual for BISON. Study the given code. Study the given header files and the grammar. Study the standard output files.
The Abstract Syntax Tree As the input is parsed, an abstract syntax tree will be created. Each node of the tree represents a symbol (terminal or nonterminal). Every time the generated parser performs a reduction, it will need to create a new node for the symbol it is reducing to, as a parent of the symbols on the right-hand side of the production. To do that, you will need to specify a semantic action for each production (rule). The purpose of this action will be to compute the semantic value of the left-hand side, using the semantic values of the symbols on the right-hand side of the production. Semantic values are values of certain attributes associated with symbols. In this stage, the semantic value of each symbol is of type Node* representing a node in the abstract syntax tree that is being created. Note that the nodes for certain terminal symbols are created in flex.l. Files The following files are needed for the project: flex.l You will use the flex.l file you created for phase 1 of the compiler, with some modifications. Since the semantic value is a Node*, the yylval of tokens such as TUINT, TIDENT will now be a new node for that token. In addition, your flex should not echo the input and line number as it did before. The provided flex.l file already contains the most important of these modifications. You just have to Add any additional declarations from your flex.l Add the actions for comments (without echoing the comments this time) Add the actions for strings. Do not forget to set yylval. Note: remove any code that prints line numbers and echoes the input. flex.l should only print when there is an error. The scanning function has been removed since the process will now be controlled by the parser. main() and yyerror() have been moved to the grammar file, for the same reason. grammar.y This is the grammar for our subset of Pascal. A skeleton file is provided. Notice that YYSTYPE is again defined at the beginning of grammar.y. It specifies the data type (Node*) for the semantic values of the tokens. Thus, the constructs $$ and $n ($n = value of nth component in rule) are always Node*s. Study the grammar carefully to see how it produces the language. Note the line yydebug = 1; in main(). If you uncomment it, you will get a trace of Bison s parsing actions (i.e. whether it is shifting or reducing, what state it is in, etc.) This is very useful in deciding whether your disambiguation is correct.
Hint: The command bison -v grammar.y will give you a file grammar.output that contains the conflicts, rules, parser states and the goto table of the LALR(1) parser. You have to modify grammar.y as follows: Complete the Actions Most of the rules have actions that compute the value of the left-hand-side nonterminal from those on the right-hand side. In several cases, need to decide for yourself whether an action is needed or not. Typically, an action will create an AST node of some type. The constructor arguments are either the children of the node (i.e. the nodes that were created earlier for certain symbols on the right-hand side of the production) or values related to those children. See the rules that we already wrote for simple type and type declaration part. Do not forget that the default semantic action is $$ = $1. Recognize Multiple Subscripts You will need to modify the grammar slightly to recognize multidimensional arrays (declarations and references). A multidimensional array may be declared as follows: type x=array[1..2, 1..3] of integer and an element of the array may be referenced in two ways: comma separated subscripts Example: arr[1,4] bracket separated subscripts Example: arr[1][4] These two forms may be intermixed and are equivalent. For example, an element of a threedimensional array may be referenced as arr[1,2][3] You have to add new rules for parsing such statements. When a multidimensional array is declared, a new MultiArrayType node should be created and then immediately split into a sequence of simple arrays. For example, the node for the array declaration shown above would essentially become ARRAY [1..2] OF ARRAY [1..3] OF INTEGER Multidimensional array references should be handled in a similar way. Disambiguate the grammar You will note that the supplied grammar is ambiguous. There are reduce-reduce conflicts as well as precedence related ambiguities involving expressions. Refer to the BISON documentation to see how you can resolve them, and modify the grammar accordingly. Some of the conflicts may be properly handled by Bison s default behavior. You don t have to do anything in those cases. A note regarding the ambiguity caused by the three rules that handle variables and expressions that reduce to parameters. The goal is to instruct the parser to always reduce a variable
directly to a parameter instead of reducing the variable to an expression and the expression to a parameter. A conflict may still exist in the final parser, but it must always be handled correctly. However, avoid having more than 3-4 conflicts in the final product. Implement some error recovery Your parser should be able to recover from the following types of error: Error in the condition of a while or if statement Extra semicolon before the token TEND in a block of statements (in Pascal, semicolons separate statements, therefore, the last statement in a block is not followed by a semicolon). Missing comma or extra/wrong characters between variable names in lists of variables or parameters. You will have to implement yyerror() with some slight modifications. It should print out not only the error message like before, but also some information about where the error occurred (the line as well as the approximate location). Bison provides special variables and macros that may be useful here. See the provided output files for an example. ast.h Contains class declarations for the abstract syntax tree. It is strongly recommended that you not modify this file, as it will be used throughout the project. ast.cpp Definitions of ast s member functions. The nodes of the tree must be visited depth-first. On a visit, the root may print something before visiting its children, between visits to its children, and/or after visits to its children. This way, when the tree is printed, it will look like the input Pascal program in infix/prefix form. Most of the debugging information is printed in infix notation, except binary expressions, which are in prefix for grading purposes. See the standard output files provided in the test directory. You will notice several special symbols such as #, @, etc. We use these as markers so we can tell whether the correct type of node has been created (and whether your parser has reduced correctly). For example, a single variable is represented by a Variable* node and, when it is visited, a # is printed in front of its name. If that variable is also reduced to an expression (this will not always be the case), a new VarExpr* node is created for it. When that node is visited, a @ is printed before the name. We have provided a skeleton file with the functions you need to implement. The cerr statements are there for debugging purposes. You can use them to see in what order the nodes are visited. symtab.h, symtab.cpp Use your symtab.* from PA1, with no modifications.
Testing As before, we provide test cases as well as sample output which your code must match. YOUR CODE MUST MATCH THE TEST CASES EXACTLY! We will be using diff to compare your results to ours, so other than extra newlines, the rest must match. Submitting your code Submit flex.l, grammar.y, ast.* and symtab.* in a tarball. As usual, email your code to c22@cs.northwestern.edu CAREFUL! cs.nwu.edu will bounce.