CS2383 Programming Assignment 3 October 18, 2014 due: November 4 Due at the end of our class period. Due to the midterm and the holiday, the assignment will be accepted with a 10% penalty until the end of our class period on November 13. After this, it will not be accepted. You will have been assigned Program 4 before this, and it is probably unwise to still be working on Program 3 after the midterm. 1 Introduction In this program, you will make a simple compiler for expressions. The key data structure is an expression tree. You will read in an expression, perform some optimization on it, and then generate a low level Java program that can evaluate the expression. The website contains a skeleton of the Java code for my solution. It leaves out all the interesting stuff, but some of the more boring code has already been written for you. You are to complete the skeleton. Testing with JUnit is not required for this program. 2 Details Details of the input parser, optimizer, and code generator follow. 2.1 Input Parser The expression to be compiled is given in reverse Polish notation. If you have not encountered this concept, an RPN expression is what you get from a postorder traversal of an expression tree. For instance, the expression we would normally write in our familiar infix notation as 2*4 + c*d becomes 2 4 * c d * + in RPN. 1
Our expressions involve 10 variables (lower-case a through j ), constants 0 through 9, and operators +,-,* and /. These operators are for integer arithmetic. I.e., 5 6 / is zero, not 5 6. Since constants and variables are single characters, we do not allow spaces in our RPN expressions. So we would use 24*cd*+ to represent the infix expression 2 * 4 + c*d. Why RPN? Well, if it is not a concept you know, it s a good one to learn. Also, it is easier to parse RPN than to parse ordinary infix expressions. RPN can be parsed using a stack of expression trees. In my solution, I use a LinkedBinaryTree<String> to represent an expression tree; leaves store strings such as 500 and x, whereas internal nodes store strings such as *. Thus, my parser uses as stack of these: well, actually I use a Deque<LinkedBinaryTree<String>>. The classical RPN parsing algorithm is as follows: the characters ( tokens ) in the input are processed from left to right. When you see a constant or a variable token, you construct a 1-node expression tree and push it onto the stack. When you see an operator token, you pop the top two expression trees from the stack. You make a new tree having the operator as its root, and the two popped trees as its left and right children. (You have to get the order of things right, because 13- means 1-3, not 3-1.) Having made this new tree, you push it onto the stack. Once the last token has been processed, the stack should have a single tree on it this is the parser s output. Input to your program is delivered via the command-line arguments, accessed by reading the parameter that main() takes. If you were developing your program from the command-line tools, then when you ran your program, you could specify the input expression. E.g. prompt> java Compiler "23-ab+*" With Eclipse, under the Run menu, there is an option for Run Configurations. There is a tab for Arguments, and you could type your 23-ab+* into the Program arguments box. Other IDEs (at least NetBeans and JGrasp) have similar mechanisms. 2.2 Code Generation Compilers usually generate either native machine code or assembly language. You ll learn about that in CS2253. The key idea is that the generated code consists of very simple statements that only do one basic operation. We ll generate this kind of code, except as bunch of Java statements that are in the body of a method whose parameters include our expression s variables. The course website shows a number of example outputs. So the RPN expression a2+b* should lead to a sequence of statements 2
like the following: int temp1 = a; int temp2 = 2; int temp3 = temp1+temp2; int temp4 = b; int temp5 = temp3*temp4; Your program should do the code generation from the expression tree, rather than trying to do something based directly on the RPN input. A (custom) postfix traversal of the expression tree will be required. Since it s a postfix traversal, it is not surprising that the generated code will follow the order of operations in the RPN. It does not make sense to attempt code generation until you can successfully parse the input. It should be possible to run the Java compiler on the code you have produced. 2.3 Optimization A sophisticated compiler will do many optimizations to the given code. Your compiler will do constant folding and implement a few algebraic simplifications. Constant folding occurs whenever the compiler can find an operator whose operands are constants. So, in the infix expression 2*a + ( 5*7 + 6) the compiler can first fold 5*7 into 35, then fold 35+6 into 41. Thus the expression now corresponds to 2*a + 41. Every algebraic identity that a mathemagician can derive can lead to an opportunity for a compiler to do algebraic simplification. To keep things manageable, you need to implement simplifications based only on the following identities: X 0 = 0 and 0 X = 0, where X is any expression; X X = 0, where X is any expression Note that one optimization can lead to an opportunity for another. For instance, in (5 / 6) * (a + b + c), the constant folding leads to 0* (a + b + c), which can be further optimized to 0. After you have optimized the expression tree, there should be no remaining opportunities for constant folding or our limited set of algebraic simplifications. The trickiest part of the optimizer is probably from the last simplification, since we have to make sure that both the left and right subtrees of 3
the - operator are identical expressions. So (a+b)-(a+b) needs to be detected. However, your optimizer does not need to take into account properties such as commutativity: it does not need to detect (a+b) - (b+a), for instance. The subexpressions a+b and b+a are different subexpressions. The optimizer can be tackled as soon as the input parser works. The optimizer and the code generator are independent of one another. 3 Difficulty Levels A real compiler requires dozens of person-years of effort by highly skilled developers. So you might be worried about this homework. However, a real compiler would have expressions as a fairly minor component. You should set aside at least 10 hours for this assignment. (I think CS students do not realize that programming assignments are inherently much more time consuming than the weekly homeworks in other courses. How much more time consuming depends on how badly you need additional programming practice.) The skeleton omits about 15 20 (nonblank, noncomment) lines from my sample solution s parserpolish(). You ve been told the basic algorithm, and so this should be the easiest part of the homework. Still, it is worth 50% of the homework. The skeleton also omits code for code generation. Again, it is about 15 lines (mostly in a method invoked from codegenerate()). This code is trickier and involves a custom postorder traversal of the expression tree. It is worth 30% of the homework. Finally, the skeleton omits optimization code. There are 15 20 lines in a method to determine whether two expressions are syntactically identical (needed for the subtract algebraic optimization). There are about 20 lines (some fairly long) that do the rest of the optimizations. You will probably find the full optimizer challenging to write, although basic constant folding is not too difficult. The full optimizer is worth 20% of the homework; i.e., one cannot get an A on the program without getting at least some of the optimizer working, but anyone who has to choose between getting codegeneration working and optimization should choose the former. 4
4 What to Submit Supply a printout of your source code. Electronically, I will use the final version checked into your subversion repository as your submission. 5