Summary: Direct Code Generation 1 Direct Code Generation Code generation involves the generation of the target representation (object code) from the annotated parse tree (or Abstract Syntactic Tree, AST) produced by syntactic and semantic analysis. The output of code generation is typically assembler code, although compilers can also be used to translate a high level language to another high level language (source to source compiler) or from a low level language to a high level one (decompiler). We will assume here that assembler is produced. Code generation can be direct or indirect: Direct code generation: The object code is produced directly from the syntactic tree. Indirect Code Generation: The code generator produces an intermediate representation, which is at a level of abstractness between the parse tree and the target form. The intermediate code is mapped into the target form by a separate process. This approach has the advantage that all processing up until the intermediate representation is machine independent, and only the final step is machine dependent. Code generation can also be distinguished by how it is integrated into the syntactic analysis: A single pass compiler performs semantic actions as syntactic rules are applied, and these semantic actions generate the code (either assembler directly, or the intermediate code). A multiple pass compiler separates syntactic analysis and code generation. The parse tree is produced in its entirety, and this is the input to the code generation phase. The rest of this summary assumes direct code generation, and a single pass compiler, generating code via semantic actions. 2 NASM Assembler 2.1 Instructions Each instruction of a NASM file has the following form: label: instruction operands ; comment Typical example: fadd st1 All fields are all optional with some restrictions. Relatively free use of white space: labels may have white space before them, or instructions may have no space before them. The colon after a label is optional. But use it for clarity. 1
2.2 Data Space The data space of a program can be divided between: Addressable Memory of the program: which can be accessed by referring, e.g., to location 345 Registers: defined locations which can hold one datum each. The program itself occupies the addressable memory of the program. Some assembler pseudo instructions do not end up as machine code instructions, but rather reserve space, e.g. buffer: resb 64 reserves 64 bytes of memory at this point of the program. The label buffer can be used to access this memory. Registers: NASM has 16 registers predefined: 8 16-bit and 8 32-bit. Our examples will work with the following 32-bit registers: EAX (accumulator), EBX, ECX, EDX, ESP (stack pointer), EBP, ESI, EDI 2.3 Operands Operands of instructions can be constants, registers, or references to locations in memory of the program. Constants: E.g., resw 64 (reserves 64 words of memory) mov eax,100 (store 100 in register eax) Registers: mov eax,ebx (moves contents of register ebx into register eax) Indirect addressing: placing a register in [ ] brackets indicates that the location to use is contained within the register mov [eax],ebx (moves contents of register ebx into memory location contained in register eax) Expressions: any operand can be replaced by an expression, e.g., mov [eax+1],ebx (moves contents of register ebx into memory location denoted by adding 1 to the content of register eax) Labels as Operands: Labels can be used in place of registers. When translated to machine code, the label will be replaced by the memory location of the instruction associated with the label. E.g., wordvar: resw 2 mov [wordvar], eax (reserve 2 words of memory and then move the content of register eax into the first word of this space) 2
3 Direct Code Generation 3.1 How code is generated There are several ways to generate code from the syntax tree. In this course, we will assume it is done via the semantic actions connected to the syntax rules. For instance, E :- E1 + E2 { GEN( add, E1.loc, E2.loc) } The semantic action calls the function GEN with the arguments provided. GEN generates assembler code for its arguments, which is saved to the object code file for this source code. GEN would be c code defined elsewhere and provided to the Bison/Yacc compiler. When this rule is applied, the current values of the attributes E1.loc and E2.loc would be substituted. E1.loc and E2.loc were calculated as E1 and E2 were parsed. the loc attribute records which variable or temporary location holds the value of the CFG symbol. GEN is responsible for resolving how these memory locations are referred to in the assembler code. 3.2 Dealing with Registers Operations can be performed more rapidly when the operands are in registers than when they are in addressable memory. Also, some operations require one operand to be a particular register (e.g., mul, div). For these reasons, to generate assembler instructions, we sometimes need to load variables into registers. For instance, to generate code for: A = B + C...we might produce the following code: mov EAX, [B] ; move the contents of variable B into register EAX add EAX, [C] ; Add the contents of variable C to register EAX. mov A, EAX ; move the contents of register EAX to location A Note we need to generate three types of assembler instructions dealing with registers: Instructions to move variables from memory location into a register Instructions to perform operations on the registers Instructions to store register values into a memory locations. An important job of the code generator is to keep track of where variable values are at a given point of time. In the example above, if the prior line of source code had left the value of B in the EAX register, then it would not be necessary to generate a line of code to move B into EAX. For this reason, the compiler maintains a variable AC, which records which variable s value is currently held in the EAX register. Before generating an instruction to place a variable s value into the register, the compiler checks what is the current value held in the register, and only generates the line if needed. 3.3 The CAC function The CAC function is c code provided by the user for use in a YACC/Bison compiler, allowing this function to be referenced in semantic actions. It is called to ensure particular values are placed in the EAX register. The EAX register, is sometimes called the accumulator, and CAC thus stands for control of Accumulator. 3
The CAC function is called with two variables as arguments, and generates the assembler code needed to ensure one of them is in the EAX register. It returns 0 if the first of these variables ends up in the register, and 1 if the second ends up in the register. The code for CAC is as follows: int CAC (opd *x, opd *y) { if (AC==y) return 1; if (AC!=x) { if (AC!=NULL) GEN ("MOV", AC, "EAX"); GEN ("MOV, EAX, x); AC=x; } return 0; } X and Y represent the two variables which CAC needs to deal with. AC is a variable maintained by the compiler, keeping track of which variable should currently be loaded into the EAX register (NULL if the current value is not a variable value). NOTE: AC and CAC belong to the compiler, they are not part of the assembler code. In the first line of the function, the program checks if y is already in EAX. If so, nothing needs to be done, so the program returns 1, indicating that the y value is in the register. In next lines of code, the program makes sure that the current value of the register is x. If it is not the current value, code is generated to move the value currently in the register back to its place, and then code is generated to move x into the register. The line AC=x tells the compiler that at that point, the execution of generated code would leave the variable x in EAX. The CAC function can be used in other code as follows. Assume we wish to generate code to add two values. if CAC(x, y) GEN( add, EAX, x) else: GEN( add, EAX, y) AC=z The user calls CAC, which returns 0 if x is in the EAX register, and 1 otherwise. CAC itself would issue 0, 1 or 2 lines of code: If x or y were already in EAX, no code would be generated by CAC. if AC was Null, then CAC would generate 1 line to move x into EAX: mov eax, [x] if AC was not Null, then CAC would generate 1 line to move EAX back to its location, and another to move x into EAX: mov [%AC], eax mov eax, [x] (where %AC would be replaced at compile time with the contents of AC) CAC will return 0 or 1. If 1 is returned, the code above would generate code to add x to the EAX, otherwise, it generated a line of code to add y to eax. 4
4 Generating Mathematical Expressions This section deals with the generation of code for a mathematical expression, such as A + B or A * B, etc. This would correspond to a grammar rule such as E :- E 1 + E 2. Each of the Es on the right hand side can correspond to a simple constant (int or float), an identify (a variable), another mathematical expression, or a function call. Lets assume a bottom-up parser with code generation performed at the same time as syntactic analysis. In this case, We generate code for E :- E 1 + E 2 at the time of reduction of the rule. The recognition of E 1 and E 2 would also have generated some code, which would thus appear in the assembly program before the code for the current rule. This code would calculate the values of the right-hand-side Es. The code we generate for the rule E :- E 1 + E 2 depends on where the values calculated for E 1 and E 2 are left. If E 1 is a variable B and E 2 is a constant, 120, we might simply generate lines of code such as: MOV EAX, [B] ADD EAX, 120 If prior code left the value of B already in EAX, then we would not need to generate the first line. If E 1 was itself a mathematical expression, we need to generate code keeping in mind where the previously generated code left the value of E 1 (possibly in EAX itself). The other problem here is that in many languages, mathematical expressions can combine data of different types (e.g., int, long, float). Often, the number of bytes of the operands will determine the register which will be used to perform the operation. One solution is to use conditional code to generate different assembly code depending on where the values of the expression are currently stored. The problem splits into two parts: 1. Getting the values into the correct locations (at least one in a register of the appropriate type for the operation, e.g., a float or int register). 2. Generating the assembler code to perform the operation (the operator needs to be float or int). Below we give a possible implementation for the realisation of E+E (sum). The example assumes we are dealing with only three data types: Unsigned chars: 1 byte Int: 2 bytes Double (float): 8 bytes These numbers can be from three sources: In the variable space A constant Already in a register Where one of the numbers is an unsigned char, it is loaded into an int register, and this register is used instead of the original location. In the process described below, it is then treated as an int. We need three distinct assembler operands: If both of the numbers are int, we use ADD x, y to add the numbers, leaving an int in the location of x. If both of the numbers are double, we use the FADD x operation. This operation assumes a stack (pila) used for storing results. The first number is assumed to be 5
at the top (cima) of the stack, and the operation adds the operand to this location, leaving the result in place of the original value (on top of the stack). if one number is a double, and the other an integer, the double is placed on the top of the stack, and then an FIADD x operation is used, which adds its integer operand to the value on top of the stack, leaving the result in place of the original value. The following table could be used by the compiler as part of the generation of the operation E+E. It allows two numbers, of whatever type, and wherever located, to be added together. Type of x operand unsigned char int Register int Constant int double Register double Type of y Operand unsigned int char Load x Swap Re-enter Re-enter Load y Load y Re-enter Re-enter Load y Re-enter Register Constant double Register int int double Swap Load x Load x Load x Re-enter Re-enter Re-enter Re-enter ADD y,x Load y Load x FIADD x Re-enter Re-enter MOV T,x Re-enter ADD x,y ADD x,y ADD x,y MOV T,x Swap Swap Swap Re-enter Re-enter Re-enter Load y Load y Swap Re-enter Re-enter Re-enter Swap Swap Swap Re-enter Re-enter Re-enter Re-enter - Load x FADD x Re-enter Swap Load y FADD x Re-enter Re-enter Swap Swap FADD y Re-enter Re-enter The table assumes there is a function Load within the compiler which places the named value into a register of the appropriate type. This function is driven from the following table. It assumes that the value to load is either unsigned char, int, int constant or double. The table generates distinct code depending on whether you want to load the value into an int or double register. Load into a register of type: int double Type of operand to load unsigned int int constant double char XOR RH,RH MOV RX,x MOV RX,x FLD x MOV RL,x FISTP x XOR RH,RH MOV RL,x MOV T,RX FLD T FILD x MOV T,x FLD T MOV RX,x FLD x Integer operations load their values into a 2 byte register RX. Each byte of RX can be accessed individually: RH is the high byte, and RL is the low byte. The operation XOR RH,RH basically sets all bits of RH to 0 (since the exclusive or of two identical numbers is 0). If the number to load is an unsigned char, the high byte is cleared, and the char is loaded in the low byte. If the number to load is an integer, it is loaded into both bytes directly. 6
Float operations make use of the stack (an area of memory assigned for such operations). The FLD operation loads the float operand onto the top of the stack. The FILD operation loads an integer operand onto the top of the stack with 8 bytes of space. Lets try an example. We start with code S+7, where S is a variable of type float, and 7 is an integer constant. On the entry to the function, we have x (=S) as a double and y (=7) as a constant int. The code for this cell is Swap; Re-enter. This means that we swap the values of x and y, and then restart the procedure. Now we have x (=7) and y (=S), which means we look at the cell for x=const int and y = double. The code for this cell is load x; re-enter. The call to load x with x as a const int, which we want to put into a double register. We thus issue the assembler code: MOV T, 7 FLD T We then perform the re-enter command, and re-start the routine with x in a double register, and y still a double variable. We thus get the commands: Swap; re-enter. We thus re-enter with x as a double variable, and y as a double register. We thus issue the assembler: FADD S and are finished. 3 assembler commands issued. 5 Generating Conditional Instructions 5.1 The Status Flags and conditional jumping A special register exists called the FLAGS register. It consists of a sequence of bits, which are set (1) or unset (0). These flags are set as the result of mathematical operations, e.g., ADD, SUB, MUL or DIV, or their float alternatives. ZF (Zero Flag): set if the operation results in a zero value, unset otherwise. SF (Sign Flag): set if operation results in a negative value, unset otherwise. These flags can be referenced in conditional jump operations, e.g., jz L100 ; jump to L100 if last op resulted in zero 5.2 Integer Comparison: CMP The NASM instruction CMP basically subtracts its second argument from the first. The result is not stored anywhere, but the ZF and SF flags are set as a result of the operation. The CMP instruction is thus usually followed by a conditional jump, e.g., CMP [A], [B] JZ L1 ; jump if cmp result was zero 7
5.3 Simple If statements If-then statements can be mapped into assembler as follows. Assume code like: if A == B then <stmt1> Firstly, we generate code for the comparison, e.g., CMP [A], [B] Then we generate code to jump over the code for <stmt1> if test fails JZ L1 ; jump if cmp result was zero Then we put the code for <stmt1>, e.g. ADD X, Y On the line following this, we put the label from above L1: if A == B then <block> CMP [A], [B] JZ L1... CODE FOR <block> L1: We can use semantic actions to generate the assembler for the source structure. #A1 and #A2 correspond to lambda rules with associated semantic actions, used as a means to generate code in the correct location (e.g., in parsing <stmt>:-if <exp> then #A1 <block> #A2, we reduce elements in the following order: <exp>, #A1, <block>, #A2 and then <stmt>, and thus the semantic actions to produce code are performed in that order). Attribute Grammar: <stmt> :- if <exp> then #A1 <block> #A2 #A1 :- λ { Generate code to jump if exp non-zero } #A2 :- λ { Generate line with label } <exp> :-... 8
5.4 If -else statements If-else statements are a little more complex. A typical if-else statement might generate code like: CMP [A], [B] JZ L1 ; jump if cmp result was zero <block1 code> JMP L2 L1: <block2 code> L2: Attribute Grammar: <stmt> :- if <exp> then #A1 <block1> #A2 else <block2> #A3 #A1 :- λ { Generate code to jump to start_else if exp non-zero } #A2 :- λ { Generate jmp to end_ifelse; Then generate label for start_else } #A3 :- λ { Generate line with label for end_ifelse } 6 Generating Loops 6.1 While Loops While loops map onto assembler much as for an if-statement. E.g., for while <exp> do <instructions> end <loop> :- while #A1 <exp> #A2 do <instructions> end #A3 #A1 :- λ { Generate line with a unique label for loop start } #A2 :- λ { Generate line with jump to end if expr fails } #A3 :- λ { Generate jump back to start, and label for loop end } 9
Below is code from a real while loop: topwhile: dunwhile: ;a label to mark the top of this WHILE loop mov eax, 3 ;planning to invoke function 3 read from a file mov ebx, [infileid] ;the file ID must be placed into register ebx mov ecx, mybyte ;the address of memory to receive file content ;must be placed into register ecx mov edx, 1 ;the number of bytes to read is placed in edx, int 80h ; invokes a kernel function according to ;the number in register eax cmp eax, 0 ;check whether a byte was read je dunwhile ;skip the body if no bytes were read xor byte [mybyte], 00001111b ;[] dereferences, thereby refers to the ;contents at mybytes mov eax, 4 ; planning to invoke function 4 write to a file mov ebx, [outfileid] ;the file ID must be placed into register ebx mov ecx, mybyte ;the address of memory to write from must be ;placed into register ecx mov edx, 1 ;the no. of bytes to write must be placed in edx int 80h ; invokes a kernel function according to no. in eax jmp topwhile ;go back to the top of the loop ;jump to here if no byte is read 10
6.2 Repeat Loops <loop> :- repeat #A1 <instructions> until <exp> #A2 #A1: Generate unique label for loop start #A2: Generate jump to end if last result zero Generate unconditional jump back to beginning Generate label for loop end <instr> Code for <instr> generated by other productions <exp>: Code for <exp> generated by other productions 7 Generating Code from Functions This section covers the generation of code for functions. This includes the generation of function calls and the generation of the code of the function body itself. Three important issues here are: 1. How are the parameters passed to the function. 2. How are local variables represented within the function. 3. How are values returned from the function. There are many possible ways to implement functions. Basically, it is up to the person writing the code generator to decide how to do it. We describe here one of the more standard ways of generating functions and function calls. 7.1 The Stack Space Our implementation of functions depends heavily on the use of a stack in the program memory. Many assemblers assign part of the addressable memory of the program to a stack to hold information about the current variable context. Basically, when we enter a function, space is allocated on top of the stack for the local variables, and when we exit from the function, this allocated space is popped off the stack. The stack thus represents the embedded block structure we discussed under symbol tables. The stack typically starts at the top of addressable memory, and expands downwards. So, assume we have 1000 bytes of addressable memory, the bottom of the stack will start at address 1000. If we push a 2-byte integer onto the stack, it will occupy memory range 999-1000. Pushing an 8 byte float value onto this stack, it would occupy bytes: 991-998. A register called SP (for Stack pointer) indicates the top of the stack. In some systems, SP will point at the next free location in the stack. In others, it points to the lowest byte of the top element of the stack. We will assume this last approach, so in the above case, after pushing on the two numbers, SP would contain 991. 11
7.2 The Function Call Before calling the function, parameters are pushed onto the stack. These can then be accessed by the call routine, from the top of the stack. So that the parameters are available in the required order, they are pushed onto the stack in reverse order. The Call in Source Code: rutina(a, b) The Call In Assembler... PUSH b PUSH a CALL rutina... The NASM instruction CALL firstly pushes the address of the following instruction onto the stack. This will be used as the return address when the function call returns. 7.3 Entering the Routine On entering the routine, the routine firstly establishes the boundaries of the local space of the stack. A register BP (Base Pointer) is used to indicate the lowest point of the stack which is part of the current context. Consequently, the first thing a routine does on entry is to store the old value of BP onto the stack (for later recovery and restoration), and then reset the BP to point at the current top of the stack (which is the point from which the local context will grow). The first lines of any routine will thus be something like the following: rutina: PUSH BP MOV BP,SP rutina is the name of the function, represented as a label in assembler. The old value of BP is pushed onto the stack, and then BP is reset to the value of SP (top of the stack). 12
7.4 Allocating Space for Local Variables The next step is to allocate stack space for the local variables. The compiler works out how many bytes of memory are required for the local variables, and decrements the stack pointer (the stack grows down, remember) by this amount. In the following example, each int takes 2 bytes and each the double 8 bytes, a total of 14 bytes. Source code: int rutina (int a, char *b) { int i, j, k; double r;... } Assembler Code: rutina: PUSH BP MOV BP,SP SUB SP,14... 7.5 Referring to parameters and local variables In the body of the function, rather than referring to variables by name, one references them in terms of offsets from the base pointer. Parameters: parameters were pushed onto the stack BEFORE the function was called, and thus are part of the previous context, they are thus above BP. In the above example, parameters a and b can be accessed using [BP+6] and [BP+8] (note the 6 bytes used to store the old BP and the return address). Local Variables: The local variables are available under BP in memory. i, j and k are thus available as, respectively: [BP-2], [BP-4], [BP-6]. r starts at [BP-14]. Then, the program address to return to is pushed on the stack. BP b a return address old_bp i j k [BP+8] [BP+6] [BP+2] [BP] [BP-2] [BP-4] [BP-6] SP r [BP-14] Free Memory On entering the routine, space is allocated for local variables of that routine. On leaving the routine, the part of the stack used by the routine can be popped. Recursive routines thus have separate memory space. 7.6 Placing The function s code After we generate the line to allocate space for the local variables, we then generate the code for the body of the function. Firstly, the line MOV SP,BP resets the stack pointer to 13
its value before calling this routine (we thus pop all the local stack space off the stack). At this point, the top element on the stack is the old BP. We can thus issue a command POP BP which pops this element off the stack into BP, thus resetting the BP to its prior value (SP is also moved up two bytes). At this point, the element on top of the stack is the address where execution should resume in the calling context. The RET operator pops an element of the stack, and resumes processing from that point. Back in the calling function, after the function call, we then need to wipe the function parameters off the stack. We do this simply by ADD SP,4. The calling code: PUSH b PUSH a CALL rutina ADD SP,4 The routine code: rutina: PUSH BP MOV BP,SP SUB SP,14... MOV SP,BP POP BP RET... 7.7 Returning Values A function may or may not return a value. There are various ways to return a value, and it is up to the compiler writer to decide how it is done. One way is to leave the returned value in the EAX register (if it fits), or in a float register for larger numbers. Alternatively, the number could have been placed on the stack, to be popped layer by the calling routine. Source Program: Object Program: PUSH b PUSH a CALL rutina ADD SP,4... rutina(a, b)... int rutina (int a, char *b) { int i, j, k; double r;... return k; } rutina: PUSH BP MOV BP,SP SUB SP,14... MOV EAX, [BP-6] MOV SP,BP POP BP RET Move K into EAX 14