Processor Organization and Performance

Chapter 6 Processor Organization and Performance 6 1 The three-address format gives the addresses required by most operations: two addresses for the two input operands and one address for the result. However, some processors like the Pentium compromise by using the two-address format because operands in these processors can be located in memory (leading to longer addresses). This is not a problem with the modern RISC processors as they use the store/load architecture. In these processors, most instructions find their operands in registers; the result is also placed in a register. Since registers can be identified with a shorter address, using the three-address format does not unduly impact the instruction length. The following figure shows the difference instruction sizes when we use register-based versus memory-based operands. We assume that there are 32 registers and memory address is 32 bits long. 8 bits 5 bits 5 bits 5 bits 23 bits Opcode Rdest Rsrc1 Rsrc2 Register format 8 bits 32 bits 104 bits Opcode destination address 32 bits 32 bits source1 address source2 address Memory format 6 2 Yes, Pentium s use of two-address format is justified for the following reason: Operands in Pentium can be located in memory, which implies longer addresses for these operands. Comparing the following figures, we see that we reduce the instruction length from 104 bits to 72 bits by moving from three to two address format. 1

2 Chapter 6 8 bits 5 bits 5 bits 5 bits 23 bits Opcode Rdest Rsrc1 Rsrc2 Register format 8 bits 32 bits 104 bits Opcode destination address 32 bits 32 bits source1 address source2 address Memory format Three-address format 8 bits 5 bits 5 bits 18 bits Opcode Rdest Rsrc Register format 8 bits 32 bits 72 bits Opcode destination address 32 bits source address Memory format Two-address format A further reason is that most instructions end up using an address twice. Here is the example we discussed in Section 6.2.1: Using the three address format, the C statement A = B + C * D - E + F + A is converted to the following code: mult T,C,D ; T = C*D add T,T,B ; T = B + C*D sub T,T,E ; T = B + C*D - E add T,T,F ; T = B + C*D - E + F add A,T,A ; A = B + C*D - E + F + A Notice that all instructions, barring the first one, use an address twice. In the middle three instructions, it is the temporary T and in the last one, it is A. This also supports using two addresses. 6 3 In the load/store architecture, all instructions excepting the load and store get their operands from the registers; the results produced by these instructions also go into the registers. This results in several advantages. The main ones discussed in this chapter are the following:

Chapter 6 3 1. Since the operands come from the internal registers and results are stored in the registers, the load/store architecture speeds up instruction execution. 2. The load/store architecture also reduces the instruction length as addressing registers takes far fewer bits than addressing a memory location. 3. Reduced processor complexity allows these processors to have a large number of registers, which improves performance. There are some other advantages (such as fixed instruction length) that are discussed in Chapter 14. 6 4 In Section 6.2.5, we assumed that the stack operation (push or pop) does not require a memory access. Thus, we used two memory accesses for each push/pop instruction (one to read the instruction and the other to get the value to be pushed/popped). If the push/pop operations require memory access, we need to add one additional memory access for each push/pop instruction. This implies we need 7 more memory accesses, leading to 19 + 7 = 26 memory accesses. 6 5 RISC processors use the load/store architecture, which assumes that the operands required by most instructions are in the internal registers. Load and store instructions are the only exceptions. These instructions move data between memory and registers. If we have few registers, we cannot keep the operands and results that can be used by other instructions (we will be overwriting them frequently with data from memory). This does not exploit the basic feature of the load/store architecture. If we have more registers, we can keep the data longer in the registers (e.g., result produced by an arithmetic instruction that is required by another instruction), which reduces the number of memory accesses. Otherwise, we will be reading and writing data using the load and store instructions (lose the main advantage of the load/store architecture). 6 6 In normal branch execution, shown in the figure below, when the branch instruction is executed, control is transferred to the target immediately. The Pentium, for example, uses this type of branching. In delayed branch execution, control is transferred to the target after executing the instruction that follows the branch instruction. In the figure below, before the control is transferred, the instruction instruction y (shown shaded) is executed. This instruction slot is called the delay slot. For example, the SPARC uses delayed branch execution. In fact, it also uses delayed execution for procedure calls. Why does this help? The reason is that by the time the processor decodes the branch instruction, the next instruction is already fetched. Thus, instead of throwing it away, we improve efficiency by executing it. This strategy requires reordering of some instructions.

4 Chapter 6 instruction x jump target instruction y instruction z instruction a target: instruction b instruction c instruction x jump target instruction y instruction z instruction a target: instruction b instruction c (a) Normal branch execution (b) Delayed branch execution 6 7 In set-then-jump design, condition testing and branching are separated (for example, Pentium uses this design). A condition code register communicates the test results to the branch instruction. On the other hand, test-and-jump design combines testing and branching into a single instruction. The first design is more general-purpose in the sense that all branching can be handled using this separation. The disadvantage is that two separate instructions need to be executed. For example, in Pentium, cmp (compare) and a conditional jump instructions are used to implement conditional branch. Furthermore, this design needs condition code registers to carry the test result. The test-and-jump is useful only for certain types of branches where testing can be part of the instruction. However, there are situations where testing cannot done as part of the branch instruction. For example, consider the overflow condition that results from an add operation. The status result of the addition must be stored in something like a condition code register or a flag for use by a branch instruction later on. Some processors like the MIPS, which follow the test-and-jump design, must handle such scenarios. The MIPS processor, for example, uses exceptions to flag these conditions. 6 8 The main advantage of storing the return address in a register is that simple procedure calls do not have to access memory. Thus, the overhead associated with a procedure invocation is reduced compared to processors like Pentium that store the return address on the stack. However, the stack-based mechanism used by Pentium is more general-purpose in that it can handle any type of procedure call. In contrast, the register-based scheme can only handle simple procedure invocations. For example, recursive procedures cause problems for the register-based scheme. 6 9 The size of the instruction depends on the number of addresses and whether these addresses identify registers or memory locations. Since RISC processors use instructions that are register-based and simple addressing modes, there is no variation in the type of information carried from instruction to instruction. This leads to fixed-size instructions. The Pentium, which is a CISC processor, encodes instructions that vary from one byte to several bytes. Part of the reason for using variable length instructions is that CISC processors tend to provide complex addressing modes. For example, in the Pentium, if we use register-based operands, we need just 3 bits to identify a register. On the other hand, if we use a memory-based operand, we

Chapter 6 5 need up to 32 bits. In addition, if we use an immediate operand, we need a further 32 bits to encode this value into the instruction. Thus, an instruction that uses a memory address and an immediate operand needs 8 bytes just for these two components. You can realize from this description that providing flexibility in specifying an operand leads to dramatic variations in instruction sizes. 6 10 There are two main reasons for this: Allowing both operands to be in memory leads to even greater variations in instruction lengths. Typically, a register in Pentium can be identified using 3 bits whereas a memory address takes 32 bits. This complicates encoding and decoding of instructions further. In addition, no one would want to work with all memory-based operands. Registers are extensively used by compilers to optimize code. By not allowing both operands in memory, inefficient code will not be executed. 6 11 If PC and IR are not connected to the system bus, we have to move the contents of PC to MAR using the A bus. Similarly, the instruction read from the memory placed in the MDR register, which must be moved to the IR register. In both cases, one additional cycle is needed. This degrades processor performance. The amount of increase in overhead depends on the instruction being executed. For example, in instruction fetch discussed in Section 6.5.2 (page 226), we need two additional cycles for the movement of data between PC and MAR and between MDR and IR. This accounts for an increase of 50%. 6 12 We assume that shl works on the B input of the ALU and shifts left by one bit position. To implement shl4, we need to execute shl four times. This is shown in the following table: Instruction Step Control signals shl4 %G7,%G5 S1 G5out: ALU=shl: Cin; S2 Cout: ALU=shl: Cin; S3 Cout: ALU=shl: Cin; S4 Cout: ALU=shl: Cin; S5 Cout: G7in: end; 6 13 We use add to perform multiply by 10. Our algorithm to multiply X by 10 is given below: X + X = 2X (store this result - we need it in the last step) 2X + 2X = 4X 4X + 4X = 8X 8X + 2X = 10X This algorithm is implemented as shown in the following table:

6 Chapter 6 Instruction Step Control signals mul10 %G7,%G5 S1 G5out: Ain; S2 G5out: ALU=add: Cin; S3 Cout: Ain: G5in; S4 Cout: ALU=add: Cin; S5 Cout: Ain; S6 Cout: ALU=add: Cin; S7 G5out: Ain; As shown in this table, we need 9 cycles. 6 14 The implementation is shown below: S8 Cout: ALU=add: Cin; S9 Cout: G7in: end; Instruction Step Control signals mov %G7,%G5 S1 G5out: ALU=BtoC: G7in; 6 15 MIPS stands for millions of instructions per second. Although it is a simple metric, it is practically useless to express the performance of a system. Since instructions vary widely among the processors, a simple instruction execution rate will not tell us anything about the system. For example, complex instructions take more clocks than simple instructions. Thus, a complex instruction rate will be lower than that for simple instructions. The MIPS metric does not capture the actual work done by these instructions. MIPS performance metric is perhaps useful in comparing various versions of processors derived from the same instruction set. 6 16 Synthetic benchmarks are programs specifically written for performance testing. For example, the Whetstones and Dhrystones benchmarks are examples of synthetic benchmarks. Real benchmarks, on the other hand, use actual programs of the intended application to capture system performance. Therefore, they capture the system performance more accurately. 6 17 Whetstone is a synthetic benchmark in which the performance is expressed in MWIPS, millions of Whetstone instructions per second. This benchmark is a small program, which may not measure the system performance for all applications. Another drawback with this benchmark is that it encouraged excessive optimization by compilers to distort the performance results. 6 18 Computer systems are no longer limited to number crunching. Modern computer systems are more complex and these systems run a variety of different applications (3D rendering, string processing, number crunching, and so on). Performance measured for one type of application may be

Chapter 6 7 inappropriate for some other application. Thus, it is important to measure performance of various components for different types of applications.