Practice Problems (Con t) The ALU performs operation x and puts the result in the RR The ALU operand Register B is loaded with the contents of Rx

Size: px

Start display at page:

Download "Practice Problems (Con t) The ALU performs operation x and puts the result in the RR The ALU operand Register B is loaded with the contents of Rx"

Eric Patterson
5 years ago
Views:

1 Microprogram Control Practice Problems (Con t) The following microinstructions are supported by each CW in the CS: RR ALU opx RA Rx RB Rx RB IR(adr) Rx RR Rx MDR MDR RR MDR Rx MAR IR(adr) MAR Rx PC IR(adr) The ALU performs operation x and puts the result in the RR The ALU operand Register A is loaded with the contents of Rx The ALU operand Register B is loaded with the contents of Rx RB is loaded with the contents of the address field of the IR. (Supports Immediate Addressing. Rx is loaded from RR (the result of an ALU operation. Rx is loaded from the MDR. (for a LOAD operation) MDR is loaded from the ALU Result Register (Result Store) MDR is loaded from Rx (For a Store Operation) Load MAR from the address field of the IR (Supports Direct Addressing) Load MAR from Rx (Supports Register Indirect Addressing) Load the PC from the address field of the IR (For Branches) Questions: 1. How many bits are in the CSA field? 2. How many bits are in the Branch Control field? 3. If every possible microinstruction has its own bit in each CW, how many bits are there in each CW, not counting the CSA and BC fields? Hint: take into account that there are several GPRs and several ALU operations, including NOP. 4. Since the ALU can only do one operation at a time, the ALU control microinstructions can be encoded into a single field. How many bits will it contain? 5. Besides the ALU oprations, many of the other microinstructions are also mutually exclusive (cannot be active at the same time). A set of such instructions can be encoded into a single field to reduce the total number of bits in each CW. Using this technique, find the minimum number of fields a CW may have (including the CSA and BC fields) and the number of bits in each field. 6. Using the results from questions 4 and 5, and including all fields, how many total bits are there in a CW for this design? NTC 8/22/04 98

2 7. Instruction Prefetch Recall from our earlier discussion that one measure of performance is the number of cycles, on average, it takes for the CPU to execute a single instruction. This is expressed as Cycles per Instruction (CPI). 22 The ideal is to achieve an average of one cycle per instruction. 23 Now consider the typical Fetch-Execute cycle as described earlier. As presented, executing an instruction took five cycles (Fetch, Decode, Operand Fetch, Execute, and Putaway, which we will call F, D, O, E, and P in the following diagrams), so we may presume that such a CPU achieves a CPI of 5. In actuality, the fetch cycle requires a memory access which, if the instruction is in main storage (RAM) will generally be much more than 1 cycle. Similarly, the operand fetch cycle may also require a memory access, perhaps more than one. For the remainder of this discussion we will assume a register-based architecture that finds all operands in GPRs which can be accessed in one cycle. That still leaves the memory access for the instruction itself. Let s call this memory access time T m. Then the number of cycles it takes to execute an instruction is T m + 4 If T m = 10 (and typically memory access times in large mainframes might be much larger than this) then we have a CPI of 14, nowhere near the CPI of 1 we desire. In the figure below we show three such instructions being executed back to back. Notice that we gain some benefit from overlapping the fetch of an instruction with the execution of the previous instruction, since the execution hardware and storage access hardware have no resources in common. F F F Let suppose, however, that our memory organization has been designed so that more than one request can be handled at a time by the memory controls and arrays. Our CPU can only issue one memory request per cycle through the MAR, but there is no reason that after the first memory request is issued we cannot issue a second request 22 Other measures of performance include MIPS (Millions of Instructions per Second) and throughput which can be expressed as number of jobs per unit time. 23 A moment s thought and you will realize that this implies either all instructions take one cycle (unlikely), or that some instructions must take less than one cycle. NTC 8/22/04 99

3 on the cycle immediately following, and a third request on the cycle after that, and so on. Then the first two instructions in the above picture might look as follows: F F The second instruction becomes available to the CPU and ready to be decoded during the Operation Fetch cycle of the first instruction. Since the CPU is still busy with the previous instruction, the second instruction must be saved in a local storage location until it can be processed. This storage location is called the Instruction Prefetch Buffer (IPB) and the process of bring a new instruction into the CPU before it is needed is called instruction prefetching. With the second instruction in the IPB, it is available for processing as soon as the first instruction completes its putaway cycle. Let s do some calculations. Without prefetching, the execution of two instructions took 2x(T m +4) for a CPI of T m +4. With prefetching, it now takes T m +4+4 for a CPI of (T m +8)/2 = T m /2+4. If T m =10 this is a change in CPI from 14 to 9. If we add the third instruction to the picture, we have F [P1] and the CPI = (T m +12)/3 = T m /3+4 = We have shown what is actually the very beginning of an instruction sequence. In the steady state (that is, we have been fetching and executing instructions for a long time), we might see something like this: [P2] This yields a CPI of 4, and represents the best case performance for a four cycle instruction execution sequence. We might generalize, then, that the best achievable CPI us equal to the number of cycles in the Fetch/Execute cycle, assuming all instructions are immediately available from the IPB. Of course, we have ignored a lot of practical difficulties. First, we have assumed (unstated) that there was no limit to the number of instructions we could prefetch. In practice, resource constraints will put an upper bound on the size of the IPB and the number of instructions we can hold in it. Second, we have not taken into consideration that instructions are not always executed in the order that they reside in memory (which NTC 8/22/04 100

4 is the order we are fetching them in). For instance, any one of the instructions might be a branch which requires execution to be continued from some other instruction location which has not been prefetched. In this case the prefetch buffer will need to be flushed (reset, emptied) and a new sequence of instructions will be fetched. We will come back to this later. Third, we have assumed that all instructions take the same number of cycles to execute; this is certainly not the case in reality. So, while it appears that we can approach a CPI equal to the average instruction processing time (counting only the time to Decode, Fetch Operands, Execute, and Putaway), we should not expect to achieve it with prefetching alone. On the other hand, we can do better, but we need more than prefetching. Practice Problems Consider the situation described above (T m = 10 and all instructions take four cycles). 1. What is the minimum capacity of the IPB required to support a 4 CPI steady state performance? 2. At what maximum rate can we Fetch instructions in the steady state, regardless of the size of the IPB? Pipelining We saw, during our discussion of CPU controls, that the Instruction Register is the source of decoded control signals for the entire sequence of instruction processing, and is unavailable to another instruction until the putaway cycle for the current instruction has completed. Let s suppose that we replicate the IR so that there are four IR registers, connected serially to each other as shown in Diagram P3. We will label them IR1 thru IR4. Further assume that the instruction is initially loaded into IR1 but that on the next cycle it is moved from IR1 to IR2, leaving IR1 empty and available. Similarly, on each subsequent cycle the instruction moves from IR2 to IR3, IR3 to IR4, and finally leaves the CPU altogether. While the instruction is in IR1, the operand field is decoded, and the Decode cycle of the instruction execution sequence is completed. While it is in IR2 the operand fields are decoded and the Operand Fetch cycle is performed, while it is in IR3 the actual ALU operation is decoded off the Op Code field and execution takes place, and in IR4 the Putaway cycle occurs. This is diagramed in [P3] below. IR1 Decode Op Code of Instruction NTC 8/22/04 101

5 IR2 Decode Operand Fields and Fetch Operand IR3 Decode Op Code and control ALU IR4 Decode Result Operand Field and Putaway Diagram P3 Each IRx is supplied with its own set of decoders, as appropriate. (Notice, by the way, that we have eliminated (in an expensive fashion) the need for a timing chain or counter to control the timing of control signals.) With this organization, we can see that when an instruction has moved from IR1 to IR2, IR1 is now available for the next instruction. On the next cycle, the first instruction moves to IR3, the second instruction moves to IR2, and we can immediately start a third instruction in IR1. Diagram [P1] can now be redrawn as follows F [P1'] In the steady state this becomes: [P2'] The register organization shown in Diagram P3 is called a pipeline, by analogy with any pipeline (such as an oil pipeline) into which things are sequentially loaded at one end and flow through the pipe until they come out, in the same order, at the other end. In the steady state diagram, [P2'], notice that a putaway occurs on every cycle. It appears that we have finally achieved our goal of a CPI of 1. Of course, the concerns expressed previously still hold, especially the effect on the pipeline of branches in the instruction stream. NTC 8/22/04 102

6 Practice Problem In the steady state Diagram [P2'} 1. At what rate must Instruction fetches be issued to memory? 2. What is the minimum size of IPB required? 3. What restriction, if any, must be placed on the length of T m Branch Prediction As we have seen, there are two types of branches: conditional and unconditional. In the case of an unconditional branch there is no question but that the instructions which follow the branch in memory will not be executed. If they have been prefetched, then the IPB contains instructions that apparently will never be executed; they have to be flushed out of the IPB and a new sequence of instructions must be fetched starting at the target of the branch. 24 In the case of conditional branches, a number of strategies are available. We will consider four of them. 1. Assume the branch is always taken. This strategy arises from the observation that many, if not most, branch instructions appear as the decision point in a loop structure. For instance, consider the following subset of instructions from some instruction sequence: load R, m loop_adr: instruction 1.. instruction n decrement R jnz loop_adr instruction k In this sequence of instructions register R is loaded with some positive integer m. Then instructions 1 through n are executed and the value in R is decreased 24 In some cases the unconditional branch may have as its target instructions that are, in fact, in the IPB, just not the next sequential one. With the addition of appropriate hardware this can be detected and flushing the IPB may be prevented. Of course, we can also prevent flushing a large percentage of the time by making the IPB large enough to hold most loops in there entirety. NTC 8/22/04 103

7 (decremented) by 1. The jnz (Jump if not zero) instruction examines the Zero flag bit (which was set as a result of the decrement instruction). If the contents of R are not zero then instruction execution continues with the instruction at the loop_adr: instruction 1; otherwise instruction k (and all following instructions ) is executed. Since R contains some integer which is usually much larger than 1, in most cases the branch is taken. That is, instructions 1 through n are executed many times and instruction k is only executed once. It makes sense, then, in a pipelined machine, to assume that such conditional branches are, in fact, always taken. This design will make the right decision most of the time and it is only when R goes to zero and we fall through the loop that the IPB and the pipeline will need to be flushed. In the best case, the IPB is large enough so that instructions 1 through n can all reside in it at the same time, eliminating the need for any memory instruction accesses while the loop is in operation. The benefits of this form of branch prediction are small, at best. 2. Predict depending on branch instruction. This is a refinement of the strategy 1 described above. In strategy 1, the branch is assumed to be taken, regardless of the condition being tested. Suppose the example above is modified as follows 25. The changes are bold italics. load R, m loop_adr: instruction 1.. instruction n decrement R jz next_adr jmp loop_adr next_adr: instruction k It would not be a good idea here to assume the jump (jump on zero, jz) is always taken, since the programmer has here elected to write the program in such a way that it is only taken when the loop is finished. In fact, based on these two examples, we might guess that jnz should always be assumed to be taken, but jz 25 This sequence performs exactly as the previous one. Such coding is sometimes required due to restrictions placed on the operands of conditional branches by a particular architecture. For instance, in Intel architectures the target of a conditional branch must be within 128 bytes of the branch instruction. If the loop is larger than 128 bytes than the second sequence shown must be used instead of the first sequence. NTC 8/22/04 104

8 NTC 8/22/ should always be assumed to be not taken! It turns out that the percentage of time we guess correctly can be improved if we examine the kind of branch and make the decision about which instruction stream to fetch on that, rather than blindly always assume the branch is taken, as was done with strategy 1. The penalty of pipeline benefits can be reduced to about 70% of that without branch prediction. 3. Branch History Table. A Branch History Table (BHT) keeps track of what occurred previously on the execution of each branch instruction. In the simplest case, a single bit can be associated with each kind of branch instruction. If a particular bit is set to 0'than that means that on the previous execution of this instruction the branch had not been taken. We would then guess that the next time the same instruction occurs we should guess to continue fetching the in-line instruction stream. If the branch is taken, then the bit for that branch is set to 1, and the next time that branch occurs we will guess to take it, and start fetching the target instruction stream instead of the in-line instruction stream. In the loop example from strategy 1, for instance, the first time we encounter the jnz instruction we would incorrectly guess to continue with instruction k; however, we would set the bit in the BHT to 1'and the next time the jnz instruction was encountered (the second time through the loop) we would see the bit in the BHT is now set to 1'and guess to fetch instruction 1. We will continue to guess to fetch instruction 1 (correctly) until register R becomes zero, and then we will misguess again. The BHT bit for jnz will be set back to 0'. A refinement of this scheme includes using more than one bit for each branch instruction so that more than just the last instance of the instruction execution can be recorded. With two bits for instance, we can record how many, of the last four times a jnz instruction was encountered, the branch was taken. The above BHT schemes provide significant improvement over strategies 1 and 2, but none of these schemes take into account that the same instruction type can occur in multiple places in a program. In the worst case, it is possible that an additional jnz instruction, for instance, may occur as one of the instructions 1 thru n in the loop, totally destroying the benefits of these schemes. For this reason, it is usual that a BHT not only record the previous actions for each branch instruction, but it also records the addresses associated with the branch instruction. Thus, a different guess could be made for the instruction at the address at the end of the loop than is made for a branch instruction made at an address in the middle of the loop, or someplace else entirely. Further refinements and performance improvements can be achieved by recording both the address

9 of the branch and the target address, along with the previous history. Branch penalty can be reduced to about 50% of the penalty without Branch Prediction using Branch History Tables. 4. Fetch both Instruction Streams Another solution, which attempts to avoid the question of branch prediction altogether, is to simply prefetch both the target and in-line instruction streams. This implies two Instruction Prefetch Buffers, or some way of tagging instructions to indicate which instruction belongs to which stream. This solution, however does not completely solve the branch problem, for the simple reason that either, or both, of the target stream may contain additional branches. If an additional branch is encountered before the previous branch has been resolved, then we would need to start prefetching a third instruction stream, and so on. The benefits of this depend on the distribution of branches in the kinds of programs typically run on the machine under consideration. For instance, if branches do not occur too close together, this solution eliminates 100% of all branch penalties. But this is not the general case. In practice, combinations of options 2, 3 and 4 are all implemented to minimize branch penalties in pipelines. NTC 8/22/04 106

10 Practice Problems 1. Assuming branches are encountered frequently in a given program, order the 4 branch penalty reduction strategies from least to most beneficial. 2. In general, lacking any knowledge of specific program mixes, which branch penalty reduction method is the best? NTC 8/22/04 107

11 Practice Problems (Con t) As an example, suppose you are given the sequence of instructions ADD MPY BRC ADD ADD and are asked to draw the timing diagram for the case of a correctly guessed branch. The correct diagram is shown below. (Note that the branch cannot be resolved until the previous MPY has completed its execution cycle.) ADD MPY D E BRC -- D O -- E P 3 ADD ADD What is the CPI for the sequence shown? (The branch is counted as an instruction.) 2. Draw a similar diagram for the case when the BRC prediction is incorrect (misguessed). Assume a single MPY instruction is in the alternative (target) instruction stream, and it takes one cycle to fetch the new instructions. 3. What is the CPI for the diagram in part b.? (The branch is counted as an instruction.) 4. Consider the following sequence of instructions. These are three address instructions which are otherwise identical to the instructions described above. MPY R1, R6, R7 ADD R8, R3, R2 ADD Rx, Ry, Rz R1R6 x R7 R8R3 + R2 RxRy + Rz a. Draw the timing diagram assuming out-of-order execution is not allowed b. Draw three timing diagrams assuming out-of-order execution is allowed, there are separate multiply and add execution elements, and x, y, and z in the third ADD are actually the following (register) numbers. i. x = 4, y = 5, z=9 ii. x = 4, y = 1, z=9 iii. x = 1, y = 5, z=9 c. Identify the kind of dependency, if any, introduced by the assignments of x, and y for each of i, ii, and iii in part b. NTC 8/22/04 108

12 Out-of-Order Execution In addition to branches, the efficiency and effectiveness of a pipeline machine can be seriously reduced when instructions cannot all be executed in the same number of cycles. Some instructions, such as a multiply (MPY) instruction take much longer than others (SHIFT or ADD, for instance). In some cases, the resources need to execute an instruction are not available. For instance, many machines improve the performance of a multiply instruction by having separate (in addition to the ALU) special-purpose hardware devoted to executing a multiply instruction. Floating Point (FLP)operations are also usually executed with dedicated hardware reserved for this purpose. 26 If a floating point instruction is decoded and the FLP execution unit is in the process of executing a previous FLP instruction, the pipeline is stalled until the FLP unit becomes available. In the diagram below, three instructions are shown, an ADD, two MPY instructions (which take five execute cycles using the ALU) and another ADD. ADD MPY [P4] MPY ADD D O E P In this diagram, the pipeline has stalled after the first MPY instruction has started executing because the ALU is now busy for five cycles instead of just one. There is now a 4 cycle gap, or bubble in the pipeline before the second MPY instruction can execute, and there is an 8 cycle bubble between the operand fetch of the second ADD and its execution. Let s suppose that the ALU is not required for the MPY instructions because there is additional special-purpose hardware available, and this hardware allows a MPY to be executed in just 2 execution cycles. Diagram [P4] now looks like this: ADD MPY [P5] MPY ADD Processors with multiple execution engines, such as floating point units, multiply units, shifters, etc., are referred to as superscalar processors. They are capable of executing multiple instructions simultaneously, one in each special unit. NTC 8/22/04 109

13 The bubbles in the pipeline have been reduced from 4 and 8 to 1 and 2, a 75% reduction in penalty. But notice that while the MPY instructions are executing, the ALU is idle. It would certainly be nice if the second ADD instruction could make use of the ALU without having to wait for the MPYs to finish. We would then have the following picture: ADD MPY [P6] MPY ADD As far as the second ADD is concerned, there is no penalty incurred due to the extra time it takes to do an MPY instruction. We have executed the ADD out of order with respect to the order of the instructions in the program. This is best seen by looking at the order in which the Putaway cycles occur in time: The first ADD s putaway occurs on cycle 4 The first MPY s putaway occurs on cycle 6 The second MPY s putaway occurs on cycle 8 The second ADD s putaway occurs on cycle 7 What do we have to be careful of if we want to be able to perform instructions out of order in this fashion? We need to study the data dependencies of the various instructions. The above diagrams assume that none of the instructions are using resources (registers, memory addresses) used by any other instruction. In reality this is virtually never true; it is in fact likely that each instruction builds on the results of previous instructions. Let s consider three kinds of dependencies between instructions. 1. Data Read Dependence. We cannot execute an instruction if its operands include the results of a previous instruction which has not yet completed. Accessing the operand(s) must be delayed. In the current example, we cannot execute the ADD instruction if its operands include the results of either of the MPYs. In the simplest organization, the O cycle of any instruction cannot occur before the P cycle of any previous instruction which provides results which become operands of the subsequent instruction. (Note that this is true even in the normal pipeline without out-of-order execution. Here is a portion of diagram [P2'] modified to show what happens if the second instruction needs, as NTC 8/22/04 110

14 operands, the result of the first instruction [P7] In this case, the operand fetch can t be done until the P cycle of the previous instruction is complete. In practice, hardware is frequently provided which examines the operand addresses of contiguous instructions and allows the results for the ALU to be fed back immediately into the ALU operand register, resulting in the following diagram, NTC 8/22/ [P8] Data Store Dependence. We cannot putaway the results of an instruction if the result address is the same as the location of the result address of a previous instruction. In the current example if the second MPY result address is the same as the second ADD result address, the contents of that address will be the results of the MPY, not of the ADD. This violates a fundamental rule of computer architecture and design: The programmer (and his/her program) must observe results to occur in the order in which the instructions appear in the program. In practice, this problem is often resolved by allowing the execution of the ADD to proceed, but to defer the putaway until all previous instructions putaways have been completed. This will still allow significant performance improvement. ADD MPY [P5'] MPY ADD Data Store/Read Dependence. We cannot putaway the results of the second ADD if the result address is the same as the location of an operand for a previous instruction that has not yet executed. This is a relatively rare dependency as it requires that an earlier instruction be delayed quite a bit, far enough so that it s operands aren t even accessed before a later

15 instruction has completed execution, something like [P9] In diagram [P9] the second result is being stored in the same location as one of the first instructions operands. It therefore cannot execute as shown but must look like [P9'] Resource Dependency. Two instructions cannot be executed out of order if they require the same resources (the ALU, for instance) at the same time. We can summarize the first three dependencies by illustrating each with a pair of 3- operand address instructions, and showing the relative positions of shared data resources (highlighted): Data Read Dependency: OP Rx, Rw, Rz ( Rx Rw OP Rz) OP Ry, Rx, Rz Data Store Dependency: Read/Store Dependency: OP Rx, Ry, Rz OP Rx, Rv, Rw OP Ry, Rx, Rz OP Rx, Rw, Rz There are other dependencies which can occur which we will not go into here. The existence of dependencies such as those discussed here requires hardware to be implemented which examines the addresses of all operand and result addresses of instructions in the pipeline and allows or prevents out-of-order execution as appropriate, always observing the rule that at the end of the day the program and user should observe results in the order of instructions in the program. NTC 8/22/04 112

16 Practice Problems 1. Calculate the CPI for the following diagrams in the notes. 1. P4 d. P5' 2. P5 e. P7 3. P6 f. P8 2. Consider a CPU design with a four-stage pipeline (Decode, Operand fetch, Execute, and Putaway), a multiply unit as well as an ALU and an architecture with three different instruction lengths, as follows: Multiply (MPY) takes six cycles to process.. The instruction format is MPY Rx, Ry and performs the operation Rx [Rx] x [Ry] and its timing diagram is Add (ADD) takes four cycles to process. Both operands are in GPRs and are made available in one cycle. The instruction format is ADD Rx, Ry and performs the operation Rx [Rx] + [Ry] (Continued ) NTC 8/22/04 113

17 Branch on Condition (BRC) takes two cycles to process. During the execute cycle the results of the previous ALU operation are examined and either the target address or the next instruction is executed, as appropriate. Assume branch prediction of some sort. If the branch prediction is incorrect, assume that the fetch of the next instruction takes one cycle. The instruction format is BC Target and performs the operation PC Target if the condition is met D E F D O (target instruction on a misguessed branch) Consider the following sequence of instructions. These are three address instructions which are otherwise identical to the instructions described above. MPY R1, R6, R7 R1R6 x R7 ADD R8, R3, R2 R8R3 + R2 ADD Rx, Ry, Rz RxRy + Rz Draw the timing diagram assuming out-of-order execution is not allowed 1. Draw three timing diagrams assuming out-of-order execution is allowed, there are separate multiply and add execution elements, and x, y, and z in the third ADD are actually the following (register) numbers. 1. x = 4, y = 5, z=9 2. x = 4, y = 1, z=9 3. x = 1, y = 5, z=9 2. Identify the kind of dependency, if any, introduced by the assignments of x, and y for each of i, ii, and iii in part b. NTC 8/22/04 114

Parallelism. Execution Cycle. Dual Bus Simple CPU. Pipelining COMP375 1

Parallelism. Execution Cycle. Dual Bus Simple CPU. Pipelining COMP375 1 Pipelining COMP375 Computer Architecture and dorganization Parallelism The most common method of making computers faster is to increase parallelism. There are many levels of parallelism Macro Multiple