CS 614 COMPUTER ARCHITECTURE II FALL 2004

Size: px

Start display at page:

Download "CS 614 COMPUTER ARCHITECTURE II FALL 2004"

Antony McKinney
5 years ago
Views:

1 CS 64 COMPUTER ARCHITECTURE II FALL 004 DUE : October, 005 HOMEWORK II READ : - Portions of Chapters 5, 7, 8 and 9 of the Sima book and - Portions of Chapter 3, 4 and Appendix A of the Hennessy book ASSIGNMENT: There are three problems from the Hennessy book. Solve all homework and exam problems as shown in class and past exam solutions. ) Consider the piece of code in Problem 4.8 of the Hennessy book. This code is for the DAXP application we discussed in class. Note that according to Figure.3 on page 37 of the Hennessy book, there is no DSUBUI instruction, even though the code in the problem uses it. The code has to use the available DADDI instruction. See the past exam questions below for the usage of the DADDI instruction. Assume that this is machine model number : the MIPS uses the Tomasulo algorithm of Section 3. and 3.3 of the Hennessy book and as discussed class where there are enough number of CDB buses to eliminate bottlenecks. In addition, there is a perfect memory with no stalls and the functional unit timings are as listed on page A-74 of the Hennessy book : double-precision FP operations ADD.D, MUL.D and DIV.D take 3, and 4 clock periods, respectively. Next assumption is that there are enough functional units for integer instructions not to cause stalls. Another assumption is that, branch predictions are correct for the duration of the loop execution discussed below. As we know a branch instruction takes two clock periods to run. Finally, store instructions complete in the WR stage. In which clock period, will the first iteration of the loop be completed? That is, what is the last clock period in which the Write-Result stage of an instruction from the first iteration be done. To answer the question, continue with the following table : Instruction IF ID EX WR L.D F0, 0(R) MUL.D F0, F0, F 3 4/ Continue Polytechnic University Page of 0 Handout No : 5 October 6, 004

2 ) Consider the same DAXPY code given in Problem 4.8 of the Hennessy book again. Note about the DSUBUI instruction case mentioned in Problem. Assume that the MIPS is implemented as the scalar hardware-speculative Tomasulo algorithm machine as discussed in class. That is, this is machine model number 3. There are enough number of CDB buses to eliminate bottlenecks. In addition, assume that there is a perfect memory with no stalls and the functional unit timings are as listed on page A-74 of the Hennessy book : double-precision FP operations ADD.D, MUL.D and DIV.D take 3, and 4 clock periods, respectively. Another assumption is that there are enough functional units for integer instructions not to cause stalls. Finally, branch predictions are correct for the duration of the loop execution discussed below. As we know a branch instruction takes two clock periods to run. In which clock period, will the first iteration of the loop be completed? That is, what is the last clock period in which the Commit stage of an instruction from the first iteration be done last. Show also whihch instructions are flushed from the pipeline. To answer the question, continue with the following table, without showing the hardware tables : Instruction IF ID EX WR CM L.D F0, 0(R) MUL.D F0, F0, F 3 4/ Continue ) Solve Problem 3.(b) of the Hennessy book. The question is on machine model number 3 discussed in class. The question is on the process of fetching operands of an instruction. As discussed in class, there are two alternatives : fetch the operands of the instruction at issue time and place them in an RS : issue-bound fetch. Eventually, the instruction is scheduled for execution on a functional unit. Obviously, RSs must have long value fields to keep the operand values, (V j /V k ), until the instruction is scheduled. This is what the MIPS machine in section 3.7 does. issue the instruction without fetching the operands : schedule-bound fetch. Eventually, the instruction is scheduled for execution and at that moment the operands are fetched. Therefore, there is no need to have long value fields in RSs to keep the operands, only shorter Q j /Q k fields are needed. The scheme explored in Problem 3.(b) is that operands are still fetched during issue and there are still Q j /Q k fields in RSs. But, there are no value fields, (V j /V k ) in RSs. Polytechnic University Page of 0 CS64 Handout No : 5 October 6, 004

3 RELEVANT QUESTIONS AND ANSWERS Q) The Section Tomasulo algorithm we discussed in class (machine model number ) is for a scalar processor with dynamic scheduling. It has drawbacks, two of which are that i) the CDB bus can carry only one value at a time, becoming a bottleneck and ii) while an instruction is issued to a reservation station, it is possible that the issued instruction misses to take along an operand with it since the operand may have just been put on the CDB by a functional unit and the issue logic is not aware of that. Thus, the instruction would wait indefinitely or would get an incorrect value. Suggest reasonable solutions for these two cases. A) i) In the first Tomasulo algorithm, seven functional units are used : Load, Store, 3 FP Add/Sub and FP Mul/Div. We are not told about the number of other integer functional units (other than Load and Store integer functional units), so we will ignore them for this discussion. All, except the Store unit, need the CDB. Technically, the Store unit needs the CDB since a store in transit to memory can respond to a subsequent Load from the same location. All these seven units can complete simultaneously and want to connect their results to the CDB. So, the CDB needs to have a width of 7 (seven) words as opposed to word. This way, it can carry up to seven values at the same time. This, however, means that there must be six write ports to the FP registers, six write ports to each of the FP Reservation Stations and six write ports for the Load unit and the Store unit so that six different values can be written to FP registers, RSs, load buffer and store buffer entries at the same time. Similarly, for the integer registers (the GPRs), the number of write ports should be increased while the width of the integer CDB is increased. ii) In our algorithm, the operand fetch is made during an issue, not during scheduling. So, when the ID stage decodes an instruction in a particular clock period, it checks the Register Status table to see if the Q i field of an operand register needed by the instruction is blank. In the original implementation, the field is not blank and the value of Q i is FU n (Functional Unit n). This means that the current instruction in ID waits in its RS until that value is computed by FU n. However, in the clock period that the ID stage is working on that particular instruction, that same FU n places its result on the CDB and is in the process of updating the Register Status table to indicate that it has finished, i.e. the Q i entry is being written blank and the result is being written to the register. As you remember, all hardware tables as well as all registers are written at the end of the clock period. Thus, in that critical clock period, this Q i is not blank. So, the ID stage places FU n in the Q x field of the reservation station buffer entry when it issues the instruction, not the value of the operand in the V x field even though the value is now on the CDB. The instruction would get a wrong value when the same FU n computes a result for another instruction. In the worst case, if that FU n is never needed again by the program, the instruction will wait indefinitely, that is the program will wait indefinitely. There are a number of solutions! One solution is that each functional unit has a new output called valid which is in the clock period the result is placed on the CDB. The issue stage would have to check the Register Status table and the special output lines. If the Q i value of the needed register matches the functional unit number whose valid output line is, the ID stage instructs the RS to store the result on the CDB to the V x field for the instruction. Finally, it must be noted that this solution works with any number of CDBs, one or seven. Polytechnic University Page 3 of 0 CS64 Handout No : 5 October 6, 004

4 Q) Consider the following piece of MIPS code written for its unpipelined version : DADDI R, R0, #(64) 0 ; Memory accesses are commented below : L.D F0, 0(Rk) ; Rk points at constant k loop: LW Ra, 0(Rindexa) ; Rindexa points at index vector for vector A L.D F, 0(Rb) ; Rb points at vector B L.D F4, 0(Rd) ; Rd points at vector D DADD.D F6, F, F0 MUL.D F8, F6, F4 S.D 0(Ra), F6 ; Stores to vector A S.D 0(Rc), F8 ; Stores to vector C DADDI R, R, #(-) 0 DADDI Rindexa, Rindexa, #4 ; Rindexa is advanced DADDI Rb, Rb, #8 ; Rb is advanced DADDI Rc, Rc, #8 ; Rc is advanced DADDI Rd, Rd, #8 ; Rd is advanced BNEZ R, loop Assume that the MIPS is scalar and uses the hardware speculative Tomasulo algorithm as discussed in class. That is, this machine model number 3. It has enough buses to eliminate bottlenecks. In addition, there is a perfect memory with no stalls and the functional unit timings are as listed on page 304 of the textbook. There are enough functional units for integer instructions not to cause stalls. Finally, assume that only one instruction per clock period is committed from the Reorder Buffer. Show the execution of the loop for the first two () iterations as we did in class. Show when (in which clock period) the loop will end. Finally, how many iterations of instructions are flushed out of the pipeline? A) The execution is as follows : Iter # Instruction IF ID EX WR CM DADDI R, R0, #(64) L.D F0, 0(Rk) LW Ra, 0(R indexa ) L.D F, 0(Rb) L.D F4, 0(Rd) DADD.D F6, F, F MUL.D F8, F6, F / S.D 0(Ra), F /8 S.D 0(Rc), F /9 DADDI R, R, #(-) /0 Polytechnic University Page 4 of 0 CS64 Handout No : 5 October 6, 004

5 DADDI R indexa, R indexa, # / DADDI Rb, Rb, # / DADDI Rc, Rc, # /3 DADDI Rd, Rd, # /4 BNEZ R, loop 5 6 7/5 LW Ra, 0(R indexa ) /6 L.D F, 0(Rb) /7 L.D F4, 0(Rd) /8 DADD.D F6, F, F /9 MUL.D F8, F6, F4 0 / S.D 0(Ra), F /3 S.D 0(Rc), F /3 DADDI R, R, #(-) /33 DADDI R indexa, R indexa, # /34 DADDI Rb, Rb, # /35 DADDI Rc, Rc, # /36 DADDI Rd, Rd, # /37 BNEZ R, loop /38 3 LW Ra, 0(R indexa ) / LW Ra, 0(R indexa ) /83 64 L.D F, 0(Rb) / L.D F4, 0(Rd) / DADD.D F6, F, F / MUL.D F8, F6, F / S.D 0(Ra), F / S.D 0(Rc), F / DADDI R, R, #(-) / DADDI R indexa, R indexa, # / DADDI Rb, Rb, # /84 64 DADDI Rc, Rc, # /84 The execution completes in clock period DADDI Rd, Rd, # / BNEZ R, loop /844 Polytechnic University Page 5 of 0 CS64 Handout No : 5 October 6, 004

6 65 LW Ra, 0(R indexa ) / L.D F, 0(Rb) / L.D F4, 0(Rd) / DADD.D F6, F, F MUL.D F8, F6, F / S.D 0(Ra), F / S.D 0(Rc), F /844 These instructions are flushed out in the 844 th clock period 65 DADDI R, R, #(-) DADDI R indexa, R indexa, # DADDI Rb, Rb, #8 844 The loop ends at clock period 844. At that point in time, there are 0 speculatively executed instructions in the pipeline of iteration 65. They are discarded (flushed out). Thus, only one iteration of instructions are flushed out. Since FP latencies are short and successive instruction dependencies are few, not too many instructions accumulate in the Reservation Stations and the ROB. Q3) Consider the following piece of MIPS code : LW R8, 0(R9) ; R8 is loaded from the memory ADD.D F0, F, F4 ; F and F4 are already initialized DIV.D F6, F8, F0 ; F8 and F0 are already initialized MUL.D F, F4, F0 ; F4 is already initialized DADDI R8, R8, #(-) 0 SUB.D F8, F, F6 BNEZ R8, loop S.D 0(R0), F8 ; R0 is already initialized The code is an old code, written for the unpipelined MIPS, i.e. there are no delayed loads and no delayed branches : this is machine model number 0. The old code is now run on the hardware speculative MIPS with the Tomasulo algorithm. This is machine model number 3. The latencies of the functional units are as listed on page A-74 of the Hennessy book. Show which instructions are flushed out of the pipeline. Show the timing of the instructions run until the loop is completed. A3) Due to long FP latencies and back-to-back instruction dependencies, many instructions accumulate in the Reservation Stations and the ROB. They are eventually flushed out of the ROB : an undesirable situation. This MIPS ROB buffer has to have at least 8 entries not to stall Polytechnic University Page 6 of 0 CS64 Handout No : 5 October 6, 004

7 any of the loop instructions in the ID stage due to the ROB-full structural hazard. Note that 8 is a large number! A solution to the large ROB size seems to be that we retire more than one instruction at a time. But, unfortunately it will not help us in this application. The only effective solution for this code is the reduction of the long FP latencies... Try this code for the functional unit latencies listed on page 304 of the Hennessy book. iter # IF ID EX WR CM LW R8, 0(R9) ADD.D F0, F, F DIV.D F6, F8, F MUL.D F, F4, F / /48 DADDI R8, R8, #(-) /49 SUB.D F8, F, F / BNEZ R8, loop 7 8 9/5 ADD.D F0, F, F /8-0 /5 DIV.D F6, F8, F0 9 0 / MUL.D F, F4, F0 0 / /9 DADDI R8, R8, #(-) /93 SUB.D F8, F, F6 3 4/ BNEZ R8, loop 3 4 5/95 3 ADD.D F0, F, F / /95 3 DIV.D F6, F8, F / MUL.D F, F4, F / /95 3 DADDI R8, R8, #(-) /95 3 SUB.D F8, F, F /95 3 BNEZ R8, loop 9 0 /95 Iterations 4 through 5 will continue like above, then iteration6 starts : These instructions are flushed out in the 95 th clock period ADD.D F0, F, F /95 DIVD F6, F8, F MULTD F, F4, F SUBI R8, R8, # 95 S.D 0(R0), F The execution completes in 00 Polytechnic University Page 7 of 0 CS64 Handout No : 5 October 6, 004

8 Q4) Consider the following piece of old MIPS code for the unpipelined MIPS processor. That is a code without delayed loads, without delayed branches, without any consideration for the latencies of functional units, etc. : L.D F0, 0(R) ; Load from M MUL.D F0, F0, F0 ; M[i] * M[i] L.D F, 0(R3) ; Load fromn L.D F, 0(R4) ; Load from Q MUL.D F3, F, F ; N[i] * Q[i] ADD.D F4, F0, F3 ; M[i] * M[i] + N[i] * Q[i] S.D 0(R), F4 ; Store to K DADDI R, R, #8 ; Advance the K pointer DADDI R, R, #8 ; Advance the M pointer DADDI R3, R3, #8 ; Advance the N pointer DADDI R4, R4, #8 ; Advance the N pointer DADDI R5, R5, #(-) 0 ; Decrement the loop counter BNEZ R5, loop ; Branch back if not the end Assume that the MIPS is implemented as the scalar hardware-speculative Tomasulo algorithm machine as discussed in class. That is, this is machine model number 3. Assume that there are enough number of CDB buses to eliminate bottlenecks. In addition, assume that there is a perfect memory with no stalls and the functional unit timings are as listed on page 304 of the Hennessy book. Another assumption is that there are enough functional units for integer instructions not to cause stalls. There are separate address and branch units. A branch instruction takes two clock periods to run if its operands are ready (IF and ID stages). Otherwise, it is issued to the EX stage and waits there until its operands are ready. Finally, assume that only one instruction per clock period is committed from the Reorder Buffer. Assume that the loop has two () iterations. In which clock period, will the second iteration of the loop be completed : what is the last clock period in which the Commit stage of an instruction from the second iteration is done? Show which instructions are flushed out of the pipeline. If a new situation is encountered, indicate the assumption made and/or how it is handled. A4) The execution of the loop for two iterations and the flushed out instructions are as follows : Iteration Instruction IF ID EX WR CM L.D F0, 0(R) MUL.D F0, F0, F0 3 4/ L.D F, 0(R3) / L.D F, 0(R4) / MUL.D F3, F, F 5 6 7/8-3 ADD.D F4, F0, F / S.D 0(R), F /8 Polytechnic University Page 8 of 0 CS64 Handout No : 5 October 6, 004

9 Iteration Instruction IF ID EX WR CM DADDI R, R, # /9 DADDI R, R, # /0 DADDI R3, R3, # / DADDI R4, R4, # / DADDI R5, R5, #(-) /3 BNEZ R5, loop 3 4 5/4 L.D F0, 0(R) /5 MUL.D F0, F0, F /8-3/6 L.D F, 0(R3) /7 L.D F, 0(R4) /8 MUL.D F3, F, F 8 9 0/-4 5 6/9 ADD.D F4, F0, F3 9 0 / S.D 0(R), F /3 DADDI R, R, # /3 DADDI R, R, # /33 DADDI R3, R3, # /34 DADDI R4, R4, # /35 DADDI R5, R5, #(-) /36 BNEZ R5, loop 6 7 8/37 3 L.D F0, 0(R) /37 3 MUL.D F0, F0, F / /37 3 L.D F, 0(R3) /37 3 L.D F, 0(R4) /37 3 MUL.D F3, F, F / ADD.D F4, F0, F /37 3 S.D 0(R), F DADDI R, R, # DADDI R, R, # DADDI R3, R3, # DADDI R4, R4, #8 37 These instructions are flushed out at the end of the 37 th clock period Polytechnic University Page 9 of 0 CS64 Handout No : 5 October 6, 004

10 The second iteration of the loop ends at clock period 37. instructions of the third iteration are flushed out of the ROB when the loop completes. Running the old code on the new processor shows why some old code runs slower than the new code for the same application : instructions wait for each other due to data dependencies (FP instructions above) while other instructions can be executed in the meantime (DADDI instructions above). A contemporary compiler would move the DADDI instructions up between FP instructions so that both the stall cycles are reduced and useful work is done. Polytechnic University Page 0 of 0 CS64 Handout No : 5 October 6, 004

CS 614 COMPUTER ARCHITECTURE II FALL 2005

CS 614 COMPUTER ARCHITECTURE II FALL 2005 DUE : November 9, 2005 HOMEWORK III READ : - Portions of Chapters 5, 6, 7, 8, 9 and 14 of the Sima book and - Portions of Chapters 3, 4, Appendix A and Appendix