DAT105: Computer Architecture Study Period 2, 2009 Exercise 3 Chapter 2: Instruction-Level Parallelism and Its Exploitation

Size: px

Start display at page:

Download "DAT105: Computer Architecture Study Period 2, 2009 Exercise 3 Chapter 2: Instruction-Level Parallelism and Its Exploitation"

Cory Hodges
5 years ago
Views:

1 Study Period 2, 2009 Exercise 3 Chapter 2: Instruction-Level Parallelism and Its Exploitation Mafijul Islam Department of Computer Science and Engineering November 19, 2009

2 Study Period 2, 2009 Goals: To understand basic pipeline scheduling and loop unrolling the impact of control dependency on performance register renaming and dynamic scheduling Case Studies/Assignments: Assignment 2 of the Exam on Assignments 2, 3 of the Exam on Assignment 3 of the Exam on Assignment 3 of the Exam on

3 Exam : Assignment 2 Assume: MIPS processor with a 5-stage pipeline (presented in Appendix A of the textbook) All memory accesses complete in a single cycle There is one branch delay slot Multiply operations are fully pipelined like all other arithmetic instructions, but the result is not available until the end of the Memory access stage for(int i=0; i<n; i=i+1) { x[i] = x[i]*x[i]; } The x array is of long integers (8 bytes). n always is greater than zero Register R1 contains the start address of the array (x[0]) and you are allowed to change the value of R1 The address of x[n] is available in register R2 LOOP: LD R4, 0(R1) DMUL R5, R4, R4 SD R5, 0(R1) DADDI R1, R1, 8 BNE R1, R2, LOOP NOP

4 Exam : Assignment 2(A) How many cycles does this code take to execute per loop of the original program? Specify all types of dependencies and unresolved hazards in this code LOOP: LD R4, 0(R1) DMUL R5, R4, R4 SD R5, 0(R1) DADDI R1, R1, 8 BNE R1, R2, LOOP NOP # RAW dependency R4

5 Exam : Assignment 2(A) How many cycles does this code take to execute per loop of the original program? Specify all types of dependencies and unresolved hazards in this code LOOP: LD R4, 0(R1) DMUL R5, R4, R4 SD R5, 0(R1) DADDI R1, R1, 8 BNE R1, R2, LOOP NOP # RAW dependency R4 # RAW dependency R5

6 Exam : Assignment 2(A) How many cycles does this code take to execute per loop of the original program? Specify all types of dependencies and unresolved hazards in this code LOOP: LD R4, 0(R1) DMUL R5, R4, R4 SD R5, 0(R1) DADDI R1, R1, 8 BNE R1, R2, LOOP NOP # RAW dependency R4 # RAW dependency R5 # RAW dependency R1 one iteration takes 9 cycles data hazards cause one stall cycle each

7 Exam : Assignment 2(B) Modify the code to require as few clock cycles as possible How many clock cycles does it take now? LOOP: LD R4, 0(R1) DMUL R5, R4, R4 SD R5, 0(R1) DADDI R1, R1, 8 BNE R1, R2, LOOP NOP LOOP: LD R4, 0(R1) DADDI R1, R1, 8 DMUL R5, R4, R4 SD R5, 0(R1) BNE R1, R2, LOOP NOP

8 Exam : Assignment 2(B) Modify the code to require as few clock cycles as possible How many clock cycles does it take now? LOOP: LD R4, 0(R1) DADDI R1, R1, 8 DMUL R5, R4, R4 SD R5, 0(R1) BNE R1, R2, LOOP NOP LOOP: LD R4, 0(R1) DADDI R1, R1, 8 DMUL R5, R4, R4 BNE R1, R2, LOOP SD R5, 0(R1) LOOP: complete elimination of the stall cycles one iteration now takes 5 cycles LD R4, 0(R1) DADDI R1, R1, 8 DMUL R5, R4, R4 BNE R1, R2, LOOP SD R5, -8(R1)

9 Exam : Assignment 2(C) Try to modify the code further (n always is an even number) How many clock cycles does it take now? Specify any remaining hazard LOOP: LD R4, 0(R1) DADDI R1, R1, 8 DMUL R5, R4, R4 BNE R1, R2, LOOP SD R5, -8(R1) unroll the loop once make each iteration do the work of two previous iterations merge the two DADDIs rearrange further to avoid stall cycles

10 Exam : Assignment 2(C) Try to modify the code further (n always is an even number) How many clock cycles does it take now? Specify any remaining hazard LOOP: LOOP: LD R4, 0(R1) LD R4, 0(R1) DADDI R1, R1, 8 DADDI R1, R1, 16 DMUL R5, R4, R4 DMUL R5, R4, R4 BNE R1, R2, LOOP LD R4, -8(R1) SD R5, -8(R1) SD R5, -16(R1) complete elimination of stall cycles the instructions corresponding to the old loop body now takes 4 cycles one new iteration takes 8 cycles DMUL R5, R4, R4 BNE R1, R2, LOOP SD R5, -8(R1)

11 Exam : Assignment 3 Assume that a floating-point load, division, and multiplication takes 10, 50, and 40 cycles, respectively LF F1, 0(R1) DIVF F4,F2, F1 MULT F2, F1, F0 Compute the execution time of the above sequence under the following assumptions on a processor that can issue : one instruction per cycle and that has no register renaming capability three instructions per cycle and that has no register renaming capability three instructions per cycle and that has register renaming capability Disclaimer: If you feel that more assumptions have to be made, feel free to do so. If they are needed and reasonable, they will be accepted without any deduction on the score

12 Exam : Assignment 3(i) Compute execution time on a processor that can issue one instruction per cycle and that has no register renaming capability Identify the dependences True data dependence Name dependence LF F1, 0(R1) # 10 cycles DIVF F4,F2, F1 # 50 cycles MULT F2, F1, F0 # 40 cycles LF F1, 0(R1) # 10 cycles DIVF F4,F2, F1 # 50 cycles MULT F2, F1, F0 # 40 cycles Execution Time = cycles = 100 cycles

13 Exam : Assignment 3(ii) Compute execution time on a processor that can issue three instructions per cycle and that has no register renaming capability Identify the dependences True data dependence Name dependence LF F1, 0(R1) # 10 cycles DIVF F4,F2, F1 # 50 cycles MULT F2, F1, F0 # 40 cycles LF F1, 0(R1) # 10 cycles DIVF F4,F2, F1 # 50 cycles MULT F2, F1, F0 # 40 cycles Execution Time = cycles = 100 cycles

14 Exam : Assignment 3(iii) Compute execution time on a processor that can issue three instructions per cycle and that has register renaming capability Identify the dependences True data dependence Name dependence LF F1, 0(R1) # 10 cycles DIVF F4,F2, F1 # 50 cycles MULT F2, F1, F0 # 40 cycles Apply register renaming: LF F1, 0(R1) # 10 cycles DIVF F4,F2, F1 # 50 cycles MULT S, F1, F0 # 40 cycles Cycle Issued Instruction 1 LF F1, 0(R1) 11 DIVF F4,F2, F1 11 MULT S, F1, F0 Execution Time = 10 + MAX(50, 40) cycles = 60 cycles

15 Exam : Assignment 2(A) Evaluating the impact of branch predictions on performance: Given: every fifth instruction is a conditional branch all conditional branches can be predicted with 100% accuracy, except for the branch less than category that can be predicted with only 50% accuracy CPI=1 for all instructions including the branches that are correctly predicted misprediction penalty is 10 cycles What is the CPI for the integer applications?

16 Exam : Assignment 2(A) Evaluating the impact of branch predictions on performance: Instruction mix: conditional branches: 20% less than : 35% of conditional branches (20%) for integers applications Relative occurrences of mispredicted less than branches = 0.20 * 0.35 *0.5 = Misprediction penalty: 10 cycles CPI : 1 for 80% of the instructions, 1 for the correctly predicted branches correctly predicted branches + other insts mispredicted branches CPI overall = 1 * ( ) + 10 * = 1.315

17 Exam : Assignment 3(A) The following MIPS program operates on an array with 64-bit elements. The register R1 points to the beginning of the array from the beginning. The register R2 points to the end. The array always contains 1000 elements. ANDI R3, R3, 0 LOOP: LD R4, 0(R1) DMUL R5, R4, R4 DADD R5, R3, R5 SD R5, 0(R1) DADDI R3, R4, 0 DADDI R1, R1, 8 BNE R1, R2, LOOP How 2-bit branch prediction scheme works for the given code?

18 Exam : Assignment 3(A) How 2-bit branch prediction scheme works for the given code? For our example program we would have one entry corresponding to the BNE instruction in the end of the program. The prediction would evolve as follows if we assume we start in state 00 : Execution of BNE State before execution Prediction State after execution

19 Exam : Assignment 3(A) How 2-bit branch prediction scheme works for the given code? The prediction evolves as follows if we assume we start in state 00 : Execution of BNE State before execution Prediction State after execution First 00 Not taken (wrong) 01 (taken)

20 Exam : Assignment 3(A) How 2-bit branch prediction scheme works for the given code? The prediction evolves as follows if we assume we start in state 00 : Execution of BNE State before execution Prediction First 00 Not taken (wrong) 01 (taken) State after execution Second 01 Not taken (wrong) 11 (taken)

21 Exam : Assignment 3(A) How 2-bit branch prediction scheme works for the given code? The prediction evolves as follows if we assume we start in state 00 : Execution of BNE State before execution Prediction First 00 Not taken (wrong) 01 (taken) Second 01 Not taken (wrong) 11 (taken) State after execution 3rd, 4 th,,999th 11 Taken (correct) 11 (taken)

22 Exam : Assignment 3(A) How 2-bit branch prediction scheme works for the given code? The prediction evolves as follows if we assume we start in state 00 : Execution of BNE State before execution Prediction First 00 Not taken (wrong) 01 (taken) Second 01 Not taken (wrong) 11 (taken) 3rd, 4 th,,999th 11 Taken (correct) 11 (taken) State after execution 1000th 11 Taken (wrong) 10 (Not taken)

23 Exam : Assignment 3 Dynamic scheduling: Tomasulo s Algorithm SUB F0, F1,F2 (8 cycles) DIV F3,F0,F4 (10 cycles) ADD F4, F5,F6 (6 cycles) 3(A): Show all data and name dependences in the code 3(B): Establish when the second addition instruction can start its execution Assume that there are two addition functional units and a division functional unit a single instruction is issued every cycle

24 Exam : Assignment 3(A) Show all data and name dependences in the code SUB F0, F1,F2 (8 cycles) DIV F3, F0,F4 (10 cycles) ADD F4, F5,F6 (6 cycles) SUB F0, F1,F2 (8 cycles) DIV F3, F0,F4 (10 cycles) ADD F4, F5,F6 (6 cycles) True data dependence Name dependence

25 Exam : Assignment 3(B) Establish when the second addition instruction can start its execution: Clock Cycle: 0 SUB F0, F1, F2 DIV F3, F0, F4 ADD F4, F5, F6 Reservation stations Name Busy Op Vj Vk Qj Qk A Add1 Add2 Div Field Qi Instruction NO NO NO F0 F1 Issue Instruction status Register status F2 F3 Execute F4 Write Result F5 F6

26 Exam : Assignment 3(B) Establish when the second addition instruction can start its execution: Clock Cycle: 1 Instruction SUB F0, F1, F2 DIV F3, F0, F4 ADD F4, F5, F6 Issue No No Instruction status Execute Reservation stations Write Result Name Busy Op Vj Vk Qj Qk A Add1 SUB Regs[F1] Regs[F2] Add2 Div No No Register status Field F0 F1 F2 F3 F4 F5 F6 Qi Add1

27 Exam : Assignment 3(B) Establish when the second addition instruction can start its execution: Clock Cycle: 2 Instruction SUB F0, F1, F2 DIV F3, F0, F4 ADD F4, F5, F6 Issue No Instruction status Execute Reservation stations Write Result Name Busy Op Vj Vk Qj Qk A Add1 SUB Regs[F1] Regs[F2] Add2 No Div DIV Regs[F4] Add1 Register status Field F0 F1 F2 F3 F4 F5 F6 Qi Add1 Div

28 Exam : Assignment 3(B) Establish when the second addition instruction can start its execution: Clock Cycle: 3 Instruction SUB F0, F1, F2 DIV F3, F0, F4 ADD F4, F5, F6 Issue Instruction status Execute Reservation stations Write Result Name Busy Op Vj Vk Qj Qk A Add1 SUB Regs[F1] Regs[F2] Add2 ADD Regs[F5] Regs[F6] Div DIV Regs[F4] Add1 Register status Field F0 F1 F2 F3 F4 F5 F6 Qi Add1 Div Add2

29 Exam : Assignment 3(B) Establish when the second addition instruction can start its execution: Clock Cycle: 4 Instruction SUB F0, F1, F2 DIV F3, F0, F4 ADD F4, F5, F6 Issue Instruction status Execute Reservation stations Write Result Name Busy Op Vj Vk Qj Qk A Add1 SUB Regs[F1] Regs[F2] Add2 ADD Regs[F5] Regs[F6] Div DIV Regs[F4] Add1 Register status Field F0 F1 F2 F3 F4 F5 F6 Qi Add1 Div Add2

30 Exam : Assignment 3(B) Establish when the second addition instruction can start its execution: Clock Cycle: 10 Instruction SUB F0, F1, F2 DIV F3, F0, F4 ADD F4, F5, F6 Issue Instruction status Execute Reservation stations Write Result Name Busy Op Vj Vk Qj Qk A Add1 No SUB Regs[F1] Regs[F2] Add2 No ADD Regs[F5] Regs[F6] Div DIV Regs[F4] Add1 Register status Field F0 F1 F2 F3 F4 F5 F6 Qi Add1 Div Add2

31 Exam : Assignment 3(B) Establish when the second addition instruction can start its execution: Clock Cycle: 11 Instruction SUB F0, F1, F2 DIV F3, F0, F4 ADD F4, F5, F6 Issue Instruction status Execute Reservation stations Write Result Name Busy Op Vj Vk Qj Qk A Add1 No SUB Regs[F1] Regs[F2] Add2 No ADD Regs[F5] Regs[F6] Div DIV Regs[F4] Add1 Register status Field F0 F1 F2 F3 F4 F5 F6 Qi - Div -

32 Exam : Assignment 3(B) Establish when the second addition instruction can start its execution: Clock Cycle: 21 Instruction SUB F0, F1, F2 DIV F3, F0, F4 ADD F4, F5, F6 Issue Instruction status Execute Reservation stations Write Result Name Busy Op Vj Vk Qj Qk A Add1 No SUB Regs[F1] Regs[F2] Add2 No ADD Regs[F5] Regs[F6] Div No DIV Regs[F4] Add1 Register status Field F0 F1 F2 F3 F4 F5 F6 Qi - Div -

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS Advanced Computer Architecture- 06CS81 Hardware Based Speculation Tomasulu algorithm and Reorder Buffer Tomasulu idea: 1. Have reservation stations where register renaming is possible 2. Results are directly