Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Size: px

Start display at page:

Download "Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4"

Harry Hutchinson
5 years ago
Views:

1 PROBLEM 1: An application running on a 1GHz pipelined processor has the following instruction mix: Instruction Frequency CPI Load-store 55% 5 Arithmetic 30% 4 Branch 15% 4 a) Determine the overall CPI of the program. b) An embedded version of the processor that operates at 600 MHz is used to run the same application. In this version, the CPI of branch instruction becomes 6 while the other types CPI remain unchanged. A new compiler is used which eliminates 25% of the load-store instructions as well as 5% of the arithmetic instructions for this application. i. Determine the overall CPI of the program on the embedded processor with the new compiler. ii. Determine the factor by which the application on the embedded processor runs faster/slower. Solution: a) cycle/instruction b) First we calculate the new percentages for each type of instruction: i. Percentage of eliminated load-store from total instructions = Percentage of eliminated arithmetic from total instructions = Percentage of remaining instructions from total instructions = ( ) New percentage of load-store instructions 1

2 = New percentage of arithmetic instructions = New percentage of branch instructions = ii. (i.e the program now is slower) cycle/instr. ( ) PROBLEM 2: 1- Suppose a MIPS processor uses the simple 5-stage pipeline described in the text. Further suppose that: There is a single memory for both instructions and data which can do one read or write each cycle. No forwarding is used. An instruction cannot be fed into the pipeline until the hardware knows the instruction is to be executed certainly (no earlier than the end of the execution stage in case the current instruction is a branch). In the absence of hazards a new instruction can be fed into the pipeline each cycle. For the following MIPS code: lw R1, 0(R2) lw R3, 12(R4) add R5, R1, R3 beq R5, R5, L1 sw R5, 0(R3) L1: sw R5, 12(R4) 2

3 a) Show using a diagram, how many cycles does this code take to complete? b) Show using a diagram, how different hazard solving techniques can be used to decrease the total number of cycles for this program. Solution: a) As shown below, the code will take 15 cycles lw R1,0(R2) IF ID EX M WB lw R3,12(R4) IF ID EX M WB Add R5,R1,R3 IF ID EX M WB beq R5,R5,L1 IF ID EX M WB L1:sw R5,12(R4) IF ID EX M WB b) Using the following hazard solving techniques: Forwarding (to resolve some data hazards) Separate instruction and data memories (to resolve some structural hazards) Branch prediction Assuming branch prediction turns out to be correct, the code will take 11 cycles lw R1,0(R2) IF ID EX M WB lw R3,12(R4) IF ID EX M WB Add R5,R1,R3 IF ID EX M WB beq R5,R5,L1 IF ID EX M WB L1:sw R5,12(R4) IF ID EX M WB PROBLEM 3 3

4 2- A five-stage pipelined processor supports the following instruction types: Instruction Frequency Load 25% Store 15% Integer 30% Floating point 20% Branch 10% Assume the base CPI of the processor is equal to 1. Data hazards for floating point operations cause an average penalty of 0.9 stall cycles, branch instructions have a misprediction penalty of 1 stall cycle, while all other instructions run at maximum possible throughput. For branch instructions, the processor uses the predicted untaken scheme. If branch prediction turns out to be true 80% of the time, calculate the average CPI for this program. Solution: The average CPI = the base CPI + The average number of stalls per instruction = ( ) cycle/instr. 4

5 PROBLEM 4: a) Identify all WAR, WAW and RAW dependencies in the following instruction sequence: LD F2, 16(R6) ADDD F2, F2, F4 DIVD F6, F2, F0 SUBD F0, F2, F10 SD F6, 32(R3) b) Fill in the blank templates for executing this code with and without Tomasulo s Algorithm for this instruction sequence. Assume the following execution times: LW: 2 cycles ADD/SUB: 2 cycles BNEZ: 3 cycles MULT/DIV: 4 cycles For the original FP unit, assume one integer unit, one floating point multiply units, one F.P. add unit, one F.P. divide unit. For Tomasulo s, assume: Three FP ADD units, 2 FP MULT units, 6 load buffers and three store buffers. (Same units as in book example) Assume there is a cache miss causing a stall of 8 cycles on the execution of the 1 st LD. Assume FP adds/subs take 2 cycles, Mults take 10 cycles and Divides take 20 cycles. Assume the store is a cache hit and executes in one cycle. Assume many instructions can read from the register file simultaneously. For the Tomasulo example, recall that only one instruction can drive the CDB at a time. Solution: Without Tomasulo s Algorithm, and the processor is using Forwarding: LD F2,16(R6) IF ID EX MEM1 MEM2 WB ADDD F2,F2,F4 IF ID stall stall EX1 EX2 MEM WB 5

6 DIVD F6,F2,F0 IF stall stall ID stall EX1 EX2 EX3 EX4 MEM WB SUBD F0,F2,F10 stall stall IF stall ID stall stall stall EX1 EX2 MEM WB SD F6,32(R3) stall stall stall IF stall stall stall ID stall EX MEM1 MEM2 WB Notes: We considered we 1 execution unit and 1 memory unit and we had to respect this in order execution and in order completion to solve the stalls exactly as shown in slide 5 of the ILP chapter. With Tomasulo s Algorithm: We will use the same architecture shown in the lecture Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F2 16 R2 Load1 No ADDD F2 F2 F4 Load2 No DIVD F6 F2 F0 Load3 No SUBD F0 F2 F10 SD F6 32 R3 Add1 Add2 Add3 Mult1 Mult2 No NO No NO NO 0 FU Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F2 16 R2 1 2 Load1 Yes 16(R2) ADDD F2 F2 F4 Load2 No DIVD F^ F2 F0 Load3 No SUBD F0 F2 F10 SD F6 32 R3 Add1 Add2 Add3 Mult1 Mult2 No NO No NO NO 1 FU Load1 6

7 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F2 16 R2 1 1 Load1 Yes 16(R2) ADDD F2 F2 F4 2 Load2 No DIVD F^ F2 F0 Load3 No SUBD F0 F2 F10 SD F6 32 R3 Add1 YES ADD F4 Load1 Add2 NO Mult1 NO 2 FU ADD1 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address ADDD F2 F2 F4 2 Load2 No SUBD F0 F2 F10 4 SD F6 32 R3 2 Add1 YES ADD MEM(1) F4 Add2 YES SUBD F10 ADD1 Mult1 YES DIVD F0 ADD1 4 FU ADD2 ADD1 MULT1 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F2 16 R Load1 Yes 16(R2) ADDD F2 F2 F4 2 Load2 No SUBD F0 F2 F10 SD F6 32 R3 Add1 YES ADD F4 Load1 Add2 NO Mult1 YES DIVD F0 ADD1 3 FU ADD1 MULT1 ADDD F2 F2 F4 2 Load2 No SUBD F0 F2 F Add1 YES ADD MEM(1) F4 Add2 NO SUBD F10 ADD1 Mult1 YES DIVD F0 ADD1 5 FU ADD2 ADD1 MULT1 7

8 ADDD F2 F2 F4 2 6 Load2 No SUBD F0 F2 F Add1 YES ADD MEM(1) F4 Add2 NO SUBD F10 ADD1 Mult1 YES DIVD F0 ADD1 6 FU ADD2 ADD1 MULT1 SUBD F0 F2 F Add2 YES SUBD res1 F10 4 Mult1 YES DIVD res1 F0 7 FU ADD2 res1 MULT1 SUBD F0 F2 F Add2 YES SUBD res1 F10 3 Mult1 YES DIVD res1 F0 8 FU ADD2 (RES) MULT1 SUBD F0 F2 F Add2 NO SUBD res1 F10 3 Mult1 YES DIVD res1 F0 9 FU ADD2 (RES) MULT1 8

9 SUBD F0 F2 F Add1 NO 0 Add2 YES SUBD res1 F10 2 Mult1 YES DIVD res1 F0 10 FU ADD2 res1 MULT1 SUBD F0 F2 F Add2 No 1 Mult1 YES DIVD res1 F0 11 FU res2 res1 MULT1 DIVD F6 F2 F Load3 No SUBD F0 F2 F Add2 No 0 Mult1 YES DIVD res1 F0 12 FU res2 res1 MULT1 DIVD F6 F2 F Load3 No SUBD F0 F2 F Time: 2 Store Yes 32(R3) Res3 0 Add2 No 0 Mult1 No 13 FU res2 res1 Res3 9

10 DIVD F6 F2 F Load3 No SUBD F0 F2 F Time: 1 Store Yes 32(R3) Res3 0 Add2 No 0 Mult1 No 14 FU res2 res1 Res3 DIVD F6 F2 F Load3 No SUBD F0 F2 F Time: 0 Store Yes 32(R3) Res3 0 Add2 No 0 Mult1 No 15 FU res2 res1 Res3 10

11 PROBLEM 5: Consider the following code. (The... marks indicate instructions that are ignored in this example) LOOP1: ADDI R4, R0, #4... LOOP 2: SUBI R4, R4, #1... BNEZ R4, LOOP2... BEQZ R8, LOOP1... a) Focusing on the inner loop (LOOP2) only, analyze the branch behavior. Assume no other instruction changes the value of register R4. What percentage of the time is the BNEZ branch instruction taken and not taken? Consider LOOP2 is taken N times, so it is easy to deduce that the branch will be taken N times in each N+1 iterations, i.e. the loop will be taken N/N+1 and not taken 1/N+1 Consider LOOP2 is taken N times, so it is easy to deduce that the branch will be taken N times in each N+1 iterations, i.e. the loop will be taken N/N+1 and not taken 1/N+1 b) Choose the best static branch prediction scheme for the BNEZ instruction. What percentage of the time will this static branch prediction be correct for LOOP2? Using Branch taken, we will reach N correct iterations out of every N+1 decisions. c) Now consider dynamic branch prediction. Draw the state machine for a one-bit branch predictor. Be sure to clearly identify or define the meaning of each state. For the inner loop (LOOP2), what will be the misprediction rate of the one-bit branch predictor? 11

Taken Not Taken Taken Not Taken For 1 bit branch predictor the FSM should look as above, studying LOOP2 only, Iteration 1 2 3.

12 Taken Not Taken Taken Not Taken For 1 bit branch predictor the FSM should look as above, studying LOOP2 only, Iteration N N+1 Prediction Decision Not Taken Taken Taken Taken Taken Taken Taken Taken Final Decision Taken Taken Taken Taken Taken Taken Taken Taken Not So, we would take wrong decision 2 times out of every N+1 times d) Now draw the state diagram for a 2-bit dynamic branch predictor. Again, clearly label all states. What will be the misprediction rate of the 2- bit branch predictor for LOOP2? Iteration N N+1 Prediction Decision Not Not Taken Taken Taken Taken Taken Taken Taken Final Decision Taken Taken Taken Taken Taken Taken Taken Taken Not 12

13 e) Taking both loops in consideration, the state diagram for a 2,2 bit collator type dynamic branch predictor. We will not use 2,2 as it is not described in the lecture, so we will just take the relation between both loop1 and loop2. So, if we consider LOOP2 is executed N times every LOOP 1 Iteration. It is clear that for the 1 st loop iteration prediction will have 3 misses then it will be only 1 miss until the end of loop1 13

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes CS433 Midterm Prof Josep Torrellas October 19, 2017 Time: 1 hour + 15 minutes Name: Instructions: 1. This is a closed-book, closed-notes examination. 2. The Exam has 4 Questions. Please budget your time.