CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Size: px

Start display at page:

Download "CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false."

Aubrey Douglas
5 years ago
Views:

1 CS 2410 Mid term (fall 2015) Name: Question 1 (10 points) Indicate which of the following statements is true and which is false. (1) SMT architectures reduces the thread context switch time by saving in memory only the registers that have been used by the threads. True False (2) VLIW architectures rely on the compiler to avoid data hazards. (3) Loop unrolling increases potential Instruction Level Parallelism in superscalers (4) Loop unrolling increases the cache hit rate. (5) The hit time for an 8-way set associative cache is smaller than the hit time for a 4-way associative cache of the same size (6) The hit rate for an 8-way set associative cache is smaller than the hit rate for a 4-way associative cache of the same size (7) A TLB miss causes a page fault (8) DRAM cells have to be periodically refreshed (9) Auto-increment addressing allows efficient implementation of function calls (10)Variable length op-code is used to simplify the decoding stage in a pipeline

2 Question 2 (20 points) If the EX stage is the bottleneck in the typical 5-stage pipeline (IF, ID, EX, MEM, WB), then it may be possible to increase the pipeline clock speed by breaking the EX stages into two stages, EX1 and EX2. The resulting 6-stage pipeline is shown below with forwarding paths implemented from C to A and from D to A. IF ID/ Reg A B C D E EX1 EX2 MEM WB (a) Use load instructions (lw Reg, I(Reg)) and/or add instructions (add Reg, Reg, Reg) to demonstrate the use of forward paths to avoid interlock by showing: i) a sequence of instructions that uses the path from C to A ii) a sequence of instructions that uses the path from D to A (b) What design decision eliminates the need for a forwarding path from E to A? (c) For how many cycles will the pipeline stall if i) lw R1, 40(R2) is immediately followed by add R1, R1, R2 ii) add R1,R1, R2 is immediately followed by lw R2, 60(R1) (d) Assume that the CPI for the 5-stage and the 6-stage pipelines are 3 and 3.2, respectively, for an instruction mix with 20% branches, 60% arithmetic/logic, 10% loads and 10% stores. If the clock speed of the 5-stage pipeline is 1GHz, how fast should the clock of the 6-stage pipeline be in order to be more efficient than the 5-stage pipeline? (e) Can you think of a situation in which a forwarding path from B to A would be useful?

3 Question 3 (15 points) Consider a 9-deep instruction pipeline (stages: IF1, IF2, ID, EX1, EX2, EX3, M1, M2, WB) where branch target addresses are resolved at the end of the ID stage and branch conditions are resolved at the end of the EX3 stage. The typical workload has 20% conditional branches with 60% of the conditional branches taken, on average. a) What is the additional CPI caused by control hazards if a "predict-not-taken" scheme is used? b) What is the additional CPI caused by control hazards if a "predict-taken" scheme is used? Assume that no Branch Prediction Buffer/Table is used? c) Consider the performance of a branch predictor on the following code: L1:.. L2: B2: BEQZ R1, L2 B1: BEQZ R2, L1 and assume that the branch prediction table initially indicates that the two branches are not taken. If the outer loop executes 100 iterations and the inner loop executes 5 iterations, then the branch instruction at B2 will execute 500 times. How many of these 500 times will be incorrectly predicted if: (i) A one bit predictor is used (ii) A two bit predictor is used

4 Question 4 (25 points) In this question, you should trace the execution of the 2-issue, dynamically scheduled (with speculation) out of order processor shown in the figure attached at the end of the exam booklet (you can detach it). Assume a perfect cache (no misses), a very large number of reservation stations, six reorder buffers and a flexible interconnection that allows up to two instructions to be issued (even to the same execution unit), written back and committed in each cycle. The Int ALU is used for integer operations as well as for memory address computation of load and store instructions. (a) Assuming that ROBs are used to rename the registers, indicate, in the provided space, the content of the register status table at the end of cycles 1-6, assuming that at cycle 0, the issue queue contains all the nine instructions shown in the figure and all reservations stations and reorder buffers are empty. At cycle 1: I0 and I1 are issued to the reservation stations and reserve ROB0 and ROB1. Cycle 1 F0 F1 F2.. R1.. At cycle 2: I2 and I3 are issued and reserve ROB2 and ROB3, I0 uses Int ALU for address computation Cycle 2 F0 F1 F2.. R1.. At cycle 3: I4 and I5 are issued and reserve ROB4 and ROB5, I0 loads from the cache I1 uses Int ALU for address computation Cycle 3 F0 F1 F2.. R1.. At cycle 4: I0 writes on the CDB and moves to ROB0 and I2 and I3 reads data from the CDB, I1 loads from the cache, I4 executes on Int ALU Cycle 4 F0 F1 F2.. R1.. At cycle 5: I0 commits and frees ROB0 I1 writes on the CDB and moves to ROB1 and I3 reads data from the CDB I2 starts execution, I4 writes on the CDB and moves to ROB4 and I5 reads data from the CDB Cycle 5 F0 F1 F2.. R1.. At cycle 6: I1 commits and frees ROB1 and I6 is issued and reserves ROB0 I3 starts execution I5 uses Int ALU for address computation Cycle 5 F0 F1 F2.. R1..

5 (b) Follow the same style to trace the next three cycles of execution and show the content of the register status table at the end of each cycle: At cycle 7: At cycle 8: Cycle 7 F0 F1 F2.. R1.. At cycle 9: Cycle 8 F0 F1 F2.. R1.. Cycle 9 F0 F1 F2.. R1.. (c) Complete the following table which describes the status of execution of each instruction in cycles 7,8 and 9. Instruction status in cycle C instruction C=1 C=2 C=3 C=4 C=5 C=6 C=7 C=8 C=9 I0 issued executing Mem WB commit I1 issued executing Mem WB commit I2 issued issued issued executing executing I3 issued issued issued issued executing I4 issued executing WB In ROB I5 issued issued issued executing I6 issued I7

6 Question 5 (15 points) Assume a system with a 4-way set associative L1 cache and a 8-way set associative L2 cache, both with block size of 8 words. The hit times for the L1 cache and the L2 cache are 1 and 6 cycles, respectively. Assume also that the L1 miss rate is 5% and the local L2 miss rate is 20%. On a L2 miss, a block is fetched from memory and it takes 92 cycles for the first word of the block to reach the L2 and 4 cycles for each of the 7 following words to reach the cache. That is, the L2 miss penalty is 120 cycles. Ignore any time needed to transfer blocks from L2 to L1 and to the CPU. a) Compute the average memory access time (in cycles) given that the L2 miss penalty is 120 cycles AMATa = b) What would be the L2 miss penalty and the average memory access time if early restart and critical word first are used when a miss occurs in L2 and a block is fetched from memory. Miss penalty = AMATb = c) Assume that early restart and critical word first are not used, but that way prediction is used for L1. With correct prediction, the L1 hit time is still 1 cycle, but a pseudo hit (way misprediction) takes 2 cycles. What is the AMAT if the way prediction is correct 80% of the time? AMATc = d) If your answer is correct, then AMATc should be larger than AMATa. Why would anyone suggest the use of way prediction??

7 Question 6 (15 points) Consider the execution of the following code segment on a CPU with a fully associative L1 cache whose size is n words and block size is 4 word: for (i=0 ; i < n ; i++) { for (j=0 ; j < n ; j++) { C = C + A[i,j] * B[j] ; } } Assume that the cache, which is initially empty, uses LRU replacement and that C is allocated in a register. Assume also that the matrix A is allocated in memory in a row-wise fashion (A[0,0], A[0,1],, A[0,n-1], A[1,0],.). (a) How many cache misses result from executing the above code segment? (b) How many cache misses result from executing the above code segment if the two loop are interchanged? (c) Can you rewrite the code segment to reduce cache misses (use blocking) (d) How many misses would result from the execution of the code that you wrote in (c)?

8 Detach this page from the rest of the booklet and use it when you are answering question 4. I0: L.D F0, 0(R1) I1: L.D F2, -8(R1) I2: MULT.D F1, F0, F1 I3: ADD.D F0, F0, F2 I4: DADDI R1, R1, -16 I5: L.D F2, -8(R1) I6: MULT.D F0, F0, F2 I7: ADD.D F0, F0, F1 I8: S.D F0, 0(R1) The following table describes the status of execution of the above instructions on the shown architecture at cycles 1-6. Instruction status in cycle C instruction C=1 C=2 C=3 C=4 C=5 C=6 C=7 C=8 C=9 I0 issued executing Mem WB commit I1 issued executing Mem WB commit I2 issued issued issued executing executing I3 issued issued issued issued executing I4 issued executing WB In ROB I5 issued issued issued executing I6 issued I7

CS 2410 Mid term (fall 2018)

CS 2410 Mid term (fall 2018) Name: Question 1 (6+6+3=15 points): Consider two machines, the first being a 5-stage operating at 1ns clock and the second is a 12-stage operating at 0.7ns clock. Due to data