EECS 470 Final Exam Fall 2015 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: Page # Points 2 /17 3 /11 4 /13 5 /10 6 /11 7 /9 8 /6 9 /8 10&11 /15 Total /100 NOTES: Open notes, open book. There are 11 pages including this one. Calculators are allowed, but no PDAs, Portables, Cell phones, etc. Don t spend too much time on any one problem. You have about 120 minutes for the exam. Be sure to show work and explain what you ve done when asked to do so. Getting partial credit without showing work will be rare. Page 1 of 11
1) Multiple choice/fill-in-the-blank. Pick the best answer. [10 points, -2 per wrong/blank answer, min 0] a) You would expect a direct-mapped 256-byte cache with 32-byte cache lines to get a hit about % of the time on a memory access with a stack distance of 1. b) In a superscalar processor, the hardware / complier / operating system detects the data dependencies between concurrently fetched instructions. In a VLIW processor, the hardware / complier / operating system does. c) In a bus-based multi-processor system, a load request must go to the bus unless the cache has the data it in the S or E or I / S or E or M / E or M / M state. d) The fundamental cause of false register dependencies in a program is imperfect branch prediction. instructions which generate exceptions. a finite number of architected registers. an inability of the compiler to detect the false dependencies. e) One advantage of a CISC ISA over a RISC ISA is that you would expect more memory operations the decode to be simpler fewer complex instructions a better Icache hit rate f) The compiler often has difficulty moving a load above a branch in program order because there might be a true dependence. there might be a name dependence. it could cause an exception that wouldn t otherwise occur. 2) Each of the following can be said to be a feature of the ISA or of the microarchitecture. Circle each of the following that can be said to be a feature of the ISA. [7 points, -2 per incorrectly circled or not circled answer, minimum 0] The depth of an in-order pipeline The size of the L1 cache Existence of predicated instructions The encoding of a given instruction Gshare branch prediction Number of CDBs Number of physical registers Number of architectural registers Page 2 of 11
3) Short answer [11 points] a) Which would be simpler to design, a 4-wide superscalar out-of-order machine or a 4-wide VLIW machine? Justify your answer. [4] b) Consider the following program running on 4 different processors on the same multi-processor system. Each processor gets a unique result from the CPUID instruction (either 0, 1, 2 or 3). Notice that the array big is read-only. main(int argc, char * argv[]) { int A[4]; // shared global array int big[400000]; // shared global array initilized elsewhere int x,i; // local variables put into a register by the // complier. Each processor/thread as its own x=cpuid(); //gets the CPU number of the current processor //so processor 1 returns "1", processor //0 returns "0" etc. for(i=0;i<400000;i++) A[x]+=big[i]; } i) Assuming the array big is initialized so all elements are 1, what would you expect the final value of the array A to be? (This is not a trick question). [2] A[0]= A[1]= A[2]= A[3]= ii) When measured in the lab, it turns out that processors 0, 1 and 2 are issuing around 400,000 BRILs on the bus when running this program, while processor 3 only issues one. What is likely causing all those reads for ownership and why is processor 3 issuing so many fewer than the others? [5] Page 3 of 11
4) Caches [13 points] a) Provide the shortest possible reference stream where a 2-way associative cache will get a hit, while a direct-mapped cache will get a miss. Assume both are 4KB caches with 32-byte lines and provide the addresses in hex. [4] b) Consider the following C code: int SIZE, STRIDE; int A[SIZE]; // ints are 4 bytes on this computer // Initialize SIZE and STRIDE here for(j=0;j<n;j++) for(i=0;i<size;i=i+stride) X+=A[i]; Assume N is a very large number. What approximately what hit rates would you expect to get on a 4 KB, two-way associative data cache with 32-byte lines given the following values for STRIDE and SIZE? You are to assume that every value other than the array A is kept in registers and that shorts are 2 bytes in size. [9 points, -3 per wrong/blank box, min 0] SIZE=4096 SIZE=3072 (2048*1.5) STRIDE=1 STRIDE=4 Page 4 of 11
4) Consider a processor running a given application that performs 300 million loads and 100 million stores per second. Assume the following is true: [10 points] The processor s multi-level cache system gets a 90% hit rate on both loads and stores. Cache lines are 32-bytes in size There is no prefetching and the instruction cache never misses. 20% of all lines evicted from the last level of the cache are dirty. The cache is write-back and no-write allocate All loads and stores are to 4-byte values. The bus supports both 4-byte and 32-byte transactions. There is no coherence traffic (only one processor) a) What is the read bandwidth (bytes/second) on the bus? Show your work. [4] b) What is the write bandwidth (bytes/second) on the bus? Show your work. [6] Page 5 of 11
5) Consider a case of having 3 processors using a snoopy MESI protocol where the memories can snarf data. All three have a 2 line direct-mapped cache with each line consisting of 16 bytes. The caches begin with all lines marked as invalid. Fill in the following tables indicating If the processor gets a hit or a miss in its cache If a HIT or HITM (or nothing) occurs on the bus during snoop. What bus transaction(s) (if any) the processor performs (BRL, BWL, BRIL, BIL) For misses only, indicate if the miss is compulsory, capacity, conflict, or coherence. A coherence miss is one where there would have been a hit, had some other processor not caused an invalidation of that line. Finally, indicate the state of the processor after all of these memory operations have completed. The operations occur in the order shown. [11 points, -1 per wrong or blank, minimum of 0] Processor Address Read/ Write 1 0x100 Read 1 0x120 Read 1 0x100 Read 2 0x104 Read 1 0x100 Write 1 0x200 Read 1 0x118 Write 1 0x100 Read 2 0x100 Read 2 0x100 Write 3 0x110 Read Hit/Miss Bus transaction(s) HIT/ HITM 4C s Miss type Proc 1 Proc 2 Proc 3 Address State Address State Address State Set 0 Set 0 Set 0 Set 1 Set 1 Set 1 Page 6 of 11
6) For purposes of this problem, assume the power of a single processor is approximately proportional to performance cubed. Say that you have two designs for a die: [9 points] (1) a single processor on the die that does 10 BIPS while drawing 200W (2) three processors on a die. They draw a total of 200W and have the performance you d expect from voltage/frequency scaling (per assumption above). a) On a highly trivially parallelizable benchmark, what performance in BIPS would (2) achieve? [3] b) On a benchmark that cannot be parallelized, what performance in BIPS would (2) achieve? [3] c) Say that power was the sole limiting factor on performance (area, cost, etc. are of no concern). How could you optimize performance to do well in both cases? [3] Page 7 of 11
7) Circle the correct answer [6 points, -2 per blank or wrong answer, min 0] a) The physical memory is effectively a cache of the page table / TLB / disk / data cache. b) The TLB is effectively a cache of page table / physical memory / disk / data cache c) If I have a virtually indexed, physically tagged cache with 32-byte blocks that is 4-way associative and the virtual memory system has 8KB pages, then I know that: The cache index is 13 bits. The cache index is 9 bits The cache size cannot be greater than 8KB The cache may suffer from the synonym problem None of the above. d) If I have a cache with 32-byte blocks that is 4-way associative and the virtual memory system has 8KB pages, where the TLB comes after the cache then I know that: The cache index is 13 bits. The cache index is 9 bits The cache size cannot be greater than 8KB The cache may suffer from the synonym problem None of the above. Page 8 of 11
8) Your boss has asked you to design a module in SystemVerilog to determine whether a branch instruction should be taken. This module, "br_taken", outputs a 1-bit signal "taken". The module header and all necessary input/output signals are provided. You may assume the following: A valid branch is either conditional or unconditional and uses a comparison operator if appropriate. Your module must output correctly for all valid input. For invalid input you should output taken = 0. Comparisons should be performed as: <op1> <comparison operator> <op2> Your code must be synthesizable and should not produce any latches. [8 points] module br_taken( input logic cond_br, // 1 if the branch is conditional, // otherwise 0 input logic uncond_br, // 1 if the branch is unconditional, //otherwise 0 input logic [63:0] op1, // 64-bit, unsigned, integer operand input logic [63:0] op2, // 64-bit, unsigned, integer operand input logic [1:0] comp, // 0: less-than, 1: equals, // 2: greater-than output logic taken // 0: not taken, 1: taken ); //Your code here Page 9 of 11
9) Consider the following tables that represent the state of a processor that implements what we have called the P6 algorithm: RAT ROB Arch ROB# Buffer PC Done Dest. Value Reg. # (-- if in ARF) Number with EX? Arch Reg # 0 -- 0 20 N 4 1 4 1 24 N 2 2 1 2 28 Y 4 100 3 -- 3 32 Y -- -- 4 2 4 36 N 1 5 -- 5 6 7 8 RS RS# Op type Op1 ready? Op1 RoB/value Op2 ready? Op2 RoB/value Dest ROB 0 + Y 5 Y 6 0 1 2 * N 0 Y 6 1 3 * Y 9 Y 7 4 4 ARF Reg# 0 1 2 3 4 5 Value 4 5 6 7 8 9 The instruction at PC 32 is a branch that has been predicted not-taken, but it is actually taken. The destination of the branch is PC 200, where the following code resides: R3=R3+R4 // A (PC 200) R1=R1+R3 // B R5=R1+R3 // C Show the state of the above tables if instruction A has retired, inst B has not started executing, while C has progressed as far along as possible. Be sure to label the head and tail of the ROB. Please place instruction A in slot 5 of the ROB. When other arbitrary decisions need to be made, you are to just make them. Be sure to update the head and tail. [15] (A second copy is available on the following page, please cross out the one you don t want graded!) Page 10 of 11
(Extra copy, cross out if not used.) Consider the following tables that represent the state of a processor that implements what we have called the P6 algorithm: RAT ROB Arch ROB# Buffer PC Done Dest. Value Reg. # (-- if in ARF) Number with EX? Arch Reg # 0 -- 0 20 N 4 1 4 1 24 N 2 2 1 2 28 Y 4 100 3 -- 3 32 Y -- -- 4 2 4 36 N 1 5 -- 5 6 7 8 RS RS# Op type Op1 ready? Op1 RoB/value Op2 ready? Op2 RoB/value Dest ROB 0 + Y 5 Y 6 0 1 2 * N 0 Y 6 1 3 * Y 9 Y 7 4 4 ARF Reg# 0 1 2 3 4 5 Value 4 5 6 7 8 9 The instruction at PC 32 is a branch that has been predicted not-taken, but it is actually taken. The destination of the branch is PC 200, where the following code resides: R3=R3+R4 // A (PC 200) R1=R1+R3 // B R5=R1+R3 // C Show the state of the above tables if instruction A has retired, inst B has not started executing, while C has progressed as far along as possible. Be sure to label the head and tail of the ROB. Please place instruction A in slot 5 of the ROB. When other arbitrary decisions need to be made, you are to just make them. Be sure to update the head and tail. [15] Page 11 of 11