EECS 70 Midterm Exam Winter 2009 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 18 2 / 12 3 / 29 / 21 5 / 20 otal / 100 Bonus / 3 NOES: Closed book/notes Calculators are allowed, but no PDAs, Portables, Cell phones, etc. Don t spend too much time on any one problem. You have about 90 minutes for the exam (avg. 18 minutes per problem). here are 9 pages including this one. Please ensure you have all pages. Be sure to show work and explain what you ve done when asked to do so. 1/9
1) Short answer [18 points] a) Does pipelining improve latency or throughput? [3 points] throughput b) Give 2 reasons why we don t build processors with massive register files (e.g., tens of thousands of registers). [6 points] Requires too many bits in opcode Register file accesses would be extremely slow c) Give an example of a microprocessor component that exploits locality. [3 points] Cache d) Most instruction sets include both PC-relative branches and branch-to-register instructions. i) State an advantage of PC-relative branches and give an example of a software construct where they are used. [3 points] PC-relative takes fewer bits in opcode; allows relocatable code. Used for loops. ii) State an advantage of branch-to-register instructions and give an example of a software construct where they are used. [3 points] Allows branching arbitrary distances; allows dynamic targets, look up tables for targets. Used for function calls, virtual functions, switch statements, function pointers. 2/9
2) Performance Analysis [12 points] a) Suppose 80% of a program is parallelizable (performance scales linearly with the number of cores), while the other 20% is serial (must run on one core). What is the speedup over a uniprocessor when running the program on a quad-core machine? [6 points] 1/ ( ( 1 0.8 ) + ( 0.8 / ) ) = 2.5 b) Suppose you run two applications one after the other on a Core 2 Duo. he two applications contain the same number of instructions. he first application runs at instructions per cycle (IPC), while the IPC of the second application is 2. What is the overall average IPC? [6 points] 2 / ( ¼ + ½ ) = 2.666 3/9
3) Reorder Buffers in the P6 microarchitecture [29 points] c) Briefly explain the purpose of a reorder buffer. [3 points] Enables precise state for speculation/exceptions d) What effect does a reorder buffer have on performance? [3 points] It reduces performance due to new structural hazard e) Draw a diagram showing the contents of a single re-order buffer () entry for a P6- like microarchitecture (i.e., one having an architectural register file). Identify all the fields stored within the entry and label the width (in bits) of each. Don't forget to include any "instruction status" bits used by any pipeline stages. Assume a 32-bit machine with 32 architectural registers, a 6-entry (you only need to draw one entry) and 16 reservation stations. [5 points] Value 32 (1) PC and/or calculated target 32 (1) Dest Reg Name 5 (1) Executed 1 (1) Exception/Mispredict 1 (1) (optional: opcode) /9
Inputs f) Finish the drawing below to indicate the input and output ports of the module from part (c) for a 2-wide superscalar machine. For each port, label its width and indicate during which pipeline stage the port is used (assume the P6 pipeline stages discussed in class: Fetch, Decode, issue, execute, Complete, Retire). Assume that the head and tail pointers are maintained within the module, and that the does not support early branch resolution. (Note: this problem is significantly harder than it looks at first glance; think carefully about all the signals required to get instructions in and out of the. I suggest doing this problem last.) [18 points] Inputs Outputs Dispatch Enable (2) - Dispatch (0.5) Dest Register x 2 (5) - Dispatch (2) PC x 2 - Dispatch (0.5) (optional: opcode - Dispatch) (optional: retire enable x2) (optional: squash) CDB Value x2 (32)- Complete (2) CDB ag x2 (6) - Complete (1) CDB Write Enable x2 (0.5) CDB Exception/Mispredict x2-complete(0.5) Source Operand ag x (6) - Dispatch (1) (Optional: Clock) Outputs: Full - Dispatch (1) Almost Full - Dispatch (0.5) Source Op. Value x (32) - Dispatch (2) Correct PC (32) - Retire (0.5) Retirement Value x2 (32) - Retire (2) Retirement Register x2 (5)- Retire (2) Head complete bits x2 - Retire (0.5) Head Mispredict/Exception x2 - Retire(0.5) ail pointer/ next tag Dispatch (1) (Optional: head pointer) Bonus) In 1-2 sentences, explain how a history buffer differs from a reorder buffer. [+3 bonus] 5/9
) Handling RAW Memory Dependences. [20 points] a) Consider the following sequence of load and store instructions (the first operand contains the address for the load or store, the second is the source/destination register for the value): (1) store [r1], r16 (2) store [r2], r17 (3) load [r], r18 () store [r5], r19 (5) load [r6], r20 i) Explain the necessary/sufficient conditions to execute instruction #5 nonspeculatively. [3 points] Stores (1),(2),() have calculated their addresses; Load (5) s address can be calculated (its input registers are ready). ii) Suppose we want to issue load (5) earlier, speculatively. What are the conditions to issue the load to the memory system? [3 points] It s address has been calculated. iii) What, precisely, are we speculating? (I.e., what is the hardware guessing about the values of the registers accessed by the loads and stores?). [3 points] hat r6 is different from r5,r2,r1 6/9
iv) Describe a sequence of events where load #5 is issued speculatively, but the speculation fails (i.e., the conditions you specify in your answer to (iii) turn out to be false). [3 points] r5 resolves, load issues, then r2 resolves to same value. v) What does the processor core have to do to fix the mis-speculation? [3 points] Squash and re-execute load 5 and all subsequent instructions b) A Memory Dependence Predictor is a piece of hardware that tries to reduce the frequency of mis-speculation events like the scenario you described in your answer to (a.iv). You can think of the predictor as a black box that takes in some information about a load instruction and tries to guess if the load should execute speculatively or not. Internally, the predictor stores some state about what it has observed in the past. Propose a highlevel design for a memory dependence predictor. In particular, describe what the inputs to your predictor black box and what state it contains. Briefly argue why your design will provide high prediction accuracy and require reasonable resources. [6 points] Save the PCs of load instructions that receive their value via forwarding in the table. Predict that the load should not execute speculatively if it has an entry in the table. 7/9
5) MIPS R10K Microarchitecture. [20 points] On the next pages, you will find a set of charts showing a snapshot of a MIPS R10K-like microarchitecture after one cycle executing a sequence of instructions. You must advance this machine 5 additional clock cycles (to the end of cycle #6). Use the cycle-by-cycle state tables to record the contents of each hardware structure at the end of each clock cycle. Assume the following: Assume the machine is a 2-wide superscalar (i.e., it can issue, complete, and retire at most 2 instructions per cycle). If there are conflicts among instructions, the machine always selects the oldest instructions first. Ignore the fetch stage. Assume all instructions have been fetched and are ready for dispatch whenever the out-of-order core allows. his machine has architectural registers, a 5-entry, reservation stations, and 9 physical registers. here are 2 add functional units with a 1-cycle latency, and 1 fully-pipelined multiply functional unit with a 2-cycle latency (fully-pipelined means the multiply unit has 2 pipeline stages; it can issue a new multiply each cycle, however, multiplies take 2 cycles to execute). Assume there is no bypassing in the X stage, but C will bypass to S through the physical register file. Assume reservation stations are freed as early as possible and can be reused as soon as they are freed. Note that there are 6 instructions, but the cycle-by-cycle tables only have space for the 5 entries. Be sure to wrap back to the top of the if you dispatch the 6 th instruction. Here is the instruction sequence: (1) R3 = R1 * R2 (2) R = R * 10 (3) R1 = R3 + R2 () R2 = R2 + 5 (5) R3 = R + R2 (6) R1 = R1 + R3 Pay attention to the cycle number on each chart be sure you fill them out in correct order! If you make a mistake and need additional blank copies of the fill-in sheet, ask the exam proctor. Make sure the old sheets are torn up and the new ones are stapled to your exam!!! 8/9
SOLUION R10K Cycle # 1 ht # Insn old S X C t 1 R3=R1*R2 p5 p3 h 2 R=R*10 p6 p 3 5 Map able Reg + r1 p1+ r2 p2+ r3 p5 p7,p8,p9 R10K Cycle # ht # Insn old S X C t 1 R3=R1*R2 p5 p3 2 3-2 R=R*10 p6 p 3-3 R1=R3+R2 p7 p1 R2=R2+5 p8 p2 3 h 5 R3=R+R2 p9 p5 Map able Reg + r1 p7 r2 p8 r3 p9 # op 1 2 1 R3=R1*R2 p5 p1+ p2+ 2 R=R*10 p6 p+ 3 # op 1 2 1 R3=R+R2 p9 p6 p8 2 3 R1=R3+R2 p7 p5 p2+ R10K Cycle # 2 ht # Insn old S X C t 1 R3=R1*R2 p5 p3 2 2 R=R*10 p6 p 3 R1=R3+R2 p7 p1 h R2=R2+5 p8 p2 5 Map able Reg + r1 p7 r2 p8 r3 p5 p9 R10K Cycle # 5 ht # Insn old S X C t 1 R3=R1*R2 p5 p3 2 3-5 2 R=R*10 p6 p 3-5 3 R1=R3+R2 p7 p1 5 R2=R2+5 p8 p2 3 5 h 5 R3=R+R2 p9 p5 Map able Reg + r1 p7 r2 p8+ r3 p9 p5 p8 # op 1 2 1 R3=R1*R2 p5 p1+ p2+ 2 R=R*10 p6 p+ 3 R1=R3+R2 p7 p5 p2+ R2=R2+5 p8 p2+ # op 1 2 1 R3=R+R2 p9 p6 p8+ 2 3 R1=R3+R2 p7 p5+ p2+ R10K Cycle # 3 ht # Insn old S X C t 1 R3=R1*R2 p5 p3 2 3 2 R=R*10 p6 p 3 3 R1=R3+R2 p7 p1 R2=R2+5 p8 p2 3 h 5 R3=R+R2 p9 p5 Map able Reg + r1 p7 r2 p8 r3 p9 R10K Cycle # 6 ht # Insn old S X C h 1 R1=R1+R3 p3 p7 t 2 R=R*10 p6 p 3-5 6 3 R1=R3+R2 p7 p1 5 6 R2=R2+5 p8 p2 3 5 5 R3=R+R2 p9 p5 6 Map able Reg + r1 p3 r2 p8+ r3 p9 + p6 # op 1 2 1 R3=R+R2 p9 p6 p8 2 R=R*10 p6 p+ 3 R1=R3+R2 p7 p5 p2+ R2=R2+5 p8 p2+ # op 1 2 1 R3=R+R2 p9 p6+ p8+ 2 R1=R1+R3 p3 p3 p9 3 9/9