Midterm I March 21 st, 2007 CS252 Graduate Computer Architecture

University of California, Berkeley College of Engineering Computer Science Division EECS Spring 2007 John Kubiatowicz Midterm I March 21 st, 2007 CS252 Graduate Computer Architecture Your Name: SID Number: Problem Possible Score 1 16 2 21 3 19 4 20 5 24 Total 100 1

[ This page left for π ] 3.141592653589793238462643383279502884197169399375105820974944 2

Question #1: Short Answer [16 pts] Problem 1a[2pts]: What is simultaneous multithreading and why is it useful? Probglem 1b[2pts]: What is a data flow architecture? How would it work? Problem 1c[3pts]: What technological forces have caused Intel, AMD, Sun, and others to start putting multiple processors on a chip? Problem 1d[2pts]: Name two components of a modern superscalar architecture whose delay scales quadratically with the issue-width. 3

Problem 1e[2pts]: Most branches in a program are highly biased, i.e. they can be predicted by a simple one-level predictor. What can the compiler do to improve the number of branches that are in this category? Problem 1f[3pts]: What is the difference between implicit and explicit register renaming? How are they implemented? Problem 1g[2pts]: Why are Vector processors are more power efficient that superscalar processors when executing applications with a lot of data-level parallelism? Explain. 4

Problem #2: Superpipelining [21 pts] Suppose that we have single-issue, in-order pipeline with one fetch stage, one decode stage, multiple execution stages (which include memory access) and a singe write-back stage. Assume that it has the following execution latencies (i.e. the number of stages that it takes to compute a value): multf (4 cycles), addf (3 cycles), divf (6 cycles), integer ops (1 cycle). Assume full bypassing and two cycles to perform memory accesses, i.e. loads and stores take a total of 3 cycles to execute (including address computation). Finally, branch conditions are computed by the first execution stage (integer execution unit). Problem 2a[10pts]: Assume that this pipeline consists of a single linear sequence of stages in which later stages serve as no-ops for shorter operations. You should do the following on your diagram: 1. Draw each stage of the pipeline as a box and name each of the stages. Stages may have multiple function: i.e. an execute stage + memory op. You will have a total of 9 stages. 2. Describe what is computed in each stage (e.g. EX 1 : Integer Ops, Address Compute, First stage of ) 3. Show all of the bypass paths (as arrows between stages). Your goal is to design a pipeline which never stalls unless a value is not ready. Label each of these arrows with the types of instructions that will forward their results along these paths (i.e. use M for multf, D for divf, A for addf, I for integer operations, Ld for loads). [Hint: be careful to optimize for information feeding into store instructions!] 5

Problem 2b[3pts]: How many extra instructions are required between each of these instruction combinations to avoid stalls (i.e. assume that the second instruction uses a value from the first). Be careful! Between a divf and an store: Between a load and a multf: Between two integer instructions: Between a multf and an addf: Between an addf and a divf: Between an integer op and a store: Problem 2c[2pts]: How many branch delay slots does this machine have? Explain. Probem 2d[2pts]: Could branch prediction increase the performance of this pipeline? Why or why not? Problem 2e[2pts]: In the 5-stage pipeline that we discussed in class, a load into a register followed by an immediate store of that register to memory would not require any stalls, i.e. the following sequence could run without stalls: lw r4, 0(r2) sw r4, 0(r3) Explain why this was true for the 5-stage pipeline. Problem 2f[2pts]: Is this still true for your superpipelined processor? Explain. 6

Problem 3: Tomasulo Architecture [20 pts] Problem 3a[5pts]: Consider a Tomasulo architecture with a reorder buffer. This architecture replaces the normal 5- stages of execution with 5 stages: Fetch, Issue, Execute, Writeback, and Commit. Explain what happens to an instruction in each of them (be as complete as you can): a) Fetch: b) Issue: c) Execute: d) Writeback: e) Commit: Problem 3b[3pts]: Name each of the three types of data hazards and explain how the Tomasulo architecture removes them: Problem 3c[3pts]: Name three structural hazards that this architecture exhibits. Explain your answer. 7

Problem 3d[2pts]: Assume that you have a long chain of dependent instructions, such as the following: add $r1, $r2, $r3 add $r3, $r1, $r4 add $r7, $r3, $r5 Also assume that the integer execution unit takes one cycle for adds. What CPI would you achieve for this sequence with the basic Tomasulo architecture, assuming that each of the stages from (3a) are non-overlapped and take a complete cycle? Problem 3e[2pts]: Assume that associative matching on the CDB is a slow enough operation that it takes much of a cycle. How can you still get a throughput of one instruction per cycle for long dependent chains of operations such as given in (3d)? Only well-thought-out answers will get credit. Problem 3f[2pts]: The Tomasulo algorithm has one interesting bug in it. Consider the situation where one instruction uses a value from another one. Suppose the first instruction is issued on the same cycle as the one that it depends on is in writeback. add $r1, $r2, $r3 The result is broadcast... add $r4, $r1, $r1 This one is being issued What is the problem? Can you fix it easily? Problem 3g[3pts]: Which changes would you have to make to the basic Tomasulo architecture (with reorder buffer) to enable it to average a CPI = 0.33? 8

Problem #4: Fixing the loops [21 pts] For this problem, assume that we have a superpipelined architecture like that in problem (2) with the following use latencies (these are not the right answers for problem #2b!): Between a multf and an addf: 3 insts Between a load and a multf: 2 insts Between an addf and a divf: 1 insts Between a divf and a store: 6 insts Between an int op and a store: 0 insts Number of branch delay slots: 1 insts Consider the following loop which performs a restricted rotation and projection operation. In this code, F0 and F1 contain sin(θ) and cos(θ) for rotation. The array based at register r1 contains pairs of single-precision (32-bit) values which represent x,y coordinates. The array based at register r2 receives a projected coordinate along the observer s horizontal direction: project: ldf F3,0(r1) multf F10,F3,F0 ldf F4,4(r1) multf F11,F4,F1 addf F12,F10,F11 divf F13,F12,F2 stf 0(r2),F13 addi r1,r1,#8 addi r2,r2,#4 subi r3,r3,#1 bneq r3, r0, project nop Problem 4a[2pts]: How many cycles does this loop take per iteration? Indicate stalls in the above code by labeling each of them with a number of cycles of stall: Problem 4b[4pts]: Reschedule this code to run with as few cycles per iteration as possible. Do not unroll it or software pipeline it. How many cycles do you get per iteration of the loop now? 9

Problem 4c[6pts]: Unroll the loop once and schedule it to run with as few cycles as possible per iteration of the original loop. How many cycles do you get per iteration now? Problem 4e[3pts]: Your loop in (4c) will not run without stalls. Without going to the trouble to unroll further, what is the minimum number of times that you would have to unroll this loop to avoid stalls? How many cycles would you get per iteration then? Problem 4f[6pts]: Rewrite your code to utilize vector instructions and to run as fast as possible. Assume that the value in r3 is the vector length. Make sure to comment each instruction to say what it is doing. Assuming full chaining, one instruction/cycle issue, and delays for instructions/memory that are the same as the non-vector processor. How long does this code take to execute (you can use the original value of r3 in your expression). 10

Problem 4g: [Extra Credit: 5pts] Assume that you have a Tomasulo architecture with functional units of the same execution latency (number of cycles) as our deeply pipelined processor (be careful to adjust use latencies to get number of execution cycles!). Assume that it issues one instruction per cycle and has an unpipelined divider with a small number of reservation stations. Suppose the other functional units are duplicated with many reservation stations and that there are many CDBs. What is the minimum number of divide reservation stations needed to achieve one instruction per cycle with the optimized code of (4b)? Show your work. [hint: assume that the maximum issue rate is sutained and look at the scheduling of a single iteration] 11

Problem 5: Prediction [24 pts] In this question, you will examine several different schemes for branch prediction, using the following code sequence for a MIPS-like ISA with no branch delay slow: addi r2, r0, #45 ; initialize r2 to 101101, binary addi r3, r0, #6 ; initialize r3 to 6, decimal addi r4, r0, #10000 ; initialize r4 to a big number top: PC1--> andi r1, r2, #1 bnez r1, skip1 ; extract the low-order bit of r2 ; branch if the bit is set xor r0, r0, r0 ; dummy instruction skip1: srli r2, r2, #1 ; shift the pattern in r2 PC2--> subi r3, r3, #1 bnez r3, skip2 ; decrement r3 addi r2, r0, #45 ; reinitialize pattern addi r3, r0, #6 skip2: subi r4, r4, #1 PC3--> bnez r4, top ; decrement loop counter This sequence contains 3 branches, labeled by PC1, PC2, and PC3. Problem 5a[2pts]: Sketch a basic PAg predictor that might be used for prediction. Assume that we will be tracking 3 bits of history. Problem 5b[2pts]: What is the minimum range of instruction address bits required to address the branch history table for your PAg predictor in order to avoid aliasing between PC1, PC2, and PC3? How many entries that this correspond to? 12

Problem 5c[6pts]: The following are the steadystate taken/not-taken patterns for each of the three branches (T indicates taken, N indicates not taken): PC1: TTNTNT TTNTNT... PC2: NTNTNT NTNTNT... PC3: TTTTTN TTTTTN... Using the PAg predictor of 5a and assuming no aliasing (i.e. a correct answer to 5b), what is the steady state prediction success rate (that is, the ratio of correctly predicted branches to total branches) for each branch? Assume that all 2-bit predictors are initialized to zero. Hint: Draw a table representing values (T or F) fed to each entry of the pattern history table. After you get a repeating-pattern stream for each predictor, you should be able to know how each 2-bit counter will predict: 13

Problem 5d[2pts]: Can you make a simple argument why a version of PAg with 6 bits of history will have 100% prediction accuracy for this set of branch patterns? Problem 5e[4pts]: Draw the following global predictors: GAg, GShare, GAs. What is the reason for using a GShare or GAs predictor instead of GAg predictor? 14

Problem 5f[4pts]: What is the simplest type of predictor that can predict the following sequence of data values without errors after some startup period: 1 4 7 10 13 16 19 22 Draw a hardware diagram for it. How many data values must it see before it starts predicting correctly? Problem 5g[4pts]: What is the simplest type of predictor that can predict the following sequence of data values without errors after some startup period: 1 3 3 7 10 1 3 3 7 10 1 3 3 7 10 Draw a hardware diagram for it. How many data values must it see before it starts predicting correctly? 15

[ This page intentionally left blank!] 16