Midterm I SOLUTIONS March 21 st, 2007 CS252 Graduate Computer Architecture

Size: px

Start display at page:

Download "Midterm I SOLUTIONS March 21 st, 2007 CS252 Graduate Computer Architecture"

Malcolm Fletcher
5 years ago
Views:

1 University of California, Berkeley College of Engineering Computer Science Division EECS Spring 2007 John Kubiatowicz Midterm I SOLUTIONS March 21 st, 2007 CS252 Graduate Computer Architecture Your Name: SID Number: Problem Possible Score Total 100 1

2 [ This page left for π ]

3 Question #1: Short Answer [16 pts] Problem 1a[2pts]: What is simultaneous multithreading and why is it useful? Simultaneous multithreading is a technique that adds multiple threads to a multi-issue, out-oforder processor. Since the instructions of these threads can be interleaved in an arbitrary fashion (and are thus running simultaneously), it is called simultaneous multithreading. This technique is useful because it can utilize otherwise idle issue slots in a multi-issue processor. Probglem 1b[2pts]: What is a data flow architecture? How would it work? A data flow architecture is one which attempts to exploit the maximum parallelism available in an algorithm. It provides a hardware execution of the dataflow graph. Such architectures typically work by placing operations (instructions) in some sort of physical store such that they are triggered immediately when their operands are available. Completion of a given operation generates data which flows to the operand inputs of dependent operations, thus triggering their execution, etc. Problem 1c[3pts]: What technological forces have caused Intel, AMD, Sun, and others to start putting multiple processors on a chip? Power consumption, limited instruction-level parallelism available in typical applications, memory access time (memory wall) and other issues have caused attempts to improve singlethread performance to stall in The cost in improving single-thread performance reached a point where it wasn t worth the resulting gain. Chip manufacturers decided to back off from improving individual single-thread performance and instead start making multicore chips. Problem 1d[2pts]: Name two components of a modern superscalar architecture whose delay scales quadratically with the issue-width. Many things scale quadratically with issue width. For instance: 1. The delay in the forwarding network scales quadratically with issue-width. 2. The instruction-issue(wakeup) logic scales quadratically with issue-width 3. Register Rename logic scales quadratically 3

4 Problem 1e[2pts]: Most branches in a program are highly biased, i.e. they can be predicted by a simple one-level predictor. What can the compiler do to improve the number of branches that are in this category? By factoring the code so that it duplicates branches that depend on other branches (called node splitting ). By copying these branches into multiple instances, one for each path through the code, the new branches can become highly biased.. Problem 1f[3pts]: What is the difference between implicit and explicit register renaming? How are they implemented? Implicit register renaming is what the original Tomasulo algorithm depended on to eliminate WAW and WAR hazards. When issuing an instruction, it replaced registers by either (1) a value or (2) the name of a slot in a reservation station. Thus, the renaming was implicit, since it merely replaced the register names as part of the scheduling algorithm. Explicit renaming, on the other hand, renames user-visible register names with physical register names before the instruction issue stage. The Explicit renaming technique relies on a mapping table/free-list mechanism to allocate registers. Problem 1g[2pts]: Why are Vector processors are more power efficient that superscalar processors when executing applications with a lot of data-level parallelism? Explain. If an application has lots of data-level parallelism, then this can be expressed directly with vector instructions. The vector processor can then spend power executing the actual data operations in parallel, with very little instruction overhead. Trying to execute the same algorithm with a superscalar processor wastes a lot of power extracting the parallelism branch prediction, large instruction windows, complex issue logic, etc is all required just to get multiple iterations executing in parallel. 4

5 Problem #2: Superpipelining [21 pts] Suppose that we have single-issue, in-order pipeline with one fetch stage, one decode stage, multiple execution stages (which include memory access) and a singe write-back stage. Assume that it has the following execution latencies (i.e. the number of stages that it takes to compute a value): multf (4 cycles), addf (3 cycles), divf (6 cycles), integer ops (1 cycle). Assume full bypassing and two cycles to perform memory accesses, i.e. loads and stores take a total of 3 cycles to execute (including address computation). Finally, branch conditions are computed by the first execution stage (integer execution unit). Problem 2a[10pts]: Assume that this pipeline consists of a single linear sequence of stages in which later stages serve as no-ops for shorter operations. You should do the following on your diagram: 1. Draw each stage of the pipeline as a box and name each of the stages. Stages may have multiple function: i.e. an execute stage + memory op. You will have a total of 9 stages. 2. Describe what is computed in each stage (e.g. EX 1 : Integer Ops, Address Compute, First stage of ) 3. Show all of the bypass paths (as arrows between stages). Your goal is to design a pipeline which never stalls unless a value is not ready. Label each of these arrows with the types of instructions that will forward their results along these paths (i.e. use M for multf, D for divf, A for addf, I for integer operations, Ld for loads). [Hint: be careful to optimize for information feeding into store instructions!] I,D,Ld,A,M I,Ld,A,M I,Ld,A,M I,Ld,A I I Ans Stage Functions: Address Compute (for LD/ST) F: Fetch next Instruction D: Decode Instructions EX 1 : Integer Ops D Ex 1 Ex 2 Ex 3 Ex 4 Ex 5 First stage of: Addf, Multf, Divf Ld,A EX 2 : First memory stage of LD/ST Second stage of Addf, Multf, Divf M EX 3 : Last memory stage of LD/ST Last stage of Addf Third stage of Multf, Divf EX 4 : Last stage of Multf Notes on Feedback paths: Fourth stage of Divf Feeding from beginning of stage to beginning EX 1 EX 5 : Fifth stage of Divf Feeding from first stage after completion to EX 2 EX 6 : Sixth stage of Divf (For stores) W: Writeback stage F Ex 6 D W 5

6 Problem 2b[3pts]: How many extra instructions are required between each of these instruction combinations to avoid stalls (i.e. assume that the second instruction uses a value from the first). Be careful! Between a divf and a store: 4 insts Between a load and a multf: 2 insts Between two integer instructions: 0 insts Between a multf and an addf: 3 insts Between an addf and a divf: 2 insts Between an integer op and a store: 0 insts Problem 2c[2pts]: How many branch delay slots does this machine have? Explain. This machine has 2 branch delay slots. This can be seen simply by considering the fact that the result of a branch comparison must somehow effect (forward in time!) a fetch: F D E 1 E 2 E 3 E 4 E 5 E 6 W F D E 1 E 2 E 3 E 4 E 5 E 6 W F D E 1 E 2 E 3 E 4 E 5 E 6 W F D E 1 E 2 E 3 E 4 E 5 E 6 W As a result, there must be 2 intervening instructions between the branch and the first instruction that is an actual taken branch. Probem 2d[2pts]: Could branch prediction increase the performance of this pipeline? Why or why not? Yes. Without some form of prediction, we must expose the 2 delay slots (part c) to the compiler. It is hard to do a good job of filling two delay slots with useful instructions. Thus, we could remove the delay slots from the ISA and use branch prediction to predict instructions immediately after a branch It is possible that a good branch prediction scheme could do a better job of finding two good instructions for these delay slots then the static compiler. Note, however, that this is very application dependent. Problem 2e[2pts]: In the 5-stage pipeline that we discussed in class, a load into a register followed by an immediate store of that register to memory would not require any stalls, i.e. the following sequence could run without stalls: lw r4, 0(r2) sw r4, 0(r3) Explain why this was true for the 5-stage pipeline. This is true because we could forward the load data from the end of the memory stage to the beginning of the memory stage for the store. Problem 2f[2pts]: Is this still true for your superpipelined processor? Explain. No. As we can see from (2a), we forward load results from beginning of EX 3 to beginning of EX 1. This means that there must be one instruction between a load and subsequent store. 6

7 Problem 3: Tomasulo Architecture [20 pts] Problem 3a[5pts]: Consider a Tomasulo architecture with a reorder buffer. This architecture replaces the normal 5- stages of execution with 5 stages: Fetch, Issue, Execute, Writeback, and Commit. Explain what happens to an instruction in each of them (be as complete as you can): a) Fetch: Fetch instructions from memory. Usually contains branch-prediction logic b) Issue: Get instruction from operation queue. If reservation station free and reorder buffer slot free, send instruction to reservation station and reorder buffer after renaming registers (to either values or tags). c) Execute: When both operands ready then execute; if not ready, watch CDB for result. d) Writeback: Finish execution. Write result on Common data bus. Use reorder buffer number as the tag for result. Result will be written to reorder buffer slot. Mark reservation station as free. e) Commit: Update actual register file with results from the head of reorder buffer. Only completed instructions at head of reorder buffer will be committed. After commit, reorder buffer slot is freed. Problem 3b[3pts]: Name each of the three types of data hazards and explain how the Tomasulo architecture removes them: RAW: Read after Write hazards. These are real data dependencies. Tomasulo removes them by tracking the dependencies during the issue stage, such that data dependant instructions receive the tag of source instructions. Results are broadcast on CDB to instructions that need them, thereby removing the RAW hazard. WAR: Write after Read hazards. Write after read hazards are removed because values from the register file are copied to the reservation stations on issue. Thus, there is no possibility for later instructions to overwrite operands read by earlier instructions. WAW: Write after Write hazards. The register renaming mechanism prevents WAW hazards because the register file gets a program-order sequence of tags written to it. If two instructions that write to the same register are in the pipeline simultaneously, then only the tag of the later one is kept in the register file; hence there is no chance for an earlier value to persist over a later value. Problem 3c[3pts]: Name three structural hazards that this architecture exhibits. Explain your answer. Structural hazards on the reorder buffer, reservation stations, and the CDB. It cannot issue an instruction if there is no slot available in the reorder buffer and/or no slot available in an appropriate reservation station. Further, two instructions cannot do writeback at the same time if there is only one CDB. 7

8 Problem 3d[2pts]: Assume that you have a long chain of dependent instructions, such as the following: add $r1, $r2, $r3 add $r3, $r1, $r4 add $r7, $r3, $r5 Also assume that the integer execution unit takes one cycle for adds. What CPI would you achieve for this sequence with the basic Tomasulo architecture, assuming that each of the stages from (3a) are non-overlapped and take a complete cycle? This would achieve a CPI of 2. Reason: assuming sufficient reservation space, each instruction would execute (one cycle), then broadcast its result (one cycle). Assuming that the writeback takes a complete cycle, there would be no overlap of the execute/writeback of subsequent insts. Problem 3e[2pts]: Assume that associative matching on the CDB is a slow enough operation that it takes much of a cycle. How can you still get a throughput of one instruction per cycle for long dependent chains of operations such as given in (3d)? Only well-thought-out answers will get credit. Separate the CDB into a two-cycle operation: first cycle for matching, second cycle for data transmission. Then, during the cycle in which an execution was occurring, the functional unit could send out the promise of the value to be transmitted on the following cycle. This would permit the reservation station to recognize when it could execute on the following cycle, thereby setting up the operation to occur. The CDB data could be sent directly to the appropriate input from the functional unit on the next cycle. Note: you have to assume that the time to transmit the value is short enough to permit both transmission and execution in a single cycle. Problem 3f[2pts]: The Tomasulo algorithm has one interesting bug in it. Consider the situation where one instruction uses a value from another one. Suppose the first instruction is issued on the same cycle as the one that it depends on is in writeback. add $r1, $r2, $r3 The result is broadcast... add $r4, $r1, $r1 This one is being issued What is the problem? Can you fix it easily? The problem is that, at the beginning of the cycle, the issue logic looks in the register file and decides that the value isn t ready thereby sending the dependent instruction to the reservation station with a tag. Meanwhile, by the end of the cycle, the register file is overwritten with an actual value. Now, there is a tag in the reservation station and a value in the register file, and the new instruction never executes. Fix: do the writeback in the first half of the cycle and look up in the register file on the second half of the cycle (like the 5-stage pipeline). Problem 3g[3pts]: Which changes would you have to make to the basic Tomasulo architecture (with reorder buffer) to enable it to average a CPI = 0.33? Need register-rename logic that can look at three instructions at the same time. Need a register file that can handle 6 reads and 3 writes at the same time. Need three CDBs and appropriate logic to arbitrate for them. Need a reorder buffer than can accept three new instructions per cycle, that can take three writebacks per cycle, and can commit three instructions per cycle. Also, need enough actual reservation stations and execution units so that you can have an average of three instructions running per cycle. 8

9 Problem #4: Fixing the loops [21 pts] For this problem, assume that we have a superpipelined architecture like that in problem (2) with the following use latencies (these are not the right answers for problem #2b!): Between a multf and an addf: 3 insts Between a load and a multf: 2 insts Between an addf and a divf: 1 insts Between a divf and a store: 6 insts Between an int op and a store: 0 insts Number of branch delay slots: 1 insts Consider the following loop which performs a restricted rotation and projection operation. In this code, F0 and F1 contain sin(θ) and cos(θ) for rotation. The array based at register r1 contains pairs of single-precision (32-bit) values which represent x,y coordinates. The array based at register r2 receives a projected coordinate along the observer s horizontal direction: project: ldf F3,0(r1) multf F10,F3,F0 ldf F4,4(r1) multf F11,F4,F1 addf F12,F10,F11 divf F13,F12,F2 stf 0(r2),F13 addi r1,r1,#8 addi r2,r2,#4 subi r3,r3,#1 bneq r3, r0, project nop 2 cycles 2 cycles 3 cycles 1 cycle 6 cycles Total: 14 Stall Cycles Problem 4a[2pts]: How many cycles does this loop take per iteration? Indicate stalls in the above code by labeling each of them with a number of cycles of stall: This takes a total of 12+14=26 cycles/iteration (see stalls above) Problem 4b[4pts]: Reschedule this code to run with as few cycles per iteration as possible. Do not unroll it or software pipeline it. How many cycles do you get per iteration of the loop now? project: ldf F3,0(r1) ldf F4,4(r1) addi r1,r1,#8 multf F10,F3,F0 multf F11,F4,F1 addi r2,r2,#4 subi r3,r3,#1 addf F12,F10,F11 divf F13,F12,F2 bneq r3, r0, project stf -4(r2),F13 7 stall cycles, 11 instructions 18 cycles/iteration 1 cycle 1 cycle 5 cycles 9

10 Problem 4c[6pts]: Unroll the loop once and schedule it to run with as few cycles as possible per iteration of the original loop. How many cycles do you get per iteration now? project: ldf F3,0(r1) ldf F4,4(r1) ldf F5,8(r1) multf F10,F3,F0 multf F11,F4,F1 ldf F6,12(r1) multf F12,F5,F0 multf F13,F6,F1 addf F14,F10,F11 divf F15,F14,F2 addf F16,F12,F13 divf F17,F16,F2 addi r1,r1,#16 addi r2,r2,#8 subi r3,r3,#2 stf -8(r2),F15 bneq project stf -4(r2),F17 1 cycle 1 cycle 1 cycle 1 cycle So: Total number of stalls = 4. Number of insts = (7 x ). Thus, we have 22/2 = 11 cycles/iteration. Note that the 3 rd stall cycle adds an extra cycle between the first divide and first store. Hence no stall before first store. Problem 4e[3pts]: Your loop in (4c) will not run without stalls. Without going to the trouble to unroll further, what is the minimum number of times that you would have to unroll this loop to avoid stalls? How many cycles would you get per iteration then? With 4 or more iterations, we can put all the loads together, all the multiplies together, etc. without any stalls until stores. Then, we have D n I 3 S n-1 B S (where D =divf, I =int, S =store, and B =branch). To avoid a stall, we look between first D and first S where there are (n-1+3) insts n+2=6 n=4 iterations. So, we want 4 iterations. #cycles/iteration = [ (7 x 4)+4 ]/4 = 8 cycles/iteration. Problem 4f[6pts]: Rewrite your code to utilize vector instructions and to run as fast as possible. Assume that the value in r3 is the vector length. Make sure to comment each instruction to say what it is doing. Assuming full chaining, one instruction/cycle issue, and delays for instructions/memory that are the same as the non-vector processor. How long does this code take to execute (you can use the original value of r3 in your expression). project: SVL r3 ; Set vector length to r3 LVF V0,r1,8 ; Load single-float V0 from addr r1, stride 8 addi r1,r1,4 ; Increment base addr by 4 LVF V1,r1,8 ; Load single-float V1 from addr r1, stride 8 MULTF V2,V0,F0 ; Multiply V0 by constant F0 V2 MULTF V3,V1,F1 ; Multiply V0 by constant F1 V3 ADDF V4,V2,V3 ; Add vectors V2 and V3 V4 DIVF V5,V4,F2 ; Divide V4 by constant F2 V5 SVF V5,r2,4 ; Store single-float V5 at addr r2, stride 4 After the execution gets going, data flows through functional units at full speed (assuming for now 1 lane). After first stall (before second MULTF), issues get ahead of execution and don t impact time. So, 6 cycles up to that MULTF, then 4 cycles for MUTLF, 2 cycles ADDF, 8 cycles DIVF, 2 cycles SVF until first value. Remaining values 1 per cycle. Time = (r3-1) = 21+r3. 10

11 Problem 4g: [Extra Credit: 5pts] Assume that you have a Tomasulo architecture with functional units of the same execution latency (number of cycles) as our deeply pipelined processor (be careful to adjust use latencies to get number of execution cycles!). Assume that it issues one instruction per cycle and has an unpipelined divider with a small number of reservation stations. Suppose the other functional units are duplicated with many reservation stations and that there are many CDBs. What is the minimum number of divide reservation stations needed to achieve one instruction per cycle with the optimized code of (4b)? Show your work. [hint: assume that the maximum issue rate is sutained and look at the scheduling of a single iteration] Answer: The best way to understand this is to actually look at the timing of issue slots. First, we take the use latencies from the beginning of this problem to extract the execution latencies (number of execution stages) for the different operations: Load: 3 cycles, Add: 2 cycles, Multiply: 4 cycles, Divide: 8 cycles (careful here!) Next, we show the timing of two iterations. Note that we assume that the WB (broadcast) of one operation and the scheduling of the next can occur in the same cycle: Name Issue Start End Write Execution Execution Back ldf ldf addi multf multf addi subi addf divf bne stf ldf ldf addi multf multf addi subi addf divf bne stf ldf ldf addi multf Looking at this table, we only need 2 reservation stations: one that is running, and one waiting. 11

12 Problem 5: Prediction [24 pts] In this question, you will examine several different schemes for branch prediction, using the following code sequence for a MIPS-like ISA with no branch delay slow: addi r2, r0, #45 ; initialize r2 to , binary addi r3, r0, #6 ; initialize r3 to 6, decimal addi r4, r0, #10000 ; initialize r4 to a big number top: PC1--> andi r1, r2, #1 bnez r1, skip1 ; extract the low-order bit of r2 ; branch if the bit is set xor r0, r0, r0 ; dummy instruction skip1: srli r2, r2, #1 ; shift the pattern in r2 PC2--> subi r3, r3, #1 bnez r3, skip2 ; decrement r3 addi r2, r0, #45 ; reinitialize pattern addi r3, r0, #6 skip2: subi r4, r4, #1 PC3--> bnez r4, top ; decrement loop counter This sequence contains 3 branches, labeled by PC1, PC2, and PC3. Problem 5a[2pts]: Sketch a basic PAg predictor that might be used for prediction. Assume that we will be tracking 3 bits of history. Address Bit sel PABHR GPHT Problem 5b[2pts]: What is the minimum range of instruction address bits required to address the branch history table for your PAg predictor in order to avoid aliasing between PC1, PC2, and PC3? How many entries that this correspond to? Notice that the branches at PC1, PC2, PC3, are separated by 4 instructions each. Assuming that instructions are 4 bytes (2 address bits), this means that we want instruction bits 5:4, since PC1, PC2, and PC3 are distinguished in these bits. Hence, our PABHR needs 4 entries. Note that you will still have some aliasing through the Pattern History Table regardless. 12

13 Problem 5c[6pts]: The following are the steadystate taken/not-taken patterns for each of the three branches (T indicates taken, N indicates not taken): PC1: TTNTNT TTNTNT... PC2: NTNTNT NTNTNT... PC3: TTTTTN TTTTTN... Using the PAg predictor of 5a and assuming no aliasing (i.e. a correct answer to 5b), what is the steady state prediction success rate (that is, the ratio of correctly predicted branches to total branches) for each branch? Assume that all 2-bit predictors are initialized to zero. Hint: Draw a table representing values (T or F) fed to each entry of the pattern history table. After you get a repeating-pattern stream for each predictor, you should be able to know how each 2-bit counter will predict: The trick here is to take groups of three branches together and look at the follow-on branch. For instance, we take the first branches of each branch. We assume that we always take PC1, PC2, PC3 as our pattern (since I didn t tell you otherwise). We see: TTN T for PC1, NTN T for PC2 and TTT T for PC3. For each of these results, I added a T to the resulting line on the table below and labeled with a subscript for which branch they belong to. Note that I ve drawn the first rectangle below to indicate this operation. The second rectangle indicates the second set of branch values (TNT N, TNT N, TTT T). PC1: TTNTNT TTNTNT... PC2: NTNTNT NTNTNT... PC3: TTTTTN TTTTTN... Pattern Follow-on Instances Prediction NNN? NNT? NTN T 2 T 1 T 2 T 2 T NTT T 1 T 3 T TNN? TNT N 1 N 2 T 1 N 2 T 3 N 2 N TTN T 1 T 3 T TTT T 3 T 3 N 3 N 1 NNTN Note that we ve continued putting in instances until there is a repeating pattern. The only ambiguous prediction is for TTT. For the others, any initialization of counters will be washed out after enough iterations. For TTT, we have utilized the initialization condition here. TTT starts at 00. First prediction: N. T takes us to 01. Next prediction: N. T takes us to 10. Next prediction: T. N takes us to 01. Next prediction N. Final N takes us back to 00. Final prediction N. Pattern repeats. Note that only the last TTT in the pattern is a correct prediction! Now, we just work our way through the branches and get our success rate: PC1: 5/6 = 83.3%, PC2: 6/6=100%, PC3: 2/6=33.3% 13

14 Problem 5d[2pts]: Can you make a simple argument why a version of PAg with 6 bits of history will have 100% prediction accuracy for this set of branch patterns? Very simple answer: since every pattern repeats after 6 bits, we know that each 6 bit pattern has only one outcome: the branch at the beginning. For instance, for a pattern B 1 B 2 B 3 B 4 B 5 B 6 the outcome is B 1. Consequently, there is no ambiguity in the outcome, i.e. the particular 2-bit counter associated with B 1 B 2 B 3 B 4 B 5 B 6 will easily converge to predict B 1. (There can t be a different outcome, because this would be a different 6-bit pattern and thus a different 2-bit counter). Problem 5e[4pts]: Draw the following global predictors: GAg, GShare, GAs. What is the reason for using a GShare or GAs predictor instead of GAg predictor? PAPHT GPHT GBHR GBHR GBHR Address GAg GPHT GAs GShare The reason to use a GShare or GAs predictor instead of a GAg predictor is to lessen the effects of aliasing in the PHT. 14

15 Problem 5f[4pts]: What is the simplest type of predictor that can predict the following sequence of data values without errors after some startup period: Draw a hardware diagram for it. How many data values must it see before it starts predicting correctly? Answer: a Strided predictor of some sort will predict this properly. Here is an extremely simple version that only remembers one previous value. It will predict properly after seeing one value (one to initialize the previous register ). Input Previous Sub (-) Add (+) Problem 5g[4pts]: What is the simplest type of predictor that can predict the following sequence of data values without errors after some startup period: Draw a hardware diagram for it. How many data values must it see before it starts predicting correctly? This requires a Context predictor. To get uniqueness, we need a 2 nd order predictor, since we need the following predictions: [1, 3] 3, [3,3] 7, [3,7] 10, [7,10] 1, [10,1] 3 We will need to see 7 values before we can predict correctly. Prediction Addr Frequency Table Read Select Highest Freq Element Output Row Prediction 2-entry History Write Update Entry For Current (+1) Input Entry Select 15

16 [ This page intentionally left blank!] 16

Midterm I March 21 st, 2007 CS252 Graduate Computer Architecture

University of California, Berkeley College of Engineering Computer Science Division EECS Spring 2007 John Kubiatowicz Midterm I March 21 st, 2007 CS252 Graduate Computer Architecture Your Name: SID Number: