Week 6 out-of-class notes, discussions and sample problems

Size: px

Start display at page:

Download "Week 6 out-of-class notes, discussions and sample problems"

Emory Brown
5 years ago
Views:

1 Week 6 out-of-class notes, discussions and sample problems We conclude our study of ILP with a look at the limitations of ILP and the benefits and costs of dynamic versus compiler-based approaches to promote ILP. Exploiting ILP has been a goal since the first pipelined processors in the 1960s. The RISC architecture brought about many advances in pipeline technology that were not implementable with CISC instruction sets. However, there was only a certain amount of parallelism available. Recall that roughly 1 in 7 instructions is a branch. If instructions are fairly uniformly distributed, this means that there is at most 6 instructions within any basic block before a branch occurs. Without moving instructions across branches (speculation, unrolling), that means there are few options to cover stalls that arise from RAW hazards and branch delays. In 1993, Wall did a study of the issues that limit ILP. This was elaborated upon by the authors. Here, we examine their findings. From chapter 3, we see that the ideal machine is a superscalar that can issue any combination of instructions at each clock cycle, with branch prediction, and enough registers and temporary registers to handle all active instructions. We might define the optimal machine as: 1. Infinite temporary registers for register renaming 2. Prefect branch prediction 3. Perfect jump prediction 4. Perfect memory address alias analysis to detect and handle all hazards 5. Perfect caches If #2 and #3 are true, we can eliminate all control dependencies. If #1 and #4 are true, we can eliminate all name and output (WAR/WAW) dependencies. This would leave only true (RAW) dependencies, which can be defeated with forwarding, reservation stations and enough time. If our functional units are pipelined, coupled with assumption #5, we would then be able to issue any instruction immediately after its predecessor instruction that is, there would be no stalls at the issuing stage. An ideal machine would permit a large number of instruction issues per cycle. How many? Looking at 6 SPEC 92 benchmarks, the authors obtain the following results: gcc: 55 espresso: 63 li: 18 fpppp: 75 doduc: 119 tomcatv: 150 The last three benchmark programs are FP programs which supports our intuition that there is a greater degree of ILP in FP programs because there are fewer control dependencies other than loops. The li program, a Lisp interpreter, has many short dependencies meaning that there is little ILP to exploit before the next dependence arises. Comparing our optimal machine to modern processors, we of course find that we cannot come close to approaching the optical specifications. The processor with the largest amount of instruction issue per cycle at the moment is probably the IBM Power7, which issues up to 6 instructions (micro-instructions) per cycle. This is a disappointing result when compared to the potential optimal instruction issue rate. Shown on the figure at the top of the next page are optimal number of instruction issues per cycle given different window sizes. Recall that the window size is the number of instructions that are examined in one cycle to determine which combination can be dynamically issued.

2 But it is obvious that our ideal machine is not practical for many reasons. For instance, window sizes of more than 8 is very challenging because of the complexity in the hardware required to track dependencies. Compiler scheduling could possibly help out with that. In addition, while an issue rate of up to 150 instructions per cycle (tomcat for an infinite window size) sounds great, the reality is that we would quickly run out of temporary registers at that rate. Assuming we could issue 150 instructions per cycle, and assuming that instructions wait on RAW hazards for at least 3-4 cycles, we would need 150 * 3.5 * 2 = 1050 registers to accommodate this issue rate (let s round this down to 1024). Is 1024 temporary registers doable? We typically equip our computers with far fewer. Examine the figure at the top of the next page. We see a comparison of the number of temporary registers to the amount of instruction issue. Notice how at 512 temporary registers, the performance is almost identical to an infinite number of registers (with the exception of doduc). Recall that both the Pentium and the i7 have 128 temporary registers available for renaming/reservation stations. Although adding more registers is not cost prohibitive today, the question is whether additional registers would actually lead to a greater instruction issue rate. Something else to keep in mind is the limited size of the reorder buffer. It makes no sense to try to issue more instructions than there are slots in the reorder buffer. The i7 has 128 entries and the largest sized reorder buffer (IBM Power7) has fewer than 200.

Notice how drastically different non-perfect from perfect branch prediction is for

3 The figure below demonstrates the maximum instruction issue rate given different forms of branch prediction. Notice how drastically different non-perfect from perfect branch prediction is for the 3 integer benchmarks. The FP benchmarks are more predictable. Tournament predictors of reasonable sizes are common and not cost prohibitive.

4 Here, we see the misprediction rate for the three forms of prediction as shown in the last figure. The profile-based is a static approch although it outperforms the dynamic 2-bit counter. But by far, the tournament predictor is the way to go. Again, notice that accuracy is better on the FP benchmakrs (this figure lists the same 6 benchmarks in reverse order from the previous graphs). Finally, the next figure demonstrates the maximum instruction issue rate for our 6 benchmarks using different forms of alias analysis. Global/stack perfect means that there is perfect analysis for all global and local variables/parameters, but not heap memory. There is a large difference between perfect and the others, but perfect is completely unrealizable. For the FP benchmarks, there is little improvement from no analysis to inspect, but in all cases, global/stack perfect is preferred.

5 We must scale back our expectations for our processor to something realizable. The authors suggest the following: Up to 64 instructions issued per cycle with no restrictions placed on what can be issued o this is more than 10 times that of the Power7 Tournament branch prediction with 1K entries and a 16-entry return predictor o this is comparable to what exists today and certainly could be expanded to larger sizes if there is any value in that Perfect memory reference disambiguation to handle memory aliases o this is utterly impossible, but the authors suggest for a small window size, it is attainable, or at least a reasonably accurate predictor is attainable, alternatively, use global/stack analysis for a small window size Register renaming with 64 additional integer and 64 additional FP registers o if there is any area that can be improved, it is this one adding another 128 registers should be no problem but the expense is that the space used for these could also be used for a slightly larger cache With these limitations over the ideal processor in place, simulations indicate that the same 6 benchmarks as listed above could achieve between 8 and 47 instructions issued per cycle, so it is possible to obtain close to the 64 (for some benchmarks) with moderate improvements to our most aggressive processor. If we consider historically the focus on processing, we see the following. Before the early 90s, much processing was integer processing because processors either had no FP capabilities (requiring a coprocessor) or the lack of pipelining led to slow computation no matter what. Outside of computationheavy processes, most software had little to no FP operations and those that they had (e.g., graphics) were simple enough. Starting with the release of the Intel Pentium MMX (multimedia) processor, FP operations were directly supported by the processor and more and more graphics routines were found in software. However, while computer graphics are still a critical aspect of most modern software, the bigger emphasis today is with the world wide web and cloud computing, both of which are primarily integer-based. We couple this analysis with the realization that FP programs are more predictable in terms of branches and have higher instruction issue rates when we try to support them with additional registers, branch prediction and memory disambiguation. This leads to contradictory conclusions: support to promote ILP in FP benchmarks is available, but we are turning more and more toward integerbased computing. Thinking of Amdahl s law, the common case is to support integer execution and therefore our conclusions from above imply the following. 1. Processors should work to improve on the 6-instructions per cycle issue rate by enlarging the window size, but not substantially (32 instruction window size is sufficient, at least for now) 2. Support dynamic loop unrolling through additional temporary registers (at least 128 additional registers) 3. Increase the size of load/store buffers such that we can still achieve highly accurate alias resolution, although the increase does not need to be substantial 4. Increase the size of the reorder buffer to at least 128 (the size of the Intel Core i7), or better yet, double the size These modest increases in capability will improve processor performance. However, as we will see starting next week, cache limitations have as large or a larger impact than anything we can do with the dynamic issue rate, and so improving cache will be as or of greater importance. Additionally, vector

6 processing/simd and true MIMD processing will also provide greater support. We will examine these at the end of the course. Let s consider 3 hypothetical but not atypical processors for the SPEC gcc benchmark (keeping in mind that this benchmark was least impacted by the window size). This example comes from pages , however I will try to elaborate on the solution over what is explained in the book. Here are our 3 processors: 1. A simple MIPS 2-issue static pipeline running at a clock rate of 4 GHz and achieving a pipeline CPI of 0.8. Processor has a cache system that yields.005 misses per instruction. 2. A deeply pipelined version of a 2-issue MIPS processor with a slightly smaller cache and a 5 GHz clock rate. Pipeline CPI is 1.0, cache misses are.0055 per instruction GHz speculative superscalar with a 64-entry window. It achieves one-half of the ideal issue rate (see figure 3.27). Small cache leads to.01 misses per instruction, but 25% of the miss penalty is hidden because of dynamic scheduling. Assume main memory time is 50 ns. What is the relative performance of each processor? Our solution requires that we compute CPU time = IC * CPI * clock cycle time. Since we want the relative difference, we can ignore IC. We are given the CPI of each processor less the impact of cache misses, so we have to modify the CPI. First, we have to translate the miss rate from a percentage to an actual time by factoring in the clock speed. Miss penalty = memory access time / clock cycle time Miss penalty machine 1 = 50 ns / (1 ns / 4) = 200 cycles Miss penalty machine 2 = 50 ns / (1 ns / 5) = 250 cycles Miss penalty machine 3 = 50 ns / (1 ns / 2.5) = 125 cycles However, for machine 3, dynamic scheduling hides 25% of the miss penalty, so we have.75 * 125 = 94 cycles. One thing to notice, to this point, is that the faster clock cycle yields a larger miss penalty in cycles. The 5 GHz processor is impacted to a much greater extent than the 2.5GHz processor when there is a cache miss. Thus, the idea that the faster processor is always better is misleading. Let s continue now to see how the cache miss factors into the overall performance. The cache miss penalty must be converted into a CPI. This is done as cache CPI = miss rate * miss penalty. We then compute the overall CPI as processor CPI + cache miss. Before we do this, we have to derive machine 3 s processor CPI. We are told that it achieves ½ of the ideal issue rate for a 64-entry window. For gcc, the 64-entry window s ideal instruction issue rate is 9, so this processor achieves 4.5 instructions issued per cycle, or 1 / 4.5 =.22. CPU miss miss CPI rate penalty CPI machine 1 = * 200 = 1.8 CPI machine 2 = * 250 = 2.4 CPI machine 3 = * 94 = 1.16 We finally plug in the CPI with the CPU clock rate to obtain their relative performance. This in fact gives us the MIPS rating (gcc is an integer benchmark, MIPS is millions of instructions per second). If we assume the benchmark has the same IC on each processor, this result is sufficient for comparison. Execution rate = clock rate / CPI

7 Execution rate machine 1 = 4 GHz / 1.8 = 2222 MIPS Execution rate machine 2 = 5 GHz / 2.4 = 2083 MIPS Execution rate machine 3 = 2.5 GHz / 1.16 = 2155 MIPS Notice that all three machines come very close to providing the same MIPS rating on gcc. However, machine 1, which to a large extent is the simplest of the 3 processors, gives us the best result. The shortened pipeline means less impact on stalls, so the CPI is lower than that of machine 2, and because there is less need for hardware to support such things as data forwarding, stalls, logic to look for dependencies, etc, there is room for a slightly larger cache and thus machine 1 has the lower cache miss rate. The aggressive CPU (machine 3) is penalized in two ways. First, it requires a longer clock cycle rate and thus is a slower machine, but also has a substantially smaller on chip cache such that its miss rate is twice as bad as machine 1. Let s consider two additional machines. The first is a variation on machine 3, say the Core i7 processor. In this case, the processor is 3.8 GHz and can issue up to 6 micro-operations per cycle. We will assume it averages micro-operations per cycle. We will have to convert micro-operation to instruction. Assume that the average machine instruction consists of micro-operations. The CPI is impacted by miss-speculations and cache misses. Miss-speculations result in an additional.1667 cycles of penalty per cycle. The cache has an average miss rate of The other machine is the EPIC. It has a clock rate of 2 GHz. The compiler sets up bundles that can execute up to 3 instructions per cycle. Stalls are inserted by the compiler in the form of stops. Assume the compiler is successful at bundling 3 instructions 65% of the time and 2 instructions 28% of the time, leaving only 1 instruction per cycle at 7%. Assume stops are only needed every 1 in 5 bundles and a stop consists of 1 cycle. Stalls only arise from miss-speculation of code, which happens once in 15 cycles with a result of a 3 cycle stall. The processor has a sizeable on-chip cache resulting in a miss rate of only.003. As before, we will start by computing the number of cycles that a cache miss requires. We then compute the CPIs and finally the MIPS rating. Miss penalty machine 4 = 50 ns / (1 ns / 3.8) = 190 cycles Miss penalty machine 5 = 50 ns / (1 ns / 2) = 100 cycles The CPI for machine 4 requires some effort to compute. First, we are given 4.2 micro-operations per cycle, but this does not mean a base CPI of 1/4.667 because these are micro-operations. Instead, we have to translate the micro-operations to machine instructions. Since the average machine instruction is micro-operations, this gives us a CPI of / =.714. Further, the miss-speculations result in an added.1667 to the CPI giving =.731. Machine 5 issues some combination of 1 to 3 instructions per cycle based on the bundle, and insets 1 cycle stops in some cases. Thus, the CPI = 1 / (3 * * *.07) + 1 / 5 (stops) = 60% of the time, 2 instructions 30% of the time and 1 instruction 10% of the time, or a base CPI =.588. Missspeculation also causes stalls equivalent to 3 clock cycles every 15 bundles, or a further.2, so CPI =.788. CPU miss miss CPI rate penalty CPI machine 4 = * 190 = CPI machine 5 = * 100 = Execution rate machine 4 = 3.8 GHz / = 1933 MIPS

8 Execution rate machine 5 = 2 GHz / = 1838 MIPS While these results are perhaps not entirely convincing, it demonstrates that integer programs might best run on short pipelines. However, what none of these examples provide us is an indicator of the performance when lengthier floating point operations are involved. As we saw in Appendix C, issuing floating point operations in a simple pipeline is challenging and out-of-order execution is almost required. Therefore, one of machine 3, 4 or 5 would best be utilized for floating point programs. However, the authors of appendix H caution that static scheduling, as with EPIC, is not entirely successful and so it appears that the dynamic issue superscalar is the approach to take. Whether instructions are issued at the machine language level or macro-operation level is an open question, but at least for the time being, Intel has no plans to abandon their current approach of using micro-operations. It is clear that a short pipelined approach does have some advantages: Less hardware required Less impact on stalls For integer operations, no concern for WAW and WAR hazards If we include floating point benchmarks and decide that we MUST utilize a dynamic issue super-scalar, we still have to worry about the following issues: WAW and WAR hazards that arise through memory Recurrences in code that cause dependencies that limit the amount of loop-level parallelism Data flow limitations (not discussed here) We wrap up this material by considering this simple question: should speculation occur through hardware or software? Software-based speculation provides tools that are not available with dynamic speculation such as trace scheduling. Further, compiler-based strategies can support some amount of memory disambiguation whereas hardware approaches support almost none. Hardware-based speculation is better when control flow is unpredictable. Further, dynamic branch prediction outperforms static branch prediction. Interestingly, even compiler-scheduled approaches are now using dynamic branch prediction. Hardware-based speculation maintains precise exceptions through a re-order buffer. Softwarebased approaches have to have additional hardware and software support to maintain precise exceptions, making them more complex and costly. Window sizes can be far greater for compiler-scheduling than any dynamically scheduled approach and this will almost certainly always be the case. One might wonder if a good mix of hardware and compiler-based scheduling could outperform either independent approach. But when it comes down to it, if you are going to use any compiler-based scheduling, you need additional hardware resources complicating matters. Therefore, minimal softwarebased scheduling is the best way to go (today). The only drawback is the window size and the lack of memory disambiguation available. Sample problems: 1. Symbolically unroll and schedule the following loop. Also describe any pre-loop and post-loop code that you would have to add. Loop: L.D F0, 0(R1) MUL.D F2, F0, F1 L.D F3, 0(R2) ADD.D F4, F2, F3

9 S.D F4, 0(R1) DSUBI R1, R1, #8 DSUBI R2, R2, #8 BNE R1, R3, Loop Solution: Loop: S.D F4, 32(R1) ADD.D F4, F2, F3 L.D F0, 16(R1) MUL.D F2, F0, F1 L.D F3, 0(R2) DSUBI R1, R1, #8 DSUBI R2, R2, #8 BNE R1, R3, Loop The startup code will require six L.Ds (four for the array pointed to by R2 and two for the array pointed to by R1), three MUL.Ds, and one ADD.D. The cleanup code will require four S.Ds, three ADD.Ds two L.Ds (for the array pointed to by R1), and one MUL.D. 2. Given the following code, unroll and schedule it for the IA-64, including any stops (whether necessary for stalls or not). For simplicity, assume that the MUL.D only takes 4 cycles to execute and the ADD.D takes 3 cycles to execute. You may omit the registers and DADDI offsets from your code. Loop: L.D F0, 0(R1) L.D F1, 0(R2) MUL.D F2, F0, F1 L.D F3, 0(R3) ADD.D F4, F3, F2 S.D F4, 0(R4) DADDI R1, R1, #8 DADDI R2, R2, #8 DADDI R3, R3, #8 DADDI R4, R4, #8 BNE R4, R5, Loop Solution: unroll the loop 4 iterations worth and schedule: L.D L.D L.D L.D // the indicates a stop L.D L.D MUL.D L.D L.D MUL.D L.D L.D MUL.D L.D DADDI MUL.D L.D DADDI ADD.D // 3 cycles to execute so latency of DADDI ADD.D // 1 between ADD.D and S.D S.D DADDI ADD.D // instead of 2 S.D ADD.D // the last stop is between the S.D BNE // DADDI and the BNE S.D

10 3. Given the following loops, determine using the GCD test if there are any dependencies. Normalize the array accesses when needed. a. for(j=1;j<300;j+=3) a[2*j-1] = a[5*j+2] * c; b. for(k=0;k<100;k++) x[3*j] = x[2*j+1] + 1; c. for(m=0;m<200;m+=2) c[12*m+2] = c[21*m-2] * q; Solution: a. for(j=1;j<100;j+=1) a[6*j-1] = a[15*j+2] * c; a = 6, b = -1, c = 15, d = 2, GCD(a, c) = 3, so we have (2 - -1) / 3 = 1 / 1 = 1, so there is a remainder and so there may be a dependence. In fact, we have a dependence when j = 3 to when j = 8 (a[47]). b. for(k=0;k<100;k++) x[3*j] = x[2*j+1] + 1; a = 3, b = 0, c = 2, d = 1, GCD(a, c) = 1, so we have (1 0) / 1 = 1, so there is a remainder and so there may be a dependence. You might think that there is a dependence when j = 1 but this would be x[3] = x[3] + 1; so it is within the same iteration. We do have a dependence later, for instance when j = 5 to j = 7 on a[15]. c. for(m=0;m<100;m+=1) c[24*m+2] = c[42*m-2] * q; a = 24, b = 2, c = 42, d = -2, GCD(a, c) = 6, so we have (-2 2) / 6 = -4 / 6 which has a remainder, so there is no dependence. 4. Given the following if-else statement, rewrite it in MIPS as is, rewrite it in MIPS code speculating that the else clause will be taken, and rewrite it using conditional instructions if possible. Assuming a 1 cycle penalty for any branch (but no penalty for RAW hazards), and assuming that the else clause is taken 90% of the time, compute the number of average cycles it takes each of your three approaches to execute. Assume x, y and z are already stored in registers R1, R2 and R3 respectively. if(x < y) z = x; else z = y; Solution: First, we have the non-speculative code: SLT R4, R1, R2 BEQZ R4, else DADDI R3, R1, #0 J out else: DADDI R3, R2, #0 out:

11 Speculating the else clause looks like this: SLT R4, R1, R2 DADDI R3, R1, #0 // speculate the else clause BNEZ R4, Out DADDI R3, R2, #0 // change R3 s value from R1 to R2 Out: Since we do not have a conditional move on < or not!<, we will use SLT and SGE SLT R4, R1, R2 // R4 = = 0 means that R1 >= R2 SGE R5, R1, R2 // R5 = = 0 means that R1 < R2 CMOVZ R3 R1, R5 // if clause (z = x if R1 < R2) CMOVZ R3, R2, R4 // else clause (z = y if R1 >= R2) Assuming that the else clause is taken 90% of the time, the original code takes 6 cycles for the if clause and 4 cycles for the else clause requiring.1 * * 4 = 4.2 cycles. The speculated code takes 4 cycles for the else clause and 5 cycles for the if clause requiring.1 * * 4 = 4.1 cycles. The code with the conditional moves take exactly 4 cycles.

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer