Week 6 out-of-class notes, discussions and sample problems

Size: px
Start display at page:

Download "Week 6 out-of-class notes, discussions and sample problems"

Transcription

1 Week 6 out-of-class notes, discussions and sample problems We conclude our study of ILP with a look at the limitations of ILP and the benefits and costs of dynamic versus compiler-based approaches to promote ILP. Exploiting ILP has been a goal since the first pipelined processors in the 1960s. The RISC architecture brought about many advances in pipeline technology that were not implementable with CISC instruction sets. However, there was only a certain amount of parallelism available. Recall that roughly 1 in 7 instructions is a branch. If instructions are fairly uniformly distributed, this means that there is at most 6 instructions within any basic block before a branch occurs. Without moving instructions across branches (speculation, unrolling), that means there are few options to cover stalls that arise from RAW hazards and branch delays. In 1993, Wall did a study of the issues that limit ILP. This was elaborated upon by the authors. Here, we examine their findings. From chapter 3, we see that the ideal machine is a superscalar that can issue any combination of instructions at each clock cycle, with branch prediction, and enough registers and temporary registers to handle all active instructions. We might define the optimal machine as: 1. Infinite temporary registers for register renaming 2. Prefect branch prediction 3. Perfect jump prediction 4. Perfect memory address alias analysis to detect and handle all hazards 5. Perfect caches If #2 and #3 are true, we can eliminate all control dependencies. If #1 and #4 are true, we can eliminate all name and output (WAR/WAW) dependencies. This would leave only true (RAW) dependencies, which can be defeated with forwarding, reservation stations and enough time. If our functional units are pipelined, coupled with assumption #5, we would then be able to issue any instruction immediately after its predecessor instruction that is, there would be no stalls at the issuing stage. An ideal machine would permit a large number of instruction issues per cycle. How many? Looking at 6 SPEC 92 benchmarks, the authors obtain the following results: gcc: 55 espresso: 63 li: 18 fpppp: 75 doduc: 119 tomcatv: 150 The last three benchmark programs are FP programs which supports our intuition that there is a greater degree of ILP in FP programs because there are fewer control dependencies other than loops. The li program, a Lisp interpreter, has many short dependencies meaning that there is little ILP to exploit before the next dependence arises. Comparing our optimal machine to modern processors, we of course find that we cannot come close to approaching the optical specifications. The processor with the largest amount of instruction issue per cycle at the moment is probably the IBM Power7, which issues up to 6 instructions (micro-instructions) per cycle. This is a disappointing result when compared to the potential optimal instruction issue rate. Shown on the figure at the top of the next page are optimal number of instruction issues per cycle given different window sizes. Recall that the window size is the number of instructions that are examined in one cycle to determine which combination can be dynamically issued.

2 But it is obvious that our ideal machine is not practical for many reasons. For instance, window sizes of more than 8 is very challenging because of the complexity in the hardware required to track dependencies. Compiler scheduling could possibly help out with that. In addition, while an issue rate of up to 150 instructions per cycle (tomcat for an infinite window size) sounds great, the reality is that we would quickly run out of temporary registers at that rate. Assuming we could issue 150 instructions per cycle, and assuming that instructions wait on RAW hazards for at least 3-4 cycles, we would need 150 * 3.5 * 2 = 1050 registers to accommodate this issue rate (let s round this down to 1024). Is 1024 temporary registers doable? We typically equip our computers with far fewer. Examine the figure at the top of the next page. We see a comparison of the number of temporary registers to the amount of instruction issue. Notice how at 512 temporary registers, the performance is almost identical to an infinite number of registers (with the exception of doduc). Recall that both the Pentium and the i7 have 128 temporary registers available for renaming/reservation stations. Although adding more registers is not cost prohibitive today, the question is whether additional registers would actually lead to a greater instruction issue rate. Something else to keep in mind is the limited size of the reorder buffer. It makes no sense to try to issue more instructions than there are slots in the reorder buffer. The i7 has 128 entries and the largest sized reorder buffer (IBM Power7) has fewer than 200.

3 The figure below demonstrates the maximum instruction issue rate given different forms of branch prediction. Notice how drastically different non-perfect from perfect branch prediction is for the 3 integer benchmarks. The FP benchmarks are more predictable. Tournament predictors of reasonable sizes are common and not cost prohibitive.

4 Here, we see the misprediction rate for the three forms of prediction as shown in the last figure. The profile-based is a static approch although it outperforms the dynamic 2-bit counter. But by far, the tournament predictor is the way to go. Again, notice that accuracy is better on the FP benchmakrs (this figure lists the same 6 benchmarks in reverse order from the previous graphs). Finally, the next figure demonstrates the maximum instruction issue rate for our 6 benchmarks using different forms of alias analysis. Global/stack perfect means that there is perfect analysis for all global and local variables/parameters, but not heap memory. There is a large difference between perfect and the others, but perfect is completely unrealizable. For the FP benchmarks, there is little improvement from no analysis to inspect, but in all cases, global/stack perfect is preferred.

5 We must scale back our expectations for our processor to something realizable. The authors suggest the following: Up to 64 instructions issued per cycle with no restrictions placed on what can be issued o this is more than 10 times that of the Power7 Tournament branch prediction with 1K entries and a 16-entry return predictor o this is comparable to what exists today and certainly could be expanded to larger sizes if there is any value in that Perfect memory reference disambiguation to handle memory aliases o this is utterly impossible, but the authors suggest for a small window size, it is attainable, or at least a reasonably accurate predictor is attainable, alternatively, use global/stack analysis for a small window size Register renaming with 64 additional integer and 64 additional FP registers o if there is any area that can be improved, it is this one adding another 128 registers should be no problem but the expense is that the space used for these could also be used for a slightly larger cache With these limitations over the ideal processor in place, simulations indicate that the same 6 benchmarks as listed above could achieve between 8 and 47 instructions issued per cycle, so it is possible to obtain close to the 64 (for some benchmarks) with moderate improvements to our most aggressive processor. If we consider historically the focus on processing, we see the following. Before the early 90s, much processing was integer processing because processors either had no FP capabilities (requiring a coprocessor) or the lack of pipelining led to slow computation no matter what. Outside of computationheavy processes, most software had little to no FP operations and those that they had (e.g., graphics) were simple enough. Starting with the release of the Intel Pentium MMX (multimedia) processor, FP operations were directly supported by the processor and more and more graphics routines were found in software. However, while computer graphics are still a critical aspect of most modern software, the bigger emphasis today is with the world wide web and cloud computing, both of which are primarily integer-based. We couple this analysis with the realization that FP programs are more predictable in terms of branches and have higher instruction issue rates when we try to support them with additional registers, branch prediction and memory disambiguation. This leads to contradictory conclusions: support to promote ILP in FP benchmarks is available, but we are turning more and more toward integerbased computing. Thinking of Amdahl s law, the common case is to support integer execution and therefore our conclusions from above imply the following. 1. Processors should work to improve on the 6-instructions per cycle issue rate by enlarging the window size, but not substantially (32 instruction window size is sufficient, at least for now) 2. Support dynamic loop unrolling through additional temporary registers (at least 128 additional registers) 3. Increase the size of load/store buffers such that we can still achieve highly accurate alias resolution, although the increase does not need to be substantial 4. Increase the size of the reorder buffer to at least 128 (the size of the Intel Core i7), or better yet, double the size These modest increases in capability will improve processor performance. However, as we will see starting next week, cache limitations have as large or a larger impact than anything we can do with the dynamic issue rate, and so improving cache will be as or of greater importance. Additionally, vector

6 processing/simd and true MIMD processing will also provide greater support. We will examine these at the end of the course. Let s consider 3 hypothetical but not atypical processors for the SPEC gcc benchmark (keeping in mind that this benchmark was least impacted by the window size). This example comes from pages , however I will try to elaborate on the solution over what is explained in the book. Here are our 3 processors: 1. A simple MIPS 2-issue static pipeline running at a clock rate of 4 GHz and achieving a pipeline CPI of 0.8. Processor has a cache system that yields.005 misses per instruction. 2. A deeply pipelined version of a 2-issue MIPS processor with a slightly smaller cache and a 5 GHz clock rate. Pipeline CPI is 1.0, cache misses are.0055 per instruction GHz speculative superscalar with a 64-entry window. It achieves one-half of the ideal issue rate (see figure 3.27). Small cache leads to.01 misses per instruction, but 25% of the miss penalty is hidden because of dynamic scheduling. Assume main memory time is 50 ns. What is the relative performance of each processor? Our solution requires that we compute CPU time = IC * CPI * clock cycle time. Since we want the relative difference, we can ignore IC. We are given the CPI of each processor less the impact of cache misses, so we have to modify the CPI. First, we have to translate the miss rate from a percentage to an actual time by factoring in the clock speed. Miss penalty = memory access time / clock cycle time Miss penalty machine 1 = 50 ns / (1 ns / 4) = 200 cycles Miss penalty machine 2 = 50 ns / (1 ns / 5) = 250 cycles Miss penalty machine 3 = 50 ns / (1 ns / 2.5) = 125 cycles However, for machine 3, dynamic scheduling hides 25% of the miss penalty, so we have.75 * 125 = 94 cycles. One thing to notice, to this point, is that the faster clock cycle yields a larger miss penalty in cycles. The 5 GHz processor is impacted to a much greater extent than the 2.5GHz processor when there is a cache miss. Thus, the idea that the faster processor is always better is misleading. Let s continue now to see how the cache miss factors into the overall performance. The cache miss penalty must be converted into a CPI. This is done as cache CPI = miss rate * miss penalty. We then compute the overall CPI as processor CPI + cache miss. Before we do this, we have to derive machine 3 s processor CPI. We are told that it achieves ½ of the ideal issue rate for a 64-entry window. For gcc, the 64-entry window s ideal instruction issue rate is 9, so this processor achieves 4.5 instructions issued per cycle, or 1 / 4.5 =.22. CPU miss miss CPI rate penalty CPI machine 1 = * 200 = 1.8 CPI machine 2 = * 250 = 2.4 CPI machine 3 = * 94 = 1.16 We finally plug in the CPI with the CPU clock rate to obtain their relative performance. This in fact gives us the MIPS rating (gcc is an integer benchmark, MIPS is millions of instructions per second). If we assume the benchmark has the same IC on each processor, this result is sufficient for comparison. Execution rate = clock rate / CPI

7 Execution rate machine 1 = 4 GHz / 1.8 = 2222 MIPS Execution rate machine 2 = 5 GHz / 2.4 = 2083 MIPS Execution rate machine 3 = 2.5 GHz / 1.16 = 2155 MIPS Notice that all three machines come very close to providing the same MIPS rating on gcc. However, machine 1, which to a large extent is the simplest of the 3 processors, gives us the best result. The shortened pipeline means less impact on stalls, so the CPI is lower than that of machine 2, and because there is less need for hardware to support such things as data forwarding, stalls, logic to look for dependencies, etc, there is room for a slightly larger cache and thus machine 1 has the lower cache miss rate. The aggressive CPU (machine 3) is penalized in two ways. First, it requires a longer clock cycle rate and thus is a slower machine, but also has a substantially smaller on chip cache such that its miss rate is twice as bad as machine 1. Let s consider two additional machines. The first is a variation on machine 3, say the Core i7 processor. In this case, the processor is 3.8 GHz and can issue up to 6 micro-operations per cycle. We will assume it averages micro-operations per cycle. We will have to convert micro-operation to instruction. Assume that the average machine instruction consists of micro-operations. The CPI is impacted by miss-speculations and cache misses. Miss-speculations result in an additional.1667 cycles of penalty per cycle. The cache has an average miss rate of The other machine is the EPIC. It has a clock rate of 2 GHz. The compiler sets up bundles that can execute up to 3 instructions per cycle. Stalls are inserted by the compiler in the form of stops. Assume the compiler is successful at bundling 3 instructions 65% of the time and 2 instructions 28% of the time, leaving only 1 instruction per cycle at 7%. Assume stops are only needed every 1 in 5 bundles and a stop consists of 1 cycle. Stalls only arise from miss-speculation of code, which happens once in 15 cycles with a result of a 3 cycle stall. The processor has a sizeable on-chip cache resulting in a miss rate of only.003. As before, we will start by computing the number of cycles that a cache miss requires. We then compute the CPIs and finally the MIPS rating. Miss penalty machine 4 = 50 ns / (1 ns / 3.8) = 190 cycles Miss penalty machine 5 = 50 ns / (1 ns / 2) = 100 cycles The CPI for machine 4 requires some effort to compute. First, we are given 4.2 micro-operations per cycle, but this does not mean a base CPI of 1/4.667 because these are micro-operations. Instead, we have to translate the micro-operations to machine instructions. Since the average machine instruction is micro-operations, this gives us a CPI of / =.714. Further, the miss-speculations result in an added.1667 to the CPI giving =.731. Machine 5 issues some combination of 1 to 3 instructions per cycle based on the bundle, and insets 1 cycle stops in some cases. Thus, the CPI = 1 / (3 * * *.07) + 1 / 5 (stops) = 60% of the time, 2 instructions 30% of the time and 1 instruction 10% of the time, or a base CPI =.588. Missspeculation also causes stalls equivalent to 3 clock cycles every 15 bundles, or a further.2, so CPI =.788. CPU miss miss CPI rate penalty CPI machine 4 = * 190 = CPI machine 5 = * 100 = Execution rate machine 4 = 3.8 GHz / = 1933 MIPS

8 Execution rate machine 5 = 2 GHz / = 1838 MIPS While these results are perhaps not entirely convincing, it demonstrates that integer programs might best run on short pipelines. However, what none of these examples provide us is an indicator of the performance when lengthier floating point operations are involved. As we saw in Appendix C, issuing floating point operations in a simple pipeline is challenging and out-of-order execution is almost required. Therefore, one of machine 3, 4 or 5 would best be utilized for floating point programs. However, the authors of appendix H caution that static scheduling, as with EPIC, is not entirely successful and so it appears that the dynamic issue superscalar is the approach to take. Whether instructions are issued at the machine language level or macro-operation level is an open question, but at least for the time being, Intel has no plans to abandon their current approach of using micro-operations. It is clear that a short pipelined approach does have some advantages: Less hardware required Less impact on stalls For integer operations, no concern for WAW and WAR hazards If we include floating point benchmarks and decide that we MUST utilize a dynamic issue super-scalar, we still have to worry about the following issues: WAW and WAR hazards that arise through memory Recurrences in code that cause dependencies that limit the amount of loop-level parallelism Data flow limitations (not discussed here) We wrap up this material by considering this simple question: should speculation occur through hardware or software? Software-based speculation provides tools that are not available with dynamic speculation such as trace scheduling. Further, compiler-based strategies can support some amount of memory disambiguation whereas hardware approaches support almost none. Hardware-based speculation is better when control flow is unpredictable. Further, dynamic branch prediction outperforms static branch prediction. Interestingly, even compiler-scheduled approaches are now using dynamic branch prediction. Hardware-based speculation maintains precise exceptions through a re-order buffer. Softwarebased approaches have to have additional hardware and software support to maintain precise exceptions, making them more complex and costly. Window sizes can be far greater for compiler-scheduling than any dynamically scheduled approach and this will almost certainly always be the case. One might wonder if a good mix of hardware and compiler-based scheduling could outperform either independent approach. But when it comes down to it, if you are going to use any compiler-based scheduling, you need additional hardware resources complicating matters. Therefore, minimal softwarebased scheduling is the best way to go (today). The only drawback is the window size and the lack of memory disambiguation available. Sample problems: 1. Symbolically unroll and schedule the following loop. Also describe any pre-loop and post-loop code that you would have to add. Loop: L.D F0, 0(R1) MUL.D F2, F0, F1 L.D F3, 0(R2) ADD.D F4, F2, F3

9 S.D F4, 0(R1) DSUBI R1, R1, #8 DSUBI R2, R2, #8 BNE R1, R3, Loop Solution: Loop: S.D F4, 32(R1) ADD.D F4, F2, F3 L.D F0, 16(R1) MUL.D F2, F0, F1 L.D F3, 0(R2) DSUBI R1, R1, #8 DSUBI R2, R2, #8 BNE R1, R3, Loop The startup code will require six L.Ds (four for the array pointed to by R2 and two for the array pointed to by R1), three MUL.Ds, and one ADD.D. The cleanup code will require four S.Ds, three ADD.Ds two L.Ds (for the array pointed to by R1), and one MUL.D. 2. Given the following code, unroll and schedule it for the IA-64, including any stops (whether necessary for stalls or not). For simplicity, assume that the MUL.D only takes 4 cycles to execute and the ADD.D takes 3 cycles to execute. You may omit the registers and DADDI offsets from your code. Loop: L.D F0, 0(R1) L.D F1, 0(R2) MUL.D F2, F0, F1 L.D F3, 0(R3) ADD.D F4, F3, F2 S.D F4, 0(R4) DADDI R1, R1, #8 DADDI R2, R2, #8 DADDI R3, R3, #8 DADDI R4, R4, #8 BNE R4, R5, Loop Solution: unroll the loop 4 iterations worth and schedule: L.D L.D L.D L.D // the indicates a stop L.D L.D MUL.D L.D L.D MUL.D L.D L.D MUL.D L.D DADDI MUL.D L.D DADDI ADD.D // 3 cycles to execute so latency of DADDI ADD.D // 1 between ADD.D and S.D S.D DADDI ADD.D // instead of 2 S.D ADD.D // the last stop is between the S.D BNE // DADDI and the BNE S.D

10 3. Given the following loops, determine using the GCD test if there are any dependencies. Normalize the array accesses when needed. a. for(j=1;j<300;j+=3) a[2*j-1] = a[5*j+2] * c; b. for(k=0;k<100;k++) x[3*j] = x[2*j+1] + 1; c. for(m=0;m<200;m+=2) c[12*m+2] = c[21*m-2] * q; Solution: a. for(j=1;j<100;j+=1) a[6*j-1] = a[15*j+2] * c; a = 6, b = -1, c = 15, d = 2, GCD(a, c) = 3, so we have (2 - -1) / 3 = 1 / 1 = 1, so there is a remainder and so there may be a dependence. In fact, we have a dependence when j = 3 to when j = 8 (a[47]). b. for(k=0;k<100;k++) x[3*j] = x[2*j+1] + 1; a = 3, b = 0, c = 2, d = 1, GCD(a, c) = 1, so we have (1 0) / 1 = 1, so there is a remainder and so there may be a dependence. You might think that there is a dependence when j = 1 but this would be x[3] = x[3] + 1; so it is within the same iteration. We do have a dependence later, for instance when j = 5 to j = 7 on a[15]. c. for(m=0;m<100;m+=1) c[24*m+2] = c[42*m-2] * q; a = 24, b = 2, c = 42, d = -2, GCD(a, c) = 6, so we have (-2 2) / 6 = -4 / 6 which has a remainder, so there is no dependence. 4. Given the following if-else statement, rewrite it in MIPS as is, rewrite it in MIPS code speculating that the else clause will be taken, and rewrite it using conditional instructions if possible. Assuming a 1 cycle penalty for any branch (but no penalty for RAW hazards), and assuming that the else clause is taken 90% of the time, compute the number of average cycles it takes each of your three approaches to execute. Assume x, y and z are already stored in registers R1, R2 and R3 respectively. if(x < y) z = x; else z = y; Solution: First, we have the non-speculative code: SLT R4, R1, R2 BEQZ R4, else DADDI R3, R1, #0 J out else: DADDI R3, R2, #0 out:

11 Speculating the else clause looks like this: SLT R4, R1, R2 DADDI R3, R1, #0 // speculate the else clause BNEZ R4, Out DADDI R3, R2, #0 // change R3 s value from R1 to R2 Out: Since we do not have a conditional move on < or not!<, we will use SLT and SGE SLT R4, R1, R2 // R4 = = 0 means that R1 >= R2 SGE R5, R1, R2 // R5 = = 0 means that R1 < R2 CMOVZ R3 R1, R5 // if clause (z = x if R1 < R2) CMOVZ R3, R2, R4 // else clause (z = y if R1 >= R2) Assuming that the else clause is taken 90% of the time, the original code takes 6 cycles for the if clause and 4 cycles for the else clause requiring.1 * * 4 = 4.2 cycles. The speculated code takes 4 cycles for the else clause and 5 cycles for the if clause requiring.1 * * 4 = 4.1 cycles. The code with the conditional moves take exactly 4 cycles.

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Hardware Speculation Support

Hardware Speculation Support Hardware Speculation Support Conditional instructions Most common form is conditional move BNEZ R1, L ;if MOV R2, R3 ;then CMOVZ R2,R3, R1 L: ;else Other variants conditional loads and stores nullification

More information

Super Scalar. Kalyan Basu March 21,

Super Scalar. Kalyan Basu March 21, Super Scalar Kalyan Basu basu@cse.uta.edu March 21, 2007 1 Super scalar Pipelines A pipeline that can complete more than 1 instruction per cycle is called a super scalar pipeline. We know how to build

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest

More information

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3) CS Homework 2 (Chapter ) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration..

More information

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Getting CPI under 1: Outline

Getting CPI under 1: Outline CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches

More information

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3) CS433 Homework 2 (Chapter 3) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies

More information

Instruction Level Parallelism (ILP)

Instruction Level Parallelism (ILP) 1 / 26 Instruction Level Parallelism (ILP) ILP: The simultaneous execution of multiple instructions from a program. While pipelining is a form of ILP, the general application of ILP goes much further into

More information

Page 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP

Page 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP CS252 Graduate Computer Architecture Lecture 18: Branch Prediction + analysis resources => ILP April 2, 2 Prof. David E. Culler Computer Science 252 Spring 2 Today s Big Idea Reactive: past actions cause

More information

Review Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction

Review Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction CS252 Graduate Computer Architecture Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue March 23, 01 Prof. David A. Patterson Computer Science 252 Spring 01 Review Tomasulo Reservations

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

Multiple Instruction Issue and Hardware Based Speculation

Multiple Instruction Issue and Hardware Based Speculation Multiple Instruction Issue and Hardware Based Speculation Soner Önder Michigan Technological University, Houghton MI www.cs.mtu.edu/~soner Hardware Based Speculation Exploiting more ILP requires that we

More information

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

EITF20: Computer Architecture Part3.2.1: Pipeline - 3 EITF20: Computer Architecture Part3.2.1: Pipeline - 3 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Dynamic scheduling - Tomasulo Superscalar, VLIW Speculation ILP limitations What we have done

More information

TDT 4260 lecture 7 spring semester 2015

TDT 4260 lecture 7 spring semester 2015 1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding

More information

Instruction-Level Parallelism and Its Exploitation

Instruction-Level Parallelism and Its Exploitation Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic

More information

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Hiroaki Kobayashi // As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Branches will arrive up to n times faster in an n-issue processor, and providing an instruction

More information

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

Instruction-Level Parallelism. Instruction Level Parallelism (ILP) Instruction-Level Parallelism CS448 1 Pipelining Instruction Level Parallelism (ILP) Limited form of ILP Overlapping instructions, these instructions can be evaluated in parallel (to some degree) Pipeline

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism Software View of Computer Architecture COMP2 Godfrey van der Linden 200-0-0 Introduction Definition of Instruction Level Parallelism(ILP) Pipelining Hazards & Solutions Dynamic

More information

T T T T T T N T T T T T T T T N T T T T T T T T T N T T T T T T T T T T T N.

T T T T T T N T T T T T T T T N T T T T T T T T T N T T T T T T T T T T T N. A1: Architecture (25 points) Consider these four possible branch predictors: (A) Static backward taken, forward not taken (B) 1-bit saturating counter (C) 2-bit saturating counter (D) Global predictor

More information

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer Pipeline CPI http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

Lecture 7 Instruction Level Parallelism (5) EEC 171 Parallel Architectures John Owens UC Davis

Lecture 7 Instruction Level Parallelism (5) EEC 171 Parallel Architectures John Owens UC Davis Lecture 7 Instruction Level Parallelism (5) EEC 171 Parallel Architectures John Owens UC Davis Credits John Owens / UC Davis 2007 2009. Thanks to many sources for slide material: Computer Organization

More information

INSTRUCTION LEVEL PARALLELISM

INSTRUCTION LEVEL PARALLELISM INSTRUCTION LEVEL PARALLELISM Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix H, John L. Hennessy and David A. Patterson,

More information

Multi-cycle Instructions in the Pipeline (Floating Point)

Multi-cycle Instructions in the Pipeline (Floating Point) Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining

More information

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Instruction-Level Parallelism and its Exploitation: PART 1 ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Project and Case

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3, 3.9, and Appendix C) Hardware

More information

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation TDT4255 Lecture 9: ILP and speculation Donn Morrison Department of Computer Science 2 Outline Textbook: Computer Architecture: A Quantitative Approach, 4th ed Section 2.6: Speculation Section 2.7: Multiple

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2018 Static Instruction Scheduling 1 Techniques to reduce stalls CPI = Ideal CPI + Structural stalls per instruction + RAW stalls per instruction + WAR stalls per

More information

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated

More information

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , ) Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14) 1 1-Bit Prediction For each branch, keep track of what happened last time and use

More information

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012 Advanced Computer Architecture CMSC 611 Homework 3 Due in class Oct 17 th, 2012 (Show your work to receive partial credit) 1) For the following code snippet list the data dependencies and rewrite the code

More information

Metodologie di Progettazione Hardware-Software

Metodologie di Progettazione Hardware-Software Metodologie di Progettazione Hardware-Software Advanced Pipelining and Instruction-Level Paralelism Metodologie di Progettazione Hardware/Software LS Ing. Informatica 1 ILP Instruction-level Parallelism

More information

Keywords and Review Questions

Keywords and Review Questions Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain

More information

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?) Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of

More information

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level Parallelism (ILP) &

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

ILP: Instruction Level Parallelism

ILP: Instruction Level Parallelism ILP: Instruction Level Parallelism Tassadaq Hussain Riphah International University Barcelona Supercomputing Center Universitat Politècnica de Catalunya Introduction Introduction Pipelining become universal

More information

CS422 Computer Architecture

CS422 Computer Architecture CS422 Computer Architecture Spring 2004 Lecture 17, 26 Feb 2004 Bhaskaran Raman Department of CSE IIT Kanpur http://web.cse.iitk.ac.in/~cs422/index.html Speculation Wish to move instructions across branches

More information

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture,

More information

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties Instruction-Level Parallelism Dynamic Branch Prediction CS448 1 Reducing Branch Penalties Last chapter static schemes Move branch calculation earlier in pipeline Static branch prediction Always taken,

More information

Updated Exercises by Diana Franklin

Updated Exercises by Diana Franklin C-82 Appendix C Pipelining: Basic and Intermediate Concepts Updated Exercises by Diana Franklin C.1 [15/15/15/15/25/10/15] Use the following code fragment: Loop: LD R1,0(R2) ;load R1 from address

More information

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4 PROBLEM 1: An application running on a 1GHz pipelined processor has the following instruction mix: Instruction Frequency CPI Load-store 55% 5 Arithmetic 30% 4 Branch 15% 4 a) Determine the overall CPI

More information

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville Lecture : Exploiting ILP with SW Approaches Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Basic Pipeline Scheduling and Loop

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

HY425 Lecture 09: Software to exploit ILP

HY425 Lecture 09: Software to exploit ILP HY425 Lecture 09: Software to exploit ILP Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 4, 2010 ILP techniques Hardware Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit

More information

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies Administrivia CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) HW #3, on memory hierarchy, due Tuesday Continue reading Chapter 3 of H&P Alan Sussman als@cs.umd.edu

More information

HY425 Lecture 09: Software to exploit ILP

HY425 Lecture 09: Software to exploit ILP HY425 Lecture 09: Software to exploit ILP Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 4, 2010 Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 1 / 44 ILP techniques

More information

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order

More information

CS146 Computer Architecture. Fall Midterm Exam

CS146 Computer Architecture. Fall Midterm Exam CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state

More information

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction

More information

Multiple Issue ILP Processors. Summary of discussions

Multiple Issue ILP Processors. Summary of discussions Summary of discussions Multiple Issue ILP Processors ILP processors - VLIW/EPIC, Superscalar Superscalar has hardware logic for extracting parallelism - Solutions for stalls etc. must be provided in hardware

More information

Four Steps of Speculative Tomasulo cycle 0

Four Steps of Speculative Tomasulo cycle 0 HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S Lecture 6 MIPS R4000 and Instruction Level Parallelism Computer Architectures 521480S Case Study: MIPS R4000 (200 MHz, 64-bit instructions, MIPS-3 instruction set) 8 Stage Pipeline: first half of fetching

More information

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Appendix C Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure C.2 The pipeline can be thought of as a series of data paths shifted in time. This shows

More information

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units CS333: Computer Architecture Spring 006 Homework 3 Total Points: 49 Points (undergrad), 57 Points (graduate) Due Date: Feb. 8, 006 by 1:30 pm (See course information handout for more details on late submissions)

More information

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士 Computer Architecture 计算机体系结构 Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2018 Review Hazards (data/name/control) RAW, WAR, WAW hazards Different types

More information

EECC551 Exam Review 4 questions out of 6 questions

EECC551 Exam Review 4 questions out of 6 questions EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving

More information

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Tomasulo

More information

Course on Advanced Computer Architectures

Course on Advanced Computer Architectures Surname (Cognome) Name (Nome) POLIMI ID Number Signature (Firma) SOLUTION Politecnico di Milano, July 9, 2018 Course on Advanced Computer Architectures Prof. D. Sciuto, Prof. C. Silvano EX1 EX2 EX3 Q1

More information

Instruction Level Parallelism. Taken from

Instruction Level Parallelism. Taken from Instruction Level Parallelism Taken from http://www.cs.utsa.edu/~dj/cs3853/lecture5.ppt Outline ILP Compiler techniques to increase ILP Loop Unrolling Static Branch Prediction Dynamic Branch Prediction

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch branch taken Revisiting Branch Hazard Solutions Stall Predict Not Taken Predict Taken Branch Delay Slot Branch I+1 I+2 I+3 Predict Not Taken branch not taken Branch I+1 IF (bubble) (bubble) (bubble) (bubble)

More information

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996 Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.SP96 1 Review: Evaluating Branch Alternatives Two part solution: Determine

More information

Page 1 ILP. ILP Basics & Branch Prediction. Smarter Schedule. Basic Block Problems. Parallelism independent enough

Page 1 ILP. ILP Basics & Branch Prediction. Smarter Schedule. Basic Block Problems. Parallelism independent enough ILP ILP Basics & Branch Prediction Today s topics: Compiler hazard mitigation loop unrolling SW pipelining Branch Prediction Parallelism independent enough e.g. avoid s» control correctly predict decision

More information

CMSC411 Fall 2013 Midterm 2 Solutions

CMSC411 Fall 2013 Midterm 2 Solutions CMSC411 Fall 2013 Midterm 2 Solutions 1. (12 pts) Memory hierarchy a. (6 pts) Suppose we have a virtual memory of size 64 GB, or 2 36 bytes, where pages are 16 KB (2 14 bytes) each, and the machine has

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

LIMITS OF ILP. B649 Parallel Architectures and Programming

LIMITS OF ILP. B649 Parallel Architectures and Programming LIMITS OF ILP B649 Parallel Architectures and Programming A Perfect Processor Register renaming infinite number of registers hence, avoids all WAW and WAR hazards Branch prediction perfect prediction Jump

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction

More information

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory The Big Picture: Where are We Now? CS152 Computer Architecture and Engineering Lecture 18 The Five Classic Components of a Computer Processor Input Control Dynamic Scheduling (Cont), Speculation, and ILP

More information

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University,

More information

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false. CS 2410 Mid term (fall 2015) Name: Question 1 (10 points) Indicate which of the following statements is true and which is false. (1) SMT architectures reduces the thread context switch time by saving in

More information

Lecture 5: VLIW, Software Pipelining, and Limits to ILP. Review: Tomasulo

Lecture 5: VLIW, Software Pipelining, and Limits to ILP. Review: Tomasulo Lecture 5: VLIW, Software Pipelining, and Limits to ILP Professor David A. Patterson Computer Science 252 Spring 1998 DAP.F96 1 Review: Tomasulo Prevents Register as bottleneck Avoids WAR, WAW hazards

More information

CS 2410 Mid term (fall 2018)

CS 2410 Mid term (fall 2018) CS 2410 Mid term (fall 2018) Name: Question 1 (6+6+3=15 points): Consider two machines, the first being a 5-stage operating at 1ns clock and the second is a 12-stage operating at 0.7ns clock. Due to data

More information

Instruction-Level Parallelism (ILP)

Instruction-Level Parallelism (ILP) Instruction Level Parallelism Instruction-Level Parallelism (ILP): overlap the execution of instructions to improve performance 2 approaches to exploit ILP: 1. Rely on hardware to help discover and exploit

More information

Static vs. Dynamic Scheduling

Static vs. Dynamic Scheduling Static vs. Dynamic Scheduling Dynamic Scheduling Fast Requires complex hardware More power consumption May result in a slower clock Static Scheduling Done in S/W (compiler) Maybe not as fast Simpler processor

More information

EE 4683/5683: COMPUTER ARCHITECTURE

EE 4683/5683: COMPUTER ARCHITECTURE EE 4683/5683: COMPUTER ARCHITECTURE Lecture 4A: Instruction Level Parallelism - Static Scheduling Avinash Kodi, kodi@ohio.edu Agenda 2 Dependences RAW, WAR, WAW Static Scheduling Loop-carried Dependence

More information

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 15: Instruction Level Parallelism and Dynamic Execution March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002

More information

Lecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism

Lecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism Lecture 8: Compiling for ILP and Branch Prediction Kunle Olukotun Gates 302 kunle@ogun.stanford.edu http://www-leland.stanford.edu/class/ee282h/ 1 Advanced pipelining and instruction level parallelism

More information

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

More information