Dynamic Branch Prediction for a VLIW Processor

Size: px
Start display at page:

Download "Dynamic Branch Prediction for a VLIW Processor"

Transcription

1 Dynamic Branch Prediction for a VLIW Processor Jan Hoogerbrugge Philips Research Laboratories, Prof. Holstlaan 4, 5656 AA Eindhoven, The Netherlands Abstract This paper describes the design of a dynamic branch predictor for a VLIW processor. The developed branch predictor predicts the direction of a branch, i.e., taken or not taken, and in the case of taken prediction it also predicts the issue-slot that contains the taken branch. This information is used to perform the BTB lookup. We compare this method against a typical superscalar branch predictor and against a branch predictor developed for VLIWs by Intel and HP. For a 2K entry BHT, 512 entry BTB, gshare branch predictor we obtain a next pc misprediction rate of 7.83%, while a traditional superscalar-type branch predictor of comparable costs achieves 10.3% and the Intel/HP predictor achieves 9.31%. In addition, we propose to have both predicted and delayed branches in the ISA and let the compiler select which type to apply. Simulations show performance improvements of 2-7% for benchmarks that are well-known for their high misprediction rates. This paper also contributes an experiment to determine whether speculative update in the fetch stage and correction of mispredictions is really necessary for VLIWs, instead of updating when branches are resolved. Experiments show that the performance advantage of speculative updating is small. 1. Introduction Dynamic branch prediction is usually associated with dynamically scheduled superscalar processors and not directly with statically scheduled very long instruction word (VLIW) processors. In fact, most VLIWs have delayed branches instead of dynamic branch prediction. The compiler is responsible for filling delay slots often assisted by static branch prediction based on profiling. As can be expected, dynamic branch prediction is applicable for VLIWs. It is useful in the cases where a compiler has difficulties with filling branch delay slots. Nevertheless, while investigating dynamic branch prediction for our target VLIW, the TriMedia processor, we encountered two problems. First, when multiple conditional direct branches are scheduled in one VLIW instruction, the instruction becomes a multi-way branch with multiple branch targets. This does not work well for traditional BTBs (branch target buffers) which associate only one branch target address with an instruction address. Second, in many cases, delayed branches are more preferable than predicted branches, when delay slots can be filled effectively and/or the branch is hard to predict. Both problems are addressed in this paper. The solutions that we propose for the two encountered problems are: First, in addition to predicting the branch direction, i.e., taken or not taken, we also predict which issue-slot contains the taken branch. This issue-slot information is used for the BTB lookup. The BTB now associates a branch target address with the combination of instruction address and issueslot. Second, we provide both delayed branches and predicted branches in the ISA. The compiler selects between those types based on how well it is able to fill delay slots and the expected branch prediction accuracy. The paper is organised as follows. Section 2 gives background information on dynamic branch prediction. Section 3 describes the simulation environment and benchmarks that we used. Section 4 discusses branch prediction for VLIWs based on issue-slot prediction. Section 5 describes how delayed and predicted branches can be used to improve performance. Section 6 discusses whether speculative update of the branch predictor is necessary. Finally, section 7 gives the conclusions. 2. Background Dynamic branch prediction is typically implemented using two structures: the BTB (branch target buffer) and the BHT (branch history table). The BTB detects branches and predicts branch targets, while the BHT predicts the branch direction. The BTB The BTB is a cache where instruction addresses are associated with branch targets [11]. If an instruction address hits in the BTB, we know that it is the address of a branch instruction and we have a prediction for its branch 1

2 N SN 00 T T T WN WT N N N Predict not taken Predict taken ST 11 T current pc br. hist. XOR index index tag BHT BTB incr. RAS Figure 1. State diagram for direction prediction based on two bit saturating counters. direction hit type target next instr. next pc selection ret. addr. next pc target. This branch target is the branch target of the last execution of the branch. This usually works very well since most branches are direct branches with a static branch target. This is not the case for indirect branches. Many of the indirect branches are function returns. The prediction of function returns can be improved by maintaining a return address stack (RAS) [8]. Function call branches push the return address on the RAS and function return branches pop values of the RAS. To determine the branch type, which is necessary to detect function returns in the fetch stage, the BTB usually also associates type information with instruction addresses. Alternatively, type information can be predecoded in the instruction cache. The BHT The BHT predicts the direction of conditional branches, i.e., whether a branch is taken or not. This is typically implemented by a table of two bit saturating counters indexed by the lower part of the pc [18]. Such a counter is incremented when a resolved branch is taken and is decremented when it is not taken. A branch is predicted as taken if the most significant bit of the corresponding two bit counter is set. The four states of the two bit counter have the following names: 0 = strongly not taken (SN), 1 = weakly not taken (WN), 2 = weakly taken, and 3 = strongly taken (ST). Figure 1 shows the state diagram corresponding to the two bit counter. The rationale for weak and strong states is to introduce some form of hysteresis in the branch predictor. Whenever a branch that is biased in one direction is mispredicted, we should give it a second chance before changing the prediction. This is realised by moving from a strong to a weak state, but maintaining the same prediction. Whenever we mispredict the branch again, we change our prediction. In the case of a correct prediction, we move back to the strong state. Because BHTs are tag-less tables, conflicts of mapping multiple branches onto the same counter are not detected. This is known as aliasing. Aliasing reduces prediction accuracy. A lot of research has been performed in the last decade to improve the branch prediction accuracy. Probably the most important invention has been the exploitation of correlation between branch directions [20, 15] in the so-called two-level adaptive schemes. Global schemes have a branch Figure 2. A gshare branch predictor [12] consisting of a BTB, BHT, and RAS. history register, that is a shift register into which directions of resolved conditional branches are shifted. The content of the branch history register is combined with the lower part of the pc to access the table of two bit counters. This achieves correlation exploitation between successively executed branches. Combining the branch history register and the lower part of the pc can be done by concatenating them, as is done in the GAs and GAp schemes, or XORing them, as done in the gshare scheme [12]. Per-branch schemes, also called local schemes, associate a branch history register to a branch. Such a register records the outcomes of the last executions of the corresponding branch. One can view this as the rhythm of a branch. This rhythm, possibly combined with the pc, is mapped by a table of two bit counters on a prediction. For example, a branch with history TTTNTTTNT (T = taken, N = not taken) will probably to be predicted as being taken since it is not-taken every four executions. Combining BTB, BHT, and RAS results Figure 2 shows how the BTB, BHT, and RAS are combined to implement a gshare branch predictor [12]. The current pc is used to access the BTB and BHT. The pc is XORed with the branch history register to achieve branch correlation exploitation. The results of the BTB and BHT lookups are combined in the following way to predict the next pc: if (BTB missed) next pc = pc + size of instruction else if (BTB type is return) next pc = top of RAS else if (BTB type is cond. branch && BHT predicts taken) next pc = target returned by BTB else next pc = pc + size of instruction 3. Simulation Environment and Benchmarks We used the compiler and simulation environment of the Philips TriMedia TM1000 VLIW mediaprocessor for 2

3 our experiments. The TriMedia TM1000 is a five issueslot VLIW with three issue-slots in which branches can be scheduled. Branches are resolved in the first execute stage that is the fourth pipeline stage. This leads to three branch delay slots, i.e., three instructions following a taken branch are executed before the branch takes place. For our experiments, we use 10 benchmarks from SPECint92 and SPECint95. Simulation is limited to 20 million branch containing instructions. The total number of simulated instructions is typically a few times more and the total number of simulated instructions is again a few times more. Due to technical difficulties, we are not able to simulate caching effects in combination with branch prediction. This is only relevant in section 5 where we present execution times instead of misprediction rates. The benchmarks are compiled with the TriMedia version 2.0 production compiler with optimization level -O3 [6]. This level includes loop unrolling and if-conversion, which transforms branches into guarded execution in order to reduce branches. Sophisticated scheduling techniques are employed to fill branch delay slots [6]. Except for the experiments described in section 5, compiler optimizations do not take branch mispredictions into consideration. Handling mispredictions for a VLIW is less complex than it is for a superscalar, where dynamic speculative execution is employed. In the case of a misprediction which is detected in the fourth pipeline stage, the first three pipeline stages are flushed and are refilled with instructions from the correct position. During this, the fourth stage and the stages that follow it are frozen. This is to ensure that the resource requirements pattern (i.e., when resources are used) of an operation used by the instruction scheduler is not affected by dynamic events. Alternatively, one can flush the pipeline and start refilling when no resource conflicts can occur anymore between operations before and after the mispredicted branch. 4. Dynamic Branch Prediction Based on Issue- Slot Prediction 4.1. Problem statement Existing branch predictors known from the literature, of which the one described in section 2 is a typical example, are designed for sequential ISAs. These predictors do not perform well for VLIWs where multiple branches can be scheduled in one instruction. Results shown in table 1 show that the scheduler actually schedules multiple branches per cycle on control-flow intensive code. It shows that many instructions that contain branch operations contain multiple branch operations; on average more than one third. The problem is that the BTB does not work effectively anymore because a BTB can associate only one branch target and Benchmark Branches per instruction espresso li eqntott compress sc gcc go m88ksim ijpeg perl Average Table 1. Distribution of branches per instruction branch type to an instruction address. The following sequential code of three conditional branches illustrates the problem: br (x=1) L1; br (x=2) L2; br (x=3) L3; After this sequential code has been executed with values 1, 2, and 3 for x, all three branches are stored in the BTB. From that moment on all BTB accesses will be hits and will deliver the correct target. The only branch mispredictions are caused by imperfections of the BHT. Now consider what will happen when all three branches are scheduled into one VLIW instruction: br (x=1) L1, br (x=2) L2, br (x=3) L3; We now no longer have the situation where all BTB accesses deliver the correct target, because the VLIW instruction has three targets. Consider the case where the example VLIW instruction is repeatedly executed with different values for x, either 1, 2, or 3. All mispredictions in the scalar machine were caused by the BHT. In the VLIW case the instruction is always taken but with different targets. Therefore, all mispredictions are caused by the BTB. We have the feeling that this shift of the causes of misprediction from the BHT to the BTB is typical. We could improve the BTB target prediction accuracy by a gshare kind of technique, which has also been proposed to predict indirect branches [2]. Unfortunately, gshare would make more BTB entries necessary; these are wide and thus expensive. Therefore, we tried to solve the problem in another way, using issue-slot based prediction, which is described in the next section Issue-slot based branch prediction Our solution to the problem described in section 4.1 is to predict not only the direction of the branch, but also the 3

4 SN S1 Predict WN W1 Predict Not taken Taken slot 1 current pc br. hist. XOR index index tag BHT tag BTB incr. RAS issue slot hit type target next instr. ret. addr. Predict Predict Taken slot 2 Taken slot 3 W2 W3 direction next pc selection Figure 4. An issue-slot based predictor next pc S2 Figure 3. State diagram for issue-slot prediction. Dashed transitions are taken on mispredictions; solid transitions are taken on correct predictions. issue-slot that contains the taken branch, in the case that the predicted direction is taken. Therefore, for our TriMedia processor with three branch issue-slots, we predict either not taken, taken by slot 1, taken by slot 2, or taken by slot 3. The prediction is made in a similar way to a directiononly prediction. A state machine is introduced with two states for every prediction outcome: a strong and a weak one. The state diagram is shown in figure 3. The state machine no longer corresponds to a saturating counter. This is obviously not a problem. The direction and issue-slot predicted by the BHT is used to perform the BTB lookup. The BTB associates a branch target and type information with a combination of an instruction address and issue-slot. For direct branches this information is constant for each combination of an instruction address and issue-slot. This means that the BTB performs as it used to do. We implement the association with the combination of instruction address and issue-slot by extending the tags of the BTB with an issue-slot identifier; two bits in our case. The issue-slot identifier is not used to index (i.e. address) the BTB, however. This would create a potential cycle time problem, since the BHT and BTB would have to be accessed sequentially. Figure 4 shows the organisation of the branch predictor. The following differences can be seen when compared to figure 2: the information delivered by the BHT is different, information is passed from the BHT to the BTB, and the tags of the BTB are extended. The function of the branch history register has also been changed. Instead of shifting in one bit representing the direction of a resolved branch, we now shift in two bits corresponding to one of four results: not taken and taken by one of the three branch issue-slots. The same applies for per-branch two-level adaptive schemes. No shifting takes place when the instruction does not contain branches. The disadvantage of shifting in issue-slot information instead of direction information only is that it takes more space and S3 therefore the history is less deep. However, more information is recorded per branch instruction. We experimented with both schemes to capture history information and came to the conclusion that recording issue-slot information gives the best results. The state machine used for issue-slot prediction is very similar to the state machine proposed by Menezes et. al for path prediction [13]. A path predictor predicts the outcomes of several successive branches in a single prediction [3, 16]. To evaluate our ideas we compared the system shown in figure 2 with the system shown in figure 4. We used a 2K entry BHT, a 512 entry, 4 way set-associative BTB with LRU replacement, and a 32 entry RAS. We used a 12 bit branch history register for gshare. These numbers are typical for contemporary microprocessors. The first two sets of three columns of table 2 show the results. The three columns show the BHT misprediction rate (wrong direction or issues-slot prediction), the BTB misprediction rate (wrong target given correct direction or issue-slot prediction), and the total misprediction rate (wrong next pc prediction). The high BTB misprediction rate, relative to the BHT misprediction rate confirms the feeling that we described in section 4.1. In comparison to direction prediction, issue-slot based branch prediction has a higher BHT misprediction rate (5.31% vs. 6.62%) because more information is predicted, but the BTB misprediction rate is much lower (5.94% vs. 1.97%). The total next pc misprediction rate is reduced from 10.3% to 7.83%. The improvement comes at the price of wider BHT and BTB entries. The BHT entries have been extended by one bit for issue-slot prediction. The BTB entries have been extended by 2 tag bits for issue-slot identifiers. The total size increased from 4.6KBytes to 5.1KBytes Comparison with multi-target BTBs An alternative solution to the problem described in section 4.1 are multi-target BTBs, which are patented by Intel and HP in US Patent 5,903,750 [21]. In this solution, the BTB entries have multiple branch targets; one for each branch issue-slot. The issue-slot prediction is used to select one of the targets. Multi-target BTBs differ from our proposed system of single-target BTBs with issue-slot tags 4

5 Benchmark Direction prediction Direction and issue-slot prediction Single-target BTB Multi-target BTB dir. target total slot target total slot target total 008.espresso li eqntott compress sc gcc go m88ksim ijpeg perl Average Table 2. Misprediction rates for three different types of branch predictors. Three numbers are given for every type. The misprediction rate of the BHT, i.e., the rate of predicting the wrong direction or issue-slot in the case of issue-slot prediction is given first. Then, the misprediction rate of the BTB in case of a correct prediction by the BHT is given. Finally, the combined misprediction rate of the branch prediction system is given. in storage efficiency. Which one is more efficient depends on the number of active branches per instruction, where an active branch is a branch that is often taken. When there are many active branches per instruction, a multi-target BTB is more efficient since the tag is shared by multiple branches. When there are few active branches per instruction, on the other hand, a multi-target BTB is less efficient since most of the branch target fields are not used. We implemented multi-target BTBs in our simulator to compare both schemes. For comparison we use a multi-target BTB with 256 entries, since a multi-target BTB entry is about twice as wide as a single-target BTB entry for a three branch issueslot configuration 1.Both BTBs, a 256-entry multi-target and a 512-entry single target, have therefore approximately the same cost in terms of die area. The remaining parameters, including set-associativity of the BTB, remain the same. The last set of three columns of table 2 lists the results. It is obvious that the BHT misprediction rate has not changed with respect to single-target issue-slot based branch prediction. The BTB misprediction rate is clearly higher for multi-target BTBs (1.97% vs. 3.44%) and therefore also the total next pc misprediction rate (7.83% vs. 9.31%). This makes single-target BTBs more favourable than multi-target BTBs Compatibility with other techniques The usefulness of a technique, in our case branch prediction based on issue-slot prediction, depends also on how 1 A single-target BTB entry consists of one tag and one target; a multitarget BTB entry consists of one tag and three targets (in this case). In the case full resolution a multi-target BTB entry is therefore approximately twice as wide as a single-target entry, e.g., ( ) vs. (20+30). well it can be combined with other techniques. In this section we describe whether our technique is compatible with several other existing techniques. Combination with variants of two-level adaptive branch predictors other than gshare is possible [20, 15]. Using Yeh and Patt s terminology, the two bit counters in the pattern history table (PHT) have to be replaced by a state of a state machine as shown in figure 3. The branch history table/register (BHT/BHR) is changed such that issue-slot information is shifted in it instead of only direction information. In this paper we described our ideas for three branch issue-slots, however, it is straightforward to generalize this. For Ò branch issue-slots, the PHT entries are ½ ÐÓ ¾ Ò ½µ bits wide and ÐÓ ¾ Ò ½µ bits of history information is produced per branch containing instruction. Hybrid branch predictors [12] also pose no problems. A meta predictor selects between the direction and issue-slot predictions of several predictors. Several techniques exist that attempt to reduce the negative effect of aliasing in the BHT: the agree predictor [19], the filter predictor [1], the skewed predictor [14], the bimode predictor [9], and YAGS [4]. Two of them pose problems: the agree predictor and the filter predictor. Both pass information from the BTB to the BHT. Since we pass information in the opposite direction, we have a dependence cycle between the BTB and BHT. This makes the agree and filter predictors incompatible. Fortunately, comparisons between the five techniques described in [10, 4] indicate that the agree and filter predictor are the least accurate of the five predictors. 5

6 5. Selective Dynamic Branch Prediction VLIW processors typically expose micro-architectural details in their ISA that superscalar processors hide. VLIW processors expose them so that a compiler can make use of them, while superscalars hide them to maintain object code compatibility between different implementations of the same architecture. The micro-architectural feature related to this paper is the number of stages in front of the stage in which branches are resolved. This can be exposed by means of delayed branches and effectively hidden by means of branch prediction. When we have the opportunity to expose micro-architectural details, we can provide both delayed branches as well as predicted branches in the ISA. The compiler can then select between both types by using different opcodes. Delayed branches are used when delay slots can be filled effectively and/or branches are hard to predict. An additional advantage of this idea, which we call selective dynamic branch prediction, is that delayed branches do not need the BHT and BTB and therefore do not update them. This reduces the pressure on them, which makes smaller tables possible. Delayed branches are still updating the branch history register, however, or the branch history table in the case of per-address schemes, in order to improve prediction accuracy of predicted branches via branch correlation. Delayed branches also update the RAS because functions are compiled independently of each other and a caller does not know whether the callee will return via a predicted branch or a delayed branch. In fact, a function can have multiple returns of different types. We implemented selective dynamic branch prediction in our compiler and simulator. The selection is made with decision tree granularity. Decision trees are the scheduling units used by our compiler [6]. All branches in a decision tree are either delayed branches or predicted branches. Selecting branch types on a finer granularity will require more effort in the compiler and will be complex in hardware or will be constrained by scheduling rules. Such a rule might be that predicted branches are not allowed to be scheduled in delay slots of delayed branches because mispredictions in delay slots cannot be handled. The implementation in our compiler is as follows. Prior to scheduling a decision tree, the compiler selects between the branch types for that particular decision tree. This is done by estimating the schedule length that it will obtain for each of the two types. The schedule length is estimated by scheduling the decision tree by a simple and fast scheduler which is far less sophisticated than the actual scheduler. The simple scheduler does not check all resource constraints for this estimation, e.g., available write back buses, and it does not perform integrated register allocation [6]. In the case of delayed branches, the estimated execution time corresponds to the estimated schedule length (cache effects are ignored). In the case of predicted branches, the estimated execution time is computed by multiplying the estimated schedule length by an estimated branch misprediction penalty. We currently use 1.08 as branch misprediction penalty estimation, which is determined empirically. Table 3 lists the outcomes of measurements to evaluate selective dynamic branch prediction. We used a branch misprediction penalty of 4 cycles for these experiments. The first column shows the performance improvements of predicted branches over delayed branches. There is a large variation in improvements among the benchmarks. The only benchmark that performs worse is go. The reason is obviously its high misprediction rate (21%). The next column pair shows results for selective dynamic branch prediction. This turns out to be effective for go, which now performs slightly better than using only delayed branches. On average, the performance improvement is slightly negative. This is mainly caused by eqntott, where the heuristic value of 1.08 is a bad choice. We can improve this by performing a profile run to measure mispredictions. This makes a better selection between branch types possible. The results are shown in the last column pair. In comparison with predicted branches only, the performance improved by almost 2%. Although this might seem low, it is most effective for the benchmarks that suffer from a high misprediction rate such as gcc, go, and perl. Table 3 also lists the percentage of dynamic delayed branches, which is 33% on average. This is an indication of the reduction in table pressure. 6. The Necessity of Speculative Update In previous sections we did not discuss when the branch predictor is updated. We assumed that all previously executed branches update the branch predictor before a new prediction is made. This is not true in an instruction pipeline, where branch instructions can be in-flight between the fetch stage where prediction takes place, and the pipeline stage where branches are resolved. The result is that the outcomes of the last few branches have not updated the predictor. The effect is most severe in deeply pipelined machines. Another effect occurring only in out-of-order superscalar processors is that the order in which branches are resolved and update the global branch history register varies because of dynamic events such as cache misses. Both effects make branch correlation less effective. Several researchers identified this problem [5, 7, 17] and proposed to update the branch predictor speculatively in the fetch stage and correct the branch predictor on a misprediction. This complicates the design of the branch predictor and may affect the cycle time. The time between fetching and resolving branches is typically shorter for VLIWs than for superscalars. This is 6

7 Benchmark Predicted Mixed Mixed with feedback performance performance frac. delayed performance frac. delayed 008.espresso li eqntott compress sc gcc go m88ksim ijpeg perl Average Table 3. Performance relative to delayed branches in three cases: 1) predicted branches only, 2) mixing predicted and delayed branches, and 3) mixing predicted and delayed branches employing profile feedback information. In the case of profile feedback, we use the same input data for profiling as well as for the actual measurement. because there are no pipeline stages needed, for example, for register renaming, and instructions are not delayed in buffers and reservation stations. Furthermore, the second effect, varying resolve order, does not apply for VLIWs. It therefore makes sense to ask ourselves whether speculative update is necessary in our case. To answer this question we modelled an instruction pipeline in our simulator where the branch predictor is updated in the fourth pipeline stage. Branches in the second, third, and also the fourth pipeline stage are not used for prediction in the fetch pipeline stage. Table 4 lists the result of a measurement that compares speculative update against resolve-time update. It includes the average number of updates that are missing for predicting the currently being fetched instruction. For resolve-time update this number corresponds to the number of branch containing instructions that are in-flight between the first and fourth stage. For speculative update it is obviously zero. The results show that the effect of speculative update is small (7.83% vs. 8.43%) and could be omitted to reduce hardware complexity. The results show that both the issue-slot prediction as well as the branch target prediction are less accurate when speculative update is not applied. The former can be explained by less effective exploitation of branch correlation; the most recent branches which are most correlated are missing. The latter can be explained by rapidly following function return branches, where the top of stack pointer has not been updated before the next function return branch is fetched. It also occurs when a function call branch is rapidly followed by a function return branch, where the function return branch is fetched before the function call updated the RAS by pushing the return address on it. 7. Conclusions This paper described branch prediction for VLIWs. We have shown that a branch prediction specially designed for VLIWs gives better results than the application of a branch predictor that is typically found in superscalar processors. In the proposed branch predictor, both the direction (taken or not taken) and the issue-slot containing the taken branch is predicted. This information is used to perform the BTB lookup. The BTB associates target and type information with the combination of an instruction address and issueslot. We have also shown that this system performs better than multi-target BTBs which have been proposed before. Furthermore, we identified the possibility of mixing predicted and delayed branches in an ISA. The compiler has to select between those types. This gave clear performance improvements for benchmarks which suffer from a high misprediction rate. Finally, we measured whether a speculative update of the branch predictor in the fetch stage and correcting it on a misprediction is really necessary for our target VLIW, the TriMedia. Measurements show that the effects are small. References [1] P.-Y. Chang, M. Evers, and Y. N. Patt. Improving Branch Prediction Accuracy by Reducing Pattern History Table Interference. In Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques (PACT 96), pages 48 57, Boston, Massachusetts, October 20 23, [2] P.-Y. Chang, E. Hao, and Y. N. Patt. Target Prediction for Indirect Jumps. In Proceedings of the 24th International Symposium on Computer Architecture, Denver, June [3] S. Dutta and M. Franklin. Control Flow Predication with Tree-Like Subgraphs for Superscalar Processors. In Pro- 7

8 Benchmark Speculative update Resolve-time update slot target total missing slot target total missing miss rate miss rate miss rate updates miss rate miss rate miss rate updates 008.espresso li eqntott compress sc gcc go m88ksim ijpeg perl Average Table 4. Speculative update vs. resolve-time update. The first three columns correspond to the columns of table 2. The fourth column lists the number of updates that are missing on average due to pipeline effects. ceedings of the 28th Annual International Workshop on Microprogramming, pages , Ann Arbor, Michigan, Nov [4] A. N. Eden and T. Mudge. The YAGS Branch Predictor. In Proceedings of the 31th Annual International Symposium on Microarchitecture, pages 69 77, Dallas, Texas, Nov [5] E. Hao, P.-Y. Chang, and Y. N. Patt. The Effect of Speculatively Updating Branch History on Branch Prediction Accuracy, Revisited. In Proceedings of the 27th Annual International Symposium on Microarchitecture, pages , San Jose, California, November 30 December 2, [6] J. Hoogerbrugge and L. Augusteijn. Instruction Scheduling for TriMedia. Journal of Instruction-Level Parallelism, 1(1), Feb [7] S. Jourdan, J. Stark, T.-H. Hsing, and Y. N. Patt. Recovery Requirements of Branch Prediction and Storage Structures in the Presence of Mispredicted-Path Execution. International Journal of Parallel Programming, 25(5): , Oct [8] D. R. Kaeli and P. G. Emma. Branch History Table Prediction of Moving Target Branches Due to Subroutine Returns. In Proceedings of the 18th Annual International Symposium on Computer Architecture, pages 34 42, Toronto, Ontario, May 27 30, [9] C.-C. Lee, I.-C. K. Chen, and T. Mudge. The Bi-Mode Branch Predictor. In Proceedings of the 30th Annual International Symposium on Microarchitecture, pages 4 13, Research Triangle Park, North Carolina, Nov [10] C.-C. Lee, I.-C. K. Chen, and T. Mudge. Design and Performance Evaluation of Global History Dynamic Branch Predictors. In Proceedings of the World Multiconference on Systemics Cybernetics and Informatics, SCI 98, pages , Orlando, FL, July [11] J. K. F. Lee and A. J. Smith. Branch Prediction Strategies and Branch Target Buffer Design. IEEE Micro, 21(7):6 22, Jan [12] S. McFarling. Combining Branch Predictors. Technical Report TN-36, Western Research Laboratory, Palo Alto, California, June [13] K. N. P. Menezes, S. A. Sathaye, and T. M. Conte. Path prediction for high issue-rate processors. In Proceedings of the 1997 Conference on Parallel Architectures and Compilation Techniques (PACT 97), San Francisco, CA, Nov [14] P. Michaud, A. Seznec, and R. Uhlig. Trading Conflict and Capacity Aliasing in Conditional Branch Predictors. In Proceedings of the 24th International Symposium on Computer Architecture, Denver, June [15] S.-T. Pan, K. So, and J. T. Rahmeh. Improving the Accuracy of Dynamic Branch Prediction Using Branch Correlation. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 76 84, Boston, Massachusetts, October 12 15, [16] D. N. Pnevmatikatos, M. Franklin, and G. S. Sohi. Control Flow Prediction for Dynamic ILP Processors. In Proceedings of the 26th Annual International Workshop on Microprogramming, pages , Austin, Texas, Dec [17] K. Skadron, M. Martonosi, and D. W. Clark. Speculative Updates of Local and Global Branch History: A Quantitative Analysis. Journal of Instruction-level Parallelism, 2, Jan [18] J. E. Smith. A Study of Branch Prediction Strategies. In Proceedings of the 8th Annual International Symposium on Computer Architecture, pages , May [19] E. Sprangle, R. S. Chappell, M. Alsup, and Y. N. Patt. The Agree Predictor: A Mechanism for Reducing Negative Branch History Interference. In Proceedings of the 24th International Symposium on Computer Architecture, Denver, June [20] T.-Y. Yeh and Y. N. Patt. Two-Level Adaptive Training Branch Prediction. In Proceedings of the 24th Annual International Symposium on Microarchitecture, pages 51 61, Albuquerque, New Mexico, November 18 20, [21] T.-Y. Yeh, M. Poplingher, W. Chen, and H. Mulder. Dynamic Branch Prediction for Branch Instructions with Multiple Targets, May US Patent 5,903,750, Filed November 20,

Static Branch Prediction

Static Branch Prediction Static Branch Prediction Branch prediction schemes can be classified into static and dynamic schemes. Static methods are usually carried out by the compiler. They are static because the prediction is already

More information

Dynamic Branch Prediction

Dynamic Branch Prediction #1 lec # 6 Fall 2002 9-25-2002 Dynamic Branch Prediction Dynamic branch prediction schemes are different from static mechanisms because they use the run-time behavior of branches to make predictions. Usually

More information

A Study for Branch Predictors to Alleviate the Aliasing Problem

A Study for Branch Predictors to Alleviate the Aliasing Problem A Study for Branch Predictors to Alleviate the Aliasing Problem Tieling Xie, Robert Evans, and Yul Chu Electrical and Computer Engineering Department Mississippi State University chu@ece.msstate.edu Abstract

More information

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722 Dynamic Branch Prediction Dynamic branch prediction schemes run-time behavior of branches to make predictions. Usually information about outcomes of previous occurrences of branches are used to predict

More information

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Branch Prediction Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 11: Branch Prediction

More information

Evaluation of Branch Prediction Strategies

Evaluation of Branch Prediction Strategies 1 Evaluation of Branch Prediction Strategies Anvita Patel, Parneet Kaur, Saie Saraf Department of Electrical and Computer Engineering Rutgers University 2 CONTENTS I Introduction 4 II Related Work 6 III

More information

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction Looking for Instruction Level Parallelism (ILP) Branch Prediction We want to identify and exploit ILP instructions that can potentially be executed at the same time. Branches are 5-20% of instructions

More information

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction ISA Support Needed By CPU Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with control hazards in instruction pipelines by: 1 2 3 4 Assuming that the branch

More information

Announcements. ECE4750/CS4420 Computer Architecture L10: Branch Prediction. Edward Suh Computer Systems Laboratory

Announcements. ECE4750/CS4420 Computer Architecture L10: Branch Prediction. Edward Suh Computer Systems Laboratory ECE4750/CS4420 Computer Architecture L10: Branch Prediction Edward Suh Computer Systems Laboratory suh@csl.cornell.edu Announcements Lab2 and prelim grades Back to the regular office hours 2 1 Overview

More information

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken Branch statistics Branches occur every 4-7 instructions on average in integer programs, commercial and desktop applications; somewhat less frequently in scientific ones Unconditional branches : 20% (of

More information

Instruction Level Parallelism (Branch Prediction)

Instruction Level Parallelism (Branch Prediction) Instruction Level Parallelism (Branch Prediction) Branch Types Type Direction at fetch time Number of possible next fetch addresses? When is next fetch address resolved? Conditional Unknown 2 Execution

More information

EE482: Advanced Computer Organization Lecture #3 Processor Architecture Stanford University Monday, 8 May Branch Prediction

EE482: Advanced Computer Organization Lecture #3 Processor Architecture Stanford University Monday, 8 May Branch Prediction EE482: Advanced Computer Organization Lecture #3 Processor Architecture Stanford University Monday, 8 May 2000 Lecture #3: Wednesday, 5 April 2000 Lecturer: Mattan Erez Scribe: Mahesh Madhav Branch Prediction

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Instruction Fetch and Branch Prediction. CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.

Instruction Fetch and Branch Prediction. CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3. Instruction Fetch and Branch Prediction CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.3) 1 Frontend and Backend Feedback: - Prediction correct or not, update

More information

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction Looking for Instruction Level Parallelism (ILP) Branch Prediction We want to identify and exploit ILP instructions that can potentially be executed at the same time. Branches are 15-20% of instructions

More information

EE382A Lecture 5: Branch Prediction. Department of Electrical Engineering Stanford University

EE382A Lecture 5: Branch Prediction. Department of Electrical Engineering Stanford University EE382A Lecture 5: Branch Prediction Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 5-1 Announcements Project proposal due on Mo 10/14 List the group

More information

CACHED TWO-LEVEL ADAPTIVE BRANCH PREDICTORS WITH MULTIPLE STAGES

CACHED TWO-LEVEL ADAPTIVE BRANCH PREDICTORS WITH MULTIPLE STAGES CACHED TWO-LEVEL ADAPTIVE BRANCH PREDICTORS WITH MULTIPLE STAGES EGAN, C., STEVEN, G. B., VINTAN, L. University of Hertfordshire, Hatfield, Hertfordshire, U.K. AL 9AB email: G.B.Steven@herts.ac.uk University

More information

Path-Based Next Trace Prediction

Path-Based Next Trace Prediction Quinn Jacobson Path-Based Next Trace Prediction Eric Rotenberg James E. Smith Department of Electrical & Computer Engineering qjacobso@ece.wisc.edu Department of Computer Science ericro@cs.wisc.edu Department

More information

Multiple Branch and Block Prediction

Multiple Branch and Block Prediction Multiple Branch and Block Prediction Steven Wallace and Nader Bagherzadeh Department of Electrical and Computer Engineering University of California, Irvine Irvine, CA 92697 swallace@ece.uci.edu, nader@ece.uci.edu

More information

Lecture 13: Branch Prediction

Lecture 13: Branch Prediction S 09 L13-1 18-447 Lecture 13: Branch Prediction James C. Hoe Dept of ECE, CMU March 4, 2009 Announcements: Spring break!! Spring break next week!! Project 2 due the week after spring break HW3 due Monday

More information

Design of Digital Circuits Lecture 18: Branch Prediction. Prof. Onur Mutlu ETH Zurich Spring May 2018

Design of Digital Circuits Lecture 18: Branch Prediction. Prof. Onur Mutlu ETH Zurich Spring May 2018 Design of Digital Circuits Lecture 18: Branch Prediction Prof. Onur Mutlu ETH Zurich Spring 2018 3 May 2018 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Cached Two-Level Adaptive Branch Predictors with Multiple Stages

Cached Two-Level Adaptive Branch Predictors with Multiple Stages Cached Two-Level Adaptive Branch Predictors with Multiple Stages Colin Egan 1, Gordon Steven 1, and Lucian Vintan 2 1 University of Hertfordshire, Hatfield, Hertfordshire, U.K. AL10 9AB {c.egan,g.b.steven}@herts.ac.uk

More information

An Efficient Indirect Branch Predictor

An Efficient Indirect Branch Predictor An Efficient Indirect ranch Predictor Yul Chu and M. R. Ito 2 Electrical and Computer Engineering Department, Mississippi State University, ox 957, Mississippi State, MS 39762, USA chu@ece.msstate.edu

More information

18-740/640 Computer Architecture Lecture 5: Advanced Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University Fall 2015, 9/16/2015

18-740/640 Computer Architecture Lecture 5: Advanced Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University Fall 2015, 9/16/2015 18-740/640 Computer Architecture Lecture 5: Advanced Branch Prediction Prof. Onur Mutlu Carnegie Mellon University Fall 2015, 9/16/2015 Required Readings Required Reading Assignment: Chapter 5 of Shen

More information

Improving Branch Prediction Accuracy in Embedded Processors in the Presence of Context Switches

Improving Branch Prediction Accuracy in Embedded Processors in the Presence of Context Switches Improving Branch Prediction Accuracy in Embedded Processors in the Presence of Context Switches Sudeep Pasricha, Alex Veidenbaum Center for Embedded Computer Systems University of California, Irvine, CA

More information

15-740/ Computer Architecture Lecture 29: Control Flow II. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/30/11

15-740/ Computer Architecture Lecture 29: Control Flow II. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/30/11 15-740/18-740 Computer Architecture Lecture 29: Control Flow II Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/30/11 Announcements for This Week December 2: Midterm II Comprehensive 2 letter-sized

More information

Static Branch Prediction

Static Branch Prediction Announcements EE382A Lecture 5: Branch Prediction Project proposal due on Mo 10/14 List the group members Describe the topic including why it is important and your thesis Describe the methodology you will

More information

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated

More information

A COST-EFFECTIVE TWO-LEVEL ADAPTIVE BRANCH PREDICTOR

A COST-EFFECTIVE TWO-LEVEL ADAPTIVE BRANCH PREDICTOR A COST-EFFECTIVE TWO-LEVEL ADAPTIVE BRANCH PREDICTOR STEVEN, G. B., EGAN, C., SHIM, W. VINTAN, L. University of Hertfordshire, Seoul National Univ. of Technology, University Lucian Braga of Sibiu Hatfield,

More information

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) 18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures

More information

Multithreaded Architectural Support for Speculative Trace Scheduling in VLIW Processors

Multithreaded Architectural Support for Speculative Trace Scheduling in VLIW Processors Multithreaded Architectural Support for Speculative Trace Scheduling in VLIW Processors Manvi Agarwal and S.K. Nandy CADL, SERC, Indian Institute of Science, Bangalore, INDIA {manvi@rishi.,nandy@}serc.iisc.ernet.in

More information

Speculative Trace Scheduling in VLIW Processors

Speculative Trace Scheduling in VLIW Processors Speculative Trace Scheduling in VLIW Processors Manvi Agarwal and S.K. Nandy CADL, SERC, Indian Institute of Science, Bangalore, INDIA {manvi@rishi.,nandy@}serc.iisc.ernet.in J.v.Eijndhoven and S. Balakrishnan

More information

Appendix A.2 (pg. A-21 A-26), Section 4.2, Section 3.4. Performance of Branch Prediction Schemes

Appendix A.2 (pg. A-21 A-26), Section 4.2, Section 3.4. Performance of Branch Prediction Schemes Module: Branch Prediction Krishna V. Palem, Weng Fai Wong, and Sudhakar Yalamanchili, Georgia Institute of Technology (slides contributed by Prof. Weng Fai Wong were prepared while visiting, and employed

More information

Combining Branch Predictors

Combining Branch Predictors Combining Branch Predictors Scott McFarling June 1993 d i g i t a l Western Research Laboratory 250 University Avenue Palo Alto, California 94301 USA Abstract One of the key factors determining computer

More information

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest

More information

Chapter 3 (CONT) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,2013 1

Chapter 3 (CONT) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,2013 1 Chapter 3 (CONT) Instructor: Josep Torrellas CS433 Copyright J. Torrellas 1999,2001,2002,2007,2013 1 Dynamic Hardware Branch Prediction Control hazards are sources of losses, especially for processors

More information

Fall 2011 Prof. Hyesoon Kim

Fall 2011 Prof. Hyesoon Kim Fall 2011 Prof. Hyesoon Kim 1 1.. 1 0 2bc 2bc BHR XOR index 2bc 0x809000 PC 2bc McFarling 93 Predictor size: 2^(history length)*2bit predict_func(pc, actual_dir) { index = pc xor BHR taken = 2bit_counters[index]

More information

Target Prediction for Indirect Jumps

Target Prediction for Indirect Jumps Target Prediction for ndirect Jumps Po-Yung Chang Eric Hao Yale N. Patt Department of Electrical Engineering and Computer Science The University of Michigan Ann Arbor, Michigan 09- email: {pychang,ehao,patt}@eecs.umich.edu

More information

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines 6.823, L15--1 Branch Prediction & Speculative Execution Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 6.823, L15--2 Branch Penalties in Modern Pipelines UltraSPARC-III

More information

HY425 Lecture 05: Branch Prediction

HY425 Lecture 05: Branch Prediction HY425 Lecture 05: Branch Prediction Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS October 19, 2011 Dimitrios S. Nikolopoulos HY425 Lecture 05: Branch Prediction 1 / 45 Exploiting ILP in hardware

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars Krste Asanovic Electrical Engineering and Computer

More information

Optimizations Enabled by a Decoupled Front-End Architecture

Optimizations Enabled by a Decoupled Front-End Architecture Optimizations Enabled by a Decoupled Front-End Architecture Glenn Reinman y Brad Calder y Todd Austin z y Department of Computer Science and Engineering, University of California, San Diego z Electrical

More information

Looking for limits in branch prediction with the GTL predictor

Looking for limits in branch prediction with the GTL predictor Looking for limits in branch prediction with the GTL predictor André Seznec IRISA/INRIA/HIPEAC Abstract The geometric history length predictors, GEHL [7] and TAGE [8], are among the most storage effective

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

Lecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism

Lecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism Lecture 8: Compiling for ILP and Branch Prediction Kunle Olukotun Gates 302 kunle@ogun.stanford.edu http://www-leland.stanford.edu/class/ee282h/ 1 Advanced pipelining and instruction level parallelism

More information

The Effect of Program Optimization on Trace Cache Efficiency

The Effect of Program Optimization on Trace Cache Efficiency The Effect of Program Optimization on Trace Cache Efficiency Derek L. Howard and Mikko H. Lipasti IBM Server Group Rochester, MN 55901 derekh@us.ibm.com, mhl@ece.cmu.edu 1 Abstract Trace cache, an instruction

More information

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II CS252 Spring 2017 Graduate Computer Architecture Lecture 8: Advanced Out-of-Order Superscalar Designs Part II Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time

More information

Applying Caching to Two-level Adaptive Branch Prediction

Applying Caching to Two-level Adaptive Branch Prediction Applying Caching to Two-level Adaptive Branch Prediction Colin Egan, Gordon B. Steven Won Shim Lucian Vintan University of Hertfordshire Seoul National Univ. of Technology University of Sibiu Hatfield,

More information

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties Instruction-Level Parallelism Dynamic Branch Prediction CS448 1 Reducing Branch Penalties Last chapter static schemes Move branch calculation earlier in pipeline Static branch prediction Always taken,

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Exploring Efficient SMT Branch Predictor Design

Exploring Efficient SMT Branch Predictor Design Exploring Efficient SMT Branch Predictor Design Matt Ramsay, Chris Feucht & Mikko H. Lipasti ramsay@ece.wisc.edu, feuchtc@cae.wisc.edu, mikko@engr.wisc.edu Department of Electrical & Computer Engineering

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

Dynamic Hardware Prediction. Basic Branch Prediction Buffers. N-bit Branch Prediction Buffers

Dynamic Hardware Prediction. Basic Branch Prediction Buffers. N-bit Branch Prediction Buffers Dynamic Hardware Prediction Importance of control dependences Branches and jumps are frequent Limiting factor as ILP increases (Amdahl s law) Schemes to attack control dependences Static Basic (stall the

More information

Wide Instruction Fetch

Wide Instruction Fetch Wide Instruction Fetch Fall 2007 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs470 edu/courses/eecs470 block_ids Trace Table pre-collapse trace_id History Br. Hash hist. Rename Fill Table

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 7

ECE 571 Advanced Microprocessor-Based Design Lecture 7 ECE 571 Advanced Microprocessor-Based Design Lecture 7 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 9 February 2016 HW2 Grades Ready Announcements HW3 Posted be careful when

More information

PMPM: Prediction by Combining Multiple Partial Matches

PMPM: Prediction by Combining Multiple Partial Matches 1 PMPM: Prediction by Combining Multiple Partial Matches Hongliang Gao Huiyang Zhou School of Electrical Engineering and Computer Science University of Central Florida {hgao, zhou}@cs.ucf.edu Abstract

More information

Control Dependence, Branch Prediction

Control Dependence, Branch Prediction Control Dependence, Branch Prediction Outline Control dependences Branch evaluation delay Branch delay slot Branch prediction Static Dynamic correlating, local, global. Control Dependences Program correctness

More information

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ... CHAPTER 6 1 Pipelining Instruction class Instruction memory ister read ALU Data memory ister write Total (in ps) Load word 200 100 200 200 100 800 Store word 200 100 200 200 700 R-format 200 100 200 100

More information

Reducing Branch Costs via Branch Alignment

Reducing Branch Costs via Branch Alignment Reducing Branch Costs via Branch Alignment Brad Calder and Dirk Grunwald Department of Computer Science Campus Box 30 University of Colorado Boulder, CO 80309 030 USA fcalder,grunwaldg@cs.colorado.edu

More information

Improving Multiple-block Prediction in the Block-based Trace Cache

Improving Multiple-block Prediction in the Block-based Trace Cache May 1, 1999. Master s Thesis Improving Multiple-block Prediction in the Block-based Trace Cache Ryan Rakvic Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh, PA 1513

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog)

Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog) Announcements EE382A Lecture 6: Register Renaming Project proposal due on Wed 10/14 2-3 pages submitted through email List the group members Describe the topic including why it is important and your thesis

More information

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due Today Homework 4 Out today Due November 15

More information

Lecture 7: Static ILP, Branch prediction. Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )

Lecture 7: Static ILP, Branch prediction. Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections ) Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections 2.2-2.6) 1 Predication A branch within a loop can be problematic to schedule Control

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Spring 2009 Prof. Hyesoon Kim

Spring 2009 Prof. Hyesoon Kim Spring 2009 Prof. Hyesoon Kim Branches are very frequent Approx. 20% of all instructions Can not wait until we know where it goes Long pipelines Branch outcome known after B cycles No scheduling past the

More information

Page 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP

Page 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP CS252 Graduate Computer Architecture Lecture 18: Branch Prediction + analysis resources => ILP April 2, 2 Prof. David E. Culler Computer Science 252 Spring 2 Today s Big Idea Reactive: past actions cause

More information

EECS 470 PROJECT: P6 MICROARCHITECTURE BASED CORE

EECS 470 PROJECT: P6 MICROARCHITECTURE BASED CORE EECS 470 PROJECT: P6 MICROARCHITECTURE BASED CORE TEAM EKA Shaizeen Aga, Aasheesh Kolli, Rakesh Nambiar, Shruti Padmanabha, Maheshwarr Sathiamoorthy Department of Computer Science and Engineering University

More information

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 18-447 Computer Architecture Lecture 14: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 Reminder: Homework 3 Homework 3 Due Feb 25 REP MOVS in Microprogrammed

More information

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1 Superscalar Processing 740 October 31, 2012 Evolution of Intel Processor Pipelines 486, Pentium, Pentium Pro Superscalar Processor Design Speculative Execution Register Renaming Branch Prediction Architectural

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Module 5: MIPS R10000: A Case Study Lecture 9: MIPS R10000: A Case Study MIPS R A case study in modern microarchitecture. Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch

More information

A 64-Kbytes ITTAGE indirect branch predictor

A 64-Kbytes ITTAGE indirect branch predictor A 64-Kbytes ITTAGE indirect branch André Seznec To cite this version: André Seznec. A 64-Kbytes ITTAGE indirect branch. JWAC-2: Championship Branch Prediction, Jun 2011, San Jose, United States. 2011,.

More information

COSC 6385 Computer Architecture Dynamic Branch Prediction

COSC 6385 Computer Architecture Dynamic Branch Prediction COSC 6385 Computer Architecture Dynamic Branch Prediction Edgar Gabriel Spring 208 Pipelining Pipelining allows for overlapping the execution of instructions Limitations on the (pipelined) execution of

More information

ECE 4750 Computer Architecture, Fall 2017 T13 Advanced Processors: Branch Prediction

ECE 4750 Computer Architecture, Fall 2017 T13 Advanced Processors: Branch Prediction ECE 4750 Computer Architecture, Fall 2017 T13 Advanced Processors: Branch Prediction School of Electrical and Computer Engineering Cornell University revision: 2017-11-20-08-48 1 Branch Prediction Overview

More information

Increasing the Instruction Fetch Rate via Block-Structured Instruction Set Architectures

Increasing the Instruction Fetch Rate via Block-Structured Instruction Set Architectures Increasing the Instruction Fetch Rate via Block-Structured Instruction Set Architectures Eric Hao, Po-Yung Chang, Marks Evers, and Yale N. Patt Advanced Computer Architecture Laboratory Department of Electrical

More information

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution Prof. Yanjing Li University of Chicago Administrative Stuff! Lab2 due tomorrow " 2 free late days! Lab3 is out " Start early!! My office

More information

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation Digital Systems Architecture EECE 343-01 EECE 292-02 Predication, Prediction, and Speculation Dr. William H. Robinson February 25, 2004 http://eecs.vanderbilt.edu/courses/eece343/ Topics Aha, now I see,

More information

Page 1 ILP. ILP Basics & Branch Prediction. Smarter Schedule. Basic Block Problems. Parallelism independent enough

Page 1 ILP. ILP Basics & Branch Prediction. Smarter Schedule. Basic Block Problems. Parallelism independent enough ILP ILP Basics & Branch Prediction Today s topics: Compiler hazard mitigation loop unrolling SW pipelining Branch Prediction Parallelism independent enough e.g. avoid s» control correctly predict decision

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches

More information

Review Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction

Review Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction CS252 Graduate Computer Architecture Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue March 23, 01 Prof. David A. Patterson Computer Science 252 Spring 01 Review Tomasulo Reservations

More information

PowerPC 620 Case Study

PowerPC 620 Case Study Chapter 6: The PowerPC 60 Modern Processor Design: Fundamentals of Superscalar Processors PowerPC 60 Case Study First-generation out-of-order processor Developed as part of Apple-IBM-Motorola alliance

More information

EECS 470. Branches: Address prediction and recovery (And interrupt recovery too.) Lecture 6 Winter 2018

EECS 470. Branches: Address prediction and recovery (And interrupt recovery too.) Lecture 6 Winter 2018 EECS 470 Branches: Address prediction and recovery (And interrupt recovery too.) Lecture 6 Winter 2018 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen,

More information

Control Flow Speculation in Multiscalar Processors

Control Flow Speculation in Multiscalar Processors Control Flow Speculation in Multiscalar Processors Quinn Jacobson Electrical & Computer Engineering Department University of Wisconsin qjacobso@ece.wisc.edu Steve Bennett 1 Measurement, Architecture and

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

EECS 470 Lecture 6. Branches: Address prediction and recovery (And interrupt recovery too.)

EECS 470 Lecture 6. Branches: Address prediction and recovery (And interrupt recovery too.) EECS 470 Lecture 6 Branches: Address prediction and recovery (And interrupt recovery too.) Announcements: P3 posted, due a week from Sunday HW2 due Monday Reading Book: 3.1, 3.3-3.6, 3.8 Combining Branch

More information

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011 5-740/8-740 Computer Architecture Lecture 0: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Fall 20, 0/3/20 Review: Solutions to Enable Precise Exceptions Reorder buffer History buffer

More information

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Hyesoon Kim Onur Mutlu Jared Stark David N. Armstrong Yale N. Patt High Performance Systems Group Department

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 9

ECE 571 Advanced Microprocessor-Based Design Lecture 9 ECE 571 Advanced Microprocessor-Based Design Lecture 9 Vince Weaver http://www.eece.maine.edu/ vweaver vincent.weaver@maine.edu 30 September 2014 Announcements Next homework coming soon 1 Bulldozer Paper

More information

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 8: Issues in Out-of-order Execution Prof. Onur Mutlu Carnegie Mellon University Readings General introduction and basic concepts Smith and Sohi, The Microarchitecture

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 8

ECE 571 Advanced Microprocessor-Based Design Lecture 8 ECE 571 Advanced Microprocessor-Based Design Lecture 8 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 16 February 2017 Announcements HW4 Due HW5 will be posted 1 HW#3 Review Energy

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

CS252 S05. Outline. Dynamic Branch Prediction. Static Branch Prediction. Dynamic Branch Prediction. Dynamic Branch Prediction

CS252 S05. Outline. Dynamic Branch Prediction. Static Branch Prediction. Dynamic Branch Prediction. Dynamic Branch Prediction Outline CMSC Computer Systems Architecture Lecture 9 Instruction Level Parallelism (Static & Dynamic Branch ion) ILP Compiler techniques to increase ILP Loop Unrolling Static Branch ion Dynamic Branch

More information

On Pipelining Dynamic Instruction Scheduling Logic

On Pipelining Dynamic Instruction Scheduling Logic On Pipelining Dynamic Instruction Scheduling Logic Jared Stark y Mary D. Brown z Yale N. Patt z Microprocessor Research Labs y Intel Corporation jared.w.stark@intel.com Dept. of Electrical and Computer

More information

COSC3330 Computer Architecture Lecture 14. Branch Prediction

COSC3330 Computer Architecture Lecture 14. Branch Prediction COSC3330 Computer Architecture Lecture 14. Branch Prediction Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston opic Out-of-Order Execution Branch Prediction Superscalar

More information