Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps

Size: px
Start display at page:

Download "Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps"

Transcription

1 Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José. Joao* Onur Mutlu Hyesoon Kim Rishi garwal Yale N. Patt* *ECE Department The University of Texas at ustin {joao, Microsoft Research College of Computing Georgia Institute of Technology IIT Kanpur bstract Indirect jump instructions are used to implement increasinglycommon programming constructs such as virtual function calls, switch-case statements, jump tables, and interface calls. The performance impact of indirect jumps is likely to increase because indirect jumps with multiple targets are difficult to predict even with specialized hardware. This paper proposes a new way of handling hard-to-predict indirect jumps: dynamically predicating them. The compiler (static or dynamic) identifies indirect jumps that are suitable for predication along with their control-flow merge (CFM) points. The hardware predicates the instructions between different targets of the jump and its CFM point if the jump turns out to be hard-to-predict at run time. If the jump would actually have been mispredicted, its dynamic predication eliminates a pipeline flush, thereby improving performance. Our evaluations show that Dynamic Indirect jump Predication (DIP) improves the performance of a set of object-oriented applications including the Java DaCapo benchmark suite by 37.8% compared to a commonly-used branch target buffer based predictor, while also reducing energy consumption by 24.8%. We compare DIP to three previously proposed indirect jump predictors and find that it provides the best performance and energy-efficiency. Categories and Subject Descriptors C.1. [Processor rchitectures]: General; C.5.3 [Computer System Implementation]: Microcomputers - Microprocessors; D.3.4 [Programming Languages]: Processors - Compilers General Terms Design, Performance Keywords Dynamic predication, indirect jumps, virtual functions, object-oriented languages, predicated execution. 1. Introduction Indirect jumps are becoming more common as an increasing number of programs is written in object-oriented languages such as Java, C#, and C++. To support polymorphism [8], these languages include virtual function calls that are implemented using indirect jump instructions in the instruction set architecture (IS). Previous research has shown that modern object-oriented languages result in significantly more indirect jumps than traditional languages [7]. In addition to virtual function calls, indirect jumps are commonly used in the implementation of programming language constructs such as switch-case statements, jump tables, and interface calls [2]. Unfortunately, current pipelined processors are not good at predicting the target address of an indirect jump if multiple different targets are exercised at runtime. Such hard-to-predict indirect jumps not only limit processor performance and cause wasted energy consumption but also contribute significantly to the performance difference Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SPLOS 8, March 1 5, 28, Seattle, Washington, US. Copyright c 28 CM /8/3... $5. between traditional and object-oriented languages [44]. The goal of this paper is to develop new architectural support to improve the performance of programming language constructs implemented using indirect jumps. Figure 1 demonstrates the problem of indirect jumps in objectoriented Java (DaCapo [5]) and C++ applications. This figure shows the indirect and conditional jump mispredictions per 1 retired instructions (MPKI) on a state-of-the-art Intel Core2 Duo 66 [22] processor. The data is collected with hardware performance counters using VTune [23]. Note that the Intel Core2 Duo processor includes a specialized indirect jump predictor [16]. Despite specialized hardware to predict indirect jump targets, 41% of all jump mispredictions in the examined applications are due to indirect jumps. Hence, hardto-predict indirect jumps cause a large fraction of all mispredictions in object-oriented Java and C++ applications. Therefore, more sophisticated architectural support than target prediction is needed to reduce the negative impact of indirect jump mispredictions on performance of object-oriented applications. Mispredictions Per Kilo Intructions (MPKI) hqlpdb firefox conditional indirect VG Figure 1. Indirect and conditional jump mispredictions in object-oriented Java and C++ applications run using the Windows Vista operating system on an Intel Core2 Duo 66 asic Idea: We propose a new way of handling hard-to-predict indirect jumps: dynamically predicating them. y dynamically predicating an indirect jump, the processor increases the probability of the correct target path of the jump to be fetched. Our technique stems from the observation that program control-flow paths starting from different targets of some indirect jump instructions usually merge at some point in the program, which we call the control-flow merge (CFM) point. The static or dynamic compiler 1 identifies such indirect jump instructions along with their CFM points and conveys them to the hardware through modifications in the instruction set architecture. When the hardware fetches such a jump, it estimates whether or not the jump is hard to predict using a confidence estimator [25]. If the jump is hard to predict, the processor predicates the instructions between N targets of the indirect jump and the CFM point. We evaluate performance/complexity for different N, and find N=2 is the best trade-off. When the processor reaches the CFM point on all N different target paths, it inserts select-µops to reconcile the data values produced on each path and continues execution on the controlindependent path. When the indirect jump is resolved, the processor 1 In the rest of the paper, we use the term compiler to refer to either a static or dynamic compiler. Our scheme can be used in conjunction with both types of compilers.

2 stops dynamic predication and turns the instructions that correspond to the incorrect target address(es) into NOPs as their predicate values are false. The instructions -if any- that correspond to the correct target address commit their results. s such, if the jump would actually have been mispredicted, its dynamic predication eliminates a full pipeline flush, thereby improving performance. Our experimental evaluation shows that Dynamic Indirect jump Predication(DIP) improves the performance of a set of indirectjump-intensive object-oriented Java and C++ applications by 37.8% over a commonly-used branch target buffer (T) based indirect jump predictor, which is employed by most current processors. We compare DIP to three previously proposed indirect jump predictors [9, 13, 3] and find that it provides significantly better performance than all of them. Our results also show that DIP provides the largest improvements in energy-efficiency and energy-delay product. We analyze the hardware cost and complexity of DIP and show that if dynamic predication is already implemented to reduce the misprediction penalty due to conditional branches [31], DIP requires little extra hardware. Hence, DIP can be a promising, energy-efficient way to reduce the performance penalty of indirect jumps without requiring large specialized hardware structures for predicting indirect jumps. Contributions. We make the following contributions: 1. We provide a new architectural approach to support indirect jumps, an important performance limiter in object-oriented applications. To our knowledge, DIP is the first mechanism that enables the predication of indirect jumps. 2. We extensively evaluate DIP in comparison to several previouslyproposed indirect jump prediction schemes and show that DIP provides the highest performance and energy improvements in modern object-oriented applications written in Java and C++. Even when used in conjunction with sophisticated predictors, DIP significantly improves performance and energy-efficiency. 3. We show that DIP can be implemented with little extra hardware if dynamic predication is already implemented to reduce the misprediction penalty due to conditional branches. Hence, we propose using dynamic predication as a general framework for reducing the performance penalty due to unpredictability in program control-flow (be it due to conditional branches or indirect jumps). 2. ackground on Dynamic Predication of Conditional ranches Compiler-based predication [1] has traditionally been used to eliminate conditional branches (hence conditional branch mispredictions) by converting control dependencies to data dependencies, but it is not used for indirect jumps. Dynamic predication was first proposed to eliminate the misprediction penalty due to simple hammock branches [34] and later extended to handle a large set of complex control-flow graphs [31]. Dynamic predication has advantages over static predication because (1) it does not require significant changes to the instruction set architecture, such as predicated instructions and architectural predicate registers, (2) it can adapt to dynamic changes in branch behavior, and (3) it is applicable to a much wider range of control-flow graphs and therefore provides higher performance [31]. Unfortunately, none of these previous static or dynamic predication approaches were applicable to indirect jumps. We first briefly review the previous dynamic predication mechanisms proposed for conditional branches [34, 31] to provide sufficient background and the terminology used in this paper. Figure 2 shows the control-flow graph (CFG) of a conditional branch and the dynamically predicated instructions. The candidate branches for dynamic predication are identified at runtime or marked by the compiler. When the processor fetches a candidate branch, it estimates whether or not the branch is hard to predict using a branch confidence estimator [25]. If the branch prediction has low confidence, the processor generates a predicate using the branch condition and enters dynamic predication mode(dpred-mode). In this mode, the processor fetches both paths after the candidate branch if (cond1) C b = else b = 1 D c = b + a (a) NT D (b) T C r = (cond1) branch r, C mov r1 < #1 jump D C mov r1 < # D add r2 < r1, r3 D add pr12 < pr41, pr13 (c) pr1 = (cond1) branch pr1, C mov pr21 < #1 C mov pr31 < # (d) p1 = pr1 (!p1) (p1) select uop pr41 = p1? pr31 : pr21 Figure 2. Dynamic predication of a conditional branch: (a) source code (b) CFG (c) assembly code (d) dynamically predicated instructions after register renaming (pr: physical register) and dynamically predicates the instructions with the corresponding predicate id. On each path, the processor follows the outcomes of the branch predictor. When the processor reaches a control-flow merge(cfm) point on both paths, it inserts c-moves [29] or select- µops [43], similar to the φ-functions in the static single-assignment (SS) form [1], to reconcile the register data values produced on either side of the branch and continues fetching from a single path. The processor exits dpred-mode either when it reaches a CFM point on both paths of the branch or when the branch is resolved. When the branch is resolved, the predicate value is also resolved. Instructions on the wrong path (i.e. predicated-flse instructions) become NOPs, and they do not update the architectural state. If the candidate branch is actually mispredicted, the processor does not need to flush its pipeline and is able to make useful progress on the correct path, which provides improved performance. 3. Dynamic Predication of Indirect Jumps (DIP) Traditionally, only conditional branches can be predicated because predication assumes that there are exactly two possible next instructions after a branch. This assumption does not hold for indirect jumps. Figure 3a shows an example virtual function call in the C++ language that is implemented as an indirect call (s->area()). Depending on the actual runtime type of the object pointed to by s, the corresponding overridden version of the area function will be called. There can be many different derived classes that override the function call and thus many different targets of the call. Even though there could be many different targets, usually only a few of them are concurrently used in each phase of the program. If the calls for different targets are interleaved in a complex way, it is usually difficult to predict exactly the correct target of each instance of the call using existing indirect jump predictors. In contrast, we found that it is much easier to estimate the two (or three)mostlikelytargets, i.e. a small set of targets that includes the correct target with a high probability. In DIP, if an indirect jump is found to be difficult to predict, the processor estimates the most likely targets. Using dynamic predication, the processor fetches and executes from these most likely targets until the dynamically-predicated paths eventually merge at the instruction after the call, when the function returns (as shown in Figure 3b,c). If one of the predicated targets is correct, the processor avoids a pipeline flush. The performance benefit of dynamically predicating the indirect jump can increase significantly if the control flow merging point is close to the indirect jump (i.e., if the body of the function is small), so that the overhead of fetching the extra path(s) is not high. Figure 3b,c illustrates conceptually the dynamic predication process for the indirect call in Figure 3a, assuming that circle->area() and rectangle->area() are the most likely targets for an instance of thes->area() call. Our approach is inspired by the dynamic predication of conditional branches. However, there are two fundamental differences between the dynamic predication of conditional branches and indirect jumps: 1. There are exactly two possible paths after a conditional branch. In contrast, the number of possible paths after an indirect jump depends on the number of possible targets, which can be very large. For example, an indirect call in the Java DaCapo benchmark exercises 11 dynamic targets. Predicating a larger number of target

3 Shape *s =...; a = s >area();... (a) dynamic target1 circle::area() (b) dynamic target2 rectangle::area() circle::area() rectangle::area() Figure 3. Dynamic predication of an indirect call: (a) source code (b) CFG (c) predicated code paths increases the likelihood that the correct path will be in the pipeline when the jump is resolved, but it also requires more complex hardware and increases the amount of wasted work due to predication since at most one path is correct. Therefore, one important question ishowtoidentifyhowmanyandwhichtargetsofajumpshouldbe predicated. 2. The target of a conditional branch is always available at compile time. On the other hand, all targets of an indirect jump may not be available at compile-time due to techniques like dynamic linking and dynamic class loading. Hence, a static compiler might not be able to convey to hardware which targets of an indirect jump can profit from dynamic predication. nother important question, therefore, is who(the compiler-static or dynamic- or the hardware) should determine the targets that should be dynamically predicated. We explore both options: the compiler can determine the targets to be predicated via profiling or the hardware can determine them at runtime. Note that the latter option can adapt to runtime changes in frequentlyexecuted targets of an indirect jump at the expense of higher hardware cost. In this paper we explore answers to these questions and propose an effective and cost-efficient implementation of DIP. 4. Why does DIP work? We first examine code examples from Java applications to provide insights into why DIP can improve performance. 4.1 Virtual Function Call Example Figure 4 shows a virtual function call in, an outputindependent print formatter Java application included in the Da- Capo suite. The function computevalue is originally defined in the class Length, and is overridden in the derived classes LinearCombinationLength, MixedLength and PercentLength. This polymorphic function is called from a single call site 32% of the time by objects of class Length, 34% of the time by objects of class LinearCombinationLength, and 34% of the time by objects of class PercentLength. The benchmark goes through two program phases. Only the first target is used at the beginning of the program, and therefore the call is easy to predict. In the second phase the targets from LinearCombinationLength and PercentLength are interleaved in a difficult to predict way. Dynamically predicating these two targets when the indirect call becomes hard to predict can eliminate most target mispredictions at the cost of executing useless instructions on one path. Since the bodies of the functions are small, the number of wasted instructions with dynamic predication is smaller than the number of wasted instructions on a pipeline flush due to a misprediction. 4.2 Switch-Case Statement Example Figure 5 shows a switch statement in the function jjstopstringliteraldfa of the class JavaParserTokenManager from the DaCapo benchmark. This class parses input tokens by implementing a deterministic finite automaton. Even though the switch statement has 11 cases, cases, 1 and 2 are executed for 59%, 25%, and 12% of the dynamic instances, respectively. The other 8 cases account for only 4% of the dynamic instances. The control flow reconverges after the switch (c) p1 p2 1: public int mvalue() { // in Length class 2: if (!biscomputed) 3: computevalue(); // call site 4: return millipoints; 5: } 6: 7: protected void computevalue() { 8: // in LinearCombinationLength class, short computation... 9: setcomputedvalue(result); 1: } 11: 12: protected void computevalue() { // in MixedLength class 13: // short computation... 14: setcomputedvalue(computedvalue, bllcomputed); 15: } 16: 17: protected void computevalue() { // in PercentLength class 18: setcomputedvalue((int)(factor * 19: (double)lbase.getaselength())); 2: } Figure 4. suitable indirect jump example from statement. Dynamically predicating the first three target paths when the indirect jump is seen would eliminate almost all mispredictions at the cost of executing useless instructions. Note, however, that the number of instructions is relatively small (fewer than 3) in each target path, so the amount of wasted work would be small compared to the amount of wasted work on a full pipeline/window flush due to a misprediction. 1: switch (pos) { // indirect jump 2: case : // target 1 3: if ((active1 & x44l)!= L) 4: r = 4; 5: else if (...)... 6: r = 28; 7: else 8: r = -1; 9: break; 1: case 1: // target 2 11: // code similar to case (setting r on every path) 12: case 2: // target 3 13: // code similar to case (setting r on every path) 14: //... 8 other seldom executed cases 15: } Figure 5. suitable indirect jump example from 5. Mechanism and Implementation There are two critical issues in implementing DIP: (1) determining which indirect jumps are candidates for dynamic predication, (2) determining which targets of a candidate indirect jump should be predicated. This section first explains how our mechanism addresses these issues. Then, we describe the required hardware support, analyze its complexity, and explain the support required from the IS. 5.1 Indirect Jump and CFM Point Selection The compiler selects indirect jump candidates for dynamic predication using control-flow analysis and profiling. Control-flow analysis finds the CFM point for each indirect jump. The CFM point for an indirect call is the instruction after the call. The CFM point for an indirect jump implementing a switch statement is usually the instruction after the statement. The compiler profiles the application to characterize the indirect jumps. Highly mispredicted indirect jumps are good candidates for DIP even if there is no CFM point common to all the targets or if the CFM point is so far from the jump that it is not reached until the indirect jump is resolved. In this case, DIP still could provide performance benefit because it executes two possible paths after the jump, one of which might be the correct path. In other words, the benefit from DIP is similar to that of dual-path execution [17, 15] if a CFM point is not reached. For the experiments in this paper the compiler selects all indirect jumps that result in at least.1% of all jump mispredictions in the profiling run on the baseline processor. 2 n indirect jump selected for dynamic predication is marked in the binary along with its CFM point. We call such a jump adipjump. 2 We have experimented with several other profiling and selection heuristics based on compile-time cost-benefit analyses (including the ones described in [33, 27]), but we found that our simple selection heuristic provides the best performance.

4 5.1.1 Return CFM Points In some switch statements, one or morecases might end with areturn instruction. For an indirect jump implementing such a switch statement, the first instruction after the statement might not be the CFM point. If all predicated paths after an indirect jump implementing a switch statement reach a return instruction that ends a case, the CFM point is actually the instruction executed after the return instruction. Unfortunately, the address of this CFM point is not known at code generation time because it depends on the caller position. We introduce a special type of CFM point calledreturncfm to handle this case. When a DIP-jump is marked as having a return CFM point, the processor does not look for a particular address to end dpred-mode, but for the execution of a return instruction at the same call depth as the DIP-jump. The processor ends dynamic predication mode when all the predicated paths reach return instructions. 5.2 Target Selection DIP provides performance benefit only if the correct target of a jump is one of the predicated targets. Therefore, the choice of which targets to predicate is an important decision to make when dynamically predicating an indirect jump since only a few targets can be predicated. This choice can be made by the compiler or the hardware. We first describe how target selection can be done assuming two targets can be predicated. Section describes the selection of more than two targets, assuming the hardware can support the predication of all of them Compiler-based Target Selection Even though an indirect jump can have many dynamically-exercised targets, we would expect the most frequently exercised targets to account for a significant fraction of the total dynamic jump instances and mispredictions [3, 27]. This assumption suggests using a simple mechanism for target selection: the compiler profiles the program with a representative input set, determines the most frequently executed targets for each DIP-jump, and annotates the executable binary with the target information. Even though this mechanism requires more IS support to supply the targets with an indirect jump, it does not require extra hardware for target selection. However, our results show that dynamic target selection mechanisms that can adapt to runtime program behavior can be much more effective at the cost of extra hardware (see Section 7.4) Hardware-based (Dynamic) Target Selection The correct target of an indirect jump depends on the runtime behavior of the application. Changes in the runtime input set, phase behavior of the program, and the control-flow path leading to the indirect jump affect the correct target, which is actually the reason why some indirect jumps are hard to predict. s the compiler does not have access to such fine-grain dynamic information, it is difficult for the compiler to select a set of targets that includes the correct target when the jump is predicated. In contrast, hardware has access to dynamic program information and can adapt to rapid changes in the behavior of indirect jumps. We therefore develop a mechanism that selects targets based on runtime information collected in hardware. We use a hardware table called Target Selection Table(TST) for dynamic target selection. The purpose of the TST is to track and provide the most frequently executed two targets for a given DIP-jump. TST entry is associated with each DIP-jump. Conceptually, each entry in the TST contains M targets and M frequency counters. frequency counter is associated with each target and keeps track of how many times the target was taken. When a fetched DIP-jump is estimated to be hard-to-predict (low-confidence), the processor accesses the TST entry for that jump and selects the two most frequently executed target addresses (i.e. the two target addresses with the highest frequency counters) in the entry. The TST is structured as a 4-way set-associative cache with a least-recently-used (LRU) replacement policy. We evaluated different indexing functions for the TST: using the address (i.e., program counter) of the DIP-jump alone or the address of the DIP-jump XORed with the 16-bit global branch history register (GHR). We found that the latter indexing function provides more accurate target selection because the correct target of a jump depends on the controlflow path leading to the jump. To reduce the storage requirements for the TST, we: (1) limit the number of targets to the maximum number of targets that can be predicated plus one; (2) implement the frequency counters as 2-bit saturating counters 3 ; (3) limit the tag to 7 bits; (4) limit the size of the TST to 2K entries; (5) store the targets associated with a DIPjump in the T (in different T entries), instead of storing them in the TST itself. The last optimization allows TST to become a lowcost indirection mechanism that stores only frequency-counters to retrieve the most frequently executed targets of a branch, which are stored in the T. Operation of TST: When a fetched DIP-jump is estimated to be hard-to-predict, the target selection mechanism starts an iterative process to retrieve the most frequently used two targets from the T, one target per cycle. 4 Figure 6 shows the basic structure of the TST and the logic required to access the T based on the information obtained from the TST. 5 lgorithm 1 describes the target selection process. In each iteration iter, the control logic finds theposition of the next frequency counter in descending order. If there are 3 counters stored in the TST, position can take only the values 1, 2 or 3. The value used to access the T to retrieve a target is the same value used to index the TST XORed with a randomized constant value hash value, which is specific to eachposition. 6 For example, if f3 and f1 are the highest frequency counters, the targets will be retrieved by accessing the T with(pc xor GHR xor hash value[3]) and(pc xor GHR xor hash value[1]) in consecutive cycles. The iterative selection process stops when it has the required number of targets to dynamically predicate the jump (PRED TRGETS), or after trying to retrieve as many targets as can be stored for one TST entry (MXTRGETS).PREDTRGETS is 2 andmxtrgets is 3 for 2-target selection. If enough targets are selected, the processor enters dpred-mode. PC xor GHR Tag f1 f2 f3 Control TST position hash_value T Target T_hit Figure 6. Target Selection Table (TST) used for selecting 2 targets to predicate. f1, f2, f3 denote the frequency counters for the three targets whose information is kept in TST. Update of TST: When a DIP-jump commits, it updates the TST regardless of whether or not it was dynamically predicated. The TST entry for the (PC, GHR) combination is accessed and the corresponding targets are retrieved from the T -one per cycle- and compared to the correct target taken by the jump. If the correct target is already stored in the T, the corresponding frequency counter is incremented. Otherwise, the correct target is inserted in any empty slot (i.e. for an iteration that misses in the T) or replacing the target with the smallest frequency counter value. Note that the TST update 3 To dynamically select 3 to 5 targets, we use 3-bit saturating frequency counters. 4 The performance impact of the extra cycles spent to retrieve targets from the T is 2%, as we show in Section Figure 6 shows only the conceptual structure of the TST. In our actual implementation, the T index used to retrieve a target in an iteration is precomputed in parallel with the TST access. Therefore, our proposal does not increase the critical path of T access. 6 Note that the values used to access the T to store the TST targets can conflict with real jump/branch addresses in the program, increasing aliasing and contention in the T. Section evaluates the impact of our mechanism on performance for different T sizes.

5 lgorithm 1 TST target selection algorithm. Inputs: PC, GHR iter 1 num targets while ((iter MX T RGET S) and (num targets < PRED TRGETS)) do position position descending order(iter) target access T(P C xor GHR xor hash value[position]) if (T hit) then next target to predicate target num targets num targets + 1 end if iter++ end while is not on the critical path of execution and can take multiple cycles as necessary. The purpose of a TST entry is to provide a list of targets approximately ordered by recent execution frequency. s the saturating frequency counters are updated, if more than two counters saturate at the maximum value, it becomes impossible to distinguish the two most frequent targets. To avoid this problem, we implement a simple aging mechanism: if two of the frequency counters are found to be saturated when a TST entry is updated, all counters in the entry are right shifted by one bit. In addition to avoiding the saturation problem, this aging mechanism also demotes the targets that have not been recently used, keeping the TST content up to date for the current program phase Selecting More Than Two Targets Unlike conditional branches, indirect jumps can have more than two targets that are frequently executed. When the likelihood of having the correct target in a set of two targets is not high enough, it might be profitable to predicate multiple targets, even though the overhead of predication would be higher. If we allow predication of more than two targets, we have to select which targets and how many targets to use for each low-confidence indirect jump. The TST holds one frequency counter for each of the targets that have been more frequently used in the recent past. The aging mechanism keeps these counters representative of the current phase of the program. Therefore, it is reasonable to select the targets with higher frequency count. To select multiple targets, the processor uses a greedy algorithm. It starts with the two targets with the highest frequency. Then, it chooses the i-th target in descending frequency order only if its frequency still adds significantly to the sum of the frequencies of the targets already selected. This happens when the following expression is satisfied: i 1 X Select Target i if Freq i i >= Freq j (1) j= Overriding the T-based Target Prediction The TST has more information than a conventional T-based indirect jump predictor for DIP-jumps, because: (1) the TST distinguishes between targets based on the different control-flow paths leading to a jump because it is indexed with PC and branch history, while a T-based prediction simply provides the last seen target for the jump; (2) each entry in the TST can hold multiple targets for each combination of PC and branch history (i.e. multiple targets per jump), while a T-based predictor can hold only one target per jump; (3) the TST contains entries for only the DIP-jumps selected by the compiler, which reduces contention, whereas a T contains one entry for every indirect jump and taken conditional branch. Our main purpose for designing the TST is to use it as a mechanism to select two or more targets for dynamic predication. However, we also found that if a TST entry contains only one target or if the most frequent target in the entry is significantly more frequent 7 than the other targets, dynamic predication provides less ben- 7 We found that a difference of at least 2 units in the 2-bit frequency counters is significant. efit than simply predicting the most frequent target as the target of the jump. Therefore, if one of these conditions holds when a DIPjump is fetched, the processor, instead of entering dynamic predication mode, simply overrides the T-based prediction for the indirect jump and uses the single predominant target specified by the TST as the predicted target for the jump Dynamic Target Selection vs. Target Prediction Dynamic target selection using the TST is conceptually different from dynamic target prediction. TST selects more than one target to predicate for an indirect jump. In contrast, an indirect jump predictor chooses a single target and uses that as the prediction for the fetched indirect jump. DIP increases the probability of having the correct target in the processor by selecting extra targets and dynamically predicating multiple paths. Nevertheless, the proposed dynamic target selection mechanism can be viewed as both a target selector andtargetpredictor especially since we sometimes use it to override target predictions as described in the previous section. s such, we envision future indirect jump predictors designed to work directly with DIP, selecting either a single target for speculative execution, or multiple targets for dynamic predication. 5.3 Hardware Support for Predicated Execution Once the targets to be predicated are selected, the dynamic predication process in DIP is similar to that in dynamic predication of conditional branches, which was described briefly in Section 2 and in detail by Kim et al. [31]. Here we describe the additional support required for DIP. If two targets are predicated in DIP, the additional support required is only in 1) the generation of the predicate values, 2) the handling of a possible pipeline flush when the predicate values are resolved. When a low-confidence DIP-jump is fetched, the processor enters dpred-mode. Figure 7 shows an example of indirect jump predication with two targets. First, the processor assigns a predicate id to each path to be predicated (i.e. each selected target). Unlike in conditional branch predication in which a single predicate value (and its complement) is generated based on the branch direction, there are multiple predicate values based on the addresses of the predicated targets in DIP. The predicate value for a path is generated by comparing the predicated target address to the correct target address. The processor inserts compare micro-operations (µops) to generate predicate values for each path as shown in Figure 7b. Unlike in conditional branch predication where one of the predicated paths is always correct, both of the predicated paths might be incorrect in DIP. s a result, the processor has to flush the whole pipeline when none of the predicated target addresses is the correct target. To this end, the processor generates a flush µop. The flush µop checks the predicate values and triggers a pipeline flush if none of the predicate values turns out to betrue (i.e., if the correct target was not predicated). If any of the predicates is TRUE, the flush µop functions as a NOP. In the example of Figure 7b, the processor inserts a flush µop to check whether or not any of the predicated targets (TRGET1 or TRGET2) is correct. ll instructions fetched during dpred-mode carry a predicate id just like in dynamic predication for a conditional branch. Since select-µops are executed only if either TRGET1 or TRGET2 is the correct target, the select-µops can be controlled by just one of the two predicates. Note that the implementation of the select- µops is the same as in dynamic predication for conditional branches. We refer the reader to [34, 31] for details on the generation and implementation of select-µops Supporting More Than Two Targets s we found that the predication of more than two targets does not provide significant benefits (shown and explained in Section 7.4), we only very briefly touch on hardware support for it solely for completeness. Each predicated path requires its own context: PC (program counter), GHR (global history register), RS (return address stack), and RT (register alias table). Since each path follows the

6 TRGET1: add r1 < r1, #1 add r1 < r2, # 1 return pr11 = MEM [pr21] p1 = cmp pr11, TRGET1 r1 = MEM[r2] p2 = cmp pr11, TRGET2 call r1 flush (p1 NOR p2) add r1 < r1, #1 (a) TRGET2: add r1 < r2, #2 return call TRGET1 add pr12 < pr11, #1 add pr13 < pr21, # 1 (p1) return call TRGET2 add pr14 < pr21, #2 return (p1) (p1) (p1) (p2) (p2) (p2) select uop pr15 < p1? pr13 : pr14 add pr16 < pr15, #1 Figure 7. n example of how the instruction stream is dynamically predicated (a) control flow graph (b) dynamically predicated instructions after register renaming outcomes of the branch predictor and does not fork more paths, i.e. the processor cannot be in dpred-mode for two or more nested indirect jumps at the same time, the complexity of predicating more than two targets is significantly less than the complexity of multi-path (i.e. eager) execution [37, 35]. The predication of more than two targets requires 1) storage of more frequency counters in the TST and additional combinational logic for target selection, 2) generation of more than two predicates using more than two compare instructions, 3) minor changes to the flush µop semantics to handle multiple paths, and 4) extension of the select-µop generation mechanism to handle the reconvergence of more than two paths Nested Indirect Jumps If the processor fetches another low-confidence DIP-jump during dpred-mode, it has two options: it can follow the low-confidence predicted target or it can exit dpred-mode for the earlier jump and reenter dpred-mode for the later jump. If the jumps are nested, the overhead of predicating the later DIP-jump is usually smaller than the overhead of predicating the earlier jump. lso, if the processor decides to continue in dpred-mode for the earlier jump and the later jump is mispredicted, a potentially significant part of the benefit of predication can be lost when the pipeline is flushed. Therefore, our policy (calledreentrypolicy) is to exit dpred-mode for the earlier jump and reenter dpred-mode for the later DIP-jump. Our experimental results show that this choice provides significantly higher performance benefits (see Section 7.3) Other Implementation Issues We briefly discuss other important issues in implementing DIP. Note that the same issues exist in architectures that implement static or dynamic predication for conditional branches [36, 34, 31]. Stores and Loads: dynamically predicated store is not sent to the memory system unless its predicate is known to be TRUE. The basic rule for the forwarding logic is that a store can forward to any younger load except for stores guarded by an unresolved predicate register, which can only forward to younger loads with the same predicate id. Interrupts and Exceptions: No special support is needed to handle interrupts and exceptions because dynamic predication state is speculative and is flushed before servicing the interrupt or exception. Predicate registers do not have to be saved and restored because they are not part of the IS. Instructions withflse predicate values do not cause exceptions. 5.4 Hardware Cost and Complexity The hardware required to dynamically predicate indirect jumps is very similar to that of the diverge-merge processor (DMP) [31, 32], which dynamically predicates conditional branches. The hardware support needed for dynamic predication (including the predicate registers, fetch/decode/rename/retirement support, and select-µops) and (b) its cost are already described in detail by previous work [31]. We assume DIP would be cost-efficiently implemented on a baseline processor that already supports dynamic predication for conditional branches, which was shown to provide very large performance and energy benefits [31, 32]. DIP requires the following hardware modifications in addition to the required support for dynamic predication: 1. Target Selection Table (TST, Section 5.2.2): a 2K-entry, 4-way set associative table with 3 2-bit saturating counters per entry, i.e. a 1.5 K data store and a 2.1 K tag store (using 7-bit tags and a valid bit per entry, plus 2 bits per set for pseudo-lru replacement). 2. simple finite state machine to implement accessing and updating the targets in the T (block labeled as Control in Figure 6) entry table with 32-bit constants (hash value in Figure 6). 4. Modified predicate generation logic and flush µops (Section 5.3). 5. Optionally, support for more than 2 targets (see Section 5.3.1). Hardware Cost: If dynamic predication hardware is already implemented for conditional branches, the cost of adding dynamic predication of indirect jumps with dynamic 2-target selection -our most efficient result- is 3.6K of storage 8 and simple extra logic. We believe that it is not cost-effective to implement dynamic predication only for indirect jumps. On the contrary, dynamic predication hardware is a substrate that should be used for both conditional and indirect jumps. 5.5 IS Support The indirect jumps selected for dynamic predication are identified in the executable binary with a different opcode for each flavor of DIP-jump (jumps or calls). The instruction format uses one bit to indicate whether or not the jump has a return CFM point. The instruction format also includes the CFM point encoded in 16-bit 2 s complement relative to the DIP-jump, which we determined is enough to encode all the CFM points we found in our set of benchmarks. When we use static target selection, the selected 32-bit targets follow the instruction in the binary. Even though these special instructions increase the code size and the pressure on the instruction cache, their impact is not significant because the number of static jumps selected for dynamic predication in our benchmarks is small (fewer than 1, as shown in Table 2). 6. Experimental Methodology 6.1 Simulation Methodology We use an idn-based [3] cycle-accurate x86 simulator to evaluate dynamic indirect jump predication. Table 1 shows our baseline processor s parameters. The baseline uses a 4K-entry T to predict indirect jumps [4, 18]. The simulator includes a Wattch-based power model [6] using 1nm technology at 4GHz, that faithfully accounts for the power consumption of all the additional structures needed by DIP. Front End ranch Predictors Execution Core On-chip Caches uses and Memory Prefetcher Dyn. pred. support Table 1. aseline processor configuration 64K, 2-way, 2-cycle I-cache; fetch ends at the first predicted-taken branch; fetch up to 3 conditional branches or 1 indirect branch 64K (64-bit history, 121-entry) perceptron branch predictor [26]; 4K-entry, 4-way T with pseudo-lru replacement; 64-entry return address stack; min. branch mispred. penalty is 3 cycles 8-wide fetch/issue/execute/retire; 512-entry RO; 384 physical registers; 128-entry LD-ST queue; 4-cycle pipelined wake-up and selection logic; scheduling window is partitioned into 8 sub-windows of 64 entries each L1 D-cache: 64K, 4-way, 2-cycle, 2 ld/st ports; L2 unified cache: 1M, 8-way, 8 banks, 1-cycle latency; ll caches: LRU repl. and 64 lines 3-cycle minimum memory latency; 32 DRM banks; 32-wide core-to-memory bus at 4:1 frequency ratio Stream prefetcher [42] (32 streams and 16 cacheline prefetch distance) 2K (12-bit history, threshold 14) enhanced JRS confidence estimator [25], 32 predicate registers, 1 CFM register We evaluate DIP using benchmarks over multiple platforms. Most of the experiments are run using the 11 DaCapo benchmarks [5] 8 Extra storage can be further reduced to 1.5K with an alternative design that stores a frequency counter and a there-is-next-target bit directly in each T entry, thus eliminating the need for a separate TST. To keep the implementation conceptually simple, we do not describe this option.

7 (Java), Matlab R26a (C), M5 simulator [4] (C++), and the interpreters (C) and (C) from the SPEC CPU 2/26 suites. We also show results for a set of 5 SPEC CPU2 INT benchmarks written in C, 3 SPEC CPU26 INT benchmarks written in C, and 1 SPEC CPU26 FP benchmark written in C++. We use those benchmarks in SPEC 2 INT and 26 INT/C++ suites that gain at least 5% performance with a perfect indirect jump predictor. Each benchmark is run for 2 million x86 instructions with the reference input set (SPEC CPU), small input set (DaCapo) or a custom input set (Matlab and M5) 9. The DaCapo benchmarks are run with Sun J2SE JRE on Windows Vista. 1 Matlab is run on Windows Vista. M5 is compiled with its default options using gcc 3.4.4, and run on Cygwin on Windows Vista. ll SPEC binaries are compiled with Intel s production compiler (ICC) [21] using -O3 optimizations and run on Linux Fedora Core 5. Table 2 shows the characteristics of the simulated portions of our main set of benchmarks on the baseline processor. 6.2 Compilation and Profiling Methodology Our methodology targets an adaptive JIT compiler that is able to use recent profiling information to recompile the hot methods to improve performance. For the experiments in this paper, we developed a dynamic profiling tool that collects edge profiling information and computes the CFM points during the 2M instructions preceding the simulation points for each application. The algorithms to find the CFM points are similar to those algorithms described in [33]. fter profiling, our tool applies the jump selection algorithm described in Section Results 7.1 Dynamic Target Distribution Table 2 shows that the average number of dynamic targets for an indirect jump in the simulated benchmarks is We found that in our set of object-oriented benchmarks 61% of all dynamic indirect jumps have 16 or more targets, which is significantly higher than what was reported in previous work for SPEC CPU C/C++ applications [3]. Even though indirect jumps have many targets, the most frequently executed targets (over the whole run of the program) are taken by a significant fraction of all dynamic indirect jumps, as we show in Figure 8. On average, the two most frequent targets cover 59.9% of the dynamic indirect jumps, but only 39.5% of the indirect jump mispredictions. The contribution of less frequently executed targets steadily drops. This data shows that statically selecting two targets for dynamic predication would likely not be very successful in these object-oriented applications where indirect jumps have a very large number of targets. 7.2 Performance of DIP The first set of bars in Figure 9 shows the performance improvement of DIP over the baseline, using dynamic 2-target selection with a 3.6K TST and all the techniques described in Section 5. The average IPC improvement of DIP is 37.8% and is analyzed in Section 7.3. We also include five idealized experiments in Figure 9 to show the potential of DIP. The IPC improvement increases to 51% if an unrealistically large TST is used (64K-entry TST with local storage for 255 targets and 32-bit frequency counters, which has a total data storage size of 128M). If the 128M TST always ideally provides the correct target for predication among the 2 selected targets (2T perfect target), the performance improves only by.2%. This means that the principles of the TST are adequate for selecting the correct 9 Matlab performs convolution on two images; M5 simulates the performance of gcc using its complex out-of-order processor model. 1 t the time of this writing, idn [3] worked for only 7 of the 11 DaCapo benchmarks running on Sun Java SE 6 (version ). The average indirect jump MPKI for these benchmarks on Java 6 is 37%higher than on Java 1.4, but since we cannot use the full suite, we report the results for Java 1.4. We expect DIP would perform better on Java For the static target selection experiments, our dynamic profiling tool also applies the target selection algorithm described in Section Percent of Executed Indirect Jumps (%) amean Figure 8. Fraction of dynamic indirect jumps taking the most frequently executed N targets target among the two that are predicated. If the DIP mechanism were used ideally only when the DIP-jump is actually mispredicted (2T perfect confidence) IPC improves by an additional 2%. The combination of perfect confidence estimation and perfect target selection (2T perfect targ./conf.) adds only an extra.5%, showing that the maximum potential performance benefit of 2-target DIP is 53.8%. Perfect indirect jump prediction (perfect IJP) provides 72.2% performance improvement over baseline, which is significantly higher than the maximum potential of DIP, because it does not have the overhead of dynamic predication. Our realistic implementation achieves 52% of the potential with perfect indirect jump prediction and 7% of the potential with the ideal 2-target DIP (2T perfect targ./conf.). Xalan and do not get as much of the potential as the other benchmarks because the TST miss rate is significantly high (43% for both) T-dyn. 3.6K TST 2T-dyn. 128M TST 2T perfect target 2T perfect confidence 2T perfect targ./conf. perfect IJP Figure 9. DIP performance and potential 7.3 nalysis of the Performance Improvement The performance improvement of DIP comes from avoiding the full pipeline flushes caused by indirect jump mispredictions. DIP can improve performance only if it selects the correct target as one of the targets to predicate. Therefore, most of our design effort for DIP is focused on mechanisms to improve target selection. On average, DIP eliminates 47% of the pipeline flushes that occur with a T-based predictor (as shown in Table 3, row 8). Furthermore, the overhead of executing the extra path is low: the average number of dynamically predicated wrong-path instructions is only 73.9 (Table 3, row 7), which is significantly smaller than the instruction window size of the processor. Hence, in the steady state, dynamic predication of a mispredicted jump would result in only 73.9 instruction slots to be wasted whereas the misprediction itself would have resulted in all instruction slots in the window plus those in the front-end pipeline stages to be wasted. The benefit of DIP depends on the combination of target selection and confidence estimation. We classify dynamic predication instances into four cases based on whether or not the correct target is predicated and whether or not the jump was actually mispredicted: 1.Useful: dynamic predication instance is useful (i.e. successfully avoids a pipeline flush) if it predicates the correct target and the

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas Diverge-Merge Processor (DMP): Dynamic Predicated Execution of Complex Control-Flow Graphs Based on Frequently Executed Paths yesoon Kim José. Joao Onur Mutlu Yale N. Patt igh Performance Systems Group

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Hyesoon Kim Onur Mutlu Jared Stark David N. Armstrong Yale N. Patt High Performance Systems Group Department

More information

Performance-Aware Speculation Control Using Wrong Path Usefulness Prediction. Chang Joo Lee Hyesoon Kim Onur Mutlu Yale N. Patt

Performance-Aware Speculation Control Using Wrong Path Usefulness Prediction. Chang Joo Lee Hyesoon Kim Onur Mutlu Yale N. Patt Performance-Aware Speculation Control Using Wrong Path Usefulness Prediction Chang Joo Lee Hyesoon Kim Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

Instruction Level Parallelism (Branch Prediction)

Instruction Level Parallelism (Branch Prediction) Instruction Level Parallelism (Branch Prediction) Branch Types Type Direction at fetch time Number of possible next fetch addresses? When is next fetch address resolved? Conditional Unknown 2 Execution

More information

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Branch Prediction Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 11: Branch Prediction

More information

DIVERGE-MERGE PROCESSOR: GENERALIZED AND ENERGY-EFFICIENT DYNAMIC PREDICATION

DIVERGE-MERGE PROCESSOR: GENERALIZED AND ENERGY-EFFICIENT DYNAMIC PREDICATION ... DIVERGE-MERGE PROCESSOR: GENERALIZED AND ENERGY-EFFICIENT DYNAMIC PREDICATION... THE BRANCH MISPREDICTION PENALTY IS A MAJOR PERFORMANCE LIMITER AND A MAJOR CAUSE OF WASTED ENERGY IN HIGH-PERFORMANCE

More information

Diverge-Merge Processor (DMP): Dynamic Predicated Execution of Complex Control-Flow Graphs Based on Frequently Executed Paths

Diverge-Merge Processor (DMP): Dynamic Predicated Execution of Complex Control-Flow Graphs Based on Frequently Executed Paths Diverge-Merge Processor (DMP): Dynamic Predicated xecution of omplex ontrol-flow Graphs ased on Frequently xecuted Paths yesoon Kim José. Joao Onur Mutlu Yale N. Patt Department of lectrical and omputer

More information

Wrong Path Events and Their Application to Early Misprediction Detection and Recovery

Wrong Path Events and Their Application to Early Misprediction Detection and Recovery Wrong Path Events and Their Application to Early Misprediction Detection and Recovery David N. Armstrong Hyesoon Kim Onur Mutlu Yale N. Patt University of Texas at Austin Motivation Branch predictors are

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers Microsoft ssri@microsoft.com Santhosh Srinath Onur Mutlu Hyesoon Kim Yale N. Patt Microsoft Research

More information

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722 Dynamic Branch Prediction Dynamic branch prediction schemes run-time behavior of branches to make predictions. Usually information about outcomes of previous occurrences of branches are used to predict

More information

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu Midterm Exam ECE 741 Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 EXAM 1 SOLUTIONS Problem Points

More information

Static Branch Prediction

Static Branch Prediction Static Branch Prediction Branch prediction schemes can be classified into static and dynamic schemes. Static methods are usually carried out by the compiler. They are static because the prediction is already

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

Performance-Aware Speculation Control using Wrong Path Usefulness Prediction

Performance-Aware Speculation Control using Wrong Path Usefulness Prediction Performance-Aware Speculation Control using Wrong Path Usefulness Prediction Chang Joo Lee Hyesoon Kim Onur Mutlu Yale N. Patt Department of Electrical and Computer Engineering The University of Texas

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

15-740/ Computer Architecture Lecture 29: Control Flow II. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/30/11

15-740/ Computer Architecture Lecture 29: Control Flow II. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/30/11 15-740/18-740 Computer Architecture Lecture 29: Control Flow II Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/30/11 Announcements for This Week December 2: Midterm II Comprehensive 2 letter-sized

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Spring 2009 Prof. Hyesoon Kim

Spring 2009 Prof. Hyesoon Kim Spring 2009 Prof. Hyesoon Kim Branches are very frequent Approx. 20% of all instructions Can not wait until we know where it goes Long pipelines Branch outcome known after B cycles No scheduling past the

More information

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken Branch statistics Branches occur every 4-7 instructions on average in integer programs, commercial and desktop applications; somewhat less frequently in scientific ones Unconditional branches : 20% (of

More information

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction ISA Support Needed By CPU Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with control hazards in instruction pipelines by: 1 2 3 4 Assuming that the branch

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated

More information

15-740/ Computer Architecture Lecture 28: Prefetching III and Control Flow. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/28/11

15-740/ Computer Architecture Lecture 28: Prefetching III and Control Flow. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/28/11 15-740/18-740 Computer Architecture Lecture 28: Prefetching III and Control Flow Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/28/11 Announcements for This Week December 2: Midterm II Comprehensive

More information

Dynamic Memory Dependence Predication

Dynamic Memory Dependence Predication Dynamic Memory Dependence Predication Zhaoxiang Jin and Soner Önder ISCA-2018, Los Angeles Background 1. Store instructions do not update the cache until they are retired (too late). 2. Store queue is

More information

Midterm Exam 1 Wednesday, March 12, 2008

Midterm Exam 1 Wednesday, March 12, 2008 Last (family) name: Solution First (given) name: Student I.D. #: Department of Electrical and Computer Engineering University of Wisconsin - Madison ECE/CS 752 Advanced Computer Architecture I Midterm

More information

Architectures for Instruction-Level Parallelism

Architectures for Instruction-Level Parallelism Low Power VLSI System Design Lecture : Low Power Microprocessor Design Prof. R. Iris Bahar October 0, 07 The HW/SW Interface Seminar Series Jointly sponsored by Engineering and Computer Science Hardware-Software

More information

Dynamic Branch Prediction

Dynamic Branch Prediction #1 lec # 6 Fall 2002 9-25-2002 Dynamic Branch Prediction Dynamic branch prediction schemes are different from static mechanisms because they use the run-time behavior of branches to make predictions. Usually

More information

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero The Nineteenth International Conference on Parallel Architectures and Compilation Techniques (PACT) 11-15

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 7

ECE 571 Advanced Microprocessor-Based Design Lecture 7 ECE 571 Advanced Microprocessor-Based Design Lecture 7 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 9 February 2016 HW2 Grades Ready Announcements HW3 Posted be careful when

More information

Address-Value Delta (AVD) Prediction: A Hardware Technique for Efficiently Parallelizing Dependent Cache Misses. Onur Mutlu Hyesoon Kim Yale N.

Address-Value Delta (AVD) Prediction: A Hardware Technique for Efficiently Parallelizing Dependent Cache Misses. Onur Mutlu Hyesoon Kim Yale N. Address-Value Delta (AVD) Prediction: A Hardware Technique for Efficiently Parallelizing Dependent Cache Misses Onur Mutlu Hyesoon Kim Yale N. Patt High Performance Systems Group Department of Electrical

More information

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars Krste Asanovic Electrical Engineering and Computer

More information

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 10: Runahead and MLP Prof. Onur Mutlu Carnegie Mellon University Last Time Issues in Out-of-order execution Buffer decoupling Register alias tables Physical

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

Lecture 7: Static ILP, Branch prediction. Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )

Lecture 7: Static ILP, Branch prediction. Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections ) Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections 2.2-2.6) 1 Predication A branch within a loop can be problematic to schedule Control

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved. LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E

More information

Lecture 13: Branch Prediction

Lecture 13: Branch Prediction S 09 L13-1 18-447 Lecture 13: Branch Prediction James C. Hoe Dept of ECE, CMU March 4, 2009 Announcements: Spring break!! Spring break next week!! Project 2 due the week after spring break HW3 due Monday

More information

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due Today Homework 4 Out today Due November 15

More information

Superscalar Processors Ch 14

Superscalar Processors Ch 14 Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion

More information

CS / ECE 6810 Midterm Exam - Oct 21st 2008

CS / ECE 6810 Midterm Exam - Oct 21st 2008 Name and ID: CS / ECE 6810 Midterm Exam - Oct 21st 2008 Notes: This is an open notes and open book exam. If necessary, make reasonable assumptions and clearly state them. The only clarifications you may

More information

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency? Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 8

ECE 571 Advanced Microprocessor-Based Design Lecture 8 ECE 571 Advanced Microprocessor-Based Design Lecture 8 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 16 February 2017 Announcements HW4 Due HW5 will be posted 1 HW#3 Review Energy

More information

Pipelining, Branch Prediction, Trends

Pipelining, Branch Prediction, Trends Pipelining, Branch Prediction, Trends 10.1-10.4 Topics 10.1 Quantitative Analyses of Program Execution 10.2 From CISC to RISC 10.3 Pipelining the Datapath Branch Prediction, Delay Slots 10.4 Overlapping

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 9

ECE 571 Advanced Microprocessor-Based Design Lecture 9 ECE 571 Advanced Microprocessor-Based Design Lecture 9 Vince Weaver http://www.eece.maine.edu/ vweaver vincent.weaver@maine.edu 30 September 2014 Announcements Next homework coming soon 1 Bulldozer Paper

More information

CS252 Graduate Computer Architecture Midterm 1 Solutions

CS252 Graduate Computer Architecture Midterm 1 Solutions CS252 Graduate Computer Architecture Midterm 1 Solutions Part A: Branch Prediction (22 Points) Consider a fetch pipeline based on the UltraSparc-III processor (as seen in Lecture 5). In this part, we evaluate

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version: SISTEMI EMBEDDED Computer Organization Pipelining Federico Baronti Last version: 20160518 Basic Concept of Pipelining Circuit technology and hardware arrangement influence the speed of execution for programs

More information

NAME: Problem Points Score. 7 (bonus) 15. Total

NAME: Problem Points Score. 7 (bonus) 15. Total Midterm Exam ECE 741 Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 NAME: Problem Points Score 1 40

More information

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency Superscalar Processors Ch 13 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction 1 New dependency for superscalar case? (8) Name dependency (nimiriippuvuus) two use the same

More information

TDT 4260 lecture 7 spring semester 2015

TDT 4260 lecture 7 spring semester 2015 1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding

More information

15-740/ Computer Architecture Lecture 14: Runahead Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011

15-740/ Computer Architecture Lecture 14: Runahead Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011 15-740/18-740 Computer Architecture Lecture 14: Runahead Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011 Reviews Due Today Chrysos and Emer, Memory Dependence Prediction Using

More information

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012 18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012 Reminder: Lab Assignments Lab Assignment 6 Implementing a more

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest

More information

Lecture 7: Static ILP and branch prediction. Topics: static speculation and branch prediction (Appendix G, Section 2.3)

Lecture 7: Static ILP and branch prediction. Topics: static speculation and branch prediction (Appendix G, Section 2.3) Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3) 1 Support for Speculation In general, when we re-order instructions, register renaming

More information

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , ) Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14) 1 1-Bit Prediction For each branch, keep track of what happened last time and use

More information

Administration CS 412/413. Instruction ordering issues. Simplified architecture model. Examples. Impact of instruction ordering

Administration CS 412/413. Instruction ordering issues. Simplified architecture model. Examples. Impact of instruction ordering dministration CS 1/13 Introduction to Compilers and Translators ndrew Myers Cornell University P due in 1 week Optional reading: Muchnick 17 Lecture 30: Instruction scheduling 1 pril 00 1 Impact of instruction

More information

Design of Digital Circuits Lecture 18: Branch Prediction. Prof. Onur Mutlu ETH Zurich Spring May 2018

Design of Digital Circuits Lecture 18: Branch Prediction. Prof. Onur Mutlu ETH Zurich Spring May 2018 Design of Digital Circuits Lecture 18: Branch Prediction Prof. Onur Mutlu ETH Zurich Spring 2018 3 May 2018 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed

More information

Chapter 3 (CONT) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,2013 1

Chapter 3 (CONT) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,2013 1 Chapter 3 (CONT) Instructor: Josep Torrellas CS433 Copyright J. Torrellas 1999,2001,2002,2007,2013 1 Dynamic Hardware Branch Prediction Control hazards are sources of losses, especially for processors

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of

More information

Profiling: Understand Your Application

Profiling: Understand Your Application Profiling: Understand Your Application Michal Merta michal.merta@vsb.cz 1st of March 2018 Agenda Hardware events based sampling Some fundamental bottlenecks Overview of profiling tools perf tools Intel

More information

Static Branch Prediction

Static Branch Prediction Announcements EE382A Lecture 5: Branch Prediction Project proposal due on Mo 10/14 List the group members Describe the topic including why it is important and your thesis Describe the methodology you will

More information

An Efficient Indirect Branch Predictor

An Efficient Indirect Branch Predictor An Efficient Indirect ranch Predictor Yul Chu and M. R. Ito 2 Electrical and Computer Engineering Department, Mississippi State University, ox 957, Mississippi State, MS 39762, USA chu@ece.msstate.edu

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1 Superscalar Processing 740 October 31, 2012 Evolution of Intel Processor Pipelines 486, Pentium, Pentium Pro Superscalar Processor Design Speculative Execution Register Renaming Branch Prediction Architectural

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

COSC 6385 Computer Architecture Dynamic Branch Prediction

COSC 6385 Computer Architecture Dynamic Branch Prediction COSC 6385 Computer Architecture Dynamic Branch Prediction Edgar Gabriel Spring 208 Pipelining Pipelining allows for overlapping the execution of instructions Limitations on the (pipelined) execution of

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

4.1.3 [10] < 4.3>Which resources (blocks) produce no output for this instruction? Which resources produce output that is not used?

4.1.3 [10] < 4.3>Which resources (blocks) produce no output for this instruction? Which resources produce output that is not used? 2.10 [20] < 2.2, 2.5> For each LEGv8 instruction in Exercise 2.9 (copied below), show the value of the opcode (Op), source register (Rn), and target register (Rd or Rt) fields. For the I-type instructions,

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties Instruction-Level Parallelism Dynamic Branch Prediction CS448 1 Reducing Branch Penalties Last chapter static schemes Move branch calculation earlier in pipeline Static branch prediction Always taken,

More information

Multithreaded Value Prediction

Multithreaded Value Prediction Multithreaded Value Prediction N. Tuck and D.M. Tullesn HPCA-11 2005 CMPE 382/510 Review Presentation Peter Giese 30 November 2005 Outline Motivation Multithreaded & Value Prediction Architectures Single

More information

HW1 Solutions. Type Old Mix New Mix Cost CPI

HW1 Solutions. Type Old Mix New Mix Cost CPI HW1 Solutions Problem 1 TABLE 1 1. Given the parameters of Problem 6 (note that int =35% and shift=5% to fix typo in book problem), consider a strength-reducing optimization that converts multiplies by

More information

Filtered Runahead Execution with a Runahead Buffer

Filtered Runahead Execution with a Runahead Buffer Filtered Runahead Execution with a Runahead Buffer ABSTRACT Milad Hashemi The University of Texas at Austin miladhashemi@utexas.edu Runahead execution dynamically expands the instruction window of an out

More information

EE382A Lecture 5: Branch Prediction. Department of Electrical Engineering Stanford University

EE382A Lecture 5: Branch Prediction. Department of Electrical Engineering Stanford University EE382A Lecture 5: Branch Prediction Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 5-1 Announcements Project proposal due on Mo 10/14 List the group

More information

Hardware Speculation Support

Hardware Speculation Support Hardware Speculation Support Conditional instructions Most common form is conditional move BNEZ R1, L ;if MOV R2, R3 ;then CMOVZ R2,R3, R1 L: ;else Other variants conditional loads and stores nullification

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 08: Caches III Shuai Wang Department of Computer Science and Technology Nanjing University Improve Cache Performance Average memory access time (AMAT): AMAT =

More information

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard. COMP 212 Computer Organization & Architecture Pipeline Re-Cap Pipeline is ILP -Instruction Level Parallelism COMP 212 Fall 2008 Lecture 12 RISC & Superscalar Divide instruction cycles into stages, overlapped

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3, 3.9, and Appendix C) Hardware

More information

Instruction Fetching based on Branch Predictor Confidence. Kai Da Zhao Younggyun Cho Han Lin 2014 Fall, CS752 Final Project

Instruction Fetching based on Branch Predictor Confidence. Kai Da Zhao Younggyun Cho Han Lin 2014 Fall, CS752 Final Project Instruction Fetching based on Branch Predictor Confidence Kai Da Zhao Younggyun Cho Han Lin 2014 Fall, CS752 Final Project Outline Motivation Project Goals Related Works Proposed Confidence Assignment

More information

Chapter 7 The Potential of Special-Purpose Hardware

Chapter 7 The Potential of Special-Purpose Hardware Chapter 7 The Potential of Special-Purpose Hardware The preceding chapters have described various implementation methods and performance data for TIGRE. This chapter uses those data points to propose architecture

More information

Fall 2011 Prof. Hyesoon Kim

Fall 2011 Prof. Hyesoon Kim Fall 2011 Prof. Hyesoon Kim 1 1.. 1 0 2bc 2bc BHR XOR index 2bc 0x809000 PC 2bc McFarling 93 Predictor size: 2^(history length)*2bit predict_func(pc, actual_dir) { index = pc xor BHR taken = 2bit_counters[index]

More information

Complex Pipelines and Branch Prediction

Complex Pipelines and Branch Prediction Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle

More information

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Based on papers by: A.Fedorova, M.Seltzer, C.Small, and D.Nussbaum Pisa November 6, 2006 Multithreaded Chip

More information

Portland State University ECE 587/687. Memory Ordering

Portland State University ECE 587/687. Memory Ordering Portland State University ECE 587/687 Memory Ordering Copyright by Alaa Alameldeen, Zeshan Chishti and Haitham Akkary 2018 Handling Memory Operations Review pipeline for out of order, superscalar processors

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

EECS 470 PROJECT: P6 MICROARCHITECTURE BASED CORE

EECS 470 PROJECT: P6 MICROARCHITECTURE BASED CORE EECS 470 PROJECT: P6 MICROARCHITECTURE BASED CORE TEAM EKA Shaizeen Aga, Aasheesh Kolli, Rakesh Nambiar, Shruti Padmanabha, Maheshwarr Sathiamoorthy Department of Computer Science and Engineering University

More information