Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps

Size: px

Start display at page:

Download "Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps"

Kerry Dennis
6 years ago
Views:

1 Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José. Joao* Onur Mutlu Hyesoon Kim Rishi garwal Yale N. Patt* *ECE Department The University of Texas at ustin {joao, Microsoft Research College of Computing Georgia Institute of Technology IIT Kanpur bstract Indirect jump instructions are used to implement increasinglycommon programming constructs such as virtual function calls, switch-case statements, jump tables, and interface calls. The performance impact of indirect jumps is likely to increase because indirect jumps with multiple targets are difficult to predict even with specialized hardware. This paper proposes a new way of handling hard-to-predict indirect jumps: dynamically predicating them. The compiler (static or dynamic) identifies indirect jumps that are suitable for predication along with their control-flow merge (CFM) points. The hardware predicates the instructions between different targets of the jump and its CFM point if the jump turns out to be hard-to-predict at run time. If the jump would actually have been mispredicted, its dynamic predication eliminates a pipeline flush, thereby improving performance. Our evaluations show that Dynamic Indirect jump Predication (DIP) improves the performance of a set of object-oriented applications including the Java DaCapo benchmark suite by 37.8% compared to a commonly-used branch target buffer based predictor, while also reducing energy consumption by 24.8%. We compare DIP to three previously proposed indirect jump predictors and find that it provides the best performance and energy-efficiency. Categories and Subject Descriptors C.1. [Processor rchitectures]: General; C.5.3 [Computer System Implementation]: Microcomputers - Microprocessors; D.3.4 [Programming Languages]: Processors - Compilers General Terms Design, Performance Keywords Dynamic predication, indirect jumps, virtual functions, object-oriented languages, predicated execution. 1. Introduction Indirect jumps are becoming more common as an increasing number of programs is written in object-oriented languages such as Java, C#, and C++. To support polymorphism [8], these languages include virtual function calls that are implemented using indirect jump instructions in the instruction set architecture (IS). Previous research has shown that modern object-oriented languages result in significantly more indirect jumps than traditional languages [7]. In addition to virtual function calls, indirect jumps are commonly used in the implementation of programming language constructs such as switch-case statements, jump tables, and interface calls [2]. Unfortunately, current pipelined processors are not good at predicting the target address of an indirect jump if multiple different targets are exercised at runtime. Such hard-to-predict indirect jumps not only limit processor performance and cause wasted energy consumption but also contribute significantly to the performance difference Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SPLOS 8, March 1 5, 28, Seattle, Washington, US. Copyright c 28 CM /8/3... $5. between traditional and object-oriented languages [44]. The goal of this paper is to develop new architectural support to improve the performance of programming language constructs implemented using indirect jumps. Figure 1 demonstrates the problem of indirect jumps in objectoriented Java (DaCapo [5]) and C++ applications. This figure shows the indirect and conditional jump mispredictions per 1 retired instructions (MPKI) on a state-of-the-art Intel Core2 Duo 66 [22] processor. The data is collected with hardware performance counters using VTune [23]. Note that the Intel Core2 Duo processor includes a specialized indirect jump predictor [16]. Despite specialized hardware to predict indirect jump targets, 41% of all jump mispredictions in the examined applications are due to indirect jumps. Hence, hardto-predict indirect jumps cause a large fraction of all mispredictions in object-oriented Java and C++ applications. Therefore, more sophisticated architectural support than target prediction is needed to reduce the negative impact of indirect jump mispredictions on performance of object-oriented applications. Mispredictions Per Kilo Intructions (MPKI) hqlpdb firefox conditional indirect VG Figure 1. Indirect and conditional jump mispredictions in object-oriented Java and C++ applications run using the Windows Vista operating system on an Intel Core2 Duo 66 asic Idea: We propose a new way of handling hard-to-predict indirect jumps: dynamically predicating them. y dynamically predicating an indirect jump, the processor increases the probability of the correct target path of the jump to be fetched. Our technique stems from the observation that program control-flow paths starting from different targets of some indirect jump instructions usually merge at some point in the program, which we call the control-flow merge (CFM) point. The static or dynamic compiler 1 identifies such indirect jump instructions along with their CFM points and conveys them to the hardware through modifications in the instruction set architecture. When the hardware fetches such a jump, it estimates whether or not the jump is hard to predict using a confidence estimator [25]. If the jump is hard to predict, the processor predicates the instructions between N targets of the indirect jump and the CFM point. We evaluate performance/complexity for different N, and find N=2 is the best trade-off. When the processor reaches the CFM point on all N different target paths, it inserts select-µops to reconcile the data values produced on each path and continues execution on the controlindependent path. When the indirect jump is resolved, the processor 1 In the rest of the paper, we use the term compiler to refer to either a static or dynamic compiler. Our scheme can be used in conjunction with both types of compilers.

2 stops dynamic predication and turns the instructions that correspond to the incorrect target address(es) into NOPs as their predicate values are false. The instructions -if any- that correspond to the correct target address commit their results. s such, if the jump would actually have been mispredicted, its dynamic predication eliminates a full pipeline flush, thereby improving performance. Our experimental evaluation shows that Dynamic Indirect jump Predication(DIP) improves the performance of a set of indirectjump-intensive object-oriented Java and C++ applications by 37.8% over a commonly-used branch target buffer (T) based indirect jump predictor, which is employed by most current processors. We compare DIP to three previously proposed indirect jump predictors [9, 13, 3] and find that it provides significantly better performance than all of them. Our results also show that DIP provides the largest improvements in energy-efficiency and energy-delay product. We analyze the hardware cost and complexity of DIP and show that if dynamic predication is already implemented to reduce the misprediction penalty due to conditional branches [31], DIP requires little extra hardware. Hence, DIP can be a promising, energy-efficient way to reduce the performance penalty of indirect jumps without requiring large specialized hardware structures for predicting indirect jumps. Contributions. We make the following contributions: 1. We provide a new architectural approach to support indirect jumps, an important performance limiter in object-oriented applications. To our knowledge, DIP is the first mechanism that enables the predication of indirect jumps. 2. We extensively evaluate DIP in comparison to several previouslyproposed indirect jump prediction schemes and show that DIP provides the highest performance and energy improvements in modern object-oriented applications written in Java and C++. Even when used in conjunction with sophisticated predictors, DIP significantly improves performance and energy-efficiency. 3. We show that DIP can be implemented with little extra hardware if dynamic predication is already implemented to reduce the misprediction penalty due to conditional branches. Hence, we propose using dynamic predication as a general framework for reducing the performance penalty due to unpredictability in program control-flow (be it due to conditional branches or indirect jumps). 2. ackground on Dynamic Predication of Conditional ranches Compiler-based predication [1] has traditionally been used to eliminate conditional branches (hence conditional branch mispredictions) by converting control dependencies to data dependencies, but it is not used for indirect jumps. Dynamic predication was first proposed to eliminate the misprediction penalty due to simple hammock branches [34] and later extended to handle a large set of complex control-flow graphs [31]. Dynamic predication has advantages over static predication because (1) it does not require significant changes to the instruction set architecture, such as predicated instructions and architectural predicate registers, (2) it can adapt to dynamic changes in branch behavior, and (3) it is applicable to a much wider range of control-flow graphs and therefore provides higher performance [31]. Unfortunately, none of these previous static or dynamic predication approaches were applicable to indirect jumps. We first briefly review the previous dynamic predication mechanisms proposed for conditional branches [34, 31] to provide sufficient background and the terminology used in this paper. Figure 2 shows the control-flow graph (CFG) of a conditional branch and the dynamically predicated instructions. The candidate branches for dynamic predication are identified at runtime or marked by the compiler. When the processor fetches a candidate branch, it estimates whether or not the branch is hard to predict using a branch confidence estimator [25]. If the branch prediction has low confidence, the processor generates a predicate using the branch condition and enters dynamic predication mode(dpred-mode). In this mode, the processor fetches both paths after the candidate branch if (cond1) C b = else b = 1 D c = b + a (a) NT D (b) T C r = (cond1) branch r, C mov r1 < #1 jump D C mov r1 < # D add r2 < r1, r3 D add pr12 < pr41, pr13 (c) pr1 = (cond1) branch pr1, C mov pr21 < #1 C mov pr31 < # (d) p1 = pr1 (!p1) (p1) select uop pr41 = p1? pr31 : pr21 Figure 2. Dynamic predication of a conditional branch: (a) source code (b) CFG (c) assembly code (d) dynamically predicated instructions after register renaming (pr: physical register) and dynamically predicates the instructions with the corresponding predicate id. On each path, the processor follows the outcomes of the branch predictor. When the processor reaches a control-flow merge(cfm) point on both paths, it inserts c-moves [29] or select- µops [43], similar to the φ-functions in the static single-assignment (SS) form [1], to reconcile the register data values produced on either side of the branch and continues fetching from a single path. The processor exits dpred-mode either when it reaches a CFM point on both paths of the branch or when the branch is resolved. When the branch is resolved, the predicate value is also resolved. Instructions on the wrong path (i.e. predicated-flse instructions) become NOPs, and they do not update the architectural state. If the candidate branch is actually mispredicted, the processor does not need to flush its pipeline and is able to make useful progress on the correct path, which provides improved performance. 3. Dynamic Predication of Indirect Jumps (DIP) Traditionally, only conditional branches can be predicated because predication assumes that there are exactly two possible next instructions after a branch. This assumption does not hold for indirect jumps. Figure 3a shows an example virtual function call in the C++ language that is implemented as an indirect call (s->area()). Depending on the actual runtime type of the object pointed to by s, the corresponding overridden version of the area function will be called. There can be many different derived classes that override the function call and thus many different targets of the call. Even though there could be many different targets, usually only a few of them are concurrently used in each phase of the program. If the calls for different targets are interleaved in a complex way, it is usually difficult to predict exactly the correct target of each instance of the call using existing indirect jump predictors. In contrast, we found that it is much easier to estimate the two (or three)mostlikelytargets, i.e. a small set of targets that includes the correct target with a high probability. In DIP, if an indirect jump is found to be difficult to predict, the processor estimates the most likely targets. Using dynamic predication, the processor fetches and executes from these most likely targets until the dynamically-predicated paths eventually merge at the instruction after the call, when the function returns (as shown in Figure 3b,c). If one of the predicated targets is correct, the processor avoids a pipeline flush. The performance benefit of dynamically predicating the indirect jump can increase significantly if the control flow merging point is close to the indirect jump (i.e., if the body of the function is small), so that the overhead of fetching the extra path(s) is not high. Figure 3b,c illustrates conceptually the dynamic predication process for the indirect call in Figure 3a, assuming that circle->area() and rectangle->area() are the most likely targets for an instance of thes->area() call. Our approach is inspired by the dynamic predication of conditional branches. However, there are two fundamental differences between the dynamic predication of conditional branches and indirect jumps: 1. There are exactly two possible paths after a conditional branch. In contrast, the number of possible paths after an indirect jump depends on the number of possible targets, which can be very large. For example, an indirect call in the Java DaCapo benchmark exercises 11 dynamic targets. Predicating a larger number of target

3 Shape *s =...; a = s >area();... (a) dynamic target1 circle::area() (b) dynamic target2 rectangle::area() circle::area() rectangle::area() Figure 3. Dynamic predication of an indirect call: (a) source code (b) CFG (c) predicated code paths increases the likelihood that the correct path will be in the pipeline when the jump is resolved, but it also requires more complex hardware and increases the amount of wasted work due to predication since at most one path is correct. Therefore, one important question ishowtoidentifyhowmanyandwhichtargetsofajumpshouldbe predicated. 2. The target of a conditional branch is always available at compile time. On the other hand, all targets of an indirect jump may not be available at compile-time due to techniques like dynamic linking and dynamic class loading. Hence, a static compiler might not be able to convey to hardware which targets of an indirect jump can profit from dynamic predication. nother important question, therefore, is who(the compiler-static or dynamic- or the hardware) should determine the targets that should be dynamically predicated. We explore both options: the compiler can determine the targets to be predicated via profiling or the hardware can determine them at runtime. Note that the latter option can adapt to runtime changes in frequentlyexecuted targets of an indirect jump at the expense of higher hardware cost. In this paper we explore answers to these questions and propose an effective and cost-efficient implementation of DIP. 4. Why does DIP work? We first examine code examples from Java applications to provide insights into why DIP can improve performance. 4.1 Virtual Function Call Example Figure 4 shows a virtual function call in, an outputindependent print formatter Java application included in the Da- Capo suite. The function computevalue is originally defined in the class Length, and is overridden in the derived classes LinearCombinationLength, MixedLength and PercentLength. This polymorphic function is called from a single call site 32% of the time by objects of class Length, 34% of the time by objects of class LinearCombinationLength, and 34% of the time by objects of class PercentLength. The benchmark goes through two program phases. Only the first target is used at the beginning of the program, and therefore the call is easy to predict. In the second phase the targets from LinearCombinationLength and PercentLength are interleaved in a difficult to predict way. Dynamically predicating these two targets when the indirect call becomes hard to predict can eliminate most target mispredictions at the cost of executing useless instructions on one path. Since the bodies of the functions are small, the number of wasted instructions with dynamic predication is smaller than the number of wasted instructions on a pipeline flush due to a misprediction. 4.2 Switch-Case Statement Example Figure 5 shows a switch statement in the function jjstopstringliteraldfa of the class JavaParserTokenManager from the DaCapo benchmark. This class parses input tokens by implementing a deterministic finite automaton. Even though the switch statement has 11 cases, cases, 1 and 2 are executed for 59%, 25%, and 12% of the dynamic instances, respectively. The other 8 cases account for only 4% of the dynamic instances. The control flow reconverges after the switch (c) p1 p2 1: public int mvalue() { // in Length class 2: if (!biscomputed) 3: computevalue(); // call site 4: return millipoints; 5: } 6: 7: protected void computevalue() { 8: // in LinearCombinationLength class, short computation... 9: setcomputedvalue(result); 1: } 11: 12: protected void computevalue() { // in MixedLength class 13: // short computation... 14: setcomputedvalue(computedvalue, bllcomputed); 15: } 16: 17: protected void computevalue() { // in PercentLength class 18: setcomputedvalue((int)(factor * 19: (double)lbase.getaselength())); 2: } Figure 4. suitable indirect jump example from statement. Dynamically predicating the first three target paths when the indirect jump is seen would eliminate almost all mispredictions at the cost of executing useless instructions. Note, however, that the number of instructions is relatively small (fewer than 3) in each target path, so the amount of wasted work would be small compared to the amount of wasted work on a full pipeline/window flush due to a misprediction. 1: switch (pos) { // indirect jump 2: case : // target 1 3: if ((active1 & x44l)!= L) 4: r = 4; 5: else if (...)... 6: r = 28; 7: else 8: r = -1; 9: break; 1: case 1: // target 2 11: // code similar to case (setting r on every path) 12: case 2: // target 3 13: // code similar to case (setting r on every path) 14: //... 8 other seldom executed cases 15: } Figure 5. suitable indirect jump example from 5. Mechanism and Implementation There are two critical issues in implementing DIP: (1) determining which indirect jumps are candidates for dynamic predication, (2) determining which targets of a candidate indirect jump should be predicated. This section first explains how our mechanism addresses these issues. Then, we describe the required hardware support, analyze its complexity, and explain the support required from the IS. 5.1 Indirect Jump and CFM Point Selection The compiler selects indirect jump candidates for dynamic predication using control-flow analysis and profiling. Control-flow analysis finds the CFM point for each indirect jump. The CFM point for an indirect call is the instruction after the call. The CFM point for an indirect jump implementing a switch statement is usually the instruction after the statement. The compiler profiles the application to characterize the indirect jumps. Highly mispredicted indirect jumps are good candidates for DIP even if there is no CFM point common to all the targets or if the CFM point is so far from the jump that it is not reached until the indirect jump is resolved. In this case, DIP still could provide performance benefit because it executes two possible paths after the jump, one of which might be the correct path. In other words, the benefit from DIP is similar to that of dual-path execution [17, 15] if a CFM point is not reached. For the experiments in this paper the compiler selects all indirect jumps that result in at least.1% of all jump mispredictions in the profiling run on the baseline processor. 2 n indirect jump selected for dynamic predication is marked in the binary along with its CFM point. We call such a jump adipjump. 2 We have experimented with several other profiling and selection heuristics based on compile-time cost-benefit analyses (including the ones described in [33, 27]), but we found that our simple selection heuristic provides the best performance.

4 5.1.1 Return CFM Points In some switch statements, one or morecases might end with areturn instruction. For an indirect jump implementing such a switch statement, the first instruction after the statement might not be the CFM point. If all predicated paths after an indirect jump implementing a switch statement reach a return instruction that ends a case, the CFM point is actually the instruction executed after the return instruction. Unfortunately, the address of this CFM point is not known at code generation time because it depends on the caller position. We introduce a special type of CFM point calledreturncfm to handle this case. When a DIP-jump is marked as having a return CFM point, the processor does not look for a particular address to end dpred-mode, but for the execution of a return instruction at the same call depth as the DIP-jump. The processor ends dynamic predication mode when all the predicated paths reach return instructions. 5.2 Target Selection DIP provides performance benefit only if the correct target of a jump is one of the predicated targets. Therefore, the choice of which targets to predicate is an important decision to make when dynamically predicating an indirect jump since only a few targets can be predicated. This choice can be made by the compiler or the hardware. We first describe how target selection can be done assuming two targets can be predicated. Section describes the selection of more than two targets, assuming the hardware can support the predication of all of them Compiler-based Target Selection Even though an indirect jump can have many dynamically-exercised targets, we would expect the most frequently exercised targets to account for a significant fraction of the total dynamic jump instances and mispredictions [3, 27]. This assumption suggests using a simple mechanism for target selection: the compiler profiles the program with a representative input set, determines the most frequently executed targets for each DIP-jump, and annotates the executable binary with the target information. Even though this mechanism requires more IS support to supply the targets with an indirect jump, it does not require extra hardware for target selection. However, our results show that dynamic target selection mechanisms that can adapt to runtime program behavior can be much more effective at the cost of extra hardware (see Section 7.4) Hardware-based (Dynamic) Target Selection The correct target of an indirect jump depends on the runtime behavior of the application. Changes in the runtime input set, phase behavior of the program, and the control-flow path leading to the indirect jump affect the correct target, which is actually the reason why some indirect jumps are hard to predict. s the compiler does not have access to such fine-grain dynamic information, it is difficult for the compiler to select a set of targets that includes the correct target when the jump is predicated. In contrast, hardware has access to dynamic program information and can adapt to rapid changes in the behavior of indirect jumps. We therefore develop a mechanism that selects targets based on runtime information collected in hardware. We use a hardware table called Target Selection Table(TST) for dynamic target selection. The purpose of the TST is to track and provide the most frequently executed two targets for a given DIP-jump. TST entry is associated with each DIP-jump. Conceptually, each entry in the TST contains M targets and M frequency counters. frequency counter is associated with each target and keeps track of how many times the target was taken. When a fetched DIP-jump is estimated to be hard-to-predict (low-confidence), the processor accesses the TST entry for that jump and selects the two most frequently executed target addresses (i.e. the two target addresses with the highest frequency counters) in the entry. The TST is structured as a 4-way set-associative cache with a least-recently-used (LRU) replacement policy. We evaluated different indexing functions for the TST: using the address (i.e., program counter) of the DIP-jump alone or the address of the DIP-jump XORed with the 16-bit global branch history register (GHR). We found that the latter indexing function provides more accurate target selection because the correct target of a jump depends on the controlflow path leading to the jump. To reduce the storage requirements for the TST, we: (1) limit the number of targets to the maximum number of targets that can be predicated plus one; (2) implement the frequency counters as 2-bit saturating counters 3 ; (3) limit the tag to 7 bits; (4) limit the size of the TST to 2K entries; (5) store the targets associated with a DIPjump in the T (in different T entries), instead of storing them in the TST itself. The last optimization allows TST to become a lowcost indirection mechanism that stores only frequency-counters to retrieve the most frequently executed targets of a branch, which are stored in the T. Operation of TST: When a fetched DIP-jump is estimated to be hard-to-predict, the target selection mechanism starts an iterative process to retrieve the most frequently used two targets from the T, one target per cycle. 4 Figure 6 shows the basic structure of the TST and the logic required to access the T based on the information obtained from the TST. 5 lgorithm 1 describes the target selection process. In each iteration iter, the control logic finds theposition of the next frequency counter in descending order. If there are 3 counters stored in the TST, position can take only the values 1, 2 or 3. The value used to access the T to retrieve a target is the same value used to index the TST XORed with a randomized constant value hash value, which is specific to eachposition. 6 For example, if f3 and f1 are the highest frequency counters, the targets will be retrieved by accessing the T with(pc xor GHR xor hash value[3]) and(pc xor GHR xor hash value[1]) in consecutive cycles. The iterative selection process stops when it has the required number of targets to dynamically predicate the jump (PRED TRGETS), or after trying to retrieve as many targets as can be stored for one TST entry (MXTRGETS).PREDTRGETS is 2 andmxtrgets is 3 for 2-target selection. If enough targets are selected, the processor enters dpred-mode. PC xor GHR Tag f1 f2 f3 Control TST position hash_value T Target T_hit Figure 6. Target Selection Table (TST) used for selecting 2 targets to predicate. f1, f2, f3 denote the frequency counters for the three targets whose information is kept in TST. Update of TST: When a DIP-jump commits, it updates the TST regardless of whether or not it was dynamically predicated. The TST entry for the (PC, GHR) combination is accessed and the corresponding targets are retrieved from the T -one per cycle- and compared to the correct target taken by the jump. If the correct target is already stored in the T, the corresponding frequency counter is incremented. Otherwise, the correct target is inserted in any empty slot (i.e. for an iteration that misses in the T) or replacing the target with the smallest frequency counter value. Note that the TST update 3 To dynamically select 3 to 5 targets, we use 3-bit saturating frequency counters. 4 The performance impact of the extra cycles spent to retrieve targets from the T is 2%, as we show in Section Figure 6 shows only the conceptual structure of the TST. In our actual implementation, the T index used to retrieve a target in an iteration is precomputed in parallel with the TST access. Therefore, our proposal does not increase the critical path of T access. 6 Note that the values used to access the T to store the TST targets can conflict with real jump/branch addresses in the program, increasing aliasing and contention in the T. Section evaluates the impact of our mechanism on performance for different T sizes.

5 lgorithm 1 TST target selection algorithm. Inputs: PC, GHR iter 1 num targets while ((iter MX T RGET S) and (num targets < PRED TRGETS)) do position position descending order(iter) target access T(P C xor GHR xor hash value[position]) if (T hit) then next target to predicate target num targets num targets + 1 end if iter++ end while is not on the critical path of execution and can take multiple cycles as necessary. The purpose of a TST entry is to provide a list of targets approximately ordered by recent execution frequency. s the saturating frequency counters are updated, if more than two counters saturate at the maximum value, it becomes impossible to distinguish the two most frequent targets. To avoid this problem, we implement a simple aging mechanism: if two of the frequency counters are found to be saturated when a TST entry is updated, all counters in the entry are right shifted by one bit. In addition to avoiding the saturation problem, this aging mechanism also demotes the targets that have not been recently used, keeping the TST content up to date for the current program phase Selecting More Than Two Targets Unlike conditional branches, indirect jumps can have more than two targets that are frequently executed. When the likelihood of having the correct target in a set of two targets is not high enough, it might be profitable to predicate multiple targets, even though the overhead of predication would be higher. If we allow predication of more than two targets, we have to select which targets and how many targets to use for each low-confidence indirect jump. The TST holds one frequency counter for each of the targets that have been more frequently used in the recent past. The aging mechanism keeps these counters representative of the current phase of the program. Therefore, it is reasonable to select the targets with higher frequency count. To select multiple targets, the processor uses a greedy algorithm. It starts with the two targets with the highest frequency. Then, it chooses the i-th target in descending frequency order only if its frequency still adds significantly to the sum of the frequencies of the targets already selected. This happens when the following expression is satisfied: i 1 X Select Target i if Freq i i >= Freq j (1) j= Overriding the T-based Target Prediction The TST has more information than a conventional T-based indirect jump predictor for DIP-jumps, because: (1) the TST distinguishes between targets based on the different control-flow paths leading to a jump because it is indexed with PC and branch history, while a T-based prediction simply provides the last seen target for the jump; (2) each entry in the TST can hold multiple targets for each combination of PC and branch history (i.e. multiple targets per jump), while a T-based predictor can hold only one target per jump; (3) the TST contains entries for only the DIP-jumps selected by the compiler, which reduces contention, whereas a T contains one entry for every indirect jump and taken conditional branch. Our main purpose for designing the TST is to use it as a mechanism to select two or more targets for dynamic predication. However, we also found that if a TST entry contains only one target or if the most frequent target in the entry is significantly more frequent 7 than the other targets, dynamic predication provides less ben- 7 We found that a difference of at least 2 units in the 2-bit frequency counters is significant. efit than simply predicting the most frequent target as the target of the jump. Therefore, if one of these conditions holds when a DIPjump is fetched, the processor, instead of entering dynamic predication mode, simply overrides the T-based prediction for the indirect jump and uses the single predominant target specified by the TST as the predicted target for the jump Dynamic Target Selection vs. Target Prediction Dynamic target selection using the TST is conceptually different from dynamic target prediction. TST selects more than one target to predicate for an indirect jump. In contrast, an indirect jump predictor chooses a single target and uses that as the prediction for the fetched indirect jump. DIP increases the probability of having the correct target in the processor by selecting extra targets and dynamically predicating multiple paths. Nevertheless, the proposed dynamic target selection mechanism can be viewed as both a target selector andtargetpredictor especially since we sometimes use it to override target predictions as described in the previous section. s such, we envision future indirect jump predictors designed to work directly with DIP, selecting either a single target for speculative execution, or multiple targets for dynamic predication. 5.3 Hardware Support for Predicated Execution Once the targets to be predicated are selected, the dynamic predication process in DIP is similar to that in dynamic predication of conditional branches, which was described briefly in Section 2 and in detail by Kim et al. [31]. Here we describe the additional support required for DIP. If two targets are predicated in DIP, the additional support required is only in 1) the generation of the predicate values, 2) the handling of a possible pipeline flush when the predicate values are resolved. When a low-confidence DIP-jump is fetched, the processor enters dpred-mode. Figure 7 shows an example of indirect jump predication with two targets. First, the processor assigns a predicate id to each path to be predicated (i.e. each selected target). Unlike in conditional branch predication in which a single predicate value (and its complement) is generated based on the branch direction, there are multiple predicate values based on the addresses of the predicated targets in DIP. The predicate value for a path is generated by comparing the predicated target address to the correct target address. The processor inserts compare micro-operations (µops) to generate predicate values for each path as shown in Figure 7b. Unlike in conditional branch predication where one of the predicated paths is always correct, both of the predicated paths might be incorrect in DIP. s a result, the processor has to flush the whole pipeline when none of the predicated target addresses is the correct target. To this end, the processor generates a flush µop. The flush µop checks the predicate values and triggers a pipeline flush if none of the predicate values turns out to betrue (i.e., if the correct target was not predicated). If any of the predicates is TRUE, the flush µop functions as a NOP. In the example of Figure 7b, the processor inserts a flush µop to check whether or not any of the predicated targets (TRGET1 or TRGET2) is correct. ll instructions fetched during dpred-mode carry a predicate id just like in dynamic predication for a conditional branch. Since select-µops are executed only if either TRGET1 or TRGET2 is the correct target, the select-µops can be controlled by just one of the two predicates. Note that the implementation of the select- µops is the same as in dynamic predication for conditional branches. We refer the reader to [34, 31] for details on the generation and implementation of select-µops Supporting More Than Two Targets s we found that the predication of more than two targets does not provide significant benefits (shown and explained in Section 7.4), we only very briefly touch on hardware support for it solely for completeness. Each predicated path requires its own context: PC (program counter), GHR (global history register), RS (return address stack), and RT (register alias table). Since each path follows the

6 TRGET1: add r1 < r1, #1 add r1 < r2, # 1 return pr11 = MEM [pr21] p1 = cmp pr11, TRGET1 r1 = MEM[r2] p2 = cmp pr11, TRGET2 call r1 flush (p1 NOR p2) add r1 < r1, #1 (a) TRGET2: add r1 < r2, #2 return call TRGET1 add pr12 < pr11, #1 add pr13 < pr21, # 1 (p1) return call TRGET2 add pr14 < pr21, #2 return (p1) (p1) (p1) (p2) (p2) (p2) select uop pr15 < p1? pr13 : pr14 add pr16 < pr15, #1 Figure 7. n example of how the instruction stream is dynamically predicated (a) control flow graph (b) dynamically predicated instructions after register renaming outcomes of the branch predictor and does not fork more paths, i.e. the processor cannot be in dpred-mode for two or more nested indirect jumps at the same time, the complexity of predicating more than two targets is significantly less than the complexity of multi-path (i.e. eager) execution [37, 35]. The predication of more than two targets requires 1) storage of more frequency counters in the TST and additional combinational logic for target selection, 2) generation of more than two predicates using more than two compare instructions, 3) minor changes to the flush µop semantics to handle multiple paths, and 4) extension of the select-µop generation mechanism to handle the reconvergence of more than two paths Nested Indirect Jumps If the processor fetches another low-confidence DIP-jump during dpred-mode, it has two options: it can follow the low-confidence predicted target or it can exit dpred-mode for the earlier jump and reenter dpred-mode for the later jump. If the jumps are nested, the overhead of predicating the later DIP-jump is usually smaller than the overhead of predicating the earlier jump. lso, if the processor decides to continue in dpred-mode for the earlier jump and the later jump is mispredicted, a potentially significant part of the benefit of predication can be lost when the pipeline is flushed. Therefore, our policy (calledreentrypolicy) is to exit dpred-mode for the earlier jump and reenter dpred-mode for the later DIP-jump. Our experimental results show that this choice provides significantly higher performance benefits (see Section 7.3) Other Implementation Issues We briefly discuss other important issues in implementing DIP. Note that the same issues exist in architectures that implement static or dynamic predication for conditional branches [36, 34, 31]. Stores and Loads: dynamically predicated store is not sent to the memory system unless its predicate is known to be TRUE. The basic rule for the forwarding logic is that a store can forward to any younger load except for stores guarded by an unresolved predicate register, which can only forward to younger loads with the same predicate id. Interrupts and Exceptions: No special support is needed to handle interrupts and exceptions because dynamic predication state is speculative and is flushed before servicing the interrupt or exception. Predicate registers do not have to be saved and restored because they are not part of the IS. Instructions withflse predicate values do not cause exceptions. 5.4 Hardware Cost and Complexity The hardware required to dynamically predicate indirect jumps is very similar to that of the diverge-merge processor (DMP) [31, 32], which dynamically predicates conditional branches. The hardware support needed for dynamic predication (including the predicate registers, fetch/decode/rename/retirement support, and select-µops) and (b) its cost are already described in detail by previous work [31]. We assume DIP would be cost-efficiently implemented on a baseline processor that already supports dynamic predication for conditional branches, which was shown to provide very large performance and energy benefits [31, 32]. DIP requires the following hardware modifications in addition to the required support for dynamic predication: 1. Target Selection Table (TST, Section 5.2.2): a 2K-entry, 4-way set associative table with 3 2-bit saturating counters per entry, i.e. a 1.5 K data store and a 2.1 K tag store (using 7-bit tags and a valid bit per entry, plus 2 bits per set for pseudo-lru replacement). 2. simple finite state machine to implement accessing and updating the targets in the T (block labeled as Control in Figure 6) entry table with 32-bit constants (hash value in Figure 6). 4. Modified predicate generation logic and flush µops (Section 5.3). 5. Optionally, support for more than 2 targets (see Section 5.3.1). Hardware Cost: If dynamic predication hardware is already implemented for conditional branches, the cost of adding dynamic predication of indirect jumps with dynamic 2-target selection -our most efficient result- is 3.6K of storage 8 and simple extra logic. We believe that it is not cost-effective to implement dynamic predication only for indirect jumps. On the contrary, dynamic predication hardware is a substrate that should be used for both conditional and indirect jumps. 5.5 IS Support The indirect jumps selected for dynamic predication are identified in the executable binary with a different opcode for each flavor of DIP-jump (jumps or calls). The instruction format uses one bit to indicate whether or not the jump has a return CFM point. The instruction format also includes the CFM point encoded in 16-bit 2 s complement relative to the DIP-jump, which we determined is enough to encode all the CFM points we found in our set of benchmarks. When we use static target selection, the selected 32-bit targets follow the instruction in the binary. Even though these special instructions increase the code size and the pressure on the instruction cache, their impact is not significant because the number of static jumps selected for dynamic predication in our benchmarks is small (fewer than 1, as shown in Table 2). 6. Experimental Methodology 6.1 Simulation Methodology We use an idn-based [3] cycle-accurate x86 simulator to evaluate dynamic indirect jump predication. Table 1 shows our baseline processor s parameters. The baseline uses a 4K-entry T to predict indirect jumps [4, 18]. The simulator includes a Wattch-based power model [6] using 1nm technology at 4GHz, that faithfully accounts for the power consumption of all the additional structures needed by DIP. Front End ranch Predictors Execution Core On-chip Caches uses and Memory Prefetcher Dyn. pred. support Table 1. aseline processor configuration 64K, 2-way, 2-cycle I-cache; fetch ends at the first predicted-taken branch; fetch up to 3 conditional branches or 1 indirect branch 64K (64-bit history, 121-entry) perceptron branch predictor [26]; 4K-entry, 4-way T with pseudo-lru replacement; 64-entry return address stack; min. branch mispred. penalty is 3 cycles 8-wide fetch/issue/execute/retire; 512-entry RO; 384 physical registers; 128-entry LD-ST queue; 4-cycle pipelined wake-up and selection logic; scheduling window is partitioned into 8 sub-windows of 64 entries each L1 D-cache: 64K, 4-way, 2-cycle, 2 ld/st ports; L2 unified cache: 1M, 8-way, 8 banks, 1-cycle latency; ll caches: LRU repl. and 64 lines 3-cycle minimum memory latency; 32 DRM banks; 32-wide core-to-memory bus at 4:1 frequency ratio Stream prefetcher [42] (32 streams and 16 cacheline prefetch distance) 2K (12-bit history, threshold 14) enhanced JRS confidence estimator [25], 32 predicate registers, 1 CFM register We evaluate DIP using benchmarks over multiple platforms. Most of the experiments are run using the 11 DaCapo benchmarks [5] 8 Extra storage can be further reduced to 1.5K with an alternative design that stores a frequency counter and a there-is-next-target bit directly in each T entry, thus eliminating the need for a separate TST. To keep the implementation conceptually simple, we do not describe this option.

7 (Java), Matlab R26a (C), M5 simulator [4] (C++), and the interpreters (C) and (C) from the SPEC CPU 2/26 suites. We also show results for a set of 5 SPEC CPU2 INT benchmarks written in C, 3 SPEC CPU26 INT benchmarks written in C, and 1 SPEC CPU26 FP benchmark written in C++. We use those benchmarks in SPEC 2 INT and 26 INT/C++ suites that gain at least 5% performance with a perfect indirect jump predictor. Each benchmark is run for 2 million x86 instructions with the reference input set (SPEC CPU), small input set (DaCapo) or a custom input set (Matlab and M5) 9. The DaCapo benchmarks are run with Sun J2SE JRE on Windows Vista. 1 Matlab is run on Windows Vista. M5 is compiled with its default options using gcc 3.4.4, and run on Cygwin on Windows Vista. ll SPEC binaries are compiled with Intel s production compiler (ICC) [21] using -O3 optimizations and run on Linux Fedora Core 5. Table 2 shows the characteristics of the simulated portions of our main set of benchmarks on the baseline processor. 6.2 Compilation and Profiling Methodology Our methodology targets an adaptive JIT compiler that is able to use recent profiling information to recompile the hot methods to improve performance. For the experiments in this paper, we developed a dynamic profiling tool that collects edge profiling information and computes the CFM points during the 2M instructions preceding the simulation points for each application. The algorithms to find the CFM points are similar to those algorithms described in [33]. fter profiling, our tool applies the jump selection algorithm described in Section Results 7.1 Dynamic Target Distribution Table 2 shows that the average number of dynamic targets for an indirect jump in the simulated benchmarks is We found that in our set of object-oriented benchmarks 61% of all dynamic indirect jumps have 16 or more targets, which is significantly higher than what was reported in previous work for SPEC CPU C/C++ applications [3]. Even though indirect jumps have many targets, the most frequently executed targets (over the whole run of the program) are taken by a significant fraction of all dynamic indirect jumps, as we show in Figure 8. On average, the two most frequent targets cover 59.9% of the dynamic indirect jumps, but only 39.5% of the indirect jump mispredictions. The contribution of less frequently executed targets steadily drops. This data shows that statically selecting two targets for dynamic predication would likely not be very successful in these object-oriented applications where indirect jumps have a very large number of targets. 7.2 Performance of DIP The first set of bars in Figure 9 shows the performance improvement of DIP over the baseline, using dynamic 2-target selection with a 3.6K TST and all the techniques described in Section 5. The average IPC improvement of DIP is 37.8% and is analyzed in Section 7.3. We also include five idealized experiments in Figure 9 to show the potential of DIP. The IPC improvement increases to 51% if an unrealistically large TST is used (64K-entry TST with local storage for 255 targets and 32-bit frequency counters, which has a total data storage size of 128M). If the 128M TST always ideally provides the correct target for predication among the 2 selected targets (2T perfect target), the performance improves only by.2%. This means that the principles of the TST are adequate for selecting the correct 9 Matlab performs convolution on two images; M5 simulates the performance of gcc using its complex out-of-order processor model. 1 t the time of this writing, idn [3] worked for only 7 of the 11 DaCapo benchmarks running on Sun Java SE 6 (version ). The average indirect jump MPKI for these benchmarks on Java 6 is 37%higher than on Java 1.4, but since we cannot use the full suite, we report the results for Java 1.4. We expect DIP would perform better on Java For the static target selection experiments, our dynamic profiling tool also applies the target selection algorithm described in Section Percent of Executed Indirect Jumps (%) amean Figure 8. Fraction of dynamic indirect jumps taking the most frequently executed N targets target among the two that are predicated. If the DIP mechanism were used ideally only when the DIP-jump is actually mispredicted (2T perfect confidence) IPC improves by an additional 2%. The combination of perfect confidence estimation and perfect target selection (2T perfect targ./conf.) adds only an extra.5%, showing that the maximum potential performance benefit of 2-target DIP is 53.8%. Perfect indirect jump prediction (perfect IJP) provides 72.2% performance improvement over baseline, which is significantly higher than the maximum potential of DIP, because it does not have the overhead of dynamic predication. Our realistic implementation achieves 52% of the potential with perfect indirect jump prediction and 7% of the potential with the ideal 2-target DIP (2T perfect targ./conf.). Xalan and do not get as much of the potential as the other benchmarks because the TST miss rate is significantly high (43% for both) T-dyn. 3.6K TST 2T-dyn. 128M TST 2T perfect target 2T perfect confidence 2T perfect targ./conf. perfect IJP Figure 9. DIP performance and potential 7.3 nalysis of the Performance Improvement The performance improvement of DIP comes from avoiding the full pipeline flushes caused by indirect jump mispredictions. DIP can improve performance only if it selects the correct target as one of the targets to predicate. Therefore, most of our design effort for DIP is focused on mechanisms to improve target selection. On average, DIP eliminates 47% of the pipeline flushes that occur with a T-based predictor (as shown in Table 3, row 8). Furthermore, the overhead of executing the extra path is low: the average number of dynamically predicated wrong-path instructions is only 73.9 (Table 3, row 7), which is significantly smaller than the instruction window size of the processor. Hence, in the steady state, dynamic predication of a mispredicted jump would result in only 73.9 instruction slots to be wasted whereas the misprediction itself would have resulted in all instruction slots in the window plus those in the front-end pipeline stages to be wasted. The benefit of DIP depends on the combination of target selection and confidence estimation. We classify dynamic predication instances into four cases based on whether or not the correct target is predicated and whether or not the jump was actually mispredicted: 1.Useful: dynamic predication instance is useful (i.e. successfully avoids a pipeline flush) if it predicates the correct target and the

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas Diverge-Merge Processor (DMP): Dynamic Predicated Execution of Complex Control-Flow Graphs Based on Frequently Executed Paths yesoon Kim José. Joao Onur Mutlu Yale N. Patt igh Performance Systems Group