A Perfect Branch Prediction Technique for Conditional Loops

Size: px

Start display at page:

Download "A Perfect Branch Prediction Technique for Conditional Loops"

Annabel Shaw
6 years ago
Views:

1 A Perfect Branch Prediction Technique for Conditional Loops Virgil Andronache Department of Computer Science, Midwestern State University Wichita Falls, TX, 76308, USA and Richard P. Simpson Department of Computer Science, Midwestern State University Wichita Falls, TX, 76308, USA and Nelson L. Passos Department of Computer Science, Midwestern State University Wichita Falls, TX, 76308, USA Abstract One of the most stringent issues in obtaining optimal performance from today s computers is the decay in CPU performance that results from mispredicted branch instructions. As a result, a number of techniques have been proposed that attempt to forecast the decision of control instructions. These techniques however, while effective, do not achieve perfect accuracy. This paper suggests the use of new prediction instructions, which, together with a loop retiming technique, provide 100% accuracy in most instances for single control instructions found in loop structures. Keywords: Branch prediction, pipeline, conditional loop, retiming, pre-fetching. 1. Introduction Control instructions may alter the linear execution of programs. If the execution path is thus altered, modern systems that make use of pre-fetching, pipelining and instruction-level parallelism are adversely affected in that a number of instructions that are already executing have to be discarded and new instructions have to be brought to the CPU. The overall effect of this process is a deterioration in unit performance. Improving the performance of loop structures has been the focus of a large number of research projects over the last decade. Significant results have been obtained with both software and hardware advancements. An important optimization for the loop execution consists of being able to predict with some degree of certainty the flow of the program when conditional instructions are encountered. That degree of certainty, however, is always somewhat less than 100%. This paper describes a transformation technique for obtaining perfect branch prediction within loops without loop carried dependencies. There exist a number of techniques that improve the performance of loop execution by working towards achieving the largest possible degree of fine-grain parallelism among loop instructions. Noteworthy among these are software pipelining [1], index-shifting [13] and retiming [12]. However, these techniques are only effective if control instructions are not present. A different set of techniques exist that deal exclusively with control instructions. These techniques are known as branch prediction techniques. The body of work done within the scope of branch prediction is extensive. There are two main approaches to branch prediction: static and dynamic [3]. In the static approach, each branch is assigned a value (taken or not taken) at compile time, based on such criteria as test run results and branch type. The value thus assigned is then considered for pre-fetching of instructions. It is relatively easy to see that this approach will yield a large number of false decisions if the probability of the branch being taken is around 50%.

2 In dynamic branch prediction, several different implementations are possible [6,7,14,15,19,21]. The simplest implementation makes use of a counter and computes the percentage of the number of times the loop has been executed that the branch was taken. The prefetching is then done along the side of the branch that appears more likely according to the result of the above computation. In this case, a sequence of alternating decisions may yield close to 0% accuracy. a[i] = a[i-1] 7 can be represented in graph format as shown in figure 1, which is changed to a new form, shown in figure 2 after applying the technique presented in this paper. A special case, two level adaptive branch prediction [21], gives added importance to the recent history of results from the branch to be predicted. This approach tends to have better results than the static or probabilistic approaches, but there are still circumstances in which it could have very low accuracy. Yet another approach to optimizing control structures, predicated execution, involves adding a tag to instructions whose execution depends on control structures [15,19]. This tag indicates whether the instruction is valid after the controlling instruction has been executed. All instructions are then executed, but only the valid ones are taken into account. This approach, can be very effective especially when used in conjunction with ordinary branch prediction techniques. However, some penalty is still incurred, for example the use of system resources that will generate discarded results. Using more than one branch prediction technique has also proved effective at the cost of additional overhead choosing the technique to be used in particular cases [4]. Most dynamic branch prediction approaches [6,7,14] make use of a hardware buffer. According to the specific use, this buffer can be designed with different characteristics such as a branch target buffer [2], elastic history buffer [18] or branch history buffer [16]. This paper describes a process in which retiming concepts and new instruction constructs are used to obtain 100% accuracy in branch prediction occurring in loops that have no loop carried dependencies [5]. In this process, loops are modeled by direct acyclic graphs where the nodes represent instructions, the edges represent dependencies between instructions and the labels on the edges represent delays in execution. Delays are computed as the difference between the iteration where some data value is produced and the iteration when that information is utilized. For example, the code for i = 0 to 100 do if (a[i] = 0) b[i] = b[i-2] * 2 Figure 1: Graph representation of non-retimed code Figure 2: Graph representation of retimed code In figures 1 and 2, instructions B and C make use of the decision made in instruction A. However, in figure 1, that result is used in the same iteration it had been computed in, whereas in figure 2, the result of computing instruction A in the current iteration is used in the next iteration of the loop. This new view of the sequence of instructions allows the program to anticipate the branch decision. The next section presents a set of basic concepts relevant to this paper. Section 3 consists of the problem model. Section 4 presents the process by which control statements are optimized. Finally, section 5 provides an example of how the technique can be used. A summary concludes the paper. 2. Background Any program consists of two types of instructions: control and data instructions [8]. With regular constructs, all instructions are executed in sequential order and control instructions determine the

3 execution of data instructions usually found in close proximity. There are two approaches to instruction execution when control instructions are present. In the first, each instruction is executed only after the control decision that it depends on has been made. In the second, called speculative execution, an instruction is executed before that decision has been reached, often according to an algorithm that determines the most likely execution path. In considering control instruction handling, there are also two possible approaches: single and multiple control flow [10]. In single control flow, an instruction is not executed until all control instructions prior to it in the sequential execution of the program have been resolved or speculated. Multiple control flow allows execution prior to obtaining the information needed to determine if the instruction would be on the current execution path. In the ordinary pipeline model, a new iteration starts execution every n cycles, where n is called the initiation interval. The execution periods of several iterations are overlapped. All iterations of an acyclic data flow graph can be executed simultaneously after proper transformation. Therefore, where there is no resource constraint, the execution rate of an acyclic data-flow graph can be made arbitrarily good. When a data-flow graph is cyclic, the schedule cannot be overlapped arbitrarily. It is not trivial to generate a schedule for loop pipeline when the data-flow graph is cyclic even without resource constraints. To resolve the difficulty, instead of looking at the original iteration, the static schedule is constructed, which is a new loop body consisting of nodes from different original iterations. In the model presented, a loop pipeline is composed of a prologue, a repeating schedule and an epilogue. A repeating schedule, a schedule that is repeatedly executed, forms the new loop body. The length of a repeating schedule corresponds to the initiation interval in the ordinary pipeline model. The goal of the process is to construct a repeating schedule of minimal length. The transition from the initial state of the old loop body to the new loop body is done by the prologue. The prologue and epilogue can be easily obtained once a static schedule is found. Under the pipeline model used in this paper, the focus is on the construction of short repeating schedules. The retiming technique [12] is a convenient and effective tool for the optimization of synchronous systems due to its properties, algorithms and ease of manipulation. This paper will use retiming concepts to achieve perfect branch prediction and additional parallelism. Retiming is a technique initially described in [12] in which loop structures are altered in a manner that parallelizes instructions that were not written specifically for parallelism. For example, the code shown below for i = 0 to 100 a[i] = a[i-2] + 3 b[i] = a[i] * 5 can be transformed to a parallel version such as: a[0] = a[-2] + 3 for i = 0 to 99 a[i+1] = a[i-1] + 3 b[i] = a[i] * 5 b[100] = a[100] In order to apply retiming techniques, the loop body is modeled as a data flow graph. A data flow graph (DFG) G= (V,E,d) is a graph in which each node in the set of computation nodes V represents an instruction and each edge in E, the set of dependency edges, represents the relation between nodes, such that the destination node depends in on the origin node. A function d from E to Z represents the delay between two nodes. Consider again the example: for i = 0 to 100 a[i] = a[i-2] + 3 /* node A */ b[i] = a[i] * 5 /* node B */ The data flow graph for this code is shown in figure 3. From this graph it is clear that A and B must be executed sequentially. The retiming method modifies the delay between instructions in order to improve their parallel execution. Let r(n) be the retiming value for a node n in the graph and d(e) the delay for an edge e connecting two nodes u and v in the graph. Then the retiming function modifies the graph such that: d r (e) = r(u) - r(v) + d(e) where d r (e) is the delay of edge e after retiming. In this case, retiming node A by 1 and node B by 0 yields the graph in figure 4.

4 is represented by the graph in figure 5. Figure 3: Graph representation of non-retimed loop body Figure 5. Conditional data flow graph In this paper, only unit time CDFG are considered. A unit time CDFG is a DFG in which the computation times for all nodes are identical. General time CDFG s will be considered in future research. Figure 4: Graph representation of retimed loop body 3. Problem model As mentioned earlier, in order to apply retiming transformations, the loop is modeled as a data flow graph. In this paper, modified dataflow graphs called conditional data flow graphs are used in order to represent loops that include conditional instructions. DEFINITION 3.1 A conditional data flow graph (CDFG) G= (V,E,d,t) is a data flow graph in which nodes are differentiated by a function t according to their type into control nodes and regular ones. In a conditional data flow graph, dependencies are either data or control dependencies. In data dependence, the result of the destination instruction depends on the result of the origin instruction. In control dependence, the execution of the destination instruction depends on the decision result of the origin instruction. In the figures, control instructions are represented by diamond shaped symbols, while all others are represented by circles. For example, the structure in the loop for i = 0 to 9 if (x[i] = 0) a[i] = a[i-1] + 3 Instruction A a[i] = a[i-1] - 5Instruction B For the sake of convenience, u e v denotes that e is an edge from u to v. Dealing with loop structures, the following definition is needed: DEFINITION 3.2 An iteration in a CDFG G=(V,E,d,t) is the execution of all nodes v V exactly once. Finally, due to the nature of the retiming process, the schedules produced by the process described in this paper are static schedules. DEFINITION 3.3 A static schedule is a set of assignments of operations to processors in which every operation is assigned to a specific processor at a specific time unit. 4. Predicted if construct Regardless of the method used, regular branch prediction is bound to yield less than perfect results. However, it is possible to obtain perfect branch prediction using a new concept, a "predicted if" (pif). The new technique is based on the fact that within regular loops, branches usually determine the path to be taken in close proximity to the branch. If, however, the conditional instruction determines the execution of constructs not immediately following it, then those instructions could be prefetched in the correct order. This has been proposed in techniques such as delayed branch

5 [11]. In uniform loops that do not have dependencies across iterations, retiming makes it possible to place an entire loop iteration between an if statement and the instruction whose execution it determines. Thus, by the time the "next" instruction needs to be fetched - based on the outcome of the if - that result is already known. As a result, two sets of instructions need to be moved outside the loop: a prologue and an epilogue. The prologue will consist of the control and data instructions that are to be executed so that all the necessary delays are introduced into the structure of the loop. The epilogue will wrap up the execution of the loop. The breakdown of the loop structure results in the need for three new constructs. The first is a predicted if (pif) whose decision is stored a number of loop iterations, usually one, and it is used throughout the restructured loop. At every occurrence of a pif instruction, the previous decision is retrieved from a branch register and used in fetching the target instructions, while the current decision is stored back in the branch register for a future iteration. The second is the auxiliary prediction instruction (paf) whose decision is stored until the first pif is encountered and which is only used in the prologue. This instruction does not cause a branch to be executed and it is the initialization process of the branch register. The ultimate if, puf, performs no real action, but acts as a flag which marks the location where the control instruction would have been for the instructions in the epilogue. In other words, puf will not make any decisions, but will be used to retrieve the last stored decision and correctly fetch the target instructions. Direct acyclic graphs do not require data from previous loop iterations in order to be able to execute the control instruction. Therefore, since no cycles are present in the graph, the root node can be viewed as having an incoming edge with an infinite delay. In the examples shown, the root of the graph is the control node. Retiming that control node by 1, the first control statement is executed prior to the start of the loop (paf) and each subsequent iteration has the control statement for the next iteration (pif) execute in parallel with the branch sequence from the current iteration. Thus, the result of the control statement will always be known at some point in the iteration previous to that. In the following section, the predicted decision will be used for 99 of the 100 iterations of the loop. The process of determining the retiming values in order to optimize the branch prediction can be summarized in the following algorithm. Algorithm Predict Input: DAG G=(V,E,d,t) Output: Retimed G Begin Q for all v V level(v) 0 while V for all v V if indegree(v) = 0 then Q Q v V V-v endif while Q get(q,u) if t(u) = condition then increment 1 increment 0 endif for all v V, such that u e v level(v) level(v) + increment indegree(v) indegree(v)-1 endwhile endwhile for all v V retiming(v) max i (level(i) level(v)) End. At the termination of this algorithm the nodes whose level is greater than 0 are the nodes which constitute the prologue. The level of each node indicates the number of instances of that node that will be present in the prologue. As indicated above, all the nodes of the prologue are pifs. The number of puf statements at the end of the loop is given for each conditional statement by the maximum level minus its own level. 5. Example In this section, a simple example is discussed in order to show the final form of the code. The example consists of an initial loop, which assigns alternate values to an array that is later tested in a second loop. This alternation is one of the worst situations faced by branch prediction schemes since the history of the conditions is not useful in the prediction process. The following code segment represents such an example. /* initialize array */ x[0] = 1 for i = 1 to 10 do x[i] = x[i-1] * (-1) /* test the array values */

6 for i = 1 to 10 do if (x[i] < 0) y[i] = y[i-1] + 7 y[i] = y[i-1] - 5 Using the retiming method proposed in this paper and retiming the conditional loop, the following code is obtained: /* initialize array */ x[0] =1 for i = 1 to 10 do x[i] = x[i-1] * (-1) /* test the array values */ /* prologue */ paf (x[1] < 0) /* modified loop */ for i = 1 to 9 do pif (x[i+1] < 10) y[i] = y[i-1] + 7 y[i] = y[i-1] + 5 /* epilogue */ puf (x[10] < 10) y[10] = y[9] + 7 y[10] = y[9] + 5 The application of this transformation to the given example will result in 100% accuracy for the execution of the if statements. The use of a static branch predictor would result in an accuracy of 50%, while a dynamic branch prediction scheme using a 1-bit history would result in 0% correctness. Predication is not considered as a viable scheme because it would require extra resources to be competitive in performance. 6. Summary One of the major remaining problems in obtaining optimal performance from a processor is the inability to predict with certainty the results of control instructions. In order to provide a possible solution to this problem, this paper has presented three new instructions: the auxiliary predicted if (paf), predicted if (pif) and ultimate predicted if (puf). These instructions, used in conjunction with the technique of retiming can obtain perfect branch prediction in loops with no dependencies across iterations. These techniques are also viable for use in nested one-dimensional loop structures and multi-dimensional cases. 7. Acknowledgements This work was supported by the National Science Foundation under Grant No. MIP References [1] A. Aiken and A. Nicolau, "Resource- Constrained Software Pipelining," IEEE Transactions of Parallel and Distributed Systems, Vol. 6, No. 12, December 1995, pp [2] B. K. Bray and M.J. Flynn, "Strategies for Branch Target Buffers," Proceedings of the 24 th Microarchitecture, November 1991, pp

7 [3] C. Burch, "PA-8000: A Case Study of Static and Dynamic Branch Prediction," Proceedings of the International Conference on Computer Design, 1997, pp [4] P. Chang and U. Banerjee, "Profile-Guided Multi-heuristic Branch Prediction," Proceedings of the 1995 International Conference on Parallel Processing Vol. 1, 1995, pp. I I [5] L.-F. Chao and E. H.-M. Sha, "Static Scheduling of Uniform Nested Loops," Proceedings of the 7 th International Parallel Processing Symposium, Newport Beach, CA, April 1993, pp [6] I.-C. K. Chen et al., "Design Optimization for High-speed Per-address Two-level Branch Predictors," Proceedings of the International Conference on Computer Design, 1997, pp [7] T. M. Conte, B. A. Patel, and J. S. Cox, "Using Branch Handling Hardware to Support Profile- Driven Optimization," Proceedings of the 27 th Microarchitecture, November 1994, pp [8] H. L. Dershem and M. J. Jipping, Programming Languages: Structures and Models, PWS Publishing Company, [9] P. K. Dubey and R. Nair, "Profile-Driven Generation of Trace Samples," Proceedings of the International Conference on Computer Design, 1996, pp [10] P. K. Dubey and T. J. Watson, "Single-Thread ILP Limits and Compile-time Multithread Speculation," 1995 International Conference on Parallel Processing, Workshop on Challenges for Parallel Processors, 1995, pp [11] J.L. Hennessy and D. A. Patterson, Computer Organization & Design The Hardware / Software Interface, Morgan Kaufmann Publishers Inc [12] E. L. Leiserson and J. B. Saxe, "Optimizing Synchronous Systems," Journal of VLSI and Computer Systems, Volume 1, No 1, 1983, pp [13] L.-S. Liu, C.-W. Ho and J.-P. Sheu, "On the Parallelism of Nested For-Loops Using Index Shift Method," International Conference on Parallel Processing, 1990, pp [14] Y. Liu and D. R. Kaeli, "Branch-Directed and Stride-Based Data Cache Prefetching," Proceedings of the International Conference on Computer Design, 1996, pp [15] S. A. Mahlke et al., "Characterizing the Impact of Predicated Execution on Branch Prediction," Proceedings of the 27 th ACM/IEEE International Symposium on Microarchitecture. November 1994, pp [16] M. Sakamoto et al., "Microarchitecture Support for Reducing Branch Penalty in a Superscalar Processor," Proceedings of the International Conference on Computer Design. 1996, pp [17] Z. Tang et al., "GPMB---Software Pipelining Branch-Intensive Loops," Proceedings of the 26 th Microarchitecture, November 1994, pp [18] M. D. Tarlescu, K. B. Theobald and G. R. Gao, "Elastic History Buffer: A Low-Cost Method to Improve Branch Prediction Accuracy," Proceedings of the International Conference on Computer Design, 1997, pp [19] G. S. Tyson, "The Effects of Predicated Execution on Branch Prediction," Proceedings of the 27 th ACM/IEEE International Symposium on Microarchitecture, November 1994, pp [20] J. P. Vogel and B. K. Holmer, "Analysis of the Conditional Skip Instructions of the HP Precision Architecture," Proceedings of the 27 th Microarchitecture, November 1994, pp [21] T.-Y. Yeh and Y. N. Patt, "Two Level Adaptive Branch Prediction," Proceedings of the 24 th Microarchitecture, 1991, pp

AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1

AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 Virgil Andronache Richard P. Simpson Nelson L. Passos Department of Computer Science Midwestern State University