On the Near-Optimality of List Scheduling Heuristics for Local and Global Instruction Scheduling

Size: px

Start display at page:

Download "On the Near-Optimality of List Scheduling Heuristics for Local and Global Instruction Scheduling"

Baldwin Fleming
5 years ago
Views:

1 On the Near-Optimality of List Scheduling Heuristics for Local and Global Instruction Scheduling by John Michael Chase A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Mathematics in Computer Science Waterloo, Ontario, Canada, 2006 c Michael Chase 2006

2 I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. ii

3 Abstract Modern architectures allow multiple instructions to be issued at once and have other complex features. To account for this, compilers perform instruction scheduling after generating the output code. The instruction scheduling problem is to find an optimal schedule given the limitations and capabilities of the architecture. While this can be done optimally, a greedy algorithm known as list scheduling is used in practice in most production compilers. List scheduling is generally regarded as being near-optimal in practice, provided a good choice of heuristic is used. However, previous work comparing a list scheduler against an optimal scheduler either makes the assumption that an idealized architectural model is being used or uses too few test cases to strongly prove or disprove the assumed near-optimality of list scheduling. It remains an open question whether or not list scheduling performs well when scheduling for a realistic architectural model. Using constraint programming, we developed an efficient optimal scheduler capable of scheduling even very large blocks within a popular benchmark suite in a reasonable amount of time. I improved the architectural model and optimal scheduler by allowing for an issue width not equal to the number of functional units, instructions that monopolize the processor for one cycle, and non-fully pipelined instructions. I then evaluated the performance of list scheduling for this more realistic architectural model. I found that when scheduling for basic blocks when using a realistic architectural model, only 6% or less of schedules produced by a list scheduler are non-optimal, but when scheduling for superblocks, at least 40% of schedules produced by a list scheduler are non-optimal. Furthermore, when the list scheduler and optimal scheduler differed, the optimal scheduler was able to improve schedule cost by at least 5% on average, realizing maximum improvements of 82%. This suggests that list scheduling is only a viable solution in practice when scheduling basic blocks. When scheduling superblocks, the advantage of using a list scheduler is its speed, not the quality of schedules produced, and other alternatives to list scheduling should be considered. iii

4 Acknowledgments I would like to thank my supervisor, Peter van Beek, for his guidance, mentoring, and support throughout my work as an undergraduate and graduate student with him. Thanks to Tyrel Russell and Abid Malik for their many contributions in the discussions we had as a research group. I am also appreciative of Farhad Mavadat and Ondrej Lhotak for serving on my committee. I would also like to thank my family and friends, and especially my wife, Jen, for encouraging and supporting me in my research and also helping me to balance between school and other aspects of my life. This work was made possible by IBM Corp. and by the facilities of the Shared Hierarchical Academic Research Computing Network (SHARCNET: iv

5 Dedication This work is dedicated to Jen, for her constant support and love and for all that she did in order to allow me to pursue this degree. v

6 Contents 1 Introduction Contributions Overview Background Scheduling Units Instruction DAGs The Instruction Scheduling Problem List Scheduling Scheduling Heuristics and Features Register Pressure Constraint Programming Related Work Theoretical Results Empirical Results Idealized Architectures Realistic Architectures Summary of Empirical Results vi

7 4 Local Instruction Scheduling Initial Model The Constraint Programming Model Latency Constraints Distance Constraints Dominance Constraints Predecessor and Successor Constraints Functional Unit Constraints Architectural Improvements Issue Width Non-Fully Pipelined Processor Serializing Instructions Architectural Features Not Modelled Evaluation Experimental Setup The Optimal Scheduler Results for Initial Architectural Models Results for Improved Architectural Models Summary of Results Global Instruction Scheduling Initial Model Architectural Improvements Evaluation Experimental Setup The Optimal Scheduler Results for Initial Architectural Models Results for Improved Architectural Models Summary of Results vii

8 6 Conclusions and Further Work Further Work viii

9 List of Tables 2.1 Local instruction scheduling heuristics, showing rank of features for each heuristic. For a pair of instructions, the instruction with the better value for the feature ranked number 1 is selected. If the two instructions have the same value for that feature, the instruction with the better value for the rank 2 feature is selected, and so on Results of experiments comparing list scheduling to optimal methods Architectural models used in experiments comparing list scheduling to optimal methods Architectural models used in scheduling experiments Local instruction scheduling for the initial architectural model before register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found an improved schedule over the best heuristic schedule, and (b) the optimal scheduler failed to complete within a 10-minute time limit, for various architectures Local instruction scheduling for the initial architectural model after register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found an improved schedule over the best heuristic schedule, and (b) the optimal scheduler failed to complete within a 10- minute time limit, for various architectures ix

10 4.4 Local instruction scheduling for the initial architectural model before register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found an improved schedule over the best heuristic schedule, and (b) the percentage of basic blocks with improved schedules, for various architectures Local instruction scheduling for the initial architectural model after register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found an improved schedule over the best heuristic schedule, and (b) the percentage of basic blocks with improved schedules, for various architectures Local instruction scheduling for the initial architectural model before register allocation. Average and maximum percentage improvements in schedule length of optimal schedule over the best heuristic schedule, for various architectures. The average is over only the basic blocks in the SPEC 2000 benchmark suite for which the optimal scheduler found an improved schedule Local instruction scheduling for the initial architectural model after register allocation. Average and maximum percentage improvements in schedule length of optimal schedule over the best heuristic schedule, for various architectures. The average is over only the basic blocks in the SPEC 2000 benchmark suite for which the optimal scheduler found an improved schedule Local instruction scheduling for the initial architectural model before register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) critical path resulted in an improved schedule over Shieh and Papachristou s heuristic, and (b) Shieh and Papachristou s heuristic resulted in an improved schedule over critical path, for various architectures Local instruction scheduling for the initial architectural model after register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) critical path resulted in an improved schedule over Shieh and Papachristou s heuristic, and (b) Shieh and Papachristou s heuristic resulted in an improved schedule over critical path, for various architectures x

11 4.10 Local instruction scheduling for the initial architectural model before register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found a schedule with lower register pressure than both heuristic schedules, and (b) a heuristic schedule had lower register pressure than the optimal schedule, for various architectures Local instruction scheduling for the initial architectural model after register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found a schedule with lower register pressure than both heuristic schedules, and (b) a heuristic schedule had lower register pressure than the optimal schedule, for various architectures Local instruction scheduling for the initial architectural model before register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the critical path schedule had lower register pressure than Shieh and Papachristou s schedule, and (b) Shieh and Papachristou s schedule had lower register pressure than the critical path schedule, for various architectures Local instruction scheduling for the initial architectural model after register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the critical path schedule had lower register pressure than Shieh and Papachristou s schedule, and (b) Shieh and Papachristou s schedule had lower register pressure than the critical path schedule, for various architectures Local instruction scheduling for the improved architectural model before register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found an improved schedule over the best heuristic schedule, and (b) the optimal scheduler failed to complete within a 10-minute time limit, for various architectures Local instruction scheduling for the improved architectural model after register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found an improved schedule over the best heuristic schedule, and (b) the optimal scheduler failed to complete within a 10-minute time limit, for various architectures xi

12 4.16 Local instruction scheduling for the improved architectural model before register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found an improved schedule over the best heuristic schedule, and (b) the percentage of basic blocks with improved schedules, for various architectures Local instruction scheduling for the improved architectural model after register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found an improved schedule over the best heuristic schedule, and (b) the percentage of basic blocks with improved schedules, for various architectures Local instruction scheduling for the improved architectural model before register allocation. Average and maximum percentage improvements in schedule length of optimal schedule over the best heuristic schedule, for various architectures. The average is over only the basic blocks in the SPEC 2000 benchmark suite for which the optimal scheduler found an improved schedule Local instruction scheduling for the improved architectural model after register allocation. Average and maximum percentage improvements in schedule length of optimal schedule over the best heuristic schedule, for various architectures. The average is over only the basic blocks in the SPEC 2000 benchmark suite for which the optimal scheduler found an improved schedule Local instruction scheduling for the improved architectural model before register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) critical path resulted in an improved schedule over Shieh and Papachristou s heuristic, and (b) Shieh and Papachristou s heuristic resulted in an improved schedule over critical path, for various architectures Local instruction scheduling for the improved architectural model after register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) critical path resulted in an improved schedule over Shieh and Papachristou s heuristic, and (b) Shieh and Papachristou s heuristic resulted in an improved schedule over critical path, for various architectures xii

13 4.22 Local instruction scheduling for the improved architectural model before register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found a schedule with lower register pressure than both heuristic schedules, and (b) a heuristic schedule had lower register pressure than the optimal schedule, for various architectures Local instruction scheduling for the improved architectural model after register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found a schedule with lower register pressure than both heuristic schedules, and (b) a heuristic schedule had lower register pressure than the optimal schedule, for various architectures Local instruction scheduling for the improved architectural model before register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the critical path schedule had lower register pressure than Shieh and Papachristou s schedule, and (b) Shieh and Papachristou s schedule had lower register pressure than the critical path schedule, for various architectures Local instruction scheduling for the improved architectural model after register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the critical path schedule had lower register pressure than Shieh and Papachristou s schedule, and (b) Shieh and Papachristou s schedule had lower register pressure than the critical path schedule, for various architectures Global instruction scheduling for the initial architectural model before register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found an improved schedule over the best heuristic schedule, and (b) the optimal scheduler failed to complete within a 10-minute time limit, for various architectures xiii

14 5.2 Global instruction scheduling for the initial architectural model after register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found an improved schedule over the best heuristic schedule, and (b) the optimal scheduler failed to complete within a 10-minute time limit, for various architectures Global instruction scheduling for the initial architectural model before register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found an improved schedule over the best heuristic schedule, and (b) the percentage of superblocks with improved schedules, for various architectures Global instruction scheduling for the initial architectural model after register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found an improved schedule over the best heuristic schedule, and (b) the percentage of superblocks with improved schedules, for various architectures Global instruction scheduling for the initial architectural model before register allocation. Average and maximum percentage improvements in schedule length of optimal schedule over the best heuristic schedule, for various architectures. The average is over only the superblocks in the SPEC 2000 benchmark suite for which the optimal scheduler found an improved schedule Global instruction scheduling for the initial architectural model after register allocation. Average and maximum percentage improvements in schedule length of optimal schedule over the best heuristic schedule, for various architectures. The average is over only the superblocks in the SPEC 2000 benchmark suite for which the optimal scheduler found an improved schedule Global instruction scheduling for the initial architectural model before register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) critical path resulted in an improved schedule over DHASY, and (b) DHASY resulted in an improved schedule over critical path, for various architectures xiv

15 5.8 Global instruction scheduling for the initial architectural model after register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) critical path resulted in an improved schedule over DHASY, and (b) DHASY resulted in an improved schedule over critical path, for various architectures Global instruction scheduling for the initial architectural model before register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found a schedule with lower register pressure than both heuristic schedules, and (b) a heuristic schedule had lower register pressure than the optimal schedule, for various architectures Global instruction scheduling for the initial architectural model after register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found a schedule with lower register pressure than both heuristic schedules, and (b) a heuristic schedule had lower register pressure than the optimal schedule, for various architectures Global instruction scheduling for the initial architectural model before register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the critical path schedule had lower register pressure than DHASY s schedule, and (b) DHASY s schedule had lower register pressure than the critical path schedule, for various architectures Global instruction scheduling for the initial architectural model after register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the critical path schedule had lower register pressure than DHASY s schedule, and (b) DHASY s schedule had lower register pressure than the critical path schedule, for various architectures Global instruction scheduling for the improved architectural model before register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found an improved schedule over the best heuristic schedule, and (b) the optimal scheduler failed to complete within a 10-minute time limit, for various architectures xv

16 5.14 Global instruction scheduling for the improved architectural model after register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found an improved schedule over the best heuristic schedule, and (b) the optimal scheduler failed to complete within a 10-minute time limit, for various architectures Global instruction scheduling for the improved architectural model before register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found an improved schedule over the best heuristic schedule, and (b) the percentage of superblocks with improved schedules, for various architectures Global instruction scheduling for the improved architectural model after register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found an improved schedule over the best heuristic schedule, and (b) the percentage of superblocks with improved schedules, for various architectures Global instruction scheduling for the improved architectural model before register allocation. Average and maximum percentage improvements in schedule length of optimal schedule over the best heuristic schedule, for various architectures. The average is over only the superblocks in the SPEC 2000 benchmark suite for which the optimal scheduler found an improved schedule Global instruction scheduling for the improved architectural model after register allocation. Average and maximum percentage improvements in schedule length of optimal schedule over the best heuristic schedule, for various architectures. The average is over only the superblocks in the SPEC 2000 benchmark suite for which the optimal scheduler found an improved schedule Global instruction scheduling for the improved architectural model before register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) critical path resulted in an improved schedule over DHASY, and (b) DHASY resulted in an improved schedule over critical path, for various architectures xvi

17 5.20 Global instruction scheduling for the improved architectural model after register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) critical path resulted in an improved schedule over DHASY, and (b) DHASY resulted in an improved schedule over critical path, for various architectures Global instruction scheduling for the improved architectural model before register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found a schedule with lower register pressure than both heuristic schedules, and (b) a heuristic schedule had lower register pressure than the optimal schedule, for various architectures Global instruction scheduling for the improved architectural model after register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found a schedule with lower register pressure than both heuristic schedules, and (b) a heuristic schedule had lower register pressure than the optimal schedule, for various architectures Global instruction scheduling for the improved architectural model before register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the critical path schedule had lower register pressure than DHASY s schedule, and (b) DHASY s schedule had lower register pressure than the critical path schedule, for various architectures Global instruction scheduling for the improved architectural model after register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the critical path schedule had lower register pressure than DHASY s schedule, and (b) DHASY s schedule had lower register pressure than the critical path schedule, for various architectures xvii

18 List of Figures 2.1 (a) DAG for a basic block from the SPEC 2000 compiler benchmark; (b) One possible schedule for a single-issue processor, where NOP denotes a cycle in which no instructions are scheduled; (c) An optimal schedule (a) DAG for a superblock from the SPEC 2000 compiler benchmark; (b) Branch probabilities for side exits and the final exit; (c) Exit probabilities for side exits and the final exit Example DAG used to illustrate constraints in the initial model. Text beside each node denotes functional unit as integer (INT) or floating point (FP) and lower and upper bounds Example DAG with additional nodes B 1 and B 2 corresponding to pipeline variables Example DAG with serial instruction B xviii

19 Chapter 1 Introduction Modern architectures are capable of issuing and executing multiple instructions at once, as well as having many other interesting properties. The code generation phase of a compiler produces a straight-line sequence of code which must be scheduled in order to execute multiple instructions at once and take advantage of other architectural features. This process, known as instruction scheduling, is performed in practice by a non-optimal, greedy algorithm known as the list scheduling algorithm. List scheduling is widely assumed to be nearly optimal with an appropriate choice of heuristic with which to rank instructions that are ready to be issued. However, any previous work that compares the list scheduling algorithm against an optimal scheduler either uses a small set of test data, producing inconclusive results, or makes many simplifying assumptions about the architectural model upon which the scheduled code will be executed. It is an open question whether or not list scheduling is nearly optimal when scheduling for a realistic architectural model. Especially for embedded processors with limited computing power, it is essential that the compiled code be as efficient as possible. If the list scheduling algorithm does not provide good enough schedules, it may be necessary to consider other scheduling algorithms. In this thesis, I take an optimal scheduler that initially assumes the same simple architectural model as list scheduling and improve the model, making it more realistic. I then compare the schedules produced by the list scheduling algorithm against those produced by the optimal scheduler. By using a realistic architectural model and evaluating with a sufficiently large set of test data, I provide conclusive evidence that speaks to the near-optimality of list scheduling. When scheduling for basic blocks, the list scheduling algorithm produces optimal schedules at least 94% 1

20 of the time for the target architectures on which it was evaluated. When scheduling for superblocks, list scheduling produces optimal schedules between 47% and 60% of the time. The difference between the list scheduler and optimal scheduler is also significant: improved blocks cost between 5% and 8% less on average, with a maximum improvement of 82%. These results suggest that list scheduling performs sufficiently well on more complex architectures for local instruction scheduling only, and that global instruction scheduling algorithms must be given further consideration in order to produce better overall schedules. 1.1 Contributions The most significant contribution of this thesis is that list scheduling is proven to be far from optimal when scheduling superblocks when the target architectural model is complex, although list scheduling basic blocks produces near-optimal results for a complex architectural model. When scheduling basic blocks, the list scheduling algorithm was optimal over 98% of the time for an idealized architectural model and over 94% of the time for a realistic architectural model. However, the list scheduler performs poorly when scheduling superblocks for a realistic architectural model, with between 39.9% and 52.3% of schedules produced being non-optimal and an average improvement of 5.3%-8.1%. These results are important because superblocks are frequently executed sequences of instructions, and any speed improvement on these will almost certainly result in a speed improvement for the whole program. This thesis contains some other relevant contributions. When comparing a heuristic given by Shieh and Papachristou [35] against the more common critical path heuristic, Shieh and Papachristou s heuristic generally performed better. As both heuristics use critical path as the primary feature, this suggests that the choice of secondary features is important to instruction scheduling heuristics, which provides motivation for a re-evaluation of existing heuristics. There is also little difference in terms of register usage between schedules produced by a list scheduler and our optimal scheduler. This is significant as it shows that not only does list scheduling produce near-optimal schedules, it also produces schedules with similar register requirements as an optimal scheduler and there is thus little purpose for using an optimal scheduler. 2

21 1.2 Overview This thesis is divided into six chapters. In Chapter 2, I formalize the instruction scheduling problem and the list scheduling algorithm, present directed acyclic graphs for instruction scheduling, discuss heuristics and features for list scheduling, and provide an introduction to constraint programming. Chapter 3 surveys work done evaluating the optimality of the list scheduling algorithm. In Chapter 4, I examine instruction scheduling for basic blocks, presenting the initial architectural model as well as several improvements to it. I compare the list scheduler to an optimal scheduler for several architectures using both the initial and improved architectural models in order to evaluate the optimality of the list scheduling algorithm. In Chapter 5, I repeat the work done in Chapter 4 but I instead consider instruction scheduling for superblocks. I conclude in Chapter 6, summarizing the results found in this thesis and discussing possibilities for future work. 3

22 Chapter 2 Background The code generation phase in a compiler typically produces a straight-line sequence of code. The code generation phase is often followed by an instruction scheduling phase which rearranges the code to achieve better performance on modern architectures. Modern architectures have multiple functional units, processing units dedicated to a specific type of instruction, such as integer or branch instructions. In addition, some architectures allow multiple instructions to be issued in a single cycle. The maximum number of instructions that can be issued in one cycle is known as the processor s issue width. The issue width for a particular architecture must be less than or equal to the number of functional units. Each instruction on a particular architecture has an execution time and a latency. Execution time is the number of cycles for which an instruction locks up a functional unit to the exclusion of all other instructions. The latency of an instruction is the number of cycles needed after an instruction is issued before its result is available to other instructions. An instruction s latency is always greater than or equal to its execution time. An instruction can also be fully pipelined: if its execution time is 1, the functional unit is only occupied by the instruction for the cycle in which the instruction was issued. Instructions with execution times greater than one are said to be not fully pipelined. Modern architectures are pipelined: an instruction with a latency greater than one need not tie up the functional unit on which it is executing until the result is ready [18]. Architectures use pipelining to overlap the execution of instructions. If an instruction with an execution time of 2 and a latency of 5 is issued in cycle 1, another instruction can be issued on the same functional unit as early as cycle 3, 4

23 even though the result from the first instruction is not available for use until cycle 6. As an example of an architecture with all these properties, consider the PowerPC 604 [19]. It has six functional units: a branch unit, two integer units for simple instructions, an integer unit for more complex instructions, a floating point unit, and a load/store unit for data transfer to and from memory. The PowerPC 604 has an issue width of 4, so not every functional unit will begin executing a new instruction every cycle. Most instructions are fully pipelined, and thus the PowerPC 604 can often dispatch a new instruction for execution each cycle on the same functional unit. However, some instructions are not fully pipelined and monopolize a functional unit for the entire duration of their execution. One such example for the PowerPC architecture is divw, one of many integer division instructions. It has an execution time of 19 cycles on the PowerPC 604 and has a latency of 20 cycles. There is thus one cycle occurring after the issue of a divw instruction for which the instruction is not executing but its result is unusable. 2.1 Scheduling Units The most common straight-line unit of code is the basic block, a code sequence having a single entry point and a single exit point [29], where the only branching that occurs is a branch from the exit instruction to another basic block, or to the same basic block in order to create a loop. Not all compilers produce basic blocks having a single entry point, but this can be remedied by inserting a dummy instruction with zero execution time, and forcing all existing entry point instructions to follow the dummy instruction. A similar method can be used to ensure that basic blocks all have a single exit point. The other common scheduling unit used in compilers is the superblock [11, 20]. A superblock consists of a series of basic blocks B 0, B 1,...,B n such that: Each basic block either branches out of the superblock or branches to the next basic block in sequence (i.e. for k = 0,..., n 1, B k branches out of the superblock or branches to B k+1 ) There are no branches from any block B i to any other block B j, for j 0, other than the branch from B i to B i+1. (i.e. there are no cycles in the superblock that do not pass through block B 0 ) 5

24 There are no branches into any basic block within the superblock except for branches to B 0. These disallowed branches are known as side entrances. Branch instructions in a superblock that both transfer control from the superblock and occur before the final instruction are known as side exits, and the final instruction in the superblock is known as the final exit. Superblock scheduling often makes use of profiling information for the program being compiled. Each side exit in a superblock can be assigned a probability that the side exit is taken. If the side exit is not taken, control proceeds to the following basic block. The final exit is always taken if it is reached. Branch probabilities are given in the form branch(i) = (P(i), 1 P(i)), for 0 P(i) 1, where P(i) is the percentage of times the branch was taken given that the branch condition is evaluated. In other words, if a side exit has branch probability (0.3, 0.7), then the branch is taken 30% of the time the branch instruction is executed and not taken 70% of the time. The formula for evaluating schedule length for superblocks given in Section 2.3 requires a cost coefficient for each exit i. This coefficient is the probability that branch i is taken given that branches 0 i 1 were not taken. For example, suppose a superblock has two side exits. The first has branch probability (0.2, 0.8), and the second has branch probability (0.6, 0.4). The cost coefficient for the first side exit is 0.2, since there is a 20% chance the first branch is taken. The cost coefficient for the second side exit is = 0.48, since the first branch is not taken 80% of the time and the second branch is taken 60% of the time. The final exit is always taken if it is reached, and so its cost coefficient is = 0.32: a side exit is taken 68% of the time, and so the final exit must be taken the remaining 32% of the time. Throughout this thesis, I refer to the probability that a branch is taken given that the branch instruction is executed as the branch probability, and the probability that a branch is taken given that all previous branches in the superblock were not taken as the exit probability. I use branch(i) and exit(i) to denote the branch and exit probabilities respectively for instruction i. While basic blocks are considered to be local scheduling units and superblocks are labelled as global scheduling units, they are not global in the sense that they incorporate an entire program. Superblocks are formed by first finding a trace [12]: a region of instructions identical to superblocks except that side entrances are permitted. To eliminate side entrances from the trace, the code following a side entrance is duplicated, known as tail duplication [20]. The side entrances branch to the old copy of the code, and the trace branches to the new copy of the code. When all side entrances are removed in this manner, the trace is now a 6

25 superblock. Superblocks are preferable to traces in that traces require a significantly more complex compiler implementation in order to deal with side entrances, while superblocks do not [11]. For simplification, this thesis refers to a block when the context allows for both basic blocks and superblocks without distinction. 2.2 Instruction DAGs A common conceptual view of an instruction scheduling problem is a directed acyclic graph, or DAG (see [29]), also known as a Program Dependence Graph or Data Dependence Graph in the compiler literature. Figure 2.1 shows a sample DAG from a basic block selected from our testing data, and Figure 2.2 shows a DAG for a superblock, also selected from testing data. In an instruction DAG, each node corresponds to an instruction from the original straight-line basic block or superblock. Nodes are labeled alphabetically, beginning with A. Nodes are also assigned a sequence number 1 through n, with node i being the i th instruction in the order given in the original block. If instruction i must be completed before instruction j in the original block, an edge is added between nodes i and j. This is known as a precedence constraint: i has precedence over j, and must be scheduled first in any correct schedule. 2 A B C 2 1. (A) or. gr1, gr2, gr2 2. (B) addis gr3, 0, nop 4. (C) stw gr4, 424(gr3) 5. (D) bc 2, cr0, (B) addis gr3, 0, (A) or. gr1, gr2, gr2 3. (C) stw gr4, 424(gr3) 4. (D) bc 2, cr0, 132 D 1 (a) (b) (c) Figure 2.1: (a) DAG for a basic block from the SPEC 2000 compiler benchmark; (b) One possible schedule for a single-issue processor, where NOP denotes a cycle in which no instructions are scheduled; (c) An optimal schedule. Edges in the DAG are also assigned a weight. If an instruction j must execute at least l(i, j) cycles after instruction i begins execution, edge (i, j) is assigned weight 7

26 A B 0 0 C 0 D E F G 0 0 H 2 2 I J (a) 1. (D) (I) (J) 1.0 (b) 1. (D) (I) (J) 0.1 (c) Figure 2.2: (a) DAG for a superblock from the SPEC 2000 compiler benchmark; (b) Branch probabilities for side exits and the final exit; (c) Exit probabilities for side exits and the final exit. l(i, j). This is known as a latency constraint, and l(i, j) is said to be the latency [29] of instruction i. If l(i, j) = 1, instruction j can begin execution in the cycle immediately following instruction i. Other instructions can be issued between the cycles in which instructions i and j begin, provided all constraints are satisfied On the PowerPC 604 architecture, most instructions have a latency of 1 or 2, and the maximum latency of any instruction is 32. Only floating point instructions have a latency over 10; the majority of instructions produce their results in very few cycles. However, many edges with long-running instructions as the source node, including the 32-cycle fdiv instruction, have low latencies. This phenomenon is explained by Smotherman et al. [37]. Suppose an instruction i has an execution time of 3 cycles and a latency of 4 cycles. Any instruction j needing to read the result of i must wait at least 4 cycles after i begins execution to assure that i has written its result. This is known as a read-after-write or RAW dependency, and edge (i, j) would have weight 4. Suppose instead that j writes to a register r that 8

27 i must read. If i takes only one cycle to read from r, j may be issued in the cycle following i, and so edge (i, j) would have weight 1: this is known as a write-afterread or WAR dependency. Edges may not reflect the true latency of an instruction for other reasons as well, and it is not the case that every edge in a DAG has the latency of the first node as its weight. There is also one other type of scheduling constraint that is not explicitly captured in a DAG, because they are general constraints for the target architecture, not specific constraints for a particular block. Resource constraints [29] are any constraints caused by the number and type of resources on a processor. For example, if a processor has two integer units, at most two integer instructions can begin execution each cycle. Resource constraints cannot be reflected in a DAG since a family of architectures may have the same latencies for each instruction but have different resources. Thus, if a DAG is scheduled on several different architectures, schedules of different lengths may be produced. 2.3 The Instruction Scheduling Problem To make the most of the advanced features of modern architectures, compilers perform instruction scheduling as an optimization after code generation. The goal of instruction scheduling is to find a minimum-cost schedule for a straight-line sequence of code, subject to several types of constraints [11, 29]. I present two versions of the instruction scheduling problem: the local instruction scheduling problem for basic blocks and the global instruction scheduling problem for superblocks. As mentioned in Section 2.1, the latter is not truly global. As the two differ only slightly, I first give a general version of the instruction scheduling problem adapted from [38], and refine it to account for the differences between the two scheduling problems. The definition of the instruction scheduling problem makes the assumptions that all instructions are fully pipelined and that either the machine has an infinite number of registers or that register allocation has taken place before instruction scheduling. Definition 1 (Instruction Scheduling Problem) Consider a DAG G = (V, E) representing a block, where each edge (i, j) has weight l(i, j). The target architecture has a global issue width W and a set of functional units U. There is a set T of types of functional units, and each unit is of some type t T. There are f(t) functional units of type t T, and each instruction i is of type u(i), where u(i) T. Let y ik be a binary variable that takes on the value 1 if and only if instruction i is scheduled in cycle k. A feasible schedule S specifies an issue time 9

28 S(i) for all instructions i. S(i) and y ik are related: i k, y ik = 1 iff S(i) = k.s must also satisfy the following constraints: k, W n i=1 y ik (global issue width constraint) k, t T, f(t) u(i)=t y ik, 1 i n (functional unit constraints) (i, j) E, S(j) S(i) + l(i, j) (latency constraints) The Instruction Scheduling Problem is to find a schedule S for which a cost function cost(s) is minimized. There are several cost measures for the instruction scheduling problem. The actual execution time on a physical architecture is one such measure, although it is hard to evaluate while scheduling is performed. For basic blocks, the common metric is schedule length. The sooner all instructions in a DAG finish executing, the shorter the schedule will be. Schedule length can be thought of as the latest cycle in which an instruction is issued: that is, cost(s) = max i V {S(i)}. Figure 2.1 (b) shows a valid schedule for the DAG in Figure 2.1 (a), but schedule (b) is clearly suboptimal as a better schedule is presented in Figure 2.1 (c). Definition 2 (Local Instruction Scheduling Problem) The Local Instruction Scheduling Problem is to solve the Instruction Scheduling Problem where the DAG G represents a basic block, minimizing cost function cost(s) = max i V {S(i)}. When scheduling superblocks, the measure of evaluation is the expected number of cycles executed within the superblock before a branch is taken or the final exit is reached. Definition 3 (Superblock Schedule Cost) Let exit(i) be the exit probability for node i in the DAG, where 0 exit(i) 1 and exit(i) = 0 when node i is not a side exit or the final exit. For a schedule S, cost(s) = n i=1 exit(i)s(i). Definition 4 (Global Instruction Scheduling Problem) The Global Instruction Scheduling Problem is to solve the Instruction Scheduling Problem where the DAG G represents a superblock, minimizing cost function cost(s) = n i=1 exit(i)s(i). 10

29 2.4 List Scheduling Finding an optimal solution to both the local and global instruction scheduling problems is NP-complete [17]. The emphasis of scheduling research has been on approximation algorithms. Hennessy and Gross [17] made an early attempt at developing a polynomial algorithm for instruction scheduling. Their algorithm had a worst-case runtime of O(n 4 ). Gibbons and Muchnick [14] were able to refine the algorithm to provide a worst-case runtime of O(n 2 ), although the quality of schedules produced is not necessarily as good. This refined algorithm, known as the list scheduling algorithm, has become the most popular instruction scheduling algorithm, and is used almost exclusively in production compilers. The list scheduling algorithm is so-called because of its use of a ready list. The algorithm iterates through machine cycles sequentially, and at each cycle it populates the ready list with the set of all candidate instructions which could begin execution in the current cycle. It then selects the best instructions from the ready list to begin execution in the current cycle, subject to resource constraints [29]. The algorithm also makes use of an execution list: whenever an instruction is issued, it is placed on the execution list, a list of all instructions currently being executed. When an instruction i finishes executing, any successor j of i becomes a candidate instruction as long as all other predecessors of j have also finished executing. The execution list is used to easily identify instructions that finish executing, so it can quickly be determined if j may be added to the ready list. Selection is made according to a heuristic independent of the core list scheduling algorithm. In Algorithm 1, I present a formal representation of the list scheduling algorithm, adapted from the presentation in [34]. The method selectbestinstruction encapsulates the particular heuristic used in an implementation of list scheduling. For each instruction in the ready list in descending order of priority according to the heuristic, it determines whether or not the instruction can be executed in the current cycle. In other words, it checks whether there is an available functional unit of the instruction s type and whether or not the number of instructions scheduled to begin in the current cycle is less than or equal to the global issue width. If there is an available functional unit and the global issue width will not be exceeded by issuing the current instruction, that instruction is returned. If there is no instruction on the ready list that can begin execution in the current cycle, selectbestinstruction returns null. It is worth pointing out that the list scheduling algorithm pays no attention to any other architectural features or hazards besides available functional units and issue width. In particular, instruction cache and data cache misses are ignored, 11

Constraint Programming Techniques for Optimal Instruction Scheduling

Constraint Programming Techniques for Optimal Instruction Scheduling by Abid Muslim Malik A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Doctor