On the Near-Optimality of List Scheduling Heuristics for Local and Global Instruction Scheduling

Size: px
Start display at page:

Download "On the Near-Optimality of List Scheduling Heuristics for Local and Global Instruction Scheduling"

Transcription

1 On the Near-Optimality of List Scheduling Heuristics for Local and Global Instruction Scheduling by John Michael Chase A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Mathematics in Computer Science Waterloo, Ontario, Canada, 2006 c Michael Chase 2006

2 I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. ii

3 Abstract Modern architectures allow multiple instructions to be issued at once and have other complex features. To account for this, compilers perform instruction scheduling after generating the output code. The instruction scheduling problem is to find an optimal schedule given the limitations and capabilities of the architecture. While this can be done optimally, a greedy algorithm known as list scheduling is used in practice in most production compilers. List scheduling is generally regarded as being near-optimal in practice, provided a good choice of heuristic is used. However, previous work comparing a list scheduler against an optimal scheduler either makes the assumption that an idealized architectural model is being used or uses too few test cases to strongly prove or disprove the assumed near-optimality of list scheduling. It remains an open question whether or not list scheduling performs well when scheduling for a realistic architectural model. Using constraint programming, we developed an efficient optimal scheduler capable of scheduling even very large blocks within a popular benchmark suite in a reasonable amount of time. I improved the architectural model and optimal scheduler by allowing for an issue width not equal to the number of functional units, instructions that monopolize the processor for one cycle, and non-fully pipelined instructions. I then evaluated the performance of list scheduling for this more realistic architectural model. I found that when scheduling for basic blocks when using a realistic architectural model, only 6% or less of schedules produced by a list scheduler are non-optimal, but when scheduling for superblocks, at least 40% of schedules produced by a list scheduler are non-optimal. Furthermore, when the list scheduler and optimal scheduler differed, the optimal scheduler was able to improve schedule cost by at least 5% on average, realizing maximum improvements of 82%. This suggests that list scheduling is only a viable solution in practice when scheduling basic blocks. When scheduling superblocks, the advantage of using a list scheduler is its speed, not the quality of schedules produced, and other alternatives to list scheduling should be considered. iii

4 Acknowledgments I would like to thank my supervisor, Peter van Beek, for his guidance, mentoring, and support throughout my work as an undergraduate and graduate student with him. Thanks to Tyrel Russell and Abid Malik for their many contributions in the discussions we had as a research group. I am also appreciative of Farhad Mavadat and Ondrej Lhotak for serving on my committee. I would also like to thank my family and friends, and especially my wife, Jen, for encouraging and supporting me in my research and also helping me to balance between school and other aspects of my life. This work was made possible by IBM Corp. and by the facilities of the Shared Hierarchical Academic Research Computing Network (SHARCNET: iv

5 Dedication This work is dedicated to Jen, for her constant support and love and for all that she did in order to allow me to pursue this degree. v

6 Contents 1 Introduction Contributions Overview Background Scheduling Units Instruction DAGs The Instruction Scheduling Problem List Scheduling Scheduling Heuristics and Features Register Pressure Constraint Programming Related Work Theoretical Results Empirical Results Idealized Architectures Realistic Architectures Summary of Empirical Results vi

7 4 Local Instruction Scheduling Initial Model The Constraint Programming Model Latency Constraints Distance Constraints Dominance Constraints Predecessor and Successor Constraints Functional Unit Constraints Architectural Improvements Issue Width Non-Fully Pipelined Processor Serializing Instructions Architectural Features Not Modelled Evaluation Experimental Setup The Optimal Scheduler Results for Initial Architectural Models Results for Improved Architectural Models Summary of Results Global Instruction Scheduling Initial Model Architectural Improvements Evaluation Experimental Setup The Optimal Scheduler Results for Initial Architectural Models Results for Improved Architectural Models Summary of Results vii

8 6 Conclusions and Further Work Further Work viii

9 List of Tables 2.1 Local instruction scheduling heuristics, showing rank of features for each heuristic. For a pair of instructions, the instruction with the better value for the feature ranked number 1 is selected. If the two instructions have the same value for that feature, the instruction with the better value for the rank 2 feature is selected, and so on Results of experiments comparing list scheduling to optimal methods Architectural models used in experiments comparing list scheduling to optimal methods Architectural models used in scheduling experiments Local instruction scheduling for the initial architectural model before register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found an improved schedule over the best heuristic schedule, and (b) the optimal scheduler failed to complete within a 10-minute time limit, for various architectures Local instruction scheduling for the initial architectural model after register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found an improved schedule over the best heuristic schedule, and (b) the optimal scheduler failed to complete within a 10- minute time limit, for various architectures ix

10 4.4 Local instruction scheduling for the initial architectural model before register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found an improved schedule over the best heuristic schedule, and (b) the percentage of basic blocks with improved schedules, for various architectures Local instruction scheduling for the initial architectural model after register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found an improved schedule over the best heuristic schedule, and (b) the percentage of basic blocks with improved schedules, for various architectures Local instruction scheduling for the initial architectural model before register allocation. Average and maximum percentage improvements in schedule length of optimal schedule over the best heuristic schedule, for various architectures. The average is over only the basic blocks in the SPEC 2000 benchmark suite for which the optimal scheduler found an improved schedule Local instruction scheduling for the initial architectural model after register allocation. Average and maximum percentage improvements in schedule length of optimal schedule over the best heuristic schedule, for various architectures. The average is over only the basic blocks in the SPEC 2000 benchmark suite for which the optimal scheduler found an improved schedule Local instruction scheduling for the initial architectural model before register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) critical path resulted in an improved schedule over Shieh and Papachristou s heuristic, and (b) Shieh and Papachristou s heuristic resulted in an improved schedule over critical path, for various architectures Local instruction scheduling for the initial architectural model after register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) critical path resulted in an improved schedule over Shieh and Papachristou s heuristic, and (b) Shieh and Papachristou s heuristic resulted in an improved schedule over critical path, for various architectures x

11 4.10 Local instruction scheduling for the initial architectural model before register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found a schedule with lower register pressure than both heuristic schedules, and (b) a heuristic schedule had lower register pressure than the optimal schedule, for various architectures Local instruction scheduling for the initial architectural model after register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found a schedule with lower register pressure than both heuristic schedules, and (b) a heuristic schedule had lower register pressure than the optimal schedule, for various architectures Local instruction scheduling for the initial architectural model before register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the critical path schedule had lower register pressure than Shieh and Papachristou s schedule, and (b) Shieh and Papachristou s schedule had lower register pressure than the critical path schedule, for various architectures Local instruction scheduling for the initial architectural model after register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the critical path schedule had lower register pressure than Shieh and Papachristou s schedule, and (b) Shieh and Papachristou s schedule had lower register pressure than the critical path schedule, for various architectures Local instruction scheduling for the improved architectural model before register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found an improved schedule over the best heuristic schedule, and (b) the optimal scheduler failed to complete within a 10-minute time limit, for various architectures Local instruction scheduling for the improved architectural model after register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found an improved schedule over the best heuristic schedule, and (b) the optimal scheduler failed to complete within a 10-minute time limit, for various architectures xi

12 4.16 Local instruction scheduling for the improved architectural model before register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found an improved schedule over the best heuristic schedule, and (b) the percentage of basic blocks with improved schedules, for various architectures Local instruction scheduling for the improved architectural model after register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found an improved schedule over the best heuristic schedule, and (b) the percentage of basic blocks with improved schedules, for various architectures Local instruction scheduling for the improved architectural model before register allocation. Average and maximum percentage improvements in schedule length of optimal schedule over the best heuristic schedule, for various architectures. The average is over only the basic blocks in the SPEC 2000 benchmark suite for which the optimal scheduler found an improved schedule Local instruction scheduling for the improved architectural model after register allocation. Average and maximum percentage improvements in schedule length of optimal schedule over the best heuristic schedule, for various architectures. The average is over only the basic blocks in the SPEC 2000 benchmark suite for which the optimal scheduler found an improved schedule Local instruction scheduling for the improved architectural model before register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) critical path resulted in an improved schedule over Shieh and Papachristou s heuristic, and (b) Shieh and Papachristou s heuristic resulted in an improved schedule over critical path, for various architectures Local instruction scheduling for the improved architectural model after register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) critical path resulted in an improved schedule over Shieh and Papachristou s heuristic, and (b) Shieh and Papachristou s heuristic resulted in an improved schedule over critical path, for various architectures xii

13 4.22 Local instruction scheduling for the improved architectural model before register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found a schedule with lower register pressure than both heuristic schedules, and (b) a heuristic schedule had lower register pressure than the optimal schedule, for various architectures Local instruction scheduling for the improved architectural model after register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found a schedule with lower register pressure than both heuristic schedules, and (b) a heuristic schedule had lower register pressure than the optimal schedule, for various architectures Local instruction scheduling for the improved architectural model before register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the critical path schedule had lower register pressure than Shieh and Papachristou s schedule, and (b) Shieh and Papachristou s schedule had lower register pressure than the critical path schedule, for various architectures Local instruction scheduling for the improved architectural model after register allocation. Number of basic blocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the critical path schedule had lower register pressure than Shieh and Papachristou s schedule, and (b) Shieh and Papachristou s schedule had lower register pressure than the critical path schedule, for various architectures Global instruction scheduling for the initial architectural model before register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found an improved schedule over the best heuristic schedule, and (b) the optimal scheduler failed to complete within a 10-minute time limit, for various architectures xiii

14 5.2 Global instruction scheduling for the initial architectural model after register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found an improved schedule over the best heuristic schedule, and (b) the optimal scheduler failed to complete within a 10-minute time limit, for various architectures Global instruction scheduling for the initial architectural model before register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found an improved schedule over the best heuristic schedule, and (b) the percentage of superblocks with improved schedules, for various architectures Global instruction scheduling for the initial architectural model after register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found an improved schedule over the best heuristic schedule, and (b) the percentage of superblocks with improved schedules, for various architectures Global instruction scheduling for the initial architectural model before register allocation. Average and maximum percentage improvements in schedule length of optimal schedule over the best heuristic schedule, for various architectures. The average is over only the superblocks in the SPEC 2000 benchmark suite for which the optimal scheduler found an improved schedule Global instruction scheduling for the initial architectural model after register allocation. Average and maximum percentage improvements in schedule length of optimal schedule over the best heuristic schedule, for various architectures. The average is over only the superblocks in the SPEC 2000 benchmark suite for which the optimal scheduler found an improved schedule Global instruction scheduling for the initial architectural model before register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) critical path resulted in an improved schedule over DHASY, and (b) DHASY resulted in an improved schedule over critical path, for various architectures xiv

15 5.8 Global instruction scheduling for the initial architectural model after register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) critical path resulted in an improved schedule over DHASY, and (b) DHASY resulted in an improved schedule over critical path, for various architectures Global instruction scheduling for the initial architectural model before register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found a schedule with lower register pressure than both heuristic schedules, and (b) a heuristic schedule had lower register pressure than the optimal schedule, for various architectures Global instruction scheduling for the initial architectural model after register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found a schedule with lower register pressure than both heuristic schedules, and (b) a heuristic schedule had lower register pressure than the optimal schedule, for various architectures Global instruction scheduling for the initial architectural model before register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the critical path schedule had lower register pressure than DHASY s schedule, and (b) DHASY s schedule had lower register pressure than the critical path schedule, for various architectures Global instruction scheduling for the initial architectural model after register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the critical path schedule had lower register pressure than DHASY s schedule, and (b) DHASY s schedule had lower register pressure than the critical path schedule, for various architectures Global instruction scheduling for the improved architectural model before register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found an improved schedule over the best heuristic schedule, and (b) the optimal scheduler failed to complete within a 10-minute time limit, for various architectures xv

16 5.14 Global instruction scheduling for the improved architectural model after register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found an improved schedule over the best heuristic schedule, and (b) the optimal scheduler failed to complete within a 10-minute time limit, for various architectures Global instruction scheduling for the improved architectural model before register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found an improved schedule over the best heuristic schedule, and (b) the percentage of superblocks with improved schedules, for various architectures Global instruction scheduling for the improved architectural model after register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found an improved schedule over the best heuristic schedule, and (b) the percentage of superblocks with improved schedules, for various architectures Global instruction scheduling for the improved architectural model before register allocation. Average and maximum percentage improvements in schedule length of optimal schedule over the best heuristic schedule, for various architectures. The average is over only the superblocks in the SPEC 2000 benchmark suite for which the optimal scheduler found an improved schedule Global instruction scheduling for the improved architectural model after register allocation. Average and maximum percentage improvements in schedule length of optimal schedule over the best heuristic schedule, for various architectures. The average is over only the superblocks in the SPEC 2000 benchmark suite for which the optimal scheduler found an improved schedule Global instruction scheduling for the improved architectural model before register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) critical path resulted in an improved schedule over DHASY, and (b) DHASY resulted in an improved schedule over critical path, for various architectures xvi

17 5.20 Global instruction scheduling for the improved architectural model after register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) critical path resulted in an improved schedule over DHASY, and (b) DHASY resulted in an improved schedule over critical path, for various architectures Global instruction scheduling for the improved architectural model before register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found a schedule with lower register pressure than both heuristic schedules, and (b) a heuristic schedule had lower register pressure than the optimal schedule, for various architectures Global instruction scheduling for the improved architectural model after register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the optimal scheduler found a schedule with lower register pressure than both heuristic schedules, and (b) a heuristic schedule had lower register pressure than the optimal schedule, for various architectures Global instruction scheduling for the improved architectural model before register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the critical path schedule had lower register pressure than DHASY s schedule, and (b) DHASY s schedule had lower register pressure than the critical path schedule, for various architectures Global instruction scheduling for the improved architectural model after register allocation. Number of superblocks in the SPEC 2000 benchmark suite with more than two instructions where (a) the critical path schedule had lower register pressure than DHASY s schedule, and (b) DHASY s schedule had lower register pressure than the critical path schedule, for various architectures xvii

18 List of Figures 2.1 (a) DAG for a basic block from the SPEC 2000 compiler benchmark; (b) One possible schedule for a single-issue processor, where NOP denotes a cycle in which no instructions are scheduled; (c) An optimal schedule (a) DAG for a superblock from the SPEC 2000 compiler benchmark; (b) Branch probabilities for side exits and the final exit; (c) Exit probabilities for side exits and the final exit Example DAG used to illustrate constraints in the initial model. Text beside each node denotes functional unit as integer (INT) or floating point (FP) and lower and upper bounds Example DAG with additional nodes B 1 and B 2 corresponding to pipeline variables Example DAG with serial instruction B xviii

19 Chapter 1 Introduction Modern architectures are capable of issuing and executing multiple instructions at once, as well as having many other interesting properties. The code generation phase of a compiler produces a straight-line sequence of code which must be scheduled in order to execute multiple instructions at once and take advantage of other architectural features. This process, known as instruction scheduling, is performed in practice by a non-optimal, greedy algorithm known as the list scheduling algorithm. List scheduling is widely assumed to be nearly optimal with an appropriate choice of heuristic with which to rank instructions that are ready to be issued. However, any previous work that compares the list scheduling algorithm against an optimal scheduler either uses a small set of test data, producing inconclusive results, or makes many simplifying assumptions about the architectural model upon which the scheduled code will be executed. It is an open question whether or not list scheduling is nearly optimal when scheduling for a realistic architectural model. Especially for embedded processors with limited computing power, it is essential that the compiled code be as efficient as possible. If the list scheduling algorithm does not provide good enough schedules, it may be necessary to consider other scheduling algorithms. In this thesis, I take an optimal scheduler that initially assumes the same simple architectural model as list scheduling and improve the model, making it more realistic. I then compare the schedules produced by the list scheduling algorithm against those produced by the optimal scheduler. By using a realistic architectural model and evaluating with a sufficiently large set of test data, I provide conclusive evidence that speaks to the near-optimality of list scheduling. When scheduling for basic blocks, the list scheduling algorithm produces optimal schedules at least 94% 1

20 of the time for the target architectures on which it was evaluated. When scheduling for superblocks, list scheduling produces optimal schedules between 47% and 60% of the time. The difference between the list scheduler and optimal scheduler is also significant: improved blocks cost between 5% and 8% less on average, with a maximum improvement of 82%. These results suggest that list scheduling performs sufficiently well on more complex architectures for local instruction scheduling only, and that global instruction scheduling algorithms must be given further consideration in order to produce better overall schedules. 1.1 Contributions The most significant contribution of this thesis is that list scheduling is proven to be far from optimal when scheduling superblocks when the target architectural model is complex, although list scheduling basic blocks produces near-optimal results for a complex architectural model. When scheduling basic blocks, the list scheduling algorithm was optimal over 98% of the time for an idealized architectural model and over 94% of the time for a realistic architectural model. However, the list scheduler performs poorly when scheduling superblocks for a realistic architectural model, with between 39.9% and 52.3% of schedules produced being non-optimal and an average improvement of 5.3%-8.1%. These results are important because superblocks are frequently executed sequences of instructions, and any speed improvement on these will almost certainly result in a speed improvement for the whole program. This thesis contains some other relevant contributions. When comparing a heuristic given by Shieh and Papachristou [35] against the more common critical path heuristic, Shieh and Papachristou s heuristic generally performed better. As both heuristics use critical path as the primary feature, this suggests that the choice of secondary features is important to instruction scheduling heuristics, which provides motivation for a re-evaluation of existing heuristics. There is also little difference in terms of register usage between schedules produced by a list scheduler and our optimal scheduler. This is significant as it shows that not only does list scheduling produce near-optimal schedules, it also produces schedules with similar register requirements as an optimal scheduler and there is thus little purpose for using an optimal scheduler. 2

21 1.2 Overview This thesis is divided into six chapters. In Chapter 2, I formalize the instruction scheduling problem and the list scheduling algorithm, present directed acyclic graphs for instruction scheduling, discuss heuristics and features for list scheduling, and provide an introduction to constraint programming. Chapter 3 surveys work done evaluating the optimality of the list scheduling algorithm. In Chapter 4, I examine instruction scheduling for basic blocks, presenting the initial architectural model as well as several improvements to it. I compare the list scheduler to an optimal scheduler for several architectures using both the initial and improved architectural models in order to evaluate the optimality of the list scheduling algorithm. In Chapter 5, I repeat the work done in Chapter 4 but I instead consider instruction scheduling for superblocks. I conclude in Chapter 6, summarizing the results found in this thesis and discussing possibilities for future work. 3

22 Chapter 2 Background The code generation phase in a compiler typically produces a straight-line sequence of code. The code generation phase is often followed by an instruction scheduling phase which rearranges the code to achieve better performance on modern architectures. Modern architectures have multiple functional units, processing units dedicated to a specific type of instruction, such as integer or branch instructions. In addition, some architectures allow multiple instructions to be issued in a single cycle. The maximum number of instructions that can be issued in one cycle is known as the processor s issue width. The issue width for a particular architecture must be less than or equal to the number of functional units. Each instruction on a particular architecture has an execution time and a latency. Execution time is the number of cycles for which an instruction locks up a functional unit to the exclusion of all other instructions. The latency of an instruction is the number of cycles needed after an instruction is issued before its result is available to other instructions. An instruction s latency is always greater than or equal to its execution time. An instruction can also be fully pipelined: if its execution time is 1, the functional unit is only occupied by the instruction for the cycle in which the instruction was issued. Instructions with execution times greater than one are said to be not fully pipelined. Modern architectures are pipelined: an instruction with a latency greater than one need not tie up the functional unit on which it is executing until the result is ready [18]. Architectures use pipelining to overlap the execution of instructions. If an instruction with an execution time of 2 and a latency of 5 is issued in cycle 1, another instruction can be issued on the same functional unit as early as cycle 3, 4

23 even though the result from the first instruction is not available for use until cycle 6. As an example of an architecture with all these properties, consider the PowerPC 604 [19]. It has six functional units: a branch unit, two integer units for simple instructions, an integer unit for more complex instructions, a floating point unit, and a load/store unit for data transfer to and from memory. The PowerPC 604 has an issue width of 4, so not every functional unit will begin executing a new instruction every cycle. Most instructions are fully pipelined, and thus the PowerPC 604 can often dispatch a new instruction for execution each cycle on the same functional unit. However, some instructions are not fully pipelined and monopolize a functional unit for the entire duration of their execution. One such example for the PowerPC architecture is divw, one of many integer division instructions. It has an execution time of 19 cycles on the PowerPC 604 and has a latency of 20 cycles. There is thus one cycle occurring after the issue of a divw instruction for which the instruction is not executing but its result is unusable. 2.1 Scheduling Units The most common straight-line unit of code is the basic block, a code sequence having a single entry point and a single exit point [29], where the only branching that occurs is a branch from the exit instruction to another basic block, or to the same basic block in order to create a loop. Not all compilers produce basic blocks having a single entry point, but this can be remedied by inserting a dummy instruction with zero execution time, and forcing all existing entry point instructions to follow the dummy instruction. A similar method can be used to ensure that basic blocks all have a single exit point. The other common scheduling unit used in compilers is the superblock [11, 20]. A superblock consists of a series of basic blocks B 0, B 1,...,B n such that: Each basic block either branches out of the superblock or branches to the next basic block in sequence (i.e. for k = 0,..., n 1, B k branches out of the superblock or branches to B k+1 ) There are no branches from any block B i to any other block B j, for j 0, other than the branch from B i to B i+1. (i.e. there are no cycles in the superblock that do not pass through block B 0 ) 5

24 There are no branches into any basic block within the superblock except for branches to B 0. These disallowed branches are known as side entrances. Branch instructions in a superblock that both transfer control from the superblock and occur before the final instruction are known as side exits, and the final instruction in the superblock is known as the final exit. Superblock scheduling often makes use of profiling information for the program being compiled. Each side exit in a superblock can be assigned a probability that the side exit is taken. If the side exit is not taken, control proceeds to the following basic block. The final exit is always taken if it is reached. Branch probabilities are given in the form branch(i) = (P(i), 1 P(i)), for 0 P(i) 1, where P(i) is the percentage of times the branch was taken given that the branch condition is evaluated. In other words, if a side exit has branch probability (0.3, 0.7), then the branch is taken 30% of the time the branch instruction is executed and not taken 70% of the time. The formula for evaluating schedule length for superblocks given in Section 2.3 requires a cost coefficient for each exit i. This coefficient is the probability that branch i is taken given that branches 0 i 1 were not taken. For example, suppose a superblock has two side exits. The first has branch probability (0.2, 0.8), and the second has branch probability (0.6, 0.4). The cost coefficient for the first side exit is 0.2, since there is a 20% chance the first branch is taken. The cost coefficient for the second side exit is = 0.48, since the first branch is not taken 80% of the time and the second branch is taken 60% of the time. The final exit is always taken if it is reached, and so its cost coefficient is = 0.32: a side exit is taken 68% of the time, and so the final exit must be taken the remaining 32% of the time. Throughout this thesis, I refer to the probability that a branch is taken given that the branch instruction is executed as the branch probability, and the probability that a branch is taken given that all previous branches in the superblock were not taken as the exit probability. I use branch(i) and exit(i) to denote the branch and exit probabilities respectively for instruction i. While basic blocks are considered to be local scheduling units and superblocks are labelled as global scheduling units, they are not global in the sense that they incorporate an entire program. Superblocks are formed by first finding a trace [12]: a region of instructions identical to superblocks except that side entrances are permitted. To eliminate side entrances from the trace, the code following a side entrance is duplicated, known as tail duplication [20]. The side entrances branch to the old copy of the code, and the trace branches to the new copy of the code. When all side entrances are removed in this manner, the trace is now a 6

25 superblock. Superblocks are preferable to traces in that traces require a significantly more complex compiler implementation in order to deal with side entrances, while superblocks do not [11]. For simplification, this thesis refers to a block when the context allows for both basic blocks and superblocks without distinction. 2.2 Instruction DAGs A common conceptual view of an instruction scheduling problem is a directed acyclic graph, or DAG (see [29]), also known as a Program Dependence Graph or Data Dependence Graph in the compiler literature. Figure 2.1 shows a sample DAG from a basic block selected from our testing data, and Figure 2.2 shows a DAG for a superblock, also selected from testing data. In an instruction DAG, each node corresponds to an instruction from the original straight-line basic block or superblock. Nodes are labeled alphabetically, beginning with A. Nodes are also assigned a sequence number 1 through n, with node i being the i th instruction in the order given in the original block. If instruction i must be completed before instruction j in the original block, an edge is added between nodes i and j. This is known as a precedence constraint: i has precedence over j, and must be scheduled first in any correct schedule. 2 A B C 2 1. (A) or. gr1, gr2, gr2 2. (B) addis gr3, 0, nop 4. (C) stw gr4, 424(gr3) 5. (D) bc 2, cr0, (B) addis gr3, 0, (A) or. gr1, gr2, gr2 3. (C) stw gr4, 424(gr3) 4. (D) bc 2, cr0, 132 D 1 (a) (b) (c) Figure 2.1: (a) DAG for a basic block from the SPEC 2000 compiler benchmark; (b) One possible schedule for a single-issue processor, where NOP denotes a cycle in which no instructions are scheduled; (c) An optimal schedule. Edges in the DAG are also assigned a weight. If an instruction j must execute at least l(i, j) cycles after instruction i begins execution, edge (i, j) is assigned weight 7

26 A B 0 0 C 0 D E F G 0 0 H 2 2 I J (a) 1. (D) (I) (J) 1.0 (b) 1. (D) (I) (J) 0.1 (c) Figure 2.2: (a) DAG for a superblock from the SPEC 2000 compiler benchmark; (b) Branch probabilities for side exits and the final exit; (c) Exit probabilities for side exits and the final exit. l(i, j). This is known as a latency constraint, and l(i, j) is said to be the latency [29] of instruction i. If l(i, j) = 1, instruction j can begin execution in the cycle immediately following instruction i. Other instructions can be issued between the cycles in which instructions i and j begin, provided all constraints are satisfied On the PowerPC 604 architecture, most instructions have a latency of 1 or 2, and the maximum latency of any instruction is 32. Only floating point instructions have a latency over 10; the majority of instructions produce their results in very few cycles. However, many edges with long-running instructions as the source node, including the 32-cycle fdiv instruction, have low latencies. This phenomenon is explained by Smotherman et al. [37]. Suppose an instruction i has an execution time of 3 cycles and a latency of 4 cycles. Any instruction j needing to read the result of i must wait at least 4 cycles after i begins execution to assure that i has written its result. This is known as a read-after-write or RAW dependency, and edge (i, j) would have weight 4. Suppose instead that j writes to a register r that 8

27 i must read. If i takes only one cycle to read from r, j may be issued in the cycle following i, and so edge (i, j) would have weight 1: this is known as a write-afterread or WAR dependency. Edges may not reflect the true latency of an instruction for other reasons as well, and it is not the case that every edge in a DAG has the latency of the first node as its weight. There is also one other type of scheduling constraint that is not explicitly captured in a DAG, because they are general constraints for the target architecture, not specific constraints for a particular block. Resource constraints [29] are any constraints caused by the number and type of resources on a processor. For example, if a processor has two integer units, at most two integer instructions can begin execution each cycle. Resource constraints cannot be reflected in a DAG since a family of architectures may have the same latencies for each instruction but have different resources. Thus, if a DAG is scheduled on several different architectures, schedules of different lengths may be produced. 2.3 The Instruction Scheduling Problem To make the most of the advanced features of modern architectures, compilers perform instruction scheduling as an optimization after code generation. The goal of instruction scheduling is to find a minimum-cost schedule for a straight-line sequence of code, subject to several types of constraints [11, 29]. I present two versions of the instruction scheduling problem: the local instruction scheduling problem for basic blocks and the global instruction scheduling problem for superblocks. As mentioned in Section 2.1, the latter is not truly global. As the two differ only slightly, I first give a general version of the instruction scheduling problem adapted from [38], and refine it to account for the differences between the two scheduling problems. The definition of the instruction scheduling problem makes the assumptions that all instructions are fully pipelined and that either the machine has an infinite number of registers or that register allocation has taken place before instruction scheduling. Definition 1 (Instruction Scheduling Problem) Consider a DAG G = (V, E) representing a block, where each edge (i, j) has weight l(i, j). The target architecture has a global issue width W and a set of functional units U. There is a set T of types of functional units, and each unit is of some type t T. There are f(t) functional units of type t T, and each instruction i is of type u(i), where u(i) T. Let y ik be a binary variable that takes on the value 1 if and only if instruction i is scheduled in cycle k. A feasible schedule S specifies an issue time 9

28 S(i) for all instructions i. S(i) and y ik are related: i k, y ik = 1 iff S(i) = k.s must also satisfy the following constraints: k, W n i=1 y ik (global issue width constraint) k, t T, f(t) u(i)=t y ik, 1 i n (functional unit constraints) (i, j) E, S(j) S(i) + l(i, j) (latency constraints) The Instruction Scheduling Problem is to find a schedule S for which a cost function cost(s) is minimized. There are several cost measures for the instruction scheduling problem. The actual execution time on a physical architecture is one such measure, although it is hard to evaluate while scheduling is performed. For basic blocks, the common metric is schedule length. The sooner all instructions in a DAG finish executing, the shorter the schedule will be. Schedule length can be thought of as the latest cycle in which an instruction is issued: that is, cost(s) = max i V {S(i)}. Figure 2.1 (b) shows a valid schedule for the DAG in Figure 2.1 (a), but schedule (b) is clearly suboptimal as a better schedule is presented in Figure 2.1 (c). Definition 2 (Local Instruction Scheduling Problem) The Local Instruction Scheduling Problem is to solve the Instruction Scheduling Problem where the DAG G represents a basic block, minimizing cost function cost(s) = max i V {S(i)}. When scheduling superblocks, the measure of evaluation is the expected number of cycles executed within the superblock before a branch is taken or the final exit is reached. Definition 3 (Superblock Schedule Cost) Let exit(i) be the exit probability for node i in the DAG, where 0 exit(i) 1 and exit(i) = 0 when node i is not a side exit or the final exit. For a schedule S, cost(s) = n i=1 exit(i)s(i). Definition 4 (Global Instruction Scheduling Problem) The Global Instruction Scheduling Problem is to solve the Instruction Scheduling Problem where the DAG G represents a superblock, minimizing cost function cost(s) = n i=1 exit(i)s(i). 10

29 2.4 List Scheduling Finding an optimal solution to both the local and global instruction scheduling problems is NP-complete [17]. The emphasis of scheduling research has been on approximation algorithms. Hennessy and Gross [17] made an early attempt at developing a polynomial algorithm for instruction scheduling. Their algorithm had a worst-case runtime of O(n 4 ). Gibbons and Muchnick [14] were able to refine the algorithm to provide a worst-case runtime of O(n 2 ), although the quality of schedules produced is not necessarily as good. This refined algorithm, known as the list scheduling algorithm, has become the most popular instruction scheduling algorithm, and is used almost exclusively in production compilers. The list scheduling algorithm is so-called because of its use of a ready list. The algorithm iterates through machine cycles sequentially, and at each cycle it populates the ready list with the set of all candidate instructions which could begin execution in the current cycle. It then selects the best instructions from the ready list to begin execution in the current cycle, subject to resource constraints [29]. The algorithm also makes use of an execution list: whenever an instruction is issued, it is placed on the execution list, a list of all instructions currently being executed. When an instruction i finishes executing, any successor j of i becomes a candidate instruction as long as all other predecessors of j have also finished executing. The execution list is used to easily identify instructions that finish executing, so it can quickly be determined if j may be added to the ready list. Selection is made according to a heuristic independent of the core list scheduling algorithm. In Algorithm 1, I present a formal representation of the list scheduling algorithm, adapted from the presentation in [34]. The method selectbestinstruction encapsulates the particular heuristic used in an implementation of list scheduling. For each instruction in the ready list in descending order of priority according to the heuristic, it determines whether or not the instruction can be executed in the current cycle. In other words, it checks whether there is an available functional unit of the instruction s type and whether or not the number of instructions scheduled to begin in the current cycle is less than or equal to the global issue width. If there is an available functional unit and the global issue width will not be exceeded by issuing the current instruction, that instruction is returned. If there is no instruction on the ready list that can begin execution in the current cycle, selectbestinstruction returns null. It is worth pointing out that the list scheduling algorithm pays no attention to any other architectural features or hazards besides available functional units and issue width. In particular, instruction cache and data cache misses are ignored, 11

Constraint Programming Techniques for Optimal Instruction Scheduling

Constraint Programming Techniques for Optimal Instruction Scheduling Constraint Programming Techniques for Optimal Instruction Scheduling by Abid Muslim Malik A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Doctor

More information

The Automatic Design of Batch Processing Systems

The Automatic Design of Batch Processing Systems The Automatic Design of Batch Processing Systems by Barry Dwyer, M.A., D.A.E., Grad.Dip. A thesis submitted for the degree of Doctor of Philosophy in the Department of Computer Science University of Adelaide

More information

fast code (preserve flow of data)

fast code (preserve flow of data) Instruction scheduling: The engineer s view The problem Given a code fragment for some target machine and the latencies for each individual instruction, reorder the instructions to minimize execution time

More information

Register Allocation via Hierarchical Graph Coloring

Register Allocation via Hierarchical Graph Coloring Register Allocation via Hierarchical Graph Coloring by Qunyan Wu A THESIS Submitted in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE IN COMPUTER SCIENCE MICHIGAN TECHNOLOGICAL

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Name :. Roll No. :... Invigilator s Signature : INTRODUCTION TO PROGRAMMING. Time Allotted : 3 Hours Full Marks : 70

Name :. Roll No. :... Invigilator s Signature : INTRODUCTION TO PROGRAMMING. Time Allotted : 3 Hours Full Marks : 70 Name :. Roll No. :..... Invigilator s Signature :.. 2011 INTRODUCTION TO PROGRAMMING Time Allotted : 3 Hours Full Marks : 70 The figures in the margin indicate full marks. Candidates are required to give

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

Enhanced Web Log Based Recommendation by Personalized Retrieval

Enhanced Web Log Based Recommendation by Personalized Retrieval Enhanced Web Log Based Recommendation by Personalized Retrieval Xueping Peng FACULTY OF ENGINEERING AND INFORMATION TECHNOLOGY UNIVERSITY OF TECHNOLOGY, SYDNEY A thesis submitted for the degree of Doctor

More information

Instruction Scheduling

Instruction Scheduling Instruction Scheduling Michael O Boyle February, 2014 1 Course Structure Introduction and Recap Course Work Scalar optimisation and dataflow L5 Code generation L6 Instruction scheduling Next register allocation

More information

Content distribution networks over shared infrastructure : a paradigm for future content network deployment

Content distribution networks over shared infrastructure : a paradigm for future content network deployment University of Wollongong Research Online University of Wollongong Thesis Collection 1954-2016 University of Wollongong Thesis Collections 2005 Content distribution networks over shared infrastructure :

More information

Contents The Definition of a Fieldbus An Introduction to Industrial Systems Communications.

Contents The Definition of a Fieldbus An Introduction to Industrial Systems Communications. Contents Page List of Tables. List of Figures. List of Symbols. Dedication. Acknowledgment. Abstract. x xi xv xxi xxi xxii Chapter 1 Introduction to FieldBuses Systems. 1 1.1. The Definition of a Fieldbus.

More information

A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors *

A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * Hsin-Ta Chiao and Shyan-Ming Yuan Department of Computer and Information Science National Chiao Tung University

More information

Lecture 7. Instruction Scheduling. I. Basic Block Scheduling II. Global Scheduling (for Non-Numeric Code)

Lecture 7. Instruction Scheduling. I. Basic Block Scheduling II. Global Scheduling (for Non-Numeric Code) Lecture 7 Instruction Scheduling I. Basic Block Scheduling II. Global Scheduling (for Non-Numeric Code) Reading: Chapter 10.3 10.4 CS243: Instruction Scheduling 1 Scheduling Constraints Data dependences

More information

2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006

2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006 2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006 The Encoding Complexity of Network Coding Michael Langberg, Member, IEEE, Alexander Sprintson, Member, IEEE, and Jehoshua Bruck,

More information

Integer Programming ISE 418. Lecture 7. Dr. Ted Ralphs

Integer Programming ISE 418. Lecture 7. Dr. Ted Ralphs Integer Programming ISE 418 Lecture 7 Dr. Ted Ralphs ISE 418 Lecture 7 1 Reading for This Lecture Nemhauser and Wolsey Sections II.3.1, II.3.6, II.4.1, II.4.2, II.5.4 Wolsey Chapter 7 CCZ Chapter 1 Constraint

More information

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 18 Dynamic Instruction Scheduling with Branch Prediction

More information

Topic 14: Scheduling COS 320. Compiling Techniques. Princeton University Spring Lennart Beringer

Topic 14: Scheduling COS 320. Compiling Techniques. Princeton University Spring Lennart Beringer Topic 14: Scheduling COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer 1 The Back End Well, let s see Motivating example Starting point Motivating example Starting point Multiplication

More information

Institutionen för datavetenskap Department of Computer and Information Science

Institutionen för datavetenskap Department of Computer and Information Science Institutionen för datavetenskap Department of Computer and Information Science Final thesis K Shortest Path Implementation by RadhaKrishna Nagubadi LIU-IDA/LITH-EX-A--13/41--SE 213-6-27 Linköpings universitet

More information

EFFICIENT ATTACKS ON HOMOPHONIC SUBSTITUTION CIPHERS

EFFICIENT ATTACKS ON HOMOPHONIC SUBSTITUTION CIPHERS EFFICIENT ATTACKS ON HOMOPHONIC SUBSTITUTION CIPHERS A Project Report Presented to The faculty of the Department of Computer Science San Jose State University In Partial Fulfillment of the Requirements

More information

AN EFFICIENT FRAMEWORK FOR PERFORMING EXECUTION-CONSTRAINT-SENSITIVE TRANSFORMATIONS THAT INCREASE INSTRUCTION-LEVEL PARALLELISM

AN EFFICIENT FRAMEWORK FOR PERFORMING EXECUTION-CONSTRAINT-SENSITIVE TRANSFORMATIONS THAT INCREASE INSTRUCTION-LEVEL PARALLELISM Copyright by John Christopher Gyllenhaal, 1997 AN EFFICIENT FRAMEWORK FOR PERFORMING EXECUTION-CONSTRAINT-SENSITIVE TRANSFORMATIONS THAT INCREASE INSTRUCTION-LEVEL PARALLELISM BY JOHN CHRISTOPHER GYLLENHAAL

More information

Computer Programming C++ (wg) CCOs

Computer Programming C++ (wg) CCOs Computer Programming C++ (wg) CCOs I. The student will analyze the different systems, and languages of the computer. (SM 1.4, 3.1, 3.4, 3.6) II. The student will write, compile, link and run a simple C++

More information

An Optimizing Compiler for Low-Level Floating Point Operations. Lucien William Van Elsen

An Optimizing Compiler for Low-Level Floating Point Operations. Lucien William Van Elsen An Optimizing Compiler for Low-Level Floating Point Operations by Lucien William Van Elsen Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements

More information

Knowledge libraries and information space

Knowledge libraries and information space University of Wollongong Research Online University of Wollongong Thesis Collection 1954-2016 University of Wollongong Thesis Collections 2009 Knowledge libraries and information space Eric Rayner University

More information

A Level-wise Priority Based Task Scheduling for Heterogeneous Systems

A Level-wise Priority Based Task Scheduling for Heterogeneous Systems International Journal of Information and Education Technology, Vol., No. 5, December A Level-wise Priority Based Task Scheduling for Heterogeneous Systems R. Eswari and S. Nickolas, Member IACSIT Abstract

More information

An Experimental Investigation into the Rank Function of the Heterogeneous Earliest Finish Time Scheduling Algorithm

An Experimental Investigation into the Rank Function of the Heterogeneous Earliest Finish Time Scheduling Algorithm An Experimental Investigation into the Rank Function of the Heterogeneous Earliest Finish Time Scheduling Algorithm Henan Zhao and Rizos Sakellariou Department of Computer Science, University of Manchester,

More information

A Fuzzy Logic Approach to Assembly Line Balancing

A Fuzzy Logic Approach to Assembly Line Balancing Mathware & Soft Computing 12 (2005), 57-74 A Fuzzy Logic Approach to Assembly Line Balancing D.J. Fonseca 1, C.L. Guest 1, M. Elam 1, and C.L. Karr 2 1 Department of Industrial Engineering 2 Department

More information

CHAPTER 1 BOOLEAN ALGEBRA CONTENTS

CHAPTER 1 BOOLEAN ALGEBRA CONTENTS pplied R&M Manual for Defence Systems Part D - Supporting Theory HPTER 1 OOLEN LGER ONTENTS Page 1 INTRODUTION 2 2 NOTTION 2 3 XIOMS ND THEOREMS 3 4 SET THEORY 5 5 PPLITION 6 Issue 1 Page 1 hapter 1 oolean

More information

Symbol Tables Symbol Table: In computer science, a symbol table is a data structure used by a language translator such as a compiler or interpreter, where each identifier in a program's source code is

More information

Fundamentals of Operating Systems. Fifth Edition

Fundamentals of Operating Systems. Fifth Edition Fundamentals of Operating Systems Fifth Edition Fundamentals of Operating Systems A.M. Lister University of Queensland R. D. Eager University of Kent at Canterbury Fifth Edition Springer Science+Business

More information

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating

More information

CACHE-OBLIVIOUS SEARCHING AND SORTING IN MULTISETS

CACHE-OBLIVIOUS SEARCHING AND SORTING IN MULTISETS CACHE-OLIVIOUS SEARCHIG AD SORTIG I MULTISETS by Arash Farzan A thesis presented to the University of Waterloo in fulfilment of the thesis requirement for the degree of Master of Mathematics in Computer

More information

ACCELERATED COMPLEX EVENT PROCESSING WITH GRAPHICS PROCESSING UNITS

ACCELERATED COMPLEX EVENT PROCESSING WITH GRAPHICS PROCESSING UNITS ACCELERATED COMPLEX EVENT PROCESSING WITH GRAPHICS PROCESSING UNITS Prabodha Srimal Rodrigo Registration No. : 138230V Degree of Master of Science Department of Computer Science & Engineering University

More information

Group B Assignment 8. Title of Assignment: Problem Definition: Code optimization using DAG Perquisite: Lex, Yacc, Compiler Construction

Group B Assignment 8. Title of Assignment: Problem Definition: Code optimization using DAG Perquisite: Lex, Yacc, Compiler Construction Group B Assignment 8 Att (2) Perm(3) Oral(5) Total(10) Sign Title of Assignment: Code optimization using DAG. 8.1.1 Problem Definition: Code optimization using DAG. 8.1.2 Perquisite: Lex, Yacc, Compiler

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

A Frequent Max Substring Technique for. Thai Text Indexing. School of Information Technology. Todsanai Chumwatana

A Frequent Max Substring Technique for. Thai Text Indexing. School of Information Technology. Todsanai Chumwatana School of Information Technology A Frequent Max Substring Technique for Thai Text Indexing Todsanai Chumwatana This thesis is presented for the Degree of Doctor of Philosophy of Murdoch University May

More information

GEO BASED ROUTING FOR BORDER GATEWAY PROTOCOL IN ISP MULTI-HOMING ENVIRONMENT

GEO BASED ROUTING FOR BORDER GATEWAY PROTOCOL IN ISP MULTI-HOMING ENVIRONMENT GEO BASED ROUTING FOR BORDER GATEWAY PROTOCOL IN ISP MULTI-HOMING ENVIRONMENT Duleep Thilakarathne (118473A) Degree of Master of Science Department of Electronic and Telecommunication Engineering University

More information

TASK SCHEDULING FOR PARALLEL SYSTEMS

TASK SCHEDULING FOR PARALLEL SYSTEMS TASK SCHEDULING FOR PARALLEL SYSTEMS Oliver Sinnen Department of Electrical and Computer Engineering The University of Aukland New Zealand TASK SCHEDULING FOR PARALLEL SYSTEMS TASK SCHEDULING FOR PARALLEL

More information

Instruction Level Parallelism (ILP)

Instruction Level Parallelism (ILP) 1 / 26 Instruction Level Parallelism (ILP) ILP: The simultaneous execution of multiple instructions from a program. While pipelining is a form of ILP, the general application of ILP goes much further into

More information

Generating Mixed-Level Covering Arrays with λ = 2 and Test Prioritization. Nicole Ang

Generating Mixed-Level Covering Arrays with λ = 2 and Test Prioritization. Nicole Ang Generating Mixed-Level Covering Arrays with λ = 2 and Test Prioritization by Nicole Ang A Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Science Approved April 2015

More information

CS553 Lecture Profile-Guided Optimizations 3

CS553 Lecture Profile-Guided Optimizations 3 Profile-Guided Optimizations Last time Instruction scheduling Register renaming alanced Load Scheduling Loop unrolling Software pipelining Today More instruction scheduling Profiling Trace scheduling CS553

More information

What Compilers Can and Cannot Do. Saman Amarasinghe Fall 2009

What Compilers Can and Cannot Do. Saman Amarasinghe Fall 2009 What Compilers Can and Cannot Do Saman Amarasinghe Fall 009 Optimization Continuum Many examples across the compilation pipeline Static Dynamic Program Compiler Linker Loader Runtime System Optimization

More information

INDIAN SCHOOL SOHAR FIRST TERM EXAM ( ) INFORMATICS PRACTICES

INDIAN SCHOOL SOHAR FIRST TERM EXAM ( ) INFORMATICS PRACTICES INDIAN SCHOOL SOHAR FIRST TERM EXAM (2015-2016) INFORMATICS PRACTICES Page 1 of 5 No. of printed pages: 5 Class: XI Marks: 70 Date: 10-09-15 Time: 3 hours Instructions: a. All the questions are compulsory.

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY BHARAT SIGINAM IN

More information

Proceedings of the 2012 International Conference on Industrial Engineering and Operations Management Istanbul, Turkey, July 3 6, 2012

Proceedings of the 2012 International Conference on Industrial Engineering and Operations Management Istanbul, Turkey, July 3 6, 2012 Proceedings of the 2012 International Conference on Industrial Engineering and Operations Management Istanbul, Turkey, July 3 6, 2012 Solving Assembly Line Balancing Problem in the State of Multiple- Alternative

More information

Multi-cycle Instructions in the Pipeline (Floating Point)

Multi-cycle Instructions in the Pipeline (Floating Point) Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining

More information

A reputation system for BitTorrent peer-to-peer filesharing

A reputation system for BitTorrent peer-to-peer filesharing University of Wollongong Research Online University of Wollongong Thesis Collection 1954-2016 University of Wollongong Thesis Collections 2006 A reputation system for BitTorrent peer-to-peer filesharing

More information

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas Software Pipelining by Modulo Scheduling Philip Sweany University of North Texas Overview Instruction-Level Parallelism Instruction Scheduling Opportunities for Loop Optimization Software Pipelining Modulo

More information

Instruction Pipelining Review

Instruction Pipelining Review Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number

More information

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion, Introduction Chapter 5 Hashing hashing performs basic operations, such as insertion, deletion, and finds in average time 2 Hashing a hash table is merely an of some fixed size hashing converts into locations

More information

Scheduling with Bus Access Optimization for Distributed Embedded Systems

Scheduling with Bus Access Optimization for Distributed Embedded Systems 472 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 8, NO. 5, OCTOBER 2000 Scheduling with Bus Access Optimization for Distributed Embedded Systems Petru Eles, Member, IEEE, Alex

More information

Layer-Based Scheduling Algorithms for Multiprocessor-Tasks with Precedence Constraints

Layer-Based Scheduling Algorithms for Multiprocessor-Tasks with Precedence Constraints Layer-Based Scheduling Algorithms for Multiprocessor-Tasks with Precedence Constraints Jörg Dümmler, Raphael Kunis, and Gudula Rünger Chemnitz University of Technology, Department of Computer Science,

More information

Background: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation

Background: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation Instruction Scheduling Last week Register allocation Background: Pipelining Basics Idea Begin executing an instruction before completing the previous one Today Instruction scheduling The problem: Pipelined

More information

IBM. Enterprise Systems Architecture/ Extended Configuration Principles of Operation. z/vm. Version 6 Release 4 SC

IBM. Enterprise Systems Architecture/ Extended Configuration Principles of Operation. z/vm. Version 6 Release 4 SC z/vm IBM Enterprise Systems Architecture/ Extended Configuration Principles of Operation Version 6 Release 4 SC24-6192-01 Note: Before you use this information and the product it supports, read the information

More information

Introduction to Optimization, Instruction Selection and Scheduling, and Register Allocation

Introduction to Optimization, Instruction Selection and Scheduling, and Register Allocation Introduction to Optimization, Instruction Selection and Scheduling, and Register Allocation Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Traditional Three-pass Compiler

More information

COMPUTATIONAL CHALLENGES IN HIGH-RESOLUTION CRYO-ELECTRON MICROSCOPY. Thesis by. Peter Anthony Leong. In Partial Fulfillment of the Requirements

COMPUTATIONAL CHALLENGES IN HIGH-RESOLUTION CRYO-ELECTRON MICROSCOPY. Thesis by. Peter Anthony Leong. In Partial Fulfillment of the Requirements COMPUTATIONAL CHALLENGES IN HIGH-RESOLUTION CRYO-ELECTRON MICROSCOPY Thesis by Peter Anthony Leong In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy California Institute

More information

PROBLEM SOLVING WITH FORTRAN 90

PROBLEM SOLVING WITH FORTRAN 90 David R. Brooks PROBLEM SOLVING WITH FORTRAN 90 FOR SCIENTISTS AND ENGINEERS Springer Contents Preface v 1.1 Overview for Instructors v 1.1.1 The Case for Fortran 90 vi 1.1.2 Structure of the Text vii

More information

A Note on Scheduling Parallel Unit Jobs on Hypercubes

A Note on Scheduling Parallel Unit Jobs on Hypercubes A Note on Scheduling Parallel Unit Jobs on Hypercubes Ondřej Zajíček Abstract We study the problem of scheduling independent unit-time parallel jobs on hypercubes. A parallel job has to be scheduled between

More information

"Charting the Course to Your Success!" MOC D Querying Microsoft SQL Server Course Summary

Charting the Course to Your Success! MOC D Querying Microsoft SQL Server Course Summary Course Summary Description This 5-day instructor led course provides students with the technical skills required to write basic Transact-SQL queries for Microsoft SQL Server 2014. This course is the foundation

More information

PERFORMANCE ANALYSIS OF REAL-TIME EMBEDDED SOFTWARE

PERFORMANCE ANALYSIS OF REAL-TIME EMBEDDED SOFTWARE PERFORMANCE ANALYSIS OF REAL-TIME EMBEDDED SOFTWARE PERFORMANCE ANALYSIS OF REAL-TIME EMBEDDED SOFTWARE Yau-Tsun Steven Li Monterey Design Systems, Inc. Sharad Malik Princeton University ~. " SPRINGER

More information

CITY UNIVERSITY OF NEW YORK. Creating a New Project in IRBNet. i. After logging in, click Create New Project on left side of the page.

CITY UNIVERSITY OF NEW YORK. Creating a New Project in IRBNet. i. After logging in, click Create New Project on left side of the page. CITY UNIVERSITY OF NEW YORK Creating a New Project in IRBNet i. After logging in, click Create New Project on left side of the page. ii. Enter the title of the project, the principle investigator s (PI)

More information

Summary: Open Questions:

Summary: Open Questions: Summary: The paper proposes an new parallelization technique, which provides dynamic runtime parallelization of loops from binary single-thread programs with minimal architectural change. The realization

More information

ALGORITHMIC ASPECTS OF DOMINATION AND ITS VARIATIONS ARTI PANDEY

ALGORITHMIC ASPECTS OF DOMINATION AND ITS VARIATIONS ARTI PANDEY ALGORITHMIC ASPECTS OF DOMINATION AND ITS VARIATIONS ARTI PANDEY DEPARTMENT OF MATHEMATICS INDIAN INSTITUTE OF TECHNOLOGY DELHI JUNE 2016 c Indian Institute of Technology Delhi (IITD), New Delhi, 2016.

More information

Some Applications of Graph Bandwidth to Constraint Satisfaction Problems

Some Applications of Graph Bandwidth to Constraint Satisfaction Problems Some Applications of Graph Bandwidth to Constraint Satisfaction Problems Ramin Zabih Computer Science Department Stanford University Stanford, California 94305 Abstract Bandwidth is a fundamental concept

More information

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard. COMP 212 Computer Organization & Architecture Pipeline Re-Cap Pipeline is ILP -Instruction Level Parallelism COMP 212 Fall 2008 Lecture 12 RISC & Superscalar Divide instruction cycles into stages, overlapped

More information

A Bad Name. CS 2210: Optimization. Register Allocation. Optimization. Reaching Definitions. Dataflow Analyses 4/10/2013

A Bad Name. CS 2210: Optimization. Register Allocation. Optimization. Reaching Definitions. Dataflow Analyses 4/10/2013 A Bad Name Optimization is the process by which we turn a program into a better one, for some definition of better. CS 2210: Optimization This is impossible in the general case. For instance, a fully optimizing

More information

A Parallel Algorithm for Exact Structure Learning of Bayesian Networks

A Parallel Algorithm for Exact Structure Learning of Bayesian Networks A Parallel Algorithm for Exact Structure Learning of Bayesian Networks Olga Nikolova, Jaroslaw Zola, and Srinivas Aluru Department of Computer Engineering Iowa State University Ames, IA 0010 {olia,zola,aluru}@iastate.edu

More information

Dissecting Execution Traces to Understand Long Timing Effects

Dissecting Execution Traces to Understand Long Timing Effects Dissecting Execution Traces to Understand Long Timing Effects Christine Rochange and Pascal Sainrat February 2005 Rapport IRIT-2005-6-R Contents 1. Introduction... 5 2. Long timing effects... 5 3. Methodology...

More information

Code generation for modern processors

Code generation for modern processors Code generation for modern processors Definitions (1 of 2) What are the dominant performance issues for a superscalar RISC processor? Refs: AS&U, Chapter 9 + Notes. Optional: Muchnick, 16.3 & 17.1 Instruction

More information

Application of the Computer Capacity to the Analysis of Processors Evolution. BORIS RYABKO 1 and ANTON RAKITSKIY 2 April 17, 2018

Application of the Computer Capacity to the Analysis of Processors Evolution. BORIS RYABKO 1 and ANTON RAKITSKIY 2 April 17, 2018 Application of the Computer Capacity to the Analysis of Processors Evolution BORIS RYABKO 1 and ANTON RAKITSKIY 2 April 17, 2018 arxiv:1705.07730v1 [cs.pf] 14 May 2017 Abstract The notion of computer capacity

More information

Code generation for modern processors

Code generation for modern processors Code generation for modern processors What are the dominant performance issues for a superscalar RISC processor? Refs: AS&U, Chapter 9 + Notes. Optional: Muchnick, 16.3 & 17.1 Strategy il il il il asm

More information

Empirical analysis of procedures that schedule unit length jobs subject to precedence constraints forming in- and out-stars

Empirical analysis of procedures that schedule unit length jobs subject to precedence constraints forming in- and out-stars Empirical analysis of procedures that schedule unit length jobs subject to precedence constraints forming in- and out-stars Samuel Tigistu Feder * Abstract This paper addresses the problem of scheduling

More information

Object Histories in Java

Object Histories in Java Object Histories in Java by Aakarsh Nair A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Applied Science in Electrical and Computer

More information

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far Chapter 5 Hashing 2 Introduction hashing performs basic operations, such as insertion, deletion, and finds in average time better than other ADTs we ve seen so far 3 Hashing a hash table is merely an hashing

More information

Computer Science 246 Computer Architecture

Computer Science 246 Computer Architecture Computer Architecture Spring 2009 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Compiler ILP Static ILP Overview Have discussed methods to extract ILP from hardware Why can t some of these

More information

Numerical analysis and comparison of distorted fingermarks from the same source. Bruce Comber

Numerical analysis and comparison of distorted fingermarks from the same source. Bruce Comber Numerical analysis and comparison of distorted fingermarks from the same source Bruce Comber This thesis is submitted pursuant to a Master of Information Science (Research) at the University of Canberra

More information

Fractal Compression. Related Topic Report. Henry Xiao. Queen s University. Kingston, Ontario, Canada. April 2004

Fractal Compression. Related Topic Report. Henry Xiao. Queen s University. Kingston, Ontario, Canada. April 2004 Fractal Compression Related Topic Report By Henry Xiao Queen s University Kingston, Ontario, Canada April 2004 Fractal Introduction Fractal is first introduced in geometry field. The birth of fractal geometry

More information

Fortran 90 Two Commonly Used Statements

Fortran 90 Two Commonly Used Statements Fortran 90 Two Commonly Used Statements 1. DO Loops (Compiled primarily from Hahn [1994]) Lab 6B BSYSE 512 Research and Teaching Methods The DO loop (or its equivalent) is one of the most powerful statements

More information

Column Generation Method for an Agent Scheduling Problem

Column Generation Method for an Agent Scheduling Problem Column Generation Method for an Agent Scheduling Problem Balázs Dezső Alpár Jüttner Péter Kovács Dept. of Algorithms and Their Applications, and Dept. of Operations Research Eötvös Loránd University, Budapest,

More information

Predicting Messaging Response Time in a Long Distance Relationship

Predicting Messaging Response Time in a Long Distance Relationship Predicting Messaging Response Time in a Long Distance Relationship Meng-Chen Shieh m3shieh@ucsd.edu I. Introduction The key to any successful relationship is communication, especially during times when

More information

09-1 Multicycle Pipeline Operations

09-1 Multicycle Pipeline Operations 09-1 Multicycle Pipeline Operations 09-1 Material may be added to this set. Material Covered Section 3.7. Long-Latency Operations (Topics) Typical long-latency instructions: floating point Pipelined v.

More information

A Fast and Simple Algorithm for Bounds Consistency of the AllDifferent Constraint

A Fast and Simple Algorithm for Bounds Consistency of the AllDifferent Constraint A Fast and Simple Algorithm for Bounds Consistency of the AllDifferent Constraint Alejandro Lopez-Ortiz 1, Claude-Guy Quimper 1, John Tromp 2, Peter van Beek 1 1 School of Computer Science 2 C WI University

More information

Resource Allocation Strategies for Multiple Job Classes

Resource Allocation Strategies for Multiple Job Classes Resource Allocation Strategies for Multiple Job Classes by Ye Hu A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Mathematics in Computer

More information

Lecture : Topological Space

Lecture : Topological Space Example of Lecture : Dr. Department of Mathematics Lovely Professional University Punjab, India October 18, 2014 Outline Example of 1 2 3 Example of 4 5 6 Example of I Topological spaces and continuous

More information

Lecture 2: Analyzing Algorithms: The 2-d Maxima Problem

Lecture 2: Analyzing Algorithms: The 2-d Maxima Problem Lecture 2: Analyzing Algorithms: The 2-d Maxima Problem (Thursday, Jan 29, 1998) Read: Chapter 1 in CLR. Analyzing Algorithms: In order to design good algorithms, we must first agree the criteria for measuring

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Register Allocation & Liveness Analysis

Register Allocation & Liveness Analysis Department of Computer Sciences Register Allocation & Liveness Analysis CS502 Purdue University is an Equal Opportunity/Equal Access institution. Department of Computer Sciences In IR tree code generation,

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

arxiv: v2 [cs.ds] 22 Jun 2016

arxiv: v2 [cs.ds] 22 Jun 2016 Federated Scheduling Admits No Constant Speedup Factors for Constrained-Deadline DAG Task Systems Jian-Jia Chen Department of Informatics, TU Dortmund University, Germany arxiv:1510.07254v2 [cs.ds] 22

More information

Frequently Asked Questions (FAQ)

Frequently Asked Questions (FAQ) You are requested to go through all the questions & answers in this section and also the Advertisement Notification before proceeding for Registration and subsequent submission of Online Application Form

More information

Survey of different Task Scheduling Algorithm

Survey of different Task Scheduling Algorithm 2014 IJEDR Volume 2, Issue 1 ISSN: 2321-9939 Survey of different Task Scheduling Algorithm 1 Viral Patel, 2 Milin Patel 1 Student, 2 Assistant Professor 1 Master in Computer Engineering, Parul Institute

More information

CS152 Computer Architecture and Engineering SOLUTIONS Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12

CS152 Computer Architecture and Engineering SOLUTIONS Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12 Assigned 2/28/2018 CS152 Computer Architecture and Engineering SOLUTIONS Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12 http://inst.eecs.berkeley.edu/~cs152/sp18

More information

Structure of Computer Systems

Structure of Computer Systems Structure of Computer Systems Structure of Computer Systems Baruch Zoltan Francisc Technical University of Cluj-Napoca Computer Science Department U. T. PRES Cluj-Napoca, 2002 CONTENTS PREFACE... xiii

More information

CSE 101 Homework 5. Winter 2015

CSE 101 Homework 5. Winter 2015 CSE 0 Homework 5 Winter 205 This homework is due Friday March 6th at the start of class. Remember to justify your work even if the problem does not explicitly say so. Writing your solutions in L A TEXis

More information

"Charting the Course... MOC C: Querying Data with Transact-SQL. Course Summary

Charting the Course... MOC C: Querying Data with Transact-SQL. Course Summary Course Summary Description This course is designed to introduce students to Transact-SQL. It is designed in such a way that the first three days can be taught as a course to students requiring the knowledge

More information

Theorem 2.9: nearest addition algorithm

Theorem 2.9: nearest addition algorithm There are severe limits on our ability to compute near-optimal tours It is NP-complete to decide whether a given undirected =(,)has a Hamiltonian cycle An approximation algorithm for the TSP can be used

More information

Improving Achievable ILP through Value Prediction and Program Profiling

Improving Achievable ILP through Value Prediction and Program Profiling Improving Achievable ILP through Value Prediction and Program Profiling Freddy Gabbay Department of Electrical Engineering Technion - Israel Institute of Technology, Haifa 32000, Israel. fredg@psl.technion.ac.il

More information

Automated Planning for Open Network Architectures

Automated Planning for Open Network Architectures UNIVERSITY OF CALIFORNIA Los Angeles Automated Planning for Open Network Architectures A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Computer

More information

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 21: Superscalar Processing Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due November 10 Homework 4 Out today Due November 15

More information