Utilising Parallel Resources by Speculation

Utilising Parallel Resources by Speculation A. Unger and E. Zehendner Th. Ungerer Computer Science Department Dept. Computer Design & Fault Tolerance Friedrich Schiller University University of Karlsruhe D-07740 Jena, Germany D-76128 Karlsruhe, Germany fa.unger, zehendnerg@acm.org ungerer@ira.uka.de Abstract This paper introduces Simultaneous Speculation Scheduling, a new compiler technique that enables speculative execution of alternative program paths. In our approach concurrently executed threads are generated that represent alternative program paths. Each thread is the result of a speculation on the outcome of one or more branches. All threads are simultaneously executed although only one of them follows the eventually correct program path. Our technique goes beyond the capabilities of usual global instruction scheduling algorithms, because we overcome most of the restrictions to speculative code motion. The architectural requirements are the ability to run two or more threads in parallel, and an enhanced instruction set to control threads. Our technique aims at multithreaded architectures, in particular simultaneous multithreaded, nanothreaded, and microthreaded processors, but can be modified for multiscalar, datascalar, and trace processors. We evaluate our approach using program kernels from the SPECint benchmark suite. 1. Introduction The continuous progress in microprocessor technology allows to implement a growing number of execution units on a single chip. With a single instruction stream, these processing potentials can only be utilised if a sufficient amount of instruction level parallelism (ILP) from the program to be executed is exposed to the processor. The compiler has to arrange the instruction sequences to optimise exploitation of ILP by the processor. Missing information about the independence of instructions and about data to be processed, the exponential complexity of the scheduling algorithms, and the possibility of the occurrence of exceptions hinder code transformations. Both, the design of new processor architectures and the investigation of compiler techniques try to overcome some of these problems. Promising architectural solutions are out-of-order execution, predicated execution [8], numerous techniques to reduce instruction penalties [14], and the concurrent execution of more than one thread of control [9][13]. Current improvements in the field of scheduling techniques are: an enlargement of the program sections treated by the algorithms [1][4][6][17], the improvement of the used heuristics [5], the enhancement of the information made available by dataflow analysis [11], and a better exploitation of the processor properties [21]. Neither the above mentioned improvements of scheduling techniques nor hardware techniques that implement some kind of dynamic scheduling can increase the ILP present in a source program. However, concurrently executing more than one thread of control simultaneously allows to use the combined ILP of all threads currently active in such a multithreaded processor. To improve the performance of a single program all concurrently executed threads must be generated from this program. Therefore the compiler is faced with another problem making available the needed coarse grain parallelism. Transforming loop iterations into concurrently executing threads of control is a well understood technique for scientific applications [22]. In our research we focus on integer programs, that usually show very small loop bodies, few iterations, many conditional branches, and use of pointers and pointer arithmetics. Therefore techniques of automatic loop parallelisation are not applicable to integer programs. With integer programs in focus we started to examine Simultaneous Speculation Scheduling, a technique that enables speculative execution of alternative program paths. Our technique should be used when static program transformations fail to expose enough ILP to completely utilise the execution units of a wide-issue processor. If the units cannot contribute to the execution of instructions that are on the currently taken program path, then they should execute instructions from other program paths.

Candidate paths are the alternative continuations of a conditional statement. Simultaneous Speculation Scheduling replaces a branch by instructions that enable the instruction scheduler to expose more ILP to the processor. A separate thread of control is generated for each of the alternative continuations. Each thread contains instructions that calculate the condition associated to the former branch, an operation which terminates an incorrectly chosen thread, and a number of other instructions from the program path under consideration. The advantage of the proposed technique is a better utilisation of the execution units and thus an improvement of the execution time of a single program by filling otherwise idle instruction slots of a wide-issue processor with speculatively dispatched instructions. Depending on the architecture there may be an additional cost for the execution of thread handling instructions. Consequently, the following preconditions have to be fulfilled for this approach to work: a very short time for switching control to another thread, tightly coupled threads that allow fast data exchange. Target architectures are different kinds of multithreaded architectures. Requirements for a multithreaded base architecture suitable for our compiler technique and architectural proposals that match the requirements are described in Section 2. In Section 3 we introduce the Simultaneous Speculation Scheduling technique and examine its applicability for these architectures. Section 4 shows the results of translating several benchmarks using our technique. 2. Target Architectures Simultaneous Speculation Scheduling is only applicable for architectures that fulfill certain requirements of a multithreaded base architecture: First, the architecture must be able to pursue two or more threads of control concurrently i.e., it must provide two or more independent program counters. All concurrently executed threads of control share the same address space, preferably the same register set. The instruction set must provide a number of thread handling instructions: Here we consider the minimal requirements for multithreading. These are an instruction for creating a new thread (fork) and an instruction that conditionally stops its own execution or the execution of some other threads (sync). Whether the threads are explicit operands to the sync instruction or are implicitly given by the architecture strongly depends on the target architecture. Therefore we use fork and sync as two abstract operations representing different implementations of these instructions. Creating a new thread by the fork instruction and joining threads by the sync instruction must be extremely fast, preferably single-cycle operations. Multithreaded architectures can achieve this last requirement because the different threads share a common set of processor resources (execution units, registers etc.). Furthermore they do not implement complex synchronisation operations but pass to the compiler the control of the steps necessary to carry out an interaction of threads. Therefore the compiler can minimise the arising cost by maintaining the complete thread handling, which includes selecting the program sections to be executed speculatively, organising the data exchange between threads, and generating the instruction sequences required for the interaction. The primary target architectures of the proposed compiler technique are simultaneous multithreaded, microthreaded, and nanothreaded architectures which can be classified as explicit multithreaded architectures, because the existence of multiple program counters in the microarchitecture is perceptible in the architecture. However, implicit multithreaded architectures that spawn and execute multiple threads implicitly not visible to the compiler can also be modified to take advantage of Simultaneous Speculation Scheduling. Examples of such implicit multithreaded architectures are the multiscalar, the trace, and the datascalar approaches. Simultaneous multithreaded (SMT) processors [19][12] combine the multithreading technique with a wide-issue processor such that the full issue bandwidth is utilised by potentially issuing instructions from different threads simultaneously. Another idea uses multithreading without the complexity of a SMT processor. Nanothreading [7] dismisses full multithreading for a nanothread that executes in the same register set as the main thread. The nanothread is used to fill the instruction issue slots like in simultaneous multithreading. The microthreading technique [2] is similar to nanothreading. All threads execute in the same register set. However, the number of threads is not restricted to two. When a context switch arises, the program counter is stored in a continuation queue. Simultaneous Speculation Scheduling aims at an architectural model which is most closely related to the SMT, nanothreaded, and microthreaded architectures. Creating a new thread respectively joining threads must be extremely fast operations, which should preferably be performed in a single execution cycle. Here nanothreaded and microthreaded architectures may prove advantageous over the simultaneous multithreaded approach, because only a new instruction pointer must be activated in the nanothreaded and microthreaded approach, while a new register set has to be assigned additionally in the simultaneous multithreaded approach. All threads execute in the same register set in

nanothreaded and microthreaded architectures. Multiscalar processors [16] divide a single-threaded program into a collection of tasks that are distributed to a number of parallel processing units under control of a single hardware sequencer. Each of the processing units fetches and executes instructions belonging to its assigned task. A static program is represented as a control flow graph (CFG), where basic blocks are nodes, and arcs represent flow of control from one basic block to another. Dynamic program execution can be viewed as walking through the CFG, generating a dynamic sequence of basic blocks which have to be executed for a particular run of the program. Simultaneous Speculation Scheduling can be applied to the multiscalar approach provided that the thread handling instructions are included in the respective instruction sets. Code blocks that represent threads for alternative program paths are speculatively assigned to different processing elements. Potentially, further code blocks that are dependent on either of these program paths can be subsequently assigned speculatively. Therefore, while executing a sync instruction the multiscalar processor must be able to stop the execution of a block and all blocks dependent on the nullified block only, while the correctly chosen block and all subsequent blocks proceed execution. Also propagating of register values between the processing elements that execute our speculatively assigned program paths must be restricted to the dependence hierarchy and is therefore also slightly more complex. Using our technique to speculate on data dependencies can be used directly to generate programs for a multiscalar processor [20]. Trace processors [15] partition a processor into multiple distinct cores similar to multiscalar and break the program into traces. Traces are collected by a trace cache which is a special instruction cache that captures dynamic instruction sequences. One core of a trace processor executes the current trace while the other cores execute future traces speculatively. The Simultaneous Speculation Scheduling technique can be applied to trace processors in a similar style like for multiscalar processors. A datascalar processor [3] runs the same sequential program redundantly across multiple processors using distributed data sets. Loads and stores are performed only locally on the processor memory that owns the data, but a local load broadcasts the loaded value to all other processors. Speculation is an optimisation option, but is limited to data value speculation. However, every datascalar machine is a de facto multiprocessor. When codes contain coarse-grain parallelism, the datascalar machine can also run like a traditional multiprocessor. This ability can also be used by the Simultaneous Speculation Scheduling technique to run threads speculatively. When the datascalar processor executes a fork instruction at least two of the processors start executing the alternative program paths. When the corresponding sync is reached each processor executing the incorrect thread stops the execution of its own thread, receives the correct data from an other processor, and then continues executing instructions from the correct thread. One of the processors that executed the correct thread when the most recent sync was executed is the lead processor. If the processor takes the wrong path at some point of the execution of the program another processor becomes the lead processor. 3. Simultaneous Speculation Scheduling 3.1. Scheduling and Speculative Execution Instruction scheduling techniques [11] are of great importance to expose ILP contained in a program to a wideissue processor. The instructions of a given program are rearranged to avoid underutilisation of the processor resources caused by dependencies between the various operations (e.g. data dependencies, control dependencies, and the usage of the same execution unit). As a result the execution time of the program decreases. Conditional branches seriously prevent the scheduling techniques from moving instructions to unused instruction slots. Whether a conditional branch is taken or not cannot be decided during compile time. Global scheduling techniques as for instance trace Scheduling [6], PDG Scheduling [1], Dominator-path Scheduling [17], or Selective Scheduling [10] use various approaches to overcome this problem. However, investigations have shown that the conditions that must be fulfilled to safely move an instruction across the basic block boundaries are very restrictive. Therefore the scheduling algorithms fail to gain enough ILP for a good utilisation of the execution units of a wide-issue processor. Global scheduling techniques were enabled to move instructions across branches to execute these instructions speculatively. The increase of performance gained by this speculative extensions is limited either by the overhead to correct misprediction or by restrictions to speculative code motion. Our optimisation technique Simultaneous Speculative Scheduling is based on the consideration that execution units, unused due to the lack of ILP, can execute instructions from alternative program paths to increase the execution speed of a program. The goal of the technique is to efficiently execute a single integer program on a multithreaded processor. Therefore we generate threads to be executed in parallel from alternative program paths. Simultaneous Speculation Scheduling can be seen as an advancement of global instruction scheduling techniques. As all mentioned global instruction scheduling techniques we attempt to enlarge the hyperblocks in which instructions can be moved, but we are able to handle all branches efficiently.

Information about execution probabilities can be used, but is not necessary. Beside increasing the hyperblocks of the global scheduling technique, Simultaneous Speculation Scheduling reduces the penalties for misspeculation. We can achieve this by simultaneously speculating on the alternative program paths and executing the generated threads in parallel on multithreaded architectures. 3.2. The Algorithm Simultaneous Speculation Scheduling attempts to enlarge the hyperblocks of the global instruction scheduling algorithm by generating separate threads of control for alternative program paths. Branches are removed by the following approach: Each branch is replaced by a fork instruction, that creates a new thread, and by two conditional sync instructions one in each thread. The two threads active after the execution of the fork instruction evaluate the two program paths corresponding to both branch targets. The compare instruction attached to the original branch remains in the program. It now calculates the condition for the sync instructions. The thread which reaches its sync instruction first either terminates, or it cancels the other thread. After removing a number of branches by generating speculative threads the basic scheduling algorithm continues to process the new hyperblocks. Each thread is considered separately. The heuristic used by the scheduling algorithm is modified to keep the speculative code sections small and avoid writes to memory within the sections. Therefore the sync instruction is moved upwards as far as the corresponding compare allows. The speculative sections are further reduced by combining identical sequences of instructions starting at the beginning of speculative sections and moving them across the fork instruction. The program transformation described above is implemented by the following Algorithm 1. Determining basic blocks. 2. Calculating execution probabilities of the basic blocks. 3. Selection of the hyperblocks by the global instruction scheduling algorithm. 4. Selection of the conditional branches that cannot be handled by the basic scheduling technique, but can be resolved by splitting the thread of control; concatenation of the corresponding hyperblocks. 5. Generation of the required operations for starting and synchronising threads; if necessary, modification of the existing compare instructions. 6. Moving all store instructions out of the speculative program sections. 7. Further processing of the new hyperblocks by the global scheduling algorithm. 8. Modification of the weights attached to the nodes of the dataflow graph. 9. Scheduling of the operations within the hyperblocks. 10. Minimising the speculatively executed program sections by moving up common code sequences starting at the beginning of the sections. 11. Calculating a new register allocation, insertion of move instructions, if necessary. 12. Basic block scheduling. The steps 1-3, 7, 9, and 12 can be performed by a common global scheduling technique. For our investigations we use the PDG scheduling technique [1]. A simple way to implement the modifications of the scheduling heuristic (steps 6 and 8) is to adjust the weights assigned to the operations by the scheduling algorithm [18] and to insert a number of artificial edges into the dataflow graph. This allows to almost directly use the formerly employed local scheduling technique (List Scheduling) that arranges the instructions within the hyperblocks. Assigning a larger weight causes an instruction to be scheduled to execute later. Therefore the attached values of the compare and the sync instructions are decreased. Since these modifications directly influence the size of the speculative section they must correspond to the implemented properties of the processor that executes the generated program. The exact weights that are assigned are not fixed by this algorithm but are parameters that have to be determined for each processor implementation. Generating separate threads of control for different program paths causes the duplication of a certain number of instructions. The number of redundant instructions grows with the length of the speculative sections, but remains small in the presence of small basic blocks. Since the number of speculatively executed threads is already limited by the hardware resources, only a small increase in the program size is caused for integer programs. 3.3. Example This section demonstrates the technique described above using a simple program. The sample program consists of

int a, b, c, d; int main(){ int i; a=0; b=2; c=7; for(i=0; i < 999; i++){ /* D */ a = a+b+c; /* A */ if(a % 2) /* A */ d=b*c; /* B */ else d=b*b*c; /* C */ } } b=b+c+a; /* D */ c=b+d; /* D */ Figure 1. Source code of the sample program a single if-then-else statement, enclosed by some assignments and surrounded by a loop to yield a measurable running time. The sample program was constructed to demonstrate all essential steps of the algorithm. Figure 1 shows the source code of the program and Figure 2 the corresponding program dependency graph (PDG). The comments in the source code give for each instruction its basic block. The nodes of the PDG contain the assembly instructions generated from the sample program. The wide arrows show the control dependencies, the small edges the true data dependencies. The probability of an execution is assumed to be 50% for either program path. Thus a static scheduling technique should not speculatively move instructions upwards from the basic blocks B or C to A. The assembly instructions are written in a simplified SPARC instruction set. For clarity, we use simplified register names and memory addresses and we renounced the concept of delayed branches. To release the strong binding of the compare to the branch instruction, the compare is enriched by a destination register that stores the condition code. Accordingly the branch reads the condition from a general purpose register. This is a common approach in modern microprocessors. To evaluate the results of the speculative execution in the context of currently used scheduling techniques and to have Figure 2. PDG of the sample program a base to calculate the performance gain, we compiled the sample program employing the global PDG scheduling algorithm (using egcs; Release 1.0.1; optimising level -O2). For the generated program the cycles to be executed on our base processor (see Section 4) were determined. This led to 22 cycles to execute the loop body if the then path is taken respectively 26 cycles for the else path, yielding to an average of 24 cycles to execute one iteration of the loop body. We now explain the steps of the algorithm described in Section 3.2. The steps 1 to 3 work similar to the PDG scheduling. The branch corresponding to the if statement in the example is not handled by this global scheduling technique. Therefore the branch is removed with our method by splitting the thread of control (step 4). After that step the basic blocks A, B, and D form thread 1; and the blocks A, C, and D form thread 2, respectively. The branch instruction (last instruction in block A) is replaced by a pair of fork and sync (step 5), where the fork takes the first position in the hyperblock and the sync the position of the former branch. In the next step the store instructions are moved out of the speculative section. By inserting artificial edges into the dataflow graph a reentry is prevented. Now the PDG of the hyperblock is transformed to the

.LL5: (1) ld b,r1 (1) add r5,1,r5 ; i++ (2) ld c,r2 (5) add r1,r2,r10 ; b+c (5) ld a,r3 (6) fork.ll15,.ll25 Figure 3. Extended dataflow graph (thread 2) extended dataflow graph by the PDG scheduling algorithm (step 7). In this representation all restrictions that are necessary to ensure a semantically correct schedule are coded as artificial edges. Before applying the instruction scheduling algorithm (List Scheduling) to the hyperblock, the weights attached to the nodes are modified (step 8). This includes the nodes corresponding to the fork, the sync, and the load instructions. By decreasing the weights of the load and the sync instructions the sync is scheduled to execute early. Thus the speculatively executed sections become smaller. Since the current weighting does not allow a direct decreasing of the values, the weights of the surrounding instructions have to be increased to achieve the expected difference. Furthermore the load instructions are moved out of the speculatively executed sections. For our base architecture we have increased the weights of all instructions within the speculative section by 10, and for all instructions following the sync by 20. Figure 3 shows the extended dataflow graph for thread 2 after executing step 8. Now the List Scheduling algorithm is applied to both threads (step 9). To further decrease the size of the speculative sections identical instruction sequences at the beginning of both threads are moved out of these sections (step 10). After this transformation the final positions of the fork and sync instructions are fixed and therefore the speculative sections are determined. At this point a new register allocation is calculated for these sections in both threads. In the example the registers r3, r9, r11, and r12 have to be renamed. The restoration of the values that are live at the end of the hyperblock to the original registers is implicitly done ; Thread 1.LL15: (7) mul r1,r2,r11 ; b*c (8) add r3,r1,r3 ; a+b (9) add r3,r2,r3 ; a+b+c (9) cmp r5,r6,r9 ; i< 999 (10) andcc r3,1,r12 ; a%2 (11) sync r12 (11) st r3,a ; a=... (12) add r10,r3,r3 ; b+c+a (12) st r11,d ; d= (13) add r3,r11,r11 ; b+c (13) st r3,b ; b=... (14) st r11,c ; c=... (14) ble.ll5,r9 ; Thread 2.LL25: (7) mul r1,r1,r15 ; b*b (8) add r3,r1,r16 ; a+b (9) add r16,r2,r16 ; a+b+c (9) cmp r5,r6,r17 ; i< 999 (10) andcc r16,1,r18 ; a%2 (11) sync r18 (11) mul r15,r2,r11 ; b*b*c (12) st r16,a ; a= (12) add r10,r16,r3 ; b+c+a (15) st r11,d ; d=... (15) add r3,r11,r11 ; b+c (16) st r3,b ; b=... (16) st r11,c ; c=... (17) ble.ll5,r17 Figure 4. Generated assembly code for the sample program by the instructions following the sync. Thus there is no need for additional move operations. The assembly program generated after executing the whole algorithm from Section 3.2 is shown in Figure 4. For each assembly instruction the corresponding C statement is

Benchmark compress go (mrglist) go (getefflibs) sample program (fig. 1) base machine (cycles) 9.6 18.6 37.3 24.0 multithreaded machine (cycles) 8.8 13.3 31.4 15.5 performance gain 9% 40% 19% 55 % Table 1. Performance gain achieved by speculatively executed program paths given as comment. The number at the beginning of each line gives the processor cycle to execute the instruction, relative to the beginning of the loop iteration. Thus the threads need 14 respectively 17 cycles. Again considering an execution probability of 50% the average time to process one loop iteration is 15.5 cycles. The program generated by our scheduling technique contains a speculative section of 6 instructions, none of which accesses memory. 4. Performance Evaluation In the previous section we have introduced our compilation technique. In this section we show a number of experimental results to examine the performance of the speculative execution of program paths. The benchmarks used belong to the SPECint benchmark suite. For two reasons presently we cannot translate the complete programs but have to focus on frequently executed functions. First, we currently do not have an implementation of the algorithm that allows to translate arbitrary programs. On the other hand the architectures under consideration are subjects of research. For some of them simulators are available, for others only the documentation of the architectural concepts is accessible. This means that both the process of applying our technique to the programs and the calculation of execution time are partially done by hand. Since this is a time-consuming process currently we can present only results for two programs from the SPECint benchmark suite and the sample program from Section 3.3. The programs chosen from the SPEC benchmarks are go and compress. The translated program sections cover the inner loops of the functions compress from the compress benchmark, and mrglist and getefflibs from the go benchmark. The results are shown in Table 1. For the calculation of the execution time we use two processor models. The first one the superscalar base processor implements a simplified SPARC architecture. The superscalar base processor has the ability to execute up to four instructions per cycle. We assume latencies of one cycle for all arithmetic operations, except for the multiplication which takes four cycles, three cycles for load instructions, and one cycle for branches. The second processor model the multithreaded base processor enhances the superscalar base processor by the ability to execute up to four threads concurrently and by the instructions fork and sync for the thread control. Here we expect these instructions to execute within a single cycle. The values presented in Table 1 were derived by counting the machine cycles, processors that match our machine models would need to execute the generated programs. The numbers shown are the cycles for the base processor, the cycles for the multithreaded processor, and the performance gain. The assembly programs for the base processor were generated by the compiler egcs. We have chosen this compiler since it implements the PDG scheduling which is the global scheduling technique our method is based on. 5. Conclusion In this paper we have proposed a compiler technique that aims to speed up integer programs that can not be successfully treated by purely static scheduling techniques. In the absence of a sufficient amount of ILP to completely utilise the superscalar execution units we employ the ability of multithreaded processors to additionally gain from coarse grain parallelism. The threads to be executed concurrently are generated from alternative program paths. Our technique improves the capability of global instruction scheduling techniques by removing conditional branches. A branch is replaced by a fork instruction that splits the current thread of control, and one sync instruction in each new thread, that finishes the incorrectly executed thread. This transformation increases the size of the hyperblocks. Thus the global instruction scheduling is able to expose more ILP to the processor. We have started to evaluate our technique by translating programs from the SPECint benchmark suite. Considering preliminary results we expect our technique to achieve a performance gain of about 20% over purely static scheduling techniques when generating code for simultaneous multithreaded processors as well as processors that employ nanothreading or microthreading. Currently we translate further programs from the SPEC benchmarks and examine the capability of the techniques to adapt to the requirements of the trace processor, the multiscalar approach, and the datascalar processor.

References [1] D. Bernstein and M. Rodeh. Global instruction scheduling for superscalar machines. In B. Hailpern, editor, Proceedings of the ACM SIGPLAN 91 Conference on Programming Language Design and Implementation, pages 241 255, Toronto, ON, Canada, June 1991. [2] A. Bolychevsky, C. R. Jesshope, and V. B. Muchnik. Dynamic scheduling in RISC architectures. IEE Proceedings Computers and Digital Techniques, 143(5):309 317, 1996. [3] D. Burger, S. Kaxiras, and J. R. Goodman. DataScalar Architectures. In Proceedings of the 24th International Symposium on Computer Architecture, pages 338 349, Boulder, CO, June 1997. [4] P. P. Chang, S. A. Mahlke, W. Y. Chen, N. J. Warter, and W. W. Hwu. IMPACT: An architectural framework for multiple-instruction-issue processors. In Proceedings of the 18th International Symposium on Computer Architecture, pages 266 275, New York, NY, June 1991. [5] K. Ebcioglu, R. D. Groves, K. C. Kim, G. M. Silberman, and I. Ziv. VLIW compilation techniques in a superscalar environment. In Proceedings of the ACM SIGPLAN 94 conference on Programming Language Design and Implementation, pages 36 48, Toronto, ON, Canada, 1994. [6] J. R. Ellis. A Compiler for VLIW Architectures. The MIT Press, Cambridge, MA, 1986. [7] L. Gwennap. Dansoft develops VLIW design. Microdesign Resources, pages 18 22, 1997. [8] W.-M. Hwu. Introduction to predicated execution. IEEE Computer, 31(1):49 50, 1998. [9] J. L. Lo, S. J. Eggers, J. S. Emer, H. M. Levy, R. L. Stamm, and D. M. Tullsen. Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading. ACM Transactions on Computer Systems, 15(3):322 354, 1997. [10] S. M. Moon and K. Ebcioglu. Parallelizing nonnumerical code with selective scheduling and software pipelining. ACM Transactions on Programming Languages and Systems, 19(6):853 898, 1997. [11] S. S. Muchnick. Advanced Compiler Design & Implementation. Morgan Kaufmann Publishers, San Francisco, 1997. [12] U. Sigmund and T. Ungerer. Identifying bottlenecks in a multithreaded superscalar microprocessor. Lecture Notes in Computer Science, 1124:797 800, 1996. [13] J. Silc, B. Robic, and T. Ungerer. Asynchrony in parallel computing: From dataflow to multithreading. Journal of Parallel and Distributed Computing Practices, 1, 1998. [14] J. Silc, B. Robic, and T. Ungerer. Processor Architecture. From Dataflow to Superscalar and Beyond. Springer Verlag, Berlin, 1999. [15] J. E. Smith and S. Vajapeyam. Trace processors: Moving to fourth-generation microarchitectures. IEEE Computer, 30(9):68 74, Sept. 1997. [16] G. S. Sohi, S. E. Breach, and T. N. Vijaykumar. Multiscalar processors. In Proceedings of the 22nd International Symposium on Computer Architecture, pages 414 425, Santa Margherita, Ligure, Italy, 1995. [17] P. H. Sweany and S. J. Beaty. Dominator-path scheduling A global scheduling method. In Proceedings of the 25th International Symposium on Microarchitecture, Dec. 1992. [18] M. D. Tiemann. The GNU instruction scheduler. Technical report, Free Software Foundation, Cambridge, MA, 1989. [19] D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. In Proceedings of the 23rd International Symposium on Computer Architecture, pages 191 202, Philadelphia, PA, May 1996. [20] A. Unger, T. Ungerer, and E. Zehendner. Static speculation, dynamic resolution. Proc. of the 7th Workshop on Compilers for Parallel Computing (CPC 98), Linköping, Sweden, June 1998. [21] A. Unger and E. Zehendner. Tuning the GNU instruction scheduler to superscalar microprocessors. In Proceedungs of the 23rd EUROMICRO Conference, pages 275 282, Budapest, Sept. 1997. [22] M. Wolfe. High Performance Compilers for Parallel Computing. Addison Wesley, 1996.