Utilising Parallel Resources by Speculation

Size: px
Start display at page:

Download "Utilising Parallel Resources by Speculation"

Transcription

1 Utilising Parallel Resources by Speculation A. Unger and E. Zehendner Th. Ungerer Computer Science Department Dept. Computer Design & Fault Tolerance Friedrich Schiller University University of Karlsruhe D Jena, Germany D Karlsruhe, Germany fa.unger, Abstract This paper introduces Simultaneous Speculation Scheduling, a new compiler technique that enables speculative execution of alternative program paths. In our approach concurrently executed threads are generated that represent alternative program paths. Each thread is the result of a speculation on the outcome of one or more branches. All threads are simultaneously executed although only one of them follows the eventually correct program path. Our technique goes beyond the capabilities of usual global instruction scheduling algorithms, because we overcome most of the restrictions to speculative code motion. The architectural requirements are the ability to run two or more threads in parallel, and an enhanced instruction set to control threads. Our technique aims at multithreaded architectures, in particular simultaneous multithreaded, nanothreaded, and microthreaded processors, but can be modified for multiscalar, datascalar, and trace processors. We evaluate our approach using program kernels from the SPECint benchmark suite. 1. Introduction The continuous progress in microprocessor technology allows to implement a growing number of execution units on a single chip. With a single instruction stream, these processing potentials can only be utilised if a sufficient amount of instruction level parallelism (ILP) from the program to be executed is exposed to the processor. The compiler has to arrange the instruction sequences to optimise exploitation of ILP by the processor. Missing information about the independence of instructions and about data to be processed, the exponential complexity of the scheduling algorithms, and the possibility of the occurrence of exceptions hinder code transformations. Both, the design of new processor architectures and the investigation of compiler techniques try to overcome some of these problems. Promising architectural solutions are out-of-order execution, predicated execution [8], numerous techniques to reduce instruction penalties [14], and the concurrent execution of more than one thread of control [9][13]. Current improvements in the field of scheduling techniques are: an enlargement of the program sections treated by the algorithms [1][4][6][17], the improvement of the used heuristics [5], the enhancement of the information made available by dataflow analysis [11], and a better exploitation of the processor properties [21]. Neither the above mentioned improvements of scheduling techniques nor hardware techniques that implement some kind of dynamic scheduling can increase the ILP present in a source program. However, concurrently executing more than one thread of control simultaneously allows to use the combined ILP of all threads currently active in such a multithreaded processor. To improve the performance of a single program all concurrently executed threads must be generated from this program. Therefore the compiler is faced with another problem making available the needed coarse grain parallelism. Transforming loop iterations into concurrently executing threads of control is a well understood technique for scientific applications [22]. In our research we focus on integer programs, that usually show very small loop bodies, few iterations, many conditional branches, and use of pointers and pointer arithmetics. Therefore techniques of automatic loop parallelisation are not applicable to integer programs. With integer programs in focus we started to examine Simultaneous Speculation Scheduling, a technique that enables speculative execution of alternative program paths. Our technique should be used when static program transformations fail to expose enough ILP to completely utilise the execution units of a wide-issue processor. If the units cannot contribute to the execution of instructions that are on the currently taken program path, then they should execute instructions from other program paths.

2 Candidate paths are the alternative continuations of a conditional statement. Simultaneous Speculation Scheduling replaces a branch by instructions that enable the instruction scheduler to expose more ILP to the processor. A separate thread of control is generated for each of the alternative continuations. Each thread contains instructions that calculate the condition associated to the former branch, an operation which terminates an incorrectly chosen thread, and a number of other instructions from the program path under consideration. The advantage of the proposed technique is a better utilisation of the execution units and thus an improvement of the execution time of a single program by filling otherwise idle instruction slots of a wide-issue processor with speculatively dispatched instructions. Depending on the architecture there may be an additional cost for the execution of thread handling instructions. Consequently, the following preconditions have to be fulfilled for this approach to work: a very short time for switching control to another thread, tightly coupled threads that allow fast data exchange. Target architectures are different kinds of multithreaded architectures. Requirements for a multithreaded base architecture suitable for our compiler technique and architectural proposals that match the requirements are described in Section 2. In Section 3 we introduce the Simultaneous Speculation Scheduling technique and examine its applicability for these architectures. Section 4 shows the results of translating several benchmarks using our technique. 2. Target Architectures Simultaneous Speculation Scheduling is only applicable for architectures that fulfill certain requirements of a multithreaded base architecture: First, the architecture must be able to pursue two or more threads of control concurrently i.e., it must provide two or more independent program counters. All concurrently executed threads of control share the same address space, preferably the same register set. The instruction set must provide a number of thread handling instructions: Here we consider the minimal requirements for multithreading. These are an instruction for creating a new thread (fork) and an instruction that conditionally stops its own execution or the execution of some other threads (sync). Whether the threads are explicit operands to the sync instruction or are implicitly given by the architecture strongly depends on the target architecture. Therefore we use fork and sync as two abstract operations representing different implementations of these instructions. Creating a new thread by the fork instruction and joining threads by the sync instruction must be extremely fast, preferably single-cycle operations. Multithreaded architectures can achieve this last requirement because the different threads share a common set of processor resources (execution units, registers etc.). Furthermore they do not implement complex synchronisation operations but pass to the compiler the control of the steps necessary to carry out an interaction of threads. Therefore the compiler can minimise the arising cost by maintaining the complete thread handling, which includes selecting the program sections to be executed speculatively, organising the data exchange between threads, and generating the instruction sequences required for the interaction. The primary target architectures of the proposed compiler technique are simultaneous multithreaded, microthreaded, and nanothreaded architectures which can be classified as explicit multithreaded architectures, because the existence of multiple program counters in the microarchitecture is perceptible in the architecture. However, implicit multithreaded architectures that spawn and execute multiple threads implicitly not visible to the compiler can also be modified to take advantage of Simultaneous Speculation Scheduling. Examples of such implicit multithreaded architectures are the multiscalar, the trace, and the datascalar approaches. Simultaneous multithreaded (SMT) processors [19][12] combine the multithreading technique with a wide-issue processor such that the full issue bandwidth is utilised by potentially issuing instructions from different threads simultaneously. Another idea uses multithreading without the complexity of a SMT processor. Nanothreading [7] dismisses full multithreading for a nanothread that executes in the same register set as the main thread. The nanothread is used to fill the instruction issue slots like in simultaneous multithreading. The microthreading technique [2] is similar to nanothreading. All threads execute in the same register set. However, the number of threads is not restricted to two. When a context switch arises, the program counter is stored in a continuation queue. Simultaneous Speculation Scheduling aims at an architectural model which is most closely related to the SMT, nanothreaded, and microthreaded architectures. Creating a new thread respectively joining threads must be extremely fast operations, which should preferably be performed in a single execution cycle. Here nanothreaded and microthreaded architectures may prove advantageous over the simultaneous multithreaded approach, because only a new instruction pointer must be activated in the nanothreaded and microthreaded approach, while a new register set has to be assigned additionally in the simultaneous multithreaded approach. All threads execute in the same register set in

3 nanothreaded and microthreaded architectures. Multiscalar processors [16] divide a single-threaded program into a collection of tasks that are distributed to a number of parallel processing units under control of a single hardware sequencer. Each of the processing units fetches and executes instructions belonging to its assigned task. A static program is represented as a control flow graph (CFG), where basic blocks are nodes, and arcs represent flow of control from one basic block to another. Dynamic program execution can be viewed as walking through the CFG, generating a dynamic sequence of basic blocks which have to be executed for a particular run of the program. Simultaneous Speculation Scheduling can be applied to the multiscalar approach provided that the thread handling instructions are included in the respective instruction sets. Code blocks that represent threads for alternative program paths are speculatively assigned to different processing elements. Potentially, further code blocks that are dependent on either of these program paths can be subsequently assigned speculatively. Therefore, while executing a sync instruction the multiscalar processor must be able to stop the execution of a block and all blocks dependent on the nullified block only, while the correctly chosen block and all subsequent blocks proceed execution. Also propagating of register values between the processing elements that execute our speculatively assigned program paths must be restricted to the dependence hierarchy and is therefore also slightly more complex. Using our technique to speculate on data dependencies can be used directly to generate programs for a multiscalar processor [20]. Trace processors [15] partition a processor into multiple distinct cores similar to multiscalar and break the program into traces. Traces are collected by a trace cache which is a special instruction cache that captures dynamic instruction sequences. One core of a trace processor executes the current trace while the other cores execute future traces speculatively. The Simultaneous Speculation Scheduling technique can be applied to trace processors in a similar style like for multiscalar processors. A datascalar processor [3] runs the same sequential program redundantly across multiple processors using distributed data sets. Loads and stores are performed only locally on the processor memory that owns the data, but a local load broadcasts the loaded value to all other processors. Speculation is an optimisation option, but is limited to data value speculation. However, every datascalar machine is a de facto multiprocessor. When codes contain coarse-grain parallelism, the datascalar machine can also run like a traditional multiprocessor. This ability can also be used by the Simultaneous Speculation Scheduling technique to run threads speculatively. When the datascalar processor executes a fork instruction at least two of the processors start executing the alternative program paths. When the corresponding sync is reached each processor executing the incorrect thread stops the execution of its own thread, receives the correct data from an other processor, and then continues executing instructions from the correct thread. One of the processors that executed the correct thread when the most recent sync was executed is the lead processor. If the processor takes the wrong path at some point of the execution of the program another processor becomes the lead processor. 3. Simultaneous Speculation Scheduling 3.1. Scheduling and Speculative Execution Instruction scheduling techniques [11] are of great importance to expose ILP contained in a program to a wideissue processor. The instructions of a given program are rearranged to avoid underutilisation of the processor resources caused by dependencies between the various operations (e.g. data dependencies, control dependencies, and the usage of the same execution unit). As a result the execution time of the program decreases. Conditional branches seriously prevent the scheduling techniques from moving instructions to unused instruction slots. Whether a conditional branch is taken or not cannot be decided during compile time. Global scheduling techniques as for instance trace Scheduling [6], PDG Scheduling [1], Dominator-path Scheduling [17], or Selective Scheduling [10] use various approaches to overcome this problem. However, investigations have shown that the conditions that must be fulfilled to safely move an instruction across the basic block boundaries are very restrictive. Therefore the scheduling algorithms fail to gain enough ILP for a good utilisation of the execution units of a wide-issue processor. Global scheduling techniques were enabled to move instructions across branches to execute these instructions speculatively. The increase of performance gained by this speculative extensions is limited either by the overhead to correct misprediction or by restrictions to speculative code motion. Our optimisation technique Simultaneous Speculative Scheduling is based on the consideration that execution units, unused due to the lack of ILP, can execute instructions from alternative program paths to increase the execution speed of a program. The goal of the technique is to efficiently execute a single integer program on a multithreaded processor. Therefore we generate threads to be executed in parallel from alternative program paths. Simultaneous Speculation Scheduling can be seen as an advancement of global instruction scheduling techniques. As all mentioned global instruction scheduling techniques we attempt to enlarge the hyperblocks in which instructions can be moved, but we are able to handle all branches efficiently.

4 Information about execution probabilities can be used, but is not necessary. Beside increasing the hyperblocks of the global scheduling technique, Simultaneous Speculation Scheduling reduces the penalties for misspeculation. We can achieve this by simultaneously speculating on the alternative program paths and executing the generated threads in parallel on multithreaded architectures The Algorithm Simultaneous Speculation Scheduling attempts to enlarge the hyperblocks of the global instruction scheduling algorithm by generating separate threads of control for alternative program paths. Branches are removed by the following approach: Each branch is replaced by a fork instruction, that creates a new thread, and by two conditional sync instructions one in each thread. The two threads active after the execution of the fork instruction evaluate the two program paths corresponding to both branch targets. The compare instruction attached to the original branch remains in the program. It now calculates the condition for the sync instructions. The thread which reaches its sync instruction first either terminates, or it cancels the other thread. After removing a number of branches by generating speculative threads the basic scheduling algorithm continues to process the new hyperblocks. Each thread is considered separately. The heuristic used by the scheduling algorithm is modified to keep the speculative code sections small and avoid writes to memory within the sections. Therefore the sync instruction is moved upwards as far as the corresponding compare allows. The speculative sections are further reduced by combining identical sequences of instructions starting at the beginning of speculative sections and moving them across the fork instruction. The program transformation described above is implemented by the following Algorithm 1. Determining basic blocks. 2. Calculating execution probabilities of the basic blocks. 3. Selection of the hyperblocks by the global instruction scheduling algorithm. 4. Selection of the conditional branches that cannot be handled by the basic scheduling technique, but can be resolved by splitting the thread of control; concatenation of the corresponding hyperblocks. 5. Generation of the required operations for starting and synchronising threads; if necessary, modification of the existing compare instructions. 6. Moving all store instructions out of the speculative program sections. 7. Further processing of the new hyperblocks by the global scheduling algorithm. 8. Modification of the weights attached to the nodes of the dataflow graph. 9. Scheduling of the operations within the hyperblocks. 10. Minimising the speculatively executed program sections by moving up common code sequences starting at the beginning of the sections. 11. Calculating a new register allocation, insertion of move instructions, if necessary. 12. Basic block scheduling. The steps 1-3, 7, 9, and 12 can be performed by a common global scheduling technique. For our investigations we use the PDG scheduling technique [1]. A simple way to implement the modifications of the scheduling heuristic (steps 6 and 8) is to adjust the weights assigned to the operations by the scheduling algorithm [18] and to insert a number of artificial edges into the dataflow graph. This allows to almost directly use the formerly employed local scheduling technique (List Scheduling) that arranges the instructions within the hyperblocks. Assigning a larger weight causes an instruction to be scheduled to execute later. Therefore the attached values of the compare and the sync instructions are decreased. Since these modifications directly influence the size of the speculative section they must correspond to the implemented properties of the processor that executes the generated program. The exact weights that are assigned are not fixed by this algorithm but are parameters that have to be determined for each processor implementation. Generating separate threads of control for different program paths causes the duplication of a certain number of instructions. The number of redundant instructions grows with the length of the speculative sections, but remains small in the presence of small basic blocks. Since the number of speculatively executed threads is already limited by the hardware resources, only a small increase in the program size is caused for integer programs Example This section demonstrates the technique described above using a simple program. The sample program consists of

5 int a, b, c, d; int main(){ int i; a=0; b=2; c=7; for(i=0; i < 999; i++){ /* D */ a = a+b+c; /* A */ if(a % 2) /* A */ d=b*c; /* B */ else d=b*b*c; /* C */ } } b=b+c+a; /* D */ c=b+d; /* D */ Figure 1. Source code of the sample program a single if-then-else statement, enclosed by some assignments and surrounded by a loop to yield a measurable running time. The sample program was constructed to demonstrate all essential steps of the algorithm. Figure 1 shows the source code of the program and Figure 2 the corresponding program dependency graph (PDG). The comments in the source code give for each instruction its basic block. The nodes of the PDG contain the assembly instructions generated from the sample program. The wide arrows show the control dependencies, the small edges the true data dependencies. The probability of an execution is assumed to be 50% for either program path. Thus a static scheduling technique should not speculatively move instructions upwards from the basic blocks B or C to A. The assembly instructions are written in a simplified SPARC instruction set. For clarity, we use simplified register names and memory addresses and we renounced the concept of delayed branches. To release the strong binding of the compare to the branch instruction, the compare is enriched by a destination register that stores the condition code. Accordingly the branch reads the condition from a general purpose register. This is a common approach in modern microprocessors. To evaluate the results of the speculative execution in the context of currently used scheduling techniques and to have Figure 2. PDG of the sample program a base to calculate the performance gain, we compiled the sample program employing the global PDG scheduling algorithm (using egcs; Release 1.0.1; optimising level -O2). For the generated program the cycles to be executed on our base processor (see Section 4) were determined. This led to 22 cycles to execute the loop body if the then path is taken respectively 26 cycles for the else path, yielding to an average of 24 cycles to execute one iteration of the loop body. We now explain the steps of the algorithm described in Section 3.2. The steps 1 to 3 work similar to the PDG scheduling. The branch corresponding to the if statement in the example is not handled by this global scheduling technique. Therefore the branch is removed with our method by splitting the thread of control (step 4). After that step the basic blocks A, B, and D form thread 1; and the blocks A, C, and D form thread 2, respectively. The branch instruction (last instruction in block A) is replaced by a pair of fork and sync (step 5), where the fork takes the first position in the hyperblock and the sync the position of the former branch. In the next step the store instructions are moved out of the speculative section. By inserting artificial edges into the dataflow graph a reentry is prevented. Now the PDG of the hyperblock is transformed to the

6 .LL5: (1) ld b,r1 (1) add r5,1,r5 ; i++ (2) ld c,r2 (5) add r1,r2,r10 ; b+c (5) ld a,r3 (6) fork.ll15,.ll25 Figure 3. Extended dataflow graph (thread 2) extended dataflow graph by the PDG scheduling algorithm (step 7). In this representation all restrictions that are necessary to ensure a semantically correct schedule are coded as artificial edges. Before applying the instruction scheduling algorithm (List Scheduling) to the hyperblock, the weights attached to the nodes are modified (step 8). This includes the nodes corresponding to the fork, the sync, and the load instructions. By decreasing the weights of the load and the sync instructions the sync is scheduled to execute early. Thus the speculatively executed sections become smaller. Since the current weighting does not allow a direct decreasing of the values, the weights of the surrounding instructions have to be increased to achieve the expected difference. Furthermore the load instructions are moved out of the speculatively executed sections. For our base architecture we have increased the weights of all instructions within the speculative section by 10, and for all instructions following the sync by 20. Figure 3 shows the extended dataflow graph for thread 2 after executing step 8. Now the List Scheduling algorithm is applied to both threads (step 9). To further decrease the size of the speculative sections identical instruction sequences at the beginning of both threads are moved out of these sections (step 10). After this transformation the final positions of the fork and sync instructions are fixed and therefore the speculative sections are determined. At this point a new register allocation is calculated for these sections in both threads. In the example the registers r3, r9, r11, and r12 have to be renamed. The restoration of the values that are live at the end of the hyperblock to the original registers is implicitly done ; Thread 1.LL15: (7) mul r1,r2,r11 ; b*c (8) add r3,r1,r3 ; a+b (9) add r3,r2,r3 ; a+b+c (9) cmp r5,r6,r9 ; i< 999 (10) andcc r3,1,r12 ; a%2 (11) sync r12 (11) st r3,a ; a=... (12) add r10,r3,r3 ; b+c+a (12) st r11,d ; d= (13) add r3,r11,r11 ; b+c (13) st r3,b ; b=... (14) st r11,c ; c=... (14) ble.ll5,r9 ; Thread 2.LL25: (7) mul r1,r1,r15 ; b*b (8) add r3,r1,r16 ; a+b (9) add r16,r2,r16 ; a+b+c (9) cmp r5,r6,r17 ; i< 999 (10) andcc r16,1,r18 ; a%2 (11) sync r18 (11) mul r15,r2,r11 ; b*b*c (12) st r16,a ; a= (12) add r10,r16,r3 ; b+c+a (15) st r11,d ; d=... (15) add r3,r11,r11 ; b+c (16) st r3,b ; b=... (16) st r11,c ; c=... (17) ble.ll5,r17 Figure 4. Generated assembly code for the sample program by the instructions following the sync. Thus there is no need for additional move operations. The assembly program generated after executing the whole algorithm from Section 3.2 is shown in Figure 4. For each assembly instruction the corresponding C statement is

7 Benchmark compress go (mrglist) go (getefflibs) sample program (fig. 1) base machine (cycles) multithreaded machine (cycles) performance gain 9% 40% 19% 55 % Table 1. Performance gain achieved by speculatively executed program paths given as comment. The number at the beginning of each line gives the processor cycle to execute the instruction, relative to the beginning of the loop iteration. Thus the threads need 14 respectively 17 cycles. Again considering an execution probability of 50% the average time to process one loop iteration is 15.5 cycles. The program generated by our scheduling technique contains a speculative section of 6 instructions, none of which accesses memory. 4. Performance Evaluation In the previous section we have introduced our compilation technique. In this section we show a number of experimental results to examine the performance of the speculative execution of program paths. The benchmarks used belong to the SPECint benchmark suite. For two reasons presently we cannot translate the complete programs but have to focus on frequently executed functions. First, we currently do not have an implementation of the algorithm that allows to translate arbitrary programs. On the other hand the architectures under consideration are subjects of research. For some of them simulators are available, for others only the documentation of the architectural concepts is accessible. This means that both the process of applying our technique to the programs and the calculation of execution time are partially done by hand. Since this is a time-consuming process currently we can present only results for two programs from the SPECint benchmark suite and the sample program from Section 3.3. The programs chosen from the SPEC benchmarks are go and compress. The translated program sections cover the inner loops of the functions compress from the compress benchmark, and mrglist and getefflibs from the go benchmark. The results are shown in Table 1. For the calculation of the execution time we use two processor models. The first one the superscalar base processor implements a simplified SPARC architecture. The superscalar base processor has the ability to execute up to four instructions per cycle. We assume latencies of one cycle for all arithmetic operations, except for the multiplication which takes four cycles, three cycles for load instructions, and one cycle for branches. The second processor model the multithreaded base processor enhances the superscalar base processor by the ability to execute up to four threads concurrently and by the instructions fork and sync for the thread control. Here we expect these instructions to execute within a single cycle. The values presented in Table 1 were derived by counting the machine cycles, processors that match our machine models would need to execute the generated programs. The numbers shown are the cycles for the base processor, the cycles for the multithreaded processor, and the performance gain. The assembly programs for the base processor were generated by the compiler egcs. We have chosen this compiler since it implements the PDG scheduling which is the global scheduling technique our method is based on. 5. Conclusion In this paper we have proposed a compiler technique that aims to speed up integer programs that can not be successfully treated by purely static scheduling techniques. In the absence of a sufficient amount of ILP to completely utilise the superscalar execution units we employ the ability of multithreaded processors to additionally gain from coarse grain parallelism. The threads to be executed concurrently are generated from alternative program paths. Our technique improves the capability of global instruction scheduling techniques by removing conditional branches. A branch is replaced by a fork instruction that splits the current thread of control, and one sync instruction in each new thread, that finishes the incorrectly executed thread. This transformation increases the size of the hyperblocks. Thus the global instruction scheduling is able to expose more ILP to the processor. We have started to evaluate our technique by translating programs from the SPECint benchmark suite. Considering preliminary results we expect our technique to achieve a performance gain of about 20% over purely static scheduling techniques when generating code for simultaneous multithreaded processors as well as processors that employ nanothreading or microthreading. Currently we translate further programs from the SPEC benchmarks and examine the capability of the techniques to adapt to the requirements of the trace processor, the multiscalar approach, and the datascalar processor.

8 References [1] D. Bernstein and M. Rodeh. Global instruction scheduling for superscalar machines. In B. Hailpern, editor, Proceedings of the ACM SIGPLAN 91 Conference on Programming Language Design and Implementation, pages , Toronto, ON, Canada, June [2] A. Bolychevsky, C. R. Jesshope, and V. B. Muchnik. Dynamic scheduling in RISC architectures. IEE Proceedings Computers and Digital Techniques, 143(5): , [3] D. Burger, S. Kaxiras, and J. R. Goodman. DataScalar Architectures. In Proceedings of the 24th International Symposium on Computer Architecture, pages , Boulder, CO, June [4] P. P. Chang, S. A. Mahlke, W. Y. Chen, N. J. Warter, and W. W. Hwu. IMPACT: An architectural framework for multiple-instruction-issue processors. In Proceedings of the 18th International Symposium on Computer Architecture, pages , New York, NY, June [5] K. Ebcioglu, R. D. Groves, K. C. Kim, G. M. Silberman, and I. Ziv. VLIW compilation techniques in a superscalar environment. In Proceedings of the ACM SIGPLAN 94 conference on Programming Language Design and Implementation, pages 36 48, Toronto, ON, Canada, [6] J. R. Ellis. A Compiler for VLIW Architectures. The MIT Press, Cambridge, MA, [7] L. Gwennap. Dansoft develops VLIW design. Microdesign Resources, pages 18 22, [8] W.-M. Hwu. Introduction to predicated execution. IEEE Computer, 31(1):49 50, [9] J. L. Lo, S. J. Eggers, J. S. Emer, H. M. Levy, R. L. Stamm, and D. M. Tullsen. Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading. ACM Transactions on Computer Systems, 15(3): , [10] S. M. Moon and K. Ebcioglu. Parallelizing nonnumerical code with selective scheduling and software pipelining. ACM Transactions on Programming Languages and Systems, 19(6): , [11] S. S. Muchnick. Advanced Compiler Design & Implementation. Morgan Kaufmann Publishers, San Francisco, [12] U. Sigmund and T. Ungerer. Identifying bottlenecks in a multithreaded superscalar microprocessor. Lecture Notes in Computer Science, 1124: , [13] J. Silc, B. Robic, and T. Ungerer. Asynchrony in parallel computing: From dataflow to multithreading. Journal of Parallel and Distributed Computing Practices, 1, [14] J. Silc, B. Robic, and T. Ungerer. Processor Architecture. From Dataflow to Superscalar and Beyond. Springer Verlag, Berlin, [15] J. E. Smith and S. Vajapeyam. Trace processors: Moving to fourth-generation microarchitectures. IEEE Computer, 30(9):68 74, Sept [16] G. S. Sohi, S. E. Breach, and T. N. Vijaykumar. Multiscalar processors. In Proceedings of the 22nd International Symposium on Computer Architecture, pages , Santa Margherita, Ligure, Italy, [17] P. H. Sweany and S. J. Beaty. Dominator-path scheduling A global scheduling method. In Proceedings of the 25th International Symposium on Microarchitecture, Dec [18] M. D. Tiemann. The GNU instruction scheduler. Technical report, Free Software Foundation, Cambridge, MA, [19] D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. In Proceedings of the 23rd International Symposium on Computer Architecture, pages , Philadelphia, PA, May [20] A. Unger, T. Ungerer, and E. Zehendner. Static speculation, dynamic resolution. Proc. of the 7th Workshop on Compilers for Parallel Computing (CPC 98), Linköping, Sweden, June [21] A. Unger and E. Zehendner. Tuning the GNU instruction scheduler to superscalar microprocessors. In Proceedungs of the 23rd EUROMICRO Conference, pages , Budapest, Sept [22] M. Wolfe. High Performance Compilers for Parallel Computing. Addison Wesley, 1996.

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of

More information

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications Byung In Moon, Hongil Yoon, Ilgu Yun, and Sungho Kang Yonsei University, 134 Shinchon-dong, Seodaemoon-gu, Seoul

More information

Simultaneous Multithreading: a Platform for Next Generation Processors

Simultaneous Multithreading: a Platform for Next Generation Processors Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

One-Level Cache Memory Design for Scalable SMT Architectures

One-Level Cache Memory Design for Scalable SMT Architectures One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract

More information

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3

More information

Speculative Trace Scheduling in VLIW Processors

Speculative Trace Scheduling in VLIW Processors Speculative Trace Scheduling in VLIW Processors Manvi Agarwal and S.K. Nandy CADL, SERC, Indian Institute of Science, Bangalore, INDIA {manvi@rishi.,nandy@}serc.iisc.ernet.in J.v.Eijndhoven and S. Balakrishnan

More information

instruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals

instruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals Performance Evaluations of a Multithreaded Java Microcontroller J. Kreuzinger, M. Pfeer A. Schulz, Th. Ungerer Institute for Computer Design and Fault Tolerance University of Karlsruhe, Germany U. Brinkschulte,

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors

MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors MPEG- Video Decompression on Simultaneous Multithreaded Multimedia Processors Heiko Oehring Ulrich Sigmund Theo Ungerer VIONA Development GmbH Karlstr. 7 D-733 Karlsruhe, Germany uli@viona.de VIONA Development

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Superscalar Processors Ch 14

Superscalar Processors Ch 14 Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion

More information

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency? Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion

More information

Multithreaded Architectural Support for Speculative Trace Scheduling in VLIW Processors

Multithreaded Architectural Support for Speculative Trace Scheduling in VLIW Processors Multithreaded Architectural Support for Speculative Trace Scheduling in VLIW Processors Manvi Agarwal and S.K. Nandy CADL, SERC, Indian Institute of Science, Bangalore, INDIA {manvi@rishi.,nandy@}serc.iisc.ernet.in

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

ronny@mit.edu www.cag.lcs.mit.edu/scale Introduction Architectures are all about exploiting the parallelism inherent to applications Performance Energy The Vector-Thread Architecture is a new approach

More information

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?) Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static

More information

A Thread Partitioning Algorithm using Structural Analysis

A Thread Partitioning Algorithm using Structural Analysis A Thread Partitioning Algorithm using Structural Analysis Niko Demus Barli, Hiroshi Mine, Shuichi Sakai and Hidehiko Tanaka Speculative Multithreading has been proposed as a method to increase performance

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery

More information

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) 18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures

More information

Predicated Software Pipelining Technique for Loops with Conditions

Predicated Software Pipelining Technique for Loops with Conditions Predicated Software Pipelining Technique for Loops with Conditions Dragan Milicev and Zoran Jovanovic University of Belgrade E-mail: emiliced@ubbg.etf.bg.ac.yu Abstract An effort to formalize the process

More information

Chip Multiprocessors A Cost-effective Alternative to Simultaneous Multithreading

Chip Multiprocessors A Cost-effective Alternative to Simultaneous Multithreading Chip Multiprocessors A Cost-effective Alternative to Simultaneous Multithreading BORUT ROBIČ JURIJ ŠILC THEO UNGERER Faculty of Computer and Information Sc. Computer Systems Department Dept. of Computer

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

Micro-threading: A New Approach to Future RISC

Micro-threading: A New Approach to Future RISC Micro-threading: A New Approach to Future RISC Chris Jesshope C.R.Jesshope@massey.ac.nz Bing Luo R.Luo@massey.ac.nz Institute of Information Sciences and Technology, Massey University, Palmerston North,

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 18-447 Computer Architecture Lecture 15: Load/Store Handling and Data Flow Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 Lab 4 Heads Up Lab 4a out Branch handling and branch predictors

More information

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency Superscalar Processors Ch 13 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction 1 New dependency for superscalar case? (8) Name dependency (nimiriippuvuus) two use the same

More information

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. July 14) (June 2013) (June 2015)(Jan 2016)(June 2016) H/W Support : Conditional Execution Also known

More information

Computer Architecture: Multithreading (III) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multithreading (III) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multithreading (III) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 13:

More information

Simultaneous Multithreading Architecture

Simultaneous Multithreading Architecture Simultaneous Multithreading Architecture Virendra Singh Indian Institute of Science Bangalore Lecture-32 SE-273: Processor Design For most apps, most execution units lie idle For an 8-way superscalar.

More information

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version: SISTEMI EMBEDDED Computer Organization Pipelining Federico Baronti Last version: 20160518 Basic Concept of Pipelining Circuit technology and hardware arrangement influence the speed of execution for programs

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Fall 2012 Parallel Computer Architecture Lecture 15: Speculation I. Prof. Onur Mutlu Carnegie Mellon University 10/10/2012

Fall 2012 Parallel Computer Architecture Lecture 15: Speculation I. Prof. Onur Mutlu Carnegie Mellon University 10/10/2012 18-742 Fall 2012 Parallel Computer Architecture Lecture 15: Speculation I Prof. Onur Mutlu Carnegie Mellon University 10/10/2012 Reminder: Review Assignments Was Due: Tuesday, October 9, 11:59pm. Sohi

More information

VLIW/EPIC: Statically Scheduled ILP

VLIW/EPIC: Statically Scheduled ILP 6.823, L21-1 VLIW/EPIC: Statically Scheduled ILP Computer Science & Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University Announcements Homework 4 Out today Due November 15 Midterm II November 22 Project

More information

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Eric Rotenberg Computer Sciences Department University of Wisconsin - Madison ericro@cs.wisc.edu Abstract This paper speculates

More information

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 18 Dynamic Instruction Scheduling with Branch Prediction

More information

Advanced Computer Architecture

Advanced Computer Architecture ECE 563 Advanced Computer Architecture Fall 2010 Lecture 6: VLIW 563 L06.1 Fall 2010 Little s Law Number of Instructions in the pipeline (parallelism) = Throughput * Latency or N T L Throughput per Cycle

More information

Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel

Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Aalborg Universitet Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Publication date: 2006 Document Version Early version, also known as pre-print

More information

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 8: Issues in Out-of-order Execution Prof. Onur Mutlu Carnegie Mellon University Readings General introduction and basic concepts Smith and Sohi, The Microarchitecture

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1996 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

A survey of new research directions in microprocessors

A survey of new research directions in microprocessors Microprocessors and Microsystems 24 (2000) 175 190 www.elsevier.nl/locate/micpro A survey of new research directions in microprocessors J. Šilc a, *, T. Ungerer b, B. Robic c a Computer Systems Department,

More information

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard. COMP 212 Computer Organization & Architecture Pipeline Re-Cap Pipeline is ILP -Instruction Level Parallelism COMP 212 Fall 2008 Lecture 12 RISC & Superscalar Divide instruction cycles into stages, overlapped

More information

Design of Out-Of-Order Superscalar Processor with Speculative Thread Level Parallelism

Design of Out-Of-Order Superscalar Processor with Speculative Thread Level Parallelism ISSN (Online) : 2319-8753 ISSN (Print) : 2347-6710 International Journal of Innovative Research in Science, Engineering and Technology Volume 3, Special Issue 3, March 2014 2014 International Conference

More information

Instruction Level Parallelism (ILP)

Instruction Level Parallelism (ILP) 1 / 26 Instruction Level Parallelism (ILP) ILP: The simultaneous execution of multiple instructions from a program. While pipelining is a form of ILP, the general application of ILP goes much further into

More information

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) #1 Lec # 2 Fall 2003 9-10-2003 Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing

More information

Lecture 13 - VLIW Machines and Statically Scheduled ILP

Lecture 13 - VLIW Machines and Statically Scheduled ILP CS 152 Computer Architecture and Engineering Lecture 13 - VLIW Machines and Statically Scheduled ILP John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw

More information

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011 5-740/8-740 Computer Architecture Lecture 0: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Fall 20, 0/3/20 Review: Solutions to Enable Precise Exceptions Reorder buffer History buffer

More information

Page # Let the Compiler Do it Pros and Cons Pros. Exploiting ILP through Software Approaches. Cons. Perhaps a mixture of the two?

Page # Let the Compiler Do it Pros and Cons Pros. Exploiting ILP through Software Approaches. Cons. Perhaps a mixture of the two? Exploiting ILP through Software Approaches Venkatesh Akella EEC 270 Winter 2005 Based on Slides from Prof. Al. Davis @ cs.utah.edu Let the Compiler Do it Pros and Cons Pros No window size limitation, the

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

On Performance, Transistor Count and Chip Space Assessment of Multimediaenhanced Simultaneous Multithreaded Processors

On Performance, Transistor Count and Chip Space Assessment of Multimediaenhanced Simultaneous Multithreaded Processors On Performance, Transistor Count and Chip Space Assessment of Multimediaenhanced Simultaneous Multithreaded Processors Ulrich Sigmund, Marc Steinhaus, and Theo Ungerer VIONA Development GmbH, Karlstr.

More information

Compiler Architecture

Compiler Architecture Code Generation 1 Compiler Architecture Source language Scanner (lexical analysis) Tokens Parser (syntax analysis) Syntactic structure Semantic Analysis (IC generator) Intermediate Language Code Optimizer

More information

Automatic Counterflow Pipeline Synthesis

Automatic Counterflow Pipeline Synthesis Automatic Counterflow Pipeline Synthesis Bruce R. Childers, Jack W. Davidson Computer Science Department University of Virginia Charlottesville, Virginia 22901 {brc2m, jwd}@cs.virginia.edu Abstract The

More information

Compiler Techniques for Energy Saving in Instruction Caches. of Speculative Parallel Microarchitectures.

Compiler Techniques for Energy Saving in Instruction Caches. of Speculative Parallel Microarchitectures. Compiler Techniques for Energy Saving in Instruction Caches of Speculative Parallel Microarchitectures Seon Wook Kim Rudolf Eigenmann School of Electrical and Computer Engineering Purdue University, West

More information

UCI. Intel Itanium Line Processor Efforts. Xiaobin Li. PASCAL EECS Dept. UC, Irvine. University of California, Irvine

UCI. Intel Itanium Line Processor Efforts. Xiaobin Li. PASCAL EECS Dept. UC, Irvine. University of California, Irvine Intel Itanium Line Processor Efforts Xiaobin Li PASCAL EECS Dept. UC, Irvine Outline Intel Itanium Line Roadmap IA-64 Architecture Itanium Processor Microarchitecture Case Study of Exploiting TLP at VLIW

More information

TECH. CH14 Instruction Level Parallelism and Superscalar Processors. What is Superscalar? Why Superscalar? General Superscalar Organization

TECH. CH14 Instruction Level Parallelism and Superscalar Processors. What is Superscalar? Why Superscalar? General Superscalar Organization CH14 Instruction Level Parallelism and Superscalar Processors Decode and issue more and one instruction at a time Executing more than one instruction at a time More than one Execution Unit What is Superscalar?

More information

Dynamic Scheduling. CSE471 Susan Eggers 1

Dynamic Scheduling. CSE471 Susan Eggers 1 Dynamic Scheduling Why go out of style? expensive hardware for the time (actually, still is, relatively) register files grew so less register pressure early RISCs had lower CPIs Why come back? higher chip

More information

ILP: Instruction Level Parallelism

ILP: Instruction Level Parallelism ILP: Instruction Level Parallelism Tassadaq Hussain Riphah International University Barcelona Supercomputing Center Universitat Politècnica de Catalunya Introduction Introduction Pipelining become universal

More information

Advanced Processor Architecture

Advanced Processor Architecture Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Hiroaki Kobayashi // As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Branches will arrive up to n times faster in an n-issue processor, and providing an instruction

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Computer Sciences Department University of Wisconsin Madison http://www.cs.wisc.edu/~ericro/ericro.html ericro@cs.wisc.edu High-Performance

More information

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell

More information

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not

More information

A Mechanism for Verifying Data Speculation

A Mechanism for Verifying Data Speculation A Mechanism for Verifying Data Speculation Enric Morancho, José María Llabería, and Àngel Olivé Computer Architecture Department, Universitat Politècnica de Catalunya (Spain), {enricm, llaberia, angel}@ac.upc.es

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Getting CPI under 1: Outline

Getting CPI under 1: Outline CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 18-447 Computer Architecture Lecture 14: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 Reminder: Homework 3 Homework 3 Due Feb 25 REP MOVS in Microprogrammed

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order

More information

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 21: Superscalar Processing Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due November 10 Homework 4 Out today Due November 15

More information

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University Advanced d Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Instruction-Level Parallelism and Its Exploitation

Instruction-Level Parallelism and Its Exploitation Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic

More information

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era

More information

Precise Exceptions and Out-of-Order Execution. Samira Khan

Precise Exceptions and Out-of-Order Execution. Samira Khan Precise Exceptions and Out-of-Order Execution Samira Khan Multi-Cycle Execution Not all instructions take the same amount of time for execution Idea: Have multiple different functional units that take

More information

Lecture 19: Instruction Level Parallelism

Lecture 19: Instruction Level Parallelism Lecture 19: Instruction Level Parallelism Administrative: Homework #5 due Homework #6 handed out today Last Time: DRAM organization and implementation Today Static and Dynamic ILP Instruction windows Register

More information

TRIPS: Extending the Range of Programmable Processors

TRIPS: Extending the Range of Programmable Processors TRIPS: Extending the Range of Programmable Processors Stephen W. Keckler Doug Burger and Chuck oore Computer Architecture and Technology Laboratory Department of Computer Sciences www.cs.utexas.edu/users/cart

More information

Control and Data Dependence Speculation in Multithreaded Processors

Control and Data Dependence Speculation in Multithreaded Processors Proceedings of the Workshop on Multithreaded Execution, Architecture and Compilation (MTEAC 98) Control and Data Dependence Speculation in Multithreaded Processors Pedro Marcuello and Antonio González

More information

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers. CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes

More information