Control and Data Dependence Speculation in Multithreaded Processors

Size: px
Start display at page:

Download "Control and Data Dependence Speculation in Multithreaded Processors"

Transcription

1 Proceedings of the Workshop on Multithreaded Execution, Architecture and Compilation (MTEAC 98) Control and Data Dependence Speculation in Multithreaded Processors Pedro Marcuello and Antonio González Universitat Politècnica de Catalunya Departament d Arquitectura de Computadors C/ Jordi Girona, 3, Mòdul D6 834 Barcelona, Spain {pmarcue,antonio}@ac.upc.es Keywords: Multithreaded Architecture, Runtime Generation of Threads, Control Speculation, Loop Detection, Data Dependence Speculation. Abstract Boosting instruction level parallelism in dynamically scheduled processors requires a large instruction window. The approach taken by current superscalar processors to build the instruction window is known to have important limitations, such as the requirement of more powerful instruction fetch mechanisms and the increasing complexity and delay of the issue logic. In this paper we present a novel processor architecture (which is called DeSM) based on a multithreaded execution model that takes a completely different approach to manage a large instruction window. The idea is to identify at runtime sections of code that correspond to loops and execute concurrently several iterations even if they are dependent. Unlike superscalar processors, instructions are not decoded in sequential order and thus, the dependence checking mechanism of superscalar processors would not work for DeSM processors. In a DeSM processor, interthread dependences are speculated (i.e. they are predicted and instructions are executed obeying the predicted dependences). Besides, a DeSM processor significantly reduces the required instruction fetch bandwidth since it takes advantage from the fact that the multiple threads of control share a common code (they are executing different iterations of the same loop). The novel features of the DeSM rely on hardware mechanisms that do not require any special feature in the instruction set architecture nor compiler support.. Introduction In the last years, the technology improvements have allowed the processor designers to increase the instruction window size and the number of functional units in such a way that processors have been able to increase their instruction issue width. Nevertheless, the approach used by current superscalar processors to build the instruction window is not scalable and it finds two main problems. Branch prediction. Building a large window requires to speculate beyond a large number of branches. Even with very powerful prediction techniques such as hybrid predictors announced for the DEC 2264, the probability to be in the correct path after speculating on multiple branches is rather low. For instance, if we assume a 95% uniformly distributed success rate, as expected for the announced DEC 2264 for integer codes, and a 2% branch frequency, the probability to correctly predict the control flow beyond instructions is only.36. Complexity and delay of the issue logic grows with the instruction window size. It has been shown in a detailed study that the issue logic is likely to be one of the most critical parts in the future [9]. On other hand, multithreaded architectures have been studied to increase the instruction level parallelism in two nonexclusive ways: executing several independent threads which may come from different programs or executing dependent threads with the necessary synchronization in order to obey all the dependences among them. We propose a novel processor architecture, which is called Dependence Speculative Multithreaded Architecture (DeSM), that is based on the second kind of multithreaded architectures. However, unlike previous proposals, the DeSM architecture enforces interthread dependences through runtime dependence speculation. 2. Related work An important feature of the DeSM architecture is that unlike previous works like Multiscalar [3][], SPSM [2], Superthreaded [2] and Multithreaded Decoupled [] architectures, it does not require any addition/extension to the ISA. The DeSM architecture obtains multiple threads of control dynamically from a sequential conventional object code without any support from the compiler, by

2 Single PC Single Fetch Unit Large RegisterFile MultiValue Cache Instruction Cache Decode Renaming IQ FUs Control Speculation Logic Data Cache Figure : General diagram of the DeSM architecture speculating on highly predictable branches (e.g. branches that close loops). Executing multiple dependent threads requires mechanisms to enforce dependences among different threads. Data dependence speculation refers to those techniques that execute parts of a code without a complete knowledge of data dependences. The DeSM architecture makes an extensive use of it: Those dependences that are unknown are predicted and the code is speculatively executed obeying the predicted dependences, in addition to the known dependences. Data dependence speculation has been hardly used in the past. Since dependences through registers are easily identified by both the compiler and the hardware, data dependence speculation has been previously used to speculate on dependences through memory. Memory references whose effective address are unknown are usually called ambiguous references. When memory instructions are executed outoforder, a memory reference may be performed before the effective address of all previous references are known, that is, it is performed without completely disambiguating the reference. We call this scheme dynamic partial disambiguation. The most remarkable work on dynamic partial disambiguation is the address resolution buffer of the Multiscalar [4] and the address reorder buffer of the HP PA8. Both approaches use a very simple speculation heuristic: they assume no dependence between an instruction and any previous instruction whose effective address is unknown. Recent results have shown that data dependence speculation can be significantly improved through more sophisticated dependence predictors[5][7][8][6]..obviously, the compiler may help to improve the performance of the speculation techniques, but this issue is beyond the scope of this paper. The DeSM architecture speculates on both interthread memory and register dependences. The former are handled through address prediction whereas the latter are managed by means of predicting the number of writes to each architected register. 3. The Dependence Speculative Multithreaded Architecture The DeSM hardware architecture (see figure ) is based on a simultaneous multithreaded processor[4][5] with the capability to speculate on multiple threads of control (threads for short) obtained from a single sequential program, and to speculate on interthread data dependences through registers and memory. Thus, the distinguishing features of the novel architecture are the extensive use of speculation on control and data dependences, as it is reflected in its name. Both types of speculation are completely performed by the hardware without requiring any compiler support. Thus, the object code can be specified in any conventional instruction set architecture (ISA), without any additional extension. Each thread, as in the SMT architecture, executes outoforder. The DeSM processor fetches instructions from a single program counter (PC). The instructions are decoded and renamed in the normal way as well as they are treated by the control speculation logic for the runtime generation of threads as it is described in Section 3.. The instructions are renamed separately for each thread (every thread has its own register map table although all the threads share the register file) and put into different instruction queues. The threads executed by DeSM architecture may not be independent, so there can be dependences among them. The enforcement of interthread dependences through registers is described in Section 3.2., and through memory is performed by a special first level data cache called MultiValue cache which is described in Section Threads also share the functional units.

3 3.. Control speculation The control speculation mechanism is based on the technique we proposed in [3]. Below we summarize the main issues of that technique. The interested reader is referred to the original paper for more details. The idea of the control speculation technique is to detect loops at runtime and predict their number of iterations in order to generate multiple threads from different iterations of the same loop. Control speculation is performed by means of a table, which is called the loop table, that is indexed by a loop identifier. A loop is identified by the target address of a backward conditional or unconditional branch (for most of the loops this is a precise identifier). A loop table entry contains information that allows to predict the number of iterations that the next execution of this loop will likely perform. In particular, it has three fields: the number of iterations of the last execution of the loop (last count); the last observed stride (i.e. the difference between the number of iterations of the last two executions of the loop) and the number of iterations of the current execution so far (current count). A new entry in this table is allocated every time that a new loop is started (in fact it is detected when its first iteration finishes). The entry corresponding to the loop with the least recently started iteration is chosen for replacement. Termination of loops is detected by means of an auxiliary structure that is called the loop stack. This structure is a stack of loops that contains at any time the loops to which the last committed instruction belongs. The loops are ordered from innermost to outermost, the top of the stack being the innermost. In case of overflow in the loop stack, the most external loop is lost when a new loop is introduced. The size of the stack of loops should be equal to the deepest loop nest of the program in order not to lose information. With the support of the loop table and the loop stack, control speculation proceeds as follows. When a new iteration of a loop is started (this is recognized by a backward taken branch), the loop table is looked up. If the loop is in the table, the number of iterations of the loop that are left to finish the loop is predicted as the last count plus the stride minus the current count. If there are enough free contexts and dependences can be speculated (this later fact is analyzed in the following section), one thread for each remaining iteration can proceed in parallel. If the loop is not in the loop table or the current iteration count is higher than the predicted one, the current thread proceeds with the next iteration and several new threads are allocated to iterations that may follow this one. In this case, creating more threads imply more potential for parallelism but also, if the threads turn out to be squashed more wasted resources. The analysis of this parameter is beyond the scope of this paper and will be researched in future work. target address of backward branch LNRW #reg. #reg. # stores # stores CNRW address Figure 2: The iteration table. LSA stride We assume here for simplicity that in this case, as many threads as number of free contexts in the processor are created. Notice that there may be speculative threads for different loops at the same time. However, in this paper we consider a simple implementation in which any loop can be speculated but only one loop can be speculated at the same time, each speculated iteration may contain any number of nested loops and subroutine calls. The objective is to provide a preliminary evaluation of the feasibility of the proposed architecture. Optimizing the control speculation scheme is left for future work. In this simplified scheme, only the nonspeculative thread is allowed to create new speculative threads Data dependence speculation CSA The multiple threads that are simultaneously executed by the DeSM architecture are not necessarily independent. Dependences among threads either through memory or registers may be present. Dependences are predicted based on the history of each loop. In particular, dependences through memory are predicted by taking advantage of the highly predictability of memory addresses as shown in other studies like [5]. The memory address prediction mechanism is based on keeping track of the last effective address generated for each static store instruction and its last stride. Using this scheme, about 75% of the memory instructions executed by the Spec 95 can be correctly predicted [6]. Predicting dependences through registers is based on the observation that the number of writes to the architected registers is usually the same for a given loop. This obviously holds for loops without conditionals in the loop body and it may also hold in the presence of conditional, for instance, when the then and else parts assign different values to the same registers or when the conditional branches usually take the same direction. Data dependences will be predicted by means of a table that is called the iteration table (see figure 2). Each entry of this table keeps the information regarding the dependences among iterations of a loop (usually called loopcarried dependences). It is indexed by the loop identifier (the target address of a backward branch) and each entry contains the C

4 Example: Loop: ld r, A(r) ld r2, B(r) add r3,r,r2 st C(r),r3 sub r,r,#4 bz #24 LNRW: LSA: C(r) Figure 3: Values of the iteration table at the end of an iteration of a sample loop. following fields (an example of the management of these fields is shown in figure 3): LNRW (Last Number of Register Writes): This field indicates for each architected register the number of times that it was written in the last iteration of the loop. CNRW (Current Number of Register Writes): This field is similar to the LNRW but it refers to the current execution of the loop, which has not finished yet. LSA (Last Store Addresses): This field is an array of a number of entries equal to the number of store instructions in the last iteration of the loop. For each store it contains the last effective address and the last stride (difference between the effective address for the last two iterations of the loop). Stores are allocated in the sequential order. To reduce complexity, this field can be implemented with a fixed number of entries. Those loops that have more stores than this limit will not be speculated. CSA (Current Store Addresses): This field contains the memory addresses of the store instructions executed so far in the current iteration. C (Confidence): This field assigns confidence to the predictions in a similar way as branch predictors do. In this paper we assume a 2bit saturating counter to implement this field. This field avoids to spend resources when the data required by a speculative thread is not highly predictable. Although the size of each entry is high, a small number of entries is enough to predict most of the loops. For instance, we showed that the hit ratio of this table is about the 85% with just two entries for the Spec95 suite. When a loop that is in the iteration table finishes (the termination of a loop is detected by means of the loop stack as previously described), the predictability of the loop is determined by checking if: a) the current iteration has performed the same number of writes to each register as the previous one; b) the current iteration has performed the same number of stores as the previous one and the addresses of these stores correspond to the addresses of the stores of the previous iteration plus the last observed stride. If the two previous conditions hold, the C field is increased; otherwise it is decreased. A loop is considered predictable is the most significant bit of C is set. Thus, in addition to the conditions imposed by the control speculation scheme described in section 3. in order to create a speculative thread to execute an iteration of a loop, an additional requirement is that the loop is found in the iteration table and the most significant bit of C is set. At the end of a loop, CNRW is copied into LNRW; CSA is copied into LSA.address and for each i, LSA[i].stride is updated with the CSA[i] LSA[i].address if the most significant bit of the new value of C is zero. The CRNW and CSA fields are reset Dependences through registers The processor has a large number of registers in order to store the multiple contexts corresponding to the multiple threads that are simultaneously executed. Such registers could be organized into a single register file or distributed into several files in order to reduce the number of ports. The study of the impact of the register file organization is beyond the scope of this paper. For the evaluation part of this paper, a single register file is considered. Each thread has its own register map table with as many entries as number of architected (also called logical) registers that indicates the current physical register allocated to the corresponding logical register. We will refer to these tables as Rmap. Notice that each map table reflects a different assignment of logical to physical registers, corresponding to a different point in time of the execution. A Rmap entry may contain a special value, NIL, that indicates that the corresponding logical register is not currently mapped to any physical register. Also, each thread has another table, which is called Rwrite (register write table), that contains for each logical register the number of remaining writes to that register. When the nonspeculative thread creates the speculative ones, each entry of their Rmap table are initialized with the same physical register as the mapping used by nonspeculative thread if no writes are expected to this register, or with the special value NIL otherwise. In other words, if Rwrite[r] is zero, then all instructions of speculative threads that have register r as a source operand are allowed to be issued (if the value in the physical register is available). If a thread has not enough physical registers, it and the subsequent speculative threads are not created.

5 address value V NW value V NW value 2 V 2 NW 2 value 3 V 3 NW 3 L2 cache Figure 4: The MV cache for a four context DeSM processor. When an instruction with destination register r finishes, the entry of the corresponding destination register in Rwrite is decreased, then the following actions are taken: If Rwrite is equal to, then the corresponding entry in the Rmap of this thread is copied into the same entry in the table of the next one (if it exists) if this latter entry is equal to NIL: since then they will share the same physical register. If Rwrite becomes lower than, then the current thread is going to perform a non predicted write to a register. In this case, all subsequent threads are squashed and all physical registers allocated to them that are not shared with the previous thread are released. When a thread finishes and if some entry of the Rwrite table is greater than, this means that the thread has performed less writes to a register than it was predicted. If there is a subsequent thread executing the next iteration and the corresponding entry of the Rmap is NIL, then the entry of Rmap is copied into the same entry of the subsequent map table. Finally, those physical registers of the finished thread that are not shared with the next thread will be freed Dependences through memory In the same way as the processor provides support to store multiple states of the registers, each one corresponding to the view that a different thread has, a similar type of support is provided for memory. This is achieved by a special first level data cache. This cache is called multivalue (MV) cache and has the particularity that their data words are replicated for each context (see figure 4), since each thread may have a different view of the contents of memory. For each replicated word the MV cache contains two additional fields: the number of expected writes by thread (NW) a a bit indicating whether it contains a valid value for the.a selective squashing of only dependent instructions would be more effective, but it is not considered in this paper corresponding thread. Another important feature is that it stores noncommitted values and does not allow to modify the next memory level until they are committed. This is implement by means of a writeback policy together with a particular replacement scheme. When the nonspeculative thread creates a set of speculative ones, the MV cache is initialized with all those words whose addresses are predicted to be written by these threads and the number of expected writes is put in the NW field. Those addresses can be predicted by means of the LSA field in the iteration table by adding at each address the stride as many times as the distance between the speculative thread and the nonspeculative one. The validity bit is initialized with a for the first thread that is expected to write to that address and all the preceding ones, it is going to write and with a for the subsequent threads. Notice that the initialization of the MV cache may seem a costly task. However, it can be done before the creation of the speculative threads in a distributed way, while the nonspeculative thread is executing the iteration prior to the creation of speculative threads. An example of the initial values of the MultiValue Cache for a given loop is shown in figure 5. Each thread executes memory instructions outoforder using a total disambiguation scheme. That is, memory instructions compute their effective address as soon as their operands are available and then, they are sent to a load/ store buffer. Stores write to memory when all the previous instructions of the same thread have completed whereas loads read from memory when the addresses of all previous stores of the same thread are known. If the load matches a Example: Loop: ld r, A(r) ld r2, B(r) add r3,r,r2 st C(r),r3 sub r,r,#4 bz #24 MultiValue Cache: Figure 5: State of the MultiValue Cache for a sample loop. We assume thread is the nonspeculative one, its store address is 2 and any thread has not performed its store yet.

6 previous store address the store data is forwarded to the load destination register; otherwise the read is performed from memory. When a thread performs a store instruction, the corresponding value of the MV cache is updated, the NW field is decreased and the V flag is set. If it becomes, the data is copied to succeeding threads from the next one to the first one that has either NW or V different from (excluded). The data is copied into this latter thread only if its V flag is reset. The V flags of all the threads where the produced data is forwarded are set. If NW becomes lower than, all succeeding threads are squashed. Besides, if the corresponding line is not in the MV cache, all succeeding threads are squashed because this write was not predicted and then there is not guarantee that all memory dependences of succeeding threads have been obeyed. When a thread performs a read from memory, the MV cache is checked first; if the validity bit is set, the value is read, otherwise the load is cancelled and stored in a load queue. If the address is not found, the value is requested to the next memory level Other features: Management of precise exceptions and instruction fetching Precise exceptions are implemented in the DeSM architecture by means of small reorder buffers, one for each thread, along with the MV cache previously described. A reorder buffer holds instructions until they are partially committed. The first instruction of a thread can partially commit as soon as it finishes. Any other instruction can partially commit when all previous instructions of the same thread have partially committed. Each reorder buffer allows for outoforder execution and control speculation inside each thread and therefore its size is of the same order than that of a conventional superscalar processor. In case of an asynchronous exception, a precise state can be recovered by squashing all the speculative threads (these free their physical registers as described in section 3.2.) and recovering a precise state in the nonspeculative thread through the support of its local reorder buffer following the conventional approach used in superscalar processors. Any write memory performed by speculative threads has been performed in the MV cache and has not modified the L2 cache. Therefore, the part of the MV cache corresponding to the nonspeculative thread together with the next levels of the memory hierarchy hold the committed values that constitute the recovered precise state. Regarding the instruction fetching, since all the concurrent threads are executing the same code, a single.a selective squashing of only dependent instructions will be considered in future work instruction fetch engine can fetch the instructions of the loop just once and replicate them as many times as number of active threads. Each instruction is renamed using a different register map table and then they are dispatch to a shared instruction queue. In this way, the DeSM architecture significantly reduces the instruction memory pressure when compared to a conventional superscalar processor. A similar feature was also proposed in [7]. This organization overcomes one of the most important hurdles of multithreaded architectures. In those machines, the processor has to fetch from different program counters, simultaneously or alternatively, which results in one of the critical parts of such architectures. In the DeSM architecture, instructions are always fetched from a single program counter. Even with this simple fetch organization, the DeSM architecture can build a large instruction window composed of several nonadjacent small windows. 4. Performance evaluation The DeSM architecture has been evaluated through tracedriven simulation of the SPEC95 benchmark suite. The programs have been compiled for a DEC AlphaStation 6 5/266 with a 264 processor using the DEC compiler with full optimization, and instrumented by means of the Atom tool[]. A cyclebycycle simulation is performed in order to obtain accurate timing results. Because of the detail at which simulation is carried out the simulator is slow, so we have simulated 25 million of instructions for each benchmark after skipping the initial part that corresponds to initializations. The evaluated DeSM processor has 4 contexts, an issue bandwidth of 4 instructions per cycle for each context, 4 entries in the iteration table, and a fully associative MV cache with 28 entries. Every context has a local reorder buffer with 64 entries. A rather limited fetch bandwidth has been considered: up to 4 instructions of consecutive addresses. The number of physical registers is 256. The number of read and write ports of Rmap and Rwrite tables are 8 and 4 respectively. We have assumed perfect branch prediction for intrathread branches and an ideal L2 cache. The number of functional units is (latency in brackets): 8 simple integer (), 4 integer multiplication (2), 4 memory (2), 6 simple FP (), 3 FP multiplication (4) and 2 FP division (7). 4.. Performance figures The evaluation of the DeSM is summarized in Table. The first five rows correspond to FP programs and the last four correspond to integer benchmarks. The first column of this table shows the average number of committed instructions per cycle. It can be seen that for the FP programs the IPC is significantly higher than for integer programs. For most of

7 IPC TPC % single thread pred. hit ratio tomcatv swim FP hydro2d mgrid applu m88ksim int vortex go ijpeg Table : Performance measures for the DeSM processor the FP programs, the DeSM architecture achieves an IPC even higher than the fetch bandwidth. This confirms the potential benefits of the fetch mechanism in terms of the reduction in peak fetch bandwidth requirements. The second column shows the average number of active threads per cycle (TPC) that are correctly speculated. This figure measures the ability of the control and dependence speculation mechanism to dynamically overlap multiple iterations of a loop, and therefore to exploit ILP in a non contiguous window. This is the main source of additional ILP provided by the novel features of the DeSM architecture. The average TPC for FP programs is 3.3, out of a maximum of 4, whereas for integer programs it is much lower (.6). From this figures we can conclude that the speculation approach is very effective for FP programs and it is not so good for the integer codes. This motivates our current work to research the main reasons for that and try to come up with alternative speculation mechanism more appropriate for integer programs. The third column shows the percentage of the execution time in which only a single thread (the nonspeculative one is active). This figure is very low for all FP programs except applu, which is the one that exhibits also the lowest TCP among the FP programs. However, it is quite high for integer programs. The forth column shows the percentage of correctly speculated threads. This percentage is in general quite high, even for integer programs, which confirms that the speculation mechanism is quite accurate in identifying speculative threads (in terms of control and dependences). It remains to be proved whether it is too conservative and then, it misses to identify speculative threads, which could be one of the reasons for the low TCP of integer programs. We have also evaluated the performance of an outoforder superscalar processor with the same number of units and fetch bandwidth. We have also assumed a perfect data cache. We have assumed a perfect branch prediction for all the branches too, which may favor the superscalar processor since the DeSM processor has to predict closing loop branches. Besides, with this perfect prediction, the superscalar processor can overlap the execution of several iterations of the loop when they are small like the DeSM processor does. With a realistic branch prediction, the DeSM will still be able to overlap the same number of iterations of a loop, but the superscalar processor will not in general due to branches that are difficult to predict. In spite of this, the DeSM processor is in average 28% faster for FP codes and 6% faster for integer codes assuming the same clock cycle. However, notice that superscalar processors have a severe limitations to scale up the issue logic[9], whereas the DeSM architecture has a distributed issue logic that is more scalable. The main objective of this preliminary evaluation is to confirm the potential benefits of the DeSM architecture. More exhaustive evaluations are required to evaluate the benefits for different configurations, to include the effects of realistic intrathread branch prediction and cache memory and in general to identify critical design parameters and propose alternative solutions. 5. Conclusions We have presented a novel processor architecture that makes use of a new approach to dynamically extract and execute multiple threads of control from a sequential application, without any special feature in the ISA nor compiler support. This approach is based on speculating on highly predictable branches not necessarily adjacent, such as the closing branches of loops. In this way, the processor manages a large window that is composed of several small windows, each one corresponding to a different iteration of a loop. The multiple instruction windows are simultaneously processed by different threads of control. To overcome the problem caused by the serialized instruction fetch and decode approach implemented by superscalar processors, the DeSM architecture speculates on interthread dependences when they are unknown, for both register and memory dependences. A preliminary evaluation of the DeSM architecture has shown that interthread parallelism, which is measured as the average number of threads per cycle (TPC), is very high for FP programs (3.3 for a four context machine) and moderate for integer ones (.6). Besides, this additional source of parallelism does not require any additional increase in the fetch bandwidth since the processor takes advantage from the fact that all the simultaneous threads executes the same code. In fact, it has been shown that the DeSM processor can achieve and ILP even higher than the peak fetch bandwidth, as observed for some programs.

8 6. Acknowledgments This work has been supported by the Spanish Ministry of Education under contract CYCIT 429/95 and by the AP The research described in this paper has been developed using the resources of the Center of Parallelism of Barcelona (CEPBA). References [] M.N. Dorojevets and V.G. Oklobdzija, Multithreaded Decoupled Architecture, Int. J. of High Speed Computing, 7(3), pp , 995. [2] P.K. Dubey, K. O Brien, K.M. O Brien and C. Barton, SingleProgram Speculative Multithreading (SPSM) Architecture: CompilerAssisted FineGrained Multithreading, in Proc. Int. Conf. on Parallel Architectures and Compilation Techniques, pp. 92, 995. [3] M. Franklin and G.S. Sohi, The Expandable Split Window Paradigm for Exploiting Fine Grain Parallelism, in Proc. of Int. Symp. on Computer Architecture, pp. 5867, 992. [4] M. Franklin and G.S. Sohi, ARB: A Hardware Mechanism for Dynamic Reordering of Memory References, IEEE Transactions on Computers, pp , May 996. [5] J. González and A. González, Memory Address Prediction for Data Speculation, in Proc. of Third Int. Euro Par Conf., pp. 849, 997. [6] J. González and A. González, Speculative Execution via Address Prediction and Data Prefetching, in Proc. of th. ACM Int. Conf. on Supercomputing, 997. [7] A. Moshovos, S.E. Breach, T.N. Vijaykumar and G.S. Sohi, Dynamic Speculation and Synchronization of Data Dependences, in Proc. of Int. Symp. on Computer Architecture, pp. 893, 997. [8] A. Moshovos and G.S. Sohi Streamlining Interoperation Memory Communication via Data Dependence Prediction, in Proc. of the 3th Int. Symp. on Microarchitecture, pp.28228, 997. [9] S. Palacharla, N.P Jouppi and J.E. Smith, Complexity Effective Superscalar Processors, in Proc. of Int. Symp on Computer Architecture, pp.2628, 997. [] G.S. Sohi, S.E. Breach and T.N Vijaykumar, Multiscalar Processors, in Proc. of the Int. Symp. on Computer Architecture, pp , 995. [] A. Srivastava and A. Eustace, ATOM: A system for building customized program analysis tools, in Proc. of the 994 Conf. on Programming Languages Design and Implementation, 994. [2] JY. Tsai and PC. Yew, The Superthreaded Architecture: Thread Pipelining with RunTime Data Dependence Checking and Control Speculation, in Proc. Int. Conf. on Parallel Architectures and Compilation Techniques, pp. 3546, 996. [3] J. Tubella and A. González, Control Speculation in Multithreaded Processors through Dynamic Loop Detection, in Proc. of the 4th Int. Symp. on HighPerformance Computer Architecture (HPCA4), to appear. [4] D.M. Tullsen, S.J. Eggers and H.M. Levy, Simultaneous Multithreading: Maximizing OnChip Parallelism, in Proc. of the Int. Symp. on Computer Architecture, pp , 995. [5] D.M. Tullsen, S.J. Eggers, J.S. Emer, H.M. Levy, J.L. Lo and R.L. Stamm, Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, in Proc. of the Int. Symp. on Computer Architecture, pp. 922, 996. [6] G. Tyson and T. Austin, Improving the Accuracy and Performance of Memory Communication Through Renaming, in Proc. of the 3th Int. Symp. on Microarchitecture, pp , 997. [7] S.Vajapeyam and T. Mitra, Improving Superscalar Instruction Dispatch and Issue by Exploiting Dynamic Code Sequences, in Proc. of the Int. Symp. on Computer Architecture, pp. 2, 997.

Control Speculation in Multithreaded Processors through Dynamic Loop Detection

Control Speculation in Multithreaded Processors through Dynamic Loop Detection Control Speculation in Multithreaded Processors through Dynamic Loop Detection Jordi Tubella and Antonio González Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya, Campus

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

114 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 2, FEBRUARY 2004

114 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 2, FEBRUARY 2004 114 IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 2, FEBRUARY 2004 Thread Partitioning and Value Prediction for Exploiting Speculative Thread-Level Parallelism Pedro Marcuello, Antonio González, Member,

More information

A Quantitative Assessment of Thread-Level Speculation Techniques

A Quantitative Assessment of Thread-Level Speculation Techniques A Quantitative Assessment of Thread-Level Speculation Techniques Pedro Marcuello and Antonio González Departament d Arquitectura de Computadors Universitat Potècnica de Catalunya Jordi Girona, 1-3 Mòdul

More information

Speculative Execution via Address Prediction and Data Prefetching

Speculative Execution via Address Prediction and Data Prefetching Speculative Execution via Address Prediction and Data Prefetching José González, Antonio González Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Barcelona, Spain Email:

More information

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of

More information

Delaying Physical Register Allocation Through Virtual-Physical Registers

Delaying Physical Register Allocation Through Virtual-Physical Registers Delaying Physical Register Allocation Through Virtual-Physical Registers Teresa Monreal, Antonio González*, Mateo Valero*, José González* and Victor Viñals Departamento de Informática e Ing. de Sistemas

More information

Multiple-Banked Register File Architectures

Multiple-Banked Register File Architectures Multiple-Banked Register File Architectures José-Lorenzo Cruz, Antonio González and Mateo Valero Nigel P. Topham Departament d Arquitectura de Computadors Siroyan Ltd Universitat Politècnica de Catalunya

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Dynamic Simultaneous Multithreaded Architecture

Dynamic Simultaneous Multithreaded Architecture Dynamic Simultaneous Multithreaded Architecture Daniel Ortiz-Arroyo and Ben Lee School of Electrical Engineering and Computer Science Oregon State University {dortiz, benl}@ece.orst.edu Abstract This paper

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Simultaneous Multithreading Architecture

Simultaneous Multithreading Architecture Simultaneous Multithreading Architecture Virendra Singh Indian Institute of Science Bangalore Lecture-32 SE-273: Processor Design For most apps, most execution units lie idle For an 8-way superscalar.

More information

TDT 4260 lecture 7 spring semester 2015

TDT 4260 lecture 7 spring semester 2015 1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls

More information

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery

More information

A Mechanism for Verifying Data Speculation

A Mechanism for Verifying Data Speculation A Mechanism for Verifying Data Speculation Enric Morancho, José María Llabería, and Àngel Olivé Computer Architecture Department, Universitat Politècnica de Catalunya (Spain), {enricm, llaberia, angel}@ac.upc.es

More information

Supertask Successor. Predictor. Global Sequencer. Super PE 0 Super PE 1. Value. Predictor. (Optional) Interconnect ARB. Data Cache

Supertask Successor. Predictor. Global Sequencer. Super PE 0 Super PE 1. Value. Predictor. (Optional) Interconnect ARB. Data Cache Hierarchical Multi-Threading For Exploiting Parallelism at Multiple Granularities Abstract Mohamed M. Zahran Manoj Franklin ECE Department ECE Department and UMIACS University of Maryland University of

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of

More information

A Dynamic Multithreading Processor

A Dynamic Multithreading Processor A Dynamic Multithreading Processor Haitham Akkary Microcomputer Research Labs Intel Corporation haitham.akkary@intel.com Michael A. Driscoll Department of Electrical and Computer Engineering Portland State

More information

Getting CPI under 1: Outline

Getting CPI under 1: Outline CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more

More information

Multithreaded Value Prediction

Multithreaded Value Prediction Multithreaded Value Prediction N. Tuck and D.M. Tullesn HPCA-11 2005 CMPE 382/510 Review Presentation Peter Giese 30 November 2005 Outline Motivation Multithreaded & Value Prediction Architectures Single

More information

Speculation and Future-Generation Computer Architecture

Speculation and Future-Generation Computer Architecture Speculation and Future-Generation Computer Architecture University of Wisconsin Madison URL: http://www.cs.wisc.edu/~sohi Outline Computer architecture and speculation control, dependence, value speculation

More information

ILP: Instruction Level Parallelism

ILP: Instruction Level Parallelism ILP: Instruction Level Parallelism Tassadaq Hussain Riphah International University Barcelona Supercomputing Center Universitat Politècnica de Catalunya Introduction Introduction Pipelining become universal

More information

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated

More information

Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel

Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Aalborg Universitet Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Publication date: 2006 Document Version Early version, also known as pre-print

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

One-Level Cache Memory Design for Scalable SMT Architectures

One-Level Cache Memory Design for Scalable SMT Architectures One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract

More information

Control-Flow Speculation through Value Prediction for Superscalar Processors

Control-Flow Speculation through Value Prediction for Superscalar Processors 1 of 21 Control-Flow Speculation through Value Prediction for Superscalar Processors José González and Antonio González Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya, Jordi

More information

Two-Level Address Storage and Address Prediction

Two-Level Address Storage and Address Prediction Two-Level Address Storage and Address Prediction Enric Morancho, José María Llabería and Àngel Olivé Computer Architecture Department - Universitat Politècnica de Catalunya (Spain) 1 Abstract. : The amount

More information

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications Byung In Moon, Hongil Yoon, Ilgu Yun, and Sungho Kang Yonsei University, 134 Shinchon-dong, Seodaemoon-gu, Seoul

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

A Cost-Effective Clustered Architecture

A Cost-Effective Clustered Architecture A Cost-Effective Clustered Architecture Ramon Canal, Joan-Manuel Parcerisa, Antonio González Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Cr. Jordi Girona, - Mòdul D6

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not

More information

Speculative Execution for Hiding Memory Latency

Speculative Execution for Hiding Memory Latency Speculative Execution for Hiding Memory Latency Alex Pajuelo, Antonio Gonzalez and Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Barcelona-Spain {mpajuelo,

More information

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr

More information

Multiple Stream Prediction

Multiple Stream Prediction Multiple Stream Prediction Oliverio J. Santana, Alex Ramirez,,andMateoValero, Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Barcelona, Spain Barcelona Supercomputing Center

More information

Dynamic Scheduling. CSE471 Susan Eggers 1

Dynamic Scheduling. CSE471 Susan Eggers 1 Dynamic Scheduling Why go out of style? expensive hardware for the time (actually, still is, relatively) register files grew so less register pressure early RISCs had lower CPIs Why come back? higher chip

More information

LIMITS OF ILP. B649 Parallel Architectures and Programming

LIMITS OF ILP. B649 Parallel Architectures and Programming LIMITS OF ILP B649 Parallel Architectures and Programming A Perfect Processor Register renaming infinite number of registers hence, avoids all WAW and WAR hazards Branch prediction perfect prediction Jump

More information

Simultaneous Multithreading: a Platform for Next Generation Processors

Simultaneous Multithreading: a Platform for Next Generation Processors Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt

More information

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others

Multithreading Processors and Static Optimization Review. Adapted from Bhuyan, Patterson, Eggers, probably others Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others Schedule of things to do By Wednesday the 9 th at 9pm Please send a milestone report (as

More information

Loop Scheduling for Multithreaded Processors

Loop Scheduling for Multithreaded Processors Loop Scheduling for Multithreaded Processors Georgios Dimitriou Dept. of Computer Engineering University of Thessaly, Volos, Greece dimitriu@uth.gr Constantine Polychronopoulos Dept. of Electrical and

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

PowerPC 620 Case Study

PowerPC 620 Case Study Chapter 6: The PowerPC 60 Modern Processor Design: Fundamentals of Superscalar Processors PowerPC 60 Case Study First-generation out-of-order processor Developed as part of Apple-IBM-Motorola alliance

More information

MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors

MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors MPEG- Video Decompression on Simultaneous Multithreaded Multimedia Processors Heiko Oehring Ulrich Sigmund Theo Ungerer VIONA Development GmbH Karlstr. 7 D-733 Karlsruhe, Germany uli@viona.de VIONA Development

More information

Instruction Level Parallelism (ILP)

Instruction Level Parallelism (ILP) 1 / 26 Instruction Level Parallelism (ILP) ILP: The simultaneous execution of multiple instructions from a program. While pipelining is a form of ILP, the general application of ILP goes much further into

More information

Multiple Instruction Issue and Hardware Based Speculation

Multiple Instruction Issue and Hardware Based Speculation Multiple Instruction Issue and Hardware Based Speculation Soner Önder Michigan Technological University, Houghton MI www.cs.mtu.edu/~soner Hardware Based Speculation Exploiting more ILP requires that we

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 21: Superscalar Processing Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due November 10 Homework 4 Out today Due November 15

More information

Wide Instruction Fetch

Wide Instruction Fetch Wide Instruction Fetch Fall 2007 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs470 edu/courses/eecs470 block_ids Trace Table pre-collapse trace_id History Br. Hash hist. Rename Fill Table

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

ONE-CYCLE ZERO-OFFSET LOADS

ONE-CYCLE ZERO-OFFSET LOADS ONE-CYCLE ZERO-OFFSET LOADS E. MORANCHO, J.M. LLABERÍA, À. OLVÉ, M. JMÉNEZ Departament d'arquitectura de Computadors Universitat Politècnica de Catnya enricm@ac.upc.es Fax: +34-93 401 70 55 Abstract Memory

More information

Instruction-Level Parallelism and Its Exploitation

Instruction-Level Parallelism and Its Exploitation Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic

More information

The Use of Multithreading for Exception Handling

The Use of Multithreading for Exception Handling The Use of Multithreading for Exception Handling Craig Zilles, Joel Emer*, Guri Sohi University of Wisconsin - Madison *Compaq - Alpha Development Group International Symposium on Microarchitecture - 32

More information

Superscalar Processors Ch 14

Superscalar Processors Ch 14 Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion

More information

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency? Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

CHECKPOINT PROCESSING AND RECOVERY: AN EFFICIENT, SCALABLE ALTERNATIVE TO REORDER BUFFERS

CHECKPOINT PROCESSING AND RECOVERY: AN EFFICIENT, SCALABLE ALTERNATIVE TO REORDER BUFFERS CHECKPOINT PROCESSING AND RECOVERY: AN EFFICIENT, SCALABLE ALTERNATIVE TO REORDER BUFFERS PROCESSORS REQUIRE A COMBINATION OF LARGE INSTRUCTION WINDOWS AND HIGH CLOCK FREQUENCY TO ACHIEVE HIGH PERFORMANCE.

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

Multi-Version Caches for Multiscalar Processors. Manoj Franklin. Clemson University. 221-C Riggs Hall, Clemson, SC , USA

Multi-Version Caches for Multiscalar Processors. Manoj Franklin. Clemson University. 221-C Riggs Hall, Clemson, SC , USA Multi-Version Caches for Multiscalar Processors Manoj Franklin Department of Electrical and Computer Engineering Clemson University 22-C Riggs Hall, Clemson, SC 29634-095, USA Email: mfrankl@blessing.eng.clemson.edu

More information

Storageless Value Prediction Using Prior Register Values

Storageless Value Prediction Using Prior Register Values Published in the Proceedings of the 26th International Symposium on Computer Architecture, May 999 Storageless Value Prediction Using Prior Register Values Dean M. Tullsen John S. Seng Dept of Computer

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Value Compression for Efficient Computation

Value Compression for Efficient Computation Value Compression for Efficient Computation Ramon Canal 1, Antonio González 12 and James E. Smith 3 1 Dept of Computer Architecture, Universitat Politècnica de Catalunya Cr. Jordi Girona, 1-3, 08034 Barcelona,

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture 1 L E C T U R E 4: D A T A S T R E A M S I N S T R U C T I O N E X E C U T I O N I N S T R U C T I O N C O M P L E T I O N & R E T I R E M E N T D A T A F L O W & R E G I

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 18-447 Computer Architecture Lecture 15: Load/Store Handling and Data Flow Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 Lab 4 Heads Up Lab 4a out Branch handling and branch predictors

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3

More information

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 18-447 Computer Architecture Lecture 14: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 Reminder: Homework 3 Homework 3 Due Feb 25 REP MOVS in Microprogrammed

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University Announcements Homework 4 Out today Due November 15 Midterm II November 22 Project

More information

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

E0-243: Computer Architecture

E0-243: Computer Architecture E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation

More information

A Fine-Grain Multithreading Superscalar Architecture

A Fine-Grain Multithreading Superscalar Architecture A Fine-Grain Multithreading Superscalar Architecture Mat Loikkanen and Nader Bagherzadeh Department of Electrical and Computer Engineering University of California, Irvine loik, nader@ece.uci.edu Abstract

More information

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Module 5: MIPS R10000: A Case Study Lecture 9: MIPS R10000: A Case Study MIPS R A case study in modern microarchitecture. Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch

More information

Simultaneous Multithreading Processor

Simultaneous Multithreading Processor Simultaneous Multithreading Processor Paper presented: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor James Lue Some slides are modified from http://hassan.shojania.com/pdf/smt_presentation.pdf

More information

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version: SISTEMI EMBEDDED Computer Organization Pipelining Federico Baronti Last version: 20160518 Basic Concept of Pipelining Circuit technology and hardware arrangement influence the speed of execution for programs

More information

Introduction. Introduction. Motivation. Main Contributions. Issue Logic - Motivation. Power- and Performance -Aware Architectures.

Introduction. Introduction. Motivation. Main Contributions. Issue Logic - Motivation. Power- and Performance -Aware Architectures. Introduction Power- and Performance -Aware Architectures PhD. candidate: Ramon Canal Corretger Advisors: Antonio onzález Colás (UPC) James E. Smith (U. Wisconsin-Madison) Departament d Arquitectura de

More information

Design of Out-Of-Order Superscalar Processor with Speculative Thread Level Parallelism

Design of Out-Of-Order Superscalar Processor with Speculative Thread Level Parallelism ISSN (Online) : 2319-8753 ISSN (Print) : 2347-6710 International Journal of Innovative Research in Science, Engineering and Technology Volume 3, Special Issue 3, March 2014 2014 International Conference

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

Improving Value Prediction by Exploiting Both Operand and Output Value Locality

Improving Value Prediction by Exploiting Both Operand and Output Value Locality Improving Value Prediction by Exploiting Both Operand and Output Value Locality Jian Huang and Youngsoo Choi Department of Computer Science and Engineering Minnesota Supercomputing Institute University

More information

Low-Complexity Reorder Buffer Architecture*

Low-Complexity Reorder Buffer Architecture* Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower

More information

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Journal of Instruction-Level Parallelism 13 (11) 1-14 Submitted 3/1; published 1/11 Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Santhosh Verma

More information

Case Study IBM PowerPC 620

Case Study IBM PowerPC 620 Case Study IBM PowerPC 620 year shipped: 1995 allowing out-of-order execution (dynamic scheduling) and in-order commit (hardware speculation). using a reorder buffer to track when instruction can commit,

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3, 3.9, and Appendix C) Hardware

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra is a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

The Mitosis Speculative Multithreaded Architecture

The Mitosis Speculative Multithreaded Architecture John von Neumann Institute for Computing The Mitosis Speculative Multithreaded Architecture C. Madriles, C. García Quiñones, J. Sánchez, P. Marcuello, A. González published in Parallel Computing: Current

More information

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , ) Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14) 1 1-Bit Prediction For each branch, keep track of what happened last time and use

More information

Complex Pipelines and Branch Prediction

Complex Pipelines and Branch Prediction Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle

More information

Weld for Itanium Processor

Weld for Itanium Processor ABSTRACT Sharma, Saurabh Weld for Itanium Processor (Under the direction of Dr. Thomas M. Conte) This dissertation extends a WELD for Itanium processors. Emre Özer presented WELD architecture in his Ph.D.

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information