Boost Sequential Program Performance Using A Virtual Large. Instruction Window on Chip Multicore Processor

Size: px
Start display at page:

Download "Boost Sequential Program Performance Using A Virtual Large. Instruction Window on Chip Multicore Processor"

Transcription

1 Boost Sequential Program Performance Using A Virtual Large Instruction Window on Chip Multicore Processor Liqiang He Inner Mongolia University Huhhot, Inner Mongolia P.R.China liqiang@imu.edu.cn Abstract Chip Multicore processor provides the opportunity to boost sequential program performance with the available duplicated hardware resources in the cores. Previous results have shown that most of sequential programs can benefit from a large and fast instruction window. In this paper, we propose a simple method to faster sequential program execution on a chip multicore processor through organizing dynamically the unused available instruction window entries in the other cores into a relative virtual large instruction window for the running program. The hardware budget of our method is small. And the initial analysis tells us that it is a promising way to improve sequential program performance in a chip multicore processor. 1. Introduction Modern microprocessors achieve high performance through exploiting multi-levels parallelism of running programs. Instruction level parallelism (ILP) is the main objective that the traditional superscalar out-of-order processor tries to exploit, whereas thread level parallelism (TLP) is the focus of today s multicore processors. With the available duplicated hardware resources in the cores, chip multicore processors provide new opportunities to boost sequential program performance through exploiting more deeply ILP in the program. Previous results have shown that most of programs can benefit from a larger instruction issue window in a processor because more independent instructions of the program can be exposed to execute out-of-order. Unfortunately, naively scaling conventional window designs can significantly degrade clock cycle time, undermining the benefits of increased parallelism. Chip multicore processors, on the contrary, already have duplicated issue window structures in the cores, and all these structures have same access latencies. These available hardware resources have the potential to be organized dynamically into a virtual larger instruction window for a specific running program and thus improving the performance. In this paper, we propose a simple method to faster sequential programs through dynamically on-demand allocating some unused window entries from the available other cores and organizing them into a virtual larger window for a running program. As we have known, the instructions that dependent on long latency operations (load cache miss for example) can not execute until the long latency operation completes. This allows us to separate instructions in the instruction window into two categories: one is the instructions that can execute in the near future, and the other is those that will execute in the distant future. In our method, we will group the entire chain of instructions dependent on a long latency operation in a core to core transfer message, and send them to the other cores. If one of the other cores has free space in its local instruction window that can hold the whole or part of the chain of instructions, then it responses that message and saves the instructions in its local space. Once a 28

2 remote core responses the message, the request core will remove the instructions from its local instruction window, and then the released window entries can be used for the further entered new instructions. In the future, after the long latency operation completes it will broadcast a similar message to the other cores, and the core that holds the dependent instructions will response the message and send the instructions back to the original core. Same as the previous work [1], we will leverage the existed techniques to help tracking the true dependencies when we send to/back the grouped dependent instructions. And also in this paper, we only focus on tolerating data cache misses, and treat these misses as our only source of long latency operations. However, due to the communications delay between different cores on the chip, for example when a sequential program does activity migration [2], some communication operations also can be treated as long latency operations and use our method. This part will leave for our future work. Our method needs adding two flags in each of the window entries and an extra bit in each physical register. We also need five new types of core to core transfer message for the chip multicore processor. The total hardware budget is small, and the implementation is very easy. This paper is a start point of our future research on this topic. We only explain the mechanism of our method, and do some qualitative analysis in this paper. More detailed simulation will be done in the next step. This paper is organized as follows. Section 2 discusses related work. Our method is presented in section 3, and we make a qualitative analysis for our method in section 4. Section 5 summarizes this paper and presents future work. 2. Related Work There has been extensive research on architecture designs for supporting large instruction windows. In the multiscalar [4] and trace processors [5], one large centralized instructions window is distributed into small windows among multiple parallel processing elements. Dynamic multithreading processors [6] deal with the complexity of a large window by employing a hierarchy of instruction windows. Clustering provides another approach, where a collection of small windows with associated functional units is used to approximate a wider and deeper instruction window [7]. A recent research [8] dynamically joins multiple cores into together and builds a more powerful processing unit on the chip. In this work, one core either belongs to a big fused core or works independently as a single core. This organization is much different from our work in this paper. In our method, we just allocate dynamically part of the free resources in other cores to a request core and build a larger instruction window. The other cores can still running other programs independently. Comparing with [8], our method has better scalability and flexibility. And in terms of implementation complexity and hardware budget, our method is much simpler than [8]. Some other researches [9, 10] also investigate issue logic designs that attempt to support large instruction windows without hurt clock rate. [11] observes that instructions dependent on long latency operations should not occupy issue queue space for a long time, and address this problem by prescheduling instructions based on data dependencies. Other researchers also address the power consumption problem of a large instruction window design. 3. A Virtual Large Window Design This section presents our technique for providing a virtual large instruction window on chip multicore processor. We begin with an overview of the whole process of using our technique. This is followed by a detailed description of our particular design. We conclude this section with a discussion of various related issues. 29

3 3.1. Overview As an example to explain our scheme, we use Alpha type processor as our basic block to build a chip multicore processor. Figure 1 shows the architecture of our multicore processor. Each core has seven stages: fetch, decode, rename, issue, execute, memory/writeback and commit. Before entering the issue window (or issue queue), an instruction needs go through the fetch stage, decode stage and rename stage. When entering the issue queue, the instruction will be checked whether its operands are ready or not. If all the dependent operands are ready, then the instruction is issued to functional unit to execute, otherwise it needs to wait in the queue until all the dependent operands are ready and wake it up. Figure1. Architecture of the Chip Multi-Core Processor in this work When a long latency operation is detected In our scheme, if a long latency operation is detected, all the instructions in the issue queue that truly dependent on the operation will be grouped into together, and packed in a core to core REMOVE_REQ message. We will discuss the details of message and the transfer process later in subsection 3.4. For this paper, we only focus on instructions in the dependence chain of LOAD cache miss because this cache miss generally will last hundreds of cycles in modern microprocessors and cause many dependent instructions to stall and wait in the queue. The core that suffers this long latency operation will broadcast the packed REMOVE_REQ message to the other cores through an on chip core to core bus. When the other cores receive this message, they will check their local issue windows and find whether there are enough free entries that can hold all/part of the instructions in the message. Here, because of the communication delay between the cores we need tradeoff the balance between the overhead of transferring message and the number of entries that the core can provide for part of the group instructions. If one core only has few free entries available (it means either this core is busy on its own job or it already holds a lot of instructions from other cores), then this core will not be a good candidate to hold the instructions in the group in this time. Otherwise, if one core has relative enough free space in its local instruction window, it will save all/part of the group instructions in its local space, and then response by sending back another REMOVE_RES message to the request core which indicates what instructions in the group are saved in its local space. Once the request core receives the answering message, it will remove the instructions that are saved in the remote core from its local queue and release the entries, thus the freed entries can be used for further issued instructions. In our design, after sending the request message, the request core will continue working in the normal way. If there is no any response for the request message, the instructions of the group will continue stay in the issue queue until the dependent operands are ready and release the waiting chain. On the other hand, after receiving a request from a remote core and checking its local space but find no/not enough space for that message the receiving core will just discard the message without doing any other actions. For some instructions, there maybe more than one dependent operands that are not ready. At this case, only the earliest detected operand which causes a long latency operation can trigger a new request message in order to avoid the unnecessary duplicated messages. 30

4 Figure 2. Architecture and operation of enhanced instruction window. Similar as [1], the instructions in the issue queue that dependent on the destination registers of the removed instruction can also trigger another request if it contains some of the instructions that this message wants to get back. If yes, then it sends a response message which includes the instructions message to the other cores and cause a serial of back to the request core and releases the sending request and removing instructions in the original core. Therefore, all instructions directly or indirectly dependent on the first long latency corresponding entries in its local queue. Otherwise, it keeps silent and does nothing for the request message. operation are identified and probably be saved in The key difference between sending other remote cores. Note that, each time we only group the instructions that truly dependent on the REMOVE_REQ request and GET_BACK request for the request core is that the later must wait for particular destination register into a message. This is someone s response. As we will discuss in to avoid transferring a big message on the bus, and also has more chance to get an available space in remote cores. For the load miss that we focus on in this paper the basic block Alpha processor already generates the signal for the load instructions. subsection 3.4, each physical register is added with an extra bit, rd for Remote_Dependent. When such bit is set in a register, it shows that some dependent instructions are stored in other remote cores. So after the load miss is resolved, it sends a GET_BACK message to get back these instructions if the flag in the destination register is set. Because the message is When a long latency operation is completed When a long latency operation is completed, for instance the request data coming back from the memory, in the traditional instruction window design it will wake up all the instructions in the issue queue that waiting for the data, and later all these instructions will compete for the next issue opportunity. Whereas in our design, in addition to waking up the instructions in the local queue it also trigger broadcasting a GET_BACK message to the remote cores in order to get back the previous removed broadcast on the bus, the core that contains the instructions will finally response this message and send back the request instructions. When the response message that contains the instructions comes back to the request core, the instructions need to reinsert into the issue queue. Similar with [1], this reinsertion will share the same bandwidth with those newly arrived instructions that are decoded and dispatched to the issue queue. The dispatch logic will give priority to the instructions reinserted from the returned message to ensure forward progress. instructions from them. Once a remote core receives such a GET_BACK A virtual large window organization message, it will check its local issue queue and find 31

5 In a chip multicore processor, when a core saves some instructions for a request core it dedicates some entries of its local issue queue to the correspondent queue in the request core. Such added entries from different cores combining with the local entries in the queue form a virtual large instruction window for the request core and help the core to exploit more deeply ILP of the running program. The size of this virtual large instruction window varies with the sending REMOVE_REQ and GET_BACK messages. It increases when the other cores dedicate free entries to the requester, and decreases when the other cores send back the saved instructions. This virtual large window can dynamically adaptive with the usage of all the entries in the cores and maximize the utilization of the total hardware budget. And at the same time, each core maintains its own virtual large window and uses it to boost the performance of the running program Detecting Dependent Instructions Same as [1], we leverage the existing issue queue wakeup-select logic to identify the instructions that dependent on a long latency operation. In this work, once a LOAD cache miss signal is generated the truly dependent instructions are selected by the wakeup logic and then packed and sent through a request message. If one of the other cores responses the message, the request core will set the rd bit in the register that the removed instructions dependent on. Also, the wakeup logic will further search the dependent chains for the destination registers of the removed instructions, and trigger other request messages for them Issue Window and Register File To support our method, we need add two flags in each entry of the issue window in all the cores. One flag is core_id which states which core this entry belongs to at that time. This flag needs logn bits where N is the number of cores on the chip. We can set a logic number for each core and use it as the initial value for the core_id. Later, when a core broadcasts a REMOVE_REQ message to the other cores it will include its core_id in the message, and the receiving core will set the flags according to this value. In the future, when the instructions are sent back to the original core, these flags will be reset to the host core_id value. Another flag added in the entry is group_id which is a unique number maintained and sent by the request core. The initial values for all the entries are same and equal to zero which means the instructions are all in the local issue queue in the host core. Later, when the core wants to send a group of instructions to a remote core, it increases the number of group_id by one, and sends the request message with this group_id, and also uses this number to update all the entries in the local queue that belongs to this host core at that time. On the other side, the receiving core will save all the contents (core_id, group_id and instructions) in its local space. Here, using the new group_id to update all the owned entries in the request core is a relative big cost for our method. In modern microprocessor, the number of entries in the issue queue is generally not much, typically 32, so we think it is not a big issue to do such an update operation in our method. Because the size of issue queue is small, so we set the length of this flag is log N where N is the size of issue queue. We believe this is enough to hold all the concurrent existed pending instructions groups. And this flag will operate at a saturate mode. Figure 2 shows the architecture and operation of our proposed instruction window. In the wake-up logic, we need two extra registers to hold the information of core_id and group_id. These two registers are used to select/enable the entries that can be waked up. In case of normal execution mode, the value in the core_id register is the identity of the host core. We can use a simple XOR logic to enable the entries that belongs to this host core, and send the waked up instructions to the functional units. In the same way, the final results also can be broadcasted to the entries that belong to the host core. In another 32

6 case, when a host core accepts a packet that holds some instructions from other core it updates the core_id register with the identity of remote core and uses the new value combined with the new group id and instructions in the packet to select and update the free entries in its local window. Again, it is the same way when the host core needs to select and fill in the entries to the packet buffer and send back to the original core. In addition to above, we also need an extra bit in each physical register, rd for remote dependent. If such a bit is set, then some dependent instructions are stored in remote cores and needed to be got back after the long latency operation is completed. This bit will be reset after the request core obtains the instructions from the remote core. The total hardware budget for our method is ' N log N N + N log N entry where Nentry is the size of one issue queue and N reg is the total number of physical registers in a separate core Core to Core Transfer Message Beside the hardware budget, we also need five types of core to core communication message. In real chip multicore processor, there already has some mechanism to support core to core communication, for example the cache coherence protocol. So we just need to add our message types to the existed infrastructure, and then it will work. The added message types are as follows. a. REMOVE_REQ message: sent by the request core, and includes core_id of the request core, group_id for the particular group of instructions, and the instructions themselves. Here, according to our experience we can set the minimum number of instructions to be sent to logn and the maximum number of instructions to N /N. b. REMOVE_RES message: sent by the remote core that will save some instructions in the group for the request core. It includes core_id of the request core, group_id of the request message, and a bit_vector that represents what instructions are stored N reg in the response core. The length of bit_vector equals to the number of instructions in the request message, and each bit correspondents to the instruction in the order of the list. If one instruction is saved in the response core, then the bit is set, otherwise it keeps to zero. c. GET_BACK message: sent by the request core who has some instructions stored in a remote core. It includes its core_id, the number of register whose rd bit is set and the correspondent long latency operation is completed. d. GET_BACK_RES message: sent by the remote core that stores some instructions for the request core. It includes the core_id of the request core, and the list of instructions that comes from the request core and dependent on the rd register. Here, the remote core needs to search the requested instructions in its local issue queue. Again, due to the small size of the queue and the consecutive stored positions of the instructions, we think this search will not be a big problem for our method. In addition, if there are more than one group of instructions for the request core, finding the real group that dependent on the rd register is an easy work for the wakeup-logic in the issue window structure. e. INVALIDATE message: sent by a request core to invalidate the unused instructions that stored in remote cores. It includes its core_id, and rd of the register that the instructions dependent. This message is used to squash instructions from remote cores because of branch misprediction Squashing instructions from remote core In case of branch misprediction that needs to squash instructions from the queue, the traditional way can do it quickly and locally. In our method, if this needs squashing some instructions that stored in remote cores, the original core will broadcast an invalidate message to the remote cores, and other cores just discard its copies from its local queue and reset the correspondent flags for the released entries. 33

7 3.4. Other issues Size of other related structures To benefit more from our method, we can adapt a larger reorder buffer size than the traditional configuration because now we have a larger instruction issue window that can help faster release the entries in it. Since the reorder buffer is not on the critical path [3], we assume that we can increase its size without affecting clock cycle time. Nonetheless, the issue queue can be filled with instructions that wait for long latency operations and stall further execution. Also, we can enlarge the fetch/decode/rename width and help to improve the performance of running program using our method, but it will affect the latency in the critical path. We will leave this as our future work Fairness and Quality of Service Fairness and Quality-of-Service (QoS) are always two contradictory requirements in a chip multicore processor. In one side, fairness tries to balance the performance obtained by each concurrent running program at nearly equal scale. In another side, QoS tries to guarantee fulfilling the requirements of some particular programs that generally have higher running priorities. In a general sense, a chip multicore processor can not support these two factors at the same time. Using our method to boost single sequential program s performance is a native way to provide QoS support for the program which has such requirement. If the QoS requirement comes from multiple programs, then each program among them should has its own running priority and a simple round-robin fashion can be used to service these needs and provide a multi-levels QoS supports for them. On the contrary, stealing some windows entries from other cores and keeping a relative longer time due to waiting the long latency operation may hurt the performance of the programs running in the stolen cores and thus affect the fairness of the whole chip. Although we have take care of this issue before (the other cores only provide entries when they have enough free entries available), it is also possible to hurt the future execution of the programs in that cores after the free entries are allocated for the requestor. In this case, two possible solutions can be used. First, a set of registers is added to record the priorities of the cores. Only the program which has the highest priority can steal entries from other cores. Once the overall fairness is affected by this program, then its priority is decreased until in the future the fairness is recovered and its priority can be increased again. Second, same as in the QoS scheme, one register can be used to point out who can steal entries from other cores, and a simple round-robin or a more complex scheduling method is used to balance to needs from the concurrent programs so as to guarantee the fairness of the whole chip. 4. A Qualitative Analysis As we state in the first section, this is a start point of our work on this topic. The detailed simulation results are not ready at this time. So we will do a simple qualitative analysis in this paper. 4.1 Limited motivation experiment results In order to know the performance potential of our method, we did some limited motivation experiments on a 4 cores CMP processor. The architecture of the whole chip is shown in Figure 1. And the detailed configuration of a single core is similar as Alpha EV6 and shown in Table 1. First, we run singly 100 million instructions for 26 programs from SPEC CPU 2K [12] in our 4-core processor and count the times of issue queue full during the execution. The simulated regions are selected similar as in [13]. Figure 3 shows the average occurring times of integer/float point issue queue full for per 1K committed instructions. For integer issue queue, half of the programs suffer relative higher (close to or more than 500 times) full rates during the execution. For example, bzip2, for 34

8 every 1000 committed instructions half of them must wait extra one time to be issued to the functional unit due to the issue queue full condition even their operands are ready. There are two extreme cases, mcf and vpr. They have very high full rates (wait 5 times or 1 time on average for each committed instruction respectively) that affect their further executing a lot. Same as integer queue, four programs, applu, lucas, mgrid, and sixtrack, have high floating point issue queue full rates that may affect their overall performance. So, at this point, at least for some programs the size of issue queue is a bottle neck for their executions, whereas for some other programs they are not sensitive to the window size. Enlarge instruction window size will have the potential to improve the final performance for the first group of programs Table 1. Configuration of a single core. Parameter Setting frequency 3GHz fetch queue 16 fetch/slot/map/commit 8 inst. width Pipeine length 16 issue width 8(int)/2(fp) issue queue 32(int)/32(fp) register 128(int,fp), 2 cycles lat. ld/st queue 32/32 entries functional units 4 int ALU s, 4(int mult/div), 1 fp ALU, 1 fp mult/div, Icache/dcache 64KB, 2way, 1 / 3 cycle lat. L2 2MB, 8 way, 10 cycles lat. l1-l2 bus 64B width l2-mem bus 8B width l2-l2 core to core bus 64B width memory Min. 166, max. 255 cyc. itlb/dtlb 2KB, 4way, 1 cycle branch predictor bimod/gshare combined, 16KB, 64 entries RAS Rob 128 Times of issue queue full per 1K inst fp_queue_full 800 int_queue_full ammp applu apsi art bzip2 crafty eon equake facerec fma3d galgel gap gcc gzip lucas mcf mesa mgrid parser perlbmk sixtrack swim twolf vortex vpr wupwise Figure 3. Times of issue queue full per 1K committed instructions. Second, we do further experiments to see the trend of decreasing the full rates when we enlarge the available window size in our tested 4-core processor. Here, we still only run one program each time in the processor and leave the multi-program experiments in the future. Figure 4 shows the experiment results. It can be seen that the full rates of integer issue queue and floating point queue are continuously decreasing with the window size enlarging. At the extreme case, when all the window entries in other cores can be dedicated to one core, all the full rates in floating point queue decrease to zero, and most of the programs the full rates in integer queue are close to zero and only bzip2, fma3d, parser, twolf, and vortex have less than 20 times of full event for per 1000 committed instructions that also decrease a lot when comparing with the original full rates. So it is obvious that increasing window size can potentially eliminate most of the full events occurring in the issue queue and thus release the stress on this structure. With the window size increasing, the IPCs for most of the programs increase but very small in the above experiments that indicate us some other resources, register file, reorder-buffer, etc, may become new performance bottle necks after we solve the window size problem. Same as in 3.4.1, for reorder-buffer, since it is not on the critical path, we can assume that we can enlarge it at a suitable size without affecting the cycle time. We will investigate the size parameter in the future work. 35

9 Times of integer issue queue full per 1K inst Times of fp issue queue full per 1K inst baseline 2 times of window size 3 times of window size 4 times of window size ammp applu apsi art bzip2 crafty eon equake facerec fma3d galgel gap gcc gzip lucas mcf mesa mgrid parser perlbmk sixtrack swim twolf vortex vpr wupwise (a) integer issue queue. baseline 2 times of window size 3 times of window size ammp applu apsi art bzip2 crafty eon equake facerec fma3d galgel gap gcc gzip lucas mcf mesa mgrid parser perlbmk sixtrack swim twolf vortex vpr wupwise (b) floating point issue queue. Figure 4. Times of issue queue full per 1K committed instructions with the window size increasing. For register file, we also need scale it proportionally according to the number of in-flight instructions. There are several alternative designs for large register files, including multi-cycle access, multi-level [14,15], multi-banks [14,16], or queue-based designs [17]. In [1], the authors use a two-level register file [14,15] that operates on principles similar to the cache hierarchy. In this paper, we will not suggest any of the above designs to be used with our scheme, and we will investigate these candidate designs and give our conclusion in the future work. 4.2 Communication delay overhead A big concern about our method is the transfer latency overhead vs. the performance improvement by the large instruction window. Here, we only care of the latency of transferring instructions from the remote core to the original core when they are waked-up by the GET_BACK message because the sending out latency can be overlapped by the waiting time for the long latency operations. If the latency to send back instructions is too long, then it will delay the further execution of them in the original core and degrade the final performance as a result. If the latency is short enough, then the sequential program can benefit from our method and get a better performance with the large instruction window. According to our experience, in a chip multicore processor with 3GHz frequency, the memory access generally will cost more than 200 cycles (in case of memory contention it can be much more), but a simple core to core transferring will only needs about 20 cycles if using an on chip core to core bus (64Bytes width) as we show in figure 1. So comparing with these two latencies, the transferring latency between cores are so small that we think it will has minor effects on the finial performance of the program. In addition, the sending REMOVE_REQ and GET_BACK messages are done only for the truly dependent instructions, and the REMOVE_RES, GET_BACK_RES, and INVALIDATE messages are sent only when it is necessary, so the core to core bus is used very effectively, and the added messages do not put much pressure on the bus. In summary of above two sub-sections, we believe our method is a promising way to boost sequential program execution on a chip multicore processor although several related issues need to be investigated more in detail. We will report more detailed simulation results and solutions of the unsolved issues in the future work. 5. Conclusion Chip multicore processor provides new opportunity to improve sequential program performance with the duplicated hardware resources in the cores. On the other side, a large instruction window can benefit most of program by exploiting more deeply instruction level parallelism. In this paper, we propose a simple method to boost sequential program performance by dynamically allocating some unused entries in other cores and building a virtual large instruction window for the core that runs the program. Our method only use very little hardware resources, and the implementation is easy. The initial qualitative analysis shows that it is a 36

10 promising way to improve the sequential program performance although some related issues still need to further investigated. Acknowledgement The authors would like to thank the anonymous reviewers for their high quality comments on this paper. References [1] A.R. Lebeck, J. Koppanalil, T. Li, J. Patwardhan, and E. Rotenberg, A Large, Fast Instruction Window for Tolerating Cache Misses, In Proceedings of the 29th annual international symposium on Computer architecture (ISCA '02), pp [2] P. Michaud, Y. Sazeides, A. Seznec, T. Constantinou, and D. Fetis, A Study of Thread Migration in Temperature-Constrained Multi-Cores, ACM Transactions on Architecture and Code Optimization, June, 2007 [3] R.Balasubramonian, S. Dwarkadas, and D.Albonesi, Dynamically Allocating Processor Resources Between Nearby and Distant ILP, In Proceeding of 28 th Annual International Symposium on Computer Architecture, pp , July [4] G. S. Sohi, S. E. Breach, and T. N. Vijaykumar, Multiscalar Processors, in Proceedings of 22 nd Annual International Symposium on Computer Architecture, pp , June [5] E. Rotenberg, Q. Jacobson, Y. Sazeides, and J. Smith, Trace Processors, in Proceedings of the 30 th Annual International Symposium on Microarchitecture, pp , Dec [6] H. Akkary, and M. A. Driscoll, A Dynamic Multithreading Processors, in Proceedings of the 31th Annual International Symposium on Microarchitecture, pp , Dec [7] S. Palacharla, N. P. Jouppi, and J. E. Smith, Complexity-Effective Superscalar Processors, in Proceedings of the 24 th Annual International Symposium on Computer Architecture, pp , June [8] E. Ipek, M. Kirman, N. Kirman, and J. Martinez, Core Fusion: Accommodating Software Diversity in Chip Multiprocessors, in Proceedings of the 34 th Annual International Symposium on Computer Architecture, June [9] M. D. Brown, J. Stark, and Y. N. Patt, Select-Free Instruction Scheduling Logic, in Proceedings of the 34 th Annual International Symposium on Microarchitecture, Dec [10] D. S. Henry, B. C. Kuszmaul, G. H. Loh, and R. Sami, Circuits for Wide-Window Superscalar Processors, in Proceeding of the 27 th Annual International Symposium on Computer Architecture, pp , June [11] P. Michaud and A. Seznec, Data-flow Prescheduling for Large Instruction Windows in Out-of-Order Processors, in Proceedings of the 7 th International Symposium on High-Performance Computer Architecture, pp , Jan [12] J. L. Henning, SPEC CPU2000: Measuring CPU performance in the new millennium, IEEE Computer, 33 (7):28-35, July [13] T. Sherwood, E. Perelman, G. Hamerly and B. Calder. Automatically Characterizing Large Scale Program Behavior, in Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2002), October San Jose, California. [14] J. L. Cruz, A. Gonzalez, M. Valero, and N. P. Topham, Multiple-Banked Register File Architectures, in Proceedings of the 27 th Annual International Symposium on Computer Architecture, pp , June [15] J. Zalamea, J. Llosa, E. Ayguade, and M. Valero, Two-Level Hierarchical Register File Organization For VLIW Processors, in Proceedings of the 33 rd Annual International Symposium on Microarchitecture, pp , Dec [16] R. Balasubramonian, S. Dwarkadasm and D. Albonesi, Reducing the Complexity of the Register File in Dynamic Superscalar Processors, in Proceedings of the 34 th International Symposium on Microarchitecture, Dec [17] B. Black and J. Shen, Scalable Register Renaming via the Quack Register File, Technical Report CMuArt 00-1, Carnegie Mellon University, April

CS A Large, Fast Instruction Window for Tolerating. Cache Misses 1. Tong Li Jinson Koppanalil Alvin R. Lebeck. Department of Computer Science

CS A Large, Fast Instruction Window for Tolerating. Cache Misses 1. Tong Li Jinson Koppanalil Alvin R. Lebeck. Department of Computer Science CS 2002 03 A Large, Fast Instruction Window for Tolerating Cache Misses 1 Tong Li Jinson Koppanalil Alvin R. Lebeck Jaidev Patwardhan Eric Rotenberg Department of Computer Science Duke University Durham,

More information

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 10: Runahead and MLP Prof. Onur Mutlu Carnegie Mellon University Last Time Issues in Out-of-order execution Buffer decoupling Register alias tables Physical

More information

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department of Computer Science State University of New York

More information

Low-Complexity Reorder Buffer Architecture*

Low-Complexity Reorder Buffer Architecture* Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower

More information

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Impact of Cache Coherence Protocols on the Processing of Network Traffic Impact of Cache Coherence Protocols on the Processing of Network Traffic Amit Kumar and Ram Huggahalli Communication Technology Lab Corporate Technology Group Intel Corporation 12/3/2007 Outline Background

More information

Exploring Wakeup-Free Instruction Scheduling

Exploring Wakeup-Free Instruction Scheduling Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University Outline Motivation Case study: Cyclone Towards high-performance

More information

ECE404 Term Project Sentinel Thread

ECE404 Term Project Sentinel Thread ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache

More information

Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation

Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun + houman@houman-homayoun.com ABSTRACT We study lazy instructions. We define lazy instructions as those spending

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Kilo-instruction Processors, Runahead and Prefetching

Kilo-instruction Processors, Runahead and Prefetching Kilo-instruction Processors, Runahead and Prefetching Tanausú Ramírez 1, Alex Pajuelo 1, Oliverio J. Santana 2 and Mateo Valero 1,3 1 Departamento de Arquitectura de Computadores UPC Barcelona 2 Departamento

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N. Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors Moinuddin K. Qureshi Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University Announcements Homework 4 Out today Due November 15 Midterm II November 22 Project

More information

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 8: Issues in Out-of-order Execution Prof. Onur Mutlu Carnegie Mellon University Readings General introduction and basic concepts Smith and Sohi, The Microarchitecture

More information

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery

More information

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2 CHERRY: CHECKPOINTED EARLY RESOURCE RECYCLING José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2 1 2 3 MOTIVATION Problem: Limited processor resources Goal: More

More information

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr

More information

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era

More information

Workloads, Scalability and QoS Considerations in CMP Platforms

Workloads, Scalability and QoS Considerations in CMP Platforms Workloads, Scalability and QoS Considerations in CMP Platforms Presenter Don Newell Sr. Principal Engineer Intel Corporation 2007 Intel Corporation Agenda Trends and research context Evolving Workload

More information

High Performance Memory Requests Scheduling Technique for Multicore Processors

High Performance Memory Requests Scheduling Technique for Multicore Processors High Performance Memory Requests Scheduling Technique for Multicore Processors Walid El-Reedy Electronics and Comm. Engineering Cairo University, Cairo, Egypt walid.elreedy@gmail.com Ali A. El-Moursy Electrical

More information

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Wann-Yun Shieh Department of Computer Science and Information Engineering Chang Gung University Tao-Yuan, Taiwan Hsin-Dar Chen

More information

Macro-op Scheduling: Relaxing Scheduling Loop Constraints

Macro-op Scheduling: Relaxing Scheduling Loop Constraints Appears in Proceedings of th International Symposium on Microarchitecture (MICRO-), December 00. Macro-op Scheduling: Relaxing Scheduling Loop Constraints Ilhyun Kim and Mikko H. Lipasti Department of

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Dynamically Controlled Resource Allocation in SMT Processors

Dynamically Controlled Resource Allocation in SMT Processors Dynamically Controlled Resource Allocation in SMT Processors Francisco J. Cazorla, Alex Ramirez, Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Jordi Girona

More information

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah ABSTRACT The growing dominance of wire delays at future technology

More information

A Complexity-Effective Out-of-Order Retirement Microarchitecture

A Complexity-Effective Out-of-Order Retirement Microarchitecture 1 A Complexity-Effective Out-of-Order Retirement Microarchitecture S. Petit, J. Sahuquillo, P. López, R. Ubal, and J. Duato Department of Computer Engineering (DISCA) Technical University of Valencia,

More information

Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors

Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors Ilya Ganusov and Martin Burtscher Computer Systems Laboratory Cornell University {ilya, burtscher}@csl.cornell.edu Abstract This

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction Reducing Latencies of Pipelined Cache Accesses Through Set Prediction Aneesh Aggarwal Electrical and Computer Engineering Binghamton University Binghamton, NY 1392 aneesh@binghamton.edu Abstract With the

More information

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example 1 Which is the best? 2 Lecture 05 Performance Metrics and Benchmarking 3 Measuring & Improving Performance (if planes were computers...) Plane People Range (miles) Speed (mph) Avg. Cost (millions) Passenger*Miles

More information

Precise Instruction Scheduling

Precise Instruction Scheduling Journal of Instruction-Level Parallelism 7 (2005) 1-29 Submitted 10/2004; published 04/2005 Precise Instruction Scheduling Gokhan Memik Department of Electrical and Computer Engineering Northwestern University

More information

Using a Serial Cache for. Energy Efficient Instruction Fetching

Using a Serial Cache for. Energy Efficient Instruction Fetching Using a Serial Cache for Energy Efficient Instruction Fetching Glenn Reinman y Brad Calder z y Computer Science Department, University of California, Los Angeles z Department of Computer Science and Engineering,

More information

Reconfigurable and Self-optimizing Multicore Architectures. Presented by: Naveen Sundarraj

Reconfigurable and Self-optimizing Multicore Architectures. Presented by: Naveen Sundarraj Reconfigurable and Self-optimizing Multicore Architectures Presented by: Naveen Sundarraj 1 11/9/2012 OUTLINE Introduction Motivation Reconfiguration Performance evaluation Reconfiguration Self-optimization

More information

Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX

Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX Keerthi Bhushan Rajesh K Chaurasia Hewlett-Packard India Software Operations 29, Cunningham Road Bangalore 560 052 India +91-80-2251554

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading CSE 502 Graduate Computer Architecture Lec 11 Simultaneous Multithreading Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from David Patterson,

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 18-447 Computer Architecture Lecture 15: Load/Store Handling and Data Flow Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 Lab 4 Heads Up Lab 4a out Branch handling and branch predictors

More information

CAVA: Using Checkpoint-Assisted Value Prediction to Hide L2 Misses

CAVA: Using Checkpoint-Assisted Value Prediction to Hide L2 Misses CAVA: Using Checkpoint-Assisted Value Prediction to Hide L2 Misses Luis Ceze, Karin Strauss, James Tuck, Jose Renau and Josep Torrellas University of Illinois at Urbana-Champaign {luisceze, kstrauss, jtuck,

More information

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Jessica H. T seng and Krste Asanoviü MIT Laboratory for Computer Science, Cambridge, MA 02139, USA ISCA2003 1 Motivation

More information

Speculative Execution for Hiding Memory Latency

Speculative Execution for Hiding Memory Latency Speculative Execution for Hiding Memory Latency Alex Pajuelo, Antonio Gonzalez and Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Barcelona-Spain {mpajuelo,

More information

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Wann-Yun Shieh * and Hsin-Dar Chen Department of Computer Science and Information Engineering Chang Gung University, Taiwan

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

Performance Oriented Prefetching Enhancements Using Commit Stalls

Performance Oriented Prefetching Enhancements Using Commit Stalls Journal of Instruction-Level Parallelism 13 (2011) 1-28 Submitted 10/10; published 3/11 Performance Oriented Prefetching Enhancements Using Commit Stalls R Manikantan R Govindarajan Indian Institute of

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

Design of Experiments - Terminology

Design of Experiments - Terminology Design of Experiments - Terminology Response variable Measured output value E.g. total execution time Factors Input variables that can be changed E.g. cache size, clock rate, bytes transmitted Levels Specific

More information

An Optimized Front-End Physical Register File with Banking and Writeback Filtering

An Optimized Front-End Physical Register File with Banking and Writeback Filtering An Optimized Front-End Physical Register File with Banking and Writeback Filtering Miquel Pericàs, Ruben Gonzalez, Adrian Cristal, Alex Veidenbaum and Mateo Valero Technical University of Catalonia, University

More information

Exploring the Potential of Architecture-Level Power Optimizations

Exploring the Potential of Architecture-Level Power Optimizations Exploring the Potential of Architecture-Level Power Optimizations John S. Seng 1 and Dean M. Tullsen 2 1 Cal Poly State University, Dept. of Computer Science, San Luis Obispo, CA 93407 jseng@calpoly.edu

More information

Demand-Only Broadcast: Reducing Register File and Bypass Power in Clustered Execution Cores

Demand-Only Broadcast: Reducing Register File and Bypass Power in Clustered Execution Cores Demand-Only Broadcast: Reducing Register File and Bypass Power in Clustered Execution Cores Mary D. Brown Yale N. Patt Electrical and Computer Engineering The University of Texas at Austin {mbrown,patt}@ece.utexas.edu

More information

Non-Uniform Instruction Scheduling

Non-Uniform Instruction Scheduling Non-Uniform Instruction Scheduling Joseph J. Sharkey, Dmitry V. Ponomarev Department of Computer Science State University of New York Binghamton, NY 13902 USA {jsharke, dima}@cs.binghamton.edu Abstract.

More information

Instruction Recirculation: Eliminating Counting Logic in Wakeup-Free Schedulers

Instruction Recirculation: Eliminating Counting Logic in Wakeup-Free Schedulers Instruction Recirculation: Eliminating Counting Logic in Wakeup-Free Schedulers Joseph J. Sharkey, Dmitry V. Ponomarev Department of Computer Science State University of New York Binghamton, NY 13902 USA

More information

Decoupled Zero-Compressed Memory

Decoupled Zero-Compressed Memory Decoupled Zero-Compressed Julien Dusser julien.dusser@inria.fr André Seznec andre.seznec@inria.fr Centre de recherche INRIA Rennes Bretagne Atlantique Campus de Beaulieu, 3542 Rennes Cedex, France Abstract

More information

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Bijay K.Paikaray Debabala Swain Dept. of CSE, CUTM Dept. of CSE, CUTM Bhubaneswer, India Bhubaneswer, India

More information

Architecture-Level Power Optimization What Are the Limits?

Architecture-Level Power Optimization What Are the Limits? Journal of Instruction-Level Parallelism 7 (2005) 1-20 Submitted 10/04; published 1/05 Architecture-Level Power Optimization What Are the Limits? John S. Seng Dept. of Computer Science California Polytechnic

More information

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) 18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures

More information

Narrow Width Dynamic Scheduling

Narrow Width Dynamic Scheduling Journal of Instruction-Level Parallelism 9 (2007) 1-23 Submitted 10/06; published 4/07 Narrow Width Dynamic Scheduling Erika Gunadi Mikko H. Lipasti Department of Electrical and Computer Engineering 1415

More information

Dataflow Dominance: A Definition and Characterization

Dataflow Dominance: A Definition and Characterization Dataflow Dominance: A Definition and Characterization Matt Ramsay University of Wisconsin-Madison Department of Electrical and Computer Engineering 1415 Engineering Drive Madison, WI 53706 Submitted in

More information

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont Lecture 18 Simultaneous Multithreading Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi,

More information

Initial Results on the Performance Implications of Thread Migration on a Chip Multi-Core

Initial Results on the Performance Implications of Thread Migration on a Chip Multi-Core 3 rd HiPEAC Workshop IBM, Haifa 17-4-2007 Initial Results on the Performance Implications of Thread Migration on a Chip Multi-Core, P. Michaud, L. He, D. Fetis, C. Ioannou, P. Charalambous and A. Seznec

More information

Speculative Multithreaded Processors

Speculative Multithreaded Processors Guri Sohi and Amir Roth Computer Sciences Department University of Wisconsin-Madison utline Trends and their implications Workloads for future processors Program parallelization and speculative threads

More information

Getting CPI under 1: Outline

Getting CPI under 1: Outline CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more

More information

Portland State University ECE 587/687. Superscalar Issue Logic

Portland State University ECE 587/687. Superscalar Issue Logic Portland State University ECE 587/687 Superscalar Issue Logic Copyright by Alaa Alameldeen, Zeshan Chishti and Haitham Akkary 2017 Instruction Issue Logic (Sohi & Vajapeyam, 1987) After instructions are

More information

CAVA: Hiding L2 Misses with Checkpoint-Assisted Value Prediction

CAVA: Hiding L2 Misses with Checkpoint-Assisted Value Prediction CAVA: Hiding L2 Misses with Checkpoint-Assisted Value Prediction Luis Ceze, Karin Strauss, James Tuck, Jose Renau, Josep Torrellas University of Illinois at Urbana-Champaign June 2004 Abstract Modern superscalar

More information

An Asymmetric Clustered Processor based on Value Content

An Asymmetric Clustered Processor based on Value Content An Asymmetric Clustered Processor based on Value Content R. González, A. Cristal, M. Pericas and M. Valero A. Veidenbaum Universitat Politècnica de Catalunya University of California, Irvine {gonzalez,adrian,mpericas,mateo}@ac.upc.edu

More information

Implicitly-Multithreaded Processors

Implicitly-Multithreaded Processors Implicitly-Multithreaded Processors School of Electrical & Computer Engineering Purdue University {parki,vijay}@ecn.purdue.edu http://min.ecn.purdue.edu/~parki http://www.ece.purdue.edu/~vijay Abstract

More information

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Hyesoon Kim Onur Mutlu Jared Stark David N. Armstrong Yale N. Patt High Performance Systems Group Department

More information

Hierarchical Scheduling Windows

Hierarchical Scheduling Windows Hierarchical Scheduling Windows Edward Brekelbaum, Jeff Rupley II, Chris Wilkerson, Bryan Black Microprocessor Research, Intel Labs (formerly known as MRL) {edward.brekelbaum, jeffrey.p.rupley.ii, chris.wilkerson,

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

Dynamic Instruction Scheduling For Microprocessors Having Out Of Order Execution

Dynamic Instruction Scheduling For Microprocessors Having Out Of Order Execution Dynamic Instruction Scheduling For Microprocessors Having Out Of Order Execution Suresh Kumar, Vishal Gupta *, Vivek Kumar Tamta Department of Computer Science, G. B. Pant Engineering College, Pauri, Uttarakhand,

More information

The Validation Buffer Out-of-Order Retirement Microarchitecture

The Validation Buffer Out-of-Order Retirement Microarchitecture The Validation Buffer Out-of-Order Retirement Microarchitecture S. Petit, J. Sahuquillo, P. López, and J. Duato Department of Computer Engineering (DISCA) Technical University of Valencia, Spain Abstract

More information

On Pipelining Dynamic Instruction Scheduling Logic

On Pipelining Dynamic Instruction Scheduling Logic On Pipelining Dynamic Instruction Scheduling Logic Jared Stark y Mary D. Brown z Yale N. Patt z Microprocessor Research Labs y Intel Corporation jared.w.stark@intel.com Dept. of Electrical and Computer

More information

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due Today Homework 4 Out today Due November 15

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture 1 L E C T U R E 4: D A T A S T R E A M S I N S T R U C T I O N E X E C U T I O N I N S T R U C T I O N C O M P L E T I O N & R E T I R E M E N T D A T A F L O W & R E G I

More information

Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel

Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Aalborg Universitet Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Publication date: 2006 Document Version Early version, also known as pre-print

More information

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 21: Superscalar Processing Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due November 10 Homework 4 Out today Due November 15

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Implicitly-Multithreaded Processors

Implicitly-Multithreaded Processors Appears in the Proceedings of the 30 th Annual International Symposium on Computer Architecture (ISCA) Implicitly-Multithreaded Processors School of Electrical & Computer Engineering Purdue University

More information

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized

More information

Reducing Reorder Buffer Complexity Through Selective Operand Caching

Reducing Reorder Buffer Complexity Through Selective Operand Caching Appears in the Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), 2003 Reducing Reorder Buffer Complexity Through Selective Operand Caching Gurhan Kucuk Dmitry Ponomarev

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

Energy Efficient Asymmetrically Ported Register Files

Energy Efficient Asymmetrically Ported Register Files Energy Efficient Asymmetrically Ported Register Files Aneesh Aggarwal ECE Department University of Maryland College Park, MD 20742 aneesh@eng.umd.edu Manoj Franklin ECE Department and UMIACS University

More information

Simultaneous Multithreading: a Platform for Next Generation Processors

Simultaneous Multithreading: a Platform for Next Generation Processors Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt

More information

A Power and Temperature Aware DRAM Architecture

A Power and Temperature Aware DRAM Architecture A Power and Temperature Aware DRAM Architecture Song Liu, Seda Ogrenci Memik, Yu Zhang, and Gokhan Memik Department of Electrical Engineering and Computer Science Northwestern University, Evanston, IL

More information

Chapter-5 Memory Hierarchy Design

Chapter-5 Memory Hierarchy Design Chapter-5 Memory Hierarchy Design Unlimited amount of fast memory - Economical solution is memory hierarchy - Locality - Cost performance Principle of locality - most programs do not access all code or

More information

The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation

The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation Noname manuscript No. (will be inserted by the editor) The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation Karthik T. Sundararajan Timothy M. Jones Nigel P. Topham Received:

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors

Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani, Naveen Muralimanohar, Rajeev Balasubramonian Department of Electrical and Computer Engineering School

More information

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors In Proceedings of the th International Symposium on High Performance Computer Architecture (HPCA), Madrid, February A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2. Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 0 Consider the following LSQ and when operands are

More information

Select-Free Instruction Scheduling Logic

Select-Free Instruction Scheduling Logic Select-Free Instruction Scheduling Logic Mary D. Brown y Jared Stark z Yale N. Patt y Dept. of Electrical and Computer Engineering y The University of Texas at Austin fmbrown,pattg@ece.utexas.edu Microprocessor

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture Motivation Banked Register File for SMT Processors Jessica H. Tseng and Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA BARC2004 Increasing demand on

More information

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2. Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 1 Consider the following LSQ and when operands are

More information

Slackened Memory Dependence Enforcement: Combining Opportunistic Forwarding with Decoupled Verification

Slackened Memory Dependence Enforcement: Combining Opportunistic Forwarding with Decoupled Verification Slackened Memory Dependence Enforcement: Combining Opportunistic Forwarding with Decoupled Verification Alok Garg, M. Wasiur Rashid, and Michael Huang Department of Electrical & Computer Engineering University

More information

A Cross-Architectural Interface for Code Cache Manipulation. Kim Hazelwood and Robert Cohn

A Cross-Architectural Interface for Code Cache Manipulation. Kim Hazelwood and Robert Cohn A Cross-Architectural Interface for Code Cache Manipulation Kim Hazelwood and Robert Cohn Software-Managed Code Caches Software-managed code caches store transformed code at run time to amortize overhead

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information