Boost Sequential Program Performance Using A Virtual Large. Instruction Window on Chip Multicore Processor
|
|
- Jayson Francis
- 5 years ago
- Views:
Transcription
1 Boost Sequential Program Performance Using A Virtual Large Instruction Window on Chip Multicore Processor Liqiang He Inner Mongolia University Huhhot, Inner Mongolia P.R.China liqiang@imu.edu.cn Abstract Chip Multicore processor provides the opportunity to boost sequential program performance with the available duplicated hardware resources in the cores. Previous results have shown that most of sequential programs can benefit from a large and fast instruction window. In this paper, we propose a simple method to faster sequential program execution on a chip multicore processor through organizing dynamically the unused available instruction window entries in the other cores into a relative virtual large instruction window for the running program. The hardware budget of our method is small. And the initial analysis tells us that it is a promising way to improve sequential program performance in a chip multicore processor. 1. Introduction Modern microprocessors achieve high performance through exploiting multi-levels parallelism of running programs. Instruction level parallelism (ILP) is the main objective that the traditional superscalar out-of-order processor tries to exploit, whereas thread level parallelism (TLP) is the focus of today s multicore processors. With the available duplicated hardware resources in the cores, chip multicore processors provide new opportunities to boost sequential program performance through exploiting more deeply ILP in the program. Previous results have shown that most of programs can benefit from a larger instruction issue window in a processor because more independent instructions of the program can be exposed to execute out-of-order. Unfortunately, naively scaling conventional window designs can significantly degrade clock cycle time, undermining the benefits of increased parallelism. Chip multicore processors, on the contrary, already have duplicated issue window structures in the cores, and all these structures have same access latencies. These available hardware resources have the potential to be organized dynamically into a virtual larger instruction window for a specific running program and thus improving the performance. In this paper, we propose a simple method to faster sequential programs through dynamically on-demand allocating some unused window entries from the available other cores and organizing them into a virtual larger window for a running program. As we have known, the instructions that dependent on long latency operations (load cache miss for example) can not execute until the long latency operation completes. This allows us to separate instructions in the instruction window into two categories: one is the instructions that can execute in the near future, and the other is those that will execute in the distant future. In our method, we will group the entire chain of instructions dependent on a long latency operation in a core to core transfer message, and send them to the other cores. If one of the other cores has free space in its local instruction window that can hold the whole or part of the chain of instructions, then it responses that message and saves the instructions in its local space. Once a 28
2 remote core responses the message, the request core will remove the instructions from its local instruction window, and then the released window entries can be used for the further entered new instructions. In the future, after the long latency operation completes it will broadcast a similar message to the other cores, and the core that holds the dependent instructions will response the message and send the instructions back to the original core. Same as the previous work [1], we will leverage the existed techniques to help tracking the true dependencies when we send to/back the grouped dependent instructions. And also in this paper, we only focus on tolerating data cache misses, and treat these misses as our only source of long latency operations. However, due to the communications delay between different cores on the chip, for example when a sequential program does activity migration [2], some communication operations also can be treated as long latency operations and use our method. This part will leave for our future work. Our method needs adding two flags in each of the window entries and an extra bit in each physical register. We also need five new types of core to core transfer message for the chip multicore processor. The total hardware budget is small, and the implementation is very easy. This paper is a start point of our future research on this topic. We only explain the mechanism of our method, and do some qualitative analysis in this paper. More detailed simulation will be done in the next step. This paper is organized as follows. Section 2 discusses related work. Our method is presented in section 3, and we make a qualitative analysis for our method in section 4. Section 5 summarizes this paper and presents future work. 2. Related Work There has been extensive research on architecture designs for supporting large instruction windows. In the multiscalar [4] and trace processors [5], one large centralized instructions window is distributed into small windows among multiple parallel processing elements. Dynamic multithreading processors [6] deal with the complexity of a large window by employing a hierarchy of instruction windows. Clustering provides another approach, where a collection of small windows with associated functional units is used to approximate a wider and deeper instruction window [7]. A recent research [8] dynamically joins multiple cores into together and builds a more powerful processing unit on the chip. In this work, one core either belongs to a big fused core or works independently as a single core. This organization is much different from our work in this paper. In our method, we just allocate dynamically part of the free resources in other cores to a request core and build a larger instruction window. The other cores can still running other programs independently. Comparing with [8], our method has better scalability and flexibility. And in terms of implementation complexity and hardware budget, our method is much simpler than [8]. Some other researches [9, 10] also investigate issue logic designs that attempt to support large instruction windows without hurt clock rate. [11] observes that instructions dependent on long latency operations should not occupy issue queue space for a long time, and address this problem by prescheduling instructions based on data dependencies. Other researchers also address the power consumption problem of a large instruction window design. 3. A Virtual Large Window Design This section presents our technique for providing a virtual large instruction window on chip multicore processor. We begin with an overview of the whole process of using our technique. This is followed by a detailed description of our particular design. We conclude this section with a discussion of various related issues. 29
3 3.1. Overview As an example to explain our scheme, we use Alpha type processor as our basic block to build a chip multicore processor. Figure 1 shows the architecture of our multicore processor. Each core has seven stages: fetch, decode, rename, issue, execute, memory/writeback and commit. Before entering the issue window (or issue queue), an instruction needs go through the fetch stage, decode stage and rename stage. When entering the issue queue, the instruction will be checked whether its operands are ready or not. If all the dependent operands are ready, then the instruction is issued to functional unit to execute, otherwise it needs to wait in the queue until all the dependent operands are ready and wake it up. Figure1. Architecture of the Chip Multi-Core Processor in this work When a long latency operation is detected In our scheme, if a long latency operation is detected, all the instructions in the issue queue that truly dependent on the operation will be grouped into together, and packed in a core to core REMOVE_REQ message. We will discuss the details of message and the transfer process later in subsection 3.4. For this paper, we only focus on instructions in the dependence chain of LOAD cache miss because this cache miss generally will last hundreds of cycles in modern microprocessors and cause many dependent instructions to stall and wait in the queue. The core that suffers this long latency operation will broadcast the packed REMOVE_REQ message to the other cores through an on chip core to core bus. When the other cores receive this message, they will check their local issue windows and find whether there are enough free entries that can hold all/part of the instructions in the message. Here, because of the communication delay between the cores we need tradeoff the balance between the overhead of transferring message and the number of entries that the core can provide for part of the group instructions. If one core only has few free entries available (it means either this core is busy on its own job or it already holds a lot of instructions from other cores), then this core will not be a good candidate to hold the instructions in the group in this time. Otherwise, if one core has relative enough free space in its local instruction window, it will save all/part of the group instructions in its local space, and then response by sending back another REMOVE_RES message to the request core which indicates what instructions in the group are saved in its local space. Once the request core receives the answering message, it will remove the instructions that are saved in the remote core from its local queue and release the entries, thus the freed entries can be used for further issued instructions. In our design, after sending the request message, the request core will continue working in the normal way. If there is no any response for the request message, the instructions of the group will continue stay in the issue queue until the dependent operands are ready and release the waiting chain. On the other hand, after receiving a request from a remote core and checking its local space but find no/not enough space for that message the receiving core will just discard the message without doing any other actions. For some instructions, there maybe more than one dependent operands that are not ready. At this case, only the earliest detected operand which causes a long latency operation can trigger a new request message in order to avoid the unnecessary duplicated messages. 30
4 Figure 2. Architecture and operation of enhanced instruction window. Similar as [1], the instructions in the issue queue that dependent on the destination registers of the removed instruction can also trigger another request if it contains some of the instructions that this message wants to get back. If yes, then it sends a response message which includes the instructions message to the other cores and cause a serial of back to the request core and releases the sending request and removing instructions in the original core. Therefore, all instructions directly or indirectly dependent on the first long latency corresponding entries in its local queue. Otherwise, it keeps silent and does nothing for the request message. operation are identified and probably be saved in The key difference between sending other remote cores. Note that, each time we only group the instructions that truly dependent on the REMOVE_REQ request and GET_BACK request for the request core is that the later must wait for particular destination register into a message. This is someone s response. As we will discuss in to avoid transferring a big message on the bus, and also has more chance to get an available space in remote cores. For the load miss that we focus on in this paper the basic block Alpha processor already generates the signal for the load instructions. subsection 3.4, each physical register is added with an extra bit, rd for Remote_Dependent. When such bit is set in a register, it shows that some dependent instructions are stored in other remote cores. So after the load miss is resolved, it sends a GET_BACK message to get back these instructions if the flag in the destination register is set. Because the message is When a long latency operation is completed When a long latency operation is completed, for instance the request data coming back from the memory, in the traditional instruction window design it will wake up all the instructions in the issue queue that waiting for the data, and later all these instructions will compete for the next issue opportunity. Whereas in our design, in addition to waking up the instructions in the local queue it also trigger broadcasting a GET_BACK message to the remote cores in order to get back the previous removed broadcast on the bus, the core that contains the instructions will finally response this message and send back the request instructions. When the response message that contains the instructions comes back to the request core, the instructions need to reinsert into the issue queue. Similar with [1], this reinsertion will share the same bandwidth with those newly arrived instructions that are decoded and dispatched to the issue queue. The dispatch logic will give priority to the instructions reinserted from the returned message to ensure forward progress. instructions from them. Once a remote core receives such a GET_BACK A virtual large window organization message, it will check its local issue queue and find 31
5 In a chip multicore processor, when a core saves some instructions for a request core it dedicates some entries of its local issue queue to the correspondent queue in the request core. Such added entries from different cores combining with the local entries in the queue form a virtual large instruction window for the request core and help the core to exploit more deeply ILP of the running program. The size of this virtual large instruction window varies with the sending REMOVE_REQ and GET_BACK messages. It increases when the other cores dedicate free entries to the requester, and decreases when the other cores send back the saved instructions. This virtual large window can dynamically adaptive with the usage of all the entries in the cores and maximize the utilization of the total hardware budget. And at the same time, each core maintains its own virtual large window and uses it to boost the performance of the running program Detecting Dependent Instructions Same as [1], we leverage the existing issue queue wakeup-select logic to identify the instructions that dependent on a long latency operation. In this work, once a LOAD cache miss signal is generated the truly dependent instructions are selected by the wakeup logic and then packed and sent through a request message. If one of the other cores responses the message, the request core will set the rd bit in the register that the removed instructions dependent on. Also, the wakeup logic will further search the dependent chains for the destination registers of the removed instructions, and trigger other request messages for them Issue Window and Register File To support our method, we need add two flags in each entry of the issue window in all the cores. One flag is core_id which states which core this entry belongs to at that time. This flag needs logn bits where N is the number of cores on the chip. We can set a logic number for each core and use it as the initial value for the core_id. Later, when a core broadcasts a REMOVE_REQ message to the other cores it will include its core_id in the message, and the receiving core will set the flags according to this value. In the future, when the instructions are sent back to the original core, these flags will be reset to the host core_id value. Another flag added in the entry is group_id which is a unique number maintained and sent by the request core. The initial values for all the entries are same and equal to zero which means the instructions are all in the local issue queue in the host core. Later, when the core wants to send a group of instructions to a remote core, it increases the number of group_id by one, and sends the request message with this group_id, and also uses this number to update all the entries in the local queue that belongs to this host core at that time. On the other side, the receiving core will save all the contents (core_id, group_id and instructions) in its local space. Here, using the new group_id to update all the owned entries in the request core is a relative big cost for our method. In modern microprocessor, the number of entries in the issue queue is generally not much, typically 32, so we think it is not a big issue to do such an update operation in our method. Because the size of issue queue is small, so we set the length of this flag is log N where N is the size of issue queue. We believe this is enough to hold all the concurrent existed pending instructions groups. And this flag will operate at a saturate mode. Figure 2 shows the architecture and operation of our proposed instruction window. In the wake-up logic, we need two extra registers to hold the information of core_id and group_id. These two registers are used to select/enable the entries that can be waked up. In case of normal execution mode, the value in the core_id register is the identity of the host core. We can use a simple XOR logic to enable the entries that belongs to this host core, and send the waked up instructions to the functional units. In the same way, the final results also can be broadcasted to the entries that belong to the host core. In another 32
6 case, when a host core accepts a packet that holds some instructions from other core it updates the core_id register with the identity of remote core and uses the new value combined with the new group id and instructions in the packet to select and update the free entries in its local window. Again, it is the same way when the host core needs to select and fill in the entries to the packet buffer and send back to the original core. In addition to above, we also need an extra bit in each physical register, rd for remote dependent. If such a bit is set, then some dependent instructions are stored in remote cores and needed to be got back after the long latency operation is completed. This bit will be reset after the request core obtains the instructions from the remote core. The total hardware budget for our method is ' N log N N + N log N entry where Nentry is the size of one issue queue and N reg is the total number of physical registers in a separate core Core to Core Transfer Message Beside the hardware budget, we also need five types of core to core communication message. In real chip multicore processor, there already has some mechanism to support core to core communication, for example the cache coherence protocol. So we just need to add our message types to the existed infrastructure, and then it will work. The added message types are as follows. a. REMOVE_REQ message: sent by the request core, and includes core_id of the request core, group_id for the particular group of instructions, and the instructions themselves. Here, according to our experience we can set the minimum number of instructions to be sent to logn and the maximum number of instructions to N /N. b. REMOVE_RES message: sent by the remote core that will save some instructions in the group for the request core. It includes core_id of the request core, group_id of the request message, and a bit_vector that represents what instructions are stored N reg in the response core. The length of bit_vector equals to the number of instructions in the request message, and each bit correspondents to the instruction in the order of the list. If one instruction is saved in the response core, then the bit is set, otherwise it keeps to zero. c. GET_BACK message: sent by the request core who has some instructions stored in a remote core. It includes its core_id, the number of register whose rd bit is set and the correspondent long latency operation is completed. d. GET_BACK_RES message: sent by the remote core that stores some instructions for the request core. It includes the core_id of the request core, and the list of instructions that comes from the request core and dependent on the rd register. Here, the remote core needs to search the requested instructions in its local issue queue. Again, due to the small size of the queue and the consecutive stored positions of the instructions, we think this search will not be a big problem for our method. In addition, if there are more than one group of instructions for the request core, finding the real group that dependent on the rd register is an easy work for the wakeup-logic in the issue window structure. e. INVALIDATE message: sent by a request core to invalidate the unused instructions that stored in remote cores. It includes its core_id, and rd of the register that the instructions dependent. This message is used to squash instructions from remote cores because of branch misprediction Squashing instructions from remote core In case of branch misprediction that needs to squash instructions from the queue, the traditional way can do it quickly and locally. In our method, if this needs squashing some instructions that stored in remote cores, the original core will broadcast an invalidate message to the remote cores, and other cores just discard its copies from its local queue and reset the correspondent flags for the released entries. 33
7 3.4. Other issues Size of other related structures To benefit more from our method, we can adapt a larger reorder buffer size than the traditional configuration because now we have a larger instruction issue window that can help faster release the entries in it. Since the reorder buffer is not on the critical path [3], we assume that we can increase its size without affecting clock cycle time. Nonetheless, the issue queue can be filled with instructions that wait for long latency operations and stall further execution. Also, we can enlarge the fetch/decode/rename width and help to improve the performance of running program using our method, but it will affect the latency in the critical path. We will leave this as our future work Fairness and Quality of Service Fairness and Quality-of-Service (QoS) are always two contradictory requirements in a chip multicore processor. In one side, fairness tries to balance the performance obtained by each concurrent running program at nearly equal scale. In another side, QoS tries to guarantee fulfilling the requirements of some particular programs that generally have higher running priorities. In a general sense, a chip multicore processor can not support these two factors at the same time. Using our method to boost single sequential program s performance is a native way to provide QoS support for the program which has such requirement. If the QoS requirement comes from multiple programs, then each program among them should has its own running priority and a simple round-robin fashion can be used to service these needs and provide a multi-levels QoS supports for them. On the contrary, stealing some windows entries from other cores and keeping a relative longer time due to waiting the long latency operation may hurt the performance of the programs running in the stolen cores and thus affect the fairness of the whole chip. Although we have take care of this issue before (the other cores only provide entries when they have enough free entries available), it is also possible to hurt the future execution of the programs in that cores after the free entries are allocated for the requestor. In this case, two possible solutions can be used. First, a set of registers is added to record the priorities of the cores. Only the program which has the highest priority can steal entries from other cores. Once the overall fairness is affected by this program, then its priority is decreased until in the future the fairness is recovered and its priority can be increased again. Second, same as in the QoS scheme, one register can be used to point out who can steal entries from other cores, and a simple round-robin or a more complex scheduling method is used to balance to needs from the concurrent programs so as to guarantee the fairness of the whole chip. 4. A Qualitative Analysis As we state in the first section, this is a start point of our work on this topic. The detailed simulation results are not ready at this time. So we will do a simple qualitative analysis in this paper. 4.1 Limited motivation experiment results In order to know the performance potential of our method, we did some limited motivation experiments on a 4 cores CMP processor. The architecture of the whole chip is shown in Figure 1. And the detailed configuration of a single core is similar as Alpha EV6 and shown in Table 1. First, we run singly 100 million instructions for 26 programs from SPEC CPU 2K [12] in our 4-core processor and count the times of issue queue full during the execution. The simulated regions are selected similar as in [13]. Figure 3 shows the average occurring times of integer/float point issue queue full for per 1K committed instructions. For integer issue queue, half of the programs suffer relative higher (close to or more than 500 times) full rates during the execution. For example, bzip2, for 34
8 every 1000 committed instructions half of them must wait extra one time to be issued to the functional unit due to the issue queue full condition even their operands are ready. There are two extreme cases, mcf and vpr. They have very high full rates (wait 5 times or 1 time on average for each committed instruction respectively) that affect their further executing a lot. Same as integer queue, four programs, applu, lucas, mgrid, and sixtrack, have high floating point issue queue full rates that may affect their overall performance. So, at this point, at least for some programs the size of issue queue is a bottle neck for their executions, whereas for some other programs they are not sensitive to the window size. Enlarge instruction window size will have the potential to improve the final performance for the first group of programs Table 1. Configuration of a single core. Parameter Setting frequency 3GHz fetch queue 16 fetch/slot/map/commit 8 inst. width Pipeine length 16 issue width 8(int)/2(fp) issue queue 32(int)/32(fp) register 128(int,fp), 2 cycles lat. ld/st queue 32/32 entries functional units 4 int ALU s, 4(int mult/div), 1 fp ALU, 1 fp mult/div, Icache/dcache 64KB, 2way, 1 / 3 cycle lat. L2 2MB, 8 way, 10 cycles lat. l1-l2 bus 64B width l2-mem bus 8B width l2-l2 core to core bus 64B width memory Min. 166, max. 255 cyc. itlb/dtlb 2KB, 4way, 1 cycle branch predictor bimod/gshare combined, 16KB, 64 entries RAS Rob 128 Times of issue queue full per 1K inst fp_queue_full 800 int_queue_full ammp applu apsi art bzip2 crafty eon equake facerec fma3d galgel gap gcc gzip lucas mcf mesa mgrid parser perlbmk sixtrack swim twolf vortex vpr wupwise Figure 3. Times of issue queue full per 1K committed instructions. Second, we do further experiments to see the trend of decreasing the full rates when we enlarge the available window size in our tested 4-core processor. Here, we still only run one program each time in the processor and leave the multi-program experiments in the future. Figure 4 shows the experiment results. It can be seen that the full rates of integer issue queue and floating point queue are continuously decreasing with the window size enlarging. At the extreme case, when all the window entries in other cores can be dedicated to one core, all the full rates in floating point queue decrease to zero, and most of the programs the full rates in integer queue are close to zero and only bzip2, fma3d, parser, twolf, and vortex have less than 20 times of full event for per 1000 committed instructions that also decrease a lot when comparing with the original full rates. So it is obvious that increasing window size can potentially eliminate most of the full events occurring in the issue queue and thus release the stress on this structure. With the window size increasing, the IPCs for most of the programs increase but very small in the above experiments that indicate us some other resources, register file, reorder-buffer, etc, may become new performance bottle necks after we solve the window size problem. Same as in 3.4.1, for reorder-buffer, since it is not on the critical path, we can assume that we can enlarge it at a suitable size without affecting the cycle time. We will investigate the size parameter in the future work. 35
9 Times of integer issue queue full per 1K inst Times of fp issue queue full per 1K inst baseline 2 times of window size 3 times of window size 4 times of window size ammp applu apsi art bzip2 crafty eon equake facerec fma3d galgel gap gcc gzip lucas mcf mesa mgrid parser perlbmk sixtrack swim twolf vortex vpr wupwise (a) integer issue queue. baseline 2 times of window size 3 times of window size ammp applu apsi art bzip2 crafty eon equake facerec fma3d galgel gap gcc gzip lucas mcf mesa mgrid parser perlbmk sixtrack swim twolf vortex vpr wupwise (b) floating point issue queue. Figure 4. Times of issue queue full per 1K committed instructions with the window size increasing. For register file, we also need scale it proportionally according to the number of in-flight instructions. There are several alternative designs for large register files, including multi-cycle access, multi-level [14,15], multi-banks [14,16], or queue-based designs [17]. In [1], the authors use a two-level register file [14,15] that operates on principles similar to the cache hierarchy. In this paper, we will not suggest any of the above designs to be used with our scheme, and we will investigate these candidate designs and give our conclusion in the future work. 4.2 Communication delay overhead A big concern about our method is the transfer latency overhead vs. the performance improvement by the large instruction window. Here, we only care of the latency of transferring instructions from the remote core to the original core when they are waked-up by the GET_BACK message because the sending out latency can be overlapped by the waiting time for the long latency operations. If the latency to send back instructions is too long, then it will delay the further execution of them in the original core and degrade the final performance as a result. If the latency is short enough, then the sequential program can benefit from our method and get a better performance with the large instruction window. According to our experience, in a chip multicore processor with 3GHz frequency, the memory access generally will cost more than 200 cycles (in case of memory contention it can be much more), but a simple core to core transferring will only needs about 20 cycles if using an on chip core to core bus (64Bytes width) as we show in figure 1. So comparing with these two latencies, the transferring latency between cores are so small that we think it will has minor effects on the finial performance of the program. In addition, the sending REMOVE_REQ and GET_BACK messages are done only for the truly dependent instructions, and the REMOVE_RES, GET_BACK_RES, and INVALIDATE messages are sent only when it is necessary, so the core to core bus is used very effectively, and the added messages do not put much pressure on the bus. In summary of above two sub-sections, we believe our method is a promising way to boost sequential program execution on a chip multicore processor although several related issues need to be investigated more in detail. We will report more detailed simulation results and solutions of the unsolved issues in the future work. 5. Conclusion Chip multicore processor provides new opportunity to improve sequential program performance with the duplicated hardware resources in the cores. On the other side, a large instruction window can benefit most of program by exploiting more deeply instruction level parallelism. In this paper, we propose a simple method to boost sequential program performance by dynamically allocating some unused entries in other cores and building a virtual large instruction window for the core that runs the program. Our method only use very little hardware resources, and the implementation is easy. The initial qualitative analysis shows that it is a 36
10 promising way to improve the sequential program performance although some related issues still need to further investigated. Acknowledgement The authors would like to thank the anonymous reviewers for their high quality comments on this paper. References [1] A.R. Lebeck, J. Koppanalil, T. Li, J. Patwardhan, and E. Rotenberg, A Large, Fast Instruction Window for Tolerating Cache Misses, In Proceedings of the 29th annual international symposium on Computer architecture (ISCA '02), pp [2] P. Michaud, Y. Sazeides, A. Seznec, T. Constantinou, and D. Fetis, A Study of Thread Migration in Temperature-Constrained Multi-Cores, ACM Transactions on Architecture and Code Optimization, June, 2007 [3] R.Balasubramonian, S. Dwarkadas, and D.Albonesi, Dynamically Allocating Processor Resources Between Nearby and Distant ILP, In Proceeding of 28 th Annual International Symposium on Computer Architecture, pp , July [4] G. S. Sohi, S. E. Breach, and T. N. Vijaykumar, Multiscalar Processors, in Proceedings of 22 nd Annual International Symposium on Computer Architecture, pp , June [5] E. Rotenberg, Q. Jacobson, Y. Sazeides, and J. Smith, Trace Processors, in Proceedings of the 30 th Annual International Symposium on Microarchitecture, pp , Dec [6] H. Akkary, and M. A. Driscoll, A Dynamic Multithreading Processors, in Proceedings of the 31th Annual International Symposium on Microarchitecture, pp , Dec [7] S. Palacharla, N. P. Jouppi, and J. E. Smith, Complexity-Effective Superscalar Processors, in Proceedings of the 24 th Annual International Symposium on Computer Architecture, pp , June [8] E. Ipek, M. Kirman, N. Kirman, and J. Martinez, Core Fusion: Accommodating Software Diversity in Chip Multiprocessors, in Proceedings of the 34 th Annual International Symposium on Computer Architecture, June [9] M. D. Brown, J. Stark, and Y. N. Patt, Select-Free Instruction Scheduling Logic, in Proceedings of the 34 th Annual International Symposium on Microarchitecture, Dec [10] D. S. Henry, B. C. Kuszmaul, G. H. Loh, and R. Sami, Circuits for Wide-Window Superscalar Processors, in Proceeding of the 27 th Annual International Symposium on Computer Architecture, pp , June [11] P. Michaud and A. Seznec, Data-flow Prescheduling for Large Instruction Windows in Out-of-Order Processors, in Proceedings of the 7 th International Symposium on High-Performance Computer Architecture, pp , Jan [12] J. L. Henning, SPEC CPU2000: Measuring CPU performance in the new millennium, IEEE Computer, 33 (7):28-35, July [13] T. Sherwood, E. Perelman, G. Hamerly and B. Calder. Automatically Characterizing Large Scale Program Behavior, in Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2002), October San Jose, California. [14] J. L. Cruz, A. Gonzalez, M. Valero, and N. P. Topham, Multiple-Banked Register File Architectures, in Proceedings of the 27 th Annual International Symposium on Computer Architecture, pp , June [15] J. Zalamea, J. Llosa, E. Ayguade, and M. Valero, Two-Level Hierarchical Register File Organization For VLIW Processors, in Proceedings of the 33 rd Annual International Symposium on Microarchitecture, pp , Dec [16] R. Balasubramonian, S. Dwarkadasm and D. Albonesi, Reducing the Complexity of the Register File in Dynamic Superscalar Processors, in Proceedings of the 34 th International Symposium on Microarchitecture, Dec [17] B. Black and J. Shen, Scalable Register Renaming via the Quack Register File, Technical Report CMuArt 00-1, Carnegie Mellon University, April
CS A Large, Fast Instruction Window for Tolerating. Cache Misses 1. Tong Li Jinson Koppanalil Alvin R. Lebeck. Department of Computer Science
CS 2002 03 A Large, Fast Instruction Window for Tolerating Cache Misses 1 Tong Li Jinson Koppanalil Alvin R. Lebeck Jaidev Patwardhan Eric Rotenberg Department of Computer Science Duke University Durham,
More information15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 10: Runahead and MLP Prof. Onur Mutlu Carnegie Mellon University Last Time Issues in Out-of-order execution Buffer decoupling Register alias tables Physical
More informationRegister Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure
Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department of Computer Science State University of New York
More informationLow-Complexity Reorder Buffer Architecture*
Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower
More informationImpact of Cache Coherence Protocols on the Processing of Network Traffic
Impact of Cache Coherence Protocols on the Processing of Network Traffic Amit Kumar and Ram Huggahalli Communication Technology Lab Corporate Technology Group Intel Corporation 12/3/2007 Outline Background
More informationExploring Wakeup-Free Instruction Scheduling
Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University Outline Motivation Case study: Cyclone Towards high-performance
More informationECE404 Term Project Sentinel Thread
ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache
More informationUsing Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation
Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun + houman@houman-homayoun.com ABSTRACT We study lazy instructions. We define lazy instructions as those spending
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationKilo-instruction Processors, Runahead and Prefetching
Kilo-instruction Processors, Runahead and Prefetching Tanausú Ramírez 1, Alex Pajuelo 1, Oliverio J. Santana 2 and Mateo Valero 1,3 1 Departamento de Arquitectura de Computadores UPC Barcelona 2 Departamento
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationMicroarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.
Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors Moinuddin K. Qureshi Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More information15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University Announcements Homework 4 Out today Due November 15 Midterm II November 22 Project
More information15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 8: Issues in Out-of-order Execution Prof. Onur Mutlu Carnegie Mellon University Readings General introduction and basic concepts Smith and Sohi, The Microarchitecture
More informationBeyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy
EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery
More informationJosé F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2
CHERRY: CHECKPOINTED EARLY RESOURCE RECYCLING José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2 1 2 3 MOTIVATION Problem: Limited processor resources Goal: More
More information250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019
250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr
More informationDual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window
Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era
More informationWorkloads, Scalability and QoS Considerations in CMP Platforms
Workloads, Scalability and QoS Considerations in CMP Platforms Presenter Don Newell Sr. Principal Engineer Intel Corporation 2007 Intel Corporation Agenda Trends and research context Evolving Workload
More informationHigh Performance Memory Requests Scheduling Technique for Multicore Processors
High Performance Memory Requests Scheduling Technique for Multicore Processors Walid El-Reedy Electronics and Comm. Engineering Cairo University, Cairo, Egypt walid.elreedy@gmail.com Ali A. El-Moursy Electrical
More informationSaving Register-File Leakage Power by Monitoring Instruction Sequence in ROB
Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Wann-Yun Shieh Department of Computer Science and Information Engineering Chang Gung University Tao-Yuan, Taiwan Hsin-Dar Chen
More informationMacro-op Scheduling: Relaxing Scheduling Loop Constraints
Appears in Proceedings of th International Symposium on Microarchitecture (MICRO-), December 00. Macro-op Scheduling: Relaxing Scheduling Loop Constraints Ilhyun Kim and Mikko H. Lipasti Department of
More informationExecution-based Prediction Using Speculative Slices
Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers
More informationDynamically Controlled Resource Allocation in SMT Processors
Dynamically Controlled Resource Allocation in SMT Processors Francisco J. Cazorla, Alex Ramirez, Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Jordi Girona
More informationCluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah ABSTRACT The growing dominance of wire delays at future technology
More informationA Complexity-Effective Out-of-Order Retirement Microarchitecture
1 A Complexity-Effective Out-of-Order Retirement Microarchitecture S. Petit, J. Sahuquillo, P. López, R. Ubal, and J. Duato Department of Computer Engineering (DISCA) Technical University of Valencia,
More informationFuture Execution: A Hardware Prefetching Technique for Chip Multiprocessors
Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors Ilya Ganusov and Martin Burtscher Computer Systems Laboratory Cornell University {ilya, burtscher}@csl.cornell.edu Abstract This
More informationAn Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors
An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group
More informationReducing Latencies of Pipelined Cache Accesses Through Set Prediction
Reducing Latencies of Pipelined Cache Accesses Through Set Prediction Aneesh Aggarwal Electrical and Computer Engineering Binghamton University Binghamton, NY 1392 aneesh@binghamton.edu Abstract With the
More informationWhich is the best? Measuring & Improving Performance (if planes were computers...) An architecture example
1 Which is the best? 2 Lecture 05 Performance Metrics and Benchmarking 3 Measuring & Improving Performance (if planes were computers...) Plane People Range (miles) Speed (mph) Avg. Cost (millions) Passenger*Miles
More informationPrecise Instruction Scheduling
Journal of Instruction-Level Parallelism 7 (2005) 1-29 Submitted 10/2004; published 04/2005 Precise Instruction Scheduling Gokhan Memik Department of Electrical and Computer Engineering Northwestern University
More informationUsing a Serial Cache for. Energy Efficient Instruction Fetching
Using a Serial Cache for Energy Efficient Instruction Fetching Glenn Reinman y Brad Calder z y Computer Science Department, University of California, Los Angeles z Department of Computer Science and Engineering,
More informationReconfigurable and Self-optimizing Multicore Architectures. Presented by: Naveen Sundarraj
Reconfigurable and Self-optimizing Multicore Architectures Presented by: Naveen Sundarraj 1 11/9/2012 OUTLINE Introduction Motivation Reconfiguration Performance evaluation Reconfiguration Self-optimization
More informationAries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX
Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX Keerthi Bhushan Rajesh K Chaurasia Hewlett-Packard India Software Operations 29, Cunningham Road Bangalore 560 052 India +91-80-2251554
More informationSPECULATIVE MULTITHREADED ARCHITECTURES
2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions
More informationCSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading
CSE 502 Graduate Computer Architecture Lec 11 Simultaneous Multithreading Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from David Patterson,
More informationLecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)
Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling
More informationComputer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014
18-447 Computer Architecture Lecture 15: Load/Store Handling and Data Flow Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 Lab 4 Heads Up Lab 4a out Branch handling and branch predictors
More informationCAVA: Using Checkpoint-Assisted Value Prediction to Hide L2 Misses
CAVA: Using Checkpoint-Assisted Value Prediction to Hide L2 Misses Luis Ceze, Karin Strauss, James Tuck, Jose Renau and Josep Torrellas University of Illinois at Urbana-Champaign {luisceze, kstrauss, jtuck,
More informationBanked Multiported Register Files for High-Frequency Superscalar Microprocessors
Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Jessica H. T seng and Krste Asanoviü MIT Laboratory for Computer Science, Cambridge, MA 02139, USA ISCA2003 1 Motivation
More informationSpeculative Execution for Hiding Memory Latency
Speculative Execution for Hiding Memory Latency Alex Pajuelo, Antonio Gonzalez and Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Barcelona-Spain {mpajuelo,
More informationSaving Register-File Leakage Power by Monitoring Instruction Sequence in ROB
Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Wann-Yun Shieh * and Hsin-Dar Chen Department of Computer Science and Information Engineering Chang Gung University, Taiwan
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More informationPerformance Oriented Prefetching Enhancements Using Commit Stalls
Journal of Instruction-Level Parallelism 13 (2011) 1-28 Submitted 10/10; published 3/11 Performance Oriented Prefetching Enhancements Using Commit Stalls R Manikantan R Govindarajan Indian Institute of
More informationUG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects
Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer
More informationDesign of Experiments - Terminology
Design of Experiments - Terminology Response variable Measured output value E.g. total execution time Factors Input variables that can be changed E.g. cache size, clock rate, bytes transmitted Levels Specific
More informationAn Optimized Front-End Physical Register File with Banking and Writeback Filtering
An Optimized Front-End Physical Register File with Banking and Writeback Filtering Miquel Pericàs, Ruben Gonzalez, Adrian Cristal, Alex Veidenbaum and Mateo Valero Technical University of Catalonia, University
More informationExploring the Potential of Architecture-Level Power Optimizations
Exploring the Potential of Architecture-Level Power Optimizations John S. Seng 1 and Dean M. Tullsen 2 1 Cal Poly State University, Dept. of Computer Science, San Luis Obispo, CA 93407 jseng@calpoly.edu
More informationDemand-Only Broadcast: Reducing Register File and Bypass Power in Clustered Execution Cores
Demand-Only Broadcast: Reducing Register File and Bypass Power in Clustered Execution Cores Mary D. Brown Yale N. Patt Electrical and Computer Engineering The University of Texas at Austin {mbrown,patt}@ece.utexas.edu
More informationNon-Uniform Instruction Scheduling
Non-Uniform Instruction Scheduling Joseph J. Sharkey, Dmitry V. Ponomarev Department of Computer Science State University of New York Binghamton, NY 13902 USA {jsharke, dima}@cs.binghamton.edu Abstract.
More informationInstruction Recirculation: Eliminating Counting Logic in Wakeup-Free Schedulers
Instruction Recirculation: Eliminating Counting Logic in Wakeup-Free Schedulers Joseph J. Sharkey, Dmitry V. Ponomarev Department of Computer Science State University of New York Binghamton, NY 13902 USA
More informationDecoupled Zero-Compressed Memory
Decoupled Zero-Compressed Julien Dusser julien.dusser@inria.fr André Seznec andre.seznec@inria.fr Centre de recherche INRIA Rennes Bretagne Atlantique Campus de Beaulieu, 3542 Rennes Cedex, France Abstract
More informationRelative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review
Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Bijay K.Paikaray Debabala Swain Dept. of CSE, CUTM Dept. of CSE, CUTM Bhubaneswer, India Bhubaneswer, India
More informationArchitecture-Level Power Optimization What Are the Limits?
Journal of Instruction-Level Parallelism 7 (2005) 1-20 Submitted 10/04; published 1/05 Architecture-Level Power Optimization What Are the Limits? John S. Seng Dept. of Computer Science California Polytechnic
More informationComputer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)
18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures
More informationNarrow Width Dynamic Scheduling
Journal of Instruction-Level Parallelism 9 (2007) 1-23 Submitted 10/06; published 4/07 Narrow Width Dynamic Scheduling Erika Gunadi Mikko H. Lipasti Department of Electrical and Computer Engineering 1415
More informationDataflow Dominance: A Definition and Characterization
Dataflow Dominance: A Definition and Characterization Matt Ramsay University of Wisconsin-Madison Department of Electrical and Computer Engineering 1415 Engineering Drive Madison, WI 53706 Submitted in
More informationEECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont
Lecture 18 Simultaneous Multithreading Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi,
More informationInitial Results on the Performance Implications of Thread Migration on a Chip Multi-Core
3 rd HiPEAC Workshop IBM, Haifa 17-4-2007 Initial Results on the Performance Implications of Thread Migration on a Chip Multi-Core, P. Michaud, L. He, D. Fetis, C. Ioannou, P. Charalambous and A. Seznec
More informationSpeculative Multithreaded Processors
Guri Sohi and Amir Roth Computer Sciences Department University of Wisconsin-Madison utline Trends and their implications Workloads for future processors Program parallelization and speculative threads
More informationGetting CPI under 1: Outline
CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more
More informationPortland State University ECE 587/687. Superscalar Issue Logic
Portland State University ECE 587/687 Superscalar Issue Logic Copyright by Alaa Alameldeen, Zeshan Chishti and Haitham Akkary 2017 Instruction Issue Logic (Sohi & Vajapeyam, 1987) After instructions are
More informationCAVA: Hiding L2 Misses with Checkpoint-Assisted Value Prediction
CAVA: Hiding L2 Misses with Checkpoint-Assisted Value Prediction Luis Ceze, Karin Strauss, James Tuck, Jose Renau, Josep Torrellas University of Illinois at Urbana-Champaign June 2004 Abstract Modern superscalar
More informationAn Asymmetric Clustered Processor based on Value Content
An Asymmetric Clustered Processor based on Value Content R. González, A. Cristal, M. Pericas and M. Valero A. Veidenbaum Universitat Politècnica de Catalunya University of California, Irvine {gonzalez,adrian,mpericas,mateo}@ac.upc.edu
More informationImplicitly-Multithreaded Processors
Implicitly-Multithreaded Processors School of Electrical & Computer Engineering Purdue University {parki,vijay}@ecn.purdue.edu http://min.ecn.purdue.edu/~parki http://www.ece.purdue.edu/~vijay Abstract
More informationWish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution
Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Hyesoon Kim Onur Mutlu Jared Stark David N. Armstrong Yale N. Patt High Performance Systems Group Department
More informationHierarchical Scheduling Windows
Hierarchical Scheduling Windows Edward Brekelbaum, Jeff Rupley II, Chris Wilkerson, Bryan Black Microprocessor Research, Intel Labs (formerly known as MRL) {edward.brekelbaum, jeffrey.p.rupley.ii, chris.wilkerson,
More informationComputer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:
More informationDynamic Instruction Scheduling For Microprocessors Having Out Of Order Execution
Dynamic Instruction Scheduling For Microprocessors Having Out Of Order Execution Suresh Kumar, Vishal Gupta *, Vivek Kumar Tamta Department of Computer Science, G. B. Pant Engineering College, Pauri, Uttarakhand,
More informationThe Validation Buffer Out-of-Order Retirement Microarchitecture
The Validation Buffer Out-of-Order Retirement Microarchitecture S. Petit, J. Sahuquillo, P. López, and J. Duato Department of Computer Engineering (DISCA) Technical University of Valencia, Spain Abstract
More informationOn Pipelining Dynamic Instruction Scheduling Logic
On Pipelining Dynamic Instruction Scheduling Logic Jared Stark y Mary D. Brown z Yale N. Patt z Microprocessor Research Labs y Intel Corporation jared.w.stark@intel.com Dept. of Electrical and Computer
More information15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due Today Homework 4 Out today Due November 15
More informationAdvanced Computer Architecture
Advanced Computer Architecture 1 L E C T U R E 4: D A T A S T R E A M S I N S T R U C T I O N E X E C U T I O N I N S T R U C T I O N C O M P L E T I O N & R E T I R E M E N T D A T A F L O W & R E G I
More informationEfficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel
Aalborg Universitet Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Publication date: 2006 Document Version Early version, also known as pre-print
More information15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 21: Superscalar Processing Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due November 10 Homework 4 Out today Due November 15
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationImplicitly-Multithreaded Processors
Appears in the Proceedings of the 30 th Annual International Symposium on Computer Architecture (ISCA) Implicitly-Multithreaded Processors School of Electrical & Computer Engineering Purdue University
More informationLecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized
More informationReducing Reorder Buffer Complexity Through Selective Operand Caching
Appears in the Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), 2003 Reducing Reorder Buffer Complexity Through Selective Operand Caching Gurhan Kucuk Dmitry Ponomarev
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationLecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )
Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target
More informationEnergy Efficient Asymmetrically Ported Register Files
Energy Efficient Asymmetrically Ported Register Files Aneesh Aggarwal ECE Department University of Maryland College Park, MD 20742 aneesh@eng.umd.edu Manoj Franklin ECE Department and UMIACS University
More informationSimultaneous Multithreading: a Platform for Next Generation Processors
Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt
More informationA Power and Temperature Aware DRAM Architecture
A Power and Temperature Aware DRAM Architecture Song Liu, Seda Ogrenci Memik, Yu Zhang, and Gokhan Memik Department of Electrical Engineering and Computer Science Northwestern University, Evanston, IL
More informationChapter-5 Memory Hierarchy Design
Chapter-5 Memory Hierarchy Design Unlimited amount of fast memory - Economical solution is memory hierarchy - Locality - Cost performance Principle of locality - most programs do not access all code or
More informationThe Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation
Noname manuscript No. (will be inserted by the editor) The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation Karthik T. Sundararajan Timothy M. Jones Nigel P. Topham Received:
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More informationChapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed
More informationMicroarchitectural Techniques to Reduce Interconnect Power in Clustered Processors
Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani, Naveen Muralimanohar, Rajeev Balasubramonian Department of Electrical and Computer Engineering School
More informationA Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors
In Proceedings of the th International Symposium on High Performance Computer Architecture (HPCA), Madrid, February A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors
More informationTradeoff between coverage of a Markov prefetcher and memory bandwidth usage
Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end
More informationLecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.
Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 0 Consider the following LSQ and when operands are
More informationSelect-Free Instruction Scheduling Logic
Select-Free Instruction Scheduling Logic Mary D. Brown y Jared Stark z Yale N. Patt y Dept. of Electrical and Computer Engineering y The University of Texas at Austin fmbrown,pattg@ece.utexas.edu Microprocessor
More informationComputer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13
Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,
More informationMotivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture
Motivation Banked Register File for SMT Processors Jessica H. Tseng and Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA BARC2004 Increasing demand on
More informationLecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.
Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 1 Consider the following LSQ and when operands are
More informationSlackened Memory Dependence Enforcement: Combining Opportunistic Forwarding with Decoupled Verification
Slackened Memory Dependence Enforcement: Combining Opportunistic Forwarding with Decoupled Verification Alok Garg, M. Wasiur Rashid, and Michael Huang Department of Electrical & Computer Engineering University
More informationA Cross-Architectural Interface for Code Cache Manipulation. Kim Hazelwood and Robert Cohn
A Cross-Architectural Interface for Code Cache Manipulation Kim Hazelwood and Robert Cohn Software-Managed Code Caches Software-managed code caches store transformed code at run time to amortize overhead
More informationTechniques for Efficient Processing in Runahead Execution Engines
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu
More information