Boost Sequential Program Performance Using A Virtual Large. Instruction Window on Chip Multicore Processor

Similar documents
CS A Large, Fast Instruction Window for Tolerating. Cache Misses 1. Tong Li Jinson Koppanalil Alvin R. Lebeck. Department of Computer Science

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure

Low-Complexity Reorder Buffer Architecture*

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Exploring Wakeup-Free Instruction Scheduling

ECE404 Term Project Sentinel Thread

Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Kilo-instruction Processors, Runahead and Prefetching

Microarchitecture Overview. Performance

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.

Microarchitecture Overview. Performance

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Workloads, Scalability and QoS Considerations in CMP Platforms

High Performance Memory Requests Scheduling Technique for Multicore Processors

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB

Macro-op Scheduling: Relaxing Scheduling Loop Constraints

Execution-based Prediction Using Speculative Slices

Dynamically Controlled Resource Allocation in SMT Processors

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

A Complexity-Effective Out-of-Order Retirement Microarchitecture

Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

Precise Instruction Scheduling

Using a Serial Cache for. Energy Efficient Instruction Fetching

Reconfigurable and Self-optimizing Multicore Architectures. Presented by: Naveen Sundarraj

Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX

SPECULATIVE MULTITHREADED ARCHITECTURES

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

CAVA: Using Checkpoint-Assisted Value Prediction to Hide L2 Misses

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors

Speculative Execution for Hiding Memory Latency

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB

Multithreaded Processors. Department of Electrical Engineering Stanford University

Performance Oriented Prefetching Enhancements Using Commit Stalls

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Design of Experiments - Terminology

An Optimized Front-End Physical Register File with Banking and Writeback Filtering

Exploring the Potential of Architecture-Level Power Optimizations

Demand-Only Broadcast: Reducing Register File and Bypass Power in Clustered Execution Cores

Non-Uniform Instruction Scheduling

Instruction Recirculation: Eliminating Counting Logic in Wakeup-Free Schedulers

Decoupled Zero-Compressed Memory

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review

Architecture-Level Power Optimization What Are the Limits?

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Narrow Width Dynamic Scheduling

Dataflow Dominance: A Definition and Characterization

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

Initial Results on the Performance Implications of Thread Migration on a Chip Multi-Core

Speculative Multithreaded Processors

Getting CPI under 1: Outline

Portland State University ECE 587/687. Superscalar Issue Logic

CAVA: Hiding L2 Misses with Checkpoint-Assisted Value Prediction

An Asymmetric Clustered Processor based on Value Content

Implicitly-Multithreaded Processors

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution

Hierarchical Scheduling Windows

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Dynamic Instruction Scheduling For Microprocessors Having Out Of Order Execution

The Validation Buffer Out-of-Order Retirement Microarchitecture

On Pipelining Dynamic Instruction Scheduling Logic

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

Advanced Computer Architecture

Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

Hardware-Based Speculation

Implicitly-Multithreaded Processors

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Reducing Reorder Buffer Complexity Through Selective Operand Caching

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Energy Efficient Asymmetrically Ported Register Files

Simultaneous Multithreading: a Platform for Next Generation Processors

A Power and Temperature Aware DRAM Architecture

Chapter-5 Memory Hierarchy Design

The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Select-Free Instruction Scheduling Logic

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Slackened Memory Dependence Enforcement: Combining Opportunistic Forwarding with Decoupled Verification

A Cross-Architectural Interface for Code Cache Manipulation. Kim Hazelwood and Robert Cohn

Techniques for Efficient Processing in Runahead Execution Engines

Transcription:

Boost Sequential Program Performance Using A Virtual Large Instruction Window on Chip Multicore Processor Liqiang He Inner Mongolia University Huhhot, Inner Mongolia 010021 P.R.China liqiang@imu.edu.cn Abstract Chip Multicore processor provides the opportunity to boost sequential program performance with the available duplicated hardware resources in the cores. Previous results have shown that most of sequential programs can benefit from a large and fast instruction window. In this paper, we propose a simple method to faster sequential program execution on a chip multicore processor through organizing dynamically the unused available instruction window entries in the other cores into a relative virtual large instruction window for the running program. The hardware budget of our method is small. And the initial analysis tells us that it is a promising way to improve sequential program performance in a chip multicore processor. 1. Introduction Modern microprocessors achieve high performance through exploiting multi-levels parallelism of running programs. Instruction level parallelism (ILP) is the main objective that the traditional superscalar out-of-order processor tries to exploit, whereas thread level parallelism (TLP) is the focus of today s multicore processors. With the available duplicated hardware resources in the cores, chip multicore processors provide new opportunities to boost sequential program performance through exploiting more deeply ILP in the program. Previous results have shown that most of programs can benefit from a larger instruction issue window in a processor because more independent instructions of the program can be exposed to execute out-of-order. Unfortunately, naively scaling conventional window designs can significantly degrade clock cycle time, undermining the benefits of increased parallelism. Chip multicore processors, on the contrary, already have duplicated issue window structures in the cores, and all these structures have same access latencies. These available hardware resources have the potential to be organized dynamically into a virtual larger instruction window for a specific running program and thus improving the performance. In this paper, we propose a simple method to faster sequential programs through dynamically on-demand allocating some unused window entries from the available other cores and organizing them into a virtual larger window for a running program. As we have known, the instructions that dependent on long latency operations (load cache miss for example) can not execute until the long latency operation completes. This allows us to separate instructions in the instruction window into two categories: one is the instructions that can execute in the near future, and the other is those that will execute in the distant future. In our method, we will group the entire chain of instructions dependent on a long latency operation in a core to core transfer message, and send them to the other cores. If one of the other cores has free space in its local instruction window that can hold the whole or part of the chain of instructions, then it responses that message and saves the instructions in its local space. Once a 28

remote core responses the message, the request core will remove the instructions from its local instruction window, and then the released window entries can be used for the further entered new instructions. In the future, after the long latency operation completes it will broadcast a similar message to the other cores, and the core that holds the dependent instructions will response the message and send the instructions back to the original core. Same as the previous work [1], we will leverage the existed techniques to help tracking the true dependencies when we send to/back the grouped dependent instructions. And also in this paper, we only focus on tolerating data cache misses, and treat these misses as our only source of long latency operations. However, due to the communications delay between different cores on the chip, for example when a sequential program does activity migration [2], some communication operations also can be treated as long latency operations and use our method. This part will leave for our future work. Our method needs adding two flags in each of the window entries and an extra bit in each physical register. We also need five new types of core to core transfer message for the chip multicore processor. The total hardware budget is small, and the implementation is very easy. This paper is a start point of our future research on this topic. We only explain the mechanism of our method, and do some qualitative analysis in this paper. More detailed simulation will be done in the next step. This paper is organized as follows. Section 2 discusses related work. Our method is presented in section 3, and we make a qualitative analysis for our method in section 4. Section 5 summarizes this paper and presents future work. 2. Related Work There has been extensive research on architecture designs for supporting large instruction windows. In the multiscalar [4] and trace processors [5], one large centralized instructions window is distributed into small windows among multiple parallel processing elements. Dynamic multithreading processors [6] deal with the complexity of a large window by employing a hierarchy of instruction windows. Clustering provides another approach, where a collection of small windows with associated functional units is used to approximate a wider and deeper instruction window [7]. A recent research [8] dynamically joins multiple cores into together and builds a more powerful processing unit on the chip. In this work, one core either belongs to a big fused core or works independently as a single core. This organization is much different from our work in this paper. In our method, we just allocate dynamically part of the free resources in other cores to a request core and build a larger instruction window. The other cores can still running other programs independently. Comparing with [8], our method has better scalability and flexibility. And in terms of implementation complexity and hardware budget, our method is much simpler than [8]. Some other researches [9, 10] also investigate issue logic designs that attempt to support large instruction windows without hurt clock rate. [11] observes that instructions dependent on long latency operations should not occupy issue queue space for a long time, and address this problem by prescheduling instructions based on data dependencies. Other researchers also address the power consumption problem of a large instruction window design. 3. A Virtual Large Window Design This section presents our technique for providing a virtual large instruction window on chip multicore processor. We begin with an overview of the whole process of using our technique. This is followed by a detailed description of our particular design. We conclude this section with a discussion of various related issues. 29

3.1. Overview As an example to explain our scheme, we use Alpha 21264 type processor as our basic block to build a chip multicore processor. Figure 1 shows the architecture of our multicore processor. Each core has seven stages: fetch, decode, rename, issue, execute, memory/writeback and commit. Before entering the issue window (or issue queue), an instruction needs go through the fetch stage, decode stage and rename stage. When entering the issue queue, the instruction will be checked whether its operands are ready or not. If all the dependent operands are ready, then the instruction is issued to functional unit to execute, otherwise it needs to wait in the queue until all the dependent operands are ready and wake it up. Figure1. Architecture of the Chip Multi-Core Processor in this work 3.1.1. When a long latency operation is detected In our scheme, if a long latency operation is detected, all the instructions in the issue queue that truly dependent on the operation will be grouped into together, and packed in a core to core REMOVE_REQ message. We will discuss the details of message and the transfer process later in subsection 3.4. For this paper, we only focus on instructions in the dependence chain of LOAD cache miss because this cache miss generally will last hundreds of cycles in modern microprocessors and cause many dependent instructions to stall and wait in the queue. The core that suffers this long latency operation will broadcast the packed REMOVE_REQ message to the other cores through an on chip core to core bus. When the other cores receive this message, they will check their local issue windows and find whether there are enough free entries that can hold all/part of the instructions in the message. Here, because of the communication delay between the cores we need tradeoff the balance between the overhead of transferring message and the number of entries that the core can provide for part of the group instructions. If one core only has few free entries available (it means either this core is busy on its own job or it already holds a lot of instructions from other cores), then this core will not be a good candidate to hold the instructions in the group in this time. Otherwise, if one core has relative enough free space in its local instruction window, it will save all/part of the group instructions in its local space, and then response by sending back another REMOVE_RES message to the request core which indicates what instructions in the group are saved in its local space. Once the request core receives the answering message, it will remove the instructions that are saved in the remote core from its local queue and release the entries, thus the freed entries can be used for further issued instructions. In our design, after sending the request message, the request core will continue working in the normal way. If there is no any response for the request message, the instructions of the group will continue stay in the issue queue until the dependent operands are ready and release the waiting chain. On the other hand, after receiving a request from a remote core and checking its local space but find no/not enough space for that message the receiving core will just discard the message without doing any other actions. For some instructions, there maybe more than one dependent operands that are not ready. At this case, only the earliest detected operand which causes a long latency operation can trigger a new request message in order to avoid the unnecessary duplicated messages. 30

Figure 2. Architecture and operation of enhanced instruction window. Similar as [1], the instructions in the issue queue that dependent on the destination registers of the removed instruction can also trigger another request if it contains some of the instructions that this message wants to get back. If yes, then it sends a response message which includes the instructions message to the other cores and cause a serial of back to the request core and releases the sending request and removing instructions in the original core. Therefore, all instructions directly or indirectly dependent on the first long latency corresponding entries in its local queue. Otherwise, it keeps silent and does nothing for the request message. operation are identified and probably be saved in The key difference between sending other remote cores. Note that, each time we only group the instructions that truly dependent on the REMOVE_REQ request and GET_BACK request for the request core is that the later must wait for particular destination register into a message. This is someone s response. As we will discuss in to avoid transferring a big message on the bus, and also has more chance to get an available space in remote cores. For the load miss that we focus on in this paper the basic block Alpha 21264 processor already generates the signal for the load instructions. subsection 3.4, each physical register is added with an extra bit, rd for Remote_Dependent. When such bit is set in a register, it shows that some dependent instructions are stored in other remote cores. So after the load miss is resolved, it sends a GET_BACK message to get back these instructions if the flag in the destination register is set. Because the message is 3.1.2. When a long latency operation is completed When a long latency operation is completed, for instance the request data coming back from the memory, in the traditional instruction window design it will wake up all the instructions in the issue queue that waiting for the data, and later all these instructions will compete for the next issue opportunity. Whereas in our design, in addition to waking up the instructions in the local queue it also trigger broadcasting a GET_BACK message to the remote cores in order to get back the previous removed broadcast on the bus, the core that contains the instructions will finally response this message and send back the request instructions. When the response message that contains the instructions comes back to the request core, the instructions need to reinsert into the issue queue. Similar with [1], this reinsertion will share the same bandwidth with those newly arrived instructions that are decoded and dispatched to the issue queue. The dispatch logic will give priority to the instructions reinserted from the returned message to ensure forward progress. instructions from them. Once a remote core receives such a GET_BACK 3.1.2. A virtual large window organization message, it will check its local issue queue and find 31

In a chip multicore processor, when a core saves some instructions for a request core it dedicates some entries of its local issue queue to the correspondent queue in the request core. Such added entries from different cores combining with the local entries in the queue form a virtual large instruction window for the request core and help the core to exploit more deeply ILP of the running program. The size of this virtual large instruction window varies with the sending REMOVE_REQ and GET_BACK messages. It increases when the other cores dedicate free entries to the requester, and decreases when the other cores send back the saved instructions. This virtual large window can dynamically adaptive with the usage of all the entries in the cores and maximize the utilization of the total hardware budget. And at the same time, each core maintains its own virtual large window and uses it to boost the performance of the running program. 3.2. Detecting Dependent Instructions Same as [1], we leverage the existing issue queue wakeup-select logic to identify the instructions that dependent on a long latency operation. In this work, once a LOAD cache miss signal is generated the truly dependent instructions are selected by the wakeup logic and then packed and sent through a request message. If one of the other cores responses the message, the request core will set the rd bit in the register that the removed instructions dependent on. Also, the wakeup logic will further search the dependent chains for the destination registers of the removed instructions, and trigger other request messages for them. 3.3. Issue Window and Register File To support our method, we need add two flags in each entry of the issue window in all the cores. One flag is core_id which states which core this entry belongs to at that time. This flag needs logn bits where N is the number of cores on the chip. We can set a logic number for each core and use it as the initial value for the core_id. Later, when a core broadcasts a REMOVE_REQ message to the other cores it will include its core_id in the message, and the receiving core will set the flags according to this value. In the future, when the instructions are sent back to the original core, these flags will be reset to the host core_id value. Another flag added in the entry is group_id which is a unique number maintained and sent by the request core. The initial values for all the entries are same and equal to zero which means the instructions are all in the local issue queue in the host core. Later, when the core wants to send a group of instructions to a remote core, it increases the number of group_id by one, and sends the request message with this group_id, and also uses this number to update all the entries in the local queue that belongs to this host core at that time. On the other side, the receiving core will save all the contents (core_id, group_id and instructions) in its local space. Here, using the new group_id to update all the owned entries in the request core is a relative big cost for our method. In modern microprocessor, the number of entries in the issue queue is generally not much, typically 32, so we think it is not a big issue to do such an update operation in our method. Because the size of issue queue is small, so we set the length of this flag is log N where N is the size of issue queue. We believe this is enough to hold all the concurrent existed pending instructions groups. And this flag will operate at a saturate mode. Figure 2 shows the architecture and operation of our proposed instruction window. In the wake-up logic, we need two extra registers to hold the information of core_id and group_id. These two registers are used to select/enable the entries that can be waked up. In case of normal execution mode, the value in the core_id register is the identity of the host core. We can use a simple XOR logic to enable the entries that belongs to this host core, and send the waked up instructions to the functional units. In the same way, the final results also can be broadcasted to the entries that belong to the host core. In another 32

case, when a host core accepts a packet that holds some instructions from other core it updates the core_id register with the identity of remote core and uses the new value combined with the new group id and instructions in the packet to select and update the free entries in its local window. Again, it is the same way when the host core needs to select and fill in the entries to the packet buffer and send back to the original core. In addition to above, we also need an extra bit in each physical register, rd for remote dependent. If such a bit is set, then some dependent instructions are stored in remote cores and needed to be got back after the long latency operation is completed. This bit will be reset after the request core obtains the instructions from the remote core. The total hardware budget for our method is ' N log N N + N log N entry where Nentry is the size of one issue queue and N reg is the total number of physical registers in a separate core. 3.4. Core to Core Transfer Message Beside the hardware budget, we also need five types of core to core communication message. In real chip multicore processor, there already has some mechanism to support core to core communication, for example the cache coherence protocol. So we just need to add our message types to the existed infrastructure, and then it will work. The added message types are as follows. a. REMOVE_REQ message: sent by the request core, and includes core_id of the request core, group_id for the particular group of instructions, and the instructions themselves. Here, according to our experience we can set the minimum number of instructions to be sent to logn and the maximum number of instructions to N /N. b. REMOVE_RES message: sent by the remote core that will save some instructions in the group for the request core. It includes core_id of the request core, group_id of the request message, and a bit_vector that represents what instructions are stored N reg in the response core. The length of bit_vector equals to the number of instructions in the request message, and each bit correspondents to the instruction in the order of the list. If one instruction is saved in the response core, then the bit is set, otherwise it keeps to zero. c. GET_BACK message: sent by the request core who has some instructions stored in a remote core. It includes its core_id, the number of register whose rd bit is set and the correspondent long latency operation is completed. d. GET_BACK_RES message: sent by the remote core that stores some instructions for the request core. It includes the core_id of the request core, and the list of instructions that comes from the request core and dependent on the rd register. Here, the remote core needs to search the requested instructions in its local issue queue. Again, due to the small size of the queue and the consecutive stored positions of the instructions, we think this search will not be a big problem for our method. In addition, if there are more than one group of instructions for the request core, finding the real group that dependent on the rd register is an easy work for the wakeup-logic in the issue window structure. e. INVALIDATE message: sent by a request core to invalidate the unused instructions that stored in remote cores. It includes its core_id, and rd of the register that the instructions dependent. This message is used to squash instructions from remote cores because of branch misprediction. 3.5. Squashing instructions from remote core In case of branch misprediction that needs to squash instructions from the queue, the traditional way can do it quickly and locally. In our method, if this needs squashing some instructions that stored in remote cores, the original core will broadcast an invalidate message to the remote cores, and other cores just discard its copies from its local queue and reset the correspondent flags for the released entries. 33

3.4. Other issues 3.4.1 Size of other related structures To benefit more from our method, we can adapt a larger reorder buffer size than the traditional configuration because now we have a larger instruction issue window that can help faster release the entries in it. Since the reorder buffer is not on the critical path [3], we assume that we can increase its size without affecting clock cycle time. Nonetheless, the issue queue can be filled with instructions that wait for long latency operations and stall further execution. Also, we can enlarge the fetch/decode/rename width and help to improve the performance of running program using our method, but it will affect the latency in the critical path. We will leave this as our future work. 3.4.2 Fairness and Quality of Service Fairness and Quality-of-Service (QoS) are always two contradictory requirements in a chip multicore processor. In one side, fairness tries to balance the performance obtained by each concurrent running program at nearly equal scale. In another side, QoS tries to guarantee fulfilling the requirements of some particular programs that generally have higher running priorities. In a general sense, a chip multicore processor can not support these two factors at the same time. Using our method to boost single sequential program s performance is a native way to provide QoS support for the program which has such requirement. If the QoS requirement comes from multiple programs, then each program among them should has its own running priority and a simple round-robin fashion can be used to service these needs and provide a multi-levels QoS supports for them. On the contrary, stealing some windows entries from other cores and keeping a relative longer time due to waiting the long latency operation may hurt the performance of the programs running in the stolen cores and thus affect the fairness of the whole chip. Although we have take care of this issue before (the other cores only provide entries when they have enough free entries available), it is also possible to hurt the future execution of the programs in that cores after the free entries are allocated for the requestor. In this case, two possible solutions can be used. First, a set of registers is added to record the priorities of the cores. Only the program which has the highest priority can steal entries from other cores. Once the overall fairness is affected by this program, then its priority is decreased until in the future the fairness is recovered and its priority can be increased again. Second, same as in the QoS scheme, one register can be used to point out who can steal entries from other cores, and a simple round-robin or a more complex scheduling method is used to balance to needs from the concurrent programs so as to guarantee the fairness of the whole chip. 4. A Qualitative Analysis As we state in the first section, this is a start point of our work on this topic. The detailed simulation results are not ready at this time. So we will do a simple qualitative analysis in this paper. 4.1 Limited motivation experiment results In order to know the performance potential of our method, we did some limited motivation experiments on a 4 cores CMP processor. The architecture of the whole chip is shown in Figure 1. And the detailed configuration of a single core is similar as Alpha EV6 and shown in Table 1. First, we run singly 100 million instructions for 26 programs from SPEC CPU 2K [12] in our 4-core processor and count the times of issue queue full during the execution. The simulated regions are selected similar as in [13]. Figure 3 shows the average occurring times of integer/float point issue queue full for per 1K committed instructions. For integer issue queue, half of the programs suffer relative higher (close to or more than 500 times) full rates during the execution. For example, bzip2, for 34

every 1000 committed instructions half of them must wait extra one time to be issued to the functional unit due to the issue queue full condition even their operands are ready. There are two extreme cases, mcf and vpr. They have very high full rates (wait 5 times or 1 time on average for each committed instruction respectively) that affect their further executing a lot. Same as integer queue, four programs, applu, lucas, mgrid, and sixtrack, have high floating point issue queue full rates that may affect their overall performance. So, at this point, at least for some programs the size of issue queue is a bottle neck for their executions, whereas for some other programs they are not sensitive to the window size. Enlarge instruction window size will have the potential to improve the final performance for the first group of programs Table 1. Configuration of a single core. Parameter Setting frequency 3GHz fetch queue 16 fetch/slot/map/commit 8 inst. width Pipeine length 16 issue width 8(int)/2(fp) issue queue 32(int)/32(fp) register 128(int,fp), 2 cycles lat. ld/st queue 32/32 entries functional units 4 int ALU s, 4(int mult/div), 1 fp ALU, 1 fp mult/div, Icache/dcache 64KB, 2way, 1 / 3 cycle lat. L2 2MB, 8 way, 10 cycles lat. l1-l2 bus 64B width l2-mem bus 8B width l2-l2 core to core bus 64B width memory Min. 166, max. 255 cyc. itlb/dtlb 2KB, 4way, 1 cycle branch predictor bimod/gshare combined, 16KB, 64 entries RAS Rob 128 Times of issue queue full per 1K inst. 1000 900 fp_queue_full 800 int_queue_full 700 600 500 400 300 200 100 0 ammp applu apsi art bzip2 crafty eon equake facerec fma3d galgel gap gcc gzip lucas mcf mesa mgrid parser perlbmk sixtrack swim twolf vortex vpr wupwise Figure 3. Times of issue queue full per 1K committed instructions. Second, we do further experiments to see the trend of decreasing the full rates when we enlarge the available window size in our tested 4-core processor. Here, we still only run one program each time in the processor and leave the multi-program experiments in the future. Figure 4 shows the experiment results. It can be seen that the full rates of integer issue queue and floating point queue are continuously decreasing with the window size enlarging. At the extreme case, when all the window entries in other cores can be dedicated to one core, all the full rates in floating point queue decrease to zero, and most of the programs the full rates in integer queue are close to zero and only bzip2, fma3d, parser, twolf, and vortex have less than 20 times of full event for per 1000 committed instructions that also decrease a lot when comparing with the original full rates. So it is obvious that increasing window size can potentially eliminate most of the full events occurring in the issue queue and thus release the stress on this structure. With the window size increasing, the IPCs for most of the programs increase but very small in the above experiments that indicate us some other resources, register file, reorder-buffer, etc, may become new performance bottle necks after we solve the window size problem. Same as in 3.4.1, for reorder-buffer, since it is not on the critical path, we can assume that we can enlarge it at a suitable size without affecting the cycle time. We will investigate the size parameter in the future work. 35

Times of integer issue queue full per 1K inst. 1000 900 800 700 600 500 400 300 200 100 0 Times of fp issue queue full per 1K inst. 800 700 600 500 400 300 200 100 0 baseline 2 times of window size 3 times of window size 4 times of window size ammp applu apsi art bzip2 crafty eon equake facerec fma3d galgel gap gcc gzip lucas mcf mesa mgrid parser perlbmk sixtrack swim twolf vortex vpr wupwise (a) integer issue queue. baseline 2 times of window size 3 times of window size ammp applu apsi art bzip2 crafty eon equake facerec fma3d galgel gap gcc gzip lucas mcf mesa mgrid parser perlbmk sixtrack swim twolf vortex vpr wupwise (b) floating point issue queue. Figure 4. Times of issue queue full per 1K committed instructions with the window size increasing. For register file, we also need scale it proportionally according to the number of in-flight instructions. There are several alternative designs for large register files, including multi-cycle access, multi-level [14,15], multi-banks [14,16], or queue-based designs [17]. In [1], the authors use a two-level register file [14,15] that operates on principles similar to the cache hierarchy. In this paper, we will not suggest any of the above designs to be used with our scheme, and we will investigate these candidate designs and give our conclusion in the future work. 4.2 Communication delay overhead A big concern about our method is the transfer latency overhead vs. the performance improvement by the large instruction window. Here, we only care of the latency of transferring instructions from the remote core to the original core when they are waked-up by the GET_BACK message because the sending out latency can be overlapped by the waiting time for the long latency operations. If the latency to send back instructions is too long, then it will delay the further execution of them in the original core and degrade the final performance as a result. If the latency is short enough, then the sequential program can benefit from our method and get a better performance with the large instruction window. According to our experience, in a chip multicore processor with 3GHz frequency, the memory access generally will cost more than 200 cycles (in case of memory contention it can be much more), but a simple core to core transferring will only needs about 20 cycles if using an on chip core to core bus (64Bytes width) as we show in figure 1. So comparing with these two latencies, the transferring latency between cores are so small that we think it will has minor effects on the finial performance of the program. In addition, the sending REMOVE_REQ and GET_BACK messages are done only for the truly dependent instructions, and the REMOVE_RES, GET_BACK_RES, and INVALIDATE messages are sent only when it is necessary, so the core to core bus is used very effectively, and the added messages do not put much pressure on the bus. In summary of above two sub-sections, we believe our method is a promising way to boost sequential program execution on a chip multicore processor although several related issues need to be investigated more in detail. We will report more detailed simulation results and solutions of the unsolved issues in the future work. 5. Conclusion Chip multicore processor provides new opportunity to improve sequential program performance with the duplicated hardware resources in the cores. On the other side, a large instruction window can benefit most of program by exploiting more deeply instruction level parallelism. In this paper, we propose a simple method to boost sequential program performance by dynamically allocating some unused entries in other cores and building a virtual large instruction window for the core that runs the program. Our method only use very little hardware resources, and the implementation is easy. The initial qualitative analysis shows that it is a 36

promising way to improve the sequential program performance although some related issues still need to further investigated. Acknowledgement The authors would like to thank the anonymous reviewers for their high quality comments on this paper. References [1] A.R. Lebeck, J. Koppanalil, T. Li, J. Patwardhan, and E. Rotenberg, A Large, Fast Instruction Window for Tolerating Cache Misses, In Proceedings of the 29th annual international symposium on Computer architecture (ISCA '02), pp. 59-70. [2] P. Michaud, Y. Sazeides, A. Seznec, T. Constantinou, and D. Fetis, A Study of Thread Migration in Temperature-Constrained Multi-Cores, ACM Transactions on Architecture and Code Optimization, June, 2007 [3] R.Balasubramonian, S. Dwarkadas, and D.Albonesi, Dynamically Allocating Processor Resources Between Nearby and Distant ILP, In Proceeding of 28 th Annual International Symposium on Computer Architecture, pp. 26-37, July 2001. [4] G. S. Sohi, S. E. Breach, and T. N. Vijaykumar, Multiscalar Processors, in Proceedings of 22 nd Annual International Symposium on Computer Architecture, pp. 414-425, June 1995. [5] E. Rotenberg, Q. Jacobson, Y. Sazeides, and J. Smith, Trace Processors, in Proceedings of the 30 th Annual International Symposium on Microarchitecture, pp.138-148, Dec. 1997. [6] H. Akkary, and M. A. Driscoll, A Dynamic Multithreading Processors, in Proceedings of the 31th Annual International Symposium on Microarchitecture, pp. 226-236, Dec. 1998. [7] S. Palacharla, N. P. Jouppi, and J. E. Smith, Complexity-Effective Superscalar Processors, in Proceedings of the 24 th Annual International Symposium on Computer Architecture, pp. 206-218, June 1997. [8] E. Ipek, M. Kirman, N. Kirman, and J. Martinez, Core Fusion: Accommodating Software Diversity in Chip Multiprocessors, in Proceedings of the 34 th Annual International Symposium on Computer Architecture, June 2007. [9] M. D. Brown, J. Stark, and Y. N. Patt, Select-Free Instruction Scheduling Logic, in Proceedings of the 34 th Annual International Symposium on Microarchitecture, Dec. 2001. [10] D. S. Henry, B. C. Kuszmaul, G. H. Loh, and R. Sami, Circuits for Wide-Window Superscalar Processors, in Proceeding of the 27 th Annual International Symposium on Computer Architecture, pp. 236-247, June 2000. [11] P. Michaud and A. Seznec, Data-flow Prescheduling for Large Instruction Windows in Out-of-Order Processors, in Proceedings of the 7 th International Symposium on High-Performance Computer Architecture, pp. 27-36, Jan. 2001. [12] J. L. Henning, SPEC CPU2000: Measuring CPU performance in the new millennium, IEEE Computer, 33 (7):28-35, July 2000. [13] T. Sherwood, E. Perelman, G. Hamerly and B. Calder. Automatically Characterizing Large Scale Program Behavior, in Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2002), October 2002. San Jose, California. [14] J. L. Cruz, A. Gonzalez, M. Valero, and N. P. Topham, Multiple-Banked Register File Architectures, in Proceedings of the 27 th Annual International Symposium on Computer Architecture, pp. 316-325, June 2000. [15] J. Zalamea, J. Llosa, E. Ayguade, and M. Valero, Two-Level Hierarchical Register File Organization For VLIW Processors, in Proceedings of the 33 rd Annual International Symposium on Microarchitecture, pp. 137-146, Dec. 2000 [16] R. Balasubramonian, S. Dwarkadasm and D. Albonesi, Reducing the Complexity of the Register File in Dynamic Superscalar Processors, in Proceedings of the 34 th International Symposium on Microarchitecture, Dec. 2000. [17] B. Black and J. Shen, Scalable Register Renaming via the Quack Register File, Technical Report CMuArt 00-1, Carnegie Mellon University, April 2000. 37