Accelerating sequential programs on Chip Multiprocessors via Dynamic Prefetching Thread

Size: px

Start display at page:

Download "Accelerating sequential programs on Chip Multiprocessors via Dynamic Prefetching Thread"

Kelley Collins
6 years ago
Views:

1 Microprocessors and Microsystems 31 (2007) Accelerating sequential programs on Chip Multiprocessors via Dynamic Prefetching Thread Hou Rui, Longbing Zhang, Weiwu Hu Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing , China Available online 20 October 2006 Abstract A Dynamic Prefetching Thread scheme is proposed in this paper to accelerate sequential programs on Chip Multiprocessors. This scheme belongs to the hardware-generated thread-based prefetching technique and can decouple the performance and correctness to some extent. This paper describes the necessary hardware infrastructure supporting Dynamic Prefetching Thread on traditional Chip Multiprocessors. Aiming at the loosely coupled feature of Chip Multiprocessors, we present the Shadow Register mechanism to support rapid register transportation among multi-cores and discuss the selection of thread spawn time. Furthermore, two aggressive thread construction policies, known as Self-Loop and Fork-on-Recursive-Call, are proposed. Self-Loop policy can greatly enlarge the prefetching range and issue more timely prefetches. Fork-on-Recursive-Call policy can evectively accelerate applications accessing trees or graphs via recursive calls. For a set of memory limited benchmarks selected from Olden benchmark, SPEC CPU2000 as well as Stream benchmark, an average speedup of 3.8% is achieved on dual-core CMP when constructing basic Dynamic Prefetching Threads, and this gain grows to 29.6% when adopting our aggressive thread construction policies Elsevier B.V. All rights reserved. Keywords: Dynamic Prefetching Thread; Chip Multiprocessors 1. Introduction Advances in integrated circuit technology avord great opportunities for Chip Multiprocessors (CMPs) [24,25]. It is really a challenge to utilize multi-cores in CMP to accelerate sequential programs, since CMP provides more computation and memory resources. Many technologies have been proposed, such as parallel compiler [32], thread level speculation [20,21,30,31]. However, these technologies often require all parallel threads derived from sequential program to obey the original program semantic. Such constraints lead to more complicated compiler or hardware support, as well as more conservative parallelization policies. Therefore, these * Corresponding author. Tel: ; fax: addresses: hourui@ict.ac.cn (H. Rui), lbzhang@ict.ac.cn (L. Zhang), hww@ict.ac.cn (W. Hu). technologies only succeed in a few Welds, and can not be widely applied [33]. Fortunately, helper thread technique can relax these constraints involved in above technologies. Helper threads can run in parallel with the main thread to preexecute the critical memory access instructions or precompute the result of the diycultly predicted branch instructions ahead of the main thread. The correct helper thread will help main thread to hide the long memory access latencies or decrease the penalty of the mistakenpredict branch; the mistaken helper thread will not avect its correctness only with possible performance loss. So helper thread can decouple the performance and the correctness to some extent. This feature will decrease the system complexity signiwcantly. Almost all general-purpose processors are moving to CMP, including the Gemini, Niagara 1/2, Panther chips from Sun, Power4/5/6 from IBM and recent announcements from Intel/AMD. Since such CMPs are soon going /$ - see front matter 2006 Elsevier B.V. All rights reserved. doi: /j.micpro

2 H. Rui et al. / Microprocessors and Microsystems 31 (2007) to be almost universal in the general-purpose arena and since many sequential programs are dominated by ov-chips memory stall, helper thread prefetching on CMPs is an attractive proposition that need further investigation. In this study, the hardware-generated prefetching helper threads are executed on idle cores to accelerate sequential programs. Such threads are called as Dynamic Prefetching Thread (DPT). Although there are several researches that focus on thread-based prefetching techniques, most of these works are based on SMT processor or additional execution pipelines. In essence, these SMT-like architecture [1] is tightly coupled, while CMP is of loosely coupled architecture. The resource sharing and conxict-avoid mechanisms on these two architectures are diverent. These diverences require us to reconsider how to support prefetching helper thread on CMP. The main contributions of this work are: (1) We design a CMP architecture with Dynamic Prefetching Thread support; (2) Two evective prefetching thread construction policies, known as Self-Loop and Fork-on-Recursive-Call, are proposed; (3) Shadow Register mechanism is designed to support rapid register transportation among multi-cores; (4) We make a deep investigation on the selection of the thread spawn time. The rest of this paper is organized as follows: Section 2 introduces related works. Section 3 describes the methodology. Section 4 introduces Dynamic Prefetching Thread. Section 5 presents the hardware infrastructure supporting Dynamic Prefetching Thread on CMP. More aggressive thread construction policies are proposed in Section 6. Other considerations are evaluated in Section 7. Finally, we make conclusion in Section Related works Compared with traditional hardware prefetchers [6,18,19] or software prefetchers [14], thread-based prefetching techniques can evectively accelerate applications with irregular memory access patterns. These techniques typically use additional execution pipelines or idle thread contexts in a multithreaded processor to execute helper threads that perform dynamic prefetching for the main thread [2,3,5,7,8,12,14,15]. Helper threads can be constructed by hardware or compiler [5,9]. However, many of these works are based on tightly coupled architecture. Our work focuses on implementing Dynamic Prefetching Thread on CMP, which belongs to loosely coupled architecture. The resource sharing mechanism of these two architectures are diverent. For example, it is quite easy for tightly coupled architecture to transfer register context from one core to another by copying rename table or sharing the same physical registers. And multi-threads share the same memory hierarchy in such architecture. Unfortunately, these questions seem complicated when applying thread-based prefetching techniques on CMP-like loosely coupled architecture. SP [7] and Dynamic Speculative Precomputation [8] focus on the benewts from precomputation on SMT-capable processor. The concept of chaining trigger is proposed, it allows a speculative thread to spawn further speculative threads to other thread contexts. The spawner thread should initialize live-in registers for spawned thread. This mechanism needs to adjust the code sequence and transport register among multi-cores. In contrast, our work is based on CMP processors. Our Self-Loop policy needn t adjust the code sequence and thus can be easily implemented in hardware. What s more, only the main thread is allowed to spawn prefetching thread in Self-Loop policy. Therefore the corresponding register transportation mechanism is simpler. From these prospects, Self-Loop policy is more suitable for CMP processor than chaining trigger. Slice Processors [12] dynamically construct instruction slices to prefetch delinquent loads. Such slices are constructed by back-end hardware and executed on specialized pipelines. Our work adopts the similar back-end hardware framework. However, our work reconsiders how to support prefetching helper thread on CMP. And more aggressive thread construction policies are proposed in this paper. Roth and Sohi proposed the Speculative Data-Driven Multithreading (DDMT) execution model [2,3]. DDMT is a general model where performance critical slices leading to branches or frequently missing loads are pre-executed. A register integration mechanism incorporates the results of pre-execution directly into the main sequential thread avoiding the need for re-executing these instructions. Our work performs automatic and dynamic detection of slices leading to loads. Moreover, our prefetching threads impact the main sequential thread indirectly and no integration mechanism is necessary. JeVery Brown applied thread-based prefetching techniques to CMP [10]. Two techniques, cache return broadcasting and cache cross-feeding, are proposed to hide some inter-core communication latency and achieve performance improvement. The prefetching threads are constructed by the mixture of both hand-generated and compiler. In their work, they did not demonstrate how to transfer the register among multi-cores. Jiwei Lu implemented prefetching thread on UltraSPARAC processor [27]. The prefetching threads are generated by user-level monitor thread, and one software data structure named as mailbox is used for register transportation. Our work presents Shadow Register used for quickly initializing thread context among multicores, and we adopt automatic hardware-based aggressive thread construction policies. In Runahead processors [26], the processor state is checkpointed when a long-latency load stalls the head of the ROB, the load is allowed to retire and the processor continues to execute speculatively. When the data Wnally return from memory, the processor rolls back and restarts execution from the load. Our work does not require checkpoint or any other recovery mechanism. The dual-core execution paradigm (DCE) [13] proposed by Zhou utilizes idle cores of a CMP to speed up

3 202 H. Rui et al. / Microprocessors and Microsystems 31 (2007) single-threaded programs. The main diverence is that our work tries to use idle cores for prefetching while DCE extends the evective instruction window size by distributing the state of a single thread over the two cores. 3. Simulation methodology The evaluation is performed by using a detailed cycleaccurate execution-driven CMP architecture simulator based on SESC [28] implementing MIPS ISA [23]. The CMP cores are out-of-order superscalar processors with private L1 instruction and data caches, shared L2 cache and all lower level memory hierarchy components. Table 1 lists the parameters in details. To demonstrate the performance potential of our architecture, dual core conwguration is used in this simulation. The memory limited benchmarks are selected from the Olden pointer intensive programs [11], SPEC CPU2000 [29] and Stream benchmark [22]. A large number of cache misses in these benchmarks are due to relatively irregular access patterns involving pointers, hash tables, tree/graph, indirect or complicated array references, or a mix of them, which are typically diycult for prefetching. We omit the ILP limited benchmarks since their performance are not dominated by memory access stall. The train sets are used for SPEC benchmarks to achieve reasonable simulation times. And other benchmarks use standard inputs. In addition, all benchmarks are compiled with gcc-o3 and simulated for one billion committed instructions after fastforwarding the initialization with cache warmup indicated by simpoint. Table 1 Simulated CMP processor parameters Processor core Number of cores/frequency 2 core/2 GHz Fetch/Issue/Commit Width 4/4/4 I-window/ROB/LSQ size 64/128/64 Int/FP registers 184 LdSt/Int/FP units 2/4/2 Execution latencies similar to MIPS R10000 Branch predictor 16K-entry gshare hybrid RAS entries 16 Memory hierarchy Cache sizes 32KB IL1, 32KB DL1, 512KB L2 Cache associativity 4-way L1, 8-way L2 Cache Hit/Miss latencies L1:2/3 cycles, L2: 9/11 cycles Cache line sizes L1:32B, L2:32B Cache ports L1:2ports, L2:4ports Cache coherence Snooping-based MESI L1-L2, L2 cache store policy Write-back MSHRs L1:64, L2:128 Main memory latency Minimum 200 cycles Dynamic Prefetching Thread hardware support Trace buver size 256-entry Thread initialization time 6 cycles Thread Cache size/associativity 32kB/2 way Thread construction time 100 cycles Shadow Register size/port 64*32B/4R4W ports 4. Dynamic Prefetching Thread Many researchers found that a small number of static loads, known as delinquent loads, are responsible for the vast majority of memory stall cycles. Furthermore, not all the instructions contribute to the address computation of the future delinquent load [7,8,12]. Motivated by these observations, we try to extract these sequence of instructions as prefetching thread from the executed instruction trace by means of hardware, and utilize idle cores to run such threads that perform dynamic prefetching for the main thread. Such threads are called Dynamic Prefetching Thread (DPT), which are automatically constructed, triggered, spawned and managed by hardware. Dynamic Prefetching Thread should exit when exceptions or interrupts occur. The operating system should make no response to these exceptions and interrupts except for TLB exception. Fig. 1 illustrates the CMP architecture with Dynamic Prefetching Thread support. The black blocks are the necessary hardware infrastructure supporting Dynamic Prefetching Thread. The DPT Generator is in charge of extracting the Dynamic Prefetching Thread, located ov the pipeline critical path. It has no evects on the pipeline frequency due to its back-end work mode. The shadow register is designed for quickly initializing the context of new spawned thread. This hardware infrastructure can be applied to most CMP-like loosely coupled architectures. The organization of DPT Generator is shown in Fig. 2. The committed load instructions in original thread and their corresponding execution information (such as L2 hit/ miss Xag) are sent to the back-end DPT Generator. These load instructions will Wrst probe the trigger pointer selector, Spawn Table. Once a trigger pointer is identiwed, the corresponding prefetching thread stored in DPT Cache is dispatched on idle core and run in parallel with original thread to perform dynamic prefetching for the targeted delinquent loads; otherwise it will query and update the Delinquent Load Table (DLT Table), which is in charge of identifying delinquent loads. When any delinquent load is identiwed, DPT Generator begins to collect the committed instructions from main core running original program. This collection does not stop untill the same delinquent load comes again or Trace BuVer is full. Then the Thread Constructor begins to extract the sequences of instructions, which produce the address of targeted delinquent load according to some certain thread construction policy. These extracted instructions are called as Dynamic Prefetching Thread and are reserved in DPT Cache. The DPT Cache can be located either in DPT Generator or in processor core. The former is chosen for simplicity. 5. Hardware support for Dynamic Prefetching Thread on CMP This section describes the critical hardware infrastructure, including thread spawn mechanism, Shadow Register mechanism and memory hierarchy. We just simply introduce

H. Rui et al. / Microprocessors and Microsystems 31 (2007) 200 211 203 Fig. 1. The architecture of CMP with Dynamic Prefetching Thread support. (F, Fetch; D, Decode; I, Issue; E, Execute; C, Commit.

4 H. Rui et al. / Microprocessors and Microsystems 31 (2007) Fig. 1. The architecture of CMP with Dynamic Prefetching Thread support. (F, Fetch; D, Decode; I, Issue; E, Execute; C, Commit.) Fig. 2. The structure of DPT Generator. the identiwcation of delinquent loads and the basic thread construction policy, since they are similar to other researches. Delinquent loads are identiwed at commit time via a simple miss predictor named as DLT Table. It is a PC-indexed table with 128 entries and each has 5-bit counters. Each L2 cache load miss increases the corresponding counter by 4, otherwise decreases it by 1. A delinquent load is selected once the counter value exceeds 31. Predictor entries are allocated only when any L2 cache load miss (ov-chip memory access in our simulation) has occurred Basic thread construction policy The basic thread construction policy is similar to slice processor and Dynamic Speculative Precomputation [8,12]. DPT Generator collects the committed instruction trace in Trace BuVer described as Section 4. Once the collection is Wnished, it performs a reverse walk of the committed instruction trace to extract the corresponding instructions which contribute to the address computation of targeted delinquent load. Then a sequence is extracted containing these instructions in program order, oldest (lead) to youngest (candidate load). This is the Dynamic Prefetching Thread. For simplicity, we only focus on register dependence and ignore both memory and control-xow dependence during the reverse analysis. Slice Processor describes an circuit implementation sketch of this thread detection mechanism [12]. Further evaluation and improvements on this policy are described later.

5 204 H. Rui et al. / Microprocessors and Microsystems 31 (2007) To utilize resources more eyciently, the delinquent load in Dynamic Prefetching Thread is transformed into prefetch instruction so as to be committed earlier Thread spawn mechanism Spawn Table is in charge of spawning new prefetching threads. It is important for thread spawn mechanism to choose the appropriate trigger point and spawn time, since it has close relationship with the hardware implementation and performance. The delinquent load itself is selected as trigger point, and its commit time is considered as spawn time. A new prefetching thread should be dispatched into an idle core whenever the committed load from the main core hits the Spawn Table. The commit time is selected as spawn time because it is more suitable for CMP architecture. It is complicated and time-consuming for CMP to share or transport the architecture register context from one core to another. Generally speaking, both decode time and commit time can be selected as spawn time. The former has the advantage of earlier spawn time, but has problems in transporting register context among multi-cores. The reason is that the value of the instruction s destination register is still unavailable at decode time. In SMT-like tightly coupled architecture, it is feasible to solve this problem by copying the register rename table and sharing these input registers. Unfortunately, this approach seems diycult for CMP-like loosely coupled architecture. Yet the register value is ready at commit time. Therefore, the commit time is selected as the spawn time. It just needs to copy the full corresponding registers to initialize the new thread context at spawn time Shadow Register mechanism The core running original thread is to initialize the registers of idle core when a prefetching thread is dispatched. Shadow Register is an evective method to accomplish this initialization. Shadow Register keeps the same data with the register of main core just like its shadow. Its size is equal to the number of architecture registers including integer and Xoat registers, since both of them may contribute to the address computation of the targeted delinquent load(e.g. CFC1 in MIPS ISA). For Shadow Register, the main core running original program has write privilege whereas the other cores running helper threads are only be allowed to read it. Being additional register Wle, Shadow Register might have little evect on the maincore pipeline. And the prefetching cores need to access the Shadow Register when the uninitiated source registers are to be used. Some modiwcations need to be done in pipeline so as to maintain data consistency between the Shadow Register and the main-core register. The value and logical index of the destination register are attached with each issued instruction and reserved in ROB entries. Thus this information can be sent to the Shadow Register at commit time. By these means, data consistency is maintained. It should be pointed out that the execution correctness is not avected only with possible performance losses when data inconsistency exists. Since prefetching threads can also write and update registers, there are two source registers for the new core. It is obvious to ensure that only those Wrst-time register accesses (read) should be directed to Shadow register, all the others should be directed to local register Wle. So during the extraction of Dynamic Prefetching Thread, the live-in registers should be analyzed. And this information is used for marking some Xags in renaming table of new core so as to diverentiate the two register sources. In renaming phase, the Xag is to be checked for each source logic register. If it is of live-ins and is the Wrst-time access, then Shadow Register should be accessed; Otherwise local register is to be accessed. Once any register has been initialized from Shadow Register, the corresponding Xags in renaming table should be changed so that the later instructions fetch value from local register. Any destination register is allocated to local registers in renaming phase. Shadow Register is shared by multi-cores. The access contention is a potential challenge scaling with the number of CMP cores. Fortunately, the thread construction policies presented in Section 6 needs small number of prefetching cores, which will release this access contention Memory hierarchy Dynamic Prefetching Thread will not avect the original data coherence, since such threads have no store instructions and the selection of delinquent loads only aims to cache-able data (not I/O address). For simplicity, we utilize current CMP memory hierarchy to store prefetching results. The prefetching requests are useless when they are illegal or issued later than the main thread. There are three cases for the execution of useful prefetching requests. The Wrst case is that the data has been fetched into the private cache of prefetching core. The second is that the data has been fetched into the share cache. And the last case is that the request has been issued but the data is not been fetched back. The cache missed load instructions in main core will fetch back the data from diverent places in case of above three cases. For the Wrst case, it will Wnd data in the private cache of other core through cache coherence (e.g. Snooping-based MESI coherence); for the second, it will Wnd it in the lower shared cache; for the last one, the cache miss request will stall in MSHR queue of the lowest on-chip cache till the prefetching request returns. In conclusion, no modiwcations on memory hierarchy have been done in this work. Although prefetching buver can be added, it needs more complex hardware mechanism to keep data coherence. This is our future work and not discussed here.

H. Rui et al. / Microprocessors and Microsystems 31 (2007) 200 211 205 5.5. The performance of basic thread construction policy The performance speedup is illustrated in Fig. 3. Most benchmarks can be accelerated with the basic thread construction.

In order to understand the performance speedup, the prefetching coverage and timeliness information is provided to have a closer look at the prefetching activity like Fig. 4.

6 H. Rui et al. / Microprocessors and Microsystems 31 (2007) The performance of basic thread construction policy The performance speedup is illustrated in Fig. 3. Most benchmarks can be accelerated with the basic thread construction. However, the performance improvement is only 3.8% on average. The basic policy needs to be improved. In order to understand the performance speedup, the prefetching coverage and timeliness information is provided to have a closer look at the prefetching activity like Fig. 4. The prefetching coverage is dewned as the ratio of the total number of useful prefetches to the total number of L2 cache load misses originally incurred by the application. And the timeliness indicates how much of the memory latency is hidden or saved by Dynamic Prefetching Thread. All these information is integrated into one Wgure, such as Figs. 4 and 7. Each bar in these Wgures is broken into eight segments according to the fractions of the miss latency hidden by prefetches, e.g. less than 10 cycles, between 10 and 50 cycles and so on. The basic policy has low prefetching coverage for all benchmarks shown in Fig. 4. All of them are lower than 20%. For mgrid and treeadd, the coverage is almost zero. This is why the speedups are generally quite lower. For swim, mgrid, mcf and art, these prefetching requests are issued too late, although most of their prefetching threads successfully compute the addresses of the targeted delinquent loads. It greatly decreases the prefetching eyciency and then leads to the tiny performance improvement. Some benchmarks, including treeadd, perimeter, and bisort, are of pointer intensive benchmarks with little overlapped work between two consecutive delinquent loads. Although they have a high cache miss rate, signiwcant improvements can not be achieved. Such benchmarks often traverse a linked list or tree using a tight-loops. Many iterations of these loops Wt within the main cores instruction window. Coupled with high branch prediction accuracy, Fig. 3. The performance speedup of basic thread construction policy. Fig. 4. The prefetching coverage and timeliness of basic policy.

7 206 H. Rui et al. / Microprocessors and Microsystems 31 (2007) little potential improvements are left for Dynamic Prefetching Threads. In conclusion, the primary problems of basic policy lie in two prospects. One is the untimely prefetching, the other is the ineyciency on the dense pointer intensive applications. 6. Aggressive thread construction policies This section proposes two aggressive thread construction policies based on above analysis The Self-Loop policy In order to issue timely prefetching requests, there are several methods: (1) Quicken the thread initialization; (2) Spawn the thread as much as sooner; (3) Prefetch farther delinquent loads. Self-Loop policy is proposed according to above factors. In the basic policy, once any delinquent load reaches the top of ROB, the corresponding thread is dispatched into idle core to prefetch the next instance of the same delinquent load. That is to say, one Dynamic Prefetching Thread just prefetches one delinquent load. To improve the performance, the next N instances of the same delinquent load are prefetched in the same Dynamic Prefetching Thread at one trigger point in Self-Loop policy. This policy enlarges the prefetching range and help the thread speculatively prefetch delinquent loads that are not seen in current pipeline, and also decreases the cost of thread initialization. We accomplish this purpose through adding loop structure on basic-policy constructed thread code. An example from mcf s core loop is shown in Fig. 5. The instruction lw v0, 28 (s1) is delinquent. The Dynamic Prefetching Thread is extracted according to basic policy. Then the loop structure is added on the extracted code (the instructions marked with are the framework of the new added loop structure). By these means, one prefetching thread can prefetch several next instances of the same Fig. 5. An example of Self-Loop Policy. delinquent load. Although there are many choices when setting the iteration number, we Wnd 10 iterations is suitable for our simulation. Through prefetching the next N instances of the same delinquent load in one thread, several prefetching threads are merged into one. Therefore, Self-Loop policy can achieve signiwcant performance improvement only utilizing less cores. This can ensure the capability of throughout computation when accelerating sequential program via parts of CMP cores, and release the access contention for Shadow Register. Section 7.4 illustrates that 4 cores and a 64 entries Shadow Register with 4-read and 4-write ports are enough for our Dynamic Prefetching mechanism. Furthermore, it needs not transfer registers between consecutive prefetching threads, because Dynamic Prefetching Threads should not spawn themselves. These characters are quite suitable for the feature of CMP. As a comparison, the Chaining Trigger method in [8] allows prefetching threads to be spawned by themselves and run on diverent thread contexts, thus such method needs more prefetching cores and complex register transportation mechanism. The framework of new added loop structure is so stable that the hardware implementation of Self-Loop policy has high feasibility. For any basic-policy constructed thread, the Wrst step is to analyze its register usage, and select one unused register as the loop induction register for the new added loop structure. This analysis can be done during the thread extraction phase with basic policy. Second, the initialization instruction for loop induction is added at the head. Then, the targeted delinquent load is transformed into prefetch instruction in prefetching thread unless the instruction is self-modiwed (e.g. lw a0, 0 (a0)). Lastly, the induction register modiwcation and conditional branch instructions are added at the end of original code The Fork-on-Recursive-Call policy Data structures like list, tree as well as graph are named as Linked Data Structure (LDS) in previous researches. Furthermore, we call it Dense Linked Data Structure (D- LDS) if there are little overlapped work between the consecutive node access. It seems hard for both basic policy and Self-Loop policy to have signiwcant improvement on D-LDS applications. Traditional prefetching approaches for LDS application include jump pointer, prefetch array and so on [2,4,16,17]. But these technologies often need to understand the program semantic quite well. So the implementations are usually based on compiler. However, we Wnd some interesting characters for trees and graphs. In fact, most nodes in tree or graph structure connect two or more sub-trees or sub-graphs. This inherent memory parallelism might be used for prefetching. When the main program accesses one sub-tree or sub-graph, the other idle cores can be utilized to speculatively access the other sub-tree or sub-graph.

H. Rui et al. / Microprocessors and Microsystems 31 (2007) 200 211 207 What s more, the recursive function is one of the primary methods used to access such structures.

8 H. Rui et al. / Microprocessors and Microsystems 31 (2007) What s more, the recursive function is one of the primary methods used to access such structures. When any recursive call instruction is executed, a new prefetching thread is dispatched on one idle core starting from the next instruction address. And then the idle core begins to speculatively execute the following instructions, accessing the other sub-tree or sub-graph for prefetching. This is the Fork-on-Recursive-Call policy. A hardware stack (tuples of [target, instruction address] with target as index) and Recursive Call Table (tuples of [instruction address]) are designed for identifying and recording recursive calls. They work in back-end and are placed in DPT Generator. Any function call instruction(e.g. jal, jalr in MIPS ISA) at the top of ROB from original thread will trigger the following step: (1) Looking up the Recursive Call Table to determine whether this call is recursive. If some entry is found, then goto (2), else goto (3). (2) The next instruction address (it should be the following PC of the call instruction s delay slot in MIPS ISA) is sent to idle core. The following N instructions started from this address are to be speculatively executed (N D 200 in our simulation). And exits here. (3) The instruction will look up the previous entries of the stack using the target address as index. If some entry matches, a recursive call is identiwed, and the instruction address is recorded in Recursive Call Table. Otherwise, the instruction address and its target are stored at the top of the stack. The stack should be emptied if it is full. Any return instruction (e.g. jr in MIPS ISA) from original thread should update the stack at commit time. The top stack entry is popped unless the stack is full. A counter is used to control the execution distance of prefetching threads. Any call instruction in prefetching threads also look up the Recursive Call Table. If any recursive call is identiwed or any return instruction is executed, the counter begins to work and increase by one for each instruction. In this work, the prefetching thread will exit when the counter exceeds 200 or some exception occurs. The store instructions are considered as nop operation since such threads are only used for prefetching and should not modify the architecture state The performance of aggressive thread construction policies In this section, evaluation is made on our aggressive thread construction policies. To merge these two policies, Fork-on-Recursive-Call policy is supposed to have higher priority than Self-Loop. When the Fork-on- Recursive-Call thread is to be dispatched into one core running the Self-Loop thread, the current thread should be exempted out and the new thread is spawned. The performance speedup of basic policy and aggressive policy are both presented to make a comparison in Fig. 6. It can be seen that signiwcant improvements are achieved with aggressive policies. The aggressive policies achieve 29.6% speedup on average while the basic policy only achieves 3.8% speedup. Fig. 7 presents the prefetching coverage and timeliness for aggressive policies. Compared with basic policy (Fig. 4), the aggressive policies have increased the prefetching coverage signiwcantly. For example, the coverage of swim is about 2% in basic policy, and it increases to 47% in aggressive policies. And the performance speedup for swim also increases from 0 to 34% with the improvements of prefetching coverage and timeliness. For swim, art, mgrid, equake, mcf, stream, and em3d, most of the speedup benewts from the larger coverage and better timeliness achieved by Self-Loop policy. Through enlarging the prefetching range and number per Dynamic Prefetching Thread, Self-Loop policy makes the thread generate more timely and more furtherer prefetch requests illustrated in Fig. 7. Fig. 6. The performance speedup of aggressive policies.

208 H. Rui et al. / Microprocessors and Microsystems 31 (2007) 200 211 Fig. 7. The prefetching coverage and timeliness of aggressive policies.

9 208 H. Rui et al. / Microprocessors and Microsystems 31 (2007) Fig. 7. The prefetching coverage and timeliness of aggressive policies. And the Fork-on-Recursive-Call policy stimulates the performance improvements for treeadd, perimeter and bisort, since these benchmark all access D-LDS data via recursive calls. This policy evectively exploits the memory parallelism indicated by the recursive call and then improves the prefetching coverage and timeliness, especially for treeadd (30% performance improvement). 7. Other considerations In this section, all the experiments adopt the aggressive thread construction policies Dynamic prefetching thread statistics To gain some insight on the inner workings of Dynamic Prefetching Thread, several kinds of detailed statistics are presented in this section, including the number of unique Dynamic Prefetching Threads detected, the instruction count of these threads and the size of instruction trace collected. Table 2 lists the number of unique Dynamic Prefetching Threads detected by the DPT Generator. It shows that few unique threads are detected. This observation indicates that most cache misses come from few loads. And it also can be observed that the average size of Dynamic Prefetching Threads is too small. All of them are no more than 20. For perimeter, treeadd and bisort, this value is not given because almost all the Dynamic Prefetching Threads are generated according to the Fork-on-Recursive-Call policy. These statistics indicate that the DPT Cache needs small size and simple organization. It also can be observed that the size of instruction trace collected is very small in Table 2. All of them are no more than 128, indicating that the Trace BuVer does not need large size. Table 2 The statistics of Dynamic Prefetching Thread Benchmark name 7.2. Time sensitivity Average size of Trace Average size of DPT swim mgrid art mcf vpr bzip equake perimeter 5 treeadd 4 bisort 3 em3d stream Unique DPT detected One of the main design principles is time insensitivity. There are two important prospects. One is that our design should be insensitive to the Dynamic Prefetching Thread construction time; the other is that the performance should be insensitive to thread initialization time. Fig. 8 illustrates that the performance are almost not decreased with the thread construction time increasing from 100 cycles to 500 cycles. It can be seen in Fig. 9 that the thread initialization time has few evects on the performance for most benchmarks with the time varying from 6 cycles to 64 cycles. Even for mgrid, the performance loss can be accepted when considering the hardware implementation cost. Self-Loop policy can evectively tolerate these delay since it merges multi-threads into one by means of adding loop structure. And in Fork-on-Recursive-Call policy, prefetching the other sub-tree or sub-graph has good prefetching timeliness (Fig. 7), which is useful to tolerate the delay of thread construction and initialization. In conclusion,

H. Rui et al. / Microprocessors and Microsystems 31 (2007) 200 211 209 Fig. 8. Performance of Dynamic Prefetching Thread with various thread construction time. Fig. 9.

Memory bandwidth overhead Memory bandwidth overhead is another important metric.

10 H. Rui et al. / Microprocessors and Microsystems 31 (2007) Fig. 8. Performance of Dynamic Prefetching Thread with various thread construction time. Fig. 9. Performance of Dynamic Prefetching Thread with various thread initialization time experiment. Figs. 8 and 9 demonstrate that our design is time insensitive Memory bandwidth overhead Memory bandwidth overhead is another important metric. In Table 3, we quantify the increased memory overhead of Dynamic Prefetching Thread with aggressive policies in terms of the increased snooping-based L1-L2 bus trayc and the increased accesses to L2 Cache, as well as the increased memory bus access. The trayc of the L1- L2 snooping bus has been increased obviously, since each cache miss in prefetching core needs to query other cores Wrstly according to MESI cache coherence. Fortunately, the on-chip bus with high bandwidth can tolerate this increased trayc. For treeadd, peri and bisort, the higher L2 cache access trayc are incurred by the greedy prefetching of Fork-on-Recursive-Call policy. Table 3 demonstrates that our mechanism has slight impacts on L2 cache and out-chip memory bandwidth. Most additional trayc is due largely to the invalid or useless prefetching caused by the complicated control dependence in original program, since our mechanism does not consider control and memory dependence during thread extraction. Although the average increased out-chips trayc is quite small, it might be still decreased when adopting feedback mechanism. This is our future work and not discussed in this paper Scalability The scalability is also one of the concerns. Shown in Fig. 10, the performance does not scale well with the core number for most benchmarks. The reason is that, Self- Loop policy tends to merge several prefetching threads into one. Thus the signiwcant performance improvements can be achieved only via less cores (usually 1 prefetching cores).

11 210 H. Rui et al. / Microprocessors and Microsystems 31 (2007) Table 3 The increased memory bandwidth overhead Benchmark name Increased snooping bus trayc (%) Increased L2 Cache Access (%) swim mgrid art mcf vpr bzip equake perimeter treeadd bisort em3d stream Increased Memory Bus Access (%) However, for swim, mgrid, art and equake, there are two or more delinquent loads in one core loop. It is hard for hardware implementing Self-Loop policy to merge threads targeting at diverent delinquent loads. More cores provide more chances to execute the diverent Dynamic Prefetching Threads simultaneously. Therefore, performance is improved when the core number scales from 2 to 4. Yet the performance is tiny improved when the number of core scales from 4 to 8, since there are usually not so many delinquent loads in the same hot region. In conclusion, Fig. 10 indicates that signiwcant performance improvement can be achieved only by 4 or less cores for most benchmarks when accelerating most benchmarks via our mechanism. Other cores should be used to run other applications. What s more, it will release the access contention of Shadow Register. Since DPT is insensitive to the thread initialization time, and the number of necessary prefetching cores are usually less than 4, and also the IPC of memory limited programs are generally quite low, a 64 entries Shadow Register with 4-read and 4-write ports will provide enough write and read bandwidth for Dynamic Prefetching Thread mechanism. 8. Conclusion A Dynamic Prefetching Thread scheme is proposed in this paper to accelerate the sequential programs on Chip Multiprocessors, which belongs to the hardware-generated thread-based prefetching technology and can decouple the performance and the correctness to some extent. This paper describes the necessary hardware infrastructure supporting Dynamic Prefetching Thread on a traditional Chip Multiprocessor. Aiming at the loosely coupled feature of Chip Multiprocessors, we propose the Shadow Register mechanism to support rapid register transportation among multi-cores and discuss the selection of thread spawn time. Two aggressive thread construction policies, Self-Loop and Fork-on- Recursive-call, are proposed. Constructed by Self-Loop policy, Dynamic Prefetching Thread can prefetch the next N instances of the same delinquent load. This policy enlarges the prefetching range and helps the thread prefetch the delinquent load that are not seen in the current pipeline. And it can also decrease the cost of thread initialization by merging multithread into one; Fork-on-Recursive-Call policy makes use of the inherent memory parallelism of application accessing tree-like data structures via recursive calls. When the main core accesses one sub-tree or sub-graph, the other idle cores can be utilized to access other sub-tree or sub-graph. Furthermore, we describe the hardware implementations of such mechanisms based on traditional Chip Multiprocessors. Almost all the new added hardware modules work in back-end so that our mechanisms have good time insensitivity and high feasibility. For a set of memory limited benchmarks selected from Olden benchmark, SPEC CPU2000 as well as Stream benchmark, an average speedup of 3.8% is achieved on dual-core CMP when constructing simple Dynamic Prefetching threads, and this gain grows to 29.6% when adopting our proposed aggressive thread construction policies. Fig. 10. The scalability experiment.

12 H. Rui et al. / Microprocessors and Microsystems 31 (2007) Acknowledgements We would appreciate the anonymous reviewers for their advices. This work is supported by National Basic Research Program of China (2005CB321600), and National Natural Science Foundation of China (NSFC) Grant No References [1] D. Tullsen, S. Eggers, H. Levy, Simultaneous multithreading: maximizing on-chip parallelism, in: 22nd Annual International Symposium on Computer Architecture, 1995, pp [2] A. Roth, A. Moshovos, G. Sohi, Dependence based prefetching for linked data structures, in: Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, 1998, pp [3] A. Roth, G. Sohi, Speculative data-driven multithreading, in: Seventh International Symposium on High Performance Computer Architecture, 2001, pp [4] A. Roth, G.S. Sohi, EVective Jump-Pointer Prefetching for Linked Data Structures, in: Proceedings of the 26th International Symposium on Computer Architecture, 1999, pp [5] C. Zilles, G. Sohi, Execution-based prediction using speculative slices, in: 28th Annual International Symposium on Computer Architecture, 2001, pp [6] T. Chen, An evective programmable prefetch engine for onchip caches, in: 28th International Symposium on Microarchitecture, 1995, pp [7] J. Collins, H. Wang, D. Tullsen, C. Hughes, Y. Lee, D. Lavery, J. Shen, Speculative precomputation: long-range prefetching of delinquent loads, in: 28th Annual International Symposium on Computer Architecture, 2001, pp [8] J.D. Collins, D.M. Tullsen, H. Wang, J.P. Shen, Dynamic speculative precomputation, in: Proceedings of the 34th annual ACM/IEEE International Symposium on Microarchitecture, 2001, pp [9] S. Liao, P. Wang, H. Wang, G. HoXehner, D. Lavery, J. Shen, Postpass binary adaptation for software-based speculative precomputation, in: ACM Conference on Programming Language Design and Implementation, [10] JeVery A. Brown, Hong Wang et al., Speculative Precomputation on Chip Multiprocessors, in: the 6th Workshop on Multithreaded Execution, Architecture, and Compilation (MTEAC-6), [11] M. Carlisle, Olden: parallelizing programs with dynamic data structures on distributed-memory machines, in: PhD Thesis, Princeton University Department of Computer Science, [12] A. Moshovos, D. Pnevmatikatos, A. Baniasadi, Slice processors: an implementation of operation-based prediction, in: 15th International Conference on Supercomputing, 2001, pp [13] H. Zhou, Dual-core execution: building a highly scalable single-thread instruction window, in: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, [14] T. Mowry, A. Gupta, Tolerating latency through softwarecontrolled prefetching in shared-memory multiprocessors, in: Journal of Parallel and Distributed Computing, 1991, pp [15] C. Luk, Tolerating memory latency through softwarecontrolled preexecution in simultaneous multithreading processors, in: 28th Annual International Symposium on Computer Architecture, 2001, pp [16] C.-K. Luk, T.C. Mowry. Compiler-based prefetching for recursive data structures, in: Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, 1996, pp [17] Magnus Karlsson, Fredrik Dahlgren, Per Stenstrom, A prefetching technique for irregular accesses to linked data structures, in: 6th International Symposium on High-Performance Computer Architecture, [18] N. Jouppi, Improving direct-mapped cache performance by the addition of a small fully associative cache and prefetch buvers, in: 17th Annual International Symposium on Computer Architecture, 1990, pp [19] D. Joseph, D. Grunwald, Prefetching using Markov Predictors, in: 24th International Symposium on Computer Architecture, [20] G.S. Sohi, S.E. Breach, T.N. Vijaykumar. Multiscalar processors, in: Proceedings of the 22nd annual International Symposium on Computer Architecture, 1995, pp [21] J. SteVan, T. Mowry, The potential for using thread level data speculation to facilitate automatic parallelization, in: Proceedings of The Fourth International Symposium on High-Performance Computer Architecture, 1998, pp [22] John D. McCalpin, STREAM: Sustainable Memory Bandwidth in High Performance Computers. < [23] Kenneth Yeager, The MIPS R10000 superscalar microprocessor, IEEE Micro 16 (1996) [24] J. Huh, D. Burger, S. Keckler, Exploring the design space of future CMPs, in: The 10th International Conference on Parallel Architectures and Compilation Techniques, 2001, pp [25] Doug Burger, James R. Goodman, Billion-transistor architectures: there and back again, Computer 37 (3) (2004) [26] O. Mutlu, J. Stark, C. Wilkerson, Y.N. Patt, Runahead execution: an alternative to very large instruction windows for out-of-order processors, in: Proceedings of the Ninth International Symposium on High-Performance Computer Architecture, 2003, pp [27] Jiwei Lu, Abhinav Das, etc. Dynamic Helper Threaded Prefetching on the Sun UltraSPARC CMP Processor, The 38th Micro, [28] Jose Renau, Basilio Fraguela, James Tuck, Wei Liu, Karin Strauss et al., < [29] < [30] Jose Renau, James Tuck, Wei Liu et al., Tasking with out-of-order spawn in TLS chip multiprocessors: microarchitecture and compilation, in: Proceedings of the 19th Annual International Conference on Supercomputing, [31] L. Hammond, M. Willey, K. Olukotun, Data speculation support for Chip Multiprocessor, in: Proceedings of the Eighth International Conference Architectural Support for Programming Languages and Operating Systems (ASPLOS-VIII), ACM Press, 1998, pp [32] M.W. Hall, et al., Maximizing multiprocessor performance with the SUIF compiler, Computer (1996) [33] G.S. Sohi, A. Roth, Speculative multithreaded processors, Computer 34 (4) (2001)

A Hybrid Hardware/Software. generated prefetching thread mechanism on Chip Multiprocessors.

A Hybrid Hardware/Software. generated prefetching thread mechanism on Chip Multiprocessors. A Hybrid Hardware/Software Generated Prefetching Thread Mechanism On Chip Multiprocessors Hou Rui, Longbing Zhang, and Weiwu Hu Key Laboratory of Computer System and Architecture, Institute of Computing