Accelerating sequential programs on Chip Multiprocessors via Dynamic Prefetching Thread

Size: px
Start display at page:

Download "Accelerating sequential programs on Chip Multiprocessors via Dynamic Prefetching Thread"

Transcription

1 Microprocessors and Microsystems 31 (2007) Accelerating sequential programs on Chip Multiprocessors via Dynamic Prefetching Thread Hou Rui, Longbing Zhang, Weiwu Hu Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing , China Available online 20 October 2006 Abstract A Dynamic Prefetching Thread scheme is proposed in this paper to accelerate sequential programs on Chip Multiprocessors. This scheme belongs to the hardware-generated thread-based prefetching technique and can decouple the performance and correctness to some extent. This paper describes the necessary hardware infrastructure supporting Dynamic Prefetching Thread on traditional Chip Multiprocessors. Aiming at the loosely coupled feature of Chip Multiprocessors, we present the Shadow Register mechanism to support rapid register transportation among multi-cores and discuss the selection of thread spawn time. Furthermore, two aggressive thread construction policies, known as Self-Loop and Fork-on-Recursive-Call, are proposed. Self-Loop policy can greatly enlarge the prefetching range and issue more timely prefetches. Fork-on-Recursive-Call policy can evectively accelerate applications accessing trees or graphs via recursive calls. For a set of memory limited benchmarks selected from Olden benchmark, SPEC CPU2000 as well as Stream benchmark, an average speedup of 3.8% is achieved on dual-core CMP when constructing basic Dynamic Prefetching Threads, and this gain grows to 29.6% when adopting our aggressive thread construction policies Elsevier B.V. All rights reserved. Keywords: Dynamic Prefetching Thread; Chip Multiprocessors 1. Introduction Advances in integrated circuit technology avord great opportunities for Chip Multiprocessors (CMPs) [24,25]. It is really a challenge to utilize multi-cores in CMP to accelerate sequential programs, since CMP provides more computation and memory resources. Many technologies have been proposed, such as parallel compiler [32], thread level speculation [20,21,30,31]. However, these technologies often require all parallel threads derived from sequential program to obey the original program semantic. Such constraints lead to more complicated compiler or hardware support, as well as more conservative parallelization policies. Therefore, these * Corresponding author. Tel: ; fax: addresses: hourui@ict.ac.cn (H. Rui), lbzhang@ict.ac.cn (L. Zhang), hww@ict.ac.cn (W. Hu). technologies only succeed in a few Welds, and can not be widely applied [33]. Fortunately, helper thread technique can relax these constraints involved in above technologies. Helper threads can run in parallel with the main thread to preexecute the critical memory access instructions or precompute the result of the diycultly predicted branch instructions ahead of the main thread. The correct helper thread will help main thread to hide the long memory access latencies or decrease the penalty of the mistakenpredict branch; the mistaken helper thread will not avect its correctness only with possible performance loss. So helper thread can decouple the performance and the correctness to some extent. This feature will decrease the system complexity signiwcantly. Almost all general-purpose processors are moving to CMP, including the Gemini, Niagara 1/2, Panther chips from Sun, Power4/5/6 from IBM and recent announcements from Intel/AMD. Since such CMPs are soon going /$ - see front matter 2006 Elsevier B.V. All rights reserved. doi: /j.micpro

2 H. Rui et al. / Microprocessors and Microsystems 31 (2007) to be almost universal in the general-purpose arena and since many sequential programs are dominated by ov-chips memory stall, helper thread prefetching on CMPs is an attractive proposition that need further investigation. In this study, the hardware-generated prefetching helper threads are executed on idle cores to accelerate sequential programs. Such threads are called as Dynamic Prefetching Thread (DPT). Although there are several researches that focus on thread-based prefetching techniques, most of these works are based on SMT processor or additional execution pipelines. In essence, these SMT-like architecture [1] is tightly coupled, while CMP is of loosely coupled architecture. The resource sharing and conxict-avoid mechanisms on these two architectures are diverent. These diverences require us to reconsider how to support prefetching helper thread on CMP. The main contributions of this work are: (1) We design a CMP architecture with Dynamic Prefetching Thread support; (2) Two evective prefetching thread construction policies, known as Self-Loop and Fork-on-Recursive-Call, are proposed; (3) Shadow Register mechanism is designed to support rapid register transportation among multi-cores; (4) We make a deep investigation on the selection of the thread spawn time. The rest of this paper is organized as follows: Section 2 introduces related works. Section 3 describes the methodology. Section 4 introduces Dynamic Prefetching Thread. Section 5 presents the hardware infrastructure supporting Dynamic Prefetching Thread on CMP. More aggressive thread construction policies are proposed in Section 6. Other considerations are evaluated in Section 7. Finally, we make conclusion in Section Related works Compared with traditional hardware prefetchers [6,18,19] or software prefetchers [14], thread-based prefetching techniques can evectively accelerate applications with irregular memory access patterns. These techniques typically use additional execution pipelines or idle thread contexts in a multithreaded processor to execute helper threads that perform dynamic prefetching for the main thread [2,3,5,7,8,12,14,15]. Helper threads can be constructed by hardware or compiler [5,9]. However, many of these works are based on tightly coupled architecture. Our work focuses on implementing Dynamic Prefetching Thread on CMP, which belongs to loosely coupled architecture. The resource sharing mechanism of these two architectures are diverent. For example, it is quite easy for tightly coupled architecture to transfer register context from one core to another by copying rename table or sharing the same physical registers. And multi-threads share the same memory hierarchy in such architecture. Unfortunately, these questions seem complicated when applying thread-based prefetching techniques on CMP-like loosely coupled architecture. SP [7] and Dynamic Speculative Precomputation [8] focus on the benewts from precomputation on SMT-capable processor. The concept of chaining trigger is proposed, it allows a speculative thread to spawn further speculative threads to other thread contexts. The spawner thread should initialize live-in registers for spawned thread. This mechanism needs to adjust the code sequence and transport register among multi-cores. In contrast, our work is based on CMP processors. Our Self-Loop policy needn t adjust the code sequence and thus can be easily implemented in hardware. What s more, only the main thread is allowed to spawn prefetching thread in Self-Loop policy. Therefore the corresponding register transportation mechanism is simpler. From these prospects, Self-Loop policy is more suitable for CMP processor than chaining trigger. Slice Processors [12] dynamically construct instruction slices to prefetch delinquent loads. Such slices are constructed by back-end hardware and executed on specialized pipelines. Our work adopts the similar back-end hardware framework. However, our work reconsiders how to support prefetching helper thread on CMP. And more aggressive thread construction policies are proposed in this paper. Roth and Sohi proposed the Speculative Data-Driven Multithreading (DDMT) execution model [2,3]. DDMT is a general model where performance critical slices leading to branches or frequently missing loads are pre-executed. A register integration mechanism incorporates the results of pre-execution directly into the main sequential thread avoiding the need for re-executing these instructions. Our work performs automatic and dynamic detection of slices leading to loads. Moreover, our prefetching threads impact the main sequential thread indirectly and no integration mechanism is necessary. JeVery Brown applied thread-based prefetching techniques to CMP [10]. Two techniques, cache return broadcasting and cache cross-feeding, are proposed to hide some inter-core communication latency and achieve performance improvement. The prefetching threads are constructed by the mixture of both hand-generated and compiler. In their work, they did not demonstrate how to transfer the register among multi-cores. Jiwei Lu implemented prefetching thread on UltraSPARAC processor [27]. The prefetching threads are generated by user-level monitor thread, and one software data structure named as mailbox is used for register transportation. Our work presents Shadow Register used for quickly initializing thread context among multicores, and we adopt automatic hardware-based aggressive thread construction policies. In Runahead processors [26], the processor state is checkpointed when a long-latency load stalls the head of the ROB, the load is allowed to retire and the processor continues to execute speculatively. When the data Wnally return from memory, the processor rolls back and restarts execution from the load. Our work does not require checkpoint or any other recovery mechanism. The dual-core execution paradigm (DCE) [13] proposed by Zhou utilizes idle cores of a CMP to speed up

3 202 H. Rui et al. / Microprocessors and Microsystems 31 (2007) single-threaded programs. The main diverence is that our work tries to use idle cores for prefetching while DCE extends the evective instruction window size by distributing the state of a single thread over the two cores. 3. Simulation methodology The evaluation is performed by using a detailed cycleaccurate execution-driven CMP architecture simulator based on SESC [28] implementing MIPS ISA [23]. The CMP cores are out-of-order superscalar processors with private L1 instruction and data caches, shared L2 cache and all lower level memory hierarchy components. Table 1 lists the parameters in details. To demonstrate the performance potential of our architecture, dual core conwguration is used in this simulation. The memory limited benchmarks are selected from the Olden pointer intensive programs [11], SPEC CPU2000 [29] and Stream benchmark [22]. A large number of cache misses in these benchmarks are due to relatively irregular access patterns involving pointers, hash tables, tree/graph, indirect or complicated array references, or a mix of them, which are typically diycult for prefetching. We omit the ILP limited benchmarks since their performance are not dominated by memory access stall. The train sets are used for SPEC benchmarks to achieve reasonable simulation times. And other benchmarks use standard inputs. In addition, all benchmarks are compiled with gcc-o3 and simulated for one billion committed instructions after fastforwarding the initialization with cache warmup indicated by simpoint. Table 1 Simulated CMP processor parameters Processor core Number of cores/frequency 2 core/2 GHz Fetch/Issue/Commit Width 4/4/4 I-window/ROB/LSQ size 64/128/64 Int/FP registers 184 LdSt/Int/FP units 2/4/2 Execution latencies similar to MIPS R10000 Branch predictor 16K-entry gshare hybrid RAS entries 16 Memory hierarchy Cache sizes 32KB IL1, 32KB DL1, 512KB L2 Cache associativity 4-way L1, 8-way L2 Cache Hit/Miss latencies L1:2/3 cycles, L2: 9/11 cycles Cache line sizes L1:32B, L2:32B Cache ports L1:2ports, L2:4ports Cache coherence Snooping-based MESI L1-L2, L2 cache store policy Write-back MSHRs L1:64, L2:128 Main memory latency Minimum 200 cycles Dynamic Prefetching Thread hardware support Trace buver size 256-entry Thread initialization time 6 cycles Thread Cache size/associativity 32kB/2 way Thread construction time 100 cycles Shadow Register size/port 64*32B/4R4W ports 4. Dynamic Prefetching Thread Many researchers found that a small number of static loads, known as delinquent loads, are responsible for the vast majority of memory stall cycles. Furthermore, not all the instructions contribute to the address computation of the future delinquent load [7,8,12]. Motivated by these observations, we try to extract these sequence of instructions as prefetching thread from the executed instruction trace by means of hardware, and utilize idle cores to run such threads that perform dynamic prefetching for the main thread. Such threads are called Dynamic Prefetching Thread (DPT), which are automatically constructed, triggered, spawned and managed by hardware. Dynamic Prefetching Thread should exit when exceptions or interrupts occur. The operating system should make no response to these exceptions and interrupts except for TLB exception. Fig. 1 illustrates the CMP architecture with Dynamic Prefetching Thread support. The black blocks are the necessary hardware infrastructure supporting Dynamic Prefetching Thread. The DPT Generator is in charge of extracting the Dynamic Prefetching Thread, located ov the pipeline critical path. It has no evects on the pipeline frequency due to its back-end work mode. The shadow register is designed for quickly initializing the context of new spawned thread. This hardware infrastructure can be applied to most CMP-like loosely coupled architectures. The organization of DPT Generator is shown in Fig. 2. The committed load instructions in original thread and their corresponding execution information (such as L2 hit/ miss Xag) are sent to the back-end DPT Generator. These load instructions will Wrst probe the trigger pointer selector, Spawn Table. Once a trigger pointer is identiwed, the corresponding prefetching thread stored in DPT Cache is dispatched on idle core and run in parallel with original thread to perform dynamic prefetching for the targeted delinquent loads; otherwise it will query and update the Delinquent Load Table (DLT Table), which is in charge of identifying delinquent loads. When any delinquent load is identiwed, DPT Generator begins to collect the committed instructions from main core running original program. This collection does not stop untill the same delinquent load comes again or Trace BuVer is full. Then the Thread Constructor begins to extract the sequences of instructions, which produce the address of targeted delinquent load according to some certain thread construction policy. These extracted instructions are called as Dynamic Prefetching Thread and are reserved in DPT Cache. The DPT Cache can be located either in DPT Generator or in processor core. The former is chosen for simplicity. 5. Hardware support for Dynamic Prefetching Thread on CMP This section describes the critical hardware infrastructure, including thread spawn mechanism, Shadow Register mechanism and memory hierarchy. We just simply introduce

4 H. Rui et al. / Microprocessors and Microsystems 31 (2007) Fig. 1. The architecture of CMP with Dynamic Prefetching Thread support. (F, Fetch; D, Decode; I, Issue; E, Execute; C, Commit.) Fig. 2. The structure of DPT Generator. the identiwcation of delinquent loads and the basic thread construction policy, since they are similar to other researches. Delinquent loads are identiwed at commit time via a simple miss predictor named as DLT Table. It is a PC-indexed table with 128 entries and each has 5-bit counters. Each L2 cache load miss increases the corresponding counter by 4, otherwise decreases it by 1. A delinquent load is selected once the counter value exceeds 31. Predictor entries are allocated only when any L2 cache load miss (ov-chip memory access in our simulation) has occurred Basic thread construction policy The basic thread construction policy is similar to slice processor and Dynamic Speculative Precomputation [8,12]. DPT Generator collects the committed instruction trace in Trace BuVer described as Section 4. Once the collection is Wnished, it performs a reverse walk of the committed instruction trace to extract the corresponding instructions which contribute to the address computation of targeted delinquent load. Then a sequence is extracted containing these instructions in program order, oldest (lead) to youngest (candidate load). This is the Dynamic Prefetching Thread. For simplicity, we only focus on register dependence and ignore both memory and control-xow dependence during the reverse analysis. Slice Processor describes an circuit implementation sketch of this thread detection mechanism [12]. Further evaluation and improvements on this policy are described later.

5 204 H. Rui et al. / Microprocessors and Microsystems 31 (2007) To utilize resources more eyciently, the delinquent load in Dynamic Prefetching Thread is transformed into prefetch instruction so as to be committed earlier Thread spawn mechanism Spawn Table is in charge of spawning new prefetching threads. It is important for thread spawn mechanism to choose the appropriate trigger point and spawn time, since it has close relationship with the hardware implementation and performance. The delinquent load itself is selected as trigger point, and its commit time is considered as spawn time. A new prefetching thread should be dispatched into an idle core whenever the committed load from the main core hits the Spawn Table. The commit time is selected as spawn time because it is more suitable for CMP architecture. It is complicated and time-consuming for CMP to share or transport the architecture register context from one core to another. Generally speaking, both decode time and commit time can be selected as spawn time. The former has the advantage of earlier spawn time, but has problems in transporting register context among multi-cores. The reason is that the value of the instruction s destination register is still unavailable at decode time. In SMT-like tightly coupled architecture, it is feasible to solve this problem by copying the register rename table and sharing these input registers. Unfortunately, this approach seems diycult for CMP-like loosely coupled architecture. Yet the register value is ready at commit time. Therefore, the commit time is selected as the spawn time. It just needs to copy the full corresponding registers to initialize the new thread context at spawn time Shadow Register mechanism The core running original thread is to initialize the registers of idle core when a prefetching thread is dispatched. Shadow Register is an evective method to accomplish this initialization. Shadow Register keeps the same data with the register of main core just like its shadow. Its size is equal to the number of architecture registers including integer and Xoat registers, since both of them may contribute to the address computation of the targeted delinquent load(e.g. CFC1 in MIPS ISA). For Shadow Register, the main core running original program has write privilege whereas the other cores running helper threads are only be allowed to read it. Being additional register Wle, Shadow Register might have little evect on the maincore pipeline. And the prefetching cores need to access the Shadow Register when the uninitiated source registers are to be used. Some modiwcations need to be done in pipeline so as to maintain data consistency between the Shadow Register and the main-core register. The value and logical index of the destination register are attached with each issued instruction and reserved in ROB entries. Thus this information can be sent to the Shadow Register at commit time. By these means, data consistency is maintained. It should be pointed out that the execution correctness is not avected only with possible performance losses when data inconsistency exists. Since prefetching threads can also write and update registers, there are two source registers for the new core. It is obvious to ensure that only those Wrst-time register accesses (read) should be directed to Shadow register, all the others should be directed to local register Wle. So during the extraction of Dynamic Prefetching Thread, the live-in registers should be analyzed. And this information is used for marking some Xags in renaming table of new core so as to diverentiate the two register sources. In renaming phase, the Xag is to be checked for each source logic register. If it is of live-ins and is the Wrst-time access, then Shadow Register should be accessed; Otherwise local register is to be accessed. Once any register has been initialized from Shadow Register, the corresponding Xags in renaming table should be changed so that the later instructions fetch value from local register. Any destination register is allocated to local registers in renaming phase. Shadow Register is shared by multi-cores. The access contention is a potential challenge scaling with the number of CMP cores. Fortunately, the thread construction policies presented in Section 6 needs small number of prefetching cores, which will release this access contention Memory hierarchy Dynamic Prefetching Thread will not avect the original data coherence, since such threads have no store instructions and the selection of delinquent loads only aims to cache-able data (not I/O address). For simplicity, we utilize current CMP memory hierarchy to store prefetching results. The prefetching requests are useless when they are illegal or issued later than the main thread. There are three cases for the execution of useful prefetching requests. The Wrst case is that the data has been fetched into the private cache of prefetching core. The second is that the data has been fetched into the share cache. And the last case is that the request has been issued but the data is not been fetched back. The cache missed load instructions in main core will fetch back the data from diverent places in case of above three cases. For the Wrst case, it will Wnd data in the private cache of other core through cache coherence (e.g. Snooping-based MESI coherence); for the second, it will Wnd it in the lower shared cache; for the last one, the cache miss request will stall in MSHR queue of the lowest on-chip cache till the prefetching request returns. In conclusion, no modiwcations on memory hierarchy have been done in this work. Although prefetching buver can be added, it needs more complex hardware mechanism to keep data coherence. This is our future work and not discussed here.

6 H. Rui et al. / Microprocessors and Microsystems 31 (2007) The performance of basic thread construction policy The performance speedup is illustrated in Fig. 3. Most benchmarks can be accelerated with the basic thread construction. However, the performance improvement is only 3.8% on average. The basic policy needs to be improved. In order to understand the performance speedup, the prefetching coverage and timeliness information is provided to have a closer look at the prefetching activity like Fig. 4. The prefetching coverage is dewned as the ratio of the total number of useful prefetches to the total number of L2 cache load misses originally incurred by the application. And the timeliness indicates how much of the memory latency is hidden or saved by Dynamic Prefetching Thread. All these information is integrated into one Wgure, such as Figs. 4 and 7. Each bar in these Wgures is broken into eight segments according to the fractions of the miss latency hidden by prefetches, e.g. less than 10 cycles, between 10 and 50 cycles and so on. The basic policy has low prefetching coverage for all benchmarks shown in Fig. 4. All of them are lower than 20%. For mgrid and treeadd, the coverage is almost zero. This is why the speedups are generally quite lower. For swim, mgrid, mcf and art, these prefetching requests are issued too late, although most of their prefetching threads successfully compute the addresses of the targeted delinquent loads. It greatly decreases the prefetching eyciency and then leads to the tiny performance improvement. Some benchmarks, including treeadd, perimeter, and bisort, are of pointer intensive benchmarks with little overlapped work between two consecutive delinquent loads. Although they have a high cache miss rate, signiwcant improvements can not be achieved. Such benchmarks often traverse a linked list or tree using a tight-loops. Many iterations of these loops Wt within the main cores instruction window. Coupled with high branch prediction accuracy, Fig. 3. The performance speedup of basic thread construction policy. Fig. 4. The prefetching coverage and timeliness of basic policy.

7 206 H. Rui et al. / Microprocessors and Microsystems 31 (2007) little potential improvements are left for Dynamic Prefetching Threads. In conclusion, the primary problems of basic policy lie in two prospects. One is the untimely prefetching, the other is the ineyciency on the dense pointer intensive applications. 6. Aggressive thread construction policies This section proposes two aggressive thread construction policies based on above analysis The Self-Loop policy In order to issue timely prefetching requests, there are several methods: (1) Quicken the thread initialization; (2) Spawn the thread as much as sooner; (3) Prefetch farther delinquent loads. Self-Loop policy is proposed according to above factors. In the basic policy, once any delinquent load reaches the top of ROB, the corresponding thread is dispatched into idle core to prefetch the next instance of the same delinquent load. That is to say, one Dynamic Prefetching Thread just prefetches one delinquent load. To improve the performance, the next N instances of the same delinquent load are prefetched in the same Dynamic Prefetching Thread at one trigger point in Self-Loop policy. This policy enlarges the prefetching range and help the thread speculatively prefetch delinquent loads that are not seen in current pipeline, and also decreases the cost of thread initialization. We accomplish this purpose through adding loop structure on basic-policy constructed thread code. An example from mcf s core loop is shown in Fig. 5. The instruction lw v0, 28 (s1) is delinquent. The Dynamic Prefetching Thread is extracted according to basic policy. Then the loop structure is added on the extracted code (the instructions marked with are the framework of the new added loop structure). By these means, one prefetching thread can prefetch several next instances of the same Fig. 5. An example of Self-Loop Policy. delinquent load. Although there are many choices when setting the iteration number, we Wnd 10 iterations is suitable for our simulation. Through prefetching the next N instances of the same delinquent load in one thread, several prefetching threads are merged into one. Therefore, Self-Loop policy can achieve signiwcant performance improvement only utilizing less cores. This can ensure the capability of throughout computation when accelerating sequential program via parts of CMP cores, and release the access contention for Shadow Register. Section 7.4 illustrates that 4 cores and a 64 entries Shadow Register with 4-read and 4-write ports are enough for our Dynamic Prefetching mechanism. Furthermore, it needs not transfer registers between consecutive prefetching threads, because Dynamic Prefetching Threads should not spawn themselves. These characters are quite suitable for the feature of CMP. As a comparison, the Chaining Trigger method in [8] allows prefetching threads to be spawned by themselves and run on diverent thread contexts, thus such method needs more prefetching cores and complex register transportation mechanism. The framework of new added loop structure is so stable that the hardware implementation of Self-Loop policy has high feasibility. For any basic-policy constructed thread, the Wrst step is to analyze its register usage, and select one unused register as the loop induction register for the new added loop structure. This analysis can be done during the thread extraction phase with basic policy. Second, the initialization instruction for loop induction is added at the head. Then, the targeted delinquent load is transformed into prefetch instruction in prefetching thread unless the instruction is self-modiwed (e.g. lw a0, 0 (a0)). Lastly, the induction register modiwcation and conditional branch instructions are added at the end of original code The Fork-on-Recursive-Call policy Data structures like list, tree as well as graph are named as Linked Data Structure (LDS) in previous researches. Furthermore, we call it Dense Linked Data Structure (D- LDS) if there are little overlapped work between the consecutive node access. It seems hard for both basic policy and Self-Loop policy to have signiwcant improvement on D-LDS applications. Traditional prefetching approaches for LDS application include jump pointer, prefetch array and so on [2,4,16,17]. But these technologies often need to understand the program semantic quite well. So the implementations are usually based on compiler. However, we Wnd some interesting characters for trees and graphs. In fact, most nodes in tree or graph structure connect two or more sub-trees or sub-graphs. This inherent memory parallelism might be used for prefetching. When the main program accesses one sub-tree or sub-graph, the other idle cores can be utilized to speculatively access the other sub-tree or sub-graph.

8 H. Rui et al. / Microprocessors and Microsystems 31 (2007) What s more, the recursive function is one of the primary methods used to access such structures. When any recursive call instruction is executed, a new prefetching thread is dispatched on one idle core starting from the next instruction address. And then the idle core begins to speculatively execute the following instructions, accessing the other sub-tree or sub-graph for prefetching. This is the Fork-on-Recursive-Call policy. A hardware stack (tuples of [target, instruction address] with target as index) and Recursive Call Table (tuples of [instruction address]) are designed for identifying and recording recursive calls. They work in back-end and are placed in DPT Generator. Any function call instruction(e.g. jal, jalr in MIPS ISA) at the top of ROB from original thread will trigger the following step: (1) Looking up the Recursive Call Table to determine whether this call is recursive. If some entry is found, then goto (2), else goto (3). (2) The next instruction address (it should be the following PC of the call instruction s delay slot in MIPS ISA) is sent to idle core. The following N instructions started from this address are to be speculatively executed (N D 200 in our simulation). And exits here. (3) The instruction will look up the previous entries of the stack using the target address as index. If some entry matches, a recursive call is identiwed, and the instruction address is recorded in Recursive Call Table. Otherwise, the instruction address and its target are stored at the top of the stack. The stack should be emptied if it is full. Any return instruction (e.g. jr in MIPS ISA) from original thread should update the stack at commit time. The top stack entry is popped unless the stack is full. A counter is used to control the execution distance of prefetching threads. Any call instruction in prefetching threads also look up the Recursive Call Table. If any recursive call is identiwed or any return instruction is executed, the counter begins to work and increase by one for each instruction. In this work, the prefetching thread will exit when the counter exceeds 200 or some exception occurs. The store instructions are considered as nop operation since such threads are only used for prefetching and should not modify the architecture state The performance of aggressive thread construction policies In this section, evaluation is made on our aggressive thread construction policies. To merge these two policies, Fork-on-Recursive-Call policy is supposed to have higher priority than Self-Loop. When the Fork-on- Recursive-Call thread is to be dispatched into one core running the Self-Loop thread, the current thread should be exempted out and the new thread is spawned. The performance speedup of basic policy and aggressive policy are both presented to make a comparison in Fig. 6. It can be seen that signiwcant improvements are achieved with aggressive policies. The aggressive policies achieve 29.6% speedup on average while the basic policy only achieves 3.8% speedup. Fig. 7 presents the prefetching coverage and timeliness for aggressive policies. Compared with basic policy (Fig. 4), the aggressive policies have increased the prefetching coverage signiwcantly. For example, the coverage of swim is about 2% in basic policy, and it increases to 47% in aggressive policies. And the performance speedup for swim also increases from 0 to 34% with the improvements of prefetching coverage and timeliness. For swim, art, mgrid, equake, mcf, stream, and em3d, most of the speedup benewts from the larger coverage and better timeliness achieved by Self-Loop policy. Through enlarging the prefetching range and number per Dynamic Prefetching Thread, Self-Loop policy makes the thread generate more timely and more furtherer prefetch requests illustrated in Fig. 7. Fig. 6. The performance speedup of aggressive policies.

9 208 H. Rui et al. / Microprocessors and Microsystems 31 (2007) Fig. 7. The prefetching coverage and timeliness of aggressive policies. And the Fork-on-Recursive-Call policy stimulates the performance improvements for treeadd, perimeter and bisort, since these benchmark all access D-LDS data via recursive calls. This policy evectively exploits the memory parallelism indicated by the recursive call and then improves the prefetching coverage and timeliness, especially for treeadd (30% performance improvement). 7. Other considerations In this section, all the experiments adopt the aggressive thread construction policies Dynamic prefetching thread statistics To gain some insight on the inner workings of Dynamic Prefetching Thread, several kinds of detailed statistics are presented in this section, including the number of unique Dynamic Prefetching Threads detected, the instruction count of these threads and the size of instruction trace collected. Table 2 lists the number of unique Dynamic Prefetching Threads detected by the DPT Generator. It shows that few unique threads are detected. This observation indicates that most cache misses come from few loads. And it also can be observed that the average size of Dynamic Prefetching Threads is too small. All of them are no more than 20. For perimeter, treeadd and bisort, this value is not given because almost all the Dynamic Prefetching Threads are generated according to the Fork-on-Recursive-Call policy. These statistics indicate that the DPT Cache needs small size and simple organization. It also can be observed that the size of instruction trace collected is very small in Table 2. All of them are no more than 128, indicating that the Trace BuVer does not need large size. Table 2 The statistics of Dynamic Prefetching Thread Benchmark name 7.2. Time sensitivity Average size of Trace Average size of DPT swim mgrid art mcf vpr bzip equake perimeter 5 treeadd 4 bisort 3 em3d stream Unique DPT detected One of the main design principles is time insensitivity. There are two important prospects. One is that our design should be insensitive to the Dynamic Prefetching Thread construction time; the other is that the performance should be insensitive to thread initialization time. Fig. 8 illustrates that the performance are almost not decreased with the thread construction time increasing from 100 cycles to 500 cycles. It can be seen in Fig. 9 that the thread initialization time has few evects on the performance for most benchmarks with the time varying from 6 cycles to 64 cycles. Even for mgrid, the performance loss can be accepted when considering the hardware implementation cost. Self-Loop policy can evectively tolerate these delay since it merges multi-threads into one by means of adding loop structure. And in Fork-on-Recursive-Call policy, prefetching the other sub-tree or sub-graph has good prefetching timeliness (Fig. 7), which is useful to tolerate the delay of thread construction and initialization. In conclusion,

10 H. Rui et al. / Microprocessors and Microsystems 31 (2007) Fig. 8. Performance of Dynamic Prefetching Thread with various thread construction time. Fig. 9. Performance of Dynamic Prefetching Thread with various thread initialization time experiment. Figs. 8 and 9 demonstrate that our design is time insensitive Memory bandwidth overhead Memory bandwidth overhead is another important metric. In Table 3, we quantify the increased memory overhead of Dynamic Prefetching Thread with aggressive policies in terms of the increased snooping-based L1-L2 bus trayc and the increased accesses to L2 Cache, as well as the increased memory bus access. The trayc of the L1- L2 snooping bus has been increased obviously, since each cache miss in prefetching core needs to query other cores Wrstly according to MESI cache coherence. Fortunately, the on-chip bus with high bandwidth can tolerate this increased trayc. For treeadd, peri and bisort, the higher L2 cache access trayc are incurred by the greedy prefetching of Fork-on-Recursive-Call policy. Table 3 demonstrates that our mechanism has slight impacts on L2 cache and out-chip memory bandwidth. Most additional trayc is due largely to the invalid or useless prefetching caused by the complicated control dependence in original program, since our mechanism does not consider control and memory dependence during thread extraction. Although the average increased out-chips trayc is quite small, it might be still decreased when adopting feedback mechanism. This is our future work and not discussed in this paper Scalability The scalability is also one of the concerns. Shown in Fig. 10, the performance does not scale well with the core number for most benchmarks. The reason is that, Self- Loop policy tends to merge several prefetching threads into one. Thus the signiwcant performance improvements can be achieved only via less cores (usually 1 prefetching cores).

11 210 H. Rui et al. / Microprocessors and Microsystems 31 (2007) Table 3 The increased memory bandwidth overhead Benchmark name Increased snooping bus trayc (%) Increased L2 Cache Access (%) swim mgrid art mcf vpr bzip equake perimeter treeadd bisort em3d stream Increased Memory Bus Access (%) However, for swim, mgrid, art and equake, there are two or more delinquent loads in one core loop. It is hard for hardware implementing Self-Loop policy to merge threads targeting at diverent delinquent loads. More cores provide more chances to execute the diverent Dynamic Prefetching Threads simultaneously. Therefore, performance is improved when the core number scales from 2 to 4. Yet the performance is tiny improved when the number of core scales from 4 to 8, since there are usually not so many delinquent loads in the same hot region. In conclusion, Fig. 10 indicates that signiwcant performance improvement can be achieved only by 4 or less cores for most benchmarks when accelerating most benchmarks via our mechanism. Other cores should be used to run other applications. What s more, it will release the access contention of Shadow Register. Since DPT is insensitive to the thread initialization time, and the number of necessary prefetching cores are usually less than 4, and also the IPC of memory limited programs are generally quite low, a 64 entries Shadow Register with 4-read and 4-write ports will provide enough write and read bandwidth for Dynamic Prefetching Thread mechanism. 8. Conclusion A Dynamic Prefetching Thread scheme is proposed in this paper to accelerate the sequential programs on Chip Multiprocessors, which belongs to the hardware-generated thread-based prefetching technology and can decouple the performance and the correctness to some extent. This paper describes the necessary hardware infrastructure supporting Dynamic Prefetching Thread on a traditional Chip Multiprocessor. Aiming at the loosely coupled feature of Chip Multiprocessors, we propose the Shadow Register mechanism to support rapid register transportation among multi-cores and discuss the selection of thread spawn time. Two aggressive thread construction policies, Self-Loop and Fork-on- Recursive-call, are proposed. Constructed by Self-Loop policy, Dynamic Prefetching Thread can prefetch the next N instances of the same delinquent load. This policy enlarges the prefetching range and helps the thread prefetch the delinquent load that are not seen in the current pipeline. And it can also decrease the cost of thread initialization by merging multithread into one; Fork-on-Recursive-Call policy makes use of the inherent memory parallelism of application accessing tree-like data structures via recursive calls. When the main core accesses one sub-tree or sub-graph, the other idle cores can be utilized to access other sub-tree or sub-graph. Furthermore, we describe the hardware implementations of such mechanisms based on traditional Chip Multiprocessors. Almost all the new added hardware modules work in back-end so that our mechanisms have good time insensitivity and high feasibility. For a set of memory limited benchmarks selected from Olden benchmark, SPEC CPU2000 as well as Stream benchmark, an average speedup of 3.8% is achieved on dual-core CMP when constructing simple Dynamic Prefetching threads, and this gain grows to 29.6% when adopting our proposed aggressive thread construction policies. Fig. 10. The scalability experiment.

12 H. Rui et al. / Microprocessors and Microsystems 31 (2007) Acknowledgements We would appreciate the anonymous reviewers for their advices. This work is supported by National Basic Research Program of China (2005CB321600), and National Natural Science Foundation of China (NSFC) Grant No References [1] D. Tullsen, S. Eggers, H. Levy, Simultaneous multithreading: maximizing on-chip parallelism, in: 22nd Annual International Symposium on Computer Architecture, 1995, pp [2] A. Roth, A. Moshovos, G. Sohi, Dependence based prefetching for linked data structures, in: Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, 1998, pp [3] A. Roth, G. Sohi, Speculative data-driven multithreading, in: Seventh International Symposium on High Performance Computer Architecture, 2001, pp [4] A. Roth, G.S. Sohi, EVective Jump-Pointer Prefetching for Linked Data Structures, in: Proceedings of the 26th International Symposium on Computer Architecture, 1999, pp [5] C. Zilles, G. Sohi, Execution-based prediction using speculative slices, in: 28th Annual International Symposium on Computer Architecture, 2001, pp [6] T. Chen, An evective programmable prefetch engine for onchip caches, in: 28th International Symposium on Microarchitecture, 1995, pp [7] J. Collins, H. Wang, D. Tullsen, C. Hughes, Y. Lee, D. Lavery, J. Shen, Speculative precomputation: long-range prefetching of delinquent loads, in: 28th Annual International Symposium on Computer Architecture, 2001, pp [8] J.D. Collins, D.M. Tullsen, H. Wang, J.P. Shen, Dynamic speculative precomputation, in: Proceedings of the 34th annual ACM/IEEE International Symposium on Microarchitecture, 2001, pp [9] S. Liao, P. Wang, H. Wang, G. HoXehner, D. Lavery, J. Shen, Postpass binary adaptation for software-based speculative precomputation, in: ACM Conference on Programming Language Design and Implementation, [10] JeVery A. Brown, Hong Wang et al., Speculative Precomputation on Chip Multiprocessors, in: the 6th Workshop on Multithreaded Execution, Architecture, and Compilation (MTEAC-6), [11] M. Carlisle, Olden: parallelizing programs with dynamic data structures on distributed-memory machines, in: PhD Thesis, Princeton University Department of Computer Science, [12] A. Moshovos, D. Pnevmatikatos, A. Baniasadi, Slice processors: an implementation of operation-based prediction, in: 15th International Conference on Supercomputing, 2001, pp [13] H. Zhou, Dual-core execution: building a highly scalable single-thread instruction window, in: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, [14] T. Mowry, A. Gupta, Tolerating latency through softwarecontrolled prefetching in shared-memory multiprocessors, in: Journal of Parallel and Distributed Computing, 1991, pp [15] C. Luk, Tolerating memory latency through softwarecontrolled preexecution in simultaneous multithreading processors, in: 28th Annual International Symposium on Computer Architecture, 2001, pp [16] C.-K. Luk, T.C. Mowry. Compiler-based prefetching for recursive data structures, in: Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, 1996, pp [17] Magnus Karlsson, Fredrik Dahlgren, Per Stenstrom, A prefetching technique for irregular accesses to linked data structures, in: 6th International Symposium on High-Performance Computer Architecture, [18] N. Jouppi, Improving direct-mapped cache performance by the addition of a small fully associative cache and prefetch buvers, in: 17th Annual International Symposium on Computer Architecture, 1990, pp [19] D. Joseph, D. Grunwald, Prefetching using Markov Predictors, in: 24th International Symposium on Computer Architecture, [20] G.S. Sohi, S.E. Breach, T.N. Vijaykumar. Multiscalar processors, in: Proceedings of the 22nd annual International Symposium on Computer Architecture, 1995, pp [21] J. SteVan, T. Mowry, The potential for using thread level data speculation to facilitate automatic parallelization, in: Proceedings of The Fourth International Symposium on High-Performance Computer Architecture, 1998, pp [22] John D. McCalpin, STREAM: Sustainable Memory Bandwidth in High Performance Computers. < [23] Kenneth Yeager, The MIPS R10000 superscalar microprocessor, IEEE Micro 16 (1996) [24] J. Huh, D. Burger, S. Keckler, Exploring the design space of future CMPs, in: The 10th International Conference on Parallel Architectures and Compilation Techniques, 2001, pp [25] Doug Burger, James R. Goodman, Billion-transistor architectures: there and back again, Computer 37 (3) (2004) [26] O. Mutlu, J. Stark, C. Wilkerson, Y.N. Patt, Runahead execution: an alternative to very large instruction windows for out-of-order processors, in: Proceedings of the Ninth International Symposium on High-Performance Computer Architecture, 2003, pp [27] Jiwei Lu, Abhinav Das, etc. Dynamic Helper Threaded Prefetching on the Sun UltraSPARC CMP Processor, The 38th Micro, [28] Jose Renau, Basilio Fraguela, James Tuck, Wei Liu, Karin Strauss et al., < [29] < [30] Jose Renau, James Tuck, Wei Liu et al., Tasking with out-of-order spawn in TLS chip multiprocessors: microarchitecture and compilation, in: Proceedings of the 19th Annual International Conference on Supercomputing, [31] L. Hammond, M. Willey, K. Olukotun, Data speculation support for Chip Multiprocessor, in: Proceedings of the Eighth International Conference Architectural Support for Programming Languages and Operating Systems (ASPLOS-VIII), ACM Press, 1998, pp [32] M.W. Hall, et al., Maximizing multiprocessor performance with the SUIF compiler, Computer (1996) [33] G.S. Sohi, A. Roth, Speculative multithreaded processors, Computer 34 (4) (2001)

A Hybrid Hardware/Software. generated prefetching thread mechanism on Chip Multiprocessors.

A Hybrid Hardware/Software. generated prefetching thread mechanism on Chip Multiprocessors. A Hybrid Hardware/Software Generated Prefetching Thread Mechanism On Chip Multiprocessors Hou Rui, Longbing Zhang, and Weiwu Hu Key Laboratory of Computer System and Architecture, Institute of Computing

More information

Speculative Parallelization in Decoupled Look-ahead

Speculative Parallelization in Decoupled Look-ahead Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang Dept. of Electrical & Computer Engineering University of Rochester, Rochester, NY Motivation Single-thread

More information

Dynamic Speculative Precomputation

Dynamic Speculative Precomputation In Proceedings of the 34th International Symposium on Microarchitecture, December, 2001 Dynamic Speculative Precomputation Jamison D. Collins y, Dean M. Tullsen y, Hong Wang z, John P. Shen z y Department

More information

Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors

Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors Ilya Ganusov and Martin Burtscher Computer Systems Laboratory Cornell University {ilya, burtscher}@csl.cornell.edu Abstract This

More information

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era

More information

ECE404 Term Project Sentinel Thread

ECE404 Term Project Sentinel Thread ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache

More information

Many Cores, One Thread: Dean Tullsen University of California, San Diego

Many Cores, One Thread: Dean Tullsen University of California, San Diego Many Cores, One Thread: The Search for Nontraditional Parallelism University of California, San Diego There are some domains that feature nearly unlimited parallelism. Others, not so much Moore s Law and

More information

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 10: Runahead and MLP Prof. Onur Mutlu Carnegie Mellon University Last Time Issues in Out-of-order execution Buffer decoupling Register alias tables Physical

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

Accelerating and Adapting Precomputation Threads for Efficient Prefetching

Accelerating and Adapting Precomputation Threads for Efficient Prefetching In Proceedings of the 13th International Symposium on High Performance Computer Architecture (HPCA 2007). Accelerating and Adapting Precomputation Threads for Efficient Prefetching Weifeng Zhang Dean M.

More information

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012 18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012 Reminder: Lab Assignments Lab Assignment 6 Implementing a more

More information

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery

More information

15-740/ Computer Architecture Lecture 28: Prefetching III and Control Flow. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/28/11

15-740/ Computer Architecture Lecture 28: Prefetching III and Control Flow. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/28/11 15-740/18-740 Computer Architecture Lecture 28: Prefetching III and Control Flow Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/28/11 Announcements for This Week December 2: Midterm II Comprehensive

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 16: Prefetching Wrap-up Prof. Onur Mutlu Carnegie Mellon University Announcements Exam solutions online Pick up your exams Feedback forms 2 Feedback Survey Results

More information

Speculative Execution for Hiding Memory Latency

Speculative Execution for Hiding Memory Latency Speculative Execution for Hiding Memory Latency Alex Pajuelo, Antonio Gonzalez and Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Barcelona-Spain {mpajuelo,

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

POSH: A TLS Compiler that Exploits Program Structure

POSH: A TLS Compiler that Exploits Program Structure POSH: A TLS Compiler that Exploits Program Structure Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign

More information

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic Fall 2011 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic Reading: Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000) If memory

More information

An Algorithm for Register-Synchronized Precomputation In Intelligent Memory Systems

An Algorithm for Register-Synchronized Precomputation In Intelligent Memory Systems Purdue University Purdue e-pubs ECE Technical Reports Electrical and Computer Engineering 1-1-2003 An Algorithm for Register-Synchronized Precomputation In Intelligent Memory Systems Wessam Hassanein José

More information

15-740/ Computer Architecture Lecture 14: Runahead Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011

15-740/ Computer Architecture Lecture 14: Runahead Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011 15-740/18-740 Computer Architecture Lecture 14: Runahead Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011 Reviews Due Today Chrysos and Emer, Memory Dependence Prediction Using

More information

Pre-Computational Thread Paradigm: A Survey

Pre-Computational Thread Paradigm: A Survey Pre-Computational Thread Paradigm: A Survey Alok Garg Abstract The straight forward solution to exploit high instruction level parallelism is to increase the size of instruction window. Large instruction

More information

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications Byung In Moon, Hongil Yoon, Ilgu Yun, and Sungho Kang Yonsei University, 134 Shinchon-dong, Seodaemoon-gu, Seoul

More information

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2 CHERRY: CHECKPOINTED EARLY RESOURCE RECYCLING José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2 1 2 3 MOTIVATION Problem: Limited processor resources Goal: More

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 18-447 Computer Architecture Lecture 15: Load/Store Handling and Data Flow Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 Lab 4 Heads Up Lab 4a out Branch handling and branch predictors

More information

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of

More information

A Hybrid Adaptive Feedback Based Prefetcher

A Hybrid Adaptive Feedback Based Prefetcher A Feedback Based Prefetcher Santhosh Verma, David M. Koppelman and Lu Peng Department of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 78 sverma@lsu.edu, koppel@ece.lsu.edu,

More information

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University Lecture 16: Checkpointed Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 18-1 Announcements Reading for today: class notes Your main focus:

More information

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads Hideki Miwa, Yasuhiro Dougo, Victor M. Goulart Ferreira, Koji Inoue, and Kazuaki Murakami Dept. of Informatics, Kyushu

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science, University of Central Florida zhou@cs.ucf.edu Abstract Current integration trends

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

The Design Complexity of Program Undo Support in a General-Purpose Processor

The Design Complexity of Program Undo Support in a General-Purpose Processor The Design Complexity of Program Undo Support in a General-Purpose Processor Radu Teodorescu and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu

More information

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu Midterm Exam ECE 741 Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 EXAM 1 SOLUTIONS Problem Points

More information

Fall 2012 Parallel Computer Architecture Lecture 15: Speculation I. Prof. Onur Mutlu Carnegie Mellon University 10/10/2012

Fall 2012 Parallel Computer Architecture Lecture 15: Speculation I. Prof. Onur Mutlu Carnegie Mellon University 10/10/2012 18-742 Fall 2012 Parallel Computer Architecture Lecture 15: Speculation I Prof. Onur Mutlu Carnegie Mellon University 10/10/2012 Reminder: Review Assignments Was Due: Tuesday, October 9, 11:59pm. Sohi

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra ia a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Simultaneous Multithreading: a Platform for Next Generation Processors

Simultaneous Multithreading: a Platform for Next Generation Processors Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

A Dynamic Multithreading Processor

A Dynamic Multithreading Processor A Dynamic Multithreading Processor Haitham Akkary Microcomputer Research Labs Intel Corporation haitham.akkary@intel.com Michael A. Driscoll Department of Electrical and Computer Engineering Portland State

More information

Performance Evaluation of data-push Thread on Commercial CMP Platform

Performance Evaluation of data-push Thread on Commercial CMP Platform Performance Evaluation of data-push Thread on Commercial CMP Platform Jianxun Zhang 1,2, Zhimin Gu 1, Ninghan Zheng 1,3, Yan Huang 1, Min Cai 1, Sicai Yang 1, Wenbiao Zhou 1 1 School of Computer, Beijing

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra is a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

2 TEST: A Tracer for Extracting Speculative Threads

2 TEST: A Tracer for Extracting Speculative Threads EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath

More information

Slice-Processors: An Implementation of Operation-Based Prediction

Slice-Processors: An Implementation of Operation-Based Prediction Slice-Processors: An Implementation of Operation-Based Prediction Andreas Moshovos Electrical and Computer Engineering University of Toronto moshovos@eecg.toronto.edu Dionisios N. Pnevmatikatos Electronic

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

Fetch Directed Instruction Prefetching

Fetch Directed Instruction Prefetching In Proceedings of the 32nd Annual International Symposium on Microarchitecture (MICRO-32), November 1999. Fetch Directed Instruction Prefetching Glenn Reinman y Brad Calder y Todd Austin z y Department

More information

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 8: Issues in Out-of-order Execution Prof. Onur Mutlu Carnegie Mellon University Readings General introduction and basic concepts Smith and Sohi, The Microarchitecture

More information

A Mechanism for Verifying Data Speculation

A Mechanism for Verifying Data Speculation A Mechanism for Verifying Data Speculation Enric Morancho, José María Llabería, and Àngel Olivé Computer Architecture Department, Universitat Politècnica de Catalunya (Spain), {enricm, llaberia, angel}@ac.upc.es

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) A 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Module 5: MIPS R10000: A Case Study Lecture 9: MIPS R10000: A Case Study MIPS R A case study in modern microarchitecture. Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables Storage Efficient Hardware Prefetching using Correlating Prediction Tables Marius Grannaes Magnus Jahre Lasse Natvig Norwegian University of Science and Technology HiPEAC European Network of Excellence

More information

Design of Out-Of-Order Superscalar Processor with Speculative Thread Level Parallelism

Design of Out-Of-Order Superscalar Processor with Speculative Thread Level Parallelism ISSN (Online) : 2319-8753 ISSN (Print) : 2347-6710 International Journal of Innovative Research in Science, Engineering and Technology Volume 3, Special Issue 3, March 2014 2014 International Conference

More information

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Journal of Instruction-Level Parallelism 13 (11) 1-14 Submitted 3/1; published 1/11 Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Santhosh Verma

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Computer Sciences Department University of Wisconsin Madison http://www.cs.wisc.edu/~ericro/ericro.html ericro@cs.wisc.edu High-Performance

More information

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr

More information

Dual Thread Speculation: Two Threads in the Machine are Worth Eight in the Bush

Dual Thread Speculation: Two Threads in the Machine are Worth Eight in the Bush Dual Thread Speculation: Two Threads in the Machine are Worth Eight in the Bush Fredrik Warg and Per Stenstrom Chalmers University of Technology 2006 IEEE. Personal use of this material is permitted. However,

More information

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due Today Homework 4 Out today Due November 15

More information

A Self-Repairing Prefetcher in an Event-Driven Dynamic Optimization Framework

A Self-Repairing Prefetcher in an Event-Driven Dynamic Optimization Framework In Proceedings of the International Symposium on Code Generation and Optimization (CGO 2006). A Self-Repairing Prefetcher in an Event-Driven Dynamic Optimization Framework Weifeng Zhang Brad Calder Dean

More information

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont.   History Table. Correlating Prediction Table Lecture 15 History Table Correlating Prediction Table Prefetching Latest A0 A0,A1 A3 11 Fall 2018 Jon Beaumont A1 http://www.eecs.umich.edu/courses/eecs470 Prefetch A3 Slides developed in part by Profs.

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

Supporting Speculative Multithreading on Simultaneous Multithreaded Processors

Supporting Speculative Multithreading on Simultaneous Multithreaded Processors Supporting Speculative Multithreading on Simultaneous Multithreaded Processors Venkatesan Packirisamy, Shengyue Wang, Antonia Zhai, Wei-Chung Hsu, and Pen-Chung Yew Department of Computer Science, University

More information

Written Exam / Tentamen

Written Exam / Tentamen Written Exam / Tentamen Computer Organization and Components / Datorteknik och komponenter (IS1500), 9 hp Computer Hardware Engineering / Datorteknik, grundkurs (IS1200), 7.5 hp KTH Royal Institute of

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

Address-Value Delta (AVD) Prediction: A Hardware Technique for Efficiently Parallelizing Dependent Cache Misses. Onur Mutlu Hyesoon Kim Yale N.

Address-Value Delta (AVD) Prediction: A Hardware Technique for Efficiently Parallelizing Dependent Cache Misses. Onur Mutlu Hyesoon Kim Yale N. Address-Value Delta (AVD) Prediction: A Hardware Technique for Efficiently Parallelizing Dependent Cache Misses Onur Mutlu Hyesoon Kim Yale N. Patt High Performance Systems Group Department of Electrical

More information

Precise Exceptions and Out-of-Order Execution. Samira Khan

Precise Exceptions and Out-of-Order Execution. Samira Khan Precise Exceptions and Out-of-Order Execution Samira Khan Multi-Cycle Execution Not all instructions take the same amount of time for execution Idea: Have multiple different functional units that take

More information

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors Portland State University ECE 587/687 The Microarchitecture of Superscalar Processors Copyright by Alaa Alameldeen and Haitham Akkary 2011 Program Representation An application is written as a program,

More information

Combining Local and Global History for High Performance Data Prefetching

Combining Local and Global History for High Performance Data Prefetching Combining Local and Global History for High Performance Data ing Martin Dimitrov Huiyang Zhou School of Electrical Engineering and Computer Science University of Central Florida {dimitrov,zhou}@eecs.ucf.edu

More information

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012 18-742 Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II Prof. Onur Mutlu Carnegie Mellon University 10/12/2012 Past Due: Review Assignments Was Due: Tuesday, October 9, 11:59pm. Sohi

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

Pull based Migration of Real-Time Tasks in Multi-Core Processors

Pull based Migration of Real-Time Tasks in Multi-Core Processors Pull based Migration of Real-Time Tasks in Multi-Core Processors 1. Problem Description The complexity of uniprocessor design attempting to extract instruction level parallelism has motivated the computer

More information

Multi-Version Caches for Multiscalar Processors. Manoj Franklin. Clemson University. 221-C Riggs Hall, Clemson, SC , USA

Multi-Version Caches for Multiscalar Processors. Manoj Franklin. Clemson University. 221-C Riggs Hall, Clemson, SC , USA Multi-Version Caches for Multiscalar Processors Manoj Franklin Department of Electrical and Computer Engineering Clemson University 22-C Riggs Hall, Clemson, SC 29634-095, USA Email: mfrankl@blessing.eng.clemson.edu

More information

Dynamic Performance Tuning for Speculative Threads

Dynamic Performance Tuning for Speculative Threads Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept. of Computer Science and Engineering Dept. of

More information

The Use of Multithreading for Exception Handling

The Use of Multithreading for Exception Handling The Use of Multithreading for Exception Handling Craig Zilles, Joel Emer*, Guri Sohi University of Wisconsin - Madison *Compaq - Alpha Development Group International Symposium on Microarchitecture - 32

More information

TDT 4260 lecture 7 spring semester 2015

TDT 4260 lecture 7 spring semester 2015 1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding

More information

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Hyesoon Kim Onur Mutlu Jared Stark David N. Armstrong Yale N. Patt High Performance Systems Group Department

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken Branch statistics Branches occur every 4-7 instructions on average in integer programs, commercial and desktop applications; somewhat less frequently in scientific ones Unconditional branches : 20% (of

More information

Characterization of Repeating Data Access Patterns in Integer Benchmarks

Characterization of Repeating Data Access Patterns in Integer Benchmarks Characterization of Repeating Data Access Patterns in Integer Benchmarks Erik M. Nystrom Roy Dz-ching Ju Wen-mei W. Hwu enystrom@uiuc.edu roy.ju@intel.com w-hwu@uiuc.edu Abstract Processor speeds continue

More information

Optimizing SMT Processors for High Single-Thread Performance

Optimizing SMT Processors for High Single-Thread Performance University of Maryland Inistitute for Advanced Computer Studies Technical Report UMIACS-TR-2003-07 Optimizing SMT Processors for High Single-Thread Performance Gautham K. Dorai, Donald Yeung, and Seungryul

More information

Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen

Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen Multi-threaded processors Hung-Wei Tseng x Dean Tullsen OoO SuperScalar Processor Fetch instructions in the instruction window Register renaming to eliminate false dependencies edule an instruction to

More information

The Effect of Executing Mispredicted Load Instructions in a Speculative Multithreaded Architecture

The Effect of Executing Mispredicted Load Instructions in a Speculative Multithreaded Architecture The Effect of Executing Mispredicted Load Instructions in a Speculative Multithreaded Architecture Resit Sendag, Ying Chen, and David J Lilja Department of Electrical and Computer Engineering Minnesota

More information

Using Incorrect Speculation to Prefetch Data in a Concurrent Multithreaded Processor

Using Incorrect Speculation to Prefetch Data in a Concurrent Multithreaded Processor Using Incorrect Speculation to Prefetch Data in a Concurrent Multithreaded Processor Ying Chen, Resit Sendag, and David J Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating

More information

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont Lecture 18 Simultaneous Multithreading Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi,

More information

Speculative Multithreaded Processors

Speculative Multithreaded Processors Guri Sohi and Amir Roth Computer Sciences Department University of Wisconsin-Madison utline Trends and their implications Workloads for future processors Program parallelization and speculative threads

More information